From 4045ddab5472e60ad172cdcb9789360f34b2b196 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Wed, 23 Jan 2019 21:39:18 +0530 Subject: [PATCH 01/98] Create CODE_OF_CONDUCT.md --- CODE_OF_CONDUCT.md | 76 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 76 insertions(+) create mode 100644 CODE_OF_CONDUCT.md diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md new file mode 100644 index 0000000..b2129ef --- /dev/null +++ b/CODE_OF_CONDUCT.md @@ -0,0 +1,76 @@ +# Contributor Covenant Code of Conduct + +## Our Pledge + +In the interest of fostering an open and welcoming environment, we as +contributors and maintainers pledge to making participation in our project and +our community a harassment-free experience for everyone, regardless of age, body +size, disability, ethnicity, sex characteristics, gender identity and expression, +level of experience, education, socio-economic status, nationality, personal +appearance, race, religion, or sexual identity and orientation. + +## Our Standards + +Examples of behavior that contributes to creating a positive environment +include: + +* Using welcoming and inclusive language +* Being respectful of differing viewpoints and experiences +* Gracefully accepting constructive criticism +* Focusing on what is best for the community +* Showing empathy towards other community members + +Examples of unacceptable behavior by participants include: + +* The use of sexualized language or imagery and unwelcome sexual attention or + advances +* Trolling, insulting/derogatory comments, and personal or political attacks +* Public or private harassment +* Publishing others' private information, such as a physical or electronic + address, without explicit permission +* Other conduct which could reasonably be considered inappropriate in a + professional setting + +## Our Responsibilities + +Project maintainers are responsible for clarifying the standards of acceptable +behavior and are expected to take appropriate and fair corrective action in +response to any instances of unacceptable behavior. + +Project maintainers have the right and responsibility to remove, edit, or +reject comments, commits, code, wiki edits, issues, and other contributions +that are not aligned to this Code of Conduct, or to ban temporarily or +permanently any contributor for other behaviors that they deem inappropriate, +threatening, offensive, or harmful. + +## Scope + +This Code of Conduct applies both within project spaces and in public spaces +when an individual is representing the project or its community. Examples of +representing a project or community include using an official project e-mail +address, posting via an official social media account, or acting as an appointed +representative at an online or offline event. Representation of a project may be +further defined and clarified by project maintainers. + +## Enforcement + +Instances of abusive, harassing, or otherwise unacceptable behavior may be +reported by contacting the project team at singhal.amogh1995@gmail.com. All +complaints will be reviewed and investigated and will result in a response that +is deemed necessary and appropriate to the circumstances. The project team is +obligated to maintain confidentiality with regard to the reporter of an incident. +Further details of specific enforcement policies may be posted separately. + +Project maintainers who do not follow or enforce the Code of Conduct in good +faith may face temporary or permanent repercussions as determined by other +members of the project's leadership. + +## Attribution + +This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, +available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html + +[homepage]: https://www.contributor-covenant.org + +For answers to common questions about this code of conduct, see +https://www.contributor-covenant.org/faq From 898ebbf2d7c780cb5a89bad51d0b7e043e25879f Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Wed, 23 Jan 2019 21:40:58 +0530 Subject: [PATCH 02/98] Update issue templates --- .github/ISSUE_TEMPLATE/bug_report.md | 38 +++++++++++++++++++++++ .github/ISSUE_TEMPLATE/feature_request.md | 20 ++++++++++++ 2 files changed, 58 insertions(+) create mode 100644 .github/ISSUE_TEMPLATE/bug_report.md create mode 100644 .github/ISSUE_TEMPLATE/feature_request.md diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md new file mode 100644 index 0000000..dd84ea7 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/bug_report.md @@ -0,0 +1,38 @@ +--- +name: Bug report +about: Create a report to help us improve +title: '' +labels: '' +assignees: '' + +--- + +**Describe the bug** +A clear and concise description of what the bug is. + +**To Reproduce** +Steps to reproduce the behavior: +1. Go to '...' +2. Click on '....' +3. Scroll down to '....' +4. See error + +**Expected behavior** +A clear and concise description of what you expected to happen. + +**Screenshots** +If applicable, add screenshots to help explain your problem. + +**Desktop (please complete the following information):** + - OS: [e.g. iOS] + - Browser [e.g. chrome, safari] + - Version [e.g. 22] + +**Smartphone (please complete the following information):** + - Device: [e.g. iPhone6] + - OS: [e.g. iOS8.1] + - Browser [e.g. stock browser, safari] + - Version [e.g. 22] + +**Additional context** +Add any other context about the problem here. diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md new file mode 100644 index 0000000..bbcbbe7 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/feature_request.md @@ -0,0 +1,20 @@ +--- +name: Feature request +about: Suggest an idea for this project +title: '' +labels: '' +assignees: '' + +--- + +**Is your feature request related to a problem? Please describe.** +A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] + +**Describe the solution you'd like** +A clear and concise description of what you want to happen. + +**Describe alternatives you've considered** +A clear and concise description of any alternative solutions or features you've considered. + +**Additional context** +Add any other context or screenshots about the feature request here. From 14f89d0445b5ecf67c99441c6fdbf7e12c12c084 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sun, 27 Jan 2019 12:02:58 +0530 Subject: [PATCH 03/98] modified requirements.txt, modiefied log. reg. --- LDA scikit-learn/nmf_lda_scikitlearn.py | 3 +- .../binary_logisitic_regression.py | 11 - helpers/gradient_descent.py | 47 ++-- helpers/linear_algebra.py | 49 ++-- helpers/probabilty.py | 7 +- helpers/stats.py | 108 +++++---- k_means_clustering/model.py | 2 +- k_means_clustering/utils.py | 2 +- .../banking.csv | 0 .../binary_logisitic_regression.py | 79 ++++++ logistic_regression_banking/utils.py | 40 ++++ multiple_regression/utils.py | 2 +- .../model.py | 2 +- .../naivebayesclassifier.py | 2 +- .../utils.py | 0 neural_network/model.py | 20 +- regression_intro.py | 23 +- requirements.txt | 226 ------------------ 18 files changed, 263 insertions(+), 360 deletions(-) delete mode 100644 Logistic Regression/binary_logisitic_regression.py rename {Logistic Regression => logistic_regression_banking}/banking.csv (100%) create mode 100644 logistic_regression_banking/binary_logisitic_regression.py create mode 100644 logistic_regression_banking/utils.py rename {NaiveBayesClassifier => naive_bayes_classfier}/model.py (96%) rename {NaiveBayesClassifier => naive_bayes_classfier}/naivebayesclassifier.py (87%) rename {NaiveBayesClassifier => naive_bayes_classfier}/utils.py (100%) delete mode 100644 requirements.txt diff --git a/LDA scikit-learn/nmf_lda_scikitlearn.py b/LDA scikit-learn/nmf_lda_scikitlearn.py index d6955b8..513091d 100644 --- a/LDA scikit-learn/nmf_lda_scikitlearn.py +++ b/LDA scikit-learn/nmf_lda_scikitlearn.py @@ -6,4 +6,5 @@ nmf = NMF(n_components=num_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf) # Run LDA -lda = LatentDirichletAllocation(n_topics=num_topics, max_iter=5, learning_method='online', learning_offset=50, random_state=0).fit(tf) +lda = LatentDirichletAllocation(n_topics=num_topics, max_iter=5, learning_method='online', + learning_offset=50, random_state=0).fit(tf) diff --git a/Logistic Regression/binary_logisitic_regression.py b/Logistic Regression/binary_logisitic_regression.py deleted file mode 100644 index 133c7c1..0000000 --- a/Logistic Regression/binary_logisitic_regression.py +++ /dev/null @@ -1,11 +0,0 @@ -import pandas as pd -import numpy as np -from sklearn import preprocessing -import matplotlib.pyplot as plt - -plt.rc("font", size=14) -from sklearn.linear_model import LogisticRegression -from sklearn.model_selection import train_test_split -import seaborn as sns -sns.set(style="white") -sns.set(style="whitegrid", color_code=True) diff --git a/helpers/gradient_descent.py b/helpers/gradient_descent.py index 5c2d40c..246a684 100644 --- a/helpers/gradient_descent.py +++ b/helpers/gradient_descent.py @@ -1,7 +1,6 @@ -from collections import Counter +import random + from helpers.linear_algebra import distance, vector_subtract, scalar_multiply -from functools import reduce -import math, random def sum_of_squares(v): @@ -14,7 +13,6 @@ def difference_quotient(f, x, h): def plot_estimated_derivative(): - def square(x): return x * x @@ -26,14 +24,13 @@ def derivative_estimate(): # plot to show they're basically the same import matplotlib.pyplot as plt - x = range(-10,10) - plt.plot(x, map(derivative, x), 'rx') # red x + x = range(-10, 10) + plt.plot(x, map(derivative, x), 'rx') # red x plt.plot(x, map(derivative_estimate, x), 'b+') # blue + - plt.show() # purple *, hopefully + plt.show() # purple *, hopefully def partial_difference_quotient(f, v, i, h): - # add h to just the i-th element of v w = [v_j + (h if j == i else 0) for j, v_j in enumerate(v)] @@ -58,11 +55,13 @@ def sum_of_squares_gradient(v): def safe(f): """define a new function that wraps f and return it""" + def safe_f(*args, **kwargs): try: return f(*args, **kwargs) except: - return float('inf') # this means "infinity" in Python + return float('inf') # this means "infinity" in Python + return safe_f @@ -77,9 +76,9 @@ def minimize_batch(target_fn, gradient_fn, theta_0, tolerance=0.000001): step_sizes = [100, 10, 1, 0.1, 0.01, 0.001, 0.0001, 0.00001] - theta = theta_0 # set theta to initial value - target_fn = safe(target_fn) # safe version of target_fn - value = target_fn(theta) # value we're minimizing + theta = theta_0 # set theta to initial value + target_fn = safe(target_fn) # safe version of target_fn + value = target_fn(theta) # value we're minimizing while True: gradient = gradient_fn(theta) @@ -113,6 +112,7 @@ def maximize_batch(target_fn, gradient_fn, theta_0, tolerance=0.000001): theta_0, tolerance) + # # minimize / maximize stochastic # @@ -121,22 +121,21 @@ def maximize_batch(target_fn, gradient_fn, theta_0, tolerance=0.000001): def in_random_order(data): """generator that returns the elements of data in random order""" indexes = [i for i, _ in enumerate(data)] # create a list of indexes - random.shuffle(indexes) # shuffle them - for i in indexes: # return the data in that order + random.shuffle(indexes) # shuffle them + for i in indexes: # return the data in that order yield data[i] def minimize_stochastic(target_fn, gradient_fn, x, y, theta_0, alpha_0=0.01): - data = list(zip(x, y)) - theta = theta_0 # initial guess - alpha = alpha_0 # initial step size - min_theta, min_value = None, float("inf") # the minimum so far + theta = theta_0 # initial guess + alpha = alpha_0 # initial step size + min_theta, min_value = None, float("inf") # the minimum so far iterations_with_no_improvement = 0 # if we ever go 100 iterations with no improvement, stop while iterations_with_no_improvement < 100: - value = sum( target_fn(x_i, y_i, theta) for x_i, y_i in data ) + value = sum(target_fn(x_i, y_i, theta) for x_i, y_i in data) if value < min_value: # if we've found a new minimum, remember it @@ -167,17 +166,17 @@ def maximize_stochastic(target_fn, gradient_fn, x, y, theta_0, alpha_0=0.01): print("using the gradient") - v = [random.randint(-10,10) for i in range(3)] + v = [random.randint(-10, 10) for i in range(3)] tolerance = 0.0000001 while True: # print v, sum_of_squares(v) - gradient = sum_of_squares_gradient(v) # compute the gradient at v - next_v = step(v, gradient, -0.01) # take a negative gradient step - if distance(next_v, v) < tolerance: # stop if we're converging + gradient = sum_of_squares_gradient(v) # compute the gradient at v + next_v = step(v, gradient, -0.01) # take a negative gradient step + if distance(next_v, v) < tolerance: # stop if we're converging break - v = next_v # continue if we're not + v = next_v # continue if we're not print("minimum v", v) print("minimum value", sum_of_squares(v)) diff --git a/helpers/linear_algebra.py b/helpers/linear_algebra.py index ecab1bf..83898de 100644 --- a/helpers/linear_algebra.py +++ b/helpers/linear_algebra.py @@ -1,9 +1,6 @@ -# -*- coding: iso-8859-15 -*- +import math +from functools import reduce -import re, math, random # regexes, math functions, random numbers -import matplotlib.pyplot as plt # pyplot -from collections import defaultdict, Counter -from functools import partial, reduce # # functions for working with vectors @@ -12,12 +9,12 @@ def vector_add(v, w): """adds two vectors componentwise""" - return [v_i + w_i for v_i, w_i in zip(v,w)] + return [v_i + w_i for v_i, w_i in zip(v, w)] def vector_subtract(v, w): """subtracts two vectors componentwise""" - return [v_i - w_i for v_i, w_i in zip(v,w)] + return [v_i - w_i for v_i, w_i in zip(v, w)] def vector_sum(vectors): @@ -32,7 +29,7 @@ def vector_mean(vectors): """compute the vector whose i-th element is the mean of the i-th elements of the input vectors""" n = len(vectors) - return scalar_multiply(1/n, vector_sum(vectors)) + return scalar_multiply(1 / n, vector_sum(vectors)) def dot(v, w): @@ -54,7 +51,8 @@ def squared_distance(v, w): def distance(v, w): - return math.sqrt(squared_distance(v, w)) + return math.sqrt(squared_distance(v, w)) + # # functions for working with matrices @@ -91,20 +89,16 @@ def is_diagonal(i, j): # user 0 1 2 3 4 5 6 7 8 9 # -friendships = [[0, 1, 1, 0, 0, 0, 0, 0, 0, 0], # user 0 - [1, 0, 1, 1, 0, 0, 0, 0, 0, 0], # user 1 - [1, 1, 0, 1, 0, 0, 0, 0, 0, 0], # user 2 - [0, 1, 1, 0, 1, 0, 0, 0, 0, 0], # user 3 - [0, 0, 0, 1, 0, 1, 0, 0, 0, 0], # user 4 - [0, 0, 0, 0, 1, 0, 1, 1, 0, 0], # user 5 - [0, 0, 0, 0, 0, 1, 0, 0, 1, 0], # user 6 - [0, 0, 0, 0, 0, 1, 0, 0, 1, 0], # user 7 - [0, 0, 0, 0, 0, 0, 1, 1, 0, 1], # user 8 - [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]] # user 9 - -##### -# DELETE DOWN -# +friendships = [[0, 1, 1, 0, 0, 0, 0, 0, 0, 0], # user 0 + [1, 0, 1, 1, 0, 0, 0, 0, 0, 0], # user 1 + [1, 1, 0, 1, 0, 0, 0, 0, 0, 0], # user 2 + [0, 1, 1, 0, 1, 0, 0, 0, 0, 0], # user 3 + [0, 0, 0, 1, 0, 1, 0, 0, 0, 0], # user 4 + [0, 0, 0, 0, 1, 0, 1, 1, 0, 0], # user 5 + [0, 0, 0, 0, 0, 1, 0, 0, 1, 0], # user 6 + [0, 0, 0, 0, 0, 1, 0, 0, 1, 0], # user 7 + [0, 0, 0, 0, 0, 0, 1, 1, 0, 1], # user 8 + [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]] # user 9 def matrix_add(A, B): @@ -119,23 +113,22 @@ def entry_fn(i, j): return A[i][j] + B[i][j] def make_graph_dot_product_as_vector_projection(plt): - v = [2, 1] w = [math.sqrt(.25), math.sqrt(.75)] c = dot(v, w) vonw = scalar_multiply(c, w) - o = [0,0] + o = [0, 0] plt.arrow(0, 0, v[0], v[1], width=0.002, head_width=.1, length_includes_head=True) plt.annotate("v", v, xytext=[v[0] + 0.1, v[1]]) - plt.arrow(0 ,0, w[0], w[1], + plt.arrow(0, 0, w[0], w[1], width=0.002, head_width=.1, length_includes_head=True) plt.annotate("w", w, xytext=[w[0] - 0.1, w[1]]) plt.arrow(0, 0, vonw[0], vonw[1], length_includes_head=True) plt.annotate(u"(v?w)w", vonw, xytext=[vonw[0] - 0.1, vonw[1] + 0.1]) plt.arrow(v[0], v[1], vonw[0] - v[0], vonw[1] - v[1], linestyle='dotted', length_includes_head=True) - plt.scatter(*zip(v,w,o),marker='.') + plt.scatter(*zip(v, w, o), marker='.') plt.axis('equal') - plt.show() \ No newline at end of file + plt.show() diff --git a/helpers/probabilty.py b/helpers/probabilty.py index ceb3d9a..618188b 100644 --- a/helpers/probabilty.py +++ b/helpers/probabilty.py @@ -13,7 +13,7 @@ def uniform_pdf(x): def uniform_cdf(x): - "returns the probability that a uniform random variable is less than x" + """returns the probability that a uniform random variable is less than x""" if x < 0: return 0 # uniform random is never less than 0 elif x < 1: @@ -22,7 +22,7 @@ def uniform_cdf(x): return 1 # uniform random is always less than 1 -def normal_pdf(x, mu=0, sigma=1): +def normal_pdf(x, mu=0, sigma=1.0): sqrt_two_pi = math.sqrt(2 * math.pi) return math.exp(-(x - mu) ** 2 / 2 / sigma ** 2) / (sqrt_two_pi * sigma) @@ -37,7 +37,7 @@ def plot_normal_pdfs(plt): plt.show() -def normal_cdf(x, mu=0, sigma=1): +def normal_cdf(x, mu=0, sigma=1.0): return (1 + math.erf((x - mu) / math.sqrt(2) / sigma)) / 2 @@ -60,6 +60,7 @@ def inverse_normal_cdf(p, mu=0, sigma=1, tolerance=0.00001): low_z, low_p = -10.0, 0 # normal_cdf(-10) is (very close to) 0 hi_z, hi_p = 10.0, 1 # normal_cdf(10) is (very close to) 1 + mid_z = None while hi_z - low_z > tolerance: mid_z = (low_z + hi_z) / 2 # consider the midpoint mid_p = normal_cdf(mid_z) # and the cdf's value there diff --git a/helpers/stats.py b/helpers/stats.py index 0ec1d90..8fa8bc9 100644 --- a/helpers/stats.py +++ b/helpers/stats.py @@ -2,14 +2,20 @@ from helpers.linear_algebra import sum_of_squares, dot import math -num_friends = [100, 49, 41, 40, 25, 21, 21, 19, 19, 18, 18, 16, 15, 15, 15, 15, 14, 14, 13, 13, 13, 13, 12, 12, 11, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] +num_friends = [100, 49, 41, 40, 25, 21, 21, 19, 19, 18, 18, 16, 15, 15, 15, 15, 14, 14, 13, 13, 13, 13, 12, 12, 11, 10, + 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, + 9, 9, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, + 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, + 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, + 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, + 1, 1, 1, 1, 1, 1, 1, 1] def make_friend_counts_histogram(plt): friend_counts = Counter(num_friends) xs = range(101) ys = [friend_counts[x] for x in xs] - plt.bar(xs, ys) + plt.bar(xs, ys) plt.axis([0, 101, 0, 25]) plt.title("Histogram of Friend Counts") plt.xlabel("# of friends") @@ -17,15 +23,16 @@ def make_friend_counts_histogram(plt): plt.show() -num_points = len(num_friends) # 204 +num_points = len(num_friends) # 204 -largest_value = max(num_friends) # 100 -smallest_value = min(num_friends) # 1 +largest_value = max(num_friends) # 100 +smallest_value = min(num_friends) # 1 sorted_values = sorted(num_friends) # smallest_value = sorted_values[0] # 1 -second_smallest_value = sorted_values[1] # 1 -second_largest_value = sorted_values[-2] # 49 +second_smallest_value = sorted_values[1] # 1 +second_largest_value = sorted_values[-2] # 49 + # this isn't right if you don't from __future__ import division @@ -50,7 +57,7 @@ def median(v): return (sorted_v[lo] + sorted_v[hi]) / 2 -def quantile(x, p): +def quantile(x, p): """returns the pth-percentile value in x""" p_index = int(p * len(x)) return sorted(x)[p_index] @@ -60,10 +67,13 @@ def mode(x): """returns a list, might be more than one mode""" counts = Counter(x) max_count = max(counts.values()) - return [x_i for x_i, count in counts.items() + return [x_i for x_i, count in counts.items() if count == max_count] + # "range" already means something in Python, so we'll use a different name + + def data_range(x): return max(x) - min(x) @@ -86,7 +96,8 @@ def standard_deviation(x): def interquartile_range(x): - return quantile(x, 0.75) - quantile(x, 0.25) + return quantile(x, 0.75) - quantile(x, 0.25) + #### # @@ -95,56 +106,71 @@ def interquartile_range(x): ##### -daily_minutes = [1, 68.77, 51.25, 52.08, 38.36, 44.54, 57.13, 51.4, 41.42, 31.22, 34.76, 54.01, 38.79, 47.59, 49.1, 27.66, 41.03, 36.73, 48.65, 28.12, 46.62, 35.57, 32.98, 35, 26.07, 23.77, 39.73, 40.57, 31.65, 31.21, 36.32, 20.45, 21.93, 26.02, 27.34, 23.49, 46.94, 30.5, 33.8, 24.23, 21.4, 27.94, 32.24, 40.57, 25.07, 19.42, 22.39, 18.42, 46.96, 23.72, 26.41, 26.97, 36.76, 40.32, 35.02, 29.47, 30.2, 31, 38.11, 38.18, 36.31, 21.03, 30.86, 36.07, 28.66, 29.08, 37.28, 15.28, 24.17, 22.31, 30.17, 25.53, 19.85, 35.37, 44.6, 17.23, 13.47, 26.33, 35.02, 32.09, 24.81, 19.33, 28.77, 24.26, 31.98, 25.73, 24.86, 16.28, 34.51, 15.23, 39.72, 40.8, 26.06, 35.76, 34.76, 16.13, 44.04, 18.03, 19.65, 32.62, 35.59, 39.43, 14.18, 35.24, 40.13, 41.82, 35.45, 36.07, 43.67, 24.61, 20.9, 21.9, 18.79, 27.61, 27.21, 26.61, 29.77, 20.59, 27.53, 13.82, 33.2, 25, 33.1, 36.65, 18.63, 14.87, 22.2, 36.81, 25.53, 24.62, 26.25, 18.21, 28.08, 19.42, 29.79, 32.8, 35.99, 28.32, 27.79, 35.88, 29.06, 36.28, 14.1, 36.63, 37.49, 26.9, 18.58, 38.48, 24.48, 18.95, 33.55, 14.24, 29.04, 32.51, 25.63, 22.22, 19, 32.73, 15.16, 13.9, 27.2, 32.01, 29.27, 33, 13.74, 20.42, 27.32, 18.23, 35.35, 28.48, 9.08, 24.62, 20.12, 35.26, 19.92, 31.02, 16.49, 12.16, 30.7, 31.22, 34.65, 13.13, 27.51, 33.2, 31.57, 14.1, 33.42, 17.44, 10.12, 24.42, 9.82, 23.39, 30.93, 15.03, 21.67, 31.09, 33.29, 22.61, 26.89, 23.48, 8.38, 27.81, 32.35, 23.84] +daily_minutes = [1, 68.77, 51.25, 52.08, 38.36, 44.54, 57.13, 51.4, 41.42, 31.22, 34.76, 54.01, 38.79, 47.59, 49.1, + 27.66, 41.03, 36.73, 48.65, 28.12, 46.62, 35.57, 32.98, 35, 26.07, 23.77, 39.73, 40.57, 31.65, 31.21, + 36.32, 20.45, 21.93, 26.02, 27.34, 23.49, 46.94, 30.5, 33.8, 24.23, 21.4, 27.94, 32.24, 40.57, 25.07, + 19.42, 22.39, 18.42, 46.96, 23.72, 26.41, 26.97, 36.76, 40.32, 35.02, 29.47, 30.2, 31, 38.11, 38.18, + 36.31, 21.03, 30.86, 36.07, 28.66, 29.08, 37.28, 15.28, 24.17, 22.31, 30.17, 25.53, 19.85, 35.37, 44.6, + 17.23, 13.47, 26.33, 35.02, 32.09, 24.81, 19.33, 28.77, 24.26, 31.98, 25.73, 24.86, 16.28, 34.51, + 15.23, 39.72, 40.8, 26.06, 35.76, 34.76, 16.13, 44.04, 18.03, 19.65, 32.62, 35.59, 39.43, 14.18, 35.24, + 40.13, 41.82, 35.45, 36.07, 43.67, 24.61, 20.9, 21.9, 18.79, 27.61, 27.21, 26.61, 29.77, 20.59, 27.53, + 13.82, 33.2, 25, 33.1, 36.65, 18.63, 14.87, 22.2, 36.81, 25.53, 24.62, 26.25, 18.21, 28.08, 19.42, + 29.79, 32.8, 35.99, 28.32, 27.79, 35.88, 29.06, 36.28, 14.1, 36.63, 37.49, 26.9, 18.58, 38.48, 24.48, + 18.95, 33.55, 14.24, 29.04, 32.51, 25.63, 22.22, 19, 32.73, 15.16, 13.9, 27.2, 32.01, 29.27, 33, 13.74, + 20.42, 27.32, 18.23, 35.35, 28.48, 9.08, 24.62, 20.12, 35.26, 19.92, 31.02, 16.49, 12.16, 30.7, 31.22, + 34.65, 13.13, 27.51, 33.2, 31.57, 14.1, 33.42, 17.44, 10.12, 24.42, 9.82, 23.39, 30.93, 15.03, 21.67, + 31.09, 33.29, 22.61, 26.89, 23.48, 8.38, 27.81, 32.35, 23.84] -def covariance(x, y): +def covariance(x, y): n = len(x) - return dot(de_mean(x), de_mean(y)) / (n - 1) + return dot(de_mean(x), de_mean(y)) / (n - 1) -def correlation(x, y): +def correlation(x, y): stdev_x = standard_deviation(x) stdev_y = standard_deviation(y) if stdev_x > 0 and stdev_y > 0: - return covariance(x, y) / stdev_x / stdev_y + return covariance(x, y) / stdev_x / stdev_y else: - return 0 # if no variation, correlation is zero + return 0 # if no variation, correlation is zero -outlier = num_friends.index(100) # index of outlier +outlier = num_friends.index(100) # index of outlier num_friends_good = [x - for i, x in enumerate(num_friends) + for i, x in enumerate(num_friends) if i != outlier] daily_minutes_good = [x - for i, x in enumerate(daily_minutes) + for i, x in enumerate(daily_minutes) if i != outlier] - # alpha, beta = least_squares_fit(num_friends_good, daily_minutes_good) if __name__ == "__main__": - - print("num_points", len(num_friends)) - print("largest value", max(num_friends)) - print("smallest value", min(num_friends)) - print("second_smallest_value", sorted_values[1]) - print("second_largest_value", sorted_values[-2]) - print("mean(num_friends)", mean(num_friends)) - print("median(num_friends)", median(num_friends)) - print("quantile(num_friends, 0.10)", quantile(num_friends, 0.10)) - print("quantile(num_friends, 0.25)", quantile(num_friends, 0.25)) - print("quantile(num_friends, 0.75)", quantile(num_friends, 0.75)) - print("quantile(num_friends, 0.90)", quantile(num_friends, 0.90)) - print("mode(num_friends)", mode(num_friends)) - print("data_range(num_friends)", data_range(num_friends)) - print("variance(num_friends)", variance(num_friends)) - print("standard_deviation(num_friends)", standard_deviation(num_friends)) - print("interquartile_range(num_friends)", interquartile_range(num_friends)) - - print("covariance(num_friends, daily_minutes)", covariance(num_friends, daily_minutes)) - print("correlation(num_friends, daily_minutes)", correlation(num_friends, daily_minutes)) - print("correlation(num_friends_good, daily_minutes_good)", correlation(num_friends_good, daily_minutes_good)) + print("num_points", len(num_friends)) + print("largest value", max(num_friends)) + print("smallest value", min(num_friends)) + + print("second_smallest_value", sorted_values[1]) + print("second_largest_value", sorted_values[-2]) + + print("mean(num_friends)", mean(num_friends)) + print("median(num_friends)", median(num_friends)) + + print("quantile(num_friends, 0.10)", quantile(num_friends, 0.10)) + print("quantile(num_friends, 0.25)", quantile(num_friends, 0.25)) + print("quantile(num_friends, 0.75)", quantile(num_friends, 0.75)) + print("quantile(num_friends, 0.90)", quantile(num_friends, 0.90)) + + print("mode(num_friends)", mode(num_friends)) + print("data_range(num_friends)", data_range(num_friends)) + print("variance(num_friends)", variance(num_friends)) + print("standard_deviation(num_friends)", standard_deviation(num_friends)) + print("interquartile_range(num_friends)", interquartile_range(num_friends)) + + print("covariance(num_friends, daily_minutes)", covariance(num_friends, daily_minutes)) + print("correlation(num_friends, daily_minutes)", correlation(num_friends, daily_minutes)) + print("correlation(num_friends_good, daily_minutes_good)", correlation(num_friends_good, daily_minutes_good)) # print("R-squared value", r_squared(alpha, beta, num_friends_good, daily_minutes_good)) diff --git a/k_means_clustering/model.py b/k_means_clustering/model.py index 051cb4b..2e05b55 100644 --- a/k_means_clustering/model.py +++ b/k_means_clustering/model.py @@ -1,7 +1,7 @@ import random from k_means_clustering.data import inputs -from k_means_clustering.utils import KMeans, squared_clustering_errors, recolor_image, bottom_up_cluster, \ +from k_means_clustering.utils import KMeans, bottom_up_cluster, \ generate_clusters, get_values if __name__ == '__main__': diff --git a/k_means_clustering/utils.py b/k_means_clustering/utils.py index b6921d9..ef31e04 100644 --- a/k_means_clustering/utils.py +++ b/k_means_clustering/utils.py @@ -112,7 +112,7 @@ def get_merge_order(cluster): def bottom_up_cluster(inputs, distance_agg=min): # start with every input leaf cluster - clusters = [(input) for input in inputs] + clusters = [input for input in inputs] # as long as we have more than one cluster left... while len(clusters) > 1: diff --git a/Logistic Regression/banking.csv b/logistic_regression_banking/banking.csv similarity index 100% rename from Logistic Regression/banking.csv rename to logistic_regression_banking/banking.csv diff --git a/logistic_regression_banking/binary_logisitic_regression.py b/logistic_regression_banking/binary_logisitic_regression.py new file mode 100644 index 0000000..5a546bd --- /dev/null +++ b/logistic_regression_banking/binary_logisitic_regression.py @@ -0,0 +1,79 @@ +import matplotlib.pyplot as plt +import pandas as pd +import seaborn as sns +from sklearn.decomposition import PCA +from sklearn.linear_model import LogisticRegression +from sklearn.metrics import confusion_matrix, classification_report +from sklearn.model_selection import train_test_split + +plt.rc("font", size=14) +sns.set(style="white") +sns.set(style="whitegrid", color_codes=True) + +if __name__ == '__main__': + + data = pd.read_csv('banking.csv', header=0) + data = data.dropna() + print(data.shape) + print(list(data.columns)) + + # plot_data(data) + + # The prediction will be based on the variables selected in plot_data(), all other varaible are dropped + + data.drop(data.columns[[0, 3, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19]], axis=1, inplace=True) + + # print(data.shape) + # print(list(data.columns)) + + # Data preprocessing + + """dummy varaiable are variables with only two values: one or zero.""" + + data2 = pd.get_dummies(data, columns=['job', 'marital', 'default', 'housing', 'loan', 'poutcome']) + + # drop the unknown columns + data2.drop(data2.columns[[12, 16, 18, 21, 24]], axis=1, inplace=True) + + print(data2.columns) + + # plot the correlation between variables + # sns.heatmap(data2.corr()) + # plt.show() + + # split the data into training and test sets + X = data2.iloc[:, 1:] + y = data2.iloc[:, 0] + + X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) + + print(X_train.shape) + + # Logistic Regression Model + clf = LogisticRegression(random_state=0) + clf.fit(X_train, y_train) + + # predicting the test results and confusion matrix + y_pred = clf.predict(X_test) + confusion_matrix = confusion_matrix(y_test, y_pred) + print(confusion_matrix) + + print('Accuracy: {:.2f}'.format(clf.score(X_test, y_test))) + + print(classification_report(y_test, y_pred)) + + pca = PCA(n_components=2).fit_transform(X) + X_train, X_test, y_train, y_test = train_test_split(pca, y, random_state=0) + + plt.figure(dpi=120) + plt.scatter(pca[y.values == 0, 0], pca[y.values == 0, 1], alpha=0.5, label='YES', s=2, color='navy') + plt.scatter(pca[y.values == 1, 0], pca[y.values == 1, 1], alpha=0.5, label='NO', s=2, color='darkorange') + plt.legend() + plt.title('Bank Marketing Data Set\nFirst Two Principal Components') + plt.xlabel('PC1') + plt.ylabel('PC2') + plt.gca().set_aspect('equal') + plt.show() + + + diff --git a/logistic_regression_banking/utils.py b/logistic_regression_banking/utils.py new file mode 100644 index 0000000..155c381 --- /dev/null +++ b/logistic_regression_banking/utils.py @@ -0,0 +1,40 @@ +import seaborn as sns +import matplotlib.pyplot as plt + +plt.rc("font", size=14) +sns.set(style="white") +sns.set(style="whitegrid", color_codes=True) + + +def plot_data(data): + # barplot for the depencent variable + sns.countplot(x='y', data=data, palette='hls') + plt.show() + + # check the missing values + print(data.isnull().sum()) + + # customer distribution plot + sns.countplot(y='job', data=data) + plt.show() + + # customer marital status distribution + sns.countplot(x='marital', data=data) + plt.show() + + # barplot for credit in default + sns.countplot(x='default', data=data) + plt.show() + + # barptot for housing loan + sns.countplot(x='housing', data=data) + plt.show() + + # barplot for personal loan + sns.countplot(x='loan', data=data) + plt.show() + + # barplot for previous marketing campaign outcome + sns.countplot(x='poutcome', data=data) + plt.show() + diff --git a/multiple_regression/utils.py b/multiple_regression/utils.py index 898d86c..a10176b 100644 --- a/multiple_regression/utils.py +++ b/multiple_regression/utils.py @@ -3,7 +3,7 @@ from helpers.gradient_descent import minimize_stochastic from helpers.linear_algebra import dot, vector_add -from helpers.probability import normal_cdf +from helpers.probabilty import normal_cdf from helpers.stats import de_mean diff --git a/NaiveBayesClassifier/model.py b/naive_bayes_classfier/model.py similarity index 96% rename from NaiveBayesClassifier/model.py rename to naive_bayes_classfier/model.py index eff37eb..80fc4e5 100644 --- a/NaiveBayesClassifier/model.py +++ b/naive_bayes_classfier/model.py @@ -2,7 +2,7 @@ import re from collections import Counter import random -from naive_bayes_classifier.naivebayesclassifier import NaiveBayesClassifier +from naive_bayes_classfier.naivebayesclassifier import NaiveBayesClassifier def split_data(data, prob): diff --git a/NaiveBayesClassifier/naivebayesclassifier.py b/naive_bayes_classfier/naivebayesclassifier.py similarity index 87% rename from NaiveBayesClassifier/naivebayesclassifier.py rename to naive_bayes_classfier/naivebayesclassifier.py index 66f2107..dd09a34 100644 --- a/NaiveBayesClassifier/naivebayesclassifier.py +++ b/naive_bayes_classfier/naivebayesclassifier.py @@ -1,4 +1,4 @@ -from naive_bayes_classifier.utils import count_words, word_probabilities, spam_probability +from naive_bayes_classfier.utils import count_words, word_probabilities, spam_probability class NaiveBayesClassifier: diff --git a/NaiveBayesClassifier/utils.py b/naive_bayes_classfier/utils.py similarity index 100% rename from NaiveBayesClassifier/utils.py rename to naive_bayes_classfier/utils.py diff --git a/neural_network/model.py b/neural_network/model.py index 8aa86a1..e7b40f8 100644 --- a/neural_network/model.py +++ b/neural_network/model.py @@ -53,11 +53,11 @@ def predict(input): .@@@.""") print([round(x, 2) for x in - predict([0, 1, 1, 1, 0, # .@@@. - 0, 0, 0, 1, 1, # ...@@ - 0, 0, 1, 1, 0, # ..@@. - 0, 0, 0, 1, 1, # ...@@ - 0, 1, 1, 1, 0])]) # .@@@. + predict([0, 1, 1, 1, 0, # .@@@. + 0, 0, 0, 1, 1, # ...@@ + 0, 0, 1, 1, 0, # ..@@. + 0, 0, 0, 1, 1, # ...@@ + 0, 1, 1, 1, 0])]) # .@@@. print() print(""".@@@. @@ -67,9 +67,9 @@ def predict(input): .@@@.""") print([round(x, 2) for x in - predict([0, 1, 1, 1, 0, # .@@@. - 1, 0, 0, 1, 1, # @..@@ - 0, 1, 1, 1, 0, # .@@@. - 1, 0, 0, 1, 1, # @..@@ - 0, 1, 1, 1, 0])]) # .@@@. + predict([0, 1, 1, 1, 0, # .@@@. + 1, 0, 0, 1, 1, # @..@@ + 0, 1, 1, 1, 0, # .@@@. + 1, 0, 0, 1, 1, # @..@@ + 0, 1, 1, 1, 0])]) # .@@@. print() diff --git a/regression_intro.py b/regression_intro.py index a247e7a..5ab2c20 100644 --- a/regression_intro.py +++ b/regression_intro.py @@ -1,10 +1,11 @@ -import pandas as pd -import quandl, math, datetime -import numpy as np -from sklearn import preprocessing, model_selection, svm -from sklearn.linear_model import LinearRegression +import datetime +import math import matplotlib.pyplot as plt +import numpy as np +import quandl from matplotlib import style +from sklearn import preprocessing, model_selection +from sklearn.linear_model import LinearRegression # Style file for plotting graph style.use('ggplot') @@ -12,9 +13,9 @@ # Retrieve dataframe from Quandl df = quandl.get('WIKI/GOOGL') -df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume',]] +df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume', ]] # High Low Change => Volatility of the stock -df['HL_PCT'] = (df['Adj. High'] - df['Adj. Low']) / df['Adj. Low'] * 100.0 +df['HL_PCT'] = (df['Adj. High'] - df['Adj. Low']) / df['Adj. Low'] * 100.0 # Percentage Change => Volatility change df['PCT_change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] * 100.0 @@ -69,10 +70,10 @@ next_unix = last_unix + one_day_in_secs for i in forecast_set: - next_date = datetime.datetime.fromtimestamp(next_unix) - next_unix += one_day_in_secs - # loc is used for indexing - df.loc[next_date] = [np.nan for _ in range(len(df.columns)-1)] + [i] + next_date = datetime.datetime.fromtimestamp(next_unix) + next_unix += one_day_in_secs + # loc is used for indexing + df.loc[next_date] = [np.nan for _ in range(len(df.columns) - 1)] + [i] print(df.tail()) diff --git a/requirements.txt b/requirements.txt deleted file mode 100644 index 1e167e4..0000000 --- a/requirements.txt +++ /dev/null @@ -1,226 +0,0 @@ -# Library dependencies for the python code. You need to install these with -# `pip install -r requirements.txt` before you can run this. - -absl-py==0.3.0 -alabaster==0.7.11 -anaconda-client==1.7.1 -anaconda-navigator==1.8.7 -anaconda-project==0.8.2 -appdirs==1.4.3 -asn1crypto==0.24.0 -astor==0.7.1 -astroid==2.0.2 -astropy==3.0.4 -atomicwrites==1.1.5 -attrs==18.1.0 -Automat==0.7.0 -autopep8==1.3.5 -Babel==2.6.0 -backcall==0.1.0 -backports.shutil-get-terminal-size==1.0.0 -beautifulsoup4==4.6.1 -bitarray==0.8.3 -bkcharts==0.2 -blaze==0.11.3 -bleach==2.1.3 -bokeh==0.13.0 -boto==2.49.0 -Bottleneck==1.2.1 -certifi==2018.4.16 -cffi==1.11.5 -chardet==3.0.4 -click==6.7 -cloudpickle==0.5.3 -clyent==1.2.2 -colorama==0.3.9 -conda==4.5.9 -conda-build==3.12.1 -conda-verify==3.1.0 -constantly==15.1.0 -contextlib2==0.5.5 -cryptography==2.3 -cryptography-vectors==2.3 -cycler==0.10.0 -Cython==0.28.5 -cytoolz==0.9.0.1 -dask==0.18.2 -datashape==0.5.4 -decorator==4.3.0 -distributed==1.22.1 -docopt==0.4.0 -docutils==0.14 -entrypoints==0.2.3 -enum34==1.1.6 -et-xmlfile==1.0.1 -fastcache==1.0.2 -filelock==3.0.4 -Flask==1.0.2 -Flask-Cors==3.0.6 -future==0.16.0 -gast==0.2.0 -gevent==1.3.5 -glob2==0.6 -gmpy2==2.0.8 -greenlet==0.4.14 -grpcio==1.12.1 -h5py==2.8.0 -heapdict==1.0.0 -html5lib==1.0.1 -hyperlink==18.0.0 -idna==2.7 -imageio==2.3.0 -imagesize==1.0.0 -incremental==17.5.0 -ipykernel==4.8.2 -ipython==6.5.0 -ipython-genutils==0.2.0 -ipywidgets==7.4.0 -isort==4.3.4 -itsdangerous==0.24 -jdcal==1.4 -jedi==0.12.1 -jeepney==0.3.1 -Jinja2==2.10 -jsonschema==2.6.0 -jupyter==1.0.0 -jupyter-client==5.2.3 -jupyter-console==5.2.0 -jupyter-core==4.4.0 -jupyterlab==0.33.8 -jupyterlab-launcher==0.11.2 -Keras==2.2.2 -Keras-Applications==1.0.4 -Keras-Preprocessing==1.0.2 -keyring==13.2.1 -kiwisolver==1.0.1 -lazy-object-proxy==1.3.1 -llvmlite==0.24.0 -locket==0.2.0 -lxml==4.2.4 -mandrill==1.0.57 -Markdown==2.6.11 -MarkupSafe==1.0 -matplotlib==3.0.2 -mccabe==0.6.1 -mistune==0.8.3 -mkl-fft==1.0.4 -mkl-random==1.0.1 -more-itertools==4.3.0 -mpmath==1.0.0 -msgpack==0.5.6 -multipledispatch==0.5.0 -navigator-updater==0.2.1 -nbconvert==5.3.1 -nbformat==4.4.0 -networkx==2.1 -nltk==3.3 -nose==1.3.7 -numba==0.39.0 -numexpr==2.6.6 -numpy==1.15.4 -numpydoc==0.8.0 -odo==0.5.1 -olefile==0.45.1 -openpyxl==2.5.5 -packaging==17.1 -pandas==0.23.4 -pandocfilters==1.4.2 -parso==0.3.1 -partd==0.3.8 -path.py==11.0.1 -pathlib2==2.3.2 -patsy==0.5.0 -pep8==1.7.1 -pexpect==4.6.0 -pickleshare==0.7.4 -Pillow==5.2.0 -pkginfo==1.4.2 -pluggy==0.7.1 -ply==3.11 -prometheus-client==0.3.1 -prompt-toolkit==1.0.15 -protobuf==3.6.0 -psutil==5.4.6 -ptyprocess==0.6.0 -py==1.5.4 -pyasn1==0.4.4 -pyasn1-modules==0.2.2 -pycodestyle==2.4.0 -pycosat==0.6.3 -pycparser==2.18 -pycurl==7.43.0.2 -pydocstyle==2.1.1 -pyflakes==2.0.0 -Pygments==2.2.0 -pylint==2.1.1 -pyodbc==4.0.23 -pyOpenSSL==18.0.0 -pyparsing==2.2.0 -Pypubsub==4.0.0 -PySocks==1.6.8 -pytest==3.7.1 -pytest-arraydiff==0.2 -pytest-astropy==0.4.0 -pytest-doctestplus==0.1.3 -pytest-openfiles==0.3.0 -pytest-remotedata==0.3.0 -python-dateutil==2.7.3 -python-language-server==0.19.0 -pytz==2018.5 -PyWavelets==0.5.2 -pyzmq==17.1.0 -QtAwesome==0.4.4 -qtconsole==4.3.1 -QtPy==1.4.2 -ruamel-yaml==0.15.46 -scikit-image==0.14.0 -scikit-learn==0.20.2 -scipy==1.2.0 -seaborn==0.9.0 -SecretStorage==3.0.1 -Send2Trash==1.5.0 -service-identity==17.0.0 -simplegeneric==0.8.1 -singledispatch==3.4.0.3 -six==1.11.0 -sklearn==0.0 -snowballstemmer==1.2.1 -sortedcollections==1.0.1 -sortedcontainers==2.0.4 -Sphinx==1.7.6 -sphinxcontrib-websupport==1.1.0 -spyder==3.3.0 -spyder-kernels==0.2.4 -SQLAlchemy==1.2.10 -statsmodels==0.9.0 -sympy==1.2 -tables==3.4.4 -tblib==1.3.2 -tensorboard==1.10.0 -tensorflow==1.9.0 -tensorflow-gpu==1.4.0 -tensorflow-tensorboard==0.4.0 -termcolor==1.1.0 -terminado==0.8.1 -testpath==0.3.1 -toolz==0.9.0 -tornado==5.1 -tqdm==4.28.1 -traitlets==4.3.2 -Twisted==18.7.0 -typed-ast==1.1.0 -typing==3.6.4 -unicodecsv==0.14.1 -urllib3==1.23 -wcwidth==0.1.7 -webencodings==0.5.1 -Werkzeug==0.14.1 -widgetsnbextension==3.4.0 -wrapt==1.10.11 -xgboost==0.72 -xlrd==1.1.0 -XlsxWriter==1.0.5 -xlwt==1.3.0 -yapf==0.22.0 -zict==0.1.3 -zope.interface==4.5.0 From c71d1099142a9be773c478ba52fb53127e4d035e Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sun, 27 Jan 2019 12:03:28 +0530 Subject: [PATCH 04/98] added new requirements.txt --- requirements.txt | 53 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 53 insertions(+) create mode 100644 requirements.txt diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..d01840b --- /dev/null +++ b/requirements.txt @@ -0,0 +1,53 @@ +Keras==2.2.4 +Keras-Application==1.0.6 +Keras-Preprocessing==1.0.5 +PySocks==1.6.8 +PyYAML==3.13 +Pygments==2.3.1 +Quandl==3.4.5 +asn1crypto==0.24.0 +backcall==0.1.0 +beautifulsoup4==4.6.3 +certifi==2018.8.24 +cffi==1.11.5 +chardet==3.0.4 +cryptography==2.3.1 +cycler==0.10.0 +decorator==4.3.2 +h5py==2.9.0 +idna==2.7 +inflection==0.3.1 +ipython==7.2.0 +ipython-genutils==0.2.0 +jedi==0.13.2 +kiwisolver==1.0.1 +matplotlib==3.0.0 +mkl-fft==1.0.6 +mkl-random==1.0.1 +more-itertools==5.0.0 +numpy==1.15.2 +pandas==0.23.4 +parso==0.3.2 +patsy==0.5.0 +pexpect==4.6.0 +pickleshare==0.7.5 +pip==10.0.1 +prompt-toolkit==2.0.7 +ptyprocess==0.6.0 +pyOpenSSL==18.0.0 +pycparser==2.19 +pyparsing==2.2.1 +python-dateutil==2.7.3 +pytz==2018.5 +requests==2.19.1 +scikit-learn==0.20.0 +scipy==1.1.0 +seaborn==0.9.0 +setuptools==40.2.0 +six==1.11.0 +statsmodels==0.9.0 +tornado==5.1.1 +traitlets==4.3.2 +urllib3==1.23 +wcwidth==0.1.7 +wheel==0.31.1 From d0f160957b29e199065684006e8ade1e2ae82892 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sun, 27 Jan 2019 22:08:17 +0530 Subject: [PATCH 05/98] Update requirements.txt --- requirements.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/requirements.txt b/requirements.txt index d01840b..0070d0b 100644 --- a/requirements.txt +++ b/requirements.txt @@ -2,7 +2,7 @@ Keras==2.2.4 Keras-Application==1.0.6 Keras-Preprocessing==1.0.5 PySocks==1.6.8 -PyYAML==3.13 +pyyaml>=4.2b1 Pygments==2.3.1 Quandl==3.4.5 asn1crypto==0.24.0 @@ -39,7 +39,7 @@ pycparser==2.19 pyparsing==2.2.1 python-dateutil==2.7.3 pytz==2018.5 -requests==2.19.1 +requests>=2.20.0 scikit-learn==0.20.0 scipy==1.1.0 seaborn==0.9.0 From 8235fadd53edcd326ae6c71e55cff719c1786571 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Tue, 29 Jan 2019 09:07:11 +0530 Subject: [PATCH 06/98] Update README.md --- README.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 155c6b2..afb427e 100644 --- a/README.md +++ b/README.md @@ -29,10 +29,12 @@ environment with the exact version of Python used for development along with all dependencies needed to run MLwP. 1. [Download and install Conda](https://conda.io/docs/download.html). -2. Create a Conda environment with Python 3. +2. Create a Conda environment with Python 3. + +(**Note**: enter ```cd ~``` to go on **$HOME** , then perform these commands) ``` - conda create -n *your env name* python=3.5 + conda create --name *your env name* python=3.5 ``` 3. Now activate the Conda environment. @@ -44,7 +46,7 @@ dependencies needed to run MLwP. 4. Install the required dependencies. ``` - ./scripts/install_requirements.sh + conda install --yes --file *path to requirements.txt* ## How good is the code ? * It is well tested From 6be2ecd0ab7ecb7af65cc37a12788ed6e572d996 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Tue, 29 Jan 2019 09:17:20 +0530 Subject: [PATCH 07/98] Update README.md --- README.md | 81 ++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 80 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index afb427e..490de6d 100644 --- a/README.md +++ b/README.md @@ -36,17 +36,96 @@ dependencies needed to run MLwP. ``` conda create --name *your env name* python=3.5 ``` + + You will get the following, mlwp-test is the env name used in this example + + ``` + Solving environment: done + +## Package Plan ## + + environment location: /home/user/anaconda3/envs/mlwp-test + + added / updated specs: + - python=3.5 + + +The following NEW packages will be INSTALLED: + + ca-certificates: 2018.12.5-0 + certifi: 2018.8.24-py35_1 + libedit: 3.1.20181209-hc058e9b_0 + libffi: 3.2.1-hd88cf55_4 + libgcc-ng: 8.2.0-hdf63c60_1 + libstdcxx-ng: 8.2.0-hdf63c60_1 + ncurses: 6.1-he6710b0_1 + openssl: 1.0.2p-h14c3975_0 + pip: 10.0.1-py35_0 + python: 3.5.6-hc3d631a_0 + readline: 7.0-h7b6447c_5 + setuptools: 40.2.0-py35_0 + sqlite: 3.26.0-h7b6447c_0 + tk: 8.6.8-hbc83047_0 + wheel: 0.31.1-py35_0 + xz: 5.2.4-h14c3975_4 + zlib: 1.2.11-h7b6447c_3 + +Proceed ([y]/n)? *Press y* + +Preparing transaction: done +Verifying transaction: done +Executing transaction: done +# +# To activate this environment, use: +# > source activate mlwp-test +# +# To deactivate an active environment, use: +# > source deactivate +# + + ``` + The environment is successfully created. 3. Now activate the Conda environment. ``` source activate *your env name* ``` + You will get the following + + ``` + (mlwp-test) amogh@hp15X34:~$ + ``` + Enter `conda list` to get the list of available packages + + ``` + (mlwp-test) amogh@hp15X34:~$ conda list + # packages in environment at /home/amogh/anaconda3/envs/mlwp-test: + # + # Name Version Build Channel + ca-certificates 2018.12.5 0 + certifi 2018.8.24 py35_1 + libedit 3.1.20181209 hc058e9b_0 + libffi 3.2.1 hd88cf55_4 + libgcc-ng 8.2.0 hdf63c60_1 + libstdcxx-ng 8.2.0 hdf63c60_1 + ncurses 6.1 he6710b0_1 + openssl 1.0.2p h14c3975_0 + pip 10.0.1 py35_0 + python 3.5.6 hc3d631a_0 + readline 7.0 h7b6447c_5 + setuptools 40.2.0 py35_0 + sqlite 3.26.0 h7b6447c_0 + tk 8.6.8 hbc83047_0 + wheel 0.31.1 py35_0 + xz 5.2.4 h14c3975_4 + zlib 1.2.11 h7b6447c_3 + ``` 4. Install the required dependencies. ``` - conda install --yes --file *path to requirements.txt* + (mlwp-test) amogh@hp15X34:~$ conda install --yes --file /home/amogh/Work/ML_Projects/Machine-Learning-with- Python/requirements.txt ## How good is the code ? * It is well tested From d34d39690f78f64a8dbe2425c73f72245b2badd3 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Tue, 29 Jan 2019 09:17:45 +0530 Subject: [PATCH 08/98] Update requirements.txt --- requirements.txt | 8 -------- 1 file changed, 8 deletions(-) diff --git a/requirements.txt b/requirements.txt index 0070d0b..3c5a58a 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,8 +1,6 @@ Keras==2.2.4 -Keras-Application==1.0.6 Keras-Preprocessing==1.0.5 PySocks==1.6.8 -pyyaml>=4.2b1 Pygments==2.3.1 Quandl==3.4.5 asn1crypto==0.24.0 @@ -13,26 +11,20 @@ cffi==1.11.5 chardet==3.0.4 cryptography==2.3.1 cycler==0.10.0 -decorator==4.3.2 h5py==2.9.0 idna==2.7 inflection==0.3.1 ipython==7.2.0 -ipython-genutils==0.2.0 jedi==0.13.2 kiwisolver==1.0.1 matplotlib==3.0.0 -mkl-fft==1.0.6 -mkl-random==1.0.1 more-itertools==5.0.0 numpy==1.15.2 pandas==0.23.4 -parso==0.3.2 patsy==0.5.0 pexpect==4.6.0 pickleshare==0.7.5 pip==10.0.1 -prompt-toolkit==2.0.7 ptyprocess==0.6.0 pyOpenSSL==18.0.0 pycparser==2.19 From 459bd980bb0a09b1f28271555a3a4977762bdd0d Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Tue, 29 Jan 2019 09:19:44 +0530 Subject: [PATCH 09/98] updated readme, verified installation. --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 490de6d..23ec27f 100644 --- a/README.md +++ b/README.md @@ -125,7 +125,8 @@ Executing transaction: done 4. Install the required dependencies. ``` - (mlwp-test) amogh@hp15X34:~$ conda install --yes --file /home/amogh/Work/ML_Projects/Machine-Learning-with- Python/requirements.txt + (mlwp-test) amogh@hp15X34:~$ conda install --yes --file *path to requirements.txt* + ``` ## How good is the code ? * It is well tested From b33521f0fadfda1f7acde5deeb9346de033ba140 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sun, 3 Feb 2019 10:18:00 +0530 Subject: [PATCH 10/98] Update README.md --- README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/README.md b/README.md index 23ec27f..c217f24 100644 --- a/README.md +++ b/README.md @@ -127,6 +127,10 @@ Executing transaction: done ``` (mlwp-test) amogh@hp15X34:~$ conda install --yes --file *path to requirements.txt* ``` + +5. In case you are not able to install the packages or getting `PackagesNotFoundError` +Use the following command ` conda install -c conda-forge *list of packages separated by space*`. For more info, refer issue #3 **Unable to install requirements** + ## How good is the code ? * It is well tested From a7c360edf88eb39220de72e98423e938c83d03c1 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sun, 3 Feb 2019 10:20:19 +0530 Subject: [PATCH 11/98] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index c217f24..786f843 100644 --- a/README.md +++ b/README.md @@ -129,7 +129,7 @@ Executing transaction: done ``` 5. In case you are not able to install the packages or getting `PackagesNotFoundError` -Use the following command ` conda install -c conda-forge *list of packages separated by space*`. For more info, refer issue #3 **Unable to install requirements** +Use the following command ` conda install -c conda-forge *list of packages separated by space*`. For more info, refer issue [#3](https://github.com/devAmoghS/Machine-Learning-with-Python/issues/3) **Unable to install requirements** ## How good is the code ? From 8f4d1fee59f4d77c247330fe29978bd27cbc665c Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sun, 3 Mar 2019 12:52:06 +0530 Subject: [PATCH 12/98] Create hparams_grid_search_keras_nn.py --- hparams_grid_search_keras_nn.py | 102 ++++++++++++++++++++++++++++++++ 1 file changed, 102 insertions(+) create mode 100644 hparams_grid_search_keras_nn.py diff --git a/hparams_grid_search_keras_nn.py b/hparams_grid_search_keras_nn.py new file mode 100644 index 0000000..561a631 --- /dev/null +++ b/hparams_grid_search_keras_nn.py @@ -0,0 +1,102 @@ +import numpy as np +import pandas as pd +from keras import Sequential +from keras.layers import Dense +from keras.wrappers.scikit_learn import KerasClassifier +from sklearn.model_selection import GridSearchCV +from sklearn.model_selection import train_test_split + +DATA_FILE = '' + +feature_cols = ['feat1', 'feat2', 'feat3', 'feat4', 'feat5', 'feat6'] +labels = ['y'] + + +def load_data(filepath): + data = pd.read_csv(filepath) + return data + + +def describe_data(data, name): + print('\nGetting the summary for ' + name + '\n') + print('Dataset Length:', len(data)) + print('Dataset Shape:', data.shape) + print(data.columns) + print(data.dtypes) + + +def create_model(): + model = Sequential() + model.add(Dense(12, input_dim=5, kernel_initializer='uniform', activation='relu')) + model.add(Dense(8, kernel_initializer='uniform', activation='relu')) + model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid')) + + model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) + + return model + + +if __name__ == '__main__': + + data_df = load_data(DATA_FILE) + + data_df = data_df.dropna() + print(data_df.isnull().sum(axis=0)) + + X_data = data_df[feature_cols] + y_data = data_df[['y']] + + # seed for reproducibility + seed = 7 + np.random.seed(seed=seed) + + # train test split + X, X_test, y, y_test = train_test_split(X_data, y_data, test_size=.20, random_state=42) + + # train val split + X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=.20, random_state=42) + + # summarize the datasets + describe_data(X_train, name="X_train") + describe_data(X_val, name="X_val") + describe_data(X_test, name="X_test") + describe_data(y_train, name="y_train") + describe_data(y_val, name="y_val") + describe_data(y_test, name="y_test") + + # create model + model = KerasClassifier(build_fn=create_model) + + # hyperparamater optimization + batch_size = [10, 20, 40, 60, 80, 100] + epochs = [10, 50, 100] + learn_rate = [0.001, 0.01, 0.1, 0.2, 0.3] + momentum = [0.0, 0.2, 0.4, 0.6, 0.8, 0.9] + weight_constraint = [1, 2, 3, 4, 5] + dropout_rate = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] + neurons = [1, 5, 10, 15, 20, 25, 30] + + param_grid = dict(batch_size=batch_size, epochs=epochs) + grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1) + grid_result = grid.fit(X=X_train, y=y_train) + + # summarize results + print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) + means = grid_result.cv_results_['mean_test_score'] + stds = grid_result.cv_results_['std_test_score'] + params = grid_result.cv_results_['params'] + for mean, stdev, param in zip(means, stds, params): + print("%f (%f) with: %r" % (mean, stdev, param)) + + + + + + + + + + + + + From ea0794c1da334e461bf75fd93ea9f858d4c570f2 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sun, 3 Mar 2019 12:54:20 +0530 Subject: [PATCH 13/98] Update README.md --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 786f843..fb555b9 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ # Machine-Learning-with-Python [![star this repo](http://githubbadges.com/star.svg?user=devAmoghS&repo=Machine-Learning-with-Python)](http://github.com/ddavison/github-badges) [![fork this repo](http://githubbadges.com/fork.svg?user=devAmoghS&repo=Machine-Learning-with-Python)](http://github.com/ddavison/github-badges/fork) ![alt text](https://media.istockphoto.com/vectors/machine-learning-3-step-infographic-artificial-intelligence-machine-vector-id962219860?k=6&m=962219860&s=612x612&w=0&h=yricYyUqZbILMHp3IvtenS3xbRDhu1w1u5kk2az5tbo=) -## Small scale machine learning projects to understand the core concepts +## Small scale machine learning projects to understand the core concepts (order: oldest to newest) * Topic Modelling using **Latent Dirichlet Allocation** with newsgroups20 dataset, implemented with Python and Scikit-Learn * Implemented a simple **neural network** built with Keras on MNIST dataset * Stock Price Forecasting on Google using **Linear Regression** @@ -21,6 +21,7 @@ * Sentence generation using n-grams * Sentence generation using **Grammars and Automata Theory; Gibbs Sampling** * Topic Modelling using Latent Dirichlet Analysis (LDA) +* Wrapper for using Scikit-Learn's **GridSearchCV** for a **Keras Neural Network** ## Installation notes MLwP is built using Python 3.5. The easiest way to set up a compatible From c2094f39c1ffc5014d045e6c1729ee2fac4df6c7 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Mon, 3 Jun 2019 18:33:39 +0530 Subject: [PATCH 14/98] Add files via upload --- .../__pycache__/data.cpython-35.pyc | Bin 0 -> 843 bytes .../__pycache__/utils.cpython-35.pyc | Bin 0 -> 3487 bytes recommender_systems/data.py | 17 +++++ recommender_systems/model.py | 36 ++++++++++ recommender_systems/utils.py | 65 ++++++++++++++++++ 5 files changed, 118 insertions(+) create mode 100644 recommender_systems/__pycache__/data.cpython-35.pyc create mode 100644 recommender_systems/__pycache__/utils.cpython-35.pyc create mode 100644 recommender_systems/data.py create mode 100644 recommender_systems/model.py create mode 100644 recommender_systems/utils.py diff --git a/recommender_systems/__pycache__/data.cpython-35.pyc b/recommender_systems/__pycache__/data.cpython-35.pyc new file mode 100644 index 0000000000000000000000000000000000000000..06a51c2c9f5c8a6a1e8ee32f6ff692c6d0a25b42 GIT binary patch literal 843 zcmYjQ+iuh_5S?wim+qx0ExiCO4Y!A?g8c!6P}_$JvDMPPSxA*PW4vZ#JF=ak+y3!WHgAX zi}1UEj{%=F7)l}LBf#UvJO+H)bam30)FGv((Gv+Nmk75&{^2C#NmH6a$*P|idba8u zpT~>D{`Z^4Uc^M103+I$aC`cSb*?yCThzUCCDThDIBQ>>b1&G&2Yye+ZNc4wj$oZ; zn-|=BZqQNS+TMQp$hs42q`f?6y{Yvn$^K7n+NE>ElWW9wV zQxTX3Cj>F(YUb~Cy?$|a#=2MBFGQZ#dqg4~*P2q&oEuqkN#KFzVJ=9-q8(DLt5|2k z=e7)WVnpe9o*EIpS-0@@2!$yARX*S@s4ps0gi$6?86}0u1lyeO;znQ@Riv@bXr`XE zClyVtKJaBxSQpZJk{dMe+Rcg-)kd&U5@8gi!s+$&d1X# zn{%h9&VD5tKaQRK9m&{<%xWzR3U}u#AA}|V_3@^tmc6$$`7#&p*5Z$FMB=cu)7qn* NkW5zZkj^3begl?i=F|WH literal 0 HcmV?d00001 diff --git a/recommender_systems/__pycache__/utils.cpython-35.pyc b/recommender_systems/__pycache__/utils.cpython-35.pyc new file mode 100644 index 0000000000000000000000000000000000000000..4e957afc1e868600957e6df01f1b5edd171b91b2 GIT binary patch literal 3487 zcmd6qOK%)S5XY-$_U(0S2gk5^1A%xzv7>;5K$eLhAh=kOj1*~!wHnW~?XhQf?C#mb zw$@x6CAo4#;uCP;$cLFLr+kH+_*c!o5^R}BIWThfcK3AGR9F3~x;-~JS&M@5!#}G; zf796GqJ0;|{(>sRpQ4zknYSH^6}25|I;8lWOR-Ds0yPWh7bq;!6U;4ASfa2jZ8Rzr zR;690utwp8v@uwxaFW7v^h8m!D%~jxr=^Q&idMau7tn!LJe5s^dc*I=X&415x^9@H zfx|EH@NWs_TNHa6RZ7&upQ2-jQbpUS^AVB}d6mDNJyvvS)l(~;Yk1>Xq3x$z&4S%C zspn+sQKlaML8gpXFTY#dO4@qSZzr2ui}w$LE#I{7n`B!Dsa*`U-i@_yI#FkH(dZy) zw{<7frezOos@rz4n?|u+*geQ*g2YB0-Lg?TihUEM2Q1ynW|V<9=#`DJR$T16sGbpR zC^k*#&{NscJ#1jrb5{a|*Zi=Y0?Iomb{^HH%v-xkM~Y4IoM@*+<|@0kjrnZobT znrubJM6CHW8n0P1X9V_Xz@}{1BcZ3~Fy`R^?wssn<*a&@k=0q|(CS~I)3^F zAqs+F{vDT0iC8B5P>?p<;?RaO^n)66fopg}*<_M#!SJ~q!>l%92Tm#XhezHp=dfr7 zMJ!QORp-!aCpH~0CxUD%O-^z`=I-c&Ax=yKBelxX*!Bo4Fw+1oE2f5i-#J-f*N;rD z0=rA7$SO;*-(C;>aCd+uHc4&Pp~hsSZ6in~{AF zO%Z8p&f^PNku}NL^pp(c%3L_{%hRkGV*y~CS=U@bD`0o>!!8{n^9?#c=9yw=4P)@% z)VQ2;dO+sQs9A>iJ~f}8P0e!&=Ehm%G-q;h1_9NILc?IdrW0dh*>txP?RE7@p)D|7 z#F`OIw_dU2K+b7E8BRrS43wG0s@Hfj7ch2>RY9S{Pydq|IZlQN1WJa}sCf^SSI905 zR38LJlrlw^dCC>8c-uMP&e&)_1&s&PKoHxYfkl0y&NET0@b`h_B(!+(5UJnM0YKn| z(@lZ)?^5b;h#^_v?7FwMC?pi=s6_kMiJe}g#=W&!Ix17IK)o^*8@&>J;V$icMi2Hb z(}UFI7uy9omO#Zp0S=#_EeQ;H3!ed;=7^`Qj(C;>`zhkPAh~Z9`H3^!TXx<&Hk>oUu?&c;Am?-Jk@r*R?xgkFsnne_b%HBcMkD2*V%4st-qhf%^NNP^YF6Bc6F-%-BUom| zi0=zt#V|+a1Fj^kHKI-scSGF@y2j{EI#MZ{7^!a!Cp3w1!w2zzUKWuH6ME-PR+PVH zc`B*8x~yi@oO8n|se*G}nVT35BvGQlRvS|M7dj(~LF&&!u^f4FmahFM&MDJTg?hO7 z;TUu{!uA)4aRqds*`;F_#9x%DRXVB>!0uHU@R|Vr0|37Qz`<{cwu=Jz;B02V<<`l7 zV_A{RZ#-=ba9lnsa%n1!0bcDNVsAG^eF=l~DQ9Q4{vVW^>rl(QkE+jh&O@0;845}C zd6e75>So=d{B`H&?^lI6T7NShmIc@wyEVt^6LW(B`N(rn?*ZKC}=L);l6@$Z5(e8um!fvdW`FdqJV(V(! QnRe@ida+)uSEr``0RbSu3;+NC literal 0 HcmV?d00001 diff --git a/recommender_systems/data.py b/recommender_systems/data.py new file mode 100644 index 0000000..aa3b9fb --- /dev/null +++ b/recommender_systems/data.py @@ -0,0 +1,17 @@ +users_interests = [ + ["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm", "Cassandra"], + ["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"], + ["Python", "scikit-learn", "scipy", "numpy", "statsmodels", "pandas"], + ["R", "Python", "statistics", "regression", "probability"], + ["machine learning", "regression", "decision trees", "libsvm"], + ["Python", "R", "Java", "C++", "Haskell", "programming languages"], + ["statistics", "probability", "mathematics", "theory"], + ["machine learning", "scikit-learn", "Mahout", "neural networks"], + ["neural networks", "deep learning", "Big Data", "artificial intelligence"], + ["Hadoop", "Java", "MapReduce", "Big Data"], + ["statistics", "R", "statsmodels"], + ["C++", "deep learning", "artificial intelligence", "probability"], + ["pandas", "R", "Python"], + ["databases", "HBase", "Postgres", "MySQL", "MongoDB"], + ["libsvm", "regression", "support vector machines"] +] diff --git a/recommender_systems/model.py b/recommender_systems/model.py new file mode 100644 index 0000000..7eb0028 --- /dev/null +++ b/recommender_systems/model.py @@ -0,0 +1,36 @@ +from functools import partial + +from recommender_systems.data import users_interests +from recommender_systems.utils import make_user_interest_vector, cosine_similarity, most_similar_users_to, \ + user_based_suggestions, most_similar_interests_to, item_based_suggestions + +if __name__ == '__main__': + unique_interests = sorted(list({interest + for user_interests in users_interests + for interest in user_interests})) + + print("unique interests") + print(unique_interests) + + user_interest_matrix = map(partial(make_user_interest_vector, unique_interests), users_interests) + + user_similarities = [[cosine_similarity(interest_vector_i, interest_vector_j) + for interest_vector_j in user_interest_matrix] + for interest_vector_i in user_interest_matrix] + + print(most_similar_users_to(user_similarities, 0)) + + print(user_based_suggestions(user_similarities, users_interests, 0)) + + # item-based + interest_user_matrix = [[user_interest_vector[j] + for user_interest_vector in user_interest_matrix] + for j, _ in enumerate(unique_interests)] + + interest_similarities = [[cosine_similarity(user_vector_i, user_vector_j) + for user_vector_j in interest_user_matrix] + for user_vector_i in interest_user_matrix] + + print(most_similar_interests_to(interest_similarities, 0, unique_interests)) + + print(item_based_suggestions(interest_similarities, users_interests, user_interest_matrix, unique_interests, 0)) \ No newline at end of file diff --git a/recommender_systems/utils.py b/recommender_systems/utils.py new file mode 100644 index 0000000..7157ad9 --- /dev/null +++ b/recommender_systems/utils.py @@ -0,0 +1,65 @@ +import math +from collections import defaultdict + +from helpers.linear_algebra import dot + + +def cosine_similarity(v, w): + return dot(v, w) / math.sqrt(dot(v, v) * dot(w, w)) + + +def make_user_interest_vector(interests, user_interests): + return [1 if interest in user_interests else 0 + for interest in interests] + + +def most_similar_users_to(user_similarities, user_id): + pairs = [(other_user_id, similarity) + for other_user_id, similarity in + enumerate(user_similarities[user_id]) + if user_id != other_user_id and similarity > 0] + + return sorted(pairs, key=lambda pair: pair[1], reverse=True) + + +def most_similar_interests_to(interest_similarities, interest_id, unique_interests): + pairs = [(unique_interests[other_interest_id], similarity) + for other_interest_id, similarity in + enumerate(interest_similarities[interest_id]) + if interest_id != other_interest_id and similarity > 0] + + return sorted(pairs, key=lambda pair: pair[1], reverse=True) + + +def user_based_suggestions(user_similarities, users_interests, user_id, include_current_interests=False): + suggestions = defaultdict(float) + for other_user_id, similarity in most_similar_users_to(user_similarities, user_id): + for interest in users_interests[other_user_id]: + suggestions[interest] += similarity + + suggestions = sorted(suggestions.items(), key=lambda pair: pair[1], reverse=True) + + if include_current_interests: + return suggestions + else: + return [(suggestion, weight) + for suggestion, weight in suggestions + if suggestion not in users_interests[user_id]] + + +def item_based_suggestions(interest_similarities, users_interests, user_interest_matrix, unique_interests, user_id, include_current_interests=False): + suggestions = defaultdict(float) + for interest_id, is_interested in enumerate(user_interest_matrix[user_id]): + if is_interested == 1: + for interest, similarity in most_similar_interests_to(interest_similarities, interest_id, unique_interests): + suggestions[interest] += similarity + + suggestions = sorted(suggestions.items(), key=lambda pair: pair[1], reverse=True) + + if include_current_interests: + return suggestions + else: + return [(suggestion, weight) + for suggestion, weight in suggestions + if suggestion not in users_interests[user_id]] + From 0a604f41c44d787381c5e3bba4de8990832a3b6c Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Mon, 3 Jun 2019 18:42:06 +0530 Subject: [PATCH 15/98] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index fb555b9..fea633d 100644 --- a/README.md +++ b/README.md @@ -22,6 +22,7 @@ * Sentence generation using **Grammars and Automata Theory; Gibbs Sampling** * Topic Modelling using Latent Dirichlet Analysis (LDA) * Wrapper for using Scikit-Learn's **GridSearchCV** for a **Keras Neural Network** +* **Recommender system** using **cosine similarity**, recommending new interests to users as well as matching users as per common interests ## Installation notes MLwP is built using Python 3.5. The easiest way to set up a compatible From c9f2eecd98ce369d7de6941b4437a124ac5b1d6e Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Mon, 3 Jun 2019 18:44:55 +0530 Subject: [PATCH 16/98] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index fea633d..ef8d2dd 100644 --- a/README.md +++ b/README.md @@ -23,6 +23,7 @@ * Topic Modelling using Latent Dirichlet Analysis (LDA) * Wrapper for using Scikit-Learn's **GridSearchCV** for a **Keras Neural Network** * **Recommender system** using **cosine similarity**, recommending new interests to users as well as matching users as per common interests +* Implementing different methods for **network analysis** such as **PageRank, Betweeness Centrality, Closeness Centrality, EigenVector Centrality** ## Installation notes MLwP is built using Python 3.5. The easiest way to set up a compatible From 941a84840fd7bfdac4086f34c7d2a298abbf35d6 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Mon, 3 Jun 2019 18:45:35 +0530 Subject: [PATCH 17/98] Add files via upload --- .../__pycache__/data.cpython-35.pyc | Bin 0 -> 902 bytes .../__pycache__/utils.cpython-35.pyc | Bin 0 -> 5930 bytes network_analysis/data.py | 19 ++ network_analysis/model.py | 23 +++ network_analysis/utils.py | 183 ++++++++++++++++++ 5 files changed, 225 insertions(+) create mode 100644 network_analysis/__pycache__/data.cpython-35.pyc create mode 100644 network_analysis/__pycache__/utils.cpython-35.pyc create mode 100644 network_analysis/data.py create mode 100644 network_analysis/model.py create mode 100644 network_analysis/utils.py diff --git a/network_analysis/__pycache__/data.cpython-35.pyc b/network_analysis/__pycache__/data.cpython-35.pyc new file mode 100644 index 0000000000000000000000000000000000000000..14f6857aceb0e63cd312f742122c98d9b1255511 GIT binary patch literal 902 zcmaiw$!^p@5Qb~KB$HV(A&?Lt5D4qA#K0LLP;!VA2}KGg3kg|M9lMF$9(8+y=E9X{ z;g#mf$*+KfL|2VlSX}T@{eQo$>c4t>IP87j3qO7H0DeO2+OYbaea&xM&^{eN56}l` ztHppZbhyH}>afeW=CH@ucQ{}iI$UQQIox2p;&7Aks>5rHTZCy3Vw>bT;11voz?*=# z0Cxdz1Kt7L1H21(4{#sS0mL@{eZU8R4*?$mJ_dXOI0k$Q_zdtl;0wS5zzFaq;44T+ z5Zg(&;cWbR;YqCRF~A~FNg?9kSm?6uSR)ROW=hre@F4cjW}^049cHo)`18lC)OAQQ zJj~^#s8?7Y%gKeQS6Ls4OR4HE|9+BGqF!VBBo|VB7<-zUYwG4SwX>Qm-PY9F`ef-w z)1qk`>(ioX8>?y2w2#xhr6x=Fmzpf6$H&J{30fmiBO1XE)uPzN+Ttzucf~DoTwz98)?5LRdO($FS>7v5@)%1 UONU@~Z5eg`y|?_$&W=a!AJWO1H~;_u literal 0 HcmV?d00001 diff --git a/network_analysis/__pycache__/utils.cpython-35.pyc b/network_analysis/__pycache__/utils.cpython-35.pyc new file mode 100644 index 0000000000000000000000000000000000000000..f9f9e46b1a6c3f76225b4bafd2bd324aaa9bc0c3 GIT binary patch literal 5930 zcma)AOK%*<5w4!u2Y0y?pQ5;=C@uYvw6a9{8N)VBzhnmp5WxUrGZ3lb9Hhsa>_07DPMKZKDZH#EO(kc{iv?0 zuKKEK7pJDGk1iHJ`^zHH*EI5(K>uAl$=f!Ohp(W3$T5A30zsigjzxm!Yzk}&3*;2c zdx3%?g(Y%I6qd;;Q#e7+1ceoHDil`9sZuye&Lo93a%vP#kuzm>7AZJG;WRnZ6wZ({ zL*Xnrvm|;;B&X@dEsV`khe*zmSE9|sC$!W1h~x~Ww&*i(G*8X~FO~CD$BO#xv-)j; z>JJ17ospfc@oE4HFw@e2_>Qf=llU(8BoK^Cs zNUqX~bAfsv@aTL`0(lqp)D>b`r?sq2lF%Yre6P?0 zo1Cj8FO$5=vwX8JkaO)cuelvGXfyAvf!8lBZ4t8ri&!FOnGb=E>K?R`Tg39vBGC2L z1`}AOE#lACjt5zvWlpjE24Bc^lItX|lLuiT<0YQ+AQ2cAmhhV7T_x`Y@~+WCaJ<1@ zv)#P@4g1y2rpSt(JUo)R?er|g1N(|h^K%D{!YmpMjvahuS2&TLB60THl;+5i1q||7JL2Y zW>JCatQ~w$1oi$#_XVjbf%Y~Ai7)TOq9+AWQGJMk#1dFbjBngs(+=SW)SdEDT6m zbiY97KpA8=na3hM0t*GbzS@@x)P_=t!#8LLJGPsXH#5uk+I^K-xw>9qm93z;PjqXR zwE*3LHdy#n9NLVZ+f8xv_tdm9YyHKnWONThsO+gYZ1H8OS?vGz{f+%NlpAgsclI|v zK5p;3D*RZ*w`Ds`Hasc2fpk^mN1csGrgvj?&~hU;I8OXz<0$omWW9U*ZDq3~BiZY! ztvd6?OqRsGOUZt$QkkT!uAA;BEk1aCGl*eC$=3Q{i<-vrQv$M0%rl`?G`I+sah_#_ zfDjY5aqdft4okGHMbJ&tQgc|H75FT+cb2f_duGL1OnqD33}CM9IP7jc&(xXC3f;T> z-E0o_D87Tt_M^S0Gy5v0b7rqHv#?z73;O`e2e8kQW?yK))csKzz_A_3D3?N=11;R# zbm&J~c`eL!(*0~U4!m5OX*|3{#qufE{Y!X~5}Jyz#hf_BDZ81rn-|bwov!6Ng|XC` zNgz3lRI7PlM+_&Hn!%2nqk2|+Kj3h7OnGUX6$Z9t*g~AAH}E8UttGJ4EGbq!ASs4% z?WLIw2vTg!S#d9j-83u7UJB@F7S#eqoe5Y@l=@ORm9CO^wA{{|F)FBGI5r!}yH8OM z4rDs&JU47p>H;?xxmn_d)tAkWiNfZ}CY3yghV1#moS3yM)|{vo&WO4_XP3mhRkQWG zy(H#s#S61Dr^&z~%qLsih5twyMKVcRZ6Km@13x|1y8VW2Yj9co!piypPqKyvrUc{y z0zT9i2OO*$pvysUyIHvRPP#7}$x+yd_Zr+a7L)W}`&>1&+8SxxkZyaw5o4)p@Qn>l z=eyz46Mk4nY$|7m7VW`WVATd@jCnA_h&K8T`;J)D2Mrx`Xbhh);T$8D_W&+X$kT9v zofIZ!L#V8D#|@5Tf;053F>(z(8ou6>{~U$1 zrkOG{ox%Yiw8i+J0=w~Qb01%Z)TXU?m*Q>kX|Pzx#Cw@|KNEf?ZjT9^EtoJ0BaOYI zcG}{5KW<^0)&L2`9@}PZ@Q3M@@o0X2j1Kq#Gf=k)X$W0BqG5$c3UpFr_7GqWY*L%d zJS+w4e+mEs^Y|7ycp+G^P?&-&99OKH_ZIw!!O%_p-2kS;p+8xJ&oKkp^n1!r_d}Wb z?GKfTm72gP-~kPD441mj{dF{s6}{wG2Y8~F@&Dy<6-?(U=%)>B>`AI<2naQYo2prg zacN+IY)-5Ztp&d7QR8Qm-)`FXwvJhp+6Ah|I*sEQLT;YW2h`f4C8GB(ySvkdaa{n zbaEZdj0BKeN$^l+sOeHJr_N5y6@8bYChrUtIz~F1`(IM3e-f1T zI{E{LW2yM|wW%@S4BQI|;7K)=~=U@9@KmOCd|GqU~m9b&K>LFgny#gHfc5=ip@Wo6U z_(GywU;z+1fE6nS(SRL*27?hmH=qNTgKdTkEJ6yfjUK2TE}~TBgtBn+lt~QS<~%d3 zk?Pi59w=@Dwdj=Rz`RS!ML`huLxgCJsYYAHPe5$oJbQlRwWN>q(p>6OZQ?R^K>e&X zRpExgMNRUkfhnhw#sQR#O2sL55JwWQA`_G`PRYbcr;7YFZRxo&rm~rFCQd*(;gpDw z0ob^)8vdUZRiWO-*xEa=0?u zFgWQT%lA$FGOB{jzzugjck5St2p&@Jqshw18Bv=+O=ZW$r_R1U*W(<3;8^TQf|igjuRI~@eW5)KS%SdN04MBik4)K$y3W!y zCVqe6oki+>!sKt<`XQa54gi@Fc}3KvAh&Q*=3iQP1ivKRxOIt6CbT}?!*|i+2wy(!temgRvxIACd@aeXb9~3p1(N?2oX)ELxhbq* zYTEkP%C(AfpVb!&Pp(8qrie_27k~oG^V_Dt`W>Id#*YGPwd;0d3*L7y;=O~F`kXf3 zJb!r0fKyJ@b53>n2R4~Yj?E!D&w4*<_!m?_*wl$c=W> z%0L&t!C1nz5!p3SFPyRFMO{pxyqFbNtg5AcfSKlt>P`I1Ch_}BfD{L@A~Cg7W$!3L z$`c2P&dHp`eHkE}B{@9`EEQ{oTD4ZhUuk*@eNKUGUTq9u)v91e1^deOe*ru*tT_Mx literal 0 HcmV?d00001 diff --git a/network_analysis/data.py b/network_analysis/data.py new file mode 100644 index 0000000..92c068a --- /dev/null +++ b/network_analysis/data.py @@ -0,0 +1,19 @@ +users = [ + { "id": 0, "name": "Hero" }, + { "id": 1, "name": "Dunn" }, + { "id": 2, "name": "Sue" }, + { "id": 3, "name": "Chi" }, + { "id": 4, "name": "Thor" }, + { "id": 5, "name": "Clive" }, + { "id": 6, "name": "Hicks" }, + { "id": 7, "name": "Devin" }, + { "id": 8, "name": "Kate" }, + { "id": 9, "name": "Klein" } +] + +friendships = [(0, 1), (0, 2), (1, 2), (1, 3), (2, 3), (3, 4), + (4, 5), (5, 6), (5, 7), (6, 8), (7, 8), (8, 9)] + +endorsements = [(0, 1), (1, 0), (0, 2), (2, 0), (1, 2), + (2, 1), (1, 3), (2, 3), (3, 4), (5, 4), + (5, 6), (7, 5), (6, 8), (8, 7), (8, 9)] \ No newline at end of file diff --git a/network_analysis/model.py b/network_analysis/model.py new file mode 100644 index 0000000..efba4bb --- /dev/null +++ b/network_analysis/model.py @@ -0,0 +1,23 @@ +from network_analysis.data import users +from network_analysis.utils import eigenvector_centralities, page_rank + +if __name__ == '__main__': + + print("Betweenness Centrality") + for user in users: + print(user["id"], user["betweenness_centrality"]) + print() + + print("Closeness Centrality") + for user in users: + print(user["id"], user["closeness_centrality"]) + print() + + print("Eigenvector Centrality") + for user_id, centrality in enumerate(eigenvector_centralities): + print(user_id, centrality) + print() + + print("PageRank") + for user_id, pr in page_rank(users).items(): + print(user_id, pr) \ No newline at end of file diff --git a/network_analysis/utils.py b/network_analysis/utils.py new file mode 100644 index 0000000..10c41c9 --- /dev/null +++ b/network_analysis/utils.py @@ -0,0 +1,183 @@ +import random +from collections import deque +from functools import partial + +from helpers.linear_algebra import dot, get_row, get_column, shape, make_matrix, magnitude, scalar_multiply, distance +from network_analysis.data import users, friendships, endorsements + +for user in users: + user["friends"] = [] + +# and populate it +for i, j in friendships: + # this works because users[i] is the user whose id is i + users[i]["friends"].append(users[j]) # add i as a friend of j + users[j]["friends"].append(users[i]) # add j as a friend of i + + +def shortest_paths_from(from_user): + + # a dictionary from "user_id" to *all* shortest paths to that user + shortest_paths_to = {from_user["id"]: [[]]} + + # a queue of (previous_user, next user) that we need to check + # starts out with all the pairs (from_user, friend_of_from_user) + frontier = deque((from_user, friend) + for friend in from_user["friends"]) + + # keep going until we empty the deque + while frontier: + + prev_user, user = frontier.popleft() # remove the user who is first in the queue + user_id = user["id"] + + # because of the way we are adding to the queue, + # necessarily we already know some shortest paths to prev_user + paths_to_prev_user = shortest_paths_to[prev_user["id"]] + new_paths_to_user = [path + [user_id] for path in paths_to_prev_user] + + # it is possible we already know a shortest path + old_paths_to_user = shortest_paths_to.get(user_id, []) + + # what is the shortest path tot here that we have seen so far ? + if old_paths_to_user: + min_path_length = len(old_paths_to_user[0]) + else: + min_path_length = float('inf') + + # only keep paths that are not too long and are actually new + new_paths_to_user = [path + for path in new_paths_to_user + if len(path) <= min_path_length + and path not in old_paths_to_user] + + shortest_paths_to[user_id] = old_paths_to_user + new_paths_to_user + + # add never-seen neighbors to the frontier + frontier.extend((user, friend) + for friend in user["friends"] + if friend["id"] not in shortest_paths_to) + + return shortest_paths_to + + +for user in users: + user["shortest_paths"] = shortest_paths_from(user) + +for user in users: + user["betweenness_centrality"] = 0.0 + +for source in users: + source_id = source["id"] + for target_id, paths in source["shortest_paths"].items(): + if source_id < target_id: # don't double count + num_paths = len(paths) # how many shortest paths? + contrib = 1 / num_paths # contribution to centrality + + for path in paths: + for id in path: + if id not in [source_id, target_id]: + users[id]["betweenness_centrality"] += contrib + + +def farness(user): + """the sum of the lengths of the shortest paths to each other user""" + return sum(len(paths[0]) + for paths in user["shortest_paths"].values()) + + +for user in users: + user["closeness_centrality"] = 1 / farness(user) + +"""Eigenvector Centrality""" + + +def matrix_product_entry(A, B, i, j): + return dot(get_row(A, i), get_column(B, j)) + + +def matrix_multiply(A, B): + n1, k1 = shape(A) + n2, k2 = shape(B) + + if k1 != n2: + raise ArithmeticError("incompatible shapes!") + + return make_matrix(n1, k2, partial(matrix_product_entry, A, B)) + + +def vector_as_matrix(v): + """returns the vector v (represented as a list) as a n x 1 matrix""" + return [[v_i] for v_i in v] + + +def vector_from_matrix(v_as_matrix): + """returns the n x 1 matrix as a list of values""" + return [row[0] for row in v_as_matrix] + + +def matrix_operation(A, v): + v_as_matrix = vector_as_matrix(v) + product = matrix_multiply(A, v_as_matrix) + return vector_from_matrix(product) + + +def find_eigenvector(A, tolerance=0.00001): + guess = [random.random() for _ in A] + + while True: + result = matrix_operation(A, guess) + length = magnitude(result) + next_guess = scalar_multiply(1/length, result) + + if distance(guess, next_guess) < tolerance: + return next_guess, length # eigenvector, eigenvalue + guess = next_guess + + +def entry_fn(i, j): + return 1 if (i, j) in friendships or (j, i) in friendships else 0 + + +n = len(users) +adjacency_matrix = make_matrix(n, n, entry_fn) +eigenvector_centralities, _ = find_eigenvector(adjacency_matrix) + +"""Directed Graphs and PageRank""" +for user in users: + user["endorses"] = [] # add one list to track outgoing endorsements + user["endorsed_by"] = [] # and another to track endorsements + +for source_id, target_id in endorsements: + users[source_id]["endorses"].append(users[target_id]) + users[target_id]["endorsed_by"].append(users[source_id]) + +endorsements_by_id = [(user["id"], len(user["endorsed_by"])) + for user in users] + +sorted(endorsements_by_id, + key=lambda pair: pair[1], + reverse=True) + + +def page_rank(users, damping=0.85, num_iters=100): + + # initially distribute PageRank evenly + num_users = len(users) + pr = {user["id"]: 1 / num_users for user in users} + + # this is the small fraction of PageRank + # that each node gets each iteration + base_pr = (1 - damping) / num_users + + for _ in range(num_iters): + next_pr = {user["id"]: base_pr for user in users} + for user in users: + # distribute PageRank to outgoing links + links_pr = pr[user["id"]] * damping + for endorsee in user["endorses"]: + next_pr[endorsee["id"]] += links_pr / len(user["endorses"]) + + pr = next_pr + + return pr \ No newline at end of file From 30ddacf0c7fae754b4839007127ea10ec9251e18 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Mon, 3 Jun 2019 18:50:47 +0530 Subject: [PATCH 18/98] Add files via upload --- hypothesis_inference.py | 205 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 205 insertions(+) create mode 100644 hypothesis_inference.py diff --git a/hypothesis_inference.py b/hypothesis_inference.py new file mode 100644 index 0000000..4c50c31 --- /dev/null +++ b/hypothesis_inference.py @@ -0,0 +1,205 @@ +from helpers.probability import normal_cdf, inverse_normal_cdf +import math, random + + +def normal_approximation_to_binomial(n, p): + """finds mu and sigma corresponding to a Binomial(n, p)""" + mu = p * n + sigma = math.sqrt(p * (1 - p) * n) + return mu, sigma + + +##### +# +# probabilities a normal lies in an interval +# +###### + +# the normal cdf _is_ the probability the variable is below a threshold +normal_probability_below = normal_cdf + + +# it's above the threshold if it's not below the threshold +def normal_probability_above(lo, mu=0, sigma=1): + return 1 - normal_cdf(lo, mu, sigma) + + +# it's between if it's less than hi, but not less than lo +def normal_probability_between(lo, hi, mu=0, sigma=1): + return normal_cdf(hi, mu, sigma) - normal_cdf(lo, mu, sigma) + + +# it's outside if it's not between +def normal_probability_outside(lo, hi, mu=0, sigma=1): + return 1 - normal_probability_between(lo, hi, mu, sigma) + + +###### +# +# normal bounds +# +###### + + +def normal_upper_bound(probability, mu=0, sigma=1): + """returns the z for which P(Z <= z) = probability""" + return inverse_normal_cdf(probability, mu, sigma) + + +def normal_lower_bound(probability, mu=0, sigma=1): + """returns the z for which P(Z >= z) = probability""" + return inverse_normal_cdf(1 - probability, mu, sigma) + + +def normal_two_sided_bounds(probability, mu=0, sigma=1): + """returns the symmetric (about the mean) bounds + that contain the specified probability""" + tail_probability = (1 - probability) / 2 + + # upper bound should have tail_probability above it + upper_bound = normal_lower_bound(tail_probability, mu, sigma) + + # lower bound should have tail_probability below it + lower_bound = normal_upper_bound(tail_probability, mu, sigma) + + return lower_bound, upper_bound + + +def two_sided_p_value(x, mu=0, sigma=1): + if x >= mu: + # if x is greater than the mean, the tail is above x + return 2 * normal_probability_above(x, mu, sigma) + else: + # if x is less than the mean, the tail is below x + return 2 * normal_probability_below(x, mu, sigma) + + +def count_extreme_values(): + extreme_value_count = 0 + for _ in range(100000): + num_heads = sum(1 if random.random() < 0.5 else 0 # count # of heads + for _ in range(1000)) # in 1000 flips + if num_heads >= 530 or num_heads <= 470: # and count how often + extreme_value_count += 1 # the # is 'extreme' + + return extreme_value_count / 100000 + + +upper_p_value = normal_probability_above +lower_p_value = normal_probability_below + + +## +# +# P-hacking +# +## + +def run_experiment(): + """flip a fair coin 1000 times, True = heads, False = tails""" + return [random.random() < 0.5 for _ in range(1000)] + + +def reject_fairness(experiment): + """using the 5% significance levels""" + num_heads = len([flip for flip in experiment if flip]) + return num_heads < 469 or num_heads > 531 + + +## +# +# running an A/B test +# +## + +def estimated_parameters(N, n): + p = n / N + sigma = math.sqrt(p * (1 - p) / N) + return p, sigma + + +def a_b_test_statistic(N_A, n_A, N_B, n_B): + p_A, sigma_A = estimated_parameters(N_A, n_A) + p_B, sigma_B = estimated_parameters(N_B, n_B) + return (p_B - p_A) / math.sqrt(sigma_A ** 2 + sigma_B ** 2) + + +## +# +# Bayesian Inference +# +## + +def B(alpha, beta): + """a normalizing constant so that the total probability is 1""" + return math.gamma(alpha) * math.gamma(beta) / math.gamma(alpha + beta) + + +def beta_pdf(x, alpha, beta): + if x < 0 or x > 1: # no weight outside of [0, 1] + return 0 + return x ** (alpha - 1) * (1 - x) ** (beta - 1) / B(alpha, beta) + + +if __name__ == "__main__": + mu_0, sigma_0 = normal_approximation_to_binomial(1000, 0.5) + print("mu_0", mu_0) + print("sigma_0", sigma_0) + print("normal_two_sided_bounds(0.95, mu_0, sigma_0)", normal_two_sided_bounds(0.95, mu_0, sigma_0)) + print() + print("power of a test") + + print("95% bounds based on assumption p is 0.5") + + lo, hi = normal_two_sided_bounds(0.95, mu_0, sigma_0) + print("lo", lo) + print("hi", hi) + + print("actual mu and sigma based on p = 0.55") + mu_1, sigma_1 = normal_approximation_to_binomial(1000, 0.55) + print("mu_1", mu_1) + print("sigma_1", sigma_1) + + # a type 2 error means we fail to reject the null hypothesis + # which will happen when X is still in our original interval + type_2_probability = normal_probability_between(lo, hi, mu_1, sigma_1) + power = 1 - type_2_probability # 0.887 + + print("type 2 probability", type_2_probability) + print("power", power) + print() + print("one-sided test") + hi = normal_upper_bound(0.95, mu_0, sigma_0) + print("hi", hi) # is 526 (< 531, since we need more probability in the upper tail) + type_2_probability = normal_probability_below(hi, mu_1, sigma_1) + power = 1 - type_2_probability # = 0.936 + print("type 2 probability", type_2_probability) + print("power", power) + print() + + print("two_sided_p_value(529.5, mu_0, sigma_0)", two_sided_p_value(529.5, mu_0, sigma_0)) + + print("two_sided_p_value(531.5, mu_0, sigma_0)", two_sided_p_value(531.5, mu_0, sigma_0)) + + print("upper_p_value(525, mu_0, sigma_0)", upper_p_value(525, mu_0, sigma_0)) + print("upper_p_value(527, mu_0, sigma_0)", upper_p_value(527, mu_0, sigma_0)) + print() + + print("P-hacking") + + random.seed(0) + experiments = [run_experiment() for _ in range(1000)] + num_rejections = len([experiment + for experiment in experiments + if reject_fairness(experiment)]) + + print(num_rejections, "rejections out of 1000") + print() + + print("A/B testing") + z = a_b_test_statistic(1000, 200, 1000, 180) + print("a_b_test_statistic(1000, 200, 1000, 180)", z) + print("p-value", two_sided_p_value(z)) + z = a_b_test_statistic(1000, 200, 1000, 150) + print("a_b_test_statistic(1000, 200, 1000, 150)", z) + print("p-value", two_sided_p_value(z)) From 358e338bbf4f751ce3ceff9792d3492c859a3658 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Mon, 3 Jun 2019 18:52:41 +0530 Subject: [PATCH 19/98] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index ef8d2dd..4044c52 100644 --- a/README.md +++ b/README.md @@ -24,6 +24,7 @@ * Wrapper for using Scikit-Learn's **GridSearchCV** for a **Keras Neural Network** * **Recommender system** using **cosine similarity**, recommending new interests to users as well as matching users as per common interests * Implementing different methods for **network analysis** such as **PageRank, Betweeness Centrality, Closeness Centrality, EigenVector Centrality** +* Implementing methods used for **Hypothesis Inference** such as **P-hacking, A/B Testing, Bayesian Inference** ## Installation notes MLwP is built using Python 3.5. The easiest way to set up a compatible From 5161b2e3402186eafb2ca3e4562f9de13b76b794 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Mon, 3 Jun 2019 19:20:11 +0530 Subject: [PATCH 20/98] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 4044c52..909588a 100644 --- a/README.md +++ b/README.md @@ -25,6 +25,7 @@ * **Recommender system** using **cosine similarity**, recommending new interests to users as well as matching users as per common interests * Implementing different methods for **network analysis** such as **PageRank, Betweeness Centrality, Closeness Centrality, EigenVector Centrality** * Implementing methods used for **Hypothesis Inference** such as **P-hacking, A/B Testing, Bayesian Inference** +* Implemented **K-nearest neigbors** for next presedential election and prediciting voting behavior based on nearest neigbors. ## Installation notes MLwP is built using Python 3.5. The easiest way to set up a compatible From 27d4f6b922666c35b74e89514423c2f1a91971bb Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Mon, 3 Jun 2019 19:21:07 +0530 Subject: [PATCH 21/98] Add files via upload --- .../__pycache__/data.cpython-35.pyc | Bin 0 -> 2619 bytes .../__pycache__/utils.cpython-35.pyc | Bin 0 -> 3960 bytes k_nearest_neighbors/data.py | 2 + k_nearest_neighbors/model.py | 35 ++++++ k_nearest_neighbors/utils.py | 108 ++++++++++++++++++ 5 files changed, 145 insertions(+) create mode 100644 k_nearest_neighbors/__pycache__/data.cpython-35.pyc create mode 100644 k_nearest_neighbors/__pycache__/utils.cpython-35.pyc create mode 100644 k_nearest_neighbors/data.py create mode 100644 k_nearest_neighbors/model.py create mode 100644 k_nearest_neighbors/utils.py diff --git a/k_nearest_neighbors/__pycache__/data.cpython-35.pyc b/k_nearest_neighbors/__pycache__/data.cpython-35.pyc new file mode 100644 index 0000000000000000000000000000000000000000..461a7cd0d864e53dcb953bcfdf37bd49bfeef09c GIT binary patch literal 2619 zcmYk6ZETZO6vuDdG4{eH69o}p*i=|S)+uk|#=P4KlfiK81-7yQU7x47g}!WUw>fCV z3Cw{o7+^$=UyNeJxET1L#xF?xU=)ps@r9WsB$@>hMNK3|jM4u+&vWK()AsqDd(OH4 zbM9&978K-tUUIYJ*L)#<6w_7%o+aMNpIt&2LgX6`VK_z5B@7oJi(v*JTNpEGCkK$r zkO#;Y#w^+?02DIZ2AItWiU7BB%pAZS>~^QMV;wqoddxh)UFoE4pp3x-C}(&C@F>SD2drRN30TE$ zs{xNOtN}dEPywi9SPQ7)1Wy1Y?^FY7SamJH%difxo}mu#B*O+kJ!jhp*u>BP*v#-0 zppnnn0%&4r2DGqSD_|?bHb5IgJHW>=+W{S{pcAlzVJBc0$M~nabpZnGW&lif3j!2F z2oUCTb_4b>L;z8S7$DBj4d~$n2|$t|1?Xky1N1ZK5~e2PX}|z$+Y6Yw;C%qpV~u|r zkYRWRu%EL%3m9ZM&jAiF90VL<7y>-cF~fkvob3qU1>QLdIL0snIL>Y-0Hby`wNdzl zFir~NlsG7aDTMccgG{GIv-?Fwe*^EeN_A`)>r*Mmvl|Cep?J*NnY~r2$s1T>RZ_W& zupAvyCHt}7m(rJ2@9*=gH$Kv9yHtN@&q68HFZ8fYsw)#%oz-&O(WJh(jFsLY)hjP! zU8%7j4tLszUqj$`I&cJ*pkCS4QgwK-P9us0UPBaZ1>uulq8A>hh+XhmtPM4_Qe8TS zRgW!805bRJ`p&?IM1F}pBzqrQ#$cfW2VtS83@kUb{$)P<^bj&1*R`*Pg$}(63zg$; zvZE%vw!%qRe63jDz-J)|YnE0MBCyQS%`9rMEtl6wHKY}`vE^u`9rZdagZj`vE2R2R zcX|P?Yq?%uqb^cdfJ&_kTxSdW4L(x0?kDJ>(@(-ebxpA5FfNFEK7r4>x?{dJ&WzgW z5@U#>>z;=wD)|IP;=Gpe_gbmy=ytYn6i%l+OHe6Ay~MR&f)91R3_hf%1pT4b&4-1$ z@i8i;j;(}+Zg~+`VZd@#-$es7&(fXiSua(yZto%Jq1sE)4YIrmA3AFUr&CwYz{eUA zSgf(ZxY7}m$U`#zLKIofq8l_npJCL`>bB&e8`Sw9PzBZ14?R@r5w4;gQ8XKGVKykw z5T=)UIEHSJ_zaw>;Rc*3VFgOQsGIpJoT)A6;Y{Lhz(ShK+wH;bs!pad_Je$}7m$^TJBKm1E*}k`bH0K)YVGgHN_q>Bm9`|Z?$t9` z2aDA$G@FVmYm?XU40wiqt#uV!BFI7NzeVe<5ywc;$*;j@ zzy3W8!$Kqe4=#*;-d)f_5;j0D^{#?Pp%uL&y%%x1)u5>=(K{NBR*V8wIf^T!KbIA_ zFw5C10}hH8A`9O$@UF!>xfCBroED<9Sai5V#vw9Jk#UK%L!_NTIYrtff=)5&6g}CZ z`E8NVq3*14XIZ%~H-bOYaH`iZeK`?-YU(Mbdi}~IPx`p!^@QS4)8miERmjsg5D57b z(Z)o4w;4z!J%(v^M@)Yr7LF;;o}Do~n#mNt!z$DjPr$*S@|SfF+{&wrgp;X2Jlb95 w)Y|n&u1MtIYqQ%io*f9M!e;XSW38u|*dEWRjK+=Lh*_nZkkk?X((8N2f346VU;qFB literal 0 HcmV?d00001 diff --git a/k_nearest_neighbors/__pycache__/utils.cpython-35.pyc b/k_nearest_neighbors/__pycache__/utils.cpython-35.pyc new file mode 100644 index 0000000000000000000000000000000000000000..fe825e7592faec0db21b5549b0035a2bd645b58c GIT binary patch literal 3960 zcmb7HTXP&o6+S()yV{jj_gK~%+gW2nvLTT|xCIa<6)S;CQ7RXDaIqD!HQE`iMw*M< zGqNQiQ4}%>Jn{>8=7k4-1HYoHc;ZPvATNFi-`Bmnl5HTxYWwI+PxtB5=X~d~OViVn z`Mobb{mZn-zvRlt2mb*^@pq6AUnPl1*YTbtN>Wd{o~YqFU-l}}t=gr&BsEDVq&wkW zS0t%R+K_Go&sb8EXAm_h;e>>B%TGzzkZ{uSO$qVXwEVP$_|8~i$oBbCGozh$e$&@J`9TC%S8` zR90wC=l(yygjltSr_Ij9$@@W39HvpxE_Z{n?JU*C72CNEBOQh99i6A`;jPvkqPOnXkf!s)K`EM90TsHxQ)Mlk~- z9v#1+7X7BGc=lbN4(Epkq!JbYAAunNEYgP__UL4I_*^MC72bx=o?#7m*ieM7q|?tT zQX)<&^gez1UKI3qoiE|QgFMd4cCniuCSiL!Y6tCuAlB_S?flzrS;zSM(ec=k3?^!2 z9XUVL%;upMe0a{d^EO<_Db0l))t$2PdKN)&Un7g2_UzQ9?@WCy3DWH_xPKRW3Jf{9 zy`N>hegYK6JI7a@<`#}s;wg0XCPr}+WQeT=VLpHjJR<#7RQVd&RdfTyRIPGOABLv$ z7k%|jjDqlO$VvOZX*7lrpjs<(T}GjGXs2q_es{(Sk3x6pLN7sDcWV2%H_Ch6xy$U~ z;%tN+F15Ot=GC)rQz-&YDh+kbzvb(%fgf9hPU3igVWS{~bvkNeDAo7}$uMrlcQp$K zNnZAlhk!(n57=p#}i>K)4!BPPs`7 zd-XVwM@J92Vb8P4aO*=WscLCW&L*U^b@CLU109-#i+oad{X< zrk(`YGvu5eMjaV-{8B`kDT`9qBfJYeY`lyAmcOi)y_UMBTK?5(AfnJBwSJ4a5;?=9@_2sA_df{+@0l9H~Yn$jj`jwqUM>Xwce@;L% zPun7nYz?ptfK_1KRzv`^g?6ijV(wD8A}%>O>n5ufQICHoj~?$iu%q$=bBr7Z_6ZC2 zx|}rxzy(}S>vD?bZjPY?)VBa%!k*lM4TD8?coyvc1lSQH?hx({o)1a`cblyZubOKP34P$&W#hf6-FKeN+w&uM2G^&lgAVY0o`X1;w`~*vjHi)RV6vJXkE#dz)b?b%DxUMwQ_c$tO zL?45b#dQ#3+GA4z1Unb9pX3&VZ$PK5^m5P7TdbR!Ui)Sve_GfA2G-USNjpzf~Q)2uG$6HV)N(j zV>?J%W*@_KBW+X521WB_p&^=(6tbvW|jUKo;nM9 zolB>X#uNMwiu0_nZF6@%4NAf=iMR2ihg`Sj{@W<&F6~Ci0s4xSB<8P%UXTo;Z5`OV zo8`+EX#4|eB@D{I1soH$ZK+%k+E}xx{Q+Olkyg1YuD=YF`dXTYhe>pw7%!L^nrhCQ VMTMC2o3-W){wmEIvp1WSe* Date: Wed, 5 Jun 2019 08:21:40 +0530 Subject: [PATCH 22/98] Update requirements.txt --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index 3c5a58a..a876eba 100644 --- a/requirements.txt +++ b/requirements.txt @@ -40,6 +40,6 @@ six==1.11.0 statsmodels==0.9.0 tornado==5.1.1 traitlets==4.3.2 -urllib3==1.23 +urllib3>=1.24.2 wcwidth==0.1.7 wheel==0.31.1 From f8c69465c825ceb5ed3ab492b29948c8974269aa Mon Sep 17 00:00:00 2001 From: NeolithEra <3226592650@qq.com> Date: Fri, 23 Aug 2019 01:16:19 +0800 Subject: [PATCH 23/98] Fix dependency conflict for issue --- requirements.txt | 1 - 1 file changed, 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index a876eba..6b4806b 100644 --- a/requirements.txt +++ b/requirements.txt @@ -40,6 +40,5 @@ six==1.11.0 statsmodels==0.9.0 tornado==5.1.1 traitlets==4.3.2 -urllib3>=1.24.2 wcwidth==0.1.7 wheel==0.31.1 From 5c8960ea770f9e11f852f9754eeb7ff3a933ac90 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sat, 14 Sep 2019 00:06:09 +0530 Subject: [PATCH 24/98] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 909588a..4da9732 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# Machine-Learning-with-Python [![star this repo](http://githubbadges.com/star.svg?user=devAmoghS&repo=Machine-Learning-with-Python)](http://github.com/ddavison/github-badges) [![fork this repo](http://githubbadges.com/fork.svg?user=devAmoghS&repo=Machine-Learning-with-Python)](http://github.com/ddavison/github-badges/fork) +# Machine-Learning-with-Python ![GitHub stars](https://img.shields.io/github/stars/devAmoghS/Machine-Learning-with-Python?style=for-the-badge) ![GitHub forks](https://img.shields.io/github/forks/devAmoghS/Machine-Learning-with-Python?label=Forks&style=for-the-badge) ![alt text](https://media.istockphoto.com/vectors/machine-learning-3-step-infographic-artificial-intelligence-machine-vector-id962219860?k=6&m=962219860&s=612x612&w=0&h=yricYyUqZbILMHp3IvtenS3xbRDhu1w1u5kk2az5tbo=) ## Small scale machine learning projects to understand the core concepts (order: oldest to newest) From e93db9dfc2bf25409bd0eda0601fe1bb11e7cb38 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Mon, 25 May 2020 12:55:03 +0530 Subject: [PATCH 25/98] Create Understanding the algorithm.md --- .../Understanding the algorithm.md | 36 +++++++++++++++++++ 1 file changed, 36 insertions(+) create mode 100644 k_nearest_neighbors/Understanding the algorithm.md diff --git a/k_nearest_neighbors/Understanding the algorithm.md b/k_nearest_neighbors/Understanding the algorithm.md new file mode 100644 index 0000000..b6eff7e --- /dev/null +++ b/k_nearest_neighbors/Understanding the algorithm.md @@ -0,0 +1,36 @@ +### Introduction + +K-nearest nieghbor is a supervised machine learning algorithm. + +### Problem Statement + +Given some labelled data points, we have to classify a new data point according to its nearest neigbors. + +### Intuition + +* In kNN, k is the no. of neigbors you will evaluate to decide which group a new data point will belong to ? +* Value of k is decided by plotting the error rate against the different value of k +* Once the value of k is initiliazed, we take the nearest the k neigbors from the data point +* The measure of distance between the data points can be calculated using either `Euclidean Distance` or `Manhattan Distance` +* Once we calculate the distance of all the k nearest neigbors, we then look for the majority of labels in the neigbots +* The data point is assigned to the group which has maximum no. of neigbors + +### Choosing K value +* First divide the entire data set into training set and test set. +* Apply the KNN algorithm into training set and cross validate it with test set. +* Lets assume you have a train set `xtrain` and test set `xtest` +* Now create the model with `k` value `1` and predict with test set data +* Check the accuracy and other parameters then repeat the same process after increasing the k value by 1 each time. + + +Here I am increasing the k value by 1 from `1 to 29` and printing the accuracy with respected `k` value. +![Code](https://qphs.fs.quoracdn.net/main-qimg-9e8fedc07dafba2106eb11f0bfd4ba7d.webp) + +### Note + +* kNN is impacted by `Imbalanced datasets`. +Suppose there are `m` instances of **class 1** and `n` insatnces of **class 2** where `n << m`. +In a case where `k > n`, then this may lead to counting of more instances of m and +hence it will impact the majority election in k nearest neigbors + +* kNN is also very sensitve to `outliers` From c6c8d54568c9b9140eb02f3a37058ef86377dbef Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Mon, 25 May 2020 17:46:31 +0530 Subject: [PATCH 26/98] Update Understanding the algorithm.md --- k_nearest_neighbors/Understanding the algorithm.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/k_nearest_neighbors/Understanding the algorithm.md b/k_nearest_neighbors/Understanding the algorithm.md index b6eff7e..9e9d432 100644 --- a/k_nearest_neighbors/Understanding the algorithm.md +++ b/k_nearest_neighbors/Understanding the algorithm.md @@ -6,6 +6,10 @@ K-nearest nieghbor is a supervised machine learning algorithm. Given some labelled data points, we have to classify a new data point according to its nearest neigbors. +**Example used here** + +We have the data for large social network company which ran polls for their favroite programming language. The users belong from a group of large cities. Now the VP of Community Engagement want you to `predict the` **favorite programming language** `for the places that were` **not** `part of the survey` + ### Intuition * In kNN, k is the no. of neigbors you will evaluate to decide which group a new data point will belong to ? From 01a6fdc79887e083da9376db0be899954b0a28d9 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Mon, 25 May 2020 17:50:46 +0530 Subject: [PATCH 27/98] Update data.py --- k_nearest_neighbors/data.py | 77 ++++++++++++++++++++++++++++++++++++- 1 file changed, 76 insertions(+), 1 deletion(-) diff --git a/k_nearest_neighbors/data.py b/k_nearest_neighbors/data.py index 9ed01df..0c5d2f9 100644 --- a/k_nearest_neighbors/data.py +++ b/k_nearest_neighbors/data.py @@ -1,2 +1,77 @@ -cities = [(-86.75, 33.5666666666667, 'Python'), (-88.25, 30.6833333333333, 'Python'), (-112.016666666667, 33.4333333333333, 'Java'), (-110.933333333333, 32.1166666666667, 'Java'), (-92.2333333333333, 34.7333333333333, 'R'), (-121.95, 37.7, 'R'), (-118.15, 33.8166666666667, 'Python'), (-118.233333333333, 34.05, 'Java'), (-122.316666666667, 37.8166666666667, 'R'), (-117.6, 34.05, 'Python'), (-116.533333333333, 33.8166666666667, 'Python'), (-121.5, 38.5166666666667, 'R'), (-117.166666666667, 32.7333333333333, 'R'), (-122.383333333333, 37.6166666666667, 'R'), (-121.933333333333, 37.3666666666667, 'R'), (-122.016666666667, 36.9833333333333, 'Python'), (-104.716666666667, 38.8166666666667, 'Python'), (-104.866666666667, 39.75, 'Python'), (-72.65, 41.7333333333333, 'R'), (-75.6, 39.6666666666667, 'Python'), (-77.0333333333333, 38.85, 'Python'), (-80.2666666666667, 25.8, 'Java'), (-81.3833333333333, 28.55, 'Java'), (-82.5333333333333, 27.9666666666667, 'Java'), (-84.4333333333333, 33.65, 'Python'), (-116.216666666667, 43.5666666666667, 'Python'), (-87.75, 41.7833333333333, 'Java'), (-86.2833333333333, 39.7333333333333, 'Java'), (-93.65, 41.5333333333333, 'Java'), (-97.4166666666667, 37.65, 'Java'), (-85.7333333333333, 38.1833333333333, 'Python'), (-90.25, 29.9833333333333, 'Java'), (-70.3166666666667, 43.65, 'R'), (-76.6666666666667, 39.1833333333333, 'R'), (-71.0333333333333, 42.3666666666667, 'R'), (-72.5333333333333, 42.2, 'R'), (-83.0166666666667, 42.4166666666667, 'Python'), (-84.6, 42.7833333333333, 'Python'), (-93.2166666666667, 44.8833333333333, 'Python'), (-90.0833333333333, 32.3166666666667, 'Java'), (-94.5833333333333, 39.1166666666667, 'Java'), (-90.3833333333333, 38.75, 'Python'), (-108.533333333333, 45.8, 'Python'), (-95.9, 41.3, 'Python'), (-115.166666666667, 36.0833333333333, 'Java'), (-71.4333333333333, 42.9333333333333, 'R'), (-74.1666666666667, 40.7, 'R'), (-106.616666666667, 35.05, 'Python'), (-78.7333333333333, 42.9333333333333, 'R'), (-73.9666666666667, 40.7833333333333, 'R'), (-80.9333333333333, 35.2166666666667, 'Python'), (-78.7833333333333, 35.8666666666667, 'Python'), (-100.75, 46.7666666666667, 'Java'), (-84.5166666666667, 39.15, 'Java'), (-81.85, 41.4, 'Java'), (-82.8833333333333, 40, 'Java'), (-97.6, 35.4, 'Python'), (-122.666666666667, 45.5333333333333, 'Python'), (-75.25, 39.8833333333333, 'Python'), (-80.2166666666667, 40.5, 'Python'), (-71.4333333333333, 41.7333333333333, 'R'), (-81.1166666666667, 33.95, 'R'), (-96.7333333333333, 43.5666666666667, 'Python'), (-90, 35.05, 'R'), (-86.6833333333333, 36.1166666666667, 'R'), (-97.7, 30.3, 'Python'), (-96.85, 32.85, 'Java'), (-95.35, 29.9666666666667, 'Java'), (-98.4666666666667, 29.5333333333333, 'Java'), (-111.966666666667, 40.7666666666667, 'Python'), (-73.15, 44.4666666666667, 'R'), (-77.3333333333333, 37.5, 'Python'), (-122.3, 47.5333333333333, 'Python'), (-89.3333333333333, 43.1333333333333, 'R'), (-104.816666666667, 41.15, 'Java')] +cities = [(-86.75, 33.5666666666667, 'Python'), + (-88.25, 30.6833333333333, 'Python'), + (-112.016666666667, 33.4333333333333, 'Java'), + (-110.933333333333, 32.1166666666667, 'Java'), + (-92.2333333333333, 34.7333333333333, 'R'), + (-121.95, 37.7, 'R'), + (-118.15, 33.8166666666667, 'Python'), + (-118.233333333333, 34.05, 'Java'), + (-122.316666666667, 37.8166666666667, 'R'), + (-117.6, 34.05, 'Python'), + (-116.533333333333, 33.8166666666667, 'Python'), + (-121.5, 38.5166666666667, 'R'), + (-117.166666666667, 32.7333333333333, 'R'), + (-122.383333333333, 37.6166666666667, 'R'), + (-121.933333333333, 37.3666666666667, 'R'), + (-122.016666666667, 36.9833333333333, 'Python'), + (-104.716666666667, 38.8166666666667, 'Python'), + (-104.866666666667, 39.75, 'Python'), + (-72.65, 41.7333333333333, 'R'), + (-75.6, 39.6666666666667, 'Python'), + (-77.0333333333333, 38.85, 'Python'), + (-80.2666666666667, 25.8, 'Java'), + (-81.3833333333333, 28.55, 'Java'), + (-82.5333333333333, 27.9666666666667, 'Java'), + (-84.4333333333333, 33.65, 'Python'), + (-116.216666666667, 43.5666666666667, 'Python'), + (-87.75, 41.7833333333333, 'Java'), + (-86.2833333333333, 39.7333333333333, 'Java'), + (-93.65, 41.5333333333333, 'Java'), + (-97.4166666666667, 37.65, 'Java'), + (-85.7333333333333, 38.1833333333333, 'Python'), + (-90.25, 29.9833333333333, 'Java'), + (-70.3166666666667, 43.65, 'R'), + (-76.6666666666667, 39.1833333333333, 'R'), + (-71.0333333333333, 42.3666666666667, 'R'), + (-72.5333333333333, 42.2, 'R'), + (-83.0166666666667, 42.4166666666667, 'Python'), + (-84.6, 42.7833333333333, 'Python'), + (-93.2166666666667, 44.8833333333333, 'Python'), + (-90.0833333333333, 32.3166666666667, 'Java'), + (-94.5833333333333, 39.1166666666667, 'Java'), + (-90.3833333333333, 38.75, 'Python'), + (-108.533333333333, 45.8, 'Python'), + (-95.9, 41.3, 'Python'), + (-115.166666666667, 36.0833333333333, 'Java'), + (-71.4333333333333, 42.9333333333333, 'R'), + (-74.1666666666667, 40.7, 'R'), + (-106.616666666667, 35.05, 'Python'), + (-78.7333333333333, 42.9333333333333, 'R'), + (-73.9666666666667, 40.7833333333333, 'R'), + (-80.9333333333333, 35.2166666666667, 'Python'), + (-78.7833333333333, 35.8666666666667, 'Python'), + (-100.75, 46.7666666666667, 'Java'), + (-84.5166666666667, 39.15, 'Java'), + (-81.85, 41.4, 'Java'), + (-82.8833333333333, 40, 'Java'), + (-97.6, 35.4, 'Python'), + (-122.666666666667, 45.5333333333333, 'Python'), + (-75.25, 39.8833333333333, 'Python'), + (-80.2166666666667, 40.5, 'Python'), + (-71.4333333333333, 41.7333333333333, 'R'), + (-81.1166666666667, 33.95, 'R'), + (-96.7333333333333, 43.5666666666667, 'Python'), + (-90, 35.05, 'R'), + (-86.6833333333333, 36.1166666666667, 'R'), + (-97.7, 30.3, 'Python'), + (-96.85, 32.85, 'Java'), + (-95.35, 29.9666666666667, 'Java'), + (-98.4666666666667, 29.5333333333333, 'Java'), + (-111.966666666667, 40.7666666666667, 'Python'), + (-73.15, 44.4666666666667, 'R'), + (-77.3333333333333, 37.5, 'Python'), + (-122.3, 47.5333333333333, 'Python'), + (-89.3333333333333, 43.1333333333333, 'R'), + (-104.816666666667, 41.15, 'Java')] + cities = [([longitude, latitude], language) for longitude, latitude, language in cities] From 4ada8be0f6b783f18f30e91673819e60a7f3c58f Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Thu, 28 May 2020 00:06:56 +0530 Subject: [PATCH 28/98] Update Understanding the algorithm.md --- k_nearest_neighbors/Understanding the algorithm.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/k_nearest_neighbors/Understanding the algorithm.md b/k_nearest_neighbors/Understanding the algorithm.md index 9e9d432..a4689f4 100644 --- a/k_nearest_neighbors/Understanding the algorithm.md +++ b/k_nearest_neighbors/Understanding the algorithm.md @@ -8,7 +8,7 @@ Given some labelled data points, we have to classify a new data point according **Example used here** -We have the data for large social network company which ran polls for their favroite programming language. The users belong from a group of large cities. Now the VP of Community Engagement want you to `predict the` **favorite programming language** `for the places that were` **not** `part of the survey` +We have the data for a large social networking company which ran polls for their favroite programming language. The users belong from a group of large cities. Now the VP of Community Engagement want you to `predict the` **favorite programming language** `for the places that were` **not** `part of the survey` ### Intuition From 897f420d10dc8c8bafdf686eb3fef2e639926782 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Thu, 28 May 2020 00:22:58 +0530 Subject: [PATCH 29/98] Update data.py --- k_means_clustering/data.py | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/k_means_clustering/data.py b/k_means_clustering/data.py index 9c9cabc..095405b 100644 --- a/k_means_clustering/data.py +++ b/k_means_clustering/data.py @@ -1,2 +1,20 @@ -inputs = [[-14, -5], [13, 13], [20, 23], [-19, -11], [-9, -16], [21, 27], [-49, 15], [26, 13], [-46, 5], [-34, -1], - [11, 15], [-49, 0], [-22, -16], [19, 28], [-12, -8], [-13, -19], [-41, 8], [-11, -6], [-25, -9], [-18, -3]] +inputs = [[-14, -5], + [13, 13], + [20, 23], + [-19, -11], + [-9, -16], + [21, 27], + [-49, 15], + [26, 13], + [-46, 5], + [-34, -1], + [11, 15], + [-49, 0], + [-22, -16], + [19, 28], + [-12, -8], + [-13, -19], + [-41, 8], + [-11, -6], + [-25, -9], + [-18, -3]] From 3c28e462331378c5320759568e1d9e7143dca6cc Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Thu, 28 May 2020 00:43:18 +0530 Subject: [PATCH 30/98] Create Understanding the algorithm.md --- .../Understanding the algorithm.md | 46 +++++++++++++++++++ 1 file changed, 46 insertions(+) create mode 100644 k_means_clustering/Understanding the algorithm.md diff --git a/k_means_clustering/Understanding the algorithm.md b/k_means_clustering/Understanding the algorithm.md new file mode 100644 index 0000000..23d2ef3 --- /dev/null +++ b/k_means_clustering/Understanding the algorithm.md @@ -0,0 +1,46 @@ +### Introduction + +* K-means clustering is an unsupervised machine learning algorithm. +* K-means algorithm is an iterative algorithm that tries to partition the dataset into `K` pre-defined distinct non-overlapping subgroups(clusters) where each data point belongs to only one group. +* It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. +* It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. +* The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster + +### Problem Statement + +Given some **unlabelled** data points, we have to identify subgroups such that +1. Points in the same subgroup are similar to each other. +2. Points in different subgroup are dissimilar to each other. + +**Example used here** + +We have the data for a large social networking company which is planning to host meetups for their users. We have the users' location data. Now the VP of Growth want you to `choose the` **meetup locations** `so it becomes convinient for everyone to attend` + +### Intuition + +* In k-means, k is the no. of subgroups you want the data to be segregated into ? +* Value of k is decided by elbow method or can be initialised randomly as well +* Once the value of k is initiliazed, we take the nearest data points from each centroid +* The measure of distance between the data points and centroids can be calculated using either `Euclidean Distance` or `Manhattan Distance` + +### Iteration +* We assign a cluster to the data point of the nearest centroid +* Once all the points are assigned to their nearest centroids, then for each cluster the centroid is calculated again. +* With the new centroids, we repeat the step of cluster assignment. + +* The above three steps are iterated as long as there is no change in cluster assigment of data points. + +### Choosing K value - Elbow method +* Elbow method gives us an idea on what a good k number of clusters. +* This is based on the sum of squared distance (SSE) between data points and their assigned clusters’ centroids. +* We pick `k` at the spot where SSE starts to flatten out and forming an elbow. + +Here I am increasing the k value by 1 from `1 to 10` and printing the sum of squared distance with respected `k` value. +![Code](https://miro.medium.com/max/866/1*9z8erk4kvsnxkfv-QhsHZg.png) + +### Note + +* Kmeans gives more weight to the bigger clusters. +* Kmeans assumes spherical shapes of clusters (with radius equal to the distance between the centroid and the furthest data point) and doesn’t work well when clusters are in different shapes such as elliptical clusters. +* If there is overlapping between clusters, kmeans doesn’t have an intrinsic measure for uncertainty for the examples belong to the overlapping region in order to determine for which cluster to assign each data point. +* Kmeans may still cluster the data even if it can’t be clustered such as data that comes from uniform distributions. From 47d45ca1fb8c4e533b52a3c2c147b1422d0cf07b Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Thu, 28 May 2020 00:45:00 +0530 Subject: [PATCH 31/98] Update Understanding the algorithm.md --- k_means_clustering/Understanding the algorithm.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/k_means_clustering/Understanding the algorithm.md b/k_means_clustering/Understanding the algorithm.md index 23d2ef3..f8f0459 100644 --- a/k_means_clustering/Understanding the algorithm.md +++ b/k_means_clustering/Understanding the algorithm.md @@ -17,13 +17,12 @@ Given some **unlabelled** data points, we have to identify subgroups such that We have the data for a large social networking company which is planning to host meetups for their users. We have the users' location data. Now the VP of Growth want you to `choose the` **meetup locations** `so it becomes convinient for everyone to attend` ### Intuition - +**Initialization** * In k-means, k is the no. of subgroups you want the data to be segregated into ? * Value of k is decided by elbow method or can be initialised randomly as well * Once the value of k is initiliazed, we take the nearest data points from each centroid * The measure of distance between the data points and centroids can be calculated using either `Euclidean Distance` or `Manhattan Distance` - -### Iteration +**Iteration** * We assign a cluster to the data point of the nearest centroid * Once all the points are assigned to their nearest centroids, then for each cluster the centroid is calculated again. * With the new centroids, we repeat the step of cluster assignment. From ceadcdc703073c5648edf4622b909e8028ca86ac Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Thu, 28 May 2020 00:45:21 +0530 Subject: [PATCH 32/98] Update Understanding the algorithm.md --- k_means_clustering/Understanding the algorithm.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/k_means_clustering/Understanding the algorithm.md b/k_means_clustering/Understanding the algorithm.md index f8f0459..e442b3e 100644 --- a/k_means_clustering/Understanding the algorithm.md +++ b/k_means_clustering/Understanding the algorithm.md @@ -17,11 +17,13 @@ Given some **unlabelled** data points, we have to identify subgroups such that We have the data for a large social networking company which is planning to host meetups for their users. We have the users' location data. Now the VP of Growth want you to `choose the` **meetup locations** `so it becomes convinient for everyone to attend` ### Intuition + **Initialization** * In k-means, k is the no. of subgroups you want the data to be segregated into ? * Value of k is decided by elbow method or can be initialised randomly as well * Once the value of k is initiliazed, we take the nearest data points from each centroid * The measure of distance between the data points and centroids can be calculated using either `Euclidean Distance` or `Manhattan Distance` + **Iteration** * We assign a cluster to the data point of the nearest centroid * Once all the points are assigned to their nearest centroids, then for each cluster the centroid is calculated again. From 74062d2617993ba643123020a63706ede04255ff Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Thu, 28 May 2020 00:46:04 +0530 Subject: [PATCH 33/98] Update Understanding the algorithm.md --- k_means_clustering/Understanding the algorithm.md | 1 - 1 file changed, 1 deletion(-) diff --git a/k_means_clustering/Understanding the algorithm.md b/k_means_clustering/Understanding the algorithm.md index e442b3e..151d521 100644 --- a/k_means_clustering/Understanding the algorithm.md +++ b/k_means_clustering/Understanding the algorithm.md @@ -28,7 +28,6 @@ We have the data for a large social networking company which is planning to host * We assign a cluster to the data point of the nearest centroid * Once all the points are assigned to their nearest centroids, then for each cluster the centroid is calculated again. * With the new centroids, we repeat the step of cluster assignment. - * The above three steps are iterated as long as there is no change in cluster assigment of data points. ### Choosing K value - Elbow method From 7ab31ebe7d74984c038b4d4ed6f17c5488c2f518 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Thu, 28 May 2020 00:52:09 +0530 Subject: [PATCH 34/98] Update Understanding the algorithm.md --- .../Understanding the algorithm.md | 22 +++++++++---------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/k_means_clustering/Understanding the algorithm.md b/k_means_clustering/Understanding the algorithm.md index 151d521..9c91e87 100644 --- a/k_means_clustering/Understanding the algorithm.md +++ b/k_means_clustering/Understanding the algorithm.md @@ -18,17 +18,17 @@ We have the data for a large social networking company which is planning to host ### Intuition -**Initialization** -* In k-means, k is the no. of subgroups you want the data to be segregated into ? -* Value of k is decided by elbow method or can be initialised randomly as well -* Once the value of k is initiliazed, we take the nearest data points from each centroid +* In K-means, `k` is the `no. of subgroups` you want the data to be segregated into ? +* Optimal value of `k` can be derived by using `elbow method` (discussed below) +**Centroid Initialization** +* We begin by initializing `k` random data points as the centroids (first pass) * The measure of distance between the data points and centroids can be calculated using either `Euclidean Distance` or `Manhattan Distance` **Iteration** -* We assign a cluster to the data point of the nearest centroid -* Once all the points are assigned to their nearest centroids, then for each cluster the centroid is calculated again. +* **Cluster assigment:** We assign a cluster to the data point that is nearest to it. +* Once all the points are assigned to their nearest centroids, then for each cluster the centroid is calculated again using centroid initialization step. * With the new centroids, we repeat the step of cluster assignment. -* The above three steps are iterated as long as there is no change in cluster assigment of data points. +* These two steps are iterated as long as `there is no change in cluster assigment of data points` i.e. no data point is moving into a new cluster. ### Choosing K value - Elbow method * Elbow method gives us an idea on what a good k number of clusters. @@ -40,7 +40,7 @@ Here I am increasing the k value by 1 from `1 to 10` and printing the sum of squ ### Note -* Kmeans gives more weight to the bigger clusters. -* Kmeans assumes spherical shapes of clusters (with radius equal to the distance between the centroid and the furthest data point) and doesn’t work well when clusters are in different shapes such as elliptical clusters. -* If there is overlapping between clusters, kmeans doesn’t have an intrinsic measure for uncertainty for the examples belong to the overlapping region in order to determine for which cluster to assign each data point. -* Kmeans may still cluster the data even if it can’t be clustered such as data that comes from uniform distributions. +* K-means gives more weight to the bigger clusters. +* K-means assumes spherical shapes of clusters (with radius equal to the distance between the centroid and the furthest data point) and doesn’t work well when clusters are in different shapes such as elliptical clusters. +* If there is overlapping between clusters, K-means doesn’t have an intrinsic measure for uncertainty for the examples belong to the overlapping region in order to determine for which cluster to assign each data point. +* K-means may still cluster the data even if it can’t be clustered such as data that comes from uniform distributions. From ce5c36bb1fdfc3f68cad832043be729cfa31290d Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Wed, 1 Jul 2020 22:08:34 +0530 Subject: [PATCH 35/98] Create Anamoly_Detection_notes.md --- Anamoly_Detection_notes.md | 48 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 48 insertions(+) create mode 100644 Anamoly_Detection_notes.md diff --git a/Anamoly_Detection_notes.md b/Anamoly_Detection_notes.md new file mode 100644 index 0000000..090e60e --- /dev/null +++ b/Anamoly_Detection_notes.md @@ -0,0 +1,48 @@ +Inspired from the following [blog post](https://iwringer.wordpress.com/2015/11/17/anomaly-detection-concepts-and-techniques/): +Kudos to [Srinath Perera](https://www.linkedin.com/in/srinathperera) for writing this 👍 + +## Anomaly Detection + +![Image](https://iwringer.files.wordpress.com/2015/11/anomelydetectionmethods.jpg?w=656) + +Four common classes of machine learning applications: + +a. classification
+b. predicting next value [also known as regression]
+c. anamoly detection
+d. discovering data strucuture
+ +### Anamoly Detection +As the name suggests, the core focus of anamoly detection is to identify data points that deos not align with the rest of the data. In statistics, these data points are also referred as `outliers` + +#### Outliers +Having outliers have **significant effect on the mean and the standard deviation** of your data and hence your results are skewed if they are not dealt properly + +#### Applications of Anamoly Detection +Here are some of the examples where anamoly detection is heavily employed: +a. fraud detection
+b. surveillance
+c. diagnosis
+d. data cleanup
+e. monitring predicitive maintenance [IoT devices] + +##### Since data is categorised as anomalous and non-anomalous, can't we solve it using classification ? +This assumption is correct as long as the following three conditions hold good: + +a. Training data present with us is labelled
+b. Anomalous and non-anomalous classes are balanced (at least 1:5 proportion)
+c. Present data point is not dependent on paast data points [not suitable for time series] + +#### Reality +a. Hard to obtain labelled training data all the time
+b. Real-life scenarios have heavily imbalanced classes, for e.g. fraud detection in credit cards can have the distribution of 1:10^x where x can go from 3 to 6
+c. One more caveat is that of precision and recall scores for such classifiers ? What is the cost of missing a false positive or a false negative ?
+[**Precision** governs of how many anomalies detected by classifiers are truly anamolies]
+[**Recall** governs of how many anomalies the classifier is able to capture] + +### Types of Anomalies +a. **Point Anomalies**: Individual instance of data is considered as anomalous with respect to rest of data (e.g. purchase with a large transaction value)
+b. **Contexual Anomalies**: The instance of data is considered as anomalous with respect to the context, but not otherwise (e.g. large spike in a trend at middle of night)
+c. **Collective Anomalies**: Unlike the previous two, here we consider a collection of data instances making up for an anomaly with respect to the rest of data + i. Events that are actually ordered but showing a degree of disorder (e.g. rhythm in ECG) + ii. Unexpected value comnbinations (e.g. buying a large number of expensive items) From 213061abe7328e66fb57a6418a56ed78743903e9 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Wed, 1 Jul 2020 22:11:01 +0530 Subject: [PATCH 36/98] Update Anamoly_Detection_notes.md --- Anamoly_Detection_notes.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Anamoly_Detection_notes.md b/Anamoly_Detection_notes.md index 090e60e..7383458 100644 --- a/Anamoly_Detection_notes.md +++ b/Anamoly_Detection_notes.md @@ -43,6 +43,6 @@ c. One more caveat is that of precision and recall scores for such classifiers ? ### Types of Anomalies a. **Point Anomalies**: Individual instance of data is considered as anomalous with respect to rest of data (e.g. purchase with a large transaction value)
b. **Contexual Anomalies**: The instance of data is considered as anomalous with respect to the context, but not otherwise (e.g. large spike in a trend at middle of night)
-c. **Collective Anomalies**: Unlike the previous two, here we consider a collection of data instances making up for an anomaly with respect to the rest of data - i. Events that are actually ordered but showing a degree of disorder (e.g. rhythm in ECG) - ii. Unexpected value comnbinations (e.g. buying a large number of expensive items) +c. **Collective Anomalies**: Unlike the previous two, here we consider a collection of data instances making up for an anomaly with respect to the rest of data
+ i. Events that are actually ordered but showing a degree of disorder (e.g. rhythm in ECG)
+ ii. Unexpected value comnbinations (e.g. buying a large number of expensive items)
From 9cc4b8405abcf57747ec893a38cfedf9554233c9 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sat, 11 Jul 2020 17:06:57 +0530 Subject: [PATCH 37/98] Create Understanding Vanishing Gradient.md --- Understanding Vanishing Gradient.md | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) create mode 100644 Understanding Vanishing Gradient.md diff --git a/Understanding Vanishing Gradient.md b/Understanding Vanishing Gradient.md new file mode 100644 index 0000000..a8ca83d --- /dev/null +++ b/Understanding Vanishing Gradient.md @@ -0,0 +1,20 @@ +# Understanding Vanishing Gradients in Neural Networks + +![Vanishing Gradient](https://i.stack.imgur.com/YUlyb.jpg) + +### Introduction + +We all know that neural networks perform learning through the process of forward pass and backward pass.
+This cycle goes on until we find a optimal value for the cost function that we are trying to minimize.
+The optmization happens with the help of gradient descent.
+ +Graidents are computed in each of the layers of the neural network. When the number of layers are increased, the value of the cost function starts approaching towards zero, since we are trying to minimize the cost. +This impacts the training of the neural network and it takes longer for the network to converge. + +The good news is that this is the case when only certain activation functions are used in the neural networks. + +### Why does it happen ? + +A very commonly used activation function is the sigmoid function. + +![Sigmoid Function and its Derivative](https://miro.medium.com/max/1000/1*6A3A_rt4YmumHusvTvVTxw.png) From f8b716a1e2a1b3919286dc22b65331f552517c15 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sat, 11 Jul 2020 17:29:31 +0530 Subject: [PATCH 38/98] Update Understanding Vanishing Gradient.md --- Understanding Vanishing Gradient.md | 33 ++++++++++++++++++++++++++--- 1 file changed, 30 insertions(+), 3 deletions(-) diff --git a/Understanding Vanishing Gradient.md b/Understanding Vanishing Gradient.md index a8ca83d..68a0123 100644 --- a/Understanding Vanishing Gradient.md +++ b/Understanding Vanishing Gradient.md @@ -8,13 +8,40 @@ We all know that neural networks perform learning through the process of forward This cycle goes on until we find a optimal value for the cost function that we are trying to minimize.
The optmization happens with the help of gradient descent.
-Graidents are computed in each of the layers of the neural network. When the number of layers are increased, the value of the cost function starts approaching towards zero, since we are trying to minimize the cost. -This impacts the training of the neural network and it takes longer for the network to converge. +### What are gradients ? +Gradients are the derivative of a function. It determines how much change happens when the input to the function is changed by a very big number -The good news is that this is the case when only certain activation functions are used in the neural networks. +Gradients of neural networks are found using backpropagation(backward pass as mentioned above).
+1. Backpropogation finds the derivatives of the network by moving layer by layer from the final layer to the initial one. +2. By the chain rule, the derivatives of each layer are multiplied down the network (from the final layer to the initial) to compute the derivatives of the initial layers. ### Why does it happen ? A very commonly used activation function is the sigmoid function. +The sigmoid function squashes the input value into a range of 0 to 1. Hence if there is a large change in the value, there is not much change in the output by the sigmoid. Hence the derivative of this function is very small. + +The graph below also shows us the same picture. For very large or small values of x, the derivative of sigmoid is very small (almost closer to zero) + ![Sigmoid Function and its Derivative](https://miro.medium.com/max/1000/1*6A3A_rt4YmumHusvTvVTxw.png) + +### How does it impact ? + +As explained above, we are multiplying gradients with each other in the bacward pass step using chain rule. So when we are multiplying a lot of small numbers (almost near zero quantities). The gradient value is descreased very sharply. + +A small gradient means that the weights and biases of the initial layers will not be updated effectively with each training session. + +**Since these initial layers are often crucial to recognizing the core elements of the input data, it can lead to overall inaccuracy of the whole network.** + +### Solutions to the vanishing gradients + +1. We can use other other activation function like `Relu` +` Relu(x) = max(x,0)` + +2. Using residual networks is also an effective solution where we add the input value X to the next layer before applying the activation. This way the overall derivative is not reduced to a small value. Refer the diagram below. + +![A Residual Block](https://miro.medium.com/max/385/1*mxJ5gBvZnYPVo0ISZE5XkA.png) + +3. Batch normalization is also an effective solution. We normalize the input value x ==> |x| so that it does not have extremely large or small values and hence the derivative is not very small. We limit the input function to a small range and hence the output from the sigmoid also remains normal. We can see the same behavior that the green region does not have very small derivatives. Refer the diagram below + +![Sigmoig function with limited values](https://miro.medium.com/max/700/1*XCtAytGsbhRQnu-x7Ynr0Q.png) From 91b25673e7750e134ae6f640d2315c2feaa0ca57 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sat, 11 Jul 2020 17:34:26 +0530 Subject: [PATCH 39/98] Update Understanding Vanishing Gradient.md --- Understanding Vanishing Gradient.md | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/Understanding Vanishing Gradient.md b/Understanding Vanishing Gradient.md index 68a0123..04915ef 100644 --- a/Understanding Vanishing Gradient.md +++ b/Understanding Vanishing Gradient.md @@ -1,5 +1,7 @@ # Understanding Vanishing Gradients in Neural Networks +#### Credits: Thanks to [Chi-Feng Wang](https://towardsdatascience.com/@reina.wang) for writing this [article](https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484) + ![Vanishing Gradient](https://i.stack.imgur.com/YUlyb.jpg) ### Introduction @@ -9,17 +11,18 @@ This cycle goes on until we find a optimal value for the cost function that we a The optmization happens with the help of gradient descent.
### What are gradients ? -Gradients are the derivative of a function. It determines how much change happens when the input to the function is changed by a very big number +Gradients are the derivative of a function. It determines how much change happens when the input to the function is changed by a very big number
Gradients of neural networks are found using backpropagation(backward pass as mentioned above).
-1. Backpropogation finds the derivatives of the network by moving layer by layer from the final layer to the initial one. +1. Backpropogation finds the derivatives of the network by moving layer by layer from the final layer to the initial one.
2. By the chain rule, the derivatives of each layer are multiplied down the network (from the final layer to the initial) to compute the derivatives of the initial layers. ### Why does it happen ? A very commonly used activation function is the sigmoid function. -The sigmoid function squashes the input value into a range of 0 to 1. Hence if there is a large change in the value, there is not much change in the output by the sigmoid. Hence the derivative of this function is very small. +The sigmoid function squashes the input value into a range of 0 to 1.
+Hence if there is a large change in the value, there is not much change in the output by the sigmoid. Hence the derivative of this function is very small.
The graph below also shows us the same picture. For very large or small values of x, the derivative of sigmoid is very small (almost closer to zero) @@ -27,7 +30,8 @@ The graph below also shows us the same picture. For very large or small values o ### How does it impact ? -As explained above, we are multiplying gradients with each other in the bacward pass step using chain rule. So when we are multiplying a lot of small numbers (almost near zero quantities). The gradient value is descreased very sharply. +As explained above, we are multiplying gradients with each other in the bacward pass step using chain rule.
+So when we are multiplying a lot of small numbers (almost near zero quantities). The gradient value is descreased very sharply.
A small gradient means that the weights and biases of the initial layers will not be updated effectively with each training session. @@ -38,10 +42,12 @@ A small gradient means that the weights and biases of the initial layers will no 1. We can use other other activation function like `Relu` ` Relu(x) = max(x,0)` -2. Using residual networks is also an effective solution where we add the input value X to the next layer before applying the activation. This way the overall derivative is not reduced to a small value. Refer the diagram below. +2. Using residual networks is also an effective solution where we add the input value X to the next layer before applying the activation.
+This way the overall derivative is not reduced to a small value. Refer the diagram below. ![A Residual Block](https://miro.medium.com/max/385/1*mxJ5gBvZnYPVo0ISZE5XkA.png) -3. Batch normalization is also an effective solution. We normalize the input value x ==> |x| so that it does not have extremely large or small values and hence the derivative is not very small. We limit the input function to a small range and hence the output from the sigmoid also remains normal. We can see the same behavior that the green region does not have very small derivatives. Refer the diagram below +3. Batch normalization is also an effective solution. We normalize the input value x ==> |x| so that it does not have extremely large or small values and hence the derivative is not very small.
+We limit the input function to a small range and hence the output from the sigmoid also remains normal. We can see the same behavior that the green region does not have very small derivatives. Refer the diagram below ![Sigmoig function with limited values](https://miro.medium.com/max/700/1*XCtAytGsbhRQnu-x7Ynr0Q.png) From 0aa97ce2b2bb8ce0fdaf4a21a3304ce5deefadb8 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sat, 11 Jul 2020 17:39:14 +0530 Subject: [PATCH 40/98] Update Understanding Vanishing Gradient.md --- Understanding Vanishing Gradient.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/Understanding Vanishing Gradient.md b/Understanding Vanishing Gradient.md index 04915ef..a0dcaff 100644 --- a/Understanding Vanishing Gradient.md +++ b/Understanding Vanishing Gradient.md @@ -4,6 +4,12 @@ ![Vanishing Gradient](https://i.stack.imgur.com/YUlyb.jpg) +### TL;DR +The gradient used in backprop is calculated using the derivative chain rule, meaning it is a product of about as many factors as there are layers (in a vanilla feedforward net).
+If all those factors are e.g. between 0 and 1 (e.g. due to the choice of 'squishing' activation functions), and some are very small (typical in the earlier layers and when activations are saturated), then the overall product (gradient) will get very small, near zero.
+The risk of this happening grows with the number of factors (the number of layers).
+The problem is that this may happen for a weight configuration that is nowhere near optimal, yet training will slow down or stop + ### Introduction We all know that neural networks perform learning through the process of forward pass and backward pass.
From 35543408e422f42efa1359a42ca2303810e24dd8 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sun, 12 Jul 2020 10:30:02 +0530 Subject: [PATCH 41/98] Create Understanding SQL Queries.md --- Understanding SQL Queries.md | 120 +++++++++++++++++++++++++++++++++++ 1 file changed, 120 insertions(+) create mode 100644 Understanding SQL Queries.md diff --git a/Understanding SQL Queries.md b/Understanding SQL Queries.md new file mode 100644 index 0000000..2a09d7b --- /dev/null +++ b/Understanding SQL Queries.md @@ -0,0 +1,120 @@ +### Three SQL Concepts you Must Know to Pass the Data Science Interview + +#### Credits: Thanks to Jay Feng for writing this [article](https://www.interviewquery.com/blog-three-sql-questions-you-must-know-to-pass/) + +#### 1. Getting the first or last value for each user in a `transactions` table. + +`transactions` + +| column_name | data_type | +--- | --- | +| user_id | int | +| created_at | datetime| +| product | varchar | + +##### Question: Given the user transactions table above, write a query to get the first purchase for each user. + +#### Solution: + +We want to take a table that looks like this: + + user_id | created_at | product + --- | --- | --- + 123 | 2019-01-01 | apple + 456 | 2019-01-02 | banana + 123 | 2019-01-05 | pear + 456 | 2019-01-10 | apple + 789 | 2019-01-11 | banana + +and turn it into this + + user_id | created_at | product + --- | --- | --- + 123 | 2019-01-01 | apple + 456 | 2019-01-02 | banana + 789 | 2019-01-11 | banana + + The solution can be broken into two parts: + - First make a table of `user_id` and the first purchase (i.e. minimum create date). We can get this by the following query + +``` +SELECT + user_id, MIN(created_at) AS min_created_at +FROM + transactions +GROUP BY 1 +``` + +- Now all we have to do is join this table back to the original on two columns: `user_id` and `created_at`.
+The self join will effectively filter for the first purchase.
+Then all we have to do is grab all of the columns on the left side table. + +``` +SELECT + t.user_id, t.created_at, t.product +FROM + transactions AS t + INNER JOIN ( + SELECT user_id, MIN(created_at) AS min_created_at + FROM transactions + GROUP BY 1 + ) AS t1 ON (t.user_id = t1.user_id AND t.created_at = t1.min_created_at) +``` + +#### 2. Knowing the difference between a LEFT JOIN and INNER JOIN in practice. + + `users` + + +| column_name | data_type | +--- | --- | +| id | int | +| name | varchar | +| city_id | int | + +`city_id` is `id` in the `cities` table + +`cities` +| column_name | data_type | +--- | --- | +| id | int | +| name | varchar | + + +##### Question: Given the `users` and `cities` tables above, write a query to return the list of cities without any users. + +This question aims to test the candidate's understanding of the LEFT JOIN and INNER JOIN + +##### What is the actual difference between a LEFT JOIN and INNER JOIN? + +**INNER JOIN**: returns rows when there is a match in __both tables__.
+**LEFT JOIN**: returns all rows from the left table, __even if there are no matches in the right table__. + +#### Solution: + +We know that each user in the users table must live in a city given the city_id field.
+However the `cities` table doesn’t have a `user_id` field.
+In which if we run an INNER JOIN between these two tables joined by the city_id in each table, we’ll get all of the cities that have users and __all of the cities without users will be filtered out.__ + +But what if we run a LEFT JOIN between cities and users? + +cities.name | users.id +--- | --- | +seattle | 123 +seattle | 124 +portland | null +san diego | 534 +san diego | 564 + +Here we see that since we are keeping all of the values on the LEFT side of the table, since there’s no match on the city of Portland to any users that exist in the database, the city shows up as NULL.
+Therefore now all we have to do is run a __WHERE filter to where any value in the users table is NULL.__ + +``` +SELECT + cities.name, users.id +FROM + cities + LEFT JOIN users ON users.city_id = cities.id +WHERE + users.id IS NULL +``` From 9613bb36ffb9ba99e1fa699ff97c5cae3e1bcdd9 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sun, 12 Jul 2020 11:48:26 +0530 Subject: [PATCH 42/98] Update Understanding SQL Queries.md --- Understanding SQL Queries.md | 40 ++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+) diff --git a/Understanding SQL Queries.md b/Understanding SQL Queries.md index 2a09d7b..f4bdd22 100644 --- a/Understanding SQL Queries.md +++ b/Understanding SQL Queries.md @@ -118,3 +118,43 @@ FROM WHERE users.id IS NULL ``` + +#### 3. Aggregations with a conditional statement + +`transactions` +| column_name | data_type | +--- | --- | +| user_id | int | +| created_at | datetime| +| product | varchar | + +##### Question: Given the same user transactions table as before,write a query to get the total purchases made in the morning versus afternoon/evening (AM vs PM) by day. + +We are comparing two groups. Every time we have to compare two groups we must use a GROUP BY + +In this case, we need to create a separate column to actually run our GROUP BY on, which in this case, is the difference between AM or PM in the `created_at` field. + +``` +CASE + WHEN HOUR(created_at) > 11 THEN 'PM' + ELSE 'AM' +END AS time_of_day +``` + +We can cast the created_at column to the hour and set the new column value time_of_day as AM or PM based on this condition. + +Now we just have to run a GROUP BY on the original `created_at` field truncated to the day AND the new column we created that differentiates each row value.
+The last aggregation will then be the output variable we want which is total purchases by running the COUNT function. + +``` +SELECT + DATE_TRUNC('day', created_at) AS date + ,CASE + WHEN HOUR(created_at) > 11 THEN 'PM' + ELSE 'AM' + END AS time_of_day + ,COUNT(*) +FROM + transactions +GROUP BY 1,2 +``` From 7146185a4d2848de23a7a66c57508e2324680d59 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sun, 12 Jul 2020 12:09:28 +0530 Subject: [PATCH 43/98] Update Understanding SQL Queries.md --- Understanding SQL Queries.md | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/Understanding SQL Queries.md b/Understanding SQL Queries.md index f4bdd22..d235831 100644 --- a/Understanding SQL Queries.md +++ b/Understanding SQL Queries.md @@ -158,3 +158,20 @@ FROM transactions GROUP BY 1,2 ``` +### Bonus Questions + +#### 4.Write an SQL query that makes recommendations using the pages that your friends liked. Assume you have two tables: + +`usersAndFriends` +| column_name | data_type | +--- | --- | +| user_id | int | +| friend | int| + +`usersLikedPages` +| column_name | data_type | +--- | --- | +| user_id | int | +| page | varchar| + +#### It should not recommend pages you already like. From 9c0d04d250ae0930afdfcad0a70df82d18884494 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sun, 12 Jul 2020 12:10:18 +0530 Subject: [PATCH 44/98] Update Understanding SQL Queries.md --- Understanding SQL Queries.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Understanding SQL Queries.md b/Understanding SQL Queries.md index d235831..69fc22d 100644 --- a/Understanding SQL Queries.md +++ b/Understanding SQL Queries.md @@ -172,6 +172,6 @@ GROUP BY 1,2 | column_name | data_type | --- | --- | | user_id | int | -| page | varchar| +| liked_page | varchar| #### It should not recommend pages you already like. From 15b0facced1048b29a64da33e42d17ebf8c00cc7 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sun, 12 Jul 2020 12:13:10 +0530 Subject: [PATCH 45/98] Update Understanding SQL Queries.md --- Understanding SQL Queries.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Understanding SQL Queries.md b/Understanding SQL Queries.md index 69fc22d..3ab32ba 100644 --- a/Understanding SQL Queries.md +++ b/Understanding SQL Queries.md @@ -172,6 +172,6 @@ GROUP BY 1,2 | column_name | data_type | --- | --- | | user_id | int | -| liked_page | varchar| +| page_id | int| #### It should not recommend pages you already like. From ee46b18dc9608adacb69e3548deab892bfc64cfb Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Thu, 23 Jul 2020 13:39:52 +0530 Subject: [PATCH 46/98] Update Understanding SQL Queries.md --- Understanding SQL Queries.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/Understanding SQL Queries.md b/Understanding SQL Queries.md index 3ab32ba..25da707 100644 --- a/Understanding SQL Queries.md +++ b/Understanding SQL Queries.md @@ -175,3 +175,11 @@ GROUP BY 1,2 | page_id | int| #### It should not recommend pages you already like. + +#### 5.Write an SQL query that shows percentage change month over month in daily active users. Assume you have a table: + +`usersAndFriends` +| column_name | data_type | +--- | --- | +| user_id | int | +| date | date| From fb1d08c7e4499c255c9cce5abcec9b1e18b1a1ec Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Thu, 23 Jul 2020 13:40:35 +0530 Subject: [PATCH 47/98] Update Understanding SQL Queries.md --- Understanding SQL Queries.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Understanding SQL Queries.md b/Understanding SQL Queries.md index 25da707..5e658d7 100644 --- a/Understanding SQL Queries.md +++ b/Understanding SQL Queries.md @@ -178,7 +178,7 @@ GROUP BY 1,2 #### 5.Write an SQL query that shows percentage change month over month in daily active users. Assume you have a table: -`usersAndFriends` +`logins` | column_name | data_type | --- | --- | | user_id | int | From 56d1d9377f8da4293881e56dadf68207fcf5595c Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue, 27 Oct 2020 21:34:53 +0000 Subject: [PATCH 48/98] Bump cryptography from 2.3.1 to 3.2 Bumps [cryptography](https://github.com/pyca/cryptography) from 2.3.1 to 3.2. - [Release notes](https://github.com/pyca/cryptography/releases) - [Changelog](https://github.com/pyca/cryptography/blob/master/CHANGELOG.rst) - [Commits](https://github.com/pyca/cryptography/compare/2.3.1...3.2) Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index 6b4806b..e26064b 100644 --- a/requirements.txt +++ b/requirements.txt @@ -9,7 +9,7 @@ beautifulsoup4==4.6.3 certifi==2018.8.24 cffi==1.11.5 chardet==3.0.4 -cryptography==2.3.1 +cryptography==3.2 cycler==0.10.0 h5py==2.9.0 idna==2.7 From 648b98c774f174bff729daf573c114eef15a401b Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Mon, 30 Nov 2020 01:06:58 +0530 Subject: [PATCH 49/98] Create interview_prep.md --- interview_prep.md | 4 ++++ 1 file changed, 4 insertions(+) create mode 100644 interview_prep.md diff --git a/interview_prep.md b/interview_prep.md new file mode 100644 index 0000000..7fb89e6 --- /dev/null +++ b/interview_prep.md @@ -0,0 +1,4 @@ +### 1. What is multi-collinearity + +When two or more predictors are highly correlated to each other such that one predictor +can be derived using the linear combinations of other predictors, then the predictors are said to be collinear From bafb8c8ac2b68a8cfab28c49f4632e0c7b3141e6 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sun, 3 Jan 2021 01:15:02 +0530 Subject: [PATCH 50/98] Update interview_prep.md --- interview_prep.md | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/interview_prep.md b/interview_prep.md index 7fb89e6..13f11d3 100644 --- a/interview_prep.md +++ b/interview_prep.md @@ -2,3 +2,36 @@ When two or more predictors are highly correlated to each other such that one predictor can be derived using the linear combinations of other predictors, then the predictors are said to be collinear + +### 2. What is the difference between standardisation and normalization ? Why is it useful? +### 3. What is the central limit theorem ? Why is it useful ? +### 4. What is the inter quartile range ? Why is it useful ? +### 5. What is the difference between t-test and z-test ? Why is it useful ? +### 6. Why do we take n-1 when calculating sample variance? Why is it useful ? +Read about Besel correction +### 7. What are the assumptions of the normal distribution ? Why is it useful ? +### 8. What are the different approches to outlier detection ? How will you handle the outliers? Why is it useful ? +### 9. Where is RMSE a bad case ? How do we solve this ? +### 10. What are the loss functions used in logistic regression ? +log loss function +### 11. Explain random forest in laymen terms ? +### 12. How does logisitc regression work in laymen terms ? +### 13. Why is logistic regression bad idea for multiclass classification ? +### 14. How do you perform the train test split in a timeseries modelling ? +### 15. What is the impact on timeseries model in case we have latge variation in data ? +### 16. How do you decide the value of K(value of clusters) in K-means clustering ? +### 17. What are the advantages and disadvantages of undersampling and oversampling ? +### 18. Which are some supervised algorithms that are not impacted by imbalanced data ? +### 19. You are a placement coordinator, you have to design a system for resume recommendation aligning to a company's requirement ? +a. K means clustering to make clusters +b. Ranking algorithm to sort for relevance + +_Second Strategy_ + +a. Perform document similarity using Hamming distance (distance based approach) +b. Compute the JD document distance with the resumes +c. Shortlist top K resumes + +### 20. How will you encode a feature like PinCode which has very high number of discrete values? +Target mean encoding +### 21. How do you design the architecture of a neural network? From bf98e117fae13abeecfae4a12f3eb0f0a5150dd2 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sun, 3 Jan 2021 11:53:41 +0530 Subject: [PATCH 51/98] Update interview_prep.md --- interview_prep.md | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/interview_prep.md b/interview_prep.md index 13f11d3..3a5b1aa 100644 --- a/interview_prep.md +++ b/interview_prep.md @@ -35,3 +35,24 @@ c. Shortlist top K resumes ### 20. How will you encode a feature like PinCode which has very high number of discrete values? Target mean encoding ### 21. How do you design the architecture of a neural network? + +## Section II + +| Algorithm | Problem Identification | Evaluation Metric | Bias Variance | Impact of outliers | Impact of imbalanced data | | +|-------------------------|------------------------|-------------------------------------------------------------------------------------------------|---------------|--------------------|---------------------------|---| +| Linear Regression | Regression | - Coefficient of determination (R2) - Adjusted R2 - Root Mean Square Error (RMSE) - Mean Absolute Error (MAE) - Root Mean Squared Logarithmic Error (RMSLE)| - High Bias Low Variance | -Impacted by outliers | | | +| Logistic Regression | Classification | - Accuracy - Precision - Recall - F-beta score - Area under ROC curve | - High Bias Low Variance | -Impacted by outliers | | | +| Support Vector Machines | Classification | - Accuracy - Precision - Recall - F-beta score - Area under ROC curve | - Low Bias High Variance | Sensitive to outliers | Sensitive to imbalanced data | | +| K-nearest neighbors | Classification | - Accuracy - Precision - Recall - F-beta score - Area under ROC curve | - Low Bias High Variance | Sensitive to outliers | Sensitive to imbalanced data | | +| Decision Tree | Both | Both | - Low Bias High Variance | - Not impacted by outliers | - Not impacted by imbalanced data | | +| Random Forest | Both | Both | - Low Bias High Variance | - Not impacted by outliers | - Not impacted by imbalanced data | | +| K-means clustering | Clustering | - Elbow method - Silhoutte Analysis | | | | | +| | | | | | | | + +### 22. Why do CNNs perfom better with images ? (What is it that CNN achieve better than ANN when delaing with image data) +### 23. Explain K-means clustering in laymen terms ? +### 24. What is the evaluation metric for K-means clustering ? +### 25. What is the impact of outliers on K-means clustering ? +### 26. What is the impact of outliers on K-nearest neigbors ? +### 25. What is the impact of imbalanced data on K-means clustering ? +### 26. What is the impact of imbalanced data on K-nearest neigbors ? From 9f1cf005b9936393c18a42243709041188c788c9 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sun, 3 Jan 2021 11:56:27 +0530 Subject: [PATCH 52/98] Update interview_prep.md --- interview_prep.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/interview_prep.md b/interview_prep.md index 3a5b1aa..768e81e 100644 --- a/interview_prep.md +++ b/interview_prep.md @@ -46,7 +46,7 @@ Target mean encoding | K-nearest neighbors | Classification | - Accuracy - Precision - Recall - F-beta score - Area under ROC curve | - Low Bias High Variance | Sensitive to outliers | Sensitive to imbalanced data | | | Decision Tree | Both | Both | - Low Bias High Variance | - Not impacted by outliers | - Not impacted by imbalanced data | | | Random Forest | Both | Both | - Low Bias High Variance | - Not impacted by outliers | - Not impacted by imbalanced data | | -| K-means clustering | Clustering | - Elbow method - Silhoutte Analysis | | | | | +| K-means clustering | Clustering | - Elbow method - Silhoutte Analysis | | Senstive to Outliers | | | | | | | | | | | ### 22. Why do CNNs perfom better with images ? (What is it that CNN achieve better than ANN when delaing with image data) From f99e3bc7344614065bd8cbcfecc4461673939510 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sun, 3 Jan 2021 12:14:42 +0530 Subject: [PATCH 53/98] Update interview_prep.md --- interview_prep.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/interview_prep.md b/interview_prep.md index 768e81e..809a5a8 100644 --- a/interview_prep.md +++ b/interview_prep.md @@ -51,8 +51,9 @@ Target mean encoding ### 22. Why do CNNs perfom better with images ? (What is it that CNN achieve better than ANN when delaing with image data) ### 23. Explain K-means clustering in laymen terms ? -### 24. What is the evaluation metric for K-means clustering ? -### 25. What is the impact of outliers on K-means clustering ? -### 26. What is the impact of outliers on K-nearest neigbors ? -### 25. What is the impact of imbalanced data on K-means clustering ? -### 26. What is the impact of imbalanced data on K-nearest neigbors ? +### 24. Does a low coefficient of determination always mean that my model is bad or vice versa ? Explain. +* R-squared does not measure goodness of fit. +* R-squared does not measure predictive error. +* R-squared does not allow you to compare models using transformed responses. +* R-squared does not measure how one variable explains another. +Ref:- https://data.library.virginia.edu/is-r-squared-useless/#:~:text=Let's%20recap%3A-,R%2Dsquared%20does%20not%20measure%20goodness%20of%20fit.,how%20one%20variable%20explains%20another. From 2a2d03a1d4385976fe83b3b4649ee4d342893406 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sun, 3 Jan 2021 12:15:29 +0530 Subject: [PATCH 54/98] Update interview_prep.md --- interview_prep.md | 1 + 1 file changed, 1 insertion(+) diff --git a/interview_prep.md b/interview_prep.md index 809a5a8..0105f8a 100644 --- a/interview_prep.md +++ b/interview_prep.md @@ -56,4 +56,5 @@ Target mean encoding * R-squared does not measure predictive error. * R-squared does not allow you to compare models using transformed responses. * R-squared does not measure how one variable explains another. + Ref:- https://data.library.virginia.edu/is-r-squared-useless/#:~:text=Let's%20recap%3A-,R%2Dsquared%20does%20not%20measure%20goodness%20of%20fit.,how%20one%20variable%20explains%20another. From 2bec118e2b9ba09c3d5f2ba967446f8d99321ef3 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sun, 3 Jan 2021 12:36:07 +0530 Subject: [PATCH 55/98] Update interview_prep.md --- interview_prep.md | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/interview_prep.md b/interview_prep.md index 0105f8a..8e04c44 100644 --- a/interview_prep.md +++ b/interview_prep.md @@ -58,3 +58,22 @@ Target mean encoding * R-squared does not measure how one variable explains another. Ref:- https://data.library.virginia.edu/is-r-squared-useless/#:~:text=Let's%20recap%3A-,R%2Dsquared%20does%20not%20measure%20goodness%20of%20fit.,how%20one%20variable%20explains%20another. + +### 24. What is the difference between probability and likelihood ? +### 25. What is the difference between generative and discriminative models ? +### 26. How is a decision tree pruned ? +### 27. What do you understand by the bias variance tradeoff ? + +![](https://djsaunde.files.wordpress.com/2017/07/bias-variance-tradeoff.png) + +Bias ia how well the model fits the data. Variance tells us the magnitude of change in the model based on the change in data +a. Very simple models have high bias and low variance eg. linear models +b. Very complex models have low bias and high variance eg. tree based models. Hence they are more prone to overfitting. + +How to deal with them ? + +| High Bias | High Variance | | +|----------------------------------------------|---------------|---| +| Try getting additional features | Try getting more training examples | | +| Try adding polynomial features | Try smaller set of features | | +| Try to decrease the regularization parameter | Try to increase the regularization parameter | | From 643cce532982290bad9eb603cff9747850dc7340 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sun, 3 Jan 2021 12:57:26 +0530 Subject: [PATCH 56/98] Update interview_prep.md --- interview_prep.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/interview_prep.md b/interview_prep.md index 8e04c44..a16d333 100644 --- a/interview_prep.md +++ b/interview_prep.md @@ -4,6 +4,17 @@ When two or more predictors are highly correlated to each other such that one pr can be derived using the linear combinations of other predictors, then the predictors are said to be collinear ### 2. What is the difference between standardisation and normalization ? Why is it useful? +Standardisation is a scclaing technique in which values are shifted and rescaled so that the mean is 0 and the variance is 1 + +Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling + +* Algorithms which use gradient descent based optimisation (linear regression, logistic regression, neural networks) will require features to be scaled so that optimization will be faster and the convergence will be more accurate. +* **Having features on a similar scale can help the gradient descent converge more quickly towards the minima.** +* Distance algorithms like KNN, K-means, and SVM are most affected by the range of features. This is because behind the scenes they are using distances between data points to determine their similarity. +* **Therefore, we scale our data before employing a distance based algorithm so that all the features contribute equally to the result.** + +![](https://i.pinimg.com/originals/1c/16/04/1c160466f8bfd26ca66a44f79514fb5d.jpg) + ### 3. What is the central limit theorem ? Why is it useful ? ### 4. What is the inter quartile range ? Why is it useful ? ### 5. What is the difference between t-test and z-test ? Why is it useful ? From 7f065d665abff80a285763dc5ae08ed1cb945a1d Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sun, 3 Jan 2021 13:05:25 +0530 Subject: [PATCH 57/98] Update interview_prep.md --- interview_prep.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/interview_prep.md b/interview_prep.md index a16d333..ba84d0c 100644 --- a/interview_prep.md +++ b/interview_prep.md @@ -16,6 +16,16 @@ Normalization is a scaling technique in which values are shifted and rescaled so ![](https://i.pinimg.com/originals/1c/16/04/1c160466f8bfd26ca66a44f79514fb5d.jpg) ### 3. What is the central limit theorem ? Why is it useful ? +The Central Limit Theorem is about how the sum of many different independent random variables tends towards a normal distribution (bell curve). + +For example: suppose you're rolling 2 6-sided dice. The rolls are all independent because one of the rolls doesn't affect any of the other rolls. For a single die, the distribution is the same chance for 1 2 3 4 5 and 6. + +But of you add the sum of 2 dice, you will notice that you have a 1/36 chance to get a 2, 2/36 chance to get a 3, 3/36 chance to get a 4, ..., up until 6/36 chance of getting a 7, then the chance decreases again until you're back at 1/36 chance of getting a 12. This is because the values in the middle, like 7, can be reached by getting 1+6, 6+1, 2+5, 5+2, 3+4 and 4+3, whereas the edges like 2 require a single very specific result (1+1) where every single die needs to land on 1. + +If you further increase the number of dice you roll, the edge cases become less and less likely, because they keep requiring very specific results, whereas the results in the middle become more likely. The more dice you add, the more it will eventually look like a bell curve. + +![](https://prwatech.in/blog/wp-content/uploads/2019/06/CetralLimitThm-1024x512.png) + ### 4. What is the inter quartile range ? Why is it useful ? ### 5. What is the difference between t-test and z-test ? Why is it useful ? ### 6. Why do we take n-1 when calculating sample variance? Why is it useful ? From a59d0b18778b1790e29cf4a70ff4b5ad3dcca781 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sun, 3 Jan 2021 13:49:52 +0530 Subject: [PATCH 58/98] Update interview_prep.md --- interview_prep.md | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/interview_prep.md b/interview_prep.md index ba84d0c..c21497d 100644 --- a/interview_prep.md +++ b/interview_prep.md @@ -27,6 +27,28 @@ If you further increase the number of dice you roll, the edge cases become less ![](https://prwatech.in/blog/wp-content/uploads/2019/06/CetralLimitThm-1024x512.png) ### 4. What is the inter quartile range ? Why is it useful ? + +The interquartile range is a measure of where the “middle fifty” is in a data set. Where a range is a measure of where the beginning and end are in a set, an interquartile range is a measure of where the bulk of the values lie. That’s why it’s preferred over many other measures of spread when reporting things like school performance or SAT scores. + +The interquartile range formula is the first quartile subtracted from the third quartile: +IQR = Q3 – Q1. + +![](https://naysan.ca/wp-content/uploads/2020/06/box_plot_ref_needed.png) + +#### IQR as a test of normality in a distribution + +Use the interquartile range formula with the mean and standard deviation to test whether or not a population has a normal distribution. The formula to determine whether or not a population is normally distributed are: +Q1 – (σ z1) + X +Q3 – (σ z3) + X +Where Q1 is the first quartile, Q3 is the third quartile, σ is the standard deviation, z is the standard score (“z-score“) and X is the mean. In order to tell whether a population is normally distributed, solve both equations and then compare the results. If there is a significant difference between the results and the first or third quartiles, then the population is not normally distributed. + +#### IQR as an instrument to detect outliers and to determine the spread of data +The interquartile range and the quartile deviation refer to the same thing. They both mean the difference between the third quartile (Q3) and the first quartile (Q1). Both are also called midspread or middle fifty. + +Some of its applications include determining the spread of data. It is used in the construction of a box plot. It is a good indicator of spread because it is robust with breakpoint of 25%. A breakpoint percentage indicates the number of incorrect observations, before a parameter starts giving a wrong description of the data set. A 25% breakpoint is robust, as it needs a quarter of the data to be incorrect, before it reflects an incorrect spread. + +The IQR is also used to determine outliers to the data set. This is in conjuction with the box plot (or the box-and-whisker plot). Outliers are defined as values that are below Q1-1.5*IQR or above Q3+1.5*IQR. There are other methods that could be used to determine whether outliers can be eliminated from the data set. + ### 5. What is the difference between t-test and z-test ? Why is it useful ? ### 6. Why do we take n-1 when calculating sample variance? Why is it useful ? Read about Besel correction From 6d93a3350e3e0822a158e3de19243cb7266f9a60 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sun, 3 Jan 2021 13:57:01 +0530 Subject: [PATCH 59/98] Update interview_prep.md --- interview_prep.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/interview_prep.md b/interview_prep.md index c21497d..93dec6c 100644 --- a/interview_prep.md +++ b/interview_prep.md @@ -50,6 +50,18 @@ Some of its applications include determining the spread of data. It is used in t The IQR is also used to determine outliers to the data set. This is in conjuction with the box plot (or the box-and-whisker plot). Outliers are defined as values that are below Q1-1.5*IQR or above Q3+1.5*IQR. There are other methods that could be used to determine whether outliers can be eliminated from the data set. ### 5. What is the difference between t-test and z-test ? Why is it useful ? + +![](https://www.wallstreetmojo.com/wp-content/uploads/2019/01/Z-Test-vs-T-Test.png) + + +| Basis | Z Test | T-Test | +|-------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:| +| Basic Definition | Z-test is a kind of hypothesis test which ascertains if the averages of the 2 datasets are different from each other when standard deviation or variance is given. | The t-test can be referred to as a kind of parametric test that is applied to an identity, how the averages of 2 sets of data differ from each other when the standard deviation or variance is not given. | +| Population Variance | The Population variance or standard deviation is known here. | The Population variance or standard deviation is unknown here. | +| Sample Size | The Sample size is large. | Here the Sample Size is small. | +| Key Assumptions | All data points are independent. Normal Distribution for Z, with an average zero and variance = 1. | All data points are not dependent. Sample values are to be recorded and taken accurately. | +| Based upon (a type of distribution) | Based on Normal distribution. | Based on Student-t distribution. | + ### 6. Why do we take n-1 when calculating sample variance? Why is it useful ? Read about Besel correction ### 7. What are the assumptions of the normal distribution ? Why is it useful ? From d1b0a91b415a2ca5cbdf5369d85be887f8c9cf13 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Mon, 4 Jan 2021 23:24:11 +0530 Subject: [PATCH 60/98] Create use_cases_insurnace.md --- use_cases_insurnace.md | 7 +++++++ 1 file changed, 7 insertions(+) create mode 100644 use_cases_insurnace.md diff --git a/use_cases_insurnace.md b/use_cases_insurnace.md new file mode 100644 index 0000000..8beb3e5 --- /dev/null +++ b/use_cases_insurnace.md @@ -0,0 +1,7 @@ +###Lapse management: Identifies policies that are likely to lapse, and how to approach the insured about maintaining the policy. +Recommendation engine: Given similar customers, discovers where individual insureds may have too much, or too little, insurance. Then, proactively help them get the right insurance for their current situation. +Assessor assistant: Once a car has been towed to a body shop, use computer vision to help the assessor identify issues which need to be fixed. This helps accuracy, speeds an assessment, and keeps the customer informed with any repairs. +Property analysis: Given images of a property, identifies structures on the property and any condition issues. Insurers can proactively help customers schedule repairs by identifying issues in their roofs, or suggest other coverage when new structures, like a swimming pool, are installed. +Fraud detection: Identifies claims which are potentially fraudulent. +Personalized offers: Improves the customer experience by offering relevant information about the coverage the insured may need based on life events, such as the birth of a child, purchase of a home or car. +Experience studies: Uses unsupervised machine learning to discover predictors in claims activity. This information can help set assumptions and feed into activities such as pricing models, risk analyses, and other actuarial analyses. From aeb0e2983e73dc02184bdcafbfbe34f5f69c5254 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Tue, 5 Jan 2021 00:17:57 +0530 Subject: [PATCH 61/98] Update interview_prep.md --- interview_prep.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/interview_prep.md b/interview_prep.md index 93dec6c..5c3437a 100644 --- a/interview_prep.md +++ b/interview_prep.md @@ -132,3 +132,13 @@ How to deal with them ? | Try getting additional features | Try getting more training examples | | | Try adding polynomial features | Try smaller set of features | | | Try to decrease the regularization parameter | Try to increase the regularization parameter | | + +### 28. What is right or left skewness ? +### 29. What is difference between bootstrapping and k-folds cross validation ? +overlapping of sample does not happen for k-folds cross validation +### 30. Which model is better: n_estimators=10 and n_estimators=30 ? +### 31. Why do we use activation functions in neural networks ? +### 32. What is the purpose of the optimizers ? +### 33. How does the stochastic gradient descent optimizer work ? + + From 1ee7330593356edb28b5a7adf9602376d97dbcdf Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Tue, 5 Jan 2021 09:39:20 +0530 Subject: [PATCH 62/98] Update use_cases_insurnace.md --- use_cases_insurnace.md | 54 ++++++++++++++++++++++++++++++++++++------ 1 file changed, 47 insertions(+), 7 deletions(-) diff --git a/use_cases_insurnace.md b/use_cases_insurnace.md index 8beb3e5..49c3e89 100644 --- a/use_cases_insurnace.md +++ b/use_cases_insurnace.md @@ -1,7 +1,47 @@ -###Lapse management: Identifies policies that are likely to lapse, and how to approach the insured about maintaining the policy. -Recommendation engine: Given similar customers, discovers where individual insureds may have too much, or too little, insurance. Then, proactively help them get the right insurance for their current situation. -Assessor assistant: Once a car has been towed to a body shop, use computer vision to help the assessor identify issues which need to be fixed. This helps accuracy, speeds an assessment, and keeps the customer informed with any repairs. -Property analysis: Given images of a property, identifies structures on the property and any condition issues. Insurers can proactively help customers schedule repairs by identifying issues in their roofs, or suggest other coverage when new structures, like a swimming pool, are installed. -Fraud detection: Identifies claims which are potentially fraudulent. -Personalized offers: Improves the customer experience by offering relevant information about the coverage the insured may need based on life events, such as the birth of a child, purchase of a home or car. -Experience studies: Uses unsupervised machine learning to discover predictors in claims activity. This information can help set assumptions and feed into activities such as pricing models, risk analyses, and other actuarial analyses. +#### Reference:- https://activewizards.com/blog/top-10-data-science-use-cases-in-insurance/ + +## Other use cases + +### Lapse management: +##### Identifies policies that are likely to lapse, and how to approach the insured about maintaining the policy. Calculate the probability to lapse + +### Recommendation engine: +##### Given similar customers, discovers where individual insureds may have too much, or too little, insurance. Then, proactively help them get the right insurance for their current situation. + +### Assessor assistant: +##### Once a car has been towed to a body shop, use computer vision to help the assessor identify issues which need to be fixed. This helps accuracy, speeds an assessment, and keeps the customer informed with any repairs. Car damage detection + +### Property analysis: +##### Given images of a property, identifies structures on the property and any condition issues. Insurers can proactively help customers schedule repairs by identifying issues in their roofs, or suggest other coverage when new structures, like a swimming pool, are installed. + +### Fraud detection: +##### Identifies claims which are potentially fraudulent. Rare events problem. Class imbalance is a huge challenge here + +### Personalized offers: +##### Improves the customer experience by offering relevant information about the coverage the insured may need based on life events, such as the birth of a child, purchase of a home or car. + +### Claims processing +##### Claims processing includes multiple tasks, including review, investigation, adjustment, remittance, or denial. While performing these tasks, numerous issues might occur: + +* Manual/inconsistent processing: Many claims processing tasks require human interaction that is prone to errors. +* Varying data formats: Customers send data in different formats to make claims. +* Changing regulation: Businesses need to accord in changing regulations promptly. Thus, constant staff training and process update are required for these companies. + +### Claims document processing +As customers make claims when they are in an uncomfortable position, customer experience and speed are critical in these processes. Thanks to document capture technologies, businesses can rapidly handle large volumes of documents required for claims processing tasks, detect fraudulent claims, and check if claims fit regulations. + +### Application processing +Application processing requires extracting information from a high volume of documents. While performing this task manually can take too long and prone to errors, document capture technologies enable insurance companies to automatically extract relevant data from application documents and accelerate insurance application processes with fewer errors and improved customer satisfaction. + +### Insurance pricing +AI can assess customers’ risk profiles based on lab testing, biometric data, claims data, patient-generated health data, and identify the optimal prices to quote with the right insurance plan. This would decrease the workflow in business operations and reduce costs while improving customer satisfaction. + +### Document creation +Insurance companies need to generate high volumes of documents, including specific information about the insurer. While creating these documents manually consume time and prone to errors, using AI and automation technologies can generate policy statements without mistakes. + +### Responding to customer queries +Conversational AI technologies can support insurance companies for faster replies to customer queries. For example, a South African insurance company, Hollard, has achieved 98% automation and reduced cost per transaction by 91%, according to its solution providers, LarcAI and UiPath. + + + + From 910faae3d4f5cb77601b355eb5a9fd039b2be1df Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Tue, 5 Jan 2021 10:10:33 +0530 Subject: [PATCH 63/98] Update interview_prep.md --- interview_prep.md | 24 +++++++++++++++++++++++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/interview_prep.md b/interview_prep.md index 5c3437a..3f757d9 100644 --- a/interview_prep.md +++ b/interview_prep.md @@ -63,7 +63,29 @@ The IQR is also used to determine outliers to the data set. This is in conjuctio | Based upon (a type of distribution) | Based on Normal distribution. | Based on Student-t distribution. | ### 6. Why do we take n-1 when calculating sample variance? Why is it useful ? -Read about Besel correction +Read about Besel correction for more technical definition + +##### Intuitive explaination + +If you are giving the standard deviation of an entire population and not a sample you actually do divide by n. However, the denominator is not referencing the number of observations, it's actually referencing degrees of freedom, which is n-1. For you to understand degrees of freedom I would recommend this example using hats. + +Basically you divide by the number of things you need to 'know' before you can fill in the blanks yourself. If you are using an entire population, you need every single example as you can't just fill in the blanks. But if you have a sample, you can know all but the last one before you can fill in the blank. + +##### Example + +![](https://ae01.alicdn.com/kf/HTB1XFW0JXXXXXcKXFXXq6xXFXXX1/225440714/HTB1XFW0JXXXXXcKXFXXq6xXFXXX1.jpg) + +Imagine you have a huge bookshelf. You measure the total thickness of the first 6 books and it turns out to be 158mm. This means that the mean thickness of a book based on first 6 samples is 26.3mm. +Now you take out and measure the first book's thickness (one degree of freedom) and find that it is 22mm. This means that the remaining 5 books must have a total thickness of 136mm +Now you measure the second book (second degree of freedom) and find it to be 28mm. So you know that the remaining 4 books should have a total thickness of 108mm . +. +. +In this way, by the time you measure the thickness of the 5th book individually (5th degree of freedom) , you automatically know the thickness of the remaining 1 book. + +This means that you automatically know the thickness of 6th book even though you have measured only 5. Extrapolating this concept, In a sample of size n, you know the value of the n'th observation even though you have only taken (n-1) measurements. i.e, the opportunity to vary has been taken away for the n'th observation. + +This means that if you have measured (n-1) objects then the nth object has no freedom to vary. Therefore, degree of freedom is only (n-1) and not n. + ### 7. What are the assumptions of the normal distribution ? Why is it useful ? ### 8. What are the different approches to outlier detection ? How will you handle the outliers? Why is it useful ? ### 9. Where is RMSE a bad case ? How do we solve this ? From 6b2817883c4fa9bdefdf924e3d0b00668d2fde0f Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Wed, 6 Jan 2021 23:03:53 +0530 Subject: [PATCH 64/98] Update interview_prep.md --- interview_prep.md | 38 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 36 insertions(+), 2 deletions(-) diff --git a/interview_prep.md b/interview_prep.md index 3f757d9..03bfd93 100644 --- a/interview_prep.md +++ b/interview_prep.md @@ -88,9 +88,43 @@ This means that if you have measured (n-1) objects then the nth object has no fr ### 7. What are the assumptions of the normal distribution ? Why is it useful ? ### 8. What are the different approches to outlier detection ? How will you handle the outliers? Why is it useful ? -### 9. Where is RMSE a bad case ? How do we solve this ? +### 9. How you assess OLS regression models ? +Three statistics are used in Ordinary Least Squares (OLS) regression to evaluate model fit: +* R-squared, +* the overall F-test, and +* the Root Mean Square Error (RMSE). + +All three are based on two sums of squares: Sum of Squares Total (SST) and Sum of Squares Error (SSE). SST measures how far the data are from the mean, and SSE measures how far the data are from the model’s predicted values. Different combinations of these two values provide different information about how the regression model compares to the mean model. + +##### R-squared and Adjusted R-squared + +The difference between SST and SSE is the improvement in prediction from the regression model, compared to the mean model. Dividing that difference by SST gives R-squared. It is the proportional improvement in prediction from the regression model, compared to the mean model. **It indicates the goodness of fit of the model.** + +R-squared has the useful property that its scale is intuitive: it ranges from zero to one, with zero indicating that the proposed model does not improve prediction over the mean model, and one indicating perfect prediction. Improvement in the regression model results in proportional increases in R-squared. + +One pitfall of R-squared is that it can only increase as predictors are added to the regression model. This increase is artificial when predictors are not actually improving the model’s fit. To remedy this, a related statistic, Adjusted R-squared, incorporates the model’s degrees of freedom. **Adjusted R-squared will decrease as predictors are added if the increase in model fit does not make up for the loss of degrees of freedom. Likewise, it will increase as predictors are added if the increase in model fit is worthwhile.** Adjusted R-squared should always be used with models with more than one predictor variable. It is interpreted as the proportion of total variance that is explained by the model. + +There are situations in which a high R-squared is not necessary or relevant. When the interest is in the relationship between variables, not in prediction, the R-square is less important. An example is a study on how religiosity affects health outcomes. A good result is a reliable relationship between religiosity and health. No one would expect that religion explains a high percentage of the variation in health, as health is affected by many other factors. Even if the model accounts for other variables known to affect health, such as income and age, an R-squared in the range of 0.10 to 0.15 is reasonable. + +![](https://miro.medium.com/max/1954/1*iFgJVgavYdENdtkssTS6pA.png) + +##### The F-test + +The F-test evaluates the null hypothesis that all regression coefficients are equal to zero versus the alternative that at least one is not. An equivalent null hypothesis is that R-squared equals zero. A significant F-test indicates that the observed R-squared is reliable and is not a spurious result of oddities in the data set. **Thus the F-test determines whether the proposed relationship between the response variable and the set of predictors is statistically reliable and can be useful when the research objective is either prediction or explanation.** + +##### RMSE + +The RMSE is the square root of the variance of the residuals. It indicates the absolute fit of the model to the data–how close the observed data points are to the model’s predicted values. **Whereas R-squared is a relative measure of fit, RMSE is an absolute measure of fit.** As the square root of a variance, RMSE can be interpreted as the standard deviation of the unexplained variance, and has the useful property of being in the same units as the response variable. **Lower values of RMSE indicate better fit. RMSE is a good measure of how accurately the model predicts the response, and it is the most important criterion for fit if the main purpose of the model is prediction.** + +##### NOTE: The best measure of model fit depends on the researcher’s objectives, and more than one are often useful. The statistics discussed above are applicable to regression models that use OLS estimation. Many types of regression models, however, such as mixed models, generalized linear models, and event history models, use maximum likelihood estimation. + ### 10. What are the loss functions used in logistic regression ? -log loss function + +![](https://miro.medium.com/max/548/1*rdBw0E-My8Gu3f_BOB6GMA.png) + +where y is the label (1 for event and 0 for non-event) and p(y) is the predicted probability of the event happening for all N observations. +Reading this formula, it tells you that, for each time the event occcurs (y=1), it adds log(p(y)) to the loss, that is, the log probability of event happening. Conversely, it adds log(1-p(y)), that is, the log probability of event not happening, for each non-event (y=0) + ### 11. Explain random forest in laymen terms ? ### 12. How does logisitc regression work in laymen terms ? ### 13. Why is logistic regression bad idea for multiclass classification ? From 31e26ea5cae7abcee59200086095fa242d9c398b Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Wed, 6 Jan 2021 23:22:55 +0530 Subject: [PATCH 65/98] Update interview_prep.md --- interview_prep.md | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/interview_prep.md b/interview_prep.md index 03bfd93..faa489c 100644 --- a/interview_prep.md +++ b/interview_prep.md @@ -126,6 +126,23 @@ where y is the label (1 for event and 0 for non-event) and p(y) is the predicted Reading this formula, it tells you that, for each time the event occcurs (y=1), it adds log(p(y)) to the loss, that is, the log probability of event happening. Conversely, it adds log(1-p(y)), that is, the log probability of event not happening, for each non-event (y=0) ### 11. Explain random forest in laymen terms ? + +Say you have three job offers and you wish to decide which is the best among them, you have the following criterion you use to shortlist a job offer like +* tools and technology +* company brand +* health insurance +* support for education +* compensation +* travel time +* joining bonus etc. + +You reach out to 10 of your connections on LinkedIn and ask them which is the best comapny to join based on 3 random criteria (for eg. c2, c3, c5) +You make different combinations of criteria while asking to different connections. At the end, you finally select company which is recommended the most from all the responses. + +##### This is how a random forest also works +![](https://miro.medium.com/max/690/0*Ry4NWdoTXjSjMfrE) + +you reach out to 50 of your different connections and ask them based on different param ### 12. How does logisitc regression work in laymen terms ? ### 13. Why is logistic regression bad idea for multiclass classification ? ### 14. How do you perform the train test split in a timeseries modelling ? From 3a0c8688208bda7dc864fbeeb2ded0fcdc74348c Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Wed, 6 Jan 2021 23:26:13 +0530 Subject: [PATCH 66/98] Update interview_prep.md --- interview_prep.md | 1 - 1 file changed, 1 deletion(-) diff --git a/interview_prep.md b/interview_prep.md index faa489c..990eb8f 100644 --- a/interview_prep.md +++ b/interview_prep.md @@ -142,7 +142,6 @@ You make different combinations of criteria while asking to different connection ##### This is how a random forest also works ![](https://miro.medium.com/max/690/0*Ry4NWdoTXjSjMfrE) -you reach out to 50 of your different connections and ask them based on different param ### 12. How does logisitc regression work in laymen terms ? ### 13. Why is logistic regression bad idea for multiclass classification ? ### 14. How do you perform the train test split in a timeseries modelling ? From baad68d2a531be6aecd75de5612bc15301cb2d2c Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Wed, 6 Jan 2021 23:28:28 +0530 Subject: [PATCH 67/98] Update interview_prep.md --- interview_prep.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/interview_prep.md b/interview_prep.md index 990eb8f..141d40a 100644 --- a/interview_prep.md +++ b/interview_prep.md @@ -140,7 +140,7 @@ You reach out to 10 of your connections on LinkedIn and ask them which is the be You make different combinations of criteria while asking to different connections. At the end, you finally select company which is recommended the most from all the responses. ##### This is how a random forest also works -![](https://miro.medium.com/max/690/0*Ry4NWdoTXjSjMfrE) +![](https://cdn.analyticsvidhya.com/wp-content/uploads/2020/02/rfc_vs_dt1.png) ### 12. How does logisitc regression work in laymen terms ? ### 13. Why is logistic regression bad idea for multiclass classification ? From 115f0a8f23b08686576705c85cdfd22ff8228aa4 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed, 10 Feb 2021 01:39:22 +0000 Subject: [PATCH 68/98] Bump cryptography from 3.2 to 3.3.2 Bumps [cryptography](https://github.com/pyca/cryptography) from 3.2 to 3.3.2. - [Release notes](https://github.com/pyca/cryptography/releases) - [Changelog](https://github.com/pyca/cryptography/blob/master/CHANGELOG.rst) - [Commits](https://github.com/pyca/cryptography/compare/3.2...3.3.2) Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index e26064b..c830d94 100644 --- a/requirements.txt +++ b/requirements.txt @@ -9,7 +9,7 @@ beautifulsoup4==4.6.3 certifi==2018.8.24 cffi==1.11.5 chardet==3.0.4 -cryptography==3.2 +cryptography==3.3.2 cycler==0.10.0 h5py==2.9.0 idna==2.7 From b1b208a79f75be8233261b88d10b33e7de57fbed Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon, 29 Mar 2021 18:43:38 +0000 Subject: [PATCH 69/98] Bump pygments from 2.3.1 to 2.7.4 Bumps [pygments](https://github.com/pygments/pygments) from 2.3.1 to 2.7.4. - [Release notes](https://github.com/pygments/pygments/releases) - [Changelog](https://github.com/pygments/pygments/blob/master/CHANGES) - [Commits](https://github.com/pygments/pygments/compare/2.3.1...2.7.4) Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index c830d94..8db7b8b 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,7 +1,7 @@ Keras==2.2.4 Keras-Preprocessing==1.0.5 PySocks==1.6.8 -Pygments==2.3.1 +Pygments==2.7.4 Quandl==3.4.5 asn1crypto==0.24.0 backcall==0.1.0 From 8dd8a1e3672c94e30687ab8bf1e5dc39dd57d176 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed, 9 Jun 2021 17:46:51 +0000 Subject: [PATCH 70/98] Bump pip from 10.0.1 to 19.2 Bumps [pip](https://github.com/pypa/pip) from 10.0.1 to 19.2. - [Release notes](https://github.com/pypa/pip/releases) - [Changelog](https://github.com/pypa/pip/blob/main/NEWS.rst) - [Commits](https://github.com/pypa/pip/compare/10.0.1...19.2) --- updated-dependencies: - dependency-name: pip dependency-type: direct:production ... Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index 8db7b8b..b9324d9 100644 --- a/requirements.txt +++ b/requirements.txt @@ -24,7 +24,7 @@ pandas==0.23.4 patsy==0.5.0 pexpect==4.6.0 pickleshare==0.7.5 -pip==10.0.1 +pip==19.2 ptyprocess==0.6.0 pyOpenSSL==18.0.0 pycparser==2.19 From c3317dd57d96967a9934150f58039e02baa516df Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon, 15 Nov 2021 17:48:10 +0000 Subject: [PATCH 71/98] Bump pip from 19.2 to 21.1 Bumps [pip](https://github.com/pypa/pip) from 19.2 to 21.1. - [Release notes](https://github.com/pypa/pip/releases) - [Changelog](https://github.com/pypa/pip/blob/main/NEWS.rst) - [Commits](https://github.com/pypa/pip/compare/19.2...21.1) --- updated-dependencies: - dependency-name: pip dependency-type: direct:production ... Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index b9324d9..ed895c6 100644 --- a/requirements.txt +++ b/requirements.txt @@ -24,7 +24,7 @@ pandas==0.23.4 patsy==0.5.0 pexpect==4.6.0 pickleshare==0.7.5 -pip==19.2 +pip==21.1 ptyprocess==0.6.0 pyOpenSSL==18.0.0 pycparser==2.19 From 2664498ec023784f9ab5d5285fa3c41ad0f28f79 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Fri, 21 Jan 2022 19:40:42 +0000 Subject: [PATCH 72/98] Bump ipython from 7.2.0 to 7.16.3 Bumps [ipython](https://github.com/ipython/ipython) from 7.2.0 to 7.16.3. - [Release notes](https://github.com/ipython/ipython/releases) - [Commits](https://github.com/ipython/ipython/compare/7.2.0...7.16.3) --- updated-dependencies: - dependency-name: ipython dependency-type: direct:production ... Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index ed895c6..5a9cb4d 100644 --- a/requirements.txt +++ b/requirements.txt @@ -14,7 +14,7 @@ cycler==0.10.0 h5py==2.9.0 idna==2.7 inflection==0.3.1 -ipython==7.2.0 +ipython==7.16.3 jedi==0.13.2 kiwisolver==1.0.1 matplotlib==3.0.0 From d46d3a77b1aa7828433922328478ded0af4118bd Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed, 30 Mar 2022 04:29:43 +0000 Subject: [PATCH 73/98] Bump numpy from 1.15.2 to 1.21.0 Bumps [numpy](https://github.com/numpy/numpy) from 1.15.2 to 1.21.0. - [Release notes](https://github.com/numpy/numpy/releases) - [Changelog](https://github.com/numpy/numpy/blob/main/doc/HOWTO_RELEASE.rst.txt) - [Commits](https://github.com/numpy/numpy/compare/v1.15.2...v1.21.0) --- updated-dependencies: - dependency-name: numpy dependency-type: direct:production ... Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index 5a9cb4d..e3781d9 100644 --- a/requirements.txt +++ b/requirements.txt @@ -19,7 +19,7 @@ jedi==0.13.2 kiwisolver==1.0.1 matplotlib==3.0.0 more-itertools==5.0.0 -numpy==1.15.2 +numpy==1.21.0 pandas==0.23.4 patsy==0.5.0 pexpect==4.6.0 From b3befe6daccae5289b23ce9b4dc700ad9c22437b Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed, 22 Jun 2022 05:13:07 +0000 Subject: [PATCH 74/98] Bump numpy from 1.21.0 to 1.22.0 Bumps [numpy](https://github.com/numpy/numpy) from 1.21.0 to 1.22.0. - [Release notes](https://github.com/numpy/numpy/releases) - [Changelog](https://github.com/numpy/numpy/blob/main/doc/HOWTO_RELEASE.rst) - [Commits](https://github.com/numpy/numpy/compare/v1.21.0...v1.22.0) --- updated-dependencies: - dependency-name: numpy dependency-type: direct:production ... Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index e3781d9..fb81870 100644 --- a/requirements.txt +++ b/requirements.txt @@ -19,7 +19,7 @@ jedi==0.13.2 kiwisolver==1.0.1 matplotlib==3.0.0 more-itertools==5.0.0 -numpy==1.21.0 +numpy==1.22.0 pandas==0.23.4 patsy==0.5.0 pexpect==4.6.0 From bc1edd25840b7a282de088d6e1c2b3d64f4ce64a Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Tue, 11 Oct 2022 18:51:29 +0530 Subject: [PATCH 75/98] Update interview_prep.md --- interview_prep.md | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/interview_prep.md b/interview_prep.md index 141d40a..b9f25af 100644 --- a/interview_prep.md +++ b/interview_prep.md @@ -86,7 +86,19 @@ This means that you automatically know the thickness of 6th book even though you This means that if you have measured (n-1) objects then the nth object has no freedom to vary. Therefore, degree of freedom is only (n-1) and not n. -### 7. What are the assumptions of the normal distribution ? Why is it useful ? +### 7. What are the assumptions of the linear regression model ? Why is it useful ? +We can divide the basic assumptions of linear regression into two categories based on whether the assumptions are about the explanatory variables (i.e. features) or the residuals. + +#### Assumptions about the explanatory variables (features): +* Linearity +* No multicollinearity + +#### Assumptions about the error terms (residuals): +* Gaussian distribution +* Homoskedasticity +* No autocorrelation +* Zero conditional mean + ### 8. What are the different approches to outlier detection ? How will you handle the outliers? Why is it useful ? ### 9. How you assess OLS regression models ? Three statistics are used in Ordinary Least Squares (OLS) regression to evaluate model fit: From c2e8b65528c68da00584a7d0f17138dbbebe5318 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Thu, 8 Dec 2022 03:00:44 +0000 Subject: [PATCH 76/98] Bump certifi from 2018.8.24 to 2022.12.7 Bumps [certifi](https://github.com/certifi/python-certifi) from 2018.8.24 to 2022.12.7. - [Release notes](https://github.com/certifi/python-certifi/releases) - [Commits](https://github.com/certifi/python-certifi/compare/2018.08.24...2022.12.07) --- updated-dependencies: - dependency-name: certifi dependency-type: direct:production ... Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index fb81870..358b9cf 100644 --- a/requirements.txt +++ b/requirements.txt @@ -6,7 +6,7 @@ Quandl==3.4.5 asn1crypto==0.24.0 backcall==0.1.0 beautifulsoup4==4.6.3 -certifi==2018.8.24 +certifi==2022.12.7 cffi==1.11.5 chardet==3.0.4 cryptography==3.3.2 From ac3a9d454e6b2f5337eb8b23d59d7ec1b675308b Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon, 26 Dec 2022 20:44:21 +0000 Subject: [PATCH 77/98] Bump wheel from 0.31.1 to 0.38.1 Bumps [wheel](https://github.com/pypa/wheel) from 0.31.1 to 0.38.1. - [Release notes](https://github.com/pypa/wheel/releases) - [Changelog](https://github.com/pypa/wheel/blob/main/docs/news.rst) - [Commits](https://github.com/pypa/wheel/compare/0.31.1...0.38.1) --- updated-dependencies: - dependency-name: wheel dependency-type: direct:production ... Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index fb81870..161cd8a 100644 --- a/requirements.txt +++ b/requirements.txt @@ -41,4 +41,4 @@ statsmodels==0.9.0 tornado==5.1.1 traitlets==4.3.2 wcwidth==0.1.7 -wheel==0.31.1 +wheel==0.38.1 From ff7d8668939a6c9f874bf97d3d5882a5cd4a5863 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue, 27 Dec 2022 15:34:35 +0000 Subject: [PATCH 78/98] Bump setuptools from 40.2.0 to 65.5.1 Bumps [setuptools](https://github.com/pypa/setuptools) from 40.2.0 to 65.5.1. - [Release notes](https://github.com/pypa/setuptools/releases) - [Changelog](https://github.com/pypa/setuptools/blob/main/CHANGES.rst) - [Commits](https://github.com/pypa/setuptools/compare/v40.2.0...v65.5.1) --- updated-dependencies: - dependency-name: setuptools dependency-type: direct:production ... Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index fb81870..1fd8a28 100644 --- a/requirements.txt +++ b/requirements.txt @@ -35,7 +35,7 @@ requests>=2.20.0 scikit-learn==0.20.0 scipy==1.1.0 seaborn==0.9.0 -setuptools==40.2.0 +setuptools==65.5.1 six==1.11.0 statsmodels==0.9.0 tornado==5.1.1 From 53bf9acbdaa1320bb4898b9b643e5057c2b382b5 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Fri, 10 Feb 2023 23:09:36 +0000 Subject: [PATCH 79/98] Bump ipython from 7.16.3 to 8.10.0 Bumps [ipython](https://github.com/ipython/ipython) from 7.16.3 to 8.10.0. - [Release notes](https://github.com/ipython/ipython/releases) - [Commits](https://github.com/ipython/ipython/compare/7.16.3...8.10.0) --- updated-dependencies: - dependency-name: ipython dependency-type: direct:production ... Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index fb81870..ce374c0 100644 --- a/requirements.txt +++ b/requirements.txt @@ -14,7 +14,7 @@ cycler==0.10.0 h5py==2.9.0 idna==2.7 inflection==0.3.1 -ipython==7.16.3 +ipython==8.10.0 jedi==0.13.2 kiwisolver==1.0.1 matplotlib==3.0.0 From d1b7d1f47a4a6267c3432b9fef42b64d674a0a68 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Sat, 25 Feb 2023 06:07:47 +0000 Subject: [PATCH 80/98] Bump cryptography from 3.3.2 to 39.0.1 Bumps [cryptography](https://github.com/pyca/cryptography) from 3.3.2 to 39.0.1. - [Release notes](https://github.com/pyca/cryptography/releases) - [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst) - [Commits](https://github.com/pyca/cryptography/compare/3.3.2...39.0.1) --- updated-dependencies: - dependency-name: cryptography dependency-type: direct:production ... Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index 358b9cf..cefb6fd 100644 --- a/requirements.txt +++ b/requirements.txt @@ -9,7 +9,7 @@ beautifulsoup4==4.6.3 certifi==2022.12.7 cffi==1.11.5 chardet==3.0.4 -cryptography==3.3.2 +cryptography==39.0.1 cycler==0.10.0 h5py==2.9.0 idna==2.7 From 25e8b590423b39402886a12dc0f573824a909ebe Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sat, 25 Feb 2023 11:42:33 +0530 Subject: [PATCH 81/98] Create dependabot.yml Adding new file --- .github/dependabot.yml | 11 +++++++++++ 1 file changed, 11 insertions(+) create mode 100644 .github/dependabot.yml diff --git a/.github/dependabot.yml b/.github/dependabot.yml new file mode 100644 index 0000000..a78b9c7 --- /dev/null +++ b/.github/dependabot.yml @@ -0,0 +1,11 @@ +# To get started with Dependabot version updates, you'll need to specify whic +# package ecosystems to update and where the package manifests are located. +# Please see the documentation for all configuration options: +# https://docs.github.com/github/administering-a-repository/configuration-options-for-dependency-updates + +version: 2 +updates: + - package-ecosystem: "" # See documentation for possible values + directory: "/" # Location of package manifests + schedule: + interval: "weekly" From 4be2cd4109d65140f305fd5936fddf25a4526593 Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Sat, 25 Mar 2023 15:09:56 +0530 Subject: [PATCH 82/98] Add files via upload --- prec_rec_curve.py | 143 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 143 insertions(+) create mode 100644 prec_rec_curve.py diff --git a/prec_rec_curve.py b/prec_rec_curve.py new file mode 100644 index 0000000..7bde221 --- /dev/null +++ b/prec_rec_curve.py @@ -0,0 +1,143 @@ +import numpy as np +from sklearn.metrics import confusion_matrix, precision_score, recall_score +import matplotlib.pyplot as plt +import matplotlib.patches as ptch + +# Appendix A - working with single threshold +pred_scores = [0.7, 0.3, 0.5, 0.6, 0.55, 0.9, 0.4, 0.2, 0.4, 0.3] +y_true = ["positive", "negative", "negative", "positive", "positive", "positive", "negative", "positive", "negative", "positive"] + +# To convert the scores into a class label, a threshold is used. +# When the score is equal to or above the threshold, the sample is classified as one class. +# Otherwise, it is classified as the other class. +# Suppose a sample is Positive if its score is above or equal to the threshold. Otherwise, it is Negative. +# The next block of code converts the scores into class labels with a threshold of 0.5. + +threshold = 0.5 + +y_pred = ["positive" if score >= threshold else "negative" for score in pred_scores] +print(y_pred) + +r = np.flip(confusion_matrix(y_true, y_pred)) +print("\n# Confusion Matrix (From Left to Right & Top to Bottom: \nTrue Positive, False Negative, \nFalse Positive, True Negative)") +print(r) + +# Remember that the higher the precision, the more confident the model is when it classifies a sample as Positive. +# Higher the recall, the more positive samples the model correctly classified as Positive. + +precision = precision_score(y_true=y_true, y_pred=y_pred, pos_label="positive") +print("\n# Precision = 4/(4+1)") +print(precision) + +recall = recall_score(y_true=y_true, y_pred=y_pred, pos_label="positive") +print("\n# Recall = 4/(4+2)") +print(recall) + +# Appendix B - working with multiple thresholds +y_true = ["positive", "negative", "negative", "positive", "positive", "positive", "negative", "positive", "negative", "positive", "positive", "positive", "positive", "negative", "negative", "negative"] + +pred_scores = [0.7, 0.3, 0.5, 0.6, 0.55, 0.9, 0.4, 0.2, 0.4, 0.3, 0.7, 0.5, 0.8, 0.2, 0.3, 0.35] + +thresholds = np.arange(start=0.2, stop=0.7, step=0.05) + +# Due to the importance of both precision and recall, there is a precision-recall curve that shows +# the tradeoff between the precision and recall values for different thresholds. +# This curve helps to select the best threshold to maximize both metrics + +def precision_recall_curve(y_true, pred_scores, thresholds): + precisions = [] + recalls = [] + f1_scores = [] + + for threshold in thresholds: + y_pred = ["positive" if score >= threshold else "negative" for score in pred_scores] + + precision = precision_score(y_true=y_true, y_pred=y_pred, pos_label="positive") + recall = recall_score(y_true=y_true, y_pred=y_pred, pos_label="positive") + f1_score = (2 * precision * recall) / (precision + recall) + + precisions.append(precision) + recalls.append(recall) + f1_scores.append(f1_score) + + return precisions, recalls, f1_scores + +precisions, recalls, f1_scores = precision_recall_curve(y_true=y_true, + pred_scores=pred_scores, + thresholds=thresholds) + +print("\nRecall:: Precision :: F1-Score",) +for p, r, f in zip(precisions, recalls, f1_scores): + print(round(r,4),"\t::\t",round(p,4),"\t::\t",round(f,4)) + +# np.max() returns the max. value in the array +# np.argmax() will return the index of the value found by np.max() + +print('Best F1-Score: ', np.max(f1_scores)) +idx_best_f1 = np.argmax(f1_scores) +print('\nBest threshold: ', thresholds[idx_best_f1]) +print('Index of threshold: ', idx_best_f1) + +# Can disable comment to display the plot + +# plt.plot(recalls, precisions, linewidth=4, color="red") +# plt.scatter(recalls[idx_best_f1], precisions[idx_best_f1], zorder=1, linewidth=6) +# plt.xlabel("Recall", fontsize=12, fontweight='bold') +# plt.ylabel("Precision", fontsize=12, fontweight='bold') +# plt.title("Precision-Recall Curve", fontsize=15, fontweight="bold") +# plt.show() + +# Appendix C - average precision (AP) +precisions, recalls, f1_scores = precision_recall_curve(y_true=y_true, + pred_scores=pred_scores, + thresholds=thresholds) + +precisions.append(1) +recalls.append(0) + +precisions = np.array(precisions) +recalls = np.array(recalls) + +print('\nRecall ::',recalls) +print('Precision ::',precisions) + +AP = np.sum((recalls[:-1] - recalls[1:]) * precisions[:-1]) +print("\nAP --", AP) + +# Appendix D - Intersection over Union + +# gt_box -- ground-truth bounding box +# pred_box -- prediction bounding box +def intersection_over_union(gt_box, pred_box): + + inter_box_top_left = [max(gt_box[0], pred_box[0]), max(gt_box[1], pred_box[1])] + + print("\ninter_box_top_left:", inter_box_top_left) + print("gt_box:", gt_box) + print("pred_box:", pred_box) + inter_box_bottom_right = [min(gt_box[0]+gt_box[2], pred_box[0]+pred_box[2]), min(gt_box[1]+gt_box[3], pred_box[1]+pred_box[3])] + print("inter_box_bottom_right:", inter_box_bottom_right) + + inter_box_w = inter_box_bottom_right[0] - inter_box_top_left[0] + print("inter_box_w:", inter_box_w) + inter_box_h = inter_box_bottom_right[1] - inter_box_top_left[1] + print("inter_box_h:", inter_box_h) + + intersection = inter_box_w * inter_box_h + union = gt_box[2] * gt_box[3] + pred_box[2] * pred_box[3] - intersection + + iou = intersection / union + + return iou, intersection, union + +gt_box1 = [320, 220, 680, 900] +pred_box1 = [500, 320, 550, 700] + +gt_box2 = [645, 130, 310, 320] +pred_box2 = [500, 60, 310, 320] + +iou1 = intersection_over_union(gt_box1, pred_box1) +print("\nIOU1 ::", iou1) + +iou2 = intersection_over_union(gt_box2, pred_box2) +print("\nIOU2 ::", iou2) \ No newline at end of file From bb15cdf6389dff507763a5ac78e523d528b1e98f Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Tue, 4 Apr 2023 15:25:21 +0530 Subject: [PATCH 83/98] Create SECURITY.md --- SECURITY.md | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) create mode 100644 SECURITY.md diff --git a/SECURITY.md b/SECURITY.md new file mode 100644 index 0000000..034e848 --- /dev/null +++ b/SECURITY.md @@ -0,0 +1,21 @@ +# Security Policy + +## Supported Versions + +Use this section to tell people about which versions of your project are +currently being supported with security updates. + +| Version | Supported | +| ------- | ------------------ | +| 5.1.x | :white_check_mark: | +| 5.0.x | :x: | +| 4.0.x | :white_check_mark: | +| < 4.0 | :x: | + +## Reporting a Vulnerability + +Use this section to tell people how to report a vulnerability. + +Tell them where to go, how often they can expect to get an update on a +reported vulnerability, what to expect if the vulnerability is accepted or +declined, etc. From a6fbb3db2be99581bcddd29c0d95c162e121807b Mon Sep 17 00:00:00 2001 From: Amogh Singhal Date: Mon, 12 Jun 2023 12:07:25 +0530 Subject: [PATCH 84/98] Update README.md --- README.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/README.md b/README.md index 4da9732..9e3482e 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,10 @@ # Machine-Learning-with-Python ![GitHub stars](https://img.shields.io/github/stars/devAmoghS/Machine-Learning-with-Python?style=for-the-badge) ![GitHub forks](https://img.shields.io/github/forks/devAmoghS/Machine-Learning-with-Python?label=Forks&style=for-the-badge) + +## Star History + +[![Star History Chart](https://api.star-history.com/svg?repos=devAmoghS/Machine-Learning-with-Python&type=Date)](https://star-history.com/#devAmoghS/Machine-Learning-with-Python&Date) + + ![alt text](https://media.istockphoto.com/vectors/machine-learning-3-step-infographic-artificial-intelligence-machine-vector-id962219860?k=6&m=962219860&s=612x612&w=0&h=yricYyUqZbILMHp3IvtenS3xbRDhu1w1u5kk2az5tbo=) ## Small scale machine learning projects to understand the core concepts (order: oldest to newest) From 17e89d10c1a055cf7533657f7a7e7100c6755841 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Thu, 6 Jul 2023 21:25:48 +0000 Subject: [PATCH 85/98] Bump scipy from 1.1.0 to 1.10.0 Bumps [scipy](https://github.com/scipy/scipy) from 1.1.0 to 1.10.0. - [Release notes](https://github.com/scipy/scipy/releases) - [Commits](https://github.com/scipy/scipy/compare/v1.1.0...v1.10.0) --- updated-dependencies: - dependency-name: scipy dependency-type: direct:production ... Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index a6641fd..8d6a647 100644 --- a/requirements.txt +++ b/requirements.txt @@ -33,7 +33,7 @@ python-dateutil==2.7.3 pytz==2018.5 requests>=2.20.0 scikit-learn==0.20.0 -scipy==1.1.0 +scipy==1.10.0 seaborn==0.9.0 setuptools==65.5.1 six==1.11.0 From f93afe24b8b72b23c6f67919002436f2e43f62ad Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Thu, 20 Jul 2023 13:10:54 +0000 Subject: [PATCH 86/98] Bump pygments from 2.7.4 to 2.15.0 Bumps [pygments](https://github.com/pygments/pygments) from 2.7.4 to 2.15.0. - [Release notes](https://github.com/pygments/pygments/releases) - [Changelog](https://github.com/pygments/pygments/blob/master/CHANGES) - [Commits](https://github.com/pygments/pygments/compare/2.7.4...2.15.0) --- updated-dependencies: - dependency-name: pygments dependency-type: direct:production ... Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index a6641fd..c002b50 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,7 +1,7 @@ Keras==2.2.4 Keras-Preprocessing==1.0.5 PySocks==1.6.8 -Pygments==2.7.4 +Pygments==2.15.0 Quandl==3.4.5 asn1crypto==0.24.0 backcall==0.1.0 From ffaf0c4d0aafa65483982137aef0a67adc394562 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed, 2 Aug 2023 01:32:45 +0000 Subject: [PATCH 87/98] Bump cryptography from 39.0.1 to 41.0.3 Bumps [cryptography](https://github.com/pyca/cryptography) from 39.0.1 to 41.0.3. - [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst) - [Commits](https://github.com/pyca/cryptography/compare/39.0.1...41.0.3) --- updated-dependencies: - dependency-name: cryptography dependency-type: direct:production ... Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index a6641fd..a152178 100644 --- a/requirements.txt +++ b/requirements.txt @@ -9,7 +9,7 @@ beautifulsoup4==4.6.3 certifi==2022.12.7 cffi==1.11.5 chardet==3.0.4 -cryptography==39.0.1 +cryptography==41.0.3 cycler==0.10.0 h5py==2.9.0 idna==2.7 From 2e13fda52748ca4dec2c36d016082368e2db3a97 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon, 14 Aug 2023 22:03:29 +0000 Subject: [PATCH 88/98] Bump tornado from 5.1.1 to 6.3.3 Bumps [tornado](https://github.com/tornadoweb/tornado) from 5.1.1 to 6.3.3. - [Changelog](https://github.com/tornadoweb/tornado/blob/master/docs/releases.rst) - [Commits](https://github.com/tornadoweb/tornado/compare/v5.1.1...v6.3.3) --- updated-dependencies: - dependency-name: tornado dependency-type: direct:production ... Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index a6641fd..df8dfee 100644 --- a/requirements.txt +++ b/requirements.txt @@ -38,7 +38,7 @@ seaborn==0.9.0 setuptools==65.5.1 six==1.11.0 statsmodels==0.9.0 -tornado==5.1.1 +tornado==6.3.3 traitlets==4.3.2 wcwidth==0.1.7 wheel==0.38.1 From e8dbf024ed64e1ee240244246700a8a821b74235 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Thu, 17 Aug 2023 01:39:09 +0000 Subject: [PATCH 89/98] Bump certifi from 2022.12.7 to 2023.7.22 Bumps [certifi](https://github.com/certifi/python-certifi) from 2022.12.7 to 2023.7.22. - [Commits](https://github.com/certifi/python-certifi/compare/2022.12.07...2023.07.22) --- updated-dependencies: - dependency-name: certifi dependency-type: direct:production ... Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index a6641fd..8f71574 100644 --- a/requirements.txt +++ b/requirements.txt @@ -6,7 +6,7 @@ Quandl==3.4.5 asn1crypto==0.24.0 backcall==0.1.0 beautifulsoup4==4.6.3 -certifi==2022.12.7 +certifi==2023.7.22 cffi==1.11.5 chardet==3.0.4 cryptography==39.0.1 From 1a671db581c679d179ef3585c6a725feacc12805 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue, 25 Jun 2024 11:35:42 +0000 Subject: [PATCH 90/98] Bump pip from 21.1 to 23.3 Bumps [pip](https://github.com/pypa/pip) from 21.1 to 23.3. - [Changelog](https://github.com/pypa/pip/blob/main/NEWS.rst) - [Commits](https://github.com/pypa/pip/compare/21.1...23.3) --- updated-dependencies: - dependency-name: pip dependency-type: direct:production ... Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index 6622752..25351e5 100644 --- a/requirements.txt +++ b/requirements.txt @@ -24,7 +24,7 @@ pandas==0.23.4 patsy==0.5.0 pexpect==4.6.0 pickleshare==0.7.5 -pip==21.1 +pip==23.3 ptyprocess==0.6.0 pyOpenSSL==18.0.0 pycparser==2.19 From ece2528697d6b28dee7718e4a127206c274a83b0 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue, 25 Jun 2024 11:36:28 +0000 Subject: [PATCH 91/98] Bump cryptography from 39.0.1 to 42.0.4 Bumps [cryptography](https://github.com/pyca/cryptography) from 39.0.1 to 42.0.4. - [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst) - [Commits](https://github.com/pyca/cryptography/compare/39.0.1...42.0.4) --- updated-dependencies: - dependency-name: cryptography dependency-type: direct:production ... Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index fc5d79d..441b518 100644 --- a/requirements.txt +++ b/requirements.txt @@ -9,7 +9,7 @@ beautifulsoup4==4.6.3 certifi==2023.7.22 cffi==1.11.5 chardet==3.0.4 -cryptography==41.0.3 +cryptography==42.0.4 cycler==0.10.0 h5py==2.9.0 idna==2.7 From 75c3a6eb23031e62731551690b2326d73ab40b0e Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue, 25 Jun 2024 11:36:37 +0000 Subject: [PATCH 92/98] Bump tornado from 5.1.1 to 6.4.1 Bumps [tornado](https://github.com/tornadoweb/tornado) from 5.1.1 to 6.4.1. - [Changelog](https://github.com/tornadoweb/tornado/blob/master/docs/releases.rst) - [Commits](https://github.com/tornadoweb/tornado/compare/v5.1.1...v6.4.1) --- updated-dependencies: - dependency-name: tornado dependency-type: direct:production ... Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index fc5d79d..a2f9cf2 100644 --- a/requirements.txt +++ b/requirements.txt @@ -38,7 +38,7 @@ seaborn==0.9.0 setuptools==65.5.1 six==1.11.0 statsmodels==0.9.0 -tornado==6.3.3 +tornado==6.4.1 traitlets==4.3.2 wcwidth==0.1.7 wheel==0.38.1 From 0ee256998f26d57e216edbbcb58516f301703fb1 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue, 25 Jun 2024 11:36:54 +0000 Subject: [PATCH 93/98] Bump scikit-learn from 0.20.0 to 1.5.0 Bumps [scikit-learn](https://github.com/scikit-learn/scikit-learn) from 0.20.0 to 1.5.0. - [Release notes](https://github.com/scikit-learn/scikit-learn/releases) - [Commits](https://github.com/scikit-learn/scikit-learn/compare/0.20.0...1.5.0) --- updated-dependencies: - dependency-name: scikit-learn dependency-type: direct:production ... Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index fc5d79d..fc2d7ab 100644 --- a/requirements.txt +++ b/requirements.txt @@ -32,7 +32,7 @@ pyparsing==2.2.1 python-dateutil==2.7.3 pytz==2018.5 requests>=2.20.0 -scikit-learn==0.20.0 +scikit-learn==1.5.0 scipy==1.10.0 seaborn==0.9.0 setuptools==65.5.1 From d3f3928dcb80cb9db05e0699f5b0527bf607614a Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue, 25 Jun 2024 11:40:17 +0000 Subject: [PATCH 94/98] Bump idna from 2.7 to 3.7 Bumps [idna](https://github.com/kjd/idna) from 2.7 to 3.7. - [Release notes](https://github.com/kjd/idna/releases) - [Changelog](https://github.com/kjd/idna/blob/master/HISTORY.rst) - [Commits](https://github.com/kjd/idna/compare/v2.7...v3.7) --- updated-dependencies: - dependency-name: idna dependency-type: direct:production ... Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index ee8d009..c83a46d 100644 --- a/requirements.txt +++ b/requirements.txt @@ -12,7 +12,7 @@ chardet==3.0.4 cryptography==42.0.4 cycler==0.10.0 h5py==2.9.0 -idna==2.7 +idna==3.7 inflection==0.3.1 ipython==8.10.0 jedi==0.13.2 From e9a498caf9fbdf485242e58c5378096a256e594f Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon, 15 Jul 2024 17:29:35 +0000 Subject: [PATCH 95/98] Bump setuptools from 65.5.1 to 70.0.0 Bumps [setuptools](https://github.com/pypa/setuptools) from 65.5.1 to 70.0.0. - [Release notes](https://github.com/pypa/setuptools/releases) - [Changelog](https://github.com/pypa/setuptools/blob/main/NEWS.rst) - [Commits](https://github.com/pypa/setuptools/compare/v65.5.1...v70.0.0) --- updated-dependencies: - dependency-name: setuptools dependency-type: direct:production ... Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index c83a46d..caf00e7 100644 --- a/requirements.txt +++ b/requirements.txt @@ -35,7 +35,7 @@ requests>=2.20.0 scikit-learn==1.5.0 scipy==1.10.0 seaborn==0.9.0 -setuptools==65.5.1 +setuptools==70.0.0 six==1.11.0 statsmodels==0.9.0 tornado==6.4.1 From 8ea82d59a2fa94fab100469c37cebab57b1ff259 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Fri, 2 Aug 2024 16:20:13 +0000 Subject: [PATCH 96/98] Bump keras from 2.2.4 to 2.13.1 Bumps [keras](https://github.com/keras-team/keras) from 2.2.4 to 2.13.1. - [Release notes](https://github.com/keras-team/keras/releases) - [Commits](https://github.com/keras-team/keras/compare/2.2.4...v2.13.1) --- updated-dependencies: - dependency-name: keras dependency-type: direct:production ... Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index c83a46d..21914c8 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,4 +1,4 @@ -Keras==2.2.4 +Keras==2.13.1 Keras-Preprocessing==1.0.5 PySocks==1.6.8 Pygments==2.15.0 From 406a19afb38c4338fc9fb9ff73baf8354146c6c7 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue, 1 Apr 2025 02:26:44 +0000 Subject: [PATCH 97/98] Bump tornado from 6.4.1 to 6.4.2 Bumps [tornado](https://github.com/tornadoweb/tornado) from 6.4.1 to 6.4.2. - [Changelog](https://github.com/tornadoweb/tornado/blob/v6.4.2/docs/releases.rst) - [Commits](https://github.com/tornadoweb/tornado/compare/v6.4.1...v6.4.2) --- updated-dependencies: - dependency-name: tornado dependency-version: 6.4.2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index 126de70..16ba325 100644 --- a/requirements.txt +++ b/requirements.txt @@ -38,7 +38,7 @@ seaborn==0.9.0 setuptools==70.0.0 six==1.11.0 statsmodels==0.9.0 -tornado==6.4.1 +tornado==6.4.2 traitlets==4.3.2 wcwidth==0.1.7 wheel==0.38.1 From 15d9c864a779fe43de90d2bd2424e6597bbfdd07 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue, 1 Apr 2025 02:26:46 +0000 Subject: [PATCH 98/98] Bump cryptography from 42.0.4 to 44.0.1 Bumps [cryptography](https://github.com/pyca/cryptography) from 42.0.4 to 44.0.1. - [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst) - [Commits](https://github.com/pyca/cryptography/compare/42.0.4...44.0.1) --- updated-dependencies: - dependency-name: cryptography dependency-version: 44.0.1 dependency-type: direct:production ... Signed-off-by: dependabot[bot] --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index 126de70..61ddb84 100644 --- a/requirements.txt +++ b/requirements.txt @@ -9,7 +9,7 @@ beautifulsoup4==4.6.3 certifi==2023.7.22 cffi==1.11.5 chardet==3.0.4 -cryptography==42.0.4 +cryptography==44.0.1 cycler==0.10.0 h5py==2.9.0 idna==3.7