diff --git a/.github/dependabot.yml b/.github/dependabot.yml new file mode 100644 index 0000000..a78b9c7 --- /dev/null +++ b/.github/dependabot.yml @@ -0,0 +1,11 @@ +# To get started with Dependabot version updates, you'll need to specify whic +# package ecosystems to update and where the package manifests are located. +# Please see the documentation for all configuration options: +# https://docs.github.com/github/administering-a-repository/configuration-options-for-dependency-updates + +version: 2 +updates: + - package-ecosystem: "" # See documentation for possible values + directory: "/" # Location of package manifests + schedule: + interval: "weekly" diff --git a/README.md b/README.md index 4da9732..9e3482e 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,10 @@ # Machine-Learning-with-Python ![GitHub stars](https://img.shields.io/github/stars/devAmoghS/Machine-Learning-with-Python?style=for-the-badge) ![GitHub forks](https://img.shields.io/github/forks/devAmoghS/Machine-Learning-with-Python?label=Forks&style=for-the-badge) + +## Star History + +[![Star History Chart](https://api.star-history.com/svg?repos=devAmoghS/Machine-Learning-with-Python&type=Date)](https://star-history.com/#devAmoghS/Machine-Learning-with-Python&Date) + + ![alt text](https://media.istockphoto.com/vectors/machine-learning-3-step-infographic-artificial-intelligence-machine-vector-id962219860?k=6&m=962219860&s=612x612&w=0&h=yricYyUqZbILMHp3IvtenS3xbRDhu1w1u5kk2az5tbo=) ## Small scale machine learning projects to understand the core concepts (order: oldest to newest) diff --git a/SECURITY.md b/SECURITY.md new file mode 100644 index 0000000..034e848 --- /dev/null +++ b/SECURITY.md @@ -0,0 +1,21 @@ +# Security Policy + +## Supported Versions + +Use this section to tell people about which versions of your project are +currently being supported with security updates. + +| Version | Supported | +| ------- | ------------------ | +| 5.1.x | :white_check_mark: | +| 5.0.x | :x: | +| 4.0.x | :white_check_mark: | +| < 4.0 | :x: | + +## Reporting a Vulnerability + +Use this section to tell people how to report a vulnerability. + +Tell them where to go, how often they can expect to get an update on a +reported vulnerability, what to expect if the vulnerability is accepted or +declined, etc. diff --git a/Understanding SQL Queries.md b/Understanding SQL Queries.md new file mode 100644 index 0000000..5e658d7 --- /dev/null +++ b/Understanding SQL Queries.md @@ -0,0 +1,185 @@ +### Three SQL Concepts you Must Know to Pass the Data Science Interview + +#### Credits: Thanks to Jay Feng for writing this [article](https://www.interviewquery.com/blog-three-sql-questions-you-must-know-to-pass/) + +#### 1. Getting the first or last value for each user in a `transactions` table. + +`transactions` + +| column_name | data_type | +--- | --- | +| user_id | int | +| created_at | datetime| +| product | varchar | + +##### Question: Given the user transactions table above, write a query to get the first purchase for each user. + +#### Solution: + +We want to take a table that looks like this: + + user_id | created_at | product + --- | --- | --- + 123 | 2019-01-01 | apple + 456 | 2019-01-02 | banana + 123 | 2019-01-05 | pear + 456 | 2019-01-10 | apple + 789 | 2019-01-11 | banana + +and turn it into this + + user_id | created_at | product + --- | --- | --- + 123 | 2019-01-01 | apple + 456 | 2019-01-02 | banana + 789 | 2019-01-11 | banana + + The solution can be broken into two parts: + - First make a table of `user_id` and the first purchase (i.e. minimum create date). We can get this by the following query + +``` +SELECT + user_id, MIN(created_at) AS min_created_at +FROM + transactions +GROUP BY 1 +``` + +- Now all we have to do is join this table back to the original on two columns: `user_id` and `created_at`.
+The self join will effectively filter for the first purchase.
+Then all we have to do is grab all of the columns on the left side table. + +``` +SELECT + t.user_id, t.created_at, t.product +FROM + transactions AS t + INNER JOIN ( + SELECT user_id, MIN(created_at) AS min_created_at + FROM transactions + GROUP BY 1 + ) AS t1 ON (t.user_id = t1.user_id AND t.created_at = t1.min_created_at) +``` + +#### 2. Knowing the difference between a LEFT JOIN and INNER JOIN in practice. + + `users` + + +| column_name | data_type | +--- | --- | +| id | int | +| name | varchar | +| city_id | int | + +`city_id` is `id` in the `cities` table + +`cities` +| column_name | data_type | +--- | --- | +| id | int | +| name | varchar | + + +##### Question: Given the `users` and `cities` tables above, write a query to return the list of cities without any users. + +This question aims to test the candidate's understanding of the LEFT JOIN and INNER JOIN + +##### What is the actual difference between a LEFT JOIN and INNER JOIN? + +**INNER JOIN**: returns rows when there is a match in __both tables__.
+**LEFT JOIN**: returns all rows from the left table, __even if there are no matches in the right table__. + +#### Solution: + +We know that each user in the users table must live in a city given the city_id field.
+However the `cities` table doesn’t have a `user_id` field.
+In which if we run an INNER JOIN between these two tables joined by the city_id in each table, we’ll get all of the cities that have users and __all of the cities without users will be filtered out.__ + +But what if we run a LEFT JOIN between cities and users? + +cities.name | users.id +--- | --- | +seattle | 123 +seattle | 124 +portland | null +san diego | 534 +san diego | 564 + +Here we see that since we are keeping all of the values on the LEFT side of the table, since there’s no match on the city of Portland to any users that exist in the database, the city shows up as NULL.
+Therefore now all we have to do is run a __WHERE filter to where any value in the users table is NULL.__ + +``` +SELECT + cities.name, users.id +FROM + cities + LEFT JOIN users ON users.city_id = cities.id +WHERE + users.id IS NULL +``` + +#### 3. Aggregations with a conditional statement + +`transactions` +| column_name | data_type | +--- | --- | +| user_id | int | +| created_at | datetime| +| product | varchar | + +##### Question: Given the same user transactions table as before,write a query to get the total purchases made in the morning versus afternoon/evening (AM vs PM) by day. + +We are comparing two groups. Every time we have to compare two groups we must use a GROUP BY + +In this case, we need to create a separate column to actually run our GROUP BY on, which in this case, is the difference between AM or PM in the `created_at` field. + +``` +CASE + WHEN HOUR(created_at) > 11 THEN 'PM' + ELSE 'AM' +END AS time_of_day +``` + +We can cast the created_at column to the hour and set the new column value time_of_day as AM or PM based on this condition. + +Now we just have to run a GROUP BY on the original `created_at` field truncated to the day AND the new column we created that differentiates each row value.
+The last aggregation will then be the output variable we want which is total purchases by running the COUNT function. + +``` +SELECT + DATE_TRUNC('day', created_at) AS date + ,CASE + WHEN HOUR(created_at) > 11 THEN 'PM' + ELSE 'AM' + END AS time_of_day + ,COUNT(*) +FROM + transactions +GROUP BY 1,2 +``` +### Bonus Questions + +#### 4.Write an SQL query that makes recommendations using the pages that your friends liked. Assume you have two tables: + +`usersAndFriends` +| column_name | data_type | +--- | --- | +| user_id | int | +| friend | int| + +`usersLikedPages` +| column_name | data_type | +--- | --- | +| user_id | int | +| page_id | int| + +#### It should not recommend pages you already like. + +#### 5.Write an SQL query that shows percentage change month over month in daily active users. Assume you have a table: + +`logins` +| column_name | data_type | +--- | --- | +| user_id | int | +| date | date| diff --git a/Understanding Vanishing Gradient.md b/Understanding Vanishing Gradient.md new file mode 100644 index 0000000..a0dcaff --- /dev/null +++ b/Understanding Vanishing Gradient.md @@ -0,0 +1,59 @@ +# Understanding Vanishing Gradients in Neural Networks + +#### Credits: Thanks to [Chi-Feng Wang](https://towardsdatascience.com/@reina.wang) for writing this [article](https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484) + +![Vanishing Gradient](https://i.stack.imgur.com/YUlyb.jpg) + +### TL;DR +The gradient used in backprop is calculated using the derivative chain rule, meaning it is a product of about as many factors as there are layers (in a vanilla feedforward net).
+If all those factors are e.g. between 0 and 1 (e.g. due to the choice of 'squishing' activation functions), and some are very small (typical in the earlier layers and when activations are saturated), then the overall product (gradient) will get very small, near zero.
+The risk of this happening grows with the number of factors (the number of layers).
+The problem is that this may happen for a weight configuration that is nowhere near optimal, yet training will slow down or stop + +### Introduction + +We all know that neural networks perform learning through the process of forward pass and backward pass.
+This cycle goes on until we find a optimal value for the cost function that we are trying to minimize.
+The optmization happens with the help of gradient descent.
+ +### What are gradients ? +Gradients are the derivative of a function. It determines how much change happens when the input to the function is changed by a very big number
+ +Gradients of neural networks are found using backpropagation(backward pass as mentioned above).
+1. Backpropogation finds the derivatives of the network by moving layer by layer from the final layer to the initial one.
+2. By the chain rule, the derivatives of each layer are multiplied down the network (from the final layer to the initial) to compute the derivatives of the initial layers. + +### Why does it happen ? + +A very commonly used activation function is the sigmoid function. + +The sigmoid function squashes the input value into a range of 0 to 1.
+Hence if there is a large change in the value, there is not much change in the output by the sigmoid. Hence the derivative of this function is very small.
+ +The graph below also shows us the same picture. For very large or small values of x, the derivative of sigmoid is very small (almost closer to zero) + +![Sigmoid Function and its Derivative](https://miro.medium.com/max/1000/1*6A3A_rt4YmumHusvTvVTxw.png) + +### How does it impact ? + +As explained above, we are multiplying gradients with each other in the bacward pass step using chain rule.
+So when we are multiplying a lot of small numbers (almost near zero quantities). The gradient value is descreased very sharply.
+ +A small gradient means that the weights and biases of the initial layers will not be updated effectively with each training session. + +**Since these initial layers are often crucial to recognizing the core elements of the input data, it can lead to overall inaccuracy of the whole network.** + +### Solutions to the vanishing gradients + +1. We can use other other activation function like `Relu` +` Relu(x) = max(x,0)` + +2. Using residual networks is also an effective solution where we add the input value X to the next layer before applying the activation.
+This way the overall derivative is not reduced to a small value. Refer the diagram below. + +![A Residual Block](https://miro.medium.com/max/385/1*mxJ5gBvZnYPVo0ISZE5XkA.png) + +3. Batch normalization is also an effective solution. We normalize the input value x ==> |x| so that it does not have extremely large or small values and hence the derivative is not very small.
+We limit the input function to a small range and hence the output from the sigmoid also remains normal. We can see the same behavior that the green region does not have very small derivatives. Refer the diagram below + +![Sigmoig function with limited values](https://miro.medium.com/max/700/1*XCtAytGsbhRQnu-x7Ynr0Q.png) diff --git a/interview_prep.md b/interview_prep.md new file mode 100644 index 0000000..b9f25af --- /dev/null +++ b/interview_prep.md @@ -0,0 +1,228 @@ +### 1. What is multi-collinearity + +When two or more predictors are highly correlated to each other such that one predictor +can be derived using the linear combinations of other predictors, then the predictors are said to be collinear + +### 2. What is the difference between standardisation and normalization ? Why is it useful? +Standardisation is a scclaing technique in which values are shifted and rescaled so that the mean is 0 and the variance is 1 + +Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling + +* Algorithms which use gradient descent based optimisation (linear regression, logistic regression, neural networks) will require features to be scaled so that optimization will be faster and the convergence will be more accurate. +* **Having features on a similar scale can help the gradient descent converge more quickly towards the minima.** +* Distance algorithms like KNN, K-means, and SVM are most affected by the range of features. This is because behind the scenes they are using distances between data points to determine their similarity. +* **Therefore, we scale our data before employing a distance based algorithm so that all the features contribute equally to the result.** + +![](https://i.pinimg.com/originals/1c/16/04/1c160466f8bfd26ca66a44f79514fb5d.jpg) + +### 3. What is the central limit theorem ? Why is it useful ? +The Central Limit Theorem is about how the sum of many different independent random variables tends towards a normal distribution (bell curve). + +For example: suppose you're rolling 2 6-sided dice. The rolls are all independent because one of the rolls doesn't affect any of the other rolls. For a single die, the distribution is the same chance for 1 2 3 4 5 and 6. + +But of you add the sum of 2 dice, you will notice that you have a 1/36 chance to get a 2, 2/36 chance to get a 3, 3/36 chance to get a 4, ..., up until 6/36 chance of getting a 7, then the chance decreases again until you're back at 1/36 chance of getting a 12. This is because the values in the middle, like 7, can be reached by getting 1+6, 6+1, 2+5, 5+2, 3+4 and 4+3, whereas the edges like 2 require a single very specific result (1+1) where every single die needs to land on 1. + +If you further increase the number of dice you roll, the edge cases become less and less likely, because they keep requiring very specific results, whereas the results in the middle become more likely. The more dice you add, the more it will eventually look like a bell curve. + +![](https://prwatech.in/blog/wp-content/uploads/2019/06/CetralLimitThm-1024x512.png) + +### 4. What is the inter quartile range ? Why is it useful ? + +The interquartile range is a measure of where the “middle fifty” is in a data set. Where a range is a measure of where the beginning and end are in a set, an interquartile range is a measure of where the bulk of the values lie. That’s why it’s preferred over many other measures of spread when reporting things like school performance or SAT scores. + +The interquartile range formula is the first quartile subtracted from the third quartile: +IQR = Q3 – Q1. + +![](https://naysan.ca/wp-content/uploads/2020/06/box_plot_ref_needed.png) + +#### IQR as a test of normality in a distribution + +Use the interquartile range formula with the mean and standard deviation to test whether or not a population has a normal distribution. The formula to determine whether or not a population is normally distributed are: +Q1 – (σ z1) + X +Q3 – (σ z3) + X +Where Q1 is the first quartile, Q3 is the third quartile, σ is the standard deviation, z is the standard score (“z-score“) and X is the mean. In order to tell whether a population is normally distributed, solve both equations and then compare the results. If there is a significant difference between the results and the first or third quartiles, then the population is not normally distributed. + +#### IQR as an instrument to detect outliers and to determine the spread of data +The interquartile range and the quartile deviation refer to the same thing. They both mean the difference between the third quartile (Q3) and the first quartile (Q1). Both are also called midspread or middle fifty. + +Some of its applications include determining the spread of data. It is used in the construction of a box plot. It is a good indicator of spread because it is robust with breakpoint of 25%. A breakpoint percentage indicates the number of incorrect observations, before a parameter starts giving a wrong description of the data set. A 25% breakpoint is robust, as it needs a quarter of the data to be incorrect, before it reflects an incorrect spread. + +The IQR is also used to determine outliers to the data set. This is in conjuction with the box plot (or the box-and-whisker plot). Outliers are defined as values that are below Q1-1.5*IQR or above Q3+1.5*IQR. There are other methods that could be used to determine whether outliers can be eliminated from the data set. + +### 5. What is the difference between t-test and z-test ? Why is it useful ? + +![](https://www.wallstreetmojo.com/wp-content/uploads/2019/01/Z-Test-vs-T-Test.png) + + +| Basis | Z Test | T-Test | +|-------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:| +| Basic Definition | Z-test is a kind of hypothesis test which ascertains if the averages of the 2 datasets are different from each other when standard deviation or variance is given. | The t-test can be referred to as a kind of parametric test that is applied to an identity, how the averages of 2 sets of data differ from each other when the standard deviation or variance is not given. | +| Population Variance | The Population variance or standard deviation is known here. | The Population variance or standard deviation is unknown here. | +| Sample Size | The Sample size is large. | Here the Sample Size is small. | +| Key Assumptions | All data points are independent. Normal Distribution for Z, with an average zero and variance = 1. | All data points are not dependent. Sample values are to be recorded and taken accurately. | +| Based upon (a type of distribution) | Based on Normal distribution. | Based on Student-t distribution. | + +### 6. Why do we take n-1 when calculating sample variance? Why is it useful ? +Read about Besel correction for more technical definition + +##### Intuitive explaination + +If you are giving the standard deviation of an entire population and not a sample you actually do divide by n. However, the denominator is not referencing the number of observations, it's actually referencing degrees of freedom, which is n-1. For you to understand degrees of freedom I would recommend this example using hats. + +Basically you divide by the number of things you need to 'know' before you can fill in the blanks yourself. If you are using an entire population, you need every single example as you can't just fill in the blanks. But if you have a sample, you can know all but the last one before you can fill in the blank. + +##### Example + +![](https://ae01.alicdn.com/kf/HTB1XFW0JXXXXXcKXFXXq6xXFXXX1/225440714/HTB1XFW0JXXXXXcKXFXXq6xXFXXX1.jpg) + +Imagine you have a huge bookshelf. You measure the total thickness of the first 6 books and it turns out to be 158mm. This means that the mean thickness of a book based on first 6 samples is 26.3mm. +Now you take out and measure the first book's thickness (one degree of freedom) and find that it is 22mm. This means that the remaining 5 books must have a total thickness of 136mm +Now you measure the second book (second degree of freedom) and find it to be 28mm. So you know that the remaining 4 books should have a total thickness of 108mm . +. +. +In this way, by the time you measure the thickness of the 5th book individually (5th degree of freedom) , you automatically know the thickness of the remaining 1 book. + +This means that you automatically know the thickness of 6th book even though you have measured only 5. Extrapolating this concept, In a sample of size n, you know the value of the n'th observation even though you have only taken (n-1) measurements. i.e, the opportunity to vary has been taken away for the n'th observation. + +This means that if you have measured (n-1) objects then the nth object has no freedom to vary. Therefore, degree of freedom is only (n-1) and not n. + +### 7. What are the assumptions of the linear regression model ? Why is it useful ? +We can divide the basic assumptions of linear regression into two categories based on whether the assumptions are about the explanatory variables (i.e. features) or the residuals. + +#### Assumptions about the explanatory variables (features): +* Linearity +* No multicollinearity + +#### Assumptions about the error terms (residuals): +* Gaussian distribution +* Homoskedasticity +* No autocorrelation +* Zero conditional mean + +### 8. What are the different approches to outlier detection ? How will you handle the outliers? Why is it useful ? +### 9. How you assess OLS regression models ? +Three statistics are used in Ordinary Least Squares (OLS) regression to evaluate model fit: +* R-squared, +* the overall F-test, and +* the Root Mean Square Error (RMSE). + +All three are based on two sums of squares: Sum of Squares Total (SST) and Sum of Squares Error (SSE). SST measures how far the data are from the mean, and SSE measures how far the data are from the model’s predicted values. Different combinations of these two values provide different information about how the regression model compares to the mean model. + +##### R-squared and Adjusted R-squared + +The difference between SST and SSE is the improvement in prediction from the regression model, compared to the mean model. Dividing that difference by SST gives R-squared. It is the proportional improvement in prediction from the regression model, compared to the mean model. **It indicates the goodness of fit of the model.** + +R-squared has the useful property that its scale is intuitive: it ranges from zero to one, with zero indicating that the proposed model does not improve prediction over the mean model, and one indicating perfect prediction. Improvement in the regression model results in proportional increases in R-squared. + +One pitfall of R-squared is that it can only increase as predictors are added to the regression model. This increase is artificial when predictors are not actually improving the model’s fit. To remedy this, a related statistic, Adjusted R-squared, incorporates the model’s degrees of freedom. **Adjusted R-squared will decrease as predictors are added if the increase in model fit does not make up for the loss of degrees of freedom. Likewise, it will increase as predictors are added if the increase in model fit is worthwhile.** Adjusted R-squared should always be used with models with more than one predictor variable. It is interpreted as the proportion of total variance that is explained by the model. + +There are situations in which a high R-squared is not necessary or relevant. When the interest is in the relationship between variables, not in prediction, the R-square is less important. An example is a study on how religiosity affects health outcomes. A good result is a reliable relationship between religiosity and health. No one would expect that religion explains a high percentage of the variation in health, as health is affected by many other factors. Even if the model accounts for other variables known to affect health, such as income and age, an R-squared in the range of 0.10 to 0.15 is reasonable. + +![](https://miro.medium.com/max/1954/1*iFgJVgavYdENdtkssTS6pA.png) + +##### The F-test + +The F-test evaluates the null hypothesis that all regression coefficients are equal to zero versus the alternative that at least one is not. An equivalent null hypothesis is that R-squared equals zero. A significant F-test indicates that the observed R-squared is reliable and is not a spurious result of oddities in the data set. **Thus the F-test determines whether the proposed relationship between the response variable and the set of predictors is statistically reliable and can be useful when the research objective is either prediction or explanation.** + +##### RMSE + +The RMSE is the square root of the variance of the residuals. It indicates the absolute fit of the model to the data–how close the observed data points are to the model’s predicted values. **Whereas R-squared is a relative measure of fit, RMSE is an absolute measure of fit.** As the square root of a variance, RMSE can be interpreted as the standard deviation of the unexplained variance, and has the useful property of being in the same units as the response variable. **Lower values of RMSE indicate better fit. RMSE is a good measure of how accurately the model predicts the response, and it is the most important criterion for fit if the main purpose of the model is prediction.** + +##### NOTE: The best measure of model fit depends on the researcher’s objectives, and more than one are often useful. The statistics discussed above are applicable to regression models that use OLS estimation. Many types of regression models, however, such as mixed models, generalized linear models, and event history models, use maximum likelihood estimation. + +### 10. What are the loss functions used in logistic regression ? + +![](https://miro.medium.com/max/548/1*rdBw0E-My8Gu3f_BOB6GMA.png) + +where y is the label (1 for event and 0 for non-event) and p(y) is the predicted probability of the event happening for all N observations. +Reading this formula, it tells you that, for each time the event occcurs (y=1), it adds log(p(y)) to the loss, that is, the log probability of event happening. Conversely, it adds log(1-p(y)), that is, the log probability of event not happening, for each non-event (y=0) + +### 11. Explain random forest in laymen terms ? + +Say you have three job offers and you wish to decide which is the best among them, you have the following criterion you use to shortlist a job offer like +* tools and technology +* company brand +* health insurance +* support for education +* compensation +* travel time +* joining bonus etc. + +You reach out to 10 of your connections on LinkedIn and ask them which is the best comapny to join based on 3 random criteria (for eg. c2, c3, c5) +You make different combinations of criteria while asking to different connections. At the end, you finally select company which is recommended the most from all the responses. + +##### This is how a random forest also works +![](https://cdn.analyticsvidhya.com/wp-content/uploads/2020/02/rfc_vs_dt1.png) + +### 12. How does logisitc regression work in laymen terms ? +### 13. Why is logistic regression bad idea for multiclass classification ? +### 14. How do you perform the train test split in a timeseries modelling ? +### 15. What is the impact on timeseries model in case we have latge variation in data ? +### 16. How do you decide the value of K(value of clusters) in K-means clustering ? +### 17. What are the advantages and disadvantages of undersampling and oversampling ? +### 18. Which are some supervised algorithms that are not impacted by imbalanced data ? +### 19. You are a placement coordinator, you have to design a system for resume recommendation aligning to a company's requirement ? +a. K means clustering to make clusters +b. Ranking algorithm to sort for relevance + +_Second Strategy_ + +a. Perform document similarity using Hamming distance (distance based approach) +b. Compute the JD document distance with the resumes +c. Shortlist top K resumes + +### 20. How will you encode a feature like PinCode which has very high number of discrete values? +Target mean encoding +### 21. How do you design the architecture of a neural network? + +## Section II + +| Algorithm | Problem Identification | Evaluation Metric | Bias Variance | Impact of outliers | Impact of imbalanced data | | +|-------------------------|------------------------|-------------------------------------------------------------------------------------------------|---------------|--------------------|---------------------------|---| +| Linear Regression | Regression | - Coefficient of determination (R2) - Adjusted R2 - Root Mean Square Error (RMSE) - Mean Absolute Error (MAE) - Root Mean Squared Logarithmic Error (RMSLE)| - High Bias Low Variance | -Impacted by outliers | | | +| Logistic Regression | Classification | - Accuracy - Precision - Recall - F-beta score - Area under ROC curve | - High Bias Low Variance | -Impacted by outliers | | | +| Support Vector Machines | Classification | - Accuracy - Precision - Recall - F-beta score - Area under ROC curve | - Low Bias High Variance | Sensitive to outliers | Sensitive to imbalanced data | | +| K-nearest neighbors | Classification | - Accuracy - Precision - Recall - F-beta score - Area under ROC curve | - Low Bias High Variance | Sensitive to outliers | Sensitive to imbalanced data | | +| Decision Tree | Both | Both | - Low Bias High Variance | - Not impacted by outliers | - Not impacted by imbalanced data | | +| Random Forest | Both | Both | - Low Bias High Variance | - Not impacted by outliers | - Not impacted by imbalanced data | | +| K-means clustering | Clustering | - Elbow method - Silhoutte Analysis | | Senstive to Outliers | | | +| | | | | | | | + +### 22. Why do CNNs perfom better with images ? (What is it that CNN achieve better than ANN when delaing with image data) +### 23. Explain K-means clustering in laymen terms ? +### 24. Does a low coefficient of determination always mean that my model is bad or vice versa ? Explain. +* R-squared does not measure goodness of fit. +* R-squared does not measure predictive error. +* R-squared does not allow you to compare models using transformed responses. +* R-squared does not measure how one variable explains another. + +Ref:- https://data.library.virginia.edu/is-r-squared-useless/#:~:text=Let's%20recap%3A-,R%2Dsquared%20does%20not%20measure%20goodness%20of%20fit.,how%20one%20variable%20explains%20another. + +### 24. What is the difference between probability and likelihood ? +### 25. What is the difference between generative and discriminative models ? +### 26. How is a decision tree pruned ? +### 27. What do you understand by the bias variance tradeoff ? + +![](https://djsaunde.files.wordpress.com/2017/07/bias-variance-tradeoff.png) + +Bias ia how well the model fits the data. Variance tells us the magnitude of change in the model based on the change in data +a. Very simple models have high bias and low variance eg. linear models +b. Very complex models have low bias and high variance eg. tree based models. Hence they are more prone to overfitting. + +How to deal with them ? + +| High Bias | High Variance | | +|----------------------------------------------|---------------|---| +| Try getting additional features | Try getting more training examples | | +| Try adding polynomial features | Try smaller set of features | | +| Try to decrease the regularization parameter | Try to increase the regularization parameter | | + +### 28. What is right or left skewness ? +### 29. What is difference between bootstrapping and k-folds cross validation ? +overlapping of sample does not happen for k-folds cross validation +### 30. Which model is better: n_estimators=10 and n_estimators=30 ? +### 31. Why do we use activation functions in neural networks ? +### 32. What is the purpose of the optimizers ? +### 33. How does the stochastic gradient descent optimizer work ? + + diff --git a/prec_rec_curve.py b/prec_rec_curve.py new file mode 100644 index 0000000..7bde221 --- /dev/null +++ b/prec_rec_curve.py @@ -0,0 +1,143 @@ +import numpy as np +from sklearn.metrics import confusion_matrix, precision_score, recall_score +import matplotlib.pyplot as plt +import matplotlib.patches as ptch + +# Appendix A - working with single threshold +pred_scores = [0.7, 0.3, 0.5, 0.6, 0.55, 0.9, 0.4, 0.2, 0.4, 0.3] +y_true = ["positive", "negative", "negative", "positive", "positive", "positive", "negative", "positive", "negative", "positive"] + +# To convert the scores into a class label, a threshold is used. +# When the score is equal to or above the threshold, the sample is classified as one class. +# Otherwise, it is classified as the other class. +# Suppose a sample is Positive if its score is above or equal to the threshold. Otherwise, it is Negative. +# The next block of code converts the scores into class labels with a threshold of 0.5. + +threshold = 0.5 + +y_pred = ["positive" if score >= threshold else "negative" for score in pred_scores] +print(y_pred) + +r = np.flip(confusion_matrix(y_true, y_pred)) +print("\n# Confusion Matrix (From Left to Right & Top to Bottom: \nTrue Positive, False Negative, \nFalse Positive, True Negative)") +print(r) + +# Remember that the higher the precision, the more confident the model is when it classifies a sample as Positive. +# Higher the recall, the more positive samples the model correctly classified as Positive. + +precision = precision_score(y_true=y_true, y_pred=y_pred, pos_label="positive") +print("\n# Precision = 4/(4+1)") +print(precision) + +recall = recall_score(y_true=y_true, y_pred=y_pred, pos_label="positive") +print("\n# Recall = 4/(4+2)") +print(recall) + +# Appendix B - working with multiple thresholds +y_true = ["positive", "negative", "negative", "positive", "positive", "positive", "negative", "positive", "negative", "positive", "positive", "positive", "positive", "negative", "negative", "negative"] + +pred_scores = [0.7, 0.3, 0.5, 0.6, 0.55, 0.9, 0.4, 0.2, 0.4, 0.3, 0.7, 0.5, 0.8, 0.2, 0.3, 0.35] + +thresholds = np.arange(start=0.2, stop=0.7, step=0.05) + +# Due to the importance of both precision and recall, there is a precision-recall curve that shows +# the tradeoff between the precision and recall values for different thresholds. +# This curve helps to select the best threshold to maximize both metrics + +def precision_recall_curve(y_true, pred_scores, thresholds): + precisions = [] + recalls = [] + f1_scores = [] + + for threshold in thresholds: + y_pred = ["positive" if score >= threshold else "negative" for score in pred_scores] + + precision = precision_score(y_true=y_true, y_pred=y_pred, pos_label="positive") + recall = recall_score(y_true=y_true, y_pred=y_pred, pos_label="positive") + f1_score = (2 * precision * recall) / (precision + recall) + + precisions.append(precision) + recalls.append(recall) + f1_scores.append(f1_score) + + return precisions, recalls, f1_scores + +precisions, recalls, f1_scores = precision_recall_curve(y_true=y_true, + pred_scores=pred_scores, + thresholds=thresholds) + +print("\nRecall:: Precision :: F1-Score",) +for p, r, f in zip(precisions, recalls, f1_scores): + print(round(r,4),"\t::\t",round(p,4),"\t::\t",round(f,4)) + +# np.max() returns the max. value in the array +# np.argmax() will return the index of the value found by np.max() + +print('Best F1-Score: ', np.max(f1_scores)) +idx_best_f1 = np.argmax(f1_scores) +print('\nBest threshold: ', thresholds[idx_best_f1]) +print('Index of threshold: ', idx_best_f1) + +# Can disable comment to display the plot + +# plt.plot(recalls, precisions, linewidth=4, color="red") +# plt.scatter(recalls[idx_best_f1], precisions[idx_best_f1], zorder=1, linewidth=6) +# plt.xlabel("Recall", fontsize=12, fontweight='bold') +# plt.ylabel("Precision", fontsize=12, fontweight='bold') +# plt.title("Precision-Recall Curve", fontsize=15, fontweight="bold") +# plt.show() + +# Appendix C - average precision (AP) +precisions, recalls, f1_scores = precision_recall_curve(y_true=y_true, + pred_scores=pred_scores, + thresholds=thresholds) + +precisions.append(1) +recalls.append(0) + +precisions = np.array(precisions) +recalls = np.array(recalls) + +print('\nRecall ::',recalls) +print('Precision ::',precisions) + +AP = np.sum((recalls[:-1] - recalls[1:]) * precisions[:-1]) +print("\nAP --", AP) + +# Appendix D - Intersection over Union + +# gt_box -- ground-truth bounding box +# pred_box -- prediction bounding box +def intersection_over_union(gt_box, pred_box): + + inter_box_top_left = [max(gt_box[0], pred_box[0]), max(gt_box[1], pred_box[1])] + + print("\ninter_box_top_left:", inter_box_top_left) + print("gt_box:", gt_box) + print("pred_box:", pred_box) + inter_box_bottom_right = [min(gt_box[0]+gt_box[2], pred_box[0]+pred_box[2]), min(gt_box[1]+gt_box[3], pred_box[1]+pred_box[3])] + print("inter_box_bottom_right:", inter_box_bottom_right) + + inter_box_w = inter_box_bottom_right[0] - inter_box_top_left[0] + print("inter_box_w:", inter_box_w) + inter_box_h = inter_box_bottom_right[1] - inter_box_top_left[1] + print("inter_box_h:", inter_box_h) + + intersection = inter_box_w * inter_box_h + union = gt_box[2] * gt_box[3] + pred_box[2] * pred_box[3] - intersection + + iou = intersection / union + + return iou, intersection, union + +gt_box1 = [320, 220, 680, 900] +pred_box1 = [500, 320, 550, 700] + +gt_box2 = [645, 130, 310, 320] +pred_box2 = [500, 60, 310, 320] + +iou1 = intersection_over_union(gt_box1, pred_box1) +print("\nIOU1 ::", iou1) + +iou2 = intersection_over_union(gt_box2, pred_box2) +print("\nIOU2 ::", iou2) \ No newline at end of file diff --git a/requirements.txt b/requirements.txt index 6b4806b..8d8c8e2 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,30 +1,30 @@ -Keras==2.2.4 +Keras==2.13.1 Keras-Preprocessing==1.0.5 PySocks==1.6.8 -Pygments==2.3.1 +Pygments==2.15.0 Quandl==3.4.5 asn1crypto==0.24.0 backcall==0.1.0 beautifulsoup4==4.6.3 -certifi==2018.8.24 +certifi==2023.7.22 cffi==1.11.5 chardet==3.0.4 -cryptography==2.3.1 +cryptography==44.0.1 cycler==0.10.0 h5py==2.9.0 -idna==2.7 +idna==3.7 inflection==0.3.1 -ipython==7.2.0 +ipython==8.10.0 jedi==0.13.2 kiwisolver==1.0.1 matplotlib==3.0.0 more-itertools==5.0.0 -numpy==1.15.2 +numpy==1.22.0 pandas==0.23.4 patsy==0.5.0 pexpect==4.6.0 pickleshare==0.7.5 -pip==10.0.1 +pip==23.3 ptyprocess==0.6.0 pyOpenSSL==18.0.0 pycparser==2.19 @@ -32,13 +32,13 @@ pyparsing==2.2.1 python-dateutil==2.7.3 pytz==2018.5 requests>=2.20.0 -scikit-learn==0.20.0 -scipy==1.1.0 +scikit-learn==1.5.0 +scipy==1.10.0 seaborn==0.9.0 -setuptools==40.2.0 +setuptools==70.0.0 six==1.11.0 statsmodels==0.9.0 -tornado==5.1.1 +tornado==6.4.2 traitlets==4.3.2 wcwidth==0.1.7 -wheel==0.31.1 +wheel==0.38.1 diff --git a/use_cases_insurnace.md b/use_cases_insurnace.md new file mode 100644 index 0000000..49c3e89 --- /dev/null +++ b/use_cases_insurnace.md @@ -0,0 +1,47 @@ +#### Reference:- https://activewizards.com/blog/top-10-data-science-use-cases-in-insurance/ + +## Other use cases + +### Lapse management: +##### Identifies policies that are likely to lapse, and how to approach the insured about maintaining the policy. Calculate the probability to lapse + +### Recommendation engine: +##### Given similar customers, discovers where individual insureds may have too much, or too little, insurance. Then, proactively help them get the right insurance for their current situation. + +### Assessor assistant: +##### Once a car has been towed to a body shop, use computer vision to help the assessor identify issues which need to be fixed. This helps accuracy, speeds an assessment, and keeps the customer informed with any repairs. Car damage detection + +### Property analysis: +##### Given images of a property, identifies structures on the property and any condition issues. Insurers can proactively help customers schedule repairs by identifying issues in their roofs, or suggest other coverage when new structures, like a swimming pool, are installed. + +### Fraud detection: +##### Identifies claims which are potentially fraudulent. Rare events problem. Class imbalance is a huge challenge here + +### Personalized offers: +##### Improves the customer experience by offering relevant information about the coverage the insured may need based on life events, such as the birth of a child, purchase of a home or car. + +### Claims processing +##### Claims processing includes multiple tasks, including review, investigation, adjustment, remittance, or denial. While performing these tasks, numerous issues might occur: + +* Manual/inconsistent processing: Many claims processing tasks require human interaction that is prone to errors. +* Varying data formats: Customers send data in different formats to make claims. +* Changing regulation: Businesses need to accord in changing regulations promptly. Thus, constant staff training and process update are required for these companies. + +### Claims document processing +As customers make claims when they are in an uncomfortable position, customer experience and speed are critical in these processes. Thanks to document capture technologies, businesses can rapidly handle large volumes of documents required for claims processing tasks, detect fraudulent claims, and check if claims fit regulations. + +### Application processing +Application processing requires extracting information from a high volume of documents. While performing this task manually can take too long and prone to errors, document capture technologies enable insurance companies to automatically extract relevant data from application documents and accelerate insurance application processes with fewer errors and improved customer satisfaction. + +### Insurance pricing +AI can assess customers’ risk profiles based on lab testing, biometric data, claims data, patient-generated health data, and identify the optimal prices to quote with the right insurance plan. This would decrease the workflow in business operations and reduce costs while improving customer satisfaction. + +### Document creation +Insurance companies need to generate high volumes of documents, including specific information about the insurer. While creating these documents manually consume time and prone to errors, using AI and automation technologies can generate policy statements without mistakes. + +### Responding to customer queries +Conversational AI technologies can support insurance companies for faster replies to customer queries. For example, a South African insurance company, Hollard, has achieved 98% automation and reduced cost per transaction by 91%, according to its solution providers, LarcAI and UiPath. + + + +