Become a Real World Pilot with Microsoft Flight Simulator 2020

Arriving August 18th this year, Microsoft Flight Simulator 2020 will be the most advanced flight sim in terms of graphics and functionality to be ever made. The map for the Flight sim is literally…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Diabetes Prediction application using machine learning algorithms.

In this article we are going to build an application which is going to predict whether the user has chances of being diabeteic based on a little information provided by the user.

Once we have our data, now let’s load it and analyze it. But first lets import the required libraries.

Don’t worry looking at all the import statements! We will understand all of them in detail . So, now lets load the data and view first five records.

Thus we have successfully loaded the data. Now lets see, how many records do we have and also how many features do we have.

So we have 768 rows and 9 feautures out of which one will be our target variable.

Now lets understand our features. First of all what are our features? The answer is…

Pregnancies - Number of times the user has undergone pregnancy.

Glucose - Plasma glucose concentration a 2 hours in an oral glucose tolerance test

Blood Pressure - Diastolic blood pressure (mm Hg)

Skin Thickness - Triceps skin fold thickness (mm)

Insulin - 2-Hour serum insulin (mu U/ml)

BMI - Body mass index (weight in kg/(height in m)²)

Diabetes Pedigree Function - Likelihood of diabetes based on family history

Age- Age of the user.

Given all these features, we have to predict the outcome as 0 or 1, i.e; the user is not diabetic or is diabetic respectively.

First lets split our data into train and test data in 70:30 ratio.

Now, lets build our model. Lets try out our model with various classification algorithms which are….

Logistic regression is a very powerful classifier in the case of binary classification. The basic working of logistic regression is that, it takes the independent features and maps them in the range of 0 to 1 using sigmoid function and based on these values the classifier does its prediction, i.e; if the value is closer to 1 the outcome is positive and when the value is closer to 0 the outcome of the classifier would be negative.

The sigmoid function looks like this…

Knn is a very simple classification algorithm which predicts the target variable given the input, based on the surrounding neighbours of the input. Lets get a clear understanding. KNN assumes that similar things exist in close proximity. Based on this assumption it finds the k most similar items to the input and classifies the target as that of the K neighbours’ (it may take mode of the neighbours’ target variable). The similarity measure can be the distance.

Linear regression is a statistical technique where “y”, the target is predicted using a linear line- “y=mx+c”, where m is the slope of the line and c is the intercept(bias).m and c are continuously updated until me find the most perfect prediction of y. But this approach is only for numerical data and not suitable for categorical data. Am just taking up this algorithm to practically show how bad it performs with categoriacal data(as in our case).

SVM is a supervised classification algorithm which classifies data using a hyperplane(decision boundry) and also two margin lines drawn on both sides of the decision boundry which create a margin for the decision boundry which leads to a better and safer prediction in a generalized model. Now, how are these margin lines drawn? These margin lines are drawn parallel to the hyperplane in such a way that they touch the closest datapoint(support vectors) on either sides of the decision boundry. Maximizing the margin width would provide us a better and efficient model. We should keep this in mind while deciding the hyperplane.

This algorithm is based on bayes’ theorem. The basic idea of this algorithm is that it calulates conditional probabilites of the target variable given the input features and based on the probabilities values the classification is done. The naive assumption in this algorithm is that the input variables are assumed to be conditionally independent. Hence this is called the Naive bayes algorithm. Gaussian Naive Bayes algorithm is built on the assumption of a normal distribution of probabilities, whereas Multinomial Naive Bayes algorithm assumes multinomial distribution of probabilities. Multinomial Naive Bayes algorithm works well for data which can easily be turned into counts, such as word counts in text.Gaussian Naive Bayes, instead, is based on a continuous distribution and it’s suitable for more generic classification tasks. We will observe the performance of both the algorithms.

By now I hope you would have gotten a rough idea about all the algorithms. So now using the sklearn’s methods, lets individually build our classifier with all these algorithms and see the performance.

Lets see the performance of these algorithms..

As you can see linear regression is useless for categorical data. Logistic regression, random forest and Gaussian Naive bayes are giving the best result.

Now, lets try to improve the accuracies. We can do this in many ways. The methods we would follow in this project are…

Handling missing data

Eventhough our data shows that it dosn’t have null values it may have false zeros which may have crept in by mistake.This might not be possible with all features, for example, number of pregnancies can be 0 so we should not touch these type of attributes. We can however check on Glucose, BloodPressure, SkinThickness, Insulin, BMI.

It is evident that there are many 0’s. So, the method to handle this problem is to replace the zero’s with the mean or median of that column. Let’s try both.

Replacing with mean…

Lets see the performance of all the algorithms..

Considering the best performing algorithm, i.e; Random Forest, we can see a very very slight increase. Now lets try with median.

Lets see the performances..

There isn’t much difference so we are going to stick with mean.

Feature Selection

Feature selection is a process in which we feed the model with only a few input features for training and not all . So how would we select these features? To understand this we should analyze and understand the realtion between the feaures and the target. Based on these relations we should select the features which are highly related and drop the features which do not have a strong relation with the target. To analyze the relation, lets visualize the data.

From the above histograms we can understand the relation of each feature with both the outcomes. If we observe very carefully, we can see that Diabetes Pedigree Function is having comparatively lesser correalation compared to others. You can also observe this using the heat map of correlations.

So lets drop Diabetes Pedigree Function.

Now lets check the performance.

The performance is degrading. Hence droping features is not a good idea on this dataset.

Hyper-Parameter tuning

Almost all algorithms have certain parameters which can be tuned to extract optimal performance from the model. Now we are going to perform such tuning of the algorithms, after checking various combinations, we are going to tune in the below way…

Logistic regression - We are going to set the maximum iterations for the solver(Algorithm to use in the optimization problem)to converg as 10,000.

KNN - We will set the neighbours to be considered as 10.

SVM - We will set the regularization parameter(C) as 2 and maximum iterations as 1000.

Multinomial Naive Bayes - Set the additive smoothing(to avoid zero probability case) parameter as 2.

lets see the performances now…

We can see a increase in accuracy of Random forest which is the best performing. So we can select Random forest algorithm as the final model for our application.

Lets select Random Forest with 75% accuracy as our final model.

To run the application on your P.C’s local host(python is required) —

To deploy the application (using free hosting service heroku)—

To deploy on heroku you will require 2 files, one is procfile, which tells heroku the procedure and the requirements file which tells heroku the requirements.Don’t worry! These files are also included in my git repository.

Add a comment

Related posts:

The Power of WebSocket

In the dark days before WebSocket web developers had to use inefficient techniques to ensure their websites were dynamic. The two most common issues we encountered were large overhead and…

PSU Banks on a Buying spree as Bank Nifty hits at new High!!!

As expected PSU Banks ruled the roost at today's trading even as Bank Nifty made another new high up by 227.70 points to close the day @ 38521.5. The day sees the indices moved in a narrow range with…

Perdiendo a futuro

Por mucho tiempo José estuvo recibiendo a cientos de compradores de sus pelotas, porque ya en el mercado era considerada como la mejor del área, por la contextura y la forma en que era confeccionada…