

During my masters program in Business Analytics at Suffolk University, I took an introductory course in Python as its an integral part in data analytics. We covered Python development environment, syntax and foundations. Then, exploratory data analysis and hypothesis testing in Pandas. After all of the basics, we dived into major statistical models such as Linear Regression, Train-Test Split, Bias-Variance Tradeoff, K-Nearest Neighbors (KNN) and Classification. Then, we went over Logistic Regression, Decision Tree and at the end we have been working with data APIs and introduction to Time Series.
Many of those topics I knew before; however many I heard for the first time. Our core book was “Introduction to Machine Learning with Python” by Andreas C. Müller & Sarah Guido.
This article will describe my final project which I have completed at the end of this course. The project is based on two research studies ‘Building Risk Prediction Models for Type 2 Diabetes Using Machine Learning Techniques’ by Xie Z, Nikolayeva O, Luo J, Li D. and ‘Developing risk prediction models for type 2 diabetes: a systematic review of methodology and reporting.’ by Collins GS, Mallett S, Omar O, Yu LM.
The data for the analysis is sourced from The Behavioral Risk Factor Surveillance System (BRFSS) which collects health-related telephone surveys data on US residents. As files of BRFSS Survey data presented with formats which are different from csv format, I have used Winston Larson work on github where he has done an extensive work on BRFSS data extracting and cleaning.
The original data has 279 variables and 464,644 records for 2014. Based on the peer reviewed articles mentioned above, I chose main 26 personal and general health related characteristics such as General Health, BMI, Age, Sleep Time, etc. The target variable represents binary classification of Yes or No answer on “Have you ever been told you have diabetes?” question. The source code could be viewed at this link.
The first step, we take out people under 30 who might have type 1 diabetes, people who are pregnant and have prediabetes which are not our main focus. The next step we drop permanently NA values and we are left with 143,383 observations for our analysis and we run describe function to check summary statistics for our data after NA values dropped. In addition, we construct a histogram for each variable to verify if they are normally distributed.
# Dropping permanently NA values and run summary statistics
df.dropna(inplace=True)
df.describe()
df.hist(figsize=(20,20))
The next step to check if our data is balanced which is recommended when our target variable has a lower relative frequency for one class than for another (More people will answer No than Yes to our main question “Have you ever been told you have diabetes?”)
# Is our data balanced? It is!
df.diabete3.value_counts(normalize=True)
The above value_counts function from Pandas library produces ratio containing counts of unique values. In our case, it gives us the following ratio of 0.83 people answered No and 0.16 people answered Yes. It is well-advised to have a balance proportion of around 10% or use resampling technique based on the book “Discovering Knowledge in Data- An Introduction to Data Mining” by Daniel T.Larose.
After we verified that our data balanced, having a proportion of 16%, we assign our features (personal and health information) to X and our target variable (Yes or No answer) to y.
feature_cols =['genhlth','age','bmi_class','checkup1','income2','race','mscode','flushot6','employ1','sex','marital','education','sleptim1','cvdcrhd4','hlthcvr1','menthlth','chckidny','useequip','exercise','addepev2','renthom1','exerany2','blind','decide','hlthpln1','smoker']X = df[feature_cols]
y = df.diabete3
The next step is dividing the data in two data sets which are training and testing. For this data, I use 70/30 approach which leaves thirty percent of data for testing purposes. Sikit-learn library has a function train_test_split which randomly splits data on train and test sets. Random state parameter is set to 50 which is set to control the randomness of the estimator. (Simply saying, every time when we run the model, we would get the same random exact split and not different with every run).
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 50, test_size=0.3)
A simple Decision Tree model created first in order to compare it to the Random Forest model at the end. Sikit-learn library has a DecisionTreeClassifier function which is used to create a decision tree classifier and fit the training data afterwards. The default criterion used to measure the quality of the split is ‘gini’ index. The maximum depth of the tree is None which is default (This is not the best approach for sure; however it does not interfere with the goal of the project). The rest of the parameters for this functions could be seen in the documentation.
# Make a decision tree and train
tree = DecisionTreeClassifier(random_state=50)# Train tree
tree.fit(X_train, y_train)
Then we use our fitted model to make predictions with predict function from Sikit-learn library. This function takes fitted data and predicts the label for a new data. Another function, predict_proba, finds a probability for each class.
# Using fitted model and Make predictions
X_train_tree_predictions = tree.predict(X_train)
X_train_tree_probs = tree.predict_proba(X_train)[:, 1]tree_predictions = tree.predict(X_test)
tree_probs = tree.predict_proba(X_test)[:, 1]
The next step would be calculating an Area Under the Receiver Operating Characteristic Curve (ROC AUC) from our predicted scores with roc_auc_score function. It shows the performance of our classification model at all classification thresholds. In our case, ROC AUC is 0.59.
# Calculate ROC AUC
roc_value = roc_auc_score(y_test, tree_probs)
Now, we need to check which features are the most important in predicting the diabetes. Feature importance function identifies that the most important features are General Health, Income, Sleep Time, and Age.
feature_tree = pd.DataFrame({'Feature': feature_cols,
'Importance': tree.feature_importances_}).
sort_values('Importance', ascending = False)
The last step would be to run out of sample test such as cross-validation(CV) in order to identify final estimation of our model. For that step we use cross_val_score function and set k-folds to 10 and compute mean of those cross-validations scores. In our case, the score is 0.76 which is 76% accuracy.
scores = cross_val_score(tree, X, y, cv=10, scoring= 'accuracy')
np.mean(scores)
A Random Forest modeling process involves all of the steps similar to a Decision Tree model building process. Random Forest is a collection of trees which produce the class with a mean prediction of all those trees. In our case, we build 100 number of trees and we do not specify maximum depth of the trees.
# Create the model with 100 trees
model = RandomForestClassifier(n_estimators=100, max_features='sqrt', oob_score=True, random_state=50, n_jobs=-1, verbose = 1)#Fit the Model on Training data
model.fit(X_train, y_train)
Next, we use our fitted model to make predictions and find the probability for each class as we did earlier for Decision Tree model.
#Using fitted model, make predictions
X_train_rf_predictions = model.predict(X_train)
X_train_rf_probs = model.predict_proba(X_train)[:, 1]rf_predictions = model.predict(X_test)
rf_probs = model.predict_proba(X_test)[:, 1]
The next step would be calculating ROC AUC from our predicted scores with roc_auc_score function. In this model, ROC AUC is 0.78 which is higher than we had in a Decision Tree model.
# Calculate ROC AUC
roc_value = roc_auc_score(y_test, rf_probs)
Now, we need to check which features are the most important in predicting the diabetes. Feature importance function identifies that the most important features are Income, Sleep Time, Age, and General Health.
# Compute feature importances
feature_model = pd.DataFrame({'Feature': feature_cols,
'importance': model.feature_importances_}).
sort_values('importance', ascending = False)
The last step would be to run a cross-validation(CV) in order to identify final estimation of our model. In Random Forest model, the accuracy is 0.84 which is higher than the simple Decision Tree model by 8%.
scores = cross_val_score(model, X, y, cv=10, scoring= 'accuracy')
np.mean(scores)
Thank you for either reading or completing the project with me! I would really appreciate feedback in the comments or your examples of other models used. Any questions welcomed!
[1] Xie Z, Nikolayeva O, Luo J, Li D. Building Risk Prediction Models for Type 2 Diabetes Using Machine Learning Techniques. Prev Chronic Dis 2019;16:190109. DOI: http://dx.doi.org/10.5888/pcd16.190109external icon.
[2] Collins GS, Mallett S, Omar O, Yu LM. Developing risk prediction models for type 2 diabetes: a systematic review of methodology and reporting. BMC Med 2011;9(1):103. https://bmcmedicine.biomedcentral.com/articles/10.1186/1741-7015-9-103
[3] Larson W. Insights into health and behavior using data from the CDC. https://github.com/winstonlarson/brfss
[4] Nelson J. Decision Trees. Adopted from Chapter 8 of An Introduction to Statistical Learning. http://faculty.marshall.usc.edu/gareth-james/