Technical Tutorial April 30, 2019

Predicting Diabetes from Glycohemoglobin Data Using Machine Learning

AIM: This article is to show how to use Machine Learning on health-related datasets. I will attempt to give rationale for most of my decisions. I hope it will help spark the interest of more people in health sciences to think of Machine Learning algorithms when thinking of solutions.

REQUIREMENT: Interactive Notebook (I would recommend Google's Colab notebook https://colab.research.google.com), basic understanding of Python Programming, basic understanding of Machine Learning.

The model built will be a classification predictive model that learns from selected features and predicts if a client is Normal, Pre-diabetic, or Diabetic.

Understanding Glycohemoglobin (HbA1c)

Glycohemoglobin (HbA1c, A1c) is a measure of how much glucose is bound to hemoglobin in red blood cells (over 2–3 months). People with diabetes or other conditions that raise blood glucose have higher A1c values.

Normal is <5.7%

Pre-diabetes 5.7–6.4%

Diabetes >6.5%

The Dataset

The original dataset has about 6800 rows and 20 columns, obtained from Vanderbilt Biostatistics Datasets.

Only a subset of the columns (9 out of 20) will be used for this analysis. The dataset features were explained at this link.

Feature Selection

The features selected were: Seqn (Unique Identifier), Sex, Age, BMI, Waist circumference, Glycohemoglobin, Albumin, Creatinine (SCr), Blood urea nitrogen (BUN) values.

I made these decisions based on domain knowledge. For example, I know BMI already shows a relationship between height and weight, high waist circumference is a risk factor in developing diabetes, and Albumin, SCr and BUN could be used to assess kidney function.

Feature Engineering

Machine learning algorithms require data to be in practical numerical formats to work well. So I created a column (gh_int) with the glycohemoglobin values assigned to three integer classes:

Normal: 0
Pre-diabetes: 1
Diabetes: 2

I also created a column (age_range) showing how to bin the age into age ranges. This makes working with age easier.

Exploratory Data Analysis

I did some visualization to learn about the dataset:

There is a good balance between male and female data.

This particular dataset seems to have more single age_range data from people in their 20s-30s. Most cases of type 2 diabetes are diagnosed in people 40 and above. If these age ranges are summed up, the dataset is a good representation of what is obtainable.

Many people in this dataset have normal glycohemoglobin values. So we will test how well our model can learn and generalize to this information.

Handling Missing Values

When I tried to compare algorithm performance using cross validation, I ran into an error showing that there are missing values in my data. One must always check for missing values. On investigation, Python was right and I was wrong!

I dropped rows with missing values, which fixed it. There are other methods of dealing with missing values.

Model Training and Comparison

I split the dataset into 33% test set (which will be hidden from the algorithm) and 66% train set that the algorithms will train on. I also scaled the data using MinMaxScaler because most of the values are in ranges. These are all readily available in the sklearn library.

I used cross validation method to compare the performance of about 7 Machine Learning algorithms on the data.

The algorithms with the highest accuracy were Random Forest, Decision Tree, and XGBoost. Any of these can be used in the final model. It's proper to compare algorithm performance. The decision of what final algorithm to use can be based on many factors such as familiarity, sensitivity to scaling, compute power, ease of interpretation, etc.