Predicting heart disease through machine learning

Suhaib Ali Kamal
Nerd For Tech
Published in
4 min readApr 14, 2021

--

Classification problems is one of the most common areas where machine learning algorithms are applied with excellent results. The biggest difference between a regression problem and a classification problem is that in a classification problem the target variable is categorical/binary.

In this article, we will go over the heart rate disease dataset published by UCI Machine Learning Repository where the target variable is heart disease.We will go over multiple algorithms and will also see how boosting can improve model results.

The dataset has multiple categorical and continous independent variables which can be used to predict heart disease in a patient.After going through the process of identifying missing values(if any) we will proceed to EDA to identify any patterns in the data.

Left: Distribution of Age Right: Age vs Target

As can be seen from the above diagram younger people are more prone to heart disease than older people which is surprising. Let us see whether gender has an impact on heart disease

Male:0 Female:1

As can be seen from the above diagram, male are more prone to heart disease as compared to women. We can also see how cholestrol affects the rate of heart disease

Cholestrol affect on heart disease

What can be seen from the above diagrams is that cholestrol itself does not have a huge impact on heart diseases.EDA is a crucial part of data science, and its importance is often underrated. For this article, we are going to move forwards to build the actual model.However before we move ,we can also the correlation matrix to get a better sense of the overall data.

Correlation matrix

Feature Engineering

Before building the model, it is important to make sure that the independent variables are engineered in a way that can improve the model performance. For the categorical features we need to convert them to dummy variables for which the code is shown below.

df=pd.get_dummies(df,columns=['cp','restecg','slope','ca','thal'],drop_first=True)
y=df.target
X=df.drop("target",axis=1)

Furthermore, for the continous variables it is important to standardize them to improve model performance.One important thing to note in this transformation is to segregate the training and the test set prior to the standardisation. For the training set we are going to use the StandardScaler from the scikit library and then transform it in the test set,

from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
num_columns=['age','trestbps','chol','thalach','oldpeak']
X_train['age']=scaler.fit_transform(X_train['age'].values.reshape(-1,1))
X_train['trestbps']=scaler.fit_transform(X_train['trestbps'].values.reshape(-1,1))
X_train['chol']=scaler.fit_transform(X_train['chol'].values.reshape(-1,1))
X_train['thalach']=scaler.fit_transform(X_train['thalach'].values.reshape(-1,1))
X_train['oldpeak']=scaler.fit_transform(X_train['oldpeak'].values.reshape(-1,1))
X_test['age']=scaler.transform(X_test['age'].values.reshape(-1,1))
X_test['trestbps']=scaler.transform(X_test['trestbps'].values.reshape(-1,1))
X_test['chol']=scaler.transform(X_test['chol'].values.reshape(-1,1))
X_test['thalach']=scaler.transform(X_test['thalach'].values.reshape(-1,1))
X_test['oldpeak']=scaler.transform(X_test['oldpeak'].values.reshape(-1,1))

After the transformations are done we can move to the next step: using machine learning algorithm

Algorithms

We will start with the Logistic regression algorithm to see how it performs on the dataset.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
lr=LogisticRegression(C=1.0,penalty='l2')
lr.fit(X_train,y_train)
print("The cross validation score mean is ",cross_val_score(lr,X_train,y_train,cv=3).mean())
pred=lr.predict(X_test)
print(classification_report(y_test,pred))
sns.heatmap(confusion_matrix(y_test,pred),annot=True)

Ths gives us a model accuracy of 56% with a recall of 90%. This is a decent result but let us see how a decision tree performs on the dataset.

from sklearn.tree import DecisionTreeClassifier
dc=DecisionTreeClassifier()
dc.fit(X_train,y_train)
print("The cross validation score is ",cross_val_score(dc,X_train,y_train).mean())
pred=dc.predict(X_test)
print(classification_report(y_test,pred))
sns.heatmap(confusion_matrix(y_test,pred),annot=True

The decision tree performs slightly better with an accuracy of 57% . The decision tree has only slightly improved the performance of the model but perhaps using a random forest classifier can improve accuracy.

from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier()
rf.fit(X_train,y_train)
print("The cross validation score is ",cross_val_score(rf,X_train,y_train).mean())
pred=rf.predict(X_test)
print(classification_report(y_test,pred))
sns.heatmap(confusion_matrix(y_test,pred),annot=True)
Confusion Matrix

The random forest classifer has improved model performance as the accuracy is now 71%. The heatmap is the confusion matrix of the dataset which shows that there are 19 false negatives and 3 false positives. In a heart disease case, the cost of a false negative is very high. Let us see whether boosting algorithm can improve this.We are going use a very popular boosting algorithm called XG Boost which has actually been used to win Kaggle competitions. The code for the algorithm is written below

Boosting

import xgboost as xgb
data_dmatrix = xgb.DMatrix(data=X,label=y)
xg_reg = xgb.XGBClassifier(objective ='reg:logistic', colsample_bytree = 0.3, learning_rate = 0.1,
max_depth = 5, alpha = 10, n_estimators = 10
xg_reg.fit(X_train,y_train)
pred = xg_reg.predict(X_test)
sns.heatmap(confusion_matrix(y_test,pred),annot=True)
Confusion matrix

The above diagram displays the confusion matrix. As can been seen the performance of the model has improved significantly with fewer false negatives. The XGBoost model has an accuracy in excess of 80%.

We saw in this example how machine learning can be applied to a dataset to predict heart diseases. We started with exploratory data analysis, then moved on to feature engineering and built the machine learning algortims. We concluded with the XGBoost model to show how boosting can improve model performance.

--

--