Heart Disease Prediction Model Proposal¶

Background¶

Heart Disease is the leading cause of death worldwide. With 33% of people dying from heart disease, data from 2019 showed that 18.5 million deaths were attributed to various cardiovascular diseases, which is approximately 50,850 deaths per average day.

People who have certain risk factors need a form of early detection for the possibility of heart disease, and such model could be useful to doctors as well.

Here are some articles that expand on the seriousness of the issue:

Causes of Death Globally: https://ourworldindata.org/causes-of-death-treemap

Early Detection: https://acsd4u.com/2021/06/30/early-detection-of-heart-disease/

Dataset¶

Here is the a glimpse of what the dataset looks like along with its sources.

The dataset represents 918 observations with 12 attributes.

The columns (attributes) are as follows:

Age: age of the patient [years]\ Sex: sex of the patient [M: Male, F: Female]\ ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]\ RestingBP: resting blood pressure [mm Hg]\ Cholesterol: serum cholesterol [mm/dl]\ FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]\ RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]\ MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]\ ExerciseAngina: exercise-induced angina [Y: Yes, N: No]\ Oldpeak: oldpeak = ST [Numeric value measured in depression]\ ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]\ HeartDisease: output class [1: heart disease, 0: Normal]\

Citation: fedesoriano. (September 2021). Heart Failure Prediction Dataset. Retrieved [2/27/2023] from https://www.kaggle.com/fedesoriano/heart-failure-prediction.

In [1]:
import pandas as pd

df = pd.read_csv('hdprediction.csv')

df
Out[1]:
Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease
0 40 M ATA 140 289 0 Normal 172 N 0.0 Up 0
1 49 F NAP 160 180 0 Normal 156 N 1.0 Flat 1
2 37 M ATA 130 283 0 ST 98 N 0.0 Up 0
3 48 F ASY 138 214 0 Normal 108 Y 1.5 Flat 1
4 54 M NAP 150 195 0 Normal 122 N 0.0 Up 0
... ... ... ... ... ... ... ... ... ... ... ... ...
913 45 M TA 110 264 0 Normal 132 N 1.2 Flat 1
914 68 M ASY 144 193 1 Normal 141 N 3.4 Flat 1
915 57 M ASY 130 131 0 Normal 115 Y 1.2 Flat 1
916 57 F ATA 130 236 0 LVH 174 N 0.0 Flat 1
917 38 M NAP 138 175 0 Normal 173 N 0.0 Up 0

918 rows × 12 columns

First, we can cluster the patients into 2 groups, Heart Disease or no Heart Disease, and observe the trends among those clusters. After we take a look at any significant attributes, we can then use KNN and Cross Validation with Folds to train and test the model to predict whether a patient has heart disease based on their other attributes.