Heart Disease is the leading cause of death worldwide. With 33% of people dying from heart disease, data from 2019 showed that 18.5 million deaths were attributed to various cardiovascular diseases, which is approximately 50,850 deaths per average day.
People who have certain risk factors need a form of early detection for the possibility of heart disease, and such model could be useful to doctors as well.
Here are some articles that expand on the seriousness of the issue:
Causes of Death Globally: https://ourworldindata.org/causes-of-death-treemap
Early Detection: https://acsd4u.com/2021/06/30/early-detection-of-heart-disease/
Here is the a glimpse of what the dataset looks like along with its sources.
The dataset represents 918 observations with 12 attributes.
The columns (attributes) are as follows:
Age: age of the patient [years]\ Sex: sex of the patient [M: Male, F: Female]\ ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]\ RestingBP: resting blood pressure [mm Hg]\ Cholesterol: serum cholesterol [mm/dl]\ FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]\ RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]\ MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]\ ExerciseAngina: exercise-induced angina [Y: Yes, N: No]\ Oldpeak: oldpeak = ST [Numeric value measured in depression]\ ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]\ HeartDisease: output class [1: heart disease, 0: Normal]\
Citation: fedesoriano. (September 2021). Heart Failure Prediction Dataset. Retrieved [2/27/2023] from https://www.kaggle.com/fedesoriano/heart-failure-prediction.
import pandas as pd
df = pd.read_csv('hdprediction.csv')
df
Age | Sex | ChestPainType | RestingBP | Cholesterol | FastingBS | RestingECG | MaxHR | ExerciseAngina | Oldpeak | ST_Slope | HeartDisease | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 40 | M | ATA | 140 | 289 | 0 | Normal | 172 | N | 0.0 | Up | 0 |
1 | 49 | F | NAP | 160 | 180 | 0 | Normal | 156 | N | 1.0 | Flat | 1 |
2 | 37 | M | ATA | 130 | 283 | 0 | ST | 98 | N | 0.0 | Up | 0 |
3 | 48 | F | ASY | 138 | 214 | 0 | Normal | 108 | Y | 1.5 | Flat | 1 |
4 | 54 | M | NAP | 150 | 195 | 0 | Normal | 122 | N | 0.0 | Up | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
913 | 45 | M | TA | 110 | 264 | 0 | Normal | 132 | N | 1.2 | Flat | 1 |
914 | 68 | M | ASY | 144 | 193 | 1 | Normal | 141 | N | 3.4 | Flat | 1 |
915 | 57 | M | ASY | 130 | 131 | 0 | Normal | 115 | Y | 1.2 | Flat | 1 |
916 | 57 | F | ATA | 130 | 236 | 0 | LVH | 174 | N | 0.0 | Flat | 1 |
917 | 38 | M | NAP | 138 | 175 | 0 | Normal | 173 | N | 0.0 | Up | 0 |
918 rows × 12 columns
First, we can cluster the patients into 2 groups, Heart Disease or no Heart Disease, and observe the trends among those clusters. After we take a look at any significant attributes, we can then use KNN and Cross Validation with Folds to train and test the model to predict whether a patient has heart disease based on their other attributes.