Heart Disease Prediction Model Proposal¶

Background¶

Heart Disease is the leading cause of death worldwide. With 33% of people dying from heart disease, data from 2019 showed that 18.5 million deaths were attributed to various cardiovascular diseases, which is approximately 50,850 deaths per average day.

People who have certain risk factors need a form of early detection for the possibility of heart disease, and such model could be useful to doctors as well.

Here are some articles that expand on the seriousness of the issue:

Causes of Death Globally: https://ourworldindata.org/causes-of-death-treemap

Early Detection: https://acsd4u.com/2021/06/30/early-detection-of-heart-disease/

Dataset¶

Here is the a glimpse of what the dataset looks like along with its sources.

The dataset represents 918 observations with 12 attributes.

The columns (attributes) are as follows:

Age: age of the patient [years]\ Sex: sex of the patient [M: Male, F: Female]\ ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]\ RestingBP: resting blood pressure [mm Hg]\ Cholesterol: serum cholesterol [mm/dl]\ FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]\ RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]\ MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]\ ExerciseAngina: exercise-induced angina [Y: Yes, N: No]\ Oldpeak: oldpeak = ST [Numeric value measured in depression]\ ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]\ HeartDisease: output class [1: heart disease, 0: Normal]\

Citation: fedesoriano. (September 2021). Heart Failure Prediction Dataset. Retrieved [2/27/2023] from https://www.kaggle.com/fedesoriano/heart-failure-prediction.

In [1]:

import pandas as pd

df = pd.read_csv('hdprediction.csv')

df

Out[1]:

	Age	Sex	ChestPainType	RestingBP	Cholesterol	FastingBS	RestingECG	MaxHR	ExerciseAngina	Oldpeak	ST_Slope	HeartDisease
0	40	M	ATA	140	289	0	Normal	172	N	0.0	Up	0
1	49	F	NAP	160	180	0	Normal	156	N	1.0	Flat	1
2	37	M	ATA	130	283	0	ST	98	N	0.0	Up	0
3	48	F	ASY	138	214	0	Normal	108	Y	1.5	Flat	1
4	54	M	NAP	150	195	0	Normal	122	N	0.0	Up	0
...	...	...	...	...	...	...	...	...	...	...	...	...
913	45	M	TA	110	264	0	Normal	132	N	1.2	Flat	1
914	68	M	ASY	144	193	1	Normal	141	N	3.4	Flat	1
915	57	M	ASY	130	131	0	Normal	115	Y	1.2	Flat	1
916	57	F	ATA	130	236	0	LVH	174	N	0.0	Flat	1
917	38	M	NAP	138	175	0	Normal	173	N	0.0	Up	0

918 rows × 12 columns

First, we can cluster the patients into 2 groups, Heart Disease or no Heart Disease, and observe the trends among those clusters. After we take a look at any significant attributes, we can then use KNN and Cross Validation with Folds to train and test the model to predict whether a patient has heart disease based on their other attributes.