BREAST CANCER FATALITY PREDICTION - MACHINE LEARNING PROJECT¶

Recently, I came across a news article saying breast cancer survivors can pause treatments in order to pursue pregnancy. I thought this was quite intriguing and decided to look into some datasets relating to breast cancer. I came across a very potential data set on Kaggle.

Link to dataset: https://www.kaggle.com/datasets/0248260fceaaaab93ceb231f0deb49f979a9ce4ed30f54260c8a18d9270bbcb0?resource=download

Link to article:

https://abcnews.go.com/Health/video/breast-cancer-survivors-pause-treatments-babies-study-95763252

In [1]:
import matplotlib.pyplot as plt
from google.colab import files
import pandas as pd
In [2]:
df = pd.read_csv("BRCA 2.csv")
In [3]:
df
Out[3]:
Patient_ID Age Gender Protein1 Protein2 Protein3 Protein4 Tumour_Stage Histology ER status PR status HER2 status Surgery_type Date_of_Surgery Date_of_Last_Visit Patient_Status
0 TCGA-D8-A1XD 36.0 FEMALE 0.080353 0.42638 0.54715 0.273680 III Infiltrating Ductal Carcinoma Positive Positive Negative Modified Radical Mastectomy 15-Jan-17 19-Jun-17 Alive
1 TCGA-EW-A1OX 43.0 FEMALE -0.420320 0.57807 0.61447 -0.031505 II Mucinous Carcinoma Positive Positive Negative Lumpectomy 26-Apr-17 09-Nov-18 Dead
2 TCGA-A8-A079 69.0 FEMALE 0.213980 1.31140 -0.32747 -0.234260 III Infiltrating Ductal Carcinoma Positive Positive Negative Other 08-Sep-17 09-Jun-18 Alive
3 TCGA-D8-A1XR 56.0 FEMALE 0.345090 -0.21147 -0.19304 0.124270 II Infiltrating Ductal Carcinoma Positive Positive Negative Modified Radical Mastectomy 25-Jan-17 12-Jul-17 Alive
4 TCGA-BH-A0BF 56.0 FEMALE 0.221550 1.90680 0.52045 -0.311990 II Infiltrating Ductal Carcinoma Positive Positive Negative Other 06-May-17 27-Jun-19 Dead
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
336 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
337 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
338 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
339 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
340 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

341 rows × 16 columns

In [7]:
dict = {"Patient_ID": "Unique Identifier of the patient",
        "Age": "Age of the patient at time of record ",
        "Gender": "Gender of Patient at time of record ",
        "Protein1" : "Expression level (undefined units)",
        "Protein2":"Expression level (undefined units)",
        "Protein3": "Expression level (undefined units)",
        "Protein4" :"Expression level (undefined units)",
        "Tumour_Stage": "stage 1,2 or 3",
        "Histology":"microscopic structure of tissues. Types: Infiltrating Ductal Carcinoma, Infiltrating Lobular Carcinoma, Mucinous Carcinoma",
        "ER status": "Negative/Positive",
        "PR status": "Negative/Positive",
        "HER2 status": "Negative/Positive",
        "Surgery_type": "Lumpectomy, Simple Mastectomy, Modified Radical Mastectomy, Other",
        "Date_of_Surgery": "date when the surgery was performed (DD-MM-YYYY)",
        "Date_of_Last_Visit": "Date of last visit (DD-MM-YY), null if the patient didn’t visited again after the surgery",
        "Patient_Status": "Dead/Alive" }
In [8]:
dict
Out[8]:
{'Patient_ID': 'Unique Identifier of the patient',
 'Age': 'Age of the patient at time of record ',
 'Gender': 'Gender of Patient at time of record ',
 'Protein1': 'Expression level (undefined units)',
 'Protein2': 'Expression level (undefined units)',
 'Protein3': 'Expression level (undefined units)',
 'Protein4': 'Expression level (undefined units)',
 'Tumour_Stage': 'stage 1,2 or 3',
 'Histology': 'microscopic structure of tissues. Types: Infiltrating Ductal Carcinoma, Infiltrating Lobular Carcinoma, Mucinous Carcinoma',
 'ER status': 'Negative/Positive',
 'PR status': 'Negative/Positive',
 'HER2 status': 'Negative/Positive',
 'Surgery_type': 'Lumpectomy, Simple Mastectomy, Modified Radical Mastectomy, Other',
 'Date_of_Surgery': 'date when the surgery was performed (DD-MM-YYYY)',
 'Date_of_Last_Visit': 'Date of last visit (DD-MM-YY), null if the patient didn’t visited again after the surgery',
 'Patient_Status': 'Dead/Alive'}

I could possibly use Machine Learning to predict whether the patient will survive based on certain characteristics like Age, Protein Expression Level, Tumor Stage, Histology, ER/PR/HER2 status, surgery type. Hopefully, we will be able to figure out the type of surgery that is most successful. I am really excited about the research I will be doing in order to understand the different types of surgeries, histologies, protein and how it affects patients in general.