Describe and motivate a real-world problem where data science may provide helpful insights:
Many times, catching Cancer early can save a patients life. However, it is extremely hard to catch Cancer early, especially if no symptoms show. Cervical Cancer, to be specific, is very responsive to treatment if caught early. If it's caught too late, chances of survival plummet to about 17%. Read more about survival statistics to Cervical Cancer here: https://www.cancer.net/cancer-types/cervical-cancer/statistics Data scientists can examine past traits of people diagnosed with cervical cancer to predict the risk of developing cervical cancer in a patient. These traits include number of pregnancies, smoking history, STDs, etc. In this project, we will examine if each patients medical history, and determine what category of risk each patient falls into (low, medium, high).
"""Because of the way the data is set up, and for the end purpose of the project,
I loaded the data set into a list of dictionaries where each dictionary represents one
patient and their medical history.
"""
file = open("kag_risk_factors_cervical_cancer.csv", "r")
data = []
# splits header to make it into key in dictionary later
headers = file.readline()
headers = headers.strip().split(",")
# for loop goes through line of file to split data up and add as value to dict
for line in file:
pieces = line.strip().split(",")
row_dict = {}
# for loop goes through split and adds it to dict
for i in range(len(pieces)):
row_dict[headers[i]] = pieces[i]
data.append(row_dict)
file.close()
# printing first 10 patients info
for i in range(10):
print(data[i])
{'Age': '18', 'Number of sexual partners': '4.0', 'First sexual intercourse': '15.0', 'Num of pregnancies': '1.0', 'Smokes': '0.0', 'Smokes (years)': '0.0', 'Smokes (packs/year)': '0.0', 'Hormonal Contraceptives': '0.0', 'Hormonal Contraceptives (years)': '0.0', 'IUD': '0.0', 'IUD (years)': '0.0', 'STDs': '0.0', 'STDs (number)': '0.0', 'STDs:condylomatosis': '0.0', 'STDs:cervical condylomatosis': '0.0', 'STDs:vaginal condylomatosis': '0.0', 'STDs:vulvo-perineal condylomatosis': '0.0', 'STDs:syphilis': '0.0', 'STDs:pelvic inflammatory disease': '0.0', 'STDs:genital herpes': '0.0', 'STDs:molluscum contagiosum': '0.0', 'STDs:AIDS': '0.0', 'STDs:HIV': '0.0', 'STDs:Hepatitis B': '0.0', 'STDs:HPV': '0.0', 'STDs: Number of diagnosis': '0', 'STDs: Time since first diagnosis': '?', 'STDs: Time since last diagnosis': '?', 'Dx:Cancer': '0', 'Dx:CIN': '0', 'Dx:HPV': '0', 'Dx': '0', 'Hinselmann': '0', 'Schiller': '0', 'Citology': '0', 'Biopsy': '0'} {'Age': '15', 'Number of sexual partners': '1.0', 'First sexual intercourse': '14.0', 'Num of pregnancies': '1.0', 'Smokes': '0.0', 'Smokes (years)': '0.0', 'Smokes (packs/year)': '0.0', 'Hormonal Contraceptives': '0.0', 'Hormonal Contraceptives (years)': '0.0', 'IUD': '0.0', 'IUD (years)': '0.0', 'STDs': '0.0', 'STDs (number)': '0.0', 'STDs:condylomatosis': '0.0', 'STDs:cervical condylomatosis': '0.0', 'STDs:vaginal condylomatosis': '0.0', 'STDs:vulvo-perineal condylomatosis': '0.0', 'STDs:syphilis': '0.0', 'STDs:pelvic inflammatory disease': '0.0', 'STDs:genital herpes': '0.0', 'STDs:molluscum contagiosum': '0.0', 'STDs:AIDS': '0.0', 'STDs:HIV': '0.0', 'STDs:Hepatitis B': '0.0', 'STDs:HPV': '0.0', 'STDs: Number of diagnosis': '0', 'STDs: Time since first diagnosis': '?', 'STDs: Time since last diagnosis': '?', 'Dx:Cancer': '0', 'Dx:CIN': '0', 'Dx:HPV': '0', 'Dx': '0', 'Hinselmann': '0', 'Schiller': '0', 'Citology': '0', 'Biopsy': '0'} {'Age': '34', 'Number of sexual partners': '1.0', 'First sexual intercourse': '?', 'Num of pregnancies': '1.0', 'Smokes': '0.0', 'Smokes (years)': '0.0', 'Smokes (packs/year)': '0.0', 'Hormonal Contraceptives': '0.0', 'Hormonal Contraceptives (years)': '0.0', 'IUD': '0.0', 'IUD (years)': '0.0', 'STDs': '0.0', 'STDs (number)': '0.0', 'STDs:condylomatosis': '0.0', 'STDs:cervical condylomatosis': '0.0', 'STDs:vaginal condylomatosis': '0.0', 'STDs:vulvo-perineal condylomatosis': '0.0', 'STDs:syphilis': '0.0', 'STDs:pelvic inflammatory disease': '0.0', 'STDs:genital herpes': '0.0', 'STDs:molluscum contagiosum': '0.0', 'STDs:AIDS': '0.0', 'STDs:HIV': '0.0', 'STDs:Hepatitis B': '0.0', 'STDs:HPV': '0.0', 'STDs: Number of diagnosis': '0', 'STDs: Time since first diagnosis': '?', 'STDs: Time since last diagnosis': '?', 'Dx:Cancer': '0', 'Dx:CIN': '0', 'Dx:HPV': '0', 'Dx': '0', 'Hinselmann': '0', 'Schiller': '0', 'Citology': '0', 'Biopsy': '0'} {'Age': '52', 'Number of sexual partners': '5.0', 'First sexual intercourse': '16.0', 'Num of pregnancies': '4.0', 'Smokes': '1.0', 'Smokes (years)': '37.0', 'Smokes (packs/year)': '37.0', 'Hormonal Contraceptives': '1.0', 'Hormonal Contraceptives (years)': '3.0', 'IUD': '0.0', 'IUD (years)': '0.0', 'STDs': '0.0', 'STDs (number)': '0.0', 'STDs:condylomatosis': '0.0', 'STDs:cervical condylomatosis': '0.0', 'STDs:vaginal condylomatosis': '0.0', 'STDs:vulvo-perineal condylomatosis': '0.0', 'STDs:syphilis': '0.0', 'STDs:pelvic inflammatory disease': '0.0', 'STDs:genital herpes': '0.0', 'STDs:molluscum contagiosum': '0.0', 'STDs:AIDS': '0.0', 'STDs:HIV': '0.0', 'STDs:Hepatitis B': '0.0', 'STDs:HPV': '0.0', 'STDs: Number of diagnosis': '0', 'STDs: Time since first diagnosis': '?', 'STDs: Time since last diagnosis': '?', 'Dx:Cancer': '1', 'Dx:CIN': '0', 'Dx:HPV': '1', 'Dx': '0', 'Hinselmann': '0', 'Schiller': '0', 'Citology': '0', 'Biopsy': '0'} {'Age': '46', 'Number of sexual partners': '3.0', 'First sexual intercourse': '21.0', 'Num of pregnancies': '4.0', 'Smokes': '0.0', 'Smokes (years)': '0.0', 'Smokes (packs/year)': '0.0', 'Hormonal Contraceptives': '1.0', 'Hormonal Contraceptives (years)': '15.0', 'IUD': '0.0', 'IUD (years)': '0.0', 'STDs': '0.0', 'STDs (number)': '0.0', 'STDs:condylomatosis': '0.0', 'STDs:cervical condylomatosis': '0.0', 'STDs:vaginal condylomatosis': '0.0', 'STDs:vulvo-perineal condylomatosis': '0.0', 'STDs:syphilis': '0.0', 'STDs:pelvic inflammatory disease': '0.0', 'STDs:genital herpes': '0.0', 'STDs:molluscum contagiosum': '0.0', 'STDs:AIDS': '0.0', 'STDs:HIV': '0.0', 'STDs:Hepatitis B': '0.0', 'STDs:HPV': '0.0', 'STDs: Number of diagnosis': '0', 'STDs: Time since first diagnosis': '?', 'STDs: Time since last diagnosis': '?', 'Dx:Cancer': '0', 'Dx:CIN': '0', 'Dx:HPV': '0', 'Dx': '0', 'Hinselmann': '0', 'Schiller': '0', 'Citology': '0', 'Biopsy': '0'} {'Age': '42', 'Number of sexual partners': '3.0', 'First sexual intercourse': '23.0', 'Num of pregnancies': '2.0', 'Smokes': '0.0', 'Smokes (years)': '0.0', 'Smokes (packs/year)': '0.0', 'Hormonal Contraceptives': '0.0', 'Hormonal Contraceptives (years)': '0.0', 'IUD': '0.0', 'IUD (years)': '0.0', 'STDs': '0.0', 'STDs (number)': '0.0', 'STDs:condylomatosis': '0.0', 'STDs:cervical condylomatosis': '0.0', 'STDs:vaginal condylomatosis': '0.0', 'STDs:vulvo-perineal condylomatosis': '0.0', 'STDs:syphilis': '0.0', 'STDs:pelvic inflammatory disease': '0.0', 'STDs:genital herpes': '0.0', 'STDs:molluscum contagiosum': '0.0', 'STDs:AIDS': '0.0', 'STDs:HIV': '0.0', 'STDs:Hepatitis B': '0.0', 'STDs:HPV': '0.0', 'STDs: Number of diagnosis': '0', 'STDs: Time since first diagnosis': '?', 'STDs: Time since last diagnosis': '?', 'Dx:Cancer': '0', 'Dx:CIN': '0', 'Dx:HPV': '0', 'Dx': '0', 'Hinselmann': '0', 'Schiller': '0', 'Citology': '0', 'Biopsy': '0'} {'Age': '51', 'Number of sexual partners': '3.0', 'First sexual intercourse': '17.0', 'Num of pregnancies': '6.0', 'Smokes': '1.0', 'Smokes (years)': '34.0', 'Smokes (packs/year)': '3.4', 'Hormonal Contraceptives': '0.0', 'Hormonal Contraceptives (years)': '0.0', 'IUD': '1.0', 'IUD (years)': '7.0', 'STDs': '0.0', 'STDs (number)': '0.0', 'STDs:condylomatosis': '0.0', 'STDs:cervical condylomatosis': '0.0', 'STDs:vaginal condylomatosis': '0.0', 'STDs:vulvo-perineal condylomatosis': '0.0', 'STDs:syphilis': '0.0', 'STDs:pelvic inflammatory disease': '0.0', 'STDs:genital herpes': '0.0', 'STDs:molluscum contagiosum': '0.0', 'STDs:AIDS': '0.0', 'STDs:HIV': '0.0', 'STDs:Hepatitis B': '0.0', 'STDs:HPV': '0.0', 'STDs: Number of diagnosis': '0', 'STDs: Time since first diagnosis': '?', 'STDs: Time since last diagnosis': '?', 'Dx:Cancer': '0', 'Dx:CIN': '0', 'Dx:HPV': '0', 'Dx': '0', 'Hinselmann': '1', 'Schiller': '1', 'Citology': '0', 'Biopsy': '1'} {'Age': '26', 'Number of sexual partners': '1.0', 'First sexual intercourse': '26.0', 'Num of pregnancies': '3.0', 'Smokes': '0.0', 'Smokes (years)': '0.0', 'Smokes (packs/year)': '0.0', 'Hormonal Contraceptives': '1.0', 'Hormonal Contraceptives (years)': '2.0', 'IUD': '1.0', 'IUD (years)': '7.0', 'STDs': '0.0', 'STDs (number)': '0.0', 'STDs:condylomatosis': '0.0', 'STDs:cervical condylomatosis': '0.0', 'STDs:vaginal condylomatosis': '0.0', 'STDs:vulvo-perineal condylomatosis': '0.0', 'STDs:syphilis': '0.0', 'STDs:pelvic inflammatory disease': '0.0', 'STDs:genital herpes': '0.0', 'STDs:molluscum contagiosum': '0.0', 'STDs:AIDS': '0.0', 'STDs:HIV': '0.0', 'STDs:Hepatitis B': '0.0', 'STDs:HPV': '0.0', 'STDs: Number of diagnosis': '0', 'STDs: Time since first diagnosis': '?', 'STDs: Time since last diagnosis': '?', 'Dx:Cancer': '0', 'Dx:CIN': '0', 'Dx:HPV': '0', 'Dx': '0', 'Hinselmann': '0', 'Schiller': '0', 'Citology': '0', 'Biopsy': '0'} {'Age': '45', 'Number of sexual partners': '1.0', 'First sexual intercourse': '20.0', 'Num of pregnancies': '5.0', 'Smokes': '0.0', 'Smokes (years)': '0.0', 'Smokes (packs/year)': '0.0', 'Hormonal Contraceptives': '0.0', 'Hormonal Contraceptives (years)': '0.0', 'IUD': '0.0', 'IUD (years)': '0.0', 'STDs': '0.0', 'STDs (number)': '0.0', 'STDs:condylomatosis': '0.0', 'STDs:cervical condylomatosis': '0.0', 'STDs:vaginal condylomatosis': '0.0', 'STDs:vulvo-perineal condylomatosis': '0.0', 'STDs:syphilis': '0.0', 'STDs:pelvic inflammatory disease': '0.0', 'STDs:genital herpes': '0.0', 'STDs:molluscum contagiosum': '0.0', 'STDs:AIDS': '0.0', 'STDs:HIV': '0.0', 'STDs:Hepatitis B': '0.0', 'STDs:HPV': '0.0', 'STDs: Number of diagnosis': '0', 'STDs: Time since first diagnosis': '?', 'STDs: Time since last diagnosis': '?', 'Dx:Cancer': '1', 'Dx:CIN': '0', 'Dx:HPV': '1', 'Dx': '1', 'Hinselmann': '0', 'Schiller': '0', 'Citology': '0', 'Biopsy': '0'} {'Age': '44', 'Number of sexual partners': '3.0', 'First sexual intercourse': '15.0', 'Num of pregnancies': '?', 'Smokes': '1.0', 'Smokes (years)': '1.266972909', 'Smokes (packs/year)': '2.8', 'Hormonal Contraceptives': '0.0', 'Hormonal Contraceptives (years)': '0.0', 'IUD': '?', 'IUD (years)': '?', 'STDs': '0.0', 'STDs (number)': '0.0', 'STDs:condylomatosis': '0.0', 'STDs:cervical condylomatosis': '0.0', 'STDs:vaginal condylomatosis': '0.0', 'STDs:vulvo-perineal condylomatosis': '0.0', 'STDs:syphilis': '0.0', 'STDs:pelvic inflammatory disease': '0.0', 'STDs:genital herpes': '0.0', 'STDs:molluscum contagiosum': '0.0', 'STDs:AIDS': '0.0', 'STDs:HIV': '0.0', 'STDs:Hepatitis B': '0.0', 'STDs:HPV': '0.0', 'STDs: Number of diagnosis': '0', 'STDs: Time since first diagnosis': '?', 'STDs: Time since last diagnosis': '?', 'Dx:Cancer': '0', 'Dx:CIN': '0', 'Dx:HPV': '0', 'Dx': '0', 'Hinselmann': '0', 'Schiller': '0', 'Citology': '0', 'Biopsy': '0'}
We will use the classifying approach in ML to predict if each patient is positive for the 'cancer' column given the medical history of each patient. We will then draw a conclusion to which traits puts patients in a higher risk category based on our predictions.