Describe and motivate a real-world problem where data science may provide helpful insights:

Many times, catching Cancer early can save a patients life. However, it is extremely hard to catch Cancer early, especially if no symptoms show. Cervical Cancer, to be specific, is very responsive to treatment if caught early. If it's caught too late, chances of survival plummet to about 17%. Read more about survival statistics to Cervical Cancer here: https://www.cancer.net/cancer-types/cervical-cancer/statistics Data scientists can examine past traits of people diagnosed with cervical cancer to predict the risk of developing cervical cancer in a patient. These traits include number of pregnancies, smoking history, STDs, etc. In this project, we will examine if each patients medical history, and determine what category of risk each patient falls into (low, medium, high).

In [8]:
"""Because of the way the data is set up, and for the end purpose of the project,
I loaded the data set into a list of dictionaries where each dictionary represents one 
patient and their medical history.
"""
file = open("kag_risk_factors_cervical_cancer.csv", "r")
data = []
   
# splits header to make it into key in dictionary later
headers = file.readline()
headers = headers.strip().split(",")

# for loop goes through line of file to split data up and add as value to dict
for line in file:
    pieces = line.strip().split(",")
    row_dict = {}
   
     # for loop goes through split and adds it to dict 
    for i in range(len(pieces)):
            row_dict[headers[i]] = pieces[i]
                
    data.append(row_dict)
        
file.close()  

# printing first 10 patients info
for i in range(10):
    print(data[i])
    
{'Age': '18', 'Number of sexual partners': '4.0', 'First sexual intercourse': '15.0', 'Num of pregnancies': '1.0', 'Smokes': '0.0', 'Smokes (years)': '0.0', 'Smokes (packs/year)': '0.0', 'Hormonal Contraceptives': '0.0', 'Hormonal Contraceptives (years)': '0.0', 'IUD': '0.0', 'IUD (years)': '0.0', 'STDs': '0.0', 'STDs (number)': '0.0', 'STDs:condylomatosis': '0.0', 'STDs:cervical condylomatosis': '0.0', 'STDs:vaginal condylomatosis': '0.0', 'STDs:vulvo-perineal condylomatosis': '0.0', 'STDs:syphilis': '0.0', 'STDs:pelvic inflammatory disease': '0.0', 'STDs:genital herpes': '0.0', 'STDs:molluscum contagiosum': '0.0', 'STDs:AIDS': '0.0', 'STDs:HIV': '0.0', 'STDs:Hepatitis B': '0.0', 'STDs:HPV': '0.0', 'STDs: Number of diagnosis': '0', 'STDs: Time since first diagnosis': '?', 'STDs: Time since last diagnosis': '?', 'Dx:Cancer': '0', 'Dx:CIN': '0', 'Dx:HPV': '0', 'Dx': '0', 'Hinselmann': '0', 'Schiller': '0', 'Citology': '0', 'Biopsy': '0'}
{'Age': '15', 'Number of sexual partners': '1.0', 'First sexual intercourse': '14.0', 'Num of pregnancies': '1.0', 'Smokes': '0.0', 'Smokes (years)': '0.0', 'Smokes (packs/year)': '0.0', 'Hormonal Contraceptives': '0.0', 'Hormonal Contraceptives (years)': '0.0', 'IUD': '0.0', 'IUD (years)': '0.0', 'STDs': '0.0', 'STDs (number)': '0.0', 'STDs:condylomatosis': '0.0', 'STDs:cervical condylomatosis': '0.0', 'STDs:vaginal condylomatosis': '0.0', 'STDs:vulvo-perineal condylomatosis': '0.0', 'STDs:syphilis': '0.0', 'STDs:pelvic inflammatory disease': '0.0', 'STDs:genital herpes': '0.0', 'STDs:molluscum contagiosum': '0.0', 'STDs:AIDS': '0.0', 'STDs:HIV': '0.0', 'STDs:Hepatitis B': '0.0', 'STDs:HPV': '0.0', 'STDs: Number of diagnosis': '0', 'STDs: Time since first diagnosis': '?', 'STDs: Time since last diagnosis': '?', 'Dx:Cancer': '0', 'Dx:CIN': '0', 'Dx:HPV': '0', 'Dx': '0', 'Hinselmann': '0', 'Schiller': '0', 'Citology': '0', 'Biopsy': '0'}
{'Age': '34', 'Number of sexual partners': '1.0', 'First sexual intercourse': '?', 'Num of pregnancies': '1.0', 'Smokes': '0.0', 'Smokes (years)': '0.0', 'Smokes (packs/year)': '0.0', 'Hormonal Contraceptives': '0.0', 'Hormonal Contraceptives (years)': '0.0', 'IUD': '0.0', 'IUD (years)': '0.0', 'STDs': '0.0', 'STDs (number)': '0.0', 'STDs:condylomatosis': '0.0', 'STDs:cervical condylomatosis': '0.0', 'STDs:vaginal condylomatosis': '0.0', 'STDs:vulvo-perineal condylomatosis': '0.0', 'STDs:syphilis': '0.0', 'STDs:pelvic inflammatory disease': '0.0', 'STDs:genital herpes': '0.0', 'STDs:molluscum contagiosum': '0.0', 'STDs:AIDS': '0.0', 'STDs:HIV': '0.0', 'STDs:Hepatitis B': '0.0', 'STDs:HPV': '0.0', 'STDs: Number of diagnosis': '0', 'STDs: Time since first diagnosis': '?', 'STDs: Time since last diagnosis': '?', 'Dx:Cancer': '0', 'Dx:CIN': '0', 'Dx:HPV': '0', 'Dx': '0', 'Hinselmann': '0', 'Schiller': '0', 'Citology': '0', 'Biopsy': '0'}
{'Age': '52', 'Number of sexual partners': '5.0', 'First sexual intercourse': '16.0', 'Num of pregnancies': '4.0', 'Smokes': '1.0', 'Smokes (years)': '37.0', 'Smokes (packs/year)': '37.0', 'Hormonal Contraceptives': '1.0', 'Hormonal Contraceptives (years)': '3.0', 'IUD': '0.0', 'IUD (years)': '0.0', 'STDs': '0.0', 'STDs (number)': '0.0', 'STDs:condylomatosis': '0.0', 'STDs:cervical condylomatosis': '0.0', 'STDs:vaginal condylomatosis': '0.0', 'STDs:vulvo-perineal condylomatosis': '0.0', 'STDs:syphilis': '0.0', 'STDs:pelvic inflammatory disease': '0.0', 'STDs:genital herpes': '0.0', 'STDs:molluscum contagiosum': '0.0', 'STDs:AIDS': '0.0', 'STDs:HIV': '0.0', 'STDs:Hepatitis B': '0.0', 'STDs:HPV': '0.0', 'STDs: Number of diagnosis': '0', 'STDs: Time since first diagnosis': '?', 'STDs: Time since last diagnosis': '?', 'Dx:Cancer': '1', 'Dx:CIN': '0', 'Dx:HPV': '1', 'Dx': '0', 'Hinselmann': '0', 'Schiller': '0', 'Citology': '0', 'Biopsy': '0'}
{'Age': '46', 'Number of sexual partners': '3.0', 'First sexual intercourse': '21.0', 'Num of pregnancies': '4.0', 'Smokes': '0.0', 'Smokes (years)': '0.0', 'Smokes (packs/year)': '0.0', 'Hormonal Contraceptives': '1.0', 'Hormonal Contraceptives (years)': '15.0', 'IUD': '0.0', 'IUD (years)': '0.0', 'STDs': '0.0', 'STDs (number)': '0.0', 'STDs:condylomatosis': '0.0', 'STDs:cervical condylomatosis': '0.0', 'STDs:vaginal condylomatosis': '0.0', 'STDs:vulvo-perineal condylomatosis': '0.0', 'STDs:syphilis': '0.0', 'STDs:pelvic inflammatory disease': '0.0', 'STDs:genital herpes': '0.0', 'STDs:molluscum contagiosum': '0.0', 'STDs:AIDS': '0.0', 'STDs:HIV': '0.0', 'STDs:Hepatitis B': '0.0', 'STDs:HPV': '0.0', 'STDs: Number of diagnosis': '0', 'STDs: Time since first diagnosis': '?', 'STDs: Time since last diagnosis': '?', 'Dx:Cancer': '0', 'Dx:CIN': '0', 'Dx:HPV': '0', 'Dx': '0', 'Hinselmann': '0', 'Schiller': '0', 'Citology': '0', 'Biopsy': '0'}
{'Age': '42', 'Number of sexual partners': '3.0', 'First sexual intercourse': '23.0', 'Num of pregnancies': '2.0', 'Smokes': '0.0', 'Smokes (years)': '0.0', 'Smokes (packs/year)': '0.0', 'Hormonal Contraceptives': '0.0', 'Hormonal Contraceptives (years)': '0.0', 'IUD': '0.0', 'IUD (years)': '0.0', 'STDs': '0.0', 'STDs (number)': '0.0', 'STDs:condylomatosis': '0.0', 'STDs:cervical condylomatosis': '0.0', 'STDs:vaginal condylomatosis': '0.0', 'STDs:vulvo-perineal condylomatosis': '0.0', 'STDs:syphilis': '0.0', 'STDs:pelvic inflammatory disease': '0.0', 'STDs:genital herpes': '0.0', 'STDs:molluscum contagiosum': '0.0', 'STDs:AIDS': '0.0', 'STDs:HIV': '0.0', 'STDs:Hepatitis B': '0.0', 'STDs:HPV': '0.0', 'STDs: Number of diagnosis': '0', 'STDs: Time since first diagnosis': '?', 'STDs: Time since last diagnosis': '?', 'Dx:Cancer': '0', 'Dx:CIN': '0', 'Dx:HPV': '0', 'Dx': '0', 'Hinselmann': '0', 'Schiller': '0', 'Citology': '0', 'Biopsy': '0'}
{'Age': '51', 'Number of sexual partners': '3.0', 'First sexual intercourse': '17.0', 'Num of pregnancies': '6.0', 'Smokes': '1.0', 'Smokes (years)': '34.0', 'Smokes (packs/year)': '3.4', 'Hormonal Contraceptives': '0.0', 'Hormonal Contraceptives (years)': '0.0', 'IUD': '1.0', 'IUD (years)': '7.0', 'STDs': '0.0', 'STDs (number)': '0.0', 'STDs:condylomatosis': '0.0', 'STDs:cervical condylomatosis': '0.0', 'STDs:vaginal condylomatosis': '0.0', 'STDs:vulvo-perineal condylomatosis': '0.0', 'STDs:syphilis': '0.0', 'STDs:pelvic inflammatory disease': '0.0', 'STDs:genital herpes': '0.0', 'STDs:molluscum contagiosum': '0.0', 'STDs:AIDS': '0.0', 'STDs:HIV': '0.0', 'STDs:Hepatitis B': '0.0', 'STDs:HPV': '0.0', 'STDs: Number of diagnosis': '0', 'STDs: Time since first diagnosis': '?', 'STDs: Time since last diagnosis': '?', 'Dx:Cancer': '0', 'Dx:CIN': '0', 'Dx:HPV': '0', 'Dx': '0', 'Hinselmann': '1', 'Schiller': '1', 'Citology': '0', 'Biopsy': '1'}
{'Age': '26', 'Number of sexual partners': '1.0', 'First sexual intercourse': '26.0', 'Num of pregnancies': '3.0', 'Smokes': '0.0', 'Smokes (years)': '0.0', 'Smokes (packs/year)': '0.0', 'Hormonal Contraceptives': '1.0', 'Hormonal Contraceptives (years)': '2.0', 'IUD': '1.0', 'IUD (years)': '7.0', 'STDs': '0.0', 'STDs (number)': '0.0', 'STDs:condylomatosis': '0.0', 'STDs:cervical condylomatosis': '0.0', 'STDs:vaginal condylomatosis': '0.0', 'STDs:vulvo-perineal condylomatosis': '0.0', 'STDs:syphilis': '0.0', 'STDs:pelvic inflammatory disease': '0.0', 'STDs:genital herpes': '0.0', 'STDs:molluscum contagiosum': '0.0', 'STDs:AIDS': '0.0', 'STDs:HIV': '0.0', 'STDs:Hepatitis B': '0.0', 'STDs:HPV': '0.0', 'STDs: Number of diagnosis': '0', 'STDs: Time since first diagnosis': '?', 'STDs: Time since last diagnosis': '?', 'Dx:Cancer': '0', 'Dx:CIN': '0', 'Dx:HPV': '0', 'Dx': '0', 'Hinselmann': '0', 'Schiller': '0', 'Citology': '0', 'Biopsy': '0'}
{'Age': '45', 'Number of sexual partners': '1.0', 'First sexual intercourse': '20.0', 'Num of pregnancies': '5.0', 'Smokes': '0.0', 'Smokes (years)': '0.0', 'Smokes (packs/year)': '0.0', 'Hormonal Contraceptives': '0.0', 'Hormonal Contraceptives (years)': '0.0', 'IUD': '0.0', 'IUD (years)': '0.0', 'STDs': '0.0', 'STDs (number)': '0.0', 'STDs:condylomatosis': '0.0', 'STDs:cervical condylomatosis': '0.0', 'STDs:vaginal condylomatosis': '0.0', 'STDs:vulvo-perineal condylomatosis': '0.0', 'STDs:syphilis': '0.0', 'STDs:pelvic inflammatory disease': '0.0', 'STDs:genital herpes': '0.0', 'STDs:molluscum contagiosum': '0.0', 'STDs:AIDS': '0.0', 'STDs:HIV': '0.0', 'STDs:Hepatitis B': '0.0', 'STDs:HPV': '0.0', 'STDs: Number of diagnosis': '0', 'STDs: Time since first diagnosis': '?', 'STDs: Time since last diagnosis': '?', 'Dx:Cancer': '1', 'Dx:CIN': '0', 'Dx:HPV': '1', 'Dx': '1', 'Hinselmann': '0', 'Schiller': '0', 'Citology': '0', 'Biopsy': '0'}
{'Age': '44', 'Number of sexual partners': '3.0', 'First sexual intercourse': '15.0', 'Num of pregnancies': '?', 'Smokes': '1.0', 'Smokes (years)': '1.266972909', 'Smokes (packs/year)': '2.8', 'Hormonal Contraceptives': '0.0', 'Hormonal Contraceptives (years)': '0.0', 'IUD': '?', 'IUD (years)': '?', 'STDs': '0.0', 'STDs (number)': '0.0', 'STDs:condylomatosis': '0.0', 'STDs:cervical condylomatosis': '0.0', 'STDs:vaginal condylomatosis': '0.0', 'STDs:vulvo-perineal condylomatosis': '0.0', 'STDs:syphilis': '0.0', 'STDs:pelvic inflammatory disease': '0.0', 'STDs:genital herpes': '0.0', 'STDs:molluscum contagiosum': '0.0', 'STDs:AIDS': '0.0', 'STDs:HIV': '0.0', 'STDs:Hepatitis B': '0.0', 'STDs:HPV': '0.0', 'STDs: Number of diagnosis': '0', 'STDs: Time since first diagnosis': '?', 'STDs: Time since last diagnosis': '?', 'Dx:Cancer': '0', 'Dx:CIN': '0', 'Dx:HPV': '0', 'Dx': '0', 'Hinselmann': '0', 'Schiller': '0', 'Citology': '0', 'Biopsy': '0'}

Data Dictionary¶

  • Age: Age of patient
  • Number of sexual partners: Number of sexual partners patient has had
  • First sexual intercourse: age of first sexual intercourse
  • Num of pregnancies: number of pregnancies
  • Smokes: 1 for patient smokes, 0 for patient doesnt smoke
  • Smokes (years): number of years patient has been smoking
  • Hormonal Contraceptives: 1 for if patient uses hormonal contraceptives, 0 for if they dont use hormonal contraceptives
  • Hormonal Contraceptives (years): years patient has been on hormonal contraceptive
  • IUD (years): years patient has had IUD implant
  • STDs: 1 for if patient has STDs, 0 if patient doesn't have STDs
  • STDs (number): number of STDs present
  • STDs: condylomatosis: 1 if patient has condylomatosis, 0 if they don't
  • STDs:cervical condylomatosis: 1 if patient has condylomatosis, 0 if they don't
  • STDs:vaginal condylomatosis : 1 if patient has condylomatosis, 0 if they don't
  • STDs:vulvo-perineal condylomatosis: 1 if patient has condylomatosis, 0 if they don't
  • STDs: syphilis: 1 if patient has syphilis, 0 if they don't
  • STDs:pelvic inflammatory disease: 1 if patient has PID, 0 if they don't
  • STDs:genital herpes: 1 if patient has genital herpes, 0 if they don't
  • STDs:molluscum contagiosum: 1 if present, 0 if not
  • STDs:AIDs: 1 if present, 0 if not
  • STDs:HIV: 1 if present, 0 if not
  • STDs:Hepatitis B: 1 if present, 0 if not
  • STDs:HPV: 1 if present, 0 if not
  • STDs: Number of diagnosis: number of total STD's diagnosed in patient
  • STDs: Time since first diagnosis: years since first diagnosis
  • STDs: Time since last diagnosis: years since last diagnosis
  • Dx:Cancer: 1 for positive for cancer, 0 if not positive for cancer
  • Dx:CIN: 1 for positive for Cervical intraepithelial neoplasia, 0 if not
  • Dx:HPV: 1 for HPV present, 0 if not
  • Dx: tumor profiling test, 1 for performed, 0 if not
  • Hinselmann: tumor profiling test, 1 for performed, 0 if not
  • Schiller: tumor profiling test, 1 for performed, 0 if not
  • Citology: tumor profiling test, 1 for performed, 0 if not
  • Biopsy: tumor profiling test, 1 for performed, 0 if not

We will use the classifying approach in ML to predict if each patient is positive for the 'cancer' column given the medical history of each patient. We will then draw a conclusion to which traits puts patients in a higher risk category based on our predictions.