Breast Cancer Identifier Proposal¶

Description, Motivation, and Impact¶

Breast Cancer, one of the most common cancers amongst women, is a disease where cells in the breast grow and divide uncontrollably, resulting in a tumor. It is the leading cause of cancer death in women aged 35 to 54, and the second leading cause of cancer death in all ages of women. While breast cancer itself is a significant and severe disease, it has the ability (similar to other cancers) to spread into neighboring tissues around the breast as well as other parts of the body, resulting in new tumors. Thus, as a result of it's own malignance and additional health complications it can lead to, breast cancer is a serious disease which must be identified and treated as early as possible.

The goal of this project is to identify the features most common in those with breast cancer, and utilize the relationship between the features to predict malignant tumors in future cases.

If successful, the project will result in a classifier that will accurately, sensitively, and specifiably identify all cases in which tumors are benign, indicating breast cancer. Such a predictor will allow for successful clarification on whether one has breast cancer or not based on physical features like breast radius, perimiter, concavity, and more.

The classifier must be accurate, sensitive, and specifiable, as all three are critical in identifying breast cancer. Accuracy overall is essential. Sensitivity also is crucial, as it represents how often the tumor was predicted as malignant given the tumor truly was malignant. If the classifier lacked sensitivity, it could falsely identify a subject as cancer free, leading the subject to disregard pursuit of additional treatment when additional treatment is needed. This would lead to severe health problems, as the cancer would go untreated. Specifiability too is very important, as it represents how many times the tumor was actually malignant given the tumor was predicted as malignant. If the classifier lacked specifiability, it could falsely identify a subject as having cancer, leading the subject to pursue unneeded treatment. This could lead to uneccesary financial expenses, as well as an unecessary emotional burden for the cancer-free subject who was wrongfully diagnosed.

Dataset¶

We will use a Kaggle Breast Cancer Dataset to observe the following features for each subject:

id
diagnosis
radius_mean
texture_mean
perimeter_mean
area_mean
smoothness_mean
compactness_mean
concavity_mean
concave points_mean
symmetry_mean
fractal_dimension_mean
radius_se
texture_se
perimeter_se
area_se
smoothness_se
compactness_se
concavity_se
concave points_se
symmetry_se
fractal_dimension_se
radius_worst
texture_worst
perimeter_worst
area_worst
smoothness_worst
compactness_worst
concavity_worst
concave points_worst
symmetry_worst
fractal_dimension_worst

Data Description

ID represents the patient ID
Diagnosis represents whether the tumor was malignant or benign (cancerous or non-cancerous)
Radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension are all physical measurements
Each physical measurement has a mean, SE, and worst value

In [1]:

#load dataset
import pandas as pd

cancer_df = pd.read_csv('breast-cancer.csv')

cancer_df.head()

Out[1]:

	id	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave points_mean	...	radius_worst	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	fractal_dimension_worst
0	842302	M	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	...	25.38	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	842517	M	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	...	24.99	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
2	84300903	M	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	...	23.57	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
3	84348301	M	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	...	14.91	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
4	84358402	M	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	...	22.54	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678

5 rows × 32 columns

This project will use the data above to identify whether or not a subject has breast cancer.¶

Method¶

This project will utilize machine learning, building a classifier to estimate whether or not a tumor is malignant or benign (cancerous or non-cancerous) through a K-Nearest Neighbors classifier. The dataset will be trained by observing similar diagnoses for subjects and the physical measurements that coincided with each diagnosis. Each subject's diagnosis is then represented as a vector containing their physical measurements. Each vector will then be compared to find correlations between physical measurements and diagnosis. Through these correlations a diagnosis (malignant or benign) will be predicted. Based on this predicted diagnosis, a conclusion can be made about whether or not a subject has breast cancer.