Breast Cancer Identifier Proposal¶

Description, Motivation, and Impact¶

Breast Cancer, one of the most common cancers amongst women, is a disease where cells in the breast grow and divide uncontrollably, resulting in a tumor. It is the leading cause of cancer death in women aged 35 to 54, and the second leading cause of cancer death in all ages of women. While breast cancer itself is a significant and severe disease, it has the ability (similar to other cancers) to spread into neighboring tissues around the breast as well as other parts of the body, resulting in new tumors. Thus, as a result of it's own malignance and additional health complications it can lead to, breast cancer is a serious disease which must be identified and treated as early as possible.

The goal of this project is to identify the features most common in those with breast cancer, and utilize the relationship between the features to predict malignant tumors in future cases.

If successful, the project will result in a classifier that will accurately, sensitively, and specifiably identify all cases in which tumors are benign, indicating breast cancer. Such a predictor will allow for successful clarification on whether one has breast cancer or not based on physical features like breast radius, perimiter, concavity, and more.

The classifier must be accurate, sensitive, and specifiable, as all three are critical in identifying breast cancer. Accuracy overall is essential. Sensitivity also is crucial, as it represents how often the tumor was predicted as malignant given the tumor truly was malignant. If the classifier lacked sensitivity, it could falsely identify a subject as cancer free, leading the subject to disregard pursuit of additional treatment when additional treatment is needed. This would lead to severe health problems, as the cancer would go untreated. Specifiability too is very important, as it represents how many times the tumor was actually malignant given the tumor was predicted as malignant. If the classifier lacked specifiability, it could falsely identify a subject as having cancer, leading the subject to pursue unneeded treatment. This could lead to uneccesary financial expenses, as well as an unecessary emotional burden for the cancer-free subject who was wrongfully diagnosed.

Dataset¶

We will use a Kaggle Breast Cancer Dataset to observe the following features for each subject:

  • id
  • diagnosis
  • radius_mean
  • texture_mean
  • perimeter_mean
  • area_mean
  • smoothness_mean
  • compactness_mean
  • concavity_mean
  • concave points_mean
  • symmetry_mean
  • fractal_dimension_mean
  • radius_se
  • texture_se
  • perimeter_se
  • area_se
  • smoothness_se
  • compactness_se
  • concavity_se
  • concave points_se
  • symmetry_se
  • fractal_dimension_se
  • radius_worst
  • texture_worst
  • perimeter_worst
  • area_worst
  • smoothness_worst
  • compactness_worst
  • concavity_worst
  • concave points_worst
  • symmetry_worst
  • fractal_dimension_worst

Data Description

  • ID represents the patient ID
  • Diagnosis represents whether the tumor was malignant or benign (cancerous or non-cancerous)
  • Radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension are all physical measurements
  • Each physical measurement has a mean, SE, and worst value
In [1]:
#load dataset
import pandas as pd

cancer_df = pd.read_csv('breast-cancer.csv')

cancer_df.head()
Out[1]:
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
0 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

5 rows × 32 columns

This project will use the data above to identify whether or not a subject has breast cancer.¶

Method¶

This project will utilize machine learning, building a classifier to estimate whether or not a tumor is malignant or benign (cancerous or non-cancerous) through a K-Nearest Neighbors classifier. The dataset will be trained by observing similar diagnoses for subjects and the physical measurements that coincided with each diagnosis. Each subject's diagnosis is then represented as a vector containing their physical measurements. Each vector will then be compared to find correlations between physical measurements and diagnosis. Through these correlations a diagnosis (malignant or benign) will be predicted. Based on this predicted diagnosis, a conclusion can be made about whether or not a subject has breast cancer.