Every day, new information is released regarding carcinogens that are present in our everyday lives. Patients who have cancer are sometimes curious about how their cancer could've come to be, whether they had led an unhealthy lifestyle or whether it was genetic, or whether there were environmental influences, etc. However, the sad reality is that there are often very few ways to know why patients get cancer, and the large majority are simply educated guesses based on health history, family history, lifestyle, and other factors.
Many institutions around the world have collected data regarding cancer patients' treatments, lifestyles, genetic history, personal history, environment, and more. Researchers continue to search for trends in data that could help healthcare providers treat cancer patients and prevent cancer's occurrence to the best of their abilities. While this project will not be able to pinpoint why certain patients have cancer, it can analyze trends in the data of 1000 cancer patients. The goal of this project is to identify trends in the severity of patients' cancer as opposed to various aspects of their lifestyles.
If successful, this work may yield an analysis of various aspects of patients' lifestyles and the severity of their cancer. Defining these general trends may help in developing public health guidelines for general population health and may be a stepping stone to developing risk assessment protocols for patients who may have an increased risk of cancer based on their lifestyles or other factors.
One negative outcome of such work, however, is that it could lead to the spread of false information if the results are misused or misrepresented. For example, if a positive correlation is observed between obesity and severity of cancer, then there may be misinformation that states that "obesity causes cancer."
We will use a Kaggle dataset of Cancer Patients Data to observe some of the following aspects of cancer patients' lifestyles and the severity of their cancers.
In addition to these, each patient's severity of cancer is scaled on a level of 'low,' 'medium,' or 'high.' Our project seeks to analyze some of the factors listed above to determine correlations between the factors in question and the severity of the patients' cancers.
import pandas as pd
cancer_pt_data = pd.read_csv('cancer patient data sets.csv')
cancer_pt_data.head(25)
Patient Id | Age | Gender | Air Pollution | Alcohol use | Dust Allergy | OccuPational Hazards | Genetic Risk | chronic Lung Disease | Balanced Diet | ... | Fatigue | Weight Loss | Shortness of Breath | Wheezing | Swallowing Difficulty | Clubbing of Finger Nails | Frequent Cold | Dry Cough | Snoring | Level | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | P1 | 33 | 1 | 2 | 4 | 5 | 4 | 3 | 2 | 2 | ... | 3 | 4 | 2 | 2 | 3 | 1 | 2 | 3 | 4 | Low |
1 | P10 | 17 | 1 | 3 | 1 | 5 | 3 | 4 | 2 | 2 | ... | 1 | 3 | 7 | 8 | 6 | 2 | 1 | 7 | 2 | Medium |
2 | P100 | 35 | 1 | 4 | 5 | 6 | 5 | 5 | 4 | 6 | ... | 8 | 7 | 9 | 2 | 1 | 4 | 6 | 7 | 2 | High |
3 | P1000 | 37 | 1 | 7 | 7 | 7 | 7 | 6 | 7 | 7 | ... | 4 | 2 | 3 | 1 | 4 | 5 | 6 | 7 | 5 | High |
4 | P101 | 46 | 1 | 6 | 8 | 7 | 7 | 7 | 6 | 7 | ... | 3 | 2 | 4 | 1 | 4 | 2 | 4 | 2 | 3 | High |
5 | P102 | 35 | 1 | 4 | 5 | 6 | 5 | 5 | 4 | 6 | ... | 8 | 7 | 9 | 2 | 1 | 4 | 6 | 7 | 2 | High |
6 | P103 | 52 | 2 | 2 | 4 | 5 | 4 | 3 | 2 | 2 | ... | 3 | 4 | 2 | 2 | 3 | 1 | 2 | 3 | 4 | Low |
7 | P104 | 28 | 2 | 3 | 1 | 4 | 3 | 2 | 3 | 4 | ... | 3 | 2 | 2 | 4 | 2 | 2 | 3 | 4 | 3 | Low |
8 | P105 | 35 | 2 | 4 | 5 | 6 | 5 | 6 | 5 | 5 | ... | 1 | 4 | 3 | 2 | 4 | 6 | 2 | 4 | 1 | Medium |
9 | P106 | 46 | 1 | 2 | 3 | 4 | 2 | 4 | 3 | 3 | ... | 1 | 2 | 4 | 6 | 5 | 4 | 2 | 1 | 5 | Medium |
10 | P107 | 44 | 1 | 6 | 7 | 7 | 7 | 7 | 6 | 7 | ... | 5 | 3 | 2 | 7 | 8 | 2 | 4 | 5 | 3 | High |
11 | P108 | 64 | 2 | 6 | 8 | 7 | 7 | 7 | 6 | 7 | ... | 9 | 6 | 5 | 7 | 2 | 4 | 3 | 1 | 4 | High |
12 | P109 | 39 | 2 | 4 | 5 | 6 | 6 | 5 | 4 | 6 | ... | 5 | 3 | 2 | 4 | 3 | 1 | 7 | 5 | 6 | Medium |
13 | P11 | 34 | 1 | 6 | 7 | 7 | 7 | 6 | 7 | 7 | ... | 4 | 2 | 3 | 1 | 4 | 5 | 6 | 7 | 5 | High |
14 | P110 | 27 | 2 | 3 | 1 | 4 | 2 | 3 | 2 | 3 | ... | 2 | 2 | 3 | 4 | 1 | 5 | 2 | 6 | 2 | Low |
15 | P111 | 73 | 1 | 5 | 6 | 6 | 5 | 6 | 5 | 6 | ... | 4 | 3 | 6 | 2 | 1 | 2 | 1 | 6 | 2 | Medium |
16 | P112 | 17 | 1 | 3 | 1 | 5 | 3 | 4 | 2 | 2 | ... | 1 | 3 | 7 | 8 | 6 | 2 | 1 | 7 | 2 | Medium |
17 | P113 | 34 | 1 | 6 | 7 | 7 | 7 | 6 | 7 | 7 | ... | 4 | 2 | 3 | 1 | 4 | 5 | 6 | 7 | 5 | High |
18 | P114 | 36 | 1 | 6 | 7 | 7 | 7 | 7 | 7 | 6 | ... | 8 | 5 | 7 | 6 | 7 | 8 | 7 | 6 | 2 | High |
19 | P115 | 14 | 1 | 2 | 4 | 5 | 6 | 5 | 5 | 4 | ... | 5 | 3 | 2 | 1 | 4 | 7 | 2 | 1 | 6 | Medium |
20 | P116 | 24 | 1 | 6 | 8 | 7 | 7 | 6 | 7 | 7 | ... | 5 | 2 | 5 | 2 | 3 | 2 | 1 | 7 | 6 | High |
21 | P117 | 53 | 2 | 4 | 5 | 6 | 5 | 5 | 4 | 6 | ... | 8 | 7 | 9 | 2 | 1 | 4 | 6 | 7 | 2 | High |
22 | P118 | 62 | 1 | 6 | 8 | 7 | 7 | 7 | 6 | 7 | ... | 3 | 2 | 4 | 1 | 4 | 2 | 4 | 2 | 3 | High |
23 | P119 | 29 | 2 | 6 | 7 | 7 | 7 | 7 | 6 | 7 | ... | 2 | 7 | 6 | 7 | 6 | 7 | 2 | 3 | 1 | High |
24 | P12 | 36 | 1 | 6 | 7 | 7 | 7 | 7 | 7 | 6 | ... | 8 | 5 | 7 | 6 | 7 | 8 | 7 | 6 | 2 | High |
25 rows × 25 columns
While the dataset provides three classifications of patients' severity of cancer, in reality there are many different types and severities of cancer which cannot be sufficiently represented by three broad categories. The results of this data analysis should be considered broadly with respect to the fact that the results can be broken down significantly by types of cancer, stage of the cancer, progression, and other factors.
Furthermore, another problem that exists is deciding whether the data should be grouped by type of lifestyle factor (for example, factors within one's control versus factors outside of one's control). To combat this problem, we plan to attempt an analysis by both grouping and not grouping the data, and adjusting methods as needed based on the results.
We will pose our problem similarly to a regression problem. Given the lifestyle factors given in the dataset, we seek to determine how these factors (quantified by their numbers) relate to the 'low, medium, high' levels associated with the cancer patients' conditions. One advantage of using this approach is that it will allow us to consider (and likely plot) all of the data so that trends can be seen visually with the eye and so that they can also provide statistical measurements through calculations.