For this project, I would like to use machine learning to predict risk for mental health disorder based on education, income level, and symptoms. Mental Health issues have become very prevelant in recent years and there has been a lot of talk around what we should do in order to reduce the amount of people affected by them. I think it would be really interesting to use machine learning to evaluate if someone may be at risk for a mental health issue based on education and socioeconomic status
import pandas as pd
data_df = pd.read_excel('Cleaned Data.xlsx')
data_df
I am currently employed at least part-time | I identify as having a mental illness | Education | I have my own computer separate from a smart phone | I have been hospitalized before for my mental illness | How many days were you hospitalized for your mental illness | I am legally disabled | I have my regular access to the internet | I live with my parents | I have a gap in my resume | ... | Obsessive thinking | Mood swings | Panic attacks | Compulsive behavior | Tiredness | Age | Gender | Household Income | Region | Device Type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | High School or GED | 0 | 0 | 0.0 | 0 | 1 | 0 | 1 | ... | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 30-44 | Male | $25,000-$49,999 | Mountain | Android Phone / Tablet |
1 | 1 | 1 | Some Phd | 1 | 0 | 0.0 | 0 | 1 | 0 | 0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 18-29 | Male | $50,000-$74,999 | East South Central | MacOS Desktop / Laptop |
2 | 1 | 0 | Completed Undergraduate | 1 | 0 | 0.0 | 0 | 1 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 30-44 | Male | $150,000-$174,999 | Pacific | MacOS Desktop / Laptop |
3 | 0 | 0 | Some Undergraduate | 1 | 0 | NaN | 0 | 1 | 1 | 1 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 30-44 | Male | $25,000-$49,999 | New England | Windows Desktop / Laptop |
4 | 1 | 1 | Completed Undergraduate | 1 | 1 | 35.0 | 1 | 1 | 0 | 1 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 30-44 | Male | $25,000-$49,999 | East North Central | iOS Phone / Tablet |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
329 | 0 | 0 | High School or GED | 1 | 0 | NaN | 1 | 1 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 45-60 | Female | Prefer not to answer | Mountain | Android Phone / Tablet |
330 | 1 | 0 | Some Undergraduate | 1 | 0 | 0.0 | 0 | 1 | 1 | 0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 18-29 | Male | $50,000-$74,999 | Pacific | Windows Desktop / Laptop |
331 | 1 | 0 | Some Undergraduate | 1 | 0 | 0.0 | 0 | 1 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | > 60 | Female | $10,000-$24,999 | West North Central | Windows Desktop / Laptop |
332 | 0 | 1 | Some Undergraduate | 0 | 1 | 1.0 | 1 | 1 | 1 | 1 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 18-29 | Female | $0-$9,999 | West South Central | Android Phone / Tablet |
333 | 1 | 1 | Some Undergraduate | 1 | 0 | 0.0 | 1 | 1 | 0 | 0 | ... | NaN | NaN | NaN | NaN | NaN | 18-29 | Female | $10,000-$24,999 | Pacific | Android Phone / Tablet |
334 rows × 31 columns
data_dict = {'I am currently employed at least part-time': 'employment status', 'I identify as having a mental illness': '0=no , 1=yes','Education': 'level of education completed', 'I have my own computer separate from a smart phone': '0=no, 1=yes','I have been hospitalized before for my mental illness': '0=no, 1=yes','How many days were you hospitalized for your mental illness': 'days spent in hospital', 'I am legally disabled': '0=no, 1=yes', 'I have my regular access to the internet': '0=no, 1=yes', 'I live with my parents': '0=no, 1=yes', 'I have a gap in my resume': 'gaps in resume due to MH 0=no, 1=yes', 'Total length of any gaps in my resume in months.': 'length of gap due to MH', 'Annual income (including any social welfare programs) in USD': 'income range in thousands? not sure', 'I am unemployed': 'employment status', 'I read outside of work and school': '0=no, 1=yes', 'Annual income from social welfare programs': 'in thousands? over what period of time?', 'I receive food stamps': '0=no, 1=yes', 'I am on section 8 housing': '0=no, 1=yes', 'How many times were you hospitalized for your mental illness': 'number of hospitalizations', 'Lack of concentration': '0=no, 1=yes', 'Anxiety': '0=no, 1=yes', 'Depression': '0=no, 1=yes', 'Obsessive thinking': '0=no, 1=yes', 'Mood swings': '0=no, 1=yes', 'Panic attacks': '0=no, 1=yes', 'Compulsive behavior':'0=no, 1=yes', 'Tiredness': '0=no, 1=yes','Age': 'age', 'Gender': 'Male or Female', 'Household Income': 'income range', 'Region': 'Which part of US', 'Device Type': 'Andriod, Windows, Mac'}
data_dict
{'I am currently employed at least part-time': 'employment status', 'I identify as having a mental illness': '0=no , 1=yes', 'Education': 'level of education completed', 'I have my own computer separate from a smart phone': '0=no, 1=yes', 'I have been hospitalized before for my mental illness': '0=no, 1=yes', 'How many days were you hospitalized for your mental illness': 'days spent in hospital', 'I am legally disabled': '0=no, 1=yes', 'I have my regular access to the internet': '0=no, 1=yes', 'I live with my parents': '0=no, 1=yes', 'I have a gap in my resume': 'gaps in resume due to MH 0=no, 1=yes', 'Total length of any gaps in my resume in\xa0months.': 'length of gap due to MH', 'Annual income (including any social welfare programs) in USD': 'income range in thousands? not sure', 'I am unemployed': 'employment status', 'I read outside of work and school': '0=no, 1=yes', 'Annual income from social welfare programs': 'in thousands? over what period of time?', 'I receive food stamps': '0=no, 1=yes', 'I am on section 8 housing': '0=no, 1=yes', 'How many times were you hospitalized for your mental illness': 'number of hospitalizations', 'Lack of concentration': '0=no, 1=yes', 'Anxiety': '0=no, 1=yes', 'Depression': '0=no, 1=yes', 'Obsessive thinking': '0=no, 1=yes', 'Mood swings': '0=no, 1=yes', 'Panic attacks': '0=no, 1=yes', 'Compulsive behavior': '0=no, 1=yes', 'Tiredness': '0=no, 1=yes', 'Age': 'age', 'Gender': 'Male or Female', 'Household Income': 'income range', 'Region': 'Which part of US', 'Device Type': 'Andriod, Windows, Mac'}
I am planning to cluster the data by income range. This will allow for analysis of Mental Health risks based on income level. I would also consider using cross validation methods to consider different segments of the data