(1%) Describes and motivates a real-world problem where data science may provide helpful insights. Your description should be easily understood by a casual reader and include citations to motivating sources or relevant information (e.g. news articles, further reading links … Wikipedia makes for a poor reference but the links it cites are usually promising)
Life expectancy, or the amount of years a person can expect to live for, is a metric that is used to roughly estimate the social development and/or general progress of health sciences in a nation. However, due to the nature of life expectancy, life expectancy as a general measurement is liable to nationwide/global events that result in large losses of life that may not be representative of social/health development of a particular nation. The opioid epidemic for example accounted for a significant decrease in the life expectancy for the United States (census.gov). In the context of data science, analyzing life expectancy and associated variables can help determine which factors influence life expectancy and how we can expect future life expectancies to look like.
While looking into the growth of life expectancy, it would also be interesting to look into LEV or longevity escape velocity, which is the point that the rate of life expectancy grows at a rate faster than the rate at which people age. However, according to a study conducted by the US Census Bureau, the rate at which life expectancy is increasing has been decreasing over time.
(1%) Explicitly load and show your dataset. Provide a data dictionary which explains the meaning of each feature present. Demonstrate that this data is sufficient to make progress on your real-world problem described above.
import pandas as pd
df_life_exp = pd.read_csv('Life_Expectancy_Data.csv')
df_life_exp = df_life_exp.set_index('Year')
df_life_exp.head()
Country | Continent | Status | Life_expectancy | Adult_Mortality | infant_deaths | Alcohol | percentage_expenditure | Hepatitis_B | Measles | ... | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness 1-19 years | thinness 5-9 years | Income_composition_of_resources | Schooling | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Year | |||||||||||||||||||||
2015 | Afghanistan | Asia | Developing | 65.0 | 263 | 62 | 0.01 | 71.279624 | 65.0 | 1154 | ... | 6.0 | 8.16 | 65 | 0.1 | 584.259210 | 33736494 | 17.2 | 17.3 | 0.479 | 10.1 |
2014 | Afghanistan | Asia | Developing | 59.9 | 271 | 64 | 0.01 | 73.523582 | 62.0 | 492 | ... | 58.0 | 8.18 | 62 | 0.1 | 612.696514 | 327582 | 17.5 | 17.5 | 0.476 | 10.0 |
2013 | Afghanistan | Asia | Developing | 59.9 | 268 | 66 | 0.01 | 73.219243 | 64.0 | 430 | ... | 62.0 | 8.13 | 64 | 0.1 | 631.744976 | 31731688 | 17.7 | 17.7 | 0.470 | 9.9 |
2012 | Afghanistan | Asia | Developing | 59.5 | 272 | 69 | 0.01 | 78.184215 | 67.0 | 2787 | ... | 67.0 | 8.52 | 67 | 0.1 | 669.959000 | 3696958 | 17.9 | 18.0 | 0.463 | 9.8 |
2011 | Afghanistan | Asia | Developing | 59.2 | 275 | 71 | 0.01 | 7.097109 | 68.0 | 3013 | ... | 68.0 | 7.87 | 68 | 0.1 | 63.537231 | 2978599 | 18.2 | 18.2 | 0.454 | 9.5 |
5 rows × 22 columns
df_life_exp.describe()
Life_expectancy | Adult_Mortality | infant_deaths | Alcohol | percentage_expenditure | Hepatitis_B | Measles | BMI | under_five_deaths | Polio | Total_expenditure | HIV/AIDS | GDP | thinness 1-19 years | thinness 5-9 years | Income_composition_of_resources | Schooling | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 2461.000000 | 2461.000000 | 2461.000000 | 2461.000000 | 2461.000000 | 1997.000000 | 2461.000000 | 2461.000000 | 2461.000000 | 2453.000000 | 2309.000000 | 2461.000000 | 2461.000000 | 2461.000000 | 2461.000000 | 2458.000000 | 2458.000000 |
mean | 69.464567 | 160.961804 | 31.134498 | 4.328952 | 880.115968 | 80.849775 | 2361.811865 | 38.346404 | 43.204388 | 82.682022 | 5.874010 | 1.893661 | 7555.989842 | 4.841040 | 4.883909 | 0.633627 | 12.157933 |
std | 9.639385 | 126.167514 | 127.249666 | 4.056351 | 2143.267664 | 24.975829 | 11148.748920 | 19.908022 | 172.992761 | 23.147657 | 2.395258 | 5.464583 | 14337.844932 | 4.500021 | 4.592501 | 0.212276 | 3.326975 |
min | 36.300000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 1.400000 | 0.000000 | 3.000000 | 0.370000 | 0.100000 | 1.681350 | 0.100000 | 0.100000 | 0.000000 | 0.000000 |
25% | 63.400000 | 69.000000 | 0.000000 | 0.510000 | 24.733286 | 77.000000 | 0.000000 | 19.200000 | 0.000000 | 78.000000 | 4.230000 | 0.100000 | 462.486524 | 1.600000 | 1.600000 | 0.494250 | 10.200000 |
50% | 72.300000 | 137.000000 | 2.000000 | 3.480000 | 122.936535 | 92.000000 | 15.000000 | 43.800000 | 3.000000 | 93.000000 | 5.760000 | 0.100000 | 1792.384500 | 3.300000 | 3.300000 | 0.686000 | 12.450000 |
75% | 76.000000 | 223.000000 | 19.000000 | 7.380000 | 579.738437 | 96.000000 | 341.000000 | 56.100000 | 24.000000 | 97.000000 | 7.530000 | 0.800000 | 6171.262444 | 7.100000 | 7.100000 | 0.788000 | 14.500000 |
max | 89.000000 | 723.000000 | 1800.000000 | 17.870000 | 19479.911610 | 99.000000 | 212183.000000 | 77.600000 | 2500.000000 | 99.000000 | 14.390000 | 50.600000 | 119172.741800 | 27.700000 | 28.600000 | 0.948000 | 20.700000 |
df_life_exp.columns
Index(['Country', 'Continent', 'Status', 'Life_expectancy ', 'Adult_Mortality', 'infant_deaths', 'Alcohol', 'percentage_expenditure', 'Hepatitis_B', 'Measles ', ' BMI ', 'under_five_deaths ', 'Polio', 'Total_expenditure', 'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population', ' thinness 1-19 years', ' thinness 5-9 years', 'Income_composition_of_resources', 'Schooling'], dtype='object')
life_exp_dict = {'Year': 'year that data is collected for; index',
'Country': 'country viewed in study',
'Continent': 'continent that study country resides in',
'Status': 'qualitative look on the development of a country',
'Life_expectancy': 'number of years an individual is expected to live',
'Adult_Mortality': 'probability of dying between 15-60 years per 1000 population',
'infant_deaths': 'number of infant deaths per 1000 population',
'Alcohol': 'alcohol per capita: consumption in litres of pure alchol',
'percentage_expenditure': 'expenditure on health as a percent of domestic GDP',
'Hepatitis_B': 'Hep-B immunization coverage of 1-year-olds',
'Measles': 'number of reported cases per 1000 population',
'BMI': 'Average body-mass index of entire population',
'under_five_deaths': 'number of deaths under the age of five per 1000 population',
'Polio': 'Polio immunization coverage among 1-year-olds',
'Total_expenditure': 'general percent gov expenditure on health',
'Diphtheria': 'tetanus and pertussis immunization coverage among 1-year-olds',
'HIV/AIDS': 'Deaths between ages 0-4 from HIV/AIDS per 1000 live births',
'GDP': 'gross domestic product per capita in USD',
'Population': 'population of a country',
'thinness 1-19 years': 'Prevalence of thinness among children and adolescents for ages 10-19(%)',
'thinness 5-9 years': 'Prevalence of thinness among children and adolescents for ages 5-9(%)',
'Income_composition_of_resources': 'HDI in terms of income composition of resources from 0 to 1',
'Schooling': 'Number of years of schooling'}
print(life_exp_dict)
{'Year': 'year that data is collected for; index', 'Country': 'country viewed in study', 'Continent': 'continent that study country resides in', 'Status': 'qualitative look on the development of a country', 'Life_expectancy': 'number of years an individual is expected to live', 'Adult_Mortality': 'probability of dying between 15-60 years per 1000 population', 'infant_deaths': 'number of infant deaths per 1000 population', 'Alcohol': 'alcohol per capita: consumption in litres of pure alchol', 'percentage_expenditure': 'expenditure on health as a percent of domestic GDP', 'Hepatitis_B': 'Hep-B immunization coverage of 1-year-olds', 'Measles': 'number of reported cases per 1000 population', 'BMI': 'Average body-mass index of entire population', 'under_five_deaths': 'number of deaths under the age of five per 1000 population', 'Polio': 'Polio immunization coverage among 1-year-olds', 'Total_expenditure': 'general percent gov expenditure on health', 'Diphtheria': 'tetanus and pertussis immunization coverage among 1-year-olds', 'HIV/AIDS': 'Deaths between ages 0-4 from HIV/AIDS per 1000 live births', 'GDP': 'gross domestic product per capita in USD', 'Population': 'population of a country', 'thinness 1-19 years': 'Prevalence of thinness among children and adolescents for ages 10-19(%)', 'thinness 5-9 years': 'Prevalence of thinness among children and adolescents for ages 5-9(%)', 'Income_composition_of_resources': 'HDI in terms of income composition of resources from 0 to 1', 'Schooling': 'Number of years of schooling'}
Above contains an abundance of factors that may influence the outcome of life expectancy
(1%) Write one or two sentences about how the data will be used to solve the problem. Earlier in the semester, we won’t have studied the Machine Learning methods just yet but you should have a general idea of what the ML will set out to do. For example:
First we'll generate several plots all with different variables and different combinations of variables in relation to life expectancy in order to determine which variables are likely to predict life expectancy. Then with these variables, test and train the data to determine life expectancy based on the k number of closest neighbors.