(1%) Describes and motivates a real-world problem where data science may provide helpful insights. Your description should be easily understood by a casual reader and include citations to motivating sources or relevant information (e.g. news articles, further reading links … Wikipedia makes for a poor reference but the links it cites are usually promising)

Life expectancy, or the amount of years a person can expect to live for, is a metric that is used to roughly estimate the social development and/or general progress of health sciences in a nation. However, due to the nature of life expectancy, life expectancy as a general measurement is liable to nationwide/global events that result in large losses of life that may not be representative of social/health development of a particular nation. The opioid epidemic for example accounted for a significant decrease in the life expectancy for the United States (census.gov). In the context of data science, analyzing life expectancy and associated variables can help determine which factors influence life expectancy and how we can expect future life expectancies to look like.

While looking into the growth of life expectancy, it would also be interesting to look into LEV or longevity escape velocity, which is the point that the rate of life expectancy grows at a rate faster than the rate at which people age. However, according to a study conducted by the US Census Bureau, the rate at which life expectancy is increasing has been decreasing over time.

life expectancy¶

(1%) Explicitly load and show your dataset. Provide a data dictionary which explains the meaning of each feature present. Demonstrate that this data is sufficient to make progress on your real-world problem described above.

In [4]:
import pandas as pd
df_life_exp = pd.read_csv('Life_Expectancy_Data.csv')
df_life_exp = df_life_exp.set_index('Year')
df_life_exp.head()
Out[4]:
Country Continent Status Life_expectancy Adult_Mortality infant_deaths Alcohol percentage_expenditure Hepatitis_B Measles ... Polio Total_expenditure Diphtheria HIV/AIDS GDP Population thinness 1-19 years thinness 5-9 years Income_composition_of_resources Schooling
Year
2015 Afghanistan Asia Developing 65.0 263 62 0.01 71.279624 65.0 1154 ... 6.0 8.16 65 0.1 584.259210 33736494 17.2 17.3 0.479 10.1
2014 Afghanistan Asia Developing 59.9 271 64 0.01 73.523582 62.0 492 ... 58.0 8.18 62 0.1 612.696514 327582 17.5 17.5 0.476 10.0
2013 Afghanistan Asia Developing 59.9 268 66 0.01 73.219243 64.0 430 ... 62.0 8.13 64 0.1 631.744976 31731688 17.7 17.7 0.470 9.9
2012 Afghanistan Asia Developing 59.5 272 69 0.01 78.184215 67.0 2787 ... 67.0 8.52 67 0.1 669.959000 3696958 17.9 18.0 0.463 9.8
2011 Afghanistan Asia Developing 59.2 275 71 0.01 7.097109 68.0 3013 ... 68.0 7.87 68 0.1 63.537231 2978599 18.2 18.2 0.454 9.5

5 rows × 22 columns

In [5]:
df_life_exp.describe()
Out[5]:
Life_expectancy Adult_Mortality infant_deaths Alcohol percentage_expenditure Hepatitis_B Measles BMI under_five_deaths Polio Total_expenditure HIV/AIDS GDP thinness 1-19 years thinness 5-9 years Income_composition_of_resources Schooling
count 2461.000000 2461.000000 2461.000000 2461.000000 2461.000000 1997.000000 2461.000000 2461.000000 2461.000000 2453.000000 2309.000000 2461.000000 2461.000000 2461.000000 2461.000000 2458.000000 2458.000000
mean 69.464567 160.961804 31.134498 4.328952 880.115968 80.849775 2361.811865 38.346404 43.204388 82.682022 5.874010 1.893661 7555.989842 4.841040 4.883909 0.633627 12.157933
std 9.639385 126.167514 127.249666 4.056351 2143.267664 24.975829 11148.748920 19.908022 172.992761 23.147657 2.395258 5.464583 14337.844932 4.500021 4.592501 0.212276 3.326975
min 36.300000 1.000000 0.000000 0.000000 0.000000 2.000000 0.000000 1.400000 0.000000 3.000000 0.370000 0.100000 1.681350 0.100000 0.100000 0.000000 0.000000
25% 63.400000 69.000000 0.000000 0.510000 24.733286 77.000000 0.000000 19.200000 0.000000 78.000000 4.230000 0.100000 462.486524 1.600000 1.600000 0.494250 10.200000
50% 72.300000 137.000000 2.000000 3.480000 122.936535 92.000000 15.000000 43.800000 3.000000 93.000000 5.760000 0.100000 1792.384500 3.300000 3.300000 0.686000 12.450000
75% 76.000000 223.000000 19.000000 7.380000 579.738437 96.000000 341.000000 56.100000 24.000000 97.000000 7.530000 0.800000 6171.262444 7.100000 7.100000 0.788000 14.500000
max 89.000000 723.000000 1800.000000 17.870000 19479.911610 99.000000 212183.000000 77.600000 2500.000000 99.000000 14.390000 50.600000 119172.741800 27.700000 28.600000 0.948000 20.700000
In [7]:
df_life_exp.columns
Out[7]:
Index(['Country', 'Continent', 'Status', 'Life_expectancy ', 'Adult_Mortality',
       'infant_deaths', 'Alcohol', 'percentage_expenditure', 'Hepatitis_B',
       'Measles ', ' BMI ', 'under_five_deaths ', 'Polio', 'Total_expenditure',
       'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population',
       ' thinness  1-19 years', ' thinness 5-9 years',
       'Income_composition_of_resources', 'Schooling'],
      dtype='object')
In [9]:
life_exp_dict = {'Year': 'year that data is collected for; index',
                'Country': 'country viewed in study',
                'Continent': 'continent that study country resides in',
                'Status': 'qualitative look on the development of a country',
                'Life_expectancy': 'number of years an individual is expected to live',
                'Adult_Mortality': 'probability of dying between 15-60 years per 1000 population',
                'infant_deaths': 'number of infant deaths per 1000 population',
                'Alcohol': 'alcohol per capita: consumption in litres of pure alchol',
                'percentage_expenditure': 'expenditure on health as a percent of domestic GDP',
                'Hepatitis_B': 'Hep-B immunization coverage of 1-year-olds',
                'Measles': 'number of reported cases per 1000 population',
                'BMI': 'Average body-mass index of entire population',
                'under_five_deaths': 'number of deaths under the age of five per 1000 population',
                'Polio': 'Polio immunization coverage among 1-year-olds',
                'Total_expenditure': 'general percent gov expenditure on health',
                'Diphtheria': 'tetanus and pertussis immunization coverage among 1-year-olds',
                'HIV/AIDS': 'Deaths between ages 0-4 from HIV/AIDS per 1000 live births',
                'GDP': 'gross domestic product per capita in USD',
                'Population': 'population of a country',
                'thinness  1-19 years': 'Prevalence of thinness among children and adolescents for ages 10-19(%)',
                'thinness 5-9 years': 'Prevalence of thinness among children and adolescents for ages 5-9(%)',
                'Income_composition_of_resources': 'HDI in terms of income composition of resources from 0 to 1',
                'Schooling': 'Number of years of schooling'}
print(life_exp_dict)
{'Year': 'year that data is collected for; index', 'Country': 'country viewed in study', 'Continent': 'continent that study country resides in', 'Status': 'qualitative look on the development of a country', 'Life_expectancy': 'number of years an individual is expected to live', 'Adult_Mortality': 'probability of dying between 15-60 years per 1000 population', 'infant_deaths': 'number of infant deaths per 1000 population', 'Alcohol': 'alcohol per capita: consumption in litres of pure alchol', 'percentage_expenditure': 'expenditure on health as a percent of domestic GDP', 'Hepatitis_B': 'Hep-B immunization coverage of 1-year-olds', 'Measles': 'number of reported cases per 1000 population', 'BMI': 'Average body-mass index of entire population', 'under_five_deaths': 'number of deaths under the age of five per 1000 population', 'Polio': 'Polio immunization coverage among 1-year-olds', 'Total_expenditure': 'general percent gov expenditure on health', 'Diphtheria': 'tetanus and pertussis immunization coverage among 1-year-olds', 'HIV/AIDS': 'Deaths between ages 0-4 from HIV/AIDS per 1000 live births', 'GDP': 'gross domestic product per capita in USD', 'Population': 'population of a country', 'thinness  1-19 years': 'Prevalence of thinness among children and adolescents for ages 10-19(%)', 'thinness 5-9 years': 'Prevalence of thinness among children and adolescents for ages 5-9(%)', 'Income_composition_of_resources': 'HDI in terms of income composition of resources from 0 to 1', 'Schooling': 'Number of years of schooling'}

Above contains an abundance of factors that may influence the outcome of life expectancy

(1%) Write one or two sentences about how the data will be used to solve the problem. Earlier in the semester, we won’t have studied the Machine Learning methods just yet but you should have a general idea of what the ML will set out to do. For example:

  • “We’ll cluster the movies into sets of movies which are often watched by the same users. Doing so allows us to discover if there is a more natural grouping of movies rather than the traditional genres: horror, comedy, romantic-comedy, etc”.

First we'll generate several plots all with different variables and different combinations of variables in relation to life expectancy in order to determine which variables are likely to predict life expectancy. Then with these variables, test and train the data to determine life expectancy based on the k number of closest neighbors.

In [ ]: