import pandas as pd
I am interested in examining the kinds of compensation and benefits that employees in the US receive and which benefits are most common in which sectors, industries, occupations.
The US has a welfare system which can be classified as "liberal". In the book "The Three Worlds of Welfare Capitalism" author Gosta Esping-Andersen organized welfare systems of western countries into three main categories: liberal, conservative, and social democratic. The liberal welfare regime is the least generous, typically limited to means-tested assistance targeted primarily towards low-income. Compared to countries with a universalistic system, access to things like healthcare, childcare, and retirement savings is employer dependent. Therefore, examining the kinds of benefits people have can reveal what kinds of services people are able to access and who is able to access them.
This data comes from The National Compensation Survey run by the Bureau of Labor Statistics.
# import csv and txt files into dataframes
df_employee_benefits = pd.read_csv('employee-benefits-in-the-united-states-dataset.csv')
df_all_data = pd.read_csv ('nb.data.1.AllData.txt', sep='\t')
df_by_series_id = df_data = pd.read_csv ('nb.series.txt', sep='\t')
/var/folders/g6/xbp7f9210h5gpwlxw5mctj940000gn/T/ipykernel_14473/3926048682.py:3: DtypeWarning: Columns (11,22) have mixed types. Specify dtype option on import or set low_memory=False. df_employee_benefits = pd.read_csv('employee-benefits-in-the-united-states-dataset.csv') /var/folders/g6/xbp7f9210h5gpwlxw5mctj940000gn/T/ipykernel_14473/3926048682.py:5: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False. df_all_data = pd.read_csv ('nb.data.1.AllData.txt', sep='\t')
# contains 615,797 data points
df_employee_benefits
Series ID | Series title | Year | Estimate | Estimate footnote | Standard error | Standard error footnote | Survey code | Survey text | Ownership code | ... | Occupation text | Subcell code | Job characteristic category | Job characteristic | Estimate code | Estimate text | Datatype code | Datatype text | Provision code | Provision text | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NBU10000000000000028016 | Percent of civilian workers with access to def... | 2010.0 | 1 | NaN | 0.1 | NaN | IP | Incidence and Key Provision Benefits | 1.0 | ... | All occupations | 0.0 | All workers | All workers | 0.0 | Benefit combinations | 28.0 | Percent of workers with access | 16 | Defined benefit retirement and no medical care... |
1 | NBU10000000000000028016 | Percent of civilian workers with access to def... | 2011.0 | 1 | NaN | 0.1 | NaN | IP | Incidence and Key Provision Benefits | 1.0 | ... | All occupations | 0.0 | All workers | All workers | 0.0 | Benefit combinations | 28.0 | Percent of workers with access | 16 | Defined benefit retirement and no medical care... |
2 | NBU10000000000000028016 | Percent of civilian workers with access to def... | 2012.0 | 1 | NaN | 0.1 | NaN | IP | Incidence and Key Provision Benefits | 1.0 | ... | All occupations | 0.0 | All workers | All workers | 0.0 | Benefit combinations | 28.0 | Percent of workers with access | 16 | Defined benefit retirement and no medical care... |
3 | NBU10000000000000028016 | Percent of civilian workers with access to def... | 2013.0 | 1 | NaN | 0.1 | NaN | IP | Incidence and Key Provision Benefits | 1.0 | ... | All occupations | 0.0 | All workers | All workers | 0.0 | Benefit combinations | 28.0 | Percent of workers with access | 16 | Defined benefit retirement and no medical care... |
4 | NBU10000000000000028016 | Percent of civilian workers with access to def... | 2014.0 | 1 | NaN | 0.1 | NaN | IP | Incidence and Key Provision Benefits | 1.0 | ... | All occupations | 0.0 | All workers | All workers | 0.0 | Benefit combinations | 28.0 | Percent of workers with access | 16 | Defined benefit retirement and no medical care... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
615792 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
615793 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
615794 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
615795 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
615796 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
615797 rows × 24 columns
df_all_data
series_id | year | period | value | footnote_codes | |
---|---|---|---|---|---|
0 | NBU10000000000000028007 | 2010 | A01 | 63 | NaN |
1 | NBU10000000000000028007 | 2011 | A01 | 62 | NaN |
2 | NBU10000000000000028007 | 2012 | A01 | 62 | NaN |
3 | NBU10000000000000028007 | 2013 | A01 | 62 | NaN |
4 | NBU10000000000000028007 | 2014 | A01 | 63 | NaN |
... | ... | ... | ... | ... | ... |
615692 | NBU59700000000000033967 | 2021 | A01 | 4 | NaN |
615693 | NBU59700000000000033967 | 2022 | A01 | 4 | NaN |
615694 | NBU59800000000000033968 | 2020 | A01 | 6 | NaN |
615695 | NBU59800000000000033968 | 2021 | A01 | 6 | NaN |
615696 | NBU59800000000000033968 | 2022 | A01 | 6 | NaN |
615697 rows × 5 columns
# each row represents a unique instance of a combination of variables
# each row is represented by a unique series id (95,118 series ids total)
df_by_series_id
series_id | seasonal | ownership_code | estimate_code | industry_code | occupation_code | subcell_code | datatype_code | provision_code | survey_code | series_title | footnote_codes | begin_year | begin_period | end_year | end_period | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NBU10000000000000028007 | U | 1 | 0 | 000000 | 0 | 0 | 28 | 007 | IP | Percent of civilian workers with access to med... | NaN | 2010 | A01 | 2022 | A01 |
1 | NBU10000000000000028008 | U | 1 | 0 | 000000 | 0 | 0 | 28 | 008 | IP | Percent of civilian workers with access to med... | NaN | 2010 | A01 | 2022 | A01 |
2 | NBU10000000000000028009 | U | 1 | 0 | 000000 | 0 | 0 | 28 | 009 | IP | Percent of civilian workers with access to ret... | NaN | 2010 | A01 | 2022 | A01 |
3 | NBU10000000000000028010 | U | 1 | 0 | 000000 | 0 | 0 | 28 | 010 | IP | Percent of civilian workers with access to no ... | NaN | 2010 | A01 | 2022 | A01 |
4 | NBU10000000000000028011 | U | 1 | 0 | 000000 | 0 | 0 | 28 | 011 | IP | Percent of civilian workers with access to med... | NaN | 2010 | A01 | 2022 | A01 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95113 | NBU59400000000000026172 | U | 5 | 94 | 000000 | 0 | 0 | 26 | 172 | IP | Percent of local government workers participat... | NaN | 2010 | A01 | 2022 | A01 |
95114 | NBU59400000000000032173 | U | 5 | 94 | 000000 | 0 | 0 | 32 | 173 | IP | Take-up rate for local government workers with... | NaN | 2010 | A01 | 2022 | A01 |
95115 | NBU59500000000000028365 | U | 5 | 95 | 000000 | 0 | 0 | 28 | 365 | IP | Percent of local government workers with acces... | NaN | 2017 | A01 | 2018 | A01 |
95116 | NBU59700000000000033967 | U | 5 | 97 | 000000 | 0 | 0 | 33 | 967 | IP | Percent of local government workers with acces... | NaN | 2020 | A01 | 2022 | A01 |
95117 | NBU59800000000000033968 | U | 5 | 98 | 000000 | 0 | 0 | 33 | 968 | IP | Percent of local government workers with acces... | NaN | 2020 | A01 | 2022 | A01 |
95118 rows × 16 columns
Because each row in df_employee_benefits
represents multiple categorical variables, to make comparisons I would have to work backwards and group each row so that it is only assoociated with one categorical variable. From there I could compare across groups (for example: Civilian workers vs Private industry workers).
For example, row 1 represents the percent of civilian workers with access to defined benefit retirement and no medical care benefits. Based on the data dictionary, this row falls into the following groups: 'civilian workers', 'access to defined benefit retirement', and 'no medical care benefits'.
After the initial comparison of benefits across industries and occupatitons is done, it would also be interesting to use machine learning to train a k-nearest neighbor classificier to guess the sector, industry, or occupation of someone based on what bennefits they have.