Despite the significant advancements in racial equality in decades past, different treatment by race continues to perpetuate the U.S. labor market, especially during the hiring process.
This dataset offers an opportunity to observe the impact of race in the labor market, as researchers sent out thousands of fictitious resumes were to help-wanted advertisements in Boston and Chicago. Each individual was characterized by numerous factors, covering education level, skills, and experience as well as objective data like race, age, and gender. The goal of this project is to identify if a relationship exists between race and hirability with respect to the qualifications of the individual.
This work may hold wide implications for the hiring process across all industries, and perhaps elicit a need for reform. I aim to create a classifier which predicts how likely an individual is to get hired based on their associated characteristics. This predictor may point out inconsistencies in the callback process, and determine just how many times a qualified candidate is being passed on just because of race.
We will use an Open Intro Dataset of Fictitious Job Applications to observe the following features for each individual:
(** note: while the data includes many features, those listed above will act as the focus of our study.)
import pandas as pd
# we can read zipped csv files too!
df_labor = pd.read_csv('labor_market_discrimination.csv')
df_labor.head()
education | n_jobs | years_exp | honors | volunteer | military | emp_holes | occup_specific | occup_broad | work_in_school | ... | comp_req | org_req | manuf | trans_com | bank_real | trade | bus_service | oth_service | miss_ind | ownership | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4 | 2 | 6 | 0 | 0 | 0 | 1 | 17 | 1 | 0 | ... | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | NaN |
1 | 3 | 3 | 6 | 0 | 1 | 1 | 0 | 316 | 6 | 1 | ... | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | NaN |
2 | 4 | 1 | 6 | 0 | 0 | 0 | 0 | 19 | 1 | 1 | ... | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | NaN |
3 | 3 | 4 | 6 | 0 | 1 | 0 | 1 | 313 | 5 | 0 | ... | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | NaN |
4 | 3 | 3 | 22 | 0 | 0 | 0 | 0 | 313 | 5 | 1 | ... | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | Nonprofit |
5 rows × 63 columns
To assess this problem, I will perform a logistic regression analysis, a common classifier used for binary classification problems. This will be used to predict whether a person with a given set of characteristics is more likely to receive a callback or not. Additionally, I will cluster resumes together based on their characteristics to then identify patterns or similarities among resumes that received callbacks.
** note: While I have not yet conducted a logistic regression in python, I alternatively may use linear regression to predict the number of callbacks an individual might receive based on their characteristics, and then conduct a comparison by race.