gender pay gap¶

Still in today's day and age there is still a pretty significant pay gap between genders. In a recent study it was found that, "Gen Z job seeking women expect to earn a salary that is on average $6,200 lower than what men anticipate making." Utilizing machine learning could shed some light on what other factors may influence these circumstances.

https://www.nasdaq.com/articles/starting-salaries-for-gen-z-women-are-%246200-less-than-male-counterparts

https://www.pewresearch.org/fact-tank/2021/05/25/gender-pay-gap-facts/

Data Set:¶

Link: https://www.kaggle.com/datasets/muhammadtalharasool/simple-gender-classification/versions/1?resource=download

In [1]:
import pandas as pd

# Import the gender.csv data set
gender_data = pd.read_csv("gender.csv")

# Creates a dictionary with the features and their descriptions
features = {'Gender': 'Gender of the person',
           'Age': 'Age of the person',
           'Height(cm)': 'Height of person in cm',
           'Weight(kg)': 'the weight of person in kg',
           'Occupation': 'What job the person has',
           'Education Level': 'What type of education the person has',
           'Marrital Status': 'Are they marriend or not',
           'Income (USD)': 'Total income for that person',
           'Favorite Color': 'favorite color for that person'}
In [13]:
gender_data
Out[13]:
Gender Age Height (cm) Weight (kg) Occupation Education Level Marital Status Income (USD) Favorite Color Unnamed: 9
0 male 32 175 70 Software Engineer Master's Degree Married 75000 Blue NaN
1 male 25 182 85 Sales Representative Bachelor's Degree Single 45000 Green NaN
2 female 41 160 62 Doctor Doctorate Degree Married 120000 Purple NaN
3 male 38 178 79 Lawyer Bachelor's Degree Single 90000 Red NaN
4 female 29 165 58 Graphic Designer Associate's Degree Single 35000 Yellow NaN
... ... ... ... ... ... ... ... ... ... ...
126 female 32 170 64 Nurse Associate's Degree Single 60000 Orange NaN
127 male 38 176 79 Project Manager Bachelor's Degree Married 90000 Black NaN
128 female 27 162 55 Graphic Designer Associate's Degree Single 55000 Green NaN
129 male 33 175 77 Sales Representative Bachelor's Degree Married 80000 Yellow NaN
130 female 29 164 57 Software Developer Bachelor's Degree Single 65000 Blue NaN

131 rows × 10 columns

In [3]:
features
Out[3]:
{'Gender': 'Gender of the person',
 'Age': 'Age of the person',
 'Height(cm)': 'Height of person in cm',
 'Weight(kg)': 'the weight of person in kg',
 'Occupation': 'What job the person has',
 'Education Level': 'What type of education the person has',
 'Marrital Status': 'Are they marriend or not',
 'Income (USD)': 'Total income for that person',
 'Favorite Color': 'favorite color for that person'}
In [17]:
mean_data = gender_data.groupby(' Gender').mean()[' Income (USD)']
In [18]:
mean_data
Out[18]:
 Gender
 female     63125.000000
 male      135925.925926
female      62820.512821
male       111585.365854
Name:  Income (USD), dtype: float64

I would split the data into train and test and determine what factors might be able to predict whether a person is male or female based of off those attributes.

Using the "Simple Gender Classification" data from kaggle, ML would allow for more insights regarding this ongoing problem. Given your age, occupation, salary, educational level, marital status, determine if you are a male or female. How accurate is the prediction using this? Is there correlation between some of these attributes but not all?