gender pay gap¶

Still in today's day and age there is still a pretty significant pay gap between genders. In a recent study it was found that, "Gen Z job seeking women expect to earn a salary that is on average $6,200 lower than what men anticipate making." Utilizing machine learning could shed some light on what other factors may influence these circumstances.

https://www.nasdaq.com/articles/starting-salaries-for-gen-z-women-are-%246200-less-than-male-counterparts

https://www.pewresearch.org/fact-tank/2021/05/25/gender-pay-gap-facts/

Data Set:¶

Link: https://www.kaggle.com/datasets/muhammadtalharasool/simple-gender-classification/versions/1?resource=download

In [1]:

import pandas as pd

# Import the gender.csv data set
gender_data = pd.read_csv("gender.csv")

# Creates a dictionary with the features and their descriptions
features = {'Gender': 'Gender of the person',
           'Age': 'Age of the person',
           'Height(cm)': 'Height of person in cm',
           'Weight(kg)': 'the weight of person in kg',
           'Occupation': 'What job the person has',
           'Education Level': 'What type of education the person has',
           'Marrital Status': 'Are they marriend or not',
           'Income (USD)': 'Total income for that person',
           'Favorite Color': 'favorite color for that person'}

In [13]:

gender_data

Out[13]:

	Gender	Age	Height (cm)	Weight (kg)	Occupation	Education Level	Marital Status	Income (USD)	Favorite Color	Unnamed: 9
0	male	32	175	70	Software Engineer	Master's Degree	Married	75000	Blue	NaN
1	male	25	182	85	Sales Representative	Bachelor's Degree	Single	45000	Green	NaN
2	female	41	160	62	Doctor	Doctorate Degree	Married	120000	Purple	NaN
3	male	38	178	79	Lawyer	Bachelor's Degree	Single	90000	Red	NaN
4	female	29	165	58	Graphic Designer	Associate's Degree	Single	35000	Yellow	NaN
...	...	...	...	...	...	...	...	...	...	...
126	female	32	170	64	Nurse	Associate's Degree	Single	60000	Orange	NaN
127	male	38	176	79	Project Manager	Bachelor's Degree	Married	90000	Black	NaN
128	female	27	162	55	Graphic Designer	Associate's Degree	Single	55000	Green	NaN
129	male	33	175	77	Sales Representative	Bachelor's Degree	Married	80000	Yellow	NaN
130	female	29	164	57	Software Developer	Bachelor's Degree	Single	65000	Blue	NaN

131 rows × 10 columns

In [3]:

features

Out[3]:

{'Gender': 'Gender of the person',
 'Age': 'Age of the person',
 'Height(cm)': 'Height of person in cm',
 'Weight(kg)': 'the weight of person in kg',
 'Occupation': 'What job the person has',
 'Education Level': 'What type of education the person has',
 'Marrital Status': 'Are they marriend or not',
 'Income (USD)': 'Total income for that person',
 'Favorite Color': 'favorite color for that person'}

In [17]:

mean_data = gender_data.groupby(' Gender').mean()[' Income (USD)']

In [18]:

mean_data

Out[18]:

 Gender
 female     63125.000000
 male      135925.925926
female      62820.512821
male       111585.365854
Name:  Income (USD), dtype: float64

I would split the data into train and test and determine what factors might be able to predict whether a person is male or female based of off those attributes.

Using the "Simple Gender Classification" data from kaggle, ML would allow for more insights regarding this ongoing problem. Given your age, occupation, salary, educational level, marital status, determine if you are a male or female. How accurate is the prediction using this? Is there correlation between some of these attributes but not all?