Still in today's day and age there is still a pretty significant pay gap between genders. In a recent study it was found that, "Gen Z job seeking women expect to earn a salary that is on average $6,200 lower than what men anticipate making." Utilizing machine learning could shed some light on what other factors may influence these circumstances.
import pandas as pd
# Import the gender.csv data set
gender_data = pd.read_csv("gender.csv")
# Creates a dictionary with the features and their descriptions
features = {'Gender': 'Gender of the person',
'Age': 'Age of the person',
'Height(cm)': 'Height of person in cm',
'Weight(kg)': 'the weight of person in kg',
'Occupation': 'What job the person has',
'Education Level': 'What type of education the person has',
'Marrital Status': 'Are they marriend or not',
'Income (USD)': 'Total income for that person',
'Favorite Color': 'favorite color for that person'}
gender_data
Gender | Age | Height (cm) | Weight (kg) | Occupation | Education Level | Marital Status | Income (USD) | Favorite Color | Unnamed: 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | male | 32 | 175 | 70 | Software Engineer | Master's Degree | Married | 75000 | Blue | NaN |
1 | male | 25 | 182 | 85 | Sales Representative | Bachelor's Degree | Single | 45000 | Green | NaN |
2 | female | 41 | 160 | 62 | Doctor | Doctorate Degree | Married | 120000 | Purple | NaN |
3 | male | 38 | 178 | 79 | Lawyer | Bachelor's Degree | Single | 90000 | Red | NaN |
4 | female | 29 | 165 | 58 | Graphic Designer | Associate's Degree | Single | 35000 | Yellow | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
126 | female | 32 | 170 | 64 | Nurse | Associate's Degree | Single | 60000 | Orange | NaN |
127 | male | 38 | 176 | 79 | Project Manager | Bachelor's Degree | Married | 90000 | Black | NaN |
128 | female | 27 | 162 | 55 | Graphic Designer | Associate's Degree | Single | 55000 | Green | NaN |
129 | male | 33 | 175 | 77 | Sales Representative | Bachelor's Degree | Married | 80000 | Yellow | NaN |
130 | female | 29 | 164 | 57 | Software Developer | Bachelor's Degree | Single | 65000 | Blue | NaN |
131 rows × 10 columns
features
{'Gender': 'Gender of the person', 'Age': 'Age of the person', 'Height(cm)': 'Height of person in cm', 'Weight(kg)': 'the weight of person in kg', 'Occupation': 'What job the person has', 'Education Level': 'What type of education the person has', 'Marrital Status': 'Are they marriend or not', 'Income (USD)': 'Total income for that person', 'Favorite Color': 'favorite color for that person'}
mean_data = gender_data.groupby(' Gender').mean()[' Income (USD)']
mean_data
Gender female 63125.000000 male 135925.925926 female 62820.512821 male 111585.365854 Name: Income (USD), dtype: float64
I would split the data into train and test and determine what factors might be able to predict whether a person is male or female based of off those attributes.
Using the "Simple Gender Classification" data from kaggle, ML would allow for more insights regarding this ongoing problem. Given your age, occupation, salary, educational level, marital status, determine if you are a male or female. How accurate is the prediction using this? Is there correlation between some of these attributes but not all?