College Dropout Prediction¶

Motivation¶

Background¶

According to research.com.), US college students experience a 40% dropout rate per year with only 41% of college students graduating after 4 years without delay. This has led to it rating 19/28 in terms of graduation rates according to the Organization for Economic Co-operation and Development. While a college education is not necessary to succeed in modern American life, there are direct links between a bachelors degree and an increase in average salary, job level earned, and financial success.

Problem¶

Find a way to predict possible college dropouts based on socioeconomic factors as well as other influences.

Solution¶

We will use this dataset to achieve our goal of identifying possible relationships between socioeconomic and other outside factors in the dropout rate of US college students using factors including but not limited to marital status of the students or their parents, whether the student is a scholarship holder, age of the student at enrollment,and other factors.

Impact¶

If successful, we may be able to use these indicators to potentialy help target and better prepare struggling college students for the change they are about to experience before they enter the higher education system.

Dataset¶

We will be using this Kaggle dataset. Attributes included in the dataset are:¶

  • Marital status of the student
  • Application mode
  • Application order
  • Course
  • Daytime/evening attendance
  • Previous qualification
  • Nationality
  • Mother's qualification
  • Father's qualification
  • Mother's occupation
  • Father's occupation
  • Displaced
  • Educational special needs
  • Debtor
  • Tuition fees up to date
  • Gender
  • Scholarship holder
  • Age at enrollment
  • International
  • Curricular units 1st sem (credited)
  • Curricular units 1st sem (enrolled)
  • Curricular units 1st sem (evaluations)
  • Curricular units 1st sem (approved)
  • Curricular units 1st sem (grade)
  • Curricular units 1st sem (without evaluations)
  • Curricular units 2nd sem (credited)
  • Curricular units 2nd sem (enrolled)
  • Curricular units 2nd sem (evaluations)
  • Curricular units 2nd sem (approved)
  • Curricular units 2nd sem (grade)
  • Curricular units 2nd sem (without evaluations)
  • Unemployment rate
  • Inflation rate
  • GDP
In [1]:
import pandas as pd
df_dropout_pred = pd.read_csv('dataset.csv')
df_dropout_pred
Out[1]:
Marital status Application mode Application order Course Daytime/evening attendance Previous qualification Nacionality Mother's qualification Father's qualification Mother's occupation ... Curricular units 2nd sem (credited) Curricular units 2nd sem (enrolled) Curricular units 2nd sem (evaluations) Curricular units 2nd sem (approved) Curricular units 2nd sem (grade) Curricular units 2nd sem (without evaluations) Unemployment rate Inflation rate GDP Target
0 1 8 5 2 1 1 1 13 10 6 ... 0 0 0 0 0.000000 0 10.8 1.4 1.74 Dropout
1 1 6 1 11 1 1 1 1 3 4 ... 0 6 6 6 13.666667 0 13.9 -0.3 0.79 Graduate
2 1 1 5 5 1 1 1 22 27 10 ... 0 6 0 0 0.000000 0 10.8 1.4 1.74 Dropout
3 1 8 2 15 1 1 1 23 27 6 ... 0 6 10 5 12.400000 0 9.4 -0.8 -3.12 Graduate
4 2 12 1 3 0 1 1 22 28 10 ... 0 6 6 6 13.000000 0 13.9 -0.3 0.79 Graduate
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4419 1 1 6 15 1 1 1 1 1 6 ... 0 6 8 5 12.666667 0 15.5 2.8 -4.06 Graduate
4420 1 1 2 15 1 1 19 1 1 10 ... 0 6 6 2 11.000000 0 11.1 0.6 2.02 Dropout
4421 1 1 1 12 1 1 1 22 27 10 ... 0 8 9 1 13.500000 0 13.9 -0.3 0.79 Dropout
4422 1 1 1 9 1 1 1 22 27 8 ... 0 5 6 5 12.000000 0 9.4 -0.8 -3.12 Graduate
4423 1 5 1 15 1 1 9 23 27 6 ... 0 6 6 6 13.000000 0 12.7 3.7 -1.70 Graduate

4424 rows × 35 columns

Potential Problems¶

This dataset relies heavily on categorical data and as such, may need to be paired with another dataset or combed through to see if there is more viability in the categorical data provided. It has also already come to its own conclusion on whether the students will dropout or not based off the data it has gathered. However, the concept and data provided by the given dataset still prove to be intriguing and worth exploring.

Method¶

If the categorical data is chosen as the preferred method of analyzing and predicting possible US college dropouts, we propose using a k-means classifier as well as a multiple regression analysis to observe the associated correlation between the given variables and final result of whether the student is predicted to be a dropout or not.