Describes and motivates a real-world problem where data science may provide helpful insights. Your description should be easily understood by a casual reader and include citations to motivating sources or relevant information (e.g. news articles, further reading links … Wikipedia makes for a poor reference but the links it cites are usually promising).
Problem Definition:
High school education, family status and living area could all be considered factors as to why a student may have the access, resources, or desire to get a college education. I have chosen this dataset to analyze becuase I believe it provides a lot of different potential analyzers. The level of education from the quality of a school a student goes to could impact their grades and as a result if they are able to get into a college. School accredidation is based on multiple factors however generally the higher level schools have better resources for students to learn [1]. Although college is not necessary to become sucessful, it should still be an option to anyone who wants to go. Being prepared with the correct education can greatly drive this desire. College prices have become increasingly expensive and unaffordable for many families [2]. I would like to know if a students level of interest to go to college is influenced by their parents salary and if their parents have ever gone to college or if their desire is linked to their grades. If this is the case, the quality of school and the area could also potentially impact a students grades due to access to better education and facilities. Overall, college can provide a larger range of career opprotunities, networking, and earning potential and should be an option for students [3]. This dataset will help inform of the factors that may lead to a student's desire to attend college.
[2] https://www.cnbc.com/2021/03/14/fewer-kids-going-to-college-because-of-cost.html
[3] https://cew.georgetown.edu/cew-reports/valueofcollegemajors/
Explicitly load and show your dataset. Provide a data dictionary which explains the meaning of each feature present. Demonstrate that this data is sufficient to make progress on your real-world problem described above.
import pandas as pd
data = pd.read_csv('data.csv')
data.head()
type_school | school_accreditation | gender | interest | residence | parent_age | parent_salary | house_area | average_grades | parent_was_in_college | will_go_to_college | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Academic | A | Male | Less Interested | Urban | 56 | 6950000 | 83.0 | 84.09 | False | True |
1 | Academic | A | Male | Less Interested | Urban | 57 | 4410000 | 76.8 | 86.91 | False | True |
2 | Academic | B | Female | Very Interested | Urban | 50 | 6500000 | 80.6 | 87.43 | False | True |
3 | Vocational | B | Male | Very Interested | Rural | 49 | 6600000 | 78.2 | 82.12 | True | True |
4 | Academic | A | Female | Very Interested | Urban | 57 | 5250000 | 75.1 | 86.79 | False | False |
Data Measure | Meaning |
---|---|
type_school | The type of school student attends (academic or vocational) |
school_accredidation | Quality of school. A is better than B |
gender | Gender of student |
interest | Interest level of student attending college |
residence | Type of residence student lives in (urban or rural) |
parent_age | Age of parent |
parent_salary | Parent salary per month |
house_area | Parent house area in meter square |
average_grades | Average grade of student on a scale 0 - 100 |
parent_was_in_college | If the parent ever attended college (True or False) |
This is enough data to provide progress to the problem question since it contains the level of interest in college for each student as well as data about their family, residence, and school information. By comparing these measures you can see if there is a correlation between them.
Write one or two sentences about how the data will be used to solve the problem. Earlier in the semester, we won’t have studied the Machine Learning methods just yet but you should have a general idea of what the ML will set out to do. For example:
We'll cluster the students into sets of students that are at the same level of interest in going to college. Allowing so will allow us to discover if there is a grouping of certain grades, residence area, school type, or parent salary based on the interest.