Each individual student will submit a project proposal (3% of final grade) in .ipynb format which:
(1%) Describes and motivates a real-world problem where data science may provide helpful insights. Your description should be easily understood by a casual reader and include citations to motivating sources or relevant information (e.g. news articles, further reading links … Wikipedia makes for a poor reference but the links it cites are usually promising).
(1%) Explicitly load and show your dataset. Provide a data dictionary which explains the meaning of each feature present. Demonstrate that this data is sufficient to make progress on your real-world problem described above.
(1%) Write one or two sentences about how the data will be used to solve the problem. Earlier in the semester, we won’t have studied the Machine Learning methods just yet but you should have a general idea of what the ML will set out to do. For example:
“We’ll cluster the movies into sets of movies which are often watched by the same users. Doing so allows us to discover if there is a more natural grouping of movies rather than the traditional genres: horror, comedy, romantic-comedy, etc”.
In this project, I will be using machine learning to predict the value of a property given the data that is available about property attributes and how much they were sold for.
According to an article by realtor.com, understanding attributes of a house such as pricer per square foot is important in understanding a house's value. As both a seller and buyer, having information about the house you are selling or buying can help you make more educated decisions about the value of the home. Source: https://www.realtor.com/advice/buy/average-price-per-square-foot-for-a-home/
This data is sufficient to create a machine learning model that adequately encapsulates the value of a home. There are various attributes ranging from the area to the stories to whether or not it has certain desirable attributes. While some attributes such as 'prefarea' which tells whether the house is in a preferred area or not don't give specific information about the range of preferred areas, they do give us information that is useful and the fact that we have many columns accounts for variability and broadness.
# Download the data
# Link to data:
import pandas as pd
df_housing = pd.read_csv('Housing.csv')
# First 5 lines
df_housing.head()
price | area | bedrooms | bathrooms | stories | mainroad | guestroom | basement | hotwaterheating | airconditioning | parking | prefarea | furnishingstatus | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 13300000 | 7420 | 4 | 2 | 3 | yes | no | no | no | yes | 2 | yes | furnished |
1 | 12250000 | 8960 | 4 | 4 | 4 | yes | no | no | no | yes | 3 | no | furnished |
2 | 12250000 | 9960 | 3 | 2 | 2 | yes | no | yes | no | no | 2 | yes | semi-furnished |
3 | 12215000 | 7500 | 4 | 2 | 2 | yes | no | yes | no | yes | 3 | yes | furnished |
4 | 11410000 | 7420 | 4 | 1 | 2 | yes | yes | yes | no | yes | 2 | no | furnished |
# Data dictionary explaining the meaning of each feature
housing_dict = dict()
housing_dict['price'] = 'Price of the houses'
housing_dict['area'] = 'Area of a house'
housing_dict['bedrooms'] = 'Number of bedrooms'
housing_dict['bathrooms'] = 'Number of bathrooms'
housing_dict['stories'] = 'Number of stories'
housing_dict['mainroad'] = 'Connected to a main road?'
housing_dict['guestroom'] = 'Has a guest room?'
housing_dict['basement'] = 'Has a basement?'
housing_dict['hotwaterheating'] = 'Has a hot water heater?'
housing_dict['airconditioning'] = 'Has air conditioning?'
housing_dict['parking'] = 'Number of parking spots'
housing_dict['prefarea'] = 'Preferred area?'
housing_dict['furnishingstatus'] = 'Furnishing status'
housing_dict
{'price': 'Price of the houses', 'area': 'Area of a house', 'bedrooms': 'Number of bedrooms', 'bathrooms': 'Number of bathrooms', 'stories': 'Number of stories', 'mainroad': 'Connected to a main road?', 'guestroom': 'Has a guest room?', 'basement': 'Has a basement?', 'hotwaterheating': 'Has a hot water heater?', 'airconditioning': 'Has air conditioning?', 'parking': 'Number of parking spots', 'prefarea': 'Preferred area?', 'furnishingstatus': 'Furnishing status'}
In order to make predictions about how much a house will cost, we need to have data to base our predictions off of. The data will be used to look at the relationship between the attributes of a house and how much that house is said to be value at. If a house is, say, in a preferred area, it will be valued at a higher price. Using a machine learning model, I will find how much the value increases by having a positive attribute such as being in a preferred area.