PART 1: Describes and motivates a real-world problem where data science may provide helpful insights. Your description should be easily understood by a casual reader and include citations to motivating sources or relevant information (e.g. news articles, further reading links … Wikipedia makes for a poor reference but the links it cites are usually promising).
Formula 1 is one of the most popular and competitive sports in the world, attracting millions of fans and viewers worldwide. The sport is heavily dependent on data and analytics, with teams using data to make strategic decisions during races and to improve car performance. In recent years, machine learning and data science have become increasingly important in Formula 1, as teams look to gain a competitive edge through data-driven insights.
The objective of this project is to build a machine learning model that can accurately predict the results of Formula 1 races. The model will use historical race data, driver and team statistics, track characteristics, and other relevant data to predict the finishing positions (or points) of drivers in upcoming races. The model can be used by fans, analysts, and teams to make informed predictions and strategic decisions. The project will also provide insights into the most important factors that influence the results of Formula 1 races and demonstrate the potential of machine learning in the sport. In terms of constructors, preforming well on track results in a higher team budget and overall more finances increasing the importance of race wins.
relavent sources:
PART 2: Explicitly load and show your dataset. Provide a data dictionary which explains the meaning of each feature present. Demonstrate that this data is sufficient to make progress on your real-world problem described above!
The Data set is from Kaggle. It includes has multiple cvs files...more which could be added if deemed neccessary. The dataset covers all races from 1950-2023.
The data sets I choose bellow are pitstops, races and results which would combine to reveal the drivers charctersitics, result charcteristics, pit stop characteristics, and track characteristics. This would make sufficient progress in predicting a future result based on the historical data.
Dictonary of dataset (I combined a few datasets and put them into one dataframe):
import pandas as pd
#pit stop data
df_pitstops = pd.read_csv('pit_stops.csv')
#race data
df_races = pd.read_csv('races.csv')
#removing some columns i dont think are necessary (but could be addd back)
df_races = df_races.drop(['round','time', 'url','fp1_date', 'fp1_time', 'fp2_date', 'fp2_time', 'fp3_date', 'fp3_time', 'quali_date', 'quali_time', 'sprint_date', 'sprint_time', 'date'], axis=1)
#results data
df_results = pd.read_csv('results.csv')
#again removing columns
df_results = df_results.drop(['resultId','positionText','laps', 'time', 'milliseconds', 'rank', 'fastestLapTime', 'statusId'], axis=1)
#combining the datasets into one for simplicty
merged_df = pd.merge(pd.merge(df_pitstops, df_races, on='raceId'), df_results)
#dataframes merged on raceID and driverID, however repeats for number of pit stops so may be better way to do this
merged_df.head()
raceId | driverId | stop | lap | time | duration | milliseconds | year | circuitId | name | constructorId | number | grid | position | positionOrder | points | fastestLap | fastestLapSpeed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 841 | 153 | 1 | 1 | 17:05:23 | 26.898 | 26898 | 2011 | 1 | Australian Grand Prix | 5 | 19 | 12 | 11 | 11 | 0.0 | 41 | 211.025 |
1 | 841 | 153 | 2 | 17 | 17:31:06 | 24.463 | 24463 | 2011 | 1 | Australian Grand Prix | 5 | 19 | 12 | 11 | 11 | 0.0 | 41 | 211.025 |
2 | 841 | 153 | 3 | 35 | 17:59:45 | 26.348 | 26348 | 2011 | 1 | Australian Grand Prix | 5 | 19 | 12 | 11 | 11 | 0.0 | 41 | 211.025 |
3 | 841 | 30 | 1 | 1 | 17:05:52 | 25.021 | 25021 | 2011 | 1 | Australian Grand Prix | 131 | 7 | 11 | \N | 19 | 0.0 | 13 | 200.283 |
4 | 841 | 30 | 2 | 17 | 17:32:08 | 23.988 | 23988 | 2011 | 1 | Australian Grand Prix | 131 | 7 | 11 | \N | 19 | 0.0 | 13 | 200.283 |
PART 3: Write one or two sentences about how the data will be used to solve the problem. Earlier in the semester, we won’t have studied the Machine Learning methods just yet but you should have a general idea of what the ML will set out to do. For example:
The data will be used to predict the actual finishing position of drivers in a race, a regression model would be more appropriate. The output of a regression model is a continuous numerical value, which would be suitable for predicting finishing positions.