FORMULA 1¶

PART 1: Describes and motivates a real-world problem where data science may provide helpful insights. Your description should be easily understood by a casual reader and include citations to motivating sources or relevant information (e.g. news articles, further reading links … Wikipedia makes for a poor reference but the links it cites are usually promising).

Formula 1 is one of the most popular and competitive sports in the world, attracting millions of fans and viewers worldwide. The sport is heavily dependent on data and analytics, with teams using data to make strategic decisions during races and to improve car performance. In recent years, machine learning and data science have become increasingly important in Formula 1, as teams look to gain a competitive edge through data-driven insights.

The objective of this project is to build a machine learning model that can accurately predict the results of Formula 1 races. The model will use historical race data, driver and team statistics, track characteristics, and other relevant data to predict the finishing positions (or points) of drivers in upcoming races. The model can be used by fans, analysts, and teams to make informed predictions and strategic decisions. The project will also provide insights into the most important factors that influence the results of Formula 1 races and demonstrate the potential of machine learning in the sport. In terms of constructors, preforming well on track results in a higher team budget and overall more finances increasing the importance of race wins.

relavent sources:

PART 2: Explicitly load and show your dataset. Provide a data dictionary which explains the meaning of each feature present. Demonstrate that this data is sufficient to make progress on your real-world problem described above!

The Data set is from Kaggle. It includes has multiple cvs files...more which could be added if deemed neccessary. The dataset covers all races from 1950-2023.

The data sets I choose bellow are pitstops, races and results which would combine to reveal the drivers charctersitics, result charcteristics, pit stop characteristics, and track characteristics. This would make sufficient progress in predicting a future result based on the historical data.

Dictonary of dataset (I combined a few datasets and put them into one dataframe):

raceID: id of race
driverID: driver id
stop: pit stop number
lap: lap number of pit stop
time: time of pitstop
duration: duration of pitstop
milliseconds: millisecond time of pitstom
year: year of race
circuit id: id of circuit
name: name of track
constructor id: id of constructor (can me matched with name)
number: car number (unique to each car)
grid: starting position on grid
position order: final rank (when race completed)
points: points earned for that unique race
fastest lap: fastest lap of that driver in that race
fastest lap speed: top speed of the fastest lap for that unique driver

In [19]:

import pandas as pd

#pit stop data
df_pitstops = pd.read_csv('pit_stops.csv')

#race data
df_races = pd.read_csv('races.csv')

#removing some columns i dont think are necessary (but could be addd back)
df_races = df_races.drop(['round','time', 'url','fp1_date', 'fp1_time', 'fp2_date', 'fp2_time', 'fp3_date', 'fp3_time', 'quali_date', 'quali_time', 'sprint_date', 'sprint_time', 'date'], axis=1)

#results data
df_results = pd.read_csv('results.csv')
#again removing columns
df_results = df_results.drop(['resultId','positionText','laps', 'time', 'milliseconds', 'rank', 'fastestLapTime', 'statusId'], axis=1)

In [31]:

#combining the datasets into one for simplicty 
merged_df = pd.merge(pd.merge(df_pitstops, df_races, on='raceId'), df_results) 
#dataframes merged on raceID and driverID, however repeats for number of pit stops so may be better way to do this 
merged_df.head()

Out[31]:

	raceId	driverId	stop	lap	time	duration	milliseconds	year	circuitId	name	constructorId	number	grid	position	positionOrder	fastestLap	fastestLapSpeed
0	841	153	1	1	17:05:23	26.898	26898	2011	1	Australian Grand Prix	5	19	12	11	11	41	211.025
1	841	153	2	17	17:31:06	24.463	24463	2011	1	Australian Grand Prix	5	19	12	11	11	41	211.025
2	841	153	3	35	17:59:45	26.348	26348	2011	1	Australian Grand Prix	5	19	12	11	11	41	211.025
3	841	30	1	1	17:05:52	25.021	25021	2011	1	Australian Grand Prix	131	7	11	\N	19	13	200.283
4	841	30	2	17	17:32:08	23.988	23988	2011	1	Australian Grand Prix	131	7	11	\N	19	13	200.283

PART 3: Write one or two sentences about how the data will be used to solve the problem. Earlier in the semester, we won’t have studied the Machine Learning methods just yet but you should have a general idea of what the ML will set out to do. For example:

The data will be used to predict the actual finishing position of drivers in a race, a regression model would be more appropriate. The output of a regression model is a continuous numerical value, which would be suitable for predicting finishing positions.