Predicting F1 Race Winner/Position¶

Formula 1 is one of the most popular and most watched sports in the world, with millions of fans following races across the globe. Winning a Formula 1 race requires not only a fast and reliable car but also a good strategy team(rip Ferrari), pit crews, and of course, luck. Predicting the winner of a Formula 1 race is a challenging task, as those races are inherently chaotic and there are so many variables that can affect the outcome. This project aims to find insights and patterns within chaos, and develop a machine learning model that can predict the winner of a Formula 1 race with a decent accuracy. We will also try to find correlations between performance in pre-season testing and performance in races.

We'll predict the winner of a race/finishing position of a driver based on the starting grid position, position/points, qualifying time, age, number of wins of the driver, track weather/status, year, etc. We'll be using different ML models and comparing their performance

In [1]:
# !pip install fastf1
import fastf1 as ff1
from fastf1 import plotting
import numpy as np
import pandas as pd

Below are 14 datasets includes all the information on the Formula 1 races, drivers, constructors, qualifying, circuits, lap times, pit stops, championships from 1950 till the latest 2023 season. These data should be sufficient to ensure a meaningful result. Obviously, some of those datasets are more useful than others, and we need to clean and combine some of them for modeling.

Features of the dataset:

  • circuitId: ID of a circuit
  • circuitRef: what people refer the circuit as
  • name: the official name of the circuit
  • location: the location of the circuit (usually city or town)
  • country: the country the circuit is in
  • lat: latitude of the circuit
  • lng: longitude of the circuit
  • alt: altitude of the circuit
  • url: the wikipedia page of the circuit
In [2]:
df_0 = pd.read_csv('data/circuits.csv')
print(df_0.shape)
df_0.head()
(77, 9)
Out[2]:
circuitId circuitRef name location country lat lng alt url
0 1 albert_park Albert Park Grand Prix Circuit Melbourne Australia -37.84970 144.96800 10 http://en.wikipedia.org/wiki/Melbourne_Grand_P...
1 2 sepang Sepang International Circuit Kuala Lumpur Malaysia 2.76083 101.73800 18 http://en.wikipedia.org/wiki/Sepang_Internatio...
2 3 bahrain Bahrain International Circuit Sakhir Bahrain 26.03250 50.51060 7 http://en.wikipedia.org/wiki/Bahrain_Internati...
3 4 catalunya Circuit de Barcelona-Catalunya Montmeló Spain 41.57000 2.26111 109 http://en.wikipedia.org/wiki/Circuit_de_Barcel...
4 5 istanbul Istanbul Park Istanbul Turkey 40.95170 29.40500 130 http://en.wikipedia.org/wiki/Istanbul_Park

Features of the dataset:

  • statusId: ID of the status of the track, car or driver
  • status: status of the track, car or driver (accident at turn 5, engine issue on car 12, etc)
In [3]:
df_1 = pd.read_csv('data/status.csv')
print(df_1.shape)
df_1.head()
(139, 2)
Out[3]:
statusId status
0 1 Finished
1 2 Disqualified
2 3 Accident
3 4 Collision
4 5 Engine

Features of the dataset:

  • raceId: ID of the race
  • driverId: ID of the driver
  • lap: lap number
  • position: track position of the driver
  • time: lap time
  • milliseconds: lap time in ms
In [4]:
df_2 = pd.read_csv('data/lap_times.csv')
print(df_2.shape)
df_2.head()
(538121, 6)
Out[4]:
raceId driverId lap position time milliseconds
0 841 20 1 1 1:38.109 98109
1 841 20 2 1 1:33.006 93006
2 841 20 3 1 1:32.713 92713
3 841 20 4 1 1:32.803 92803
4 841 20 5 1 1:32.342 92342

Features of the dataset:

  • resultId: ID of the sprint result
  • raceId: ID of the race
  • driverId: ID of the driver
  • constructorId: ID of the constructor
  • number: car number
  • grid: starting position of the car
  • position: finishing position of the car
  • positionText: finishing position of the car in text
  • positionOrder: finishing position of the car, including disqualified or not finished
  • points: points awarded for the driver
  • laps: number of laps the driver finished
  • time: finishing time of the sprint race
  • milliseconds: finishing time of the sprint race in ms
  • fastestLap: lap number of the driver's fastest lap
  • fastestLapTime: lap time of the driver's fastest lap
  • statusId: ID of the status of the driver
In [5]:
df_3 = pd.read_csv('data/sprint_results.csv')
print(df_3.shape)
df_3.head()
(120, 16)
Out[5]:
resultId raceId driverId constructorId number grid position positionText positionOrder points laps time milliseconds fastestLap fastestLapTime statusId
0 1 1061 830 9 33 2 1 1 1 3 17 25:38.426 1538426 14 1:30.013 1
1 2 1061 1 131 44 1 2 2 2 2 17 +1.430 1539856 17 1:29.937 1
2 3 1061 822 131 77 3 3 3 3 1 17 +7.502 1545928 17 1:29.958 1
3 4 1061 844 6 16 4 4 4 4 0 17 +11.278 1549704 16 1:30.163 1
4 5 1061 846 1 4 6 5 5 5 0 17 +24.111 1562537 16 1:30.566 1

Features of the dataset:

  • driverId: ID of the driver
  • driverRef: driver refered as
  • number: number of the driver (some drivers don't have a number)
  • code: three-letter code of the driver
  • forename: first name of the driver
  • surname: last name of the driver
  • dob: Date of Birth of the driver
  • nationality: nationality of the driver
  • url: link to the driver's wikipedia page
In [6]:
df_4 = pd.read_csv('data/drivers.csv')
print(df_4.shape)
df_4.head()
(857, 9)
Out[6]:
driverId driverRef number code forename surname dob nationality url
0 1 hamilton 44 HAM Lewis Hamilton 1985-01-07 British http://en.wikipedia.org/wiki/Lewis_Hamilton
1 2 heidfeld \N HEI Nick Heidfeld 1977-05-10 German http://en.wikipedia.org/wiki/Nick_Heidfeld
2 3 rosberg 6 ROS Nico Rosberg 1985-06-27 German http://en.wikipedia.org/wiki/Nico_Rosberg
3 4 alonso 14 ALO Fernando Alonso 1981-07-29 Spanish http://en.wikipedia.org/wiki/Fernando_Alonso
4 5 kovalainen \N KOV Heikki Kovalainen 1981-10-19 Finnish http://en.wikipedia.org/wiki/Heikki_Kovalainen

Features of the dataset:

  • raceId: ID of the race
  • year: year of the race
  • round: round of the race in that year
  • circuitId: ID of the circuit
  • name: name of the race
  • date: date of the race
  • time: time of the race
  • url: link to the wikipedia page of the race
  • fp1_date: date of the Free Practice 1 session
  • fp1_time: time of the Free Practice 1 session
  • fp2_date: date of the Free Practice 2 session
  • fp2_time: time of the Free Practice 2 session
  • fp3_date: date of the Free Practice 3 session
  • fp3_time: time of the Free Practice 3 session
  • quali_date: date of the Qualifying session
  • quali_time: time of the Qualifying session
  • sprint_date: date of the Sprint Race
  • sprint_time: time of the Sprint Race
In [7]:
df_5 = pd.read_csv('data/races.csv')
print(df_5.shape)
df_5.head()
(1102, 18)
Out[7]:
raceId year round circuitId name date time url fp1_date fp1_time fp2_date fp2_time fp3_date fp3_time quali_date quali_time sprint_date sprint_time
0 1 2009 1 1 Australian Grand Prix 2009-03-29 06:00:00 http://en.wikipedia.org/wiki/2009_Australian_G... \N \N \N \N \N \N \N \N \N \N
1 2 2009 2 2 Malaysian Grand Prix 2009-04-05 09:00:00 http://en.wikipedia.org/wiki/2009_Malaysian_Gr... \N \N \N \N \N \N \N \N \N \N
2 3 2009 3 17 Chinese Grand Prix 2009-04-19 07:00:00 http://en.wikipedia.org/wiki/2009_Chinese_Gran... \N \N \N \N \N \N \N \N \N \N
3 4 2009 4 3 Bahrain Grand Prix 2009-04-26 12:00:00 http://en.wikipedia.org/wiki/2009_Bahrain_Gran... \N \N \N \N \N \N \N \N \N \N
4 5 2009 5 4 Spanish Grand Prix 2009-05-10 12:00:00 http://en.wikipedia.org/wiki/2009_Spanish_Gran... \N \N \N \N \N \N \N \N \N \N

Features of the dataset:

  • constructorId: ID of the constructor
  • constructorRef: reference of the constructor
  • name: name of the constructor
  • nationality: nationality of the constructor
  • url: link to the wikipedia page of the constructor
In [8]:
df_6 = pd.read_csv('data/constructors.csv')
print(df_6.shape)
df_6.head()
(211, 5)
Out[8]:
constructorId constructorRef name nationality url
0 1 mclaren McLaren British http://en.wikipedia.org/wiki/McLaren
1 2 bmw_sauber BMW Sauber German http://en.wikipedia.org/wiki/BMW_Sauber
2 3 williams Williams British http://en.wikipedia.org/wiki/Williams_Grand_Pr...
3 4 renault Renault French http://en.wikipedia.org/wiki/Renault_in_Formul...
4 5 toro_rosso Toro Rosso Italian http://en.wikipedia.org/wiki/Scuderia_Toro_Rosso

Features of the dataset:

  • constructorStandingsId: ID of the constructor Standing
  • raceId: ID of the race
  • constructorId: ID of teh constructor
  • points: points of the constructor
  • position: position in the Constructor's Championship
  • positionText: same as position
  • wins: number of wins the Constructor has in the season
In [9]:
df_7 = pd.read_csv('data/constructor_standings.csv')
print(df_7.shape)
df_7.head()
(12941, 7)
Out[9]:
constructorStandingsId raceId constructorId points position positionText wins
0 1 18 1 14.0 1 1 1
1 2 18 2 8.0 3 3 0
2 3 18 3 9.0 2 2 0
3 4 18 4 5.0 4 4 0
4 5 18 5 2.0 5 5 0

Features of the dataset:

  • qualifyId: ID of the qualifying result
  • raceId: ID of the race
  • driverId: ID of the driver
  • constructorId: ID of the constructor
  • number: driver's number
  • position: driver's qualifying position
  • q1: lap time in Q1 (first qualifying session)
  • q2: lap time in Q2 (second qualifying session)
  • q3: lap time in Q3 (third qualifying session)
In [10]:
df_8 = pd.read_csv('data/qualifying.csv')
print(df_8.shape)
df_8.head()
(9575, 9)
Out[10]:
qualifyId raceId driverId constructorId number position q1 q2 q3
0 1 18 1 1 22 1 1:26.572 1:25.187 1:26.714
1 2 18 9 2 4 2 1:26.103 1:25.315 1:26.869
2 3 18 5 1 23 3 1:25.664 1:25.452 1:27.079
3 4 18 13 6 2 4 1:25.994 1:25.691 1:27.178
4 5 18 2 2 3 5 1:25.960 1:25.518 1:27.236

Features of the dataset:

  • driverStandingsId: ID of the driver's standing
  • raceId: ID of the race
  • driverId: ID of the driver
  • points: points of the driver
  • position: position of the driver in the Driver's Championship
  • positionText: same as position
  • wins: number of wins of the driver
In [11]:
df_9 = pd.read_csv('data/driver_standings.csv')
print(df_9.shape)
df_9.head()
(33902, 7)
Out[11]:
driverStandingsId raceId driverId points position positionText wins
0 1 18 1 10.0 1 1 1
1 2 18 2 8.0 2 2 0
2 3 18 3 6.0 3 3 0
3 4 18 4 5.0 4 4 0
4 5 18 5 4.0 5 5 0

Features of the dataset:

  • constructorResultsId: ID of the constructor Result
  • raceId: ID of the race
  • constructorId: ID of the constructor
  • points: points of the constructor
  • status: status of the constructor
In [12]:
df_10 = pd.read_csv('data/constructor_results.csv')
print(df_10.shape)
df_10.head()
(12170, 5)
Out[12]:
constructorResultsId raceId constructorId points status
0 1 18 1 14.0 \N
1 2 18 2 8.0 \N
2 3 18 3 9.0 \N
3 4 18 4 5.0 \N
4 5 18 5 2.0 \N

Features of the dataset:

  • raceId: ID of the race
  • driverId: ID of the driver
  • stop: number of pit stop a driver has taken
  • lap: number of the lap when the pitstop took place
  • time: time of day of the pitstop
  • duration: duration of the pit stop in seconds
  • milliseconds: duration of the pit stop in ms
In [13]:
df_11 = pd.read_csv('data/pit_stops.csv')
print(df_11.shape)
df_11.head()
(9634, 7)
Out[13]:
raceId driverId stop lap time duration milliseconds
0 841 153 1 1 17:05:23 26.898 26898
1 841 30 1 1 17:05:52 25.021 25021
2 841 17 1 11 17:20:48 23.426 23426
3 841 4 1 12 17:22:34 23.251 23251
4 841 13 1 13 17:24:10 23.842 23842

Features of the dataset:

  • year: year of the F1 season
  • url: link to the wikipedia page of the F1 season
In [14]:
df_12 = pd.read_csv('data/seasons.csv')
print(df_12.shape)
df_12.head()
(74, 2)
Out[14]:
year url
0 2009 http://en.wikipedia.org/wiki/2009_Formula_One_...
1 2008 http://en.wikipedia.org/wiki/2008_Formula_One_...
2 2007 http://en.wikipedia.org/wiki/2007_Formula_One_...
3 2006 http://en.wikipedia.org/wiki/2006_Formula_One_...
4 2005 http://en.wikipedia.org/wiki/2005_Formula_One_...

Features of the dataset:

  • resultId: ID of the result
  • raceId: ID of the race
  • driverId: ID of the driver
  • constructorId: ID of the constructor
  • number: number of the driver
  • grid: starting grid of the driver
  • position: finishing position of the driver
  • positionText: same as position
  • positionOrder: finishing position of the driver (includes DNFs and DNSs)
  • points: points of the driver
  • laps: laps driven by the driver
  • time: finishing time of the driver (time to winner if the driver is not the winner of the race)
  • milliseconds: finishing time of the driver in ms
In [15]:
df_13 = pd.read_csv('data/results.csv')
print(df_13.shape)
df_13.head()
(25840, 18)
Out[15]:
resultId raceId driverId constructorId number grid position positionText positionOrder points laps time milliseconds fastestLap rank fastestLapTime fastestLapSpeed statusId
0 1 18 1 1 22 1 1 1 1 10.0 58 1:34:50.616 5690616 39 2 1:27.452 218.300 1
1 2 18 2 2 3 5 2 2 2 8.0 58 +5.478 5696094 41 3 1:27.739 217.586 1
2 3 18 3 3 7 7 3 3 3 6.0 58 +8.163 5698779 41 5 1:28.090 216.719 1
3 4 18 4 4 5 11 4 4 4 5.0 58 +17.181 5707797 58 7 1:28.603 215.464 1
4 5 18 5 1 23 3 5 5 5 4.0 58 +18.014 5708630 43 1 1:27.418 218.385 1

Below is a demonstration of how to get data for pre-season testing sessions.

In [16]:
# Setup plotting
plotting.setup_mpl()
# Enable the cache
ff1.Cache.enable_cache('cache') 

# Get rid of some pandas warnings that are not relevant for us at the moment
pd.options.mode.chained_assignment = None
In [17]:
# here we're getting data of the first testing session on the first test day of 2020 pre-season testing
test_session_2020 = ff1.get_testing_session(2020, 1, 1)
test_session_2020.load()
core           INFO 	Loading data for Pre-Season Test 1 - Practice 1 [v2.3.0]
api            INFO 	Using cached data for driver_info
api            INFO 	Using cached data for timing_data
api            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
api            INFO 	Using cached data for session_status_data
api            INFO 	Using cached data for track_status_data
core        WARNING 	No tyre data for driver 65535
api            INFO 	Using cached data for car_data
api            INFO 	No cached data found for position_data. Loading data...
api            INFO 	Fetching position data...
core        WARNING 	Failed to load telemetry data!
api            INFO 	Using cached data for weather_data
api            INFO 	Using cached data for race_control_messages
core           INFO 	Finished loading data for 16 drivers: ['3', '6', '11', '16', '18', '20', '26', '31', '33', '44', '55', '63', '77', '88', '99', '65535']
In [18]:
test_session_2020.laps
Out[18]:
Time DriverNumber LapTime LapNumber PitOutTime PitInTime Sector1Time Sector2Time Sector3Time Sector1SessionTime ... IsPersonalBest Compound TyreLife FreshTyre Stint LapStartTime Team Driver TrackStatus IsAccurate
0 0 days 06:39:08.368000 3 NaT 1 0 days 06:37:18.857000 NaT NaT 0 days 00:00:37.065000 0 days 00:00:33.388000 NaT ... False MEDIUM 8.0 False 1.0 0 days 06:37:18.857000 Renault RIC 1 False
1 0 days 06:40:28.731000 3 0 days 00:01:20.363000 2 NaT NaT 0 days 00:00:22.744000 0 days 00:00:30.040000 0 days 00:00:27.579000 0 days 06:39:31.112000 ... False MEDIUM 9.0 False 1.0 0 days 06:39:08.368000 Renault RIC 1 True
2 0 days 06:41:47.755000 3 0 days 00:01:19.024000 3 NaT NaT 0 days 00:00:22.386000 0 days 00:00:29.385000 0 days 00:00:27.253000 0 days 06:40:51.117000 ... False MEDIUM 10.0 False 1.0 0 days 06:40:28.731000 Renault RIC 1 True
3 0 days 06:43:08.377000 3 0 days 00:01:20.622000 4 NaT NaT 0 days 00:00:22.949000 0 days 00:00:30.084000 0 days 00:00:27.589000 0 days 06:42:10.704000 ... False MEDIUM 11.0 False 1.0 0 days 06:41:47.755000 Renault RIC 1 True
4 0 days 06:44:29.285000 3 0 days 00:01:20.908000 5 NaT NaT 0 days 00:00:23.229000 0 days 00:00:30.112000 0 days 00:00:27.567000 0 days 06:43:31.606000 ... False MEDIUM 12.0 False 1.0 0 days 06:43:08.377000 Renault RIC 1 True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1356 0 days 09:10:26.838000 99 0 days 00:01:34.876000 78 NaT 0 days 09:10:24.176000 0 days 00:00:23.849000 0 days 00:00:35.867000 0 days 00:00:35.160000 0 days 09:09:15.811000 ... False HARD 19.0 True 8.0 0 days 09:08:51.962000 Alfa Romeo Racing GIO 1 False
1357 0 days 09:12:39.970000 99 0 days 00:02:13.132000 79 0 days 09:10:42.993000 0 days 09:12:39.467000 0 days 00:01:01.735000 0 days 00:00:35.692000 0 days 00:00:35.705000 0 days 09:11:28.573000 ... False HARD 20.0 False 9.0 0 days 09:10:26.838000 Alfa Romeo Racing GIO 1 False
1358 0 days 00:50:55.197000 65535 0 days 00:01:27.453000 1 NaT NaT NaT NaT NaT NaT ... False NaN False 1.0 NaT None GRO 1 False
1359 0 days 00:52:37.389000 65535 0 days 00:01:42.192000 2 0 days 00:51:15.977000 NaT 0 days 00:00:42.956000 0 days 00:00:29.840000 0 days 00:00:29.396000 0 days 00:51:38.153000 ... False NaN False 1.0 0 days 00:51:15.977000 None GRO 1 False
1360 0 days 00:54:06.177000 65535 NaT 3 0 days 05:09:34.459000 0 days 00:54:00.574000 0 days 00:00:22.842000 0 days 00:00:29.690000 0 days 00:00:36.225000 0 days 00:53:00.262000 ... False NaN False 1.0 0 days 05:09:34.459000 None GRO 1 False

1361 rows × 26 columns

In [ ]: