Formula 1 is one of the most popular and most watched sports in the world, with millions of fans following races across the globe. Winning a Formula 1 race requires not only a fast and reliable car but also a good strategy team(rip Ferrari), pit crews, and of course, luck. Predicting the winner of a Formula 1 race is a challenging task, as those races are inherently chaotic and there are so many variables that can affect the outcome. This project aims to find insights and patterns within chaos, and develop a machine learning model that can predict the winner of a Formula 1 race with a decent accuracy. We will also try to find correlations between performance in pre-season testing and performance in races.
We'll predict the winner of a race/finishing position of a driver based on the starting grid position, position/points, qualifying time, age, number of wins of the driver, track weather/status, year, etc. We'll be using different ML models and comparing their performance
# !pip install fastf1
import fastf1 as ff1
from fastf1 import plotting
import numpy as np
import pandas as pd
Below are 14 datasets includes all the information on the Formula 1 races, drivers, constructors, qualifying, circuits, lap times, pit stops, championships from 1950 till the latest 2023 season. These data should be sufficient to ensure a meaningful result. Obviously, some of those datasets are more useful than others, and we need to clean and combine some of them for modeling.
Features of the dataset:
df_0 = pd.read_csv('data/circuits.csv')
print(df_0.shape)
df_0.head()
(77, 9)
circuitId | circuitRef | name | location | country | lat | lng | alt | url | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | albert_park | Albert Park Grand Prix Circuit | Melbourne | Australia | -37.84970 | 144.96800 | 10 | http://en.wikipedia.org/wiki/Melbourne_Grand_P... |
1 | 2 | sepang | Sepang International Circuit | Kuala Lumpur | Malaysia | 2.76083 | 101.73800 | 18 | http://en.wikipedia.org/wiki/Sepang_Internatio... |
2 | 3 | bahrain | Bahrain International Circuit | Sakhir | Bahrain | 26.03250 | 50.51060 | 7 | http://en.wikipedia.org/wiki/Bahrain_Internati... |
3 | 4 | catalunya | Circuit de Barcelona-Catalunya | Montmeló | Spain | 41.57000 | 2.26111 | 109 | http://en.wikipedia.org/wiki/Circuit_de_Barcel... |
4 | 5 | istanbul | Istanbul Park | Istanbul | Turkey | 40.95170 | 29.40500 | 130 | http://en.wikipedia.org/wiki/Istanbul_Park |
Features of the dataset:
df_1 = pd.read_csv('data/status.csv')
print(df_1.shape)
df_1.head()
(139, 2)
statusId | status | |
---|---|---|
0 | 1 | Finished |
1 | 2 | Disqualified |
2 | 3 | Accident |
3 | 4 | Collision |
4 | 5 | Engine |
Features of the dataset:
df_2 = pd.read_csv('data/lap_times.csv')
print(df_2.shape)
df_2.head()
(538121, 6)
raceId | driverId | lap | position | time | milliseconds | |
---|---|---|---|---|---|---|
0 | 841 | 20 | 1 | 1 | 1:38.109 | 98109 |
1 | 841 | 20 | 2 | 1 | 1:33.006 | 93006 |
2 | 841 | 20 | 3 | 1 | 1:32.713 | 92713 |
3 | 841 | 20 | 4 | 1 | 1:32.803 | 92803 |
4 | 841 | 20 | 5 | 1 | 1:32.342 | 92342 |
Features of the dataset:
df_3 = pd.read_csv('data/sprint_results.csv')
print(df_3.shape)
df_3.head()
(120, 16)
resultId | raceId | driverId | constructorId | number | grid | position | positionText | positionOrder | points | laps | time | milliseconds | fastestLap | fastestLapTime | statusId | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1061 | 830 | 9 | 33 | 2 | 1 | 1 | 1 | 3 | 17 | 25:38.426 | 1538426 | 14 | 1:30.013 | 1 |
1 | 2 | 1061 | 1 | 131 | 44 | 1 | 2 | 2 | 2 | 2 | 17 | +1.430 | 1539856 | 17 | 1:29.937 | 1 |
2 | 3 | 1061 | 822 | 131 | 77 | 3 | 3 | 3 | 3 | 1 | 17 | +7.502 | 1545928 | 17 | 1:29.958 | 1 |
3 | 4 | 1061 | 844 | 6 | 16 | 4 | 4 | 4 | 4 | 0 | 17 | +11.278 | 1549704 | 16 | 1:30.163 | 1 |
4 | 5 | 1061 | 846 | 1 | 4 | 6 | 5 | 5 | 5 | 0 | 17 | +24.111 | 1562537 | 16 | 1:30.566 | 1 |
Features of the dataset:
df_4 = pd.read_csv('data/drivers.csv')
print(df_4.shape)
df_4.head()
(857, 9)
driverId | driverRef | number | code | forename | surname | dob | nationality | url | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | hamilton | 44 | HAM | Lewis | Hamilton | 1985-01-07 | British | http://en.wikipedia.org/wiki/Lewis_Hamilton |
1 | 2 | heidfeld | \N | HEI | Nick | Heidfeld | 1977-05-10 | German | http://en.wikipedia.org/wiki/Nick_Heidfeld |
2 | 3 | rosberg | 6 | ROS | Nico | Rosberg | 1985-06-27 | German | http://en.wikipedia.org/wiki/Nico_Rosberg |
3 | 4 | alonso | 14 | ALO | Fernando | Alonso | 1981-07-29 | Spanish | http://en.wikipedia.org/wiki/Fernando_Alonso |
4 | 5 | kovalainen | \N | KOV | Heikki | Kovalainen | 1981-10-19 | Finnish | http://en.wikipedia.org/wiki/Heikki_Kovalainen |
Features of the dataset:
df_5 = pd.read_csv('data/races.csv')
print(df_5.shape)
df_5.head()
(1102, 18)
raceId | year | round | circuitId | name | date | time | url | fp1_date | fp1_time | fp2_date | fp2_time | fp3_date | fp3_time | quali_date | quali_time | sprint_date | sprint_time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2009 | 1 | 1 | Australian Grand Prix | 2009-03-29 | 06:00:00 | http://en.wikipedia.org/wiki/2009_Australian_G... | \N | \N | \N | \N | \N | \N | \N | \N | \N | \N |
1 | 2 | 2009 | 2 | 2 | Malaysian Grand Prix | 2009-04-05 | 09:00:00 | http://en.wikipedia.org/wiki/2009_Malaysian_Gr... | \N | \N | \N | \N | \N | \N | \N | \N | \N | \N |
2 | 3 | 2009 | 3 | 17 | Chinese Grand Prix | 2009-04-19 | 07:00:00 | http://en.wikipedia.org/wiki/2009_Chinese_Gran... | \N | \N | \N | \N | \N | \N | \N | \N | \N | \N |
3 | 4 | 2009 | 4 | 3 | Bahrain Grand Prix | 2009-04-26 | 12:00:00 | http://en.wikipedia.org/wiki/2009_Bahrain_Gran... | \N | \N | \N | \N | \N | \N | \N | \N | \N | \N |
4 | 5 | 2009 | 5 | 4 | Spanish Grand Prix | 2009-05-10 | 12:00:00 | http://en.wikipedia.org/wiki/2009_Spanish_Gran... | \N | \N | \N | \N | \N | \N | \N | \N | \N | \N |
Features of the dataset:
df_6 = pd.read_csv('data/constructors.csv')
print(df_6.shape)
df_6.head()
(211, 5)
constructorId | constructorRef | name | nationality | url | |
---|---|---|---|---|---|
0 | 1 | mclaren | McLaren | British | http://en.wikipedia.org/wiki/McLaren |
1 | 2 | bmw_sauber | BMW Sauber | German | http://en.wikipedia.org/wiki/BMW_Sauber |
2 | 3 | williams | Williams | British | http://en.wikipedia.org/wiki/Williams_Grand_Pr... |
3 | 4 | renault | Renault | French | http://en.wikipedia.org/wiki/Renault_in_Formul... |
4 | 5 | toro_rosso | Toro Rosso | Italian | http://en.wikipedia.org/wiki/Scuderia_Toro_Rosso |
Features of the dataset:
df_7 = pd.read_csv('data/constructor_standings.csv')
print(df_7.shape)
df_7.head()
(12941, 7)
constructorStandingsId | raceId | constructorId | points | position | positionText | wins | |
---|---|---|---|---|---|---|---|
0 | 1 | 18 | 1 | 14.0 | 1 | 1 | 1 |
1 | 2 | 18 | 2 | 8.0 | 3 | 3 | 0 |
2 | 3 | 18 | 3 | 9.0 | 2 | 2 | 0 |
3 | 4 | 18 | 4 | 5.0 | 4 | 4 | 0 |
4 | 5 | 18 | 5 | 2.0 | 5 | 5 | 0 |
Features of the dataset:
df_8 = pd.read_csv('data/qualifying.csv')
print(df_8.shape)
df_8.head()
(9575, 9)
qualifyId | raceId | driverId | constructorId | number | position | q1 | q2 | q3 | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 18 | 1 | 1 | 22 | 1 | 1:26.572 | 1:25.187 | 1:26.714 |
1 | 2 | 18 | 9 | 2 | 4 | 2 | 1:26.103 | 1:25.315 | 1:26.869 |
2 | 3 | 18 | 5 | 1 | 23 | 3 | 1:25.664 | 1:25.452 | 1:27.079 |
3 | 4 | 18 | 13 | 6 | 2 | 4 | 1:25.994 | 1:25.691 | 1:27.178 |
4 | 5 | 18 | 2 | 2 | 3 | 5 | 1:25.960 | 1:25.518 | 1:27.236 |
Features of the dataset:
df_9 = pd.read_csv('data/driver_standings.csv')
print(df_9.shape)
df_9.head()
(33902, 7)
driverStandingsId | raceId | driverId | points | position | positionText | wins | |
---|---|---|---|---|---|---|---|
0 | 1 | 18 | 1 | 10.0 | 1 | 1 | 1 |
1 | 2 | 18 | 2 | 8.0 | 2 | 2 | 0 |
2 | 3 | 18 | 3 | 6.0 | 3 | 3 | 0 |
3 | 4 | 18 | 4 | 5.0 | 4 | 4 | 0 |
4 | 5 | 18 | 5 | 4.0 | 5 | 5 | 0 |
Features of the dataset:
df_10 = pd.read_csv('data/constructor_results.csv')
print(df_10.shape)
df_10.head()
(12170, 5)
constructorResultsId | raceId | constructorId | points | status | |
---|---|---|---|---|---|
0 | 1 | 18 | 1 | 14.0 | \N |
1 | 2 | 18 | 2 | 8.0 | \N |
2 | 3 | 18 | 3 | 9.0 | \N |
3 | 4 | 18 | 4 | 5.0 | \N |
4 | 5 | 18 | 5 | 2.0 | \N |
Features of the dataset:
df_11 = pd.read_csv('data/pit_stops.csv')
print(df_11.shape)
df_11.head()
(9634, 7)
raceId | driverId | stop | lap | time | duration | milliseconds | |
---|---|---|---|---|---|---|---|
0 | 841 | 153 | 1 | 1 | 17:05:23 | 26.898 | 26898 |
1 | 841 | 30 | 1 | 1 | 17:05:52 | 25.021 | 25021 |
2 | 841 | 17 | 1 | 11 | 17:20:48 | 23.426 | 23426 |
3 | 841 | 4 | 1 | 12 | 17:22:34 | 23.251 | 23251 |
4 | 841 | 13 | 1 | 13 | 17:24:10 | 23.842 | 23842 |
Features of the dataset:
df_12 = pd.read_csv('data/seasons.csv')
print(df_12.shape)
df_12.head()
(74, 2)
year | url | |
---|---|---|
0 | 2009 | http://en.wikipedia.org/wiki/2009_Formula_One_... |
1 | 2008 | http://en.wikipedia.org/wiki/2008_Formula_One_... |
2 | 2007 | http://en.wikipedia.org/wiki/2007_Formula_One_... |
3 | 2006 | http://en.wikipedia.org/wiki/2006_Formula_One_... |
4 | 2005 | http://en.wikipedia.org/wiki/2005_Formula_One_... |
Features of the dataset:
df_13 = pd.read_csv('data/results.csv')
print(df_13.shape)
df_13.head()
(25840, 18)
resultId | raceId | driverId | constructorId | number | grid | position | positionText | positionOrder | points | laps | time | milliseconds | fastestLap | rank | fastestLapTime | fastestLapSpeed | statusId | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 18 | 1 | 1 | 22 | 1 | 1 | 1 | 1 | 10.0 | 58 | 1:34:50.616 | 5690616 | 39 | 2 | 1:27.452 | 218.300 | 1 |
1 | 2 | 18 | 2 | 2 | 3 | 5 | 2 | 2 | 2 | 8.0 | 58 | +5.478 | 5696094 | 41 | 3 | 1:27.739 | 217.586 | 1 |
2 | 3 | 18 | 3 | 3 | 7 | 7 | 3 | 3 | 3 | 6.0 | 58 | +8.163 | 5698779 | 41 | 5 | 1:28.090 | 216.719 | 1 |
3 | 4 | 18 | 4 | 4 | 5 | 11 | 4 | 4 | 4 | 5.0 | 58 | +17.181 | 5707797 | 58 | 7 | 1:28.603 | 215.464 | 1 |
4 | 5 | 18 | 5 | 1 | 23 | 3 | 5 | 5 | 5 | 4.0 | 58 | +18.014 | 5708630 | 43 | 1 | 1:27.418 | 218.385 | 1 |
Below is a demonstration of how to get data for pre-season testing sessions.
# Setup plotting
plotting.setup_mpl()
# Enable the cache
ff1.Cache.enable_cache('cache')
# Get rid of some pandas warnings that are not relevant for us at the moment
pd.options.mode.chained_assignment = None
# here we're getting data of the first testing session on the first test day of 2020 pre-season testing
test_session_2020 = ff1.get_testing_session(2020, 1, 1)
test_session_2020.load()
core INFO Loading data for Pre-Season Test 1 - Practice 1 [v2.3.0] api INFO Using cached data for driver_info api INFO Using cached data for timing_data api INFO Using cached data for timing_app_data core INFO Processing timing data... api INFO Using cached data for session_status_data api INFO Using cached data for track_status_data core WARNING No tyre data for driver 65535 api INFO Using cached data for car_data api INFO No cached data found for position_data. Loading data... api INFO Fetching position data... core WARNING Failed to load telemetry data! api INFO Using cached data for weather_data api INFO Using cached data for race_control_messages core INFO Finished loading data for 16 drivers: ['3', '6', '11', '16', '18', '20', '26', '31', '33', '44', '55', '63', '77', '88', '99', '65535']
test_session_2020.laps
Time | DriverNumber | LapTime | LapNumber | PitOutTime | PitInTime | Sector1Time | Sector2Time | Sector3Time | Sector1SessionTime | ... | IsPersonalBest | Compound | TyreLife | FreshTyre | Stint | LapStartTime | Team | Driver | TrackStatus | IsAccurate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 days 06:39:08.368000 | 3 | NaT | 1 | 0 days 06:37:18.857000 | NaT | NaT | 0 days 00:00:37.065000 | 0 days 00:00:33.388000 | NaT | ... | False | MEDIUM | 8.0 | False | 1.0 | 0 days 06:37:18.857000 | Renault | RIC | 1 | False |
1 | 0 days 06:40:28.731000 | 3 | 0 days 00:01:20.363000 | 2 | NaT | NaT | 0 days 00:00:22.744000 | 0 days 00:00:30.040000 | 0 days 00:00:27.579000 | 0 days 06:39:31.112000 | ... | False | MEDIUM | 9.0 | False | 1.0 | 0 days 06:39:08.368000 | Renault | RIC | 1 | True |
2 | 0 days 06:41:47.755000 | 3 | 0 days 00:01:19.024000 | 3 | NaT | NaT | 0 days 00:00:22.386000 | 0 days 00:00:29.385000 | 0 days 00:00:27.253000 | 0 days 06:40:51.117000 | ... | False | MEDIUM | 10.0 | False | 1.0 | 0 days 06:40:28.731000 | Renault | RIC | 1 | True |
3 | 0 days 06:43:08.377000 | 3 | 0 days 00:01:20.622000 | 4 | NaT | NaT | 0 days 00:00:22.949000 | 0 days 00:00:30.084000 | 0 days 00:00:27.589000 | 0 days 06:42:10.704000 | ... | False | MEDIUM | 11.0 | False | 1.0 | 0 days 06:41:47.755000 | Renault | RIC | 1 | True |
4 | 0 days 06:44:29.285000 | 3 | 0 days 00:01:20.908000 | 5 | NaT | NaT | 0 days 00:00:23.229000 | 0 days 00:00:30.112000 | 0 days 00:00:27.567000 | 0 days 06:43:31.606000 | ... | False | MEDIUM | 12.0 | False | 1.0 | 0 days 06:43:08.377000 | Renault | RIC | 1 | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1356 | 0 days 09:10:26.838000 | 99 | 0 days 00:01:34.876000 | 78 | NaT | 0 days 09:10:24.176000 | 0 days 00:00:23.849000 | 0 days 00:00:35.867000 | 0 days 00:00:35.160000 | 0 days 09:09:15.811000 | ... | False | HARD | 19.0 | True | 8.0 | 0 days 09:08:51.962000 | Alfa Romeo Racing | GIO | 1 | False |
1357 | 0 days 09:12:39.970000 | 99 | 0 days 00:02:13.132000 | 79 | 0 days 09:10:42.993000 | 0 days 09:12:39.467000 | 0 days 00:01:01.735000 | 0 days 00:00:35.692000 | 0 days 00:00:35.705000 | 0 days 09:11:28.573000 | ... | False | HARD | 20.0 | False | 9.0 | 0 days 09:10:26.838000 | Alfa Romeo Racing | GIO | 1 | False |
1358 | 0 days 00:50:55.197000 | 65535 | 0 days 00:01:27.453000 | 1 | NaT | NaT | NaT | NaT | NaT | NaT | ... | False | NaN | False | 1.0 | NaT | None | GRO | 1 | False | |
1359 | 0 days 00:52:37.389000 | 65535 | 0 days 00:01:42.192000 | 2 | 0 days 00:51:15.977000 | NaT | 0 days 00:00:42.956000 | 0 days 00:00:29.840000 | 0 days 00:00:29.396000 | 0 days 00:51:38.153000 | ... | False | NaN | False | 1.0 | 0 days 00:51:15.977000 | None | GRO | 1 | False | |
1360 | 0 days 00:54:06.177000 | 65535 | NaT | 3 | 0 days 05:09:34.459000 | 0 days 00:54:00.574000 | 0 days 00:00:22.842000 | 0 days 00:00:29.690000 | 0 days 00:00:36.225000 | 0 days 00:53:00.262000 | ... | False | NaN | False | 1.0 | 0 days 05:09:34.459000 | None | GRO | 1 | False |
1361 rows × 26 columns