Flights are cancelled everyday, accounting for thousands of hours lost, and in turn millions of dollars wasted by consumers across the world, as airlines strive to maximize ticket sales while minimizing the number of flights they actually fly without a complete sellout. As a result, almost everyone has had a flight cancelled on them at one time or another. What's more - some people lose hundreds of hours a year to cancelled flights, affecting their jobs and personal lives. My plan is to develop a machine learning algorithm that can identify which flights are most likely to be cancelled BEFORE a given flyer purchases the ticket to a flight, allowing for consumers to more easily make decisions about which flights they purchase.
Further reading:
https://askwonder.com/research/us-based-airlines-airlines-pay-every-year-retribution-flight-delays-i76gkz9qd https://www.usatoday.com/story/travel/airline-news/2022/09/01/flight-delay-canceled-airline-policy-consumer-dashboard/7937630001/ https://www.insurancejournal.com/news/national/2023/02/16/708174.htm
import pandas as pd
flights = pd.read_csv("flights.csv")
flights.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4310030 entries, 0 to 4310029 Data columns (total 31 columns): # Column Dtype --- ------ ----- 0 YEAR int64 1 MONTH int64 2 DAY float64 3 DAY_OF_WEEK float64 4 AIRLINE object 5 FLIGHT_NUMBER float64 6 TAIL_NUMBER object 7 ORIGIN_AIRPORT object 8 DESTINATION_AIRPORT object 9 SCHEDULED_DEPARTURE float64 10 DEPARTURE_TIME float64 11 DEPARTURE_DELAY float64 12 TAXI_OUT float64 13 WHEELS_OFF float64 14 SCHEDULED_TIME float64 15 ELAPSED_TIME float64 16 AIR_TIME float64 17 DISTANCE float64 18 WHEELS_ON float64 19 TAXI_IN float64 20 SCHEDULED_ARRIVAL float64 21 ARRIVAL_TIME float64 22 ARRIVAL_DELAY float64 23 DIVERTED float64 24 CANCELLED float64 25 CANCELLATION_REASON object 26 AIR_SYSTEM_DELAY float64 27 SECURITY_DELAY float64 28 AIRLINE_DELAY float64 29 LATE_AIRCRAFT_DELAY float64 30 WEATHER_DELAY float64 dtypes: float64(24), int64(2), object(5) memory usage: 1019.4+ MB
flights.columns
Index(['YEAR', 'MONTH', 'DAY', 'DAY_OF_WEEK', 'AIRLINE', 'FLIGHT_NUMBER', 'TAIL_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT', 'SCHEDULED_DEPARTURE', 'DEPARTURE_TIME', 'DEPARTURE_DELAY', 'TAXI_OUT', 'WHEELS_OFF', 'SCHEDULED_TIME', 'ELAPSED_TIME', 'AIR_TIME', 'DISTANCE', 'WHEELS_ON', 'TAXI_IN', 'SCHEDULED_ARRIVAL', 'ARRIVAL_TIME', 'ARRIVAL_DELAY', 'DIVERTED', 'CANCELLED', 'CANCELLATION_REASON', 'AIR_SYSTEM_DELAY', 'SECURITY_DELAY', 'AIRLINE_DELAY', 'LATE_AIRCRAFT_DELAY', 'WEATHER_DELAY'], dtype='object')
feature_desc_dict = {"YEAR":"Year of the Flight Trip",
"MONTH":"Month of the Flight Trip",
"DAY":"Day of the Flight Trip",
"DAY_OF_WEEK":"Day of week of the Flight Trip",
"AIRLINE":"Airline Identifier",
"FLIGHT_NUMBER":"Flight Identifier",
"TAIL_NUMBER":"Aircraft Identifier",
"ORIGIN_AIRPORT":"Starting Airport",
"DESTINATION_AIRPORT":"Destination Airport",
"SCHEDULED_DEPARTURE":"Planned Departure Time",
"DEPARTURE_TIME":"WHEEL_OFF - TAXI_OUT",
"DEPARTURE_DELAY":"Total Delay on Departure",
"TAXI_OUT":"The time duration elapsed between departure from the origin airport gate and wheels off",
"WHEELS_OFF":"The time point that the aircraft's wheels leave the ground",
"SCHEDULED_TIME":"Planned time amount needed for the flight trip",
"ELAPSED_TIME":"AIR_TIME+TAXI_IN+TAXI_OUT",
"AIR_TIME":"The time duration between wheels_off and wheels_on time",
"DISTANCE":"Distance between two airports",
"WHEELS_ON":"The time point that the aircraft's wheels touch on the ground",
"TAXI_IN":"The time duration elapsed between departure from the origin airport gate and wheels off",
"SCHEDULED_ARRIVAL":"the scheduled time of arrival of the flight",
"ARRIVAL_TIME":"the time of arrival of the flight",
"ARRIVAL_DELAY":"the length of time a flight's arrival was delayed",
"DIVERTED":"Whether or not a flight was diverted",
"CANCELLED":"Whether or not a flight was cancelled",
"CANCELLATION_REASON":"Why a flight was cancelled",
"AIR_SYSTEM_DELAY":"Whether or not the flight was delayed due to technical issues",
"SECURITY_DELAY":"Whether or not the flight was delayed due to security concerns",
"AIRLINE_DELAY":"Whether or not the flight was delayed due to delays within the aircrafts flight network",
"LATE_AIRCRAFT_DELAY":"Whether or not the flight was delayed due to the delayed arrival of the aircraft",
"WEATHER_DELAY":"Whether or not the flight was delayed due to weather conditions"
}
Key to using machine learning to solve this problem will be features like airline, origin_airport, destination_airport, month, day, and day of the week. Just these features alone, coupled with more than four million examples, should provide enough predictive power to estimate when cancellations are most likely. My 'cancellation' feature will work as a target variable.
Features in the dataset will be used as the training features for a classificaton algorithm, potentially logistic regression or naive bayes, which will identify which flights are most likely to be cancelled in the future.