Flight Cancellation Predictor¶

Describe and Motivate a Real World Problem¶

Flights are cancelled everyday, accounting for thousands of hours lost, and in turn millions of dollars wasted by consumers across the world, as airlines strive to maximize ticket sales while minimizing the number of flights they actually fly without a complete sellout. As a result, almost everyone has had a flight cancelled on them at one time or another. What's more - some people lose hundreds of hours a year to cancelled flights, affecting their jobs and personal lives. My plan is to develop a machine learning algorithm that can identify which flights are most likely to be cancelled BEFORE a given flyer purchases the ticket to a flight, allowing for consumers to more easily make decisions about which flights they purchase.

Further reading:

https://askwonder.com/research/us-based-airlines-airlines-pay-every-year-retribution-flight-delays-i76gkz9qd https://www.usatoday.com/story/travel/airline-news/2022/09/01/flight-delay-canceled-airline-policy-consumer-dashboard/7937630001/ https://www.insurancejournal.com/news/national/2023/02/16/708174.htm

In [2]:
import pandas as pd
In [4]:
flights = pd.read_csv("flights.csv")
In [6]:
flights.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4310030 entries, 0 to 4310029
Data columns (total 31 columns):
 #   Column               Dtype  
---  ------               -----  
 0   YEAR                 int64  
 1   MONTH                int64  
 2   DAY                  float64
 3   DAY_OF_WEEK          float64
 4   AIRLINE              object 
 5   FLIGHT_NUMBER        float64
 6   TAIL_NUMBER          object 
 7   ORIGIN_AIRPORT       object 
 8   DESTINATION_AIRPORT  object 
 9   SCHEDULED_DEPARTURE  float64
 10  DEPARTURE_TIME       float64
 11  DEPARTURE_DELAY      float64
 12  TAXI_OUT             float64
 13  WHEELS_OFF           float64
 14  SCHEDULED_TIME       float64
 15  ELAPSED_TIME         float64
 16  AIR_TIME             float64
 17  DISTANCE             float64
 18  WHEELS_ON            float64
 19  TAXI_IN              float64
 20  SCHEDULED_ARRIVAL    float64
 21  ARRIVAL_TIME         float64
 22  ARRIVAL_DELAY        float64
 23  DIVERTED             float64
 24  CANCELLED            float64
 25  CANCELLATION_REASON  object 
 26  AIR_SYSTEM_DELAY     float64
 27  SECURITY_DELAY       float64
 28  AIRLINE_DELAY        float64
 29  LATE_AIRCRAFT_DELAY  float64
 30  WEATHER_DELAY        float64
dtypes: float64(24), int64(2), object(5)
memory usage: 1019.4+ MB
In [8]:
flights.columns
Out[8]:
Index(['YEAR', 'MONTH', 'DAY', 'DAY_OF_WEEK', 'AIRLINE', 'FLIGHT_NUMBER',
       'TAIL_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT',
       'SCHEDULED_DEPARTURE', 'DEPARTURE_TIME', 'DEPARTURE_DELAY', 'TAXI_OUT',
       'WHEELS_OFF', 'SCHEDULED_TIME', 'ELAPSED_TIME', 'AIR_TIME', 'DISTANCE',
       'WHEELS_ON', 'TAXI_IN', 'SCHEDULED_ARRIVAL', 'ARRIVAL_TIME',
       'ARRIVAL_DELAY', 'DIVERTED', 'CANCELLED', 'CANCELLATION_REASON',
       'AIR_SYSTEM_DELAY', 'SECURITY_DELAY', 'AIRLINE_DELAY',
       'LATE_AIRCRAFT_DELAY', 'WEATHER_DELAY'],
      dtype='object')
In [9]:
feature_desc_dict = {"YEAR":"Year of the Flight Trip",
                    "MONTH":"Month of the Flight Trip",
                    "DAY":"Day of the Flight Trip",
                    "DAY_OF_WEEK":"Day of week of the Flight Trip",
                    "AIRLINE":"Airline Identifier",
                    "FLIGHT_NUMBER":"Flight Identifier",
                    "TAIL_NUMBER":"Aircraft Identifier",
                    "ORIGIN_AIRPORT":"Starting Airport",
                    "DESTINATION_AIRPORT":"Destination Airport",
                    "SCHEDULED_DEPARTURE":"Planned Departure Time",
                    "DEPARTURE_TIME":"WHEEL_OFF - TAXI_OUT",
                    "DEPARTURE_DELAY":"Total Delay on Departure",
                    "TAXI_OUT":"The time duration elapsed between departure from the origin airport gate and wheels off",
                    "WHEELS_OFF":"The time point that the aircraft's wheels leave the ground",
                    "SCHEDULED_TIME":"Planned time amount needed for the flight trip",
                     "ELAPSED_TIME":"AIR_TIME+TAXI_IN+TAXI_OUT",
                     "AIR_TIME":"The time duration between wheels_off and wheels_on time",
                     "DISTANCE":"Distance between two airports",
                     "WHEELS_ON":"The time point that the aircraft's wheels touch on the ground",
                     "TAXI_IN":"The time duration elapsed between departure from the origin airport gate and wheels off",
                    "SCHEDULED_ARRIVAL":"the scheduled time of arrival of the flight",
                    "ARRIVAL_TIME":"the time of arrival of the flight",
                    "ARRIVAL_DELAY":"the length of time a flight's arrival was delayed",
                    "DIVERTED":"Whether or not a flight was diverted",
                    "CANCELLED":"Whether or not a flight was cancelled",
                    "CANCELLATION_REASON":"Why a flight was cancelled",
                    "AIR_SYSTEM_DELAY":"Whether or not the flight was delayed due to technical issues",
                     "SECURITY_DELAY":"Whether or not the flight was delayed due to security concerns",
                     "AIRLINE_DELAY":"Whether or not the flight was delayed due to delays within the aircrafts flight network",
                     "LATE_AIRCRAFT_DELAY":"Whether or not the flight was delayed due to the delayed arrival of the aircraft",
                     "WEATHER_DELAY":"Whether or not the flight was delayed due to weather conditions"
                    }

Demonstrate that this data is sufficient to make progress on this real world problem¶

Key to using machine learning to solve this problem will be features like airline, origin_airport, destination_airport, month, day, and day of the week. Just these features alone, coupled with more than four million examples, should provide enough predictive power to estimate when cancellations are most likely. My 'cancellation' feature will work as a target variable.

How will the data be used to solve the problem¶

Features in the dataset will be used as the training features for a classificaton algorithm, potentially logistic regression or naive bayes, which will identify which flights are most likely to be cancelled in the future.