MTA Ridership Data¶

Motivation¶

Problem¶

The MTA (Metropolitian Transport Authority) is the New York State government authority that controls public transport in the New York City Metropolitain Area. This includes the Subway, Metro-North, Long Island Railroads, and the buses. The effeciency of the MTA is crucial for the City to run smoothly. One of the issues, though, is that as with most public transport systems, it is not perfect, and delays are common. It feels as though recently it has gotten even worse, with negative news articles, such as Crew Shortages cause Delays, becoming even more common.

Solution¶

Using data on delays, it is possible to see which lines and which days have had the most delays, and outliers will be obvious. Ridership data will also be useful, and it can be used to see how ridership flucuates between days, months, and even years. Data on both will be readily available using the MTA's and New York state's databases. It will also be useful to cross reference trends and outliers with major outside events and other sources (one example is the COVID-19 pandemic and how it impacted ridership).

Datasets¶

For this project I am using two datasets. Both are from New York State. The first data set is daily ridership data for the past 3 years (Daily Ridership Data). The other dataset is about delays every month for the past 3 years (Delays Dataset).

In [1]:
#load the datasets
import pandas as pd

daily_ridership_df = pd.read_csv('MTA_Daily_Ridership_Data__Beginning_2020.csv')
delays_df = pd.read_csv('MTA_Subway_Service_Delivered__Beginning_2020.csv')

Daily Ridership Data¶

In [2]:
#ridership data
daily_ridership_df.head()
Out[2]:
Date Subways: Total Estimated Ridership Subways: % of Comparable Pre-Pandemic Day Buses: Total Estimated Ridership Buses: % of Comparable Pre-Pandemic Day LIRR: Total Estimated Ridership LIRR: % of 2019 Monthly Weekday/Saturday/Sunday Average Metro-North: Total Estimated Ridership Metro-North: % of 2019 Monthly Weekday/Saturday/Sunday Average Access-A-Ride: Total Scheduled Trips Access-A-Ride: % of Comparable Pre-Pandemic Day Bridges and Tunnels: Total Traffic Bridges and Tunnels: % of Comparable Pre-Pandemic Day
0 02/23/2023 3499940 0.64 898699 0.42 188438.0 0.62 167193.0 0.62 27336 0.93 906322 1.03
1 02/22/2023 3458490 0.64 1010427 0.47 193753.0 0.64 171187.0 0.64 27895 0.95 869960 0.98
2 02/21/2023 3330546 0.61 1001993 0.47 194967.0 0.64 174283.0 0.65 27224 0.93 864047 0.98
3 02/20/2023 2239470 1.02 713171 0.73 90371.0 1.16 79984.0 0.88 14818 0.50 809578 1.07
4 02/19/2023 1824881 0.83 514773 0.53 75448.0 0.97 75064.0 0.82 16165 0.96 784650 1.04

Ridership Data Dictionary¶

Delays Data¶

In [3]:
delays_df.head()
Out[3]:
month division line day_type num_sched_trains num_actual_trains service delivered
0 2020-01 A DIVISION 1 1 1826 1773 0.970975
1 2020-01 A DIVISION 1 2 958 952 0.993737
2 2020-01 A DIVISION 2 1 2420 2322 0.959504
3 2020-01 A DIVISION 2 2 1866 1836 0.983923
4 2020-01 A DIVISION 3 1 2244 2174 0.968806

Delays Data Dictionary¶

Methods¶

Using the ridership data, it would be possible to create a line graph of each day's number of riders on the subway. From there, one could look at any outliers, and cross reference those days with outside sources to look at potential reasons in the significant change. A line graph can also be created for the delays data, which can be used to look at outliers, by looking at months with extremely low percentage of on-time trains. Grouping by each line and looking at their average on-time percentage could be helpful to figure out which lines are the worst in terms of delays.

Potential Problems¶

One potential problem is that the ridership data does not include data for specefic subway lines, so another data set may be needed to get more specefic on looking at individual subway lines. Also, the COVID-19 Pandemic will skew a lot of data, and may not represent the (mostly) COVID free environment now. Finally, the delays data only goes by month, so it will not be possible to determine the reasons for delays using only that dataset.