Uber/Lyft Trips Project¶

Recently I read a business case study about Uber Pool, which highlighted the need for a management decision to decrease wait time (increase customer satisfaction), or to maximize profits. Finding the balance between these two business objectives seemed interesting to me. In addition, I have experienced Uber and Lyft's surge pricing strategy where the price changes depending on the day and time. For example, a trip to the airport might be $20 during the early morning. However, the price for the same trip can double depending on the time of day. Uber surge pricing Uber Pool Strategy

Based on this case and my experience, I decided upon a dataset that contains both Uber and Lyft trip data. I wish to explore how Uber and Lyft price their services and if I could predict the price of a trip to find the best time to travel. I also think it will be interesting to compare and contrast Uber and Lyft trip data because they provide similar services.

In [1]:
import pandas as pd

df_trip = pd.read_csv(r'/Users/freyali/Downloads/archive 2/cab_rides.csv')
df_trip.head()
Out[1]:
distance cab_type time_stamp destination source price surge_multiplier id product_id name
0 0.44 Lyft 1544952607890 North Station Haymarket Square 5.0 1.0 424553bb-7174-41ea-aeb4-fe06d4f4b9d7 lyft_line Shared
1 0.44 Lyft 1543284023677 North Station Haymarket Square 11.0 1.0 4bd23055-6827-41c6-b23b-3c491f24e74d lyft_premier Lux
2 0.44 Lyft 1543366822198 North Station Haymarket Square 7.0 1.0 981a3613-77af-4620-a42a-0c0866077d1e lyft Lyft
3 0.44 Lyft 1543553582749 North Station Haymarket Square 26.0 1.0 c2d88af2-d278-4bfd-a8d0-29ca77cc5512 lyft_luxsuv Lux Black XL
4 0.44 Lyft 1543463360223 North Station Haymarket Square 9.0 1.0 e0126e1f-8ca9-4f2e-82b3-50505a09db9a lyft_plus Lyft XL

df_trip Data Dictionary:

  • distance (float): distance in miles one trip covers
  • cab_type (str): app used to call the ride
  • time_stamp (int): time the trip was called
  • destination (str): name of destination
  • source (str): name of pickup point
  • price (float): price of a trip
  • surge_multiplier (float): indicates if there was a price surge and by how much
  • id (str): trip identifier
  • product_id (str): what type of trip was requested
  • name (str): what type of car was requested
In [2]:
df_weather = pd.read_csv(r'/Users/freyali/Downloads/archive 2/weather.csv')
df_weather.head()
Out[2]:
temp location clouds pressure rain time_stamp humidity wind
0 42.42 Back Bay 1.0 1012.14 0.1228 1545003901 0.77 11.25
1 42.43 Beacon Hill 1.0 1012.15 0.1846 1545003901 0.76 11.32
2 42.50 Boston University 1.0 1012.15 0.1089 1545003901 0.76 11.07
3 42.11 Fenway 1.0 1012.13 0.0969 1545003901 0.77 11.09
4 43.13 Financial District 1.0 1012.14 0.1786 1545003901 0.75 11.49

df_weather Data Dictionary:

  • temp (float): temperature where the trip was called
  • location (str): location of the recorded temperature
  • clouds (float): how cloudy it was
  • pressure (float): recorded air pressure
  • rain (float): recorded inches of rain
  • time_stamp (int): time when the trip was called
  • humidity (float): air humidity percentage
  • wind (float): wind speed

I plan to cluster the trips first based on the company (Uber or Lyft). Then, I will cluster the trips again using the surge multiplier to identify at which times the surge caused the price of trips to be higher. After identifying trips with the surge price, I will consider other factors such as distance, temperature, rain. Using these factors, I can have a more comprehensive analysis on the factors that influence trip pricing and use machine learning to make price predictions.