Recently I read a business case study about Uber Pool, which highlighted the need for a management decision to decrease wait time (increase customer satisfaction), or to maximize profits. Finding the balance between these two business objectives seemed interesting to me. In addition, I have experienced Uber and Lyft's surge pricing strategy where the price changes depending on the day and time. For example, a trip to the airport might be $20 during the early morning. However, the price for the same trip can double depending on the time of day. Uber surge pricing Uber Pool Strategy
Based on this case and my experience, I decided upon a dataset that contains both Uber and Lyft trip data. I wish to explore how Uber and Lyft price their services and if I could predict the price of a trip to find the best time to travel. I also think it will be interesting to compare and contrast Uber and Lyft trip data because they provide similar services.
import pandas as pd
df_trip = pd.read_csv(r'/Users/freyali/Downloads/archive 2/cab_rides.csv')
df_trip.head()
distance | cab_type | time_stamp | destination | source | price | surge_multiplier | id | product_id | name | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.44 | Lyft | 1544952607890 | North Station | Haymarket Square | 5.0 | 1.0 | 424553bb-7174-41ea-aeb4-fe06d4f4b9d7 | lyft_line | Shared |
1 | 0.44 | Lyft | 1543284023677 | North Station | Haymarket Square | 11.0 | 1.0 | 4bd23055-6827-41c6-b23b-3c491f24e74d | lyft_premier | Lux |
2 | 0.44 | Lyft | 1543366822198 | North Station | Haymarket Square | 7.0 | 1.0 | 981a3613-77af-4620-a42a-0c0866077d1e | lyft | Lyft |
3 | 0.44 | Lyft | 1543553582749 | North Station | Haymarket Square | 26.0 | 1.0 | c2d88af2-d278-4bfd-a8d0-29ca77cc5512 | lyft_luxsuv | Lux Black XL |
4 | 0.44 | Lyft | 1543463360223 | North Station | Haymarket Square | 9.0 | 1.0 | e0126e1f-8ca9-4f2e-82b3-50505a09db9a | lyft_plus | Lyft XL |
df_trip Data Dictionary:
df_weather = pd.read_csv(r'/Users/freyali/Downloads/archive 2/weather.csv')
df_weather.head()
temp | location | clouds | pressure | rain | time_stamp | humidity | wind | |
---|---|---|---|---|---|---|---|---|
0 | 42.42 | Back Bay | 1.0 | 1012.14 | 0.1228 | 1545003901 | 0.77 | 11.25 |
1 | 42.43 | Beacon Hill | 1.0 | 1012.15 | 0.1846 | 1545003901 | 0.76 | 11.32 |
2 | 42.50 | Boston University | 1.0 | 1012.15 | 0.1089 | 1545003901 | 0.76 | 11.07 |
3 | 42.11 | Fenway | 1.0 | 1012.13 | 0.0969 | 1545003901 | 0.77 | 11.09 |
4 | 43.13 | Financial District | 1.0 | 1012.14 | 0.1786 | 1545003901 | 0.75 | 11.49 |
df_weather Data Dictionary:
I plan to cluster the trips first based on the company (Uber or Lyft). Then, I will cluster the trips again using the surge multiplier to identify at which times the surge caused the price of trips to be higher. After identifying trips with the surge price, I will consider other factors such as distance, temperature, rain. Using these factors, I can have a more comprehensive analysis on the factors that influence trip pricing and use machine learning to make price predictions.