Bluebike Useage¶

Motivation:¶

We want to see how temperature affects the amount of Blue Bike usage, under each factor. We hypothesize that in warmer months, there will be larger and longer bike usage, and as the temperature decreases, we expect the bike usage to follow suit. We also want to compare the usage with the temperature specifically to add context to the number and length of bike rides; for example, on a warmer day, maybe some people will prefer to walk, so there’s less usage, or they think they’ll get too warm and sweaty by walking, so there will be more usage. However, if one thinks more broadly in terms of hot vs. cold, on a colder winter day, there will be little Blue Bike usage, because no one is really out and about; on the other hand, one could argue that there would be more bike usage because people want to get to their destination (indoors) as fast as possible. We want to see how the weather plays out with the bike rides monthly. There is so much context that can be provided for each day and each ride, so it will be interesting to observe.

Problem¶

How does temperature affect BlueBike usage?

  • Independent Variable - mean temperature of each day/month
  • Dependent Variable - Blue Bike usage: number of trips for the month, total amount (+ average) minutes spent on trips, total amount (+ average) distance on trips

Solution¶

we will take each large monthly Blue Bike dataset and aggregate the following values: number of trips (so, the length of each dataset), and then the total minutes of all trips and total distances for all trips, along with averages of the latter two values. The averages won’t necessarily be manipulated, but they will be useful numbers to have. All of these values will be put into a new spreadsheet, as mentioned above, that will only have the numbers that we need. For our temperature data, we will use our pre-made file from above listing minimum, maximum, and averages for each day of the year, and clean it up so we have our daily temperature data. From there, after extracting our respective values, we will plot them in a line graph, and observe how the temperature changes along with Blue Bike usage.

Dataset¶

Detail¶

There are two main datasets that we will be using for this project. The first is Boston’s daily temperature data, taken from thisa weather website It does not export full datasets, so we copied and pasted the columns that we wanted, and created our own simple spreadsheet. Above is a screenshot of some of the rows. We specifically chose these four columns because they are the only ones we find necessary for our project; while the weather website provided other data, we only want to look at daily temperature. The second set of data that we will be using was taken from Blue Bike's website; we downloaded all trips for the year of 2021, which was organized monthly. With each file, we will extract the three values that we’re looking for: the number of trips each month (the length of the file, excluding the header), the total time of trips each month (the sum of the duration of each trip), and the total length of trips each month (the sum of the distance traveled of each trip). So from those 12 files, we will have another dataset that will again be smaller and simpler.

Months maximum minimum average
2021-01 36 29 32.5
2021-02 42 32 37
2021-03 36 29 32.5
2021-04 39 31 35
2021-05 33 30 31.5
2021-06 40 30 35
2021-07 42 29 35.5

variables:¶

  • Month: 12 moths
  • max: monthly maximum temperature
  • min: monthly minimum temperature
  • trip duration: the time of each trip
  • station latitude/logitude
In [20]:
# one of the dataset for bluebikes we need to use
import pandas as pd
df_bikes = pd.read_csv('202101.csv')
df_bikes
Out[20]:
tripduration start station name start station latitude start station longitude end station name end station latitude end station longitude
0 914 One Kendall Square at Hampshire St / Portland St 42.366277 -71.091690 Dartmouth St at Newbury St 42.350961 -71.077828
1 1085 Dartmouth St at Newbury St 42.350961 -71.077828 Edwards Playground - Main St at Eden St 42.378965 -71.068607
2 946 Christian Science Plaza - Massachusetts Ave at... 42.343666 -71.085824 Prudential Center - 101 Huntington Ave 42.346520 -71.080658
3 355 MIT Pacific St at Purrington St 42.359573 -71.101295 Ames St at Main St 42.362500 -71.088220
4 511 Sennott Park Broadway at Norfolk Street 42.368605 -71.099302 Kennedy-Longfellow School 158 Spring St 42.369553 -71.085790
... ... ... ... ... ... ... ...
71800 181 Ames St at Main St 42.362500 -71.088220 Kennedy-Longfellow School 158 Spring St 42.369553 -71.085790
71801 408 MIT Stata Center at Vassar St / Main St 42.362131 -71.091156 MIT Stata Center at Vassar St / Main St 42.362131 -71.091156
71802 535 Harvard Stadium: N. Harvard St at Soldiers Fie... 42.368019 -71.124200 Innovation Lab - 125 Western Ave at Batten Way 42.363145 -71.122986
71803 2552 Sidney Research Campus/Erie Street at Waverly 42.357753 -71.103934 Watertown Sq 42.365260 -71.185733
71804 525 Harvard University Gund Hall at Quincy St / Ki... 42.376369 -71.114025 Harvard University Radcliffe Quadrangle at She... 42.380287 -71.125107

71805 rows × 7 columns

In [27]:
bike01 = {'trip_duration': '914, 1085, 946, 355, 511...',
       'start station latitude': '42.366277, 42.350961,42.343666, 42.359573... ',
       'start station longitude': '-71.091690,-71.077828,-71.085824,-71.101295...  ',
       'end station latitude': '42.350961,42.378965,42.346520,42.362500...  ',
        'end station longitude': '-71.077828,-71.068607, -71.080658,-71.088220...  '}
bike
Out[27]:
{'trip_duration': '914, 1085, 946, 355, 511...',
 'start station latitude': '42.366277, 42.350961,42.343666, 42.359573... ',
 'start station longitude': '-71.091690,-71.077828,-71.085824,-71.101295...  ',
 'end station latitude': '42.350961,42.378965,42.346520,42.362500...  ',
 'end station longitude': '-71.077828,-71.068607, -71.080658,-71.088220...  '}
In [7]:
# one of the dataset for weather we need to use
df_weather = pd.read_csv('weather01.csv')
df_weather
Out[7]:
date average
0 2021-01-01 32.5
1 2021-01-02 37.0
2 2021-01-03 32.5
3 2021-01-04 35.0
4 2021-01-05 31.5
5 2021-01-06 35.0
6 2021-01-07 35.5
7 2021-01-08 33.5
8 2021-01-09 30.0
9 2021-01-10 34.0
10 2021-01-11 30.0
11 2021-01-12 34.5
12 2021-01-13 35.5
13 2021-01-14 35.0
14 2021-01-15 37.0
15 2021-01-16 44.5
16 2021-01-17 38.5
17 2021-01-18 38.0
18 2021-01-19 34.0
19 2021-01-20 31.5
20 2021-01-21 26.5
21 2021-01-22 36.5
22 2021-01-23 25.5
23 2021-01-24 23.5
24 2021-01-25 27.5
25 2021-01-26 28.5
26 2021-01-27 32.0
27 2021-01-28 24.5
28 2021-01-29 12.5
29 2021-01-30 14.0
30 2021-01-31 15.0
In [31]:
weather01 = {'2021-01-01': '32.75', '2021-01-02':'37.0', '2021-01-03': '32.5', '2021-01-04':'35.0...'}
weather01
Out[31]:
{'2021-01-01': '32.75',
 '2021-01-02': '37.0',
 '2021-01-03': '32.5',
 '2021-01-04': '35.0...'}

Method:¶

In terms of our methodology, we will take each large monthly Blue Bike dataset and aggregate the following values: number of trips (so, the length of each dataset), and then the total minutes of all trips and total distances for all trips, along with averages of the latter two values. The averages won’t necessarily be manipulated, but they will be useful numbers to have. All of these values will be put into a new spreadsheet, as mentioned above, that will only have the numbers that we need.

ML: We will aggregate the average temperature into a set of relatively suitable temperatures. Doing this allows us to discover what conditions people prefer to ride remotely