car crash (ny state)¶

For my project, I would like to predict the possibility of a car crash in New York on any given day, along with its location and possible number of casualties that may occur from it. This would allow civilians and cyclists to be aware of particular hotspots and times which crashes will likely occur.

I intend to use this dataset from NYC which provides a comprehensive report on the crashes.

In [25]:
import requests
import pandas as pd
response = requests.get('https://data.cityofnewyork.us/resource/h9gi-nx95.json')

assert response.status_code == 200, 'request error'

Below is a quick preview of the dataframe.

In [8]:
crash_api = response.json()
crash_df = pd.DataFrame(crash_api)
crash_df.head()
Out[8]:
crash_date crash_time on_street_name off_street_name number_of_persons_injured number_of_persons_killed number_of_pedestrians_injured number_of_pedestrians_killed number_of_cyclist_injured number_of_cyclist_killed ... latitude longitude location cross_street_name contributing_factor_vehicle_3 vehicle_type_code_3 contributing_factor_vehicle_4 vehicle_type_code_4 contributing_factor_vehicle_5 vehicle_type_code_5
0 2021-09-11T00:00:00.000 2:39 WHITESTONE EXPRESSWAY 20 AVENUE 2 0 0 0 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2022-03-26T00:00:00.000 11:45 QUEENSBORO BRIDGE UPPER NaN 1 0 0 0 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2022-06-29T00:00:00.000 6:55 THROGS NECK BRIDGE NaN 0 0 0 0 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2021-09-11T00:00:00.000 9:35 NaN NaN 0 0 0 0 0 0 ... 40.667202 -73.8665 {'latitude': '40.667202', 'longitude': '-73.86... 1211 LORING AVENUE NaN NaN NaN NaN NaN NaN
4 2021-12-14T00:00:00.000 8:13 SARATOGA AVENUE DECATUR STREET 0 0 0 0 0 0 ... 40.683304 -73.917274 {'latitude': '40.683304', 'longitude': '-73.91... NaN NaN NaN NaN NaN NaN NaN

5 rows × 29 columns

In [20]:
# List of columns
[*crash_df.columns]
Out[20]:
['crash_date',
 'crash_time',
 'on_street_name',
 'off_street_name',
 'number_of_persons_injured',
 'number_of_persons_killed',
 'number_of_pedestrians_injured',
 'number_of_pedestrians_killed',
 'number_of_cyclist_injured',
 'number_of_cyclist_killed',
 'number_of_motorist_injured',
 'number_of_motorist_killed',
 'contributing_factor_vehicle_1',
 'contributing_factor_vehicle_2',
 'collision_id',
 'vehicle_type_code1',
 'vehicle_type_code2',
 'borough',
 'zip_code',
 'latitude',
 'longitude',
 'location',
 'cross_street_name',
 'contributing_factor_vehicle_3',
 'vehicle_type_code_3',
 'contributing_factor_vehicle_4',
 'vehicle_type_code_4',
 'contributing_factor_vehicle_5',
 'vehicle_type_code_5']

For example, index 2 of the DataFrame shows that there was a crash on March 26, 2022 at 11:45 on the upper part of Queensboro Bridge. Luckily, there was only 1 person injured. The rest of the data is filled with NaNs, which may be important to consider when sorting through the data. When going through the API, we need to make sure to only consider data that satisfies the features we are looking for.

In [11]:
crash_df.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Columns: 29 entries, crash_date to vehicle_type_code_5
dtypes: object(29)
memory usage: 226.7+ KB

Right now, I haven't figured out fully how to get the API to cooperate with me. If you take a look at the dates, there are dates from 2022 mixed in with 2021, indicating that somewhere my parsing has gone awry. The API has roughly 1.97 million rows, which provides plenty of data for machine learning to understand where crashes may occur most commonly. Right now, I have only been able to pull 999.

With this data, I want to make a map that highlights locations of highest danger. "Danger" will be determined by the number of crashes and how commonly they occur in a given location.

As I was also writing this up, I think it would be interesting to relate this to NYC traffic data, and compare the times of crashes to times where traffic is busiest. We can also observe the relationship between areas of high traffic and areas where car crashes are most common.