car crash (ny state)¶

For my project, I would like to predict the possibility of a car crash in New York on any given day, along with its location and possible number of casualties that may occur from it. This would allow civilians and cyclists to be aware of particular hotspots and times which crashes will likely occur.

I intend to use this dataset from NYC which provides a comprehensive report on the crashes.

In [25]:

import requests
import pandas as pd
response = requests.get('https://data.cityofnewyork.us/resource/h9gi-nx95.json')

assert response.status_code == 200, 'request error'

Below is a quick preview of the dataframe.

In [8]:

crash_api = response.json()
crash_df = pd.DataFrame(crash_api)
crash_df.head()

Out[8]:

	crash_date	crash_time	on_street_name	off_street_name	number_of_persons_injured	...	latitude	longitude	location	cross_street_name	contributing_factor_vehicle_3	vehicle_type_code_3	contributing_factor_vehicle_4	vehicle_type_code_4	contributing_factor_vehicle_5	vehicle_type_code_5
0	2021-09-11T00:00:00.000	2:39	WHITESTONE EXPRESSWAY	20 AVENUE	2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	2022-03-26T00:00:00.000	11:45	QUEENSBORO BRIDGE UPPER	NaN	1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	2022-06-29T00:00:00.000	6:55	THROGS NECK BRIDGE	NaN	0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	2021-09-11T00:00:00.000	9:35	NaN	NaN	0	...	40.667202	-73.8665	{'latitude': '40.667202', 'longitude': '-73.86...	1211 LORING AVENUE	NaN	NaN	NaN	NaN	NaN	NaN
4	2021-12-14T00:00:00.000	8:13	SARATOGA AVENUE	DECATUR STREET	0	...	40.683304	-73.917274	{'latitude': '40.683304', 'longitude': '-73.91...	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 29 columns

In [20]:

# List of columns
[*crash_df.columns]

Out[20]:

['crash_date',
 'crash_time',
 'on_street_name',
 'off_street_name',
 'number_of_persons_injured',
 'number_of_persons_killed',
 'number_of_pedestrians_injured',
 'number_of_pedestrians_killed',
 'number_of_cyclist_injured',
 'number_of_cyclist_killed',
 'number_of_motorist_injured',
 'number_of_motorist_killed',
 'contributing_factor_vehicle_1',
 'contributing_factor_vehicle_2',
 'collision_id',
 'vehicle_type_code1',
 'vehicle_type_code2',
 'borough',
 'zip_code',
 'latitude',
 'longitude',
 'location',
 'cross_street_name',
 'contributing_factor_vehicle_3',
 'vehicle_type_code_3',
 'contributing_factor_vehicle_4',
 'vehicle_type_code_4',
 'contributing_factor_vehicle_5',
 'vehicle_type_code_5']

For example, index 2 of the DataFrame shows that there was a crash on March 26, 2022 at 11:45 on the upper part of Queensboro Bridge. Luckily, there was only 1 person injured. The rest of the data is filled with NaNs, which may be important to consider when sorting through the data. When going through the API, we need to make sure to only consider data that satisfies the features we are looking for.

In [11]:

crash_df.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Columns: 29 entries, crash_date to vehicle_type_code_5
dtypes: object(29)
memory usage: 226.7+ KB

Right now, I haven't figured out fully how to get the API to cooperate with me. If you take a look at the dates, there are dates from 2022 mixed in with 2021, indicating that somewhere my parsing has gone awry. The API has roughly 1.97 million rows, which provides plenty of data for machine learning to understand where crashes may occur most commonly. Right now, I have only been able to pull 999.

With this data, I want to make a map that highlights locations of highest danger. "Danger" will be determined by the number of crashes and how commonly they occur in a given location.

As I was also writing this up, I think it would be interesting to relate this to NYC traffic data, and compare the times of crashes to times where traffic is busiest. We can also observe the relationship between areas of high traffic and areas where car crashes are most common.