For my project, I would like to predict the possibility of a car crash in New York on any given day, along with its location and possible number of casualties that may occur from it. This would allow civilians and cyclists to be aware of particular hotspots and times which crashes will likely occur.
I intend to use this dataset from NYC which provides a comprehensive report on the crashes.
import requests
import pandas as pd
response = requests.get('https://data.cityofnewyork.us/resource/h9gi-nx95.json')
assert response.status_code == 200, 'request error'
Below is a quick preview of the dataframe.
crash_api = response.json()
crash_df = pd.DataFrame(crash_api)
crash_df.head()
crash_date | crash_time | on_street_name | off_street_name | number_of_persons_injured | number_of_persons_killed | number_of_pedestrians_injured | number_of_pedestrians_killed | number_of_cyclist_injured | number_of_cyclist_killed | ... | latitude | longitude | location | cross_street_name | contributing_factor_vehicle_3 | vehicle_type_code_3 | contributing_factor_vehicle_4 | vehicle_type_code_4 | contributing_factor_vehicle_5 | vehicle_type_code_5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2021-09-11T00:00:00.000 | 2:39 | WHITESTONE EXPRESSWAY | 20 AVENUE | 2 | 0 | 0 | 0 | 0 | 0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 2022-03-26T00:00:00.000 | 11:45 | QUEENSBORO BRIDGE UPPER | NaN | 1 | 0 | 0 | 0 | 0 | 0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 2022-06-29T00:00:00.000 | 6:55 | THROGS NECK BRIDGE | NaN | 0 | 0 | 0 | 0 | 0 | 0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 2021-09-11T00:00:00.000 | 9:35 | NaN | NaN | 0 | 0 | 0 | 0 | 0 | 0 | ... | 40.667202 | -73.8665 | {'latitude': '40.667202', 'longitude': '-73.86... | 1211 LORING AVENUE | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 2021-12-14T00:00:00.000 | 8:13 | SARATOGA AVENUE | DECATUR STREET | 0 | 0 | 0 | 0 | 0 | 0 | ... | 40.683304 | -73.917274 | {'latitude': '40.683304', 'longitude': '-73.91... | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 29 columns
# List of columns
[*crash_df.columns]
['crash_date', 'crash_time', 'on_street_name', 'off_street_name', 'number_of_persons_injured', 'number_of_persons_killed', 'number_of_pedestrians_injured', 'number_of_pedestrians_killed', 'number_of_cyclist_injured', 'number_of_cyclist_killed', 'number_of_motorist_injured', 'number_of_motorist_killed', 'contributing_factor_vehicle_1', 'contributing_factor_vehicle_2', 'collision_id', 'vehicle_type_code1', 'vehicle_type_code2', 'borough', 'zip_code', 'latitude', 'longitude', 'location', 'cross_street_name', 'contributing_factor_vehicle_3', 'vehicle_type_code_3', 'contributing_factor_vehicle_4', 'vehicle_type_code_4', 'contributing_factor_vehicle_5', 'vehicle_type_code_5']
For example, index 2 of the DataFrame shows that there was a crash on March 26, 2022
at 11:45
on the upper part of Queensboro Bridge
. Luckily, there was only 1
person injured. The rest of the data is filled with NaN
s, which may be important to consider when sorting through the data. When going through the API, we need to make sure to only consider data that satisfies the features we are looking for.
crash_df.info(verbose=False)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Columns: 29 entries, crash_date to vehicle_type_code_5 dtypes: object(29) memory usage: 226.7+ KB
Right now, I haven't figured out fully how to get the API to cooperate with me. If you take a look at the dates, there are dates from 2022 mixed in with 2021, indicating that somewhere my parsing has gone awry. The API has roughly 1.97 million rows, which provides plenty of data for machine learning to understand where crashes may occur most commonly. Right now, I have only been able to pull 999.
With this data, I want to make a map that highlights locations of highest danger. "Danger" will be determined by the number of crashes and how commonly they occur in a given location.
As I was also writing this up, I think it would be interesting to relate this to NYC traffic data, and compare the times of crashes to times where traffic is busiest. We can also observe the relationship between areas of high traffic and areas where car crashes are most common.