Road accidents constitute a major problem in our societies around the world. The World Health Organization(WHO) estimated that 1.25 million deaths were related to road traffic injuries in the year 2010. For the year 2016, the USA alone had recorded 37, 461 motor vehicle crash-related deaths, averaging around 102 people per day. In Europe, the statistics also indicate that each minute, there are 50 road deaths recorded in the year 2017. Can machine learning help us understand the causes and the factors that affect car crash severity?

In this article, we will do a complete machine learning pipeline from getting data through APIs, performing exploratory data analysis and formulating a real-world problem into a machine learning model. The complete code and Jupyter notebooks are available in this Github Gist. The whole process is carried out in Google Colab using their free GPU/TPU environment so you can directly open the notebook from Github and experiment it in Google Colab.

Get the Data

The Crash Analysis System (CAS) data is available in different formats and APIs. It is simple to grab them through API interfaces, instead of downloading to your local machine. This is beneficial, as we will access the latest updated data every time we run the Jupyter notebook. I find this particular problem, the vehicle accidents, to be strongly related to location (Geography), so we will grab the Geojson file, instead of the usual CSV file, so that we can perform geographic data analysis without creating geometries from latitude and longitude and deal with coordinate reference systems and projections.

We will use Geopandas library to read the data. If you are familiar with Pandas library, then you should feel home as Geopandas is built on top pandas. Geopandas is a high-level library that makes working with geographic data in Python easier as it allows pandas functionality and data types to allow spatial operations on geographic geometries. It is well integrated with the Python ecosystem and depends much on pandas, Matplotlib and shapely library for geometric operations.

# Get the data from url and request it as json file
url = ‘’
geojson = requests.get(url).json()
# Read the data as GeodataFrame in Geopandas
crs = {‘init’: ‘epsg:3851’} # Coordinate reference system (CRS) for Newzealand
gdf = gpd.GeoDataFrame.from_features(geojson[‘features’], crs=crs)

Exploratory Data Analysis

In New Zealand, the total fatalities in crash accidents since the year 2000, up to 2018 is 6922. While the total number of serious injuries and minor injuries in car accidents reach 45044, 205895 respectively. Although this dataset records all crashes reported to NZ Police, we have to consider that all crashes are not reported to NZ police especially non-fatal crashes. Most of the crashes are non-injury crashes while fatal crashes are the least. In terms of fatality counts, most of the crashes have 0 fatality rate.

Image for post

Image for post

Left — Crash severity categories. Right — Fatality count in crash accidents

Over the years, the overall statistics show a decline in crash severity and fatalities, but as you can see from the line chart, there seems to be an upward increase of fatality count from 2016. On the other hand, 2017 had a peak of Serious injuries and minor injuries.

Image for post

The crash casualty from 2000 to 2018.

Roads and other related attributes also indicate crash severity as well as fatality level. So let us explore the relationship between them. In relation to fatality counts and the number of lanes in a road, 2 lanes seem to have a higher percentage than any other number. Straight roads seem to be less related to fatalities while most of the fatalities are related to some sort of road curvature (Easy, Moderate and Severe).

Image for post

Image for post

Right: Road curvature and crash fatality. Left: Number of lanes and crash fatality

Let us look at the traffic laws and their relationship with crash severity and fatality. The speed limit is a good measure to explore this relationship. 90 km/h is the deadliest speed limit followed by 100.

Image for post

Speed limit and crash fatality count

Exploring the weather also shows that mist and strong wind have the highest percentage in terms of fatality counts. Rain, snow, and frost count also for a high percentage.

Image for post

Image for post

Impact of weather in crash fatalities

Geographic Data Exploration

The geographic data visualizations indicate clearly where clashes happen. As you might have expected, most crashes happen along roads and mostly in cities.

Image for post

                                                           All vehicle crash points

Let us have a look at crashes aggregated in cluster map in Auckland.

Image for post

Some Clustered crash Points in Auckland, New Zealand.

Machine Learning

We can approach the modeling part of this problem in different ways. We could take it as a regression problem and predict the number of fatalities based on the attributes of the crash dataset. We can also approach it as a classification problem and predict the severity of the crash based on the crash dataset. In this example, I will approach it as a regression problem. Feel free to build a classification model if you want to give it a try. It will basically be the same approach. I will not do any feature engineering in this case, I think the attributes we have are enough to build a baseline, and we can always revisit this and do feature engineering later to boost our model accuracy.

We first need to convert the categorical features into numerical values. We can use Sklearn library to do that like this:

# Label encoder
from sklearn.preprocessing import LabelEncoder
lblE = LabelEncoder()
for i in df:
    if df[i].dtype == 'object':[i])
        df[i] = lblE.transform(df[i])

Then we split the data into dependent and independent variables as well as training and validation set to later evaluate our model results.

# Let us split our data into training and validation sets
X_train, X_test, y_train, y_test = train_test_split(df.drop('fatalCount', axis=1), df.fatalCount, test_size=0.33, random_state=42)

Now we are ready to apply a machine learning model to our data. I usually start with Random Forest, a tree-based algorithm, which performs well on many datasets.

m = RandomForestRegressor(n_estimators=50), y_train)
RMSE Train:0.017368616661096157, 
RMSE Valid:0.042981327685985046, 
Accuracy Train: 0.977901052706869, 
Accuracy Valid: 0.8636075084646185

As you can the simple Random forest model gave us an accuracy of 86% on the validation set and after some initial fine-tuning and using feature importance selection, the model could be boosted to 87%. We could go further and do some improvements in our model, create new features or use some other algorithms to increase the model performance but for now, this is enough for this article purpose.Here are some of the most important features from our Random Forest model.

Image for post

Feature Importance


I hope you have enjoyed reading this article. If you want to try and experiment the code, it is available as GitHub Gist and you can directly open the Notebook in Google Colab.

You can reach me on Twitter @shakasom.

Leave a Reply

Your email address will not be published. Required fields are marked *