Accelerate Geospatial Data Science With These Tricks

Tips and tricks on how to speed up geospatial data processing with code

Photo by Mathew Schwartz on Unsplash

In this era of big data, large datasets are ubiquitous rather than being the exception. Although you can use some other accelerating techniques in data science, Geographic data science have special techniques to boost up your geodata processing. In this article, I share some of my favourite tips and tricks to accelerate geospatial data processing in Python.

Throughout this tutorial, we use NYC taxi trips dataset. The data contains 1.4 million points. We also use taxi zones dataset to do some geoprocessing tasks. We first read the data with Geopandas for benchmarking.

Let us import libraries we use in this tutorial.

import geopandas as gpd
import matplotlib.pyplot as plt
import geofeather
import datashader as ds
import holoviews as hv

Reading the data takes quite some time, depending on the computation resources. At my end, this took 1 minute and 46 seconds.

Photo by Mathew Schwartz on Unsplash

We can speed up reading time by converting the data format (Shapefile) to GeoFeather format.

1. Use GeoFeather for fast geospatial data Reading

Now this will boost your reading speed, but it only takes one line of code to convert your data to GeoFeather, thanks to Geofeather library. Let us transform the data to GeoFeather.

geofeather.to_geofeather(gdf,'tutorialData/TaxiDataGeoFeather.feather')

Converting to GeoFeather takes only 51 seconds in my case, but that is it, and next time you can read the data with Geofeather, and it takes less time than reading from shapefile. Let us see that.

gdf = geofeather.from_geofeather(‘tutorialData/TaxiDataGeoFeather.feather’)

Reading with Geofeather took only 26 seconds compared to 1 minute and 46 seconds while using Shapefiles. It is a wonderful technique and helps you experiment fast with geospatial data without waiting a long time.

I also find sometimes to read only a subset of geometry based on my needs. It was not possible before Geopandas 0.70 released recently. The second tip shows you how to read a subset geometry with a large dataset.

2. Read a subset with Geometry

Let us read first the taxi zones and plot them with taxi points we read in the previous section.

zones = gpd.read_file(“tutorialData/taxizones.shp”)
fig, ax = plt.subplots(figsize=(12,10))
zones.plot(“borough”, ax=ax, legend=True)
gdf.sample(5000).plot(ax=ax, markersize=2, color=”black”)

The following map shows a sample of taxi points overlayed on taxi zones(coloured by borough).

Photo by Mathew Schwartz on Unsplash

Now, imagine you need only Brooklyn borough, and it is unnecessary to read the whole data. In that case, you need only to read a subset of the data with geometry. You can do that by providing a mask while reading the data with Geopandas.

gdf_subset = gpd.read_file(“tutorialData/taxidata.shp”, mask=zones[zones[“borough”].isin([“Brooklyn”])])

This took only 38 seconds, and you only get what you need, a subset of taxi points data that are within Brooklyn polygon only.

3. Spatial Index might boost Geoprocessing

Sometimes, spatial indexes might help you speed up your geoprocessing tasks. Let us see speed differences between the spatial index and non-spatial index implementation of Point In Polygon(PIP) example. We first dissolve the taxi zones into borough polygons and then do spatial join without spatial index.

boroughs = zones.dissolve(by=’borough’)
sjoined = gpd.sjoin(gdf, boroughs, op=”within”)

Without a spatial index, the process takes 2 minutes and 50 seconds in my case. Let us see if the spatial index might boost performance speed.

To create a spatial index in Geopandas is very easy. Let us do that for both datasets.

gdf_sindexed = gdf
zones_sindexed = zones
gdf_sindexed.sindex
zones_sindexed.sindex

And now we do PIP processing with Spatial Index.

sjoined = gpd.sjoin(gdf_sindexed, boroughs_sindexed, op="within")

With the spatial index, the process takes 2 minutes 19 seconds which is a little improvement in speed compared to non-spatial index processing. Spatial Indexes do not always help, especially where points data and polygons have the same extent. You can read a detailed explanation on when spatial indexes work and when they don’t from this blog post.

4. Plot With Datashader

Plotting large datasets often ends up with frustration when using other plotting libraries, like Folium or Plotly. Datashader is specific for plotting large datasets.

Datashader is a graphics pipeline system for creating meaningful representations of large datasets quickly and flexibly.

This allows you to plot all your data on a map without compromising by taking a sample of it. Let us plot all 1.4 million points quickly and easily with Datashader.

agg = ds.Canvas().points(sjoined, ‘pickup_lon’, ‘pickup_lat’)
img = tf.set_background(tf.shade(agg, cmap=fire),”black”)
img

And what you get is this beautiful point map of all the data.

Photo by Mathew Schwartz on Unsplash

You can create interactive maps with beautiful base maps in Datashader. It also works well PyViz tools for data visualization like Geoviews and can combine with Panel to create beautiful dashboards with Jupyter notebooks.

Conclusion

In this article, I have shared some of my favourite boosting techniques in Geospatial data science with Python. You can speed up reading time with Geofeather or reading only the subset. For the geoprocessing task, it is worth trying to use the spatial index to speed up the processing time. And finally, to plot large dataset, use Datashader for speed and beautiful maps.

The notebook for this tutorial is here in Github.

Leave a Reply

Your email address will not be published. Required fields are marked *