Accelerate Geospatial Data Science With These Tricks
Tips and tricks on how to speed up geospatial data processing with code
In this era of big data, large datasets are ubiquitous rather than being the exception. Although you can use some other accelerating techniques in data science, Geographic data science have special techniques to boost up your geodata processing. In this article, I share some of my favourite tips and tricks to accelerate geospatial data processing in Python.
Throughout this tutorial, we use NYC taxi trips dataset. The data contains 1.4 million points. We also use taxi zones dataset to do some geoprocessing tasks. We first read the data with Geopandas for benchmarking.
Let us import libraries we use in this tutorial.
import geopandas as gpd import matplotlib.pyplot as plt import geofeather import datashader as ds import holoviews as hv
Reading the data takes quite some time, depending on the computation resources. At my end, this took 1 minute and 46 seconds.
We can speed up reading time by converting the data format (Shapefile) to GeoFeather format.
1. Use GeoFeather for fast geospatial data Reading
Now this will boost your reading speed, but it only takes one line of code to convert your data to GeoFeather, thanks to Geofeather library. Let us transform the data to GeoFeather.
Converting to GeoFeather takes only 51 seconds in my case, but that is it, and next time you can read the data with Geofeather, and it takes less time than reading from shapefile. Let us see that.
gdf = geofeather.from_geofeather(‘tutorialData/TaxiDataGeoFeather.feather’)
Reading with Geofeather took only 26 seconds compared to 1 minute and 46 seconds while using Shapefiles. It is a wonderful technique and helps you experiment fast with geospatial data without waiting a long time.
I also find sometimes to read only a subset of geometry based on my needs. It was not possible before Geopandas 0.70 released recently. The second tip shows you how to read a subset geometry with a large dataset.
2. Read a subset with Geometry
Let us read first the taxi zones and plot them with taxi points we read in the previous section.
zones = gpd.read_file(“tutorialData/taxizones.shp”)
fig, ax = plt.subplots(figsize=(12,10)) zones.plot(“borough”, ax=ax, legend=True) gdf.sample(5000).plot(ax=ax, markersize=2, color=”black”)
The following map shows a sample of taxi points overlayed on taxi zones(coloured by borough).
Now, imagine you need only Brooklyn borough, and it is unnecessary to read the whole data. In that case, you need only to read a subset of the data with geometry. You can do that by providing a mask while reading the data with Geopandas.
gdf_subset = gpd.read_file(“tutorialData/taxidata.shp”, mask=zones[zones[“borough”].isin([“Brooklyn”])])
This took only 38 seconds, and you only get what you need, a subset of taxi points data that are within Brooklyn polygon only.
3. Spatial Index might boost Geoprocessing
Sometimes, spatial indexes might help you speed up your geoprocessing tasks. Let us see speed differences between the spatial index and non-spatial index implementation of Point In Polygon(PIP) example. We first dissolve the taxi zones into borough polygons and then do spatial join without spatial index.
boroughs = zones.dissolve(by=’borough’)
sjoined = gpd.sjoin(gdf, boroughs, op=”within”)
Without a spatial index, the process takes 2 minutes and 50 seconds in my case. Let us see if the spatial index might boost performance speed.
To create a spatial index in Geopandas is very easy. Let us do that for both datasets.
gdf_sindexed = gdf zones_sindexed = zones
And now we do PIP processing with Spatial Index.
sjoined = gpd.sjoin(gdf_sindexed, boroughs_sindexed, op="within")
With the spatial index, the process takes 2 minutes 19 seconds which is a little improvement in speed compared to non-spatial index processing. Spatial Indexes do not always help, especially where points data and polygons have the same extent. You can read a detailed explanation on when spatial indexes work and when they don’t from this blog post.
4. Plot With Datashader
Plotting large datasets often ends up with frustration when using other plotting libraries, like Folium or Plotly. Datashader is specific for plotting large datasets.
Datashader is a graphics pipeline system for creating meaningful representations of large datasets quickly and flexibly.
This allows you to plot all your data on a map without compromising by taking a sample of it. Let us plot all 1.4 million points quickly and easily with Datashader.
agg = ds.Canvas().points(sjoined, ‘pickup_lon’, ‘pickup_lat’) img = tf.set_background(tf.shade(agg, cmap=fire),”black”) img
And what you get is this beautiful point map of all the data.
You can create interactive maps with beautiful base maps in Datashader. It also works well PyViz tools for data visualization like Geoviews and can combine with Panel to create beautiful dashboards with Jupyter notebooks.
In this article, I have shared some of my favourite boosting techniques in Geospatial data science with Python. You can speed up reading time with Geofeather or reading only the subset. For the geoprocessing task, it is worth trying to use the spatial index to speed up the processing time. And finally, to plot large dataset, use Datashader for speed and beautiful maps.
The notebook for this tutorial is here in Github.