Geospatial Data Science Learning Resources

The Best Features of Geopandas 0.80 Release

Geospatial operation speed ups with PyGeo and IO improvements with Example codes

Base Image from Canvas

I like using Geopandas for my Geospatial data science projects. Geopandas gets better in every release, thanks to its developers. If you have ever processed Geospatial Vector data, the chances are that you have used Geopandas.

With the latest 0.80 release, there are a lot of improvements and bug fixes. In particular, I am excited about the PyGEOS integration to speed up geospatial operations. Although this in the experimental stage, I have already seen some improvements in Geopandas using this option.

There is also a lot of IO enhancement in this release, including PostGIS improvements as well as a ton of new IO features including GeoFeather and Parquet data types.

In this article, I will go through these two features, PyGEOS integration and IO enhancement with some code examples.

PyGEOS Option

Geopandas default backend for spatial operations is still Shapely, but you can use PyGeos optionally to speed up geospatial processes.

PyGEOS is a C/Python library with vectorized geometry functions. The geometry operations are done in the open-source geometry library GEOS. PyGEOS wraps these operations in NumPy ufuncs providing a performance improvement when operating on arrays of geometries. — PyGEOS

According to the official release, PyGeos integration is experimental and only supports some of the spatial operations available in Geopandas. However, it already improves and speeds up some of the spatial processes I often use. I have given it a test with spatial join operation, one of the essential spatial operations for spatial data.

I will use a subset of the famous NYC taxi data, which contains 1.6 million rows and taxi zones polygon to find out which polygon each point is within. The result is a slight improvement in run time between using PyGEOS and not using it.

Spatial Join Without PyGEOS

import geopandas as gpd
import time
# Set PyGEOS to False 
gpd.options.use_pygeos = False
print(gpd.options.use_pygeos)
# Read the data
gdf = gpd.read_file(“data/taxidata.shp”)
zones = gpd.read_file(“data/taxizones.shp”)
# Spatial Join
start_time = time.time()
sjoined = gpd.sjoin(gdf, zones, op=”within”)
print(“It takes %s seconds” % (time.time() — start_time))

Without PyGEOS Option, it takes 1.8 seconds to finish the process. The same code runs much longer in Geopandas 0.70. It takes more than two minutes with the same data and the code. So there is an overall improvement even without using PyGeos explicitly. Kudos to the developers.

Spatial Join With PyGEOS

Let us test out also using PyGEOS and run the same code with vectorized geometry operations in PyGEOS. To use the optional PyGEOS under the hood, you only need to set use_pygeos to True. Note that you need to do this before reading the data.

# Set PyGEOS to True
gpd.options.use_pygeos = True
print(gpd.options.use_pygeos)
# Read the data
gdf = gpd.read_file(“data/taxidata.shp”)
zones = gpd.read_file(“data/taxizones.shp”)
# Spatial Join
start_time = time.time()
sjoined = gpd.sjoin(gdf, zones, op=”within”)
print(“It takes %s seconds” % (time.time() — start_time))

There is a slight improvement in run time using PyGEOS. With use_pygeos set to True, the process takes only 0.84 seconds. Although this is a significant improvement and will multiply when applied to a larger dataset, I am impressed by the overall development from Geopandas 0.70 to Geopandas 0.80.

IO enhancement

There is a lot of Input/Output enhancements in this release as well, which makes the live geospatial data scientist easy to convert and work between different data formats.

PostGIS

Reading PostGIS data was available in Geopadnas at earlier releases; however, writing data to PostGIS was not available. With this release. It is possible to write your GeoDataFrame into a PostGIS database.

The new GeoDataFrame.to_post() method is an excellent feature which facilitates full transferability between GeodataFrame and PostGIS. To use this feature, you need to install SQLAlchemy and GeoAlchemy2, and a PostgreSQL Python driver (e.g. psycopg2).

For example, if we want to write the Taxi Geodataframe we used above, we only need to create an engine using alchemy.

import geopandas as gpd
from sqlalchemy import create_engine
engine = create_engine(f'postgresql://{user}:{password}@localhost:5432/database_name?gssencmode=disable')
gdf.sample(1000).to_postgis("taxi", engine)

The transfer of the data to PostGIS was smooth. I am also most excited about this option. It opens up new possibilities to use the power of PostGIS with the familiar processing of data in Python with Pandas and Geopandas.

GeoFeather

I have already covered GeoFeather and its speed in writing and reading geographic data in another article.

Accelerate Geospatial Data Science With These Tricks
Tips and tricks on how to speed up geospatial data processing with codetowardsdatascience.com
 

With Geopandas 0.8 release, you do not need to install GeoFeather Python library to use it. It is integrated with Geopandas, and you can use it without needing to install other libraries.

You can use GeoDataFrame.to_feather() to write GeoDataFrame into GeoFeather format. You can also read GeoFeather format data with gpd.from_feather() .

Others

Conclusion

In this article, we have covered the new features and improvements in Geopandas 0.80 release. The features we have seen include PyGEOS integration, PostGIS writing, and Feather and Parquet IO additions.

I hope you are excited about this release as I am. It brings a lot of features that make your life as a Geospatial analyst better. In this article, I have only mentioned the few features I am excited about, but there are other features that you might be interested in, which I have not mentioned in this article. Feel free to visit the release notes here.