This is the first article of a three-part series of articles in Getting started Geographic Data Science with Python. You will learn about reading, manipulating and analysing Geographic data in Python. The articles in this series are designed to be sequential where the first article lays the foundation and the second one gets into intermediate and advanced level Geographic data science topics. The third part covers a relevant and real-world project wrapping up to cement your learning. Each tutorial also has some simple exercises to help you learn and practice. All code, dataset and Google Colab Jupyter notebooks are available from the link at the end of this article.

I will focus only on Geographic Vector data in this series. In another coming series, We will learn about satellite images and raster data analysis.

The series contains Three parts:

  1. Introduction to Geographic Data science
  2. Geographic data processing
  3. Geographic data science project

This is the first part. In this tutorial, we will learn the basics of loading and processing geographic data using Geopandas. Geopandas, the workhorse of Geographic data science in Python, is built on top of Pandas and Numpy libraries. Like Pandas Dataframe, Geopandas data structure contains GeodataFrame and GeoSeries. Geopandas provides not only the capability to read and manipulate geographic data easily but also can perform many essential geospatial operations including among other geometric operations, projections and geographic analysis. You can also visualize and plot maps with Geopandas- It provides a high-level interface to the Matplotlib library- by using the .plot() method on GeodataFrame/GeoSeries.

These are the learning objectives for this part, Introduction to Geographic Data science:

  1. Read and write geographic data in Geopandas.
  2. Project and set Coordinate reference systems (CRS).
  3. Visualizing maps

1. Reading Geographic data

In this tutorial we will use mainly 3 datasets:

  • Countries
  • Cities
  • Rivers

Geographic (Vector) data comes in different formats (Shapefiles, Geopackage, Geojson …etc). Loading most of Geodata Formats with Geopandas is straightforward. We can use .read_file().

Let see an example of reading the data. In this case, we will read the countries dataset.

# Read the data

# 1. Read Countries --> Geopackage Format
shp = '1-introData/countries.gpkg'
countries = gpd.read_file(shp)

First, we created a variable to hold the file path and then we have used Geopandas, .read_file() method to read the countries dataset. Geopandas takes care of the geometry column which enables us to carry out geoprocessing tasks, for example, plotting maps.

A good way to start your data exploration is to look at the first few rows, the shape of the data, as well as general statistics of the data. This is possible through the following commands.

  • .head() method returns back the first 5 rows. You can adjust the number of rows to get back if you want for example,.head(8) for the first 8 rows in the dataset.
  • .shape() returns the number of rows and columns of the data
  • .describe() can be used to explore some basic statistical details, for example, mean, standard deviation, and percentiles.
# First 5 rows
countries.head()

# Rows and columns of the countries data
countries.shape

# Describtive statistics of the countries data
countries.describe()


This is how the first 5 rows of the data looks like.

Image for post

                                                                     countries Geodataframe — first 5 rows

One more example of reading data in Geopandas and this time we will read the cities dataset. It comes as Geojson file but the same techniques we have used to read the countries dataset apply here.

We can carry out also the same exploration using .head() , .shape() and .describe() to get a feeling of what this dataset is about. Once we do the explorations, we can go ahead and plot maps.

Plotting maps in Geopandas is easy and available through .plot() function. Since we have two datasets countries and cities data, we can overlay them and display it as a map. Here we set up the subplots using Matplotlib and pass the axis to Geopandas .plot() function.

This is the output map. With just two-three lines of code, we are able to produce this nice map.

Image for post
                                           Countries and cities of the world map — Change picture

It is time for small exercise from your part.

Exercise 1.1: Read the rivers data

Exercise 1.2: Read the first 5 rows of the rivers dataset

Exercise 1.3: Visualize rivers dataset.

2. Coordinate systems and Projections

Coordinate reference systems represent how our data as two dimensional (planar) relates to actual places on earth. It is the glue that holds the attributes to their respective locations. Geodataframes has .crs attribute that can give you the original CRS used in the data. It is easy to transform and project these coordinates. However, to perform projections, it is necessary to have the same CRS in order to carry out geographic analysis and get the right values out the analysis. The countries, cities and rivers have the same CRS. Let us check the countries CRS.

This is the output of the above code {‘init’: ‘epsg:4326’}.EPSG stands for European Petroleum Survey Group and is an authority that maintains spatial reference systems. The code 4326 indicates which Geographic Coordinate System is used, in this case (WGS84) The World Geodetic System of 1984.

Different CRS have different measurements. For some, the coordinates are defined in decimal degrees while others are defined in meters. It is a common to process and reproject data from one format to another in Geographic data processing. This source is very useful in visualizing and comparing different Projections:

Mercator vs. Robinson: Compare Map Projections

Compare the map projections Mercator and Robinson

map-projections.net

We will project our data into Mercator. The Mercator projection, latitude-longitude quadrangles are stretched along the x-axis and y-axis as you move away from the equator. But first, let us the geometry column of the countries dataset.

This is the output of the above code. It just prints out the latitude and longitude of the Polygons. These coordinates are in decimal degrees now.

# Look at the geometry column: decimal degrees
countries.geometry[:5]

0 (POLYGON ((117.7036079039552 4.163414542001791…
1 (POLYGON ((117.7036079039552 4.163414542001791…
2 (POLYGON ((-69.51008875199994 -17.506588197999…
3 (POLYGON ((-69.51008875199994 -17.506588197999…
4 (POLYGON ((-69.51008875199994 -17.506588197999…

Let us project this data and see the changes. In this example, we project to EPSG:3395 which is the widely used Mercator projection.

# Project the data into Mercator Projection epsg=3395
countries_projected = countries.to_crs({'init': 'epsg:3395'})

# See the geometry column of the projected countries
countries_projected['geometry'][:5]

Now our geometry column data looks like this:

0 (POLYGON ((13102705.69639943 460777.6522179524…
1 (POLYGON ((13102705.69639943 460777.6522179524…
2 (POLYGON ((-7737827.684867887 -1967028.7849201…
3 (POLYGON ((-7737827.684867887 -1967028.7849201…
4 (POLYGON ((-7737827.684867887 -1967028.7849201…

Due to the projection, the geometry is no longer measured in decimal style points but in a metre unit. It is easier to understand the difference in maps. Let us plot both the original countries and the projected countries.

Image for post

Image for post

Left WGS84 World map. Right World Mercator Projection map

Notice the different scales of x and y in both maps. The Mercator projection distorts the size of the objects as we go further from the equator to the poles. That is why Africa looks small and Greenland appears much larger than its size.

If you try to overlay the projected data with unprojected data, then your data will not align properly. Let us see if we can plot cities on the top of projected countries. Remember we have not projected the cities.

fig, ax = plt.subplots(figsize=(14,12))
countries_projected.plot(ax=ax)
# Cities are still in WGS84
cities.plot(ax=ax, color='red');

Image for post

Overlay Projected countries and unprojected cities

As you can see the cities are not overlayed properly in the projected countries dataset. They fall near Africa and that is not their proper place. In order to align them properly, we also need to project the same projection of the countries dataset, EPSG:3395 and that is an exercise for you.

Exercise 2.1: Convert the cities data into EPSG:3395 projection and plot cities on top of countries_proj.

3. Write Geographic Data

We can easily save any new data created to our local disk. This is helpful when you want to access that file in another time without carrying out the same operations again. Let us save our projected countries. Remember we have projected it. Geopandas has .to_file() method.

# Save projected countries 
shp = '1-introData/countries_epsg3395.shp'
countries_projected.to_file(shp)

And that will save your file. You might want to download this file since we are using collab and did not configure it with Google drive. This will be erased when you close your session in Google Colab.

If you have guessed that we also need to save the projected cities from exercise 2.1, you are right. That is your last exercise for this part.

Exercise 3.1: Save the projected cities file you created in exercise 2.1 into a file

Conclusion

In this tutorial, we have covered the basics of loading and writing Geographic data as well as Geographic coordinate system and projections. In the next tutorial, we will learn the Geoprocessing and manipulation of Geographic data using Geopandas. The code is available in this GitHub repository:

shakasom/GDS

Geographic data science tutorials series. Contribute to shakasom/GDS development by creating an account on GitHub.

github.com

You can also go directly and run Google Collaboraty Jupyter Notebooks directly from this link:

Google Colaboratory

Edit description

colab.research.google.com

Leave a Reply

Your email address will not be published. Required fields are marked *