This section of the tutorial discusses how to use geopandas
and shapely
to manipulate geospatial data in Python. If you've never used these libraries before, or are looking for a refresher on how they work, this page is for you!
I recommend following along with this tutorial interactively using Binder.
In [1]:
import pandas as pd; pd.set_option('max_columns', 6) # Unclutter display.
import geopandas as gpd
import geoplot as gplt
# load the example data
nyc_boroughs = gpd.read_file(gplt.datasets.get_path('nyc_boroughs'))
nyc_boroughs
Out[1]:
In [2]:
nyc_boroughs.geometry
Out[2]:
Whenever you work with novel geospatial data in a GeoDataFrame
, the first thing you should do is check its coordinate reference system.
A coordinate reference system, or CRS, is a system for defining where points in space are. You can extract what CRS your polygons are stored in using the crs
attribute:
In [3]:
nyc_boroughs.crs
Out[3]:
In this case epsg:4326
is the official identifier for what the rest of us more commonly refer to as "longitude and latitude". Most coordinate reference systems have a well-defined EPSG number, which you can look up using the handy spatialreference.org website.
Why do coordinate reference systems besides latitude-longitude even exist? As an example, the United States Geolocial Service, which maintains extremely high-accuracy maps of the United States, maintains 110 coordinate reference systems, refered to as "state plane coordinate systems", for various portions of the United States. Latitude-longitude uses spherical coordinates; state plane coordinate systems use "flat-Earth" Cartesian coordinate. State plane coordinates are therefore much simpler to work with computationally, while remaining accurate enough (within their "zone") for most applications.
For this reason, state plane coordinate systems remain in use throughout government. For example, here's a sample of data taken from the MapPLUTO dataset released by the City of New York:
In [4]:
nyc_map_pluto_sample = gpd.read_file(gplt.datasets.get_path('nyc_map_pluto_sample'))
nyc_map_pluto_sample
Out[4]:
This data is stored in the Long Island State Plane coordinate reference system (EPSG 2263). Unfortunately the CRS on read is set incorrectly to epsg:4326
and we have to set it to the correct coordinate reference system ourselves.
In [5]:
nyc_map_pluto_sample.crs = {'init': 'epsg:2263'}
nyc_map_pluto_sample.crs
Out[5]:
Depending on the dataset, crs
may be set to either epsg:<INT>
or to a raw proj4 projection dictionary. The bottom line is, after reading in a dataset, always verify that the dataset coordinate reference system is set to what its documentation it should be set to.
If you determine that your coordinates are not latitude-longitude, usually the first thing you want to do is covert to it. to_crs
does this:
In [6]:
nyc_map_pluto_sample = nyc_map_pluto_sample.to_crs(epsg=4326)
nyc_map_pluto_sample
Out[6]:
shapely
, the library geopandas
uses to store its geometries, uses "modern" longitude-latitude (x, y)
coordinate order. This differs from the "historical" latitude-longitude (y, x)
coordinate order. Datasets "in the wild" may be in either format.
There is no way for geopandas
to know whether a dataset is in one format or the other at load time. Once you have converted your dataset to the right coordinate system, always always always make sure to next check that the geometries are also in the right coordinate order.
This is an easy mistake to make and people are making it constantly!
The fastest way to ensure that coordinates are in the right order is to know what the right x coordinates and y coordinates for your data should be and eyeball it.
Every element of the geometry
column in a GeoDataFrame
is a shapely
object. Shapely is a geometric operations library which is used for manipulating geometries in space, and it's the Python API of choice for working with shape data.
shapely
defines just a handful of types of geometries:
Point
—a point.MultiPoint
—a set of points.LineString
—a line segment.MultiLineString
—a collection of lines (e.g. a sequence of connected line segments).LinearRing
—a closed collection of lines. Basically a polygon with zero-area.Polygon
—an closed shape along a sequence of points.MultiPolygon
—a collection of polygons.You can check the type
of a geometry using the type
operator:
In [7]:
type(nyc_boroughs.geometry.iloc[0])
Out[7]:
In [8]:
type(nyc_map_pluto_sample.geometry.iloc[0])
Out[8]:
The shapely user manual provides an extensive list of geometric operations that you can perform using the library: from simple things like translations and transformations to more complex operations like polygon buffering.
You can apply transformations to your geometries in an object-by-object way by using the native pandas
map
function on the geometry
column. For example, here is one way of deconstructing a set of Polygon
or MultiPolygon
objects into simplified convex hulls:
In [9]:
%time gplt.polyplot(nyc_boroughs.geometry.map(lambda shp: shp.convex_hull))
Out[9]:
You can perform arbitrarily complex geometric transformations on your shapes this way. However, most common operations are provided in optimized form as part of the geopandas
API. Here's a faster way to create convex hulls, for example:
In [10]:
%time nyc_boroughs.convex_hull
Out[10]:
In this section of the tutorial, we will focus on one particular aspect of shapely
which is likely to come up: defining your own geometries.
In the cases above we read a GeoDataFrame straight out of geospatial files: our borough information was stored in the GeoJSON format, while our building footprints were a Shapefile. What if we have geospatial data embedded in an ordinary CSV
or JSON
file, which read into an ordinary pandas
DataFrame
?
In [11]:
nyc_collisions_sample = pd.read_csv(gplt.datasets.get_path('nyc_collisions_sample'))
nyc_collisions_sample
Out[11]:
In [12]:
from shapely.geometry import Point
collision_points = nyc_collisions_sample.apply(
lambda srs: Point(float(srs['LONGITUDE']), float(srs['LATITUDE'])),
axis='columns'
)
collision_points
Out[12]:
From there we pass this iterable of geometries to the geometry
property of the GeoDataFrame
initializer:
In [13]:
import geopandas as gpd
nyc_collisions_sample_geocoded = gpd.GeoDataFrame(nyc_collisions_sample, geometry=collision_points)
nyc_collisions_sample_geocoded
Out[13]:
In [14]:
obesity = pd.read_csv(gplt.datasets.get_path('obesity_by_state'), sep='\t')
obesity.head()
Out[14]:
In [15]:
contiguous_usa = gpd.read_file(gplt.datasets.get_path('contiguous_usa'))
contiguous_usa.head()
Out[15]:
In [20]:
result = contiguous_usa.set_index('state').join(obesity.set_index('State'))
result.head()
Out[20]:
In [24]:
import geoplot.crs as gcrs
gplt.cartogram(result, scale='Percent', projection=gcrs.AlbersEqualArea())
Out[24]:
You can read data out of a geospatial file format using GeoDataFrame.from_file
. You can write data to a geospatial file format using GeoDataFrame.to_file
. By default, these methods will infer the file format and save to a Shapefile
, respectively. To specify an explicit file format, pass the name of that format to the driver
argument. For example:
nyc_boroughs.to_file('boroughs.geojson', driver='GeoJSON')
The simplest and increasingly most common save format for geospatial data is GeoJSON. A geojson file may have a .geojson
or .json
extension, and stores data in a human-readable format:
{
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [125.6, 10.1]
},
"properties": {
"name": "Dinagat Islands"
}
}
Historically speaking, the most common geospatial data format is the Shapefile. Shapefiles are not actually really files, but instead groups of files in a folder or zip
archive that together can encode very complex information about your data. Shapefiles are a binary file format, so they are not human-readable like GeoJSON files are, but can efficiently encode data too complex for easy storage in a GeoJSON.
These are the two best-known file formats, but there are many many others. For a list of geospatial file formats supported by geopandas
refer to the fiona user manual.