Working with Geospatial Data

This section of the tutorial discusses how to use geopandas and shapely to manipulate geospatial data in Python. If you've never used these libraries before, or are looking for a refresher on how they work, this page is for you!

I recommend following along with this tutorial interactively using Binder.

Coordinate reference systems

The GeoDataFrame is an augmented version of a pandas DataFrame with an attached geometry:



In [1]:

    
import pandas as pd; pd.set_option('max_columns', 6)  # Unclutter display.
import geopandas as gpd
import geoplot as gplt

# load the example data
nyc_boroughs = gpd.read_file(gplt.datasets.get_path('nyc_boroughs'))
nyc_boroughs









    Out[1]:







  
    
      
      BoroCode
      BoroName
      Shape_Leng
      Shape_Area
      geometry
    
  
  
    
      0
      5
      Staten Island
      330385.03697
      1.623853e+09
      (POLYGON ((-74.05050806403247 40.5664220341608...
    
    
      1
      4
      Queens
      861038.47930
      3.049947e+09
      (POLYGON ((-73.83668274106708 40.5949466970158...
    
    
      2
      3
      Brooklyn
      726568.94634
      1.959432e+09
      (POLYGON ((-73.8670614947212 40.58208797679338...
    
    
      3
      1
      Manhattan
      358532.95642
      6.364422e+08
      (POLYGON ((-74.01092841268033 40.6844914725429...
    
    
      4
      2
      Bronx
      464517.89055
      1.186804e+09
      (POLYGON ((-73.89680883223775 40.7958084451597...

Most operations that will work on a `pandas` `DataFrame` will work on a `GeoDataFrame`, but the latter adds a few additional methods and features for dealing with geometry not present in the former. The most obvious of these is the addition of a column for storing geometries, accessible using the `geometry` attribute:



In [2]:

    
nyc_boroughs.geometry









    Out[2]:





0    (POLYGON ((-74.05050806403247 40.5664220341608...
1    (POLYGON ((-73.83668274106708 40.5949466970158...
2    (POLYGON ((-73.8670614947212 40.58208797679338...
3    (POLYGON ((-74.01092841268033 40.6844914725429...
4    (POLYGON ((-73.89680883223775 40.7958084451597...
Name: geometry, dtype: object

Whenever you work with novel geospatial data in a GeoDataFrame, the first thing you should do is check its coordinate reference system.

A coordinate reference system, or CRS, is a system for defining where points in space are. You can extract what CRS your polygons are stored in using the crs attribute:



In [3]:

    
nyc_boroughs.crs









    Out[3]:





{'init': 'epsg:4326'}

In this case epsg:4326 is the official identifier for what the rest of us more commonly refer to as "longitude and latitude". Most coordinate reference systems have a well-defined EPSG number, which you can look up using the handy spatialreference.org website.

Why do coordinate reference systems besides latitude-longitude even exist? As an example, the United States Geolocial Service, which maintains extremely high-accuracy maps of the United States, maintains 110 coordinate reference systems, refered to as "state plane coordinate systems", for various portions of the United States. Latitude-longitude uses spherical coordinates; state plane coordinate systems use "flat-Earth" Cartesian coordinate. State plane coordinates are therefore much simpler to work with computationally, while remaining accurate enough (within their "zone") for most applications.

For this reason, state plane coordinate systems remain in use throughout government. For example, here's a sample of data taken from the MapPLUTO dataset released by the City of New York:



In [4]:

    
nyc_map_pluto_sample = gpd.read_file(gplt.datasets.get_path('nyc_map_pluto_sample'))
nyc_map_pluto_sample









    Out[4]:







  
    
      
      Borough
      Block
      Lot
      ...
      Shape_Leng
      Shape_Area
      geometry
    
  
  
    
      0
      MN
      1
      10
      ...
      12277.824113
      7.550340e+06
      POLYGON ((979561.8712409735 191884.2491553128,...
    
    
      1
      MN
      1
      101
      ...
      3940.840373
      5.018974e+05
      POLYGON ((972382.8255597204 190647.2667211443,...
    
    
      2
      MN
      1
      101
      ...
      3940.840373
      5.018974e+05
      POLYGON ((972428.8290766329 190679.1751218885,...
    
    
      3
      MN
      1
      101
      ...
      3940.840373
      5.018974e+05
      POLYGON ((972058.3399882168 190689.2800885588,...
    
    
      4
      MN
      1
      201
      ...
      6306.268341
      1.148539e+06
      POLYGON ((973154.7118112147 194614.3312935531,...
    
    
      5
      MN
      2
      1
      ...
      2721.060649
      1.008250e+05
      POLYGON ((980915.0020648837 194319.1402828991,...
    
    
      6
      MN
      2
      2
      ...
      2411.869687
      8.724423e+04
      POLYGON ((981169.004181549 194678.8213220537, ...
    
  

7 rows × 90 columns

This data is stored in the Long Island State Plane coordinate reference system (EPSG 2263). Unfortunately the CRS on read is set incorrectly to epsg:4326 and we have to set it to the correct coordinate reference system ourselves.



In [5]:

    
nyc_map_pluto_sample.crs = {'init': 'epsg:2263'}
nyc_map_pluto_sample.crs









    Out[5]:





{'init': 'epsg:2263'}

Depending on the dataset, crs may be set to either epsg:<INT> or to a raw proj4 projection dictionary. The bottom line is, after reading in a dataset, always verify that the dataset coordinate reference system is set to what its documentation it should be set to.

If you determine that your coordinates are not latitude-longitude, usually the first thing you want to do is covert to it. to_crs does this:



In [6]:

    
nyc_map_pluto_sample = nyc_map_pluto_sample.to_crs(epsg=4326)
nyc_map_pluto_sample









    Out[6]:







  
    
      
      Borough
      Block
      Lot
      ...
      Shape_Leng
      Shape_Area
      geometry
    
  
  
    
      0
      MN
      1
      10
      ...
      12277.824113
      7.550340e+06
      POLYGON ((-74.0169058260488 40.69335342975063,...
    
    
      1
      MN
      1
      101
      ...
      3940.840373
      5.018974e+05
      POLYGON ((-74.04279194703045 40.68995148413111...
    
    
      2
      MN
      1
      101
      ...
      3940.840373
      5.018974e+05
      POLYGON ((-74.04262611856618 40.69003912689961...
    
    
      3
      MN
      1
      101
      ...
      3940.840373
      5.018974e+05
      POLYGON ((-74.04396208819837 40.69006636010664...
    
    
      4
      MN
      1
      201
      ...
      6306.268341
      1.148539e+06
      POLYGON ((-74.04001513069795 40.7008411559464,...
    
    
      5
      MN
      2
      1
      ...
      2721.060649
      1.008250e+05
      POLYGON ((-74.01202751677701 40.70003725302833...
    
    
      6
      MN
      2
      2
      ...
      2411.869687
      8.724423e+04
      POLYGON ((-74.01111163437271 40.70102458543801...
    
  

7 rows × 90 columns

Coordinate order

shapely, the library geopandas uses to store its geometries, uses "modern" longitude-latitude (x, y) coordinate order. This differs from the "historical" latitude-longitude (y, x) coordinate order. Datasets "in the wild" may be in either format.

There is no way for geopandas to know whether a dataset is in one format or the other at load time. Once you have converted your dataset to the right coordinate system, always always always make sure to next check that the geometries are also in the right coordinate order.

This is an easy mistake to make and people are making it constantly!

The fastest way to ensure that coordinates are in the right order is to know what the right x coordinates and y coordinates for your data should be and eyeball it.

Types of geometries

Every element of the geometry column in a GeoDataFrame is a shapely object. Shapely is a geometric operations library which is used for manipulating geometries in space, and it's the Python API of choice for working with shape data.

shapely defines just a handful of types of geometries:

Point—a point.
MultiPoint—a set of points.
LineString—a line segment.
MultiLineString—a collection of lines (e.g. a sequence of connected line segments).
LinearRing—a closed collection of lines. Basically a polygon with zero-area.
Polygon—an closed shape along a sequence of points.
MultiPolygon—a collection of polygons.

You can check the type of a geometry using the type operator:



In [7]:

    
type(nyc_boroughs.geometry.iloc[0])









    Out[7]:





shapely.geometry.multipolygon.MultiPolygon



In [8]:

    
type(nyc_map_pluto_sample.geometry.iloc[0])









    Out[8]:





shapely.geometry.polygon.Polygon

Performing geometric operations

The shapely user manual provides an extensive list of geometric operations that you can perform using the library: from simple things like translations and transformations to more complex operations like polygon buffering.

You can apply transformations to your geometries in an object-by-object way by using the native pandas map function on the geometry column. For example, here is one way of deconstructing a set of Polygon or MultiPolygon objects into simplified convex hulls:



In [9]:

    
%time gplt.polyplot(nyc_boroughs.geometry.map(lambda shp: shp.convex_hull))









    



CPU times: user 62.7 ms, sys: 2.64 ms, total: 65.3 ms
Wall time: 71.6 ms






    Out[9]:





<matplotlib.axes._subplots.AxesSubplot at 0x11c61d7b8>

You can perform arbitrarily complex geometric transformations on your shapes this way. However, most common operations are provided in optimized form as part of the geopandas API. Here's a faster way to create convex hulls, for example:



In [10]:

    
%time nyc_boroughs.convex_hull









    



CPU times: user 55.6 ms, sys: 6.45 ms, total: 62.1 ms
Wall time: 39.9 ms






    Out[10]:





0    POLYGON ((-74.24712436215984 40.49611539517034...
1    POLYGON ((-73.94073681665428 40.54182008715522...
2    POLYGON ((-73.98336058039274 40.56952999448672...
3    POLYGON ((-74.02305574749596 40.68291694544512...
4    POLYGON ((-73.87830680057651 40.78535662050845...
dtype: object

It is beyond the scope of this short guide to dive too deeply into geospatial data transformations. Suffice to say that there are many of them, and that you can learn some more about them by consulting the geopandas and shapely documentation.

Defining your own geometries

In this section of the tutorial, we will focus on one particular aspect of shapely which is likely to come up: defining your own geometries.

In the cases above we read a GeoDataFrame straight out of geospatial files: our borough information was stored in the GeoJSON format, while our building footprints were a Shapefile. What if we have geospatial data embedded in an ordinary CSV or JSON file, which read into an ordinary pandas DataFrame?



In [11]:

    
nyc_collisions_sample = pd.read_csv(gplt.datasets.get_path('nyc_collisions_sample'))
nyc_collisions_sample

It is extremely common for datasets containing light geospatial data (e.g. points, maybe line segments, but usually not whole polygons) to be saved in a non-geospatial formats. In this case can import `shapely` directly, use it to define our own geometries, then initialize a `GeoDataFrame`. The `pandas` `apply` function is the best to do this:



In [12]:

    
from shapely.geometry import Point

collision_points = nyc_collisions_sample.apply(
    lambda srs: Point(float(srs['LONGITUDE']), float(srs['LATITUDE'])),
    axis='columns'
)
collision_points









    Out[12]:





0           POINT (-73.950057 40.767373)
1    POINT (-73.90903900000001 40.86267)
2           POINT (-73.961275 40.716507)
3           POINT (-73.987768 40.749788)
4    POINT (73.96049599999999 40.702401)
dtype: object

From there we pass this iterable of geometries to the geometry property of the GeoDataFrame initializer:



In [13]:

    
import geopandas as gpd
nyc_collisions_sample_geocoded = gpd.GeoDataFrame(nyc_collisions_sample, geometry=collision_points)
nyc_collisions_sample_geocoded









    Out[13]:







  
    
      
      LATITUDE
      LONGITUDE
      DATE
      TIME
      geometry
    
  
  
    
      0
      40.767373
      -73.950057
      04/16/2016
      4:13
      POINT (-73.950057 40.767373)
    
    
      1
      40.862670
      -73.909039
      04/16/2016
      4:30
      POINT (-73.90903900000001 40.86267)
    
    
      2
      40.716507
      -73.961275
      04/16/2016
      4:30
      POINT (-73.961275 40.716507)
    
    
      3
      40.749788
      -73.987768
      04/16/2016
      4:30
      POINT (-73.987768 40.749788)
    
    
      4
      40.702401
      73.960496
      04/16/2016
      4:50
      POINT (73.96049599999999 40.702401)

In most cases, data with geospatial information provided in a CSV will be point data corresponding with individual coordinates. Sometimes, however, one may wish to define more complex geometry: square areas, for example, and *maybe* even complex polygons. While we won't cover these cases, they're quite similar to the extremely simple point case we've shown here. For further reference on such a task, refer to the `shapely` documentation.

Joining on existing geometries

Sometimes the necessary geospatial data is elsewhere entirely.

Suppose now that we have information on obesity by state.



In [14]:

    
obesity = pd.read_csv(gplt.datasets.get_path('obesity_by_state'), sep='\t')
obesity.head()









    Out[14]:







  
    
      
      State
      Percent
    
  
  
    
      0
      Alabama
      32.4
    
    
      1
      Missouri
      30.4
    
    
      2
      Alaska
      28.4
    
    
      3
      Montana
      24.6
    
    
      4
      Arizona
      26.8

We'd like to put this information on a map. But we don't have any geometry! We will once again have to define a geometry. Except that this time, instead of writing our own, we will need to find data with state shapes, and join that data against this data. In other cases there may be other shapes: police precincts, survey zones, and so on. Here is just such a dataset:



In [15]:

    
contiguous_usa = gpd.read_file(gplt.datasets.get_path('contiguous_usa'))
contiguous_usa.head()









    Out[15]:







  
    
      
      state
      adm1_code
      population
      geometry
    
  
  
    
      0
      Minnesota
      USA-3514
      5303925
      POLYGON ((-89.59940899999999 48.010274, -89.48...
    
    
      1
      Montana
      USA-3515
      989415
      POLYGON ((-111.194189 44.561156, -111.291548 4...
    
    
      2
      North Dakota
      USA-3516
      672591
      POLYGON ((-96.601359 46.351357, -96.5389080000...
    
    
      3
      Idaho
      USA-3518
      1567582
      POLYGON ((-111.049728 44.488163, -111.050245 4...
    
    
      4
      Washington
      USA-3519
      6724540
      POLYGON ((-116.998073 46.33017, -116.906528 46...

A simple `join` solves the problem:



In [20]:

    
result = contiguous_usa.set_index('state').join(obesity.set_index('State'))
result.head()









    Out[20]:







  
    
      
      adm1_code
      population
      geometry
      Percent
    
    
      state
      
      
      
      
    
  
  
    
      Minnesota
      USA-3514
      5303925
      POLYGON ((-89.59940899999999 48.010274, -89.48...
      25.5
    
    
      Montana
      USA-3515
      989415
      POLYGON ((-111.194189 44.561156, -111.291548 4...
      24.6
    
    
      North Dakota
      USA-3516
      672591
      POLYGON ((-96.601359 46.351357, -96.5389080000...
      31.0
    
    
      Idaho
      USA-3518
      1567582
      POLYGON ((-111.049728 44.488163, -111.050245 4...
      29.6
    
    
      Washington
      USA-3519
      6724540
      POLYGON ((-116.998073 46.33017, -116.906528 46...
      27.2

Now we can plot it:



In [24]:

    
import geoplot.crs as gcrs
gplt.cartogram(result, scale='Percent', projection=gcrs.AlbersEqualArea())









    Out[24]:





<cartopy.mpl.geoaxes.GeoAxesSubplot at 0x11c3bda20>

Save formats

You can read data out of a geospatial file format using GeoDataFrame.from_file. You can write data to a geospatial file format using GeoDataFrame.to_file. By default, these methods will infer the file format and save to a Shapefile, respectively. To specify an explicit file format, pass the name of that format to the driver argument. For example:

nyc_boroughs.to_file('boroughs.geojson', driver='GeoJSON')

The simplest and increasingly most common save format for geospatial data is GeoJSON. A geojson file may have a .geojson or .json extension, and stores data in a human-readable format:

{
  "type": "Feature",
  "geometry": {
    "type": "Point",
    "coordinates": [125.6, 10.1]
  },
  "properties": {
    "name": "Dinagat Islands"
  }
}

Historically speaking, the most common geospatial data format is the Shapefile. Shapefiles are not actually really files, but instead groups of files in a folder or zip archive that together can encode very complex information about your data. Shapefiles are a binary file format, so they are not human-readable like GeoJSON files are, but can efficiently encode data too complex for easy storage in a GeoJSON.

These are the two best-known file formats, but there are many many others. For a list of geospatial file formats supported by geopandas refer to the fiona user manual.

	BoroCode	BoroName	Shape_Leng	Shape_Area	geometry
0	5	Staten Island	330385.03697	1.623853e+09	(POLYGON ((-74.05050806403247 40.5664220341608...
1	4	Queens	861038.47930	3.049947e+09	(POLYGON ((-73.83668274106708 40.5949466970158...
2	3	Brooklyn	726568.94634	1.959432e+09	(POLYGON ((-73.8670614947212 40.58208797679338...
3	1	Manhattan	358532.95642	6.364422e+08	(POLYGON ((-74.01092841268033 40.6844914725429...
4	2	Bronx	464517.89055	1.186804e+09	(POLYGON ((-73.89680883223775 40.7958084451597...

	Borough	Block	Lot	...	Shape_Leng	Shape_Area	geometry
0	MN	1	10	...	12277.824113	7.550340e+06	POLYGON ((979561.8712409735 191884.2491553128,...
1	MN	1	101	...	3940.840373	5.018974e+05	POLYGON ((972382.8255597204 190647.2667211443,...
2	MN	1	101	...	3940.840373	5.018974e+05	POLYGON ((972428.8290766329 190679.1751218885,...
3	MN	1	101	...	3940.840373	5.018974e+05	POLYGON ((972058.3399882168 190689.2800885588,...
4	MN	1	201	...	6306.268341	1.148539e+06	POLYGON ((973154.7118112147 194614.3312935531,...
5	MN	2	1	...	2721.060649	1.008250e+05	POLYGON ((980915.0020648837 194319.1402828991,...
6	MN	2	2	...	2411.869687	8.724423e+04	POLYGON ((981169.004181549 194678.8213220537, ...

	Borough	Block	Lot	...	Shape_Leng	Shape_Area	geometry
0	MN	1	10	...	12277.824113	7.550340e+06	POLYGON ((-74.0169058260488 40.69335342975063,...
1	MN	1	101	...	3940.840373	5.018974e+05	POLYGON ((-74.04279194703045 40.68995148413111...
2	MN	1	101	...	3940.840373	5.018974e+05	POLYGON ((-74.04262611856618 40.69003912689961...
3	MN	1	101	...	3940.840373	5.018974e+05	POLYGON ((-74.04396208819837 40.69006636010664...
4	MN	1	201	...	6306.268341	1.148539e+06	POLYGON ((-74.04001513069795 40.7008411559464,...
5	MN	2	1	...	2721.060649	1.008250e+05	POLYGON ((-74.01202751677701 40.70003725302833...
6	MN	2	2	...	2411.869687	8.724423e+04	POLYGON ((-74.01111163437271 40.70102458543801...

	LATITUDE	LONGITUDE	DATE	TIME
0	40.767373	-73.950057	04/16/2016	4:13
1	40.862670	-73.909039	04/16/2016	4:30
2	40.716507	-73.961275	04/16/2016	4:30
3	40.749788	-73.987768	04/16/2016	4:30
4	40.702401	73.960496	04/16/2016	4:50

	State	Percent
0	Alabama	32.4
1	Missouri	30.4
2	Alaska	28.4
3	Montana	24.6
4	Arizona	26.8

	state	adm1_code	population	geometry
0	Minnesota	USA-3514	5303925	POLYGON ((-89.59940899999999 48.010274, -89.48...
1	Montana	USA-3515	989415	POLYGON ((-111.194189 44.561156, -111.291548 4...
2	North Dakota	USA-3516	672591	POLYGON ((-96.601359 46.351357, -96.5389080000...
3	Idaho	USA-3518	1567582	POLYGON ((-111.049728 44.488163, -111.050245 4...
4	Washington	USA-3519	6724540	POLYGON ((-116.998073 46.33017, -116.906528 46...

	adm1_code	population	geometry	Percent
state
Minnesota	USA-3514	5303925	POLYGON ((-89.59940899999999 48.010274, -89.48...	25.5
Montana	USA-3515	989415	POLYGON ((-111.194189 44.561156, -111.291548 4...	24.6
North Dakota	USA-3516	672591	POLYGON ((-96.601359 46.351357, -96.5389080000...	31.0
Idaho	USA-3518	1567582	POLYGON ((-111.049728 44.488163, -111.050245 4...	29.6
Washington	USA-3519	6724540	POLYGON ((-116.998073 46.33017, -116.906528 46...	27.2