Introduction

PostGIS is a free and open source library that adds spatial database functionality to PostgreSQL. It stores spatial data characteristics such as whether data are points (addresses or lat-lon), polygons (areas) or lines (streets). Further, with PostGIS you can query, index, and manipulate spatial data using SQL.

Spatial queries are queries involving the spatial dimension of data, which includes location, area, distance and interaction (Anselin 1990). Although they can be univariate (Where is this address?) they often involve relationships between different locations or geometric features. One of the things that differentiates spatial relationships is that what is closer tends to be more related than what is further away (Tobler 1979). There are many ways to operationalize proximity (Church and Murray 2009): One involves distance. Examples include straight-line distance, street-line distance, nearest neighbors or travel time by foot or transit. Alternative operationalizations focus on topological relationships such as adjacency (e.g. shared boundary), contiguity (travel through connected parts), intersection (two geometries meet or overlap) or connectivity (e.g. movement between two locations, often on network). Further, proximity can also be measured in terms of topological transformations such as buffers that transform a point using a distance threshold.

In other words, with spatial queries you can ask about relationships between geometries stored in your spatial database. Examples of such questions can be about access to jobs: “How many employers are within half a mile of a location?” or “Who are the five employers closest to this location?” or summarize characteristics of neighboring locations like “What is the average wage within half a mile of a location?

You can answer these questions using different kinds of spatial queries, depending on whether your data are stored as addresses, areas or lines. Different combinations are possible, e.g. querying points in relationship to other points, points related to areas, areas related to areas, lines to areas, etc. Questions to illustrate this include: “How many welfare recipients live within a mile of human services?” (point-to-point with buffer) or “How many low-wage earners live within a mile of each employer?” (point to center of area with buffer) or “For each block, what are the average wages in neighboring blocks?” (area to area).

Common spatial query functions that operationalize proximity in postGIS include (geometries can be points, lines or areas):

  • ST_Distance: Distance - distance between two geometries (e.g. two points)
  • ST_Within: Intersection (Overlap) - one geometry is completely inside another (e.g. points in area)
  • ST_Intersects: Intersection (Meet) - one geometry shares a portion with another (e.g. parts of line in area)
  • ST_Touches: Adjacency - geometries share common points but not interior (e.g. neighboring blocks)
  • ST_DWithin: Buffer - geometries within specified distance (e.g. points within radius)

Spatial queries involve choices about areas: What is the shape of the micro areas you are analyzing? If you aggregate individual-level data, what is the level of aggregation? These questions are challenging since the choice of area shape and the level of aggregation often yields different results: Two currently debated examples of this are gerrymandering and the masking of the Flint lead problem when analyzed at the zip code level (Sadler 2016) (see Openshaw 1979 and the MAUP problem for more background). If you start with addresses you have a choice about boundaries and levels of aggregation. With blocks, you can examine the impact of different levels of aggregation, albeit without control over the shape of the aggregated areas. In other cases, you will have less control over either -- e.g. when all you can access for confidentiality reasons is zip code data.

Spatial indexing is used to speed up spatial queries for large datasets by indexing generalizations (bounding boxes) of geometric features rather than the features themselves. (You can add a spatial index with the USING GIST command, which stands for generic index structure.)

The following examples of postGIS spatial queries illustrate some of these possibilities for relating points to points, points to areas, and areas to areas.

Data note: many of the following queries make use of the "il_des_subset_2014q3" table in the "class1" schema, which is a selected set of columns from the QCEW employer data and the establishment addendum data for 2014Q3. See the appendix for the SQL statements used.

Examples of Common Spatial Queries

As you’ve seen in the non-spatial database module, the syntax structure of SQL has these components:

  1. SELECT <column> (required)
  2. FROM <table> (required)
  3. WHERE <condition> (optional)

What is new in postGIS is the addition of spatial queries, as in the following example where _st_within_ is added to the WHERE condition to select points in an area. With this, you can, e.g., ask:

What points are inside a given area? e.g. select all employers in block X:

        /* SELECT query */
        SELECT name_legal
          FROM il_des_subset_2014q3 a, tl_2016_16980_tabblock10 b
            WHERE st_within(a.geom_2163, b.geom_2163)
            AND b.geoid10 = 'X';

In other words, SELECT the name column FROM employer table a and area table b WHERE the points are within areas, AND restrict the query to block X (which, in this case is the block with the largest number of welfare recipients). Then display the list of employers (a.name).

Now let's run the query for a specific Block (with FIPS code: 171419617003069) and only return the establishment account numbers and legal name:

        /* SELECT query */
        SELECT ein, uiacctno, rptunitno, name_legal
          FROM class1.il_des_subset_2014q3 a, tl_2016_16980_tabblock10 b
            WHERE st_within(a.geom_2163, b.geom_2163)
            AND b.geoid10 = '170313201002016';

        /* result columns - if you did not change anything it should return ## rows */
        ein, uiacctno, rptunitno, name_legal

Especially for small areas, it often makes more sense to measure proximity in terms of distance or nearest neighbors rather than shared area. In the next example we use a spatial query to identify the 10 nearest employers from the center of a block. In this case we fix the number of neighbors (10) and let the distance between the block center and the employers vary.

Find the 10 nearest neighbors of a given point e.g., find the 10 nearest employers from the center of block X (note we're using the same '170318236031029' Block referenced above, and got the values of X and Y below with the query: _SELECT ST_AsText(ST_Centroid(geom_2163)) block_centroid FROM tl_2016_16980_tabblock10 WHERE geoid10 = '170313201002016';_)

        /* SELECT query */
        SELECT ein, uiacctno, rptunitno, name_trade
          FROM class1.il_des_subset_2014q3
           ORDER BY geom_2163 <-> 'SRID=2163;POINT(1020027.11961544 -268894.849160242)'::geometry
           LIMIT 10;

        /* result columns */
        ein, uiacctno, rptunitno, name_trade

The key here is the <-> operator, which is a distance operator that uses the spatial index within the constraint of the ORDER BY clause and the number of neighbors identified in the LIMIT clause, which in this case is 10 but you can vary this number. The combination of this constrained search with the spatial index (based on bounding boxes of the geometries) makes this search fast. POINT(X Y) defines the XY coordinates of the center of our block of interest and 2163 is the Spatial Reference ID of the coordinate system used, in this case the US National Atlas projection. See the appendix for more information about projections.

Optional exercise question:

  1. Find out how far each of the returned employer locations is from the Block center point

If you want to instead use a common distance from the center of a block and let the number of employers vary within this distance, you can instead use a spatial query based on a distance threshold as in the following example:

Find all points within 1,000 meters (3,281 ft) of a given point e.g. find the employers within 1,000 meters of the center of block X

        /* SELECT query */
        SELECT ein, uiacctno, rptunitno, ST_Distance(geom_2163, ST_GeomFromText('POINT(1020027.11961544 -268894.849160242)',2163)) dist_meters
          FROM class1.il_des_subset_2014q3
            WHERE ST_DWithin(
                geom_2163,
                ST_GeomFromText('POINT(1020027.11961544 -268894.849160242)',2163),
                1000
            );

        /* result column */
        ein, uiacctno, rptunitno, dist_meters

This query is based on the _ST_DWithin_ function, which selects points (employers) within 1,000 meters of the center of a block. As above, the location of this center is defined by the XY coordinates in 'POINT(X1 Y)' and 2163 identifies the projection.

The previous example identified all the points in a given radius. What if you want to aggregate (sum, count, etc) an attribute instead? You can use the same example as above and just add the function to the column you select, as in this example:

How many of something are within 1,000 meters of a given point? e.g. how many jobs are within 1,000 meters of the center of block X

        /* SELECT query */
        SELECT count(*) as total_establishments
          FROM class1.il_des_subset_2014q3
            WHERE ST_DWithin(
                geom_2163,
                ST_GeomFromText('POINT(1020027.11961544 -268894.849160242)',2163),
                1000
            );

        /* result column */
        total_establishments

Here you are counting the number of establishments and creating a new alias for this variable with as total_establishments. Next we do the same thing with the average wage:

What is the average of something within 1,000 meters of a given point? e.g. what is the average wage within 1,000 meters of the center of block X

        /* SELECT query */
        SELECT avg(total_wages) as average_wage
          FROM class1.il_des_subset_2014q3
            WHERE ST_DWithin(
                geom_2163,
                ST_GeomFromText('POINT(1020027.11961544 -268894.849160242)',2163),
                1000
            );

        /* result column */
        average_wage

So far, we have selected points within areas, nearest points, points within distance thresholds, and summed or averaged point attributes within distance thresholds. Next we identify neighbors in area data by asking:

What are the adjacent areas of a given area? e.g. select blocks sharing a border with block X:

        /* SELECT query */
        SELECT b.geoid10 AS block_fips
            FROM tl_2016_16980_tabblock10 a, tl_2016_16980_tabblock10 b
                WHERE st_touches(a.geom_2163, b.geom_2163)
                AND a.geoid10 = '170313201002016';

        /* result column */
        block_fips

Here you use _st_touches_ to identify the areas that are adjacent to (touch) the selected block X.

Make a map

Using our new found skills let's make a quick map of 100 establishments in our data from 2014Q3 that are within 1km of our study block and overlay it on top of zip codes. NOTE: please do not worry about understanding all the Python code in the next few cells if it is confusing, we can revisit it later.


In [ ]:
# import packages
from sqlalchemy import create_engine
import geopandas as gpd
import matplotlib.pyplot as plt
%matplotlib inline

In [ ]:
# query to get the zipcodes:
zip_query = "SELECT geoid10 AS zipcode, geom_2163 FROM \
tl_2016_us_zcta510 WHERE ST_Dwithin(geom_2163, \
ST_GeomFromText('POINT(1020027.11961544 -268894.849160242)',2163),1000);"

# query to get the estabishments:
est_query = "SELECT ein, uiacctno, rptunitno, total_wages, geom_2163 \
FROM class1.il_des_subset_2014q3 \
WHERE ST_DWithin(geom_2163, \
ST_GeomFromText('POINT(1020027.11961544 -268894.849160242)',2163),1000) LIMIT 100;"

# connection string to the database
conn = create_engine("postgresql://10.10.2.10/appliedda")

# get the data
zipcodes = gpd.read_postgis(zip_query, conn, geom_col='geom_2163', crs="+init=epsg:2163")
establishments = gpd.read_postgis(est_query, conn, geom_col='geom_2163', crs="+init=epsg:2163")

# how many rows and columms did we get?
print('zipcodes has {} rows and {} colummns'.format(zipcodes.shape[0],zipcodes.shape[1]))
print('establishments has {} rows and {} colummns'.format(establishments.shape[0],establishments.shape[1]))

In [ ]:
# make the map #
# make a matplotlib figure and axis objects and set the image size
f, ax = plt.subplots(figsize=(6,6))

# draw zipcodes on our axis object
zipcodes.plot(color='grey', linewidth=0.1, ax=ax)
# and the establishments on top of the zipcodes, we'll color by total_wages
establishments.plot(column='total_wages', scheme='QUANTILES', k=5, cmap='OrRd', 
                   markeredgewidth=0.0, ax=ax, legend=True, markersize=10);

Spatial Queries and Spatial Joins

Spatial joins allow you to combine different datasets based on a spatial relationship instead of a common key. This is useful for relating different geometric types, such as points and areas, or different geometric shapes of the same type, such as blocks and zip codes.

Note that the discussion above about how different area shapes and levels of aggregation impact results is relevant here (see Openshaw and Sadler in the references).

So far, we have restricted the spatial query to a given point. However, often you want to run spatial queries for all of your data at once. In the following example, we have data on welfare recipients, low-wage earners, and housing assistance at the block level and data on employers at the address/point level. If we want to aggregate the employer point data to the block level to have one common unit of analysis, we can do so with a so-called spatial join operation in postGIS. In this example, we count points within areas but you can also use all the regular SQL aggregate functions instead (such as sum, average, and more complex mathematical operations). Or you could apply the spatial queries from the previous section to the whole dataset.

In this example we ask:

How many points are in each area? e.g., how many parolees were in each County of the Chicago metropolitan area in 2014?

Table 1: parolees with last known address Table 2: Counties

    SELECT c.geoid AS county_fips,
        COUNT(*) as parolee_count
    FROM ildoc.il_doc_parole p, tl_2016_us_county c
    WHERE ST_Intersects(p.geom_2163, c.geom_2163)
        AND c.geoid IN ('17031', '17037', '17043', '17063', '17089', '17093',
        '17097', '17111', '17197','18073', '18089', '18111', '18127', '55059')
        AND p.year = 2014
    GROUP BY c.geoid

You can do the same thing with SUM instead of count (or any other operation), as illustrated in this example:

How many point attributes are in each area? e.g., how many jobs are in each county of the Chicago MSA?

table_1: counties (areas) table_2: employers with jobs as attribute (points)

    SELECT c.geoid county_fips, 
        SUM(e.total_jobs) as total_jobs
    FROM class1.il_des_subset_2014q3 e, tl_2016_us_county c
    WHERE ST_Within(e.geom_2163, c.geom_2163)
        AND c.geoid IN ('17031', '17037', '17043', '17063', '17089', '17093',
        '17097', '17111', '17197','18073', '18089', '18111', '18127', '55059')
    GROUP BY county_fips

In this case we are summing the number of jobs associated with all the employers in all Chicago MSA counties. Now you aggregated two new spatial variables from the point to the block level: the number of parolees (count of points in area) in 2014 and total jobs (sum of point attributes in area) in 2014Q3.

Spatial analysis exercise

Let's use some of the queries above to make a new table in your schema. A few things to keep in mind:

  1. Be sure to set your team's schema correctly (eg replace class1 below with a9)
  2. The more complex the queries the longer they will run, so try to keep it relatively simple for this exercise

GOAL: make a map with a new variable you have created

Suggestion: each team take a couple minutes to think of 2 or 3 tables it may be interesting to make, then have one team member do the first two steps of the exercise and a different team member finish the exercise

In the example below we show:

  1. Creating a new table
  2. Table administrative tasks
  3. Adding a new column to the table
  4. Updating the new column

Step 1: create a new table For this example we'll start with creating a table of the zipcodes in Cook County, IL.

SELECT z.geoid10 zipcode, z.pop_est_2010, z.geom_2163
INTO class1.zcta_pop_example
FROM tl_2016_us_zcta510 z JOIN tl_2016_us_county c
ON z.geom_2163 && c.geom_2163
WHERE c.geoid = '17031';

Step 2: adminstrative tasks on new tables For each new table, there are a few things that will make life easier later:

a. Set the ownership of the table to your team's admin group (<schema>_admin, so eg class1_admin) - this will allow anyone on your team to edit/update the table

ALTER TABLE class1.zcta_pop_example OWNER TO class1_admin;


b. Add a unique primary key to the table

ALTER TABLE class1.zcta_pop_example ADD COLUMN uid serial NOT NULL;
ALTER TABLE class1.zcta_pop_example ADD PRIMARY KEY (uid);


c. Create indexes, below we have (a) an index on the zipcode column and (b) a spatial index on the geometry column

CREATE INDEX zcta_pop_example_zipcode_idx ON class1.zcta_pop_example (zipcode);
CREATE INDEX zcta_pop_example_geom_2163_gist ON class1.zcta_pop_example USING gist (geom_2163);


d. Vacuum analyze your new table

VACUUM ANALYZE class1.zcta_pop_example;


Step 3: Adding a new column to the table In order to add new data we first have to add a column in our table. This is the same syntax as used above to create the unique ID, but it will just be an empty column rather than auto-populating with a value. In general form the syntax is ALTER TABLE <table-name> ADD COLUMN <column-name> <column-type>;.

ALTER TABLE class1.zcta_pop_example ADD COLUMN parolees_2010 integer;

Step 4: Updating the new column Now we can add data to our new column. Let's add the count of parolees who's last known address for 2010 was in each zipcode:

UPDATE class1.zcta_pop_example z SET parolees_2010 = (SELECT count(*)
FROM ildoc.il_doc_parole p
WHERE p.geom_2163 && z.geom_2163 AND p.year = 2010);

In [ ]:
# now we'll make a map using very similar code to what is above #
# set the query
query = "SELECT * FROM class1.zcta_pop_example"
# get the data
zcta_parolees = gpd.read_postgis(query, conn, geom_col='geom_2163', crs='+init=epsg:2163')

# make the map #
# make a matplotlib figure and axis objects and set the image size
f, ax = plt.subplots(figsize=(6,6))

# draw zipcodes on our axis object
zcta_parolees.plot(column='parolees_2010', scheme='QUANTILES', k=7, cmap='OrRd', 
                   linewidth=0.1, ax=ax,legend=True);

References and Resources

Anselin, L. (1990). “What is Special About Spatial Data? Alternative Perspectives on Spatial Data Analysis.” In D.A. Griffith (ed.), Spatial Statistics: Past, Present and Future, Institute for Mathematical Geography Monograph Series, Ann Arbor, MI:IMaGe, pp. 63–77.

Boundless. Introduction to PostGIS. http://workshops.boundlessgeo.com/postgis-intro/

Church, Richard L., and Alan T. Murray. (2009). Business Site Selection, Location Analysis and GIS. Hoboken, NJ: Wiley.

Obe, Regina O., and Leo S. Hsu. (2015). PostGIS in Action. Manning Publications Co.

Openshaw, S., and Taylor, P . J. (1979). A Million or So Correlation Coefficients: Three Experiments on the Modifiable Areal Unit Problem, in Wrigley, N. (ed.), Statistical Methods in the Spatial Sciences (London: Pion), pp. 127–144.

Sadler Assistant Professor, Michigan State University, Richard Casey. "How ZIP Codes Nearly Masked the Lead Problem in Flint." The Conversation. The Conversation, 08 Feb. 2017. Web. 12 Mar. 2017. http://theconversation.com/how-zip-codes-nearly-masked-the-lead-problem-in-flint-65626.

Tobler, W. R. "Cellular Geography." Philosophy in Geography. Springer Netherlands, 1979. 379-386.

Appendix

Other reference materials

  1. Intro to spatial analysis notebook - provides examples of map visualization, the Getis-Ord G* hot-spot analysis, and an overview of Coordinate Reference Systems (CRS, also called "projections")
  2. Database connections notebook - examples of how to connect to the postgres database with PgAdmin, psql, and Python
  3. Load CSV into postgres with Pandas

Create a spatial table in PostGIS from existing tables

The _il_des_subset_2014q3_ table was created with the following SQL statments:

Create the table from a query using the "SELECT INTO FROM..." syntax

SELECT a.ein, b.uiacctno, b.rptunitno, a.name_legal, a.name_trade, a.sic, a.naics, a.total_wages, a.empl_month1::int + a.empl_month2::int + a.empl_month3::int AS total_jobs, b.census_id block_fips, b.x as lon, b.y as lat, b.geom, b.geom_2163 
INTO class1.il_des_subset_2014q3 
FROM ides.il_qcew_employers a 
JOIN ides.il_des_establishment b 
ON a.ein=b.ein AND a.seinunit=b.rptunitno AND b.uiacctno = a.empr_no 
    AND a.year=b.year AND a.quarter = b.quarter 
WHERE a.year = 2014 AND a.quarter = 3;

Set the table ownership

ALTER TABLE class1.il_des_subset_2014q3 OWNER TO class1_admin;

Add a primary key

create a unique ID column

ALTER TABLE class1.il_des_subset_2014q3 ADD COLUMN uid serial NOT NULL;

Then set "uid" as the primary key

ALTER TABLE class1.il_des_subset_2014q3 ADD PRIMARY KEY (uid);

Add spatial indexes to the two geometry columns

CREATE INDEX il_des_subset_2014q3_geom_gist ON class1.il_des_subset_2014q3 USING gist (geom);

CREATE INDEX il_des_subset_2014q3_geom_2163_gist ON class1.il_des_subset_2014q3 USING gist (geom_2163);

VACUUM ANLYZE the table

VACUUM ANALYZE class1.il_des_subset_2014q3;

Geocoding

There are several things you need to take care of under the hood before you can run spatial queries. Two important issues are geocoding and projections. Geocoding matches an address to a point on a map. This matching requires a reference map that a given address or address component can be compared to using an address locator. The quality of the input address data, reference map (also referred to as Master Address File) and address locator influence the precision of the match, indicated by a score. Popular and accurate address geocoding APIs are Google Maps’ (https://developers.google.com/maps/documentation/geocoding/start) but this service limits the number of records you can geocode per gmail account. It is also an online service, which does not work for restricted data that has to be stores offline. A popular offline geocoding solution that integrates with postgreSQL/postGIS is MapZen’s Pelias service: https://mapzen.com/blog/pelias-setup-tutorial/. For more information about geocoding, see Geocoding in PostGIS (http://postgis.net/docs/Extras.html) and http://www.directionsmag.com/entry/three-standard-geocoding-methods/123627.

Projections

The geocoded point on a map has an X and Y coordinate. The unit of these coordinates is in degrees. These degrees need to be projected to obtain commonly used metrics such as feet or meters that are needed for spatial queries involving distances. Projections convert a spherical to a flat surface. There are lots of projection choices to do this that involve different trade-offs in accuracy between distance, direction, form, and area. If you are combining multiple spatial datasets (such as points and areas), they need to share the same projection to overlay correctly on the same map. For more information about projections, see http://www.axismaps.com/guide/projections/ and https://www.oreilly.com/ideas/understanding-projections-with-spatial-and-geo-data.

Also see the addendum in the intro to spatial analysis notebook here