Introduction to Spatial Analysis

Table of Contents

  1. Datasets in this Notebook
  2. Choropleth maps
  3. Spatial weights
  4. Exploratory Spatial Data Analysis
    1. Global spatial autocorrelation: Moran's I
    2. Local Measures of Spatial Association

Addendum

  1. Overview of spatial data types
  2. Coordinate Reference Systems (aka projections)
  3. Common vector data formats

Reference material

NOTE: we recommend you refer to the Spatial Queries in PostGIS notebook for explanations of the underlying SQL used below. The Spatial Queries notebook also has the SQL used to create the "il_des_subset_2014q3" table.

Datasets

Datasets used in this Notebook

  1. Chicago neighborhoods - polygons of Chicago's 88 neighborhoods created using the Tract to Neighborhood lookup table to create polygons based on underlying Census Blocks. Code to create the neighborhood table is in this notebook
  2. Business location and info for 2014Q3 derived from IDES employer and addendum to employer locations

Considerations for spatial data

  • Data collection - at what spatial scale were data collected?
  • Data aggregation issue: aggregating to different spatial units could give different results due to something known as the "modifiable areal unit problem" (MAUP). MAUP is a statistical bias from summarizing point data to areas as the value of a given area depends on where the boundary is drawn (description paraphrased from Wikipedia, which has more info and links to related issues)

In [ ]:
## data manipulation libraries ##
# Pandas for generic manipulation
import pandas as pd
# GeoPandas for spatial data manipulation
import geopandas as gpd

# PySAL for spatial statistics
import pysal as ps

# shapely for specific spatial data tasks (GeoPandas uses Shapely objects)
# commented out as not used, but may be useful reference
# from shapely.geometry import Point, LineString, Polygon

# SQLAlchemy to get some data from the database
from sqlalchemy import create_engine

# improve control of visualizations
import matplotlib.pyplot as plt
%matplotlib inline

In [ ]:
# set up connection paramaters for the 'appliedda' database
host = '10.10.2.10'
db = 'appliedda'

# create DB connection 
# (connection string is "<database-type>://<username>@<host-url>:<port>/<database-name>")
engine = create_engine("postgresql://{host_str}/{db_str}"
                       .format(host_str = host, db_str=db))

In [ ]:
# read in point locations - the full file has ~##k locations (~## of which are in Cook County)
# so here we are going to grab a sample of 500 from within Cook County
query = """
SELECT e.*

FROM class1.il_des_subset_2014q3 e
JOIN tl_2016_us_county c
ON ST_Within(e.geom, c.geom)

WHERE c.geoid = '17031'

LIMIT 500;
"""
points = gpd.read_postgis(query, engine, geom_col='geom',
                          crs={'init': u'epsg:4269'})

In [ ]:
# what info is contained in this file?
points.info()

In [ ]:
# what does the first row look like?
# points.head(1)

Point data alone is not that interesting, so we will also pull Chicago neighborhoods which were created using the below SQL - they already contain the summary of number of businesses (tot_biz), total wages paid (total_wages), and the total number of jobs (jobs_tot) for 2014Q3.

SELECT n.neighborhood, count(e.*) tot_biz, 
    sum(e.total_wages) wages_tot, sum(e.total_jobs) jobs_tot, n.geom 
INTO class1.chi_nhood_estab_2014q3

FROM chicago_nhoods AS n
LEFT JOIN class1.il_des_subset_2014q3 e
ON ST_Within(e.geom, n.geom)

GROUP BY n.neighborhood, n.geom
ORDER BY n.neighborhood;

ALTER TABLE class1.chi_nhood_estab_2014q3 ADD PRIMARY KEY (neighborhood);
CREATE INDEX chi_nhood_estab_2014q3_geom_gist ON class1.chi_nhood_estab_2014q3 USING gist (geom);

In [ ]:
# Get data 
qry = "SELECT * FROM class1.chi_nhood_estab_2014q3"

nhoods = gpd.read_postgis(qry, engine, geom_col='geom', 
                          crs='+init=epsg:4269')
nhoods.info()

In [ ]:
# now let's try putting the sample of business locations on top of neighborhoods #

# create a map of Chicago neighborhoods colored grey
ax = nhoods.plot(color='grey', figsize=(6,8))
# note we assign counties to the object "ax" so we can 
# overlay the points on the same "matplotlib axis"

# use the same "ax" object to plot the prisons on top of the neighborhood map, 
# plus resize the markers and remove their outlines
points.plot(ax=ax, markersize=5, markeredgewidth=0); 
plt.annotate('Source: IDES', xy=(0.8,-0.10), xycoords="axes fraction")
plt.title('Business locations');

Choropleths

Choropleths are very useful because they can quickly convey how values compare across the study area. Let's start wtih a simple example of the number of businesses by neighborhood. (Note much of the code below comes from this Geovisualization (source -> http://darribas.org/gds_scipy16/ipynb_md/02_geovisualization.html) notebook)


In [ ]:
# we'll create our matplotlib figure and axis first so we have more control over it later
f, ax = plt.subplots(figsize=(6,8))

# we'll pass the geopandas plot() function the column, scheme (calculation method), 
# number of groups to calculate (k)
# colormap to use, linewidge (to make the edges less noticeable),
# the axis object created above, and include a legend
nhoods.plot(column='tot_biz', scheme='QUANTILES', k=10, 
               cmap='plasma', linewidth=0.1, ax=ax, legend=True)

# and this time we'll turn off the coordinates
ax.set_axis_off();

As you can see, Geopandas only allows using the "quantiles" (or any other [scheme supported by PySAL -> http://pysal.readthedocs.io/en/latest/library/esda/mapclassify.html) with between 2 and 9 and if you try soemthing different, it resets to 5.

So here is how you can use more categories for your choropleths: create a new column with the appropriate PySAL function and map that, as follows. By "appropriate PySAL function" we mean whichever classification algorithm you select, above we used Quantiles and below we are using the Fisher Jenks algorithm. In extreme brief, Fisher-Jenks is a classification method which maximizes the difference in means between groups while minimizing the variance within groups.

The PySAL documentation is quite good so we encourage you check there (as well as the notebook referenced above).


In [ ]:
# let's try the 'Fisher_Jenks' scheme:
fj10 = ps.Fisher_Jenks(nhoods.tot_biz,k=10)

# the ps.<scheme> function returns two things, the bins used for the cutoffs:
print('bins:')
print(fj10.bins)
# and the assigned bin number to use:
print('\nbin number:')
print(fj10.yb)

In [ ]:
# now we can use the new categories to color the choropleth of land area into 10 buckets
# notice the couple new keywords we include

# again we'll create the matplotlib figure and axis objects first
f, ax = plt.subplots(figsize=(6, 8))

# then create our choropleth, 
# the "assign" function adds our Fisher Jenks buckets as a new column to map
## Note with this formulation 'cl' is a temporary column
# the 'categorical=True' tells geopandas we want a different color for each category
nhoods.assign(cl=fj10.yb).plot(column='cl', categorical=True, \
        cmap='viridis', linewidth=0.1, ax=ax, \
        edgecolor='grey', legend=True)

# turn off the latitude/longitude axes
ax.set_axis_off();

Exercise 1: above is a choropleth of total businesses, which may or may not be a useful visualization. Try creating a choropleth of some other value that does not already exist in the nhoods GeoDataFrame (ie calculate a new value first and then visualize it).

Spatial weights

In the next sections we will demonstrate Moran's I and Getis-Ord G - measures of spatial correlation and association, respectively. In order for PySAL to assess if neighboring values are correlated or associated we first need to tell PySAL what to consider a neighbor for any given polygon. Additionally we need to tell PySAL how connected different neighbors are - this connectedness is generally termed "weight". These two components - neighbors and weights - are together known as "spatial weights".

Side note: this is a very similar concept to what we'll be talking about in the Network Analysis session.

There are a number of different ways to construct weights and we encourage you to look at the PySAL documentation for a description of each. Here we will consider "queen contiguity" - so named because of how the chess pieces moves on a chess board (yes, there is also rook and bishop contiguity).

PySAL gives us the ability to create weights directly from other objects, such as a GeoDataFrame


In [ ]:
# create weights, specificying 'geom' rather than default 'geometry' column
w = ps.weights.Queen.from_dataframe(nhoods, geom_col='geom')

In [ ]:
# the neighbor dictionary defines which areas are neighbors
w.neighbors

In [ ]:
# let's look at the entries in our dataframe associated with one set of neighbors:
nhoods.loc[[77, 1, 36, 62, 46],'neighborhood']

In [ ]:
# it is also generally a good idea to standardize weights
# which you can do by setting "transform"
w.transform = 'r'

Exploratory spatial data analysis

The below code is sourced mostly from a notebook on Spatial Exploratory Data Analysis: http://darribas.org/gds_scipy16/ipynb_md/04_esda.html

Global spatial autocorrelation: Moran's I

The great font of knowledge, Wikipedia, tells us "Moran's I is a measure of spatial autocorrelation ... [which] is characterized by a correlation in a signal among nearby locations in space. Spatial autocorrelation is more complex than one dimensional autocorrelation because spatial correlation is multi-dimensional (ie 2 or 3 dimensions of space) and multi-directional."

Using the spatial weights created above we can derive a "spatial lag" variable to better understand how Moran's I works. For these examples let's consider the average wages per job across neighborhoods:


In [ ]:
# create variable
nhoods['wages_per_job'] = nhoods['wages_tot'] / nhoods['jobs_tot']
# descriptive stats of nhoods:
nhoods.describe()

In [ ]:
# now create the spatially lagged value:
nhoods['AvgWagesLag'] = ps.lag_spatial(w, nhoods['wages_per_job'])

In [ ]:
# and let's see a map of the quantiles for each side by side
f, axes = plt.subplots(1, 2, sharey=True, figsize=(10,6))

# map wages per job on first map
nhoods.plot('wages_per_job', scheme='QUANTILES', k=7, \
    cmap='viridis_r', linewidth=0.1, ax=axes[0], \
    edgecolor='grey', legend=True)
axes[0].set_title('Wages per job')
# and spatially lagged value on second
nhoods.plot('AvgWagesLag', scheme='QUANTILES', k=7,  \
    cmap='viridis_r', linewidth=0.1, ax=axes[1], \
    edgecolor='grey', legend=True)
axes[1].set_title('Spatially lagged wages per job')

f.suptitle('Study variable vs spatially lagged variable');

Exercise 2: identify one reason the above visual is misleading - can you think of better visuals for the question we are trying to answer? (how the two variables compare)

One way to compare the two variables is a more formal statistical measure, Moran's I, and one visualization of it is simply a scatterplot of the study variable vs the spatially lagged variable.


In [ ]:
# calculate slope and intercept with single polynomial least squares (ie OLS) 
b,a = pd.np.polyfit(nhoods['wages_per_job'], nhoods['AvgWagesLag'], 1)

# create scatter plot
plt.plot(nhoods['wages_per_job'], nhoods['AvgWagesLag'], '.b')

# add plot labels
plt.ylabel('Spatially lagged wages per job')
plt.xlabel('Wages per job')
plt.title('Moran scatterplot')

# add mean lines
plt.vlines(nhoods['wages_per_job'].mean(),
           nhoods['AvgWagesLag'].min(), 
           nhoods['AvgWagesLag'].max(),'k', '--')
plt.hlines(nhoods['AvgWagesLag'].mean(), 
           nhoods['wages_per_job'].min(), 
           nhoods['wages_per_job'].max(),'k', '--')

# add red dashed line of best fit
plt.plot(nhoods['wages_per_job'], a + b*nhoods['wages_per_job'], '--r');

The upper right and lower left quadrants represent positive spatial correlation where the upper left and lower right are negative spatial correlation.

Doing the same thing a different way, we can compute the global Moran's I in PySAL with:


In [ ]:
# global Moran's I
wagesPerJob_I = ps.Moran(nhoods['wages_per_job'], w)

print("Global Moran's I is {:.4f} with a p-value of {:0.4f}"\
      .format(wagesPerJob_I.I, wagesPerJob_I.p_sim))

There is more you can explore here, but overall this tells us that yes our study variable exhibits spatial autocorrelation, so let's dig in to what that means at a local level.

Local Measures of Spatial Association

Local Moran's I is, unsurprisingly, Moran's I adapted to compute a statistic for each polygon in our dataset rather than a single statistic for the whole dataset.


In [ ]:
# calculate local statistic:
WagesPerJob_LocI = ps.Moran_Local(nhoods['wages_per_job'], w, 
                                  permutations=9999)

# how many of our computed values are statistically significant?
(WagesPerJob_LocI.p_sim < 0.001).sum()

Not very many of the local Moran's I statistics are significant even though our global statistic is - which simply means there are not many strong hotspots for our small dataset.

Exercise 3: compute local Moran's I for a different variable, for an extra challenge try using a different geographic unit than neighborhoods

Let's make a map of our results, but instead of 99% confidence let's use 95%. First we will identify the "hot" and "cold" spots.


In [ ]:
# identify statistically significant results:
sig_idx = WagesPerJob_LocI.p_sim < 0.05

# identify hotspots using the quadrant attribute (see Moran scatterplot for brief explanation)
# and only keep those that are statistically significant
hotspots = WagesPerJob_LocI.q==1 * sig_idx

# similarly identify coldspots
coldspots = WagesPerJob_LocI.q==3 * sig_idx

In [ ]:
# and plot them #
# first create a single variable of both hot and cold spots:
hotcold = hotspots*1 + coldspots*2

# we'll also make our own colormap with a new matplotlib library
from matplotlib import colors
hcmap = colors.ListedColormap(['grey', 'red', 'blue'])

# initalize fig and ax objects
f, ax = plt.subplots(figsize=(6,8))

# and map the hot and cold spots
nhoods.assign(cl=hotcold).plot('cl', categorical=True, k=2, 
                               cmap=hcmap, linewidth=0.1, 
                               legend=True,
                              ax=ax)
ax.set_ylabel('Latitude')
ax.set_xlabel('Longitude');

getis-ord G is a second measure of local spatial association and is similarly easy to calculate:


In [ ]:
# calculate local G
WagesPerJob_LocG = ps.G_Local(nhoods['wages_per_job'], w, 
                              permutations=9999)

# how many are significant?
print('{} observations are statistically significant'.\
      format((WagesPerJob_LocG.p_sim < 0.001).sum()))

In [ ]:
# and similarly let's make a map highlighting significant values, 
# but we'll do it slightly differently
f, ax = plt.subplots(figsize=(6,8))

# create index of significant and insignificant values at 95%
idx_sig = WagesPerJob_LocG.p_sim < 0.05
idx_ins = WagesPerJob_LocG.p_sim >= 0.05

# first we'll add our G values - the '\' just splits things to different lines
nhoods.assign(G=WagesPerJob_LocG.Gs)\
.loc[idx_sig,:]\
.plot('G', cmap='viridis', scheme='QUANTILES', 
      k=3, linewidth=0.1, ax=ax, legend=True)

# then add insignificant values in grey
nhoods.loc[idx_ins,:].plot(color='grey', linewidth=0.1, ax=ax);

And the local G* version looks like this:


In [ ]:
# calculate local G*
WagesPerJob_LocGstar = ps.G_Local(nhoods['wages_per_job'], w, 
                                  permutations=9999, 
                                  star=True)

# how many are significant?
print('{} observations are statistically significant'.format((WagesPerJob_LocGstar.p_sim < 0.05).sum()))

Spatial data types

There are two generic spatial data types:

  1. Vector - discrete data (usually), represented by points, lines, and polygons
  2. Raster - continuous data (usually), generally represented as square pixels where each pixel (or "grid cell") has some value. Examples of raster data - link to "big data"
    • Imagery data (satellite, Google SteetView, traffic cameras, Placemeter)
    • Surface data (collected at monitoring stations then interpolated to a 'surface' - eg Array of Things, weather data)

However, raster data is commonly used in few social science contexts, so the below image (courtesy of Data Science for Social Good) is probably sufficiet discussion about rasters for now:

Notice the pesky "usually" next to both vector and raster datatypes? Technically any data could be represented as either vector or raster, but it would be computationally inefficient to create a raster layer of rivers or roads across Illinois because

  1. All the non-road and non-river locations would still have some value and
  2. You would have to pick a cell size which may not well represent the actual course of a given river (as opposed to a vector - line - that follows a path and could have some value for width of the river or road)

Coordinate Reference Systems

Coordinate Reference Systems (aka projections) are basically math that (1) describes how information in a given dataset relates to the rest of the world and (2) usually creates a 'flat' surface on which data can be analyzed using more common algorithms (eg Euclidean geometry).

Why do we care?

  1. Distance / area measurements
  2. Spatial join - won't work with different CRS

As an example of point 2, consider the distance between two points: Euclidean distance (aka pythagorean theorem) provides an easy way to calculate distance so long as we know the difference in x (longitude) and y (latitude) between two points: $$Distance = \sqrt(({x}_1-{x}_2)^2 + ({y}_1-{y}_2)^2)$$

This works fine on correctly projected data, but does not work on unprojected data. For one thing the result would be in degrees and degrees are a different distance apart (in terms of meters or miles) at different points on the Earth's surface.

All this is to say: if you do a calculation with geographic data and the numbers don't make sense, check the projection. Let's do an example with the IL county areas.


In [ ]:
# create DB connection (connection string is "<database-type>://<username>@<host-url>:<port>/<database-name>")
engine = create_engine("postgresql://10.10.2.10:5432/appliedda")

# create SQL query - limit to just IL by using the state FIPS code of 17
sql = "SELECT * FROM tl_2016_us_county WHERE statefp = '17';"

# get data from the database
il_counties = gpd.read_postgis(sql, engine, geom_col='geom', index_col='gid')

In [ ]:
# print out the CRS of IL counties:
print(il_counties.crs)

so first, it needs to be set (I'd guess GeoPandas will appropriately set from a database in the future). If we look it up in the database we'll see that it's NAD83 (North American Datum 1983), which has the EPSG (European Petroleum Survey Group) code of 4269.


In [ ]:
# set the counties crs to 4269
il_counties.crs = {'init': u'epsg:4269'}

# print it out
print(il_counties.crs)

In [ ]:
# let's check out the area calculated using Pandas with NAD83
il_counties['area_nad83'] = il_counties.geom.area

In [ ]:
# view the first 5 records' aland and calculated area with NAD83:
il_counties.loc[:,('aland', 'area_nad83')].head()

Clearly not the same. We can look for other projections at a number of websites, but a good one is epsg.io. let's use the US National Atlas Equal Area projection (epsg=2163), which is a meters based equal area projection.


In [ ]:
# transform aka re-project the data (use the "inplace=True" flag to perform the operation on this Geodataframe)
il_counties.to_crs(epsg=2163, inplace=True)

# print out the CRS to see it worked
print(il_counties.crs)

In [ ]:
# and let's calculate the area with the new CRS
il_counties['area_2163'] = il_counties.geom.area

# and again check the head() of the data, with all 3 area columns:
il_counties.loc[:,('aland', 'area_nad83', 'area_2163')].head()

In [ ]:
# let's check if those small differences are just because we're only looking at land area, not full county area:
il_counties['total_area'] = il_counties.aland + il_counties.awater

# and recheck areas against total:
il_counties.loc[:,('total_area', 'area_nad83', 'area_2163')].head()

There are still some differences between our newly calculated area ('area_2163') and the total area that came in the data ('aland' + 'awater'), however we can see it's much closer than the nad83 version. These small differences most likely mean that the area from Census was calculated using a different Coordinate Reference System.

Common vector data formats

One common type of vector data is called a "shapefile". Shapefiles are actually a collection of files (eg the IL_prisons.* files above) and there are:

  1. Required files
    • .dbf - contains attributes (ie variables that describe the object, like number of households in a zip code)
    • .shp - the geographic information of the object
    • .shx - an index to facilitate searching within data
  2. Additional / optional files
    • .prj - the projetion or coordinate reference sysstem (CRS); technically optional but highly recommended as it tells you how the data relates to real-world locations
    • .sbn / .sbx - other indexes used by specific software
    • .shp.xml - metadata information

Above description pulled mostly from the Wikipedia article on shapefiles which also has much more information.

Other common vector formats:

  1. GeoJSON - Geospatial standards in the JavaScript Object Notation (JSON) text format
  2. KML / KMZ - Keyhole Markup Language popularized by Google
  3. GDB - Esri's Geodatabase, a proprietary data format used in ArcGIS software products