We are going to explore the crime problem in London in the first half year of 2018, by conducting data cleaning, exploration and visualization.
The crime data in London used for this exercise were monthly data from January to June in 2018. Data Sources: https://data.police.uk/data/
The other datasets are "LSOA 2011" and "Deprivation Index 2015" from https://data.london.gov.uk.
This week's practical focuses on the application of a spatial clustering method to understand the pattens of crime concentration. You will use Python to conduct crime hotspots detection, indentify the spatial autocorrelation issue, and practice spatial regression models with different deprivation indexes configurations.
The main tasks are
By the end of this practical you will learn:
In this practical, we will use the open data about crime and policing in England from https://data.police.uk/data/. It provides street-level crime, outcome, and stop and search data in simple CSV format. Normally, download the data for Metropolitan Police Service from January to June 2018. Unzip the downloaded archive to extract the CSV file.
We will also use London deprivation index data IMD2015 at LSOA level, which included in the "LSOA_IMD.csv" file.
To save your time, I had cleaned the crime data into 'londoncrime20180106.csv.zip' file, and uploaded it together with "LSOA_IMD.csv" file on KEATS page. So please go to the KEATS page for this module, download the data into your folder for this notebook, and unzip it.
If Internet Explorer is you default brower to open Jupyter notebook, please change it to other brower, e.g. Chrome, because it can't exhibit the output of heatmap. To do so, you need
(1) generate the jupyter_notebook_config.py file (You may already have this file, if so you can skip this step).
$ jupyter notebook --generate-config
(2) edit the file jupyter_notebook_config.py found in the .jupyter folder of your home directory, which might be found at ~/.jupyter/jupyter_notebook_config.py (OS X/ Linux) or \Users\builder.juptyer\jupyter_notebook_config.py (Windows). change the line
# c.NotebookApp.browser = '' to
For MacOS: c.NotebookApp.browser = 'open -a /Applications/Google\ Chrome.app %s'
For Windows: c.NotebookApp.browser = 'C:/path/Google/Chrome/Application/chrome.exe %s' or other browers you are using.
On windows 10, Chrome should be located C:\User\kxxxxxx\AppData\Local\Google\Chrome\Application\chrome.exe but check on your system to be sure.
Save the change to your config file, then launch jupyter notebook as usual.
On some Windows setups the above option may not launch the desired browser still. But you will find the following from your terminal upon the launching:
# Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://localhost:88XX/?token=.....................
Simply copy this URL into your Chrome (or Firefox) brower, then it will be launched right there.
Once again, the first thing we need to do is setup our working environment with "gsa2018". Run the scripts below to import the relevant libraries.
To get started we're going to work with pandas, geopandas and pysal. You'll see we've got some new libraries here.
Specifying the Kernel Note: Before you go any further, we need to check that you've got the right 'Kernel' specified in Jupyter. At top right it should say "Python [conda env:gsa2018]" (or something very similar to one of those!) and that is the environment that we want to work in.
There are other kernels configured and these can be accessed by clicking on the 'Kernel' menu item and then 'Change Kernel'. This feature is well beyond the scope of this practical, but it basically allows you to run multiple 'versions' of Python with different libraries or versions of libraries installed at the same time.
Importing the Libraries
In [1]:
import os
import pandas as pd
import geopandas as gpd
from geopandas import GeoDataFrame
import numpy as np
import pysal
from pysal.spreg import ols
from pysal.spreg import ml_error
from pysal.spreg import ml_lag
import fiona
import shapely
from shapely.geometry import Point, Polygon
import seaborn as sns
sns.set_style("darkgrid")
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from matplotlib import gridspec
%matplotlib inline
from scipy import stats
from IPython.display import IFrame
import folium
from folium import plugins
from folium.plugins import HeatMap, MarkerCluster, FastMarkerCluster
The original Crime dataset was in the form of a CSV but was large. So use pandas' TextFileReader function to load the large file in chunks of 100,000 rows and then concatenate the chunks back together into a new dataframe.
After loading the dataset, use panda's built in funtions to explore the characteristics of the data as shown below.
In [2]:
imd=pd.read_csv('LSOA_IMD.csv')
# use TextFileReader iterable with chunks of 100,000 rows.
tp = pd.read_csv('loncrime2018_0106.csv', iterator=True, chunksize=100000)
crime_data = pd.concat(tp, ignore_index=True)
# print data shape
crime_data.shape
Out[2]:
You will easily get the data information for crime dataset: 501130 rows and 14 columns.
In [3]:
imd.describe()
Out[3]:
Pandas describe function generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. It analyzes both numeric and object series, as well as DataFrame column sets of mixed data types.
You could also tell from the output that, the LSOA_IMD data has 24 columns with 4946 rows.
In [4]:
# view the basic stats on columns
crime_data.info()
It could be detected that there are some missing data, but don't worry! We need firstly to decide whether it worthwhile to fill them.
There are columns we don't need for data processing, so we can delete these columns by "drop" operation.
In [5]:
crime_data.drop(['field_1', 'Context','Crime ID', 'Falls with','Last outco','Reported b','Unnamed_ 0', 'Location'], inplace=True, axis=1)
# show dataframe columns
crime_data.columns
Out[5]:
You may find we only leave the variables on crime types, locations and dates (actually months) information for further interpretation. Try to save our time on typing in the future, we can rename the labels by abbreviation.
In [6]:
# Now check the name of the fields and rename the columns by more python recognized names...
colnames = ['type','code','name','lat','lon','Month']
crime_data.columns = colnames
crime_data.head(4)
Out[6]:
In [7]:
# print all crime variables in the "type" column
crimes = crime_data['type'].sort_values().unique()
crimes, len(crimes)
Out[7]:
It would be worth exploring variation in the data we have imported. It would be useful to know how many rows correspond to each category. To do this, we use the .value_counts() function, which returns object containing counts of unique values. In calling a function against a column, Pandas allows us to reference the column name directly within the function. This structure requires the dataframe (e.g. crimes), the column name (e.g. type), and the function name (e.g. value_counts()).
This step helps us to prepare for exploring the crimes' pattern by types.
In [8]:
# for examples, lets check out the column 'crime type'
crime_data.type.value_counts()
Out[8]:
It applies the groupby and sort_values functions for sorting the order in group.
In [9]:
crime_data_group = crime_data.groupby('type').size().reset_index(name='count')
crime_data_sort = crime_data_group.sort_values(['count'], ascending=False).reset_index(drop=True)
crime_data_sort
Out[9]:
In [10]:
in_type=[crime_data_sort['type'][i]for i in range(3)]
fillcolors = ['#ff0000','#0000ff','#00ff00']
nlst = crime_data[crime_data.type.isin(in_type)].copy()
nlst.shape
Out[10]:
In [11]:
nlst['lon'].notnull().count() # check whether they all have longitude data
# Your code here to replace the ???
nlst['lat'].notnull().count() # check whether they all have latitude data
# If you get the same results on rows, that'll be great!
Out[11]:
In [12]:
new_gdb = gpd.GeoSeries(nlst[['lon', 'lat']].apply(Point, axis=1), crs="+init=epsg:4326")
bbox = new_gdb.total_bounds
titles=["Kernel Density: "+in_type[i] for i in range(3)]
fig, axs = plt.subplots(2, 2, figsize = (12,12))
ax1 = plt.subplot2grid((8,8), (0,0), rowspan=3, colspan=3)
ax2 = plt.subplot2grid((8,8), (4,0), rowspan=3, colspan=3)
ax3 = plt.subplot2grid((8,8), (0,4), rowspan=3, colspan=3)
fig.tight_layout(pad = 0.4, w_pad = 4.0, h_pad = 4.0)
ax1.set_title(titles[0], fontsize =16)
ax2.set_title(titles[1], fontsize =16)
ax3.set_title(titles[2], fontsize =16)
ax1.set_xlim(bbox[0], bbox[2])
ax1.set_ylim(bbox[1], bbox[3])
ax2.set_xlim(bbox[0], bbox[2])
ax2.set_ylim(bbox[1], bbox[3])
ax3.set_xlim(bbox[0], bbox[2])
ax3.set_ylim(bbox[1], bbox[3])
# ^The above code sets the x and y limits for each function. Without this, the density maps
# are very small and only take up about 20% of the graph space.
gdfnew1 = nlst[nlst['type']==in_type[0]]
gdfnew2 = nlst[nlst['type']==in_type[1]]
gdfnew3 = nlst[nlst['type']==in_type[2]]
sns.kdeplot(gdfnew1.lon, gdfnew1.lat, shade = True, cmap = "Reds", ax=ax1)
sns.kdeplot(gdfnew2.lon, gdfnew2.lat, shade = True, cmap = "Blues", ax=ax2)
sns.kdeplot(gdfnew3.lon, gdfnew3.lat, shade = True, cmap = "Greens", ax=ax3)
sns.set(style = "whitegrid") # aesthetics
sns.despine(left=True) # aesthetics
sns.set_context("paper") # aesthetics
plt.axis('equal')
plt.show()
In [13]:
# Set up geodataframe, initially with CRS = WGS84 (since that matches the lon and lat co-ordinates)
crs = {'init':'epsg:4326'}
geometry = [shapely.geometry.Point(xy) for xy in zip(crime_data['lon'], crime_data['lat'])]
geo_df = gpd.GeoDataFrame(crime_data,
crs = crs,
geometry = geometry)
# Convert geometry to OSGB36 EPSG: 27700
geo_df_new = geo_df.to_crs(epsg = 27700)
# convert the .csv file into .shp file
geo_df_new.to_file(driver='ESRI Shapefile', filename='crime_data.shp')
The crime_data.csv file now has been saved as crime_data.shp file in your designated folder together with this notebook.
You may also want to double check its CRS in QGIS, which is now with National Grid. We can plot the crime data on london basemap.
In [14]:
# Plot map
fig, ax = plt.subplots(1,
figsize = (12,12),
dpi = 72,
facecolor = 'lightblue')
ax.set_position([0,0,1,1]) # Puts axis to edge of figure
ax.set_axis_off() # Turns axis off so facecolour applies to axis area as well as bit around the outside
ax.get_xaxis().set_visible(False) # Turns the x axis off so that 'invisible' axis labels don't take up space
ax.get_yaxis().set_visible(False)
lims = plt.axis('equal')
geo_df_new.plot(ax=ax)
plt.show()
In [15]:
# replace the ??? to present the top 4 rows of crime_data
crime_data.head(4)
Out[15]:
In [16]:
# draw the LSOA map and set its Coordinate Reference Systems (CRS) into EPSG: 27700.
lsoa=gpd.read_file('LSOA_IMD2015.shp')
f, ax = plt.subplots(1, figsize=(12, 12))
ax = lsoa.plot(axes=ax);
lsoa.crs = {'init' :'epsg:27700'}
plt.show()
In [17]:
# check the lsoa is a GeoDataFrame
type(lsoa)
Out[17]:
In [18]:
# check the labels for columns in lsoa
lsoa.columns
Out[18]:
In [19]:
# make the columns for LSOA data more readable
# rename the indicator with full title to help you interpret the columns/indicators
lsoa.drop(['objectid', 'lsoa11nmw','st_areasha', 'st_lengths', 'IMD2015_LS', 'IMD2015__1', 'IMD2015_Lo'], inplace=True, axis=1)
colnames2 = ['code','name','IMDindex','Income','Employment','Education', 'Health','Subcrime','BarriersHou','LivEnviron','Affect_child', 'Affect_old', 'child_young','Adult_skill','Geobarrier','widerbarrier','indoors', 'outdoors','population','depend_child', 'peo_middle', 'peo_over60', 'work_age','geometry']
lsoa.columns = colnames2
lsoa.head(4)
Out[19]:
In [20]:
# to visulize both crime points data and lsoa vector data on a map
f, ax = plt.subplots(1, figsize=(12, 12))
ax.set_axis_off()
plt.axis('equal')
lsoa.plot(ax=ax, cmap='OrRd', linewidth=0.5, edgecolor='white')
geo_df_new.plot(column='type', markersize=3, categorical=True, legend=True, ax=ax)
# the legend was set by default to take the first column as labels.
Out[20]:
In [21]:
# This is an example for you to follow on
# Make a Choropleth maps on the deprivation index.
lsoa.plot(column='IMDindex', cmap='OrRd', scheme='quantiles')
Out[21]:
In [22]:
# Please plot series quantile map for 3 sub domain of deprivation indexes you want to explore
# e.g. Income, Employment, Education, etc.
# Your code here
lsoa.plot(column='Income', cmap='OrRd', scheme='quantiles')
Out[22]:
In [23]:
lsoa.plot(column='Employment', cmap='OrRd', scheme='quantiles')
Out[23]:
In [24]:
lsoa.plot(column='Education', cmap='OrRd', scheme='quantiles')
Out[24]:
We use function in Folium to generate crime heatmap. This does not take Dataframes. You'll need to give it a list of lats, lons, i.e. a list of lists. NaNs will also trip it up.
Because Internet Explorer doesn't recognize the heatmap, your output maybe blank, so we need to configurate your brower at the beginning. For other browsers like Chrome, it works well; so does Safari on Mac.
In [25]:
# Ensure you're handing it floats
crime_data['lat'] = crime_data['lat'].astype(float)
crime_data['lon'] = crime_data['lon'].astype(float)
# Filter the DF for rows, then columns, then remove NaNs
heat_df = crime_data[['lat', 'lon']]
heat_df = heat_df.dropna(axis=0, subset=['lat','lon'])
# List comprehension to make out list of lists
heat_data = [[row['lat'],row['lon']] for index, row in heat_df.iterrows()]
heatmap_map = folium.Map([51.50632, -0.1271448], zoom_start=12)
# Plot it on the map
hm=plugins.HeatMap(heat_data)
heatmap_map.add_child(hm)
# You save a map as an html file
heatmap_map.save("heatmap.html")
Please try zoom in and zoom out on the heatmap, and save your screenshot of the crime heatmap around Strand Campus to your folder as a .jpg or .png.
To detect the hotspots, the first step is to calculate how many crime cases happened in each LSOA, this operation is similar to "Points in Polygon" operation in QGIS.
In [26]:
# calculating total number of crime incidents per lsoa
crime_data2 = pd.DataFrame(crime_data['code'].value_counts().astype(float))
crime_data2 = crime_data2.reset_index()
crime_data2.columns = ['code','Numbers']
crime_data2.head(4)
Out[26]:
In [27]:
# add a new column of crime incidents number in each lsoa
# by join lsoa and crime_data2 through attribute join
lsoa = lsoa.merge(crime_data2, on='code')
# save this newly joined .csv file into .shp file
lsoa.to_file(driver='ESRI Shapefile', filename='lsoa_numbers.shp')
When you check the columns of lsoa file, you may find a new column "Numbers" has been added at the end.
In [28]:
# Make a Choropleth map on crime incidents per lsoa.
lsoa.plot(column='Numbers', cmap='coolwarm', scheme='quantiles')
Out[28]:
We implement Local Indicators of Spatial Association (LISAs) for Moran’s I and Getis and Ord’s G in PySal to detect hotspots.
We use Local Moran's I index as follows:
to test the spatial autocorrelationality. It will measure the spatial autocorrelation in an attribute y measured over n spatial units. To calculate Moran’s I we first need to create and read in a GAL file for a weights matrix W. In order to get W, we need to work out what polygons neighbour each other (e.g. Queen Style Contiguity Neighbours, and Rook's Case Neighbours, etc.).
Read more in R.Bivand (2017) "Finding Neighbours".
Normally, we use Rook and Queen contiguity weight matrix. Rook weights consider observations as neighboring only when they share an edge; while queen contigutiy weight reflects adjacency relationships whether or not a polygon shares an edge or a vertex with another polygon. They may be different, depending on how the observation and its nearby polygons are configured.
In [29]:
# Generate weights matrix based on the kernel weights among lsoas.
# We use Rook case for example.
# You are encouraged to try Queen case and compare the results.
w = pysal.queen_from_shapefile("lsoa_numbers.shp")
w.transform = 'r'
The y attribute should be read from DataFrame, which is actually the same "lsoa". However, we tried to produce a dataframe from shapefile to keep them consistent from the same shapefile.
In [30]:
dataframe = pysal.pdio.read_files("lsoa_numbers.shp")
y = np.array(dataframe['Numbers'])
yI=pysal.Moran_Local(y,w,permutations=9999)
yI.p_sim.shape
Out[30]:
We have an array of local I statistics stored in the .Is attribute, and p-values from the simulation stored in p_sim.
In [31]:
yI.Is[0:20], yI.p_sim[0:20]
Out[31]:
To visualize the distribution of Moran's I, we can use LISA statistics to present the Moran Scatterplot.
In order to do this, we need firstly construct the spatial lag of the covariate:
In [32]:
lag_y = pysal.lag_spatial(w,y)
sigs = y[yI.p_sim <= .001]
w_sigs = lag_y[yI.p_sim <= .001]
insigs = y[yI.p_sim > .001]
w_insigs = lag_y[yI.p_sim > .001]
We can make LISA map for the data
In [33]:
b,a = np.polyfit(y, lag_y, 1)
# Calculate the global Moran's I index
I_y = pysal.Moran(y, w, two_tailed=False)
In [34]:
plt.plot(sigs, w_sigs, '.', color='firebrick')
plt.plot(insigs, w_insigs, '.k', alpha=.2)
# dashed vert at mean of the last year's PCI
plt.vlines(y.mean(), lag_y.min(), lag_y.max(), linestyle='--')
# dashed horizontal at mean of lagged PCI
plt.hlines(lag_y.mean(), y.min(), y.max(), linestyle='--')
# red line of best fit using global I as slope
plt.plot(y, a + b*y, 'r')
plt.title('Moran Scatterplot')
plt.text(s='$I = %.3f$' % I_y.I, x=5000, y=3500, fontsize=18)
plt.ylabel('Spatial Lag of Numbers')
plt.xlabel('Numbers')
plt.show()
To identify the significant LISA values we can use numpy indexing:
In [35]:
sig = yI.p_sim < 0.05
sig.sum()
Out[35]:
and then use this indexing on the q attribute to find out which quadrant of the Moran scatter plot each of the significant values is contained in:
In [36]:
yI.q[sig]
Out[36]:
Local Getis and Ord’s G can be calculated through:
Here we just calculate G for example, and you can try to exercise with G* by yourself
In [37]:
from pysal.esda.getisord import G_Local
lg = G_Local(y,w)
lg.n
Out[37]:
In [38]:
lg.p_sim
Out[38]:
To identify the significant G index values we can use numpy indexing:
In [39]:
sig = lg.p_sim<0.05
lg.p_sim[sig]
Out[39]:
In [40]:
# Hot spots
# High value with high-valued neighbours.
hotspots = yI.q==1 * sig
hotspots.sum()
Out[40]:
In [41]:
# Plot the map of hotspots
tx = gpd.read_file("lsoa_numbers.shp")
hmap = matplotlib.colors.ListedColormap(['grey', 'red'])
f, ax = plt.subplots(1, figsize=(12, 12))
tx.assign(cl=hotspots*1).plot(column='cl', categorical=True, \
k=2, cmap=hmap, linewidth=0.1, ax=ax, \
edgecolor='grey', legend=True)
ax.set_axis_off()
plt.show()
Similarly, we can try to detect the coldspots which are low values with low-valued neighbours. Please replace the ??? below to complete the detection of coldspots.
In [42]:
# Cold spots
coldspots = yI.q==3 * sig
coldspots.sum()
Out[42]:
In [43]:
# Plot the map of coldspots
tx = gpd.read_file("lsoa_numbers.shp")
hmap = matplotlib.colors.ListedColormap(['grey', 'blue'])
f, ax = plt.subplots(1, figsize=(12, 12))
tx.assign(cl=coldspots*1).plot(column='cl', categorical=True, \
k=2, cmap=hmap, linewidth=0.1, ax=ax, \
edgecolor='grey', legend=True)
ax.set_axis_off()
plt.show()
Regression analysis is a form of predictive modelling technique which investigates the relationship between a dependent (target) and independent variable (s) (predictor).
This technique is used for forecasting, time series modelling and finding the influential effect relationship between the variables.
This will allow us, not only to set up the model and try to interpret the coefficients, which is the basis for spatial models; but also will provide a baseline model for evaluating the spatial extensions' strength.
In this example, we will be analyzing Numbers with relation to Income, Employment, LivEnviron and peo_middle by lsoa.
In [44]:
# Read in the Numbers (dependent variable) into an array y
# and the value for Income, Employment, LivEnviron and population (independent variables)
# into a one dimmensional array X.
f = pysal.open("lsoa_numbers.dbf",'r')
y = np.array(f.by_col['Numbers'])
y.shape = (len(y),1)
X= []
X.append(f.by_col['Income'])
X.append(f.by_col['Employment'])
X.append(f.by_col['LivEnviron'])
X.append(f.by_col['peo_middle'])
X = np.array(X).T
Now that we have stored the values for analyzing, we can perform ordinary least squares (OLS) regression.
This is done with the pysal.spreg module. We will use ls.summary to obtain of full summary of the results.
In [45]:
lrs = ols.OLS(y, X, name_y = 'Numbers', name_x = ['Income', 'Employment','LivEnviron', 'peo_middle'], name_ds = 'lsoa_numbers')
print(lrs.summary)
OLS regression assumes that each observation is independent from all others. However, in real urban context, First Law of Geography indicates that places near to each other are not likely to be independent; they might spatially depend on each other.
We can evaluate spatial autocorrelation in the residuals with Moran’s I test, with the first step to create a spatial weights matrix (we've already created w.gal in hotspots detection procedure).
We will use Moran's I tool to check for spatial autocorrelation in the residuals generated from the OLS model to determine if geographically weighted regression might be appropriate for the data.
In [46]:
mi = pysal.Moran(y, w, two_tailed=False)
print("The Statistic Moran's I is: "+str("%.4f"%mi.I),
"\nThe Expected Value for Statistic I is: "+str("%.4f"%mi.EI),
"\nThe Significance Test Value is: "+str("%.4f"%mi.p_norm))
We can also use Geary's C for test, have a try below.
In [47]:
gc = pysal.Geary(y, w)
print("The Statistic C is: "+str("%.3f"%gc.C),
"\nThe Expected Value for Statistic C is: "+str(gc.EC),
"\nThe Significance Test Value for Normal Frequency Distribution is: "+str("%.3f"%gc.z_norm))
Now, we can use a spatial regression model to account for spatial non-independence. The spreg module has several different functions for creating a spatial regression model.
Firstly, we will try the spatial error model estimation with maximum likelihood.
In [48]:
spat_err = ml_error.ML_Error(y, X, w,
name_y='Numbers', name_x=['Income', 'Employment','LivEnviron', 'peo_middle'],
name_w='w.gal', name_ds='lsoa_numbers')
print(spat_err.summary)
Secondly, we will try the spatial lag model estimation with maximum likelihood.
In [49]:
spat_lag = ml_lag.ML_Lag(y, X, w,
name_y='Numbers', name_x=['Income', 'Employment','LivEnviron', 'peo_middle'],
name_w='w.gal', name_ds='lsoa_numbers')
print(spat_lag.summary)
Please try to interpret the regression summary for different models, and send your results to module lead.