The Atlanta Police Department provides Part 1 crime data at http://www.atlantapd.org/i-want-to/crime-data-downloads
A recent copy of the data file is stored in the cluster. Please, do not copy this data file into your home directory!
In [2]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
In [3]:
# load data set
df = pd.read_csv('/home/data/APD/COBRA-YTD-multiyear.csv.gz')
print "Shape of table: ", df.shape
Let's look at the structure of this table. We're actually creating some text output that can be used to create a data dictionary.
In [4]:
dataDict = pd.DataFrame({'DataType': df.dtypes.values, 'Description': '', }, index=df.columns.values)
We need to enter the descriptions for each entry in our dictionary manually...
In [6]:
dataDict.loc['MI_PRINX'].Description = '' # type: int64
dataDict.loc['offense_id'].Description = 'Unique ID in the format YYDDDNNNN with the year YY, the day of the year DDD and a counter NNNN' # type: int64
dataDict.loc['rpt_date'].Description = 'Date the crime was reported' # type: object
dataDict.loc['occur_date'].Description = 'Estimated date when the crime occured' # type: object
dataDict.loc['occur_time'].Description = 'Estimated time when the crime occured' # type: object
dataDict.loc['poss_date'].Description = '' # type: object
dataDict.loc['poss_time'].Description = '' # type: object
dataDict.loc['beat'].Description = '' # type: int64
dataDict.loc['apt_office_prefix'].Description = '' # type: object
dataDict.loc['apt_office_num'].Description = '' # type: object
dataDict.loc['location'].Description = '' # type: object
dataDict.loc['MinOfucr'].Description = '' # type: int64
dataDict.loc['MinOfibr_code'].Description = '' # type: object
dataDict.loc['dispo_code'].Description = '' # type: object
dataDict.loc['MaxOfnum_victims'].Description = '' # type: float64
dataDict.loc['Shift'].Description = 'Zones have 8 or 10 hour shifts' # type: object
dataDict.loc['Avg Day'].Description = '' # type: object
dataDict.loc['loc_type'].Description = '' # type: float64
dataDict.loc['UC2 Literal'].Description = '' # type: object
dataDict.loc['neighborhood'].Description = '' # type: object
dataDict.loc['npu'].Description = '' # type: object
dataDict.loc['x'].Description = '' # type: float64
dataDict.loc['y'].Description = '' # type: float64
dataDict.to_csv("COBRA_Data_Dictionary.csv")
In [7]:
dataDict
Out[7]:
Please refer to the following resources for working with time series data in pandas:
In [10]:
# function currying
def fixdatetime(fld):
def _fix(s):
date_col = '%s_date' % fld # "rpt_date"
time_col = '%s_time' % fld # "rpt_time"
if time_col in s.index:
return str(s[date_col])+' '+str(s[time_col])
else:
return str(s[date_col])+' 00:00:00'
return _fix
In [11]:
for col in ['rpt', 'occur', 'poss']:
datser = df.apply(fixdatetime(col), axis=1)
df['%s_dt'%col] = pd.to_datetime(datser, format="%m/%d/%Y %H:%M:%S", errors='coerce')
In [12]:
df[["MI_PRINX", "offense_id", "beat", "UC2 Literal", "neighborhood", "rpt_dt", "occur_dt", "poss_dt"]].head()
Out[12]:
What's the date range of the data?
In [13]:
print df.occur_dt.min(), '---', df.occur_dt.max()
Number of crimes reported each year:
In [14]:
# resample is like "groupby" for time
df.resample('A-DEC', closed='right', on='occur_dt').offense_id.count()
# df['Year'] = df.occur_dt.map(lambda d: d.year)
# df2 = df[(df.Year>=2010) & (df.Year<=2017)]
# df2.shape, df.shape
Out[14]:
Looks like most of the data is actually from 2009-2017! Let's throw the rest away...
In [15]:
df = df[df.occur_dt>='01/01/2009']
In [39]:
df[["occur_dt", "UC2 Literal", "offense_id"]].head()
Out[39]:
In [115]:
# Pivoting the table:
# index = nolumn that the new table will be indexed by
# columns = column whose unique values will form the new column names
# values = values used to fill the table (default = all columns other than those given in index and columns)
df_ct = df.pivot_table(index="occur_dt", columns="UC2 Literal", values="offense_id")
In [116]:
df_ct.head()
Out[116]:
This gives us a timeline for different types of crime reported in Atlanta.
By itself, this can be useful, but we are more interested in aggregate statistics. Let's get the number of crimes by month...
In [117]:
df_ct = df_ct.resample("1M", closed="right").count()
In [113]:
df_ct.head()
Out[113]:
Average number of crimes per month, for each year:
In [56]:
ax = df_ct.plot.box(figsize=(13,4), rot=45)
plt.ylabel("Total Reported Crimes by Month")
Out[56]:
Explanation of boxplot:
From the matplotlib documentation (http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.boxplot):
The box extends from the lower to upper quartile values of the data, with a line at the median. The whiskers extend from the box to show the range of the data. Flier points are those past the end of the whiskers.
Whiskers: IQR is the interquartile range (Q3-Q1). The upper whisker will extend to last datum less than Q3 + whisIQR (where the default value for whis is 1.5). Similarly, the lower whisker will extend to the first datum greater than Q1 - whisIQR. Beyond the whiskers, data are considered outliers and are plotted as individual points.
In [70]:
## In-class exercise:
# Make a boxplot of the number of reported crimes, aggregating by week.
In [199]:
df_wk = df.pivot_table(index="occur_dt", columns="UC2 Literal", values="offense_id")
df_wk = df_wk.resample("W-SUN", closed='right').count()
df_wk.plot.box(figsize=(13,4), rot=45)
plt.ylabel("Total Reported Crimes by Week")
Out[199]:
Let's take a look at a time series plot of the number of crimes over time...
In [69]:
ax = df_ct.plot(figsize=(10,5), style='-o')
ax.get_legend().set_bbox_to_anchor((1, 1))
plt.ylabel("Total Reported Crimes by Month")
ax.vlines(pd.date_range("12/31/2009", "12/31/2017", freq="A-JAN"), 0, 900)
Out[69]:
Can you pick out the seasonal variation in number of crimes per year?
Suppose we are not interested in seasonal trends, but want to see if the number of reported crimes is changing year over year. We could simply add the number of crimes together to get number of crimes reported each year.
In [61]:
ann_cr = df_ct.resample("A-DEC", closed="right").sum()
In [63]:
ax = ann_cr[ann_cr.index<"01/01/2017"].plot(figsize=(10,5), style='-o')
ax.get_legend().set_bbox_to_anchor((1, 1))
You can use the "corr" method in Pandas to find the correlation between columns of a dataframe.
In [72]:
crime_corr = df_ct.corr()
In [73]:
crime_corr
Out[73]:
Visualizing the correlation...
In [83]:
plt.matshow(crime_corr);
plt.yticks(range(len(crime_corr.columns)), crime_corr.columns);
plt.xticks(range(len(crime_corr.columns)), crime_corr.columns, rotation=90);
plt.colorbar();
In [132]:
df['Zone'] = df['beat']//100
df['Year'] = df.occur_dt.apply(lambda x: x.year)
In [202]:
df_cp = df.pivot_table(index="Zone",
columns="UC2 Literal",
values="offense_id",
aggfunc=lambda x: np.count_nonzero(~np.isnan(x)))
In [203]:
df_cp
Out[203]:
In [218]:
df_cp = df.pivot_table(index=["Year","Zone"],
columns="UC2 Literal",
values="offense_id",
aggfunc=lambda x: np.count_nonzero(~np.isnan(x)))
In [219]:
df_cp
Out[219]:
In [160]:
df_cp = df_cp[np.logical_and([x >= 1 for x in zip(*df_cp.index.values)[1]],
[x <= 6 for x in zip(*df_cp.index.values)[1]])].fillna(0)
df_cp.head(20)
Out[160]:
In [224]:
# A MUCH PRETTIER way to do the same thing:
df_cp = df_cp.loc[(slice(None), slice(1,6)),:].fillna(0)
df_cp.head(20)
Out[224]:
In [223]:
## slicing on a multi-index
# get data for 2009-2010, for zones 1-3
df_cp.loc[(slice(2009,2010), slice(1,5,2)),:]
Out[223]:
In [ ]:
## In-class exercise:
# Show all robbery data for 2011, 2013, and 2015, for zones 4-6
In [226]:
df_cp.loc[(slice(2011,2015,2), slice(4,6)), "ROBBERY-COMMERCIAL":"ROBBERY-RESIDENCE"]
Out[226]:
In [240]:
df_cp.filter(like='ROBBERY').loc[(slice(2011,2015,2), slice(4,6)), :]
Out[240]:
In [242]:
## In-class exercise:
# Count the total number of crimes in each zone
df_cp.groupby(level=1).sum()
Out[242]:
In [251]:
help(df_cp.plot)
In [250]:
## In-class exercise:
# Plot the number of pedestrian robberies in each zone in 2016
df_cp.loc[(slice(2016,2016), slice(None)), "ROBBERY-PEDESTRIAN"].plot.bar()
plt.xticks(range(6), range(1,7), rotation=0);
plt.xlabel("Zone");
plt.ylabel("Ped. Robberies in 2016");
In [ ]:
## In-class exercise:
# What is the average annual number of crimes in each zone (for each type of crime)?
# Hint: use "groupby" with a "level" argument.
In [254]:
df_cp.groupby(level=1).mean()
Out[254]:
In [ ]:
##### Shapefile stuff ########
In [196]:
import sys
try:
from osgeo import ogr, osr, gdal
except:
sys.exit('ERROR: cannot find GDAL/OGR modules')
In [ ]: