The Atlanta Police Department provides Part 1 crime data at http://www.atlantapd.org/i-want-to/crime-data-downloads
A recent copy of the data file is stored in the cluster. Please, do not copy this data file into your home directory!
In [1]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
In [2]:
# load data set
df = pd.read_csv('/home/data/APD/COBRA-YTD2017.csv.gz')
print "Shape of table: ", df.shape
Let's look at the structure of this table. We're actually creating some text output that can be used to create a data dictionary.
In [3]:
dataDict = pd.DataFrame({'DataType': df.dtypes.values, 'Description': '', }, index=df.columns.values)
In [4]:
dataDict
Out[4]:
We need to enter the descriptions for each entry in our dictionary manually...
In [5]:
dataDict.loc['MI_PRINX'].Description = '' # type: int64
dataDict.loc['offense_id'].Description = 'Unique ID in the format YYDDDNNNN with the year YY, the day of the year DDD and a counter NNNN' # type: int64
dataDict.loc['rpt_date'].Description = 'Date the crime was reported' # type: object
dataDict.loc['occur_date'].Description = 'Estimated date when the crime occured' # type: object
dataDict.loc['occur_time'].Description = 'Estimated time when the crime occured' # type: object
dataDict.loc['poss_date'].Description = '' # type: object
dataDict.loc['poss_time'].Description = '' # type: object
dataDict.loc['beat'].Description = '' # type: int64
dataDict.loc['apt_office_prefix'].Description = '' # type: object
dataDict.loc['apt_office_num'].Description = '' # type: object
dataDict.loc['location'].Description = '' # type: object
dataDict.loc['MinOfucr'].Description = '' # type: int64
dataDict.loc['MinOfibr_code'].Description = '' # type: object
dataDict.loc['dispo_code'].Description = '' # type: object
dataDict.loc['MaxOfnum_victims'].Description = '' # type: float64
dataDict.loc['Shift'].Description = 'Zones have 8 or 10 hour shifts' # type: object
dataDict.loc['Avg Day'].Description = '' # type: object
dataDict.loc['loc_type'].Description = '' # type: float64
dataDict.loc['UC2 Literal'].Description = '' # type: object
dataDict.loc['neighborhood'].Description = '' # type: object
dataDict.loc['npu'].Description = '' # type: object
dataDict.loc['x'].Description = '' # type: float64
dataDict.loc['y'].Description = '' # type: float64
dataDict.to_csv("COBRA_Data_Dictionary.csv")
In [6]:
dataDict
Out[6]:
Let's take a look at what we've got...
In [ ]:
dataDict
What is "object?"
In [7]:
print df.groupby("Shift").count().index
This shows that "shift" is a unicode string...
datetime
to deal with converting parsing and formatting dates and times. See https://docs.python.org/2/library/datetime.htmlpandas
package provides functionality to convert text fields into date/time fields...given the values adhere to a given format. See http://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.to_datetime.html
In [8]:
# function currying
def fixdatetime(fld):
def _fix(s):
date_col = '%s_date' % fld # "rpt_date"
time_col = '%s_time' % fld # "rpt_time"
if time_col in s.index:
return str(s[date_col])+' '+str(s[time_col])
else:
return str(s[date_col])+' 00:00:00'
return _fix
df.apply(fixdatetime('rpt'), axis=1)[:10]
# df.apply(_fix, axis=1)
Out[8]:
Please refer to the following resources for working with time series data in pandas:
In [9]:
for col in ['rpt', 'occur', 'poss']:
datser = df.apply(fixdatetime(col), axis=1)
df['%s_dt'%col] = pd.to_datetime(datser, format="%m/%d/%Y %H:%M:%S", errors='coerce')
In [10]:
df.head()
Out[10]:
In [11]:
def fun1(x):
def _inner1(y):
print x, " + ", y, " = ", x+y
return _inner1
In [12]:
showname = fun1('name:')
In [13]:
type(showname)
Out[13]:
In [14]:
showname('Bob')
In [15]:
addthree = fun1(3)
In [16]:
addthree(5)
What's the date range of the data?
In [17]:
print df.occur_dt.min(), '---', df.occur_dt.max()
Actual years from which we have data...
In [18]:
sorted(df.occur_dt.map(lambda d: d.year).unique())
Out[18]:
This is an odd assortment of years... let's look at number of crimes reported each year
In [21]:
# resample is like "groupby" for time
df.resample('A-DEC', closed='right', on='occur_dt').offense_id.count()
# df['Year'] = df.occur_dt.map(lambda d: d.year)
# df2 = df[(df.Year>=2010) & (df.Year<=2017)]
# df2.shape, df.shape
Out[21]:
Looks like most of the data is actually from 2017! Let's throw the rest away...
In [20]:
df = df[df.occur_dt>='01/01/2017']
Types of crimes in the database:
In [22]:
print "\n".join(sorted(df['UC2 Literal'].unique()))
In [23]:
df['oct'] = pd.to_timedelta(df.occur_time) # estimated time of ocurence in timedelta format
In [24]:
# Let's take a closer look at a few selected examples of crime...
vehicle_larceny = df[df['UC2 Literal']=='LARCENY-FROM VEHICLE'].resample('1H', closed='right', on='oct').offense_id.count()
pedestrian_robbery = df[df['UC2 Literal']=='ROBBERY-PEDESTRIAN'].resample('1H', closed='right', on='oct').offense_id.count()
residence_burglary = df[df['UC2 Literal']=='BURGLARY-RESIDENCE'].resample('1H', closed='right', on='oct').offense_id.count()
assault = df[df['UC2 Literal']=='AGG ASSAULT'].resample('1H', closed='right', on='oct').offense_id.count()
In [25]:
fig = plt.figure(figsize=(8,4));
plt.plot_date(vehicle_larceny.index, vehicle_larceny.values);
plt.title('Vehilce Larceny by Time of Day');
plt.xlabel('Time (24h format)');
plt.ylabel('Number of incidents');
In [26]:
fig = plt.figure(figsize=(8,4));
plt.plot_date(pedestrian_robbery.index, pedestrian_robbery.values);
plt.title('Pedestrian Robbery by Time of Day');
plt.xlabel('Time (24h format)');
plt.ylabel('Number of incidents');
In [27]:
fig = plt.figure(figsize=(8,4));
plt.plot_date(residence_burglary.index, residence_burglary.values);
plt.title('Residence Burglary by Time of Day');
plt.xlabel('Time (24h format)');
plt.ylabel('Number of incidents');
In [28]:
fig = plt.figure(figsize=(8,4));
plt.plot_date(assault.index, assault.values);
plt.title('Aggrivated Assault by Time of Day');
plt.xlabel('Time (24h format)');
plt.ylabel('Number of incidents');
In [29]:
# Let's take a closer look at a few selected examples of crime...
seasonal_vehicle_larceny = df[df['UC2 Literal']=='LARCENY-FROM VEHICLE'].resample('M', closed='right', on='occur_dt').offense_id.count()
seasonal_pedestrian_robbery = df[df['UC2 Literal']=='ROBBERY-PEDESTRIAN'].resample('M', closed='right', on='occur_dt').offense_id.count()
seasonal_residence_burglary = df[df['UC2 Literal']=='BURGLARY-RESIDENCE'].resample('M', closed='right', on='occur_dt').offense_id.count()
seasonal_assault = df[df['UC2 Literal']=='AGG ASSAULT'].resample('M', closed='right', on='occur_dt').offense_id.count()
seasonal_rape = df[df['UC2 Literal']=='RAPE'].resample('M', closed='right', on='occur_dt').offense_id.count()
In [30]:
fig = plt.figure(figsize=(8,4));
plt.plot_date(seasonal_vehicle_larceny.index, seasonal_vehicle_larceny.values);
plt.title('Vehilce Larceny by Time of Year');
plt.xlabel('Time (24h format)');
plt.ylabel('Number of incidents');
In [31]:
fig = plt.figure(figsize=(8,4));
plt.plot_date(seasonal_pedestrian_robbery.index, seasonal_pedestrian_robbery.values);
plt.title('Pedestrian Robbery by Time of Year');
plt.xlabel('Time (24h format)');
plt.ylabel('Number of incidents');
In [32]:
fig = plt.figure(figsize=(8,4));
plt.plot_date(seasonal_residence_burglary.index, seasonal_residence_burglary.values);
plt.title('Residence Burglary by Time of Year');
plt.xlabel('Time (24h format)');
plt.ylabel('Number of incidents');
In [33]:
fig = plt.figure(figsize=(8,4));
plt.plot_date(seasonal_assault.index, seasonal_assault.values);
plt.title('Assault by Time of Year');
plt.xlabel('Time (24h format)');
plt.ylabel('Number of incidents');
In [34]:
fig = plt.figure(figsize=(8,4));
plt.plot_date(seasonal_rape.index, seasonal_rape.values);
plt.title('Rape by Time of Year');
plt.xlabel('Time (24h format)');
plt.ylabel('Number of incidents');
https://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics
In [35]:
# let's create a colormap for the different colors...
crime_col = dict(zip(sorted(df['UC2 Literal'].unique()),
['red','olive','blue','black','black','royalblue','slategrey','purple','purple','darkblue','coral']))
In [36]:
crime_col
Out[36]:
In [37]:
violent = df[df['UC2 Literal'].apply(lambda x: x in set(['AGG ASSAULT','HOMICIDE','RAPE']))]
plt.scatter(violent.x, violent.y, s=7, c=violent['UC2 Literal'].apply(lambda x: crime_col[x]), edgecolors='none')
Out[37]:
In [ ]:
df[df['UC2 Literal']=='ROBBERY-RESIDENCE']
In [43]:
violent = df[df['UC2 Literal'].apply(lambda x: x in ['ROBBERY-COMMERCIAL', 'ROBBERY-PEDESTRIAN', 'ROBBERY-RESIDENCE'])]
plt.scatter(violent.x, violent.y, s=10, c=violent['UC2 Literal'].apply(lambda x: crime_col[x]), edgecolors='none', marker='^')
Out[43]:
In [ ]:
In the beat ID, the hundreds place indicates the zone number...
Let's create a separate column for zone.
In [45]:
df['Zone'] = df['beat']//100
In [46]:
pd.DataFrame(df.groupby(['Zone','UC2 Literal']).offense_id.count())
Out[46]: