Do crimes have patterns? Are a specific type of crime more likely to occur on certain days of the week? What is the distribution of a specific crime in a certain district?
To answer these questions, I will analyze NYC crime report in 2015. More specifically, I will focus on six types of felonies: murder, rape, robbery, assault, larceny and burglary. The potential results of this project include some graphs illustrating the frequencies of a certain crime and the distribution of crimes in different districts, regression analysis on temperature and crime rates, etc.
I downloaded the incident-level crime report from NYPD's website.
It is a spreadsheet which is already relatively well-organized. I just need to extract the information I want to use and reformat them using pandas.
In [1]:
# import packages
import pandas as pd # data management
import matplotlib.pyplot as plt # graphics
In [2]:
file = pd.read_excel('Felony.xlsx')
file = pd.DataFrame(file)
file.head(5)
Out[2]:
In [3]:
file.shape
Out[3]:
In [4]:
# I only need occurrence times and types of the crimes
felony = pd.read_excel('Felony.xlsx', sheetname='Felony.csv', usecols=[2,3,4,5,6,11,15])
# Reset the column names
col_names = ['Date','Dayofweek','Month','Day','Year','Offense','Borough']
felony.columns = col_names
# I also only need the crimes took place in 2015
felony = felony[felony.Year == 2015]
'''Note: sometimes dates of the crimes could be wrong due to human errors,
but the corresponding month/day/year should be correct'''
# Check
felony.head(5)
Out[4]:
In [5]:
felony.dtypes
Out[5]:
In [6]:
felony.shape
Out[6]:
I also need to reformat the date for further analysis.
In [7]:
date_list = []
for d in felony['Date']:
d = str(d)
d = d.rsplit(' ')
d = d[0]
date_list.append(d)
felony = felony.drop('Date',1)
In [8]:
felony['Date'] = date_list
felony.head()
Out[8]:
In [9]:
# define a function to count number of crimes
def count_crimes(tp):
count = 0
for c in felony.Offense:
if c == tp:
count += 1
return count
In [10]:
# Create two dictionaries to store the statistics
violent_crimes = {'Murder':count_crimes('MURDER'),
'Rape':count_crimes('RAPE'),
'Assault':count_crimes('FELONY ASSAULT')}
property_crimes = {'Robbery':count_crimes('ROBBERY'),
'Burglary':count_crimes('BURGLARY'),
'Larceny':count_crimes('GRAND LARCENY'),
'Larceny MV':count_crimes('GRAND LARCENY OF MOTOR VEHICLE')}
In [11]:
# Turn the dictionaries into pandas.DataFrame
violent_stats = pd.DataFrame.from_dict(violent_crimes, orient='index')
violent_stats.columns=['Count']
property_stats = pd.DataFrame.from_dict(property_crimes, orient='index')
property_stats.columns=['Count']
In [12]:
# Check
violent_stats
Out[12]:
In [13]:
# Check
property_stats
Out[13]:
In [14]:
%matplotlib inline
plt.style.use('ggplot')
fig, ax = plt.subplots()
violent_stats.plot(ax=ax, kind='barh', color='b')
Out[14]:
In [15]:
%matplotlib inline
plt.style.use('ggplot')
fig, ax = plt.subplots()
property_stats.plot(ax=ax, kind='barh', color='g')
Out[15]:
We have counted the amount of each type of crimes, but the graphs are still too general.
In [16]:
# Split different types of crimes
Murder = pd.DataFrame(felony[felony.Offense == 'MURDER'])
Rape = pd.DataFrame(felony[felony.Offense == 'RAPE'])
Assault = pd.DataFrame(felony[felony.Offense == 'FELONY ASSAULT'])
Robbery = pd.DataFrame(felony[felony.Offense == 'ROBBERY'])
Burglary = pd.DataFrame(felony[felony.Offense == 'BURGLARY'])
Larceny = pd.DataFrame(felony[felony.Offense == 'GRAND LARCENY'])
Larceny_M = pd.DataFrame(felony[felony.Offense == 'GRAND LARCENY OF MOTOR VEHICLE'])
In [17]:
# Define a funtion to count a type of crime in one borough
def borough_crimes(tp, borough):
count = 0
for b in tp.Borough:
if b == borough:
count += 1
return count
In [18]:
# Group the statistics by boroughs
Queens = {'Murder':borough_crimes(Murder, 'Queens'),
'Robbery':borough_crimes(Robbery, 'Queens'),
'Rape':borough_crimes(Rape, 'Queens'),
'Assault':borough_crimes(Assault, 'Queens'),
'Larceny':borough_crimes(Larceny, 'Queens'),
'Larceny MV':borough_crimes(Larceny_M, 'Queens'),
'Burglary':borough_crimes(Burglary, 'Queens')}
Manhattan = {'Murder':borough_crimes(Murder, 'Manhattan'),
'Robbery':borough_crimes(Robbery, 'Manhattan'),
'Rape':borough_crimes(Rape, 'Manhattan'),
'Assault':borough_crimes(Assault, 'Manhattan'),
'Larceny':borough_crimes(Larceny, 'Manhattan'),
'Larceny MV':borough_crimes(Larceny_M, 'Manhattan'),
'Burglary':borough_crimes(Burglary, 'Manhattan')}
Bronx = {'Murder':borough_crimes(Murder, 'Bronx'),
'Robbery':borough_crimes(Robbery, 'Bronx'),
'Rape':borough_crimes(Rape, 'Bronx'),
'Assault':borough_crimes(Assault, 'Bronx'),
'Larceny':borough_crimes(Larceny, 'Bronx'),
'Larceny MV':borough_crimes(Larceny_M, 'Bronx'),
'Burglary':borough_crimes(Burglary, 'Bronx')}
Brooklyn = {'Murder':borough_crimes(Murder, 'Brooklyn'),
'Robbery':borough_crimes(Robbery, 'Brooklyn'),
'Rape':borough_crimes(Rape, 'Brooklyn'),
'Assault':borough_crimes(Assault, 'Brooklyn'),
'Larceny':borough_crimes(Larceny, 'Brooklyn'),
'Larceny MV':borough_crimes(Larceny_M, 'Brooklyn'),
'Burglary':borough_crimes(Burglary, 'Brooklyn')}
SI = {'Murder':borough_crimes(Murder, 'Staten Island'),
'Robbery':borough_crimes(Robbery, 'Staten Island'),
'Rape':borough_crimes(Rape, 'Staten Island'),
'Assault':borough_crimes(Assault, 'Staten Island'),
'Larceny':borough_crimes(Larceny, 'Staten Island'),
'Larceny MV':borough_crimes(Larceny_M, 'Staten Island'),
'Burglary':borough_crimes(Burglary, 'Staten Island')}
In [19]:
# Convert dictionaries to DataFrames
s1 = pd.DataFrame.from_dict(Queens, orient='index')
s2 = pd.DataFrame.from_dict(Manhattan, orient='index')
s3 = pd.DataFrame.from_dict(Bronx, orient='index')
s4 = pd.DataFrame.from_dict(Brooklyn, orient='index')
s5 = pd.DataFrame.from_dict(SI, orient='index')
In [20]:
# Merge the DataFrames
stats = pd.concat([s1,s2,s3,s4,s5], axis=1)
stats.columns = ['Queens','Manhattan','Bronx','Brooklyn','StatenIsland']
stats
Out[20]:
In [21]:
%matplotlib inline
fig, ax = plt.subplots()
stats.plot(ax=ax, kind='bar')
ax.set_title('A BAD EXAMPLE')
Out[21]:
There is too much information in just one graph.
In this case we underestimate the frequencies of murders and rapes, which are basically invisible in the graph above.
We want to break down the graph by crimes or/and by boroughs.
In [22]:
# Import advanced graphing packages
import numpy as np # foundation for Pandas
import seaborn.apionly as sns # fancy matplotlib graphics (no styling)
from plotly.offline import iplot, iplot_mpl # plotting functions
import plotly.graph_objs as go # ditto
import plotly # just to print version and init notebook
import cufflinks as cf # gives us df.iplot that feels like df.plot
cf.set_config_file(offline=True, offline_show_link=False)
In [23]:
%matplotlib inline
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots()
stats.T.Murder.plot(ax=ax, kind='pie', autopct='%.2f')
ax.set_title('Distribution of murders by boroughs (%)')
ax.set_axis_off()
In [24]:
%matplotlib inline
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots()
stats.T.Rape.plot(ax=ax, kind='pie', autopct='%.2f')
ax.set_title('Distribution of rapes by boroughs (%)')
ax.set_axis_off()
In [25]:
%matplotlib inline
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots()
stats.T.Assault.plot(ax=ax, kind='pie', autopct='%.2f')
ax.set_title('Distribution of assualts by boroughs (%)')
ax.set_axis_off()
Looking at the three pie charts shown above, we can conclude that violent crimes take place most frequently in Brooklyn, and the ranking goes by Bronx, Queens, Manhattan. There are not many violent crimes going on in Staten Island, which is probably due to the size of popualtion.
In [26]:
# interactive graphs
layout = dict(width=500, height=500, # plot width/height
yaxis={"title": "Number of crimes"}, # yaxis label
title="Distribution of larcenies by boroughs", # title
xaxis={"title": "Boroughs"} # xaxis label
)
stats.T.Larceny.iplot(kind='bar', layout=layout)
In [27]:
layout = dict(width=500, height=500, # plot width/height
yaxis={"title": "Number of crimes"}, # yaxis label
title="Distribution of robberies by boroughs", # title
xaxis={"title": "Boroughs"} # xaxis label
)
stats.T.Robbery.iplot(kind='bar', layout=layout)
In [28]:
layout = dict(width=500, height=500, # plot width/height
yaxis={"title": "Number of crimes"}, # yaxis label
title="Distribution of burglaries by boroughs", # title
xaxis={"title": "Boroughs"} # xaxis label
)
stats.T.Burglary.iplot(kind='bar', layout=layout)
Regarding property crimes, Manhattan is a great place for grand larcenies. Intuitively this makes sense because it is easy to steal something in a crowded place, and the potential victims in Manhattan are relatively wealthier.
That said, Brooklyn is still the place where robberies and burglaries take place most frequently.
In [29]:
# define a function to count total number of crimes in a month
def count_crimes_month(m):
count = 0
for c in felony.Month:
if c == m:
count += 1
return count
In [30]:
# The value of each element in the dictionary is a list composing of two elements
# The second element is going to help me get the rows in DataFrame sorted
monthly_stats = {'Jan':[count_crimes_month('Jan'),1],
'Feb':[count_crimes_month('Feb'),2],
'Mar':[count_crimes_month('Mar'),3],
'Apr':[count_crimes_month('Apr'),4],
'May':[count_crimes_month('May'),5],
'Jun':[count_crimes_month('Jun'),6],
'Jul':[count_crimes_month('Jul'),7],
'Aug':[count_crimes_month('Aug'),8],
'Sep':[count_crimes_month('Sep'),9]}
m_stats = pd.DataFrame.from_dict(monthly_stats, orient='index')
m_stats.columns = ['Count','Month']
# Check m_stats; Note that they are not sorted
m_stats
Out[30]:
In [31]:
# Sort m_stats by month
m_stats = m_stats.sort_values(by='Month', ascending=1)
# ... and then drop the month column
m_stats = m_stats.drop('Month', 1)
# Check again
m_stats
Out[31]:
In [32]:
layout = dict(width=500, height=500, # plot width/height
yaxis={"title": "Number of crimes"}, # yaxis label
title="Frequency", # title
xaxis={"title": "Month"} # xaxis label
)
m_stats.iplot(kind='bar', layout=layout)
Number of crimes hit the peak around July and August. Least amount of crimes took place in February.
The data seems to confirm a weird correlation shown by a lot of studies: heat positively correlates with higher crime rates. See an article on New York Times: Weather and Violence
In the next section, I will explore the correlation between temperature and number of crimes.
The NYC weather data monitored at Central Park location was downloaded from National Centers for Environmental Information.
In [33]:
# Read the data
weather = pd.read_excel('Weather_NY2015.xls', sheetname='927869', usecols=[2,3,5,7,8])
weather.head()
Out[33]:
In [34]:
# Create a new column to store the average temperature
weather['TAVG'] = (weather['TMAX'] + weather['TMIN']) / 2
weather['Date'] = pd.to_datetime(weather['DATE'],format='%Y%m%d', errors='ignore')
weather.head()
Out[34]:
I would like to graph both the weather and crime data.
However, they can not share the same x-axis as one of them is a monthly data point while the other is a daily data point.
That said, I still would like to see those graphs separately.
In [35]:
# Crimes trace
crimes = dict(type="bar",
name="Number of crimes",
x=m_stats.index,
y=m_stats['Count'],
marker={"color": "Grey"}
)
# Weather trace
tavg = dict(type="scatter",
name="Average temperature",
x=weather['Date'],
y=weather['TAVG'],
marker={"color": "Blue"}
)
# Plot 2 graphs seperately
layout = dict(width=600, height=500)
iplot(go.Figure(data=[tavg], layout=layout))
iplot(go.Figure(data=[crimes], layout=layout))
Next, I attempted to count daily crimes and plot weather and crime data on the same graph.
In [36]:
# Create a daily crimes stats DataFrame
daily_crimes = pd.DataFrame()
daily_crimes['Count'] = felony['Date'].value_counts()
The daily crime DataFrame is not sorted, but it should not matter when I plot it.
In [37]:
# Daily crimes trace
d_crimes = dict(type="bar",
name="Number of crimes",
x=daily_crimes.index,
y=daily_crimes['Count'],
marker={"color": "Pink"},
)
# Temperature trace (adjusted)
tavg = dict(type="scatter",
name="Average temperature (adjusted)",
x=weather['Date'],
y=weather['TAVG']*3,
marker={"color": "Grey"}
)
# Plot on the same graph
layout = dict(width=950, height=800)
iplot(go.Figure(data=[tavg, d_crimes], layout=layout))
The shapes of two graphs are quite similar: they dip in Feb&Mar and keep increasing, till they peak in July&August.
It would be more interesting if we could do regression analysis on them.
To create a jointplot using seaborn, I need to put weather data and crime data into one DataFrame.
In [38]:
# I need a separate column for Date in the format of pd.datetime
# This part is tedious
daily_crimes['Date'] = daily_crimes.index.tolist()
daily_crimes = daily_crimes.drop(daily_crimes.index[-1])
daily_crimes['Date'] = pd.to_datetime(daily_crimes['Date'],format='%Y-%m-%d')
daily_crimes.head()
Out[38]:
In [39]:
# Reset index
daily_crimes = daily_crimes.reset_index()
daily_crimes.head()
Out[39]:
In [40]:
# Create a new sub DataFrame of weather
w1 = weather['TAVG']
w2 = weather['Date']
w = pd.concat([w1,w2], axis=1)
w = w.set_index('Date')
In [41]:
# Create a new sub DataFrame of daily crimes
c1 = daily_crimes['Date']
c2 = daily_crimes['Count']
c = pd.concat([c1,c2],axis=1)
c = c.set_index('Date')
In [42]:
# Check the shape of each sub data set to make sure they can be put into one DataFrame
# If the output is True, we are good to go
w.shape[0] == c.shape[0]
Out[42]:
In [43]:
# Merge the two DataFrames
merged = pd.concat([w, c], axis=1)
merged.head()
Out[43]:
In [44]:
import numpy as np
import seaborn as sns
In [45]:
sns.set(style="dark", color_codes=True)
g = sns.jointplot('TAVG','Count', data=merged, kind="reg",
xlim=(0,100), ylim=(100,400),color="purple", size=8)
Again, the correlation is confirmed.
The correlation is interesting, but correlation does NOT mean causation. At least in this case I don't think high temperature would directly casue higher volumn of crimes.
High temperature might impact people's mood or other factors, which lead to violent behaviors. There are many possible explanations out there.
That said, the correlation is useful for economists as temperature can be used as an instrument variables to see how crime rates fluctuate. Here's my favorite paper on this topic, written by Brian Jacob, Lars Lefgren, and Enrico Moretti: The Dynamics of Criminal Behavior: Evidence from Weather Shocks.