Final Project: Analysis on 2015 NYC Crime Report

Sophie Li - ywl242

Project summary:

Do crimes have patterns? Are a specific type of crime more likely to occur on certain days of the week? What is the distribution of a specific crime in a certain district?

To answer these questions, I will analyze NYC crime report in 2015. More specifically, I will focus on six types of felonies: murder, rape, robbery, assault, larceny and burglary. The potential results of this project include some graphs illustrating the frequencies of a certain crime and the distribution of crimes in different districts, regression analysis on temperature and crime rates, etc.

Data

I downloaded the incident-level crime report from NYPD's website.

It is a spreadsheet which is already relatively well-organized. I just need to extract the information I want to use and reformat them using pandas.

Import packages and read files


In [1]:
# import packages 
import pandas as pd                   # data management
import matplotlib.pyplot as plt       # graphics

In [2]:
file = pd.read_excel('Felony.xlsx')
file = pd.DataFrame(file)
file.head(5)


Out[2]:
OBJECTID Identifier OccurrenceDate DayofWeek OccurrenceMonth OccurrenceDay OccurrenceYear OccurrenceHour CompStatMonth CompStatDay ... Offense OffenseClassification Sector Precinct Borough Jurisdiction XCoordinate YCoordinate Latitude Longitude
0 1 987726459 1982-09-21 23:20:00 Tuesday Sep 21 1982 2300 Apr 9 ... MURDER FELONY NaN 79 Brooklyn NYPD 999634 190253 40.688872 -73.944529
1 2 987726401 1991-08-08 22:15:00 Thursday Aug 8 1991 2200 Mar 10 ... MURDER FELONY A 30 Manhattan NYPD 996322 238267 40.820663 -73.956385
2 3 987726875 2014-03-17 12:00:00 Monday Mar 17 2014 1200 Jun 26 ... MURDER FELONY NaN 67 Brooklyn NYPD 999132 175384 40.648061 -73.946372
3 4 987725414 2014-07-30 21:41:00 Wednesday Jul 30 2014 2100 Jan 28 ... MURDER FELONY H 120 Staten Island NYPD 959784 171202 40.636561 -74.088153
4 5 987726422 2014-08-08 23:46:00 Friday Aug 8 2014 2300 Mar 13 ... MURDER FELONY D 41 Bronx NYPD 1013232 236725 40.816392 -73.895296

5 rows × 21 columns


In [3]:
file.shape


Out[3]:
(76755, 21)

Cleaning

The original file contains too much information and some of them are not useful


In [4]:
# I only need occurrence times and types of the crimes
felony = pd.read_excel('Felony.xlsx', sheetname='Felony.csv', usecols=[2,3,4,5,6,11,15])

# Reset the column names
col_names = ['Date','Dayofweek','Month','Day','Year','Offense','Borough']
felony.columns = col_names

# I also only need the crimes took place in 2015
felony = felony[felony.Year == 2015]
'''Note: sometimes dates of the crimes could be wrong due to human errors,
but the corresponding month/day/year should be correct'''

# Check
felony.head(5)


Out[4]:
Date Dayofweek Month Day Year Offense Borough
10 2015-01-01 02:51:00 Thursday Jan 1 2015 MURDER Bronx
11 2015-01-01 07:23:00 Thursday Jan 1 2015 MURDER Bronx
12 2015-01-02 02:33:00 Friday Jan 2 2015 MURDER Queens
13 2015-01-02 19:28:00 Friday Jan 2 2015 MURDER Manhattan
14 2015-01-03 19:38:00 Saturday Jan 3 2015 MURDER Brooklyn

In [5]:
felony.dtypes


Out[5]:
Date         object
Dayofweek    object
Month        object
Day           int64
Year          int64
Offense      object
Borough      object
dtype: object

In [6]:
felony.shape


Out[6]:
(74098, 7)

I also need to reformat the date for further analysis.


In [7]:
date_list = []
for d in felony['Date']:
    d = str(d)
    d = d.rsplit(' ')
    d = d[0]
    date_list.append(d)
felony = felony.drop('Date',1)

In [8]:
felony['Date'] = date_list
felony.head()


Out[8]:
Dayofweek Month Day Year Offense Borough Date
10 Thursday Jan 1 2015 MURDER Bronx 2015-01-01
11 Thursday Jan 1 2015 MURDER Bronx 2015-01-01
12 Friday Jan 2 2015 MURDER Queens 2015-01-02
13 Friday Jan 2 2015 MURDER Manhattan 2015-01-02
14 Saturday Jan 3 2015 MURDER Brooklyn 2015-01-03

Summarize the data part I: violent and property crimes

The original file is a detailed mapping over crimes.

However, I want to know more about the statistics of violent and property crimes, so I need to create new DataFrames.


In [9]:
# define a function to count number of crimes
def count_crimes(tp):
    count = 0
    for c in felony.Offense:
        if c == tp:
            count += 1
    return count

In [10]:
# Create two dictionaries to store the statistics
violent_crimes = {'Murder':count_crimes('MURDER'),
                  'Rape':count_crimes('RAPE'),
                  'Assault':count_crimes('FELONY ASSAULT')}
property_crimes = {'Robbery':count_crimes('ROBBERY'),
        'Burglary':count_crimes('BURGLARY'),
        'Larceny':count_crimes('GRAND LARCENY'),
        'Larceny MV':count_crimes('GRAND LARCENY OF MOTOR VEHICLE')}

In [11]:
# Turn the dictionaries into pandas.DataFrame
violent_stats = pd.DataFrame.from_dict(violent_crimes, orient='index')
violent_stats.columns=['Count']
property_stats = pd.DataFrame.from_dict(property_crimes, orient='index')
property_stats.columns=['Count']

In [12]:
# Check
violent_stats


Out[12]:
Count
Rape 845
Murder 247
Assault 14947

In [13]:
# Check
property_stats


Out[13]:
Count
Burglary 10776
Robbery 12192
Larceny 29654
Larceny MV 5437

In [14]:
%matplotlib inline
plt.style.use('ggplot')
fig, ax = plt.subplots()
violent_stats.plot(ax=ax, kind='barh', color='b')


Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x119d3c940>

In [15]:
%matplotlib inline
plt.style.use('ggplot')
fig, ax = plt.subplots()
property_stats.plot(ax=ax, kind='barh', color='g')


Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a983400>

We have counted the amount of each type of crimes, but the graphs are still too general.

Summarize the data part II: breaking down by types and locations of crimes


In [16]:
# Split different types of crimes
Murder = pd.DataFrame(felony[felony.Offense == 'MURDER'])
Rape = pd.DataFrame(felony[felony.Offense == 'RAPE'])
Assault = pd.DataFrame(felony[felony.Offense == 'FELONY ASSAULT'])
Robbery = pd.DataFrame(felony[felony.Offense == 'ROBBERY'])
Burglary = pd.DataFrame(felony[felony.Offense == 'BURGLARY'])
Larceny = pd.DataFrame(felony[felony.Offense == 'GRAND LARCENY'])
Larceny_M = pd.DataFrame(felony[felony.Offense == 'GRAND LARCENY OF MOTOR VEHICLE'])

In [17]:
# Define a funtion to count a type of crime in one borough
def borough_crimes(tp, borough):
    count = 0
    for b in tp.Borough:
        if b == borough:
            count += 1
    return count

In [18]:
# Group the statistics by boroughs
Queens = {'Murder':borough_crimes(Murder, 'Queens'),
         'Robbery':borough_crimes(Robbery, 'Queens'),
         'Rape':borough_crimes(Rape, 'Queens'),
         'Assault':borough_crimes(Assault, 'Queens'),
         'Larceny':borough_crimes(Larceny, 'Queens'),
         'Larceny MV':borough_crimes(Larceny_M, 'Queens'),
         'Burglary':borough_crimes(Burglary, 'Queens')}
Manhattan = {'Murder':borough_crimes(Murder, 'Manhattan'),
         'Robbery':borough_crimes(Robbery, 'Manhattan'),
         'Rape':borough_crimes(Rape, 'Manhattan'),
         'Assault':borough_crimes(Assault, 'Manhattan'),
         'Larceny':borough_crimes(Larceny, 'Manhattan'),
         'Larceny MV':borough_crimes(Larceny_M, 'Manhattan'),
         'Burglary':borough_crimes(Burglary, 'Manhattan')}
Bronx = {'Murder':borough_crimes(Murder, 'Bronx'),
         'Robbery':borough_crimes(Robbery, 'Bronx'),
         'Rape':borough_crimes(Rape, 'Bronx'),
         'Assault':borough_crimes(Assault, 'Bronx'),
         'Larceny':borough_crimes(Larceny, 'Bronx'),
         'Larceny MV':borough_crimes(Larceny_M, 'Bronx'),
         'Burglary':borough_crimes(Burglary, 'Bronx')}
Brooklyn = {'Murder':borough_crimes(Murder, 'Brooklyn'),
         'Robbery':borough_crimes(Robbery, 'Brooklyn'),
         'Rape':borough_crimes(Rape, 'Brooklyn'),
         'Assault':borough_crimes(Assault, 'Brooklyn'),
         'Larceny':borough_crimes(Larceny, 'Brooklyn'),
         'Larceny MV':borough_crimes(Larceny_M, 'Brooklyn'),
         'Burglary':borough_crimes(Burglary, 'Brooklyn')}
SI = {'Murder':borough_crimes(Murder, 'Staten Island'),
         'Robbery':borough_crimes(Robbery, 'Staten Island'),
         'Rape':borough_crimes(Rape, 'Staten Island'),
         'Assault':borough_crimes(Assault, 'Staten Island'),
         'Larceny':borough_crimes(Larceny, 'Staten Island'),
         'Larceny MV':borough_crimes(Larceny_M, 'Staten Island'),
         'Burglary':borough_crimes(Burglary, 'Staten Island')}

In [19]:
# Convert dictionaries to DataFrames
s1 = pd.DataFrame.from_dict(Queens, orient='index')
s2 = pd.DataFrame.from_dict(Manhattan, orient='index')
s3 = pd.DataFrame.from_dict(Bronx, orient='index')
s4 = pd.DataFrame.from_dict(Brooklyn, orient='index')
s5 = pd.DataFrame.from_dict(SI, orient='index')

In [20]:
# Merge the DataFrames
stats = pd.concat([s1,s2,s3,s4,s5], axis=1)
stats.columns = ['Queens','Manhattan','Bronx','Brooklyn','StatenIsland']
stats


Out[20]:
Queens Manhattan Bronx Brooklyn StatenIsland
Burglary 2583 1918 1921 3949 405
Larceny MV 1630 624 1209 1820 154
Murder 47 30 66 97 7
Larceny 5441 11033 4230 8083 866
Robbery 2398 2270 3107 4073 343
Rape 166 184 201 267 27
Assault 2968 2570 3940 4740 512

Visualization: distribution of crimes across boroughs

Now that we have an organized DataFrame, we can create graphs to visualize the distributions of crimes.

Note that it would be very messy if we plot everything in one graph. See the bad example below:


In [21]:
%matplotlib inline
fig, ax = plt.subplots()
stats.plot(ax=ax, kind='bar')
ax.set_title('A BAD EXAMPLE')


Out[21]:
<matplotlib.text.Text at 0x11c69c0f0>

There is too much information in just one graph.

In this case we underestimate the frequencies of murders and rapes, which are basically invisible in the graph above.

We want to break down the graph by crimes or/and by boroughs.


In [22]:
# Import advanced graphing packages
import numpy as np                     # foundation for Pandas
import seaborn.apionly as sns          # fancy matplotlib graphics (no styling)
from plotly.offline import iplot, iplot_mpl  # plotting functions
import plotly.graph_objs as go               # ditto
import plotly                                # just to print version and init notebook
import cufflinks as cf                       # gives us df.iplot that feels like df.plot
cf.set_config_file(offline=True, offline_show_link=False)



In [23]:
%matplotlib inline
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots()
stats.T.Murder.plot(ax=ax, kind='pie', autopct='%.2f')
ax.set_title('Distribution of murders by boroughs (%)')
ax.set_axis_off()



In [24]:
%matplotlib inline
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots()
stats.T.Rape.plot(ax=ax, kind='pie', autopct='%.2f')
ax.set_title('Distribution of rapes by boroughs (%)')
ax.set_axis_off()



In [25]:
%matplotlib inline
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots()
stats.T.Assault.plot(ax=ax, kind='pie', autopct='%.2f')
ax.set_title('Distribution of assualts by boroughs (%)')
ax.set_axis_off()


Looking at the three pie charts shown above, we can conclude that violent crimes take place most frequently in Brooklyn, and the ranking goes by Bronx, Queens, Manhattan. There are not many violent crimes going on in Staten Island, which is probably due to the size of popualtion.


In [26]:
# interactive graphs
layout = dict(width=500, height=500,                        # plot width/height
              yaxis={"title": "Number of crimes"},                    # yaxis label
              title="Distribution of larcenies by boroughs",            # title
              xaxis={"title": "Boroughs"}  # xaxis label
             )
stats.T.Larceny.iplot(kind='bar', layout=layout)



In [27]:
layout = dict(width=500, height=500,                        # plot width/height
              yaxis={"title": "Number of crimes"},                    # yaxis label
              title="Distribution of robberies by boroughs",            # title
              xaxis={"title": "Boroughs"}  # xaxis label
             )
stats.T.Robbery.iplot(kind='bar', layout=layout)



In [28]:
layout = dict(width=500, height=500,                        # plot width/height
              yaxis={"title": "Number of crimes"},                    # yaxis label
              title="Distribution of burglaries by boroughs",            # title
              xaxis={"title": "Boroughs"}  # xaxis label
             )
stats.T.Burglary.iplot(kind='bar', layout=layout)


Regarding property crimes, Manhattan is a great place for grand larcenies. Intuitively this makes sense because it is easy to steal something in a crowded place, and the potential victims in Manhattan are relatively wealthier.

That said, Brooklyn is still the place where robberies and burglaries take place most frequently.

Summarize the data part III: timing of the crimes

First, we want to compare number of crimes in each month


In [29]:
# define a function to count total number of crimes in a month
def count_crimes_month(m):
    count = 0
    for c in felony.Month:
        if c == m:
            count += 1
    return count

In [30]:
# The value of each element in the dictionary is a list composing of two elements
# The second element is going to help me get the rows in DataFrame sorted
monthly_stats = {'Jan':[count_crimes_month('Jan'),1],
                'Feb':[count_crimes_month('Feb'),2],
                'Mar':[count_crimes_month('Mar'),3],
                'Apr':[count_crimes_month('Apr'),4],
                'May':[count_crimes_month('May'),5],
                'Jun':[count_crimes_month('Jun'),6],
                'Jul':[count_crimes_month('Jul'),7],
                'Aug':[count_crimes_month('Aug'),8],
                'Sep':[count_crimes_month('Sep'),9]}
m_stats = pd.DataFrame.from_dict(monthly_stats, orient='index')
m_stats.columns = ['Count','Month']

# Check m_stats; Note that they are not sorted
m_stats


Out[30]:
Count Month
Jul 9473 7
Jun 8436 6
Apr 7891 4
May 8826 5
Sep 8264 9
Aug 9677 8
Mar 7426 3
Feb 6408 2
Jan 7697 1

In [31]:
# Sort m_stats by month
m_stats = m_stats.sort_values(by='Month', ascending=1)

# ... and then drop the month column
m_stats = m_stats.drop('Month', 1)

# Check again
m_stats


Out[31]:
Count
Jan 7697
Feb 6408
Mar 7426
Apr 7891
May 8826
Jun 8436
Jul 9473
Aug 9677
Sep 8264

In [32]:
layout = dict(width=500, height=500,                        # plot width/height
              yaxis={"title": "Number of crimes"},                    # yaxis label
              title="Frequency",            # title
              xaxis={"title": "Month"}  # xaxis label
             )
m_stats.iplot(kind='bar', layout=layout)


Number of crimes hit the peak around July and August. Least amount of crimes took place in February.

The data seems to confirm a weird correlation shown by a lot of studies: heat positively correlates with higher crime rates. See an article on New York Times: Weather and Violence

In the next section, I will explore the correlation between temperature and number of crimes.

Incorporating the weather data

The NYC weather data monitored at Central Park location was downloaded from National Centers for Environmental Information.

Cleaning


In [33]:
# Read the data
weather = pd.read_excel('Weather_NY2015.xls', sheetname='927869', usecols=[2,3,5,7,8])
weather.head()


Out[33]:
DATE PRCP SNOW TMAX TMIN
0 20150101 0.00 0.0 39 27
1 20150102 0.00 0.0 42 35
2 20150103 0.71 0.0 42 33
3 20150104 0.30 0.0 56 41
4 20150105 0.00 0.0 49 21

In [34]:
# Create a new column to store the average temperature
weather['TAVG'] = (weather['TMAX'] + weather['TMIN']) / 2
weather['Date'] = pd.to_datetime(weather['DATE'],format='%Y%m%d', errors='ignore')
weather.head()


Out[34]:
DATE PRCP SNOW TMAX TMIN TAVG Date
0 20150101 0.00 0.0 39 27 33.0 2015-01-01
1 20150102 0.00 0.0 42 35 38.5 2015-01-02
2 20150103 0.71 0.0 42 33 37.5 2015-01-03
3 20150104 0.30 0.0 56 41 48.5 2015-01-04
4 20150105 0.00 0.0 49 21 35.0 2015-01-05

Visualization: how crime rates fluctuates when temperature rises/declines?

I would like to graph both the weather and crime data.

However, they can not share the same x-axis as one of them is a monthly data point while the other is a daily data point.

That said, I still would like to see those graphs separately.


In [35]:
# Crimes trace
crimes = dict(type="bar",                                 
           name="Number of crimes",
           x=m_stats.index,
           y=m_stats['Count'],
           marker={"color": "Grey"}
          )

# Weather trace
tavg = dict(type="scatter",                                 
           name="Average temperature",
           x=weather['Date'],
           y=weather['TAVG'],
           marker={"color": "Blue"}
          )

# Plot 2 graphs seperately
layout = dict(width=600, height=500)
iplot(go.Figure(data=[tavg], layout=layout))
iplot(go.Figure(data=[crimes], layout=layout))


Next, I attempted to count daily crimes and plot weather and crime data on the same graph.


In [36]:
# Create a daily crimes stats DataFrame
daily_crimes = pd.DataFrame()
daily_crimes['Count'] = felony['Date'].value_counts()

The daily crime DataFrame is not sorted, but it should not matter when I plot it.


In [37]:
# Daily crimes trace
d_crimes = dict(type="bar",                                 
           name="Number of crimes",
           x=daily_crimes.index,
           y=daily_crimes['Count'],
           marker={"color": "Pink"},
          )

# Temperature trace (adjusted)
tavg = dict(type="scatter",                                 
           name="Average temperature (adjusted)",
           x=weather['Date'],
           y=weather['TAVG']*3,
           marker={"color": "Grey"}
          )

# Plot on the same graph
layout = dict(width=950, height=800)
iplot(go.Figure(data=[tavg, d_crimes], layout=layout))


The shapes of two graphs are quite similar: they dip in Feb&Mar and keep increasing, till they peak in July&August.

It would be more interesting if we could do regression analysis on them.

Regression analysis

To create a jointplot using seaborn, I need to put weather data and crime data into one DataFrame.


In [38]:
# I need a separate column for Date in the format of pd.datetime
# This part is tedious
daily_crimes['Date'] = daily_crimes.index.tolist()
daily_crimes = daily_crimes.drop(daily_crimes.index[-1])
daily_crimes['Date'] = pd.to_datetime(daily_crimes['Date'],format='%Y-%m-%d')
daily_crimes.head()


Out[38]:
Count Date
2015-07-31 364 2015-07-31
2015-08-14 364 2015-08-14
2015-01-01 358 2015-01-01
2015-08-01 353 2015-08-01
2015-09-19 352 2015-09-19

In [39]:
# Reset index
daily_crimes = daily_crimes.reset_index()
daily_crimes.head()


Out[39]:
index Count Date
0 2015-07-31 364 2015-07-31
1 2015-08-14 364 2015-08-14
2 2015-01-01 358 2015-01-01
3 2015-08-01 353 2015-08-01
4 2015-09-19 352 2015-09-19

In [40]:
# Create a new sub DataFrame of weather
w1 = weather['TAVG']
w2 = weather['Date']
w = pd.concat([w1,w2], axis=1)
w = w.set_index('Date')

In [41]:
# Create a new sub DataFrame of daily crimes
c1 = daily_crimes['Date']
c2 = daily_crimes['Count']
c = pd.concat([c1,c2],axis=1)
c = c.set_index('Date')

In [42]:
# Check the shape of each sub data set to make sure they can be put into one DataFrame
# If the output is True, we are good to go
w.shape[0] == c.shape[0]


Out[42]:
True

In [43]:
# Merge the two DataFrames
merged = pd.concat([w, c], axis=1)
merged.head()


Out[43]:
TAVG Count
Date
2015-01-01 33.0 358
2015-01-02 38.5 280
2015-01-03 37.5 261
2015-01-04 48.5 202
2015-01-05 35.0 266

In [44]:
import numpy as np
import seaborn as sns

In [45]:
sns.set(style="dark", color_codes=True)
g = sns.jointplot('TAVG','Count', data=merged, kind="reg", 
                  xlim=(0,100), ylim=(100,400),color="purple", size=8)


Again, the correlation is confirmed.

Conclusion

The correlation is interesting, but correlation does NOT mean causation. At least in this case I don't think high temperature would directly casue higher volumn of crimes.

High temperature might impact people's mood or other factors, which lead to violent behaviors. There are many possible explanations out there.

That said, the correlation is useful for economists as temperature can be used as an instrument variables to see how crime rates fluctuate. Here's my favorite paper on this topic, written by Brian Jacob, Lars Lefgren, and Enrico Moretti: The Dynamics of Criminal Behavior: Evidence from Weather Shocks.