Crime In Detroit (2009 to 2015)

Course: Data Bootcamp
Name: Gillian Berberich
Project Date: May 4th, 2016
New York University
Leonard N. Stern School of Business

This project examines crime data in Detroit, Michigan from January 1, 2009 to December 31, 2015 using Detroit Open Data. There are several datasets of interest contained on the Detroit Open Data website. The main crime file has over 1 million records and is updated frequently. Other datasets used contain locational data for schools, police stations, and fire stations within Detroit city limits.

Research Questions:

Has crime increased, decreased, or stayed the same over the last 6 years?
Is crime level consistent over different times of day? Is this affected by specific offense?
Is the crime pattern over the last 6 years consistent over all neighborhoods?
What is the distribution of crime location relative to institution (police stations, fire stations, schools) location?

1.0 Importing Packages

For this project, I have used several packages available in Python. It is useful to know which versions I am using as well as the date these scripts were run.



In [1]:

    
import pandas as pd             # data package
import matplotlib.pyplot as plt # graphics 
import numpy as np              # numpy package
import sys                      # system module, used to get Python version 
import geopy as geo             # geographical package
import seaborn as sns           # seaborn package
import datetime as dt           # date tools, used to note current date

%matplotlib inline 

print('\nPython version: ', sys.version) 
print('Pandas version: ', pd.__version__)
print('Numpy version: ', np.__version__)
print('Geopy version: ', geo.__version__)
print('Seaborn version: ', sns.__version__)
print("Today's date:", dt.date.today())









    



Python version:  3.5.1 |Anaconda 4.0.0 (64-bit)| (default, Feb 16 2016, 09:49:46) [MSC v.1900 64 bit (AMD64)]
Pandas version:  0.18.0
Numpy version:  1.10.4
Geopy version:  1.11.0
Seaborn version:  0.7.0
Today's date: 2016-05-06

Reading Data

The datasets can be accessed live from Detroit Open Data in the json format. Each data file contains a link which explains how to access the data (each dataset provides a link, shown as url, url1, url2, url3 below). For one of the datasets, a link was not available, though I was able to create a link by finding the resource number for the data.

Follow these instructions to access the data:

Open the web address https://data.detroitmi.gov/
Click “Public Safety”
Locate “DPD: All Crime Incidents, 2009 – Present (Provisional)” on the list.
Click “API Docs” link to the bottom right of that block.
Scroll down to “Getting Started” and you will see a link that ends in .json
Copy the link, this is the link that IPython will pull the data from.
Repeat process for “DPD Stations” and “DFD Stations”

Follow these instructions to access the Detroit School data.

Click the “Education” link on the left hand bar under “Categories”.
Locate “Detroit Schools” on the list.
Click on the dataset.
On the bar at the top right, click “Export”
Under “Download as”, right click on “JSON” and click “Copy Link Location”
Paste the link to a word document and copy the resource code portion (it is 4 characters followed by a hyphen and four more characters).
Add this resource code to the end of https://data.detroitmi.gov/resource/
Finally, add .json to the end of it. This is the link that IPython will pull data from.

We can use query information from Socrata to pull only the data we will use. This makes the requests much faster.

1.1 Reported Crimes from Detroit Open Data

Uploading the data directly from Detroit Crime Data.

Note: This code takes a very long time to run and may time out depending on your system's available memory. If you are running this code, you must remove the # marks.



In [2]:

    
# url = ('https://data.detroitmi.gov/resource/i9ph-uyrp.json?'
#       '&$select=caseid,address,hour,incidentdate,lat,lon,neighborhood,category,offensedescription'
#       '&$limit=1200000')

# crime = pd.read_json(url)                           # reads in file from url in json format
# crime = crime.rename(columns={'caseid':'Case ID',   # renaming the columns
#                              'address':'Address',
#                              'hour':'Hour',
#                              'incidentdate':'Incident Date',
#                              'lat':'Latitude',
#                              'lon':'Longitude',
#                              'neighborhood':'Neighborhood',
#                              'category':'Category',
#                              'offensedescription':'Offense Description'})

# crime.head(2)









    Out[2]:






  
    
      
      Address
      Case ID
      Category
      Hour
      Incident Date
      Latitude
      Longitude
      Neighborhood
      Offense Description
    
  
  
    
      0
      11400 WHITCOMB
      2062789
      FRAUD
      0
      2016-01-01T00:00:00.000
      42.3716
      -83.1948
      PLYMOUTH-HUBBELL
      FRAUD (OTHER)
    
    
      1
      00 E 7 MILE CARDONI
      2037998
      ROBBERY
      0
      2016-01-01T00:00:00.000
      42.4326
      -83.0916
      STATE FAIR-NOLAN
      ROBBERY - STREET - GUN

1.2 Reported Crimes from File

Instead, I will pull the file from my harddrive, which takes much less time and is more reliable.



In [3]:

    
crime = pd.read_csv('DPD__All_Crime_Incidents__2009_-_Present__Provisional.csv') # Read the file from my harddrive
print('Original Dimensions:', crime.shape)                                       # Provides dataset dimensions

crime = crime.rename(columns={'CASEID':'Case ID',                                # Renaming the columns
                              'ADDRESS':'Address',
                              'HOUR':'Hour',
                              'INCIDENTDATE':'Incident Date',
                              'LAT':'Latitude',
                              'LON':'Longitude',
                              'NEIGHBORHOOD':'Neighborhood',
                              'CATEGORY':'Category',
                              'OFFENSEDESCRIPTION':'Offense Description'})

# Uses only the columns we need to examine this dataset.
crime = crime[['Case ID','Longitude','Latitude','Address','Incident Date','Hour','Neighborhood','Category','Offense Description']].set_index('Case ID')

print('New Dimensions:', crime.shape)                                           # Dimensions for new dataset (fewer columns)
crime.head(2)









    



C:\Users\Jill\Anaconda\lib\site-packages\IPython\core\interactiveshell.py:2723: DtypeWarning: Columns (3) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)






    



Original Dimensions: (1048575, 18)
New Dimensions: (1048575, 8)






    Out[3]:






  
    
      
      Longitude
      Latitude
      Address
      Incident Date
      Hour
      Neighborhood
      Category
      Offense Description
    
    
      Case ID
      
      
      
      
      
      
      
      
    
  
  
    
      1099487
      -83.0649
      42.4261
      18000 WEXFORD
      1/1/2009
      0
      CONANT GARDENS
      MISCELLANEOUS
      MISCELLANEOUS - GENERAL NON-CRIMINAL
    
    
      1117507
      999999.0001
      999999.0000
      00 UNKNOWN
      1/1/2009
      0
      NaN
      MISCELLANEOUS
      MISCELLANEOUS - GENERAL NON-CRIMINAL

Police Station Locations

The police station file is small enough that it can be uploaded directly from the website. This file contains locational information for all police stations in Detroit.



In [4]:

    
url1 = 'https://data.detroitmi.gov/resource/3n6r-g9kp.json'
police = pd.read_json(url1)
police = police.rename(columns={'address_1':'Address',         # Rename columns
                                'zip_code':'Zip Code',
                                'id':'ID'})

police.insert(1, 'Longitude', 0.0)                # Inserts a column in the dataframe for longitude
police.insert(2, 'Latitude', 0.0)                 # Inserts a column in the dataframe for latitude

for (i, ps) in police.iterrows():                 # Iteration for one row at a time of the dataframe
    curr_dict = ps['location']                    # Pulls out dictionary from cell containing coordinates
    coord = curr_dict['coordinates']              # Pulls out coordinates from dictionary

    police.set_value(i, 'Longitude', coord[0])    # Places the first value in the dictionary in Longitude
    police.set_value(i, 'Latitude', coord[1])     # Places the second value in the dictionary in Latitude

# Sets the index to ID and uses only columns we need
police = police[['ID','Longitude','Latitude','Address','Zip Code']].set_index('ID')

police.head(2)









    Out[4]:






  
    
      
      Longitude
      Latitude
      Address
      Zip Code
    
    
      ID
      
      
      
      
    
  
  
    
      1
      -83.045241
      42.326325
      20 Atwater
      48226
    
    
      2
      -83.179933
      42.385553
      13530 Lesure
      48227

1.3 Fire Station Locations

The fire station file is also small enough that it can be uploaded directly from the website. This file contains locational information for all fire stations in Detroit.



In [5]:

    
url2 = 'https://data.detroitmi.gov/resource/hz79-58xh.json'
fire = pd.read_json(url2)
fire = fire.rename(columns={'station':'Station',               # Rename columns
                            'full_address_address':'Address',
                            'full_address_zip':'Zip Code'})

fire.insert(1, 'Longitude', 0.0)                 # Inserts a column in the dataframe for longitude
fire.insert(2, 'Latitude', 0.0)                  # Inserts a column in the dataframe for latitude

for (i, fs) in fire.iterrows():                  # Iteration for one row at a time of the dataframe
    curr_dict = fs['full_address']               # Pulls out dictionary from cell containing coordinates
    coord = curr_dict['coordinates']             # Pulls out coordinates from dictionary

    fire.set_value(i, 'Longitude', coord[0])     # Places the first value in the dictionary in Longitude
    fire.set_value(i, 'Latitude', coord[1])      # Places the second value in the dictionary in Latitude

# Sets the index to ID and uses only columns we need
fire = fire[['Station','Longitude','Latitude','Address','Zip Code']].set_index('Station')

fire.head(2)









    Out[5]:






  
    
      
      Longitude
      Latitude
      Address
      Zip Code
    
    
      Station
      
      
      
      
    
  
  
    
      E50
      -82.985474
      42.420406
      12985 Houston Whittier St
      48205
    
    
      E42
      -83.138740
      42.366575
      6324 W Chicago
      48204

1.4 School Locations

The school file is also small enough that it can be uploaded directly from the website. This file contains locational information for all schools in Detroit.



In [6]:

    
url3 = 'https://data.detroitmi.gov/resource/8xpr-6ij9.json'
school = pd.read_json(url3)
school = school.rename(columns={'entityoffi':'School',        # Rename columns
                                'the_geom':'Location',
                                'entityphys':'Address',
                                'entityph_4':'Zip Code'})

school.insert(1, 'Longitude', 0.0)               # Inserts a column in the dataframe for longitude
school.insert(2, 'Latitude', 0.0)                # Inserts a column in the dataframe for latitude

for (i, s) in school.iterrows():                 # Iteration for one row at a time of the dataframe
    curr_dict = s['Location']                    # Pulls out dictionary from cell containing coordinates
    coord = curr_dict['coordinates']             # Pulls out coordinates from dictionary

    school.set_value(i, 'Longitude', coord[0])   # Places the first value in the dictionary in Longitude
    school.set_value(i, 'Latitude', coord[1])    # Places the second value in the dictionary in Latitude

# Sets the index to ID and uses only columns we need
school = school[['School', 'Longitude', 'Latitude', 'Address', 'Zip Code']].set_index('School')

school.head(2)









    Out[6]:






  
    
      
      Longitude
      Latitude
      Address
      Zip Code
    
    
      School
      
      
      
      
    
  
  
    
      Pulaski Elementary-Middle School
      -82.999392
      42.441115
      19725 STRASBURG ST
      482051633
    
    
      Sampson Academy
      -83.118454
      42.353458
      4700 TIREMAN ST
      482044243

2.0 Overall Crime Examination

We will begin the analysis by examining the crime dataset as a whole. It is useful to examine the data by checking what the top offenses are (called 'category' in this dataset). It is also useful to first examine the trend of crime over time to see if it has increased, decreased, or stayed relatively constant.

It is not surprising that the largest category of crimes listed are unclassified. It is interesting to note that the number of reported crimes have gone down over time.



In [7]:

    
crime['Year'] = pd.DatetimeIndex(crime['Incident Date']).year     # Adding a column for the year of the crime

# Creating a plot of the top offenses
fig, ax = plt.subplots()                                          # Creating a figure
crime['Category'].value_counts().head(12).plot(ax=ax,             # Plotting the top 12 offenses by category (bar chart)
                                               kind='barh', 
                                               legend=False, 
                                               color='blue', 
                                               alpha=0.5,
                                               figsize=(12,6))

ax.set_xlabel('Number of Offenses', fontsize=14)                  # Label the x-axis
ax.set_title('Top 12 Offenses in Detroit',                        # Title the graph
             fontsize=16, fontweight='bold')
plt.show()

# Creating a plot of the crime trend throughout the years
fig, ax = plt.subplots()                                          # Creating a figure
crime['Year'].value_counts().plot(ax=ax,                          # Plotting the number of crimes per year (line graph)
                                  legend=False, 
                                  color='blue', 
                                  alpha=0.5,
                                  figsize=(12,6))
ax.set_ylabel('Number of Offenses', fontsize=14)                  # Label the y-axis
ax.set_title('Number of Crimes By Year',                          # Title the graph
             fontsize=16, 
             fontweight='bold')
ax.ticklabel_format(useOffset=False)                              # Make sure the years show on the x-axis
plt.show()

We can also examine the number of reported crimes in each of the largest categories over time. To do this, we must first get a count of the number of crimes in each year. We notice immediately that the number of cases of miscellaneous crimes skyrockets in 2012, while the number of murder/information crimes drops precipitously. As they seems to mirror each other, my guess is that in 2012 murder/information crimes were reclassified under miscellaneous.

Since these are not necessarily of interest as they are general categories, we will omit these and graph the top 10 offenses excluding those categories. Again we see that the number of crimes for the most common offenses has been going down (with the exception of fraud).



In [8]:

    
# Creating the pivot chart
crime.insert(0, 'Count', 1)                                # Adds a column containing a 1 for each row

crimepivot = pd.pivot_table(crime,                         # Creates a pivot table
                            index=['Year','Category'],     # for the year and category of crime
                            aggfunc=np.sum)                # by the sum of the count column created above

crimepivot = crimepivot.unstack('Category')                # Unstacks category and moves it to the top
crimepivot = crimepivot['Count'].fillna(value=0)           # Selects only the count columns & fills in NaN values with 0
top12 = crimepivot[['MISCELLANEOUS','ASSAULT','LARCENY',   # Creates a 'top12', a dataFrame containing the top 12 offenses
                   'BURGLARY','DAMAGE TO PROPERTY',        # This dataframe is now a yearly count of each offense
                   'STOLEN VEHICLE','MURDER/INFORMATION',
                   'AGGRAVATED ASSAULT','TRAFFIC',
                   'ROBBERY','FRAUD','DANGEROUS DRUGS']]

# Creates the Top 12 graph
fig, ax = plt.subplots()                                   # Creates a figure
top12.plot(ax=ax,figsize=(12,6))                           # Plots the dataframe 'top12'
ax.legend(bbox_to_anchor=(1.3, 1))                         # Moves the legend off the graph
ax.set_ylabel('Number of Offenses', fontsize=14)           # Label the y-axis
ax.set_title('Top 12 Offenses Over Time',                  # Title the graph
             fontsize=16, fontweight='bold')
ax.ticklabel_format(useOffset=False)                       # Make sure the years show on the x-axis
plt.show()

# Creates the new Top 10 graph
top10 = top12[['ASSAULT','LARCENY','BURGLARY',             # Creates a new 'top10' dataFrame
               'DAMAGE TO PROPERTY','STOLEN VEHICLE',
               'AGGRAVATED ASSAULT','TRAFFIC',
               'ROBBERY','FRAUD','DANGEROUS DRUGS']]

fig, ax = plt.subplots()                                   # Creates a figure
top10.plot(ax=ax,figsize=(12,6))                           # Plots the dataframe
ax.legend(bbox_to_anchor=(1.3, 1))                         # Moves the legend off the graph
ax.set_ylabel('Number of Offenses', fontsize=14)           # Label the y-axis
ax.set_title('Top 10 Offenses Over Time',                  # Title the graph 
             fontsize=16, fontweight='bold')
ax.ticklabel_format(useOffset=False)                       # Make sure the years show on the x-axis
plt.show()

3.0 Crime Examination By Time of Day

We will continue the analysis by examining the crime dataset by time of day. The obvious question with this analysis is how the number of crimes changes with the time of day.

It appears that there have been several crimes which are documented as having happened around 12pm (there is a large spike at this time). My guess is that the crime was input into the system at a standard time of 12pm (perhaps when victims were unable to pinpoint the exact time of the crime), as suggested by I Quant NY. It also appears that crimes dip to their lowest levels around 5-7am. Crimes also seems to dip a bit during dinner time (6-7pm).

It is also interesting to examine how certain offenses might be related to time of day. For this analysis, I have chosen the crime of arson. In the plot, we see that the distributions for arsons relating to private property, residences, an businesses center around early morning, while 'Arson-Public Building' centers around the hours that a public building is open. This is perhaps because the time that minor cases of arson in public buildings occur may be unknown, and perhaps might not be reported until the building opens.



In [9]:

    
# Kernel Density Plot of Crime by Time of Day
sns.set()

fig, ax = plt.subplots()                               # Create a figure
fig.set_size_inches(12, 6)                             # Set figure size
sns.kdeplot(crime['Hour'], ax=ax)                      # Create a kernel density plot
ax.hist(crime['Hour'], bins=24, alpha=0.25,            # Create a histogram by hour with 24 bins
        normed=True, label='Hour')
fig.suptitle('Kernel Density with Histogram',          # Title of graph
             fontsize=16)
ax.set_xlabel('Hour of Day', fontsize = 14)            # Label x-axis
ax.legend()                                            # Add a legend
plt.show()

arson = crime['Category'] == 'ARSON'                   # Removes the arson cases from the dataFrame

# Violin plot of arsons by hour of day
fig, ax = plt.subplots()                               # Create a figure
fig.set_size_inches(12, 6)                             # Set figure size
sns.violinplot(x='Offense Description', y='Hour',      # Creates a violin plot of arson cases by hour of day
               data=crime[arson])
plt.xticks(rotation=90)                                # Rotate the axis labels so they are legible
plt.show()

4.0 Crime Examination By Neighborhood

We will continue the analysis by examining the difference in crimes by neighborhood. Detroit is currently the 18th largest city in the United States and has a large geographical footprint of 138.8 square miles. I have chosen to compare two neighborhoods which are very different. The first is Green Acres, which borders the city of Ferndale to the north. Ferndale is considered trendy for its nightlife. The second is the Denby which is the most dangerous according to mLive for containing the intersection of Kelly Rd. and Morang Ave. I was interested to see if crime has lessened in all areas of Detroit or just those that are situated near other popular areas of southeast Michigan.

We do see that instances of the top 10 crimes have remained stagnant in the time frame for Denby (dangerous), though many of the top 10 offenses have decreased in frequency for Green Acres (safe).



In [11]:

    
# Import data as before from two specific neighborhoods
url = ('https://data.detroitmi.gov/resource/i9ph-uyrp.json?'
       '&$select=caseid,address,hour,incidentdate,lat,lon,neighborhood,category,offensedescription'
       '&neighborhood=GREEN%20ACRES'
       '&$limit=100000')

greenacres = pd.read_json(url)                           # Reads in file from url in json format
greenacres = greenacres.rename(columns={'caseid':'Case ID',   # Renames the columns
                              'address':'Address',
                              'hour':'Hour',
                              'incidentdate':'Incident Date',
                              'lat':'Latitude',
                              'lon':'Longitude',
                              'neighborhood':'Neighborhood',
                              'category':'Category',
                              'offensedescription':'Offense Description'})

url = ('https://data.detroitmi.gov/resource/i9ph-uyrp.json?'
       '&$select=caseid,address,hour,incidentdate,lat,lon,neighborhood,category,offensedescription'
       '&neighborhood=DENBY'
       '&$limit=100000')

denby = pd.read_json(url)                           # Reads in file from url in json format
denby = denby.rename(columns={'caseid':'Case ID',   # Renames the columns
                              'address':'Address',
                              'hour':'Hour',
                              'incidentdate':'Incident Date',
                              'lat':'Latitude',
                              'lon':'Longitude',
                              'neighborhood':'Neighborhood',
                              'category':'Category',
                              'offensedescription':'Offense Description'})

# Add year information
greenacres['Year'] = pd.DatetimeIndex(greenacres['Incident Date']).year     # Adding a column for the year of the crime
denby['Year'] = pd.DatetimeIndex(denby['Incident Date']).year               # Adding a column for the year of the crime
greenacres.insert(0, 'Count', 1)                                            # Adds a column containing a 1 for each row
denby.insert(0, 'Count', 1)                                                 # Adds a column containing a 1 for each row
greenacres = greenacres[greenacres.Year != 2016]                            # Excludes data from 2016 (not a full year)
denby = denby[denby.Year != 2016]                                           # Excludes data from 2016 (not a full year)

# Create pivot tables for each dataset
greenacrespivot = pd.pivot_table(greenacres,                             # Creates a pivot table
                                index=['Year','Category'],               # for the year and category of crime
                                aggfunc=np.sum)                          # by the sum of the count column created above


greenacrespivot = greenacrespivot.unstack('Category')           # Unstacks category and moves it to the top
greenacrespivot = greenacrespivot['Count'].fillna(value=0)      # Selects only the count columns & fills in NaN values with 0
greenacrestop10 = greenacrespivot[['ASSAULT','LARCENY',         # Selects the top 10 Detroit crimes
                                   'BURGLARY',
                                   'DAMAGE TO PROPERTY',
                                   'STOLEN VEHICLE',
                                   'AGGRAVATED ASSAULT',
                                   'TRAFFIC',
                                   'ROBBERY','FRAUD',
                                   'DANGEROUS DRUGS']]


denbypivot = pd.pivot_table(denby,                                      # Creates a pivot table
                            index=['Year','Category'],                  # for the year and category of crime
                            aggfunc=np.sum)                             # by the sum of the count column created above

denbypivot = denbypivot.unstack('Category')                      # Unstacks category and moves it to the top
denbypivot = denbypivot['Count'].fillna(value=0)                 # Selects only the count columns & fills in NaN values with 0
denbytop10 = denbypivot[['ASSAULT','LARCENY','BURGLARY',         # Selects the top 10 Detroit crimes
                         'DAMAGE TO PROPERTY','STOLEN VEHICLE',
                         'AGGRAVATED ASSAULT','TRAFFIC',
                         'ROBBERY','FRAUD','DANGEROUS DRUGS']]

# Creates a Top 10 graph for Pembroke
fig, ax = plt.subplots()                                   # Creates a figure
greenacrestop10.plot(ax=ax,figsize=(12,6))                   # Plots the dataframe
ax.legend(bbox_to_anchor=(1.3, 1))                         # Moves the legend off the graph
ax.set_ylabel('Number of Offenses', fontsize=14)           # Label the y-axis
ax.set_title('Top 10 Offenses Over Time (Green Acres)',       # Title the graph 
             fontsize=16, fontweight='bold')
ax.ticklabel_format(useOffset=False)                       # Make sure the years show on the x-axis
plt.show()

# Creates a Top 10 graph for Denby
fig, ax = plt.subplots()                                   # Creates a figure
denbytop10.plot(ax=ax,figsize=(12,6))                      # Plots the dataframe
ax.legend(bbox_to_anchor=(1.3, 1))                         # Moves the legend off the graph
ax.set_ylabel('Number of Offenses', fontsize=14)           # Label the y-axis
ax.set_title('Top 10 Offenses Over Time (Denby)',          # Title the graph 
             fontsize=16, fontweight='bold')
ax.ticklabel_format(useOffset=False)                       # Make sure the years show on the x-axis
plt.show()

5.0 Crime Examination By Proximity to Institutions

We will conclude the analysis by performing a locational analysis on the arson subset to examine how far away the different types of arson occur from different institutions (police stations, fire stations, schools). I have chosen arson because arsons often deal with buildings.

My hypothesis was that the distribution of crime location in relation to institutions such as police stations, fire stations, and schools would not be normal and would depend on the type of crime. We see that the distributions of proximity to institution have long right tails for police stations and fire stations, meaning the distributions are not normal. For schools, the distribution does not favor a right or left tail. One reason this may be is because there are many more schools than there are other institutions, which tightens the distance to the nearest schools. Crimes may not be affected by the closeness to a school in the same way they appear to be affected by closeness to a police station or fire station.



In [12]:

    
# Violin plots of the closeness of different types of arson offenses
# to police stations, fire stations, and schools

url = ('https://data.detroitmi.gov/resource/i9ph-uyrp.json?'
       '&$select=caseid,address,hour,incidentdate,lat,lon,neighborhood,category,offensedescription'
       '&category=ARSON'
       '&$limit=10000')

arson = pd.read_json(url)                           # Reads in file from url in json format
arson = arson.rename(columns={'caseid':'Case ID',   # Renames the columns
                              'address':'Address',
                              'hour':'Hour',
                              'incidentdate':'Incident Date',
                              'lat':'Latitude',
                              'lon':'Longitude',
                              'neighborhood':'Neighborhood',
                              'category':'Category',
                              'offensedescription':'Offense Description'})

arson.insert(0, 'ClosestPolice', 0.0)                        # Inserts a column with float values for closest police station
arson.insert(1, 'ClosestFire', 0.0)                          # Inserts a column with float values for closest fire station
arson.insert(2, 'ClosestSchool', 0.0)                        # Inserts a column with float values for closest school

from geopy.distance import vincenty                          # Imports a package for calculating distance from coordinates

# This code goes through each row and compares the location of the crime to
# each of the police stations and finds the smallest distance between the two.
for (ic, c) in arson.iterrows():                             # Iterates through the rows (crimes) of the arson dataFrame
    dist = 100000000.0                                       # Starts with a distance of a large number (float)
    coord_crime = (c['Longitude'], c['Latitude'])            # Creates a variable containing the crime coordinates
    for (ips, ps) in police.iterrows():                      # Iterates through the rows (locations) of the police dataFrame
        coord_ps = (ps['Longitude'], ps['Latitude'])         # Creates a variable containing the police station coordinates
        currdist = vincenty(coord_crime, coord_ps).miles     # Calculates the distance between a crime and a police station
        if currdist < dist:                                  # If the distance between the two is smaller than previous
            dist = currdist                                  # stations, the new distance will be used
    arson.set_value(ic, 'ClosestPolice', dist)               # Once done with the police stations, the cell is filled with
                                                             # the distance (in miles) of the closest police station

# This code goes through each row and compares the location of the crime to
# each of the fire stations and finds the smallest distance between the two.        
for (ic, c) in arson.iterrows():                             # Iterates through the rows (crimes) of the arson dataFrame
    dist = 100000000.0                                       # Starts with a distance of a large number (float)
    coord_crime = (c['Longitude'], c['Latitude'])            # Creates a variable containing the crime coordinates
    for (ifs, fs) in fire.iterrows():                        # Iterates through the rows (locations) of the fire dataFrame
        coord_fs = (fs['Longitude'], fs['Latitude'])         # Creates a variable containing the fire station coordinates
        currdist = vincenty(coord_crime, coord_fs).miles     # Calculates the distance between a crime and a fire station
        if currdist < dist:                                  # If the distance between the two is smaller than previous
            dist = currdist                                  # stations, the new distance will be used
    arson.set_value(ic, 'ClosestFire', dist)                 # Once done with the fire stations, the cell is filled with
                                                             # the distance (in miles) of the closest fire station
    
# This code goes through each row and compares the location of the crime to
# each of the schools and finds the smallest distance between the two. 
for (ic, c) in arson.iterrows():                             # Iterates through the rows (crimes) of the arson dataFrame
    dist = 100000000.0                                       # Starts with a distance of a large number (float)
    coord_crime = (c['Longitude'], c['Latitude'])            # Creates a variable containing the crime coordinates
    for (isch, sch) in school.iterrows():                    # Iterates through the rows (locations) of the school dataFrame
        coord_sch = (sch['Longitude'], sch['Latitude'])      # Creates a variable containing the school coordinates
        currdist = vincenty(coord_crime, coord_sch).miles    # Calculates the distance between a crime and a school
        if currdist < dist:                                  # If the distance between the two is smaller than previous
            dist = currdist                                  # stations, the new distance will be used
    arson.set_value(ic, 'ClosestSchool', dist)               # Once done with the schools, the cell is filled with
                                                             # the distance (in miles) of the closest school

# Filters out bad data    
arson = arson[arson.ClosestPolice < 5]                       # Filters out distances more than 5 miles (errors)
arson = arson[arson.ClosestPolice > 0]                       # Filters out distances of zero (errors)
arson = arson[arson.ClosestFire < 5]                         # Filters out distances more than 5 miles (errors)
arson = arson[arson.ClosestFire > 0]                         # Filters out distances of zero (errors)
arson = arson[arson.ClosestSchool < 5]                       # Filters out distances more than 5 miles (errors)
arson = arson[arson.ClosestSchool > 0]                       # Filters out distances of zero (errors)

# Creates a violin plot of type of arson vs. closest police station
fig, ax = plt.subplots()                                     # Creates a figure
fig.set_size_inches(12, 6)                                   # Sizes the figure
sns.violinplot(x='Offense Description', y='ClosestPolice',   # Creates a violinplot of type of offense versus distance
               data=arson)                                   # between the crime and the closest police station
plt.xticks(rotation=90)                                      # Rotates the labels so they are readable
plt.show()

# Creates a violin plot of type of arson vs. closest fire station
fig, ax = plt.subplots()                                     # Creates a figure
fig.set_size_inches(12, 6)                                   # Sizes the figure
sns.violinplot(x='Offense Description', y='ClosestFire',     # Creates a violinplot of type of offense versus distance
               data=arson)                                   # between the crime and the closest fire station
plt.xticks(rotation=90)                                      # Rotates the labels so they are readable
plt.show()

# Creates a violin plot of type of arson vs. closest school
fig, ax = plt.subplots()                                     # Creates a figure
fig.set_size_inches(12, 6)                                   # Sizes the figure
sns.violinplot(x='Offense Description', y='ClosestSchool',   # Creates a violinplot of type of offense versus distance
               data=arson)                                   # between the crime and the closest school
plt.xticks(rotation=90)                                      # Rotates the labels so they are readable
plt.show()

One other possible examination is the number of crimes versus proximity to an institution, taking into account the hour of day. This analysis does not show any interesting distributions, though the plots look nice.



In [13]:

    
# Jointplots of Hour of Day vs. Closest Institution (number of crimes is density)
sns.jointplot(x='Hour', y='ClosestPolice', data=arson, kind='kde')
sns.jointplot(x='Hour', y='ClosestFire', data=arson, kind='kde')
sns.jointplot(x='Hour', y='ClosestSchool', data=arson, kind='kde')









    Out[13]:





<seaborn.axisgrid.JointGrid at 0xdc5e839160>

6.0 Conclusion

We were able to show that the overall number of crimes in Detroit have been decreasing in recent years, though this reduction may favor the "nicer" neighborhoods over the ones which have traditionally been more dangerous. Crimes are much more likely to happen in the middle of the day and at night than they are in the early morning. For arsons specifically, the distribution of time of day depends on the specific crime. These crimes may occur further from police and fire stations, though the proximity to schools does not appear to affect the crime location.

7.0 Further Study

This code is freely available to anyone to adapt as they see fit. My recommendations for further study include an examination of pre-Great Recession Detroit crime data, which I am unable to locate, as well as similar analysis of crime data for other large cities in the United States. It might also be useful to compare crime across demographics such as median household income, as well as demographics of the alleged criminal, if this data is available.

One interesting topic of study might be the unemployment rate versus crime levels over time. The period of available data overlaps with the end of the Great Recession, which hit Detroit very hard. According to the Bureau of Labor Statistics, the unemployment rate (not seasonally adjusted) in Detroit peaked at 28.4% in June 2009. There has been significant study regarding crime reduction across the United States during the Great Recession, so it would be interesting to find out if the Detroit Crime Data reflects this. One question might examine if there is a lag of some kind between unemployment rate and crime levels, perhaps after the period when new unemployment claims expire.

8.0 Data Sources & Acknowledgement

The datasets are accessed live from Detroit Open Data in the json format.

I would like to thank Professor David Backus, and PhD Candidate Chase Coleman, both of the New York University Leonard N. Stern School of Business Department of Economics, for their assistance with this project.

	Address	Case ID	Category	Hour	Incident Date	Latitude	Longitude	Neighborhood	Offense Description
0	11400 WHITCOMB	2062789	FRAUD	0	2016-01-01T00:00:00.000	42.3716	-83.1948	PLYMOUTH-HUBBELL	FRAUD (OTHER)
1	00 E 7 MILE CARDONI	2037998	ROBBERY	0	2016-01-01T00:00:00.000	42.4326	-83.0916	STATE FAIR-NOLAN	ROBBERY - STREET - GUN

	Longitude	Latitude	Address	Incident Date	Hour	Neighborhood	Category	Offense Description
Case ID
1099487	-83.0649	42.4261	18000 WEXFORD	1/1/2009	0	CONANT GARDENS	MISCELLANEOUS	MISCELLANEOUS - GENERAL NON-CRIMINAL
1117507	999999.0001	999999.0000	00 UNKNOWN	1/1/2009	0	NaN	MISCELLANEOUS	MISCELLANEOUS - GENERAL NON-CRIMINAL

	Longitude	Latitude	Address	Zip Code
ID
1	-83.045241	42.326325	20 Atwater	48226
2	-83.179933	42.385553	13530 Lesure	48227

	Longitude	Latitude	Address	Zip Code
Station
E50	-82.985474	42.420406	12985 Houston Whittier St	48205
E42	-83.138740	42.366575	6324 W Chicago	48204

	Longitude	Latitude	Address	Zip Code
School
Pulaski Elementary-Middle School	-82.999392	42.441115	19725 STRASBURG ST	482051633
Sampson Academy	-83.118454	42.353458	4700 TIREMAN ST	482044243