The data set is 1.32 gb, so downloading directly from an ipynb takes a considerable amount of time. We used geojson data of Chicago which we had to clean to work with the folium data package. Please install this non standard package and obtain the boundary areas here (in geojson format):
CSV Data https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
Export as 'geojson' https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-current-/cauq-8yn6
In [1]:
import numpy as np
import pandas as pd
import datetime
import urllib
import matplotlib as mpl # graphics package
import matplotlib.pyplot as plt # graphics module
import folium
import json
%matplotlib inline
from bokeh.plotting import *
from bokeh.models import HoverTool
from collections import OrderedDict
In [2]:
df = pd.read_csv('Crimes_-_2001_to_present.csv', sep=',')
In [3]:
df = df[['Year','Primary Type','Location Description','Community Area','Latitude','Longitude','Arrest','Description']]
In [4]:
df.head()
Out[4]:
In [5]:
len(df.columns)
Out[5]:
In [6]:
df.shape
Out[6]:
In [7]:
len(df['Primary Type'].unique())
Out[7]:
In [8]:
len(df)
Out[8]:
In [9]:
dfrobbery = df[['Primary Type','Location Description']] ##create file with just 2 columns
dfrobbery = dfrobbery[dfrobbery['Primary Type'].str.contains('ROBBERY')] ## extract only rows that were ROBBERY
dfrobbery = dfrobbery.groupby(['Location Description']) ##group it by the location where the robbery took place
dfrobbery = dfrobbery.count().sort_values(by='Primary Type', ascending=[0]) ## count the number of robberies in each location
dfrobbery = dfrobbery.head(10) ## extract the top 10 categories
dfrobbery.plot(kind='bar')
plt.title('Top 10 Robbery Locations', fontsize=14, loc='left') ## Add title
plt.legend('') ### Get rid of legend
plt.ylabel('Number of Incidents') ###Add y label
Out[9]:
In [10]:
dfburglary = df[['Primary Type','Location Description']] ##create file with just 2 columns
dfburglary = dfburglary[dfburglary['Primary Type'].str.contains('BURGLARY')] ## extract only rows that were BURGLARY
dfburglary = dfburglary.groupby(['Location Description']) ##group it by the location where the burglary took place
dfburglary = dfburglary.count().sort_values(by='Primary Type', ascending=[0]) ## count the number of burglaries in each location
dfburglary = dfburglary.head(10) ## extract the top 10 categories
dfburglary.plot(kind='bar')
plt.title('Top 10 Burglary Locations', fontsize=14, loc='left') ## Add title
plt.legend('') ### Get rid of legend
plt.ylabel('Number of Incidents') ###Add y label
Out[10]:
In [11]:
dftheft = df[['Primary Type','Location Description']] ##create file with just 2 columns
dftheft = dftheft[dftheft['Primary Type'].str.contains('THEFT')] ## extract only rows that were THEFT
dftheft = dftheft.groupby(['Location Description']) ##group it by the location where the theft took place
dftheft = dftheft.count().sort_values(by='Primary Type', ascending=[0]) ## count the number of thefts in each location
dftheft = dftheft.head(10) ## extract the top 10 categories
dftheft.plot(kind='bar')
plt.title('Top 10 Theft Locations', fontsize=14, loc='left') ## Add title
plt.legend('') ### Get rid of legend
plt.ylabel('Number of Incidents') ###Add y label
Out[11]:
As a way to start the analysis of the crime data we wanted to observe where the most "stealing crimes" took place. It is useful to note the technical differences between several types of similar crimes, which many people consider to be synonymous, but which law enforcement catagorizes differently: robbery, burglary, and theft. Robbery is the act of forcibly stealing from a person. For example a mugging is a robbery as is someone snatching your cell phone out of your hands while you are texting. Burglary is the act of breaking and entering then stealing something or committing another crime. Theft is simply stealing. This could be taking a watch out of a gym locker room or stealing jewelry from a store.
It is not suprising to see that street and sidewalk were the most common areas for robberies, and that residences were the largest victims of burglaries. However, the fact that the street was the most common area for thefts was surprising to us. This left us wondering about the exact nature of these crimes. What types of things could be stolen on the street that don't involve mugging or some kind of forcible theft from a person (which would classify the crime as a robbery)?
Our best guess is that many of these street "thefts" were actually more like robberies when they were committed, but were reclassified. Generally, theft is a misdemeanor offense (less than $500) and robbery is always a felony. The reason behind this reclassification is unclear, but it could have something to do with additional paperwork for police, mitigating circumstances, plea deals, or if we are being cynical, an attempt to make crime trends seem less serious than they actually are in the city of Chicago.
In [12]:
dfweapon = df[['Primary Type','Location Description']] ##create file with just 2 columns
dfweapon = dfweapon[dfweapon['Primary Type'].str.contains('WEAPONS VIOLATION')] ## extract only rows that were VIOLATION
dfweapon = dfweapon.groupby(['Location Description']) ##group it by the location where the robbery took place
dfweapon = dfweapon.count().sort_values(by='Primary Type', ascending=[0]) ## count the number of robberies in each location
dfweapon = dfweapon.head(10) ## extract the top 10 categories
dfweapon.plot(kind='bar')
plt.title('Top 10 Weapons Violation Locations', fontsize=14, loc='left') ## Add title
plt.legend('') ### Get rid of legend
plt.ylabel('Number of Incidents') ###Add y label
Out[12]:
We also wanted to investigate where the most weapons violations took place. We performed a similar code base as the robbery data set to extract the top 10 locations where weapons violations are most common. We created a sorted bar graph to display the data.
After a review of the chart it is not surprising that street and sidewalk were the most common places for weapon violations. The police would arrest individuals on the street and sidewalk and frisk for guns. The interesting factor was that school/ public buidling was number four on the list. That is large number of weapons violations for public buildings. Therefore, we wanted to dig into that data set a bit more to see what was causing that large number.
In [13]:
dfschool = df[['Primary Type','Location Description','Description']]
##create file with just 3 columns
dfschool = dfschool[dfschool['Primary Type'].str.contains('WEAPONS VIOLATION')]
## extract only rows that were Weapons VIOLATIONs
dfschool = dfschool.loc[dfschool['Location Description'] == 'SCHOOL, PUBLIC, BUILDING']
##extract rows where the violations took place in a school for public space
dfschool = dfschool.drop(['Primary Type'],1)
## drop the primary type because it is redundant
dfschool = dfschool.groupby(['Description'])
## group by the type of violation
dfschool = dfschool.count().sort_values(by='Location Description', ascending=[0])
## Count the number of violations and sort by top 10
dfschool = dfschool.head(5)
## extract top 5 violations
dfschool.plot(kind='bar') ### create bar chart
plt.title('Top 5 Violations at Schools/Public Buildings', fontsize=14, loc='left') ## Add title
plt.legend('') ### Get rid of legend
plt.ylabel('Number of Incidents') ###Add y label
Out[13]:
In order to dive slightly deeper into the violations that occurred in public areas and schools we extracted all the violations that occurred at Schools, Public, Buidlings and sorted by the description. This gave us the types of violations that occurred at these locations. The largest of these violations was the use of a dangerous weapon, while the sales of firearms at school was low at only 111 instances.
Chicago has some of the strictest gun control laws in the nation that provide misdemeanor and felony charges for most people possessing firearms. One question that remains puzzling is what constitutes the unlawful use of "other" dangerous weapons. According to the Illinois Criminal Code, this has a very broad definition, covering anything from brass knuckles to tasers to pistols. How does this differ from unlawful possession of a handgun, which intuitively would seem to be a much more common offense than the above graph indicates? There are several reasons. First, unlawful use of other dangerous weapon is a Class 1 misdemeanor whereas unlawful possession of a handgun is a felony. Also, unlawful possession of a handgun is limited to people who are underage, have mental disabilities, or have a criminal record. Unlawful use of other dangerous weapons is broader and does not require any special circumstances in order to take effect.
In [14]:
### This code is used to extract the percentage of arrest made for different types of crimes
dfH = df[['Primary Type','Arrest']] ## extract 2 columns from larger data set
dfH = dfH[dfH['Primary Type'].str.contains('HOMICIDE')] ## use only crimes that are 'Homicide'
del dfH['Primary Type'] ## del primary type because this column is not longer needed and inteferes with the calc
PerH = dfH[dfH['Arrest']==True].sum() / dfH['Arrest'].count() # compare the number of arrests to the total number of homicides
PerH = float(PerH)*100 ## create a new variable call 'PerH' that is the percentage of sovled homicides
###recreate the same procedures as completed under 'Homicide' for Assault
dfA = df[['Primary Type','Arrest']]
dfA = dfA[dfA['Primary Type'].str.contains('ASSAULT')]
del dfA['Primary Type']
PerA = dfA[dfA['Arrest']==True].sum() / dfA['Arrest'].count()
PerA = float(PerA)*100
###recreate the same procedures as completed under 'Homicide' for Sex Offense
dfSO = df[['Primary Type','Arrest']]
dfSO = dfSO[dfSO['Primary Type'].str.contains('SEX OFFENSE')]
del dfSO['Primary Type']
PerSO = dfSO[dfSO['Arrest']==True].sum() / dfSO['Arrest'].count()
PerSO = float(PerSO)*100
###recreate the same procedures as completed under 'Homicide' for Criminal Sexual Assault
dfCSA = df[['Primary Type','Arrest']]
dfCSA = dfCSA[dfCSA['Primary Type'].str.contains('CRIM SEXUAL ASSAULT')]
del dfCSA['Primary Type']
PerCSA = dfCSA[dfCSA['Arrest']==True].sum() / dfCSA['Arrest'].count()
PerCSA = float(PerCSA)*100
###recreate the same procedures as completed under 'Homicide' for Robbery
dfRob = df[['Primary Type','Arrest']]
dfRob = dfRob[dfRob['Primary Type'].str.contains('ROBBERY')]
del dfRob['Primary Type']
PerRob = dfRob[dfRob['Arrest']==True].sum() / dfRob['Arrest'].count()
PerRob = float(PerRob)*100
###create a new data frame with the proper names and the variables that were created for each group
dfnew = pd.DataFrame({'Type': ['Homicide','Assault','Sex Offense','Sexual Assault','Robbery'],
'Percentage': [PerH,PerA,PerSO,PerCSA,PerRob]})
## Create new data set with the variables we just created
dfnew = dfnew.sort_values(['Percentage'])
## sort by the value of the percentage
dfnew = dfnew.set_index(['Type'])
## set the index to type for chart purposes
fig, ax = plt.subplots()
dfnew.plot(ax=ax, kind='barh') ## create a horizontal bar chart
ax.set_title('Chance of Arrest', fontsize=14, loc='Center') ## set title to middle and change name
ax.set_xlabel('Percentage') ## add x label
ax.legend_.remove() ## get rid of the legend
ax.set_ylabel([]) ## Remove y label
Out[14]:
We chose 5 types of crimes that represent the FBI violent crimes to analyze for arrest purposes. There are numerous other types of crimes that exist in the database but these would be the highest profile and have the highest chance of investigation.
In order to extract the data we pulled the type of crime and if an arrest had happened related to that crime into the dataset. We then created a variable for each crime that measured the percentage of arrests that were made for each type of crime.
After creating this chart it was interesting to us that the chance of arrest for each type of crime was much lower than we anticipated. If you are murdered in Chicago there is only a 50% chance that an arrest will be made related to that murder. Furthermore, it is interesting that only about 10% of robberies have an arrest. So if you are robbed in Chicago there is a pretty low chance you will ever find out who did it.
We were interested to know how crime trends may have shifted in the Chicago area over time. We felt that heat maps would be a useful way to conduct this analysis and decided to compare 2002 data to 2105 data. We found that the overall number of crimes decreased, but that the same neightborhoods that suffered high crime rates in 2002 also suffered high crime rates in 2015. In fact, the geographic trends are almost identical.
In [15]:
Chicago_Coords = [41.88, -87.62]
In [16]:
with open('Boundaries - Community Areas (current).geojson') as f:
comm_area = json.load(f)
In [17]:
geometry, geo_id = [], []
for areas in comm_area['features']:
geometry.append(areas['geometry']['coordinates'][0])
geo_id.append(areas['properties']['area_num_1'])
CA = pd.DataFrame(np.c_[geo_id, geometry], columns=['Comm_area', 'geometry'])
In [18]:
def df_to_geojson(df):
geojson = {'type':'FeatureCollection', 'features':[]}
for row in df.iterrows():
feature = {'type':'Feature',
'Comm_area_id':row[1]['Comm_area'],
'geometry':{'type':'Polygon',
'coordinates':row[1]['geometry']}}
geojson['features'].append(feature)
return geojson
In [19]:
new_geojson = df_to_geojson(CA)
with open('NewChicagoboundaries.txt', 'w') as outfile:
json.dump(new_geojson, outfile)
In [20]:
def trim_fraction(text):
if '.0' in text:
return text[:text.rfind('.0')]
return text
In [21]:
crime02 = df[df.Year ==2002]
crime02 = crime02.groupby(['Community Area'], as_index=False)['Year'].count()
crime02['Community Area'] = crime02['Community Area'].astype(str)
crime02['Community Area'] = crime02['Community Area'].apply(trim_fraction)
crime02 = crime02[crime02['Community Area']!='0']
In [22]:
crime15 = df[df.Year ==2015]
crime15 = crime15.groupby(['Community Area'], as_index=False)['Year'].count()
crime15['Community Area'] = crime15['Community Area'].astype(str)
crime15['Community Area'] = crime15['Community Area'].apply(trim_fraction)
crime15 = crime15[crime15['Community Area']!='0']
In [23]:
districtmap02 = folium.Map(location=Chicago_Coords, zoom_start=10)
districtmap02.choropleth(geo_path = 'NewChicagoboundaries.txt',data = crime02,
columns = ['Community Area', 'Year'],
key_on = 'feature.Comm_area_id',fill_color='YlGn',
fill_opacity=0.7, line_opacity=0.2,
legend_name='Crime Rate')
districtmap02
Out[23]:
In [24]:
districtmap15 = folium.Map(location=Chicago_Coords, zoom_start=10)
districtmap15.choropleth(geo_path = 'NewChicagoboundaries.txt',data = crime15,
columns = ['Community Area', 'Year'],
key_on = 'feature.Comm_area_id',fill_color='YlGn',
fill_opacity=0.7, line_opacity=0.2,
legend_name='Crime Rate')
districtmap15
Out[24]:
In [25]:
df_count = df[['Year','Community Area']]
df_count_pivot = df_count.pivot_table(index='Community Area',columns='Year', aggfunc=len)
In [26]:
toptenoverall = df_count_pivot.iloc[:,len(df_count_pivot.columns) -1].sort_values(ascending=False).head(10).index.values
highest = df_count_pivot[df_count_pivot.index.isin(toptenoverall)]
highest
Out[26]:
In [27]:
lowtenoverall = df_count_pivot.iloc[:,len(df_count_pivot.columns) -1].sort_values(ascending=False).tail(10).index.values
lowest = df_count_pivot[df_count_pivot.index.isin(lowtenoverall)]
lowest
Out[27]: