I recently discovered that Baltimore publishes its data on all sorts of aspects of the city, from finances to safety to transportation. The most interesting starting point (to me) is public safety data, which BPD shares plenty of. There are datasets for reported crimes, arrests, officer involved injuries, and other things. This notebook is a dig through the data set on violent crimes in Baltimore.
Let's start with the typical imports. The data set can be accessed via a JSON API, but the API won't return everything together so I can't use pandas.read_json()
, hence the use of requests
.
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests as req
from IPython.display import display
plt.style.use('fivethirtyeight')
%matplotlib inline
# this warning got annoying. yes, i know it is a copy. no, i dont care
pd.options.mode.chained_assignment = None
Grabbing all the data is a little more involved than if I just download everything, but the dataset is regularly updated online. Loading the data in this way allows this notebook to stay current with the latest published data.
In [2]:
url = 'https://data.baltimorecity.gov/resource/wsfq-mvij.json'
size = 50000 # grab a lot at once because i'm impatient
limit = '$limit={}'.format(size)
offset = '$offset={}'
idx = 0
df = pd.DataFrame()
while True:
off = offset.format(idx)
u = '{}?{}&{}'.format(url, limit, off)
r = req.get(u)
if not len(r.content) > 3:
break
df1 = pd.DataFrame(r.json())
df = df.append(df1, ignore_index=True)
idx += size
In [3]:
df['crimedate'] = pd.to_datetime(df['crimedate'])
df.head()
Out[3]:
They provide a decent amount of data regarding each incident. Unfortunately the case number is not included so we can't easily cross-reference these incidents against the reported arrests data set.
Let's look at Medfield. Medfield is fairly small and mostly families so it will be interesting to see how much crime takes place and what the most common crime is in the neighborhood. Based on chatter in the neighborhood, my guess is that theft of property from cars is most common.
In [4]:
medfield = df[df['neighborhood'] == 'Medfield']
medfield.index = medfield['crimedate']
medfield.groupby(medfield.index.year)['crimecode'].count().plot(kind='bar', rot=45, title='Medfield crime totals by year')
Out[4]:
2013 was oddly high whereas the rest of the years stayed between 65 and 75 total crimes. However at time of writing the current data set goes through August 14th, 2015, and 2015 is creeping up on 2014's numbers.
Now what about the types of crimes in Medfield? Note that the data set is only crimes that affect the person (violent crimes), so things like littering are not included.
In [5]:
display(medfield['description'].value_counts().sum())
medfield['description'].value_counts().plot(kind='barh', title='Medfield crimes by type')
Out[5]:
Yikes! I was not expecting a homicide or rape in Medfield. Larceny from auto was guess for most common crime, so it looks like the neighborhood chatter is accurate.
This is nearly six years worth of crimes, so a total of 437 crimes is not too shabby. I'm curious about the rape and homicide incidents, so let's investigate those before continuing.
In [6]:
display(medfield[medfield['description'] == 'HOMICIDE'])
display(medfield[medfield['description'] == 'RAPE'])
Homicides with a knife usually occur when the victim and killer know each other, so a random act of murder can be ruled out. As for the rapes, five years between incidents is pretty good (but it would be better if there weren't any incidents at all!).
It would be interesting to know if there are certain times of the year with an increase in crime. Let's investigate by plotting crimes by month in each year.
In [7]:
years = np.unique(medfield.index.year)
months = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
df1 = medfield['description']
for year in years:
d = df1[df1.index.year == year]
title = 'Medfield crimes by month for {}'.format(year)
cnt = d.groupby(d.index.month).count()
if len(cnt.index) < len(months):
cnt.index = months[:len(cnt.index)] # special case for current year
else:
cnt.index = months
cnt.plot(legend=None, yticks=range(0,25), title=title, grid=True, figsize=(8,6), label='Month')
plt.show()
There really does not seem to be a correlation between time of year and number of crimes. There usually are dips in June and November, but that does not hold true for every year. February is the only month that consistently is lower than the prior and latter months. February was pretty cold this year so that makes sense for most of the crimes. It seems that these incidents follow no particular trend except that larceny from auto is consistently the most common.
Baltimore is a big city, and focusing just on quiet, little Medfield does not accurately represent the state of things in the city. Let's break down the neighborhoods by number of crimes. The data set contains a significant number of neighborhoods, so representing it all on one chart would be difficult to parse through. Instead, let's look in chunks of thirty neighborhoods so as to keep the readability manageable while also minimizing the total number of charts needed to represent the data.
In [8]:
by_neighborhood = df['neighborhood'].value_counts()
for i in range(0, len(by_neighborhood), 30):
title = 'Neighborhoods crime numbers, {}-{}'.format(years.min(), years.max())
by_neighborhood[i:i+30].plot(kind='barh', title=title, figsize=(6,6))
plt.show()
Moral of the story: do not live in the Frankford/Belair-Edison area and do your best to get a place in Blythewood. If you are from Baltimore, or know someone from Baltimore, you can try and parse through all of that and find the neighborhoods you are curious about. Or, we could make everything a little more accessible with a dictionary.
In [9]:
# group neighborhoods into count of crime descriptions. allows for easy querying of which crimes are most prevelant in a neighborhood
type_nums = {}
hoods = df['neighborhood'].unique()
for hood in hoods:
hood_df = df[df['neighborhood'] == hood]
type_nums[hood] = hood_df['description'].value_counts()
Hampden and Mondawmin are places I regularly go to or pass through. I wonder what they are like? Let's use the handy dictionary we just created to check.
In [10]:
display(type_nums['Hampden'].sum())
type_nums['Hampden'].plot(kind='barh')
Out[10]:
In [11]:
display(type_nums['Mondawmin'].sum())
type_nums['Mondawmin'].plot(kind='barh')
Out[11]:
Hampden looks about the same as Medfield with regard to types of crimes that are most common, but the total number of crimes is significantly higher (but Hampden is much larger so that is to be expected). Mondawmin is where the riots started back in April so the high number of larceny incidents is not surprising. However, the rest of the numbers for the area are much lower than I would have expected. I guess it isn't so bad there after all.
If I do not want to die or get raped in Baltimore, where should I avoid?
In [12]:
# most homicides?
most = 0
most_hood = ''
for hood in type_nums:
if 'HOMICIDE' in type_nums[hood] and type_nums[hood]['HOMICIDE'] > most:
most = type_nums[hood]['HOMICIDE']
most_hood = hood
print(most)
print(most_hood)
In [13]:
# most rapes?
most = 0
most_hood = ''
for hood in type_nums:
if 'RAPE' in type_nums[hood] and type_nums[hood]['RAPE'] > most:
most = type_nums[hood]['RAPE']
most_hood = hood
print(most)
print(most_hood)
I guess I won't go golfing at the Clifton Park golf course then.
I wonder if there is a correlation between district and number of crimes? A pie chart should handle that for us nicely.
In [14]:
df['district'].value_counts().plot(kind='pie', autopct='%.2f', title='Crimes by district', figsize=(8,8))
Out[14]:
It seems like most of the districts are fairly similar to each other. If you really want to be in the clear then live somewhere in the Eastern or Western districts.
Finally, what weapons are used most commonly in crimes and how many homicides have there been each year? With how hard the city is on talking about gun crimes, I will venture a guess that guns will be used most often. Also, remember that the weapons count is for all crimes, not just homicides.
In [15]:
df['weapon'].dropna().value_counts().plot(kind='barh')
Out[15]:
In [16]:
df.index = df['crimedate']
for year in years:
d = df[df.index.year == year]
d = d['description']
d = d.value_counts()
if 'HOMICIDE' in d.index:
display('Year {} - Homicide count is {}'.format(year, d['HOMICIDE']))
else:
display('Year {} - Homicide count is {}'.format(year, 0))
For some reason, I doubt that there were no homicides in all of 2010, so I will assume that the data set is not complete for 2010. I am fairly certain the data is updated every month, so 206 for 2015 is actually slightly behind the current number (as of August 25th) of 215. It is only the end of August and we have already passed last years numbers! To put this in perspective, despite having fewer residents, by end of July this year Baltimore boasted 26 more homicides than Detroit(see this for more). Charm City certainly has its problems as evidenced by the data, but it sure is a great place to live.
We could dig through more by neighborhood or plot the incidents on a map (maps are a pain for me at the moment, otherwise I would include one), but I have gotten a picture of the areas I care about. Since this loads data from the internet, you are more than welcome to play with the notebook and investigate areas you care about. You can access the notebook here. If you have feedback or suggestions please let me know as I am new to this data science space.
For convenience, below I have added a function to generate a report for a specified neighborhood. As an example, I used it on Hampden.
In [17]:
def report_for_neighborhood(df, hood):
# assumes good input
df = df[df['neighborhood'] == hood]
df.index = range(len(df))
df['description'] = df['description'].astype('category')
df.groupby(df['crimedate'].dt.year)['crimecode'].count().plot(kind='bar', rot=45, title='{} crime totals by year'.format(hood))
plt.show()
df['description'].value_counts().plot(kind='barh', title='{} crimes by type (all years combined)'.format(hood))
plt.show()
s = df.groupby(df['crimedate'].dt.year)['description'].value_counts()
s.unstack().plot(kind='bar', figsize=(15,15), title='{} crimes by type and year'.format(hood))
In [18]:
report_for_neighborhood(df, 'Hampden')
In [19]:
riots = df[(df['crimedate'].dt.year == 2015) & (df['crimedate'].dt.month == 4) & (df['crimedate'].dt.day > 22) & (df['crimedate'].dt.day < 28)]
display('Total crimes during riots: {}'.format(len(riots)))
display(riots['description'].value_counts())
Compare that to the prior year.
In [20]:
rnd = df[(df['crimedate'].dt.year == 2014) & (df['crimedate'].dt.month == 4) & (df['crimedate'].dt.day > 22) & (df['crimedate'].dt.day < 28)]
display('Total crimes during April 23-27, 2014: {}'.format(len(rnd)))
display(rnd['description'].value_counts())
129 more crimes during that time period in 2015 than 2014 for the whole city. Most of the problems during the riots were theft and property destruction. You can see the larceny and burglary amounts are much higher during the riots than during the prior year which holds true to what actually happened.