Crime in Baltimore

I recently discovered that Baltimore publishes its data on all sorts of aspects of the city, from finances to safety to transportation. The most interesting starting point (to me) is public safety data, which BPD shares plenty of. There are datasets for reported crimes, arrests, officer involved injuries, and other things. This notebook is a dig through the data set on violent crimes in Baltimore.

Let's start with the typical imports. The data set can be accessed via a JSON API, but the API won't return everything together so I can't use pandas.read_json(), hence the use of requests.



In [1]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests as req
from IPython.display import display
plt.style.use('fivethirtyeight')
%matplotlib inline
# this warning got annoying. yes, i know it is a copy. no, i dont care
pd.options.mode.chained_assignment = None

Grabbing all the data is a little more involved than if I just download everything, but the dataset is regularly updated online. Loading the data in this way allows this notebook to stay current with the latest published data.



In [2]:

    
url = 'https://data.baltimorecity.gov/resource/wsfq-mvij.json'
size = 50000 # grab a lot at once because i'm impatient
limit = '$limit={}'.format(size)
offset = '$offset={}'
idx = 0
df = pd.DataFrame()
while True:
    off = offset.format(idx)
    u = '{}?{}&{}'.format(url, limit, off)
    r = req.get(u)
    if not len(r.content) > 3:
        break
    df1 = pd.DataFrame(r.json())
    df = df.append(df1, ignore_index=True)
    idx += size



In [3]:

    
df['crimedate'] = pd.to_datetime(df['crimedate'])
df.head()









    Out[3]:






  
    
      
      crimecode
      crimedate
      crimetime
      description
      district
      location
      location_1
      neighborhood
      post
      total_incidents
      weapon
    
  
  
    
      0
      5A
      2015-08-15
      00:00:00
      BURGLARY
      SOUTHEASTERN
      0 S CURLEY ST
      {'longitude': '-76.57541', 'needs_recoding': F...
      Patterson Park Neighborhood
      224
      1
      NaN
    
    
      1
      6D
      2015-08-15
      00:00:00
      LARCENY FROM AUTO
      SOUTHEASTERN
      2500 E FAYETTE ST
      {'longitude': '-76.58127', 'needs_recoding': F...
      McElderry Park
      221
      1
      NaN
    
    
      2
      5D
      2015-08-15
      00:00:00
      BURGLARY
      NORTHERN
      5300 TILBURY WAY
      {'longitude': '-76.61499', 'needs_recoding': F...
      Homeland
      523
      1
      NaN
    
    
      3
      4E
      2015-08-15
      00:02:00
      COMMON ASSAULT
      CENTRAL
      300 GUILFORD AVE
      {'longitude': '-76.6113', 'needs_recoding': Fa...
      Downtown
      114
      1
      HANDS
    
    
      4
      6B
      2015-08-15
      00:10:00
      LARCENY
      CENTRAL
      400 TUBMAN CT
      {'longitude': '-76.62451', 'needs_recoding': F...
      Upton
      123
      1
      NaN

They provide a decent amount of data regarding each incident. Unfortunately the case number is not included so we can't easily cross-reference these incidents against the reported arrests data set.

Let's look at Medfield. Medfield is fairly small and mostly families so it will be interesting to see how much crime takes place and what the most common crime is in the neighborhood. Based on chatter in the neighborhood, my guess is that theft of property from cars is most common.



In [4]:

    
medfield = df[df['neighborhood'] == 'Medfield']
medfield.index = medfield['crimedate']
medfield.groupby(medfield.index.year)['crimecode'].count().plot(kind='bar', rot=45, title='Medfield crime totals by year')









    Out[4]:





<matplotlib.axes._subplots.AxesSubplot at 0xb4bb568c>

2013 was oddly high whereas the rest of the years stayed between 65 and 75 total crimes. However at time of writing the current data set goes through August 14th, 2015, and 2015 is creeping up on 2014's numbers.

Now what about the types of crimes in Medfield? Note that the data set is only crimes that affect the person (violent crimes), so things like littering are not included.



In [5]:

    
display(medfield['description'].value_counts().sum())
medfield['description'].value_counts().plot(kind='barh', title='Medfield crimes by type')









    





437






    Out[5]:





<matplotlib.axes._subplots.AxesSubplot at 0xaca64b6c>

Yikes! I was not expecting a homicide or rape in Medfield. Larceny from auto was guess for most common crime, so it looks like the neighborhood chatter is accurate.

This is nearly six years worth of crimes, so a total of 437 crimes is not too shabby. I'm curious about the rape and homicide incidents, so let's investigate those before continuing.



In [6]:

    
display(medfield[medfield['description'] == 'HOMICIDE'])
display(medfield[medfield['description'] == 'RAPE'])









    






  
    
      
      crimecode
      crimedate
      crimetime
      description
      district
      location
      location_1
      neighborhood
      post
      total_incidents
      weapon
    
    
      crimedate
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2015-03-10
      1K
      2015-03-10
      2254
      HOMICIDE
      NORTHERN
      4400 LAPLATA AV
      {'longitude': '-76.64514', 'needs_recoding': F...
      Medfield
      536
      1
      KNIFE
    
  








    






  
    
      
      crimecode
      crimedate
      crimetime
      description
      district
      location
      location_1
      neighborhood
      post
      total_incidents
      weapon
    
    
      crimedate
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2015-01-29
      2B
      2015-01-29
      21:00:34
      RAPE
      NORTHERN
      1400 N WELDON PL
      {'longitude': '-76.6419', 'needs_recoding': Fa...
      Medfield
      536
      1
      NaN
    
    
      2010-02-15
      2B
      2010-02-15
      21:11:58
      RAPE
      NORTHERN
      4400 FALLS BRIDGE DR
      {'longitude': '-76.64564', 'needs_recoding': F...
      Medfield
      536
      1
      NaN

Homicides with a knife usually occur when the victim and killer know each other, so a random act of murder can be ruled out. As for the rapes, five years between incidents is pretty good (but it would be better if there weren't any incidents at all!).

It would be interesting to know if there are certain times of the year with an increase in crime. Let's investigate by plotting crimes by month in each year.



In [7]:

    
years = np.unique(medfield.index.year)
months = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
df1 = medfield['description']
for year in years:
    d = df1[df1.index.year == year]
    title = 'Medfield crimes by month for {}'.format(year)
    cnt = d.groupby(d.index.month).count()
    if len(cnt.index) < len(months):
        cnt.index = months[:len(cnt.index)] # special case for current year
    else:
        cnt.index = months
    cnt.plot(legend=None, yticks=range(0,25), title=title, grid=True, figsize=(8,6), label='Month')
    plt.show()

There really does not seem to be a correlation between time of year and number of crimes. There usually are dips in June and November, but that does not hold true for every year. February is the only month that consistently is lower than the prior and latter months. February was pretty cold this year so that makes sense for most of the crimes. It seems that these incidents follow no particular trend except that larceny from auto is consistently the most common.

Baltimore is a big city, and focusing just on quiet, little Medfield does not accurately represent the state of things in the city. Let's break down the neighborhoods by number of crimes. The data set contains a significant number of neighborhoods, so representing it all on one chart would be difficult to parse through. Instead, let's look in chunks of thirty neighborhoods so as to keep the readability manageable while also minimizing the total number of charts needed to represent the data.



In [8]:

    
by_neighborhood = df['neighborhood'].value_counts()
for i in range(0, len(by_neighborhood), 30):
    title = 'Neighborhoods crime numbers, {}-{}'.format(years.min(), years.max())
    by_neighborhood[i:i+30].plot(kind='barh', title=title, figsize=(6,6))
    plt.show()

Moral of the story: do not live in the Frankford/Belair-Edison area and do your best to get a place in Blythewood. If you are from Baltimore, or know someone from Baltimore, you can try and parse through all of that and find the neighborhoods you are curious about. Or, we could make everything a little more accessible with a dictionary.



In [9]:

    
# group neighborhoods into count of crime descriptions. allows for easy querying of which crimes are most prevelant in a neighborhood
type_nums = {}
hoods = df['neighborhood'].unique()
for hood in hoods:
    hood_df = df[df['neighborhood'] == hood]
    type_nums[hood] = hood_df['description'].value_counts()

Hampden and Mondawmin are places I regularly go to or pass through. I wonder what they are like? Let's use the handy dictionary we just created to check.



In [10]:

    
display(type_nums['Hampden'].sum())
type_nums['Hampden'].plot(kind='barh')









    





2517






    Out[10]:





<matplotlib.axes._subplots.AxesSubplot at 0xac44e82c>



In [11]:

    
display(type_nums['Mondawmin'].sum())
type_nums['Mondawmin'].plot(kind='barh')









    





3585






    Out[11]:





<matplotlib.axes._subplots.AxesSubplot at 0xac299d0c>

Hampden looks about the same as Medfield with regard to types of crimes that are most common, but the total number of crimes is significantly higher (but Hampden is much larger so that is to be expected). Mondawmin is where the riots started back in April so the high number of larceny incidents is not surprising. However, the rest of the numbers for the area are much lower than I would have expected. I guess it isn't so bad there after all.

If I do not want to die or get raped in Baltimore, where should I avoid?



In [12]:

    
# most homicides?
most = 0
most_hood = ''
for hood in type_nums:
    if 'HOMICIDE' in type_nums[hood] and type_nums[hood]['HOMICIDE'] > most:
        most = type_nums[hood]['HOMICIDE']
        most_hood = hood
print(most)
print(most_hood)









    



34
Coldstream Homestead Montebello



In [13]:

    
# most rapes?
most = 0
most_hood = ''
for hood in type_nums:
    if 'RAPE' in type_nums[hood] and type_nums[hood]['RAPE'] > most:
        most = type_nums[hood]['RAPE']
        most_hood = hood
print(most)
print(most_hood)









    



82
Downtown

I guess I won't go golfing at the Clifton Park golf course then.

I wonder if there is a correlation between district and number of crimes? A pie chart should handle that for us nicely.



In [14]:

    
df['district'].value_counts().plot(kind='pie', autopct='%.2f', title='Crimes by district', figsize=(8,8))









    Out[14]:





<matplotlib.axes._subplots.AxesSubplot at 0xac80918c>

It seems like most of the districts are fairly similar to each other. If you really want to be in the clear then live somewhere in the Eastern or Western districts.

Finally, what weapons are used most commonly in crimes and how many homicides have there been each year? With how hard the city is on talking about gun crimes, I will venture a guess that guns will be used most often. Also, remember that the weapons count is for all crimes, not just homicides.



In [15]:

    
df['weapon'].dropna().value_counts().plot(kind='barh')









    Out[15]:





<matplotlib.axes._subplots.AxesSubplot at 0xac82f50c>



In [16]:

    
df.index = df['crimedate']
for year in years:
    d = df[df.index.year == year]
    d = d['description']
    d = d.value_counts()
    if 'HOMICIDE' in d.index:
        display('Year {} - Homicide count is {}'.format(year, d['HOMICIDE']))
    else:
        display('Year {} - Homicide count is {}'.format(year, 0))









    





'Year 2010 - Homicide count is 0'






    





'Year 2011 - Homicide count is 197'






    





'Year 2012 - Homicide count is 217'






    





'Year 2013 - Homicide count is 235'






    





'Year 2014 - Homicide count is 211'






    





'Year 2015 - Homicide count is 206'

For some reason, I doubt that there were no homicides in all of 2010, so I will assume that the data set is not complete for 2010. I am fairly certain the data is updated every month, so 206 for 2015 is actually slightly behind the current number (as of August 25th) of 215. It is only the end of August and we have already passed last years numbers! To put this in perspective, despite having fewer residents, by end of July this year Baltimore boasted 26 more homicides than Detroit(see this for more). Charm City certainly has its problems as evidenced by the data, but it sure is a great place to live.

We could dig through more by neighborhood or plot the incidents on a map (maps are a pain for me at the moment, otherwise I would include one), but I have gotten a picture of the areas I care about. Since this loads data from the internet, you are more than welcome to play with the notebook and investigate areas you care about. You can access the notebook here. If you have feedback or suggestions please let me know as I am new to this data science space.

For convenience, below I have added a function to generate a report for a specified neighborhood. As an example, I used it on Hampden.



In [17]:

    
def report_for_neighborhood(df, hood):
    # assumes good input
    df = df[df['neighborhood'] == hood]
    df.index = range(len(df))
    df['description'] = df['description'].astype('category')
    df.groupby(df['crimedate'].dt.year)['crimecode'].count().plot(kind='bar', rot=45, title='{} crime totals by year'.format(hood))
    plt.show()
    df['description'].value_counts().plot(kind='barh', title='{} crimes by type (all years combined)'.format(hood))
    plt.show()
    s = df.groupby(df['crimedate'].dt.year)['description'].value_counts()
    s.unstack().plot(kind='bar', figsize=(15,15), title='{} crimes by type and year'.format(hood))



In [18]:

    
report_for_neighborhood(df, 'Hampden')

Bonus

How many crimes were committed during the riots? Protests began April 18th, but none of the violence really started until the 23rd. Let's look from April 23rd, 2015 - April 27th, 2015 (clean up of the streets by citizens started on the 28th).



In [19]:

    
riots = df[(df['crimedate'].dt.year == 2015) & (df['crimedate'].dt.month == 4) & (df['crimedate'].dt.day > 22) & (df['crimedate'].dt.day < 28)]
display('Total crimes during riots: {}'.format(len(riots)))
display(riots['description'].value_counts())









    





'Total crimes during riots: 820'






    





BURGLARY                302
LARCENY                 110
COMMON ASSAULT           81
AUTO THEFT               81
LARCENY FROM AUTO        79
AGG. ASSAULT             70
ROBBERY - STREET         35
ROBBERY - COMMERCIAL     19
ARSON                    10
ROBBERY - CARJACKING     10
ASSAULT BY THREAT         6
SHOOTING                  5
HOMICIDE                  5
RAPE                      4
ROBBERY - RESIDENCE       3
dtype: int64

Compare that to the prior year.



In [20]:

    
rnd = df[(df['crimedate'].dt.year == 2014) & (df['crimedate'].dt.month == 4) & (df['crimedate'].dt.day > 22) & (df['crimedate'].dt.day < 28)]
display('Total crimes during April 23-27, 2014: {}'.format(len(rnd)))
display(rnd['description'].value_counts())









    





'Total crimes during April 23-27, 2014: 691'






    





LARCENY                 167
COMMON ASSAULT          142
BURGLARY                 95
AGG. ASSAULT             79
LARCENY FROM AUTO        77
AUTO THEFT               63
ROBBERY - STREET         26
ASSAULT BY THREAT        12
ROBBERY - RESIDENCE      11
SHOOTING                  6
ROBBERY - COMMERCIAL      4
ARSON                     3
HOMICIDE                  2
ROBBERY - CARJACKING      2
RAPE                      2
dtype: int64

129 more crimes during that time period in 2015 than 2014 for the whole city. Most of the problems during the riots were theft and property destruction. You can see the larceny and burglary amounts are much higher during the riots than during the prior year which holds true to what actually happened.

	crimecode	crimedate	crimetime	description	district	location	location_1	neighborhood	post	total_incidents	weapon
0	5A	2015-08-15	00:00:00	BURGLARY	SOUTHEASTERN	0 S CURLEY ST	{'longitude': '-76.57541', 'needs_recoding': F...	Patterson Park Neighborhood	224	1	NaN
1	6D	2015-08-15	00:00:00	LARCENY FROM AUTO	SOUTHEASTERN	2500 E FAYETTE ST	{'longitude': '-76.58127', 'needs_recoding': F...	McElderry Park	221	1	NaN
2	5D	2015-08-15	00:00:00	BURGLARY	NORTHERN	5300 TILBURY WAY	{'longitude': '-76.61499', 'needs_recoding': F...	Homeland	523	1	NaN
3	4E	2015-08-15	00:02:00	COMMON ASSAULT	CENTRAL	300 GUILFORD AVE	{'longitude': '-76.6113', 'needs_recoding': Fa...	Downtown	114	1	HANDS
4	6B	2015-08-15	00:10:00	LARCENY	CENTRAL	400 TUBMAN CT	{'longitude': '-76.62451', 'needs_recoding': F...	Upton	123	1	NaN

	crimecode	crimedate	crimetime	description	district	location	location_1	neighborhood	post	total_incidents	weapon
crimedate
2015-03-10	1K	2015-03-10	2254	HOMICIDE	NORTHERN	4400 LAPLATA AV	{'longitude': '-76.64514', 'needs_recoding': F...	Medfield	536	1	KNIFE

	crimecode	crimedate	crimetime	description	district	location	location_1	neighborhood	post	total_incidents	weapon
crimedate
2015-01-29	2B	2015-01-29	21:00:34	RAPE	NORTHERN	1400 N WELDON PL	{'longitude': '-76.6419', 'needs_recoding': Fa...	Medfield	536	1	NaN
2010-02-15	2B	2010-02-15	21:11:58	RAPE	NORTHERN	4400 FALLS BRIDGE DR	{'longitude': '-76.64564', 'needs_recoding': F...	Medfield	536	1	NaN