DC Crimebusters - Georgetown Data Analytics Certificate, Spring 2015

Rana, David, Mike, Kathleen and Andrew

DC Crimebusters set out to help users make knowledgeable decisions about their personal safety when taking Washington Metropolitan Area Transit Authority (WMATA) trains to their destinations. Riders have developed perceptions about Metro stops within the District based on anecdotal evidence and this analysis aims to support or disprove riders’ notions. The final product will be a mobile application in which users may enter their destination in Washington, DC and the time of travel, and our app will inform them of the relative safety of the destination neighborhood at that time. It will inform them which crimes to be most aware of and recommend taking a taxi or private car hire if the probability of being victimized exceeds a specified threshold.


Metro Usage Polls

Please take a few minutes to answer this poll question:



Overall, crime happens the closer a commuter gets to a Metro station. A Metro rider has a higher chance of encountering or being victimized by a crime because of his/her proximity to a station. In addition to this high level hypothesis, we also have the following based on our data:

Distance from Metro Station: Crime happens closer to a metro station.

Percent of Vacant Homes: The more vacant homes, the more likely crime will occur.

Percent of Occupied Homes: The more occupied homes, the less likely crime will occur.

Percent below the Poverty Line: The more residences below the poverty line, the more likely crime will occur.

Median Home Value: The higher the median home value, the less likely crime will occur.


Data files:

Washington, DC Crime information from the District of Columbia’s Open Data Initiative: 2014 crime data that included the type of crime, the time it occurred (broken out by day, evening and night), the date and the location.

Location of Metro Stations: includes addresses and latitude/longitude information that would allow us to map the Metro stations.

Demographic/Census Information: this data was broken out by block groups and included the following: Percent of vacant homes, percent of occupied homes, percent of population below the poverty line, the median home value and the median household income.

Data Exploration:

Before conducting deeper data analysis, such as the regression or classification of the data, we wanted to become familiar with the main features of the data set. In order to do this, we started with answering three questions:

  • What is the most common type of crime?
  • When did crimes most frequently occur?
  • Which metro station experience the most crime?

What is the most common type of crime?

Out of our data set of about 38,000 events, roughly 14,500 of those are of the type "THEFT/OTHER" making that the leading kind of crime in DC in 2014. Another 11,000 events are of the type "THEFT F/AUTO" which is specifically theft from auto vehicles. This means that around 67.5% of the crimes that occurred in 2014 are of some form of theft.

When did crime most frequently occur?

This graph shows us the breakdown of each Offense type by the shift - Day, Evening, and Midnight. We can see that the leading type of crime, THEFT/OTHER, occurred during the Evening shift. Theft from auto vehicles, specifically, occurred mainly during the Day shift. Another interesting observation is that the HOMICIDE type crimes occurred almost exclusively during the midnight shift.

Which metro station experienced the most crime?

Finally, we can see that the metro station that experienced the most crimes in 2014 was Columbia Heights with 2,593 crime incidents. Within that, THEFT/OTHER is the leading type of crime experienced.

Mapping individual variables against the dependent variable begins to reveal relationships. Through our investigation of the two graphs below, we were able to identify two potential explanatory variables - distance from Metro stations and percentage of households below the poverty line.

In [4]:
import numpy as np
from bokeh import plotting
from bokeh.models import HoverTool


df = pd.read_csv('CrimeEvents_CalculatedAttr.csv')

TOOLS = "save, hover"

poverty_percentage = u'Per Below Poverty Line'

distance = u'Distance from metro KM'

pov_frame = df[poverty_percentage].replace([np.inf, -np.inf, 0, np.str], np.nan).dropna() *100

hist1, edges1 = np.histogram(pov_frame, bins=20)

source1 = plotting.ColumnDataSource(data=dict(count=hist1))

fig1 = plotting.figure(title="Poverty Percentage and Crime Count",
                       tools=TOOLS, background_fill="#E8DDCB")

fig1.quad(top=hist1, bottom=0, left=edges1[:-1], right=edges1[1:],
     fill_color="#036564", line_color="#033649", source=source1)    

hover1 = fig1.select(dict(type=HoverTool))                  
hover1.tooltips=[("count", "@count"),]

fig1.xaxis.axis_label = 'Percentage Poverty Status'
fig1.yaxis.axis_label = 'Crime Count'

hist2, edges2 = np.histogram(df[distance], bins=20)

source2 = plotting.ColumnDataSource(data=dict(count=hist2))

fig2 = plotting.figure(title="Distance From Metro and Crime Count",
                       tools=TOOLS, background_fill="#E8DDCB")

fig2.quad(top=hist2, bottom=0, left=edges2[:-1], right=edges2[1:],
     fill_color="#036564", line_color="#033649", source=source2)  

hover2 = fig2.select(dict(type=HoverTool))                  
hover2.tooltips=[("count", "@count"),]                    

fig2.xaxis.axis_label = 'Distance From Metro (KM)'
fig2.yaxis.axis_label = 'Crime Count'

plotting.output_notebook(plotting.show(plotting.vplot(fig1, fig2)))