COGS 108 - Final Project: Vehicle Stops in San Diego

Group Members:

  • Louise Xu (A12650425) - introduction, loading datasets, time visualizations, conclusion
  • David Tea (A10914210) - data cleaning, region visualization, ranges, conclusion
  • Jeffrey Lu (A12177953) - miscellaneous parameter visualizations, OLS analysis, conclusion

Part 1: Introduction and Background

Goal of Project

The goal of our project is to determine whether there is a correlation between the frequency of vehicle stops in San Diego and the time and locations at which they take place. Additionally, we want to see if there are correlations between the location of a vehicle stop and the time at which it took place. If any correlations do exist, we would like to examine possible logistical, psychological, or social reasons for these correlations.

Background

The trend in the frequency of vehicle stops is of interest, especially given recent commotion over police abusing their power and violating the rights of civilians. Previous reports on vehicle stops found on the San Diego County's government website focus on apparent racial bias among the San Diego Police Department. Unlike these reports, we would like to take our analyses in a different direction and see if there are other external factors that influence the frequency of vehicle stops.

One such factor that we'd like to examine is the time of day at which vehicle stops happen. There have been studies within other professions that show how the time of day affects job performance. For instance, judges tend to rule more harshly right before lunchtime. We'd like to see if similar correlations can be found with the SDPD. Additionally, we'd like to see if different regions in San Diego have different frequencies of traffic stops.

References:

Research Question

Are there more vehicle stops at certain times and places than others? Does the frequency of vehicle stops vary significantly between different times and places?

Hypothesis

There will most likely be more traffic violations during periods of time with high traffic density (e.g. rush hour), during weekdays (when there are more people on the road who are driving to work), and during holiday seasons (since there would be more tourists and people visting family). On a broader scale, given that the population of San Diego is increasing, it reasonable to say that the number of traffic violations is also increasing at a similar rate.

We also predict that the frequency of stops at certain times of the day differs between different areas. For instance, we might expect to see more activity in the Sorrento Valley area during rush hour but not during night time, and we may see more activity in Pacific Beach, Downtown, and Hillcrest during the nighttime compared to other regions due to the number of bars and clubs in these areas.

Part 2: Data Description

Vehicle Stops Data

Link to the dataset: https://data.sandiego.gov/datasets/police-vehicle-stops/

We will be using six of the datasets from the Police Vehicle Stops data provided by San Diego County. The datasets that we will be using are as follows:

  • Dataset Name: Vehicle Stops (year-to-date)
    • Number of observations: 28362
    • This dataset contains all vehicle stops from the San Diego Police Department during 2017.
  • Dataset Name: Vehicle Stops (2016)
    • Number of observations: 103052
    • This dataset contains all vehicle stops from the San Diego Police Department during 2016.
  • Dataset Name: Vehicle Stops (2015)
    • Number of observations: 115423
    • This dataset contains all vehicle stops from the San Diego Police Department during 2015.
  • Dataset Name: Vehicle Stops (2014)
    • Number of observations: 144164
    • This dataset contains all vehicle stops from the San Diego Police Department during 2014.
  • Dataset Name: Vehicle Stops Race Codes
    • Number of observations: 18
    • This dataset contains the race codes used in the vehicle stops datasets and their corresponding race.
  • Dataset Name: Vehicle Stops Dictionary
    • Number of observations: 15
    • This dataset contains desciptions of each of the parameters in the vehicle stops datasets.

Police Beats Data

Link to the dataset: https://data.sandiego.gov/datasets/police-beats/

  • Dataset Name: Police Beats (csv)
    • Number of observations: 125
    • This dataset contains the police service area codes and their corresponding regions.

Zillow Region Data

Link to the dataset: https://www.zillowstatic.com/static/shp/ZillowNeighborhoods-CA.zip

  • Dataset Name: ZillowNeighborhoods-CA (shp)
    • Number of Observations: 2051
    • This dataset contains the shape files of neighborhoods in California.
  • Dataset Name: ZillowNeighborhoods-CA (dbf)
    • Number of Observations: 2051
    • This dataset contains the records of neighborhood in California. The records include the city, county, and neighborhood name.

Part 3: Data Cleaning / Preprocessing

Let's get started with loading and cleaning our data


In [1]:
# Import packages
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import patsy
import statsmodels.api as sm
from scipy.stats import ttest_ind

In [2]:
# Import each vehicle stops data file into its own DataFrame
vs_14 = pd.read_csv('vehicle_stops_2014_datasd.csv')
vs_15 = pd.read_csv('vehicle_stops_2015_datasd.csv')
vs_16 = pd.read_csv('vehicle_stops_2016_datasd.csv')
vs_17 = pd.read_csv('vehicle_stops_2017_datasd.csv')

# Join the data for each year together into one DataFrame
# Note: The file for vs_17 contained two more columns (both of them empty and unnamed) than the datasets for the other years 
# These two columns were deleted on Microsoft Excel in order to successfully merge the four datasets together
vs_all = pd.concat([vs_14, vs_15, vs_16, vs_17], ignore_index=True)

# Read in race details for the race codes
pd_race_codes = pd.read_csv('vehicle_stops_race_codes.csv')

In [3]:
# Let's check to see how many observations are in each dataset
# Note: vs_17 is year to date data from 2017, so there is data up to May of 2017 (the period in which this analysis is conducted)
print(vs_14.shape)
print(vs_15.shape)
print(vs_16.shape)
print(vs_17.shape)
print(vs_all.shape)


(144164, 15)
(115422, 15)
(103051, 15)
(28362, 15)
(390999, 15)

We can see here that the number of vehicle stops is actually decreasing each year between 2014 and 2016. Also, while we can not take the 2017 data into direct consideration because we only have data up to March, we can predict the total number of vehicle stops at the end of this year based on our current data. If we take the 28362 stops and then divide that by the three months that have already passed and then multiply it by the twelve months of the year, we end up with (28362)(12)/3 = 113448 stops. While this number is larger than the number of stops in 2016, it is still smaller than the number of stops in 2014 and 2015, which follows the trend of decreasing vehicle stops in San Diego.

This already disproves our hypothesis that vehicle stops are increasing in San Diego. We will try to find a reason for this later on in this notebook.


In [4]:
# Let's start cleaning up the data

# We won't be using these columns
vs_all.drop(['arrested', 'searched', 'obtained_consent', 'contraband_found', 'property_seized', 'stop_time', 'stop_date', 'stop_id', 'sd_resident'], 
           axis=1, inplace=True, errors='ignore')

# Get rid of rows with invalid data
vs_all.dropna(inplace=True)
vs_all = vs_all.loc[(vs_all['service_area'] != 'Unknown')]

# 630, 130 service areas don't exist, 840 is rolando park but it doesn't show up in Zillow
vs_all = vs_all.loc[(vs_all['service_area'] != '630') & (vs_all['service_area'] != '130') & (vs_all['service_area'] != '840')]

# Cleaning the ages of the subjects, these characters occur in the dataset and we don't want these
try:
    vs_all = vs_all.loc[~(vs_all['subject_age'].str.contains("_|N|`|Q|F|Y|H|J|X", case=False))]
    vs_all[['subject_age']] = vs_all[['subject_age']].apply(pd.to_numeric)
except:
    pass

# Some stops have an X as the gender
vs_all = vs_all.loc[(vs_all['subject_sex'].str.contains("M|F", case=False))]

# Use Safe Harbor practices on the subject_ages and remove subjects who aren't of legal driving age
vs_all = vs_all.loc[(vs_all['subject_age'] <= 90) & (vs_all['subject_age'] >= 16)]

# Format all dates in datetime format
vs_all['timestamp'] = pd.to_datetime(vs_all.timestamp)

# Find the hour each stop took place
vs_all['hour'] = vs_all['timestamp'].dt.hour

# Find the day of the week each stop took place
vs_all['weekday'] = vs_all['timestamp'].dt.dayofweek

# Find the month each stop took place
vs_all['month'] = vs_all['timestamp'].dt.month

# Clean subject_race and replace letters with race description. Replace X with O, since unknown and other are similar
race_codes = pd_race_codes['Race Code'].tolist()
race_des = pd_race_codes['Description'].tolist()
vs_all['subject_race'].replace('X', 'O', inplace=True)
vs_all['subject_race'].replace(race_codes, race_des, inplace=True)

# Clean Stop Causes
vs_all = vs_all.loc[~(vs_all['stop_cause'].str.contains("NOT|none|Other|no cause", case=False))]
vs_all['stop_cause'].replace('&Moving Violation', 'Moving Violation', inplace=True)
vs_all['stop_cause'].replace(['UNI, &County, H&&S Code', 'MUNI, County, H&S Code'] , 'Muni, County, H&S Code', inplace=True)
vs_all['stop_cause'].replace('&Radio Call/Citizen Contact'  , 'Radio Call/Citizen Contact', inplace=True)
vs_all['stop_cause'].replace('&Equipment Violation'  , 'Equipment Violation', inplace=True)
vs_all['stop_cause'].replace('Suspect Info'  , 'Suspect Info (I.S., Bulletin, Log)', inplace=True)
vs_all['stop_cause'].replace('Personal Observ/Knowledge'  , 'Personal Knowledge/Informant', inplace=True)

# Look at shape of data so far
vs_all


Out[4]:
stop_cause service_area subject_race subject_sex subject_age timestamp hour weekday month
0 Moving Violation 110 WHITE M 24.0 2014-01-01 01:25:00 1 2 1
1 Moving Violation 320 WHITE M 42.0 2014-01-01 05:47:00 5 2 1
2 Moving Violation 320 LAOTIAN M 29.0 2014-01-01 07:46:00 7 2 1
3 Moving Violation 610 WHITE M 23.0 2014-01-01 08:10:00 8 2 1
4 Equipment Violation 930 HISPANIC M 35.0 2014-01-01 08:35:00 8 2 1
5 Equipment Violation 820 HISPANIC M 30.0 2014-01-01 08:39:00 8 2 1
6 Moving Violation 710 HISPANIC F 19.0 2014-01-01 09:13:00 9 2 1
7 Moving Violation 120 WHITE M 32.0 2014-01-01 09:50:00 9 2 1
8 Moving Violation 120 WHITE M 36.0 2014-01-01 10:00:00 10 2 1
9 Moving Violation 120 HISPANIC M 27.0 2014-01-01 10:40:00 10 2 1
10 Moving Violation 120 WHITE M 16.0 2014-01-01 12:05:00 12 2 1
11 Muni, County, H&S Code 710 HISPANIC M 32.0 2014-01-01 13:13:00 13 2 1
12 Equipment Violation 230 VIETNAMESE M 23.0 2014-01-01 13:16:00 13 2 1
13 Moving Violation 110 WHITE M 50.0 2014-01-01 13:40:00 13 2 1
14 Equipment Violation 240 WHITE F 47.0 2014-01-01 14:05:00 14 2 1
15 Moving Violation 720 BLACK M 25.0 2014-01-01 14:38:00 14 2 1
16 Moving Violation 230 WHITE F 53.0 2014-01-01 14:50:00 14 2 1
17 Equipment Violation 320 WHITE M 29.0 2014-01-01 14:50:00 14 2 1
18 Moving Violation 610 WHITE M 31.0 2014-01-01 15:03:00 15 2 1
19 Equipment Violation 820 BLACK F 23.0 2014-01-01 15:20:00 15 2 1
20 Equipment Violation 820 BLACK F 23.0 2014-01-01 15:20:00 15 2 1
21 Equipment Violation 430 HISPANIC F 21.0 2014-01-01 15:23:00 15 2 1
22 Moving Violation 230 WHITE M 25.0 2014-01-01 15:32:00 15 2 1
23 Equipment Violation 430 BLACK F 75.0 2014-01-01 15:50:00 15 2 1
24 Moving Violation 610 WHITE M 41.0 2014-01-01 16:25:00 16 2 1
25 Equipment Violation 230 WHITE F 31.0 2014-01-01 16:31:00 16 2 1
26 Moving Violation 610 WHITE F 42.0 2014-01-01 16:41:00 16 2 1
27 Moving Violation 240 WHITE F 37.0 2014-01-01 16:50:00 16 2 1
28 Moving Violation 720 HISPANIC M 25.0 2014-01-01 17:00:00 17 2 1
29 Moving Violation 320 HISPANIC M 25.0 2014-01-01 17:05:00 17 2 1
... ... ... ... ... ... ... ... ... ...
390967 Moving Violation 710 HISPANIC M 21.0 2017-03-31 22:20:00 22 4 3
390968 Moving Violation 120 WHITE M 54.0 2017-03-31 22:20:00 22 4 3
390969 Moving Violation 310 OTHER M 33.0 2017-03-31 22:25:00 22 4 3
390970 Equipment Violation 310 WHITE M 24.0 2017-03-31 22:27:00 22 4 3
390971 Equipment Violation 440 HISPANIC M 30.0 2017-03-31 22:29:00 22 4 3
390972 Moving Violation 440 BLACK M 47.0 2017-03-31 22:29:00 22 4 3
390973 Moving Violation 120 WHITE M 21.0 2017-03-31 22:30:00 22 4 3
390974 Moving Violation 520 OTHER ASIAN M 30.0 2017-03-31 22:30:00 22 4 3
390975 Moving Violation 710 HISPANIC M 55.0 2017-03-31 22:40:00 22 4 3
390976 Moving Violation 310 BLACK F 33.0 2017-03-31 22:42:00 22 4 3
390977 Moving Violation 820 OTHER M 27.0 2017-03-31 22:45:00 22 4 3
390978 Moving Violation 120 WHITE M 25.0 2017-03-31 22:45:00 22 4 3
390979 Moving Violation 110 WHITE M 49.0 2017-03-31 22:46:00 22 4 3
390980 Moving Violation 320 HISPANIC F 20.0 2017-03-31 22:47:00 22 4 3
390981 Moving Violation 710 WHITE M 24.0 2017-03-31 22:47:00 22 4 3
390982 Equipment Violation 810 BLACK M 36.0 2017-03-31 23:00:00 23 4 3
390983 Moving Violation 320 BLACK F 18.0 2017-03-31 23:03:00 23 4 3
390984 Moving Violation 320 WHITE F 30.0 2017-03-31 23:05:00 23 4 3
390985 Moving Violation 710 HISPANIC M 21.0 2017-03-31 23:05:00 23 4 3
390986 Moving Violation 320 WHITE M 18.0 2017-03-31 23:18:00 23 4 3
390987 Moving Violation 710 HISPANIC F 20.0 2017-03-31 23:23:00 23 4 3
390988 Moving Violation 110 OTHER ASIAN M 26.0 2017-03-31 23:25:00 23 4 3
390989 Moving Violation 710 HISPANIC F 60.0 2017-03-31 23:30:00 23 4 3
390990 Moving Violation 320 WHITE M 27.0 2017-03-31 23:35:00 23 4 3
390992 Moving Violation 510 BLACK M 36.0 2017-03-31 23:40:00 23 4 3
390993 Equipment Violation 810 HISPANIC F 26.0 2017-03-31 23:42:00 23 4 3
390994 Moving Violation 830 BLACK M 20.0 2017-03-31 23:45:00 23 4 3
390995 Moving Violation 710 HISPANIC M 65.0 2017-03-31 23:45:00 23 4 3
390996 Radio Call/Citizen Contact 620 HISPANIC M 23.0 2017-03-31 23:49:00 23 4 3
390998 Equipment Violation 310 WHITE F 26.0 2017-03-31 23:58:00 23 4 3

365723 rows × 9 columns


In [5]:
# Make additional datasets

# Create separate cleaned datasets for each year individually
vs_14 = vs_all[(vs_all['timestamp'].dt.year == 2014)]
vs_15 = vs_all[(vs_all['timestamp'].dt.year == 2015)]
vs_16 = vs_all[(vs_all['timestamp'].dt.year == 2016)]
vs_17 = vs_all[(vs_all['timestamp'].dt.year == 2017)]

# Create cummulative dataset without data from 2017, since the 2017 data is incomplete and will skew categorical distributions
vs_complete = vs_all[(vs_all['timestamp'].dt.year != 2017)]

Part 4: Data Vizualization

This will be divided into three parts:

  • histogram of time (categorically: by time of day, by day of the week, by month; temporally: by day, by month, by year)
  • histograms for fun:
    • histogram of age
    • histogram of gender
    • histogram of race
    • histogram of type of vehicle violation
  • heatmaps of region (categorically: by police beats)

Histograms Based on Time


In [6]:
# Labels for plots
days_of_week = ['Sun', 'Mon', 'Tues', 'Wed', 'Thu', 'Fri', 'Sat']
months_of_year = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

# Histograms of Vehicle Stops in terms of categorical time (on vs_all and vs_complete)

fig1 = plt.figure(figsize=(10, 9), dpi=100)

# Cummulative (all)
plt.subplot(3, 2, 1)
plt.xlabel('Time of Day (24-hour clock)')
plt.ylabel('# of Vehicle Stops')
plt.title("Vehicle Stops by time of day (Cummulative)")
plt.hist(vs_all['timestamp'].dt.hour, range(25), width = .5)

plt.subplot(3, 2, 3)
plt.xlabel('Day of the Week')
plt.ylabel('# of Vehicle Stops')
plt.title("Vehicle Stops by day of week (Cummulative)")
plt.hist(vs_all['weekday'], bins='auto')
plt.xticks(range(8),days_of_week)

plt.subplot(3, 2, 5)
plt.xlabel('Month')
plt.ylabel('# of Vehicle Stops')
plt.title("Vehicle Stops by month (Cummulative)")
plt.hist(vs_all['month'], bins='auto')
plt.xticks(range(1,14),months_of_year)

# Cummunlative (excluding 2017)
plt.subplot(3, 2, 2)
plt.xlabel('Time of Day (24-hour clock)')
plt.title("Vehicle Stops by time of day (Cummulative, excludes 2017)")
plt.hist(vs_complete['timestamp'].dt.hour, range(25), width = .5)

plt.subplot(3, 2, 4)
plt.xlabel('Day of the Week')
plt.title("Vehicle Stops by day of week (Cummulative, excludes 2017)")
plt.hist(vs_complete['weekday'], bins='auto')
plt.xticks(range(8), days_of_week)

plt.subplot(3, 2, 6)
plt.xlabel('Month')
plt.title("Vehicle Stops by month (Cummulative, excludes 2017)")
plt.hist(vs_complete['month'], bins='auto')
plt.xticks(range(1,14), months_of_year)

plt.tight_layout()
plt.show()


We can see that the plots for time of day and day of week are nearly identical between each of the two cummulative plots. As expected, the plot by month for the cummulative dataset excluding 2017 also has less of a gap between the number of stops during the first three months and the number of stops during the rest of the year. However, we can see that there is still a noticeable downward trend.


In [7]:
# Histograms of Vehicle Stops in terms of categorical time (for each year)

# Time of Day (by 2014, 2015, 2016, 2017)
fig2 = plt.figure(figsize=(10, 6), dpi=100)

plt.subplot(2, 2, 1)
plt.xlabel('Time of Day (24-hour clock)')
plt.title("Vehicle Stops by time of day (2014 Only)")
plt.hist(vs_14['timestamp'].dt.hour, range(25), width = .5)

plt.subplot(2, 2, 2)
plt.xlabel('Time of Day (24-hour clock)')
plt.title("Vehicle Stops by time of day (2015 Only)")
plt.hist(vs_15['timestamp'].dt.hour, range(25), width = .5)

plt.subplot(2, 2, 3)
plt.xlabel('Time of Day (24-hour clock)')
plt.title("Vehicle Stops by time of day (2016 Only)")
plt.hist(vs_16['timestamp'].dt.hour, range(25), width = .5)

plt.subplot(2, 2, 4)
plt.xlabel('Time of Day (24-hour clock)')
plt.title("Vehicle Stops by time of day (2017 Only)")
plt.hist(vs_17['timestamp'].dt.hour, range(25), width = .5)

plt.tight_layout()
plt.show()

# Day of Week (by 2014, 2015, 2016, 2017)
fig3 = plt.figure(figsize=(10, 6), dpi=100)

plt.subplot(2, 2, 1)
plt.xlabel('Day of the Week')
plt.ylabel('# of Vehicle Stops')
plt.title("Vehicle Stops by day of week (2014 Only)")
plt.hist(vs_14['weekday'], bins='auto', width = .25)
plt.xticks(range(8),days_of_week)

plt.subplot(2, 2, 2)
plt.xlabel('Day of the Week')
plt.ylabel('# of Vehicle Stops')
plt.title("Vehicle Stops by day of week (2015 Only)")
plt.hist(vs_15['weekday'], bins='auto', width = .25)
plt.xticks(range(8),days_of_week)

plt.subplot(2, 2, 3)
plt.xlabel('Day of the Week')
plt.ylabel('# of Vehicle Stops')
plt.title("Vehicle Stops by day of week (2016 Only)")
plt.hist(vs_16['weekday'], bins='auto', width = .25)
plt.xticks(range(8),days_of_week)

plt.subplot(2, 2, 4)
plt.xlabel('Day of the Week')
plt.ylabel('# of Vehicle Stops')
plt.title("Vehicle Stops by day of week (2017 Only)")
plt.hist(vs_17['weekday'], bins='auto', width = .25)
plt.xticks(range(8),days_of_week)

plt.tight_layout()
plt.show()

# Month (by 2014, 2015, 2016, 2017)
# These are a bit redundant, since we will see the plots combined together in the temporal monthly plot
fig4 = plt.figure(figsize=(10, 6), dpi=100)

plt.subplot(2, 2, 1)
plt.xlabel('Month')
plt.title("Vehicle Stops by month (2014 Only)")
plt.hist(vs_14['month'], bins='auto', width = .25)
plt.xticks(range(1,14),months_of_year)

plt.subplot(2, 2, 2)
plt.xlabel('Month')
plt.title("Vehicle Stops by month (2015 Only)")
plt.hist(vs_15['month'], bins='auto', width = .25)
plt.xticks(range(1,14),months_of_year)

plt.subplot(2, 2, 3)
plt.xlabel('Month')
plt.title("Vehicle Stops by month (2016 Only)")
plt.hist(vs_16['month'], bins='auto', width = .25)
plt.xticks(range(1,13),months_of_year)

plt.subplot(2, 2, 4)
plt.xlabel('Month')
plt.title("Vehicle Stops by month (2017 Only)")
plt.hist(vs_17['month'], bins='auto', width = .1)
temp = ['', '']
temp[1:1] = months_of_year[0:3]
plt.xticks(range(0,5), temp)

plt.tight_layout()
plt.show()


From these plots, we can see that each year has the same general trends for hourly, weekly, and monthly vehicle stops. For each year, the number of vehicle stops peaks around rush hour in the morning (8-10AM). There are also slightly lower peaks around rush hour in the evening, but these vary slightly between each year. For 2014, the peak appears prominently at 3PM, but for subsequent years, the peak is distributed across 1-5PM. In all of the plots, we can see that there is a very noticeable dip around 12PM, which would correspond to typical lunch hours. We can also see there is more activity around 10PM and 12PM, with another dip in activity in between these two peak hours at 11PM. Expectedly, during the nighttime hours of 1-6PM the frequency of vehicle stops decreases dramatically.

We can also see that the trend for stops across the days of the week is nearly identical between each year. Each plot starts out with a peak number of stops on Monday. We then see a downward trend as the week goes by, until we hit Sunday, when the number of stops increases to a level comparable to those of Thursday and Friday.

As for the months, we can see that the number of stops generally peaks towards the beginning of the year (noticeably February or April) and then decreases as the year goes by. For the 2015 and 2016 plots, we can also see that there is a secondary peak around October and November


In [8]:
# Histograms of Vehicle Stops in terms of temporal time

year = vs_all["timestamp"].dt.year
month = vs_all["timestamp"].dt.month
day = vs_all["timestamp"].dt.day

fig5 = plt.figure(figsize=(8, 12), dpi=100)

plt.subplot(311)
plt.title("Vehicle Stops by Day")
plt.ylabel('# of Vehicle Stops')
vs_all["timestamp"].groupby([year, month, day]).count().plot(kind="bar", edgecolor = "none")
days_list = vs_all["timestamp"].groupby([year, month, day]).count()
plt.xlabel('Date (Year, Month, Day)')
plt.xticks(range(0,len(days_list),31), [x for x in days_list.keys() if x[2] == 1], rotation=90)

plt.subplot(312)
plt.ylabel('# of Vehicle Stops')
plt.title("Vehicle Stops by Month")
vs_all["timestamp"].groupby([year, month]).count().plot(kind="bar")
plt.xlabel('Month (Year, Month)')

plt.subplot(313)
plt.ylabel('# of Vehicle Stops')
plt.title("Vehicle Stops by Year")
vs_all["timestamp"].groupby([year]).count().plot(kind="bar")
plt.xlabel('Year')

plt.tight_layout()
plt.show()


Though it is hard to see from the "Vehicle Stops by Day" graph, there are a few days that have an unusually high number of traffic stops. We'll have to replot it so that we can see these days more clearly. Additionally, in the next section we will tabulate the days with the most number of traffic stops using value_counts().


In [9]:
# The graph of the vehicle stops by day is a bit difficult to see when we plot it out all at once
# Let's make an interactive plot that lets us scroll through time to view a range of dates

import ipywidgets as widgets
from ipywidgets import interact

day_data = vs_all["timestamp"].groupby([year, month, day]).count()

def pltsin(f):
    fig6 = plt.figure(figsize=(12, 4), dpi=100)
    plt.title("Vehicle Stops by Day")
    plt.gca().set_ylim([0, 1000])
    # Total number of days = 1186, so to make the graph nice and even, we set the range of points shown to be 36
    day_data[0+f:35+f].plot(kind="bar")
    plt.xlabel('Date (Year, Month, Day)')
    plt.show()
    
interact(pltsin, f=widgets.IntSlider(min=0,max=1150,step=1,continuous_update=False,readout=False))


Out[9]:
<function __main__.pltsin>

Histograms for Other Parameters

Now let's go ahead and visualize some of the other parameters included in our data. We may be able to see some interesting patterns.


In [10]:
#Histograms of age from 2014 to 2017
fig_age = plt.figure(figsize=(10,8), dpi=100)

#Cummulative(2014-2017)
plt.subplot(321)
plt.hist(vs_all['subject_age'], bins='auto')
plt.axis('auto')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution (Cummulative)')

#Cummulative excluding 2017 (2014-2016)
plt.subplot(322)
plt.hist(vs_complete['subject_age'], bins='auto')
plt.axis('auto')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution (2014-2016)')

#For individual years (2014-2017)
plt.subplot(323)
plt.hist(vs_14['subject_age'], bins='auto')
plt.axis('auto')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution (2014)')

plt.subplot(324)
plt.hist(vs_15['subject_age'], bins='auto')
plt.axis('auto')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution (2015)')

plt.subplot(325)
plt.hist(vs_16['subject_age'],bins='auto')
plt.axis('auto')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution (2016)')

plt.subplot(326)
plt.hist(vs_17['subject_age'],bins='auto')
plt.axis('auto')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution (2017)')

plt.tight_layout()


There is a very clear trend in the age distribution across all of the years. The number of violations starts out at low levels, then peaks between ages 20 and 30, and then gradually declines until we see very few violations in the 80 to 90-year-old range.

Somewhat troubling is the frequency of stops for ages that are a multiple of 5, when compared to ages in the same range. A plot of age would typically be fairly smooth, but the spikes at these specific age intervals suggests that there may be some bias when logging the age data for each vehicle stop.


In [11]:
#Histograms of gender from 2014 to 2017
fig_gender = plt.figure(figsize=(10,8), dpi=100)

#Cummulative
plt.subplot(321)
all_gen = vs_all['subject_sex'].value_counts()
all_gen.plot('bar')
plt.xlabel('Gender')
plt.ylabel('Frequency')
plt.title('Gender Distribution(Cummulative)')

#Cummulative excluding 2017 (2014-2016)
plt.subplot(322)
excl_gen = vs_complete['subject_sex'].value_counts()
excl_gen.plot('bar')
plt.xlabel('Gender')
plt.ylabel('Frequency')
plt.title('Gender Distribution(2014 to 2016)')

#For individual years (2014 to 2017)
plt.subplot(323)
four_gen = vs_14['subject_sex'].value_counts()
four_gen.plot('bar')
plt.xlabel('Gender')
plt.ylabel('Frequency')
plt.title('Gender Distribution(2014)')

plt.subplot(324)
five_gen = vs_15['subject_sex'].value_counts()
five_gen.plot('bar')
plt.xlabel('Gender')
plt.ylabel('Frequency')
plt.title('Gender Distribution(2015)')

plt.subplot(325)
six_gen = vs_16['subject_sex'].value_counts()
six_gen.plot('bar')
plt.xlabel('Gender')
plt.ylabel('Frequency')
plt.title('Gender Distribution(2016)')

plt.subplot(326)
seven_gen = vs_17['subject_sex'].value_counts()
seven_gen.plot('bar')
plt.xlabel('Gender')
plt.ylabel('Frequency')
plt.title('Gender Distribution(2017)')

plt.tight_layout()


From these plots, we can see there are clearly far more vehicle stops for males than for females.


In [12]:
#Histograms of race from 2014 to 2017

fig_race = plt.figure(figsize=(10,12), dpi=100)

#Cummulative
plt.subplot(321)
all_race = vs_all['subject_race'].value_counts()
all_race.plot('bar')
plt.xlabel('Race')
plt.ylabel('Frequency')
plt.title('Race Distribution(Cummulative)')

#Cummulative excluding 2017 (2014-2016)
plt.subplot(322)
excl_race = vs_complete['subject_race'].value_counts()
excl_race.plot('bar')
plt.xlabel('Race')
plt.ylabel('Frequency')
plt.title('Race Distribution(2014 to 2016)')

#For individual years (2014 to 2017)
plt.subplot(323)
four_race = vs_14['subject_race'].value_counts()
four_race.plot('bar')
plt.xlabel('Race')
plt.ylabel('Frequency')
plt.title('Race Distribution(2014)')

plt.subplot(324)
five_race = vs_15['subject_race'].value_counts()
five_race.plot('bar')
plt.xlabel('Race')
plt.ylabel('Frequency')
plt.title('Race Distribution(2015)')

plt.subplot(325)
six_race = vs_16['subject_race'].value_counts()
six_race.plot('bar')
plt.xlabel('Race')
plt.ylabel('Frequency')
plt.title('Race Distribution(2016)')

plt.subplot(326)
seven_race = vs_17['subject_race'].value_counts()
seven_race.plot('bar')
plt.xlabel('Race')
plt.ylabel('Frequency')
plt.title('Race Distribution(2017)')

plt.tight_layout()


The distribution for the top six most frequently stopped races are fairly consistent. White drivers are consistently stopped the most often, followed by Hispanic, Black, Other, Other Asian, and Filipino drivers. For the other races, the distribution changes slightly between year to year. For instance, Indian drivers had more stops than Chinese drivers in 2014, but in 2015 Indian drivers had less stops than Chinese drivers.

When we take a look at this distribution compared to the racial demographic data in San Diego county (Source: https://datausa.io/profile/geo/san-diego-county-ca/#ethnicity), we can actually see that there does in fact seem to be some racial bias against Black drivers. They consist of nearly half the population of Asian drivers, but in our vehicle stop histogram, they are stopped nearly twice as much. One factor to take into consideration is that the DataUSA graph shows Asians in one collective catagory while the San Diego county data separates them into more specific catagories. However, even if combine the number of stops for all the Asian races, there would still be less stops for Asians combined than for Black drivers.


In [13]:
vs_m = vs_all[vs_all['subject_sex'] == 'M']
vs_m = vs_m['subject_race'].value_counts()
vs_f = vs_all[vs_all['subject_sex'] == 'F']
vs_f = vs_f['subject_race'].value_counts()
vs_race = pd.DataFrame(vs_m)
vs_race['f'] = vs_f
vs_race.reset_index(level=0, inplace=True)
vs_race.columns = ['race', 'm', 'f']

f, ax1 = plt.subplots(1, figsize=(15,5))
bar_width = 0.75
bar_l = [i+1 for i in range(len(vs_m))]
tick_pos = [i+(bar_width/2) for i in bar_l]

ax1.bar(bar_l, vs_race['m'], width=bar_width, label='Male', alpha=0.5, color='blue')
ax1.bar(bar_l, vs_race['f'], width=bar_width, bottom=vs_race['m'], label='Female',alpha=0.5, color='red')

plt.xticks(tick_pos, vs_race['race'])
ax1.set_ylabel("# of Vehicle Stops")
ax1.set_xlabel("Race")
plt.legend(loc='upper right')
plt.xticks(rotation=45)
plt.xlim([min(tick_pos)-bar_width, max(tick_pos)+bar_width])


Out[13]:
(0.625, 19.125)

In [14]:
#Histograms of stop causes from 2014 to 2017

fig_cause = plt.figure(figsize=(10,16), dpi=100)

#Cummulative
plt.subplot(321)
all_cause = vs_all['stop_cause'].value_counts()
all_cause.plot('bar')
plt.xlabel('Stop Cause')
plt.ylabel('Frequency')
plt.title('Stop Cause Distribution(Cummulative)')

#Cummulative excluding 2017 (2014-2016)
plt.subplot(322)
excl_cause = vs_complete['stop_cause'].value_counts()
excl_cause.plot('bar')
plt.xlabel('Stop Cause')
plt.ylabel('Frequency')
plt.title('Stop Cause Distribution(2014 to 2016)')

#For individual years (2014 to 2017)
plt.subplot(323)
four_cause = vs_14['stop_cause'].value_counts()
four_cause.plot('bar')
plt.xlabel('Stop Cause')
plt.ylabel('Frequency')
plt.title('Stop Cause Distribution(2014)')

plt.subplot(324)
five_cause = vs_15['stop_cause'].value_counts()
five_cause.plot('bar')
plt.xlabel('Stop Cause')
plt.ylabel('Frequency')
plt.title('Stop Cause Distribution(2015)')

plt.subplot(325)
six_cause = vs_16['stop_cause'].value_counts()
six_cause.plot('bar')
plt.xlabel('Stop Cause')
plt.ylabel('Frequency')
plt.title('Stop Cause Distribution(2016)')

plt.subplot(326)
seven_cause = vs_17['stop_cause'].value_counts()
seven_cause.plot('bar')
plt.xlabel('Stop Cause')
plt.ylabel('Frequency')
plt.title('Stop Cause Distribution(2017)')

plt.tight_layout()


Across the board, there are far more stops resulting from Moving Violations and Equipment Violations than all other causes. Additionally, there are nearly three times as many Moving Violations than there are Equipment Violations.


In [15]:
vs_m = vs_all[vs_all['subject_sex'] == 'M']
vs_m = vs_m['stop_cause'].value_counts()
vs_f = vs_all[vs_all['subject_sex'] == 'F']
vs_f = vs_f['stop_cause'].value_counts()
vs_sc = pd.DataFrame(vs_m)
vs_sc['f'] = vs_f
vs_sc.reset_index(level=0, inplace=True)
vs_sc.columns = ['stop_cause', 'm', 'f']

f2, ax2 = plt.subplots(1, figsize=(15,5))
bar_width = 0.75
bar_2 = [i+1 for i in range(len(vs_m))]
tick_pos = [i+(bar_width/2) for i in bar_2]

ax2.bar(bar_2, vs_sc['m'], width=bar_width, label='Male', alpha=0.5, color='blue')
ax2.bar(bar_2, vs_sc['f'], width=bar_width, bottom=vs_sc['m'], label='Female', alpha=0.5, color='red')

plt.xticks(tick_pos, vs_sc['stop_cause'])
ax2.set_ylabel("# of Vehicle Stops")
ax2.set_xlabel("Stop Cause")
plt.legend(loc='upper right')
plt.xticks(rotation=45)
plt.xlim([min(tick_pos)-bar_width, max(tick_pos)+bar_width])


Out[15]:
(0.625, 7.125)
First Install these packages before running the Interactive Heatmap:
- pip install pyshp
- pip install bokeh

Heatmap of Traffic Stops in San Diego

To create the heatmap, we need the shapefiles of each police beat in San Diego and the data from our police traffic stop datasets.

  • First, we take the police beat data from 'pd_beat_neighborhoods_datasd.csv' to get all the beats and their service areas.
  • Then, we take the neighborhood data from the Zillow dataset and match the neighborhoods from Zillow with the police beats neighborhoods.
  • We try to match as many neighborhoods as we can so we can get an accurate representation of San Diego.
  • Next, we take the shapefiles of the neighborhoods matched from Zillow and create a mapof San Diego.
  • Now, we take our traffic stops data for each service area, then match the neighborhoods with their service area, and create the heatmap.
  • Finally, we add interactive controls to the heatmap so we can select certain data conditions and view the effects they have on the heatmap.

In [16]:
# Sort police beats into service areas.
police_beats = pd.read_csv('pd_beat_neighborhoods_datasd.csv')
# Remove 'Oak Park' because data for service area 450 doesn't exist
police_beats = police_beats.loc[(police_beats['Neighborhood'] != 'Oak Park')]
police_beats['Beat'] = police_beats['Beat'].astype(str)
beats_from_data = police_beats['Neighborhood'].tolist()

# working with shapefiles
# neighborhood shapefile data from Zillow:
# https://www.zillow.com/howto/api/neighborhood-boundaries.htm
#
# shapefile package: pyshp
# https://pypi.python.org/pypi/pyshp
#   Approach taken from Lecture Materials: https://github.com/COGS108/LectureMaterials/blob/master/Geospatial/GeospatialAnalytics.ipynb

import shapefile

# read the in the shapefile and list the methods associated with the object
sf = shapefile.Reader("ZillowNeighborhoods-CA.shp")

# read in the dbf (metadata) file and list all the methods associated with it
sfdbf = shapefile.Reader("ZillowNeighborhoods-CA.dbf")

# find indices of all San Diego neighborhoods
metadata = sfdbf.shapeRecords()

sd_index = []
beats_from_zillow = []
final_beats_zillow = []

# Get the names of all the neighborhoods in San Diego County from Zillow dataset
for i in range(len(metadata)):
    if metadata[i].record[1] == 'San Diego':
        beats_from_zillow.append(metadata[i].record[3])
        if metadata[i].record[3] in beats_from_data:
            final_beats_zillow.append(metadata[i].record[3])

# need to keep track of name in data to properly create service areas
# final_beats_data contains all neighborhoods from the Zillow dataset that match with the police beats dataset
final_beats_data = final_beats_zillow.copy()

# Get the remaining neighborhoods with names that don't match up because of spelling differences
for data in (set(beats_from_data) - set(beats_from_zillow)):
    for zillow in (set(beats_from_zillow) - set(beats_from_data)):
        # Some beats have the same starting substrings but are not the same city so we need to check the ending substring as well.
        if data[0:4] == zillow[0:4] and data[-3:] == zillow[-3:]:
            # Keep track of all neighborhoods from Zillow dataset and Police Beats dataset
            final_beats_zillow.append(zillow)
            final_beats_data.append(data)

# Manual addition, can't find a good way to programmically add these without adding unnecessary neighborhoods 
final_beats_zillow.append('Cortez Hill')
final_beats_zillow.append('Gaslamp Quarter')
final_beats_zillow.append('Jomacha-Lomita')
final_beats_zillow.append('Rancho Encantado')
# Add in names that correspond to the data
final_beats_data.append('Cortez')
final_beats_data.append('Gaslamp')
final_beats_data.append('Jamacha/Lomita')
final_beats_data.append('Rancho Encantada')

#  'Broadway Heights', "O'Farrell", 'Petco Park', 'Qualcomm', and 'Rolando Park' are not in zillow so we will not
#  be using these neighborhoods.

# Need indices in order they were added so final beats to line up correctly with their indices in the Zillow metadata
for beat in final_beats_zillow:
    for i in range(len(metadata)):
        if metadata[i].record[1] == 'San Diego' and metadata[i].record[3] == beat:
            sd_index.append(i)
            continue
            
# Get shapes of each beat
shapes = sf.shapes()
sd_shapes = []
for i in range(len(sd_index)):
    sd_shapes.append(shapes[sd_index[i]].points)

# Get the latitude and longitude of each neighborhood, in order of indices 
beats = []
beats_lons = []
beats_lats = []
for i in range(len(sd_shapes)):
    beats.append(sd_shapes[i])
    beats_lons.append([xy[0] for xy in beats[i]])
    beats_lats.append([xy[1] for xy in beats[i]])
    
# Group beats(value) by service area(key) indices(value) by service area(key)
service_areas = vs_complete['service_area'].value_counts().index.tolist()
beats_by_service_area = {}
# indices(value) by service area(key)
index_by_service_area = {}

# Sort beats by service area
for area in sorted(service_areas):
    temp = police_beats.loc[(police_beats['Beat'].str[0