Due in part to recent high-profile shootings of civilians by police in the US, the media and public have been scrutinizing police killings heavily. If you browse social media and news sites, you may get the sense that there's been a large uptick in civilian shootings by police in the US.
As a data scientist, you may want to investigate police killings more and get to the facts. Luckily, there's a dataset that will help you do this. The team at FiveThirtyEight assembled a dataset using crowdsourced data and census data. It contains information on each police killing in the US, and can be found here.
Each of the 467 rows in the dataset contains information on a police killing of a civilian in the US in from January 2015 to June 2015. There are many interesting columns in the dataset, but here are some of the more relevant ones:
In [1]:
# Import the packages we need
import pandas as pd
In [2]:
# Read police_killings.csv into a Pandas DataFrame
police_killings = pd.read_csv('../data/police_killings.csv', encoding='ISO-8859-1')
# Print out and look at the columns in the data
print(police_killings.columns)
In [3]:
# Print out the first few rows of the data
police_killings.head()
Out[3]:
In [4]:
# See how many times each race occurs
police_killings['raceethnicity'].value_counts()
Out[4]:
In [5]:
# import matplotlib for graphing
import matplotlib.pyplot as plt
%matplotlib inline
In [6]:
# Maek a bar graph of the results
police_killings['raceethnicity'].value_counts().plot.bar()
Out[6]:
So there are more police killings of white people than any other race. At first this seems a little surprsing.
But if we were to look at the percentages, how would they comapre to the percentage of each race in the US population as a whole?
In [7]:
# Create a Pandas Series of US racial percentages and plot a bar graph
us_race_population = pd.Series({'White': 0.63, 'Hispanic': 0.17,
'Black':0.123, 'Asian': 0.05,
'Multiracial/Ohter': 0.024})
us_race_population.plot.bar()
Out[7]:
So when comparing the percentages for which races were slain by police to the overal US racial percentages, it looks like black people were disproporationally killed by police.
In [8]:
# Need to clean up "-" p_income values
income = police_killings['p_income'][police_killings['p_income'] != '-']
income.head()
Out[8]:
In [9]:
# Convert income to an integer type
income = income.astype(int)
# Use the hist() method on income to generate a histogram
income.hist(bins=20)
Out[9]:
The histogram shows a peak around 20k. The per capita income for the overall US population in 2008 was $26,964 according to the census.
In [10]:
median_kill_income = income.median()
median_kill_income
Out[10]:
In [11]:
median_us_income = 26964
median_us_income
Out[11]:
These incomes don't seem that different. Which is somewhat of a surprise. Were the incomes from the two sources calculated in the same way? This may be worth looking into.
Since we have geographic data, we can look at breakdowns of shootings by region. Since we don't have a ton of data, it might be best to go for just the largest regions (states).
There is one caveat to be aware of when looking at population-level data. Finding that more people were sot in Texas than Georgia doesn't automatically mean that cops are more likely to shoot people in Texas. This is because Texas has a higher population than Georgia. So you need to adjust for the state population.
In [12]:
# Read in state_population.csv as a DataFrame
state_pop = pd.read_csv('../data/state_population.csv')
state_pop.head()
Out[12]:
In [13]:
# Get the number of police killings in each state by state code
counts = police_killings['state_fp'].value_counts()
In [14]:
# Make a new DataFrame to hold counts (for eventual merging)
states = pd.DataFrame({'STATE': counts.index, "shootings": counts})
states.head()
Out[14]:
In [15]:
# Merge the state_pop and states DataFrames
states = states.merge(state_pop, on='STATE')
states.head()
Out[15]:
In [16]:
# Create a new column in states called pop_millions
states['pop_millions'] = states['POPEST18PLUS2015'] / 1000000
states.head()
Out[16]:
In [17]:
# Create a new column called rate
states['rate'] = states['shootings'] / states['pop_millions']
states.head()
Out[17]:
In [18]:
columns_we_care_about = ['NAME', 'rate', 'shootings', 'pop_millions']
In [19]:
# Find states with most killings
most_killings = states.sort_values(by='rate', ascending=False)
most_killings[columns_we_care_about].head(10)
Out[19]:
In [20]:
# Find states with least killings
least_killings = states.sort_values(by='rate', ascending=True)
least_killings[columns_we_care_about].head(10)
Out[20]:
So the states with the most killings are:
And the states with the least killigs are:
Why? What separates these groups of states?
The states with the most killings all have very high native american populations with very low per-capita income, at least in the more rural regions.
The states with the least killings are all rich states with high per-capita incomes and very low populations of native americans.
Since there are overall very few police killings of native americans, that doesn't seem to be the key factor. Economics seem like they are likely of much higher importance.
If we were to explore the data further, potential things to looke for include:
Why do some states have a much higher rate of police killings than others? Is it due to random change, or is there an underlying factor that could explain it?
Let's dive more into the data ...
To better look at the differences, lets split off the data for the 10 states with the lowest shooting rate and the data for the 10 states with the highest shooting rate and see if we can notice any stark contrasts.
In [21]:
# Create a new DataFrame called pk where all rows with "-" values are
# removed and columns converted to float
pk = police_killings[police_killings['share_white'] != '-']
pk = pk[pk['share_black'] != '-']
pk = pk[pk['share_hispanic'] != '-']
pk = pk[pk['p_income'] != '-']
pk.head()
Out[21]:
In [22]:
# Convert the share columns to float
pk['share_white'] = pk['share_white'].astype(float)
pk['share_black'] = pk['share_black'].astype(float)
pk['share_hispanic'] = pk['share_hispanic'].astype(float)
pk['p_income'] = pk['p_income'].astype(int)
pk['pov'] = pk['pov'].astype(float)
pk.dtypes
Out[22]:
In [23]:
# Create a DataFrame containing only rows from pk that took place in
# one of the 10 states with the lowest shooting rates.
low_st = ['CT', 'PA', 'NY', 'IA', 'MA', 'ME', 'NH', 'IL', 'OH', 'WI']
pk_lowest = pk[pk['state'].isin(low_st)]
pk_lowest.head()
Out[23]:
In [24]:
# Create a DataFrame containing only rows from pk that took place in
# one of the 10 states with the hightest shooting rates.
high_st = ['OK', 'AZ', 'NE', 'AK', 'HI', 'ID', 'NM', 'LA', 'CO', 'KS']
pk_highest = pk[pk['state'].isin(high_st)]
pk_highest.head()
Out[24]:
In [25]:
# Create a list of columns which may be important
col_might_matter = ['share_white', 'share_black', 'share_hispanic',
'p_income', 'h_income', 'county_income',
'comp_income', 'pov', 'urate', 'college']
In [26]:
# Look at summary statistics for states with lowest killings
summary_low = pk_lowest[col_might_matter].describe()
summary_low
Out[26]:
In [27]:
# Look at summary statistics for states with highest killings
summary_high = pk_highest[col_might_matter].describe()
summary_high
Out[27]:
In [28]:
ratio_high_to_low = summary_high / summary_low
ratio_high_to_low
Out[28]:
So what interesting differences jump out? States with a high rate of police shootings have:
Wow. Not what I would have expected!
What are some potential problems with this analysis?
Here are some potential next steps:
In [ ]: