The Toxic Armories investigation produced a comprehensive dataset over 1000 armories across the US, as well as thousands of pages of documents requested through Freedom of Information Act requests.
The Oregonian has made both the inspection reports and the dataset publicly available. The purpose of this notebook is to demonstrate ways you can make use of both.
In [1]:
import json
import urllib.parse
import numpy as np
import pandas as pd
import requests
%matplotlib inline
All of the inspection reports we have as of publication and all that we receive later will be free to the public through the excellent open source journalism resource DocumentCloud.
The ID corresponding to the armories project is 30072. You can either use the DocumentCloud API or use the web interface to find more, but the former allows you to access many documents at once.
In [2]:
# It might be overkill, but I figured it best
# for legibility to separate query arguments as a dict
params_dict = {
"q":"projectid:30072",
"per_page":"1000"
}
Here I convert the relatively human-readable dict to the right format for querying DocumentCloud:
In [3]:
params_encoded = urllib.parse.urlencode(params_dict)
In [4]:
r = requests.get('https://www.documentcloud.org/api/search.json', params=params_encoded)
In [24]:
r.json()['documents'][0]
Out[24]:
In [6]:
documentcloud_df = pd.read_json(json.dumps(r.json()['documents']))
In [7]:
documentcloud_df.head()
Out[7]:
Over 18 months, reporter Rob Davis took notes on all of the armories he was investigating, and also tracked them in a spreadsheet. Through collaboration with other data team members, these notes were structured into a compromise between what is human-readable and what is structured for visualization. The dataset that informed his article and underlies the interactive at publication time follows.
In [8]:
df = pd.read_csv('armories_data - 20161201.csv',dtype={'Oregonian ID':'str','Inspection year':'str'})
Each row in df_data corresponds to one armory. Below I have selected the row with data on a Portland, OR armory about two miles from the Portland International Airport.
In [9]:
df.ix[975]
Out[9]:
Below I change "Unknown" to "NaN" so that, when I try to calculate counts for different variables, it is evident that this information is missing:
Note: NaN (not a number) is the standard missing data marker used in pandas
Source: http://pandas-docs.github.io/pandas-docs-travis/dsintro.html#series
In [10]:
df.head()
Out[10]:
If you aren't already familiar with programming: below I change the character values from "Yes" and "No," called Strings in Python and many other programming languages, to numbers (1,0) so that I can calculate rates of inspection or rates of lead presence later.
In [11]:
df_modified = df.replace(to_replace=['Unknown','Yes','No'],value=[np.nan,1,0])
In [12]:
df_modified.head()
Out[12]:
Python's groupby behaves much like Excel's pivot tables, and is also a process in SQL.
In [13]:
df_states = df_modified[['State','Oregonian ID','Lead present?','Had firing range?']].groupby('State').sum().reset_index()
Below I've made a new dataframe to examine what proportion of armories in a given state had a firing range, and what proportion had lead.
In [14]:
df_state_count = df_modified[['State','Oregonian ID']].groupby('State').count().reset_index()
I counted the number of unique IDs that The Oregonian had (Oregonian ID) for armories within each state, so I'll rename that column to Armory Count for the sake of clarity.
In [15]:
df_state_count.rename(columns={'Oregonian ID':'Armory Count'},inplace=True)
In [16]:
df_state_values = df_modified[['State','Inspection conducted?','Inspection report available?','Lead present?','Lead present outside range?','Had firing range?']].groupby('State').sum().reset_index()
Here I recombine the two dataframes because one had to be a calculation that counted the number of armories, while df_state_values allowed me to sum the 1s that I had generated for Yes for a given variable, e.g. Inspection conducted?
In [17]:
df_states = pd.merge(df_state_count,df_state_values)
I'm going to repurpose df_state_count to more easily show the results of some calculations:
In [18]:
df_state_values['Rate of lead in state'] = round(df_states['Lead present?']/df_states['Armory Count'],2)
In [19]:
df_state_values['Rate of inspection'] = round(df_states['Inspection conducted?']/df_states['Armory Count'],2)
In [20]:
df_state_values.sort_values(by=['Rate of inspection'],ascending=False)
Out[20]:
It's verbose and it isn't pretty, but plotting is also built into pandas!
In [21]:
df_state_values[['Rate of inspection']].sort_values(
by='Rate of inspection', ascending=False).plot(
kind='bar',
title='Rate of armory inspection by state',
legend=False,
x=df_state_values['State'])
Out[21]:
Please feel free to open issues with your questions and also to share your own exploration and analyses with me! I'll be updating with analyses and plotting with Seaborn soon.