ReproduceIt is a series of articles that reproduce the results from data analysis articles focusing on having open data and open code. All the code and data is available on github: reproduceit-538-baltimore-black-income. This post contains a more verbose version of the content that will probably get outdated while the github version could be updated including fixes.
For this second article I (again) took an article from one of my favorite websites FiveThirtyEight. On this case I took an article by Ben Casselman "How Baltimore’s Young Black Men Are Boxed In" which I found interesting given the recent events in the US, specially on this case Baltimore, Maryland.
The article analyses the income gap between white and black people in different cities all around the US. The data source is the "American Community Survey" and is available in the "American Fact Finder"
With this app is not possible to crawl it like on the previous ReproduciIt article since it requires user input to select the desired data but the data used for this analysis is available with the code in the github repo.
As usual we print some versions of the software and libraries used for future reproducibility.
In [1]:
import sys
sys.version_info
Out[1]:
In [2]:
import numpy as np
np.__version__
Out[2]:
In [3]:
import pandas as pd
pd.__version__
Out[3]:
In [4]:
%matplotlib inline
In [5]:
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
matplotlib.__version__
Out[5]:
In [6]:
import bokeh.plotting as plt
from bokeh.models import HoverTool
plt.output_notebook()
import bokeh
bokeh.__version__
Out[6]:
The majority of this article was spent cleaning a little bit the data, which was not really hard since it only required removing some columns and getting a weighed average. For all those operations I am using pandas.
In [7]:
black = pd.read_csv('BlackIncome/ACS_13_5YR_B19001B_with_ann.csv', encoding='cp1252', skiprows=[0])
In [8]:
black.head()
Out[8]:
In [9]:
black.set_index('Geography', inplace=True)
In [10]:
black.drop(['Id', 'Id2', 'Estimate; Total:'], axis=1, inplace=True)
In [11]:
margin_cols = [col for col in black.columns if col.startswith('Margin of Error')]
In [12]:
black.drop(margin_cols, axis=1, inplace=True)
In [13]:
black.head()
Out[13]:
Since the data available is given in intervals I used a simple weithed average to get a single number per city.
In [14]:
weights = [10000, 12500, 17500, 22500, 27500, 32500, 37500, 42500, 47500, 55000, 67500, 87500, 112500, 137500, 187500, 200000]
In [15]:
weights = pd.Series(weights, index=black.columns)
In [16]:
def weight_average(x):
return (x * weights).sum() / x.sum()
black.head().apply(weight_average, axis=1)
Out[16]:
In [17]:
estimate_cols = [col for col in black.columns if col.startswith('Estimate; Total:')]
In [18]:
black['average'] = black[estimate_cols].apply(weight_average, axis=1)
In [19]:
black.head(2)
Out[19]:
In [20]:
black.average.hist(bins=25, color='black')
Out[20]:
In [21]:
white = pd.read_csv('WhiteIncome/ACS_13_5YR_B19001A_with_ann.csv', encoding='cp1252', skiprows=[0])
In [22]:
white.head(2)
Out[22]:
In [23]:
white.set_index('Geography', inplace=True)
white.drop(['Id', 'Id2', 'Estimate; Total:'], axis=1, inplace=True)
margin_cols = [col for col in white.columns if col.startswith('Margin of Error')]
white.drop(margin_cols, axis=1, inplace=True)
In [24]:
estimate_cols = [col for col in white.columns if col.startswith('Estimate; Total:')]
white['average'] = white[estimate_cols].apply(weight_average, axis=1)
In [25]:
white.average.hist(bins=25, color='black')
Out[25]:
These two previous histograms were not in the original article but is a nice plus to see how the white and black income compare in general.
In [26]:
black_and_white = black[['average']].join(white[['average']], lsuffix='_black', rsuffix='_white')
In [27]:
black_and_white.head()
Out[27]:
In [28]:
black_and_white['gap'] = black_and_white.average_white - black_and_white.average_black
In [29]:
ax = black_and_white.dropna().plot(kind='scatter', x='average_black', y='gap', color='black', alpha=0.1)
black_and_white.ix[["Baltimore city, Maryland"]].plot(kind='scatter', x='average_black', y='gap', color='red', ax=ax , figsize=(8, 8))
Out[29]:
This scatter plot shows the ditribution for Average Black Income vs White-Black income gap as the original article.
The only difference is that this plot is showing all cities in the US. The vertical bars that apper in are a consecuence of the simple weighed average explained before.
Article plotted the only the cities with more than 10% of black population which removes some noise and results in a more clean scatter plot.
In [33]:
races = pd.read_csv('races/ACS_13_5YR_B02001_with_ann.csv', encoding='cp1252', skiprows=[0])
In [34]:
races = races[['Geography', 'Estimate; Total:', 'Estimate; Total: - Black or African American alone']]
In [35]:
races = races.set_index('Geography')
In [36]:
black_percentage = races['Estimate; Total: - Black or African American alone'] / races['Estimate; Total:']
In [37]:
subset = black_and_white[black_percentage > 0.1]
In [38]:
ax = subset.dropna().plot(kind='scatter', x='average_black', y='gap', color='black', alpha=0.1)
subset.ix[["Baltimore city, Maryland"]].plot(kind='scatter', x='average_black', y='gap', color='red', ax=ax, figsize=(8, 8))
Out[38]:
Using Bokeh is possible to convert the static matplotlib image to a javascript interactive visualization.
In [39]:
source = plt.ColumnDataSource(
data=dict(
black_income=subset.average_black,
gap=subset.gap,
city=subset.index,
)
)
p = plt.figure(tools='hover,reset,save',
title='', width=530, height=530,
x_axis_label="Average Black Income",
y_axis_label="Black-white income gap")
p.scatter(subset.average_black, subset.gap, size=5, color="black", alpha=0.05, source=source)
hover = p.select(dict(type=HoverTool))
hover.tooltips = [
("City", "@city"),
("Average Black Income ", "@black_income"),
("B-W income gap", "@gap"),
]
plt.show(p)
The big cloud in the middle might get a little bit messy to see since there are too many points in close together and the tooltip will try to show all of them. But is more interesting to see the outliers since Memphis, Alabama in the top left corner or Hensley, Arkansas as the lower point.
This was a very simple article reproducing the results from the article "How Baltimore’s Young Black Men Are Boxed In". I highly recommend reading the original article to read the conclusions the author presents there since the objective of this article is to merely reproduce the results.
Beyond the simple code I learn about how much data is available by the US goverment, on this case the "American Fact Finder" . The website might not be as friendly as other data sources and you might have to spend a little bit of time trying to get the data you want but there is a lot of data which can lead to simple but interesting analysis like the original article.