ReproduceIt is a series of articles that reproduce the results from data analysis articles focusing on having open data and open code. All the code and data is available on github: reproduceit-538-baltimore-black-income. This post contains a more verbose version of the content that will probably get outdated while the github version could be updated including fixes.

For this second article I (again) took an article from one of my favorite websites FiveThirtyEight. On this case I took an article by Ben Casselman "How Baltimore’s Young Black Men Are Boxed In" which I found interesting given the recent events in the US, specially on this case Baltimore, Maryland.

The article analyses the income gap between white and black people in different cities all around the US. The data source is the "American Community Survey" and is available in the "American Fact Finder"

With this app is not possible to crawl it like on the previous ReproduciIt article since it requires user input to select the desired data but the data used for this analysis is available with the code in the github repo.

As usual we print some versions of the software and libraries used for future reproducibility.


In [1]:
import sys
sys.version_info


Out[1]:
sys.version_info(major=3, minor=4, micro=3, releaselevel='final', serial=0)

In [2]:
import numpy as np
np.__version__


Out[2]:
'1.9.2'

In [3]:
import pandas as pd
pd.__version__


Out[3]:
'0.16.0'

In [4]:
%matplotlib inline

In [5]:
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')

matplotlib.__version__


Out[5]:
'1.4.3'

In [6]:
import bokeh.plotting as plt
from bokeh.models import HoverTool
plt.output_notebook()

import bokeh
bokeh.__version__


BokehJS successfully loaded.
Out[6]:
'0.8.2'

Cleaning data

The majority of this article was spent cleaning a little bit the data, which was not really hard since it only required removing some columns and getting a weighed average. For all those operations I am using pandas.

Black


In [7]:
black = pd.read_csv('BlackIncome/ACS_13_5YR_B19001B_with_ann.csv', encoding='cp1252', skiprows=[0])

In [8]:
black.head()


Out[8]:
Id Id2 Geography Estimate; Total: Margin of Error; Total: Estimate; Total: - Less than $10,000 Margin of Error; Total: - Less than $10,000 Estimate; Total: - $10,000 to $14,999 Margin of Error; Total: - $10,000 to $14,999 Estimate; Total: - $15,000 to $19,999 ... Estimate; Total: - $75,000 to $99,999 Margin of Error; Total: - $75,000 to $99,999 Estimate; Total: - $100,000 to $124,999 Margin of Error; Total: - $100,000 to $124,999 Estimate; Total: - $125,000 to $149,999 Margin of Error; Total: - $125,000 to $149,999 Estimate; Total: - $150,000 to $199,999 Margin of Error; Total: - $150,000 to $199,999 Estimate; Total: - $200,000 or more Margin of Error; Total: - $200,000 or more
0 1600000US0100100 100100 Abanda CDP, Alabama 0 11 0 11 0 11 0 ... 0 11 0 11 0 11 0 11 0 11
1 1600000US0100124 100124 Abbeville city, Alabama 410 90 81 42 47 35 120 ... 4 8 0 11 12 15 0 11 0 11
2 1600000US0100460 100460 Adamsville city, Alabama 585 116 40 31 11 18 9 ... 112 62 38 41 33 47 5 8 0 11
3 1600000US0100484 100484 Addison town, Alabama 0 11 0 11 0 11 0 ... 0 11 0 11 0 11 0 11 0 11
4 1600000US0100676 100676 Akron town, Alabama 118 37 26 17 18 17 6 ... 0 11 0 11 8 12 0 11 0 11

5 rows × 37 columns


In [9]:
black.set_index('Geography', inplace=True)

In [10]:
black.drop(['Id', 'Id2', 'Estimate; Total:'], axis=1, inplace=True)

In [11]:
margin_cols = [col for col in black.columns if col.startswith('Margin of Error')]

In [12]:
black.drop(margin_cols, axis=1, inplace=True)

In [13]:
black.head()


Out[13]:
Estimate; Total: - Less than $10,000 Estimate; Total: - $10,000 to $14,999 Estimate; Total: - $15,000 to $19,999 Estimate; Total: - $20,000 to $24,999 Estimate; Total: - $25,000 to $29,999 Estimate; Total: - $30,000 to $34,999 Estimate; Total: - $35,000 to $39,999 Estimate; Total: - $40,000 to $44,999 Estimate; Total: - $45,000 to $49,999 Estimate; Total: - $50,000 to $59,999 Estimate; Total: - $60,000 to $74,999 Estimate; Total: - $75,000 to $99,999 Estimate; Total: - $100,000 to $124,999 Estimate; Total: - $125,000 to $149,999 Estimate; Total: - $150,000 to $199,999 Estimate; Total: - $200,000 or more
Geography
Abanda CDP, Alabama 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Abbeville city, Alabama 81 47 120 7 30 5 47 16 15 6 20 4 0 12 0 0
Adamsville city, Alabama 40 11 9 11 28 109 17 61 21 52 38 112 38 33 5 0
Addison town, Alabama 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Akron town, Alabama 26 18 6 16 14 21 2 7 0 0 0 0 0 8 0 0

Since the data available is given in intervals I used a simple weithed average to get a single number per city.


In [14]:
weights = [10000, 12500, 17500, 22500, 27500, 32500, 37500, 42500, 47500, 55000, 67500, 87500, 112500, 137500, 187500, 200000]

In [15]:
weights = pd.Series(weights, index=black.columns)

In [16]:
def weight_average(x):
    return (x * weights).sum() / x.sum()
black.head().apply(weight_average, axis=1)


Out[16]:
Geography
Abanda CDP, Alabama                  NaN
Abbeville city, Alabama     27993.902439
Adamsville city, Alabama    58901.709402
Addison town, Alabama                NaN
Akron town, Alabama         29576.271186
dtype: float64

In [17]:
estimate_cols = [col for col in black.columns if col.startswith('Estimate; Total:')]

In [18]:
black['average'] = black[estimate_cols].apply(weight_average, axis=1)

In [19]:
black.head(2)


Out[19]:
Estimate; Total: - Less than $10,000 Estimate; Total: - $10,000 to $14,999 Estimate; Total: - $15,000 to $19,999 Estimate; Total: - $20,000 to $24,999 Estimate; Total: - $25,000 to $29,999 Estimate; Total: - $30,000 to $34,999 Estimate; Total: - $35,000 to $39,999 Estimate; Total: - $40,000 to $44,999 Estimate; Total: - $45,000 to $49,999 Estimate; Total: - $50,000 to $59,999 Estimate; Total: - $60,000 to $74,999 Estimate; Total: - $75,000 to $99,999 Estimate; Total: - $100,000 to $124,999 Estimate; Total: - $125,000 to $149,999 Estimate; Total: - $150,000 to $199,999 Estimate; Total: - $200,000 or more average
Geography
Abanda CDP, Alabama 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN
Abbeville city, Alabama 81 47 120 7 30 5 47 16 15 6 20 4 0 12 0 0 27993.902439

In [20]:
black.average.hist(bins=25, color='black')


Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x10cf75080>

White


In [21]:
white = pd.read_csv('WhiteIncome/ACS_13_5YR_B19001A_with_ann.csv', encoding='cp1252', skiprows=[0])

In [22]:
white.head(2)


Out[22]:
Id Id2 Geography Estimate; Total: Margin of Error; Total: Estimate; Total: - Less than $10,000 Margin of Error; Total: - Less than $10,000 Estimate; Total: - $10,000 to $14,999 Margin of Error; Total: - $10,000 to $14,999 Estimate; Total: - $15,000 to $19,999 ... Estimate; Total: - $75,000 to $99,999 Margin of Error; Total: - $75,000 to $99,999 Estimate; Total: - $100,000 to $124,999 Margin of Error; Total: - $100,000 to $124,999 Estimate; Total: - $125,000 to $149,999 Margin of Error; Total: - $125,000 to $149,999 Estimate; Total: - $150,000 to $199,999 Margin of Error; Total: - $150,000 to $199,999 Estimate; Total: - $200,000 or more Margin of Error; Total: - $200,000 or more
0 1600000US0100100 100100 Abanda CDP, Alabama 23 25 0 11 11 17 0 ... 0 11 0 11 0 11 0 11 0 11
1 1600000US0100124 100124 Abbeville city, Alabama 580 107 48 37 37 21 66 ... 59 37 50 28 13 13 8 12 3 6

2 rows × 37 columns


In [23]:
white.set_index('Geography', inplace=True)
white.drop(['Id', 'Id2', 'Estimate; Total:'], axis=1, inplace=True)
margin_cols = [col for col in white.columns if col.startswith('Margin of Error')]
white.drop(margin_cols, axis=1, inplace=True)

In [24]:
estimate_cols = [col for col in white.columns if col.startswith('Estimate; Total:')]
white['average'] = white[estimate_cols].apply(weight_average, axis=1)

In [25]:
white.average.hist(bins=25, color='black')


Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x109eeb940>

These two previous histograms were not in the original article but is a nice plus to see how the white and black income compare in general.

Combined


In [26]:
black_and_white = black[['average']].join(white[['average']], lsuffix='_black', rsuffix='_white')

In [27]:
black_and_white.head()


Out[27]:
average_black average_white
Geography
Abanda CDP, Alabama NaN 22934.782609
Abbeville city, Alabama 27993.902439 49422.413793
Adamsville city, Alabama 58901.709402 53250.750751
Addison town, Alabama NaN 44384.328358
Akron town, Alabama 29576.271186 29398.148148

In [28]:
black_and_white['gap'] = black_and_white.average_white - black_and_white.average_black

In [29]:
ax = black_and_white.dropna().plot(kind='scatter', x='average_black', y='gap', color='black', alpha=0.1)
black_and_white.ix[["Baltimore city, Maryland"]].plot(kind='scatter', x='average_black', y='gap', color='red', ax=ax , figsize=(8, 8))


Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a518dd8>

This scatter plot shows the ditribution for Average Black Income vs White-Black income gap as the original article.

The only difference is that this plot is showing all cities in the US. The vertical bars that apper in are a consecuence of the simple weighed average explained before.

Subset: 10% of black population

Article plotted the only the cities with more than 10% of black population which removes some noise and results in a more clean scatter plot.


In [33]:
races = pd.read_csv('races/ACS_13_5YR_B02001_with_ann.csv', encoding='cp1252', skiprows=[0])

In [34]:
races = races[['Geography', 'Estimate; Total:', 'Estimate; Total: - Black or African American alone']]

In [35]:
races = races.set_index('Geography')

In [36]:
black_percentage = races['Estimate; Total: - Black or African American alone'] / races['Estimate; Total:']

In [37]:
subset = black_and_white[black_percentage > 0.1]

In [38]:
ax = subset.dropna().plot(kind='scatter', x='average_black', y='gap', color='black', alpha=0.1)
subset.ix[["Baltimore city, Maryland"]].plot(kind='scatter', x='average_black', y='gap', color='red', ax=ax, figsize=(8, 8))


Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x10f551eb8>

Interactive

Using Bokeh is possible to convert the static matplotlib image to a javascript interactive visualization.


In [39]:
source = plt.ColumnDataSource(
    data=dict(
        black_income=subset.average_black,
        gap=subset.gap,
        city=subset.index,
    )
)

p = plt.figure(tools='hover,reset,save',
               title='', width=530, height=530,
               x_axis_label="Average Black Income",
               y_axis_label="Black-white income gap")

p.scatter(subset.average_black, subset.gap, size=5, color="black", alpha=0.05, source=source)

hover = p.select(dict(type=HoverTool))

hover.tooltips = [
    ("City", "@city"),
    ("Average Black Income ", "@black_income"),
    ("B-W income gap", "@gap"),
]

plt.show(p)


The big cloud in the middle might get a little bit messy to see since there are too many points in close together and the tooltip will try to show all of them. But is more interesting to see the outliers since Memphis, Alabama in the top left corner or Hensley, Arkansas as the lower point.

Conclusion

This was a very simple article reproducing the results from the article "How Baltimore’s Young Black Men Are Boxed In". I highly recommend reading the original article to read the conclusions the author presents there since the objective of this article is to merely reproduce the results.

Beyond the simple code I learn about how much data is available by the US goverment, on this case the "American Fact Finder" . The website might not be as friendly as other data sources and you might have to spend a little bit of time trying to get the data you want but there is a lot of data which can lead to simple but interesting analysis like the original article.