ReproduceIt is a series of articles that reproduce the results from data analysis articles focusing on having open data and open code. All the code and data is available on github: reproduceit-538-baltimore-black-income. This post contains a more verbose version of the content that will probably get outdated while the github version could be updated including fixes.

For this second article I (again) took an article from one of my favorite websites FiveThirtyEight. On this case I took an article by Ben Casselman "How Baltimore’s Young Black Men Are Boxed In" which I found interesting given the recent events in the US, specially on this case Baltimore, Maryland.

The article analyses the income gap between white and black people in different cities all around the US. The data source is the "American Community Survey" and is available in the "American Fact Finder"

With this app is not possible to crawl it like on the previous ReproduciIt article since it requires user input to select the desired data but the data used for this analysis is available with the code in the github repo.

As usual we print some versions of the software and libraries used for future reproducibility.



In [1]:

    
import sys
sys.version_info









    Out[1]:





sys.version_info(major=3, minor=4, micro=3, releaselevel='final', serial=0)



In [2]:

    
import numpy as np
np.__version__









    Out[2]:





'1.9.2'



In [3]:

    
import pandas as pd
pd.__version__









    Out[3]:





'0.16.0'



In [4]:

    
%matplotlib inline



In [5]:

    
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')

matplotlib.__version__









    Out[5]:





'1.4.3'



In [6]:

    
import bokeh.plotting as plt
from bokeh.models import HoverTool
plt.output_notebook()

import bokeh
bokeh.__version__









    




    
        
        
        
    
        
        BokehJS successfully loaded.
    






    Out[6]:





'0.8.2'

Cleaning data

The majority of this article was spent cleaning a little bit the data, which was not really hard since it only required removing some columns and getting a weighed average. For all those operations I am using pandas.

Black



In [7]:

    
black = pd.read_csv('BlackIncome/ACS_13_5YR_B19001B_with_ann.csv', encoding='cp1252', skiprows=[0])



In [8]:

    
black.head()









    Out[8]:






  
    
      
      Id
      Id2
      Geography
      Estimate; Total:
      Margin of Error; Total:
      Estimate; Total: - Less than $10,000
      Margin of Error; Total: - Less than $10,000
      Estimate; Total: - $10,000 to $14,999
      Margin of Error; Total: - $10,000 to $14,999
      Estimate; Total: - $15,000 to $19,999
      ...
      Estimate; Total: - $75,000 to $99,999
      Margin of Error; Total: - $75,000 to $99,999
      Estimate; Total: - $100,000 to $124,999
      Margin of Error; Total: - $100,000 to $124,999
      Estimate; Total: - $125,000 to $149,999
      Margin of Error; Total: - $125,000 to $149,999
      Estimate; Total: - $150,000 to $199,999
      Margin of Error; Total: - $150,000 to $199,999
      Estimate; Total: - $200,000 or more
      Margin of Error; Total: - $200,000 or more
    
  
  
    
      0
      1600000US0100100
      100100
      Abanda CDP, Alabama
      0
      11
      0
      11
      0
      11
      0
      ...
      0
      11
      0
      11
      0
      11
      0
      11
      0
      11
    
    
      1
      1600000US0100124
      100124
      Abbeville city, Alabama
      410
      90
      81
      42
      47
      35
      120
      ...
      4
      8
      0
      11
      12
      15
      0
      11
      0
      11
    
    
      2
      1600000US0100460
      100460
      Adamsville city, Alabama
      585
      116
      40
      31
      11
      18
      9
      ...
      112
      62
      38
      41
      33
      47
      5
      8
      0
      11
    
    
      3
      1600000US0100484
      100484
      Addison town, Alabama
      0
      11
      0
      11
      0
      11
      0
      ...
      0
      11
      0
      11
      0
      11
      0
      11
      0
      11
    
    
      4
      1600000US0100676
      100676
      Akron town, Alabama
      118
      37
      26
      17
      18
      17
      6
      ...
      0
      11
      0
      11
      8
      12
      0
      11
      0
      11
    
  

5 rows × 37 columns



In [9]:

    
black.set_index('Geography', inplace=True)



In [10]:

    
black.drop(['Id', 'Id2', 'Estimate; Total:'], axis=1, inplace=True)



In [11]:

    
margin_cols = [col for col in black.columns if col.startswith('Margin of Error')]



In [12]:

    
black.drop(margin_cols, axis=1, inplace=True)



In [13]:

    
black.head()









    Out[13]:






  
    
      
      Estimate; Total: - Less than $10,000
      Estimate; Total: - $10,000 to $14,999
      Estimate; Total: - $15,000 to $19,999
      Estimate; Total: - $20,000 to $24,999
      Estimate; Total: - $25,000 to $29,999
      Estimate; Total: - $30,000 to $34,999
      Estimate; Total: - $35,000 to $39,999
      Estimate; Total: - $40,000 to $44,999
      Estimate; Total: - $45,000 to $49,999
      Estimate; Total: - $50,000 to $59,999
      Estimate; Total: - $60,000 to $74,999
      Estimate; Total: - $75,000 to $99,999
      Estimate; Total: - $100,000 to $124,999
      Estimate; Total: - $125,000 to $149,999
      Estimate; Total: - $150,000 to $199,999
      Estimate; Total: - $200,000 or more
    
    
      Geography
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Abanda CDP, Alabama
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      Abbeville city, Alabama
      81
      47
      120
      7
      30
      5
      47
      16
      15
      6
      20
      4
      0
      12
      0
      0
    
    
      Adamsville city, Alabama
      40
      11
      9
      11
      28
      109
      17
      61
      21
      52
      38
      112
      38
      33
      5
      0
    
    
      Addison town, Alabama
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      Akron town, Alabama
      26
      18
      6
      16
      14
      21
      2
      7
      0
      0
      0
      0
      0
      8
      0
      0

Since the data available is given in intervals I used a simple weithed average to get a single number per city.



In [14]:

    
weights = [10000, 12500, 17500, 22500, 27500, 32500, 37500, 42500, 47500, 55000, 67500, 87500, 112500, 137500, 187500, 200000]



In [15]:

    
weights = pd.Series(weights, index=black.columns)



In [16]:

    
def weight_average(x):
    return (x * weights).sum() / x.sum()
black.head().apply(weight_average, axis=1)









    Out[16]:





Geography
Abanda CDP, Alabama                  NaN
Abbeville city, Alabama     27993.902439
Adamsville city, Alabama    58901.709402
Addison town, Alabama                NaN
Akron town, Alabama         29576.271186
dtype: float64



In [17]:

    
estimate_cols = [col for col in black.columns if col.startswith('Estimate; Total:')]



In [18]:

    
black['average'] = black[estimate_cols].apply(weight_average, axis=1)



In [19]:

    
black.head(2)









    Out[19]:






  
    
      
      Estimate; Total: - Less than $10,000
      Estimate; Total: - $10,000 to $14,999
      Estimate; Total: - $15,000 to $19,999
      Estimate; Total: - $20,000 to $24,999
      Estimate; Total: - $25,000 to $29,999
      Estimate; Total: - $30,000 to $34,999
      Estimate; Total: - $35,000 to $39,999
      Estimate; Total: - $40,000 to $44,999
      Estimate; Total: - $45,000 to $49,999
      Estimate; Total: - $50,000 to $59,999
      Estimate; Total: - $60,000 to $74,999
      Estimate; Total: - $75,000 to $99,999
      Estimate; Total: - $100,000 to $124,999
      Estimate; Total: - $125,000 to $149,999
      Estimate; Total: - $150,000 to $199,999
      Estimate; Total: - $200,000 or more
      average
    
    
      Geography
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Abanda CDP, Alabama
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      NaN
    
    
      Abbeville city, Alabama
      81
      47
      120
      7
      30
      5
      47
      16
      15
      6
      20
      4
      0
      12
      0
      0
      27993.902439



In [20]:

    
black.average.hist(bins=25, color='black')









    Out[20]:





<matplotlib.axes._subplots.AxesSubplot at 0x10cf75080>

White



In [21]:

    
white = pd.read_csv('WhiteIncome/ACS_13_5YR_B19001A_with_ann.csv', encoding='cp1252', skiprows=[0])



In [22]:

    
white.head(2)









    Out[22]:






  
    
      
      Id
      Id2
      Geography
      Estimate; Total:
      Margin of Error; Total:
      Estimate; Total: - Less than $10,000
      Margin of Error; Total: - Less than $10,000
      Estimate; Total: - $10,000 to $14,999
      Margin of Error; Total: - $10,000 to $14,999
      Estimate; Total: - $15,000 to $19,999
      ...
      Estimate; Total: - $75,000 to $99,999
      Margin of Error; Total: - $75,000 to $99,999
      Estimate; Total: - $100,000 to $124,999
      Margin of Error; Total: - $100,000 to $124,999
      Estimate; Total: - $125,000 to $149,999
      Margin of Error; Total: - $125,000 to $149,999
      Estimate; Total: - $150,000 to $199,999
      Margin of Error; Total: - $150,000 to $199,999
      Estimate; Total: - $200,000 or more
      Margin of Error; Total: - $200,000 or more
    
  
  
    
      0
      1600000US0100100
      100100
      Abanda CDP, Alabama
      23
      25
      0
      11
      11
      17
      0
      ...
      0
      11
      0
      11
      0
      11
      0
      11
      0
      11
    
    
      1
      1600000US0100124
      100124
      Abbeville city, Alabama
      580
      107
      48
      37
      37
      21
      66
      ...
      59
      37
      50
      28
      13
      13
      8
      12
      3
      6
    
  

2 rows × 37 columns



In [23]:

    
white.set_index('Geography', inplace=True)
white.drop(['Id', 'Id2', 'Estimate; Total:'], axis=1, inplace=True)
margin_cols = [col for col in white.columns if col.startswith('Margin of Error')]
white.drop(margin_cols, axis=1, inplace=True)



In [24]:

    
estimate_cols = [col for col in white.columns if col.startswith('Estimate; Total:')]
white['average'] = white[estimate_cols].apply(weight_average, axis=1)



In [25]:

    
white.average.hist(bins=25, color='black')









    Out[25]:





<matplotlib.axes._subplots.AxesSubplot at 0x109eeb940>

These two previous histograms were not in the original article but is a nice plus to see how the white and black income compare in general.

Combined



In [26]:

    
black_and_white = black[['average']].join(white[['average']], lsuffix='_black', rsuffix='_white')



In [27]:

    
black_and_white.head()









    Out[27]:






  
    
      
      average_black
      average_white
    
    
      Geography
      
      
    
  
  
    
      Abanda CDP, Alabama
      NaN
      22934.782609
    
    
      Abbeville city, Alabama
      27993.902439
      49422.413793
    
    
      Adamsville city, Alabama
      58901.709402
      53250.750751
    
    
      Addison town, Alabama
      NaN
      44384.328358
    
    
      Akron town, Alabama
      29576.271186
      29398.148148



In [28]:

    
black_and_white['gap'] = black_and_white.average_white - black_and_white.average_black



In [29]:

    
ax = black_and_white.dropna().plot(kind='scatter', x='average_black', y='gap', color='black', alpha=0.1)
black_and_white.ix[["Baltimore city, Maryland"]].plot(kind='scatter', x='average_black', y='gap', color='red', ax=ax , figsize=(8, 8))









    Out[29]:





<matplotlib.axes._subplots.AxesSubplot at 0x10a518dd8>

This scatter plot shows the ditribution for Average Black Income vs White-Black income gap as the original article.

The only difference is that this plot is showing all cities in the US. The vertical bars that apper in are a consecuence of the simple weighed average explained before.

Subset: 10% of black population

Article plotted the only the cities with more than 10% of black population which removes some noise and results in a more clean scatter plot.



In [33]:

    
races = pd.read_csv('races/ACS_13_5YR_B02001_with_ann.csv', encoding='cp1252', skiprows=[0])



In [34]:

    
races = races[['Geography', 'Estimate; Total:', 'Estimate; Total: - Black or African American alone']]



In [35]:

    
races = races.set_index('Geography')



In [36]:

    
black_percentage = races['Estimate; Total: - Black or African American alone'] / races['Estimate; Total:']



In [37]:

    
subset = black_and_white[black_percentage > 0.1]



In [38]:

    
ax = subset.dropna().plot(kind='scatter', x='average_black', y='gap', color='black', alpha=0.1)
subset.ix[["Baltimore city, Maryland"]].plot(kind='scatter', x='average_black', y='gap', color='red', ax=ax, figsize=(8, 8))









    Out[38]:





<matplotlib.axes._subplots.AxesSubplot at 0x10f551eb8>

Interactive

Using Bokeh is possible to convert the static matplotlib image to a javascript interactive visualization.



In [39]:

    
source = plt.ColumnDataSource(
    data=dict(
        black_income=subset.average_black,
        gap=subset.gap,
        city=subset.index,
    )
)

p = plt.figure(tools='hover,reset,save',
               title='', width=530, height=530,
               x_axis_label="Average Black Income",
               y_axis_label="Black-white income gap")

p.scatter(subset.average_black, subset.gap, size=5, color="black", alpha=0.05, source=source)

hover = p.select(dict(type=HoverTool))

hover.tooltips = [
    ("City", "@city"),
    ("Average Black Income ", "@black_income"),
    ("B-W income gap", "@gap"),
]

plt.show(p)

The big cloud in the middle might get a little bit messy to see since there are too many points in close together and the tooltip will try to show all of them. But is more interesting to see the outliers since Memphis, Alabama in the top left corner or Hensley, Arkansas as the lower point.

Conclusion

This was a very simple article reproducing the results from the article "How Baltimore’s Young Black Men Are Boxed In". I highly recommend reading the original article to read the conclusions the author presents there since the objective of this article is to merely reproduce the results.

Beyond the simple code I learn about how much data is available by the US goverment, on this case the "American Fact Finder" . The website might not be as friendly as other data sources and you might have to spend a little bit of time trying to get the data you want but there is a lot of data which can lead to simple but interesting analysis like the original article.

	Id	Id2	Geography	Estimate; Total:	Margin of Error; Total:	Estimate; Total: - Less than $10,000	Margin of Error; Total: - Less than $10,000	Estimate; Total: - $10,000 to $14,999	Margin of Error; Total: - $10,000 to $14,999	Estimate; Total: - $15,000 to $19,999	...	Estimate; Total: - $75,000 to $99,999	Margin of Error; Total: - $75,000 to $99,999	Estimate; Total: - $100,000 to $124,999	Margin of Error; Total: - $100,000 to $124,999	Estimate; Total: - $125,000 to $149,999	Margin of Error; Total: - $125,000 to $149,999	Estimate; Total: - $150,000 to $199,999	Margin of Error; Total: - $150,000 to $199,999	Margin of Error; Total: - $200,000 or more
0	1600000US0100100	100100	Abanda CDP, Alabama	0	11	0	11	0	11	0	...	0	11	0	11	0	11	0	11	11
1	1600000US0100124	100124	Abbeville city, Alabama	410	90	81	42	47	35	120	...	4	8	0	11	12	15	0	11	11
2	1600000US0100460	100460	Adamsville city, Alabama	585	116	40	31	11	18	9	...	112	62	38	41	33	47	5	8	11
3	1600000US0100484	100484	Addison town, Alabama	0	11	0	11	0	11	0	...	0	11	0	11	0	11	0	11	11
4	1600000US0100676	100676	Akron town, Alabama	118	37	26	17	18	17	6	...	0	11	0	11	8	12	0	11	11

	Estimate; Total: - Less than $10,000	Estimate; Total: - $10,000 to $14,999	Estimate; Total: - $15,000 to $19,999	Estimate; Total: - $20,000 to $24,999	Estimate; Total: - $25,000 to $29,999	Estimate; Total: - $30,000 to $34,999	Estimate; Total: - $35,000 to $39,999	Estimate; Total: - $40,000 to $44,999	Estimate; Total: - $45,000 to $49,999	Estimate; Total: - $50,000 to $59,999	Estimate; Total: - $60,000 to $74,999	Estimate; Total: - $75,000 to $99,999	Estimate; Total: - $100,000 to $124,999	Estimate; Total: - $125,000 to $149,999	Estimate; Total: - $150,000 to $199,999	Estimate; Total: - $200,000 or more
Geography
Abanda CDP, Alabama	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Abbeville city, Alabama	81	47	120	7	30	5	47	16	15	6	20	4	0	12	0	0
Adamsville city, Alabama	40	11	9	11	28	109	17	61	21	52	38	112	38	33	5	0
Addison town, Alabama	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Akron town, Alabama	26	18	6	16	14	21	2	7	0	0	0	0	0	8	0	0

	average_black	average_white
Geography
Abanda CDP, Alabama	NaN	22934.782609
Abbeville city, Alabama	27993.902439	49422.413793
Adamsville city, Alabama	58901.709402	53250.750751
Addison town, Alabama	NaN	44384.328358
Akron town, Alabama	29576.271186	29398.148148