Copyright 2019 Google LLC. SPDX-License-Identifier: Apache-2.0

Notebook Version - 1.0.0


In [4]:
# Install datacommons
!pip install --upgrade --quiet git+https://github.com/datacommonsorg/api-python.git@stable-1.x


  Building wheel for datacommons (setup.py) ... done

Analyzing Income Distribution

The American Community Survey (published by the US Census) annually reports the number of individuals in a given income bracket at the State level. We can use this information, stored in Data Commons, to visualize disparity in income for each State in the US. Our goal for this tutorial will be to generate a plot that visualizes the total number of individuals across a given set of income brackets for a given state.

Before we begin, we'll setup our notebook


In [0]:
# Import the Data Commons library
import datacommons as dc

# Import other libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import json

from google.colab import drive

We will also need to provide the API with an API key. See the to see how to set this up for a Colab Notebook.


In [6]:
# Mount the Drive
drive.mount('/content/drive', force_remount=True)

# REPLACE THIS with the path to your key.
key_path = '/content/drive/My Drive/DataCommons/secret.json'

# Read the key in and provide it to the Data Commons API
with open(key_path, 'r') as f:
  secrets = json.load(f)
  dc.set_api_key(secrets['dc_api_key'])


Mounted at /content/drive

Preparing the Data

We'll begin by creating a dataframe with states and their total population. We can use get_places_in to get all States within the United States. We can then call get_populations and get_observations to get the population of all persons in each State.


In [7]:
# Initialize a DataFrame holding the USA.
data = pd.DataFrame({'country': ['country/USA']})

# Add a column for states and get their names
data['state'] = dc.get_places_in(data['country'], 'State')
data = dc.flatten_frame(data)

# Get all state names and store it in a column "name"
data['name'] = dc.get_property_values(data['state'], 'name')
data = dc.flatten_frame(data)

# Get StatisticalPopulations representing all persons in each state. 
data['all_pop'] = dc.get_populations(data['state'], 'Person')

# Get the total count of all persons in each population
data['all'] = dc.get_observations(data['all_pop'], 
    'count',
    'measuredValue',
    '2017', 
    measurement_method='CenusACS5yrSurvey')

# Display the first five rows of the table.
data.head(5)


Out[7]:
country state name all_pop all
0 country/USA geoId/01 Alabama dc/p/dft95nqhpkkgd 4850771.0
1 country/USA geoId/02 Alaska dc/p/th9yyf8ej6m64 738565.0
2 country/USA geoId/04 Arizona dc/p/8hyhwze6s5tn6 6809946.0
3 country/USA geoId/05 Arkansas dc/p/rp61n0st0vtr 2977944.0
4 country/USA geoId/06 California dc/p/f1jbe2m7w6f58 38982847.0

Querying for Income Brackets

Next, let's get the population level for each income bracket. The datacommons graph identifies 16 different income brackets. For each bracket and state, we can get the population level. Remember that we first get the StatisticalPopulation, and then a corresponding observation. We'll filter observations to between published in 2017 by the American Community Survey.


In [8]:
# A list of income brackets
income_brackets = [
  "USDollarUpto10000",
  "USDollar10000To14999",
  "USDollar15000To19999",
  "USDollar20000To24999",
  "USDollar25000To29999",
  "USDollar30000To34999",
  "USDollar35000To39999",
  "USDollar40000To44999",
  "USDollar45000To49999",
  "USDollar50000To59999",
  "USDollar60000To74999",
  "USDollar75000To99999",
  "USDollar100000To124999",
  "USDollar125000To149999",
  "USDollar150000To199999",
  "USDollar200000Onwards",
]

# Add a column containin the population count for each income bracket
for bracket in income_brackets:
  # Get the new column names
  pop_col = '{}_pop'.format(bracket)
  obs_col = bracket
 
  # Create the constraining properties map
  pvs = {'income': bracket}

  # Get the StatisticalPopulation and Observation
  data[pop_col] = dc.get_populations(data['state'], 'Household', 
    constraining_properties=pvs)
  data[obs_col] = dc.get_observations(data[pop_col], 
    'count', 
    'measuredValue', 
    '2017', 
    measurement_method='CenusACS5yrSurvey')

# Display the table
data.head(5)


Out[8]:
country state name all_pop all USDollarUpto10000_pop USDollarUpto10000 USDollar10000To14999_pop USDollar10000To14999 USDollar15000To19999_pop USDollar15000To19999 USDollar20000To24999_pop USDollar20000To24999 USDollar25000To29999_pop USDollar25000To29999 USDollar30000To34999_pop USDollar30000To34999 USDollar35000To39999_pop USDollar35000To39999 USDollar40000To44999_pop USDollar40000To44999 USDollar45000To49999_pop USDollar45000To49999 USDollar50000To59999_pop USDollar50000To59999 USDollar60000To74999_pop USDollar60000To74999 USDollar75000To99999_pop USDollar75000To99999 USDollar100000To124999_pop USDollar100000To124999 USDollar125000To149999_pop USDollar125000To149999 USDollar150000To199999_pop USDollar150000To199999 USDollar200000Onwards_pop USDollar200000Onwards
0 country/USA geoId/01 Alabama dc/p/dft95nqhpkkgd 4850771.0 dc/p/k82f7vl1g4xwd 170650.0 dc/p/hrw93dxs7y4y 121688.0 dc/p/q3m0heyyphs14 116273.0 dc/p/9c6gwfklz5pdd 112406.0 dc/p/xlm581r7zpx51 102590.0 dc/p/4fgqdmf6e4stc 100223.0 dc/p/8j3sfby51kqtc 92139.0 dc/p/46tg31n5kyvj9 88987.0 dc/p/lq36bnvmybekc 76070.0 dc/p/tn6m8wemxs8p8 149492.0 dc/p/mgbnwm5rld8wg 176589.0 dc/p/w2b2s0967krx5 206436.0 dc/p/x2ylx680g94e 133618.0 dc/p/wphpwpkew18c1 76939.0 dc/p/p1zgxd7fdcqs 71748.0 dc/p/3hr2p57hjdrpg 60847.0
1 country/USA geoId/02 Alaska dc/p/th9yyf8ej6m64 738565.0 dc/p/glg0lwn58bgc6 9650.0 dc/p/hw4lqlcj11fgf 7838.0 dc/p/w4gre87nr3p47 8013.0 dc/p/4hhydjfw16c43 8854.0 dc/p/ceecx2c5xbgn9 8383.0 dc/p/426jehzy44g3f 8914.0 dc/p/pefxk2krk2qp1 8998.0 dc/p/prskbr4e1sfyb 10163.0 dc/p/d5er7t9wwc66d 8536.0 dc/p/ktvrdr1gknnhf 18810.0 dc/p/plzhkpjz6wr78 26224.0 dc/p/fdc874r6enj2c 36745.0 dc/p/15sg657df7j5 29925.0 dc/p/7e3m1jl4r8052 19752.0 dc/p/3kv7sn1ldgwm2 23596.0 dc/p/jb4z82wlf0126 18135.0
2 country/USA geoId/04 Arizona dc/p/8hyhwze6s5tn6 6809946.0 dc/p/6rf6dsjxvbr5f 178384.0 dc/p/hll4exlmk8jtc 118676.0 dc/p/b6tpkyz295xm8 123075.0 dc/p/lx28sr11tdqq7 131903.0 dc/p/efkmmwsclfnfh 127741.0 dc/p/33edcg1c0100h 127570.0 dc/p/17q967xck0c01 119431.0 dc/p/6m7ch4wy4b7xb 125103.0 dc/p/385p6jvm97wxg 107245.0 dc/p/3f837b5vqk8h3 208318.0 dc/p/696jjk7e08hh1 254294.0 dc/p/pwcgh30nt42r1 303811.0 dc/p/nghgrgdm0tpv7 200818.0 dc/p/6blwss6nsxz09 119347.0 dc/p/fltk974mm3kj4 119778.0 dc/p/2k3tkfp68b293 116817.0
3 country/USA geoId/05 Arkansas dc/p/rp61n0st0vtr 2977944.0 dc/p/vgcvlmnj54ydh 97226.0 dc/p/7ww6w7t4xvlk7 78535.0 dc/p/efynsmzwc5823 74451.0 dc/p/jp3jptvfwn9nd 75194.0 dc/p/66tt8rr0fm3bb 70211.0 dc/p/23qm6tpqyt2b5 67958.0 dc/p/z932k4dhz67dc 61588.0 dc/p/fct1f5rb8ev42 60814.0 dc/p/q3zges6zxwgmh 50978.0 dc/p/rd41ltemvb8z8 95836.0 dc/p/dfbfs6lbx10vb 111618.0 dc/p/hzy4wfxdpvzfh 120034.0 dc/p/yc2terhkh6d78 72752.0 dc/p/w4fte21fpn4wd 40780.0 dc/p/bl9k2gysnpb69 35900.0 dc/p/bjjw2gv79ev25 33416.0
4 country/USA geoId/06 California dc/p/f1jbe2m7w6f58 38982847.0 dc/p/dgbhzzzkwdj28 694945.0 dc/p/xddppkel4sgr9 604666.0 dc/p/qgyd003dbv7p6 536203.0 dc/p/xzf8djy5dt8bd 568994.0 dc/p/7x9127m9stem7 523563.0 dc/p/v6wcvfjc3bvfd 539988.0 dc/p/qn2fz1xggyhkc 500571.0 dc/p/pkbcstte9lq28 509113.0 dc/p/mfy8k6ejt1k0d 456152.0 dc/p/1gy8f20jdlwng 895758.0 dc/p/g9ly5ee1c7bf3 1199773.0 dc/p/kc59r4x1tedq4 1568843.0 dc/p/3c9483vks6ev7 1206637.0 dc/p/bc518fz7gl1f6 818690.0 dc/p/cx6fr5lfvhtx5 1008388.0 dc/p/2vdzt089wp0c1 1255844.0

Let's limit the size of this DataFrame by selecting columns with only the State name and Observations.


In [9]:
# Select columns that will be used for plotting
data = data[['name', 'all'] + income_brackets]

# Display the table
data.head(5)


Out[9]:
name all USDollarUpto10000 USDollar10000To14999 USDollar15000To19999 USDollar20000To24999 USDollar25000To29999 USDollar30000To34999 USDollar35000To39999 USDollar40000To44999 USDollar45000To49999 USDollar50000To59999 USDollar60000To74999 USDollar75000To99999 USDollar100000To124999 USDollar125000To149999 USDollar150000To199999 USDollar200000Onwards
0 Alabama 4850771.0 170650.0 121688.0 116273.0 112406.0 102590.0 100223.0 92139.0 88987.0 76070.0 149492.0 176589.0 206436.0 133618.0 76939.0 71748.0 60847.0
1 Alaska 738565.0 9650.0 7838.0 8013.0 8854.0 8383.0 8914.0 8998.0 10163.0 8536.0 18810.0 26224.0 36745.0 29925.0 19752.0 23596.0 18135.0
2 Arizona 6809946.0 178384.0 118676.0 123075.0 131903.0 127741.0 127570.0 119431.0 125103.0 107245.0 208318.0 254294.0 303811.0 200818.0 119347.0 119778.0 116817.0
3 Arkansas 2977944.0 97226.0 78535.0 74451.0 75194.0 70211.0 67958.0 61588.0 60814.0 50978.0 95836.0 111618.0 120034.0 72752.0 40780.0 35900.0 33416.0
4 California 38982847.0 694945.0 604666.0 536203.0 568994.0 523563.0 539988.0 500571.0 509113.0 456152.0 895758.0 1199773.0 1568843.0 1206637.0 818690.0 1008388.0 1255844.0

Analyzing the Data

Let's plot our data as a histogram. Notice that the income ranges as tabulated by the US Census are not equal. At the low end, the range is 0-9999, whereas, towards the top, the range 150,000-199,999 is five times as broad! We will make the width of each of the columns correspond to their range, and will give us an idea of the total earnings, not just the number of people in that group.

First we provide code for generating the plot.


In [0]:
# Histogram bins
label_to_range = {
  "USDollarUpto10000": [0, 9999],
  "USDollar10000To14999": [10000, 14999],
  "USDollar15000To19999": [15000, 19999],
  "USDollar20000To24999": [20000, 24999],
  "USDollar25000To29999": [25000, 29999],
  "USDollar30000To34999": [30000, 34999],
  "USDollar35000To39999": [35000, 39999],
  "USDollar40000To44999": [40000, 44999],
  "USDollar45000To49999": [45000, 49999],
  "USDollar50000To59999": [50000, 59999],
  "USDollar60000To74999": [60000, 74999],
  "USDollar75000To99999": [75000, 99999],
  "USDollar100000To124999": [100000, 124999],
  "USDollar125000To149999": [125000, 149999],
  "USDollar150000To199999": [150000, 199999],
  "USDollar200000Onwards": [250000, 300000],
}
bins = [
  0, 10000, 15000, 20000, 25000, 30000, 35000, 40000, 45000, 50000, 60000, 
  75000, 100000, 125000, 150000, 250000
]

def plot_income(data, state_name):
  # Assert that "state_name" is a valid state name
  frame_search = data.loc[data['name'] == state_name].squeeze()
  if frame_search.shape[0] == 0:
    print('{} does not have sufficient income data to generate the plot!'.format(state_name))
    return
  
  # Print the resulting series
  data = frame_search[2:]
  
  # Calculate the bar lengths
  lengths = []
  for bracket in income_brackets:
    r = label_to_range[bracket] 
    lengths.append(int((r[1] - r[0]) / 18))
  
  # Calculate the x-axis positions
  pos, total = [], 0
  for l in lengths:
    pos.append(total + (l // 2))
    total += l
  
  # Plot the histogram
  plt.figure(figsize=(12, 10))
  plt.xticks(pos, income_brackets, rotation=90)
  plt.grid(True)
  plt.bar(pos, data.values, lengths, color='b', alpha=0.3)
  
  # Return the resulting frame.
  return frame_search

We can then call this code with a state to plot the income bracket sizes.


In [11]:
#@title Enter State to plot { run: "auto" }
state_name = "Tennessee" #@param ["Missouri", "Arkansas", "Arizona", "Ohio", "Connecticut", "Vermont", "Illinois", "South Dakota", "Iowa", "Oklahoma", "Kansas", "Washington", "Oregon", "Hawaii", "Minnesota", "Idaho", "Alaska", "Colorado", "Delaware", "Alabama", "North Dakota", "Michigan", "California", "Indiana", "Kentucky", "Nebraska", "Louisiana", "New Jersey", "Rhode Island", "Utah", "Nevada", "South Carolina", "Wisconsin", "New York", "North Carolina", "New Hampshire", "Georgia", "Pennsylvania", "West Virginia", "Maine", "Mississippi", "Montana", "Tennessee", "New Mexico", "Massachusetts", "Wyoming", "Maryland", "Florida", "Texas", "Virginia"]
result = plot_income(data, state_name)

# Show the plot
plt.show()


and we can display the raw table of values.


In [12]:
# Additionally print the table of income bracket sizes 
result


Out[12]:
name                        Tennessee
all                       6.59738e+06
USDollarUpto10000              200034
USDollar10000To14999           146404
USDollar15000To19999           147660
USDollar20000To24999           152867
USDollar25000To29999           142625
USDollar30000To34999           140211
USDollar35000To39999           132737
USDollar40000To44999           127434
USDollar45000To49999           110586
USDollar50000To59999           213266
USDollar60000To74999           257530
USDollar75000To99999           292764
USDollar100000To124999         181732
USDollar125000To149999         102356
USDollar150000To199999          99880
USDollar200000Onwards           99108
Name: 42, dtype: object

This is only the beginning! What else can you analyze? For example, you could try computing a measure of income disparity in each state (see Gini Coefficient).

You could then expand the dataframe to include more information and analyze how attributes like education level, crime, or even weather effect income disparity.