Copyright 2019 Google LLC. SPDX-License-Identifier: Apache-2.0
Notebook Version - 1.0.0
In [4]:
# Install datacommons
!pip install --upgrade --quiet git+https://github.com/datacommonsorg/api-python.git@stable-1.x
The American Community Survey (published by the US Census) annually reports the number of individuals in a given income bracket at the State level. We can use this information, stored in Data Commons, to visualize disparity in income for each State in the US. Our goal for this tutorial will be to generate a plot that visualizes the total number of individuals across a given set of income brackets for a given state.
Before we begin, we'll setup our notebook
In [0]:
# Import the Data Commons library
import datacommons as dc
# Import other libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import json
from google.colab import drive
We will also need to provide the API with an API key. See the to see how to set this up for a Colab Notebook.
In [6]:
# Mount the Drive
drive.mount('/content/drive', force_remount=True)
# REPLACE THIS with the path to your key.
key_path = '/content/drive/My Drive/DataCommons/secret.json'
# Read the key in and provide it to the Data Commons API
with open(key_path, 'r') as f:
secrets = json.load(f)
dc.set_api_key(secrets['dc_api_key'])
In [7]:
# Initialize a DataFrame holding the USA.
data = pd.DataFrame({'country': ['country/USA']})
# Add a column for states and get their names
data['state'] = dc.get_places_in(data['country'], 'State')
data = dc.flatten_frame(data)
# Get all state names and store it in a column "name"
data['name'] = dc.get_property_values(data['state'], 'name')
data = dc.flatten_frame(data)
# Get StatisticalPopulations representing all persons in each state.
data['all_pop'] = dc.get_populations(data['state'], 'Person')
# Get the total count of all persons in each population
data['all'] = dc.get_observations(data['all_pop'],
'count',
'measuredValue',
'2017',
measurement_method='CenusACS5yrSurvey')
# Display the first five rows of the table.
data.head(5)
Out[7]:
Next, let's get the population level for each income bracket. The datacommons graph identifies 16 different income brackets. For each bracket and state, we can get the population level. Remember that we first get the StatisticalPopulation, and then a corresponding observation. We'll filter observations to between published in 2017 by the American Community Survey.
In [8]:
# A list of income brackets
income_brackets = [
"USDollarUpto10000",
"USDollar10000To14999",
"USDollar15000To19999",
"USDollar20000To24999",
"USDollar25000To29999",
"USDollar30000To34999",
"USDollar35000To39999",
"USDollar40000To44999",
"USDollar45000To49999",
"USDollar50000To59999",
"USDollar60000To74999",
"USDollar75000To99999",
"USDollar100000To124999",
"USDollar125000To149999",
"USDollar150000To199999",
"USDollar200000Onwards",
]
# Add a column containin the population count for each income bracket
for bracket in income_brackets:
# Get the new column names
pop_col = '{}_pop'.format(bracket)
obs_col = bracket
# Create the constraining properties map
pvs = {'income': bracket}
# Get the StatisticalPopulation and Observation
data[pop_col] = dc.get_populations(data['state'], 'Household',
constraining_properties=pvs)
data[obs_col] = dc.get_observations(data[pop_col],
'count',
'measuredValue',
'2017',
measurement_method='CenusACS5yrSurvey')
# Display the table
data.head(5)
Out[8]:
Let's limit the size of this DataFrame by selecting columns with only the State name and Observations.
In [9]:
# Select columns that will be used for plotting
data = data[['name', 'all'] + income_brackets]
# Display the table
data.head(5)
Out[9]:
Let's plot our data as a histogram. Notice that the income ranges as tabulated by the US Census are not equal. At the low end, the range is 0-9999, whereas, towards the top, the range 150,000-199,999 is five times as broad! We will make the width of each of the columns correspond to their range, and will give us an idea of the total earnings, not just the number of people in that group.
First we provide code for generating the plot.
In [0]:
# Histogram bins
label_to_range = {
"USDollarUpto10000": [0, 9999],
"USDollar10000To14999": [10000, 14999],
"USDollar15000To19999": [15000, 19999],
"USDollar20000To24999": [20000, 24999],
"USDollar25000To29999": [25000, 29999],
"USDollar30000To34999": [30000, 34999],
"USDollar35000To39999": [35000, 39999],
"USDollar40000To44999": [40000, 44999],
"USDollar45000To49999": [45000, 49999],
"USDollar50000To59999": [50000, 59999],
"USDollar60000To74999": [60000, 74999],
"USDollar75000To99999": [75000, 99999],
"USDollar100000To124999": [100000, 124999],
"USDollar125000To149999": [125000, 149999],
"USDollar150000To199999": [150000, 199999],
"USDollar200000Onwards": [250000, 300000],
}
bins = [
0, 10000, 15000, 20000, 25000, 30000, 35000, 40000, 45000, 50000, 60000,
75000, 100000, 125000, 150000, 250000
]
def plot_income(data, state_name):
# Assert that "state_name" is a valid state name
frame_search = data.loc[data['name'] == state_name].squeeze()
if frame_search.shape[0] == 0:
print('{} does not have sufficient income data to generate the plot!'.format(state_name))
return
# Print the resulting series
data = frame_search[2:]
# Calculate the bar lengths
lengths = []
for bracket in income_brackets:
r = label_to_range[bracket]
lengths.append(int((r[1] - r[0]) / 18))
# Calculate the x-axis positions
pos, total = [], 0
for l in lengths:
pos.append(total + (l // 2))
total += l
# Plot the histogram
plt.figure(figsize=(12, 10))
plt.xticks(pos, income_brackets, rotation=90)
plt.grid(True)
plt.bar(pos, data.values, lengths, color='b', alpha=0.3)
# Return the resulting frame.
return frame_search
We can then call this code with a state to plot the income bracket sizes.
In [11]:
#@title Enter State to plot { run: "auto" }
state_name = "Tennessee" #@param ["Missouri", "Arkansas", "Arizona", "Ohio", "Connecticut", "Vermont", "Illinois", "South Dakota", "Iowa", "Oklahoma", "Kansas", "Washington", "Oregon", "Hawaii", "Minnesota", "Idaho", "Alaska", "Colorado", "Delaware", "Alabama", "North Dakota", "Michigan", "California", "Indiana", "Kentucky", "Nebraska", "Louisiana", "New Jersey", "Rhode Island", "Utah", "Nevada", "South Carolina", "Wisconsin", "New York", "North Carolina", "New Hampshire", "Georgia", "Pennsylvania", "West Virginia", "Maine", "Mississippi", "Montana", "Tennessee", "New Mexico", "Massachusetts", "Wyoming", "Maryland", "Florida", "Texas", "Virginia"]
result = plot_income(data, state_name)
# Show the plot
plt.show()
and we can display the raw table of values.
In [12]:
# Additionally print the table of income bracket sizes
result
Out[12]:
This is only the beginning! What else can you analyze? For example, you could try computing a measure of income disparity in each state (see Gini Coefficient).
You could then expand the dataframe to include more information and analyze how attributes like education level, crime, or even weather effect income disparity.