This notebook provides working examples of many of the concepts introduced earlier:
Using the 2010 surname data from the US Census, we will develop a workflow to accomplish the following:
Decennial Census Surname Files (2010)
https://www.census.gov/data/developers/data-sets/surnames.html
https://api.census.gov/data/2010/surname.html
US Census Bureau (2016) Decennial Census Surname Files (2010) Retrieved from https://api.census.gov/data/2010/surname.json
The modules used in this exercise are popular and under active development. Follow the links for more information about methods, syntax, etc.
Requests: http://docs.python-requests.org/en/master/
JSON: https://docs.python.org/3/library/json.html
Pandas: http://pandas.pydata.org/
Matplotlib: https://matplotlib.org/
Look for information about or links to the API, developer's documentation, etc. Helpful examples are often included.
Note that we are providing an alias for Pandas and matplotlib. Whenever we need to call a method from those module, we can use the alias.
In [2]:
# http://api.census.gov/data/2010/surname
import requests
import json
import pandas as pd
import matplotlib.pyplot as plt
The dataset in our example is not excessively large, so we can explore different approaches to interacting with it:
Both have pros and cons. Both are used in the following examples.
In order to make a human readable table we need to:
In [4]:
# First, get the basic info about the dataset.
# References: Dataset API (https://api.census.gov/data/2010/surname.html)
# Requests API (http://docs.python-requests.org/en/master/)
# Python 3 JSON API (https://docs.python.org/3/library/json.html)
api_base_url = "http://api.census.gov/data/2010/surname"
api_info = requests.get(api_base_url)
api_json = api_info.json()
# Uncomment the next line(s) to see the response content.
# NOTE: JSON and TEXT don't look much different to us. They can look very different to a machine!
#print(api_info.text)
print(json.dumps(api_json, indent=4))
# The output is a dictionary - data are stored as key:value pairs and can be nested.
In [5]:
# Request and store a local copy of the dataset variables.
# Note that the URL could be hard coded just from referencing the API, but
# we are navigating the JSON data.
var_link = api_json['dataset'][0]['c_variablesLink']
print(var_link)
In [8]:
# Use the variable info link to make a new request
variables = requests.get(var_link)
jsonData = variables.json()
variable_data = jsonData['variables']
# Note that this is a dictionary of dictionaries.
print(json.dumps(variable_data, indent=4))
In [9]:
print(variable_data.keys())
In [11]:
# References: Pandas (http://pandas.pydata.org/)
# Default vars: 'RANK,COUNT,PCTWHITE,PCTAPI,PCT2PRACE,PCTAIAN,PCTBLACK,PCTHISPANIC'
desired_vars = 'NAME,COUNT,PCTWHITE,PCTAPI,PCT2PRACE,PCTAIAN,PCTBLACK,PCTHISPANIC&RANK=1:10' # Top ten names
base_url = 'http://api.census.gov/data/2010/surname?get='
query_url = base_url + desired_vars
name_stats = requests.get(query_url)
surname_data = name_stats.json()
# The response data are not very human readable.
# Note that this is a list of lists. Data within lists are typically accessed by position number. (There are no keys.)
print('Raw response data:\n')
print(json.dumps(surname_data, indent=4))
Laying out the API response like a table helps illustrate what we're doing here. For easier reading the "surname_data" variable has been replace with "d" in the image below.
The variable codes in d[0] will be replaced with human readable descriptions from the variable list (v).
In [16]:
# Pass the data to a Pandas dataframe.
# In addition to being easier to read, dataframes simplify further analysis.
# The simplest dataframe would use the variable names returned with the data. Example: PCTWHITE
# It's easier to read the descriptive labels provide via the variables API.
# The code block below replaces variable names with labels as it builds the dataframe.
column_list = []
for each in surname_data[0]: # For each variable in the response data (stored as surname_data[0])
label = variable_data[each]['label'] # look up that variable's label in the variable dictionary
column_list.append(label) # add the variable's label to the list of column headers
print(each, ":", label)
print('\n', column_list)
In [31]:
df = pd.DataFrame([surname_data[1]], columns=column_list) # Create a dataframe using the column names created above. Data
# for the dataframe comes from rows 2-10 (positions 1-9)
# of surname_data.
# The table we just created is empty. Here we add the surname data:
for surname in d[2:]:
tdf = pd.DataFrame([surname], columns=column_list)
df = df.append(tdf)
print('\n\nPandas dataframe:')
df.sort_values(by=["National Rank"])
Out[31]:
In [23]:
# Try 'STEUBEN' in order to break the first pie chart example.
# Update 2020-02-26: Surnames should be all caps!
name = 'WHEELER'
name_query = '&NAME=' + name
Referring to the variables API, decide which variables are of interest and edit accordingly.
In [24]:
# Default vars: 'RANK,COUNT,PCTWHITE,PCTAPI,PCT2PRACE,PCTAIAN,PCTBLACK,PCTHISPANIC'
desired_vars = 'RANK,COUNT,PCTWHITE,PCTAPI,PCT2PRACE,PCTAIAN,PCTBLACK,PCTHISPANIC'
Build the query URL and send the request. Pass the response data into a Pandas dataframe for viewing.
In [25]:
# References: Pandas (http://pandas.pydata.org/)
base_url = 'http://api.census.gov/data/2010/surname?get='
query_url = base_url + desired_vars + name_query
name_stats = requests.get(query_url)
d = name_stats.json()
# The response data are not very human readable.
print('Raw response data:\n')
print(d)
# Pass the data to a Pandas dataframe.
# In addition to being easier to read, dataframes simplify further analysis.
# The simplest dataframe would use the variable names returned with the data. Example: PCTWHITE
# It's easier to read the descriptive labels provide via the variables API.
# The code block below replaces variable names with labels as it builds the dataframe.
column_list = []
for each in d[0]: # For each variable in the response data (stored as d[0])
label = v[each]['label'] # look up that variable's label in the variable dictionary
column_list.append(label) # add the variable's label to the list of column headers
df = pd.DataFrame([d[1]], columns=column_list) # Create a dataframe using the column names created above. Data
# for the dataframe comes from d[1]
print('\n\nPandas dataframe:')
df
Out[25]:
In [26]:
# Using index positions is good for doing something quick, but in this case makes code easy to break.
# Selecting different surname dataset variables or re-ordering variables will result in errors.
print(d)
pcts = d[1][2:8]
print('\n\n',pcts)
In [27]:
# Create the labels and get the data for the pie chart.
# Note that we are using the downloaded source data, not the dataframe
# used for the table above.
labels = ['White', 'Asian', '2+ Races', 'Native American', 'Black', 'Hispanic']
pcts = d[1][2:8]
#print(pcts)
# Create a pie chart (https://matplotlib.org/2.0.2/examples/pie_and_polar_charts/pie_demo_features.html)
plt.pie(
# using data percentages
pcts,
# Use labels defined above
labels=labels,
# with no shadows
shadow=False,
# with the start angle at 90%
startangle=90,
# with the percent listed as a fraction
autopct='%1.1f%%',
)
# View the plot drop above
plt.axis('equal')
# View the plot
plt.tight_layout()
plt.show()
In [28]:
# First try - just replace string with a zero.
# Here, the for loop iterates through items in a list.
pcts2 = []
for p in pcts:
if p != '(S)':
pcts2.append(p)
else:
pcts2.append(0)
# Create a pie chart (https://matplotlib.org/2.0.2/examples/pie_and_polar_charts/pie_demo_features.html)
plt.pie(
# using data percentages
pcts2,
# Use labels defined above
labels=labels,
# with no shadows
shadow=False,
# with the start angle at 90%
startangle=90,
# with the percent listed as a fraction
autopct='%1.1f%%',
)
# View the plot drop above
plt.axis('equal')
# View the plot
plt.tight_layout()
plt.show()
In [29]:
# Second try - exclude and corresponding label if source data for a given demographic == (S)
# This requires the list index of the data and the label.
# The for loop in this case iterates across a range of integers equal to the length of the list.
pcts3 = []
edit_labels = []
for i in range(len(pcts)):
print(pcts[i])
if pcts[i] != '(S)':
pcts3.append(pcts[i])
edit_labels.append(labels[i])
else:
pass
# Create a pie chart (https://matplotlib.org/2.0.2/examples/pie_and_polar_charts/pie_demo_features.html)
plt.pie(
# using data percentages
pcts3,
# Use labels defined above
labels=edit_labels,
# with no shadows
shadow=False,
# with the start angle at 90%
startangle=90,
# with the percent listed as a fraction
autopct='%1.1f%%',
)
# View the plot drop above
plt.axis('equal')
# View the plot
plt.tight_layout()
plt.show()
In [ ]: