An Introductory Python Workflow: US Census Surname Data

This notebook provides working examples of many of the concepts introduced earlier:

Importing modules or libraries to extend basic Python functionality
Declaring and using variables
Python data types and data structures
Flow control

Using the 2010 surname data from the US Census, we will develop a workflow to accomplish the following:

Retrieve information about the dataset API
Retrieve data about a single surname
Output surname data in tabular form
Visualize surname data using a pie chart

Sample dataset

Decennial Census Surname Files (2010)

https://www.census.gov/data/developers/data-sets/surnames.html

https://api.census.gov/data/2010/surname.html

Citation

US Census Bureau (2016) Decennial Census Surname Files (2010) Retrieved from https://api.census.gov/data/2010/surname.json

1. Import modules

The modules used in this exercise are popular and under active development. Follow the links for more information about methods, syntax, etc.

Requests: http://docs.python-requests.org/en/master/

JSON: https://docs.python.org/3/library/json.html

Pandas: http://pandas.pydata.org/

Matplotlib: https://matplotlib.org/

Look for information about or links to the API, developer's documentation, etc. Helpful examples are often included.

Note that we are providing an alias for Pandas and matplotlib. Whenever we need to call a method from those module, we can use the alias.



In [2]:

    
# http://api.census.gov/data/2010/surname
import requests
import json
import pandas as pd
import matplotlib.pyplot as plt

2. Basic interactions with the Census dataset API

Combining data from two API endpoints into a human-readable table

The dataset in our example is not excessively large, so we can explore different approaches to interacting with it:

Download some or all data to the computer we're using ('local'). Keep the data in memory and do stuff.
Only download what we need when we need it. Doing stuff may require additional calls to the API.

Both have pros and cons. Both are used in the following examples.

Some points of interest:

The data are not provided in tabular form
Human-readable variable names and definitions are stored separately from the data

In order to make a human readable table we need to:

Download variable definitions
Download data
Replace shorthand variable codes with human readable names
Reformat the data into a table

Get API and variable information:



In [4]:

    
# First, get the basic info about the dataset.
# References: Dataset API (https://api.census.gov/data/2010/surname.html) 
#             Requests API (http://docs.python-requests.org/en/master/)
#             Python 3 JSON API (https://docs.python.org/3/library/json.html)

api_base_url = "http://api.census.gov/data/2010/surname"
api_info = requests.get(api_base_url) 
api_json = api_info.json() 

# Uncomment the next line(s) to see the response content.
# NOTE: JSON and TEXT don't look much different to us. They can look very different to a machine!
#print(api_info.text)
print(json.dumps(api_json, indent=4))

# The output is a dictionary - data are stored as key:value pairs and can be nested.









    



{
    "@context": "https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld",
    "@id": "https://api.census.gov/data/2010/surname.json",
    "@type": "dcat:Catalog",
    "conformsTo": "https://project-open-data.cio.gov/v1.1/schema",
    "describedBy": "https://project-open-data.cio.gov/v1.1/schema/catalog.json",
    "dataset": [
        {
            "c_vintage": 2010,
            "c_dataset": [
                "surname"
            ],
            "c_geographyLink": "https://api.census.gov/data/2010/surname/geography.json",
            "c_variablesLink": "https://api.census.gov/data/2010/surname/variables.json",
            "c_examplesLink": "https://api.census.gov/data/2010/surname/examples.json",
            "c_groupsLink": "https://api.census.gov/data/2010/surname/groups.json",
            "c_valuesLink": "https://api.census.gov/data/2010/surname/values.json",
            "c_documentationLink": "http://www.census.gov/developer/",
            "c_isAggregate": true,
            "c_isAvailable": true,
            "@type": "dcat:Dataset",
            "title": "2010 Decennial Census of Population and Housing: Surnames",
            "accessLevel": "public",
            "bureauCode": [
                "006:07"
            ],
            "description": "The Census Bureau's Census surnames product is a data release based on names recorded in the decennial census.  The product contains rank and frequency data on surnames reported 100 or more times in the decennial census, along with Hispanic origin and race category percentages.  The latter are suppressed where necessary for confidentiality.  The data focus on summarized aggregates of counts and characteristics associated with surnames, and the data do not in any way identify any specific individuals.",
            "distribution": [
                {
                    "@type": "dcat:Distribution",
                    "accessURL": "https://api.census.gov/data/2010/surname",
                    "description": "API endpoint",
                    "format": "API",
                    "mediaType": "application/json",
                    "title": "API endpoint"
                }
            ],
            "contactPoint": {
                "fn": "Joshua Comenetz",
                "hasEmail": "Joshua.Comenetz@census.gov"
            },
            "identifier": "http://api.census.gov/data/id/DecennialSurname2010",
            "keyword": [],
            "license": "http://creativecommons.org/publicdomain/zero/1.0/Public Domain",
            "modified": "2016-12-15",
            "programCode": [
                "006:004"
            ],
            "references": [
                "http://www.census.gov/developers/"
            ],
            "spatial": "United States",
            "temporal": "2010/2010",
            "publisher": {
                "@type": "org:Organization",
                "name": "U.S. Census Bureau",
                "subOrganizationOf": {
                    "@type": "org:Organization",
                    "name": "U.S. Department Of Commerce",
                    "subOrganizationOf": {
                        "@type": "org:Organization",
                        "name": "U.S. Government"
                    }
                }
            }
        }
    ]
}



In [5]:

    
# Request and store a local copy of the dataset variables. 
# Note that the URL could be hard coded just from referencing the API, but
#   we are navigating the JSON data.
var_link = api_json['dataset'][0]['c_variablesLink']
print(var_link)









    



https://api.census.gov/data/2010/surname/variables.json



In [8]:

    
# Use the variable info link to make a new request

variables = requests.get(var_link)
jsonData = variables.json()
variable_data = jsonData['variables']

# Note that this is a dictionary of dictionaries.
print(json.dumps(variable_data, indent=4))









    



{
    "PCTAPI": {
        "label": "Percent Non-Hispanic Asian and Native Hawaiian and Other Pacific Islander Alone",
        "concept": "Surnames Variables",
        "predicateType": "int",
        "group": "N/A",
        "limit": 0
    },
    "PCTBLACK": {
        "label": "Percent Non-Hispanic Black or African American Alone",
        "concept": "Surnames Variables",
        "predicateType": "int",
        "group": "N/A",
        "limit": 0
    },
    "PCTAIAN": {
        "label": "Percent Non-Hispanic American Indian and Alaska Native Alone",
        "concept": "Surnames Variables",
        "predicateType": "int",
        "group": "N/A",
        "limit": 0
    },
    "CUM_PROP100K": {
        "label": "Cumulative proportion per 100,000 population",
        "concept": "Surnames Variables",
        "predicateType": "int",
        "group": "N/A",
        "limit": 0
    },
    "PROP100K": {
        "label": "Proportion per 100,000 population for name",
        "concept": "Surnames Variables",
        "predicateType": "int",
        "group": "N/A",
        "limit": 0
    },
    "PCTWHITE": {
        "label": "Percent Non-Hispanic White Alone",
        "concept": "Surnames Variables",
        "predicateType": "int",
        "group": "N/A",
        "limit": 0
    },
    "RANK": {
        "label": "National Rank",
        "concept": "Surnames Variables",
        "predicateType": "int",
        "group": "N/A",
        "limit": 0
    },
    "COUNT": {
        "label": "Frequency: number of occurrences nationally",
        "concept": "Surnames Variables",
        "predicateType": "int",
        "group": "N/A",
        "limit": 0
    },
    "PCT2PRACE": {
        "label": "Percent Non-Hispanic Two or More Races",
        "concept": "Surnames Variables",
        "predicateType": "int",
        "group": "N/A",
        "limit": 0
    },
    "PCTHISPANIC": {
        "label": "Percent Hispanic or Latino origin",
        "concept": "Surnames Variables",
        "predicateType": "int",
        "group": "N/A",
        "limit": 0
    },
    "NAME": {
        "label": "Surname",
        "concept": "Surnames Variables",
        "predicateType": "string",
        "group": "N/A",
        "limit": 0
    }
}



In [9]:

    
print(variable_data.keys())









    



dict_keys(['PCTAPI', 'PCTBLACK', 'PCTAIAN', 'CUM_PROP100K', 'PROP100K', 'PCTWHITE', 'RANK', 'COUNT', 'PCT2PRACE', 'PCTHISPANIC', 'NAME'])

Get surname data:



In [11]:

    
# References: Pandas (http://pandas.pydata.org/)

# Default vars: 'RANK,COUNT,PCTWHITE,PCTAPI,PCT2PRACE,PCTAIAN,PCTBLACK,PCTHISPANIC'
desired_vars = 'NAME,COUNT,PCTWHITE,PCTAPI,PCT2PRACE,PCTAIAN,PCTBLACK,PCTHISPANIC&RANK=1:10'     # Top ten names

base_url = 'http://api.census.gov/data/2010/surname?get='
query_url = base_url + desired_vars
name_stats = requests.get(query_url)
surname_data = name_stats.json()

# The response data are not very human readable.
# Note that this is a list of lists. Data within lists are typically accessed by position number. (There are no keys.)
print('Raw response data:\n')
print(json.dumps(surname_data, indent=4))









    



Raw response data:

[
    [
        "NAME",
        "COUNT",
        "PCTWHITE",
        "PCTAPI",
        "PCT2PRACE",
        "PCTAIAN",
        "PCTBLACK",
        "PCTHISPANIC",
        "RANK"
    ],
    [
        "BROWN",
        "1437026",
        "57.95",
        "0.51",
        "2.55",
        "0.87",
        "35.6",
        "2.52",
        "4"
    ],
    [
        "DAVIS",
        "1116357",
        "62.2",
        "0.49",
        "2.45",
        "0.82",
        "31.6",
        "2.44",
        "8"
    ],
    [
        "GARCIA",
        "1166120",
        "5.38",
        "1.41",
        "0.26",
        "0.47",
        "0.45",
        "92.03",
        "6"
    ],
    [
        "JOHNSON",
        "1932812",
        "58.97",
        "0.54",
        "2.56",
        "0.94",
        "34.63",
        "2.36",
        "2"
    ],
    [
        "JONES",
        "1425470",
        "55.19",
        "0.44",
        "2.61",
        "1",
        "38.48",
        "2.29",
        "5"
    ],
    [
        "MARTINEZ",
        "1060159",
        "5.28",
        "0.6",
        "0.22",
        "0.51",
        "0.49",
        "92.91",
        "10"
    ],
    [
        "MILLER",
        "1161437",
        "84.11",
        "0.54",
        "1.77",
        "0.66",
        "10.76",
        "2.17",
        "7"
    ],
    [
        "RODRIGUEZ",
        "1094924",
        "4.75",
        "0.57",
        "0.18",
        "0.18",
        "0.54",
        "93.77",
        "9"
    ],
    [
        "SMITH",
        "2442977",
        "70.9",
        "0.5",
        "2.19",
        "0.89",
        "23.11",
        "2.4",
        "1"
    ],
    [
        "WILLIAMS",
        "1625252",
        "45.75",
        "0.46",
        "2.81",
        "0.82",
        "47.68",
        "2.49",
        "3"
    ]
]

Laying out the API response like a table helps illustrate what we're doing here. For easier reading the "surname_data" variable has been replace with "d" in the image below.

The variable codes in d[0] will be replaced with human readable descriptions from the variable list (v).

Replace variable codes with human readable labels



In [16]:

    
# Pass the data to a Pandas dataframe. 
# In addition to being easier to read, dataframes simplify further analysis.

# The simplest dataframe would use the variable names returned with the data. Example: PCTWHITE
# It's easier to read the descriptive labels provide via the variables API.
# The code block below replaces variable names with labels as it builds the dataframe.
column_list = []

for each in surname_data[0]:                        # For each variable in the response data (stored as surname_data[0])
    label = variable_data[each]['label']            #     look up that variable's label in the variable dictionary
    column_list.append(label)                       #     add the variable's label to the list of column headers
    print(each, ":", label)

print('\n', column_list)









    



NAME : Surname
COUNT : Frequency: number of occurrences nationally
PCTWHITE : Percent Non-Hispanic White Alone
PCTAPI : Percent Non-Hispanic Asian and Native Hawaiian and Other Pacific Islander Alone
PCT2PRACE : Percent Non-Hispanic Two or More Races
PCTAIAN : Percent Non-Hispanic American Indian and Alaska Native Alone
PCTBLACK : Percent Non-Hispanic Black or African American Alone
PCTHISPANIC : Percent Hispanic or Latino origin
RANK : National Rank

 ['Surname', 'Frequency: number of occurrences nationally', 'Percent Non-Hispanic White Alone', 'Percent Non-Hispanic Asian and Native Hawaiian and Other Pacific Islander Alone', 'Percent Non-Hispanic Two or More Races', 'Percent Non-Hispanic American Indian and Alaska Native Alone', 'Percent Non-Hispanic Black or African American Alone', 'Percent Hispanic or Latino origin', 'National Rank']

Create a dataframe (table) with variable labels as column names and append data



In [31]:

    
df = pd.DataFrame([surname_data[1]], columns=column_list)  # Create a dataframe using the column names created above. Data
                                                           # for the dataframe comes from rows 2-10 (positions 1-9) 
                                                           # of surname_data.

# The table we just created is empty. Here we add the surname data:
for surname in d[2:]:
    tdf = pd.DataFrame([surname], columns=column_list)
    df = df.append(tdf)

print('\n\nPandas dataframe:')
df.sort_values(by=["National Rank"])









    




Pandas dataframe:






    Out[31]:







  
    
      
      Surname
      Frequency: number of occurrences nationally
      Percent Non-Hispanic White Alone
      Percent Non-Hispanic Asian and Native Hawaiian and Other Pacific Islander Alone
      Percent Non-Hispanic Two or More Races
      Percent Non-Hispanic American Indian and Alaska Native Alone
      Percent Non-Hispanic Black or African American Alone
      Percent Hispanic or Latino origin
      National Rank
    
  
  
    
      0
      SMITH
      2442977
      70.9
      0.5
      2.19
      0.89
      23.11
      2.4
      1
    
    
      0
      MARTINEZ
      1060159
      5.28
      0.6
      0.22
      0.51
      0.49
      92.91
      10
    
    
      0
      JOHNSON
      1932812
      58.97
      0.54
      2.56
      0.94
      34.63
      2.36
      2
    
    
      0
      WILLIAMS
      1625252
      45.75
      0.46
      2.81
      0.82
      47.68
      2.49
      3
    
    
      0
      BROWN
      1437026
      57.95
      0.51
      2.55
      0.87
      35.6
      2.52
      4
    
    
      0
      JONES
      1425470
      55.19
      0.44
      2.61
      1
      38.48
      2.29
      5
    
    
      0
      GARCIA
      1166120
      5.38
      1.41
      0.26
      0.47
      0.45
      92.03
      6
    
    
      0
      MILLER
      1161437
      84.11
      0.54
      1.77
      0.66
      10.76
      2.17
      7
    
    
      0
      DAVIS
      1116357
      62.2
      0.49
      2.45
      0.82
      31.6
      2.44
      8
    
    
      0
      RODRIGUEZ
      1094924
      4.75
      0.57
      0.18
      0.18
      0.54
      93.77
      9

Exercises

Change the table to include the 50 most common surnames. Alternatively, create a table for the 11th - 20th most common surnames.
Sort the output table by surname or a demographic.
Change the request/table to include only surname, rank, and count.
Correct the sort order in the example table.

3. Download and tabulate statistics for a given surname

Now let's find out the rank and demographic breakdown of a particular name.

To make it easy to change the name we're looking up, assign it to a variable.



In [23]:

    
# Try 'STEUBEN' in order to break the first pie chart example.
# Update 2020-02-26: Surnames should be all caps!
name = 'WHEELER'
name_query = '&NAME=' + name

Referring to the variables API, decide which variables are of interest and edit accordingly.



In [24]:

    
# Default vars: 'RANK,COUNT,PCTWHITE,PCTAPI,PCT2PRACE,PCTAIAN,PCTBLACK,PCTHISPANIC'
desired_vars = 'RANK,COUNT,PCTWHITE,PCTAPI,PCT2PRACE,PCTAIAN,PCTBLACK,PCTHISPANIC'

Build the query URL and send the request. Pass the response data into a Pandas dataframe for viewing.



In [25]:

    
# References: Pandas (http://pandas.pydata.org/)

base_url = 'http://api.census.gov/data/2010/surname?get='
query_url = base_url + desired_vars + name_query
name_stats = requests.get(query_url)
d = name_stats.json()

# The response data are not very human readable.
print('Raw response data:\n')
print(d)

# Pass the data to a Pandas dataframe. 
# In addition to being easier to read, dataframes simplify further analysis.

# The simplest dataframe would use the variable names returned with the data. Example: PCTWHITE
# It's easier to read the descriptive labels provide via the variables API.
# The code block below replaces variable names with labels as it builds the dataframe.
column_list = []

for each in d[0]:                                   # For each variable in the response data (stored as d[0])
    label = v[each]['label']                        #     look up that variable's label in the variable dictionary
    column_list.append(label)                       #     add the variable's label to the list of column headers

df = pd.DataFrame([d[1]], columns=column_list)      # Create a dataframe using the column names created above. Data
                                                    # for the dataframe comes from d[1]

print('\n\nPandas dataframe:')
df









    



Raw response data:

[['RANK', 'COUNT', 'PCTWHITE', 'PCTAPI', 'PCT2PRACE', 'PCTAIAN', 'PCTBLACK', 'PCTHISPANIC', 'NAME'], ['243', '125058', '80.96', '0.5', '1.97', '0.93', '13.29', '2.35', 'WHEELER']]


Pandas dataframe:






    Out[25]:







  
    
      
      National Rank
      Frequency: number of occurrences nationally
      Percent Non-Hispanic White Alone
      Percent Non-Hispanic Asian and Native Hawaiian and Other Pacific Islander Alone
      Percent Non-Hispanic Two or More Races
      Percent Non-Hispanic American Indian and Alaska Native Alone
      Percent Non-Hispanic Black or African American Alone
      Percent Hispanic or Latino origin
      Surname
    
  
  
    
      0
      243
      125058
      80.96
      0.5
      1.97
      0.93
      13.29
      2.35
      WHEELER

4. Create a name demographic pie-chart



In [26]:

    
# Using index positions is good for doing something quick, but in this case makes code easy to break.
# Selecting different surname dataset variables or re-ordering variables will result in errors.
print(d)
pcts = d[1][2:8]
print('\n\n',pcts)









    



[['RANK', 'COUNT', 'PCTWHITE', 'PCTAPI', 'PCT2PRACE', 'PCTAIAN', 'PCTBLACK', 'PCTHISPANIC', 'NAME'], ['243', '125058', '80.96', '0.5', '1.97', '0.93', '13.29', '2.35', 'WHEELER']]


 ['80.96', '0.5', '1.97', '0.93', '13.29', '2.35']



In [27]:

    
# Create the labels and get the data for the pie chart.
# Note that we are using the downloaded source data, not the dataframe
#     used for the table above.
labels = ['White', 'Asian', '2+ Races', 'Native American', 'Black', 'Hispanic']
pcts = d[1][2:8]
#print(pcts)

# Create a pie chart (https://matplotlib.org/2.0.2/examples/pie_and_polar_charts/pie_demo_features.html)
plt.pie(
    # using data percentages
    pcts,
    # Use labels defined above
    labels=labels,
    # with no shadows
    shadow=False,
    # with the start angle at 90%
    startangle=90,
    # with the percent listed as a fraction
    autopct='%1.1f%%',
    )

# View the plot drop above
plt.axis('equal')

# View the plot
plt.tight_layout()
plt.show()

5. Fix the data type error



In [28]:

    
# First try - just replace string with a zero.
# Here, the for loop iterates through items in a list.
pcts2 = []
for p in pcts:
    if p != '(S)':
        pcts2.append(p)
    else:
        pcts2.append(0)

# Create a pie chart (https://matplotlib.org/2.0.2/examples/pie_and_polar_charts/pie_demo_features.html)
plt.pie(
    # using data percentages
    pcts2,
    # Use labels defined above
    labels=labels,
    # with no shadows
    shadow=False,
    # with the start angle at 90%
    startangle=90,
    # with the percent listed as a fraction
    autopct='%1.1f%%',
    )

# View the plot drop above
plt.axis('equal')

# View the plot
plt.tight_layout()
plt.show()



In [29]:

    
# Second try - exclude and corresponding label if source data for a given demographic == (S)
# This requires the list index of the data and the label.
# The for loop in this case iterates across a range of integers equal to the length of the list.
pcts3 = []
edit_labels = []
for i in range(len(pcts)):
    print(pcts[i])
    if pcts[i] != '(S)':
        pcts3.append(pcts[i])
        edit_labels.append(labels[i])
    else:
        pass

# Create a pie chart (https://matplotlib.org/2.0.2/examples/pie_and_polar_charts/pie_demo_features.html)
plt.pie(
    # using data percentages
    pcts3,
    # Use labels defined above
    labels=edit_labels,
    # with no shadows
    shadow=False,
    # with the start angle at 90%
    startangle=90,
    # with the percent listed as a fraction
    autopct='%1.1f%%',
    )

# View the plot drop above
plt.axis('equal')

# View the plot
plt.tight_layout()
plt.show()

Exercises

Look up and produce the demographic pie charts for your name or other names of interest.
Change the variables displayed in the dataframe.
Change the order of variables displayed in the dataframe.
Referring to the documentation, edit the pie chart parameters.



In [ ]:

Surname	Frequency: number of occurrences nationally	Percent Non-Hispanic White Alone	Percent Non-Hispanic Asian and Native Hawaiian and Other Pacific Islander Alone	Percent Non-Hispanic Two or More Races	Percent Non-Hispanic American Indian and Alaska Native Alone	Percent Non-Hispanic Black or African American Alone	Percent Hispanic or Latino origin	National Rank
SMITH	2442977	70.9	0.5	2.19	0.89	23.11	2.4	1
MARTINEZ	1060159	5.28	0.6	0.22	0.51	0.49	92.91	10
JOHNSON	1932812	58.97	0.54	2.56	0.94	34.63	2.36	2
WILLIAMS	1625252	45.75	0.46	2.81	0.82	47.68	2.49	3
BROWN	1437026	57.95	0.51	2.55	0.87	35.6	2.52	4
JONES	1425470	55.19	0.44	2.61	1	38.48	2.29	5
GARCIA	1166120	5.38	1.41	0.26	0.47	0.45	92.03	6
MILLER	1161437	84.11	0.54	1.77	0.66	10.76	2.17	7
DAVIS	1116357	62.2	0.49	2.45	0.82	31.6	2.44	8
RODRIGUEZ	1094924	4.75	0.57	0.18	0.18	0.54	93.77	9