Copyright 2019 Google LLC. SPDX-License-Identifier: Apache-2.0
Notebook Version - 1.0.0
Datacommons is intended for various data science tasks. This tutorial introduces the datacommons knowledge graph and discusses two tools to help integrate its data into your data science projects: (1) the datacommons browser and (2) the Python API. Before getting started, we will need to install the Python API package.
In [0]:
# Install datacommons
!pip install --upgrade --quiet git+https://github.com/datacommonsorg/api-python.git@stable-1.x
Data Commons is an open knowledge graph of structured data. It contains statements about real world objects such as
In the graph, entities like Santa Clara County are represented by nodes. Every node has a type corresponding to what the node represents. For example, California is a State. Relations between entities are represented by edges between these nodes. For example, the statement "Santa Clara County is contained in the State of California" is represented in the graph as two nodes: "Santa Clara County" and "California" with an edge labeled "containedInPlace" pointing from Santa Clara to California. Data Commons closely follows the Schema.org data model and leverages schema.org schema to provide a common set of types and properties.
The Data Commons browser provides a way to explore the data in a human-readable format. It is the best way to explore what is in Data Commons. Searching in the browser for an entity like Mountain View, takes you to a page about the entity, including properties like containedInPlace and timezone.
An important property for all entities is the dcid
. The dcid
(DataCommons identifier) is a unique identifier assigned to each entity in the knowledge graph. With this identifier, you will be able to search for and query information on the given entity in ways that we will discuss later. The dcid
is listed at the top of the page next to "About: " and also in the list of properties.
The Python API provides functions for users to extract structured information from Data Commons programmatically and view them in different formats such as Python dict
's and Pandas DataFrames. DataFrames allow access to all the data processing, analytical and visualization tools provided by packages such as Pandas, NumPy, SciPy, and Matplotlib.
Every notebook begins by loading the dataCommons client as follows:
In [0]:
# Import Data Commons
import datacommons as dc
# Import other required libraries
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import pandas as pd
import json
We will also need to provide an API key to access the Data Commons Python API. This notebook is setup to read from a json
file stored at key_path
in your Google Drive. This text file should have one line:
{
"dc_api_key": "YOUR-API-KEY"
}
If you want to make a copy of this notebook, make sure to replace key_path
with the path to a file containing your API key. For more detail, visit the Creating an API Key page of the Python API documentation.
In [0]:
# The key is stored in a secret file. To use this notebook, you will need to
# make a copy and point this line to the file containing the key in your drive!
from google.colab import drive
# Mount the Drive
drive.mount('/content/drive', force_remount=True)
# REPLACE THIS with the path to your key.
key_path = '/content/drive/My Drive/DataCommons/secret.json'
# Read the key in and provide it to the Data Commons API
with open(key_path, 'r') as f:
secrets = json.load(f)
dc.set_api_key(secrets['dc_api_key'])
For this exercise, we will be comparing the median ages and population count for US states, counties, and cities. First, let's lookup the dcid for the United States'.
Note that "Country" defines the Data Commons type, and "country" is the name we assign to this column.
get_places_in
to Query Administrative AreasThe Client API defines a number of convenience functions for building Pandas dataframes with information in the datacommons graph. We will be using get_places_in
which requires three arguments:
dcids
- A list or pandas.Series
of dcids identifying administrative areas that we wish to get containing places for.place_type
- The type of the administrative area that we wish to query for.In the datacommons knowledge graph, the 'containedInPlace' property relates a administrative area types to its containing administrative area type. Concretely, every 'state' node has a directed edge to a 'country' node where the name of this edge is 'containedInPlace'. To confirm this, you can check the browser page for 'United States'. The same also goes for 'county' to 'state' nodes and 'city' to 'county'.
When we provide a pandas.Series
to get_places_in
, we get back a pandas.Series
with the all places contained in the Administrative Areas identified by the given series of dcids.
In [0]:
# Create a three DataFrames with the dcid of the USA storing state, county and
# city data respectively.
state = pd.DataFrame({'country': ['country/USA']})
county = pd.DataFrame({'country': ['country/USA']})
city = pd.DataFrame({'country': ['country/USA']})
# Get all states, counties, and cities within the United States
state['state'] = dc.get_places_in(state['country'], 'State')
county['county'] = dc.get_places_in(county['country'], 'County')
city['city'] = dc.get_places_in(county['country'], 'City')
Let's see what each table has.
In [0]:
# Display the state data
state.head(5)
Out[0]:
In [0]:
# Display the county data
county.head(5)
Out[0]:
In [0]:
# Display the city data
city.head(5)
Out[0]:
Notice that each table only has one row where the column that we just added has a list of items. This is because there are many States, Counties, Cities in the United States! We can expand this list out by calling flatten_frame
or pandas.explode
if using a version of Pandas more recent than 0.25.
In [0]:
# Unroll the frames
state = dc.flatten_frame(state)
county = dc.flatten_frame(county)
city = dc.flatten_frame(city)
# Display the first 5 rows of this table.
state.head(5)
Out[0]:
Unfortunately, dcids aren't very readable. Let's call get_property_values
to include a column with the name for each County. This function call returns a column of names associated with each item in the calling column of dcids. Here 'name' specifies the property/edge of interest, and each of data['state']
, data['county']
, and data['city']
contain dcids identifying source nodes for this relation.
Note - This query may take a minute!
In [0]:
# Get all state, county, and city names
state['state_name'] = dc.get_property_values(state['state'], 'name')
county['county_name'] = dc.get_property_values(county['county'], 'name')
city['city_name'] = dc.get_property_values(city['city'], 'name')
# Unroll the returned results
state = dc.flatten_frame(state)
county = dc.flatten_frame(county)
city = dc.flatten_frame(city)
Let's view the result in each frame.
In [0]:
# View the first 5 rows of the state table.
state.head(5)
Out[0]:
In [0]:
# View the first 5 rows of the county table.
county.head(5)
Out[0]:
In [0]:
# View the first 5 rows of the city table.
city.head(5)
Out[0]:
Great! Now we can begin to fill our dataframe with the population and median age for each state. To do that, we'll need to understand a little bit about queryng statistical data.
Data Commons has a large corpus of statistical data, which can be queried and joined with other statistics. For example, we can query the median income of women living in Berkeley, California or the number of individuals who are insured in Maryland.
Before we explore how to do this, we need to understand how Data Commons stores statistical data. In particular, there are two types of entities: StatisticalPopulations and Observations.
A StatisticalPopulation defines a collection of things of a certain type. One example of a population is the set of all Persons in Pittsburgh. For a particular population, we can have different Observations. For example, we can have an observation corresponding to the population of Pittsburgh in 2017.
It's important to note several things:
The API defines functions allowing us to fetch data over these two types. To begin with, we can use the get_populations
function get the population of type 'Person' for each county in our dataframe. Again, data['county']
contains the column of dcids identifying where the populations we want are located and 'Person' the population type.
Note - This query may take a minute!
In [0]:
# Get StatisticalPopulations representing all persons in a given state, county,
# and city.
state['all_persons_pop'] = dc.get_populations(state['state'], 'Person')
county['all_persons_pop'] = dc.get_populations(county['county'], 'Person')
city['all_persons_pop'] = dc.get_populations(city['city'], 'Person')
# Notice that we don't need to unroll the results because the parameters
# provided to get_population will always define a unique population if it
# exists
state.head(5)
Out[0]:
Now that we have StatisticalPopulations, let's get some observations! For this example, we're interested in the median age and total population count. One of the nice things about datacommons is that many statistical measures (e.g. median) have already been calculated and stored as properties in the graph.
We use the get_observations
function to get the the total count and the median age for each population referred to in our all_persons_pop
column. For our purposes, we filter the data to only include statistics from 2013-2017. As an example, you can compare the function calls below to the browser page for the population of 'Persons' in 'Ohio'. If you scroll down to the bottom, you can find the Observation nodes for 'median_age' and 'count'
In [0]:
# Add a 'count' and 'med_age' columns representing the total count and
# median age of populations in all_persons_pop columns we created earlier.
state['count'] = dc.get_observations(state['all_persons_pop'],
'count',
'measuredValue',
'2017',
measurement_method='CenusACS5yrSurvey')
state['med_age'] = dc.get_observations(state['all_persons_pop'],
'age',
'medianValue',
'2017',
measurement_method='CenusACS5yrSurvey')
# Get observations for counties.
county['count'] = dc.get_observations(county['all_persons_pop'],
'count',
'measuredValue',
'2017',
measurement_method='CenusACS5yrSurvey')
county['med_age'] = dc.get_observations(county['all_persons_pop'],
'age',
'medianValue',
'2017',
measurement_method='CenusACS5yrSurvey')
# Get observations for cities.
city['count'] = dc.get_observations(city['all_persons_pop'],
'count',
'measuredValue',
'2017',
measurement_method='CenusACS5yrSurvey')
city['med_age'] = dc.get_observations(county['all_persons_pop'],
'age',
'medianValue',
'2017',
measurement_method='CenusACS5yrSurvey')
Finally, we view the data we've queried for.
In [0]:
# View the first 5 rows of the state table.
state.head(5)
Out[0]:
In [0]:
# View the first 5 rows of the county table.
county.head(5)
Out[0]:
In [0]:
# View the first 5 rows of the city table.
city.head(5)
Out[0]:
In [0]:
# Clean the dataframes
state_clean = dc.clean_frame(state)
county_clean = dc.clean_frame(county)
city_clean = dc.clean_frame(city)
# Filter for all cities that have at least one person
city_clean = city_clean[city_clean['count'] >= 1]
and finally, let's visualize our results.
In [0]:
def plot_data(title, pd_table):
""" Generate a scatter plot comparing median age and populationc count. """
plt.figure(figsize=(12, 8))
plt.title(title)
plt.xlabel('Median Age in Years')
plt.ylabel('Population Count (log scale)')
# Scatter plot the information
ax = plt.gca()
ax.set_yscale('log')
ax.scatter(pd_table['med_age'], pd_table['count'], alpha=0.7)
In [0]:
# Generate the plot for state data
plot_data('Median Age vs. Population Count for States', state_clean)
In [0]:
# Generate the plot for county data
plot_data('Median Age vs. Population Count for Counties', county_clean)
In [0]:
# Generate the plot for city data
plot_data('Median Age vs. Population Count for Cities', city_clean)
We can also plot each administrative area granularity on the same plot to see how they relate.
In [0]:
def plot_all_data(state_table, county_table, city_table):
plt.figure(figsize=(12, 8))
plt.title('Median Age vs. Population Count')
plt.xlabel('Median Age in Years')
plt.ylabel('Population Count (log scale)')
# Make things pretty
state_color = "#ffa600"
county_color = "#bc5090"
city_color = "#003f5c"
# Scatter plot the information
ax = plt.gca()
ax.set_yscale('log')
ax.scatter(state_table['med_age'], state_table['count'], color=state_color, alpha=0.75)
ax.scatter(county_table['med_age'], county_table['count'], color=county_color, alpha=0.5)
ax.scatter(city_table['med_age'], city_table['count'], color=city_color, alpha=0.4)
# Create the legend
state_patch = mpatches.Patch(color=state_color, label='States')
county_patch = mpatches.Patch(color=county_color, label='Counties')
city_patch = mpatches.Patch(color=city_color, label='Cities')
plt.legend(handles=[state_patch, county_patch, city_patch])
# Plot all the data together.
plot_all_data(state_clean, county_clean, city_clean)