Copyright 2019 Google LLC. SPDX-License-Identifier: Apache-2.0
Notebook Version - 1.0.0
In [0]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from google.colab import drive
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import json
Obesity is well known to correlate with health factors such as high blood pressure, but is also known to correlate with economic factors such as low-income, unemployment, etc [1][2]. The Center for Disease Control (CDC) provides prevalence percentages on health conditions such as obesity, high blood pressure, and high cholesterol for approximately 500 major cities in the US (e.g. San Francisco, New York, and Austin). Meanwhile, the US Bureau of Labor Statistics provides unemployment rates while the US Census provides poverty rates for most cities across the United States.
Even though these statistics come from different datasets across different government agencies with different storage formats, dataCommons surfaces each of these in a single, uniform knowledge graph. In fact, you can see this in the browser by looking at the provenance column. Let's use the data in dataCommons to create a linear regression model that incorporates variables:
to predict the prevalence of obesity in the 500 cities that the CDC provides data for. One thing you may note is that the US Census also provides employment statistics (you can see this by navigating to the "employment" and "employmentStatus" sections for San Francisco and observing the different provenances). Our choice of using statistics from the Bureau of Labor Statistics is purely demonstrative, but it would be interesting to see if similar results can be reproduced using US Census employment statistics.
In [5]:
!pip install --upgrade --quiet git+https://github.com/datacommonsorg/api-python.git@stable-1.x
import datacommons as dc
We will also need to provide the API with an API key. See the to see how to set this up for a Colab Notebook.
In [6]:
# Mount the Drive
drive.mount('/content/drive', force_remount=True)
# REPLACE THIS with the path to your key.
key_path = '/content/drive/My Drive/DataCommons/secret.json'
# Read the key in and provide it to the Data Commons API
with open(key_path, 'r') as f:
secrets = json.load(f)
dc.set_api_key(secrets['dc_api_key'])
To get started, we'll need the cities that CDC provides data for. We've listed the DCIDs of all cities with CDC obesity prevalence statistics and stored it as a variable city_dcids
. For now we will hide this cell!
In [0]:
#@title
city_dcids = [
"geoId/5370000", "geoId/0524550", "geoId/0820000",
"geoId/3502000", "geoId/0665000", "geoId/1239550",
"geoId/2148006", "geoId/1232000", "geoId/1757225",
"geoId/0622804", "geoId/0807850", "geoId/4848072",
"geoId/3921000", "geoId/0680000", "geoId/3650617",
"geoId/4813024", "geoId/0644000", "geoId/0918430",
"geoId/5147672", "geoId/3345140", "geoId/4856000",
"geoId/1714351", "geoId/1982425", "geoId/2674900",
"geoId/1258050", "geoId/4943660", "geoId/0541000",
"geoId/0659451", "geoId/1269700", "geoId/0616350",
"geoId/0653980", "geoId/4459000", "geoId/0650258",
"geoId/0622230", "geoId/4845744", "geoId/0662938",
"geoId/0150000", "geoId/0673080", "geoId/3725580",
"geoId/0603526", "geoId/0846465", "geoId/1656260",
"geoId/2523000", "geoId/2241155", "geoId/0427820",
"geoId/2970000", "geoId/0622020", "geoId/3728000",
"geoId/5101000", "geoId/0681204", "geoId/1851876",
"geoId/1768003", "geoId/4224000", "geoId/4815976",
"geoId/2688940", "geoId/0606000", "geoId/0454050",
"geoId/3006550", "geoId/1751622", "geoId/4748000",
"geoId/5164000", "geoId/1077580", "geoId/0681666",
"geoId/5548000", "geoId/5584250", "geoId/0681344",
"geoId/0623182", "geoId/2915670", "geoId/1368516",
"geoId/4260000", "geoId/0615044", "geoId/1765000",
"geoId/4819000", "geoId/1921000", "geoId/1212875",
"geoId/3413690", "geoId/1253000", "geoId/0827425",
"geoId/0603666", "geoId/0616000", "geoId/2255000",
"geoId/5539225", "geoId/3710740", "geoId/3663000",
"geoId/4843888", "geoId/2567000", "geoId/2646000",
"geoId/2954074", "geoId/1150000", "geoId/0668084",
"geoId/3665508", "geoId/3436000", "geoId/1270600",
"geoId/0980000", "geoId/2706616", "geoId/4548535",
"geoId/1840788", "geoId/0677000", "geoId/1779293",
"geoId/2836000", "geoId/4983470", "geoId/0636770",
"geoId/2680700", "geoId/4803000", "geoId/1239425",
"geoId/2509000", "geoId/2036000", "geoId/0446000",
"geoId/2582000", "geoId/0816000", "geoId/0477000",
"geoId/0616532", "geoId/0658072", "geoId/4856348",
"geoId/2511000", "geoId/5303180", "geoId/0877290",
"geoId/4847892", "geoId/5305280", "geoId/1257425",
"geoId/0653000", "geoId/3912000", "geoId/2649000",
"geoId/1912000", "geoId/0639892", "geoId/4957300",
"geoId/4962470", "geoId/0680812", "geoId/0640130",
"geoId/1822000", "geoId/4147000", "geoId/0660466",
"geoId/1823278", "geoId/2758000", "geoId/0681554",
"geoId/1207875", "geoId/0135896", "geoId/3702140",
"geoId/4261000", "geoId/1207300", "geoId/2669035",
"geoId/4810768", "geoId/0680854", "geoId/5135000",
"geoId/0535710", "geoId/1712385", "geoId/0107000",
"geoId/0655156", "geoId/1772000", "geoId/4419180",
"geoId/4751560", "geoId/0630000", "geoId/3916000",
"geoId/5566000", "geoId/0614218", "geoId/5367000",
"geoId/2038900", "geoId/2545000", "geoId/0659920",
"geoId/1938595", "geoId/0648256", "geoId/3825700",
"geoId/0845970", "geoId/4752006", "geoId/4841464",
"geoId/0678582", "geoId/0613392", "geoId/3260600",
"geoId/4041850", "geoId/4819972", "geoId/4827000",
"geoId/3240000", "geoId/0608786", "geoId/0636000",
"geoId/4714000", "geoId/5357745", "geoId/4105800",
"geoId/0455000", "geoId/1703012", "geoId/1367284",
"geoId/4848804", "geoId/0665042", "geoId/0684200",
"geoId/1827000", "geoId/1714000", "geoId/1342425",
"geoId/5010675", "geoId/4513330", "geoId/0611194",
"geoId/0621712", "geoId/0650076", "geoId/3712000",
"geoId/0660018", "geoId/1831000", "geoId/4982950",
"geoId/4715160", "geoId/1243975", "geoId/4822660",
"geoId/0843000", "geoId/5531000", "geoId/5167000",
"geoId/4105350", "geoId/4123850", "geoId/1652120",
"geoId/0602000", "geoId/4810912", "geoId/0628000",
"geoId/0804000", "geoId/5176432", "geoId/2621000",
"geoId/2754880", "geoId/4561405", "geoId/4829000",
"geoId/1973335", "geoId/5380010", "geoId/0646898",
"geoId/4740000", "geoId/2717000", "geoId/0668000",
"geoId/1230000", "geoId/5157000", "geoId/0624680",
"geoId/1216725", "geoId/0651182", "geoId/0636546",
"geoId/0523290", "geoId/0669070", "geoId/0137000",
"geoId/0659962", "geoId/3128000", "geoId/0613210",
"geoId/4801924", "geoId/0812815", "geoId/0485540",
"geoId/0613214", "geoId/4009050", "geoId/0600884",
"geoId/0662000", "geoId/1245025", "geoId/3961000",
"geoId/2629000", "geoId/4804000", "geoId/3570500",
"geoId/5182000", "geoId/4841980", "geoId/0611530",
"geoId/3722920", "geoId/2146027", "geoId/2052575",
"geoId/1759000", "geoId/3474000", "geoId/1224000",
"geoId/0648354", "geoId/4861796", "geoId/1254200",
"geoId/15003", "geoId/5305210", "geoId/4269000",
"geoId/4807000", "geoId/4879000", "geoId/0629000",
"geoId/0656000", "geoId/2935000", "geoId/2071000",
"geoId/4835000", "geoId/1235000", "geoId/4075000",
"geoId/4853388", "geoId/0617918", "geoId/4055000",
"geoId/1871000", "geoId/5116000", "geoId/0668196",
"geoId/5374060", "geoId/4830464", "geoId/2537490",
"geoId/1238250", "geoId/2360545", "geoId/0832155",
"geoId/0664000", "geoId/4801000", "geoId/0647766",
"geoId/0672016", "geoId/2684000", "geoId/4858016",
"geoId/0465000", "geoId/3563460", "geoId/0646842",
"geoId/3775000", "geoId/0682996", "geoId/1707133",
"geoId/2627440", "geoId/2239475", "geoId/2603000",
"geoId/5335415", "geoId/1723074", "geoId/0632548",
"geoId/0670000", "geoId/1836003", "geoId/0669000",
"geoId/5168000", "geoId/3268400", "geoId/3774440",
"geoId/3457000", "geoId/2964550", "geoId/4876000",
"geoId/5414600", "geoId/0656700", "geoId/3539380",
"geoId/0657792", "geoId/0670098", "geoId/0952000",
"geoId/0666000", "geoId/0973000", "geoId/0639496",
"geoId/0566080", "geoId/0600562", "geoId/3673000",
"geoId/1245060", "geoId/0669196", "geoId/3350260",
"geoId/0862000", "geoId/2676460", "geoId/0955990",
"geoId/1810342", "geoId/0626000", "geoId/1738570",
"geoId/3137000", "geoId/4052500", "geoId/4857200",
"geoId/3601000", "geoId/0608954", "geoId/0619766",
"geoId/3456550", "geoId/3474630", "geoId/4131250",
"geoId/4164900", "geoId/1319000", "geoId/2555745",
"geoId/3714100", "geoId/3231900", "geoId/0602364",
"geoId/4550875", "geoId/0427400", "geoId/1369000",
"geoId/1225175", "geoId/0654652", "geoId/1304204",
"geoId/4202000", "geoId/3918000", "geoId/2829700",
"geoId/2270000", "geoId/5367167", "geoId/0937000",
"geoId/4263624", "geoId/4845384", "geoId/5613900",
"geoId/2941348", "geoId/0682590", "geoId/1258715",
"geoId/4874144", "geoId/0682954", "geoId/1825000",
"geoId/0908000", "geoId/0633434", "geoId/0653896",
"geoId/4955980", "geoId/0151000", "geoId/0645484",
"geoId/1216475", "geoId/3728080", "geoId/2545560",
"geoId/0883835", "geoId/3410000", "geoId/2642160",
"geoId/2537000", "geoId/4848768", "geoId/1724582",
"geoId/4965330", "geoId/0627000", "geoId/0652526",
"geoId/0649670", "geoId/4870808", "geoId/4206088",
"geoId/1919000", "geoId/3755000", "geoId/0636448",
"geoId/0641992", "geoId/5553000", "geoId/0684550",
"geoId/0675000", "geoId/5323515", "geoId/3684000",
"geoId/2404000", "geoId/0613588", "geoId/4652980",
"geoId/3651000", "geoId/2562535", "geoId/1718823",
"geoId/1245000", "geoId/2507000", "geoId/0604982",
"geoId/0664224", "geoId/2240735", "geoId/0678120",
"geoId/3451000", "geoId/2751730", "geoId/0667000",
"geoId/4659020", "geoId/5363000", "geoId/2622000",
"geoId/1255775", "geoId/4863500", "geoId/0685292",
"geoId/0950370", "geoId/1217200", "geoId/0643000",
"geoId/3734200", "geoId/4837000", "geoId/0412000",
"geoId/4842508", "geoId/1303440", "geoId/4967000",
"geoId/0669088", "geoId/0660620", "geoId/0203000",
"geoId/1702154", "geoId/2686000", "geoId/4806128",
"geoId/0669084", "geoId/1304000", "geoId/1301052",
"geoId/0649270", "geoId/1210275", "geoId/0613014",
"geoId/4474300", "geoId/5322640", "geoId/2743000",
"geoId/0803455", "geoId/1271000", "geoId/1380508",
"geoId/0602252", "geoId/4134100", "geoId/4159000",
"geoId/3901000", "geoId/1263000", "geoId/3421000",
"geoId/4817000", "geoId/1245975", "geoId/0404720",
"geoId/4454640", "geoId/0633000", "geoId/3611000",
"geoId/5156000", "geoId/3977000", "geoId/4845000",
"geoId/0633182", "geoId/0668252", "geoId/3731400",
"geoId/1706613", "geoId/2634000", "geoId/4839148",
"geoId/4516000", "geoId/2707966", "geoId/1349000",
"geoId/2053775", "geoId/1214400", "geoId/4864472",
"geoId/2938000", "geoId/3251800", "geoId/2965000",
"geoId/4865000", "geoId/3649121", "geoId/4827684",
"geoId/0471510", "geoId/1254000", "geoId/4023200",
"geoId/3050200", "geoId/1276600", "geoId/0660102",
"geoId/0644574", "geoId/3719000", "geoId/1608830",
"geoId/0653322", "geoId/5502375", "geoId/2534550",
"geoId/4967440", "geoId/0473000", "geoId/5335275",
"geoId/3915000", "geoId/1805860", "geoId/4824000",
"geoId/4805000", "geoId/2079000", "geoId/2205000",
"geoId/0624638", "geoId/3988000", "geoId/0668378",
"geoId/0177256", "geoId/0680238",
]
We then initialize a pandas.DataFrame
with the dcids in this list.
In [8]:
# Create the Pandas DataFrame
data = pd.DataFrame({'City': city_dcids})
# Get the name of each city
data['CityName'] = dc.get_property_values(data['City'], 'name')
data = dc.flatten_frame(data)
# Show the top five rows
data.head(5)
Out[8]:
In [9]:
# Prevalence of obesity
data['ObesityPop'] = dc.get_populations(
data['City'],
'Person',
constraining_properties={
'unhealthyBehavior': 'CDC_Obesity'
}
)
data['Obesity'] = dc.get_observations(
data['ObesityPop'],
'prevalence',
'measuredValue',
'2015',
measurement_method='CDC_CrudePrevalence'
)
# Prevalence of high blood pressure
data['HighBpPop'] = dc.get_populations(
data['City'],
'Person',
constraining_properties={
'healthOutcome': 'CDC_BPHigh'
}
)
data['HighBp'] = dc.get_observations(
data['HighBpPop'],
'prevalence',
'measuredValue',
'2015',
measurement_method='CDC_CrudePrevalence'
)
# Population of all persons
data['AllPersonsPop'] = dc.get_populations(data['City'], 'Person')
# Total persons count
data['PersonCount'] = dc.get_observations(
data['AllPersonsPop'],
'count',
'measuredValue',
'2015',
measurement_method='CenusACS5yrSurvey'
)
# Unemployment rate
data['Unemployment'] = dc.get_observations(
data['AllPersonsPop'],
'unemploymentRate',
'measuredValue',
'2015',
observation_period='P1Y',
measurement_method='BLSSeasonallyUnadjusted'
)
# Count of persons living below the poverty line
data['PovertyPersonsPop'] = dc.get_populations(
data['City'],
'Person',
constraining_properties={
'povertyStatus': 'BelowPovertyLevelInThePast12Months'
}
)
data['PovertyCount'] = dc.get_observations(
data['PovertyPersonsPop'],
'count',
'measuredValue',
'2015',
measurement_method='CenusACS5yrSurvey'
)
# Display the first five rows.
data.head(5)
Out[9]:
Great! We have all the statistics we'll need for this case study queried from three different data sources without having to write additional code to perform the joins. However, we'll still need to perform three data cleanup tasks to prepare the data for analysis:
Let's generate the final dataframe.
In [10]:
# Get the cleaned table
cleaned = dc.clean_frame(data)
# Remove population columns and compute the poverty rate
cleaned = cleaned[['CityName', 'Obesity', 'HighBp', 'Unemployment', 'PersonCount', 'PovertyCount']]
cleaned['PovertyRate'] = cleaned['PovertyCount'] / cleaned['PersonCount'] * 100
# Display the table
cleaned.head(5)
Out[10]:
Now's a good time to explore how each variable (high blood pressure prevalence, unemployment rate, and poverty rate) correlate with obesity prevalence. As stated before, previous research has shown that each of these variables tend to correlate positively with the prevalence of Obesity. Does our data show this?
In [11]:
# Plot Bphigh vs. Obesity
plt.figure(figsize=(6,4))
cleaned.plot.scatter(x='HighBp', y='Obesity')
plt.title('High Blood Pressure Prevalence v. Obesity Prevalence')
# Show the plot
plt.show()
In [12]:
# Plot Unemployment with Obesity
plt.figure(figsize=(6,4))
cleaned.plot.scatter(x='Unemployment', y='Obesity')
plt.title('Unemployment Rate v. Obesity Prevalence')
# Show the plot
plt.show()
In [13]:
# Plot PovertyRate with Obesity
plt.figure(figsize=(6,4))
cleaned.plot.scatter(x='PovertyRate', y='Obesity')
plt.title('Poverty Rate v. Obesity Prevalence')
# Show the plot
plt.show()
Looks like each of variable does correlate positively with obesity prevalence.
We'll be predicting the prevalence of obesity with the following linear model.
$$f_\theta(x) = \theta_0 + \theta_1 (\text{high blood pressure}) + \theta_2 (\text{unemployment}) + \theta_3(\text{poverty rate})$$
Let's start by creating our training and test sets. We'll then train a linear regression model using Scikit learn's LinearRegression
In [14]:
X = cleaned[['HighBp', 'Unemployment', 'PovertyRate']]
Y = cleaned[['Obesity']]
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
model = LinearRegression(fit_intercept=True)
model.fit(x_train, y_train)
print('Model Intercept: {}'.format(model.intercept_))
print('Model Coefficients: {}'.format(model.coef_))
In [15]:
def mse(y_pred, y_true):
""" Compute the mean squared error of 'y_pred' and 'y_true'. """
return float(np.sum((y_pred - y_true) ** 2)) / len(y_pred)
train_pred = model.predict(x_train)
test_pred = model.predict(x_test)
print('Training Error: {}'.format(mse(train_pred, y_train)))
print('Test Error: {}'.format(mse(test_pred, y_test)))
We can also display a plot of the residuals
In [16]:
y_res = (test_pred - y_test) / y_test
# Plot the results
plt.title("Residual Proportions Plot")
plt.xlabel("Test Data Row Index")
plt.ylabel("resid. / actual obesity prevalence")
plt.scatter(y_res.index, y_res['Obesity'])
plt.show()
How well does your model perform? We were able to achieve an MSE for the test set of approximately 10% points from the observed obesity prevalence. Our model was also able to fit the data with the max residual bounded by +/- 30%, which for a simple model considering only three explanatory variables isn't so bad.
Congratulations! Rather than spending a couple of hours just searching for and joining datasets from the CDC, Bureau of Labor Statistics, and US Census Bureau, you've spent one lab session querying dataCommons for the same data and, on top of that, using it to build a linear model. Hopefully dataCommons has made the time-to-analysis much shorter!
You've also created a basic model to predict the prevalence of obesity in approximately 500 states using intuition from a long line of research. The model you've created only uses three explanatory variables, so even though it's not the most accurate at predicting obesity prevalence, it's possible that it can be improved by adding more variables. Obesity is also known to correlate with factors such as high cholesterol and diabetes, but with the data in dataCommons, we can even consider asking if obesity correlates with stranger factors such as
Does adding these variables into your model improve accuracy? Can you think of other variables that correlate with obesity?