Example 3: Downloading State Data

For this example, we will be running a simple linear regression model, so we need an additional import:


In [1]:
import pandas as pd
import censusdata
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.precision', 2)
import statsmodels.formula.api as sm

We begin by downloading data on some basic socioeconomic characteristics for all U.S. states:


In [2]:
statedata = censusdata.download('acs5', 2015, censusdata.censusgeo([('state', '*')]),
                                ['B01001_001E', 'B19013_001E', 'B19083_001E',
                                 'C17002_001E', 'C17002_002E', 'C17002_003E', 'C17002_004E',
                                 'B03002_001E', 'B03002_003E', 'B03002_004E', 'B03002_012E',])

We then link data on the percent of voters in each state voting Democratic in the 2016 U.S. presidential election:


In [3]:
voting2016 = {
    censusdata.censusgeo((('state', '01'),)): 34.6,
    censusdata.censusgeo((('state', '02'),)): 37.7,
    censusdata.censusgeo((('state', '04'),)): 45.4,
    censusdata.censusgeo((('state', '05'),)): 33.8,
    censusdata.censusgeo((('state', '06'),)): 61.6,
    censusdata.censusgeo((('state', '08'),)): 47.2,
    censusdata.censusgeo((('state', '09'),)): 54.5,
    censusdata.censusgeo((('state', '10'),)): 53.4,
    censusdata.censusgeo((('state', '11'),)): 92.8,
    censusdata.censusgeo((('state', '12'),)): 47.8,
    censusdata.censusgeo((('state', '13'),)): 45.6,
    censusdata.censusgeo((('state', '15'),)): 62.3,
    censusdata.censusgeo((('state', '16'),)): 27.6,
    censusdata.censusgeo((('state', '17'),)): 55.4,
    censusdata.censusgeo((('state', '18'),)): 37.9,
    censusdata.censusgeo((('state', '19'),)): 42.2,
    censusdata.censusgeo((('state', '20'),)): 36.2,
    censusdata.censusgeo((('state', '21'),)): 32.7,
    censusdata.censusgeo((('state', '22'),)): 38.4,
    censusdata.censusgeo((('state', '23'),)): 47.9,
    censusdata.censusgeo((('state', '24'),)): 60.5,
    censusdata.censusgeo((('state', '25'),)): 60.8,
    censusdata.censusgeo((('state', '26'),)): 47.3,
    censusdata.censusgeo((('state', '27'),)): 46.9,
    censusdata.censusgeo((('state', '28'),)): 39.7,
    censusdata.censusgeo((('state', '29'),)): 38,
    censusdata.censusgeo((('state', '30'),)): 36,
    censusdata.censusgeo((('state', '31'),)): 34,
    censusdata.censusgeo((('state', '32'),)): 47.9,
    censusdata.censusgeo((('state', '33'),)): 47.6,
    censusdata.censusgeo((('state', '34'),)): 55,
    censusdata.censusgeo((('state', '35'),)): 48.3,
    censusdata.censusgeo((('state', '36'),)): 58.8,
    censusdata.censusgeo((('state', '37'),)): 46.7,
    censusdata.censusgeo((('state', '38'),)): 27.8,
    censusdata.censusgeo((('state', '39'),)): 43.5,
    censusdata.censusgeo((('state', '40'),)): 28.9,
    censusdata.censusgeo((('state', '41'),)): 51.7,
    censusdata.censusgeo((('state', '42'),)): 47.6,
    censusdata.censusgeo((('state', '44'),)): 55.4,
    censusdata.censusgeo((('state', '45'),)): 40.8,
    censusdata.censusgeo((('state', '46'),)): 31.7,
    censusdata.censusgeo((('state', '47'),)): 34.9,
    censusdata.censusgeo((('state', '48'),)): 43.4,
    censusdata.censusgeo((('state', '49'),)): 27.8,
    censusdata.censusgeo((('state', '50'),)): 61.1,
    censusdata.censusgeo((('state', '51'),)): 49.9,
    censusdata.censusgeo((('state', '53'),)): 54.4,
    censusdata.censusgeo((('state', '54'),)): 26.5,
    censusdata.censusgeo((('state', '55'),)): 46.9,
    censusdata.censusgeo((('state', '56'),)): 22.5,
}
voting2016 = pd.DataFrame.from_dict(voting2016, orient='index')
statedata['percent_democratic_pres_2016'] = voting2016

We then rename columns, compute some additional variables, and rescale some variables to make regression coefficients more easily interpretable:


In [4]:
statedata = statedata.rename(columns={'B01001_001E': 'population_size'})
statedata.population_size = statedata.population_size / 100000
statedata = statedata.rename(columns={'B19013_001E': 'median_HH_income'})
statedata['median_HH_income'] = statedata['median_HH_income'] / 1000
statedata = statedata.rename(columns={'B19083_001E': 'gini_index'})
statedata.gini_index = statedata.gini_index * 100
statedata['percent_below_125_poverty'] = (statedata['C17002_002E'] + statedata['C17002_003E'] + statedata['C17002_004E']) / statedata['C17002_001E'] * 100
statedata['percent_nonhisp_white'] = statedata['B03002_003E'] / statedata['B03002_001E'] * 100
statedata['percent_nonhisp_black'] = statedata['B03002_004E'] / statedata['B03002_001E'] * 100
statedata['percent_hispanic'] = statedata['B03002_012E'] / statedata['B03002_001E'] * 100

We run a quick check on the data and then delete variables we no longer need:


In [5]:
assert (statedata['population_size'] == statedata['B03002_001E'] / 100000).all()
for column in ['C17002_001E', 'C17002_002E', 'C17002_003E', 'C17002_004E',
               'B03002_001E', 'B03002_003E', 'B03002_004E', 'B03002_012E',]:
    del statedata[column]

We are only interested in the 50 states + DC, so we drop Puerto Rico:


In [6]:
statedata = statedata.drop([censusdata.censusgeo([('state', '72')])])

Finally, we reorder the variables and run simple descriptives:


In [7]:
statedata = statedata.reindex(columns=['percent_democratic_pres_2016', 'population_size', 'median_HH_income', 'percent_below_125_poverty', 'gini_index', 'percent_nonhisp_white', 'percent_nonhisp_black', 'percent_hispanic'])
statedata.describe()


Out[7]:
percent_democratic_pres_2016 population_size median_HH_income percent_below_125_poverty gini_index percent_nonhisp_white percent_nonhisp_black percent_hispanic
count 51.00 51.00 51.00 51.00 51.00 51.00 51.00 51.00
mean 45.05 62.06 54.64 19.44 46.22 69.53 10.91 11.20
std 12.41 70.53 9.16 3.94 2.14 16.12 10.77 10.06
min 22.50 5.80 39.66 11.84 41.81 22.89 0.44 1.37
25% 36.10 17.34 47.55 16.25 44.81 58.43 3.17 4.72
50% 46.70 43.97 53.00 20.08 46.26 73.60 7.12 8.84
75% 52.55 68.46 60.68 22.45 47.59 81.23 14.92 12.88
max 92.80 384.21 74.55 28.96 53.17 93.88 47.98 47.36

Then we examine bivariate correlations prior to running a linear regression model:


In [8]:
statedata.corr()


Out[8]:
percent_democratic_pres_2016 population_size median_HH_income percent_below_125_poverty gini_index percent_nonhisp_white percent_nonhisp_black percent_hispanic
percent_democratic_pres_2016 1.00 0.24 0.57 -0.21 0.47 -0.53 0.34 0.26
population_size 0.24 1.00 0.03 0.18 0.43 -0.40 0.11 0.53
median_HH_income 0.57 0.03 1.00 -0.81 -0.09 -0.27 -0.06 0.11
percent_below_125_poverty -0.21 0.18 -0.81 1.00 0.48 -0.23 0.39 0.19
gini_index 0.47 0.43 -0.09 0.48 1.00 -0.45 0.61 0.28
percent_nonhisp_white -0.53 -0.40 -0.27 -0.23 -0.45 1.00 -0.46 -0.63
percent_nonhisp_black 0.34 0.11 -0.06 0.39 0.61 -0.46 1.00 -0.13
percent_hispanic 0.26 0.53 0.11 0.19 0.28 -0.63 -0.13 1.00

In [9]:
result = sm.ols(formula=("percent_democratic_pres_2016 ~ population_size + median_HH_income"
                         "+ percent_nonhisp_black + percent_hispanic"), data=statedata).fit()
result.summary()


Out[9]:
OLS Regression Results
Dep. Variable: percent_democratic_pres_2016 R-squared: 0.532
Model: OLS Adj. R-squared: 0.492
Method: Least Squares F-statistic: 13.08
Date: Sat, 15 Sep 2018 Prob (F-statistic): 3.42e-07
Time: 18:48:26 Log-Likelihood: -180.94
No. Observations: 51 AIC: 371.9
Df Residuals: 46 BIC: 381.5
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -5.7076 7.801 -0.732 0.468 -21.409 9.994
population_size 0.0121 0.021 0.563 0.576 -0.031 0.055
median_HH_income 0.7715 0.138 5.603 0.000 0.494 1.049
percent_nonhisp_black 0.4551 0.120 3.790 0.000 0.213 0.697
percent_hispanic 0.2578 0.151 1.704 0.095 -0.047 0.562
Omnibus: 2.104 Durbin-Watson: 2.420
Prob(Omnibus): 0.349 Jarque-Bera (JB): 1.237
Skew: 0.208 Prob(JB): 0.539
Kurtosis: 3.640 Cond. No. 647.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In this simple model, the percentage voting Democratic is not significantly associated with population size or % Hispanic, at the p<.05 level. It is significantly associated with median household income and the % non-Hispanic black. Every $1,000 increase in median household income is associated with an increase of just under 1 percentage point in the Democratic vote. Every one percentage point increase in the % non-Hispanic black is associated with about a half a percentage point increase in the Democratic vote. Of course,

  1. The outcome variable is not continuous, due to its bounded range, and this model does not account for this (it is essentially a linear probability model);
  2. The choice of covariates is simplistic and just designed to demonstrate fitting a model;
  3. We might consider robust standard errors for this model.