The California Health Interview Survey (CHIS) is a large-scale, long-running health survey that provides detailed, respondent level data about health behaviors, outcomes and demographics. The dataset includes 80 replicate weights to calculate populations estimates and variances. This notebook demonstrates how to use these weights, comparing the values to AskCHIS, a data access website that aggregates CHIS responses.
This notebook uses a Metapack package to access the CHIS datasets, which bypasses the the dataset terms and restrictions. These terms are also reprocudes in the datapackage documentation, shown below. You must accept these terms and restrictions before using this dataset.
First, we will load common important imports.
In [1]:
import seaborn as sns
import metapack as mp
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display
from publicdata.chis import *
%matplotlib inline
sns.set_context('notebook')
In [2]:
# Opening a source package presumes you are working with the notebook in the source package,
# https://github.com/sandiegodata-projects/chis.git
pkg = mp.jupyter.open_source_package()
pkg
Out[2]:
In [3]:
df = pkg.reference('adult_2017').dataframe()
First, we'll replicate the results for the question "Ever diagnosed with diabetes" for all of California, for 2017, from AskCHIS.
Total population: 29,456,000
Getting estimates is easy. Each of the values of rakedw0
is the number of people that the associated responded represents in the total California population. So, all of the values of rakedw0
will sum to the controlled California population of adults, and dividing the whole dataset by responses on a variable and summing the values of rakedw0
for each response gives us the estimate number of people who would have given that response.
In [4]:
t = df.pivot_table(values='rakedw0', columns='diabetes', index=df.index)
t2 = t.sum().round(-3)
t2
Out[4]:
Summing across responses yields the total popluation, which we can use to calculate percentages.
In [5]:
t2.sum()
Out[5]:
In [6]:
(t2/t2.sum()*100).round(1)
Out[6]:
In [7]:
t = df[['diabetes','rakedw0']].set_index('diabetes',append=True).unstack()
t2 = t.sum().round(-3)
diabetes_yes = t2.unstack().loc['rakedw0','YES']
diabetes_no = t2.unstack().loc['rakedw0','NO']
diabetes_yes, diabetes_no
Out[7]:
The basic formula for calculating the variance is in section 9.2, Methods for Variance Estimation of CHIS Report 5 Weighting and Variance Estimation . Basically, the other 80 raked weights, rakedw1
through rakedw80
give alternate estimates. It's like running the survey an additional 80 times, which allows you to calculate the sample variance from the set of alternate estimates.
In the code below, we'll expand the operation with temporary variables, to document each step.
In [8]:
weight_cols = [c for c in df.columns if 'raked' in c]
t = df[['diabetes']+weight_cols] # Get the column of interest, and all of the raked weights
t = t.set_index('diabetes',append=True) # Move the column of interest into the index
t = t.unstack() # Unstack the column of interest, so both values are now in multi-level columns
t = t.sum() # Sum all of the weights for each of the raked weght set and "YES"/"NO"
t = t.unstack() # Now we have sums for each of the replicated, for each of the variable values.
t = t.sub(t.loc['rakedw0']).iloc[1:] # Subtract off the median estimate from each of the replicates
t = (t**2).sum() # sum of squares
ci_95 = np.sqrt(t)*1.96 # sqrt to get stddev, and 1.96 to get 95% CI
The final percentage ranges match those from AskCHIS.
In [9]:
((diabetes_yes-ci_95.loc['YES'])/29_456_000*100).round(1), ((diabetes_yes+ci_95.loc['YES'])/29_456_000*100).round(1)
Out[9]:
In [10]:
((diabetes_no-ci_95.loc['NO'])/29_456_000*100).round(1), ((diabetes_no+ci_95.loc['NO'])/29_456_000*100).round(1)
Out[10]:
Here is a function for calculating the estimate, percentages, Standard Error and Relative Standard Error from a dataset. This function also works with a subset of the dataset, but note that the percentages will be relative to the total from the input dataset, not the whole California population.
In [11]:
def chis_estimate(df, column, ci=True, pct=True, rse=False):
"""Calculate estimates for CHIS variables, with variances, as 95% CI, from the replicate weights"""
weight_cols = [c for c in df.columns if 'raked' in c]
t = df[[column]+weight_cols] # Get the column of interest, and all of the raked weights
t = t.set_index(column,append=True) # Move the column of interest into the index
t = t.unstack() # Unstack the column of interest, so both values are now in multi-level columns
t = t.sum() # Sum all of the weights for each of the raked weight set and "YES"/"NO"
t = t.unstack() # Now we have sums for each of the replicats, for each of the variable values.
est = t.iloc[0].to_frame() # Replicate weight 0 is the estimate
est.columns = [column]
total = est.sum()[column]
t = t.sub(t.loc['rakedw0']).iloc[1:] # Subtract off the median estimate from each of the replicates
t = (t**2).sum() # sum of squares
se = np.sqrt(t) # sqrt to get stddev,
ci_95 = se*1.96 # and 1.96 to get 95% CI
if ci:
est[column+'_95_l'] = est[column] - ci_95
est[column+'_95_h'] = est[column] + ci_95
else:
est[column+'_se'] = se
if pct:
est[column+'_pct'] = (est[column]/total*100).round(1)
if ci:
est[column+'_pct_l'] = (est[column+'_95_l']/total*100).round(1)
est[column+'_pct_h'] = (est[column+'_95_h']/total*100).round(1)
if rse:
est[column+'_rse'] = (se/est[column]*100).round(1)
est.rename(columns={column:column+'_count'}, inplace=True)
return est
chis_estimate(df, 'diabetes', ci=False, pct=False)
Out[11]:
In [12]:
# This validates with the whole population for 2017, from the AskCHIS web application
chis_estimate(df, 'ag1')
Out[12]:
In [13]:
# This validates with the latino subset for 2017, from the AskCHIS web application
chis_estimate(df[df.racedf_p1=='LATINO'], 'ag1')
Out[13]:
This function allows segmenting on another column, for instance, breaking out responses by race. Note that in the examples we are checking for estimates to have a relative standard error ( such as diabetes_rse
) of greater than 30%. CHIS uses 30% as a limit for unstable values, and won't publish estimate with higher RSEs.
In [ ]:
def chis_segment_estimate(df, column, segment_columns):
"""Return aggregated CHIS data, segmented on one or more other variables.
"""
if not isinstance(segment_columns, (list,tuple)):
segment_columns = [segment_columns]
odf = None
for index,row in df[segment_columns].drop_duplicates().iterrows():
query = ' and '.join([ "{} == '{}'".format(c,v) for c,v in zip(segment_columns, list(row))])
x = chis_estimate(df.query(query), column, ci=True, pct=True, rse=True)
x.columns.names = ['measure']
x = x.unstack()
for col,val in zip(segment_columns, list(row)):
x = pd.concat([x], keys=[val], names=[col])
if odf is None:
odf = x
else:
odf = pd.concat([odf, x])
odf = odf.to_frame()
odf.columns = ['value']
return odf
The dataframe returned by this function has a multi-level index, which include all of the unique values from the segmentation columns, a level for measures, and the values from the target column. For instance:
In [204]:
chis_segment_estimate(df, 'diabetes', ['racedf_p1', 'ur_ihs']).head(20)
Out[204]:
You can "pivot" a level out of the row into the columns with unstack()
. Here we move the measures out of the row index into columns.
In [207]:
t = chis_segment_estimate(df, 'diabetes', ['racedf_p1', 'ur_ihs'])
t.unstack('measure').head()
Out[207]:
Complex selections can be made with .loc
.
In [213]:
t = chis_segment_estimate(df, 'diabetes', ['racedf_p1', 'ur_ihs'])
idx = pd.IndexSlice # Convenience redefinition.
# The IndexSlices should have one term ( seperated by ',') for each of the levels in the index.
# We have one `IndexSlice` for rows, and one for columns. Note that the ``row_indexer`` has 4 terms.
row_indexer = idx[:,:,('diabetes_pct','diabetes_rse'),'YES']
col_indexer = idx[:]
# Now we can select with the two indexers.
t = t.loc[row_indexer,col_indexer]
# Rotate the measures out of rows into columns
t = t.unstack('measure')
# The columns are multi-level, but there is only one value for the first level,
# so it is useless.
t.columns = t.columns.droplevel()
# Only use estimates wtih RSE < 30%
t = t[t.diabetes_rse < 30]
# We don't nee the RSE colum any more.
t = t.drop(columns='diabetes_rse')
# Move the Rural/Urban into columns
t = t.unstack(0)
t
Out[213]:
In [202]:
x = chis_segment_estimate(df, 'diabetes', ['racedf_p1', 'am3'])
row_indexer = idx[('YES','NO'),:,('diabetes_pct','diabetes_rse'),'YES']
col_indexer = idx[:]
t = x.loc[row_indexer,col_indexer].unstack('measure')
t.columns = t.columns.droplevel()
t = t[t.diabetes_rse < 30].drop(columns='diabetes_rse')
t
Out[202]:
In [214]:
x = chis_segment_estimate(df, 'diabetes', ['racedf_p1', 'am3'])
row_indexer = idx[:,:,('diabetes_pct','diabetes_rse'),'YES']
col_indexer = idx[:]
t = x.loc[row_indexer,col_indexer].unstack('measure')
#t.index = t.index.droplevel('diabetes')
t.columns = t.columns.droplevel()
t = t[t.diabetes_rse < 30].drop(columns='diabetes_rse')
t.unstack(0)
Out[214]:
In [186]:
chis_segment_estimate(df, 'diabetes', 'am3')
Out[186]:
In [ ]: