CHIS Estimates and Variances

The California Health Interview Survey (CHIS) is a large-scale, long-running health survey that provides detailed, respondent level data about health behaviors, outcomes and demographics. The dataset includes 80 replicate weights to calculate populations estimates and variances. This notebook demonstrates how to use these weights, comparing the values to AskCHIS, a data access website that aggregates CHIS responses.

This notebook uses a Metapack package to access the CHIS datasets, which bypasses the the dataset terms and restrictions. These terms are also reprocudes in the datapackage documentation, shown below. You must accept these terms and restrictions before using this dataset.

First, we will load common important imports.


In [1]:
import seaborn as sns
import metapack as mp
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display 

from publicdata.chis import *

%matplotlib inline
sns.set_context('notebook')

In [2]:
# Opening a source package presumes you are working with the notebook in the source package, 
# https://github.com/sandiegodata-projects/chis.git
pkg = mp.jupyter.open_source_package()
pkg


Out[2]:

CHIS California Health Interview Survey, Adults

healthpolicy.ucla.edu-chis-adult-1 Last Update: 2018-11-21T20:44:48

Documentation and Reference Links to CHIS files.

CHIS Data packages

Using these file requires accepting the terms and restrictions provided by the UCLA Center for Health Policy Research. These terms are available online, and reproduced here:

Restrictions on the Use of California Health Interview Survey Data Before you
download this file, you must first agree to these Restrictions on the Use of
CHIS Data by clicking the button below.

The California Health Interview Survey (CHIS) is bound by promises made to
respondents, by California law, and by University and government human subject
protection committees to assure that no personal information is released in a
form that identifies an individual without the consent of the person who
supplied the information. The California Information Practices Act (section
1798.24) provides that the data collected by CHIS may be released only for
statistical research and reporting purposes. Any intentional identification or
disclosure of personal information violates this law, and violates the privacy
rights of the people who provided data to CHIS. Unauthorized disclosure of
personal information is subject to civil action and penalties for invasion of
privacy under California Civil Code, Section 1798.53.

Documentation Links

Contacts

References


In [3]:
df = pkg.reference('adult_2017').dataframe()

Estimates Using Pivot

First, we'll replicate the results for the question "Ever diagnosed with diabetes" for all of California, for 2017, from AskCHIS.

  • Diagnosed with diabetes: 10.7%, ( 9.6% - 11.8% ) 3,145,000
  • Never diagnosed with diabetes 89.3% ( 88.2% - 90.4% ) 26,311,000

Total population: 29,456,000

Getting estimates is easy. Each of the values of rakedw0 is the number of people that the associated responded represents in the total California population. So, all of the values of rakedw0 will sum to the controlled California population of adults, and dividing the whole dataset by responses on a variable and summing the values of rakedw0 for each response gives us the estimate number of people who would have given that response.


In [4]:
t = df.pivot_table(values='rakedw0', columns='diabetes', index=df.index)
t2 = t.sum().round(-3)
t2


Out[4]:
diabetes
YES     3145000.0
NO     26311000.0
dtype: float64

Summing across responses yields the total popluation, which we can use to calculate percentages.


In [5]:
t2.sum()


Out[5]:
29456000.0

In [6]:
(t2/t2.sum()*100).round(1)


Out[6]:
diabetes
YES    10.7
NO     89.3
dtype: float64

Estimates Using Unstack

You can also calculate the same values using set_index and unstack.


In [7]:
t = df[['diabetes','rakedw0']].set_index('diabetes',append=True).unstack()
t2 = t.sum().round(-3)
diabetes_yes = t2.unstack().loc['rakedw0','YES']
diabetes_no = t2.unstack().loc['rakedw0','NO']
diabetes_yes, diabetes_no


Out[7]:
(3145000.0, 26311000.0)

Calculating Variance

The basic formula for calculating the variance is in section 9.2, Methods for Variance Estimation of CHIS Report 5 Weighting and Variance Estimation . Basically, the other 80 raked weights, rakedw1 through rakedw80 give alternate estimates. It's like running the survey an additional 80 times, which allows you to calculate the sample variance from the set of alternate estimates.

In the code below, we'll expand the operation with temporary variables, to document each step.


In [8]:
weight_cols = [c for c in df.columns if 'raked' in c]

t = df[['diabetes']+weight_cols] # Get the column of interest, and all of the raked weights
t = t.set_index('diabetes',append=True) # Move the column of interest into the index
t = t.unstack() # Unstack the column of interest, so both values are now in multi-level columns
t = t.sum() # Sum all of the weights for each of the raked weght set and "YES"/"NO"
t = t.unstack() # Now we have sums for each of the replicated, for each of the variable values. 

t = t.sub(t.loc['rakedw0']).iloc[1:] # Subtract off the median estimate from each of the replicates
t = (t**2).sum() # sum of squares
ci_95 = np.sqrt(t)*1.96 # sqrt to get stddev, and 1.96 to get 95% CI

The final percentage ranges match those from AskCHIS.


In [9]:
((diabetes_yes-ci_95.loc['YES'])/29_456_000*100).round(1), ((diabetes_yes+ci_95.loc['YES'])/29_456_000*100).round(1)


Out[9]:
(9.6, 11.8)

In [10]:
((diabetes_no-ci_95.loc['NO'])/29_456_000*100).round(1), ((diabetes_no+ci_95.loc['NO'])/29_456_000*100).round(1)


Out[10]:
(88.2, 90.4)

Functions

Here is a function for calculating the estimate, percentages, Standard Error and Relative Standard Error from a dataset. This function also works with a subset of the dataset, but note that the percentages will be relative to the total from the input dataset, not the whole California population.


In [11]:
def chis_estimate(df, column, ci=True, pct=True, rse=False):
    """Calculate estimates for CHIS variables, with variances, as 95% CI,  from the replicate weights"""
    
    weight_cols = [c for c in df.columns if 'raked' in c]
    
    t = df[[column]+weight_cols] # Get the column of interest, and all of the raked weights
    t = t.set_index(column,append=True) # Move the column of interest into the index
    t = t.unstack() # Unstack the column of interest, so both values are now in multi-level columns
    t = t.sum() # Sum all of the weights for each of the raked weight set and "YES"/"NO"
    t = t.unstack() # Now we have sums for each of the replicats, for each of the variable values. 

    est = t.iloc[0].to_frame() # Replicate weight 0 is the estimate
    
    est.columns = [column]
    
    total = est.sum()[column]
    
    t = t.sub(t.loc['rakedw0']).iloc[1:] # Subtract off the median estimate from each of the replicates
    t = (t**2).sum() # sum of squares
    
    se = np.sqrt(t) # sqrt to get stddev,
    ci_95 = se*1.96 #  and 1.96 to get 95% CI

    if ci:
        est[column+'_95_l'] = est[column] - ci_95
        est[column+'_95_h'] = est[column] + ci_95  
    else:
        est[column+'_se'] = se
    
    if pct:
        est[column+'_pct'] = (est[column]/total*100).round(1)
        if ci:
            est[column+'_pct_l'] = (est[column+'_95_l']/total*100).round(1)
            est[column+'_pct_h'] = (est[column+'_95_h']/total*100).round(1)
    if rse:
        est[column+'_rse'] = (se/est[column]*100).round(1)
        
        
    est.rename(columns={column:column+'_count'}, inplace=True)
    
    return est
    
chis_estimate(df, 'diabetes', ci=False, pct=False)


Out[11]:
diabetes_count diabetes_se
diabetes
YES 3.144752e+06 162066.482981
NO 2.631094e+07 162066.482979

In [12]:
# This validates with the whole population for 2017, from the AskCHIS web application
chis_estimate(df, 'ag1')


Out[12]:
ag1_count ag1_95_l ag1_95_h ag1_pct ag1_pct_l ag1_pct_h
ag1
HAVE NEVER VISIT 7.240603e+05 5.590919e+05 8.890286e+05 2.5 1.9 3.0
6 MONTHS AGO OR LESS 1.697365e+07 1.613956e+07 1.780774e+07 57.6 54.8 60.5
MORE THAN 6 MONTHS UP TO 1 YEAR AGO 4.462391e+06 4.079391e+06 4.845391e+06 15.1 13.8 16.4
MORE THAN 1 YEAR UP TO 2 YEARS AGO 2.997435e+06 2.649192e+06 3.345679e+06 10.2 9.0 11.4
MORE THAN 2 YEARS UP TO 5 YEARS AGO 2.291014e+06 1.788671e+06 2.793358e+06 7.8 6.1 9.5
MORE THAN 5 YEARS AGO 2.007144e+06 1.765346e+06 2.248943e+06 6.8 6.0 7.6

In [13]:
# This validates with the latino subset for 2017, from the AskCHIS web application
chis_estimate(df[df.racedf_p1=='LATINO'], 'ag1')


Out[13]:
ag1_count ag1_95_l ag1_95_h ag1_pct ag1_pct_l ag1_pct_h
ag1
HAVE NEVER VISIT 4.663458e+05 3.094320e+05 6.232595e+05 4.4 2.9 5.9
6 MONTHS AGO OR LESS 4.988002e+06 4.702488e+06 5.273516e+06 47.3 44.6 50.1
MORE THAN 6 MONTHS UP TO 1 YEAR AGO 1.870843e+06 1.495776e+06 2.245911e+06 17.8 14.2 21.3
MORE THAN 1 YEAR UP TO 2 YEARS AGO 1.244184e+06 1.084788e+06 1.403581e+06 11.8 10.3 13.3
MORE THAN 2 YEARS UP TO 5 YEARS AGO 1.009716e+06 7.895267e+05 1.229906e+06 9.6 7.5 11.7
MORE THAN 5 YEARS AGO 9.570417e+05 6.690951e+05 1.244988e+06 9.1 6.4 11.8

Segmenting Results

This function allows segmenting on another column, for instance, breaking out responses by race. Note that in the examples we are checking for estimates to have a relative standard error ( such as diabetes_rse ) of greater than 30%. CHIS uses 30% as a limit for unstable values, and won't publish estimate with higher RSEs.


In [ ]:
def chis_segment_estimate(df, column, segment_columns):
    """Return aggregated CHIS data, segmented on one or more other variables. 
    """

    if not isinstance(segment_columns, (list,tuple)):
        segment_columns = [segment_columns]
    
    odf = None
    
    for index,row in df[segment_columns].drop_duplicates().iterrows():
        query = ' and '.join([ "{} == '{}'".format(c,v) for c,v in zip(segment_columns, list(row))])
    
        x = chis_estimate(df.query(query), column, ci=True, pct=True, rse=True)
        x.columns.names = ['measure']
        x = x.unstack()
       
        
        for col,val in zip(segment_columns, list(row)):
          
            x = pd.concat([x], keys=[val], names=[col])

        if odf is None:
            odf = x
        else:
            odf = pd.concat([odf, x])
        
    
    odf = odf.to_frame()
    odf.columns = ['value']
    
    return odf

The dataframe returned by this function has a multi-level index, which include all of the unique values from the segmentation columns, a level for measures, and the values from the target column. For instance:


In [204]:
chis_segment_estimate(df, 'diabetes', ['racedf_p1', 'ur_ihs']).head(20)


Out[204]:
value
ur_ihs racedf_p1 measure diabetes
RURAL NON-LATINO WHITE diabetes_count YES 4.258255e+05
NO 3.650399e+06
diabetes_95_l YES 3.334384e+05
NO 3.415331e+06
diabetes_95_h YES 5.182125e+05
NO 3.885467e+06
diabetes_pct YES 1.040000e+01
NO 8.960000e+01
diabetes_pct_l YES 8.200000e+00
NO 8.380000e+01
diabetes_pct_h YES 1.270000e+01
NO 9.530000e+01
diabetes_rse YES 1.110000e+01
NO 3.300000e+00
URBAN NON-LATINO WHITE diabetes_count YES 6.618401e+05
NO 7.385090e+06
diabetes_95_l YES 5.500003e+05
NO 7.138565e+06
diabetes_95_h YES 7.736798e+05
NO 7.631615e+06

You can "pivot" a level out of the row into the columns with unstack(). Here we move the measures out of the row index into columns.


In [207]:
t = chis_segment_estimate(df, 'diabetes', ['racedf_p1', 'ur_ihs'])
t.unstack('measure').head()


Out[207]:
value
measure diabetes_95_h diabetes_95_l diabetes_count diabetes_pct diabetes_pct_h diabetes_pct_l diabetes_rse
ur_ihs racedf_p1 diabetes
RURAL LATINO YES 5.070159e+05 3.374837e+05 4.222498e+05 13.3 15.9 10.6 10.2
NO 2.985110e+06 2.530615e+06 2.757862e+06 86.7 93.9 79.6 4.2
NON-LATINO AFR. AMER. YES 8.132200e+04 1.811403e+04 4.971802e+04 16.5 26.9 6.0 32.4
NO 4.037548e+05 1.010851e+05 2.524199e+05 83.5 133.6 33.5 30.6
NON-LATINO AMERICAN INDIAN/ALASKAN NATIVE YES 1.129030e+04 -6.225377e+02 5.333882e+03 7.9 16.8 -0.9 57.0

Complex selections can be made with .loc.


In [213]:
t = chis_segment_estimate(df, 'diabetes', ['racedf_p1', 'ur_ihs'])

idx = pd.IndexSlice # Convenience redefinition. 

# The IndexSlices should have one term ( seperated by ',') for each of the levels in the index. 
# We have one `IndexSlice` for rows, and one for columns. Note that the ``row_indexer`` has 4 terms. 
row_indexer = idx[:,:,('diabetes_pct','diabetes_rse'),'YES']
col_indexer = idx[:]

# Now we can select with the two indexers. 
t = t.loc[row_indexer,col_indexer]

# Rotate the measures out of rows into columns
t = t.unstack('measure')

# The columns are multi-level, but there is only one value for the first level, 
# so it is useless. 
t.columns = t.columns.droplevel()

# Only use estimates wtih RSE < 30%
t = t[t.diabetes_rse < 30]

# We don't nee the RSE colum any more. 
t = t.drop(columns='diabetes_rse')

# Move the Rural/Urban into columns
t = t.unstack(0)

t


Out[213]:
measure diabetes_pct
ur_ihs RURAL URBAN
racedf_p1 diabetes
LATINO YES 13.3 12.3
NON-LATINO WHITE YES 10.4 8.2
NON-LATINO AFR. AMER. YES NaN 16.4
NON-LATINO ASIAN YES NaN 8.8

In [202]:
x = chis_segment_estimate(df, 'diabetes', ['racedf_p1', 'am3'])
row_indexer = idx[('YES','NO'),:,('diabetes_pct','diabetes_rse'),'YES']
col_indexer = idx[:]

t = x.loc[row_indexer,col_indexer].unstack('measure')
t.columns = t.columns.droplevel()
t = t[t.diabetes_rse < 30].drop(columns='diabetes_rse')
t


Out[202]:
measure diabetes_pct
am3 racedf_p1 diabetes
NO LATINO YES 16.7
NON-LATINO WHITE YES 14.5
YES LATINO YES 11.7

In [214]:
x = chis_segment_estimate(df, 'diabetes', ['racedf_p1', 'am3'])
row_indexer = idx[:,:,('diabetes_pct','diabetes_rse'),'YES']
col_indexer = idx[:]

t = x.loc[row_indexer,col_indexer].unstack('measure')
#t.index = t.index.droplevel('diabetes')
t.columns = t.columns.droplevel()
t = t[t.diabetes_rse < 30].drop(columns='diabetes_rse')
t.unstack(0)


Out[214]:
measure diabetes_pct
am3 INAPPLICABLE NO YES
racedf_p1 diabetes
LATINO YES 9.0 16.7 11.7
NON-LATINO AFR. AMER. YES 15.0 NaN NaN
NON-LATINO ASIAN YES 7.3 NaN NaN
NON-LATINO WHITE YES 7.8 14.5 NaN
NON-LATINO, TWO+ RACES YES 6.3 NaN NaN

In [186]:
chis_segment_estimate(df, 'diabetes',  'am3')


Out[186]:
value
am3 measure diabetes
INAPPLICABLE diabetes_count YES 1.649592e+06
NO 1.792121e+07
diabetes_95_l YES 1.428566e+06
NO 1.718414e+07
diabetes_95_h YES 1.870619e+06
NO 1.865827e+07
diabetes_pct YES 8.400000e+00
NO 9.160000e+01
diabetes_pct_l YES 7.300000e+00
NO 8.780000e+01
diabetes_pct_h YES 9.600000e+00
NO 9.530000e+01
diabetes_rse YES 6.800000e+00
NO 2.100000e+00
NO diabetes_count YES 1.204269e+06
NO 6.559025e+06
diabetes_95_l YES 1.022911e+06
NO 5.904377e+06
diabetes_95_h YES 1.385627e+06
NO 7.213672e+06
diabetes_pct YES 1.550000e+01
NO 8.450000e+01
diabetes_pct_l YES 1.320000e+01
NO 7.610000e+01
diabetes_pct_h YES 1.780000e+01
NO 9.290000e+01
diabetes_rse YES 7.700000e+00
NO 5.100000e+00
YES diabetes_count YES 2.908910e+05
NO 1.830713e+06
diabetes_95_l YES 2.151534e+05
NO 1.448414e+06
diabetes_95_h YES 3.666286e+05
NO 2.213011e+06
diabetes_pct YES 1.370000e+01
NO 8.630000e+01
diabetes_pct_l YES 1.010000e+01
NO 6.830000e+01
diabetes_pct_h YES 1.730000e+01
NO 1.043000e+02
diabetes_rse YES 1.330000e+01
NO 1.070000e+01

In [ ]: