CHIS Estimates and Variances

The California Health Interview Survey (CHIS) is a large-scale, long-running health survey that provides detailed, respondent level data about health behaviors, outcomes and demographics. The dataset includes 80 replicate weights to calculate populations estimates and variances. This notebook demonstrates how to use these weights, comparing the values to AskCHIS, a data access website that aggregates CHIS responses.

This notebook uses a Metapack package to access the CHIS datasets, which bypasses the the dataset terms and restrictions. These terms are also reprocudes in the datapackage documentation, shown below. You must accept these terms and restrictions before using this dataset.

First, we will load common important imports.



In [1]:

    
import seaborn as sns
import metapack as mp
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display 

from publicdata.chis import *

%matplotlib inline
sns.set_context('notebook')



In [2]:

    
# Opening a source package presumes you are working with the notebook in the source package, 
# https://github.com/sandiegodata-projects/chis.git
pkg = mp.jupyter.open_source_package()
pkg









    Out[2]:




CHIS California Health Interview Survey, Adults
healthpolicy.ucla.edu-chis-adult-1 Last Update: 2018-11-21T20:44:48
Documentation and Reference Links to CHIS files.
CHIS Data packages

Using these file requires accepting the terms and restrictions provided by the
UCLA Center for Health Policy Research. These terms are available
online, and reproduced here:
Restrictions on the Use of California Health Interview Survey Data Before you
download this file, you must first agree to these Restrictions on the Use of
CHIS Data by clicking the button below.

The California Health Interview Survey (CHIS) is bound by promises made to
respondents, by California law, and by University and government human subject
protection committees to assure that no personal information is released in a
form that identifies an individual without the consent of the person who
supplied the information. The California Information Practices Act (section
1798.24) provides that the data collected by CHIS may be released only for
statistical research and reporting purposes. Any intentional identification or
disclosure of personal information violates this law, and violates the privacy
rights of the people who provided data to CHIS. Unauthorized disclosure of
personal information is subject to civil action and penalties for invasion of
privacy under California Civil Code, Section 1798.53.

Documentation Links

dd_adult_2017 Data dictionary, Adult, 2017
dd_adult_2016 Data dictionary, Adult, 2016
dd_adult_2015 Data dictionary, Adult, 2015
dd_adult_2014 Data dictionary, Adult, 2014
dd_adult_2013 Data dictionary, Adult, 2013
homepage Data download home page.

Contacts

Wrangler Eric Busboom, Civic Knowledge

References

adult_2017. CHIS, Adult 2017
adult_2016. CHIS, Adult 2016
adult_2015. CHIS, Adult 2015
adult_2014. CHIS, Adult 2014
adult_2013. CHIS, Adult 2013



In [3]:

    
df = pkg.reference('adult_2017').dataframe()

Estimates Using Pivot

First, we'll replicate the results for the question "Ever diagnosed with diabetes" for all of California, for 2017, from AskCHIS.

Diagnosed with diabetes: 10.7%, ( 9.6% - 11.8% ) 3,145,000
Never diagnosed with diabetes 89.3% ( 88.2% - 90.4% ) 26,311,000

Total population: 29,456,000

Getting estimates is easy. Each of the values of rakedw0 is the number of people that the associated responded represents in the total California population. So, all of the values of rakedw0 will sum to the controlled California population of adults, and dividing the whole dataset by responses on a variable and summing the values of rakedw0 for each response gives us the estimate number of people who would have given that response.



In [4]:

    
t = df.pivot_table(values='rakedw0', columns='diabetes', index=df.index)
t2 = t.sum().round(-3)
t2









    Out[4]:





diabetes
YES     3145000.0
NO     26311000.0
dtype: float64

Summing across responses yields the total popluation, which we can use to calculate percentages.



In [5]:

    
t2.sum()









    Out[5]:





29456000.0



In [6]:

    
(t2/t2.sum()*100).round(1)









    Out[6]:





diabetes
YES    10.7
NO     89.3
dtype: float64

Estimates Using Unstack

You can also calculate the same values using set_index and unstack.



In [7]:

    
t = df[['diabetes','rakedw0']].set_index('diabetes',append=True).unstack()
t2 = t.sum().round(-3)
diabetes_yes = t2.unstack().loc['rakedw0','YES']
diabetes_no = t2.unstack().loc['rakedw0','NO']
diabetes_yes, diabetes_no









    Out[7]:





(3145000.0, 26311000.0)

Calculating Variance

The basic formula for calculating the variance is in section 9.2, Methods for Variance Estimation of CHIS Report 5 Weighting and Variance Estimation . Basically, the other 80 raked weights, rakedw1 through rakedw80 give alternate estimates. It's like running the survey an additional 80 times, which allows you to calculate the sample variance from the set of alternate estimates.

In the code below, we'll expand the operation with temporary variables, to document each step.



In [8]:

    
weight_cols = [c for c in df.columns if 'raked' in c]

t = df[['diabetes']+weight_cols] # Get the column of interest, and all of the raked weights
t = t.set_index('diabetes',append=True) # Move the column of interest into the index
t = t.unstack() # Unstack the column of interest, so both values are now in multi-level columns
t = t.sum() # Sum all of the weights for each of the raked weght set and "YES"/"NO"
t = t.unstack() # Now we have sums for each of the replicated, for each of the variable values. 

t = t.sub(t.loc['rakedw0']).iloc[1:] # Subtract off the median estimate from each of the replicates
t = (t**2).sum() # sum of squares
ci_95 = np.sqrt(t)*1.96 # sqrt to get stddev, and 1.96 to get 95% CI

The final percentage ranges match those from AskCHIS.



In [9]:

    
((diabetes_yes-ci_95.loc['YES'])/29_456_000*100).round(1), ((diabetes_yes+ci_95.loc['YES'])/29_456_000*100).round(1)









    Out[9]:





(9.6, 11.8)



In [10]:

    
((diabetes_no-ci_95.loc['NO'])/29_456_000*100).round(1), ((diabetes_no+ci_95.loc['NO'])/29_456_000*100).round(1)









    Out[10]:





(88.2, 90.4)

Functions

Here is a function for calculating the estimate, percentages, Standard Error and Relative Standard Error from a dataset. This function also works with a subset of the dataset, but note that the percentages will be relative to the total from the input dataset, not the whole California population.



In [11]:

    
def chis_estimate(df, column, ci=True, pct=True, rse=False):
    """Calculate estimates for CHIS variables, with variances, as 95% CI,  from the replicate weights"""
    
    weight_cols = [c for c in df.columns if 'raked' in c]
    
    t = df[[column]+weight_cols] # Get the column of interest, and all of the raked weights
    t = t.set_index(column,append=True) # Move the column of interest into the index
    t = t.unstack() # Unstack the column of interest, so both values are now in multi-level columns
    t = t.sum() # Sum all of the weights for each of the raked weight set and "YES"/"NO"
    t = t.unstack() # Now we have sums for each of the replicats, for each of the variable values. 

    est = t.iloc[0].to_frame() # Replicate weight 0 is the estimate
    
    est.columns = [column]
    
    total = est.sum()[column]
    
    t = t.sub(t.loc['rakedw0']).iloc[1:] # Subtract off the median estimate from each of the replicates
    t = (t**2).sum() # sum of squares
    
    se = np.sqrt(t) # sqrt to get stddev,
    ci_95 = se*1.96 #  and 1.96 to get 95% CI

    if ci:
        est[column+'_95_l'] = est[column] - ci_95
        est[column+'_95_h'] = est[column] + ci_95  
    else:
        est[column+'_se'] = se
    
    if pct:
        est[column+'_pct'] = (est[column]/total*100).round(1)
        if ci:
            est[column+'_pct_l'] = (est[column+'_95_l']/total*100).round(1)
            est[column+'_pct_h'] = (est[column+'_95_h']/total*100).round(1)
    if rse:
        est[column+'_rse'] = (se/est[column]*100).round(1)
        
        
    est.rename(columns={column:column+'_count'}, inplace=True)
    
    return est
    
chis_estimate(df, 'diabetes', ci=False, pct=False)









    Out[11]:







  
    
      
      diabetes_count
      diabetes_se
    
    
      diabetes
      
      
    
  
  
    
      YES
      3.144752e+06
      162066.482981
    
    
      NO
      2.631094e+07
      162066.482979



In [12]:

    
# This validates with the whole population for 2017, from the AskCHIS web application
chis_estimate(df, 'ag1')









    Out[12]:







  
    
      
      ag1_count
      ag1_95_l
      ag1_95_h
      ag1_pct
      ag1_pct_l
      ag1_pct_h
    
    
      ag1
      
      
      
      
      
      
    
  
  
    
      HAVE NEVER VISIT
      7.240603e+05
      5.590919e+05
      8.890286e+05
      2.5
      1.9
      3.0
    
    
      6 MONTHS AGO OR LESS
      1.697365e+07
      1.613956e+07
      1.780774e+07
      57.6
      54.8
      60.5
    
    
      MORE THAN 6 MONTHS UP TO 1 YEAR AGO
      4.462391e+06
      4.079391e+06
      4.845391e+06
      15.1
      13.8
      16.4
    
    
      MORE THAN 1 YEAR UP TO 2 YEARS AGO
      2.997435e+06
      2.649192e+06
      3.345679e+06
      10.2
      9.0
      11.4
    
    
      MORE THAN 2 YEARS UP TO 5 YEARS AGO
      2.291014e+06
      1.788671e+06
      2.793358e+06
      7.8
      6.1
      9.5
    
    
      MORE THAN 5 YEARS AGO
      2.007144e+06
      1.765346e+06
      2.248943e+06
      6.8
      6.0
      7.6



In [13]:

    
# This validates with the latino subset for 2017, from the AskCHIS web application
chis_estimate(df[df.racedf_p1=='LATINO'], 'ag1')









    Out[13]:







  
    
      
      ag1_count
      ag1_95_l
      ag1_95_h
      ag1_pct
      ag1_pct_l
      ag1_pct_h
    
    
      ag1
      
      
      
      
      
      
    
  
  
    
      HAVE NEVER VISIT
      4.663458e+05
      3.094320e+05
      6.232595e+05
      4.4
      2.9
      5.9
    
    
      6 MONTHS AGO OR LESS
      4.988002e+06
      4.702488e+06
      5.273516e+06
      47.3
      44.6
      50.1
    
    
      MORE THAN 6 MONTHS UP TO 1 YEAR AGO
      1.870843e+06
      1.495776e+06
      2.245911e+06
      17.8
      14.2
      21.3
    
    
      MORE THAN 1 YEAR UP TO 2 YEARS AGO
      1.244184e+06
      1.084788e+06
      1.403581e+06
      11.8
      10.3
      13.3
    
    
      MORE THAN 2 YEARS UP TO 5 YEARS AGO
      1.009716e+06
      7.895267e+05
      1.229906e+06
      9.6
      7.5
      11.7
    
    
      MORE THAN 5 YEARS AGO
      9.570417e+05
      6.690951e+05
      1.244988e+06
      9.1
      6.4
      11.8

Segmenting Results

This function allows segmenting on another column, for instance, breaking out responses by race. Note that in the examples we are checking for estimates to have a relative standard error ( such as diabetes_rse ) of greater than 30%. CHIS uses 30% as a limit for unstable values, and won't publish estimate with higher RSEs.



In [ ]:

    
def chis_segment_estimate(df, column, segment_columns):
    """Return aggregated CHIS data, segmented on one or more other variables. 
    """

    if not isinstance(segment_columns, (list,tuple)):
        segment_columns = [segment_columns]
    
    odf = None
    
    for index,row in df[segment_columns].drop_duplicates().iterrows():
        query = ' and '.join([ "{} == '{}'".format(c,v) for c,v in zip(segment_columns, list(row))])
    
        x = chis_estimate(df.query(query), column, ci=True, pct=True, rse=True)
        x.columns.names = ['measure']
        x = x.unstack()
       
        
        for col,val in zip(segment_columns, list(row)):
          
            x = pd.concat([x], keys=[val], names=[col])

        if odf is None:
            odf = x
        else:
            odf = pd.concat([odf, x])
        
    
    odf = odf.to_frame()
    odf.columns = ['value']
    
    return odf

The dataframe returned by this function has a multi-level index, which include all of the unique values from the segmentation columns, a level for measures, and the values from the target column. For instance:



In [204]:

    
chis_segment_estimate(df, 'diabetes', ['racedf_p1', 'ur_ihs']).head(20)









    Out[204]:







  
    
      
      
      
      
      value
    
    
      ur_ihs
      racedf_p1
      measure
      diabetes
      
    
  
  
    
      RURAL
      NON-LATINO WHITE
      diabetes_count
      YES
      4.258255e+05
    
    
      NO
      3.650399e+06
    
    
      diabetes_95_l
      YES
      3.334384e+05
    
    
      NO
      3.415331e+06
    
    
      diabetes_95_h
      YES
      5.182125e+05
    
    
      NO
      3.885467e+06
    
    
      diabetes_pct
      YES
      1.040000e+01
    
    
      NO
      8.960000e+01
    
    
      diabetes_pct_l
      YES
      8.200000e+00
    
    
      NO
      8.380000e+01
    
    
      diabetes_pct_h
      YES
      1.270000e+01
    
    
      NO
      9.530000e+01
    
    
      diabetes_rse
      YES
      1.110000e+01
    
    
      NO
      3.300000e+00
    
    
      URBAN
      NON-LATINO WHITE
      diabetes_count
      YES
      6.618401e+05
    
    
      NO
      7.385090e+06
    
    
      diabetes_95_l
      YES
      5.500003e+05
    
    
      NO
      7.138565e+06
    
    
      diabetes_95_h
      YES
      7.736798e+05
    
    
      NO
      7.631615e+06

You can "pivot" a level out of the row into the columns with unstack(). Here we move the measures out of the row index into columns.



In [207]:

    
t = chis_segment_estimate(df, 'diabetes', ['racedf_p1', 'ur_ihs'])
t.unstack('measure').head()









    Out[207]:







  
    
      
      
      
      value
    
    
      
      
      measure
      diabetes_95_h
      diabetes_95_l
      diabetes_count
      diabetes_pct
      diabetes_pct_h
      diabetes_pct_l
      diabetes_rse
    
    
      ur_ihs
      racedf_p1
      diabetes
      
      
      
      
      
      
      
    
  
  
    
      RURAL
      LATINO
      YES
      5.070159e+05
      3.374837e+05
      4.222498e+05
      13.3
      15.9
      10.6
      10.2
    
    
      NO
      2.985110e+06
      2.530615e+06
      2.757862e+06
      86.7
      93.9
      79.6
      4.2
    
    
      NON-LATINO AFR. AMER.
      YES
      8.132200e+04
      1.811403e+04
      4.971802e+04
      16.5
      26.9
      6.0
      32.4
    
    
      NO
      4.037548e+05
      1.010851e+05
      2.524199e+05
      83.5
      133.6
      33.5
      30.6
    
    
      NON-LATINO AMERICAN INDIAN/ALASKAN NATIVE
      YES
      1.129030e+04
      -6.225377e+02
      5.333882e+03
      7.9
      16.8
      -0.9
      57.0

Complex selections can be made with .loc.



In [213]:

    
t = chis_segment_estimate(df, 'diabetes', ['racedf_p1', 'ur_ihs'])

idx = pd.IndexSlice # Convenience redefinition. 

# The IndexSlices should have one term ( seperated by ',') for each of the levels in the index. 
# We have one `IndexSlice` for rows, and one for columns. Note that the ``row_indexer`` has 4 terms. 
row_indexer = idx[:,:,('diabetes_pct','diabetes_rse'),'YES']
col_indexer = idx[:]

# Now we can select with the two indexers. 
t = t.loc[row_indexer,col_indexer]

# Rotate the measures out of rows into columns
t = t.unstack('measure')

# The columns are multi-level, but there is only one value for the first level, 
# so it is useless. 
t.columns = t.columns.droplevel()

# Only use estimates wtih RSE < 30%
t = t[t.diabetes_rse < 30]

# We don't nee the RSE colum any more. 
t = t.drop(columns='diabetes_rse')

# Move the Rural/Urban into columns
t = t.unstack(0)

t









    Out[213]:







  
    
      
      measure
      diabetes_pct
    
    
      
      ur_ihs
      RURAL
      URBAN
    
    
      racedf_p1
      diabetes
      
      
    
  
  
    
      LATINO
      YES
      13.3
      12.3
    
    
      NON-LATINO WHITE
      YES
      10.4
      8.2
    
    
      NON-LATINO AFR. AMER.
      YES
      NaN
      16.4
    
    
      NON-LATINO ASIAN
      YES
      NaN
      8.8



In [202]:

    
x = chis_segment_estimate(df, 'diabetes', ['racedf_p1', 'am3'])
row_indexer = idx[('YES','NO'),:,('diabetes_pct','diabetes_rse'),'YES']
col_indexer = idx[:]

t = x.loc[row_indexer,col_indexer].unstack('measure')
t.columns = t.columns.droplevel()
t = t[t.diabetes_rse < 30].drop(columns='diabetes_rse')
t









    Out[202]:







  
    
      
      
      measure
      diabetes_pct
    
    
      am3
      racedf_p1
      diabetes
      
    
  
  
    
      NO
      LATINO
      YES
      16.7
    
    
      NON-LATINO WHITE
      YES
      14.5
    
    
      YES
      LATINO
      YES
      11.7



In [214]:

    
x = chis_segment_estimate(df, 'diabetes', ['racedf_p1', 'am3'])
row_indexer = idx[:,:,('diabetes_pct','diabetes_rse'),'YES']
col_indexer = idx[:]

t = x.loc[row_indexer,col_indexer].unstack('measure')
#t.index = t.index.droplevel('diabetes')
t.columns = t.columns.droplevel()
t = t[t.diabetes_rse < 30].drop(columns='diabetes_rse')
t.unstack(0)









    Out[214]:







  
    
      
      measure
      diabetes_pct
    
    
      
      am3
      INAPPLICABLE
      NO
      YES
    
    
      racedf_p1
      diabetes
      
      
      
    
  
  
    
      LATINO
      YES
      9.0
      16.7
      11.7
    
    
      NON-LATINO AFR. AMER.
      YES
      15.0
      NaN
      NaN
    
    
      NON-LATINO ASIAN
      YES
      7.3
      NaN
      NaN
    
    
      NON-LATINO WHITE
      YES
      7.8
      14.5
      NaN
    
    
      NON-LATINO, TWO+ RACES
      YES
      6.3
      NaN
      NaN



In [186]:

    
chis_segment_estimate(df, 'diabetes',  'am3')









    Out[186]:







  
    
      
      
      
      value
    
    
      am3
      measure
      diabetes
      
    
  
  
    
      INAPPLICABLE
      diabetes_count
      YES
      1.649592e+06
    
    
      NO
      1.792121e+07
    
    
      diabetes_95_l
      YES
      1.428566e+06
    
    
      NO
      1.718414e+07
    
    
      diabetes_95_h
      YES
      1.870619e+06
    
    
      NO
      1.865827e+07
    
    
      diabetes_pct
      YES
      8.400000e+00
    
    
      NO
      9.160000e+01
    
    
      diabetes_pct_l
      YES
      7.300000e+00
    
    
      NO
      8.780000e+01
    
    
      diabetes_pct_h
      YES
      9.600000e+00
    
    
      NO
      9.530000e+01
    
    
      diabetes_rse
      YES
      6.800000e+00
    
    
      NO
      2.100000e+00
    
    
      NO
      diabetes_count
      YES
      1.204269e+06
    
    
      NO
      6.559025e+06
    
    
      diabetes_95_l
      YES
      1.022911e+06
    
    
      NO
      5.904377e+06
    
    
      diabetes_95_h
      YES
      1.385627e+06
    
    
      NO
      7.213672e+06
    
    
      diabetes_pct
      YES
      1.550000e+01
    
    
      NO
      8.450000e+01
    
    
      diabetes_pct_l
      YES
      1.320000e+01
    
    
      NO
      7.610000e+01
    
    
      diabetes_pct_h
      YES
      1.780000e+01
    
    
      NO
      9.290000e+01
    
    
      diabetes_rse
      YES
      7.700000e+00
    
    
      NO
      5.100000e+00
    
    
      YES
      diabetes_count
      YES
      2.908910e+05
    
    
      NO
      1.830713e+06
    
    
      diabetes_95_l
      YES
      2.151534e+05
    
    
      NO
      1.448414e+06
    
    
      diabetes_95_h
      YES
      3.666286e+05
    
    
      NO
      2.213011e+06
    
    
      diabetes_pct
      YES
      1.370000e+01
    
    
      NO
      8.630000e+01
    
    
      diabetes_pct_l
      YES
      1.010000e+01
    
    
      NO
      6.830000e+01
    
    
      diabetes_pct_h
      YES
      1.730000e+01
    
    
      NO
      1.043000e+02
    
    
      diabetes_rse
      YES
      1.330000e+01
    
    
      NO
      1.070000e+01



In [ ]:

	diabetes_count	diabetes_se
diabetes
YES	3.144752e+06	162066.482981
NO	2.631094e+07	162066.482979

	ag1_count	ag1_95_l	ag1_95_h	ag1_pct	ag1_pct_l	ag1_pct_h
ag1
HAVE NEVER VISIT	7.240603e+05	5.590919e+05	8.890286e+05	2.5	1.9	3.0
6 MONTHS AGO OR LESS	1.697365e+07	1.613956e+07	1.780774e+07	57.6	54.8	60.5
MORE THAN 6 MONTHS UP TO 1 YEAR AGO	4.462391e+06	4.079391e+06	4.845391e+06	15.1	13.8	16.4
MORE THAN 1 YEAR UP TO 2 YEARS AGO	2.997435e+06	2.649192e+06	3.345679e+06	10.2	9.0	11.4
MORE THAN 2 YEARS UP TO 5 YEARS AGO	2.291014e+06	1.788671e+06	2.793358e+06	7.8	6.1	9.5
MORE THAN 5 YEARS AGO	2.007144e+06	1.765346e+06	2.248943e+06	6.8	6.0	7.6

	ag1_count	ag1_95_l	ag1_95_h	ag1_pct	ag1_pct_l	ag1_pct_h
ag1
HAVE NEVER VISIT	4.663458e+05	3.094320e+05	6.232595e+05	4.4	2.9	5.9
6 MONTHS AGO OR LESS	4.988002e+06	4.702488e+06	5.273516e+06	47.3	44.6	50.1
MORE THAN 6 MONTHS UP TO 1 YEAR AGO	1.870843e+06	1.495776e+06	2.245911e+06	17.8	14.2	21.3
MORE THAN 1 YEAR UP TO 2 YEARS AGO	1.244184e+06	1.084788e+06	1.403581e+06	11.8	10.3	13.3
MORE THAN 2 YEARS UP TO 5 YEARS AGO	1.009716e+06	7.895267e+05	1.229906e+06	9.6	7.5	11.7
MORE THAN 5 YEARS AGO	9.570417e+05	6.690951e+05	1.244988e+06	9.1	6.4	11.8

				value
ur_ihs	racedf_p1	measure	diabetes
RURAL	NON-LATINO WHITE	diabetes_count	YES	4.258255e+05
		diabetes_count	NO	3.650399e+06
		diabetes_95_l	YES	3.334384e+05
		diabetes_95_l	NO	3.415331e+06
		diabetes_95_h	YES	5.182125e+05
		diabetes_95_h	NO	3.885467e+06
		diabetes_pct	YES	1.040000e+01
		diabetes_pct	NO	8.960000e+01
		diabetes_pct_l	YES	8.200000e+00
		diabetes_pct_l	NO	8.380000e+01
		diabetes_pct_h	YES	1.270000e+01
		diabetes_pct_h	NO	9.530000e+01
		diabetes_rse	YES	1.110000e+01
		diabetes_rse	NO	3.300000e+00
URBAN	NON-LATINO WHITE	diabetes_count	YES	6.618401e+05
		diabetes_count	NO	7.385090e+06
		diabetes_95_l	YES	5.500003e+05
		diabetes_95_l	NO	7.138565e+06
		diabetes_95_h	YES	7.736798e+05
		diabetes_95_h	NO	7.631615e+06

		measure	diabetes_pct
am3	racedf_p1	diabetes
NO	LATINO	YES	16.7
NO	NON-LATINO WHITE	YES	14.5
YES	LATINO	YES	11.7

	measure	diabetes_pct
	am3	INAPPLICABLE	NO	YES
racedf_p1	diabetes
LATINO	YES	9.0	16.7	11.7
NON-LATINO AFR. AMER.	YES	15.0	NaN	NaN
NON-LATINO ASIAN	YES	7.3	NaN	NaN
NON-LATINO WHITE	YES	7.8	14.5	NaN
NON-LATINO, TWO+ RACES	YES	6.3	NaN	NaN