Association Between Categorical Variables

A standard approach for testing the independence/dependence of a pair of categorical variables is to use a $\chi^2$ (Chi-square) test of independence.

The null and alternative hypotheses for the $\chi^2$ test are as follows:

$H_0$: the two categorical variables are independent
$H_A$: the two categorical variables are dependent

Contingency tables of observed counts

We typically depict the relationship between categorical variables using a "contingency table".

	B1	B2	Total
A1	$O_{11}$	$O_{12}$	$O_{11}$+$O_{12}$
A2	$O_{21}$	$O_{22}$	$O_{12}$+$O_{22}$
Total	$O_{11}$+$O_{21}$	$O_{12}$+$O_{22}$	$O_{11}$+$O_{12}$+$O_{12}$+$O_{22}$

The rows and columns indicate the different categories for variables A and B respectively, and the cells, $O_{ij}$, give the counts of the number of observations for the corresponding combination of A and B. For example, the cell $O_{11}$ gives the number of observations that that belong to both the category A1 and B1, while $O_{12}$ gives the number that are both A1 and B2, etc.



In [1]:

    
%matplotlib inline
import numpy as np
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sbn



In [9]:

    
# load the bumpus dat set
dataurl = "https://github.com/Bio204-class/bio204-datasets/raw/master/bumpus-data.txt"
bumpus = pd.read_table(dataurl)

# for clarity of presentation let's replace the sex designation "m","f" with "male","female"
bumpus.sex.replace(dict(m="male", f="female"), inplace=True)

# similarly replace survived variables"F","T" with "False" and "True"
bumpus.survived.replace(dict(F=False, T=True), inplace=True)



In [22]:

    
# Construct a contingency table for the sex and survival categorical variables.
# The pandas.crosstab function will create the contingency table
# if margins = True, will including row, column, and overall sums in table
# if margins = False only includes the the joint counts
observed = pd.crosstab(bumpus.survived, bumpus.sex, margins=True)

print("Table of observed counts")
observed









    



Table of observed counts






    Out[22]:






  
    
      sex
      female
      male
      All
    
    
      survived
      
      
      
    
  
  
    
      False
      28
      36
      64
    
    
      True
      21
      51
      72
    
    
      All
      49
      87
      136

Expected counts

If the two categorical variables were independent than we would expect that the count in each cell of the contigency table to equal the product of the marginal probalities times the total number of observations.



In [12]:

    
nobs = observed.ix['All','All']

prob_female = observed.ix['All','female']/nobs
prob_male = observed.ix['All', 'male']/nobs

prob_surv = observed.ix[True, 'All']/nobs
prob_died = observed.ix[False, 'All']/nobs

expected_counts = []
for i in (prob_died, prob_surv):
    row = []
    for j in (prob_female, prob_male):
        row.append(i * j * nobs)
    expected_counts.append(row + [np.sum(row)])
expected_counts.append(np.sum(expected_counts,axis=0).tolist())

expected = pd.DataFrame(expected_counts, index=observed.index, columns=observed.columns)

print("Table of Expected Counts")
expected









    



Table of Expected Counts






    Out[12]:






  
    
      sex
      female
      male
      All
    
    
      survived
      
      
      
    
  
  
    
      False
      23.058824
      40.941176
      64
    
    
      True
      25.941176
      46.058824
      72
    
    
      All
      49.000000
      87.000000
      136

Difference between Observed and Expected counts

Since the margins of the table (the row and column labeled "All") are the same for both the observed and expected counts, let's work with just the inner cells of the table.



In [18]:

    
Observed = observed.ix[:-1, :-1]  # take all but the last row and colum 
Expected = expected.ix[:-1, :-1]  # as above



In [51]:

    
# This code cell just shows how you can do some fancy HTML output
# from Jupyter code cells.  Here I wanted to output both observed and expected
# count tables in the same cell with some nicely formatted headers.
# This is purely for presentation, we're not doing an analysis here.

from IPython.display import display, HTML

def fmt(x):
    return "{:04.2f}".format(x)

Obs_as_html = Observed.to_html(float_format=fmt)
Exp_as_html = Expected.to_html(float_format=fmt)

HTML("<h3>Observed Counts</h3>\n\n" + Obs_as_html + "\n\n" +
     "<h3>Expected Counts</h3>\n\n" + Exp_as_html)









    Out[51]:




Observed Counts


  
    
      sex
      female
      male
    
    
      survived
      
      
    
  
  
    
      False
      28
      36
    
    
      True
      21
      51
    
  


Expected Counts


  
    
      sex
      female
      male
    
    
      survived
      
      
    
  
  
    
      False
      23.06
      40.94
    
    
      True
      25.94
      46.06

We can quantify the difference between the observed counts and expected counts on a per cell basis using something which is essentially a squared deviation (but for counts rather than values of a random variate) normalized by the expectation.



In [57]:

    
diff_squared = (Observed-Expected)**2/Expected
diff_squared

The $\chi^2$-statistic

We can calculate a new test statistic, we'll call $\chi^2$ (chi-squared), as the sum of

\begin{align} \chi^2 &= \frac{(O_{11} - E_{11})^2}{E_{11}} + \frac{(O_{12} - E_{12})^2}{E_{12}} + \cdots + \frac{(O_{mn} - E_{mn})^2}{E_{mn}} \\ \\ &= \sum_{i=1}^{m}\sum_{j=1}^{n} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \end{align}

where $m$ and $n$ are the number of categories of the two variables under consideration.



In [60]:

    
# cast to matrix so we can calculate sum across all cells
chi2 = np.sum(diff_squared.as_matrix())  

print("Chi-squared", chi2)









    



Chi-squared 3.12643678161

We can think of the $\chi^2$-statistic as the sum of the normalized squared differences between observed and expected counts (where each cell is normalized by the expected counts).

The $\chi^2$-distribution

Under the null hypothesis that the categorical variables are independent the $\chi^2$-statistic has a sampling distribution called the $\chi^2$-distribution. The $\chi^2$-distribution takes a single parameter, degree of freedom (df), which for the $\chi^2$ test of independence is equal to $(m-1)(n-1)$ where $m$ and $n$ are the number of categories in the two variables.



In [72]:

    
x = np.linspace(0,8,300)

chi2_df1_pdf = stats.chi2.pdf(x, df=1) 
chi2_df2_pdf = stats.chi2.pdf(x, df=2)
chi2_df3_pdf = stats.chi2.pdf(x, df=3)
chi2_df4_pdf = stats.chi2.pdf(x, df=4)
chi2_df5_pdf = stats.chi2.pdf(x, df=5)

plt.plot(x, chi2_df1_pdf, label="df = 1")
plt.plot(x, chi2_df2_pdf, label="df = 2")
plt.plot(x, chi2_df3_pdf, label="df = 3")
plt.plot(x, chi2_df4_pdf, label="df = 4")
plt.plot(x, chi2_df5_pdf, label="df = 5")

plt.xlabel('$\chi^2$')
plt.ylabel('Density')
plt.ylim(0,1)  # note this will cut off the curve for df=1
plt.legend()
pass

For our bumpus example, the degrees of freedom is df=1. We can calculate the corresponding p-value for our sample chi2 value of ~3.13.



In [75]:

    
stats.chi2.sf(chi2, df=1)









    Out[75]:





0.077031938117561066



In [ ]:

sex	female	male	All
survived
False	23.058824	40.941176	64
True	25.941176	46.058824	72
All	49.000000	87.000000	136