A standard approach for testing the independence/dependence of a pair of categorical variables is to use a $\chi^2$ (Chi-square) test of independence.
The null and alternative hypotheses for the $\chi^2$ test are as follows:
We typically depict the relationship between categorical variables using a "contingency table".
B1 | B2 | Total | |
---|---|---|---|
A1 | $O_{11}$ | $O_{12}$ | $O_{11}$+$O_{12}$ |
A2 | $O_{21}$ | $O_{22}$ | $O_{12}$+$O_{22}$ |
Total | $O_{11}$+$O_{21}$ | $O_{12}$+$O_{22}$ | $O_{11}$+$O_{12}$+$O_{12}$+$O_{22}$ |
The rows and columns indicate the different categories for variables A and B respectively, and the cells, $O_{ij}$, give the counts of the number of observations for the corresponding combination of A and B. For example, the cell $O_{11}$ gives the number of observations that that belong to both the category A1 and B1, while $O_{12}$ gives the number that are both A1 and B2, etc.
In [1]:
%matplotlib inline
import numpy as np
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sbn
In [9]:
# load the bumpus dat set
dataurl = "https://github.com/Bio204-class/bio204-datasets/raw/master/bumpus-data.txt"
bumpus = pd.read_table(dataurl)
# for clarity of presentation let's replace the sex designation "m","f" with "male","female"
bumpus.sex.replace(dict(m="male", f="female"), inplace=True)
# similarly replace survived variables"F","T" with "False" and "True"
bumpus.survived.replace(dict(F=False, T=True), inplace=True)
In [22]:
# Construct a contingency table for the sex and survival categorical variables.
# The pandas.crosstab function will create the contingency table
# if margins = True, will including row, column, and overall sums in table
# if margins = False only includes the the joint counts
observed = pd.crosstab(bumpus.survived, bumpus.sex, margins=True)
print("Table of observed counts")
observed
Out[22]:
In [12]:
nobs = observed.ix['All','All']
prob_female = observed.ix['All','female']/nobs
prob_male = observed.ix['All', 'male']/nobs
prob_surv = observed.ix[True, 'All']/nobs
prob_died = observed.ix[False, 'All']/nobs
expected_counts = []
for i in (prob_died, prob_surv):
row = []
for j in (prob_female, prob_male):
row.append(i * j * nobs)
expected_counts.append(row + [np.sum(row)])
expected_counts.append(np.sum(expected_counts,axis=0).tolist())
expected = pd.DataFrame(expected_counts, index=observed.index, columns=observed.columns)
print("Table of Expected Counts")
expected
Out[12]:
In [18]:
Observed = observed.ix[:-1, :-1] # take all but the last row and colum
Expected = expected.ix[:-1, :-1] # as above
In [51]:
# This code cell just shows how you can do some fancy HTML output
# from Jupyter code cells. Here I wanted to output both observed and expected
# count tables in the same cell with some nicely formatted headers.
# This is purely for presentation, we're not doing an analysis here.
from IPython.display import display, HTML
def fmt(x):
return "{:04.2f}".format(x)
Obs_as_html = Observed.to_html(float_format=fmt)
Exp_as_html = Expected.to_html(float_format=fmt)
HTML("<h3>Observed Counts</h3>\n\n" + Obs_as_html + "\n\n" +
"<h3>Expected Counts</h3>\n\n" + Exp_as_html)
Out[51]:
We can quantify the difference between the observed counts and expected counts on a per cell basis using something which is essentially a squared deviation (but for counts rather than values of a random variate) normalized by the expectation.
In [57]:
diff_squared = (Observed-Expected)**2/Expected
diff_squared
Out[57]:
We can calculate a new test statistic, we'll call $\chi^2$ (chi-squared), as the sum of
\begin{align} \chi^2 &= \frac{(O_{11} - E_{11})^2}{E_{11}} + \frac{(O_{12} - E_{12})^2}{E_{12}} + \cdots + \frac{(O_{mn} - E_{mn})^2}{E_{mn}} \\ \\ &= \sum_{i=1}^{m}\sum_{j=1}^{n} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \end{align}where $m$ and $n$ are the number of categories of the two variables under consideration.
In [60]:
# cast to matrix so we can calculate sum across all cells
chi2 = np.sum(diff_squared.as_matrix())
print("Chi-squared", chi2)
We can think of the $\chi^2$-statistic as the sum of the normalized squared differences between observed and expected counts (where each cell is normalized by the expected counts).
Under the null hypothesis that the categorical variables are independent the $\chi^2$-statistic has a sampling distribution called the $\chi^2$-distribution. The $\chi^2$-distribution takes a single parameter, degree of freedom (df), which for the $\chi^2$ test of independence is equal to $(m-1)(n-1)$ where $m$ and $n$ are the number of categories in the two variables.
In [72]:
x = np.linspace(0,8,300)
chi2_df1_pdf = stats.chi2.pdf(x, df=1)
chi2_df2_pdf = stats.chi2.pdf(x, df=2)
chi2_df3_pdf = stats.chi2.pdf(x, df=3)
chi2_df4_pdf = stats.chi2.pdf(x, df=4)
chi2_df5_pdf = stats.chi2.pdf(x, df=5)
plt.plot(x, chi2_df1_pdf, label="df = 1")
plt.plot(x, chi2_df2_pdf, label="df = 2")
plt.plot(x, chi2_df3_pdf, label="df = 3")
plt.plot(x, chi2_df4_pdf, label="df = 4")
plt.plot(x, chi2_df5_pdf, label="df = 5")
plt.xlabel('$\chi^2$')
plt.ylabel('Density')
plt.ylim(0,1) # note this will cut off the curve for df=1
plt.legend()
pass
For our bumpus example, the degrees of freedom is df=1. We can calculate the corresponding p-value for our sample chi2 value of ~3.13.
In [75]:
stats.chi2.sf(chi2, df=1)
Out[75]:
In [ ]: