Chi-Square Test of Independence

The Chi-square test of independence tests if two categorical variables are independent.

Recall that we can summarize two categorical variables within a two-way table, also called a r × c contingency table, where r = number of rows, c = number of columns. Our question of interest is "Are the two variables independent?" This question is set up using the following hypothesis statements:

Null Hypothesis: The two categorical variables are independent.

Alternative Hypothesis: The two categorical variables are dependent.

The chi-square test statistic is calculated by using the formula:

$$\chi^2 = \sum \dfrac {(O-E)^2} {E}$$

where $O$ represents the observed frequency. $E$ is the expected frequency under the null hypothesis and computed by:

$$ E = \dfrac {row \space total \times column \space total} {sample \space size}$$

We will calculate the P-value of the chi-square test statistic from a $\chi^2$ distribution with degree of freedom $(r - 1) * (c - 1)$. We will fail to reject the null hypothesis if the P-value is greater than the specified significance level.

Example

Is gender independent of education level? A random sample of 395 people were surveyed and each person was asked to report the highest education level they obtained. The data that resulted from the survey is summarized in the following table:



In [15]:

    
import pandas as pd
import numpy as np
%pylab inline
pylab.style.use('ggplot')
import seaborn as sns









    



Populating the interactive namespace from numpy and matplotlib



In [16]:

    
gender_data = pd.DataFrame(data=[[60, 54, 46, 41], 
                                 [40, 44, 53, 57]],
                          index=['female', 'male'],
                          columns=['High School', 'Bachelors', 'Masters', 'Ph.d.'])



In [17]:

    
gender_data









    Out[17]:







  
    
      
      High School
      Bachelors
      Masters
      Ph.d.
    
  
  
    
      female
      60
      54
      46
      41
    
    
      male
      40
      44
      53
      57

Visualizing the Data



In [18]:

    
gender_long = pd.melt(gender_data.reset_index(), 
                      id_vars='index', 
                      var_name='education', 
                      value_name='n_samples').rename(columns={'index': 'gender'})



In [19]:

    
gender_long









    Out[19]:







  
    
      
      gender
      education
      n_samples
    
  
  
    
      0
      female
      High School
      60
    
    
      1
      male
      High School
      40
    
    
      2
      female
      Bachelors
      54
    
    
      3
      male
      Bachelors
      44
    
    
      4
      female
      Masters
      46
    
    
      5
      male
      Masters
      53
    
    
      6
      female
      Ph.d.
      41
    
    
      7
      male
      Ph.d.
      57



In [20]:

    
g = sns.factorplot(kind='bar', x='education', y='n_samples', hue="gender", data=gender_long)

Calculate Totals



In [21]:

    
row_totals = gender_data.sum(axis=1)
row_totals









    Out[21]:





female    201
male      194
dtype: int64



In [22]:

    
column_totals = gender_data.sum(axis=0)
column_totals.to_frame().T









    Out[22]:







  
    
      
      High School
      Bachelors
      Masters
      Ph.d.
    
  
  
    
      0
      100
      98
      99
      98

Calculate Expected Frequencies



In [23]:

    
expected_frequencies = pd.concat(
    {c: row_totals for c in gender_data.columns}, 
    axis=1).mul(column_totals).div(gender_data.sum(axis=1).sum())



In [24]:

    
expected_frequencies









    Out[24]:







  
    
      
      Bachelors
      High School
      Masters
      Ph.d.
    
  
  
    
      female
      49.868354
      50.886076
      50.377215
      49.868354
    
    
      male
      48.131646
      49.113924
      48.622785
      48.131646

Calculate Chi-Square test statistic



In [25]:

    
t_stat = gender_data.sub(
    expected_frequencies).pow(2.0).div(
    expected_frequencies).sum(axis=1).sum()



In [26]:

    
t_stat









    Out[26]:





8.006066246262538



In [27]:

    
df = (gender_data.shape[0]-1) * (gender_data.shape[1]-1)
df









    Out[27]:





3



In [28]:

    
from scipy.stats import chi2
chi2_dist = chi2(df=df)
p_val = chi2_dist.sf(t_stat)

sig_level = 0.05
print('P-val of chi^2 test at significance level {:.2f} is {:.4f}'.format(sig_level, p_val))









    



P-val of chi^2 test at significance level 0.05 is 0.0459

Since the P-val is less than the significance level, we will reject the null hypothesis, so we conclude that the two categorical variables are dependent. Education level is dependent on gender at 5% significance level.

	gender	education	n_samples
0	female	High School	60
1	male	High School	40
2	female	Bachelors	54
3	male	Bachelors	44
4	female	Masters	46
5	male	Masters	53
6	female	Ph.d.	41
7	male	Ph.d.	57

	Bachelors	High School	Masters	Ph.d.
female	49.868354	50.886076	50.377215	49.868354
male	48.131646	49.113924	48.622785	48.131646