In [1]:
import numpy as np
import pandas as pd
from scipy import stats
Suppose you are studying the number of flight cancellations at a small regional airport and observe the following pattern:
January 23
February 12
March 10
April 10
May 13
June 7
July 12
August 11
September 22
October 6
November 6
December 14
These were the actual 2001 flight cancellations at the Willard airport (CMI), which you should be able to check easily from 2001.csv. It seems like there are a lot of cancellations in January and September, and fewer than average cancellations in June, October, and December.
In this problem, we will compute the chi-squared statistic to test the hypothesis that the number of flight cancellations depends on the month of the year. In other words, we will accept or reject the null hypothesis (the hypothesis that the fluctuations in flight cancellations is entirely due to random fluctuations) using the chi-squared test. The chi-squared statistic is defined as
\begin{equation} \chi^2 = \sum_i \frac{\left(O_i - E_i\right)^2}{E_i}. \end{equation}We will also compute the corresponding p-value of this distribution.
For example, if the p-value is, say, 50%, then it means that there's a 50% chance and that the apparent pattern is due to chance. On the other hand, if the p-value is something small like 0.1%, then this pattern happens only one time in 1000 when the number of cancellations does not depend on the month of the year.
In this problem, we will compute the $\chi^2$ statistic defined as
and the corresponding $p$-value of this distribution.
I summarized the number of flight cancellations in the following array:
In [2]:
cancelled = np.array([23, 12, 10, 10, 13, 7, 12, 11, 22, 6, 6, 14])
I will break down each step in separate functions to make sure you know
how to calculate $\chi^2$. But note that you would normally combine
all the steps in a single function.
Also note that there's a SciPy function
that does this for you:
scipy.stats.chisquare()
.
You may not use this function to write your code, but you can use it to check your answer.
The $\chi^2$ test is covered in Section 7.7 of Think Stats.
First, we define the cells to be the twelve months of the year.
Second, write a function named get_expected()
that takes a Numpy array
and returns a Numpy array with the number of flight cancellations
expected under the null hypothesis.
Under the null hypothesis, we expect a uniform distribution,
i.e. the same number of flight cancellations in each month.
However, we also want our function to handle different types of data,
so do not simply add up all the numbers and divide by 12.
Instead, you should use the length of the array.
For example, if the input array is np.array([1.0, 2.0, 3.0, 4.0, 5.0]
, the get_expected()
should return np.array([3.0, 3.0, 3.0, 3.0, 3.0])
.
In [ ]:
def get_expected(x):
'''
Takes a Numpy array and returns a Numpy array of the same length
that represents the expected values (uniform distribution).
Parameters
----------
x: A Numpy array.
Returns
-------
A Numpy array of the same length as x.
Example
-------
>>> import numpy as np
>>> x = get_expected([1., 2.])
>>> x = get_expected(np.array([2., 1., 3., 4.]))
>>> x
array([ 2.5, 2.5, 2.5, 2.5])
'''
# your code goes here
return expected
Next, write a function named get_diff()
that takes two Numpy arrays,
an array of the observed values and an array of the expected values.
For each cell, the funtion calculates the difference between the observed value $O_i$ and the expected value $E_i$.
Thus, it should return a Numpy array of $O_i$ - $E_i$.
In [ ]:
def get_diff(observed, expected):
'''
Takes two Numpy arrays, the observed values and the expected values,
and returns a Numpy array of the differences.
Parameters
----------
observed: A Numpy array.
expected: A Numpy array.
Returns
-------
A Numpy array.
Example
-------
>>> import numpy as np
>>> x = np.array([2., 4., 6.])
>>> y = np.array([3., 6., 9.])
>>> z = get_diff(x, y)
>>> z
array([-1., -2., -3.])
'''
# your code goes here
return diff
In [ ]:
def get_chi2(diff, expected):
'''
Takes O - E and computes chi-squared.
Parameters
----------
diff: A Numpy array. O - E
expected: A numpy array. E
Returns
-------
A float. The chi-squared value.
Example
-------
>>> import numpy as np
>>> x = np.array([1., 2., 3.])
>>> y = np.array([1., 2., 3.])
>>> z = get_chi2(x, y)
>>> z
6.0
'''
# your code goes here
return chi2
Finally, write a function named get_p_value()
that takes two arguments,
the $\chi^2$ value (a float) and a degree of freedom (an int).
The $p$-value can easily be calculated by using
scipy.stats.chi2.cdf().
The $p$-value is given simply by 1 minus the CDF of the $\chi^2$ distribution
(See Chi-squared distribution).
When using scipy.stats.chi2.cdf()
, you should specify the df
option,
which corresponds to the degrees of freedom.
The degrees of freedom
is the number of cells minus 1, which in our case is 11
(the twelve months of the year minus 1).
In [ ]:
def get_p_value(chi2, dof):
'''
Takes chi-squared and degrees of freedom
and computes the p-value.
Parameters
----------
chi2: A float.
dof: An int. Degrees of freedom.
Returns
-------
>>> import numpy as np
>>> x = 1.0
>>> y = 2
>>> z = get_p_value(x, y)
0.60653065971263342
'''
# your code goes here
return p
When you run the following cell, you should get
The chi square value is 27.3 and the p-value is 4.2e-03.
That means there's a 0.4% chance that the observed pattern is due to chance. I think we can safely reject the null hypothesis.
In [ ]:
expected = get_expected(cancelled)
diff = get_diff(cancelled, expected)
chi2 = get_chi2(diff, expected)
p = get_p_value(chi2, len(expected) - 1)
print('The chi square value is {0:.1f} and the p-value is {1:.1e}.'.format(chi2, p))
In [ ]: