In [1]:
import numpy as np
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sbn
sbn.set_style("white") # get rid of seaborn grid lines

In [2]:
%matplotlib inline
np.random.seed(20160410)

Sampling Distribution of Pearson Correlation

As with all the other statistics we've looked at over the course of the semester, the sample correlation coefficient may differ substantially from the underly population correlation coefficient, depending on the vagaries of sampling.

First we'll generate a single sample, drawn from an underlying bivariate normal distribution with uncorrelated variables.


In [23]:
# generate bivariate normal data for uncorrelated variables
# See the docs on scipy.stats.multivariate_normal

# bivariate mean
mean = [0,0]  

# covariance matrix
cov = np.array([[1,0],
                [0,1]])

sample = stats.multivariate_normal.rvs(mean=mean, cov=cov, size=30)

In [24]:
sbn.jointplot(sample[:,0], sample[:,1])
print("Correlation matrix:\n", np.corrcoef(sample, rowvar=False, ddof=1))


Correlation matrix:
 [[ 1.          0.17131038]
 [ 0.17131038  1.        ]]

Simulate the sampling distribution of correlation coefficient, uncorrelated variables

First we'll simulate the sampling distribution of $r_{xy}$ when the true population correlation $\rho_{xy} = 0$ and the sample size $n = 30$.


In [5]:
mean = [0,0]
cov = [[1,0],
       [0,1]]
ssize = 30
nsims = 2500

cors = []
for n in range(nsims):
    sample = stats.multivariate_normal.rvs(mean=mean, cov=cov, size=ssize)
    r = np.corrcoef(sample, rowvar=False, ddof=1)[0,1]
    cors.append(r)
sbn.distplot(cors)
pass


The above plot looks fairly symmetrical, and approximately normal. However, now let's look at the sampling distribution for $\rho_{xy} = 0.9$ and $n=30$.


In [6]:
mean = [0,0]
cov = [[1,0.9],
       [0.9,1]]
ssize = 30
nsims = 2500

cors090 = []
for n in range(nsims):
    sample = stats.multivariate_normal.rvs(mean=mean, cov=cov, size=ssize)
    r = np.corrcoef(sample, rowvar=False, ddof=1)[0,1]
    cors090.append(r)
    
sbn.distplot(cors090)
pass


We see that the sampling correlation is strongly skewed for larger values of $\rho_{xy}$

Fisher's transformation normalizes the distribution of the Pearson product-moment correlation, and is usually used to calculate confidence intervals and carry out hypothesis test with correlations.

$$ F(r) = \frac{1}{2}\ln \frac{1+r}{1-r} = \mbox{arctanh}(r) $$

Where $\mbox{arctanh}(r)$ is the inverse hyperbolic tangent of the correlation coefficient.

If the underlying true correlation is $\rho$, the sampling distribution of $F(\rho)$ is approximately normal: $$ F(\rho) \sim N(\mu = \mbox{arctanh}(\rho),\ \sigma = \frac{1}{\sqrt{n-3}}) $$


In [7]:
# plot for sampling distribution when pho = 0.9
# using Fisher's transformation
sbn.distplot(np.arctanh(cors090))
print("")
pass



Confidence intervals in the space of the transformed variables are:

$$ 100(1 - \alpha)\%\text{CI}: \operatorname{arctanh}(\rho) \in [\operatorname{arctanh}(r) \pm z_{\alpha/2}SE] $$

To put this back in terms of untransformed correlations:

$$ 100(1 - \alpha)\%\text{CI}: \rho \in [\operatorname{tanh}(\operatorname{arctanh}(r) - z_{\alpha/2}SE), \operatorname{tanh}(\operatorname{arctanh}(r) + z_{\alpha/2}SE)] $$

Let's write a Python function to calculate the confidence intervals for correlations:


In [8]:
def correlationCI(r, n, alpha=0.05):
    mu = np.arctanh(r)
    sigma = 1.0/np.sqrt(n-3)
    z = stats.norm.ppf(alpha/2)
    left = np.tanh(mu) - z*sigma
    right = np.tanh(mu) + z*sigma
    return (left, right)

In [9]:
correlationCI(0, 30)


Out[9]:
(0.37719524469205734, -0.37719524469205734)

In [10]:
ssizes = np.arange(10,250,step=10)
cis = []
for i in ssizes:
    cis.append(correlationCI(0, i)[0])

plt.plot(ssizes, cis, '-o')
plt.xlabel("Sample size")
plt.ylabel("Half-width of CI")
plt.title(r"""Half-width of CIs for $\rho=0$
as as function of sample size""")


pass



In [26]:
stats.pearsonr(sample[:,0], sample[:,1])


Out[26]:
(0.1713103818610458, 0.36538468042995798)

In [27]:
?stats.pearsonr

Rank Correlation

There are two popular "robust" estimators of correlation, based on a consideration of correlations of ranks. These are known as Spearman's Rho and Kendall's Tau.

Spearman's Rho

Spearman's rank correlation, or Spearman's Rho, for variables $X$ and $Y$ is simply the correlation of the ranks of the $X$ and $Y$.

Let $R_X$ and $R_Y$ be the rank values of the variables $X$ and $Y$.

$$ \rho_S = \frac{\operatorname{cov}(R_X, R_Y)}{\sigma_{R_X}\sigma_{R_Y}} $$

Kendall's Tau

Kendall's rank correlation, or Kendall's Tau, is defined in terms of concordant and discordant pairs of observations, in the rank scale.

$$ \tau = \frac{(\text{number of concordant pairs}) - (\text{number of discordant pairs})}{n (n-1) /2} $$

The definition of concordant vs discordant pairs considers all pairs of observations and asks whether they are in the same relative rank order for X and Y (if so, than consider concordant) or in different relative rank order (if so, then discordant).

Rank correlation measures are more robust to outliers at the extremes of the distribution


In [11]:
n = 50
x = np.linspace(1,10,n) + stats.norm.rvs(size=n)
y = x + stats.norm.rvs(loc=1, scale=1.5, size=n)
plt.scatter(x,y)

print("Pearson r: ", stats.pearsonr(x, y)[0])
print("Spearman's rho: ", stats.spearmanr(x, y)[0])
print("Kendall's tau: ", stats.kendalltau(x, y)[0])

pass


Pearson r:  0.911862600258
Spearman's rho:  0.903577430972
Kendall's tau:  0.74693877551

In [12]:
pollute_X = np.concatenate([x, stats.norm.rvs(loc=14, size=1), stats.norm.rvs(loc=-1, size=1)])
pollute_Y = np.concatenate([y, stats.norm.rvs(loc=6, size=1), stats.norm.rvs(loc=8, size=1)])

plt.scatter(pollute_X, pollute_Y)

print("Pearson r: ", stats.pearsonr(pollute_X, pollute_Y)[0])
print("Spearman's rho: ", stats.spearmanr(pollute_X,pollute_Y)[0])
print("Kendall's tau: ", stats.kendalltau(pollute_X,pollute_Y)[0])


Pearson r:  0.785484667133
Spearman's rho:  0.820626654145
Kendall's tau:  0.67269984917

Association Between Categorical Variables

A standard approach for testing the independence/dependence of a pair of categorical variables is to use a $\chi^2$ (Chi-square) test of independence.

The null and alternative hypotheses for the $\chi^2$ test are as follows:

  • $H_0$: the two categorical variables are independent
  • $H_A$: the two categorical variables are dependent

Contingency tables

We typically depict the relationship between categorical variables using a "contingency table".

B1 B2 Total
A1 $c_{11}$ $c_{12}$ $c_{11}$+$c_{12}$
A2 $c_{21}$ $c_{22}$ $c_{12}$+$c_{22}$
Total $c_{11}$+$c_{21}$ $c_{12}$+$c_{22}$ $c_{11}$+$c_{12}$+$c_{12}$+$c_{22}$

The rows and columns indicate the different categories for variables A and B respectively, and the cells, $c_{ij}$ give the counts of the number of observations for the corresponding combination of A and B. For example, the cell $c_{11}$ gives the number of observations that that belong to both the category A1 and B1, while $c_{12}$ gives the number that are both A1 and B2, etc.


In [13]:
# construct a contingency table for the sex and survival categorical variables
# from the bumpus data set
dataurl = "https://github.com/Bio204-class/bio204-datasets/raw/master/bumpus-data.txt"
bumpus = pd.read_table(dataurl)

In [14]:
observed = pd.crosstab(bumpus.survived, bumpus.sex, margins=True)
observed


Out[14]:
sex f m All
survived
F 28 36 64
T 21 51 72
All 49 87 136

Expected counts

If the two categorical variables were independent than we would expect that the count in each cell of the contigency table to equal the product of the marginal probalities times the total number of observations.


In [15]:
nobs = observed.ix['All','All']

prob_female = observed.ix['All','f']/nobs
prob_male = observed.ix['All', 'm']/nobs

prob_surv = observed.ix['T', 'All']/nobs
prob_died = observed.ix['F', 'All']/nobs

expected_counts = []
for i in (prob_died, prob_surv):
    row = []
    for j in (prob_female, prob_male):
        row.append(i * j * nobs)
    expected_counts.append(row + [np.sum(row)])
expected_counts.append(np.sum(expected_counts,axis=0).tolist())

expected = pd.DataFrame(expected_counts, index=observed.index, columns=observed.columns)

print("Table of Expected Counts")
expected


Table of Expected Counts
Out[15]:
sex f m All
survived
F 23.058824 40.941176 64
T 25.941176 46.058824 72
All 49.000000 87.000000 136

In [16]:
Z2 = (observed-expected)**2/expected
Z2


Out[16]:
sex f m All
survived
F 1.058824 0.596349 0
T 0.941176 0.530088 0
All 0.000000 0.000000 0

In [17]:
chi2 = np.sum(Z2.values)
chi2


Out[17]:
3.1264367816091938

In [18]:
df = (len(bumpus.survived.unique()) - 1) * (len(bumpus.sex.unique()) - 1)
print("chi2 = {}, df = {}, pval = {}".format(chi2, df, stats.chi2.sf(chi2, df=df)))


chi2 = 3.126436781609194, df = 1, pval = 0.07703193811756107

$X^2$-square test of independence using scipy.stats


In [19]:
# Same analysis with scipy.stats fxns
observed_nomargin = pd.crosstab(bumpus.survived, bumpus.sex, margins=False)
Chi2, Pval, Dof, Expected = stats.chi2_contingency(observed_nomargin.values, 
                                                   correction=False)

In [20]:
Chi2, Pval, Dof


Out[20]:
(3.1264367816091938, 0.077031938117561066, 1)

In [21]:



Out[21]:
array([[ 23.05882353,  40.94117647],
       [ 25.94117647,  46.05882353]])

In [22]:
bumpus.survived.unique()


Out[22]:
array(['T', 'F'], dtype=object)