Correlation

Correlation values range between -1 and 1.

There are two key components of a correlation value:

  • magnitude – The larger the magnitude (closer to 1 or -1), the stronger the correlation
  • sign – If negative, there is an inverse correlation. If positive, there is a regular correlation.

Positive Correlation

Let’s take a look at a positive correlation. Numpy implements a corrcoef() function that returns a matrix of correlations of x with x, x with y, y with x and y with y. We’re interested in the values of correlation of x with y (so position (1, 0) or (0, 1)).


In [1]:
import numpy as np

np.random.seed(1)

# 1000 random integers between 0 and 50
x = np.random.randint(0, 50, 1000)

# Positive Correlation with some noise
y = x + np.random.normal(0, 10, 1000)

np.corrcoef(x, y)


Out[1]:
array([[ 1.        ,  0.81543901],
       [ 0.81543901,  1.        ]])

This correlation is 0.815, a strong positive correlation, let’s take a look at a scatter chart.


In [2]:
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')

plt.scatter(x, y)
plt.show()


Out[2]:
<matplotlib.collections.PathCollection at 0x7fa227f05e48>

Negative Correlation

What happens to our correlation figure if we invert the correlation such that an increase in x results in a decrease in y?


In [3]:
# 1000 random integers between 0 and 50
x = np.random.randint(0, 50, 1000)

# Negative Correlation with some noise
y = 100 - x + np.random.normal(0, 5, 1000)

np.corrcoef(x, y)


Out[3]:
array([[ 1.        , -0.94957116],
       [-0.94957116,  1.        ]])

In [4]:
plt.scatter(x, y)
plt.show()


Out[4]:
<matplotlib.collections.PathCollection at 0x7fa224635198>

No/Weak Correlatio

What if there is no correlation between x and y?


In [5]:
x = np.random.randint(0, 50, 1000)
y = np.random.randint(0, 50, 1000)

np.corrcoef(x, y)


Out[5]:
array([[ 1.        , -0.00554681],
       [-0.00554681,  1.        ]])

In [6]:
plt.scatter(x, y)
plt.show()


Out[6]:
<matplotlib.collections.PathCollection at 0x7fa2245a6ba8>

Correlation Matrix

If we’re using pandas we can create a correlation matrix to view the correlations between different variables in a dataframe:


In [7]:
import pandas as pd

df = pd.DataFrame({'a': np.random.randint(0, 50, 1000)})
df['b'] = df['a'] + np.random.normal(0, 10, 1000) # positively correlated with 'a'
df['c'] = 100 - df['a'] + np.random.normal(0, 5, 1000) # negatively correlated with 'a'
df['d'] = np.random.randint(0, 50, 1000) # not correlated with 'a'

df.corr()


Out[7]:
a b c d
a 1.000000 0.825361 -0.948845 0.009802
b 0.825361 1.000000 -0.789391 0.011852
c -0.948845 -0.789391 1.000000 -0.003228
d 0.009802 0.011852 -0.003228 1.000000

We can also view these correlations graphically as a scatter matrix:


In [9]:
_ = pd.scatter_matrix(df, figsize=(6, 6))
plt.show()


Or we can directly plot a correlation matrix plot:


In [11]:
_ = plt.matshow(df.corr())
_ = plt.xticks(range(len(df.columns)), df.columns)
_ = plt.yticks(range(len(df.columns)), df.columns)
_ = plt.colorbar()
plt.show()



In [2]:
#fn = '/data/daily-rainfall-in-melbourne-aust.csv'
fn = '/data/daily-minimum-temperatures-in-me.csv'
df = pd.read_csv(fn, header=0, sep=';', decimal=',')
#df.info()
df.plot();


Out[2]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f51eccca2b0>

We can calculate the correlation for time series observations with observations with previous time steps, called lags.

A plot of the autocorrelation of a time series by lag is called the AutoCorrelation Function, or the acronym ACF.

Running the example creates a 2D plot showing the lag value along the x-axis and the correlation on the y-axis between -1 and 1.

Confidence intervals are drawn as a cone. By default, this is set to a 95% confidence interval, suggesting that correlation values outside of this code are very likely a correlation and not a statistical fluke.


In [51]:
from statsmodels.graphics.tsaplots import plot_acf
from pandas.tools.plotting import autocorrelation_plot

fig, (ax0, ax1) = plt.subplots(nrows=2, ncols=1, sharex=True, sharey=True, facecolor='white', figsize=(20,10))
autocorrelation_plot(df.ix[:,1], ax=ax0)
ax0.set(title='pandas', xlabel='Lag', ylabel='Autocorrelation')
plot_acf(df.ix[:,1], ax=ax1, lags=len(df)-1)
ax1.set(title='statsmodels', xlabel='Lag', ylabel='Autocorrelation');


Pandas confidence interval seems strange....

By default, all lag values are printed, which makes the plot noisy.

We can limit the number of lags on the x-axis to 50 to make the plot easier to read.


In [52]:
plot_acf(df.ix[:,1], lags=50);


Pandas autocorrelation plot with lag=1.


In [55]:
from pandas.tools.plotting import lag_plot
lag_plot(df.ix[:,1]);


Test Stationary

The observations in a stationary time series are not dependent on time.

Time series are stationary if they do not have trend or seasonal effects. Summary statistics calculated on the time series are consistent over time, like the mean or the variance of the observations.

When a time series is stationary, it can be easier to model. Statistical modeling methods assume or require the time series to be stationary to be effective.

The Augmented Dickey-Fuller test can be used to test for a unit root in a univariate process in the presence of serial correlation.

ADF Statistic must be smaller than the critical values to dismiss the hypothesis of being not stationary.


In [56]:
from statsmodels.tsa.stattools import adfuller
result = adfuller(df.ix[:,1])
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
    print('\t%s: %.3f' % (key, value))


ADF Statistic: -4.446730
p-value: 0.000245
Critical Values:
	1%: -3.432
	10%: -2.567
	5%: -2.862

the more negative this statistic, the more likely we are to reject the null hypothesis (we have a stationary dataset).

As part of the output, we get a look-up table to help determine the ADF statistic. We can see that our statistic value of -4.447 is less than the value of -3.432 at 1%.

This suggests that we can reject the null hypothesis with a significance level of less than 1% (i.e. a low probability that the result is a statistical fluke).

Rejecting the null hypothesis means that the process has no unit root, and in turn that the time series is stationary or does not have time-dependent structure.

Partial Autocorrelation Function

A partial autocorrelation is a summary of the relationship between an observation in a time series with observations at prior time steps with the relationships of intervening observations removed.

The partial autocorrelation at lag k is the correlation that results after removing the effect of any correlations due to the terms at shorter lags.

The autocorrelation for an observation and an observation at a prior time step is comprised of both the direct correlation and indirect correlations. These indirect correlations are a linear function of the correlation of the observation, with observations at intervening time steps.

It is these indirect correlations that the partial autocorrelation function seeks to remove. Without going into the math, this is the intuition for the partial autocorrelation.

The example below calculates and plots a partial autocorrelation function for the first 50 lags in the Minimum Daily Temperatures dataset using the plot_pacf() from the statsmodels library.


In [59]:
from statsmodels.graphics.tsaplots import plot_pacf

fig, (ax0, ax1) = plt.subplots(2, 1, sharex=True, figsize=(20,10))
plot_acf(df.ix[:,1], lags=50, ax=ax0);
plot_pacf(df.ix[:,1], lags=50, ax=ax1);



In [ ]: