Correlation values range between -1 and 1.
There are two key components of a correlation value:
Let’s take a look at a positive correlation. Numpy implements a corrcoef() function that returns a matrix of correlations of x with x, x with y, y with x and y with y. We’re interested in the values of correlation of x with y (so position (1, 0) or (0, 1)).
In [1]:
import numpy as np
np.random.seed(1)
# 1000 random integers between 0 and 50
x = np.random.randint(0, 50, 1000)
# Positive Correlation with some noise
y = x + np.random.normal(0, 10, 1000)
np.corrcoef(x, y)
Out[1]:
This correlation is 0.815, a strong positive correlation, let’s take a look at a scatter chart.
In [2]:
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')
plt.scatter(x, y)
plt.show()
Out[2]:
In [3]:
# 1000 random integers between 0 and 50
x = np.random.randint(0, 50, 1000)
# Negative Correlation with some noise
y = 100 - x + np.random.normal(0, 5, 1000)
np.corrcoef(x, y)
Out[3]:
In [4]:
plt.scatter(x, y)
plt.show()
Out[4]:
In [5]:
x = np.random.randint(0, 50, 1000)
y = np.random.randint(0, 50, 1000)
np.corrcoef(x, y)
Out[5]:
In [6]:
plt.scatter(x, y)
plt.show()
Out[6]:
In [7]:
import pandas as pd
df = pd.DataFrame({'a': np.random.randint(0, 50, 1000)})
df['b'] = df['a'] + np.random.normal(0, 10, 1000) # positively correlated with 'a'
df['c'] = 100 - df['a'] + np.random.normal(0, 5, 1000) # negatively correlated with 'a'
df['d'] = np.random.randint(0, 50, 1000) # not correlated with 'a'
df.corr()
Out[7]:
We can also view these correlations graphically as a scatter matrix:
In [9]:
_ = pd.scatter_matrix(df, figsize=(6, 6))
plt.show()
Or we can directly plot a correlation matrix plot:
In [11]:
_ = plt.matshow(df.corr())
_ = plt.xticks(range(len(df.columns)), df.columns)
_ = plt.yticks(range(len(df.columns)), df.columns)
_ = plt.colorbar()
plt.show()
In [2]:
#fn = '/data/daily-rainfall-in-melbourne-aust.csv'
fn = '/data/daily-minimum-temperatures-in-me.csv'
df = pd.read_csv(fn, header=0, sep=';', decimal=',')
#df.info()
df.plot();
Out[2]:
We can calculate the correlation for time series observations with observations with previous time steps, called lags.
A plot of the autocorrelation of a time series by lag is called the AutoCorrelation Function, or the acronym ACF.
Running the example creates a 2D plot showing the lag value along the x-axis and the correlation on the y-axis between -1 and 1.
Confidence intervals are drawn as a cone. By default, this is set to a 95% confidence interval, suggesting that correlation values outside of this code are very likely a correlation and not a statistical fluke.
In [51]:
from statsmodels.graphics.tsaplots import plot_acf
from pandas.tools.plotting import autocorrelation_plot
fig, (ax0, ax1) = plt.subplots(nrows=2, ncols=1, sharex=True, sharey=True, facecolor='white', figsize=(20,10))
autocorrelation_plot(df.ix[:,1], ax=ax0)
ax0.set(title='pandas', xlabel='Lag', ylabel='Autocorrelation')
plot_acf(df.ix[:,1], ax=ax1, lags=len(df)-1)
ax1.set(title='statsmodels', xlabel='Lag', ylabel='Autocorrelation');
Pandas confidence interval seems strange....
By default, all lag values are printed, which makes the plot noisy.
We can limit the number of lags on the x-axis to 50 to make the plot easier to read.
In [52]:
plot_acf(df.ix[:,1], lags=50);
Pandas autocorrelation plot with lag=1.
In [55]:
from pandas.tools.plotting import lag_plot
lag_plot(df.ix[:,1]);
The observations in a stationary time series are not dependent on time.
Time series are stationary if they do not have trend or seasonal effects. Summary statistics calculated on the time series are consistent over time, like the mean or the variance of the observations.
When a time series is stationary, it can be easier to model. Statistical modeling methods assume or require the time series to be stationary to be effective.
The Augmented Dickey-Fuller test can be used to test for a unit root in a univariate process in the presence of serial correlation.
ADF Statistic must be smaller than the critical values to dismiss the hypothesis of being not stationary.
In [56]:
from statsmodels.tsa.stattools import adfuller
result = adfuller(df.ix[:,1])
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
print('\t%s: %.3f' % (key, value))
the more negative this statistic, the more likely we are to reject the null hypothesis (we have a stationary dataset).
As part of the output, we get a look-up table to help determine the ADF statistic. We can see that our statistic value of -4.447 is less than the value of -3.432 at 1%.
This suggests that we can reject the null hypothesis with a significance level of less than 1% (i.e. a low probability that the result is a statistical fluke).
Rejecting the null hypothesis means that the process has no unit root, and in turn that the time series is stationary or does not have time-dependent structure.
A partial autocorrelation is a summary of the relationship between an observation in a time series with observations at prior time steps with the relationships of intervening observations removed.
The partial autocorrelation at lag k is the correlation that results after removing the effect of any correlations due to the terms at shorter lags.
The autocorrelation for an observation and an observation at a prior time step is comprised of both the direct correlation and indirect correlations. These indirect correlations are a linear function of the correlation of the observation, with observations at intervening time steps.
It is these indirect correlations that the partial autocorrelation function seeks to remove. Without going into the math, this is the intuition for the partial autocorrelation.
The example below calculates and plots a partial autocorrelation function for the first 50 lags in the Minimum Daily Temperatures dataset using the plot_pacf() from the statsmodels library.
In [59]:
from statsmodels.graphics.tsaplots import plot_pacf
fig, (ax0, ax1) = plt.subplots(2, 1, sharex=True, figsize=(20,10))
plot_acf(df.ix[:,1], lags=50, ax=ax0);
plot_pacf(df.ix[:,1], lags=50, ax=ax1);
In [ ]: