In [1]:
%load_ext load_style
%load_style talk.css
from IPython.display import Image
from talktools import website


Statistical modelling with statsmodels

statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An (incomplete) list of what you can do with statsmodels:

  • Linear regression models
  • Generalized linear models
  • A wide range of statistical tests
  • ...

I will also mention briefly the seaborn package, for high-level, beautiful statistical visualisations

Statsmodels


In [2]:
#website('http://statsmodels.sourceforge.net/', width=1000)

In [3]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from IPython.display import Image, HTML
%matplotlib inline

In [4]:
data = pd.read_csv('../data/soi_nino.csv', index_col=0, parse_dates=True)

In [5]:
data.tail()


Out[5]:
SOI nino
2013-08-01 -0.222680 -0.37
2013-09-01 0.317375 -0.27
2013-10-01 -0.309098 -0.14
2013-11-01 0.836244 -0.19
2013-12-01 -0.117755 -0.43

In [6]:
data.corr()


Out[6]:
SOI nino
SOI 1.000000 -0.710077
nino -0.710077 1.000000

Using statsmodels with numpy arrays and the classic API


In [7]:
import statsmodels.api as sm

Get the data into numpy arrays


In [8]:
y = data['SOI'].values
X = data['nino'].values

add an axis to X and add a constant with the sm.add_constant function


In [9]:
X = X[...,np.newaxis]

In [10]:
X = sm.add_constant(X)

In [11]:
X


Out[11]:
array([[ 1.  , -1.84],
       [ 1.  , -1.63],
       [ 1.  , -1.3 ],
       ..., 
       [ 1.  , -0.14],
       [ 1.  , -0.19],
       [ 1.  , -0.43]])

instantiate the model, arguments are y (dependent variable), X (independant variable)


In [12]:
lm1_mod = sm.OLS(y, X)

fit the model


In [13]:
lm1_fit = lm1_mod.fit()

In [14]:
lm1_fit.summary()


Out[14]:
OLS Regression Results
Dep. Variable: y R-squared: 0.504
Model: OLS Adj. R-squared: 0.504
Method: Least Squares F-statistic: 779.0
Date: Wed, 25 Feb 2015 Prob (F-statistic): 8.08e-119
Time: 09:15:55 Log-Likelihood: -905.59
No. Observations: 768 AIC: 1815.
Df Residuals: 766 BIC: 1824.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
const -0.2307 0.029 -7.977 0.000 -0.287 -0.174
x1 -0.9506 0.034 -27.911 0.000 -1.017 -0.884
Omnibus: 2.974 Durbin-Watson: 1.505
Prob(Omnibus): 0.226 Jarque-Bera (JB): 2.809
Skew: -0.138 Prob(JB): 0.246
Kurtosis: 3.106 Cond. No. 1.28

Out of sample prediction for ONE value of NINO3.4

You need to prepend a constant (1. is the default) in order for the matrices to be aligned


In [15]:
X_prime = np.array([1., 3.]).reshape((1,2))

In [16]:
X_prime.shape


Out[16]:
(1, 2)

In [17]:
X.shape


Out[17]:
(768, 2)

In [18]:
y_hat = lm1_fit.predict(X_prime)

In [19]:
y_hat


Out[19]:
array([-3.08240935])

Using statsmodels with Pandas DataFrames and the formula API

The statsmodels library has also a formula API that allows you to enter your model using a syntax which should be familiar to R users. The variable names are simply the names of the corresponding columns in the pandas DataFrame, which is given as the second argument.

The formula API is based on the patsy project.


In [20]:
from statsmodels.formula.api import ols as OLS

instantiate the model (OLS), Specifying the columns of the pandas DataFrame


In [21]:
lm2_mod = OLS('SOI ~ nino', data)

fit the model


In [22]:
lm2_fit = lm2_mod.fit()

In [23]:
lm2_fit.summary()


Out[23]:
OLS Regression Results
Dep. Variable: SOI R-squared: 0.504
Model: OLS Adj. R-squared: 0.504
Method: Least Squares F-statistic: 779.0
Date: Wed, 25 Feb 2015 Prob (F-statistic): 8.08e-119
Time: 09:15:59 Log-Likelihood: -905.59
No. Observations: 768 AIC: 1815.
Df Residuals: 766 BIC: 1824.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -0.2307 0.029 -7.977 0.000 -0.287 -0.174
nino -0.9506 0.034 -27.911 0.000 -1.017 -0.884
Omnibus: 2.974 Durbin-Watson: 1.505
Prob(Omnibus): 0.226 Jarque-Bera (JB): 2.809
Skew: -0.138 Prob(JB): 0.246
Kurtosis: 3.106 Cond. No. 1.28

Statistical models and statistical visualisations with Seaborn

Seaborn Seaborn is a library for making attractive and informative statistical graphics in Python. It is built on top of matplotlib and tightly integrated with the PyData stack, including support for numpy and pandas data structures and statistical routines from scipy and statsmodels.


In [24]:
import seaborn as sb

In [25]:
sb.regplot("nino", "SOI", data)


Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a275f10>

In [26]:
f, ax = plt.subplots()
sb.kdeplot(data['nino'], data['SOI'], shade=True, ax=ax)


Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a355b50>

In [ ]: