In [1]:

    
%load_ext load_style
%load_style talk.css
from IPython.display import Image
from talktools import website

Statistical modelling with statsmodels

statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An (incomplete) list of what you can do with statsmodels:

Linear regression models
Generalized linear models
A wide range of statistical tests
...

I will also mention briefly the seaborn package, for high-level, beautiful statistical visualisations

Statsmodels



In [2]:

    
#website('http://statsmodels.sourceforge.net/', width=1000)



In [3]:

    
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from IPython.display import Image, HTML
%matplotlib inline



In [4]:

    
data = pd.read_csv('../data/soi_nino.csv', index_col=0, parse_dates=True)



In [5]:

    
data.tail()



In [6]:

    
data.corr()

Using statsmodels with numpy arrays and the classic API



In [7]:

    
import statsmodels.api as sm

Get the data into numpy arrays



In [8]:

    
y = data['SOI'].values
X = data['nino'].values

add an axis to X and add a constant with the sm.add_constant function



In [9]:

    
X = X[...,np.newaxis]



In [10]:

    
X = sm.add_constant(X)



In [11]:

    
X









    Out[11]:





array([[ 1.  , -1.84],
       [ 1.  , -1.63],
       [ 1.  , -1.3 ],
       ..., 
       [ 1.  , -0.14],
       [ 1.  , -0.19],
       [ 1.  , -0.43]])

instantiate the model, arguments are y (dependent variable), X (independant variable)



In [12]:

    
lm1_mod = sm.OLS(y, X)

fit the model



In [13]:

    
lm1_fit = lm1_mod.fit()



In [14]:

    
lm1_fit.summary()









    Out[14]:





OLS Regression Results

  Dep. Variable:             y           R-squared:             0.504 


  Model:                    OLS          Adj. R-squared:        0.504 


  Method:              Least Squares     F-statistic:           779.0 


  Date:              Wed, 25 Feb 2015    Prob (F-statistic):  8.08e-119


  Time:                  09:15:55        Log-Likelihood:      -905.59 


  No. Observations:          768         AIC:                   1815. 


  Df Residuals:              766         BIC:                   1824. 


  Df Model:                    1                                      


  Covariance Type:       nonrobust                                    




           coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  const     -0.2307      0.029     -7.977   0.000     -0.287    -0.174


  x1        -0.9506      0.034    -27.911   0.000     -1.017    -0.884




  Omnibus:         2.974    Durbin-Watson:         1.505


  Prob(Omnibus):   0.226    Jarque-Bera (JB):      2.809


  Skew:           -0.138    Prob(JB):              0.246


  Kurtosis:        3.106    Cond. No.               1.28

Out of sample prediction for ONE value of NINO3.4

You need to prepend a constant (1. is the default) in order for the matrices to be aligned



In [15]:

    
X_prime = np.array([1., 3.]).reshape((1,2))



In [16]:

    
X_prime.shape









    Out[16]:





(1, 2)



In [17]:

    
X.shape









    Out[17]:





(768, 2)



In [18]:

    
y_hat = lm1_fit.predict(X_prime)



In [19]:

    
y_hat









    Out[19]:





array([-3.08240935])

Using statsmodels with Pandas DataFrames and the formula API

The statsmodels library has also a formula API that allows you to enter your model using a syntax which should be familiar to R users. The variable names are simply the names of the corresponding columns in the pandas DataFrame, which is given as the second argument.

The formula API is based on the patsy project.



In [20]:

    
from statsmodels.formula.api import ols as OLS

instantiate the model (OLS), Specifying the columns of the pandas DataFrame



In [21]:

    
lm2_mod = OLS('SOI ~ nino', data)

fit the model



In [22]:

    
lm2_fit = lm2_mod.fit()



In [23]:

    
lm2_fit.summary()









    Out[23]:





OLS Regression Results

  Dep. Variable:            SOI          R-squared:             0.504 


  Model:                    OLS          Adj. R-squared:        0.504 


  Method:              Least Squares     F-statistic:           779.0 


  Date:              Wed, 25 Feb 2015    Prob (F-statistic):  8.08e-119


  Time:                  09:15:59        Log-Likelihood:      -905.59 


  No. Observations:          768         AIC:                   1815. 


  Df Residuals:              766         BIC:                   1824. 


  Df Model:                    1                                      


  Covariance Type:       nonrobust                                    




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept     -0.2307      0.029     -7.977   0.000     -0.287    -0.174


  nino          -0.9506      0.034    -27.911   0.000     -1.017    -0.884




  Omnibus:         2.974    Durbin-Watson:         1.505


  Prob(Omnibus):   0.226    Jarque-Bera (JB):      2.809


  Skew:           -0.138    Prob(JB):              0.246


  Kurtosis:        3.106    Cond. No.               1.28

Statistical models and statistical visualisations with Seaborn

Seaborn Seaborn is a library for making attractive and informative statistical graphics in Python. It is built on top of matplotlib and tightly integrated with the PyData stack, including support for numpy and pandas data structures and statistical routines from scipy and statsmodels.



In [24]:

    
import seaborn as sb



In [25]:

    
sb.regplot("nino", "SOI", data)









    Out[25]:





<matplotlib.axes._subplots.AxesSubplot at 0x10a275f10>



In [26]:

    
f, ax = plt.subplots()
sb.kdeplot(data['nino'], data['SOI'], shade=True, ax=ax)









    Out[26]:





<matplotlib.axes._subplots.AxesSubplot at 0x10a355b50>



In [ ]:

	SOI	nino
2013-08-01	-0.222680	-0.37
2013-09-01	0.317375	-0.27
2013-10-01	-0.309098	-0.14
2013-11-01	0.836244	-0.19
2013-12-01	-0.117755	-0.43

Dep. Variable:	y	R-squared:	0.504
Model:	OLS	Adj. R-squared:	0.504
Method:	Least Squares	F-statistic:	779.0
Date:	Wed, 25 Feb 2015	Prob (F-statistic):	8.08e-119
Time:	09:15:55	Log-Likelihood:	-905.59
No. Observations:	768	AIC:	1815.
Df Residuals:	766	BIC:	1824.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
const	-0.2307	0.029	-7.977	0.000	-0.287 -0.174
x1	-0.9506	0.034	-27.911	0.000	-1.017 -0.884

Omnibus:	2.974	Durbin-Watson:	1.505
Prob(Omnibus):	0.226	Jarque-Bera (JB):	2.809
Skew:	-0.138	Prob(JB):	0.246
Kurtosis:	3.106	Cond. No.	1.28