In [1]:
%load_ext load_style
%load_style talk.css
from IPython.display import Image
from talktools import website
statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An (incomplete) list of what you can do with statsmodels:
I will also mention briefly the seaborn package, for high-level, beautiful statistical visualisations
In [2]:
#website('http://statsmodels.sourceforge.net/', width=1000)
In [3]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from IPython.display import Image, HTML
%matplotlib inline
In [4]:
data = pd.read_csv('../data/soi_nino.csv', index_col=0, parse_dates=True)
In [5]:
data.tail()
Out[5]:
In [6]:
data.corr()
Out[6]:
In [7]:
import statsmodels.api as sm
Get the data into numpy arrays
In [8]:
y = data['SOI'].values
X = data['nino'].values
add an axis to X and add a constant with the sm.add_constant
function
In [9]:
X = X[...,np.newaxis]
In [10]:
X = sm.add_constant(X)
In [11]:
X
Out[11]:
instantiate the model, arguments are y
(dependent variable), X
(independant variable)
In [12]:
lm1_mod = sm.OLS(y, X)
fit the model
In [13]:
lm1_fit = lm1_mod.fit()
In [14]:
lm1_fit.summary()
Out[14]:
Out of sample prediction for ONE value of NINO3.4
You need to prepend a constant (1. is the default) in order for the matrices to be aligned
In [15]:
X_prime = np.array([1., 3.]).reshape((1,2))
In [16]:
X_prime.shape
Out[16]:
In [17]:
X.shape
Out[17]:
In [18]:
y_hat = lm1_fit.predict(X_prime)
In [19]:
y_hat
Out[19]:
The statsmodels library has also a formula API that allows you to enter your model using a syntax which should be familiar to R users. The variable names are simply the names of the corresponding columns in the pandas DataFrame, which is given as the second argument.
The formula
API is based on the patsy project.
In [20]:
from statsmodels.formula.api import ols as OLS
instantiate the model (OLS), Specifying the columns of the pandas DataFrame
In [21]:
lm2_mod = OLS('SOI ~ nino', data)
fit the model
In [22]:
lm2_fit = lm2_mod.fit()
In [23]:
lm2_fit.summary()
Out[23]:
Seaborn Seaborn is a library for making attractive and informative statistical graphics in Python. It is built on top of matplotlib and tightly integrated with the PyData stack, including support for numpy and pandas data structures and statistical routines from scipy and statsmodels.
In [24]:
import seaborn as sb
In [25]:
sb.regplot("nino", "SOI", data)
Out[25]:
In [26]:
f, ax = plt.subplots()
sb.kdeplot(data['nino'], data['SOI'], shade=True, ax=ax)
Out[26]:
In [ ]: