Ruixue Gong, NYU
This notebook helps economists better understand the differences and similarities between machine learning package Scikit-learn and traditional statistical package Statsmodels.
First of all, we provide an overview of Scikit-learn and Statsmodels. Secondly, we make a brief contrast of these two packages. At last, we talk about the applications on several topics such as linear regression models, logistic models and time series analysis with the use of Statsmodels and Scikit-learn respectively.
In [1]:
from IPython.display import Image
Image(filename='scikit-learn-flow-chart.jpg') #source: web
Out[1]:
Contrast
In [2]:
%matplotlib inline
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std
from scipy import stats
import pandas as pd
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from statsmodels.base.model import GenericLikelihoodModel
Linear regression is in its basic form the same in statsmodels and in scikit-learn. However, Statsmodels can generate a summary of OLS result like Stata and R. Lecture 11 had already given an introduction to regression analysis in Sciki-learn, so we only present the different features owned by Statsmodels here.
Now consider the real model as follows: $$y = \beta_1 x + \beta_2 sin(x) + \beta_3 (x-3)^2 + \beta_0, \quad \text{ where } \epsilon \sim N(0,1)$$
In [3]:
n = 20
x = np.linspace(0, 5, n)
sigma = 0.3
beta = np.array([1, 0.5, -0.02,5]) # real coefficient
e = np.random.normal(size=n)
X = np.column_stack((x, np.sin(x), (x-3)**2, np.ones(n)))
y_true = np.dot(X,beta)
y = y_true + e
In [4]:
#do regression
model = sm.OLS(y, X) #Pick a class. GLS, WLS...
results = model.fit()
We can get the summary of the regression result just like R.
results.summary()
We can also extract the values that we are interested in.
results.paramsresults.rsquaredresults.fittedvaluesresults.predict()results.bse
In [5]:
print(results.summary())
In [6]:
print('Coefficients: ', results.params)
print('R2: ', results.rsquared)
print('Standard errors: ', results.bse)
print('Fitted values: ', results.fittedvalues)
print('Predicted values: ', results.predict())
In [7]:
prstd, iv_l, iv_u = wls_prediction_std(results)
#wls_prediction_std returns standard deviation and
#confidence interval of my fitted model data
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(x, y, 'o', label="data")
ax.plot(x, y_true, 'b-', label="Real")
ax.plot(x, results.fittedvalues, 'r--.', label="OLS")
ax.plot(x, iv_u, 'r--')
ax.plot(x, iv_l, 'r--')
ax.legend(loc='best')
Out[7]:
Joint Test
We can test $$R\beta = 0$$ or formula-like $$\beta_2 = \beta_3 = 0$$
In [8]:
#test beta_2 = beta_3 = 0
print(results.f_test("x2 = x3 = 0"))
#test R beta = 0
R = [[0, 1, 0, 0], [0, 0, 1, 0]]
print(results.f_test(R))
In [9]:
data = sm.datasets.spector.load_pandas() #use the dataset spector
exog = data.exog
endog = data.endog
print(sm.datasets.spector.NOTE)
print(data.exog.head())
In [10]:
exog1 = sm.add_constant(exog, prepend=True) #combine X matrix with constant
#plug in the log-likelihood function of my own model
class MyProbit(GenericLikelihoodModel):
def loglike(self, params):
exog = self.exog
endog = self.endog
q = 2 * endog - 1
y = stats.norm.logcdf(q*np.dot(exog, params)).sum()
return y
sm_probit_manual = MyProbit(endog, exog1).fit()
print(sm_probit_manual.summary())
I use the spector dataset which contained in Statsmodels package. To visualize the results, I set two exogenous variables, x1 and x2.
We all have familiarized with the mathematical formation of logistic regression. So I put a brief introduction graph about supporting vector machine algorithm as follows.
In [11]:
Image(filename='svm.jpg' ) #source : web
Out[11]:
In [12]:
X = np.delete(data.exog.as_matrix(),2,1)
y = np.asarray(data.endog.tolist())
#classifiers
clf = svm.SVC(kernel='linear', C = 1)
#fitting by Scikit-learn
model_svm = clf.fit(X,y)
#fitting by Statsmodels
XX = np.insert(X,0,1,axis = 1) #add a column with contant 1
model_logit_stats = sm.Logit(y,XX)
res = model_logit_stats.fit()
In [13]:
w = clf.coef_[0]
a = -w[0] / w[1]
c = -res.params[1] / res.params[2]
x1 = np.linspace(2,4)
x2 = a * x1 - clf.intercept_[0] / w[1] #svm
x3 = c * x1 - res.params[0] / res.params[2] #logistic
h1 = plt.plot(x1,x2,'k-', color = 'green', label = 'SVM')
h2 = plt.plot(x1,x3,'k-', color = 'blue', label = 'Logit Statsmodels')
plt.scatter(X[:, 0], X[:,1], c = y, alpha = 0.4)
plt.legend()
plt.title('Comparision: LogisticRegression and SVM')
plt.show()
In [14]:
dta = sm.datasets.sunspots.load_pandas().data #use the dataset sunspot
print(sm.datasets.sunspots.NOTE)
In [15]:
dta.index = pd.Index(sm.tsa.datetools.dates_from_range('1700', '2008')) #set time series
del dta["YEAR"]
In [16]:
dta.plot(figsize=(12,8))
Out[16]:
In [17]:
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(dta.values.squeeze(), lags=40, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(dta, lags=40, ax=ax2)