In [1]:
import numpy as np
import pandas as pd

import statsmodels.api as sm
import statsmodels.formula.api as smf

import matplotlib.pyplot as plt
# import seaborn as sns
%matplotlib inline

benchmarking for the basic document and efficient document


In [2]:
data = pd.read_csv('../benchMarkingResult.txt',
                   header=None,
                   sep='\t',
                   names=('iteration',
                          'basic_result',
                          'efficient_result'))

Plot the Basic and Efficient data first


Basic


In [3]:
x = data.iteration
y1 = data.basic_result
y2 = data.efficient_result

In [4]:
fig, ax = plt.subplots(figsize=(10,10))
ax.scatter(x, y1, s=15)
ax.scatter(x, y2, s=15, c='g')


Out[4]:
<matplotlib.collections.PathCollection at 0x11a1beb50>

do a linear regression for the 2 lines and evaluate the r-squared value


In [5]:
d1 = {'x':x, 'y_basic':y1}
d2 = {'x':x, 'y_efficient':y2}

In [6]:
mod1 = smf.ols(formula='y_basic ~ x - 1',
               data = d1).fit()

In [7]:
print mod1.summary()


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                y_basic   R-squared:                       0.993
Model:                            OLS   Adj. R-squared:                  0.992
Method:                 Least Squares   F-statistic:                     2628.
Date:                Mon, 27 Feb 2017   Prob (F-statistic):           7.75e-22
Time:                        19:57:29   Log-Likelihood:                -48.575
No. Observations:                  20   AIC:                             99.15
Df Residuals:                      19   BIC:                             100.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x              0.0001   2.37e-06     51.269      0.000         0.000     0.000
==============================================================================
Omnibus:                        3.436   Durbin-Watson:                   1.028
Prob(Omnibus):                  0.179   Jarque-Bera (JB):                2.013
Skew:                           0.542   Prob(JB):                        0.366
Kurtosis:                       1.887   Cond. No.                         1.00
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [8]:
mod2 = smf.ols(formula='y_efficient ~ x - 1', data=d2).fit()
mod2.summary()


Out[8]:
OLS Regression Results
Dep. Variable: y_efficient R-squared: 0.991
Model: OLS Adj. R-squared: 0.991
Method: Least Squares F-statistic: 2163.
Date: Mon, 27 Feb 2017 Prob (F-statistic): 4.87e-21
Time: 19:57:29 Log-Likelihood: -34.300
No. Observations: 20 AIC: 70.60
Df Residuals: 19 BIC: 71.59
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
x 5.388e-05 1.16e-06 46.510 0.000 5.15e-05 5.63e-05
Omnibus: 21.719 Durbin-Watson: 2.000
Prob(Omnibus): 0.000 Jarque-Bera (JB): 28.717
Skew: 1.952 Prob(JB): 5.81e-07
Kurtosis: 7.384 Cond. No. 1.00

both r-squared look quite nice

plot the resulting lines. I'm ignoring the cross-validation here


In [9]:
y_est_basic = mod1.predict(x)
y_est_effi = mod2.predict(x)

In [12]:
fig, ax = plt.subplots(figsize=(10,10))
ax.scatter(x, y1, s=15)
ax.scatter(x, y2, s=15, c='g')
ax.plot(x, y_est_basic)
ax.plot(x, y_est_effi)
ax.legend(scatterpoints=1)


Out[12]:
<matplotlib.legend.Legend at 0x1029ac610>

In [ ]: