Regression functions demo notebook

If you have not already done so, run the following command to install the statsmodels package:

easy_install -U statsmodels

Run the following command to install scipy and scikit-learn:

conda install scipy

conda install scikit-learn

Use the data cleaning package to import a data set:


In [1]:
from data_cleaning_utils import import_data
dat = import_data('../Data/Test/pool82014-10-02cleaned_Subset.csv')


Index(['FID', 'time', 'XCO2Dpp', 'XCH4Dpp', 'TempC', 'ChlAugL', 'TurbFNU',
       'fDOMQSU', 'ODOsat', 'ODOmgL', 'pH', 'CH4uM', 'CH4Sat', 'CO2uM',
       'CO2Sat'],
      dtype='object')
datetime column name? time

The following function runs a random model with a random independent variable y and four random covariates, using both the statsmodels and scikit-learn packages. The user can compare output from the two tools.


In [ ]:
from regression import compare_OLS
compare_OLS(dat)

The two models produce the same results.

There is no standard regression table type output from sklearn. However, sklearn offers greater features for prediction, by incorporating machine learning functionality. For that reason, we will likely wish to use both packages, for different purposes.

The user_model function prompts the user to input a model formula for an OLS regression, then runs the model in statsmodel, and outputs model results and a plot of y data vs. model fitted values.

At the prompt, you may either input your own model formula, or copy and paste the following formula as an example:

CO2uM ~ pH + TempC + ChlAugL


In [ ]:
%matplotlib inline
from regression import user_model
user_model(data=dat)

In [6]:
%matplotlib inline
import pandas as pd
from regression import plot_pairs
plot_pairs(data=dat[['XCO2Dpp', 'XCH4Dpp', 'TempC', 'ChlAugL', 'TurbFNU',
       'fDOMQSU', 'ODOmgL', 'pH', 'CH4uM', 'CO2uM']], minCorr=0.1, maxCorr=0.95)



In [6]:
dat.columns


Out[6]:
Index(['FID', 'time', 'XCO2Dpp', 'XCH4Dpp', 'TempC', 'ChlAugL', 'TurbFNU',
       'fDOMQSU', 'ODOsat', 'ODOmgL', 'pH', 'CH4uM', 'CH4Sat', 'CO2uM',
       'CO2Sat'],
      dtype='object')

In [ ]:
dat.shape[1]

In [ ]: