If you have not already done so, run the following command to install the statsmodels package:
easy_install -U statsmodels
Run the following command to install scipy and scikit-learn:
conda install scipy
conda install scikit-learn
Use the data cleaning package to import a data set:
In [1]:
from data_cleaning_utils import import_data
dat = import_data('../Data/Test/pool82014-10-02cleaned_Subset.csv')
The following function runs a random model with a random independent variable y and four random covariates, using both the statsmodels and scikit-learn packages. The user can compare output from the two tools.
In [ ]:
from regression import compare_OLS
compare_OLS(dat)
The two models produce the same results.
There is no standard regression table type output from sklearn. However, sklearn offers greater features for prediction, by incorporating machine learning functionality. For that reason, we will likely wish to use both packages, for different purposes.
The user_model
function prompts the user to input a model formula for an OLS regression, then runs the model in statsmodel
, and outputs model results and a plot of y data vs. model fitted values.
At the prompt, you may either input your own model formula, or copy and paste the following formula as an example:
CO2uM ~ pH + TempC + ChlAugL
In [ ]:
%matplotlib inline
from regression import user_model
user_model(data=dat)
In [6]:
%matplotlib inline
import pandas as pd
from regression import plot_pairs
plot_pairs(data=dat[['XCO2Dpp', 'XCH4Dpp', 'TempC', 'ChlAugL', 'TurbFNU',
'fDOMQSU', 'ODOmgL', 'pH', 'CH4uM', 'CO2uM']], minCorr=0.1, maxCorr=0.95)
In [6]:
dat.columns
Out[6]:
In [ ]:
dat.shape[1]
In [ ]: