eRIVERS Example iPython Notebook

The purpose of this notebook is to demonstrate how to import and run the various portions of our eRivers project.

Necessary Packages to Install:

matplotlib
pandas
Bokeh
scikit-learn
numpy
statsmodels

Here are two ways to install packages:

1. If you are already an Anaconda user, you can simply run the command:

conda install bokeh

This will install the most recent published Bokeh release from the Continuum Analytics Anaconda repository, along with all dependencies.

2. Alternatively, it is possible to install from PyPI using pip:

pip install bokeh

First, we need to import our python files necessary to run our code in an iPython notebook.



In [3]:

    
%cd Analysis/
from regression import *
from data_cleaning_utils import *
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Image
%matplotlib inline









    



/Users/Will/Documents/GITHUB/class_project/class_project/Analysis

For this walkthrough, we will use a .csv that contains in-situ chemical data collected form the Mississippi River. See image below for data on how the samples were collected.

Data Inport and Cleaning

The first portion of our code deals with importing the data, recognizing the time variable and providing tools to allow the user to clean the data.



In [4]:

    
%cd ../Data/mississippi
%ls









    



/Users/Will/Documents/GITHUB/class_project/class_project/Data/mississippi
UMR_Day42015-08-04.csv                callback.html
UMR_Day42015-08-04_Nulls_Removed.csv  shapefiles/

The data_import function attempts to automatically recognize the date format, but asks the user for input as well. It returns a pandas dataframe of the raw imported data indexed by the datetime column.



In [42]:

    
raw_data_miss = import_data('UMR_Day42015-08-04_Nulls_Removed.csv')
%cd ../Amazon/
raw_data_amz = import_data('TROCAS2_clean.csv')









    



Index(['ltime', 'Latitude', 'Longitude', 'Temp', 'fDOMQSU', 'ODOsat', 'ODOmgL',
       'pH', 'CH4uM', 'CO2uM', 'Cond', 'indicator'],
      dtype='object')
What is your datetime column?ltime
/Users/Will/Documents/GITHUB/class_project/class_project/Data/Amazon
Index(['datetime', 'Latitude', 'Longitude', 'fDOMQSU', 'ODOsat', 'ODOmgL',
       'Temp', 'Cond', 'pH', 'CO2uM', 'CH4uM', 'indicator'],
      dtype='object')
What is your datetime column?datetime



In [43]:

    
raw_data_miss = nullRemover(raw_data_miss)
raw_data_amz = nullRemover(raw_data_amz)









    



(26334, 12)
(26334, 12)
(16417, 12)
(16417, 12)

Data Smoothing

We created a smoothing function that applys a running mean to our time series data. This tool allows the user to input the window for the running mean. It also outputs a simple time-series of the data before and after the smoothing to allow the user to determine the optimum window size



In [44]:

    
cleaned_miss= test_smooth_data("pH", raw_data_miss)









    



What size windows do you want for the moving average? 45

Using a 45 second window removes a significant amount of noise from the data without removing the rapidly occuring trends that we see in the data.



In [46]:

    
cleaned_miss = nullRemover(cleaned_miss)









    



(26290, 12)
(26290, 12)

Data Reducing

We have also made a function to allow the user to resample the data. Depending on the original time-step, the user can either up-sample(more coverage) or down-sample(more sparse data).



In [47]:

    
resample = reducer(cleaned_miss)









    



What frequency would you like to resample to? Format = XS(seconds), XT(minutes)4T
Mississippi
Value Error
Num Samples Before: 315480
Num Samples After: 1300

Time Series Visualization

Now that the data has been imported and cleaned, we have created a tool that can interactively visualize these time series for simple data exploration



In [17]:

    
%cd ../../Analysis/timeseries
import TimeSeries as ts
%cd ../../Data/mississippi/
data = pd.read_csv('UMR_Day42015-08-04_Nulls_Removed.csv')
data.head()









    



/Users/Will/Documents/GITHUB/class_project/class_project/Analysis/timeseries
/Users/Will/Documents/GITHUB/class_project/class_project/Data/mississippi






    Out[17]:






  
    
      
      ltime
      Latitude
      Longitude
      Temp
      fDOMQSU
      ODOsat
      ODOmgL
      pH
      CH4uM
      CO2uM
      Cond
      indicator
    
  
  
    
      0
      8/4/15 8:56
      43.664773
      -91.238209
      22.6392
      80.448
      97.81
      8.35761
      8.70
      0.239870
      11.897216
      408.02
      Mississippi
    
    
      1
      8/4/15 8:56
      43.664777
      -91.238210
      22.6382
      80.440
      97.68
      8.35128
      8.69
      0.250415
      11.900610
      408.22
      Mississippi
    
    
      2
      8/4/15 8:56
      43.664780
      -91.238211
      22.6354
      80.480
      97.65
      8.34495
      8.69
      0.228742
      11.775067
      407.76
      Mississippi
    
    
      3
      8/4/15 8:56
      43.664783
      -91.238211
      22.6334
      80.488
      97.52
      8.36229
      8.69
      0.219063
      11.757889
      407.38
      Mississippi
    
    
      4
      8/4/15 8:56
      43.664787
      -91.238209
      22.6324
      80.508
      97.49
      8.36596
      8.69
      0.244385
      11.918703
      407.84
      Mississippi



In [49]:

    
ts.timeplot(data)









    





    
        
        Loading BokehJS ...

Statistics

We also developed several tools to analyze correlations between the parameters and help visualize underlying trends in the data



In [56]:

    
regression_miss_data = cleaned_miss.drop(['Latitude', 'Longitude', 'indicator'], axis=1)

regression_miss_data.head()









    Out[56]:






  
    
      
      ltime
      Temp
      fDOMQSU
      ODOsat
      ODOmgL
      pH
      CH4uM
      CO2uM
      Cond
    
    
      ltime
      
      
      
      
      
      
      
      
      
    
  
  
    
      2015-08-04 08:57:00
      2015-08-04 08:57:00
      22.617387
      80.712533
      96.760667
      8.321806
      8.686556
      0.287032
      12.030600
      408.943111
    
    
      2015-08-04 08:57:00
      2015-08-04 08:57:00
      22.616973
      80.721111
      96.696889
      8.317733
      8.686111
      0.288896
      12.036452
      408.986667
    
    
      2015-08-04 08:57:00
      2015-08-04 08:57:00
      22.616644
      80.730178
      96.635333
      8.313578
      8.685889
      0.290252
      12.041881
      409.019778
    
    
      2015-08-04 08:57:00
      2015-08-04 08:57:00
      22.616360
      80.738267
      96.576000
      8.309342
      8.685333
      0.292405
      12.054681
      409.056889
    
    
      2015-08-04 08:57:00
      2015-08-04 08:57:00
      22.616142
      80.745867
      96.515778
      8.304802
      8.684444
      0.295090
      12.065228
      409.108667

The following function runs a random model with a random independent variable y and four random covariates, using both the statsmodels and scikit-learn packages. The user can compare output from the two tools.



In [57]:

    
from regression import compare_OLS
compare_OLS(regression_miss_data)









    



statsmodels OLS regression on 
 ODOsat ~ CH4uM + ODOmgL + fDOMQSU + Temp

                             OLS Regression Results                            
==============================================================================
Dep. Variable:                 ODOsat   R-squared:                       0.999
Model:                            OLS   Adj. R-squared:                  0.999
Method:                 Least Squares   F-statistic:                 8.194e+06
Date:                Mon, 21 Mar 2016   Prob (F-statistic):               0.00
Time:                        13:25:40   Log-Likelihood:                -31055.
No. Observations:               26290   AIC:                         6.212e+04
Df Residuals:                   26285   BIC:                         6.216e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept    -62.5909      0.179   -349.881      0.000       -62.942   -62.240
CH4uM         -0.3864      0.026    -14.909      0.000        -0.437    -0.336
ODOmgL        11.8347      0.003   3966.695      0.000        11.829    11.841
fDOMQSU       -0.0179      0.000    -60.494      0.000        -0.018    -0.017
Temp           2.7021      0.009    295.947      0.000         2.684     2.720
==============================================================================
Omnibus:                     4906.342   Durbin-Watson:                   0.006
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           112284.685
Skew:                           0.269   Prob(JB):                         0.00
Kurtosis:                      13.110   Cond. No.                     2.89e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.89e+03. This might indicate that there are
strong multicollinearity or other numerical problems. 


 scikit-learn OLS regression on 
 ODOsat ~ CH4uM + ODOmgL + fDOMQSU + Temp 

-62.5909204589
[ -0.38644641  11.83473936  -0.01785002   2.70212092]
0.999198680708

The two models produce the same results.

The user_model function prompts the user to input a model formula for an OLS regression, then runs the model in statsmodel, and outputs model results and a plot of y data vs. model fitted values.

At the prompt, you may either input your own model formula, or copy and paste the following formula as an example:

CO2uM ~ pH + Temp + CH4uM



In [58]:

    
from regression import user_model
user_model(data=regression_miss_data)









    



The data set contains the following covariates: 

['ltime', 'Temp', 'fDOMQSU', 'ODOsat', 'ODOmgL', 'pH', 'CH4uM', 'CO2uM', 'Cond'] 

Enter your regression model formula, using syntax as shown: 
 
 dependent_variable ~ covariate1 + covariate 2 + ... 
 
CO2uM ~ pH + Temp + CH4uM

                             OLS Regression Results                            
==============================================================================
Dep. Variable:                  CO2uM   R-squared:                       0.882
Model:                            OLS   Adj. R-squared:                  0.882
Method:                 Least Squares   F-statistic:                 6.573e+04
Date:                Mon, 21 Mar 2016   Prob (F-statistic):               0.00
Time:                        13:25:53   Log-Likelihood:                -96525.
No. Observations:               26290   AIC:                         1.931e+05
Df Residuals:                   26286   BIC:                         1.931e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept    450.8517      2.306    195.530      0.000       446.332   455.371
pH           -37.4897      0.330   -113.575      0.000       -38.137   -36.843
Temp          -5.3320      0.090    -59.134      0.000        -5.509    -5.155
CH4uM         62.5849      0.375    167.057      0.000        61.851    63.319
==============================================================================
Omnibus:                    12318.309   Durbin-Watson:                   0.000
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           200831.791
Skew:                           1.844   Prob(JB):                         0.00
Kurtosis:                      16.028   Cond. No.                         999.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We have also developed a visualization tool that will automatically plot all data columns verses eachother to allow a user to view trends in the data. The functions takes a min and max R^2 value as an argument to allow the user to set cutoffs for uncorrelated columns



In [59]:

    
from regression import plot_pairs
plot_pairs(data=regression_miss_data, minCorr=0.2, maxCorr=.98)

	ltime	Latitude	Longitude	Temp	fDOMQSU	ODOsat	ODOmgL	pH	CH4uM	CO2uM	Cond	indicator
0	8/4/15 8:56	43.664773	-91.238209	22.6392	80.448	97.81	8.35761	8.70	0.239870	11.897216	408.02	Mississippi
1	8/4/15 8:56	43.664777	-91.238210	22.6382	80.440	97.68	8.35128	8.69	0.250415	11.900610	408.22	Mississippi
2	8/4/15 8:56	43.664780	-91.238211	22.6354	80.480	97.65	8.34495	8.69	0.228742	11.775067	407.76	Mississippi
3	8/4/15 8:56	43.664783	-91.238211	22.6334	80.488	97.52	8.36229	8.69	0.219063	11.757889	407.38	Mississippi
4	8/4/15 8:56	43.664787	-91.238209	22.6324	80.508	97.49	8.36596	8.69	0.244385	11.918703	407.84	Mississippi

	ltime	Temp	fDOMQSU	ODOsat	ODOmgL	pH	CH4uM	CO2uM	Cond
ltime
2015-08-04 08:57:00	2015-08-04 08:57:00	22.617387	80.712533	96.760667	8.321806	8.686556	0.287032	12.030600	408.943111
2015-08-04 08:57:00	2015-08-04 08:57:00	22.616973	80.721111	96.696889	8.317733	8.686111	0.288896	12.036452	408.986667
2015-08-04 08:57:00	2015-08-04 08:57:00	22.616644	80.730178	96.635333	8.313578	8.685889	0.290252	12.041881	409.019778
2015-08-04 08:57:00	2015-08-04 08:57:00	22.616360	80.738267	96.576000	8.309342	8.685333	0.292405	12.054681	409.056889
2015-08-04 08:57:00	2015-08-04 08:57:00	22.616142	80.745867	96.515778	8.304802	8.684444	0.295090	12.065228	409.108667