This notebook serves as a showcase for the functions written in the wwdata package, more specifically the OnlineSensorBased subclass. For additional information on the functions, the user is encouraged to use the provided docstrings. They can be accessed by entering a function name and hitting shift+tab between the function brackets.

All information and documentation on the wwdata package, including how to install it, can also be found online at https://ugentbiomath.github.io/wwdata-docs/.

An elaborate explanation on the functionalities of the package is published in Environmental Modelling & Software and is available on ResearchGate.

Loading the necessary packages


In [ ]:
import sys
import os
from os import listdir
import pandas as pd
import scipy as sp
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
%matplotlib inline
# seaborn is not a required package, it just prettifies the figures
import seaborn as sns

And now for the actual package...


In [ ]:
import wwdata as ww

Check what version you have installed


In [ ]:
ww.__version__

pd.read_excel


In [ ]:
measurements = pd.read_csv('./data/201301.txt',sep='\t',skiprows=0)
measurements.columns

Create Class object and format data


In [ ]:
dataset = ww.OnlineSensorBased(data=measurements,
                               timedata_column='Time',
                               data_type='WWTP')
dataset.set_tag('January 2013')
dataset.replace('Bad','NaN',inplace=True)

Convert the values in the column containing time data to the pandas datetime format.


In [ ]:
dataset.to_datetime(time_column=dataset.timename,time_format='%d-%m-%y %H:%M')

use the time-column as index


In [ ]:
dataset.set_index('Time',key_is_time=True,drop=True,inplace=True)

Convert the absolute timestamps to relative values. This can be important when data is to be used for modeling purposes later on, and needs to be written to text files.


In [ ]:
#dataset.absolute_to_relative(time_data='index',unit='d')

Drop any duplicates that might be present in the index


In [ ]:
dataset.drop_index_duplicates()

Convert all or the selected columns to float type.


In [ ]:
dataset.to_float(columns='all')

In [ ]:
fig, ax = plt.subplots(figsize=(18,4))
ax.plot(dataset.data['CODtot_line2'],'.g')
ax.set_ylabel('Total COD [mg/L]',fontsize=18);ax.set_xlabel('')
ax.tick_params(labelsize=14)

Filter data

Selecting data happens through tagging, so no original data is lost. When applying filter algorithms such as tag_doubles, moving_slope_filter etc., a new pandas dataframe is created (dataset.meta_valid, see also below figure) that contains these tags. It is also based on this new dataframe that the plotting of selected and not selected datapoints in different colours happens.

The written output of the filter functions tells the user how many data points were tagged based on that specific function. When the plotting argument is set to true, the plot shows the aggregated results of the filter functions used up until that point.

Maxima

Tag the data points that are higher then a certain percentile


In [ ]:
dataset.get_highs('Flow_total',0.95,arange=['2013/1/1','2013/1/31'],method='percentile',plot=True)

NaN values

Tag all NaN (Not a Number) values as 'filtered'.


In [ ]:
dataset.tag_nan('CODtot_line2')

Sensor failure

Tag all datapoints that are part of a constant (within a given bound) signal.


In [ ]:
dataset.tag_doubles('CODtot_line2',bound=0.05,plot=False)

Noise

Tag all data points for which the slope as compared with the previous point is too high to be realistic (i.e. the data point is noisy).


In [ ]:
dataset.moving_slope_filter('index','CODtot_line2',72000,arange=['2013/1/1','2013/1/31'],
                            time_unit='d',inplace=False,plot=False)

Tag all data points that are more than a specified percentage away from the calculated moving average. This function makes use of the simple_moving_average function, also written as part of this package.


In [ ]:
dataset.moving_average_filter(data_name='CODtot_line2',window=12,cutoff_frac=0.20,
                              arange=['2013/1/1','2013/1/31'],plot=False)

In [ ]:
fig, ax = dataset.plot_analysed('CODtot_line2')
ax.legend(bbox_to_anchor=(1.15,1.0),fontsize=18)
ax.set_ylabel('Total COD [mg/L]',fontsize=18);ax.set_xlabel('')
ax.tick_params(labelsize=14)

In [ ]:
dataset.columns

Instead of a package-specific filtering, data points can also be filtered and replaced by other filtering algorithms, such as the Savitsky-Golay filter as illustrated below. The disadvantage of this is that no tags are added to the meta_valid DataFrame and that original data are replaced (when the inplace option is set to True).


In [ ]:
dataset.savgol('TSS_line3',plot=True)

Check the reliability of the filling algorithms

In order to be able to make a choice and apply the best method to fill gaps in the data, the wwdata package provides the option to check for the reliability of each filling algorithm. This is represented in the below figure.

In wording, the workflow of the check_filling_error is as follows:

  • Randomly (!) create large or small artificial gaps in the data within the given test_data_range.
  • Fill the created gaps with a chosen filling function (see further in this notebook for illustrations of those).
  • Compare the original data points with the filled data points and calculate the deviation between them.
  • Iterate for a given number of times, to average out the random creation of the gaps.

Before applying this, it is wise to check the total number of points within test_data_range and then determine the number of gaps to create. Take into account that the length of the gaps is sampled from a uniform distribution between 0 and the maximum length of a gap given as an argument.
For example: creating two large gaps of 50 datapoints in a dataset containing 100 datapoints would mean a theoretical average of 50% data recovery (2*(50/2) = 50 data points are left out of the 100; the 2 gaps can however still overlap)


In [ ]:
len(dataset.data['2013/1/1':'2013/1/17'])

In [ ]:
dataset._filling_warning_issued

In [ ]:
dataset.check_filling_error(100,'CODtot_line2','fill_missing_standard',[dt.datetime(2013,1,1,0,5),dt.datetime(2013,1,17)],
                            nr_small_gaps=70,max_size_small_gaps=12,
                            nr_large_gaps=3,max_size_large_gaps=800,
                            to_fill='CODtot_line2',arange=[dt.datetime(2013,1,1,0,5),dt.datetime(2013,1,17)],
                            only_checked=True)

In [ ]:
dataset.check_filling_error(100,'CODtot_line2','fill_missing_daybefore',[dt.datetime(2013,1,1,0,5),dt.datetime(2013,1,17)],
                            nr_small_gaps=70,max_size_small_gaps=12,
                            nr_large_gaps=3,max_size_large_gaps=800,
                            to_fill='CODtot_line2',arange=[dt.datetime(2013,1,1,0,5),dt.datetime(2013,1,17)],
                            range_to_replace=[0,10],only_checked=True)

Fill data

Filling data can be done using a range of functions implemented in the package. Again, a new pandas dataframe is created (dataset.meta_filled, see also below figure), starting from the dataset.meta_valid dataframe, and updated with tags indicating what filling method was used to obtain a certain point.

Using the only_checked argument, implemented in most filling functions, the user can always choose whether only data points tagged as filtered will be filled, or all data points within a certain range.

When using the plotting argument to plot the analysed data, the user will see a plot based on the latest function that was used; if this was a filter function, the data will be plotted based on the dataset.meta_valid dataframe, if it was a filling function, the tags in dataset.meta_filled will be used.

Interpolation

Fill missing data points by interpolation, if number of consecutive missing points is lower than a specified number.


In [ ]:
dataset.fill_missing_interpolation('CODtot_line2',12,[dt.datetime(2013,1,1),dt.datetime(2013,1,8)],
                                   plot=True)

Average daily profile

Fill missing datapoints by using an average daily profile. The fill_missing_standard function requires the running of the calc_daily_profile function, also developed for this package, first. This creates a dataframe (dataset.daily_profile) containing the average daily profile calculated within a defined time period (e.g. selecting only non-peak days for example).


In [ ]:
dataset.calc_daily_profile('CODtot_line2',[dt.datetime(2013,1,1),dt.datetime(2013,1,8)],
                           quantile=0.9,clear=True)

In [ ]:
dataset.fill_missing_standard('CODtot_line2',[dt.datetime(2013,1,14),dt.datetime(2013,1,17)],
                              only_checked=True,clear=False,plot=True)

Model output

Fill gaps using a model output. This assumes that the user has good reason to trust that the model predictions are sound and can indeed be used to replace missing data where needed.


In [ ]:
model_output_ontv_1 = pd.read_csv('./data/model_output.txt',
                           sep='\t')
units_model = model_output_ontv_1.iloc[0]
model_output_ontv_1 = model_output_ontv_1.drop(0,inplace=False).reset_index(drop=True)
model_output_ontv_1 = model_output_ontv_1.astype(float)
model_output_ontv_1.set_index('#.t',drop=True,inplace=True)
model_output_ontv_1.columns

In [ ]:
dataset.fill_missing_model('CODtot_line2',model_output_ontv_1['.sewer_1.COD'],
                           [dt.datetime(2013,1,18),dt.datetime(2013,1,22)],
                           only_checked=True,plot=True)

Ratio or correlation

Constant ratios or correlations between data can be used to filled missing points. The user can calculate and compare ratios and correlations (currently only linear) between selected measurements, and fill data using these.

nb: in the examples below, data filling based on ratios or correlation is obviously not a very good choice. Both methods are included here for completeness of method showcasing.


In [ ]:
dataset.calc_ratio('CODtot_line2','CODsol_line2',
                   [dt.datetime(2013,1,1,0,5,0),dt.datetime(2013,1,31)])

To find the 'best' ratio (i.e. the one with the lowest relative standard deviation ($\sigma/\mu$)), the ratio obtained in different periods can be compared and the best one used during possible further replacements.


In [ ]:
avg,std = dataset.compare_ratio('CODtot_line2','CODsol_line2',2)

Use the average obtained from the compare_ratio function to fill in missing values. (in this case, as mentioned before, this does clearly not work, since zero-values are replaced with zero-values. This only showcases the function and its arguments).


In [ ]:
dataset.fill_missing_ratio('CODtot_line2',
                           'CODsol_line2',avg,
                           [dt.datetime(2013,1,22),dt.datetime(2013,1,23)],
                           only_checked=True,plot=True)

Instead of a ratio, a correlation can be sought. In case of a zero intercept, this of course gives a result in the same range if the same data is used. To have a good impression on how useful the calculated correlation is, a prediction interval is plotted as well when plot is set to True.


In [ ]:
dataset.get_correlation('CODtot_line2',
                        'CODsol_line2',
                        [dt.datetime(2013,1,1,0,5,0),dt.datetime(2013,1,31)],
                        zero_intercept=True,plot=True)

After the previously made assessment, use the correlation function to fill gaps in the dataset.


In [ ]:
dataset.fill_missing_correlation('CODtot_line2',
                                 'CODsol_line2',
                                 [dt.datetime(2013,1,23),dt.datetime(2013,1,25)],
                                 [dt.datetime(2013,1,1,0,5,0),dt.datetime(2013,1,31)],
                                 only_checked=True,clear=False,plot=True)

Data from previous day

Under the assumption that "The best prediction for tomorrows weather is todays weather", one can also replace missing data by making use of (one of) the previous days.


In [ ]:
dataset.fill_missing_daybefore('CODtot_line2',
                               [dt.datetime(2013,1,25),dt.datetime(2013,1,27)],
                               range_to_replace=[0,10],plot=True,
                               only_checked=False)

In [ ]:
fig, ax = dataset.plot_analysed('CODtot_line2')
ax.legend(bbox_to_anchor=(1.3,1.0),fontsize=18)
ax.set_ylabel('Total COD [mg/L]',fontsize=18);ax.set_xlabel('')
ax.tick_params(labelsize=14)

Calculations

Calculate the daily average of a certain data series


In [ ]:
dataset.calc_daily_average('CODtot_line2',arange=[dt.datetime(2013,1,1),dt.datetime(2013,2,1)],plot=True)

Calculate the proportional concentration of different flows coming together.


In [ ]:
dataset.calc_total_proportional('Flow_total',
                                ['Flow_line1','Flow_line2','Flow_line3'],
                                ['TSS_line1','TSS_line2','TSS_line3'],
                               'TSS_prop')

In [ ]: