This notebook serves as a showcase for the functions written in the wwdata package, more specifically the OnlineSensorBased subclass. For additional information on the functions, the user is encouraged to use the provided docstrings. They can be accessed by entering a function name and hitting shift+tab between the function brackets.
All information and documentation on the wwdata package, including how to install it, can also be found online at https://ugentbiomath.github.io/wwdata-docs/.
An elaborate explanation on the functionalities of the package is published in Environmental Modelling & Software and is available on ResearchGate.
In [ ]:
import sys
import os
from os import listdir
import pandas as pd
import scipy as sp
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
%matplotlib inline
# seaborn is not a required package, it just prettifies the figures
import seaborn as sns
And now for the actual package...
In [ ]:
import wwdata as ww
Check what version you have installed
In [ ]:
ww.__version__
pd.read_excel
In [ ]:
measurements = pd.read_csv('./data/201301.txt',sep='\t',skiprows=0)
measurements.columns
In [ ]:
dataset = ww.OnlineSensorBased(data=measurements,
timedata_column='Time',
data_type='WWTP')
dataset.set_tag('January 2013')
dataset.replace('Bad','NaN',inplace=True)
Convert the values in the column containing time data to the pandas datetime format.
In [ ]:
dataset.to_datetime(time_column=dataset.timename,time_format='%d-%m-%y %H:%M')
use the time-column as index
In [ ]:
dataset.set_index('Time',key_is_time=True,drop=True,inplace=True)
Convert the absolute timestamps to relative values. This can be important when data is to be used for modeling purposes later on, and needs to be written to text files.
In [ ]:
#dataset.absolute_to_relative(time_data='index',unit='d')
Drop any duplicates that might be present in the index
In [ ]:
dataset.drop_index_duplicates()
Convert all or the selected columns to float type.
In [ ]:
dataset.to_float(columns='all')
In [ ]:
fig, ax = plt.subplots(figsize=(18,4))
ax.plot(dataset.data['CODtot_line2'],'.g')
ax.set_ylabel('Total COD [mg/L]',fontsize=18);ax.set_xlabel('')
ax.tick_params(labelsize=14)
Selecting data happens through tagging, so no original data is lost. When applying filter algorithms such as tag_doubles, moving_slope_filter etc., a new pandas dataframe is created (dataset.meta_valid, see also below figure) that contains these tags. It is also based on this new dataframe that the plotting of selected and not selected datapoints in different colours happens.
The written output of the filter functions tells the user how many data points were tagged based on that specific function. When the plotting argument is set to true, the plot shows the aggregated results of the filter functions used up until that point.
In [ ]:
dataset.get_highs('Flow_total',0.95,arange=['2013/1/1','2013/1/31'],method='percentile',plot=True)
In [ ]:
dataset.tag_nan('CODtot_line2')
In [ ]:
dataset.tag_doubles('CODtot_line2',bound=0.05,plot=False)
In [ ]:
dataset.moving_slope_filter('index','CODtot_line2',72000,arange=['2013/1/1','2013/1/31'],
time_unit='d',inplace=False,plot=False)
Tag all data points that are more than a specified percentage away from the calculated moving average. This function makes use of the simple_moving_average function, also written as part of this package.
In [ ]:
dataset.moving_average_filter(data_name='CODtot_line2',window=12,cutoff_frac=0.20,
arange=['2013/1/1','2013/1/31'],plot=False)
In [ ]:
fig, ax = dataset.plot_analysed('CODtot_line2')
ax.legend(bbox_to_anchor=(1.15,1.0),fontsize=18)
ax.set_ylabel('Total COD [mg/L]',fontsize=18);ax.set_xlabel('')
ax.tick_params(labelsize=14)
In [ ]:
dataset.columns
Instead of a package-specific filtering, data points can also be filtered and replaced by other filtering algorithms, such as the Savitsky-Golay filter as illustrated below. The disadvantage of this is that no tags are added to the meta_valid DataFrame and that original data are replaced (when the inplace option is set to True).
In [ ]:
dataset.savgol('TSS_line3',plot=True)
In order to be able to make a choice and apply the best method to fill gaps in the data, the wwdata package provides the option to check for the reliability of each filling algorithm. This is represented in the below figure.
In wording, the workflow of the check_filling_error is as follows:
test_data_range. Before applying this, it is wise to check the total number of points within test_data_range and then determine the number of gaps to create. Take into account that the length of the gaps is sampled from a uniform distribution between 0 and the maximum length of a gap given as an argument.
For example: creating two large gaps of 50 datapoints in a dataset containing 100 datapoints would mean a theoretical average of 50% data recovery (2*(50/2) = 50 data points are left out of the 100; the 2 gaps can however still overlap)
In [ ]:
len(dataset.data['2013/1/1':'2013/1/17'])
In [ ]:
dataset._filling_warning_issued
In [ ]:
dataset.check_filling_error(100,'CODtot_line2','fill_missing_standard',[dt.datetime(2013,1,1,0,5),dt.datetime(2013,1,17)],
nr_small_gaps=70,max_size_small_gaps=12,
nr_large_gaps=3,max_size_large_gaps=800,
to_fill='CODtot_line2',arange=[dt.datetime(2013,1,1,0,5),dt.datetime(2013,1,17)],
only_checked=True)
In [ ]:
dataset.check_filling_error(100,'CODtot_line2','fill_missing_daybefore',[dt.datetime(2013,1,1,0,5),dt.datetime(2013,1,17)],
nr_small_gaps=70,max_size_small_gaps=12,
nr_large_gaps=3,max_size_large_gaps=800,
to_fill='CODtot_line2',arange=[dt.datetime(2013,1,1,0,5),dt.datetime(2013,1,17)],
range_to_replace=[0,10],only_checked=True)
Filling data can be done using a range of functions implemented in the package. Again, a new pandas dataframe is created (dataset.meta_filled, see also below figure), starting from the dataset.meta_valid dataframe, and updated with tags indicating what filling method was used to obtain a certain point.
Using the only_checked argument, implemented in most filling functions, the user can always choose whether only data points tagged as filtered will be filled, or all data points within a certain range.
When using the plotting argument to plot the analysed data, the user will see a plot based on the latest function that was used; if this was a filter function, the data will be plotted based on the dataset.meta_valid dataframe, if it was a filling function, the tags in dataset.meta_filled will be used.
In [ ]:
dataset.fill_missing_interpolation('CODtot_line2',12,[dt.datetime(2013,1,1),dt.datetime(2013,1,8)],
plot=True)
Fill missing datapoints by using an average daily profile. The fill_missing_standard function requires the running of the calc_daily_profile function, also developed for this package, first. This creates a dataframe (dataset.daily_profile) containing the average daily profile calculated within a defined time period (e.g. selecting only non-peak days for example).
In [ ]:
dataset.calc_daily_profile('CODtot_line2',[dt.datetime(2013,1,1),dt.datetime(2013,1,8)],
quantile=0.9,clear=True)
In [ ]:
dataset.fill_missing_standard('CODtot_line2',[dt.datetime(2013,1,14),dt.datetime(2013,1,17)],
only_checked=True,clear=False,plot=True)
In [ ]:
model_output_ontv_1 = pd.read_csv('./data/model_output.txt',
sep='\t')
units_model = model_output_ontv_1.iloc[0]
model_output_ontv_1 = model_output_ontv_1.drop(0,inplace=False).reset_index(drop=True)
model_output_ontv_1 = model_output_ontv_1.astype(float)
model_output_ontv_1.set_index('#.t',drop=True,inplace=True)
model_output_ontv_1.columns
In [ ]:
dataset.fill_missing_model('CODtot_line2',model_output_ontv_1['.sewer_1.COD'],
[dt.datetime(2013,1,18),dt.datetime(2013,1,22)],
only_checked=True,plot=True)
Constant ratios or correlations between data can be used to filled missing points. The user can calculate and compare ratios and correlations (currently only linear) between selected measurements, and fill data using these.
nb: in the examples below, data filling based on ratios or correlation is obviously not a very good choice. Both methods are included here for completeness of method showcasing.
In [ ]:
dataset.calc_ratio('CODtot_line2','CODsol_line2',
[dt.datetime(2013,1,1,0,5,0),dt.datetime(2013,1,31)])
To find the 'best' ratio (i.e. the one with the lowest relative standard deviation ($\sigma/\mu$)), the ratio obtained in different periods can be compared and the best one used during possible further replacements.
In [ ]:
avg,std = dataset.compare_ratio('CODtot_line2','CODsol_line2',2)
Use the average obtained from the compare_ratio function to fill in missing values. (in this case, as mentioned before, this does clearly not work, since zero-values are replaced with zero-values. This only showcases the function and its arguments).
In [ ]:
dataset.fill_missing_ratio('CODtot_line2',
'CODsol_line2',avg,
[dt.datetime(2013,1,22),dt.datetime(2013,1,23)],
only_checked=True,plot=True)
Instead of a ratio, a correlation can be sought. In case of a zero intercept, this of course gives a result in the same range if the same data is used. To have a good impression on how useful the calculated correlation is, a prediction interval is plotted as well when plot is set to True.
In [ ]:
dataset.get_correlation('CODtot_line2',
'CODsol_line2',
[dt.datetime(2013,1,1,0,5,0),dt.datetime(2013,1,31)],
zero_intercept=True,plot=True)
After the previously made assessment, use the correlation function to fill gaps in the dataset.
In [ ]:
dataset.fill_missing_correlation('CODtot_line2',
'CODsol_line2',
[dt.datetime(2013,1,23),dt.datetime(2013,1,25)],
[dt.datetime(2013,1,1,0,5,0),dt.datetime(2013,1,31)],
only_checked=True,clear=False,plot=True)
In [ ]:
dataset.fill_missing_daybefore('CODtot_line2',
[dt.datetime(2013,1,25),dt.datetime(2013,1,27)],
range_to_replace=[0,10],plot=True,
only_checked=False)
In [ ]:
fig, ax = dataset.plot_analysed('CODtot_line2')
ax.legend(bbox_to_anchor=(1.3,1.0),fontsize=18)
ax.set_ylabel('Total COD [mg/L]',fontsize=18);ax.set_xlabel('')
ax.tick_params(labelsize=14)
Calculate the daily average of a certain data series
In [ ]:
dataset.calc_daily_average('CODtot_line2',arange=[dt.datetime(2013,1,1),dt.datetime(2013,2,1)],plot=True)
Calculate the proportional concentration of different flows coming together.
In [ ]:
dataset.calc_total_proportional('Flow_total',
['Flow_line1','Flow_line2','Flow_line3'],
['TSS_line1','TSS_line2','TSS_line3'],
'TSS_prop')
In [ ]: