Pastas TimeSeries

Developed by Raoul Collenteur In this Jupyter Notebook, the concept of the Pastas TimeSeries class is explained in full detail.

Objective of the Pastas TimeSeries class:

"To create one class that deals with all user-provided time series and the manipulations of the series while maintaining the original series."

Desired Capabilities: The central idea behind the TimeSeries class is to solve all data manipulations in a single class while maintaining the original time series. While manipulating the TimeSeries when working with your Pastas model, the original data are to be maintained such that only the settings and the original series can be stored.

  • Validate user-provided time series
  • Extend before and after
  • Fill nan-values
  • Change frequency
    • Upsample
    • Downsample
  • Normalize values

Resources The definition of the class can be found on Github ( Documentation on the Pandas Series can be found here:

In [1]:
# Import some packages
import pastas as ps
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline


Python version: 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:05:27) 
[Clang 9.0.1 ]
Numpy version: 1.17.5
Scipy version: 1.4.1
Pandas version: 0.25.0
Pastas version: 0.14.0b

1. Importing groundwater time series

Let's first import some time series so we have some data to play around with. We use Pandas read_csv method and obtain a Pandas Series object, pandas data structure to efficiently deal with 1D Time Series data. By default, Pandas adds a wealth of functionalities to a Series object, such as descriptive statistics (e.g. series.describe()) and plotting funtionality.

In [2]:
gwdata = pd.read_csv('../data/head_nb1.csv', parse_dates=['date'],
                     index_col='date', squeeze=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1158d5f50>

2. Creating a Pastas TimeSeries object

The user will provide time series data when creating a model instance, or one of the stressmodels found in Pastas expects Pandas Series as a standard format in which time series are provided, but will internally transform these to Pastas TimeSeries objects to add the necessary funtionality. It is therefore also possible to provide a TimeSeries object directly instead of a Pandas Series object.

We will now create a TimeSeries object for the groundwater level (gwdata). When creating a TimeSeries object the time series that are provided are validated, such that Pastas can use the provided time series for simulation without errors. The time series are checked for:

  1. Being actual Pandas Series object;
  2. Making sure the indices are all TimeStamps;
  3. Making sure the indices are ordered in time;
  4. Dropping any nan-values before and after the first and final valid value;
  5. Frequency of the Series is inferred, or otherwise the user-provided value for "freq" is applied;
  6. Nan-values within the series are handled, depending on the value for the "fill_nan" argument;
  7. Duplicate indices are dropped from the series.

If all of the above is OK, a TimeSeries object is returned. When valid time series are provided all of the above checks are no problem and no settings are required. However, all too often this is not the case and at least "fill_nan" and "freq" are required. The first argument tells the TimeSeries object how to handle nan-values, and the freq argument provides the frequency of the original time series (by default, freq=D, fill_nan="interpolate").

In [3]:
oseries = ps.TimeSeries(gwdata, name="Groundwater Level")

# Plot the new time series and the original
oseries.plot(label="pastas timeseries")

INFO: Cannot determine frequency of series Groundwater Level
<matplotlib.legend.Legend at 0x117a857d0>

3. Configuring a TimeSeries object

So let's see how we can configure a TimeSeries object. In the case of the observed groundwater levels (oseries) as in the example above, interpolating between observations might not be the preffered method to deal with gaps in your data. In fact, the do not have to be constant for simulation, one of the benefits of the method of impulse response functions. The nan-values can simply be dropped. To configure a TimeSeries object the user has three options:

  1. Use a predefined configuration by providing a string to the settings argument
  2. Manually set all or some of the settings by providing a dictonary to the settings argument
  3. Providing the arguments as keyword arguments to the TimeSeries object (not recommended)

For example, when creating a TimeSeries object for the groundwater levels consider the three following examples for setting the fill_nan option:

In [4]:
# Options 1
oseries = ps.TimeSeries(gwdata, name="Groundwater Level", settings="oseries")

INFO: Cannot determine frequency of series Groundwater Level
{'to_daily_unit': None, 'freq': None, 'sample_up': None, 'sample_down': 'drop', 'fill_nan': 'drop', 'fill_before': None, 'fill_after': None, 'tmin': Timestamp('1985-11-14 00:00:00'), 'tmax': Timestamp('2015-06-28 00:00:00'), 'norm': None, 'time_offset': Timedelta('0 days 00:00:00')}

In [5]:
# Option 2
oseries = ps.TimeSeries(gwdata, name="Groundwater Level", settings=dict(fill_nan="drop"))

INFO: Cannot determine frequency of series Groundwater Level
{'to_daily_unit': None, 'freq': None, 'sample_up': None, 'sample_down': None, 'fill_nan': 'drop', 'fill_before': None, 'fill_after': None, 'tmin': Timestamp('1985-11-14 00:00:00'), 'tmax': Timestamp('2015-06-28 00:00:00'), 'norm': None, 'time_offset': Timedelta('0 days 00:00:00')}

In [6]:
# Options 3
oseries = ps.TimeSeries(gwdata, name="Groundwater Level", fill_nan="drop")

INFO: Cannot determine frequency of series Groundwater Level
{'to_daily_unit': None, 'freq': None, 'sample_up': None, 'sample_down': None, 'fill_nan': 'drop', 'fill_before': None, 'fill_after': None, 'tmin': Timestamp('1985-11-14 00:00:00'), 'tmax': Timestamp('2015-06-28 00:00:00'), 'norm': None, 'time_offset': Timedelta('0 days 00:00:00')}

Predefined settings

All of the above methods yield the same result. It is up to the user which one is preferred.

A question that may arise with options 1, is what the possible strings for settings are and what configuration is then used. The TimeSeries class contains a dictionary with predefined settings that are used often. You can ask the TimeSeries class this question:

In [7]:

fill_nan sample_down sample_up fill_before fill_after to_daily_unit
oseries drop drop NaN NaN NaN NaN
prec 0 mean bfill mean mean NaN
evap interpolate mean bfill mean mean NaN
well 0 mean bfill 0 0 divide
waterlevel interpolate mean interpolate mean mean NaN
level interpolate mean interpolate mean mean NaN
flux 0 mean bfill mean mean NaN
quantity 0 sum divide mean mean NaN

4. Let's explore the possibilities

As said, Pastas TimeSeries are capable of handling time series in a way that is convenient for Pastas.

  • Changing the frequency of the time series (sample_up, sameple_down)
  • Extending the time series (fill_before and fill_after)
  • Normalizing the time series (norm *not fully supported yet)

We will now import some precipitation series measured at a daily frequency and show how the above methods work

In [8]:
# Import observed precipitation series
precip = pd.read_csv('../data/rain_nb1.csv', parse_dates=['date'],
                     index_col='date', squeeze=True)
prec = ps.TimeSeries(precip, name="Precipitation", settings="prec")

INFO: Inferred frequency from time series Precipitation: freq=D 

In [9]:
# fig, ax = plt.subplots(2, 1, figsize=(10,8))
# prec.update_series(freq="D")
# prec.update_series(freq="7D")

# import matplotlib.dates as mdates
# ax[1].fmt_xdata = mdates.DateFormatter('%m')
# fig.autofmt_xdate()

Wait, what?

We just changed the frequency of the TimeSeries. When reducing the frequency, the values were summed into the new bins. Conveniently, all pandas methods are still available and functional, such as the great plotting functionalities of Pandas.

All this happened inplace, meaning the same object just took another shape based on the new settings. Moreover, it performed those new settings (freq="W" weekly values) on the original series. This means that going back and forth between frequencies does not lead to any information loss.

Why is this so important? Because when solving or simulating a model, the Model will ask every member of the TimeSeries family to prepare itself with the necessary settings (e.g. new freq) and perform that operation only once. When asked for a time series, the TimeSeries object will "be" in that new shape.

Some more action

Let's say, we want to simulate the groundwater series for a period where no data is available for the time series, but we need some kind of value for the warmup period to prevent things from getting messy. The TimeSeries object can easily extend itself, as the following example shows.

In [10]:

{'to_daily_unit': None,
 'freq': None,
 'sample_up': 'bfill',
 'sample_down': 'mean',
 'fill_nan': 0.0,
 'fill_before': 'mean',
 'fill_after': 'mean',
 'tmin': Timestamp('2011-01-01 00:00:00'),
 'tmax': Timestamp('2016-10-31 00:00:00'),
 'norm': None,
 'time_offset': Timedelta('0 days 00:00:00')}

5. Exporting the TimeSeries

When done, we might want to store the TimeSeries object for later use. A to_dict method is built-in to export the original time series to a json format, along with its current settings and name. This way the original data is maintained and can easily be recreated from a json file.

In [11]:
data = prec.to_dict()

dict_keys(['series', 'name', 'settings', 'metadata', 'freq_original'])

In [12]:
# Tadaa, we have our extended time series in weekly frequency back!
ts = ps.TimeSeries(**data)

<matplotlib.axes._subplots.AxesSubplot at 0x102347810>