In this notebook, we'll explore some functionality of the models of this package. We'll work with the coupled CemaneigeGR4j model that is implemented in rrmpg.models
module. The data we'll use, comes from the CAMELS [1] data set. For some basins, the data is provided within this Python library and can be easily imported using the CAMELSLoader
class implemented in the rrmpg.data
module.
In summary we'll look at:
[1] Addor, N., A.J. Newman, N. Mizukami, and M.P. Clark, 2017: The CAMELS data set: catchment attributes and meteorology for large-sample studies. version 2.0. Boulder, CO: UCAR/NCAR. doi:10.5065/D6G73C3Q
In [1]:
# Imports and Notebook setup
from timeit import timeit
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from rrmpg.models import CemaneigeGR4J
from rrmpg.data import CAMELSLoader
from rrmpg.tools.monte_carlo import monte_carlo
from rrmpg.utils.metrics import calc_nse
As a first step let us have a look how we can create one of the models implemented in rrmpg.models
. Basically, for all models we have two different options:
The documentation provides a list of all model parameters. Alternatively we can look at help()
for the model (e.g. help(CemaneigeGR4J)
).
If no specific model parameters are provided upon intialization, random parameters will be generated that are in between the default parameter bounds. We can look at these bounds by calling .get_param_bounds()
method on the model object and check the current parameter values by calling .get_params()
method.
For now we don't know any specific parameter values, so we'll create one with random parameters.
In [2]:
model = CemaneigeGR4J()
model.get_params()
Out[2]:
Here we can see the six model parameters of CemaneigeGR4J model and their current values.
To have data to start with, we can use the CAMELSLoader
class to load data of provided basins from the CAMELS dataset. To get a list of all available basins that are provided within this library, we can use the .get_basin_numbers()
method. For now we will use the provided basin number 01031500
.
In [3]:
df = CAMELSLoader().load_basin('01031500')
df.head()
Out[3]:
Next we will split the data into a calibration period, which we will use to find a set of good model parameters, and a validation period, we will use the see how good our model works on unseen data. As in the CAMELS data set publication, we will use the first 15 hydrological years for calibration. The rest of the data will be used for validation.
Because the index of the dataframe is in pandas Datetime format, we can easily split the dataframe into two parts
In [4]:
# calcute the end date of the calibration period
end_cal = pd.to_datetime(f"{df.index[0].year + 15}/09/30", yearfirst=True)
# validation period starts one day later
start_val = end_cal + pd.DateOffset(days=1)
# split the data into two parts
cal = df[:end_cal].copy()
val = df[start_val:].copy()
As already said above, we'll look at two different methods implemented in this library:
Each model has a .fit()
method. This function uses the global optimizer differential evolution from the scipy package to find the set of model parameters that produce the best simulation, regarding the provided observed discharge array.
The inputs for this function can be found in the documentation or the help()
.
In [5]:
help(model.fit)
We don't know any values for the initial states of the storages, so we will ignore them for now. For the missing mean temperature, we calculate a proxy from the minimum and maximum daily temperature. The station height can be retrieved from the CAMELSLoader
class via the .get_station_height()
method.
In [6]:
# calculate mean temp for calibration and validation period
cal['tmean'] = (cal['tmin(C)'] + cal['tmax(C)']) / 2
val['tmean'] = (val['tmin(C)'] + val['tmax(C)']) / 2
# load the gauge station height
height = CAMELSLoader().get_station_height('01031500')
Now we are ready to fit the model and retrieve a good set of model parameters from the optimizer. Again, this will be done with the calibration data. Because the model methods also except pandas Series, we can call the function as follows.
In [7]:
# We don't have an initial value for the snow storage, so we omit this input
result = model.fit(cal['QObs(mm/d)'], cal['prcp(mm/day)'], cal['tmean'],
cal['tmin(C)'], cal['tmax(C)'], cal['PET'], height)
result
is an object defined by the scipy library and contains the optimized model parameters, as well as some more information on the optimization process. Let us have a look at this object:
In [8]:
result
Out[8]:
The relevant information here is:
fun
is the final value of our optimization criterion (the mean-squared-error in this case)message
describes the cause of the optimization terminationnfev
is the number of model simulationssucess
is a flag wether or not the optimization was successfulx
are the optimized model parametersNext, let us set the model parameters to the optimized ones found by the search. Therefore we need to create a dictonary containing one key for each model parameter and as the corresponding value the optimized parameter. As mentioned before, the list of model parameter names can be retrieved by the model.get_parameter_names()
function. We can then create the needed dictonary by the following lines of code:
In [9]:
params = {}
param_names = model.get_parameter_names()
for i, param in enumerate(param_names):
params[param] = result.x[i]
# This line set the model parameters to the ones specified in the dict
model.set_params(params)
# To be sure, let's look at the current model parameters
model.get_params()
Out[9]:
Also it might not be clear at the first look, this are the same parameters as the ones specified in result.x
. In result.x
they are ordered according to the ordering of the _param_list
specified in each model class, where ass the dictonary output here is alphabetically sorted.
In [10]:
help(monte_carlo)
As specified in the help text, all model inputs needed for a simulation must be provided as keyword arguments. The keywords need to match the names specified in the model.simulate()
function. Let us create a new model instance and see how this works for the CemaneigeGR4J model.
In [11]:
model2 = CemaneigeGR4J()
# Let use run MC for 1000 runs, which is in the same range as the above optimizer
result_mc = monte_carlo(model2, num=10000, qobs=cal['QObs(mm/d)'],
prec=cal['prcp(mm/day)'], mean_temp=cal['tmean'],
min_temp=cal['tmin(C)'], max_temp=cal['tmax(C)'],
etp=cal['PET'], met_station_height=height)
# Get the index of the best fit (smallest mean squared error)
idx = np.argmin(result_mc['mse'][~np.isnan(result_mc['mse'])])
# Get the optimal parameters and set them as model parameters
optim_params = result_mc['params'][idx]
params = {}
for i, param in enumerate(param_names):
params[param] = optim_params[i]
# This line set the model parameters to the ones specified in the dict
model2.set_params(params)
In [12]:
# simulated discharge of the model optimized by the .fit() function
val['qsim_fit'] = model.simulate(val['prcp(mm/day)'], val['tmean'],
val['tmin(C)'], val['tmax(C)'],
val['PET'], height)
# simulated discharge of the model optimized by monte-carlo-sim
val['qsim_mc'] = model2.simulate(val['prcp(mm/day)'], val['tmean'],
val['tmin(C)'], val['tmax(C)'],
val['PET'], height)
# Calculate and print the Nash-Sutcliff-Efficiency for both simulations
nse_fit = calc_nse(val['QObs(mm/d)'], val['qsim_fit'])
nse_mc = calc_nse(val['QObs(mm/d)'], val['qsim_mc'])
print("NSE of the .fit() optimization: {:.4f}".format(nse_fit))
print("NSE of the Monte-Carlo-Simulation: {:.4f}".format(nse_mc))
What do this number mean? Let us have a look at some window of the simulated timeseries and compare them to the observed discharge:
In [13]:
# Plot last full hydrological year of the simulation
%matplotlib notebook
start_date = pd.to_datetime("2013/10/01", yearfirst=True)
end_date = pd.to_datetime("2014/09/30", yearfirst=True)
plt.plot(val.loc[start_date:end_date, 'QObs(mm/d)'], label='Qobs')
plt.plot(val.loc[start_date:end_date, 'qsim_fit'], label='Qsim .fit()')
plt.plot(val.loc[start_date:end_date, 'qsim_mc'], label='Qsim mc')
plt.legend()
Out[13]:
The result is not perfect, but it is not bad either! And since this package is also about speed, let us also check how long it takes to simulate the discharge for the entire validation period (19 years of data).
In [14]:
%%timeit
model.simulate(val['prcp(mm/day)'], val['tmean'],
val['tmin(C)'], val['tmax(C)'],
val['PET'], height)