Starting with the Shyft api

Introduction

At its core, Shyft provides functionality through an API (Application Programming Interface). All the functionality of Shyft is available through this API.

We begin the tutorials by introducing the API as it provides the building blocks for the framework. Once you have a good understanding, you can move toward configured runs that make use of orchestation. To make use of configured runs, you need to understand how we 'serialize' configurations and input data through repositories.

In a separate of the simulation tutorials, we cover conducting a very simple simulation of an example catchment using configuration files. This is a typical use case, but assumes that you have a model well configured and ready for simulation. In practice, one is interested in working with the model, testing different configurations, and evaluating different data sources.

This is in fact a key idea of Shyft -- to make it simple to evaluate the impact of the selection of model routine on the performance of the simulation. In this notebook we walk through this lower level paradigm of working with the toolbox and using the Shyft api directly to conduct the simulations.

This notebook is guiding through the simulation process of a catchment. The following steps are described:

  1. Loading required python modules and setting path to SHyFT installation
  2. Building a Shyft model
  3. Running a Shyft simulation with updated parameters
  4. Activating the simulation only for selected catchments
  5. Setting up different input datasets
  6. Changing state collection settings
  7. Post processing and extracting results

1. Loading required python modules and setting path to SHyFT installation

For the notebook tutorials we require several imports. In addition, be sure your shyft environment is correctly configured. This is required before importing shyft itself. Lastly, import the shyft classes and modules.


In [1]:
# Pure python modules and jupyter notebook functionality
# first you should import the third-party python modules which you'll use later on
# the first line enables that figures are shown inline, directly in the notebook
%matplotlib inline
import os
import datetime as dt
import numpy as np
from os import path
import sys
from matplotlib import pyplot as plt
from netCDF4 import Dataset

In [2]:
# try to auto-configure the path. This will work in the case
# that you have checked out the doc and data repositories
# at same level. Make sure this is done **before** importing shyft
shyft_data_path = path.abspath("../../../shyft-data")
if path.exists(shyft_data_path) and 'SHYFT_DATA' not in os.environ:
    os.environ['SHYFT_DATA']=shyft_data_path

# shyft should be available either by it's install in python
# or by PYTHONPATH set by user prior to starting notebook.
# If you have cloned the repositories according to guidelines:
shyft_path=path.abspath('../../../shyft')
sys.path.insert(0,shyft_path)

In [3]:
from shyft import api
import shyft
import shyft.api.pt_gs_k
import shyft.repository
import shyft.repository.netcdf

print(shyft.__path__)
for env in os.environ:
    if 'SHYFT' in env:
        print('{0}:\n{1}'.format(env, os.environ[env]))
from shyft import shyftdata_dir
print(shyftdata_dir)


['/Data/johnbur/Dropbox/home/Programming/workspace/shyft_workspace/shyft/shyft']
SHYFT_DEPENDENCIES_DIR:
/Data/johnbur/Programming/workspace/shyft_workspace/shyft-dependencies
SHYFT_DATA:
/Data/johnbur/Dropbox/home/Programming/workspace/shyft_workspace/shyft-data
/Data/johnbur/Dropbox/home/Programming/workspace/shyft_workspace/shyft-data

2. Build a Shyft model

The first point of simulation is to define the model that you will create. In this example, we will use Shyft's pure api approach to create a model from scratch.

The simulation domain

What is required to set up a simulation? At the most basic level of Shyft, we need to define the simulation domain / geometry. Shyft does not care about the specific shape of the cells. Shyft just needs a 'geocentroid location' and an area. We will create a container of this information as a first step to provide to one of Shyft's model types later.

We are going to be working with the data from the Nea-Nidelva catchment, example dataset. This is available in the shyft-data repository. Above, you should have set your SHYFT_DATA environment variable to point to this directory so that we can easily read the data.

The first thing to do is to take a look at the geography of our cells.


In [4]:
# load the data from the example datasets
cell_data = Dataset( os.path.join(shyftdata_dir, 'netcdf/orchestration-testdata/cell_data.nc'))

# plot the coordinates of the cell data provided
# fetch the x- and y-location of the cells
x = cell_data.variables['x'][:]
y = cell_data.variables['y'][:]
z = cell_data.variables['z'][:]
cid = cell_data.variables['catchment_id'][:]

# and make a quick catchment map...
# using a scatter plot of the cells
fig, ax = plt.subplots(figsize=(15,5))
cm = plt.cm.get_cmap('rainbow')
elv_col = ax.scatter(x, y, c=z, marker='.', s=40, lw=0, cmap=cm)
# cm = plt.cm.get_cmap('gist_gray')
# cid_col = ax.scatter(x, y, c=cid, marker='.', s=40, lw=0, alpha=0.4, cmap=cm)
plt.colorbar(elv_col).set_label('catchment elevation [m]')
# plt.colorbar(cid_col).set_label('catchment indices [id]')
plt.title('Nea Nidelva Catchment')
# print(set(cid))


Out[4]:
Text(0.5,1,'Nea Nidelva Catchment')

Create a collection of simulation cells

In Shyft we work with 'cells', which is the basic simulation unit. In the example netcdf file, we provide the attributes for the cells we are going to plot. But you made need to extract this information from your own GIS, or other data. The essential variables that are minimally required include::

  • x, generally an easting coordinate in UTM space, [meters]
  • y, generally a northing coordinate in UTM space, [meters]
  • z, elevation [meters]
  • area, the area of the cell, [square meters]
  • land cover type fractions (these are float values that sum to 1):
    • glacier
    • lake
    • reservoir
    • forest
    • unspecified
  • catchment_id, an integer to associate the cell with a catchment
  • a radiation factor (set to 0.9 by default)

If you look at the netcdf file, you'll see these are included:


In [5]:
print(cell_data.variables.keys())


odict_keys(['x', 'y', 'z', 'crs', 'area', 'forest-fraction', 'reservoir-fraction', 'lake-fraction', 'glacier-fraction', 'catchment_id'])

So the first step is to extract these from the netcdf file, and get them into the model.


In [6]:
# Let's first create a 'container' that will hold all of our model domains cells:
cell_data_vector = api.GeoCellDataVector()
# help(cell_data_vector)
#help(api.GeoPoint)

In [7]:
num_cells = cell_data.dimensions['cell'].size

for i in range(num_cells):

    gp = api.GeoPoint(x[i], y[i], z[i]) # recall, we extracted x,y,z above
    cid = cell_data.variables['catchment_id'][i]
    cell_area = cell_data.variables['area'][i]

    # land fractions:
    glac = cell_data.variables['glacier-fraction'][i]
    lake = cell_data.variables['lake-fraction'][i]
    rsvr = cell_data.variables['reservoir-fraction'][i]
    frst = cell_data.variables['forest-fraction'][i]
    unsp = 1 - (glac + lake + rsvr + frst)
    
    land_cover_frac = api.LandTypeFractions(glac, lake, rsvr, frst, unsp)
    
    rad_fx = 0.9
    # note, for now we need to make sure we cast some types to pure python, not numpy
    geo_cell_data = api.GeoCellData(gp, float(cell_area), int(cid), rad_fx, land_cover_frac)

    cell_data_vector.append(geo_cell_data)

In [11]:
nci.close()

In [ ]:
# now get the forcing data ready.
# first create a region_environment object, the 'container' that will hold all 
# the forcing data sources
re = api.ARegionEnvironment()

# map the variable names in the netcdf file to the source types
source_map = {'precipitation' : (api.PrecipitationSource, re.precipitation),
             'radiation' : (api.RadiationSource, re.radiation),
             'temperature' : (api.TemperatureSource, re.temperature),
             'wind_speed' : (api.WindSpeedSource, re.wind_speed), 
             'relative_humidity' : (api.RelHumSource, re.rel_hum) }

# load the data from the example datasets
# station_met = Dataset( os.path.join(shyftdata_dir, 'netcdf/orchestration-testdata/stations_met.nc'))

# for station in station_met.groups.keys():
#     stn = station_met.groups[station]
#     print(stn)
#     time = api.UtcTimeVector([int(t) for t in stn.variables['time'][:]])
#     dt = time[1] - time[0] if len(time) > 1 else api.deltahours(1)
#     x = stn.x
#     y = stn.y
#     z = stn.z
#     gp = api.GeoPoint(x, y, z)
#     for var, (source, source_vec) in source_map.items():
#         if var in stn.variables.keys():
#             data = stn.variables[var][:]
        
#             time_axis = api.TimeAxis(int(time[0]), api.deltahours(dt), len(time))  
#             cts = api.TsFactory().create_time_point_ts(time_axis.total_period(), time, data, api.POINT_AVERAGE_VALUE)
#             # add it to the precipitation source
#             source_vec.append(source(gp, cts))

### ANOTHER APPROACH
re2 = api.ARegionEnvironment()
# precip = Dataset( os.path.join(shyftdata_dir, 'netcdf/orchestration-testdata/precipitation.nc'))
# rad = Dataset( os.path.join(shyftdata_dir, 'netcdf/orchestration-testdata/radiation.nc'))
# windsp = Dataset( os.path.join(shyftdata_dir, 'netcdf/orchestration-testdata/wind_speed.nc'))
# relhum = Dataset( os.path.join(shyftdata_dir, 'netcdf/orchestration-testdata/relative_humidity.nc'))
# temp = Dataset( os.path.join(shyftdata_dir, 'netcdf/orchestration-testdata/temperature.nc'))

# map the variable names in the netcdf file to the source types
source_map = {'precipitation' : ('precipitation.nc', api.PrecipitationSource, re2.precipitation),
             'global_radiation' : ('radiation.nc', api.RadiationSource, re2.radiation),
             'temperature' : ('temperature.nc', api.TemperatureSource, re2.temperature),
             'wind_speed' : ('wind_speed.nc', api.WindSpeedSource, re2.wind_speed), 
             'relative_humidity' : ('relative_humidity.nc', api.RelHumSource, re2.rel_hum) }


for var, (file_name, source, source_vec) in source_map.items():
    nci = Dataset( os.path.join(shyftdata_dir, 'netcdf/orchestration-testdata/' + file_name))
    
    time = api.UtcTimeVector([int(t) for t in nci.variables['time'][:]])
    dt = time[1] - time[0] if len(time) > 1 else api.deltahours(1)
    for i in range(nci.dimensions['station'].size):
        x = nci.variables['x'][i]
        y = nci.variables['y'][i]
        z = nci.variables['z'][i]
        gp = api.GeoPoint(x, y, z)
        data = nci.variables[var][:, i]
        print(data)
        time_axis = api.TimeAxis(int(time[0]), api.deltahours(dt), len(time))  
        cts = api.TsFactory().create_time_point_ts(time_axis.total_period(), time, data, api.POINT_AVERAGE_VALUE)
        # add it to the precipitation source
        source_vec.append(source(gp, cts))

In [21]:
re2.temperature.values_at_time(int(stn.variables['time'][100]))


Out[21]:
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])

In [ ]:
def create_dummy_region_environment(self, time_axis, mid_point):
        re = api.ARegionEnvironment()
        re.precipitation.append(self._create_constant_geo_ts(api.PrecipitationSource, mid_point, time_axis.total_period(), 5.0))
        re.temperature.append(self._create_constant_geo_ts(api.TemperatureSource, mid_point, time_axis.total_period(), 10.0))
        re.wind_speed.append(self._create_constant_geo_ts(api.WindSpeedSource, mid_point, time_axis.total_period(), 2.0))
        re.rel_hum.append(self._create_constant_geo_ts(api.RelHumSource, mid_point, time_axis.total_period(), 0.7))
        re.radiation = api.RadiationSourceVector()  # just for testing BW compat
        re.radiation.append(self._create_constant_geo_ts(api.RadiationSource, mid_point, time_axis.total_period(), 300.0))
        return re
    
    
    
        def _create_constant_geo_ts(self, geo_ts_type, geo_point, utc_period, value):
        """Create a time point ts, with one value at the start
        of the supplied utc_period."""
        tv = api.UtcTimeVector()
        tv.push_back(utc_period.start)
        vv = api.DoubleVector()
        vv.push_back(value)
        cts = api.TsFactory().create_time_point_ts(utc_period, tv, vv, api.POINT_AVERAGE_VALUE)
        return geo_ts_type(geo_point, cts)

In [31]:
# next, create the simulation dictionary
RegionDict = {'region_model_id': 'demo', #a unique name identifier of the simulation
              'domain': {'EPSG': 32633,
                        'nx': 400,
                        'ny': 80,
                        'step_x': 1000,
                        'step_y': 1000,
                        'lower_left_x': 100000,
                        'lower_left_y': 6960000},
              'repository': {'class': shyft.repository.netcdf.cf_region_model_repository.CFRegionModelRepository,
                             'params': {'data_file': 'netcdf/orchestration-testdata/cell_data.nc'}},
          }

The first keys, are probably quite clear:

  • start_datetime: a string in the format: "2013-09-01T00:00:00"
  • run_time_step: an integer representing the time step of the simulation (in seconds), so for a daily step: 86400
  • number_of_steps: an integer for how long the simulatoin should run: 365 (for a year long simulation)
  • region_model_id: a string to name the simulation: 'neanidelva-ptgsk'

We also need to know where the simulation is taking place. This information is contained in the domain:

  • EPSG: an EPSG string to identify the coordinate system
  • nx: number of 'cells' in the x direction
  • ny: number of 'cells' in the y direction
  • step_x: size of cell in x direction (m)
  • step_y: size of cell in y direction (m)
  • lower_left_x: where (x) in the EPSG system the cells begin
  • lower_left_y: where (y) in the EPSG system the cells begin
  • repository: a repository that can read the file containing data for the cells (in this case it will read a netcdf file)

Model specification

The next dictionary provides information about the model that we would like to use in Shyft, or the 'Model Stack' as it is generally referred to. In this case, we are going to use the PTGSK model, and the rest of the dictionary provides the parameter values.


In [32]:
from shyft.api.pt_gs_k import PTGSKModel
ModelDict = {'model_t': PTGSKModel,  # model to construct
            'model_parameters': {
                'actual_evapotranspiration':{
                    'ae_scale_factor': 1.5},
                'gamma_snow':{
                    'calculate_iso_pot_energy': False,
                    'fast_albedo_decay_rate': 6.752787747748934,
                    'glacier_albedo': 0.4,
                    'initial_bare_ground_fraction': 0.04,
                    'max_albedo': 0.9,
                    'max_water': 0.1,
                    'min_albedo': 0.6,
                    'slow_albedo_decay_rate': 37.17325702015658,
                    'snow_cv': 0.4,
                    'tx': -0.5752881492890207,
                    'snowfall_reset_depth': 5.0,
                    'surface_magnitude': 30.0,
                    'wind_const': 1.0,
                    'wind_scale': 1.8959672005350063,
                    'winter_end_day_of_year': 100},
                'kirchner':{ 
                    'c1': -3.336197322290274,
                    'c2': 0.33433661533385695,
                    'c3': -0.12503959620315988},
                'precipitation_correction': {
                    'scale_factor': 1.0},
                'priestley_taylor':{'albedo': 0.2,
                    'alpha': 1.26},
                    }
            }

In this dictionary we define two variables:

  • model_t: the import path to a shyft 'model stack' class
  • model_parameters: a dictionary containing specific parameter values for a particular model class

Specifics of the model_parameters dictionary will vary based on which class is used.

Okay, so far we have two dictionaries. One which provides information regarding our simulation domain, and a second which provides information on the model that we wish to run over the domain (e.g. in each of the cells). The next step, then, is to map these together and create a region_repo class.

This is achieved by using a repository, in this case, the CFRegionModelRepository we imported above.


In [33]:
region_repo = CFRegionModelRepository(RegionDict, ModelDict)

The region_model

**TODO:** a notebook documenting the CFRegionModelRepository

The first step in conducting a hydrologic simulation is to define the domain of the simulation and the model type which we would like to simulate. To do this we create a region_model object. Above we created dictionaries that contain this information, and we instantiated a class called teh region_repo. In this next step, we put it together so that we have a single object which we can work with "at our fingertips". You'll note above that we have pointed to a 'data_file' earlier when we defined the RegionDict. This data file contains all the required elements to fill the cells of our domain. The informaiton is contained in a single netcdf file

Before we go further, let's look briefly at the contents of this file:


In [34]:
cell_data_file = os.path.join(shyftdata_dir, 'netcdf/orchestration-testdata/cell_data.nc')
print(cell_data_file)
cell_data = Dataset(cell_data_file)
print(cell_data)


/Data/johnbur/Dropbox/home/Programming/workspace/shyft_workspace/shyft-data/netcdf/orchestration-testdata/cell_data.nc
<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
    dimensions(sizes): cell(4650)
    variables(dimensions): float64 x(cell), float64 y(cell), float64 z(cell), int32 crs(), float64 area(cell), float64 forest-fraction(cell), float64 reservoir-fraction(cell), float64 lake-fraction(cell), float64 glacier-fraction(cell), int32 catchment_id(cell)
    groups: 

You might be surprised to see the dimensions are 'cells', but recall that in Shyft everything is vectorized. Each 'cell' is an element within a domain, and each cell has associated variables:

  • location: x, y, z
  • characteristics: forest-fraction, reservoir-fraction, lake-fraction, glacier-fraction, catchment-id

We'll bring this data into our workspace via the region_model. Note that we have instantiated a region_repo class using one of the existing Shyft repositories, in this case one that was built for reading in the data as it is contained in the example shyft-data netcdf files: CFRegionModelRepository.

Next, we'll use the region_repo.get_region_model method to get the region_model. Note the name 'demo', in this case is arbitrary. However, depending on how you create your repository, you can specify what region model to return using this string.

**note:** *you are strongly encouraged to learn how to create repositories. This particular repository is just for demonstration purposes. In practice, one may use a repository that connects directly to a GIS service, a database, or some other data sets that contain the data required for simulations.*
**warning**: *also, please note that below we call the 'get_region_model' method as we instantiate the class. This behavior may change in the future.*

In [35]:
region_model = region_repo.get_region_model('demo')

Exploring the region_model

So we now have created a region_model, but what is it actually? This is a very fundamental class in Shyft. It is actually one of the "model stacks", such as 'PTGSK', or 'PTHSK'. Essentially, the region_model contains all the information regarding the simulation type and domain. There are many methods associated with the region_model and it will take time to understand all of them. For now, let's just explore a few key methods:

  • bounding_region: provides information regarding the domain of interest for the simulation
  • catchment_id_map: indices of the various catchments within the domain
  • cells: an instance of PTGSKCellAllVector that holds the individual cells for the simulation (note that this is type-specific to the model type)
  • ncore: an integer that sets the numbers of cores to use during simulation (Shyft is very greedy if you let it!)
  • time_axis: a shyft.api.TimeAxisFixedDeltaT class (basically contains information regarding the timing of the simulation)

Keep in mind that many of these methods are more 'C++'-like than 'Pythonic'. This means, that in some cases, you'll have to 'call' the method. For example: region_model.bounding_region.epsg() returns a string. You can use tab-completion to explore the region_model further:


In [36]:
region_model.bounding_region.epsg()


Out[36]:
'32633'

You'll likely note that there are a number of intriguing fucntions, e.g. initialize_cell_environment or interpolate. But before we can go further, we need some more information. Perhaps you are wondering about forcing data. So far, we haven't said anything about model input or the time of the simulation, we've only set up a container that holds all the domain and model type information about our simulation.

Still, we have made some progress. Let's look for instance at the cells:


In [37]:
cell_0 = region_model.cells[0]
print(cell_0.geo)


GeoCellData(mid_point=GeoPoint(204843.73715373065,6994695.209048475,978.344970703125),catchment_id=2305,area=209211.92418108872,ltf=LandTypeFractions(glacier=0.0,lake=0.0,reservoir=0.0,forest=0.0,unspecified=1.0))

So you can see that so far, each of the cells in the region_model contain information regarding their LandTypeFractions, geolocation, catchment_id, and area.

A particulary important attribute is region_model.region_env. This is a container for each cell that holds the "environmental timeseries", or forcing data, for the simulation. By "tabbing" from cell. you can see that each cell also has and env_ts attribute. These are containers customized to provide timeseries as required by the model type we selected, but there is no data yet. In this case we used the PTGSKModel (see the ModelDict). So for every cell in your simulation, there is a container prepared to accept the forcing data as the next cell shows.


In [38]:
#just so we don't see 'private' attributes
print([d for d  in dir(cell_0.env_ts) if  '_'  not in d[0]]) 
region_model.size()


['init', 'precipitation', 'radiation', 'rel_hum', 'temperature', 'wind_speed']
Out[38]:
4650

Adding forcing data to the region_model

Clearly the next step is to add forcing data to our region_model object. Let's start by thinking about what kind of data we need. From above, where we looked at the env_ts attribute, it's clear that this particular model stack, PTGSKModel, requires:

  • precipitation
  • radiation
  • relative humidity (rel_hum)
  • temperature
  • wind speed

We have stored this information each in seperate netcdf files which each contain the observational series for a number of different stations.

Again, these files **do not represent the recommended practice**, but are *only for demonstration purposes*. The idea here is just to demonstrate with an example repository, but *you should create your own to match **your** data*.

Our goal now is to populate the region_env.

"Sources"

We use the term sources to define a location data may be coming from. You may also come across destinations. In both cases, it just means a file, database, service of some kind, etc. that is capable of providing data. Repositories are written to connect to sources. Following our earlier approach, we'll create another dictionary to define our data sources, but first we need to import another repository:


In [39]:
from shyft.repository.netcdf.cf_geo_ts_repository import CFDataRepository

In [40]:
from shyft.repository.netcdf.cf_geo_ts_repository import CFDataRepository
ForcingData = {'sources': [
        
    {'repository': shyft.repository.netcdf.cf_geo_ts_repository.CFDataRepository,
     'params': {'epsg': 32633,
            'selection_criteria': None,
            'stations_met': 'netcdf/orchestration-testdata/precipitation.nc'},
     'types': ['precipitation']},
       
    {'repository': shyft.repository.netcdf.cf_geo_ts_repository.CFDataRepository,
     'params': {'epsg': 32633,
            'selection_criteria': None,
            'stations_met': 'netcdf/orchestration-testdata/temperature.nc'},
    'types': ['temperature']},
        
    {'params': {'epsg': 32633,
            'selection_criteria': None,
            'stations_met': 'netcdf/orchestration-testdata/wind_speed.nc'},
     'repository': shyft.repository.netcdf.cf_geo_ts_repository.CFDataRepository,
     'types': ['wind_speed']},
    
    {'repository': shyft.repository.netcdf.cf_geo_ts_repository.CFDataRepository,
     'params': {'epsg': 32633,
            'selection_criteria': None,
            'stations_met': 'netcdf/orchestration-testdata/relative_humidity.nc'},
     'types': ['relative_humidity']},
    
    {'repository': shyft.repository.netcdf.cf_geo_ts_repository.CFDataRepository,
     'params': {'epsg': 32633,
            'selection_criteria': None,
            'stations_met': 'netcdf/orchestration-testdata/radiation.nc'},
     'types': ['radiation']}]
      }

Data Repositories

In another notebook, further information will be provided regarding the repositories. For the time being, let's look at this configuration dictionary that was created. It essentially just contains a list, keyed by the name "sources". This key is known in some of the tools that are built in the Shyft orchestration, so it is recommended to use it.

Each item in the list is a dictionary for each of the source types, the keys in the dictionaries are: repository, params, and types. The general idea and concept is that in orchestration, the object keyed by repository is a class that is instantiated by passing the objects contained in params.

Let's repeat that. From our Datasets dictionary, we get a list of "sources". Each of these sources contains a class (a repository) that is capable of getting the source data into Shyft. Whatever parameters that are required for the class to work, will be included in the "sources" dictionary. In our case, the params are quite simple, just a path to a netcdf file. But suppose our repository required credentials or other information for a database? This information could also be included in the params stanza of the dictionary.

You should explore the above referenced netcdf files that are available at the shyft-data git repository. These files contain the forcing data that will be used in the example simulation. Each one contains observational data from some stations in our catchment. Depending on how you write your repository, this data may be provided to Shyft in many different formats.

Let's explore this concept further by getting the 'temperature' data:


In [41]:
# get the temperature sources:
tmp_sources = [source for source in ForcingData['sources'] if 'temperature' in source['types']]

# in this example there is only one
t0 = tmp_sources[0]

# We will now instantiate the repository with the parameters that are provided
# in the dictionary. 
# Note the 'call' structure expects params to contain keyword arguments, and these
# can be anything you want depending on how you create your repository
tmp_repo = t0['repository'](**t0['params'])

tmp_repo is now an instance of the Shyft CFDataRepository, and this will provide Shyft with the data when it sets up a simulation by reading the data directly out of the file referenced in the 'source'. But that is just one repository, and we defined many in fact. Furthermore, you may have a heterogenous collection of data sources -- if for example you want to get your temperature from station data, but radiation from model output. You could define different repositories in the ForcingData dictionary.

Ultimately, we bundle all these repositories up into a new class called a GeoTsRepositoryCollection that we can use to populate the region_model.region_env with data.


In [42]:
# we'll actually create a collection of repositories, as we have different input types.
from shyft.repository.geo_ts_repository_collection import GeoTsRepositoryCollection

def construct_geots_repo(datasets_config, epsg=None):
    """ iterates over the different sources that are provided 
    and prepares the repository to read the data for each type"""
    geo_ts_repos = []
    src_types_to_extract = []
    for source in datasets_config['sources']:
        if epsg is not None:
            source['params'].update({'epsg': epsg})
        # note that here we are instantiating the different source repositories
        # to place in the geo_ts list 
        geo_ts_repos.append(source['repository'](**source['params']))
        src_types_to_extract.append(source['types'])
    
    return GeoTsRepositoryCollection(geo_ts_repos, src_types_per_repo=src_types_to_extract)

# instantiate the repository
geots_repo = construct_geots_repo(ForcingData)

geots_repo is now a "geographic timeseries repository", meaning that the timeseries it holds are spatially aware of their x,y,z coordinates (see CFDataRepository for details). It also has several methods. One in particular we are interested in is the get_timeseries method. However, before we can proceed, we need to define the period for the simulation.

Shyft TimeAxis

Time in Shyft is handled with specialized C++ types for computational efficiency. These are custom built objects that are 'calendar' aware. But since in python, most like to use datetime objects, we create a function:


In [43]:
# next, create the simulation dictionary
TimeDict = {'start_datetime': "2013-09-01T00:00:00",
           'run_time_step': 86400, # seconds, daily
           'number_of_steps': 365 # one year
           }

def time_axis_from_dict(t_dict)->api.TimeAxis:
    utc = api.Calendar()
    
    sim_start = dt.datetime.strptime(t_dict['start_datetime'], "%Y-%m-%dT%H:%M:%S")
    utc_start = utc.time(sim_start.year, sim_start.month, sim_start.day,\
                         sim_start.hour, sim_start.minute, sim_start.second)
    tstep = t_dict['run_time_step']
    nstep = t_dict['number_of_steps']
    time_axis = api.TimeAxis(utc_start, tstep, nstep)
    
    return time_axis

ta_1 = time_axis_from_dict(TimeDict)
print(f'1. {ta_1} \n  {ta_1.total_period()}')
# or shyft-wise, ready tested, precise and less effort, two lines
utc = api.Calendar() # 'Europe/Oslo' can be passed to calendar for time-zone
ta_2 = api.TimeAxis(utc.time(2013, 9, 1), api.deltahours(24), 365)
print(f'2. {ta_2} \n {ta_2.total_period()}')


1. TimeAxisFixedDeltaT(2013-09-01T00:00:00Z,86400,365) 
  [2013-09-01T00:00:00Z,2014-09-01T00:00:00Z>
2. TimeAxisFixedDeltaT(2013-09-01T00:00:00Z,86400,365) 
 [2013-09-01T00:00:00Z,2014-09-01T00:00:00Z>

We now have an object that defines the time dimension for the simulation, and we will use this to initialize the region_model with the "environmental timeseries" or env_ts data. These containers will be given data from the appropriate repositories using the get_timeseries function. Following the templates in the shyft.repository.interfaces module, you'll see that the repositories should provide the capability to "screen" data based on time criteria and *optinally** geo_location criteria.


In [44]:
# we can extract our "bounding box" based on the `region_model` we set up
bbox = region_model.bounding_region.bounding_box(region_model.bounding_region.epsg())

period = ta_1.total_period() #just defined above

# required forcing data sets we want to retrieve
geo_ts_names = ("temperature", "wind_speed", "precipitation",
                              "relative_humidity", "radiation")

sources = geots_repo.get_timeseries( geo_ts_names, period, geo_location_criteria=bbox )


/Data/johnbur/Dropbox/home/Programming/workspace/shyft_workspace/shyft/shyft/repository/netcdf/cf_geo_ts_repository.py:280: RuntimeWarning: invalid value encountered in greater
  pure_arr = data[data_slice]

Now we have a new dictionary, called 'sources' that contains specialized Shyft api types specific to each forcing data type. You can look at one for example:


In [45]:
prec = sources['precipitation']
print(len(prec))


10

We can explore further and see each element is in itself an api.PrecipitationSource, which has a timeseries (ts). Recall from the first tutorial that we can easily convert the timeseries.time_axis into datetime values for plotting.

Let's plot the precip of each of the sources:


In [46]:
fig, ax = plt.subplots(figsize=(15,10))

for pr in prec:
    t,p = [dt.datetime.utcfromtimestamp(t_.start) for t_ in pr.ts.time_axis], pr.ts.values
    ax.plot(t,p, label=pr.mid_point().x) #uid is empty now, but we reserve for later use
fig.autofmt_xdate()
ax.legend(title="Precipitation Input Sources")
ax.set_ylabel("precip[mm/hr]")


Out[46]:
Text(0,0.5,'precip[mm/hr]')

Finally, the next step will take the data from the sources and connect it to our region_model.region_env class:


In [47]:
def get_region_environment(sources):
    region_env = api.ARegionEnvironment()
    region_env.temperature = sources["temperature"]
    region_env.precipitation = sources["precipitation"]
    region_env.radiation = sources["radiation"]
    region_env.wind_speed = sources["wind_speed"]
    region_env.rel_hum = sources["relative_humidity"]
    return region_env

region_model.region_env = get_region_environment(sources)

And now our forcing data is connected to the region_model. We are almost ready to run a simulation. There is just one more step. We've connected the sources to the model, but remember that Shyft is a distributed modeling framework, and we've connected point data sources (in this case). So we need to get the data from the observed points to each cell. This is done through interpolation.

Shyft Interpolation

In Shyft there are predefined routines for interpolation. In the interp_config class below one quickly recognizes the same input source type keywords that are used as keys to the params dictionary. params is simply a dictionary of dictionaries which contains the parameters used by the interpolation model that is specific for each source type.


In [48]:
from shyft.repository.interpolation_parameter_repository import InterpolationParameterRepository

class interp_config(object):
    """ a simple class to provide the interpolation parameters """

    def __init__(self):
        
        self.interp_params = {'precipitation': {'method': 'idw',
          'params': {'distance_measure_factor': 1.0,
           'max_distance': 600000.0,
           'max_members': 10,
           'scale_factor': 1.02}},
         'radiation': {'method': 'idw',
          'params': {'distance_measure_factor': 1.0,
           'max_distance': 600000.0,
           'max_members': 10}},
         'relative_humidity': {'method': 'idw',
          'params': {'distance_measure_factor': 1.0,
           'max_distance': 600000.0,
           'max_members': 10}},
         'temperature': {'method': 'btk',
          'params': {'nug': 0.5,
           'range': 200000.0,
           'sill': 25.0,
           'temperature_gradient': -0.6,
           'temperature_gradient_sd': 0.25,
           'zscale': 20.0}},
         'wind_speed': {'method': 'idw',
          'params': {'distance_measure_factor': 1.0,
           'max_distance': 600000.0,
           'max_members': 10}}}

    def interpolation_parameters(self):
        return self.interp_params

ip_conf = interp_config()
ip_repo = InterpolationParameterRepository(ip_conf)

region_model.interpolation_parameter = ip_repo.get_parameters(0) #just a '0' for now

The next step is to set the intial states of the model using our last repository. This one, the GeneratedStateRepository will set empty default values.

Now we are nearly ready to conduct a simulation. We just need to run a few methods to prepare the model and cells for the simulation. The region_model has a method called initalize_cell_environment that takes a time_axis type as input. We defined the time_axis above, so now we'll use it to initialize the model. At the same time, we'll set the initial_state. Then we can actually run a simulation!


In [49]:
from shyft.repository.generated_state_repository import GeneratedStateRepository

init_values = {'gs': {'acc_melt': 0.0,
   'albedo': 0.65,
   'alpha': 6.25,
   'iso_pot_energy': 0.0,
   'lwc': 0.1,
   'sdc_melt_mean': 0.0,
   'surface_heat': 30000.0,
   'temp_swe': 0.0},
  'kirchner': {'q': 0.01}}

   
state_generator = GeneratedStateRepository(region_model)#, init_values=init_values)

# we need the state_repository to have the same size as the model
#state_repo.n = region_model.size()
# there is only 1 state (indexed '0')
s0 = state_generator.get_state(0)
not_applied_list=region_model.state.apply_state(  # apply state set the current state according to arguments
    cell_id_state_vector=s0,  # ok, easy to get
    cids=[] # empty means apply all, if we wanted to only apply state for specific catchment-ids, this is where to put them
)
assert len(not_applied_list)==0, 'Ensure all states was matched and applied to the model'
region_model.initial_state=region_model.current_state  # now we stash the current state to the initial state

In [50]:
state_generator.find_state?

Conduct the simulation

We now have a region_model that is ready for simulation. As we discussed before, we still need to get the data from our point observations interpolated to the cells, and we need to get the env_ts of each cell populated. But all the machinery is now in place to make this happen.

To summarize, we've created:

  • region_repo, a region repository that contains information related to region of simulation and the model to be used in the simulation. From this we get a region_model
  • geots_repo, a geo-timeseries repository that provides a mechanism to pull the data we require from our 'sources'.
  • time_axis, created from the TimeAxisFixedDeltaT class of shyft to provide the period of simulation.
  • ip_repo, an interpolation repository which provides all the required parameters for interpolating our data to the distributed cells -- following variable specific protocols/models.
  • state_repo, a GeneratedStateRepository used to provide our simulation an initial state.

The next step is simply to initialize the cell environment and run the interpolation. As a practive, before simulation we reset to the initial state (we're there already, but it is something you have to do before a new simulation), and then run the cells. First we'll initialize the cell environment:


In [51]:
region_model.initialize_cell_environment(ta_1)

As a habit, we have a quick "sanity check" function to see if the model is runnable. Itis recommended to have this function when you create 'run scripts'.


In [52]:
def runnable(reg_mod):
    """ returns True if model is properly configured 
    **note** this is specific depending on your model's input data requirements """
    return all((reg_mod.initial_state.size() > 0, reg_mod.time_axis.size() > 0,
                   all([len(getattr(reg_mod.region_env, attr)) > 0 for attr in
                    ("temperature", "wind_speed", "precipitation", "rel_hum", "radiation")])))

In [53]:
# run the model, e.g. as you may configure it in a script:
if runnable(region_model):
    
    region_model.interpolate(region_model.interpolation_parameter, region_model.region_env)
    region_model.revert_to_initial_state()
    region_model.run_cells()
else:
    print('Something wrong with model configuration.')

Okay, so the simulation was run. Now we may be interested in looking at some of the output. We'll take a brief summary glance in the next section, and save a deeper dive into the simulation results for another notebook.

3. Simulation results

The first step will be simply to look at the discharge results for each subcatchment within our simulation domain. For simplicity, we can use a pandas.DataFrame to collect the data from each catchment.


In [54]:
# Here we are going to extact data from the simulation.
# We start by creating a list to hold discharge for each of the subcatchments.
# Then we'll get the data from the region_model object
import pandas as pd
# mapping of internal catch ID to catchment
catchment_id_map = region_model.catchment_id_map 

# First get the time-axis which we'll use as the index for the data frame
ta = region_model.time_axis
# and convert it to datetimes
index = [dt.datetime.utcfromtimestamp(p.start) for p in ta]

# Now we'll add all the discharge series for each catchment 
data = {}
for cid in catchment_id_map:
    # get the discharge time series for the subcatchment
    q_ts = region_model.statistics.discharge([int(cid)])
    data[cid] = q_ts.values.to_numpy()

df = pd.DataFrame(data, index=index)
# we can simply use:
ax = df.plot(figsize=(20,15))
ax.legend(title="Catch. ID")
ax.set_ylabel("discharge [m3 s-1]")


Out[54]:
Text(0,0.5,'discharge [m3 s-1]')

In [ ]:

Okay, that was simple. Let's look at the timeseries in some individual cells. The following is a bit of a contrived example, but it shows some aspects of the api. We'll plot the temperature series of all the cells in one sub-catchment, and color them by elevation. This doesn't necessarily show anything about the simulation, per se, but rather results from the interpolation step.


In [55]:
from matplotlib.cm import jet as jet
from matplotlib.colors import Normalize

# get all the cells for one sub-catchment with 'id' == 1228
c1228 = [c for c in region_model.cells if c.geo.catchment_id() == 1228]

# for plotting, create an mpl normalizer based on min,max elevation
elv = [c.geo.mid_point().z for c in c1228]
norm = Normalize(min(elv), max(elv))

#plot with line color a function of elevation
fig, ax = plt.subplots(figsize=(15,10))

# here we are cycling through each of the cells in c1228
for dat,elv in zip([c.env_ts.temperature.values for c in c1228], [c.mid_point().z for c in c1228]):
    ax.plot(dat, color=jet(norm(elv)), label=int(elv))
    
    
# the following is just to plot the legend entries and not related to Shyft
handles, labels = ax.get_legend_handles_labels()

# sort by labels
import operator
hl = sorted(zip(handles, labels),
            key=operator.itemgetter(1))
handles2, labels2 = zip(*hl)

# show legend, but only every fifth entry
ax.legend(handles2[::5], labels2[::5], title='Elevation [m]')


Out[55]:
<matplotlib.legend.Legend at 0x7f07ab1cde10>

As we would expect from the temperature kriging method, we should find higher elevations have colder temperatures. As an exercise you could explore this relationship using a scatter plot.

Now we're going to create a function that will read initial states from the initial_state_repo. In practice, this is already done by the ConfgiSimulator, but to demonstrate lower level functions, we'll reset the states of our region_model:


In [56]:
# create a function to read the states from the state repository
def get_init_state_from_repo(initial_state_repo_, region_model_id_=None, timestamp=None):
    state_id = 0
    if hasattr(initial_state_repo_, 'n'):  # No stored state, generated on-the-fly
        initial_state_repo_.n = region_model.size()
    else:
        states = initial_state_repo_.find_state(
            region_model_id_criteria=region_model_id_,
            utc_period_criteria=timestamp)
        if len(states) > 0:
            state_id = states[0].state_id  # most_recent_state i.e. <= start time
        else:
            raise Exception('No initial state matching criteria.')
    return initial_state_repo_.get_state(state_id)
 
init_state = get_init_state_from_repo(state_generator, timestamp=region_model.time_axis.start)

Don't worry too much about the function for now, but do take note of the init_state object that we created. This is another container, this time it is a class that contains PTGSKStateWithId objects, which are specific to the model stack implemented in the simulation (in this case PTGSK). If we explore an individual state object, we'll see init_state contains, for each cell in our simulation, the state variables for each 'method' of the method stack.

Let's look more closely:


In [57]:
def print_pub_attr(obj):
    #only public attributes
    print(f'{obj.__class__.__name__}:\t',[attr for attr in dir(obj) if attr[0] is not '_']) 
    
print(len(init_state))
init_state_cell0 = init_state[0] 
# the identifier
print_pub_attr(init_state_cell0.id)
# gam snow states
print_pub_attr(init_state_cell0.state.gs)

#init_state_cell0.kirchner states
print_pub_attr(init_state_cell0.state.kirchner)


4650
CellStateId:	 ['area', 'cid', 'x', 'y']
GammaSnowState:	 ['acc_melt', 'albedo', 'alpha', 'iso_pot_energy', 'lwc', 'sdc_melt_mean', 'surface_heat', 'temp_swe']
KirchnerState:	 ['q']

Summary

We have now explored the region_model and looked at how to instantiate a region_model by using a api.ARegionEnvironment, containing a collection of timeseries sources, and passing an api.InterpolationParameter class containing the parameters to use for the data interpolation algorithms. The interpolation step "populated" our cells with data from the point sources.

The cells each contain all the information related to the simulation (their own timeseries, env_ts; their own model parameters, parameter; and other attributes and methods). In future tutorials we'll work with the cells indivdual "resource collector" (.rc) and "state collector" (.sc) attributes.