In [1]:
%run talktools
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_style('whitegrid')
from IPython.display import Image, display


Open Source can improve current scientific practice

Ipython notebooks are a great tool to support this

Sophie Balemans and Stijn Van Hoey

EGU 2015 - PICO session on Open Source Computing in Hydrology

The ideals of (hydrological) science

  • Provide verifiable answers about water and solutions to water-related problems.
  • The validation of these results by reproduction.
  • An altruistic, collective enterprise for humanity's benefit

F Perez

The ideals reality of (hydrological) science

  • Provide verifiable answers about water and solutions to water-related problems.
    • The pursuit of highly cited papers for your CV.
  • The validation of our results by reproduction.
    • Validation by convincing journal referees who didn't see your code or data.
  • An altruistic, collective enterprise for humanity's benefit.
    • A deadly race to outrun your colleagues in front of the bear of funding.

F Perez

Free and Open Source Software (FOSS) in this context

  • Open, collaborative by definition.

    • Industrial competition can coexist...
  • Continuous, public process.

    • Distributed credit.
    • Open peer review.
  • Reproducible by necessity.

  • Public bug tracking.

  • The use of licenses is essential (CC, BSD, GPL,...)

F Perez

FOSS $\neq$ free work

All waiting for the developer...

...or all developers?

Graveyard of good intentions

Towards continuous and collaborative

What do we need:

  • Training of students, Phds,...
    Creating a future generation of scientists with reproducibility as default
    Provide version control, script-based development, database management... in the curricula
  • Continuous funding of open source development
    Payed to maintain and develop open source projects
  • Tools that facilitate a reproducible workflow
    knitr, Ipython Notebook, git, RunMyCode, VIStrails, Authorea,...

Ipython Notebook

Reproducible science

  • Reproducibility at publication time? is TOO late!

We need to embed the entire lifecycle of a scientific idea:

  1. exploratory stuff
  2. (collaborative) development
  3. production (simulations on HPC, data visualisation,...)
  4. publication (with reproducible results)
  5. teach about it
  6. Go back to 1.

Ipython (Jupyter!) Notebook can support on the different levels

Ipython (Jupyter!) Notebook... (this is a notebook!)

Minimize effort between analysis and sharing

  • Interactive shell for data-analysis and exploration
  • Interaction between languages (R, Julia,...)
  • Parallel computing
  • ipynb to latex, pdf, html, html,slides, publications, books,...
  • Loading of images, websites, widgets,...
  • ...

Check it out on https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks

Scripts, so it can be version controlled!

Recap 5. teach about it

  • The same file can be used to do analysis, create course notes and retrieve slides using nbconvert
  • Students can interactively work on their notebook
  • Different useful features: eg.

    interactive widgets

Conceptual rainfall-runoff model



In [2]:
# %load PDM_HPC.py

In [3]:
pars =pd.read_csv('data/example2_PDM_parameters.txt',header=0, sep=',', index_col=0)
measured = pd.read_csv('data/example_PDM_measured.txt', header=0, sep='\t', decimal='.', index_col=0)
modelled = pd.read_csv('data/example2_PDM_outputs.txt',header=0, sep=',', index_col=0).T

In [4]:
modeloutput1 = pd.DataFrame(modelled.iloc[:,0].values, index=measured.index)

Measured vs modelled discharge:


In [5]:
fig, ax = plt.subplots(figsize=(10, 6))
p1 = measured.plot(ax=ax, label='measured')
p2 = modeloutput1.plot(ax=ax, label='modelled')
t = ax.set_ylabel(r'Q m$^3$s$^{-1}$')
plt.legend(['measured', 'modelled'])


Out[5]:
<matplotlib.legend.Legend at 0x7f8080e98190>

In [6]:
from scatter_hist2 import create_scatterhist, create_seasonOF

names = pars.columns
time=np.array(measured.index)
modelled.index = time

pars_name={}
for i in range(0, names.size):
    pars_name[names[i]]=i

Exploring the parameter space

  • Simulating 20000 Monte-Carlo runs
  • Sampling from uniform distribution

  • Parallel calculation within IPython notebook


In [7]:
objective_functions = create_seasonOF(modelled, measured)

Visualisation in 2D scatter plot


In [8]:
from scatter_hist_season import create_scatterhist

In [9]:
scatter = create_scatterhist(pars, 2, 1, objective_functions, names, 
                                            objective_function='SSE', 
                                            threshold=0.4,  
                                            season = 'Winter')


Current threshold = 674.854206156
Number of behavioural parametersets = 7910 out of 20000

In [10]:
scatter = create_scatterhist(pars, 2, 1, objective_functions, names, 
                                            objective_function='SSE', 
                                            threshold=0.4,
                                            season = 'Spring')


Current threshold = 105.663890284
Number of behavioural parametersets = 9592 out of 20000

What about the model?

  • Parameter boundaries correct?
  • Optimal paramersets change periodically... Correct model structure?
  • NOT an optimization tool!

The function

  • Select parametersets based on:
    1. Objective function (SSE, RMSE, RRMSE)
    2. Time period of interest (whole year or specific season)
    3. Relative threshold (scaled between 0 and 1)
  • Visualisation of a 2D parameter response surface of selected parametersets together with histograms

More interactive?

command:

interact(...)

In [12]:
#Loading interact functionality
from IPython.html.widgets import interact, fixed
interact(create_scatterhist,*args, **kwargs)

  • input list => dropdown

      objective_function = ['SSE', 'RMSE', 'RRMSE']
      season = ['Winter','Summer', 
                      'Spring', 'Autumn','All season']
  • input array => slider

      threshold=(0,1,0.005)

Ipython notebook...


In [ ]:


In [ ]: