Conservative Estimation using a Grid Seach Minimization

This notebook illustrates the different steps for a conservative estimation using a grid search minimization.

Classic Libraries

In [1]:
import openturns as ot
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
%load_ext autoreload
%autoreload 2

random_state = 123
np.random.seed(random_state)

Additive model

The first example of conservative estimation consider an additive model $\eta : \mathbb R^d \rightarrow \mathbb R$ with Gaussian margins. The objectives are to estimate a quantity of interest $\mathcal C(Y)$ of the model output distribution. Unfortunately, the dependence structure is unknown. In order to be conservative we aim to give bounds to $\mathcal C(Y)$.

The model

This example consider the simple additive example.


In [2]:
from depimpact.tests import func_sum
help(func_sum)


Help on function func_sum in module depimpact.tests.test_functions:

func_sum(x, a=None)
    Additive weighted model function.
    
    Parameters
    ----------
    x : np.ndarray
        The input values.
    a : np.ndarray
        The input coefficients.
        
    Returns
    -------
        y : a.x^t

Dimension 2

We consider the problem in dimension $d=2$ and a number of pairs $p=1$ for gaussian margins.


In [3]:
dim = 2
margins = [ot.Normal()]*dim

Copula families

We consider a gaussian copula for this first example


In [4]:
families = np.zeros((dim, dim), dtype=int)
families[1, 0] = 1

Estimations

We create an instance of the main class for a conservative estimate.


In [6]:
from depimpact import ConservativeEstimate

quant_estimate = ConservativeEstimate(model_func=func_sum, margins=margins, families=families)

First, we compute the quantile at independence


In [6]:
n = 1000
indep_result = quant_estimate.independence(n_input_sample=n, random_state=random_state)

We aim to minimize the output quantile. To do that, we create a q_func object from the function quantile_func to associate a probability $\alpha$ to a function that computes the empirical quantile from a given sample.


In [7]:
from depimpact import quantile_func
alpha = 0.05
q_func = quantile_func(alpha)
indep_result.q_func = q_func

The computation returns a DependenceResult instance. This object gather the informations of the computation. It also computes the output quantity of interest (which can also be changed).


In [8]:
sns.jointplot(indep_result.input_sample[:, 0], indep_result.input_sample[:, 1]);



In [9]:
h = sns.distplot(indep_result.output_sample_id, axlabel='Model output', label="Output Distribution")
plt.plot([indep_result.quantity]*2, h.get_ylim(), label='Quantile at %d%%' % (alpha*100))
plt.legend(loc=0)
print('Output quantile :', indep_result.quantity)


Output quantile : -2.3035580191761107

A boostrap can be done on the output quantity


In [10]:
indep_result.compute_bootstrap(n_bootstrap=5000)

And we can plot it


In [11]:
sns.distplot(indep_result.bootstrap_sample, axlabel='Output quantile');



In [12]:
ci = [0.025, 0.975]
quantity_ci = indep_result.compute_quantity_bootstrap_ci(ci)

h = sns.distplot(indep_result.output_sample_id, axlabel='Model output', label="Output Distribution")
plt.plot([indep_result.quantity]*2, h.get_ylim(), 'g-', label='Quantile at %d%%' % (alpha*100))
plt.plot([quantity_ci[0]]*2, h.get_ylim(), 'g--', label='%d%% confidence intervals' % ((1. - (ci[0] + 1. - ci[1]))*100))
plt.plot([quantity_ci[1]]*2, h.get_ylim(), 'g--')
plt.legend(loc=0)
print('Quantile at independence: %.2f with a C.O.V at %.1f %%' % (indep_result.boot_mean, indep_result.boot_cov))


Quantile at independence: -2.29 with a C.O.V at 0.0 %

Grid Search Approach

Firstly, we consider a grid search approach in order to compare the perfomance with the iterative algorithm. The discretization can be made on the parameter space or on other concordance measure such as the kendall's Tau. This below example shows a grid-search on the parameter space.


In [27]:
K = 20
n = 10000
grid_type = 'lhs'
dep_measure = 'parameter'
grid_result = quant_estimate.gridsearch(n_dep_param=K, n_input_sample=n, grid_type=grid_type, dep_measure=dep_measure, 
                                        random_state=random_state)

The computation returns a ListDependenceResult which is a list of DependenceResult instances and some bonuses.


In [28]:
print('The computation did %d model evaluations.' % (grid_result.n_evals))


The computation did 200000 model evaluations.

Lets set the quantity function and search for the minimum among the grid results.


In [29]:
grid_result.q_func = q_func
min_result = grid_result.min_result
print('Minimum quantile: {} at param: {}'.format(min_result.quantity, min_result.dep_param))


Minimum quantile: -3.270045262294147 at param: [0.9498828008575777]

We can plot the result in grid results. The below figure shows the output quantiles in function of the dependence parameters.


In [30]:
plt.plot(grid_result.dep_params, grid_result.quantities, '.', label='Quantiles')
plt.plot(min_result.dep_param[0], min_result.quantity, 'ro', label='minimum')
plt.xlabel('Dependence parameter')
plt.ylabel('Quantile value')
plt.legend(loc=0);


As for the individual problem, we can do a boostrap also, for each parameters. Because we have $K$ parameters, we can do a bootstrap for the $K$ samples, compute the $K$ quantiles for all the bootstrap and get the minimum quantile for each bootstrap.


In [32]:
grid_result.compute_bootstraps(n_bootstrap=500)
boot_min_quantiles = grid_result.bootstrap_samples.min(axis=0)
boot_argmin_quantiles = grid_result.bootstrap_samples.argmin(axis=0).ravel().tolist()
boot_min_params = [grid_result.dep_params[idx][0] for idx in boot_argmin_quantiles]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
sns.distplot(boot_min_quantiles, axlabel="Minimum quantiles", ax=axes[0])
sns.distplot(boot_min_params, axlabel="Parameters of the minimum", ax=axes[1])


C:\Users\naz-probook\Anaconda3\lib\site-packages\statsmodels\nonparametric\kde.py:494: RuntimeWarning: invalid value encountered in true_divide
  binned = fast_linbin(X,a,b,gridsize)/(delta*nobs)
C:\Users\naz-probook\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:34: RuntimeWarning: invalid value encountered in double_scalars
  FAC1 = 2*(np.pi*bw/RANGE)**2
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x23c800d4a20>

For the parameter that have the most occurence for the minimum, we compute its bootstrap mean.


In [33]:
# The parameter with most occurence
boot_id_min = max(set(boot_argmin_quantiles), key=boot_argmin_quantiles.count)
boot_min_result = grid_result[boot_id_min]

boot_mean = boot_min_result.bootstrap_sample.mean()
boot_std = boot_min_result.bootstrap_sample.std()
print('Worst Quantile: {} at {} with a C.O.V of {} %'.format(boot_min_result.boot_mean, min_result.dep_param, boot_min_result.boot_cov*100.))


Worst Quantile: -3.2667239514489337 at [0.9498828008575777] with a C.O.V of 1.1714362901728483 %
Kendall's Tau

An interesting feature is to convert the dependence parameters to Kendall's Tau values.


In [34]:
plt.plot(grid_result.kendalls, grid_result.quantities, '.', label='Quantiles')
plt.plot(min_result.kendall_tau, min_result.quantity, 'ro', label='Minimum quantile')
plt.xlabel("Kendall's tau")
plt.ylabel('Quantile')
plt.legend(loc=0);


As we can see, the bounds

With bounds on the dependencies

An interesting option in the ConservativeEstimate class is to bound the dependencies, due to some prior informations.


In [35]:
bounds_tau = np.asarray([[0., 0.7], [0.1, 0.]])
quant_estimate.bounds_tau = bounds_tau
K = 20
n = 10000
grid_type = 'lhs'
grid_result = quant_estimate.gridsearch(n_dep_param=K, n_input_sample=n, grid_type=grid_type, random_state=random_state)

In [36]:
grid_result.q_func = q_func
min_result = grid_result.min_result
print('Minimum quantile: {} at param: {}'.format(min_result.quantity, min_result.dep_param))


Minimum quantile: -3.3076218755118094 at param: [ 0.98005602]

In [37]:
plt.plot(grid_result.dep_params, grid_result.quantities, '.', label='Quantiles')
plt.plot(min_result.dep_param[0], min_result.quantity, 'ro', label='minimum')
plt.xlabel('Dependence parameter')
plt.ylabel('Quantile value')
plt.legend(loc=0);


Saving the results

It is usefull to save the result in a file to load it later and compute other quantities or anything you need!


In [38]:
filename = './result.hdf'
grid_result.to_hdf(filename)


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-38-473176a44a41> in <module>()
      1 filename = './result.hdf'
----> 2 grid_result.to_hdf(filename)

c:\users\naz-probook\onedrive\git-repo\impact-of-dependence\depimpact\conservative.py in to_hdf(self, path_or_buf, input_names, output_names, verbose, with_input_sample)
   1230                         hdf_store.attrs['Fixed Parameters'] = self.fixed_params
   1231                         hdf_store.attrs['Run Type'] = self.run_type
-> 1232                         hdf_store.attrs['Input Names'] = input_names
   1233                         hdf_store.attrs['Output Names'] = output_names
   1234 

h5py\_objects.pyx in h5py._objects.with_phil.wrapper()

h5py\_objects.pyx in h5py._objects.with_phil.wrapper()

~\Anaconda3\lib\site-packages\h5py\_hl\attrs.py in __setitem__(self, name, value)
     93         use the methods create() and modify().
     94         """
---> 95         self.create(name, data=value, dtype=base.guess_dtype(value))
     96 
     97     @with_phil

~\Anaconda3\lib\site-packages\h5py\_hl\attrs.py in create(self, name, data, shape, dtype)
    169             # Make HDF5 datatype and dataspace for the H5A calls
    170             if use_htype is None:
--> 171                 htype = h5t.py_create(original_dtype, logical=True)
    172                 htype2 = h5t.py_create(original_dtype)  # Must be bit-for-bit representation rather than logical
    173             else:

h5py\h5t.pyx in h5py.h5t.py_create()

h5py\h5t.pyx in h5py.h5t.py_create()

h5py\h5t.pyx in h5py.h5t.py_create()

TypeError: No conversion path for dtype: dtype('<U3')

In [26]:
from dependence import ListDependenceResult
load_grid_result = ListDependenceResult.from_hdf(filename, q_func=q_func, with_input_sample=False)


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-26-f4ce0ceb49d1> in <module>()
      1 from dependence import ListDependenceResult
----> 2 load_grid_result = ListDependenceResult.from_hdf(filename, q_func=q_func, with_input_sample=False)

c:\users\naz-probook\onedrive\git-repo\dep-impact\dependence\dependence.py in from_hdf(cls, filepath_or_buffer, id_of_experiment, output_id, with_input_sample, q_func)
   1253                 # All groups of experiments are loaded and concatenated
   1254                 list_index = hdf_store.keys()
-> 1255                 list_index.remove('dependence_params')
   1256             else:
   1257                 # Only the specified experiment is loaded

AttributeError: 'KeysView' object has no attribute 'remove'

In [27]:
np.testing.assert_array_equal(grid_result.output_samples, load_grid_result.output_samples)


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-27-1644477fd2bb> in <module>()
----> 1 np.testing.assert_array_equal(grid_result.output_samples, load_grid_result.output_samples)

NameError: name 'load_grid_result' is not defined

In [28]:
import os
os.remove(filename)

Taking the extreme values of the dependence parameter

If the output quantity of interest seems to have a monotonicity with the dependence parameter, it is better to directly take the bounds of the dependence problem. Obviously, the minimum should be at the edges of the design space


In [43]:
K = None
n = 1000
grid_type = 'vertices'
grid_result = quant_estimate.gridsearch(n_dep_param=K, n_input_sample=n, grid_type=grid_type, random_state=random_state)

In [49]:



Kendall's Tau : [ 0.15643447  0.89100652], Quantile: [-2.68011394 -3.2314678 ]

In [50]:
grid_result.q_func = q_func
print("Kendall's Tau : {}, Quantile: {}".format(grid_result.kendalls.ravel(), grid_result.quantities))


Kendall's Tau : [ 0.15643447  0.89100652], Quantile: [-2.68011394 -3.2314678 ]

In [53]:
from depimpact.plots import matrix_plot_input
matrix_plot_input(grid_result.min_result);


Higher Dimension

We consider the problem in dimension $d=5$.


In [54]:
dim = 5
quant_estimate.margins = [ot.Normal()]*dim


Don't forget to change the family matrix.
Don't forget to change the bounds matrix

Copula families with one dependent pair

We consider a gaussian copula for this first example, but for the moment only one pair is dependent.


In [55]:
families = np.zeros((dim, dim), dtype=int)
families[2, 0] = 1
quant_estimate.families = families
families


Dont't foget to update the bounds matrix
Out[55]:
array([[0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]])

In [56]:
quant_estimate.bounds_tau = None
quant_estimate.bounds_tau


Out[56]:
array([[ 0.  ,  0.  ,  0.99,  0.  ,  0.  ],
       [ 0.  ,  0.  ,  0.  ,  0.  ,  0.  ],
       [-0.99,  0.  ,  0.  ,  0.  ,  0.  ],
       [ 0.  ,  0.  ,  0.  ,  0.  ,  0.  ],
       [ 0.  ,  0.  ,  0.  ,  0.  ,  0.  ]])

We reset the families and bounds for the current instance. (I don't want to create a new instance, just to check if the setters are good).


In [57]:
quant_estimate.vine_structure


Out[57]:
array([[3, 0, 0, 0, 0],
       [5, 2, 0, 0, 0],
       [4, 5, 4, 0, 0],
       [1, 4, 5, 1, 0],
       [2, 1, 1, 5, 5]])

Let's do the grid search to see


In [59]:
K = 20
n = 10000
grid_type = 'vertices'
grid_result = quant_estimate.gridsearch(n_dep_param=K, n_input_sample=n, grid_type=grid_type, random_state=random_state)

The quantile is lower compare to the problem of dimension 1. Indeed, there is more variables, more uncertainty, so a larger deviation of the output.


In [64]:
grid_result.q_func = q_func
min_result = grid_result.min_result
print('Worst Quantile: {} at {}'.format(min_result.quantity, min_result.dep_param))
matrix_plot_input(min_result)


Worst Quantile: -4.273422229577796 at [ 0.99999998]
Out[64]:
<seaborn.axisgrid.PairGrid at 0x23380a91f28>

In [67]:
plt.plot(grid_result.dep_params, grid_result.quantities, '.', label='Quantiles')
plt.plot(min_result.dep_param[0], min_result.quantity, 'ro', label='Minimum')
plt.xlabel('Dependence parameter')
plt.ylabel('Quantile value')
plt.legend(loc=0);


Copula families with all dependent pairs

We consider a gaussian copula for this first example, but for the moment only one pair is dependent.


In [68]:
families = np.zeros((dim, dim), dtype=int)
for i in range(1, dim):
    for j in range(i):
        families[i, j] = 1

In [69]:
quant_estimate.margins = margins
quant_estimate.families = families
quant_estimate.vine_structure = None
quant_estimate.bounds_tau = None
quant_estimate.bounds_tau


Don't forget to change the family matrix.
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-69-dc57b31daec4> in <module>()
----> 1 quant_estimate.margins = margins
      2 quant_estimate.families = families
      3 quant_estimate.vine_structure = None
      4 quant_estimate.bounds_tau = None
      5 quant_estimate.bounds_tau

c:\users\naz-probook\onedrive\git-repo\dep-impact\dependence\dependence.py in margins(self, margins)
    457                     print("Don't forget to change the R-vine array")
    458                 else:
--> 459                     self.vine_structure = None
    460         if hasattr(self, '_bounds_tau'):
    461             if self._bounds_tau.shape[0] != self._input_dim:

c:\users\naz-probook\onedrive\git-repo\dep-impact\dependence\dependence.py in vine_structure(self, structure)
    604             if len(listed_pairs) > 0:
    605             #if False:
--> 606                 pairs_iter_id = [get_pair_id(dim, pair, with_plus=False) for pair in listed_pairs]
    607                 pairs_by_levels = get_pairs_by_levels(dim, pairs_iter_id)
    608                 structure = get_possible_structures(dim, pairs_by_levels)[1]

c:\users\naz-probook\onedrive\git-repo\dep-impact\dependence\dependence.py in <listcomp>(.0)
    604             if len(listed_pairs) > 0:
    605             #if False:
--> 606                 pairs_iter_id = [get_pair_id(dim, pair, with_plus=False) for pair in listed_pairs]
    607                 pairs_by_levels = get_pairs_by_levels(dim, pairs_iter_id)
    608                 structure = get_possible_structures(dim, pairs_by_levels)[1]

c:\users\naz-probook\onedrive\git-repo\dep-impact\dependence\utils.py in get_pair_id(dim, pair, with_plus)
    972         pair[1] += 1
    973 
--> 974     pair_id = np.where((pairs == pair).all(axis=1))[0][0]
    975     return pair_id
    976 

IndexError: index 0 is out of bounds for axis 0 with size 0

In [71]:
K = 100
n = 1000
grid_type = 'lhs'
grid_result = quant_estimate.gridsearch(n_dep_param=K, n_input_sample=n, grid_type=grid_type, random_state=random_state)


---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-71-90f292a31187> in <module>()
      2 n = 1000
      3 grid_type = 'lhs'
----> 4 grid_result = quant_estimate.gridsearch(n_dep_param=K, n_input_sample=n, grid_type=grid_type, random_state=random_state)

c:\users\naz-probook\onedrive\git-repo\dep-impact\dependence\dependence.py in gridsearch(self, n_dep_param, n_input_sample, grid_type, dep_measure, lhs_grid_criterion, keep_input_samples, load_grid, save_grid, use_sto_func, random_state)
    212             if keep_input_samples:
    213                 output_samples, input_samples = self.run_stochastic_models(
--> 214                         params, n_input_sample, return_input_samples=keep_input_samples)
    215             else:
    216                 output_samples = self.run_stochastic_models(

c:\users\naz-probook\onedrive\git-repo\dep-impact\dependence\dependence.py in run_stochastic_models(self, params, n_input_sample, return_input_samples, random_state)
    253         for param in params:
    254             full_param = np.zeros((self._corr_dim, ))
--> 255             full_param[self._pair_ids] = param
    256             full_param[self._fixed_pairs_ids] = self._fixed_params_list
    257             intput_sample = self._get_sample(full_param, n_input_sample)

IndexError: index 1 is out of bounds for axis 0 with size 1

In [153]:
min_result = grid_result.min_result
print('Worst Quantile: {0} at {1}'.format(min_result.quantity, min_result.dep_param))


Worst Quantile: -10.8510465381 at [-0.97636018  0.59525763 -0.71252831  0.99805621  0.99805621  0.43584383
 -0.94199303  0.99225191  0.93110039  0.59525763]

With one fixed pair


In [72]:
families[3, 2] = 0
quant_estimate = ConservativeEstimate(model_func=func_sum, margins=margins, families=families)


Don't forget to change the margins.
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-72-a6d6fc4e3e3a> in <module>()
      1 families[3, 2] = 0
----> 2 quant_estimate = ConservativeEstimate(model_func=func_sum, margins=margins, families=families)

c:\users\naz-probook\onedrive\git-repo\dep-impact\dependence\dependence.py in __init__(self, model_func, margins, families, fixed_params, bounds_tau, vine_structure, copula_type)
     92         self.margins = margins
     93         self.families = families
---> 94         self.bounds_tau = bounds_tau
     95         self.fixed_params = fixed_params
     96         self.vine_structure = vine_structure

c:\users\naz-probook\onedrive\git-repo\dep-impact\dependence\dependence.py in bounds_tau(self, value)
    653             bounds_tau = np.zeros((dim, dim))
    654             for i, j in self._pairs:
--> 655                 bounds_tau[i, j], bounds_tau[j, i] = get_tau_interval(self._families[i, j])
    656             self._custom_bounds_tau = False
    657         elif isinstance(value, str):

IndexError: index 2 is out of bounds for axis 0 with size 2

In [73]:
K = 100
n = 10000
grid_type = 'lhs'
grid_result = quant_estimate.gridsearch(n_dep_param=K, n_input_sample=n, grid_type=grid_type, 
                                                 q_func=q_func, random_state=random_state)


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-73-7ac333112d45> in <module>()
      3 grid_type = 'lhs'
      4 grid_result = quant_estimate.gridsearch(n_dep_param=K, n_input_sample=n, grid_type=grid_type, 
----> 5                                                  q_func=q_func, random_state=random_state)

TypeError: gridsearch() got an unexpected keyword argument 'q_func'

In [74]:
min_result = grid_result.min_result
print('Worst Quantile: {0} at {1}'.format(min_result.quantity, min_result.dep_param))


Worst Quantile: -4.273422229577796 at [ 0.99999998]

In [75]:
grid_result.vine_structure


Out[75]:
array([[3, 0, 0, 0, 0],
       [5, 2, 0, 0, 0],
       [4, 5, 4, 0, 0],
       [1, 4, 5, 1, 0],
       [2, 1, 1, 5, 5]])

In [77]:
from depimpact.plots import matrix_plot_input

matrix_plot_input(min_result)


Out[77]:
<seaborn.axisgrid.PairGrid at 0x23384e28470>

Save the used grid and load it again


In [78]:
K = 100
n = 1000
grid_type = 'lhs'
grid_result_1 = quant_estimate.gridsearch(n_dep_param=K, n_input_sample=n, grid_type=grid_type, save_grid=True, grid_path='./output')


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-78-7e2f2a79afd0> in <module>()
      3 grid_type = 'lhs'
      4 grid_result_1 = quant_estimate.gridsearch(n_dep_param=K, n_input_sample=n, grid_type=grid_type, 
----> 5                                                  q_func=q_func, save_grid=True, grid_path='./output')

TypeError: gridsearch() got an unexpected keyword argument 'q_func'

In [79]:
grid_result_2 = quant_estimate.gridsearch(n_dep_param=K, n_input_sample=n, grid_type=grid_type, 
                                                 q_func=q_func, use_grid=0, grid_path='./output')


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-79-14131258b620> in <module>()
      1 grid_result_2 = quant_estimate.gridsearch(n_dep_param=K, n_input_sample=n, grid_type=grid_type, 
----> 2                                                  q_func=q_func, use_grid=0, grid_path='./output')

TypeError: gridsearch() got an unexpected keyword argument 'q_func'

Then gather the results from the same grid with the same configurations


In [43]:
grid_result_1.n_input_sample, grid_result_2.n_input_sample


Out[43]:
(1000, 1000)

In [44]:
grid_result = grid_result_1 + grid_result_2

Because the configurations are the same, we can gather the results from two different runs


In [45]:
grid_result.n_input_sample


Out[45]:
2000