Mixture Density Networks (MDN) for distribution and uncertainty estimation

This material is copyright Axel Brando and made available under the Creative Commons Attribution-ShareAlike 4.0 International License (http://creativecommons.org/licenses/by-sa/4.0/). Code is also made available under the Apache Version 2.0 License (https://www.apache.org/licenses/LICENSE-2.0).

Please, to use this material and code follow the instructions explained in the main repository https://github.com/axelbrando/Mixture-Density-Networks-for-distribution-and-uncertainty-estimation

Introduction

As Bishop [1] explains, if we assume that the conditional distribution of the target data is, in fact, Gaussian, then we can obtain the least-squares formalism using maximum likelihood. This motivates the idea of replacing the Gaussian distribution in the conditional density of the complete target vector with a mixture model (McLachlan et al. [2]), which has the flexibility to completely model a general distribution functions. The probability density of the target data is then represented as a linear combination of kernel functions in the form

$$ p(\boldsymbol{y} | \boldsymbol{x} ) = \sum_{i=1}^m \alpha_i (\boldsymbol{x}) \phi_i (\boldsymbol{y} | \boldsymbol{x} )$$

where $m$ is the number of components in the mixture and $\alpha_i(\boldsymbol{x})$ is called mixing coefficients.

In his paper, Bishop [1] selected the kernel functions which are Gaussian of the form:

$$ \phi_i(\boldsymbol{y} | \boldsymbol{x}) = \frac{1}{(2 \pi )^{c/2} \sigma_i (\boldsymbol{x})^c} \exp\left\{ - \frac{ \parallel \boldsymbol{y} - \boldsymbol{\mu}_i (\boldsymbol{x})\parallel^2}{2 \sigma_i (\boldsymbol{x})^2} \right\}$$

where $\boldsymbol{\mu}_i$ represents the centre of the $i^{th}$ kernel. The author assumed that the components of the output vector are statically independent within each component of the distribution, and it can be described by a common variance $\sigma_i(\boldsymbol{x})$. As Bishop [1] explains, to be more formal, the assumption of independence can be relaxed by introducing a full covariance matrices for each Gaussian kernel. However, according to McLachlan et al. [2] and Bishop [1], a Gaussian mixture model with this simplified kernel can approximate any given density function to arbitrary accuracy, provided the mixing coefficients and the Gaussian parameters (means and variances) are correctly chosen. Note that this assumption simplifies the calculation of the inverse of the covariance matrix $\boldsymbol{\Sigma}_i$ since we will have a diagonal matrix with the same variance $\sigma_i$ across all dimensions

$$\boldsymbol{\Sigma}_i = \begin{bmatrix}\sigma_{i} & 0 & \cdots & 0\\ 0 & \sigma_{i} & & 0\\ \vdots & & \ddots & \vdots\\ 0 & \cdots & 0 & \sigma_{i} \end{bmatrix} $$

Which simplifies the $\mid \boldsymbol{\Sigma}_i \mid^{-1}$ calculation of the first equation to the $\sigma_i^{-c}$ of the equation second one.

As we can see in next graph, given a input vector $\boldsymbol{x}$, the Mixture Density Network model provides a general formalism for modelling an arbitrary conditional density function $p(\boldsymbol{y} \mid \boldsymbol{x})$. This union between the traditional neural network and the mixture model part is achieved by using the log-likelihood of the linear combination of kernel functions as a loss function of the neural network. According to Bishop [1], by choosing a mixture model with a sufficient number of kernel functions, and a neural network with a sufficient number of hidden units, the Mixture Density Network can approximate any conditional density $p(\boldsymbol{y} \mid \boldsymbol{x})$ as closely as desired.

The representation graph of the Mixture Density Network model is as follows:

The output of the feed-forward neural network determine the parameters in a mixture density model. Therefore, the mixture density model represents the conditional probability density function of the target variables conditioned on the input vector of the neural network.

Building this Mixture Density Network increases the number of parameters from $c$ output parameters to $(c+2)\times m$ parameters, where $c$ remains to be the dimension of the output and $m$ is the number of mixtures we are using in the model.

There are some restrictions that Bishop [1] proposes in his article to the different parameters to satisfy:

As required for probabilities, it is important that the mixing coefficients $\alpha_i$ satisfy the constraint $\sum^m_{i=1}\alpha_{i}=1$. To achieve this restriction, in principle, it is enough to have a softmax activation function in the nodes corresponding to $\alpha_i$.
Since variance $\sigma_i$ represents scale parameters, Bishop[1] recommends to represent them in terms of the exponential of the corresponding network output $z_i^{\sigma}$

$$\sigma_i = exp(z_{i}^{\sigma})$$
The centres parameters $\mu_i$ represent location parameters. Taking into account the notion of an uninformative prior it suggests that these would be represented directly by the network outputs, i.e. $$\mu_{i,k} = z_{i,k}^{\mu}$$ which, in a Bayesian framework, would correspond to the choice of an un-informative Bayesian prior, assuming that the corresponding network outputs $z_i^{\sigma}$ had uniform probability distribution (Nowlan et al. [3], Jacobs et al. [4] and Bishop [1]). According to Bishop [1], the use of this representation avoids pathological configurations in which one or more of the variances goes to zero.

To define an error function, to use as a loss function, the standard approach is the maximum likelihood method, which requires maximisation of the log-likelihood function or, equivalently, minimisation of the negative logarithm of the likelihood. Therefore, the error function for the Mixture Density Network is:

$$ \log \mathcal{L}(\boldsymbol{y} \mid \boldsymbol{x}) = - \log \left( p(\boldsymbol{y} \mid \boldsymbol{x}) \right) = - \log \left(\overset{m}{\underset{i=0}{\sum}} \alpha_i (\boldsymbol{x}) \phi_i (\boldsymbol{y} | \boldsymbol{x} ) \right) $$

Where $\phi_i (\boldsymbol{y} | \boldsymbol{x} )$ is the same of the Gaussian Kernel Equation. As it is explained in Bishop [1], the term $\sum p(\boldsymbol{x})$ has been dropped as it is independent from the parameters of the mixture model, and hence it is independent from the network weights. Thus, the aim of Mixture Density Networks is to model the complete conditional probability density of the output variables. From this density function, any desired statistic involving the output variables can, in principle, be computed.

How to minimise the error function with respect to the weights in the neural network

Once our neural network architecture is decided, we need a way to minimise the error function to modify the weights in order to obtain an expected result. In order to do this we need to calculate the derivatives of the loss function with respect to the weights in the neural network. According to Bishop [1], one method to solve this problem is by using the standard \textit{back-propagation} procedure, provided we obtain suitable expressions for the derivatives of the error with respect to the activations of the output units of the neural network. Since the loss function is a composition of a sum of terms, one for each pattern, we can consider the derivatives $\delta_k = \frac{\partial \mathcal{L}(\boldsymbol{y} \mid \boldsymbol{x})}{\partial z_k}$ for a particular pattern and then we can find the derivatives of $\mathcal{L}$ by summing over all patterns. The derivatives $\delta_k$ act as \textit{errors} which can be back-propagated through the network to find the derivatives with respect to the network weights. There is a lot of bibliography about this process of optimisation like Nielsen [5], Goodfellow et al. [6] and Bishop [7]. As Bishop [1] notes, standard optimisation algorithms, such as conjugate gradients or quasi-Newton methods, can then be used to find a minimum of $\mathcal{L}$. Alternatively, if an optimisation algorithm such as stochastic gradient descent is to be used, the weight updates can be applied after presentation of each pattern separately. In recent years, many new gradient descend optimisation algorithms have been developed, such as Nesterov accelerated gradient, Adagrad, Adadelta, RMSprop or Adam.

Nowadays, this differentiation process is implemented in the most Deep Learning relevant libraries in the way it can automatically differentiate native code. As it is used in Autograd Library, most part of the libraries use reverse-mode differentiation (also called reverse accumulation (Process well explained on the Automatic_differentiation Wikipedia website or on page $7$ of Ilya Sutskever's PhD thesis), which means by using this libraries we can efficiently take gradients of scalar-valued functions with respect to array-valued arguments. Thus, to use this libraries simplifies the gradient-based optimisation problem and this allows us to focus on other problems.

REFERENCES

[1]: Bishop, C. M. (1994). Mixture density networks.

[2]: McLachlan, G. J., & Basford, K. E. (1988). Mixture models: Inference and applications to clustering (Vol. 84). Marcel Dekker.

[3]: Nowlan, S. J., & Hinton, G. E. (1992). Simplifying neural networks by soft weight-sharing. Neural computation, 4(4), 473-493.

[4]: Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural computation, 3(1), 79-87.

[5]: Nielsen, M. A. (2015). Neural networks and deep learning

[6]: Bengio, Y., Goodfellow, I. J., & Courville, A. (2015). Deep learning. Nature, 521, 436-444.

[7]: Bishop, C. M. (2006). Pattern recognition. Machine Learning, 128, 1-58.

Implementation

Below we will show a generic implementation of the MDN in the following points view:

Prepared to use as many distributions in the mixture as defined in $m$ variable.
Prepared to have as many outputs as defined in $c$ variable.
Prepared to use the desired likelihood (Gaussian or Laplace) function.
Prepared to use adversarial training (with variation of modifying weights twice) during MDN training or not.
With the other tricks described in the main page of the repository https://github.com/axelbrando/Mixture-Density-Networks-for-distribution-and-uncertainty-estimation



In [ ]:

    
#Import of the TensorFlow and definition of the control_flow_ops variable
import tensorflow as tf
tf.python.control_flow_ops = tf

#GPU Memory allocation on demand (Remove comments if necessary)
##config = tf.ConfigProto()
##config.gpu_options.allow_growth=True
##sess = tf.Session(config=config)

#Import of TensorFlow backend of Keras
from keras import backend as K

#Some other imports
import os
import numpy as np
from pandas.io.parsers import read_csv
from sklearn.utils import shuffle
import random
from datetime import datetime



In [ ]:

    
#Imports of the Keras library parts we will need
from keras.models import Sequential,Graph
from keras.layers.core import Dense, Dropout
from keras.callbacks import History
from keras.layers.recurrent import LSTM
from keras.models import model_from_json
from keras.regularizers import l2, activity_l2

from keras.objectives import mean_absolute_error

#Definition of the ELU+1 function
#With some margin to avoid problems of instability
from keras.layers.advanced_activations import ELU

def elu_modif(x, a=1.):
    e=1e-15
    return ELU(alpha=a)(x)+1.+e


c = 1 #The number of outputs we want to predict
m = 1 #The number of distributions we want to use in the mixture

#Note: The output size will be (c + 2) * m

def log_sum_exp(x, axis=None):
    """Log-sum-exp trick implementation"""
    x_max = K.max(x, axis=axis, keepdims=True)
    return K.log(K.sum(K.exp(x - x_max), 
                       axis=axis, keepdims=True))+x_max


def mean_log_Gaussian_like(y_true, parameters):
    """Mean Log Gaussian Likelihood distribution
    Note: The 'c' variable is obtained as global variable
    """
    components = K.reshape(parameters,[-1, c + 2, m])
    mu = components[:, :c, :]
    sigma = components[:, c, :]
    alpha = components[:, c + 1, :]
    alpha = K.softmax(K.clip(alpha,1e-8,1.))
    
    exponent = K.log(alpha) - .5 * float(c) * K.log(2 * np.pi) \
    - float(c) * K.log(sigma) \
    - K.sum((K.expand_dims(y_true,2) - mu)**2, axis=1)/(2*(sigma)**2)
    
    log_gauss = log_sum_exp(exponent, axis=1)
    res = - K.mean(log_gauss)
    return res


def mean_log_LaPlace_like(y_true, parameters):
    """Mean Log Laplace Likelihood distribution
    Note: The 'c' variable is obtained as global variable
    """
    components = K.reshape(parameters,[-1, c + 2, m])
    mu = components[:, :c, :]
    sigma = components[:, c, :]
    alpha = components[:, c + 1, :]
    alpha = K.softmax(K.clip(alpha,1e-2,1.))
    
    exponent = K.log(alpha) - float(c) * K.log(2 * sigma) \
    - K.sum(K.abs(K.expand_dims(y_true,2) - mu), axis=1)/(sigma)
    
    log_gauss = log_sum_exp(exponent, axis=1)
    res = - K.mean(log_gauss)
    return res


def scoring_rule_adv(y_true, y_pred):
    """Fast Gradient Sign Method (FSGM) to implement Adversarial Training
    Note: The 'graphADV' pointer is obtained as global variable
    """
    
    # Compute loss 
    #Note: Replace with 'mean_log_Gaussian_like' if you want a Gaussian kernel.
    error = mean_log_LaPlace_like(y_true, y_pred)
    
    # Craft adversarial examples using Fast Gradient Sign Method (FGSM)
    # Define gradient of loss wrt input
    grad_error = K.gradients(error,graphADV.input) #Minus is on error function
    # Take sign of gradient, Multiply by constant epsilon, Add perturbation to original example to obtain adversarial example
    #Sign add a new dimension we need to obviate
    
    epsilon = 0.08
    
    adversarial_X = K.stop_gradient(graphADV.input + epsilon * K.sign(grad_error)[0])
    
    # Note: If you want to test the variation of adversarial training 
    #  proposed by XX, eliminate the following comment character 
    #  and comment the previous one.
    
    ##adversarial_X = graphADV.input + epsilon * K.sign(grad_error)[0]
    
    adv_output = graphADV(adversarial_X)
    
    #Note: Replace with 'mean_log_Gaussian_like' if you want a Gaussian kernel.
    adv_error = mean_log_LaPlace_like(y_true, adv_output)
    return 0.3 * error + 0.7 * adv_error

#Definition of 3 stacked dense layers followed by Mixture Density block.
# This initial feed-forward neural network could be as you want.
graph = Graph()
graph.add_input(name='input', input_shape=(12,))
graph.add_node(Dense(500, activation='relu'), name='dense1_1', input='input')
graph.add_node(Dropout(0.25), name='drop1_1', input='dense1_1')

graph.add_node(Dense(500, activation='relu'), name='dense2_1', input='drop1_1')
graph.add_node(Dropout(0.25), name='drop2_1', input='dense2_1')

graph.add_node(Dense(500, activation='relu'), name='dense3_1', input='drop2_1')
graph.add_node(Dropout(0.25), name='drop3_1', input='dense3_1')


graph.add_node(Dense(output_dim=500, activation="relu"), name='FC1', input='drop3_1')
graph.add_node(Dense(output_dim=c*m), name='FC_mus', input='FC1')
graph.add_node(Dense(output_dim=m, activation=elu_modif), name='FC_sigmas', input='FC1') #K.exp, W_regularizer=l2(1e-3)
graph.add_node(Dense(output_dim=m, activation='softmax'), name='FC_alphas', input='FC1')
graph.add_output(name='output', inputs=['FC_mus','FC_sigmas', 'FC_alphas'], merge_mode='concat',concat_axis=1)
graphADV = graph

#Note 1: 'scoring_rule_adv' by  'mean_log_Gaussian_like' or
# 'mean_log_LaPlace_like' depending on your needs.
#Note 2: Replace 'rmsprop' by 'adam' depending on your needs.
graph.compile('rmsprop', {'output':scoring_rule_adv})

Training of the neural network



In [ ]:

    
from keras.callbacks import Callback, ModelCheckpoint
class LossHistory(Callback):
    def on_train_begin(self, logs={}):
        self.losses = []

    def on_batch_end(self, batch, logs={}):
        self.losses.append(logs.get('loss'))
lossHistory = LossHistory()

# checkpoint
filepath="MDN--{epoch:02d}-{val_loss:.2f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')



In [ ]:

    
from datetime import datetime
start_time = datetime.now()
epoch=300
graph.fit(data={'input':X,'output':y}, batch_size=40000, nb_epoch=epoch, 
          validation_split=0.1,callbacks=[lossHistory, checkpoint])
end_time = datetime.now()
a=0
print 
print "*********************************  End  *********************************"
print
print('Duration: {}'.format(end_time - start_time))



In [ ]:

    
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(np.arange(len(lossHistory.losses)), lossHistory.losses)

Predict by using the neural network



In [ ]:

    
#Depending on your needs you will load the weights
##graph.load_weights('MDN-Weights.hdf5')



In [ ]:

    
#y_pred = model.predict(X_val)['output']
from datetime import datetime
start_time = datetime.now()
y_pred = graph.predict(data={'input':X_val})['output']
end_time = datetime.now()
print 
print "*********************************  Prediction ends  *********************************"
print
print('Duration: {}'.format(end_time - start_time))

How to obtain the parameters



In [ ]:

    
comp = np.reshape(y_pred,[-1, c + 2, m])
mu_pred = comp[:, :c, :]
sigma_pred = comp[:, c, :]
alpha_pred = comp[:, c + 1, :]