Tutorial in Bayesian Optimization

Javier Gonzalez (j.h.gonzalez@sheffield.ac.uk) University of Sheffield.

The basics

Bayesian optimization (BO) is an strategy for global optimization of black-box functions. For instance, consider a Lipschitz continuous function $f(x)$ defined on a domain $\mathcal{X}$. BO aims to obtain

$$x^* = \arg \max_{\mathcal{X}} f(x)$$

There are two crucial bits in any Bayesian Optimization (BO) procedure approach.

Define a prior probability measure on $f$: this function will capture the our prior beliefs on $f$. The prior will be updated to a 'posterior' using the available data.
Define an acquisition function $acqu(x)$: this is a criteria to decide where to sample next in order to gain the maximum information about the location of the global maximum of $f$.

Given a prior over the function $f$ and an acquisition function a BO procedure will converge to the optimum of $f$ under some conditions.

Use of Bayesian Optimization in real applications

BO has been applied to solve a wide range of problems such us: Interactive animation, Sensor networks, Automatic algorithm configuration, Automatic machine learning toolboxes, Reinforcement learning, Organization planning, Deep learning, Engineering and a long etc!

1D-Toy illustration

We illustrate the idea behind BO using a one-dimensional example. We start by importing the required libraries for the analysis. Note that we use our library GPy for Gaussian Processes! The on-line documentation of GPy is available from the SheffieldML github page.



In [1]:

    
%pylab inline  
import GPy     
import pylab as pb
import numpy as np
import matplotlib.pyplot as plt  # plots
import scipy.stats
from scipy.stats import norm









    



Populating the interactive namespace from numpy and matplotlib

Let's start by considering the function $f(x) = − \cos(2πx ) + \sin(4\pi x )$ defined on the interval $[0.08, 0.92]$. The maximum of this function is located at $x_{max}=0.6010$. Obviously, to obtain it in this example is trivial. But, what if $f$ is not explicitly available and we only have access to a small number of noise evaluations? We see how the BO acts in this case for illustrative purpose but, of course, BO can be used in more complex scenarios. We first generate 3 noisy observations sampled from $f$ and we proceed.



In [2]:

    
## Function f(x)
X_star = np.linspace(0,1,1000)[:,None]
Y_star = -np.cos(2*np.pi*X_star) + np.sin(4*np.pi*X_star)
X_eval = X_star

# Sampled values
seed([1])
n = 3
X = np.array([[0.09],[0.2],[0.8]])
Y = -np.cos(2*np.pi*X) + np.sin(4*np.pi*X) + np.random.randn(n,1)*0.1

# Plot of f(x) and the generated sample
pylab.rcParams['figure.figsize'] = 8, 5 
plt.figure();
plt.plot(X_star,Y_star,c='grey',lw=2,ls='--',mew=1.5)
plt.plot(X,Y,'kx',mew=1.5)
plt.xlabel('x')
plt.ylabel('f(x)')
savefig('data.pdf')

3.1 Gaussian Process Prior

Now we define a Gaussian Process (GP) prior on $f$. A GP is an extension of the multivariate Gaussian distribution to an infinite dimension stochastic process for which any finite combination of dimensions is a Gaussian distribution. Therefore a GP is a distribution over functions, which is totally specified by its mean function $m(x)$ and its covariance function $k(x,x')$:

$$f(x) \sim \mathcal{GP}(m(x),k(x,x')) $$

For convenience, the mean is often fixed as $m(x)=0$. We use as covariance function the exponentiated quadratic kernel

$$ k(x,x') = l \cdot exp{ \left(\frac{\|x-x'\|}{2\sigma^2}\right)} $$

where $\sigma^2$ and and $l$ are positive parameters. Next, we fit this model in our dataset. We start by a kernel object.



In [3]:

    
# Choose the kernel
k = GPy.kern.rbf(input_dim=1, variance=1, lengthscale=0.1)

Now we create a Gaussian Process model using as covariance function the the previous kernel and we optimize the parameters by maximizing the log-likelihood. Ir order to avoid local solutions we use 10 different initial points in the optimization process.



In [4]:

    
# We create the GP model
m = GPy.models.GPRegression(X, Y, k)
m.constrain_positive('')
m.optimize()
m.optimize_restarts(num_restarts = 10)









    



Warning: re-constraining these parameters
rbf_variance
rbf_lengthscale
noise_variance
Optimization restart 1/10, f = 3.28419934479
Optimization restart 2/10, f = 3.28426467375
Optimization restart 3/10, f = 2.98866679975
Optimization restart 4/10, f = 2.98853927632
Optimization restart 5/10, f = 2.98875738545
Optimization restart 6/10, f = 2.98858964232
Optimization restart 7/10, f = 3.28247984535
Optimization restart 8/10, f = 2.9880072671
Optimization restart 9/10, f = 3.283203537
Optimization restart 10/10, f = 2.98858699364

Now, it is time to have a look to the fitted model. We show the parameters and the fitted function and a plot to see how it fit the data.



In [5]:

    
print m
#m.plot()
fest = m.predict(X_star)
plt.plot(X_star,fest[0],c='blue',lw=2,ls='-',mew=1.5)
plt.plot(X_star,fest[0]+1.96*sqrt(fest[1]),c='blue',lw=1,ls='-',mew=1)
plt.plot(X_star,fest[0]-1.96*sqrt(fest[1]),c='blue',lw=1,ls='-',mew=1)
plt.plot(X,Y,'kx',mew=2.5)
plt.plot(X_star,Y_star,c='grey',lw=2,ls='--',mew=1.5)
plt.title('GP model')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.xlim(0,1)
savefig('datamodel.pdf')









    



Log-likelihood: -2.988e+00

       Name        |  Value   |  Constraints  |  Ties  |  prior  
-----------------------------------------------------------------
   rbf_variance    |  0.6527  |     (+ve)     |        |         
  rbf_lengthscale  |  0.1922  |     (+ve)     |        |         
  noise_variance   |  0.0015  |     (+ve)     |        |

Given this model, where do you think the maximum of the function should be? Around 0.3 the posterior mean is maximum but around 0.45 the variance is large. If you could collect a new data point, where would you do it? This is the job of the second main element of any Bayesian Optimization procedure: the acquisition function.

3.2 Aquisition functions

Next lines of code define the three acquisition functions we are going to use in our example. They are the functions that represents our beliefs over the maximum of $f(x)$. Denote by $\theta$ the parameters of the GP model and by $\{x_i,y_i\}$ the available sample. Three of the most common acquisition functions are:

Maximum probability of improvement (MPI):

$$acqu_{MPI}(x;\{x_n,y_n\},\theta) = \Phi(\gamma(x)), \mbox{where}\ \gamma(x)=\frac{\mu(x;\{x_n,y_n\},\theta)-f(x_{best})-\psi}{\sigma(x;\{x_n,y_n\},\theta)}.$$

Expected improvement (EI):

$$acqu_{EI}(x;\{x_n,y_n\},\theta) = \sigma(x;\{x_n,y_n\},\theta) (\gamma(x) \Phi(\gamma(x))) + N(\gamma(x);0,1).$$

Upper confidence bound (UCB):

$$acqu_{UCB}(x;\{x_n,y_n\},\theta) = \mu(x;\{x_n,y_n\},\theta)+\eta\sigma(x;\{x_n,y_n\},\theta).$$

Both, $\psi$ and $\eta$, are tunable parameters that help to make the acquisition functions more flexible. Also, in the case of the UBC, the parameter $\eta$ is useful to define the balance between the importance we give to the mean and the variance of the model. This is know as the exploration/exploitation trade off.



In [6]:

    
def MPI_max(x,model,par = 0.01):
 fest = model.predict(x)
 acqu = norm.cdf((fest[0]-max(fest[0])-par) / fest[1])
 return acqu

def EI_max(x,model,par = 0.01):
 fest = model.predict(x)
 Z = (fest[0]-max(fest[0])-par) / fest[1]
 acqu = (fest[0]-max(fest[0])-par)*norm.cdf(Z)+fest[1]*norm.pdf(Z)
 return acqu

def UBC_max(x,model,z_mui=1):
 fest = model.predict(x)
 acqu = fest[0]+z_mui*np.sqrt(fest[1])
 return acqu

We evaluate the functions on our interval of interest. Here, the maximum is found using grid search but in higher dimensional problems and the maximum can be systematically obtained with a Conjugate Gradient method.



In [7]:

    
## Evaluate and get the maximum of the acquisition function (grid seach for-plotting purposes)
# MPI
acqu_MPI1 = MPI_max(X_eval,m,0.01) 
acqu_MPI2 = MPI_max(X_eval,m,0.1) 
acqu_MPI3 = MPI_max(X_eval,m,0.5) 
max_MPI1 = X_eval[np.argmax(acqu_MPI1)]
max_MPI2 = X_eval[np.argmax(acqu_MPI2)]
max_MPI3 = X_eval[np.argmax(acqu_MPI3)]

# EI
acqu_EI1  = EI_max(X_eval,m,0.01)
acqu_EI2  = EI_max(X_eval,m,0.1)
acqu_EI3  = EI_max(X_eval,m,0.5)
max_EI1 = X_eval[np.argmax(acqu_EI1)]
max_EI2 = X_eval[np.argmax(acqu_EI2)]
max_EI3 = X_eval[np.argmax(acqu_EI3)]
res_max_EI3 = max_EI3

# UBC
acqu_UBC1 = UBC_max(X_eval,m,0.5)   
acqu_UBC2 = UBC_max(X_eval,m,1)
acqu_UBC3 = UBC_max(X_eval,m,4)
max_UBC1 = X_eval[np.argmax(acqu_UBC1)]
max_UBC2 = X_eval[np.argmax(acqu_UBC2)]
max_UBC3 = X_eval[np.argmax(acqu_UBC3)]
res_max_UBC3 = max_UBC3



In [9]:

    
# Plot GP posterior, collected data and the acquisition function
m.plot()
plt.ylim(-2,3)
plt.plot(X_star,Y_star,c='grey',lw=2,ls='--',mew=1.5)
plt.title('GP model')
savefig('datamodel.pdf')



In [8]:

    
plt.figure(figsize=(20,5))
plt.subplot(1, 3, 1)
plt.title('Acquisition functions for MPI')
plt.xlim(0.08,0.92)
p1, = plt.plot(X_eval, acqu_MPI1, 'r-',lw=2.5)
p2, = plt.plot(X_eval, acqu_MPI2, 'b-',lw=2.5)
p3, = plt.plot(X_eval, acqu_MPI3, 'g-',lw=2.5)
plt.title('Acquisition functions for MPI')
plt.xlim(0.08,0.92)
plt.xlabel('x')
plt.ylabel('Acquisition value')
plt.legend([p1, p2, p3], ["0.01", "0.1", "0.5"])
axvline(x=max_MPI1,ls='-',c='red')
axvline(x=max_MPI2,ls='-',c='blue')
axvline(x=max_MPI3,ls='-',c='green')

plt.subplot(1, 3, 2)
plt.plot(X_eval, acqu_EI1, 'r-',lw=2.5)
plt.plot(X_eval, acqu_EI2, 'b-',lw=2.5)
plt.plot(X_eval, acqu_EI3, 'g-',lw=2.5)
plt.title('Acquisition functions for EI')
plt.xlim(0.08,0.92)
plt.xlabel('x')
plt.ylabel('Acquisition value')
plt.legend([p1, p2, p3], ["0.01", "0.1", "0.5"])
axvline(x=max_EI1,ls='-',c='red')
axvline(x=max_EI2,ls='-',c='blue')
axvline(x=max_EI3,ls='-',c='green')

plt.subplot(1, 3, 3)
p1, = plt.plot(X_eval, acqu_UBC1, 'r-',lw=2.5)
p2, = plt.plot(X_eval, acqu_UBC2, 'b-',lw=2.5)
p3, = plt.plot(X_eval, acqu_UBC3, 'g-',lw=2.5)
plt.title('Acquisition functions for UBC')
plt.xlim(0.08,0.92)
plt.xlabel('x')
plt.ylabel('Acquisition value')
plt.legend([p1, p2, p3], ["0.5", "1", "4"])
axvline(x=max_UBC1,ls='-',c='red')
axvline(x=max_UBC2,ls='-',c='blue')
axvline(x=max_UBC3,ls='-',c='green')









    Out[8]:





<matplotlib.lines.Line2D at 0xb7ff6cc>

Next, we show the how the three functions represents our beliefs about the maximum of $f(x)$. Note that all of them use the mean and the variance of the Gaussian process we have fitted to the data. In this example we simply select some values of the parameters.

You can see how the thee acquisition functions represent their beliefs about the maximum of $f(x)$ in a different way. It is up to the user to select the most appropriate depending on the problem. Typically, if we can collect new data we will do it in the maximum of the acquisition function.

3.3 Iterative sampling/sequential design

Next, to see how BO works iteratively, we use the Expected improvement with $\psi=0.5$. In each iteration we use the same generative model we considered for our first three data points in the point where $acqu_{EI}(x)$ is maximum. See what happens by running several times the cell below!!



In [14]:

    
# 1.- Collect an new sample where the MPI indicates and attach to the previous dataset
x_new = max_EI3
y_new = -np.cos(2*np.pi*x_new) + np.sin(4*np.pi*x_new) + np.random.randn(1,1)*0.1
X = np.vstack([X,x_new])
Y = np.vstack([Y,y_new])

# 2.- Run and optimize the new GP model
k = GPy.kern.rbf(input_dim=1, variance=.1, lengthscale=.1)
m_augmented = GPy.models.GPRegression(X, Y, k)
m_augmented.constrain_positive('')
m_augmented.constrain_fixed('.*noise_variance', 0.01)
m_augmented.optimize_restarts(num_restarts = 10)

# 3.- Optimize aquisition function MPI
acqu_EI3 = EI_max(X_eval,m_augmented,0.5) 
max_EI3 = X_eval[np.argmax(acqu_EI3)]
res_max_EI3 = np.vstack([res_max_EI3,max_EI3])
x_res = np.linspace(1,res_max_EI3.shape[0],res_max_EI3.shape[0])

# GP plot
pylab.rcParams['figure.figsize'] = 10, 3 
# GP plot
fest = m_augmented.predict(X_star)
plt.plot(X_star,fest[0],c='blue',lw=2,ls='-',mew=4)
plt.plot(X_star,fest[0]+1.96*sqrt(fest[1]),c='blue',lw=1,ls='-',mew=1)
plt.plot(X_star,fest[0]-1.96*sqrt(fest[1]),c='blue',lw=1,ls='-',mew=1)
plt.plot(X,Y,'kx',mew=2.5)
plt.title('GP model')
plt.xlabel('x')
plt.plot(X_star,Y_star,c='grey',lw=2,ls='--',mew=1.5)
plt.ylabel('f(x)')
plt.xlim(0,1)
savefig('datamodel7.pdf')
# EI plot
pylab.rcParams['figure.figsize'] = 10, 3 
plt.figure(figsize=(10,3))
p1, = plt.plot(X_eval,(acqu_EI3-min(acqu_EI3))/(max(acqu_EI3-min(acqu_EI3))), 'g-',lw=2.5)
plt.title('Acquisition function')
plt.xlim(0,1.01)
plt.ylim(-0.1,1.1)
plt.xlabel('x')
plt.ylabel('Value')
plt.legend([p1], ["Expected improvement"])
savefig('aq7.pdf')

print m_augmented
# Convergence plot
#plt.subplot(1, 2, 2)
#plt.plot(x_res,res_max_EI3,'kx',mew=4.5)
#plt.title('Convergence to the maximum')
#plt.xlabel('iteration')
#plt.ylabel('Value')
#plt.ylim(-0.25,1.5)
#plt.plot(x_res,res_max_EI3,'g-',lw=2.5)
#axhline(y=0.6010,ls='--',c='red'#)









    



Warning: re-constraining these parameters
rbf_variance
rbf_lengthscale
noise_variance
Warning: re-constraining these parameters
noise_variance
Optimization restart 1/10, f = 10.3714704286
Optimization restart 2/10, f = 10.3714704471
Optimization restart 3/10, f = 10.3714706512
Optimization restart 4/10, f = 10.3714705039
Optimization restart 5/10, f = 10.3714713365
Optimization restart 6/10, f = 10.3714705163
Optimization restart 7/10, f = 10.3714709173
Optimization restart 8/10, f = 12.4782822081
Optimization restart 9/10, f = 12.4782821865
Optimization restart 10/10, f = 10.3714704671

Log-likelihood: -1.037e+01

       Name        |  Value   |  Constraints  |  Ties  |  prior  
-----------------------------------------------------------------
   rbf_variance    |  1.3908  |     (+ve)     |        |         
  rbf_lengthscale  |  0.1384  |     (+ve)     |        |         
  noise_variance   |  0.0100  |     Fixed     |        |         







    



/usr/local/lib/python2.7/dist-packages/GPy/kern/parts/rbf.py:256: RuntimeWarning: invalid value encountered in divide
  X = X / self.lengthscale

As you can see, after the first iterations, the aquisition function explores the domain of $f$. In some cases it is either flat or it concentrates all the mass in a point, which normally conincides with the exploration of the limit of the function domain. After some iterations, however, it becomes more narrow and the convergence to the maximum of $f$, 0.6010, is achieved. We only needed 7 data points!!