Morten Hjorth-Jensen Email hjensen@msu.edu Department of Physics and Astronomy and National Superconducting Cyclotron Laboratory, Michigan State University, East Lansing, 48824 MI, USA
Date: Notebook 1: Variational Monte Carlo
These notebooks serve the aim of linking traditional variational Monte Carlo VMC calculations methods with recent progress on solving many-particle problems using Machine Learning algorithms.
Furthermore, when linking with Machine Learning algorithms, in particular so-called Boltzmann Machines, there are interesting connections between these algorithms and so-called Shadow Wave functions (SWFs) (and references therein). The implications of the latter have been explored in various Monte Carlo calculations.
In total there are three notebooks:
the one you are reading now on Variational Monte Carlo methods,
notebook 2 on Machine Learning and quantum mechanical problems and in particular on Boltzmann Machines,
and finally notebook 3 on the link between Boltzmann machines and SWFs.
In this notebook the aim is to give you an introduction as well as an understanding of the basic elements that are needed in order to develop a professional variational Monte Carlo code. We will focus on a simple system of two particles in an oscillator trap (or alternatively two fermions moving in a Coulombic potential) which can interact via repulsive or attrative force.
The advantage of these systems is that for two particles (boson or fermions) we have analytical solutions for the eigenpairs for the non-interacting case. Furthermore, for a two- or three-dimensional system of two electrons moving in a harmonic oscillator trap, we have analytical solutions for the interacting case as well.
Having analytical eigenpairs is an invaluable feature that allow us to assess the physical relevance of the trial wave functions, be these either from a standard VMC procedure, from Boltzmann Machines or from Shadow Wave functions.
In this notebook we start with the basics of a VMC calculation and introduce concepts like Markow Chain Monte Carlo methods and the Metropolis algorithm, importance sampling and Metropolis-Hastings algorithm, resampling methods to obtain better estimates of the statistical errors and minimization of the expectation values of the energy and the variance. The latter is done in order to obtain the best possible variational parameters. Furthermore it will define the so-called cost function, a commonly encountered quantity in Machine Learning algorithms. Minimizing the latter is the one which leads to the determination of the optimal parameters in basically all Machine Learning algorithms. For our purposes, it will serve as the first link between VMC methods and Machine Learning methods.
Topics like Markov Chain Monte Carlo and various resampling techniques are also central to Machine Learning methods. Presenting them in the context of VMC approaches leads hopefully to an easier starting point for the understanding of these methods.
Finally, the reader may ask what do we actually want to achieve with complicating life with Machine Learning methods when we can easily study interacting systems with standard Monte Carlo approaches. Our hope is that by adding additional degrees of freedom via Machine Learning algorithms, we can let the algorithms we employ learn the parameters of the model via a given optimization algorithm. In standard Monte Carlo calculations the practitioners end fine tuning the trial wave function using all possible insights about the system understudy. This may not always lead to the best possible ansatz and can in the long run be rather time-consuming. In fields like nuclear many-body physics with complicated interaction terms, guessing an analytical form for the trial wave fuction can be difficult. Letting the machine learn the form of the trial function or find the optimal parameters may lead to insights about the problem which cannot be obtained by selecting various trial wave functions.
The emerging and rapidly expanding fields of Machine Learning and Quantum Computing hold also great promise in tackling the dimensionality problems (the so-called dimensionality curse in many-body problems) we encounter when studying complicated many-body problems. The approach to Machine Learning we will focus on is inspired by the idea of representing the wave function with a restricted Boltzmann machine (RBM), presented recently by G. Carleo and M. Troyer, Science 355, Issue 6325, pp. 602-606 (2017). They named such a wave function/network a neural network quantum state (NQS). In their article they apply it to the quantum mechanical spin lattice systems of the Ising model and Heisenberg model, with encouraging results.
Machine learning (ML) is an extremely rich field, in spite of its young age. The increases we have seen during the last three decades in computational capabilities have been followed by developments of methods and techniques for analyzing and handling large date sets, relying heavily on statistics, computer science and mathematics. The field is rather new and developing rapidly. Machine learning is the science of giving computers the ability to learn without being explicitly programmed. The idea is that there exist generic algorithms which can be used to find patterns in a broad class of data sets without having to write code specifically for each problem. The algorithm will build its own logic based on the data.
Machine learning is a subfield of computer science, and is closely related to computational statistics. It evolved from the study of pattern recognition in artificial intelligence (AI) research, and has made contributions to AI tasks like computer vision, natural language processing and speech recognition. It has also, especially in later years, found applications in a wide variety of other areas, including bioinformatics, economy, physics, finance and marketing. An excellent reference we will come to back to is Mehta et al., arXiv:1803.08823.
Our focus will first be on the basics of VMC calculations.
We start with the variational principle. Given a hamiltonian $H$ and a trial wave function $\Psi_T(\boldsymbol{R};\boldsymbol{\alpha})$, the variational principle states that the expectation value of $\langle H \rangle$, defined through
is an upper bound to the ground state energy $E_0$ of the hamiltonian $H$, that is
In general, the integrals involved in the calculation of various expectation values are multi-dimensional ones. Traditional integration methods such as the Gauss-Legendre will not be adequate for say the computation of the energy of a many-body system.
Here we have defined the vector $\boldsymbol{R} =[\boldsymbol{r}_1,\boldsymbol{r}_2,\dots,\boldsymbol{r}_n}]$ as an array that contains the positions of all particles $n$ while the vector $\boldsymbol{\alpha} = [\alpha_1,\alpha_2,\dots,\alpha_m]$ contains the variational parameters of the model, $m$ in total.
The trial wave function can be expanded in the eigenstates of the hamiltonian since they form a complete set, viz.,
and assuming the set of eigenfunctions to be normalized one obtains
where we used that $H(\boldsymbol{R})\Psi_n(\boldsymbol{R})=E_n\Psi_n(\boldsymbol{R})$. In general, the integrals involved in the calculation of various expectation values are multi-dimensional ones. The variational principle yields the lowest state of a given symmetry.
In most cases, a wave function has only small values in large parts of configuration space, and a straightforward procedure which uses homogenously distributed random points in configuration space will most likely lead to poor results. This may suggest that some kind of importance sampling combined with e.g., the Metropolis algorithm may be a more efficient way of obtaining the ground state energy. The hope is then that those regions of configurations space where the wave function assumes appreciable values are sampled more efficiently.
The tedious part in a VMC calculation is the search for the variational minimum. A good knowledge of the system is required in order to carry out reasonable VMC calculations. This is not always the case, and often VMC calculations serve rather as the starting point for so-called diffusion Monte Carlo calculations (DMC). DMC is a way of solving exactly the many-body Schroedinger equation by means of a stochastic procedure. A good guess on the binding energy and its wave function is however necessary. A carefully performed VMC calculation can aid in this context.
The basic procedure of a Variational Monte Carlo calculations consists thus of
Construct first a trial wave function $\psi_T(\boldsymbol{R};\boldsymbol{\alpha})$, for a many-body system consisting of $n$ particles located at positions $\boldsymbol{R}=(\boldsymbol{R}_1,\dots ,\boldsymbol{R}_n)$. The trial wave function depends on $\alpha$ variational parameters $\boldsymbol{\alpha}=(\alpha_1,\dots ,\alpha_M)$.
Then we evaluate the expectation value of the hamiltonian $H$
Here we have used the notation $\overline{E}$ to label the expectation value of the energy.
In order to bring in the Monte Carlo machinery, we define first a likelihood distribution, or probability density distribution (PDF). Using our ansatz for the trial wave function $\psi_T(\boldsymbol{R};\boldsymbol{\alpha})$.
This is our new probability distribution function (PDF). The approximation to the expectation value of the Hamiltonian is now
Define a new quantity
called the local energy, which, together with our trial PDF yields
with $N$ being the number of Monte Carlo samples.
The Algorithm for performing a variational Monte Carlo calculations runs thus as this
Initialisation: Fix the number of Monte Carlo steps. Choose an initial $\boldsymbol{R}$ and variational parameters $\alpha$ and calculate $\left|\psi_T^{\alpha}(\boldsymbol{R})\right|^2$.
Initialise the energy and the variance and start the Monte Carlo calculation.
Calculate a trial position $\boldsymbol{R}_p=\boldsymbol{R}+r*step$ where $r$ is a random variable $r \in [0,1]$.
Metropolis algorithm to accept or reject this move $w = P(\boldsymbol{R}_p)/P(\boldsymbol{R})$.
If the step is accepted, then we set $\boldsymbol{R}=\boldsymbol{R}_p$.
Update averages
Observe that the jumping in space is governed by the variable step. This is called brute-force sampling. Need importance sampling to get more relevant sampling, see lectures below.
The radial Schroedinger equation for the hydrogen atom can be written as (when we have gotten rid of the first derivative term in the kinetic energy and used $rR(r)=u(r)$)
or with dimensionless variables
with the hamiltonian
Use variational parameter $\alpha$ in the trial wave function
Inserting this wave function into the expression for the local energy $E_L$ gives
We note that at $\alpha=1$ we obtain the exact result, and the variance is zero, as it should. The reason is that we then have the exact wave function, and the action of the hamiltionan on the wave function
yields just a constant. The integral which defines various expectation values involving moments of the hamiltonian becomes then
This gives an important information: the exact wave function leads to zero variance! Variation is then performed by minimizing both the energy and the variance.
The Metropolis algorithm , see the original article was invented by Metropolis et. al and is often simply called the Metropolis algorithm. It is a method to sample a normalized probability distribution by a stochastic process. We define $\mathbf{P}_i^{(n)}$ to be the probability for finding the system in the state $i$ at step $n$. The algorithm is then
Sample a possible new state $j$ with some probability $T_{i\rightarrow j}$.
Accept the new state $j$ with probability $A_{i \rightarrow j}$ and use it as the next sample. With probability $1-A_{i\rightarrow j}$ the move is rejected and the original state $i$ is used again as a sample.
We wish to derive the required properties of $T$ and $A$ such that $\mathbf{P}_i^{(n\rightarrow \infty)} \rightarrow p_i$ so that starting from any distribution, the method converges to the correct distribution. Note that the description here is for a discrete probability distribution. Replacing probabilities $p_i$ with expressions like $p(x_i)dx_i$ will take all of these over to the corresponding continuum expressions.
The dynamical equation for $\mathbf{P}_i^{(n)}$ can be written directly from the description above. The probability of being in the state $i$ at step $n$ is given by the probability of being in any state $j$ at the previous step, and making an accepted transition to $i$ added to the probability of being in the state $i$, making a transition to any state $j$ and rejecting the move:
Since the probability of making some transition must be 1, $\sum_j T_{i\rightarrow j} = 1$, and the above equation becomes
For large $n$ we require that $\mathbf{P}^{(n\rightarrow \infty)}_i = p_i$, the desired probability distribution. Taking this limit, gives the balance requirement
The balance requirement is very weak. Typically the much stronger detailed balance requirement is enforced, that is rather than the sum being set to zero, we set each term separately to zero and use this to determine the acceptance probabilities. Rearranging, the result is
The Metropolis choice is to maximize the $A$ values, that is
Other choices are possible, but they all correspond to multilplying $A_{i\rightarrow j}$ and $A_{j\rightarrow i}$ by the same constant smaller than unity.\footnote{The penalty function method uses just such a factor to compensate for $p_i$ that are evaluated stochastically and are therefore noisy.}
Having chosen the acceptance probabilities, we have guaranteed that if the $\mathbf{P}_i^{(n)}$ has equilibrated, that is if it is equal to $p_i$, it will remain equilibrated. Next we need to find the circumstances for convergence to equilibrium.
The dynamical equation can be written as
with the matrix $M$ given by
Summing over $i$ shows that $\sum_i M_{ij} = 1$, and since $\sum_k T_{i\rightarrow k} = 1$, and $A_{i \rightarrow k} \leq 1$, the elements of the matrix satisfy $M_{ij} \geq 0$. The matrix $M$ is therefore a stochastic matrix.
The Metropolis method is simply the power method for computing the right eigenvector of $M$ with the largest magnitude eigenvalue. By construction, the correct probability distribution is a right eigenvector with eigenvalue 1. Therefore, for the Metropolis method to converge to this result, we must show that $M$ has only one eigenvalue with this magnitude, and all other eigenvalues are smaller.
The Hamiltonian of the quantum dot is given by
where $\hat{H}_0$ is the many-body HO Hamiltonian, and $\hat{V}$ is the inter-electron Coulomb interactions. In dimensionless units,
with $r_{ij}=\sqrt{\mathbf{r}_i^2 - \mathbf{r}_j^2}$.
This leads to the separable Hamiltonian, with the relative motion part given by ($r_{ij}=r$)
plus a standard Harmonic Oscillator problem for the center-of-mass motion. This system has analytical solutions in two and three dimensions (M. Taut 1993 and 1994).
We want to perform a Variational Monte Carlo calculation of the ground state of two electrons in a quantum dot well with different oscillator energies, assuming total spin $S=0$. Our trial wave function has the following form
where the $\alpha$s represent our variational parameters, two in this case.
Why does the trial function look like this? How did we get there? This will be our main motivation for switching to Machine Learning.
To find an ansatz for the correlated part of the wave function, it is useful to rewrite the two-particle local energy in terms of the relative and center-of-mass motion. Let us denote the distance between the two electrons as $r_{12}$. We omit the center-of-mass motion since we are only interested in the case when $r_{12} \rightarrow 0$. The contribution from the center-of-mass (CoM) variable $\boldsymbol{R}_{\mathrm{CoM}}$ gives only a finite contribution. We focus only on the terms that are relevant for $r_{12}$ and for three dimensions. The relevant local energy becomes then
Set $l=0$ and we have the so-called cusp condition
The above results in
for anti-parallel spins and
for anti-parallel spins. This is the so-called cusp condition for the relative motion, resulting in a minimal requirement for the correlation part of the wave fuction. For general systems containing more than say two electrons, we have this condition for each electron pair $ij$.
In [1]:
python code here
We need to replace the brute force Metropolis algorithm with a walk in coordinate space biased by the trial wave function. This approach is based on the Fokker-Planck equation and the Langevin equation for generating a trajectory in coordinate space. The link between the Fokker-Planck equation and the Langevin equations are explained, only partly, in the slides below. An excellent reference on topics like Brownian motion, Markov chains, the Fokker-Planck equation and the Langevin equation is the text by Van Kampen Here we will focus first on the implementation part first.
For a diffusion process characterized by a time-dependent probability density $P(x,t)$ in one dimension the Fokker-Planck equation reads (for one particle /walker)
where $F$ is a drift term and $D$ is the diffusion coefficient.
The new positions in coordinate space are given as the solutions of the Langevin equation using Euler's method, namely, we go from the Langevin equation
with $\eta$ a random variable, yielding a new position
where $\xi$ is gaussian random variable and $\Delta t$ is a chosen time step. The quantity $D$ is, in atomic units, equal to $1/2$ and comes from the factor $1/2$ in the kinetic energy operator. Note that $\Delta t$ is to be viewed as a parameter. Values of $\Delta t \in [0.001,0.01]$ yield in general rather stable values of the ground state energy.
The process of isotropic diffusion characterized by a time-dependent probability density $P(\mathbf{x},t)$ obeys (as an approximation) the so-called Fokker-Planck equation
where $\mathbf{F_i}$ is the $i^{th}$ component of the drift term (drift velocity) caused by an external potential, and $D$ is the diffusion coefficient. The convergence to a stationary probability density can be obtained by setting the left hand side to zero. The resulting equation will be satisfied if and only if all the terms of the sum are equal zero,
The drift vector should be of the form $\mathbf{F} = g(\mathbf{x}) \frac{\partial P}{\partial \mathbf{x}}$. Then,
The condition of stationary density means that the left hand side equals zero. In other words, the terms containing first and second derivatives have to cancel each other. It is possible only if $g = \frac{1}{P}$, which yields
which is known as the so-called quantum force. This term is responsible for pushing the walker towards regions of configuration space where the trial wave function is large, increasing the efficiency of the simulation in contrast to the Metropolis algorithm where the walker has the same probability of moving in every direction.
The Fokker-Planck equation yields a (the solution to the equation) transition probability given by the Green's function
which in turn means that our brute force Metropolis algorithm
with $q(y,x) = |\Psi_T(y)|^2/|\Psi_T(x)|^2$ is now replaced by the Metropolis-Hastings algorithm as well as Hasting's article,
In [2]:
add python code here
The above procedure is not the smartest one. Looping over all variational parameters becomes expensive. Also, we don't use importance sampling and optimizations of the standard deviation (blocking, bootstrap, jackknife). Such codes are included in the above Github address.
We can also be smarter and use minimization methods to find the optimal variational parameters with fewer Monte Carlo cycles and then fire up our heavy artillery.
One way to achieve this is to minimize the energy as function of the variational parameters.
To find the derivatives of the local energy expectation value as function of the variational parameters, we can use the chain rule and the hermiticity of the Hamiltonian.
Let us define
as the derivative of the energy with respect to the variational parameter $\alpha_i$ We define also the derivative of the trial function (skipping the subindex $T$) as
The elements of the gradient of the local energy are then (using the chain rule and the hermiticity of the Hamiltonian)
From a computational point of view it means that you need to compute the expectation values of
and
These integrals are evaluted using MC intergration (with all its possible error sources). We can then use methods like stochastic gradient or other minimization methods to find the optimal variational parameters (I don't discuss this topic here, but these methods are very important in ML).
We have a model, our likelihood function.
How should we define the cost function?
Suppose the trial function (our model) is the exact wave function. The action of the hamiltionan on the wave function
The integral which defines various expectation values involving moments of the hamiltonian becomes then
This gives an important information: If I want the variance, the exact wave function leads to zero variance! The variance is defined as
Variation is then performed by minimizing both the energy and the variance.
We can then take the derivatives of
with respect to the variational parameters. The derivatives of the variance can then be used to defined the so-called Hessian matrix, which in turn allows us to use minimization methods like Newton's method or standard gradient methods.
This leads to however a more complicated expression, with obvious errors when evaluating integrals by Monte Carlo integration. Less used, see however Filippi and Umrigar. The expression becomes complicated
Evaluating the cost function means having to evaluate the above second derivative of the energy.