**Morten Hjorth-Jensen Email morten.hjorth-jensen@fys.uio.no**, Department of Physics, University of Oslo and Department of Physics and Astronomy and National Superconducting Cyclotron Laboratory, Michigan State University

Date: **Feb 6, 2019**

Copyright 1999-2019, Morten Hjorth-Jensen Email morten.hjorth-jensen@fys.uio.no. Released under CC Attribution-NonCommercial 4.0 license

**Statistical analysis.**

```
* Monte Carlo simulations can be treated as *computer experiments*
* The results can be analysed with the same statistical tools as we would use analysing experimental data.
* As in all experiments, we are looking for expectation values and an estimate of how accurate they are, i.e., possible sources for errors.
```

A very good article which explains blocking is H. Flyvbjerg and H. G. Petersen, *Error estimates on averages of correlated data*, Journal of Chemical Physics 91, 461-466 (1989).

**Statistical analysis.**

```
* As in other experiments, Monte Carlo experiments have two classes of errors:
* Statistical errors
* Systematical errors
* Statistical errors can be estimated using standard tools from statistics
* Systematical errors are method specific and must be treated differently from case to case. (In VMC a common source is the step length or time step in importance sampling)
```

Resampling methods are an indispensable tool in modern statistics. They involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model. For example, in order to estimate the variability of a linear regression fit, we can repeatedly draw different samples from the training data, fit a linear regression to each new sample, and then examine the extent to which the resulting fits differ. Such an approach may allow us to obtain information that would not be available from fitting the model only once using the original training sample.

Resampling approaches can be computationally expensive, because they involve fitting the same statistical method multiple times using different subsets of the training data. However, due to recent advances in computing power, the computational requirements of resampling methods generally are not prohibitive. In this chapter, we discuss two of the most commonly used resampling methods, cross-validation and the bootstrap. Both methods are important tools in the practical application of many statistical learning procedures. For example, cross-validation can be used to estimate the test error associated with a given statistical learning method in order to evaluate its performance, or to select the appropriate level of flexibility. The process of evaluating a model’s performance is known as model assessment, whereas the process of selecting the proper level of flexibility for a model is known as model selection. The bootstrap is widely used.

**Statistical analysis.**

```
* Our simulations can be treated as *computer experiments*. This is particularly the case for Monte Carlo methods
* The results can be analysed with the same statistical tools as we would use analysing experimental data.
* As in all experiments, we are looking for expectation values and an estimate of how accurate they are, i.e., possible sources for errors.
```

```
* As in other experiments, many numerical experiments have two classes of errors:
* Statistical errors
* Systematical errors
* Statistical errors can be estimated using standard tools from statistics
* Systematical errors are method specific and must be treated differently from case to case.
```

The *probability distribution function (PDF)* is a function
$p(x)$ on the domain which, in the discrete case, gives us the
probability or relative frequency with which these values of $X$ occur:

$$
p(x) = \mathrm{prob}(X=x)
$$

*density* of the probability rather than the probability
itself. The probability for a stochastic variable to assume any value
on a non-infinitesimal interval $[a,\,b]$ is then just the integral:

$$
\mathrm{prob}(a\leq X\leq b) = \int_a^b p(x)dx
$$

Qualitatively speaking, a stochastic variable represents the values of numbers chosen as if by chance from some specified PDF so that the selection of a large set of these numbers reproduces this PDF.

A particularly useful class of special expectation values are the
*moments*. The $n$-th moment of the PDF $p$ is defined as
follows:

$$
\langle x^n\rangle \equiv \int\! x^n p(x)\,dx
$$

*mean* of $p$
and often denoted by the letter $\mu$:

$$
\langle x\rangle = \mu \equiv \int\! x p(x)\,dx
$$

$$
\langle (x-\langle x \rangle )^n\rangle \equiv \int\! (x-\langle x\rangle)^n p(x)\,dx
$$

*variance* of $p$, is of particular interest. For the stochastic
variable $X$, the variance is denoted as $\sigma^2_X$ or $\mathrm{var}(X)$:

$$
\begin{equation}
\sigma^2_X\ \ =\ \ \mathrm{var}(X) = \langle (x-\langle x\rangle)^2\rangle =
\int\! (x-\langle x\rangle)^2 p(x)\,dx
\label{_auto1} \tag{1}
\end{equation}
$$

$$
\begin{equation}
= \int\! \left(x^2 - 2 x \langle x\rangle^{2} +
\langle x\rangle^2\right)p(x)\,dx
\label{_auto2} \tag{2}
\end{equation}
$$

$$
\begin{equation}
= \langle x^2\rangle - 2 \langle x\rangle\langle x\rangle + \langle x\rangle^2
\label{_auto3} \tag{3}
\end{equation}
$$

$$
\begin{equation}
= \langle x^2\rangle - \langle x\rangle^2
\label{_auto4} \tag{4}
\end{equation}
$$

The square root of the variance, $\sigma =\sqrt{\langle (x-\langle x\rangle)^2\rangle}$ is called the *standard deviation* of $p$. It is clearly just the RMS (root-mean-square)
value of the deviation of the PDF from its mean value, interpreted
qualitatively as the *spread* of $p$ around its mean.

Another important quantity is the so called covariance, a variant of
the above defined variance. Consider again the set $\{X_i\}$ of $n$
stochastic variables (not necessarily uncorrelated) with the
multivariate PDF $P(x_1,\dots,x_n)$. The *covariance* of two
of the stochastic variables, $X_i$ and $X_j$, is defined as follows:

$$
\mathrm{cov}(X_i,\,X_j) \equiv \langle (x_i-\langle x_i\rangle)(x_j-\langle x_j\rangle)\rangle
\nonumber
$$

$$
\begin{equation}
=
\int\!\cdots\!\int\!(x_i-\langle x_i \rangle)(x_j-\langle x_j \rangle)\,
P(x_1,\dots,x_n)\,dx_1\dots dx_n
\label{eq:def_covariance} \tag{5}
\end{equation}
$$

with

$$
\langle x_i\rangle =
\int\!\cdots\!\int\!x_i\,P(x_1,\dots,x_n)\,dx_1\dots dx_n
$$

If we consider the above covariance as a matrix $C_{ij}=\mathrm{cov}(X_i,\,X_j)$, then the diagonal elements are just the familiar variances, $C_{ii} = \mathrm{cov}(X_i,\,X_i) = \mathrm{var}(X_i)$. It turns out that all the off-diagonal elements are zero if the stochastic variables are uncorrelated. This is easy to show, keeping in mind the linearity of the expectation value. Consider the stochastic variables $X_i$ and $X_j$, ($i\neq j$):

$$
\begin{equation}
\mathrm{cov}(X_i,\,X_j) = \langle(x_i-\langle x_i\rangle)(x_j-\langle x_j\rangle)\rangle
\label{_auto5} \tag{6}
\end{equation}
$$

$$
\begin{equation}
=\langle x_i x_j - x_i\langle x_j\rangle - \langle x_i\rangle x_j + \langle x_i\rangle\langle x_j\rangle\rangle
\label{_auto6} \tag{7}
\end{equation}
$$

$$
\begin{equation}
=\langle x_i x_j\rangle - \langle x_i\langle x_j\rangle\rangle - \langle \langle x_i\rangle x_j\rangle +
\langle \langle x_i\rangle\langle x_j\rangle\rangle
\label{_auto7} \tag{8}
\end{equation}
$$

$$
\begin{equation}
=\langle x_i x_j\rangle - \langle x_i\rangle\langle x_j\rangle - \langle x_i\rangle\langle x_j\rangle +
\langle x_i\rangle\langle x_j\rangle
\label{_auto8} \tag{9}
\end{equation}
$$

$$
\begin{equation}
=\langle x_i x_j\rangle - \langle x_i\rangle\langle x_j\rangle
\label{_auto9} \tag{10}
\end{equation}
$$

If $X_i$ and $X_j$ are independent, we get $\langle x_i x_j\rangle =\langle x_i\rangle\langle x_j\rangle$, resulting in $\mathrm{cov}(X_i, X_j) = 0\ \ (i\neq j)$.

Also useful for us is the covariance of linear combinations of stochastic variables. Let $\{X_i\}$ and $\{Y_i\}$ be two sets of stochastic variables. Let also $\{a_i\}$ and $\{b_i\}$ be two sets of scalars. Consider the linear combination:

$$
U = \sum_i a_i X_i \qquad V = \sum_j b_j Y_j
$$

By the linearity of the expectation value

$$
\mathrm{cov}(U, V) = \sum_{i,j}a_i b_j \mathrm{cov}(X_i, Y_j)
$$

$$
\begin{equation}
\mathrm{var}(U) = \sum_{i,j}a_i a_j \mathrm{cov}(X_i, X_j)
\label{eq:variance_linear_combination} \tag{11}
\end{equation}
$$

2 0

< < < ! ! M A T H _ B L O C K

$$
\mathrm{var}(\sum_i a_i X_i) = \sum_i a_i^2 \mathrm{var}(X_i)
$$

$$
\{x_1, x_2,\dots\,x_k,\dots\}.
$$

We will call these
values our *measurements* and the entire set as our measured
*sample*. The action of measuring all the elements of a sample
we will call a stochastic *experiment* since, operationally,
they are often associated with results of empirical observation of
some physical or mathematical phenomena; precisely an experiment. We
assume that these values are distributed according to some
PDF $p_X^{\phantom X}(x)$, where $X$ is just the formal symbol for the
stochastic variable whose PDF is $p_X^{\phantom X}(x)$. Instead of
trying to determine the full distribution $p$ we are often only
interested in finding the few lowest moments, like the mean
$\mu_X^{\phantom X}$ and the variance $\sigma_X^{\phantom X}$.

In practical situations a sample is always of finite size. Let that
size be $n$. The expectation value of a sample, the *sample mean*, is then defined as follows:

$$
\bar{x}_n \equiv \frac{1}{n}\sum_{k=1}^n x_k
$$

The *sample variance* is:

$$
\mathrm{var}(x) \equiv \frac{1}{n}\sum_{k=1}^n (x_k - \bar{x}_n)^2
$$

its square root being the *standard deviation of the sample*. The
*sample covariance* is:

$$
\mathrm{cov}(x)\equiv\frac{1}{n}\sum_{kl}(x_k - \bar{x}_n)(x_l - \bar{x}_n)
$$

Note that the sample variance is the sample covariance without the cross terms. In a similar manner as the covariance in Eq. (5) is a measure of the correlation between two stochastic variables, the above defined sample covariance is a measure of the sequential correlation between succeeding measurements of a sample.

These quantities, being known experimental values, differ significantly from and must not be confused with the similarly named quantities for stochastic variables, mean $\mu_X$, variance $\mathrm{var}(X)$ and covariance $\mathrm{cov}(X,Y)$.

The law of large numbers states that as the size of our sample grows to infinity, the sample mean approaches the true mean $\mu_X^{\phantom X}$ of the chosen PDF:

$$
\lim_{n\to\infty}\bar{x}_n = \mu_X^{\phantom X}
$$

The sample mean $\bar{x}_n$ works therefore as an estimate of the true mean $\mu_X^{\phantom X}$.

What we need to find out is how good an approximation $\bar{x}_n$ is to
$\mu_X^{\phantom X}$. In any stochastic measurement, an estimated
mean is of no use to us without a measure of its error. A quantity
that tells us how well we can reproduce it in another experiment. We
are therefore interested in the PDF of the sample mean itself. Its
standard deviation will be a measure of the spread of sample means,
and we will simply call it the *error* of the sample mean, or
just sample error, and denote it by $\mathrm{err}_X^{\phantom X}$. In
practice, we will only be able to produce an *estimate* of the
sample error since the exact value would require the knowledge of the
true PDFs behind, which we usually do not have.

Let us first take a look at what happens to the sample error as the size of the sample grows. In a sample, each of the measurements $x_i$ can be associated with its own stochastic variable $X_i$. The stochastic variable $\overline X_n$ for the sample mean $\bar{x}_n$ is then just a linear combination, already familiar to us:

$$
\overline X_n = \frac{1}{n}\sum_{i=1}^n X_i
$$

All the coefficients are just equal $1/n$. The PDF of $\overline X_n$, denoted by $p_{\overline X_n}(x)$ is the desired PDF of the sample means.

The probability density of obtaining a sample mean $\bar x_n$ is the product of probabilities of obtaining arbitrary values $x_1, x_2,\dots,x_n$ with the constraint that the mean of the set $\{x_i\}$ is $\bar x_n$:

$$
p_{\overline X_n}(x) = \int p_X^{\phantom X}(x_1)\cdots
\int p_X^{\phantom X}(x_n)\
\delta\!\left(x - \frac{x_1+x_2+\dots+x_n}{n}\right)dx_n \cdots dx_1
$$

And in particular we are interested in its variance $\mathrm{var}(\overline X_n)$.

It is generally not possible to express $p_{\overline X_n}(x)$ in a
closed form given an arbitrary PDF $p_X^{\phantom X}$ and a number
$n$. But for the limit $n\to\infty$ it is possible to make an
approximation. The very important result is called *the central limit theorem*. It tells us that as $n$ goes to infinity,
$p_{\overline X_n}(x)$ approaches a Gaussian distribution whose mean
and variance equal the true mean and variance, $\mu_{X}^{\phantom X}$
and $\sigma_{X}^{2}$, respectively:

$$
\begin{equation}
\lim_{n\to\infty} p_{\overline X_n}(x) =
\left(\frac{n}{2\pi\mathrm{var}(X)}\right)^{1/2}
e^{-\frac{n(x-\bar x_n)^2}{2\mathrm{var}(X)}}
\label{eq:central_limit_gaussian} \tag{12}
\end{equation}
$$

$$
\begin{equation}
\mathrm{err}_X^2 = \mathrm{var}(\overline X_n) = \frac{1}{n^2}
\sum_{ij} \mathrm{cov}(X_i, X_j)
\label{eq:error_exact} \tag{13}
\end{equation}
$$

We see now that in order to calculate the exact error of the sample with the above expression, we would need the true means $\mu_{X_i}^{\phantom X}$ of the stochastic variables $X_i$. To calculate these requires that we know the true multivariate PDF of all the $X_i$. But this PDF is unknown to us, we have only got the measurements of one sample. The best we can do is to let the sample itself be an estimate of the PDF of each of the $X_i$, estimating all properties of $X_i$ through the measurements of the sample.

Our estimate of $\mu_{X_i}^{\phantom X}$ is then the sample mean $\bar x$ itself, in accordance with the the central limit theorem:

$$
\mu_{X_i}^{\phantom X} = \langle x_i\rangle \approx \frac{1}{n}\sum_{k=1}^n x_k = \bar x
$$

*estimate* of the covariance in Eq. (13)

$$
\mathrm{cov}(X_i, X_j) = \langle (x_i-\langle x_i\rangle)(x_j-\langle x_j\rangle)\rangle
\approx\langle (x_i - \bar x)(x_j - \bar{x})\rangle,
$$

resulting in

$$
\frac{1}{n} \sum_{l}^n \left(\frac{1}{n}\sum_{k}^n (x_k -\bar x_n)(x_l - \bar x_n)\right)=\frac{1}{n}\frac{1}{n} \sum_{kl} (x_k -\bar x_n)(x_l - \bar x_n)=\frac{1}{n}\mathrm{cov}(x)
$$

$$
\mathrm{var}(X_i)=\langle x_i - \langle x_i\rangle\rangle \approx \langle x_i - \bar x_n\rangle\nonumber,
$$

which is approximated as

$$
\begin{equation}
\mathrm{var}(X_i)\approx \frac{1}{n}\sum_{k=1}^n (x_k - \bar x_n)=\mathrm{var}(x)
\label{eq:var_estimate_i_think} \tag{14}
\end{equation}
$$

$$
\mathrm{err}_X^2
=\frac{1}{n^2}\sum_{ij} \mathrm{cov}(X_i, X_j) \nonumber
$$

$$
\approx\frac{1}{n^2}\sum_{ij}\frac{1}{n}\mathrm{cov}(x) =\frac{1}{n^2}n^2\frac{1}{n}\mathrm{cov}(x)\nonumber
$$

$$
\begin{equation}
=\frac{1}{n}\mathrm{cov}(x)
\label{eq:error_estimate} \tag{15}
\end{equation}
$$

which is nothing but the sample covariance divided by the number of measurements in the sample.

In the special case that the measurements of the sample are uncorrelated (equivalently the stochastic variables $X_i$ are uncorrelated) we have that the off-diagonal elements of the covariance are zero. This gives the following estimate of the sample error:

$$
\mathrm{err}_X^2=\frac{1}{n^2}\sum_{ij} \mathrm{cov}(X_i, X_j) =
\frac{1}{n^2} \sum_i \mathrm{var}(X_i),
$$

resulting in

$$
\begin{equation}
\mathrm{err}_X^2\approx \frac{1}{n^2} \sum_i \mathrm{var}(x)= \frac{1}{n}\mathrm{var}(x)
\label{eq:error_estimate_uncorrel} \tag{16}
\end{equation}
$$

where in the second step we have used Eq. (14). The error of the sample is then just its standard deviation divided by the square root of the number of measurements the sample contains. This is a very useful formula which is easy to compute. It acts as a first approximation to the error, but in numerical experiments, we cannot overlook the always present correlations.

For computational purposes one usually splits up the estimate of $\mathrm{err}_X^2$, given by Eq. (15), into two parts

$$
\mathrm{err}_X^2 = \frac{1}{n}\mathrm{var}(x) + \frac{1}{n}(\mathrm{cov}(x)-\mathrm{var}(x)),
$$

which equals

$$
\begin{equation}
\frac{1}{n^2}\sum_{k=1}^n (x_k - \bar x_n)^2 +\frac{2}{n^2}\sum_{k<l} (x_k - \bar x_n)(x_l - \bar x_n)
\label{eq:error_estimate_split_up} \tag{17}
\end{equation}
$$

The first term is the same as the error in the uncorrelated case, Eq. (16). This means that the second term accounts for the error correction due to correlation between the measurements. For uncorrelated measurements this second term is zero.

Computationally the uncorrelated first term is much easier to treat efficiently than the second.

$$
\mathrm{var}(x) = \frac{1}{n}\sum_{k=1}^n (x_k - \bar x_n)^2 =
\left(\frac{1}{n}\sum_{k=1}^n x_k^2\right) - \bar x_n^2
$$

We just accumulate separately the values $x^2$ and $x$ for every measurement $x$ we receive. The correlation term, though, has to be calculated at the end of the experiment since we need all the measurements to calculate the cross terms. Therefore, all measurements have to be stored throughout the experiment.

Let us analyze the problem by splitting up the correlation term into partial sums of the form:

$$
f_d = \frac{1}{n-d}\sum_{k=1}^{n-d}(x_k - \bar x_n)(x_{k+d} - \bar x_n)
$$

The correlation term of the error can now be rewritten in terms of $f_d$

$$
\frac{2}{n}\sum_{k<l} (x_k - \bar x_n)(x_l - \bar x_n) =
2\sum_{d=1}^{n-1} f_d
$$

*autocorrelation function*

$$
\kappa_d = \frac{f_d}{\mathrm{var}(x)}
$$

which gives us a useful measure of pairwise correlations starting always at $1$ for $d=0$.

The sample error (see eq. (17)) can now be written in terms of the autocorrelation function:

$$
\mathrm{err}_X^2 =
\frac{1}{n}\mathrm{var}(x)+\frac{2}{n}\cdot\mathrm{var}(x)\sum_{d=1}^{n-1}
\frac{f_d}{\mathrm{var}(x)}\nonumber
$$

$$
=
\left(1+2\sum_{d=1}^{n-1}\kappa_d\right)\frac{1}{n}\mathrm{var}(x)\nonumber
$$

$$
\begin{equation}
=\frac{\tau}{n}\cdot\mathrm{var}(x)
\label{eq:error_estimate_corr_time} \tag{18}
\end{equation}
$$

*autocorrelation time*:

$$
\begin{equation}
\tau = 1+2\sum_{d=1}^{n-1}\kappa_d
\label{eq:autocorrelation_time} \tag{19}
\end{equation}
$$

For a correlation free experiment, $\tau$ equals 1. From the point of view of eq. (18) we can interpret a sequential correlation as an effective reduction of the number of measurements by a factor $\tau$. The effective number of measurements becomes:

$$
n_\mathrm{eff} = \frac{n}{\tau}
$$

To neglect the autocorrelation time $\tau$ will always cause our
simple uncorrelated estimate of $\mathrm{err}_X^2\approx \mathrm{var}(x)/n$ to
be less than the true sample error. The estimate of the error will be
too *good*. On the other hand, the calculation of the full
autocorrelation time poses an efficiency problem if the set of
measurements is very large.

The so-called time-displacement autocorrelation $\phi(t)$ for a quantity $\mathbf{M}$ is given by

$$
\phi(t) = \int dt' \left[\mathbf{M}(t')-\langle \mathbf{M} \rangle\right]\left[\mathbf{M}(t'+t)-\langle \mathbf{M} \rangle\right],
$$

which can be rewritten as

$$
\phi(t) = \int dt' \left[\mathbf{M}(t')\mathbf{M}(t'+t)-\langle \mathbf{M} \rangle^2\right],
$$

$$
\phi(t) = \frac{1}{t_{\mathrm{max}}-t}\sum_{t'=0}^{t_{\mathrm{max}}-t}\mathbf{M}(t')\mathbf{M}(t'+t)
-\frac{1}{t_{\mathrm{max}}-t}\sum_{t'=0}^{t_{\mathrm{max}}-t}\mathbf{M}(t')\times
\frac{1}{t_{\mathrm{max}}-t}\sum_{t'=0}^{t_{\mathrm{max}}-t}\mathbf{M}(t'+t).
\label{eq:phitf} \tag{20}
$$

One should be careful with times close to $t_{\mathrm{max}}$, the upper limit of the sums becomes small and we end up integrating over a rather small time interval. This means that the statistical error in $\phi(t)$ due to the random nature of the fluctuations in $\mathbf{M}(t)$ can become large.

One should therefore choose $t \ll t_{\mathrm{max}}$.

Note that the variable $\mathbf{M}$ can be any expectation values of interest.

The time-correlation function gives a measure of the correlation between the various values of the variable at a time $t'$ and a time $t'+t$. If we multiply the values of $\mathbf{M}$ at these two different times, we will get a positive contribution if they are fluctuating in the same direction, or a negative value if they fluctuate in the opposite direction. If we then integrate over time, or use the discretized version of, the time correlation function $\phi(t)$ should take a non-zero value if the fluctuations are correlated, else it should gradually go to zero. For times a long way apart the different values of $\mathbf{M}$ are most likely uncorrelated and $\phi(t)$ should be zero.

We can derive the correlation time by observing that our Metropolis algorithm is based on a random walk in the space of all possible spin configurations. Our probability distribution function $\mathbf{\hat{w}}(t)$ after a given number of time steps $t$ could be written as

$$
\mathbf{\hat{w}}(t) = \mathbf{\hat{W}^t\hat{w}}(0),
$$

$$
\mathbf{\hat{w}}(0) = \sum_i\alpha_i\mathbf{\hat{v}}_i,
$$

resulting in

$$
\mathbf{\hat{w}}(t) = \mathbf{\hat{W}}^t\mathbf{\hat{w}}(0)=\mathbf{\hat{W}}^t\sum_i\alpha_i\mathbf{\hat{v}}_i=
\sum_i\lambda_i^t\alpha_i\mathbf{\hat{v}}_i,
$$

with $\lambda_i$ the $i^{\mathrm{th}}$ eigenvalue corresponding to

the eigenvector $\mathbf{\hat{v}}_i$.

If we assume that $\lambda_0$ is the largest eigenvector we see that in the limit $t\rightarrow \infty$, $\mathbf{\hat{w}}(t)$ becomes proportional to the corresponding eigenvector $\mathbf{\hat{v}}_0$. This is our steady state or final distribution.

We can relate this property to an observable like the mean energy. With the probabilty $\mathbf{\hat{w}}(t)$ (which in our case is the squared trial wave function) we can write the expectation values as

$$
\langle \mathbf{M}(t) \rangle = \sum_{\mu} \mathbf{\hat{w}}(t)_{\mu}\mathbf{M}_{\mu},
$$

or as the scalar of a vector product

$$
\langle \mathbf{M}(t) \rangle = \mathbf{\hat{w}}(t)\mathbf{m},
$$

$$
\langle \mathbf{M}(t) \rangle = \mathbf{\hat{w}}(t)\mathbf{m}=\sum_i\lambda_i^t\alpha_i\mathbf{\hat{v}}_i\mathbf{m}_i.
$$

$$
\langle \mathbf{M}(t) \rangle = \sum_i\lambda_i^t\alpha_im_i.
$$

$$
\langle \mathbf{M}(t) \rangle = \langle \mathbf{M}(\infty) \rangle+\sum_{i\ne 0}\lambda_i^t\alpha_im_i.
$$

We define the quantity

$$
\tau_i=-\frac{1}{log\lambda_i},
$$

and rewrite the last expectation value as

$$
\langle \mathbf{M}(t) \rangle = \langle \mathbf{M}(\infty) \rangle+\sum_{i\ne 0}\alpha_im_ie^{-t/\tau_i}.
\label{eq:finalmeanm} \tag{21}
$$

The quantities $\tau_i$ are the correlation times for the system. They control also the auto-correlation function discussed above. The longest correlation time is obviously given by the second largest eigenvalue $\tau_1$, which normally defines the correlation time discussed above. For large times, this is the only correlation time that survives. If higher eigenvalues of the transition matrix are well separated from $\lambda_1$ and we simulate long enough, $\tau_1$ may well define the correlation time. In other cases we may not be able to extract a reliable result for $\tau_1$. Coming back to the time correlation function $\phi(t)$ we can present a more general definition in terms of the mean magnetizations $ \langle \mathbf{M}(t) \rangle$. Recalling that the mean value is equal to $ \langle \mathbf{M}(\infty) \rangle$ we arrive at the expectation values

$$
\phi(t) =\langle \mathbf{M}(0)-\mathbf{M}(\infty)\rangle \langle \mathbf{M}(t)-\mathbf{M}(\infty)\rangle,
$$

resulting in

$$
\phi(t) =\sum_{i,j\ne 0}m_i\alpha_im_j\alpha_je^{-t/\tau_i},
$$

$$
\phi (t) \sim \exp{(-t/\tau)}
$$

then the exponential correlation time can be computed as the average

$$
\tau_{\mathrm{exp}} = -\langle \frac{t}{log|\frac{\phi(t)}{\phi(0)}|} \rangle.
$$

If the decay is exponential, then

$$
\int_0^{\infty} dt \phi(t) = \int_0^{\infty} dt \phi(0)\exp{(-t/\tau)} = \tau \phi(0),
$$

which suggests another measure of correlation

$$
\tau_{\mathrm{int}} = \sum_k \frac{\phi(k)}{\phi(0)},
$$

called the integrated correlation time.

Two famous
resampling methods are the **independent bootstrap** and **the jackknife**.

The jackknife is a special case of the independent bootstrap. Still, the jackknife was made
popular prior to the independent bootstrap. And as the popularity of
the independent bootstrap soared, new variants, such as **the dependent bootstrap**.

The Jackknife and independent bootstrap work for independent, identically distributed random variables. If these conditions are not satisfied, the methods will fail. Yet, it should be said that if the data are independent, identically distributed, and we only want to estimate the variance of $\overline{X}$ (which often is the case), then there is no need for bootstrapping.

The Jackknife works by making many replicas of the estimator $\widehat{\theta}$. The jackknife is a resampling method, we explained that this happens by scrambling the data in some way. When using the jackknife, this is done by systematically leaving out one observation from the vector of observed values $\hat{x} = (x_1,x_2,\cdots,X_n)$. Let $\hat{x}_i$ denote the vector

$$
\hat{x}_i = (x_1,x_2,\cdots,x_{i-1},x_{i+1},\cdots,x_n),
$$

which equals the vector $\hat{x}$ with the exception that observation number $i$ is left out. Using this notation, define $\widehat{\theta}_i$ to be the estimator $\widehat{\theta}$ computed using $\vec{X}_i$.

To get an estimate for the bias and standard error of $\widehat{\theta}$, use the following estimators for each component of $\widehat{\theta}$

$$
\widehat{\mathrm{Bias}}(\widehat \theta,\theta) = (n-1)\left( - \widehat{\theta} + \frac{1}{n}\sum_{i=1}^{n} \widehat \theta_i \right) \qquad \text{and} \qquad \widehat{\sigma}^2_{\widehat{\theta} } = \frac{n-1}{n}\sum_{i=1}^{n}( \widehat{\theta}_i - \frac{1}{n}\sum_{j=1}^{n}\widehat \theta_j )^2.
$$

```
In [1]:
```from numpy import *
from numpy.random import randint, randn
from time import time
def jackknife(data, stat):
n = len(data);t = zeros(n); inds = arange(n); t0 = time()
## 'jackknifing' by leaving out an observation for each i
for i in range(n):
t[i] = stat(delete(data,i) )
# analysis
print("Runtime: %g sec" % (time()-t0)); print("Jackknife Statistics :")
print("original bias std. error")
print("%8g %14g %15g" % (stat(data),(n-1)*mean(t)/n, (n*var(t))**.5))
return t
# Returns mean of data samples
def stat(data):
return mean(data)
mu, sigma = 100, 15
datapoints = 10000
x = mu + sigma*random.randn(datapoints)
# jackknife returns the data sample
t = jackknife(x, stat)

Bootstrapping is a nonparametric approach to statistical inference that substitutes computation for more traditional distributional assumptions and asymptotic results. Bootstrapping offers a number of advantages:

The bootstrap is quite general, although there are some cases in which it fails.

Because it does not require distributional assumptions (such as normally distributed errors), the bootstrap can provide more accurate inferences when the data are not well behaved or when the sample size is small.

It is possible to apply the bootstrap to statistics with sampling distributions that are difficult to derive, even asymptotically.

It is relatively simple to apply the bootstrap to complex data-collection plans (such as stratified and clustered samples).

Since $\widehat{\theta} = \widehat{\theta}(\hat{X})$ is a function of random variables, $\widehat{\theta}$ itself must be a random variable. Thus it has a pdf, call this function $p(\hat{t})$. The aim of the bootstrap is to estimate $p(\hat{t})$ by the relative frequency of $\widehat{\theta}$. You can think of this as using a histogram in the place of $p(\hat{t})$. If the relative frequency closely resembles $p(\vec{t})$, then using numerics, it is straight forward to estimate all the interesting parameters of $p(\hat{t})$ using point estimators.

In the case that $\widehat{\theta}$ has more than one component, and the components are independent, we use the same estimator on each component separately. If the probability density function of $X_i$, $p(x)$, had been known, then it would have been straight forward to do this by:

Drawing lots of numbers from $p(x)$, suppose we call one such set of numbers $(X_1^*, X_2^*, \cdots, X_n^*)$.

Then using these numbers, we could compute a replica of $\widehat{\theta}$ called $\widehat{\theta}^*$.

By repeated use of (1) and (2), many estimates of $\widehat{\theta}$ could have been obtained. The idea is to use the relative frequency of $\widehat{\theta}^*$ (think of a histogram) as an estimate of $p(\hat{t})$.

But unless there is enough information available about the process that generated $X_1,X_2,\cdots,X_n$, $p(x)$ is in general unknown. Therefore, Efron in 1979 asked the question: What if we replace $p(x)$ by the relative frequency of the observation $X_i$; if we draw observations in accordance with the relative frequency of the observations, will we obtain the same result in some asymptotic sense? The answer is yes.

Instead of generating the histogram for the relative frequency of the observation $X_i$, just draw the values $(X_1^*,X_2^*,\cdots,X_n^*)$ with replacement from the vector $\hat{X}$.

The independent bootstrap works like this:

Draw with replacement $n$ numbers for the observed variables $\hat{x} = (x_1,x_2,\cdots,x_n)$.

Define a vector $\hat{x}^*$ containing the values which were drawn from $\hat{x}$.

Using the vector $\hat{x}^*$ compute $\widehat{\theta}^*$ by evaluating $\widehat \theta$ under the observations $\hat{x}^*$.

Repeat this process $k$ times.

When you are done, you can draw a histogram of the relative frequency of $\widehat \theta^*$. This is your estimate of the probability distribution $p(t)$. Using this probability distribution you can estimate any statistics thereof. In principle you never draw the histogram of the relative frequency of $\widehat{\theta}^*$. Instead you use the estimators corresponding to the statistic of interest. For example, if you are interested in estimating the variance of $\widehat \theta$, apply the etsimator $\widehat \sigma^2$ to the values $\widehat \theta ^*$.

The following code starts with a Gaussian distribution with mean value $\mu =100$ and variance $\sigma=15$. We use this to generate the data used in the bootstrap analysis. The bootstrap analysis returns a data set after a given number of bootstrap operations (as many as we have data points). This data set consists of estimated mean values for each bootstrap operation. The histogram generated by the bootstrap method shows that the distribution for these mean values is also a Gaussian, centered around the mean value $\mu=100$ but with standard deviation $\sigma/\sqrt{n}$, where $n$ is the number of bootstrap samples (in this case the same as the number of original data points). The value of the standard deviation is what we expect from the central limit theorem.

```
In [2]:
```%matplotlib inline
from numpy import *
from numpy.random import randint, randn
from time import time
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
# Returns mean of bootstrap samples
def stat(data):
return mean(data)
# Bootstrap algorithm
def bootstrap(data, statistic, R):
t = zeros(R); n = len(data); inds = arange(n); t0 = time()
# non-parametric bootstrap
for i in range(R):
t[i] = statistic(data[randint(0,n,n)])
# analysis
print("Runtime: %g sec" % (time()-t0)); print("Bootstrap Statistics :")
print("original bias std. error")
print("%8g %8g %14g %15g" % (statistic(data), std(data),\
mean(t), \
std(t)))
return t
mu, sigma = 100, 15
datapoints = 10000
x = mu + sigma*random.randn(datapoints)
# bootstrap returns the data sample t = bootstrap(x, stat, datapoints)
# the histogram of the bootstrapped data n, binsboot, patches = plt.hist(t, 50, normed=1, facecolor='red', alpha=0.75)
# add a 'best fit' line
y = mlab.normpdf( binsboot, mean(t), std(t))
lt = plt.plot(binsboot, y, 'r--', linewidth=1)
plt.xlabel('Smarts')
plt.ylabel('Probability')
plt.axis([99.5, 100.6, 0, 3.0])
plt.grid(True)
plt.show()

The blocking method was made popular by Flyvbjerg and Pedersen (1989) and has become one of the standard ways to estimate $V(\widehat{\theta})$ for exactly one $\widehat{\theta}$, namely $\widehat{\theta} = \overline{X}$.

Assume $n = 2^d$ for some integer $d>1$ and $X_1,X_2,\cdots, X_n$ is a stationary time series to begin with. Moreover, assume that the time series is asymptotically uncorrelated. We switch to vector notation by arranging $X_1,X_2,\cdots,X_n$ in an $n$-tuple. Define:

$$
\begin{align*}
\hat{X} = (X_1,X_2,\cdots,X_n).
\end{align*}
$$

The strength of the blocking method is when the number of observations, $n$ is large. For large $n$, the complexity of dependent bootstrapping scales poorly, but the blocking method does not, moreover, it becomes more accurate the larger $n$ is.

We now define blocking transformations. The idea is to take the mean of subsequent pair of elements from $\vec{X}$ and form a new vector $\vec{X}_1$. Continuing in the same way by taking the mean of subsequent pairs of elements of $\vec{X}_1$ we obtain $\vec{X}_2$, and so on. Define $\vec{X}_i$ recursively by:

$$
(\vec{X}_0)_k \equiv (\vec{X})_k \nonumber
$$

$$
\begin{equation}
(\vec{X}_{i+1})_k \equiv \frac{1}{2}\Big( (\vec{X}_i)_{2k-1} +
(\vec{X}_i)_{2k} \Big) \qquad \text{for all} \qquad 1 \leq i \leq d-1
\label{_auto10} \tag{22}
\end{equation}
$$

The quantity $\vec{X}_k$ is
subject to $k$ **blocking transformations**. We now have $d$ vectors
$\vec{X}_0, \vec{X}_1,\cdots,\vec X_{d-1}$ containing the subsequent
averages of observations. It turns out that if the components of
$\vec{X}$ is a stationary time series, then the components of
$\vec{X}_i$ is a stationary time series for all $0 \leq i \leq d-1$

We can then compute the autocovariance, the variance, sample mean, and number of observations for each $i$. Let $\gamma_i, \sigma_i^2, \overline{X}_i$ denote the autocovariance, variance and average of the elements of $\vec{X}_i$ and let $n_i$ be the number of elements of $\vec{X}_i$. It follows by induction that $n_i = n/2^i$.

Using the definition of the blocking transformation and the distributive property of the covariance, it is clear that since $h =|i-j|$ we can define

$$
\gamma_{k+1}(h) = cov\left( ({X}_{k+1})_{i}, ({X}_{k+1})_{j} \right) \nonumber
$$

$$
= \frac{1}{4}cov\left( ({X}_{k})_{2i-1} + ({X}_{k})_{2i}, ({X}_{k})_{2j-1} + ({X}_{k})_{2j} \right) \nonumber
$$

$$
\begin{equation}
= \frac{1}{2}\gamma_{k}(2h) + \frac{1}{2}\gamma_k(2h+1) \hspace{0.1cm} \mathrm{h = 0}
\label{_auto11} \tag{23}
\end{equation}
$$

$$
\begin{equation}
=\frac{1}{4}\gamma_k(2h-1) + \frac{1}{2}\gamma_k(2h) + \frac{1}{4}\gamma_k(2h+1) \quad \mathrm{else}
\label{_auto12} \tag{24}
\end{equation}
$$

$$
\begin{equation}
V(\overline{X}_k) = \frac{\sigma_k^2}{n_k} + \underbrace{\frac{2}{n_k} \sum_{h=1}^{n_k-1}\left( 1 - \frac{h}{n_k} \right)\gamma_k(h)}_{\equiv e_k} = \frac{\sigma^2_k}{n_k} + e_k \quad \text{if} \quad \gamma_k(0) = \sigma_k^2.
\label{_auto13} \tag{25}
\end{equation}
$$

The term $e_k$ is called the **truncation error**:

$$
\begin{equation}
e_k = \frac{2}{n_k} \sum_{h=1}^{n_k-1}\left( 1 - \frac{h}{n_k} \right)\gamma_k(h).
\label{_auto14} \tag{26}
\end{equation}
$$

$$
n_{j+1} \overline{X}_{j+1} = \sum_{i=1}^{n_{j+1}} (\hat{X}_{j+1})_i = \frac{1}{2}\sum_{i=1}^{n_{j}/2} (\hat{X}_{j})_{2i-1} + (\hat{X}_{j})_{2i} \nonumber
$$

$$
\begin{equation}
= \frac{1}{2}\left[ (\hat{X}_j)_1 + (\hat{X}_j)_2 + \cdots + (\hat{X}_j)_{n_j} \right] = \underbrace{\frac{n_j}{2}}_{=n_{j+1}} \overline{X}_j = n_{j+1}\overline{X}_j.
\label{_auto15} \tag{27}
\end{equation}
$$

$$
\begin{equation}
V(\overline{X}) = \frac{\sigma_k^2}{n_k} + e_k \qquad \text{for all} \qquad 0 \leq k \leq d-1. \label{eq:convergence} \tag{28}
\end{equation}
$$

Fyvbjerg and Petersen demonstrated that the sequence $\{e_k\}_{k=0}^{d-1}$ is decreasing, and conjecture that the term $e_k$ can be made as small as we would like by making $k$ (and hence $d$) sufficiently large. The sequence is decreasing (Master of Science thesis by Marius Jonsson, UiO 2018). It means we can apply blocking transformations until $e_k$ is sufficiently small, and then estimate $V(\overline{X})$ by $\widehat{\sigma}^2_k/n_k$.

```
In [3]:
```from sys import argv
from os import mkdir, path
import time
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import FormatStrFormatter
from matplotlib.font_manager import FontProperties
# Timing Decorator
def timeFunction(f):
def wrap(*args):
time1 = time.time()
ret = f(*args)
time2 = time.time()
print '%s Function Took: \t %0.3f s' % (f.func_name.title(), (time2-time1))
return ret
return wrap
class dataAnalysisClass:
# General Init functions
def __init__(self, fileName, size=0):
self.inputFileName = fileName
self.loadData(size)
self.createOutputFolder()
self.avg = np.average(self.data)
self.var = np.var(self.data)
self.std = np.std(self.data)
def loadData(self, size=0):
if size != 0:
with open(self.inputFileName) as inputFile:
self.data = np.zeros(size)
for x in xrange(size):
self.data[x] = float(next(inputFile))
else:
self.data = np.loadtxt(self.inputFileName)
# Statistical Analysis with Multiple Methods
def runAllAnalyses(self):
if len(self.data) <= 100000:
print "Autocorrelation..."
self.autocorrelation()
print "Bootstrap..."
self.bootstrap()
print "Jackknife..."
self.jackknife()
print "Blocking..."
self.blocking()
# Standard Autocorrelation
@timeFunction
def autocorrelation(self):
self.acf = np.zeros(len(self.data)/2)
for k in range(0, len(self.data)/2):
self.acf[k] = np.corrcoef(np.array([self.data[0:len(self.data)-k], \
self.data[k:len(self.data)]]))[0,1]
# Bootstrap
@timeFunction
def bootstrap(self, nBoots = 1000):
bootVec = np.zeros(nBoots)
for k in range(0,nBoots):
bootVec[k] = np.average(np.random.choice(self.data, len(self.data)))
self.bootAvg = np.average(bootVec)
self.bootVar = np.var(bootVec)
self.bootStd = np.std(bootVec)
# Jackknife
@timeFunction
def jackknife(self):
jackknVec = np.zeros(len(self.data))
for k in range(0,len(self.data)):
jackknVec[k] = np.average(np.delete(self.data, k))
self.jackknAvg = self.avg - (len(self.data) - 1) * (np.average(jackknVec) - self.avg)
self.jackknVar = float(len(self.data) - 1) * np.var(jackknVec)
self.jackknStd = np.sqrt(self.jackknVar)
# Blocking
@timeFunction
def blocking(self, blockSizeMax = 500):
blockSizeMin = 1
self.blockSizes = []
self.meanVec = []
self.varVec = []
for i in range(blockSizeMin, blockSizeMax):
if(len(self.data) % i != 0):
pass#continue
blockSize = i
meanTempVec = []
varTempVec = []
startPoint = 0
endPoint = blockSize
while endPoint <= len(self.data):
meanTempVec.append(np.average(self.data[startPoint:endPoint]))
startPoint = endPoint
endPoint += blockSize
mean, var = np.average(meanTempVec), np.var(meanTempVec)/len(meanTempVec)
self.meanVec.append(mean)
self.varVec.append(var)
self.blockSizes.append(blockSize)
self.blockingAvg = np.average(self.meanVec[-200:])
self.blockingVar = (np.average(self.varVec[-200:]))
self.blockingStd = np.sqrt(self.blockingVar)
# Plot of Data, Autocorrelation Function and Histogram
def plotAll(self):
self.createOutputFolder()
if len(self.data) <= 100000:
self.plotAutocorrelation()
self.plotData()
self.plotHistogram()
self.plotBlocking()
# Create Output Plots Folder
def createOutputFolder(self):
self.outName = self.inputFileName[:-4]
if not path.exists(self.outName):
mkdir(self.outName)
# Plot the Dataset, Mean and Std
def plotData(self):
# Far away plot
font = {'fontname':'serif'}
plt.plot(range(0, len(self.data)), self.data, 'r-', linewidth=1)
plt.plot([0, len(self.data)], [self.avg, self.avg], 'b-', linewidth=1)
plt.plot([0, len(self.data)], [self.avg + self.std, self.avg + self.std], 'g--', linewidth=1)
plt.plot([0, len(self.data)], [self.avg - self.std, self.avg - self.std], 'g--', linewidth=1)
plt.ylim(self.avg - 5*self.std, self.avg + 5*self.std)
plt.gca().yaxis.set_major_formatter(FormatStrFormatter('%.4f'))
plt.xlim(0, len(self.data))
plt.ylabel(self.outName.title() + ' Monte Carlo Evolution', **font)
plt.xlabel('MonteCarlo History', **font)
plt.title(self.outName.title(), **font)
plt.savefig(self.outName + "/data.eps")
plt.savefig(self.outName + "/data.png")
plt.clf()
# Plot Histogram of Dataset and Gaussian around it
def plotHistogram(self):
binNumber = 50
font = {'fontname':'serif'}
count, bins, ignore = plt.hist(self.data, bins=np.linspace(self.avg - 5*self.std, self.avg + 5*self.std, binNumber))
plt.plot([self.avg, self.avg], [0,np.max(count)+10], 'b-', linewidth=1)
plt.ylim(0,np.max(count)+10)
plt.ylabel(self.outName.title() + ' Histogram', **font)
plt.xlabel(self.outName.title() , **font)
plt.title('Counts', **font)
#gaussian
norm = 0
for i in range(0,len(bins)-1):
norm += (bins[i+1]-bins[i])*count[i]
plt.plot(bins, norm/(self.std * np.sqrt(2 * np.pi)) * np.exp( - (bins - self.avg)**2 / (2 * self.std**2) ), linewidth=1, color='r')
plt.savefig(self.outName + "/hist.eps")
plt.savefig(self.outName + "/hist.png")
plt.clf()
# Plot the Autocorrelation Function
def plotAutocorrelation(self):
font = {'fontname':'serif'}
plt.plot(range(1, len(self.data)/2), self.acf[1:], 'r-')
plt.ylim(-1, 1)
plt.xlim(0, len(self.data)/2)
plt.ylabel('Autocorrelation Function', **font)
plt.xlabel('Lag', **font)
plt.title('Autocorrelation', **font)
plt.savefig(self.outName + "/autocorrelation.eps")
plt.savefig(self.outName + "/autocorrelation.png")
plt.clf()
def plotBlocking(self):
font = {'fontname':'serif'}
plt.plot(self.blockSizes, self.varVec, 'r-')
plt.ylabel('Variance', **font)
plt.xlabel('Block Size', **font)
plt.title('Blocking', **font)
plt.savefig(self.outName + "/blocking.eps")
plt.savefig(self.outName + "/blocking.png")
plt.clf()
# Print Stuff to the Terminal
def printOutput(self):
print "\nSample Size: \t", len(self.data)
print "\n=========================================\n"
print "Sample Average: \t", self.avg
print "Sample Variance:\t", self.var
print "Sample Std: \t", self.std
print "\n=========================================\n"
print "Bootstrap Average: \t", self.bootAvg
print "Bootstrap Variance:\t", self.bootVar
print "Bootstrap Error: \t", self.bootStd
print "\n=========================================\n"
print "Jackknife Average: \t", self.jackknAvg
print "Jackknife Variance:\t", self.jackknVar
print "Jackknife Error: \t", self.jackknStd
print "\n=========================================\n"
print "Blocking Average: \t", self.blockingAvg
print "Blocking Variance:\t", self.blockingVar
print "Blocking Error: \t", self.blockingStd, "\n"
# Initialize the class
if len(argv) > 2:
dataAnalysis = dataAnalysisClass(argv[1], int(argv[2]))
else:
dataAnalysis = dataAnalysisClass(argv[1])
# Run Analyses
dataAnalysis.runAllAnalyses()
# Plot the data
dataAnalysis.plotAll()
# Print Some Output
dataAnalysis.printOutput()