Data Analysis and Machine Learning


Morten Hjorth-Jensen, Department of Physics, University of Oslo and Department of Physics and Astronomy and National Superconducting Cyclotron Laboratory, Michigan State University

Date: Jan 27, 2018

Copyright 1999-2018, Morten Hjorth-Jensen. Released under CC Attribution-NonCommercial 4.0 license

What is Machine Learning?

Machine learning is the science of giving computers the ability to learn without being explicitly programmed. The idea is that there exist generic algorithms which can be used to find patterns in a broad class of data sets without having to write code specifically for each problem. The algorithm will build its own logic based on the data.

Machine learning is a subfield of computer science, and is closely related to computational statistics. It evolved from the study of pattern recognition in artificial intelligence (AI) research, and has made contributions to AI tasks like computer vision, natural language processing and speech recognition. It has also, especially in later years, found applications in a wide variety of other areas, including bioinformatics, economy, physics, finance and marketing.

Types of Machine Learning

The approaches to machine learning are many, but are often split into two main categories. In supervised learning we know the answer to a problem, and let the computer deduce the logic behind it. On the other hand, unsupervised learning is a method for finding patterns and relationship in data sets without any prior knowledge of the system. Some authours also operate with a third category, namely reinforcement learning. This is a paradigm of learning inspired by behavioural psychology, where learning is achieved by trial-and-error, solely from rewards and punishment.

Another way to categorize machine learning tasks is to consider the desired output of a system. Some of the most common tasks are:

  • Classification: Outputs are divided into two or more classes. The goal is to produce a model that assigns inputs into one of these classes. An example is to identify digits based on pictures of hand-written ones. Classification is typically supervised learning.

  • Regression: Finding a functional relationship between an input data set and a reference data set. The goal is to construct a function that maps input data to continuous output values.

  • Clustering: Data are divided into groups with certain common traits, without knowing the different groups beforehand. It is thus a form of unsupervised learning.

Different algorithms

In this course we will build our machine learning approach on a statistical foundation, with elements from data analysis, stochastic processes etc before we proceed with the following machine learning algorithms

  1. Linear regression and its variants

  2. Decision tree algorithms, from simpler to more complex ones

  3. Nearest neighbors models

  4. Bayesian statistics

  5. Support vector machines and finally various variants of

  6. Artifical neural networks

Before we proceed however, there are several practicalities with data analysis and software tools we would like to present. These tools will help us in our understanding of various machine learning algorithms.

Our emphasis here is on understanding the mathematical aspects of different algorithms, however, where possible we will emphasize the importance of using available software.

Software and needed installations

We will make intensive use of python as programming language and the myriad of available libraries. Furthermore, you will find IPython/Jupyter notebooks invaluable in your work. You can run R codes in the Jupyter/IPython notebooks, with the immediate benefit of visualizing your data.

If you have Python installed (we recommend Python3) and you feel pretty familiar with installing different packages, we recommend that you install the following Python packages via pip as

  1. pip install numpy scipy matplotlib ipython scikit-learn mglearn sympy pandas pillow

For Python3, replace pip with pip3.

For OSX users we recommend also, after having installed Xcode, to install brew. Brew allows for a seamless installation of additional software via for example

  1. brew install python3

For Linux users, with its variety of distributions like for example the widely popular Ubuntu distribution you can use pip as well and simply install Python as

  1. sudo apt-get install python3 (or python for pyhton2.7)

etc etc.

Python installers

If you don't want to perform these operations separately, we recommend two widely used distrubutions which set up all relevant dependencies for Python, namely

  1. Anaconda Anaconda is an open source distribution of the Python and R programming languages for large-scale data processing, predictive analytics, and scientific computing, that aims to simplify package management and deployment. Package versions are managed by the package management system conda

  2. Enthought canopy is a Python distribution for scientific and analytic computing distribution and analysis environment, available for free and under a commercial license.

Installing R, C++, cython or Julia

You will also find it convenient to utilize R. Jupyter/Ipython notebook allows you run R code interactively in your browser. The software library R is tuned to statistically analysis and allows for an easy usage of the tools we will discuss in these texts.

To install R with Jupyter notebook following the link here

Installing R, C++, cython or Julia

For the C++ affecianodas, Jupyter/IPython notebook allows you also to install C++ and run codes written in this language interactively in the browser. Since we will emphasize writing many of the algorithms yourself, you can thus opt for either Python or C++ as programming languages.

To add more entropy, cython can also be used when running your notebooks. It means that Python with the Jupyter/IPython notebook setup allows you to integrate widely popular softwares and tools for scientific computing. With its versatility, including symbolic operations, Python offers a unique computational environment. Your Jupyter/IPython notebook can easily be converted into a nicely rendered PDF file or a Latex file for further processing. For example, convert to latex as


In [1]:
jupyter nbconvert filename.ipynb --to latex

If you use the light mark-up language doconce you can convert a standard ascii text file into various HTML formats, ipython notebooks, latex files, pdf files etc.

Introduction to Jupyter notebook and available tools


In [2]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
from scipy import sparse
import pandas as pd
from IPython.display import display
eye = np.eye(4)
print(eye)
sparse_mtx = sparse.csr_matrix(eye)
print(sparse_mtx)
x = np.linspace(-10,10,100)
y = np.sin(x)
plt.plot(x,y,marker='x')
plt.show()
data = {'Name': ["John", "Anna", "Peter", "Linda"], 'Location': ["Nairobi", "Napoli", "London", "Buenos Aires"], 'Age':[51, 21, 34, 45]}
data_pandas = pd.DataFrame(data)
display(data_pandas)

Representing data, more examples


In [3]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import sparse
import pandas as pd
from IPython.display import display
import mglearn
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
x, y = mglearn.datasets.make_wave(n_samples=100)
line = np.linspace(-3,3,1000,endpoint=False).reshape(-1,1)
reg = DecisionTreeRegressor(min_samples_split=3).fit(x,y)
plt.plot(line, reg.predict(line), label="decision tree")
regline = LinearRegression().fit(x,y)
plt.plot(line, regline.predict(line), label= "Linear Rgression")
plt.show()

Predator-Prey model from ecology

The population dynamics of a simple predator-prey system is a classical example shown in many biology textbooks when ecological systems are discussed. The system contains all elements of the scientific method:

  • The set up of a specific hypothesis combined with

  • the experimental methods needed (one can study existing data or perform experiments)

  • analyzing and interpreting the data and performing further experiments if needed

  • trying to extract general behaviors and extract eventual laws or patterns

  • develop mathematical relations for the uncovered regularities/laws and test these by per forming new experiments

Case study from Hudson bay

Lots of data about populations of hares and lynx collected from furs in Hudson Bay, Canada, are available. It is known that the populations oscillate. Why? Here we start by

  1. plotting the data

  2. derive a simple model for the population dynamics

  3. (fitting parameters in the model to the data)

  4. using the model predict the evolution other predator-pray systems

Hudson bay data

Most mammalian predators rely on a variety of prey, which complicates mathematical modeling; however, a few predators have become highly specialized and seek almost exclusively a single prey species. An example of this simplified predator-prey interaction is seen in Canadian northern forests, where the populations of the lynx and the snowshoe hare are intertwined in a life and death struggle.

One reason that this particular system has been so extensively studied is that the Hudson Bay company kept careful records of all furs from the early 1800s into the 1900s. The records for the furs collected by the Hudson Bay company showed distinct oscillations (approximately 12 year periods), suggesting that these species caused almost periodic fluctuations of each other's populations. The table here shows data from 1900 to 1920.

Year Hares (x1000) Lynx (x1000)
1900 30.0 4.0
1901 47.2 6.1
1902 70.2 9.8
1903 77.4 35.2
1904 36.3 59.4
1905 20.6 41.7
1906 18.1 19.0
1907 21.4 13.0
1908 22.0 8.3
1909 25.4 9.1
1910 27.1 7.4
1911 40.3 8.0
1912 57 12.3
1913 76.6 19.5
1914 52.3 45.7
1915 19.5 51.1
1916 11.2 29.7
1917 7.6 15.8
1918 14.6 9.7
1919 16.2 10.1
1920 24.7 8.6

Plotting the data


In [4]:
import numpy as np
from  matplotlib import pyplot as plt

# Load in data file
data = np.loadtxt('src/Hudson_Bay.csv', delimiter=',', skiprows=1)
# Make arrays containing x-axis and hares and lynx populations
year = data[:,0]
hares = data[:,1]
lynx = data[:,2]

plt.plot(year, hares ,'b-+', year, lynx, 'r-o')
plt.axis([1900,1920,0, 100.0])
plt.xlabel(r'Year')
plt.ylabel(r'Numbers of hares and lynx ')
plt.legend(('Hares','Lynx'), loc='upper right')
plt.title(r'Population of hares and lynx from 1900-1920 (x1000)}')
plt.savefig('Hudson_Bay_data.pdf')
plt.savefig('Hudson_Bay_data.png')
plt.show()

Hares and lynx in Hudson bay from 1900 to 1920

Why now create a computer model for the hare and lynx populations?

We see from the plot that there are indeed fluctuations. We would like to create a mathematical model that explains these population fluctuations. Ecologists have predicted that in a simple predator-prey system that a rise in prey population is followed (with a lag) by a rise in the predator population. When the predator population is sufficiently high, then the prey population begins dropping. After the prey population falls, then the predator population falls, which allows the prey population to recover and complete one cycle of this interaction. Thus, we see that qualitatively oscillations occur. Can a mathematical model predict this? What causes cycles to slow or speed up? What affects the amplitude of the oscillation or do you expect to see the oscillations damp to a stable equilibrium? The models tend to ignore factors like climate and other complicating factors. How significant are these?

  • We see oscillations in the data

  • What causes cycles to slow or speed up?

  • What affects the amplitude of the oscillation or do you expect to see the oscillations damp to a stable equilibrium?

  • With a model we can better understand the data

  • More important: we can understand the ecology dynamics of predator-pray populations

The traditional (top-down) approach

The classical way (in all books) is to present the Lotka-Volterra equations:

$$ \begin{align*} \frac{dH}{dt} &= H(a - b L)\\ \frac{dL}{dt} &= - L(d - c H) \end{align*} $$

Here,

  • $H$ is the number of preys

  • $L$ the number of predators

  • $a$, $b$, $d$, $c$ are parameters

Most books quickly establish the model and then use considerable space on discussing the qualitative properties of this nonlinear system of ODEs (which cannot be solved)

Basic mathematics notation

  • Time points: $t_0,t_1,\ldots,t_m$

  • Uniform distribution of time points: $t_n=n\Delta t$

  • $H^n$: population of hares at time $t_n$

  • $L^n$: population of lynx at time $t_n$

  • We want to model the changes in populations, $\Delta H=H^{n+1}-H^n$ and $\Delta L=L^{n+1}-L^n$ during a general time interval $[t_{n+1},t_n]$ of length $\Delta t=t_{n+1}-t_n$

Basic dynamics of the population of hares

The population of hares evolves due to births and deaths exactly as a bacteria population:

$$ \Delta H = a \Delta t H^n $$

However, hares have an additional loss in the population because they are eaten by lynx. All the hares and lynx can form $H\cdot L$ pairs in total. When such pairs meet during a time interval $\Delta t$, there is some small probablity that the lynx will eat the hare. So in fraction $b\Delta t HL$, the lynx eat hares. This loss of hares must be accounted for. Subtracted in the equation for hares:

$$ \Delta H = a\Delta t H^n - b \Delta t H^nL^n $$

Basic dynamics of the population of lynx

We assume that the primary growth for the lynx population depends on sufficient food for raising lynx kittens, which implies an adequate source of nutrients from predation on hares. Thus, the growth of the lynx population does not only depend of how many lynx there are, but on how many hares they can eat. In a time interval $\Delta t HL$ hares and lynx can meet, and in a fraction $b\Delta t HL$ the lynx eats the hare. All of this does not contribute to the growth of lynx, again just a fraction of $b\Delta t HL$ that we write as $d\Delta t HL$. In addition, lynx die just as in the population dynamics with one isolated animal population, leading to a loss $-c\Delta t L$.

The accounting of lynx then looks like

$$ \Delta L = d\Delta t H^nL^n - c\Delta t L^n $$

Evolution equations

By writing up the definition of $\Delta H$ and $\Delta L$, and putting all assumed known terms $H^n$ and $L^n$ on the right-hand side, we have

$$ H^{n+1} = H^n + a\Delta t H^n - b\Delta t H^n L^n $$
$$ L^{n+1} = L^n + d\Delta t H^nL^n - c\Delta t L^n $$

Note:

  • These equations are ready to be implemented!

  • But to start, we need $H^0$ and $L^0$ (which we can get from the data)

  • We also need values for $a$, $b$, $d$, $c$

Adapt the model to the Hudson Bay case

  • As always, models tend to be general - as here, applicable to "all" predator-pray systems

  • The critical issue is whether the interaction between hares and lynx is sufficiently well modeled by $\hbox{const}HL$

  • The parameters $a$, $b$, $d$, and $c$ must be estimated from data

  • Measure time in years

  • $t_0=1900$, $t_m=1920$

The program


In [5]:
import numpy as np
import matplotlib.pyplot as plt

def solver(m, H0, L0, dt, a, b, c, d, t0):
    """Solve the difference equations for H and L over m years
    with time step dt (measured in years."""

    num_intervals = int(m/float(dt))
    t = np.linspace(t0, t0 + m, num_intervals+1)
    H = np.zeros(t.size)
    L = np.zeros(t.size)

    print('Init:', H0, L0, dt)
    H[0] = H0
    L[0] = L0

    for n in range(0, len(t)-1):
        H[n+1] = H[n] + a*dt*H[n] - b*dt*H[n]*L[n]
        L[n+1] = L[n] + d*dt*H[n]*L[n] - c*dt*L[n]
    return H, L, t

# Load in data file
data = np.loadtxt('src/Hudson_Bay.csv', delimiter=',', skiprows=1)
# Make arrays containing x-axis and hares and lynx populations
t_e = data[:,0]
H_e = data[:,1]
L_e = data[:,2]

# Simulate using the model
H, L, t = solver(m=20, H0=34.91, L0=3.857, dt=0.1,
                 a=0.4807, b=0.02482, c=0.9272, d=0.02756,
                 t0=1900)

# Visualize simulations and data
plt.plot(t_e, H_e, 'b-+', t_e, L_e, 'r-o', t, H, 'm--', t, L, 'k--')
plt.xlabel('Year')
plt.ylabel('Numbers of hares and lynx')
plt.axis([1900, 1920, 0, 140])
plt.title(r'Population of hares and lynx 1900-1920 (x1000)')
plt.legend(('H_e', 'L_e', 'H', 'L'), loc='upper left')
plt.savefig('Hudson_Bay_sim.pdf')
plt.savefig('Hudson_Bay_sim.png')
plt.show()

The plot

If we perform a least-square fitting, we can find optimal values for the parameters $a$, $b$, $d$, $c$. The optimal parameters are $a=0.4807$, $b=0.02482$, $d=0.9272$ and $c=0.02756$. These parameters result in a slightly modified initial conditions, namely $H(0) = 34.91$ and $L(0)=3.857$. With these parameters we are now ready to solve the equations and plot these data together with the experimental values.

Linear regression in Python


In [6]:
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor


data = np.loadtxt('src/Hudson_Bay.csv', delimiter=',', skiprows=1)
x = data[:,0]
y = data[:,1]
line = np.linspace(1900,1920,1000,endpoint=False).reshape(-1,1)
reg = DecisionTreeRegressor(min_samples_split=3).fit(x.reshape(-1,1),y.reshape(-1,1))
plt.plot(line, reg.predict(line), label="decision tree")
regline = LinearRegression().fit(x.reshape(-1,1),y.reshape(-1,1))
plt.plot(line, regline.predict(line), label= "Linear Regression")
plt.plot(x, y, label= "Linear Regression")
plt.show()

Linear Least squares in R

    HudsonBay = read.csv("src/Hudson_Bay.csv",header=T)
    fix(HudsonBay)
    dim(HudsonBay)
    names(HudsonBay)
    plot(HudsonBay$Year, HudsonBay$Hares..x1000.)
    attach(HudsonBay)
    plot(Year, Hares..x1000.)
    plot(Year, Hares..x1000., col="red", varwidth=T, xlab="Years", ylab="Haresx 1000")
    summary(HudsonBay)
    summary(Hares..x1000.)
    library(MASS)
    library(ISLR)
    scatter.smooth(x=Year, y = Hares..x1000.)
    linearMod = lm(Hares..x1000. ~ Year)
    print(linearMod)
    summary(linearMod)
    plot(linearMod)
    confint(linearMod)
    predict(linearMod,data.frame(Year=c(1910,1914,1920)),interval="confidence")

Non-Linear Least squares in R

    set.seed(1485)
    len = 24
    x = runif(len)
    y = x^3+rnorm(len, 0,0.06)
    ds = data.frame(x = x, y = y)
    str(ds)
    plot( y ~ x, main ="Known cubic with noise")
    s  = seq(0,1,length =100)
    lines(s, s^3, lty =2, col ="green")
    m = nls(y ~ I(x^power), data = ds, start = list(power=1), trace = T)
    class(m)
    summary(m)
    power = round(summary(m)$coefficients[1], 3)
    power.se = round(summary(m)$coefficients[2], 3)
    plot(y ~ x, main = "Fitted power model", sub = "Blue: fit; green: known")
    s = seq(0, 1, length = 100)
    lines(s, s^3, lty = 2, col = "green")
    lines(s, predict(m, list(x = s)), lty = 1, col = "blue")
    text(0, 0.5, paste("y =x^ (", power, " +/- ", power.se, ")", sep = ""), pos = 4)

Important Matrix and vector handling packages

The Numerical Recipes codes have been rewritten in Fortran 90/95 and C/C++ by us. The original source codes are taken from the widely used software package LAPACK, which follows two other popular packages developed in the 1970s, namely EISPACK and LINPACK.

  • LINPACK: package for linear equations and least square problems.

  • LAPACK:package for solving symmetric, unsymmetric and generalized eigenvalue problems. From LAPACK's website http://www.netlib.org it is possible to download for free all source codes from this library. Both C/C++ and Fortran versions are available.

  • BLAS (I, II and III): (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. Blas I is vector operations, II vector-matrix operations and III matrix-matrix operations. Highly parallelized and efficient codes, all available for download from http://www.netlib.org.

Add python material on linear algebra and array handling, text on numpy etc

Basic Matrix Features

Matrix properties reminder

$$ \mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & a_{13} & a_{14} \\ a_{21} & a_{22} & a_{23} & a_{24} \\ a_{31} & a_{32} & a_{33} & a_{34} \\ a_{41} & a_{42} & a_{43} & a_{44} \end{bmatrix}\qquad \mathbf{I} = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} $$

Basic Matrix Features

The inverse of a matrix is defined by

$$ \mathbf{A}^{-1} \cdot \mathbf{A} = I $$

Basic Matrix Features

Matrix Properties Reminder

Relations Name matrix elements
$A = A^{T}$ symmetric $a_{ij} = a_{ji}$
$A = \left (A^{T} \right )^{-1}$ real orthogonal $\sum_k a_{ik} a_{jk} = \sum_k a_{ki} a_{kj} = \delta_{ij}$
$A = A^{ * }$ real matrix $a_{ij} = a_{ij}^{ * }$
$A = A^{\dagger}$ hermitian $a_{ij} = a_{ji}^{ * }$
$A = \left (A^{\dagger} \right )^{-1}$ unitary $\sum_k a_{ik} a_{jk}^{ * } = \sum_k a_{ki}^{ * } a_{kj} = \delta_{ij}$

Some famous Matrices

  • Diagonal if $a_{ij}=0$ for $i\ne j$

  • Upper triangular if $a_{ij}=0$ for $i > j$

  • Lower triangular if $a_{ij}=0$ for $i < j$

  • Upper Hessenberg if $a_{ij}=0$ for $i > j+1$

  • Lower Hessenberg if $a_{ij}=0$ for $i < j+1$

  • Tridiagonal if $a_{ij}=0$ for $|i -j| > 1$

  • Lower banded with bandwidth $p$: $a_{ij}=0$ for $i > j+p$

  • Upper banded with bandwidth $p$: $a_{ij}=0$ for $i < j+p$

  • Banded, block upper triangular, block lower triangular....

Basic Matrix Features

Some Equivalent Statements For an $N\times N$ matrix $\mathbf{A}$ the following properties are all equivalent

  • If the inverse of $\mathbf{A}$ exists, $\mathbf{A}$ is nonsingular.

  • The equation $\mathbf{Ax}=0$ implies $\mathbf{x}=0$.

  • The rows of $\mathbf{A}$ form a basis of $R^N$.

  • The columns of $\mathbf{A}$ form a basis of $R^N$.

  • $\mathbf{A}$ is a product of elementary matrices.

  • $0$ is not eigenvalue of $\mathbf{A}$.

Matrix Handling in C/C++, Static and Dynamical allocation

Static We have an $N\times N$ matrix A with $N=100$ In C/C++ this would be defined as

       int N = 100;
       double A[100][100];
       //   initialize all elements to zero
       for(i=0 ; i < N ; i++) {
          for(j=0 ; j < N ; j++) {
             A[i][j] = 0.0;

Note the way the matrix is organized, row-major order.

Matrix Handling in C/C++

Row Major Order, Addition We have $N\times N$ matrices A, B and C and we wish to evaluate $A=B+C$.

$$ \mathbf{A}= \mathbf{B}\pm\mathbf{C} \Longrightarrow a_{ij} = b_{ij}\pm c_{ij}, $$

In C/C++ this would be coded like

       for(i=0 ; i < N ; i++) {
          for(j=0 ; j < N ; j++) {
             a[i][j] = b[i][j]+c[i][j]

Matrix Handling in C/C++

Row Major Order, Multiplication We have $N\times N$ matrices A, B and C and we wish to evaluate $A=BC$.

$$ \mathbf{A}=\mathbf{BC} \Longrightarrow a_{ij} = \sum_{k=1}^{n} b_{ik}c_{kj}, $$

In C/C++ this would be coded like

       for(i=0 ; i < N ; i++) {
          for(j=0 ; j < N ; j++) {
             for(k=0 ; k < N ; k++) {
                a[i][j]+=b[i][k]*c[k][j];

Dynamic memory allocation in C/C++

At least three possibilities in this course

  • Do it yourself

  • Use the functions provided in the library package lib.cpp

  • Use Armadillo http://arma.sourceforgenet (a C++ linear algebra library, discussion both here and at lab).

Matrix Handling in C/C++, Dynamic Allocation

Do it yourself

    int N;
    double **  A;
    A = new double*[N]
    for ( i = 0; i < N; i++)
        A[i] = new double[N];

Always free space when you don't need an array anymore.

    for ( i = 0; i < N; i++)
        delete[] A[i];
    delete[] A;

Armadillo, recommended!!

  • Armadillo is a C++ linear algebra library (matrix maths) aiming towards a good balance between speed and ease of use. The syntax is deliberately similar to Matlab.

  • Integer, floating point and complex numbers are supported, as well as a subset of trigonometric and statistics functions. Various matrix decompositions are provided through optional integration with LAPACK, or one of its high performance drop-in replacements (such as the multi-threaded MKL or ACML libraries).

  • A delayed evaluation approach is employed (at compile-time) to combine several operations into one and reduce (or eliminate) the need for temporaries. This is accomplished through recursive templates and template meta-programming.

  • Useful for conversion of research code into production environments, or if C++ has been decided as the language of choice, due to speed and/or integration capabilities.

  • The library is open-source software, and is distributed under a license that is useful in both open-source and commercial/proprietary contexts.

Armadillo, simple examples

    #include <iostream>
    #include <armadillo>

    using namespace std;
    using namespace arma;

    int main(int argc, char** argv)
      {
      mat A = randu<mat>(5,5);
      mat B = randu<mat>(5,5);

      cout << A*B << endl;

      return 0;

Armadillo, how to compile and install

For people using Ubuntu, Debian, Linux Mint, simply go to the synaptic package manager and install armadillo from there. You may have to install Lapack as well. For Mac and Windows users, follow the instructions from the webpage http://arma.sourceforge.net. To compile, use for example (linux/ubuntu)

    c++ -O2 -o program.x program.cpp  -larmadillo -llapack -lblas

where the -l option indicates the library you wish to link to.

For OS X users you may have to declare the paths to the include files and the libraries as

    c++ -O2 -o program.x program.cpp  -L/usr/local/lib -I/usr/local/include -larmadillo -llapack -lblas

Armadillo, simple examples

    #include <iostream>
    #include "armadillo"
    using namespace arma;
    using namespace std;

    int main(int argc, char** argv)
      {
      // directly specify the matrix size (elements are uninitialised)
      mat A(2,3);
      // .n_rows = number of rows    (read only)
      // .n_cols = number of columns (read only)
      cout << "A.n_rows = " << A.n_rows << endl;
      cout << "A.n_cols = " << A.n_cols << endl;
      // directly access an element (indexing starts at 0)
      A(1,2) = 456.0;
      A.print("A:");
      // scalars are treated as a 1x1 matrix,
      // hence the code below will set A to have a size of 1x1
      A = 5.0;
      A.print("A:");
      // if you want a matrix with all elements set to a particular value
      // the .fill() member function can be used
      A.set_size(3,3);
      A.fill(5.0);  A.print("A:");

Armadillo, simple examples

      mat B;

      // endr indicates "end of row"
      B << 0.555950 << 0.274690 << 0.540605 << 0.798938 << endr
        << 0.108929 << 0.830123 << 0.891726 << 0.895283 << endr
        << 0.948014 << 0.973234 << 0.216504 << 0.883152 << endr
        << 0.023787 << 0.675382 << 0.231751 << 0.450332 << endr;

      // print to the cout stream
      // with an optional string before the contents of the matrix
      B.print("B:");

      // the << operator can also be used to print the matrix
      // to an arbitrary stream (cout in this case)
      cout << "B:" << endl << B << endl;
      // save to disk
      B.save("B.txt", raw_ascii);
      // load from disk
      mat C;
      C.load("B.txt");
      C += 2.0 * B;
      C.print("C:");

Armadillo, simple examples

      // submatrix types:
      //
      // .submat(first_row, first_column, last_row, last_column)
      // .row(row_number)
      // .col(column_number)
      // .cols(first_column, last_column)
      // .rows(first_row, last_row)

      cout << "C.submat(0,0,3,1) =" << endl;
      cout << C.submat(0,0,3,1) << endl;

      // generate the identity matrix
      mat D = eye<mat>(4,4);

      D.submat(0,0,3,1) = C.cols(1,2);
      D.print("D:");

      // transpose
      cout << "trans(B) =" << endl;
      cout << trans(B) << endl;

      // maximum from each column (traverse along rows)
      cout << "max(B) =" << endl;
      cout << max(B) << endl;

Armadillo, simple examples

      // maximum from each row (traverse along columns)
      cout << "max(B,1) =" << endl;
      cout << max(B,1) << endl;
      // maximum value in B
      cout << "max(max(B)) = " << max(max(B)) << endl;
      // sum of each column (traverse along rows)
      cout << "sum(B) =" << endl;
      cout << sum(B) << endl;
      // sum of each row (traverse along columns)
      cout << "sum(B,1) =" << endl;
      cout << sum(B,1) << endl;
      // sum of all elements
      cout << "sum(sum(B)) = " << sum(sum(B)) << endl;
      cout << "accu(B)     = " << accu(B) << endl;
      // trace = sum along diagonal
      cout << "trace(B)    = " << trace(B) << endl;
      // random matrix -- values are uniformly distributed in the [0,1] interval
      mat E = randu<mat>(4,4);
      E.print("E:");

Armadillo, simple examples

      // row vectors are treated like a matrix with one row
      rowvec r;
      r << 0.59499 << 0.88807 << 0.88532 << 0.19968;
      r.print("r:");

      // column vectors are treated like a matrix with one column
      colvec q;
      q << 0.81114 << 0.06256 << 0.95989 << 0.73628;
      q.print("q:");

      // dot or inner product
      cout << "as_scalar(r*q) = " << as_scalar(r*q) << endl;

        // outer product
      cout << "q*r =" << endl;
      cout << q*r << endl;


      // sum of three matrices (no temporary matrices are created)
      mat F = B + C + D;
      F.print("F:");

        return 0;

Armadillo, simple examples

    #include <iostream>
    #include "armadillo"
    using namespace arma;
    using namespace std;

    int main(int argc, char** argv)
      {
      cout << "Armadillo version: " << arma_version::as_string() << endl;

      mat A;

      A << 0.165300 << 0.454037 << 0.995795 << 0.124098 << 0.047084 << endr
        << 0.688782 << 0.036549 << 0.552848 << 0.937664 << 0.866401 << endr
        << 0.348740 << 0.479388 << 0.506228 << 0.145673 << 0.491547 << endr
        << 0.148678 << 0.682258 << 0.571154 << 0.874724 << 0.444632 << endr
        << 0.245726 << 0.595218 << 0.409327 << 0.367827 << 0.385736 << endr;

      A.print("A =");

      // determinant
      cout << "det(A) = " << det(A) << endl;

Armadillo, simple examples

      // inverse
      cout << "inv(A) = " << endl << inv(A) << endl;
      double k = 1.23;

      mat    B = randu<mat>(5,5);
      mat    C = randu<mat>(5,5);

      rowvec r = randu<rowvec>(5);
      colvec q = randu<colvec>(5);


      // examples of some expressions
      // for which optimised implementations exist
      // optimised implementation of a trinary expression
      // that results in a scalar
      cout << "as_scalar( r*inv(diagmat(B))*q ) = ";
      cout << as_scalar( r*inv(diagmat(B))*q ) << endl;

      // example of an expression which is optimised
      // as a call to the dgemm() function in BLAS:
      cout << "k*trans(B)*C = " << endl << k*trans(B)*C;

        return 0;

Gaussian Elimination

We start with the linear set of equations

$$ \mathbf{A}\mathbf{x} = \mathbf{w}. $$

We assume also that the matrix $\mathbf{A}$ is non-singular and that the matrix elements along the diagonal satisfy $a_{ii} \ne 0$. Simple $4\times 4 $ example

$$ \begin{bmatrix} a_{11}& a_{12} &a_{13}& a_{14}\\ a_{21}& a_{22} &a_{23}& a_{24}\\ a_{31}& a_{32} &a_{33}& a_{34}\\ a_{41}& a_{42} &a_{43}& a_{44}\\ \end{bmatrix} \begin{bmatrix} x_1\\ x_2\\ x_3 \\ x_4 \\ \end{bmatrix} =\begin{bmatrix} w_1\\ w_2\\ w_3 \\ w_4\\ \end{bmatrix}. $$

Gaussian Elimination

or

$$ a_{11}x_1 +a_{12}x_2 +a_{13}x_3 + a_{14}x_4=w_1 \nonumber $$
$$ a_{21}x_1 + a_{22}x_2 + a_{23}x_3 + a_{24}x_4=w_2 \nonumber $$
$$ a_{31}x_1 + a_{32}x_2 + a_{33}x_3 + a_{34}x_4=w_3 \nonumber $$
$$ a_{41}x_1 + a_{42}x_2 + a_{43}x_3 + a_{44}x_4=w_4. \nonumber $$

Gaussian Elimination

The basic idea of Gaussian elimination is to use the first equation to eliminate the first unknown $x_1$ from the remaining $n-1$ equations. Then we use the new second equation to eliminate the second unknown $x_2$ from the remaining $n-2$ equations. With $n-1$ such eliminations we obtain a so-called upper triangular set of equations of the form

$$ b_{11}x_1 +b_{12}x_2 +b_{13}x_3 + b_{14}x_4=y_1 \nonumber $$
$$ b_{22}x_2 + b_{23}x_3 + b_{24}x_4=y_2 \nonumber $$
$$ b_{33}x_3 + b_{34}x_4=y_3 \nonumber $$

$$ b_{44}x_4=y_4. \nonumber \label{eq:gaussbacksub} \tag{1} $$

We can solve this system of equations recursively starting from $x_n$ (in our case $x_4$) and proceed with what is called a backward substitution.

Gaussian Elimination

This process can be expressed mathematically as

$$ \begin{equation} x_m = \frac{1}{b_{mm}}\left(y_m-\sum_{k=m+1}^nb_{mk}x_k\right)\quad m=n-1,n-2,\dots,1. \label{_auto1} \tag{2} \end{equation} $$

To arrive at such an upper triangular system of equations, we start by eliminating the unknown $x_1$ for $j=2,n$. We achieve this by multiplying the first equation by $a_{j1}/a_{11}$ and then subtract the result from the $j$th equation. We assume obviously that $a_{11}\ne 0$ and that $\mathbf{A}$ is not singular.

Gaussian Elimination

Our actual $4\times 4$ example reads after the first operation

$$ \begin{bmatrix} a_{11}& a_{12} &a_{13}& a_{14}\\ 0& (a_{22}-\frac{a_{21}a_{12}}{a_{11}}) &(a_{23}-\frac{a_{21}a_{13}}{a_{11}}) & (a_{24}-\frac{a_{21}a_{14}}{a_{11}})\\ 0& (a_{32}-\frac{a_{31}a_{12}}{a_{11}})& (a_{33}-\frac{a_{31}a_{13}}{a_{11}})& (a_{34}-\frac{a_{31}a_{14}}{a_{11}})\\ 0&(a_{42}-\frac{a_{41}a_{12}}{a_{11}}) &(a_{43}-\frac{a_{41}a_{13}}{a_{11}}) & (a_{44}-\frac{a_{41}a_{14}}{a_{11}}) \\ \end{bmatrix} \begin{bmatrix} x_1\\ x_2\\ x_3 \\ x_4 \\ \end{bmatrix} =\begin{bmatrix} y_1\\ w_2^{(2)}\\ w_3^{(2)} \\ w_4^{(2)}\\ \end{bmatrix}, $$

or

$$ b_{11}x_1 +b_{12}x_2 +b_{13}x_3 + b_{14}x_4=y_1 \nonumber $$
$$ a^{(2)}_{22}x_2 + a^{(2)}_{23}x_3 + a^{(2)}_{24}x_4=w^{(2)}_2 \nonumber $$
$$ a^{(2)}_{32}x_2 + a^{(2)}_{33}x_3 + a^{(2)}_{34}x_4=w^{(2)}_3 \nonumber $$
$$ a^{(2)}_{42}x_2 + a^{(2)}_{43}x_3 + a^{(2)}_{44}x_4=w^{(2)}_4, \nonumber $$

$$ \begin{equation} \label{_auto2} \tag{3} \end{equation} $$

Gaussian Elimination

The new coefficients are

$$ \begin{equation} b_{1k} = a_{1k}^{(1)} \quad k=1,\dots,n, \label{_auto3} \tag{4} \end{equation} $$

where each $a_{1k}^{(1)}$ is equal to the original $a_{1k}$ element. The other coefficients are

$$ \begin{equation} a_{jk}^{(2)} = a_{jk}^{(1)}-\frac{a_{j1}^{(1)}a_{1k}^{(1)}}{a_{11}^{(1)}} \quad j,k=2,\dots,n, \label{_auto4} \tag{5} \end{equation} $$

with a new right-hand side given by

$$ \begin{equation} y_{1}=w_1^{(1)}, \quad w_j^{(2)} =w_j^{(1)}-\frac{a_{j1}^{(1)}w_1^{(1)}}{a_{11}^{(1)}} \quad j=2,\dots,n. \label{_auto5} \tag{6} \end{equation} $$

We have also set $w_1^{(1)}=w_1$, the original vector element. We see that the system of unknowns $x_1,\dots,x_n$ is transformed into an $(n-1)\times (n-1)$ problem.

Gaussian Elimination

This step is called forward substitution. Proceeding with these substitutions, we obtain the general expressions for the new coefficients

$$ \begin{equation} a_{jk}^{(m+1)} = a_{jk}^{(m)}-\frac{a_{jm}^{(m)}a_{mk}^{(m)}}{a_{mm}^{(m)}} \quad j,k=m+1,\dots,n, \label{_auto6} \tag{7} \end{equation} $$

with $m=1,\dots,n-1$ and a right-hand side given by

$$ \begin{equation} w_j^{(m+1)} =w_j^{(m)}-\frac{a_{jm}^{(m)}w_m^{(m)}}{a_{mm}^{(m)}}\quad j=m+1,\dots,n. \label{_auto7} \tag{8} \end{equation} $$

This set of $n-1$ elimations leads us to an equations which is solved by back substitution. If the arithmetics is exact and the matrix $\mathbf{A}$ is not singular, then the computed answer will be exact.

Even though the matrix elements along the diagonal are not zero, numerically small numbers may appear and subsequent divisions may lead to large numbers, which, if added to a small number may yield losses of precision. Suppose for example that our first division in $(a_{22}-a_{21}a_{12}/a_{11})$ results in $-10^{-7}$ and that $a_{22}$ is one. one. We are then adding $10^7+1$. With single precision this results in $10^7$.

Linear Algebra Methods

  • Gaussian elimination, $O(2/3n^3)$ flops, general matrix

  • LU decomposition, upper triangular and lower tridiagonal matrices, $O(2/3n^3)$ flops, general matrix. Get easily the inverse, determinant and can solve linear equations with back-substitution only, $O(n^2)$ flops

  • Cholesky decomposition. Real symmetric or hermitian positive definite matrix, $O(1/3n^3)$ flops.

  • Tridiagonal linear systems, important for differential equations. Normally positive definite and non-singular. $O(8n)$ flops for symmetric. Special case of banded matrices.

  • Singular value decomposition

  • the QR method will be discussed in chapter 7 in connection with eigenvalue systems. $O(4/3n^3)$ flops.

LU Decomposition

The LU decomposition method means that we can rewrite this matrix as the product of two matrices $\mathbf{L}$ and $\mathbf{U}$ where

$$ \begin{bmatrix} a_{11} & a_{12} & a_{13} & a_{14} \\ a_{21} & a_{22} & a_{23} & a_{24} \\ a_{31} & a_{32} & a_{33} & a_{34} \\ a_{41} & a_{42} & a_{43} & a_{44} \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 & 0 \\ l_{21} & 1 & 0 & 0 \\ l_{31} & l_{32} & 1 & 0 \\ l_{41} & l_{42} & l_{43} & 1 \end{bmatrix} \begin{bmatrix} u_{11} & u_{12} & u_{13} & u_{14} \\ 0 & u_{22} & u_{23} & u_{24} \\ 0 & 0 & u_{33} & u_{34} \\ 0 & 0 & 0 & u_{44} \end{bmatrix}. $$

LU Decomposition

LU decomposition forms the backbone of other algorithms in linear algebra, such as the solution of linear equations given by

$$ a_{11}x_1 +a_{12}x_2 +a_{13}x_3 + a_{14}x_4=w_1 \nonumber $$
$$ a_{21}x_1 + a_{22}x_2 + a_{23}x_3 + a_{24}x_4=w_2 \nonumber $$
$$ a_{31}x_1 + a_{32}x_2 + a_{33}x_3 + a_{34}x_4=w_3 \nonumber $$
$$ a_{41}x_1 + a_{42}x_2 + a_{43}x_3 + a_{44}x_4=w_4. \nonumber $$

The above set of equations is conveniently solved by using LU decomposition as an intermediate step.

The matrix $\mathbf{A}\in \mathbb{R}^{n\times n}$ has an LU factorization if the determinant is different from zero. If the LU factorization exists and $\mathbf{A}$ is non-singular, then the LU factorization is unique and the determinant is given by

$$ det\{\mathbf{A}\}=det\{\mathbf{LU}\}= det\{\mathbf{L}\}det\{\mathbf{U}\}=u_{11}u_{22}\dots u_{nn}. $$

LU Decomposition, why?

There are at least three main advantages with LU decomposition compared with standard Gaussian elimination:

  • It is straightforward to compute the determinant of a matrix

  • If we have to solve sets of linear equations with the same matrix but with different vectors $\mathbf{y}$, the number of FLOPS is of the order $n^3$.

  • The inverse is such an operation

LU Decomposition, linear equations

With the LU decomposition it is rather simple to solve a system of linear equations

$$ a_{11}x_1 +a_{12}x_2 +a_{13}x_3 + a_{14}x_4=w_1 \nonumber $$
$$ a_{21}x_1 + a_{22}x_2 + a_{23}x_3 + a_{24}x_4=w_2 \nonumber $$
$$ a_{31}x_1 + a_{32}x_2 + a_{33}x_3 + a_{34}x_4=w_3 \nonumber $$
$$ a_{41}x_1 + a_{42}x_2 + a_{43}x_3 + a_{44}x_4=w_4. \nonumber $$

This can be written in matrix form as

$$ \mathbf{Ax}=\mathbf{w}. $$

where $\mathbf{A}$ and $\mathbf{w}$ are known and we have to solve for $\mathbf{x}$. Using the LU dcomposition we write

$$ \mathbf{A} \mathbf{x} \equiv \mathbf{L} \mathbf{U} \mathbf{x} =\mathbf{w}. $$

LU Decomposition, linear equations

The previous equation can be calculated in two steps

$$ \mathbf{L} \mathbf{y} = \mathbf{w};\qquad \mathbf{Ux}=\mathbf{y}. $$

To show that this is correct we use to the LU decomposition to rewrite our system of linear equations as

$$ \mathbf{LUx}=\mathbf{w}, $$

and since the determinat of $\mathbf{L}$ is equal to 1 (by construction since the diagonals of $\mathbf{L}$ equal 1) we can use the inverse of $\mathbf{L}$ to obtain

$$ \mathbf{Ux}=\mathbf{L^{-1}w}=\mathbf{y}, $$

which yields the intermediate step

$$ \mathbf{L^{-1}w}=\mathbf{y} $$

and as soon as we have $\mathbf{y}$ we can obtain $\mathbf{x}$ through $\mathbf{Ux}=\mathbf{y}$.

LU Decomposition, why?

For our four-dimentional example this takes the form

$$ y_1=w_1 \nonumber $$
$$ l_{21}y_1 + y_2=w_2\nonumber $$
$$ l_{31}y_1 + l_{32}y_2 + y_3 =w_3\nonumber $$
$$ l_{41}y_1 + l_{42}y_2 + l_{43}y_3 + y_4=w_4. \nonumber $$

and

$$ u_{11}x_1 +u_{12}x_2 +u_{13}x_3 + u_{14}x_4=y_1 \nonumber $$
$$ u_{22}x_2 + u_{23}x_3 + u_{24}x_4=y_2\nonumber $$
$$ u_{33}x_3 + u_{34}x_4=y_3\nonumber $$
$$ u_{44}x_4=y_4 \nonumber $$

This example shows the basis for the algorithm needed to solve the set of $n$ linear equations.

LU Decomposition, linear equations

The algorithm goes as follows

  • Set up the matrix $\bf A$ and the vector $\bf w$ with their correct dimensions. This determines the dimensionality of the unknown vector $\bf x$.

  • Then LU decompose the matrix $\bf A$ through a call to the function ludcmp(double a, int n, int indx, double &d). This functions returns the LU decomposed matrix $\bf A$, its determinant and the vector indx which keeps track of the number of interchanges of rows. If the determinant is zero, the solution is malconditioned.

  • Thereafter you call the function lubksb(double a, int n, int indx, double w) which uses the LU decomposed matrix $\bf A$ and the vector $\bf w$ and returns $\bf x$ in the same place as $\bf w$. Upon exit the original content in $\bf w$ is destroyed. If you wish to keep this information, you should make a backup of it in your calling function.

LU Decomposition, the inverse of a matrix

If the inverse exists then

$$ \mathbf{A}^{-1}\mathbf{A}=\mathbf{I}, $$

the identity matrix. With an LU decomposed matrix we can rewrite the last equation as

$$ \mathbf{LU}\mathbf{A}^{-1}=\mathbf{I}. $$

LU Decomposition, the inverse of a matrix

If we assume that the first column (that is column 1) of the inverse matrix can be written as a vector with unknown entries

$$ \mathbf{A}_1^{-1}= \begin{bmatrix} a_{11}^{-1} \\ a_{21}^{-1} \\ \dots \\ a_{n1}^{-1} \\ \end{bmatrix}, $$

then we have a linear set of equations

$$ \mathbf{LU}\begin{bmatrix} a_{11}^{-1} \\ a_{21}^{-1} \\ \dots \\ a_{n1}^{-1} \\ \end{bmatrix} =\begin{bmatrix} 1 \\ 0 \\ \dots \\ 0 \\ \end{bmatrix}. $$

LU Decomposition, the inverse

In a similar way we can compute the unknow entries of the second column,

$$ \mathbf{LU}\begin{bmatrix} a_{12}^{-1} \\ a_{22}^{-1} \\ \dots \\ a_{n2}^{-1} \\ \end{bmatrix}=\begin{bmatrix} 0 \\ 1 \\ \dots \\ 0 \\ \end{bmatrix}, $$

and continue till we have solved all $n$ sets of linear equations.

Using Armadillo to perform an LU decomposition

    #include <iostream>
    #include "armadillo"
    using namespace arma;
    using namespace std;

    int main()
      {
       mat A = randu<mat>(5,5);
       vec b = randu<vec>(5);

      A.print("A =");
      b.print("b=");
      // solve Ax = b
      vec x = solve(A,b);
      // print x
      x.print("x=");
      // find LU decomp of A, if needed, P is the permutation matrix
      mat L, U;
      lu(L,U,A);
      // print l
      L.print(" L= ");
      // print U
      U.print(" U= ");
      //Check that A = LU
      (A-L*U).print("Test of LU decomposition");
        return 0;
      }

Iterative methods, Chapter 6

  • Direct solvers such as Gauss elimination and LU decomposition discussed in connection with project 1.

  • Iterative solvers such as Basic iterative solvers, Jacobi, Gauss-Seidel, Successive over-relaxation. These methods are easy to parallelize, as we will se later. Much used in solutions of partial differential equations.

  • Other iterative methods such as Krylov subspace methods with Generalized minimum residual (GMRES) and Conjugate gradient etc will not be discussed.

Iterative methods, Jacobi's method

It is a simple method for solving

$$ \mathbf{A}\mathbf{x}=\mathbf{b}, $$

where $\mathbf{A}$ is a matrix and $\mathbf{x}$ and $\mathbf{b}$ are vectors. The vector $\mathbf{x}$ is the unknown.

It is an iterative scheme where we start with a guess for the unknown, and after $k+1$ iterations we have

$$ \mathbf{x}^{(k+1)}= \mathbf{D}^{-1}(\mathbf{b}-(\mathbf{L}+\mathbf{U})\mathbf{x}^{(k)}), $$

with $\mathbf{A}=\mathbf{D}+\mathbf{U}+\mathbf{L}$ and $\mathbf{D}$ being a diagonal matrix, $\mathbf{U}$ an upper triangular matrix and $\mathbf{L}$ a lower triangular matrix.

If the matrix $\mathbf{A}$ is positive definite or diagonally dominant, one can show that this method will always converge to the exact solution.

Iterative methods, Jacobi's method

We can demonstrate Jacobi's method by this $4\times 4$ matrix problem. We assume a guess for the vector elements $x_i^{(0)}$, a guess which represents our first iteration. The new values are obtained by substitution

$$ x_1^{(1)} =(b_1-a_{12}x_2^{(0)} -a_{13}x_3^{(0)} - a_{14}x_4^{(0)})/a_{11} \nonumber $$
$$ x_2^{(1)} =(b_2-a_{21}x_1^{(0)} - a_{23}x_3^{(0)} - a_{24}x_4^{(0)})/a_{22} \nonumber $$
$$ x_3^{(1)} =(b_3- a_{31}x_1^{(0)} -a_{32}x_2^{(0)} -a_{34}x_4^{(0)})/a_{33} \nonumber $$
$$ x_4^{(1)}=(b_4-a_{41}x_1^{(0)} -a_{42}x_2^{(0)} - a_{43}x_3^{(0)})/a_{44}, \nonumber $$

which after $k+1$ iterations reads

$$ x_1^{(k+1)} =(b_1-a_{12}x_2^{(k)} -a_{13}x_3^{(k)} - a_{14}x_4^{(k)})/a_{11} \nonumber $$
$$ x_2^{(k+1)} =(b_2-a_{21}x_1^{(k)} - a_{23}x_3^{(k)} - a_{24}x_4^{(k)})/a_{22} \nonumber $$
$$ x_3^{(k+1)} =(b_3- a_{31}x_1^{(k)} -a_{32}x_2^{(k)} -a_{34}x_4^{(k)})/a_{33} \nonumber $$
$$ x_4^{(k+1)}=(b_4-a_{41}x_1^{(k)} -a_{42}x_2^{(k)} - a_{43}x_3^{(k)})/a_{44}, \nonumber $$

Iterative methods, Jacobi's method

We can generalize the above equations to

$$ x_i^{(k+1)}=(b_i-\sum_{j=1, j\ne i}^{n}a_{ij}x_j^{(k)})/a_{ii} $$

or in an even more compact form as

$$ \mathbf{x}^{(k+1)}= \mathbf{D}^{-1}(\mathbf{b}-(\mathbf{L}+\mathbf{U})\mathbf{x}^{(k)}), $$

with $\mathbf{A}=\mathbf{D}+\mathbf{U}+\mathbf{L}$ and $\mathbf{D}$ being a diagonal matrix, $\mathbf{U}$ an upper triangular matrix and $\mathbf{L}$ a lower triangular matrix.

Iterative methods, Gauss-Seidel's method

Our $4\times 4$ matrix problem

$$ x_1^{(k+1)} =(b_1-a_{12}x_2^{(k)} -a_{13}x_3^{(k)} - a_{14}x_4^{(k)})/a_{11} \nonumber $$
$$ x_2^{(k+1)} =(b_2-a_{21}x_1^{(k)} - a_{23}x_3^{(k)} - a_{24}x_4^{(k)})/a_{22} \nonumber $$
$$ x_3^{(k+1)} =(b_3- a_{31}x_1^{(k)} -a_{32}x_2^{(k)} -a_{34}x_4^{(k)})/a_{33} \nonumber $$
$$ x_4^{(k+1)}=(b_4-a_{41}x_1^{(k)} -a_{42}x_2^{(k)} - a_{43}x_3^{(k)})/a_{44}, \nonumber $$

can be rewritten as

$$ x_1^{(k+1)} =(b_1-a_{12}x_2^{(k)} -a_{13}x_3^{(k)} - a_{14}x_4^{(k)})/a_{11} \nonumber $$
$$ x_2^{(k+1)} =(b_2-a_{21}x_1^{(k+1)} - a_{23}x_3^{(k)} - a_{24}x_4^{(k)})/a_{22} \nonumber $$
$$ x_3^{(k+1)} =(b_3- a_{31}x_1^{(k+1)} -a_{32}x_2^{(k+1)} -a_{34}x_4^{(k)})/a_{33} \nonumber $$
$$ x_4^{(k+1)}=(b_4-a_{41}x_1^{(k+1)} -a_{42}x_2^{(k+1)} - a_{43}x_3^{(k+1)})/a_{44}, \nonumber $$

which allows us to utilize the preceding solution (forward substitution). This improves normally the convergence behavior and leads to the Gauss-Seidel method!

Iterative methods, Gauss-Seidel's method

We can generalize

$$ x_1^{(k+1)} =(b_1-a_{12}x_2^{(k)} -a_{13}x_3^{(k)} - a_{14}x_4^{(k)})/a_{11} \nonumber $$
$$ x_2^{(k+1)} =(b_2-a_{21}x_1^{(k+1)} - a_{23}x_3^{(k)} - a_{24}x_4^{(k)})/a_{22} \nonumber $$
$$ x_3^{(k+1)} =(b_3- a_{31}x_1^{(k+1)} -a_{32}x_2^{(k+1)} -a_{34}x_4^{(k)})/a_{33} \nonumber $$
$$ x_4^{(k+1)}=(b_4-a_{41}x_1^{(k+1)} -a_{42}x_2^{(k+1)} - a_{43}x_3^{(k+1)})/a_{44}, \nonumber $$

to the following form

$$ x^{(k+1)}_i = \frac{1}{a_{ii}} \left(b_i - \sum_{j > i}a_{ij}x^{(k)}_j - \sum_{j < i}a_{ij}x^{(k+1)}_j \right),\quad i=1,2,\ldots,n. $$

The procedure is generally continued until the changes made by an iteration are below some tolerance.

The convergence properties of the Jacobi method and the Gauss-Seidel method are dependent on the matrix $\mathbf{A}$. These methods converge when the matrix is symmetric positive-definite, or is strictly or irreducibly diagonally dominant. Both methods sometimes converge even if these conditions are not satisfied.

Iterative methods, Successive over-relaxation

Given a square system of n linear equations with unknown $\mathbf x$:

$$ \mathbf{A}\mathbf x = \mathbf b $$

where

$$ \mathbf{A}=\begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\a_{n1} & a_{n2} & \cdots & a_{nn} \end{bmatrix}, \qquad \mathbf{x} = \begin{bmatrix} x_{1} \\ x_2 \\ \vdots \\ x_n \end{bmatrix} , \qquad \mathbf{b} = \begin{bmatrix} b_{1} \\ b_2 \\ \vdots \\ b_n \end{bmatrix}. $$

Iterative methods, Successive over-relaxation

Then A can be decomposed into a diagonal component D, and strictly lower and upper triangular components L and U:

$$ \mathbf{A} =\mathbf{D} + \mathbf{L} + \mathbf{U}, $$

where

$$ D = \begin{bmatrix} a_{11} & 0 & \cdots & 0 \\ 0 & a_{22} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\0 & 0 & \cdots & a_{nn} \end{bmatrix}, \quad L = \begin{bmatrix} 0 & 0 & \cdots & 0 \\ a_{21} & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\a_{n1} & a_{n2} & \cdots & 0 \end{bmatrix}, \quad U = \begin{bmatrix} 0 & a_{12} & \cdots & a_{1n} \\ 0 & 0 & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\0 & 0 & \cdots & 0 \end{bmatrix}. $$

The system of linear equations may be rewritten as:

$$ (D+\omega L) \mathbf{x} = \omega \mathbf{b} - [\omega U + (\omega-1) D ] \mathbf{x} $$

for a constant $\omega > 1$.

Iterative methods, Successive over-relaxation

The method of successive over-relaxation is an iterative technique that solves the left hand side of this expression for $x$, using previous value for $x$ on the right hand side. Analytically, this may be written as:

$$ \mathbf{x}^{(k+1)} = (D+\omega L)^{-1} \big(\omega \mathbf{b} - [\omega U + (\omega-1) D ] \mathbf{x}^{(k)}\big). $$

However, by taking advantage of the triangular form of $(D+\omega L)$, the elements of $x^{(k+1)}$ can be computed sequentially using forward substitution:

$$ x^{(k+1)}_i = (1-\omega)x^{(k)}_i + \frac{\omega}{a_{ii}} \left(b_i - \sum_{j > i} a_{ij}x^{(k)}_j - \sum_{j < i} a_{ij}x^{(k+1)}_j \right),\quad i=1,2,\ldots,n. $$

The choice of relaxation factor is not necessarily easy, and depends upon the properties of the coefficient matrix. For symmetric, positive-definite matrices it can be proven that $0 < \omega < 2$ will lead to convergence, but we are generally interested in faster convergence rather than just convergence.

Cubic Splines

Cubic spline interpolation is among one of the most used methods for interpolating between data points where the arguments are organized as ascending series. In the library program we supply such a function, based on the so-called cubic spline method to be described below.

A spline function consists of polynomial pieces defined on subintervals. The different subintervals are connected via various continuity relations.

Assume we have at our disposal $n+1$ points $x_0, x_1, \dots x_n$ arranged so that $x_0 < x_1 < x_2 < \dots x_{n-1} < x_n$ (such points are called knots). A spline function $s$ of degree $k$ with $n+1$ knots is defined as follows

  • On every subinterval $[x_{i-1},x_i)$ s is a polynomial of degree $\le k$.

  • $s$ has $k-1$ continuous derivatives in the whole interval $[x_0,x_n]$.

Splines

As an example, consider a spline function of degree $k=1$ defined as follows

$$ s(x)=\begin{bmatrix} s_0(x)=a_0x+b_0 & x\in [x_0, x_1) \\ s_1(x)=a_1x+b_1 & x\in [x_1, x_2) \\ \dots & \dots \\ s_{n-1}(x)=a_{n-1}x+b_{n-1} & x\in [x_{n-1}, x_n] \end{bmatrix}. $$

In this case the polynomial consists of series of straight lines connected to each other at every endpoint. The number of continuous derivatives is then $k-1=0$, as expected when we deal with straight lines. Such a polynomial is quite easy to construct given $n+1$ points $x_0, x_1, \dots x_n$ and their corresponding function values.

Splines

The most commonly used spline function is the one with $k=3$, the so-called cubic spline function. Assume that we have in adddition to the $n+1$ knots a series of functions values $y_0=f(x_0), y_1=f(x_1), \dots y_n=f(x_n)$. By definition, the polynomials $s_{i-1}$ and $s_i$ are thence supposed to interpolate the same point $i$, that is

$$ s_{i-1}(x_i)= y_i = s_i(x_i), $$

with $1 \le i \le n-1$. In total we have $n$ polynomials of the type

$$ s_i(x)=a_{i0}+a_{i1}x+a_{i2}x^2+a_{i2}x^3, $$

yielding $4n$ coefficients to determine.

Splines

Every subinterval provides in addition the $2n$ conditions

$$ y_i = s(x_i), $$

and

$$ s(x_{i+1})= y_{i+1}, $$

to be fulfilled. If we also assume that $s'$ and $s''$ are continuous, then

$$ s'_{i-1}(x_i)= s'_i(x_i), $$

yields $n-1$ conditions. Similarly,

$$ s''_{i-1}(x_i)= s''_i(x_i), $$

results in additional $n-1$ conditions. In total we have $4n$ coefficients and $4n-2$ equations to determine them, leaving us with $2$ degrees of freedom to be determined.

Splines

Using the last equation we define two values for the second derivative, namely

$$ s''_{i}(x_i)= f_i, $$

and

$$ s''_{i}(x_{i+1})= f_{i+1}, $$

and setting up a straight line between $f_i$ and $f_{i+1}$ we have

$$ s_i''(x) = \frac{f_i}{x_{i+1}-x_i}(x_{i+1}-x)+ \frac{f_{i+1}}{x_{i+1}-x_i}(x-x_i), $$

and integrating twice one obtains

$$ s_i(x) = \frac{f_i}{6(x_{i+1}-x_i)}(x_{i+1}-x)^3+ \frac{f_{i+1}}{6(x_{i+1}-x_i)}(x-x_i)^3 +c(x-x_i)+d(x_{i+1}-x). $$

Splines

Using the conditions $s_i(x_i)=y_i$ and $s_i(x_{i+1})=y_{i+1}$ we can in turn determine the constants $c$ and $d$ resulting in

$$ s_i(x) =\frac{f_i}{6(x_{i+1}-x_i)}(x_{i+1}-x)^3+ \frac{f_{i+1}}{6(x_{i+1}-x_i)}(x-x_i)^3 \nonumber $$

$$ \begin{equation} +(\frac{y_{i+1}}{x_{i+1}-x_i}-\frac{f_{i+1}(x_{i+1}-x_i)}{6}) (x-x_i)+ (\frac{y_{i}}{x_{i+1}-x_i}-\frac{f_{i}(x_{i+1}-x_i)}{6}) (x_{i+1}-x). \label{_auto8} \tag{9} \end{equation} $$

Splines

How to determine the values of the second derivatives $f_{i}$ and $f_{i+1}$? We use the continuity assumption of the first derivatives

$$ s'_{i-1}(x_i)= s'_i(x_i), $$

and set $x=x_i$. Defining $h_i=x_{i+1}-x_i$ we obtain finally the following expression

$$ h_{i-1}f_{i-1}+2(h_{i}+h_{i-1})f_i+h_if_{i+1}= \frac{6}{h_i}(y_{i+1}-y_i)-\frac{6}{h_{i-1}}(y_{i}-y_{i-1}), $$

and introducing the shorthands $u_i=2(h_{i}+h_{i-1})$, $v_i=\frac{6}{h_i}(y_{i+1}-y_i)-\frac{6}{h_{i-1}}(y_{i}-y_{i-1})$, we can reformulate the problem as a set of linear equations to be solved through e.g., Gaussian elemination

Splines

Gaussian elimination

$$ \begin{bmatrix} u_1 & h_1 &0 &\dots & & & & \\ h_1 & u_2 & h_2 &0 &\dots & & & \\ 0 & h_2 & u_3 & h_3 &0 &\dots & & \\ \dots& & \dots &\dots &\dots &\dots &\dots & \\ &\dots & & &0 &h_{n-3} &u_{n-2} &h_{n-2} \\ & && & &0 &h_{n-2} &u_{n-1} \end{bmatrix} \begin{bmatrix} f_1 \\ f_2 \\ f_3\\ \dots \\ f_{n-2} \\ f_{n-1} \end{bmatrix} = \begin{bmatrix} v_1 \\ v_2 \\ v_3\\ \dots \\ v_{n-2}\\ v_{n-1} \end{bmatrix}. $$

Note that this is a set of tridiagonal equations and can be solved through only $O(n)$ operations.

Splines

The functions supplied in the program library are spline and splint. In order to use cubic spline interpolation you need first to call

    spline(double x[], double y[], int n, double yp1,  double yp2, double y2[])

This function takes as input $x[0,..,n - 1]$ and $y[0,..,n - 1]$ containing a tabulation $y_i = f(x_i)$ with $x_0 < x_1 < .. < x_{n - 1}$ together with the first derivatives of $f(x)$ at $x_0$ and $x_{n-1}$, respectively. Then the function returns $y2[0,..,n-1]$ which contains the second derivatives of $f(x_i)$ at each point $x_i$. $n$ is the number of points. This function provides the cubic spline interpolation for all subintervals and is called only once.

Splines

Thereafter, if you wish to make various interpolations, you need to call the function

    splint(double x[], double y[], double y2a[], int n, double x, double *y)

which takes as input the tabulated values $x[0,..,n - 1]$ and $y[0,..,n - 1]$ and the output y2a[0,..,n - 1] from spline. It returns the value $y$ corresponding to the point $x$.

Conjugate gradient (CG) method

The success of the CG method for finding solutions of non-linear problems is based on the theory of conjugate gradients for linear systems of equations. It belongs to the class of iterative methods for solving problems from linear algebra of the type

$$ \hat{A}\hat{x} = \hat{b}. $$

In the iterative process we end up with a problem like

$$ \hat{r}= \hat{b}-\hat{A}\hat{x}, $$

where $\hat{r}$ is the so-called residual or error in the iterative process.

When we have found the exact solution, $\hat{r}=0$.

Conjugate gradient method

The residual is zero when we reach the minimum of the quadratic equation

$$ P(\hat{x})=\frac{1}{2}\hat{x}^T\hat{A}\hat{x} - \hat{x}^T\hat{b}, $$

with the constraint that the matrix $\hat{A}$ is positive definite and symmetric. If we search for a minimum of the quantum mechanical variance, then the matrix $\hat{A}$, which is called the Hessian, is given by the second-derivative of the function we want to minimize. This quantity is always positive definite. In our case this corresponds normally to the second derivative of the energy.

Conjugate gradient method, Newton's method first

We seek the minimum of the energy or the variance as function of various variational parameters. In our case we have thus a function $f$ whose minimum we are seeking. In Newton's method we set $\nabla f = 0$ and we can thus compute the next iteration point

$$ \hat{x}-\hat{x}_i=\hat{A}^{-1}\nabla f(\hat{x}_i). $$

Subtracting this equation from that of $\hat{x}_{i+1}$ we have

$$ \hat{x}_{i+1}-\hat{x}_i=\hat{A}^{-1}(\nabla f(\hat{x}_{i+1})-\nabla f(\hat{x}_i)). $$

Simple example and demonstration

The function $f$ can be either the energy or the variance. If we choose the energy then we have

$$ \hat{\alpha}_{i+1}-\hat{\alpha}_i=\hat{A}^{-1}(\nabla E(\hat{\alpha}_{i+1})-\nabla E(\hat{\alpha}_i)). $$

In the simple harmonic oscillator model, the gradient and the Hessian $\hat{A}$ are

$$ \frac{d\langle E_L[\alpha]\rangle}{d\alpha} = \alpha-\frac{1}{4\alpha^3} $$

and a second derivative which is always positive (meaning that we find a minimum)

$$ \hat{A}= \frac{d^2\langle E_L[\alpha]\rangle}{d\alpha^2} = 1+\frac{3}{4\alpha^4} $$

Simple example and demonstration

We get then

$$ \alpha_{i+1}=\frac{4}{3}\alpha_i-\frac{\alpha_i^4}{3\alpha_{i+1}^3}, $$

which can be rewritten as

$$ \alpha_{i+1}^4-\frac{4}{3}\alpha_i\alpha_{i+1}^4+\frac{1}{3}\alpha_i^4. $$

Conjugate gradient method

In the CG method we define so-called conjugate directions and two vectors $\hat{s}$ and $\hat{t}$ are said to be conjugate if

$$ \hat{s}^T\hat{A}\hat{t}= 0. $$

The philosophy of the CG method is to perform searches in various conjugate directions of our vectors $\hat{x}_i$ obeying the above criterion, namely

$$ \hat{x}_i^T\hat{A}\hat{x}_j= 0. $$

Two vectors are conjugate if they are orthogonal with respect to this inner product. Being conjugate is a symmetric relation: if $\hat{s}$ is conjugate to $\hat{t}$, then $\hat{t}$ is conjugate to $\hat{s}$.

Conjugate gradient method

An example is given by the eigenvectors of the matrix

$$ \hat{v}_i^T\hat{A}\hat{v}_j= \lambda\hat{v}_i^T\hat{v}_j, $$

which is zero unless $i=j$.

Conjugate gradient method

Assume now that we have a symmetric positive-definite matrix $\hat{A}$ of size $n\times n$. At each iteration $i+1$ we obtain the conjugate direction of a vector

$$ \hat{x}_{i+1}=\hat{x}_{i}+\alpha_i\hat{p}_{i}. $$

We assume that $\hat{p}_{i}$ is a sequence of $n$ mutually conjugate directions. Then the $\hat{p}_{i}$ form a basis of $R^n$ and we can expand the solution $ \hat{A}\hat{x} = \hat{b}$ in this basis, namely

$$ \hat{x} = \sum^{n}_{i=1} \alpha_i \hat{p}_i. $$

Conjugate gradient method

The coefficients are given by

$$ \mathbf{A}\mathbf{x} = \sum^{n}_{i=1} \alpha_i \mathbf{A} \mathbf{p}_i = \mathbf{b}. $$

Multiplying with $\hat{p}_k^T$ from the left gives

$$ \hat{p}_k^T \hat{A}\hat{x} = \sum^{n}_{i=1} \alpha_i\hat{p}_k^T \hat{A}\hat{p}_i= \hat{p}_k^T \hat{b}, $$

and we can define the coefficients $\alpha_k$ as

$$ \alpha_k = \frac{\hat{p}_k^T \hat{b}}{\hat{p}_k^T \hat{A} \hat{p}_k} $$

Conjugate gradient method and iterations

If we choose the conjugate vectors $\hat{p}_k$ carefully, then we may not need all of them to obtain a good approximation to the solution $\hat{x}$. We want to regard the conjugate gradient method as an iterative method. This will us to solve systems where $n$ is so large that the direct method would take too much time.

We denote the initial guess for $\hat{x}$ as $\hat{x}_0$. We can assume without loss of generality that

$$ \hat{x}_0=0, $$

or consider the system

$$ \hat{A}\hat{z} = \hat{b}-\hat{A}\hat{x}_0, $$

instead.

Conjugate gradient method

One can show that the solution $\hat{x}$ is also the unique minimizer of the quadratic form

$$ f(\hat{x}) = \frac{1}{2}\hat{x}^T\hat{A}\hat{x} - \hat{x}^T \hat{x} , \quad \hat{x}\in\mathbf{R}^n. $$

This suggests taking the first basis vector $\hat{p}_1$ to be the gradient of $f$ at $\hat{x}=\hat{x}_0$, which equals

$$ \hat{A}\hat{x}_0-\hat{b}, $$

and $\hat{x}_0=0$ it is equal $-\hat{b}$. The other vectors in the basis will be conjugate to the gradient, hence the name conjugate gradient method.

Conjugate gradient method

Let $\hat{r}_k$ be the residual at the $k$-th step:

$$ \hat{r}_k=\hat{b}-\hat{A}\hat{x}_k. $$

Note that $\hat{r}_k$ is the negative gradient of $f$ at $\hat{x}=\hat{x}_k$, so the gradient descent method would be to move in the direction $\hat{r}_k$. Here, we insist that the directions $\hat{p}_k$ are conjugate to each other, so we take the direction closest to the gradient $\hat{r}_k$
under the conjugacy constraint. This gives the following expression

$$ \hat{p}_{k+1}=\hat{r}_k-\frac{\hat{p}_k^T \hat{A}\hat{r}_k}{\hat{p}_k^T\hat{A}\hat{p}_k} \hat{p}_k. $$

Conjugate gradient method

We can also compute the residual iteratively as

$$ \hat{r}_{k+1}=\hat{b}-\hat{A}\hat{x}_{k+1}, $$

which equals

$$ \hat{b}-\hat{A}(\hat{x}_k+\alpha_k\hat{p}_k), $$

or

$$ (\hat{b}-\hat{A}\hat{x}_k)-\alpha_k\hat{A}\hat{p}_k, $$

which gives

$$ \hat{r}_{k+1}=\hat{r}_k-\hat{A}\hat{p}_{k}, $$

Review of probability theory

Domains and probabilities

Consider the following simple example, namely the tossing of a dice, resulting in the following possible values

$$ \{2,3,4,5,6,7,8,9,10,11,12\}. $$

These values are called the domain. To this domain we have the corresponding probabilities

$$ \{1/36,2/36/3/36,4/36,5/36,6/36,5/36,4/36,3/36,2/36,1/36\}. $$

Tossing a dice

The numbers in the domain are the outcomes of the physical process tossing the dice. We cannot tell beforehand whether the outcome is 3 or 5 or any other number in this domain. This defines the randomness of the outcome, or unexpectedness or any other synonimous word which encompasses the uncertitude of the final outcome.

The only thing we can tell beforehand is that say the outcome 2 has a certain probability.
If our favorite hobby is to spend an hour every evening throwing dice and registering the sequence of outcomes, we will note that the numbers in the above domain

$$ \{2,3,4,5,6,7,8,9,10,11,12\}, $$

appear in a random order. After 11 throws the results may look like

$$ \{10,8,6,3,6,9,11,8,12,4,5\}. $$

Stochastic variables

Random variables are characterized by a domain which contains all possible values that the random value may take. This domain has a corresponding PDF.

Stochastic variables and the main concepts, the discrete case

There are two main concepts associated with a stochastic variable. The domain is the set $\mathbb D = \{x\}$ of all accessible values the variable can assume, so that $X \in \mathbb D$. An example of a discrete domain is the set of six different numbers that we may get by throwing of a dice, $x\in\{1,\,2,\,3,\,4,\,5,\,6\}$.

The probability distribution function (PDF) is a function $p(x)$ on the domain which, in the discrete case, gives us the probability or relative frequency with which these values of $X$ occur

$$ p(x) = \mathrm{Prob}(X=x). $$

Stochastic variables and the main concepts, the continuous case

In the continuous case, the PDF does not directly depict the actual probability. Instead we define the probability for the stochastic variable to assume any value on an infinitesimal interval around $x$ to be $p(x)dx$. The continuous function $p(x)$ then gives us the density of the probability rather than the probability itself. The probability for a stochastic variable to assume any value on a non-infinitesimal interval $[a,\,b]$ is then just the integral

$$ \mathrm{Prob}(a\leq X\leq b) = \int_a^b p(x)dx. $$

Qualitatively speaking, a stochastic variable represents the values of numbers chosen as if by chance from some specified PDF so that the selection of a large set of these numbers reproduces this PDF.

The cumulative probability

Of interest to us is the cumulative probability distribution function (CDF), $P(x)$, which is just the probability for a stochastic variable $X$ to assume any value less than $x$

$$ P(x)=\mathrm{Prob(}X\leq x\mathrm{)} = \int_{-\infty}^x p(x^{\prime})dx^{\prime}. $$

The relation between a CDF and its corresponding PDF is then

$$ p(x) = \frac{d}{dx}P(x). $$

Properties of PDFs

There are two properties that all PDFs must satisfy. The first one is positivity (assuming that the PDF is normalized)

$$ 0 \leq p(x) \leq 1. $$

Naturally, it would be nonsensical for any of the values of the domain to occur with a probability greater than $1$ or less than $0$. Also, the PDF must be normalized. That is, all the probabilities must add up to unity. The probability of "anything" to happen is always unity. For both discrete and continuous PDFs, this condition is

$$ \begin{align*} \sum_{x_i\in\mathbb D} p(x_i) & = 1,\\ \int_{x\in\mathbb D} p(x)\,dx & = 1. \end{align*} $$

Important distributions, the uniform distribution

The first one is the most basic PDF; namely the uniform distribution

$$ \begin{equation} p(x) = \frac{1}{b-a}\theta(x-a)\theta(b-x), \label{eq:unifromPDF} \tag{10} \end{equation} $$

with

$$ \begin{array}{ll} \theta(x)=0 & x<0 \\ \theta(x)=\frac{1}{b-a} & \in [a,b]. \end{array} $$

The normal distribution with $b=1$ and $a=0$ is used to generate random numbers.

Gaussian distribution

The second one is the Gaussian Distribution

$$ p(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp{(-\frac{(x-\mu)^2}{2\sigma^2})}, $$

with mean value $\mu$ and standard deviation $\sigma$. If $\mu=0$ and $\sigma=1$, it is normally called the standard normal distribution

$$ p(x) = \frac{1}{\sqrt{2\pi}} \exp{(-\frac{x^2}{2})}, $$

The following simple Python code plots the above distribution for different values of $\mu$ and $\sigma$.


In [1]:
import numpy as np
from math import acos, exp, sqrt
from  matplotlib import pyplot as plt
from matplotlib import rc, rcParams
import matplotlib.units as units
import matplotlib.ticker as ticker
rc('text',usetex=True)
rc('font',**{'family':'serif','serif':['Gaussian distribution']})
font = {'family' : 'serif',
        'color'  : 'darkred',
        'weight' : 'normal',
        'size'   : 16,
        }
pi = acos(-1.0)
mu0 = 0.0
sigma0 = 1.0
mu1= 1.0
sigma1 = 2.0
mu2 = 2.0
sigma2 = 4.0

x = np.linspace(-20.0, 20.0)
v0 = np.exp(-(x*x-2*x*mu0+mu0*mu0)/(2*sigma0*sigma0))/sqrt(2*pi*sigma0*sigma0)
v1 = np.exp(-(x*x-2*x*mu1+mu1*mu1)/(2*sigma1*sigma1))/sqrt(2*pi*sigma1*sigma1)
v2 = np.exp(-(x*x-2*x*mu2+mu2*mu2)/(2*sigma2*sigma2))/sqrt(2*pi*sigma2*sigma2)
plt.plot(x, v0, 'b-', x, v1, 'r-', x, v2, 'g-')
plt.title(r'{\bf Gaussian distributions}', fontsize=20)
plt.text(-19, 0.3, r'Parameters: $\mu = 0$, $\sigma = 1$', fontdict=font)
plt.text(-19, 0.18, r'Parameters: $\mu = 1$, $\sigma = 2$', fontdict=font)
plt.text(-19, 0.08, r'Parameters: $\mu = 2$, $\sigma = 4$', fontdict=font)
plt.xlabel(r'$x$',fontsize=20)
plt.ylabel(r'$p(x)$ [MeV]',fontsize=20)

# Tweak spacing to prevent clipping of ylabel                                                                       
plt.subplots_adjust(left=0.15)
plt.savefig('gaussian.pdf', format='pdf')
plt.show()


Exponential distribution

Another important distribution in science is the exponential distribution

$$ p(x) = \alpha\exp{-(\alpha x)}. $$

Expectation values

Let $h(x)$ be an arbitrary continuous function on the domain of the stochastic variable $X$ whose PDF is $p(x)$. We define the expectation value of $h$ with respect to $p$ as follows

$$ \begin{equation} \langle h \rangle_X \equiv \int\! h(x)p(x)\,dx \label{eq:expectation_value_of_h_wrt_p} \tag{11} \end{equation} $$

Whenever the PDF is known implicitly, like in this case, we will drop the index $X$ for clarity.
A particularly useful class of special expectation values are the moments. The $n$-th moment of the PDF $p$ is defined as follows

$$ \langle x^n \rangle \equiv \int\! x^n p(x)\,dx $$

Stochastic variables and the main concepts, mean values

The zero-th moment $\langle 1\rangle$ is just the normalization condition of $p$. The first moment, $\langle x\rangle$, is called the mean of $p$ and often denoted by the letter $\mu$

$$ \langle x\rangle = \mu \equiv \int x p(x)dx, $$

for a continuous distribution and

$$ \langle x\rangle = \mu \equiv \frac{1}{N}\sum_{i=1}^N x_i p(x_i), $$

for a discrete distribution. Qualitatively it represents the centroid or the average value of the PDF and is therefore simply called the expectation value of $p(x)$.

Stochastic variables and the main concepts, central moments, the variance

A special version of the moments is the set of central moments, the n-th central moment defined as

$$ \langle (x-\langle x\rangle )^n\rangle \equiv \int\! (x-\langle x\rangle)^n p(x)\,dx $$

The zero-th and first central moments are both trivial, equal $1$ and $0$, respectively. But the second central moment, known as the variance of $p$, is of particular interest. For the stochastic variable $X$, the variance is denoted as $\sigma^2_X$ or $\mathrm{Var}(X)$

$$ \begin{align*} \sigma^2_X &=\mathrm{Var}(X) = \langle (x-\langle x\rangle)^2\rangle = \int (x-\langle x\rangle)^2 p(x)dx\\ & = \int\left(x^2 - 2 x \langle x\rangle^{2} +\langle x\rangle^2\right)p(x)dx\\ & = \langle x^2\rangle\rangle - 2 \langle x\rangle\langle x\rangle + \langle x\rangle^2\\ & = \langle x^2 \rangle - \langle x\rangle^2 \end{align*} $$

The square root of the variance, $\sigma =\sqrt{\langle (x-\langle x\rangle)^2\rangle}$ is called the standard deviation of $p$. It is the RMS (root-mean-square) value of the deviation of the PDF from its mean value, interpreted qualitatively as the "spread" of $p$ around its mean.

Probability Distribution Functions

The following table collects properties of probability distribution functions. In our notation we reserve the label $p(x)$ for the probability of a certain event, while $P(x)$ is the cumulative probability.

Discrete PDF Continuous PDF
Domain $\left\{x_1, x_2, x_3, \dots, x_N\right\}$ $[a,b]$
Probability $p(x_i)$ $p(x)dx$
Cumulative $P_i=\sum_{l=1}^ip(x_l)$ $P(x)=\int_a^xp(t)dt$
Positivity $ 0\le p(x_i)\le 1$ $ p(x) \ge 0$
Positivity $ 0\le P_i\le 1$ $ 0\le P(x)\le 1$
Monotonic $P_i\ge P_j$ if $x_i\ge x_j$ $P(x_i)\ge P(x_j)$ if $x_i\ge x_j$
Normalization $P_N=1$ $P(b)=1$

Probability Distribution Functions

With a PDF we can compute expectation values of selected quantities such as

$$ \langle x^k\rangle=\frac{1}{N}\sum_{i=1}^{N}x_i^kp(x_i), $$

if we have a discrete PDF or

$$ \langle x^k\rangle=\int_a^b x^kp(x)dx, $$

in the case of a continuous PDF. We have already defined the mean value $\mu$ and the variance $\sigma^2$.

The three famous Probability Distribution Functions

There are at least three PDFs which one may encounter. These are the

Uniform distribution

$$ p(x)=\frac{1}{b-a}\Theta(x-a)\Theta(b-x), $$

yielding probabilities different from zero in the interval $[a,b]$.

The exponential distribution

$$ p(x)=\alpha \exp{(-\alpha x)}, $$

yielding probabilities different from zero in the interval $[0,\infty)$ and with mean value

$$ \mu = \int_0^{\infty}xp(x)dx=\int_0^{\infty}x\alpha \exp{(-\alpha x)}dx=\frac{1}{\alpha}, $$

with variance

$$ \sigma^2=\int_0^{\infty}x^2p(x)dx-\mu^2 = \frac{1}{\alpha^2}. $$

Probability Distribution Functions, the normal distribution

Finally, we have the so-called univariate normal distribution, or just the normal distribution

$$ p(x)=\frac{1}{b\sqrt{2\pi}}\exp{\left(-\frac{(x-a)^2}{2b^2}\right)} $$

with probabilities different from zero in the interval $(-\infty,\infty)$. The integral $\int_{-\infty}^{\infty}\exp{\left(-(x^2\right)}dx$ appears in many calculations, its value is $\sqrt{\pi}$, a result we will need when we compute the mean value and the variance. The mean value is

$$ \mu = \int_0^{\infty}xp(x)dx=\frac{1}{b\sqrt{2\pi}}\int_{-\infty}^{\infty}x \exp{\left(-\frac{(x-a)^2}{2b^2}\right)}dx, $$

which becomes with a suitable change of variables

$$ \mu =\frac{1}{b\sqrt{2\pi}}\int_{-\infty}^{\infty}b\sqrt{2}(a+b\sqrt{2}y)\exp{-y^2}dy=a. $$

Probability Distribution Functions, the normal distribution

Similarly, the variance becomes

$$ \sigma^2 = \frac{1}{b\sqrt{2\pi}}\int_{-\infty}^{\infty}(x-\mu)^2 \exp{\left(-\frac{(x-a)^2}{2b^2}\right)}dx, $$

and inserting the mean value and performing a variable change we obtain

$$ \sigma^2 = \frac{1}{b\sqrt{2\pi}}\int_{-\infty}^{\infty}b\sqrt{2}(b\sqrt{2}y)^2\exp{\left(-y^2\right)}dy= \frac{2b^2}{\sqrt{\pi}}\int_{-\infty}^{\infty}y^2\exp{\left(-y^2\right)}dy, $$

and performing a final integration by parts we obtain the well-known result $\sigma^2=b^2$. It is useful to introduce the standard normal distribution as well, defined by $\mu=a=0$, viz. a distribution centered around zero and with a variance $\sigma^2=1$, leading to

$$ \begin{equation} p(x)=\frac{1}{\sqrt{2\pi}}\exp{\left(-\frac{x^2}{2}\right)}. \label{_auto9} \tag{12} \end{equation} $$

Probability Distribution Functions, the cumulative distribution

The exponential and uniform distributions have simple cumulative functions, whereas the normal distribution does not, being proportional to the so-called error function $erf(x)$, given by

$$ P(x) = \frac{1}{\sqrt{2\pi}}\int_{-\infty}^x\exp{\left(-\frac{t^2}{2}\right)}dt, $$

which is difficult to evaluate in a quick way.

Probability Distribution Functions, other important distribution

Some other PDFs which one encounters often in the natural sciences are the binomial distribution

$$ p(x) = \left(\begin{array}{c} n \\ x\end{array}\right)y^x(1-y)^{n-x} \hspace{0.5cm}x=0,1,\dots,n, $$

where $y$ is the probability for a specific event, such as the tossing of a coin or moving left or right in case of a random walker. Note that $x$ is a discrete stochastic variable.

The sequence of binomial trials is characterized by the following definitions

  • Every experiment is thought to consist of $N$ independent trials.

  • In every independent trial one registers if a specific situation happens or not, such as the jump to the left or right of a random walker.

  • The probability for every outcome in a single trial has the same value, for example the outcome of tossing (either heads or tails) a coin is always $1/2$.

Probability Distribution Functions, the binomial distribution

In order to compute the mean and variance we need to recall Newton's binomial formula

$$ (a+b)^m=\sum_{n=0}^m \left(\begin{array}{c} m \\ n\end{array}\right)a^nb^{m-n}, $$

which can be used to show that

$$ \sum_{x=0}^n\left(\begin{array}{c} n \\ x\end{array}\right)y^x(1-y)^{n-x} = (y+1-y)^n = 1, $$

the PDF is normalized to one. The mean value is

$$ \mu = \sum_{x=0}^n x\left(\begin{array}{c} n \\ x\end{array}\right)y^x(1-y)^{n-x} = \sum_{x=0}^n x\frac{n!}{x!(n-x)!}y^x(1-y)^{n-x}, $$

resulting in

$$ \mu = \sum_{x=0}^n x\frac{(n-1)!}{(x-1)!(n-1-(x-1))!}y^{x-1}(1-y)^{n-1-(x-1)}, $$

which we rewrite as

$$ \mu=ny\sum_{\nu=0}^n\left(\begin{array}{c} n-1 \\ \nu\end{array}\right)y^{\nu}(1-y)^{n-1-\nu} =ny(y+1-y)^{n-1}=ny. $$

The variance is slightly trickier to get. It reads $\sigma^2=ny(1-y)$.

Probability Distribution Functions, Poisson's distribution

Another important distribution with discrete stochastic variables $x$ is
the Poisson model, which resembles the exponential distribution and reads

$$ p(x) = \frac{\lambda^x}{x!} e^{-\lambda} \hspace{0.5cm}x=0,1,\dots,;\lambda > 0. $$

In this case both the mean value and the variance are easier to calculate,

$$ \mu = \sum_{x=0}^{\infty} x \frac{\lambda^x}{x!} e^{-\lambda} = \lambda e^{-\lambda}\sum_{x=1}^{\infty} \frac{\lambda^{x-1}}{(x-1)!}=\lambda, $$

and the variance is $\sigma^2=\lambda$.

Probability Distribution Functions, Poisson's distribution

An example of applications of the Poisson distribution could be the counting of the number of $\alpha$-particles emitted from a radioactive source in a given time interval. In the limit of $n\rightarrow \infty$ and for small probabilities $y$, the binomial distribution approaches the Poisson distribution. Setting $\lambda = ny$, with $y$ the probability for an event in the binomial distribution we can show that

$$ \lim_{n\rightarrow \infty}\left(\begin{array}{c} n \\ x\end{array}\right)y^x(1-y)^{n-x} e^{-\lambda}=\sum_{x=1}^{\infty}\frac{\lambda^x}{x!} e^{-\lambda}. $$

Meet the covariance!

An important quantity in a statistical analysis is the so-called covariance.

Consider the set $\{X_i\}$ of $n$ stochastic variables (not necessarily uncorrelated) with the multivariate PDF $P(x_1,\dots,x_n)$. The covariance of two of the stochastic variables, $X_i$ and $X_j$, is defined as follows

$$ \begin{equation} \mathrm{Cov}(X_i,\,X_j) = \langle (x_i-\langle x_i\rangle)(x_j-\langle x_j\rangle)\rangle \label{_auto10} \tag{13} \end{equation} $$

$$ \begin{equation} =\int\cdots\int (x_i-\langle x_i\rangle)(x_j-\langle x_j\rangle)P(x_1,\dots,x_n)\,dx_1\dots dx_n, \label{eq:def_covariance} \tag{14} \end{equation} $$

with

$$ \langle x_i\rangle = \int\cdots\int x_i P(x_1,\dots,x_n)\,dx_1\dots dx_n. $$

Meet the covariance in matrix disguise

If we consider the above covariance as a matrix

$$ C_{ij} =\mathrm{Cov}(X_i,\,X_j), $$

then the diagonal elements are just the familiar variances, $C_{ii} = \mathrm{Cov}(X_i,\,X_i) = \mathrm{Var}(X_i)$. It turns out that all the off-diagonal elements are zero if the stochastic variables are uncorrelated.

Meet the covariance, uncorrelated events

This is easy to show, keeping in mind the linearity of the expectation value. Consider the stochastic variables $X_i$ and $X_j$, ($i\neq j$)

$$ \begin{align*} \mathrm{Cov}(X_i,\,X_j) &= \langle (x_i-\langle x_i\rangle)(x_j-\langle x_j\rangle)\rangle\\ &=\langle x_i x_j - x_i\langle x_j\rangle - \langle x_i\rangle x_j + \langle x_i\rangle\langle x_j\rangle\rangle\\ &=\langle x_i x_j\rangle - \langle x_i\langle x_j\rangle\rangle - \langle \langle x_i\rangle x_j \rangle + \langle \langle x_i\rangle\langle x_j\rangle\rangle\\ &=\langle x_i x_j\rangle - \langle x_i\rangle\langle x_j\rangle - \langle x_i\rangle\langle x_j\rangle + \langle x_i\rangle\langle x_j\rangle\\ &=\langle x_i x_j\rangle - \langle x_i\rangle\langle x_j\rangle \end{align*} $$

If $X_i$ and $X_j$ are independent, we get

$$ \langle x_i x_j\rangle = \langle x_i\rangle\langle x_j\rangle=\mathrm{Cov}(X_i, X_j) = 0\ \ (i\neq j). $$

Numerical experiments and the covariance

Now that we have constructed an idealized mathematical framework, let us try to apply it to empirical observations. Examples of relevant physical phenomena may be spontaneous decays of nuclei, or a purely mathematical set of numbers produced by some deterministic mechanism. It is the latter we will deal with, using so-called pseudo-random number generators. In general our observations will contain only a limited set of observables. We remind the reader that a stochastic process is a process that produces sequentially a chain of values

$$ \{x_1, x_2,\dots\,x_k,\dots\}. $$

Numerical experiments and the covariance

We will call these values our measurements and the entire set as our measured sample. The action of measuring all the elements of a sample we will call a stochastic experiment (since, operationally, they are often associated with results of empirical observation of some physical or mathematical phenomena; precisely an experiment). We assume that these values are distributed according to some PDF $p_X^{\phantom X}(x)$, where $X$ is just the formal symbol for the stochastic variable whose PDF is $p_X^{\phantom X}(x)$. Instead of trying to determine the full distribution $p$ we are often only interested in finding the few lowest moments, like the mean $\mu_X^{\phantom X}$ and the variance $\sigma_X^{\phantom X}$.

Numerical experiments and the covariance, actual situations

In practical situations however, a sample is always of finite size. Let that size be $n$. The expectation value of a sample $\alpha$, the sample mean, is then defined as follows

$$ \langle x_{\alpha} \rangle \equiv \frac{1}{n}\sum_{k=1}^n x_{\alpha,k}. $$

The sample variance is:

$$ \mathrm{Var}(x) \equiv \frac{1}{n}\sum_{k=1}^n (x_{\alpha,k} - \langle x_{\alpha} \rangle)^2, $$

with its square root being the standard deviation of the sample.

Numerical experiments and the covariance, our observables

You can think of the above observables as a set of quantities which define a given experiment. This experiment is then repeated several times, say $m$ times. The total average is then

$$ \begin{equation} \langle X_m \rangle= \frac{1}{m}\sum_{\alpha=1}^mx_{\alpha}=\frac{1}{mn}\sum_{\alpha, k} x_{\alpha,k}, \label{eq:exptmean} \tag{15} \end{equation} $$

where the last sums end at $m$ and $n$. The total variance is

$$ \sigma^2_m= \frac{1}{mn^2}\sum_{\alpha=1}^m(\langle x_{\alpha} \rangle-\langle X_m \rangle)^2, $$

which we rewrite as

$$ \begin{equation} \sigma^2_m=\frac{1}{m}\sum_{\alpha=1}^m\sum_{kl=1}^n (x_{\alpha,k}-\langle X_m \rangle)(x_{\alpha,l}-\langle X_m \rangle). \label{eq:exptvariance} \tag{16} \end{equation} $$

Numerical experiments and the covariance, the sample variance

We define also the sample variance $\sigma^2$ of all $mn$ individual experiments as

$$ \begin{equation} \sigma^2=\frac{1}{mn}\sum_{\alpha=1}^m\sum_{k=1}^n (x_{\alpha,k}-\langle X_m \rangle)^2. \label{eq:sampleexptvariance} \tag{17} \end{equation} $$

These quantities, being known experimental values or the results from our calculations, may differ, in some cases significantly, from the similarly named exact values for the mean value $\mu_X$, the variance $\mathrm{Var}(X)$ and the covariance $\mathrm{Cov}(X,Y)$.

Numerical experiments and the covariance, central limit theorem

The central limit theorem states that the PDF $\tilde{p}(z)$ of the average of $m$ random values corresponding to a PDF $p(x)$ is a normal distribution whose mean is the mean value of the PDF $p(x)$ and whose variance is the variance of the PDF $p(x)$ divided by $m$, the number of values used to compute $z$.

The central limit theorem leads then to the well-known expression for the standard deviation, given by

$$ \sigma_m= \frac{\sigma}{\sqrt{m}}. $$

In many cases the above estimate for the standard deviation, in particular if correlations are strong, may be too simplistic. We need therefore a more precise defintion of the error and the variance in our results.

Definition of Correlation Functions and Standard Deviation

Our estimate of the true average $\mu_{X}$ is the sample mean $\langle X_m \rangle$

$$ \mu_{X}^{\phantom X} \approx X_m=\frac{1}{mn}\sum_{\alpha=1}^m\sum_{k=1}^n x_{\alpha,k}. $$

We can then use Eq. (eq:exptvariance)

$$ \sigma^2_m=\frac{1}{mn^2}\sum_{\alpha=1}^m\sum_{kl=1}^n (x_{\alpha,k}-\langle X_m \rangle)(x_{\alpha,l}-\langle X_m \rangle), $$

and rewrite it as

$$ \sigma^2_m=\frac{\sigma^2}{n}+\frac{2}{mn^2}\sum_{\alpha=1}^m\sum_{k<l}^n (x_{\alpha,k}-\langle X_m \rangle)(x_{\alpha,l}-\langle X_m \rangle), $$

where the first term is the sample variance of all $mn$ experiments divided by $n$ and the last term is nothing but the covariance which arises when $k\ne l$.

Definition of Correlation Functions and Standard Deviation

Our estimate of the true average $\mu_{X}$ is the sample mean $\langle X_m \rangle$

If the observables are uncorrelated, then the covariance is zero and we obtain a total variance which agrees with the central limit theorem. Correlations may often be present in our data set, resulting in a non-zero covariance. The first term is normally called the uncorrelated contribution. Computationally the uncorrelated first term is much easier to treat efficiently than the second. We just accumulate separately the values $x^2$ and $x$ for every measurement $x$ we receive. The correlation term, though, has to be calculated at the end of the experiment since we need all the measurements to calculate the cross terms. Therefore, all measurements have to be stored throughout the experiment.

Definition of Correlation Functions and Standard Deviation

Let us analyze the problem by splitting up the correlation term into partial sums of the form

$$ f_d = \frac{1}{nm}\sum_{\alpha=1}^m\sum_{k=1}^{n-d}(x_{\alpha,k}-\langle X_m \rangle)(x_{\alpha,k+d}-\langle X_m \rangle), $$

The correlation term of the total variance can now be rewritten in terms of $f_d$

$$ \frac{2}{mn^2}\sum_{\alpha=1}^m\sum_{k<l}^n (x_{\alpha,k}-\langle X_m \rangle)(x_{\alpha,l}-\langle X_m \rangle)= \frac{2}{n}\sum_{d=1}^{n-1} f_d $$

Definition of Correlation Functions and Standard Deviation

The value of $f_d$ reflects the correlation between measurements separated by the distance $d$ in the samples. Notice that for $d=0$, $f$ is just the sample variance, $\sigma^2$. If we divide $f_d$ by $\sigma^2$, we arrive at the so called autocorrelation function

$$ \begin{equation} \kappa_d = \frac{f_d}{\sigma^2} \label{eq:autocorrelformal} \tag{18} \end{equation} $$

which gives us a useful measure of the correlation pair correlation starting always at $1$ for $d=0$.

Definition of Correlation Functions and Standard Deviation, sample variance

The sample variance of the $mn$ experiments can now be written in terms of the autocorrelation function

$$ \begin{equation} \sigma_m^2=\frac{\sigma^2}{n}+\frac{2}{n}\cdot\sigma^2\sum_{d=1}^{n-1} \frac{f_d}{\sigma^2}=\left(1+2\sum_{d=1}^{n-1}\kappa_d\right)\frac{1}{n}\sigma^2=\frac{\tau}{n}\cdot\sigma^2 \label{eq:error_estimate_corr_time} \tag{19} \end{equation} $$

and we see that $\sigma_m$ can be expressed in terms of the uncorrelated sample variance times a correction factor $\tau$ which accounts for the correlation between measurements. We call this correction factor the autocorrelation time

$$ \begin{equation} \tau = 1+2\sum_{d=1}^{n-1}\kappa_d \label{eq:autocorrelation_time} \tag{20} \end{equation} $$

For a correlation free experiment, $\tau$ equals 1.

Definition of Correlation Functions and Standard Deviation

From the point of view of Eq. (eq:error_estimate_corr_time) we can interpret a sequential correlation as an effective reduction of the number of measurements by a factor $\tau$. The effective number of measurements becomes

$$ n_\mathrm{eff} = \frac{n}{\tau} $$

To neglect the autocorrelation time $\tau$ will always cause our simple uncorrelated estimate of $\sigma_m^2\approx \sigma^2/n$ to be less than the true sample error. The estimate of the error will be too "good". On the other hand, the calculation of the full autocorrelation time poses an efficiency problem if the set of measurements is very large. The solution to this problem is given by more practically oriented methods like the blocking technique.

Random Numbers

Uniform deviates are just random numbers that lie within a specified range (typically 0 to 1), with any one number in the range just as likely as any other. They are, in other words, what you probably think random numbers are. However, we want to distinguish uniform deviates from other sorts of random numbers, for example numbers drawn from a normal (Gaussian) distribution of specified mean and standard deviation. These other sorts of deviates are almost always generated by performing appropriate operations on one or more uniform deviates, as we will see in subsequent sections. So, a reliable source of random uniform deviates, the subject of this section, is an essential building block for any sort of stochastic modeling or Monte Carlo computer work.

Random Numbers, better name: pseudo random numbers

A disclaimer is however appropriate. It should be fairly obvious that something as deterministic as a computer cannot generate purely random numbers.

Numbers generated by any of the standard algorithms are in reality pseudo random numbers, hopefully abiding to the following criteria:

  • they produce a uniform distribution in the interval [0,1].

  • correlations between random numbers are negligible

  • the period before the same sequence of random numbers is repeated is as large as possible and finally

  • the algorithm should be fast.

Random number generator RNG

The most common random number generators are based on so-called Linear congruential relations of the type

$$ N_i=(aN_{i-1}+c) \mathrm{MOD} (M), $$

which yield a number in the interval [0,1] through

$$ x_i=N_i/M $$

The number $M$ is called the period and it should be as large as possible and $N_0$ is the starting value, or seed. The function $\mathrm{MOD}$ means the remainder, that is if we were to evaluate $(13)\mathrm{MOD}(9)$, the outcome is the remainder of the division $13/9$, namely $4$.

Random number generator RNG and periodic outputs

The problem with such generators is that their outputs are periodic; they will start to repeat themselves with a period that is at most $M$. If however the parameters $a$ and $c$ are badly chosen, the period may be even shorter.

Consider the following example

$$ N_i=(6N_{i-1}+7) \mathrm{MOD} (5), $$

with a seed $N_0=2$. This generator produces the sequence $4,1,3,0,2,4,1,3,0,2,...\dots$, i.e., a sequence with period $5$. However, increasing $M$ may not guarantee a larger period as the following example shows

$$ N_i=(27N_{i-1}+11) \mathrm{MOD} (54), $$

which still, with $N_0=2$, results in $11,38,11,38,11,38,\dots$, a period of just $2$.

Random number generator RNG and its period

Typical periods for the random generators provided in the program library are of the order of $\sim 10^9$ or larger. Other random number generators which have become increasingly popular are so-called shift-register generators. In these generators each successive number depends on many preceding values (rather than the last values as in the linear congruential generator). For example, you could make a shift register generator whose $l$th number is the sum of the $l-i$th and $l-j$th values with modulo $M$,

$$ N_l=(aN_{l-i}+cN_{l-j})\mathrm{MOD}(M). $$

Random number generator RNG, other examples

Such a generator again produces a sequence of pseudorandom numbers but this time with a period much larger than $M$. It is also possible to construct more elaborate algorithms by including more than two past terms in the sum of each iteration. One example is the generator of Marsaglia and Zaman which consists of two congruential relations

$$ \begin{equation} N_l=(N_{l-3}-N_{l-1})\mathrm{MOD}(2^{31}-69), \label{eq:mz1} \tag{21} \end{equation} $$

followed by

$$ \begin{equation} N_l=(69069N_{l-1}+1013904243)\mathrm{MOD}(2^{32}), \label{eq:mz2} \tag{22} \end{equation} $$

which according to the authors has a period larger than $2^{94}$.

Random number generator RNG, other examples

Instead of using modular addition, we could use the bitwise exclusive-OR ($\oplus$) operation so that

$$ N_l=(N_{l-i})\oplus (N_{l-j}) $$

where the bitwise action of $\oplus$ means that if $N_{l-i}=N_{l-j}$ the result is $0$ whereas if $N_{l-i}\ne N_{l-j}$ the result is $1$. As an example, consider the case where $N_{l-i}=6$ and $N_{l-j}=11$. The first one has a bit representation (using 4 bits only) which reads $0110$ whereas the second number is $1011$. Employing the $\oplus$ operator yields $1101$, or $2^3+2^2+2^0=13$.

In Fortran90, the bitwise $\oplus$ operation is coded through the intrinsic function $\mathrm{IEOR}(m,n)$ where $m$ and $n$ are the input numbers, while in $C$ it is given by $m\wedge n$.

Random number generator RNG, RAN0

We show here how the linear congruential algorithm can be implemented, namely

$$ N_i=(aN_{i-1}) \mathrm{MOD} (M). $$

However, since $a$ and $N_{i-1}$ are integers and their multiplication could become greater than the standard 32 bit integer, there is a trick via Schrage's algorithm which approximates the multiplication of large integers through the factorization

$$ M=aq+r, $$

where we have defined

$$ q=[M/a], $$

and

$$ r = M\hspace{0.1cm}\mathrm{MOD} \hspace{0.1cm}a. $$

where the brackets denote integer division. In the code below the numbers $q$ and $r$ are chosen so that $r < q$.

Random number generator RNG, RAN0

To see how this works we note first that

$$ \begin{equation} (aN_{i-1}) \mathrm{MOD} (M)= (aN_{i-1}-[N_{i-1}/q]M)\mathrm{MOD} (M), \label{eq:rntrick1} \tag{23} \end{equation} $$

since we can add or subtract any integer multiple of $M$ from $aN_{i-1}$. The last term $[N_{i-1}/q]M\mathrm{MOD}(M)$ is zero since the integer division $[N_{i-1}/q]$ just yields a constant which is multiplied with $M$.

Random number generator RNG, RAN0

We can now rewrite Eq. (eq:rntrick1) as

$$ \begin{equation} (aN_{i-1}) \mathrm{MOD} (M)= (aN_{i-1}-[N_{i-1}/q](aq+r))\mathrm{MOD} (M), \label{eq:rntrick2} \tag{24} \end{equation} $$

which results in

$$ \begin{equation} (aN_{i-1}) \mathrm{MOD} (M)= \left(a(N_{i-1}-[N_{i-1}/q]q)-[N_{i-1}/q]r)\right)\mathrm{MOD} (M), \label{eq:rntrick3} \tag{25} \end{equation} $$

yielding

$$ \begin{equation} (aN_{i-1}) \mathrm{MOD} (M)= \left(a(N_{i-1}\mathrm{MOD} (q)) -[N_{i-1}/q]r)\right)\mathrm{MOD} (M). \label{eq:rntrick4} \tag{26} \end{equation} $$

Random number generator RNG, RAN0

The term $[N_{i-1}/q]r$ is always smaller or equal $N_{i-1}(r/q)$ and with $r < q$ we obtain always a number smaller than $N_{i-1}$, which is smaller than $M$. And since the number $N_{i-1}\mathrm{MOD} (q)$ is between zero and $q-1$ then $a(N_{i-1}\mathrm{MOD} (q))< aq$. Combined with our definition of $q=[M/a]$ ensures that this term is also smaller than $M$ meaning that both terms fit into a 32-bit signed integer. None of these two terms can be negative, but their difference could. The algorithm below adds $M$ if their difference is negative. Note that the program uses the bitwise $\oplus$ operator to generate the starting point for each generation of a random number. The period of $ran0$ is $\sim 2.1\times 10^{9}$. A special feature of this algorithm is that is should never be called with the initial seed set to $0$.

Random number generator RNG, RAN0 code

        /*
         ** The function
         **           ran0()
         ** is an "Minimal" random number generator of Park and Miller
         ** Set or reset the input value
         ** idum to any integer value (except the unlikely value MASK)
         ** to initialize the sequence; idum must not be altered between
         ** calls for sucessive deviates in a sequence.
         ** The function returns a uniform deviate between 0.0 and 1.0.
         */
    double ran0(long &idum)
    {
       const int a = 16807, m = 2147483647, q = 127773;
       const int r = 2836, MASK = 123459876;
       const double am = 1./m;
       long     k;
       double   ans;
       idum ^= MASK;
       k = (*idum)/q;
       idum = a*(idum - k*q) - r*k;
       // add m if negative difference
       if(idum < 0) idum += m;
       ans=am*(idum);
       idum ^= MASK;
       return ans;
    } // End: function ran0() 

Properties of Selected Random Number Generators

As mentioned previously, the underlying PDF for the generation of random numbers is the uniform distribution, meaning that the probability for finding a number $x$ in the interval [0,1] is $p(x)=1$.

A random number generator should produce numbers which are uniformly distributed in this interval. The table shows the distribution of $N=10000$ random numbers generated by the functions in the program library. We note in this table that the number of points in the various intervals $0.0-0.1$, $0.1-0.2$ etc are fairly close to $1000$, with some minor deviations.

Two additional measures are the standard deviation $\sigma$ and the mean $\mu=\langle x\rangle$.

Properties of Selected Random Number Generators

For the uniform distribution, the mean value $\mu$ is then

$$ \mu=\langle x\rangle=\frac{1}{2} $$

while the standard deviation is

$$ \sigma=\sqrt{\langle x^2\rangle-\mu^2}=\frac{1}{\sqrt{12}}=0.2886. $$

Properties of Selected Random Number Generators

The various random number generators produce results which agree rather well with these limiting values.

$x$-bin ran0 ran1 ran2 ran3
0.0-0.1 1013 991 938 1047
0.1-0.2 1002 1009 1040 1030
0.2-0.3 989 999 1030 993
0.3-0.4 939 960 1023 937
0.4-0.5 1038 1001 1002 992
0.5-0.6 1037 1047 1009 1009
0.6-0.7 1005 989 1003 989
0.7-0.8 986 962 985 954
0.8-0.9 1000 1027 1009 1023
0.9-1.0 991 1015 961 1026
$\mu$ 0.4997 0.5018 0.4992 0.4990
$\sigma$ 0.2882 0.2892 0.2861 0.2915

Simple demonstration of RNGs using python

The following simple Python code plots the distribution of the produced random numbers using the linear congruential RNG employed by Python. The trend displayed in the previous table is seen rather clearly.


In [8]:
#!/usr/bin/env python
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import random

# initialize the rng with a seed
random.seed() 
counts = 10000
values = np.zeros(counts)   
for i in range (1, counts, 1):
    values[i] = random.random()

# the histogram of the data
n, bins, patches = plt.hist(values, 10, facecolor='green')

plt.xlabel('$x$')
plt.ylabel('Number of counts')
plt.title(r'Test of uniform distribution')
plt.axis([0, 1, 0, 1100])
plt.grid(True)
plt.show()

Properties of Selected Random Number Generators

Since our random numbers, which are typically generated via a linear congruential algorithm, are never fully independent, we can then define an important test which measures the degree of correlation, namely the so-called
auto-correlation function defined previously, see again Eq. (eq:autocorrelformal). We rewrite it here as

$$ C_k=\frac{f_d} {\sigma^2}, $$

with $C_0=1$. Recall that $\sigma^2=\langle x_i^2\rangle-\langle x_i\rangle^2$ and that

$$ f_d = \frac{1}{nm}\sum_{\alpha=1}^m\sum_{k=1}^{n-d}(x_{\alpha,k}-\langle X_m \rangle)(x_{\alpha,k+d}-\langle X_m \rangle), $$

The non-vanishing of $C_k$ for $k\ne 0$ means that the random numbers are not independent. The independence of the random numbers is crucial in the evaluation of other expectation values. If they are not independent, our assumption for approximating $\sigma_N$ is no longer valid.

Correlation function and which random number generators should I use

The program here computes the correlation function for one of the standard functions included with the c++ compiler.

    //  This function computes the autocorrelation function for 
    //  the standard c++ random number generator

    #include <fstream>
    #include <iomanip>
    #include <iostream>
    #include <cmath>
    using namespace std;
    // output file as global variable
    ofstream ofile;  

    //     Main function begins here     
    int main(int argc, char* argv[])
    {
         int n;
         char *outfilename;

         cin >> n;
         double MCint = 0.;      double MCintsqr2=0.;
         double invers_period = 1./RAND_MAX; // initialise the random number generator
         srand(time(NULL));  // This produces the so-called seed in MC jargon
         // Compute the variance and the mean value of the uniform distribution
         // Compute also the specific values x for each cycle in order to be able to
         // the covariance and the correlation function  
         // Read in output file, abort if there are too few command-line arguments
         if( argc <= 2 ){
           cout << "Bad Usage: " << argv[0] << 
         " read also output file and number of cycles on same line" << endl;
           exit(1);
         }
         else{
           outfilename=argv[1];
         }
         ofile.open(outfilename); 
         // Get  the number of Monte-Carlo samples
         n = atoi(argv[2]);
         double *X;  
         X = new double[n];
         for (int i = 0;  i < n; i++){
               double x = double(rand())*invers_period; 
               X[i] = x;
               MCint += x;
               MCintsqr2 += x*x;
         }
         double Mean = MCint/((double) n );
         MCintsqr2 = MCintsqr2/((double) n );
         double STDev = sqrt(MCintsqr2-Mean*Mean);
         double Variance = MCintsqr2-Mean*Mean;
    //   Write mean value and standard deviation 
         cout << " Standard deviation= " << STDev << " Integral = " << Mean << endl;

         // Now we compute the autocorrelation function
         double *autocor;  autocor = new double[n];
         for (int j = 0; j < n; j++){
           double sum = 0.0;
           for (int k = 0; k < (n-j); k++){
         sum  += (X[k]-Mean)*(X[k+j]-Mean); 
           }
           autocor[j] = sum/Variance/((double) n );
           ofile << setiosflags(ios::showpoint | ios::uppercase);
           ofile << setw(15) << setprecision(8) << j;
           ofile << setw(15) << setprecision(8) << autocor[j] << endl;
         }
         ofile.close();  // close output file
         return 0;
    }  // end of main program 

Correlation function and which random number generators should I use

The following Python code plots the results for the correlation function from the above program.


In [9]:
import numpy as np
from  matplotlib import pyplot as plt
# Load in data file
data = np.loadtxt("datafiles/autocor.dat")
# Make arrays containing x-axis and binding energies as function of A
x = data[:,0]
corr = data[:,1]
plt.plot(x, corr ,'ro')
plt.axis([0,1000,-0.2, 1.1])
plt.xlabel(r'$d$')
plt.ylabel(r'$C_d$')
plt.title(r'autocorrelation function for RNG')
plt.savefig('autocorr.pdf')
plt.show()

Which RNG should I use?

  • In the library files lib.cpp and lib.h we have included four popular RNGs taken from the widely used textbook Numerical Recipes. These are called ran0, ran1, ran2 and ran3.

  • C++ has a class called random. The random class contains a large selection of RNGs and is highly recommended. Some of these RNGs have very large periods making it thereby very safe to use these RNGs in case one is performing large calculations. In particular, the Mersenne twister random number engine has a period of $2^{19937}$.

How to use the Mersenne generator

The following part of a c++ code (from project 4) sets up the uniform distribution for $x\in [0,1]$.

    /*

    //  You need this 
    #include <random>

    // Initialize the seed and call the Mersienne algo
    std::random_device rd;
    std::mt19937_64 gen(rd());
    // Set up the uniform distribution for x \in [[0, 1]
    std::uniform_real_distribution<double> RandomNumberGenerator(0.0,1.0);

    // Now use the RNG
    int ix = (int) (RandomNumberGenerator(gen)*NSpins);

Why blocking?

Statistical analysis

* Monte Carlo simulations can be treated as *computer experiments*

* The results can be analysed with the same statistical tools as we would use analysing experimental data.

* As in all experiments, we are looking for expectation values and an estimate of how accurate they are, i.e., possible sources for errors.

A very good article which explains blocking is H. Flyvbjerg and H. G. Petersen, Error estimates on averages of correlated data, Journal of Chemical Physics 91, 461-466 (1989).

Why blocking?

Statistical analysis

* As in other experiments, Monte Carlo experiments have two classes of errors:

  * Statistical errors

  * Systematical errors


* Statistical errors can be estimated using standard tools from statistics

* Systematical errors are method specific and must be treated differently from case to case. (In VMC a common source is the step length or time step in importance sampling)

Code to demonstrate the calculation of the autocorrelation function

The following code computes the autocorrelation function, the covariance and the standard deviation for standard RNG. The following file gives the code.

    //  This function computes the autocorrelation function for 
    //  the Mersenne random number generator with a uniform distribution
    #include <iostream>
    #include <fstream>
    #include <iomanip>
    #include <cstdlib>
    #include <random>
    #include <armadillo>
    #include <string>
    #include <cmath>
    using namespace  std;
    using namespace arma;
    // output file
    ofstream ofile;

    //     Main function begins here     
    int main(int argc, char* argv[])
    {
      int MonteCarloCycles;
      string filename;
      if (argc > 1) {
        filename=argv[1];
        MonteCarloCycles = atoi(argv[2]);
        string fileout = filename;
        string argument = to_string(MonteCarloCycles);
        fileout.append(argument);
        ofile.open(fileout);
      }

      // Compute the variance and the mean value of the uniform distribution
      // Compute also the specific values x for each cycle in order to be able to
      // compute the covariance and the correlation function  

      vec X  = zeros<vec>(MonteCarloCycles);
      double MCint = 0.;      double MCintsqr2=0.;
      std::random_device rd;
      std::mt19937_64 gen(rd());
      // Set up the uniform distribution for x \in [[0, 1]
      std::uniform_real_distribution<double> RandomNumberGenerator(0.0,1.0);
      for (int i = 0;  i < MonteCarloCycles; i++){
        double x =   RandomNumberGenerator(gen); 
        X(i) = x;
        MCint += x;
        MCintsqr2 += x*x;
      }
      double Mean = MCint/((double) MonteCarloCycles );
      MCintsqr2 = MCintsqr2/((double) MonteCarloCycles );
      double STDev = sqrt(MCintsqr2-Mean*Mean);
      double Variance = MCintsqr2-Mean*Mean;
      //   Write mean value and variance
      cout << " Sample variance= " << Variance  << " Mean value = " << Mean << endl;
      // Now we compute the autocorrelation function
      vec autocorrelation = zeros<vec>(MonteCarloCycles);
      for (int j = 0; j < MonteCarloCycles; j++){
        double sum = 0.0;
        for (int k = 0; k < (MonteCarloCycles-j); k++){
          sum  += (X(k)-Mean)*(X(k+j)-Mean); 
        }
        autocorrelation(j) = sum/Variance/((double) MonteCarloCycles );
        ofile << setiosflags(ios::showpoint | ios::uppercase);
        ofile << setw(15) << setprecision(8) << j;
        ofile << setw(15) << setprecision(8) << autocorrelation(j) << endl;
      }
      // Now compute the exact covariance using the autocorrelation function
      double Covariance = 0.0;
      for (int j = 0; j < MonteCarloCycles; j++){
        Covariance  += autocorrelation(j);
      }
      Covariance *=  2.0/((double) MonteCarloCycles);
      // Compute now the total variance, including the covariance, and obtain the standard deviation
      double TotalVariance = (Variance/((double) MonteCarloCycles ))+Covariance;
      cout << "Covariance =" << Covariance << "Totalvariance= " << TotalVariance << "Sample Variance/n= " << (Variance/((double) MonteCarloCycles )) << endl;
      cout << " STD from sample variance= " << sqrt(Variance/((double) MonteCarloCycles )) << " STD with covariance = " << sqrt(TotalVariance) << endl;

      ofile.close();  // close output file
      return 0;
    }  // end of main program 

What is blocking?

Blocking

* Say that we have a set of samples from a Monte Carlo experiment

* Assuming (wrongly) that our samples are uncorrelated our best estimate of the standard deviation of the mean $\langle \mathbf{M}\rangle$ is given by
$$ \sigma=\sqrt{\frac{1}{n}\left(\langle \mathbf{M}^2\rangle-\langle \mathbf{M}\rangle^2\right)} $$
  • If the samples are correlated we can rewrite our results to show that
$$ \sigma=\sqrt{\frac{1+2\tau/\Delta t}{n}\left(\langle \mathbf{M}^2\rangle-\langle \mathbf{M}\rangle^2\right)} $$

where $\tau$ is the correlation time (the time between a sample and the next uncorrelated sample) and $\Delta t$ is time between each sample

What is blocking?

Blocking

* If $\Delta t\gg\tau$ our first estimate of $\sigma$ still holds

* Much more common that $\Delta t<\tau$

* In the method of data blocking we divide the sequence of samples into blocks

* We then take the mean $\langle \mathbf{M}_i\rangle$ of block $i=1\ldots n_{blocks}$ to calculate the total mean and variance

* The size of each block must be so large that sample $j$ of block $i$ is not correlated with sample $j$ of block $i+1$

* The correlation time $\tau$ would be a good choice

What is blocking?

Blocking

* Problem: We don't know $\tau$ or it is too expensive to compute

* Solution: Make a plot of std. dev. as a function of blocksize

* The estimate of std. dev. of correlated data is too low $\to$ the error will increase with increasing block size until the blocks are uncorrelated, where we reach a plateau

* When the std. dev. stops increasing the blocks are uncorrelated

Implementation

* Do a Monte Carlo simulation, storing all samples to file

* Do the statistical analysis on this file, independently of your Monte Carlo program

* Read the file into an array

* Loop over various block sizes

* For each block size $n_b$, loop over the array in steps of $n_b$ taking the mean of elements $i n_b,\ldots,(i+1) n_b$

* Take the mean and variance of the resulting array

* Write the results for each block size to file for later
  analysis

Actual implementation with code, main function

When the file gets large, it can be useful to write your data in binary mode instead of ascii characters. The following python file reads data from file with the output from every Monte Carlo cycle.


In [10]:
# Blocking
    @timeFunction
    def blocking(self, blockSizeMax = 500):
        blockSizeMin = 1

        self.blockSizes = []
        self.meanVec = []
        self.varVec = []

        for i in range(blockSizeMin, blockSizeMax):
            if(len(self.data) % i != 0):
                pass#continue
            blockSize = i
            meanTempVec = []
            varTempVec = []
            startPoint = 0
            endPoint = blockSize

            while endPoint <= len(self.data):
                meanTempVec.append(np.average(self.data[startPoint:endPoint]))
                startPoint = endPoint
                endPoint += blockSize
            mean, var = np.average(meanTempVec), np.var(meanTempVec)/len(meanTempVec)
            self.meanVec.append(mean)
            self.varVec.append(var)
            self.blockSizes.append(blockSize)

        self.blockingAvg = np.average(self.meanVec[-200:])
        self.blockingVar = (np.average(self.varVec[-200:]))
        self.blockingStd = np.sqrt(self.blockingVar)

The Bootstrap method

The Bootstrap resampling method is also very popular. It is very simple:

  1. Start with your sample of measurements and compute the sample variance and the mean values

  2. Then start again but pick in a random way the numbers in the sample and recalculate the mean and the sample variance.

  3. Repeat this $K$ times.

It can be shown, see the article by Efron that it produces the correct standard deviation.

This method is very useful for small ensembles of data points.

Bootstrapping

Given a set of $N$ data, assume that we are interested in some observable $\theta$ which may be estimated from that set. This observable can also be for example the result of a fit based on all $N$ raw data. Let us call the value of the observable obtained from the original data set $\hat{\theta}$. One recreates from the sample repeatedly other samples by choosing randomly $N$ data out of the original set. This costs essentially nothing, since we just recycle the original data set for the building of new sets.

Bootstrapping, recipe

Let us assume we have done this $K$ times and thus have $K$ sets of $N$ data values each. Of course some values will enter more than once in the new sets. For each of these sets one computes the observable $\theta$ resulting in values $\theta_k$ with $k = 1,...,K$. Then one determines

$$ \tilde{\theta} = \frac{1}{K} \sum_{k=1}^K \theta_k, $$

and

$$ sigma^2_{\tilde{\theta}} = \frac{1}{K} \sum_{k=1}^K \left(\theta_k-\tilde{\theta}\right)^2. $$

These are estimators for $\angle\theta\rangle$ and its variance. They are not unbiased and therefore $\tilde{\theta}\neq\hat{\theta}$ for finite K.

The difference is called bias and gives an idea on how far away the result may be from the true $\angle\theta\rangle$. As final result for the observable one quotes $\angle\theta\rangle = \tilde{\theta} \pm \sigma_{\tilde{\theta}}$ .

Bootstrapping, code

    # Bootstrap
        @timeFunction
        def bootstrap(self, nBoots = 1000):
            bootVec = np.zeros(nBoots)
            for k in range(0,nBoots):
                bootVec[k] = np.average(np.random.choice(self.data, len(self.data)))
            self.bootAvg = np.average(bootVec)
            self.bootVar = np.var(bootVec)
            self.bootStd = np.std(bootVec)

Jackknife, code

    # Jackknife
        @timeFunction
        def jackknife(self):
            jackknVec = np.zeros(len(self.data))
            for k in range(0,len(self.data)):
                jackknVec[k] = np.average(np.delete(self.data, k))
            self.jackknAvg = self.avg - (len(self.data) - 1) * (np.average(jackknVec) - self.avg)
            self.jackknVar = float(len(self.data) - 1) * np.var(jackknVec)
            self.jackknStd = np.sqrt(self.jackknVar)

Regression analysis, overarching aims

Regression modeling deals with the description of the sampling distribution of a given random variable $y$ varies as function of another variable or a set of such variables $\hat{x} =[x_0, x_1,\dots, x_p]^T$. The first variable is called the dependent, the outcome or the response variable while the set of variables $\hat{x}$ is called the independent variable, or the predictor variable or the explanatory variable.

A regression model aims at finding a likelihood function $p(y\vert \hat{x})$, that is the conditional distribution for $y$ with a given $\hat{x}$. The estimation of $p(y\vert \hat{x})$ is made using a data set with

  • $n$ cases $i = 0, 1, 2, \dots, n-1$

  • Response (dependent or outcome) variable $y_i$ with $i = 0, 1, 2, \dots, n-1$

  • $p$ Explanatory (independent or predictor) variables $\hat{x}_i=[x_{i0}, x_{i1}, \dots, x_{ip}]$ with $i = 0, 1, 2, \dots, n-1$

    The goal of the regression analysis is to extract/exploit relationship between $y_i$ and $\hat{x}_i$ in or to infer causal dependencies, approximations to the likelihood functions, functional relationships and to make predictions .

General linear models

Before we proceed let us study a case from linear algebra where we aim at fitting a set of data $\hat{y}=[y_0,y_1,\dots,y_{n-1}]$. We could think of these data as a result of an experiment or a complicated numerical experiment. These data are functions of a series of variables $\hat{x}=[x_0,x_1,\dots,x_{n-1}]$, that is $y_i = y(x_i)$ with $i=0,1,2,\dots,n-1$. The variables $x_i$ could represent physical quantities like time, temperature, position etc. We assume that $y(x)$ is a smooth function.

Since obtaining these data points may not be trivial, we want to use these data to fit a function which can allow us to make predictions for values of $y$ which are not in the present set. The perhaps simplest approach is to assume we can parametrize our function in terms of a polynomial of degree $n-1$ with $n$ points, that is

$$ y=y(x) \rightarrow y(x_i)=\tilde{y}_i+\epsilon_i=\sum_{j=0}^{n-1} \beta_i x_i^j+\epsilon_i, $$

where $\epsilon_i$ is the error in our approximation.

Rewriting the fitting procedure as a linear algebra problem

For every set of values $y_i,x_i$ we have thus the corresponding set of equations

$$ \begin{align*} y_0&=\beta_0+\beta_1x_0^1+\beta_2x_0^2+\dots+\beta_{n-1}x_0^{n-1}+\epsilon_0\\ y_1&=\beta_0+\beta_1x_1^1+\beta_2x_1^2+\dots+\beta_{n-1}x_1^{n-1}+\epsilon_1\\ y_2&=\beta_0+\beta_1x_2^1+\beta_2x_2^2+\dots+\beta_{n-1}x_2^{n-1}+\epsilon_2\\ \dots & \dots \\ y_{n-1}&=\beta_0+\beta_1x_{n-1}^1+\beta_2x_{n-1}^2+\dots+\beta_1x_{n-1}^{n-1}+\epsilon_{n-1}.\\ \end{align*} $$

Rewriting the fitting procedure as a linear algebra problem, follows

Defining the vectors

2 2 9

< < < ! ! M A T H _ B L O C K

2 3 0

< < < ! ! M A T H _ B L O C K

$$ \hat{\epsilon} = [\epsilon_0,\epsilon_1, \epsilon_2,\dots, \epsilon_{n-1}]^T, $$

and the matrix

$$ \hat{X}= \begin{bmatrix} 1& x_{0}^1 &x_{0}^2& \dots & \dots &x_{0}^{n-1}\\ 1& x_{1}^1 &x_{1}^2& \dots & \dots &x_{1}^{n-1}\\ 1& x_{2}^1 &x_{2}^2& \dots & \dots &x_{2}^{n-1}\\ \dots& \dots &\dots& \dots & \dots &\dots\\ 1& x_{n-1}^1 &x_{n-1}^2& \dots & \dots &x_{n-1}^{n-1}\\ \end{bmatrix} $$

we can rewrite our equations as

$$ \hat{y} = \hat{X}\hat{\beta}+\hat{\epsilon}. $$

Generalizing the fitting procedure as a linear algebra problem

We are obviously not limited to the above polynomial. We could replace the various powers of $x$ with elements of Fourier series, that is, instead of $x_i^j$ we could have $\cos{(j x_i)}$ or $\sin{(j x_i)}$, or time series or other orthogonal functions. For every set of values $y_i,x_i$ we can then generalize the equations to

$$ \begin{align*} y_0&=\beta_0x_{00}+\beta_1x_{01}+\beta_2x_{02}+\dots+\beta_{n-1}x_{0n-1}+\epsilon_0\\ y_1&=\beta_0x_{10}+\beta_1x_{11}+\beta_2x_{12}+\dots+\beta_{n-1}x_{1n-1}+\epsilon_1\\ y_2&=\beta_0x_{20}+\beta_1x_{21}+\beta_2x_{22}+\dots+\beta_{n-1}x_{2n-1}+\epsilon_2\\ \dots & \dots \\ y_{i}&=\beta_0x_{i0}+\beta_1x_{i1}+\beta_2x_{i2}+\dots+\beta_{n-1}x_{in-1}+\epsilon_i\\ \dots & \dots \\ y_{n-1}&=\beta_0x_{n-1,0}+\beta_1x_{n-1,2}+\beta_2x_{n-1,2}+\dots+\beta_1x_{n-1,n-1}+\epsilon_{n-1}.\\ \end{align*} $$

Generalizing the fitting procedure as a linear algebra problem

We redefine in turn the matrix $\hat{X}$ as

$$ \hat{X}= \begin{bmatrix} x_{00}& x_{01} &x_{02}& \dots & \dots &x_{0,n-1}\\ x_{10}& x_{11} &x_{12}& \dots & \dots &x_{1,n-1}\\ x_{20}& x_{21} &x_{22}& \dots & \dots &x_{2,n-1}\\ \dots& \dots &\dots& \dots & \dots &\dots\\ x_{n-1,0}& x_{n-1,1} &x_{n-1,2}& \dots & \dots &x_{n-1,n-1}\\ \end{bmatrix} $$

and without loss of generality we rewrite again our equations as

$$ \hat{y} = \hat{X}\hat{\beta}+\hat{\epsilon}. $$

The left-hand side of this equation forms know. Our error vector $\hat{\epsilon}$ and the parameter vector $\hat{\beta}$ are our unknow quantities. How can we obtain the optimal set of $\beta_i$ values?

Optimizing our parameters

We have defined the matrix $\hat{X}$

$$ \begin{align*} y_0&=\beta_0x_{00}+\beta_1x_{01}+\beta_2x_{02}+\dots+\beta_{n-1}x_{0n-1}+\epsilon_0\\ y_1&=\beta_0x_{10}+\beta_1x_{11}+\beta_2x_{12}+\dots+\beta_{n-1}x_{1n-1}+\epsilon_1\\ y_2&=\beta_0x_{20}+\beta_1x_{21}+\beta_2x_{22}+\dots+\beta_{n-1}x_{2n-1}+\epsilon_1\\ \dots & \dots \\ y_{i}&=\beta_0x_{i0}+\beta_1x_{i1}+\beta_2x_{i2}+\dots+\beta_{n-1}x_{in-1}+\epsilon_1\\ \dots & \dots \\ y_{n-1}&=\beta_0x_{n-1,0}+\beta_1x_{n-1,2}+\beta_2x_{n-1,2}+\dots+\beta_1x_{n-1,n-1}+\epsilon_{n-1}.\\ \end{align*} $$

Optimizing our parameters, more details

We well use this matrix to define the approximation $\hat{\tilde{y}}$ via the unknown quantity $\hat{\beta}$ as

$$ \hat{\tilde{y}}= \hat{X}\hat{\beta}, $$

and in order to find the optimal parameters $\beta_i$ instead of solving the above linear algebra problem, we define a function which gives a measure of the spread between the values $y_i$ (which represent hopefully the exact values) and the parametrized values $\tilde{y}_i$, namely

$$ Q(\hat{\beta})=\sum_{i=0}^{n-1}\left(y_i-\tilde{y}_i\right)^2=\left(\hat{y}-\hat{\tilde{y}}\right)^T\left(\hat{y}-\hat{\tilde{y}}\right), $$

or using the matrix $\hat{X}$ as

$$ Q(\hat{\beta})=\left(\hat{y}-\hat{X}\hat{\beta}\right)^T\left(\hat{y}-\hat{X}\hat{\beta}\right). $$

Interpretations and optimizing our parameters

The function

$$ Q(\hat{\beta})=\left(\hat{y}-\hat{X}\hat{\beta}\right)^T\left(\hat{y}-\hat{X}\hat{\beta}\right), $$

can be linked to the variance of the quantity $y_i$ if we interpret the latter as the mean value of for example a numerical experiment. When linking below with the maximum likelihood approach below, we will indeed interpret $y_i$ as a mean value

$$ y_{i}=\langle y_i \rangle = \beta_0x_{i,0}+\beta_1x_{i,1}+\beta_2x_{i,2}+\dots+\beta_{n-1}x_{i,n-1}+\epsilon_i, $$

where $\langle y_i \rangle$ is the mean value. Keep in mind also that till now we have treated $y_i$ as the exact value. Normally, the response (dependent or outcome) variable $y_i$ the outcome of a numerical experiment or another type of experiment and is thus only an approximation to the true value. It is then always accompanied by an error estimate, often limited to a statistical error estimate given by the standard deviation discussed earlier. In the discussion here we will treat $y_i$ as our exact value for the response variable.

In order to find the parameters $\beta_i$ we will then minimize the spread of $Q(\hat{\beta})$ by requiring

$$ \frac{\partial Q(\hat{\beta})}{\partial \beta_j} = \frac{\partial }{\partial \beta_j}\left[ \sum_{i=0}^{n-1}\left(y_i-\beta_0x_{i,0}-\beta_1x_{i,1}-\beta_2x_{i,2}-\dots-\beta_{n-1}x_{i,n-1}\right)^2\right]=0, $$

which results in

$$ \frac{\partial Q(\hat{\beta})}{\partial \beta_j} = -2\left[ \sum_{i=0}^{n-1}x_{ij}\left(y_i-\beta_0x_{i,0}-\beta_1x_{i,1}-\beta_2x_{i,2}-\dots-\beta_{n-1}x_{i,n-1}\right)\right]=0, $$

or in a matrix-vector form as

$$ \frac{\partial Q(\hat{\beta})}{\partial \hat{\beta}} = 0 = \hat{X}^T\left( \hat{y}-\hat{X}\hat{\beta}\right). $$

Interpretations and optimizing our parameters

We can rewrite

$$ \frac{\partial Q(\hat{\beta})}{\partial \hat{\beta}} = 0 = \hat{X}^T\left( \hat{y}-\hat{X}\hat{\beta}\right), $$

as

$$ \hat{X}^T\hat{y} = \hat{X}^T\hat{X}\hat{\beta}, $$

and if the matrix $\hat{X}^T\hat{X}$ is invertible we have the solution

$$ \hat{\beta} =\left(\hat{X}^T\hat{X}\right)^{-1}\hat{X}^T\hat{y}. $$

Interpretations and optimizing our parameters

The residuals $\hat{\epsilon}$ are in turn given by

$$ \hat{\epsilon} = \hat{y}-\hat{\tilde{y}} = \hat{y}-\hat{X}\hat{\beta}, $$

and with

$$ \hat{X}^T\left( \hat{y}-\hat{X}\hat{\beta}\right)= 0, $$

we have

$$ \hat{X}^T\hat{\epsilon}=\hat{X}^T\left( \hat{y}-\hat{X}\hat{\beta}\right)= 0, $$

meaning that the solution for $\hat{\beta}$ is the one which minimizes the residuals. Later we will link this with the maximum likelihood approach.

The $\chi^2$ function

Normally, the response (dependent or outcome) variable $y_i$ the outcome of a numerical experiment or another type of experiment and is thus only an approximation to the true value. It is then always accompanied by an error estimate, often limited to a statistical error estimate given by the standard deviation discussed earlier. In the discussion here we will treat $y_i$ as our exact value for the response variable.

Introducing the standard deviation $\sigma_i$ for each measurement $y_i$, we define now the $\chi^2$ function as

$$ \chi^2(\hat{\beta})=\sum_{i=0}^{n-1}\frac{\left(y_i-\tilde{y}_i\right)^2}{\sigma_i^2}=\left(\hat{y}-\hat{\tilde{y}}\right)^T\frac{1}{\hat{\Sigma^2}}\left(\hat{y}-\hat{\tilde{y}}\right), $$

where the matrix $\hat{\Sigma}$ is a diagonal matrix with $\sigma_i$ as matrix elements.

The $\chi^2$ function

In order to find the parameters $\beta_i$ we will then minimize the spread of $\chi^2(\hat{\beta})$ by requiring

$$ \frac{\partial \chi^2(\hat{\beta})}{\partial \beta_j} = \frac{\partial }{\partial \beta_j}\left[ \sum_{i=0}^{n-1}\left(\frac{y_i-\beta_0x_{i,0}-\beta_1x_{i,1}-\beta_2x_{i,2}-\dots-\beta_{n-1}x_{i,n-1}}{\sigma_i}\right)^2\right]=0, $$

which results in

$$ \frac{\partial \chi^2(\hat{\beta})}{\partial \beta_j} = -2\left[ \sum_{i=0}^{n-1}\frac{x_{ij}}{\sigma_i}\left(\frac{y_i-\beta_0x_{i,0}-\beta_1x_{i,1}-\beta_2x_{i,2}-\dots-\beta_{n-1}x_{i,n-1}}{\sigma_i}\right)\right]=0, $$

or in a matrix-vector form as

$$ \frac{\partial \chi^2(\hat{\beta})}{\partial \hat{\beta}} = 0 = \hat{A}^T\left( \hat{b}-\hat{A}\hat{\beta}\right). $$

where we have defined the matrix $\hat{A} =\hat{X}/\hat{\Sigma}$ with matrix elements $a_{ij} = x_{ij}/\sigma_i$ and the vector $\hat{b}$ with elements $b_i = y_i/\sigma_i$.

The $\chi^2$ function

We can rewrite

$$ \frac{\partial \chi^2(\hat{\beta})}{\partial \hat{\beta}} = 0 = \hat{A}^T\left( \hat{b}-\hat{A}\hat{\beta}\right), $$

as

$$ \hat{A}^T\hat{b} = \hat{A}^T\hat{A}\hat{\beta}, $$

and if the matrix $\hat{A}^T\hat{A}$ is invertible we have the solution

$$ \hat{\beta} =\left(\hat{A}^T\hat{A}\right)^{-1}\hat{A}^T\hat{b}. $$

The $\chi^2$ function

If we then introduce the matrix

$$ \hat{H} = \hat{A}^T\hat{A}, $$

we have then the following expression for the parameters $\beta_j$ (the matrix elements of $\hat{H}$ are $h_{ij}$)

$$ \beta_j = \sum_{k=0}^{p-1}h_{jk}\sum_{i=0}^{n-1}\frac{y_i}{\sigma_i}\frac{x_{ik}}{\sigma_i} = \sum_{k=0}^{p-1}h_{jk}\sum_{i=0}^{n-1}b_ia_{ik} $$

We state without proof the expression for the uncertainty in the parameters $\beta_j$ as

$$ \sigma^2(\beta_j) = \sum_{i=0}^{n-1}\sigma_i^2\left( \frac{\partial \beta_j}{\partial y_i}\right)^2, $$

resulting in

$$ \sigma^2(\beta_j) = \left(\sum_{k=0}^{p-1}h_{jk}\sum_{i=0}^{n-1}a_{ik}\right)\left(\sum_{l=0}^{p-1}h_{jl}\sum_{m=0}^{n-1}a_{ml}\right) = h_{jj}! $$

The $\chi^2$ function

The first step here is to approximate the function $y$ with a first-order polynomial, that is we write

$$ y=y(x) \rightarrow y(x_i) \approx \beta_0+\beta_1 x_i. $$

By computing the derivatives of $\chi^2$ with respect to $\beta_0$ and $\beta_1$ show that these are given by

$$ \frac{\partial \chi^2(\hat{\beta})}{\partial \beta_0} = -2\left[ \sum_{i=0}^{1}\left(\frac{y_i-\beta_0-\beta_1x_{i}}{\sigma_i^2}\right)\right]=0, $$

and

$$ \frac{\partial \chi^2(\hat{\beta})}{\partial \beta_0} = -2\left[ \sum_{i=0}^{1}x_i\left(\frac{y_i-\beta_0-\beta_1x_{i}}{\sigma_i^2}\right)\right]=0. $$

The $\chi^2$ function

We define then

$$ \gamma = \sum_{i=0}^{1}\frac{1}{\sigma_i^2}, $$

2 6 7

< < < ! ! M A T H _ B L O C K

2 6 8

< < < ! ! M A T H _ B L O C K

2 6 9

< < < ! ! M A T H _ B L O C K

$$ \gamma_{xy} = \sum_{i=0}^{1}\frac{y_ix_{i}}{\sigma_i^2}, $$

and show that

2 7 1

< < < ! ! M A T H _ B L O C K

$$ \beta_1 = \frac{\gamma_{xy}\gamma-\gamma_x\gamma_y}{\gamma\gamma_{xx}-\gamma_x^2}. $$

The LSM suffers often from both being underdetermined and overdetermined in the unknown coefficients $\beta_i$. A better approach is to use the Singular Value Decomposition (SVD) method discussed below.

The singular value decompostion

How can we use the singular value decomposition to find the parameters $\beta_j$? More details will come. We first note that a general $m\times n$ matrix $\hat{A}$ can be written in terms of a diagonal matrix $\hat{\Sigma}$ of dimensionality $n\times n$ and two orthognal matrices $\hat{U}$ and $\hat{V}$, where the first has dimensionality $m \times n$ and the last dimensionality $n\times n$. We have then

$$ \hat{A} = \hat{U}\hat{\Sigma}\hat{V} $$

Neural Networks

Artificial neurons

The field of artificial neural networks has a long history of development, and is closely connected with the advancement of computer science and computers in general. A model of artificial neurons was first developed by McCulloch and Pitts in 1943 to study signal processing in the brain and has later been refined by others. The general idea is to mimic neural networks in the human brain, which is composed of billions of neurons that communicate with each other by sending electrical signals. Each neuron accumulates its incoming signals, which must exceed an activation threshold to yield an output. If the threshold is not overcome, the neuron remains inactive, i.e. has zero output.

This behaviour has inspired a simple mathematical model for an artificial neuron.

$$ \begin{equation} y = f\left(\sum_{i=1}^n w_ix_i\right) = f(u) \label{artificialNeuron} \tag{27} \end{equation} $$

Here, the output $y$ of the neuron is the value of its activation function, which have as input a weighted sum of signals $x_i, \dots ,x_n$ received by $n$ other neurons.

Neural network types

An artificial neural network (NN), is a computational model that consists of layers of connected neurons, or nodes. It is supposed to mimic a biological nervous system by letting each neuron interact with other neurons by sending signals in the form of mathematical functions between layers. A wide variety of different NNs have been developed, but most of them consist of an input layer, an output layer and eventual layers in-between, called hidden layers. All layers can contain an arbitrary number of nodes, and each connection between two nodes is associated with a weight variable.

Feed-forward neural networks

The feed-forward neural network (FFNN) was the first and simplest type of NN devised. In this network, the information moves in only one direction: forward through the layers.

Nodes are represented by circles, while the arrows display the connections between the nodes, including the direction of information flow. Additionally, each arrow corresponds to a weight variable, not displayed here. We observe that each node in a layer is connected to all nodes in the subsequent layer, making this a so-called fully-connected FFNN.

A different variant of FFNNs are convolutional neural networks (CNNs), which have a connectivity pattern inspired by the animal visual cortex. Individual neurons in the visual cortex only respond to stimuli from small sub-regions of the visual field, called a receptive field. This makes the neurons well-suited to exploit the strong spatially local correlation present in natural images. The response of each neuron can be approximated mathematically as a convolution operation.

CNNs emulate the behaviour of neurons in the visual cortex by enforcing a local connectivity pattern between nodes of adjacent layers: Each node in a convolutional layer is connected only to a subset of the nodes in the previous layer, in contrast to the fully-connected FFNN. Often, CNNs consist of several convolutional layers that learn local features of the input, with a fully-connected layer at the end, which gathers all the local data and produces the outputs. They have wide applications in image and video recognition

Recurrent neural networks

So far we have only mentioned NNs where information flows in one direction: forward. Recurrent neural networks on the other hand, have connections between nodes that form directed cycles. This creates a form of internal memory which are able to capture information on what has been calculated before; the output is dependent on the previous computations. Recurrent NNs make use of sequential information by performing the same task for every element in a sequence, where each element depends on previous elements. An example of such information is sentences, making recurrent NNs especially well-suited for handwriting and speech recognition.

Other types of networks

There are many other kinds of NNs that have been developed. One type that is specifically designed for interpolation in multidimensional space is the radial basis function (RBF) network. RBFs are typically made up of three layers: an input layer, a hidden layer with non-linear radial symmetric activation functions and a linear output layer (''linear'' here means that each node in the output layer has a linear activation function). The layers are normally fully-connected and there are no cycles, thus RBFs can be viewed as a type of fully-connected FFNN. They are however usually treated as a separate type of NN due the unusual activation functions.

Other types of NNs could also be mentioned, but are outside the scope of this work. We will now move on to a detailed description of how a fully-connected FFNN works, and how it can be used to interpolate data sets.

Multilayer perceptrons

One use often so-called fully-connected feed-forward neural networks with three or more layers (an input layer, one or more hidden layers and an output layer) consisting of neurons that have non-linear activation functions.

Such networks are often called multilayer perceptrons (MLPs)

Why multilayer perceptrons?

According to the Universal approximation theorem, a feed-forward neural network with just a single hidden layer containing a finite number of neurons can approximate a continuous multidimensional function to arbitrary accuracy, assuming the activation function for the hidden layer is a non-constant, bounded and monotonically-increasing continuous function. Note that the requirements on the activation function only applies to the hidden layer, the output nodes are always assumed to be linear, so as to not restrict the range of output values.

We note that this theorem is only applicable to a NN with one hidden layer. Therefore, we can easily construct an NN that employs activation functions which do not satisfy the above requirements, as long as we have at least one layer with activation functions that do. Furthermore, although the universal approximation theorem lays the theoretical foundation for regression with neural networks, it does not say anything about how things work in practice: A neural network can still be able to approximate a given function reasonably well without having the flexibility to fit all other functions.

Mathematical model

$$ \begin{equation} y = f\left(\sum_{i=1}^n w_ix_i + b_i\right) = f(u) \label{artificialNeuron2} \tag{28} \end{equation} $$

In an FFNN of such neurons, the inputs $x_i$ are the outputs of the neurons in the preceding layer. Furthermore, an MLP is fully-connected, which means that each neuron receives a weighted sum of the outputs of all neurons in the previous layer.

Mathematical model

First, for each node $i$ in the first hidden layer, we calculate a weighted sum $u_i^1$ of the input coordinates $x_j$,

$$ \begin{equation} u_i^1 = \sum_{j=1}^2 w_{ij}^1 x_j + b_i^1 \label{_auto11} \tag{29} \end{equation} $$

This value is the argument to the activation function $f_1$ of each neuron $i$, producing the output $y_i^1$ of all neurons in layer 1,

$$ \begin{equation} y_i^1 = f_1(u_i^1) = f_1\left(\sum_{j=1}^2 w_{ij}^1 x_j + b_i^1\right) \label{outputLayer1} \tag{30} \end{equation} $$

where we assume that all nodes in the same layer have identical activation functions, hence the notation $f_l$

$$ \begin{equation} y_i^l = f_l(u_i^l) = f_l\left(\sum_{j=1}^{N_{l-1}} w_{ij}^l y_j^{l-1} + b_i^l\right) \label{generalLayer} \tag{31} \end{equation} $$

where $N_l$ is the number of nodes in layer $l$. When the output of all the nodes in the first hidden layer are computed, the values of the subsequent layer can be calculated and so forth until the output is obtained.

Mathematical model

The output of neuron $i$ in layer 2 is thus,

$$ \begin{equation} y_i^2 = f_2\left(\sum_{j=1}^3 w_{ij}^2 y_j^1 + b_i^2\right) \label{_auto12} \tag{32} \end{equation} $$

$$ \begin{equation} = f_2\left[\sum_{j=1}^3 w_{ij}^2f_1\left(\sum_{k=1}^2 w_{jk}^1 x_k + b_j^1\right) + b_i^2\right] \label{outputLayer2} \tag{33} \end{equation} $$

where we have substituted $y_m^1$ with. Finally, the NN output yields,

$$ \begin{equation} y_1^3 = f_3\left(\sum_{j=1}^3 w_{1m}^3 y_j^2 + b_1^3\right) \label{_auto13} \tag{34} \end{equation} $$

$$ \begin{equation} = f_3\left[\sum_{j=1}^3 w_{1j}^3 f_2\left(\sum_{k=1}^3 w_{jk}^2 f_1\left(\sum_{m=1}^2 w_{km}^1 x_m + b_k^1\right) + b_j^2\right) + b_1^3\right] \label{_auto14} \tag{35} \end{equation} $$

Mathematical model

We can generalize this expression to an MLP with $l$ hidden layers. The complete functional form is,

$$ \begin{equation} y^{l+1}_1\! = \!f_{l+1}\!\left[\!\sum_{j=1}^{N_l}\! w_{1j}^3 f_l\!\left(\!\sum_{k=1}^{N_{l-1}}\! w_{jk}^2 f_{l-1}\!\left(\! \dots \!f_1\!\left(\!\sum_{n=1}^{N_0} \!w_{mn}^1 x_n\! + \!b_m^1\!\right) \!\dots \!\right) \!+ \!b_k^2\!\right) \!+ \!b_1^3\!\right] \label{completeNN} \tag{36} \end{equation} $$

which illustrates a basic property of MLPs: The only independent variables are the input values $x_n$.

Mathematical model

This confirms that an MLP, despite its quite convoluted mathematical form, is nothing more than an analytic function, specifically a mapping of real-valued vectors $\vec{x} \in \mathbb{R}^n \rightarrow \vec{y} \in \mathbb{R}^m$. In our example, $n=2$ and $m=1$. Consequentially, the number of input and output values of the function we want to fit must be equal to the number of inputs and outputs of our MLP.

Furthermore, the flexibility and universality of a MLP can be illustrated by realizing that the expression is essentially a nested sum of scaled activation functions of the form

$$ \begin{equation} h(x) = c_1 f(c_2 x + c_3) + c_4 \label{_auto15} \tag{37} \end{equation} $$

where the parameters $c_i$ are weights and biases. By adjusting these parameters, the activation functions can be shifted up and down or left and right, change slope or be rescaled which is the key to the flexibility of a neural network.

Matrix-vector notation

We can introduce a more convenient notation for the activations in a NN.

Additionally, we can represent the biases and activations as layer-wise column vectors $\vec{b}_l$ and $\vec{y}_l$, so that the $i$-th element of each vector is the bias $b_i^l$ and activation $y_i^l$ of node $i$ in layer $l$ respectively.

We have that $\mathrm{W}_l$ is a $N_{l-1} \times N_l$ matrix, while $\vec{b}_l$ and $\vec{y}_l$ are $N_l \times 1$ column vectors. With this notation, the sum in becomes a matrix-vector multiplication, and we can write the equation for the activations of hidden layer 2 in

$$ \begin{equation} \vec{y}_2 = f_2(\mathrm{W}_2 \vec{y}_{1} + \vec{b}_{2}) = f_2\left(\left[\begin{array}{ccc} w^2_{11} &w^2_{12} &w^2_{13} \\ w^2_{21} &w^2_{22} &w^2_{23} \\ w^2_{31} &w^2_{32} &w^2_{33} \\ \end{array} \right] \cdot \left[\begin{array}{c} y^1_1 \\ y^1_2 \\ y^1_3 \\ \end{array}\right] + \left[\begin{array}{c} b^2_1 \\ b^2_2 \\ b^2_3 \\ \end{array}\right]\right). \label{_auto16} \tag{38} \end{equation} $$

Matrix-vector notation and activation

The activation of node $i$ in layer 2 is

$$ \begin{equation} y^2_i = f_2\Bigr(w^2_{i1}y^1_1 + w^2_{i2}y^1_2 + w^2_{i3}y^1_3 + b^2_i\Bigr) = f_2\left(\sum_{j=1}^3 w^2_{ij} y_j^1 + b^2_i\right). \label{_auto17} \tag{39} \end{equation} $$

This is not just a convenient and compact notation, but also a useful and intuitive way to think about MLPs: The output is calculated by a series of matrix-vector multiplications and vector additions that are used as input to the activation functions. For each operation $\mathrm{W}_l \vec{y}_{l-1}$ we move forward one layer.

Activation functions

A property that characterizes a neural network, other than its connectivity, is the choice of activation function(s). As described in, the following restrictions are imposed on an activation function for a FFNN to fulfill the universal approximation theorem

  • Non-constant

  • Bounded

  • Monotonically-increasing

  • Continuous

Activation functions, Logistic and Hyperbolic ones

The second requirement excludes all linear functions. Furthermore, in a MLP with only linear activation functions, each layer simply performs a linear transformation of its inputs.

Regardless of the number of layers, the output of the NN will be nothing but a linear function of the inputs. Thus we need to introduce some kind of non-linearity to the NN to be able to fit non-linear functions Typical examples are the logistic Sigmoid

$$ \begin{equation} f(x) = \frac{1}{1 + e^{-x}}, \label{sigmoidActivationFunction} \tag{40} \end{equation} $$

and the hyperbolic tangent function

$$ \begin{equation} f(x) = \tanh(x) \label{tanhActivationFunction} \tag{41} \end{equation} $$

Relevance

The sigmoid function are more biologically plausible because the output of inactive neurons are zero. Such activation function are called one-sided. However, it has been shown that the hyperbolic tangent performs better than the sigmoid for training MLPs. has become the most popular for deep neural networks