Morten Hjorth-Jensen, Department of Physics, University of Oslo and Department of Physics and Astronomy and National Superconducting Cyclotron Laboratory, Michigan State University
Date: Jan 27, 2018
Copyright 1999-2018, Morten Hjorth-Jensen. Released under CC Attribution-NonCommercial 4.0 license
Machine learning is the science of giving computers the ability to learn without being explicitly programmed. The idea is that there exist generic algorithms which can be used to find patterns in a broad class of data sets without having to write code specifically for each problem. The algorithm will build its own logic based on the data.
Machine learning is a subfield of computer science, and is closely related to computational statistics. It evolved from the study of pattern recognition in artificial intelligence (AI) research, and has made contributions to AI tasks like computer vision, natural language processing and speech recognition. It has also, especially in later years, found applications in a wide variety of other areas, including bioinformatics, economy, physics, finance and marketing.
The approaches to machine learning are many, but are often split into two main categories. In supervised learning we know the answer to a problem, and let the computer deduce the logic behind it. On the other hand, unsupervised learning is a method for finding patterns and relationship in data sets without any prior knowledge of the system. Some authours also operate with a third category, namely reinforcement learning. This is a paradigm of learning inspired by behavioural psychology, where learning is achieved by trial-and-error, solely from rewards and punishment.
Another way to categorize machine learning tasks is to consider the desired output of a system. Some of the most common tasks are:
Classification: Outputs are divided into two or more classes. The goal is to produce a model that assigns inputs into one of these classes. An example is to identify digits based on pictures of hand-written ones. Classification is typically supervised learning.
Regression: Finding a functional relationship between an input data set and a reference data set. The goal is to construct a function that maps input data to continuous output values.
Clustering: Data are divided into groups with certain common traits, without knowing the different groups beforehand. It is thus a form of unsupervised learning.
In this course we will build our machine learning approach on a statistical foundation, with elements from data analysis, stochastic processes etc before we proceed with the following machine learning algorithms
Linear regression and its variants
Decision tree algorithms, from simpler to more complex ones
Nearest neighbors models
Bayesian statistics
Support vector machines and finally various variants of
Artifical neural networks
Before we proceed however, there are several practicalities with data analysis and software tools we would like to present. These tools will help us in our understanding of various machine learning algorithms.
Our emphasis here is on understanding the mathematical aspects of different algorithms, however, where possible we will emphasize the importance of using available software.
We will make intensive use of python as programming language and the myriad of available libraries. Furthermore, you will find IPython/Jupyter notebooks invaluable in your work. You can run R codes in the Jupyter/IPython notebooks, with the immediate benefit of visualizing your data.
If you have Python installed (we recommend Python3) and you feel pretty familiar with installing different packages, we recommend that you install the following Python packages via pip as
For Python3, replace pip with pip3.
For OSX users we recommend also, after having installed Xcode, to install brew. Brew allows for a seamless installation of additional software via for example
For Linux users, with its variety of distributions like for example the widely popular Ubuntu distribution you can use pip as well and simply install Python as
etc etc.
If you don't want to perform these operations separately, we recommend two widely used distrubutions which set up all relevant dependencies for Python, namely
Anaconda Anaconda is an open source distribution of the Python and R programming languages for large-scale data processing, predictive analytics, and scientific computing, that aims to simplify package management and deployment. Package versions are managed by the package management system conda
Enthought canopy is a Python distribution for scientific and analytic computing distribution and analysis environment, available for free and under a commercial license.
You will also find it convenient to utilize R. Jupyter/Ipython notebook allows you run R code interactively in your browser. The software library R is tuned to statistically analysis and allows for an easy usage of the tools we will discuss in these texts.
To install R with Jupyter notebook following the link here
For the C++ affecianodas, Jupyter/IPython notebook allows you also to install C++ and run codes written in this language interactively in the browser. Since we will emphasize writing many of the algorithms yourself, you can thus opt for either Python or C++ as programming languages.
To add more entropy, cython can also be used when running your notebooks. It means that Python with the Jupyter/IPython notebook setup allows you to integrate widely popular softwares and tools for scientific computing. With its versatility, including symbolic operations, Python offers a unique computational environment. Your Jupyter/IPython notebook can easily be converted into a nicely rendered PDF file or a Latex file for further processing. For example, convert to latex as
In [1]:
jupyter nbconvert filename.ipynb --to latex
In [2]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import sparse
import pandas as pd
from IPython.display import display
eye = np.eye(4)
print(eye)
sparse_mtx = sparse.csr_matrix(eye)
print(sparse_mtx)
x = np.linspace(-10,10,100)
y = np.sin(x)
plt.plot(x,y,marker='x')
plt.show()
data = {'Name': ["John", "Anna", "Peter", "Linda"], 'Location': ["Nairobi", "Napoli", "London", "Buenos Aires"], 'Age':[51, 21, 34, 45]}
data_pandas = pd.DataFrame(data)
display(data_pandas)
In [3]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import sparse
import pandas as pd
from IPython.display import display
import mglearn
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
x, y = mglearn.datasets.make_wave(n_samples=100)
line = np.linspace(-3,3,1000,endpoint=False).reshape(-1,1)
reg = DecisionTreeRegressor(min_samples_split=3).fit(x,y)
plt.plot(line, reg.predict(line), label="decision tree")
regline = LinearRegression().fit(x,y)
plt.plot(line, regline.predict(line), label= "Linear Rgression")
plt.show()
The population dynamics of a simple predator-prey system is a classical example shown in many biology textbooks when ecological systems are discussed. The system contains all elements of the scientific method:
The set up of a specific hypothesis combined with
the experimental methods needed (one can study existing data or perform experiments)
analyzing and interpreting the data and performing further experiments if needed
trying to extract general behaviors and extract eventual laws or patterns
develop mathematical relations for the uncovered regularities/laws and test these by per forming new experiments
Lots of data about populations of hares and lynx collected from furs in Hudson Bay, Canada, are available. It is known that the populations oscillate. Why? Here we start by
plotting the data
derive a simple model for the population dynamics
(fitting parameters in the model to the data)
using the model predict the evolution other predator-pray systems
Most mammalian predators rely on a variety of prey, which complicates mathematical modeling; however, a few predators have become highly specialized and seek almost exclusively a single prey species. An example of this simplified predator-prey interaction is seen in Canadian northern forests, where the populations of the lynx and the snowshoe hare are intertwined in a life and death struggle.
One reason that this particular system has been so extensively studied is that the Hudson Bay company kept careful records of all furs from the early 1800s into the 1900s. The records for the furs collected by the Hudson Bay company showed distinct oscillations (approximately 12 year periods), suggesting that these species caused almost periodic fluctuations of each other's populations. The table here shows data from 1900 to 1920.
Year | Hares (x1000) | Lynx (x1000) |
---|---|---|
1900 | 30.0 | 4.0 |
1901 | 47.2 | 6.1 |
1902 | 70.2 | 9.8 |
1903 | 77.4 | 35.2 |
1904 | 36.3 | 59.4 |
1905 | 20.6 | 41.7 |
1906 | 18.1 | 19.0 |
1907 | 21.4 | 13.0 |
1908 | 22.0 | 8.3 |
1909 | 25.4 | 9.1 |
1910 | 27.1 | 7.4 |
1911 | 40.3 | 8.0 |
1912 | 57 | 12.3 |
1913 | 76.6 | 19.5 |
1914 | 52.3 | 45.7 |
1915 | 19.5 | 51.1 |
1916 | 11.2 | 29.7 |
1917 | 7.6 | 15.8 |
1918 | 14.6 | 9.7 |
1919 | 16.2 | 10.1 |
1920 | 24.7 | 8.6 |
In [4]:
import numpy as np
from matplotlib import pyplot as plt
# Load in data file
data = np.loadtxt('src/Hudson_Bay.csv', delimiter=',', skiprows=1)
# Make arrays containing x-axis and hares and lynx populations
year = data[:,0]
hares = data[:,1]
lynx = data[:,2]
plt.plot(year, hares ,'b-+', year, lynx, 'r-o')
plt.axis([1900,1920,0, 100.0])
plt.xlabel(r'Year')
plt.ylabel(r'Numbers of hares and lynx ')
plt.legend(('Hares','Lynx'), loc='upper right')
plt.title(r'Population of hares and lynx from 1900-1920 (x1000)}')
plt.savefig('Hudson_Bay_data.pdf')
plt.savefig('Hudson_Bay_data.png')
plt.show()
We see from the plot that there are indeed fluctuations. We would like to create a mathematical model that explains these population fluctuations. Ecologists have predicted that in a simple predator-prey system that a rise in prey population is followed (with a lag) by a rise in the predator population. When the predator population is sufficiently high, then the prey population begins dropping. After the prey population falls, then the predator population falls, which allows the prey population to recover and complete one cycle of this interaction. Thus, we see that qualitatively oscillations occur. Can a mathematical model predict this? What causes cycles to slow or speed up? What affects the amplitude of the oscillation or do you expect to see the oscillations damp to a stable equilibrium? The models tend to ignore factors like climate and other complicating factors. How significant are these?
We see oscillations in the data
What causes cycles to slow or speed up?
What affects the amplitude of the oscillation or do you expect to see the oscillations damp to a stable equilibrium?
With a model we can better understand the data
More important: we can understand the ecology dynamics of predator-pray populations
The classical way (in all books) is to present the Lotka-Volterra equations:
Here,
$H$ is the number of preys
$L$ the number of predators
$a$, $b$, $d$, $c$ are parameters
Most books quickly establish the model and then use considerable space on discussing the qualitative properties of this nonlinear system of ODEs (which cannot be solved)
Time points: $t_0,t_1,\ldots,t_m$
Uniform distribution of time points: $t_n=n\Delta t$
$H^n$: population of hares at time $t_n$
$L^n$: population of lynx at time $t_n$
We want to model the changes in populations, $\Delta H=H^{n+1}-H^n$ and $\Delta L=L^{n+1}-L^n$ during a general time interval $[t_{n+1},t_n]$ of length $\Delta t=t_{n+1}-t_n$
The population of hares evolves due to births and deaths exactly as a bacteria population:
However, hares have an additional loss in the population because they are eaten by lynx. All the hares and lynx can form $H\cdot L$ pairs in total. When such pairs meet during a time interval $\Delta t$, there is some small probablity that the lynx will eat the hare. So in fraction $b\Delta t HL$, the lynx eat hares. This loss of hares must be accounted for. Subtracted in the equation for hares:
We assume that the primary growth for the lynx population depends on sufficient food for raising lynx kittens, which implies an adequate source of nutrients from predation on hares. Thus, the growth of the lynx population does not only depend of how many lynx there are, but on how many hares they can eat. In a time interval $\Delta t HL$ hares and lynx can meet, and in a fraction $b\Delta t HL$ the lynx eats the hare. All of this does not contribute to the growth of lynx, again just a fraction of $b\Delta t HL$ that we write as $d\Delta t HL$. In addition, lynx die just as in the population dynamics with one isolated animal population, leading to a loss $-c\Delta t L$.
The accounting of lynx then looks like
Note:
These equations are ready to be implemented!
But to start, we need $H^0$ and $L^0$ (which we can get from the data)
We also need values for $a$, $b$, $d$, $c$
As always, models tend to be general - as here, applicable to "all" predator-pray systems
The critical issue is whether the interaction between hares and lynx is sufficiently well modeled by $\hbox{const}HL$
The parameters $a$, $b$, $d$, and $c$ must be estimated from data
Measure time in years
$t_0=1900$, $t_m=1920$
In [5]:
import numpy as np
import matplotlib.pyplot as plt
def solver(m, H0, L0, dt, a, b, c, d, t0):
"""Solve the difference equations for H and L over m years
with time step dt (measured in years."""
num_intervals = int(m/float(dt))
t = np.linspace(t0, t0 + m, num_intervals+1)
H = np.zeros(t.size)
L = np.zeros(t.size)
print('Init:', H0, L0, dt)
H[0] = H0
L[0] = L0
for n in range(0, len(t)-1):
H[n+1] = H[n] + a*dt*H[n] - b*dt*H[n]*L[n]
L[n+1] = L[n] + d*dt*H[n]*L[n] - c*dt*L[n]
return H, L, t
# Load in data file
data = np.loadtxt('src/Hudson_Bay.csv', delimiter=',', skiprows=1)
# Make arrays containing x-axis and hares and lynx populations
t_e = data[:,0]
H_e = data[:,1]
L_e = data[:,2]
# Simulate using the model
H, L, t = solver(m=20, H0=34.91, L0=3.857, dt=0.1,
a=0.4807, b=0.02482, c=0.9272, d=0.02756,
t0=1900)
# Visualize simulations and data
plt.plot(t_e, H_e, 'b-+', t_e, L_e, 'r-o', t, H, 'm--', t, L, 'k--')
plt.xlabel('Year')
plt.ylabel('Numbers of hares and lynx')
plt.axis([1900, 1920, 0, 140])
plt.title(r'Population of hares and lynx 1900-1920 (x1000)')
plt.legend(('H_e', 'L_e', 'H', 'L'), loc='upper left')
plt.savefig('Hudson_Bay_sim.pdf')
plt.savefig('Hudson_Bay_sim.png')
plt.show()
If we perform a least-square fitting, we can find optimal values for the parameters $a$, $b$, $d$, $c$. The optimal parameters are $a=0.4807$, $b=0.02482$, $d=0.9272$ and $c=0.02756$. These parameters result in a slightly modified initial conditions, namely $H(0) = 34.91$ and $L(0)=3.857$. With these parameters we are now ready to solve the equations and plot these data together with the experimental values.
In [6]:
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
data = np.loadtxt('src/Hudson_Bay.csv', delimiter=',', skiprows=1)
x = data[:,0]
y = data[:,1]
line = np.linspace(1900,1920,1000,endpoint=False).reshape(-1,1)
reg = DecisionTreeRegressor(min_samples_split=3).fit(x.reshape(-1,1),y.reshape(-1,1))
plt.plot(line, reg.predict(line), label="decision tree")
regline = LinearRegression().fit(x.reshape(-1,1),y.reshape(-1,1))
plt.plot(line, regline.predict(line), label= "Linear Regression")
plt.plot(x, y, label= "Linear Regression")
plt.show()
HudsonBay = read.csv("src/Hudson_Bay.csv",header=T)
fix(HudsonBay)
dim(HudsonBay)
names(HudsonBay)
plot(HudsonBay$Year, HudsonBay$Hares..x1000.)
attach(HudsonBay)
plot(Year, Hares..x1000.)
plot(Year, Hares..x1000., col="red", varwidth=T, xlab="Years", ylab="Haresx 1000")
summary(HudsonBay)
summary(Hares..x1000.)
library(MASS)
library(ISLR)
scatter.smooth(x=Year, y = Hares..x1000.)
linearMod = lm(Hares..x1000. ~ Year)
print(linearMod)
summary(linearMod)
plot(linearMod)
confint(linearMod)
predict(linearMod,data.frame(Year=c(1910,1914,1920)),interval="confidence")
set.seed(1485)
len = 24
x = runif(len)
y = x^3+rnorm(len, 0,0.06)
ds = data.frame(x = x, y = y)
str(ds)
plot( y ~ x, main ="Known cubic with noise")
s = seq(0,1,length =100)
lines(s, s^3, lty =2, col ="green")
m = nls(y ~ I(x^power), data = ds, start = list(power=1), trace = T)
class(m)
summary(m)
power = round(summary(m)$coefficients[1], 3)
power.se = round(summary(m)$coefficients[2], 3)
plot(y ~ x, main = "Fitted power model", sub = "Blue: fit; green: known")
s = seq(0, 1, length = 100)
lines(s, s^3, lty = 2, col = "green")
lines(s, predict(m, list(x = s)), lty = 1, col = "blue")
text(0, 0.5, paste("y =x^ (", power, " +/- ", power.se, ")", sep = ""), pos = 4)
The Numerical Recipes codes have been rewritten in Fortran 90/95 and C/C++ by us. The original source codes are taken from the widely used software package LAPACK, which follows two other popular packages developed in the 1970s, namely EISPACK and LINPACK.
LINPACK: package for linear equations and least square problems.
LAPACK:package for solving symmetric, unsymmetric and generalized eigenvalue problems. From LAPACK's website http://www.netlib.org it is possible to download for free all source codes from this library. Both C/C++ and Fortran versions are available.
BLAS (I, II and III): (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. Blas I is vector operations, II vector-matrix operations and III matrix-matrix operations. Highly parallelized and efficient codes, all available for download from http://www.netlib.org.
Add python material on linear algebra and array handling, text on numpy etc
Matrix properties reminder
Matrix Properties Reminder
Relations | Name | matrix elements |
---|---|---|
$A = A^{T}$ | symmetric | $a_{ij} = a_{ji}$ |
$A = \left (A^{T} \right )^{-1}$ | real orthogonal | $\sum_k a_{ik} a_{jk} = \sum_k a_{ki} a_{kj} = \delta_{ij}$ |
$A = A^{ * }$ | real matrix | $a_{ij} = a_{ij}^{ * }$ |
$A = A^{\dagger}$ | hermitian | $a_{ij} = a_{ji}^{ * }$ |
$A = \left (A^{\dagger} \right )^{-1}$ | unitary | $\sum_k a_{ik} a_{jk}^{ * } = \sum_k a_{ki}^{ * } a_{kj} = \delta_{ij}$ |
Diagonal if $a_{ij}=0$ for $i\ne j$
Upper triangular if $a_{ij}=0$ for $i > j$
Lower triangular if $a_{ij}=0$ for $i < j$
Upper Hessenberg if $a_{ij}=0$ for $i > j+1$
Lower Hessenberg if $a_{ij}=0$ for $i < j+1$
Tridiagonal if $a_{ij}=0$ for $|i -j| > 1$
Lower banded with bandwidth $p$: $a_{ij}=0$ for $i > j+p$
Upper banded with bandwidth $p$: $a_{ij}=0$ for $i < j+p$
Banded, block upper triangular, block lower triangular....
Some Equivalent Statements For an $N\times N$ matrix $\mathbf{A}$ the following properties are all equivalent
If the inverse of $\mathbf{A}$ exists, $\mathbf{A}$ is nonsingular.
The equation $\mathbf{Ax}=0$ implies $\mathbf{x}=0$.
The rows of $\mathbf{A}$ form a basis of $R^N$.
The columns of $\mathbf{A}$ form a basis of $R^N$.
$\mathbf{A}$ is a product of elementary matrices.
$0$ is not eigenvalue of $\mathbf{A}$.
Static We have an $N\times N$ matrix A with $N=100$ In C/C++ this would be defined as
int N = 100;
double A[100][100];
// initialize all elements to zero
for(i=0 ; i < N ; i++) {
for(j=0 ; j < N ; j++) {
A[i][j] = 0.0;
In C/C++ this would be coded like
for(i=0 ; i < N ; i++) {
for(j=0 ; j < N ; j++) {
a[i][j] = b[i][j]+c[i][j]
In C/C++ this would be coded like
for(i=0 ; i < N ; i++) {
for(j=0 ; j < N ; j++) {
for(k=0 ; k < N ; k++) {
a[i][j]+=b[i][k]*c[k][j];
At least three possibilities in this course
Do it yourself
Use the functions provided in the library package lib.cpp
Use Armadillo http://arma.sourceforgenet (a C++ linear algebra library, discussion both here and at lab).
Do it yourself
int N;
double ** A;
A = new double*[N]
for ( i = 0; i < N; i++)
A[i] = new double[N];
Always free space when you don't need an array anymore.
for ( i = 0; i < N; i++)
delete[] A[i];
delete[] A;
Armadillo is a C++ linear algebra library (matrix maths) aiming towards a good balance between speed and ease of use. The syntax is deliberately similar to Matlab.
Integer, floating point and complex numbers are supported, as well as a subset of trigonometric and statistics functions. Various matrix decompositions are provided through optional integration with LAPACK, or one of its high performance drop-in replacements (such as the multi-threaded MKL or ACML libraries).
A delayed evaluation approach is employed (at compile-time) to combine several operations into one and reduce (or eliminate) the need for temporaries. This is accomplished through recursive templates and template meta-programming.
Useful for conversion of research code into production environments, or if C++ has been decided as the language of choice, due to speed and/or integration capabilities.
The library is open-source software, and is distributed under a license that is useful in both open-source and commercial/proprietary contexts.
#include <iostream>
#include <armadillo>
using namespace std;
using namespace arma;
int main(int argc, char** argv)
{
mat A = randu<mat>(5,5);
mat B = randu<mat>(5,5);
cout << A*B << endl;
return 0;
For people using Ubuntu, Debian, Linux Mint, simply go to the synaptic package manager and install armadillo from there. You may have to install Lapack as well. For Mac and Windows users, follow the instructions from the webpage http://arma.sourceforge.net. To compile, use for example (linux/ubuntu)
c++ -O2 -o program.x program.cpp -larmadillo -llapack -lblas
where the -l
option indicates the library you wish to link to.
For OS X users you may have to declare the paths to the include files and the libraries as
c++ -O2 -o program.x program.cpp -L/usr/local/lib -I/usr/local/include -larmadillo -llapack -lblas
#include <iostream>
#include "armadillo"
using namespace arma;
using namespace std;
int main(int argc, char** argv)
{
// directly specify the matrix size (elements are uninitialised)
mat A(2,3);
// .n_rows = number of rows (read only)
// .n_cols = number of columns (read only)
cout << "A.n_rows = " << A.n_rows << endl;
cout << "A.n_cols = " << A.n_cols << endl;
// directly access an element (indexing starts at 0)
A(1,2) = 456.0;
A.print("A:");
// scalars are treated as a 1x1 matrix,
// hence the code below will set A to have a size of 1x1
A = 5.0;
A.print("A:");
// if you want a matrix with all elements set to a particular value
// the .fill() member function can be used
A.set_size(3,3);
A.fill(5.0); A.print("A:");
mat B;
// endr indicates "end of row"
B << 0.555950 << 0.274690 << 0.540605 << 0.798938 << endr
<< 0.108929 << 0.830123 << 0.891726 << 0.895283 << endr
<< 0.948014 << 0.973234 << 0.216504 << 0.883152 << endr
<< 0.023787 << 0.675382 << 0.231751 << 0.450332 << endr;
// print to the cout stream
// with an optional string before the contents of the matrix
B.print("B:");
// the << operator can also be used to print the matrix
// to an arbitrary stream (cout in this case)
cout << "B:" << endl << B << endl;
// save to disk
B.save("B.txt", raw_ascii);
// load from disk
mat C;
C.load("B.txt");
C += 2.0 * B;
C.print("C:");
// submatrix types:
//
// .submat(first_row, first_column, last_row, last_column)
// .row(row_number)
// .col(column_number)
// .cols(first_column, last_column)
// .rows(first_row, last_row)
cout << "C.submat(0,0,3,1) =" << endl;
cout << C.submat(0,0,3,1) << endl;
// generate the identity matrix
mat D = eye<mat>(4,4);
D.submat(0,0,3,1) = C.cols(1,2);
D.print("D:");
// transpose
cout << "trans(B) =" << endl;
cout << trans(B) << endl;
// maximum from each column (traverse along rows)
cout << "max(B) =" << endl;
cout << max(B) << endl;
// maximum from each row (traverse along columns)
cout << "max(B,1) =" << endl;
cout << max(B,1) << endl;
// maximum value in B
cout << "max(max(B)) = " << max(max(B)) << endl;
// sum of each column (traverse along rows)
cout << "sum(B) =" << endl;
cout << sum(B) << endl;
// sum of each row (traverse along columns)
cout << "sum(B,1) =" << endl;
cout << sum(B,1) << endl;
// sum of all elements
cout << "sum(sum(B)) = " << sum(sum(B)) << endl;
cout << "accu(B) = " << accu(B) << endl;
// trace = sum along diagonal
cout << "trace(B) = " << trace(B) << endl;
// random matrix -- values are uniformly distributed in the [0,1] interval
mat E = randu<mat>(4,4);
E.print("E:");
// row vectors are treated like a matrix with one row
rowvec r;
r << 0.59499 << 0.88807 << 0.88532 << 0.19968;
r.print("r:");
// column vectors are treated like a matrix with one column
colvec q;
q << 0.81114 << 0.06256 << 0.95989 << 0.73628;
q.print("q:");
// dot or inner product
cout << "as_scalar(r*q) = " << as_scalar(r*q) << endl;
// outer product
cout << "q*r =" << endl;
cout << q*r << endl;
// sum of three matrices (no temporary matrices are created)
mat F = B + C + D;
F.print("F:");
return 0;
#include <iostream>
#include "armadillo"
using namespace arma;
using namespace std;
int main(int argc, char** argv)
{
cout << "Armadillo version: " << arma_version::as_string() << endl;
mat A;
A << 0.165300 << 0.454037 << 0.995795 << 0.124098 << 0.047084 << endr
<< 0.688782 << 0.036549 << 0.552848 << 0.937664 << 0.866401 << endr
<< 0.348740 << 0.479388 << 0.506228 << 0.145673 << 0.491547 << endr
<< 0.148678 << 0.682258 << 0.571154 << 0.874724 << 0.444632 << endr
<< 0.245726 << 0.595218 << 0.409327 << 0.367827 << 0.385736 << endr;
A.print("A =");
// determinant
cout << "det(A) = " << det(A) << endl;
// inverse
cout << "inv(A) = " << endl << inv(A) << endl;
double k = 1.23;
mat B = randu<mat>(5,5);
mat C = randu<mat>(5,5);
rowvec r = randu<rowvec>(5);
colvec q = randu<colvec>(5);
// examples of some expressions
// for which optimised implementations exist
// optimised implementation of a trinary expression
// that results in a scalar
cout << "as_scalar( r*inv(diagmat(B))*q ) = ";
cout << as_scalar( r*inv(diagmat(B))*q ) << endl;
// example of an expression which is optimised
// as a call to the dgemm() function in BLAS:
cout << "k*trans(B)*C = " << endl << k*trans(B)*C;
return 0;
We assume also that the matrix $\mathbf{A}$ is non-singular and that the matrix elements along the diagonal satisfy $a_{ii} \ne 0$. Simple $4\times 4 $ example
The basic idea of Gaussian elimination is to use the first equation to eliminate the first unknown $x_1$ from the remaining $n-1$ equations. Then we use the new second equation to eliminate the second unknown $x_2$ from the remaining $n-2$ equations. With $n-1$ such eliminations we obtain a so-called upper triangular set of equations of the form
To arrive at such an upper triangular system of equations, we start by eliminating the unknown $x_1$ for $j=2,n$. We achieve this by multiplying the first equation by $a_{j1}/a_{11}$ and then subtract the result from the $j$th equation. We assume obviously that $a_{11}\ne 0$ and that $\mathbf{A}$ is not singular.
Our actual $4\times 4$ example reads after the first operation
or
where each $a_{1k}^{(1)}$ is equal to the original $a_{1k}$ element. The other coefficients are
with a new right-hand side given by
We have also set $w_1^{(1)}=w_1$, the original vector element. We see that the system of unknowns $x_1,\dots,x_n$ is transformed into an $(n-1)\times (n-1)$ problem.
This step is called forward substitution. Proceeding with these substitutions, we obtain the general expressions for the new coefficients
with $m=1,\dots,n-1$ and a right-hand side given by
This set of $n-1$ elimations leads us to an equations which is solved by back substitution. If the arithmetics is exact and the matrix $\mathbf{A}$ is not singular, then the computed answer will be exact.
Even though the matrix elements along the diagonal are not zero, numerically small numbers may appear and subsequent divisions may lead to large numbers, which, if added to a small number may yield losses of precision. Suppose for example that our first division in $(a_{22}-a_{21}a_{12}/a_{11})$ results in $-10^{-7}$ and that $a_{22}$ is one. one. We are then adding $10^7+1$. With single precision this results in $10^7$.
Gaussian elimination, $O(2/3n^3)$ flops, general matrix
LU decomposition, upper triangular and lower tridiagonal matrices, $O(2/3n^3)$ flops, general matrix. Get easily the inverse, determinant and can solve linear equations with back-substitution only, $O(n^2)$ flops
Cholesky decomposition. Real symmetric or hermitian positive definite matrix, $O(1/3n^3)$ flops.
Tridiagonal linear systems, important for differential equations. Normally positive definite and non-singular. $O(8n)$ flops for symmetric. Special case of banded matrices.
Singular value decomposition
the QR method will be discussed in chapter 7 in connection with eigenvalue systems. $O(4/3n^3)$ flops.
The LU decomposition method means that we can rewrite this matrix as the product of two matrices $\mathbf{L}$ and $\mathbf{U}$ where
The above set of equations is conveniently solved by using LU decomposition as an intermediate step.
The matrix $\mathbf{A}\in \mathbb{R}^{n\times n}$ has an LU factorization if the determinant is different from zero. If the LU factorization exists and $\mathbf{A}$ is non-singular, then the LU factorization is unique and the determinant is given by
There are at least three main advantages with LU decomposition compared with standard Gaussian elimination:
It is straightforward to compute the determinant of a matrix
If we have to solve sets of linear equations with the same matrix but with different vectors $\mathbf{y}$, the number of FLOPS is of the order $n^3$.
The inverse is such an operation
With the LU decomposition it is rather simple to solve a system of linear equations
This can be written in matrix form as
where $\mathbf{A}$ and $\mathbf{w}$ are known and we have to solve for $\mathbf{x}$. Using the LU dcomposition we write
To show that this is correct we use to the LU decomposition to rewrite our system of linear equations as
and since the determinat of $\mathbf{L}$ is equal to 1 (by construction since the diagonals of $\mathbf{L}$ equal 1) we can use the inverse of $\mathbf{L}$ to obtain
which yields the intermediate step
and
This example shows the basis for the algorithm needed to solve the set of $n$ linear equations.
The algorithm goes as follows
Set up the matrix $\bf A$ and the vector $\bf w$ with their correct dimensions. This determines the dimensionality of the unknown vector $\bf x$.
Then LU decompose the matrix $\bf A$ through a call to the function ludcmp(double a, int n, int indx, double &d)
. This functions returns the LU decomposed matrix $\bf A$, its determinant and the vector indx which keeps track of the number of interchanges of rows. If the determinant is zero, the solution is malconditioned.
Thereafter you call the function lubksb(double a, int n, int indx, double w)
which uses the LU decomposed matrix $\bf A$ and the vector $\bf w$ and returns $\bf x$ in the same place as $\bf w$. Upon exit the original content in $\bf w$ is destroyed. If you wish to keep this information, you should make a backup of it in your calling function.
If the inverse exists then
the identity matrix. With an LU decomposed matrix we can rewrite the last equation as
then we have a linear set of equations
and continue till we have solved all $n$ sets of linear equations.
#include <iostream>
#include "armadillo"
using namespace arma;
using namespace std;
int main()
{
mat A = randu<mat>(5,5);
vec b = randu<vec>(5);
A.print("A =");
b.print("b=");
// solve Ax = b
vec x = solve(A,b);
// print x
x.print("x=");
// find LU decomp of A, if needed, P is the permutation matrix
mat L, U;
lu(L,U,A);
// print l
L.print(" L= ");
// print U
U.print(" U= ");
//Check that A = LU
(A-L*U).print("Test of LU decomposition");
return 0;
}
Direct solvers such as Gauss elimination and LU decomposition discussed in connection with project 1.
Iterative solvers such as Basic iterative solvers, Jacobi, Gauss-Seidel, Successive over-relaxation. These methods are easy to parallelize, as we will se later. Much used in solutions of partial differential equations.
Other iterative methods such as Krylov subspace methods with Generalized minimum residual (GMRES) and Conjugate gradient etc will not be discussed.
It is a simple method for solving
where $\mathbf{A}$ is a matrix and $\mathbf{x}$ and $\mathbf{b}$ are vectors. The vector $\mathbf{x}$ is the unknown.
It is an iterative scheme where we start with a guess for the unknown, and after $k+1$ iterations we have
with $\mathbf{A}=\mathbf{D}+\mathbf{U}+\mathbf{L}$ and $\mathbf{D}$ being a diagonal matrix, $\mathbf{U}$ an upper triangular matrix and $\mathbf{L}$ a lower triangular matrix.
If the matrix $\mathbf{A}$ is positive definite or diagonally dominant, one can show that this method will always converge to the exact solution.
We can demonstrate Jacobi's method by this $4\times 4$ matrix problem. We assume a guess for the vector elements $x_i^{(0)}$, a guess which represents our first iteration. The new values are obtained by substitution
which after $k+1$ iterations reads
or in an even more compact form as
can be rewritten as
to the following form
The procedure is generally continued until the changes made by an iteration are below some tolerance.
The convergence properties of the Jacobi method and the Gauss-Seidel method are dependent on the matrix $\mathbf{A}$. These methods converge when the matrix is symmetric positive-definite, or is strictly or irreducibly diagonally dominant. Both methods sometimes converge even if these conditions are not satisfied.
Given a square system of n linear equations with unknown $\mathbf x$:
where
where
The system of linear equations may be rewritten as:
However, by taking advantage of the triangular form of $(D+\omega L)$, the elements of $x^{(k+1)}$ can be computed sequentially using forward substitution:
The choice of relaxation factor is not necessarily easy, and depends upon the properties of the coefficient matrix. For symmetric, positive-definite matrices it can be proven that $0 < \omega < 2$ will lead to convergence, but we are generally interested in faster convergence rather than just convergence.
Cubic spline interpolation is among one of the most used methods for interpolating between data points where the arguments are organized as ascending series. In the library program we supply such a function, based on the so-called cubic spline method to be described below.
A spline function consists of polynomial pieces defined on subintervals. The different subintervals are connected via various continuity relations.
Assume we have at our disposal $n+1$ points $x_0, x_1, \dots x_n$ arranged so that $x_0 < x_1 < x_2 < \dots x_{n-1} < x_n$ (such points are called knots). A spline function $s$ of degree $k$ with $n+1$ knots is defined as follows
On every subinterval $[x_{i-1},x_i)$ s is a polynomial of degree $\le k$.
$s$ has $k-1$ continuous derivatives in the whole interval $[x_0,x_n]$.
As an example, consider a spline function of degree $k=1$ defined as follows
In this case the polynomial consists of series of straight lines connected to each other at every endpoint. The number of continuous derivatives is then $k-1=0$, as expected when we deal with straight lines. Such a polynomial is quite easy to construct given $n+1$ points $x_0, x_1, \dots x_n$ and their corresponding function values.
The most commonly used spline function is the one with $k=3$, the so-called cubic spline function. Assume that we have in adddition to the $n+1$ knots a series of functions values $y_0=f(x_0), y_1=f(x_1), \dots y_n=f(x_n)$. By definition, the polynomials $s_{i-1}$ and $s_i$ are thence supposed to interpolate the same point $i$, that is
with $1 \le i \le n-1$. In total we have $n$ polynomials of the type
and
to be fulfilled. If we also assume that $s'$ and $s''$ are continuous, then
yields $n-1$ conditions. Similarly,
and
and setting up a straight line between $f_i$ and $f_{i+1}$ we have
and integrating twice one obtains
and set $x=x_i$. Defining $h_i=x_{i+1}-x_i$ we obtain finally the following expression
spline(double x[], double y[], int n, double yp1, double yp2, double y2[])
This function takes as input $x[0,..,n - 1]$ and $y[0,..,n - 1]$ containing a tabulation $y_i = f(x_i)$ with $x_0 < x_1 < .. < x_{n - 1}$ together with the first derivatives of $f(x)$ at $x_0$ and $x_{n-1}$, respectively. Then the function returns $y2[0,..,n-1]$ which contains the second derivatives of $f(x_i)$ at each point $x_i$. $n$ is the number of points. This function provides the cubic spline interpolation for all subintervals and is called only once.
Thereafter, if you wish to make various interpolations, you need to call the function
splint(double x[], double y[], double y2a[], int n, double x, double *y)
which takes as input the tabulated values $x[0,..,n - 1]$ and $y[0,..,n - 1]$ and the output y2a[0,..,n - 1] from spline. It returns the value $y$ corresponding to the point $x$.
The success of the CG method for finding solutions of non-linear problems is based on the theory of conjugate gradients for linear systems of equations. It belongs to the class of iterative methods for solving problems from linear algebra of the type
In the iterative process we end up with a problem like
with the constraint that the matrix $\hat{A}$ is positive definite and symmetric. If we search for a minimum of the quantum mechanical variance, then the matrix $\hat{A}$, which is called the Hessian, is given by the second-derivative of the function we want to minimize. This quantity is always positive definite. In our case this corresponds normally to the second derivative of the energy.
We seek the minimum of the energy or the variance as function of various variational parameters. In our case we have thus a function $f$ whose minimum we are seeking. In Newton's method we set $\nabla f = 0$ and we can thus compute the next iteration point
Subtracting this equation from that of $\hat{x}_{i+1}$ we have
In the simple harmonic oscillator model, the gradient and the Hessian $\hat{A}$ are
and a second derivative which is always positive (meaning that we find a minimum)
which can be rewritten as
The philosophy of the CG method is to perform searches in various conjugate directions of our vectors $\hat{x}_i$ obeying the above criterion, namely
We assume that $\hat{p}_{i}$ is a sequence of $n$ mutually conjugate directions. Then the $\hat{p}_{i}$ form a basis of $R^n$ and we can expand the solution $ \hat{A}\hat{x} = \hat{b}$ in this basis, namely
Multiplying with $\hat{p}_k^T$ from the left gives
and we can define the coefficients $\alpha_k$ as
If we choose the conjugate vectors $\hat{p}_k$ carefully, then we may not need all of them to obtain a good approximation to the solution $\hat{x}$. We want to regard the conjugate gradient method as an iterative method. This will us to solve systems where $n$ is so large that the direct method would take too much time.
We denote the initial guess for $\hat{x}$ as $\hat{x}_0$. We can assume without loss of generality that
or consider the system
This suggests taking the first basis vector $\hat{p}_1$ to be the gradient of $f$ at $\hat{x}=\hat{x}_0$, which equals
Note that $\hat{r}_k$ is the negative gradient of $f$ at
$\hat{x}=\hat{x}_k$,
so the gradient descent method would be to move in the direction $\hat{r}_k$.
Here, we insist that the directions $\hat{p}_k$ are conjugate to each other,
so we take the direction closest to the gradient $\hat{r}_k$
under the conjugacy constraint.
This gives the following expression
which equals
or
which gives
These values are called the domain. To this domain we have the corresponding probabilities
The numbers in the domain are the outcomes of the physical process tossing the dice. We cannot tell beforehand whether the outcome is 3 or 5 or any other number in this domain. This defines the randomness of the outcome, or unexpectedness or any other synonimous word which encompasses the uncertitude of the final outcome.
The only thing we can tell beforehand
is that say the outcome 2 has a certain probability.
If our favorite hobby is to spend an hour every evening throwing dice and
registering the sequence of outcomes, we will note that the numbers in the above domain
appear in a random order. After 11 throws the results may look like
Random variables are characterized by a domain which contains all possible values that the random value may take. This domain has a corresponding PDF.
There are two main concepts associated with a stochastic variable. The domain is the set $\mathbb D = \{x\}$ of all accessible values the variable can assume, so that $X \in \mathbb D$. An example of a discrete domain is the set of six different numbers that we may get by throwing of a dice, $x\in\{1,\,2,\,3,\,4,\,5,\,6\}$.
The probability distribution function (PDF) is a function $p(x)$ on the domain which, in the discrete case, gives us the probability or relative frequency with which these values of $X$ occur
In the continuous case, the PDF does not directly depict the actual probability. Instead we define the probability for the stochastic variable to assume any value on an infinitesimal interval around $x$ to be $p(x)dx$. The continuous function $p(x)$ then gives us the density of the probability rather than the probability itself. The probability for a stochastic variable to assume any value on a non-infinitesimal interval $[a,\,b]$ is then just the integral
Qualitatively speaking, a stochastic variable represents the values of numbers chosen as if by chance from some specified PDF so that the selection of a large set of these numbers reproduces this PDF.
Of interest to us is the cumulative probability distribution function (CDF), $P(x)$, which is just the probability for a stochastic variable $X$ to assume any value less than $x$
The relation between a CDF and its corresponding PDF is then
Naturally, it would be nonsensical for any of the values of the domain to occur with a probability greater than $1$ or less than $0$. Also, the PDF must be normalized. That is, all the probabilities must add up to unity. The probability of "anything" to happen is always unity. For both discrete and continuous PDFs, this condition is
with
with mean value $\mu$ and standard deviation $\sigma$. If $\mu=0$ and $\sigma=1$, it is normally called the standard normal distribution
The following simple Python code plots the above distribution for different values of $\mu$ and $\sigma$.
In [1]:
import numpy as np
from math import acos, exp, sqrt
from matplotlib import pyplot as plt
from matplotlib import rc, rcParams
import matplotlib.units as units
import matplotlib.ticker as ticker
rc('text',usetex=True)
rc('font',**{'family':'serif','serif':['Gaussian distribution']})
font = {'family' : 'serif',
'color' : 'darkred',
'weight' : 'normal',
'size' : 16,
}
pi = acos(-1.0)
mu0 = 0.0
sigma0 = 1.0
mu1= 1.0
sigma1 = 2.0
mu2 = 2.0
sigma2 = 4.0
x = np.linspace(-20.0, 20.0)
v0 = np.exp(-(x*x-2*x*mu0+mu0*mu0)/(2*sigma0*sigma0))/sqrt(2*pi*sigma0*sigma0)
v1 = np.exp(-(x*x-2*x*mu1+mu1*mu1)/(2*sigma1*sigma1))/sqrt(2*pi*sigma1*sigma1)
v2 = np.exp(-(x*x-2*x*mu2+mu2*mu2)/(2*sigma2*sigma2))/sqrt(2*pi*sigma2*sigma2)
plt.plot(x, v0, 'b-', x, v1, 'r-', x, v2, 'g-')
plt.title(r'{\bf Gaussian distributions}', fontsize=20)
plt.text(-19, 0.3, r'Parameters: $\mu = 0$, $\sigma = 1$', fontdict=font)
plt.text(-19, 0.18, r'Parameters: $\mu = 1$, $\sigma = 2$', fontdict=font)
plt.text(-19, 0.08, r'Parameters: $\mu = 2$, $\sigma = 4$', fontdict=font)
plt.xlabel(r'$x$',fontsize=20)
plt.ylabel(r'$p(x)$ [MeV]',fontsize=20)
# Tweak spacing to prevent clipping of ylabel
plt.subplots_adjust(left=0.15)
plt.savefig('gaussian.pdf', format='pdf')
plt.show()
Whenever the PDF is known implicitly, like in this case, we will drop
the index $X$ for clarity.
A particularly useful class of special expectation values are the
moments. The $n$-th moment of the PDF $p$ is defined as
follows
for a continuous distribution and
for a discrete distribution. Qualitatively it represents the centroid or the average value of the PDF and is therefore simply called the expectation value of $p(x)$.
A special version of the moments is the set of central moments, the n-th central moment defined as
The zero-th and first central moments are both trivial, equal $1$ and $0$, respectively. But the second central moment, known as the variance of $p$, is of particular interest. For the stochastic variable $X$, the variance is denoted as $\sigma^2_X$ or $\mathrm{Var}(X)$
The square root of the variance, $\sigma =\sqrt{\langle (x-\langle x\rangle)^2\rangle}$ is called the standard deviation of $p$. It is the RMS (root-mean-square) value of the deviation of the PDF from its mean value, interpreted qualitatively as the "spread" of $p$ around its mean.
The following table collects properties of probability distribution functions. In our notation we reserve the label $p(x)$ for the probability of a certain event, while $P(x)$ is the cumulative probability.
Discrete PDF | Continuous PDF | |
---|---|---|
Domain | $\left\{x_1, x_2, x_3, \dots, x_N\right\}$ | $[a,b]$ |
Probability | $p(x_i)$ | $p(x)dx$ |
Cumulative | $P_i=\sum_{l=1}^ip(x_l)$ | $P(x)=\int_a^xp(t)dt$ |
Positivity | $ 0\le p(x_i)\le 1$ | $ p(x) \ge 0$ |
Positivity | $ 0\le P_i\le 1$ | $ 0\le P(x)\le 1$ |
Monotonic | $P_i\ge P_j$ if $x_i\ge x_j$ | $P(x_i)\ge P(x_j)$ if $x_i\ge x_j$ |
Normalization | $P_N=1$ | $P(b)=1$ |
With a PDF we can compute expectation values of selected quantities such as
if we have a discrete PDF or
yielding probabilities different from zero in the interval $[a,b]$.
The exponential distribution
yielding probabilities different from zero in the interval $[0,\infty)$ and with mean value
with variance
with probabilities different from zero in the interval $(-\infty,\infty)$. The integral $\int_{-\infty}^{\infty}\exp{\left(-(x^2\right)}dx$ appears in many calculations, its value is $\sqrt{\pi}$, a result we will need when we compute the mean value and the variance. The mean value is
which becomes with a suitable change of variables
and inserting the mean value and performing a variable change we obtain
and performing a final integration by parts we obtain the well-known result $\sigma^2=b^2$. It is useful to introduce the standard normal distribution as well, defined by $\mu=a=0$, viz. a distribution centered around zero and with a variance $\sigma^2=1$, leading to
where $y$ is the probability for a specific event, such as the tossing of a coin or moving left or right in case of a random walker. Note that $x$ is a discrete stochastic variable.
The sequence of binomial trials is characterized by the following definitions
Every experiment is thought to consist of $N$ independent trials.
In every independent trial one registers if a specific situation happens or not, such as the jump to the left or right of a random walker.
The probability for every outcome in a single trial has the same value, for example the outcome of tossing (either heads or tails) a coin is always $1/2$.
In order to compute the mean and variance we need to recall Newton's binomial formula
which can be used to show that
the PDF is normalized to one. The mean value is
resulting in
which we rewrite as
In this case both the mean value and the variance are easier to calculate,
and the variance is $\sigma^2=\lambda$.
An example of applications of the Poisson distribution could be the counting of the number of $\alpha$-particles emitted from a radioactive source in a given time interval. In the limit of $n\rightarrow \infty$ and for small probabilities $y$, the binomial distribution approaches the Poisson distribution. Setting $\lambda = ny$, with $y$ the probability for an event in the binomial distribution we can show that
An important quantity in a statistical analysis is the so-called covariance.
Consider the set $\{X_i\}$ of $n$ stochastic variables (not necessarily uncorrelated) with the multivariate PDF $P(x_1,\dots,x_n)$. The covariance of two of the stochastic variables, $X_i$ and $X_j$, is defined as follows
with
then the diagonal elements are just the familiar variances, $C_{ii} = \mathrm{Cov}(X_i,\,X_i) = \mathrm{Var}(X_i)$. It turns out that all the off-diagonal elements are zero if the stochastic variables are uncorrelated.
This is easy to show, keeping in mind the linearity of the expectation value. Consider the stochastic variables $X_i$ and $X_j$, ($i\neq j$)
If $X_i$ and $X_j$ are independent, we get
Now that we have constructed an idealized mathematical framework, let us try to apply it to empirical observations. Examples of relevant physical phenomena may be spontaneous decays of nuclei, or a purely mathematical set of numbers produced by some deterministic mechanism. It is the latter we will deal with, using so-called pseudo-random number generators. In general our observations will contain only a limited set of observables. We remind the reader that a stochastic process is a process that produces sequentially a chain of values
We will call these values our measurements and the entire set as our measured sample. The action of measuring all the elements of a sample we will call a stochastic experiment (since, operationally, they are often associated with results of empirical observation of some physical or mathematical phenomena; precisely an experiment). We assume that these values are distributed according to some PDF $p_X^{\phantom X}(x)$, where $X$ is just the formal symbol for the stochastic variable whose PDF is $p_X^{\phantom X}(x)$. Instead of trying to determine the full distribution $p$ we are often only interested in finding the few lowest moments, like the mean $\mu_X^{\phantom X}$ and the variance $\sigma_X^{\phantom X}$.
In practical situations however, a sample is always of finite size. Let that size be $n$. The expectation value of a sample $\alpha$, the sample mean, is then defined as follows
The sample variance is:
with its square root being the standard deviation of the sample.
You can think of the above observables as a set of quantities which define a given experiment. This experiment is then repeated several times, say $m$ times. The total average is then
where the last sums end at $m$ and $n$. The total variance is
which we rewrite as
These quantities, being known experimental values or the results from our calculations, may differ, in some cases significantly, from the similarly named exact values for the mean value $\mu_X$, the variance $\mathrm{Var}(X)$ and the covariance $\mathrm{Cov}(X,Y)$.
The central limit theorem states that the PDF $\tilde{p}(z)$ of the average of $m$ random values corresponding to a PDF $p(x)$ is a normal distribution whose mean is the mean value of the PDF $p(x)$ and whose variance is the variance of the PDF $p(x)$ divided by $m$, the number of values used to compute $z$.
The central limit theorem leads then to the well-known expression for the standard deviation, given by
In many cases the above estimate for the standard deviation, in particular if correlations are strong, may be too simplistic. We need therefore a more precise defintion of the error and the variance in our results.
Our estimate of the true average $\mu_{X}$ is the sample mean $\langle X_m \rangle$
We can then use Eq. (eq:exptvariance)
and rewrite it as
where the first term is the sample variance of all $mn$ experiments divided by $n$ and the last term is nothing but the covariance which arises when $k\ne l$.
Our estimate of the true average $\mu_{X}$ is the sample mean $\langle X_m \rangle$
If the observables are uncorrelated, then the covariance is zero and we obtain a total variance which agrees with the central limit theorem. Correlations may often be present in our data set, resulting in a non-zero covariance. The first term is normally called the uncorrelated contribution. Computationally the uncorrelated first term is much easier to treat efficiently than the second. We just accumulate separately the values $x^2$ and $x$ for every measurement $x$ we receive. The correlation term, though, has to be calculated at the end of the experiment since we need all the measurements to calculate the cross terms. Therefore, all measurements have to be stored throughout the experiment.
Let us analyze the problem by splitting up the correlation term into partial sums of the form
The correlation term of the total variance can now be rewritten in terms of $f_d$
The value of $f_d$ reflects the correlation between measurements separated by the distance $d$ in the samples. Notice that for $d=0$, $f$ is just the sample variance, $\sigma^2$. If we divide $f_d$ by $\sigma^2$, we arrive at the so called autocorrelation function
and we see that $\sigma_m$ can be expressed in terms of the uncorrelated sample variance times a correction factor $\tau$ which accounts for the correlation between measurements. We call this correction factor the autocorrelation time
For a correlation free experiment, $\tau$ equals 1.
From the point of view of Eq. (eq:error_estimate_corr_time) we can interpret a sequential correlation as an effective reduction of the number of measurements by a factor $\tau$. The effective number of measurements becomes
To neglect the autocorrelation time $\tau$ will always cause our simple uncorrelated estimate of $\sigma_m^2\approx \sigma^2/n$ to be less than the true sample error. The estimate of the error will be too "good". On the other hand, the calculation of the full autocorrelation time poses an efficiency problem if the set of measurements is very large. The solution to this problem is given by more practically oriented methods like the blocking technique.
Uniform deviates are just random numbers that lie within a specified range (typically 0 to 1), with any one number in the range just as likely as any other. They are, in other words, what you probably think random numbers are. However, we want to distinguish uniform deviates from other sorts of random numbers, for example numbers drawn from a normal (Gaussian) distribution of specified mean and standard deviation. These other sorts of deviates are almost always generated by performing appropriate operations on one or more uniform deviates, as we will see in subsequent sections. So, a reliable source of random uniform deviates, the subject of this section, is an essential building block for any sort of stochastic modeling or Monte Carlo computer work.
A disclaimer is however appropriate. It should be fairly obvious that something as deterministic as a computer cannot generate purely random numbers.
Numbers generated by any of the standard algorithms are in reality pseudo random numbers, hopefully abiding to the following criteria:
they produce a uniform distribution in the interval [0,1].
correlations between random numbers are negligible
the period before the same sequence of random numbers is repeated is as large as possible and finally
the algorithm should be fast.
The most common random number generators are based on so-called Linear congruential relations of the type
which yield a number in the interval [0,1] through
The number $M$ is called the period and it should be as large as possible and $N_0$ is the starting value, or seed. The function $\mathrm{MOD}$ means the remainder, that is if we were to evaluate $(13)\mathrm{MOD}(9)$, the outcome is the remainder of the division $13/9$, namely $4$.
The problem with such generators is that their outputs are periodic; they will start to repeat themselves with a period that is at most $M$. If however the parameters $a$ and $c$ are badly chosen, the period may be even shorter.
Consider the following example
with a seed $N_0=2$. This generator produces the sequence $4,1,3,0,2,4,1,3,0,2,...\dots$, i.e., a sequence with period $5$. However, increasing $M$ may not guarantee a larger period as the following example shows
which still, with $N_0=2$, results in $11,38,11,38,11,38,\dots$, a period of just $2$.
Typical periods for the random generators provided in the program library are of the order of $\sim 10^9$ or larger. Other random number generators which have become increasingly popular are so-called shift-register generators. In these generators each successive number depends on many preceding values (rather than the last values as in the linear congruential generator). For example, you could make a shift register generator whose $l$th number is the sum of the $l-i$th and $l-j$th values with modulo $M$,
Such a generator again produces a sequence of pseudorandom numbers but this time with a period much larger than $M$. It is also possible to construct more elaborate algorithms by including more than two past terms in the sum of each iteration. One example is the generator of Marsaglia and Zaman which consists of two congruential relations
followed by
where the bitwise action of $\oplus$ means that if $N_{l-i}=N_{l-j}$ the result is $0$ whereas if $N_{l-i}\ne N_{l-j}$ the result is $1$. As an example, consider the case where $N_{l-i}=6$ and $N_{l-j}=11$. The first one has a bit representation (using 4 bits only) which reads $0110$ whereas the second number is $1011$. Employing the $\oplus$ operator yields $1101$, or $2^3+2^2+2^0=13$.
In Fortran90, the bitwise $\oplus$ operation is coded through the intrinsic function $\mathrm{IEOR}(m,n)$ where $m$ and $n$ are the input numbers, while in $C$ it is given by $m\wedge n$.
We show here how the linear congruential algorithm can be implemented, namely
However, since $a$ and $N_{i-1}$ are integers and their multiplication could become greater than the standard 32 bit integer, there is a trick via Schrage's algorithm which approximates the multiplication of large integers through the factorization
where we have defined
and
since we can add or subtract any integer multiple of $M$ from $aN_{i-1}$. The last term $[N_{i-1}/q]M\mathrm{MOD}(M)$ is zero since the integer division $[N_{i-1}/q]$ just yields a constant which is multiplied with $M$.
We can now rewrite Eq. (eq:rntrick1) as
which results in
yielding
The term $[N_{i-1}/q]r$ is always smaller or equal $N_{i-1}(r/q)$ and with $r < q$ we obtain always a number smaller than $N_{i-1}$, which is smaller than $M$. And since the number $N_{i-1}\mathrm{MOD} (q)$ is between zero and $q-1$ then $a(N_{i-1}\mathrm{MOD} (q))< aq$. Combined with our definition of $q=[M/a]$ ensures that this term is also smaller than $M$ meaning that both terms fit into a 32-bit signed integer. None of these two terms can be negative, but their difference could. The algorithm below adds $M$ if their difference is negative. Note that the program uses the bitwise $\oplus$ operator to generate the starting point for each generation of a random number. The period of $ran0$ is $\sim 2.1\times 10^{9}$. A special feature of this algorithm is that is should never be called with the initial seed set to $0$.
/*
** The function
** ran0()
** is an "Minimal" random number generator of Park and Miller
** Set or reset the input value
** idum to any integer value (except the unlikely value MASK)
** to initialize the sequence; idum must not be altered between
** calls for sucessive deviates in a sequence.
** The function returns a uniform deviate between 0.0 and 1.0.
*/
double ran0(long &idum)
{
const int a = 16807, m = 2147483647, q = 127773;
const int r = 2836, MASK = 123459876;
const double am = 1./m;
long k;
double ans;
idum ^= MASK;
k = (*idum)/q;
idum = a*(idum - k*q) - r*k;
// add m if negative difference
if(idum < 0) idum += m;
ans=am*(idum);
idum ^= MASK;
return ans;
} // End: function ran0()
As mentioned previously, the underlying PDF for the generation of random numbers is the uniform distribution, meaning that the probability for finding a number $x$ in the interval [0,1] is $p(x)=1$.
A random number generator should produce numbers which are uniformly distributed in this interval. The table shows the distribution of $N=10000$ random numbers generated by the functions in the program library. We note in this table that the number of points in the various intervals $0.0-0.1$, $0.1-0.2$ etc are fairly close to $1000$, with some minor deviations.
Two additional measures are the standard deviation $\sigma$ and the mean $\mu=\langle x\rangle$.
For the uniform distribution, the mean value $\mu$ is then
while the standard deviation is
The various random number generators produce results which agree rather well with these limiting values.
$x$-bin | ran0 | ran1 | ran2 | ran3 |
---|---|---|---|---|
0.0-0.1 | 1013 | 991 | 938 | 1047 |
0.1-0.2 | 1002 | 1009 | 1040 | 1030 |
0.2-0.3 | 989 | 999 | 1030 | 993 |
0.3-0.4 | 939 | 960 | 1023 | 937 |
0.4-0.5 | 1038 | 1001 | 1002 | 992 |
0.5-0.6 | 1037 | 1047 | 1009 | 1009 |
0.6-0.7 | 1005 | 989 | 1003 | 989 |
0.7-0.8 | 986 | 962 | 985 | 954 |
0.8-0.9 | 1000 | 1027 | 1009 | 1023 |
0.9-1.0 | 991 | 1015 | 961 | 1026 |
$\mu$ | 0.4997 | 0.5018 | 0.4992 | 0.4990 |
$\sigma$ | 0.2882 | 0.2892 | 0.2861 | 0.2915 |
The following simple Python code plots the distribution of the produced random numbers using the linear congruential RNG employed by Python. The trend displayed in the previous table is seen rather clearly.
In [8]:
#!/usr/bin/env python
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import random
# initialize the rng with a seed
random.seed()
counts = 10000
values = np.zeros(counts)
for i in range (1, counts, 1):
values[i] = random.random()
# the histogram of the data
n, bins, patches = plt.hist(values, 10, facecolor='green')
plt.xlabel('$x$')
plt.ylabel('Number of counts')
plt.title(r'Test of uniform distribution')
plt.axis([0, 1, 0, 1100])
plt.grid(True)
plt.show()
Since our random numbers, which are typically generated via a linear congruential algorithm,
are never fully independent, we can then define
an important test which measures the degree of correlation, namely the so-called
auto-correlation function defined previously, see again Eq. (eq:autocorrelformal).
We rewrite it here as
with $C_0=1$. Recall that $\sigma^2=\langle x_i^2\rangle-\langle x_i\rangle^2$ and that
The non-vanishing of $C_k$ for $k\ne 0$ means that the random numbers are not independent. The independence of the random numbers is crucial in the evaluation of other expectation values. If they are not independent, our assumption for approximating $\sigma_N$ is no longer valid.
The program here computes the correlation function for one of the standard functions included with the c++ compiler.
// This function computes the autocorrelation function for
// the standard c++ random number generator
#include <fstream>
#include <iomanip>
#include <iostream>
#include <cmath>
using namespace std;
// output file as global variable
ofstream ofile;
// Main function begins here
int main(int argc, char* argv[])
{
int n;
char *outfilename;
cin >> n;
double MCint = 0.; double MCintsqr2=0.;
double invers_period = 1./RAND_MAX; // initialise the random number generator
srand(time(NULL)); // This produces the so-called seed in MC jargon
// Compute the variance and the mean value of the uniform distribution
// Compute also the specific values x for each cycle in order to be able to
// the covariance and the correlation function
// Read in output file, abort if there are too few command-line arguments
if( argc <= 2 ){
cout << "Bad Usage: " << argv[0] <<
" read also output file and number of cycles on same line" << endl;
exit(1);
}
else{
outfilename=argv[1];
}
ofile.open(outfilename);
// Get the number of Monte-Carlo samples
n = atoi(argv[2]);
double *X;
X = new double[n];
for (int i = 0; i < n; i++){
double x = double(rand())*invers_period;
X[i] = x;
MCint += x;
MCintsqr2 += x*x;
}
double Mean = MCint/((double) n );
MCintsqr2 = MCintsqr2/((double) n );
double STDev = sqrt(MCintsqr2-Mean*Mean);
double Variance = MCintsqr2-Mean*Mean;
// Write mean value and standard deviation
cout << " Standard deviation= " << STDev << " Integral = " << Mean << endl;
// Now we compute the autocorrelation function
double *autocor; autocor = new double[n];
for (int j = 0; j < n; j++){
double sum = 0.0;
for (int k = 0; k < (n-j); k++){
sum += (X[k]-Mean)*(X[k+j]-Mean);
}
autocor[j] = sum/Variance/((double) n );
ofile << setiosflags(ios::showpoint | ios::uppercase);
ofile << setw(15) << setprecision(8) << j;
ofile << setw(15) << setprecision(8) << autocor[j] << endl;
}
ofile.close(); // close output file
return 0;
} // end of main program
In [9]:
import numpy as np
from matplotlib import pyplot as plt
# Load in data file
data = np.loadtxt("datafiles/autocor.dat")
# Make arrays containing x-axis and binding energies as function of A
x = data[:,0]
corr = data[:,1]
plt.plot(x, corr ,'ro')
plt.axis([0,1000,-0.2, 1.1])
plt.xlabel(r'$d$')
plt.ylabel(r'$C_d$')
plt.title(r'autocorrelation function for RNG')
plt.savefig('autocorr.pdf')
plt.show()
In the library files lib.cpp and lib.h we have included four popular RNGs taken from the widely used textbook Numerical Recipes. These are called ran0, ran1, ran2 and ran3.
C++ has a class called random. The random class contains a large selection of RNGs and is highly recommended. Some of these RNGs have very large periods making it thereby very safe to use these RNGs in case one is performing large calculations. In particular, the Mersenne twister random number engine has a period of $2^{19937}$.
The following part of a c++ code (from project 4) sets up the uniform distribution for $x\in [0,1]$.
/*
// You need this
#include <random>
// Initialize the seed and call the Mersienne algo
std::random_device rd;
std::mt19937_64 gen(rd());
// Set up the uniform distribution for x \in [[0, 1]
std::uniform_real_distribution<double> RandomNumberGenerator(0.0,1.0);
// Now use the RNG
int ix = (int) (RandomNumberGenerator(gen)*NSpins);
Statistical analysis
* Monte Carlo simulations can be treated as *computer experiments*
* The results can be analysed with the same statistical tools as we would use analysing experimental data.
* As in all experiments, we are looking for expectation values and an estimate of how accurate they are, i.e., possible sources for errors.
A very good article which explains blocking is H. Flyvbjerg and H. G. Petersen, Error estimates on averages of correlated data, Journal of Chemical Physics 91, 461-466 (1989).
Statistical analysis
* As in other experiments, Monte Carlo experiments have two classes of errors:
* Statistical errors
* Systematical errors
* Statistical errors can be estimated using standard tools from statistics
* Systematical errors are method specific and must be treated differently from case to case. (In VMC a common source is the step length or time step in importance sampling)
The following code computes the autocorrelation function, the covariance and the standard deviation for standard RNG. The following file gives the code.
// This function computes the autocorrelation function for
// the Mersenne random number generator with a uniform distribution
#include <iostream>
#include <fstream>
#include <iomanip>
#include <cstdlib>
#include <random>
#include <armadillo>
#include <string>
#include <cmath>
using namespace std;
using namespace arma;
// output file
ofstream ofile;
// Main function begins here
int main(int argc, char* argv[])
{
int MonteCarloCycles;
string filename;
if (argc > 1) {
filename=argv[1];
MonteCarloCycles = atoi(argv[2]);
string fileout = filename;
string argument = to_string(MonteCarloCycles);
fileout.append(argument);
ofile.open(fileout);
}
// Compute the variance and the mean value of the uniform distribution
// Compute also the specific values x for each cycle in order to be able to
// compute the covariance and the correlation function
vec X = zeros<vec>(MonteCarloCycles);
double MCint = 0.; double MCintsqr2=0.;
std::random_device rd;
std::mt19937_64 gen(rd());
// Set up the uniform distribution for x \in [[0, 1]
std::uniform_real_distribution<double> RandomNumberGenerator(0.0,1.0);
for (int i = 0; i < MonteCarloCycles; i++){
double x = RandomNumberGenerator(gen);
X(i) = x;
MCint += x;
MCintsqr2 += x*x;
}
double Mean = MCint/((double) MonteCarloCycles );
MCintsqr2 = MCintsqr2/((double) MonteCarloCycles );
double STDev = sqrt(MCintsqr2-Mean*Mean);
double Variance = MCintsqr2-Mean*Mean;
// Write mean value and variance
cout << " Sample variance= " << Variance << " Mean value = " << Mean << endl;
// Now we compute the autocorrelation function
vec autocorrelation = zeros<vec>(MonteCarloCycles);
for (int j = 0; j < MonteCarloCycles; j++){
double sum = 0.0;
for (int k = 0; k < (MonteCarloCycles-j); k++){
sum += (X(k)-Mean)*(X(k+j)-Mean);
}
autocorrelation(j) = sum/Variance/((double) MonteCarloCycles );
ofile << setiosflags(ios::showpoint | ios::uppercase);
ofile << setw(15) << setprecision(8) << j;
ofile << setw(15) << setprecision(8) << autocorrelation(j) << endl;
}
// Now compute the exact covariance using the autocorrelation function
double Covariance = 0.0;
for (int j = 0; j < MonteCarloCycles; j++){
Covariance += autocorrelation(j);
}
Covariance *= 2.0/((double) MonteCarloCycles);
// Compute now the total variance, including the covariance, and obtain the standard deviation
double TotalVariance = (Variance/((double) MonteCarloCycles ))+Covariance;
cout << "Covariance =" << Covariance << "Totalvariance= " << TotalVariance << "Sample Variance/n= " << (Variance/((double) MonteCarloCycles )) << endl;
cout << " STD from sample variance= " << sqrt(Variance/((double) MonteCarloCycles )) << " STD with covariance = " << sqrt(TotalVariance) << endl;
ofile.close(); // close output file
return 0;
} // end of main program
where $\tau$ is the correlation time (the time between a sample and the next uncorrelated sample) and $\Delta t$ is time between each sample
Blocking
* If $\Delta t\gg\tau$ our first estimate of $\sigma$ still holds
* Much more common that $\Delta t<\tau$
* In the method of data blocking we divide the sequence of samples into blocks
* We then take the mean $\langle \mathbf{M}_i\rangle$ of block $i=1\ldots n_{blocks}$ to calculate the total mean and variance
* The size of each block must be so large that sample $j$ of block $i$ is not correlated with sample $j$ of block $i+1$
* The correlation time $\tau$ would be a good choice
Blocking
* Problem: We don't know $\tau$ or it is too expensive to compute
* Solution: Make a plot of std. dev. as a function of blocksize
* The estimate of std. dev. of correlated data is too low $\to$ the error will increase with increasing block size until the blocks are uncorrelated, where we reach a plateau
* When the std. dev. stops increasing the blocks are uncorrelated
* Do a Monte Carlo simulation, storing all samples to file
* Do the statistical analysis on this file, independently of your Monte Carlo program
* Read the file into an array
* Loop over various block sizes
* For each block size $n_b$, loop over the array in steps of $n_b$ taking the mean of elements $i n_b,\ldots,(i+1) n_b$
* Take the mean and variance of the resulting array
* Write the results for each block size to file for later
analysis
When the file gets large, it can be useful to write your data in binary mode instead of ascii characters. The following python file reads data from file with the output from every Monte Carlo cycle.
In [10]:
# Blocking
@timeFunction
def blocking(self, blockSizeMax = 500):
blockSizeMin = 1
self.blockSizes = []
self.meanVec = []
self.varVec = []
for i in range(blockSizeMin, blockSizeMax):
if(len(self.data) % i != 0):
pass#continue
blockSize = i
meanTempVec = []
varTempVec = []
startPoint = 0
endPoint = blockSize
while endPoint <= len(self.data):
meanTempVec.append(np.average(self.data[startPoint:endPoint]))
startPoint = endPoint
endPoint += blockSize
mean, var = np.average(meanTempVec), np.var(meanTempVec)/len(meanTempVec)
self.meanVec.append(mean)
self.varVec.append(var)
self.blockSizes.append(blockSize)
self.blockingAvg = np.average(self.meanVec[-200:])
self.blockingVar = (np.average(self.varVec[-200:]))
self.blockingStd = np.sqrt(self.blockingVar)
The Bootstrap resampling method is also very popular. It is very simple:
Start with your sample of measurements and compute the sample variance and the mean values
Then start again but pick in a random way the numbers in the sample and recalculate the mean and the sample variance.
Repeat this $K$ times.
It can be shown, see the article by Efron that it produces the correct standard deviation.
This method is very useful for small ensembles of data points.
Given a set of $N$ data, assume that we are interested in some observable $\theta$ which may be estimated from that set. This observable can also be for example the result of a fit based on all $N$ raw data. Let us call the value of the observable obtained from the original data set $\hat{\theta}$. One recreates from the sample repeatedly other samples by choosing randomly $N$ data out of the original set. This costs essentially nothing, since we just recycle the original data set for the building of new sets.
Let us assume we have done this $K$ times and thus have $K$ sets of $N$ data values each. Of course some values will enter more than once in the new sets. For each of these sets one computes the observable $\theta$ resulting in values $\theta_k$ with $k = 1,...,K$. Then one determines
and
These are estimators for $\angle\theta\rangle$ and its variance. They are not unbiased and therefore $\tilde{\theta}\neq\hat{\theta}$ for finite K.
The difference is called bias and gives an idea on how far away the result may be from the true $\angle\theta\rangle$. As final result for the observable one quotes $\angle\theta\rangle = \tilde{\theta} \pm \sigma_{\tilde{\theta}}$ .
# Bootstrap
@timeFunction
def bootstrap(self, nBoots = 1000):
bootVec = np.zeros(nBoots)
for k in range(0,nBoots):
bootVec[k] = np.average(np.random.choice(self.data, len(self.data)))
self.bootAvg = np.average(bootVec)
self.bootVar = np.var(bootVec)
self.bootStd = np.std(bootVec)
# Jackknife
@timeFunction
def jackknife(self):
jackknVec = np.zeros(len(self.data))
for k in range(0,len(self.data)):
jackknVec[k] = np.average(np.delete(self.data, k))
self.jackknAvg = self.avg - (len(self.data) - 1) * (np.average(jackknVec) - self.avg)
self.jackknVar = float(len(self.data) - 1) * np.var(jackknVec)
self.jackknStd = np.sqrt(self.jackknVar)
Regression modeling deals with the description of the sampling distribution of a given random variable $y$ varies as function of another variable or a set of such variables $\hat{x} =[x_0, x_1,\dots, x_p]^T$. The first variable is called the dependent, the outcome or the response variable while the set of variables $\hat{x}$ is called the independent variable, or the predictor variable or the explanatory variable.
A regression model aims at finding a likelihood function $p(y\vert \hat{x})$, that is the conditional distribution for $y$ with a given $\hat{x}$. The estimation of $p(y\vert \hat{x})$ is made using a data set with
$n$ cases $i = 0, 1, 2, \dots, n-1$
Response (dependent or outcome) variable $y_i$ with $i = 0, 1, 2, \dots, n-1$
$p$ Explanatory (independent or predictor) variables $\hat{x}_i=[x_{i0}, x_{i1}, \dots, x_{ip}]$ with $i = 0, 1, 2, \dots, n-1$
The goal of the regression analysis is to extract/exploit relationship between $y_i$ and $\hat{x}_i$ in or to infer causal dependencies, approximations to the likelihood functions, functional relationships and to make predictions .
Before we proceed let us study a case from linear algebra where we aim at fitting a set of data $\hat{y}=[y_0,y_1,\dots,y_{n-1}]$. We could think of these data as a result of an experiment or a complicated numerical experiment. These data are functions of a series of variables $\hat{x}=[x_0,x_1,\dots,x_{n-1}]$, that is $y_i = y(x_i)$ with $i=0,1,2,\dots,n-1$. The variables $x_i$ could represent physical quantities like time, temperature, position etc. We assume that $y(x)$ is a smooth function.
Since obtaining these data points may not be trivial, we want to use these data to fit a function which can allow us to make predictions for values of $y$ which are not in the present set. The perhaps simplest approach is to assume we can parametrize our function in terms of a polynomial of degree $n-1$ with $n$ points, that is
2 2 9
< < < ! ! M A T H _ B L O C K
2 3 0
< < < ! ! M A T H _ B L O C K
and the matrix
we can rewrite our equations as
We are obviously not limited to the above polynomial. We could replace the various powers of $x$ with elements of Fourier series, that is, instead of $x_i^j$ we could have $\cos{(j x_i)}$ or $\sin{(j x_i)}$, or time series or other orthogonal functions. For every set of values $y_i,x_i$ we can then generalize the equations to
and without loss of generality we rewrite again our equations as
and in order to find the optimal parameters $\beta_i$ instead of solving the above linear algebra problem, we define a function which gives a measure of the spread between the values $y_i$ (which represent hopefully the exact values) and the parametrized values $\tilde{y}_i$, namely
or using the matrix $\hat{X}$ as
can be linked to the variance of the quantity $y_i$ if we interpret the latter as the mean value of for example a numerical experiment. When linking below with the maximum likelihood approach below, we will indeed interpret $y_i$ as a mean value
where $\langle y_i \rangle$ is the mean value. Keep in mind also that till now we have treated $y_i$ as the exact value. Normally, the response (dependent or outcome) variable $y_i$ the outcome of a numerical experiment or another type of experiment and is thus only an approximation to the true value. It is then always accompanied by an error estimate, often limited to a statistical error estimate given by the standard deviation discussed earlier. In the discussion here we will treat $y_i$ as our exact value for the response variable.
In order to find the parameters $\beta_i$ we will then minimize the spread of $Q(\hat{\beta})$ by requiring
which results in
or in a matrix-vector form as
as
and if the matrix $\hat{X}^T\hat{X}$ is invertible we have the solution
and with
we have
meaning that the solution for $\hat{\beta}$ is the one which minimizes the residuals. Later we will link this with the maximum likelihood approach.
Normally, the response (dependent or outcome) variable $y_i$ the outcome of a numerical experiment or another type of experiment and is thus only an approximation to the true value. It is then always accompanied by an error estimate, often limited to a statistical error estimate given by the standard deviation discussed earlier. In the discussion here we will treat $y_i$ as our exact value for the response variable.
Introducing the standard deviation $\sigma_i$ for each measurement $y_i$, we define now the $\chi^2$ function as
which results in
or in a matrix-vector form as
as
and if the matrix $\hat{A}^T\hat{A}$ is invertible we have the solution
we have then the following expression for the parameters $\beta_j$ (the matrix elements of $\hat{H}$ are $h_{ij}$)
We state without proof the expression for the uncertainty in the parameters $\beta_j$ as
resulting in
By computing the derivatives of $\chi^2$ with respect to $\beta_0$ and $\beta_1$ show that these are given by
and
2 6 7
< < < ! ! M A T H _ B L O C K
2 6 8
< < < ! ! M A T H _ B L O C K
2 6 9
< < < ! ! M A T H _ B L O C K
and show that
2 7 1
< < < ! ! M A T H _ B L O C K
The LSM suffers often from both being underdetermined and overdetermined in the unknown coefficients $\beta_i$. A better approach is to use the Singular Value Decomposition (SVD) method discussed below.
How can we use the singular value decomposition to find the parameters $\beta_j$? More details will come. We first note that a general $m\times n$ matrix $\hat{A}$ can be written in terms of a diagonal matrix $\hat{\Sigma}$ of dimensionality $n\times n$ and two orthognal matrices $\hat{U}$ and $\hat{V}$, where the first has dimensionality $m \times n$ and the last dimensionality $n\times n$. We have then
The field of artificial neural networks has a long history of development, and is closely connected with the advancement of computer science and computers in general. A model of artificial neurons was first developed by McCulloch and Pitts in 1943 to study signal processing in the brain and has later been refined by others. The general idea is to mimic neural networks in the human brain, which is composed of billions of neurons that communicate with each other by sending electrical signals. Each neuron accumulates its incoming signals, which must exceed an activation threshold to yield an output. If the threshold is not overcome, the neuron remains inactive, i.e. has zero output.
This behaviour has inspired a simple mathematical model for an artificial neuron.
Here, the output $y$ of the neuron is the value of its activation function, which have as input a weighted sum of signals $x_i, \dots ,x_n$ received by $n$ other neurons.
An artificial neural network (NN), is a computational model that consists of layers of connected neurons, or nodes. It is supposed to mimic a biological nervous system by letting each neuron interact with other neurons by sending signals in the form of mathematical functions between layers. A wide variety of different NNs have been developed, but most of them consist of an input layer, an output layer and eventual layers in-between, called hidden layers. All layers can contain an arbitrary number of nodes, and each connection between two nodes is associated with a weight variable.
The feed-forward neural network (FFNN) was the first and simplest type of NN devised. In this network, the information moves in only one direction: forward through the layers.
Nodes are represented by circles, while the arrows display the connections between the nodes, including the direction of information flow. Additionally, each arrow corresponds to a weight variable, not displayed here. We observe that each node in a layer is connected to all nodes in the subsequent layer, making this a so-called fully-connected FFNN.
A different variant of FFNNs are convolutional neural networks (CNNs), which have a connectivity pattern inspired by the animal visual cortex. Individual neurons in the visual cortex only respond to stimuli from small sub-regions of the visual field, called a receptive field. This makes the neurons well-suited to exploit the strong spatially local correlation present in natural images. The response of each neuron can be approximated mathematically as a convolution operation.
CNNs emulate the behaviour of neurons in the visual cortex by enforcing a local connectivity pattern between nodes of adjacent layers: Each node in a convolutional layer is connected only to a subset of the nodes in the previous layer, in contrast to the fully-connected FFNN. Often, CNNs consist of several convolutional layers that learn local features of the input, with a fully-connected layer at the end, which gathers all the local data and produces the outputs. They have wide applications in image and video recognition
So far we have only mentioned NNs where information flows in one direction: forward. Recurrent neural networks on the other hand, have connections between nodes that form directed cycles. This creates a form of internal memory which are able to capture information on what has been calculated before; the output is dependent on the previous computations. Recurrent NNs make use of sequential information by performing the same task for every element in a sequence, where each element depends on previous elements. An example of such information is sentences, making recurrent NNs especially well-suited for handwriting and speech recognition.
There are many other kinds of NNs that have been developed. One type that is specifically designed for interpolation in multidimensional space is the radial basis function (RBF) network. RBFs are typically made up of three layers: an input layer, a hidden layer with non-linear radial symmetric activation functions and a linear output layer (''linear'' here means that each node in the output layer has a linear activation function). The layers are normally fully-connected and there are no cycles, thus RBFs can be viewed as a type of fully-connected FFNN. They are however usually treated as a separate type of NN due the unusual activation functions.
Other types of NNs could also be mentioned, but are outside the scope of this work. We will now move on to a detailed description of how a fully-connected FFNN works, and how it can be used to interpolate data sets.
One use often so-called fully-connected feed-forward neural networks with three or more layers (an input layer, one or more hidden layers and an output layer) consisting of neurons that have non-linear activation functions.
Such networks are often called multilayer perceptrons (MLPs)
According to the Universal approximation theorem, a feed-forward neural network with just a single hidden layer containing a finite number of neurons can approximate a continuous multidimensional function to arbitrary accuracy, assuming the activation function for the hidden layer is a non-constant, bounded and monotonically-increasing continuous function. Note that the requirements on the activation function only applies to the hidden layer, the output nodes are always assumed to be linear, so as to not restrict the range of output values.
We note that this theorem is only applicable to a NN with one hidden layer. Therefore, we can easily construct an NN that employs activation functions which do not satisfy the above requirements, as long as we have at least one layer with activation functions that do. Furthermore, although the universal approximation theorem lays the theoretical foundation for regression with neural networks, it does not say anything about how things work in practice: A neural network can still be able to approximate a given function reasonably well without having the flexibility to fit all other functions.
In an FFNN of such neurons, the inputs $x_i$ are the outputs of the neurons in the preceding layer. Furthermore, an MLP is fully-connected, which means that each neuron receives a weighted sum of the outputs of all neurons in the previous layer.
First, for each node $i$ in the first hidden layer, we calculate a weighted sum $u_i^1$ of the input coordinates $x_j$,
This value is the argument to the activation function $f_1$ of each neuron $i$, producing the output $y_i^1$ of all neurons in layer 1,
where we assume that all nodes in the same layer have identical activation functions, hence the notation $f_l$
where we have substituted $y_m^1$ with. Finally, the NN output yields,
which illustrates a basic property of MLPs: The only independent variables are the input values $x_n$.
This confirms that an MLP, despite its quite convoluted mathematical form, is nothing more than an analytic function, specifically a mapping of real-valued vectors $\vec{x} \in \mathbb{R}^n \rightarrow \vec{y} \in \mathbb{R}^m$. In our example, $n=2$ and $m=1$. Consequentially, the number of input and output values of the function we want to fit must be equal to the number of inputs and outputs of our MLP.
Furthermore, the flexibility and universality of a MLP can be illustrated by realizing that the expression is essentially a nested sum of scaled activation functions of the form
where the parameters $c_i$ are weights and biases. By adjusting these parameters, the activation functions can be shifted up and down or left and right, change slope or be rescaled which is the key to the flexibility of a neural network.
We can introduce a more convenient notation for the activations in a NN.
Additionally, we can represent the biases and activations as layer-wise column vectors $\vec{b}_l$ and $\vec{y}_l$, so that the $i$-th element of each vector is the bias $b_i^l$ and activation $y_i^l$ of node $i$ in layer $l$ respectively.
We have that $\mathrm{W}_l$ is a $N_{l-1} \times N_l$ matrix, while $\vec{b}_l$ and $\vec{y}_l$ are $N_l \times 1$ column vectors. With this notation, the sum in becomes a matrix-vector multiplication, and we can write the equation for the activations of hidden layer 2 in
This is not just a convenient and compact notation, but also a useful and intuitive way to think about MLPs: The output is calculated by a series of matrix-vector multiplications and vector additions that are used as input to the activation functions. For each operation $\mathrm{W}_l \vec{y}_{l-1}$ we move forward one layer.
A property that characterizes a neural network, other than its connectivity, is the choice of activation function(s). As described in, the following restrictions are imposed on an activation function for a FFNN to fulfill the universal approximation theorem
Non-constant
Bounded
Monotonically-increasing
Continuous
The second requirement excludes all linear functions. Furthermore, in a MLP with only linear activation functions, each layer simply performs a linear transformation of its inputs.
Regardless of the number of layers, the output of the NN will be nothing but a linear function of the inputs. Thus we need to introduce some kind of non-linearity to the NN to be able to fit non-linear functions Typical examples are the logistic Sigmoid
and the hyperbolic tangent function
The sigmoid function are more biologically plausible because the output of inactive neurons are zero. Such activation function are called one-sided. However, it has been shown that the hyperbolic tangent performs better than the sigmoid for training MLPs. has become the most popular for deep neural networks