Morten Hjorth-Jensen, Department of Physics, University of Oslo and Department of Physics and Astronomy and National Superconducting Cyclotron Laboratory, Michigan State University
Date: Nov 21, 2017
Copyright 1999-2017, Morten Hjorth-Jensen. Released under CC Attribution-NonCommercial 4.0 license
Regression modeling deals with the description of the sampling distribution of a given random variable $y$ varies as function of another variable or a set of such variables $\hat{x} =[x_0, x_1,\dots, x_p]^T$. The first variable is called the dependent, the outcome or the response variable while the set of variables $\hat{x}$ is called the independent variable, or the predictor variable or the explanatory variable.
A regression model aims at finding a likelihood function $p(y\vert \hat{x})$, that is the conditional distribution for $y$ with a given $\hat{x}$. The estimation of $p(y\vert \hat{x})$ is made using a data set with
$n$ cases $i = 0, 1, 2, \dots, n-1$
Response (dependent or outcome) variable $y_i$ with $i = 0, 1, 2, \dots, n-1$
$p$ Explanatory (independent or predictor) variables $\hat{x}_i=[x_{i0}, x_{i1}, \dots, x_{ip}]$ with $i = 0, 1, 2, \dots, n-1$
The goal of the regression analysis is to extract/exploit relationship between $y_i$ and $\hat{x}_i$ in or to infer causal dependencies, approximations to the likelihood functions, functional relationships and to make predictions .
Before we proceed let us study a case from linear algebra where we aim at fitting a set of data $\hat{y}=[y_0,y_1,\dots,y_{n-1}]$. We could think of these data as a result of an experiment or a complicated numerical experiment. These data are functions of a series of variables $\hat{x}=[x_0,x_1,\dots,x_{n-1}]$, that is $y_i = y(x_i)$ with $i=0,1,2,\dots,n-1$. The variables $x_i$ could represent physical quantities like time, temperature, position etc. We assume that $y(x)$ is a smooth function.
Since obtaining these data points may not be trivial, we want to use these data to fit a function which can allow us to make predictions for values of $y$ which are not in the present set. The perhaps simplest approach is to assume we can parametrize our function in terms of a polynomial of degree $n-1$ with $n$ points, that is
2
< < < ! ! M A T H _ B L O C K
3
< < < ! ! M A T H _ B L O C K
and the matrix
we can rewrite our equations as
We are obviously not limited to the above polynomial. We could replace the various powers of $x$ with elements of Fourier series, that is, instead of $x_i^j$ we could have $\cos{(j x_i)}$ or $\sin{(j x_i)}$, or time series or other orthogonal functions. For every set of values $y_i,x_i$ we can then generalize the equations to
and without loss of generality we rewrite again our equations as
and in order to find the optimal parameters $\beta_i$ instead of solving the above linear algebra problem, we define a function which gives a measure of the spread between the values $y_i$ (which represent hopefully the exact values) and the parametrized values $\tilde{y}_i$, namely
or using the matrix $\hat{X}$ as
can be linked to the variance of the quantity $y_i$ if we interpret the latter as the mean value of for example a numerical experiment. When linking below with the maximum likelihood approach below, we will indeed interpret $y_i$ as a mean value
where $\langle y_i \rangle$ is the mean value. Keep in mind also that till now we have treated $y_i$ as the exact value. Normally, the response (dependent or outcome) variable $y_i$ the outcome of a numerical experiment or another type of experiment and is thus only an approximation to the true value. It is then always accompanied by an error estimate, often limited to a statistical error estimate given by the standard deviation discussed earlier. In the discussion here we will treat $y_i$ as our exact value for the response variable.
In order to find the parameters $\beta_i$ we will then minimize the spread of $Q(\hat{\beta})$ by requiring
which results in
or in a matrix-vector form as
as
and if the matrix $\hat{X}^T\hat{X}$ is invertible we have the solution
and with
we have
meaning that the solution for $\hat{\beta}$ is the one which minimizes the residuals. Later we will link this with the maximum likelihood approach.
Normally, the response (dependent or outcome) variable $y_i$ the outcome of a numerical experiment or another type of experiment and is thus only an approximation to the true value. It is then always accompanied by an error estimate, often limited to a statistical error estimate given by the standard deviation discussed earlier. In the discussion here we will treat $y_i$ as our exact value for the response variable.
Introducing the standard deviation $\sigma_i$ for each measurement $y_i$, we define now the $\chi^2$ function as
which results in
or in a matrix-vector form as
as
and if the matrix $\hat{A}^T\hat{A}$ is invertible we have the solution
we have then the following expression for the parameters $\beta_j$ (the matrix elements of $\hat{H}$ are $h_{ij}$)
We state without proof the expression for the uncertainty in the parameters $\beta_j$ as
resulting in
By computing the derivatives of $\chi^2$ with respect to $\beta_0$ and $\beta_1$ show that these are given by
and
4 0
< < < ! ! M A T H _ B L O C K
4 1
< < < ! ! M A T H _ B L O C K
4 2
< < < ! ! M A T H _ B L O C K
and show that
4 4
< < < ! ! M A T H _ B L O C K
The LSM suffers often from both being underdetermined and overdetermined in the unknown coefficients $\beta_i$. A better approach is to use the Singular Value Decomposition (SVD) method discussed below.
How can we use the singular value decomposition to find the parameters $\beta_j$? More details will come. We first note that a general $m\times n$ matrix $\hat{A}$ can be written in terms of a diagonal matrix $\hat{\Sigma}$ of dimensionality $n\times n$ and two orthognal matrices $\hat{U}$ and $\hat{V}$, where the first has dimensionality $m \times n$ and the last dimensionality $n\times n$. We have then