In [ ]:

    
# -*- coding: utf-8 -*-
# <nbformat>2</nbformat>

# <markdowncell>

# <h1>Readings</h1>
# <ul>
#     <li>Bishop: 3.1.0-3.1.4</li>
#     <li>Ng: Lecture 2 pdf, page 4, LMS algorithm</li>
#     <li>Ng: Lecture 2 pdf, page 13, Locally weighted linear regression</li>
#     <li>Bishop: 3.3.0-3.3.2</li>
# </ul>
# <p><font color="blue"><em><b>Regression</b></em></font>: Given the value of a D-dimensional input vector $\mathbf{x}$, predict the value of one or more <em>target</em> variables</p>
# <p><font color="blue"><b><em>Linear</em></b></font>: The models discussed in this section are <em>linear</em> with respect to the adjustable parameters, <em>not</em> 
#     necessisarily with respect to the input variables. </p>

# <markdowncell>

# <h1>Creating A Model</h1>
# In this notebook, our objective is to construct models that can predict the value of some target variable, $t$, given some 
# input vector, $\mathbf{x}$, where the target value can occupy any value in some space - though here we'll only consider the space of 
# real valued vectors. We want the models to allow for uncertainty in the accuracy of the model and/or noise on the observed data. 
# We also want the model to provide some information on our confidence in a given prediction. 
# 
# The first step is to contruct a mathematical model that adequately represents the observations we wish to predict. 
# The model we will use is described in the next two subsections. It is **important to note** that the model itself is independent 
# of the use of a frequentist or Bayesian viewpoint. It is *how we obtain the free parameters* of the model that is affected by using
# frequentist or Bayesian approaches. However, if the model is a poor choice for a particular observation, then its predictive 
# capability is likely to be poor whether we use a frequentist or Bayesian approach to obtain the parameters.

# <markdowncell>

# <h2><font size="4">Gaussian Noise: Model Assumption 1</font></h2>
# We will *assume* throughout this notebook that the target variable is described by <br/><br/>
#     $t = y(\mathbf{x},\mathbf{w}) + \epsilon$
#     <br/><br/>
# where $y(\mathbf{x},\mathbf{w})$ is an as of yet undefined function of $\mathbf{x}$ and $\mathbf{w}$ and $\epsilon$ is a <font color="red"><em>Gaussian</em></font> distributed noise component. 
# 
# **Gaussian Noise?** The derivations provided below all assume Gaussian noise on the target data. Is this a good assumption? In many cases yes. The argument hinges
# on the use of the [Central_Limit_Theorem](http://en.wikipedia.org/wiki/Central_limit_theorem) that basically says the the **sum** of many independent random
# variables behaves behaves like a Gaussian distributed random variable. The _noise_ term in this model, $\epsilon$, can be thought of as the sum of features
# not included in the model function, $y(\mathbf{x},\mathbf{w})$. Assuming these features are themselves independent random variables then the Central Limit Theorom suggests a Gaussian model 
# is appropriate, assuming there are many independent unaccounted for features. It is possible that there is only a small number of unaccounted for features
# or that there is genuine _non-Gauisian_ noise in our observation measurements, e.g. sensor shot noise that often has a Poisson distribution. In such cases, the assumption is no longer valid.

# <markdowncell>

# <h2><font size="4">General Linear Model: Model Assumption 2</font></h2>
# In order to proceed, we need to define a model for $y(\mathbf{x},\mathbf{w})$. We will use the *general linear regression* model defined as follows <br/><br/>
#     $y(\mathbf{x},\mathbf{w}) = \sum_{j=0}^{M-1} w_j\phi_j(\mathbf{x}) = \mathbf{w}^T\mathbf{\phi}(\mathbf{x})$ <br/><br/>
#     where $\mathbf{x}$ is a $D$ dimensional input vector, $M$ is the number of free parameters in the model, $\mathbf{w}$ is a column 
# vector of the free parameters, and 
# $\phi(\mathbf{x}) = \\{\phi_0(\mathbf{x}),\phi_1(\mathbf{x}), \ldots,\phi_{M-1}(\mathbf{x})\\}$ with $\phi_0(\mathbf{x})=1$ is a set of basis functions where 
#     each $\phi_i$ is in the real valued function space 
#     $\\{f \in \mathbf{R}^D\Rightarrow\mathbf{R}^1\\}$. It is important to note that the set of basis functions, $\phi$, <font color="red">need
#     not be linear</font> with respect to $\mathbf{x}$. Further, note that this model defines an entire class of models. In order to 
#     contruct an actual predictive model for some observable quantity, we will have to make a further assumption on the choice of the
#     set of basis functions, $\phi$. However, for the purposes of deriving general results, we can delay this choice.
# 
# Note that that $\mathbf{w}^T$ is an $1 \times M$ vector and that $\mathbf{\phi}(\mathbf{x})$ is a $M \times 1$ vector so that the target, $y$ 
#     is a scalar. This will be exteneded to $K$ dimensional target variables below.
# 
#     

# <markdowncell>

# <h1>Frequentist View: Maximum Likelihood</h1>
# Let's now embark on the path of obtaining the free parameters, $\mathbf{w}$, of our model. We will begin using a *frequentist*, or 
# *maximum likelihood*, approach. This approach assumes that we first obtain observation training data, $\mathbf{t}$, and that the *best* 
# value of $\mathbf{w}$, is that which maximizes the likelihood function, $p(\mathbf{t}|\mathbf{w})$.
# 
# <p>Under the Gaussian noise condition it can be shown that the maximum likelihood function for the training data is <br/><br/>
#     
#     $p(\mathbf{t}|\mathbf{X},\mathbf{w},\sigma^2) = \prod_{n=1}^N ND(t_n|\mathbf{w}^T\phi(\mathbf{x}_n),\sigma^2)$ <br/><br/>
#     
#     $=\frac{N}{2}\ln\frac{1}{\sigma^2} -\frac{N}{2}\ln(2\pi) - \frac{1}{2\sigma^2}\sum_{n=1}^N
#     \{t_n -\mathbf{w}^T\phi(\mathbf{x}_n)\}^2$ <br/><br/>
#     
#     where $\mathbf{X}=\{\mathbf{x}_1,\ldots,\mathbf{x}_N\}$ is the input value set for the corresponding $N$ oberved output values contained in the vector 
#     $\mathbf{t}$, and $ND(\mu,\sigma^2)$ is the Normal Distribution (Gaussian). (I used ND instead of the standard N to avoid confusion 
#     with the product limit).
#     
#     Taking the logarithm of the maximum likelihood and setting the derivative with respect to $\mathbf{w}$ equal to zero, one can obtain 
#     the maximum likelikhood parameters given by the <em>normal equations</em>: <br/><br/>
#     $\mathbf{w}_{ML} = \left(\mathbf{\Phi}^T\mathbf{\Phi}\right)^{-1}\mathbf{\Phi}^T\mathbf{t}$ <br/><br/>
#     where $\Phi$ is the $N \times M$ <em>design matrix</em> with elements $\Phi_{n,j}=\phi_j(\mathbf{x}_n)$, and $\mathbf{t}$ is the $N \times K$
#     matrix of training set target values (for $K=1$, it is simply a column vector). Note that $\mathbf{\Phi}^T$ is a $M \times N$ matrix, so that $\mathbf{w}_{ML}=\left(\mathbf{\Phi}^T \mathbf{\Phi}\right)^{-1}\mathbf{\Phi}^T\mathbf{t}$ is 
# $(M \times N)\times(N \times M)\times(M\times N)\times(N \times K) = M \times K$, where $M$ is the number of free parameters and $K$ is the number of predicted 
# target values for a given input. <br/>
# </p>
# 
# Note that the only term in the likelihood function that depends on $\mathbf{w}$ is the last term. <font color="red">Thus, maximizing the likelihood
# function with respect to $\mathbf{w}$ __under the assumption of Gaussian noise__ is equivalent to minimizing a 
# sum-of-squares error function. </font>
# 
# <p>
#     The quantity, $\mathbf{\Phi}^\dagger=\left(\mathbf{\Phi}^T\mathbf{\Phi}\right)^{-1}\mathbf{\Phi}^T$ is known as the 
#     <em>Moore-Penrose pseudo-inverse</em> of $\Phi$. When $\Phi^T\Phi$ is invertible, the pseudo-inverse is 
#     equivalent to the inverse. When this condition fails, the pseudo-inverse can be found with techniques such as <em>singular value decomposition</em>.
# </p>

# <markdowncell>

# <h3>Example 1</h3>
# <h4>(a) Linear Data</h4>
# <p>Let's generate data of the for $y = m*x + b + \epsilon $ where $\epsilon$ is a random Gaussian component with zero mean. Given this data, let's apply the maximum likelihood 
#     solution to find values for the parameters $m$ and $b$. Given that we know our data is linear, we chose basis functions $\phi_0(x)=1$ and $\phi_1(x)=x$. Thus, our 
#     our model will be $y=\theta_0\phi_0(x) + \theta_1\phi_1(x)$, where presumabely the solution should yield $\theta_0 \approx b$ and $\theta_1 \approx
#     m$
# </p>