A common parameter-estimation strategy is to minimize error,
\begin{equation*} \sum_{i}(D_i - y_i(\theta, x))^2 \end{equation*}E.g.
Create some phony straight line data with $\theta=(a, b)$ and normally distributed noise in $y$,
In [1]:
n_points = 10
x = linspace(0,10,n_points)
def y(x, theta=(2, 4), noise=0):
return theta[0] * x + theta[1] + noise*randn(n_points)
D = y(x, noise=1)
scatter(x,D)
Out[1]:
We are trying to discover parameters, $\theta$, so write error in terms of only $theta$,
$SSE(\theta; y(x), D)$
In [2]:
# cheat, do this with global scope of y, x and D
def ssq_err(theta):
return np.sum((D - y(x,theta))*(D - y(x,theta)).transpose())
from scipy.optimize import minimize
fit = minimize(ssq_err,[1,1])
print fit
scatter(x, D)
plot(x,y(x, fit.x))
print fit
Not amazing so far $\ldots$
Leading quesiton: Is normally distributed noise "unusually compatible" with least squares errors minimization?
Maybe we should do something that takes into account the expected distibution of the model errors? What if the errors are distributed by Poisson, for example, does sum-squared-error still make sense?
In [3]:
from scipy.stats import norm, poisson
rvn, rvp = norm(5,1), poisson(5)
x = np.linspace(-2,10,140)
plot(x, rvn.pdf(x))
xd = np.arange(0, np.minimum(rvp.dist.b, 15))
vlines(xd, 0, rvp.pmf(xd), lw=2, color="red")
Out[3]:
New Idea: The parameter fitting problem can be framed as "Can we find the mostly likely value of the parameters $\theta$ of the model given the data?"
i.e. maximize $P(\theta|D)$
Significance is specified by probability that the data was drawn from the “wrong” model (null hypothesis): p value
That is, from a frequentist perspective, the PDF for normal noise is in terms of $P(D|\theta)$,
E.g. \begin{equation*} P(D|\theta) \propto exp(\sum_i -\frac{(D_i - y_i(\theta))^2}{2\sigma^2}) \end{equation*}
We need to flip this around.
Some statistics Rules
\begin{equation} P(A,B) = P(A|B)P(B) = P(B|A)P(A) \\ P(A) = \int_\Omega P(A,B)dB = \int_\Omega P(A|B)P(B)dB \\ \end{equation}Bayesian statistics is the application of these rules
Back to data and parameters:
\begin{equation*} P(\theta|D) = \frac{P(D|\theta)P(\theta)}{P(D)} \end{equation*}\begin{equation*} P(D|\theta) : \text{Likelihood} \\ P(\theta) : \text{Prior} \\ P(D) : \text{Evidence} \\ P(\theta|D) : \text{Posterier} \\ \end{equation*}\begin{equation*} P(D) = \int_\Omega P(D|\theta)P(\theta)d\theta \\ \end{equation*}Integration over the parameter space of the model.
In 10 trials, we get 7 heads: What's our best estimate of the probability of flipping heads with this coin?
The distribution of coin toss outcomes is the Binomial Distribution.
$pmf(y|t,w) = \binom{n}{k} p^k (1-p)^{(n-k)}$
Let's not forget how obvious this is and guess 0.7.
In [4]:
from scipy.stats import binom
rv = binom(10, 0.7)
x = np.arange(0, np.minimum(rv.dist.b, 10))
h = plt.vlines(x, 0, rv.pmf(x), lw=2, color="orange")
"Specifically, the PDF in Fig. 1 is a function of the data given a particular set of parameter values, defined on the data scale. On the other hand, the likelihood function is a function of the parameter given a particular set of observed data, defined on the parameter scale. In short, Fig. 1 tells us the probability of a particular data value for a fixed parameter, whereas Fig. 2 tells us the likelihood (‘‘unnormalized probability’’) of a particular parameter value for a fixed data set." (.J. Myung / Journal of Mathematical Psychology 47 (2003) 90–100)
In [5]:
def f(w):
rv = binom(10, w)
return rv.pmf(7)
# Hold the data constant, vary the parameter
x = np.linspace(0,1,100)
plot(x, f(x))
Out[5]:
It looks like the maximum likelihood on the data over the parameter is around 0.7?
Our data could have been (6,10) or (8,10) with a 0.7 coin. But we started by asking about the coin, to make any headway, we need to flip the question around.
In [6]:
dat = [10, 7]
def L_binom(w):
rv = binom(dat[0], w)
x = dat[1]
# -1 to make max problem a min problem
return -rv.pmf(x)
In [7]:
x = np.linspace(0.1, 0.9, 9)
scatter(x, L_binom(x))
from scipy.optimize import minimize_scalar
minimize_scalar(L_binom, bounds=[0,1], method="bounded")
Out[7]:
In [9]:
n_points = 10
x = linspace(0,100,n_points)
def y(x, theta=(10, 0.05), noise=0):
return theta[0] * np.exp(-x * theta[1]) + noise*randn(n_points)
D = y(x, noise=1)
scatter(x,D)
Out[9]:
Likelihood:
\begin{equation} P(D|\theta) \propto \prod_i {\exp(-\frac{(D_i - y_i(\theta))^2}{2\sigma^2})} \propto \prod_i {\exp(-\frac{(D_i - (10\exp(-wx)))^2}{2})} \end{equation}
In [10]:
# use x, D from global scope
def L_cost(w):
return np.prod(np.exp(((D - y(x,(10, w)))*(D - y(x,(10, w))).transpose())/2.))
w = np.linspace(0,0.1,10)
L_c = [L_cost(i) for i in w]
scatter(w,L_c)
Out[10]:
Numerically troubling! Turn the product into a sum with log!
In [11]:
# use x, D from global scope
def L_cost(w):
return np.sum(((D - y(x,(10, w)))*(D - y(x,(10, w))).transpose())/2.)
w = np.linspace(0,0.1,10)
L_c = [L_cost(i) for i in w]
scatter(w,L_c)
from scipy.optimize import minimize_scalar
minimize_scalar(L_cost, bounds=[0,0.1], method="bounded")
Out[11]:
So it might be useful to use the $\log$ likelihood:
\begin{equation} \log(P(D|\theta)) \propto \sum_i \log(exp(-\frac{(D_i - (10\exp(-wx)))^2}{2})) = - \sum_i \frac{(D_i - (10\exp(-wx)))^2}{2} \end{equation}Normally distributed errors + Minimum-Log-Likelihood is the same solution as Lease Squares.
Sterling $\ldots$ or realize that $\Gamma(x_i)$ is a constant w.r.t. MLE estimation of $\lambda$.
\begin{equation} \log(P(D|\theta)) \propto \sum_i \log(\frac{y_i(\theta))^{D_i}e^{-y_i(\theta))}}{\Gamma(D_i)}) = \sum_i D_i \log y_i(\theta) -y_i(\theta) \end{equation}
In [12]:
from numpy.random import poisson
n_points = 20
x = linspace(0,300,n_points)
def y(x, theta=(110, 0.03), noise=0):
ty = theta[0] * np.exp(-x * theta[1])
if noise != 0:
ty = [poisson(k)for k in ty]
return ty
D = y(x, noise=5)
scatter(x,D)
plot(x,y(x))
Out[12]:
In [13]:
# use x, D from global scope
def P_cost(*w):
return -np.sum(D*np.log(y(x,*w)) - y(x,*w))
fit_ml = minimize(P_cost, (105,0.05))
print fit_ml
In [14]:
# cheat, do this with global scope of y, x and D
def ssq_err(*w):
return np.sum((D - y(x,*w))*(D - y(x,*w)).transpose())
fit_ssq = minimize(ssq_err,[105,0.05])
print fit_ssq
scatter(x, D)
plot(x,y(x,fit_ml.x), color="green")
plot(x,y(x,fit_ssq.x), color="red")
Out[14]:
In [ ]: