Consider case of linear dependence between measurements $x_i \in \mathbb{R}^n$ and $y_i \in \mathbb{R}, \; i = 1,\ldots, m$.
Then
$$ f(x, w) = x^{\top}w $$or
$$ f(X, W) = XW $$Linear least squares problem is stated in the following form
$$ L(w|X, y) = \frac{1}{2}\sum\limits_{i=1}^m (x^{\top}_i w - y_i)^2 = \frac{1}{2}\|Xw - y \|^2_2 \to \min_w $$Remark. Further we assume $m \geq n$ и $\mathrm{rank}(X) = n$ except special cases
From the first order necessary optimality condition and convexity of the norm follows
$$ L'(w^* | X, y) = 0 \Rightarrow (X^{\top}X)w^* = X^{\top}y $$or
$$ w^* = (X^{\top}X)^{-1}X^{\top}y = X^+y = X^{\dagger}y, $$where $X^{\dagger} = X^+ = (X^{\top}X)^{-1}X^{\top}$ - pseudo-inverse matrix.
Remark: check yourself that you can derive expession for $w^*$!
Q: to what problem was reduced optimization problem?
Definition. Any matrix $A \in \mathbb{S}^n_{++}$ has unique Cholesky decomposition:
$$ A = LL^{\top}, $$where $L$ is lower-triangular matrix
The method:
Pro
Contra
and normal equation
$$ R_1w^* = Q_1^{\top}y $$Definition. Any matrix $A \in \mathbb{R}^{m \times n}$ can be represented in the form
$$ A = U\widehat{\Sigma} V^* = [U_1, U_2] \begin{bmatrix} \Sigma\\ 0 \end{bmatrix} V^*, $$where $U \in \mathbb{R}^{m \times m}$ is unitary, $U_1 \in \mathbb{R}^{m \times n}$, $\Sigma = \mathrm{diag}(\sigma_1, \ldots, \sigma_n) \in \mathbb{R}^{n \times n}$ is diagonal with singular values $\sigma_i$ in the diagonal, and $V \in \mathbb{R}^{n \times n}$ is unitary.
Solving linear system with square matrix:
$$ w^* = V\Sigma^{-1}U_1^{\top}y = \sum\limits_{i=1}^n \frac{u_i^{\top}y}{\sigma_i} v_i, $$where $v_i$ and $u_i$ are columns of matrices $V$ and $U_1$ respectively.
Pro
Contra
In [3]:
import numpy as np
n = 1000
m = 2 * n
X = np.random.randn(m, n)
w = np.random.randn(n)
y = X.dot(w) + 1e-5 * np.random.randn(m)
w_est = np.linalg.solve(X.T.dot(X), X.T.dot(y))
print(np.linalg.norm(w - w_est))
In [2]:
import scipy.linalg as sclin
import scipy.sparse.linalg as scsplin
def CholSolve(X, y):
res = sclin.cho_factor(X.T.dot(X), lower=True)
return sclin.cho_solve(res, X.T.dot(y))
def QRSolve(X, y):
Q, R = sclin.qr(X)
return sclin.solve_triangular(R[:R.shape[1], :], Q[:, :R.shape[1]].T.dot(y))
def SVDSolve(X, y):
U, s, V = sclin.svd(X, full_matrices=False)
return V.T.dot(np.diagflat(1.0 / s).dot(U.T.dot(y)))
def CGSolve(X, y):
def mv(x):
return X.T.dot(X.dot(x))
LA = scsplin.LinearOperator((X.shape[1], X.shape[1]), matvec=mv)
w, _ = scsplin.cg(LA, X.T.dot(y), tol=1e-10)
return w
def NPSolve(X, y):
return np.linalg.solve(X.T.dot(X), X.T.dot(y))
def LSQRSolve(X, y):
res = scsplin.lsqr(X, y)
return res[0]
In [3]:
w_chol = CholSolve(X, y)
print(np.linalg.norm(w - w_chol))
w_qr = QRSolve(X, y)
print(np.linalg.norm(w - w_qr))
w_svd = SVDSolve(X, y)
print(np.linalg.norm(w - w_svd))
w_cg = CGSolve(X, y)
print(np.linalg.norm(w - w_cg))
w_np = NPSolve(X, y)
print(np.linalg.norm(w - w_np))
w_lsqr = LSQRSolve(X, y)
print(np.linalg.norm(w - w_lsqr))
In [4]:
%timeit w_chol = CholSolve(X, y)
%timeit w_qr = QRSolve(X, y)
%timeit w_svd = SVDSolve(X, y)
%timeit w_cg = CGSolve(X, y)
%timeit w_np = NPSolve(X, y)
%timeit w_lsqr = LSQRSolve(X, y)
In [2]:
%matplotlib inline
import time
import matplotlib.pyplot as plt
In [5]:
n = [10, 100, 1000, 2000, 5000]
chol_time = []
qr_time = []
svd_time = []
cg_time = []
np_time = []
lsqr_time = []
for dim in n:
m = int(1.5 * dim)
X = np.random.randn(m, dim)
w = np.random.randn(dim)
y = X.dot(w) + 1e-5 * np.random.randn(m)
st = time.time()
w_chol = CholSolve(X, y)
chol_time.append(time.time() - st)
st = time.time()
w_qr = QRSolve(X, y)
qr_time.append(time.time() - st)
st = time.time()
w_svd = SVDSolve(X, y)
svd_time.append(time.time() - st)
st = time.time()
w_cg = CGSolve(X, y)
cg_time.append(time.time() - st)
st = time.time()
w_np = NPSolve(X, y)
np_time.append(time.time() - st)
st = time.time()
w_lsqr = LSQRSolve(X, y)
lsqr_time.append(time.time() - st)
In [6]:
plt.figure(figsize=(10,8))
plt.plot(n, chol_time, linewidth=5, label="Cholesky")
plt.plot(n, qr_time, linewidth=5, label="QR")
plt.plot(n, svd_time, linewidth=5, label="SVD")
plt.plot(n, cg_time, linewidth=5, label="CG")
plt.plot(n, np_time, linewidth=5, label="Numpy")
plt.plot(n, lsqr_time, linewidth=5, label="LSQR")
plt.legend(loc="best", fontsize=20)
plt.xscale("log")
plt.yscale("log")
plt.xlabel(r"Dimension", fontsize=20)
plt.ylabel(r"Time, sec.", fontsize=20)
plt.xticks(fontsize = 20)
_ = plt.yticks(fontsize = 20)
where $J$ is Jacobian of residuals $r(w)$.
\begin{equation*} \begin{split} S''(w) = & \sum_{i=1}^m r_i'(w)r_i'(w) + \sum_{i=1}^m r_i(w)r_i''(w) \\ = & J^{\top}(w)J(w) + \sum_{i=1}^m r_i(w)r_i''(w) \end{split} \end{equation*}Pro
Contra
Idea: separate spectrum of the hessian from zero with additional term $\lambda I$
Levenberg–Marquardt method :
$$ (f''(x_k) + \lambda_k I)h_k = -f'(x_k), \qquad \lambda_k > 0 $$There are a lot of methods to select $\lambda_k$. General idea is similar to backtracking:
In [10]:
w = np.array([1, -0.5, 10, 0])
def f(x, w=w):
return w[0] * np.exp(x * w[1]) * np.cos(w[2] * x + w[3])
num_points = 100
x_range = np.linspace(0, 5, num=num_points)
plt.plot(x_range, f(x_range))
Out[10]:
In [21]:
num_samples = num_points // 2
x_samples = np.random.choice(x_range, size=num_samples)
y_samples = f(x_samples) + 0.05 * np.random.randn(num_samples)
plt.plot(x_range, f(x_range))
plt.scatter(x_samples, y_samples, c="red")
Out[21]:
In [24]:
import scipy.optimize as scopt
res = lambda w: f(x_samples, w) - y_samples
def jac(w):
J = np.zeros((num_samples, 4))
J[:, 0] = np.exp(x_samples * w[1]) * np.cos(x_samples * w[2] + w[3])
J[:, 1] = w[0] * x_samples * np.exp(x_samples * w[1]) * np.cos(x_samples * w[2] + w[3])
J[:, 2] = -w[0] * x_samples * np.exp(x_samples * w[1]) * np.sin(x_samples * w[2] + w[3])
J[:, 3] = -w[0] * np.exp(x_samples * w[1]) * np.sin(x_samples * w[2] + w[3])
return J
x0 = np.random.rand(4)
result = {}
result["LM"] = scopt.least_squares(fun=res, method="lm", x0=x0, jac=jac)
# result["TR"] = scopt.least_squares(fun=res, method="trf", x0=x0, jac=jac)
# result["Dogleg"] = scopt.least_squares(fun=res, method="dogbox", x0=x0, jac=jac)
In [25]:
plt.figure(figsize=(8, 6))
fontsize = 16
plt.plot(x_range, f(x_range), label="Exact")
for method in result:
plt.plot(x_range, f(x_range, result[method].x), label=method)
plt.legend(fontsize=fontsize)
plt.xticks(fontsize=fontsize)
plt.yticks(fontsize=fontsize)
print("Exact parameters = {}".format(w))
for method in result:
print("{} method parameters = {}".format(method, result[method].x))
Pro
Contra
Definition. The problem is called ill-posed, if at least one of the following conditions of well-posedness problem is false:
Definition. Regularization is a process of introducing additional information in the model for solving ill-posed problem
Examples:
Exercise: derive analogue of normal equation for this problem. What interpretation has modified normal equation and why this regularization works?
The method to solve this problem is similar to linear case without regularization.