%%latex

在开始具体的方法之前,添加一些基础的知识

1. 实数函数相对于实向量的梯度

  1. 如果$f(\mathbf{x})=c$是常数,那么有$$\frac{\partial f(\mathbf{x})}{\partial{\mathbf{x}}}=\frac{\partial{c}}{\partial{\mathbf{x}}}=\mathbf{0}$$
  2. 线性法则:如果$f(\mathbf{x})$和$g(\mathbf{x})$分别是向量$\mathbf{x}$的实值函数,$c_1$和$c_2$是实常数,那么$$\frac{\partial[c_1 f(\mathbf{x}) + c_2 g(\mathbf{x})]}{\partial{\mathbf{x}}}=c_1\frac{\partial f(\mathbf{x})}{\partial{\mathbf{x}}}+c_2\frac{\partial g(\mathbf{x})}{\partial{\mathbf{x}}}$$
  3. 乘积法则:
    • 如果$f(\mathbf{x})$和$g(\mathbf{x})$分别是向量$\mathbf{x}$的实值函数,那么$$\frac{\partial[f(\mathbf{x})g(\mathbf{x})]}{\partial{\mathbf{x}}}=g(\mathbf{x})\frac{\partial f(\mathbf{x})}{\partial{\mathbf{x}}}+f(\mathbf{x})\frac{\partial g(\mathbf{x})}{\partial{\mathbf{x}}}$$
    • 如果$f(\mathbf{x})$,$g(\mathbf{x})$和$h(\mathbf{x})$分别是向量$\mathbf{x}$的实值函数,那么$$\frac{\partial[f(\mathbf{x})g(\mathbf{x})h(\mathbf{x})]}{\partial{\mathbf{x}}}=g(\mathbf{x})h(\mathbf{x})\frac{\partial f(\mathbf{x})}{\partial{\mathbf{x}}}+f(\mathbf{x})h(\mathbf{x})\frac{\partial g(\mathbf{x})}{\partial{\mathbf{x}}}+f(\mathbf{x})g(\mathbf{x})\frac{\partial h(\mathbf{x})}{\partial{\mathbf{x}}}$$
  4. 商法则:如果$f(\mathbf{x})$和$g(\mathbf{x})$分别是向量$\mathbf{x}$的实值函数,并且$f(\mathbf{x})\neq 0$,那么$$\frac{\partial[f(\mathbf{x})/g(\mathbf{x})]}{\partial{\mathbf{x}}}=\frac{1}{g^2(\mathbf{x})}[g(\mathbf{x})\frac{\partial f(\mathbf{x})}{\partial{\mathbf{x}}}-f(\mathbf{x})\frac{\partial g(\mathbf{x})}{\partial{\mathbf{x}}}]$$
  5. 链式法则:如果$\mathbf{y}(\mathbf{x})$是$\mathbf{x}$的实值函数,那么$$\frac{\partial[f(\mathbf{g(x)})]}{\partial{\mathbf{x}}}=\frac{\partial \mathbf{y}^T\mathbf{(x)}}{\partial{\mathbf{x}}}\frac{\partial g(\mathbf{y})}{\partial{\mathbf{y}}}$$其中$$\frac{\partial \mathbf{y}^T\mathbf{(x)}}{\partial{\mathbf{x}}}$$是$n\times n$矩阵
  6. 如果$x \times 1$是与$\mathbf{x}$无关的常数向量,那么$$ \frac{\partial \mathbf{a}^T\mathbf{y(x)}}{\partial \mathbf{x}}=\frac{\partial \mathbf{y}^T(\mathbf{x})\mathbf{a}}{\partial \mathbf{x}}=\frac{\partial \mathbf{y}^T({\mathbf{x}})} {\mathbf{(x)}} {\mathbf{a}}$$令$\mathbf{x}=\mathbf{g(x)}$,那么$$\frac{\partial \mathbf{a}^T\mathbf{x}}{\partial{\mathbf{x}}}=\frac{\partial \mathbf{x}^T\mathbf{a}}{\partial{\mathbf{x}}}=\mathbf{a}$$再令$\mathbf{a}={\mathbf{Ay}}$,那么$$\frac{\partial \mathbf{x}^T\mathbf{a}}{\mathbf{x}}=\frac{\partial \mathbf{x}^T\mathbf{Ay}}{\partial \mathbf{x}}=\mathbf{Ay}$$又令$\mathbf{a}=\mathbf{A}^T\mathbf{y}$,那么$$\frac{\partial \mathbf{a}^T \mathbf{x}}{\partial \mathbf{x}}=\frac{\partial \mathbf{y}^T \mathbf{Ax}}{\partial \mathbf{x}}=\mathbf{A}^T\mathbf{y}$$
  7. 如果$\mathbf{y}(\mathbf{x})$和$\mathbf{y}(\mathbf{x})$是$\mathbf{x}$的实值函数,那么$$\frac{\partial [\mathbf{y}^T(\mathbf{x})\mathbf{Az}(\mathbf{x})]}{\partial \mathbf{x}}=\frac{\partial \mathbf{y}^T(\mathbf{x})}{\partial \mathbf{x}}\mathbf{Az}(\mathbf{x})+\frac{\partial \mathbf{z}^T(\mathbf{x})}{\partial {\mathbf{x}}}\mathbf{A}^T\mathbf{y}(\mathbf{x})$$令$\mathbf{z}(\mathbf{x})=\mathbf{y}(\mathbf{x})$那么$$\frac{\partial [\mathbf{y}^T(\mathbf{x})\mathbf{Ay}(\mathbf{x})]}{\partial \mathbf{x}}=\frac{\partial \mathbf{y}^T(\mathbf{x})}{\partial \mathbf{x}}\mathbf{Ay}(\mathbf{x})+\frac{\partial \mathbf{y}^T(\mathbf{x})}{\partial {\mathbf{x}}}\mathbf{A}^T\mathbf{y}(\mathbf{x})=\frac{\partial \mathbf{y}^T(\mathbf{x})}{\partial {\mathbf{x}}}(\mathbf{A}+\mathbf{A}^T)\mathbf{y}(\mathbf{x})$$再令$\mathbf{y}=\mathbf{a}-\mathbf{Bx}$,假设$\mathbf{A}$是对称矩阵,那么 $$\frac{\partial [(\mathbf{a}-\mathbf{Bx})^T\mathbf{A}(\mathbf{a}-\mathbf{Bx})]}{\partial \mathbf{x}}=\frac{\partial (\mathbf{a}-\mathbf{Bx})^T}{\partial {\mathbf{x}}}(\mathbf{A}+\mathbf{A}^T)(\mathbf{a}-\mathbf{Bx})=-2\mathbf{B}^T\mathbf{A}(\mathbf{a}-\mathbf{Bx})$$又令 $\mathbf{y}(\mathbf{x})=\mathbf{x}$,那么$$\frac{\partial [\mathbf{x}^T\mathbf{Ax}]}{\partial \mathbf{x}}=\frac{\partial \mathbf{x}^T}{\partial \mathbf{x}}\mathbf{Ax}+\frac{\partial \mathbf{x}^T}{\partial {\mathbf{x}}}\mathbf{A}^T\mathbf{x}=(\mathbf{A}+\mathbf{A}^T)\mathbf{x}$$

2. Hessian矩阵

  1. 实值函数$f(\mathbf{x})$相对于$x\times 1$实向量$x$的二阶偏导,它是一个$m^2$个二阶偏导组成的矩阵,定义为$$\frac{\partial ^2 f (\mathbf{x})}{\partial \mathbf{x} \partial \mathbf{x}^T}=\frac{\partial}{\partial \mathbf{x}^T}\bigg[\frac{\partial f(\mathbf{x})}{\partial \mathbf{x}}\bigg]$$或者简写为$$\nabla_x^2 f(\mathbf{x})=\nabla_x (\nabla_x f(\mathbf{x}))$$具体的其中一个分量为$$\bigg[\frac{\partial ^2 f (\mathbf{x})}{\partial \mathbf{x} \partial \mathbf{x}^T}\bigg]_{i \times j}=\frac{\partial ^2 f (\mathbf{x})}{\partial x_i \partial x_j}$$
  2. Hessian矩阵可以分两步来计算:
    1. 求实值函数$f(\mathbf{x})$关于列向量$\mathbf{x}$的偏导数,得到实值函数的梯度$\frac{\partial f(\mathbf{x})}{\mathbf{x}}$
    2. 再求梯度$\frac{\partial f(\mathbf{x})}{\mathbf{x}}$相对于$1 \times n$行向量$\mathbf{x}^T$的偏导数,得到梯度的梯度即Hessian矩阵
  3. 关于Hessian矩阵的几个公式
    1. 对于$n \times 1$的常数向量$\mathbf{a}$,有$$\frac{\partial ^2 \mathbf{a}^T \mathbf{x}}{\partial \mathbf{x} \partial \mathbf{x}^T}=\mathbf{O}_{n \times n}$$
    2. 对于$n \times n$的矩阵$A$,有$$\frac{\partial ^2 \mathbf{x}^T \mathbf{A} \mathbf{x}}{\partial \mathbf{x} \partial \mathbf{x}^T}=\mathbf{A}+\mathbf{A}^T$$注意第二步是对$\mathbf{x}^T$求偏导
    3. 令$x$为$n \times 1$的列向量,$a$为$m \times 1$的列向量,$\mathbf{A}$是$m \times m$的矩阵,$\mathbf{B}$是$m \times n$的矩阵,那么 $$\frac{\partial ^2 [(\mathbf{a}-\mathbf{Bx})^T\mathbf{A}(\mathbf{a}-\mathbf{Bx})]}{\partial \mathbf{x} \partial \mathbf{x}^T}=2\mathbf{B}^T\mathbf{A}\mathbf{B}$$

3. Taylor定理

  1. 一元带余项的Taylor定理$$ f(b)=f(a)+(b-a)f^{'}(a) + \frac{(b-a)^2}{2!}f^{''}(a)+\frac{(b-a)^3}{3!}f^{'''}(a)+ ... +\frac{(b-a)^n}{n!}f^n(a)+\frac{(b-a)^{n+1}}{(n+1)!}f^{n+1}(\xi)$$
  2. 多元带余项的Taylor定理,比较复杂,常用的是前三项的展开$$f(\mathbf{b})=f(\mathbf{a})+(\mathbf{b}-\mathbf{a})^T \frac{\partial f(\mathbf{x})}{\partial \mathbf{x}}\bigg\lvert_{\mathbf{x}=\mathbf{a}}+\frac{1}{2}{(\mathbf{b}-\mathbf{a})^T}\frac{\partial f^2(\mathbf{x})}{\partial \mathbf{x} \partial \mathbf{x}^T}\bigg\lvert_{\mathbf{x}=\mathbf{a}}{(\mathbf{b}-\mathbf{a})}+ ...$$其中$$\mathbf{J}_f(\mathbf{x})=\frac{\partial f(\mathbf{x})}{\partial \mathbf{x}}$$称为Jacobi矩阵$$\mathbf{H}_f(\mathbf{x})=\frac{\partial f^2(\mathbf{x})}{\partial \mathbf{x} \partial \mathbf{x}^T}$$称为Hessian矩阵

4. 梯度下降法

  1. 梯度下降法是基于函数在梯度方向上下降的最快而提出的方法。
  2. 设需要回归函数是:$$J(\pmb{\omega})=\sum_{i=1}^{n}{J_{i}(\pmb{\omega})}$$
  3. BGD(Batch Gradient Descent)会使用所有的数据更新梯度值:$$\pmb{\omega}_{k+1}=\pmb{\omega}_{k}-\eta \cdot \nabla_{\pmb{\omega}} J(\pmb{\omega})$$
  4. SGD(Stochastic Gradient Descent)使用单一的数据更新梯度值:$$\pmb{\omega}_{k+1}=\pmb{\omega}_{k}-\eta \cdot \nabla_{\pmb{\omega}} J_{i}(\pmb{\omega})$$
  5. mini-BGD使用部分数据更新梯度值:$$\pmb{\omega}_{k+1}=\pmb{\omega}_{k}-\eta \cdot \nabla_{\pmb{\omega}} \sum_{i=1}^{p}{J_{i}(\pmb{\omega})}$$其中$k \lt n$
  6. 更多的梯度下降法可以参考arxiv文章An overview of gradient descent optimization algorithms

5. 牛顿法/拟牛顿法

  1. 牛顿法求解函数$f(\pmb{x})=f(\pmb{a})+ J_{f}^{T}(\pmb{a})(\pmb{x}-\pmb{a})+\frac{1}{2}{(\pmb{x}-\pmb{a})^T}H_{f}(\pmb{\xi}){(\pmb{x}-\pmb{a})}$的零点。因为求的是$f(\pmb{x})$的零点,所以有$f(\pmb{x})=0$,那么$$f(\pmb{a})+ J_{f}^{T}(\pmb{a})(\pmb{x}-\pmb{a})+\frac{1}{2}{(\pmb{x}-\pmb{a})^T}H_{f}(\pmb{\xi}){(\pmb{x}-\pmb{a})}=0$$如果忽略二阶小量则有,$$\pmb{x}=\pmb{a}-J_{f}^{-T}(\pmb{a})f(\pmb{a})$$如果再令$\pmb{x}=\pmb{x}_{k+1}, \pmb{a}=\pmb{x}_{k}$,那么有$$\pmb{x}_{k+1}=\pmb{x}_{k}-J_{f}^{-T}(\pmb{x}_{k})f(\pmb{x}_{k})$$
  2. 牛顿法求解函数$f(\pmb{x})=f(\pmb{a})+ J_{f}^{T}(\pmb{a})(\pmb{x}-\pmb{a})+\frac{1}{2}{(\pmb{x}-\pmb{a})^T}H_{f}(\pmb{a}){(\pmb{x}-\pmb{a})}+\pmb{O}(\pmb{x}^{n})$的极点。因为求的是$f(\pmb{x})$的极点,所以有$J_{f}(\pmb{a})=\pmb{0}$,如果忽略三阶小量则有$$f(\pmb{x})=f(\pmb{a})+\frac{1}{2}{(\pmb{x}-\pmb{a})^T}H_{f}(\pmb{a}){(\pmb{x}-\pmb{a})}$$两边同时去梯度有$$\nabla_{x} f(\pmb{x})=\nabla_{x} \big[\frac{1}{2}{(\pmb{x}-\pmb{a})^T}H_{f}(\pmb{a}){(\pmb{x}-\pmb{a})}\big]=-H_{f}(\pmb{a})(\pmb{x}-\pmb{a})=J_{f}(\pmb{a})$$所以有$$\pmb{x}=\pmb{a}-H_{f}^{-1}(\pmb{a})J_f(\pmb{a})$$令$\pmb{x}=\pmb{x_{k+1}}$,$\pmb{a}=\pmb{x}_{k}$那么有$$\pmb{x_{k+1}}=\pmb{x}_{k}-H_{f}^{-1}(\pmb{x}_{k})J_f(\pmb{x}_{k})$$
  3. 牛顿法的一个弱点是求Hessian矩阵的逆矩阵非常的困难,所以拟牛顿法试图通过其他的方法来代替Hessian矩阵逆矩阵的求解。有一个博士论文也许可以参考Analytical study of the Least Squares Quasi-Newton method for interaction problems

In [ ]: