In [1]:
# %load /Users/facai/Study/book_notes/preconfig.py
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
sns.set(font='SimHei')
plt.rcParams['axes.grid'] = False
from IPython.display import SVG
def show_image(filename, figsize=None):
if figsize:
plt.figure(figsize=figsize)
plt.imshow(plt.imread(filename))
sklearn代码版本:
~/W/s/sklearn ❯❯❯ git log -n 1
commit d161bfaa1a42da75f4940464f7f1c524ef53484f
Author: John B Nelson <jnelso11@gmu.edu>
Date: Thu May 26 18:36:37 2016 -0400
Add missing double quote (#6831)
GBDT模块位于sklearn/ensemble/gradient_boosting.py
文件,类的关系比较简单。
In [2]:
SVG("./res/Main.svg")
Out[2]:
传统的Gradient Boost算法主要有四步:
可以用数学公式对应表述为[1]:
$F_0(x) = \operatorname{arg \, min}_\rho \displaystyle \sum_{i=1}^N L(y_i, \rho)$
For $m=1$ to $M$ do:
$\tilde{y} = - \left [ \frac{\partial L (y_i, F(x_i))}{\partial F(x_i)} \right ]_{F(x) = F_{m-1}(x)}, \quad i = 1, 2, \dotsc, N$
$\mathbf{a}_m = \operatorname{arg \, min}_{\mathbf{a}, \beta} \displaystyle \sum_{i=1}^N \left [ \tilde{y}_i - \beta h(x_i; \mathbf{a}) \right ]^2$
其中,$L$是损失函数, $l_r$是学习率,$\rho$是模型的权值,$h(x_i; \mathbf{a})$是决策树模型,$\mathbf{a}$是树的参数,$F$是最终累加模型。
利用一些数学方法和损失函数的特性,可以将第3步的权值寻优进行简化。在sklear中单步训练代码对应如下:
748 def _fit_stage(self, i, X, y, y_pred, sample_weight, sample_mask,
749 random_state, X_idx_sorted, X_csc=None, X_csr=None):
750 """Fit another stage of ``n_classes_`` trees to the boosting model. """
751 #+-- 8 lines: assert sample_mask.dtype == np.bool----------
759
760 residual = loss.negative_gradient(y, y_pred, k=k,
761 sample_weight=sample_weight)
762
763 # induce regression tree on residuals
764 tree = DecisionTreeRegressor(
765 criterion='friedman_mse',
766 splitter='best',
767 #+--- 7 lines: max_depth=self.max_depth,------------------
774 presort=self.presort)
775
776 #+--- 8 lines: if self.subsample < 1.0:------------------
784 tree.fit(X, residual, sample_weight=sample_weight,
785 check_input=False, X_idx_sorted=X_idx_sorted)
786
787 # update tree leaves
788 #+-- 5 lines: if X_csr is not None:---------------------
793 loss.update_terminal_regions(tree.tree_, X, y, residual, y_pred,
794 sample_weight, sample_mask,
795 self.learning_rate, k=k)
796
797 # add tree to ensemble
798 self.estimators_[i, k] = tree
799
800 return y_pred
其中,
接下来,我们先就平方差和绝对值两种损失函数,对传统Gradient Boost方法进行第3步的优化,为后续TreeBoost铺路。
[1]: Friedman - Greedy function approximation: A gradient boosting machine
这种损失函数定义是 $L(y, F) = \frac{1}{2} (y - F)^2$,则其导数为$\frac{\partial L}{\partial F} = -(y - F)$。将偏导代入到第1步有:
\begin{align} \tilde{y} &= - \left [ \frac{\partial L (y_i, F(x_i))}{\partial F(x_i)} \right ]_{F(x) = F_{m-1}(x)} \\ &= y_i - F_{m-1}(x_i), \quad i = 1, 2, \dotsc, N \\ \end{align}将$L$和$\tilde{y}$代入到第3步中,整理可得:
\begin{align} \rho_m &= \operatorname{arg \, min}_\rho \displaystyle \sum_{i=1}^N L \left ( y_i, F_{m-1}(x_i) + \rho h(x_i; \mathbf{a}_m) \right) \\ &= \operatorname{arg \, min}_\rho \displaystyle \sum_{i=1}^N \frac{1}{2} \left ( y_i - F_{m-1}(x_i) - \rho h(x_i; \mathbf{a}_m) \right)^2 \\ &= \operatorname{arg \, min}_\rho \displaystyle \sum_{i=1}^N \frac{1}{2} \left ( \tilde{y}_i - \rho h(x_i; \mathbf{a}_m) \right)^2 \\ &= \operatorname{arg \, min}_\rho \displaystyle \sum_{i=1}^N \left ( \tilde{y}_i - \rho h(x_i; \mathbf{a}_m) \right)^2 \\ &= \operatorname{arg \, min}_\beta \displaystyle \sum_{i=1}^N \left ( \tilde{y}_i - \beta h(x_i; \mathbf{a}_m) \right)^2 \quad \text{符号替换}\\ &= \beta_m \end{align}也就是说,对于平方差这种损失函数,它的最优权值就是第2步中指导决策树生成时的$\beta$值。
这个算法称为LS_Boost,具体过程为:
$F_0(x) = \bar{y}$
For $m=1$ to $M$ do:
sklearn中代码如下:
274 class LeastSquaresError(RegressionLossFunction):
275 """Loss function for least squares (LS) estimation.
276 Terminal regions need not to be updated for least squares. """
277 def init_estimator(self):
278 return MeanEstimator()
279
280 def __call__(self, y, pred, sample_weight=None):
281 if sample_weight is None:
282 return np.mean((y - pred.ravel()) ** 2.0)
283 else:
284 return (1.0 / sample_weight.sum() *
285 np.sum(sample_weight * ((y - pred.ravel()) ** 2.0)))
286
287 def negative_gradient(self, y, pred, **kargs):
288 return y - pred.ravel()
289
290 def update_terminal_regions(self, tree, X, y, residual, y_pred,
291 sample_weight, sample_mask,
292 learning_rate=1.0, k=0):
293 """Least squares does not need to update terminal regions.
294
295 But it has to update the predictions.
296 """
297 # update predictions
298 y_pred[:, k] += learning_rate * tree.predict(X).ravel()
299
300 def _update_terminal_region(self, tree, terminal_regions, leaf, X, y,
301 residual, pred, sample_weight):
302 pass
前面说过sklearn中$\beta=1$,则$l_r \times \beta = l_r$,所以update_terminal_regions
这里直接用学习率和树预测值相乘。
损失函数定义为 $L(y, F) = | y - F |$,同样地,可以利用导数算出残差:
\begin{align} \tilde{y} &= - \left [ \frac{\partial L (y_i, F(x_i))}{\partial F(x_i)} \right ]_{F(x) = F_{m-1}(x)} \\ &= \operatorname{sign}(y_i - F_{m-1}(x_i))\\ \end{align}$L(y, F)$是分段函数,它的导数在正数区间恒为1,负数区间恒为-1,而在间段点$F=0$处是不可导的,人为规定为0,所以可以用sign函数来描述:
\begin{equation} \operatorname{sign}(x) := { \begin{cases} -1 & {\text{if }} x<0, \\ 0 & {\text{if }} x=0, \\ 1 & {\text{if }} x>0. \end{cases} } \end{equation}同样的,将$L$和$\tilde{y}$代入到第3步中,整理可得:
\begin{align} \rho_m &= \operatorname{arg \, min}_\rho \displaystyle \sum_{i=1}^N L \left ( y_i, F_{m-1}(x_i) + \rho h(x_i; \mathbf{a}_m) \right) \\ &= \operatorname{arg \, min}_\rho \sum_{i=1}^N \big | y_i - F_{m-1}(x_i) - \rho h(x_i; \mathbf{a}_m) \big | \\ &= \operatorname{arg \, min}_\rho \sum_{i=1}^N \big | h(x_i; \mathbf{a}_m) \big | \cdot \left | \frac{y_i - F_{m-1}(x_i)}{h(x_i; \mathbf{a}_m)} - \rho \right | \\ &= \operatorname{median}_W \left \{ \frac{y_i - F_{m-1}(x_i)}{h(x_i; \mathbf{a}_m)} \right \}_1^N, \quad w_i = \big | h(x_i; \mathbf{a}_m) \big | \end{align}这里$\operatorname{median}_W \{ \cdot \}$是带权$w_i$的weighted median。
注意,weighted median的概念$\operatorname{median}_W (x)$,这里的加权,应理解为$x_i$出现了$w_i$次,而不是$x_i \cdot w_i$数值。 你可以去维基页查看定义,也可以看下面的推导细节。不太好讲,但其实很简单。
在sklearn中,LAD用的TreeBoost去实现,所以没有代码对应。
最后一步的过程比较跳,我们在这里详细说下推导。
在这之前,我们简化下符号,以利于理解,
\begin{align} \rho_m &= \operatorname{arg \, min}_\rho \sum_{i=1}^N \big | h(x_i; \mathbf{a}_m) \big | \cdot \left | \frac{y_i - F_{m-1}(x_i)}{h(x_i; \mathbf{a}_m)} - \rho \right | \\ &= \operatorname{arg \, min}_\rho \sum_{i=1}^N w_i \cdot | t_i - \rho | \\ \end{align}首先,我们考虑无加权的情况,即$w_i = 1$。此时有$\rho_m = \operatorname{arg \, min}_\rho \sum_{i=1}^N | t_i - \rho |$。
有两种方法来解释为什么$\rho_m = \operatorname{median}(t_i)$:
第一种是几何方法,$| t_i - \rho |$表示两点距离。我们要找的$\rho_m$,本质上就是离所有点$t_i$总距离最短的点。所以通过作图的方法,直观理解。
In [3]:
show_image("./res/rho_m.png")
如上图,绿色直线是中位点到各点距离,黄色直线是另一个点的距离,随着这个点外移,总距离越来越长。总结,对于奇数个点,$\rho_m$是中位点;对于偶数点,$\rho_m$在两个中间点闭区间上。
第二种方法是代数方法。
首先我们直观地了感受下结果,对于$\rho_m = \operatorname{arg \, min}_\rho \sum_{i=1}^N | t_i - \rho | = \operatorname{arg \, min} f(\rho)$,导数是$f'(\rho) = \sum_{i=1}^N \operatorname{sign}(t_i - \rho)$。在极值点$f'(\rho) = 0$,而$\operatorname{sign}(x)$只有$1, -1, 0$三种值,所以要和为0,则必然要求$1$和$-1$的数量相同。而中位点正好满足此条件2。
然后,在Mathematics上看到一个较为严密的论证2。它先算最左端结果,再向右取区间推进,证明整个过程中距离是先单减再单增的过程,从而证明中位点是最小点,具体如下:
令集合$T$含$N$个元素$t_1 < t_2 < \dots < t_N$。
对于最左端$\rho < t_1$,有$f(\rho) = \sum_{i=1}^N | t_i - \rho | = \sum_{i=1}^N (t_i - \rho)$。很明显,对于每个子项,有$t_i - \rho > 0$,且$\rho \to t_1$时,$(t_i - \rho)$单减。也就是说,随着$\rho$从左向右靠近$t_1$点,所有子项都在减小,则总和$f(\rho)$单减。
选定任一区间,$t_k \leq \rho \leq \rho + d \leq t_{k+1}$,有:
则有:
\begin{equation} f(\rho + d) = \begin{cases} < f(\rho), \quad \text{when } k < \frac{N}{2} \\ = f(\rho), \quad \text{when } k = \frac{N}{2} \\ > f(\rho), \quad \text{when } k > \frac{N}{2} \\ \end{cases} \end{equation}也就是说,在$k < \frac{N}{2}$区间,$f(\rho)$单减;在$k > \frac{N}{2}$区间,$f(\rho)$单增。所以,在中位点$\frac{N}{2}$处,$f(\rho)$是最小值。于是得$\rho_m = \operatorname{median}(t_i)$。
对于加权,有点无从下手的感觉。我们可以换个角度,会提供点有意思的想法。
假设$w_i$是正整数,则$w_i \cdot | t_i - \rho | = \sum_{k=1}^{w_i} | t_i - \rho |$,也就是说,相当于将集合$T$中的$t_i$点扩增到$w_i$个。那么,可以展开为:
\begin{align} \rho_m &= \operatorname{arg \, min}_\rho \sum_{i=1}^N w_i \cdot | t_i - \rho | \\ &= \operatorname{arg \, min}_\rho \sum_{i=1}^N \sum_{k=1}^{w_i} | t_i - \rho | \\ &= \operatorname{arg \, min}_\rho \sum_{t_i \in T'} | t_i - \rho | \quad \text{$T'$ 是$t_i$扩增$w_i$倍的集合} \end{align}于是,有加权问题就变更到无加权的问题。这时,我们可以将解法表示如下:
对应的sklearn代码如下:
51 def _weighted_percentile(array, sample_weight, percentile=50):
52 """Compute the weighted ``percentile`` of ``array`` with ``sample_weight``. """
53 sorted_idx = np.argsort(array)
54
55 # Find index of median prediction for each sample
56 weight_cdf = sample_weight[sorted_idx].cumsum()
57 percentile_idx = np.searchsorted(
58 weight_cdf, (percentile / 100.) * weight_cdf[-1])
59 return array[sorted_idx[percentile_idx]]
当然,如果$w_i$是分数,应该怎么想,我暂时也没有思路。可以参考下维基定义Weighted median,后续有时间,细究下。
我们先回顾前面的Gradient Boost方法,其训练公式如下:
For $m=1$ to $M$ do:
$\tilde{y} = - \left [ \frac{\partial L (y_i, F(x_i))}{\partial F(x_i)} \right ]_{F(x) = F_{m-1}(x)}, \quad i = 1, 2, \dotsc, N$
$\mathbf{a}_m = \operatorname{arg \, min}_{\mathbf{a}, \beta} \displaystyle \sum_{i=1}^N \left [ \tilde{y}_i - \beta h(x_i; \mathbf{a}) \right ]^2$
TreeBoost将外部的寻优参数内化到决策树内部,从而既让模型更精细,又加快了运行速度。而要让外部参数进入到决策树$h(x_i; \mathbf{a}_m)$,首要的问题是得打开这个函数,即使用解析式来描述。换句话说,我们需要对决策树建立数学模型。
对于$J$个叶子的回归决策树,可以表述为累加式:
\begin{equation} h(x; \{b_j, R_j\}_1^J) = \displaystyle \sum_{j=1}^J b_j \, \mathbf{1}(x \in R_j) \end{equation}其中,$R_j$是各叶子,对应叶子的值是此区域的样本均值$b_j = \operatorname{ave}_{x_i \in R_j} y_i$。因为各个叶子是没有交集的,所以这个公式的含义等价于:如果$x \in R_j$,那么$h(x) = b_j$。
我们将回归决策树的数学式代入Gradient Boost中第四步,可得到:
\begin{align} F_m(x) &= F_{m-1}(x) + \rho_m h(x; \mathbf{a}_m) \\ &= F_{m-1}(x) + \rho_M \displaystyle \sum_{j=1}^J b_{jm} \, \mathbf{1}(x \in R_{jm}) \\ &= F_{m-1}(x) + \sum_{j=1}^J \color{red}{\rho_M b_{jm}} \, \mathbf{1}(x \in R_{jm}) \\ &= F_{m-1}(x) + \sum_{j=1}^J \color{red}{\gamma_{jm}} \, \mathbf{1}(x \in R_{jm}) \end{align}请注意,最后一步定义$\gamma_{jm} = \rho_M b_{jm}$,只是个简单的代数替换。但它的实质是将权重从对整颗树的全局解移动到了对各个叶子的局部解。也就是说,这个权重值更加细化了,以前是对整颗树寻优,现在是针对各个叶子,各自寻优。此时,将定义代入Gradient Boost第三步,就得到局部权重的最优解:
\begin{equation} \{\gamma_{jm}\}_1^J = \displaystyle \operatorname{arg \, min}_{\{\gamma_j\}_1^J} \sum_{i=1}^N L \left ( y_i, F_{m-1}(x_i) + \sum_{j=1}^J \gamma_j \mathbf{1}(x \in R_{jm}) \right ) \end{equation}又因为各叶子$R_{jm}$是互无交集,相互独立的,上式的最优解就是各叶子各自的最优解汇总,故可简写为:
\begin{equation} \gamma_{jm} = \displaystyle \operatorname{arg \, min}_{\gamma} \sum_{x_i \in R_{jm}} L(y_i, F_{m-1}(x_i) + \gamma) \end{equation}前面已经讨论过,对于损失函数$L(y,F) = |y - F|$,它的最优解是中位值。
故,对于LAD回归,可得:
\begin{equation} \gamma_{jm} = \displaystyle \operatorname{median}_{x_i \in R_{jm}} \{ y_i - F_{m-1}(x_i) \} \end{equation}综上,可得到LAD_TreeBoost算法如下:
对应的sklearn中代码为
305 class LeastAbsoluteError(RegressionLossFunction):
306 """Loss function for least absolute deviation (LAD) regression. """
307 def init_estimator(self):
308 return QuantileEstimator(alpha=0.5)
309
310 def __call__(self, y, pred, sample_weight=None):
311 if sample_weight is None:
312 return np.abs(y - pred.ravel()).mean()
313 else:
314 return (1.0 / sample_weight.sum() *
315 np.sum(sample_weight * np.abs(y - pred.ravel())))
316
317 def negative_gradient(self, y, pred, **kargs):
318 """1.0 if y - pred > 0.0 else -1.0"""
319 pred = pred.ravel()
320 return 2.0 * (y - pred > 0.0) - 1.0
321
322 def _update_terminal_region(self, tree, terminal_regions, leaf, X, y,
323 residual, pred, sample_weight):
324 """LAD updates terminal regions to median estimates. """
325 terminal_region = np.where(terminal_regions == leaf)[0]
326 sample_weight = sample_weight.take(terminal_region, axis=0)
327 diff = y.take(terminal_region, axis=0) - pred.take(terminal_region, axis=0)
328 tree.value[leaf, 0, 0] = _weighted_percentile(diff, sample_weight, percentile=50)
其中320L就是第一步的求导,328L就是第三步将树的叶子改为中位值(percentile=50)。
因为$\tilde{y}_i$只有两种值$\tilde{y} \in \{-1, 1\}$,又叶子值是取中位数,所以这个损失的鲁棒性非常强。但求解中位数不如平均数有快速方法,所以性能会受影响。
LS_Boost运行快,LDA_TreeBoost鲁棒性好,能不能将两者结合起来呢?M-Regression的初衷正是来源于此,它用阈值$\delta$,比如说三倍方差,将$|y-F|$误差分隔成两部份:在阈值内的认定是正常误差,用LS_Boost来约束;在阈值外认定是长尾和坏值,用LDA_TreeBoost来抵抗。通过调整$\delta$值来控制平衡,从而达到各取其长的好处。
这里具体对的损失函数叫Huber,定义如下:
\begin{equation} L(y, F) = \begin{cases} \frac{1}{2} (y - F)^2 \quad & |y - F| \leq \delta \\ \delta \cdot \left (\big |y - F \big | - \frac{\delta}{2} \right ) \quad & |y - F| > \delta \end{cases} \end{equation}对应的具体取解推导要再参阅论文,如果后面有时间再来填坑。
在sklearn中对应的是HuberLossFunction类。
这里用的损失函数叫negative binomial log-likelihood[3],定义为:
\begin{equation} L(y, F) = \log (1 + e^{-2 y F}), \quad y \in \{-1, 1\} \end{equation}其中$F(x) = \frac{1}{2} \log \left [ \frac{\operatorname{Pr}(y = 1 | x)}{\operatorname{Pr}(y = -1 | x)} \right ]$
则第一步残差展开为:
\begin{align} \tilde{y_i} &= - \left [ \frac{\partial L (y_i, F(x_i))}{\partial F(x_i)} \right ]_{F(x) = F_{m-1}(x)} \\ &= - \frac{1}{1 + e^{-2yF}} \cdot 1 \cdot e^{-2yF} \cdot -2y \\ &= 2y \frac{e^{-2yF}}{1 + e^{-2yF}} \\ &= \frac{2y}{1 + e^{2yF}} \\ &= \frac{2 y_i}{1 + e^{2 y_i F_{m-1}(x_i)}} \\ \end{align}同样地,第三步寻优代入损失函数,变为:
\begin{equation} \rho_m = \displaystyle \operatorname{arg \, min}_\rho \sum_{i=1}^{N} \log \left ( 1 + e^{-2 y_i \big ( F_{m-1}(x_i) + \rho h(x_i; \mathbf{a}_m) \big )} \right ) \end{equation}运用前面LAD_Boost的技巧,很容易得到:
\begin{equation} \gamma_{jm} = \displaystyle \operatorname{arg \, min}_\gamma \sum_{x_i \in R_{jm}} \log(1 + e^{-2 y_i ( F_{m-1}(x_i) + \gamma ) }) \end{equation}上式没有封闭形式的解,根据[3],可用一轮最优化中的牛顿迭代法,得到数值解:
\begin{equation} \gamma_{jm} = \displaystyle \sum_{x_i \in R_{jm}} \tilde{y}_i \Big / \sum_{x_i \in R_{jm}} |\tilde{y}_i| (2 - |\tilde{y}_i|) \end{equation}这一步推导,我也还没有细看,以后有精力再细究。
这个算法被称为L2_TreeBoost,总结如下:
$F_0(x) = \frac{1}{2} \log \frac{1 + \bar{y}}{1 - \bar{y}}$
For $m=1$ to $M$ do:
sklearn中代码如下:
106 class LogOddsEstimator(BaseEstimator):
107 """An estimator predicting the log odds ratio."""
108 scale = 1.0
109
110 def fit(self, X, y, sample_weight=None):
111 # pre-cond: pos, neg are encoded as 1, 0
112 if sample_weight is None:
113 pos = np.sum(y)
114 neg = y.shape[0] - pos
115 else:
116 pos = np.sum(sample_weight * y)
117 neg = np.sum(sample_weight * (1 - y))
118
119 if neg == 0 or pos == 0:
120 raise ValueError('y contains non binary labels.')
121 self.prior = self.scale * np.log(pos / neg)
122
123 def predict(self, X):
124 check_is_fitted(self, 'prior')
125
126 y = np.empty((X.shape[0], 1), dtype=np.float64)
127 y.fill(self.prior)
128 return y
注意,这个初始化$F_0$对公式做了点变形。
466 class BinomialDeviance(ClassificationLossFunction):
467 # """Binomial deviance loss function for binary classification.
468 #+-- 10 lines: Binary classification is a special case; here, we only need to-------------------------
478
479 def init_estimator(self):
480 return LogOddsEstimator()
481
482 #+-- 9 lines: def __call__(self, y, pred, sample_weight=None):---------------------------------------
491
492 def negative_gradient(self, y, pred, **kargs):
493 """Compute the residual (= negative gradient). """
494 return y - expit(pred.ravel())
495
496 def _update_terminal_region(self, tree, terminal_regions, leaf, X, y,
497 residual, pred, sample_weight):
498 """Make a single Newton-Raphson step.
499
500 our node estimate is given by:
501
502 sum(w * (y - prob)) / sum(w * prob * (1 - prob))
503
504 we take advantage that: y - prob = residual
505 """
506 terminal_region = np.where(terminal_regions == leaf)[0]
507 residual = residual.take(terminal_region, axis=0)
508 y = y.take(terminal_region, axis=0)
509 sample_weight = sample_weight.take(terminal_region, axis=0)
510
511 numerator = np.sum(sample_weight * residual)
512 denominator = np.sum(sample_weight * (y - residual) * (1 - y + residual))
513
514 if denominator == 0.0:
515 tree.value[leaf, 0, 0] = 0.0
516 else:
517 tree.value[leaf, 0, 0] = numerator / denominator
518
519 def _score_to_proba(self, score):
520 proba = np.ones((score.shape[0], 2), dtype=np.float64)
521 proba[:, 1] = expit(score.ravel())
522 proba[:, 0] -= proba[:, 1]
523 return proba
[3]: Jerome Friedman - Additive logistic regression: a statistical view of boosting
模型$F_M(x)$输出的分值需要转换成各类的概率值,519L的_score_to_proba
处理思路来源如下:
通过联理公式,
\begin{align} & F(x) = \frac{1}{2} \log \left [ \frac{\operatorname{Pr}(y = 1 | x)}{\operatorname{Pr}(y = -1 | x)} \right ] \\ & \operatorname{Pr}(y = 1 | x) + \operatorname{Pr}(y = -1 | x) = 1 \end{align}可以容易解得:
\begin{align} & \operatorname{Pr}(y = 1 | x) = \frac{1}{1 + e^{-2 F(x)}} \\ & \operatorname{Pr}(y = -1 | x) = \frac{1}{1 + e^{2 F(x)}} \\ \end{align}因为$F_M(x)$和$F(x)$有关联,我们可以借用上式将分值近似转换成概率值:
\begin{align} p_{+}(x) &= \operatorname{\widehat{Pr}}(y = 1 | x) = \frac{1}{1 + e^{-2 F_M(x)}} \\ p_{-}(x) &= \operatorname{\widehat{Pr}}(y = -1 | x) = \frac{1}{1 + e^{2 F_M(x)}} \\ \end{align}用于二分类时,应用如下公式:
\begin{equation} \hat{y}(x) = 2 \cdot \mathbf{1}[c(-1,1) p_{+}(x) > c(1,-1) p_{-}(x)] - 1 \end{equation}其中,$c(\hat{y}, y)$是将$y$误认为$\hat{y}$的代价。
回过头来看score_to_proba
函数,它其实是上式的简化版,并没有$c(\cdot)$,同时虽然$p_{+}(x)$和$p_{-}(x)$并非真实概率,但相加等于1,所以只算$p_{+}(x)$。
在寻优$\gamma_{jm}$时,还有优化的空间:
\begin{align} \gamma_{jm} &= \displaystyle \operatorname{arg \, min}_\gamma \sum_{x_i \in R_{jm}} \log(1 + e^{-2 y_i ( F_{m-1}(x_i) + \gamma ) }) \\ &= \displaystyle \operatorname{arg \, min}_\gamma \sum_{x_i \in R_{jm}} \log(1 + e^{-2 \color{blue}{y_i F_{m-1}(x_i)}} \cdot e^{-2 y_i \gamma} ) \\ &= \displaystyle \operatorname{arg \, min}_\gamma \sum_{x_i \in R_{jm}} \phi(x_i) \end{align}注意,上式中,若$y_i F_{m-1}(x_i)$非常大,则对应的$\phi(x_i) \to 0$。也就是说,这些样本对寻优不再有贡献。于是,我们可以定义$w_i = e^{-2 y_i F_{m-1}(x_i)}$作为测量函数,如果它的值小于一定阈值,这个样本就不再参与计算。
另外,对于一个$x_i$,其损失函数$L(y, F(x_i))$如果到达极值点,就相当于是常数,对于整体寻优$\operatorname{arg \, min} L(y, F(x))$不再有贡献。而判断极值点可以用二阶导数来度量,所以二阶导也能作为一种测量函数。具体到现在的逻辑回归树,就可以定义为:
\begin{equation} w_i = \frac{\partial^2 L}{\partial F^2} = |\tilde{y}_i| (2 - |\tilde{y}_i|) \end{equation}至于这个移除的阈值,可以用比例来约束。即我们定阈值为$w_{l(\alpha)}$,而$l(\alpha)$满足:
\begin{equation} \displaystyle \sum_{i=1}^{l(\alpha)} w_{(i)} = \alpha \sum_{i=1}^{N} w_i \end{equation}其中,$w_{(i)}$是按升序排列的权值。典型值$\alpha \in [0.05, 0.2]$,也就是说,可以减少10到20倍的计算量。
lack of fit(LOF):
两种shrinkage strategy:
The relative influences $I_j$, of the individual inputs $x_j$, on the variation of $\hat{F}(x)$ over the joint input variable distributation:
\begin{equation} I_j = \left ( E_x \left [ \frac{\partial \hat{F}(x)}{\partial x_j} \right ]^2 \cdot \operatorname{var}_x[x_j] \right )^{1/2} \end{equation}据此,Breiman提出了适合于决策树的公式:
\begin{equation} \hat{I}_j^2(T) = \displaystyle \sum_{t=1}^{J-1} \hat{i}_t^2 \mathbf{1}(v_t = j) \end{equation}其中,$v_t$是以$j$作分割特征的中间节点,$\hat{i}_t^2$是对应节点的Friedman MSE评价$i^2(R_l, R_r) = \frac{w_l w_r}{w_l + w_r}(\bar{y}_l - \bar{y}_r)^2$。
对于多颗树,就加和取平均:
\begin{equation} \hat{I}^2_j = \frac{1}{M} \displaystyle \sum_{m=1}^M \hat{I}^2_j (T_m) \end{equation}这个思路非常朴素,就是计算各特征对决策树评价函数提升的贡献度。在sklearn中实现如下:
1201 def feature_importances_(self):
1202 """Return the feature importances (the higher, the more important the
1203 feature).
1204
1205 Returns
1206 -------
1207 feature_importances_ : array, shape = [n_features]
1208 """
1209 self._check_initialized()
1210
1211 total_sum = np.zeros((self.n_features, ), dtype=np.float64)
1212 for stage in self.estimators_:
1213 stage_sum = sum(tree.feature_importances_
1214 for tree in stage) / len(stage)
1215 total_sum += stage_sum
1216
1217 importances = total_sum / len(self.estimators_)
1218 return importances
此时,我们也就明白了为什么765L决策树使用的friedman_mse损失函数了。
Partial dependence主要是用于可视化一维或二维变量对于整体总结果的影响。它的计算思路非常简单,令有特征集合$X_s \cup X_c = X$,则$f(X_s = x_s) = \operatorname{Avg}(f(X_s = x_s, X_{ci}))$。也就是对于特定的$x_s$值,算出它的响应均值。
下图是一个示例sklearn: Partial Dependence Plots。
对于一维变量,可以看出,房价与Medlin(平均收入)成正比,与AveOccup(每户人均数)成反比,而与HouseAge(房龄)、AveRoom(房间均数)没有明显关系。再看二维变量(HouseAge - AveOccup),对于每户人均数大于2的情况,房龄与房价关系不大;但当人均数小于2时,则出现了相关性。
In [ ]: