In [1]:

    
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
sns.set(font='SimHei')
#plt.rcParams['axes.grid'] = False

import numpy as np

import pandas as pd
pd.options.display.max_rows = 10

from IPython.display import Image

决策树原理与实现简介

前言

为什么讲决策树?

原理简单，直白易懂。
可解释性好。
变种在工业上应用多：随机森林、GBDT。

深化拓展

理论，考古：ID3, C4.5, CART
工程，实现细节：
- demo
- scikit-learn
- spark
- xgboost
应用，调参分析
演示

理论

算法：

ID3
C4.5
C5.0
CART
CHAID
MARS

行业黑话

分类问题 vs 回归问题
样本 = （特征$x$，真实值$y$）
目的：找到模型$h(\cdot)$，使得预测值$\hat{y} = h(x)$ $\to$ 真实值$y$



In [2]:

    
from sklearn.datasets import load_iris
data = load_iris()

# 准备特征数据
X = pd.DataFrame(data.data, 
                 columns=["sepal_length", "sepal_width", "petal_length", "petal_width"])

# 准备标签数据
y = pd.DataFrame(data.target, columns=['target'])
y.replace(to_replace=range(3), value=data.target_names, inplace=True)

# 组建样本 [特征，标签]
samples = pd.concat([X, y], axis=1, keys=["x", "y"])
samples.head(5)









    Out[2]:






  
    
      
      x
      y
    
    
      
      sepal_length
      sepal_width
      petal_length
      petal_width
      target
    
  
  
    
      0
      5.1
      3.5
      1.4
      0.2
      setosa
    
    
      1
      4.9
      3.0
      1.4
      0.2
      setosa
    
    
      2
      4.7
      3.2
      1.3
      0.2
      setosa
    
    
      3
      4.6
      3.1
      1.5
      0.2
      setosa
    
    
      4
      5.0
      3.6
      1.4
      0.2
      setosa



In [3]:

    
samples["y", "target"].value_counts()









    Out[3]:





versicolor    50
virginica     50
setosa        50
Name: (y, target), dtype: int64



In [4]:

    
samples["x"].describe()









    Out[4]:






  
    
      
      sepal_length
      sepal_width
      petal_length
      petal_width
    
  
  
    
      count
      150.000000
      150.000000
      150.000000
      150.000000
    
    
      mean
      5.843333
      3.054000
      3.758667
      1.198667
    
    
      std
      0.828066
      0.433594
      1.764420
      0.763161
    
    
      min
      4.300000
      2.000000
      1.000000
      0.100000
    
    
      25%
      5.100000
      2.800000
      1.600000
      0.300000
    
    
      50%
      5.800000
      3.000000
      4.350000
      1.300000
    
    
      75%
      6.400000
      3.300000
      5.100000
      1.800000
    
    
      max
      7.900000
      4.400000
      6.900000
      2.500000

三分钟明白决策树



In [5]:

    
Image(url="https://upload.wikimedia.org/wikipedia/commons/f/f3/CART_tree_titanic_survivors.png")









    Out[5]:



In [6]:

    
Image(url="http://scikit-learn.org/stable/_images/iris.svg")









    Out[6]:



In [7]:

    
Image(url="http://scikit-learn.org/stable/_images/sphx_glr_plot_iris_0011.png")









    Out[7]:



In [8]:

    
samples = pd.concat([X, y], axis=1)
samples.head(3)









    Out[8]:






  
    
      
      sepal_length
      sepal_width
      petal_length
      petal_width
      target
    
  
  
    
      0
      5.1
      3.5
      1.4
      0.2
      setosa
    
    
      1
      4.9
      3.0
      1.4
      0.2
      setosa
    
    
      2
      4.7
      3.2
      1.3
      0.2
      setosa

工程

Demo实现

其主要问题是在每次决策时找到一个分割点，让生成的子集尽可能地纯净。这里涉及到四个问题:

如何分割样本？
如何评价子集的纯净度？
如何找到单个最佳的分割点，其子集最为纯净？
如何找到最佳的分割点序列，其最终分割子集总体最为纯净？



In [9]:

    
Image(url="https://upload.wikimedia.org/wikipedia/commons/f/f3/CART_tree_titanic_survivors.png")









    Out[9]:

1.0 如何分割样本

决策树的分割方法是取一个特征 $f$ 和阈值 $t$，以此为界将样本 $X$ 拆分为两个子集 $X_l, X_r$。其数学表达形同：

\begin{align} X = \begin{cases} X_l, \ \text{if } X[f] < t \\ X_r, \ \text{if } X[f] \geq t \end{cases} \end{align}



In [10]:

    
def splitter(samples, feature, threshold):
    # 按特征 f 和阈值 t 分割样本
    
    left_nodes = samples.query("{f} < {t}".format(f=feature, t=threshold))
    right_nodes = samples.query("{f} >= {t}".format(f=feature, t=threshold))
    
    return {"left_nodes": left_nodes, "right_nodes": right_nodes}



In [11]:

    
split = splitter(samples, "sepal_length", 5)

# 左子集
x_l = split["left_nodes"].loc[:, "target"].value_counts()
x_l









    Out[11]:





setosa        20
versicolor     1
virginica      1
Name: target, dtype: int64



In [12]:

    
# 右子集
x_r = split["right_nodes"].loc[:, "target"].value_counts()
x_r









    Out[12]:





versicolor    49
virginica     49
setosa        30
Name: target, dtype: int64

2. 如何评价子集的纯净度？

常用的评价函数正是计算各标签 $c_k$ 在子集中的占比 $p_k = c_k / \sum (c_k)$，并通过组合 $p_k$ 来描述占比集中或分散。



In [13]:

    
def calc_class_proportion(node):
    # 计算各标签在集合中的占比
    
    y = node["target"]
    return y.value_counts() / y.count()



In [14]:

    
calc_class_proportion(split["left_nodes"])









    Out[14]:





setosa        0.909091
versicolor    0.045455
virginica     0.045455
Name: target, dtype: float64



In [15]:

    
calc_class_proportion(split["right_nodes"])









    Out[15]:





versicolor    0.382812
virginica     0.382812
setosa        0.234375
Name: target, dtype: float64

主要的评价函数有三种，它们评价的是集合的不纯度（值越大，集合越混杂）。

先做些数学定义以便于描述：
假设对于集合 $m$ 有 $N_m$ 个样本，可分割成 $R_m$ 子集。
若总的标签类别有 $K$ 种，则标签 $k$ 在此集合中的占比为：

\begin{equation} \hat{p}_{m k} = \frac{1}{N_m} \displaystyle \sum_{x_i \in R_m} I(y_i = k) \end{equation}

且令标签 $k$ 是占比最大的标签，即 $k(m) = \operatorname{arg max}_k \hat{p}_{m k}$.

1. Misclassification error

我们一般把集合的分类结果定义为占比最大的标签，那么落在此集合中的其它标签就是误分类。其比率是 $1 - \hat{p}_{m k}(m)$.

scikit-learn实现简介

应用

评价：过拟合

如何模型解释

参数的含义

演示

使用sklearn的决策树，进行一个小样本的分析



In [ ]:

	x				y
	sepal_length	sepal_width	petal_length	petal_width	target
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

	sepal_length	sepal_width	petal_length	petal_width
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.054000	3.758667	1.198667
std	0.828066	0.433594	1.764420	0.763161
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000