In [ ]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

梯度提升树(Gradient Boosted Trees):模型理解

Note: 我们的 TensorFlow 社区翻译了这些文档。因为社区翻译是尽力而为, 所以无法保证它们是最准确的,并且反映了最新的 官方英文文档。如果您有改进此翻译的建议, 请提交 pull request 到 tensorflow/docs GitHub 仓库。要志愿地撰写或者审核译文,请加入 docs-zh-cn@tensorflow.org Google Group

对于梯度提升模型(Gradient Boosting model)的端到端演示(end-to-end walkthrough),请查阅在 Tensorflow 中训练提升树(Boosted Trees)模型。在本教程中,您将:

  • 学习到如何对提升树模型(Boosted Trees model)进行局部解释全局解释
  • 了解到提升树模型在数据集上的表现。

如何对提升树模型(Boosted Trees model)进行局部解释和全局解释

局部可解释性指模型的预测在单一样例层面上的理解程度,而全局可解释性指模型作为一个整体的理解能力。这种技术可以帮助使用机器学习的人在模型开发阶段检测偏差(bias)和bug。

对于局部可解释性,您将了解到如何创造并可视化每个实例(per-instance)的贡献度。区别于特征重要性,这种贡献被称为 DFCs(定向特征贡献,directional feature contributions)。

对于全局可解释性,您将学习并可视化基于增益的特征重要性(gain-based feature importances),排列特征重要性(permutation feature importances)和总DFCs。

加载泰坦尼克数据集(titanic)

本教程使用泰坦尼克数据集,旨在已知乘客的性别,年龄和客舱等级等特征的情况下预测的存活率。


In [ ]:
import numpy as np
import pandas as pd
from IPython.display import clear_output

# 加载数据集。
dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')
dfeval = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/eval.csv')
y_train = dftrain.pop('survived')
y_eval = dfeval.pop('survived')

In [3]:
import tensorflow as tf
tf.random.set_seed(123)


TensorFlow 2.x selected.

有关特征的描述,请参阅之前的教程。

创建特征列, 输入函数并训练 estimator

数据预处理

特征处理,使用原始的数值特征和独热编码(one-hot-encoding)处理过的非数值特征(如性别,舱位)别建立数据集。


In [ ]:
fc = tf.feature_column
CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck',
                       'embark_town', 'alone']
NUMERIC_COLUMNS = ['age', 'fare']

def one_hot_cat_column(feature_name, vocab):
  return fc.indicator_column(
      fc.categorical_column_with_vocabulary_list(feature_name,
                                                 vocab))
feature_columns = []
for feature_name in CATEGORICAL_COLUMNS:
  # 需要使用独热编码(one-hot-encoding)处理非数值特征。
  vocabulary = dftrain[feature_name].unique()
  feature_columns.append(one_hot_cat_column(feature_name, vocabulary))

for feature_name in NUMERIC_COLUMNS:
  feature_columns.append(fc.numeric_column(feature_name,
                                           dtype=tf.float32))

构建输入 pipeline

使用 API tf.data 中的 from_tensor_slices 方法建立输入方程来从 Pandas 中直接读取数据。


In [ ]:
# 当数据集小的时候,将整个数据集作为一个 batch。
NUM_EXAMPLES = len(y_train)

def make_input_fn(X, y, n_epochs=None, shuffle=True):
  def input_fn():
    dataset = tf.data.Dataset.from_tensor_slices((X.to_dict(orient='list'), y))
    if shuffle:
      dataset = dataset.shuffle(NUM_EXAMPLES)
    # 训练时让数据迭代尽可能多次 (n_epochs=None)。
    dataset = (dataset
      .repeat(n_epochs)
      .batch(NUM_EXAMPLES))
    return dataset
  return input_fn

# 训练并评估输入函数。
train_input_fn = make_input_fn(dftrain, y_train)
eval_input_fn = make_input_fn(dfeval, y_eval, shuffle=False, n_epochs=1)

训练模型


In [6]:
params = {
  'n_trees': 50,
  'max_depth': 3,
  'n_batches_per_layer': 1,
  # 为了得到 DFCs,请设置 center_bias = True。这将强制
  # 模型在使用特征(例如:回归中训练集标签的均值,分类中使
  # 用交叉熵损失函数时的对数几率)前做一个初始预测。
  'center_bias': True
}

est = tf.estimator.BoostedTreesClassifier(feature_columns, **params)
# 训练模型。
est.train(train_input_fn, max_steps=100)

# 评估。
results = est.evaluate(eval_input_fn)
clear_output()
pd.Series(results).to_frame()


Out[6]:
0
accuracy 0.806818
accuracy_baseline 0.625000
auc 0.866606
auc_precision_recall 0.849128
average_loss 0.421549
label/mean 0.375000
loss 0.421549
precision 0.755319
prediction/mean 0.384944
recall 0.717172
global_step 100.000000

出于性能方面的原因,当您的数据是内存数据集时,我们推荐您使用 boosted_trees_classifier_train_in_memory 函数。此外,如果您对训练时间没有要求抑或是您的数据集很大且不愿做分布式训练,请使用上面显示的 tf.estimator.BoostedTrees API。

当您使用此方法时,请不要对数据分批(batch),而是对整个数据集进行操作。


In [7]:
in_memory_params = dict(params)
in_memory_params['n_batches_per_layer'] = 1
# 在内存中的输入方程请不要对数据分批。
def make_inmemory_train_input_fn(X, y):
  y = np.expand_dims(y, axis=1)
  def input_fn():
    return dict(X), y
  return input_fn
train_input_fn = make_inmemory_train_input_fn(dftrain, y_train)

# 训练模型。
est = tf.estimator.BoostedTreesClassifier(
    feature_columns, 
    train_in_memory=True, 
    **in_memory_params)

est.train(train_input_fn)
print(est.evaluate(eval_input_fn))


INFO:tensorflow:Using default config.
WARNING:tensorflow:Using temporary folder as model directory: /tmp/tmpec8e696f
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpec8e696f', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
WARNING:tensorflow:Issue encountered when serializing resources.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
WARNING:tensorflow:Issue encountered when serializing resources.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmpec8e696f/model.ckpt.
WARNING:tensorflow:Issue encountered when serializing resources.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'_Resource' object has no attribute 'name'
INFO:tensorflow:loss = 0.6931472, step = 0
WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
INFO:tensorflow:global_step/sec: 80.2732
INFO:tensorflow:loss = 0.34654337, step = 99 (1.249 sec)
INFO:tensorflow:Saving checkpoints for 153 into /tmp/tmpec8e696f/model.ckpt.
WARNING:tensorflow:Issue encountered when serializing resources.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Loss for final step: 0.31796658.
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-03-09T21:21:14Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpec8e696f/model.ckpt-153
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 0.55945s
INFO:tensorflow:Finished evaluation at 2020-03-09-21:21:15
INFO:tensorflow:Saving dict for global step 153: accuracy = 0.8030303, accuracy_baseline = 0.625, auc = 0.8679216, auc_precision_recall = 0.8527449, average_loss = 0.4203342, global_step = 153, label/mean = 0.375, loss = 0.4203342, precision = 0.7473684, prediction/mean = 0.38673538, recall = 0.7171717
WARNING:tensorflow:Issue encountered when serializing resources.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 153: /tmp/tmpec8e696f/model.ckpt-153
{'accuracy': 0.8030303, 'accuracy_baseline': 0.625, 'auc': 0.8679216, 'auc_precision_recall': 0.8527449, 'average_loss': 0.4203342, 'label/mean': 0.375, 'loss': 0.4203342, 'precision': 0.7473684, 'prediction/mean': 0.38673538, 'recall': 0.7171717, 'global_step': 153}

模型说明与绘制


In [ ]:
import matplotlib.pyplot as plt
import seaborn as sns
sns_colors = sns.color_palette('colorblind')

局部可解释性(Local interpretability)

接下来,您将输出定向特征贡献(DFCs)来解释单个预测。输出依据 Palczewska et al 和 Saabas 在 解释随机森林(Interpreting Random Forests) 中提出的方法产生(scikit-learn 中随机森林相关的包 treeinterpreter 使用原理相同的远离). 使用以下语句输出 DFCs:

pred_dicts = list(est.experimental_predict_with_explanations(pred_input_fn))

(注意:带 “experimental” 前缀为实验版本(开发中),在正式版发布前可能对其修改。)


In [9]:
pred_dicts = list(est.experimental_predict_with_explanations(eval_input_fn))


INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpec8e696f', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpec8e696f/model.ckpt-153
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.

In [10]:
# 创建 DFCs 的 DataFrame。
labels = y_eval.values
probs = pd.Series([pred['probabilities'][1] for pred in pred_dicts])
df_dfc = pd.DataFrame([pred['dfc'] for pred in pred_dicts])
df_dfc.describe().T


Out[10]:
count mean std min 25% 50% 75% max
age 264.0 -0.023851 0.095522 -0.154286 -0.076392 -0.057281 0.009408 0.541436
sex 264.0 0.007242 0.110388 -0.123509 -0.074560 -0.073093 0.139799 0.182828
class 264.0 0.016631 0.093480 -0.065638 -0.046510 -0.045238 0.038365 0.236259
deck 264.0 -0.016809 0.031755 -0.086020 -0.041980 -0.029283 0.003319 0.187737
embark_town 264.0 -0.006789 0.028007 -0.053620 -0.014970 -0.013391 -0.003095 0.070267
fare 264.0 0.020249 0.088010 -0.308154 -0.028543 -0.008702 0.057093 0.228394
n_siblings_spouses 264.0 0.002551 0.028782 -0.177096 0.002736 0.003991 0.006956 0.135561
parch 264.0 -0.000049 0.009230 -0.075935 0.000389 0.000483 0.000871 0.050035
alone 264.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

DFCs 有个不错的特性:贡献和 + 偏差(bias) = 给出样例的预测值。


In [ ]:
# DFCs的和 + 偏差(bias) == 可能性
bias = pred_dicts[0]['bias']
dfc_prob = df_dfc.sum(axis=1) + bias
np.testing.assert_almost_equal(dfc_prob.values,
                               probs.values)

为单个乘客绘制 DFCs,绘图时按贡献的方向性对其进行涂色并添加特征的值。


In [ ]:
# 绘制模版 :)
def _get_color(value):
    """正的 DFCs 标为绿色,负的为红色。"""
    green, red = sns.color_palette()[2:4]
    if value >= 0: return green
    return red

def _add_feature_values(feature_values, ax):
    """在图的左侧显示特征的值"""
    x_coord = ax.get_xlim()[0]
    OFFSET = 0.15
    for y_coord, (feat_name, feat_val) in enumerate(feature_values.items()):
        t = plt.text(x_coord, y_coord - OFFSET, '{}'.format(feat_val), size=12)
        t.set_bbox(dict(facecolor='white', alpha=0.5))
    from matplotlib.font_manager import FontProperties
    font = FontProperties()
    font.set_weight('bold')
    t = plt.text(x_coord, y_coord + 1 - OFFSET, 'feature\nvalue',
    fontproperties=font, size=12)

def plot_example(example):
  TOP_N = 8 # 显示前8个特征。
  sorted_ix = example.abs().sort_values()[-TOP_N:].index  # 按值排序。
  example = example[sorted_ix]
  colors = example.map(_get_color).tolist()
  ax = example.to_frame().plot(kind='barh',
                          color=[colors],
                          legend=None,
                          alpha=0.75,
                          figsize=(10,6))
  ax.grid(False, axis='y')
  ax.set_yticklabels(ax.get_yticklabels(), size=14)

  # 添加特征的值。
  _add_feature_values(dfeval.iloc[ID][sorted_ix], ax)
  return ax

In [13]:
# 绘制结果。
ID = 182
example = df_dfc.iloc[ID]  # 从评估集中选择第 i 个样例。
TOP_N = 8  # 显示前8个特征。
sorted_ix = example.abs().sort_values()[-TOP_N:].index
ax = plot_example(example)
ax.set_title('Feature contributions for example {}\n pred: {:1.2f}; label: {}'.format(ID, probs[ID], labels[ID]))
ax.set_xlabel('Contribution to predicted probability', size=14)
plt.show()


更大的贡献值意味着对模型的预测有更大的影响。负的贡献表示此样例该特征的值减小了减小了模型的预测,正贡献值表示增加了模型的预测。

您也可以使用小提琴图(violin plot)来绘制该样例的 DFCs 并与整体分布比较。


In [ ]:
# 绘制代码模版。
def dist_violin_plot(df_dfc, ID):
  # 初始化画布。
  fig, ax = plt.subplots(1, 1, figsize=(10, 6))

  # 创建样例 DataFrame。
  TOP_N = 8  # 显示前8个特征。
  example = df_dfc.iloc[ID]
  ix = example.abs().sort_values()[-TOP_N:].index
  example = example[ix]
  example_df = example.to_frame(name='dfc')

  # 添加整个分布的贡献。
  parts=ax.violinplot([df_dfc[w] for w in ix],
                 vert=False,
                 showextrema=False,
                 widths=0.7,
                 positions=np.arange(len(ix)))
  face_color = sns_colors[0]
  alpha = 0.15
  for pc in parts['bodies']:
      pc.set_facecolor(face_color)
      pc.set_alpha(alpha)

  # 添加特征的值。
  _add_feature_values(dfeval.iloc[ID][sorted_ix], ax)

  # 添加局部贡献。
  ax.scatter(example,
              np.arange(example.shape[0]),
              color=sns.color_palette()[2],
              s=100,
              marker="s",
              label='contributions for example')

  # 图例。
  # 生成小提琴图的详细图例。
  ax.plot([0,0], [1,1], label='eval set contributions\ndistributions',
          color=face_color, alpha=alpha, linewidth=10)
  legend = ax.legend(loc='lower right', shadow=True, fontsize='x-large',
                     frameon=True)
  legend.get_frame().set_facecolor('white')

  # 调整格式。
  ax.set_yticks(np.arange(example.shape[0]))
  ax.set_yticklabels(example.index)
  ax.grid(False, axis='y')
  ax.set_xlabel('Contribution to predicted probability', size=14)

绘制此样例。


In [15]:
dist_violin_plot(df_dfc, ID)
plt.title('Feature contributions for example {}\n pred: {:1.2f}; label: {}'.format(ID, probs[ID], labels[ID]))
plt.show()


最后,第三方的工具,如:LIMEshap 也可以帮助理解模型的各个预测。

全局特征重要性(Global feature importances)

此外,您或许想了解模型这个整体而不是单个预测。接下来,您将计算并使用:

  • 通过 est.experimental_feature_importances 得到基于增益的特征重要性(Gain-based feature importances)
  • 排列特征重要性(Permutation feature importances)
  • 使用 est.experimental_predict_with_explanations 得到总 DFCs。

基于增益的特征重要性在分离特定特征时测量损失的变化。而排列特征重要性是在评估集上通过每次打乱一个特征后观察模型性能的变化计算而出。

一般来说,排列特征重要性要优于基于增益的特征重要性,尽管这两种方法在潜在预测变量的测量范围或类别数量不确定时和特征相关联时不可信(来源)。 对不同种类特征重要性的更透彻概括和更翔实讨论请参考 这篇文章

基于增益的特征重要性(Gain-based feature importances)

TensorFlow 的提升树估算器(estimator)内置了函数 est.experimental_feature_importances 用于计算基于增益的特征重要性。


In [16]:
importances = est.experimental_feature_importances(normalize=True)
df_imp = pd.Series(importances)

# 可视化重要性。
N = 8
ax = (df_imp.iloc[0:N][::-1]
    .plot(kind='barh',
          color=sns_colors[0],
          title='Gain feature importances',
          figsize=(10, 6)))
ax.grid(False, axis='y')


平均绝对 DFCs

您还可以得到绝对DFCs的平均值来从全局的角度分析影响。


In [17]:
# 绘图。
dfc_mean = df_dfc.abs().mean()
N = 8
sorted_ix = dfc_mean.abs().sort_values()[-N:].index  # 求平均并按绝对值排序。
ax = dfc_mean[sorted_ix].plot(kind='barh',
                       color=sns_colors[1],
                       title='Mean |directional feature contributions|',
                       figsize=(10, 6))
ax.grid(False, axis='y')


您可以看到 DFCs 如何随特征的值变化而变化。


In [18]:
FEATURE = 'fare'
feature = pd.Series(df_dfc[FEATURE].values, index=dfeval[FEATURE].values).sort_index()
ax = sns.regplot(feature.index.values, feature.values, lowess=True)
ax.set_ylabel('contribution')
ax.set_xlabel(FEATURE)
ax.set_xlim(0, 100)
plt.show()


排列特征重要性(Permutation feature importances)


In [19]:
def permutation_importances(est, X_eval, y_eval, metric, features):
    """
    分别对每列,打散列中的值并观察其对评估集的影响。
    
    在训练过程中,有一种类似的方法,请参阅文章(来源:http://explained.ai/rf-importance/index.html)
    中有关 “Drop-column importance” 的部分。
    """
    baseline = metric(est, X_eval, y_eval)
    imp = []
    for col in features:
        save = X_eval[col].copy()
        X_eval[col] = np.random.permutation(X_eval[col])
        m = metric(est, X_eval, y_eval)
        X_eval[col] = save
        imp.append(baseline - m)
    return np.array(imp)

def accuracy_metric(est, X, y):
    """TensorFlow 估算器精度"""
    eval_input_fn = make_input_fn(X,
                                  y=y,
                                  shuffle=False,
                                  n_epochs=1)
    return est.evaluate(input_fn=eval_input_fn)['accuracy']
features = CATEGORICAL_COLUMNS + NUMERIC_COLUMNS
importances = permutation_importances(est, dfeval, y_eval, accuracy_metric,
                                      features)
df_imp = pd.Series(importances, index=features)

sorted_ix = df_imp.abs().sort_values().index
ax = df_imp[sorted_ix][-5:].plot(kind='barh', color=sns_colors[2], figsize=(10, 6))
ax.grid(False, axis='y')
ax.set_title('Permutation feature importance')
plt.show()


INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-03-09T21:21:18Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpec8e696f/model.ckpt-153
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 0.56113s
INFO:tensorflow:Finished evaluation at 2020-03-09-21:21:18
INFO:tensorflow:Saving dict for global step 153: accuracy = 0.8030303, accuracy_baseline = 0.625, auc = 0.8679216, auc_precision_recall = 0.8527449, average_loss = 0.4203342, global_step = 153, label/mean = 0.375, loss = 0.4203342, precision = 0.7473684, prediction/mean = 0.38673538, recall = 0.7171717
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 153: /tmp/tmpec8e696f/model.ckpt-153
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-03-09T21:21:19Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpec8e696f/model.ckpt-153
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 0.57949s
INFO:tensorflow:Finished evaluation at 2020-03-09-21:21:19
INFO:tensorflow:Saving dict for global step 153: accuracy = 0.6060606, accuracy_baseline = 0.625, auc = 0.64355683, auc_precision_recall = 0.5400543, average_loss = 0.74337494, global_step = 153, label/mean = 0.375, loss = 0.74337494, precision = 0.47524753, prediction/mean = 0.39103043, recall = 0.4848485
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 153: /tmp/tmpec8e696f/model.ckpt-153
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-03-09T21:21:20Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpec8e696f/model.ckpt-153
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 0.58528s
INFO:tensorflow:Finished evaluation at 2020-03-09-21:21:21
INFO:tensorflow:Saving dict for global step 153: accuracy = 0.7916667, accuracy_baseline = 0.625, auc = 0.8624732, auc_precision_recall = 0.8392693, average_loss = 0.43363357, global_step = 153, label/mean = 0.375, loss = 0.43363357, precision = 0.7244898, prediction/mean = 0.38975066, recall = 0.7171717
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 153: /tmp/tmpec8e696f/model.ckpt-153
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-03-09T21:21:21Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpec8e696f/model.ckpt-153
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 0.55600s
INFO:tensorflow:Finished evaluation at 2020-03-09-21:21:22
INFO:tensorflow:Saving dict for global step 153: accuracy = 0.8068182, accuracy_baseline = 0.625, auc = 0.8674931, auc_precision_recall = 0.85280114, average_loss = 0.4206087, global_step = 153, label/mean = 0.375, loss = 0.4206087, precision = 0.75, prediction/mean = 0.38792592, recall = 0.72727275
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 153: /tmp/tmpec8e696f/model.ckpt-153
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-03-09T21:21:22Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpec8e696f/model.ckpt-153
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 0.54454s
INFO:tensorflow:Finished evaluation at 2020-03-09-21:21:23
INFO:tensorflow:Saving dict for global step 153: accuracy = 0.72727275, accuracy_baseline = 0.625, auc = 0.76737064, auc_precision_recall = 0.62659556, average_loss = 0.6019534, global_step = 153, label/mean = 0.375, loss = 0.6019534, precision = 0.6626506, prediction/mean = 0.3688063, recall = 0.5555556
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 153: /tmp/tmpec8e696f/model.ckpt-153
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-03-09T21:21:24Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpec8e696f/model.ckpt-153
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 0.53149s
INFO:tensorflow:Finished evaluation at 2020-03-09-21:21:24
INFO:tensorflow:Saving dict for global step 153: accuracy = 0.7878788, accuracy_baseline = 0.625, auc = 0.8389348, auc_precision_recall = 0.8278463, average_loss = 0.45054114, global_step = 153, label/mean = 0.375, loss = 0.45054114, precision = 0.7263158, prediction/mean = 0.3912348, recall = 0.6969697
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 153: /tmp/tmpec8e696f/model.ckpt-153
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-03-09T21:21:25Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpec8e696f/model.ckpt-153
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 0.54399s
INFO:tensorflow:Finished evaluation at 2020-03-09-21:21:25
INFO:tensorflow:Saving dict for global step 153: accuracy = 0.8030303, accuracy_baseline = 0.625, auc = 0.862565, auc_precision_recall = 0.84412414, average_loss = 0.42553493, global_step = 153, label/mean = 0.375, loss = 0.42553493, precision = 0.75268817, prediction/mean = 0.37500647, recall = 0.7070707
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 153: /tmp/tmpec8e696f/model.ckpt-153
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-03-09T21:21:26Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpec8e696f/model.ckpt-153
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 0.56776s
INFO:tensorflow:Finished evaluation at 2020-03-09-21:21:26
INFO:tensorflow:Saving dict for global step 153: accuracy = 0.8030303, accuracy_baseline = 0.625, auc = 0.8679216, auc_precision_recall = 0.8527449, average_loss = 0.4203342, global_step = 153, label/mean = 0.375, loss = 0.4203342, precision = 0.7473684, prediction/mean = 0.38673538, recall = 0.7171717
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 153: /tmp/tmpec8e696f/model.ckpt-153
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-03-09T21:21:27Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpec8e696f/model.ckpt-153
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 0.56329s
INFO:tensorflow:Finished evaluation at 2020-03-09-21:21:28
INFO:tensorflow:Saving dict for global step 153: accuracy = 0.79924244, accuracy_baseline = 0.625, auc = 0.8132232, auc_precision_recall = 0.7860318, average_loss = 0.4787808, global_step = 153, label/mean = 0.375, loss = 0.4787808, precision = 0.7613636, prediction/mean = 0.37704408, recall = 0.67676765
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 153: /tmp/tmpec8e696f/model.ckpt-153
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-03-09T21:21:28Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpec8e696f/model.ckpt-153
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 0.60489s
INFO:tensorflow:Finished evaluation at 2020-03-09-21:21:29
INFO:tensorflow:Saving dict for global step 153: accuracy = 0.8030303, accuracy_baseline = 0.625, auc = 0.8360882, auc_precision_recall = 0.7940172, average_loss = 0.45960733, global_step = 153, label/mean = 0.375, loss = 0.45960733, precision = 0.7473684, prediction/mean = 0.38010252, recall = 0.7171717
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 153: /tmp/tmpec8e696f/model.ckpt-153

可视化模型拟合过程

首先,使用以下公式构建训练数据:

$$z=x* e^{-x^2 - y^2}$$

其中, (z) 是您要试着预测的值(因变量),(x) 和 (y) 是特征。


In [ ]:
from numpy.random import uniform, seed
from scipy.interpolate import griddata

# 生成数据。
seed(0)
npts = 5000
x = uniform(-2, 2, npts)
y = uniform(-2, 2, npts)
z = x*np.exp(-x**2 - y**2)
xy = np.zeros((2,np.size(x)))
xy[0] = x
xy[1] = y
xy = xy.T

In [ ]:
# 准备用于训练的数据。
df = pd.DataFrame({'x': x, 'y': y, 'z': z})

xi = np.linspace(-2.0, 2.0, 200),
yi = np.linspace(-2.1, 2.1, 210),
xi,yi = np.meshgrid(xi, yi)

df_predict = pd.DataFrame({
    'x' : xi.flatten(),
    'y' : yi.flatten(),
})
predict_shape = xi.shape

In [ ]:
def plot_contour(x, y, z, **kwargs):
  # 准备用于训练的数据。
  plt.figure(figsize=(10, 8))
  # 绘制等值线图,标出非均匀数据点。
  CS = plt.contour(x, y, z, 15, linewidths=0.5, colors='k')
  CS = plt.contourf(x, y, z, 15,
                    vmax=abs(zi).max(), vmin=-abs(zi).max(), cmap='RdBu_r')
  plt.colorbar()  # 绘制颜色图例。
  # 绘制数据点。
  plt.xlim(-2, 2)
  plt.ylim(-2, 2)

您可以可视化这个方程,红色代表较大的值。


In [23]:
zi = griddata(xy, z, (xi, yi), method='linear', fill_value='0')
plot_contour(xi, yi, zi)
plt.scatter(df.x, df.y, marker='.')
plt.title('Contour on training data')
plt.show()



In [ ]:
fc = [tf.feature_column.numeric_column('x'),
      tf.feature_column.numeric_column('y')]

In [ ]:
def predict(est):
  """已有估算器给出的预测"""
  predict_input_fn = lambda: tf.data.Dataset.from_tensors(dict(df_predict))
  preds = np.array([p['predictions'][0] for p in est.predict(predict_input_fn)])
  return preds.reshape(predict_shape)

首先,我们尝试用线性模型拟合数据。


In [26]:
train_input_fn = make_input_fn(df, df.z)
est = tf.estimator.LinearRegressor(fc)
est.train(train_input_fn, max_steps=500);


INFO:tensorflow:Using default config.
WARNING:tensorflow:Using temporary folder as model directory: /tmp/tmpd4fqobc9
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpd4fqobc9', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:From /tensorflow-2.1.0/python3.6/tensorflow_core/python/feature_column/feature_column_v2.py:518: Layer.add_variable (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.add_weight` method instead.
WARNING:tensorflow:From /tensorflow-2.1.0/python3.6/tensorflow_core/python/keras/optimizer_v2/ftrl.py:143: calling Constant.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmpd4fqobc9/model.ckpt.
INFO:tensorflow:loss = 0.023290718, step = 0
INFO:tensorflow:global_step/sec: 267.329
INFO:tensorflow:loss = 0.017512696, step = 100 (0.377 sec)
INFO:tensorflow:global_step/sec: 312.355
INFO:tensorflow:loss = 0.018098738, step = 200 (0.321 sec)
INFO:tensorflow:global_step/sec: 341.77
INFO:tensorflow:loss = 0.019927984, step = 300 (0.291 sec)
INFO:tensorflow:global_step/sec: 307.825
INFO:tensorflow:loss = 0.01797011, step = 400 (0.327 sec)
INFO:tensorflow:Saving checkpoints for 500 into /tmp/tmpd4fqobc9/model.ckpt.
INFO:tensorflow:Loss for final step: 0.019703189.

In [27]:
plot_contour(xi, yi, predict(est))


INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Layer linear/linear_model is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because it's dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpd4fqobc9/model.ckpt-500
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.

可见,拟合效果并不好。接下来,我们试着用 GBDT 模型拟合并了解模型是如何拟合方程的。


In [28]:
n_trees = 37 #@param {type: "slider", min: 1, max: 80, step: 1}

est = tf.estimator.BoostedTreesRegressor(fc, n_batches_per_layer=1, n_trees=n_trees)
est.train(train_input_fn, max_steps=500)
clear_output()
plot_contour(xi, yi, predict(est))
plt.text(-1.8, 2.1, '# trees: {}'.format(n_trees), color='w', backgroundcolor='black', size=20)
plt.show()


INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmp3jae7fgc/model.ckpt-222
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.

随着树的数量增加,模型的预测越来越接近真实方程。

总结

本文介绍了如何使用定向特征贡献(DFCs)及几种特征重要性来解释提升树模型。这些方法可以帮助您了解特征是如何影响模型的预测。 最后,您还可以通过观察其他模型的超平面(decision surface)并结合本文内容来学习提升树模型是如何拟合方程的。