In [1]:
### Requirements: PyDotPlus, Matplotlib, Scikit-Learn, Pandas, Numpy, IPython (and possibly GraphViz)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import Imputer
from sklearn import preprocessing
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import BaggingClassifier
from sklearn import tree
import sklearn
import sklearn.metrics as skm
from scipy import misc
from sklearn.externals.six import StringIO
import pydotplus
from IPython.display import Image, YouTubeVideo
def visualize_tree(tree, feature_names, class_names):
dot_data = StringIO()
sklearn.tree.export_graphviz(tree, out_file=dot_data,
filled=True, rounded=True,
feature_names=feature_names,
class_names=class_names,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
return graph.create_png()
Assume the true binary labels $\{y_i : i =1\ldots m\}$ are distribution according to $P(y)$. But when we observe the value of a decision stump $A = T\text{ or }F$, then we obtain two new distributions, $P(y \mid a = T)$ and $P(y \mid a = F)$. We use information gain to measure how much the distribution on $y$ changes when we observe $a$.
\begin{align*} \text{Information Gain } & = \text{ Entropy(Parent) - } \text{ Weighted Sum of Entropy(Children)} \\ & = IG(P,A) = H(P) - H(P|A) \\ & = H(P) - \sum_{A = T,F} Pr(A) H(P(\cdot | A)) \end{align*}If we think of a node $A$ as simply "guessing" $A(x)$ for the label $y$ of $x$, then the Misclassification Error (ME) is essentially the probability $P(A(x) = y)$
$$\text{ME}(A) = \frac{\sum_{i=1}^N \mathbb{I}[ A(x_i) = y_i ] }{N}$$Decision Trees in general perform well with lots of data, are robust to violations of assumptions, and probably most strikingly are easy to understand and interpret. However:
What is the policy by which a particular decision tree algorithm generalizes from observed training examples to classify unseen instances?
Definition: The set of assumptions that, together with the training data, deductively justify the classifications assigned by the learner to future instances.
We can also think of this bias as an algorithm's "preference" over possibly hypotheses.
Note: Bagging can also be applied for regression but instead of using majority vote, the average is used.
In [2]:
%matplotlib inline
# Author: Gilles Louppe <g.louppe@gmail.com>
# License: BSD 3 clause
# Settings
n_repeat = 50 # Number of iterations for computing expectations
n_train = 50 # Size of the training set
n_test = 1000 # Size of the test set
noise = 0.1 # Standard deviation of the noise
np.random.seed(0)
# Change this for exploring the bias-variance decomposition of other
# estimators. This should work well for estimators with high variance (e.g.,
# decision trees or KNN), but poorly for estimators with low variance (e.g.,
# linear models).
estimators = [("Tree", DecisionTreeRegressor()),
("Bagging(Tree)", BaggingRegressor(DecisionTreeRegressor()))]
n_estimators = len(estimators)
# Generate data
def f(x):
x = x.ravel()
return np.exp(-x ** 2) + 1.5 * np.exp(-(x - 2) ** 2)
def generate(n_samples, noise, n_repeat=1):
X = np.random.rand(n_samples) * 10 - 5
X = np.sort(X)
if n_repeat == 1:
y = f(X) + np.random.normal(0.0, noise, n_samples)
else:
y = np.zeros((n_samples, n_repeat))
for i in range(n_repeat):
y[:, i] = f(X) + np.random.normal(0.0, noise, n_samples)
X = X.reshape((n_samples, 1))
return X, y
def bias_variance_example():
X_train = []
y_train = []
for i in range(n_repeat):
X, y = generate(n_samples=n_train, noise=noise)
X_train.append(X)
y_train.append(y)
X_test, y_test = generate(n_samples=n_test, noise=noise, n_repeat=n_repeat)
# Loop over estimators to compare
for n, (name, estimator) in enumerate(estimators):
# Compute predictions
y_predict = np.zeros((n_test, n_repeat))
for i in range(n_repeat):
estimator.fit(X_train[i], y_train[i])
y_predict[:, i] = estimator.predict(X_test)
# Bias^2 + Variance + Noise decomposition of the mean squared error
y_error = np.zeros(n_test)
for i in range(n_repeat):
for j in range(n_repeat):
y_error += (y_test[:, j] - y_predict[:, i]) ** 2
y_error /= (n_repeat * n_repeat)
y_noise = np.var(y_test, axis=1)
y_bias = (f(X_test) - np.mean(y_predict, axis=1)) ** 2
y_var = np.var(y_predict, axis=1)
print("{0}: {1:.4f} (error) = {2:.4f} (bias^2) "
" + {3:.4f} (var) + {4:.4f} (noise)".format(name,
np.mean(y_error),
np.mean(y_bias),
np.mean(y_var),
np.mean(y_noise)))
# Plot figures
from pylab import rcParams
rcParams['figure.figsize'] = 9, 9
plt.subplot(2, n_estimators, n + 1)
plt.plot(X_test, f(X_test), "b", label="$f(x)$")
plt.plot(X_train[0], y_train[0], ".b", label="LS ~ $y = f(x)+noise$")
for i in range(n_repeat):
if i == 0:
plt.plot(X_test, y_predict[:, i], "r", label="$\^y(x)$")
else:
plt.plot(X_test, y_predict[:, i], "r", alpha=0.05)
plt.plot(X_test, np.mean(y_predict, axis=1), "c",
label="$\mathbb{E}_{LS} \^y(x)$")
plt.xlim([-5, 5])
plt.title(name)
if n == 0:
plt.legend(loc="upper left", prop={"size": 11})
plt.subplot(2, n_estimators, n_estimators + n + 1)
plt.plot(X_test, y_error, "r", label="$error(x)$")
plt.plot(X_test, y_bias, "b", label="$bias^2(x)$"),
plt.plot(X_test, y_var, "g", label="$variance(x)$"),
plt.plot(X_test, y_noise, "c", label="$noise(x)$")
plt.xlim([-5, 5])
plt.ylim([0, 0.1])
if n == 0:
plt.legend(loc="upper left", prop={"size": 11})
plt.show()
In [3]:
# Bias-Variance of Bagging with
# Decision Tree Regressors Illustration (Adapted from ELSII, 2009)
# (Note: LS refers to a bootstrap sample)
bias_variance_example()
An iterative algorithm for "ensembling" base learners
In [4]:
YouTubeVideo('k4G2VCuOMMg')
Out[4]:
In the context of face detection, what makes a good set of "weak learners"? Apparently a good choice are these haar-like features. You sum up the pixel values in the white patches, minus the pixel values in the black patches.
(Proof in Mohri et. al, Foundations of Machine Learning, 2012)
We can use a surrogate loss function $\phi$ instead to give the following optimization problem $$\min_{F \in \text{span}(\mathscr{F})} \frac{1}{n} \sum \limits_{i = 1}^n \phi(y_i F(\mathbf{x}_i))$$
Examples of surrogate losses:
$\left.\frac{\partial B(\alpha, f)}{\partial \alpha}\right\vert_{\alpha = 0} = \frac{1}{n}\sum \limits_{i = 1}^n y_i f(\mathbf{x}_i)\phi'(y_i F_{t - 1}(\mathbf{x}_i))$
Minimizing the above with respect to $f$ is equivalent to minimizing $-\sum \limits_{i = 1}^n y_i f(\mathbf{x}_i)\frac{\phi'(y_iF_{t - 1}(\mathbf{x}_i))}{\sum \limits_{j = 1}^n \phi'(y_j F_{t - 1}(x_j)}$ (Note, a minus sign is used as $\phi' < 0$)
Setting $w_i^t = \frac{\phi'(y_iF_{t - 1}(\mathbf{x}_i))}{\sum \limits_{j = 1}^n \phi'(y_j F_{t - 1}(x_j)}$, the minimization problem reduces to $\sum \limits_{i = 1}^n w_i^t\mathbb{1}_{\{f(\mathbf{x}_i) \neq y_i)\}} - \sum \limits_{i = 1}^n w_i^t\mathbb{1}_{\{f(\mathbf{x}_i) = y_i\}}$
Finally, we then get the minimization as $2\left(\sum \limits_{i = 1}^n w_i^t \mathbb{1}_{\{f(\mathbf{x}_i) = y_i\}}\right) - 1$
Thus, to solve the first step (choose $f_t$) we just apply the base learner.
The above is just a scalar minimization problem that can be solved numerically, e.g., via Newton's method, if no closed form solution is available.