Single decision trees generally overfit, leading to poor predictive performance. Tree ensembles (RF, GBM) perform well, but are black-box models. In this notebook, we investigate whether smoothing the predictions in decision trees can produce a traceable, white-box model with improved predicive accuracy.
In a smoothed regression tree, the node values, $s_n$, will be as follows:
$$ s_n = \begin{cases} y_n & \text{n is root}\\ \frac{w_n y_n + v_{ss} s_p}{w_n + v_{ss}} & \text{otherwise} \end{cases} $$Where $y_n$ is the mean of the targets of the node n, $s_n$ is the smoothed node value of node n, $s_p$ is the smoothed value of the parent of node n, $v_{ss}$ is the virtual sample size (a free parameter of the model), and $w_n$ is the total weight of data in node n, or the count if the tree is unweighted.
Smoothed classification trees are similar, but operate on class probabilities instead of on the mean of the targets.
In [3]:
from arboretum.datasets import load_diabetes
xtr, ytr, xte, yte = load_diabetes()
xtr.shape, xte.shape
Out[3]:
We will compare a smoothed regression tree from arboretum to a regression tree and a random forest from scikit-learn. First, we'll just run the models once, then we will investiagte their performance in more detail.
In [4]:
from sklearn.metrics import mean_squared_error as mse
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from arboretum import SmoothRegressionTree
from sklearn.model_selection import GridSearchCV
dtr = DecisionTreeRegressor(min_samples_leaf=5)
rf = RandomForestRegressor(n_estimators=100, min_samples_leaf=5)
mytree = SmoothRegressionTree(vss= 5, min_leaf=5)
In [5]:
dtr.fit(xtr, ytr)
pred = dtr.predict(xte)
mse(yte, pred)
Out[5]:
In [6]:
mytree.fit(xtr, ytr)
pred = mytree.predict(xte)
mse(yte, pred)
Out[6]:
In [7]:
rf.fit(xtr, ytr)
pred = rf.predict(xte)
mse(yte, pred)
Out[7]:
So, off-hand, it looks like the smoothed regression tree is in-between one tree and a random forest in terms of accuracy, but closer to one tree. However, it has more free parameters, so we need to investiagte more. We'll give the regular tree one more control parameter so that both models have two.
In [8]:
params = {'min_samples_leaf':[5, 10, 20, 50, 100], 'max_depth':[2,4,8,16, None]}
gcv = GridSearchCV(dtr, params, 'neg_mean_squared_error')
gcv.fit(xtr, ytr)
gcv.best_score_, gcv.best_params_
Out[8]:
Which is better than the naively set smoothing tree. Note that scikit-learn uses negative mean squared error as a scoring function because GridSearchCV
maximizes the scoring function. And on the test set, that estimator gets:
In [9]:
pred = gcv.predict(xte)
mse(yte, pred)
Out[9]:
So how does the smoothing tree do? We'll try it two ways, once with vss
and min_leaf
set and then with vss
and max_depth
.
In [10]:
myparams = {'min_leaf':[5, 10, 20, 50, 100], 'vss':[5, 10, 20, 50, 100]}
gcv = GridSearchCV(mytree, myparams, 'neg_mean_squared_error')
gcv.fit(xtr, ytr)
mypred = gcv.predict(xte)
mse(yte, mypred), gcv.best_score_, gcv.best_params_
Out[10]:
That's about the same. Next we'll try it with the other parameter set:
In [11]:
myparams = {'max_depth':[2,4,8,16, None], 'vss':[5, 10, 20, 50, 100]}
gcv = GridSearchCV(mytree, myparams, 'neg_mean_squared_error')
gcv.fit(xtr, ytr)
mypred = gcv.predict(xte)
mse(yte, mypred), gcv.best_score_, gcv.best_params_
Out[11]:
And that's worse. So the initial impression that we had, that the smoothing tree was moderately better was due to better model capacity control from the extra parameter. It disappeared when we conducted a search over equal numbers of parameters.
In [14]:
from arboretum.datasets import load_als
xtr, ytr, xte, yte = load_als()
xtr.shape, xte.shape
Out[14]:
The constant model gets mse of about 0.32, which is better than both the tree and smoothed tree (at these parameters), but worse than the RF model. The 0.26 of the RF model is a good score on this data.
In [15]:
mse(yte, 0 * yte + ytr.mean())
Out[15]:
In [16]:
dtr.fit(xtr, ytr)
pred = dtr.predict(xte)
mse(yte, pred)
Out[16]:
In [17]:
mytree.fit(xtr, ytr)
pred = mytree.predict(xte)
mse(yte, pred)
Out[17]:
In [18]:
rf.n_estimators = 100
rf.fit(xtr, ytr)
pred = rf.predict(xte)
mse(yte, pred)
Out[18]:
The $v_{ss}$ parameter can be changed without refitting on an arboretum.SmoothRegressionTree
. For this noisy data, much higher smoothing values are better.
In [19]:
mytree.vss = 100
pred = mytree.predict(xte)
mse(yte, pred)
Out[19]:
Once again, it looks like smoothing trees are in between the results for a single tree and an RF. Like before, we'll compare the smoothing tree with vss
and one other control parameter to a regular tree with two control parameters.
In [24]:
params = {'min_samples_leaf':[5, 10, 20, 50, 100, 200, 400], 'max_depth':[2,4,8,16, None]}
gcv = GridSearchCV(dtr, params, 'neg_mean_squared_error')
gcv.fit(xtr, ytr)
pred = gcv.predict(xte)
mse(yte, pred), gcv.best_score_, gcv.best_params_
Out[24]:
In [26]:
myparams = {'min_leaf':[5, 10, 20, 50, 100], 'vss':[5, 10, 20, 50, 100, 200, 400]}
gcv = GridSearchCV(mytree, myparams, 'neg_mean_squared_error')
gcv.fit(xtr, ytr)
mypred = gcv.predict(xte)
mse(yte, mypred), gcv.best_score_, gcv.best_params_
Out[26]:
In [27]:
myparams = {'max_depth':[2,4,8,16, None], 'vss':[5, 10, 20, 50, 100, 200, 400]}
gcv = GridSearchCV(mytree, myparams, 'neg_mean_squared_error')
gcv.fit(xtr, ytr)
mypred = gcv.predict(xte)
mse(yte, mypred), gcv.best_score_, gcv.best_params_
Out[27]:
So once again, on closer examination, the smoothing tree is not better than a well-tuned regular tree.
In [29]:
from arboretum import RFRegressor
myrf = RFRegressor()
myrf.base_estimator = mytree
myrf.fit(xtr[:10], ytr[:10])
Out[29]:
In [30]:
myrf.n_trees = 100
myrf.fit(xtr, ytr)
pred = myrf.predict(xte)
mse(yte, pred)
Out[30]:
From before we got:
In [31]:
rf.fit(xtr, ytr)
pred = rf.predict(xte)
mse(yte, pred)
Out[31]:
So it looks like the smoothing doesn't help any in an RF model.
Smoothing trees initially looked promising, however, closer investigation indicates that the initial positive results were due to having better control of overfitting due to having an extra control parameter. In comparing two-parameter models after a grid search, smoothing trees were not better than regular trees. In summary, well-tuned decision trees perform as well as smoothing trees, given the same number of control paramters.