Smooth Tree

Single decision trees generally overfit, leading to poor predictive performance. Tree ensembles (RF, GBM) perform well, but are black-box models. In this notebook, we investigate whether smoothing the predictions in decision trees can produce a traceable, white-box model with improved predicive accuracy.

In a smoothed regression tree, the node values, $s_n$, will be as follows:

$$ s_n = \begin{cases} y_n & \text{n is root}\\ \frac{w_n y_n + v_{ss} s_p}{w_n + v_{ss}} & \text{otherwise} \end{cases} $$

Where $y_n$ is the mean of the targets of the node n, $s_n$ is the smoothed node value of node n, $s_p$ is the smoothed value of the parent of node n, $v_{ss}$ is the virtual sample size (a free parameter of the model), and $w_n$ is the total weight of data in node n, or the count if the tree is unweighted.

Smoothed classification trees are similar, but operate on class probabilities instead of on the mean of the targets.

Diabetes

Our first example involves prediction on the diabetes dataset. This dataset has 502 examples with 10 predictors. We divide it into a 295 row training set and a 147 row test set. The targets are numeric and are evaluated by mse.



In [3]:

    
from arboretum.datasets import load_diabetes
xtr, ytr, xte, yte = load_diabetes()
xtr.shape, xte.shape









    Out[3]:





((295, 10), (147, 10))

We will compare a smoothed regression tree from arboretum to a regression tree and a random forest from scikit-learn. First, we'll just run the models once, then we will investiagte their performance in more detail.



In [4]:

    
from sklearn.metrics import mean_squared_error as mse
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from arboretum import SmoothRegressionTree
from sklearn.model_selection import GridSearchCV
dtr = DecisionTreeRegressor(min_samples_leaf=5)
rf = RandomForestRegressor(n_estimators=100, min_samples_leaf=5)
mytree = SmoothRegressionTree(vss= 5, min_leaf=5)



In [5]:

    
dtr.fit(xtr, ytr)
pred = dtr.predict(xte)
mse(yte, pred)









    Out[5]:





5160.6983167743037



In [6]:

    
mytree.fit(xtr, ytr)
pred = mytree.predict(xte)
mse(yte, pred)









    Out[6]:





4593.5736258148163



In [7]:

    
rf.fit(xtr, ytr)
pred = rf.predict(xte)
mse(yte, pred)









    Out[7]:





2942.9805697170409

So, off-hand, it looks like the smoothed regression tree is in-between one tree and a random forest in terms of accuracy, but closer to one tree. However, it has more free parameters, so we need to investiagte more. We'll give the regular tree one more control parameter so that both models have two.



In [8]:

    
params = {'min_samples_leaf':[5, 10, 20, 50, 100], 'max_depth':[2,4,8,16, None]}
gcv = GridSearchCV(dtr, params, 'neg_mean_squared_error')
gcv.fit(xtr, ytr)
gcv.best_score_, gcv.best_params_









    Out[8]:





(-3611.5331249595665, {'max_depth': 8, 'min_samples_leaf': 20})

Which is better than the naively set smoothing tree. Note that scikit-learn uses negative mean squared error as a scoring function because GridSearchCV maximizes the scoring function. And on the test set, that estimator gets:



In [9]:

    
pred = gcv.predict(xte)
mse(yte, pred)









    Out[9]:





3898.421419473354

So how does the smoothing tree do? We'll try it two ways, once with vss and min_leaf set and then with vss and max_depth.



In [10]:

    
myparams = {'min_leaf':[5, 10, 20, 50, 100], 'vss':[5, 10, 20, 50, 100]}
gcv = GridSearchCV(mytree, myparams, 'neg_mean_squared_error')
gcv.fit(xtr, ytr)
mypred = gcv.predict(xte)
mse(yte, mypred), gcv.best_score_, gcv.best_params_









    Out[10]:





(3855.629545109196, -3599.5965268547347, {'min_leaf': 20, 'vss': 5})

That's about the same. Next we'll try it with the other parameter set:



In [11]:

    
myparams = {'max_depth':[2,4,8,16, None], 'vss':[5, 10, 20, 50, 100]}
gcv = GridSearchCV(mytree, myparams, 'neg_mean_squared_error')
gcv.fit(xtr, ytr)
mypred = gcv.predict(xte)
mse(yte, mypred), gcv.best_score_, gcv.best_params_









    Out[11]:





(3968.9469274964681, -3924.7853187456703, {'max_depth': 4, 'vss': 50})

And that's worse. So the initial impression that we had, that the smoothing tree was moderately better was due to better model capacity control from the extra parameter. It disappeared when we conducted a search over equal numbers of parameters.

ALS

Next up we consider the ALS dataset. This is a wide, noisy dataset with 369 predictors. The training set has 1197 rows and the test set has 625. It is a tough dataset for trees. They often underperform the constant model.



In [14]:

    
from arboretum.datasets import load_als
xtr, ytr, xte, yte = load_als()
xtr.shape, xte.shape









    Out[14]:





((1197, 369), (625, 369))

The constant model gets mse of about 0.32, which is better than both the tree and smoothed tree (at these parameters), but worse than the RF model. The 0.26 of the RF model is a good score on this data.



In [15]:

    
mse(yte, 0 * yte + ytr.mean())









    Out[15]:





0.32037244903736767



In [16]:

    
dtr.fit(xtr, ytr)
pred = dtr.predict(xte)
mse(yte, pred)









    Out[16]:





0.45225023825751176



In [17]:

    
mytree.fit(xtr, ytr)
pred = mytree.predict(xte)
mse(yte, pred)









    Out[17]:





0.36398480589778592



In [18]:

    
rf.n_estimators = 100
rf.fit(xtr, ytr)
pred = rf.predict(xte)
mse(yte, pred)









    Out[18]:





0.25852639120867782

The $v_{ss}$ parameter can be changed without refitting on an arboretum.SmoothRegressionTree. For this noisy data, much higher smoothing values are better.



In [19]:

    
mytree.vss = 100
pred = mytree.predict(xte)
mse(yte, pred)









    Out[19]:





0.2944062345391758

Once again, it looks like smoothing trees are in between the results for a single tree and an RF. Like before, we'll compare the smoothing tree with vss and one other control parameter to a regular tree with two control parameters.



In [24]:

    
params = {'min_samples_leaf':[5, 10, 20, 50, 100, 200, 400], 'max_depth':[2,4,8,16, None]}
gcv = GridSearchCV(dtr, params, 'neg_mean_squared_error')
gcv.fit(xtr, ytr)
pred = gcv.predict(xte)
mse(yte, pred), gcv.best_score_, gcv.best_params_









    Out[24]:





(0.27856303321060677,
 -0.28591222452298265,
 {'max_depth': None, 'min_samples_leaf': 200})



In [26]:

    
myparams = {'min_leaf':[5, 10, 20, 50, 100], 'vss':[5, 10, 20, 50, 100, 200, 400]}
gcv = GridSearchCV(mytree, myparams, 'neg_mean_squared_error')
gcv.fit(xtr, ytr)
mypred = gcv.predict(xte)
mse(yte, mypred), gcv.best_score_, gcv.best_params_









    Out[26]:





(0.27818923925487532, -0.28039185551835893, {'min_leaf': 50, 'vss': 200})



In [27]:

    
myparams = {'max_depth':[2,4,8,16, None], 'vss':[5, 10, 20, 50, 100, 200, 400]}
gcv = GridSearchCV(mytree, myparams, 'neg_mean_squared_error')
gcv.fit(xtr, ytr)
mypred = gcv.predict(xte)
mse(yte, mypred), gcv.best_score_, gcv.best_params_









    Out[27]:





(0.28229031274129485, -0.28162959756580547, {'max_depth': 8, 'vss': 400})

So once again, on closer examination, the smoothing tree is not better than a well-tuned regular tree.

Classification

On classification tasks, the initial positive results found above on regression problems did not occur, at least for accuracy.

Random Forest

We might wonder if there is any benefit to using smoothing trees in a random forest ensemble. We can monkey-patch an arboretum.RFRegressor to find out. We run the forest once on small data to get numba to jit the RF code.



In [29]:

    
from arboretum import RFRegressor
myrf = RFRegressor()
myrf.base_estimator = mytree
myrf.fit(xtr[:10], ytr[:10])









    Out[29]:





RFRegressor(n_trees=30, min_leaf=5, max_features=None, max_depth=None)



In [30]:

    
myrf.n_trees = 100
myrf.fit(xtr, ytr)
pred = myrf.predict(xte)
mse(yte, pred)









    Out[30]:





0.25731757233969138

From before we got:



In [31]:

    
rf.fit(xtr, ytr)
pred = rf.predict(xte)
mse(yte, pred)









    Out[31]:





0.25739208636018618

So it looks like the smoothing doesn't help any in an RF model.

Conclusion

Smoothing trees initially looked promising, however, closer investigation indicates that the initial positive results were due to having better control of overfitting due to having an extra control parameter. In comparing two-parameter models after a grid search, smoothing trees were not better than regular trees. In summary, well-tuned decision trees perform as well as smoothing trees, given the same number of control paramters.