Introduction to regression trees & forests

Compiled by Mohamad Ali-Dib

Based on:

http://web.as.uky.edu/statistics/users/pbreheny/764-F11/notes/11-3.pdf

https://blog.cambridgecoding.com/2016/01/03/getting-started-with-regression-and-decision-trees/

http://www.bios.unc.edu/~dzeng/BIOS740/randomforest.pdf

http://sli.ics.uci.edu/Classes/2015W-273a

Regression trees

How do regression trees work ?

1- consider a linear regression problem with a continuous response z and two predictors x and y

2- We begin by splitting the space into two regions on the basis of a rule of the form $x,y \leq s$ , and modeling the response using the mean of z in the two regions

3- The optimal split (in terms of reducing the residual sum of squares) is found over all predictors (x and y) and all possible split points s

4- The process is then repeated in a recursive fashion for each of the two sub-regions

5- This process continues until some stopping rule is applied

6- For example, letting {Rm} denote the collection of rectangular partitions, we might continue partitioning until |Rm| = 10

7- The end result is a piecewise constant model over the partition {Rm} of the form:

$$f(x,y)=\sum_{m} c_m I \ \ \ (x,y \ \epsilon \ R_m) $$

where $c_m$ is the constant term for the $m^{th}$ region (i.e., the mean of $z_i$ for those observations $x,y \ \epsilon \ R_m$)

Numerical example:

Regression trees with Python

Simple example: 1 feature



In [1]:

    
# Import the plotting library

import matplotlib as mpl
mpl.rcParams['axes.color_cycle'] = ['#7FB5C7', '#E63E65', '#5B5BC9', '#55D957']
%matplotlib inline









    



/Users/chelsea/miniconda2/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))
/Users/chelsea/miniconda2/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')



In [2]:

    
# Import the data manipulation library and read the training set into a dataframe

import pandas as pd
bikes = pd.read_csv('data/bikes.csv')
bikes.head()









    Out[2]:






  
    
      
      date
      temperature
      humidity
      windspeed
      count
    
  
  
    
      0
      2011-01-03
      2.716070
      45.715346
      21.414957
      120
    
    
      1
      2011-01-04
      2.896673
      54.267219
      15.136882
      108
    
    
      2
      2011-01-05
      4.235654
      45.697702
      17.034578
      82
    
    
      3
      2011-01-06
      3.112643
      50.237349
      10.091568
      88
    
    
      4
      2011-01-07
      2.723918
      49.144928
      15.738204
      148



In [3]:

    
# Plot the data

from matplotlib import pyplot as plt

plt.figure(figsize=(8,6))
plt.plot(bikes['temperature'], bikes['count'], 'o')
plt.xlabel('temperature')
plt.ylabel('bikes')
plt.show()



In [4]:

    
# Import the Decision Trees Regressor and train it !

from sklearn.tree import DecisionTreeRegressor
import numpy as np

regressor = DecisionTreeRegressor(max_depth=2)
regressor.fit(np.array([bikes['temperature']]).T, bikes['count'])









    Out[4]:





DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')



In [5]:

    
regressor.predict(5.)









    Out[5]:





array([ 189.23183761])



In [6]:

    
regressor.predict(20.)









    Out[6]:





array([ 769.08756039])



In [7]:

    
# plot the fit

xx = np.array([np.linspace(-5, 40, 100)]).T

plt.figure(figsize=(8,6))
plt.plot(bikes['temperature'], bikes['count'], 'o', label='observation')
plt.plot(xx, regressor.predict(xx), linewidth=4, alpha=.7, label='prediction')
plt.xlabel('temperature')
plt.ylabel('bikes')
plt.legend()
plt.show()

Simple example: two features



In [8]:

    
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
import numpy as np

from matplotlib import pyplot as plt
%matplotlib inline

bikes = pd.read_csv('data/bikes.csv')
regressor = DecisionTreeRegressor(max_depth=2)
regressor.fit(bikes[['temperature', 'humidity']], bikes['count'])









    Out[8]:





DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')



In [9]:

    
nx = 30
ny = 30
# creating a grid of points
x_temperature = np.linspace(-5, 40, nx)
y_humidity = np.linspace(20, 80, ny)
xx, yy = np.meshgrid(x_temperature, y_humidity)
# evaluating the regresson on all the points
z_bikes = regressor.predict(np.array([xx.flatten(), yy.flatten()]).T)
zz = np.reshape(z_bikes, (nx, ny))


fig = plt.figure(figsize=(8, 8))
# plotting the predictions
plt.pcolormesh(x_temperature, y_humidity, zz, cmap=plt.cm.YlOrRd)
plt.colorbar(label='bikes predicted') # add a colorbar on the right
# plotting also the observations
plt.scatter(bikes['temperature'], bikes['humidity'], s=bikes['count']/25.0, c='g')
# setting the limit for each axis
plt.xlim(np.min(x_temperature), np.max(x_temperature))
plt.ylim(np.min(y_humidity), np.max(y_humidity))
plt.xlabel('temperature')
plt.ylabel('humidity')
plt.show()

plt.show()



In [10]:

    
from sklearn.metrics import mean_absolute_error
mean_absolute_error(bikes['count'], regressor.predict(bikes[['temperature','humidity']]))









    Out[10]:





181.28165652686295



In [11]:

    
from sklearn.cross_validation import cross_val_score
scores = -cross_val_score(regressor, bikes[['temperature', 'humidity']], 
                          bikes['count'], scoring='mean_absolute_error', cv=10)









    



/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/cross_validation.py:42: DeprecationWarning: This module has been deprecated in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)



In [12]:

    
scores.mean()









    Out[12]:





224.66188344455881



In [13]:

    
regressor2 = DecisionTreeRegressor(max_depth=100)
regressor2.fit(bikes[['temperature', 'humidity']], bikes['count'])









    Out[13]:





DecisionTreeRegressor(criterion='mse', max_depth=100, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')



In [14]:

    
mean_absolute_error(bikes['count'], regressor2.predict(bikes[['temperature','humidity']]))









    Out[14]:





0.0



In [15]:

    
scores = -cross_val_score(regressor2, bikes[['temperature', 'humidity']], 
                          bikes['count'], scoring='mean_absolute_error', cv=10)
print scores.mean()









    



240.125328196

Random forest regressors

RFRs work similarly to Random Forest Classifiers, except we average the output of the different trees instead of taking a majority vote.

1- Draw $n_{trees}$ bootstrap samples from the original data.

2- For each of the bootstrap samples, grow an unpruned classification or regression tree, with the following modification: at each node, rather than choosing the best split among all predictors, randomly sample $m_{try}$ of the predictors and choose the best split from among those variables.

3- Predict new data by aggregating the predictions of $n_{trees}$ trees (i.e., majority votes for classification, average for regression).



In [29]:

    
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
import numpy as np

from matplotlib import pyplot as plt
%matplotlib inline

bikes = pd.read_csv('data/bikes.csv')
regressor = RandomForestRegressor(max_depth=100,n_estimators=1000)
regressor.fit(bikes[['temperature', 'humidity']], bikes['count'])









    Out[29]:





RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=100,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=1000, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)



In [30]:

    
nx = 30
ny = 30
# creating a grid of points
x_temperature = np.linspace(-5, 40, nx)
y_humidity = np.linspace(20, 80, ny)
xx, yy = np.meshgrid(x_temperature, y_humidity)
# evaluating the regresson on all the points
z_bikes = regressor.predict(np.array([xx.flatten(), yy.flatten()]).T)
zz = np.reshape(z_bikes, (nx, ny))


fig = plt.figure(figsize=(8, 8))
# plotting the predictions
plt.pcolormesh(x_temperature, y_humidity, zz, cmap=plt.cm.YlOrRd)
plt.colorbar(label='bikes predicted') # add a colorbar on the right
# plotting also the observations
plt.scatter(bikes['temperature'], bikes['humidity'], s=bikes['count']/25.0, c='g')
# setting the limit for each axis
plt.xlim(np.min(x_temperature), np.max(x_temperature))
plt.ylim(np.min(y_humidity), np.max(y_humidity))
plt.xlabel('temperature')
plt.ylabel('humidity')
plt.show()

plt.show()



In [31]:

    
from sklearn.metrics import mean_absolute_error
mean_absolute_error(bikes['count'], regressor.predict(bikes[['temperature','humidity']]))









    Out[31]:





62.988285436671205



In [32]:

    
from sklearn.cross_validation import cross_val_score
scores = -cross_val_score(regressor, bikes[['temperature', 'humidity']], 
                          bikes['count'], scoring='mean_absolute_error', cv=10)



In [33]:

    
scores.mean()









    Out[33]:





207.706182619863

Gradient boosters

Gradient Boosting:

– Use a simple regression model to start

– Subsequent models predict the error residual of the previous predictions

– Overall prediction given by a weighted sum of the collection



In [21]:

    
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor  #GBM algorithm
import numpy as np

from matplotlib import pyplot as plt
%matplotlib inline

bikes = pd.read_csv('data/bikes.csv')
regressor = GradientBoostingRegressor(max_depth=2,n_estimators=1000, learning_rate=0.05)
regressor.fit(bikes[['temperature', 'humidity']], bikes['count'])









    Out[21]:





GradientBoostingRegressor(alpha=0.9, init=None, learning_rate=0.05, loss='ls',
             max_depth=2, max_features=None, max_leaf_nodes=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=1000,
             presort='auto', random_state=None, subsample=1.0, verbose=0,
             warm_start=False)



In [22]:

    
nx = 30
ny = 30
# creating a grid of points
x_temperature = np.linspace(-5, 40, nx)
y_humidity = np.linspace(20, 80, ny)
xx, yy = np.meshgrid(x_temperature, y_humidity)
# evaluating the regresson on all the points
z_bikes = regressor.predict(np.array([xx.flatten(), yy.flatten()]).T)
zz = np.reshape(z_bikes, (nx, ny))


fig = plt.figure(figsize=(8, 8))
# plotting the predictions
plt.pcolormesh(x_temperature, y_humidity, zz, cmap=plt.cm.YlOrRd)
plt.colorbar(label='bikes predicted') # add a colorbar on the right
# plotting also the observations
plt.scatter(bikes['temperature'], bikes['humidity'], s=bikes['count']/25.0, c='g')
# setting the limit for each axis
plt.xlim(np.min(x_temperature), np.max(x_temperature))
plt.ylim(np.min(y_humidity), np.max(y_humidity))
plt.xlabel('temperature')
plt.ylabel('humidity')
plt.show()

plt.show()



In [23]:

    
from sklearn.metrics import mean_absolute_error
mean_absolute_error(bikes['count'], regressor.predict(bikes[['temperature','humidity']]))









    Out[23]:





104.58756410230531



In [24]:

    
from sklearn.cross_validation import cross_val_score
scores = -cross_val_score(regressor, bikes[['temperature', 'humidity']], 
                          bikes['count'], scoring='mean_absolute_error', cv=10)


scores.mean()









    Out[24]:





203.68906484711539

SVM regression



In [25]:

    
import pandas as pd
from sklearn.svm import SVR
import numpy as np

from matplotlib import pyplot as plt
%matplotlib inline

bikes = pd.read_csv('data/bikes.csv')
regressor = SVR()
regressor.fit(bikes[['temperature', 'humidity']], bikes['count'])









    Out[25]:





SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)



In [26]:

    
nx = 30
ny = 30
# creating a grid of points
x_temperature = np.linspace(-5, 40, nx)
y_humidity = np.linspace(20, 80, ny)
xx, yy = np.meshgrid(x_temperature, y_humidity)
# evaluating the regresson on all the points
z_bikes = regressor.predict(np.array([xx.flatten(), yy.flatten()]).T)
zz = np.reshape(z_bikes, (nx, ny))


fig = plt.figure(figsize=(8, 8))
# plotting the predictions
plt.pcolormesh(x_temperature, y_humidity, zz, cmap=plt.cm.YlOrRd)
plt.colorbar(label='bikes predicted') # add a colorbar on the right
# plotting also the observations
plt.scatter(bikes['temperature'], bikes['humidity'], s=bikes['count']/25.0, c='g')
# setting the limit for each axis
plt.xlim(np.min(x_temperature), np.max(x_temperature))
plt.ylim(np.min(y_humidity), np.max(y_humidity))
plt.xlabel('temperature')
plt.ylabel('humidity')
plt.show()

plt.show()



In [27]:

    
from sklearn.metrics import mean_absolute_error
mean_absolute_error(bikes['count'], regressor.predict(bikes[['temperature','humidity']]))









    Out[27]:





321.1636207694396



In [28]:

    
from sklearn.cross_validation import cross_val_score
scores = -cross_val_score(regressor, bikes[['temperature', 'humidity']], 
                          bikes['count'], scoring='mean_absolute_error', cv=10)


scores.mean()









    Out[28]:





363.74462717482004



In [ ]:



In [ ]:

	date	temperature	humidity	windspeed	count
0	2011-01-03	2.716070	45.715346	21.414957	120
1	2011-01-04	2.896673	54.267219	15.136882	108
2	2011-01-05	4.235654	45.697702	17.034578	82
3	2011-01-06	3.112643	50.237349	10.091568	88
4	2011-01-07	2.723918	49.144928	15.738204	148