Regression trees

How do regression trees work ?

1- consider a linear regression problem with a continuous response z and two predictors x and y

2- We begin by splitting the space into two regions on the basis of a rule of the form $x,y \leq s$ , and modeling the response using the mean of z in the two regions

3- The optimal split (in terms of reducing the residual sum of squares) is found over all predictors (x and y) and all possible split points s

4- The process is then repeated in a recursive fashion for each of the two sub-regions

5- This process continues until some stopping rule is applied

6- For example, letting {Rm} denote the collection of rectangular partitions, we might continue partitioning until |Rm| = 10

7- The end result is a piecewise constant model over the partition {Rm} of the form:

$$f(x,y)=\sum_{m} c_m I \ \ \ (x,y \ \epsilon \ R_m) $$

where $c_m$ is the constant term for the $m^{th}$ region (i.e., the mean of $z_i$ for those observations $x,y \ \epsilon \ R_m$)

Numerical example:

Regression trees with Python

Simple example: 1 feature


In [1]:
# Import the plotting library

import matplotlib as mpl
mpl.rcParams['axes.color_cycle'] = ['#7FB5C7', '#E63E65', '#5B5BC9', '#55D957']
%matplotlib inline


/Users/chelsea/miniconda2/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))
/Users/chelsea/miniconda2/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

In [2]:
# Import the data manipulation library and read the training set into a dataframe

import pandas as pd
bikes = pd.read_csv('data/bikes.csv')
bikes.head()


Out[2]:
date temperature humidity windspeed count
0 2011-01-03 2.716070 45.715346 21.414957 120
1 2011-01-04 2.896673 54.267219 15.136882 108
2 2011-01-05 4.235654 45.697702 17.034578 82
3 2011-01-06 3.112643 50.237349 10.091568 88
4 2011-01-07 2.723918 49.144928 15.738204 148

In [3]:
# Plot the data

from matplotlib import pyplot as plt

plt.figure(figsize=(8,6))
plt.plot(bikes['temperature'], bikes['count'], 'o')
plt.xlabel('temperature')
plt.ylabel('bikes')
plt.show()



In [4]:
# Import the Decision Trees Regressor and train it !

from sklearn.tree import DecisionTreeRegressor
import numpy as np

regressor = DecisionTreeRegressor(max_depth=2)
regressor.fit(np.array([bikes['temperature']]).T, bikes['count'])


Out[4]:
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')

In [5]:
regressor.predict(5.)


Out[5]:
array([ 189.23183761])

In [6]:
regressor.predict(20.)


Out[6]:
array([ 769.08756039])

In [7]:
# plot the fit

xx = np.array([np.linspace(-5, 40, 100)]).T

plt.figure(figsize=(8,6))
plt.plot(bikes['temperature'], bikes['count'], 'o', label='observation')
plt.plot(xx, regressor.predict(xx), linewidth=4, alpha=.7, label='prediction')
plt.xlabel('temperature')
plt.ylabel('bikes')
plt.legend()
plt.show()


Simple example: two features


In [8]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
import numpy as np

from matplotlib import pyplot as plt
%matplotlib inline

bikes = pd.read_csv('data/bikes.csv')
regressor = DecisionTreeRegressor(max_depth=2)
regressor.fit(bikes[['temperature', 'humidity']], bikes['count'])


Out[8]:
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')

In [9]:
nx = 30
ny = 30
# creating a grid of points
x_temperature = np.linspace(-5, 40, nx)
y_humidity = np.linspace(20, 80, ny)
xx, yy = np.meshgrid(x_temperature, y_humidity)
# evaluating the regresson on all the points
z_bikes = regressor.predict(np.array([xx.flatten(), yy.flatten()]).T)
zz = np.reshape(z_bikes, (nx, ny))


fig = plt.figure(figsize=(8, 8))
# plotting the predictions
plt.pcolormesh(x_temperature, y_humidity, zz, cmap=plt.cm.YlOrRd)
plt.colorbar(label='bikes predicted') # add a colorbar on the right
# plotting also the observations
plt.scatter(bikes['temperature'], bikes['humidity'], s=bikes['count']/25.0, c='g')
# setting the limit for each axis
plt.xlim(np.min(x_temperature), np.max(x_temperature))
plt.ylim(np.min(y_humidity), np.max(y_humidity))
plt.xlabel('temperature')
plt.ylabel('humidity')
plt.show()

plt.show()



In [10]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(bikes['count'], regressor.predict(bikes[['temperature','humidity']]))


Out[10]:
181.28165652686295

In [11]:
from sklearn.cross_validation import cross_val_score
scores = -cross_val_score(regressor, bikes[['temperature', 'humidity']], 
                          bikes['count'], scoring='mean_absolute_error', cv=10)


/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/cross_validation.py:42: DeprecationWarning: This module has been deprecated in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

In [12]:
scores.mean()


Out[12]:
224.66188344455881

In [13]:
regressor2 = DecisionTreeRegressor(max_depth=100)
regressor2.fit(bikes[['temperature', 'humidity']], bikes['count'])


Out[13]:
DecisionTreeRegressor(criterion='mse', max_depth=100, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')

In [14]:
mean_absolute_error(bikes['count'], regressor2.predict(bikes[['temperature','humidity']]))


Out[14]:
0.0

In [15]:
scores = -cross_val_score(regressor2, bikes[['temperature', 'humidity']], 
                          bikes['count'], scoring='mean_absolute_error', cv=10)
print scores.mean()


240.125328196

Random forest regressors

RFRs work similarly to Random Forest Classifiers, except we average the output of the different trees instead of taking a majority vote.

1- Draw $n_{trees}$ bootstrap samples from the original data.

2- For each of the bootstrap samples, grow an unpruned classification or regression tree, with the following modification: at each node, rather than choosing the best split among all predictors, randomly sample $m_{try}$ of the predictors and choose the best split from among those variables.

3- Predict new data by aggregating the predictions of $n_{trees}$ trees (i.e., majority votes for classification, average for regression).


In [29]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
import numpy as np

from matplotlib import pyplot as plt
%matplotlib inline

bikes = pd.read_csv('data/bikes.csv')
regressor = RandomForestRegressor(max_depth=100,n_estimators=1000)
regressor.fit(bikes[['temperature', 'humidity']], bikes['count'])


Out[29]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=100,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=1000, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [30]:
nx = 30
ny = 30
# creating a grid of points
x_temperature = np.linspace(-5, 40, nx)
y_humidity = np.linspace(20, 80, ny)
xx, yy = np.meshgrid(x_temperature, y_humidity)
# evaluating the regresson on all the points
z_bikes = regressor.predict(np.array([xx.flatten(), yy.flatten()]).T)
zz = np.reshape(z_bikes, (nx, ny))


fig = plt.figure(figsize=(8, 8))
# plotting the predictions
plt.pcolormesh(x_temperature, y_humidity, zz, cmap=plt.cm.YlOrRd)
plt.colorbar(label='bikes predicted') # add a colorbar on the right
# plotting also the observations
plt.scatter(bikes['temperature'], bikes['humidity'], s=bikes['count']/25.0, c='g')
# setting the limit for each axis
plt.xlim(np.min(x_temperature), np.max(x_temperature))
plt.ylim(np.min(y_humidity), np.max(y_humidity))
plt.xlabel('temperature')
plt.ylabel('humidity')
plt.show()

plt.show()



In [31]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(bikes['count'], regressor.predict(bikes[['temperature','humidity']]))


Out[31]:
62.988285436671205

In [32]:
from sklearn.cross_validation import cross_val_score
scores = -cross_val_score(regressor, bikes[['temperature', 'humidity']], 
                          bikes['count'], scoring='mean_absolute_error', cv=10)

In [33]:
scores.mean()


Out[33]:
207.706182619863

Gradient boosters

Gradient Boosting:

– Use a simple regression model to start

– Subsequent models predict the error residual of the previous predictions

– Overall prediction given by a weighted sum of the collection


In [21]:
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor  #GBM algorithm
import numpy as np

from matplotlib import pyplot as plt
%matplotlib inline

bikes = pd.read_csv('data/bikes.csv')
regressor = GradientBoostingRegressor(max_depth=2,n_estimators=1000, learning_rate=0.05)
regressor.fit(bikes[['temperature', 'humidity']], bikes['count'])


Out[21]:
GradientBoostingRegressor(alpha=0.9, init=None, learning_rate=0.05, loss='ls',
             max_depth=2, max_features=None, max_leaf_nodes=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=1000,
             presort='auto', random_state=None, subsample=1.0, verbose=0,
             warm_start=False)

In [22]:
nx = 30
ny = 30
# creating a grid of points
x_temperature = np.linspace(-5, 40, nx)
y_humidity = np.linspace(20, 80, ny)
xx, yy = np.meshgrid(x_temperature, y_humidity)
# evaluating the regresson on all the points
z_bikes = regressor.predict(np.array([xx.flatten(), yy.flatten()]).T)
zz = np.reshape(z_bikes, (nx, ny))


fig = plt.figure(figsize=(8, 8))
# plotting the predictions
plt.pcolormesh(x_temperature, y_humidity, zz, cmap=plt.cm.YlOrRd)
plt.colorbar(label='bikes predicted') # add a colorbar on the right
# plotting also the observations
plt.scatter(bikes['temperature'], bikes['humidity'], s=bikes['count']/25.0, c='g')
# setting the limit for each axis
plt.xlim(np.min(x_temperature), np.max(x_temperature))
plt.ylim(np.min(y_humidity), np.max(y_humidity))
plt.xlabel('temperature')
plt.ylabel('humidity')
plt.show()

plt.show()



In [23]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(bikes['count'], regressor.predict(bikes[['temperature','humidity']]))


Out[23]:
104.58756410230531

In [24]:
from sklearn.cross_validation import cross_val_score
scores = -cross_val_score(regressor, bikes[['temperature', 'humidity']], 
                          bikes['count'], scoring='mean_absolute_error', cv=10)


scores.mean()


Out[24]:
203.68906484711539

SVM regression


In [25]:
import pandas as pd
from sklearn.svm import SVR
import numpy as np

from matplotlib import pyplot as plt
%matplotlib inline

bikes = pd.read_csv('data/bikes.csv')
regressor = SVR()
regressor.fit(bikes[['temperature', 'humidity']], bikes['count'])


Out[25]:
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [26]:
nx = 30
ny = 30
# creating a grid of points
x_temperature = np.linspace(-5, 40, nx)
y_humidity = np.linspace(20, 80, ny)
xx, yy = np.meshgrid(x_temperature, y_humidity)
# evaluating the regresson on all the points
z_bikes = regressor.predict(np.array([xx.flatten(), yy.flatten()]).T)
zz = np.reshape(z_bikes, (nx, ny))


fig = plt.figure(figsize=(8, 8))
# plotting the predictions
plt.pcolormesh(x_temperature, y_humidity, zz, cmap=plt.cm.YlOrRd)
plt.colorbar(label='bikes predicted') # add a colorbar on the right
# plotting also the observations
plt.scatter(bikes['temperature'], bikes['humidity'], s=bikes['count']/25.0, c='g')
# setting the limit for each axis
plt.xlim(np.min(x_temperature), np.max(x_temperature))
plt.ylim(np.min(y_humidity), np.max(y_humidity))
plt.xlabel('temperature')
plt.ylabel('humidity')
plt.show()

plt.show()



In [27]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(bikes['count'], regressor.predict(bikes[['temperature','humidity']]))


Out[27]:
321.1636207694396

In [28]:
from sklearn.cross_validation import cross_val_score
scores = -cross_val_score(regressor, bikes[['temperature', 'humidity']], 
                          bikes['count'], scoring='mean_absolute_error', cv=10)


scores.mean()


Out[28]:
363.74462717482004

In [ ]:


In [ ]: