Compiled by Mohamad Ali-Dib
Based on:
http://web.as.uky.edu/statistics/users/pbreheny/764-F11/notes/11-3.pdf
https://blog.cambridgecoding.com/2016/01/03/getting-started-with-regression-and-decision-trees/
How do regression trees work ?
1- consider a linear regression problem with a continuous response z and two predictors x and y
2- We begin by splitting the space into two regions on the basis of a rule of the form $x,y \leq s$ , and modeling the response using the mean of z in the two regions
3- The optimal split (in terms of reducing the residual sum of squares) is found over all predictors (x and y) and all possible split points s
4- The process is then repeated in a recursive fashion for each of the two sub-regions
5- This process continues until some stopping rule is applied
6- For example, letting {Rm} denote the collection of rectangular partitions, we might continue partitioning until |Rm| = 10
7- The end result is a piecewise constant model over the partition {Rm} of the form:
$$f(x,y)=\sum_{m} c_m I \ \ \ (x,y \ \epsilon \ R_m) $$where $c_m$ is the constant term for the $m^{th}$ region (i.e., the mean of $z_i$ for those observations $x,y \ \epsilon \ R_m$)
Numerical example:
In [1]:
# Import the plotting library
import matplotlib as mpl
mpl.rcParams['axes.color_cycle'] = ['#7FB5C7', '#E63E65', '#5B5BC9', '#55D957']
%matplotlib inline
In [2]:
# Import the data manipulation library and read the training set into a dataframe
import pandas as pd
bikes = pd.read_csv('data/bikes.csv')
bikes.head()
Out[2]:
In [3]:
# Plot the data
from matplotlib import pyplot as plt
plt.figure(figsize=(8,6))
plt.plot(bikes['temperature'], bikes['count'], 'o')
plt.xlabel('temperature')
plt.ylabel('bikes')
plt.show()
In [4]:
# Import the Decision Trees Regressor and train it !
from sklearn.tree import DecisionTreeRegressor
import numpy as np
regressor = DecisionTreeRegressor(max_depth=2)
regressor.fit(np.array([bikes['temperature']]).T, bikes['count'])
Out[4]:
In [5]:
regressor.predict(5.)
Out[5]:
In [6]:
regressor.predict(20.)
Out[6]:
In [7]:
# plot the fit
xx = np.array([np.linspace(-5, 40, 100)]).T
plt.figure(figsize=(8,6))
plt.plot(bikes['temperature'], bikes['count'], 'o', label='observation')
plt.plot(xx, regressor.predict(xx), linewidth=4, alpha=.7, label='prediction')
plt.xlabel('temperature')
plt.ylabel('bikes')
plt.legend()
plt.show()
In [8]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
bikes = pd.read_csv('data/bikes.csv')
regressor = DecisionTreeRegressor(max_depth=2)
regressor.fit(bikes[['temperature', 'humidity']], bikes['count'])
Out[8]:
In [9]:
nx = 30
ny = 30
# creating a grid of points
x_temperature = np.linspace(-5, 40, nx)
y_humidity = np.linspace(20, 80, ny)
xx, yy = np.meshgrid(x_temperature, y_humidity)
# evaluating the regresson on all the points
z_bikes = regressor.predict(np.array([xx.flatten(), yy.flatten()]).T)
zz = np.reshape(z_bikes, (nx, ny))
fig = plt.figure(figsize=(8, 8))
# plotting the predictions
plt.pcolormesh(x_temperature, y_humidity, zz, cmap=plt.cm.YlOrRd)
plt.colorbar(label='bikes predicted') # add a colorbar on the right
# plotting also the observations
plt.scatter(bikes['temperature'], bikes['humidity'], s=bikes['count']/25.0, c='g')
# setting the limit for each axis
plt.xlim(np.min(x_temperature), np.max(x_temperature))
plt.ylim(np.min(y_humidity), np.max(y_humidity))
plt.xlabel('temperature')
plt.ylabel('humidity')
plt.show()
plt.show()
In [10]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(bikes['count'], regressor.predict(bikes[['temperature','humidity']]))
Out[10]:
In [11]:
from sklearn.cross_validation import cross_val_score
scores = -cross_val_score(regressor, bikes[['temperature', 'humidity']],
bikes['count'], scoring='mean_absolute_error', cv=10)
In [12]:
scores.mean()
Out[12]:
In [13]:
regressor2 = DecisionTreeRegressor(max_depth=100)
regressor2.fit(bikes[['temperature', 'humidity']], bikes['count'])
Out[13]:
In [14]:
mean_absolute_error(bikes['count'], regressor2.predict(bikes[['temperature','humidity']]))
Out[14]:
In [15]:
scores = -cross_val_score(regressor2, bikes[['temperature', 'humidity']],
bikes['count'], scoring='mean_absolute_error', cv=10)
print scores.mean()
RFRs work similarly to Random Forest Classifiers, except we average the output of the different trees instead of taking a majority vote.
1- Draw $n_{trees}$ bootstrap samples from the original data.
2- For each of the bootstrap samples, grow an unpruned classification or regression tree, with the following modification: at each node, rather than choosing the best split among all predictors, randomly sample $m_{try}$ of the predictors and choose the best split from among those variables.
3- Predict new data by aggregating the predictions of $n_{trees}$ trees (i.e., majority votes for classification, average for regression).
In [29]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
bikes = pd.read_csv('data/bikes.csv')
regressor = RandomForestRegressor(max_depth=100,n_estimators=1000)
regressor.fit(bikes[['temperature', 'humidity']], bikes['count'])
Out[29]:
In [30]:
nx = 30
ny = 30
# creating a grid of points
x_temperature = np.linspace(-5, 40, nx)
y_humidity = np.linspace(20, 80, ny)
xx, yy = np.meshgrid(x_temperature, y_humidity)
# evaluating the regresson on all the points
z_bikes = regressor.predict(np.array([xx.flatten(), yy.flatten()]).T)
zz = np.reshape(z_bikes, (nx, ny))
fig = plt.figure(figsize=(8, 8))
# plotting the predictions
plt.pcolormesh(x_temperature, y_humidity, zz, cmap=plt.cm.YlOrRd)
plt.colorbar(label='bikes predicted') # add a colorbar on the right
# plotting also the observations
plt.scatter(bikes['temperature'], bikes['humidity'], s=bikes['count']/25.0, c='g')
# setting the limit for each axis
plt.xlim(np.min(x_temperature), np.max(x_temperature))
plt.ylim(np.min(y_humidity), np.max(y_humidity))
plt.xlabel('temperature')
plt.ylabel('humidity')
plt.show()
plt.show()
In [31]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(bikes['count'], regressor.predict(bikes[['temperature','humidity']]))
Out[31]:
In [32]:
from sklearn.cross_validation import cross_val_score
scores = -cross_val_score(regressor, bikes[['temperature', 'humidity']],
bikes['count'], scoring='mean_absolute_error', cv=10)
In [33]:
scores.mean()
Out[33]:
Gradient Boosting:
– Use a simple regression model to start
– Subsequent models predict the error residual of the previous predictions
– Overall prediction given by a weighted sum of the collection
In [21]:
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor #GBM algorithm
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
bikes = pd.read_csv('data/bikes.csv')
regressor = GradientBoostingRegressor(max_depth=2,n_estimators=1000, learning_rate=0.05)
regressor.fit(bikes[['temperature', 'humidity']], bikes['count'])
Out[21]:
In [22]:
nx = 30
ny = 30
# creating a grid of points
x_temperature = np.linspace(-5, 40, nx)
y_humidity = np.linspace(20, 80, ny)
xx, yy = np.meshgrid(x_temperature, y_humidity)
# evaluating the regresson on all the points
z_bikes = regressor.predict(np.array([xx.flatten(), yy.flatten()]).T)
zz = np.reshape(z_bikes, (nx, ny))
fig = plt.figure(figsize=(8, 8))
# plotting the predictions
plt.pcolormesh(x_temperature, y_humidity, zz, cmap=plt.cm.YlOrRd)
plt.colorbar(label='bikes predicted') # add a colorbar on the right
# plotting also the observations
plt.scatter(bikes['temperature'], bikes['humidity'], s=bikes['count']/25.0, c='g')
# setting the limit for each axis
plt.xlim(np.min(x_temperature), np.max(x_temperature))
plt.ylim(np.min(y_humidity), np.max(y_humidity))
plt.xlabel('temperature')
plt.ylabel('humidity')
plt.show()
plt.show()
In [23]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(bikes['count'], regressor.predict(bikes[['temperature','humidity']]))
Out[23]:
In [24]:
from sklearn.cross_validation import cross_val_score
scores = -cross_val_score(regressor, bikes[['temperature', 'humidity']],
bikes['count'], scoring='mean_absolute_error', cv=10)
scores.mean()
Out[24]:
In [25]:
import pandas as pd
from sklearn.svm import SVR
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
bikes = pd.read_csv('data/bikes.csv')
regressor = SVR()
regressor.fit(bikes[['temperature', 'humidity']], bikes['count'])
Out[25]:
In [26]:
nx = 30
ny = 30
# creating a grid of points
x_temperature = np.linspace(-5, 40, nx)
y_humidity = np.linspace(20, 80, ny)
xx, yy = np.meshgrid(x_temperature, y_humidity)
# evaluating the regresson on all the points
z_bikes = regressor.predict(np.array([xx.flatten(), yy.flatten()]).T)
zz = np.reshape(z_bikes, (nx, ny))
fig = plt.figure(figsize=(8, 8))
# plotting the predictions
plt.pcolormesh(x_temperature, y_humidity, zz, cmap=plt.cm.YlOrRd)
plt.colorbar(label='bikes predicted') # add a colorbar on the right
# plotting also the observations
plt.scatter(bikes['temperature'], bikes['humidity'], s=bikes['count']/25.0, c='g')
# setting the limit for each axis
plt.xlim(np.min(x_temperature), np.max(x_temperature))
plt.ylim(np.min(y_humidity), np.max(y_humidity))
plt.xlabel('temperature')
plt.ylabel('humidity')
plt.show()
plt.show()
In [27]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(bikes['count'], regressor.predict(bikes[['temperature','humidity']]))
Out[27]:
In [28]:
from sklearn.cross_validation import cross_val_score
scores = -cross_val_score(regressor, bikes[['temperature', 'humidity']],
bikes['count'], scoring='mean_absolute_error', cv=10)
scores.mean()
Out[28]:
In [ ]:
In [ ]: