This lab focuses on data modelling using decision tree and random forest regression. It's a direct counterpart to the linear regression modelling in Lab 06. At the end of the lab, you should be able to use scikit-learn
to:
Let's start by importing the packages we'll need. As usual, we'll import pandas
for exploratory analysis, but this week we're also going to use the tree
subpackage from scikit-learn
to create decision tree models and the ensemble
subpackage to create random forest models.
In [ ]:
%matplotlib inline
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import GridSearchCV, KFold, cross_val_predict
Next, let's load the data. This week, we're going to load the Auto MPG data set, which is available online at the UC Irvine Machine Learning Repository. The dataset is in fixed width format, but fortunately this is supported out of the box by pandas
' read_fwf
function:
In [ ]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
df = pd.read_fwf(url, header=None, names=['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
'acceleration', 'model year', 'origin', 'car name'])
According to its documentation, the Auto MPG dataset consists of eight explantory variables (i.e. features), each describing a single car model, which are related to the given target variable: the number of miles per gallon (MPG) of fuel of the given car. The following attribute information is given:
Let's start by taking a quick peek at the data:
In [ ]:
df.head()
As the car name is unique for each instance (according to the dataset documentation), it cannot be used to predict the MPG by itself so let's drop it as a feature and use it as the index instead:
Note: It seems plausible that MPG efficiency might vary from manufacturer to manufacturer, so we could generate a new feature by converting the car names into manufacturer names, but for simplicity lets just drop them here.
In [ ]:
df = df.set_index('car name')
df.head()
According to the documentation, the horsepower column contains a small number of missing values, each of which is denoted by the string '?'
. Again, for simplicity, let's just drop these from the data set:
In [ ]:
df = df[df['horsepower'] != '?']
Usually, pandas
is smart enough to recognise that a column is numeric and will convert it to the appropriate data type automatically. However, in this case, because there were strings present initially, the value type of the horsepower column isn't numeric:
In [ ]:
df.dtypes
We can correct this by converting the column values numbers manually, using pandas
' to_numeric
function:
In [ ]:
df['horsepower'] = pd.to_numeric(df['horsepower'])
# Check the data types again
df.dtypes
As can be seen, the data type of the horsepower column is now float64
, i.e. a 64 bit floating point value.
According to the documentation, the origin variable is categoric (i.e. origin = 1 is not "less than" origin = 2) and so we should encode it via one hot encoding so that our model can make sense of it. This is easy with pandas
: all we need to do is use the get_dummies
method, as follows:
In [ ]:
df = pd.get_dummies(df, columns=['origin'])
df.head()
As can be seen, one hot encoding converts the origin column into separate binary columns, each representing the presence or absence of the given category. Because we're going to use a decsion tree regression model, we don't need to worry about the effects of multicollinearity, and so there's no need to drop one of the encoded variable columns as we did in the case of linear regression.
Next, let's take a look at the distribution of the variables in the data frame. We can start by computing some descriptive statistics:
In [ ]:
df.describe()
Print a matrix of pairwise Pearson correlation values:
In [ ]:
df.corr()
Let's also create a scatter plot matrix:
In [ ]:
pd.plotting.scatter_matrix(df, s=50, hist_kwds={'bins': 10}, figsize=(16, 16));
Based on the above information, we can conclude the following:
For now, we'll just note this information, but we'll come back to it later when improving our model.
Let's build a decision tree regression model to predict the MPG of a car based on its other attributes. scikit-learn
supports decision tree functionality via the tree
subpackage. This subpackage supports both decision tree regression and classification. We can use the DecisionTreeRegressor
class to build our model.
DecisionTreeRegressor
accepts a number of different hyperparameters and the model we build may be more or less accurate depending on their values. We can get a list of these modelling parameters using the get_params
method of the estimator (this works on any scikit-learn
estimator), like this:
In [ ]:
DecisionTreeRegressor().get_params()
You can find a more detailed description of each parameter in the scikit-learn
documentation.
Let's use a grid search to select the optimal decision tree regression model from a set of candidates. First, we define the parameter grid. Then, we can use a grid search to select the best model via an inner cross validation and an outer cross validation to measure the accuracy of the selected model.
In [ ]:
X = df.drop('mpg', axis='columns') # X = features
y = df['mpg'] # y = prediction target
algorithm = DecisionTreeRegressor(random_state=0)
# Build models for different values of min_samples_leaf and min_samples_split
parameters = {
'min_samples_leaf': [1, 10, 20],
'min_samples_split': [2, 10, 20] # Min value is 2
}
# Use inner CV to select the best model
inner_cv = KFold(n_splits=5, shuffle=True, random_state=0) # K = 5
clf = GridSearchCV(algorithm, parameters, cv=inner_cv, n_jobs=-1) # n_jobs=-1 uses all available CPUs = faster
clf.fit(X, y)
# Use outer CV to evaluate the error of the best model
outer_cv = KFold(n_splits=10, shuffle=True, random_state=0) # K = 10, doesn't have to be the same
y_pred = cross_val_predict(clf, X, y, cv=outer_cv)
# Print the results
print('Mean absolute error: %f' % mean_absolute_error(y, y_pred))
print('Standard deviation of the error: %f' % (y - y_pred).std())
ax = (y - y_pred).hist()
ax.set(
title='Distribution of errors for the decision tree regression model',
xlabel='Error'
);
Our decision tree regression model predicts the MPG with an average error of approximately ±2.32 with a standard deviation of 3.16, which is similar to our final linear regression model from Lab 06. It's also worth noting that we were able to achieve this level of accuracy with very little feature engineering effort. This is because decision tree regression does not rely on the same set of assumptions (e.g. linearity) as linear regression, and so is able to learn from data with less manual tuning.
We can check the parameters that led to the best model via the best_params_
attribute of the output of our grid search, as follows:
In [ ]:
clf.best_params_
Next, let's build a random forest regression model to predict the car MPGs to see if we can improve on our decision tree model. Random forests are ensemble models, i.e. they are a collection of different decision trees, each of which is trained on a random subset of the data. By combining trees with different characteristics, it's possible to form an overall model that can utilise the benefits of each, which often produces better results than using a single tree to model all the data. scikit-learn
supports ensemble model functionality via the ensemble
subpackage. This subpackage supports both random forest regression and classification. We can use the RandomForestRegressor
class to build our model.
RandomForestRegressor
accepts a number of different hyperparameters and the model we build may be more or less accurate depending on their values. We can get a list of these modelling parameters using the get_params
method of the estimator (this works on any scikit-learn
estimator), like this:
In [ ]:
RandomForestRegressor().get_params()
As before, you can find a more detailed description of each parameter in the scikit-learn
documentation.
Let's use a grid search to select the optimal random forest regression model from a set of candidates. First, we define the parameter grid. Then, we can use a grid search to select the best model via an inner cross validation and an outer cross validation to measure the accuracy of the selected model.
In [ ]:
X = df.drop('mpg', axis='columns') # X = features
y = df['mpg'] # y = prediction target
algorithm = RandomForestRegressor(random_state=0)
# Build models for different values of n_estimators, min_samples_leaf and min_samples_split
parameters = {
'n_estimators': [2, 5, 10],
'min_samples_leaf': [1, 10, 20],
'min_samples_split': [2, 10, 20] # Min value is 2
}
# Use inner CV to select the best model
inner_cv = KFold(n_splits=5, shuffle=True, random_state=0) # K = 5
clf = GridSearchCV(algorithm, parameters, cv=inner_cv, n_jobs=-1) # n_jobs=-1 uses all available CPUs = faster
clf.fit(X, y)
# Use outer CV to evaluate the error of the best model
outer_cv = KFold(n_splits=10, shuffle=True, random_state=0) # K = 10, doesn't have to be the same
y_pred = cross_val_predict(clf, X, y, cv=outer_cv)
# Print the results
print('Mean absolute error: %f' % mean_absolute_error(y, y_pred))
print('Standard deviation of the error: %f' % (y - y_pred).std())
ax = (y - y_pred).hist()
ax.set(
title='Distribution of errors for the random forest regression model',
xlabel='Error'
);
As can be seen, our random forest regression model significantly outperforms our previous decision tree model as well as our linear regression model from Lab 06. Further improvements can be made by expanding the ranges of parameter grid values or introducing further hyperparameters (e.g. impurity measures, stopping criteria).