Model ensembling is a class of techniques for aggregating together multiple different predictive algorithm into a sort of mega-algorithm, which can often increase the accuracy and reduce the overfitting of your model. Ensembling approaches often work surprisingly well. Many winners of competitive data science competitions use model ensembling in one form or another. In this tutorial, we will take you through the steps of building your own ensemble of a random forest, support vector machine, and neural network for doing a classification problem. We’ll be working on the famous spam
dataset and trying to predict whether a certain email is spam or not, and using the standard Python machine learning stack (scikit
/numpy
/pandas
).
You have probably already encountered several uses of model ensembling. Random forests are a type of ensemble algorithm that aggregates together many individual tree base learners. If you’re interested in deep learning, one common technique for improving classification accuracies is training different networks and getting them to vote on classifications for test instances (look at dropout for a related but wacky take on ensembling). If you’re familiar with bagging or boosting algorithms, these are very explicit examples of ensembling.
Regardless of the specifics, the general idea behind ensembling is this: different classes of algorithms (or differently parameterized versions of the same type of algorithm) might be good at picking up on different signals in the dataset. Combining them means that you can model the data better, leading to better predictions. Furthermore, different algorithms might be overfitting to the data in various ways, but by combining them, you can effectively average away some of this overfitting.
We won’t do fancy visualizations of the dataset here. Check out this tutorial or our bootcamp to learn Plotly and matplotlib. Here, we are focused on optimizing different algorithms and combining them to boost performance.
Let's get started!
In [1]:
import pandas as pd
import numpy as np
# Import the dataset
dataset_path = "spam_dataset.csv"
dataset = pd.read_csv(dataset_path, sep=",")
# Take a peak at the data
dataset.head()
Out[1]:
In [2]:
# Reorder the data columns and drop email_id
cols = dataset.columns.tolist()
cols = cols[2:] + [cols[1]]
dataset = dataset[cols]
# Examine shape of dataset and some column names
print dataset.shape
print dataset.columns.values[0:10]
# Summarise feature values
dataset.describe()
Out[2]:
In [3]:
# Convert dataframe to numpy array and split
# data into input matrix X and class label vector y
npArray = np.array(dataset)
X = npArray[:,:-1].astype(float)
y = npArray[:,-1]
In [4]:
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
# Scale and split dataset
X_scaled = preprocessing.scale(X)
# Split into training and test sets
XTrain, XTest, yTrain, yTest = train_test_split(X_scaled, y, random_state=1)
Blah blah now it's time to train algorithms. We are doing binary classification. Could ahve also used logistic regression, kNN, etc etc.
Let’s build a random forest. A great explanation of random forests can be found here. Briefly, random forests build a collection of classification trees, which each try to predict classes by recursively splitting the data on features that split classes best. Each tree is trained on bootstrapped data, and each split is only allowed to use certain variables. So, an element of randomness is introduced, a variety of different trees are built, and the 'random forest' ensembles together these base learners.
A hyperparameter is something than influences the performance of your model, but isn't directly tuned during model training. The main hyperparameters to adjust for random forrests are n_estimators
and max_features
. n_estimators
controls the number of trees in the forest - the more the better, but more trees comes at the expense of longer training time. max_features
controls the size of the random selection of features the algorithm is allowed to consider when splitting a node.
We could also choose to tune various other hyperpramaters, like max_depth
(the maximum depth of a tree, which controls how tall we grow our trees and influences overfitting) and the choice of the purity criterion
(which are specific formulas for calculating how good or 'pure' our splits make the terminal nodes).
We are doing gridsearch to find optimal hyperparameter values, which tries out each given value for each hyperparameter of interst and sees how well it performs using (in this case) 10-fold cross-validation (CV). As a reminder, in cross-validation we try to estimate the test-set performance for a model; in k-fold CV, the estimate is done by repeatedly partitioning the dataset into k parts and 'testing' on 1/kth of it. We could have also tuned our hyperparameters using randomized search, which samples some values from a distribution rather than trying out all given values. Either is probably fine.
The following code block takes about a minute to run.
In [10]:
from sklearn import metrics
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
# Search for good hyperparameter values
# Specify values to grid search over
n_estimators = np.arange(1, 30, 5)
max_features = np.arange(1, X.shape[1], 10)
max_depth = np.arange(1, 100, 10)
hyperparameters = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth}
# Grid search using cross-validation
gridCV = GridSearchCV(RandomForestClassifier(), param_grid=hyperparameters, cv=10, n_jobs=4)
gridCV.fit(XTrain, yTrain)
best_n_estim = gridCV.best_params_['n_estimators']
best_max_features = gridCV.best_params_['max_features']
best_max_depth = gridCV.best_params_['max_depth']
# Train classifier using optimal hyperparameter values
# We could have also gotten this model out from gridCV.best_estimator_
clfRDF = RandomForestClassifier(n_estimators=best_n_estim, max_features=best_max_features, max_depth=best_max_depth)
clfRDF.fit(XTrain, yTrain)
RF_predictions = clfRDF.predict(XTest)
print (metrics.classification_report(yTest, RF_predictions))
print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, RF_predictions),2))
93-95% accuracy, not too shabby! Have a look and see how random forests with suboptimal hyperparameters fare. We got around 91-92% accuracy on the out of the box (untuned) random forests, which actually isn't terrible.
Let's train our second algorithm, support vector machines (SVMs) to do the same exact prediction task. A great introduction to the theory behind SVMs can be read here. Briefly, SVMs search for hyperplanes in the feature space which best divide the different classes in your dataset. Crucially, SVMs can find non-linear decision boundaries between classes using a process called kernelling, which projects the data into a higher-dimensional space. This sounds a bit abstract, but if you've ever fit a linear regression to power-transformed variables (e.g. maybe you used x^2, x^3 as features), you're already familiar with the concept.
SVMs can use different types of kernels, like Gaussian or radial ones, to throw the data into a different space. The main hyperparameters we must tune for SVMs are gamma (a kernel parameter, controlling how far we 'throw' the data into the new feature space) and C (which controls the bias-variance tradeoff of the model).
In [11]:
from sklearn.svm import SVC
# Search for good hyperparameter values
# Specify values to grid search over
g_range = 2. ** np.arange(-15, 5, step=2)
C_range = 2. ** np.arange(-5, 15, step=2)
hyperparameters = [{'gamma': g_range,
'C': C_range}]
# Grid search using cross-validation
grid = GridSearchCV(SVC(), param_grid=hyperparameters, cv= 10)
grid.fit(XTrain, yTrain)
bestG = grid.best_params_['gamma']
bestC = grid.best_params_['C']
# Train SVM and output predictions
rbfSVM = SVC(kernel='rbf', C=bestC, gamma=bestG)
rbfSVM.fit(XTrain, yTrain)
SVM_predictions = rbfSVM.predict(XTest)
print metrics.classification_report(yTest, SVM_predictions)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, SVM_predictions),2)
Looks good! This is similar performance to what we saw in the random forests.
Finally, let's jump on the hype wagon and throw neural networks at our problem.
Neural networks (NNs) represent a different way of thinking about machine learning algorithms. A great place to start learning about neural networks and deep learning is this resource. Briefly, NNs are composed of multiple layers of artificial neurons, which individually are simple processing units that weigh up input data. Together, layers of neurons can work together to compute some very complex functions of the data, which in turn can make excellent predictions. You may be aware of some of the crazy results that NN research has recently achieved.
Here, we train a shallow, fully-connected, feedforward neural network on the spam dataset. Other types of neural network implementations in scikit are available here. The hyperparameters we optimize here are the overall architecture (number of neurons in each layer and the number of layers) and the learning rate (which controls how quickly the parameters in our network change during the training phase; see gradient descent and backpropagation).
In [12]:
from multilayer_perceptron import multilayer_perceptron
# Search for good hyperparameter values
# Specify values to grid search over
layer_size_range = [(3,2),(10,10),(2,2,2),10,5] # different networks shapes
learning_rate_range = np.linspace(.1,1,3)
hyperparameters = [{'hidden_layer_sizes': layer_size_range, 'learning_rate_init': learning_rate_range}]
# Grid search using cross-validation
grid = GridSearchCV(multilayer_perceptron.MultilayerPerceptronClassifier(), param_grid=hyperparameters, cv=10)
grid.fit(XTrain, yTrain)
# Output best hyperparameter values
best_size = grid.best_params_['hidden_layer_sizes']
best_best_lr = grid.best_params_['learning_rate_init']
# Train neural network and output predictions
nnet = multilayer_perceptron.MultilayerPerceptronClassifier(hidden_layer_sizes=best_size, learning_rate_init=best_best_lr)
nnet.fit(XTrain, yTrain)
NN_predictions = nnet.predict(XTest)
print metrics.classification_report(yTest, NN_predictions)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, NN_predictions),2)
Looks like this neural network (given this dataset, architecture, and hyperparameterisation) is doing slightly worse on the spam dataset. That's okay, it could still be picking up on a signal that the random forest and SVM weren't.
Machine learning algorithns... ensemble!
In [153]:
# here's a rough solution
import collections
# stick all predictions into a dataframe
predictions = pd.DataFrame(np.array([RF_predictions, SVM_predictions, NN_predictions])).T
predictions.columns = ['RF', 'SVM', 'NN']
predictions = pd.DataFrame(np.where(predictions=='yes', 1, 0),
columns=predictions.columns,
index=predictions.index)
# initialise empty array for holding predictions
ensembled_predictions = np.zeros(shape=yTest.shape)
# majority vote and output final predictions
for test_point in range(predictions.shape[0]):
predictions.iloc[test_point,:]
counts = collections.Counter(predictions.iloc[test_point,:])
majority_vote = counts.most_common(1)[0][0]
# output votes
ensembled_predictions[test_point] = majority_vote.astype(int)
print "The majority vote for test point", test_point, "is: ", majority_vote
In [178]:
# Get final accuracy of ensembled model
yTest[yTest == "yes"] = 1
yTest[yTest == "no"] = 0
print metrics.classification_report(yTest.astype(int), ensembled_predictions.astype(int))
print "Ensemble Accuracy:", round(metrics.accuracy_score(yTest.astype(int), ensembled_predictions.astype(int)),2)
There are plenty of ways to do model ensembling. Simple majority voting. We can also do weighted majority voting, where models with higher accuracy get more of a vote. If your output is numerical, you could average. These relatively simple techniques do a great job, but there is more! Stacking (also called blending) is when the predictions from different algorithms are used as input into another algorithm (often good old linear and logistic regression) which then outputs your final predictions. For example, you might train a linear model on the predictions. Blending.
It is best to ensemble together models which are less correlated (see an excellent explanation here). See an excellent explanation of ensembling here.
What happens when your dataset isn’t as nice as this? What if there are many more instances of one class versus the other, or if you have a lot of missing values, or a mixture of categorical and numerical variables? Stay tuned for the next blog post where we write up guidance on tackling these types of sticky situations.
In [ ]: