Model ensembling is a class of techniques for aggregating together multiple different predictive algorithms into a meta-algorithm, which tends to increase accuracy and reduce overfitting. Ensembling approaches often work surprisingly well. Many winners of competitive data science competitions use model ensembling in one form or another. In previous tutorials, we discussed how tuning models allows you to get the best performance from individual machine learning algorithms. Here, we will take you through the steps of building your own ensemble for a classification problem, consisting of an individually optimized:
These different models have quite different structures, which suggests they might capture different aspects of the dataset and could work well in an ensemble. We’ll continue working on the popular wine dataset, which captures chemical properties of wines and associated wine quality rankings. The goal is to predict wine quality from the chemical properties. In this post, you'll use the following techniques to build model ensembles: simple majority voting, weighted majority voting, and model stacking/blending.
There are also fundamental reasons for why ensembling together different algorithms often improves accuracy, which is extremely well explained in this Kaggle ensembling guide. Briefly, majority voting between models can correct errors in the predictions of individual models.
The general idea behind ensembling is this: different classes of algorithms (or differently parameterized versions of the same type of algorithm) might be good at picking up on different signals in the dataset. Combining them means that you can model the data better, leading to better predictions. Furthermore, different algorithms might be overfitting to the data in various ways, but by combining them, you can effectively average away some of this overfitting. Furthermore, if you're trying to improve your model to chase accuracy points, ensembling is a more computationally effective way to do this than trying to tune a single model by searching for more and more optimal hyperparameters.
It is best to ensemble together models which are less correlated, because then you can capture different aspects of the blog post (see an excellent explanation here). See an excellent explanation of ensembling here.
You have probably already encountered several uses of model ensembling. Random forests are a type of ensemble algorithm that aggregates together many individual classification tree base learners. They are a good systems for intuitively understanding what ensembling is. [Explanation here].
So, a random forest is already an ensemble. But, a random forest will be just one model in the ensemble we build here. 'Ensembling' is a broad term, and is a recurrent concept throughout machine learning, but the general idea is that ensembling can correct the individual parts that may go wrong, and allow different models to capture different signals in the datasetm thereby improving overall performance.
If you’re interested in deep learning, one common technique for improving classification accuracy is training different neural networks and getting them to vote on classifications for test instances. An ensemble-like technique for training individual neural networks is called dropout, and involves training different subnetworks during the same training phase. Combinging different models is a recurring trend in machine learning, different incarnations. If you’re familiar with bagging or boosting algorithms, these are very explicit examples of ensembling.
We will be working on ensembling different algorithms, using both majority voting and stacking,, in order to get improved classification accuracy on the spam dataset. We won’t do fancy visualizations of the dataset, but check out a previous tutorial or our bootcamp to learn Plotly and matplotlib if you're interested. Here, we focus on combining different algorithms to boost performance.
Let's get started!
In [2]:
import wget
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
# Import the dataset
data_url = 'https://raw.githubusercontent.com/nslatysheva/data_science_blogging/master/datasets/wine/winequality-red.csv'
dataset = wget.download(data_url)
dataset = pd.read_csv(dataset, sep=";")
# Using a lambda function to bin quality scores
dataset['quality_is_high'] = dataset.quality.apply(lambda x: 1 if x >= 6 else 0)
# Convert the dataframe to a numpy array and split the
# data into an input matrix X and class label vector y
npArray = np.array(dataset)
X = npArray[:,:-2].astype(float)
y = npArray[:,-1]
# Split into training and test sets
XTrain, XTest, yTrain, yTest = train_test_split(X, y, random_state=1)
In [ ]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.linear_model import LogisticRegression
# Build rf model
best_n_estimators, best_max_features = 73, 5
rf = RandomForestClassifier(n_estimators=best_n_estimators, max_features=best_max_features)
rf.fit(XTrain, yTrain)
rf_predictions = rf.predict(XTest)
# Build SVM model
best_C_svm, best_gamma = 1.07, 0.01
rbf_svm = svm.SVC(kernel='rbf', C=best_C_svm, gamma=best_gamma)
rbf_svm.fit(XTrain, yTrain)
svm_predictions = rbf_svm.predict(XTest)
# Build LR model
best_penalty, best_C_lr = "l2", 0.52
lr = LogisticRegression(penalty=best_penalty, C=best_C_lr)
lr.fit(XTrain, yTrain)
lr_predictions = lr.predict(XTest)
In [ ]:
# Train SVM and output predictions
# rbfSVM = svm.SVC(kernel='rbf', C=best_C, gamma=best_gamma)
# rbfSVM.fit(XTrain, yTrain)
# svm_predictions = rbfSVM.predict(XTest)
print (classification_report(yTest, svm_predictions))
print ("Overall Accuracy:", round(accuracy_score(yTest, svm_predictions),4))
In [1]:
print(best_C, best_C_svm)
In [ ]:
import collections
# stick all predictions into a dataframe
predictions = pd.DataFrame(np.array([rf_predictions, svm_predictions, lr_predictions])).T
predictions.columns = ['RF', 'SVM', 'LR']
# initialise empty array for holding predictions
ensembled_predictions = np.zeros(shape=yTest.shape)
# majority vote and output final predictions
for test_point in range(predictions.shape[0]):
row = predictions.iloc[test_point,:]
counts = collections.Counter(row)
majority_vote = counts.most_common(1)[0][0]
# output votes
ensembled_predictions[test_point] = majority_vote.astype(int)
#print "The majority vote for test point", test_point, "is: ", majority_vote
print(ensembled_predictions)
And we could assess the performance of the majority voted predictions like so:
In [ ]:
# Get final accuracy of ensembled model
from sklearn.metrics import classification_report, accuracy_score
for individual_predictions in [rf_predictions, svm_predictions, lr_predictions]:
# classification_report(yTest.astype(int), individual_predictions.astype(int))
print "Accuracy:", round(accuracy_score(yTest.astype(int), individual_predictions.astype(int)),2)
print classification_report(yTest.astype(int), ensembled_predictions.astype(int))
print "Ensemble Accuracy:", round(accuracy_score(yTest.astype(int), ensembled_predictions.astype(int)),2)
Luckily, we do not have to do all of this manually, but can use scikit's VotingClassifier
class:
In [5]:
# from sklearn.ensemble import VotingClassifier
import sklearn.ensemble.VotingClassifier
# Build and fit majority vote classifier
# ensemble_1 = VotingClassifier(estimators=[('rf', rf), ('svm', rbf_svm), ('lr', lr)], voting='hard')
# ensemble_1.fit(XTrain, yTrain)
# simple_ensemble_predictions = ensemble_1.predict(XTest)
# print metrics.classification_report(yTest, simple_ensemble_predictions)
# print "Ensemble_2 Overall Accuracy:", round(metrics.accuracy_score(yTest, simple_ensemble_predictions),2)
We can also do a weighted majority vote, where the different base learners are associated with a weight (often reflecting the accuracies of the models, i.e. more accurate models should have a higher weight). These weight the occurence of predicted class labels, which allows certain algorithms to have more of a say in the majority voting.
In [ ]:
# Getting weights
ensemble_1 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], weights=[1,1,1], voting='hard')
ensemble_1.fit(XTrain, yTrain)
simple_ensemble_predictions = ensemble_1.predict(XTest)
print metrics.classification_report(yTest, simple_ensemble_predictions)
print "Ensemble_2 Overall Accuracy:", round(metrics.accuracy_score(yTest, simple_ensemble_predictions),2)
You may have noticed the voting='hard'
argument we passed to the VotingClassifier. Setting voting='soft'
would predict the class labels based on how certain each algorithm in the ensemble was about their individual predictions. This involves calculating the predicted probabilities p for the classifier. Note that scikit only recommends this approach if the classifiers are already tuned well, which should be the case here.
In [ ]:
ensemble_1 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], weights=[1,1,1], voting='soft')
ensemble_1.fit(XTrain, yTrain)
simple_ensemble_predictions = ensemble_1.predict(XTest)
print metrics.classification_report(yTest, simple_ensemble_predictions)
print "Ensemble_2 Overall Accuracy:", round(metrics.accuracy_score(yTest, simple_ensemble_predictions),2)
In [ ]:
## Model stacking
There are plenty of ways to do model ensembling. Simple majority voting. We can also do weighted majority voting, where models with higher accuracy get more of a vote. If your output is numerical, you could average. These relatively simple techniques do a great job, but there is more! Stacking (also called blending) is when the predictions from different algorithms are used as input into another algorithm (often good old linear and logistic regression) which then outputs your final predictions. For example, you might train a linear model on the predictions. Blending.
It is best to ensemble together models which are less correlated (see an excellent explanation here). See an excellent explanation of ensembling here.
What happens when your dataset isn’t as nice as this? What if there are many more instances of one class versus the other, or if you have a lot of missing values, or a mixture of categorical and numerical variables? Stay tuned for the next blog post where we write up guidance on tackling these types of sticky situations.
Stacking. Combining different techniques. One approach that has been useful in competitive machine learning (where the smallest improvements are crucial to winning). You can add the predictions of different classifiers as additional features. You can then train a variety of models on this new feature set (your old features + these predictions), and average the predictions of these models. Taking the harmonic mean instead of the standard geometric mean. Added the logit of the model’s predictions. Certain types of algorithms tend to do very well in prediction competitions, like Gradient Boosting Trees.XGBoost, a gradient boosting implementation, also popular, R, faster than scikit.
Properly parameterised neural networks can be extremely powerful. Though for competitions that are time and computational resource dependent, sometimes not practical.
Scikit’s bagging classifier meta-estimator. Run the same algorithm multiple times, random selection of observations and features, take average of output.
Probability calibration.
We could go a step further with Random Forests - extermely randomized rfs.
Another nice tutorial on doing ensembling in python is here.
In [ ]: