Combining different machine learning algorithms into an ensemble model

Model ensembling is a class of techniques for aggregating together multiple different predictive algorithm into a sort of mega-algorithm, which can often increase the accuracy and reduce the overfitting of your model. Ensembling approaches often work surprisingly well. Many winners of competitive data science competitions use model ensembling in one form or another. In this tutorial, we will take you through the steps of building your own ensemble of a random forest, support vector machine, and neural network for doing a classification problem. We’ll be working on the famous spam dataset and trying to predict whether a certain email is spam or not, and using the standard Python machine learning stack (scikit/numpy/pandas).

You have probably already encountered several uses of model ensembling. Random forests are a type of ensemble algorithm that aggregates together many individual tree base learners. If you’re interested in deep learning, one common technique for improving classification accuracies is training different networks and getting them to vote on classifications for test instances (look at dropout for a related but wacky take on ensembling). If you’re familiar with bagging or boosting algorithms, these are very explicit examples of ensembling.

Regardless of the specifics, the general idea behind ensembling is this: different classes of algorithms (or differently parameterized versions of the same type of algorithm) might be good at picking up on different signals in the dataset. Combining them means that you can model the data better, leading to better predictions. Furthermore, different algorithms might be overfitting to the data in various ways, but by combining them, you can effectively average away some of this overfitting.

We won’t do fancy visualizations of the dataset here. Check out this tutorial or our bootcamp to learn Plotly and matplotlib. Here, we are focused on optimizing different algorithms and combining them to boost performance.

Let's get started!

1. Loading up the data

Load dataset. We often want our input data to be a matrix (X) and the vector of instance labels as a separate vector (y).


In [1]:
import pandas as pd
import numpy as np

# Import the dataset
dataset_path = "spam_dataset.csv"
dataset = pd.read_csv(dataset_path, sep=",")

# Take a peak at the data
dataset.head()


Out[1]:
email_id is_spam word_freq_will word_freq_original word_freq_415 word_freq_mail char_freq_# char_freq_$ word_freq_internet word_freq_edu ... word_freq_receive word_freq_000 capital_run_length_average word_freq_address word_freq_george word_freq_cs word_freq_random word_freq_conference word_freq_technology char_freq_(
0 3628 no 0.00 0 0 0.00 0 0 0.0 0 ... 0.00 0 2.000 0.00 0.00 0 0 0 0 0.000
1 63 no 0.00 0 0 0.49 0 0 0.0 0 ... 0.00 0 2.824 0.00 0.99 0 0 0 0 0.062
2 1540 no 1.31 0 0 0.00 0 0 0.0 0 ... 0.00 0 2.176 0.00 0.00 0 0 0 0 0.431
3 4460 yes 0.75 0 0 0.50 0 0 0.5 0 ... 0.25 0 1.023 0.75 0.00 0 0 0 0 0.180
4 2771 no 0.00 0 0 0.00 0 0 0.0 0 ... 0.00 0 1.500 0.00 1.56 0 0 0 0 0.180

5 rows × 63 columns

2. Cleaning up and summarizing the data

Lookin' good! Let's convert the data into a nice format. We rearrange some columns, check out what the columns are.


In [2]:
# Reorder the data columns and drop email_id
cols = dataset.columns.tolist()
cols = cols[2:] + [cols[1]]
dataset = dataset[cols]

# Examine shape of dataset and some column names
print dataset.shape
print dataset.columns.values[0:10]

# Summarise feature values
dataset.describe()


(1000, 62)
['word_freq_will' 'word_freq_original' 'word_freq_415' 'word_freq_mail'
 'char_freq_#' 'char_freq_$' 'word_freq_internet' 'word_freq_edu'
 'word_freq_hp' 'word_freq_lab']
Out[2]:
word_freq_will word_freq_original word_freq_415 word_freq_mail char_freq_# char_freq_$ word_freq_internet word_freq_edu word_freq_hp word_freq_lab ... word_freq_receive word_freq_000 capital_run_length_average word_freq_address word_freq_george word_freq_cs word_freq_random word_freq_conference word_freq_technology char_freq_(
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.00000 1000.000000 1000.000000 ... 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000 1000 1000.000000 1000.000000 1000.000000
mean 0.537950 0.038370 0.054690 0.189840 0.022792 0.066014 0.073210 0.18100 0.611970 0.118610 ... 0.051040 0.081300 4.857610 0.149980 0.775740 0 0 0.036690 0.125580 0.144783
std 0.831747 0.173041 0.365678 0.496022 0.109007 0.248239 0.270431 0.86285 1.734907 0.746169 ... 0.192314 0.358906 30.226395 0.955315 3.509211 0 0 0.268434 0.449092 0.232423
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 ... 0.000000 0.000000 1.000000 0.000000 0.000000 0 0 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 ... 0.000000 0.000000 1.541000 0.000000 0.000000 0 0 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 ... 0.000000 0.000000 2.219500 0.000000 0.000000 0 0 0.000000 0.000000 0.072000
75% 0.820000 0.000000 0.000000 0.000000 0.000000 0.016000 0.000000 0.00000 0.315000 0.000000 ... 0.000000 0.000000 3.396500 0.000000 0.000000 0 0 0.000000 0.000000 0.195000
max 6.250000 2.220000 4.760000 5.260000 1.410000 4.017000 3.570000 10.00000 20.830000 14.280000 ... 2.000000 5.450000 667.000000 14.280000 33.330000 0 0 5.000000 4.760000 2.941000

8 rows × 61 columns


In [3]:
# Convert dataframe to numpy array and split
# data into input matrix X and class label vector y
npArray = np.array(dataset)
X = npArray[:,:-1].astype(float)
y = npArray[:,-1]

3) Splitting data into training and testing sets

Our day is now nice and squeaky clean! This definitely always happens in real life.

Next up, let's scale the data and split it into a training and test set.


In [4]:
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split

# Scale and split dataset
X_scaled = preprocessing.scale(X)

# Split into training and test sets
XTrain, XTest, yTrain, yTest = train_test_split(X_scaled, y, random_state=1)

4. Running algorithms on the data

Blah blah now it's time to train algorithms. We are doing binary classification. Could ahve also used logistic regression, kNN, etc etc.

4.1 Random forests

Let’s build a random forest. A great explanation of random forests can be found here. Briefly, random forests build a collection of classification trees, which each try to predict classes by recursively splitting the data on features that split classes best. Each tree is trained on bootstrapped data, and each split is only allowed to use certain variables. So, an element of randomness is introduced, a variety of different trees are built, and the 'random forest' ensembles together these base learners.

A hyperparameter is something than influences the performance of your model, but isn't directly tuned during model training. The main hyperparameters to adjust for random forrests are n_estimators and max_features. n_estimators controls the number of trees in the forest - the more the better, but more trees comes at the expense of longer training time. max_features controls the size of the random selection of features the algorithm is allowed to consider when splitting a node.

We could also choose to tune various other hyperpramaters, like max_depth (the maximum depth of a tree, which controls how tall we grow our trees and influences overfitting) and the choice of the purity criterion (which are specific formulas for calculating how good or 'pure' our splits make the terminal nodes).

We are doing gridsearch to find optimal hyperparameter values, which tries out each given value for each hyperparameter of interst and sees how well it performs using (in this case) 10-fold cross-validation (CV). As a reminder, in cross-validation we try to estimate the test-set performance for a model; in k-fold CV, the estimate is done by repeatedly partitioning the dataset into k parts and 'testing' on 1/kth of it. We could have also tuned our hyperparameters using randomized search, which samples some values from a distribution rather than trying out all given values. Either is probably fine.

The following code block takes about a minute to run.


In [10]:
from sklearn import metrics
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

# Search for good hyperparameter values
# Specify values to grid search over
n_estimators = np.arange(1, 30, 5)
max_features  = np.arange(1, X.shape[1], 10)
max_depth    = np.arange(1, 100, 10)

hyperparameters   = {'n_estimators': n_estimators, 
                     'max_features': max_features, 
                     'max_depth': max_depth}

# Grid search using cross-validation
gridCV = GridSearchCV(RandomForestClassifier(), param_grid=hyperparameters, cv=10, n_jobs=4)
gridCV.fit(XTrain, yTrain)

best_n_estim      = gridCV.best_params_['n_estimators']
best_max_features = gridCV.best_params_['max_features']               
best_max_depth    = gridCV.best_params_['max_depth']

# Train classifier using optimal hyperparameter values
# We could have also gotten this model out from gridCV.best_estimator_
clfRDF = RandomForestClassifier(n_estimators=best_n_estim, max_features=best_max_features, max_depth=best_max_depth)
clfRDF.fit(XTrain, yTrain)
RF_predictions = clfRDF.predict(XTest)

print (metrics.classification_report(yTest, RF_predictions))
print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, RF_predictions),2))


             precision    recall  f1-score   support

         no       0.94      0.96      0.95       170
        yes       0.91      0.86      0.88        80

avg / total       0.93      0.93      0.93       250

('Overall Accuracy:', 0.93)

93-95% accuracy, not too shabby! Have a look and see how random forests with suboptimal hyperparameters fare. We got around 91-92% accuracy on the out of the box (untuned) random forests, which actually isn't terrible.

2) Second algorithm: support vector machines

Let's train our second algorithm, support vector machines (SVMs) to do the same exact prediction task. A great introduction to the theory behind SVMs can be read here. Briefly, SVMs search for hyperplanes in the feature space which best divide the different classes in your dataset. Crucially, SVMs can find non-linear decision boundaries between classes using a process called kernelling, which projects the data into a higher-dimensional space. This sounds a bit abstract, but if you've ever fit a linear regression to power-transformed variables (e.g. maybe you used x^2, x^3 as features), you're already familiar with the concept.

SVMs can use different types of kernels, like Gaussian or radial ones, to throw the data into a different space. The main hyperparameters we must tune for SVMs are gamma (a kernel parameter, controlling how far we 'throw' the data into the new feature space) and C (which controls the bias-variance tradeoff of the model).


In [11]:
from sklearn.svm import SVC

# Search for good hyperparameter values
# Specify values to grid search over
g_range = 2. ** np.arange(-15, 5, step=2)
C_range = 2. ** np.arange(-5, 15, step=2)

hyperparameters = [{'gamma': g_range, 
                    'C': C_range}] 

# Grid search using cross-validation
grid = GridSearchCV(SVC(), param_grid=hyperparameters, cv= 10)  
grid.fit(XTrain, yTrain)

bestG = grid.best_params_['gamma']
bestC = grid.best_params_['C']

# Train SVM and output predictions
rbfSVM = SVC(kernel='rbf', C=bestC, gamma=bestG)
rbfSVM.fit(XTrain, yTrain)
SVM_predictions = rbfSVM.predict(XTest)

print metrics.classification_report(yTest, SVM_predictions)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, SVM_predictions),2)


             precision    recall  f1-score   support

         no       0.94      0.95      0.94       170
        yes       0.88      0.86      0.87        80

avg / total       0.92      0.92      0.92       250

Overall Accuracy: 0.92

Looks good! This is similar performance to what we saw in the random forests.

3) Third algorithm: neural network

Finally, let's jump on the hype wagon and throw neural networks at our problem.

Neural networks (NNs) represent a different way of thinking about machine learning algorithms. A great place to start learning about neural networks and deep learning is this resource. Briefly, NNs are composed of multiple layers of artificial neurons, which individually are simple processing units that weigh up input data. Together, layers of neurons can work together to compute some very complex functions of the data, which in turn can make excellent predictions. You may be aware of some of the crazy results that NN research has recently achieved.

Here, we train a shallow, fully-connected, feedforward neural network on the spam dataset. Other types of neural network implementations in scikit are available here. The hyperparameters we optimize here are the overall architecture (number of neurons in each layer and the number of layers) and the learning rate (which controls how quickly the parameters in our network change during the training phase; see gradient descent and backpropagation).


In [12]:
from multilayer_perceptron import multilayer_perceptron

# Search for good hyperparameter values
# Specify values to grid search over
layer_size_range = [(3,2),(10,10),(2,2,2),10,5] # different networks shapes
learning_rate_range = np.linspace(.1,1,3)
hyperparameters = [{'hidden_layer_sizes': layer_size_range, 'learning_rate_init': learning_rate_range}]

# Grid search using cross-validation
grid = GridSearchCV(multilayer_perceptron.MultilayerPerceptronClassifier(), param_grid=hyperparameters, cv=10)
grid.fit(XTrain, yTrain)

# Output best hyperparameter values
best_size    = grid.best_params_['hidden_layer_sizes']
best_best_lr = grid.best_params_['learning_rate_init']

# Train neural network and output predictions
nnet = multilayer_perceptron.MultilayerPerceptronClassifier(hidden_layer_sizes=best_size, learning_rate_init=best_best_lr)
nnet.fit(XTrain, yTrain)
NN_predictions = nnet.predict(XTest)

print metrics.classification_report(yTest, NN_predictions)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, NN_predictions),2)


             precision    recall  f1-score   support

         no       0.95      0.92      0.93       170
        yes       0.84      0.89      0.86        80

avg / total       0.91      0.91      0.91       250

Overall Accuracy: 0.91

Looks like this neural network (given this dataset, architecture, and hyperparameterisation) is doing slightly worse on the spam dataset. That's okay, it could still be picking up on a signal that the random forest and SVM weren't.

Machine learning algorithns... ensemble!

4) Majority vote on classifications


In [153]:
# here's a rough solution

import collections

# stick all predictions into a dataframe
predictions = pd.DataFrame(np.array([RF_predictions, SVM_predictions, NN_predictions])).T
predictions.columns = ['RF', 'SVM', 'NN']
predictions = pd.DataFrame(np.where(predictions=='yes', 1, 0), 
                           columns=predictions.columns, 
                           index=predictions.index)

# initialise empty array for holding predictions
ensembled_predictions = np.zeros(shape=yTest.shape)

# majority vote and output final predictions
for test_point in range(predictions.shape[0]):
    predictions.iloc[test_point,:]
    counts = collections.Counter(predictions.iloc[test_point,:])
    majority_vote = counts.most_common(1)[0][0]
    
    # output votes
    ensembled_predictions[test_point] = majority_vote.astype(int)
    print "The majority vote for test point", test_point, "is: ", majority_vote


The majority vote for test point 0 is:  1
The majority vote for test point 1 is:  0
The majority vote for test point 2 is:  0
The majority vote for test point 3 is:  0
The majority vote for test point 4 is:  0
The majority vote for test point 5 is:  1
The majority vote for test point 6 is:  1
The majority vote for test point 7 is:  0
The majority vote for test point 8 is:  1
The majority vote for test point 9 is:  1
The majority vote for test point 10 is:  0
The majority vote for test point 11 is:  0
The majority vote for test point 12 is:  0
The majority vote for test point 13 is:  0
The majority vote for test point 14 is:  1
The majority vote for test point 15 is:  0
The majority vote for test point 16 is:  0
The majority vote for test point 17 is:  0
The majority vote for test point 18 is:  0
The majority vote for test point 19 is:  0
The majority vote for test point 20 is:  0
The majority vote for test point 21 is:  0
The majority vote for test point 22 is:  1
The majority vote for test point 23 is:  1
The majority vote for test point 24 is:  0
The majority vote for test point 25 is:  0
The majority vote for test point 26 is:  0
The majority vote for test point 27 is:  1
The majority vote for test point 28 is:  0
The majority vote for test point 29 is:  0
The majority vote for test point 30 is:  0
The majority vote for test point 31 is:  0
The majority vote for test point 32 is:  1
The majority vote for test point 33 is:  0
The majority vote for test point 34 is:  0
The majority vote for test point 35 is:  0
The majority vote for test point 36 is:  0
The majority vote for test point 37 is:  0
The majority vote for test point 38 is:  0
The majority vote for test point 39 is:  0
The majority vote for test point 40 is:  1
The majority vote for test point 41 is:  0
The majority vote for test point 42 is:  0
The majority vote for test point 43 is:  1
The majority vote for test point 44 is:  0
The majority vote for test point 45 is:  0
The majority vote for test point 46 is:  0
The majority vote for test point 47 is:  0
The majority vote for test point 48 is:  1
The majority vote for test point 49 is:  0
The majority vote for test point 50 is:  0
The majority vote for test point 51 is:  1
The majority vote for test point 52 is:  0
The majority vote for test point 53 is:  0
The majority vote for test point 54 is:  0
The majority vote for test point 55 is:  0
The majority vote for test point 56 is:  0
The majority vote for test point 57 is:  0
The majority vote for test point 58 is:  0
The majority vote for test point 59 is:  1
The majority vote for test point 60 is:  0
The majority vote for test point 61 is:  0
The majority vote for test point 62 is:  1
The majority vote for test point 63 is:  0
The majority vote for test point 64 is:  1
The majority vote for test point 65 is:  1
The majority vote for test point 66 is:  0
The majority vote for test point 67 is:  1
The majority vote for test point 68 is:  1
The majority vote for test point 69 is:  1
The majority vote for test point 70 is:  0
The majority vote for test point 71 is:  1
The majority vote for test point 72 is:  0
The majority vote for test point 73 is:  0
The majority vote for test point 74 is:  0
The majority vote for test point 75 is:  0
The majority vote for test point 76 is:  0
The majority vote for test point 77 is:  1
The majority vote for test point 78 is:  0
The majority vote for test point 79 is:  0
The majority vote for test point 80 is:  0
The majority vote for test point 81 is:  1
The majority vote for test point 82 is:  1
The majority vote for test point 83 is:  0
The majority vote for test point 84 is:  1
The majority vote for test point 85 is:  1
The majority vote for test point 86 is:  0
The majority vote for test point 87 is:  0
The majority vote for test point 88 is:  0
The majority vote for test point 89 is:  0
The majority vote for test point 90 is:  1
The majority vote for test point 91 is:  0
The majority vote for test point 92 is:  1
The majority vote for test point 93 is:  0
The majority vote for test point 94 is:  0
The majority vote for test point 95 is:  0
The majority vote for test point 96 is:  0
The majority vote for test point 97 is:  1
The majority vote for test point 98 is:  1
The majority vote for test point 99 is:  0
The majority vote for test point 100 is:  1
The majority vote for test point 101 is:  0
The majority vote for test point 102 is:  1
The majority vote for test point 103 is:  0
The majority vote for test point 104 is:  0
The majority vote for test point 105 is:  0
The majority vote for test point 106 is:  1
The majority vote for test point 107 is:  0
The majority vote for test point 108 is:  0
The majority vote for test point 109 is:  0
The majority vote for test point 110 is:  0
The majority vote for test point 111 is:  1
The majority vote for test point 112 is:  0
The majority vote for test point 113 is:  0
The majority vote for test point 114 is:  0
The majority vote for test point 115 is:  0
The majority vote for test point 116 is:  0
The majority vote for test point 117 is:  0
The majority vote for test point 118 is:  0
The majority vote for test point 119 is:  0
The majority vote for test point 120 is:  1
The majority vote for test point 121 is:  0
The majority vote for test point 122 is:  1
The majority vote for test point 123 is:  0
The majority vote for test point 124 is:  0
The majority vote for test point 125 is:  1
The majority vote for test point 126 is:  0
The majority vote for test point 127 is:  1
The majority vote for test point 128 is:  0
The majority vote for test point 129 is:  1
The majority vote for test point 130 is:  0
The majority vote for test point 131 is:  0
The majority vote for test point 132 is:  1
The majority vote for test point 133 is:  0
The majority vote for test point 134 is:  1
The majority vote for test point 135 is:  0
The majority vote for test point 136 is:  0
The majority vote for test point 137 is:  1
The majority vote for test point 138 is:  1
The majority vote for test point 139 is:  0
The majority vote for test point 140 is:  1
The majority vote for test point 141 is:  0
The majority vote for test point 142 is:  0
The majority vote for test point 143 is:  1
The majority vote for test point 144 is:  0
The majority vote for test point 145 is:  1
The majority vote for test point 146 is:  1
The majority vote for test point 147 is:  1
The majority vote for test point 148 is:  0
The majority vote for test point 149 is:  0
The majority vote for test point 150 is:  1
The majority vote for test point 151 is:  0
The majority vote for test point 152 is:  0
The majority vote for test point 153 is:  0
The majority vote for test point 154 is:  0
The majority vote for test point 155 is:  0
The majority vote for test point 156 is:  0
The majority vote for test point 157 is:  0
The majority vote for test point 158 is:  0
The majority vote for test point 159 is:  1
The majority vote for test point 160 is:  1
The majority vote for test point 161 is:  0
The majority vote for test point 162 is:  0
The majority vote for test point 163 is:  0
The majority vote for test point 164 is:  1
The majority vote for test point 165 is:  0
The majority vote for test point 166 is:  1
The majority vote for test point 167 is:  0
The majority vote for test point 168 is:  0
The majority vote for test point 169 is:  1
The majority vote for test point 170 is:  1
The majority vote for test point 171 is:  0
The majority vote for test point 172 is:  0
The majority vote for test point 173 is:  0
The majority vote for test point 174 is:  0
The majority vote for test point 175 is:  0
The majority vote for test point 176 is:  0
The majority vote for test point 177 is:  0
The majority vote for test point 178 is:  1
The majority vote for test point 179 is:  0
The majority vote for test point 180 is:  0
The majority vote for test point 181 is:  1
The majority vote for test point 182 is:  0
The majority vote for test point 183 is:  0
The majority vote for test point 184 is:  0
The majority vote for test point 185 is:  1
The majority vote for test point 186 is:  1
The majority vote for test point 187 is:  1
The majority vote for test point 188 is:  1
The majority vote for test point 189 is:  0
The majority vote for test point 190 is:  0
The majority vote for test point 191 is:  0
The majority vote for test point 192 is:  0
The majority vote for test point 193 is:  0
The majority vote for test point 194 is:  0
The majority vote for test point 195 is:  0
The majority vote for test point 196 is:  0
The majority vote for test point 197 is:  0
The majority vote for test point 198 is:  1
The majority vote for test point 199 is:  0
The majority vote for test point 200 is:  0
The majority vote for test point 201 is:  0
The majority vote for test point 202 is:  1
The majority vote for test point 203 is:  0
The majority vote for test point 204 is:  0
The majority vote for test point 205 is:  0
The majority vote for test point 206 is:  0
The majority vote for test point 207 is:  0
The majority vote for test point 208 is:  0
The majority vote for test point 209 is:  0
The majority vote for test point 210 is:  0
The majority vote for test point 211 is:  1
The majority vote for test point 212 is:  0
The majority vote for test point 213 is:  0
The majority vote for test point 214 is:  0
The majority vote for test point 215 is:  1
The majority vote for test point 216 is:  1
The majority vote for test point 217 is:  0
The majority vote for test point 218 is:  0
The majority vote for test point 219 is:  0
The majority vote for test point 220 is:  0
The majority vote for test point 221 is:  1
The majority vote for test point 222 is:  0
The majority vote for test point 223 is:  0
The majority vote for test point 224 is:  0
The majority vote for test point 225 is:  0
The majority vote for test point 226 is:  0
The majority vote for test point 227 is:  0
The majority vote for test point 228 is:  0
The majority vote for test point 229 is:  1
The majority vote for test point 230 is:  0
The majority vote for test point 231 is:  1
The majority vote for test point 232 is:  0
The majority vote for test point 233 is:  1
The majority vote for test point 234 is:  0
The majority vote for test point 235 is:  1
The majority vote for test point 236 is:  0
The majority vote for test point 237 is:  1
The majority vote for test point 238 is:  0
The majority vote for test point 239 is:  0
The majority vote for test point 240 is:  0
The majority vote for test point 241 is:  1
The majority vote for test point 242 is:  0
The majority vote for test point 243 is:  1
The majority vote for test point 244 is:  0
The majority vote for test point 245 is:  0
The majority vote for test point 246 is:  1
The majority vote for test point 247 is:  1
The majority vote for test point 248 is:  1
The majority vote for test point 249 is:  0

In [178]:
# Get final accuracy of ensembled model
yTest[yTest == "yes"] = 1
yTest[yTest == "no"] = 0

print metrics.classification_report(yTest.astype(int), ensembled_predictions.astype(int))
print "Ensemble Accuracy:", round(metrics.accuracy_score(yTest.astype(int), ensembled_predictions.astype(int)),2)


             precision    recall  f1-score   support

          0       0.95      0.96      0.95       170
          1       0.91      0.89      0.90        80

avg / total       0.94      0.94      0.94       250

Ensemble Accuracy: 0.94

5) Conclusion

There are plenty of ways to do model ensembling. Simple majority voting. We can also do weighted majority voting, where models with higher accuracy get more of a vote. If your output is numerical, you could average. These relatively simple techniques do a great job, but there is more! Stacking (also called blending) is when the predictions from different algorithms are used as input into another algorithm (often good old linear and logistic regression) which then outputs your final predictions. For example, you might train a linear model on the predictions. Blending.

It is best to ensemble together models which are less correlated (see an excellent explanation here). See an excellent explanation of ensembling here.

What happens when your dataset isn’t as nice as this? What if there are many more instances of one class versus the other, or if you have a lot of missing values, or a mixture of categorical and numerical variables? Stay tuned for the next blog post where we write up guidance on tackling these types of sticky situations.

Notes

  • Should we use something cooler like gradient boosting?

It is best to ensemble together models which are less correlated (see an excellent explanation here). See an excellent explanation of ensembling here.

Another nice tutorial on doing ensembling in python is here.


In [ ]: