Date created: Jan 11, 2017
Last modified: Jan 12, 2017
Tags: GBM, hyperopt, imbalanced data set, semiconductor data
About: Classify imbalanced semicondutor manufacturing data set using the Gradient Boosting Machine. Weight imbalanced classes using sample_weight
. Tune hyperparameters manually and algorithmically using hyperopt
.
The SECOM dataset in the UCI Machine Learning Repository is semicondutor manufacturing data. There are 1567 records, 590 anonymized features and 104 fails. This makes it an imbalanced dataset with a 14:1 ratio of pass to fails. The process yield has a simple pass/fail response (encoded -1/1).
In previous exercises we looked at the one-class SVM and SVM+oversampling using SMOTE. (The full list of methods we have looked can be found here). In this exercise we consider the Gradient Boosting (GBM) classifier which is an ensemble method that has been shown to perform very well in machine learning competitions. To implement a cost-sensitive GBM, the GBM fit
method with the sample_weight
option is used to weigh individual observations.
sample_weight
mask to assign a vector of weights to corresponding observations. To balance the two classes, we will weigh them in inverse proportion to their respective class frequencies.
At each stage, the boosting algorithm will fit the residuals from the previous step. We need to determine if the data needs to be reweighted in order to get optimal results. We will look at:
In addition, since hyperparameter tuning is an important part of GBM performance, we will look at:
hyperopt
We will see if any of these three parameter selection methods suggest the use of default versus sample weights.
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.preprocessing import Imputer
from sklearn.model_selection import train_test_split as tts
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import confusion_matrix, classification_report,\
roc_auc_score, accuracy_score
from sklearn.metrics import make_scorer, matthews_corrcoef
from time import time
from __future__ import division
import warnings
warnings.filterwarnings("ignore")
SEED = 7 # random state
In [2]:
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom.data"
secom = pd.read_table(url, header=None, delim_whitespace=True)
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom_labels.data"
y = pd.read_table(url, header=None, usecols=[0], squeeze=True, delim_whitespace=True)
print 'The dataset has {} observations/rows and {} variables/columns.' \
.format(secom.shape[0], secom.shape[1])
print 'The ratio of majority class to minority class is {}:1.' \
.format(int(y[y == -1].size/y[y == 1].size))
We process the missing values first, dropping columns which have a large number of missing values and imputing values for those that have only a few missing values. The one-class SVM exercise has a more detailed version of these steps.
In [3]:
# dropping columns which have large number of missing entries
m = map(lambda x: sum(secom[x].isnull()), xrange(secom.shape[1]))
m_200thresh = filter(lambda i: (m[i] > 200), xrange(secom.shape[1]))
secom_drop_200thresh = secom.dropna(subset=[m_200thresh], axis=1)
dropthese = [x for x in secom_drop_200thresh.columns.values if \
secom_drop_200thresh[x].std() == 0]
secom_drop_200thresh.drop(dropthese, axis=1, inplace=True)
print 'The SECOM data set now has {} variables.'\
.format(secom_drop_200thresh.shape[1])
In [4]:
# imputing missing values for the random forest
imp = Imputer(missing_values='NaN', strategy='median', axis=0)
secom_imp = imp.fit_transform(secom_drop_200thresh)
We will first compare baseline results with the performance of a model where the sample_weight
is used. As discussed in previous exercises, the Matthews correlation coefficient (MCC) is used instead of the Accuracy to compute the score.
In [5]:
# split data into train and holdout sets
# stratify the sample used for modeling to preserve the class proportions
X_train, X_test, y_train, y_test = tts(secom_imp, y, \
test_size=0.2, stratify=y, random_state=5)
In [6]:
# function to test GBC parameters
def GBC(params, weight):
clf = GradientBoostingClassifier(**params)
if weight:
sample_weight = np.array([14 if i == -1 else 1 for i in y_train])
clf.fit(X_train, y_train, sample_weight)
else:
clf.fit(X_train, y_train)
print_results(clf, X_train, X_test)
# function to print results
def print_results(clf, X_train, X_test):
# training data results
print 'Training set results:'
y_pred = clf.predict(X_train)
print 'The train set MCC: {0:4.3f}'\
.format(matthews_corrcoef(y_train, y_pred))
# test set results
print '\nTest set results:'
acc = clf.score(X_test, y_test)
print("Accuracy: {:.4f}".format(acc))
y_pred = clf.predict(X_test)
print '\nThe confusion matrix: '
cm = confusion_matrix(y_test, y_pred)
print cm
print '\nThe test set MCC: {0:4.3f}'\
.format(matthews_corrcoef(y_test, y_pred))
In [7]:
params = {'n_estimators': 800, 'max_depth': 3, 'subsample': 0.8, 'max_features' : 'sqrt',
'learning_rate': 0.019, 'min_samples_split': 2, 'random_state': SEED}
GBC(params, 0)
In [8]:
# RUN 1 Using the same parameters as the baseline
params = {'n_estimators': 800, 'max_depth': 3, 'subsample': 0.8, 'max_features' : 'sqrt',
'learning_rate': 0.019, 'min_samples_split': 2, 'random_state': SEED}
GBC(params, 1)
In [9]:
# RUN 2 Manually selecting parameters to optimize the train/test MCC with sample weights
params = {'n_estimators': 800, 'max_depth': 3, 'subsample': 0.7, 'max_features' : 'log2',
'learning_rate': 0.018, 'min_samples_split': 2, 'random_state': SEED}
GBC(params, 1)
In the baseline case (where we do not adjust the weights), we get a high MCC score for the training set (0.97). The test MCC is 0.197 so there is a large gap between the train and test MCC. When sample weights are used, we get a test set MCC of 0.242 after tuning the parameters.
The tuning parameters play a big role in the performance of the GBM. In Section III (D) we will look at the MCC trend over the number of estimators. This will allow us to adjust the complexity of the model. In Sections IV and V we will use the hyperopt
and gridsearchCV
modules to select the hyperparameters.
Here we compute the feature importance (using the baseline parameters) for the GBM. We also compare the results with the feature importance for the Random Forest.
In [10]:
params = {'n_estimators': 800, 'max_depth': 3, 'subsample': 0.8, 'max_features' : 'sqrt',
'learning_rate': 0.019, 'min_samples_split': 2, 'random_state': SEED}
# GBM
clf = GradientBoostingClassifier(**params)
clf.fit(X_train, y_train)
gbm_importance = clf.feature_importances_
gbm_ranked_indices = np.argsort(clf.feature_importances_)[::-1]
# Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=7)
rf.fit(X_train, y_train)
rf_importance = rf.feature_importances_
rf_ranked_indices = np.argsort(rf.feature_importances_)[::-1]
# printing results in a table
importance_results = pd.DataFrame(index=range(1,16),
columns=pd.MultiIndex.from_product([['GBM','RF'],['Feature #','Importance']]))
importance_results.index.name = 'Rank'
importance_results.loc[:,'GBM'] = list(zip(gbm_ranked_indices[:15],
gbm_importance[gbm_ranked_indices[:15]]))
importance_results.loc[:,'RF'] = list(zip(rf_ranked_indices[:15],
rf_importance[rf_ranked_indices[:15]]))
print importance_results
Roughly half the top fifteen most important features for the GBM were also the top fifteen computed for the Random Forest classifier. There are complex interactions between the parameters so we do not expect the two classifiers to give the same results. In Section IV where we optimize the hyperparameters, the nvar
(number of variables) parameter will select variables according to their ranked importance.
The GBM sequentially fits the function and the number of steps (the number of estimators) is specified by the ntree
parameter. At some stage, the model complexity increases to the point where we start overfitting the data. Here we will plot the MCC as a function of the number of estimators for the train set and test set to determine a good value for ntree
.
When Accuracy is used as score, a plot of Classification Error (= 1 - Accuracy) versus the Number of Trees can be used as a diagnostic to determine bias and overfitting. The term (1 - MCC), however, is hard to interpret since MCC is calculated with all four terms of the confusion matrix. (I did make those plots out of curiosity and the trends are similar to those seen in an "Error vs no. of estimators" plot an example of which is presented on Slide 3 of this Hastie/Tibshirani lecture).
In [11]:
# function to compute MCC vs number of trees
def GBC_trend(weight):
base_params = {'max_depth': 3, 'subsample': 0.8, 'max_features' : 'sqrt',
'learning_rate': 0.019, 'min_samples_split': 2, 'random_state': SEED}
mcc_train = []
mcc_test = []
for i in range(500, 1600, 100):
params = dict(base_params)
ntrees = {'n_estimators': i}
params.update(ntrees)
clf = GradientBoostingClassifier(**params)
if weight:
sample_weight = np.array([14 if i == -1 else 1 for i in y_train])
clf.fit(X_train, y_train, sample_weight)
else:
clf.fit(X_train, y_train)
y_pred_train = clf.predict(X_train)
mcc_train.append(matthews_corrcoef(y_train, y_pred_train))
y_pred_test = clf.predict(X_test)
mcc_test.append(matthews_corrcoef(y_test, y_pred_test))
return mcc_train, mcc_test
In [12]:
# plot
fig, ax = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(7,4))
plt.style.use('ggplot')
mcc_train, mcc_test = GBC_trend(0)
ax[0].plot(range(500, 1600, 100), mcc_train, color='magenta', label='train MCC' )
ax[0].plot(range(500, 1600, 100), mcc_test, color='olive', label='test MCC' )
ax[0].set_title('For default weights')
mcc_train, mcc_test = GBC_trend(1)
ax[1].plot(range(500, 1600, 100), mcc_train, color='magenta', label='train MCC' )
ax[1].plot(range(500, 1600, 100), mcc_test, color='olive', label='test MCC' )
ax[1].set_title('For sample weights')
ax[0].set_ylabel('MCC')
fig.text(0.5, 0.04, 'Boosting Iterations', ha='center', va='center')
plt.xlim(500,1500)
plt.ylim(-0.1, 1.1);
plt.legend(bbox_to_anchor=(1.05, 0), loc='lower left');
Default weight option (left): After about 900 iterations, the GBM models all the training data perfectly. At the same time, there is a large gap in the holdout data classification results. This is a classic case of overfitting.
Sample weight option (right): This plot was constructed using the same parameters at the default weight plot. This may account for some of the bias (lower values for MCC) we see for the training data line. From III B, we can see that even when we tune the parameters for the sample weight model, the MCC for the training set is lower that for the default weight model. This gives us some indication that using sample weights adds bias.
The python hyperopt
module by James Bergstra uses Bayesian optimization (based on a tree-structured Parzen density estimate--TPE) [1] to automatically select the best hyperparameters. This blog post provides examples of how this can be used with sklearn classifiers. We will use the MCC instead of the Accuracy for the cross-validation score. Note that since the hyperopt
function fmin
will minimze the MCC, we need to negate the MCC to maximize its value.
In [13]:
# defining the MCC metric to assess cross-validation
def mcc_score(y_true, y_pred):
mcc = matthews_corrcoef(y_true, y_pred)
return mcc
mcc_scorer = make_scorer(mcc_score, greater_is_better=True)
In [14]:
# convert to DataFrame for easy indexing of number of variables (nvar)
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)
In [15]:
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
def hyperopt_train_test(params):
weight = params['weight']
X_ = X_train.loc[:, gbm_ranked_indices[:params['nvar']]]
del params['nvar'], params['weight']
clf = GradientBoostingClassifier(**params)
if weight:
sample_weight = np.array([14 if i == -1 else 1 for i in y_train])
return cross_val_score(clf, X_, y_train, scoring=mcc_scorer,\
fit_params={'sample_weight': sample_weight}).mean()
else:
return cross_val_score(clf, X_, y_train, scoring=mcc_scorer).mean()
space = {
'n_estimators': hp.choice('n_estimators', range(700,1300, 100)),
'max_features': hp.choice('max_features', ['sqrt', 'log2']),
'max_depth': hp.choice('max_depth', range(2,5)),
'subsample': hp.choice('subsample', [0.6, 0.7, 0.8, 0.9]),
'min_samples_split': hp.choice('min_samples_split', [2, 3]),
'learning_rate': hp.choice('learning_rate', [0.01, 0.015, 0.018, 0.02, 0.025]),
'nvar': hp.choice('nvar', [200, 300, 400]),
'weight': hp.choice('weight', [0, 1]), # select sample_weight or default
'random_state': SEED
}
def f(params):
mcc = hyperopt_train_test(params)
# with a negative sign, we maximize the MCC
return {'loss': -mcc, 'status': STATUS_OK}
In [22]:
start = time()
trials = Trials()
best = fmin(f, space, algo=tpe.suggest, max_evals=100, trials=trials)
print("HyperoptCV took %.2f seconds."% (time() - start))
print '\nBest parameters (by index):'
print best
We will apply the optimal hyperparameters selected via hyperopt
above to the GBM classifier. The optimal parameters include nvar
= 200 and the use of sample_weight
.
In [23]:
params = {'n_estimators': 1200, 'max_depth': 3, 'subsample': 0.7, 'max_features' : 'log2',
'learning_rate': 0.018, 'min_samples_split': 3, 'random_state': SEED}
train_ = X_train.loc[:, gbm_ranked_indices[:200]]
test_ = X_test.loc[:, gbm_ranked_indices[:200]]
clf = GradientBoostingClassifier(**params)
sample_weight = np.array([14 if i == -1 else 1 for i in y_train])
clf.fit(train_, y_train, sample_weight)
print_results(clf, train_, test_)
hyperopt
a few times and in each case the optimal parameters include nvar
= 200 and the use of sample_weight
. There is a great deal of variability among the remaining parameters selected across the runs. This is an example of a second run.
In [24]:
trials = Trials()
best = fmin(f, space, algo=tpe.suggest, max_evals=100, trials=trials)
print '\nBest parameters (by index):'
print best
In [25]:
params = {'n_estimators': 700, 'max_depth': 4, 'subsample': 0.8, 'max_features' : 'log2',
'learning_rate': 0.018, 'min_samples_split': 2, 'random_state': SEED}
train_ = X_train.loc[:, gbm_ranked_indices[:200]]
test_ = X_test.loc[:, gbm_ranked_indices[:200]]
clf = GradientBoostingClassifier(**params)
sample_weight = np.array([14 if i == -1 else 1 for i in y_train])
clf.fit(train_, y_train, sample_weight)
print_results(clf, train_, test_)
hyperopt
selected the sample_weight
option. This was the case for most of the runs though there were a few instances in which the default option was selected. This is an example:
In [31]:
start = time()
trials = Trials()
best = fmin(f, space, algo=tpe.suggest, max_evals=100, trials=trials)
print("HyperoptCV took %.2f seconds."% (time() - start))
print '\nBest parameters (by index):'
print best
In [32]:
params = {'n_estimators': 1000, 'max_depth': 2, 'subsample': 0.9, 'max_features' : 'log2',
'learning_rate': 0.025, 'min_samples_split': 3, 'random_state': SEED}
train_ = X_train.loc[:, gbm_ranked_indices[:200]]
test_ = X_test.loc[:, gbm_ranked_indices[:200]]
clf = GradientBoostingClassifier(**params)
clf.fit(train_, y_train)
print_results(clf, train_, test_)
The results from hyperopt
(tested over ten or more runs) were quite variable and no conclusions can be made. By seeding the random_state
parameter, we should be able to get reproducible results but since this was not the case here, we will need to investigate this further.
In [11]:
# cv function
def GBMCV(weight):
clf = GradientBoostingClassifier(random_state=SEED)
param_grid = {"n_estimators": [800, 900, 1000, 1200],
"max_depth": [2, 3],
"subsample": [0.6, 0.7, 0.8],
"min_samples_split": [2, 3],
"max_features": ["sqrt", "log2"],
"learning_rate": [0.01, 0.015, 0.018, 0.02, 0.025]}
# run grid search
if weight:
sample_weight = np.array([14 if i == -1 else 1 for i in y_train])
grid_search = GridSearchCV(clf, param_grid=param_grid, scoring=mcc_scorer, \
fit_params={'sample_weight': sample_weight})
else:
grid_search = GridSearchCV(clf, param_grid=param_grid, scoring=mcc_scorer)
start = time()
grid_search.fit(X_train, y_train)
# print results
print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
% (time() - start, len(grid_search.grid_scores_)))
print 'The best parameters:'
print '{}\n'. format(grid_search.best_params_)
print 'Results for model fitted with the best parameter:'
y_true, y_pred = y_test, grid_search.predict(X_test)
print(classification_report(y_true, y_pred))
print 'The confusion matrix: '
cm = confusion_matrix(y_true, y_pred)
print cm
print '\nThe Matthews correlation coefficient: {0:4.3f}'\
.format(matthews_corrcoef(y_test, y_pred))
In [14]:
# CV with default weights
print "CV with default weights:"
GBMCV(0)
In [10]:
# CV with sample weights
GBMCV(1)
In this exercise the GBM was used to classify imbalanced data. In looking at the GBM classifier we need to consider the following: (1) whether the GBM is superior to other methods we have looked at and (2) whether reweighting with sample weights will improve classification results
On the issue of the GBM classifier performance (classification results/time/ease of use) it should be noted that tuning the GBM is no simple matter as there are many complex interactions between the parameters. We use the three methods -- manual selection, automatic hyperparameter optimization using a Bayesian strategy and grid search cross validation -- available but they give variable results. At the same time the best classification results obtained (an MCC of 0.24 for the holdout data) show that there is no obvious advantage to using the GBM over a simpler sampling strategy (see SVM+undersampling).
On the question of rebalancing with sample_weight
, once again it is not clear that rebalancing gives better results. When the parameters were manually selected, both the default and weighted GBM showed results comparable to other classifiers such as the Random Forest. The hyperopt
selected the sample_weight over the default in at least 9/10 runs (not all runs are shown in Section IV) but the holdout MCC was not good (e.g. 0.004). For the gridsearch crossvalidation, results for default vs sample_weight were comparable.
In previous exercises we had experimented with SVM+oversampling using SMOTE and SVM+undersampling. Our experiments thus far indicate that SVM combined with a sampling strategy holds promise and we will next look at a combination of oversampling (the minority class) and undersampling (the majority class).
[1] Bergstra, James S., et al. “Algorithms for hyper-parameter optimization.” Advances in Neural Information Processing Systems. (2011)