Date created: Nov 10, 2016
Last modified: Jan 13, 2017
Tags: Random Forest, imbalanced data set, semiconductor data
About: Classify using the Random Forest Classifier. Weight imbalanced semicondutor manufacturing dataset using class_weights.
The SECOM dataset in the UCI Machine Learning Repository is semicondutor manufacturing data. There are 1567 records, 590 anonymized features and 104 fails. This makes it an imbalanced dataset with a 14:1 ratio of pass to fails. The process yield has a simple pass/fail response (encoded -1/1).
The RF classifier is an ensemble method that has been shown to perform very well. A 2014 paper compared 179 classifiers on 121 data sets in the UCI repository and found that the RF was the most accurate classifier overall [1]. The algorithm has been designed with the assumption that the classes are equally balanced. With imbalanced data, however, the overall error rate and prediction accuracy is skewed to the majority class. To correct for this, two strategies are available for highly imbalanced classes:
We will look at the first of these by reweighting the observations. The two classes will be weighted in inverse proportion to their respective class frequencies so that a heavier penalty is assigned to the misclassification of the minority class. The class weights are incorporated into the RF algorithm in two places. The first is to weight the gini criterion when splitting the data; the second is to weight the majority vote in the terminal nodes which is then aggregated for the final prediction.
class_weight
and
sample_weight
. The RF is an ensemble of decision trees, so these options have been derived from the implementions for the decision tree.
We will look at the following:
For this data set, the measurements come from a large number of processes or sensors and many of the records are missing. In addition some measurements are identical/constant and so not useful for prediction. We will remove the columns that have high missing counts or constant values and estimate values for the rest of the missing data. These are the same steps used for the one-class SVM and a more detailed explanation can be seen there.
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.preprocessing import Imputer
from sklearn.model_selection import train_test_split as tts # sklearn 0.18.1
from sklearn.model_selection import GridSearchCV # sklearn 0.18.1
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import confusion_matrix, \
classification_report, accuracy_score
from sklearn.metrics import make_scorer, matthews_corrcoef
from time import time
from __future__ import division
import warnings
warnings.filterwarnings("ignore")
In [2]:
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom.data"
secom = pd.read_table(url, header=None, delim_whitespace=True)
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom_labels.data"
y = pd.read_table(url, header=None, usecols=[0], squeeze=True, delim_whitespace=True)
print 'The dataset has {} observations/rows and {} variables/columns.' \
.format(secom.shape[0], secom.shape[1])
print 'The ratio of majority class to minority class is {}:1.' \
.format(int(y[y == -1].size/y[y == 1].size))
We process the missing values first, dropping columns which have a large number of missing values and imputing values for those that have only a few missing values. The one-class SVM exercise has a more detailed version of these steps.
In [3]:
# dropping columns which have large number of missing entries
m = map(lambda x: sum(secom[x].isnull()), xrange(secom.shape[1]))
m_200thresh = filter(lambda i: (m[i] > 200), xrange(secom.shape[1]))
secom_drop_200thresh = secom.dropna(subset=[m_200thresh], axis=1)
dropthese = [x for x in secom_drop_200thresh.columns.values if \
secom_drop_200thresh[x].std() == 0]
secom_drop_200thresh.drop(dropthese, axis=1, inplace=True)
print 'The SECOM data set now has {} variables.'\
.format(secom_drop_200thresh.shape[1])
In [4]:
# imputing missing values for the random forest
imp = Imputer(missing_values='NaN', strategy='median', axis=0)
secom_imp = imp.fit_transform(secom_drop_200thresh)
In [5]:
# split data into train and holdout sets
# stratify the sample used for modeling to preserve the class proportions
X_train, X_test, y_train, y_test = tts(secom_imp, y, \
test_size=0.2, stratify=y, random_state=5)
We will vary the class_weight
parameter and evaluate the performance using the Matthews correlation coefficient (MCC). When there is a large skew in the class balance, the overall classification accuracy is not a good performance measure since the True Negatives (the majority class elements correctly classified) dominate the score in these cases. The MCC is an alternative score which uses all four values of the confusion matrix and is therefore considered more balanced.
Both the classes are equally weighted with a weight of one. The other parameters, with the exception of the number of trees (n_estimators) and the out-of-bag score (oob_score), are set to default values.
In [6]:
rfc_default = RandomForestClassifier(n_estimators=100, random_state=7,
oob_score=True)
rfc_default.fit(X_train, y_train)
y_pred = rfc_default.predict(X_test)
print 'The confusion matrix: '
cm = confusion_matrix(y_test, y_pred)
print cm
print '\nThe True Positive rate is: {0:4.2f}' \
.format(float(cm[1][1])/np.sum(cm[1]))
print '\nThe Matthews correlation coefficient: {0:3.2f}'\
.format(matthews_corrcoef(y_test, y_pred))
print '\nThe accuracy is: {0:4.2f}' \
.format(accuracy_score(y_test, y_pred))
print '\nThe OOB score: {0:4.2f}' \
.format(rfc_default.oob_score_)
We can see from the confusion matrix (CM) the True Positive Rate --the accuracy for the minority class -- is zero (see bottom right of the CM). The MCC score, which is 0, reflects this whereas scores such as accuracy (0.93) or the OOB score (0.93) do not.
We will see how by changing the class weights to the ratio of the class frequencies, the MCC score increases compared to the baseline.
Applying additional parameters such as the n_estimators
and min_samples_split
options along with the class_weight
can improve the MCC score further.
Finally, since it has been suggested that the OOB estimate of the accuracy can be used to select the the best class_weight
parameter [2. See Sec. 2.3], we will look at the OOB estimate for different class_weight
inputs.
In [7]:
# function to evaluate the RFC model by varying
# the class weights and some other parameters
def rfc_classweights(class_weight, n_estimators, min_samples_split, verbose):
rfc_cw = RandomForestClassifier(n_estimators=n_estimators,
class_weight=class_weight,
min_samples_split=min_samples_split,
random_state=7,
oob_score=True)
rfc_cw.fit(X_train, y_train)
y_pred = rfc_cw.predict(X_test)
if verbose: # print
print 'The confusion matrix: '
cm = confusion_matrix(y_test, y_pred)
print cm
print '\nThe True Positive rate is: {0:4.2}' \
.format(float(cm[1][1])/np.sum(cm[1]))
print '\nThe Matthews correlation coefficient: {0:4.3f}'\
.format(matthews_corrcoef(y_test, y_pred))
print '\nThe OOB score: {0:4.3f}' .format(rfc_cw.oob_score_)
else: # don't print, compute probs instead
y_prob = rfc_cw.predict_proba(X_test)
return y_prob
In [8]:
# changing the class_weight ratio to 0.8/0.2
rfc_classweights({-1:.8, 1:.2}, 100, 3, 1)
In [9]:
# adjusting class weigths in the proportion of class frequencies
rfc_classweights({-1:.93, 1:.07}, 100, 3, 1)
In [10]:
# optimizing by selecting n_estimators=500 (number of trees)
# min_samples_split=5 (min samples for splitting node)
rfc_classweights({-1:.93, 1:.07}, 500, 5, 1)
With the class weight ratio in proportion to the class frequencies (0.93 and 0.07), the MCC changes from the baseline 0 to 0.197. The score further improves to 0.236 when the parameters are set to n_estimators
=500 and min_samples_split
=5.
In all these cases (the baseline, the class_weight
at 0.8/0.2 and 0.93/0.07), the OOB score --which calculates generalization accuracy-- remains unchanged around 0.92. Since the OOB estimate only reflects the accuracy, we will not use it to select the optimal class weighting.
The sample_weight
which is an option provided with the fit
method is another way to weight the classes. It has the same effect as the class weight so we should expect similar results. As explained here, the two options allow the weighting routine to be accessed in different ways.
In [11]:
# the sample weight uses a 1D-array of weights as input
sample_weight = np.array([14 if i == -1 else 1 for i in y_train])
rfc_sw = RandomForestClassifier(n_estimators=100, random_state=7,
min_samples_split=3, oob_score=True)
rfc_sw.fit(X_train, y_train, sample_weight)
y_pred = rfc_sw.predict(X_test)
print 'The confusion matrix: '
cm = confusion_matrix(y_test, y_pred)
print cm
print '\nThe Matthews correlation coefficient: {0:4.3f}'\
.format(matthews_corrcoef(y_test, y_pred))
print '\nThe OOB score: {0:4.3f}'.format(rfc_sw.oob_score_)
When the weights assigned are in proportion to the class frequencies (and n_estimators
=100, min_samples_split
=3), both the sample_weight
and class_weight
options give similar results. The MCC score is 0.197 in both cases.
The usual way to select parameters is via grid-search and cross-validation (CV). The default CV scoring is based on the accuracy so we will replace this with an MCC scorer.
In [7]:
# defining the MCC metric to assess cross-validation
def mcc_score(y_true, y_pred):
mcc = matthews_corrcoef(y_true, y_pred)
return mcc
mcc_scorer = make_scorer(mcc_score, greater_is_better=True)
In [13]:
# cross-validation
#n_estimators = 500 gives better results
clf = RandomForestClassifier(n_estimators=500, random_state=7, n_jobs=-1)
param_grid = {"max_depth": [3, 8, None],
"class_weight": [{-1:.90, 1:.10}, {-1:.93, 1:.07}, {-1:.99, 1:.01}],
"min_samples_split": [2, 3, 5],
"max_features": ["sqrt", "log2", "auto", None],
"criterion": ["gini", "entropy"]}
# run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid, scoring=mcc_scorer)
start = time()
grid_search.fit(X_train, y_train)
print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
% (time() - start, len(grid_search.grid_scores_)))
In [14]:
print 'The best parameters:'
print '{}\n'. format(grid_search.best_params_)
print 'Results for model fitted with the best parameter:'
y_true, y_pred = y_test, grid_search.predict(X_test)
print(classification_report(y_true, y_pred))
print 'The confusion matrix: '
cm = confusion_matrix(y_true, y_pred)
print cm
print '\nThe True Positive rate is: {0:4.2}'\
.format(float(cm[1][1])/np.sum(cm[1]))
print '\nThe Matthews correlation coefficient: {0:4.3f}'\
.format(matthews_corrcoef(y_test, y_pred))
The CV was based on the MCC score and the MCC for the best model is 0.236. Some of the parameters (max_features
= sqrt, criterion
= gini, max_depth
= None) for the best model are default options. The class_weight
is 0.99/.01 and min_samples_split
is 5. The number of trees was set at 500 because we saw this gave the best_results in Section III (3) above.
A model we ran (Section III (3)) with the same parameters and a class_weight
of .93/.07 also had an MCC of 0.236 and this appears to be the best we can do with the RF classifier.
In [15]:
#plot for baseline
y_prob = rfc_default.predict_proba(X_test)
fpr, tpr, _ = roc_curve(y_test, y_prob[:,1])
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(8,6.5))
plt.plot(fpr, tpr, color='red', lw=2, linestyle=':',
label='baseline (area = {0:0.2f})'
''.format(roc_auc))
# plot for weighted classes
# class_weights .93/.07, n_estimators=500, min_samples_split=5
y_prob = rfc_classweights({-1:.93, 1:.07}, 500, 5, 0)
fpr, tpr, _ = roc_curve(y_test, y_prob[:,1])
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, color='magenta', lw=2, linestyle='-.',
label='weighted (area = {0:0.2f})'
''.format(roc_auc))
# plot for optimal CV hyperparameters
y_prob = grid_search.predict_proba(X_test)
fpr, tpr, _ = roc_curve(y_test, y_prob[:,1])
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, color='turquoise', lw=2, linestyle='-',
label='best CV (area = {0:0.2f})'
''.format(roc_auc))
plt.plot([0, 1], [0, 1], 'k--', lw=1)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curves for Baseline, Weighted and CV', fontsize=14)
plt.legend(title='AUC', loc="lower right")
plt.show()
The AUC values are the same for the baseline, the weighted and the optimal model via cross-validation. [To complete later]
In the experiments above, after preprocessing for the missing data, all the 409 features available were used. The RF variable importance can rank features in order of importance and in the One-class SVM exercise, we saw that some of the RF features had a variable importance score of zero. In this section, we will compute the MCC for RF models where for each subsequent model, the number of features (included in order of rank) is incremented.
In [8]:
# we use a DataFrame to acess ranked columns
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)
# the RF model is weighted
rfc_p1 = RandomForestClassifier(n_estimators=500,
class_weight={-1:.93, 1:.07},
min_samples_split=5,
random_state=7)
rfc_p1.fit(X_train, y_train)
importance = rfc_p1.feature_importances_
ranked_indices = np.argsort(importance)[::-1]
# print out the features, their rank
print "Feature Rank:"
for i in range(8):
print "{0:3d} column {1:3d} {2:6.4f}"\
.format(i+1, ranked_indices[i], importance[ranked_indices[i]])
print "\n"
for i in xrange(len(importance)-3,len(importance)):
print "{0:3d} column {1:3d} {2:6.4f}"\
.format(i+1, ranked_indices[i], importance[ranked_indices[i]])
In [12]:
# Compute MCC for different RF models.
# The number of ranked features is increased for each subsequent model
mcc_scores = []
nfeatures = np.arange(40, 401, 20) # no. of features used
rfc_p2 = RandomForestClassifier(n_estimators=500,
class_weight={-1:.93, 1:.07},
min_samples_split=5, random_state=7)
for i in nfeatures:
rfc_p2.fit(X_train[ranked_indices[:i]], y_train)
y_pred = rfc_p2.predict(X_test[ranked_indices[:i]])
mcc_scores = np.append(mcc_scores, matthews_corrcoef(y_test, y_pred))
In [18]:
# plot the MCC scores
plt.figure(figsize=(6,4.5))
plt.plot(nfeatures, mcc_scores, 'm')
plt.axvline(180, color='gray', linestyle='dotted')
plt.text(184, 0.14, 'n = 180', rotation='vertical', fontsize=12)
plt.axhline(y=.23, color='gray', linestyle='dotted')
plt.text(30, 0.235, 'MCC = 0.23', fontsize=12)
#plt.xlabel('No of ranked features')
#plt.ylabel('MCC')
plt.ylim([0.0, 0.30])
plt.title('MCC score vs No. of ranked features used', fontsize=14)
plt.show()
From the plot we see that the MCC score of 0.236 stabilizes after n = 180 (n is the number of features used). The variation we see beyond that can be attributed to the inherent variability in the RF runs. One of the characteristics of the random forest algorithm is that it performs well even with a large number of features and we see that using a larger number of variables does not impair the RF performance in our experiments either.
Here we were interested in the performance improvements we would get with a weighted Random Forest applied to a heavily skewed data set. The Matthews correlation coefficient was 0.236 for the best models.
In a previous exercise we had experimented with SVM+oversampling using SMOTE and we had obtained similar performance for the MCC but the AUC score were marginally better. The SVM combined with a sampling strategy holds promise and we will next look at a combination of oversampling (the minority class) and undersampling (the majority class).
[1] Fernández-Delgado, Manuel, et al. "Do we need hundreds of classifiers to solve real world classification problems." J. Mach. Learn. Res 15.1 (2014): 3133-3181.
[2] Chao Chen, Andy Liaw, Leo Breiman. "Using Random Forest to Learn Imbalanced Data." Report ID 666. Retrieved from Univ. of California Berkeley Mathematics Statistics Library. (2004)