Date created: Nov 16, 2016
Last modified: Nov 23, 2016
Tags: SVM, undersampling, data-cleaning, imbalanced data set, semiconductor data
About: Improve the classification of an imbalanced semicondutor manufacturing data set using a combination of undersampling (the majority class) and data cleaning.
The SECOM dataset in the UCI Machine Learning Repository is semicondutor manufacturing data. There are 1567 records, 590 anonymized features and 104 fails. This makes it an imbalanced dataset with a 14:1 ratio of pass to fails. The process yield has a simple pass/fail response (encoded -1/1).
The undersampling or data cleaning step is followed by classification using an SVM. The imblearn toolbox has a pipeline method which will be used to chain all the steps. The cross validation is evaluated with the Matthews correlation coeefficient as well as other measures based on the confusion matrix.
In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split as tts
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from imblearn.under_sampling import RandomUnderSampler
from imblearn.under_sampling import NeighbourhoodCleaningRule
from imblearn.under_sampling import TomekLinks
from imblearn.pipeline import Pipeline as ImbPipe
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import confusion_matrix, matthews_corrcoef,\
accuracy_score, classification_report
from collections import Counter
from time import time
from __future__ import division
import warnings
warnings.filterwarnings("ignore")
In [3]:
# load the data
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom.data"
secom = pd.read_table(url, header=None, delim_whitespace=True)
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom_labels.data"
y = pd.read_table(url, header=None, usecols=[0], squeeze=True, delim_whitespace=True)
print 'The dataset has {} observations/rows and {} variables/columns.' \
.format(secom.shape[0], secom.shape[1])
print 'The ratio of majority class to minority class is {}:1.' \
.format(int(y[y == -1].size/y[y == 1].size))
We process the missing values first, dropping columns which have a large number of missing values and imputing values for those that have only a few missing values. The Random Forest variable importance is used to rank the variables in terms of their importance. The one-class SVM exercise has a more detailed version of these steps.
In [4]:
# dropping columns which have large number of missing entries
m = map(lambda x: sum(secom[x].isnull()), xrange(secom.shape[1]))
m_200thresh = filter(lambda i: (m[i] > 200), xrange(secom.shape[1]))
secom_drop_200thresh = secom.dropna(subset=[m_200thresh], axis=1)
dropthese = [x for x in secom_drop_200thresh.columns.values if \
secom_drop_200thresh[x].std() == 0]
secom_drop_200thresh.drop(dropthese, axis=1, inplace=True)
print 'The SECOM data set now has {} variables.'\
.format(secom_drop_200thresh.shape[1])
In [5]:
# imputing missing values for the random forest
imp = Imputer(missing_values='NaN', strategy='median', axis=0)
secom_imp = pd.DataFrame(imp.fit_transform(secom_drop_200thresh))
# use Random Forest to assess variable importance
rf = RandomForestClassifier(n_estimators=100, random_state=7)
rf.fit(secom_imp, y)
# sorting features according to their rank
importance = rf.feature_importances_
ranked_indices = np.argsort(importance)[::-1]
In [6]:
# split data into train and holdout sets
# stratify the sample used for modeling to preserve the class proportions
#X_train, X_test, y_train, y_test = tts(secom_imp[ranked_indices[:40]], y, \
X_train, X_test, y_train, y_test = tts(secom_imp[ranked_indices], y, \
test_size=0.2, stratify=y, random_state=5)
print 'Train data for each class: {} '\
.format(Counter(y_train))
print 'The maj/min class ratio is: {0:2.0f}' \
.format(round(y_train[y_train == -1].size/y_train[y_train == 1].size))
print 'Test data for each class: {} '\
.format(Counter(y_test))
print 'The maj/min class ratio for the holdout set is: {0:2.0f}' \
.format(round(y_test[y_test == -1].size/y_test[y_test == 1].size))
The SVM is sensitive to feature scale so the first step is to center and normalize the data. The train and test sets are scaled separately using the mean and variance computed from the training data. This is done to estimate the ability of the model to generalize.
In [7]:
# scaling the split data. The holdout data uses scaling parameters
# computed from the training data
standard_scaler = StandardScaler()
X_train_scaled = pd.DataFrame(standard_scaler.fit_transform(X_train), \
index=X_train.index)
X_test_scaled = pd.DataFrame(standard_scaler.transform(X_test))
# Note: we convert to a DataFrame because the plot functions
# we will use need DataFrame inputs.
We will use the Imblearn Random Undersampler method to perform random undersampling. Random Undersampling is a simple method where data points are randomly selected and removed from the majority class. While it is an easy way to address class imbalance, the classifier is forced to learn from a smaller data set and may miss certain characteristics of the data.
In this section we will look at the following:
In [8]:
# undersampling numbers before/after
print 'Original dataset distribution: {}'.format(Counter(y_train))
ratio = 0.8
rus = RandomUnderSampler(ratio=ratio, random_state=7)
X_res, y_res = rus.fit_sample(X_train_scaled, y_train)
print 'Resampled dataset distribution: {}'.format(Counter(y_res))
In [9]:
# plotting majority class distribution after undersampling
# displaying column 4
from IPython.html.widgets import interact
@interact(ratio=[0.1,1.0])
def plot_dist(ratio):
sns.set(style="white", font_scale=1.3)
fig, ax = plt.subplots(figsize=(7,5))
rus = RandomUnderSampler(ratio=ratio, random_state=7)
X_res, y_res = rus.fit_sample(X_train_scaled, y_train)
X_res_df = pd.DataFrame(X_res)
ax = sns.distplot(X_train_scaled[4][y_train == -1], color='darkorange', \
kde=False, label='before')
ax = sns.distplot(X_res_df[4][y_res == -1], color='b', \
kde=False, label='after')
ax.set_ylim([0, 180])
ax.set(xlabel='')
ax.legend(title='Ratio = {}'.format(ratio))
plt.title('Majority class distribution before and after undersampling')
plt.show()
The usual way to select parameters is via grid-search and cross-validation (CV). We will do a grid search over undersampling ratios, and the SVM rbf kernel tuning parameters. The five-fold CV is stratified so for each fold, the sampling ratios are preserved.
The imblearn package includes a pipeline module which allows one to chain transformers, resamplers and estimators. We use the pipeline to chain undersampling method with the SVM classifier.
We will set-up a CV function to test out different undersampling methods. We will also vary the number of features.
The default CV scoring is based on the accuracy. When the classes are imbalanced, the true negative term (the accuracy of the majority class) dominates. Often, there is a high cost associated with the misclassification of the minority class, and in those cases alternative scoring measures such as the F1 and $F_{\beta}$ scores or the Matthews Correlation Coefficient (MCC) (which uses all four values of the confusion matrix) are used. We will score the cross-validation using the MCC. We will also look at the True Positive Rate (accuracy of the minority class alone), the Accuracy and the confusion matrix when interpreting results.
In [10]:
# defining the MCC metric to assess cross-validation
def tpr_score(y_true, y_pred):
tprate = float(cm[1][1])/np.sum(cm[1])
return tprate
def mcc_score(y_true, y_pred):
mcc = matthews_corrcoef(y_true, y_pred)
return mcc
mcc_scorer = make_scorer(mcc_score, greater_is_better=True)
tpr_scorer = make_scorer(tpr_score, greater_is_better=True)
In [11]:
# print classification results
def test_results(y_test, y_predicted):
print '\nThe accuracy is: {0:4.2} ' \
.format(accuracy_score(y_test, y_predicted))
print '\nThe confusion matrix: '
cm = confusion_matrix(y_test, y_predicted)
print cm
print '\nThe True Positive rate is: {0:4.2}' \
.format(float(cm[1][1])/np.sum(cm[1]))
print '\nThe Matthews correlation coefficient: {0:4.2f} \n' \
.format(matthews_corrcoef(y_test, y_predicted))
print(classification_report(y_test, y_predicted))
In [17]:
# grid search cross-validation function
def sampling_gridcv(samp_method, nfeatures):
X_train_ = X_train_scaled.iloc[:,:nfeatures]
X_test_ = X_test_scaled.iloc[:,:nfeatures]
add_parameters = dict()
if samp_method == 'rus':
sampling = RandomUnderSampler(random_state=7)
#[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
add_parameters = dict(samp__ratio = np.arange(1,11)*0.1)
elif samp_method == 't1':
sampling = TomekLinks(random_state=7)
elif samp_method == 'ncr':
sampling = NeighbourhoodCleaningRule(random_state=7)
estimators = [('samp', sampling),
('clf', SVC(probability=True, random_state=7))]
parameters = dict(clf__C =[1, 10, 50, 100, 200],
clf__gamma=[.04, .05, .06, .07])
parameters.update(add_parameters)
pipe = ImbPipe(estimators)
print pipe
# stratified K-fold cross-validation
cv = GridSearchCV(pipe, param_grid = parameters, cv =5, scoring=mcc_scorer)
start = time()
cv.fit(X_train_, y_train)
print '\nGridSearchCV took {} seconds for {} candidate parameter settings.'\
.format(time() - start, len(cv.grid_scores_))
y_predicted = cv.predict(X_test_)
#probas_ = cv.predict_proba(X_test_)
print '\nThe best CV parameters are: {}' .format(cv.best_params_)
# print test results using best parameters
test_results(y_test, y_predicted)
In [18]:
# random undersampling with 40 features
sampling_gridcv('rus', 40)
In [19]:
sampling_gridcv('rus', 100)
In [20]:
sampling_gridcv('rus', 140)
Based on initial experiments and the results of the CV above, undersampling with higher ratios like 0.8, 0.9, and 1.0 give higher MCC and True Positive values than lower ratios.
We will now run a simpler grid-search over C and gamma using the following parameters:
(When ratio is included as a grid-search parameter, the MCC and TPR scores are consistently lower. This requires some investigation since the size of the CV folds should not be affected by the complexity of the grid.)
In [22]:
# Second CV (using best CV parameters from previous run)
ratio = 1.0
X_train_ = X_train_scaled.iloc[:,:40]
X_test_ = X_test_scaled.iloc[:,:40]
rus = RandomUnderSampler(ratio=ratio, random_state=7)
X_res, y_res = rus.fit_sample(X_train_, y_train)
clf = SVC(random_state=7)
param_grid = {"C": [1, 5, 7, 10],
"gamma": [0.04, 0.05, 0.06]}
# run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid, cv=5, scoring=mcc_scorer)
start = time()
grid_search.fit(X_res, y_res)
print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
% (time() - start, len(grid_search.grid_scores_)))
print '\nThe best CV parameters are: {}' .format(grid_search.best_params_)
# using model with best parameters on test set
y_predicted = grid_search.predict(X_test_)
test_results(y_test, y_predicted)
If an overall accuracy of 0.7 is acceptable, with random undersampling, we can get a TPR and MCC in the 0.71 and 0.22 range. These are the highest values obtained so far from experiments with sampling and cost-sensitive learning.
These methods emphasize data cleaning over data reduction.
If a pair of minimally distanced nearest neighbors are of opposite classes, they are said to form a Tomek link [1] [3] . Tomek links (TL) are present both at class boundaries or when there is noise within a class. TL are used to cleanup the data set and establish well-defined class clusters and boundaries. We will perform a gridsearch CV using the Imblearn Tomek Links method along with the SVM classifier in the section.
In [24]:
# Number of elements before/after TL
print 'Original dataset distribution: {}'.format(Counter(y_train))
tl = TomekLinks(random_state=7)
X_res, y_res = tl.fit_sample(X_train_scaled, y_train)
print 'Resampled dataset distribution: {}'.format(Counter(y_res))
In [25]:
sampling_gridcv('t1', 40)
The Neighborhood Cleaning Rule (NCL) is based on a 2001 paper by Laurikkala [4]. It is a combination of ENN (edited nearest neighbor where data points that differ from two of their three nearest neighbors are removed) and Tomek links. NCL emphasizes data cleaning over data reduction with the view that the quality of classification does not necessarily depend on the size of the class.
We will perform a gridsearch CV using the Imblearn Neighborhood Cleaning Rule method along with the SVM classifier.
In [26]:
# Number of elements before/after NCL
print 'Original dataset distribution: {}'.format(Counter(y_train))
ncr = NeighbourhoodCleaningRule(random_state=7)
X_res, y_res = ncr.fit_sample(X_train_scaled, y_train)
print 'Resampled dataset distribution: {}'.format(Counter(y_res))
In [28]:
sampling_gridcv('ncr', 40)
In our experiments, the simple Random Undersampling method gave much better results (with respect to the TPR of the minority class) than the TL and NCL methods which emphasize data cleaning. Furthermore, the undersampling also gave better results than Oversampling with SMOTE. The NCL method which incorporates TL gave marginally better results than using TL alone.
We expect a combination of oversampling, undersampling, data cleaning and boosting to give the best classification results for the minority class. (writeup to be completed)
[1] H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009.
[2] Lecture Notes based on “Learning from Imbalanced Data” by Haibo He
[3] I. Tomek, “Two modifications of CNN,” In Systems, Man, and Cybernetics, IEEE Transactions on, vol. 6, pp 769-772, 1976.
[4] Laurikkala, Jorma. "Improving identification of difficult small classes by balancing class distribution." Conference on Artificial Intelligence in Medicine in Europe. Springer Berlin Heidelberg, 2001.