SVM classification/SMOTE oversampling for an imbalanced data set

Date created: Oct 14, 2016
Last modified: Nov 16, 2016
Tags: SVM, SMOTE, ROC/AUC, oversampling, imbalanced data set, semiconductor data
About: Rebalance imbalanced semicondutor manufacturing dataset by oversampling the minority class using SMOTE. Classify using SVM. Assess the value of oversampling using ROC/AUC.

I. Introduction

The SECOM dataset in the UCI Machine Learning Repository is semicondutor manufacturing data. There are 1567 records, 590 anonymized features and 104 fails. This makes it an imbalanced dataset with a 14:1 ratio of pass to fails. The process yield has a simple pass/fail response (encoded -1/1).

Objective

We consider some of the different approaches to classify imbalanced data. In the previous example we looked at one-class SVM. Another strategy is to rebalance the dataset by oversampling the minority class and/or undersampling the majority class. This is done to improve the sensitivity (i.e the true positive rate) of the minority class. For this exercise, we will look at:

rebalancing the dataset using SMOTE (which oversamples the minority class)
ROC curves for different oversampling ratios

Methodology

The sklearn imblearn toolbox has many methods for oversamplng/undersampling. We will use the SMOTE (Synthetic Minority Over-sampling Technique) method introduced in 2002 by Chawla et al. [1], [2]. With SMOTE, synthetic examples are interpolated along the line segments joining some/all of the k minority class nearest neighbors.
In the experiment, the oversampling rate is varied between 10-70%, in 10% increments. The percentage represents the final minority class fraction after oversampling: if the majority class has 1000 data points (and the minority class 50), at 10% the minority class will have 100 data points after oversampling (not 5 or 50+5 = 55).

The rebalanced data is classified using an SVM. The imblearn toolbox has a pipeline method which will be used to chain all the steps. The SMOTE+SVM method is evaluated by the area under the Receiver Operating Characteristic curve (AUC).

Preprocessing

The data represents measurements from a large number of processes or sensors and many of the records are missing. In addition some measurements are identical/constant and so not useful for prediction. We will remove those columns with high missing count or constant values.
The Random Forest variable importance is used to rank the variables in terms of their importance. For the random forest, we will impute the remaining missing values with the median for the column.
We will additionally scale the data that is applied to the SVM. We will use the sklearn preprocessing module for both imputing and scaling. These are the same steps used for the one-class SVM and a more detailed explanation can be seen there.



In [1]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split as tts

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC 

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.metrics import roc_curve, auc

from __future__ import division
import warnings
warnings.filterwarnings("ignore")



In [2]:

    
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom.data"
secom = pd.read_table(url, header=None, delim_whitespace=True)

url = "http://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom_labels.data"
y = pd.read_table(url, header=None, usecols=[0], squeeze=True, delim_whitespace=True)

print 'The dataset has {} observations/rows and {} variables/columns.' \
       .format(secom.shape[0], secom.shape[1])
print 'The ratio of majority class to minority class is {}:1.' \
      .format(int(y[y == -1].size/y[y == 1].size))









    



The dataset has 1567 observations/rows and 590 variables/columns.
The ratio of majority class to minority class is 14:1.

II. Preprocessing

We process the missing values first, dropping columns which have a large number of missing values and imputing values for those that have only a few missing values. The Random Forest variable importance is used to rank the variables in terms of their importance. The one-class SVM exercise has a more detailed version of these steps.



In [3]:

    
# dropping columns which have large number of missing entries 

m = map(lambda x: sum(secom[x].isnull()), xrange(secom.shape[1]))
m_200thresh = filter(lambda i: (m[i] > 200), xrange(secom.shape[1]))
secom_drop_200thresh = secom.dropna(subset=[m_200thresh], axis=1)
dropthese = [x for x in secom_drop_200thresh.columns.values if \
             secom_drop_200thresh[x].std() == 0]
secom_drop_200thresh.drop(dropthese, axis=1, inplace=True)

print 'The SECOM data set now has {} variables.'\
      .format(secom_drop_200thresh.shape[1])









    



The SECOM data set now has 409 variables.



In [4]:

    
# imputing missing values for the random forest

imp = Imputer(missing_values='NaN', strategy='median', axis=0)
secom_imp = pd.DataFrame(imp.fit_transform(secom_drop_200thresh))

# use Random Forest to assess variable importance

rf = RandomForestClassifier(n_estimators=100, random_state=7)
rf.fit(secom_imp, y)

# sorting features according to their rank

importance = rf.feature_importances_
ranked_indices = np.argsort(importance)[::-1]

III. SVM Classification

Preprocessing

The SVM is sensitive to feature scale so the first step is to center and normalize the data. The train and test sets are scaled separately using the mean and variance computed from the training data. This is done to estimate the ability of the model to generalize.



In [5]:

    
# split data into train and holdout sets
# stratify the sample used for modeling to preserve the class proportions


X_train, X_holdout, y_train, y_holdout = tts(secom_imp[ranked_indices[:40]], y, \
                                             test_size=0.2, stratify=y, random_state=5)


print 'Train data: The majority/minority class have {} and {} elements respectively.'\
      .format(y_train[y_train == -1].size, y_train[y_train == 1].size)
print 'The maj/min class ratio is: {0:2.0f}' \
      .format(round(y_train[y_train == -1].size/y_train[y_train == 1].size))
print 'Holdout data: The majority/minority class have {} and {} elements respectively.'\
       .format(y_holdout[y_holdout == -1].size, y_holdout[y_holdout == 1].size)
print 'The maj/min class ratio for the holdout set is: {0:2.0f}' \
      .format(round(y_holdout[y_holdout == -1].size/y_holdout[y_holdout == 1].size))









    



Train data: The majority/minority class have 1170 and 83 elements respectively.
The maj/min class ratio is: 14
Holdout data: The majority/minority class have 293 and 21 elements respectively.
The maj/min class ratio for the holdout set is: 14



In [6]:

    
# scaling the split data. The holdout data uses scaling parameters 
# computed from the training data

standard_scaler = StandardScaler()
X_train_scaled  = pd.DataFrame(standard_scaler.fit_transform(X_train), \
                              index=X_train.index)
X_holdout_scaled = pd.DataFrame(standard_scaler.transform(X_holdout))
# Note: we convert to a DataFrame because the plot functions 
# we will use need DataFrame inputs.

Finding parameters

The usual way to select parameters is via grid-search and cross-validation (CV). The scoring is based on the accuracy. When the classes are imbalanced, the true positive of the majority class dominates. Often, there is a high cost associated with the misclassification of the minority class, and in those cases alternative scoring measures such as the F1 and $F_{\beta}$ scores or the Matthews Correlation Coefficient (which uses all four values of the confusion matrix) are used.
In CV experiments on this data, the majority class still dominates so that for the best CV F1-scores, the True Negative Rate (TNR - the rate at which the minority class is correctly classified) is zero.
Instead of automating the selection of hyperparameters, I have manually selected C and $\gamma$ values for which the precision/recall/F1 values as well as the TNR are high.

An example is shown below.



In [7]:

    
# oversampling

ratio = 0.5

smote = SMOTE(ratio = ratio, kind='regular')
smox, smoy = smote.fit_sample(X_train_scaled, y_train)

print 'Before resampling: \n\
The majority/minority class have {} and {} elements respectively.'\
.format(y_train[y == -1].size, y_train[y == 1].size)
print 'After oversampling at {}%: \n\
The majority/minority class have {} and {} elements respectively.'\
.format(ratio, smoy[smoy == -1].size, smoy[smoy == 1].size)









    



Before resampling: 
The majority/minority class have 1170 and 83 elements respectively.
After oversampling at 0.5%: 
The majority/minority class have 1170 and 585 elements respectively.



In [19]:

    
# plotting minority class distribution after SMOTE
# column 4 displayed

from IPython.html.widgets import interact
@interact(ratio=[0.1,1.0])

def plot_dist(ratio):
    sns.set(style="white", font_scale=1.3) 
    fig, ax = plt.subplots(figsize=(7,5))

    smote = SMOTE(ratio = ratio, kind='regular')
    smox, smoy = smote.fit_sample(X_train_scaled, y_train)
    smox_df = pd.DataFrame(smox)

    ax = sns.distplot(smox_df[4][smoy == 1], color='b',  \
                  kde=False, label='after')         
    ax = sns.distplot(X_train_scaled[4][y_train == 1], color='r', \
                  kde=False, label='before')
    ax.set_ylim([0, 130])
    ax.set(xlabel='')
    ax.legend(title='Ratio = {}'.format(ratio))
    plt.title('Minority class distribution before and after oversampling')

    plt.show()



In [9]:

    
# classification results

from sklearn.metrics import confusion_matrix, matthews_corrcoef,\
classification_report, roc_auc_score, accuracy_score

# manually selected parameters
clf = SVC(C = 2, gamma = .0008)
clf.fit(smox, smoy)
y_predicted = clf.predict(X_holdout_scaled)


print 'The accuracy is: {0:4.2} \n' \
.format(accuracy_score(y_holdout, y_predicted))

print 'The confusion matrix: '
cm = confusion_matrix(y_holdout, y_predicted)
print cm

print '\nThe True Negative rate is: {0:4.2}' \
.format(float(cm[1][1])/np.sum(cm[1]))

print '\nThe Matthews correlation coefficient: {0:4.2f} \n' \
.format(matthews_corrcoef(y_holdout, y_predicted))

print(classification_report(y_holdout, y_predicted))
print 'The AUC is: {0:4.2}'\
.format(roc_auc_score(y_holdout, y_predicted))









    



The accuracy is: 0.86 

The confusion matrix: 
[[263  30]
 [ 13   8]]

The True Negative rate is: 0.38

The Matthews correlation coefficient: 0.21 

             precision    recall  f1-score   support

         -1       0.95      0.90      0.92       293
          1       0.21      0.38      0.27        21

avg / total       0.90      0.86      0.88       314

The AUC is: 0.64

For these manually selected parameters, the TNR is 0.38, the Matthews correlation coefficient is 0.21 and the precision/recall/F1 is in the 0.86 - 0.90 range. Selecting the best CV score (usually in the 0.90 range), on the other hand, would have given a TNR of 0 for all the scoring metrics I looked at.

The Pipeline -- Oversampling, classification and ROC computations

The imblearn package includes a pipeline module which allows one to chain transformers, resamplers and estimators. We compute the ROC curves for each of the oversampling ratios and corresponding hyperparameters C and gamma and for this we use the pipeline to oversample with SMOTE and classify with the SVM.



In [31]:

    
# oversampling, classification and computing ROC values

fpr = dict()
tpr = dict()
roc_auc = dict()

ratio = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7]
C =     [3, 3, 3, 2, 2, 2, 2]
gamma = [.02, .009, .009, .005, .0008, .0009, .0007]

    
estimators = [('smt', SMOTE(random_state=42)), 
              ('clf', SVC(probability=True, random_state=42))]
pipe = Pipeline(estimators)

print pipe

for i, ratio, C, gamma in zip(range(7), ratio, C, gamma):

    pipe.set_params(smt__ratio = ratio, clf__C = C, clf__gamma = gamma)
    probas_ = pipe.fit(X_train_scaled, y_train).predict_proba(X_holdout_scaled)
    fpr[i], tpr[i], _ = roc_curve(y_holdout, probas_[:,1])
    roc_auc[i] = auc(fpr[i], tpr[i])









    



Pipeline(steps=[('smt', SMOTE(k=5, kind='regular', m=10, n_jobs=-1, out_step=0.5, random_state=42,
   ratio='auto')), ('clf', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=True, random_state=42, shrinking=True,
  tol=0.001, verbose=False))])



In [32]:

    
# plotting the ROC curves

def plot_roc(fpr, tpr, roc_auc):
    colors = ['darkorange', 'deeppink', 'red', 'aqua', 'cornflowerblue','navy', 'blue']

    plt.figure(figsize=(10,8.5))
    for i, color in zip(range(7), colors):
        plt.plot(fpr[i], tpr[i], color=color, lw=2, linestyle=':',
                 label='{0} (area = {1:0.2f})'
                 ''.format((i+1)/10, roc_auc[i]))

    plt.plot([0, 1], [0, 1], 'k--', lw=1)
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC curves: SMOTE oversampled minority class', fontsize=14)
    plt.legend(title='Class ratio after oversampling', loc="lower right")
    plt.show()
    plt.savefig('ROC_oversampling.png')



In [33]:

    
plot_roc(fpr, tpr, roc_auc)

IV. Discussion

There is a trend in the ROC curves (the ROC convex hull) in the figure above with the higher oversampling ratios --0.7 vs 0.1-- having a higher AUC. An obvious question then is whether increasing the oversampling ratio to get a balanced data set would give the best results.

In the experiment, no significant improvements were seen in the 0.5 - 0.8 (0.8 not plotted) regime. Oversampling with SMOTE broadens the decision region around the minority points (so we would expect better results) but the coverage may exceed the decision surface^**. The level of oversampling therefore needs to be experimentally determined.

Another strategy to balance the classes is to combine oversampling (the minority class) with undersampling (the majority class). Chawla et al. had reported [1] that a combination of oversampling and undersampling gave the best results. We will experiment with this combination in a future exercise.

^**It should also be noted that oversampling results in a significant increase (and bias) in the minority class. For instance, for a 0.5 ratio, the minority class is increased seven-fold (from 83 to 585). A completely balanced data set would involve a fourteen-fold increase in the minority class and this would alter the decision surface.

V. References and Further Reading

[1] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. SMOTE: synthetic minority over-sampling technique. J. Artif. Int. Res. 16, 1 (June 2002), 321-357.

[2] Chawla, Nitesh V. Data Mining for Imbalanced Datasets: An Overview. In: Maimon, Oded; Rokach, Lior (Eds) Data Mining and Knowledge Discovery Handbook, Springer, (2010), 875-886.

[3] Altini, Marco. "Dealing with Imbalanced Data: Undersampling, Oversampling and Proper Cross-validation." Web log post. Marco Altini Blog. N.p., 17 Aug. 2015. Web.

Author: Meena Mani
email: meenas.mailbag@gmail.com
twitter: @meena_uvaca