Date created: Oct 10, 2016
Last modified: Oct 18, 2016
Tags: one-class SVM, Random Forest variable importance, imbalanced data set, anomaly detection, feature selection, semiconductor manufacturing data
About: for an imbalanced semicondutor manufacturing dataset, find explanatory variables with predictive power and build a classifier to detect failures
The SECOM dataset in the UCI Machine Learning Repository is semicondutor manufacturing data. There are 1567 records, 590 anonymized features and 104 fails. This makes it an imbalanced with a 14:1 ratio of pass to fails. The process yield has a simple pass/fail response (encoded -1/1).
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split as tts
from sklearn.grid_search import ParameterGrid
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
import warnings
warnings.filterwarnings("ignore")
from __future__ import division
In [2]:
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom.data"
secom = pd.read_table(url, header=None, delim_whitespace=True)
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom_labels.data"
y = pd.read_table(url, header=None, usecols=[0], squeeze=True, delim_whitespace=True)
print 'The dataset has {} observations/rows and {} variables/columns.'\
.format(secom.shape[0], secom.shape[1])
print 'The majority class has {} observations, minority class {}.'\
.format(y[y == -1].size, y[y == 1].size)
print 'The dataset is imbalanced. \
The ratio of majority class to minority class is {}:1.'\
.format(int(y[y == -1].size/y[y == 1].size))
We will process the missing values first, dropping columns which have a large number of missing values and imputing values for those that have only a few missing values. We will use pandas for this.
In [3]:
# what if all the columns/rows with missing values were removed
nmv = secom.dropna(axis=1)
print 'No. of columns after removing columns with missing data: {}'\
.format(nmv.shape[1])
nmv = secom.dropna(axis=0)
print 'No. of rows after removing rows with missing data: {}'\
.format(nmv.shape[0])
In [4]:
# num of missing entries per column
m = map(lambda x: sum(secom[x].isnull()), xrange(secom.shape[1]))
# distribution of columns with missing entries
plt.hist(m, color='turquoise')
plt.title("Distribution of missing values")
plt.xlabel("No. of missing values in a column")
plt.ylabel("Columns")
plt.show()
In [5]:
m_700thresh = filter(lambda i: (m[i] > 700), xrange(secom.shape[1]))
print 'The number of columns with more than 700 missing values: {}'\
.format(len(m_700thresh))
m_200thresh = filter(lambda i: (m[i] > 200), xrange(secom.shape[1]))
print 'The number of columns with more than 200 missing values: {}'\
.format(len(m_200thresh))
In [6]:
# remove columns with more than 200 missing entries
secom_drop_200thresh = secom.dropna(subset=[m_200thresh], axis=1)
print 'No. of columns after dropping columns with more than 200 missing entries: {}'\
.format(secom_drop_200thresh.shape[1])
In [7]:
# remove columns where every entry is identical (the std. dev = 0)
dropthese = [x for x in secom_drop_200thresh.columns.values \
if secom_drop_200thresh[x].std() == 0]
print 'There are {} columns which have identical values recorded. \
We will drop these.' .format(len(dropthese))
secom_drop_200thresh.drop(dropthese, axis=1, inplace=True)
print 'The data set now has {} columns.'\
.format(secom_drop_200thresh.shape[1])
In [8]:
# checking whether there is a mix of categorical variables
print 'The number of categorical variables is: {}'\
.format(sum((secom_drop_200thresh.dtypes == 'categorical')*1))
secom_drop_200thresh.head(2)
Out[8]:
In [9]:
# imputing missing values for the random forest
imp = Imputer(missing_values='NaN', strategy='median', axis=0)
secom_imp = pd.DataFrame(imp.fit_transform(secom_drop_200thresh))
In addition to prediction the Random Forest can also be used to assess variable importance. In the sklearn RandomForestClassifier package this is computed from the total decrease in node impurity when a predictor is split. There are issues with this computation (there is a bias towards variables with more categories and correlated variables are arbitrarily selected) but we can ignore these since we will be using many variables for the OCSVM classifier. A bigger concern is the imbalance in the data set as this might affect the variable importance ranking.
The SECOM matrix at this point has 409 variables. We will use the Random Forest to rank the variables in terms of their importance.
In [10]:
rf = RandomForestClassifier(n_estimators=100, random_state=7)
rf.fit(secom_imp, y)
Out[10]:
In [11]:
# displaying features and their rank
importance = rf.feature_importances_
ranked_indices = np.argsort(importance)[::-1]
print "Feature Rank:"
for i in range(15):
print "{0:3d} column {1:3d} {2:6.4f}"\
.format(i+1, ranked_indices[i], importance[ranked_indices[i]])
print "\n"
for i in xrange(len(importance)-5,len(importance)):
print "{0:3d} column {1:3d} {2:6.4f}"\
.format(i+1, ranked_indices[i], importance[ranked_indices[i]])
navg = 0
for i in range(len(importance)):
if importance[ranked_indices[i]] > np.average(rf.feature_importances_):
navg = navg+1
print 'The number of features better than average is: {}'.format(navg)
In [12]:
# plot of importance vs the number of features
plt.figure(figsize=(8,6))
plt.plot(range(len(importance)), importance[ranked_indices[:]])
plt.axvline(15, color='magenta', linestyle='dashdot', label='n=15')
plt.axvline(40, color='orange', linestyle='dashed', label='n=40')
plt.axvline(65, color='turquoise', linestyle='dashdot', label='n=65')
plt.axvline(100, color='red', linestyle='dotted', label='n=100')
plt.text(15, 0.002, 'n=15', rotation='vertical')
plt.text(40, 0.008, 'n=40', rotation='vertical')
plt.text(65, 0.011, 'n=65', rotation='vertical')
plt.text(100, 0.014, 'n=100', rotation='vertical')
plt.title('Importance vs feature rank')
plt.xlabel('feature rank')
plt.ylabel('importance')
plt.show()
From this plot, we see points of inflection around the 15, 40, 65 and 100 mark. We will use these to generate 4-5 sets of features to test out on the one-class SVM. The 50 percentile mark is at 148 so these are reduced feature sets, much smaller than the 409 features we had after cleaning the data. In some of the literature [1] associated with this data set, 40 features were used in the analysis. This was determined by correlation.
The OCSVM proposed by Schölkopf et al. [2], [3] can be used to detect negative examples in imbalanced data sets. In the OCSVM, training examples from the majority class are mapped to a feature space circumscribed by a hypersphere; a soft-margin decision boundary is minimized and all points outside are considered outliers.
The data is first divided into a majority class train and test set and the minority class test-only set. The OCSVM is sensitive to feature scale so the first step is to center and normalize the data. The train and test sets are scaled separately using the mean and variance computed from the training data. This is done to estimate the ability of the model to generalize.
The main parameters are the choice of the kernel, nu and gamma. Some preliminary evaluation showed that the linear and polynomial kernels give poor results. This leaves the rbf kernel for further experimentation.
The hyper-parameters for OCSVM with the rbf kernel are nu ($\nu$) and gamma ($\gamma$).
In [28]:
# function for preprocessing, classification, parameter grid search
def ocsvm_classify(nfeatures, param_grid, printflag=0, printheader=0):
# selecting features and generating a data set
X_ocsvm = secom_imp.loc[y == -1,ranked_indices[:nfeatures]]
X_outlier = secom_imp.loc[y == 1,ranked_indices[:nfeatures]]
if printflag:
print "The majority/minority classes have {}/{} observations. \n"\
.format(X_ocsvm.shape[0], X_outlier.shape[0])
# By convention the majority class has a +1 label
# and the train and test set belong to the majority class.
# This is not to be confused with the secom dataset where
# the majority class has a -1 label
y_ocsvm = np.ones(len(X_ocsvm))
X_train, X_test, y_train, y_test = tts(
X_ocsvm, y_ocsvm, test_size=0.3, random_state=5)
# scaling the split data. The test/outlier data uses scaling parameters
# computed from the training data
standard_scaler = StandardScaler()
X_train_scaled = standard_scaler.fit_transform(X_train)
X_test_scaled = standard_scaler.transform(X_test)
X_outlier_scaled = standard_scaler.transform(X_outlier)
# classify for each set of parameters in the grid
for i in range(len(list(ParameterGrid(param_grid)) )):
nu = ParameterGrid(param_grid)[i]['nu']
gamma = ParameterGrid(param_grid)[i]['gamma']
clf = svm.OneClassSVM(nu=nu, kernel='rbf', gamma=gamma)
clf.fit(X_train_scaled)
y_pred_train = clf.predict(X_train_scaled)
y_pred_test = clf.predict(X_test_scaled)
y_pred_outlier = clf.predict(X_outlier_scaled)
# calculating error
n_error_train = y_pred_train[y_pred_train == -1].size
n_error_test = y_pred_test[y_pred_test == -1].size
n_error_outlier = y_pred_outlier[y_pred_outlier == 1].size
# printing results
if i == 0:
if printheader:
print "nfeatures\tnu\tgamma\t train error\t test error\t outlier error"
print "{0:3d}\t\t{1:3.2f}\t{2:3.2f}\t {3:3d}({4:4.2f}%)\t {5:3d}({6:4.2f}%)\t{7:3d}({8:4.2f}%)"\
.format(nfeatures, nu, gamma, \
n_error_train, round(n_error_train/len(y_train) *100, 2),\
n_error_test, round(n_error_test/len(y_test) *100, 2),\
n_error_outlier, round(n_error_outlier/len(X_outlier) *100, 2))
else:
print "\t\t{0:3.2f}\t{1:3.2f}\t {2:3d}({3:4.2f}%) \t {4:3d}({5:4.2f}%)\t{6:3d}({7:4.2f}%)"\
.format(nu, gamma, \
n_error_train, round(n_error_train/len(y_train) *100, 2),\
n_error_test, round(n_error_test/len(y_test) *100, 2),\
n_error_outlier, round(n_error_outlier/len(X_outlier) *100, 2))
In [29]:
# running a second pass on the tuning matrix (after a coarse first pass)
param_grid = {'nu': [0.03,0.04,0.05], 'gamma': [0.07, 0.08, 0.09, 0.10, 0.15, 0.20]}
ocsvm_classify(40, param_grid, printflag=1, printheader=1)
print "\n"
param_grid = {'nu': [0.1,0.2], 'gamma': [0.04, 0.05, 0.06]}
ocsvm_classify(60, param_grid)
param_grid = {'nu': [0.1], 'gamma': [0.04, 0.05]}
ocsvm_classify(65, param_grid)
param_grid = {'nu': [0.04,0.05], 'gamma': [0.03, 0.04]}
ocsvm_classify(100, param_grid)
print "\n"
param_grid = {'nu': [0.02,0.03, 0.04], 'gamma': [0.10, 0.15, 0.18, 0.20]}
ocsvm_classify(15, param_grid)
As gamma increases:
As nu increases:
As n, the number of variables, increases:
This parameter grid search was carried out on a single split of the data (rather than cross-validation) but there is an indication the models have high bias since the train error and test error are both high and also because you can reduce the outlier error by increasing the variance via gamma.
The best** results were obtained when the test error and outlier error were in the low 30% range for n=40 - 65.
nfeatures | nu | gamma | train error | test error | outlier error |
---|---|---|---|---|---|
40 | 0.03 | 0.09 | 169(16.50%) | 148(33.71%) | 32(30.77%) |
60 | 0.10 | 0.05 | 183(17.87%) | 133(30.30%) | 34(32.69%) |
65 | 0.10 | 0.05 | 162(15.82%) | 154(35.08%) | 29(27.88%) |
This however is still too high. The OCSVM on this data set also appears to be difficult to tune so rather than find the optimal hyperparameters through cross-validation we will look at other classification methods.
**Depending on which point in the production flow the data was recorded, one may wish to optimize for False Positives or False Negatives. If you optimize for detecting False Positives early in the production flow, you will save resources for a device that will be dead in the backend. At the end of the line, detecting false negatives is more important since the chip, which increases in value with each process step, is most valuable at this stage.
[1] M McCann Y Li L Maguire and A Johnston. "Causality Challenge: Benchmarking relevant signal components for effective monitoring and process control." J. Mach. Learn. Res. Proc. NIPS 2008 workshop on causality.
[2] Schölkopf, Bernhard, et al. "Support Vector Method for Novelty Detection." NIPS. Vol. 12. 1999.
[3] B. Schölkopf, A. Smola, R. Williamson, and P. L. Bartlett. New support vector algorithms. Neural Computation, 12, 2000, 1207-1245.
[4] Statnikov, Alexander, et al. "A Gentle Introduction to Support Vector Machines in Biomedicine."
[5] SVM Parameters