In [1]:
%load_ext watermark
%watermark -a 'Sebastian Raschka' -d -v
In [2]:
import sys
sys.path = ['/Users/sebastian/github/mlxtend/'] + sys.path
In order to avoid the Curse of Dimensionality, pattern classification is often accompanied by Dimensionality Reduction, which also has the nice side-effect of increasing the computational performance. Common techniques are projection-based, such as Principal Component Analysis (PCA) (unsupervised) or Linear Discriminant (LDA) (supervised). It shall be noted though that regularization in classification models such as Logistic Regression, Support Vector Machines, or Neural Networks is to be preferred over using dimensionality reduction to avoid overfitting. However, dimensionality reduction is still a useful data compression technique to increase computational efficiency and data storage problems.
An alternative to a projection-based dimensionality reduction approach is the so-called Feature Selection, and here, we will take a look at some of the established algorithms to tackle this combinatorial search problem: Sequential Backward Selection (SBS).
Let's summarize its mechanics in words: SBS starts with the original $d$-dimensional feature set and sequentially removes features from this set until the subset reaches a desired (user-specified) size $k$ where $k < d$. In every iteration $i$, the subset $d-i$ dimensional subset is evaluated using a criterion function to determine the least informative feature to be removed.
The criterion function is typically the performance of the classifier measured via cross validation.
Let's consider the following example where we have a dataset that consists of 3 features:
In order to determine the least informative feature, we create 2-dimensional feature subsets and measure the performance (e.g., accuracy) of the classifier on each of those subset:
Based on the accuracy measures, we would then eliminate feature $x_3$ and repeat this procedure until we reached the number of features to select. E.g., if we'd want to select 2 features, we'd stop at this point and select the feature subset $\{x_1, x_2$}.
Note that this algorithm is considered as "subpoptimal" in contrast to an exhaustive search, which is often computationally not feasible, though.
F. Ferri, P. Pudil, M. Hatef, and J. Kittler investigated the performance of different Sequential Selection Algorithms for Feature Selection on different scales and reported their results in
Choosing an "appropriate" algorithm really depends on the problem - the size and desired recognition rate and computational performance. Thus, I want to encourage you to take (at least) a brief look at their paper and the results they obtained from experimenting with different problems feature space dimensions.
In [3]:
from mlxtend.sklearn import SBS
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)
sbs = SBS(knn, k_features=2, scoring='accuracy', cv=5)
sbs.fit(X, y)
print('Indices of selected features:', sbs.indices_)
print('CV score of selected subset:', sbs.k_score_)
print('New feature subset:')
sbs.transform(X)[0:5]
Out[3]:
In [4]:
import pandas as pd
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None)
X = df.iloc[:, 1:].values
y = df.iloc[:, 0].values
df.head()
Out[4]:
In [5]:
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
scr = StandardScaler()
X_std = scr.fit_transform(X)
knn = KNeighborsClassifier(n_neighbors=4)
# selecting features
sbs = SBS(knn, k_features=1, scoring='accuracy', cv=5)
sbs.fit(X_std, y)
# plotting performance of feature subsets
k_feat = [len(k) for k in sbs.subsets_]
plt.plot(k_feat, sbs.scores_, marker='o')
plt.ylabel('Accuracy')
plt.xlabel('Number of features')
plt.show()
Selecting the number of features in a pipeline.
In [6]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from mlxtend.sklearn import SBS
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
##########################
### Loading data
##########################
iris = load_iris()
X = iris.data
y = iris.target
##########################
### Setting up pipeline
##########################
knn = KNeighborsClassifier(n_neighbors=4)
sbs = SBS(estimator=knn, k_features=2, scoring='accuracy', cv=5)
pipeline = Pipeline([
('scr', StandardScaler()),
('sel', sbs),
('clf', knn)])
parameters = {'sel__k_features': [1,2,3,4]}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=1, verbose=1)
##########################
### Running GridSearch
##########################
grid_search.fit(X, y)
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
Tuning the estimator used for feature selection. Note that the current implementation requires to search for the weights in both the classifier and the SBS transformer separately.
In [7]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from mlxtend.sklearn import SBS
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
##########################
### Loading data
##########################
iris = load_iris()
X = iris.data
y = iris.target
##########################
### Setting up pipeline
##########################
knn = KNeighborsClassifier(n_neighbors=4)
sbs = SBS(estimator=knn, k_features=2, scoring='accuracy', cv=5)
pipeline = Pipeline([
('scr', StandardScaler()),
('sel', sbs),
('clf', knn)])
parameters = {'sel__k_features': [1, 2, 3, 4],
'sel__estimator__n_neighbors': [4, 5, 6],
'clf__n_neighbors': [4, 5, 6]}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=1, verbose=1)
##########################
### Running GridSearch
##########################
grid_search.fit(X, y)
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
The final feature subset can then be obtained as follows:
In [8]:
print('Best feature subset:')
grid_search.best_estimator_.steps[1][1].indices_
Out[8]: