In [ ]:
%pylab inline
import numpy as np
import pylab as pl
In this section we study how different estimators maybe be chained
For some types of data, for instance text data, a feature extraction step must be applied to convert it to numerical features.
In [ ]:
from sklearn import datasets, feature_selection
from sklearn.feature_extraction.text import TfidfVectorizer
news = datasets.fetch_20newsgroups()
X, y = news.data, news.target
vectorizer = TfidfVectorizer()
vectorizer.fit(X)
vector_X = vectorizer.transform(X)
print vector_X.shape
The feature selection object is a "transformer": it has a "fit" method and a "transform" method.
Importantly, the "fit" method of the transformer is applied on the training set, but the transform method can be applied on any data, including the test set.
We can see that the vectorized data has a very large number of features, as it list the words of the document. Many of these are not relevant for the classification problem.
Supervised feature selection can select features that seem relevent for a learning task based on a simple test. It is often a computationally cheap way of reducing the dimensionality.
Scikit-learn has a variety of feature selection strategy. The univariate feature selection strategies, (FDR, FPR, FWER, k-best, percentile) apply a simple function to compute a test statistic on each feature. The choice of this function (the score_func parameter) is important:
In [ ]:
from sklearn import feature_selection
selector = feature_selection.SelectPercentile(percentile=5, score_func=feature_selection.chi2)
X_red = selector.fit_transform(vector_X, y)
print "Original data shape %s, reduced data shape %s" % (vector_X.shape, X_red.shape)
A transformer and a predictor can be combined to form a predictor using the pipeline object.
The constructor of the pipeline object takes a list of (name, estimator) pairs, that are applied on the data in the order of the list. The pipeline object exposes fit, transform, predict and score methods that result from applying the transforms (and fit in the case of the fit method) one after the other to the data, and calling the last object's corresponding function.
Using a pipeline we can combine our feature extraction, selection and final SVC in one step. This is convenient, as it enables to do clean cross-validation.
In [ ]:
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.cross_validation import cross_val_score
svc = LinearSVC()
pipeline = Pipeline([('vectorize', vectorizer), ('select', selector), ('svc', svc)])
cross_val_score(pipeline, X, y, verbose=3)
The resulting pipelined predictor object has implicitely many parameters. How do we set them in a principled way?
As a reminder, the GridSearchCV object can be used to set the parameters of an estimator. We just need to know the name of the parameters to set.
The pipeline object exposes the parameters of the estimators it wraps with the following convention: first the name of the estimator, as given in the constructor list, then the name of parameter, separated by a double underscore. For instance, to set the SVC's 'C' parameter:
In [ ]:
pipeline.set_params(svc__C=10)
We can then use the grid search to choose the best C between 3 values.
Performance tip: choosing parameters by cross-validation may imply running the transformers many times on the same data with the same parameters. One way to avoid part of this overhead is to use memoization. In particular, we can use the version of joblib that is embedded in scikit-learn:
In [ ]:
from sklearn.externals import joblib
memory = joblib.Memory(cachedir='.')
memory.clear()
selector.score_func = memory.cache(selector.score_func)
Now we can proceed to run the grid search:
In [ ]:
from sklearn.grid_search import GridSearchCV
grid = GridSearchCV(estimator=pipeline, param_grid=dict(svc__C=[1e-2, 1, 1e2]))
grid.fit(X, y)
print grid.best_estimator_.named_steps['svc']
On the 'labeled faces in the wild' (datasets.fetch_lfw_people) chain a randomized PCA with an SVC for prediction
In [ ]: