Feature Extraction

Here we will talk about an important piece of machine learning: the extraction of quantitative features from data. By the end of this section you will

  • Know how features are extracted from real-world data.
  • See an example of extracting numerical features from textual data

In addition, we will go over several basic tools within scikit-learn which can be used to accomplish the above tasks.

What Are Features?

Numerical Features

Recall that data in scikit-learn is expected to be in two-dimensional arrays, of size n_samples $\times$ n_features.

Previously, we looked at the iris dataset, which has 150 samples and 4 features


In [ ]:
from sklearn.datasets import load_iris
iris = load_iris()
print iris.data.shape

These features are:

  • sepal length in cm
  • sepal width in cm
  • petal length in cm
  • petal width in cm

Numerical features such as these are pretty straightforward: each sample contains a list of floating-point numbers corresponding to the features

Categorical Features

What if you have categorical features? For example, imagine there is data on the color of each iris:

color in [red, blue, purple]

You might be tempted to assign numbers to these features, i.e. red=1, blue=2, purple=3 but in general this is a bad idea. Estimators tend to operate under the assumption that numerical features lie on some continuous scale, so, for example, 1 and 2 are more alike than 1 and 3, and this is often not the case for categorical features.

A better strategy is to give each category its own dimension.
The enriched iris feature set would hence be in this case:

  • sepal length in cm
  • sepal width in cm
  • petal length in cm
  • petal width in cm
  • color=purple (1.0 or 0.0)
  • color=blue (1.0 or 0.0)
  • color=red (1.0 or 0.0)

Note that using many of these categorical features may result in data which is better represented as a sparse matrix, as we'll see with the text classification example below.

Using the DictVectorizer to encode categorical features

When the source data is encoded has a list of dicts where the values are either strings names for categories or numerical values, you can use the DictVectorizer class to compute the boolean expansion of the categorical features while leaving the numerical features unimpacted:


In [ ]:
measurements = [
    {'city': 'Dubai', 'temperature': 33.},
    {'city': 'London', 'temperature': 12.},
    {'city': 'San Fransisco', 'temperature': 18.},
]

In [ ]:
from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer()
vec

In [ ]:
vec.fit_transform(measurements).toarray()

In [ ]:
vec.get_feature_names()

Derived Features

Another common feature type are derived features, where some pre-processing step is applied to the data to generate features that are somehow more informative. Derived features may be based in dimensionality reduction (such as PCA or manifold learning), may be linear or nonlinear combinations of features (such as in Polynomial regression), or may be some more sophisticated transform of the features. The latter is often used in image processing.

For example, scikit-image provides a variety of feature extractors designed for image data: see the skimage.feature submodule. We will see some dimensionality-based feature extraction routines later in the tutorial.

Text Feature Extraction

Unstructed content such as text documents require there own feature extraction step. In general we treat words in text documents as individual categorical features. An example on text mining will be introduced later, at the end of this session.