This notebook contains an excerpt from the book Machine Learning for OpenCV by Michael Beyeler. The code is released under the MIT license, and is available on GitHub.

Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!

Representing Text Features

Similar to categorical features, scikit-learn offers an easy way to encode another common feature type, text features. When working with text features, it is often convenient to encode individual words or phrases as numerical values.

Let's consider a dataset that contains a small corpus of text phrases:


In [1]:
sample = [
    'feature engineering',
    'feature selection',
    'feature extraction'
]

One of the simplest methods of encoding such data is by word count; for each phrase, we count the occurrences of each word within it. In scikit-learn, this is easily done using CountVectorizer, which functions akin to DictVectorizer:


In [2]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X = vec.fit_transform(sample)
X


Out[2]:
<3x4 sparse matrix of type '<class 'numpy.int64'>'
	with 6 stored elements in Compressed Sparse Row format>

By default, this will store our feature matrix X as a sparse matrix. If we want to manually inspect it, we need to convert it to a regular array:


In [3]:
X.toarray()


Out[3]:
array([[1, 0, 1, 0],
       [0, 0, 1, 1],
       [0, 1, 1, 0]], dtype=int64)

...with corresponding feature names:


In [4]:
vec.get_feature_names()


Out[4]:
['engineering', 'extraction', 'feature', 'selection']

One possible shortcoming of this approach is that we might put too much weight on words that appear very frequently. One approach to fix this is known as term frequency-inverse document frequency (TF-IDF). What TF-IDF does might be easier to understand than its name, which is basically to weigh the word counts by a measure of how often they appear in the entire dataset.

The syntax for TF-IDF is pretty much similar to the previous command:


In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(sample)
X.toarray()


Out[5]:
array([[ 0.861037  ,  0.        ,  0.50854232,  0.        ],
       [ 0.        ,  0.        ,  0.50854232,  0.861037  ],
       [ 0.        ,  0.861037  ,  0.50854232,  0.        ]])

We note that the numbers are now smaller than before, with the third column taking the biggest hit. This makes sense, as the third column corresponds to the most frequent word across all three phrases, 'feature':


In [6]:
vec.get_feature_names()


Out[6]:
['engineering', 'extraction', 'feature', 'selection']

Representing text features will become important in Chapter 7, Implementing a Spam Filter with Bayesian Learning.