This notebook contains an excerpt from the book Machine Learning for OpenCV by Michael Beyeler. The code is released under the MIT license, and is available on GitHub.

Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!

Representing Categorical Variables

One of the most common data types we might encounter while building a machine learning system are categorical features (also known as discrete features), such as the color of a fruit or the name of a company.

The challenge with categorical features is that they don't change in a continuous way, which makes it hard to represent them with numbers. For example, a banana is either green or yellow, but not both. A product belongs either in the clothing department or in the books department, but rarely in both, and so on.

How would you go about representing such features?

Consider the following data containing a list of some of the forefathers of machine learning and artificial intelligence:


In [1]:
data = [
    {'name': 'Alan Turing', 'born': 1912, 'died': 1954},
    {'name': 'Herbert A. Simon', 'born': 1916, 'died': 2001},
    {'name': 'Jacek Karpinski', 'born': 1927, 'died': 2010},
    {'name': 'J.C.R. Licklider', 'born': 1915, 'died': 1990},
    {'name': 'Marvin Minsky', 'born': 1927, 'died': 2016},
]

While the features 'born' and 'died' are already in numeric format, the 'name' feature is a bit trickier to encode. We might be intrigued to encode them in the following way:


In [2]:
{'Alan Turing': 1,
 'Herbert A. Simon': 2,
 'Jacek Karpinsky': 3,
 'J.C.R. Licklider': 4,
 'Marvin Minsky': 5};

Although this seems like a good idea, it does not make much sense from a machine learning perspective. Why not?

Refer to the book for the answer (p. 97).

A better way is to use a DictVectorizer, also known as a one-hot encoding. The way it works is by feeding a dictionary containing the data to the fit_transform function, and the function automatically determines which features to encode:


In [3]:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int)
vec.fit_transform(data)


Out[3]:
array([[1912, 1954,    1,    0,    0,    0,    0],
       [1916, 2001,    0,    1,    0,    0,    0],
       [1927, 2010,    0,    0,    0,    1,    0],
       [1915, 1990,    0,    0,    1,    0,    0],
       [1927, 2016,    0,    0,    0,    0,    1]], dtype=int64)

What happened here? The two year entries are still intact, but the rest of the rows have been replaced by ones and zeros. We can call get_feature_names to find out the listed order of the features:


In [4]:
vec.get_feature_names()


Out[4]:
['born',
 'died',
 'name=Alan Turing',
 'name=Herbert A. Simon',
 'name=J.C.R. Licklider',
 'name=Jacek Karpinski',
 'name=Marvin Minsky']

The first row of our data matrix, which stands for Alan Turing, is now encoded as 'born'=1912, 'died'=1954, 'Alan Turing'=1, 'Herbert A. Simon'=0, 'J.C.R Licklider'=0, 'Jacek Karpinsik'=0, and 'Marvin Minsky'=0.

If your category has many possible values, it is better to use a sparse matrix:


In [5]:
vec = DictVectorizer(sparse=True, dtype=int)
vec.fit_transform(data)


Out[5]:
<5x7 sparse matrix of type '<class 'numpy.int64'>'
	with 15 stored elements in Compressed Sparse Row format>

We will come back to this technique when we talk about neural networks in Chapter 9, Using Deep Learning to Classify Handwritten Digits.