Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!
Similar to categorical features, scikit-learn offers an easy way to encode another common feature type, text features. When working with text features, it is often convenient to encode individual words or phrases as numerical values.
Let's consider a dataset that contains a small corpus of text phrases:
In [1]:
sample = [
'feature engineering',
'feature selection',
'feature extraction'
]
One of the simplest methods of encoding such data is by word count; for each phrase, we count the occurrences of each word within it. In scikit-learn, this is easily done using
CountVectorizer
, which functions akin to DictVectorizer
:
In [2]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X = vec.fit_transform(sample)
X
Out[2]:
By default, this will store our feature matrix X
as a sparse matrix. If we want to manually
inspect it, we need to convert it to a regular array:
In [3]:
X.toarray()
Out[3]:
...with corresponding feature names:
In [4]:
vec.get_feature_names()
Out[4]:
One possible shortcoming of this approach is that we might put too much weight on words that appear very frequently. One approach to fix this is known as term frequency-inverse document frequency (TF-IDF). What TF-IDF does might be easier to understand than its name, which is basically to weigh the word counts by a measure of how often they appear in the entire dataset.
The syntax for TF-IDF is pretty much similar to the previous command:
In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(sample)
X.toarray()
Out[5]:
We note that the numbers are now smaller than before, with the third column taking the
biggest hit. This makes sense, as the third column corresponds to the most frequent word
across all three phrases, 'feature'
:
In [6]:
vec.get_feature_names()
Out[6]:
Representing text features will become important in Chapter 7, Implementing a Spam Filter with Bayesian Learning.