In [1]:
from sklearn.feature_extraction.text import CountVectorizer

Basic vectorization

Vectorizing text is a fundamental concept in applying both supervised and unsupervised learning to documents. Basically, you can think of it as turning the words in a given text document into features.

Rather than explicitly defining our features, as we did for the donor classification problem, we can instead take advantage of tools, called vectorizers, that turn each word into a feature best described as "The number of times Word X appears in this document".

Here's an example with one bill title:

In [14]:
bill_titles = ['An act to amend Section 44277 of the Education Code, relating to teachers.']

In [16]:
vectorizer = CountVectorizer()
features = vectorizer.fit_transform(bill_titles).toarray()

[[1 1 1 1 1 1 1 1 1 1 1 2]]

In [17]:
print features
print vectorizer.get_feature_names()

[[1 1 1 1 1 1 1 1 1 1 1 2]]
[u'44277', u'act', u'amend', u'an', u'code', u'education', u'of', u'relating', u'section', u'teachers', u'the', u'to']

Think of this vector as a matrix with one row and 12 columns. The row corresponds to our document above. The columns each correspond to a word contained in that document (the first is "44277", the second is "act", etc.) The numbers correspond to the number of times each word appears in that document. You'll see that all words appear once, except the last one, "to", which appears twice.

Now what happens if we add another bill and run it again?

In [19]:
bill_titles = ['An act to amend Section 44277 of the Education Code, relating to teachers.',
'An act relative to health care coverage']
features = vectorizer.fit_transform(bill_titles).toarray()

print features
print vectorizer.get_feature_names()

[[1 1 1 1 0 1 0 1 0 1 1 0 1 1 1 2]
[0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 1]]
[u'44277', u'act', u'amend', u'an', u'care', u'code', u'coverage', u'education', u'health', u'of', u'relating', u'relative', u'section', u'teachers', u'the', u'to']

Now we've got two rows, each corresponding to a document. The columns correspond to all words contained in BOTH documents, with counts. For example, the first entry from the first column, "44277', appears once in the first document but zero times in the second. This, basically, is the concept of vectorization.

Cleaning up our vectors

As you might imagine, a document set with a relatively large vocabulary can result in vectors that are thousands and thousands of dimensions wide. This isn't necessarily bad, but in the interest of keeping our feature space as low-dimensional as possible, there are a few things we can do to clean them up.

First is removing so-called "stop words" -- words like "and", "or", "the', etc. that appear in almost every document and therefore aren't especially useful. Scikit-learn's vectorizer objects make this easy:

In [21]:
new_vectorizer = CountVectorizer(stop_words='english')
features = new_vectorizer.fit_transform(bill_titles).toarray()

print features
print new_vectorizer.get_feature_names()

[[1 1 1 0 1 0 1 0 1 0 1 1]
[0 1 0 1 0 1 0 1 0 1 0 0]]
[u'44277', u'act', u'amend', u'care', u'code', u'coverage', u'education', u'health', u'relating', u'relative', u'section', u'teachers']

Notice that our feature space is now a little smaller. We can use a similar trick to eliminate words that only appear a small number of times, which becomes useful when document sets get very large.

In [24]:
new_vectorizer = CountVectorizer(stop_words='english', min_df=2)
features = new_vectorizer.fit_transform(bill_titles).toarray()

print features
print new_vectorizer.get_feature_names()

[[1]
[1]]
[u'act']

This is a bad example for this document set, but it will help later -- I promise. Finally, we can also create features that comprise more than one word. These are known as N-grams, with the N being the number of words contained in the feature. Here is how you could create a feature vector of all 1-grams and 2-grams:

In [ ]:
new_vectorizer = CountVectorizer(stop_words='english', ngram_range=(1,2))
features = new_vectorizer.fit_transform(bill_titles).toarray()

print features
print new_vectorizer.get_feature_names()