In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
# Some nice default configuration for plots
plt.rcParams['figure.figsize'] = 10, 7.5
plt.rcParams['axes.grid'] = True
plt.gray()
Outline of this section:
Let's start by implementing a canonical text classification example:
In [2]:
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
# Load the text data
categories = [
'alt.atheism',
'talk.religion.misc',
'comp.graphics',
'sci.space',
]
twenty_train_small = load_files('../datasets/20news-bydate-train/',
categories=categories, encoding='latin-1')
twenty_test_small = load_files('../datasets/20news-bydate-test/',
categories=categories, encoding='latin-1')
# Turn the text documents into vectors of word frequencies
vectorizer = TfidfVectorizer(min_df=2)
X_train = vectorizer.fit_transform(twenty_train_small.data)
y_train = twenty_train_small.target
# Fit a classifier on the training set
classifier = MultinomialNB().fit(X_train, y_train)
print("Training score: {0:.1f}%".format(
classifier.score(X_train, y_train) * 100))
# Evaluate the classifier on the testing set
X_test = vectorizer.transform(twenty_test_small.data)
y_test = twenty_test_small.target
print("Testing score: {0:.1f}%".format(
classifier.score(X_test, y_test) * 100))
Here is a workflow diagram summary of what happened previously:
Let's now decompose what we just did to understand and customize each step.
Let's explore the dataset loading utility without passing a list of categories: in this case we load the full 20 newsgroups dataset in memory. The source website for the 20 newsgroups already provides a date-based train / test split that is made available using the subset keyword argument:
In [3]:
ls -l ../datasets/
In [4]:
ls -lh ../datasets/20news-bydate-train
In [5]:
ls -lh ../datasets/20news-bydate-train/alt.atheism/
The load_files function can load text files from a 2 levels folder structure assuming folder names represent categories:
In [ ]:
#print(load_files.__doc__)
In [6]:
all_twenty_train = load_files('../datasets/20news-bydate-train/',
encoding='latin-1', random_state=42)
all_twenty_test = load_files('../datasets/20news-bydate-test/',
encoding='latin-1', random_state=42)
In [7]:
all_target_names = all_twenty_train.target_names
all_target_names
Out[7]:
In [8]:
all_twenty_train.target
Out[8]:
In [9]:
all_twenty_train.target.shape
Out[9]:
In [10]:
all_twenty_test.target.shape
Out[10]:
In [11]:
len(all_twenty_train.data)
Out[11]:
In [12]:
type(all_twenty_train.data[0])
Out[12]:
In [13]:
def display_sample(i, dataset):
print("Class name: " + dataset.target_names[dataset.target[i]])
print("Text content:\n")
print(dataset.data[i])
In [14]:
display_sample(0, all_twenty_train)
In [15]:
display_sample(1, all_twenty_train)
Let's compute the (uncompressed, in-memory) size of the training and test sets in MB assuming an 8-bit encoding (in this case, all chars can be encoded using the latin-1 charset).
In [16]:
def text_size(text, charset='iso-8859-1'):
return len(text.encode(charset)) * 8 * 1e-6
train_size_mb = sum(text_size(text) for text in all_twenty_train.data)
test_size_mb = sum(text_size(text) for text in all_twenty_test.data)
print("Training set size: {0} MB".format(int(train_size_mb)))
print("Testing set size: {0} MB".format(int(test_size_mb)))
If we only consider a small subset of the 4 categories selected from the initial example:
In [17]:
train_small_size_mb = sum(text_size(text) for text in twenty_train_small.data)
test_small_size_mb = sum(text_size(text) for text in twenty_test_small.data)
print("Training set size: {0} MB".format(int(train_small_size_mb)))
print("Testing set size: {0} MB".format(int(test_small_size_mb)))
In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
TfidfVectorizer()
Out[18]:
In [19]:
vectorizer = TfidfVectorizer(min_df=1)
%time X_train_small = vectorizer.fit_transform(twenty_train_small.data)
The results is not a numpy.array but instead a scipy.sparse matrix. This datastructure is quite similar to a 2D numpy array but it does not store the zeros.
In [20]:
X_train_small
Out[20]:
scipy.sparse matrices also have a shape attribute to access the dimensions:
In [21]:
n_samples, n_features = X_train_small.shape
This dataset has around 2000 samples (the rows of the data matrix):
In [22]:
n_samples
Out[22]:
This is the same value as the number of strings in the original list of text documents:
In [23]:
len(twenty_train_small.data)
Out[23]:
The columns represent the individual token occurrences:
In [24]:
n_features
Out[24]:
This number is the size of the vocabulary of the model extracted during fit in a Python dictionary:
In [25]:
type(vectorizer.vocabulary_)
Out[25]:
In [26]:
len(vectorizer.vocabulary_)
Out[26]:
The keys of the vocabulary_ attribute are also called feature names and can be accessed as a list of strings.
In [27]:
len(vectorizer.get_feature_names())
Out[27]:
Here are the first 10 elements (sorted in lexicographical order):
In [28]:
vectorizer.get_feature_names()[:10]
Out[28]:
Let's have a look at the features from the middle:
In [29]:
vectorizer.get_feature_names()[n_features / 2:n_features / 2 + 10]
Out[29]:
Now that we have extracted a vector representation of the data, it's a good idea to project the data on the first 2D of a Principal Component Analysis to get a feel of the data. Note that the TruncatedSVD class can accept scipy.sparse matrices as input (as an alternative to numpy arrays):
In [30]:
from sklearn.decomposition import TruncatedSVD
%time X_train_small_pca = TruncatedSVD(n_components=2).fit_transform(X_train_small)
In [31]:
from itertools import cycle
colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k']
for i, c in zip(np.unique(y_train), cycle(colors)):
plt.scatter(X_train_small_pca[y_train == i, 0],
X_train_small_pca[y_train == i, 1],
c=c, label=twenty_train_small.target_names[i], alpha=0.5)
_ = plt.legend(loc='best')
We can observe that there is a large overlap of the samples from different categories. This is to be expected as the PCA linear projection projects data from a 34118 dimensional space down to 2 dimensions: data that is linearly separable in 34118D is often no longer linearly separable in 2D.
Still we can notice an interesting pattern: the newsgroups on religion and atheism occupy the much the same region and computer graphics and space science / space overlap more together than they do with the religion or atheism newsgroups.
We have previously extracted a vector representation of the training corpus and put it into a variable name X_train_small. To train a supervised model, in this case a classifier, we also need
In [32]:
y_train_small = twenty_train_small.target
In [33]:
y_train_small.shape
Out[33]:
In [34]:
y_train_small
Out[34]:
We can shape that we have the same number of samples for the input data and the labels:
In [35]:
X_train_small.shape[0] == y_train_small.shape[0]
Out[35]:
We can now train a classifier, for instance a Multinomial Naive Bayesian classifier:
In [36]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB(alpha=0.1)
clf
Out[36]:
In [37]:
clf.fit(X_train_small, y_train_small)
Out[37]:
We can now evaluate the classifier on the testing set. Let's first use the builtin score function, which is the rate of correct classification in the test set:
In [38]:
X_test_small = vectorizer.transform(twenty_test_small.data)
y_test_small = twenty_test_small.target
In [39]:
X_test_small.shape
Out[39]:
In [40]:
y_test_small.shape
Out[40]:
In [41]:
clf.score(X_test_small, y_test_small)
Out[41]:
We can also compute the score on the test set and observe that the model is both overfitting and underfitting a bit at the same time:
In [42]:
clf.score(X_train_small, y_train_small)
Out[42]:
The text vectorizer has many parameters to customize it's behavior, in particular how it extracts tokens:
In [43]:
TfidfVectorizer()
Out[43]:
In [44]:
print(TfidfVectorizer.__doc__)
The easiest way to introspect what the vectorizer is actually doing for a given test of parameters is call the vectorizer.build_analyzer() to get an instance of the text analyzer it uses to process the text:
In [45]:
analyzer = TfidfVectorizer().build_analyzer()
analyzer("I love scikit-learn: this is a cool Python lib!")
Out[45]:
You can notice that all the tokens are lowercase, that the single letter word "I" was dropped, and that hyphenation is used. Let's change some of that default behavior:
In [46]:
analyzer = TfidfVectorizer(
preprocessor=lambda text: text, # disable lowercasing
token_pattern=ur'(?u)\b[\w-]+\b', # treat hyphen as a letter
# do not exclude single letter tokens
).build_analyzer()
analyzer("I love scikit-learn: this is a cool Python lib!")
Out[46]:
The analyzer name comes from the Lucene parlance: it wraps the sequential application of:
The analyzer system of scikit-learn is much more basic than lucene's though.
Exercise:
Hint: the TfidfVectorizer class can accept python functions to customize the preprocessor, tokenizer or analyzer stages of the vectorizer.
type TfidfVectorizer() alone in a cell to see the default value of the parameters
type TfidfVectorizer.__doc__ to print the constructor parameters doc or ? suffix operator on a any Python class or method to read the docstring or even the ?? operator to read the source code.
In [ ]:
The MultinomialNB class is a good baseline classifier for text as it's fast and has few parameters to tweak:
In [ ]:
MultinomialNB()
In [ ]:
print(MultinomialNB.__doc__)
By reading the doc we can see that the alpha parameter is a good candidate to adjust the model for the bias (underfitting) vs variance (overfitting) trade-off.
Exercise:
sklearn.grid_search.GridSearchCV or the model_selection.RandomizedGridSeach utility function from the previous chapters to find a good value for the parameter alphaHints:
RandomizedGridSearch also has a launch_for_arrays method as an alternative to launch_for_splits in case the CV splits have not been precomputed in advance.
1
In [ ]:
The feature extraction class has many options to customize its behavior:
In [ ]:
print(TfidfVectorizer.__doc__)
In order to evaluate the impact of the parameters of the feature extraction one can chain a configured feature extraction and linear classifier (as an alternative to the naive Bayes model):
In [ ]:
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.pipeline import Pipeline
pipeline = Pipeline((
('vec', TfidfVectorizer(min_df=1, max_df=0.8, use_idf=True)),
('clf', PassiveAggressiveClassifier(C=1)),
))
Such a pipeline can then be cross validated or even grid searched:
In [ ]:
from sklearn.cross_validation import cross_val_score
from scipy.stats import sem
scores = cross_val_score(pipeline, twenty_train_small.data,
twenty_train_small.target, cv=3, n_jobs=-1)
scores.mean(), sem(scores)
For the grid search, the parameters names are prefixed with the name of the pipeline step using "__" as a separator:
In [ ]:
from sklearn.grid_search import GridSearchCV
parameters = {
#'vec__min_df': [1, 2],
'vec__max_df': [0.8, 1.0],
'vec__ngram_range': [(1, 1), (1, 2)],
'vec__use_idf': [True, False],
}
gs = GridSearchCV(pipeline, parameters, verbose=2, refit=False)
_ = gs.fit(twenty_train_small.data, twenty_train_small.target)
In [ ]:
gs.best_score_
In [ ]:
gs.best_params_
Let's fit a model on the small dataset and collect info on the fitted components:
In [ ]:
_ = pipeline.fit(twenty_train_small.data, twenty_train_small.target)
In [ ]:
vec_name, vec = pipeline.steps[0]
clf_name, clf = pipeline.steps[1]
feature_names = vec.get_feature_names()
target_names = twenty_train_small.target_names
feature_weights = clf.coef_
feature_weights.shape
By sorting the feature weights on the linear model and asking the vectorizer what their names is, one can get a clue on what the model did actually learn on the data:
In [ ]:
def display_important_features(feature_names, target_names, weights, n_top=30):
for i, target_name in enumerate(target_names):
print("Class: " + target_name)
print("")
sorted_features_indices = weights[i].argsort()[::-1]
most_important = sorted_features_indices[:n_top]
print(", ".join("{0}: {1:.4f}".format(feature_names[j], weights[i, j])
for j in most_important))
print("...")
least_important = sorted_features_indices[-n_top:]
print(", ".join("{0}: {1:.4f}".format(feature_names[j], weights[i, j])
for j in least_important))
print("")
display_important_features(feature_names, target_names, feature_weights)
In [ ]:
from sklearn.metrics import classification_report
predicted = pipeline.predict(twenty_test_small.data)
In [ ]:
print(classification_report(twenty_test_small.target, predicted,
target_names=twenty_test_small.target_names))
The confusion matrix summarize which class where by having a look at off-diagonal entries: here we can see that articles about atheism have been wrongly classified as being about religion 57 times for instance:
In [ ]:
from sklearn.metrics import confusion_matrix
confusion_matrix(twenty_test_small.target, predicted)
In [ ]:
twenty_test_small.target_names
In [ ]: