2) Model Selection and Assessment (Wednesday)
3) Distributed Model Selection and Assessment (Weekend)
4) Text Feature Extraction for Classification and Clustering
5) Large Scale Text Classification for Sentiment Analysis
But first, recall Gheorghe's lecture on Search:
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
# Some nice default configuration for plots
plt.rcParams['figure.figsize'] = 10, 7.5
plt.rcParams['axes.grid'] = True
plt.gray()
Outline of this section:
Let's start by implementing a canonical text classification example:
In [2]:
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
# Load the text data
categories = [
'alt.atheism',
'talk.religion.misc',
'comp.graphics',
'sci.space',
]
twenty_train_small = load_files('../datasets/20news-bydate-train/',
categories=categories, encoding='latin-1')
twenty_test_small = load_files('../datasets/20news-bydate-test/',
categories=categories, encoding='latin-1')
# Turn the text documents into vectors of word frequencies
vectorizer = TfidfVectorizer(min_df=2)
X_train = vectorizer.fit_transform(twenty_train_small.data)
y_train = twenty_train_small.target
# Fit a classifier on the training set
classifier = MultinomialNB().fit(X_train, y_train)
print("Training score: {0:.1f}%".format(
classifier.score(X_train, y_train) * 100))
# Evaluate the classifier on the testing set
X_test = vectorizer.transform(twenty_test_small.data)
y_test = twenty_test_small.target
print("Testing score: {0:.1f}%".format(
classifier.score(X_test, y_test) * 100))
MultinomialNB
implements the naive Bayes algorithm for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification (where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice). The distribution is parametrized by vectors $\theta_y = (\theta_{y1},\ldots,\theta_{yn})$ for each class $y$, where $n$ is the number of features (in text classification, the size of the vocabulary) and $\theta_{yi}$ is the probability $P(x_i \mid y)$ of feature $i$ appearing in a sample belonging to class $y$.
The parameters $\theta_y$ is estimated by a smoothed version of maximum likelihood, i.e. relative frequency counting:
$$ \hat{\theta}_{yi} = \frac{ N_{yi} + \alpha}{N_y + \alpha n} $$where $N_{yi} = \sum_{x \in T} x_i$ is the number of times feature $i$ appears in a sample of class $y$ in the training set $T$, and $N_{y} = \sum_{i=1}^{|T|} N_{yi}$ is the total count of all features for class $y$.
The smoothing priors $\alpha \ge 0$ accounts for features not present in the learning samples and prevents zero probabilities in further computations. Setting $\alpha = 1$ is called Laplace smoothing, while $\alpha < 1$ is called Lidstone smoothing.
Here is a workflow diagram summary of what happened previously:
Let's now decompose what we just did to understand and customize each step.
Let's explore the dataset loading utility without passing a list of categories: in this case we load the full 20 newsgroups dataset in memory. The source website for the 20 newsgroups already provides a date-based train / test split that is made available using the subset
keyword argument:
In [3]:
ls ../datasets/
In [4]:
ls -lh ../datasets/20news-bydate-train
In [5]:
ls -lh ../datasets/20news-bydate-train/alt.atheism/ | head -n27
The load_files
function can load text files from a 2 levels folder structure assuming folder names represent categories:
In [6]:
print(load_files.__doc__)
In [7]:
all_twenty_train = load_files('../datasets/20news-bydate-train/',
encoding='latin-1', random_state=42)
all_twenty_test = load_files('../datasets/20news-bydate-test/',
encoding='latin-1', random_state=42)
In [8]:
all_target_names = all_twenty_train.target_names
all_target_names
Out[8]:
In [9]:
all_twenty_train.target
Out[9]:
In [10]:
all_twenty_train.target.shape
Out[10]:
In [11]:
all_twenty_test.target.shape
Out[11]:
In [12]:
len(all_twenty_train.data)
Out[12]:
In [13]:
type(all_twenty_train.data[0])
Out[13]:
In [14]:
def display_sample(i, dataset):
print("Class name: " + dataset.target_names[dataset.target[i]])
print("Text content:\n")
print(dataset.data[i])
In [15]:
display_sample(0, all_twenty_train)
In [16]:
display_sample(1, all_twenty_train)
Let's compute the (uncompressed, in-memory) size of the training and test sets in MB assuming an 8-bit encoding (in this case, all chars can be encoded using the latin-1 charset).
In [17]:
def text_size(text, charset='iso-8859-1'):
return len(text.encode(charset)) * 8 * 1e-6
train_size_mb = sum(text_size(text) for text in all_twenty_train.data)
test_size_mb = sum(text_size(text) for text in all_twenty_test.data)
print("Training set size: {0} MB".format(int(train_size_mb)))
print("Testing set size: {0} MB".format(int(test_size_mb)))
If we only consider a small subset of the 4 categories selected from the initial example:
In [18]:
train_small_size_mb = sum(text_size(text) for text in twenty_train_small.data)
test_small_size_mb = sum(text_size(text) for text in twenty_test_small.data)
print("Training set size: {0} MB".format(int(train_small_size_mb)))
print("Testing set size: {0} MB".format(int(test_small_size_mb)))
In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
TfidfVectorizer()
Out[19]:
In [20]:
vectorizer = TfidfVectorizer(min_df=1)
%time X_train_small = vectorizer.fit_transform(twenty_train_small.data)
The results is not a numpy.array
but instead a scipy.sparse
matrix. (Similar to the DocumentTermMatrix in R's tm
library.) This datastructure is quite similar to a 2D numpy array but it does not store the zeros.
In [21]:
X_train_small
Out[21]:
scipy.sparse matrices also have a shape attribute to access the dimensions:
In [22]:
n_samples, n_features = X_train_small.shape
This dataset has around 2000 samples (the rows of the data matrix):
In [23]:
n_samples
Out[23]:
This is the same value as the number of strings in the original list of text documents:
In [24]:
len(twenty_train_small.data)
Out[24]:
The columns represent the individual token occurrences:
In [25]:
n_features
Out[25]:
This number is the size of the vocabulary of the model extracted during fit in a Python dictionary:
In [26]:
type(vectorizer.vocabulary_)
Out[26]:
In [27]:
len(vectorizer.vocabulary_)
Out[27]:
The keys of the vocabulary_
attribute are also called feature names and can be accessed as a list of strings.
In [28]:
len(vectorizer.get_feature_names())
Out[28]:
Here are the first 10 elements (sorted in lexicographical order):
In [29]:
vectorizer.get_feature_names()[:10]
Out[29]:
Let's have a look at the features from the middle:
In [30]:
vectorizer.get_feature_names()[n_features / 2:n_features / 2 + 10]
Out[30]:
Now that we have extracted a vector representation of the data, it's a good idea to project the data on the first 2D of a Singular Value Decomposition (i.e.. Principal Component Analysis) to get a feel of the data. Note that the TruncatedSVD
class can accept scipy.sparse
matrices as input (as an alternative to numpy arrays):
In [31]:
from sklearn.decomposition import TruncatedSVD
%time X_train_small_pca = TruncatedSVD(n_components=2).fit_transform(X_train_small)
In [32]:
from itertools import cycle
colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k']
for i, c in zip(np.unique(y_train), cycle(colors)):
plt.scatter(X_train_small_pca[y_train == i, 0],
X_train_small_pca[y_train == i, 1],
c=c, label=twenty_train_small.target_names[i], alpha=0.5)
_ = plt.legend(loc='best')
We can observe that there is a large overlap of the samples from different categories. This is to be expected as the PCA linear projection projects data from a 34118 dimensional space down to 2 dimensions: data that is linearly separable in 34118D is often no longer linearly separable in 2D.
Still we can notice an interesting pattern: the newsgroups on religion and atheism occupy the much the same region and computer graphics and space science / space overlap more together than they do with the religion or atheism newsgroups.
We have previously extracted a vector representation of the training corpus and put it into a variable name X_train_small
. To train a supervised model, in this case a classifier, we also need
In [33]:
y_train_small = twenty_train_small.target
In [34]:
y_train_small.shape
Out[34]:
In [35]:
y_train_small
Out[35]:
We can shape that we have the same number of samples for the input data and the labels:
In [36]:
X_train_small.shape[0] == y_train_small.shape[0]
Out[36]:
We can now train a classifier, for instance a Multinomial Naive Bayesian classifier:
In [37]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB(alpha=0.1)
clf
Out[37]:
In [38]:
clf.fit(X_train_small, y_train_small)
Out[38]:
We can now evaluate the classifier on the testing set. Let's first use the builtin score function, which is the rate of correct classification in the test set:
In [39]:
X_test_small = vectorizer.transform(twenty_test_small.data)
y_test_small = twenty_test_small.target
In [40]:
X_test_small.shape
Out[40]:
In [41]:
y_test_small.shape
Out[41]:
In [42]:
clf.score(X_test_small, y_test_small)
Out[42]:
We can also compute the score on the train set and observe that the model is both overfitting and underfitting a bit at the same time:
In [43]:
clf.score(X_train_small, y_train_small)
Out[43]:
The text vectorizer has many parameters to customize it's behavior, in particular how it extracts tokens:
In [44]:
TfidfVectorizer()
Out[44]:
In [45]:
print(TfidfVectorizer.__doc__)
The easiest way to introspect what the vectorizer is actually doing for a given test of parameters is call the vectorizer.build_analyzer()
to get an instance of the text analyzer it uses to process the text:
In [46]:
analyzer = TfidfVectorizer().build_analyzer()
analyzer("I love scikit-learn: this is a cool Python lib!")
Out[46]:
You can notice that all the tokens are lowercase, that the single letter word "I" was dropped, and that hyphenation is used. Let's change some of that default behavior:
In [47]:
analyzer = TfidfVectorizer(
preprocessor=lambda text: text, # disable lowercasing
token_pattern=ur'(?u)\b[\w-]+\b', # treat hyphen as a letter
# do not exclude single letter tokens
).build_analyzer()
analyzer("I love scikit-learn: this is a cool Python lib!")
Out[47]:
The analyzer name comes from the Lucene parlance: it wraps the sequential application of:
The analyzer system of scikit-learn is much more basic than lucene's though.
Exercise:
Hint 1: As when we used Naïve Bayes in R, these messages have headers that are separated from the message by a blank line. In other words, find the first blank line ('\n\n'
) and take everything after that. (The find
or index
function may be of help here.)
Hint 2: the TfidfVectorizer
class can accept python functions to customize the preprocessor
, tokenizer
or analyzer
stages of the vectorizer.
type TfidfVectorizer()
alone in a cell to see the default value of the parameters
type TfidfVectorizer.__doc__
to print the constructor parameters doc or ?
suffix operator on a any Python class or method to read the docstring or even the ??
operator to read the source code.
In [47]:
The MultinomialNB
class is a good baseline classifier for text as it's fast and has few parameters to tweak:
In [48]:
MultinomialNB()
Out[48]:
In [49]:
print(MultinomialNB.__doc__)
By reading the doc we can see that the alpha
parameter is a good candidate to adjust the model for the bias (underfitting) vs variance (overfitting) trade-off.
Exercise:
sklearn.grid_search.GridSearchCV
or the model_selection.RandomizedGridSeach
utility function from the previous chapters to find a good value for the parameter alpha
Hints:
RandomizedGridSearch
also has a launch_for_arrays
method as an alternative to launch_for_splits
in case the CV splits have not been precomputed in advance.
1
In [49]:
The feature extraction class has many options to customize its behavior:
In [50]:
print(TfidfVectorizer.__doc__)
In order to evaluate the impact of the parameters of the feature extraction one can chain a configured feature extraction and linear classifier (as an alternative to the naive Bayes model):
In [51]:
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.pipeline import Pipeline
pipeline = Pipeline((
('vec', TfidfVectorizer(min_df=1, max_df=0.8, use_idf=True)),
('clf', PassiveAggressiveClassifier(C=1)),
))
Such a pipeline can then be cross validated or even grid searched:
In [52]:
from sklearn.cross_validation import cross_val_score
from scipy.stats import sem
scores = cross_val_score(pipeline, twenty_train_small.data,
twenty_train_small.target, cv=3, n_jobs=-1)
scores.mean(), sem(scores)
Out[52]:
For the grid search, the parameters names are prefixed with the name of the pipeline step using "__" as a separator:
In [53]:
from sklearn.grid_search import GridSearchCV
parameters = {
#'vec__min_df': [1, 2],
'vec__max_df': [0.8, 1.0],
'vec__ngram_range': [(1, 1), (1, 2)],
'vec__use_idf': [True, False],
}
gs = GridSearchCV(pipeline, parameters, verbose=2, refit=False)
_ = gs.fit(twenty_train_small.data, twenty_train_small.target)
In [54]:
gs.best_score_
Out[54]:
In [55]:
gs.best_params_
Out[55]:
Let's fit a model on the small dataset and collect info on the fitted components:
In [56]:
_ = pipeline.fit(twenty_train_small.data, twenty_train_small.target)
In [57]:
vec_name, vec = pipeline.steps[0]
clf_name, clf = pipeline.steps[1]
feature_names = vec.get_feature_names()
target_names = twenty_train_small.target_names
feature_weights = clf.coef_
feature_weights.shape
Out[57]:
By sorting the feature weights on the linear model and asking the vectorizer what their names is, one can get a clue on what the model did actually learn on the data:
In [58]:
def display_important_features(feature_names, target_names, weights, n_top=30):
for i, target_name in enumerate(target_names):
print("Class: " + target_name)
print("")
sorted_features_indices = weights[i].argsort()[::-1]
most_important = sorted_features_indices[:n_top]
print(", ".join("{0}: {1:.4f}".format(feature_names[j], weights[i, j])
for j in most_important))
print("...")
least_important = sorted_features_indices[-n_top:]
print(", ".join("{0}: {1:.4f}".format(feature_names[j], weights[i, j])
for j in least_important))
print("")
display_important_features(feature_names, target_names, feature_weights)
In [59]:
from sklearn.metrics import classification_report
predicted = pipeline.predict(twenty_test_small.data)
In [60]:
print(classification_report(twenty_test_small.target, predicted,
target_names=twenty_test_small.target_names))
The confusion matrix summarize which class where by having a look at off-diagonal entries: here we can see that articles about atheism have been wrongly classified as being about religion 57 times for instance:
In [61]:
from sklearn.metrics import confusion_matrix
confusion_matrix(twenty_test_small.target, predicted)
Out[61]:
In [62]:
twenty_test_small.target_names
Out[62]:
The sklearn.feature_extraction.text.CountVectorizer
and sklearn.feature_extraction.text.TfidfVectorizer
classes suffer from a number of scalability issues that all stem from the internal usage of the vocabulary_
attribute (a Python dictionary) used to map the unicode string feature names to the integer feature indices.
The main scalability issues are:
vocabulary_
would be a shared state: complex synchronization and overheadvocabulary_
needs to be learned from the data: its size cannot be known before making one pass over the full datasetTo better understand the issue, let's have a look at how the vocabulary_
attribute works. At fit
time the tokens of the corpus are uniquely identified by a integer index and this mapping stored in the vocabulary:
In [63]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)
vectorizer.fit([
"The cat sat on the mat.",
])
vectorizer.vocabulary_
Out[63]:
The vocabulary is used at transform
time to build the occurence matrix:
In [64]:
X = vectorizer.transform([
"The cat sat on the mat.",
"This cat is a nice cat.",
]).toarray()
print(len(vectorizer.vocabulary_))
print(vectorizer.get_feature_names())
print(X)
Let's refit with a slightly larger corpus:
In [65]:
vectorizer = CountVectorizer(min_df=1)
vectorizer.fit([
"The cat sat on the mat.",
"The quick brown fox jumps over the lazy dog.",
])
vectorizer.vocabulary_
Out[65]:
The vocabulary_
is (logarithmically) growing with the size of the training corpus. Note that we could not have built the vocabularies in parallel on the 2 text documents as they share some words, hence would require some kind of shared datastructure or synchronization barrier which is complicated to setup, especially if we want to distribute the processing on a cluster.
With this new vocabulary, the dimensionality of the output space is now larger:
In [66]:
X = vectorizer.transform([
"The cat sat on the mat.",
"This cat is a nice cat.",
]).toarray()
print(len(vectorizer.vocabulary_))
print(vectorizer.get_feature_names())
print(X)
To illustrate the scalability issues of the vocabulary-based vectorizers, let's load a more realistic dataset for a classical text classification task: sentiment analysis on tweets. The goal is to tell apart negative from positive tweets on a variety of topics.
If ../fetch_data.py sentiment140
didn't work, go to https://docs.google.com/file/d/0B04GJPshIjmPRnZManQwWEdTZjg/edit and download trainingandtestdata.zip there. Unzip and copy the contents to ../datasets/sentiment140/
In [67]:
import os
sentiment140_folder = os.path.join('..', 'datasets', 'sentiment140')
training_csv_file = os.path.join(sentiment140_folder, 'training.1600000.processed.noemoticon.csv')
testing_csv_file = os.path.join(sentiment140_folder, 'testdata.manual.2009.06.14.csv')
Those files were downloaded from the research archive of the Sentiment140 project. The first file was gathered using the twitter streaming API by running stream queries for the positive ":)" and negative ":(" emoticons to collect tweets that are explicitly positive or negative. To make the classification problem non-trivial, the emoticons were stripped out of the text in the CSV files:
In [68]:
!ls -lh ../datasets/sentiment140/training.1600000.processed.noemoticon.csv
Let's parse the CSV files and load everything in memory. As loading everything can take up to 2GB, let's limit the collection to 100K tweets of each (positive and negative) out of the total of 1.6M tweets.
In [69]:
FIELDNAMES = ('polarity', 'id', 'date', 'query', 'author', 'text')
def read_csv(csv_file, fieldnames=FIELDNAMES, max_count=None,
n_partitions=1, partition_id=0):
import csv # put the import inside for use in IPython.parallel
texts = []
targets = []
with open(csv_file, 'rb') as f:
reader = csv.DictReader(f, fieldnames=fieldnames,
delimiter=',', quotechar='"')
pos_count, neg_count = 0, 0
for i, d in enumerate(reader):
if i % n_partitions != partition_id:
# Skip entry if not in the requested partition
continue
if d['polarity'] == '4':
if max_count and pos_count >= max_count / 2:
continue
pos_count += 1
texts.append(d['text'])
targets.append(1)
elif d['polarity'] == '0':
if max_count and neg_count >= max_count / 2:
continue
neg_count += 1
texts.append(d['text'])
targets.append(-1)
return texts, targets
In [70]:
%time text_train_all, target_train_all = read_csv(training_csv_file, max_count=200000)
In [71]:
len(text_train_all), len(target_train_all)
Out[71]:
Let's display the first samples:
In [72]:
for text in text_train_all[:3]:
print(text + "\n")
In [73]:
print(target_train_all[:3])
A polarity of "0" means negative while a polarity of "4" means positive. "0"s were convereted to "-1"s and "4"s to "1"s. All the positive tweets are at the end of the file:
In [74]:
for text in text_train_all[-3:]:
print(text + "\n")
In [75]:
print(target_train_all[-3:])
Let's split the training CSV file into a smaller training set and a validation set with 100k random tweets each:
In [76]:
from sklearn.cross_validation import train_test_split
text_train_small, text_validation, target_train_small, target_validation = train_test_split(
text_train_all, target_train_all, test_size=.5, random_state=42)
Let's open the manually annotated tweet files. The evaluation set also has neutral tweets with a polarity of "2" which we ignore. We can build the final evaluation set with only the positive and negative tweets of the evaluation CSV file:
In [77]:
text_test_all, target_test_all = read_csv(testing_csv_file)
In [78]:
len(text_test_all), len(target_test_all)
Out[78]:
To workaround the limitations of the vocabulary-based vectorizers, one can use the hashing trick. Instead of building and storing an explicit mapping from the feature names to the feature indices in a Python dict, we can just use a hash function and a modulus operation:
In [79]:
from sklearn.utils.murmurhash import murmurhash3_bytes_u32
for word in "the cat sat on the mat".split():
print("{0} => {1}".format(
word, murmurhash3_bytes_u32(word, 0) % 2 ** 20))
This mapping is completely stateless and the dimensionality of the output space is explicitly fixed in advance (here we use a modulo 2 ** 20
which means roughly 1M dimensions). This makes it possible to workaround the limitations of the vocabulary based vectorizer both for parallelizability and online / out-of-core learning.
The HashingVectorizer
class is an alternative to the TfidfVectorizer
class with use_idf=False
that internally uses the murmurhash hash function:
In [80]:
from sklearn.feature_extraction.text import HashingVectorizer
h_vectorizer = HashingVectorizer(encoding='latin-1')
h_vectorizer
Out[80]:
It shares the same "preprocessor", "tokenizer" and "analyzer" infrastructure:
In [81]:
analyzer = h_vectorizer.build_analyzer()
analyzer('This is a test sentence.')
Out[81]:
We can vectorize our datasets into a scipy sparse matrix exactly as we would have done with the CountVectorizer
or TfidfVectorizer
, except that we can directly call the transform
method: there is no need to fit
as HashingVectorizer
is a stateless transformer:
In [82]:
%time X_train_small = h_vectorizer.transform(text_train_small)
The dimension of the output is fixed ahead of time to n_features=2 ** 20
by default (nearly 1M features) to minimize the rate of collision on most classification problem while having reasonably sized linear models (1M weights in the coef_
attribute):
In [83]:
X_train_small
Out[83]:
As only the non-zero elements are stored, n_features
has little impact on the actual size of the data in memory. We can combine the hashing vectorizer with a Passive-Aggressive linear model in a pipeline:
In [84]:
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.pipeline import Pipeline
h_pipeline = Pipeline((
('vec', HashingVectorizer(encoding='latin-1')),
('clf', PassiveAggressiveClassifier(C=1, n_iter=1)),
))
%time h_pipeline.fit(text_train_small, target_train_small).score(text_validation, target_validation)
Out[84]:
Let's check that the score on the validation set is reasonably in line with the set of manually annotated tweets:
In [85]:
h_pipeline.score(text_test_all, target_test_all)
Out[85]:
As the text_train_small
dataset is not that big, we can still use a vocabulary based vectorizer to check that the hashing collisions are not causing any significant performance drop on the validation set (WARNING this is twice as slow as the hashing vectorizer version, skip this cell if your computer is too slow):
In [86]:
from sklearn.feature_extraction.text import TfidfVectorizer
vocabulary_vec = TfidfVectorizer(encoding='latin-1', use_idf=False)
vocabulary_pipeline = Pipeline((
('vec', vocabulary_vec),
('clf', PassiveAggressiveClassifier(C=1, n_iter=1)),
))
%time vocabulary_pipeline.fit(text_train_small, target_train_small).score(text_validation, target_validation)
Out[86]:
We get almost the same score but almost twice as slower with also a big, slow to (un)pickle datastructure in memory:
In [87]:
len(vocabulary_vec.vocabulary_)
Out[87]:
More info and reference for the original papers on the Hashing Trick in the answers to this http://metaoptimize.com/qa question: What is the Hashing Trick?.
Out-of-Core learning is the task of training a machine learning model on a dataset that does not fit in the main memory. This requires the following conditions:
partial_fit
method in scikit-learn).Let us simulate an infinite tweeter stream that can generate batches of annotated tweet texts and their polarity. We can do this by recombining randomly pairs of positive or negative tweets from our fixed dataset:
In [88]:
from random import Random
class InfiniteStreamGenerator(object):
"""Simulate random polarity queries on the twitter streaming API"""
def __init__(self, texts, targets, seed=0, batchsize=100):
self.texts_pos = [text for text, target in zip(texts, targets)
if target > 0]
self.texts_neg = [text for text, target in zip(texts, targets)
if target <= 0]
self.rng = Random(seed)
self.batchsize = batchsize
def next_batch(self, batchsize=None):
batchsize = self.batchsize if batchsize is None else batchsize
texts, targets = [], []
for i in range(batchsize):
# Select the polarity randomly
target = self.rng.choice((-1, 1))
targets.append(target)
# Combine 2 random texts of the right polarity
pool = self.texts_pos if target > 0 else self.texts_neg
text = self.rng.choice(pool) + " " + self.rng.choice(pool)
texts.append(text)
return texts, targets
infinite_stream = InfiniteStreamGenerator(text_train_small, target_train_small)
In [89]:
texts_in_batch, targets_in_batch = infinite_stream.next_batch(batchsize=3)
In [90]:
for t in texts_in_batch:
print(t + "\n")
In [91]:
targets_in_batch
Out[91]:
We can now use our infinte tweet source to train an online machine learning algorithm using the hashing vectorizer. Note the use of the partial_fit
method of the PassiveAggressiveClassifier
instance in place of the traditional call to the fit
method that needs access to the full training set.
From time to time, we evaluate the current predictive performance of the model on our validation set that is guaranteed to not overlap with the infinite training set source:
In [92]:
n_batches = 1000
validation_scores = []
training_set_size = []
# Build the vectorizer and the classifier
h_vectorizer = HashingVectorizer(encoding='latin-1')
clf = PassiveAggressiveClassifier(C=1)
# Extract the features for the validation once and for all
X_validation = h_vectorizer.transform(text_validation)
classes = np.array([-1, 1])
n_samples = 0
for i in range(n_batches):
texts_in_batch, targets_in_batch = infinite_stream.next_batch()
n_samples += len(texts_in_batch)
# Vectorize the text documents in the batch
X_batch = h_vectorizer.transform(texts_in_batch)
# Incrementally train the model on the new batch
clf.partial_fit(X_batch, targets_in_batch, classes=classes)
if n_samples % 100 == 0:
# Compute the validation score of the current state of the model
score = clf.score(X_validation, target_validation)
validation_scores.append(score)
training_set_size.append(n_samples)
if i % 100 == 0:
print("n_samples: {0}, score: {1:.4f}".format(n_samples, score))
We can now plot the collected validation score values, versus the number of samples generated by the infinite source and feed to the model:
In [93]:
plt.plot(training_set_size, validation_scores)
plt.xlabel("Number of samples")
plt.ylabel("Validation score")
Out[93]:
As the HashingVectorizer
is stateless, one can use a separate instance (with the same parameters) in parallel or distributed processes to extract the features on independant partitions of a big text dataset. Each partition of extracted features can then be fed to independent instances of a linear classifier model on each computing node:
Once all the nodes are ready we can average the linear models:
Let's use IPython parallel to read partitions of the train CSV in different Python processes using the interactive IPython.parallel interface:
In [94]:
from IPython.parallel import Client
client = Client()
len(client)
Out[94]:
Let's tell each engine which partition of the data it will have to handle:
In [95]:
dv = client.direct_view()
In [106]:
dv.scatter('partition_id', range(len(client)), flatten=True, block=True)
In [107]:
%px print(partition_id)
Let's send all we need to the engines
In [100]:
from sklearn.feature_extraction.text import HashingVectorizer
h_vectorizer = HashingVectorizer(encoding='latin-1')
dv['h_vectorizer'] = h_vectorizer
dv['read_csv'] = read_csv
dv['training_csv_file'] = training_csv_file
dv['n_partitions'] = len(client)
In [101]:
%px print(training_csv_file)
%px print(n_partitions)
In [102]:
%%px
max_count = 50000
print("Parsing %d items for partition %d..." % (max_count, partition_id))
texts, targets = read_csv(training_csv_file, n_partitions=n_partitions,
partition_id=partition_id, max_count=50000)
print("Shuffling the positive and negative examples...")
from sklearn.utils import shuffle
texts, targets = shuffle(texts, targets, random_state=1)
print("Vectorizing text data...")
vectors = h_vectorizer.transform(texts)
print("Fitting a linear model...")
from sklearn.linear_model import Perceptron
clf = Perceptron(n_iter=1).fit(vectors, targets)
print("Done!")
In [103]:
classifiers = dv.gather('clf', block=True)
classifiers
Out[103]:
We can now compute the average linear model:
In [104]:
from copy import copy
def average_linear_model(models):
"""Compute a linear model that is the average of the others"""
avg = copy(models[0])
avg.coef_ = np.sum([m.coef_ for m in models], axis=0)
avg.coef_ /= len(models)
avg.intercept_ = np.sum([m.intercept_ for m in models], axis=0)
avg.intercept_ /= len(models)
return avg
clf = average_linear_model(classifiers)
Let's compare the score of the average model with the scores of the individual classifiers. The average models can have a better generalization than the individual models being averaged:
In [108]:
clf.score(h_vectorizer.transform(text_test_all), target_test_all)
Out[108]:
In [109]:
for c in classifiers:
print(c.score(h_vectorizer.transform(text_test_all), target_test_all))
Averaging linear models learned on different datasets that follow the same distribution is a form of Ensemble method. Other Ensemble methods include:
Using the Hashing Vectorizer makes it possible to implement streaming and parallel text classification but can also introduce some issues:
HashingVectorizer
does not provide "Inverse Document Frequency" reweighting (lack of a use_idf=True
option).The collision issues can be controlled by increasing the n_features
parameters.
The IDF weighting might be reintroduced by appending a TfidfTransformer
instance on the output of the vectorizer. However computing the idf_
statistic used for the feature reweighting will require to do at least one additional pass over the training set before being able to start training the classifier: this breaks the online learning scheme.
The lack of inverse mapping (the get_feature_names()
method of TfidfVectorizer
) is even harder to workaround. That would require extending the HashingVectorizer
class to add a "trace" mode to record the mapping of the most important features to provide statistical debugging information.
In the mean time to debug feature extraction issues, it is recommended to use TfidfVectorizer(use_idf=False)
on a small-ish subset of the dataset to simulate a HashingVectorizer()
instance that have the get_feature_names()
method and no collision issues.