Text Mining in Python

Instructor: Alessandro Gagliardi
TA: Kevin Perko

Midterm Feedback

Last Week:

Decision Trees
- Optimization Functions
- Preventing Overfit
- Implementing Decision Trees with Scikit-Learn

Support Vector Machines
- Maximum Margin Hyperplanes
- Slack Variables
- Nonlinear Classification with Kernels

Olivier's Tutorial Thus Far....

1) An Introduction to Predictive Modeling in Python (Monday)

Training a predictive model on numerical features
Model evaluation and interpretation
Cross-validation
More feature engineering and richer models

2) Model Selection and Assessment (Wednesday)

Model performance evaluation and detection of overfitting with Cross-Validation
Hyper parameter tuning and model selection with Grid Search
Error analysis with learning curves and the Bias-Variance trade-off
Overfitting via Model Selection and the Development / Evaluation set split

3) Distributed Model Selection and Assessment (Weekend)

Introduction to IPython.parallel
Sharing Data Between Processes with Memory Mapping
Parallel Grid Search and Model Selection
Distributed Computation on EC2 Spot Instances with StarCluster

Today: Text Mining in Python

4) Text Feature Extraction for Classification and Clustering

Turn a corpus of text documents into feature vectors using a Bag of Words representation,
Train a simple text classifier on the feature vectors,
Wrap the vectorizer and the classifier with a pipeline,
Cross-validation and model selection on the pipeline.

5) Large Scale Text Classification for Sentiment Analysis

Limitations of the Vocabulary-Based Vectorizer
The Hashing Trick
Online / Streaming Text Feature Extraction and Classification
Parallel Text Feature Extraction and Classification

But first, recall Gheorghe's lecture on Search:

TF-IDF

Terms that occur in only a few documents are often more valuable than ones that occur in many – inverse document frequency ($IDF_j$)

The more often a term occurs in a document, the more likely it is to be important for that document – term frequency ($TF_{ij}$)



In [1]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# Some nice default configuration for plots
plt.rcParams['figure.figsize'] = 10, 7.5
plt.rcParams['axes.grid'] = True
plt.gray()









    





<matplotlib.figure.Figure at 0x104b4be10>

Text Feature Extraction for Classification and Clustering

Outline of this section:

Turn a corpus of text documents into feature vectors using a Bag of Words representation,
Train a simple text classifier on the feature vectors,
Wrap the vectorizer and the classifier with a pipeline,
Cross-validation and model selection on the pipeline.

Text Classification in 20 lines of Python

Let's start by implementing a canonical text classification example:

The 20 newsgroups dataset: around 18000 text posts from 20 newsgroups forums
Bag of Words features extraction with TF-IDF weighting
Naive Bayes classifier or Linear Support Vector Machine for the classifier itself



In [2]:

    
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

# Load the text data
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]
twenty_train_small = load_files('../datasets/20news-bydate-train/',
    categories=categories, encoding='latin-1')
twenty_test_small = load_files('../datasets/20news-bydate-test/',
    categories=categories, encoding='latin-1')

# Turn the text documents into vectors of word frequencies
vectorizer = TfidfVectorizer(min_df=2)
X_train = vectorizer.fit_transform(twenty_train_small.data)
y_train = twenty_train_small.target

# Fit a classifier on the training set
classifier = MultinomialNB().fit(X_train, y_train)
print("Training score: {0:.1f}%".format(
    classifier.score(X_train, y_train) * 100))

# Evaluate the classifier on the testing set
X_test = vectorizer.transform(twenty_test_small.data)
y_test = twenty_test_small.target
print("Testing score: {0:.1f}%".format(
    classifier.score(X_test, y_test) * 100))









    



Training score: 95.1%
Testing score: 85.1%

Multinomial Naive Bayes

MultinomialNB implements the naive Bayes algorithm for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification (where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice). The distribution is parametrized by vectors $\theta_y = (\theta_{y1},\ldots,\theta_{yn})$ for each class $y$, where $n$ is the number of features (in text classification, the size of the vocabulary) and $\theta_{yi}$ is the probability $P(x_i \mid y)$ of feature $i$ appearing in a sample belonging to class $y$.

The parameters $\theta_y$ is estimated by a smoothed version of maximum likelihood, i.e. relative frequency counting:

$$ \hat{\theta}_{yi} = \frac{ N_{yi} + \alpha}{N_y + \alpha n} $$

where $N_{yi} = \sum_{x \in T} x_i$ is the number of times feature $i$ appears in a sample of class $y$ in the training set $T$, and $N_{y} = \sum_{i=1}^{|T|} N_{yi}$ is the total count of all features for class $y$.

The smoothing priors $\alpha \ge 0$ accounts for features not present in the learning samples and prevents zero probabilities in further computations. Setting $\alpha = 1$ is called Laplace smoothing, while $\alpha < 1$ is called Lidstone smoothing.

Here is a workflow diagram summary of what happened previously:

Let's now decompose what we just did to understand and customize each step.

Loading the Dataset

Let's explore the dataset loading utility without passing a list of categories: in this case we load the full 20 newsgroups dataset in memory. The source website for the 20 newsgroups already provides a date-based train / test split that is made available using the subset keyword argument:



In [3]:

    
ls ../datasets/









    



20news-bydate-test/      20news-bydate.tar.gz     titanic_train.csv
20news-bydate-train/     sentiment140/            trainingandtestdata.zip



In [4]:

    
ls -lh ../datasets/20news-bydate-train









    



total 0
drwxr-xr-x  482 alessandro.gagliardi  1260538795    16K Mar 18  2003 alt.atheism/
drwxr-xr-x  586 alessandro.gagliardi  1260538795    19K Mar 18  2003 comp.graphics/
drwxr-xr-x  593 alessandro.gagliardi  1260538795    20K Mar 18  2003 comp.os.ms-windows.misc/
drwxr-xr-x  592 alessandro.gagliardi  1260538795    20K Mar 18  2003 comp.sys.ibm.pc.hardware/
drwxr-xr-x  580 alessandro.gagliardi  1260538795    19K Mar 18  2003 comp.sys.mac.hardware/
drwxr-xr-x  595 alessandro.gagliardi  1260538795    20K Mar 18  2003 comp.windows.x/
drwxr-xr-x  587 alessandro.gagliardi  1260538795    19K Mar 18  2003 misc.forsale/
drwxr-xr-x  596 alessandro.gagliardi  1260538795    20K Mar 18  2003 rec.autos/
drwxr-xr-x  600 alessandro.gagliardi  1260538795    20K Mar 18  2003 rec.motorcycles/
drwxr-xr-x  599 alessandro.gagliardi  1260538795    20K Mar 18  2003 rec.sport.baseball/
drwxr-xr-x  602 alessandro.gagliardi  1260538795    20K Mar 18  2003 rec.sport.hockey/
drwxr-xr-x  597 alessandro.gagliardi  1260538795    20K Mar 18  2003 sci.crypt/
drwxr-xr-x  593 alessandro.gagliardi  1260538795    20K Mar 18  2003 sci.electronics/
drwxr-xr-x  596 alessandro.gagliardi  1260538795    20K Mar 18  2003 sci.med/
drwxr-xr-x  595 alessandro.gagliardi  1260538795    20K Mar 18  2003 sci.space/
drwxr-xr-x  601 alessandro.gagliardi  1260538795    20K Mar 18  2003 soc.religion.christian/
drwxr-xr-x  548 alessandro.gagliardi  1260538795    18K Mar 18  2003 talk.politics.guns/
drwxr-xr-x  566 alessandro.gagliardi  1260538795    19K Mar 18  2003 talk.politics.mideast/
drwxr-xr-x  467 alessandro.gagliardi  1260538795    16K Mar 18  2003 talk.politics.misc/
drwxr-xr-x  379 alessandro.gagliardi  1260538795    13K Mar 18  2003 talk.religion.misc/



In [5]:

    
ls -lh ../datasets/20news-bydate-train/alt.atheism/ | head -n27









    



total 4480
-rw-r--r--  1 alessandro.gagliardi  1260538795    12K Mar 18  2003 49960
-rw-r--r--  1 alessandro.gagliardi  1260538795    31K Mar 18  2003 51060
-rw-r--r--  1 alessandro.gagliardi  1260538795   4.0K Mar 18  2003 51119
-rw-r--r--  1 alessandro.gagliardi  1260538795   1.6K Mar 18  2003 51120
-rw-r--r--  1 alessandro.gagliardi  1260538795   773B Mar 18  2003 51121
-rw-r--r--  1 alessandro.gagliardi  1260538795   4.8K Mar 18  2003 51122
-rw-r--r--  1 alessandro.gagliardi  1260538795   618B Mar 18  2003 51123
-rw-r--r--  1 alessandro.gagliardi  1260538795   1.4K Mar 18  2003 51124
-rw-r--r--  1 alessandro.gagliardi  1260538795   2.7K Mar 18  2003 51125
-rw-r--r--  1 alessandro.gagliardi  1260538795   427B Mar 18  2003 51126
-rw-r--r--  1 alessandro.gagliardi  1260538795   742B Mar 18  2003 51127
-rw-r--r--  1 alessandro.gagliardi  1260538795   650B Mar 18  2003 51128
-rw-r--r--  1 alessandro.gagliardi  1260538795   1.3K Mar 18  2003 51130
-rw-r--r--  1 alessandro.gagliardi  1260538795   2.3K Mar 18  2003 51131
-rw-r--r--  1 alessandro.gagliardi  1260538795   2.6K Mar 18  2003 51132
-rw-r--r--  1 alessandro.gagliardi  1260538795   1.5K Mar 18  2003 51133
-rw-r--r--  1 alessandro.gagliardi  1260538795   1.2K Mar 18  2003 51134
-rw-r--r--  1 alessandro.gagliardi  1260538795   1.6K Mar 18  2003 51135
-rw-r--r--  1 alessandro.gagliardi  1260538795   2.1K Mar 18  2003 51136
-rw-r--r--  1 alessandro.gagliardi  1260538795   1.3K Mar 18  2003 51139
-rw-r--r--  1 alessandro.gagliardi  1260538795   409B Mar 18  2003 51140
-rw-r--r--  1 alessandro.gagliardi  1260538795   940B Mar 18  2003 51141
-rw-r--r--  1 alessandro.gagliardi  1260538795   9.0K Mar 18  2003 51142
-rw-r--r--  1 alessandro.gagliardi  1260538795   632B Mar 18  2003 51143
-rw-r--r--  1 alessandro.gagliardi  1260538795   1.2K Mar 18  2003 51144
-rw-r--r--  1 alessandro.gagliardi  1260538795   609B Mar 18  2003 51145

The load_files function can load text files from a 2 levels folder structure assuming folder names represent categories:



In [6]:

    
print(load_files.__doc__)









    



Load text files with categories as subfolder names.

    Individual samples are assumed to be files stored a two levels folder
    structure such as the following:

        container_folder/
            category_1_folder/
                file_1.txt
                file_2.txt
                ...
                file_42.txt
            category_2_folder/
                file_43.txt
                file_44.txt
                ...

    The folder names are used has supervised signal label names. The
    individual file names are not important.

    This function does not try to extract features into a numpy array or
    scipy sparse matrix. In addition, if load_content is false it
    does not try to load the files in memory.

    To use text files in a scikit-learn classification or clustering
    algorithm, you will need to use the `sklearn.feature_extraction.text`
    module to build a feature extraction transformer that suits your
    problem.

    If you set load_content=True, you should also specify the encoding of
    the text using the 'encoding' parameter. For many modern text files,
    'utf-8' will be the correct encoding. If you leave encoding equal to None,
    then the content will be made of bytes instead of Unicode, and you will
    not be able to use most functions in `sklearn.feature_extraction.text`.

    Similar feature extractors should be built for other kind of unstructured
    data input such as images, audio, video, ...

    Parameters
    ----------
    container_path : string or unicode
        Path to the main folder holding one subfolder per category

    description: string or unicode, optional (default=None)
        A paragraph describing the characteristic of the dataset: its source,
        reference, etc.

    categories : A collection of strings or None, optional (default=None)
        If None (default), load all the categories.
        If not None, list of category names to load (other categories ignored).

    load_content : boolean, optional (default=True)
        Whether to load or not the content of the different files. If
        true a 'data' attribute containing the text information is present
        in the data structure returned. If not, a filenames attribute
        gives the path to the files.

    encoding : string or None (default is None)
        If None, do not try to decode the content of the files (e.g. for
        images or other non-text content).
        If not None, encoding to use to decode text files to Unicode if
        load_content is True.

    decode_error: {'strict', 'ignore', 'replace'}, optional
        Instruction on what to do if a byte sequence is given to analyze that
        contains characters not of the given `encoding`. Passed as keyword
        argument 'errors' to bytes.decode.

    shuffle : bool, optional (default=True)
        Whether or not to shuffle the data: might be important for models that
        make the assumption that the samples are independent and identically
        distributed (i.i.d.), such as stochastic gradient descent.

    random_state : int, RandomState instance or None, optional (default=0)
        If int, random_state is the seed used by the random number generator;
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used
        by `np.random`.

    Returns
    -------
    data : Bunch
        Dictionary-like object, the interesting attributes are: either
        data, the raw text data to learn, or 'filenames', the files
        holding it, 'target', the classification labels (integer index),
        'target_names', the meaning of the labels, and 'DESCR', the full
        description of the dataset.



In [7]:

    
all_twenty_train = load_files('../datasets/20news-bydate-train/',
  encoding='latin-1', random_state=42)
all_twenty_test = load_files('../datasets/20news-bydate-test/',
    encoding='latin-1', random_state=42)



In [8]:

    
all_target_names = all_twenty_train.target_names
all_target_names









    Out[8]:





['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']



In [9]:

    
all_twenty_train.target









    Out[9]:





array([12,  6,  9, ...,  9,  1, 12])



In [10]:

    
all_twenty_train.target.shape









    Out[10]:





(11314,)



In [11]:

    
all_twenty_test.target.shape









    Out[11]:





(7532,)



In [12]:

    
len(all_twenty_train.data)









    Out[12]:





11314



In [13]:

    
type(all_twenty_train.data[0])









    Out[13]:





unicode



In [14]:

    
def display_sample(i, dataset):
    print("Class name: " + dataset.target_names[dataset.target[i]])
    print("Text content:\n")
    print(dataset.data[i])



In [15]:

    
display_sample(0, all_twenty_train)









    



Class name: sci.electronics
Text content:

From: wtm@uhura.neoucom.edu (Bill Mayhew)
Subject: Re: How to the disks copy protected.
Organization: Northeastern Ohio Universities College of Medicine
Lines: 23

Write a good manual to go with the software.  The hassle of
photocopying the manual is offset by simplicity of purchasing
the package for only $15.  Also, consider offering an inexpensive
but attractive perc for registered users.  For instance, a coffee
mug.  You could produce and mail the incentive for a couple of
dollars, so consider pricing the product at $17.95.

You're lucky if only 20% of the instances of your program in use
are non-licensed users.

The best approach is to estimate your loss and accomodate that into
your price structure.  Sure it hurts legitimate users, but too bad.
Retailers have to charge off loss to shoplifters onto paying
customers; the software industry is the same.

Unless your product is exceptionally unique, using an ostensibly
copy-proof disk will just send your customers to the competetion.


-- 
Bill Mayhew      NEOUCOM Computer Services Department
Rootstown, OH  44272-9995  USA    phone: 216-325-2511
wtm@uhura.neoucom.edu (140.220.1.1)    146.580: N8WED



In [16]:

    
display_sample(1, all_twenty_train)

Class name: misc.forsale
Text content:

From: andy@SAIL.Stanford.EDU (Andy Freeman)
Subject: Re: Catalog of Hard-to-Find PC Enhancements (Repost)
Organization: Computer Science Department,  Stanford University.
Lines: 33

>andy@SAIL.Stanford.EDU (Andy Freeman) writes:
>> >In article <C5ELME.4z4@unix.portal.com> jdoll@shell.portal.com (Joe Doll) wr
>> >>   "The Catalog of Personal Computing Tools for Engineers and Scien-
>> >>   tists" lists hardware cards and application software packages for 
>> >>   PC/XT/AT/PS/2 class machines.  Focus is on engineering and scien-
>> >>   tific applications of PCs, such as data acquisition/control, 
>> >>   design automation, and data analysis and presentation.  
>> >
>> >>   If you would like a free copy, reply with your (U. S. Postal) 
>> >>   mailing address.
>> 
>> Don't bother - it never comes.  It's a cheap trick for building a
>> mailing list to sell if my junk mail flow is any indication.
>> 
>> -andy sent his address months ago
>
>Perhaps we can get Portal to nuke this weasal.  I never received a 
>catalog either.  If that person doesn't respond to a growing flame, then 
>we can assume that we'yall look forward to lotsa junk mail.

I don't want him nuked, I want him to be honest.  The junk mail has
been much more interesting than the promised catalog.  If I'd known
what I was going to get, I wouldn't have hesitated.  I wouldn't be
surprised if there were other folks who looked at the ad and said
"nope" but who would be very interested in the junk mail that results.
Similarly, there are people who wanted the advertised catalog who
aren't happy with the junk they got instead.

The folks buying the mailing lists would prefer an honest ad, and
so would the people reading it.

-andy
--

Let's compute the (uncompressed, in-memory) size of the training and test sets in MB assuming an 8-bit encoding (in this case, all chars can be encoded using the latin-1 charset).



In [17]:

    
def text_size(text, charset='iso-8859-1'):
    return len(text.encode(charset)) * 8 * 1e-6

train_size_mb = sum(text_size(text) for text in all_twenty_train.data) 
test_size_mb = sum(text_size(text) for text in all_twenty_test.data)

print("Training set size: {0} MB".format(int(train_size_mb)))
print("Testing set size: {0} MB".format(int(test_size_mb)))









    



Training set size: 176 MB
Testing set size: 110 MB

If we only consider a small subset of the 4 categories selected from the initial example:



In [18]:

    
train_small_size_mb = sum(text_size(text) for text in twenty_train_small.data) 
test_small_size_mb = sum(text_size(text) for text in twenty_test_small.data)

print("Training set size: {0} MB".format(int(train_small_size_mb)))
print("Testing set size: {0} MB".format(int(test_small_size_mb)))









    



Training set size: 31 MB
Testing set size: 22 MB

Extracting Text Features



In [19]:

    
from sklearn.feature_extraction.text import TfidfVectorizer

TfidfVectorizer()









    Out[19]:





TfidfVectorizer(analyzer=u'word', binary=False, charset=None,
        charset_error=None, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)



In [20]:

    
vectorizer = TfidfVectorizer(min_df=1)

%time X_train_small = vectorizer.fit_transform(twenty_train_small.data)









    



CPU times: user 870 ms, sys: 46.9 ms, total: 917 ms
Wall time: 897 ms

The results is not a numpy.array but instead a scipy.sparse matrix. (Similar to the DocumentTermMatrix in R's tm library.) This datastructure is quite similar to a 2D numpy array but it does not store the zeros.



In [21]:

    
X_train_small









    Out[21]:





<2034x34118 sparse matrix of type '<type 'numpy.float64'>'
	with 323433 stored elements in Compressed Sparse Row format>

scipy.sparse matrices also have a shape attribute to access the dimensions:



In [22]:

    
n_samples, n_features = X_train_small.shape

This dataset has around 2000 samples (the rows of the data matrix):



In [23]:

    
n_samples









    Out[23]:





2034

This is the same value as the number of strings in the original list of text documents:



In [24]:

    
len(twenty_train_small.data)









    Out[24]:





2034

The columns represent the individual token occurrences:



In [25]:

    
n_features









    Out[25]:





34118

This number is the size of the vocabulary of the model extracted during fit in a Python dictionary:



In [26]:

    
type(vectorizer.vocabulary_)









    Out[26]:





dict



In [27]:

    
len(vectorizer.vocabulary_)









    Out[27]:





34118

The keys of the vocabulary_ attribute are also called feature names and can be accessed as a list of strings.



In [28]:

    
len(vectorizer.get_feature_names())









    Out[28]:





34118

Here are the first 10 elements (sorted in lexicographical order):



In [29]:

    
vectorizer.get_feature_names()[:10]









    Out[29]:





[u'00',
 u'000',
 u'0000',
 u'00000',
 u'000000',
 u'000005102000',
 u'000021',
 u'000062david42',
 u'0000vec',
 u'0001']

Let's have a look at the features from the middle:



In [30]:

    
vectorizer.get_feature_names()[n_features / 2:n_features / 2 + 10]









    Out[30]:





[u'inadequate',
 u'inala',
 u'inalienable',
 u'inane',
 u'inanimate',
 u'inapplicable',
 u'inappropriate',
 u'inappropriately',
 u'inaudible',
 u'inbreeding']

Now that we have extracted a vector representation of the data, it's a good idea to project the data on the first 2D of a Singular Value Decomposition (i.e.. Principal Component Analysis) to get a feel of the data. Note that the TruncatedSVD class can accept scipy.sparse matrices as input (as an alternative to numpy arrays):



In [31]:

    
from sklearn.decomposition import TruncatedSVD

%time X_train_small_pca = TruncatedSVD(n_components=2).fit_transform(X_train_small)









    



CPU times: user 136 ms, sys: 13.8 ms, total: 150 ms
Wall time: 143 ms



In [32]:

    
from itertools import cycle

colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k']
for i, c in zip(np.unique(y_train), cycle(colors)):
    plt.scatter(X_train_small_pca[y_train == i, 0],
               X_train_small_pca[y_train == i, 1],
               c=c, label=twenty_train_small.target_names[i], alpha=0.5)
    
_ = plt.legend(loc='best')

We can observe that there is a large overlap of the samples from different categories. This is to be expected as the PCA linear projection projects data from a 34118 dimensional space down to 2 dimensions: data that is linearly separable in 34118D is often no longer linearly separable in 2D.

Still we can notice an interesting pattern: the newsgroups on religion and atheism occupy the much the same region and computer graphics and space science / space overlap more together than they do with the religion or atheism newsgroups.

Training a Classifier on Text Features

We have previously extracted a vector representation of the training corpus and put it into a variable name X_train_small. To train a supervised model, in this case a classifier, we also need



In [33]:

    
y_train_small = twenty_train_small.target



In [34]:

    
y_train_small.shape









    Out[34]:





(2034,)



In [35]:

    
y_train_small









    Out[35]:





array([1, 2, 2, ..., 2, 1, 1])

We can shape that we have the same number of samples for the input data and the labels:



In [36]:

    
X_train_small.shape[0] == y_train_small.shape[0]









    Out[36]:





True

We can now train a classifier, for instance a Multinomial Naive Bayesian classifier:



In [37]:

    
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB(alpha=0.1)
clf









    Out[37]:





MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)



In [38]:

    
clf.fit(X_train_small, y_train_small)









    Out[38]:





MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)

We can now evaluate the classifier on the testing set. Let's first use the builtin score function, which is the rate of correct classification in the test set:



In [39]:

    
X_test_small = vectorizer.transform(twenty_test_small.data)
y_test_small = twenty_test_small.target



In [40]:

    
X_test_small.shape









    Out[40]:





(1353, 34118)



In [41]:

    
y_test_small.shape









    Out[41]:





(1353,)



In [42]:

    
clf.score(X_test_small, y_test_small)









    Out[42]:





0.89652623798965259

We can also compute the score on the train set and observe that the model is both overfitting and underfitting a bit at the same time:



In [43]:

    
clf.score(X_train_small, y_train_small)









    Out[43]:





0.99262536873156337

Introspecting the Behavior of the Text Vectorizer

The text vectorizer has many parameters to customize it's behavior, in particular how it extracts tokens:



In [44]:

    
TfidfVectorizer()









    Out[44]:





TfidfVectorizer(analyzer=u'word', binary=False, charset=None,
        charset_error=None, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)



In [45]:

    
print(TfidfVectorizer.__doc__)









    



Convert a collection of raw documents to a matrix of TF-IDF features.

    Equivalent to CountVectorizer followed by TfidfTransformer.

    Parameters
    ----------
    input : string {'filename', 'file', 'content'}
        If filename, the sequence passed as an argument to fit is
        expected to be a list of filenames that need reading to fetch
        the raw content to analyze.

        If 'file', the sequence items must have 'read' method (file-like
        object) it is called to fetch the bytes in memory.

        Otherwise the input is expected to be the sequence strings or
        bytes items are expected to be analyzed directly.

    encoding : string, 'utf-8' by default.
        If bytes or files are given to analyze, this encoding is used to
        decode.

    decode_error : {'strict', 'ignore', 'replace'}
        Instruction on what to do if a byte sequence is given to analyze that
        contains characters not of the given `encoding`. By default, it is
        'strict', meaning that a UnicodeDecodeError will be raised. Other
        values are 'ignore' and 'replace'.

    strip_accents : {'ascii', 'unicode', None}
        Remove accents during the preprocessing step.
        'ascii' is a fast method that only works on characters that have
        an direct ASCII mapping.
        'unicode' is a slightly slower method that works on any characters.
        None (default) does nothing.

    analyzer : string, {'word', 'char'} or callable
        Whether the feature should be made of word or character n-grams.

        If a callable is passed it is used to extract the sequence of features
        out of the raw, unprocessed input.

    preprocessor : callable or None (default)
        Override the preprocessing (string transformation) stage while
        preserving the tokenizing and n-grams generation steps.

    tokenizer : callable or None (default)
        Override the string tokenization step while preserving the
        preprocessing and n-grams generation steps.

    ngram_range : tuple (min_n, max_n)
        The lower and upper boundary of the range of n-values for different
        n-grams to be extracted. All values of n such that min_n <= n <= max_n
        will be used.

    stop_words : string {'english'}, list, or None (default)
        If a string, it is passed to _check_stop_list and the appropriate stop
        list is returned. 'english' is currently the only supported string
        value.

        If a list, that list is assumed to contain stop words, all of which
        will be removed from the resulting tokens.

        If None, no stop words will be used. max_df can be set to a value
        in the range [0.7, 1.0) to automatically detect and filter stop
        words based on intra corpus document frequency of terms.

    lowercase : boolean, default True
        Convert all characters to lowercase befor tokenizing.

    token_pattern : string
        Regular expression denoting what constitutes a "token", only used
        if `tokenize == 'word'`. The default regexp select tokens of 2
        or more letters characters (punctuation is completely ignored
        and always treated as a token separator).

    max_df : float in range [0.0, 1.0] or int, optional, 1.0 by default
        When building the vocabulary ignore terms that have a term frequency
        strictly higher than the given threshold (corpus specific stop words).
        If float, the parameter represents a proportion of documents, integer
        absolute counts.
        This parameter is ignored if vocabulary is not None.

    min_df : float in range [0.0, 1.0] or int, optional, 1 by default
        When building the vocabulary ignore terms that have a term frequency
        strictly lower than the given threshold.
        This value is also called cut-off in the literature.
        If float, the parameter represents a proportion of documents, integer
        absolute counts.
        This parameter is ignored if vocabulary is not None.

    max_features : optional, None by default
        If not None, build a vocabulary that only consider the top
        max_features ordered by term frequency across the corpus.

        This parameter is ignored if vocabulary is not None.

    vocabulary : Mapping or iterable, optional
        Either a Mapping (e.g., a dict) where keys are terms and values are
        indices in the feature matrix, or an iterable over terms. If not
        given, a vocabulary is determined from the input documents.

    binary : boolean, False by default.
        If True, all non zero counts are set to 1. This is useful for discrete
        probabilistic models that model binary events rather than integer
        counts.

    dtype : type, optional
        Type of the matrix returned by fit_transform() or transform().

    norm : 'l1', 'l2' or None, optional
        Norm used to normalize term vectors. None for no normalization.

    use_idf : boolean, optional
        Enable inverse-document-frequency reweighting.

    smooth_idf : boolean, optional
        Smooth idf weights by adding one to document frequencies, as if an
        extra document was seen containing every term in the collection
        exactly once. Prevents zero divisions.

    sublinear_tf : boolean, optional
        Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

    See also
    --------
    CountVectorizer
        Tokenize the documents and count the occurrences of token and return
        them as a sparse matrix

    TfidfTransformer
        Apply Term Frequency Inverse Document Frequency normalization to a
        sparse matrix of occurrence counts.

The easiest way to introspect what the vectorizer is actually doing for a given test of parameters is call the vectorizer.build_analyzer() to get an instance of the text analyzer it uses to process the text:



In [46]:

    
analyzer = TfidfVectorizer().build_analyzer()
analyzer("I love scikit-learn: this is a cool Python lib!")









    Out[46]:





[u'love', u'scikit', u'learn', u'this', u'is', u'cool', u'python', u'lib']

You can notice that all the tokens are lowercase, that the single letter word "I" was dropped, and that hyphenation is used. Let's change some of that default behavior:



In [47]:

    
analyzer = TfidfVectorizer(
    preprocessor=lambda text: text,  # disable lowercasing
    token_pattern=ur'(?u)\b[\w-]+\b', # treat hyphen as a letter
                                      # do not exclude single letter tokens
).build_analyzer()

analyzer("I love scikit-learn: this is a cool Python lib!")









    Out[47]:





[u'I',
 u'love',
 u'scikit-learn',
 u'this',
 u'is',
 u'a',
 u'cool',
 u'Python',
 u'lib']

The analyzer name comes from the Lucene parlance: it wraps the sequential application of:

text preprocessing (processing the text documents as a whole, e.g. lowercasing)
text tokenization (splitting the document into a sequence of tokens)
token filtering and recombination (e.g. n-grams extraction, see later)

The analyzer system of scikit-learn is much more basic than lucene's though.

Exercise:

Write a pre-processor callable (e.g. a python function) to remove the headers of the text a newsgroup post.
Vectorize the data again and measure the impact on performance of removing the header info from the dataset.
Do you expect the performance of the model to improve or decrease? What is the score of a uniform random classifier on the same dataset?

Hint 1: As when we used Naïve Bayes in R, these messages have headers that are separated from the message by a blank line. In other words, find the first blank line ('\n\n') and take everything after that. (The find or index function may be of help here.)

Hint 2: the TfidfVectorizer class can accept python functions to customize the preprocessor, tokenizer or analyzer stages of the vectorizer.

type TfidfVectorizer() alone in a cell to see the default value of the parameters
type TfidfVectorizer.__doc__ to print the constructor parameters doc or ? suffix operator on a any Python class or method to read the docstring or even the ?? operator to read the source code.



In [47]:

Model Selection of the Naive Bayes Classifier Parameter Alone

The MultinomialNB class is a good baseline classifier for text as it's fast and has few parameters to tweak:



In [48]:

    
MultinomialNB()









    Out[48]:





MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)



In [49]:

    
print(MultinomialNB.__doc__)









    



    Naive Bayes classifier for multinomial models

    The multinomial Naive Bayes classifier is suitable for classification with
    discrete features (e.g., word counts for text classification). The
    multinomial distribution normally requires integer feature counts. However,
    in practice, fractional counts such as tf-idf may also work.

    Parameters
    ----------
    alpha : float, optional (default=1.0)
        Additive (Laplace/Lidstone) smoothing parameter
        (0 for no smoothing).

    fit_prior : boolean
        Whether to learn class prior probabilities or not.
        If false, a uniform prior will be used.

    class_prior : array-like, size (n_classes,)
        Prior probabilities of the classes. If specified the priors are not
        adjusted according to the data.

    Attributes
    ----------
    `class_log_prior_` : array, shape (n_classes, )
        Smoothed empirical log probability for each class.

    `intercept_` : property
        Mirrors ``class_log_prior_`` for interpreting MultinomialNB
        as a linear model.

    `feature_log_prob_`: array, shape (n_classes, n_features)
        Empirical log probability of features
        given a class, ``P(x_i|y)``.

    `coef_` : property
        Mirrors ``feature_log_prob_`` for interpreting MultinomialNB
        as a linear model.

    `class_count_` : array, shape (n_classes,)
        Number of samples encountered for each class during fitting. This
        value is weighted by the sample weight when provided.

    `feature_count_` : array, shape (n_classes, n_features)
        Number of samples encountered for each (class, feature)
        during fitting. This value is weighted by the sample weight when
        provided.

    Examples
    --------
    >>> import numpy as np
    >>> X = np.random.randint(5, size=(6, 100))
    >>> y = np.array([1, 2, 3, 4, 5, 6])
    >>> from sklearn.naive_bayes import MultinomialNB
    >>> clf = MultinomialNB()
    >>> clf.fit(X, y)
    MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
    >>> print(clf.predict(X[2]))
    [3]

    Notes
    -----
    For the rationale behind the names `coef_` and `intercept_`, i.e.
    naive Bayes as a linear classifier, see J. Rennie et al. (2003),
    Tackling the poor assumptions of naive Bayes text classifiers, ICML.

    References
    ----------
    C.D. Manning, P. Raghavan and H. Schütze (2008). Introduction to
    Information Retrieval. Cambridge University Press, pp. 234–265.
    http://nlp.stanford.edu/IR-book/html/htmledition/
        naive-bayes-text-classification-1.html

By reading the doc we can see that the alpha parameter is a good candidate to adjust the model for the bias (underfitting) vs variance (overfitting) trade-off.

Exercise:

use the sklearn.grid_search.GridSearchCV or the model_selection.RandomizedGridSeach utility function from the previous chapters to find a good value for the parameter alpha
plots the validation scores (and optionally the training scores) for each value of alpha and identify the areas where model overfits or underfits.

Hints:

you can search for values of alpha in the range $[10^{-5} - 10^{0}]$ using a logarithmic scale
RandomizedGridSearch also has a launch_for_arrays method as an alternative to launch_for_splits in case the CV splits have not been precomputed in advance. 1



In [49]:

Setting Up a Pipeline for Cross Validation and Model Selection of the Feature Extraction parameters

The feature extraction class has many options to customize its behavior:



In [50]:

    
print(TfidfVectorizer.__doc__)









    



Convert a collection of raw documents to a matrix of TF-IDF features.

    Equivalent to CountVectorizer followed by TfidfTransformer.

    Parameters
    ----------
    input : string {'filename', 'file', 'content'}
        If filename, the sequence passed as an argument to fit is
        expected to be a list of filenames that need reading to fetch
        the raw content to analyze.

        If 'file', the sequence items must have 'read' method (file-like
        object) it is called to fetch the bytes in memory.

        Otherwise the input is expected to be the sequence strings or
        bytes items are expected to be analyzed directly.

    encoding : string, 'utf-8' by default.
        If bytes or files are given to analyze, this encoding is used to
        decode.

    decode_error : {'strict', 'ignore', 'replace'}
        Instruction on what to do if a byte sequence is given to analyze that
        contains characters not of the given `encoding`. By default, it is
        'strict', meaning that a UnicodeDecodeError will be raised. Other
        values are 'ignore' and 'replace'.

    strip_accents : {'ascii', 'unicode', None}
        Remove accents during the preprocessing step.
        'ascii' is a fast method that only works on characters that have
        an direct ASCII mapping.
        'unicode' is a slightly slower method that works on any characters.
        None (default) does nothing.

    analyzer : string, {'word', 'char'} or callable
        Whether the feature should be made of word or character n-grams.

        If a callable is passed it is used to extract the sequence of features
        out of the raw, unprocessed input.

    preprocessor : callable or None (default)
        Override the preprocessing (string transformation) stage while
        preserving the tokenizing and n-grams generation steps.

    tokenizer : callable or None (default)
        Override the string tokenization step while preserving the
        preprocessing and n-grams generation steps.

    ngram_range : tuple (min_n, max_n)
        The lower and upper boundary of the range of n-values for different
        n-grams to be extracted. All values of n such that min_n <= n <= max_n
        will be used.

    stop_words : string {'english'}, list, or None (default)
        If a string, it is passed to _check_stop_list and the appropriate stop
        list is returned. 'english' is currently the only supported string
        value.

        If a list, that list is assumed to contain stop words, all of which
        will be removed from the resulting tokens.

        If None, no stop words will be used. max_df can be set to a value
        in the range [0.7, 1.0) to automatically detect and filter stop
        words based on intra corpus document frequency of terms.

    lowercase : boolean, default True
        Convert all characters to lowercase befor tokenizing.

    token_pattern : string
        Regular expression denoting what constitutes a "token", only used
        if `tokenize == 'word'`. The default regexp select tokens of 2
        or more letters characters (punctuation is completely ignored
        and always treated as a token separator).

    max_df : float in range [0.0, 1.0] or int, optional, 1.0 by default
        When building the vocabulary ignore terms that have a term frequency
        strictly higher than the given threshold (corpus specific stop words).
        If float, the parameter represents a proportion of documents, integer
        absolute counts.
        This parameter is ignored if vocabulary is not None.

    min_df : float in range [0.0, 1.0] or int, optional, 1 by default
        When building the vocabulary ignore terms that have a term frequency
        strictly lower than the given threshold.
        This value is also called cut-off in the literature.
        If float, the parameter represents a proportion of documents, integer
        absolute counts.
        This parameter is ignored if vocabulary is not None.

    max_features : optional, None by default
        If not None, build a vocabulary that only consider the top
        max_features ordered by term frequency across the corpus.

        This parameter is ignored if vocabulary is not None.

    vocabulary : Mapping or iterable, optional
        Either a Mapping (e.g., a dict) where keys are terms and values are
        indices in the feature matrix, or an iterable over terms. If not
        given, a vocabulary is determined from the input documents.

    binary : boolean, False by default.
        If True, all non zero counts are set to 1. This is useful for discrete
        probabilistic models that model binary events rather than integer
        counts.

    dtype : type, optional
        Type of the matrix returned by fit_transform() or transform().

    norm : 'l1', 'l2' or None, optional
        Norm used to normalize term vectors. None for no normalization.

    use_idf : boolean, optional
        Enable inverse-document-frequency reweighting.

    smooth_idf : boolean, optional
        Smooth idf weights by adding one to document frequencies, as if an
        extra document was seen containing every term in the collection
        exactly once. Prevents zero divisions.

    sublinear_tf : boolean, optional
        Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

    See also
    --------
    CountVectorizer
        Tokenize the documents and count the occurrences of token and return
        them as a sparse matrix

    TfidfTransformer
        Apply Term Frequency Inverse Document Frequency normalization to a
        sparse matrix of occurrence counts.

In order to evaluate the impact of the parameters of the feature extraction one can chain a configured feature extraction and linear classifier (as an alternative to the naive Bayes model):



In [51]:

    
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.pipeline import Pipeline

pipeline = Pipeline((
    ('vec', TfidfVectorizer(min_df=1, max_df=0.8, use_idf=True)),
    ('clf', PassiveAggressiveClassifier(C=1)),
))

Such a pipeline can then be cross validated or even grid searched:



In [52]:

    
from sklearn.cross_validation import cross_val_score
from scipy.stats import sem

scores = cross_val_score(pipeline, twenty_train_small.data,
                         twenty_train_small.target, cv=3, n_jobs=-1)
scores.mean(), sem(scores)









    Out[52]:





(0.96116027531956727, 0.0026015253796112022)

For the grid search, the parameters names are prefixed with the name of the pipeline step using "__" as a separator:



In [53]:

    
from sklearn.grid_search import GridSearchCV

parameters = {
    #'vec__min_df': [1, 2],
    'vec__max_df': [0.8, 1.0],
    'vec__ngram_range': [(1, 1), (1, 2)],
    'vec__use_idf': [True, False],
}

gs = GridSearchCV(pipeline, parameters, verbose=2, refit=False)
_ = gs.fit(twenty_train_small.data, twenty_train_small.target)









    



Fitting 3 folds for each of 8 candidates, totalling 24 fits
[GridSearchCV] vec__max_df=0.8, vec__ngram_range=(1, 1), vec__use_idf=True .....
[GridSearchCV]  vec__max_df=0.8, vec__ngram_range=(1, 1), vec__use_idf=True -   1.2s
[GridSearchCV] vec__max_df=0.8, vec__ngram_range=(1, 1), vec__use_idf=True .....
[GridSearchCV]  vec__max_df=0.8, vec__ngram_range=(1, 1), vec__use_idf=True -   1.1s
[GridSearchCV] vec__max_df=0.8, vec__ngram_range=(1, 1), vec__use_idf=True .....
[GridSearchCV]  vec__max_df=0.8, vec__ngram_range=(1, 1), vec__use_idf=True -   1.2s
[GridSearchCV] vec__max_df=0.8, vec__ngram_range=(1, 1), vec__use_idf=False ....
[GridSearchCV]  vec__max_df=0.8, vec__ngram_range=(1, 1), vec__use_idf=False -   1.2s
[GridSearchCV] vec__max_df=0.8, vec__ngram_range=(1, 1), vec__use_idf=False ....
[GridSearchCV]  vec__max_df=0.8, vec__ngram_range=(1, 1), vec__use_idf=False -   1.2s
[GridSearchCV] vec__max_df=0.8, vec__ngram_range=(1, 1), vec__use_idf=False ....
[GridSearchCV]  vec__max_df=0.8, vec__ngram_range=(1, 1), vec__use_idf=False -   1.2s
[GridSearchCV] vec__max_df=0.8, vec__ngram_range=(1, 2), vec__use_idf=True .....
[GridSearchCV]  vec__max_df=0.8, vec__ngram_range=(1, 2), vec__use_idf=True -   3.6s
[GridSearchCV] vec__max_df=0.8, vec__ngram_range=(1, 2), vec__use_idf=True .....
[GridSearchCV]  vec__max_df=0.8, vec__ngram_range=(1, 2), vec__use_idf=True -   3.5s
[GridSearchCV] vec__max_df=0.8, vec__ngram_range=(1, 2), vec__use_idf=True .....
[GridSearchCV]  vec__max_df=0.8, vec__ngram_range=(1, 2), vec__use_idf=True -   3.5s
[GridSearchCV] vec__max_df=0.8, vec__ngram_range=(1, 2), vec__use_idf=False ....
[GridSearchCV]  vec__max_df=0.8, vec__ngram_range=(1, 2), vec__use_idf=False -   3.4s
[GridSearchCV] vec__max_df=0.8, vec__ngram_range=(1, 2), vec__use_idf=False ....
[GridSearchCV]  vec__max_df=0.8, vec__ngram_range=(1, 2), vec__use_idf=False -   3.4s
[GridSearchCV] vec__max_df=0.8, vec__ngram_range=(1, 2), vec__use_idf=False ....
[GridSearchCV]  vec__max_df=0.8, vec__ngram_range=(1, 2), vec__use_idf=False -   3.4s
[GridSearchCV] vec__max_df=1.0, vec__ngram_range=(1, 1), vec__use_idf=True .....
[GridSearchCV]  vec__max_df=1.0, vec__ngram_range=(1, 1), vec__use_idf=True -   0.9s
[GridSearchCV] vec__max_df=1.0, vec__ngram_range=(1, 1), vec__use_idf=True .....
[GridSearchCV]  vec__max_df=1.0, vec__ngram_range=(1, 1), vec__use_idf=True -   1.0s
[GridSearchCV] vec__max_df=1.0, vec__ngram_range=(1, 1), vec__use_idf=True .....
[GridSearchCV]  vec__max_df=1.0, vec__ngram_range=(1, 1), vec__use_idf=True -   1.1s
[GridSearchCV] vec__max_df=1.0, vec__ngram_range=(1, 1), vec__use_idf=False ....
[GridSearchCV]  vec__max_df=1.0, vec__ngram_range=(1, 1), vec__use_idf=False -   1.2s
[GridSearchCV] vec__max_df=1.0, vec__ngram_range=(1, 1), vec__use_idf=False ....
[GridSearchCV]  vec__max_df=1.0, vec__ngram_range=(1, 1), vec__use_idf=False -   1.2s
[GridSearchCV] vec__max_df=1.0, vec__ngram_range=(1, 1), vec__use_idf=False ....
[GridSearchCV]  vec__max_df=1.0, vec__ngram_range=(1, 1), vec__use_idf=False -   1.1s
[GridSearchCV] vec__max_df=1.0, vec__ngram_range=(1, 2), vec__use_idf=True .....
[GridSearchCV]  vec__max_df=1.0, vec__ngram_range=(1, 2), vec__use_idf=True -   4.1s
[GridSearchCV] vec__max_df=1.0, vec__ngram_range=(1, 2), vec__use_idf=True .....
[GridSearchCV]  vec__max_df=1.0, vec__ngram_range=(1, 2), vec__use_idf=True -   4.2s
[GridSearchCV] vec__max_df=1.0, vec__ngram_range=(1, 2), vec__use_idf=True .....
[GridSearchCV]  vec__max_df=1.0, vec__ngram_range=(1, 2), vec__use_idf=True -   3.7s
[GridSearchCV] vec__max_df=1.0, vec__ngram_range=(1, 2), vec__use_idf=False ....
[GridSearchCV]  vec__max_df=1.0, vec__ngram_range=(1, 2), vec__use_idf=False -   3.2s
[GridSearchCV] vec__max_df=1.0, vec__ngram_range=(1, 2), vec__use_idf=False ....
[GridSearchCV]  vec__max_df=1.0, vec__ngram_range=(1, 2), vec__use_idf=False -   3.8s
[GridSearchCV] vec__max_df=1.0, vec__ngram_range=(1, 2), vec__use_idf=False ....
[GridSearchCV]  vec__max_df=1.0, vec__ngram_range=(1, 2), vec__use_idf=False -   4.0s





    



[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    1.3s
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:   58.5s finished



In [54]:

    
gs.best_score_









    Out[54]:





0.96116027531956738



In [55]:

    
gs.best_params_









    Out[55]:





{'vec__max_df': 0.8, 'vec__ngram_range': (1, 1), 'vec__use_idf': True}

Introspecting Model Performance

Displaying the Most Discriminative Features

Let's fit a model on the small dataset and collect info on the fitted components:



In [56]:

    
_ = pipeline.fit(twenty_train_small.data, twenty_train_small.target)



In [57]:

    
vec_name, vec = pipeline.steps[0]
clf_name, clf = pipeline.steps[1]

feature_names = vec.get_feature_names()
target_names = twenty_train_small.target_names

feature_weights = clf.coef_

feature_weights.shape









    Out[57]:





(4, 34109)

By sorting the feature weights on the linear model and asking the vectorizer what their names is, one can get a clue on what the model did actually learn on the data:



In [58]:

    
def display_important_features(feature_names, target_names, weights, n_top=30):
    for i, target_name in enumerate(target_names):
        print("Class: " + target_name)
        print("")
        
        sorted_features_indices = weights[i].argsort()[::-1]
        
        most_important = sorted_features_indices[:n_top]
        print(", ".join("{0}: {1:.4f}".format(feature_names[j], weights[i, j])
                        for j in most_important))
        print("...")
        
        least_important = sorted_features_indices[-n_top:]
        print(", ".join("{0}: {1:.4f}".format(feature_names[j], weights[i, j])
                        for j in least_important))
        print("")
        
display_important_features(feature_names, target_names, feature_weights)









    



Class: alt.atheism

atheism: 2.8369, atheists: 2.7697, keith: 2.6781, cobb: 2.1986, islamic: 1.7952, okcforum: 1.6646, caltech: 1.5838, rice: 1.5769, bobby: 1.5187, peace: 1.5151, freedom: 1.4775, wingate: 1.4733, tammy: 1.4702, enviroleague: 1.4619, atheist: 1.4277, psilink: 1.3985, rushdie: 1.3846, tek: 1.3809, jaeger: 1.3783, osrhe: 1.3591, bible: 1.3543, wwc: 1.3375, mangoe: 1.3324, perry: 1.3082, religion: 1.2733, benedikt: 1.2581, liar: 1.2288, lunatic: 1.2110, free: 1.2060, charley: 1.2006
...
good: -0.8709, dm: -0.8764, 10: -0.8786, brian: -0.8900, objective: -0.8986, deal: -0.9098, thanks: -0.9174, order: -0.9174, image: -0.9258, scic: -0.9314, force: -0.9314, useful: -0.9377, com: -0.9414, weiss: -0.9428, interested: -0.9465, use: -0.9525, buffalo: -0.9580, fbi: -0.9660, 2000: -0.9810, they: -1.0051, muhammad: -1.0165, out: -1.0520, kevin: -1.0545, org: -1.0908, morality: -1.1773, mail: -1.1945, graphics: -1.5805, christian: -1.6466, hudson: -1.6503, space: -1.8655

Class: comp.graphics

graphics: 4.3650, image: 2.5319, tiff: 1.9232, file: 1.8831, animation: 1.8733, 3d: 1.7270, card: 1.7127, files: 1.6637, 42: 1.6542, 3do: 1.6326, points: 1.6154, code: 1.5795, computer: 1.5767, video: 1.5549, color: 1.5069, polygon: 1.5057, windows: 1.4597, comp: 1.4421, package: 1.3865, format: 1.3183, pc: 1.2518, email: 1.2262, cview: 1.2155, hi: 1.2004, 24: 1.1909, postscript: 1.1827, virtual: 1.1706, sphere: 1.1691, looking: 1.1613, images: 1.1561
...
astronomy: -0.9077, are: -0.9133, who: -0.9217, bill: -0.9354, atheism: -0.9397, org: -0.9404, christian: -0.9489, funding: -0.9494, that: -0.9597, by: -0.9654, solar: -0.9708, access: -0.9722, us: -0.9907, planets: -0.9992, cmu: -1.0507, moon: -1.0730, you: -1.0802, nasa: -1.0859, dgi: -1.1009, jennise: -1.1009, writes: -1.1152, was: -1.1369, beast: -1.1597, dc: -1.2858, he: -1.3806, orbit: -1.3853, edu: -1.4121, re: -1.4396, god: -1.6422, space: -3.5582

Class: sci.space

space: 5.7627, orbit: 2.3450, dc: 2.0973, nasa: 2.0815, moon: 1.9315, launch: 1.8711, sci: 1.7931, alaska: 1.7344, solar: 1.6946, henry: 1.6384, pat: 1.5734, ether: 1.5178, nick: 1.4982, planets: 1.4155, dietz: 1.3681, cmu: 1.3530, aurora: 1.3106, nicho: 1.2958, funding: 1.2768, lunar: 1.2757, astronomy: 1.2595, flight: 1.2418, rockets: 1.2048, jennise: 1.1963, dgi: 1.1963, shuttle: 1.1652, spacecraft: 1.1631, sky: 1.1593, digex: 1.1247, rochester: 1.1080
...
any: -0.8163, computer: -0.8183, gaspra: -0.8261, bible: -0.8342, video: -0.8485, religion: -0.8640, format: -0.8682, fbi: -0.8720, com: -0.8725, card: -0.8737, cc: -0.8828, code: -0.8875, 24: -0.8883, library: -0.8904, sgi: -0.9208, halat: -0.9531, 3d: -0.9607, ___: -0.9630, points: -1.0150, tiff: -1.0278, color: -1.0560, keith: -1.0664, koresh: -1.1302, file: -1.1529, files: -1.1679, image: -1.3169, christian: -1.3767, animation: -1.4241, god: -1.7873, graphics: -2.5640

Class: talk.religion.misc

christian: 3.0979, hudson: 1.8959, who: 1.8842, beast: 1.8652, fbi: 1.6698, mr: 1.6386, buffalo: 1.6148, 2000: 1.5694, abortion: 1.5172, church: 1.5061, koresh: 1.4853, weiss: 1.4829, morality: 1.4750, brian: 1.4736, order: 1.4545, frank: 1.4508, biblical: 1.4123, 666: 1.3742, thyagi: 1.3520, terrorist: 1.3306, christians: 1.3202, mormons: 1.2810, amdahl: 1.2641, blood: 1.2380, freenet: 1.2299, rosicrucian: 1.2122, mitre: 1.2032, christ: 1.1982, objective: 1.1635, love: 1.1519
...
file: -0.9489, saturn: -0.9516, university: -0.9569, on: -0.9592, ac: -0.9685, lunatic: -0.9820, for: -0.9882, orbit: -0.9893, some: -1.0031, anyone: -1.0355, uk: -1.0703, liar: -1.0715, ibm: -1.0965, wwc: -1.1029, thanks: -1.1200, freedom: -1.1455, nasa: -1.1951, free: -1.2008, thing: -1.2337, atheist: -1.2573, princeton: -1.2966, cobb: -1.3150, keith: -1.4660, caltech: -1.4869, graphics: -1.5331, edu: -1.5969, atheism: -1.7381, it: -1.7571, atheists: -1.9418, space: -2.2211

Displaying the per-class Classification Reports



In [59]:

    
from sklearn.metrics import classification_report

predicted = pipeline.predict(twenty_test_small.data)



In [60]:

    
print(classification_report(twenty_test_small.target, predicted,
                            target_names=twenty_test_small.target_names))









    



                    precision    recall  f1-score   support

       alt.atheism       0.87      0.78      0.83       319
     comp.graphics       0.93      0.96      0.95       389
         sci.space       0.95      0.95      0.95       394
talk.religion.misc       0.76      0.80      0.78       251

       avg / total       0.89      0.89      0.89      1353

Printing the Confusion Matrix

The confusion matrix summarize which class where by having a look at off-diagonal entries: here we can see that articles about atheism have been wrongly classified as being about religion 57 times for instance:



In [61]:

    
from sklearn.metrics import confusion_matrix

confusion_matrix(twenty_test_small.target, predicted)









    Out[61]:





array([[250,   5,   7,  57],
       [  2, 375,   6,   6],
       [  2,  15, 375,   2],
       [ 32,   9,   8, 202]])



In [62]:

    
twenty_test_small.target_names









    Out[62]:





['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']

Class Break

Large Scale Text Classification for Sentiment Analysis

Outline of the Session

Limitations of the Vocabulary-Based Vectorizer
The Hashing Trick
Online / Streaming Text Feature Extraction and Classification
Parallel Text Feature Extraction and Classification

Scalability Issues

The sklearn.feature_extraction.text.CountVectorizer and sklearn.feature_extraction.text.TfidfVectorizer classes suffer from a number of scalability issues that all stem from the internal usage of the vocabulary_ attribute (a Python dictionary) used to map the unicode string feature names to the integer feature indices.

The main scalability issues are:

Memory usage of the text vectorizer: all the string representations of the features are loaded in memory
Parallelization problems for text feature extraction: the vocabulary_ would be a shared state: complex synchronization and overhead
Impossibility to do online or out-of-core / streaming learning: the vocabulary_ needs to be learned from the data: its size cannot be known before making one pass over the full dataset

To better understand the issue, let's have a look at how the vocabulary_ attribute works. At fit time the tokens of the corpus are uniquely identified by a integer index and this mapping stored in the vocabulary:



In [63]:

    
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df=1)

vectorizer.fit([
    "The cat sat on the mat.",
])
vectorizer.vocabulary_









    Out[63]:





{u'cat': 0, u'mat': 1, u'on': 2, u'sat': 3, u'the': 4}

The vocabulary is used at transform time to build the occurence matrix:



In [64]:

    
X = vectorizer.transform([
    "The cat sat on the mat.",
    "This cat is a nice cat.",
]).toarray()

print(len(vectorizer.vocabulary_))
print(vectorizer.get_feature_names())
print(X)









    



5
[u'cat', u'mat', u'on', u'sat', u'the']
[[1 1 1 1 2]
 [2 0 0 0 0]]

Let's refit with a slightly larger corpus:



In [65]:

    
vectorizer = CountVectorizer(min_df=1)

vectorizer.fit([
    "The cat sat on the mat.",
    "The quick brown fox jumps over the lazy dog.",
])
vectorizer.vocabulary_









    Out[65]:





{u'brown': 0,
 u'cat': 1,
 u'dog': 2,
 u'fox': 3,
 u'jumps': 4,
 u'lazy': 5,
 u'mat': 6,
 u'on': 7,
 u'over': 8,
 u'quick': 9,
 u'sat': 10,
 u'the': 11}

The vocabulary_ is (logarithmically) growing with the size of the training corpus. Note that we could not have built the vocabularies in parallel on the 2 text documents as they share some words, hence would require some kind of shared datastructure or synchronization barrier which is complicated to setup, especially if we want to distribute the processing on a cluster.

With this new vocabulary, the dimensionality of the output space is now larger:



In [66]:

    
X = vectorizer.transform([
    "The cat sat on the mat.",
    "This cat is a nice cat.",
]).toarray()

print(len(vectorizer.vocabulary_))
print(vectorizer.get_feature_names())
print(X)









    



12
[u'brown', u'cat', u'dog', u'fox', u'jumps', u'lazy', u'mat', u'on', u'over', u'quick', u'sat', u'the']
[[0 1 0 0 0 0 1 1 0 0 1 2]
 [0 2 0 0 0 0 0 0 0 0 0 0]]

The Sentiment 140 Dataset

To illustrate the scalability issues of the vocabulary-based vectorizers, let's load a more realistic dataset for a classical text classification task: sentiment analysis on tweets. The goal is to tell apart negative from positive tweets on a variety of topics.

If ../fetch_data.py sentiment140 didn't work, go to https://docs.google.com/file/d/0B04GJPshIjmPRnZManQwWEdTZjg/edit and download trainingandtestdata.zip there. Unzip and copy the contents to ../datasets/sentiment140/



In [67]:

    
import os

sentiment140_folder = os.path.join('..', 'datasets', 'sentiment140')
training_csv_file = os.path.join(sentiment140_folder, 'training.1600000.processed.noemoticon.csv')
testing_csv_file = os.path.join(sentiment140_folder, 'testdata.manual.2009.06.14.csv')

Those files were downloaded from the research archive of the Sentiment140 project. The first file was gathered using the twitter streaming API by running stream queries for the positive ":)" and negative ":(" emoticons to collect tweets that are explicitly positive or negative. To make the classification problem non-trivial, the emoticons were stripped out of the text in the CSV files:



In [68]:

    
!ls -lh ../datasets/sentiment140/training.1600000.processed.noemoticon.csv









    



-rwxr-xr-x@ 1 alessandro.gagliardi  1260538795   228M Mar 15 21:42 ../datasets/sentiment140/training.1600000.processed.noemoticon.csv

Let's parse the CSV files and load everything in memory. As loading everything can take up to 2GB, let's limit the collection to 100K tweets of each (positive and negative) out of the total of 1.6M tweets.



In [69]:

    
FIELDNAMES = ('polarity', 'id', 'date', 'query', 'author', 'text')

def read_csv(csv_file, fieldnames=FIELDNAMES, max_count=None,
             n_partitions=1, partition_id=0):
    
    import csv  # put the import inside for use in IPython.parallel
    
    texts = []
    targets = []
    with open(csv_file, 'rb') as f:
        reader = csv.DictReader(f, fieldnames=fieldnames,
                                delimiter=',', quotechar='"')
        pos_count, neg_count = 0, 0
        for i, d in enumerate(reader):
            if i % n_partitions != partition_id:
                # Skip entry if not in the requested partition
                continue

            if d['polarity'] == '4':
                if max_count and pos_count >= max_count / 2:
                    continue
                pos_count += 1
                texts.append(d['text'])
                targets.append(1)

            elif d['polarity'] == '0':
                if max_count and neg_count >= max_count / 2:
                    continue
                neg_count += 1
                texts.append(d['text'])
                targets.append(-1)

    return texts, targets



In [70]:

    
%time text_train_all, target_train_all = read_csv(training_csv_file, max_count=200000)









    



CPU times: user 12 s, sys: 224 ms, total: 12.2 s
Wall time: 12.3 s



In [71]:

    
len(text_train_all), len(target_train_all)









    Out[71]:





(200000, 200000)

Let's display the first samples:



In [72]:

    
for text in text_train_all[:3]:
    print(text + "\n")









    



@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D

is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!

@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds



In [73]:

    
print(target_train_all[:3])









    



[-1, -1, -1]

A polarity of "0" means negative while a polarity of "4" means positive. "0"s were convereted to "-1"s and "4"s to "1"s. All the positive tweets are at the end of the file:



In [74]:

    
for text in text_train_all[-3:]:
    print(text + "\n")









    



Okie doke!! Time for me to escape for the North while Massa's back is turned. Be on when I get home folks 

finished the lessons, hooray! 

Some ppl are just fucking KP0. Cb ! Stop asking me laa.. I love my boyfriend and thats it.



In [75]:

    
print(target_train_all[-3:])









    



[1, 1, 1]

Let's split the training CSV file into a smaller training set and a validation set with 100k random tweets each:



In [76]:

    
from sklearn.cross_validation import train_test_split

text_train_small, text_validation, target_train_small, target_validation = train_test_split(
    text_train_all, target_train_all, test_size=.5, random_state=42)

Let's open the manually annotated tweet files. The evaluation set also has neutral tweets with a polarity of "2" which we ignore. We can build the final evaluation set with only the positive and negative tweets of the evaluation CSV file:



In [77]:

    
text_test_all, target_test_all = read_csv(testing_csv_file)



In [78]:

    
len(text_test_all), len(target_test_all)









    Out[78]:





(359, 359)

The Hashing Trick

To workaround the limitations of the vocabulary-based vectorizers, one can use the hashing trick. Instead of building and storing an explicit mapping from the feature names to the feature indices in a Python dict, we can just use a hash function and a modulus operation:



In [79]:

    
from sklearn.utils.murmurhash import murmurhash3_bytes_u32

for word in "the cat sat on the mat".split():
    print("{0} => {1}".format(
        word, murmurhash3_bytes_u32(word, 0) % 2 ** 20))









    



the => 761698
cat => 300839
sat => 122804
on => 735689
the => 761698
mat => 122997

This mapping is completely stateless and the dimensionality of the output space is explicitly fixed in advance (here we use a modulo 2 ** 20 which means roughly 1M dimensions). This makes it possible to workaround the limitations of the vocabulary based vectorizer both for parallelizability and online / out-of-core learning.

The HashingVectorizer class is an alternative to the TfidfVectorizer class with use_idf=False that internally uses the murmurhash hash function:



In [80]:

    
from sklearn.feature_extraction.text import HashingVectorizer

h_vectorizer = HashingVectorizer(encoding='latin-1')
h_vectorizer









    Out[80]:





HashingVectorizer(analyzer=u'word', binary=False, charset=None,
         charset_error=None, decode_error=u'strict',
         dtype=<type 'numpy.float64'>, encoding='latin-1',
         input=u'content', lowercase=True, n_features=1048576,
         ngram_range=(1, 1), non_negative=False, norm=u'l2',
         preprocessor=None, stop_words=None, strip_accents=None,
         token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None)

It shares the same "preprocessor", "tokenizer" and "analyzer" infrastructure:



In [81]:

    
analyzer = h_vectorizer.build_analyzer()
analyzer('This is a test sentence.')









    Out[81]:





[u'this', u'is', u'test', u'sentence']

We can vectorize our datasets into a scipy sparse matrix exactly as we would have done with the CountVectorizer or TfidfVectorizer, except that we can directly call the transform method: there is no need to fit as HashingVectorizer is a stateless transformer:



In [82]:

    
%time X_train_small = h_vectorizer.transform(text_train_small)









    



CPU times: user 2.9 s, sys: 82.1 ms, total: 2.99 s
Wall time: 3 s

The dimension of the output is fixed ahead of time to n_features=2 ** 20 by default (nearly 1M features) to minimize the rate of collision on most classification problem while having reasonably sized linear models (1M weights in the coef_ attribute):



In [83]:

    
X_train_small









    Out[83]:





<100000x1048576 sparse matrix of type '<type 'numpy.float64'>'
	with 1188916 stored elements in Compressed Sparse Row format>

As only the non-zero elements are stored, n_features has little impact on the actual size of the data in memory. We can combine the hashing vectorizer with a Passive-Aggressive linear model in a pipeline:



In [84]:

    
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.pipeline import Pipeline

h_pipeline = Pipeline((
    ('vec', HashingVectorizer(encoding='latin-1')),
    ('clf', PassiveAggressiveClassifier(C=1, n_iter=1)),
))

%time h_pipeline.fit(text_train_small, target_train_small).score(text_validation, target_validation)









    



CPU times: user 5.12 s, sys: 141 ms, total: 5.26 s
Wall time: 5.22 s






    Out[84]:





0.74375999999999998

Let's check that the score on the validation set is reasonably in line with the set of manually annotated tweets:



In [85]:

    
h_pipeline.score(text_test_all, target_test_all)









    Out[85]:





0.74373259052924789

As the text_train_small dataset is not that big, we can still use a vocabulary based vectorizer to check that the hashing collisions are not causing any significant performance drop on the validation set (WARNING this is twice as slow as the hashing vectorizer version, skip this cell if your computer is too slow):



In [86]:

    
from sklearn.feature_extraction.text import TfidfVectorizer

vocabulary_vec = TfidfVectorizer(encoding='latin-1', use_idf=False)
vocabulary_pipeline = Pipeline((
    ('vec', vocabulary_vec),
    ('clf', PassiveAggressiveClassifier(C=1, n_iter=1)),
))

%time vocabulary_pipeline.fit(text_train_small, target_train_small).score(text_validation, target_validation)









    



CPU times: user 4.74 s, sys: 272 ms, total: 5.01 s
Wall time: 4.93 s






    Out[86]:





0.74487000000000003

We get almost the same score but almost twice as slower with also a big, slow to (un)pickle datastructure in memory:



In [87]:

    
len(vocabulary_vec.vocabulary_)









    Out[87]:





91301

More info and reference for the original papers on the Hashing Trick in the answers to this http://metaoptimize.com/qa question: What is the Hashing Trick?.

Out-of-Core learning

Out-of-Core learning is the task of training a machine learning model on a dataset that does not fit in the main memory. This requires the following conditions:

a feature extraction layer with fixed output dimensionality
knowing the list of all classes in advance (in this case we only have positive and negative tweets)
a machine learning algorithm that supports incremental learning (the partial_fit method in scikit-learn).

Let us simulate an infinite tweeter stream that can generate batches of annotated tweet texts and their polarity. We can do this by recombining randomly pairs of positive or negative tweets from our fixed dataset:



In [88]:

    
from random import Random


class InfiniteStreamGenerator(object):
    """Simulate random polarity queries on the twitter streaming API"""
    
    def __init__(self, texts, targets, seed=0, batchsize=100):
        self.texts_pos = [text for text, target in zip(texts, targets)
                               if target > 0]
        self.texts_neg = [text for text, target in zip(texts, targets)
                               if target <= 0]
        self.rng = Random(seed)
        self.batchsize = batchsize

    def next_batch(self, batchsize=None):
        batchsize = self.batchsize if batchsize is None else batchsize
        texts, targets = [], []
        for i in range(batchsize):
            # Select the polarity randomly
            target = self.rng.choice((-1, 1))
            targets.append(target)
            
            # Combine 2 random texts of the right polarity
            pool = self.texts_pos if target > 0 else self.texts_neg
            text = self.rng.choice(pool) + " " + self.rng.choice(pool)
            texts.append(text)
        return texts, targets

infinite_stream = InfiniteStreamGenerator(text_train_small, target_train_small)



In [89]:

    
texts_in_batch, targets_in_batch = infinite_stream.next_batch(batchsize=3)



In [90]:

    
for t in texts_in_batch:
    print(t + "\n")









    



@abelteh hey! you're reached!  Come back and bring the worship team to another level bro!  Send my regards to Krys! At Elephant Bar with the familiaaa celebratin my 19th Birthdayyyyyy! 

quitting my piano lesson this month. i was never excellent  seeing and hearing all the motorcycles out this morning.  Its killing me that mine is in the shop. 

@emmaluxton  YES! another one jess  this will be fun is yous all read it, beleve See how we built our successful website, DIY eLearning, StepXStep, puts U in total control of Ur Web Biz! See...http://snipr.com/g8y52



In [91]:

    
targets_in_batch









    Out[91]:





[1, -1, 1]

We can now use our infinte tweet source to train an online machine learning algorithm using the hashing vectorizer. Note the use of the partial_fit method of the PassiveAggressiveClassifier instance in place of the traditional call to the fit method that needs access to the full training set.

From time to time, we evaluate the current predictive performance of the model on our validation set that is guaranteed to not overlap with the infinite training set source:



In [92]:

    
n_batches = 1000
validation_scores = []
training_set_size = []

# Build the vectorizer and the classifier
h_vectorizer = HashingVectorizer(encoding='latin-1')
clf = PassiveAggressiveClassifier(C=1)

# Extract the features for the validation once and for all
X_validation = h_vectorizer.transform(text_validation)
classes = np.array([-1, 1])

n_samples = 0
for i in range(n_batches):
    
    texts_in_batch, targets_in_batch = infinite_stream.next_batch()    
    n_samples += len(texts_in_batch)

    # Vectorize the text documents in the batch
    X_batch = h_vectorizer.transform(texts_in_batch)
    
    # Incrementally train the model on the new batch
    clf.partial_fit(X_batch, targets_in_batch, classes=classes)
    
    if n_samples % 100 == 0:
        # Compute the validation score of the current state of the model
        score = clf.score(X_validation, target_validation)
        validation_scores.append(score)
        training_set_size.append(n_samples)

    if i % 100 == 0:
        print("n_samples: {0}, score: {1:.4f}".format(n_samples, score))









    



n_samples: 100, score: 0.5754
n_samples: 10100, score: 0.7333
n_samples: 20100, score: 0.7409
n_samples: 30100, score: 0.7485
n_samples: 40100, score: 0.7374
n_samples: 50100, score: 0.7247
n_samples: 60100, score: 0.7541
n_samples: 70100, score: 0.7590
n_samples: 80100, score: 0.7559
n_samples: 90100, score: 0.7574

We can now plot the collected validation score values, versus the number of samples generated by the infinite source and feed to the model:



In [93]:

    
plt.plot(training_set_size, validation_scores)
plt.xlabel("Number of samples")
plt.ylabel("Validation score")









    Out[93]:





<matplotlib.text.Text at 0x115324750>

Parallelizing Text Classification

Partitioning the Data and Training in Parallel

As the HashingVectorizer is stateless, one can use a separate instance (with the same parameters) in parallel or distributed processes to extract the features on independant partitions of a big text dataset. Each partition of extracted features can then be fed to independent instances of a linear classifier model on each computing node:

Final Linear Model Averaging

Once all the nodes are ready we can average the linear models:

Sample Implementation on the Tweet Data

Let's use IPython parallel to read partitions of the train CSV in different Python processes using the interactive IPython.parallel interface:



In [94]:

    
from IPython.parallel import Client

client = Client()
len(client)









    Out[94]:





2

Let's tell each engine which partition of the data it will have to handle:



In [95]:

    
dv = client.direct_view()



In [106]:

    
dv.scatter('partition_id', range(len(client)), flatten=True, block=True)



In [107]:

    
%px print(partition_id)









    



[stdout:0] 0
[stdout:1] 1

Let's send all we need to the engines



In [100]:

    
from sklearn.feature_extraction.text import HashingVectorizer

h_vectorizer = HashingVectorizer(encoding='latin-1')
dv['h_vectorizer'] = h_vectorizer
dv['read_csv'] = read_csv
dv['training_csv_file'] = training_csv_file
dv['n_partitions'] = len(client)



In [101]:

    
%px print(training_csv_file)
%px print(n_partitions)









    



[stdout:0] ../datasets/sentiment140/training.1600000.processed.noemoticon.csv
[stdout:1] ../datasets/sentiment140/training.1600000.processed.noemoticon.csv
[stdout:0] 2
[stdout:1] 2



In [102]:

    
%%px

max_count = 50000
print("Parsing %d items for partition %d..." % (max_count, partition_id))

texts, targets = read_csv(training_csv_file, n_partitions=n_partitions,
                         partition_id=partition_id, max_count=50000)

print("Shuffling the positive and negative examples...")

from sklearn.utils import shuffle
texts, targets = shuffle(texts, targets, random_state=1)

print("Vectorizing text data...")

vectors = h_vectorizer.transform(texts)

print("Fitting a linear model...")

from sklearn.linear_model import Perceptron
clf = Perceptron(n_iter=1).fit(vectors, targets)

print("Done!")









    



[stdout:0] 
Parsing 50000 items for partition 0...
Shuffling the positive and negative examples...
Vectorizing text data...
Fitting a linear model...
Done!
[stdout:1] 
Parsing 50000 items for partition 1...
Shuffling the positive and negative examples...
Vectorizing text data...
Fitting a linear model...
Done!



In [103]:

    
classifiers = dv.gather('clf', block=True)
classifiers









    Out[103]:





[Perceptron(alpha=0.0001, class_weight=None, eta0=1.0, fit_intercept=True,
       n_iter=1, n_jobs=1, penalty=None, random_state=0, shuffle=False,
       verbose=0, warm_start=False),
 Perceptron(alpha=0.0001, class_weight=None, eta0=1.0, fit_intercept=True,
       n_iter=1, n_jobs=1, penalty=None, random_state=0, shuffle=False,
       verbose=0, warm_start=False)]

We can now compute the average linear model:



In [104]:

    
from copy import copy

def average_linear_model(models):
    """Compute a linear model that is the average of the others"""
    avg = copy(models[0])

    avg.coef_ = np.sum([m.coef_ for m in models], axis=0)
    avg.coef_ /= len(models)
    
    avg.intercept_ = np.sum([m.intercept_ for m in models], axis=0)
    avg.intercept_ /= len(models)

    return avg
    

clf = average_linear_model(classifiers)

Let's compare the score of the average model with the scores of the individual classifiers. The average models can have a better generalization than the individual models being averaged:



In [108]:

    
clf.score(h_vectorizer.transform(text_test_all), target_test_all)









    Out[108]:





0.7325905292479109



In [109]:

    
for c in classifiers:
    print(c.score(h_vectorizer.transform(text_test_all), target_test_all))









    



0.710306406685
0.701949860724

Averaging linear models learned on different datasets that follow the same distribution is a form of Ensemble method. Other Ensemble methods include:

Boosted models (see Gradient Tree Boosting available in 0.13 and AdaBoost in master),
Bagging (Bootstrap Aggregating) as done in Random Forests. Decision Trees as the base model
Other non-bootstrapped, randomized aggregate of Decision Trees such as Extremely Randomized Trees.
Averaging the probabilistic estimate of a library of randomized and / or heterogeneous linear or non-linear models.
Stacking, for instance: training a final Random Forest on the probabilistic class assignment output of a library of randomized and / or heterogeneous linear or non-linear models.

Limitations of the Hashing Vectorizer

Using the Hashing Vectorizer makes it possible to implement streaming and parallel text classification but can also introduce some issues:

The collisions can introduce too much noise in the data and degrade prediction quality,
The HashingVectorizer does not provide "Inverse Document Frequency" reweighting (lack of a use_idf=True option).
There is no easy way to inverse the mapping and find the feature names from the feature index.

The collision issues can be controlled by increasing the n_features parameters.

The IDF weighting might be reintroduced by appending a TfidfTransformer instance on the output of the vectorizer. However computing the idf_ statistic used for the feature reweighting will require to do at least one additional pass over the training set before being able to start training the classifier: this breaks the online learning scheme.

The lack of inverse mapping (the get_feature_names() method of TfidfVectorizer) is even harder to workaround. That would require extending the HashingVectorizer class to add a "trace" mode to record the mapping of the most important features to provide statistical debugging information.

In the mean time to debug feature extraction issues, it is recommended to use TfidfVectorizer(use_idf=False) on a small-ish subset of the dataset to simulate a HashingVectorizer() instance that have the get_feature_names() method and no collision issues.