In [0]:

    
from __future__ import division
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# Some nice default configuration for plots
plt.rcParams['figure.figsize'] = 10, 7.5
plt.rcParams['axes.grid'] = True
plt.gray()









    





<matplotlib.figure.Figure at 0x106e5c6a0>

Text Feature Extraction for Classification and Clustering

Outline of this section:

Turn a corpus of text documents into feature vectors using a Bag of Words representation,
Train a simple text classifier on the feature vectors,
Wrap the vectorizer and the classifier with a pipeline,
Cross-validation and model selection on the pipeline.

Text Classification in 20 lines of Python

Let's start by implementing a canonical text classification example:

The 20 newsgroups dataset: around 18000 text posts from 20 newsgroups forums
Bag of Words features extraction with TF-IDF weighting
Naive Bayes classifier or Linear Support Vector Machine for the classifier itself



In [1]:

    
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

# Load the text data
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]
twenty_train_small = load_files('../datasets/20news-bydate-train/',
    categories=categories, encoding='latin-1')
twenty_test_small = load_files('../datasets/20news-bydate-test/',
    categories=categories, encoding='latin-1')

# Turn the text documents into vectors of word frequencies
vectorizer = TfidfVectorizer(min_df=2)
X_train = vectorizer.fit_transform(twenty_train_small.data)
y_train = twenty_train_small.target

# Fit a classifier on the training set
classifier = MultinomialNB().fit(X_train, y_train)
print("Training score: {0:.1f}%".format(
    classifier.score(X_train, y_train) * 100))

# Evaluate the classifier on the testing set
X_test = vectorizer.transform(twenty_test_small.data)
y_test = twenty_test_small.target
print("Testing score: {0:.1f}%".format(
    classifier.score(X_test, y_test) * 100))









    



Training score: 95.1%
Testing score: 85.1%

Here is a workflow diagram summary of what happened previously:

Let's now decompose what we just did to understand and customize each step.

Loading the Dataset

Let's explore the dataset loading utility without passing a list of categories: in this case we load the full 20 newsgroups dataset in memory. The source website for the 20 newsgroups already provides a date-based train / test split that is made available using the subset keyword argument:



In [2]:

    
ls -l ../datasets/









    



total 187296
drwxr-xr-x  22 ogrisel  staff       748 Mar 18  2003 20news-bydate-test/
drwxr-xr-x  22 ogrisel  staff       748 Mar 18  2003 20news-bydate-train/
-rw-r--r--   1 ogrisel  staff  14464277 May 23  2013 20news-bydate.tar.gz
drwxr-xr-x   4 ogrisel  staff       136 Sep 14  2014 sentiment140/
-rw-r--r--   1 ogrisel  staff     61194 Feb 10  2014 titanic_train.csv
-rw-r--r--   1 ogrisel  staff  81363704 Sep 14  2014 trainingandtestdata.zip



In [3]:

    
ls -lh ../datasets/20news-bydate-train









    



total 0
drwxr-xr-x  482 ogrisel  staff    16K Mar 18  2003 alt.atheism/
drwxr-xr-x  586 ogrisel  staff    19K Mar 18  2003 comp.graphics/
drwxr-xr-x  593 ogrisel  staff    20K Mar 18  2003 comp.os.ms-windows.misc/
drwxr-xr-x  592 ogrisel  staff    20K Mar 18  2003 comp.sys.ibm.pc.hardware/
drwxr-xr-x  580 ogrisel  staff    19K Mar 18  2003 comp.sys.mac.hardware/
drwxr-xr-x  595 ogrisel  staff    20K Mar 18  2003 comp.windows.x/
drwxr-xr-x  587 ogrisel  staff    19K Mar 18  2003 misc.forsale/
drwxr-xr-x  596 ogrisel  staff    20K Mar 18  2003 rec.autos/
drwxr-xr-x  600 ogrisel  staff    20K Mar 18  2003 rec.motorcycles/
drwxr-xr-x  599 ogrisel  staff    20K Mar 18  2003 rec.sport.baseball/
drwxr-xr-x  602 ogrisel  staff    20K Mar 18  2003 rec.sport.hockey/
drwxr-xr-x  597 ogrisel  staff    20K Mar 18  2003 sci.crypt/
drwxr-xr-x  593 ogrisel  staff    20K Mar 18  2003 sci.electronics/
drwxr-xr-x  596 ogrisel  staff    20K Mar 18  2003 sci.med/
drwxr-xr-x  595 ogrisel  staff    20K Mar 18  2003 sci.space/
drwxr-xr-x  601 ogrisel  staff    20K Mar 18  2003 soc.religion.christian/
drwxr-xr-x  548 ogrisel  staff    18K Mar 18  2003 talk.politics.guns/
drwxr-xr-x  566 ogrisel  staff    19K Mar 18  2003 talk.politics.mideast/
drwxr-xr-x  467 ogrisel  staff    16K Mar 18  2003 talk.politics.misc/
drwxr-xr-x  379 ogrisel  staff    13K Mar 18  2003 talk.religion.misc/



In [4]:

    
ls -lh ../datasets/20news-bydate-train/alt.atheism/









    



total 4480
-rw-r--r--  1 ogrisel  staff    12K Mar 18  2003 49960
-rw-r--r--  1 ogrisel  staff    31K Mar 18  2003 51060
-rw-r--r--  1 ogrisel  staff   4.0K Mar 18  2003 51119
-rw-r--r--  1 ogrisel  staff   1.6K Mar 18  2003 51120
-rw-r--r--  1 ogrisel  staff   773B Mar 18  2003 51121
-rw-r--r--  1 ogrisel  staff   4.8K Mar 18  2003 51122
-rw-r--r--  1 ogrisel  staff   618B Mar 18  2003 51123
-rw-r--r--  1 ogrisel  staff   1.4K Mar 18  2003 51124
-rw-r--r--  1 ogrisel  staff   2.7K Mar 18  2003 51125
-rw-r--r--  1 ogrisel  staff   427B Mar 18  2003 51126
-rw-r--r--  1 ogrisel  staff   742B Mar 18  2003 51127
-rw-r--r--  1 ogrisel  staff   650B Mar 18  2003 51128
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 51130
-rw-r--r--  1 ogrisel  staff   2.3K Mar 18  2003 51131
-rw-r--r--  1 ogrisel  staff   2.6K Mar 18  2003 51132
-rw-r--r--  1 ogrisel  staff   1.5K Mar 18  2003 51133
-rw-r--r--  1 ogrisel  staff   1.2K Mar 18  2003 51134
-rw-r--r--  1 ogrisel  staff   1.6K Mar 18  2003 51135
-rw-r--r--  1 ogrisel  staff   2.1K Mar 18  2003 51136
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 51139
-rw-r--r--  1 ogrisel  staff   409B Mar 18  2003 51140
-rw-r--r--  1 ogrisel  staff   940B Mar 18  2003 51141
-rw-r--r--  1 ogrisel  staff   9.0K Mar 18  2003 51142
-rw-r--r--  1 ogrisel  staff   632B Mar 18  2003 51143
-rw-r--r--  1 ogrisel  staff   1.2K Mar 18  2003 51144
-rw-r--r--  1 ogrisel  staff   609B Mar 18  2003 51145
-rw-r--r--  1 ogrisel  staff   631B Mar 18  2003 51146
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 51147
-rw-r--r--  1 ogrisel  staff   1.8K Mar 18  2003 51148
-rw-r--r--  1 ogrisel  staff   405B Mar 18  2003 51149
-rw-r--r--  1 ogrisel  staff   696B Mar 18  2003 51150
-rw-r--r--  1 ogrisel  staff   5.5K Mar 18  2003 51151
-rw-r--r--  1 ogrisel  staff   1.4K Mar 18  2003 51152
-rw-r--r--  1 ogrisel  staff   5.0K Mar 18  2003 51153
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 51154
-rw-r--r--  1 ogrisel  staff   1.6K Mar 18  2003 51155
-rw-r--r--  1 ogrisel  staff   5.0K Mar 18  2003 51156
-rw-r--r--  1 ogrisel  staff   1.8K Mar 18  2003 51157
-rw-r--r--  1 ogrisel  staff   604B Mar 18  2003 51158
-rw-r--r--  1 ogrisel  staff   1.4K Mar 18  2003 51159
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 51160
-rw-r--r--  1 ogrisel  staff   1.4K Mar 18  2003 51161
-rw-r--r--  1 ogrisel  staff   2.9K Mar 18  2003 51162
-rw-r--r--  1 ogrisel  staff   1.1K Mar 18  2003 51163
-rw-r--r--  1 ogrisel  staff   2.3K Mar 18  2003 51164
-rw-r--r--  1 ogrisel  staff   4.8K Mar 18  2003 51165
-rw-r--r--  1 ogrisel  staff   1.2K Mar 18  2003 51169
-rw-r--r--  1 ogrisel  staff   868B Mar 18  2003 51170
-rw-r--r--  1 ogrisel  staff   721B Mar 18  2003 51171
-rw-r--r--  1 ogrisel  staff   3.0K Mar 18  2003 51172
-rw-r--r--  1 ogrisel  staff   1.9K Mar 18  2003 51173
-rw-r--r--  1 ogrisel  staff   645B Mar 18  2003 51174
-rw-r--r--  1 ogrisel  staff   2.4K Mar 18  2003 51175
-rw-r--r--  1 ogrisel  staff   2.9K Mar 18  2003 51176
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 51177
-rw-r--r--  1 ogrisel  staff   879B Mar 18  2003 51178
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 51179
-rw-r--r--  1 ogrisel  staff   994B Mar 18  2003 51180
-rw-r--r--  1 ogrisel  staff   1.2K Mar 18  2003 51181
-rw-r--r--  1 ogrisel  staff   2.2K Mar 18  2003 51182
-rw-r--r--  1 ogrisel  staff   1.7K Mar 18  2003 51183
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 51184
-rw-r--r--  1 ogrisel  staff   1.2K Mar 18  2003 51185
-rw-r--r--  1 ogrisel  staff   949B Mar 18  2003 51186
-rw-r--r--  1 ogrisel  staff   1.9K Mar 18  2003 51187
-rw-r--r--  1 ogrisel  staff   1.1K Mar 18  2003 51188
-rw-r--r--  1 ogrisel  staff   834B Mar 18  2003 51189
-rw-r--r--  1 ogrisel  staff   895B Mar 18  2003 51190
-rw-r--r--  1 ogrisel  staff   776B Mar 18  2003 51191
-rw-r--r--  1 ogrisel  staff   1.6K Mar 18  2003 51192
-rw-r--r--  1 ogrisel  staff   1.8K Mar 18  2003 51193
-rw-r--r--  1 ogrisel  staff   1.4K Mar 18  2003 51194
-rw-r--r--  1 ogrisel  staff   964B Mar 18  2003 51195
-rw-r--r--  1 ogrisel  staff   2.4K Mar 18  2003 51196
-rw-r--r--  1 ogrisel  staff   759B Mar 18  2003 51197
-rw-r--r--  1 ogrisel  staff   1.5K Mar 18  2003 51198
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 51199
-rw-r--r--  1 ogrisel  staff   1.9K Mar 18  2003 51200
-rw-r--r--  1 ogrisel  staff   916B Mar 18  2003 51201
-rw-r--r--  1 ogrisel  staff   1.9K Mar 18  2003 51202
-rw-r--r--  1 ogrisel  staff   1.5K Mar 18  2003 51203
-rw-r--r--  1 ogrisel  staff   846B Mar 18  2003 51204
-rw-r--r--  1 ogrisel  staff   1.4K Mar 18  2003 51205
-rw-r--r--  1 ogrisel  staff   881B Mar 18  2003 51206
-rw-r--r--  1 ogrisel  staff   6.2K Mar 18  2003 51208
-rw-r--r--  1 ogrisel  staff   1.7K Mar 18  2003 51209
-rw-r--r--  1 ogrisel  staff   1.7K Mar 18  2003 51210
-rw-r--r--  1 ogrisel  staff    10K Mar 18  2003 51211
-rw-r--r--  1 ogrisel  staff   2.5K Mar 18  2003 51212
-rw-r--r--  1 ogrisel  staff   1.6K Mar 18  2003 51213
-rw-r--r--  1 ogrisel  staff   636B Mar 18  2003 51214
-rw-r--r--  1 ogrisel  staff   989B Mar 18  2003 51215
-rw-r--r--  1 ogrisel  staff   668B Mar 18  2003 51216
-rw-r--r--  1 ogrisel  staff   2.8K Mar 18  2003 51217
-rw-r--r--  1 ogrisel  staff   1.7K Mar 18  2003 51218
-rw-r--r--  1 ogrisel  staff   905B Mar 18  2003 51219
-rw-r--r--  1 ogrisel  staff   2.4K Mar 18  2003 51220
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 51221
-rw-r--r--  1 ogrisel  staff   1.7K Mar 18  2003 51222
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 51223
-rw-r--r--  1 ogrisel  staff   2.1K Mar 18  2003 51224
-rw-r--r--  1 ogrisel  staff   1.5K Mar 18  2003 51225
-rw-r--r--  1 ogrisel  staff   3.4K Mar 18  2003 51226
-rw-r--r--  1 ogrisel  staff   704B Mar 18  2003 51227
-rw-r--r--  1 ogrisel  staff   949B Mar 18  2003 51228
-rw-r--r--  1 ogrisel  staff   714B Mar 18  2003 51229
-rw-r--r--  1 ogrisel  staff   966B Mar 18  2003 51230
-rw-r--r--  1 ogrisel  staff   2.9K Mar 18  2003 51231
-rw-r--r--  1 ogrisel  staff   871B Mar 18  2003 51232
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 51233
-rw-r--r--  1 ogrisel  staff   1.5K Mar 18  2003 51234
-rw-r--r--  1 ogrisel  staff   2.4K Mar 18  2003 51235
-rw-r--r--  1 ogrisel  staff   1.2K Mar 18  2003 51236
-rw-r--r--  1 ogrisel  staff   564B Mar 18  2003 51237
-rw-r--r--  1 ogrisel  staff    11K Mar 18  2003 51238
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 51239
-rw-r--r--  1 ogrisel  staff   749B Mar 18  2003 51240
-rw-r--r--  1 ogrisel  staff   932B Mar 18  2003 51241
-rw-r--r--  1 ogrisel  staff   1.2K Mar 18  2003 51242
-rw-r--r--  1 ogrisel  staff   2.2K Mar 18  2003 51243
-rw-r--r--  1 ogrisel  staff   554B Mar 18  2003 51244
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 51245
-rw-r--r--  1 ogrisel  staff   1.7K Mar 18  2003 51246
-rw-r--r--  1 ogrisel  staff   1.7K Mar 18  2003 51247
-rw-r--r--  1 ogrisel  staff   1.6K Mar 18  2003 51249
-rw-r--r--  1 ogrisel  staff   2.8K Mar 18  2003 51250
-rw-r--r--  1 ogrisel  staff   570B Mar 18  2003 51251
-rw-r--r--  1 ogrisel  staff   1.8K Mar 18  2003 51252
-rw-r--r--  1 ogrisel  staff   3.1K Mar 18  2003 51253
-rw-r--r--  1 ogrisel  staff   2.9K Mar 18  2003 51254
-rw-r--r--  1 ogrisel  staff   748B Mar 18  2003 51255
-rw-r--r--  1 ogrisel  staff   2.3K Mar 18  2003 51256
-rw-r--r--  1 ogrisel  staff   1.2K Mar 18  2003 51258
-rw-r--r--  1 ogrisel  staff   1.7K Mar 18  2003 51259
-rw-r--r--  1 ogrisel  staff   6.2K Mar 18  2003 51260
-rw-r--r--  1 ogrisel  staff   1.6K Mar 18  2003 51261
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 51262
-rw-r--r--  1 ogrisel  staff   1.2K Mar 18  2003 51265
-rw-r--r--  1 ogrisel  staff   456B Mar 18  2003 51266
-rw-r--r--  1 ogrisel  staff   816B Mar 18  2003 51267
-rw-r--r--  1 ogrisel  staff   2.4K Mar 18  2003 51268
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 51269
-rw-r--r--  1 ogrisel  staff   3.4K Mar 18  2003 51270
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 51271
-rw-r--r--  1 ogrisel  staff   2.0K Mar 18  2003 51272
-rw-r--r--  1 ogrisel  staff   790B Mar 18  2003 51273
-rw-r--r--  1 ogrisel  staff   1.6K Mar 18  2003 51274
-rw-r--r--  1 ogrisel  staff   2.5K Mar 18  2003 51275
-rw-r--r--  1 ogrisel  staff   4.4K Mar 18  2003 51276
-rw-r--r--  1 ogrisel  staff   1.5K Mar 18  2003 51277
-rw-r--r--  1 ogrisel  staff   6.2K Mar 18  2003 51278
-rw-r--r--  1 ogrisel  staff   963B Mar 18  2003 51279
-rw-r--r--  1 ogrisel  staff   2.0K Mar 18  2003 51280
-rw-r--r--  1 ogrisel  staff   1.1K Mar 18  2003 51281
-rw-r--r--  1 ogrisel  staff   618B Mar 18  2003 51282
-rw-r--r--  1 ogrisel  staff   2.7K Mar 18  2003 51283
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 51284
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 51285
-rw-r--r--  1 ogrisel  staff   601B Mar 18  2003 51286
-rw-r--r--  1 ogrisel  staff   751B Mar 18  2003 51287
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 51288
-rw-r--r--  1 ogrisel  staff   8.0K Mar 18  2003 51290
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 51291
-rw-r--r--  1 ogrisel  staff   2.9K Mar 18  2003 51292
-rw-r--r--  1 ogrisel  staff   1.2K Mar 18  2003 51293
-rw-r--r--  1 ogrisel  staff   1.8K Mar 18  2003 51294
-rw-r--r--  1 ogrisel  staff   1.9K Mar 18  2003 51295
-rw-r--r--  1 ogrisel  staff   1.7K Mar 18  2003 51296
-rw-r--r--  1 ogrisel  staff   4.2K Mar 18  2003 51297
-rw-r--r--  1 ogrisel  staff   2.6K Mar 18  2003 51298
-rw-r--r--  1 ogrisel  staff   2.2K Mar 18  2003 51299
-rw-r--r--  1 ogrisel  staff   2.3K Mar 18  2003 51300
-rw-r--r--  1 ogrisel  staff   6.3K Mar 18  2003 51301
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 51302
-rw-r--r--  1 ogrisel  staff   1.9K Mar 18  2003 51303
-rw-r--r--  1 ogrisel  staff    10K Mar 18  2003 51304
-rw-r--r--  1 ogrisel  staff   1.5K Mar 18  2003 51305
-rw-r--r--  1 ogrisel  staff   1.4K Mar 18  2003 51306
-rw-r--r--  1 ogrisel  staff   4.1K Mar 18  2003 51307
-rw-r--r--  1 ogrisel  staff   6.2K Mar 18  2003 51308
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 51309
-rw-r--r--  1 ogrisel  staff   768B Mar 18  2003 51310
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 51311
-rw-r--r--  1 ogrisel  staff   930B Mar 18  2003 51312
-rw-r--r--  1 ogrisel  staff   771B Mar 18  2003 51313
-rw-r--r--  1 ogrisel  staff   670B Mar 18  2003 51314
-rw-r--r--  1 ogrisel  staff   1.1K Mar 18  2003 51315
-rw-r--r--  1 ogrisel  staff   3.7K Mar 18  2003 51316
-rw-r--r--  1 ogrisel  staff   406B Mar 18  2003 51317
-rw-r--r--  1 ogrisel  staff   5.4K Mar 18  2003 51318
-rw-r--r--  1 ogrisel  staff   9.6K Mar 18  2003 51319
-rw-r--r--  1 ogrisel  staff   2.1K Mar 18  2003 51320
-rw-r--r--  1 ogrisel  staff    29K Mar 18  2003 52499
-rw-r--r--  1 ogrisel  staff    25K Mar 18  2003 52909
-rw-r--r--  1 ogrisel  staff   5.8K Mar 18  2003 52910
-rw-r--r--  1 ogrisel  staff   819B Mar 18  2003 53055
-rw-r--r--  1 ogrisel  staff   857B Mar 18  2003 53056
-rw-r--r--  1 ogrisel  staff   755B Mar 18  2003 53057
-rw-r--r--  1 ogrisel  staff   4.4K Mar 18  2003 53058
-rw-r--r--  1 ogrisel  staff   2.1K Mar 18  2003 53059
-rw-r--r--  1 ogrisel  staff   1.1K Mar 18  2003 53062
-rw-r--r--  1 ogrisel  staff   1.6K Mar 18  2003 53064
-rw-r--r--  1 ogrisel  staff   515B Mar 18  2003 53065
-rw-r--r--  1 ogrisel  staff   9.2K Mar 18  2003 53066
-rw-r--r--  1 ogrisel  staff   2.4K Mar 18  2003 53067
-rw-r--r--  1 ogrisel  staff   610B Mar 18  2003 53069
-rw-r--r--  1 ogrisel  staff   759B Mar 18  2003 53070
-rw-r--r--  1 ogrisel  staff   2.3K Mar 18  2003 53071
-rw-r--r--  1 ogrisel  staff   1.5K Mar 18  2003 53072
-rw-r--r--  1 ogrisel  staff   1.9K Mar 18  2003 53073
-rw-r--r--  1 ogrisel  staff   2.1K Mar 18  2003 53075
-rw-r--r--  1 ogrisel  staff   411B Mar 18  2003 53078
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 53081
-rw-r--r--  1 ogrisel  staff   962B Mar 18  2003 53082
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 53083
-rw-r--r--  1 ogrisel  staff   2.0K Mar 18  2003 53085
-rw-r--r--  1 ogrisel  staff   1.1K Mar 18  2003 53086
-rw-r--r--  1 ogrisel  staff   247B Mar 18  2003 53087
-rw-r--r--  1 ogrisel  staff   3.8K Mar 18  2003 53090
-rw-r--r--  1 ogrisel  staff   1.1K Mar 18  2003 53093
-rw-r--r--  1 ogrisel  staff   1.1K Mar 18  2003 53094
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 53095
-rw-r--r--  1 ogrisel  staff   863B Mar 18  2003 53096
-rw-r--r--  1 ogrisel  staff   1.1K Mar 18  2003 53097
-rw-r--r--  1 ogrisel  staff   1.2K Mar 18  2003 53098
-rw-r--r--  1 ogrisel  staff   1.1K Mar 18  2003 53099
-rw-r--r--  1 ogrisel  staff   2.0K Mar 18  2003 53106
-rw-r--r--  1 ogrisel  staff   784B Mar 18  2003 53108
-rw-r--r--  1 ogrisel  staff   2.3K Mar 18  2003 53110
-rw-r--r--  1 ogrisel  staff   712B Mar 18  2003 53111
-rw-r--r--  1 ogrisel  staff   2.4K Mar 18  2003 53112
-rw-r--r--  1 ogrisel  staff   2.6K Mar 18  2003 53113
-rw-r--r--  1 ogrisel  staff   1.7K Mar 18  2003 53114
-rw-r--r--  1 ogrisel  staff   1.5K Mar 18  2003 53117
-rw-r--r--  1 ogrisel  staff   2.8K Mar 18  2003 53118
-rw-r--r--  1 ogrisel  staff   4.1K Mar 18  2003 53120
-rw-r--r--  1 ogrisel  staff   1.8K Mar 18  2003 53121
-rw-r--r--  1 ogrisel  staff   2.4K Mar 18  2003 53122
-rw-r--r--  1 ogrisel  staff   1.2K Mar 18  2003 53123
-rw-r--r--  1 ogrisel  staff   3.4K Mar 18  2003 53124
-rw-r--r--  1 ogrisel  staff   1.8K Mar 18  2003 53125
-rw-r--r--  1 ogrisel  staff   1.2K Mar 18  2003 53126
-rw-r--r--  1 ogrisel  staff   826B Mar 18  2003 53127
-rw-r--r--  1 ogrisel  staff   958B Mar 18  2003 53130
-rw-r--r--  1 ogrisel  staff   1.5K Mar 18  2003 53131
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 53132
-rw-r--r--  1 ogrisel  staff   640B Mar 18  2003 53133
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 53134
-rw-r--r--  1 ogrisel  staff   2.1K Mar 18  2003 53135
-rw-r--r--  1 ogrisel  staff   4.2K Mar 18  2003 53136
-rw-r--r--  1 ogrisel  staff   4.8K Mar 18  2003 53137
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 53139
-rw-r--r--  1 ogrisel  staff   3.0K Mar 18  2003 53140
-rw-r--r--  1 ogrisel  staff   2.1K Mar 18  2003 53141
-rw-r--r--  1 ogrisel  staff   456B Mar 18  2003 53142
-rw-r--r--  1 ogrisel  staff   760B Mar 18  2003 53143
-rw-r--r--  1 ogrisel  staff   768B Mar 18  2003 53144
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 53145
-rw-r--r--  1 ogrisel  staff   1.2K Mar 18  2003 53149
-rw-r--r--  1 ogrisel  staff   2.1K Mar 18  2003 53150
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 53151
-rw-r--r--  1 ogrisel  staff   1.9K Mar 18  2003 53153
-rw-r--r--  1 ogrisel  staff   1.2K Mar 18  2003 53154
-rw-r--r--  1 ogrisel  staff   1.2K Mar 18  2003 53157
-rw-r--r--  1 ogrisel  staff   2.0K Mar 18  2003 53158
-rw-r--r--  1 ogrisel  staff   819B Mar 18  2003 53159
-rw-r--r--  1 ogrisel  staff   1.9K Mar 18  2003 53160
-rw-r--r--  1 ogrisel  staff   3.5K Mar 18  2003 53161
-rw-r--r--  1 ogrisel  staff   1.5K Mar 18  2003 53162
-rw-r--r--  1 ogrisel  staff   1.9K Mar 18  2003 53163
-rw-r--r--  1 ogrisel  staff   2.2K Mar 18  2003 53164
-rw-r--r--  1 ogrisel  staff   1.1K Mar 18  2003 53165
-rw-r--r--  1 ogrisel  staff   684B Mar 18  2003 53166
-rw-r--r--  1 ogrisel  staff   443B Mar 18  2003 53167
-rw-r--r--  1 ogrisel  staff   1.2K Mar 18  2003 53168
-rw-r--r--  1 ogrisel  staff   1.4K Mar 18  2003 53170
-rw-r--r--  1 ogrisel  staff   2.5K Mar 18  2003 53171
-rw-r--r--  1 ogrisel  staff   785B Mar 18  2003 53172
-rw-r--r--  1 ogrisel  staff   1.1K Mar 18  2003 53173
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 53174
-rw-r--r--  1 ogrisel  staff   737B Mar 18  2003 53175
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 53176
-rw-r--r--  1 ogrisel  staff   1.8K Mar 18  2003 53177
-rw-r--r--  1 ogrisel  staff   2.2K Mar 18  2003 53178
-rw-r--r--  1 ogrisel  staff   1.6K Mar 18  2003 53179
-rw-r--r--  1 ogrisel  staff   2.1K Mar 18  2003 53180
-rw-r--r--  1 ogrisel  staff   3.2K Mar 18  2003 53181
-rw-r--r--  1 ogrisel  staff   1.2K Mar 18  2003 53182
-rw-r--r--  1 ogrisel  staff   1.4K Mar 18  2003 53183
-rw-r--r--  1 ogrisel  staff   1.7K Mar 18  2003 53184
-rw-r--r--  1 ogrisel  staff   2.6K Mar 18  2003 53185
-rw-r--r--  1 ogrisel  staff   3.0K Mar 18  2003 53186
-rw-r--r--  1 ogrisel  staff   665B Mar 18  2003 53187
-rw-r--r--  1 ogrisel  staff   2.0K Mar 18  2003 53188
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 53190
-rw-r--r--  1 ogrisel  staff   1.9K Mar 18  2003 53191
-rw-r--r--  1 ogrisel  staff   1.8K Mar 18  2003 53192
-rw-r--r--  1 ogrisel  staff   1.4K Mar 18  2003 53193
-rw-r--r--  1 ogrisel  staff   792B Mar 18  2003 53194
-rw-r--r--  1 ogrisel  staff   2.0K Mar 18  2003 53195
-rw-r--r--  1 ogrisel  staff   1.6K Mar 18  2003 53196
-rw-r--r--  1 ogrisel  staff   2.6K Mar 18  2003 53197
-rw-r--r--  1 ogrisel  staff   1.1K Mar 18  2003 53198
-rw-r--r--  1 ogrisel  staff   1.4K Mar 18  2003 53199
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 53201
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 53203
-rw-r--r--  1 ogrisel  staff   3.7K Mar 18  2003 53208
-rw-r--r--  1 ogrisel  staff   1.1K Mar 18  2003 53209
-rw-r--r--  1 ogrisel  staff   1.5K Mar 18  2003 53210
-rw-r--r--  1 ogrisel  staff   2.7K Mar 18  2003 53211
-rw-r--r--  1 ogrisel  staff   1.4K Mar 18  2003 53212
-rw-r--r--  1 ogrisel  staff   2.3K Mar 18  2003 53213
-rw-r--r--  1 ogrisel  staff   1.9K Mar 18  2003 53214
-rw-r--r--  1 ogrisel  staff   919B Mar 18  2003 53215
-rw-r--r--  1 ogrisel  staff   868B Mar 18  2003 53216
-rw-r--r--  1 ogrisel  staff   2.3K Mar 18  2003 53217
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 53218
-rw-r--r--  1 ogrisel  staff   1.1K Mar 18  2003 53219
-rw-r--r--  1 ogrisel  staff   640B Mar 18  2003 53220
-rw-r--r--  1 ogrisel  staff   1.1K Mar 18  2003 53221
-rw-r--r--  1 ogrisel  staff   2.0K Mar 18  2003 53222
-rw-r--r--  1 ogrisel  staff   2.0K Mar 18  2003 53223
-rw-r--r--  1 ogrisel  staff   3.4K Mar 18  2003 53224
-rw-r--r--  1 ogrisel  staff   808B Mar 18  2003 53225
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 53226
-rw-r--r--  1 ogrisel  staff   640B Mar 18  2003 53228
-rw-r--r--  1 ogrisel  staff   856B Mar 18  2003 53229
-rw-r--r--  1 ogrisel  staff   967B Mar 18  2003 53230
-rw-r--r--  1 ogrisel  staff   781B Mar 18  2003 53231
-rw-r--r--  1 ogrisel  staff   1.2K Mar 18  2003 53232
-rw-r--r--  1 ogrisel  staff   2.2K Mar 18  2003 53235
-rw-r--r--  1 ogrisel  staff   1.7K Mar 18  2003 53237
-rw-r--r--  1 ogrisel  staff   2.2K Mar 18  2003 53238
-rw-r--r--  1 ogrisel  staff   2.4K Mar 18  2003 53239
-rw-r--r--  1 ogrisel  staff   1.2K Mar 18  2003 53240
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 53243
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 53248
-rw-r--r--  1 ogrisel  staff   1.4K Mar 18  2003 53249
-rw-r--r--  1 ogrisel  staff   1.8K Mar 18  2003 53250
-rw-r--r--  1 ogrisel  staff   1.5K Mar 18  2003 53251
-rw-r--r--  1 ogrisel  staff   1.4K Mar 18  2003 53252
-rw-r--r--  1 ogrisel  staff   1.2K Mar 18  2003 53256
-rw-r--r--  1 ogrisel  staff   806B Mar 18  2003 53258
-rw-r--r--  1 ogrisel  staff   4.2K Mar 18  2003 53266
-rw-r--r--  1 ogrisel  staff   3.5K Mar 18  2003 53267
-rw-r--r--  1 ogrisel  staff   1.8K Mar 18  2003 53269
-rw-r--r--  1 ogrisel  staff   3.2K Mar 18  2003 53271
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 53274
-rw-r--r--  1 ogrisel  staff   2.1K Mar 18  2003 53275
-rw-r--r--  1 ogrisel  staff   2.0K Mar 18  2003 53281
-rw-r--r--  1 ogrisel  staff   958B Mar 18  2003 53282
-rw-r--r--  1 ogrisel  staff   3.2K Mar 18  2003 53283
-rw-r--r--  1 ogrisel  staff   872B Mar 18  2003 53284
-rw-r--r--  1 ogrisel  staff   387B Mar 18  2003 53285
-rw-r--r--  1 ogrisel  staff   3.1K Mar 18  2003 53286
-rw-r--r--  1 ogrisel  staff   3.5K Mar 18  2003 53287
-rw-r--r--  1 ogrisel  staff   2.6K Mar 18  2003 53288
-rw-r--r--  1 ogrisel  staff   956B Mar 18  2003 53289
-rw-r--r--  1 ogrisel  staff   1.6K Mar 18  2003 53290
-rw-r--r--  1 ogrisel  staff    10K Mar 18  2003 53292
-rw-r--r--  1 ogrisel  staff   5.4K Mar 18  2003 53298
-rw-r--r--  1 ogrisel  staff   945B Mar 18  2003 53303
-rw-r--r--  1 ogrisel  staff   1.2K Mar 18  2003 53304
-rw-r--r--  1 ogrisel  staff   1.5K Mar 18  2003 53305
-rw-r--r--  1 ogrisel  staff   1.4K Mar 18  2003 53306
-rw-r--r--  1 ogrisel  staff   590B Mar 18  2003 53307
-rw-r--r--  1 ogrisel  staff   663B Mar 18  2003 53308
-rw-r--r--  1 ogrisel  staff   907B Mar 18  2003 53309
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 53311
-rw-r--r--  1 ogrisel  staff   1.5K Mar 18  2003 53312
-rw-r--r--  1 ogrisel  staff   576B Mar 18  2003 53314
-rw-r--r--  1 ogrisel  staff    15K Mar 18  2003 53323
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 53334
-rw-r--r--  1 ogrisel  staff   783B Mar 18  2003 53347
-rw-r--r--  1 ogrisel  staff   5.8K Mar 18  2003 53351
-rw-r--r--  1 ogrisel  staff   1.6K Mar 18  2003 53366
-rw-r--r--  1 ogrisel  staff   698B Mar 18  2003 53370
-rw-r--r--  1 ogrisel  staff   600B Mar 18  2003 53371
-rw-r--r--  1 ogrisel  staff   5.6K Mar 18  2003 53373
-rw-r--r--  1 ogrisel  staff   1.8K Mar 18  2003 53374
-rw-r--r--  1 ogrisel  staff   1.1K Mar 18  2003 53375
-rw-r--r--  1 ogrisel  staff   849B Mar 18  2003 53376
-rw-r--r--  1 ogrisel  staff   621B Mar 18  2003 53377
-rw-r--r--  1 ogrisel  staff   270B Mar 18  2003 53380
-rw-r--r--  1 ogrisel  staff   1.1K Mar 18  2003 53381
-rw-r--r--  1 ogrisel  staff   2.2K Mar 18  2003 53382
-rw-r--r--  1 ogrisel  staff   1.6K Mar 18  2003 53383
-rw-r--r--  1 ogrisel  staff   1.6K Mar 18  2003 53387
-rw-r--r--  1 ogrisel  staff   759B Mar 18  2003 53389
-rw-r--r--  1 ogrisel  staff   396B Mar 18  2003 53390
-rw-r--r--  1 ogrisel  staff   669B Mar 18  2003 53391
-rw-r--r--  1 ogrisel  staff   1.8K Mar 18  2003 53434
-rw-r--r--  1 ogrisel  staff   1.6K Mar 18  2003 53435
-rw-r--r--  1 ogrisel  staff   708B Mar 18  2003 53436
-rw-r--r--  1 ogrisel  staff   887B Mar 18  2003 53437
-rw-r--r--  1 ogrisel  staff   838B Mar 18  2003 53438
-rw-r--r--  1 ogrisel  staff   1.4K Mar 18  2003 53439
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 53440
-rw-r--r--  1 ogrisel  staff   384B Mar 18  2003 53441
-rw-r--r--  1 ogrisel  staff   857B Mar 18  2003 53442
-rw-r--r--  1 ogrisel  staff   1.6K Mar 18  2003 53443
-rw-r--r--  1 ogrisel  staff   1.4K Mar 18  2003 53445
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 53449
-rw-r--r--  1 ogrisel  staff   2.4K Mar 18  2003 53459
-rw-r--r--  1 ogrisel  staff   1.4K Mar 18  2003 53460
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 53465
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 53466
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 53467
-rw-r--r--  1 ogrisel  staff   1.4K Mar 18  2003 53468
-rw-r--r--  1 ogrisel  staff   1.1K Mar 18  2003 53471
-rw-r--r--  1 ogrisel  staff   1.9K Mar 18  2003 53477
-rw-r--r--  1 ogrisel  staff   718B Mar 18  2003 53478
-rw-r--r--  1 ogrisel  staff   781B Mar 18  2003 53483
-rw-r--r--  1 ogrisel  staff   1.6K Mar 18  2003 53509
-rw-r--r--  1 ogrisel  staff   910B Mar 18  2003 53510
-rw-r--r--  1 ogrisel  staff   781B Mar 18  2003 53512
-rw-r--r--  1 ogrisel  staff   1.8K Mar 18  2003 53515
-rw-r--r--  1 ogrisel  staff   2.1K Mar 18  2003 53518
-rw-r--r--  1 ogrisel  staff    50K Mar 18  2003 53519
-rw-r--r--  1 ogrisel  staff   6.0K Mar 18  2003 53521
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 53522
-rw-r--r--  1 ogrisel  staff   2.8K Mar 18  2003 53523
-rw-r--r--  1 ogrisel  staff   338B Mar 18  2003 53524
-rw-r--r--  1 ogrisel  staff   1.4K Mar 18  2003 53525
-rw-r--r--  1 ogrisel  staff   489B Mar 18  2003 53526
-rw-r--r--  1 ogrisel  staff   2.6K Mar 18  2003 53527
-rw-r--r--  1 ogrisel  staff   2.4K Mar 18  2003 53528
-rw-r--r--  1 ogrisel  staff   228B Mar 18  2003 53529
-rw-r--r--  1 ogrisel  staff   1.1K Mar 18  2003 53531
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 53532
-rw-r--r--  1 ogrisel  staff   1.2K Mar 18  2003 53533
-rw-r--r--  1 ogrisel  staff   356B Mar 18  2003 53534
-rw-r--r--  1 ogrisel  staff   614B Mar 18  2003 53535
-rw-r--r--  1 ogrisel  staff   895B Mar 18  2003 53571
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 53572
-rw-r--r--  1 ogrisel  staff   697B Mar 18  2003 53573
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 53574
-rw-r--r--  1 ogrisel  staff   1.8K Mar 18  2003 53654
-rw-r--r--  1 ogrisel  staff   2.3K Mar 18  2003 53655
-rw-r--r--  1 ogrisel  staff   2.5K Mar 18  2003 53656
-rw-r--r--  1 ogrisel  staff   2.1K Mar 18  2003 53660
-rw-r--r--  1 ogrisel  staff   6.8K Mar 18  2003 53661
-rw-r--r--  1 ogrisel  staff   1.8K Mar 18  2003 53753
-rw-r--r--  1 ogrisel  staff   698B Mar 18  2003 53754
-rw-r--r--  1 ogrisel  staff   779B Mar 18  2003 53755
-rw-r--r--  1 ogrisel  staff   3.9K Mar 18  2003 53756
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 53757
-rw-r--r--  1 ogrisel  staff   2.2K Mar 18  2003 53758
-rw-r--r--  1 ogrisel  staff   745B Mar 18  2003 53759
-rw-r--r--  1 ogrisel  staff   1.9K Mar 18  2003 53760
-rw-r--r--  1 ogrisel  staff   592B Mar 18  2003 53761
-rw-r--r--  1 ogrisel  staff   658B Mar 18  2003 53762
-rw-r--r--  1 ogrisel  staff   756B Mar 18  2003 53763
-rw-r--r--  1 ogrisel  staff   2.7K Mar 18  2003 53764
-rw-r--r--  1 ogrisel  staff   1.1K Mar 18  2003 53765
-rw-r--r--  1 ogrisel  staff   906B Mar 18  2003 53766
-rw-r--r--  1 ogrisel  staff   535B Mar 18  2003 53780
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 53785
-rw-r--r--  1 ogrisel  staff   2.3K Mar 18  2003 54165
-rw-r--r--  1 ogrisel  staff   2.8K Mar 18  2003 54166
-rw-r--r--  1 ogrisel  staff   547B Mar 18  2003 54167
-rw-r--r--  1 ogrisel  staff   2.4K Mar 18  2003 54168
-rw-r--r--  1 ogrisel  staff   4.7K Mar 18  2003 54178
-rw-r--r--  1 ogrisel  staff   1.8K Mar 18  2003 54179
-rw-r--r--  1 ogrisel  staff   4.4K Mar 18  2003 54180
-rw-r--r--  1 ogrisel  staff   1.3K Mar 18  2003 54181
-rw-r--r--  1 ogrisel  staff   3.0K Mar 18  2003 54182
-rw-r--r--  1 ogrisel  staff   1.4K Mar 18  2003 54198
-rw-r--r--  1 ogrisel  staff   1.8K Mar 18  2003 54199
-rw-r--r--  1 ogrisel  staff   2.5K Mar 18  2003 54200
-rw-r--r--  1 ogrisel  staff   1.7K Mar 18  2003 54201
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 54202
-rw-r--r--  1 ogrisel  staff   1.2K Mar 18  2003 54203
-rw-r--r--  1 ogrisel  staff   565B Mar 18  2003 54204
-rw-r--r--  1 ogrisel  staff   641B Mar 18  2003 54227
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 54228
-rw-r--r--  1 ogrisel  staff   877B Mar 18  2003 54470
-rw-r--r--  1 ogrisel  staff   1.0K Mar 18  2003 54471
-rw-r--r--  1 ogrisel  staff   993B Mar 18  2003 54472
-rw-r--r--  1 ogrisel  staff   434B Mar 18  2003 54473

The load_files function can load text files from a 2 levels folder structure assuming folder names represent categories:



In [5]:

    
#print(load_files.__doc__)



In [6]:

    
all_twenty_train = load_files('../datasets/20news-bydate-train/',
    encoding='latin-1', random_state=42)
all_twenty_test = load_files('../datasets/20news-bydate-test/',
    encoding='latin-1', random_state=42)



In [7]:

    
all_target_names = all_twenty_train.target_names
all_target_names









    Out[7]:





['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']



In [8]:

    
all_twenty_train.target









    Out[8]:





array([12,  6,  9, ...,  9,  1, 12])



In [9]:

    
all_twenty_train.target.shape









    Out[9]:





(11314,)



In [10]:

    
all_twenty_test.target.shape









    Out[10]:





(7532,)



In [11]:

    
len(all_twenty_train.data)









    Out[11]:





11314



In [12]:

    
type(all_twenty_train.data[0])









    Out[12]:





str



In [13]:

    
def display_sample(i, dataset):
    print("Class name: " + dataset.target_names[dataset.target[i]])
    print("Text content:\n")
    print(dataset.data[i])



In [14]:

    
display_sample(0, all_twenty_train)









    



Class name: sci.electronics
Text content:

From: wtm@uhura.neoucom.edu (Bill Mayhew)
Subject: Re: How to the disks copy protected.
Organization: Northeastern Ohio Universities College of Medicine
Lines: 23

Write a good manual to go with the software.  The hassle of
photocopying the manual is offset by simplicity of purchasing
the package for only $15.  Also, consider offering an inexpensive
but attractive perc for registered users.  For instance, a coffee
mug.  You could produce and mail the incentive for a couple of
dollars, so consider pricing the product at $17.95.

You're lucky if only 20% of the instances of your program in use
are non-licensed users.

The best approach is to estimate your loss and accomodate that into
your price structure.  Sure it hurts legitimate users, but too bad.
Retailers have to charge off loss to shoplifters onto paying
customers; the software industry is the same.

Unless your product is exceptionally unique, using an ostensibly
copy-proof disk will just send your customers to the competetion.


-- 
Bill Mayhew      NEOUCOM Computer Services Department
Rootstown, OH  44272-9995  USA    phone: 216-325-2511
wtm@uhura.neoucom.edu (140.220.1.1)    146.580: N8WED



In [15]:

    
display_sample(1, all_twenty_train)

Class name: misc.forsale
Text content:

From: andy@SAIL.Stanford.EDU (Andy Freeman)
Subject: Re: Catalog of Hard-to-Find PC Enhancements (Repost)
Organization: Computer Science Department,  Stanford University.
Lines: 33

>andy@SAIL.Stanford.EDU (Andy Freeman) writes:
>> >In article <C5ELME.4z4@unix.portal.com> jdoll@shell.portal.com (Joe Doll) wr
>> >>   "The Catalog of Personal Computing Tools for Engineers and Scien-
>> >>   tists" lists hardware cards and application software packages for 
>> >>   PC/XT/AT/PS/2 class machines.  Focus is on engineering and scien-
>> >>   tific applications of PCs, such as data acquisition/control, 
>> >>   design automation, and data analysis and presentation.  
>> >
>> >>   If you would like a free copy, reply with your (U. S. Postal) 
>> >>   mailing address.
>> 
>> Don't bother - it never comes.  It's a cheap trick for building a
>> mailing list to sell if my junk mail flow is any indication.
>> 
>> -andy sent his address months ago
>
>Perhaps we can get Portal to nuke this weasal.  I never received a 
>catalog either.  If that person doesn't respond to a growing flame, then 
>we can assume that we'yall look forward to lotsa junk mail.

I don't want him nuked, I want him to be honest.  The junk mail has
been much more interesting than the promised catalog.  If I'd known
what I was going to get, I wouldn't have hesitated.  I wouldn't be
surprised if there were other folks who looked at the ad and said
"nope" but who would be very interested in the junk mail that results.
Similarly, there are people who wanted the advertised catalog who
aren't happy with the junk they got instead.

The folks buying the mailing lists would prefer an honest ad, and
so would the people reading it.

-andy
--

Let's compute the (uncompressed, in-memory) size of the training and test sets in MB assuming an 8-bit encoding (in this case, all chars can be encoded using the latin-1 charset).



In [16]:

    
def text_size(text, charset='iso-8859-1'):
    return len(text.encode(charset)) * 8 * 1e-6

train_size_mb = sum(text_size(text) for text in all_twenty_train.data) 
test_size_mb = sum(text_size(text) for text in all_twenty_test.data)

print("Training set size: {0} MB".format(int(train_size_mb)))
print("Testing set size: {0} MB".format(int(test_size_mb)))









    



Training set size: 176 MB
Testing set size: 110 MB

If we only consider a small subset of the 4 categories selected from the initial example:



In [17]:

    
train_small_size_mb = sum(text_size(text) for text in twenty_train_small.data) 
test_small_size_mb = sum(text_size(text) for text in twenty_test_small.data)

print("Training set size: {0} MB".format(int(train_small_size_mb)))
print("Testing set size: {0} MB".format(int(test_small_size_mb)))









    



Training set size: 31 MB
Testing set size: 22 MB

Extracting Text Features



In [18]:

    
from sklearn.feature_extraction.text import TfidfVectorizer

TfidfVectorizer()









    Out[18]:





TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)



In [19]:

    
vectorizer = TfidfVectorizer(min_df=1)

%time X_train_small = vectorizer.fit_transform(twenty_train_small.data)









    



CPU times: user 736 ms, sys: 23.2 ms, total: 759 ms
Wall time: 771 ms

The results is not a numpy.array but instead a scipy.sparse matrix. This datastructure is quite similar to a 2D numpy array but it does not store the zeros.



In [20]:

    
X_train_small









    Out[20]:





<2034x34118 sparse matrix of type '<class 'numpy.float64'>'
	with 323433 stored elements in Compressed Sparse Row format>

scipy.sparse matrices also have a shape attribute to access the dimensions:



In [21]:

    
n_samples, n_features = X_train_small.shape

This dataset has around 2000 samples (the rows of the data matrix):



In [22]:

    
n_samples









    Out[22]:





2034

This is the same value as the number of strings in the original list of text documents:



In [23]:

    
len(twenty_train_small.data)









    Out[23]:





2034

The columns represent the individual token occurrences:



In [24]:

    
n_features









    Out[24]:





34118

This number is the size of the vocabulary of the model extracted during fit in a Python dictionary:



In [25]:

    
type(vectorizer.vocabulary_)









    Out[25]:





dict



In [26]:

    
len(vectorizer.vocabulary_)









    Out[26]:





34118

The keys of the vocabulary_ attribute are also called feature names and can be accessed as a list of strings.



In [27]:

    
len(vectorizer.get_feature_names())









    Out[27]:





34118

Here are the first 10 elements (sorted in lexicographical order):



In [28]:

    
vectorizer.get_feature_names()[:10]









    Out[28]:





['00',
 '000',
 '0000',
 '00000',
 '000000',
 '000005102000',
 '000021',
 '000062david42',
 '0000vec',
 '0001']

Let's have a look at the features from the middle:



In [29]:

    
vectorizer.get_feature_names()[n_features // 2:n_features // 2 + 10]









    Out[29]:





['inadequate',
 'inala',
 'inalienable',
 'inane',
 'inanimate',
 'inapplicable',
 'inappropriate',
 'inappropriately',
 'inaudible',
 'inbreeding']

Now that we have extracted a vector representation of the data, it's a good idea to project the data on the first 2D of a Principal Component Analysis to get a feel of the data. Note that the TruncatedSVD class can accept scipy.sparse matrices as input (as an alternative to numpy arrays):



In [30]:

    
from sklearn.decomposition import TruncatedSVD

%time X_train_small_pca = TruncatedSVD(n_components=2).fit_transform(X_train_small)









    



CPU times: user 97.8 ms, sys: 18.5 ms, total: 116 ms
Wall time: 109 ms



In [31]:

    
from itertools import cycle

colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k']
for i, c in zip(np.unique(y_train), cycle(colors)):
    plt.scatter(X_train_small_pca[y_train == i, 0],
               X_train_small_pca[y_train == i, 1],
               c=c, label=twenty_train_small.target_names[i], alpha=0.5)
    
_ = plt.legend(loc='best')

We can observe that there is a large overlap of the samples from different categories. This is to be expected as the PCA linear projection projects data from a 34118 dimensional space down to 2 dimensions: data that is linearly separable in 34118D is often no longer linearly separable in 2D.

Still we can notice an interesting pattern: the newsgroups on religion and atheism occupy the much the same region and computer graphics and space science / space overlap more together than they do with the religion or atheism newsgroups.

Training a Classifier on Text Features

We have previously extracted a vector representation of the training corpus and put it into a variable name X_train_small. To train a supervised model, in this case a classifier, we also need



In [32]:

    
y_train_small = twenty_train_small.target



In [33]:

    
y_train_small.shape









    Out[33]:





(2034,)



In [34]:

    
y_train_small









    Out[34]:





array([1, 2, 2, ..., 2, 1, 1])

We can shape that we have the same number of samples for the input data and the labels:



In [35]:

    
X_train_small.shape[0] == y_train_small.shape[0]









    Out[35]:





True

We can now train a classifier, for instance a Multinomial Naive Bayesian classifier:



In [36]:

    
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB(alpha=0.1)
clf









    Out[36]:





MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)



In [37]:

    
clf.fit(X_train_small, y_train_small)









    Out[37]:





MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)

We can now evaluate the classifier on the testing set. Let's first use the builtin score function, which is the rate of correct classification in the test set:



In [38]:

    
X_test_small = vectorizer.transform(twenty_test_small.data)
y_test_small = twenty_test_small.target



In [39]:

    
X_test_small.shape









    Out[39]:





(1353, 34118)



In [40]:

    
y_test_small.shape









    Out[40]:





(1353,)



In [41]:

    
clf.score(X_test_small, y_test_small)









    Out[41]:





0.89652623798965259

We can also compute the score on the test set and observe that the model is both overfitting and underfitting a bit at the same time:



In [42]:

    
clf.score(X_train_small, y_train_small)









    Out[42]:





0.99262536873156337

Introspecting the Behavior of the Text Vectorizer

The text vectorizer has many parameters to customize it's behavior, in particular how it extracts tokens:



In [43]:

    
TfidfVectorizer()









    Out[43]:





TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)



In [44]:

    
print(TfidfVectorizer.__doc__)









    



Convert a collection of raw documents to a matrix of TF-IDF features.

    Equivalent to CountVectorizer followed by TfidfTransformer.

    Parameters
    ----------
    input : string {'filename', 'file', 'content'}
        If 'filename', the sequence passed as an argument to fit is
        expected to be a list of filenames that need reading to fetch
        the raw content to analyze.

        If 'file', the sequence items must have a 'read' method (file-like
        object) that is called to fetch the bytes in memory.

        Otherwise the input is expected to be the sequence strings or
        bytes items are expected to be analyzed directly.

    encoding : string, 'utf-8' by default.
        If bytes or files are given to analyze, this encoding is used to
        decode.

    decode_error : {'strict', 'ignore', 'replace'}
        Instruction on what to do if a byte sequence is given to analyze that
        contains characters not of the given `encoding`. By default, it is
        'strict', meaning that a UnicodeDecodeError will be raised. Other
        values are 'ignore' and 'replace'.

    strip_accents : {'ascii', 'unicode', None}
        Remove accents during the preprocessing step.
        'ascii' is a fast method that only works on characters that have
        an direct ASCII mapping.
        'unicode' is a slightly slower method that works on any characters.
        None (default) does nothing.

    analyzer : string, {'word', 'char'} or callable
        Whether the feature should be made of word or character n-grams.

        If a callable is passed it is used to extract the sequence of features
        out of the raw, unprocessed input.

    preprocessor : callable or None (default)
        Override the preprocessing (string transformation) stage while
        preserving the tokenizing and n-grams generation steps.

    tokenizer : callable or None (default)
        Override the string tokenization step while preserving the
        preprocessing and n-grams generation steps.

    ngram_range : tuple (min_n, max_n)
        The lower and upper boundary of the range of n-values for different
        n-grams to be extracted. All values of n such that min_n <= n <= max_n
        will be used.

    stop_words : string {'english'}, list, or None (default)
        If a string, it is passed to _check_stop_list and the appropriate stop
        list is returned. 'english' is currently the only supported string
        value.

        If a list, that list is assumed to contain stop words, all of which
        will be removed from the resulting tokens.

        If None, no stop words will be used. max_df can be set to a value
        in the range [0.7, 1.0) to automatically detect and filter stop
        words based on intra corpus document frequency of terms.

    lowercase : boolean, default True
        Convert all characters to lowercase before tokenizing.

    token_pattern : string
        Regular expression denoting what constitutes a "token", only used
        if `analyzer == 'word'`. The default regexp selects tokens of 2
        or more alphanumeric characters (punctuation is completely ignored
        and always treated as a token separator).

    max_df : float in range [0.0, 1.0] or int, default=1.0
        When building the vocabulary ignore terms that have a document frequency
        strictly higher than the given threshold (corpus specific stop words).
        If float, the parameter represents a proportion of documents, integer
        absolute counts.
        This parameter is ignored if vocabulary is not None.

    min_df : float in range [0.0, 1.0] or int, default=1
        When building the vocabulary ignore terms that have a document frequency
        strictly lower than the given threshold.
        This value is also called cut-off in the literature.
        If float, the parameter represents a proportion of documents, integer
        absolute counts.
        This parameter is ignored if vocabulary is not None.

    max_features : int or None, default=None
        If not None, build a vocabulary that only consider the top
        max_features ordered by term frequency across the corpus.

        This parameter is ignored if vocabulary is not None.

    vocabulary : Mapping or iterable, optional
        Either a Mapping (e.g., a dict) where keys are terms and values are
        indices in the feature matrix, or an iterable over terms. If not
        given, a vocabulary is determined from the input documents.

    binary : boolean, default=False
        If True, all non-zero term counts are set to 1. This does not mean
        outputs will have only 0/1 values, only that the tf term in tf-idf
        is binary. (Set idf and normalization to False to get 0/1 outputs.)

    dtype : type, optional
        Type of the matrix returned by fit_transform() or transform().

    norm : 'l1', 'l2' or None, optional
        Norm used to normalize term vectors. None for no normalization.

    use_idf : boolean, default=True
        Enable inverse-document-frequency reweighting.

    smooth_idf : boolean, default=True
        Smooth idf weights by adding one to document frequencies, as if an
        extra document was seen containing every term in the collection
        exactly once. Prevents zero divisions.

    sublinear_tf : boolean, default=False
        Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

    Attributes
    ----------
    idf_ : array, shape = [n_features], or None
        The learned idf vector (global term weights)
        when ``use_idf`` is set to True, None otherwise.

    stop_words_ : set
        Terms that were ignored because they either:

          - occurred in too many documents (`max_df`)
          - occurred in too few documents (`min_df`)
          - were cut off by feature selection (`max_features`).

        This is only available if no vocabulary was given.

    See also
    --------
    CountVectorizer
        Tokenize the documents and count the occurrences of token and return
        them as a sparse matrix

    TfidfTransformer
        Apply Term Frequency Inverse Document Frequency normalization to a
        sparse matrix of occurrence counts.

    Notes
    -----
    The ``stop_words_`` attribute can get large and increase the model size
    when pickling. This attribute is provided only for introspection and can
    be safely removed using delattr or set to None before pickling.

The easiest way to introspect what the vectorizer is actually doing for a given test of parameters is call the vectorizer.build_analyzer() to get an instance of the text analyzer it uses to process the text:



In [45]:

    
analyzer = TfidfVectorizer().build_analyzer()
analyzer("I love scikit-learn: this is a cool Python lib!")









    Out[45]:





['love', 'scikit', 'learn', 'this', 'is', 'cool', 'python', 'lib']

You can notice that all the tokens are lowercase, that the single letter word "I" was dropped, and that hyphenation is used. Let's change some of that default behavior:



In [46]:

    
analyzer = TfidfVectorizer(
    preprocessor=lambda text: text,  # disable lowercasing
    token_pattern=r'(?u)\b[\w-]+\b', # treat hyphen as a letter
                                      # do not exclude single letter tokens
).build_analyzer()

analyzer("I love scikit-learn: this is a cool Python lib!")









    Out[46]:





['I', 'love', 'scikit-learn', 'this', 'is', 'a', 'cool', 'Python', 'lib']

The analyzer name comes from the Lucene parlance: it wraps the sequential application of:

text preprocessing (processing the text documents as a whole, e.g. lowercasing)
text tokenization (splitting the document into a sequence of tokens)
token filtering and recombination (e.g. n-grams extraction, see later)

The analyzer system of scikit-learn is much more basic than lucene's though.

Exercise:

Write a pre-processor callable (e.g. a python function) to remove the headers of the text a newsgroup post.
Vectorize the data again and measure the impact on performance of removing the header info from the dataset.
Do you expect the performance of the model to improve or decrease? What is the score of a uniform random classifier on the same dataset?

Hint: the TfidfVectorizer class can accept python functions to customize the preprocessor, tokenizer or analyzer stages of the vectorizer.

type TfidfVectorizer() alone in a cell to see the default value of the parameters
type TfidfVectorizer.__doc__ to print the constructor parameters doc or ? suffix operator on a any Python class or method to read the docstring or even the ?? operator to read the source code.



In [47]:



In [48]:

    
%load solutions/07A_1_strip_headers.py



In [49]:



In [50]:

    
%load solutions/07A_2_evaluate_model.py

Model Selection of the Naive Bayes Classifier Parameter Alone

The MultinomialNB class is a good baseline classifier for text as it's fast and has few parameters to tweak:



In [51]:

    
MultinomialNB()









    Out[51]:





MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)



In [52]:

    
print(MultinomialNB.__doc__)









    



    Naive Bayes classifier for multinomial models

    The multinomial Naive Bayes classifier is suitable for classification with
    discrete features (e.g., word counts for text classification). The
    multinomial distribution normally requires integer feature counts. However,
    in practice, fractional counts such as tf-idf may also work.

    Parameters
    ----------
    alpha : float, optional (default=1.0)
        Additive (Laplace/Lidstone) smoothing parameter
        (0 for no smoothing).

    fit_prior : boolean
        Whether to learn class prior probabilities or not.
        If false, a uniform prior will be used.

    class_prior : array-like, size (n_classes,)
        Prior probabilities of the classes. If specified the priors are not
        adjusted according to the data.

    Attributes
    ----------
    class_log_prior_ : array, shape (n_classes, )
        Smoothed empirical log probability for each class.

    intercept_ : property
        Mirrors ``class_log_prior_`` for interpreting MultinomialNB
        as a linear model.

    feature_log_prob_ : array, shape (n_classes, n_features)
        Empirical log probability of features
        given a class, ``P(x_i|y)``.

    coef_ : property
        Mirrors ``feature_log_prob_`` for interpreting MultinomialNB
        as a linear model.

    class_count_ : array, shape (n_classes,)
        Number of samples encountered for each class during fitting. This
        value is weighted by the sample weight when provided.

    feature_count_ : array, shape (n_classes, n_features)
        Number of samples encountered for each (class, feature)
        during fitting. This value is weighted by the sample weight when
        provided.

    Examples
    --------
    >>> import numpy as np
    >>> X = np.random.randint(5, size=(6, 100))
    >>> y = np.array([1, 2, 3, 4, 5, 6])
    >>> from sklearn.naive_bayes import MultinomialNB
    >>> clf = MultinomialNB()
    >>> clf.fit(X, y)
    MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
    >>> print(clf.predict(X[2]))
    [3]

    Notes
    -----
    For the rationale behind the names `coef_` and `intercept_`, i.e.
    naive Bayes as a linear classifier, see J. Rennie et al. (2003),
    Tackling the poor assumptions of naive Bayes text classifiers, ICML.

    References
    ----------
    C.D. Manning, P. Raghavan and H. Schuetze (2008). Introduction to
    Information Retrieval. Cambridge University Press, pp. 234-265.
    http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html

By reading the doc we can see that the alpha parameter is a good candidate to adjust the model for the bias (underfitting) vs variance (overfitting) trade-off.

Exercise:

use the sklearn.grid_search.GridSearchCV or the model_selection.RandomizedGridSeach utility function from the previous chapters to find a good value for the parameter alpha
plots the validation scores (and optionally the training scores) for each value of alpha and identify the areas where model overfits or underfits.

Hints:

you can search for values of alpha in the range [0.00001 - 1] using a logarithmic scale
RandomizedGridSearch also has a launch_for_arrays method as an alternative to launch_for_splits in case the CV splits have not been precomputed in advance. 1



In [53]:



In [54]:

    
%load solutions/07B_grid_search_alpha_nb.py



In [55]:



In [56]:

    
%load solutions/07C_validation_curves_alpha.py

Setting Up a Pipeline for Cross Validation and Model Selection of the Feature Extraction parameters

The feature extraction class has many options to customize its behavior:



In [57]:

    
print(TfidfVectorizer.__doc__)









    



Convert a collection of raw documents to a matrix of TF-IDF features.

    Equivalent to CountVectorizer followed by TfidfTransformer.

    Parameters
    ----------
    input : string {'filename', 'file', 'content'}
        If 'filename', the sequence passed as an argument to fit is
        expected to be a list of filenames that need reading to fetch
        the raw content to analyze.

        If 'file', the sequence items must have a 'read' method (file-like
        object) that is called to fetch the bytes in memory.

        Otherwise the input is expected to be the sequence strings or
        bytes items are expected to be analyzed directly.

    encoding : string, 'utf-8' by default.
        If bytes or files are given to analyze, this encoding is used to
        decode.

    decode_error : {'strict', 'ignore', 'replace'}
        Instruction on what to do if a byte sequence is given to analyze that
        contains characters not of the given `encoding`. By default, it is
        'strict', meaning that a UnicodeDecodeError will be raised. Other
        values are 'ignore' and 'replace'.

    strip_accents : {'ascii', 'unicode', None}
        Remove accents during the preprocessing step.
        'ascii' is a fast method that only works on characters that have
        an direct ASCII mapping.
        'unicode' is a slightly slower method that works on any characters.
        None (default) does nothing.

    analyzer : string, {'word', 'char'} or callable
        Whether the feature should be made of word or character n-grams.

        If a callable is passed it is used to extract the sequence of features
        out of the raw, unprocessed input.

    preprocessor : callable or None (default)
        Override the preprocessing (string transformation) stage while
        preserving the tokenizing and n-grams generation steps.

    tokenizer : callable or None (default)
        Override the string tokenization step while preserving the
        preprocessing and n-grams generation steps.

    ngram_range : tuple (min_n, max_n)
        The lower and upper boundary of the range of n-values for different
        n-grams to be extracted. All values of n such that min_n <= n <= max_n
        will be used.

    stop_words : string {'english'}, list, or None (default)
        If a string, it is passed to _check_stop_list and the appropriate stop
        list is returned. 'english' is currently the only supported string
        value.

        If a list, that list is assumed to contain stop words, all of which
        will be removed from the resulting tokens.

        If None, no stop words will be used. max_df can be set to a value
        in the range [0.7, 1.0) to automatically detect and filter stop
        words based on intra corpus document frequency of terms.

    lowercase : boolean, default True
        Convert all characters to lowercase before tokenizing.

    token_pattern : string
        Regular expression denoting what constitutes a "token", only used
        if `analyzer == 'word'`. The default regexp selects tokens of 2
        or more alphanumeric characters (punctuation is completely ignored
        and always treated as a token separator).

    max_df : float in range [0.0, 1.0] or int, default=1.0
        When building the vocabulary ignore terms that have a document frequency
        strictly higher than the given threshold (corpus specific stop words).
        If float, the parameter represents a proportion of documents, integer
        absolute counts.
        This parameter is ignored if vocabulary is not None.

    min_df : float in range [0.0, 1.0] or int, default=1
        When building the vocabulary ignore terms that have a document frequency
        strictly lower than the given threshold.
        This value is also called cut-off in the literature.
        If float, the parameter represents a proportion of documents, integer
        absolute counts.
        This parameter is ignored if vocabulary is not None.

    max_features : int or None, default=None
        If not None, build a vocabulary that only consider the top
        max_features ordered by term frequency across the corpus.

        This parameter is ignored if vocabulary is not None.

    vocabulary : Mapping or iterable, optional
        Either a Mapping (e.g., a dict) where keys are terms and values are
        indices in the feature matrix, or an iterable over terms. If not
        given, a vocabulary is determined from the input documents.

    binary : boolean, default=False
        If True, all non-zero term counts are set to 1. This does not mean
        outputs will have only 0/1 values, only that the tf term in tf-idf
        is binary. (Set idf and normalization to False to get 0/1 outputs.)

    dtype : type, optional
        Type of the matrix returned by fit_transform() or transform().

    norm : 'l1', 'l2' or None, optional
        Norm used to normalize term vectors. None for no normalization.

    use_idf : boolean, default=True
        Enable inverse-document-frequency reweighting.

    smooth_idf : boolean, default=True
        Smooth idf weights by adding one to document frequencies, as if an
        extra document was seen containing every term in the collection
        exactly once. Prevents zero divisions.

    sublinear_tf : boolean, default=False
        Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

    Attributes
    ----------
    idf_ : array, shape = [n_features], or None
        The learned idf vector (global term weights)
        when ``use_idf`` is set to True, None otherwise.

    stop_words_ : set
        Terms that were ignored because they either:

          - occurred in too many documents (`max_df`)
          - occurred in too few documents (`min_df`)
          - were cut off by feature selection (`max_features`).

        This is only available if no vocabulary was given.

    See also
    --------
    CountVectorizer
        Tokenize the documents and count the occurrences of token and return
        them as a sparse matrix

    TfidfTransformer
        Apply Term Frequency Inverse Document Frequency normalization to a
        sparse matrix of occurrence counts.

    Notes
    -----
    The ``stop_words_`` attribute can get large and increase the model size
    when pickling. This attribute is provided only for introspection and can
    be safely removed using delattr or set to None before pickling.

In order to evaluate the impact of the parameters of the feature extraction one can chain a configured feature extraction and linear classifier (as an alternative to the naive Bayes model):



In [58]:

    
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.pipeline import Pipeline

pipeline = Pipeline((
    ('vec', TfidfVectorizer(min_df=1, max_df=0.8, use_idf=True)),
    ('clf', PassiveAggressiveClassifier(C=1)),
))

Such a pipeline can then be cross validated or even grid searched:



In [59]:

    
from sklearn.cross_validation import cross_val_score
from scipy.stats import sem

scores = cross_val_score(pipeline, twenty_train_small.data,
                         twenty_train_small.target, cv=3, n_jobs=-1)
scores.mean(), sem(scores)









    Out[59]:





(0.96362849415981844, 0.0048041106263237311)

For the grid search, the parameters names are prefixed with the name of the pipeline step using "__" as a separator:



In [60]:

    
from sklearn.grid_search import GridSearchCV

parameters = {
    #'vec__min_df': [1, 2],
    'vec__max_df': [0.8, 1.0],
    'vec__ngram_range': [(1, 1), (1, 2)],
    'vec__use_idf': [True, False],
}

gs = GridSearchCV(pipeline, parameters, verbose=2, refit=False)
_ = gs.fit(twenty_train_small.data, twenty_train_small.target)









    



Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] vec__use_idf=True, vec__ngram_range=(1, 1), vec__max_df=0.8 .....
[CV]  vec__use_idf=True, vec__ngram_range=(1, 1), vec__max_df=0.8 -   0.7s
[CV] vec__use_idf=True, vec__ngram_range=(1, 1), vec__max_df=0.8 .....
[CV]  vec__use_idf=True, vec__ngram_range=(1, 1), vec__max_df=0.8 -   0.7s
[CV] vec__use_idf=True, vec__ngram_range=(1, 1), vec__max_df=0.8 .....
[CV]  vec__use_idf=True, vec__ngram_range=(1, 1), vec__max_df=0.8 -   0.7s
[CV] vec__use_idf=False, vec__ngram_range=(1, 1), vec__max_df=0.8 ....
[CV]  vec__use_idf=False, vec__ngram_range=(1, 1), vec__max_df=0.8 -   0.7s
[CV] vec__use_idf=False, vec__ngram_range=(1, 1), vec__max_df=0.8 ....
[CV]  vec__use_idf=False, vec__ngram_range=(1, 1), vec__max_df=0.8 -   0.7s
[CV] vec__use_idf=False, vec__ngram_range=(1, 1), vec__max_df=0.8 ....
[CV]  vec__use_idf=False, vec__ngram_range=(1, 1), vec__max_df=0.8 -   0.7s
[CV] vec__use_idf=True, vec__ngram_range=(1, 2), vec__max_df=0.8 .....
[CV]  vec__use_idf=True, vec__ngram_range=(1, 2), vec__max_df=0.8 -   2.9s
[CV] vec__use_idf=True, vec__ngram_range=(1, 2), vec__max_df=0.8 .....
[CV]  vec__use_idf=True, vec__ngram_range=(1, 2), vec__max_df=0.8 -   3.0s
[CV] vec__use_idf=True, vec__ngram_range=(1, 2), vec__max_df=0.8 .....
[CV]  vec__use_idf=True, vec__ngram_range=(1, 2), vec__max_df=0.8 -   3.0s
[CV] vec__use_idf=False, vec__ngram_range=(1, 2), vec__max_df=0.8 ....
[CV]  vec__use_idf=False, vec__ngram_range=(1, 2), vec__max_df=0.8 -   2.8s
[CV] vec__use_idf=False, vec__ngram_range=(1, 2), vec__max_df=0.8 ....
[CV]  vec__use_idf=False, vec__ngram_range=(1, 2), vec__max_df=0.8 -   2.9s
[CV] vec__use_idf=False, vec__ngram_range=(1, 2), vec__max_df=0.8 ....
[CV]  vec__use_idf=False, vec__ngram_range=(1, 2), vec__max_df=0.8 -   2.9s
[CV] vec__use_idf=True, vec__ngram_range=(1, 1), vec__max_df=1.0 .....
[CV]  vec__use_idf=True, vec__ngram_range=(1, 1), vec__max_df=1.0 -   0.7s
[CV] vec__use_idf=True, vec__ngram_range=(1, 1), vec__max_df=1.0 .....
[CV]  vec__use_idf=True, vec__ngram_range=(1, 1), vec__max_df=1.0 -   0.7s
[CV] vec__use_idf=True, vec__ngram_range=(1, 1), vec__max_df=1.0 .....
[CV]  vec__use_idf=True, vec__ngram_range=(1, 1), vec__max_df=1.0 -   0.7s
[CV] vec__use_idf=False, vec__ngram_range=(1, 1), vec__max_df=1.0 ....
[CV]  vec__use_idf=False, vec__ngram_range=(1, 1), vec__max_df=1.0 -   0.7s
[CV] vec__use_idf=False, vec__ngram_range=(1, 1), vec__max_df=1.0 ....
[CV]  vec__use_idf=False, vec__ngram_range=(1, 1), vec__max_df=1.0 -   0.7s
[CV] vec__use_idf=False, vec__ngram_range=(1, 1), vec__max_df=1.0 ....
[CV]  vec__use_idf=False, vec__ngram_range=(1, 1), vec__max_df=1.0 -   0.7s
[CV] vec__use_idf=True, vec__ngram_range=(1, 2), vec__max_df=1.0 .....
[CV]  vec__use_idf=True, vec__ngram_range=(1, 2), vec__max_df=1.0 -   2.8s
[CV] vec__use_idf=True, vec__ngram_range=(1, 2), vec__max_df=1.0 .....
[CV]  vec__use_idf=True, vec__ngram_range=(1, 2), vec__max_df=1.0 -   2.9s
[CV] vec__use_idf=True, vec__ngram_range=(1, 2), vec__max_df=1.0 .....
[CV]  vec__use_idf=True, vec__ngram_range=(1, 2), vec__max_df=1.0 -   3.0s
[CV] vec__use_idf=False, vec__ngram_range=(1, 2), vec__max_df=1.0 ....
[CV]  vec__use_idf=False, vec__ngram_range=(1, 2), vec__max_df=1.0 -   2.8s
[CV] vec__use_idf=False, vec__ngram_range=(1, 2), vec__max_df=1.0 ....
[CV]  vec__use_idf=False, vec__ngram_range=(1, 2), vec__max_df=1.0 -   2.8s
[CV] vec__use_idf=False, vec__ngram_range=(1, 2), vec__max_df=1.0 ....
[CV]  vec__use_idf=False, vec__ngram_range=(1, 2), vec__max_df=1.0 -   2.9s





    



[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    0.8s
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:   43.8s finished



In [61]:

    
gs.best_score_









    Out[61]:





0.96411012782694194



In [62]:

    
gs.best_params_









    Out[62]:





{'vec__max_df': 0.8, 'vec__ngram_range': (1, 1), 'vec__use_idf': True}

Introspecting Model Performance

Displaying the Most Discriminative Features

Let's fit a model on the small dataset and collect info on the fitted components:



In [63]:

    
_ = pipeline.fit(twenty_train_small.data, twenty_train_small.target)



In [64]:

    
vec_name, vec = pipeline.steps[0]
clf_name, clf = pipeline.steps[1]

feature_names = vec.get_feature_names()
target_names = twenty_train_small.target_names

feature_weights = clf.coef_

feature_weights.shape









    Out[64]:





(4, 34109)

By sorting the feature weights on the linear model and asking the vectorizer what their names is, one can get a clue on what the model did actually learn on the data:



In [65]:

    
def display_important_features(feature_names, target_names, weights, n_top=30):
    for i, target_name in enumerate(target_names):
        print("Class: " + target_name)
        print("")
        
        sorted_features_indices = weights[i].argsort()[::-1]
        
        most_important = sorted_features_indices[:n_top]
        print(", ".join("{0}: {1:.4f}".format(feature_names[j], weights[i, j])
                        for j in most_important))
        print("...")
        
        least_important = sorted_features_indices[-n_top:]
        print(", ".join("{0}: {1:.4f}".format(feature_names[j], weights[i, j])
                        for j in least_important))
        print("")
        
display_important_features(feature_names, target_names, feature_weights)









    



Class: alt.atheism

keith: 2.8194, atheism: 2.7564, atheists: 2.7424, cobb: 2.2468, okcforum: 1.8102, caltech: 1.7408, islamic: 1.6697, enviroleague: 1.5718, wingate: 1.5251, freedom: 1.5048, osrhe: 1.5004, rice: 1.4750, tek: 1.4727, mangoe: 1.4727, bobby: 1.4498, religion: 1.4118, peace: 1.4078, wwc: 1.4041, atheist: 1.3920, rushdie: 1.3670, bible: 1.3650, jaeger: 1.3511, liar: 1.3140, charley: 1.2767, perry: 1.2447, tammy: 1.2407, ico: 1.2387, genocide: 1.2220, vice: 1.2021, lunatic: 1.2016
...
alaska: -0.8896, 10: -0.8958, christians: -0.9091, paul: -0.9139, use: -0.9206, christ: -0.9230, microsoft: -0.9442, order: -0.9500, objective: -0.9578, brian: -0.9652, just: -0.9675, fbi: -0.9695, access: -0.9731, org: -0.9865, with: -0.9894, am: -0.9990, image: -1.0032, 2000: -1.0056, interested: -1.0358, thanks: -1.0359, moon: -1.0573, com: -1.0773, out: -1.1073, morality: -1.1142, muhammad: -1.1251, mail: -1.1978, graphics: -1.3494, hudson: -1.3525, christian: -1.5203, space: -1.6827

Class: comp.graphics

graphics: 3.6426, image: 2.6292, 42: 1.9402, 3d: 1.9254, color: 1.8483, file: 1.7020, 3do: 1.6883, polygon: 1.6722, computer: 1.6710, card: 1.6696, files: 1.6683, animation: 1.6485, points: 1.6387, tiff: 1.6103, cview: 1.4854, code: 1.4841, package: 1.4833, video: 1.3957, windows: 1.3737, hi: 1.3732, format: 1.3038, fractal: 1.2566, version: 1.1942, images: 1.1784, need: 1.1642, advance: 1.1561, help: 1.1379, looking: 1.1375, nl: 1.1366, comp: 1.1238
...
atheism: -0.9332, not: -0.9393, access: -0.9463, koresh: -0.9473, article: -0.9760, sci: -0.9869, cmu: -0.9920, you: -0.9965, funding: -1.0374, planets: -1.0380, by: -1.0393, nasa: -1.0420, moon: -1.0425, people: -1.0477, that: -1.0510, pat: -1.0599, was: -1.0670, he: -1.1034, dgi: -1.1106, jennise: -1.1106, shuttle: -1.1310, beast: -1.1441, who: -1.1590, writes: -1.1900, dc: -1.2186, re: -1.4984, edu: -1.5214, orbit: -1.5226, god: -1.5656, space: -3.5997

Class: sci.space

space: 5.9217, orbit: 2.2741, moon: 2.1519, sci: 1.8638, dc: 1.8301, alaska: 1.8176, nasa: 1.7456, launch: 1.5911, pat: 1.5664, henry: 1.5581, mars: 1.4650, nick: 1.4093, solar: 1.3964, aurora: 1.3953, flight: 1.3799, shuttle: 1.3356, ether: 1.3336, sunrise: 1.3294, spacecraft: 1.3180, rockets: 1.3099, sunset: 1.2997, astronomy: 1.2719, planets: 1.2427, cmu: 1.2313, dseg: 1.2266, nicho: 1.2217, fred: 1.2062, dgi: 1.1914, jennise: 1.1914, lunar: 1.1761
...
files: -0.8118, polygon: -0.8120, vga: -0.8211, 3do: -0.8353, hi: -0.8429, format: -0.8589, cc: -0.8636, morality: -0.8724, gaspra: -0.8818, beast: -0.8888, sandvik: -0.8905, sphere: -0.8954, package: -0.9099, video: -0.9404, nl: -0.9629, koresh: -0.9686, any: -0.9694, com: -0.9705, color: -0.9742, sgi: -1.0022, keith: -1.0082, religion: -1.0835, christian: -1.0957, points: -1.1166, file: -1.2138, 3d: -1.2298, image: -1.2387, animation: -1.3019, god: -1.8429, graphics: -2.3409

Class: talk.religion.misc

christian: 2.8059, beast: 2.0125, who: 1.8315, hudson: 1.8281, christians: 1.6947, mr: 1.6917, brian: 1.6664, fbi: 1.6358, koresh: 1.5972, biblical: 1.5780, frank: 1.5190, buffalo: 1.5031, morality: 1.4865, terrorist: 1.4707, abortion: 1.4671, 2000: 1.4547, weiss: 1.4019, mormons: 1.3975, thyagi: 1.3824, convenient: 1.3615, 666: 1.3609, church: 1.3498, order: 1.3242, mitre: 1.3234, christ: 1.2872, blood: 1.2610, freenet: 1.2556, rosicrucian: 1.1821, muhammad: 1.1793, greek: 1.1773
...
dan: -0.9760, tammy: -0.9874, liar: -0.9889, mangoe: -0.9930, need: -0.9931, peace: -1.0059, on: -1.0077, for: -1.0173, caltech: -1.0353, wingate: -1.0389, ac: -1.0947, wwc: -1.1128, uk: -1.1263, atheist: -1.1403, ibm: -1.1575, thing: -1.1694, nasa: -1.1823, file: -1.1920, free: -1.2025, thanks: -1.3018, princeton: -1.3028, freedom: -1.3068, cobb: -1.3426, graphics: -1.5477, edu: -1.6370, keith: -1.6592, atheism: -1.7010, it: -1.9919, atheists: -2.0447, space: -2.3754

Displaying the per-class Classification Reports



In [66]:

    
from sklearn.metrics import classification_report

predicted = pipeline.predict(twenty_test_small.data)



In [67]:

    
print(classification_report(twenty_test_small.target, predicted,
                            target_names=twenty_test_small.target_names))









    



                    precision    recall  f1-score   support

       alt.atheism       0.86      0.83      0.84       319
     comp.graphics       0.93      0.96      0.95       389
         sci.space       0.94      0.95      0.95       394
talk.religion.misc       0.79      0.78      0.79       251

       avg / total       0.89      0.89      0.89      1353

Printing the Confusion Matrix

The confusion matrix summarize which class where by having a look at off-diagonal entries: here we can see that articles about atheism have been wrongly classified as being about religion 57 times for instance:



In [68]:

    
from sklearn.metrics import confusion_matrix

confusion_matrix(twenty_test_small.target, predicted)









    Out[68]:





array([[264,   5,   7,  43],
       [  1, 374,   6,   8],
       [  3,  16, 374,   1],
       [ 39,   7,   9, 196]])



In [69]:

    
twenty_test_small.target_names









    Out[69]:





['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']



In [70]: