In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# Some nice default configuration for plots
plt.rcParams['figure.figsize'] = 10, 7.5
plt.rcParams['axes.grid'] = True
plt.gray()


<matplotlib.figure.Figure at 0x1081a4ed0>

Text Feature Extraction for Classification and Clustering

Outline of this section:

  • Turn a corpus of text documents into feature vectors using a Bag of Words representation,
  • Train a simple text classifier on the feature vectors,
  • Wrap the vectorizer and the classifier with a pipeline,
  • Cross-validation and model selection on the pipeline.

Text Classification in 20 lines of Python

Let's start by implementing a canonical text classification example:

  • The 20 newsgroups dataset: around 18000 text posts from 20 newsgroups forums
  • Bag of Words features extraction with TF-IDF weighting
  • Naive Bayes classifier or Linear Support Vector Machine for the classifier itself

In [2]:
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

# Load the text data
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]
twenty_train_small = load_files('../datasets/20news-bydate-train/',
    categories=categories, encoding='latin-1')
twenty_test_small = load_files('../datasets/20news-bydate-test/',
    categories=categories, encoding='latin-1')

# Turn the text documents into vectors of word frequencies
vectorizer = TfidfVectorizer(min_df=2)
X_train = vectorizer.fit_transform(twenty_train_small.data)
y_train = twenty_train_small.target

# Fit a classifier on the training set
classifier = MultinomialNB().fit(X_train, y_train)
print("Training score: {0:.1f}%".format(
    classifier.score(X_train, y_train) * 100))

# Evaluate the classifier on the testing set
X_test = vectorizer.transform(twenty_test_small.data)
y_test = twenty_test_small.target
print("Testing score: {0:.1f}%".format(
    classifier.score(X_test, y_test) * 100))


Training score: 95.1%
Testing score: 85.1%

Here is a workflow diagram summary of what happened previously:

Let's now decompose what we just did to understand and customize each step.

Loading the Dataset

Let's explore the dataset loading utility without passing a list of categories: in this case we load the full 20 newsgroups dataset in memory. The source website for the 20 newsgroups already provides a date-based train / test split that is made available using the subset keyword argument:


In [3]:
ls -l ../datasets/


total 28376
drwxr-xr-x  22 awalz  staff       748 Mar 18  2003 20news-bydate-test/
drwxr-xr-x  22 awalz  staff       748 Mar 18  2003 20news-bydate-train/
-rw-r--r--   1 awalz  staff  14464277 Jun 12 12:55 20news-bydate.tar.gz
-rw-r--r--   1 awalz  staff     61194 Jun 12 12:56 titanic_train.csv

In [4]:
ls -lh ../datasets/20news-bydate-train


total 0
drwxr-xr-x  482 awalz  staff    16K Mar 18  2003 alt.atheism/
drwxr-xr-x  586 awalz  staff    19K Mar 18  2003 comp.graphics/
drwxr-xr-x  593 awalz  staff    20K Mar 18  2003 comp.os.ms-windows.misc/
drwxr-xr-x  592 awalz  staff    20K Mar 18  2003 comp.sys.ibm.pc.hardware/
drwxr-xr-x  580 awalz  staff    19K Mar 18  2003 comp.sys.mac.hardware/
drwxr-xr-x  595 awalz  staff    20K Mar 18  2003 comp.windows.x/
drwxr-xr-x  587 awalz  staff    19K Mar 18  2003 misc.forsale/
drwxr-xr-x  596 awalz  staff    20K Mar 18  2003 rec.autos/
drwxr-xr-x  600 awalz  staff    20K Mar 18  2003 rec.motorcycles/
drwxr-xr-x  599 awalz  staff    20K Mar 18  2003 rec.sport.baseball/
drwxr-xr-x  602 awalz  staff    20K Mar 18  2003 rec.sport.hockey/
drwxr-xr-x  597 awalz  staff    20K Mar 18  2003 sci.crypt/
drwxr-xr-x  593 awalz  staff    20K Mar 18  2003 sci.electronics/
drwxr-xr-x  596 awalz  staff    20K Mar 18  2003 sci.med/
drwxr-xr-x  595 awalz  staff    20K Mar 18  2003 sci.space/
drwxr-xr-x  601 awalz  staff    20K Mar 18  2003 soc.religion.christian/
drwxr-xr-x  548 awalz  staff    18K Mar 18  2003 talk.politics.guns/
drwxr-xr-x  566 awalz  staff    19K Mar 18  2003 talk.politics.mideast/
drwxr-xr-x  467 awalz  staff    16K Mar 18  2003 talk.politics.misc/
drwxr-xr-x  379 awalz  staff    13K Mar 18  2003 talk.religion.misc/

In [5]:
ls -lh ../datasets/20news-bydate-train/alt.atheism/


total 4480
-rw-r--r--  1 awalz  staff    12K Mar 18  2003 49960
-rw-r--r--  1 awalz  staff    31K Mar 18  2003 51060
-rw-r--r--  1 awalz  staff   4.0K Mar 18  2003 51119
-rw-r--r--  1 awalz  staff   1.6K Mar 18  2003 51120
-rw-r--r--  1 awalz  staff   773B Mar 18  2003 51121
-rw-r--r--  1 awalz  staff   4.8K Mar 18  2003 51122
-rw-r--r--  1 awalz  staff   618B Mar 18  2003 51123
-rw-r--r--  1 awalz  staff   1.4K Mar 18  2003 51124
-rw-r--r--  1 awalz  staff   2.7K Mar 18  2003 51125
-rw-r--r--  1 awalz  staff   427B Mar 18  2003 51126
-rw-r--r--  1 awalz  staff   742B Mar 18  2003 51127
-rw-r--r--  1 awalz  staff   650B Mar 18  2003 51128
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 51130
-rw-r--r--  1 awalz  staff   2.3K Mar 18  2003 51131
-rw-r--r--  1 awalz  staff   2.6K Mar 18  2003 51132
-rw-r--r--  1 awalz  staff   1.5K Mar 18  2003 51133
-rw-r--r--  1 awalz  staff   1.2K Mar 18  2003 51134
-rw-r--r--  1 awalz  staff   1.6K Mar 18  2003 51135
-rw-r--r--  1 awalz  staff   2.1K Mar 18  2003 51136
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 51139
-rw-r--r--  1 awalz  staff   409B Mar 18  2003 51140
-rw-r--r--  1 awalz  staff   940B Mar 18  2003 51141
-rw-r--r--  1 awalz  staff   9.0K Mar 18  2003 51142
-rw-r--r--  1 awalz  staff   632B Mar 18  2003 51143
-rw-r--r--  1 awalz  staff   1.2K Mar 18  2003 51144
-rw-r--r--  1 awalz  staff   609B Mar 18  2003 51145
-rw-r--r--  1 awalz  staff   631B Mar 18  2003 51146
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 51147
-rw-r--r--  1 awalz  staff   1.8K Mar 18  2003 51148
-rw-r--r--  1 awalz  staff   405B Mar 18  2003 51149
-rw-r--r--  1 awalz  staff   696B Mar 18  2003 51150
-rw-r--r--  1 awalz  staff   5.5K Mar 18  2003 51151
-rw-r--r--  1 awalz  staff   1.4K Mar 18  2003 51152
-rw-r--r--  1 awalz  staff   5.0K Mar 18  2003 51153
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 51154
-rw-r--r--  1 awalz  staff   1.6K Mar 18  2003 51155
-rw-r--r--  1 awalz  staff   5.0K Mar 18  2003 51156
-rw-r--r--  1 awalz  staff   1.8K Mar 18  2003 51157
-rw-r--r--  1 awalz  staff   604B Mar 18  2003 51158
-rw-r--r--  1 awalz  staff   1.4K Mar 18  2003 51159
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 51160
-rw-r--r--  1 awalz  staff   1.4K Mar 18  2003 51161
-rw-r--r--  1 awalz  staff   2.9K Mar 18  2003 51162
-rw-r--r--  1 awalz  staff   1.1K Mar 18  2003 51163
-rw-r--r--  1 awalz  staff   2.3K Mar 18  2003 51164
-rw-r--r--  1 awalz  staff   4.8K Mar 18  2003 51165
-rw-r--r--  1 awalz  staff   1.2K Mar 18  2003 51169
-rw-r--r--  1 awalz  staff   868B Mar 18  2003 51170
-rw-r--r--  1 awalz  staff   721B Mar 18  2003 51171
-rw-r--r--  1 awalz  staff   3.0K Mar 18  2003 51172
-rw-r--r--  1 awalz  staff   1.9K Mar 18  2003 51173
-rw-r--r--  1 awalz  staff   645B Mar 18  2003 51174
-rw-r--r--  1 awalz  staff   2.4K Mar 18  2003 51175
-rw-r--r--  1 awalz  staff   2.9K Mar 18  2003 51176
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 51177
-rw-r--r--  1 awalz  staff   879B Mar 18  2003 51178
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 51179
-rw-r--r--  1 awalz  staff   994B Mar 18  2003 51180
-rw-r--r--  1 awalz  staff   1.2K Mar 18  2003 51181
-rw-r--r--  1 awalz  staff   2.2K Mar 18  2003 51182
-rw-r--r--  1 awalz  staff   1.7K Mar 18  2003 51183
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 51184
-rw-r--r--  1 awalz  staff   1.2K Mar 18  2003 51185
-rw-r--r--  1 awalz  staff   949B Mar 18  2003 51186
-rw-r--r--  1 awalz  staff   1.9K Mar 18  2003 51187
-rw-r--r--  1 awalz  staff   1.1K Mar 18  2003 51188
-rw-r--r--  1 awalz  staff   834B Mar 18  2003 51189
-rw-r--r--  1 awalz  staff   895B Mar 18  2003 51190
-rw-r--r--  1 awalz  staff   776B Mar 18  2003 51191
-rw-r--r--  1 awalz  staff   1.6K Mar 18  2003 51192
-rw-r--r--  1 awalz  staff   1.8K Mar 18  2003 51193
-rw-r--r--  1 awalz  staff   1.4K Mar 18  2003 51194
-rw-r--r--  1 awalz  staff   964B Mar 18  2003 51195
-rw-r--r--  1 awalz  staff   2.4K Mar 18  2003 51196
-rw-r--r--  1 awalz  staff   759B Mar 18  2003 51197
-rw-r--r--  1 awalz  staff   1.5K Mar 18  2003 51198
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 51199
-rw-r--r--  1 awalz  staff   1.9K Mar 18  2003 51200
-rw-r--r--  1 awalz  staff   916B Mar 18  2003 51201
-rw-r--r--  1 awalz  staff   1.9K Mar 18  2003 51202
-rw-r--r--  1 awalz  staff   1.5K Mar 18  2003 51203
-rw-r--r--  1 awalz  staff   846B Mar 18  2003 51204
-rw-r--r--  1 awalz  staff   1.4K Mar 18  2003 51205
-rw-r--r--  1 awalz  staff   881B Mar 18  2003 51206
-rw-r--r--  1 awalz  staff   6.2K Mar 18  2003 51208
-rw-r--r--  1 awalz  staff   1.7K Mar 18  2003 51209
-rw-r--r--  1 awalz  staff   1.7K Mar 18  2003 51210
-rw-r--r--  1 awalz  staff    10K Mar 18  2003 51211
-rw-r--r--  1 awalz  staff   2.5K Mar 18  2003 51212
-rw-r--r--  1 awalz  staff   1.6K Mar 18  2003 51213
-rw-r--r--  1 awalz  staff   636B Mar 18  2003 51214
-rw-r--r--  1 awalz  staff   989B Mar 18  2003 51215
-rw-r--r--  1 awalz  staff   668B Mar 18  2003 51216
-rw-r--r--  1 awalz  staff   2.8K Mar 18  2003 51217
-rw-r--r--  1 awalz  staff   1.7K Mar 18  2003 51218
-rw-r--r--  1 awalz  staff   905B Mar 18  2003 51219
-rw-r--r--  1 awalz  staff   2.4K Mar 18  2003 51220
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 51221
-rw-r--r--  1 awalz  staff   1.7K Mar 18  2003 51222
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 51223
-rw-r--r--  1 awalz  staff   2.1K Mar 18  2003 51224
-rw-r--r--  1 awalz  staff   1.5K Mar 18  2003 51225
-rw-r--r--  1 awalz  staff   3.4K Mar 18  2003 51226
-rw-r--r--  1 awalz  staff   704B Mar 18  2003 51227
-rw-r--r--  1 awalz  staff   949B Mar 18  2003 51228
-rw-r--r--  1 awalz  staff   714B Mar 18  2003 51229
-rw-r--r--  1 awalz  staff   966B Mar 18  2003 51230
-rw-r--r--  1 awalz  staff   2.9K Mar 18  2003 51231
-rw-r--r--  1 awalz  staff   871B Mar 18  2003 51232
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 51233
-rw-r--r--  1 awalz  staff   1.5K Mar 18  2003 51234
-rw-r--r--  1 awalz  staff   2.4K Mar 18  2003 51235
-rw-r--r--  1 awalz  staff   1.2K Mar 18  2003 51236
-rw-r--r--  1 awalz  staff   564B Mar 18  2003 51237
-rw-r--r--  1 awalz  staff    11K Mar 18  2003 51238
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 51239
-rw-r--r--  1 awalz  staff   749B Mar 18  2003 51240
-rw-r--r--  1 awalz  staff   932B Mar 18  2003 51241
-rw-r--r--  1 awalz  staff   1.2K Mar 18  2003 51242
-rw-r--r--  1 awalz  staff   2.2K Mar 18  2003 51243
-rw-r--r--  1 awalz  staff   554B Mar 18  2003 51244
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 51245
-rw-r--r--  1 awalz  staff   1.7K Mar 18  2003 51246
-rw-r--r--  1 awalz  staff   1.7K Mar 18  2003 51247
-rw-r--r--  1 awalz  staff   1.6K Mar 18  2003 51249
-rw-r--r--  1 awalz  staff   2.8K Mar 18  2003 51250
-rw-r--r--  1 awalz  staff   570B Mar 18  2003 51251
-rw-r--r--  1 awalz  staff   1.8K Mar 18  2003 51252
-rw-r--r--  1 awalz  staff   3.1K Mar 18  2003 51253
-rw-r--r--  1 awalz  staff   2.9K Mar 18  2003 51254
-rw-r--r--  1 awalz  staff   748B Mar 18  2003 51255
-rw-r--r--  1 awalz  staff   2.3K Mar 18  2003 51256
-rw-r--r--  1 awalz  staff   1.2K Mar 18  2003 51258
-rw-r--r--  1 awalz  staff   1.7K Mar 18  2003 51259
-rw-r--r--  1 awalz  staff   6.2K Mar 18  2003 51260
-rw-r--r--  1 awalz  staff   1.6K Mar 18  2003 51261
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 51262
-rw-r--r--  1 awalz  staff   1.2K Mar 18  2003 51265
-rw-r--r--  1 awalz  staff   456B Mar 18  2003 51266
-rw-r--r--  1 awalz  staff   816B Mar 18  2003 51267
-rw-r--r--  1 awalz  staff   2.4K Mar 18  2003 51268
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 51269
-rw-r--r--  1 awalz  staff   3.4K Mar 18  2003 51270
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 51271
-rw-r--r--  1 awalz  staff   2.0K Mar 18  2003 51272
-rw-r--r--  1 awalz  staff   790B Mar 18  2003 51273
-rw-r--r--  1 awalz  staff   1.6K Mar 18  2003 51274
-rw-r--r--  1 awalz  staff   2.5K Mar 18  2003 51275
-rw-r--r--  1 awalz  staff   4.4K Mar 18  2003 51276
-rw-r--r--  1 awalz  staff   1.5K Mar 18  2003 51277
-rw-r--r--  1 awalz  staff   6.2K Mar 18  2003 51278
-rw-r--r--  1 awalz  staff   963B Mar 18  2003 51279
-rw-r--r--  1 awalz  staff   2.0K Mar 18  2003 51280
-rw-r--r--  1 awalz  staff   1.1K Mar 18  2003 51281
-rw-r--r--  1 awalz  staff   618B Mar 18  2003 51282
-rw-r--r--  1 awalz  staff   2.7K Mar 18  2003 51283
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 51284
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 51285
-rw-r--r--  1 awalz  staff   601B Mar 18  2003 51286
-rw-r--r--  1 awalz  staff   751B Mar 18  2003 51287
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 51288
-rw-r--r--  1 awalz  staff   8.0K Mar 18  2003 51290
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 51291
-rw-r--r--  1 awalz  staff   2.9K Mar 18  2003 51292
-rw-r--r--  1 awalz  staff   1.2K Mar 18  2003 51293
-rw-r--r--  1 awalz  staff   1.8K Mar 18  2003 51294
-rw-r--r--  1 awalz  staff   1.9K Mar 18  2003 51295
-rw-r--r--  1 awalz  staff   1.7K Mar 18  2003 51296
-rw-r--r--  1 awalz  staff   4.2K Mar 18  2003 51297
-rw-r--r--  1 awalz  staff   2.6K Mar 18  2003 51298
-rw-r--r--  1 awalz  staff   2.2K Mar 18  2003 51299
-rw-r--r--  1 awalz  staff   2.3K Mar 18  2003 51300
-rw-r--r--  1 awalz  staff   6.3K Mar 18  2003 51301
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 51302
-rw-r--r--  1 awalz  staff   1.9K Mar 18  2003 51303
-rw-r--r--  1 awalz  staff    10K Mar 18  2003 51304
-rw-r--r--  1 awalz  staff   1.5K Mar 18  2003 51305
-rw-r--r--  1 awalz  staff   1.4K Mar 18  2003 51306
-rw-r--r--  1 awalz  staff   4.1K Mar 18  2003 51307
-rw-r--r--  1 awalz  staff   6.2K Mar 18  2003 51308
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 51309
-rw-r--r--  1 awalz  staff   768B Mar 18  2003 51310
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 51311
-rw-r--r--  1 awalz  staff   930B Mar 18  2003 51312
-rw-r--r--  1 awalz  staff   771B Mar 18  2003 51313
-rw-r--r--  1 awalz  staff   670B Mar 18  2003 51314
-rw-r--r--  1 awalz  staff   1.1K Mar 18  2003 51315
-rw-r--r--  1 awalz  staff   3.7K Mar 18  2003 51316
-rw-r--r--  1 awalz  staff   406B Mar 18  2003 51317
-rw-r--r--  1 awalz  staff   5.4K Mar 18  2003 51318
-rw-r--r--  1 awalz  staff   9.6K Mar 18  2003 51319
-rw-r--r--  1 awalz  staff   2.1K Mar 18  2003 51320
-rw-r--r--  1 awalz  staff    29K Mar 18  2003 52499
-rw-r--r--  1 awalz  staff    25K Mar 18  2003 52909
-rw-r--r--  1 awalz  staff   5.8K Mar 18  2003 52910
-rw-r--r--  1 awalz  staff   819B Mar 18  2003 53055
-rw-r--r--  1 awalz  staff   857B Mar 18  2003 53056
-rw-r--r--  1 awalz  staff   755B Mar 18  2003 53057
-rw-r--r--  1 awalz  staff   4.4K Mar 18  2003 53058
-rw-r--r--  1 awalz  staff   2.1K Mar 18  2003 53059
-rw-r--r--  1 awalz  staff   1.1K Mar 18  2003 53062
-rw-r--r--  1 awalz  staff   1.6K Mar 18  2003 53064
-rw-r--r--  1 awalz  staff   515B Mar 18  2003 53065
-rw-r--r--  1 awalz  staff   9.2K Mar 18  2003 53066
-rw-r--r--  1 awalz  staff   2.4K Mar 18  2003 53067
-rw-r--r--  1 awalz  staff   610B Mar 18  2003 53069
-rw-r--r--  1 awalz  staff   759B Mar 18  2003 53070
-rw-r--r--  1 awalz  staff   2.3K Mar 18  2003 53071
-rw-r--r--  1 awalz  staff   1.5K Mar 18  2003 53072
-rw-r--r--  1 awalz  staff   1.9K Mar 18  2003 53073
-rw-r--r--  1 awalz  staff   2.1K Mar 18  2003 53075
-rw-r--r--  1 awalz  staff   411B Mar 18  2003 53078
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 53081
-rw-r--r--  1 awalz  staff   962B Mar 18  2003 53082
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 53083
-rw-r--r--  1 awalz  staff   2.0K Mar 18  2003 53085
-rw-r--r--  1 awalz  staff   1.1K Mar 18  2003 53086
-rw-r--r--  1 awalz  staff   247B Mar 18  2003 53087
-rw-r--r--  1 awalz  staff   3.8K Mar 18  2003 53090
-rw-r--r--  1 awalz  staff   1.1K Mar 18  2003 53093
-rw-r--r--  1 awalz  staff   1.1K Mar 18  2003 53094
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 53095
-rw-r--r--  1 awalz  staff   863B Mar 18  2003 53096
-rw-r--r--  1 awalz  staff   1.1K Mar 18  2003 53097
-rw-r--r--  1 awalz  staff   1.2K Mar 18  2003 53098
-rw-r--r--  1 awalz  staff   1.1K Mar 18  2003 53099
-rw-r--r--  1 awalz  staff   2.0K Mar 18  2003 53106
-rw-r--r--  1 awalz  staff   784B Mar 18  2003 53108
-rw-r--r--  1 awalz  staff   2.3K Mar 18  2003 53110
-rw-r--r--  1 awalz  staff   712B Mar 18  2003 53111
-rw-r--r--  1 awalz  staff   2.4K Mar 18  2003 53112
-rw-r--r--  1 awalz  staff   2.6K Mar 18  2003 53113
-rw-r--r--  1 awalz  staff   1.7K Mar 18  2003 53114
-rw-r--r--  1 awalz  staff   1.5K Mar 18  2003 53117
-rw-r--r--  1 awalz  staff   2.8K Mar 18  2003 53118
-rw-r--r--  1 awalz  staff   4.1K Mar 18  2003 53120
-rw-r--r--  1 awalz  staff   1.8K Mar 18  2003 53121
-rw-r--r--  1 awalz  staff   2.4K Mar 18  2003 53122
-rw-r--r--  1 awalz  staff   1.2K Mar 18  2003 53123
-rw-r--r--  1 awalz  staff   3.4K Mar 18  2003 53124
-rw-r--r--  1 awalz  staff   1.8K Mar 18  2003 53125
-rw-r--r--  1 awalz  staff   1.2K Mar 18  2003 53126
-rw-r--r--  1 awalz  staff   826B Mar 18  2003 53127
-rw-r--r--  1 awalz  staff   958B Mar 18  2003 53130
-rw-r--r--  1 awalz  staff   1.5K Mar 18  2003 53131
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 53132
-rw-r--r--  1 awalz  staff   640B Mar 18  2003 53133
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 53134
-rw-r--r--  1 awalz  staff   2.1K Mar 18  2003 53135
-rw-r--r--  1 awalz  staff   4.2K Mar 18  2003 53136
-rw-r--r--  1 awalz  staff   4.8K Mar 18  2003 53137
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 53139
-rw-r--r--  1 awalz  staff   3.0K Mar 18  2003 53140
-rw-r--r--  1 awalz  staff   2.1K Mar 18  2003 53141
-rw-r--r--  1 awalz  staff   456B Mar 18  2003 53142
-rw-r--r--  1 awalz  staff   760B Mar 18  2003 53143
-rw-r--r--  1 awalz  staff   768B Mar 18  2003 53144
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 53145
-rw-r--r--  1 awalz  staff   1.2K Mar 18  2003 53149
-rw-r--r--  1 awalz  staff   2.1K Mar 18  2003 53150
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 53151
-rw-r--r--  1 awalz  staff   1.9K Mar 18  2003 53153
-rw-r--r--  1 awalz  staff   1.2K Mar 18  2003 53154
-rw-r--r--  1 awalz  staff   1.2K Mar 18  2003 53157
-rw-r--r--  1 awalz  staff   2.0K Mar 18  2003 53158
-rw-r--r--  1 awalz  staff   819B Mar 18  2003 53159
-rw-r--r--  1 awalz  staff   1.9K Mar 18  2003 53160
-rw-r--r--  1 awalz  staff   3.5K Mar 18  2003 53161
-rw-r--r--  1 awalz  staff   1.5K Mar 18  2003 53162
-rw-r--r--  1 awalz  staff   1.9K Mar 18  2003 53163
-rw-r--r--  1 awalz  staff   2.2K Mar 18  2003 53164
-rw-r--r--  1 awalz  staff   1.1K Mar 18  2003 53165
-rw-r--r--  1 awalz  staff   684B Mar 18  2003 53166
-rw-r--r--  1 awalz  staff   443B Mar 18  2003 53167
-rw-r--r--  1 awalz  staff   1.2K Mar 18  2003 53168
-rw-r--r--  1 awalz  staff   1.4K Mar 18  2003 53170
-rw-r--r--  1 awalz  staff   2.5K Mar 18  2003 53171
-rw-r--r--  1 awalz  staff   785B Mar 18  2003 53172
-rw-r--r--  1 awalz  staff   1.1K Mar 18  2003 53173
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 53174
-rw-r--r--  1 awalz  staff   737B Mar 18  2003 53175
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 53176
-rw-r--r--  1 awalz  staff   1.8K Mar 18  2003 53177
-rw-r--r--  1 awalz  staff   2.2K Mar 18  2003 53178
-rw-r--r--  1 awalz  staff   1.6K Mar 18  2003 53179
-rw-r--r--  1 awalz  staff   2.1K Mar 18  2003 53180
-rw-r--r--  1 awalz  staff   3.2K Mar 18  2003 53181
-rw-r--r--  1 awalz  staff   1.2K Mar 18  2003 53182
-rw-r--r--  1 awalz  staff   1.4K Mar 18  2003 53183
-rw-r--r--  1 awalz  staff   1.7K Mar 18  2003 53184
-rw-r--r--  1 awalz  staff   2.6K Mar 18  2003 53185
-rw-r--r--  1 awalz  staff   3.0K Mar 18  2003 53186
-rw-r--r--  1 awalz  staff   665B Mar 18  2003 53187
-rw-r--r--  1 awalz  staff   2.0K Mar 18  2003 53188
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 53190
-rw-r--r--  1 awalz  staff   1.9K Mar 18  2003 53191
-rw-r--r--  1 awalz  staff   1.8K Mar 18  2003 53192
-rw-r--r--  1 awalz  staff   1.4K Mar 18  2003 53193
-rw-r--r--  1 awalz  staff   792B Mar 18  2003 53194
-rw-r--r--  1 awalz  staff   2.0K Mar 18  2003 53195
-rw-r--r--  1 awalz  staff   1.6K Mar 18  2003 53196
-rw-r--r--  1 awalz  staff   2.6K Mar 18  2003 53197
-rw-r--r--  1 awalz  staff   1.1K Mar 18  2003 53198
-rw-r--r--  1 awalz  staff   1.4K Mar 18  2003 53199
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 53201
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 53203
-rw-r--r--  1 awalz  staff   3.7K Mar 18  2003 53208
-rw-r--r--  1 awalz  staff   1.1K Mar 18  2003 53209
-rw-r--r--  1 awalz  staff   1.5K Mar 18  2003 53210
-rw-r--r--  1 awalz  staff   2.7K Mar 18  2003 53211
-rw-r--r--  1 awalz  staff   1.4K Mar 18  2003 53212
-rw-r--r--  1 awalz  staff   2.3K Mar 18  2003 53213
-rw-r--r--  1 awalz  staff   1.9K Mar 18  2003 53214
-rw-r--r--  1 awalz  staff   919B Mar 18  2003 53215
-rw-r--r--  1 awalz  staff   868B Mar 18  2003 53216
-rw-r--r--  1 awalz  staff   2.3K Mar 18  2003 53217
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 53218
-rw-r--r--  1 awalz  staff   1.1K Mar 18  2003 53219
-rw-r--r--  1 awalz  staff   640B Mar 18  2003 53220
-rw-r--r--  1 awalz  staff   1.1K Mar 18  2003 53221
-rw-r--r--  1 awalz  staff   2.0K Mar 18  2003 53222
-rw-r--r--  1 awalz  staff   2.0K Mar 18  2003 53223
-rw-r--r--  1 awalz  staff   3.4K Mar 18  2003 53224
-rw-r--r--  1 awalz  staff   808B Mar 18  2003 53225
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 53226
-rw-r--r--  1 awalz  staff   640B Mar 18  2003 53228
-rw-r--r--  1 awalz  staff   856B Mar 18  2003 53229
-rw-r--r--  1 awalz  staff   967B Mar 18  2003 53230
-rw-r--r--  1 awalz  staff   781B Mar 18  2003 53231
-rw-r--r--  1 awalz  staff   1.2K Mar 18  2003 53232
-rw-r--r--  1 awalz  staff   2.2K Mar 18  2003 53235
-rw-r--r--  1 awalz  staff   1.7K Mar 18  2003 53237
-rw-r--r--  1 awalz  staff   2.2K Mar 18  2003 53238
-rw-r--r--  1 awalz  staff   2.4K Mar 18  2003 53239
-rw-r--r--  1 awalz  staff   1.2K Mar 18  2003 53240
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 53243
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 53248
-rw-r--r--  1 awalz  staff   1.4K Mar 18  2003 53249
-rw-r--r--  1 awalz  staff   1.8K Mar 18  2003 53250
-rw-r--r--  1 awalz  staff   1.5K Mar 18  2003 53251
-rw-r--r--  1 awalz  staff   1.4K Mar 18  2003 53252
-rw-r--r--  1 awalz  staff   1.2K Mar 18  2003 53256
-rw-r--r--  1 awalz  staff   806B Mar 18  2003 53258
-rw-r--r--  1 awalz  staff   4.2K Mar 18  2003 53266
-rw-r--r--  1 awalz  staff   3.5K Mar 18  2003 53267
-rw-r--r--  1 awalz  staff   1.8K Mar 18  2003 53269
-rw-r--r--  1 awalz  staff   3.2K Mar 18  2003 53271
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 53274
-rw-r--r--  1 awalz  staff   2.1K Mar 18  2003 53275
-rw-r--r--  1 awalz  staff   2.0K Mar 18  2003 53281
-rw-r--r--  1 awalz  staff   958B Mar 18  2003 53282
-rw-r--r--  1 awalz  staff   3.2K Mar 18  2003 53283
-rw-r--r--  1 awalz  staff   872B Mar 18  2003 53284
-rw-r--r--  1 awalz  staff   387B Mar 18  2003 53285
-rw-r--r--  1 awalz  staff   3.1K Mar 18  2003 53286
-rw-r--r--  1 awalz  staff   3.5K Mar 18  2003 53287
-rw-r--r--  1 awalz  staff   2.6K Mar 18  2003 53288
-rw-r--r--  1 awalz  staff   956B Mar 18  2003 53289
-rw-r--r--  1 awalz  staff   1.6K Mar 18  2003 53290
-rw-r--r--  1 awalz  staff    10K Mar 18  2003 53292
-rw-r--r--  1 awalz  staff   5.4K Mar 18  2003 53298
-rw-r--r--  1 awalz  staff   945B Mar 18  2003 53303
-rw-r--r--  1 awalz  staff   1.2K Mar 18  2003 53304
-rw-r--r--  1 awalz  staff   1.5K Mar 18  2003 53305
-rw-r--r--  1 awalz  staff   1.4K Mar 18  2003 53306
-rw-r--r--  1 awalz  staff   590B Mar 18  2003 53307
-rw-r--r--  1 awalz  staff   663B Mar 18  2003 53308
-rw-r--r--  1 awalz  staff   907B Mar 18  2003 53309
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 53311
-rw-r--r--  1 awalz  staff   1.5K Mar 18  2003 53312
-rw-r--r--  1 awalz  staff   576B Mar 18  2003 53314
-rw-r--r--  1 awalz  staff    15K Mar 18  2003 53323
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 53334
-rw-r--r--  1 awalz  staff   783B Mar 18  2003 53347
-rw-r--r--  1 awalz  staff   5.8K Mar 18  2003 53351
-rw-r--r--  1 awalz  staff   1.6K Mar 18  2003 53366
-rw-r--r--  1 awalz  staff   698B Mar 18  2003 53370
-rw-r--r--  1 awalz  staff   600B Mar 18  2003 53371
-rw-r--r--  1 awalz  staff   5.6K Mar 18  2003 53373
-rw-r--r--  1 awalz  staff   1.8K Mar 18  2003 53374
-rw-r--r--  1 awalz  staff   1.1K Mar 18  2003 53375
-rw-r--r--  1 awalz  staff   849B Mar 18  2003 53376
-rw-r--r--  1 awalz  staff   621B Mar 18  2003 53377
-rw-r--r--  1 awalz  staff   270B Mar 18  2003 53380
-rw-r--r--  1 awalz  staff   1.1K Mar 18  2003 53381
-rw-r--r--  1 awalz  staff   2.2K Mar 18  2003 53382
-rw-r--r--  1 awalz  staff   1.6K Mar 18  2003 53383
-rw-r--r--  1 awalz  staff   1.6K Mar 18  2003 53387
-rw-r--r--  1 awalz  staff   759B Mar 18  2003 53389
-rw-r--r--  1 awalz  staff   396B Mar 18  2003 53390
-rw-r--r--  1 awalz  staff   669B Mar 18  2003 53391
-rw-r--r--  1 awalz  staff   1.8K Mar 18  2003 53434
-rw-r--r--  1 awalz  staff   1.6K Mar 18  2003 53435
-rw-r--r--  1 awalz  staff   708B Mar 18  2003 53436
-rw-r--r--  1 awalz  staff   887B Mar 18  2003 53437
-rw-r--r--  1 awalz  staff   838B Mar 18  2003 53438
-rw-r--r--  1 awalz  staff   1.4K Mar 18  2003 53439
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 53440
-rw-r--r--  1 awalz  staff   384B Mar 18  2003 53441
-rw-r--r--  1 awalz  staff   857B Mar 18  2003 53442
-rw-r--r--  1 awalz  staff   1.6K Mar 18  2003 53443
-rw-r--r--  1 awalz  staff   1.4K Mar 18  2003 53445
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 53449
-rw-r--r--  1 awalz  staff   2.4K Mar 18  2003 53459
-rw-r--r--  1 awalz  staff   1.4K Mar 18  2003 53460
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 53465
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 53466
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 53467
-rw-r--r--  1 awalz  staff   1.4K Mar 18  2003 53468
-rw-r--r--  1 awalz  staff   1.1K Mar 18  2003 53471
-rw-r--r--  1 awalz  staff   1.9K Mar 18  2003 53477
-rw-r--r--  1 awalz  staff   718B Mar 18  2003 53478
-rw-r--r--  1 awalz  staff   781B Mar 18  2003 53483
-rw-r--r--  1 awalz  staff   1.6K Mar 18  2003 53509
-rw-r--r--  1 awalz  staff   910B Mar 18  2003 53510
-rw-r--r--  1 awalz  staff   781B Mar 18  2003 53512
-rw-r--r--  1 awalz  staff   1.8K Mar 18  2003 53515
-rw-r--r--  1 awalz  staff   2.1K Mar 18  2003 53518
-rw-r--r--  1 awalz  staff    50K Mar 18  2003 53519
-rw-r--r--  1 awalz  staff   6.0K Mar 18  2003 53521
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 53522
-rw-r--r--  1 awalz  staff   2.8K Mar 18  2003 53523
-rw-r--r--  1 awalz  staff   338B Mar 18  2003 53524
-rw-r--r--  1 awalz  staff   1.4K Mar 18  2003 53525
-rw-r--r--  1 awalz  staff   489B Mar 18  2003 53526
-rw-r--r--  1 awalz  staff   2.6K Mar 18  2003 53527
-rw-r--r--  1 awalz  staff   2.4K Mar 18  2003 53528
-rw-r--r--  1 awalz  staff   228B Mar 18  2003 53529
-rw-r--r--  1 awalz  staff   1.1K Mar 18  2003 53531
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 53532
-rw-r--r--  1 awalz  staff   1.2K Mar 18  2003 53533
-rw-r--r--  1 awalz  staff   356B Mar 18  2003 53534
-rw-r--r--  1 awalz  staff   614B Mar 18  2003 53535
-rw-r--r--  1 awalz  staff   895B Mar 18  2003 53571
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 53572
-rw-r--r--  1 awalz  staff   697B Mar 18  2003 53573
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 53574
-rw-r--r--  1 awalz  staff   1.8K Mar 18  2003 53654
-rw-r--r--  1 awalz  staff   2.3K Mar 18  2003 53655
-rw-r--r--  1 awalz  staff   2.5K Mar 18  2003 53656
-rw-r--r--  1 awalz  staff   2.1K Mar 18  2003 53660
-rw-r--r--  1 awalz  staff   6.8K Mar 18  2003 53661
-rw-r--r--  1 awalz  staff   1.8K Mar 18  2003 53753
-rw-r--r--  1 awalz  staff   698B Mar 18  2003 53754
-rw-r--r--  1 awalz  staff   779B Mar 18  2003 53755
-rw-r--r--  1 awalz  staff   3.9K Mar 18  2003 53756
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 53757
-rw-r--r--  1 awalz  staff   2.2K Mar 18  2003 53758
-rw-r--r--  1 awalz  staff   745B Mar 18  2003 53759
-rw-r--r--  1 awalz  staff   1.9K Mar 18  2003 53760
-rw-r--r--  1 awalz  staff   592B Mar 18  2003 53761
-rw-r--r--  1 awalz  staff   658B Mar 18  2003 53762
-rw-r--r--  1 awalz  staff   756B Mar 18  2003 53763
-rw-r--r--  1 awalz  staff   2.7K Mar 18  2003 53764
-rw-r--r--  1 awalz  staff   1.1K Mar 18  2003 53765
-rw-r--r--  1 awalz  staff   906B Mar 18  2003 53766
-rw-r--r--  1 awalz  staff   535B Mar 18  2003 53780
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 53785
-rw-r--r--  1 awalz  staff   2.3K Mar 18  2003 54165
-rw-r--r--  1 awalz  staff   2.8K Mar 18  2003 54166
-rw-r--r--  1 awalz  staff   547B Mar 18  2003 54167
-rw-r--r--  1 awalz  staff   2.4K Mar 18  2003 54168
-rw-r--r--  1 awalz  staff   4.7K Mar 18  2003 54178
-rw-r--r--  1 awalz  staff   1.8K Mar 18  2003 54179
-rw-r--r--  1 awalz  staff   4.4K Mar 18  2003 54180
-rw-r--r--  1 awalz  staff   1.3K Mar 18  2003 54181
-rw-r--r--  1 awalz  staff   3.0K Mar 18  2003 54182
-rw-r--r--  1 awalz  staff   1.4K Mar 18  2003 54198
-rw-r--r--  1 awalz  staff   1.8K Mar 18  2003 54199
-rw-r--r--  1 awalz  staff   2.5K Mar 18  2003 54200
-rw-r--r--  1 awalz  staff   1.7K Mar 18  2003 54201
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 54202
-rw-r--r--  1 awalz  staff   1.2K Mar 18  2003 54203
-rw-r--r--  1 awalz  staff   565B Mar 18  2003 54204
-rw-r--r--  1 awalz  staff   641B Mar 18  2003 54227
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 54228
-rw-r--r--  1 awalz  staff   877B Mar 18  2003 54470
-rw-r--r--  1 awalz  staff   1.0K Mar 18  2003 54471
-rw-r--r--  1 awalz  staff   993B Mar 18  2003 54472
-rw-r--r--  1 awalz  staff   434B Mar 18  2003 54473

The load_files function can load text files from a 2 levels folder structure assuming folder names represent categories:


In [ ]:
#print(load_files.__doc__)

In [6]:
all_twenty_train = load_files('../datasets/20news-bydate-train/',
    encoding='latin-1', random_state=42)
all_twenty_test = load_files('../datasets/20news-bydate-test/',
    encoding='latin-1', random_state=42)

In [7]:
all_target_names = all_twenty_train.target_names
all_target_names


Out[7]:
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [8]:
all_twenty_train.target


Out[8]:
array([12,  6,  9, ...,  9,  1, 12])

In [9]:
all_twenty_train.target.shape


Out[9]:
(11314,)

In [10]:
all_twenty_test.target.shape


Out[10]:
(7532,)

In [11]:
len(all_twenty_train.data)


Out[11]:
11314

In [12]:
type(all_twenty_train.data[0])


Out[12]:
unicode

In [13]:
def display_sample(i, dataset):
    print("Class name: " + dataset.target_names[dataset.target[i]])
    print("Text content:\n")
    print(dataset.data[i])

In [14]:
display_sample(0, all_twenty_train)


Class name: sci.electronics
Text content:

From: wtm@uhura.neoucom.edu (Bill Mayhew)
Subject: Re: How to the disks copy protected.
Organization: Northeastern Ohio Universities College of Medicine
Lines: 23

Write a good manual to go with the software.  The hassle of
photocopying the manual is offset by simplicity of purchasing
the package for only $15.  Also, consider offering an inexpensive
but attractive perc for registered users.  For instance, a coffee
mug.  You could produce and mail the incentive for a couple of
dollars, so consider pricing the product at $17.95.

You're lucky if only 20% of the instances of your program in use
are non-licensed users.

The best approach is to estimate your loss and accomodate that into
your price structure.  Sure it hurts legitimate users, but too bad.
Retailers have to charge off loss to shoplifters onto paying
customers; the software industry is the same.

Unless your product is exceptionally unique, using an ostensibly
copy-proof disk will just send your customers to the competetion.


-- 
Bill Mayhew      NEOUCOM Computer Services Department
Rootstown, OH  44272-9995  USA    phone: 216-325-2511
wtm@uhura.neoucom.edu (140.220.1.1)    146.580: N8WED


In [15]:
display_sample(1, all_twenty_train)


Class name: misc.forsale
Text content:

From: andy@SAIL.Stanford.EDU (Andy Freeman)
Subject: Re: Catalog of Hard-to-Find PC Enhancements (Repost)
Organization: Computer Science Department,  Stanford University.
Lines: 33

>andy@SAIL.Stanford.EDU (Andy Freeman) writes:
>> >In article <C5ELME.4z4@unix.portal.com> jdoll@shell.portal.com (Joe Doll) wr
>> >>   "The Catalog of Personal Computing Tools for Engineers and Scien-
>> >>   tists" lists hardware cards and application software packages for 
>> >>   PC/XT/AT/PS/2 class machines.  Focus is on engineering and scien-
>> >>   tific applications of PCs, such as data acquisition/control, 
>> >>   design automation, and data analysis and presentation.  
>> >
>> >>   If you would like a free copy, reply with your (U. S. Postal) 
>> >>   mailing address.
>> 
>> Don't bother - it never comes.  It's a cheap trick for building a
>> mailing list to sell if my junk mail flow is any indication.
>> 
>> -andy sent his address months ago
>
>Perhaps we can get Portal to nuke this weasal.  I never received a 
>catalog either.  If that person doesn't respond to a growing flame, then 
>we can assume that we'yall look forward to lotsa junk mail.

I don't want him nuked, I want him to be honest.  The junk mail has
been much more interesting than the promised catalog.  If I'd known
what I was going to get, I wouldn't have hesitated.  I wouldn't be
surprised if there were other folks who looked at the ad and said
"nope" but who would be very interested in the junk mail that results.
Similarly, there are people who wanted the advertised catalog who
aren't happy with the junk they got instead.

The folks buying the mailing lists would prefer an honest ad, and
so would the people reading it.

-andy
--

Let's compute the (uncompressed, in-memory) size of the training and test sets in MB assuming an 8-bit encoding (in this case, all chars can be encoded using the latin-1 charset).


In [16]:
def text_size(text, charset='iso-8859-1'):
    return len(text.encode(charset)) * 8 * 1e-6

train_size_mb = sum(text_size(text) for text in all_twenty_train.data) 
test_size_mb = sum(text_size(text) for text in all_twenty_test.data)

print("Training set size: {0} MB".format(int(train_size_mb)))
print("Testing set size: {0} MB".format(int(test_size_mb)))


Training set size: 176 MB
Testing set size: 110 MB

If we only consider a small subset of the 4 categories selected from the initial example:


In [17]:
train_small_size_mb = sum(text_size(text) for text in twenty_train_small.data) 
test_small_size_mb = sum(text_size(text) for text in twenty_test_small.data)

print("Training set size: {0} MB".format(int(train_small_size_mb)))
print("Testing set size: {0} MB".format(int(test_small_size_mb)))


Training set size: 31 MB
Testing set size: 22 MB

Extracting Text Features


In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

TfidfVectorizer()


Out[18]:
TfidfVectorizer(analyzer=u'word', binary=False, charset=None,
        charset_error=None, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [19]:
vectorizer = TfidfVectorizer(min_df=1)

%time X_train_small = vectorizer.fit_transform(twenty_train_small.data)


CPU times: user 635 ms, sys: 29.5 ms, total: 665 ms
Wall time: 658 ms

The results is not a numpy.array but instead a scipy.sparse matrix. This datastructure is quite similar to a 2D numpy array but it does not store the zeros.


In [20]:
X_train_small


Out[20]:
<2034x34118 sparse matrix of type '<type 'numpy.float64'>'
	with 323433 stored elements in Compressed Sparse Row format>

scipy.sparse matrices also have a shape attribute to access the dimensions:


In [21]:
n_samples, n_features = X_train_small.shape

This dataset has around 2000 samples (the rows of the data matrix):


In [22]:
n_samples


Out[22]:
2034

This is the same value as the number of strings in the original list of text documents:


In [23]:
len(twenty_train_small.data)


Out[23]:
2034

The columns represent the individual token occurrences:


In [24]:
n_features


Out[24]:
34118

This number is the size of the vocabulary of the model extracted during fit in a Python dictionary:


In [25]:
type(vectorizer.vocabulary_)


Out[25]:
dict

In [26]:
len(vectorizer.vocabulary_)


Out[26]:
34118

The keys of the vocabulary_ attribute are also called feature names and can be accessed as a list of strings.


In [27]:
len(vectorizer.get_feature_names())


Out[27]:
34118

Here are the first 10 elements (sorted in lexicographical order):


In [28]:
vectorizer.get_feature_names()[:10]


Out[28]:
[u'00',
 u'000',
 u'0000',
 u'00000',
 u'000000',
 u'000005102000',
 u'000021',
 u'000062david42',
 u'0000vec',
 u'0001']

Let's have a look at the features from the middle:


In [29]:
vectorizer.get_feature_names()[n_features / 2:n_features / 2 + 10]


Out[29]:
[u'inadequate',
 u'inala',
 u'inalienable',
 u'inane',
 u'inanimate',
 u'inapplicable',
 u'inappropriate',
 u'inappropriately',
 u'inaudible',
 u'inbreeding']

Now that we have extracted a vector representation of the data, it's a good idea to project the data on the first 2D of a Principal Component Analysis to get a feel of the data. Note that the TruncatedSVD class can accept scipy.sparse matrices as input (as an alternative to numpy arrays):


In [30]:
from sklearn.decomposition import TruncatedSVD

%time X_train_small_pca = TruncatedSVD(n_components=2).fit_transform(X_train_small)


CPU times: user 101 ms, sys: 16 ms, total: 117 ms
Wall time: 112 ms

In [31]:
from itertools import cycle

colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k']
for i, c in zip(np.unique(y_train), cycle(colors)):
    plt.scatter(X_train_small_pca[y_train == i, 0],
               X_train_small_pca[y_train == i, 1],
               c=c, label=twenty_train_small.target_names[i], alpha=0.5)
    
_ = plt.legend(loc='best')


We can observe that there is a large overlap of the samples from different categories. This is to be expected as the PCA linear projection projects data from a 34118 dimensional space down to 2 dimensions: data that is linearly separable in 34118D is often no longer linearly separable in 2D.

Still we can notice an interesting pattern: the newsgroups on religion and atheism occupy the much the same region and computer graphics and space science / space overlap more together than they do with the religion or atheism newsgroups.

Training a Classifier on Text Features

We have previously extracted a vector representation of the training corpus and put it into a variable name X_train_small. To train a supervised model, in this case a classifier, we also need


In [32]:
y_train_small = twenty_train_small.target

In [33]:
y_train_small.shape


Out[33]:
(2034,)

In [34]:
y_train_small


Out[34]:
array([1, 2, 2, ..., 2, 1, 1])

We can shape that we have the same number of samples for the input data and the labels:


In [35]:
X_train_small.shape[0] == y_train_small.shape[0]


Out[35]:
True

We can now train a classifier, for instance a Multinomial Naive Bayesian classifier:


In [36]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB(alpha=0.1)
clf


Out[36]:
MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)

In [37]:
clf.fit(X_train_small, y_train_small)


Out[37]:
MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)

We can now evaluate the classifier on the testing set. Let's first use the builtin score function, which is the rate of correct classification in the test set:


In [38]:
X_test_small = vectorizer.transform(twenty_test_small.data)
y_test_small = twenty_test_small.target

In [39]:
X_test_small.shape


Out[39]:
(1353, 34118)

In [40]:
y_test_small.shape


Out[40]:
(1353,)

In [41]:
clf.score(X_test_small, y_test_small)


Out[41]:
0.89652623798965259

We can also compute the score on the test set and observe that the model is both overfitting and underfitting a bit at the same time:


In [42]:
clf.score(X_train_small, y_train_small)


Out[42]:
0.99262536873156337

Introspecting the Behavior of the Text Vectorizer

The text vectorizer has many parameters to customize it's behavior, in particular how it extracts tokens:


In [43]:
TfidfVectorizer()


Out[43]:
TfidfVectorizer(analyzer=u'word', binary=False, charset=None,
        charset_error=None, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [44]:
print(TfidfVectorizer.__doc__)


Convert a collection of raw documents to a matrix of TF-IDF features.

    Equivalent to CountVectorizer followed by TfidfTransformer.

    Parameters
    ----------
    input : string {'filename', 'file', 'content'}
        If filename, the sequence passed as an argument to fit is
        expected to be a list of filenames that need reading to fetch
        the raw content to analyze.

        If 'file', the sequence items must have 'read' method (file-like
        object) it is called to fetch the bytes in memory.

        Otherwise the input is expected to be the sequence strings or
        bytes items are expected to be analyzed directly.

    encoding : string, 'utf-8' by default.
        If bytes or files are given to analyze, this encoding is used to
        decode.

    decode_error : {'strict', 'ignore', 'replace'}
        Instruction on what to do if a byte sequence is given to analyze that
        contains characters not of the given `encoding`. By default, it is
        'strict', meaning that a UnicodeDecodeError will be raised. Other
        values are 'ignore' and 'replace'.

    strip_accents : {'ascii', 'unicode', None}
        Remove accents during the preprocessing step.
        'ascii' is a fast method that only works on characters that have
        an direct ASCII mapping.
        'unicode' is a slightly slower method that works on any characters.
        None (default) does nothing.

    analyzer : string, {'word', 'char'} or callable
        Whether the feature should be made of word or character n-grams.

        If a callable is passed it is used to extract the sequence of features
        out of the raw, unprocessed input.

    preprocessor : callable or None (default)
        Override the preprocessing (string transformation) stage while
        preserving the tokenizing and n-grams generation steps.

    tokenizer : callable or None (default)
        Override the string tokenization step while preserving the
        preprocessing and n-grams generation steps.

    ngram_range : tuple (min_n, max_n)
        The lower and upper boundary of the range of n-values for different
        n-grams to be extracted. All values of n such that min_n <= n <= max_n
        will be used.

    stop_words : string {'english'}, list, or None (default)
        If a string, it is passed to _check_stop_list and the appropriate stop
        list is returned. 'english' is currently the only supported string
        value.

        If a list, that list is assumed to contain stop words, all of which
        will be removed from the resulting tokens.

        If None, no stop words will be used. max_df can be set to a value
        in the range [0.7, 1.0) to automatically detect and filter stop
        words based on intra corpus document frequency of terms.

    lowercase : boolean, default True
        Convert all characters to lowercase befor tokenizing.

    token_pattern : string
        Regular expression denoting what constitutes a "token", only used
        if `tokenize == 'word'`. The default regexp select tokens of 2
        or more letters characters (punctuation is completely ignored
        and always treated as a token separator).

    max_df : float in range [0.0, 1.0] or int, optional, 1.0 by default
        When building the vocabulary ignore terms that have a term frequency
        strictly higher than the given threshold (corpus specific stop words).
        If float, the parameter represents a proportion of documents, integer
        absolute counts.
        This parameter is ignored if vocabulary is not None.

    min_df : float in range [0.0, 1.0] or int, optional, 1 by default
        When building the vocabulary ignore terms that have a term frequency
        strictly lower than the given threshold.
        This value is also called cut-off in the literature.
        If float, the parameter represents a proportion of documents, integer
        absolute counts.
        This parameter is ignored if vocabulary is not None.

    max_features : optional, None by default
        If not None, build a vocabulary that only consider the top
        max_features ordered by term frequency across the corpus.

        This parameter is ignored if vocabulary is not None.

    vocabulary : Mapping or iterable, optional
        Either a Mapping (e.g., a dict) where keys are terms and values are
        indices in the feature matrix, or an iterable over terms. If not
        given, a vocabulary is determined from the input documents.

    binary : boolean, False by default.
        If True, all non zero counts are set to 1. This is useful for discrete
        probabilistic models that model binary events rather than integer
        counts.

    dtype : type, optional
        Type of the matrix returned by fit_transform() or transform().

    norm : 'l1', 'l2' or None, optional
        Norm used to normalize term vectors. None for no normalization.

    use_idf : boolean, optional
        Enable inverse-document-frequency reweighting.

    smooth_idf : boolean, optional
        Smooth idf weights by adding one to document frequencies, as if an
        extra document was seen containing every term in the collection
        exactly once. Prevents zero divisions.

    sublinear_tf : boolean, optional
        Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

    See also
    --------
    CountVectorizer
        Tokenize the documents and count the occurrences of token and return
        them as a sparse matrix

    TfidfTransformer
        Apply Term Frequency Inverse Document Frequency normalization to a
        sparse matrix of occurrence counts.

    

The easiest way to introspect what the vectorizer is actually doing for a given test of parameters is call the vectorizer.build_analyzer() to get an instance of the text analyzer it uses to process the text:


In [45]:
analyzer = TfidfVectorizer().build_analyzer()
analyzer("I love scikit-learn: this is a cool Python lib!")


Out[45]:
[u'love', u'scikit', u'learn', u'this', u'is', u'cool', u'python', u'lib']

You can notice that all the tokens are lowercase, that the single letter word "I" was dropped, and that hyphenation is used. Let's change some of that default behavior:


In [46]:
analyzer = TfidfVectorizer(
    preprocessor=lambda text: text,  # disable lowercasing
    token_pattern=ur'(?u)\b[\w-]+\b', # treat hyphen as a letter
                                      # do not exclude single letter tokens
).build_analyzer()

analyzer("I love scikit-learn: this is a cool Python lib!")


Out[46]:
[u'I',
 u'love',
 u'scikit-learn',
 u'this',
 u'is',
 u'a',
 u'cool',
 u'Python',
 u'lib']

The analyzer name comes from the Lucene parlance: it wraps the sequential application of:

  • text preprocessing (processing the text documents as a whole, e.g. lowercasing)
  • text tokenization (splitting the document into a sequence of tokens)
  • token filtering and recombination (e.g. n-grams extraction, see later)

The analyzer system of scikit-learn is much more basic than lucene's though.

Exercise:

  • Write a pre-processor callable (e.g. a python function) to remove the headers of the text a newsgroup post.
  • Vectorize the data again and measure the impact on performance of removing the header info from the dataset.
  • Do you expect the performance of the model to improve or decrease? What is the score of a uniform random classifier on the same dataset?

Hint: the TfidfVectorizer class can accept python functions to customize the preprocessor, tokenizer or analyzer stages of the vectorizer.

  • type TfidfVectorizer() alone in a cell to see the default value of the parameters

  • type TfidfVectorizer.__doc__ to print the constructor parameters doc or ? suffix operator on a any Python class or method to read the docstring or even the ?? operator to read the source code.


In [ ]:

Model Selection of the Naive Bayes Classifier Parameter Alone

The MultinomialNB class is a good baseline classifier for text as it's fast and has few parameters to tweak:


In [ ]:
MultinomialNB()

In [ ]:
print(MultinomialNB.__doc__)

By reading the doc we can see that the alpha parameter is a good candidate to adjust the model for the bias (underfitting) vs variance (overfitting) trade-off.

Exercise:

  • use the sklearn.grid_search.GridSearchCV or the model_selection.RandomizedGridSeach utility function from the previous chapters to find a good value for the parameter alpha
  • plots the validation scores (and optionally the training scores) for each value of alpha and identify the areas where model overfits or underfits.

Hints:

  • you can search for values of alpha in the range [0.00001 - 1] using a logarithmic scale
  • RandomizedGridSearch also has a launch_for_arrays method as an alternative to launch_for_splits in case the CV splits have not been precomputed in advance. 1

In [ ]:

Setting Up a Pipeline for Cross Validation and Model Selection of the Feature Extraction parameters

The feature extraction class has many options to customize its behavior:


In [ ]:
print(TfidfVectorizer.__doc__)

In order to evaluate the impact of the parameters of the feature extraction one can chain a configured feature extraction and linear classifier (as an alternative to the naive Bayes model):


In [ ]:
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.pipeline import Pipeline

pipeline = Pipeline((
    ('vec', TfidfVectorizer(min_df=1, max_df=0.8, use_idf=True)),
    ('clf', PassiveAggressiveClassifier(C=1)),
))

Such a pipeline can then be cross validated or even grid searched:


In [ ]:
from sklearn.cross_validation import cross_val_score
from scipy.stats import sem

scores = cross_val_score(pipeline, twenty_train_small.data,
                         twenty_train_small.target, cv=3, n_jobs=-1)
scores.mean(), sem(scores)

For the grid search, the parameters names are prefixed with the name of the pipeline step using "__" as a separator:


In [ ]:
from sklearn.grid_search import GridSearchCV

parameters = {
    #'vec__min_df': [1, 2],
    'vec__max_df': [0.8, 1.0],
    'vec__ngram_range': [(1, 1), (1, 2)],
    'vec__use_idf': [True, False],
}

gs = GridSearchCV(pipeline, parameters, verbose=2, refit=False)
_ = gs.fit(twenty_train_small.data, twenty_train_small.target)

In [ ]:
gs.best_score_

In [ ]:
gs.best_params_

Introspecting Model Performance

Displaying the Most Discriminative Features

Let's fit a model on the small dataset and collect info on the fitted components:


In [ ]:
_ = pipeline.fit(twenty_train_small.data, twenty_train_small.target)

In [ ]:
vec_name, vec = pipeline.steps[0]
clf_name, clf = pipeline.steps[1]

feature_names = vec.get_feature_names()
target_names = twenty_train_small.target_names

feature_weights = clf.coef_

feature_weights.shape

By sorting the feature weights on the linear model and asking the vectorizer what their names is, one can get a clue on what the model did actually learn on the data:


In [ ]:
def display_important_features(feature_names, target_names, weights, n_top=30):
    for i, target_name in enumerate(target_names):
        print("Class: " + target_name)
        print("")
        
        sorted_features_indices = weights[i].argsort()[::-1]
        
        most_important = sorted_features_indices[:n_top]
        print(", ".join("{0}: {1:.4f}".format(feature_names[j], weights[i, j])
                        for j in most_important))
        print("...")
        
        least_important = sorted_features_indices[-n_top:]
        print(", ".join("{0}: {1:.4f}".format(feature_names[j], weights[i, j])
                        for j in least_important))
        print("")
        
display_important_features(feature_names, target_names, feature_weights)

Displaying the per-class Classification Reports


In [ ]:
from sklearn.metrics import classification_report

predicted = pipeline.predict(twenty_test_small.data)

In [ ]:
print(classification_report(twenty_test_small.target, predicted,
                            target_names=twenty_test_small.target_names))

Printing the Confusion Matrix

The confusion matrix summarize which class where by having a look at off-diagonal entries: here we can see that articles about atheism have been wrongly classified as being about religion 57 times for instance:


In [ ]:
from sklearn.metrics import confusion_matrix

confusion_matrix(twenty_test_small.target, predicted)

In [ ]:
twenty_test_small.target_names

In [ ]: