Classification in Sci-kit Learn

This code predicts the newsgroup from a list of 20 possible news groups. Its trainind on the commonly used 20-newsgroups dataset that is a "unusual" clasification dataset in that each newsgroup is very distinctive, leading to picking models that do better with this kind of data.

The code does the following:

  1. counts words
  2. weights word count features with TFIDF weighting
  3. predicts the newsgroup from the weighted features

Models are optimized through:

  1. Varying tokenization methods including character and word n-grams
  2. Several model types
  3. Randomized hyperparameter and tokenization option search
  4. Ranking of several models so best models are visible

Code came from examples at:

  1. http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
  2. http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html

20 newsgroups dataset info is at http://scikit-learn.org/stable/datasets/index.html#the-20-newsgroups-text-dataset

Be sure to install the following (pip3 is python 3 and pip command will also work):

  1. pip3 install sklearn
  2. pip3 install pandas
  3. pip3 install scipy

If I missed an instal and you get an import error, try doing a pip3 install <import name> . Note that the kernel for jupyter needs to be the same version/instalation of python you do the pip3 install in (python 3).


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
from IPython.core.display import display, HTML
from IPython.display import Audio
import os

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier

import time
display(HTML("<style>.container { width:97% !important; }</style>")) #Set width of iPython cells


Load Data


In [3]:
from sklearn.datasets import fetch_20newsgroups
# You can restrict the categories to simulate fewwer classes
#categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
#categories = ['comp.graphics', 'sci.med']
#categories = ['alt.atheism', 'talk.religion.misc']
categories=None
twenty_train = fetch_20newsgroups(subset='train',
    categories=categories, shuffle=True, random_state=42)

twenty_test = fetch_20newsgroups(subset='test',
    categories=categories, shuffle=True, random_state=42)

Investigate Training Set


In [4]:
twenty_train.target_names


Out[4]:
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [5]:
len(twenty_train.data)


Out[5]:
11314

In [6]:
len(twenty_train.filenames)


Out[6]:
11314

In [7]:
print(twenty_train.data[0])


From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----






In [8]:
twenty_train.target_names[twenty_train.target[0]]


Out[8]:
'rec.autos'

In [9]:
twenty_train.target


Out[9]:
array([7, 4, 4, ..., 3, 1, 8])

In [10]:
len(twenty_train.target)


Out[10]:
11314

In [11]:
twenty_train.target_names


Out[11]:
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

Test Set


In [12]:
len(twenty_test.data)


Out[12]:
7532

In [13]:
len(twenty_test.data) / len(twenty_train.data)


Out[13]:
0.66572388191621

In [14]:
print(twenty_test.data[10])


From: Greg.Reinacker@FtCollins.NCR.COM
Subject: Windows On-Line Review uploaded
Reply-To: Greg.Reinacker@FtCollinsCO.NCR.COM
Organization: NCR Microelectronics, Ft. Collins, CO
Lines: 12

I have uploaded the Windows On-Line Review shareware edition to
ftp.cica.indiana.edu as /pub/pc/win3/uploads/wolrs7.zip.

It is an on-line magazine which contains reviews of some shareware
products...I grabbed it from the Windows On-Line BBS.

--
--------------------------------------------------------------------------
Greg Reinacker                          (303) 223-5100 x9289
NCR Microelectronic Products Division   VoicePlus 464-9289
2001 Danfield Court                     Greg.Reinacker@FtCollinsCO.NCR.COM
Fort Collins, CO  80525


In [15]:
twenty_test.target_names[twenty_test.target[10]]


Out[15]:
'comp.os.ms-windows.misc'

Test Tokenization

CountVectorizer


In [16]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
print('training examples = ' + str(len(twenty_train.data)))
print('vocabulary length = ' + str(len(count_vect.vocabulary_)))
print('transformed training text matrix shape = ' + str(X_train_counts.shape))


training examples = 11314
vocabulary length = 130107
transformed training text matrix shape = (11314, 130107)

In [17]:
# vocabulary_ is dict of word string -> word index
list(count_vect.vocabulary_.items())[:50]


Out[17]:
[('from', 56979),
 ('lerxst', 75358),
 ('wam', 123162),
 ('umd', 118280),
 ('edu', 50527),
 ('where', 124031),
 ('my', 85354),
 ('thing', 114688),
 ('subject', 111322),
 ('what', 123984),
 ('car', 37780),
 ('is', 68532),
 ('this', 114731),
 ('nntp', 87620),
 ('posting', 95162),
 ('host', 64095),
 ('rac3', 98949),
 ('organization', 90379),
 ('university', 118983),
 ('of', 89362),
 ('maryland', 79666),
 ('college', 40998),
 ('park', 92081),
 ('lines', 76032),
 ('15', 4605),
 ('was', 123292),
 ('wondering', 124931),
 ('if', 65798),
 ('anyone', 28615),
 ('out', 90774),
 ('there', 114579),
 ('could', 42876),
 ('enlighten', 51793),
 ('me', 80638),
 ('on', 89860),
 ('saw', 104813),
 ('the', 114455),
 ('other', 90686),
 ('day', 45295),
 ('it', 68766),
 ('door', 48618),
 ('sports', 109581),
 ('looked', 76718),
 ('to', 115475),
 ('be', 32311),
 ('late', 74693),
 ('60s', 16574),
 ('early', 50111),
 ('70s', 18299),
 ('called', 37433)]

Test transform on some short text


In [18]:
text = ['The The rain in spain.', 'The brown brown fox.']
counts_matrix = count_vect.transform(text)
type(counts_matrix)


Out[18]:
scipy.sparse.csr.csr_matrix

In [19]:
counts_matrix.data


Out[19]:
array([1, 1, 1, 2, 2, 1, 1])

In [20]:
counts_matrix.indptr


Out[20]:
array([0, 4, 7], dtype=int32)

In [21]:
counts_matrix.indices


Out[21]:
array([ 66608,  99121, 109111, 114455,  35194,  56573, 114455], dtype=int32)

Convert to coo sparce matrix for easier display


In [22]:
from scipy.sparse import coo_matrix
coo = coo_matrix(counts_matrix)
#print(np.stack((coo.row, coo.col, coo.data)))

df = pd.DataFrame({'row':coo.row, 'column':coo.col, 'count':coo.data}, 
                  columns=['row','column', 'count'])

df


Out[22]:
row column count
0 0 66608 1
1 0 99121 1
2 0 109111 1
3 0 114455 2
4 1 35194 2
5 1 56573 1
6 1 114455 1

Build inverse vocabulary


In [23]:
inverse_vocabulary=np.empty(len(count_vect.vocabulary_), dtype=object)
for key,value in count_vect.vocabulary_.items():
    inverse_vocabulary[value] = key
    
for i in coo.col:
    print(i, inverse_vocabulary[i])


66608 in
99121 rain
109111 spain
114455 the
35194 brown
56573 fox
114455 the

In [24]:
words = [inverse_vocabulary[i] for i in coo.col]
df = pd.DataFrame({'row':coo.row, 'column':coo.col, 'count':coo.data, 'word':words})
df = df[ ['row','column', 'count', 'word'] ]
df


Out[24]:
row column count word
0 0 66608 1 in
1 0 99121 1 rain
2 0 109111 1 spain
3 0 114455 2 the
4 1 35194 2 brown
5 1 56573 1 fox
6 1 114455 1 the

In [25]:
tfidf = TfidfTransformer()
tfidf.fit(X_train_counts) # compute weights on whole training set
tfidf_matrix = tfidf.transform(counts_matrix) # transform examples
print( 'tfidf_matrix type = ' + str(type(tfidf_matrix)) )
print( 'tfidf_matrix shape = ' + str(tfidf_matrix.shape) )

coo_tfidf = coo_matrix(tfidf_matrix)
words_tfidf = [inverse_vocabulary[i] for i in coo_tfidf.col]

df = pd.DataFrame({'row':coo_tfidf.row, 'column':coo_tfidf.col, 
                   'value':coo_tfidf.data, 'word':words_tfidf})
df = df[ ['row','column', 'value', 'word'] ]
df


tfidf_matrix type = <class 'scipy.sparse.csr.csr_matrix'>
tfidf_matrix shape = (2, 130107)
Out[25]:
row column value word
0 0 114455 0.214719 the
1 0 109111 0.759340 spain
2 0 99121 0.602864 rain
3 0 66608 0.117698 in
4 1 114455 0.085361 the
5 1 56573 0.524809 fox
6 1 35194 0.846929 brown

In [26]:
import scipy
scipy.sparse.linalg.norm(tfidf_matrix, axis=1)


Out[26]:
array([ 1.,  1.])

Notice the following in the above values:

  1. frequent words like 'the' and 'in' are down weighted
  2. Each matrix row has a euclidian norm of 1.0

Tfidf Weights


In [27]:
tfidf.idf_.shape


Out[27]:
(130107,)

In [28]:
words = ['the', 'very', 'car', 'vector', 'africa']
for word in words:
    word_index = count_vect.vocabulary_[word]
    print(word + ' = ' + str(tfidf.idf_[word_index]))


the = 1.06905600081
very = 2.72055957853
car = 3.98125516175
vector = 7.1150087332
africa = 6.48373695636

Pipelines

Pipelines pass the output of one transform to the input of the next.

Pipeline


In [29]:
text_clf = Pipeline([('cvect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('sgdc', MultinomialNB()),
                    ])

In [30]:
text_clf.fit(twenty_train.data, twenty_train.target)


Out[30]:
Pipeline(memory=None,
     steps=[('cvect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        stri...near_tf=False, use_idf=True)), ('sgdc', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [31]:
predicted = text_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)


Out[31]:
0.7738980350504514

In [32]:
from sklearn import metrics
print(metrics.classification_report(twenty_test.target, predicted,
     target_names=twenty_test.target_names))


                          precision    recall  f1-score   support

             alt.atheism       0.80      0.52      0.63       319
           comp.graphics       0.81      0.65      0.72       389
 comp.os.ms-windows.misc       0.82      0.65      0.73       394
comp.sys.ibm.pc.hardware       0.67      0.78      0.72       392
   comp.sys.mac.hardware       0.86      0.77      0.81       385
          comp.windows.x       0.89      0.75      0.82       395
            misc.forsale       0.93      0.69      0.80       390
               rec.autos       0.85      0.92      0.88       396
         rec.motorcycles       0.94      0.93      0.93       398
      rec.sport.baseball       0.92      0.90      0.91       397
        rec.sport.hockey       0.89      0.97      0.93       399
               sci.crypt       0.59      0.97      0.74       396
         sci.electronics       0.84      0.60      0.70       393
                 sci.med       0.92      0.74      0.82       396
               sci.space       0.84      0.89      0.87       394
  soc.religion.christian       0.44      0.98      0.61       398
      talk.politics.guns       0.64      0.94      0.76       364
   talk.politics.mideast       0.93      0.91      0.92       376
      talk.politics.misc       0.96      0.42      0.58       310
      talk.religion.misc       0.97      0.14      0.24       251

             avg / total       0.82      0.77      0.77      7532


In [33]:
df = pd.DataFrame(metrics.confusion_matrix(twenty_test.target, predicted))
df


Out[33]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 166 0 0 1 0 1 0 0 1 1 1 3 0 6 3 123 4 8 0 1
1 1 252 15 12 9 18 1 2 1 5 2 41 4 0 6 15 4 1 0 0
2 0 14 258 45 3 9 0 2 1 3 2 25 1 0 6 23 2 0 0 0
3 0 5 11 305 17 1 3 6 1 0 2 19 13 0 5 3 1 0 0 0
4 0 3 8 23 298 0 3 8 1 3 1 16 8 0 2 8 3 0 0 0
5 1 21 17 13 2 298 1 0 1 1 0 23 0 1 4 10 2 0 0 0
6 0 1 3 31 12 1 271 19 4 4 6 5 12 6 3 9 3 0 0 0
7 0 1 0 3 0 0 4 364 3 2 2 4 1 1 3 3 4 0 1 0
8 0 0 0 1 0 0 2 10 371 0 0 4 0 0 0 8 2 0 0 0
9 0 0 0 0 1 0 0 4 0 357 22 0 0 0 2 9 1 1 0 0
10 0 0 0 0 0 0 0 1 0 4 387 1 0 0 1 5 0 0 0 0
11 0 2 1 0 0 1 1 3 0 0 0 383 1 0 0 3 1 0 0 0
12 0 4 2 17 5 0 2 8 7 1 2 78 235 3 11 15 2 1 0 0
13 2 3 0 1 1 3 1 0 2 3 4 11 5 292 6 52 6 4 0 0
14 0 2 0 1 0 3 0 2 1 0 1 6 1 2 351 19 4 0 1 0
15 2 0 0 0 0 0 0 0 1 0 0 0 0 1 2 392 0 0 0 0
16 0 0 0 1 0 0 2 0 1 1 0 10 0 0 1 6 341 1 0 0
17 0 1 0 0 0 0 0 0 0 1 0 2 0 0 0 24 3 344 1 0
18 2 0 0 0 0 0 0 1 0 0 1 11 0 1 7 35 118 5 129 0
19 33 2 0 0 0 0 0 0 0 1 1 3 0 4 4 131 29 5 3 35

Function to Test a Pipeline


In [34]:
class QAResults:
    def init(self, Y_expected, Y_predicted, X, class_labels):
        self.Y_expected = Y_expected
        self.Y_predicted = Y_predicted
        self.X = X
        self.class_labels = class_labels
        self.next_error_index = 0
        self.errors = np.nonzero(Y_expected - Y_predicted) # returns indexs of non-zero elements
        print(self.errors)
        
    def display_next(self):
        if(self.next_error_index >= self.errors[0].shape[0]):
            self.next_error_index = 0 # cycle back around
        X_index = self.errors[0][self.next_error_index]
        print('index = ', X_index )
        print('Expected = ' + self.class_labels[self.Y_expected[X_index]])
        print('Predicted = ' + self.class_labels[self.Y_predicted[X_index]])
        print('\nX['+ str(X_index) +']')
        print( self.X[X_index] )
        self.next_error_index +=1

In [35]:
def header(str):
    display(HTML('<h3>'+str+'</h3>'))

tests = {}

def test_pipeline(pipeline, name=None, verbose=True, qa_test = None):
    start=time.time()
    pipeline.fit(twenty_train.data, twenty_train.target)  
    predicted = pipeline.predict(twenty_test.data)
    elapsed_time = (time.time() - start)
    accuracy = np.mean(predicted == twenty_test.target)
    f1 = metrics.f1_score(twenty_test.target, predicted, average='macro') 
    print( 'F1 = %.3f \nAccuracy = %.3f\ntime = %.3f sec.' % (f1, accuracy, elapsed_time))
    if(verbose):
        header('Classification Report')
        print(metrics.classification_report(twenty_test.target, predicted,
                 target_names=twenty_test.target_names, digits=3))
        header('Confusion Matrix (row=expected, col=predicted)')
        df = pd.DataFrame(metrics.confusion_matrix(twenty_test.target, predicted))
        df.columns = twenty_test.target_names
        df['Expected']=twenty_test.target_names
        df.set_index('Expected',inplace=True)
        display(df)
   
    if name is not None:
        tests[name]={'Name':name, 'Accuracy':accuracy, 'F1':f1, 'Time':elapsed_time, 
                     'Details':pipeline.get_params(deep=True)}
    
    if qa_test is not None:
        qa_test.init( twenty_test.target, predicted, twenty_test.data, twenty_test.target_names) 
    
qa_test=QAResults()
test_pipeline(text_clf, qa_test=qa_test)


F1 = 0.756 
Accuracy = 0.774
time = 4.562 sec.

Classification Report

                          precision    recall  f1-score   support

             alt.atheism      0.802     0.520     0.631       319
           comp.graphics      0.810     0.648     0.720       389
 comp.os.ms-windows.misc      0.819     0.655     0.728       394
comp.sys.ibm.pc.hardware      0.672     0.778     0.721       392
   comp.sys.mac.hardware      0.856     0.774     0.813       385
          comp.windows.x      0.890     0.754     0.816       395
            misc.forsale      0.931     0.695     0.796       390
               rec.autos      0.847     0.919     0.881       396
         rec.motorcycles      0.937     0.932     0.935       398
      rec.sport.baseball      0.922     0.899     0.911       397
        rec.sport.hockey      0.892     0.970     0.929       399
               sci.crypt      0.594     0.967     0.736       396
         sci.electronics      0.836     0.598     0.697       393
                 sci.med      0.921     0.737     0.819       396
               sci.space      0.842     0.891     0.866       394
  soc.religion.christian      0.439     0.985     0.607       398
      talk.politics.guns      0.643     0.937     0.763       364
   talk.politics.mideast      0.930     0.915     0.922       376
      talk.politics.misc      0.956     0.416     0.580       310
      talk.religion.misc      0.972     0.139     0.244       251

             avg / total      0.822     0.774     0.768      7532

Confusion Matrix (row=expected, col=predicted)

alt.atheism comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey sci.crypt sci.electronics sci.med sci.space soc.religion.christian talk.politics.guns talk.politics.mideast talk.politics.misc talk.religion.misc
Expected
alt.atheism 166 0 0 1 0 1 0 0 1 1 1 3 0 6 3 123 4 8 0 1
comp.graphics 1 252 15 12 9 18 1 2 1 5 2 41 4 0 6 15 4 1 0 0
comp.os.ms-windows.misc 0 14 258 45 3 9 0 2 1 3 2 25 1 0 6 23 2 0 0 0
comp.sys.ibm.pc.hardware 0 5 11 305 17 1 3 6 1 0 2 19 13 0 5 3 1 0 0 0
comp.sys.mac.hardware 0 3 8 23 298 0 3 8 1 3 1 16 8 0 2 8 3 0 0 0
comp.windows.x 1 21 17 13 2 298 1 0 1 1 0 23 0 1 4 10 2 0 0 0
misc.forsale 0 1 3 31 12 1 271 19 4 4 6 5 12 6 3 9 3 0 0 0
rec.autos 0 1 0 3 0 0 4 364 3 2 2 4 1 1 3 3 4 0 1 0
rec.motorcycles 0 0 0 1 0 0 2 10 371 0 0 4 0 0 0 8 2 0 0 0
rec.sport.baseball 0 0 0 0 1 0 0 4 0 357 22 0 0 0 2 9 1 1 0 0
rec.sport.hockey 0 0 0 0 0 0 0 1 0 4 387 1 0 0 1 5 0 0 0 0
sci.crypt 0 2 1 0 0 1 1 3 0 0 0 383 1 0 0 3 1 0 0 0
sci.electronics 0 4 2 17 5 0 2 8 7 1 2 78 235 3 11 15 2 1 0 0
sci.med 2 3 0 1 1 3 1 0 2 3 4 11 5 292 6 52 6 4 0 0
sci.space 0 2 0 1 0 3 0 2 1 0 1 6 1 2 351 19 4 0 1 0
soc.religion.christian 2 0 0 0 0 0 0 0 1 0 0 0 0 1 2 392 0 0 0 0
talk.politics.guns 0 0 0 1 0 0 2 0 1 1 0 10 0 0 1 6 341 1 0 0
talk.politics.mideast 0 1 0 0 0 0 0 0 0 1 0 2 0 0 0 24 3 344 1 0
talk.politics.misc 2 0 0 0 0 0 0 1 0 0 1 11 0 1 7 35 118 5 129 0
talk.religion.misc 33 2 0 0 0 0 0 0 0 1 1 3 0 4 4 131 29 5 3 35
(array([   1,    4,   14, ..., 7525, 7528, 7530]),)

In [36]:
qa_test.display_next()  # re-run this cell to see next error


index =  1
Expected = comp.windows.x
Predicted = sci.crypt

X[1]
From: Rick Miller <rick@ee.uwm.edu>
Subject: X-Face?
Organization: Just me.
Lines: 17
Distribution: world
NNTP-Posting-Host: 129.89.2.33
Summary: Go ahead... swamp me.  <EEP!>

I'm not familiar at all with the format of these "X-Face:" thingies, but
after seeing them in some folks' headers, I've *got* to *see* them (and
maybe make one of my own)!

I've got "dpg-view" on my Linux box (which displays "uncompressed X-Faces")
and I've managed to compile [un]compface too... but now that I'm *looking*
for them, I can't seem to find any X-Face:'s in anyones news headers!  :-(

Could you, would you, please send me your "X-Face:" header?

I *know* I'll probably get a little swamped, but I can handle it.

	...I hope.

Rick Miller  <rick@ee.uwm.edu> | <ricxjo@discus.mil.wi.us>   Ricxjo Muelisto
Send a postcard, get one back! | Enposxtigu bildkarton kaj vi ricevos alion!
          RICK MILLER // 16203 WOODS // MUSKEGO, WIS. 53150 // USA

Importance of TFIDF Weighting


In [37]:
test_pipeline(Pipeline([('cvect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),  # <-- with weighting
                      ('sgdc', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-5, random_state=42,
                                            max_iter=40)),
                         ]), verbose=False)


F1 = 0.841 
Accuracy = 0.848
time = 8.696 sec.

In [38]:
test_pipeline(Pipeline([('cvect', CountVectorizer()),  # <-- no weighting
                      ('sgdc', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-5, random_state=42,
                                            max_iter=40)),
                         ]), verbose=False)


F1 = 0.760 
Accuracy = 0.769
time = 8.705 sec.

TfidfVectorizer combines CountVectorizer and TfidfTransformer


In [39]:
test_pipeline(Pipeline([('tfidf_v', TfidfVectorizer()),
                      ('sgdc', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-4, random_state=42,
                                            max_iter=40 )),
                         ]), verbose=False)


F1 = 0.846 
Accuracy = 0.854
time = 8.722 sec.

In [40]:
test_pipeline(Pipeline([('cvect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('sgdc', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-4, random_state=42,
                                            max_iter=40 )),
                         ]), verbose=False, name='hinge loss')


F1 = 0.846 
Accuracy = 0.854
time = 8.760 sec.

Hyper-parameter tests on SGDClassifier


In [41]:
# hinge loss is a linear SVM
test_pipeline(Pipeline([('cvect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('sgdc', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-4, random_state=42,
                                            max_iter=40 )),
                         ]), verbose=False, name='hinge loss')


F1 = 0.846 
Accuracy = 0.854
time = 8.373 sec.

In [42]:
# log loss is logistic regression
test_pipeline(Pipeline([('cvect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('sgdc', SGDClassifier(loss='log', penalty='l2',
                                            alpha=1e-6, random_state=42,
                                            max_iter=10 )),
                         ]), verbose=False, name='log loss')


F1 = 0.840 
Accuracy = 0.847
time = 6.093 sec.

In [43]:
test_pipeline(Pipeline([('cvect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('sgdc', SGDClassifier(loss='log', penalty='none',
                                            alpha=1e-6, random_state=42,
                                            max_iter=10 )),
                         ]), verbose=False, name='log loss no regularization')


F1 = 0.819 
Accuracy = 0.825
time = 6.112 sec.

Test Naive Bayes model

MultinomialNB


In [44]:
test_pipeline(Pipeline([('cvect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('sgdc', MultinomialNB()),
                         ]), verbose=False, name='MultinomialNB')


F1 = 0.756 
Accuracy = 0.774
time = 4.628 sec.

K-nearest neighbors model

KNeighborsClassifier


In [45]:
from sklearn.neighbors import KNeighborsClassifier

test_pipeline(Pipeline([('cvect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('knn', KNeighborsClassifier(n_neighbors=5)),
                         ]), verbose=False, name='KNN n=5')


F1 = 0.655 
Accuracy = 0.659
time = 13.326 sec.

In [46]:
from sklearn.neighbors import KNeighborsClassifier

for n in range(1,7):
    print( '\nn = ' + str(n))
    test_pipeline(Pipeline([('cvect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('knn', KNeighborsClassifier(n_neighbors=n)),
                         ]), verbose=False, name='KNN n=' + str(n))


n = 1
F1 = 0.667 
Accuracy = 0.672
time = 11.949 sec.

n = 2
F1 = 0.639 
Accuracy = 0.641
time = 11.942 sec.

n = 3
F1 = 0.656 
Accuracy = 0.658
time = 11.676 sec.

n = 4
F1 = 0.654 
Accuracy = 0.656
time = 11.963 sec.

n = 5
F1 = 0.655 
Accuracy = 0.659
time = 11.997 sec.

n = 6
F1 = 0.655 
Accuracy = 0.661
time = 11.863 sec.

In [47]:
from sklearn.neighbors import KNeighborsClassifier

for n in range(1,7):
    print( '\nn = ' + str(n))
    test_pipeline(Pipeline([('cvect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('knn', KNeighborsClassifier(n_neighbors=n, weights='distance')),
                         ]), verbose=False, name='KNN n=' + str(n) + ' distance weights')


n = 1
F1 = 0.667 
Accuracy = 0.672
time = 11.402 sec.

n = 2
F1 = 0.667 
Accuracy = 0.672
time = 11.514 sec.

n = 3
F1 = 0.675 
Accuracy = 0.680
time = 11.714 sec.

n = 4
F1 = 0.678 
Accuracy = 0.684
time = 12.719 sec.

n = 5
F1 = 0.672 
Accuracy = 0.678
time = 12.581 sec.

n = 6
F1 = 0.674 
Accuracy = 0.681
time = 12.041 sec.

Nearest Centroid Model

NearestCentroid


In [48]:
from sklearn.neighbors.nearest_centroid import NearestCentroid

test_pipeline(Pipeline([('cvect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('sgdc', NearestCentroid(metric='euclidean')),
                         ]), verbose=False, name='NearestCentroid')


F1 = 0.694 
Accuracy = 0.692
time = 4.217 sec.

Logistic Regression

This is same as to SGDClassifier with log loss, but uses different code/solver.


In [49]:
from sklearn.linear_model import LogisticRegression

test_pipeline(Pipeline([('cvect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('sgdc', LogisticRegression(solver='sag', multi_class='multinomial', n_jobs=-1)),
                         ]), verbose=False, name='LogisticRegression multinomial')


F1 = 0.819 
Accuracy = 0.827
time = 8.674 sec.

In [50]:
test_pipeline(Pipeline([('cvect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('sgdc', LogisticRegression(solver='sag', multi_class='ovr',n_jobs=-1)),
                         ]), verbose=False, name='LogisticRegression ovr')


F1 = 0.819 
Accuracy = 0.828
time = 47.392 sec.

In [51]:
test_pipeline(Pipeline([('cvect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('sgdc', LogisticRegression(C=10, solver='sag', multi_class='multinomial', n_jobs=-1, max_iter=200)),
                         ]), verbose=False, name='LogisticRegression multinomial C=10')


F1 = 0.838 
Accuracy = 0.845
time = 15.442 sec.

In [52]:
test_pipeline(Pipeline([('cvect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('sgdc', LogisticRegression(C=100, solver='sag', multi_class='multinomial', n_jobs=-1, max_iter=200)),
                         ]), verbose=False, name='LogisticRegression multinomial C=100')


F1 = 0.841 
Accuracy = 0.847
time = 49.159 sec.

In [53]:
test_pipeline(Pipeline([('cvect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('sgdc', LogisticRegression(C=1000, solver='sag', multi_class='multinomial', n_jobs=-1, max_iter=200)),
                         ]), verbose=False, name='LogisticRegression multinomial C=1000')


/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/sag.py:326: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
F1 = 0.840 
Accuracy = 0.846
time = 50.857 sec.

Most Influential Features


In [54]:
p = Pipeline([('cvect', CountVectorizer(stop_words='english', ngram_range=(1,2),
                                                max_df = 0.88, min_df=1)),
                      ('tfidf', TfidfTransformer(sublinear_tf=True)),
                      ('sgdc', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=4e-4, random_state=42,
                                            max_iter=40 )),
                         ])     
test_pipeline(p, verbose=False)


F1 = 0.843 
Accuracy = 0.852
time = 30.012 sec.

In [55]:
# Adapted from https://stackoverflow.com/questions/11116697/how-to-get-most-informative-features-for-scikit-learn-classifiers
def show_most_informative_features(vectorizer, clf, class_labels, n=50):
    feature_names = vectorizer.get_feature_names()
    for row in range(clf.coef_.shape[0]):
        coefs_with_fns = sorted(zip(clf.coef_[row], feature_names))
        top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
        print( '\nclass = ' + class_labels[row])
        l = [[fn_1, coef_1,fn_2,coef_2]  for (coef_1, fn_1), (coef_2, fn_2) in top]
        df = pd.DataFrame(l, columns=['Smallest Word', 'Smallest Weight', 'Largest Word', 'Largest Weight'])
        display(df)
        
show_most_informative_features(p.named_steps['cvect'], p.named_steps['sgdc'], twenty_train.target_names)


class = alt.atheism
Smallest Word Smallest Weight Largest Word Largest Weight
0 christians -0.262594 keith 1.439411
1 rutgers edu -0.232721 atheists 1.271824
2 sandman caltech -0.230508 edu keith 1.146518
3 host sandman -0.229238 atheism 1.139742
4 rutgers -0.227094 livesey 0.918769
5 sandman -0.221514 schneider 0.881965
6 usa -0.219238 keith cco 0.858405
7 ca -0.216125 jaeger 0.855797
8 christ -0.207407 islamic 0.845568
9 mail -0.204733 islam 0.809818
10 thanks -0.194144 keith allan 0.799969
11 interested -0.182378 allan schneider 0.799969
12 usa lines -0.181814 benedikt 0.762978
13 help -0.172463 rushdie 0.752313
14 arc cco -0.165953 solntze wpd 0.731002
15 organization university -0.165137 solntze 0.731002
16 arc -0.157305 allan 0.722777
17 heaven -0.156941 political atheists 0.716436
18 athos -0.154498 okcforum 0.710658
19 athos rutgers -0.154498 wpd sgi 0.649272
20 christian -0.148664 wpd 0.649272
21 use -0.146382 osrhe 0.647602
22 distribution usa -0.142912 vice ico 0.628458
23 new -0.141217 ico tek 0.628458
24 ritvax -0.140880 god 0.627499
25 ritvax isc -0.140880 ico 0.616609
26 acs -0.137707 mozumder 0.614927
27 government -0.137602 kmr4 0.614037
28 article apr -0.137103 atheists organization 0.612605
29 arrogance -0.136612 gregg 0.612241
30 church -0.135456 jon livesey 0.605124
31 subject 2000 -0.135289 osrhe edu 0.588700
32 rochester -0.134716 okcforum osrhe 0.588700
33 lines nntp -0.134381 schneider subject 0.573327
34 year -0.134233 bobbe vice 0.571574
35 white -0.130156 bobbe 0.571574
36 gun -0.129868 caltech edu 0.564274
37 clh -0.129793 mangoe 0.563092
38 reply -0.127696 beauchaine 0.562624
39 graphics -0.127420 livesey solntze 0.562374
40 pro -0.127393 caltech 0.547609
41 told -0.126946 jaeger buphy 0.546331
42 data -0.126880 buphy bu 0.546331
43 truth -0.126806 buphy 0.546331
44 aaron -0.124190 tu bs 0.545316
45 buy -0.123875 i3150101 0.545316
46 weapons -0.123823 dbstu1 rz 0.545316
47 scripture -0.123695 dbstu1 0.545316
48 hell -0.123164 gregg jaeger 0.537279
49 want -0.122580 i3150101 dbstu1 0.537143
class = comp.graphics
Smallest Word Smallest Weight Largest Word Largest Weight
0 sale -0.248925 graphics 1.735588
1 windows -0.248667 3d 1.079136
2 monitor -0.217991 image 0.861527
3 window -0.197619 polygon 0.819904
4 mit -0.190565 tiff 0.793013
5 widget -0.176296 images 0.663772
6 win -0.174629 cview 0.656759
7 people -0.163595 pov 0.631182
8 list -0.158626 animation 0.602805
9 drive -0.158394 format 0.519557
10 mit edu -0.156234 comp graphics 0.512187
11 cica -0.150670 algorithm 0.497796
12 distribution -0.140442 files 0.492763
13 video card -0.134308 points 0.491094
14 school -0.134172 gif 0.489808
15 application -0.133454 package 0.477010
16 motif -0.133025 sphere 0.473042
17 server -0.132383 3do 0.457766
18 usa -0.131238 library 0.454194
19 card -0.130418 graphics library 0.453582
20 right -0.130028 vga 0.417271
21 key -0.128958 surface 0.405159
22 font -0.128804 routine 0.396083
23 motherboard -0.128709 subject tiff 0.395853
24 distribution usa -0.126410 newsgroup split 0.395765
25 monitors -0.123936 3d graphics 0.386984
26 nec -0.122411 viewer 0.380951
27 make -0.122140 24 bit 0.374144
28 message -0.120238 vesa 0.358538
29 really -0.119586 philosophical significance 0.352424
30 hp -0.118778 studio 0.350484
31 scsi -0.118023 tdawson 0.347543
32 modem -0.118012 42 0.335618
33 xterm -0.117902 jpeg 0.332758
34 memory -0.115604 algorithms 0.331362
35 new -0.114796 code 0.329442
36 x11r5 -0.114525 program 0.328883
37 questions -0.114125 computer graphics 0.316519
38 lcs mit -0.112023 split 0.314760
39 lcs -0.111929 subject cview 0.312441
40 widgets -0.111297 philosophical 0.312432
41 error -0.110961 tiff philosophical 0.312197
42 circuit -0.109990 significance 42 0.312197
43 space -0.108308 polygon routine 0.311376
44 controller -0.107864 aspects graphics 0.310915
45 car -0.107545 xv 0.307551
46 long -0.106565 3d studio 0.303794
47 norton -0.105883 caspian usc 0.303565
48 com -0.105569 polygons 0.301520
49 price -0.104208 24 0.297507
class = comp.os.ms-windows.misc
Smallest Word Smallest Weight Largest Word Largest Weight
0 sale -0.315937 windows 2.982981
1 motif -0.284700 file 0.848644
2 graphics -0.277000 ini 0.793745
3 mac -0.250074 cica 0.781741
4 board -0.214636 win 0.751385
5 scsi -0.207636 driver 0.734663
6 monitor -0.206667 drivers 0.662261
7 color -0.191345 win3 0.656588
8 image -0.190970 ms 0.631716
9 bus -0.186814 files 0.626170
10 unix -0.186793 dos 0.595812
11 floppy -0.180367 ms windows 0.568824
12 apple -0.168479 nt 0.511702
13 x11r5 -0.165597 subject windows 0.498173
14 ibm -0.164128 microsoft 0.477887
15 macintosh -0.162409 exe 0.457013
16 power -0.161449 windows organization 0.431226
17 following -0.160944 font 0.423543
18 xlib -0.160862 swap file 0.412667
19 comp windows -0.158583 printer 0.386935
20 time -0.158197 ftp 0.386923
21 ms dos -0.157416 bmp 0.383132
22 drive -0.157296 win ini 0.372668
23 interested -0.155825 diamond 0.364714
24 cview -0.154014 w4wg 0.359173
25 xterm -0.149959 desktop 0.358825
26 bit -0.148351 fonts 0.354241
27 isa -0.147073 program manager 0.353322
28 window manager -0.145817 ftp cica 0.352917
29 mpeg -0.143865 edu tw 0.346036
30 car -0.143617 apps 0.345799
31 viewer -0.142155 cica indiana 0.345217
32 x11 -0.142057 tw 0.343264
33 pcx -0.141835 using 0.341016
34 vesa -0.137888 bj 0.340315
35 sdsu -0.136676 ini files 0.329657
36 mit -0.135104 truetype 0.327812
37 10 -0.134868 manager 0.325475
38 gif -0.134016 latest 0.323598
39 formats -0.133913 subject win 0.323259
40 shipping -0.132435 swap 0.317746
41 openwindows -0.131865 win nt 0.317550
42 runs -0.131365 louray 0.317443
43 question -0.129966 utility 0.316380
44 lines nntp -0.129664 access 0.315626
45 sdsu edu -0.127962 norton 0.312211
46 long -0.126243 download 0.310784
47 se -0.123020 use 0.310535
48 software -0.122193 zip 0.306288
49 keyboard -0.122145 seas gwu 0.305211
class = comp.sys.ibm.pc.hardware
Smallest Word Smallest Weight Largest Word Largest Weight
0 mac -0.379717 ide 1.409218
1 sale -0.345942 controller 1.206842
2 apple -0.307799 bus 1.148549
3 windows -0.217368 scsi 0.970953
4 shipping -0.212027 isa 0.933777
5 internal -0.199717 vlb 0.780324
6 file -0.181048 bios 0.637390
7 brand new -0.178187 486 0.596254
8 iisi -0.171694 gateway 0.569682
9 items -0.167544 eisa 0.557024
10 latest -0.167369 motherboard 0.556187
11 graphics -0.162316 drive 0.554489
12 sun -0.153188 drives 0.551369
13 macintosh -0.152913 pc 0.522637
14 quadra -0.148348 isa bus 0.520204
15 window -0.143562 irq 0.498586
16 win3 -0.143011 card 0.495699
17 code -0.142029 local bus 0.491729
18 includes -0.140734 floppy 0.487182
19 space -0.138845 subject ide 0.478090
20 centris -0.138325 dma 0.454041
21 sale organization -0.137216 os 0.438606
22 color -0.136745 vs scsi 0.437615
23 car -0.135345 ide vs 0.437615
24 interested -0.135291 adaptec 0.432051
25 150 -0.134436 port 0.420597
26 driver -0.133105 boot 0.410292
27 best offer -0.132752 cmos 0.398034
28 lc -0.132335 settings 0.371923
29 offer -0.131659 monitors 0.359509
30 lciii -0.125270 board 0.358851
31 sell -0.122500 esdi 0.349988
32 edu -0.120479 jumpers 0.349597
33 files -0.118800 scsi controller 0.347016
34 power -0.118238 monitor 0.343010
35 subject windows -0.116013 dos 0.327310
36 iifx -0.114658 hd 0.326227
37 virginia -0.111283 dx2 0.320946
38 utilities -0.109607 harddisk 0.319506
39 internal drive -0.109555 dante nmsu 0.317269
40 ms windows -0.109310 ide controller 0.316091
41 display -0.108820 seagate 0.313257
42 000 -0.107160 scsi organization 0.311869
43 information -0.105917 vs 0.310886
44 se 30 -0.105826 disk 0.308125
45 looked -0.104528 valve heart 0.307642
46 mit -0.103172 rri uwo 0.307642
47 duo -0.103134 rri 0.307642
48 cs -0.103112 heart rri 0.307642
49 graphics card -0.103059 17 monitors 0.305329
class = comp.sys.mac.hardware
Smallest Word Smallest Weight Largest Word Largest Weight
0 windows -0.441898 mac 2.041394
1 sale -0.342317 apple 1.686945
2 ide -0.324485 quadra 1.263753
3 pc -0.290010 centris 1.200487
4 controller -0.287402 lc 0.951701
5 dos -0.277375 duo 0.898442
6 offer -0.207702 powerbook 0.837541
7 car -0.206371 lciii 0.788603
8 com -0.173308 iisi 0.779973
9 files -0.163073 c650 0.738654
10 bios -0.160484 simms 0.672638
11 condition -0.158353 610 0.659771
12 gateway -0.146756 vram 0.650616
13 graphics -0.145430 se 0.640326
14 includes -0.144365 centris 610 0.603946
15 old 256k -0.140535 nubus 0.591657
16 forsale -0.140051 fpu 0.564404
17 isa -0.138786 simm 0.563169
18 3d -0.138507 pds 0.561671
19 program -0.137868 adb 0.557583
20 mike -0.137478 macs 0.532054
21 apple com -0.136940 hades 0.514634
22 vlb -0.136382 040 0.457711
23 cs -0.135985 upgrade 0.435468
24 sandvik -0.135625 bmug 0.427254
25 type -0.133522 macintosh 0.425225
26 amiga -0.132956 se 30 0.422475
27 kent -0.130648 centris 650 0.418256
28 00 -0.130307 lc iii 0.414442
29 uucp -0.125630 monitor 0.405484
30 frame -0.125046 internal 0.395408
31 file -0.124133 650 0.389275
32 end -0.123662 scsi 0.381324
33 john -0.120111 powerpc 0.375989
34 486 -0.119271 subject quadra 0.360956
35 eisa -0.118764 clock 0.353483
36 asking -0.118329 nada kth 0.348820
37 adaptec -0.118231 quadra 800 0.341816
38 good -0.117954 slot 0.338758
39 os -0.116723 nada 0.338661
40 copy -0.112375 drive 0.336297
41 compaq -0.112335 ethernet 0.332064
42 god -0.112011 68040 0.328638
43 bike -0.111046 pb 0.323738
44 unix -0.110769 ram 0.313958
45 apple laserwriter -0.110738 machines 0.313670
46 ericsson -0.110566 pds slot 0.312209
47 ibm -0.110536 syquest 0.310015
48 sun -0.110302 mac ii 0.309108
49 local bus -0.110174 cable 0.301666
class = comp.windows.x
Smallest Word Smallest Weight Largest Word Largest Weight
0 dos -0.362704 motif 1.775618
1 mac -0.259959 window 1.740462
2 university -0.224300 x11r5 1.347464
3 card -0.216287 widget 1.279595
4 driver -0.206881 server 1.180182
5 good -0.203282 lcs mit 1.050715
6 athena mit -0.190738 lcs 1.048969
7 pc -0.168147 xterm 1.037401
8 printer -0.163036 expo lcs 1.025669
9 algorithm -0.159614 expo 1.008819
10 vga -0.158319 xpert 0.946528
11 pov -0.158016 mit 0.899334
12 usa -0.155522 window manager 0.877270
13 data -0.155363 xlib 0.867122
14 sale -0.154562 organization internet 0.861733
15 bbs -0.154494 xpert expo 0.847135
16 drive -0.154115 internet lines 0.835634
17 power -0.153397 application 0.826600
18 car -0.149635 enterpoop 0.672402
19 quadra -0.149372 enterpoop mit 0.662596
20 disk -0.148759 x11 0.653357
21 post -0.148603 widgets 0.642999
22 organization massachusetts -0.148396 host enterpoop 0.635055
23 monitor -0.148223 edu xpert 0.619388
24 distribution usa -0.147370 mit edu 0.604277
25 ai mit -0.147212 openwindows 0.583481
26 ai -0.145360 client 0.579953
27 massachusetts institute -0.145081 display 0.562440
28 years -0.143402 xt 0.547640
29 drivers -0.143216 r5 0.544012
30 chip -0.143111 pixmap 0.525120
31 small -0.141455 manager 0.512121
32 edu organization -0.141074 colormap 0.503765
33 apple -0.140136 clients 0.501120
34 images -0.138349 tu dresden 0.493931
35 files -0.137146 running 0.483813
36 massachusetts -0.136488 expose 0.481673
37 program manager -0.135871 code 0.475165
38 ca -0.135722 sun 0.470845
39 technology -0.134941 dresden 0.467343
40 tiff -0.133385 inf tu 0.460746
41 print -0.133167 mwm 0.449451
42 truetype -0.132521 xview 0.448274
43 word -0.129411 xdm 0.446943
44 scsi -0.128986 sunos 0.446736
45 board -0.128633 event 0.436659
46 3d -0.127373 internet 0.418717
47 picture -0.126694 lib 0.408139
48 technology lines -0.126683 beck 0.407041
49 james -0.125749 olwm 0.406992
class = misc.forsale
Smallest Word Smallest Weight Largest Word Largest Weight
0 writes -0.317536 sale 3.436651
1 help -0.304476 shipping 1.344154
2 does -0.289535 offer 1.311094
3 know -0.280317 forsale 1.087234
4 think -0.268504 condition 0.979972
5 thanks -0.256351 sale organization 0.898205
6 info -0.254473 best offer 0.873974
7 question -0.232185 asking 0.848756
8 just -0.212882 sell 0.734246
9 com -0.211879 make offer 0.695124
10 appreciated -0.200603 00 0.632339
11 bike -0.195627 interested 0.613650
12 article -0.192214 brand new 0.612971
13 problem -0.186661 obo 0.610971
14 using -0.185959 includes 0.526228
15 mac -0.180333 price 0.522129
16 time -0.179811 brand 0.493351
17 advance -0.173216 trade 0.470088
18 thanks advance -0.171076 excellent 0.469081
19 read -0.170269 offers 0.467000
20 ftp -0.160806 excellent condition 0.458888
21 run -0.159450 manuals 0.455352
22 ve -0.158736 email 0.434294
23 information -0.158183 hiram 0.427270
24 uk -0.157411 new 0.410821
25 heard -0.156603 stereo 0.406574
26 anybody -0.155429 original 0.390596
27 post -0.154718 items 0.389908
28 better -0.154131 games 0.370641
29 remember -0.150524 included 0.366704
30 se -0.149907 25 0.355985
31 recommend -0.148636 genesis 0.350733
32 news -0.146307 hiram edu 0.347349
33 ca -0.146250 manual 0.346961
34 say -0.145148 sega 0.339552
35 hp com -0.144636 distribution 0.334254
36 try -0.138278 forsale organization 0.333367
37 probably -0.136636 sale trade 0.327502
38 program -0.136584 cd 0.325240
39 hi -0.135890 wanted 0.317359
40 sure -0.134926 selling 0.314517
41 1993 -0.134482 contact 0.310048
42 don -0.132754 mail 0.309086
43 going -0.132739 best 0.306935
44 tell -0.132410 plus shipping 0.306826
45 data -0.131891 kou 0.305577
46 cnn -0.131432 douglas kou 0.305577
47 people -0.129769 used 0.301618
48 did -0.128862 camera 0.297479
49 dealer -0.128568 subject sale 0.295516
class = rec.autos
Smallest Word Smallest Weight Largest Word Largest Weight
0 bike -0.566461 car 2.704451
1 sale -0.285270 cars 1.858921
2 bikes -0.249484 engine 0.906552
3 dod -0.236707 dealer 0.842979
4 david -0.185771 automotive 0.720391
5 gun -0.178030 ford 0.684498
6 god -0.177929 callison 0.657060
7 guns -0.177296 oil 0.632864
8 motorcycle -0.175182 toyota 0.629774
9 card -0.169799 boyle 0.622813
10 team -0.164670 warning read 0.611300
11 mac -0.163937 subject warning 0.605882
12 pc -0.162375 dumbest 0.548768
13 game -0.161408 autos 0.543519
14 apple -0.155769 auto 0.517957
15 government -0.150223 dumbest automotive 0.515150
16 rider -0.141518 automotive concepts 0.507330
17 public -0.140775 sho 0.496273
18 radio -0.140703 eliot 0.487751
19 house -0.138416 frost 0.469956
20 ride -0.133703 unisql 0.457907
21 shaft -0.133044 warning 0.457647
22 yamaha -0.132802 centerline 0.457581
23 baseball -0.130083 rec autos 0.457334
24 asking -0.129630 wagon 0.451413
25 monitor -0.127358 subject dumbest 0.448474
26 ed -0.126569 centerline com 0.448218
27 machine -0.125961 dodge 0.445804
28 midway ecn -0.125909 concepts time 0.445543
29 condition -0.124424 new car 0.445538
30 uk -0.123403 saturn 0.428828
31 advance -0.120888 concepts 0.418998
32 video -0.119778 nissan 0.417425
33 hockey -0.117756 taurus 0.413592
34 dog -0.117534 james callison 0.408614
35 la -0.117507 jim frost 0.403838
36 use -0.117077 convertible 0.403720
37 order -0.117014 driving 0.403482
38 mode -0.116576 callison uokmax 0.395484
39 games -0.116479 jimf centerline 0.394862
40 windows -0.114812 jimf 0.394862
41 memory -0.114748 uokmax ecn 0.394628
42 sony -0.113444 uokmax 0.394628
43 book -0.113401 honda 0.391601
44 scsi -0.112871 vw 0.374545
45 control -0.112714 boyle cactus 0.370921
46 computing -0.111639 models 0.368860
47 nasa -0.111585 trunk 0.366876
48 utexas edu -0.110489 engr washington 0.365249
49 board -0.110376 chevy 0.361945
class = rec.motorcycles
Smallest Word Smallest Weight Largest Word Largest Weight
0 car -0.286296 bike 3.322988
1 windows -0.225016 dod 2.566536
2 does -0.189510 bikes 1.417203
3 cars -0.173699 motorcycle 1.392350
4 use -0.172006 ride 1.334104
5 gun -0.162740 riding 1.221409
6 believe -0.156443 bmw 0.991547
7 card -0.144651 rider 0.977306
8 space -0.142959 motorcycles 0.826216
9 auto -0.140948 helmet 0.811539
10 using -0.139521 ama 0.731285
11 mac -0.136020 dog 0.640129
12 information -0.133225 harley 0.627291
13 game -0.131732 honda 0.624566
14 team -0.131579 yamaha 0.595195
15 government -0.130072 behanna 0.569610
16 baseball -0.128850 egreen east 0.563789
17 radio -0.126334 egreen 0.563789
18 problem -0.122624 east sun 0.563203
19 american -0.122275 infante 0.554910
20 fan -0.121492 ranck 0.546244
21 scott -0.120340 moa 0.545869
22 boyle -0.120062 zx 0.535977
23 hockey -0.118793 biker 0.512128
24 microsystems lines -0.117895 hydro 0.510585
25 toyota -0.117440 ed green 0.502827
26 fax -0.115135 riders 0.496290
27 christian -0.115006 hydro ca 0.481503
28 price -0.113900 dogs 0.465928
29 jesus -0.113470 nj nec 0.465131
30 power -0.113262 drinking 0.457907
31 chip -0.113260 eskimo com 0.453305
32 year -0.113209 speedy 0.453300
33 season -0.112759 countersteering 0.450573
34 god -0.110527 levine 0.439081
35 na -0.110237 eskimo 0.434208
36 support -0.109762 stafford 0.430404
37 does know -0.108036 sun com 0.424932
38 conditioning -0.107715 jody levine 0.424040
39 graphics -0.107148 shaft 0.418007
40 nissan -0.106932 jody 0.405544
41 distribution na -0.106755 insurance 0.403480
42 games -0.106543 nec com 0.402403
43 reply -0.106294 winona 0.401970
44 control -0.104614 pettefar 0.400739
45 pc -0.104598 ed 0.397985
46 data -0.104284 chris behanna 0.397672
47 box -0.103805 npet bnr 0.392363
48 memory -0.103360 npet 0.392363
49 boston -0.103121 nick pettefar 0.392363
class = rec.sport.baseball
Smallest Word Smallest Weight Largest Word Largest Weight
0 hockey -0.599129 baseball 1.982992
1 nhl -0.501949 pitching 1.215024
2 playoffs -0.313111 braves 1.127552
3 cup -0.291172 phillies 1.073558
4 sale -0.265599 runs 1.035096
5 goal -0.261501 cubs 0.882127
6 playoff -0.260685 mets 0.856105
7 leafs -0.260215 sox 0.848817
8 pens -0.256636 players 0.822357
9 penguins -0.247154 hit 0.815258
10 goals -0.246488 year 0.815014
11 ice -0.242504 hitter 0.805832
12 wings -0.242452 team 0.718800
13 windows -0.230931 alomar 0.716358
14 car -0.224364 jays 0.693386
15 flyers -0.203229 ball 0.693194
16 people -0.199597 jewish baseball 0.682621
17 devils -0.194810 yankees 0.674261
18 dod -0.184558 baseball players 0.638860
19 want -0.180850 games 0.635803
20 bruins -0.179437 pitcher 0.616015
21 space -0.171028 season 0.605661
22 ca -0.165528 rbi 0.590167
23 puck -0.163944 jewish 0.587952
24 contact -0.163914 rockies 0.582167
25 problem -0.162610 subject jewish 0.581473
26 karr -0.160960 stadium 0.570149
27 gun -0.160467 dodgers 0.550324
28 goalie -0.160116 win 0.549610
29 god -0.159886 game 0.539143
30 email -0.159301 red sox 0.508905
31 stanley -0.156141 pitchers 0.508403
32 buy -0.155195 yankee 0.499444
33 lemieux -0.154946 batting 0.499180
34 penalty -0.152949 tigers 0.492961
35 program -0.151645 nl 0.491502
36 state -0.151555 ted 0.483163
37 bike -0.146000 pitch 0.482258
38 file -0.143852 hr 0.481884
39 interested -0.143689 alleg 0.472645
40 using -0.142661 baerga 0.472381
41 stanley cup -0.142555 league 0.469714
42 christian -0.141428 morris 0.464239
43 computer -0.141296 cs cornell 0.462765
44 quebec -0.141273 journalism indiana 0.462645
45 espn -0.139897 orioles 0.461480
46 uk -0.138402 tedward 0.456461
47 round -0.138221 tedward cs 0.454666
48 called -0.137844 ted fischer 0.450943
49 sharks -0.137289 edward ted 0.450943
class = rec.sport.hockey
Smallest Word Smallest Weight Largest Word Largest Weight
0 pitching -0.503812 hockey 2.745056
1 runs -0.397354 nhl 2.031628
2 com -0.297169 team 1.614400
3 phillies -0.297097 game 1.222137
4 run -0.281868 playoff 1.198517
5 mets -0.273202 playoffs 1.165680
6 use -0.264522 leafs 1.160847
7 sox -0.259872 cup 1.074502
8 sale -0.259185 play 1.056705
9 braves -0.251448 devils 1.054775
10 thanks -0.233531 wings 1.007565
11 nl -0.229535 espn 0.915872
12 windows -0.222824 detroit 0.913727
13 tigers -0.221776 rangers 0.911138
14 work -0.212930 season 0.901455
15 baseball -0.210562 players 0.882939
16 edu -0.208619 pens 0.871250
17 rockies -0.208497 penguins 0.859462
18 jays -0.201134 islanders 0.839436
19 innings -0.195829 coach 0.810762
20 hit -0.190635 stanley 0.809702
21 health -0.189913 toronto 0.781234
22 pitcher -0.189833 stanley cup 0.776259
23 drive -0.189027 ice 0.776044
24 orioles -0.188800 teams 0.775791
25 using -0.185575 lemieux 0.761445
26 cubs -0.184513 goals 0.753313
27 ball -0.183012 gerald 0.742641
28 reds -0.182822 bruins 0.736018
29 yankees -0.181461 goal 0.698247
30 god -0.181432 pittsburgh 0.690573
31 used -0.178879 ca 0.670671
32 ibm -0.178403 maynard 0.669196
33 price -0.173524 abc 0.657951
34 rbi -0.172838 flyers 0.650510
35 help -0.172739 montreal 0.626113
36 hitter -0.171543 player 0.622925
37 distribution -0.170394 goalie 0.605528
38 dodgers -0.169857 coverage 0.604294
39 card -0.169183 puck 0.585689
40 red sox -0.168037 ramsey 0.583423
41 tax -0.166739 league 0.582566
42 base -0.165461 finals 0.566472
43 stadium -0.165115 win 0.552868
44 manager -0.164440 traded 0.552262
45 looking -0.162689 chem utoronto 0.551491
46 pitched -0.162649 hawks 0.550538
47 bullpen -0.162617 alchemy 0.550032
48 does -0.158034 alchemy chem 0.543425
49 world series -0.156910 utoronto ca 0.543121
class = sci.crypt
Smallest Word Smallest Weight Largest Word Largest Weight
0 thanks -0.255692 clipper 2.996278
1 help -0.248558 encryption 2.512530
2 keyboard -0.240424 key 2.268067
3 windows -0.234923 chip 1.725607
4 gun -0.219906 clipper chip 1.559344
5 problem -0.192000 keys 1.368082
6 guns -0.191063 nsa 1.353249
7 problems -0.186570 escrow 1.301817
8 window -0.183966 crypto 1.258073
9 god -0.181335 secret 1.023378
10 apple -0.177214 pgp 1.017468
11 distribution usa -0.176762 gtoal 1.015017
12 email -0.176095 security 0.995122
13 car -0.174092 secure 0.982704
14 waco -0.171591 encrypted 0.982152
15 info -0.169386 des 0.955641
16 motif -0.167728 tapped 0.897075
17 batf -0.160759 tapped code 0.868792
18 hp -0.160049 code good 0.864205
19 scsi -0.159458 wiretap 0.851041
20 image -0.154544 algorithm 0.843944
21 cb -0.153680 subject tapped 0.843709
22 memory -0.153456 cryptography 0.831226
23 cb att -0.153022 government 0.826672
24 card -0.146347 privacy 0.800161
25 home -0.145396 public key 0.799333
26 power -0.142330 subject clipper 0.774841
27 address -0.139914 eff 0.707552
28 sale -0.138669 announcement 0.700055
29 ctrl -0.137723 rsa 0.663111
30 graphics -0.137238 toal 0.624819
31 mac -0.135983 graham toal 0.624819
32 pc -0.133167 code 0.623301
33 looking -0.131588 public 0.610487
34 book -0.131475 gtoal gtoal 0.610321
35 scott -0.130827 gtoal com 0.610321
36 driver -0.129492 key escrow 0.596081
37 said -0.129159 denning 0.591515
38 dc -0.128865 phone 0.589490
39 israel -0.128401 classified 0.588888
40 chris -0.128212 qualcomm 0.581876
41 win -0.125422 sternlight 0.580894
42 intel -0.123691 scheme 0.572859
43 mouse -0.123439 david sternlight 0.570591
44 frank -0.123178 phones 0.563682
45 area -0.122326 white house 0.554053
46 display -0.120349 bits 0.530820
47 sun -0.119753 brad 0.521718
48 year -0.119710 crypt 0.512859
49 xterm -0.119239 bontchev 0.505843
class = sci.electronics
Smallest Word Smallest Weight Largest Word Largest Weight
0 windows -0.311488 circuit 1.056166
1 sale -0.273852 voltage 0.733122
2 clipper -0.203362 electronics 0.652888
3 mac -0.188981 amp 0.581651
4 space -0.175729 circuits 0.561164
5 graphics -0.168297 power 0.505019
6 shipping -0.166350 audio 0.474846
7 did -0.148983 current 0.463145
8 access -0.148443 cooling 0.446665
9 year -0.142876 radar 0.437066
10 government -0.141965 detector 0.429100
11 engine -0.141007 ground 0.420204
12 encryption -0.136890 wire 0.409583
13 bike -0.136380 cooling towers 0.402634
14 motherboard -0.135493 towers 0.400315
15 card -0.133813 number phone 0.398327
16 drive -0.133782 use 0.388118
17 file -0.132637 line 0.387387
18 cs -0.130776 electrical 0.378410
19 asking -0.130558 babb 0.369220
20 months -0.126755 motorola 0.368604
21 forsale -0.126702 output 0.366988
22 driver -0.126119 outlet 0.362812
23 window -0.124397 device 0.362767
24 orbit -0.122202 detector detectors 0.360113
25 motif -0.119381 receiver 0.359438
26 monitor -0.119290 wiring 0.354431
27 scsi -0.118519 radio 0.350346
28 color -0.117391 detectors 0.346530
29 dos -0.117299 tv 0.345803
30 modem -0.116601 rf 0.341801
31 server -0.115994 need number 0.336502
32 think -0.115592 8051 0.335419
33 news -0.115398 neoucom 0.329285
34 memory -0.114647 number line 0.327816
35 486 -0.114190 ee 0.327052
36 god -0.112164 grissom larc 0.324003
37 new -0.111509 grissom 0.321098
38 algorithm -0.111352 low 0.319040
39 steve -0.111241 signal 0.316684
40 dod -0.107977 radar detector 0.316674
41 send -0.107632 signals 0.315270
42 format -0.107159 build 0.314854
43 person -0.106951 kolstad 0.314739
44 floppy -0.106858 phone 0.310411
45 clipper chip -0.106828 old 256k 0.309202
46 team -0.105403 256k simms 0.307771
47 james -0.105359 design 0.304236
48 programs -0.104793 resistor 0.302913
49 program -0.104208 input 0.301473
class = sci.med
Smallest Word Smallest Weight Largest Word Largest Weight
0 god -0.308393 msg 1.391552
1 space -0.219429 gordon banks 1.134642
2 car -0.208842 geb 1.087958
3 government -0.202934 disease 1.053067
4 power -0.190390 doctor 1.033087
5 drive -0.184275 cs pitt 1.025371
6 christians -0.174080 banks 1.011849
7 windows -0.166031 gordon 0.951553
8 gun -0.164528 geb cs 0.914965
9 cis -0.162413 pitt 0.887279
10 graphics -0.155481 edu gordon 0.886129
11 org -0.151248 dyer 0.852144
12 christian -0.151072 pitt edu 0.819541
13 team -0.148499 patients 0.806629
14 game -0.146758 medical 0.771834
15 bible -0.144245 food 0.763348
16 earth -0.142148 medicine 0.736737
17 card -0.141680 diet 0.731859
18 jesus -0.141456 treatment 0.727953
19 bike -0.139986 foods 0.711181
20 code -0.138682 sensitivity 0.689626
21 religion -0.133725 syndrome 0.652705
22 software -0.133340 msg sensitivity 0.647112
23 tv -0.132018 sensitivity superstition 0.635233
24 church -0.131194 superstition 0.626830
25 list -0.130082 symptoms 0.622026
26 sale -0.129283 spdcc 0.608653
27 man -0.127012 subject msg 0.589249
28 dod -0.126707 health 0.555455
29 clinton -0.126026 cancer 0.529851
30 cis pitt -0.124664 spdcc com 0.528743
31 win -0.120584 chinese 0.528719
32 ac -0.117899 steve dyer 0.520260
33 police -0.116742 patient 0.518024
34 math -0.115484 yeast 0.514193
35 usa -0.115059 pain 0.513593
36 coverage -0.114358 effects 0.489690
37 law -0.114015 seizures 0.478036
38 pc -0.113967 candida 0.461372
39 engineering -0.113634 physician 0.448191
40 shipping -0.113519 univ pittsburgh 0.432932
41 video -0.112646 pittsburgh computer 0.432932
42 cars -0.112641 homeopathy 0.431116
43 gas -0.111112 skepticism chastity 0.430920
44 didn -0.110731 n3jxp skepticism 0.430920
45 games -0.110437 n3jxp 0.430920
46 parts -0.110218 chastity intellect 0.430920
47 rutgers -0.109387 banks n3jxp 0.430920
48 created -0.109305 chastity 0.430039
49 chip -0.109268 skepticism 0.429682
class = sci.space
Smallest Word Smallest Weight Largest Word Largest Weight
0 windows -0.329683 space 2.737442
1 steve -0.290013 orbit 1.507993
2 greenbelt md -0.274327 moon 1.435014
3 md usa -0.261338 launch 1.175230
4 communications greenbelt -0.252275 shuttle 1.054835
5 distribution usa -0.244397 henry 1.049330
6 chip -0.242633 nasa 1.026781
7 greenbelt -0.242195 pat 1.016211
8 sale -0.234942 alaska 0.997669
9 car -0.232567 prb access 0.961597
10 god -0.218735 prb 0.961597
11 good -0.217626 com pat 0.934431
12 code -0.206001 lunar 0.930392
13 md -0.201529 alaska edu 0.918956
14 file -0.196365 spacecraft 0.884275
15 cc -0.189775 aurora 0.820603
16 best -0.189436 zoo toronto 0.819396
17 clinton -0.187100 nsmca 0.804579
18 hp -0.165996 zoo 0.798657
19 usa -0.165993 earth 0.764267
20 ve -0.165896 sci 0.748000
21 card -0.161573 henry spencer 0.691288
22 thanks -0.161558 baalke 0.685424
23 graphics -0.159680 henry zoo 0.681170
24 brinich -0.156961 space station 0.670813
25 steve access -0.156961 aurora alaska 0.668944
26 steve brinich -0.156961 access digex 0.653567
27 com steve -0.153574 flight 0.644492
28 andrew -0.153570 digex 0.638069
29 run -0.152757 solar 0.635337
30 law -0.152483 subject space 0.632624
31 jason -0.144077 kelvin jpl 0.620839
32 want -0.143859 spencer 0.614056
33 brian -0.143520 toronto edu 0.598772
34 opinions -0.143468 funding 0.598246
35 brinich subject -0.142943 distribution sci 0.585037
36 game -0.142251 mars 0.583079
37 mail -0.141279 edu henry 0.582750
38 season -0.141062 kelvin 0.580679
39 number -0.140707 mccall 0.576526
40 color -0.139889 sci space 0.573207
41 encryption -0.139528 pat subject 0.570772
42 cars -0.138169 nsmca aurora 0.560148
43 disk -0.137070 orbital 0.552450
44 state -0.135425 station 0.533392
45 ms -0.134902 acad3 alaska 0.529022
46 does -0.134021 acad3 0.529022
47 summary -0.133050 comet 0.521670
48 netcom com -0.131623 billion 0.519471
49 andrew cmu -0.131307 satellite 0.507340
class = soc.religion.christian
Smallest Word Smallest Weight Largest Word Largest Weight
0 nntp -0.478942 god 1.805293
1 nntp posting -0.477601 rutgers edu 1.625885
2 posting host -0.477601 article apr 1.518949
3 host -0.469325 christians 1.448327
4 posting -0.440066 rutgers 1.446445
5 morality -0.360335 athos rutgers 1.429535
6 distribution -0.351899 athos 1.429535
7 kaldis -0.345950 christ 1.266615
8 atheism -0.281448 church 1.200873
9 brian -0.250321 jesus 1.052535
10 romulus rutgers -0.250041 clh 0.942558
11 romulus -0.240821 geneva rutgers 0.896292
12 com -0.232132 faith 0.887455
13 keywords -0.214753 geneva 0.881115
14 theodore kaldis -0.211939 christian 0.862696
15 islam -0.211866 bible 0.839950
16 sandvik -0.211291 1993 0.832347
17 apr 1993 -0.208587 christianity 0.795422
18 american -0.202822 apr 0.773233
19 distribution usa -0.196376 scripture 0.703437
20 theodore -0.192093 heaven 0.690204
21 malcolm -0.191463 sin 0.678748
22 using -0.191385 catholic 0.677958
23 lds -0.190771 resurrection 0.661965
24 host aisun3 -0.190653 petch 0.644785
25 apple -0.190322 arrogance 0.622870
26 malcolm lee -0.190060 hell 0.587978
27 keith -0.189473 03 0.559302
28 newsreader -0.188690 jayne 0.538981
29 robert -0.187946 romans 0.527868
30 news -0.185091 fisher 0.526401
31 usa -0.185061 lord 0.475500
32 uucp -0.183387 kulikauskas 0.474620
33 edu kaldis -0.182369 married 0.472520
34 weiss -0.180023 truth 0.468538
35 robert weiss -0.179193 easter 0.468126
36 distribution world -0.178996 verse 0.462723
37 buffalo -0.178983 arrogance christians 0.443106
38 gmt -0.178545 believe 0.427412
39 benedikt -0.175737 revelation 0.418598
40 koresh -0.175436 petch gvg47 0.416763
41 apple com -0.175034 gvg47 gvg 0.416763
42 government -0.174643 gvg47 0.416763
43 version -0.170642 accepting 0.408514
44 majority -0.166478 subject daily 0.407767
45 remus rutgers -0.165933 daily verse 0.407767
46 remus -0.164602 subject arrogance 0.403420
47 royalroads -0.164564 question 0.402095
48 royalroads ca -0.164564 mmalt guild 0.401333
49 tin -0.161246 mmalt 0.401333
class = talk.politics.guns
Smallest Word Smallest Weight Largest Word Largest Weight
0 clipper -0.222678 gun 2.579640
1 encryption -0.215469 guns 1.750272
2 mail -0.188852 firearms 1.229257
3 chip -0.185588 batf 1.070268
4 israel -0.170434 waco 1.054974
5 key -0.160673 atf 0.987423
6 radar -0.157711 weapons 0.987191
7 new -0.155324 gun control 0.936558
8 israeli -0.153292 fbi 0.913826
9 program -0.145553 handgun 0.762626
10 turkish -0.145435 subject gun 0.717268
11 speed -0.143815 cdt 0.709990
12 space -0.138044 weapon 0.672084
13 reason -0.136117 firearm 0.656523
14 death penalty -0.133639 ranch 0.649816
15 president know -0.131840 survivors 0.642186
16 data -0.131838 dividian 0.608275
17 jesus -0.131204 handheld com 0.606538
18 car -0.130628 dividian ranch 0.603888
19 warning -0.129894 ifas ufl 0.599772
20 mit -0.128916 ifas 0.599772
21 message mr -0.128690 gnv ifas 0.599772
22 god -0.128398 gnv 0.595359
23 play -0.127494 subject atf 0.582316
24 subject message -0.126755 burns dividian 0.576486
25 game -0.126469 atf burns 0.576486
26 cryptography -0.126197 control 0.573298
27 sandvik -0.124592 handheld 0.572013
28 happened organization -0.124249 nra 0.566508
29 virginia -0.123857 compound 0.564035
30 windows -0.121348 kratz 0.563989
31 mit edu -0.120890 feustel 0.549113
32 14 -0.120116 sw stratus 0.536962
33 warning read -0.119787 tavares 0.534152
34 org -0.119494 arms 0.531348
35 subject warning -0.119405 stratus com 0.517930
36 se -0.118506 criminals 0.513595
37 corporation -0.118429 rkba 0.512320
38 tax -0.118302 ranch survivors 0.506095
39 atheists -0.118159 com tavares 0.505799
40 team -0.117957 cdt sw 0.505799
41 code -0.117929 stratus 0.503989
42 christian -0.117311 batf fbi 0.503198
43 games -0.117150 gun like 0.497802
44 president -0.117068 handguns 0.497027
45 health care -0.116405 crime 0.494372
46 keys -0.116015 sw 0.494356
47 message -0.115645 frank crary 0.492476
48 cs colorado -0.114984 crary 0.492476
49 ma -0.114734 burns 0.490712
class = talk.politics.mideast
Smallest Word Smallest Weight Largest Word Largest Weight
0 god -0.350832 israel 2.761935
1 jesus -0.274825 israeli 2.527672
2 good -0.251022 turkish 1.493652
3 thanks -0.232954 arab 1.340593
4 bible -0.209399 armenians 1.246517
5 christian -0.201946 jews 1.231433
6 christ -0.200742 armenian 1.226403
7 gun -0.196413 armenia 1.169814
8 uk -0.192723 arabs 1.072722
9 ve -0.178288 turks 0.972496
10 distribution -0.172574 serdar 0.970465
11 space -0.170639 turkey 0.937920
12 game -0.165605 argic 0.932473
13 ll -0.161409 israelis 0.927347
14 jim -0.160812 soldiers 0.883230
15 fbi -0.159531 cpr 0.818071
16 little -0.155570 policy 0.810203
17 waco -0.152306 center policy 0.776000
18 lord -0.151701 serdar argic 0.763582
19 data -0.151387 policy research 0.762256
20 thing -0.149069 jewish 0.732222
21 distribution usa -0.148496 subject israeli 0.728123
22 new -0.145157 lebanese 0.728118
23 pretty -0.144219 occupied 0.687469
24 guns -0.143545 lebanon 0.666627
25 mouse -0.142456 palestinian 0.663780
26 law -0.141472 palestinians 0.652173
27 best -0.140601 holocaust 0.649212
28 james -0.140366 igc 0.641495
29 windows -0.139020 palestine 0.627737
30 doesn -0.137142 peace 0.622758
31 available -0.135942 sdpa 0.622098
32 mormons -0.132159 igc apc 0.620060
33 christianity -0.131594 apc org 0.620060
34 version -0.131466 apc 0.618195
35 michael -0.131115 nysernet 0.608043
36 looking -0.130784 jake 0.582760
37 canada -0.129731 urartu 0.579178
38 net -0.129338 zuma 0.575633
39 batf -0.127972 villages 0.567328
40 faith -0.127897 zuma uucp 0.565462
41 sex -0.126974 sdpa org 0.562028
42 apple -0.124563 urartu sdpa 0.554725
43 sun -0.124421 cyprus 0.540212
44 need -0.124299 hezbollah 0.537373
45 win -0.123517 cosmo 0.536988
46 edu david -0.122965 uucp serdar 0.535759
47 probably -0.122663 sera zuma 0.535759
48 use -0.122423 sera 0.535759
49 matthew -0.121841 occupation 0.530713
class = talk.politics.misc
Smallest Word Smallest Weight Largest Word Largest Weight
0 gun -0.240749 cramer 1.279590
1 clipper -0.210502 optilink 1.075916
2 car -0.208287 clayton 0.925913
3 team -0.199598 kaldis 0.910310
4 guns -0.193330 clayton cramer 0.872607
5 chip -0.187127 optilink com 0.799627
6 god -0.172919 gay 0.784267
7 christian -0.169253 clinton 0.779250
8 israeli -0.159803 com clayton 0.653406
9 israel -0.158112 cramer optilink 0.631128
10 christians -0.153255 tax 0.627427
11 encryption -0.151469 isc br 0.593709
12 technology -0.144965 br 0.537165
13 thanks -0.144391 theodore kaldis 0.530644
14 waco -0.141359 br com 0.508279
15 religion -0.140141 steveh 0.506406
16 control -0.139926 thor isc 0.504580
17 killed -0.132962 new study 0.468353
18 jesus -0.132649 romulus rutgers 0.466554
19 phone -0.130550 steveh thor 0.464326
20 batf -0.129394 steve hendricks 0.464326
21 john -0.124590 hendricks 0.463630
22 jim -0.123446 health care 0.463058
23 space -0.120510 sexual 0.455044
24 stated -0.120082 homosexual 0.451504
25 clipper chip -0.119716 romulus 0.443966
26 info -0.117185 drugs 0.441151
27 feustel -0.117025 cramer writes 0.440658
28 science -0.116493 theodore 0.431780
29 brad -0.115651 isc 0.431360
30 fbi -0.115404 jobs 0.429552
31 key -0.115277 homosexuals 0.428784
32 jews -0.111757 president 0.412206
33 buy -0.111501 thor 0.411450
34 research -0.110619 health 0.403253
35 christianity -0.109087 ipser 0.400499
36 baseball -0.107713 deficit 0.394536
37 matter -0.106788 atlantaga ncr 0.387261
38 home -0.106209 atlantaga 0.387261
39 host magnus -0.105333 gay percentage 0.385092
40 firearms -0.104333 kaldis romulus 0.384204
41 robert -0.102893 consent 0.383365
42 end -0.101495 government 0.382534
43 interested -0.101170 concentrate 0.377172
44 arms -0.101056 study gay 0.375971
45 jewish -0.100878 ncratl atlantaga 0.366063
46 drive -0.099840 ncratl 0.366063
47 lines article -0.098429 men 0.356346
48 koresh -0.098307 free 0.351063
49 reno -0.098265 taxes 0.349464
class = talk.religion.misc
Smallest Word Smallest Weight Largest Word Largest Weight
0 atheists -0.195706 sandvik 0.585719
1 rutgers edu -0.191471 christian 0.545276
2 rutgers -0.190287 robert weiss 0.475192
3 thanks -0.163467 weiss 0.463496
4 clh -0.162012 ch981 cleveland 0.459477
5 free -0.152926 ch981 0.459477
6 need -0.152154 kent 0.459302
7 problem -0.139410 koresh 0.450443
8 article apr -0.137967 royalroads ca 0.446104
9 1993 -0.137923 royalroads 0.446104
10 atheist -0.135408 tony alicea 0.433298
11 genocide -0.123902 rosicrucian 0.432065
12 hell -0.119854 morality 0.429734
13 caltech edu -0.115136 god promise 0.419108
14 caltech -0.114485 alicea 0.408533
15 geneva rutgers -0.112430 malcolm lee 0.407051
16 geneva -0.111645 jesus 0.396638
17 cobb -0.105518 malcolm 0.395671
18 schneider -0.104465 93 god 0.391050
19 atheism -0.104245 psyrobtw 0.371880
20 new -0.100533 order 0.365398
21 idea -0.100353 edu tony 0.357811
22 assumption -0.098130 biblical 0.354625
23 keith -0.096336 sandvik kent 0.347946
24 year -0.095172 kent apple 0.347946
25 islamic -0.093796 kendig 0.340709
26 athos -0.093756 brian kendig 0.340709
27 athos rutgers -0.093756 rosicrucian order 0.339644
28 clinton -0.093643 post royalroads 0.338062
29 keith cco -0.093065 mlee post 0.338062
30 wwc -0.092543 quack kfu 0.329175
31 general -0.091737 kfu com 0.329175
32 saturn wwc -0.091722 kfu 0.327360
33 wwc edu -0.091722 god 0.327133
34 information -0.091504 mlee 0.323950
35 bu edu -0.090913 promise 0.319546
36 uk -0.090687 joslin 0.316207
37 allan schneider -0.090648 apple com 0.314847
38 keith allan -0.090648 subject 2000 0.313779
39 perfect -0.090471 sandvik newton 0.312501
40 gov -0.090247 newton apple 0.312501
41 university -0.089358 ca malcolm 0.308892
42 belief -0.089247 kent sandvik 0.303535
43 bu -0.088978 article sandvik 0.302410
44 work -0.088664 biblical backing 0.299272
45 03 -0.088466 meritt 0.298930
46 valuable -0.088403 christian morality 0.298559
47 mike -0.088218 brian 0.296168
48 allan -0.088208 com sandvik 0.293057
49 free moral -0.087457 bskendig netcom 0.292565

In [56]:
p = Pipeline([('cvect', CountVectorizer( analyzer='char', ngram_range=(5,5),
                                                max_df = 0.88, min_df=1)),
                      ('tfidf', TfidfTransformer(sublinear_tf=True)),
                      ('sgdc', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=4e-4, random_state=42,
                                            max_iter=40 )),
                         ]) 
test_pipeline(p, verbose=False)
show_most_informative_features(p.named_steps['cvect'], p.named_steps['sgdc'], twenty_train.target_names)


F1 = 0.841 
Accuracy = 0.851
time = 98.776 sec.

class = alt.atheism
Smallest Word Smallest Weight Largest Word Largest Weight
0 hrist -0.194710 theis 1.106866
1 ians -0.178771 athei 1.072342
2 stian -0.153789 athe 0.864802
3 istia -0.151108 keith 0.824156
4 s.edu -0.150853 heist 0.808652
5 risti -0.149580 islam 0.724374
6 gers. -0.149477 eists 0.701778
7 .rutg -0.143008 heism 0.632027
8 ers.e -0.143003 isla 0.591572
9 rs.ed -0.143003 keit 0.554473
10 rutge -0.140693 u (ke 0.524833
11 tgers -0.140693 (keit 0.521565
12 utger -0.140693 (kei 0.513267
13 chri -0.138795 vesey 0.511749
14 ===== -0.137284 ivese 0.511749
15 tians -0.136190 bobb 0.510663
16 n.cal -0.133942 alla 0.509364
17 chris -0.132989 eith 0.507646
18 tian -0.132534 jaege 0.493277
19 dman. -0.130011 h@cco 0.483662
20 of c -0.129564 th@cc 0.481903
21 sandm -0.129428 aeger 0.480952
22 man.c -0.127061 hneid 0.479919
23 1993. -0.125907 eider 0.479919
24 on: u -0.125689 du (k 0.478513
25 an.ca -0.124556 neide 0.476685
26 ndman -0.123557 schne 0.474441
27 hat c -0.123154 chnei 0.471840
28 t: sa -0.123140 slami 0.468545
29 .1993 -0.121659 jaeg 0.467804
30 f chr -0.121652 ith@c 0.466925
31 andma -0.117787 lamic 0.465649
32 of ch -0.117505 eith@ 0.455297
33 hell -0.117165 nedik 0.446001
34 want -0.114513 enedi 0.446001
35 du (a -0.111676 edikt 0.446001
36 in h -0.109096 bened 0.446001
37 s pro -0.105030 l ath 0.445724
38 s.rut -0.104740 schn 0.437340
39 usa\n -0.104656 ushdi 0.431082
40 trea -0.104591 shdie 0.431082
41 ruth -0.104466 rushd 0.429657
42 ture -0.104398 ider) 0.421163
43 use t -0.104200 eism 0.418629
44 cult -0.104098 ze.wp 0.408721
45 hell -0.103903 tze.w 0.408721
46 can -0.103502 solnt 0.408721
47 c@cco -0.102430 olntz 0.408721
48 n we -0.102389 ntze. 0.408721
49 l.com -0.102057 lntze 0.408721
class = comp.graphics
Smallest Word Smallest Weight Largest Word Largest Weight
0 wind -0.210219 phics 0.889178
1 windo -0.199793 aphic 0.826278
2 indow -0.199399 raphi 0.792678
3 crypt -0.147027 hics 0.700232
4 idget -0.137149 graph 0.697049
5 widge -0.134665 grap 0.688807
6 sale -0.131028 image 0.640641
7 dows -0.130920 imag 0.635767
8 ndows -0.128106 olygo 0.556094
9 r sal -0.125818 lygon 0.549179
10 it.ed -0.124636 polyg 0.548527
11 the x -0.121964 file 0.451353
12 widg -0.118938 tiff 0.444364
13 list -0.113557 poly 0.438104
14 st of -0.112570 nimat 0.411684
15 monit -0.111921 mage 0.406374
16 onito -0.111012 cview 0.387756
17 pric -0.110942 ygon 0.370210
18 nitor -0.110795 imati 0.369748
19 price -0.109920 rithm 0.366211
20 ith a -0.109114 mages 0.366034
21 mit.e -0.108937 orith 0.358966
22 moni -0.108279 cvie 0.358923
23 .mit. -0.107361 algo 0.358834
24 or sa -0.106952 lgori 0.353428
25 list -0.106588 gorit 0.353396
26 e car -0.105770 algor 0.353228
27 t.edu -0.104144 anima 0.315670
28 an x -0.103585 oints 0.301595
29 y inf -0.103238 files 0.296488
30 icati -0.102855 tiff 0.293500
31 prot -0.102126 anim 0.283251
32 the l -0.100827 ibrar 0.282348
33 itor -0.099944 form 0.278534
34 elect -0.099637 progr 0.278199
35 i tr -0.099346 ckage 0.275891
36 confi -0.098909 point 0.275882
37 lanet -0.098654 libra 0.273130
38 contr -0.098025 urfac 0.271032
39 ight -0.097715 surfa 0.271032
40 hat t -0.096963 p.gra 0.269583
41 catio -0.096583 omp.g 0.269583
42 he ca -0.096374 mp.gr 0.269583
43 a wi -0.094973 .grap 0.268418
44 a win -0.094596 ogram 0.266209
45 acce -0.094402 ackag 0.264678
46 , ple -0.094348 cs li 0.262966
47 appli -0.093803 3d g 0.262910
48 vice -0.093504 libr 0.260609
49 ndow -0.093407 hics. 0.259998
class = comp.os.ms-windows.misc
Smallest Word Smallest Weight Largest Word Largest Weight
0 ndow -0.380450 ndows 1.544164
1 board -0.236146 dows 1.421803
2 dow m -0.225387 wind 1.159674
3 motif -0.202330 windo 1.154040
4 for s -0.187504 indow 1.153866
5 color -0.186237 r win 0.692845
6 image -0.184436 file 0.689033
7 enwin -0.182456 s 3.1 0.668462
8 openw -0.182456 ows 3 0.592240
9 penwi -0.182456 ws 3. 0.589709
10 sale -0.176028 for w 0.558763
11 nwind -0.171307 river 0.558037
12 the x -0.171290 or wi 0.546765
13 or sa -0.168077 3.1 0.525055
14 r sal -0.165673 n win 0.512513
15 w man -0.161914 : win 0.473913
16 oard -0.159727 dos 0.446539
17 otif -0.156950 p fil 0.385430
18 ow ma -0.155975 in wi 0.381601
19 phics -0.150670 s-win 0.380655
20 boar -0.149208 ms-wi 0.380655
21 imag -0.148457 file 0.376563
22 he po -0.148251 files 0.372198
23 mac -0.144419 win3 0.366861
24 port -0.143563 font 0.362544
25 x11r -0.143242 .ini 0.357624
26 x11r5 -0.139668 win 0.329636
27 scsi -0.139619 drive 0.327014
28 nitor -0.138714 e: wi 0.324357
29 onito -0.138588 cica 0.308772
30 monit -0.137784 dows. 0.308543
31 x-win -0.137459 in 3. 0.304620
32 an x -0.137220 win 3 0.303961
33 splay -0.135698 in w 0.297449
34 displ -0.135211 driv 0.294832
35 aphic -0.134526 h win 0.291729
36 open -0.134197 n 3.1 0.274426
37 power -0.133617 wnloa 0.274328
38 x win -0.133271 win. 0.273890
39 scsi -0.132955 ownlo 0.273710
40 x-wi -0.131787 desk 0.273228
41 moni -0.128444 rosof 0.272599
42 apple -0.128037 ms-w 0.272545
43 hics -0.127602 iver 0.271827
44 rive -0.124650 downl 0.271321
45 for a -0.123150 croso 0.268511
46 stem -0.122265 re: w 0.262431
47 ing x -0.121175 dows\n 0.258328
48 colo -0.120276 ith w 0.257445
49 olor -0.119339 cica. 0.254940
class = comp.sys.ibm.pc.hardware
Smallest Word Smallest Weight Largest Word Largest Weight
0 mac -0.225206 ide 0.724057
1 apple -0.205988 oller 0.640627
2 dows -0.182041 scsi 0.550627
3 r sal -0.173268 troll 0.543668
4 sale -0.169135 rolle 0.529195
5 pple -0.168663 isa 0.471442
6 or sa -0.168652 bus 0.431050
7 appl -0.164108 scsi 0.397718
8 indow -0.154122 card 0.378798
9 ernal -0.151867 ller 0.370022
10 windo -0.146486 board 0.351313
11 for s -0.138725 scsi- 0.346129
12 file -0.138207 vlb 0.339812
13 quadr -0.128540 bios 0.331904
14 terna -0.128167 rives 0.329731
15 power -0.127971 aster 0.317262
16 iver -0.127483 herbo 0.316772
17 rnal -0.121413 ntrol 0.313237
18 uadra -0.121338 rive 0.311371
19 e mac -0.120393 therb 0.309691
20 ac ii -0.119918 ontro 0.307562
21 rd an -0.116821 s scs 0.306790
22 iisi -0.115712 teway 0.303595
23 ippin -0.114795 gatew 0.303595
24 powe -0.114598 atewa 0.303595
25 hippi -0.114077 erboa 0.303019
26 mac i -0.113544 loppy 0.297843
27 the r -0.111339 irq 0.297385
28 shipp -0.110979 rboar 0.297033
29 river -0.110168 jumpe 0.296655
30 trol -0.107486 isa b 0.295718
31 atest -0.105985 boot 0.293753
32 offer -0.105914 eisa 0.289116
33 olor -0.105693 maste 0.288282
34 adra -0.105026 486dx 0.287544
35 e kno -0.104982 drive 0.284757
36 tris -0.104622 umper 0.283022
37 iisi -0.104128 mothe 0.282869
38 quad -0.102243 gate 0.277538
39 ntris -0.102111 486d 0.276730
40 mail -0.101645 moth 0.276552
41 ndows -0.101178 scsi\n 0.269204
42 entri -0.099466 17" 0.265744
43 pping -0.099150 sa bu 0.260033
44 ntern -0.097730 aptec 0.259465
45 ship -0.097348 flopp 0.257935
46 ne kn -0.097247 7" mo 0.256589
47 offe -0.097217 a 486 0.254615
48 rmina -0.096218 eway 0.254564
49 he ii -0.096031 disk 0.254408
class = comp.sys.mac.hardware
Smallest Word Smallest Weight Largest Word Largest Weight
0 ndows -0.249282 mac 1.181907
1 windo -0.243462 apple 1.113316
2 indow -0.241740 pple 0.938156
3 wind -0.216890 uadra 0.765887
4 troll -0.181238 quadr 0.756157
5 r sal -0.181068 appl 0.699827
6 or sa -0.177942 ntris 0.691111
7 dows -0.169359 quad 0.689525
8 ide -0.164676 tris 0.662605
9 was -0.162394 adra 0.645392
10 sale -0.160638 entri 0.606050
11 dos -0.159867 e mac 0.581103
12 tion -0.159049 simm 0.534742
13 oller -0.155241 ris 6 0.528727
14 rolle -0.153744 rbook 0.524118
15 car -0.140105 erboo 0.524118
16 sale\n -0.129215 werbo 0.522094
17 file -0.123062 owerb 0.513684
18 offer -0.119675 mac i 0.476253
19 486dx -0.118243 duo 0.463300
20 contr -0.117083 lciii 0.462353
21 prog -0.114890 iisi 0.449844
22 offe -0.114142 lcii 0.447892
23 ller -0.113738 ac ii 0.401948
24 was w -0.113079 a mac 0.401265
25 gate -0.112909 power 0.398869
26 r win -0.111513 simms 0.398192
27 the w -0.109609 centr 0.390573
28 an o -0.108568 vram 0.380350
29 bios -0.108146 powe 0.376295
30 ontro -0.106682 c650 0.374078
31 486d -0.106052 iisi 0.356028
32 a 486 -0.106031 s 610 0.352749
33 files -0.104793 is 61 0.350748
34 a 48 -0.103365 macs 0.349994
35 ntrol -0.102389 he lc 0.349674
36 progr -0.101212 nubus 0.349553
37 he bi -0.100341 nubu 0.336223
38 john -0.099871 he ma 0.318364
39 isa -0.095963 ubus 0.317760
40 ogram -0.095621 c650 0.317715
41 eway -0.094665 ciii 0.316954
42 rogra -0.094409 vram 0.304517
43 le ii -0.092187 lc ii 0.300362
44 atewa -0.091715 pgrad 0.296581
45 gatew -0.091715 upgra 0.292764
46 teway -0.091715 lc i 0.289957
47 as wo -0.091698 adb 0.284755
48 for s -0.090928 imms 0.282688
49 aptec -0.090298 pds 0.279150
class = comp.windows.x
Smallest Word Smallest Weight Largest Word Largest Weight
0 dos -0.285421 motif 1.038948
1 dows -0.232684 widge 1.033707
2 for w -0.215935 idget 1.029967
3 s 3.1 -0.210200 ndow 0.949009
4 drive -0.205488 widg 0.870868
5 r win -0.199687 the x 0.855321
6 driv -0.195715 x11r 0.814708
7 card -0.195496 otif 0.810574
8 mode -0.183356 x11r5 0.791648
9 river -0.170276 moti 0.714497
10 ndows -0.151904 erver 0.703306
11 data -0.151166 indow 0.677439
12 edu ( -0.149197 xter 0.673960
13 ivers -0.147227 windo 0.672274
14 or wi -0.144325 .lcs. 0.640502
15 3.1 -0.140729 an x 0.639145
16 files -0.135844 lcs.m 0.632856
17 file -0.134367 xterm 0.619203
18 data -0.134365 cs.mi 0.612331
19 disk -0.132212 xpo.l 0.598478
20 apple -0.130597 po.lc 0.598478
21 on: u -0.130275 o.lcs 0.598478
22 micro -0.130189 expo. 0.598478
23 ontro -0.128230 s.mit 0.596266
24 ntrol -0.126785 rver 0.592994
25 print -0.126529 \nto: 0.554958
26 s for -0.126485 11r5 0.554392
27 was -0.126044 @expo 0.547827
28 rinte -0.125220 lient 0.542968
29 compu -0.122309 clien 0.538482
30 u.edu -0.121555 .mit. 0.536659
31 micr -0.120605 pert@ 0.534854
32 prin -0.119756 xper 0.530423
33 ccess -0.119584 ing x 0.522248
34 univ -0.118422 mit.e 0.511993
35 lity -0.116723 t@exp 0.503491
36 out -0.114841 rt@ex 0.502618
37 mac -0.114330 ert@e 0.502618
38 ows 3 -0.112654 ct: x 0.496638
39 acces -0.111811 dow m 0.485591
40 niver -0.111376 clie 0.481668
41 n: us -0.110026 displ 0.478946
42 : win -0.107857 xpert 0.478170
43 unive -0.107461 to: x 0.477017
44 ersit -0.107052 splay 0.472727
45 word -0.106937 serve 0.468464
46 iver -0.106772 rnet\n 0.463489
47 count -0.106322 : xpe 0.461695
48 rsity -0.105651 dget 0.460244
49 chip -0.105576 ispla 0.455907
class = misc.forsale
Smallest Word Smallest Weight Largest Word Largest Weight
0 t: re -0.284520 r sal 1.946512
1 ct: r -0.279003 sale 1.888745
2 re: -0.275372 or sa 1.864354
3 : re: -0.274179 for s 1.219069
4 anyon -0.233947 sale\n 0.950867
5 nyone -0.232929 offer 0.879329
6 that -0.232821 offe 0.794954
7 yone -0.229462 hippi 0.782960
8 anyo -0.227751 shipp 0.757651
9 that -0.226408 sale 0.716921
10 help -0.219395 sale: 0.702044
11 t the -0.207557 ippin 0.700228
12 ?\norg -0.203893 ship 0.657347
13 the s -0.202806 ale: 0.649299
14 ould -0.196347 rsale 0.647213
15 n the -0.195043 orsal 0.644706
16 how -0.186268 forsa 0.624387
17 hanks -0.179725 sale. 0.602332
18 info -0.176671 pping 0.561603
19 here -0.168495 sell 0.558384
20 anks -0.164983 ale\no 0.501618
21 thank -0.161385 nditi 0.497111
22 ? tha -0.157931 fors 0.495101
23 what -0.157100 askin 0.492689
24 there -0.155025 ondit 0.491716
25 on t -0.152140 condi 0.490672
26 in th -0.151960 for $ 0.488348
27 can -0.151402 sking 0.480182
28 ommen -0.149528 le\nor 0.458098
29 does -0.149149 for 0.457451
30 know -0.148880 aski 0.450945
31 is t -0.148213 ale. 0.439920
32 what -0.146651 inclu 0.436977
33 estio -0.146280 nclud 0.431275
34 s any -0.146008 cond 0.430081
35 does -0.145651 clude 0.422192
36 stion -0.145645 ditio 0.416524
37 bike -0.145437 ing $ 0.408327
38 o the -0.145081 price 0.380063
39 some -0.143722 incl 0.376264
40 any -0.142687 manua 0.374705
41 s\norg -0.142457 anual 0.374434
42 ng a -0.142366 if in 0.364452
43 d be -0.141671 sell 0.362964
44 recom -0.141251 mail 0.362954
45 ther -0.139936 ffer 0.360446
46 info -0.135513 reste 0.356020
47 in a -0.134944 est o 0.353581
48 on th -0.133711 t off 0.349485
49 ecomm -0.133620 game 0.332203
class = rec.autos
Smallest Word Smallest Weight Largest Word Largest Weight
0 bike -0.447476 car 1.332066
1 bike -0.295431 cars 1.022231
2 card -0.226607 cars 0.797087
3 e bik -0.204163 auto 0.754415
4 game -0.148118 e car 0.632151
5 or sa -0.145579 deale 0.521365
6 bikes -0.141259 ealer 0.515262
7 appl -0.138371 utomo 0.500544
8 apple -0.132100 autom 0.499204
9 r sal -0.131201 ford 0.449743
10 ride -0.130011 otive 0.427707
11 card -0.129637 tomot 0.419487
12 board -0.129264 deal 0.416377
13 s to -0.128221 aler 0.405098
14 rider -0.124475 car i 0.401449
15 team -0.124013 car. 0.399626
16 pple -0.117924 omoti 0.398442
17 guns -0.112414 car, 0.383130
18 play -0.112218 car, 0.376840
19 gun -0.111121 oyota 0.375485
20 david -0.110370 toyot 0.373812
21 cage -0.109768 oil 0.370888
22 avid -0.108519 motiv 0.359810
23 compu -0.107837 toyo 0.355105
24 a bik -0.105580 boyle 0.354520
25 ///// -0.104768 ad).. 0.349386
26 ard d -0.103217 ...(p 0.349386
27 otorc -0.102594 ..(pl 0.349386
28 nitor -0.102540 .(ple 0.349386
29 file -0.102145 car. 0.346348
30 chine -0.101747 a car 0.341654
31 order -0.099018 d)... 0.341043
32 now -0.098314 ead). 0.340055
33 sale -0.097799 read) 0.337229
34 onito -0.097496 ngine 0.332878
35 dod -0.096685 wagon 0.331770
36 of h -0.096505 engin 0.330210
37 monit -0.096342 yota 0.329858
38 omput -0.096124 engi 0.327341
39 port -0.095993 )...\n 0.321081
40 imag -0.095873 he ca 0.321047
41 y bik -0.095451 ....( 0.319974
42 one -0.095380 umbes 0.318903
43 guns -0.095023 mbest 0.318903
44 cont -0.094883 dumbe 0.316993
45 east -0.094793 autos 0.315683
46 n of -0.094480 g.... 0.315607
47 rial -0.093952 r car 0.314386
48 torcy -0.093727 : war 0.313933
49 one t -0.093667 lliso 0.312228
class = rec.motorcycles
Smallest Word Smallest Weight Largest Word Largest Weight
0 car -0.235848 bike 2.463835
1 auto -0.169623 bike 1.382759
2 cars -0.135226 otorc 1.214320
3 the p -0.130284 torcy 1.201406
4 game -0.129371 orcyc 1.189365
5 play -0.127945 rcycl 1.188078
6 has -0.126328 ride 1.076691
7 cars -0.125134 cycle 0.954165
8 torol -0.123827 e bik 0.945903
9 orola -0.123756 motor 0.899671
10 otoro -0.122984 moto 0.898792
11 e pro -0.122867 bikes 0.855027
12 uld b -0.121739 dod # 0.847455
13 ical -0.120472 dod 0.809856
14 does -0.120437 rider 0.759128
15 ndows -0.118623 ridin 0.707794
16 on: s -0.118125 ycle 0.616406
17 windo -0.114932 ride 0.595201
18 d be -0.114779 ridi 0.585603
19 space -0.114765 bmw 0.558758
20 does -0.114518 a bik 0.548926
21 indow -0.113472 helme 0.548264
22 ..... -0.108079 helm 0.544714
23 cons -0.107385 iding 0.543775
24 powe -0.106793 ikes 0.533816
25 autom -0.105640 y bik 0.528643
26 belie -0.104890 elmet 0.527710
27 ld be -0.104641 od #0 0.492050
28 oes a -0.104255 biker 0.441115
29 usin -0.103999 he bi 0.436256
30 eliev -0.103526 cage 0.435488
31 s, in -0.102693 dod# 0.414394
32 rola -0.101492 ycles 0.405564
33 supp -0.100667 bike. 0.402476
34 e con -0.100062 harl 0.401603
35 hrist -0.099078 t bik 0.400582
36 beli -0.098856 r bik 0.396247
37 power -0.098034 arley 0.388181
38 lieve -0.096643 ehann 0.371777
39 gun -0.096405 behan 0.367423
40 hat t -0.095960 hanna 0.357456
41 ason -0.095853 my bi 0.354486
42 inte -0.095787 lmet 0.354323
43 hey a -0.095760 e rid 0.350504
44 du (b -0.095696 a mot 0.350030
45 c.edu -0.094634 harle 0.346507
46 id th -0.094570 amaha 0.346121
47 edwar -0.094005 terst 0.344235
48 car i -0.093673 n@eas 0.342178
49 ander -0.093251 en@ea 0.342178
class = rec.sport.baseball
Smallest Word Smallest Weight Largest Word Largest Weight
0 hocke -0.387633 pitch 1.317979
1 ockey -0.357096 pitc 1.251327
2 hock -0.355135 sebal 1.125030
3 goal -0.350442 aseba 1.123968
4 layof -0.296948 baseb 1.123140
5 ayoff -0.295727 eball 1.121030
6 playo -0.294216 ball 0.928144
7 nhl -0.216489 base 0.905820
8 e nhl -0.215044 tcher 0.704080
9 yoffs -0.207485 brave 0.627126
10 ckey -0.204440 itchi 0.626820
11 he nh -0.202300 runs 0.622253
12 wing -0.182870 yank 0.612611
13 pens -0.167918 hitt 0.610819
14 leaf -0.166590 raves 0.606225
15 goals -0.163845 yanke 0.592199
16 leafs -0.163608 hitte 0.590829
17 uins -0.162823 itche 0.587270
18 wings -0.156440 ankee 0.583680
19 you -0.153992 illie 0.577014
20 nguin -0.151521 llies 0.571660
21 r sal -0.151243 brav 0.563277
22 engui -0.150934 game 0.560745
23 pengu -0.150934 phill 0.557445
24 bike -0.149070 tchin 0.546160
25 the i -0.146757 field 0.514674
26 goal -0.146418 itter 0.499724
27 guins -0.146222 mets 0.499050
28 chris -0.145174 hilli 0.497567
29 lyers -0.140835 cubs 0.475976
30 eafs -0.140708 layer 0.472641
31 flyer -0.140657 playe 0.468003
32 ical -0.139467 team 0.454292
33 the c -0.138016 ball 0.442790
34 yoff -0.137966 aves 0.436709
35 the f -0.137776 cubs 0.433247
36 peng -0.136959 sox 0.427468
37 sale -0.135846 jays 0.426236
38 espn -0.135397 phil 0.422527
39 space -0.134654 ching 0.420616
40 e ice -0.134463 e run 0.417006
41 cup -0.133467 ayers 0.415740
42 there -0.132243 lomar 0.413823
43 flye -0.131955 ball. 0.413712
44 penal -0.131549 hit 0.409910
45 windo -0.131421 aloma 0.405286
46 oals -0.130694 play 0.400369
47 indow -0.129893 home 0.397470
48 ===== -0.127582 seaso 0.392525
49 bruin -0.127174 alom 0.392132
class = rec.sport.hockey
Smallest Word Smallest Weight Largest Word Largest Weight
0 pitch -0.461276 hocke 1.535583
1 pitc -0.425422 ockey 1.502924
2 itchi -0.261211 hock 1.439787
3 runs -0.246273 play 1.381557
4 itche -0.227483 ckey 1.127159
5 base -0.204821 team 1.124085
6 mets -0.176732 playo 1.114949
7 ball -0.176623 layof 1.113788
8 sebal -0.176404 ayoff 1.105152
9 aseba -0.176274 goal 0.990104
10 baseb -0.176144 nhl 0.768611
11 sale -0.175885 game 0.720266
12 brave -0.174616 game 0.673643
13 eball -0.172910 team 0.668960
14 tion -0.172551 leaf 0.662521
15 llies -0.170228 leafs 0.648150
16 hitt -0.168332 yoffs 0.642385
17 tcher -0.164154 e nhl 0.634809
18 inni -0.163471 yoff 0.613883
19 card -0.159966 playe 0.593323
20 tiger -0.157453 wings 0.588920
21 tige -0.156389 he nh 0.587949
22 brav -0.153643 evils 0.575008
23 run -0.149741 wing 0.544409
24 hitte -0.148633 eafs 0.544361
25 ball. -0.148576 layer 0.526256
26 igers -0.147544 coach 0.526096
27 work -0.145700 devil 0.522240
28 runs -0.142803 uins 0.518214
29 .edu -0.141035 espn 0.517386
30 illie -0.140901 toron 0.517131
31 field -0.140841 oront 0.517131
32 he ph -0.140491 ronto 0.511178
33 hit -0.139046 e pen 0.500194
34 chers -0.137209 seaso 0.492490
35 dodge -0.135668 pengu 0.492271
36 e run -0.133673 engui 0.492271
37 dodg -0.133049 troit 0.488038
38 aves -0.132833 detro 0.487313
39 driv -0.131367 etroi 0.483845
40 catc -0.129916 nguin 0.483678
41 rocki -0.129907 guins 0.474558
42 itter -0.129159 coac 0.473427
43 ockie -0.128602 y cup 0.472747
44 edu ( -0.128316 cup 0.453882
45 use -0.127157 ley c 0.453598
46 ckies -0.126803 seas 0.451769
47 riole -0.126644 ey cu 0.451525
48 tchin -0.126375 olcho 0.451118
49 oriol -0.126294 lchow 0.451118
class = sci.crypt
Smallest Word Smallest Weight Largest Word Largest Weight
0 ----- -0.284012 crypt 2.397442
1 indow -0.152546 encry 1.571634
2 windo -0.152464 ncryp 1.567428
3 n: us -0.145351 encr 1.465511
4 oblem -0.140890 clipp 1.431540
5 probl -0.140256 ipper 1.419516
6 roble -0.140175 lippe 1.412055
7 wind -0.135490 rypti 1.277410
8 chris -0.132707 yptio 1.240426
9 guns -0.131371 clip 1.176980
10 ere i -0.130495 cryp 1.154029
11 image -0.129232 pper 1.140149
12 : usa -0.128974 rypto 1.131629
13 on: u -0.126589 chip 0.999422
14 ower -0.126415 key 0.957370
15 the m -0.125031 chip 0.798083
16 s in -0.122532 keys 0.791703
17 grap -0.121502 per c 0.751491
18 thank -0.121042 ption 0.744498
19 . you -0.120594 escro 0.716339
20 keybo -0.119617 secur 0.700087
21 phics -0.119188 scrow 0.698501
22 ndows -0.119161 e key 0.694333
23 the b -0.118458 r chi 0.626615
24 appl -0.118342 er ch 0.609113
25 eyboa -0.118320 secu 0.605918
26 yboar -0.118320 yptog 0.596017
27 kill -0.118164 ptogr 0.596017
28 israe -0.117987 nsa 0.590867
29 srael -0.117960 escr 0.566792
30 or a -0.117930 rypte 0.555481
31 he ar -0.116731 ypted 0.552844
32 apple -0.116487 e cli 0.550312
33 keyb -0.114318 ypto 0.549488
34 ervic -0.114096 re: o 0.546840
35 rvice -0.113298 gtoal 0.531762
36 ave a -0.113001 e nsa 0.529573
37 ians -0.112395 secre 0.528408
38 visio -0.112270 tapp 0.498630
39 spac -0.112192 crow 0.495821
40 e.com -0.111206 togra 0.489905
41 pace -0.110359 ecret 0.488486
42 chri -0.108957 tappe 0.484258
43 the h -0.108616 e tap 0.476374
44 ters -0.108438 he ns 0.475263
45 hanks -0.108427 wiret 0.467759
46 servi -0.107839 retap 0.467759
47 e for -0.107542 ireta 0.467759
48 om: d -0.107013 keys 0.467381
49 compo -0.106812 t key 0.467130
class = sci.electronics
Smallest Word Smallest Weight Largest Word Largest Weight
0 windo -0.179607 rcuit 0.766207
1 indow -0.178272 ircui 0.761114
2 ndows -0.155433 circu 0.606135
3 mac -0.153168 circ 0.582962
4 space -0.152030 volt 0.548929
5 crypt -0.148931 cuit 0.525815
6 wind -0.145591 oltag 0.463753
7 drive -0.139402 ltage 0.463753
8 spac -0.135972 volta 0.460535
9 sale -0.133337 lectr 0.444991
10 or sa -0.129630 elec 0.373032
11 r sal -0.123852 sisto 0.355999
12 raphi -0.121161 elect 0.346360
13 aphic -0.118990 etect 0.311119
14 auto -0.118714 detec 0.311119
15 bike -0.116512 cuits 0.304165
16 graph -0.114089 ectro 0.301768
17 lippe -0.112767 line 0.298895
18 acce -0.109226 ower 0.297016
19 t on -0.108852 esist 0.296768
20 file -0.108632 eiver 0.292461
21 ncryp -0.107185 ignal 0.292125
22 encry -0.107105 resis 0.291774
23 e. th -0.106955 ctron 0.287576
24 driv -0.106869 tecto 0.284756
25 think -0.106180 troni 0.275736
26 cent -0.106174 a pho 0.275226
27 for m -0.105795 e amp 0.273497
28 t for -0.105434 audio 0.268876
29 grap -0.104316 onics 0.268459
30 dows -0.104234 amp 0.267126
31 card -0.103895 g tow 0.266887
32 cause -0.103558 tage 0.265596
33 clipp -0.103260 cooli 0.263897
34 all -0.099243 rada 0.262090
35 perso -0.098532 oolin 0.261349
36 ipper -0.098228 adio 0.259043
37 rypti -0.096445 power 0.258887
38 ship -0.096191 ectri 0.258411
39 orbit -0.094506 ctric 0.258411
40 ently -0.094468 use 0.257458
41 chang -0.094430 radar 0.253589
42 yptio -0.094349 powe 0.252295
43 clip -0.093778 cool 0.251513
44 ccess -0.093601 chip 0.251445
45 encr -0.093538 utlet 0.250426
46 ntly -0.092964 outle 0.250426
47 me a -0.092877 radio 0.248765
48 becau -0.092413 scope 0.247174
49 ecaus -0.092413 signa 0.245484
class = sci.med
Smallest Word Smallest Weight Largest Word Largest Weight
0 drive -0.174184 medic 0.753762
1 stian -0.173023 msg 0.742886
2 istia -0.172149 docto 0.699396
3 hrist -0.167166 octor 0.698220
4 space -0.166703 sease 0.676768
5 risti -0.165866 iseas 0.673404
6 chris -0.164079 disea 0.673404
7 spac -0.161962 food 0.646730
8 driv -0.159636 dise 0.635357
9 chri -0.156317 treat 0.626115
10 power -0.155868 geb@c 0.590658
11 powe -0.146908 doct 0.589793
12 god -0.133700 don b 0.583144
13 car -0.133568 n ban 0.573876
14 he de -0.133291 pitt. 0.570318
15 game -0.128654 trea 0.567426
16 nment -0.126921 cs.pi 0.562157
17 theis -0.126644 tient 0.561302
18 pace -0.125246 patie 0.554959
19 ling -0.123670 atien 0.554959
20 bibl -0.121707 @cs.p 0.554225
21 he bi -0.121467 medi 0.550727
22 windo -0.120702 banks 0.536373
23 ower -0.119597 .pitt 0.515359
24 indow -0.118981 gordo 0.515172
25 ligio -0.118284 ordon 0.514018
26 relig -0.118019 edica 0.513667
27 gover -0.117337 bank 0.504548
28 rive -0.114718 eb@cs 0.496635
29 overn -0.114714 rdon 0.495473
30 play -0.113358 pati 0.494558
31 vernm -0.112855 geb@ 0.492000
32 ernme -0.112818 anks) 0.491771
33 rnmen -0.112802 b@cs. 0.487279
34 the g -0.111907 tt.ed 0.481260
35 eligi -0.111141 itt.e 0.480813
36 u.edu -0.109548 u (go 0.473997
37 gove -0.109279 on ba 0.472747
38 nning -0.109024 reatm 0.466067
39 ware -0.107916 eatme 0.466067
40 athei -0.107584 atmen 0.465534
41 en th -0.106238 s.pit 0.463466
42 bike -0.105980 ients 0.429590
43 the w -0.105862 dyer 0.426110
44 tatio -0.105854 (gord 0.420624
45 engin -0.105831 (gor 0.420236
46 bilit -0.105432 iagno 0.412790
47 on th -0.103459 diagn 0.412790
48 a.edu -0.103111 foods 0.412778
49 disk -0.103106 itivi 0.406920
class = sci.space
Smallest Word Smallest Weight Largest Word Largest Weight
0 windo -0.201688 space 1.760493
1 lt, m -0.200772 spac 1.692403
2 indow -0.199935 pace 1.343471
3 elt, -0.199304 orbit 1.121241
4 belt, -0.199116 orbi 1.024504
5 t, md -0.194512 moon 0.850933
6 , md -0.182686 launc 0.783823
7 ndows -0.182051 aunch 0.766848
8 chip -0.180454 laun 0.748340
9 ***** -0.179084 nasa 0.646137
10 steve -0.177157 huttl 0.625697
11 d usa -0.174550 moon 0.620814
12 md u -0.166240 uttle 0.608911
13 md us -0.166240 rbit 0.606169
14 : usa -0.160080 ace s 0.599939
15 your -0.156201 e moo 0.596560
16 ns, g -0.150602 henry 0.593839
17 driv -0.150455 shutt 0.589686
18 , gre -0.149570 rb@ac 0.577582
19 file -0.148455 prb@a 0.577582
20 n: us -0.148328 prb@ 0.574039
21 play -0.144103 (pat) 0.571262
22 s, gr -0.143880 nasa 0.549102
23 drive -0.142216 laska 0.547842
24 wind -0.139548 alask 0.547842
25 file -0.139332 lunar 0.519805
26 nbelt -0.137405 ska.e 0.505964
27 enbel -0.137243 ka.ed 0.505964
28 eenbe -0.134041 aska. 0.505964
29 dows -0.133773 .alas 0.505964
30 reenb -0.133693 shut 0.504948
31 t.edu -0.133632 unar 0.503734
32 good -0.133034 luna 0.500711
33 any -0.132673 craft 0.500077
34 your -0.131375 cecra 0.477195
35 h)\nsu -0.129856 acecr 0.477195
36 chris -0.128344 pacec 0.473351
37 teve -0.127526 unch 0.471966
38 sale -0.127338 b@acc 0.468673
39 r sal -0.126375 e spa 0.462074
40 i ha -0.123979 astro 0.461108
41 clin -0.123898 fligh 0.460700
42 game -0.123274 ecraf 0.458215
43 does -0.121958 (pat 0.457396
44 crypt -0.121588 zoo.t 0.448436
45 linto -0.120826 oo.to 0.448436
46 inton -0.120503 o.tor 0.448436
47 chip -0.119908 @zoo. 0.448436
48 i hav -0.119364 urora 0.448343
49 nning -0.119182 auror 0.448343
class = soc.religion.christian
Smallest Word Smallest Weight Largest Word Largest Weight
0 ting- -0.286087 hrist 1.157187
1 ng-ho -0.283180 chris 1.037373
2 ralit -0.282904 chri 1.007450
3 nntp- -0.282361 1993. 0.926848
4 -host -0.282084 .rutg 0.912009
5 g-hos -0.282073 rs.ed 0.909241
6 host: -0.282073 ers.e 0.909241
7 ntp-p -0.282073 istia 0.907408
8 p-pos -0.282073 stian 0.892207
9 tp-po -0.282073 <apr. 0.878809
10 ing-h -0.282041 risti 0.878778
11 ost: -0.281688 gers. 0.870621
12 -post -0.280582 <apr 0.848478
13 \nnntp -0.269971 utger 0.826029
14 le <1 -0.262265 tgers 0.826029
15 orali -0.246522 rutge 0.826029
16 ostin -0.240871 .1993 0.819469
17 posti -0.239520 e <ap 0.814446
18 ution -0.233377 os.ru 0.796072
19 .com> -0.227263 hos.r 0.796072
20 93apr -0.227262 athos 0.796072
21 moral -0.225977 @atho 0.796072
22 mora -0.213659 thos. 0.787158
23 1993a -0.210682 le <a 0.784424
24 993ap -0.210682 god 0.782882
25 sting -0.207029 tians 0.744131
26 <1993 -0.203505 s.edu 0.678769
27 us.ru -0.202609 churc 0.667827
28 <199 -0.202568 hurch 0.666893
29 distr -0.201586 chur 0.623747
30 le <c -0.200039 s.rut 0.622173
31 istri -0.196983 jesus 0.590921
32 e <c5 -0.193733 jesu 0.588939
33 ibuti -0.191413 apr.1 0.574902
34 aldis -0.191340 -clh] 0.540801
35 ribut -0.191021 --clh 0.540801
36 islam -0.190591 f chr 0.538473
37 kaldi -0.189607 faith 0.518511
38 strib -0.189587 --cl 0.515254
39 butio -0.188530 va.ru 0.508999
40 tribu -0.187318 eva.r 0.508999
41 \ndist -0.187065 a.rut 0.507753
42 com> -0.182778 neva. 0.505187
43 erica -0.177180 genev 0.500854
44 kald -0.176562 god's 0.499207
45 andvi -0.175633 eneva 0.494737
46 ndvik -0.175633 993.1 0.491023
47 sandv -0.175633 esus 0.486630
48 e <19 -0.172140 god' 0.483537
49 meric -0.170508 od's 0.478711
class = talk.politics.guns
Smallest Word Smallest Weight Largest Word Largest Weight
0 crypt -0.231857 gun 1.266008
1 ipper -0.154711 fire 1.046236
2 encry -0.151651 firea 0.921690
3 ncryp -0.151651 irear 0.918977
4 spee -0.143482 rearm 0.914630
5 lippe -0.140055 guns 0.911108
6 israe -0.139948 guns 0.869453
7 srael -0.139948 weapo 0.711248
8 and d -0.138797 eapon 0.711248
9 ?\norg -0.137767 earms 0.706334
10 clipp -0.137020 weap 0.644490
11 encr -0.131148 waco 0.626129
12 clip -0.130617 batf 0.615037
13 game -0.129599 gun c 0.605255
14 isra -0.129538 handg 0.591689
15 rypto -0.128183 ndgun 0.588299
16 estin -0.128073 andgu 0.588299
17 he so -0.125262 arms 0.545820
18 rypti -0.125222 e gun 0.545762
19 eason -0.124204 un co 0.545057
20 he re -0.120309 apons 0.521431
21 chip -0.120229 atf 0.484039
22 play -0.118937 waco 0.463063
23 yptio -0.117902 crim 0.462801
24 menia -0.117776 n wac 0.439493
25 rmeni -0.117772 y gun 0.439133
26 armen -0.117333 cdt@ 0.428831
27 speed -0.114494 gun i 0.399300
28 pper -0.113902 batf 0.392067
29 new -0.113807 a gun 0.382797
30 ealth -0.113110 crary 0.382576
31 bike -0.113081 n gun 0.381007
32 mess -0.111776 in wa 0.374819
33 prog -0.111312 ranc 0.372492
34 are t -0.111286 pons 0.366446
35 cryp -0.109880 vivor 0.358042
36 y is -0.109409 rvivo 0.356586
37 ystem -0.109316 vidia 0.354413
38 syste -0.109277 ividi 0.351073
39 chri -0.108416 ivors 0.351069
40 stev -0.108380 fire 0.350243
41 ograp -0.108087 idian 0.350098
42 ites -0.107754 t gun 0.347556
43 r) wr -0.107164 u2803 0.347533
44 togra -0.105629 8037@ 0.347533
45 team -0.105547 7@uic 0.347533
46 fact -0.105446 37@ui 0.347533
47 ogram -0.105143 28037 0.347533
48 stian -0.104777 037@u 0.347533
49 motor -0.104668 fbi 0.347489
class = talk.politics.mideast
Smallest Word Smallest Weight Largest Word Largest Weight
0 chris -0.197178 israe 2.039616
1 hrist -0.193177 srael 2.039260
2 jesus -0.178078 isra 1.893064
3 ?\norg -0.173866 raeli 1.341251
4 jesu -0.166723 aeli 1.163265
5 chri -0.162195 rmeni 1.038226
6 stian -0.161559 menia 1.022559
7 istia -0.156717 armen 1.015505
8 god -0.155762 turk 0.979189
9 f god -0.151733 rael 0.911527
10 risti -0.148715 arab 0.879315
11 bibl -0.141630 enian 0.817754
12 esus -0.141518 nians 0.799968
13 the b -0.138227 arme 0.751435
14 gun -0.135298 nian 0.735639
15 the c -0.135137 pales 0.715789
16 good -0.128128 alest 0.714717
17 he pr -0.125567 turki 0.697670
18 of go -0.124924 urkis 0.690479
19 good -0.124592 rkish 0.689627
20 bible -0.123040 lesti 0.678826
21 play -0.122911 kish 0.647414
22 he bi -0.121984 pale 0.601010
23 s)\nsu -0.121421 jews 0.586000
24 rophe -0.120127 jews 0.567469
25 proph -0.119750 arab 0.564121
26 , but -0.118328 occup 0.551971
27 e)\nsu -0.118053 arabs 0.534219
28 he co -0.117934 he ar 0.526690
29 lord -0.117003 inian 0.508972
30 e com -0.116628 stini 0.498846
31 than -0.116155 n isr 0.498296
32 e bib -0.114943 tinia 0.493683
33 anyo -0.114032 turks 0.469413
34 in c -0.113425 : isr 0.467455
35 anyon -0.112181 urkey 0.466318
36 nyone -0.111922 turke 0.466318
37 , and -0.111854 erdar 0.463276
38 roduc -0.111097 serda 0.462826
39 than -0.109835 e isr 0.458450
40 canad -0.109623 soldi 0.441401
41 game -0.109467 ldier 0.441401
42 cient -0.109081 oldie 0.440794
43 n is -0.108821 argic 0.438893
44 king -0.108138 t isr 0.438478
45 the d -0.106649 aelis 0.436242
46 resu -0.105719 enia 0.431850
47 r) wr -0.105287 terr 0.430864
48 it's -0.104790 e ara 0.428938
49 e fed -0.104297 leban 0.423841
class = talk.politics.misc
Smallest Word Smallest Weight Largest Word Largest Weight
0 hrist -0.171628 cram 0.702884
1 risti -0.166115 crame 0.653219
2 chris -0.161354 ramer 0.613786
3 crypt -0.154118 ptili 0.577782
4 chri -0.154054 optil 0.577782
5 stian -0.152474 tilin 0.568303
6 istia -0.151307 ilink 0.522609
7 gun -0.149785 aldis 0.494942
8 ***** -0.141936 clayt 0.492113
9 ipper -0.139771 kaldi 0.489572
10 lippe -0.138582 layto 0.485312
11 israe -0.136118 kald 0.452913
12 srael -0.136118 yton 0.450479
13 the\n -0.131490 @opti 0.431328
14 clipp -0.128860 ayton 0.424674
15 isra -0.124200 linto 0.420665
16 guns -0.124179 clint 0.417481
17 guns -0.123240 inton 0.415637
18 team -0.122991 sexua 0.408172
19 cont -0.117785 exual 0.408172
20 chip -0.116148 gay 0.383518
21 pper -0.114901 sc-br 0.365720
22 fire -0.109583 isc-b 0.363961
23 contr -0.108357 osexu 0.357775
24 scien -0.108235 clin 0.356485
25 lled -0.107989 r.isc 0.351769
26 is th -0.104385 c-br. 0.350971
27 encr -0.101968 n cra 0.343639
28 encry -0.101949 link. 0.343020
29 ncryp -0.101949 r@opt 0.341056
30 rearm -0.100924 mer@o 0.341056
31 ntrol -0.100233 us.ru 0.340791
32 ontro -0.098770 er@op 0.340054
33 buil -0.098509 m (cl 0.336406
34 car -0.098180 ton c 0.332366
35 firea -0.097797 p ten 0.329213
36 irear -0.097797 mulus 0.326969
37 icati -0.097326 (clay 0.323269
38 cent -0.096986 nk.co 0.323134
39 waco -0.096764 ink.c 0.321097
40 chip -0.096505 op te 0.320966
41 space -0.095967 on cr 0.318403
42 driv -0.095406 tax 0.313726
43 scie -0.095240 br.co 0.313460
44 raeli -0.094066 .isc- 0.313460
45 clip -0.093871 -br.c 0.313460
46 illed -0.093657 amer 0.312036
47 h.edu -0.093423 amer@ 0.307402
48 kille -0.093269 omose 0.306136
49 team -0.093066 mosex 0.306136
class = talk.religion.misc
Smallest Word Smallest Weight Largest Word Largest Weight
0 theis -0.167289 ndvik 0.336977
1 athei -0.159897 sandv 0.335371
2 heist -0.148800 andvi 0.335371
3 athe -0.144354 dvik- 0.311110
4 h.edu -0.123736 endig 0.302324
5 .rutg -0.114729 kendi 0.300536
6 ers.e -0.113476 weiss 0.263257
7 rs.ed -0.113476 istia 0.259342
8 eists -0.113026 rt we 0.258908
9 rutge -0.112598 eritt 0.257621
10 tgers -0.112598 stian 0.256729
11 utger -0.112598 h981@ 0.256256
12 ----- -0.108509 ch981 0.256256
13 gers. -0.107762 ch98 0.256256
14 s on -0.104184 981@c 0.254621
15 need -0.101806 cruci 0.254562
16 1993. -0.100520 yalro 0.253985
17 .1993 -0.096622 oyalr 0.253985
18 ch.ed -0.092860 ds.ca 0.253985
19 le <a -0.091901 alroa 0.253985
20 thank -0.091515 .roya 0.253985
21 du (m -0.090084 weis 0.251545
22 s for -0.090061 sicru 0.250585
23 --clh -0.089927 rosic 0.250585
24 -clh] -0.089927 osicr 0.250585
25 ectio -0.089803 icruc 0.250585
26 <apr. -0.089658 ads.c 0.248127
27 _____ -0.089291 risti 0.247781
28 <apr -0.089104 tian 0.247239
29 free -0.088943 lroad 0.244298
30 hanks -0.088223 81@cl 0.242664
31 e <ap -0.086701 icea) 0.238063
32 uman -0.086578 ucian 0.237407
33 re: t -0.084443 ert w 0.237292
34 --cl -0.083790 ralit 0.236477
35 free -0.083725 t wei 0.233681
36 ce of -0.083299 .appl 0.233057
37 . how -0.081699 rosi 0.231263
38 keit -0.081260 lm le 0.231150
39 nati -0.081109 licea 0.229946
40 than -0.080232 1@cle 0.229496
41 ech.e -0.078315 hrist 0.228836
42 . --c -0.077761 oads. 0.226113
43 dom i -0.075727 orali 0.225484
44 need -0.075382 m lee 0.222942
45 work -0.075058 3 god 0.222288
46 (was -0.074601 93 g 0.222288
47 our -0.074392 re: a 0.222246
48 lief -0.072877 93 go 0.220974
49 genoc -0.072577 jesus 0.219796

Tokenization Tests

Docs:

CountVectorizer
TfidfTransformer


In [57]:
test_pipeline(Pipeline([('cvect', CountVectorizer()),
                      ('tfidf', TfidfTransformer(use_idf=False)),
                      ('sgdc', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-4, random_state=42,
                                            max_iter=40 )),
                         ]), verbose=False, name='use_idf=False')


F1 = 0.804 
Accuracy = 0.815
time = 8.694 sec.

In [58]:
test_pipeline(Pipeline([('cvect', CountVectorizer(stop_words='english')),
                      ('tfidf', TfidfTransformer()),
                      ('sgdc', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-4, random_state=42,
                                            max_iter=40 )),
                         ]), verbose=False, name='stopwords')


F1 = 0.844 
Accuracy = 0.851
time = 7.025 sec.

In [59]:
test_pipeline(Pipeline([('cvect', CountVectorizer(stop_words='english', ngram_range=(1,2),
                                                max_df = 0.8, min_df=2)),
                      ('tfidf', TfidfTransformer()),
                      ('sgdc', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-4, random_state=42,
                                            max_iter=40 )),
                         ]), verbose=False, name='ngram_range=(1,2)')


F1 = 0.850 
Accuracy = 0.856
time = 17.337 sec.

In [60]:
test_pipeline(Pipeline([('cvect', CountVectorizer(stop_words='english')),
                      ('tfidf', TfidfTransformer(norm=None)),
                      ('sgdc', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-4, random_state=42,
                                            max_iter=40 )),
                         ]), verbose=False, name='norm = None')


F1 = 0.746 
Accuracy = 0.753
time = 7.385 sec.

In [61]:
test_pipeline(Pipeline([('cvect', CountVectorizer(stop_words='english')),
                      ('tfidf', TfidfTransformer(sublinear_tf=True)),
                      ('sgdc', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-4, random_state=42,
                                            max_iter=40 )),
                         ]), verbose=False, name='sublinear_tf=True')


F1 = 0.852 
Accuracy = 0.859
time = 6.955 sec.

In [62]:
test_pipeline(Pipeline([('cvect', CountVectorizer(stop_words='english')),
                      ('tfidf', TfidfTransformer(norm='l1')),
                      ('sgdc', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-4, random_state=42,
                                            max_iter=40 )),
                         ]), verbose=False, name='norm=l1')


F1 = 0.810 
Accuracy = 0.823
time = 7.500 sec.

In [63]:
test_pipeline(Pipeline([('cvect', CountVectorizer(stop_words='english', ngram_range=(1,3),
                                                max_df = 0.8, min_df=2)),
                      ('tfidf', TfidfTransformer()),
                      ('sgdc', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-4, random_state=42,
                                            max_iter=40 )),
                         ]), verbose=False, name='ngram_range=(1,3)')


F1 = 0.848 
Accuracy = 0.853
time = 31.060 sec.

In [64]:
test_pipeline(Pipeline([('cvect', CountVectorizer(stop_words='english', ngram_range=(1,2), 
                                                max_df = 0.8, min_df=2)),
                      ('tfidf', TfidfTransformer()),
                      ('sgdc', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-4, random_state=42,
                                            max_iter=40 )),
                         ]), verbose=False, name = 'ngram_range=(1,2), max_df = 0.8, min_df=2')


F1 = 0.850 
Accuracy = 0.856
time = 17.937 sec.

In [65]:
test_pipeline(Pipeline([('cvect', CountVectorizer(stop_words='english', ngram_range=(1,2), 
                                                  max_df = 0.8, min_df=2)),
                      ('tfidf', TfidfTransformer()),
                      ('sgdc', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-4, random_state=42,
                                            max_iter=40 , n_jobs=-1)),
                         ]), verbose=False, name='ngram_range=(1,2), max_df = 0.8, min_df=2')


F1 = 0.850 
Accuracy = 0.856
time = 17.200 sec.

In [66]:
test_pipeline(Pipeline([('cvect', CountVectorizer(stop_words='english', token_pattern="[a-zA-Z]{3,}")),
                      ('tfidf', TfidfTransformer()),
                      ('sgdc', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-4, random_state=42,
                                            max_iter=40 , n_jobs=-1)),
                         ]), verbose=False, name='no numbers')


F1 = 0.839 
Accuracy = 0.847
time = 3.978 sec.

In [67]:
test_pipeline(Pipeline([('cvect', CountVectorizer(stop_words='english', token_pattern="[a-zA-Z0-9.-]{1,}")),
                      ('tfidf', TfidfTransformer()),
                      ('sgdc', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-4, random_state=42,
                                            max_iter=40 , n_jobs=-1)),
                         ]), verbose=False, name='dots in words')


F1 = 0.847 
Accuracy = 0.854
time = 5.745 sec.

In [68]:
test_pipeline(Pipeline([('cvect', CountVectorizer(stop_words='english', 
                                                 ngram_range=(1,2), min_df=3,max_df=0.8)),
                      ('tfidf', TfidfTransformer(sublinear_tf=True)),
                      ('sgdc', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-4, random_state=42,
                                            max_iter=40 , n_jobs=-1)),
                         ]), verbose=False, name='ngram_range=(1,2), min_df=3,max_df=0.8, sublinear_tf')


F1 = 0.859 
Accuracy = 0.865
time = 14.211 sec.

Char N-grams


In [69]:
for n in range(3,8,1):
    print('\nN-grams = '+ str(n))
    test_pipeline(Pipeline([('cvect', CountVectorizer(analyzer='char', ngram_range=(n,n), 
                                                     min_df=2, max_df=0.9)),
                      ('tfidf', TfidfTransformer(sublinear_tf=True)),
                      ('sgdc', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-4, random_state=42,
                                            max_iter=40 , n_jobs=-1)),
                         ]), verbose=False, name='char ngram ' + str(n) + ' + sublinear_tf')


N-grams = 3
F1 = 0.842 
Accuracy = 0.850
time = 24.555 sec.

N-grams = 4
F1 = 0.860 
Accuracy = 0.867
time = 42.013 sec.

N-grams = 5
F1 = 0.863 
Accuracy = 0.870
time = 76.980 sec.

N-grams = 6
F1 = 0.856 
Accuracy = 0.863
time = 101.537 sec.

N-grams = 7
F1 = 0.851 
Accuracy = 0.858
time = 113.803 sec.

SVM Model

SVC
LinearSVC


In [70]:
from sklearn.svm import SVC
from sklearn.decomposition import TruncatedSVD
test_pipeline(Pipeline([('cvect', CountVectorizer(stop_words='english')),
                      ('tfidf', TfidfTransformer()),
                      ('svd', TruncatedSVD(n_components=300)),
                      ('svc', SVC(kernel='linear', C=10)),
                         ]), verbose=False, name='SVC + TruncatedSVD')


F1 = 0.783 
Accuracy = 0.788
time = 52.128 sec.

In [71]:
from sklearn.svm import SVC
from sklearn.decomposition import TruncatedSVD
test_pipeline(Pipeline([('cvect', CountVectorizer(stop_words='english')),
                      ('tfidf', TfidfTransformer()),
                      ('svc', SVC(kernel='linear')),
                         ]), verbose=False, name='SVC')


F1 = 0.830 
Accuracy = 0.835
time = 154.554 sec.

In [72]:
from sklearn.svm import LinearSVC
from sklearn.decomposition import TruncatedSVD
test_pipeline(Pipeline([('cvect', CountVectorizer(stop_words='english')),
                      ('tfidf', TfidfTransformer()),
                      ('sgdc', LinearSVC(C=10)),
                         ]), verbose=False, name='LinearSVC, C=10')


F1 = 0.841 
Accuracy = 0.847
time = 9.021 sec.

In [73]:
from sklearn.svm import LinearSVC
from sklearn.decomposition import TruncatedSVD
test_pipeline(Pipeline([('cvect', CountVectorizer(stop_words='english')),
                      ('tfidf', TfidfTransformer()),
                      ('sgdc', LinearSVC(C=1)),
                         ]), verbose=False, name='LinearSVC, C=1')


F1 = 0.845 
Accuracy = 0.851
time = 5.456 sec.

In [74]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn.datasets.twenty_newsgroups import strip_newsgroup_footer
from sklearn.datasets.twenty_newsgroups import strip_newsgroup_quoting

In [75]:
class ItemSelector(BaseEstimator, TransformerMixin):
    """For data grouped by feature, select subset of data at a provided key.

    The data is expected to be stored in a 2D data structure, where the first
    index is over features and the second is over samples.  i.e.

    >> len(data[key]) == n_samples

    Please note that this is the opposite convention to scikit-learn feature
    matrixes (where the first index corresponds to sample).

    ItemSelector only requires that the collection implement getitem
    (data[key]).  Examples include: a dict of lists, 2D numpy array, Pandas
    DataFrame, numpy record array, etc.

    >> data = {'a': [1, 5, 2, 5, 2, 8],
               'b': [9, 4, 1, 4, 1, 3]}
    >> ds = ItemSelector(key='a')
    >> data['a'] == ds.transform(data)

    ItemSelector is not designed to handle data grouped by sample.  (e.g. a
    list of dicts).  If your data is structured this way, consider a
    transformer along the lines of `sklearn.feature_extraction.DictVectorizer`.

    Parameters
    ----------
    key : hashable, required
        The key corresponding to the desired value in a mappable.
    """
    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.key]


class TextStats(BaseEstimator, TransformerMixin):
    """Extract features from each document for DictVectorizer"""

    def fit(self, x, y=None):
        return self

    def transform(self, posts):
        return [{'length': len(text),
                 'num_sentences': text.count('.'),
                 'num_questions': text.count('?') ,
                 'num_dollars': text.count('$'), 
                 'num_percent': text.count('%'), 
                 'num_exclamations': text.count('!'), 
                }
                for text in posts]


class SubjectBodyExtractor(BaseEstimator, TransformerMixin):
    """Extract the subject & body from a usenet post in a single pass.

    Takes a sequence of strings and produces a dict of sequences.  Keys are
    `subject` and `body`.
    """
    def fit(self, x, y=None):
        return self

    def transform(self, posts):
        features = np.recarray(shape=(len(posts),),
                               dtype=[('subject', object), ('body', object)])
        for i, text in enumerate(posts):
            headers, _, bod = text.partition('\n\n')
            bod = strip_newsgroup_footer(bod)
            bod = strip_newsgroup_quoting(bod)
            features['body'][i] = bod

            prefix = 'Subject:'
            sub = ''
            for line in headers.split('\n'):
                if line.startswith(prefix):
                    sub = line[len(prefix):]
                    break
            features['subject'][i] = sub

        return features


class Printer(BaseEstimator, TransformerMixin):
    """{Print inputs}"""

    def __init__(self, count):
        self.count = count

    
    def fit(self, x, y=None):
        return self

    def transform(self, x):
        if(self.count >0):
            self.count-=1
            print(x[0])
        return x
    
    
pipeline = Pipeline([
    # Extract the subject & body
    ('subjectbody', SubjectBodyExtractor()),

    # Use FeatureUnion to combine the features from subject and body
    ('union', FeatureUnion(n_jobs=-1,
        transformer_list=[

            # Pipeline for pulling features from the post's subject line
            ('subject', Pipeline([
                ('selector', ItemSelector(key='subject')),
                ('tfidf', TfidfVectorizer(min_df=1)),
            ])),

            # Pipeline for standard bag-of-words model for body
            ('body_bow', Pipeline([
                ('selector', ItemSelector(key='body')),
                ('tfidf', TfidfVectorizer()),
            ])),
            
            # Pipeline for pulling ad hoc features from post's body
            ('body_stats', Pipeline([
                ('selector', ItemSelector(key='body')),
                ('stats', TextStats()),  # returns a list of dicts
                ('cvect', DictVectorizer()),  # list of dicts -> feature matrix
                #('print',Printer(1)),
                # scaling is needed so SGD model will have balanced feature gradients
                ('scale', StandardScaler(copy=False, with_mean=False, with_std=True) ),
                #('print2',Printer(1)),
            ])),
        ],

        # weight components in FeatureUnion
        transformer_weights={
            'subject': 1,
            'body_bow': 1,
            'body_stats': .1,
        },
        
    )),
    #('print',Printer(1)),
    # Use a SVC classifier on the combined features
    #('svc', SVC(kernel='linear')),
    ('sgdc', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5 )),
])


test_pipeline(pipeline, verbose=False, name='metadata')


F1 = 0.772 
Accuracy = 0.779
time = 8.225 sec.

Randomized Paramerter Search

RandomizedSearchCV


In [76]:
from scipy.stats import expon as sp_expon
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform

Random variable of 10^x with x uniformly distributed


In [77]:
r = sp_uniform(loc=5,scale=2).rvs(size=1000*1000)

fig, ax = plt.subplots(1, 1, figsize=(12, 5))
ax.hist(r,  bins=100)
plt.show()



In [78]:
def geometric_sample(power_min, power_max, sample_size):
    dist = sp_uniform(loc=power_min, scale=power_max-power_min)
    return np.power(10, dist.rvs(size=sample_size))

geometric_sample(1,6,50)


Out[78]:
array([  1.92896544e+01,   6.83289355e+05,   1.95385196e+04,
         1.20947891e+05,   8.33795910e+04,   3.05388861e+02,
         9.66487850e+01,   2.84030906e+03,   2.20135512e+02,
         1.18118731e+05,   5.54893953e+05,   7.73359579e+02,
         2.31922296e+02,   2.93040871e+03,   3.41772071e+03,
         3.08408903e+02,   1.51417542e+04,   4.44420516e+02,
         3.52444813e+03,   9.52551850e+01,   3.56450292e+02,
         7.61391478e+05,   6.10989850e+05,   1.28923699e+04,
         4.14388670e+03,   1.37949930e+02,   3.03808359e+05,
         2.80071916e+03,   2.36883674e+03,   2.28782530e+02,
         1.69987435e+03,   6.16399317e+01,   3.47855278e+04,
         2.58338539e+05,   1.86128121e+03,   9.21084971e+05,
         1.80319635e+03,   2.17500489e+03,   5.06344048e+04,
         5.92700245e+02,   7.90825307e+03,   1.29513406e+05,
         1.17371924e+02,   3.33168152e+01,   4.88239782e+03,
         1.54286837e+05,   5.54862492e+05,   7.08711990e+04,
         4.91055287e+04,   3.06866519e+03])

Random Parameter Search

RandomizedSearchCV

Example


In [79]:
from sklearn.model_selection import RandomizedSearchCV

pipeline = Pipeline([('cvect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('sgdc', SGDClassifier( random_state=42 )),
                         ])

#ngram_range=(1,2), max_df = 0.8, min_df=2

param_dist = {"cvect__stop_words": [None,'english'],
              "cvect__ngram_range": [(1,1),(1,2)],
              "cvect__min_df": sp_randint(1, 6),
              "cvect__max_df": sp_uniform(loc=0.5, scale=0.5), # range is (loc, loc+scale)
              "tfidf__sublinear_tf": [True,False],
              "tfidf__norm": [None, 'l1', 'l2'],
              "sgdc__max_iter": sp_randint(5, 40),
              "sgdc__loss": ['hinge','log'],
              "sgdc__alpha": geometric_sample(-8,-3,10000),
              }

# n_iter - number of random models to evaluate
# n_jobs = -1 to run in parallel on all cores
# cv = 4 , 4-fold cross validation
# scoring='f1_macro' , averages the F1 for each target class

rs = RandomizedSearchCV(pipeline, param_distributions=param_dist,
                        n_iter=5, n_jobs=-1, cv=3, return_train_score=False, 
                        verbose=1, scoring='f1_macro', random_state=42)

test_pipeline(rs, verbose=False, name='Random Parameter Search')

Audio(url='./Beep 2.wav', autoplay=True)


Fitting 3 folds for each of 5 candidates, totalling 15 fits
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   23.9s finished
F1 = 0.830 
Accuracy = 0.838
time = 29.103 sec.
Out[79]:

In [80]:
#pd.get_option("display.max_columns")
pd.set_option("display.max_columns", 40)

header('Best')
display( pd.DataFrame.from_dict(rs.best_params_, orient= 'index') )

header('All Results')
df = pd.DataFrame(rs.cv_results_)
df = df.sort_values(['rank_test_score'])
display(df)


Best

0
cvect__max_df 0.577997
cvect__min_df 3
cvect__ngram_range (1, 1)
cvect__stop_words english
sgdc__alpha 8.0991e-07
sgdc__loss log
sgdc__max_iter 6
tfidf__norm l1
tfidf__sublinear_tf False

All Results

mean_fit_time mean_score_time mean_test_score param_cvect__max_df param_cvect__min_df param_cvect__ngram_range param_cvect__stop_words param_sgdc__alpha param_sgdc__loss param_sgdc__max_iter param_tfidf__norm param_tfidf__sublinear_tf params rank_test_score split0_test_score split1_test_score split2_test_score std_fit_time std_score_time std_test_score
1 4.677818 1.673881 0.886112 0.577997 3 (1, 1) english 8.0991e-07 log 6 l1 False {'cvect__max_df': 0.577997260168, 'cvect__min_... 1 0.884335 0.886725 0.887280 0.106748 0.073919 0.001278
4 12.823025 1.979624 0.880997 0.545303 3 (1, 2) None 3.51484e-06 log 7 None True {'cvect__max_df': 0.545303217266, 'cvect__min_... 2 0.877519 0.882615 0.882863 0.123957 0.070086 0.002463
0 6.218041 1.776505 0.869679 0.68727 5 (1, 1) None 3.20011e-08 hinge 25 l2 False {'cvect__max_df': 0.687270059424, 'cvect__min_... 3 0.865980 0.872571 0.870490 0.143012 0.044473 0.002752
3 8.028108 1.310651 0.835594 0.511531 3 (1, 1) None 0.000592011 log 32 l2 False {'cvect__max_df': 0.511531212521, 'cvect__min_... 4 0.831645 0.843733 0.831397 0.007954 0.154820 0.005758
2 6.339887 1.760869 0.825414 0.500389 4 (1, 1) None 9.94494e-06 log 26 None False {'cvect__max_df': 0.500389382921, 'cvect__min_... 5 0.833237 0.834292 0.808678 0.283259 0.057872 0.011829

Plot Parameter Performance


In [81]:
df = df.apply(pd.to_numeric, errors='ignore')
prefix = 'param_'
param_col = [col for col in df.columns if col.startswith(prefix) ]
for col in param_col:
    name = col[len(prefix):]
    header(name)
    if(df[col].dtype == np.float64 or df[col].dtype == np.int64):
        print( 'scatter')
        df.plot(kind='scatter', x=col, y='mean_test_score', figsize=(15,10))
        plt.show()
    else:
        mean = df[[col,'mean_test_score']].fillna(value='None').groupby(col).mean()
        mean.plot(kind='bar', figsize=(10,10))
        plt.show()


cvect__max_df

scatter

cvect__min_df

scatter

cvect__ngram_range

cvect__stop_words

sgdc__alpha

scatter

sgdc__loss

sgdc__max_iter

scatter

tfidf__norm

tfidf__sublinear_tf

All Tests


In [82]:
tests_df=pd.DataFrame.from_dict(tests, orient= 'index')
tests_df = tests_df.drop(['Name'], axis=1)
tests_df.columns=[ 'F1', 'Accuracy', 'Time (sec.)', 'Details']
tests_df = tests_df.sort_values(by=['F1'], ascending=False)
display(tests_df)

header('Best Model')
display(tests_df.head(1))
print(tests_df['Details'].values[0])


F1 Accuracy Time (sec.) Details
char ngram 5 + sublinear_tf 0.869888 0.863004 76.979677 {'memory': None, 'steps': [('cvect', CountVect...
char ngram 4 + sublinear_tf 0.866702 0.859743 42.012700 {'memory': None, 'steps': [('cvect', CountVect...
ngram_range=(1,2), min_df=3,max_df=0.8, sublinear_tf 0.865242 0.859321 14.210797 {'memory': None, 'steps': [('cvect', CountVect...
char ngram 6 + sublinear_tf 0.862985 0.856477 101.537202 {'memory': None, 'steps': [('cvect', CountVect...
sublinear_tf=True 0.859400 0.852128 6.955444 {'memory': None, 'steps': [('cvect', CountVect...
char ngram 7 + sublinear_tf 0.857541 0.851403 113.802729 {'memory': None, 'steps': [('cvect', CountVect...
ngram_range=(1,2), max_df = 0.8, min_df=2 0.856213 0.850155 17.200412 {'memory': None, 'steps': [('cvect', CountVect...
ngram_range=(1,2) 0.856213 0.850155 17.337156 {'memory': None, 'steps': [('cvect', CountVect...
dots in words 0.854488 0.847328 5.744998 {'memory': None, 'steps': [('cvect', CountVect...
hinge loss 0.853691 0.846033 8.372657 {'memory': None, 'steps': [('cvect', CountVect...
ngram_range=(1,3) 0.853425 0.847773 31.060322 {'memory': None, 'steps': [('cvect', CountVect...
stopwords 0.851434 0.844401 7.025221 {'memory': None, 'steps': [('cvect', CountVect...
LinearSVC, C=1 0.851036 0.844743 5.456266 {'memory': None, 'steps': [('cvect', CountVect...
char ngram 3 + sublinear_tf 0.850239 0.842296 24.555441 {'memory': None, 'steps': [('cvect', CountVect...
LogisticRegression multinomial C=100 0.847053 0.840856 49.159303 {'memory': None, 'steps': [('cvect', CountVect...
LinearSVC, C=10 0.846920 0.841163 9.021482 {'memory': None, 'steps': [('cvect', CountVect...
no numbers 0.846787 0.839302 3.978191 {'memory': None, 'steps': [('cvect', CountVect...
log loss 0.846654 0.840403 6.092606 {'memory': None, 'steps': [('cvect', CountVect...
LogisticRegression multinomial C=1000 0.845858 0.839522 50.856773 {'memory': None, 'steps': [('cvect', CountVect...
LogisticRegression multinomial C=10 0.844530 0.837837 15.442162 {'memory': None, 'steps': [('cvect', CountVect...
Random Parameter Search 0.837759 0.829541 29.103136 {'cv': 3, 'error_score': 'raise', 'estimator__...
SVC 0.834971 0.829670 154.554208 {'memory': None, 'steps': [('cvect', CountVect...
LogisticRegression ovr 0.827801 0.818612 47.392265 {'memory': None, 'steps': [('cvect', CountVect...
LogisticRegression multinomial 0.827403 0.819329 8.674405 {'memory': None, 'steps': [('cvect', CountVect...
log loss no regularization 0.825412 0.819152 6.112183 {'memory': None, 'steps': [('cvect', CountVect...
norm=l1 0.823155 0.809600 7.499916 {'memory': None, 'steps': [('cvect', CountVect...
use_idf=False 0.815321 0.803527 8.694123 {'memory': None, 'steps': [('cvect', CountVect...
SVC + TruncatedSVD 0.787971 0.782752 52.127526 {'memory': None, 'steps': [('cvect', CountVect...
metadata 0.779076 0.771931 8.225383 {'memory': None, 'steps': [('subjectbody', Sub...
MultinomialNB 0.773898 0.755754 4.628393 {'memory': None, 'steps': [('cvect', CountVect...
norm = None 0.753186 0.745847 7.384666 {'memory': None, 'steps': [('cvect', CountVect...
NearestCentroid 0.692114 0.693867 4.217473 {'memory': None, 'steps': [('cvect', CountVect...
KNN n=4 distance weights 0.684015 0.678298 12.718538 {'memory': None, 'steps': [('cvect', CountVect...
KNN n=6 distance weights 0.680696 0.674070 12.040544 {'memory': None, 'steps': [('cvect', CountVect...
KNN n=3 distance weights 0.680165 0.674523 11.714124 {'memory': None, 'steps': [('cvect', CountVect...
KNN n=5 distance weights 0.678306 0.672070 12.580661 {'memory': None, 'steps': [('cvect', CountVect...
KNN n=2 distance weights 0.672464 0.667350 11.513792 {'memory': None, 'steps': [('cvect', CountVect...
KNN n=1 0.672464 0.667350 11.949279 {'memory': None, 'steps': [('cvect', CountVect...
KNN n=1 distance weights 0.672464 0.667350 11.402045 {'memory': None, 'steps': [('cvect', CountVect...
KNN n=6 0.660515 0.654841 11.862630 {'memory': None, 'steps': [('cvect', CountVect...
KNN n=5 0.659187 0.654903 11.996723 {'memory': None, 'steps': [('cvect', CountVect...
KNN n=3 0.657860 0.656407 11.676285 {'memory': None, 'steps': [('cvect', CountVect...
KNN n=4 0.656399 0.653836 11.962758 {'memory': None, 'steps': [('cvect', CountVect...
KNN n=2 0.640733 0.638887 11.941925 {'memory': None, 'steps': [('cvect', CountVect...

Best Model

F1 Accuracy Time (sec.) Details
char ngram 5 + sublinear_tf 0.869888 0.863004 76.979677 {'memory': None, 'steps': [('cvect', CountVect...
{'memory': None, 'steps': [('cvect', CountVectorizer(analyzer='char', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.9, max_features=None, min_df=2,
        ngram_range=(5, 5), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('tfidf', TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=True, use_idf=True)), ('sgdc', SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=40, n_iter=None,
       n_jobs=-1, penalty='l2', power_t=0.5, random_state=42, shuffle=True,
       tol=None, verbose=0, warm_start=False))], 'cvect': CountVectorizer(analyzer='char', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.9, max_features=None, min_df=2,
        ngram_range=(5, 5), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None), 'tfidf': TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=True, use_idf=True), 'sgdc': SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=40, n_iter=None,
       n_jobs=-1, penalty='l2', power_t=0.5, random_state=42, shuffle=True,
       tol=None, verbose=0, warm_start=False), 'cvect__analyzer': 'char', 'cvect__binary': False, 'cvect__decode_error': 'strict', 'cvect__dtype': <class 'numpy.int64'>, 'cvect__encoding': 'utf-8', 'cvect__input': 'content', 'cvect__lowercase': True, 'cvect__max_df': 0.9, 'cvect__max_features': None, 'cvect__min_df': 2, 'cvect__ngram_range': (5, 5), 'cvect__preprocessor': None, 'cvect__stop_words': None, 'cvect__strip_accents': None, 'cvect__token_pattern': '(?u)\\b\\w\\w+\\b', 'cvect__tokenizer': None, 'cvect__vocabulary': None, 'tfidf__norm': 'l2', 'tfidf__smooth_idf': True, 'tfidf__sublinear_tf': True, 'tfidf__use_idf': True, 'sgdc__alpha': 0.0001, 'sgdc__average': False, 'sgdc__class_weight': None, 'sgdc__epsilon': 0.1, 'sgdc__eta0': 0.0, 'sgdc__fit_intercept': True, 'sgdc__l1_ratio': 0.15, 'sgdc__learning_rate': 'optimal', 'sgdc__loss': 'hinge', 'sgdc__max_iter': 40, 'sgdc__n_iter': None, 'sgdc__n_jobs': -1, 'sgdc__penalty': 'l2', 'sgdc__power_t': 0.5, 'sgdc__random_state': 42, 'sgdc__shuffle': True, 'sgdc__tol': None, 'sgdc__verbose': 0, 'sgdc__warm_start': False}

In [84]:
plt.figure(figsize=(13,5))
tests_df['F1'].plot(kind='bar', ylim=(0.6,None))

Audio(url='./Beep 2.wav', autoplay=True)


Out[84]:

In [ ]: