Working with Scikit Learn's "Working with Text Data" tutorial in the "Text Analytics" category. This is found at http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#tutorial-setup.


In [12]:
from sklearn.datasets import fetch_20newsgroups
import sklearn.datasets
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn import metrics
import numpy as np

datadir = "data/twenty_newsgroups/"

I already downloaded the data. Simply give scikit learn the path.


In [5]:
sklearn.datasets.load_files("%s/20news-bydate-train" % datadir)
print("Loaded data")


Loaded data

In [6]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med', 'sci.crypt']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

In [7]:
print("First 10 of %d documents are of category:" % len(twenty_train.data))
for cat in twenty_train.target[:10]:
    print(twenty_train.target_names[cat])


First 10 of 2852 documents are of category:
soc.religion.christian
alt.atheism
sci.crypt
sci.med
comp.graphics
soc.religion.christian
comp.graphics
alt.atheism
alt.atheism
alt.atheism

In [8]:
cv = CountVectorizer()
X_train_counts = cv.fit_transform(twenty_train.data)
X_train_counts.shape


Out[8]:
(2852, 42731)

In [9]:
print("There are %d distinct words in all of the documents." % len(cv.vocabulary_))


There are 42731 distinct words in all of the documents.

TfidTransformer does two things.

  1. It divides all word counts by the total number of words to give the frequency of the word instead of count.
  2. It scales down the frequncy based on the frequency of the words within all documents.

In [10]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape


Out[10]:
(2852, 42731)

Tutorial suggests Naive Bayes classifier


In [13]:
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

Test some strings


In [15]:
docs_new = ['God is love', 'OpenGL on the GPU is fast', 'AES Encryption is the best. Way better than DES!',
            'Jesus, I love encryption. AES + RSA is the way to go.',
            'There is grandeur in this view of life, with its several powers, having been originally breathed into a few forms or into one; and that, whilst this planet has gone cycling on according to the fixed law of gravity, from so simple a beginning endless forms most beautiful and most wonderful have been, and are being, evolved.'
]
predictions = text_clf.predict(docs_new)

for doc, category in zip(docs_new, predictions):
    print('%r => %s' % (doc, twenty_train.target_names[category]))


'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics
'AES Encryption is the best. Way better than DES!' => sci.crypt
'Jesus, I love encryption. AES + RSA is the way to go.' => sci.crypt
'There is grandeur in this view of life, with its several powers, having been originally breathed into a few forms or into one; and that, whilst this planet has gone cycling on according to the fixed law of gravity, from so simple a beginning endless forms most beautiful and most wonderful have been, and are being, evolved.' => soc.religion.christian

Evaluate Predictions


In [16]:
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)


Out[16]:
0.84246575342465757

Now try SVM


In [17]:
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                           alpha=1e-3, n_iter=5)),
])
_ = text_clf.fit(twenty_train.data, twenty_train.target)
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)


Out[17]:
0.90358271865121176

In [18]:
print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))
metrics.confusion_matrix(twenty_test.target, predicted)


                        precision    recall  f1-score   support

           alt.atheism       0.94      0.79      0.86       319
         comp.graphics       0.82      0.97      0.88       389
             sci.crypt       0.97      0.93      0.95       396
               sci.med       0.95      0.86      0.90       396
soc.religion.christian       0.88      0.95      0.91       398

           avg / total       0.91      0.90      0.90      1898

Out[18]:
array([[251,  10,   2,  13,  43],
       [  2, 376,   7,   1,   3],
       [  1,  23, 370,   1,   1],
       [  6,  40,   3, 341,   6],
       [  6,  12,   0,   3, 377]])

What about docs_new?


In [19]:
predicted = text_clf.predict(docs_new)
for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))


'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics
'AES Encryption is the best. Way better than DES!' => sci.crypt
'Jesus, I love encryption. AES + RSA is the way to go.' => sci.crypt
'There is grandeur in this view of life, with its several powers, having been originally breathed into a few forms or into one; and that, whilst this planet has gone cycling on according to the fixed law of gravity, from so simple a beginning endless forms most beautiful and most wonderful have been, and are being, evolved.' => comp.graphics