notebook.community

Edit and run

Working with Scikit Learn's "Working with Text Data" tutorial in the "Text Analytics" category. This is found at http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#tutorial-setup.



In [12]:

    
from sklearn.datasets import fetch_20newsgroups
import sklearn.datasets
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn import metrics
import numpy as np

datadir = "data/twenty_newsgroups/"

I already downloaded the data. Simply give scikit learn the path.



In [5]:

    
sklearn.datasets.load_files("%s/20news-bydate-train" % datadir)
print("Loaded data")









    



Loaded data



In [6]:

    
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med', 'sci.crypt']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)



In [7]:

    
print("First 10 of %d documents are of category:" % len(twenty_train.data))
for cat in twenty_train.target[:10]:
    print(twenty_train.target_names[cat])









    



First 10 of 2852 documents are of category:
soc.religion.christian
alt.atheism
sci.crypt
sci.med
comp.graphics
soc.religion.christian
comp.graphics
alt.atheism
alt.atheism
alt.atheism



In [8]:

    
cv = CountVectorizer()
X_train_counts = cv.fit_transform(twenty_train.data)
X_train_counts.shape









    Out[8]:





(2852, 42731)



In [9]:

    
print("There are %d distinct words in all of the documents." % len(cv.vocabulary_))









    



There are 42731 distinct words in all of the documents.

TfidTransformer does two things.

It divides all word counts by the total number of words to give the frequency of the word instead of count.
It scales down the frequncy based on the frequency of the words within all documents.



In [10]:

    
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape









    Out[10]:





(2852, 42731)

Tutorial suggests Naive Bayes classifier



In [13]:

    
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

Test some strings



In [15]:

    
docs_new = ['God is love', 'OpenGL on the GPU is fast', 'AES Encryption is the best. Way better than DES!',
            'Jesus, I love encryption. AES + RSA is the way to go.',
            'There is grandeur in this view of life, with its several powers, having been originally breathed into a few forms or into one; and that, whilst this planet has gone cycling on according to the fixed law of gravity, from so simple a beginning endless forms most beautiful and most wonderful have been, and are being, evolved.'
]
predictions = text_clf.predict(docs_new)

for doc, category in zip(docs_new, predictions):
    print('%r => %s' % (doc, twenty_train.target_names[category]))









    



'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics
'AES Encryption is the best. Way better than DES!' => sci.crypt
'Jesus, I love encryption. AES + RSA is the way to go.' => sci.crypt
'There is grandeur in this view of life, with its several powers, having been originally breathed into a few forms or into one; and that, whilst this planet has gone cycling on according to the fixed law of gravity, from so simple a beginning endless forms most beautiful and most wonderful have been, and are being, evolved.' => soc.religion.christian

Evaluate Predictions



In [16]:

    
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)









    Out[16]:





0.84246575342465757

Now try SVM



In [17]:

    
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                           alpha=1e-3, n_iter=5)),
])
_ = text_clf.fit(twenty_train.data, twenty_train.target)
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)









    Out[17]:





0.90358271865121176



In [18]:

    
print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))
metrics.confusion_matrix(twenty_test.target, predicted)









    



                        precision    recall  f1-score   support

           alt.atheism       0.94      0.79      0.86       319
         comp.graphics       0.82      0.97      0.88       389
             sci.crypt       0.97      0.93      0.95       396
               sci.med       0.95      0.86      0.90       396
soc.religion.christian       0.88      0.95      0.91       398

           avg / total       0.91      0.90      0.90      1898







    Out[18]:





array([[251,  10,   2,  13,  43],
       [  2, 376,   7,   1,   3],
       [  1,  23, 370,   1,   1],
       [  6,  40,   3, 341,   6],
       [  6,  12,   0,   3, 377]])

What about docs_new?



In [19]:

    
predicted = text_clf.predict(docs_new)
for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))









    



'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics
'AES Encryption is the best. Way better than DES!' => sci.crypt
'Jesus, I love encryption. AES + RSA is the way to go.' => sci.crypt
'There is grandeur in this view of life, with its several powers, having been originally breathed into a few forms or into one; and that, whilst this planet has gone cycling on according to the fixed law of gravity, from so simple a beginning endless forms most beautiful and most wonderful have been, and are being, evolved.' => comp.graphics