Working with Scikit Learn's "Working with Text Data" tutorial in the "Text Analytics" category. This is found at http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#tutorial-setup.
In [12]:
from sklearn.datasets import fetch_20newsgroups
import sklearn.datasets
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn import metrics
import numpy as np
datadir = "data/twenty_newsgroups/"
I already downloaded the data. Simply give scikit learn the path.
In [5]:
sklearn.datasets.load_files("%s/20news-bydate-train" % datadir)
print("Loaded data")
In [6]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med', 'sci.crypt']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
In [7]:
print("First 10 of %d documents are of category:" % len(twenty_train.data))
for cat in twenty_train.target[:10]:
print(twenty_train.target_names[cat])
In [8]:
cv = CountVectorizer()
X_train_counts = cv.fit_transform(twenty_train.data)
X_train_counts.shape
Out[8]:
In [9]:
print("There are %d distinct words in all of the documents." % len(cv.vocabulary_))
TfidTransformer does two things.
In [10]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape
Out[10]:
Tutorial suggests Naive Bayes classifier
In [13]:
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)
Test some strings
In [15]:
docs_new = ['God is love', 'OpenGL on the GPU is fast', 'AES Encryption is the best. Way better than DES!',
'Jesus, I love encryption. AES + RSA is the way to go.',
'There is grandeur in this view of life, with its several powers, having been originally breathed into a few forms or into one; and that, whilst this planet has gone cycling on according to the fixed law of gravity, from so simple a beginning endless forms most beautiful and most wonderful have been, and are being, evolved.'
]
predictions = text_clf.predict(docs_new)
for doc, category in zip(docs_new, predictions):
print('%r => %s' % (doc, twenty_train.target_names[category]))
Evaluate Predictions
In [16]:
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)
Out[16]:
Now try SVM
In [17]:
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier(loss='hinge', penalty='l2',
alpha=1e-3, n_iter=5)),
])
_ = text_clf.fit(twenty_train.data, twenty_train.target)
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)
Out[17]:
In [18]:
print(metrics.classification_report(twenty_test.target, predicted,
target_names=twenty_test.target_names))
metrics.confusion_matrix(twenty_test.target, predicted)
Out[18]:
What about docs_new?
In [19]:
predicted = text_clf.predict(docs_new)
for doc, category in zip(docs_new, predicted):
print('%r => %s' % (doc, twenty_train.target_names[category]))