Sample from http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
The data set is '20 newsgroups dataset' a dataset used for testing machine learning accuracy described at: 20 newsgroups dataset website.
We will be using this data to show scikit learn.
To make the samples run more quickly we will be limiting the example data set to just 4 categories.
In [1]:
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
Load in the training set of data
In [2]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train',categories=categories, shuffle=True, random_state=42)
In [3]:
twenty_train.target_names
Out[3]:
Note target names not in same order as in the categories array
Count of documents
In [4]:
len(twenty_train.data)
Out[4]:
Show the first 8 lines of text from one of the documents formated with line breaks
In [5]:
print("\n".join(twenty_train.data[0].split("\n")[:8]))
Path to file on your machine
In [6]:
twenty_train.filenames[0]
Out[6]:
Show the the targets categories of first 10 documents. As a list and show there names.
In [7]:
print(twenty_train.target[:10])
for t in twenty_train.target[:10]:
print(twenty_train.target_names[t])
Lets look at a document in the training data.
In [8]:
print("\n".join(twenty_train.data[0].split("\n")))
In [9]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape
Out[9]:
In [10]:
X_train_counts.__class__
Out[10]:
Using a CountVectorizer method we can get the integer identifier of a word.
In [11]:
count_vect.vocabulary_.get(u'application')
Out[11]:
With this identifier we can get the count of the word in a given document.
In [12]:
print("Word count for application in first document: {0} and last document: {1} ").format(
X_train_counts[0, 5285], X_train_counts[2256, 5285])
In [13]:
count_vect.vocabulary_.get(u'subject')
Out[13]:
In [14]:
print("Word count for email in first document: {0} and last document: {1} ").format(
X_train_counts[0, 31077], X_train_counts[2256, 31077])
In [15]:
count_vect.vocabulary_.get(u'to')
Out[15]:
In [16]:
print("Word count for email in first document: {0} and last document: {1} ").format(
X_train_counts[0, 32493], X_train_counts[2256, 32493])
What are two problems with using a word count in a document?
In [17]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tfidf_2stage = tf_transformer.transform(X_train_counts)
X_train_tfidf_2stage.shape
Out[17]:
In [18]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape
Out[18]:
In [19]:
print("In first document tf-idf for application: {0} subject: {1} to: {2}").format(
X_train_tfidf[0, 5285], X_train_tfidf[0, 31077], X_train_tfidf[0, 32493])
So we now have features. We can train a classifier to try to predict the category of a post. First we will try the naïve Bayes classifier.
In [20]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
Here tfidf_transformer is used to classify
In [21]:
docs_new = ['God is love', 'Heart attacks are common', 'Disbelief in a proposition', 'Disbelief in a proposition means that one does not believe it to be true', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new, predicted):
print('%r => %s' % (doc, twenty_train.target_names[category]))
In [22]:
from sklearn.pipeline import Pipeline
text_clf_bayes = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
In [23]:
text_clf_bayes_fit = text_clf_bayes.fit(twenty_train.data, twenty_train.target)
In [24]:
import numpy as np
twenty_test = fetch_20newsgroups(subset='test',
categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data
predicted_bayes = text_clf_bayes_fit.predict(docs_test)
np.mean(predicted_bayes == twenty_test.target)
Out[24]:
Try a support vector machine instead
In [25]:
from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier(loss='hinge', penalty='l2',
alpha=1e-3, n_iter=5, random_state=42)),])
text_clf_svm_fit = text_clf_svm.fit(twenty_train.data, twenty_train.target)
predicted_svm = text_clf_svm_fit.predict(docs_test)
np.mean(predicted_svm == twenty_test.target)
Out[25]:
We can see the support vector machine got a higher number than naïve Bayes. What does it mean? We move on to metrics.
In [26]:
from sklearn import metrics
y_true = ["cat", "ant", "cat", "cat", "ant", "bird", "bird"]
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat", "bird"]
print(metrics.classification_report(y_true, y_pred,
target_names=["ant", "bird", "cat"]))
Here we can see that the predictions:
Confusion matix
In [ ]:
metrics.confusion_matrix(y_true, y_pred, labels=["ant", "bird", "cat"])
In the confusion_matrix the labels give the order of the rows.
In [ ]:
metrics.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)
Back to '20 newsgroups dataset'
In [ ]:
print(metrics.classification_report(twenty_test.target, predicted_svm,
target_names=twenty_test.target_names))
We can see where the 91% score came from.
In [ ]:
# We got the evaluation score this way before:
print(np.mean(predicted_svm == twenty_test.target))
# We get the same results using metrics.accuracy_score
print(metrics.accuracy_score(twenty_test.target, predicted_svm, normalize=True, sample_weight=None))
Now lets see the confusion matrix.
In [ ]:
print(twenty_train.target_names)
metrics.confusion_matrix(twenty_test.target, predicted_bayes)
So we can see the naïve Bayes classifier got a lot more correct in some cases but also included a higher proportion in the last category.
In [ ]:
metrics.confusion_matrix(twenty_test.target, predicted_svm)
We can see that atheism is miss categorised as Christian and science and medicine as computer graphics a high proportion of the time using the support vector machine.
Transformation and classifiers can have various parameters. Rather than manually tweaking each parameter in the pipeline it is possible to use grid search instead.
Here we try a couple of options for each stage. The more options the longer the grid search will take.
In [ ]:
from sklearn.grid_search import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
'tfidf__use_idf': (True, False),
'clf__alpha': (1e-3, 1e-4),
}
In [ ]:
gs_clf = GridSearchCV(text_clf_svm_fit, parameters, n_jobs=-1)
Running the search on all the data will take a little while 10-30 seconds on a new ish desktop with 8 cores. If you don't want to wait that long uncomment the line with :400 and comment out the other.
In [ ]:
#gs_clf_fit = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])
gs_clf_fit = gs_clf.fit(twenty_train.data, twenty_train.target)
In [ ]:
best_parameters, score, _ = max(gs_clf_fit.grid_scores_, key=lambda x: x[1])
for param_name in sorted(parameters.keys()):
print("%s: %r" % (param_name, best_parameters[param_name]))
score
Well that is a significant improvement. Lets use these new parameters.
In [ ]:
text_clf_svm_tuned = Pipeline([('vect', CountVectorizer(ngram_range=(1, 2))),
('tfidf', TfidfTransformer(use_idf=True)),
('clf', SGDClassifier(loss='hinge', penalty='l2',
alpha=0.0001, n_iter=5, random_state=42)),
])
text_clf_svm_tuned_fit = text_clf_svm_tuned.fit(twenty_train.data, twenty_train.target)
predicted_tuned = text_clf_svm_tuned_fit.predict(docs_test)
metrics.accuracy_score(twenty_test.target, predicted_tuned, normalize=True, sample_weight=None)
Why has this only give a .93 instead of .97?
http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html
In [ ]:
for x in gs_clf_fit.grid_scores_:
print x[0], x[1], x[2]
Moving on from that lets see where the improvements where made.
In [ ]:
print(metrics.classification_report(twenty_test.target, predicted_svm,
target_names=twenty_test.target_names))
metrics.confusion_matrix(twenty_test.target, predicted_svm)
In [ ]:
print(metrics.classification_report(twenty_test.target, predicted_tuned,
target_names=twenty_test.target_names))
metrics.confusion_matrix(twenty_test.target, predicted_tuned)
We see comp.graphics is the only category to see a drop in prediction the other have improved.
We can see that scikit learn can do a good job in classification with the amount of training and test data in this simple example.
In [ ]: