20Newsgroup Text Data


In [1]:
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

print(type(twenty_train), "\n")
print(twenty_train.target[:5], twenty_train.target_names, "\n")
print(twenty_train.data[:2])


<class 'sklearn.datasets.base.Bunch'> 

[1 1 3 3 3] ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian'] 

[u'From: sd345@city.ac.uk (Michael Collier)\nSubject: Converting images to HP LaserJet III?\nNntp-Posting-Host: hampton\nOrganization: The City University\nLines: 14\n\nDoes anyone know of a good way (standard PC application/PD utility) to\nconvert tif/img/tga files into LaserJet III format.  We would also like to\ndo the same, converting to HPGL (HP plotter) files.\n\nPlease email any response.\n\nIs this the correct group?\n\nThanks in advance.  Michael.\n-- \nMichael Collier (Programmer)                 The Computer Unit,\nEmail: M.P.Collier@uk.ac.city                The City University,\nTel: 071 477-8000 x3769                      London,\nFax: 071 477-8565                            EC1V 0HB.\n', u"From: ani@ms.uky.edu (Aniruddha B. Deglurkar)\nSubject: help: Splitting a trimming region along a mesh \nOrganization: University Of Kentucky, Dept. of Math Sciences\nLines: 28\n\n\n\n\tHi,\n\n\tI have a problem, I hope some of the 'gurus' can help me solve.\n\n\tBackground of the problem:\n\tI have a rectangular mesh in the uv domain, i.e  the mesh is a \n\tmapping of a 3d Bezier patch into 2d. The area in this domain\n\twhich is inside a trimming loop had to be rendered. The trimming\n\tloop is a set of 2d Bezier curve segments.\n\tFor the sake of notation: the mesh is made up of cells.\n\n\tMy problem is this :\n\tThe trimming area has to be split up into individual smaller\n\tcells bounded by the trimming curve segments. If a cell\n\tis wholly inside the area...then it is output as a whole ,\n\telse it is trivially rejected. \n\n\tDoes any body know how thiss can be done, or is there any algo. \n\tsomewhere for doing this.\n\n\tAny help would be appreciated.\n\n\tThanks, \n\tAni.\n-- \nTo get irritated is human, to stay cool, divine.\n"]

Extracting features from text files

Tokenizing text with scikit-learn


In [2]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape


Out[2]:
(2257, 35788)

In [3]:
X_train_counts


Out[3]:
<2257x35788 sparse matrix of type '<type 'numpy.int64'>'
	with 365886 stored elements in Compressed Sparse Row format>

In [4]:
count_vect.vocabulary_.get(u'algorithmic')


Out[4]:
4691

From occurrences to frequencies

  • tf
  • tfidf

In [5]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape


Out[5]:
(2257, 35788)

In [6]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape


Out[6]:
(2257, 35788)

Training a classifier


In [7]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('{doc} => {category}'.format(
            doc=doc, category=twenty_train.target_names[category]))


God is love => soc.religion.christian
OpenGL on the GPU is fast => comp.graphics

Building a pipeline for Grid Search & Evaluation


In [8]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', MultinomialNB()),        
    ])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

Evaluation of the performance on the test set

clf = MultinomialNB


In [9]:
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)


Out[9]:
0.83488681757656458

In [10]:
from sklearn import metrics
print(metrics.classification_report(
        twenty_test.target,
        predicted,
        target_names=twenty_test.target_names))


                        precision    recall  f1-score   support

           alt.atheism       0.97      0.60      0.74       319
         comp.graphics       0.96      0.89      0.92       389
               sci.med       0.97      0.81      0.88       396
soc.religion.christian       0.65      0.99      0.78       398

           avg / total       0.88      0.83      0.84      1502


In [11]:
metrics.confusion_matrix(twenty_test.target, predicted)


Out[11]:
array([[192,   2,   6, 119],
       [  2, 347,   4,  36],
       [  2,  11, 322,  61],
       [  2,   2,   1, 393]])

clf = SGD


In [12]:
from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([
        ('vect', CountVectorizer()), 
        ('tfidf', TfidfTransformer()),
        ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5, random_state=42)),
    ])
_ = text_clf.fit(twenty_train.data, twenty_train.target)
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)


Out[12]:
0.9127829560585885

clf = SVM(kernel = 'linear')


In [13]:
from sklearn.svm import SVC
text_clf1 = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', SVC(kernel='linear', random_state=42)),
    ])
_ = text_clf1.fit(twenty_train.data, twenty_train.target)
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)


Out[13]:
0.9127829560585885

clf = SVM(kernel = 'rbf')


In [14]:
from sklearn.svm import SVC
text_clf2 = Pipeline([
        ('vect', CountVectorizer()), 
        ('tfidf', TfidfTransformer()),
        ('clf', SVC(kernel='rbf', random_state=42, gamma=0.10, C=10.0)),
    ])
_ = text_clf2.fit(twenty_train.data, twenty_train.target)
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)


Out[14]:
0.9127829560585885

Parameter tuning using GridSearch


In [21]:
from sklearn.grid_search import GridSearchCV
parameters = {
    'vect__ngram_range': [(1, 1), (1, 2)],
    'tfidf__use_idf': (True, False),
    'clf__alpha': (1e-2, 1e-3),
}
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
# 데이터량이 너무많아 소량으로만 해봄
gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])

In [22]:
twenty_train.target_names[gs_clf.predict(['God is love'])]


C:\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: DeprecationWarning: converting an array with ndim > 0 to an index will result in an error in the future
  if __name__ == '__main__':
Out[22]:
'soc.religion.christian'

In [23]:
best_parameters, score, _ = max(gs_clf.grid_scores_, key=lambda x: x[1])
for param_name in sorted(parameters.keys()):
    print("{name}: {best}".format
         (name=param_name, best=best_parameters[param_name]))

score


clf__alpha: 0.001
tfidf__use_idf: True
vect__ngram_range: (1, 1)
Out[23]:
0.90000000000000002