In [16]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [17]:
from sklearn.datasets import fetch_20newsgroups

In [18]:
news = fetch_20newsgroups(subset='all')

In [19]:
news.description


Out[19]:
'the 20 newsgroups by date dataset'

In [20]:
print len(news.data)


18846

In [21]:
print news.data[0]


From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!



In [22]:
print news.target


[10  3 17 ...,  3  1  7]

In [23]:
from sklearn.cross_validation import train_test_split

In [24]:
X_train,X_test,Y_train,Y_test = train_test_split(news.data, news.target, test_size=0.25, random_state=33)

In [26]:
#将文本转换为特征向量,再使用贝叶斯模型从训练数据中估计参数
from sklearn.feature_extraction.text import CountVectorizer

In [27]:
vec = CountVectorizer()

In [28]:
X_train = vec.fit_transform(X_train)

In [29]:
X_test = vec.transform(X_test)

In [31]:
#导入朴素贝叶斯模型
from sklearn.naive_bayes import MultinomialNB

In [32]:
mnb = MultinomialNB()

In [33]:
mnb


Out[33]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [34]:
mnb.fit(X_train, Y_train)


Out[34]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [35]:
y_predict = mnb.predict(X_test)

In [36]:
from sklearn.metrics import classification_report

In [37]:
print 'The Accuracy of NavieBayesClassifier is:',mnb.score(X_test, Y_test)


The Accuracy of NavieBayesClassifier is: 0.839770797963

In [38]:
print classification_report(Y_test, y_predict,target_names=news.target_names)


                          precision    recall  f1-score   support

             alt.atheism       0.86      0.86      0.86       201
           comp.graphics       0.59      0.86      0.70       250
 comp.os.ms-windows.misc       0.89      0.10      0.17       248
comp.sys.ibm.pc.hardware       0.60      0.88      0.72       240
   comp.sys.mac.hardware       0.93      0.78      0.85       242
          comp.windows.x       0.82      0.84      0.83       263
            misc.forsale       0.91      0.70      0.79       257
               rec.autos       0.89      0.89      0.89       238
         rec.motorcycles       0.98      0.92      0.95       276
      rec.sport.baseball       0.98      0.91      0.95       251
        rec.sport.hockey       0.93      0.99      0.96       233
               sci.crypt       0.86      0.98      0.91       238
         sci.electronics       0.85      0.88      0.86       249
                 sci.med       0.92      0.94      0.93       245
               sci.space       0.89      0.96      0.92       221
  soc.religion.christian       0.78      0.96      0.86       232
      talk.politics.guns       0.88      0.96      0.92       251
   talk.politics.mideast       0.90      0.98      0.94       231
      talk.politics.misc       0.79      0.89      0.84       188
      talk.religion.misc       0.93      0.44      0.60       158

             avg / total       0.86      0.84      0.82      4712


In [ ]: