notebook.community

Edit and run



In [16]:

    
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np



In [17]:

    
from sklearn.datasets import fetch_20newsgroups



In [18]:

    
news = fetch_20newsgroups(subset='all')



In [19]:

    
news.description









    Out[19]:





'the 20 newsgroups by date dataset'



In [20]:

    
print len(news.data)



In [21]:

    
print news.data[0]









    



From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!



In [22]:

    
print news.target









    



[10  3 17 ...,  3  1  7]



In [23]:

    
from sklearn.cross_validation import train_test_split



In [24]:

    
X_train,X_test,Y_train,Y_test = train_test_split(news.data, news.target, test_size=0.25, random_state=33)



In [26]:

    
#将文本转换为特征向量，再使用贝叶斯模型从训练数据中估计参数
from sklearn.feature_extraction.text import CountVectorizer



In [27]:

    
vec = CountVectorizer()



In [28]:

    
X_train = vec.fit_transform(X_train)



In [29]:

    
X_test = vec.transform(X_test)



In [31]:

    
#导入朴素贝叶斯模型
from sklearn.naive_bayes import MultinomialNB



In [32]:

    
mnb = MultinomialNB()



In [33]:

    
mnb









    Out[33]:





MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)



In [34]:

    
mnb.fit(X_train, Y_train)









    Out[34]:





MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)



In [35]:

    
y_predict = mnb.predict(X_test)



In [36]:

    
from sklearn.metrics import classification_report



In [37]:

    
print 'The Accuracy of NavieBayesClassifier is:',mnb.score(X_test, Y_test)









    



The Accuracy of NavieBayesClassifier is: 0.839770797963



In [38]:

    
print classification_report(Y_test, y_predict,target_names=news.target_names)









    



                          precision    recall  f1-score   support

             alt.atheism       0.86      0.86      0.86       201
           comp.graphics       0.59      0.86      0.70       250
 comp.os.ms-windows.misc       0.89      0.10      0.17       248
comp.sys.ibm.pc.hardware       0.60      0.88      0.72       240
   comp.sys.mac.hardware       0.93      0.78      0.85       242
          comp.windows.x       0.82      0.84      0.83       263
            misc.forsale       0.91      0.70      0.79       257
               rec.autos       0.89      0.89      0.89       238
         rec.motorcycles       0.98      0.92      0.95       276
      rec.sport.baseball       0.98      0.91      0.95       251
        rec.sport.hockey       0.93      0.99      0.96       233
               sci.crypt       0.86      0.98      0.91       238
         sci.electronics       0.85      0.88      0.86       249
                 sci.med       0.92      0.94      0.93       245
               sci.space       0.89      0.96      0.92       221
  soc.religion.christian       0.78      0.96      0.86       232
      talk.politics.guns       0.88      0.96      0.92       251
   talk.politics.mideast       0.90      0.98      0.94       231
      talk.politics.misc       0.79      0.89      0.84       188
      talk.religion.misc       0.93      0.44      0.60       158

             avg / total       0.86      0.84      0.82      4712



In [ ]: