Naive Bayes classifier for multinomial models

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html


In [15]:


In [2]:
import logging
import numpy as np
from optparse import OptionParser
import sys
from time import time
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.utils.extmath import density
from pprint import pprint
def classifier(clf,X_train,y_train,X_test,feature_names,categories):
    # Adapted from below
    # Author: Peter Prettenhofer <peter.prettenhofer@gmail.com>
    #         Olivier Grisel <olivier.grisel@ensta.org>
    #         Mathieu Blondel <mathieu@mblondel.org>
    #         Lars Buitinck <L.J.Buitinck@uva.nl>
    # License: BSD 3 clause

    clf.fit(X_train, y_train)
    pred = clf.predict(X_test)
    pred_prob = clf.predict_proba(X_test)
    

    print("top 10 keywords per class:")
    for i, category in enumerate(categories):
        top10 = np.argsort(clf.coef_[i])[-10:]
        print("{0}: {1}".format(category, " ".join(feature_names[top10])))
    print()


    return pred_prob

from sklearn.datasets import fetch_20newsgroups
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]
remove = ('headers', 'footers', 'quotes')
data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=remove)

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=remove)


print('data loaded')
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
categories = data_train.target_names    
print(categories)
print()

pprint(data_train.data[:2])
pprint(data_train.target[:2])
print()
y_train = data_train.target

X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)
feature_names = np.array(vectorizer.get_feature_names())

pred_prob = classifier(MultinomialNB(alpha=.01),X_train,y_train,X_test,feature_names,categories)


data loaded
['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']

['Hi,\n'
 '\n'
 "I've noticed that if you only save a model (with all your mapping planes\n"
 'positioned carefully) to a .3DS file that when you reload it after '
 'restarting\n'
 '3DS, they are given a default position and orientation.  But if you save\n'
 'to a .PRJ file their positions/orientation are preserved.  Does anyone\n'
 'know why this information is not stored in the .3DS file?  Nothing is\n'
 'explicitly said in the manual about saving texture rules in the .PRJ file. \n'
 "I'd like to be able to read the texture rule information, does anyone have \n"
 'the format for the .PRJ file?\n'
 '\n'
 'Is the .CEL file format available from somewhere?\n'
 '\n'
 'Rych',
 '\n'
 '\n'
 'Seems to be, barring evidence to the contrary, that Koresh was simply\n'
 'another deranged fanatic who thought it neccessary to take a whole bunch of\n'
 'folks with him, children and all, to satisfy his delusional mania. Jim\n'
 'Jones, circa 1993.\n'
 '\n'
 '\n'
 'Nope - fruitcakes like Koresh have been demonstrating such evil corruption\n'
 'for centuries.']
array([1, 3])

top 10 keywords per class:
alt.atheism: know atheism religion does say just think people don god
comp.graphics: hi does looking program image know file files thanks graphics
sci.space: don earth think moon launch orbit just like nasa space
talk.religion.misc: bible know just think christians don christian people jesus god


In [3]:
#Some options for nice printing
np.set_printoptions(precision=3)
np.set_printoptions(suppress=True)
print(['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc'])
pred_prob


['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']
Out[3]:
array([[ 0.016,  0.177,  0.805,  0.002],
       [ 0.058,  0.704,  0.232,  0.006],
       [ 0.001,  0.875,  0.124,  0.   ],
       ..., 
       [ 0.09 ,  0.003,  0.336,  0.571],
       [ 0.   ,  0.972,  0.028,  0.   ],
       [ 0.   ,  1.   ,  0.   ,  0.   ]])