In [1]:
# Naive Bayes is similar to k-NN in the sense that it makes
# some assumptions that might oversimplify reality but still
# perform well in many cases.

In [11]:
from sklearn.datasets import fetch_20newsgroups
import numpy as np

In [3]:
categories = ['rec.autos', 'rec.motorcycles']

In [4]:
newsgroups = fetch_20newsgroups(categories=categories)
print '\n'.join(newsgroups.data[:1])


WARNING:sklearn.datasets.twenty_newsgroups:Downloading dataset from http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz (14 MB)
From: gregl@zimmer.CSUFresno.EDU (Greg Lewis)
Subject: Re: WARNING.....(please read)...
Keywords: BRICK, TRUCK, DANGER
Nntp-Posting-Host: zimmer.csufresno.edu
Organization: CSU Fresno
Lines: 33

In article <1qh336INNfl5@CS.UTK.EDU> larose@austin.cs.utk.edu (Brian LaRose) writes:
>This just a warning to EVERYBODY on the net.  Watch out for
>folks standing NEXT to the road or on overpasses.   They can
>cause SERIOUS HARM to you and your car.  
>
>(just a cliff-notes version of my story follows)
>
>10pm last night, I was travelling on the interstate here in
>knoxville,  I was taking an offramp exit to another interstate
>and my wife suddenly screamed and something LARGE hit the side
>of my truck.  We slowed down, but after looking back to see the
>vandals standing there, we drove on to the police station.
>
>She did get a good look at the guy and saw him "cock his arm" with
>something the size of a cinderblock, BUT I never saw him. We are 
>VERY lucky the truck sits up high on the road; if it would have hit
>her window, it would have killed her. 
>
>The police are looking for the guy, but in all likelyhood he is gone. 
Stuff deleted...

I am sorry to report that in Southern California it was a sick sport
for a while to drop concrete blocks from the overpasses onto the
freeway.  Several persons were killed when said blocks came through
their windshields.  Many overpass bridges are now fenced, and they
have made it illegal to loiter on such bridges (as if that would stop
such people).  Yet many bridges are NOT fenced.  I always look up at a
bridge while I still have time to take evasive action even though this
*sport* has not reached us here in Fresno.
___________________________________________________________________
Greg_Lewis@csufresno.edu
Photojournalism sequence, Department of Journalism
CSU Fresno, Fresno, CA 93740


In [5]:
newsgroups.target_names[newsgroups.target[:1]]


Out[5]:
'rec.autos'

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

In [8]:
count_vec = CountVectorizer()
bow = count_vec.fit_transform(newsgroups.data)

In [9]:
bow


Out[9]:
<1192x19177 sparse matrix of type '<type 'numpy.int64'>'
	with 164296 stored elements in Compressed Sparse Row format>

In [12]:
bow = np.array(bow.todense())

In [13]:
words = np.array(count_vec.get_feature_names())
words[bow[0] > 0][:5]


Out[13]:
array([u'10pm', u'1qh336innfl5', u'33', u'93740',
       u'___________________________________________________________________'], 
      dtype='<U79')

In [14]:
# verify that these words are in the first document.
'10pm' in newsgroups.data[0].lower()


Out[14]:
True

In [15]:
from sklearn import naive_bayes
clf = naive_bayes.GaussianNB()

In [16]:
mask = np.random.choice([True, False], len(bow))
clf.fit(bow[mask], newsgroups.target[mask])
predictions = clf.predict(bow[~mask])

In [17]:
np.mean(predictions == newsgroups.target[~mask])


Out[17]:
0.92465753424657537

In [18]:
# How it works:
# The fundamental idea of how Naive Bayes works is that we can
# estimate the probability of some data point being a class,
# given the feature vector.
# This can be rearranged via the Bayes formula to give the MAP
# estimate for the feature vector. This MAP estimate chooses the
# class for which the feature vector's probability is maximized.

In [19]:
# Naive Bayes can also do multiclass work. Instead of assuming
# a Gaussian likelihood, we'll use a multinomal likelihood.

In [20]:
mn_categories = ['rec.autos', 'rec.motorcycles',
                 'talk.politics.guns']

In [21]:
mn_newsgroups = fetch_20newsgroups(categories=mn_categories)

In [22]:
mn_bow = count_vec.fit_transform(mn_newsgroups.data)

In [23]:
mn_bow = np.array(mn_bow.todense())

In [24]:
mn_mask = np.random.choice([True, False], len(mn_newsgroups.data))

In [26]:
multinom = naive_bayes.MultinomialNB()
multinom.fit(mn_bow[mn_mask], mn_newsgroups.target[mn_mask])


Out[26]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [27]:
mn_predict = multinom.predict(mn_bow[~mask])

In [28]:
np.mean(mn_predict == mn_newsgroups.target[~mask])


Out[28]:
0.98116438356164382

In [ ]: