Example for Naive Bayes using the Iris Data

First we need the standard import



In [1]:

    
%pylab inline
from classy import *









    



Populating the interactive namespace from numpy and matplotlib
Version:  0.0.15

Load the Data



In [2]:

    
data=load_excel('data/iris.xls',verbose=True)









    



iris.data 151 5
150 vectors of length 4
Feature names: 'petal length in cm', 'petal width in cm', 'sepal length in cm', 'sepal width in cm'
Target values given.
Target names: 'Iris-setosa', 'Iris-versicolor', 'Iris-virginica'
Mean:  [ 3.75866667  1.19866667  5.84333333  3.054     ]
Median:  [ 4.35  1.3   5.8   3.  ]
Stddev:  [ 1.75852918  0.76061262  0.82530129  0.43214658]

Look at the data

it's a good idea to look at the data a little bit, know the shapes, etc...



In [3]:

    
print((data.vectors.shape))
print((data.targets))
print((data.target_names))
print((data.feature_names))









    



(150, 4)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
['petal length in cm', 'petal width in cm', 'sepal length in cm', 'sepal width in cm']

since you can't plot 4 dimensions, try plotting some 2D subsets

I don't like the automatic placement of the legend, so lets set it manually



In [4]:

    
subset=extract_features(data,[0,2])
plot2D(subset,legend_location='upper left')

I don't want to do the classification on this subset, so make sure to use the entire data set.

Classification

First, we choose a classifier



In [5]:

    
C=NaiveBayes()

Split the data into test and train subsets...



In [6]:

    
data_train,data_test=split(data,test_size=0.2)









    



Original vector shape:  (150, 4)
Train vector shape:  (120, 4)
Test vector shape:  (30, 4)

...and then train...



In [7]:

    
timeit(reset=True)
C.fit(data_train.vectors,data_train.targets)
print(("Training time: ",timeit()))









    



Time Reset
Training time:  0.002332925796508789 seconds



In [8]:

    
print(("On Training Set:",C.percent_correct(data_train.vectors,data_train.targets)))
print(("On Test Set:",C.percent_correct(data_test.vectors,data_test.targets)))









    



On Training Set: 95.0
On Test Set: 93.3333333333

some classifiers have properties that are useful to look at. Naive Bayes has means and stddevs...



In [9]:

    
C.means









    Out[9]:





array([[ 1.4575    ,  0.245     ,  5.0125    ,  3.4175    ],
       [ 4.18717949,  1.28974359,  5.87435897,  2.75384615],
       [ 5.55121951,  2.02926829,  6.57317073,  2.93414634]])



In [10]:

    
C.stddevs









    Out[10]:





array([[ 0.02944375,  0.011475  ,  0.12309375,  0.09394375],
       [ 0.21496384,  0.03733071,  0.2757528 ,  0.09069034],
       [ 0.32152291,  0.07865557,  0.40830458,  0.10029745]])



In [ ]: