Naive Bayesan classifier

Bayesan theorem

We will try to compute the probability of having a type of Strain $P(Y=y)$ given a feature vector X (i.e. a vector containing the Input resistance, sag ratio, etc...). We will use the “naive” assumption of independence between every pair of features.

Given a class variable $Y$ and a dependent feature vector $X_1$ through $X_n$, Bayes’ theorem states the following relationship:

$$P(Y | X_i) = \frac{P(Y) P(X_i|Y)}{P(X_i)}$$



In [1]:

    
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib



In [2]:

    
import pandas as pd



In [3]:

    
# first row contains units
df = pd.read_excel(io='../data/Cell_types.xlsx', sheetname='PFC',  skiprows=1)
del df['CellID'] # remove column with cell IDs
df.head() # show first elements

We use pandas to split up the matrix into the feature vectors we're interested in. We will also to convert textual category data (Strain, Gender) into an ordinal number that we can work with.



In [4]:

    
pd.Categorical(df.Strain).codes # CB57BL is zero, GAD67 is one









    Out[4]:





array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1], dtype=int8)



In [5]:

    
df['Gender'] = pd.Categorical(df.Gender).codes
df['Strain'] = pd.Categorical(df.Strain).codes
df.head()



In [6]:

    
df.shape # as with NumPy the number of rows first









    Out[6]:





(11, 15)



In [7]:

    
df.iloc[[0]].values[0] # get a row as NumPy array









    Out[7]:





array([ -64.8039 ,  123.611  ,    1.31594,   21.7662 ,    5.     ,
         24.6    ,    0.     ,   23.1    ,   56.     ,    1.     ,
         69.1071 ,  -42.8009 ,  170.288  ,    1.53888,  150.     ])



In [8]:

    
# create X and Y
Y = df['Strain'].values

del df['Strain'] # remove Strain
X = [ df.iloc[[i]].values[0] for i in range(df.shape[0]) ]
len(X)==len(Y)









    Out[8]:





True



In [9]:

    
X[0] # data from CB57BL









    Out[9]:





array([ -64.8039 ,  123.611  ,    1.31594,   21.7662 ,    5.     ,
         24.6    ,   23.1    ,   56.     ,    1.     ,   69.1071 ,
        -42.8009 ,  170.288  ,    1.53888,  150.     ])

Gaussian naive Bayesan classifier



In [10]:

    
from sklearn.naive_bayes import GaussianNB



In [11]:

    
myclassifier = GaussianNB()
myclassifier.fit(X,Y)









    Out[11]:





GaussianNB()



In [12]:

    
df.iloc[[-2]].values # this is a GAD67 mice









    Out[12]:





array([[ -78.7819 ,   97.8774 ,    1.1365 ,   28.8005 ,   10.     ,
          22.3    ,   19.1    ,   58.     ,    1.     ,   86.9954 ,
         -48.2144 ,  233.765  ,    1.87993,  334.086  ]])



In [13]:

    
df.iloc[[-1]].values # this is a GAD67 mice









    Out[13]:





array([[ -67.8163  ,   98.3324  ,    1.04459 ,   19.2278  ,   63.      ,
          22.2     ,   22.3     ,   60.      ,    0.      ,   84.692   ,
         -44.611   ,  429.993   ,    0.537511,  344.928   ]])

We now test with the classifier with the training data



In [14]:

    
def predict(idx):
    if myclassifier.predict( X[idx]):
        print('Cell %2d is GAD67  mice'%idx)
    else:
        print('Cell %2d is CB57BL mice'%idx)

# test with training data (similar to myclassifier.score(X,Y) )
for i in range(df.shape[0]):
    predict(i)









    



Cell  0 is CB57BL mice
Cell  1 is CB57BL mice
Cell  2 is CB57BL mice
Cell  3 is CB57BL mice
Cell  4 is CB57BL mice
Cell  5 is CB57BL mice
Cell  6 is CB57BL mice
Cell  7 is CB57BL mice
Cell  8 is CB57BL mice
Cell  9 is GAD67  mice
Cell 10 is GAD67  mice

We test with some fictitious data



In [15]:

    
d = np.array([[ -75.50, 98.25, 1.49, 24.75, 90. ,21.5, 24.5 ,60, 1, 85.95, -48.6,
              430.95, 0.5, 385.55]])

test_df = pd.DataFrame(d, columns=df.columns)
test_df









    Out[15]:






  
    
      
      Vrest
      InputR
      Sag
      Tau_mb
      MaxAPfreq
      Temp
      Weight
      Age
      Gender
      AP_peak
      AP_thr
      AP_maxrise
      AP_t50
      rheobase
    
  
  
    
      0
      -75.5
      98.25
      1.49
      24.75
      90.0
      21.5
      24.5
      60.0
      1.0
      85.95
      -48.6
      430.95
      0.5
      385.55



In [16]:

    
if myclassifier.predict( test_df.iloc[[0]].values[0]):
    print('Test is GAD67  mice')
else:
    print('Test is CB57BL mice')









    



Test is CB57BL mice

	Vrest	InputR	Sag	Tau_mb	MaxAPfreq	Temp	Strain	Weight	Age	Gender	AP_peak	AP_thr	AP_maxrise	AP_t50	rheobase
0	-64.8039	123.611	1.31594	21.766200	5	24.6	CB57BL	23.1	56	male	69.1071	-42.8009	170.2880	1.53888	150.0
1	-74.9081	152.837	1.17876	23.242900	17	22.4	CB57BL	24.6	57	male	79.8796	-46.7987	180.0540	2.03427	150.0
2	-74.4443	221.895	1.17550	27.983400	7	22.4	CB57BL	24.6	58	male	89.9048	-42.7856	292.9690	1.37136	100.0
3	-75.8549	294.484	1.10705	31.245583	29	22.5	CB57BL	26.0	58	female	74.4971	-48.0499	116.5570	1.64910	50.0
4	-66.3612	271.674	1.06450	34.679290	25	22.7	CB57BL	26.0	58	female	51.8799	-44.3420	79.9561	3.24711	50.0

	Vrest	InputR	Sag	Tau_mb	MaxAPfreq	Temp	Weight	Age	Gender	AP_peak	AP_thr	AP_maxrise	AP_t50	rheobase
0	-64.8039	123.611	1.31594	21.766200	5	24.6	23.1	56	1	69.1071	-42.8009	170.2880	1.53888	150.0
1	-74.9081	152.837	1.17876	23.242900	17	22.4	24.6	57	1	79.8796	-46.7987	180.0540	2.03427	150.0
2	-74.4443	221.895	1.17550	27.983400	7	22.4	24.6	58	1	89.9048	-42.7856	292.9690	1.37136	100.0
3	-75.8549	294.484	1.10705	31.245583	29	22.5	26.0	58	0	74.4971	-48.0499	116.5570	1.64910	50.0
4	-66.3612	271.674	1.06450	34.679290	25	22.7	26.0	58	0	51.8799	-44.3420	79.9561	3.24711	50.0