Naive Bayesan classifier

Bayesan theorem

We will try to compute the probability of having a type of Strain $P(Y=y)$ given a feature vector X (i.e. a vector containing the Input resistance, sag ratio, etc...). We will use the “naive” assumption of independence between every pair of features.

Given a class variable $Y$ and a dependent feature vector $X_1$ through $X_n$, Bayes’ theorem states the following relationship:

$$P(Y | X_i) = \frac{P(Y) P(X_i|Y)}{P(X_i)}$$

In [1]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [2]:
import pandas as pd

In [3]:
# first row contains units
df = pd.read_excel(io='../data/Cell_types.xlsx', sheetname='PFC',  skiprows=1)
del df['CellID'] # remove column with cell IDs
df.head() # show first elements


Out[3]:
Vrest InputR Sag Tau_mb MaxAPfreq Temp Strain Weight Age Gender AP_peak AP_thr AP_maxrise AP_t50 rheobase
0 -64.8039 123.611 1.31594 21.766200 5 24.6 CB57BL 23.1 56 male 69.1071 -42.8009 170.2880 1.53888 150.0
1 -74.9081 152.837 1.17876 23.242900 17 22.4 CB57BL 24.6 57 male 79.8796 -46.7987 180.0540 2.03427 150.0
2 -74.4443 221.895 1.17550 27.983400 7 22.4 CB57BL 24.6 58 male 89.9048 -42.7856 292.9690 1.37136 100.0
3 -75.8549 294.484 1.10705 31.245583 29 22.5 CB57BL 26.0 58 female 74.4971 -48.0499 116.5570 1.64910 50.0
4 -66.3612 271.674 1.06450 34.679290 25 22.7 CB57BL 26.0 58 female 51.8799 -44.3420 79.9561 3.24711 50.0

We use pandas to split up the matrix into the feature vectors we're interested in. We will also to convert textual category data (Strain, Gender) into an ordinal number that we can work with.


In [4]:
pd.Categorical(df.Strain).codes # CB57BL is zero, GAD67 is one


Out[4]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1], dtype=int8)

In [5]:
df['Gender'] = pd.Categorical(df.Gender).codes
df['Strain'] = pd.Categorical(df.Strain).codes
df.head()


Out[5]:
Vrest InputR Sag Tau_mb MaxAPfreq Temp Strain Weight Age Gender AP_peak AP_thr AP_maxrise AP_t50 rheobase
0 -64.8039 123.611 1.31594 21.766200 5 24.6 0 23.1 56 1 69.1071 -42.8009 170.2880 1.53888 150.0
1 -74.9081 152.837 1.17876 23.242900 17 22.4 0 24.6 57 1 79.8796 -46.7987 180.0540 2.03427 150.0
2 -74.4443 221.895 1.17550 27.983400 7 22.4 0 24.6 58 1 89.9048 -42.7856 292.9690 1.37136 100.0
3 -75.8549 294.484 1.10705 31.245583 29 22.5 0 26.0 58 0 74.4971 -48.0499 116.5570 1.64910 50.0
4 -66.3612 271.674 1.06450 34.679290 25 22.7 0 26.0 58 0 51.8799 -44.3420 79.9561 3.24711 50.0

In [6]:
df.shape # as with NumPy the number of rows first


Out[6]:
(11, 15)

In [7]:
df.iloc[[0]].values[0] # get a row as NumPy array


Out[7]:
array([ -64.8039 ,  123.611  ,    1.31594,   21.7662 ,    5.     ,
         24.6    ,    0.     ,   23.1    ,   56.     ,    1.     ,
         69.1071 ,  -42.8009 ,  170.288  ,    1.53888,  150.     ])

In [8]:
# create X and Y
Y = df['Strain'].values

del df['Strain'] # remove Strain
X = [ df.iloc[[i]].values[0] for i in range(df.shape[0]) ]
len(X)==len(Y)


Out[8]:
True

In [9]:
X[0] # data from CB57BL


Out[9]:
array([ -64.8039 ,  123.611  ,    1.31594,   21.7662 ,    5.     ,
         24.6    ,   23.1    ,   56.     ,    1.     ,   69.1071 ,
        -42.8009 ,  170.288  ,    1.53888,  150.     ])

Gaussian naive Bayesan classifier


In [10]:
from sklearn.naive_bayes import GaussianNB

In [11]:
myclassifier = GaussianNB()
myclassifier.fit(X,Y)


Out[11]:
GaussianNB()

In [12]:
df.iloc[[-2]].values # this is a GAD67 mice


Out[12]:
array([[ -78.7819 ,   97.8774 ,    1.1365 ,   28.8005 ,   10.     ,
          22.3    ,   19.1    ,   58.     ,    1.     ,   86.9954 ,
         -48.2144 ,  233.765  ,    1.87993,  334.086  ]])

In [13]:
df.iloc[[-1]].values # this is a GAD67 mice


Out[13]:
array([[ -67.8163  ,   98.3324  ,    1.04459 ,   19.2278  ,   63.      ,
          22.2     ,   22.3     ,   60.      ,    0.      ,   84.692   ,
         -44.611   ,  429.993   ,    0.537511,  344.928   ]])

We now test with the classifier with the training data


In [14]:
def predict(idx):
    if myclassifier.predict( X[idx]):
        print('Cell %2d is GAD67  mice'%idx)
    else:
        print('Cell %2d is CB57BL mice'%idx)

# test with training data (similar to myclassifier.score(X,Y) )
for i in range(df.shape[0]):
    predict(i)


Cell  0 is CB57BL mice
Cell  1 is CB57BL mice
Cell  2 is CB57BL mice
Cell  3 is CB57BL mice
Cell  4 is CB57BL mice
Cell  5 is CB57BL mice
Cell  6 is CB57BL mice
Cell  7 is CB57BL mice
Cell  8 is CB57BL mice
Cell  9 is GAD67  mice
Cell 10 is GAD67  mice

We test with some fictitious data


In [15]:
d = np.array([[ -75.50, 98.25, 1.49, 24.75, 90. ,21.5, 24.5 ,60, 1, 85.95, -48.6,
              430.95, 0.5, 385.55]])

test_df = pd.DataFrame(d, columns=df.columns)
test_df


Out[15]:
Vrest InputR Sag Tau_mb MaxAPfreq Temp Weight Age Gender AP_peak AP_thr AP_maxrise AP_t50 rheobase
0 -75.5 98.25 1.49 24.75 90.0 21.5 24.5 60.0 1.0 85.95 -48.6 430.95 0.5 385.55

In [16]:
if myclassifier.predict( test_df.iloc[[0]].values[0]):
    print('Test is GAD67  mice')
else:
    print('Test is CB57BL mice')


Test is CB57BL mice