We will try to compute the probability of having a type of Strain $P(Y=y)$ given a feature vector X (i.e. a vector containing the Input resistance, sag ratio, etc...). We will use the “naive” assumption of independence between every pair of features.
Given a class variable $Y$ and a dependent feature vector $X_1$ through $X_n$, Bayes’ theorem states the following relationship:
$$P(Y | X_i) = \frac{P(Y) P(X_i|Y)}{P(X_i)}$$
In [1]:
%pylab inline
In [2]:
import pandas as pd
In [3]:
# first row contains units
df = pd.read_excel(io='../data/Cell_types.xlsx', sheetname='PFC', skiprows=1)
del df['CellID'] # remove column with cell IDs
df.head() # show first elements
Out[3]:
We use pandas to split up the matrix into the feature vectors we're interested in. We will also to convert textual category data (Strain, Gender) into an ordinal number that we can work with.
In [4]:
pd.Categorical(df.Strain).codes # CB57BL is zero, GAD67 is one
Out[4]:
In [5]:
df['Gender'] = pd.Categorical(df.Gender).codes
df['Strain'] = pd.Categorical(df.Strain).codes
df.head()
Out[5]:
In [6]:
df.shape # as with NumPy the number of rows first
Out[6]:
In [7]:
df.iloc[[0]].values[0] # get a row as NumPy array
Out[7]:
In [8]:
# create X and Y
Y = df['Strain'].values
del df['Strain'] # remove Strain
X = [ df.iloc[[i]].values[0] for i in range(df.shape[0]) ]
len(X)==len(Y)
Out[8]:
In [9]:
X[0] # data from CB57BL
Out[9]:
In [10]:
from sklearn.naive_bayes import GaussianNB
In [11]:
myclassifier = GaussianNB()
myclassifier.fit(X,Y)
Out[11]:
In [12]:
df.iloc[[-2]].values # this is a GAD67 mice
Out[12]:
In [13]:
df.iloc[[-1]].values # this is a GAD67 mice
Out[13]:
We now test with the classifier with the training data
In [14]:
def predict(idx):
if myclassifier.predict( X[idx]):
print('Cell %2d is GAD67 mice'%idx)
else:
print('Cell %2d is CB57BL mice'%idx)
# test with training data (similar to myclassifier.score(X,Y) )
for i in range(df.shape[0]):
predict(i)
We test with some fictitious data
In [15]:
d = np.array([[ -75.50, 98.25, 1.49, 24.75, 90. ,21.5, 24.5 ,60, 1, 85.95, -48.6,
430.95, 0.5, 385.55]])
test_df = pd.DataFrame(d, columns=df.columns)
test_df
Out[15]:
In [16]:
if myclassifier.predict( test_df.iloc[[0]].values[0]):
print('Test is GAD67 mice')
else:
print('Test is CB57BL mice')