Naive Bayes and Logistic Regression

In this tutorial, we'll explore training and evaluation of Naive Bayes and Logitistic Regression Classifiers.

To start, we import the standard BIDMach class definitions.



In [1]:

    
import BIDMat.{CMat,CSMat,DMat,Dict,IDict,FMat,GMat,GIMat,GSMat,HMat,IMat,Mat,SMat,SBMat,SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMat.Solvers._
import BIDMat.Plotting._
import BIDMach.Learner
import BIDMach.models.{FM,GLM,KMeans,LDA,LDAgibbs,NMF,SFA}
import BIDMach.datasources.{MatDS,FilesDS,SFilesDS}
import BIDMach.mixins.{CosineSim,Perplexity,Top,L1Regularizer,L2Regularizer}
import BIDMach.updaters.{ADAGrad,Batch,BatchNorm,IncMult,IncNorm,Telescoping}
import BIDMach.causal.{IPTW}

Mat.checkMKL
Mat.checkCUDA
if (Mat.hasCUDA > 0) GPUmem









    



Cant find native HDF5 library
Couldnt load JCuda






    Out[1]:





()

Now we load some training and test data, and some category labels. The data come from a news collection from Reuters, and is a "classic" test set for classification. Each article belongs to one or more of 103 categories. The articles are represented as Bag-of-Words (BoW) column vectors. For a data matrix A, element A(i,j) holds the count of word i in document j.

The category matrices have 103 rows, and a category matrix C has a one in position C(i,j) if document j is tagged with category i, or zero otherwise.

To reduce the computing time and memory footprint, the training data have been sampled. The full collection has about 700k documents. Our training set has 60k.

Since the document matrices contain counts of words, we use a min function to limit the count to "1", i.e. because we need binary features for naive Bayes.



In [10]:

    
val dict = "../data/"
val traindata = loadFMat(dict+"wildlifeall.txt")
val traincats = loadIMat(dict+"wildlifecatall.txt")
val testdata = loadFMat(dict+"wltestall.txt")
val testcats = loadIMat(dict+"wltestcatall.txt")
//min(traindata, 1, traindata)                       // the first "traindata" argument is the input, the other is output
//min(testdata, 1, testdata)









    









    Out[10]:





   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1...
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0...
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0...
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0...
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0...
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0...
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0...
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0...
  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..

Get the word and document counts from the data. This turns out to be equivalent to a matrix multiply. For a data matrix A and category matrix C, we want all (cat, word) pairs (i,j) such that C(i,k) and A(j,k) are both 1 - this means that document k contains word j, and is also tagged with category i. Summing over all documents gives us

$${\rm wordcatCounts(i,j)} = \sum_{k=1}^N C(i,k) A(j,k) = C * A^T$$

Because we are doing independent binary classifiers for each class, we need to construct the counts for words not in the class (negwcounts).

Finally, we add a smoothing count 0.5 to counts that could be zero.



In [4]:

    
val truecounts = traincats *^ traindata
val wcounts = truecounts + 0.5
val negwcounts = sum(truecounts) - truecounts + 0.5
val dcounts = sum(traincats,2)

Now compute the probabilities

pwordcat = probability that a word is in a cat, given the cat.
pwordncat = probability of a word, given the complement of the cat.
pcat = probability that doc is in a given cat.
spcat = sum of pcat probabilities (> 1 because docs can be in multiple cats)



In [5]:

    
val pwordcat = wcounts / sum(wcounts,2)                 // Normalize the rows to sum to 1.
val pwordncat = negwcounts / sum(negwcounts,2)          // Each row represents word probabilities conditioned on one cat. 
val pcat = dcounts / traindata.ncols
val spcat = sum(pcat)









    









    



java.lang.RuntimeException: operator / not implemented for IMat and Float
    BIDMat.Mat.notImplemented2(Mat.scala:27)
    BIDMat.Mat.$div(Mat.scala:220)

Now take the logs of those probabilities. Here we're using the formula presented here to match Naive Bayes to Logistic Regression for independent data.

For each word, we compute the log of the ratio of the complementary word probability over the in-class word probability.

For each category, we compute the log of the ratio of the complementary category probability over the current category probability.

lpwordcat(j,i) represents $\log\left(\frac{{\rm Pr}(X_i|\neg c_j)}{{\rm Pr}(X_i|c_j)}\right)$

while lpcat(j) represents $\log\left(\frac{{\rm Pr}(\neg c)}{{\rm Pr}(c)}\right)$



In [ ]:

    
val lpwordcat = ln(pwordncat/pwordcat)   // ln is log to the base e (natural log)
val lpcat = ln((spcat-pcat)/pcat)

Here's where we apply Naive Bayes. The formula we're using is borrowed from here.

$${\rm Pr}(c|X_1,\ldots,X_k) = \frac{1}{1 + \frac{{\rm Pr}(\neg c)}{{\rm Pr}(c)}\prod_{i-1}^k\frac{{\rm Pr}(X_i|\neg c)}{{\rm Pr}(X_i|c)}}$$

and we can rewrite

$$\frac{{\rm Pr}(\neg c)}{{\rm Pr}(c)}\prod_{i-1}^k\frac{{\rm Pr}(X_i|\neg c)}{{\rm Pr}(X_i|c)}$$

$$\exp\left(\log\left(\frac{{\rm Pr}(\neg c)}{{\rm Pr}(c)}\right) + \sum_{i=1}^k\log\left(\frac{{\rm Pr}(X_i|\neg c)}{{\rm Pr}(X_i|c)}\right)\right) = \exp({\rm lpcat(j)} + {\rm lpwordcat(j,?)} * X)$$

for class number j and an input column $X$. This follows because an input column $X$ is a sparse vector with ones in the positions of the input features. The product ${\rm lpwordcat(i,?)} * X$ picks out the features occuring in the input document and adds the corresponding logs from lpwordcat.

Finally, we take the exponential above and fold it into the formula $P(c_j|X_1,\ldots,X_k) = 1/(1+\exp(\cdots))$. This gives us a matrix of predictions. preds(i,j) = prediction of membership in category i for test document j.



In [ ]:

    
val logodds = lpwordcat * testdata + lpcat
val preds = 1 / (1 + exp(logodds))

To measure the accuracy of the predictions above, we can compute the probability that the classifier outputs the right label. We used this formula in class for the expected accuracy for logistic regression. The "dot arrow" operator takes dot product along rows:



In [ ]:

    
val acc = ((preds ∙→ testcats) + ((1-preds) ∙→ (1-testcats)))/preds.ncols
acc.t

Raw accuracy is not a good measure in most cases. When there are few positives (instances in the class vs. its complement), accuracy simply drives down false-positive rate at the expense of false-negative rate. In the worst case, the learner may always predict "no" and still achieve high accuracy.

ROC curves and ROC Area Under the Curve (AUC) are much better. Here we compute the ROC curves from the predictions above. We need:

scores - the predicted quality from the formula above.
good - 1 for positive instances, 0 for negative instances.
bad - complement of good.
npoints (100) - specifies the number of X-axis points for the ROC plot.

itest specifies which of the categories to plot for. We chose itest=6 because that category has one of the highest positive rates, and gives the most stable accuracy plots.



In [ ]:

    
val itest = 6
val scores = preds(itest,?)
val good = testcats(itest,?)
val bad = 1-testcats(itest,?)
val rr =roc(scores,good,bad,100)

TODO 1: In the cell below, write an expression to derive the ROC Area under the curve (AUC) given the curve rr. rr gives the ROC curve y-coordinates at 100 evenly-spaced X-values from 0 to 1.0.



In [ ]:

    
// auc =

TODO 2: In the cell below, write the value of AUC returned by the expression above.



In [ ]:

Logistic Regression

Now lets train a logistic classifier on the same data. BIDMach has an umbrella classifier called GLM for Generalized Linear Model. GLM includes linear regression, logistic regression (with log accuracy or direct accuracy optimization), and SVM.

The learner function accepts these arguments:

traindata: the training data in the same format as for Naive Bayes
traincats: the training category labels
testdata: the test input data
predcats: a container for the predictions generated by the model
modeltype (GLM.logistic here): an integer that specifies the type of model (0=linear, 1=logistic log accuracy, 2=logistic accuracy, 3=SVM).

We'll construct the learner and then look at its options:



In [11]:

    
val predcats = zeros(testcats.nrows, testcats.ncols)
val (mm,mopts,nn,nopts) = GLM.learner(traindata, traincats, testdata, predcats, GLM.maxp)
mopts.what









    



Option Name       Type          Value
===========       ====          =====
addConstFeat      boolean       false
autoReset         boolean       false
batchSize         int           12
dim               int           256
doubleScore       boolean       false
epsilon           float         1.0E-5
evalStep          int           11
featThreshold     Mat           null
featType          int           1
hashFeatures      boolean       false
initsumsq         float         1.0E-5
iweight           FMat          null
lim               float         0.0
links             IMat             2
   2
   2
   2
...
lrate             FMat          1
mask              FMat          null
npasses           int           2
nzPerColumn       int           0
pstep             float         0.01
putBack           int           -1
r1nmats           int           1
reg1weight        FMat          1.0000e-07
resFile           String        null
rmask             FMat          null
sample            float         1.0
sizeMargin        float         3.0
startBlock        int           8000
targets           FMat          null
targmap           FMat          null
texp              FMat          0.50000
updateAll         boolean       false
useCache          boolean       true
useDouble         boolean       false
useGPU            boolean       true
vexp              FMat          0.50000
waitsteps         int           2

The most important options are:

lrate: the learning rate
batchSize: the minibatch size
npasses: the number of passes over the dataset

We'll use the following parameters for this training run.



In [13]:

    
mopts.lrate=1.0
mopts.batchSize=1000
mopts.npasses=2
mopts.autoReset = false
mm.train









    



corpus perplexity=2886.492192
pass= 0






    



scala.MatchError: (  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000...
  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000...
  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000...
  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000...
  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000...
  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000...
  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000...
  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000...
       ..       ..       ..       ..       ..       ..       ..       ..
,   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1...
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0...
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0...
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0...
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0...
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0...
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0...
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0...
  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..
,  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000...
  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000...
  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000...
  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000...
  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000...
  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000...
  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000...
  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000  0.50000...
       ..       ..       ..       ..       ..       ..       ..       ..
,   2
   2
   2
   2
   2
   2
   2
   2
  ..
) (of class scala.Tuple4)
    BIDMach.models.GLM$.derivs(GLM.scala:383)
    BIDMach.models.GLM.mupdate3(GLM.scala:101)
    BIDMach.models.GLM.mupdate2(GLM.scala:90)
    BIDMach.models.RegressionModel.doblock(Regression.scala:68)
    BIDMach.models.Model.doblockg(Model.scala:74)
    BIDMach.Learner.retrain(Learner.scala:83)
    BIDMach.Learner.train(Learner.scala:49)



In [8]:

    
nn.predict









    



corpus perplexity=4009.564775
Predicting
 3.00%, ll=-0.50000, gf=0.375, secs=0.0, GB=0.00, MB/s=33.46
 7.00%, ll=-0.50000, gf=0.624, secs=0.0, GB=0.00, MB/s=53.62
10.00%, ll=-0.50000, gf=0.702, secs=0.0, GB=0.00, MB/s=62.69
14.00%, ll=-0.50000, gf=0.833, secs=0.0, GB=0.00, MB/s=72.50
17.00%, ll=-0.50000, gf=0.937, secs=0.0, GB=0.00, MB/s=81.61
21.00%, ll=-0.50000, gf=1.022, secs=0.0, GB=0.00, MB/s=88.87
24.00%, ll=-0.50000, gf=1.093, secs=0.0, GB=0.00, MB/s=95.41
28.00%, ll=-0.50000, gf=1.070, secs=0.0, GB=0.00, MB/s=92.53
31.00%, ll=-0.50000, gf=1.124, secs=0.0, GB=0.00, MB/s=96.78
35.00%, ll=-0.50000, gf=1.171, secs=0.0, GB=0.00, MB/s=101.33
38.00%, ll=-0.50000, gf=1.212, secs=0.0, GB=0.00, MB/s=104.98
42.00%, ll=-0.50000, gf=1.249, secs=0.0, GB=0.00, MB/s=108.83
45.00%, ll=-0.50000, gf=1.282, secs=0.0, GB=0.00, MB/s=111.81
49.00%, ll=-0.50000, gf=1.311, secs=0.0, GB=0.00, MB/s=114.80
52.00%, ll=-0.50000, gf=1.338, secs=0.0, GB=0.00, MB/s=117.68
56.00%, ll=-0.50000, gf=1.362, secs=0.0, GB=0.00, MB/s=119.57
60.00%, ll=-0.50000, gf=1.385, secs=0.0, GB=0.00, MB/s=121.62
63.00%, ll=-0.50000, gf=1.466, secs=0.0, GB=0.00, MB/s=129.93
67.00%, ll=-0.50000, gf=1.483, secs=0.0, GB=0.00, MB/s=131.26
70.00%, ll=-0.50000, gf=1.499, secs=0.0, GB=0.00, MB/s=132.49
74.00%, ll=-0.50000, gf=1.513, secs=0.0, GB=0.00, MB/s=134.00
77.00%, ll=-0.50000, gf=1.526, secs=0.0, GB=0.00, MB/s=134.74
81.00%, ll=-0.50000, gf=1.539, secs=0.0, GB=0.00, MB/s=134.89
84.00%, ll=-0.50000, gf=1.606, secs=0.0, GB=0.00, MB/s=140.70
88.00%, ll=-0.50000, gf=1.615, secs=0.0, GB=0.00, MB/s=141.65
91.00%, ll=-0.50000, gf=1.624, secs=0.0, GB=0.00, MB/s=142.19
95.00%, ll=-0.50000, gf=1.632, secs=0.0, GB=0.00, MB/s=142.85
98.00%, ll=-0.50000, gf=1.639, secs=0.0, GB=0.00, MB/s=143.64
100.00%, ll=-0.50000, gf=1.659, secs=0.0, GB=0.00, MB/s=145.25
Time=0.0320 secs, gflops=1.66



In [9]:

    
val lacc = (predcats ∙→ testcats + (1-predcats) ∙→ (1-testcats))/preds.ncols
lacc.t
mean(lacc)









    



<console>:29: error: not found: value preds
              val lacc = (predcats ∙→ testcats + (1-predcats) ∙→ (1-testcats))/preds.ncols
                                                                               ^

Since we have the accuracy scores for both Naive Bayes and Logistic regression, we can plot both of them on the same axes. Naive Bayes is red, Logistic regression is blue. The x-axis is the category number from 0 to 102. The y-axis is the absolute accuracy of the predictor for that category.



In [ ]:

    
val axaxis = row(0 until 103)
plot(axaxis, acc, axaxis, lacc)

TODO 3: With the full training set (700k training documents), Logistic Regression is noticeably more accurate than Naive Bayes in every category. What do you observe in the plot above? Why do you think this is?

Next we'll compute the ROC plot and ROC area (AUC) for Logistic regression for category itest.



In [ ]:

    
val lscores = predcats(itest,?)
val lrr =roc(lscores,good,bad,100)
val auc = mean(lrr)                           // Fill in using the formula you used before

We computed the ROC curve for Naive Bayes earlier, so now we can plot them on the same axes. Naive Bayes is once again in red, Logistic regression in blue.



In [ ]:

    
val rocxaxis = row(0 until 101)
plot(rocxaxis, rr, rocxaxis, lrr)

TODO 4: In the cell below, compute and plot lift curves from the ROC curves for Naive Bayes and Logistic regression. The lift curves should show the ratio of ROC y-values over a unit slope diagonal line (Y=X). The X-values should be the same as for the ROC plots, except that X=0 will be omitted since the lift will be undefined.



In [ ]:

TODO 5: Experiment with different values for learning rate and batchSize to get the best performance for absolute accuracy and ROC area on category 6. Write your optimal values below:



In [ ]: