In this tutorial, we'll explore training and evaluation of Naive Bayes and Logitistic Regression Classifiers.

To start, we import the standard BIDMach class definitions.

```
In [ ]:
```import BIDMat.{CMat,CSMat,DMat,Dict,IDict,FMat,GMat,GIMat,GSMat,HMat,IMat,Mat,SMat,SBMat,SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMat.Solvers._
import BIDMat.Plotting._
import BIDMach.Learner
import BIDMach.models.{FM,GLM,KMeans,LDA,LDAgibbs,NMF,SFA}
import BIDMach.datasources.{MatDS,FilesDS,SFilesDS}
import BIDMach.mixins.{CosineSim,Perplexity,Top,L1Regularizer,L2Regularizer}
import BIDMach.updaters.{ADAGrad,Batch,BatchNorm,IncMult,IncNorm,Telescoping}
import BIDMach.causal.{IPTW}
Mat.checkMKL
Mat.checkCUDA
if (Mat.hasCUDA > 0) GPUmem

Now we load some training and test data, and some category labels. The data come from a news collection from Reuters, and is a "classic" test set for classification. Each article belongs to one or more of 103 categories. The articles are represented as Bag-of-Words (BoW) column vectors. For a data matrix A, element A(i,j) holds the count of word i in document j.

The category matrices have 103 rows, and a category matrix C has a one in position C(i,j) if document j is tagged with category i, or zero otherwise.

To reduce the computing time and memory footprint, the training data have been sampled. The full collection has about 700k documents. Our training set has 60k.

Since the document matrices contain counts of words, we use a min function to limit the count to "1", i.e. because we need binary features for naive Bayes.

```
In [ ]:
```val dict = "../data/rcv1/"
val traindata = loadSMat(dict+"docs.smat.lz4")
val traincats = loadFMat(dict+"cats.fmat.lz4")
val testdata = loadSMat(dict+"testdocs.smat.lz4")
val testcats = loadFMat(dict+"testcats.fmat.lz4")
min(traindata, 1, traindata) // the first "traindata" argument is the input, the other is output
min(testdata, 1, testdata)

Get the word and document counts from the data. This turns out to be equivalent to a matrix multiply. For a data matrix A and category matrix C, we want all (cat, word) pairs (i,j) such that C(i,k) and A(j,k) are both 1 - this means that document k contains word j, and is also tagged with category i. Summing over all documents gives us

$${\rm wordcatCounts(i,j)} = \sum_{k=1}^N C(i,k) A(j,k) = C * A^T$$Because we are doing independent binary classifiers for each class, we need to construct the counts for words not in the class (negwcounts).

Finally, we add a smoothing count 0.5 to counts that could be zero.

```
In [ ]:
```val truecounts = traincats *^ traindata
val wcounts = truecounts + 0.5
val negwcounts = sum(truecounts) - truecounts + 0.5
val dcounts = sum(traincats,2)

Now compute the probabilities

- pwordcat = probability that a word is in a cat, given the cat.
- pwordncat = probability of a word, given the complement of the cat.
- pcat = probability that doc is in a given cat.
- spcat = sum of pcat probabilities (> 1 because docs can be in multiple cats)

```
In [ ]:
```val pwordcat = wcounts / sum(wcounts,2) // Normalize the rows to sum to 1.
val pwordncat = negwcounts / sum(negwcounts,2) // Each row represents word probabilities conditioned on one cat.
val pcat = dcounts / traindata.ncols
val spcat = sum(pcat)

Now take the logs of those probabilities. Here we're using the formula presented here to match Naive Bayes to Logistic Regression for independent data.

For each word, we compute the log of the ratio of the complementary word probability over the in-class word probability.

For each category, we compute the log of the ratio of the complementary category probability over the current category probability.

lpwordcat(j,i) represents $\log\left(\frac{{\rm Pr}(X_i|\neg c_j)}{{\rm Pr}(X_i|c_j)}\right)$

while lpcat(j) represents $\log\left(\frac{{\rm Pr}(\neg c)}{{\rm Pr}(c)}\right)$

```
In [ ]:
```val lpwordcat = ln(pwordncat/pwordcat) // ln is log to the base e (natural log)
val lpcat = ln((spcat-pcat)/pcat)

Here's where we apply Naive Bayes. The formula we're using is borrowed from here.

$${\rm Pr}(c|X_1,\ldots,X_k) = \frac{1}{1 + \frac{{\rm Pr}(\neg c)}{{\rm Pr}(c)}\prod_{i-1}^k\frac{{\rm Pr}(X_i|\neg c)}{{\rm Pr}(X_i|c)}}$$and we can rewrite

$$\frac{{\rm Pr}(\neg c)}{{\rm Pr}(c)}\prod_{i-1}^k\frac{{\rm Pr}(X_i|\neg c)}{{\rm Pr}(X_i|c)}$$as

$$\exp\left(\log\left(\frac{{\rm Pr}(\neg c)}{{\rm Pr}(c)}\right) + \sum_{i=1}^k\log\left(\frac{{\rm Pr}(X_i|\neg c)}{{\rm Pr}(X_i|c)}\right)\right) = \exp({\rm lpcat(j)} + {\rm lpwordcat(j,?)} * X)$$for class number j and an input column $X$. This follows because an input column $X$ is a sparse vector with ones in the positions of the input features. The product ${\rm lpwordcat(i,?)} * X$ picks out the features occuring in the input document and adds the corresponding logs from lpwordcat.

Finally, we take the exponential above and fold it into the formula $P(c_j|X_1,\ldots,X_k) = 1/(1+\exp(\cdots))$. This gives us a matrix of predictions. preds(i,j) = prediction of membership in category i for test document j.

```
In [ ]:
```val logodds = lpwordcat * testdata + lpcat
val preds = 1 / (1 + exp(logodds))

```
In [ ]:
```val acc = ((preds ∙→ testcats) + ((1-preds) ∙→ (1-testcats)))/preds.ncols
acc.t

Raw accuracy is not a good measure in most cases. When there are few positives (instances in the class vs. its complement), accuracy simply drives down false-positive rate at the expense of false-negative rate. In the worst case, the learner may always predict "no" and still achieve high accuracy.

ROC curves and ROC Area Under the Curve (AUC) are much better. Here we compute the ROC curves from the predictions above. We need:

- scores - the predicted quality from the formula above.
- good - 1 for positive instances, 0 for negative instances.
- bad - complement of good.
- npoints (100) - specifies the number of X-axis points for the ROC plot.

itest specifies which of the categories to plot for. We chose itest=6 because that category has one of the highest positive rates, and gives the most stable accuracy plots.

```
In [ ]:
```val itest = 6
val scores = preds(itest,?)
val good = testcats(itest,?)
val bad = 1-testcats(itest,?)
val rr =roc(scores,good,bad,100)

```
In [ ]:
```// auc =

TODO 2: In the cell below, write the value of AUC returned by the expression above.

```
In [ ]:
```

Now lets train a logistic classifier on the same data. BIDMach has an umbrella classifier called GLM for Generalized Linear Model. GLM includes linear regression, logistic regression (with log accuracy or direct accuracy optimization), and SVM.

The learner function accepts these arguments:

- traindata: the training data in the same format as for Naive Bayes
- traincats: the training category labels
- testdata: the test input data
- predcats: a container for the predictions generated by the model
- modeltype (GLM.logistic here): an integer that specifies the type of model (0=linear, 1=logistic log accuracy, 2=logistic accuracy, 3=SVM).

We'll construct the learner and then look at its options:

```
In [ ]:
```val predcats = zeros(testcats.nrows, testcats.ncols)
val (mm,mopts,nn,nopts) = GLM.learner(traindata, traincats, testdata, predcats, GLM.maxp)
mopts.what

The most important options are:

- lrate: the learning rate
- batchSize: the minibatch size
- npasses: the number of passes over the dataset

We'll use the following parameters for this training run.

```
In [ ]:
```mopts.lrate=1.0
mopts.batchSize=1000
mopts.npasses=2
mopts.autoReset = false
mm.train

```
In [ ]:
```nn.predict

```
In [ ]:
```val lacc = (predcats ∙→ testcats + (1-predcats) ∙→ (1-testcats))/preds.ncols
lacc.t
mean(lacc)

```
In [ ]:
```val axaxis = row(0 until 103)
plot(axaxis, acc, axaxis, lacc)

Next we'll compute the ROC plot and ROC area (AUC) for Logistic regression for category itest.

```
In [ ]:
```val lscores = predcats(itest,?)
val lrr =roc(lscores,good,bad,100)
val auc = mean(lrr) // Fill in using the formula you used before

```
In [ ]:
```val rocxaxis = row(0 until 101)
plot(rocxaxis, rr, rocxaxis, lrr)

```
In [ ]:
```

```
In [ ]:
```