In this tutorial, we'll explore training and evaluation of L-BFGS method based Logitistic Regression Classifiers.
To start, we import the standard BIDMach class definitions.
In [1]:
import BIDMat.{CMat,CSMat,DMat,Dict,IDict,FMat,GMat,GIMat,GSMat,HMat,IMat,Mat,SMat,SBMat,SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMat.Solvers._
import BIDMat.Plotting._
import BIDMach.Learner
import BIDMach.models.{FM,GLM,KMeans,LDA,LDAgibbs,NMF,SFA}
import BIDMach.datasources.{MatDS,FilesDS,SFilesDS}
import BIDMach.mixins.{CosineSim,Perplexity,Top,L1Regularizer,L2Regularizer}
import BIDMach.updaters.{ADAGrad,Batch,BatchNorm,IncMult,IncNorm,Telescoping}
import BIDMach.causal.{IPTW}
Mat.checkMKL
Mat.checkCUDA
if (Mat.hasCUDA > 0) GPUmem
Out[1]:
Now we load some training and test data, and some category labels. The data come from a news collection from Reuters, and is a "classic" test set for classification. Each article belongs to one or more of 103 categories. The articles are represented as Bag-of-Words (BoW) column vectors. For a data matrix A, element A(i,j) holds the count of word i in document j.
The category matrices have 103 rows, and a category matrix C has a one in position C(i,j) if document j is tagged with category i, or zero otherwise.
To reduce the computing time and memory footprint, the training data have been sampled. The full collection has about 700k documents. Our training set has 60k.
Since the document matrices contain counts of words, we use a min function to limit the count to "1", i.e. because we need binary features for naive Bayes.
In [6]:
val dict = "/Users/Anna/workspace/BIDMach/data/"
val rpath = "/Users/Anna/workspace/BIDMach/data/"
var index = 79;
var index1 = 110;
var tnum = 1;
var tnum1 = 1;
//for(index <- 79 to 79){
var aa = loadFMat(dict+index+"_train"+".txt")
// var c = loadFMat(dict+index+"_test"+".txt")
var b = loadFMat(dict+index1+"_train.txt")
//var a = aa.t
var atrain = aa(?,(239->aa.ncols))
//var atest = c(?,(239->c.ncols))
var atrain1 = b(?,(239->b.ncols))
tnum = (aa.nrows+1)/2
var ttrain = aa(0->tnum,(239->aa.ncols) )
var ttest = aa(tnum->aa.nrows,(239->aa.ncols) )
tnum1 = (b.nrows+1)/2
var ttrain1 = b(0->tnum1,(239->b.ncols) )
var ttest1 = b(tnum1->b.nrows,(239->b.ncols) )
var trainx = zeros( (ttrain.nrows+ttrain1.nrows),4096)
trainx(0->(ttrain.nrows),?)=ttrain
trainx( (ttrain.nrows)->(ttrain.nrows+ttrain1.nrows),?)=ttrain1
var testx= zeros( (ttest.nrows+ttest1.nrows),4096)
testx(0->(ttest.nrows),?)=ttest
testx( (ttest.nrows)->(ttest.nrows+ttest1.nrows),?)=ttest1
var ctrain1=zeros(2,(ttrain.nrows+ttrain1.nrows) )
ctrain1(0,(0->(ttrain.nrows) ) )= 1
ctrain1(1,(ttrain.nrows)->(ttrain.nrows+ttrain1.nrows)) = 1
ctrain1(1,(0->(ttrain.nrows) ) )= -1
ctrain1(0,(ttrain.nrows)->(ttrain.nrows+ttrain1.nrows)) = -1
var ctest=zeros(2,(ttest.nrows+ttest1.nrows) )
ctest(0,(0->(ttest.nrows) ) )= 1
ctest(1,(ttest.nrows)->(ttest.nrows+ttest1.nrows)) = 1
//ctest(1,(0->(ttest.nrows) ) )= -1
//ctest(0,(ttest.nrows)->(ttest.nrows+ttest1.nrows)) = -1
//max(atrain, 0.001, atrain) // the first "traindata" argument is the input, the other is out
tnum=testx.nrows
val cx=zeros(2,testx.nrows)
var maxx = maxi(maxi(trainx,1),2);
var minx = mini(mini(trainx,1),2);
trainx = (trainx-minx)/(maxx-minx);
testx = (testx-minx)/(maxx-minx);
val (mm,mopts,nn,nopts)=GLM.LBFGSlearner(trainx.t,ctrain1,testx.t,cx)
mopts.autoReset=false
mopts.useGPU=false
mopts.lrate=0.1
mopts.batchSize=50
mopts.dim=256
mopts.startBlock=0
mopts.npasses=20
mopts.updateAll=false
//problem:memory leak
mm.train;
nn.predict;
saveFMat(rpath+"r"+index+".txt",cx/tnum);
//}
min(cx, 1, cx) // the first "traindata" argument is the input, the other is output
max(cx, 0, cx)
val p=ctest *@cx +(1-ctest) *@(1-cx)
var meanp=mean(p,2)
//saveFMat(rpath+"meanp"+index+".txt",meanp);
meanp
Out[6]:
Get the word and document counts from the data. This turns out to be equivalent to a matrix multiply. For a data matrix A and category matrix C, we want all (cat, word) pairs (i,j) such that C(i,k) and A(j,k) are both 1 - this means that document k contains word j, and is also tagged with category i. Summing over all documents gives us
$${\rm wordcatCounts(i,j)} = \sum_{k=1}^N C(i,k) A(j,k) = C * A^T$$Because we are doing independent binary classifiers for each class, we need to construct the counts for words not in the class (negwcounts).
Finally, we add a smoothing count 0.5 to counts that could be zero.
In [45]:
val cx=zeros(ctest.nrows,ctest.ncols)
val (mm,mopts,nn,nopts)=GLM.learner(atrain,ctrain,atest,cx,0)
mopts.autoReset=false
mopts.useGPU=false
mopts.lrate=0.1
mopts.batchSize=2
mopts.dim=256
mopts.startBlock=0
mopts.npasses=10
mopts.updateAll=false
//mopts.what
Out[45]:
Now compute the probabilities
In [11]:
//mm.train
//nn.predict
tnum1
Out[11]:
Now take the logs of those probabilities. Here we're using the formula presented here to match Naive Bayes to Logistic Regression for independent data.
For each word, we compute the log of the ratio of the complementary word probability over the in-class word probability.
For each category, we compute the log of the ratio of the complementary category probability over the current category probability.
lpwordcat(j,i) represents $\log\left(\frac{{\rm Pr}(X_i|\neg c_j)}{{\rm Pr}(X_i|c_j)}\right)$
while lpcat(j) represents $\log\left(\frac{{\rm Pr}(\neg c)}{{\rm Pr}(c)}\right)$
In [22]:
val cx1=cx
min(cx1, 1, cx1) // the first "traindata" argument is the input, the other is output
max(cx1, 0, cx1)
val p=ctest *@cx1 +(1-ctest) *@(1-cx1)
mean(p,2)
//saveFMat(rpath+index+".txt",cx)
cx
Here's where we apply Naive Bayes. The formula we're using is borrowed from here.
$${\rm Pr}(c|X_1,\ldots,X_k) = \frac{1}{1 + \frac{{\rm Pr}(\neg c)}{{\rm Pr}(c)}\prod_{i-1}^k\frac{{\rm Pr}(X_i|\neg c)}{{\rm Pr}(X_i|c)}}$$and we can rewrite
$$\frac{{\rm Pr}(\neg c)}{{\rm Pr}(c)}\prod_{i-1}^k\frac{{\rm Pr}(X_i|\neg c)}{{\rm Pr}(X_i|c)}$$as
$$\exp\left(\log\left(\frac{{\rm Pr}(\neg c)}{{\rm Pr}(c)}\right) + \sum_{i=1}^k\log\left(\frac{{\rm Pr}(X_i|\neg c)}{{\rm Pr}(X_i|c)}\right)\right) = \exp({\rm lpcat(j)} + {\rm lpwordcat(j,?)} * X)$$for class number j and an input column $X$. This follows because an input column $X$ is a sparse vector with ones in the positions of the input features. The product ${\rm lpwordcat(i,?)} * X$ picks out the features occuring in the input document and adds the corresponding logs from lpwordcat.
Finally, we take the exponential above and fold it into the formula $P(c_j|X_1,\ldots,X_k) = 1/(1+\exp(\cdots))$. This gives us a matrix of predictions. preds(i,j) = prediction of membership in category i for test document j.
In [6]:
val lacc = (cx1 ∙→ ctest + (1-cx1) ∙→ (1-ctest))/cx1.ncols
lacc.t
mean(lacc)
Out[6]:
To measure the accuracy of the predictions above, we can compute the probability that the classifier outputs the right label. We used this formula in class for the expected accuracy for logistic regression. The "dot arrow" operator takes dot product along rows:
In [7]:
val model = mm.model
val (nn1, nopts1) = GLM.LBFGSpredictor(model, atest, cx)
Out[7]:
Raw accuracy is not a good measure in most cases. When there are few positives (instances in the class vs. its complement), accuracy simply drives down false-positive rate at the expense of false-negative rate. In the worst case, the learner may always predict "no" and still achieve high accuracy.
ROC curves and ROC Area Under the Curve (AUC) are much better. Here we compute the ROC curves from the predictions above. We need:
itest specifies which of the categories to plot for. We chose itest=6 because that category has one of the highest positive rates, and gives the most stable accuracy plots.
In [8]:
nn1.predict
TODO 1: In the cell below, write an expression to derive the ROC Area under the curve (AUC) given the curve rr. rr gives the ROC curve y-coordinates at 100 evenly-spaced X-values from 0 to 1.0.
In [9]:
val cx1=cx*10
min(cx1, 1, cx1) // the first "traindata" argument is the input, the other is output
max(cx1, 0, cx1)
val p=ctest *@cx1 +(1-ctest) *@(1-cx1)
mean(p,2)
val lacc = (cx1 ∙→ ctest + (1-cx1) ∙→ (1-ctest))/cx1.ncols
lacc.t
mean(lacc)
Out[9]:
TODO 2: In the cell below, write the value of AUC returned by the expression above.
In [10]:
cx1
Out[10]:
In [11]:
saveFMat(dict+"moonresult.fmat.txt",cx)
Now lets train a logistic classifier on the same data. BIDMach has an umbrella classifier called GLM for Generalized Linear Model. GLM includes linear regression, logistic regression (with log accuracy or direct accuracy optimization), and SVM.
The learner function accepts these arguments:
We'll construct the learner and then look at its options:
In [2]:
val dict = "/Users/Anna/workspace/BIDMach/data/wildlife/"
val aa = loadFMat(dict+"1.txt")
val c = loadFMat(dict+"c1.txt")
val b = loadFMat("/Users/Anna/workspace/BIDMach/data/wltestall.txt")
val d = loadFMat("/Users/Anna/workspace/BIDMach/data/wltestcatall.txt")
//val dict = "/Users/Anna/workspace/BIDMach_1.0.0-full-linux-x86_64/data/uci/"
//val aa = loadFMat(dict+"arabic.fmat.lz4")
//val c = loadFMat(dict+"arabic_cats.fmat.lz4")
val a = aa *10
val atrain = a //a(?,(100->a.ncols))
val atest = a //a(?,(0->100))
val ctrain = c //c(?,(100->a.ncols))
val ctest = c //c(?,(0->100))
//max(atrain, 0.001, atrain) // the first "traindata" argument is the input, the other is output
//max(atest, 0.001, atest)
atrain
Out[2]:
The most important options are:
We'll use the following parameters for this training run.
In [7]:
val cx=zeros(2,ctest.ncols)
val (mm,mopts,nn,nopts)=GLM.SVMlearner(atrain,ctrain,atest,cx)
mopts.autoReset=false
mopts.useGPU=false
mopts.lrate=1
mopts.batchSize=2
mopts.dim=256
mopts.startBlock=0
mopts.npasses=10
mopts.updateAll=false
mopts.what
In [8]:
mm.train
nn.predict
val cx1=cx
//min(cx1, 1, cx1) // the first "traindata" argument is the input, the other is output
//max(cx1, 0, cx1)
//val p=ctest *@cx1 +(1-ctest) *@(1-cx1)
//mean(p,2)
cx
Out[8]:
Since we have the accuracy scores for both Naive Bayes and Logistic regression, we can plot both of them on the same axes. Naive Bayes is red, Logistic regression is blue. The x-axis is the category number from 0 to 102. The y-axis is the absolute accuracy of the predictor for that category.
In [7]:
val lacc = (cx1 ∙→ ctest + (1-cx1) ∙→ (1-ctest))/cx1.ncols
lacc.t
mean(lacc)
TODO 3: With the full training set (700k training documents), Logistic Regression is noticeably more accurate than Naive Bayes in every category. What do you observe in the plot above? Why do you think this is?
Next we'll compute the ROC plot and ROC area (AUC) for Logistic regression for category itest.
In [40]:
saveFMat(dict+"wildliferesults.fmat.txt",cx)
We computed the ROC curve for Naive Bayes earlier, so now we can plot them on the same axes. Naive Bayes is once again in red, Logistic regression in blue.
In [51]:
cx1
Out[51]:
TODO 4: In the cell below, compute and plot lift curves from the ROC curves for Naive Bayes and Logistic regression. The lift curves should show the ratio of ROC y-values over a unit slope diagonal line (Y=X). The X-values should be the same as for the ROC plots, except that X=0 will be omitted since the lift will be undefined.
In [2]:
val dict = "/Users/Anna/workspace/BIDMach_1.0.0-full-linux-x86_64/data/"
//val a = loadFMat(dict+"arabic.fmat.lz4")
//val c = loadFMat(dict+"arabic_cats.fmat.lz4")
val aa = loadFMat(dict+"a.txt")
val c = loadFMat(dict+"alabel.txt")
val a = aa + 0.5
val atrain =a(?,(100->a.ncols))
val atest =a(?,(0->100))
val ctrain =c(?,(100->a.ncols))
val ctest =c(?,(0->100))
max(atrain, 0.001, atrain) // the first "traindata" argument is the input, the other is output
max(atest, 0.001, atest)
atest
Out[2]:
TODO 5: Experiment with different values for learning rate and batchSize to get the best performance for absolute accuracy and ROC area on category 6. Write your optimal values below:
In [41]:
val cx=zeros(ctest.nrows,ctest.ncols)
val (mm,mopts,nn,nopts)=GLM.SVMlearner(atrain,ctrain,atest,cx)
mopts.autoReset=false
mopts.useGPU=false
mopts.lrate=0.1
mopts.batchSize=2
mopts.dim=256
mopts.startBlock=0
mopts.npasses=10
mopts.updateAll=false
mopts.what
In [42]:
mm.train
nn.predict
In [ ]:
val cx1=cx*5
min(cx1, 1, cx1) // the first "traindata" argument is the input, the other is output
max(cx1, 0, cx1)
val p=ctest *@cx1 +(1-ctest) *@(1-cx1)
mean(p,2)
In [6]:
cx1
Out[6]:
In [7]:
val lacc = (cx1 ∙→ ctest + (1-cx1) ∙→ (1-ctest))/cx1.ncols
lacc.t
mean(lacc)
Out[7]:
In [8]:
val model = mm.model
val (nn1, nopts1) = GLM.LBFGSpredictor(model, atest, cx)
Out[8]:
In [9]:
nn1.predict
In [10]:
val cx1=cx*10
min(cx1, 1, cx1) // the first "traindata" argument is the input, the other is output
max(cx1, 0, cx1)
val p=ctest *@cx1 +(1-ctest) *@(1-cx1)
mean(p,2)
val lacc = (cx1 ∙→ ctest + (1-cx1) ∙→ (1-ctest))/cx1.ncols
lacc.t
mean(lacc)
Out[10]:
In [11]:
cx1
Out[11]:
In [23]:
saveFMat(dict+"moonresult.fmat.txt",cx)
In [ ]:
In [19]: