This notebook demonstrates Document classification in Shogun. After providing a semi-formal introduction to the Bag of Words model and its limitations, we illustrate the hashing trick. This is consolidated by performing experiments on the large-scale webspam data set.
The most common and the most widely used representation of document collections is the Bag of Words model.The BoW representation considers each document as a collection of tokens. A token is defined as the minimum arbitrary sequence of characters that can be considered as atomic. Tokens depending on the choice of the user can be either whole words or n-grams, etc. A simple approach is to consider every possible token in your document collection as a different dimension in a feature space and then vectorize each document by assigning 1 to every dimension that corresponds to a token contained in that document and 0 to every other. This is the simpler approach and is known as the Boolean Vector Space Model. Another approach is, instead of using {1,0}, to use a weight for every term, like their frequencies, but this is out of our scope.
Let's now consider a document collection consisting of the following documents:
Document 1 = "This is the first document"
Document 2 = "Document classification: Introduction"
Document 3 = "A third document about classification"
and suppose that we consider as tokens, every word that appears in the collection.
Then we would arrive to the following representation:
Document 1 | Document 2 | Document 3 | |
[0] this | 1 | 0 | 0 |
[1] is | 1 | 0 | 0 |
[2] the | 1 | 0 | 0 |
[3] first | 1 | 0 | 0 |
[4] document | 1 | 1 | 1 |
[5] classification | 0 | 1 | 1 |
[6] introduction | 0 | 1 | 0 |
[7] a | 0 | 0 | 1 |
[8] third | 0 | 0 | 1 |
[9] about | 0 | 0 | 1 |
The above approach will result in a hashed document-term matrix, similar to the one shown above, of a pre-defined dimension size (number of rows).
Each token will now be assigned to a random dimension, although always the same for the same token, that is determined from the hash function.
This will probably incure some collisions, but if the hash functions is good enough and our choice for the dimension size is sufficiently large it will not be a problem. Especially if we are considering large collections.
After the creation of the hashed document-term matrix, the intuition is to pass it as a dataset to a linear classifier that will take care of the rest!
In [ ]:
%matplotlib inline
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
from shogun import StringCharFeatures, RAWBYTE, HashedDocDotFeatures, NGramTokenizer
The StringCharFeatures are nothing more than a collection of char vectors, where each one represents a single document.
RAWBYTE is simply an enum that specifies that every possible character can be found in the collection.
The HashedDocDotFeatures is where all the magic happens! This class is responsible for encapsulating the document collection and providing the calculation of the hashed representation of each document whenever it is required.
The NGramTokenizer is the tokenizer we will use on our collection. Its job is to parse a document and create the tokens. As its name suggests, it creates a sequence of ngrams. Another option would be the DelimiterTokenizer, which allows the user to specify the delimiter that will be used in order to create the tokens.
Suppose now that we have the following documents:
In [ ]:
doc_1 = "this is the first document"
doc_2 = "document classification introduction"
doc_3 = "a third document about classification"
document_collection = [doc_1, doc_2, doc_3]
We will take some time off now to assign each document to a category, to ease our work later on. Since the two last documents refer to classification we will label them as "Relevant" and the first document as "Not relevant". Since we only have two categories, this makes it a binary problem and we will represent "Relevant" as 1 and "Not relevant" as -1.
In [ ]:
from shogun import BinaryLabels
from numpy import array
labels = BinaryLabels(array([-1, 1, 1]))
We now create our document collection:
In [ ]:
string_features = StringCharFeatures(document_collection, RAWBYTE)
Now we will create the object responsible for the hashed BoW representation. We are going to specify that we want a hash size of 8 bits, which will be translated to a dimension of size 2^8 = 256 (powers of 2 are considered to speed up computatins) and a tokenizer that creates 5-grams. We will also specify that we want to normalize each hashed vector with regards to the square root of the document length, which is a common approach:
In [ ]:
hash_size = 8
tokenizer = NGramTokenizer(5)
normalize = True
hashed_feats = HashedDocDotFeatures(hash_size, string_features, tokenizer, normalize)
And that was it!
The hashed_feats object now has all the required parameters and knows how to communicate and provide the hashed representation only when it is needed by the various algorithms of Shogun.
So how do we proceed to actually learn a model that can classify documents?
We said before that the idea after we have taken care of the hashing is to use a linear classifier.
Shogun has many linear classifiers available, but for the moment we will consider SVMOcas and we will see how to use it:
In [ ]:
from shogun import SVMOcas
C = 0.1
epsilon = 0.01
svm = SVMOcas(C, hashed_feats, labels)
svm.put('epsilon', epsilon)
We have now created our svm. The parameter C specifies the regularization constant. The best choice for this parameter will usually be selected after a model selection process.
Now, let's train our svm:
In [ ]:
_=svm.train()
When the execution finishes, we will have learned our so desired linear model! Mind that for large collections the above call can take hours.
Now that we have a model ready, we would like to see how well it does. Let's see what it predicts for our training data:
In [ ]:
predicted_labels = svm.apply()
print (predicted_labels.get_labels())
We can see that it misclassified the first document. This has to do with the nature of our overly-simplified toy dataset which doesn't provide enough information. However, another option of the HashedDocDotFeatures class will allow us to extract some more information from the same dataset!
A | B | C | |
x: | 2 | 3 | 1 |
A | B | C | AA | AB | AC | BB | BC | CC | |
x: | 2 | 3 | 1 | 2*2=4 | 2*3=6 | 2*1=2 | 3*3=9 | 3*1=3 | 1*1=1 |
In order to use the k-skip n-gram approach explained above, one only has to specify his choice for k and n to the HashedDocDotFeatures object he is creating.
So, by keeping our previous choice of parameters intact and introducing the k and n we obtaing the following piece of code :
In [ ]:
k = 3 # number of tokens, up to which we allow it to skip
n = 3 # number of tokens, up to which we allow it to combine
hashed_feats_quad = HashedDocDotFeatures(hash_size, string_features, tokenizer, normalize, n, k)
If we do not specify these numbers, as we did not do before, then, (you maybe have guessed it!) they are set by default to the following values, n=1, k=0!
Let's see if there was any improvement in our predicting capabilities.
By keeping the same svm settings, we will train on the new HashedDocDotFeatures object we have just created. Although we have the same document collection underneath, we are now considering a different representation model because of the quadratic features.
In [ ]:
svm.set_features(hashed_feats_quad)
svm.train()
predicted_labels = svm.apply()
print(predicted_labels.get_labels())
Better!
Attention! Some clarifications now will follow on the Quadratic Features to clear any misunderstandings:
The example that was demonstrated before was very small and simple and was manipulated for demonstration purposes, to show the capabilities of the HashedDocDotFeatures class.
In [ ]:
from pylab import *
# HashedDocDotFeatures results
hashed_training_examples = [5000, 10000, 15000, 20000, 25000, 30000, 50000, 100000]
# For C=1
hashed_C_1_sec = [2682.750000,5202.690000,8120.460000,10846.410000,13944.200000,17016.840000,30496.720000,66302.950000]
hashed_C_1_roc = [0.980730,0.986382,0.988894,0.990666,0.991602,0.991957,0.993680,0.995184]
# For C=0.1
hashed_C_01_sec = [1074.130000,2142.390000,3434.710000,4641.380000,5984.530000,7206.040000,12864.270000,28393.540000]
hashed_C_01_roc = [0.976560,0.982660,0.985251,0.987380,0.988368,0.989022,0.990950,0.993197]
# Spectrum kernel results
kernel_training_examples = [5000, 10000, 15000, 20000, 25000]
# For C=1
kernel_C_1_sec = [2912.410000,6543.220000,10840.550000,16108.360000,19899.610000]
kernel_C_1_roc = [0.971284,0.976628,0.979715,0.982084,0.984355]
# For C=0.1
kernel_C_01_sec = [1441.380000,3261.870000,5071.040000,7568.130000,10436.430000]
kernel_C_01_roc = [0.946308,0.955245,0.961576,0.965204,0.968264]
figure(figsize=(12,6))
subplot(1,2,1)
plot(hashed_training_examples, hashed_C_1_sec, 'b')
plot(kernel_training_examples, kernel_C_1_sec, 'r')
title("Time comparison for C=1")
xlabel("Number of examples")
ylabel("Time in seconds")
legend(["HashedDocDotFeatures", "Spectrum Kernel"], loc=2)
subplot(1,2,2)
plot(hashed_training_examples, hashed_C_1_roc, 'b')
plot(kernel_training_examples, kernel_C_1_roc, 'r')
title("Area under ROC comparison for C=1")
xlabel("Number of examples")
ylabel("auROC")
_=legend(["HashedDocDotFeatures", "Spectrum Kernel"], loc=4)
In [ ]:
clf
figure(figsize=(12,6))
subplot(1,2,1)
plot(hashed_training_examples, hashed_C_01_sec, 'b')
plot(kernel_training_examples, kernel_C_01_sec, 'r')
title("Time comparison for C=0.1")
xlabel("Number of examples")
ylabel("Time in seconds")
ylim((0,70000))
legend(["HashedDocDotFeatures", "Spectrum Kernel"], loc=2)
subplot(1,2,2)
plot(hashed_training_examples, hashed_C_01_roc, 'b')
plot(kernel_training_examples, kernel_C_01_roc, 'r')
title("Area under ROC comparison for C=0.1")
xlabel("Number of examples")
ylabel("auROC")
_=legend(["HashedDocDotFeatures", "Spectrum Kernel"], loc=4)
The hashing trick has also been implemented in Shogun for the DenseFeatures and SparseFeatures classes, as HashedDenseFeatures and HashedSparseFeatures classes. These classes also support quadratic features and provide the option to maintain or drop the linear terms.
For online algorithms or for cases when the data do not fit in memory there exist similar classes with the prefix "Streaming" that support reading examples from the disk and providing them to some algorithms one at a time. The classes specifically are StreamingHashedDocDotFeatures, StreamingHashedDenseFeatures and StreamingHashedSparseFeatures. If one has mixed features, that are not just numerical or not just text, then he can use the CombinedDotFeatures class to combine objects of the aforementioned classes!
Another option is to use the Vw* algorithms and the VwExample class that require the input to be in vw format and are a bit trickier to use.
In addition, a HashedDocConverter class exists that allows the user to get the hashed BoW representation of a document collection as a SparseFeatures object.