Vector Space Model

We are interested in using this data to build statistical models. So, we now need to vectorize this data. The goal is to find a way to represent the data so that the computer can understand it.

Bag of words

A bag of words represents each document in a corpus as a series of features. Most commonly, the features are the collection of all unique words in the vocabulary of the entire corpus. The values are usually the count of the number of times that word appears in the document, i.e. term frequency.

A document $d$ is represented by a weight vector is $v_d=[w_{1,d} , w_{2,d},\ldots, w_{N,d}]$ where $w_{t,d} = tf_{t,d}$, the term frequency of word $t$ in document $d$.

A corpus is then represented as a matrix with one row per document and one column per unique word.

Scikit-Learn

Scikit-learn is machine learning library for the Python programming language. It features a wide range of machine learning algorithms for classification, regression and clustering. It also provides various supporting machine learning techniques such as cross validation, text vectorizer. Scikit-learn is designed to interoperate with the Python numerical and scientific libraries NumPy.

Simple to use: import the required module and call it.

Vectorizer

To build our initial bag of words count matrix, we will use scikit-learn's CountVectorizer class to transform our corpus into a bag of words representation. CountVectorizer expects as input a list of raw strings containing the documents in the corpus. It takes care of the tokenization, transformation to lowercase, filtering stop words, building the vocabulary etc. It also tabulates occurrance counts per document for each feature.


In [ ]:
import numpy as np

# Write code to import CountVectorizer
from ...

raw_docs_sample = ["The dog sat on the mat.", 
                   "The cat sat on the mat!",
                   "We have a mat in our house."]

# Write code to create a CountVectorizer
# Hint: use "stop_word" argument to specify English stop words
vectorizer = ...

# Write code to vectorize the sample text
X_sample = ...

X_sample

Sparse Vs Dense Matrices

Dense matrices store every entry in the matrix. Sparse matrices only store the nonzero entries. Sparse matrices don't have a lot of extra features, and some algorithms may not work for them. You use them when you need to work with matrices that would be too big for the computer to handle them, but they are mostly zero, so they compress easily. Be aware of issues that may arise at:

  • dot product
  • slicing (row, column)

In python these are taken care almost automatically, by using sparse dot product and implementations of csr and csc matrices (scipy.sparse.csr_matrix, scipy.sparse.csc_matrix, etc..).


In [ ]:
print("Count Matrix:")
print(X_sample.todense())
print("\nWords in vocabulary:")
print(vectorizer.get_feature_names())

TF-IDF Weighting Scheme

The tf-idf weighting scheme is an improvement over the simple term count or term frequency scheme we just saw. It is frequently used in text mining applications and has been shown to be effective. It combines two term statistics components:

  1. Local component: term count or term frequency (tf) reflects how important a word is to a document locally. For more details you can refer to this link.
  2. Global component: inverse document frequency (idf) of a word reflects how important the word is to the entire corpus or collection of documents. Document frequency (df) of a word is the number of documents in the corpus where the word appears. A term with higher $df$ is a common term, thus carries less importance. $idf$ is an inverse function of $df$. So higher $idf$ means higher importance of the term globally. For more details you can refer to this link.

The weight vector for document $d$ under tf-idf scheme is $v_d=[w_{1,d} , w_{2,d},\ldots, w_{N,d}]$ where $w_{t,d}=tf_{t,d}\times\log\frac{|D|}{|d'\in D | t\in d'| + 1}$ In the denominator we have added 1 to avoid division by zero, which is called smoothing.

Scikit-learn has your back, it already provides the TfidfVectorizer module to compute TF-IDF matrix.

Note: Scikit-learn uses a slightly different formula than that we saw today morning. You can refer to corresponding documentation to know more.


In [ ]:
# Write code to import TfidfVectorizer


# Write code to create a TfidfVectorizer
# Hint: use "stop_word" argument to specify English stop words
tfidf = ...

# Write code to vectorize the sample text
X_tfidf_sample = ...

print("TF-IDF Matrix:\n")
print(X_tfidf_sample.todense())

A Bigger Collection

We will use DBPedia Ontology Classification Dataset. It includes first paragraphs of Wikipedia articles. Each paragraph is assigned one of 14 categories. Here is an example of an abstract under Written Work catgory:

The Regime: Evil Advances/Before They Were Left Behind is the second prequel novel in the Left Behind series written by Tim LaHaye and Jerry B. Jenkins. It was released on Tuesday November 15 2005. This book covers more events leading up to the first novel Left Behind. It takes place from 9 years to 14 months before the Rapture.

In this hands-on we will use 15,000 documents belonging to three categories, namely Album, Film and Written Work.

The file corpus.txt supplied here, contains 15,000 documents. Each line of the file is a document.

Now we will:

  1. Load the documents as a list
  2. Create TF-IDF vectors

Note: Each line of the file corpus.txt is a document.


In [ ]:
# Write code to load documents as a list
# Hint: recall the strategy you used in the previous notebook
raw_docs = ...

print("Loaded " + str(len(raw_docs)) + " documents.")

In [ ]:
# Write code to convert raw documents into TF-IDF matrix.
"""
Hint: - create a TfidfVectorizer, and do not forget to remove stopwords
      - use fit_transform to vectorize raw_docs
"""
tfidf = ...
X_tfidf = ...

Text Classifier

Machine learning algorithms need a training set. In our text classification scenario, we need category or class labels for all 15000 documents in the collection.

In our collection we have documents from three categories: "Album" (category 12), "Film" (category 13) and "Written Work" (category 14). For each document we know the labels. The labels are stored in labels.txt file. Each line of corpus.txt file corresponds to the label in the same line of file labels.txt.

Lets load the labels.

Note: Each line of the file labels.txt is a document.

Verify that the number of loaded labels and number of loaded documents are same.


In [1]:
# Write code to load labels list from file 'labels.txt'
# Hint: use the same strategy you used to load documents
labels = ...

print("Loaded " + str(len(labels)) + " labels.")


Loaded 15000 labels.

Note: When labels are read from a text file, Python by default interprets each label line as string. For computations we require integer labels. So, these strings are converted to integers while reading from file.


In [ ]:
# Replace string labels with numerical ones
y = np.array([int(label) for label in labels])

Training and Testing

As we wish to first train a model and then to see how well it is. So the norm is to divide the data into two parts:

  1. Training set: Documents along with their class labels are used to train the model.
  2. Test set: Documents are used for predicting the class labels using the trained classifier. However, the class labels of this set are kept hidden and are only revealed during evalutaion of the trained model, not before that.

Note: Here we are splitting the data ourselves. In most of the datasets, training and test set are provided separately.


In [ ]:
# package to split training and testing data
from sklearn.model_selection import train_test_split

# split the data into training (80%) and testing (20%)
X_train, X_test, y_train, y_test =train_test_split(X_tfidf, y, test_size = 0.2, random_state = 256)

print("Training set: " + str(X_train.shape[0]) + " documents.")
print("Test set: " + str(X_test.shape[0]) + " documents.")

Training the Classifier

First we will use Multinomial Naive Bayes classifier. Scikit-learn provides the module MultinomialNB


In [ ]:
# Write code to import MultinomialNB
from ...

# Write code to create a MultinomialNB classifier
classifierNB = ...

# Write code to train the classifier using "fit" function.
# Hint: you need to provide training data and labels for training

Testing

Now we will test the trained classifier on out kept hidden test data to see how well it did.

Here we will look at the accuracy of the model, the simplest evaluation measure used in machine learning algorithms: $$accuracy = \frac{\text{number of correctly classified examples}}{\text{total number of examples}}$$

There are more informative and complex evaluaton measures, e.g. precision, recall, f-measure etc.

Note: It is customary to report accuracy in percentage. So we convert the ratio into percentage.

Again, Scikit-learn already provides the accuracy_score method for calculating accuracy.


In [ ]:
# Write code to import accuracy_score method
from ...

# Write code to predict labels for the test set using the classifier you trained
# Hint: use the "predict" method of the classifier
predictionsNB = ...

# Write code to calculate accuracy using "accuracy_score" method
# Hint: you have to provide test labels and predicted labels to measure accuracy
accuracyNB = ...

print("Test accuracy: " + str(accuracyNB * 100) + "%")

Other Classifiers

You have virtually an endless option to choose your classifier. Let's try some more.

Method is simple:

  1. import relevant packages
  2. create an instance of the classifier
  3. fit with the training data and training labels
  4. predict with the test data
  5. evaluate by comparing predicted labels and test labels

Perceptron


In [ ]:
# 1. Write code to import Perceptron
from ...

# 2. Write code to create Perceptron classifier
classifierPer = ...

# 3. Write code to "fit" the classifier with training data and labels
...

# 4. Write code to "predict" labels for the test set
predictionsPer = ...

# 5. Write code to calculate accuracy
accuracyPer = ...

print("Test accuracy: " + str(accuracyPer * 100) + "%")

Random Forest Classifier


In [ ]:
# 1. Write code to import RandomForestClassifier
from ...

# 2. Write code to create Random Forest classifier
classifierRF = ...

# 3. Write code to "fit" the classifier with training data and labels


# 4. Write code to "predict" labels for the test set
predictionsRF = ...

# 5. Write code to report accuracy
accuracyRF = ...

print("Test accuracy: " + str(accuracyRF * 100) + "%")

In [ ]: