Document Classification Tutorial 1

(C) 2019 by Damir Cavar

Amazon Reviews

We will use the data provided at this site. This is a collection of 3.6 mil. Amazon text reviews and labels. The data is formated using the FastText corpus format, that is, each file contains lines with a label followed by the text.

__label__2 Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^

We load the data set


In [1]:
data = open('data/corpus').read()
labels, texts = [], []

for line in data.split("\n"):
    content = line.split(' ', 1)
    labels.append(content[0])
    texts.append(content[1])

In [2]:
print(texts[:3])


['Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^', "The best soundtrack ever to anything.: I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade.The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny.", 'Amazing!: This soundtrack is my favorite music of all time, hands down. The intense sadness of "Prisoners of Fate" (which means all the more if you\'ve played the game) and the hope in "A Distant Promise" and "Girl who Stole the Star" have been an important inspiration to me personally throughout my teen years. The higher energy tracks like "Chrono Cross ~ Time\'s Scar~", "Time of the Dreamwatch", and "Chronomantique" (indefinably remeniscent of Chrono Trigger) are all absolutely superb as well.This soundtrack is amazing music, probably the best of this composer\'s work (I haven\'t heard the Xenogears soundtrack, so I can\'t say for sure), and even if you\'ve never played the game, it would be worth twice the price to buy it.I wish I could give it 6 stars.']

We will use Pandas to store the labels and texts in a DataFrame. We import Pandas:


In [3]:
import pandas

Packing the data into a Pandas DataFrame:


In [4]:
corpus = pandas.DataFrame()
corpus['text'] = texts
corpus['label'] = labels

From scikit_learn we will import model_selection. This module contains a function train_test_split that splits arrays or matrices into random train and test subsets. See for more details the documentation page.


In [5]:
from sklearn import model_selection

We will select a third of the data set for testing. The random_state in the default will use np.random in this function call.


In [6]:
train_text, test_text, train_label, test_label = model_selection.train_test_split(corpus['text'],
                                                                                  corpus['label'],
                                                                                  test_size=0.33)

In [7]:
print(train_text[:2])
print(test_text[:2])


7985    Brilliant album but not the best of Celine: Ex...
7760    Stargate Continuum: So glad I purchased the mo...
Name: text, dtype: object
9898    THIS SUCKS: THIS IS THE STUPIDIEST MOVIE EVER ...
2748    Great produt easy to install: great product to...
Name: text, dtype: object

We use the scikit_learn module for preprocessing. We will use the LabelEncoder in the preprocessing module to normalize the labels such that they contain only values between 0 and n_classes-1. See for more details the documentation page.


In [8]:
from sklearn import preprocessing

encoder = preprocessing.LabelEncoder()

We encode the labels for the training and test set:


In [9]:
print(test_label[:10])


9898    __label__1
2748    __label__2
5025    __label__2
5453    __label__1
3782    __label__1
8702    __label__2
8124    __label__2
7988    __label__1
9393    __label__2
6324    __label__2
Name: label, dtype: object

In [10]:
train_label = encoder.fit_transform(train_label)
test_label = encoder.fit_transform(test_label)

In [11]:
print(test_label[:10])


[0 1 1 0 0 1 1 0 1 1]

Feature Engineering

To engineer a classifier, we will select different types of features. We will start using the count vectors as features. In count vectors, each row represents a document from the corpus and each column represents a word from the corpus. The scalar in each vector contains the frequency of a particular token (column) in the document (row). We will import the CountVectorizer from the scikit-learn module and its feature_extraction.text collection:


In [12]:
from sklearn.feature_extraction.text import CountVectorizer

The CountVectorizer should make features of word n-grams, as specified in analyzer='word'. The token_pattern parameter is a regular expression denoting what constitutes a token and it is only used if analyzer == 'word'. The regular expression here selects words to be tokens of one or more characters. See for more details the documentation page.


In [13]:
vectorizer = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')

The fit method applied to the vectorizer object learns a vocabulary dictionary of all tokens in the raw texts.


In [14]:
vectorizer.fit(corpus['text'])


Out[14]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='\\w{1,}', tokenizer=None,
        vocabulary=None)

We will now transform the training and test data using the vectorizer object:


In [15]:
train_text_count = vectorizer.transform(train_text)
test_text_count = vectorizer.transform(test_text)

We will use the scikit_learn module for linear models:


In [16]:
from sklearn import linear_model

We can now apply logistic regression on the transformed data and print the resulting accuracy. We create an instance of a Logistic Regression classifier using the liblinear algorithm as a solver for optimization. We train the model and generate the predictions for the test data. See for more details the documentation page.


In [17]:
classifier = linear_model.LogisticRegression(solver='liblinear')
classifier.fit(train_text_count, train_label)
predictions = classifier.predict(test_text_count)

We will use the metrics module in scikit_learn to compute the accuracy score:


In [18]:
from sklearn import metrics

To compute the accuracy score, we provide the accuracy_score function in the metrics module with the predicted labels for the test data set and the real labels.


In [19]:
accuracy = metrics.accuracy_score(predictions, test_label)
print("LR, Count Vectors: ", accuracy)


LR, Count Vectors:  0.8512121212121212

In this case logistic regression as a classifier on the word count vectors results in more than 84% accuracy.


In [ ]:


In [ ]: