(C) 2019 by Damir Cavar
See for more details the source of this tutorial: https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/
We will use the data provided at this site. This is a collection of 3.6 mil. Amazon text reviews and labels. The data is formated using the FastText corpus format, that is, each file contains lines with a label followed by the text.
__label__2 Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^
We load the data set
In [1]:
data = open('data/corpus').read()
labels, texts = [], []
for line in data.split("\n"):
content = line.split(' ', 1)
labels.append(content[0])
texts.append(content[1])
In [2]:
print(texts[:3])
We will use Pandas to store the labels and texts in a DataFrame. We import Pandas:
In [3]:
import pandas
Packing the data into a Pandas DataFrame:
In [4]:
corpus = pandas.DataFrame()
corpus['text'] = texts
corpus['label'] = labels
From scikit_learn we will import model_selection. This module contains a function train_test_split that splits arrays or matrices into random train and test subsets. See for more details the documentation page.
In [5]:
from sklearn import model_selection
We will select a third of the data set for testing. The random_state in the default will use np.random in this function call.
In [6]:
train_text, test_text, train_label, test_label = model_selection.train_test_split(corpus['text'],
corpus['label'],
test_size=0.33)
In [7]:
print(train_text[:2])
print(test_text[:2])
We use the scikit_learn module for preprocessing. We will use the LabelEncoder in the preprocessing module to normalize the labels such that they contain only values between 0 and n_classes-1. See for more details the documentation page.
In [8]:
from sklearn import preprocessing
encoder = preprocessing.LabelEncoder()
We encode the labels for the training and test set:
In [9]:
print(test_label[:10])
In [10]:
train_label = encoder.fit_transform(train_label)
test_label = encoder.fit_transform(test_label)
In [11]:
print(test_label[:10])
To engineer a classifier, we will select different types of features. We will start using the count vectors as features. In count vectors, each row represents a document from the corpus and each column represents a word from the corpus. The scalar in each vector contains the frequency of a particular token (column) in the document (row). We will import the CountVectorizer from the scikit-learn module and its feature_extraction.text collection:
In [12]:
from sklearn.feature_extraction.text import CountVectorizer
The CountVectorizer should make features of word n-grams, as specified in analyzer='word'. The token_pattern parameter is a regular expression denoting what constitutes a token and it is only used if analyzer == 'word'. The regular expression here selects words to be tokens of one or more characters. See for more details the documentation page.
In [13]:
vectorizer = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
The fit method applied to the vectorizer object learns a vocabulary dictionary of all tokens in the raw texts.
In [14]:
vectorizer.fit(corpus['text'])
Out[14]:
We will now transform the training and test data using the vectorizer object:
In [15]:
train_text_count = vectorizer.transform(train_text)
test_text_count = vectorizer.transform(test_text)
We will use the scikit_learn module for linear models:
In [16]:
from sklearn import linear_model
We can now apply logistic regression on the transformed data and print the resulting accuracy. We create an instance of a Logistic Regression classifier using the liblinear algorithm as a solver for optimization. We train the model and generate the predictions for the test data. See for more details the documentation page.
In [17]:
classifier = linear_model.LogisticRegression(solver='liblinear')
classifier.fit(train_text_count, train_label)
predictions = classifier.predict(test_text_count)
We will use the metrics module in scikit_learn to compute the accuracy score:
In [18]:
from sklearn import metrics
To compute the accuracy score, we provide the accuracy_score function in the metrics module with the predicted labels for the test data set and the real labels.
In [19]:
accuracy = metrics.accuracy_score(predictions, test_label)
print("LR, Count Vectors: ", accuracy)
In this case logistic regression as a classifier on the word count vectors results in more than 84% accuracy.
In [ ]:
In [ ]: