scikit-learn for NLP -- Part 1 Introductory Tutorial

scikit-learning is an open-sourced simple and efficient tools for data mining, data analysis and machine learning in Python. It is built on NumPy, SciPy and matplotlib. There are built-in classification, regression, and clustering models, as well as useful features like dimensionality reduction, evaluation and preprocessing.

These tutorials are specifically tailored for NLP, i.e. working with text data. Part one covers the following topics: loading data, simple preprocessing, training (supervised, semi-supervised, and unsupervised), and evaluation. Part two focuses on feature engineering and covers more advanced topics like feature extraction, building a pipeline, creating custom transformers, feature union, dimensionality reduction, etc.

For this tutorial, we will use the 20 Newsgroups data set and perform topic classification. For the sake of time, I converted all the data into a CSV file.

Note: apparently Jupyter disables spell-checker (or I just don't know how to enable it), so I'm only partially responsible for the typos in this tutorial.

1. Loading Dataset

For the sake of convenience, we will use pandas to read CSV file. (You may do so with numpy as well; there is a loadtext() function, but you might encounter encoding issues when using it. Before you begin, the dataset in the repo is a tarball, so in the terminal run tar xfz 20news-18828.csv.tar.gz to uncompress it.


In [ ]:
import pandas as pd

dataset = pd.read_csv('20news-18828.csv', header=None, delimiter=',', names=['label', 'text'])

Sanity check on the dataset.


In [ ]:
print("There are 20 categories: %s" % (len(dataset.label.unique()) == 20))
print("There are 18828 records: %s" % (len(dataset) == 18828))

Now we need to split it to train set and test set. To do so, we can use the train_test_split() function. In scikit-learn's convention, X indicates data (yeah, uppercase X), and y indicates truths (and yeah, lowercase y). I mean, it doesn't really matter, as long as you remember what is what.


In [ ]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(dataset.text, dataset.label, train_size=0.8)

2. A Simple Example

Before going too much into preprocessing, feature extraction and other more complicated tasks, we will do a relatively simple but complete example. In this example, we will use bag-of-words as features, and Naive Bayes as classifier to establish our baseline.

There are some built-in vectorizers, CountVectorizer and TfidfVectorizer that we can use to vectorizer our raw data and perform preprocessing and feature exctration on it. First, we will experiment with CountVectorizer which basically makes a token/ngram a feature and stores its count in the corresponding feature space. The fit_transform() function is the combination of fit() and transform(), and it's a more efficient implementation. fit() indexes the vocabulary/features, and transform() transforms the dataset into a matrix.


In [ ]:
from sklearn.feature_extraction.text import CountVectorizer
# initialize a CountVectorizer
cv = CountVectorizer()
# fit the raw data into the vectorizer and tranform it into a series of arrays
X_train_counts = cv.fit_transform(X_train)
X_train_counts.shape

Similar thing needs to be done for the test set, but we only need to use the transform() function to transform the test data into a matrix.


In [ ]:
X_test_counts = cv.transform(X_test)
X_test_counts.shape

Then, we fit our features and labels into a Naive Bayes classifier, which basically trains a model (if you fit the data more than once, it overwrites the parameters the model learns previously). After training, we can use it to perform prediction.


In [ ]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_counts, y_train)
predicted = clf.predict(X_test_counts)

# sample some of the predictions against the ground truths 
for prediction, truth in zip(predicted[:10], y_test[:10]):
    print(prediction, truth)

Let's do some legit evaluation. The classification_report() function gives you precison, recall and f1 scores for each label, and their average. If you want to calculate overall macro-averaged, micro-averaged or weighted performance, you can use the precision_recall_fscore_support. Finally, the confusion_matrix() can show you which labels are confusing to the model, but unfortunately, it does not include the labels. I think that's why they call it a confusion matrix.


In [ ]:
from sklearn import metrics

print(metrics.classification_report(y_test, predicted, labels=dataset.label.unique()))

p, r, f1, _ = metrics.precision_recall_fscore_support(y_test, predicted, labels=dataset.label.unique(), average='micro')

print("Micro-averaged Performance:\nPrecision: {0}, Recall: {1}, F1: {2}".format(p, r, f1))

print(metrics.confusion_matrix(y_test, predicted, labels=dataset.label.unique()))

3. Preprocessing & Feature Extraction

One may ask, "how do I remove stop words, tokenize the texts differently, or use bigrams/trigrams as features?" The answer is you can do all that with a CountVectorizer object, i.e. you can pass various arguments to the constructor.

Here are some of them: ngram_range takes in a tuple (n_min, n_max). For example, (2,2) means only use bigrams, and (1,3) means use unigrams, bigrams, and trigrams. stop_words takes in a list of stopwords that you'd like to remove. If you want to use default stopword list in scikit-learn, pass in the string 'english'. tokenizer is a function that takes in a string and returns a string, inside that function, you can define how to tokenize your text. By default, scikit-learn tokenization pattern is u'(?u)\b\w\w+\b'. Finally, preprocessor takes in a function of which the argument is a string and the output is a string. You can use it to perform more customized preprocessing. For more detail, please checkout the documentation for CountVectorizer or TfidfVectorizer.

Let's start with defining a preprocessor to normalize all the numeric values, i.e. replacing numbers with the string NUM. Then, we construct a new CountVectorizer, and then use unigrams, bigrams, and trigrams as features, and remove stop words.


In [ ]:
import re
def normalize_numbers(s):
    return re.sub(r'\b\d+\b', 'NUM', s)

cv = CountVectorizer(preprocessor=normalize_numbers, ngram_range=(1,3), stop_words='english')

Let's fit and transform the train data and transform the test data. The speed of preprocessing and feature extraction depends on the running time of each step. For example, the running time of stopword removal is O(N * M), where N is the vocabulary size of the document, and M is the stopword list size.


In [ ]:
# fit the raw data into the vectorizer and tranform it into a series of arrays
X_train_counts = cv.fit_transform(X_train)
X_test_counts = cv.transform(X_test)

Let's use the Naive Bayes classifier to train a new model and see if it works better. From the last section with out preprocessing or feature engineering, our precison, recall and F1 are in the mid 80s, but now we got 90 for each score.


In [ ]:
clf = MultinomialNB().fit(X_train_counts, y_train)
predicted = clf.predict(X_test_counts)
print(metrics.classification_report(y_test, predicted, labels=dataset.label.unique()))

Do you remember there are other vecotrizers that you can use? Walla, one of them is TfidfVecotrizer. LOL, what is tf-idf? It's on Wikipedia. It basically reflects how important a word/phrase is to a document in a corpus. The constructor of TfidfVectorizer takes in the same parameters as that of CountVectorizer, so you can perfrom the same preprocessing/feature extraction. Try to run the following block of code and see if using tf-idf will help improve the performance. There are some other parameters in the constructor that you can tweak when initializing the object, and they could affect the performance as well.


In [ ]:
from sklearn.feature_extraction.text import TfidfVectorizer

tv = TfidfVectorizer(preprocessor=normalize_numbers, ngram_range=(1,3), stop_words='english')
X_train_tf = tv.fit_transform(X_train)
X_test_tf = tv.transform(X_test)
clf2 = MultinomialNB().fit(X_train_tf, y_train)
predicted = clf2.predict(X_test_tf)
print(metrics.classification_report(y_test, predicted, labels=dataset.label.unique()))

Alternatively, if you like typing a longer block of code, you can use the TfidfTransformer to transform a word count matrix created by CountVectorizer into a tf-idf matrix.


In [ ]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

4. Model Selection

Depending on the size of your data and the nature your task, some classifiers might perform better than others. These days, Maximum Entropy is a very popular classifier for many machine learning tasks, and every often, it is used to establish the baseline for a task. So let's try the logistic regression classifier in scikit-learn (Maxent and logistic regression are virtually the same thing). Please note that this classifier is a lot slower than Naive Bayes, because of math and whatever (JK, Maxent updates the weights for each feature for some iterations, whereas Naive Bayes only calculates the probabilities of each feature at once).


In [ ]:
from sklearn.linear_model import LogisticRegression

# for the sake of speed, we will just use all the default value of the constructor
cv = CountVectorizer()
X_train_counts = cv.fit_transform(X_train)
X_test_counts = cv.transform(X_test)

clf = LogisticRegression(solver='liblinear', max_iter=500, n_jobs=4)
clf.fit(X_train_counts, y_train)
predicted = clf.predict(X_test_counts)
print(metrics.classification_report(y_test, predicted, labels=dataset.label.unique()))

There are many other supervised, semi-supervised and unsupervised algorithms, and many of them work almost the same: (1) initialize a classifier, (2) fit the feature matrix and labels of the train set, and (3) pass in the feature matrix of test set to perform prediction.

The following is an example of semi-supervised algorithm, label spreading, which uses both labeled and unlabeled data to train a classifier. It is especially nice when you don't have too much annotated data.


In [ ]:
from sklearn.semi_supervised import LabelSpreading
from sklearn import preprocessing
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics

label_prop_model = LabelSpreading()
# randomly select some of the training data and use them as unlabeled data
random_unlabeled_points = np.where(np.random.randint(0, 2, size=len(y_train)))
labels = np.copy(y_train)
# convert all labels into indecies, since we'll assign -1 to the unlabeled dataset
le = preprocessing.LabelEncoder()
le.fit(labels)
# transforms labels to numeric values
labels = le.transform(labels)
labels[random_unlabeled_points] = -1
labels_test = le.transform(y_test)

# here we will need to limit the features space, because we are going to convert the sparse
# matrices into dense ones required by the implementation of LabelSpreading. 
# Using dense matrices in general slows down training and takes up a lot of memory; therefore,
# in most cases, it is not recommended.
cv = CountVectorizer(max_features=100)
X_train_counts = cv.fit_transform(X_train)
X_test_counts = cv.transform(X_test)
# toarray() here is to convert a sparse matrix into a dense matrix, as the latter is required
# by the algorithm. 
label_prop_model.fit(X_train_counts.toarray(), labels)
predicted = label_prop_model.predict(X_test_counts.toarray())

# For this experiment, we are only using 100 most frequent words as features in the train set, 
# so the test results will be lousy. 
p, r, f1, _ = metrics.precision_recall_fscore_support(labels_test, predicted, average='macro')
print("Macro-averaged Performance:\nPrecision: {0}, Recall: {1}, F1: {2}".format(p, r, f1))

Finally, we are going to experiment a clustering (unsupervised) aglorithm, K-means clustering. Generally speaking, it's one of the fastest and popular clustering algorithms.


In [ ]:
from sklearn.cluster import KMeans
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
from sklearn import preprocessing

X_train, X_test, y_train, y_test = train_test_split(dataset.text, dataset.label, train_size=0.8)
# I reduced the number of features here just to make training a little faster
cv = CountVectorizer(max_features=200)
X_train_counts = cv.fit_transform(X_train)
X_test_counts = cv.transform(X_test)

kmeans = KMeans(n_clusters=20, random_state=0)
kmeans.fit(X_train_counts)
# see the cluster label for each instance
print(kmeans.labels_)
print(kmeans.predict(X_test_counts))

le = preprocessing.LabelEncoder()
le.fit(y_train)
# you can also pass in the labels, if you have some annotated data
kmeans.fit(X_train_counts, le.transform(y_train))
predicted = kmeans.predict(X_test_counts)
p, r, f1, _ = metrics.precision_recall_fscore_support(le.transform(y_test), predicted, average='macro')
print("Macro-averaged Performance:\nPrecision: {0}, Recall: {1}, F1: {2}".format(p, r, f1))

5. More on Evaluation

In some examples above, we have talked about the ways to calculate precision, recall and F1; therefore, we are not going to repeat that for this section. Here, we are going to conclude this introductory tutorial with cross-validation. There are many strategies for cross validation, and scikit-learn has a rich selection for them; for the purpose of demonstration, we will using K-fold cross validation (and it's widely used in NLP).


In [ ]:
from sklearn.model_selection import KFold
from sklearn.naive_bayes import MultinomialNB

dataset = pd.read_csv('20news-18828.csv', header=None, delimiter=',', names=['label', 'text'])
X = dataset.text
y = dataset.label

# the train_test_split() function shuffles the dataset under the hood, but the KFold object
# does not; therefore, if your dataset is sorted, make sure to shuffle it. For the sake of time,
# we are doing 5 fold 
kf = KFold(n_splits=5, shuffle=True)
for train_index, test_index in kf.split(X):
    X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
    cv = CountVectorizer()
    X_train_counts = cv.fit_transform(X_train)
    X_test_counts = cv.transform(X_test)
    clf = MultinomialNB().fit(X_train_counts, y_train)
    predicted = clf.predict(X_test_counts)
    p, r, f1, _ = metrics.precision_recall_fscore_support(y_test, predicted, average='macro')
    print("Macro-averaged Performance:\nPrecision: {0}, Recall: {1}, F1: {2}".format(p, r, f1))

6. Conclusion

In this tutorial, we went through some basic features of scikit-learn that allow us to perform some straightforward NLP/ML tasks. In this tutorial, we only use texts or bag of words as features; what if we want to incorporate some other features, such as document length, overall document sentiment, etc? In the next tutorial, we will address those issues and focus more on using scikit-learn for feature engineering.