Text Classification (scikit-learn) with Naive Bayes

In this Machine Learning Snippet we use scikit-learn (http://scikit-learn.org/) and ebooks from Project Gutenberg (https://www.gutenberg.org/) to create a text classifier, which can classify German, French, Dutch and English documents.

We need one document per language and split the document into smaller chuncks to train the classifier.

For our snippet we use the following ebooks:

Note: The ebooks are for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org

Data Preparation

We have to prepare the English, French, Dutch and German text before we can beginn. The goal is to cut off the header and footer from the ebooks. Then we do some text cleaning.

First let's extract the text without the header and footer from the ebooks, convert to lowercase and split the text by whitespace.


In [2]:
import re

txt_german = open('data/pg22465.txt', 'r').read()
txt_english = open('data/pg46.txt', 'r').read()
txt_french = open('data/pg16021.txt', 'r').read()
txt_dutch = open('data/pg28560.txt', 'r').read()



def get_markers(txt, pattern='\*\*\*'):
    iter = re.finditer(pattern, txt)
    indices = [m.start(0) for m in iter]
    return indices

def extract_text_tokens(txt):
    indices = get_markers(txt)
    header = indices[1]
    footer = indices[2]
    
    return txt[header: footer].lower().strip().split()


feat_german = extract_text_tokens(txt_german)
feat_english = extract_text_tokens(txt_english)
feat_french = extract_text_tokens(txt_french)
feat_dutch = extract_text_tokens(txt_dutch)


---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-2-5474a3c43010> in <module>
      1 import re
      2 
----> 3 txt_german = open('data/pg22465.txt', 'r').read()
      4 txt_english = open('data/pg46.txt', 'r').read()
      5 txt_french = open('data/pg16021.txt', 'r').read()

FileNotFoundError: [Errno 2] No such file or directory: 'data/pg22465.txt'

Next we create text tokens and remove special characters and numbers.


In [2]:
import re

def remove_special_chars(x):
    
    chars = ['_', '(', ')', '*', '"', '[', ']', '?', '!', ',', '.', '»', '«', ':', ';']
    for c in chars:
        x = x.replace(c, '')
    
    # remove numbers
    x = re.sub('\d', '', x)
    
    return x

tokens_english = [remove_special_chars(x) for x in feat_english]
tokens_german = [remove_special_chars(x) for x in feat_german]
tokens_french = [remove_special_chars(x) for x in feat_french]
tokens_dutch = [remove_special_chars(x) for x in feat_dutch]


print('tokens (german)', len(tokens_german))
print('tokens (french)', len(tokens_french))
print('tokens (dutch)', len(tokens_dutch))
print('tokens (english)', len(tokens_english))


tokens (german) 27216
tokens (french) 32755
tokens (dutch) 31502
tokens (english) 28559

Create samples

Now we create text samples from 20 tokens (words). The tokens from the samples will be later used to train the classifier. We will only use 1300 samples from every langauge.


In [3]:
from sklearn.utils import resample

max_samples = 1300

def create_text_sample(x):
    max_tokens = 20
    data = []
    text = []
    for i, f in enumerate(x):
        text.append(f)
        if i % max_tokens == 0 and i != 0:
            data.append(' '.join(text))
            text = []
    return data
    

sample_german = resample(create_text_sample(tokens_german), replace=False, n_samples=max_samples)
sample_french = resample(create_text_sample(tokens_french), replace=False, n_samples=max_samples)
sample_dutch = resample(create_text_sample(tokens_dutch), replace=False, n_samples=max_samples)
sample_english = resample(create_text_sample(tokens_english), replace=False, n_samples=max_samples)

print('samples (german)', len(sample_german))
print('samples (french)', len(sample_french))
print('samples (dutch)', len(sample_dutch))
print('samples (english)', len(sample_english))


samples (german) 1300
samples (french) 1300
samples (dutch) 1300
samples (english) 1300

A text sample looks like this.


In [4]:
print('English sample:\n------------------')
print(sample_english[100])
print('------------------')


English sample:
------------------
themselves rather than be parties to a lie of such enormous magnitude spirit are they yours scrooge could say no
------------------

Modeling

As classifier we use the MultinomialNB classifier with the TfidfVectorizer.

First we create the data structure which we used to train the model. The data structure will have list with the data (X), target (y) and target names.

{data: [], target: [], target_names: [] }

In [5]:
import argparse as ap

def create_data_structure(**kwargs):
    samples = {'data': [], 'target': [], 'target_names':[]}
    label = 0
    for name, value in kwargs.items():
        samples['target_names'].append(name)
        for i in value:
            samples['data'].append(i)
            samples['target'].append(label)
        label += 1
            
    
    return ap.Namespace(**samples)

data = create_data_structure(de = sample_german, en = sample_english, 
                             fr = sample_french, nl = sample_dutch)



print('target names: ', data.target_names)
print('number of observations: ', len(data.data))


target names:  ['fr', 'en', 'nl', 'de']
number of observations:  5200

It's importan that we shuffle and split the data into training (80%) and test set (20%)


In [6]:
from sklearn.model_selection import train_test_split
import numpy as np

x_train, x_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.20)

print('train size (x, y): ', len(x_train),  len(y_train))
print('test size (x, y): ', len(x_test), len(y_test))


train size (x, y):  4160 4160
test size (x, y):  1040 1040

We connect all our parts (classifier, etc.) to our Machine Learning Pipeline. So it’s easier and faster to go trough all processing steps to build a model.

The TfidfVectorizer will use the the word analyzer, min document frequency of 10 and convert the text to lowercase. I know we already did a lowercase conversion in the previous step. We also provide some stop words which should be ignored in our model.

The MultinomialNB classifier wil use the default alpha value 1.0.

Here you can play around with the settings. In the next section you see how to evaluate your model.


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn import model_selection
from sklearn import metrics

stopwords = ['scrooge', 'scrooges', 'bob']

pipeline = Pipeline([('vect', TfidfVectorizer(analyzer='word', 
                            min_df=10, lowercase=True, stop_words=stopwords)),
                      ('clf', MultinomialNB(alpha=1.0))])

Evaluation

In this step we want to evaluate the performance of our classifier. So we do the following evaluation:

  • Evaluate the model with k-fold on the training set
  • Evaluate the final model with the test set

Let's evaluate our model with k-fold cross validation. In this step we can tune the model and settings with the output from the model evaluation.


In [8]:
from sklearn.model_selection import KFold

folds = 4
kf = KFold(n_splits=folds)

'''
scores = model_selection.cross_val_score(pipeline, X=x_train, y=y_train, 
                                         cv=folds, scoring='f1_weighted')
'''

scores = model_selection.cross_val_score(pipeline, X=x_train, y=y_train, 
                                         cv=kf, scoring='accuracy')

print('scores: %s' % scores )

print('accuracy: %0.6f (+/- %0.4f)' % (scores.mean(), scores.std() * 2))


scores: [ 1.          0.99903846  0.99903846  0.99903846]
accuracy: 0.999279 (+/- 0.0008)

In [9]:
predicted = model_selection.cross_val_predict(pipeline, X=x_train, y=y_train, cv=folds)
print(metrics.classification_report(y_train, predicted, 
                                    target_names=data.target_names, digits=4))


             precision    recall  f1-score   support

         fr     1.0000    0.9990    0.9995      1032
         en     0.9971    1.0000    0.9986      1042
         nl     1.0000    0.9990    0.9995      1044
         de     1.0000    0.9990    0.9995      1042

avg / total     0.9993    0.9993    0.9993      4160

We build the final model with the fold, which had the best score. We will not use the whole training set because we might overfit the model.


In [10]:
def select_best_kfold(x, y, kf, scores):

    splitts = list(kf.split(x))
    
    score_index = np.argmax(scores == max(scores))
    train_index = splitts[score_index][0]
    
    return np.array(x)[train_index], np.array(y)[train_index]

x_final, y_final = select_best_kfold(x_train, y_train, kf, scores)

Next we build the model and evaluate the result against our test set.


In [11]:
text_clf = pipeline.fit(x_final, y_final)

predicted = text_clf.predict(x_test)


print(metrics.classification_report(y_test, predicted, target_names=data.target_names, digits=4))


             precision    recall  f1-score   support

         fr     1.0000    1.0000    1.0000       268
         en     1.0000    1.0000    1.0000       258
         nl     1.0000    1.0000    1.0000       256
         de     1.0000    1.0000    1.0000       258

avg / total     1.0000    1.0000    1.0000      1040

Examine the features of the model

Let's see what are the most informative features


In [12]:
# show most informative features
def show_top10(classifier, vectorizer, categories):

    feature_names = np.asarray(vectorizer.get_feature_names())
    for i, category in enumerate(categories):
        top10 = np.argsort(classifier.coef_[i])[-10:]
        print("%s: %s" % (category, " ".join(feature_names[top10])))


show_top10(text_clf.named_steps['clf'], text_clf.named_steps['vect'], data.target_names)


fr: pas une que les un la il et le de
en: was in that his he it of to and the
nl: op te dat zijn van hij de een het en
de: das ein ich es zu sie er die der und

Let's see which and how many features our model has.


In [13]:
feature_names = np.asarray(text_clf.named_steps['vect'].get_feature_names())

print('number of features: %d' % len(feature_names))
print('first features: %s'% feature_names[0:10])
print('last features: %s' % feature_names[-10:])


number of features: 804
first features: ['aan' 'aber' 'about' 'after' 'again' 'ah' 'ai' 'air' 'al' 'all']
last features: ['zur' 'zurück' 'zwei' 'écria' 'étaient' 'était' 'été' 'één' 'être' 'über']

New data

Let's try out the classifier with new data.


In [14]:
new_data = ['Hallo mein Name ist Hugo.', 
            'Hi my name is Hugo.', 
            'Bonjour mon nom est Hugo.',
            'Hallo mijn naam is Hugo.',
            'Eins, zwei und drei.',
            'One, two and three.',
            'Un, deux et trois.',
            'Een, twee en drie.'
           ]

predicted = text_clf.predict(new_data)
probs = text_clf.predict_proba(new_data)
for i, p in enumerate(predicted):
    print(new_data[i], ' --> ', data.target_names[p], ', prob:' , max(probs[i]))


Hallo mein Name ist Hugo.  -->  de , prob: 0.866497382158
Hi my name is Hugo.  -->  en , prob: 0.822075269211
Bonjour mon nom est Hugo.  -->  fr , prob: 0.955918310243
Hallo mijn naam is Hugo.  -->  nl , prob: 0.835304531876
Eins, zwei und drei.  -->  de , prob: 0.90988343868
One, two and three.  -->  en , prob: 0.974259445821
Un, deux et trois.  -->  fr , prob: 0.990403731538
Een, twee en drie.  -->  nl , prob: 0.962814629426