In this Machine Learning Snippet we use scikit-learn (http://scikit-learn.org/) and ebooks from Project Gutenberg (https://www.gutenberg.org/) to create a text classifier, which can classify German, French, Dutch and English documents.
We need one document per language and split the document into smaller chuncks to train the classifier.
For our snippet we use the following ebooks:
Note: The ebooks are for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org
First let's extract the text without the header and footer from the ebooks, convert to lowercase and split the text by whitespace.
In [2]:
import re
txt_german = open('data/pg22465.txt', 'r').read()
txt_english = open('data/pg46.txt', 'r').read()
txt_french = open('data/pg16021.txt', 'r').read()
txt_dutch = open('data/pg28560.txt', 'r').read()
def get_markers(txt, pattern='\*\*\*'):
iter = re.finditer(pattern, txt)
indices = [m.start(0) for m in iter]
return indices
def extract_text_tokens(txt):
indices = get_markers(txt)
header = indices[1]
footer = indices[2]
return txt[header: footer].lower().strip().split()
feat_german = extract_text_tokens(txt_german)
feat_english = extract_text_tokens(txt_english)
feat_french = extract_text_tokens(txt_french)
feat_dutch = extract_text_tokens(txt_dutch)
Next we create text tokens and remove special characters and numbers.
In [2]:
import re
def remove_special_chars(x):
chars = ['_', '(', ')', '*', '"', '[', ']', '?', '!', ',', '.', '»', '«', ':', ';']
for c in chars:
x = x.replace(c, '')
# remove numbers
x = re.sub('\d', '', x)
return x
tokens_english = [remove_special_chars(x) for x in feat_english]
tokens_german = [remove_special_chars(x) for x in feat_german]
tokens_french = [remove_special_chars(x) for x in feat_french]
tokens_dutch = [remove_special_chars(x) for x in feat_dutch]
print('tokens (german)', len(tokens_german))
print('tokens (french)', len(tokens_french))
print('tokens (dutch)', len(tokens_dutch))
print('tokens (english)', len(tokens_english))
In [3]:
from sklearn.utils import resample
max_samples = 1300
def create_text_sample(x):
max_tokens = 20
data = []
text = []
for i, f in enumerate(x):
text.append(f)
if i % max_tokens == 0 and i != 0:
data.append(' '.join(text))
text = []
return data
sample_german = resample(create_text_sample(tokens_german), replace=False, n_samples=max_samples)
sample_french = resample(create_text_sample(tokens_french), replace=False, n_samples=max_samples)
sample_dutch = resample(create_text_sample(tokens_dutch), replace=False, n_samples=max_samples)
sample_english = resample(create_text_sample(tokens_english), replace=False, n_samples=max_samples)
print('samples (german)', len(sample_german))
print('samples (french)', len(sample_french))
print('samples (dutch)', len(sample_dutch))
print('samples (english)', len(sample_english))
A text sample looks like this.
In [4]:
print('English sample:\n------------------')
print(sample_english[100])
print('------------------')
In [5]:
import argparse as ap
def create_data_structure(**kwargs):
samples = {'data': [], 'target': [], 'target_names':[]}
label = 0
for name, value in kwargs.items():
samples['target_names'].append(name)
for i in value:
samples['data'].append(i)
samples['target'].append(label)
label += 1
return ap.Namespace(**samples)
data = create_data_structure(de = sample_german, en = sample_english,
fr = sample_french, nl = sample_dutch)
print('target names: ', data.target_names)
print('number of observations: ', len(data.data))
It's importan that we shuffle and split the data into training (80%) and test set (20%)
In [6]:
from sklearn.model_selection import train_test_split
import numpy as np
x_train, x_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.20)
print('train size (x, y): ', len(x_train), len(y_train))
print('test size (x, y): ', len(x_test), len(y_test))
We connect all our parts (classifier, etc.) to our Machine Learning Pipeline. So it’s easier and faster to go trough all processing steps to build a model.
The TfidfVectorizer will use the the word analyzer, min document frequency of 10 and convert the text to lowercase. I know we already did a lowercase conversion in the previous step. We also provide some stop words which should be ignored in our model.
The MultinomialNB classifier wil use the default alpha value 1.0.
Here you can play around with the settings. In the next section you see how to evaluate your model.
In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn import model_selection
from sklearn import metrics
stopwords = ['scrooge', 'scrooges', 'bob']
pipeline = Pipeline([('vect', TfidfVectorizer(analyzer='word',
min_df=10, lowercase=True, stop_words=stopwords)),
('clf', MultinomialNB(alpha=1.0))])
In this step we want to evaluate the performance of our classifier. So we do the following evaluation:
Let's evaluate our model with k-fold cross validation. In this step we can tune the model and settings with the output from the model evaluation.
In [8]:
from sklearn.model_selection import KFold
folds = 4
kf = KFold(n_splits=folds)
'''
scores = model_selection.cross_val_score(pipeline, X=x_train, y=y_train,
cv=folds, scoring='f1_weighted')
'''
scores = model_selection.cross_val_score(pipeline, X=x_train, y=y_train,
cv=kf, scoring='accuracy')
print('scores: %s' % scores )
print('accuracy: %0.6f (+/- %0.4f)' % (scores.mean(), scores.std() * 2))
In [9]:
predicted = model_selection.cross_val_predict(pipeline, X=x_train, y=y_train, cv=folds)
print(metrics.classification_report(y_train, predicted,
target_names=data.target_names, digits=4))
We build the final model with the fold, which had the best score. We will not use the whole training set because we might overfit the model.
In [10]:
def select_best_kfold(x, y, kf, scores):
splitts = list(kf.split(x))
score_index = np.argmax(scores == max(scores))
train_index = splitts[score_index][0]
return np.array(x)[train_index], np.array(y)[train_index]
x_final, y_final = select_best_kfold(x_train, y_train, kf, scores)
Next we build the model and evaluate the result against our test set.
In [11]:
text_clf = pipeline.fit(x_final, y_final)
predicted = text_clf.predict(x_test)
print(metrics.classification_report(y_test, predicted, target_names=data.target_names, digits=4))
Let's see what are the most informative features
In [12]:
# show most informative features
def show_top10(classifier, vectorizer, categories):
feature_names = np.asarray(vectorizer.get_feature_names())
for i, category in enumerate(categories):
top10 = np.argsort(classifier.coef_[i])[-10:]
print("%s: %s" % (category, " ".join(feature_names[top10])))
show_top10(text_clf.named_steps['clf'], text_clf.named_steps['vect'], data.target_names)
Let's see which and how many features our model has.
In [13]:
feature_names = np.asarray(text_clf.named_steps['vect'].get_feature_names())
print('number of features: %d' % len(feature_names))
print('first features: %s'% feature_names[0:10])
print('last features: %s' % feature_names[-10:])
In [14]:
new_data = ['Hallo mein Name ist Hugo.',
'Hi my name is Hugo.',
'Bonjour mon nom est Hugo.',
'Hallo mijn naam is Hugo.',
'Eins, zwei und drei.',
'One, two and three.',
'Un, deux et trois.',
'Een, twee en drie.'
]
predicted = text_clf.predict(new_data)
probs = text_clf.predict_proba(new_data)
for i, p in enumerate(predicted):
print(new_data[i], ' --> ', data.target_names[p], ', prob:' , max(probs[i]))