In this tutorial, I will show you how to use local n-grams (LNG) methods for authorship attribution. LNG based methods are highly effective for authorship attribution, and represent one of the significant differences in techniques between authorship studies and other document based machine learning, such as topic modelling. If you have any requests, or spot any bugs, please send me a message via Github. I've setup a project just for tutorials like this.
By the way, this tool is called the ipython notebook. It's free, it's easy to use and allows for using python in-line. I won't introduce it here, but for more information, have a look here. You can download this notebook, and run it from your own computer, allowing you to change things in-line. Alternatively, you can copy and paste the code segments into a new file.
You can run this code by either downloading this workbook, using the python interactive interpreter, or by copying the lines into a file and running them with python. I'll be assuming knowledge of how to run python scripts, and therefore won't be covering beginner knowledge here.
Please also note that this code is slow. I have really focused on making readable code, rather than fast code, and left optimisations out that could confused.
If you find any mistakes or corrections, please contact me at r.layton@icsl.com.au
In [135]:
# Just a configuration, ignore me
%pylab inline
I'll be using the python programming language, using numpy, scipy and scikit-learn as the 'stack'. You can install scikit-learn by following this page, which will in turn install numpy and scipy. If you can run the following code, you are ready to run this tutorial. I've used version scikit-learn version 0.14 to do this tutorial, but it should work with version 0.12 and above.
In [136]:
import sklearn
print(sklearn.__version__)
You will also need the matplotlib library, which has the pyplot interface. This generates the graphs in this tutorial. You can test it's installed by running the next line:
In [137]:
import matplotlib.pyplot as plt
The first utility function we need is a function for counting the frequency of character n-grams in a string.
A character n-gram is a sequence of n characters in a row, present in a string.
For these examples, we assume n=3
, with these n-grams also called tri-grams.
For example:
In [138]:
n = 3 # Length of n-gram
suess = "Sam! If you will let me be, I will try them. You will see."
for i in range(5): # Print the first five n-grams
print("n-gram #{}: {}".format(i, suess[i:i+n]))
We won't do any preprocessing in this tutorial, so capital letters, numbers and punctuation will all remain as they are in the original documents.
Now, we will use the following function to extract and count all n-grams from a given string, which I will refer to as a document. The counts will be optionally normalised so that the figures are the frequencies, summing to 1.0.
In [139]:
# A defaultdict works the same as a normal dict, except
# that if the key is not found, it is given a default value.
from collections import defaultdict
def count_ngrams(document, n, normalise=False):
counts = defaultdict(float) # Default to 0.0 if the key not found
# Iterate through all n-grams in the document
for i in range(len(document) - n + 1):
ngram = document[i:i+n]
# Update the count of this n-gram.
counts[ngram] = counts[ngram] + 1
if normalise:
# Normalise so that sums equal 1
normalise_factor = float(len(document) - n + 1)
for ngram in counts:
counts[ngram] /= normalise_factor
return counts
In [140]:
counts = count_ngrams(suess, 3)
# Print some examples
for ngram in [" I ", "Sam", "ill", "abc"]:
print("{0} occurs {1:.0f} time(s)".format(ngram, counts[ngram]))
In [141]:
# Let's graph the most frequent
from operator import itemgetter
top_ngrams = sorted(counts.items(), key=itemgetter(1), reverse=True)[:10]
ngrams, ng_counts = zip(*top_ngrams)
y_pos = np.arange(len(ngrams))
plt.figure()
plt.barh(y_pos, ng_counts, align='center', alpha=0.4)
plt.yticks(y_pos, ngrams)
plt.xlabel('Count')
plt.title('Most frequent n-grams')
plt.xticks(np.arange(0, max(ng_counts)+1, 1.0))
plt.show()
In [142]:
frequencies = count_ngrams(suess, 3, normalise=True)
# Print some examples
for ngram in [" I ", "Sam", "ill", "abc"]:
print("{0} has frequency {1:.3f}".format(ngram, frequencies[ngram]))
We now need a corpus to do the classification on. The format I'll use for this tutorial is to use one list of strings for the "documents" and one array of integers as the "classes", or "targets". Each document has a corresponding class value, such that all documents with the same class were written by the same person.
First, grab this zip file containing the dataset and unzip it into a directory somewhere.
Then update default_folder
value below, changing the value from None
to the path of the data.
In [143]:
import os
from os.path import expanduser
import codecs
import numpy as np
default_folder = None # Change this to the folder name containing the dataset
def get_single_corpus(folder=None):
if folder is None:
folder = os.path.join(expanduser("~"), "Data", "books")
documents = []
authors = []
training_mask = []
authornum = 0
i = 0
subfolders = [name for name in os.listdir(folder)
if os.path.isdir(os.path.join(folder, name))]
for subfolder in subfolders:
sf = os.path.join(folder, subfolder)
#print "Author %d is %s" % (authornum, subfolder)
for document_name in os.listdir(sf):
i += 1
with codecs.open(os.path.join(sf, document_name), encoding='utf=8') as input_f:
documents.append(cleanFile(input_f.read()))
authors.append(authornum)
training_mask.append(True)
authornum += 1
# There is an author that appears only once. To simplify this tutorial, we remove it now.
min_docs = 10
c = np.bincount(authors)
validauthors = [authornum for authornum, count in enumerate(c) if count >= min_docs]
documents = [d for d, c in zip(documents, authors) if c in validauthors]
authors = [c for c in authors if c in validauthors]
assert len(documents) == len(authors)
return documents, np.array(authors, dtype='int')
def cleanFile(document):
# Removes the stuff pg put in the ebooks
lines = document.split("\n")
start = 0
end = len(lines)
for i in range(len(lines)):
line = lines[i]
if line.startswith("*** START OF THIS PROJECT GUTENBERG"):
start = i + 1
elif line.startswith("*** END OF THIS PROJECT GUTENBERG"):
end = i - 1
return "\n".join(lines[start:end])
documents, classes = get_single_corpus()
documents = np.array(documents, dtype=object) # For fancy indexing using numpy
In [144]:
# Example document, first 20 lines
print(documents[0].split("\n")[:20])
In [145]:
# Let's graph the most frequent n-grams of the first document
counts = count_ngrams(documents[0], 3, normalise=True)
top_ngrams = sorted(counts.items(), key=itemgetter(1), reverse=True)[:30]
ngrams, ng_counts = zip(*top_ngrams)
plt.figure()
plt.bar(x_pos, ng_counts, alpha=0.4)
plt.xlabel('n-gram (by rank)')
plt.ylabel('Normalised Frequency')
plt.title('Frequency of the most common n-grams')
plt.xlim(0, len(ngrams))
plt.show()
In [146]:
import numpy as np
# Print stats about the authors
print("\n".join("Author {} appears {} times".format(*r) for r in enumerate(np.bincount(classes))))
That's all the setup that is needed for this tutorial.
SCAP was not the first LNG method to be created, but it is the easiest to explain.
In SCAP, documents are profiled by taking the L most frequent n-grams that occur in the document.
An author is profiled in a similar way, except the L most frequent n-grams from all known documents from that author are used.
Two profiles are compared by finding the number of n-grams each profile has in common, which can be normalised by dividing by L.
The more n-grams they have in common, the more similar the profiles are.
When trying to determine which author wrote a particular document, the most similar profile.
To make this calculation consistent with other methods that will be introduced later, we use a distance instead of similarity.
The distance is just 1 - s
, where s is the similarity.
In this format, the lower the distance, the more likely the author, and the author with the lowest distance to a document is the most likely author of that document.
For two profiles $P_1$ and $P_2$ (which can be either author or document profiles) the distance is given below. An n-gram $x$ is in $P_i$ if it is in the profile, and the intersection of two profiles is the set of n-grams that are in both.
As a similarity: $$s_{SCAP}(P_1, P_2) = \frac{1}{L}|P_1 \cap P_2|$$
As a distance: $$d_{SCAP}(P_1, P_2) = min(0, 1 - s_{SCAP})$$
The class below implements SCAP using the base classes from scikit-learn (imported as sklearn).
This will be important later, but for now, the important thing to acknoweldge is the fit
and predict
functions.
The fit
function takes a set of training data and creates profiles for each author in that data.
The predict
function predicts which of those authors is most likely to have written each document.
The class also has utility functions for creating profiles and comparing profiles.
The reference for this paper is below:
In [147]:
from operator import itemgetter
from sklearn.base import BaseEstimator, ClassifierMixin # For using scikit-learn's test framework. Ignore for now
class SCAP(BaseEstimator, ClassifierMixin):
def __init__(self, n, L):
self.n = n
self.L = L
self.author_profiles = None # Will be trained by fit()
def create_profile(self, documents):
# Creates a profile of a document or list of documents.
if isinstance(documents, basestring):
# documents can be either a list of documents, or a single document.
# if it's a single document, convert to a list
documents = [documents,]
# profile each document independently
profiles = (count_ngrams(document, self.n, normalise=False)
for document in documents)
# Merge the profiles
main_profile = defaultdict(float)
for profile in profiles:
for ngram in profile:
main_profile[ngram] += profile[ngram]
# Normalise the profile
num_ngrams = sum(main_profile.values())
for ngram in main_profile:
main_profile[ngram] /= num_ngrams
# Return the profile with only the top L n-grams
return self.top_L(main_profile)
def top_L(self, profile):
# Returns the profile with only the top L most frequent n-grams
if self.L >= len(profile):
return profile
threshold = sorted(profile.values())[-self.L]
return {ngram: profile[ngram]
for ngram in profile
if profile[ngram] >= threshold}
def compare_profiles(self, profile1, profile2):
# Number of n-grams in both profiles, divided by L
similarity = len(set(profile1.keys()) & set(profile2.keys())) / float(self.L)
# Slight edge case here, similarity could be higher than 1.
# Just make it equal to 1 if it is over.
similarity = min(similarity, 1.0)
distance = 1 - similarity
return distance
def fit(self, documents, classes):
# Fits the current model to the training data provided
# Separate documents into the sets written by each author
author_documents = ((author, [documents[i]
for i in range(len(documents))
if classes[i] == author])
for author in set(classes))
# Profile each of the authors independently
self.author_profiles = {author:self.create_profile(cur_docs)
for author, cur_docs in author_documents}
def predict(self, documents):
# Predict which of the authors wrote each of the documents
predictions = np.array([self.predict_single(document) for document in documents])
return predictions
def predict_single(self, document):
# Predicts the author of a single document
# Profile of current document
profile = self.create_profile(document)
# Distance from document to each author
distances = [(author, self.compare_profiles(profile, self.author_profiles[author]))
for author in self.author_profiles]
# Get the nearest pair, and the author from that pair
prediction = sorted(distances, key=itemgetter(1))[0][0]
return prediction
OK, let's see it in action. I'll point out here that this method of testing should not be used in the real world. The fundamental mistake I'll be making is to use training data to test the method, which should never be done in practice. Anyway, let's do it now:
In [148]:
model = SCAP(n=4, L=2000)
# Training the model
model.fit(documents, classes)
# Predict the author of each of the documents
y_pred = model.predict(documents) # Notice how I am predicting the data I used for training? Don't do that.
# Compute the accuracy of these predictions.
print("Accuracy is {:.1f}%".format(100. * np.mean(classes == y_pred)))
Not a bad result, but as I said, don't use training data for testing. Here is a function that uses scikit-learn to perform the testing, which automatically splits the data into training and testing data before computing the accuracy, through a process called cross fold validation. More information on this is available in the scikit-learn documentation, but I won't go into the details here. In a nutshell, what cross validation does is split the data into two sets, training and testing. It does this multiple times with different splits of the data (depending on the exact method used).
In [149]:
# First, setup the cross fold validation object
from sklearn import cross_validation
cv = cross_validation.KFold(len(documents), n_folds=5, random_state=0)
# Now, compute the score using it
scores = cross_validation.cross_val_score(model, documents, classes, cv=cv)
print("Accuracy is {:.1f}%".format(100. * np.mean(scores)))
A fair bit lower, but not too bad (the chance rate is around 28%). This is a better representation of what we can expect the algorithm to do if applied to real data with these parameters.
CNG was the first LNG method proposed in the literature. The method is almost entirely the same as SCAP (should that be the other way around?), except for the method used to compute the distance. This distance computation uses the frequencies of the n-grams, where SCAP simply used the fact they were in the top L n-grams.
$$ d_{CNG}(P_1, P_2) = \displaystyle\sum_{x \in X_{P_1} \cup X_{P_2}} \Biggl(\frac{2\cdot(P_1(x) - P_2(x))}{P_1(x) + P_2(x)}\Biggr)^2 $$Note that $P_i(x)$ is the frequency of n-gram $x$ in profile $P_i$, which is zero if the n-gram is not in the profile. (This is important, as it is zero if it is not in the top L, even if it does have a frequency in the original document.)
To simplify the coding, we simply derive our new class from the SCAP class defined above. The reference for this method is below.
In [150]:
class CNG(SCAP):
def compare_profiles(self, profile1, profile2):
ngrams = set(profile1.keys() + profile2.keys())
d1 = np.array([profile1.get(ng, 0.) for ng in ngrams])
d2 = np.array([profile2.get(ng, 0.) for ng in ngrams])
return np.mean(4 * np.square((d1 - d2) / (d1 + d2 + 1e-16)))
Thanks to the fact we inherited from the above SCAP
class, testing this method uses almost exactly the same code.
The only difference in the code below has been highlighted.
In [151]:
# Setup a new model, this is the only difference
model = CNG(n=4, L=2000)
# First, setup the cross fold validation object
from sklearn import cross_validation
cv = cross_validation.KFold(len(documents), n_folds=5, random_state=0)
# Now, compute the score using it
scores = cross_validation.cross_val_score(model, documents, classes, cv=cv)
print("Accuracy is {:.1f}%".format(100. * np.mean(scores)))
Slightly better than before. One problem here is that the values of the parameters (n
and L
in this case) make a huge difference in the accuracy of the algorithm.
To find good algorithms, we can perform a grid search of different parameters.
Thanks to the format of the classes, we can use scikit-learn's grid search parameter for this purpose. I've greatly restricted the parameter spaces, as this can take a long time to compute.
In [152]:
# This code snippet is a shortened version of the one in the scikit-learn docs, available here:
# http://scikit-learn.org/stable/auto_examples/grid_search_digits.html#example-grid-search-digits-py
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
parameters = [{'n': [3, 4], 'L':[500, 1000]},] # These are the parameters to search through.
# The n and L values here don't matter, they will be updated for training.
clf = GridSearchCV(CNG(n=1, L=1), parameters, cv=5, scoring='accuracy')
clf.fit(documents, classes)
print("The best model found has n={} and L={}".format(clf.best_estimator_.n, clf.best_estimator_.L))
print("The scores for the different parameter sets were:")
for params, mean_score, scores in clf.grid_scores_:
print("{}: {:.3f} (+/-{:.3f})".format(params, mean_score, scores.std() / 2))
From this, we can increase the parameter options and do larger searches.
I recommend moving to a more distributed system, and having a look at the n_jobs
parameter, which can run the search across multiple processes (see the docs for more details).
The returned model is a valid scikit-learn estimator, so you can call the below code, which will use this model.
Additionally, you can retrieve the actual best estimator instance by calling clf.best_estimator_
.
One final word of caution here about the next snippet: I have used the documents list, which was used in training the model, for the evaluation.
You shouldn't do that, instead, you should split the data up into a training and evaluation set, where the evaluation set is used only at the end and never in any training or cross validation.
In [153]:
y_true, y_pred = classes, clf.predict(documents)
print(classification_report(y_true, y_pred))
We can have a look at the effect that the L parameter has on the accuracy of CNG by plotting it for different n-values. The code below only does n=3, but can be increased to do more values.
This code will take a while to run, depending on your computer, maybe half an hour or more.
In [162]:
from sklearn.cross_validation import KFold
n_values = [3, ] # Add more values here to test different n-values
L_values = np.arange(500, 3001, 500) # From 500 to 3000 inclusive, in increments of 500.
# Takes too long on the full set of documents.
# Set sample_size = None if you want to run on the full dataset.
sample_size = None
if sample_size is None:
sample_documents = documents
sample_classes = classes
else:
sample_documents = documents[:sample_size]
sample_classes = classes[:sample_size]
assert len(sample_documents) == len(sample_classes)
# The number of folds is the size of the smallest class
class_counts = np.bincount(sample_classes)
# class_counts could be zero if class doesn't exist, so remove those values
class_counts = class_counts[np.nonzero(class_counts)]
n_folds = min(class_counts)
plt.figure()
for n in n_values:
scores = []
for L in L_values:
cv = KFold(len(sample_documents), n_folds=n_folds, random_state=0)
cur_model = CNG(n=n, L=L)
cur_score = cross_validation.cross_val_score(cur_model, sample_documents, sample_classes, cv=cv)
cur_score = np.mean(cur_score)
scores.append(cur_score)
print("CNG(n={}, L={}) = {:.4f}".format(n, L, cur_score))
plt.plot(L_values, scores, label='n={}'.format(n))
plt.ylim(0, 1.0)
plt.xticks(L_values)
plt.show()
RLP was created by myself, and has a number of differences to the first two algorithms presented. At its core though, it is based on the principles as CNG (and therefore SCAP).
In the original derivation of CNG, the work of Bennet (1976), which used a similar model, with a language default value.
This language default is effectively a profile of all documents in that language.
Then, rather than compare two profiles, we first normalise the profiles against what is the normal value for that n-gram.
Intuitively, while the the
n-gram is very common for most authors, it is expected to be, and therefore the presence of this n-gram is hardly surprising.
By comparing to the language default, we see that this n-gram is expected and therefore are less surprised.
A profile is then computed by first computing the CNG-esque profile of the author, then subtracting (recentering) the values from the corpus default profile.
These profiles are then compared in RLP.
RLP does a number of other things differently as well.
Rather then setting n-gram frequencies to zero if they are not present in the profile (as in CNG), the actual frequency value is used for comparisons, even if that n-gram is not present.
Profiles are compared using the following equation, where $||P_i||_2$ is the Euclidean distance of the profile (i.e. the sum of the squares of each frequency value, and $P_i \cdot P_j$ is the dot product of the n-gram values in each profile, for all n-grams in the profiles' union.
Another major change is that the top L
n-grams are ranked by absolute, rather than actual highest.
This means an n-gram with a value of -0.15 occurs higher than an n-gram with value 0.14.
Two profiles are then compared using the following equation, which is also known as the cosine distance.
$$ d_{RLP}(P_1, P_2) = 1 - \frac{P_1 \cdot P_2}{||P_1||_2 ||P_2||_2} $$References:
In [155]:
from scipy.spatial import distance
class RLP(SCAP):
def __init__(self, n, L):
super(RLP, self).__init__(n, L)
self.topLonly = False # Don't limit profiles on creation
self.language_profile = None
def compare_profiles(self, profile1, profile2):
# All n-grams in the top L of either profile.
ngrams = set(self.top_L(profile1).keys() + self.top_L(profile1).keys())
# Profile vector for profile 1
d1 = np.array([profile1.get(ng, 0.) for ng in ngrams])
# Profile vector for profile 2
d2 = np.array([profile2.get(ng, 0.) for ng in ngrams])
return distance.cosine(d1, d2)
def top_L(self, profile):
threshold = sorted(map(abs, profile.values()))[-self.L]
copy = defaultdict(float)
for key in profile:
if abs(profile[key]) >= threshold:
copy[key] = profile[key]
return copy
def fit(self, documents, classes):
self.language_profile = self.create_profile(documents)
super(RLP, self).fit(documents, classes)
def create_profile(self, documents):
# Creates a profile of a document or list of documents.
if isinstance(documents, basestring):
# documents can be either a list of documents, or a single document.
# if it's a single document, convert to a list
documents = [documents,]
# profile each document independently
profiles = (count_ngrams(document, self.n, normalise=False)
for document in documents)
# Merge the profiles
main_profile = defaultdict(float)
for profile in profiles:
for ngram in profile:
main_profile[ngram] += profile[ngram]
# Normalise the profile
num_ngrams = float(sum(main_profile.values()))
for ngram in main_profile:
main_profile[ngram] /= num_ngrams
if self.language_profile is not None:
# Recentre profile.
for key in main_profile:
main_profile[key] = main_profile.get(key, 0) - self.language_profile.get(key, 0)
# Note that the profile is returned in full, as exact frequencies are used
# in comparing profiles (rather than chopped off)
return main_profile
Now we test with out overfitting code from earlier.
In [156]:
model = RLP(n=4, L=2000)
# Training the model
model.fit(documents, classes)
# Predict the author of each of the documents
y_pred = model.predict(documents) # Notice how I am predicting the data I used for training? Don't do that.
# Compute the accuracy of these predictions.
print("Accuracy is {:.1f}%".format(100. * np.mean(classes == y_pred)))
In [157]:
# Have a look at some of the values in the language profile
for ngram in ['the ', "'ell", " and", "est ", "ever"]:
print("{} = {:.6f}".format(repr(ngram), model.language_profile[ngram]))
In [158]:
# Have a look at one of the author profiles
ap = model.author_profiles[1]
print("Length of profile is {}".format(len(ap))) # Notice how this is much higher than L (which is currently 2000)
for ngram in ['the ', "'ell", " and", "est ", "ever"]:
print("{} = {:.6f}".format(repr(ngram), ap[ngram]))
Notice how the n-gram est
has a negative frequency.
This means the frequency for this author was 0.000089 lower than the frequency for the language default.
RLP uses this information, where it may be lost in other LNG methods.
The cost, as you may be discovered if you are running the code as we go, is that RLP is significantly slower than CNG or SCAP.
Experiments I've conducted show that RLP is typically higher than CNG (which is in turn typically higher than SCAP) for most corpora.
Next, we do a more appropriate test using cross validation.
In [159]:
# Setup a new model
model = RLP(n=4, L=2000)
# First, setup the cross fold validation object
from sklearn import cross_validation
cv = cross_validation.KFold(len(documents), n_folds=5, random_state=0)
# Now, compute the score using it
scores = cross_validation.cross_val_score(model, documents, classes, cv=cv)
print("Accuracy is {:.1f}%".format(100. * np.mean(scores)))
That's lower than before, but with different models comes different parameters for improving performance. A grid search through some parameter values will give a better score (but will take a while to run).
In [160]:
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
parameters = [{'n': [3, 4], 'L':[500, 1000]},] # These are the parameters to search through.
# The n and L values here don't matter, they will be updated for training.
clf = GridSearchCV(RLP(n=1, L=1), parameters, cv=5, scoring='accuracy')
clf.fit(documents, classes)
print("The best model found has n={} and L={}".format(clf.best_estimator_.n, clf.best_estimator_.L))
print("The scores for the different parameter sets were:")
for params, mean_score, scores in clf.grid_scores_:
print("{}: {:.3f} (+/-{:.3f})".format(params, mean_score, scores.std() / 2))
In [160]:
In [160]: