A tutorial on Local n-grams for Authorship Attribution

By Dr. Robert Layton

In this tutorial, I will show you how to use local n-grams (LNG) methods for authorship attribution. LNG based methods are highly effective for authorship attribution, and represent one of the significant differences in techniques between authorship studies and other document based machine learning, such as topic modelling. If you have any requests, or spot any bugs, please send me a message via Github. I've setup a project just for tutorials like this.

By the way, this tool is called the ipython notebook. It's free, it's easy to use and allows for using python in-line. I won't introduce it here, but for more information, have a look here. You can download this notebook, and run it from your own computer, allowing you to change things in-line. Alternatively, you can copy and paste the code segments into a new file.

You can run this code by either downloading this workbook, using the python interactive interpreter, or by copying the lines into a file and running them with python. I'll be assuming knowledge of how to run python scripts, and therefore won't be covering beginner knowledge here.

Please also note that this code is slow. I have really focused on making readable code, rather than fast code, and left optimisations out that could confused.

If you find any mistakes or corrections, please contact me at r.layton@icsl.com.au


In [135]:
# Just a configuration, ignore me
%pylab inline


Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

Installation Dependencies

I'll be using the python programming language, using numpy, scipy and scikit-learn as the 'stack'. You can install scikit-learn by following this page, which will in turn install numpy and scipy. If you can run the following code, you are ready to run this tutorial. I've used version scikit-learn version 0.14 to do this tutorial, but it should work with version 0.12 and above.


In [136]:
import sklearn
print(sklearn.__version__)


0.15-git

You will also need the matplotlib library, which has the pyplot interface. This generates the graphs in this tutorial. You can test it's installed by running the next line:


In [137]:
import matplotlib.pyplot as plt

0) Utility Functions

The first utility function we need is a function for counting the frequency of character n-grams in a string. A character n-gram is a sequence of n characters in a row, present in a string. For these examples, we assume n=3, with these n-grams also called tri-grams. For example:


In [138]:
n = 3  # Length of n-gram
suess = "Sam! If you will let me be, I will try them. You will see."
for i in range(5):  # Print the first five n-grams
    print("n-gram #{}: {}".format(i, suess[i:i+n]))


n-gram #0: Sam
n-gram #1: am!
n-gram #2: m! 
n-gram #3: ! I
n-gram #4:  If

We won't do any preprocessing in this tutorial, so capital letters, numbers and punctuation will all remain as they are in the original documents.

Now, we will use the following function to extract and count all n-grams from a given string, which I will refer to as a document. The counts will be optionally normalised so that the figures are the frequencies, summing to 1.0.


In [139]:
# A defaultdict works the same as a normal dict, except
# that if the key is not found, it is given a default value.
from collections import defaultdict


def count_ngrams(document, n, normalise=False):    
    counts = defaultdict(float)  # Default to 0.0 if the key not found
    # Iterate through all n-grams in the document
    for i in range(len(document) - n + 1):
        ngram = document[i:i+n]
        # Update the count of this n-gram.
        counts[ngram] = counts[ngram] + 1
    if normalise:
        # Normalise so that sums equal 1
        normalise_factor = float(len(document) - n + 1)
        for ngram in counts:
            counts[ngram] /= normalise_factor
    return counts

In [140]:
counts = count_ngrams(suess, 3)
# Print some examples
for ngram in [" I ", "Sam", "ill", "abc"]:
    print("{0} occurs {1:.0f} time(s)".format(ngram, counts[ngram]))


 I  occurs 1 time(s)
Sam occurs 1 time(s)
ill occurs 3 time(s)
abc occurs 0 time(s)

In [141]:
# Let's graph the most frequent
from operator import itemgetter
top_ngrams = sorted(counts.items(), key=itemgetter(1), reverse=True)[:10]
ngrams, ng_counts = zip(*top_ngrams)
y_pos = np.arange(len(ngrams))

plt.figure()
plt.barh(y_pos, ng_counts, align='center', alpha=0.4)
plt.yticks(y_pos, ngrams)
plt.xlabel('Count')
plt.title('Most frequent n-grams')
plt.xticks(np.arange(0, max(ng_counts)+1, 1.0))

plt.show()



In [142]:
frequencies = count_ngrams(suess, 3, normalise=True)
# Print some examples
for ngram in [" I ", "Sam", "ill", "abc"]:
    print("{0} has frequency {1:.3f}".format(ngram, frequencies[ngram]))


 I  has frequency 0.018
Sam has frequency 0.018
ill has frequency 0.054
abc has frequency 0.000

We now need a corpus to do the classification on. The format I'll use for this tutorial is to use one list of strings for the "documents" and one array of integers as the "classes", or "targets". Each document has a corresponding class value, such that all documents with the same class were written by the same person.

First, grab this zip file containing the dataset and unzip it into a directory somewhere. Then update default_folder value below, changing the value from None to the path of the data.


In [143]:
import os
from os.path import expanduser
import codecs
import numpy as np

default_folder = None  # Change this to the folder name containing the dataset

def get_single_corpus(folder=None):
    if folder is None:
        folder = os.path.join(expanduser("~"), "Data", "books")
    documents = []
    authors = []
    training_mask = []
    authornum = 0
    i = 0
    subfolders = [name for name in os.listdir(folder)
                  if os.path.isdir(os.path.join(folder, name))]
    for subfolder in subfolders:
        sf = os.path.join(folder, subfolder)
        #print "Author %d is %s" % (authornum, subfolder)
        for document_name in os.listdir(sf):
            i += 1
            with codecs.open(os.path.join(sf, document_name), encoding='utf=8') as input_f:
                documents.append(cleanFile(input_f.read()))
                authors.append(authornum)
                training_mask.append(True)
        authornum += 1
    # There is an author that appears only once. To simplify this tutorial, we remove it now.
    min_docs = 10
    c = np.bincount(authors)
    validauthors = [authornum for authornum, count in enumerate(c) if count >= min_docs]
    documents = [d for d, c in zip(documents, authors) if c in validauthors]
    authors = [c for c in authors if c in validauthors]
    assert len(documents) == len(authors)
    return documents, np.array(authors, dtype='int')


def cleanFile(document):
    # Removes the stuff pg put in the ebooks
    lines = document.split("\n")
    start = 0
    end = len(lines)
    for i in range(len(lines)):
        line = lines[i]
        if line.startswith("*** START OF THIS PROJECT GUTENBERG"):
            start = i + 1
        elif line.startswith("*** END OF THIS PROJECT GUTENBERG"):
            end = i - 1
    return "\n".join(lines[start:end])




documents, classes = get_single_corpus()
documents = np.array(documents, dtype=object)  # For fancy indexing using numpy

In [144]:
# Example document, first 20 lines
print(documents[0].split("\n")[:20])


[u'\r', u'\r', u'\r', u'\r', u'Produced by David Brannan.  HTML version by Al Haines.\r', u'\r', u'\r', u'\r', u'\r', u'\r', u'\r', u'\r', u'\r', u'\r', u'The Adventure of the Cardboard Box\r', u'\r', u'\r', u'By\r', u'\r', u'Sir Arthur Conan Doyle\r']

In [145]:
# Let's graph the most frequent n-grams of the first document
counts = count_ngrams(documents[0], 3, normalise=True)
top_ngrams = sorted(counts.items(), key=itemgetter(1), reverse=True)[:30]
ngrams, ng_counts = zip(*top_ngrams)

plt.figure()
plt.bar(x_pos, ng_counts, alpha=0.4)
plt.xlabel('n-gram (by rank)')
plt.ylabel('Normalised Frequency')
plt.title('Frequency of the most common n-grams')
plt.xlim(0, len(ngrams))
plt.show()



In [146]:
import numpy as np
# Print stats about the authors
print("\n".join("Author {} appears {} times".format(*r) for r in enumerate(np.bincount(classes))))


Author 0 appears 51 times
Author 1 appears 44 times
Author 2 appears 22 times
Author 3 appears 10 times
Author 4 appears 11 times
Author 5 appears 29 times
Author 6 appears 10 times

That's all the setup that is needed for this tutorial.

1) Source Code Author Profiles (SCAP)

SCAP was not the first LNG method to be created, but it is the easiest to explain. In SCAP, documents are profiled by taking the L most frequent n-grams that occur in the document. An author is profiled in a similar way, except the L most frequent n-grams from all known documents from that author are used. Two profiles are compared by finding the number of n-grams each profile has in common, which can be normalised by dividing by L. The more n-grams they have in common, the more similar the profiles are. When trying to determine which author wrote a particular document, the most similar profile. To make this calculation consistent with other methods that will be introduced later, we use a distance instead of similarity. The distance is just 1 - s, where s is the similarity. In this format, the lower the distance, the more likely the author, and the author with the lowest distance to a document is the most likely author of that document.

For two profiles $P_1$ and $P_2$ (which can be either author or document profiles) the distance is given below. An n-gram $x$ is in $P_i$ if it is in the profile, and the intersection of two profiles is the set of n-grams that are in both.

As a similarity: $$s_{SCAP}(P_1, P_2) = \frac{1}{L}|P_1 \cap P_2|$$

As a distance: $$d_{SCAP}(P_1, P_2) = min(0, 1 - s_{SCAP})$$

The class below implements SCAP using the base classes from scikit-learn (imported as sklearn). This will be important later, but for now, the important thing to acknoweldge is the fit and predict functions. The fit function takes a set of training data and creates profiles for each author in that data. The predict function predicts which of those authors is most likely to have written each document. The class also has utility functions for creating profiles and comparing profiles.

The reference for this paper is below:

@inproceedings{frantzeskou2006scap, title = {Effective identification of source code authors using byte-level information}, booktitle = {Proceedings of the 28th international conference on Software engineering}, author = {Frantzeskou, Georgia and Stamatatos, Efstathios and Gritzalis, Stefanos and Katsikas, Sokratis}, year = {2006}, pages = {893–896}, }

In [147]:
from operator import itemgetter
from sklearn.base import BaseEstimator, ClassifierMixin  # For using scikit-learn's test framework. Ignore for now

class SCAP(BaseEstimator, ClassifierMixin):
    def __init__(self, n, L):
        self.n = n
        self.L = L
        self.author_profiles = None  # Will be trained by fit()

    def create_profile(self, documents):
        # Creates a profile of a document or list of documents.
        if isinstance(documents, basestring):
            # documents can be either a list of documents, or a single document.
            # if it's a single document, convert to a list
            documents = [documents,]
        # profile each document independently
        profiles = (count_ngrams(document, self.n, normalise=False)
                    for document in documents)
        # Merge the profiles
        main_profile = defaultdict(float)
        for profile in profiles:
            for ngram in profile:
                main_profile[ngram] += profile[ngram]
        # Normalise the profile
        num_ngrams = sum(main_profile.values())
        for ngram in main_profile:
            main_profile[ngram] /= num_ngrams
        # Return the profile with only the top L n-grams
        return self.top_L(main_profile)

    def top_L(self, profile):
        # Returns the profile with only the top L most frequent n-grams
        if self.L >= len(profile):
            return profile
        threshold = sorted(profile.values())[-self.L]
        return {ngram: profile[ngram]
                for ngram in profile
                if profile[ngram] >= threshold}

    def compare_profiles(self, profile1, profile2):
        # Number of n-grams in both profiles, divided by L
        similarity =  len(set(profile1.keys()) & set(profile2.keys())) / float(self.L)
        # Slight edge case here, similarity could be higher than 1.
        # Just make it equal to 1 if it is over.
        similarity = min(similarity, 1.0)
        distance = 1 - similarity
        return distance

    def fit(self, documents, classes):
        # Fits the current model to the training data provided
        # Separate documents into the sets written by each author
        author_documents = ((author, [documents[i] 
                                      for i in range(len(documents))
                                      if classes[i] == author])
                            for author in set(classes))
        # Profile each of the authors independently
        self.author_profiles = {author:self.create_profile(cur_docs)
                                for author, cur_docs in author_documents}
    
    def predict(self, documents):
        # Predict which of the authors wrote each of the documents
        predictions = np.array([self.predict_single(document) for document in documents])
        return predictions

    def predict_single(self, document):
        # Predicts the author of a single document
        # Profile of current document
        profile = self.create_profile(document)
        # Distance from document to each author
        distances = [(author, self.compare_profiles(profile, self.author_profiles[author]))
                      for author in self.author_profiles]
        # Get the nearest pair, and the author from that pair
        prediction = sorted(distances, key=itemgetter(1))[0][0]
        return prediction

OK, let's see it in action. I'll point out here that this method of testing should not be used in the real world. The fundamental mistake I'll be making is to use training data to test the method, which should never be done in practice. Anyway, let's do it now:


In [148]:
model = SCAP(n=4, L=2000)
# Training the model
model.fit(documents, classes)
# Predict the author of each of the documents
y_pred = model.predict(documents)  # Notice how I am predicting the data I used for training? Don't do that.
# Compute the accuracy of these predictions.
print("Accuracy is {:.1f}%".format(100. * np.mean(classes == y_pred)))


Accuracy is 80.8%

Not a bad result, but as I said, don't use training data for testing. Here is a function that uses scikit-learn to perform the testing, which automatically splits the data into training and testing data before computing the accuracy, through a process called cross fold validation. More information on this is available in the scikit-learn documentation, but I won't go into the details here. In a nutshell, what cross validation does is split the data into two sets, training and testing. It does this multiple times with different splits of the data (depending on the exact method used).


In [149]:
# First, setup the cross fold validation object
from sklearn import cross_validation
cv = cross_validation.KFold(len(documents), n_folds=5, random_state=0)
# Now, compute the score using it
scores = cross_validation.cross_val_score(model, documents, classes, cv=cv)
print("Accuracy is {:.1f}%".format(100. * np.mean(scores)))


Accuracy is 54.2%

A fair bit lower, but not too bad (the chance rate is around 28%). This is a better representation of what we can expect the algorithm to do if applied to real data with these parameters.

2) Common n-grams (CNG)

CNG was the first LNG method proposed in the literature. The method is almost entirely the same as SCAP (should that be the other way around?), except for the method used to compute the distance. This distance computation uses the frequencies of the n-grams, where SCAP simply used the fact they were in the top L n-grams.

$$ d_{CNG}(P_1, P_2) = \displaystyle\sum_{x \in X_{P_1} \cup X_{P_2}} \Biggl(\frac{2\cdot(P_1(x) - P_2(x))}{P_1(x) + P_2(x)}\Biggr)^2 $$

Note that $P_i(x)$ is the frequency of n-gram $x$ in profile $P_i$, which is zero if the n-gram is not in the profile. (This is important, as it is zero if it is not in the top L, even if it does have a frequency in the original document.)

To simplify the coding, we simply derive our new class from the SCAP class defined above. The reference for this method is below.

@inproceedings{kevsel2003cng, title = {N-gram-based author profiles for authorship attribution}, volume = {3}, booktitle = {Proceedings of the Conference Pacific Association for Computational Linguistics, {PACLING}}, author = {Ke{\textbackslash}vselj, Vlado and Peng, Fuchun and Cercone, Nick and Thomas, Calvin}, year = {2003}, pages = {255–264} }

In [150]:
class CNG(SCAP):
    def compare_profiles(self, profile1, profile2):
        ngrams = set(profile1.keys() + profile2.keys())
        d1 = np.array([profile1.get(ng, 0.) for ng in ngrams])
        d2 = np.array([profile2.get(ng, 0.) for ng in ngrams])
        return np.mean(4 * np.square((d1 - d2) / (d1 + d2 + 1e-16)))

Thanks to the fact we inherited from the above SCAP class, testing this method uses almost exactly the same code. The only difference in the code below has been highlighted.


In [151]:
# Setup a new model, this is the only difference
model = CNG(n=4, L=2000)
# First, setup the cross fold validation object
from sklearn import cross_validation
cv = cross_validation.KFold(len(documents), n_folds=5, random_state=0)
# Now, compute the score using it
scores = cross_validation.cross_val_score(model, documents, classes, cv=cv)
print("Accuracy is {:.1f}%".format(100. * np.mean(scores)))


Accuracy is 54.2%

Slightly better than before. One problem here is that the values of the parameters (n and L in this case) make a huge difference in the accuracy of the algorithm. To find good algorithms, we can perform a grid search of different parameters.

Thanks to the format of the classes, we can use scikit-learn's grid search parameter for this purpose. I've greatly restricted the parameter spaces, as this can take a long time to compute.


In [152]:
# This code snippet is a shortened version of the one in the scikit-learn docs, available here:
# http://scikit-learn.org/stable/auto_examples/grid_search_digits.html#example-grid-search-digits-py
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
parameters = [{'n': [3, 4], 'L':[500, 1000]},]  # These are the parameters to search through.

# The n and L values here don't matter, they will be updated for training.
clf = GridSearchCV(CNG(n=1, L=1), parameters, cv=5, scoring='accuracy')
clf.fit(documents, classes)
print("The best model found has n={} and L={}".format(clf.best_estimator_.n, clf.best_estimator_.L))
print("The scores for the different parameter sets were:")
for params, mean_score, scores in clf.grid_scores_:
    print("{}: {:.3f} (+/-{:.3f})".format(params, mean_score, scores.std() / 2))


The best model found has n=4 and L=500
The scores for the different parameter sets were:
{'L': 500, 'n': 3}: 0.751 (+/-0.054)
{'L': 500, 'n': 4}: 0.814 (+/-0.050)
{'L': 1000, 'n': 3}: 0.768 (+/-0.046)
{'L': 1000, 'n': 4}: 0.808 (+/-0.037)

From this, we can increase the parameter options and do larger searches. I recommend moving to a more distributed system, and having a look at the n_jobs parameter, which can run the search across multiple processes (see the docs for more details).

The returned model is a valid scikit-learn estimator, so you can call the below code, which will use this model. Additionally, you can retrieve the actual best estimator instance by calling clf.best_estimator_. One final word of caution here about the next snippet: I have used the documents list, which was used in training the model, for the evaluation. You shouldn't do that, instead, you should split the data up into a training and evaluation set, where the evaluation set is used only at the end and never in any training or cross validation.


In [153]:
y_true, y_pred = classes, clf.predict(documents)
print(classification_report(y_true, y_pred))


             precision    recall  f1-score   support

          0       0.92      0.92      0.92        51
          1       1.00      0.61      0.76        44
          2       0.86      0.86      0.86        22
          3       0.75      0.90      0.82        10
          4       0.85      1.00      0.92        11
          5       0.63      0.90      0.74        29
          6       0.91      1.00      0.95        10

avg / total       0.87      0.84      0.84       177

We can have a look at the effect that the L parameter has on the accuracy of CNG by plotting it for different n-values. The code below only does n=3, but can be increased to do more values.

This code will take a while to run, depending on your computer, maybe half an hour or more.


In [162]:
from sklearn.cross_validation import KFold

n_values = [3, ]  # Add more values here to test different n-values
L_values = np.arange(500, 3001, 500)  # From 500 to 3000 inclusive, in increments of 500.

# Takes too long on the full set of documents.
# Set sample_size = None if you want to run on the full dataset.
sample_size = None

if sample_size is None:
    sample_documents = documents
    sample_classes = classes
else:
    sample_documents = documents[:sample_size]
    sample_classes = classes[:sample_size]
assert len(sample_documents) == len(sample_classes)
# The number of folds is the size of the smallest class
class_counts = np.bincount(sample_classes)
# class_counts could be zero if class doesn't exist, so remove those values
class_counts = class_counts[np.nonzero(class_counts)]
n_folds = min(class_counts)

plt.figure()
for n in n_values:
    scores = []
    for L in L_values:
        cv = KFold(len(sample_documents), n_folds=n_folds, random_state=0)
        cur_model = CNG(n=n, L=L)
        cur_score = cross_validation.cross_val_score(cur_model, sample_documents, sample_classes, cv=cv)
        cur_score = np.mean(cur_score)
        scores.append(cur_score)
        print("CNG(n={}, L={}) = {:.4f}".format(n, L, cur_score))
    plt.plot(L_values, scores, label='n={}'.format(n))
plt.ylim(0, 1.0)
plt.xticks(L_values)
plt.show()


CNG(n=3, L=500) = 0.6137
CNG(n=3, L=1000) = 0.6359
CNG(n=3, L=1500) = 0.6193
CNG(n=3, L=2000) = 0.6190
CNG(n=3, L=2500) = 0.6252
CNG(n=3, L=3000) = 0.6088

3) Recentred Local Profiles (RLP)

RLP was created by myself, and has a number of differences to the first two algorithms presented. At its core though, it is based on the principles as CNG (and therefore SCAP).

In the original derivation of CNG, the work of Bennet (1976), which used a similar model, with a language default value. This language default is effectively a profile of all documents in that language. Then, rather than compare two profiles, we first normalise the profiles against what is the normal value for that n-gram. Intuitively, while the the n-gram is very common for most authors, it is expected to be, and therefore the presence of this n-gram is hardly surprising. By comparing to the language default, we see that this n-gram is expected and therefore are less surprised. A profile is then computed by first computing the CNG-esque profile of the author, then subtracting (recentering) the values from the corpus default profile. These profiles are then compared in RLP.

RLP does a number of other things differently as well. Rather then setting n-gram frequencies to zero if they are not present in the profile (as in CNG), the actual frequency value is used for comparisons, even if that n-gram is not present. Profiles are compared using the following equation, where $||P_i||_2$ is the Euclidean distance of the profile (i.e. the sum of the squares of each frequency value, and $P_i \cdot P_j$ is the dot product of the n-gram values in each profile, for all n-grams in the profiles' union. Another major change is that the top L n-grams are ranked by absolute, rather than actual highest. This means an n-gram with a value of -0.15 occurs higher than an n-gram with value 0.14.

Two profiles are then compared using the following equation, which is also known as the cosine distance.

$$ d_{RLP}(P_1, P_2) = 1 - \frac{P_1 \cdot P_2}{||P_1||_2 ||P_2||_2} $$

References:

@book{bennett1976scientific, title={Scientific and engineering problem-solving with the computer}, author={Bennett, William Ralph}, year={1976}, publisher={Prentice Hall PTR} } @article{layton2012recentred, title={Recentred local profiles for authorship attribution.}, author={Layton, Robert and Watters, Paul Andrew and Dazeley, Richard}, journal={Natural Language Engineering}, volume={18}, number={3}, pages={293--312}, year={2012}, publisher={Cambridge Univ Press} }

In [155]:
from scipy.spatial import distance


class RLP(SCAP):
    
    def __init__(self, n, L):
        super(RLP, self).__init__(n, L)
        self.topLonly = False  # Don't limit profiles on creation
        self.language_profile = None
    
    def compare_profiles(self, profile1, profile2):
        # All n-grams in the top L of either profile.
        ngrams = set(self.top_L(profile1).keys() + self.top_L(profile1).keys())
        # Profile vector for profile 1
        d1 = np.array([profile1.get(ng, 0.) for ng in ngrams])
        # Profile vector for profile 2
        d2 = np.array([profile2.get(ng, 0.) for ng in ngrams])
        return distance.cosine(d1, d2)
    
    def top_L(self, profile):
        threshold = sorted(map(abs, profile.values()))[-self.L]
        copy = defaultdict(float)
        for key in profile:
            if abs(profile[key]) >= threshold:
                copy[key] = profile[key]
        return copy
    
    def fit(self, documents, classes):
        self.language_profile = self.create_profile(documents)
        super(RLP, self).fit(documents, classes)
        
    def create_profile(self, documents):
        # Creates a profile of a document or list of documents.
        if isinstance(documents, basestring):
            # documents can be either a list of documents, or a single document.
            # if it's a single document, convert to a list
            documents = [documents,]
        # profile each document independently
        profiles = (count_ngrams(document, self.n, normalise=False)
                    for document in documents)
        # Merge the profiles
        main_profile = defaultdict(float)
        for profile in profiles:
            for ngram in profile:
                main_profile[ngram] += profile[ngram]
        # Normalise the profile
        num_ngrams = float(sum(main_profile.values()))
        for ngram in main_profile:
            main_profile[ngram] /= num_ngrams
        if self.language_profile is not None:
            # Recentre profile.
            for key in main_profile:
                main_profile[key] = main_profile.get(key, 0) - self.language_profile.get(key, 0)
        # Note that the profile is returned in full, as exact frequencies are used
        # in comparing profiles (rather than chopped off)
        return main_profile

Now we test with out overfitting code from earlier.


In [156]:
model = RLP(n=4, L=2000)
# Training the model
model.fit(documents, classes)
# Predict the author of each of the documents
y_pred = model.predict(documents)  # Notice how I am predicting the data I used for training? Don't do that.
# Compute the accuracy of these predictions.
print("Accuracy is {:.1f}%".format(100. * np.mean(classes == y_pred)))


Accuracy is 75.1%

In [157]:
# Have a look at some of the values in the language profile
for ngram in ['the ', "'ell", " and", "est ", "ever"]:
    print("{} = {:.6f}".format(repr(ngram), model.language_profile[ngram]))


'the ' = 0.007964
"'ell" = 0.000011
' and' = 0.005158
'est ' = 0.000345
'ever' = 0.000779

In [158]:
# Have a look at one of the author profiles
ap = model.author_profiles[1]
print("Length of profile is {}".format(len(ap))) # Notice how this is much higher than L (which is currently 2000)
for ngram in ['the ', "'ell", " and", "est ", "ever"]:
    print("{} = {:.6f}".format(repr(ngram), ap[ngram]))


Length of profile is 158363
'the ' = -0.001139
"'ell" = 0.000021
' and' = -0.000118
'est ' = 0.000039
'ever' = -0.000027

Notice how the n-gram est has a negative frequency. This means the frequency for this author was 0.000089 lower than the frequency for the language default. RLP uses this information, where it may be lost in other LNG methods. The cost, as you may be discovered if you are running the code as we go, is that RLP is significantly slower than CNG or SCAP. Experiments I've conducted show that RLP is typically higher than CNG (which is in turn typically higher than SCAP) for most corpora.

Next, we do a more appropriate test using cross validation.


In [159]:
# Setup a new model
model = RLP(n=4, L=2000)
# First, setup the cross fold validation object
from sklearn import cross_validation
cv = cross_validation.KFold(len(documents), n_folds=5, random_state=0)
# Now, compute the score using it
scores = cross_validation.cross_val_score(model, documents, classes, cv=cv)
print("Accuracy is {:.1f}%".format(100. * np.mean(scores)))


Accuracy is 50.7%

That's lower than before, but with different models comes different parameters for improving performance. A grid search through some parameter values will give a better score (but will take a while to run).


In [160]:
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
parameters = [{'n': [3, 4], 'L':[500, 1000]},]  # These are the parameters to search through.

# The n and L values here don't matter, they will be updated for training.
clf = GridSearchCV(RLP(n=1, L=1), parameters, cv=5, scoring='accuracy')
clf.fit(documents, classes)
print("The best model found has n={} and L={}".format(clf.best_estimator_.n, clf.best_estimator_.L))
print("The scores for the different parameter sets were:")
for params, mean_score, scores in clf.grid_scores_:
    print("{}: {:.3f} (+/-{:.3f})".format(params, mean_score, scores.std() / 2))


The best model found has n=3 and L=1000
The scores for the different parameter sets were:
{'L': 500, 'n': 3}: 0.723 (+/-0.038)
{'L': 500, 'n': 4}: 0.695 (+/-0.048)
{'L': 1000, 'n': 3}: 0.734 (+/-0.037)
{'L': 1000, 'n': 4}: 0.701 (+/-0.047)

In [160]:


In [160]: