Chapter 4 - Burrows's Delta in sklearn

Arguably, Burrows's Delta is currently the best known attribution algorithm in stylometry. In this chapter, we will show how it can be easily implemented in sklearn, which supports many interesting tweaks to the original algorithm.

Let us load our dummy corpus again:


In [1]:
import pickle
titles, authors, words, X = pickle.load(open("dummy.p", "rb"))
print(X.shape)


(9, 100)

Burrows's Delta is a well-known attribution algorithm in the field. While Burrows originally did not situate his technique in the field of Machine Learning, Argamon later showed that his Delta measure could be described as a 'nearest neighbour' algorithm. Nearest neighbour classification (or 'kNN') refers is a very simple and intuitive family of classification procedures. The training stage is extremely light and in many cases, the classifier - which could for instance be an attribution algorithm - will simply store the examples in memory, without significantly intervening in this data. This is why this family of algorithms is also known as 'memory-based learning', because the classifier predominantly relies on the examples it has stored in memory. In sklearn, such a training phase is extremely easy to set up:


In [2]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X, authors)


Out[2]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_neighbors=5, p=2, weights='uniform')

Note that, here, we are dealing with supervised classification algorithms, so that fit() also takes a vector labels as input, which knn will store in its memory alongside the actual vectors. After fitting the classifier, we can apply it to new data, such as anonymous texts. The predict() which serves this purpose, now only takes the new, unseen text vectors as input - and not the labels, since it exactly its task to predict this. Let us apply this predict method to our training data:


In [3]:
predictions = knn.predict(X)
print(predictions)


['Austen' 'Austen' 'Austen' 'Dickens' 'Dickens' 'Dickens' 'Dickens'
 'Thackeray' 'Thackeray']

The output above shows that predict() returns a label for each vector it was fed. In this case, we tested our classifier on the same data which it was trained on, so it should come as no surprise that it got a lot of the classification problems right. Using the metrics module, we can easily calculate the attribution accuracy which the algorithm obtains here:


In [4]:
from sklearn.metrics import accuracy_score
accuracy_score(predictions, authors)


Out[4]:
0.88888888888888884

As we can see, the algorithm gets one of nine attributions wrong. But how does the classification stage work exactly in kNN? When confronted with a new test vector, the classifier will scan its memory for a 'nearest neighbour', or the training text, which looks most like the new test item. It determines this nearest neighbour using a distance calculation like the one we implemented ourselves in SciPy in a previous chapter. Next, it extrapolates the class label from this nearest neighbor and attaches the same authorship label to the unseen test instance. Therefore, the distance measure is crucial to the working of a robust kNN-classifier; it can be set in the constructor:


In [5]:
knn = KNeighborsClassifier(metric='cityblock',
                           n_neighbors=1,
                           algorithm='brute')

Here, we also set two other parameters, to make our implementation exactly parallel to Burrows's original Delta: i.e. we only look at an item's single nearest neighbour (instead of a larger 'neighbourhood') and we 'brutely' calculate the distance of the test items to all training items in memory. This 'brute' approach is less feasible for larger data sets, but which is fine for the smaller corpora we work with.

Burrows originally used the Manhattan cityblock distance, which we already came across in this course a couple of times. Importantly, however, Argamon showed that he would scale the relative frequencies of words in texts using the standard deviation of a word's frequency in the entire corpus. Thus, each simple relative frequency in our table needs to be replaced by the ratio of that frequency over the word's standard deviation in this corpus. This scaling method captures the following intuition: in authorship attribution, we are especially interested in high-frequency function words - as you will all know. Therefore, Delta will tune down the weight of words that show a relatively large standard deviation across - and which are probably not function words. Note that this operation is fully unsupervised: we do not need access to the author labels in our training data, to be able to scale our matrixes in this way. Again, in sklearn this is fairly easy to achieve, via the StandardScaler. We first import this Scaler and fit in on our data;


In [6]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler(with_mean=False)
scaler.fit(X)


Out[6]:
StandardScaler(copy=True, with_mean=False, with_std=True)

During the fitting stage, the StandardScaler will have calculated the per-column (i.e. per-word) standard deviations, which we can access as follows (mind the underscore!):


In [7]:
stds = scaler.std_
print(stds.shape)


(100,)

Indeed, we have obtained the standard deviation in the frequencies for each item in our 100-word vocabulary. Scaling the matrix, as Burrows proposed, is now a simple matter of applying the scaler:


In [8]:
X_scaled = scaler.transform(X)
print(X_scaled.shape)


(9, 100)

Note that the shape of our data is not affected by this operation. We can now re-run our attribution algorithm:


In [9]:
knn = KNeighborsClassifier(metric='cityblock',
                           n_neighbors=1,
                           algorithm='brute')
knn.fit(X_scaled, authors)
preds = knn.predict(X_scaled)
print(preds)
print('Accuracy: ', accuracy_score(preds, authors))


['Austen' 'Austen' 'Austen' 'Dickens' 'Dickens' 'Dickens' 'Thackeray'
 'Thackeray' 'Thackeray']
Accuracy:  1.0

Interestingly, the scaling seems to have helped, since now all test instances are correctly predicted, including the Thackeray book which was previously misattributed.

Evaluation

Until now, we have been making one major error, since we have been testing our classifier on the same data it was trained on. This is a bad mistake: to evaluate the performance of a classifier, we should really be testing it on out-of-training-data, and see how well it performs on such, previously unseen data. A number of evaluation strategies are available in this respect. One interesting technique is 'leave-one-out' validation: in this setup, we will loop over the available training data; each time we set aside one of the original training samples and we pretend we have never seen it. Then, we fit our classifier on the remaining instances and we apply it to the held-out item. After doing this for all instances, we now get an idea of how well our classifier would do, if it were to be applied to new, unseen documents. Sklearn ships with a lot of interesting functionality for such evaluation procedures and leave-onout validation is one of them:


In [11]:
from sklearn.cross_validation import LeaveOneOut

In [12]:
nb = X_scaled.shape[0]
loo = LeaveOneOut(nb)

As we loop over the newly created loo object, we see that it returns a different test item in each iteration:


In [12]:
for train, test in loo:
    print("%s %s" % (train, test))


[1 2 3 4 5 6 7 8] [0]
[0 2 3 4 5 6 7 8] [1]
[0 1 3 4 5 6 7 8] [2]
[0 1 2 4 5 6 7 8] [3]
[0 1 2 3 5 6 7 8] [4]
[0 1 2 3 4 6 7 8] [5]
[0 1 2 3 4 5 7 8] [6]
[0 1 2 3 4 5 6 8] [7]
[0 1 2 3 4 5 6 7] [8]

In [13]:
knn = KNeighborsClassifier(metric='cityblock',
                           n_neighbors=1,
                           algorithm='brute')
preds = []
for train, test in loo:
    X_train, X_test = X[train], X[test]
    y_test = [authors[test]]
    y_train = [authors[i] for i in train]
    knn.fit(X_train, y_train)
    pred = knn.predict(X_test)
    preds.append(pred)

print('Accuracy after LOO:', accuracy_score(preds, authors))


Accuracy after LOO: 1.0

Exercise

Check out the documentation online for Support Vector Machine classification (SVC) and Naive Bayes, which are two other highly popular classification methods in author attribution. Can you run them also on our dummy data set in a leave-one-out setup? Do they perform as good?