In this notebook, we offer a quick tutorial as to how you could use the code in this repository. While the package is very much geared towards our own work in authorship verification, you might some of the more general functions useful. All feedback and comments are welcome. This code assumes Python 2.7+ (Python 3 has not been tested). You do not need to install the library to run the code below, but please note that there are a number of well-known third-party Python libraries, including:
and preferably (for GPU acceleration and/or JIT-compilation):
We recommend installing Continuum's excellent Anaconda Python framework, which comes bundled with most of these dependencies.
By default, we assume that your data sets are stored in a directory the format on the PAN 2014 track on authorship attribution: a directory should minimally include one folder per verification problem (an unknown.txt and at least one known01.txt) and a truth.txt. E.g. for the corpus of Dutch essays (../data/2014/du_essays/train), truth.txt contains has a tab-separated line with the ground truth for each problem:
DE001 Y
DE002 Y
DE003 N
DE004 N
DE005 N
DE006 N
DE007 N
DE008 Y
...
To inspect the problems:
In [1]:
ls ../data/2014/du_essays/train
Let us now load the set of development problems for the Dutch essays:
In [2]:
from ruzicka.utilities import *
D = '../data/2014/du_essays/'
dev_train_data, dev_test_data = load_pan_dataset(D+'train')
This functions loads all documents and splits the development data into a development part (the known documents) and a testing part (the unknown documents). We can unpack these as follows:
In [3]:
dev_train_labels, dev_train_documents = zip(*dev_train_data)
dev_test_labels, dev_test_documents = zip(*dev_test_data)
Let us have a look at the actual test texts:
In [4]:
from __future__ import print_function
for doc in dev_test_documents[:10]:
print('+ ', doc[:70])
For each of these documents we need to decide whether or not they were in fact written by the target authors proposed:
In [5]:
for doc in dev_test_labels[:10]:
print('+ ', doc[:70])
The first and crucial step is to vectorize the documents using a vector space model. Below, we use generic example, using the 10,000 most common word unigrams and a plain tf model:
In [6]:
from ruzicka.vectorization import Vectorizer
vectorizer = Vectorizer(mfi = 10000,
vector_space = 'tf',
ngram_type = 'word',
ngram_size = 1)
dev_train_X = vectorizer.fit_transform(dev_train_documents).toarray()
dev_test_X = vectorizer.transform(dev_test_documents).toarray()
Note that we use sklearn conventions here: we fit the vectorizer only on the vocabulary of the known documents and apply it it later to the unknown documents (since in real life too, we will not necessarily know the known documents in advance). This gives us two compatible corpus matrices:
In [7]:
print(dev_train_X.shape)
print(dev_test_X.shape)
We now encode the author labels in the development problem sets as integers, using sklearn's convenient LabelEncoder:
In [8]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
label_encoder.fit(dev_train_labels + dev_test_labels)
dev_train_y = label_encoder.transform(dev_train_labels)
dev_test_y = label_encoder.transform(dev_test_labels)
print(dev_test_y)
We now construct and fit an 'O2' verifier: this extrinsic verification technique is based on the General Imposters framework. We apply it with the minmax metric and a profile base, meaning that the known documents for each author will be represented as a mean centroid:
In [9]:
from ruzicka.Order2Verifier import Order2Verifier
dev_verifier = Order2Verifier(metric = 'minmax',
base = 'profile',
nb_bootstrap_iter=100,
rnd_prop = 0.5)
dev_verifier.fit(dev_train_X, dev_train_y)
We can now obtain the probability which this O1 verifier would assign to each combination of an unknown document and the target author suggested in the problem:
In [10]:
dev_test_scores = dev_verifier.predict_proba(test_X = dev_test_X,
test_y = dev_test_y,
nb_imposters = 30)
This gives us as an array of probability scores for each problem, corresponding to the number of iterations in which the target's author's profile was closer to the anonymous document than to one of the imposters:
In [11]:
print(dev_test_scores)
Let us now load the ground truth to check how well we did:
In [12]:
dev_gt_scores = load_ground_truth(
filepath=os.sep.join((D, 'train', 'truth.txt')),
labels=dev_test_labels)
print(dev_gt_scores)
There is one final step needed: the PAN evaluation measures allow systems to leave a number of difficult problems unanswered, by setting the probability exactly at 0.5. To account for this strict threshold, we fit a score shifter, which will attempt to rectify mid-range score to 0.5. We can tune these parameters as follows:
In [13]:
from ruzicka.score_shifting import ScoreShifter
shifter = ScoreShifter()
shifter.fit(predicted_scores=dev_test_scores,
ground_truth_scores=dev_gt_scores)
dev_test_scores = shifter.transform(dev_test_scores)
As you can see, this shifter optimizes 2 parameters using a grid search: all values in between p1 and p2 will be rectified to 0.5:
In [14]:
print(dev_test_scores)
We can later apply this optimized score shifter to the test problems. Now the main question: how well would our O2 verifier perform on the development problems, given the optimal p1 and p2 found? We answer this question using the three evaluation measures used in the PAN competition.
In [15]:
from ruzicka.evaluation import pan_metrics
dev_acc_score, dev_auc_score, dev_c_at_1_score = \
pan_metrics(prediction_scores=dev_test_scores,
ground_truth_scores=dev_gt_scores)
print('Accuracy: ', dev_acc_score)
print('AUC: ', dev_auc_score)
print('c@1: ', dev_c_at_1_score)
print('AUC x c@1: ', dev_auc_score * dev_c_at_1_score)
Our score shifting approach clearly pays off, since we are able to leave difficult problems unswered, yielding to a higher c@1 than pure accuracy. We can now proceed to the test problems. The following code block runs entire parallel to the approach above: only the score shifter isn't retrained again:
In [16]:
train_data, test_data = load_pan_dataset(D+'test')
train_labels, train_documents = zip(*train_data)
test_labels, test_documents = zip(*test_data)
# vectorize:
vectorizer = Vectorizer(mfi = 10000,
vector_space = 'tf',
ngram_type = 'word',
ngram_size = 1)
train_X = vectorizer.fit_transform(train_documents).toarray()
test_X = vectorizer.transform(test_documents).toarray()
# encode author labels:
label_encoder = LabelEncoder()
label_encoder.fit(train_labels+test_labels)
train_y = label_encoder.transform(train_labels)
test_y = label_encoder.transform(test_labels)
# fit and predict a verifier on the test data:
test_verifier = Order2Verifier(metric = 'minmax',
base = 'profile',
nb_bootstrap_iter=100,
rnd_prop = 0.5)
test_verifier.fit(train_X, train_y)
test_scores = test_verifier.predict_proba(test_X=test_X,
test_y=test_y,
nb_imposters=30)
# load the ground truth:
test_gt_scores = load_ground_truth(
filepath=os.sep.join((D, 'test', 'truth.txt')),
labels=test_labels)
# apply the optimzed score shifter:
test_scores = shifter.transform(test_scores)
test_acc_score, test_auc_score, test_c_at_1_score = \
pan_metrics(prediction_scores=test_scores,
ground_truth_scores=test_gt_scores)
print('Accuracy: ', test_acc_score)
print('AUC: ', test_auc_score)
print('c@1: ', test_c_at_1_score)
print('AUC x c@1: ', test_auc_score * test_c_at_1_score)
While our final test results are a bit lower, the verifier seems to scale reasonably well to the unseen verification problems in the test set.
It is interesting now to compare the GI approach to a first-order verification system, which often yields very competitive results too. Our implementation closely resembles the system proposed by Potha and Stamatatos in 2014 (A Profile-based Method for Authorship Verification). We import and fit this O1 verifier:
In [18]:
from ruzicka.Order1Verifier import Order1Verifier
dev_verifier = Order1Verifier(metric = 'minmax',
base = 'profile')
dev_verifier.fit(dev_train_X, dev_train_y)
dev_test_scores = dev_verifier.predict_proba(test_X = dev_test_X,
test_y = dev_test_y)
print(dev_test_scores)
Note that in this case, the 'probabilities' returned are only distance-based pseudo-probabilities and don't lie in the range of 0-1. Applying the score shifter is therefore quintessential with O1, since it will scale the distances to a more useful range:
In [87]:
shifter = ScoreShifter()
shifter.fit(predicted_scores=dev_test_scores,
ground_truth_scores=dev_gt_scores)
dev_test_scores = shifter.transform(dev_test_scores)
print(dev_test_scores)
And again, we are now ready to test the performance of O1 on the test problems.
In [20]:
train_data, test_data = load_pan_dataset(D+'test')
train_labels, train_documents = zip(*train_data)
test_labels, test_documents = zip(*test_data)
# vectorize:
vectorizer = Vectorizer(mfi = 10000,
vector_space = 'tf',
ngram_type = 'word',
ngram_size = 1)
train_X = vectorizer.fit_transform(train_documents).toarray()
test_X = vectorizer.transform(test_documents).toarray()
# encode author labels:
label_encoder = LabelEncoder()
label_encoder.fit(train_labels+test_labels)
train_y = label_encoder.transform(train_labels)
test_y = label_encoder.transform(test_labels)
# fit and predict a verifier on the test data:
test_verifier = Order1Verifier(metric = 'minmax',
base = 'profile')
test_verifier.fit(train_X, train_y)
test_scores = test_verifier.predict_proba(test_X=test_X,
test_y=test_y)
# load the ground truth:
test_gt_scores = load_ground_truth(
filepath=os.sep.join((D, 'test', 'truth.txt')),
labels=test_labels)
# apply the optimzed score shifter:
test_scores = shifter.transform(test_scores)
test_acc_score, test_auc_score, test_c_at_1_score = \
pan_metrics(prediction_scores=test_scores,
ground_truth_scores=test_gt_scores)
print('Accuracy: ', test_acc_score)
print('AUC: ', test_auc_score)
print('c@1: ', test_c_at_1_score)
print('AUC x c@1: ', test_auc_score * test_c_at_1_score)
Interestingly, O1 maintains a healthy AUC, but its accuracy and c@1 are disappointing. This is, by the way, certainly not true for other data sets: as we show in the paper, O1 produces relatively high scores in other corpora.