Ružička: Authorship Verification in Python

In this notebook, we offer a quick tutorial as to how you could use the code in this repository. While the package is very much geared towards our own work in authorship verification, you might some of the more general functions useful. All feedback and comments are welcome. This code assumes Python 2.7+ (Python 3 has not been tested). You do not need to install the library to run the code below, but please note that there are a number of well-known third-party Python libraries, including:

numpy
scipy
scikit-learn
matplotlib
seaborn
numba

and preferably (for GPU acceleration and/or JIT-compilation):

theano
numbapro

We recommend installing Continuum's excellent Anaconda Python framework, which comes bundled with most of these dependencies.

Walk through

By default, we assume that your data sets are stored in a directory the format on the PAN 2014 track on authorship attribution: a directory should minimally include one folder per verification problem (an unknown.txt and at least one known01.txt) and a truth.txt. E.g. for the corpus of Dutch essays (../data/2014/du_essays/train), truth.txt contains has a tab-separated line with the ground truth for each problem:

DE001 Y
DE002 Y
DE003 N
DE004 N
DE005 N
DE006 N
DE007 N
DE008 Y
...

To inspect the problems:



In [1]:

    
ls ../data/2014/du_essays/train









    



DE001/         DE014/         DE027/         DE040/         DE053/         DE066/         DE079/         DE092/
DE002/         DE015/         DE028/         DE041/         DE054/         DE067/         DE080/         DE093/
DE003/         DE016/         DE029/         DE042/         DE055/         DE068/         DE081/         DE094/
DE004/         DE017/         DE030/         DE043/         DE056/         DE069/         DE082/         DE095/
DE005/         DE018/         DE031/         DE044/         DE057/         DE070/         DE083/         DE096/
DE006/         DE019/         DE032/         DE045/         DE058/         DE071/         DE084/         contents.json*
DE007/         DE020/         DE033/         DE046/         DE059/         DE072/         DE085/         truth.json*
DE008/         DE021/         DE034/         DE047/         DE060/         DE073/         DE086/         truth.txt*
DE009/         DE022/         DE035/         DE048/         DE061/         DE074/         DE087/
DE010/         DE023/         DE036/         DE049/         DE062/         DE075/         DE088/
DE011/         DE024/         DE037/         DE050/         DE063/         DE076/         DE089/
DE012/         DE025/         DE038/         DE051/         DE064/         DE077/         DE090/
DE013/         DE026/         DE039/         DE052/         DE065/         DE078/         DE091/

Let us now load the set of development problems for the Dutch essays:



In [2]:

    
from ruzicka.utilities import *
D = '../data/2014/du_essays/'
dev_train_data, dev_test_data = load_pan_dataset(D+'train')

This functions loads all documents and splits the development data into a development part (the known documents) and a testing part (the unknown documents). We can unpack these as follows:



In [3]:

    
dev_train_labels, dev_train_documents = zip(*dev_train_data)
dev_test_labels, dev_test_documents = zip(*dev_test_data)

Let us have a look at the actual test texts:



In [4]:

    
from __future__ import print_function
for doc in dev_test_documents[:10]:
    print('+ ', doc[:70])









    



+  Dankzij het internet zijn we een grote bron aan informatie rijker .
+  Het is dus begrijpelijk dat de commerciële zenders meer reclame mo
+  " Hey , vuile nicht ! Hangt er nog stront aan je lul ? " . Dergelij
+  Gelijkheid tussen man en vrouw is iets dat ons al eeuwen in de ban 
+  Gisteren was er opnieuw een protest tegen homofilie in de grootstad
+  Voetbal is vandaag de dag zonder twijfel de populairste sport in Be
+  Door de ongekende groei van nieuwsbronnen en de opkomst van het int
+  Woordenboekgebruik uit interesse De categorie woordenboekgebruikers
+  Ze bouwden een tegencultuur op die alles verwierp waar hun ouders a
+  Als we hier in België op straat rondlopen , merken we dat er zeer

For each of these documents we need to decide whether or not they were in fact written by the target authors proposed:



In [5]:

    
for doc in dev_test_labels[:10]:
    print('+ ', doc[:70])









    



+  DE001
+  DE002
+  DE003
+  DE004
+  DE005
+  DE006
+  DE007
+  DE008
+  DE009
+  DE010

The first and crucial step is to vectorize the documents using a vector space model. Below, we use generic example, using the 10,000 most common word unigrams and a plain tf model:



In [6]:

    
from ruzicka.vectorization import Vectorizer
vectorizer = Vectorizer(mfi = 10000,
                        vector_space = 'tf',
                        ngram_type = 'word',
                        ngram_size = 1)

dev_train_X = vectorizer.fit_transform(dev_train_documents).toarray()
dev_test_X = vectorizer.transform(dev_test_documents).toarray()

Note that we use sklearn conventions here: we fit the vectorizer only on the vocabulary of the known documents and apply it it later to the unknown documents (since in real life too, we will not necessarily know the known documents in advance). This gives us two compatible corpus matrices:



In [7]:

    
print(dev_train_X.shape)
print(dev_test_X.shape)









    



(172, 9977)
(96, 9977)

We now encode the author labels in the development problem sets as integers, using sklearn's convenient LabelEncoder:



In [8]:

    
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
label_encoder.fit(dev_train_labels + dev_test_labels)
dev_train_y = label_encoder.transform(dev_train_labels)
dev_test_y = label_encoder.transform(dev_test_labels)
print(dev_test_y)









    



[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95]

We now construct and fit an 'O2' verifier: this extrinsic verification technique is based on the General Imposters framework. We apply it with the minmax metric and a profile base, meaning that the known documents for each author will be represented as a mean centroid:



In [9]:

    
from ruzicka.Order2Verifier import Order2Verifier
dev_verifier = Order2Verifier(metric = 'minmax',
                              base = 'profile',
                              nb_bootstrap_iter=100,
                              rnd_prop = 0.5)
dev_verifier.fit(dev_train_X, dev_train_y)

We can now obtain the probability which this O1 verifier would assign to each combination of an unknown document and the target author suggested in the problem:



In [10]:

    
dev_test_scores = dev_verifier.predict_proba(test_X = dev_test_X,
                                             test_y = dev_test_y,
                                             nb_imposters = 30)









    



ruzicka/Order2Verifier.py:191: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if rnd_feature_idxs == 'all': # use entire feature space
ruzicka/Order2Verifier.py:252: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if rnd_feature_idxs == 'all':






    



	 - # test documents processed: 10 out of 96
	 - # test documents processed: 20 out of 96
	 - # test documents processed: 30 out of 96
	 - # test documents processed: 40 out of 96
	 - # test documents processed: 50 out of 96
	 - # test documents processed: 60 out of 96
	 - # test documents processed: 70 out of 96
	 - # test documents processed: 80 out of 96
	 - # test documents processed: 90 out of 96

This gives us as an array of probability scores for each problem, corresponding to the number of iterations in which the target's author's profile was closer to the anonymous document than to one of the imposters:



In [11]:

    
print(dev_test_scores)









    



[ 0.69        0.61000001  0.          0.          0.08        0.07        0.
  1.          1.          0.75999999  0.49000001  0.31        0.94        0.94
  0.01        0.38999999  0.54000002  0.          0.03        0.36000001
  0.          0.          0.56        0.38999999  0.          0.81999999
  0.          0.52999997  0.04        0.          0.          0.01
  0.25999999  0.          0.02        0.18000001  0.          0.07        0.09
  0.          0.23        0.70999998  0.02        0.77999997  1.          0.
  0.38        0.01        0.          0.23999999  0.01        0.40000001
  0.03        0.38        0.72000003  0.          0.02        0.76999998
  0.02        0.83999997  0.98000002  0.64999998  0.97000003  0.50999999
  0.68000001  0.89999998  0.41999999  0.16        0.56        0.87
  0.34999999  0.01        0.02        0.50999999  0.07        0.12
  0.20999999  0.          0.99000001  0.          0.88        0.38        0.
  0.          1.          0.          1.          0.76999998  0.01        0.
  0.          0.63        0.          0.          0.46000001  0.56      ]

Let us now load the ground truth to check how well we did:



In [12]:

    
dev_gt_scores = load_ground_truth(
                    filepath=os.sep.join((D, 'train', 'truth.txt')),
                    labels=dev_test_labels)
print(dev_gt_scores)









    



[1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0]

There is one final step needed: the PAN evaluation measures allow systems to leave a number of difficult problems unanswered, by setting the probability exactly at 0.5. To account for this strict threshold, we fit a score shifter, which will attempt to rectify mid-range score to 0.5. We can tune these parameters as follows:



In [13]:

    
from ruzicka.score_shifting import ScoreShifter
shifter = ScoreShifter()
shifter.fit(predicted_scores=dev_test_scores,
            ground_truth_scores=dev_gt_scores)
dev_test_scores = shifter.transform(dev_test_scores)









    



p1 for optimal combo: 0.08
p2 for optimal combo: 0.35
AUC for optimal combo: 0.955729166667
c@1 for optimal combo: 0.943142361111

As you can see, this shifter optimizes 2 parameters using a grid search: all values in between p1 and p2 will be rectified to 0.5:



In [14]:

    
print(dev_test_scores)









    



[0.79849999845027919, 0.7465000092983245, 0.0, 0.0, 0.0063999998569488539, 0.005600000023841859, 0.0, 1.0, 1.0, 0.8439999938011169, 0.66850000619888306, 0.5, 0.96099999845027928, 0.96099999845027928, 0.00079999998211860673, 0.60349999070167537, 0.70100001394748679, 0.0, 0.00239999994635582, 0.58400000929832463, 0.0, 0.0, 0.71400000154972076, 0.60349999070167537, 0.0, 0.88299999535083762, 0.0, 0.69449998140335079, 0.0031999999284744269, 0.0, 0.0, 0.00079999998211860673, 0.5, 0.0, 0.0015999999642372135, 0.5, 0.0, 0.005600000023841859, 0.5, 0.0, 0.5, 0.81149998605251317, 0.0015999999642372135, 0.85699998140335087, 1.0, 0.0, 0.59699999690055849, 0.00079999998211860673, 0.0, 0.5, 0.00079999998211860673, 0.61000000387430187, 0.00239999994635582, 0.59699999690055849, 0.81800001859664917, 0.0, 0.0015999999642372135, 0.85049998760223389, 0.0015999999642372135, 0.89599998295307159, 0.98700001239776602, 0.77249998450279234, 0.98050001859664904, 0.68149999380111692, 0.7920000046491622, 0.93499998450279231, 0.62299999147653584, 0.5, 0.71400000154972076, 0.91550000309944157, 0.5, 0.00079999998211860673, 0.0015999999642372135, 0.68149999380111692, 0.005600000023841859, 0.5, 0.5, 0.0, 0.99350000619888301, 0.0, 0.92199999690055834, 0.59699999690055849, 0.0, 0.0, 1.0, 0.0, 1.0, 0.85049998760223389, 0.00079999998211860673, 0.0, 0.0, 0.75949999690055847, 0.0, 0.0, 0.64900000542402259, 0.71400000154972076]

We can later apply this optimized score shifter to the test problems. Now the main question: how well would our O2 verifier perform on the development problems, given the optimal p1 and p2 found? We answer this question using the three evaluation measures used in the PAN competition.



In [15]:

    
from ruzicka.evaluation import pan_metrics
dev_acc_score, dev_auc_score, dev_c_at_1_score = \
    pan_metrics(prediction_scores=dev_test_scores,
    ground_truth_scores=dev_gt_scores)
print('Accuracy: ', dev_acc_score)
print('AUC: ', dev_auc_score)
print('c@1: ', dev_c_at_1_score)
print('AUC x c@1: ', dev_auc_score * dev_c_at_1_score)









    



Accuracy:  0.885416666667
AUC:  0.955729166667
c@1:  0.943142361111
AUC x c@1:  0.901388662833

Our score shifting approach clearly pays off, since we are able to leave difficult problems unswered, yielding to a higher c@1 than pure accuracy. We can now proceed to the test problems. The following code block runs entire parallel to the approach above: only the score shifter isn't retrained again:



In [16]:

    
train_data, test_data = load_pan_dataset(D+'test')
train_labels, train_documents = zip(*train_data)
test_labels, test_documents = zip(*test_data)
                
# vectorize:
vectorizer = Vectorizer(mfi = 10000,
                        vector_space = 'tf',
                        ngram_type = 'word',
                        ngram_size = 1)
train_X = vectorizer.fit_transform(train_documents).toarray()
test_X = vectorizer.transform(test_documents).toarray()
                
# encode author labels:
label_encoder = LabelEncoder()
label_encoder.fit(train_labels+test_labels)
train_y = label_encoder.transform(train_labels)
test_y = label_encoder.transform(test_labels)
                
# fit and predict a verifier on the test data:
test_verifier = Order2Verifier(metric = 'minmax',
                               base = 'profile',
                               nb_bootstrap_iter=100,
                               rnd_prop = 0.5)
test_verifier.fit(train_X, train_y)
test_scores = test_verifier.predict_proba(test_X=test_X,
                                          test_y=test_y,
                                          nb_imposters=30)
                
# load the ground truth:
test_gt_scores = load_ground_truth(
                    filepath=os.sep.join((D, 'test', 'truth.txt')),
                    labels=test_labels)
                
# apply the optimzed score shifter:
test_scores = shifter.transform(test_scores)
                
test_acc_score, test_auc_score, test_c_at_1_score = \
    pan_metrics(prediction_scores=test_scores,
                ground_truth_scores=test_gt_scores)

print('Accuracy: ', test_acc_score)
print('AUC: ', test_auc_score)
print('c@1: ', test_c_at_1_score)
print('AUC x c@1: ', test_auc_score * test_c_at_1_score)









    



	 - # test documents processed: 10 out of 96
	 - # test documents processed: 20 out of 96
	 - # test documents processed: 30 out of 96
	 - # test documents processed: 40 out of 96
	 - # test documents processed: 50 out of 96
	 - # test documents processed: 60 out of 96
	 - # test documents processed: 70 out of 96
	 - # test documents processed: 80 out of 96
	 - # test documents processed: 90 out of 96
Accuracy:  0.864583333333
AUC:  0.9609375
c@1:  0.911458333333
AUC x c@1:  0.875854492187

While our final test results are a bit lower, the verifier seems to scale reasonably well to the unseen verification problems in the test set.

First Order Verification

It is interesting now to compare the GI approach to a first-order verification system, which often yields very competitive results too. Our implementation closely resembles the system proposed by Potha and Stamatatos in 2014 (A Profile-based Method for Authorship Verification). We import and fit this O1 verifier:



In [18]:

    
from ruzicka.Order1Verifier import Order1Verifier
dev_verifier = Order1Verifier(metric = 'minmax',
                              base = 'profile')
dev_verifier.fit(dev_train_X, dev_train_y)
dev_test_scores = dev_verifier.predict_proba(test_X = dev_test_X,
                                             test_y = dev_test_y)
print(dev_test_scores)









    



[ 0.0508821   0.05295295 -0.05339944 -0.07909369 -0.02331865 -0.04220104
 -0.06020927  0.11833715  0.11711633  0.03420103  0.01194018 -0.00176835
  0.09044588  0.05795223 -0.10883117 -0.00071907 -0.08573282 -0.13027966
 -0.05026388 -0.01643515 -0.05558467 -0.12349176  0.0027076  -0.04140735
 -0.06439781 -0.01183951 -0.09243321 -0.03753805 -0.06817973 -0.10692203
 -0.08212757 -0.09001279 -0.06661606 -0.10339952 -0.09174156 -0.03461802
 -0.1220206  -0.05210984 -0.12378168 -0.08442163 -0.02438498  0.03309178
 -0.07402968  0.02882493  0.12914622 -0.14603448 -0.03053057 -0.05629373
 -0.10035634 -0.10980856 -0.07716274 -0.07025313 -0.0667429  -0.11839318
  0.02641141 -0.13112211 -0.03812957  0.05383098 -0.05459356  0.03681302
 -0.03131771  0.03050268  0.0914582   0.02064216  0.01521158  0.0497179
  0.00120807 -0.06035507  0.01666337  0.07360435 -0.15455794 -0.19472182
 -0.18665552 -0.02599692 -0.11922693 -0.1706109  -0.08144045 -0.09309399
  0.09763068 -0.08678317 -0.03580868 -0.03423667 -0.09028387 -0.10228109
  0.12156731 -0.10104704  0.15736157  0.02625966 -0.10609066 -0.14817739
 -0.08555293 -0.0347091  -0.08178961 -0.13069367 -0.01512218  0.00522423]

Note that in this case, the 'probabilities' returned are only distance-based pseudo-probabilities and don't lie in the range of 0-1. Applying the score shifter is therefore quintessential with O1, since it will scale the distances to a more useful range:



In [87]:

    
shifter = ScoreShifter()
shifter.fit(predicted_scores=dev_test_scores,
            ground_truth_scores=dev_gt_scores)
dev_test_scores = shifter.transform(dev_test_scores)
print(dev_test_scores)









    



p1 for optimal combo: 0.4
p2 for optimal combo: 0.48
AUC for optimal combo: 0.900607638889
c@1 for optimal combo: 0.875217013889
[0.84273804257641682, 0.84579651967613234, 0.16055557333113935, 0.13136447581750335, 0.5, 0.5, 0.15281896211244364, 0.94236395287675823, 0.94056089246296626, 0.81810138143372435, 0.78522383099719839, 0.76497738691808459, 0.90117068640332054, 0.85318007055721723, 0.097579896593504079, 0.76652709278296061, 0.12382180468222422, 0.073212381489759837, 0.1641178680337276, 0.74331566101724755, 0.1580729506571803, 0.080924073032932753, 0.77158802155890516, 0.5, 0.14806038755174178, 0.75010306283465722, 0.11620952097510422, 0.5, 0.14376377501934579, 0.099748856395121779, 0.12791770548024634, 0.11895935299583767, 0.14554024993147938, 0.10375076667785683, 0.11699530335218733, 0.5, 0.082595451922209323, 0.16202068773225708, 0.080594699930370509, 0.1253114324598073, 0.5, 0.81646311591762899, 0.13711767047081092, 0.81016129564706429, 0.95832810646525068, 0.055313418246450516, 0.5, 0.15726739505930193, 0.10720810079060721, 0.096469481269528826, 0.13355821986162472, 0.14140818851734535, 0.1453961491991084, 0.086716543261792553, 0.80659672566975438, 0.072255276343457478, 0.5, 0.84709331114940478, 0.15919894077835728, 0.82195909618097507, 0.5, 0.8126392052519571, 0.90266581276148095, 0.79807598435680915, 0.79005543781347454, 0.84101861205194739, 0.76937332602672193, 0.1526533275300285, 0.79219962014423961, 0.87629704456372703, 0.045629957377535973, 0.0, 0.0091640752404909525, 0.5, 0.085769324725887816, 0.027392276153343366, 0.12869834140260192, 0.11545881574997982, 0.91178208691786444, 0.12262850435053184, 0.5, 0.5, 0.11865137831030044, 0.10502139926348139, 0.94713464192102259, 0.10642340187767725, 1.0, 0.80637259756637025, 0.1006933662706436, 0.05287887429428817, 0.12402617310811503, 0.5, 0.12830165808952429, 0.072742022614266974, 0.74525481807197902, 0.77530488596624436]

And again, we are now ready to test the performance of O1 on the test problems.



In [20]:

    
train_data, test_data = load_pan_dataset(D+'test')
train_labels, train_documents = zip(*train_data)
test_labels, test_documents = zip(*test_data)
                
# vectorize:
vectorizer = Vectorizer(mfi = 10000,
                        vector_space = 'tf',
                        ngram_type = 'word',
                        ngram_size = 1)
train_X = vectorizer.fit_transform(train_documents).toarray()
test_X = vectorizer.transform(test_documents).toarray()
                
# encode author labels:
label_encoder = LabelEncoder()
label_encoder.fit(train_labels+test_labels)
train_y = label_encoder.transform(train_labels)
test_y = label_encoder.transform(test_labels)
                
# fit and predict a verifier on the test data:
test_verifier = Order1Verifier(metric = 'minmax',
                               base = 'profile')
test_verifier.fit(train_X, train_y)
test_scores = test_verifier.predict_proba(test_X=test_X,
                                          test_y=test_y)
                
# load the ground truth:
test_gt_scores = load_ground_truth(
                    filepath=os.sep.join((D, 'test', 'truth.txt')),
                    labels=test_labels)
                
# apply the optimzed score shifter:
test_scores = shifter.transform(test_scores)
                
test_acc_score, test_auc_score, test_c_at_1_score = \
    pan_metrics(prediction_scores=test_scores,
                ground_truth_scores=test_gt_scores)

print('Accuracy: ', test_acc_score)
print('AUC: ', test_auc_score)
print('c@1: ', test_c_at_1_score)
print('AUC x c@1: ', test_auc_score * test_c_at_1_score)









    



Accuracy:  0.5
AUC:  0.881944444444
c@1:  0.637478298611
AUC x c@1:  0.562220443914

Interestingly, O1 maintains a healthy AUC, but its accuracy and c@1 are disappointing. This is, by the way, certainly not true for other data sets: as we show in the paper, O1 produces relatively high scores in other corpora.