This notebook demonstrates how to reproduce the results of our TPAMI paper on NLP tasks.
Caveat: The results may vary from the published version: the published paper reports results obtained from Matlab code, while this is a rewritten Python version. The Python version is the only one we distribute, as it is much cleaner and simpler to run as the Matlab version.
The data splits or folds are specified in the data/datasplit_* files. The random seeds are all 0 as can be seen in the command lines. The experiments were run on our lab's SGE computing cluster, named Fear. The SGE and Python command lines are scripts generated by Python programs (src/fear.py
), so that the experimental configuration you can find here is rather self-contained.
In [3]:
pygpstruct_location = '/home/sb358/pygpstruct'
pygpstruct_fear_location = '/home/mlg/sb358/pygpstruct'
result_location = '/bigscratch/sb358/pygpstruct/results'
%load_ext autoreload
%autoreload 2
import sys
sys.path.append(pygpstruct_location + '/src/') # replace by your path to .py files
np.set_printoptions(precision=3)
import fear
In [9]:
for task in ['basenp', 'chunking', 'segmentation', 'japanesene']:
n_data = {'basenp' : 300, 'chunking' : 100, 'segmentation' : 36, 'japanesene' : 100}[task]
n_data_train = {'basenp' : 150, 'chunking' : 50, 'segmentation' : 20, 'japanesene' : 50}[task]
files_prefix = result_location + '/2014-08-22_%s/' % task
data_indices = np.loadtxt(pygpstruct_location + '/data/datasplit.n_data=%s.txt' % n_data, dtype=np.int16) - 1 # need -1 because doing +1 inside prepare_data_chain
for fold in range(5):
fear.launch_qsub_job({
'n_samples' : '250000',
'prediction_thinning' : '1000',
'lhp_update' : "{'binary' : np.log(1)}",
'data_indices_train' : 'np.array(%s)' % str(data_indices[fold,:n_data_train].tolist()),
'data_indices_test' : 'np.array(%s)' % str(data_indices[fold,n_data_train:].tolist()),
'data_folder' : "'" + pygpstruct_fear_location + "/data/%s'" % task,
'task' : "'%s'" % task
},
job_hash = 'qsub_' + str(fold),
files_prefix=files_prefix,
repeat_runs=8)
In [5]:
#!ssh fear qdel -u sb358
!ssh fear qstat
!date
In [1]:
!tail -n 1 /bigscratch/sb358/pygpstruct/results/2014-08-22_*/qsub_*.log
#!ls -l /bigscratch/sb358/pygpstruct/results/2014-08-22_basenp/*
In [35]:
# check state of a job
import pickle
with open("/bigscratch/sb358/pygpstruct/results/2014-08-22_japanesene/qsub_3.lhp_update=binary:np.log1++n_samples=250000++prediction_thinning=1000++task=japanesene.state.pickle", 'rb') as f:
a=pickle.load(f, encoding='latin1')
print(a)
In [8]:
import util
util.make_figure([3],
[('segmentation', '/bigscratch/sb358/pygpstruct/results/2014-08-22_segmentation/*.results.bin' ),
('chunking', '/bigscratch/sb358/pygpstruct/results/2014-08-22_chunking/*.results.bin' ),
('japanesene', '/bigscratch/sb358/pygpstruct/results/2014-08-22_japanesene/*.results.bin' ),
('basenp', '/bigscratch/sb358/pygpstruct/results/2014-08-22_basenp/*.results.bin' ),
], top=0.15, bottom=0.04)
In [104]:
# Matlab line to regenerate data splits: n_data=150;fold=1; rand('state', fold); r=randperm(n_data*2);save(sprintf('~/n_data=%s.fold=%s.mat', int2str(n_data), int2str(fold)), 'r')
import scipy.io
print(scipy.io.loadmat('/home/sb358/n_data=150.fold=1.mat'))
# convert to txt format
for n_data in [18, 50, 150]:
a = np.empty((5, n_data*2), dtype=np.int16)
for fold in range(1,6):
a[fold-1, :] = scipy.io.loadmat('/home/sb358/n_data=%s.fold=%s.mat' % (str(n_data), str(fold)))['r']
np.savetxt('/home/sb358/pygpstruct/data/datasplit.n_data=%s.txt' % str(n_data*2), a, fmt="%g")
Out[104]: