The data format is identical to what we used last year. But we made slight changes to some of the file names in the package to prevent confusion from last year. The package name indicates language (en or zh) and the date of creation (MM-DD-YY) and the data split (train, dev, trial, etc). Once you unpack the package, you can expect the following files and folders:
parses.json
- The input file for the main task and the supplementary task (pdtb-parses.json
in 2015)relations-no-senses.json
- The input file for the supplementary task (new this year)relations.json
- the gold standard discourse relations (pdtb-data.json
in 2015)raw/DocID
- plain text file. One file per document. No extension. File name will match the DocID field in relations.json and key in parses.json.conll_format/DocID.conll
- CoNLL format for the training data (one file per document .conll)We will show you how to work with each of these files in order to train your systems for the main task and the supplementary in the language of your choice.
In [3]:
ls -l conll16st-en-01-12-16-trial
relations.json
: Gold standard discourse relation annotationThis file is from The Penn Discourse Treebank (PDTB) & Chinese Discourse Treebank (CDTB) for English and Chinese respectively. These are the gold standard annotation for both the main task and the supplementary task. Each line in the file is a json line. In Python, you can turn it into a dictionary. Similarly, you can turn it into HashMap in Java. But please do not do not use regex to parse json. Your system will most likely break during evaluation.
The dictionary describes the following component of a relation:
Arg1
: the text span of Arg1 of the relationArg2
: the text span of Arg2 of the relationConnective
: the text span of the connective of the relationDocID
: document id where the relation is in.ID
: the relation id, which is unique across training, dev, and test sets.Sense
: the sense of the relation Type
: the type of relation (Explicit, Implicit, Entrel, AltLex, or NoRel)The text span is in the same format for Arg1
, Arg2
, and Connective
. A text span has the following fields:
CharacterSpanList
: the list of character offsets (beginning, end) in the raw untokenized data file. RawText
: the raw untokenized text of the spanTokenList
: the list of the addresses of the tokens in the form of
(character offset begin, character offset end, token offset within the document, sentence offset, token offset within the sentence)For example,
In [4]:
import json
import codecs
pdtb_file = codecs.open('conll16st-en-01-12-16-trial/relations.json', encoding='utf8')
relations = [json.loads(x) for x in pdtb_file];
example_relation = relations[10]
example_relation
Out[4]:
Everything in Chinese data and English data are identical except that Chinese data have one extra field Punctuation
. Punctuations in Chinese have some discourse functions, so they are annotated as well. But you are not required to detect those as part of the task. Discourse annotation in Chinese differs quite a bit from English from the linguistics perspective. Please refer to the original paper in Chinese Discourse Treebank.
In [5]:
data = codecs.open('conll16st-zh-01-08-2016-trial/relations.json', encoding='utf8')
chinese_relations = [json.loads(x) for x in data]
chinese_relations[13]
Out[5]:
In [6]:
print 'Arg1 : %s\nArg2 : %s' % (chinese_relations[13]['Arg1']['RawText'], chinese_relations[13]['Arg2']['RawText'])
parses.json
: Input for the main task and the supplementary taskThis is the file that your system will have to process during evaluation.
The automatic parses and part-of-speech tags are provided in this file.
Note that this file contains only one line unlike the discourse relation json file.
Suppose we want the parse for the sentence in the relation above, which is sentence #15 shown in TokenList
.
In [7]:
parse_file = codecs.open('conll16st-en-01-12-16-trial/parses.json', encoding='utf8')
en_parse_dict = json.load(parse_file)
en_example_relation = relations[10]
en_doc_id = en_example_relation['DocID']
print en_parse_dict[en_doc_id]['sentences'][15]['parsetree']
In [8]:
parse_file = codecs.open('conll16st-zh-01-08-2016-trial/parses.json', encoding='utf8')
zh_parse_dict = json.load(parse_file)
zh_example_relation = chinese_relations[13]
zh_doc_id = zh_example_relation['DocID']
print zh_parse_dict[zh_doc_id]['sentences'][5]['parsetree']
In [9]:
en_parse_dict[en_doc_id]['sentences'][15]['dependencies']
Out[9]:
Each token can be iterated from words
field within the sentence. Note that Linkers
field is provided to indicate whether that token is part of an Arg or not. The format is arg1_ID
. The ID corresponds to the ID field in the relation json.
In [10]:
en_parse_dict[en_doc_id]['sentences'][15]['words'][0]
Out[10]:
In [11]:
en_parse_dict[en_doc_id]['sentences'][15]['words'][1]
Out[11]:
relations-no-senses.json
: Input for the supplementary taskThe systems participating in the supplementary task (sense classification) take in this file as input. The file is the same as relations.json
but the Type
and Sense
fields are left empty. This is the same for Chinese and English except for the Punctuation
field.
In [12]:
supp_data = open('conll16st-en-01-12-16-trial/relations-no-senses.json')
relations_no_senses = [json.loads(x) for x in supp_data]
relations_no_senses[10]
Out[12]:
In [13]:
all_tokens = [token for sentence in en_parse_dict[en_doc_id]['sentences'] for token in sentence['words']]
for token in all_tokens[0:20]:
for linker in token[1]['Linkers']:
role, relation_id = linker.split('_')
print '%s \t is part of %s in relation id %s' % (token[0], role, relation_id)
In [14]:
print 'Relation ID is %s' % relations[13]['ID']
print 'Arg 1 : %s' % relations[13]['Arg1']['RawText']
We also provide CoNLL format for those who prefer it but it does not very pretty. Those can also be used for training. CoNLL format will not be provided during evaluation.
In [15]:
for x in open('conll16st-en-01-12-16-trial/conll_format/wsj_1000.conll').readlines()[0:5]:
print x[0:40]
Here's the explanation of each field if a document has n relations:
The relation information field can take many forms:
arg1
part of Arg1 of the relationarg2
part of Arg2 of the relationconn|Comparison.Concession
part of the discourse connective AND the sense of that relation is Comparison.Concession (Explicit relations only)arg2|EntRel
part of Arg2 of the relation AND the sense of that relation is EntRel (Entrel and Norel relations only)arg2|because|Contingency.Pragmatic cause
part of Arg2 (Implicit relations only)The system output must be in json format. It is very similar to the training set except for the TokenList
field.
The TokenList
field is now a list of document level token indices.
If the relation is not explicit, Connective
field must still be there, and its TokenList
must be an empty list.
You may however add whatever field into json to help yourself debug or develop the system.
Below is an example of a relation given by a system.
You can also run the sample parser:
python sample_parser.py conll16st-en-01-12-16-trial inputrun tutorial
.
In [24]:
output_relations = [json.loads(x) for x in codecs.open('output.json', encoding='utf8')]
output_relations[10]
Out[24]:
Suppose you already have a system and you want to evaluate the system.
We provide validator.py
and scorer.py
to help you validate the format of the system out and evaluate the system respectively.
These utility functions can be downloaded from CoNLL Shared Task Github.
The usage is included in the functions.
If you find any errors or suggestions, please post to the forum or email the organizing committee at conll16st@gmail.com
.
We hope you enjoy solving this challenging task of shallow discourse parsing.
Together, we can make progress in understanding discourse phenomena.