parse RST-DT documents in LISP/S-Expression format

  • RST-DT *.rs3 files are broken, cf. my notebook on RST-DT/PTB merging
  • only use the *.dis files, the *.lisp.name and *.step.name may be broken, too
  • the RST-DT people probably used Marcu's tools
    to convert their annotations into *.dis format

The RST Discourse Treebank contains 385 WSJ articles from PTB with Rhetorical Structure Theory (RST) annotations.

The following information was taken from the RST-DT documentation:

RSTtrees-WSJ-main-1.0

This directory contains 385 Wall Street Journal articles, broken into TRAINING (347 documents) and TEST (38 documents) sub-directories.

Filenames are in one of two forms:

  • wsj_####.ext (380 documents)
  • file#.ext(5 documents)

The 5 files named file# were identified as the following filenames in Treebank-2:

  • file1 - 07/wsj_0764
  • file2 - 04/wsj_0430
  • file3 - 07/wsj_0766
  • file4 - 07/wsj_0778
  • file5 - 21/wsj_2172

(More information is available in a compressed file via ftp, which provides the relationship between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.)

.rst/

A directory with three files:

  • <docno>.lisp.name - discourse structure created by a human judge for a text.
  • <docno>.step.name - list of all human actions taken
    during the creation of the discourse structure
  • ## -- a file with an integer as its name - temp file;
    contains last human action during creation of the discourse structure

All annotations were produced using a discourse annotation tool that can be downloaded from http://www.isi.edu/~marcu/discourse.

The files in the .rst directories are provided only to enable interested users to visualize and print in a convenient format the discourse annotations in the corpus.

  • `.dis`` - contains the manually annotated discourse structure
    of the file <docno>
    The .dis files were generated automatically from the .step and .lisp
    files using a mapping program.
    More information about this program is available at http://www.isi.edu/~marcu/discourse.

IMPORTANT NOTE: The .lisp files may contain errors introduced by the discourse annotation tool. Please use the .lisp and .step files only for visualizing the trees.
Use the .dis files for training/testing purposes (the mapping program that produced the .dis file was written so as to eliminate the errors introduced by the annotation tool).

  • <docno>.edus - edus (elementary discourse units) listed line by line.

RSTtrees-WSJ-double-1.0

This directory contains the same types of files as the subdirectory RSTtrees-WSJ-main-1.0, for 53 documents which were reviewed by a second analyst.


In [8]:
import os
import sys
import glob

import nltk

RSTDT_MAIN_ROOT = os.path.expanduser('~/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0')
RSTDT_DOUBLE_ROOT = os.path.expanduser('~/repos/rst_discourse_treebank/data/RSTtrees-WSJ-double-1.0')
RSTDT_TOKENIZED_ROOT = os.path.expanduser('~/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-tokenized')

RSTDT_TEST_FILE = os.path.join(RSTDT_MAIN_ROOT, 'TEST', 'wsj_1306.out.dis')
RSTDT_TOKENIZED_TEST_FILE = os.path.join(RSTDT_TOKENIZED_ROOT, 'TEST', 'wsj_1306.out.dis')

PTB_WSJ_ROOT_DIR = os.path.expanduser('~/corpora/pennTreebank/parsed/mrg/wsj')

Find unparsable files

  • only 3 files that nltk's Bracket parser can't handle at all

In [3]:
FILES_UNPARSABLE_WITH_NLTK = set([
    '/home/arne/corpora/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0/TRAINING/wsj_1107.out.dis',
    '/home/arne/corpora/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0/TRAINING/wsj_2353.out.dis',
    '/home/arne/corpora/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0/TRAINING/wsj_2367.out.dis'])

In [4]:
def get_nodelabel(node):
    """returns the node label of an nltk Tree or one of its subtrees"""
    if isinstance(node, nltk.tree.Tree):
        return node.label()
    elif isinstance(node, unicode):
        return node.encode('utf-8')
    else:
        raise ValueError("Unexpected node type: {}, {}".format(type(node), node))

In [5]:
from nltk.corpus.reader import BracketParseCorpusReader

def parse_rstfile_nltk(rst_filepath):
    """parse a *.dis RST file into an nltk.tree.Tree"""
    rst_path, rst_filename = os.path.split(rst_filepath)
    parsed_doc = BracketParseCorpusReader(rst_path, [rst_filename])
    parsed_sents_iter = parsed_doc.parsed_sents()
    return parsed_sents_iter[0] # there's only one tree in a *.dis

In [6]:
from collections import defaultdict

def nested_tree_count(tree, result_dict=None):
    if not result_dict:
        result_dict = defaultdict(lambda : defaultdict(int))
    for i, subtree in enumerate(tree):
        if isinstance(subtree, nltk.tree.Tree) and subtree.label() in ('Nucleus', 'Satellite'):
            rhs = tuple([get_nodelabel(st) for st in subtree])
            result_dict[get_nodelabel(subtree)][rhs] += 1
            if rhs[0] == u'leaf' and len(rhs) != 3: # (leaf, rel2par, text)
                raise ValueError('Badly escaped s-expression\n{}\n'.format(subtree))
            nested_tree_count(subtree, result_dict)

Files with bad escaping

  • 22 badly escaped files

"(" and ")" aren't escaped in text field!

  • /home/arne/corpora/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0/TRAINING/wsj_0612.out.dis
  ( Satellite (span 22 28) (rel2par elaboration-set-member-e)
    ( Nucleus (span 22 23) (rel2par span)
      ( Nucleus (leaf 22) (rel2par span) (text _!Canadian Imperial Bank of Commerce_!) )
      ( Satellite (leaf 23) (rel2par elaboration-additional) (text _!(Canada) --_!) )
    )

In [6]:
# BADLY_ESCAPED_FILES = set()

# for folder in ('TEST', 'TRAINING'):
#     for rst_fpath in glob.glob(os.path.join(RSTDT_MAIN_ROOT, folder, '*.dis')):
#         if rst_fpath not in FILES_UNPARSABLE_WITH_NLTK:
#             rst_tree = parse_rstfile_nltk(rst_fpath)
#             try:
#                 nested_tree_count(rst_tree)
#             except ValueError as e:
#                 BADLY_ESCAPED_FILES.add(rst_fpath)

# len(BADLY_ESCAPED_FILES) # 22 files

Files unparsable with sexpdata (due to bad bracketing)

  • 113 files that aren't valid s-expressions (nltk parses them, as it is very forgiving)

In [7]:
import sys
import traceback
import sexpdata

def parse_rstfile_sexpdata(rst_filepath):
    with open(rst_filepath) as rstfile:
        try:
            return sexpdata.load(rstfile)
        except sexpdata.ExpectClosingBracket as e:
            raise ValueError(u"{}\n{}\n\n".format(rst_fpath, e))
        except sexpdata.ExpectNothing as e:
            error_msg = e.args[0][:100] # complete msg would contain the whole document
            raise ValueError(u"{}\n{}...\n\n".format(rst_fpath, e.args[0][:100]))
        except AssertionError as e:
            raise ValueError(u"{}\n{}\n\n".format(rst_fpath, traceback.format_exc()))
        except AttributeError as e:
            raise ValueError(u"{}\n{}\n\n".format(rst_fpath, traceback.format_exc()))

In [8]:
# FILES_UNPARSABLE_WITH_SEXPDATA = set()
# for folder in ('TEST', 'TRAINING'):
#     for rst_fpath in glob.glob(os.path.join(RSTDT_MAIN_ROOT, folder, '*.dis')):
#         try:
#             parse_rstfile_sexpdata(rst_fpath)
#         except ValueError as e:
#             FILES_UNPARSABLE_WITH_SEXPDATA.add(rst_fpath)

# len(FILES_UNPARSABLE_WITH_SEXPDATA) # 113 unparsable files

set of all 'unparsable' files (before tokenization and text escaping)


In [9]:
# ALL_UNPARSABLE_FILES = FILES_UNPARSABLE_WITH_NLTK.union(FILES_UNPARSABLE_WITH_SEXPDATA).union(BADLY_ESCAPED_FILES)
# len(ALL_UNPARSABLE_FILES) # 124 unparsable files

try parsing files into graphs

Summary of RST tree rules

  • Root --> span (N+ | N S | S N)
  • Nucleus --> leaf rel2par text (N | S | re.compile('.*_!') )?
  • Nucleus --> span rel2par (N+ | N S | S N | S N S)
  • Satellite --> leaf rel2par text (N | re.compile('.*_!') )?
  • Satellite --> span rel2par (N+ | N S | S | S N | S N S)
  • rel2par --> any RST relation string

In [10]:
sexp_tree = parse_rstfile_sexpdata(RSTDT_TEST_FILE)
# a list that contains Symbol instances (and lists of Symbol instances and integers)

In [11]:
root = sexp_tree[0]

In [12]:
print sexp_tree[1]
print sexp_tree[1][0]
print sexp_tree[1][0].value()


[Symbol('span'), 1, 47]
Symbol('span')
span

In [13]:
nuc_tree = sexp_tree[2]
print nuc_tree[1][0].value()
print nuc_tree[1][1], nuc_tree[1][2]
for i, e in enumerate(nuc_tree):
    print i, e, '\n'


span
1 20
0 Symbol('Nucleus') 

1 [Symbol('span'), 1, 20] 

2 [Symbol('rel2par'), Symbol('span')] 

3 [Symbol('Nucleus'), [Symbol('span'), 1, 14], [Symbol('rel2par'), Symbol('span')], [Symbol('Nucleus'), [Symbol('span'), 1, 8], [Symbol('rel2par'), Symbol('span')], [Symbol('Nucleus'), [Symbol('span'), 1, 4], [Symbol('rel2par'), Symbol('Inverted-Sequence')], [Symbol('Nucleus'), [Symbol('span'), 1, 3], [Symbol('rel2par'), Symbol('span')], [Symbol('Satellite'), [Symbol('leaf'), 1], [Symbol('rel2par'), Symbol('attribution')], [Symbol('text'), Symbol('_!Tandy'), Symbol('Corp.'), Symbol('said_!')]], [Symbol('Nucleus'), [Symbol('span'), 2, 3], [Symbol('rel2par'), Symbol('span')], [Symbol('Nucleus'), [Symbol('leaf'), 2], [Symbol('rel2par'), Symbol('span')], [Symbol('text'), Symbol('_!it'), Symbol('won'), Quoted(True), Symbol('join'), Symbol('U.S.'), Symbol('Memories,'), Symbol('the'), Symbol('group_!')]], [Symbol('Satellite'), [Symbol('leaf'), 3], [Symbol('rel2par'), Symbol('elaboration-object-attribute-e')], [Symbol('text'), Symbol('_!that'), Symbol('seeks'), Symbol('to'), Symbol('battle'), Symbol('the'), Symbol('Japanese'), Symbol('in'), Symbol('the'), Symbol('market'), Symbol('for'), Symbol('computer'), Symbol('memory'), Symbol('chips.<P>_!')]]]], [Symbol('Satellite'), [Symbol('leaf'), 4], [Symbol('rel2par'), Symbol('elaboration-additional')], [Symbol('text'), Symbol('_!Tandy'), Quoted(Symbol('s')), Symbol('decision'), Symbol('is'), Symbol('a'), Symbol('second'), Symbol('setback'), Symbol('for'), Symbol('U.S.'), Symbol('Memories._!')]]], [Symbol('Nucleus'), [Symbol('span'), 5, 8], [Symbol('rel2par'), Symbol('Inverted-Sequence')], [Symbol('Nucleus'), [Symbol('span'), 5, 6], [Symbol('rel2par'), Symbol('span')], [Symbol('Satellite'), [Symbol('leaf'), 5], [Symbol('rel2par'), Symbol('attribution')], [Symbol('text'), Symbol('_!Last'), Symbol('month,'), Symbol('Apple'), Symbol('Computer'), Symbol('Inc.'), Symbol('said_!')]], [Symbol('Nucleus'), [Symbol('leaf'), 6], [Symbol('rel2par'), Symbol('span')], [Symbol('text'), Symbol('_!that'), Symbol('it'), Symbol('wouldn'), Quoted(True), Symbol('invest'), Symbol('in'), Symbol('the'), Symbol('group._!')]]], [Symbol('Satellite'), [Symbol('span'), 7, 8], [Symbol('rel2par'), Symbol('reason')], [Symbol('Satellite'), [Symbol('leaf'), 7], [Symbol('rel2par'), Symbol('attribution')], [Symbol('text'), Symbol('_!Apple'), Symbol('said_!')]], [Symbol('Nucleus'), [Symbol('leaf'), 8], [Symbol('rel2par'), Symbol('span')], [Symbol('text'), Symbol('_!that'), Symbol('its'), Symbol('money'), Symbol('would'), Symbol('be'), Symbol('better'), Symbol('spent'), Symbol('in'), Symbol('areas'), Symbol('such'), Symbol('as'), Symbol('research'), Symbol('and'), Symbol('development.<P>_!')]]]]], [Symbol('Satellite'), [Symbol('span'), 9, 14], [Symbol('rel2par'), Symbol('circumstance')], [Symbol('Nucleus'), [Symbol('span'), 9, 11], [Symbol('rel2par'), Symbol('span')], [Symbol('Nucleus'), [Symbol('leaf'), 9], [Symbol('rel2par'), Symbol('span')], [Symbol('text'), Symbol('_!U.S.'), Symbol('Memories'), Symbol('is'), Symbol('seeking'), Symbol('major'), Symbol('investors'), Symbol('to'), Symbol('back'), Symbol('its'), Symbol('attempt_!')]], [Symbol('Satellite'), [Symbol('span'), 10, 11], [Symbol('rel2par'), Symbol('elaboration-object-attribute-e')], [Symbol('Nucleus'), [Symbol('leaf'), 10], [Symbol('rel2par'), Symbol('span')], [Symbol('text'), Symbol('_!to'), Symbol('crack'), Symbol('the'), Symbol('$10'), Symbol('billion'), Symbol('market'), Symbol('for'), Symbol('dynamic'), Symbol('random'), Symbol('access'), Symbol('memory'), Symbol('chips,'), Symbol('a'), Symbol('market_!')]], [Symbol('Satellite'), [Symbol('leaf'), 11], [Symbol('rel2par'), Symbol('elaboration-object-attribute-e')], [Symbol('text'), Symbol('_!dominated'), Symbol('by'), Symbol('the'), Symbol('Japanese._!')]]]], [Symbol('Satellite'), [Symbol('span'), 12, 14], [Symbol('rel2par'), Symbol('elaboration-additional')], [Symbol('Nucleus'), [Symbol('leaf'), 12], [Symbol('rel2par'), Symbol('span')], [Symbol('text'), Symbol('_!Those'), Symbol('chips'), Symbol('were'), Symbol('in'), Symbol('dire'), Symbol('shortage'), Symbol('last'), Symbol('year,_!')]], [Symbol('Satellite'), [Symbol('span'), 13, 14], [Symbol('rel2par'), Symbol('consequence-s')], [Symbol('Nucleus'), [Symbol('leaf'), 13], [Symbol('rel2par'), Symbol('span')], [Symbol('text'), Symbol('_!hurting'), Symbol('many'), Symbol('U.S.'), Symbol('computer'), Symbol('companies_!')]], [Symbol('Satellite'), [Symbol('leaf'), 14], [Symbol('rel2par'), Symbol('elaboration-object-attribute-e')], [Symbol('text'), Symbol('_!that'), Symbol('couldn'), Quoted(True), Symbol('get'), Symbol('sufficient'), Symbol('Japanese-supplied'), Symbol('chips.<P>_!')]]]]]] 

4 [Symbol('Satellite'), [Symbol('span'), 15, 20], [Symbol('rel2par'), Symbol('reason')], [Symbol('Nucleus'), [Symbol('span'), 15, 17], [Symbol('rel2par'), Symbol('span')], [Symbol('Satellite'), [Symbol('leaf'), 15], [Symbol('rel2par'), Symbol('attribution')], [Symbol('text'), Symbol('_!Tandy'), Symbol('said_!')]], [Symbol('Nucleus'), [Symbol('span'), 16, 17], [Symbol('rel2par'), Symbol('span')], [Symbol('Nucleus'), [Symbol('leaf'), 16], [Symbol('rel2par'), Symbol('span')], [Symbol('text'), Symbol('_!its'), Symbol('experience'), Symbol('during'), Symbol('the'), Symbol('shortage'), Symbol('didn'), Quoted(True), Symbol('merit'), Symbol('the'), Symbol('$5'), Symbol('million'), Symbol('to'), Symbol('$50'), Symbol('million'), Symbol('investment_!')]], [Symbol('Satellite'), [Symbol('leaf'), 17], [Symbol('rel2par'), Symbol('elaboration-object-attribute-e')], [Symbol('text'), Symbol('_!U.S.'), Symbol('Memories'), Symbol('is'), Symbol('seeking'), Symbol('from'), Symbol('each'), Symbol('investor._!')]]]], [Symbol('Satellite'), [Symbol('span'), 18, 20], [Symbol('rel2par'), Symbol('explanation-argumentative')], [Symbol('Nucleus'), [Symbol('span'), 18, 19], [Symbol('rel2par'), Symbol('span')], [Symbol('Nucleus'), [Symbol('leaf'), 18], [Symbol('rel2par'), Symbol('span')], [Symbol('text'), Symbol('_!'), 'At this time, we elected not to get involved_!) )\n          ( Satellite (leaf 19) (rel2par reason) (text _!because we have been able to satisfy our need {for DRAMs} from the market as a rule,', Symbol('_!')]]], [Symbol('Satellite'), [Symbol('leaf'), 20], [Symbol('rel2par'), Symbol('attribution')], [Symbol('text'), Symbol('_!said'), Symbol('Ed'), Symbol('Juge,'), Symbol('Tandy'), Quoted(Symbol('s')), Symbol('director'), Symbol('of'), Symbol('market'), Symbol('planning.<P>_!')]]]] 

SEXPDATA fail: ' must be escaped

>>> sexpdata.loads("(text this won't hurt)")
>>> [Symbol('text'), Symbol('this'), Symbol('won'), Quoted(True), Symbol('hurt')]

Epic fail: RST-DT files contain superflous //TT_ERR strings

  • I fixed the files in the RSTtrees-WSJ-main-1.0-tokenized directory
arne@ziegelstein ~/repos/rst_discourse_treebank $ ack-grep -cl "//TT_ERR"
data/RSTtrees-WSJ-main-1.0/TRAINING/wsj_2367.out.dis:102
data/RSTtrees-WSJ-main-1.0/TRAINING/wsj_2353.out.dis:53
data/RSTtrees-WSJ-main-1.0-tokenized/TRAINING/wsj_2367.out.dis:102
data/RSTtrees-WSJ-main-1.0-tokenized/TRAINING/wsj_2353.out.dis:53

In [14]:
import discoursegraphs as dg
from collections import Counter

class RSTLispDocumentGraph(dg.DiscourseDocumentGraph):
    """
    A directed graph with multiple edges (based on a networkx
    MultiDiGraph) that represents the rhetorical structure of a
    document.

    Attributes
    ----------
    name : str
        name, ID of the document or file name of the input file
    ns : str
        the namespace of the document (default: rst)
    root : str
        name of the document root node ID
    tokens : list of str
        sorted list of all token node IDs contained in this document graph
    """
    def __init__(self, dis_filepath, name=None, namespace='rst',
                 tokenize=True, precedence=False):
        """
        Creates an RSTLispDocumentGraph from a Rhetorical Structure *.dis file and adds metadata
        to it.

        Parameters
        ----------
        dis_filepath : str
            absolute or relative path to the Rhetorical Structure *.dis file to be
            parsed.
        name : str or None
            the name or ID of the graph to be generated. If no name is
            given, the basename of the input file is used.
        namespace : str
            the namespace of the document (default: rst)
        precedence : bool
            If True, add precedence relation edges
            (root precedes token1, which precedes token2 etc.)
        """
        # super calls __init__() of base class DiscourseDocumentGraph
        super(RSTLispDocumentGraph, self).__init__()

        self.name = name if name else os.path.basename(dis_filepath)
        self.ns = namespace
        self.root = 0
        self.add_node(self.root, layers={self.ns}, label=self.ns+':root_node')
        if 'discoursegraph:root_node' in self:
            self.remove_node('discoursegraph:root_node')
        
        self.tokenized = tokenize
        self.tokens = []
        self.rst_tree = parse_rstfile_sexpdata(dis_filepath)
        self.parse_rst_tree(self.rst_tree)
        
    def parse_rst_tree(self, rst_tree, indent=0):
        tree_type = self.get_tree_type(rst_tree)
        assert tree_type in ('Root', 'Nucleus', 'Satellite')
        if tree_type == 'Root':
            span, children = rst_tree[1], rst_tree[2:]
            for child in children:
                self.parse_rst_tree(child, indent=indent+1)

        else: # tree_type in ('Nucleus', 'Satellite')
            node_id = self.get_node_id(rst_tree)
            node_type = self.get_node_type(rst_tree)
            relation_type = self.get_relation_type(rst_tree)
            if node_type == 'leaf':
                edu_text = self.get_edu_text(rst_tree[3])
                self.add_node(node_id, attr_dict={self.ns+':text': edu_text,
                                                  'label': u'{}: {}'.format(node_id, edu_text[:20])})
                if self.tokenized:
                    edu_tokens = edu_text.split()
                    for i, token in enumerate(edu_tokens):
                        token_node_id = '{}_{}'.format(node_id, i)
                        self.tokens.append(token_node_id)
                        self.add_node(token_node_id, attr_dict={self.ns+':token': token,
                                                                'label': token})
                        self.add_edge(node_id, '{}_{}'.format(node_id, i))
                    
            else: # node_type == 'span'
                self.add_node(node_id, attr_dict={self.ns+':rel_type': relation_type,
                                                   self.ns+':node_type': node_type})
                children = rst_tree[3:]
                child_types = self.get_child_types(children)
                
                expected_child_types = set(['Nucleus', 'Satellite'])
                unexpected_child_types = set(child_types).difference(expected_child_types)
                assert not unexpected_child_types, \
                    "Node '{}' contains unexpected child types: {}\n".format(node_id, unexpected_child_types)
                
                if 'Satellite' not in child_types:
                    # span only contains nucleii -> multinuc
                    for child in children:
                        child_node_id = self.get_node_id(child)
                        self.add_edge(node_id, child_node_id, attr_dict={self.ns+':rel_type': relation_type})
                
                elif len(child_types['Satellite']) == 1 and len(child_types['Nucleus']) == 1:
                    # standard RST relation, where one satellite is dominated by one nucleus
                    nucleus_index = child_types['Nucleus'][0]
                    satellite_index = child_types['Satellite'][0]
                    
                    nucleus_node_id = self.get_node_id(children[nucleus_index])
                    satellite_node_id = self.get_node_id(children[satellite_index])
                    self.add_edge(node_id, nucleus_node_id, attr_dict={self.ns+':rel_type': 'span'},
                                  edge_type=dg.EdgeTypes.spanning_relation)
                    self.add_edge(nucleus_node_id, satellite_node_id,
                                  attr_dict={self.ns+':rel_type': relation_type},
                                  edge_type=dg.EdgeTypes.dominance_relation)
                else:
                    raise ValueError("Unexpected child combinations: {}\n".format(child_types))
                
                for child in children:
                    self.parse_rst_tree(child, indent=indent+1)

    def get_child_types(self, children):
        """
        maps from (sub)tree type (i.e. Nucleus or Satellite) to a list
        of all children of this type
        """
        child_types = defaultdict(list)
        for i, child in enumerate(children):
            child_types[self.get_tree_type(child)].append(i)
        return child_types


    def get_edu_text(self, text_subtree):
        assert text_subtree[0].value() == 'text'
        return u' '.join(word.value().decode('utf-8')
                         if isinstance(word, sexpdata.Symbol) else word.decode('utf-8')
                         for word in text_subtree[1:])
        
    def get_tree_type(self, tree):
        """returns the type of the (sub)tree: Root, Nucleus or Satellite"""
        tree_type = tree[0].value()
        return tree_type

    def get_node_type(self, tree):
        """returns the node type (leaf or span) of a subtree (i.e. Nucleus or Satellite)"""
        node_type = tree[1][0].value()
        assert node_type in ('leaf', 'span')
        return node_type

    def get_relation_type(self, tree):
        """returns the RST relation type attached to the parent node of an RST relation"""
        return tree[2][1].value()

    def get_node_id(self, nuc_or_sat):
        node_type = self.get_node_type(nuc_or_sat)
        if node_type == 'leaf':
            leaf_id = nuc_or_sat[1][1]
            return '{}:{}'.format(self.ns, leaf_id)
        else: # node_type == 'span'
            span_start = nuc_or_sat[1][1]
            span_end = nuc_or_sat[1][2]
            return '{}:span:{}-{}'.format(self.ns, span_start, span_end)


Couldn't import dot_parser, loading of dot files will not be possible.

In [15]:
RSTDT_TOKENIZED_ROOT = os.path.expanduser('~/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-tokenized')

import traceback

# for folder in ('TEST', 'TRAINING'):
#     for rst_fpath in glob.glob(os.path.join(RSTDT_TOKENIZED_ROOT, folder, '*.dis')):
#         try:
#             RSTLispDocumentGraph(rst_fpath)
# #             print rst_fpath
#         except ValueError as e:
#             sys.stderr.write("Error in file '{}'\n{}\n".format(rst_fpath, e))

In [16]:
# TODO: error in attachment: rst:span:18-20 -> 18-19
rdg = RSTLispDocumentGraph(RSTDT_TOKENIZED_TEST_FILE, tokenize=False)

In [17]:
# %load_ext gvmagic
# %dotstr dg.print_dot(rdg)

Do RSTDT-CoreNLP tokenizations match PTB?


In [18]:
RSTDT_NLTK_TOKENIZED_ROOT = os.path.expanduser('~/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized')

dis_file = os.path.join(RSTDT_NLTK_TOKENIZED_ROOT, 'TEST/wsj_2386.out.dis')
mrg_file = os.path.join(PTB_WSJ_ROOT_DIR, '23/wsj_2386.mrg')

rdg = RSTLispDocumentGraph(dis_file)
pdg = dg.read_ptb(mrg_file)

for t in rdg.tokens[:10]: print t,
print
for t in pdg.tokens[:10]: print t,


rst:1_0 rst:1_1 rst:1_2 rst:1_3 rst:1_4 rst:1_5 rst:1_6 rst:1_7 rst:1_8 rst:1_9
4 6 9 12 14 16 18 21 24 27

In [19]:
print dis_file
rdg.merge_graphs(pdg, verbose=True)


/home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_2386.out.dis

In [20]:
import re
import glob
import sys


WSJ_SUBDIR_REGEX = re.compile('wsj_(\d{2})')
WSJ_DOCID_REGEX = re.compile('wsj_(\d{4})')

for folder in ('TEST', 'TRAINING'):
    for rst_fpath in glob.glob(os.path.join(RSTDT_NLTK_TOKENIZED_ROOT, folder, '*.dis')):
        doc_id = os.path.basename(rst_fpath).split('.')[0]
        
        try:
            rdg = RSTLispDocumentGraph(rst_fpath)
            rst_fname = os.path.basename(rst_fpath).lower()

            doc_id = WSJ_DOCID_REGEX.match(rst_fname).groups()[0]
            wsj_subdir = WSJ_SUBDIR_REGEX.match(rst_fname).groups()[0]

            ptb_file = os.path.join(PTB_WSJ_ROOT_DIR, wsj_subdir, 'wsj_{}.mrg'.format(doc_id))
            pdg = dg.read_ptb(ptb_file)

            try:
                rdg.merge_graphs(pdg)
                print "merged: {}\n".format(rst_fpath)
            except Exception as e:
                sys.stderr.write("Error in {}: {}\n".format(rst_fpath, e))
        except Exception as e:
            sys.stderr.write("Error in {}: {}\n".format(rst_fpath, e))


merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_2386.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_0627.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_2375.out.dis: 
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_1387.out.dis: Tokenization mismatch: wsj_1387.out.dis (rst) vs. wsj_1387.mrg (ptb)
	Co. != Co
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_0644.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_1365.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_0632.out.dis: Tokenization mismatch: wsj_0632.out.dis (rst) vs. wsj_0632.mrg (ptb)
	No != No.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_2385.out.dis: Tokenization mismatch: wsj_2385.out.dis (rst) vs. wsj_2385.mrg (ptb)
	B != B&H
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_0667.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_1189.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_0689.out.dis: Tokenization mismatch: wsj_0689.out.dis (rst) vs. wsj_0689.mrg (ptb)
	Cos != Cos.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_1183.out.dis: Tokenization mismatch: wsj_1183.out.dis (rst) vs. wsj_1183.mrg (ptb)
	US != US$
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_0654.out.dis: 
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_1169.out.dis: Tokenization mismatch: wsj_1169.out.dis (rst) vs. wsj_1169.mrg (ptb)
	Mfg != Mfg.
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_1346.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_1113.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_2336.out.dis: Tokenization mismatch: wsj_2336.out.dis (rst) vs. wsj_2336.mrg (ptb)
	Inc != Inc.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_0607.out.dis: Tokenization mismatch: wsj_0607.out.dis (rst) vs. wsj_0607.mrg (ptb)
	Nasdaq/National != Nasdaq\/National
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_2354.out.dis: Tokenization mismatch: wsj_2354.out.dis (rst) vs. wsj_2354.mrg (ptb)
	S != S&Ls
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_1354.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_1306.out.dis: 
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_1148.out.dis: Tokenization mismatch: wsj_1148.out.dis (rst) vs. wsj_1148.mrg (ptb)
	U.S != U.S.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_0655.out.dis: Tokenization mismatch: wsj_0655.out.dis (rst) vs. wsj_0655.mrg (ptb)
	. != ...
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_1331.out.dis: Tokenization mismatch: wsj_1331.out.dis (rst) vs. wsj_1331.mrg (ptb)
	`It != `
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_0684.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_1197.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_1376.out.dis: Tokenization mismatch: wsj_1376.out.dis (rst) vs. wsj_1376.mrg (ptb)
	. != ...
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_0602.out.dis: Tokenization mismatch: wsj_0602.out.dis (rst) vs. wsj_0602.mrg (ptb)
	1/2 != 1\/2
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_1129.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_1325.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_0623.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_1380.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_0616.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_2373.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_1142.out.dis: Tokenization mismatch: wsj_1142.out.dis (rst) vs. wsj_1142.mrg (ptb)
	7/8 != 7\/8
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_1146.out.dis: Tokenization mismatch: wsj_1146.out.dis (rst) vs. wsj_1146.mrg (ptb)
	1973's != 1973
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_1307.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TEST/wsj_1126.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2366.out.dis: Tokenization mismatch: wsj_2366.out.dis (rst) vs. wsj_2366.mrg (ptb)
	. != ...
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2327.out.dis: Tokenization mismatch: wsj_2327.out.dis (rst) vs. wsj_2327.mrg (ptb)
	HK != HK$
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1303.out.dis: Tokenization mismatch: wsj_1303.out.dis (rst) vs. wsj_1303.mrg (ptb)
	Mr != Mr.
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1198.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0650.out.dis: Tokenization mismatch: wsj_0650.out.dis (rst) vs. wsj_0650.mrg (ptb)
	3/8 != 3\/8
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0670.out.dis: Tokenization mismatch: wsj_0670.out.dis (rst) vs. wsj_0670.mrg (ptb)
	R.D != R.D.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2321.out.dis: Tokenization mismatch: wsj_2321.out.dis (rst) vs. wsj_2321.mrg (ptb)
	U.S != U.S.
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0679.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0686.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1371.out.dis: Tokenization mismatch: wsj_1371.out.dis (rst) vs. wsj_1371.mrg (ptb)
	S.p != S.p.A.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0605.out.dis: 
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0603.out.dis: Tokenization mismatch: wsj_0603.out.dis (rst) vs. wsj_0603.mrg (ptb)
	BOARD != BOARD'S
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1164.out.dis: Tokenization mismatch: wsj_1164.out.dis (rst) vs. wsj_1164.mrg (ptb)
	F.J != F.J.
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0621.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2346.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0662.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1369.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1985.out.dis: Tokenization mismatch: wsj_1985.out.dis (rst) vs. wsj_1985.mrg (ptb)
	1/2 != 1\/2
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1147.out.dis: Tokenization mismatch: wsj_1147.out.dis (rst) vs. wsj_1147.mrg (ptb)
	W.G != W.G.
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2322.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0665.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1330.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1107.out.dis: 
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2353.out.dis: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2353.out.dis
Traceback (most recent call last):
  File "<ipython-input-7-a3ce24290f79>", line 8, in parse_rstfile_sexpdata
    return sexpdata.load(rstfile)
  File "/usr/local/lib/python2.7/dist-packages/sexpdata.py", line 171, in load
    return loads(filelike.read(), **kwds)
  File "/usr/local/lib/python2.7/dist-packages/sexpdata.py", line 244, in loads
    assert len(obj) == 1  # FIXME: raise an appropriate error
AssertionError



merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1140.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0661.out.dis: Tokenization mismatch: wsj_0661.out.dis (rst) vs. wsj_0661.mrg (ptb)
	N.A != N.A.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0604.out.dis: Tokenization mismatch: wsj_0604.out.dis (rst) vs. wsj_0604.mrg (ptb)
	B.J != B.J.
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1151.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1362.out.dis: Unexpected child combinations: defaultdict(<type 'list'>, {'Satellite': [0, 2], 'Nucleus': [1]})

Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0612.out.dis: Tokenization mismatch: wsj_0612.out.dis (rst) vs. wsj_0612.mrg (ptb)
	3/8 != 3\/8
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1963.out.dis: Tokenization mismatch: wsj_1963.out.dis (rst) vs. wsj_1963.mrg (ptb)
	Calif. != Calif
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1109.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1396.out.dis: 
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0696.out.dis: 
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1196.out.dis: Tokenization mismatch: wsj_1196.out.dis (rst) vs. wsj_1196.mrg (ptb)
	35,000-to- != 35,000-to-$50,000
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1316.out.dis: 
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1153.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1321.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1111.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1988.out.dis: Tokenization mismatch: wsj_1988.out.dis (rst) vs. wsj_1988.mrg (ptb)
	C != C$
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1374.out.dis: 
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0649.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0625.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1980.out.dis: Tokenization mismatch: wsj_1980.out.dis (rst) vs. wsj_1980.mrg (ptb)
	A != A.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1342.out.dis: Tokenization mismatch: wsj_1342.out.dis (rst) vs. wsj_1342.mrg (ptb)
	a-Discounted != a
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1130.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0663.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1384.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0668.out.dis: Tokenization mismatch: wsj_0668.out.dis (rst) vs. wsj_0668.mrg (ptb)
	N.V. != N.V
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2338.out.dis: Tokenization mismatch: wsj_2338.out.dis (rst) vs. wsj_2338.mrg (ptb)
	U.S != U.S.
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1391.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2395.out.dis: Tokenization mismatch: wsj_2395.out.dis (rst) vs. wsj_2395.mrg (ptb)
	Cities/ABC != Cities\/ABC
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1322.out.dis: Unexpected child combinations: defaultdict(<type 'list'>, {'Satellite': [0, 2], 'Nucleus': [1]})

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0651.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2344.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0688.out.dis: Tokenization mismatch: wsj_0688.out.dis (rst) vs. wsj_0688.mrg (ptb)
	Tex != Tex.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2352.out.dis: Tokenization mismatch: wsj_2352.out.dis (rst) vs. wsj_2352.mrg (ptb)
	U.S != U.S.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0617.out.dis: Tokenization mismatch: wsj_0617.out.dis (rst) vs. wsj_0617.mrg (ptb)
	No != No.
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0606.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1160.out.dis: Tokenization mismatch: wsj_1160.out.dis (rst) vs. wsj_1160.mrg (ptb)
	Gov != Gov.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1179.out.dis: 
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1343.out.dis: Tokenization mismatch: wsj_1343.out.dis (rst) vs. wsj_1343.mrg (ptb)
	AT != AT&T
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0634.out.dis: Tokenization mismatch: wsj_0634.out.dis (rst) vs. wsj_0634.mrg (ptb)
	US != US$
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1134.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1315.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1163.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0641.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0619.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1345.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1162.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1139.out.dis: Tokenization mismatch: wsj_1139.out.dis (rst) vs. wsj_1139.mrg (ptb)
	No != No.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1150.out.dis: Tokenization mismatch: wsj_1150.out.dis (rst) vs. wsj_1150.mrg (ptb)
	p.m. != p.m
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1344.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1351.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1340.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1352.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1117.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1320.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2362.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1149.out.dis: Tokenization mismatch: wsj_1149.out.dis (rst) vs. wsj_1149.mrg (ptb)
	Princeton/Newport != Princeton\/Newport
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2399.out.dis: Tokenization mismatch: wsj_2399.out.dis (rst) vs. wsj_2399.mrg (ptb)
	S != S&P
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1177.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1333.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1171.out.dis: Tokenization mismatch: wsj_1171.out.dis (rst) vs. wsj_1171.mrg (ptb)
	'controlled != `
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0614.out.dis: 
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1350.out.dis: Tokenization mismatch: wsj_1350.out.dis (rst) vs. wsj_1350.mrg (ptb)
	Co != Co.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1110.out.dis: Tokenization mismatch: wsj_1110.out.dis (rst) vs. wsj_1110.mrg (ptb)
	1/2 != 1\/2
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1187.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2380.out.dis: Tokenization mismatch: wsj_2380.out.dis (rst) vs. wsj_2380.mrg (ptb)
	1/2 != 1\/2
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0633.out.dis: Tokenization mismatch: wsj_0633.out.dis (rst) vs. wsj_0633.mrg (ptb)
	. != ...
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1131.out.dis: Tokenization mismatch: wsj_1131.out.dis (rst) vs. wsj_1131.mrg (ptb)
	FHA- != FHA
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1970.out.dis: Tokenization mismatch: wsj_1970.out.dis (rst) vs. wsj_1970.mrg (ptb)
	July's != July
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1358.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0620.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2325.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1121.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2331.out.dis: Tokenization mismatch: wsj_2331.out.dis (rst) vs. wsj_2331.mrg (ptb)
	. != ...
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2343.out.dis: Tokenization mismatch: wsj_2343.out.dis (rst) vs. wsj_2343.mrg (ptb)
	. != ...
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1304.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1366.out.dis: Tokenization mismatch: wsj_1366.out.dis (rst) vs. wsj_1366.mrg (ptb)
	S != S&L
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0624.out.dis: Tokenization mismatch: wsj_0624.out.dis (rst) vs. wsj_0624.mrg (ptb)
	1/4 != 1\/4
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1353.out.dis: Tokenization mismatch: wsj_1353.out.dis (rst) vs. wsj_1353.mrg (ptb)
	. != ...
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1349.out.dis: Tokenization mismatch: wsj_1349.out.dis (rst) vs. wsj_1349.mrg (ptb)
	Calif != Calif.
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0626.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2398.out.dis: Tokenization mismatch: wsj_2398.out.dis (rst) vs. wsj_2398.mrg (ptb)
	U.S != U.S.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1314.out.dis: Tokenization mismatch: wsj_1314.out.dis (rst) vs. wsj_1314.mrg (ptb)
	`Leverage != `
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1100.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1347.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1116.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1335.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1974.out.dis: Tokenization mismatch: wsj_1974.out.dis (rst) vs. wsj_1974.mrg (ptb)
	1/2 != 1\/2
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0656.out.dis: Tokenization mismatch: wsj_0656.out.dis (rst) vs. wsj_0656.mrg (ptb)
	`Oh != `
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2315.out.dis: Tokenization mismatch: wsj_2315.out.dis (rst) vs. wsj_2315.mrg (ptb)
	FCB/Leber != FCB\/Leber
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1180.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1301.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1167.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0659.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1166.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0683.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1934.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1944.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0660.out.dis: Tokenization mismatch: wsj_0660.out.dis (rst) vs. wsj_0660.mrg (ptb)
	S.p != S.p.A.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2339.out.dis: Tokenization mismatch: wsj_2339.out.dis (rst) vs. wsj_2339.mrg (ptb)
	US != US$
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1984.out.dis: Tokenization mismatch: wsj_1984.out.dis (rst) vs. wsj_1984.mrg (ptb)
	Sen != Sen.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2348.out.dis: Tokenization mismatch: wsj_2348.out.dis (rst) vs. wsj_2348.mrg (ptb)
	S != S&Ls
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0600.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1152.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2364.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2313.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0678.out.dis: Tokenization mismatch: wsj_0678.out.dis (rst) vs. wsj_0678.mrg (ptb)
	No != No.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1398.out.dis: Tokenization mismatch: wsj_1398.out.dis (rst) vs. wsj_1398.mrg (ptb)
	P.R != P.R.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1962.out.dis: Tokenization mismatch: wsj_1962.out.dis (rst) vs. wsj_1962.mrg (ptb)
	S != S&P
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2329.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1341.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1359.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2347.out.dis: Tokenization mismatch: wsj_2347.out.dis (rst) vs. wsj_2347.mrg (ptb)
	A.E != A.E.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0645.out.dis: Tokenization mismatch: wsj_0645.out.dis (rst) vs. wsj_0645.mrg (ptb)
	W.T != W.T.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0611.out.dis: 
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1124.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1119.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1983.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2363.out.dis: Tokenization mismatch: wsj_2363.out.dis (rst) vs. wsj_2363.mrg (ptb)
	US != US$
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2345.out.dis: Tokenization mismatch: wsj_2345.out.dis (rst) vs. wsj_2345.mrg (ptb)
	`Enough != `
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0677.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0698.out.dis: Tokenization mismatch: wsj_0698.out.dis (rst) vs. wsj_0698.mrg (ptb)
	7/8 != 7\/8
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2381.out.dis: Tokenization mismatch: wsj_2381.out.dis (rst) vs. wsj_2381.mrg (ptb)
	`portfolio != `
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1174.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0610.out.dis: Tokenization mismatch: wsj_0610.out.dis (rst) vs. wsj_0610.mrg (ptb)
	`healthy != `
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0694.out.dis: Tokenization mismatch: wsj_0694.out.dis (rst) vs. wsj_0694.mrg (ptb)
	. != ...
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1101.out.dis: Tokenization mismatch: wsj_1101.out.dis (rst) vs. wsj_1101.mrg (ptb)
	U.S != U.S.
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1392.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2320.out.dis: Tokenization mismatch: wsj_2320.out.dis (rst) vs. wsj_2320.mrg (ptb)
	P != P&G
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1157.out.dis: Tokenization mismatch: wsj_1157.out.dis (rst) vs. wsj_1157.mrg (ptb)
	U.S != U.S.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2360.out.dis: Tokenization mismatch: wsj_2360.out.dis (rst) vs. wsj_2360.mrg (ptb)
	Corp != Corp.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1373.out.dis: Tokenization mismatch: wsj_1373.out.dis (rst) vs. wsj_1373.mrg (ptb)
	Corp != Corp.
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0693.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1973.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1357.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1377.out.dis: Unexpected child combinations: defaultdict(<type 'list'>, {'Satellite': [0, 2], 'Nucleus': [1]})

Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2359.out.dis: 
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1348.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1185.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1360.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1103.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1144.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0690.out.dis: Tokenization mismatch: wsj_0690.out.dis (rst) vs. wsj_0690.mrg (ptb)
	Novo/Nordisk != Novo\/Nordisk
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1122.out.dis: Tokenization mismatch: wsj_1122.out.dis (rst) vs. wsj_1122.mrg (ptb)
	Gov != Gov.
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1992.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0640.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0630.out.dis: Tokenization mismatch: wsj_0630.out.dis (rst) vs. wsj_0630.mrg (ptb)
	Corp. != Corp
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1317.out.dis: Tokenization mismatch: wsj_1317.out.dis (rst) vs. wsj_1317.mrg (ptb)
	. != ...
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1161.out.dis: Tokenization mismatch: wsj_1161.out.dis (rst) vs. wsj_1161.mrg (ptb)
	1/4 != 1\/4
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/file5.dis: 'NoneType' object has no attribute 'groups'
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1186.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2396.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1137.out.dis: Tokenization mismatch: wsj_1137.out.dis (rst) vs. wsj_1137.mrg (ptb)
	No != No.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1339.out.dis: Tokenization mismatch: wsj_1339.out.dis (rst) vs. wsj_1339.mrg (ptb)
	1/2 != 1\/2
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1190.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0629.out.dis: Tokenization mismatch: wsj_0629.out.dis (rst) vs. wsj_0629.mrg (ptb)
	G.S != G.S.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1328.out.dis: Tokenization mismatch: wsj_1328.out.dis (rst) vs. wsj_1328.mrg (ptb)
	5/8 != 5\/8
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1356.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1383.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1319.out.dis: Tokenization mismatch: wsj_1319.out.dis (rst) vs. wsj_1319.mrg (ptb)
	Cie. != Cie
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1138.out.dis: Unexpected child combinations: defaultdict(<type 'list'>, {'Satellite': [0, 2], 'Nucleus': [1]})

Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1367.out.dis: Tokenization mismatch: wsj_1367.out.dis (rst) vs. wsj_1367.mrg (ptb)
	L.A != L.A.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1390.out.dis: Tokenization mismatch: wsj_1390.out.dis (rst) vs. wsj_1390.mrg (ptb)
	. != ...
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2382.out.dis: Tokenization mismatch: wsj_2382.out.dis (rst) vs. wsj_2382.mrg (ptb)
	Inc != Inc.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0638.out.dis: Tokenization mismatch: wsj_0638.out.dis (rst) vs. wsj_0638.mrg (ptb)
	Compaq's != Compaq
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1132.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2308.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1158.out.dis: Tokenization mismatch: wsj_1158.out.dis (rst) vs. wsj_1158.mrg (ptb)
	- != --
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2394.out.dis: Tokenization mismatch: wsj_2394.out.dis (rst) vs. wsj_2394.mrg (ptb)
	. != ...
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1173.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0682.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1175.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0692.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1309.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1143.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1172.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0666.out.dis: 
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0636.out.dis: Tokenization mismatch: wsj_0636.out.dis (rst) vs. wsj_0636.mrg (ptb)
	p.m != p.m.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1312.out.dis: Tokenization mismatch: wsj_1312.out.dis (rst) vs. wsj_1312.mrg (ptb)
	1/2 != 1\/2
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1127.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1930.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0635.out.dis: Tokenization mismatch: wsj_0635.out.dis (rst) vs. wsj_0635.mrg (ptb)
	`` != ''
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2393.out.dis: 
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0672.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0608.out.dis: 
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/file1.dis: 'NoneType' object has no attribute 'groups'
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1118.out.dis: Tokenization mismatch: wsj_1118.out.dis (rst) vs. wsj_1118.mrg (ptb)
	1/2 != 1\/2
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1375.out.dis: Tokenization mismatch: wsj_1375.out.dis (rst) vs. wsj_1375.mrg (ptb)
	S != S&P
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0687.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0618.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1394.out.dis: Tokenization mismatch: wsj_1394.out.dis (rst) vs. wsj_1394.mrg (ptb)
	J.X != J.X.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1115.out.dis: Tokenization mismatch: wsj_1115.out.dis (rst) vs. wsj_1115.mrg (ptb)
	Co != Co.
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1361.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1336.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1135.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1389.out.dis: Tokenization mismatch: wsj_1389.out.dis (rst) vs. wsj_1389.mrg (ptb)
	1/2 != 1\/2
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1176.out.dis: Tokenization mismatch: wsj_1176.out.dis (rst) vs. wsj_1176.mrg (ptb)
	`` != ''
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1327.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1370.out.dis: Tokenization mismatch: wsj_1370.out.dis (rst) vs. wsj_1370.mrg (ptb)
	Y-MP/832 != Y-MP\/832
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1193.out.dis: Tokenization mismatch: wsj_1193.out.dis (rst) vs. wsj_1193.mrg (ptb)
	Mr != Mr.
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1381.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0691.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1318.out.dis: Unexpected child combinations: defaultdict(<type 'list'>, {'Satellite': [0, 2], 'Nucleus': [1]})

Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1156.out.dis: Tokenization mismatch: wsj_1156.out.dis (rst) vs. wsj_1156.mrg (ptb)
	-results != -
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1120.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2365.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2328.out.dis: Tokenization mismatch: wsj_2328.out.dis (rst) vs. wsj_2328.mrg (ptb)
	S != S&L
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0674.out.dis: Tokenization mismatch: wsj_0674.out.dis (rst) vs. wsj_0674.mrg (ptb)
	30 != 30%-owned
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1338.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1155.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0673.out.dis: Tokenization mismatch: wsj_0673.out.dis (rst) vs. wsj_0673.mrg (ptb)
	C.J != C.J.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0671.out.dis: Tokenization mismatch: wsj_0671.out.dis (rst) vs. wsj_0671.mrg (ptb)
	property/casualty != property\/casualty
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0646.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1997.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0609.out.dis: Tokenization mismatch: wsj_0609.out.dis (rst) vs. wsj_0609.mrg (ptb)
	. != ...
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1181.out.dis: Tokenization mismatch: wsj_1181.out.dis (rst) vs. wsj_1181.mrg (ptb)
	US != US$
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1128.out.dis: Tokenization mismatch: wsj_1128.out.dis (rst) vs. wsj_1128.mrg (ptb)
	U.S != U.S.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1324.out.dis: Tokenization mismatch: wsj_1324.out.dis (rst) vs. wsj_1324.mrg (ptb)
	Co != Co.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1145.out.dis: Tokenization mismatch: wsj_1145.out.dis (rst) vs. wsj_1145.mrg (ptb)
	. != ...
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1123.out.dis: Tokenization mismatch: wsj_1123.out.dis (rst) vs. wsj_1123.mrg (ptb)
	`` != ''
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1924.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1302.out.dis: Tokenization mismatch: wsj_1302.out.dis (rst) vs. wsj_1302.mrg (ptb)
	Lloyd's != Lloyd
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1159.out.dis: Tokenization mismatch: wsj_1159.out.dis (rst) vs. wsj_1159.mrg (ptb)
	. != ...
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1382.out.dis: 
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2332.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1931.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1313.out.dis: Tokenization mismatch: wsj_1313.out.dis (rst) vs. wsj_1313.mrg (ptb)
	Inc != Inc.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1355.out.dis: Unexpected child combinations: defaultdict(<type 'list'>, {'Satellite': [0, 2], 'Nucleus': [1]})

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1102.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1323.out.dis: Tokenization mismatch: wsj_1323.out.dis (rst) vs. wsj_1323.mrg (ptb)
	A != A&M
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1379.out.dis: 
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0637.out.dis: Tokenization mismatch: wsj_0637.out.dis (rst) vs. wsj_0637.mrg (ptb)
	US != US$
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0601.out.dis: 
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0697.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1326.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1188.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1386.out.dis: Tokenization mismatch: wsj_1386.out.dis (rst) vs. wsj_1386.mrg (ptb)
	president-engineering != president
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2358.out.dis: Tokenization mismatch: wsj_2358.out.dis (rst) vs. wsj_2358.mrg (ptb)
	1982=100 != 1982
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2340.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0676.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2383.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2349.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1182.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2356.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1334.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1305.out.dis: Tokenization mismatch: wsj_1305.out.dis (rst) vs. wsj_1305.mrg (ptb)
	. != ...
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1311.out.dis: Tokenization mismatch: wsj_1311.out.dis (rst) vs. wsj_1311.mrg (ptb)
	C.J != C.J.
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2317.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1378.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0685.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1104.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1337.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1194.out.dis: Tokenization mismatch: wsj_1194.out.dis (rst) vs. wsj_1194.mrg (ptb)
	US != US$
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1125.out.dis: Tokenization mismatch: wsj_1125.out.dis (rst) vs. wsj_1125.mrg (ptb)
	U.S != U.S.
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0639.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/file2.dis: 'NoneType' object has no attribute 'groups'
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0622.out.dis: Tokenization mismatch: wsj_0622.out.dis (rst) vs. wsj_0622.mrg (ptb)
	`manipulation != `
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1106.out.dis: Tokenization mismatch: wsj_1106.out.dis (rst) vs. wsj_1106.mrg (ptb)
	late-summer/early-FALL != late-summer\/early-FALL
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2309.out.dis: Tokenization mismatch: wsj_2309.out.dis (rst) vs. wsj_2309.mrg (ptb)
	US != US$
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1395.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1191.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0631.out.dis: Tokenization mismatch: wsj_0631.out.dis (rst) vs. wsj_0631.mrg (ptb)
	50 != 50%-state-owned
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0648.out.dis: Tokenization mismatch: wsj_0648.out.dis (rst) vs. wsj_0648.mrg (ptb)
	Interstate/Johnson != Interstate\/Johnson
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1133.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2391.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1112.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2326.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1397.out.dis: Tokenization mismatch: wsj_1397.out.dis (rst) vs. wsj_1397.mrg (ptb)
	writer/producers != writer\/producers
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1308.out.dis: Tokenization mismatch: wsj_1308.out.dis (rst) vs. wsj_1308.mrg (ptb)
	Prop != Prop.
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0657.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1385.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1998.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1199.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1154.out.dis: Tokenization mismatch: wsj_1154.out.dis (rst) vs. wsj_1154.mrg (ptb)
	. != ...
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1141.out.dis: Tokenization mismatch: wsj_1141.out.dis (rst) vs. wsj_1141.mrg (ptb)
	30-Oct. != 30-Oct
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1168.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/file4.dis: 'NoneType' object has no attribute 'groups'
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1192.out.dis: Unexpected child combinations: defaultdict(<type 'list'>, {'Satellite': [0, 2], 'Nucleus': [1]})

Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0664.out.dis: Tokenization mismatch: wsj_0664.out.dis (rst) vs. wsj_0664.mrg (ptb)
	ATS/2 != ATS\/2
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2357.out.dis: Tokenization mismatch: wsj_2357.out.dis (rst) vs. wsj_2357.mrg (ptb)
	R.B != R.B.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1999.out.dis: Tokenization mismatch: wsj_1999.out.dis (rst) vs. wsj_1999.mrg (ptb)
	AT != AT&T
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0669.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1363.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0681.out.dis: Unexpected child combinations: defaultdict(<type 'list'>, {'Satellite': [0, 2], 'Nucleus': [1]})

Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1329.out.dis: Tokenization mismatch: wsj_1329.out.dis (rst) vs. wsj_1329.mrg (ptb)
	G.m.b != G.m.b.H.
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1105.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/file3.dis: 'NoneType' object has no attribute 'groups'
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2367.out.dis: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2367.out.dis
Traceback (most recent call last):
  File "<ipython-input-7-a3ce24290f79>", line 8, in parse_rstfile_sexpdata
    return sexpdata.load(rstfile)
  File "/usr/local/lib/python2.7/dist-packages/sexpdata.py", line 171, in load
    return loads(filelike.read(), **kwds)
  File "/usr/local/lib/python2.7/dist-packages/sexpdata.py", line 244, in loads
    assert len(obj) == 1  # FIXME: raise an appropriate error
AssertionError



Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0675.out.dis: Tokenization mismatch: wsj_0675.out.dis (rst) vs. wsj_0675.mrg (ptb)
	- != Stocks
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1364.out.dis: Tokenization mismatch: wsj_1364.out.dis (rst) vs. wsj_1364.mrg (ptb)
	S != S&L
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1136.out.dis: Tokenization mismatch: wsj_1136.out.dis (rst) vs. wsj_1136.mrg (ptb)
	P != P&G
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1300.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1114.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0643.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0652.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1165.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2342.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2350.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2316.out.dis: Tokenization mismatch: wsj_2316.out.dis (rst) vs. wsj_2316.mrg (ptb)
	U.S != U.S.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1178.out.dis: 
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0658.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1393.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0628.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1399.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0653.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1184.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2303.out.dis: Tokenization mismatch: wsj_2303.out.dis (rst) vs. wsj_2303.mrg (ptb)
	Co != Co.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1170.out.dis: Unexpected child combinations: defaultdict(<type 'list'>, {'Satellite': [0, 2], 'Nucleus': [1]})

Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1195.out.dis: Tokenization mismatch: wsj_1195.out.dis (rst) vs. wsj_1195.mrg (ptb)
	Corp. != Corp
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1108.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1310.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1372.out.dis: Tokenization mismatch: wsj_1372.out.dis (rst) vs. wsj_1372.mrg (ptb)
	US != US$
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1368.out.dis: Tokenization mismatch: wsj_1368.out.dis (rst) vs. wsj_1368.mrg (ptb)
	S.p != S.p.A.
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1976.out.dis
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1388.out.dis: Tokenization mismatch: wsj_1388.out.dis (rst) vs. wsj_1388.mrg (ptb)
	. != ...
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_1332.out.dis: Tokenization mismatch: wsj_1332.out.dis (rst) vs. wsj_1332.mrg (ptb)
	Bancorp != Bancorp.
merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_2341.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0699.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0647.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0615.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0613.out.dis

merged: /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0642.out.dis

Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0695.out.dis: Tokenization mismatch: wsj_0695.out.dis (rst) vs. wsj_0695.mrg (ptb)
	U.S != U.S.
Error in /home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0-nltk-tokenized/TRAINING/wsj_0680.out.dis: Tokenization mismatch: wsj_0680.out.dis (rst) vs. wsj_0680.mrg (ptb)
	a-Ex-dividend != a

In [20]:


In [21]:
os.path.basename(RSTDT_TEST_FILE)


Out[21]:
'wsj_1306.out.dis'

In [22]:
PTB_TEST_FILE = os.path.expanduser('~/corpora/pennTreebank/parsed/mrg/wsj/13/wsj_1306.mrg')

In [23]:
sent0_root = pdg.sentences[0]
ptb_1306_tokens = list(pdg.get_tokens(token_strings_only=True))

Epic Fail: we can't use nltk's Bracket parser, as it parses (span 1 5) as (span 1)


In [10]:
RSTDT_TEST_FILE


Out[10]:
'/home/arne/repos/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0/TEST/wsj_1306.out.dis'

In [9]:
rst_tree = parse_rstfile_nltk(RSTDT_TEST_FILE)
span_tree = rst_tree[0]
print span_tree, span_tree.productions(), span_tree.leaves()


(span 1) [span -> '1'] [u'1']

In [21]:
print rst_tree[1][1]


(rel2par span)

In [16]:
# print open(RSTDT_TEST_FILE).read()

In [25]: