RST Discourse Treebank experiments

  • major problem: RST-DT files in *.rs3 format (provided by Maite) are broken
    (EDUs appear in wrong order)
  • solution: I'll try to parse the original *.dis lisp/s-expression RST-DT files (cf. rstdt-lisp-import-test)

In [20]:
import os
import sys
import glob
import discoursegraphs as dg

RST_ROOT_DIR = os.path.expanduser('~/repos/rst_discourse_treebank_rs3')

RST_TEST_FILE = os.path.join(RST_ROOT_DIR, 'untokenized', 'TRAINING', 'wsj_1337.rs3')

In [21]:
rdg = dg.read_rs3(RST_TEST_FILE)
#dg.get_text(rdg)

In [22]:
# for folder in (RST_TEST_DIR, RST_TRAINING_DIR):
#     for rst_file in glob.glob(os.path.join(folder, '*.rs3')):
#         try:
#             dg.read_rs3(rst_file)
#         except KeyError as e:
#             sys.stderr.write("failed: {}\n".format(rst_file))

Find & Parse corresponding PTB/WSJ files


In [23]:
wsj_doc_ids = [os.path.basename(rst_file).lower().split('_')[1].split('.')[0]
               for folder in (os.path.join(RST_ROOT_DIR, 'untokenized', 'TEST'),
                              os.path.join(RST_ROOT_DIR, 'untokenized', 'TRAINING'))
               for rst_file in glob.glob(os.path.join(folder, '*.rs3'))
               if os.path.basename(rst_file).lower().startswith('wsj')]

In [24]:
PTB_WSJ_ROOT_DIR = os.path.expanduser('~/corpora/pennTreebank/parsed/mrg/wsj')

PTB_WSJ_TEST_FILE = os.path.join(PTB_WSJ_ROOT_DIR, '13/wsj_1337.mrg')

In [25]:
import nltk

ptb_path, ptb_filename = os.path.split(PTB_WSJ_TEST_FILE)
document = nltk.corpus.BracketParseCorpusReader(ptb_path, [ptb_filename])
parsed_sents_iter = document.parsed_sents()

In [26]:
sent0 = parsed_sents_iter[0]

In [27]:
for elem in sent0:
    print type(elem), elem.label()


<class 'nltk.tree.Tree'> NP-SBJ
<class 'nltk.tree.Tree'> VP
<class 'nltk.tree.Tree'> .

In [28]:
nltk.__version__


Out[28]:
'3.0.1'

In [29]:
# %load_ext gvmagic
# %dotstr dg.print_dot(ptbg)

In [30]:
ptbg = dg.read_ptb(PTB_WSJ_TEST_FILE)

In [31]:
dg.get_text(ptbg)


Out[31]:
"Tokyo stocks closed firmer Monday , with the Nikkei index making its fifth consecutive daily gain . Stocks also rose in London , while the Frankfurt market was mixed . In Tokyo , the Nikkei index added 99.14 to 35585.52 . The index moved above 35670 at midmorning , nearly reaching the record of 35689.98 set Sept. 28 . But the market lost part of the early gains on index-linked investment trust fund selling . In early trading in Tokyo Tuesday , the Nikkei index rose 1.08 points to 35586.60 . On Monday , traders noted that some investors took profits against the backdrop of the Nikkei 's fast-paced recovery following its plunge last Monday in reaction to the Oct. 13 drop in New York stock prices . But overall buying interest remained strong through Monday , with many observers saying they expect the Nikkei to continue with moderate gains this week . Turnover remained relatively small . Volume on the first section was estimated at 600 million shares , down from 1.03 billion shares Friday . The Tokyo stock price index of first section issues was up 7.81 at 2687.53 . Relatively stable foreign currency dealings Monday were viewed favorably by market players , traders said . But institutional investors may wait a little longer to appraise the direction of the U.S. monetary policy and the dollar , traders said . Hiroyuki Wada , general manager of the stock department at Okasan Securities , said Monday 's trading was `` unfocused . '' He said investors were picking individual stocks based on specific incentives and the likelihood of a wider price increase over the short term . The selective approach blurred themes such as domestic-demand issues , large-capitalization issues or high-technology shares , which had been providing at least some trading direction over the past few weeks , Mr. Wada said . Investors took profits on major construction shares , which advanced last week , shifting their attention to some midsize companies such as Aoki Corp. , Tobishima and Maeda . Aoki gained 60 yen to 1,480 yen -LRB- $ 10.40 -RRB- . Some pharmaceutical shares were popular on rumors related to new products to be introduced at a cancer conference that opened in Nagoya . Teijin was up 15 at 936 , and Kyowa Hakko gained 30 to 1,770 . Mochida advanced 40 to 4,440 . Fujisawa continued to attract investors because of strong earning prospects stemming from a new immune control agent . Fujisawa gained 50 to 2,060 . Kikkoman was up 30 to 1,600 , receiving investor interest for its land property holdings near Tokyo , a trader said . London prices closed modestly higher in the year 's thinnest turnover , a condition that underscored a lack of conviction ahead of a U.K. balance of payments report Tuesday . Limited volume ahead of the September trade data showed the market is nervous , but dealers added that the day 's modest gains also signaled some support for London equities . They pegged the support largely to anticipation that Britain 's current account imbalance ca n't be much worse than the near record deficits seen in July and August . `` It 's a case of the market being too high to buy and too afraid to sell , '' a senior dealer with Kleinwort Benson Securities said . `` It 's better to wait . '' The Financial Times 100-share index finished 10.6 points higher at 2189.7 . The 30-share index closed 11.6 points higher at 1772.6 . Volume was 276.8 million shares , beneath the year 's previous low of 280.5 million shares Sept. 25 , the session before the August trade figures were released . Analysts ' expectations suggest a September current account deficit of # 1.6 billion -LRB- $ 2.54 billion -RRB- , compared with August 's # 2.0 billion deficit . Dealers , however , said forecasts are broadly divergent with estimates ranging between # 1 billion and # 2 billion . `` The range of expectations is so broad , '' a dealer at another major U.K. brokerage firm said , `` the deficit may have to be nearer or above # 2 billion for it to have any impact on the market . '' Lucas Industries , a British automotive and aerospace concern , rose 13 pence to 614 pence after it said its pretax profit for the year rose 28 % . Share prices on the Frankfurt stock exchange closed narrowly mixed in quiet dealings after recovering most of their early losses . The DAX index eased 0.99 point to end at 1523.22 after falling 5.5 points early in the session . Brokers said the declines early in the day were partly caused by losses of the ruling Christian-Democratic Union in communal elections in the state of Baden-Wuerttemberg . The start of a weeklong conference by the IG Metall metal worker union in Berlin is drawing attention to the impending wage negotiations , which could boost companies ' personnel costs next year , they said . But there was little selling pressure , and even small orders at the lower levels sufficed to bring the market back to Friday 's opening levels . Traders said the thin trading volume points to continued uncertainty by most investors following last Monday 's record 13 % loss . The market is still 4 % short of its level before the plunge , and analysts are n't sure how long it will take until the DAX has closed that gap . But Norbert Braeuer , chief trader at Hessische Landesbank Girozentrale -LRB- Helaba -RRB- , said he expects share prices to move upward in the coming weeks . Banking stocks were the major gainers Monday amid hope that interest rates have peaked , as Deutsche Bank and Dresdner Bank added 4 marks each to 664 marks -LRB- $ 357 -RRB- and 326 marks , respectively . Commerzbank gained 1 to 252.5 . Auto shares were mixed , as Daimler-Benz firmed 2 to 723 , Bayerische Motoren Werke lost the same amount to 554 , and Volkswagen inched down 1.4 to 451.6 . Elsewhere , prices closed higher in Amsterdam , lower in Zurich , Stockholm and Milan , mixed in Brussels and unchanged in Paris . Shares closed higher in Hong Kong , Singapore and Manila , and were lower in Sydney , Seoul and Taipei . Wellington was closed . Here are price trends on the world 's major stock markets , as calculated by Morgan Stanley Capital International Perspective , Geneva . To make them directly comparable , each index is based on the close of 1969 equaling 100 . The percentage change is since year-end ."

In [32]:
shortg = dg.read_ptb(PTB_WSJ_TEST_FILE, limit=1)

In [33]:
dg.write_dot(shortg, '/tmp/shortg.dot')

In [34]:
import networkx as nx
import matplotlib.pyplot as plt

%pylab inline

G = shortg

plt.title("draw_networkx")
pos=nx.graphviz_layout(G,prog='dot')
nx.draw(G,pos,with_labels=True,arrows=True)
plt.show()
plt.savefig('nx_test.png')


Populating the interactive namespace from numpy and matplotlib
<matplotlib.figure.Figure at 0x644a790>

In [35]:
for n, ndict in shortg.nodes(data=True):
    print n,

print '\n\n'

for s,t, edict in shortg.edges(data=True):
    print "({}, {})".format(s,t),


0 1 2 3 5 7 8 10 11 13 14 16 18 19 21 22 23 25 27 29 30 32 33 35 36 38 40 42 44 


(0, 1) (1, 2) (1, 44) (1, 7) (2, 3) (2, 5) (7, 8) (7, 16) (7, 10) (7, 18) (7, 13) (10, 11) (13, 14) (18, 19) (18, 21) (21, 29) (21, 22) (22, 25) (22, 27) (22, 23) (29, 32) (29, 30) (32, 40) (32, 33) (32, 42) (32, 35) (35, 36) (35, 38)

In [36]:
for e in nx.bfs_edges(shortg, 0):
    print e,


(0, 1) (1, 2) (1, 44) (1, 7) (2, 3) (2, 5) (7, 8) (7, 16) (7, 10) (7, 18) (7, 13) (10, 11) (18, 19) (18, 21) (13, 14) (21, 29) (21, 22) (29, 32) (29, 30) (22, 25) (22, 27) (22, 23) (32, 40) (32, 33) (32, 42) (32, 35) (35, 36) (35, 38)

In [37]:
# dg.get_text(ptbg, ptbg.sentences[1]) # TODO: TypeError: expected string or buffer

In [38]:
%load_ext gvmagic
%dotstr dg.print_dot(shortg, ignore_node_labels=True)
# %dotstr dg.print_dot(ptbg)


wsj_1337.mrg 0 0 1 1 0->1 2 2 1->2 44 44 1->44 7 7 1->7 5 5 2->5 3 3 2->3 18 18 7->18 13 13 7->13 8 8 7->8 16 16 7->16 10 10 7->10 21 21 18->21 19 19 18->19 14 14 13->14 11 11 10->11 29 29 21->29 22 22 21->22 30 30 29->30 32 32 29->32 23 23 22->23 27 27 22->27 25 25 22->25 35 35 32->35 42 42 32->42 40 40 32->40 33 33 32->33 38 38 35->38 36 36 35->36

Problem: RST-DT files are untokenized

rdg = dg.read_rs3(RST_TEST_FILE)
pdg = dg.read_ptb(PTB_WSJ_TEST_FILE)
rdg.merge_graphs(pdg)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-20-d2261e9abd4f> in <module>()
      1 rdg = dg.read_rs3(RST_TEST_FILE)
      2 pdg = dg.read_ptb(PTB_WSJ_TEST_FILE)
----> 3 rdg.merge_graphs(pdg)

/usr/local/lib/python2.7/dist-packages/discoursegraphs-0.1.2-py2.7.egg/discoursegraphs/discoursegraph.pyc in merge_graphs(self, other_docgraph)
    554         """
    555         # renaming the tokens of the other graph to match this one
--> 556         rename_tokens(other_docgraph, self)
    557         self.add_nodes_from(other_docgraph.nodes(data=True))
    558 

/usr/local/lib/python2.7/dist-packages/discoursegraphs-0.1.2-py2.7.egg/discoursegraphs/discoursegraph.pyc in rename_tokens(docgraph_with_old_names, docgraph_with_new_names)
    602     """
    603     old2new = create_token_mapping(docgraph_with_old_names,
--> 604                                    docgraph_with_new_names)
    605     relabel_nodes(docgraph_with_old_names, old2new, copy=False)
    606     new_token_ids = old2new.values()

/usr/local/lib/python2.7/dist-packages/discoursegraphs-0.1.2-py2.7.egg/discoursegraphs/discoursegraph.pyc in create_token_mapping(docgraph_with_old_names, docgraph_with_new_names)
    644                     docgraph_with_new_names.name, docgraph_with_new_names.ns,
    645                     docgraph_with_old_names.name, docgraph_with_old_names.ns,
--> 646                     new_tok, old_tok).encode('utf-8'))
    647         else:
    648             old2new[old_tok_id] = new_tok_id

ValueError: Tokenization mismatch: wsj_1337.rs3 (rst) vs. wsj_1337.mrg (ptb)
    Monday, != Monday

Tokenize RST-DT files with Stanford CoreNLP


In [39]:
from simplejson import loads

import corenlp
from corenlp import StanfordCoreNLP
cnlp = StanfordCoreNLP()  # wait a few minutes...


Loading Models: 5/5                                                            

In [40]:
from discoursegraphs.readwrite.rst import get_edus

rdg_untokenized = dg.read_rs3(RST_TEST_FILE, tokenize=False)

In [41]:
rdg_untokenized.node['rst:138']


Out[41]:
{'label': u'[s]:138: before the August tr...',
 'layers': {'rst', 'rst:segment'},
 'rst:segment_type': 'satellite',
 'rst:text': u'before the August trade figures were released.'}

In [42]:
# for edu_root_node in get_edus(rdg_untokenized):
#     edu_str = rdg_untokenized.node[edu_root_node]['rst:text']
# #     print edu_root_node
# #     print edu_str
#     result_str = cnlp.parse(edu_str.encode('utf-8'))
#     result_sentences = loads(result_str)['sentences']
#     for sentence in result_sentences:
#         tokenized_sentence = u' '.join(word for (word, annotations) in sentence['words'])

In [44]:
%time
from lxml import etree
import simplejson
import codecs

for folder in ('TEST', 'TRAINING'):
    for rst_file in glob.glob(os.path.join(RST_ROOT_DIR, 'untokenized', folder, '*.rs3')):
        tree = etree.parse(rst_file)
        for segment in tree.iter('segment'):
            untokenized_segment = segment.text
#             print untokenized_segment
            corenlp_result = cnlp.parse(untokenized_segment)
            segment.text = u' '.join(word
                                     for sentence in simplejson.loads(corenlp_result)['sentences']
                                     for (word, annotations) in sentence['words'])
#             print segment.text, '\n\n'
        output_dir = os.path.join(RST_ROOT_DIR, 'tokenized', folder)
        dg.util.create_dir(output_dir)
        with codecs.open(os.path.join(output_dir, os.path.basename(rst_file)), 'w', encoding='utf-8') as outfile:
            outfile.write(etree.tounicode(tree, pretty_print=True))


CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 5.01 µs

In [ ]: