In [2]:
import popplerqt4

In [3]:
filepath = '/Users/chbrown/work/tmp/pdfconv/nakassis.pdf'

In [9]:
filepath = '/Users/chbrown/github/acl/anthology/P/P13/P13-1006.pdf'

In [16]:
doc = popplerqt4.Poppler.Document.load(filepath)

In [17]:
print 'Document has %d pages' % doc.numPages()


Document has 11 pages

In [18]:
doc.pageMode()


Out[18]:
0

In [12]:
print dir(doc)


['AcroForm', 'Antialiasing', 'ArthurBackend', 'FormType', 'FullScreen', 'NoForm', 'NoLayout', 'OneColumn', 'OverprintPreview', 'PageLayout', 'PageMode', 'RenderBackend', 'RenderHint', 'RenderHints', 'SinglePage', 'SplashBackend', 'TextAntialiasing', 'TextHinting', 'TextSlightHinting', 'ThinLineShape', 'ThinLineSolid', 'TwoColumnLeft', 'TwoColumnRight', 'TwoPageLeft', 'TwoPageRight', 'UseAttach', 'UseNone', 'UseOC', 'UseOutlines', 'UseThumbs', 'XfaForm', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'availableRenderBackends', 'colorDisplayProfile', 'colorRgbProfile', 'date', 'embeddedFiles', 'fontData', 'fonts', 'formType', 'getPdfId', 'getPdfVersion', 'hasEmbeddedFiles', 'hasOptionalContent', 'info', 'infoKeys', 'isEncrypted', 'isLinearized', 'isLocked', 'linkDestination', 'load', 'loadFromData', 'metadata', 'newFontIterator', 'numPages', 'okToAddNotes', 'okToAssemble', 'okToChange', 'okToCopy', 'okToCreateFormFields', 'okToExtractForAccessibility', 'okToFillForm', 'okToPrint', 'okToPrintHighRes', 'optionalContentModel', 'page', 'pageLayout', 'pageMode', 'paperColor', 'pdfConverter', 'psConverter', 'renderBackend', 'renderHints', 'scripts', 'setColorDisplayProfile', 'setColorDisplayProfileName', 'setPaperColor', 'setRenderBackend', 'setRenderHint', 'toc', 'unlock']

In [19]:
page = doc.page(4)

In [ ]:
texts =

In [20]:
for t in page.textList():
    print t.text()


two
lattices.
This
jointly
selects
the
optimal
detec-
tions
to
form
the
track,
together
with
the
optimal
state
sequence,
and
scores
that
combination.
over,
we
wish
the
track
to
be
temporally
coherent;
we
want
the
objects
in
a
track
to
move
smoothly
over
time
and
not
jump
around
the
field
of
view.
Let
G(D
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-20-02f30c156955> in <module>()
      1 for t in page.textList():
----> 2     print t.text()

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2212' in position 1: ordinal not in range(128)

In [15]:
# t1.text(), t1.boundingBox(), t1.charBoundingBox(1), t1.hasSpaceAfter(), t1.nextWord()


Out[15]:
(PyQt4.QtCore.QString(u'two'),
 PyQt4.QtCore.QRectF(307.27699999999993, 63.68599159999985, 16.25455900000003, 13.1454655),
 PyQt4.QtCore.QRectF(310.30972979999996, 63.68599159999985, 7.767279200000019, 13.1454655),
 True,
 <popplerqt4.TextBox at 0x10c272df8>)

In [12]:
text1 = t1.text()

In [15]:
# http://www.cs.berkeley.edu/~dlwh/papers/spanparser.pdf
spanparser_filepath = '/Users/chbrown/work/tmp/pdfconv/spanparser.pdf'
spanparser_doc = popplerqt4.Poppler.Document.load(spanparser_filepath)


 # pages: 10

In [16]:
spanparser_page = spanparser_doc.page(0)
spanparser_textlist = spanparser_page.textList()

In [19]:
def body(textboxes):
    for textbox in textboxes:
        yield unicode(textbox.text())
        if textbox.hasSpaceAfter():
            yield ' '

print ''.join(body(spanparser_textlist))


Less Grammar, More FeaturesDavid HallGreg DurrettDan KleinComputer Science DivisionUniversity of California, Berkeley{dlwh,gdurrett,klein}@cs.berkeley.eduAbstractWe present a parser that relies primar-ily on extracting information directly fromsurface spans rather than on propagat-ing information through enriched gram-mar structure. For example, instead of cre-ating separate grammar symbols to markthe definiteness of an NP, our parser mightinstead capture the same information fromthe first word of the NP. Moving contextout of the grammar and onto surface fea-tures can greatly simplify the structuralcomponent of the parser: because so manydeep syntactic cues have surface reflexes,our system can still parse accurately withcontext-free backbones as minimal as X-bar grammars. Keeping the structuralbackbone simple and moving features tothe surface also allows easy adaptationto new languages and even to new tasks.On the SPMRL 2013 multilingual con-stituency parsing shared task (Seddah etal., 2013), our system outperforms the topsingle parser system of Bj¨orkelund et al.(2013) on a range of languages. In addi-tion, despite being designed for syntacticanalysis, our system also achieves state-of-the-art numbers on the structural senti-ment task of Socher et al. (2013). Finally,we show that, in both syntactic parsing andsentiment analysis, many broad linguistictrends can be captured via surface features.1IntroductionNa¨ıve context-free grammars, such as those em-bodied by standard treebank annotations, do notparse well because their symbols have too littlecontext to constrain their syntactic behavior. Forexample, to PPs usually attach to verbs and ofPPs usually attach to nouns, but a context-free PPsymbol can equally well attach to either. Muchof the last few decades of parsing research hastherefore focused on propagating contextual in-formation from the leaves of the tree to inter-nal nodes. For example, head lexicalization (Eis-ner, 1996; Collins, 1997; Charniak, 1997), struc-tural annotation (Johnson, 1998; Klein and Man-ning, 2003), and state-splitting (Matsuzaki et al.,2005; Petrov et al., 2006) are all designed to takecoarse symbols like PP and decorate them withadditional context. The underlying reason thatsuch propagation is even needed is that PCFGparsers score trees based on local configurationsonly, and any information that is not threadedthrough the tree becomes inaccessible to the scor-ing function. There have been non-local ap-proaches as well, such as tree-substitution parsers(Bod, 1993; Sima’an, 2000), neural net parsers(Henderson, 2003), and rerankers (Collins andKoo, 2005; Charniak and Johnson, 2005; Huang,2008). These non-local approaches can actuallygo even further in enriching the grammar’s struc-tural complexity by coupling larger domains invarious ways, though their non-locality generallycomplicates inference.In this work, we instead try to minimize thestructural complexity of the grammar by movingas much context as possible onto local surface fea-tures. We examine the position that grammarsshould not propagate any information that is avail-able from surface strings, since a discriminativeparser can access that information directly. Wetherefore begin with a minimal grammar and it-eratively augment it with rich input features thatdo not enrich the context-free backbone. Previ-ous work has also used surface features in theirparsers, but the focus has been on machine learn-ing methods (Taskar et al., 2004), latent annota-tions (Petrov and Klein, 2008a; Petrov and Klein,2008b), or implementation (Finkel et al., 2008).By contrast, we investigate the extent to which