Python Parsing with NLTK

(C) 2017-2020 by Damir Cavar

Download: This and various other Jupyter notebooks are available from my GitHub repo.

This is a tutorial related to the discussion of grammar engineering and parsing in the class Alternative Syntactic Theories and Advanced Natural Language Processing taught at Indiana University in Spring 2017 and Fall 2018.

Working with Grammars

The following examples are taken from the NLTK parsing HOWTO page.


In [1]:
from nltk import Nonterminal, nonterminals, Production, CFG

In [2]:
nt1 = Nonterminal('NP')
nt2 = Nonterminal('VP')

In [3]:
nt1.symbol()


Out[3]:
'NP'

In [4]:
nt1 == Nonterminal('NP')


Out[4]:
True

In [5]:
nt1 == nt2


Out[5]:
False

In [6]:
S, NP, VP, PP = nonterminals('S, NP, VP, PP')
print(S.symbol())


S

In [7]:
N, V, P, DT = nonterminals('N, V, P, DT')

In [8]:
prod1 = Production(S, [NP, VP])

In [9]:
prod2 = Production(NP, [DT, NP])

In [10]:
prod1.lhs()


Out[10]:
S

In [11]:
prod1.rhs()


Out[11]:
(NP, VP)

In [12]:
prod1 == Production(S, [NP, VP])


Out[12]:
True

In [13]:
prod1 == prod2


Out[13]:
False

In [14]:
grammar = CFG.fromstring("""
 S -> NP VP
 PP -> P NP
 PP -> P NP
 NP -> 'the' N | N PP | 'the' N PP
 VP -> V NP | V PP | V NP PP
 N -> 'cat'
 N -> 'fish'
 N -> 'aligator'
 N -> 'dog'
 N -> 'rug'
 N -> 'mouse'
 V -> 'chased'
 V -> 'sat'
 P -> 'in'
 P -> 'on'
""")

In [15]:
print(grammar)


Grammar with 19 productions (start state = S)
    S -> NP VP
    PP -> P NP
    PP -> P NP
    NP -> 'the' N
    NP -> N PP
    NP -> 'the' N PP
    VP -> V NP
    VP -> V PP
    VP -> V NP PP
    N -> 'cat'
    N -> 'fish'
    N -> 'aligator'
    N -> 'dog'
    N -> 'rug'
    N -> 'mouse'
    V -> 'chased'
    V -> 'sat'
    P -> 'in'
    P -> 'on'

Feature Structures

One can build complex feature structures using the following strategies:


In [16]:
import nltk

fstr = nltk.FeatStruct("[POS='N', AGR=[PER=3, NUM='pl', GND='fem']]")
print(fstr)


[       [ GND = 'fem' ] ]
[ AGR = [ NUM = 'pl'  ] ]
[       [ PER = 3     ] ]
[                       ]
[ POS = 'N'             ]

Creating shared paths is also possible:


In [17]:
fstr2 = nltk.FeatStruct("""[NAME='Lee', ADDRESS=(1)[NUMBER=74, STREET='rue Pascal'],
                          SPOUSE=[NAME='Kim', ADDRESS->(1)]]""")
print(fstr2)


[ ADDRESS = (1) [ NUMBER = 74           ] ]
[               [ STREET = 'rue Pascal' ] ]
[                                         ]
[ NAME    = 'Lee'                         ]
[                                         ]
[ SPOUSE  = [ ADDRESS -> (1)  ]           ]
[           [ NAME    = 'Kim' ]           ]

Let us create feature structures and try out unification:


In [21]:
fs1 = nltk.FeatStruct("[AGR=[PER=3, NUM='pl', GND='fem'], POS='N']")
fs2 = nltk.FeatStruct("[POS='N', AGR=[PER=3, GND='fem']]")

print(fs1.unify(fs2))


[       [ GND = 'fem' ] ]
[ AGR = [ NUM = 'pl'  ] ]
[       [ PER = 3     ] ]
[                       ]
[ POS = 'N'             ]

Chart Parser

The following examples are taken from the NLTK parsing HOWTO page.


In [22]:
import nltk

In [23]:
nltk.parse.chart.demo(2, print_times=False, trace=1,
                       sent='I saw a dog', numparses=1)


* Sentence:
I saw a dog
['I', 'saw', 'a', 'dog']

* Strategy: Bottom-up

|.    I    .   saw   .    a    .   dog   .|
|[---------]         .         .         .| [0:1] 'I'
|.         [---------]         .         .| [1:2] 'saw'
|.         .         [---------]         .| [2:3] 'a'
|.         .         .         [---------]| [3:4] 'dog'
|>         .         .         .         .| [0:0] NP -> * 'I'
|[---------]         .         .         .| [0:1] NP -> 'I' *
|>         .         .         .         .| [0:0] S  -> * NP VP
|>         .         .         .         .| [0:0] NP -> * NP PP
|[--------->         .         .         .| [0:1] S  -> NP * VP
|[--------->         .         .         .| [0:1] NP -> NP * PP
|.         >         .         .         .| [1:1] Verb -> * 'saw'
|.         [---------]         .         .| [1:2] Verb -> 'saw' *
|.         >         .         .         .| [1:1] VP -> * Verb NP
|.         >         .         .         .| [1:1] VP -> * Verb
|.         [--------->         .         .| [1:2] VP -> Verb * NP
|.         [---------]         .         .| [1:2] VP -> Verb *
|.         >         .         .         .| [1:1] VP -> * VP PP
|[-------------------]         .         .| [0:2] S  -> NP VP *
|.         [--------->         .         .| [1:2] VP -> VP * PP
|.         .         >         .         .| [2:2] Det -> * 'a'
|.         .         [---------]         .| [2:3] Det -> 'a' *
|.         .         >         .         .| [2:2] NP -> * Det Noun
|.         .         [--------->         .| [2:3] NP -> Det * Noun
|.         .         .         >         .| [3:3] Noun -> * 'dog'
|.         .         .         [---------]| [3:4] Noun -> 'dog' *
|.         .         [-------------------]| [2:4] NP -> Det Noun *
|.         .         >         .         .| [2:2] S  -> * NP VP
|.         .         >         .         .| [2:2] NP -> * NP PP
|.         [-----------------------------]| [1:4] VP -> Verb NP *
|.         .         [------------------->| [2:4] S  -> NP * VP
|.         .         [------------------->| [2:4] NP -> NP * PP
|[=======================================]| [0:4] S  -> NP VP *
|.         [----------------------------->| [1:4] VP -> VP * PP
Nr edges in chart: 33
(S (NP I) (VP (Verb saw) (NP (Det a) (Noun dog))))

This is an example how to apply top-down parsing:


In [24]:
nltk.parse.chart.demo(1, print_times=True, trace=0,
                       sent='she killed the man with the tie', numparses=2)


* Sentence:
she killed the man with the tie
['she', 'killed', 'the', 'man', 'with', 'the', 'tie']

* Strategy: Top-down

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-24-8b6bf75d1ad9> in <module>
      1 nltk.parse.chart.demo(1, print_times=True, trace=0,
----> 2                        sent='she killed the man with the tie', numparses=2)

c:\program files\python37\lib\site-packages\nltk\parse\chart.py in demo(choice, print_times, print_grammar, print_trees, trace, sent, numparses)
   1805         cp = ChartParser(grammar, strategies[strategy][1], trace=trace)
   1806         t = time.time()
-> 1807         chart = cp.chart_parse(tokens)
   1808         parses = list(chart.parses(grammar.start()))
   1809 

c:\program files\python37\lib\site-packages\nltk\parse\chart.py in chart_parse(self, tokens, trace)
   1447 
   1448         tokens = list(tokens)
-> 1449         self._grammar.check_coverage(tokens)
   1450         chart = self._chart_class(tokens)
   1451         grammar = self._grammar

c:\program files\python37\lib\site-packages\nltk\grammar.py in check_coverage(self, tokens)
    682             missing = ', '.join('%r' % (w,) for w in missing)
    683             raise ValueError(
--> 684                 "Grammar does not cover some of the " "input words: %r." % missing
    685             )
    686 

ValueError: Grammar does not cover some of the input words: "'she', 'killed', 'man', 'tie'".

This is how to apply bottom-up parsing:


In [25]:
nltk.parse.chart.demo(2, print_times=False, trace=0,
                       sent='I saw John with a dog', numparses=2)


* Sentence:
I saw John with a dog
['I', 'saw', 'John', 'with', 'a', 'dog']

* Strategy: Bottom-up

Nr edges in chart: 53
(S
  (NP I)
  (VP (VP (Verb saw) (NP John)) (PP with (NP (Det a) (Noun dog)))))
(S
  (NP I)
  (VP (Verb saw) (NP (NP John) (PP with (NP (Det a) (Noun dog))))))


In [ ]:
nltk.parse.featurechart.demo(print_times=False,
                              print_grammar=True,
                              parser=nltk.parse.featurechart.FeatureChartParser,
                              sent='I saw John with a dog')

Loading grammars from files and editing them

We will need the following NLTK modules in this section:


In [26]:
import nltk
from nltk import CFG
from nltk.grammar import FeatureGrammar as FCFG

We can load a grammar from a file, that is located in the same folder as the current Jupyter notebook, in the following way:


In [27]:
cfg = nltk.data.load('spanish1.cfg')
print(cfg)


Grammar with 31 productions (start state = S)
    S -> SN SV
    SV -> v SN
    SV -> v
    SN -> det GN
    GN -> nom_com
    GN -> nom_prop
    det -> 'el'
    det -> 'la'
    det -> 'los'
    det -> 'las'
    det -> 'un'
    det -> 'una'
    det -> 'unos'
    det -> 'unas'
    nom_com -> 'vecino'
    nom_com -> 'ladrones'
    nom_com -> 'mujeres'
    nom_com -> 'bosques'
    nom_com -> 'noche'
    nom_com -> 'flauta'
    nom_com -> 'ventana'
    nom_prop -> 'Jose'
    nom_prop -> 'Lucas'
    nom_prop -> 'Pedro'
    nom_prop -> 'Marta'
    v -> 'toca'
    v -> 'moja'
    v -> 'adoran'
    v -> 'robaron'
    v -> 'escondieron'
    v -> 'rompió'

We instantiate a ChartParser object with this grammar:


In [28]:
cp1 = nltk.parse.ChartParser(cfg)

The ChartParser object has a parse-function that takes a list of tokens as a parameter. The token list can be generated using a language specific tokenizer. In this case we simply tokenize using the Python-function split on strings. The output of the parse function is a list of trees. We loop through the list of parse trees and print them out:


In [29]:
"los mujeres adoran la Lucas".split()


Out[29]:
['los', 'mujeres', 'adoran', 'la', 'Lucas']

In [30]:
for x in cp1.parse("los mujeres adoran la Lucas".split()):
    print(x)


(S
  (SN (det los) (GN (nom_com mujeres)))
  (SV (v adoran) (SN (det la) (GN (nom_prop Lucas)))))

We can also edit a grammar directly:


In [32]:
cfg2 = CFG.fromstring("""
 S -> NP VP
 PP -> P NP
 NP -> 'the' N | N PP | 'the' N PP
 VP -> V NP | V PP | V NP PP
 N -> 'cat'
 N -> 'dog'
 N -> 'bird'
 N -> 'rug'
 N -> 'woman'
 N -> 'man'
 N -> 'tie'
 V -> 'chased'
 V -> 'killed'
 V -> 'sat'
 V -> 'bit'
 P -> 'in'
 P -> 'on'
 P -> 'with'
""")

We parse our example sentences using the same approach as above:


In [33]:
cp2 = nltk.parse.ChartParser(cfg2)
for x in cp2.parse("the woman killed the man with the tie".split()):
    print(x)


(S
  (NP the (N woman))
  (VP (V killed) (NP the (N man)) (PP (P with) (NP the (N tie)))))
(S
  (NP the (N woman))
  (VP (V killed) (NP the (N man) (PP (P with) (NP the (N tie))))))

The previous example included a Context-free grammar. In the following example we load a Context-free Grammar with Features, instantiate a FeatureChartParser, and loop through the parse trees that are generated by our grammar to print them out:


In [34]:
fcfg = nltk.data.load('spanish1.fcfg')
fcp1 = nltk.parse.FeatureChartParser(fcfg)
for x in fcp1.parse(u"Miguel adoró el gato".split()):
    print(x)


(S[]
  (SN[+PROP, gen=?g, num='singular'] (NP[num='singular'] Miguel))
  (SV[num='singular', tiempo='pasado']
    (VT[num='singular', tiempo='pasado'] adoró)
    (SN[-PROP, gen='masculino', num='singular']
      (DET[gen='masculino', num='singular'] el)
      (NC[gen='masculino', num='singular'] gato))))

We can edit a Feature CFG in the same way directly in this notebook and then parse with it:


In [ ]:
fcfg2 = FCFG.fromstring("""
% start CP
# ############################
# Grammar Rules
# ############################
CP -> Cbar[stype=decl]
Cbar[stype=decl] -> IP[+TNS]
IP[+TNS] -> DP[num=?n,pers=?p,case=nom] VP[num=?n,pers=?p]
DP[num=?n,pers=?p,case=?k] ->  Dbar[num=?n,pers=?p,case=?k]
Dbar[num=?n,pers=?p] -> D[num=?n,DEF=?d,COUNT=?c] NP[num=?n,pers=?p,DEF=?d,COUNT=?c]
Dbar[num=?n,pers=?p] -> NP[num=?n,pers=?p,DEF=?d,COUNT=?c]
Dbar[num=?n,pers=?p,case=?k] -> D[num=?n,pers=?p,+DEF,type=pron,case=?k]
NP[num=?n,pers=?p,COUNT=?c] -> N[num=?n,pers=?p,type=prop,COUNT=?c]
VP[num=?n,pers=?p] -> V[num=?n,pers=?p,val=1]
VP[num=?n,pers=?p] -> V[num=?n,pers=?p,val=2] DP[case=acc]
PP -> P DP[num=?n,pers=?p,case=acc]
#PP -> P DP[num=?n,pers=?p,case=dat]
#
# ############################
# Lexical Rules
# ############################
D[-DEF,+COUNT,num=sg] -> 'a'
D[-DEF,+COUNT,num=sg] -> 'an'
D[+DEF] -> 'the'
D[+DEF,gen=f,num=sg,case=nom,type=pron] -> 'she'
D[+DEF,gen=m,num=sg,case=nom,type=pron] -> 'he'
D[+DEF,gen=n,num=sg,type=pron] -> 'it'
D[+DEF,gen=f,num=sg,case=acc,type=pron] -> 'her'
D[+DEF,gen=m,num=sg,case=acc,type=pron] -> 'him'
N[num=sg,pers=3,type=prop] -> 'John' | 'Sara' | 'Mary'
V[tns=pres,num=sg,pers=3,val=2] -> 'loves' | 'calls' | 'sees' | 'buys'
N[num=sg,pers=3,-COUNT] -> 'furniture' | 'air' | 'justice'
N[num=sg,pers=3] -> 'cat' | 'dog' | 'mouse'
N[num=pl,pers=3] -> 'cats' | 'dogs' | 'mice'
V[tns=pres,num=sg,pers=3,val=1] -> 'sleeps' | 'snores'
V[tns=pres,num=sg,pers=1,val=1] -> 'sleep' | 'snore'
V[tns=pres,num=sg,pers=2,val=1] -> 'sleep' | 'snore'
V[tns=pres,num=pl,val=1] -> 'sleep' | 'snore'
V[tns=past,val=1] -> 'slept' | 'snored'
V[tns=pres,num=sg,pers=3,val=2] -> 'calls' | 'sees' | 'loves'
V[tns=pres,num=sg,pers=1,val=2] -> 'call' | 'see' | 'love'
V[tns=pres,num=sg,pers=2,val=2] -> 'call' | 'see' | 'love'
V[tns=pres,num=pl,val=2] -> 'call' | 'see' | 'love'
V[tns=past,val=2] -> 'called' | 'saw' | 'loved'
""")

We can now create a parser instance and parse with this grammar:


In [ ]:
fcp2 = nltk.parse.FeatureChartParser(fcfg2, trace=1)
sentence = "John buys the furniture"
result = list(fcp2.parse(sentence.split()))
if result:
    for x in result:
        print(x)
else:
    print("*", sentence)

Countable nouns and articles in a DP:


In [ ]:

DPs and pronouns


In [ ]:

CP/IP sentence structures


In [ ]:

Different Parsers

This is a list of the different Feature Parsers in NLTK.

  • nltk.parse.featurechart.FeatureChartParser
  • nltk.parse.featurechart.FeatureTopDownChartParser
  • nltk.parse.featurechart.FeatureBottomUpChartParser
  • nltk.parse.featurechart.FeatureBottomUpLeftCornerChartParser
  • nltk.parse.earleychart.FeatureIncrementalChartParser
  • nltk.parse.earleychart.FeatureEarleyChartParser
  • nltk.parse.earleychart.FeatureIncrementalTopDownChartParser
  • nltk.parse.earleychart.FeatureIncrementalBottomUpChartParser
  • nltk.parse.earleychart.FeatureIncrementalBottomUpLeftCornerChartParser

I do not know whether this is an exhaustive list.