Languages can be represented by a set of grammar rules over a lexicon of words. Different languages can be represented by different types of grammar, but in Natural Language Processing we are mainly interested in context-free grammars.
A lot of natural and programming languages can be represented by a Context-Free Grammar (CFG). A CFG is a grammar that has a single non-terminal symbol on the left-hand side. That means a non-terminal can be replaced by the right-hand side of the rule regardless of context. An example of a CFG:
S -> aSb | ε
That means S
can be replaced by either aSb
or ε
(with ε
we denote the empty string). The lexicon of the language is comprised of the terminals a
and b
, while with S
we denote the non-terminal symbol. In general, non-terminals are capitalized while terminals are not, and we usually name the starting non-terminal S
. The language generated by the above grammar is the language anbn for n greater or equal than 1.
While a simple CFG can be very useful, we might want to know the chance of each rule occurring. Above, we do not know if S
is more likely to be replaced by aSb
or ε
. Probabilistic Context-Free Grammars (PCFG) are built to fill exactly that need. Each rule has a probability, given in brackets, and the probabilities of a rule sum up to 1:
S -> aSb [0.7] | ε [0.3]
Now we know it is more likely for S
to be replaced by aSb
than by ε
.
An issue with PCFGs is how we will assign the various probabilities to the rules. We could use our knowledge as humans to assign the probabilities, but that is laborious and prone to error task. Instead, we can learn the probabilities from data. Data is categorized as labeled (with correctly parsed sentences, usually called a treebank) or unlabeled (given only lexical and syntactic category names).
With labeled data, we can simply count the occurrences. For the above grammar, if we have 100 S
rules and 30 of them are of the form S -> ε
, we assign a probability of 0.3 to the transformation.
With unlabeled data, we have to learn both the grammar rules and the probability of each rule. We can go with many approaches, one of them the inside-outside algorithm. It uses a dynamic programming approach, that first finds the probability of a substring being generated by each rule and then estimates the probability of each rule.
Grammar is in Chomsky Normal Form (or CNF, not to be confused with Conjunctive Normal Form) if its rules are one of the three:
X -> Y Z
A -> a
S -> ε
Where X, Y, Z, A are non-terminals, a is a terminal, ε is the empty string and S is the start symbol (the start symbol should not be appearing on the right-hand side of rules). Note that there can be multiple rules for each left-hand side non-terminal, as long they follow the above. For example, a rule for X might be: X -> Y Z | A B | a | b
.
Of course, we can also have a CNF with probabilities.
This type of grammar may seem restrictive, but it can be proven that any context-free grammar can be converted to CNF.
The lexicon of a language is defined as a list of allowable words. These words are grouped into the usual classes: verbs
, nouns
, adjectives
, adverbs
, pronouns
, names
, articles
, prepositions
and conjunctions
. For the first five classes, it is impossible to list all words since words are continuously being added in the classes. Recently "google" was added to the list of verbs, and words like that will continue to pop up and get added to the lists. For that reason, these first five categories are called open classes. The rest of the categories have much fewer words and much less development. While words like "thou" were commonly used in the past but have declined almost completely in usage, most changes take many decades or centuries to manifest, so we can safely assume the categories will remain static for the foreseeable future. Thus, these categories are called closed classes.
An example lexicon for a PCFG (note that other classes can also be used according to the language, like digits
, or RelPro
for relative pronoun):
Verb -> is [0.3] | say [0.1] | are [0.1] | ...
Noun -> robot [0.1] | sheep [0.05] | fence [0.05] | ...
Adjective -> good [0.1] | new [0.1] | sad [0.05] | ...
Adverb -> here [0.1] | lightly [0.05] | now [0.05] | ...
Pronoun -> me [0.1] | you [0.1] | he [0.05] | ...
RelPro -> that [0.4] | who [0.2] | which [0.2] | ...
Name -> john [0.05] | mary [0.05] | peter [0.01] | ...
Article -> the [0.35] | a [0.25] | an [0.025] | ...
Preposition -> to [0.25] | in [0.2] | at [0.1] | ...
Conjunction -> and [0.5] | or [0.2] | but [0.2] | ...
Digit -> 1 [0.3] | 2 [0.2] | 0 [0.2] | ...
With grammars we combine words from the lexicon into valid phrases. A grammar is comprised of grammar rules. Each rule transforms the left-hand side of the rule into the right-hand side. For example, A -> B
means that A
transforms into B
. Let's build a grammar for the language we started building with the lexicon. We will use a PCFG.
S -> NP VP [0.9] | S Conjunction S [0.1]
NP -> Pronoun [0.3] | Name [0.1] | Noun [0.1] | Article Noun [0.25] |
Article Adjs Noun [0.05] | Digit [0.05] | NP PP [0.1] |
NP RelClause [0.05]
VP -> Verb [0.4] | VP NP [0.35] | VP Adjective [0.05] | VP PP [0.1]
VP Adverb [0.1]
Adjs -> Adjective [0.8] | Adjective Adjs [0.2]
PP -> Preposition NP [1.0]
RelClause -> RelPro VP [1.0]
Some valid phrases the grammar produces: "mary is sad
", "you are a robot
" and "she likes mary and a good fence
".
What if we wanted to check if the phrase "mary is sad
" is actually a valid sentence? We can use a parse tree to constructively prove that a string of words is a valid phrase in the given language and even calculate the probability of the generation of the sentence.
The probability of the whole tree can be calculated by multiplying the probabilities of each individual rule transormation: 0.9 * 0.1 * 0.05 * 0.05 * 0.4 * 0.05 * 0.3 = 0.00000135
.
To conserve space, we can also write the tree in linear form:
[S [NP [Name mary]] [VP [VP [Verb is]] [Adjective sad]]]
Unfortunately, the current grammar overgenerates, that is, it creates sentences that are not grammatically correct (according to the English language), like "the fence are john which say
". It also undergenerates, which means there are valid sentences it does not generate, like "he believes mary is sad
".
In the module, we have implemented both probabilistic and non-probabilistic grammars. Both of these implementations follow the same format. There are functions for the lexicon and the rules which can be combined to create a grammar object.
Execute the cell below to view the implementations:
In [2]:
import os, sys
sys.path = [os.path.abspath("../../")] + sys.path
from nlp4e import *
from notebook4e import psource
In [ ]:
psource(Lexicon, Rules, Grammar)
Let's build a lexicon and a grammar for the above language:
In [3]:
lexicon = Lexicon(
Verb = "is | say | are",
Noun = "robot | sheep | fence",
Adjective = "good | new | sad",
Adverb = "here | lightly | now",
Pronoun = "me | you | he",
RelPro = "that | who | which",
Name = "john | mary | peter",
Article = "the | a | an",
Preposition = "to | in | at",
Conjunction = "and | or | but",
Digit = "1 | 2 | 0"
)
print("Lexicon", lexicon)
rules = Rules(
S = "NP VP | S Conjunction S",
NP = "Pronoun | Name | Noun | Article Noun \
| Article Adjs Noun | Digit | NP PP | NP RelClause",
VP = "Verb | VP NP | VP Adjective | VP PP | VP Adverb",
Adjs = "Adjective | Adjective Adjs",
PP = "Preposition NP",
RelClause = "RelPro VP"
)
print("\nRules:", rules)
Both the functions return a dictionary with keys to the left-hand side of the rules. For the lexicon, the values are the terminals for each left-hand side non-terminal, while for the rules the values are the right-hand sides as lists.
We can now use the variables lexicon
and rules
to build a grammar. After we've done so, we can find the transformations of a non-terminal (the Noun
, Verb
and the other basic classes do not count as proper non-terminals in the implementation). We can also check if a word is in a particular class.
In [4]:
grammar = Grammar("A Simple Grammar", rules, lexicon)
print("How can we rewrite 'VP'?", grammar.rewrites_for('VP'))
print("Is 'the' an article?", grammar.isa('the', 'Article'))
print("Is 'here' a noun?", grammar.isa('here', 'Noun'))
In [5]:
E_Chomsky = Grammar("E_Prob_Chomsky", # A Grammar in Chomsky Normal Form
Rules(
S = "NP VP",
NP = "Article Noun | Adjective Noun",
VP = "Verb NP | Verb Adjective",
),
Lexicon(
Article = "the | a | an",
Noun = "robot | sheep | fence",
Adjective = "good | new | sad",
Verb = "is | say | are"
))
In [6]:
print(E_Chomsky.cnf_rules())
Finally, we can generate random phrases using our grammar. Most of them will be complete gibberish, falling under the overgenerated phrases of the grammar. That goes to show that in the grammar the valid phrases are much fewer than the overgenerated ones.
In [8]:
grammar.generate_random('S')
Out[8]:
The probabilistic grammars follow the same approach. They take as input a string, are assembled from grammar and a lexicon and can generate random sentences (giving the probability of the sentence). The main difference is that in the lexicon we have tuples (terminal, probability) instead of strings and for the rules, we have a list of tuples (list of non-terminals, probability) instead of the list of lists of non-terminals.
Execute the cells to read the code:
In [ ]:
psource(ProbLexicon, ProbRules, ProbGrammar)
Let's build a lexicon and rules for the probabilistic grammar:
In [9]:
lexicon = ProbLexicon(
Verb = "is [0.5] | say [0.3] | are [0.2]",
Noun = "robot [0.4] | sheep [0.4] | fence [0.2]",
Adjective = "good [0.5] | new [0.2] | sad [0.3]",
Adverb = "here [0.6] | lightly [0.1] | now [0.3]",
Pronoun = "me [0.3] | you [0.4] | he [0.3]",
RelPro = "that [0.5] | who [0.3] | which [0.2]",
Name = "john [0.4] | mary [0.4] | peter [0.2]",
Article = "the [0.5] | a [0.25] | an [0.25]",
Preposition = "to [0.4] | in [0.3] | at [0.3]",
Conjunction = "and [0.5] | or [0.2] | but [0.3]",
Digit = "0 [0.35] | 1 [0.35] | 2 [0.3]"
)
print("Lexicon", lexicon)
rules = ProbRules(
S = "NP VP [0.6] | S Conjunction S [0.4]",
NP = "Pronoun [0.2] | Name [0.05] | Noun [0.2] | Article Noun [0.15] \
| Article Adjs Noun [0.1] | Digit [0.05] | NP PP [0.15] | NP RelClause [0.1]",
VP = "Verb [0.3] | VP NP [0.2] | VP Adjective [0.25] | VP PP [0.15] | VP Adverb [0.1]",
Adjs = "Adjective [0.5] | Adjective Adjs [0.5]",
PP = "Preposition NP [1]",
RelClause = "RelPro VP [1]"
)
print("\nRules:", rules)
Let's use the above to assemble our probabilistic grammar and run some simple queries:
In [10]:
grammar = ProbGrammar("A Simple Probabilistic Grammar", rules, lexicon)
print("How can we rewrite 'VP'?", grammar.rewrites_for('VP'))
print("Is 'the' an article?", grammar.isa('the', 'Article'))
print("Is 'here' a noun?", grammar.isa('here', 'Noun'))
If we have a grammar in CNF, we can get a list of all the rules. Let's create a grammar in the form and print the CNF rules:
In [11]:
E_Prob_Chomsky = ProbGrammar("E_Prob_Chomsky", # A Probabilistic Grammar in CNF
ProbRules(
S = "NP VP [1]",
NP = "Article Noun [0.6] | Adjective Noun [0.4]",
VP = "Verb NP [0.5] | Verb Adjective [0.5]",
),
ProbLexicon(
Article = "the [0.5] | a [0.25] | an [0.25]",
Noun = "robot [0.4] | sheep [0.4] | fence [0.2]",
Adjective = "good [0.5] | new [0.2] | sad [0.3]",
Verb = "is [0.5] | say [0.3] | are [0.2]"
))
In [12]:
print(E_Prob_Chomsky.cnf_rules())
Lastly, we can generate random sentences from this grammar. The function prob_generation
returns a tuple (sentence, probability).
In [13]:
sentence, prob = grammar.generate_random('S')
print(sentence)
print(prob)
As with the non-probabilistic grammars, this one mostly overgenerates. You can also see that the probability is very, very low, which means there are a ton of generate able sentences (in this case infinite, since we have recursion; notice how VP
can produce another VP
, for example).
In [ ]: