Our third case study will examine the domain of geography queries. In particular, we'll focus on the Geo880 corpus, which contains 880 queries about U.S. geography. Examples include:
"which states border texas?"
"how many states border the largest state?"
"what is the size of the capital of texas?"
The Geo880 queries have a quite different character from the arithmetic queries and travel queries we have examined previously. They differ from the arithmetic queries in using a large vocabulary, and in exhibiting greater degrees of both lexical and syntactic ambiguity. They differ from the travel queries in adhering to conventional rules for spelling and syntax, and in having semantics with arbitrarily complex compositional structure. For example:
"what rivers flow through states that border the state with the largest population?"
"what is the population of the capital of the largest state through which the mississippi runs?"
"what is the longest river that passes the states that border the state that borders the most states?"
Geo880 was developed in Ray Mooney's group at UT Austin. It is of particular interest because it has for many years served as a standard evaluation for semantic parsing systems. (See, for example, Zelle & Mooney 1996, Tang & Mooney 2001, Zettlemoyer & Collins 2005, and Liang et al. 2011.) It has thereby become, for many, a paradigmatic application of semantic parsing. It has also served as a bridge between an older current of research on natural language interfaces to databases (NLIDBs) (see Androutsopoulos et al. 1995) and the modern era of semantic parsing.
The domain of geography queries is also of interest because there are many plausible real-world applications for semantic parsing which similarly involve complex compositional queries against a richly structured knowledge base. For example, some people are passionate about baseball statistics, and might want to ask queries like:
"pitchers who have struck out four batters in one inning"
"players who have stolen at least 100 bases in a season"
"complete games with fewer than 90 pitches"
"most home runs hit in one game"
Environmental advocates and policymakers might have queries like:
"which country has the highest co2 emissions"
"what five countries have the highest per capita co2 emissions"
"what country's co2 emissions increased the most over the last five years"
"what fraction of co2 emissions was from european countries in 2010"
Techniques that work in the geography domain are like to work in these other domains too.
I've been told that the Geo880 queries were collected from students in classes taught by Ray Mooney at UT Austin. I'm not sure whether I've got the story right. But this account is consistent with one of the notable limitations of the dataset: it is not a natural distribution, and not a realistic representation of the geography queries that people actually ask on, say, Google. Nobody ever asks, "what is the longest river that passes the states that border the state that borders the most states?" Nobody. Ever.
The dataset was published online by Rohit Jaivant Kate in a Prolog file containing semantic representations in Prolog style. It was later republished by Yuk Wah Wong as an XML file containing additional metadata for each example, including translations into Spanish, Japanese, and Turkish; syntactic parse trees; and semantics in two different representations: Prolog and FunQL.
In SippyCup, we're not going to use either Prolog or FunQL semantics. Instead, we'll use examples which have been annotated only with denotations (which were provided by Percy Liang — thanks!). Of course, our grammar will require a semantic representation, even if our examples are not annotated with semantics. We will introduce one below.
The Geo880 dataset is conventionally divided into 600 training examples and 280 test examples. In SippyCup, the dataset can in found in geo880.py
. Let's take a peek.
In [1]:
from geo880 import geo880_train_examples, geo880_test_examples
print('train examples:', len(geo880_train_examples))
print('test examples: ', len(geo880_test_examples))
print(geo880_train_examples[0])
print(geo880_test_examples[0])
Geobase is a small knowledge base about the geography of the United States. It contains (almost) all the information needed to answer queries in the Geo880 dataset, including facts about:
SippyCup contains a class called GeobaseReader
(in geobase.py
) which facilitates working with Geobase in Python. It reads and parses the Geobase Prolog file, and creates a set of tuples representing its content. Let's take a look.
In [2]:
from geobase import GeobaseReader
reader = GeobaseReader()
unaries = [str(t) for t in reader.tuples if len(t) == 2]
print('\nSome unaries:\n ' + '\n '.join(unaries[:10]))
binaries = [str(t) for t in reader.tuples if len(t) == 3]
print('\nSome binaries:\n ' + '\n '.join(binaries[:10]))
Some observations here:
/type/name
. This is a GeobaseReader
convention; these identifiers are not used in the original Prolog file.place
because they occur in the Prolog file only as the highest or lowest point in a state, and it's hard to reliably assign such points to one of the more specific types.GeobaseReader
converts everything to SI units: meters and square meters.GeobaseReader
merely reads the data in Geobase into a set of tuples. It doesn't provide any facility for querying that data. That's where GraphKB
and GraphKBExecutor
come in. GraphKB
is a graph-structured knowledge base, with indexing for fast lookups. GraphKBExecutor
defines a representation for formal queries against that knowledge base, and supports query execution. The formal query language defined by GraphKBExecutor
will serve as our semantic representation for the geography domain.
A GraphKB
is a generic graph-structured knowledge base, or equivalently, a set of relational pairs and triples, with indexing for fast lookups. It represents a knowledge base as set of tuples, each either:
For example, we can construct a GraphKB
representing facts about The Simpsons:
In [3]:
from graph_kb import GraphKB
simpsons_tuples = [
# unaries
('male', 'homer'),
('female', 'marge'),
('male', 'bart'),
('female', 'lisa'),
('female', 'maggie'),
('adult', 'homer'),
('adult', 'marge'),
('child', 'bart'),
('child', 'lisa'),
('child', 'maggie'),
# binaries
('has_age', 'homer', 36),
('has_age', 'marge', 34),
('has_age', 'bart', 10),
('has_age', 'lisa', 8),
('has_age', 'maggie', 1),
('has_brother', 'lisa', 'bart'),
('has_brother', 'maggie', 'bart'),
('has_sister', 'bart', 'maggie'),
('has_sister', 'bart', 'lisa'),
('has_sister', 'lisa', 'maggie'),
('has_sister', 'maggie', 'lisa'),
('has_father', 'bart', 'homer'),
('has_father', 'lisa', 'homer'),
('has_father', 'maggie', 'homer'),
('has_mother', 'bart', 'marge'),
('has_mother', 'lisa', 'marge'),
('has_mother', 'maggie', 'marge'),
]
simpsons_kb = GraphKB(simpsons_tuples)
The GraphKB
object now contains three indexes:
unaries[U]
: all entities belonging to unary relation U
binaries_fwd[B][E]
: all entities X
such that (E, X)
belongs to binary relation B
binaries_rev[B][E]
: all entities X
such that (X, E)
belongs to binary relation B
For example:
In [4]:
simpsons_kb.unaries['child']
Out[4]:
In [5]:
simpsons_kb.binaries_fwd['has_sister']['lisa']
Out[5]:
In [6]:
simpsons_kb.binaries_rev['has_sister']['lisa']
Out[6]:
A GraphKBExecutor
executes formal queries against a GraphKB
and returns their denotations.
Queries are represented by Python tuples, and can be nested.
Denotations are also represented by Python tuples, but are conceptually sets (possibly empty). The elements of these tuples are always sorted in canonical order, so that they can be reliably compared for set equality.
The query language defined by GraphKBExecutor
is perhaps most easily explained by example:
In [7]:
queries = [
'bart',
'male',
('has_sister', 'lisa'), # who has sister lisa?
('lisa', 'has_sister'), # lisa has sister who, i.e., who is a sister of lisa?
('lisa', 'has_brother'), # lisa has brother who, i.e., who is a brother of lisa?
('.and', 'male', 'child'),
('.or', 'male', 'adult'),
('.not', 'child'),
('.any',), # anything
('.any', 'has_sister'), # anything has sister who, i.e., who is a sister of anything?
('.and', 'child', ('.not', ('.any', 'has_sister'))),
('.count', ('bart', 'has_sister')),
('has_age', ('.gt', 21)),
('has_age', ('.lt', 2)),
('has_age', ('.eq', 10)),
('.max', 'has_age', 'female'),
('.min', 'has_age', ('bart', 'has_sister')),
('.max', 'has_age', '.any'),
('.argmax', 'has_age', 'female'),
('.argmin', 'has_age', ('bart', 'has_sister')),
('.argmax', 'has_age', '.any'),
]
executor = simpsons_kb.executor()
for query in queries:
print()
print('Q ', query)
print('D ', executor.execute(query))
Note that the query (R E)
denotes entities having relation R
to entity E
,
whereas the query (E R)
denotes entities to which entity E
has relation R
.
For a more detailed understanding of the style of semantic representation defined by GraphKBExecutor
, take a look at the source code.
In [8]:
geobase = GraphKB(reader.tuples)
executor = geobase.executor()
queries = [
('/state/texas', 'capital'), # capital of texas
('.and', 'river', ('traverses', '/state/utah')), # rivers that traverse utah
('.argmax', 'height', 'mountain'), # tallest mountain
]
for query in queries:
print()
print(query)
print(executor.execute(query))
It's time to start developing a grammar for the geography domain. As in Unit 2, the performance metric we'll focus on during grammar engineering is oracle accuracy (the proportion of examples for which any parse is correct), not accuracy (the proportion of examples for which the first parse is correct). Remember that oracle accuracy is an upper bound on accuracy, and is a measure of the expressive power of the grammar: does it have the rules it needs to generate the correct parse? The gap between oracle accuracy and accuracy, on the other hand, reflects the ability of the scoring model to bring the correct parse to the top of the candidate list.
As always, we're going to take a data-driven approach to grammar engineering. We want to introduce rules which will enable us to handle the lexical items and syntactic structures that we actually observe in the Geo880 training data. To that end, let's count the words that appear among the 600 training examples. (We do not examine the test data!)
In [9]:
from collections import defaultdict
from operator import itemgetter
from geo880 import geo880_train_examples
words = [word for example in geo880_train_examples for word in example.input.split()]
counts = defaultdict(int)
for word in words:
counts[word] += 1
counts = sorted([(count, word) for word, count in counts.items()], reverse=True)
print('There were %d tokens of %d types:\n' % (len(words), len(counts)))
print(', '.join(['%s (%d)' % (word, count) for count, word in counts[:50]] + ['...']))
There are at least four major categories of words here:
One might make finer distinctions, but this seems like a reasonable starting point. Note that these categories do not always correspond to traditional syntactic categories. While the entities are typically proper nouns, and the types are typically common nouns, the relations include prepositions, verbs, nouns, and adjectives.
The design of our grammar will roughly follow this schema. The major categories will include $Entity
, $Type
, $Collection
, $Relation
, and $Optional
.
In Unit 2, our grammar engineering process didn't really start cooking until we introduced optionals. This time around, let's begin with the optionals. We'll define as $Optional
every word in the Geo880 training data which does not plainly refer to an entity, type, or relation. And we'll let any query be preceded or followed by a sequence of one or more $Optionals
.
In [10]:
from parsing import Grammar, Rule
optional_words = [
'the', '?', 'what', 'is', 'in', 'of', 'how', 'many', 'are', 'which', 'that',
'with', 'has', 'major', 'does', 'have', 'where', 'me', 'there', 'give',
'name', 'all', 'a', 'by', 'you', 'to', 'tell', 'other', 'it', 'do', 'whose',
'show', 'one', 'on', 'for', 'can', 'whats', 'urban', 'them', 'list',
'exist', 'each', 'could', 'about'
]
rules_optionals = [
Rule('$ROOT', '?$Optionals $Query ?$Optionals', lambda sems: sems[1]),
Rule('$Optionals', '$Optional ?$Optionals'),
] + [Rule('$Optional', word) for word in optional_words]
Because $Query
has not yet been defined, we won't be able to parse anything yet.
Our grammar will need to be able to recognize names of entities, such as "utah"
. There are hundreds of entities in Geobase, and we don't want to have to introduce a grammar rule for each entity. Instead, we'll define a new annotator, GeobaseAnnotator
, which simply annotates phrases which exactly match names in Geobase.
In [11]:
from annotator import Annotator, NumberAnnotator
class GeobaseAnnotator(Annotator):
def __init__(self, geobase):
self.geobase = geobase
def annotate(self, tokens):
phrase = ' '.join(tokens)
places = self.geobase.binaries_rev['name'][phrase]
return [('$Entity', place) for place in places]
Now a couple of rules that will enable us to parse inputs that simply name locations, such as "utah"
.
(TODO: explain rationale for $Collection
and $Query
.)
In [12]:
rules_collection_entity = [
Rule('$Query', '$Collection', lambda sems: sems[0]),
Rule('$Collection', '$Entity', lambda sems: sems[0]),
]
rules = rules_optionals + rules_collection_entity
Now let's make a grammar.
In [13]:
annotators = [NumberAnnotator(), GeobaseAnnotator(geobase)]
grammar = Grammar(rules=rules, annotators=annotators)
Let's try to parse some inputs which just name locations.
In [14]:
parses = grammar.parse_input('what is utah')
for parse in parses[:1]:
print('\n'.join([str(parse.semantics), str(executor.execute(parse.semantics))]))
Great, it worked. Now let's run an evaluation on the Geo880 training examples.
In [15]:
from experiment import sample_wins_and_losses
from geoquery import GeoQueryDomain
from metrics import DenotationOracleAccuracyMetric
from scoring import Model
domain = GeoQueryDomain()
model = Model(grammar=grammar, executor=executor.execute)
metric = DenotationOracleAccuracyMetric()
sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1)
We don't yet have a single win: denotation oracle accuracy remains stuck at zero. However, the average number of parses is slightly greater than zero, meaning that there are a few examples which our grammar can parse (though not correctly). It would be interesting to know which examples. There's a utility function in experiment.py
which will give you the visibility you need. See if you can figure out what to do.
(TODO: the words in the training data include lots of words for types. Let's write down some lexical rules defining the category $Type
, guided as usual by the words we actually see in the training data. We'll also make $Type
a kind of $Collection
.)
In [16]:
rules_types = [
Rule('$Collection', '$Type', lambda sems: sems[0]),
Rule('$Type', 'state', 'state'),
Rule('$Type', 'states', 'state'),
Rule('$Type', 'city', 'city'),
Rule('$Type', 'cities', 'city'),
Rule('$Type', 'big cities', 'city'),
Rule('$Type', 'towns', 'city'),
Rule('$Type', 'river', 'river'),
Rule('$Type', 'rivers', 'river'),
Rule('$Type', 'mountain', 'mountain'),
Rule('$Type', 'mountains', 'mountain'),
Rule('$Type', 'mount', 'mountain'),
Rule('$Type', 'peak', 'mountain'),
Rule('$Type', 'road', 'road'),
Rule('$Type', 'roads', 'road'),
Rule('$Type', 'lake', 'lake'),
Rule('$Type', 'lakes', 'lake'),
Rule('$Type', 'country', 'country'),
Rule('$Type', 'countries', 'country'),
]
We should now be able to parse inputs denoting types, such as "name the lakes"
:
In [17]:
rules = rules_optionals + rules_collection_entity + rules_types
grammar = Grammar(rules=rules, annotators=annotators)
parses = grammar.parse_input('name the lakes')
for parse in parses[:1]:
print('\n'.join([str(parse.semantics), str(executor.execute(parse.semantics))]))
It worked. Let's evaluate on the Geo880 training data again.
In [18]:
model = Model(grammar=grammar, executor=executor.execute)
sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1)
Liftoff! We have two wins, and denotation oracle accuracy is greater than zero! Just barely.
In order to really make this bird fly, we're going to have to handle relations. In particular, we'd like to be able to parse queries which combine a relation with an entity or collection, such as "what is the capital of vermont"
.
As usual, we'll adopt a data-driven approach. The training examples include lots of words and phrases which refer to relations, both "forward" relations (like "traverses"
) and "reverse" relations (like "traversed by"
). Guided by the training data, we'll write lexical rules which define the categories $FwdRelation
and $RevRelation
. Then we'll add rules that allow either a $FwdRelation
or a $RevRelation
to be promoted to a generic $Relation
, with semantic functions which ensure that the semantics are constructed with the proper orientation. Finally, we'll define a rule for joining a $Relation
(such as "capital of"
) with a $Collection
(such as "vermont"
) to yield another $Collection
(such as "capital of vermont"
).
In [19]:
rules_relations = [
Rule('$Collection', '$Relation ?$Optionals $Collection', lambda sems: sems[0](sems[2])),
Rule('$Relation', '$FwdRelation', lambda sems: (lambda arg: (sems[0], arg))),
Rule('$Relation', '$RevRelation', lambda sems: (lambda arg: (arg, sems[0]))),
Rule('$FwdRelation', '$FwdBordersRelation', 'borders'),
Rule('$FwdBordersRelation', 'border'),
Rule('$FwdBordersRelation', 'bordering'),
Rule('$FwdBordersRelation', 'borders'),
Rule('$FwdBordersRelation', 'neighbor'),
Rule('$FwdBordersRelation', 'neighboring'),
Rule('$FwdBordersRelation', 'surrounding'),
Rule('$FwdBordersRelation', 'next to'),
Rule('$FwdRelation', '$FwdTraversesRelation', 'traverses'),
Rule('$FwdTraversesRelation', 'cross ?over'),
Rule('$FwdTraversesRelation', 'flow through'),
Rule('$FwdTraversesRelation', 'flowing through'),
Rule('$FwdTraversesRelation', 'flows through'),
Rule('$FwdTraversesRelation', 'go through'),
Rule('$FwdTraversesRelation', 'goes through'),
Rule('$FwdTraversesRelation', 'in'),
Rule('$FwdTraversesRelation', 'pass through'),
Rule('$FwdTraversesRelation', 'passes through'),
Rule('$FwdTraversesRelation', 'run through'),
Rule('$FwdTraversesRelation', 'running through'),
Rule('$FwdTraversesRelation', 'runs through'),
Rule('$FwdTraversesRelation', 'traverse'),
Rule('$FwdTraversesRelation', 'traverses'),
Rule('$RevRelation', '$RevTraversesRelation', 'traverses'),
Rule('$RevTraversesRelation', 'has'),
Rule('$RevTraversesRelation', 'have'), # 'how many states have major rivers'
Rule('$RevTraversesRelation', 'lie on'),
Rule('$RevTraversesRelation', 'next to'),
Rule('$RevTraversesRelation', 'traversed by'),
Rule('$RevTraversesRelation', 'washed by'),
Rule('$FwdRelation', '$FwdContainsRelation', 'contains'),
# 'how many states have a city named springfield'
Rule('$FwdContainsRelation', 'has'),
Rule('$FwdContainsRelation', 'have'),
Rule('$RevRelation', '$RevContainsRelation', 'contains'),
Rule('$RevContainsRelation', 'contained by'),
Rule('$RevContainsRelation', 'in'),
Rule('$RevContainsRelation', 'found in'),
Rule('$RevContainsRelation', 'located in'),
Rule('$RevContainsRelation', 'of'),
Rule('$RevRelation', '$RevCapitalRelation', 'capital'),
Rule('$RevCapitalRelation', 'capital'),
Rule('$RevCapitalRelation', 'capitals'),
Rule('$RevRelation', '$RevHighestPointRelation', 'highest_point'),
Rule('$RevHighestPointRelation', 'high point'),
Rule('$RevHighestPointRelation', 'high points'),
Rule('$RevHighestPointRelation', 'highest point'),
Rule('$RevHighestPointRelation', 'highest points'),
Rule('$RevRelation', '$RevLowestPointRelation', 'lowest_point'),
Rule('$RevLowestPointRelation', 'low point'),
Rule('$RevLowestPointRelation', 'low points'),
Rule('$RevLowestPointRelation', 'lowest point'),
Rule('$RevLowestPointRelation', 'lowest points'),
Rule('$RevLowestPointRelation', 'lowest spot'),
Rule('$RevRelation', '$RevHighestElevationRelation', 'highest_elevation'),
Rule('$RevHighestElevationRelation', '?highest elevation'),
Rule('$RevRelation', '$RevHeightRelation', 'height'),
Rule('$RevHeightRelation', 'elevation'),
Rule('$RevHeightRelation', 'height'),
Rule('$RevHeightRelation', 'high'),
Rule('$RevHeightRelation', 'tall'),
Rule('$RevRelation', '$RevAreaRelation', 'area'),
Rule('$RevAreaRelation', 'area'),
Rule('$RevAreaRelation', 'big'),
Rule('$RevAreaRelation', 'large'),
Rule('$RevAreaRelation', 'size'),
Rule('$RevRelation', '$RevPopulationRelation', 'population'),
Rule('$RevPopulationRelation', 'big'),
Rule('$RevPopulationRelation', 'large'),
Rule('$RevPopulationRelation', 'populated'),
Rule('$RevPopulationRelation', 'population'),
Rule('$RevPopulationRelation', 'populations'),
Rule('$RevPopulationRelation', 'populous'),
Rule('$RevPopulationRelation', 'size'),
Rule('$RevRelation', '$RevLengthRelation', 'length'),
Rule('$RevLengthRelation', 'length'),
Rule('$RevLengthRelation', 'long'),
]
We should now be able to parse "what is the capital of vermont"
. Let's see:
In [20]:
rules = rules_optionals + rules_collection_entity + rules_types + rules_relations
grammar = Grammar(rules=rules, annotators=annotators)
parses = grammar.parse_input('what is the capital of vermont ?')
for parse in parses[:1]:
print('\n'.join([str(parse.semantics), str(executor.execute(parse.semantics))]))
Montpelier! I always forget that one.
OK, let's evaluate our progress on the Geo880 training data.
In [21]:
model = Model(grammar=grammar, executor=executor.execute)
sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1)
Hot diggity, it's working. Denotation oracle accuracy is over 12%, double digits. We have 75 wins, and they're what we expect: queries that simply combine a relation and an entity (or collection).
In [22]:
rules_intersection = [
Rule('$Collection', '$Collection $Collection',
lambda sems: ('.and', sems[0], sems[1])),
Rule('$Collection', '$Collection $Optional $Collection',
lambda sems: ('.and', sems[0], sems[2])),
Rule('$Collection', '$Collection $Optional $Optional $Collection',
lambda sems: ('.and', sems[0], sems[3])),
]
In [23]:
rules = rules_optionals + rules_collection_entity + rules_types + rules_relations + rules_intersection
grammar = Grammar(rules=rules, annotators=annotators)
parses = grammar.parse_input('states bordering california')
for parse in parses[:1]:
print('\n'.join([str(parse.semantics), str(executor.execute(parse.semantics))]))
Let's evaluate the impact on the Geo880 training examples.
In [24]:
model = Model(grammar=grammar, executor=executor.execute)
sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1)
Great, denotation oracle accuracy has more than doubled, from 12% to 28%. And the wins now include intersections like "which states border new york"
. The losses, however, are clearly dominated by one category of error.
Many of the losses involve superlatives, such as "biggest"
or "shortest"
. Let's remedy that. As usual, we let the training examples guide us in adding lexical rules.
In [25]:
rules_superlatives = [
Rule('$Collection', '$Superlative ?$Optionals $Collection', lambda sems: sems[0] + (sems[2],)),
Rule('$Collection', '$Collection ?$Optionals $Superlative', lambda sems: sems[2] + (sems[0],)),
Rule('$Superlative', 'largest', ('.argmax', 'area')),
Rule('$Superlative', 'largest', ('.argmax', 'population')),
Rule('$Superlative', 'biggest', ('.argmax', 'area')),
Rule('$Superlative', 'biggest', ('.argmax', 'population')),
Rule('$Superlative', 'smallest', ('.argmin', 'area')),
Rule('$Superlative', 'smallest', ('.argmin', 'population')),
Rule('$Superlative', 'longest', ('.argmax', 'length')),
Rule('$Superlative', 'shortest', ('.argmin', 'length')),
Rule('$Superlative', 'tallest', ('.argmax', 'height')),
Rule('$Superlative', 'highest', ('.argmax', 'height')),
Rule('$Superlative', '$MostLeast $RevRelation', lambda sems: (sems[0], sems[1])),
Rule('$MostLeast', 'most', '.argmax'),
Rule('$MostLeast', 'least', '.argmin'),
Rule('$MostLeast', 'lowest', '.argmin'),
Rule('$MostLeast', 'greatest', '.argmax'),
Rule('$MostLeast', 'highest', '.argmax'),
]
Now we should be able to parse "tallest mountain"
:
In [26]:
rules = rules_optionals + rules_collection_entity + rules_types + rules_relations + rules_intersection + rules_superlatives
grammar = Grammar(rules=rules, annotators=annotators)
parses = grammar.parse_input('tallest mountain')
for parse in parses[:1]:
print('\n'.join([str(parse.semantics), str(executor.execute(parse.semantics))]))
Let's evaluate the impact on the Geo880 training examples.
In [27]:
model = Model(grammar=grammar, executor=executor.execute)
sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1)
Wow, superlatives make a big difference. Denotation oracle accuracy has surged from 28% to 42%.
In [28]:
def reverse(relation_sem):
"""TODO"""
# relation_sem is a lambda function which takes an arg and forms a pair,
# either (rel, arg) or (arg, rel). We want to swap the order of the pair.
def apply_and_swap(arg):
pair = relation_sem(arg)
return (pair[1], pair[0])
return apply_and_swap
rules_reverse_joins = [
Rule('$Collection', '$Collection ?$Optionals $Relation',
lambda sems: reverse(sems[2])(sems[0])),
]
In [29]:
rules = rules_optionals + rules_collection_entity + rules_types + rules_relations + rules_intersection + rules_superlatives + rules_reverse_joins
grammar = Grammar(rules=rules, annotators=annotators)
parses = grammar.parse_input('which states does the rio grande cross')
for parse in parses[:1]:
print('\n'.join([str(parse.semantics), str(executor.execute(parse.semantics))]))
Let's evaluate the impact on the Geo880 training examples.
In [30]:
model = Model(grammar=grammar, executor=executor.execute)
sample_wins_and_losses(domain=domain, model=model, metric=metric, seed=1)
This time the gain in denotation oracle accuracy was more modest, from 42% to 47%. Still, we are making good progress. However, note that a substantial gap has opened between accuracy and oracle accuracy. This indicates that we could benefit from adding a scoring model.
Through an iterative process of grammar engineering, we've managed to increase denotation oracle accuracy of 47%. But we've been ignoring denotation accuracy, which now lags far behind, at 25%. This represents an opportunity.
In order to figure out how best to fix the problem, we need to do some error analysis. Let's look for some specific examples where denotation accuracy is 0, even though denotation oracle accuracy is 1. In other words, let's look for some examples where we have a correct parse, but it's not ranked at the top. We should be able to find some cases like that among the first ten examples of the Geo880 training data.
In [31]:
from experiment import evaluate_model
from metrics import denotation_match_metrics
evaluate_model(model=model,
examples=geo880_train_examples[:10],
metrics=denotation_match_metrics(),
print_examples=True)
Take a look through that output. Over the ten examples, we achieved denotation oracle accuracy of 60%, but denotation accuracy of just 40%. In other words, there were two examples where we generated a correct parse, but failed to rank it at the top. Take a closer look at those two cases.
The first case is "what state has the shortest river ?"
. The top parse has semantics ('.and', 'state', ('.argmin', 'length', 'river'))
, which means something like "states that are the shortest river". That's not right. In fact, there's no such thing: the denotation is empty.
The second case is "what is the highest mountain in alaska ?"
. The top parse has semantics ('.argmax', 'height', ('.and', 'mountain', '/state/alaska'))
, which means "the highest mountain which is alaska". Again, there's no such thing: the denotation is empty.
So in both of the cases where we put the wrong parse at the top, the top parse had nonsensical semantics with an empty denotation. In fact, if you scroll through the output above, you will see that there are a lot of candidate parses with empty denotations. Seems like we could make a big improvement just by downweighting parses with empty denotations. This is easy to do.
In [32]:
def empty_denotation_feature(parse):
features = defaultdict(float)
if parse.denotation == ():
features['empty_denotation'] += 1.0
return features
weights = {'empty_denotation': -1.0}
model = Model(grammar=grammar,
feature_fn=empty_denotation_feature,
weights=weights,
executor=executor.execute)
Let's evaluate the impact of using our new empty_denotation
feature on the Geo880 training examples.
In [33]:
from experiment import evaluate_model
from metrics import denotation_match_metrics
evaluate_model(model=model,
examples=geo880_train_examples,
metrics=denotation_match_metrics(),
print_examples=False)
Great! Using the empty_denotation
feature has enabled us to increase denotation accuracy from 25% to 39%. That's a big gain! In fact, we've closed most of the gap between accuracy and oracle accuracy. As a result, the headroom for further gains from feature engineering is limited — but it's not zero. The exercises will ask you to push further.
Several of these exercises ask you to measure the impact of your change on key evaluation metrics. Part of your job is to decide which evaluation metrics are most relevant for the change you're making. It's probably best to evaluate only on training data, in order to keep the test data unseen during development. (But the test data is hardly a state secret, so whatever.)
Extend the grammar to handle queries which use the phrase "how many people" to ask about population. There are many examples in the Geo880 training data. Measure the impact of this change on key evaluation metrics.
Extend the grammar to handle counting questions, indicated by the phrases "how many" or "number of", as in "how many states does missouri border". Measure the impact of this change on key evaluation metrics.
Several examples in the Geo880 training dataset fail to parse because they refer to locations using names that, while valid, are not recognized by the GeobaseAnnotator
. For example, some queries use "america" to refer to /country/usa
, but GeobaseAnnotator
recognizes only "usa". Find unannotated location references in the Geo880 training dataset, and extend the grammar to handle them. Measure the impact of this change on key evaluation metrics.
Extend the grammar to handle examples of the form "where is X", such as "where is mount whitney" or "where is san diego". Measure the impact of this change on key evaluation metrics.
Quite a few examples in the Geo880 training dataset specify a set of entities by name, as in "how many states have a city named springfield" or "how many rivers are called colorado". Extend the grammar to handle these, leveraging the TokenAnnotator
introduced in Unit 2. Measure the impact of this change on key evaluation metrics.
Extend the grammar to handle phrases like "austin texas" or "atlanta ga", where two entity names appear in sequence. Make sure that "atlanta ga" has semantics and a denotation, whereas "atlanta tx" has semantics but no denotation. Measure the impact of this change on key evaluation metrics.
In Unit 2, while examining the travel domain, we saw big gains from including rule features in our feature representation. Experiment with adding rule features to the GeoQuery model. Are the learned weights intuitive? Do the rule features help? If so, identify a few specific examples which are fixed by the inclusion of rule features. Measure the impact on key evaluation metrics.
Find an example where using the empty_denotation
feature causes a loss (a bad prediction).
The empty_denotation
feature doesn't help on count questions (e.g., "how many states ..."), where all parses, good or bad, typically yield a denotation which is a single number. Add a new feature which helps in such cases. Identify a few specific examples which are fixed by the new feature. Measure the impact on key evaluation metrics. [This exercise assumes you have already done the exercise on handling count questions.]
Extend the grammar to handle comparisons, such as "mountains with height greater than 5000" or "mountains with height greater than the height of bona".
Building on the previous exercise, extend the grammar to handle even those comparisons which involve ellipsis, such as "mountains higher than [the height of] mt. katahdin" or "rivers longer than [the length of] the colorado river" (where the bracketed phrase does not appear in the surface form!).
Extend the grammar to handle queries involving units, such as "what is the area of maryland in square kilometers" or "how long is the mississippi river in miles".
The Geo880 training dataset contains many examples involving population density. Extend the grammar to handle these examples. Measure the impact on key evaluation metrics.
Several examples in the Geo880 training dataset involve some form of negation, expressed by the words "not", "no", or "excluding". Extend the grammar to handle these examples. Measure the impact on key evaluation metrics.
The current grammar handles "capital of texas", but not "texas capital". It handles "has austin capital", but not "has capital austin". In general, it defines every phrase which expresses a relation as either a $FwdRelation
or a $RevRelation
, and constrains word order accordingly. Extend the grammar to allow any phrase which is ordinarily a $FwdRelation
to function as a $RevRelation
, and vice versa. Can you now handle "texas capital" and "has capital austin"? Measure the impact on key evaluation metrics.
What if we permit any word to be optionalized by adding the rule Rule('$Optional', '$Token')
to our grammar? (Recall that $Token
is produced by the TokenAnnotator
.) Measure the impact of this change on key evaluation metrics. You will likely find that the change has some negative effects and some positive effects. Is there a way to mitigate the negative effects, while preserving the positive effects?
The success of the empty_denotation
feature demonstrates the potential of denotation features. Can we go further? Experiment with features that characterize the size of the denotation (that is, the number of answers). Are two answers better than one? Are ten answers better than two? If you find some features that seem to work, identify a few specific examples which are fixed by your new features. Measure the impact on key evaluation metrics.
Copyright (C) 2015 Bill MacCartney