In [1]:
from IPython.display import HTML
HTML("""<style>{}</style>""".format(open("assets/css/custom.css").read()))
Out[1]:
In [2]:
import processors
print(processors.__version__)
In [3]:
from processors import *
from processors.visualization import JupyterVisualizer as viz
API = ProcessorsAPI(port=8881, keep_alive=True)
This tutorial provides an introduction to Odin, a domain-independent rule-based system for information extraction.
YAML
and familiar constructsOdin operates over documents which have been tokenized, sentence-segmented, parsed, and annotated via an NLP pipeline for part-of-speech (PoS) tags, lemmas, and named entities.
Rules matched against these annotated documents produce mentions of entities, relations, or events that can then be reused to write more complicated rules (ex. entities $\rightarrow$ events $\rightarrow$ events involving other events).
These rules are written in a simple subset of YAML
and can describe sequences of tokens or traversals over syntactic dependency parse. Luckily, you don't need to be an expert in YAML
in order to write Odin rules.
All rules have the following fields:
Field | Description |
---|---|
name |
the name of the specific rule. When a rule matches, the match (Mention) stores the value of this field in its .foundBy attribute. |
label |
What a rule's match represents (Person , Location , Phosphorylation , etc.). |
type |
Currently, two primary rule types token or dependency . token refers to a surface pattern or sequence of tokens. dependency refers to a pattern over a graph (syntactic dependency parse). |
pattern |
Specified as multi-line string using the vertical bar character (e.g. | ) |
YAML
It's useful to keep in mind that YAML
strings don’t have to be quoted. This is a nice feature that allows one to write
shorter and cleaner rules. However, there is one exception that you should be aware of: strings that
start with a YAML indicator character must be quoted. Indicator characters have special semantics
and must be quoted if they should be interpreted as part of a string. These are all the valid YAML
indicator characters:
- ? : , [ ] { } # & * ! | > ’ " % @ ‘
As you can probably tell, these are not characters that occur frequently in practice. Usually names and labels are composed of alphanumeric characters and the occasional underscore, so, most of the time, you can get away without quoting strings.
In [4]:
rules = """
taxonomy:
- Entity:
- ProperNoun
- Organization
- PossiblePerson:
- Person
- Location
- Date
- HasX:
- HasTitle
- Event:
- Missing
rules:
- name: "ner-location"
label: Location
priority: 1
type: token
pattern: |
[entity="LOCATION"]+ |
Twin Peaks
- name: "ner-person"
label: Person
priority: 1
type: token
pattern: |
[entity="PERSON"]+
- name: "ner-org"
label: Organization
priority: 1
type: token
pattern: |
[entity="ORGANIZATION"]+
- name: "ner-date"
label: Date
priority: 1
type: token
pattern: |
[entity="DATE"]+
- name: "proper-noun"
label: ProperNoun
priority: 2
type: token
pattern: |
[word=/^[A-Z]/ & tag=/^(JJ|NN)/ & !mention=Person]+ |
[tag=/^NNP/]+
- name: "has-title"
label: HasTitle
pattern: |
person: Person
title: ProperNoun = nn [!mention=Person]
- name: "missing"
label: Missing
pattern: |
trigger = [lemma=go] missing
theme: Person = <xcomp nsubj
date: Date? = prep_on
"""
mentions = API.odin.extract_from_text("FBI Special Agent Dale Cooper went missing on June 10, 1991. He was last seen in the woods of Twin Peaks. ", rules=rules)
for m in mentions: viz.display_mention(m)
viz.display_graph(mentions[-1].sentenceObj)
In [5]:
example_doc = API.annotate("Julia-Louis Dreyfus and Brad Hall were married in June of 1987.")
from processors.visualization import JupyterVisualizer as viz
viz.display_graph(example_doc.sentences[0])
In [6]:
rules = """
rules:
- name: "job-title"
label: JobTitle
type: token
pattern: |
Special Agent
"""
tp_doc = API.annotate("FBI Special Agent Dale Cooper went missing on June 10, 1991")
mentions = API.odin.extract_from_document(tp_doc, rules)
for m in mentions: viz.display_mention(m)
Of course as we'll see rules can get much more sophisticated than this. For example, Odin allows you to write your pattern over combinations of token attributes (see the token constraints section for more details).
Much of the power of Odin comes from its ability to scaffold rules. The output of one rule can be referenced by its label in subsequent rules. This allows us to write compact, powerful grammars. In a surface pattern, this is done using the syntax @MyLabelHere
where MyLabelHere
refers to whatever label you wish to reference. We'll apply this syntax in the example below where we'll build another rule off of the output of JobTitle
...
In [7]:
rules = """
rules:
- name: "job-title"
label: JobTitle
# This rule runs in the first pass
# of Odin and never again
priority: 1
type: token
pattern: |
Special Agent
- name: "expanded-title"
label: JobTitle
priority: 2
type: token
pattern: |
FBI @JobTitle
"""
tp_doc = API.annotate("FBI Special Agent Dale Cooper went missing on June 10, 1991")
mentions = API.odin.extract_from_document(tp_doc, rules)
for m in mentions: print("\"{}\"".format(m.foundBy)), viz.display_mention(m)
Note that we could omit the explicit priority from our second rule, "expanded-title", as it won't successfully match until an @JobTitle
is available to reuse. Limiting the priority here is an efficiency decision in that Odin won't event attempt to match the rule until told to do so.
In [8]:
challenge_text = "FBI Special Agent Fox Mulder wants to believe you're gokking mention reuse."
challenge_rules = """
rules:
- name: "job-title"
label: JobTitle
type: token
pattern: |
Special Agent
- name: "org"
label: Organization
type: token
# Don't worry if this rule doesn't make sense to you just yet.
# This pattern is a peek ahead. Feel free to rewrite it in a form that is familiar.
pattern: |
[entity=ORGANIZATION]+
- name: "star-fox"
label: Person
type: token
pattern: |
[entity=PERSON]+
- name: "you-complete-me"
label: ReallySpecialGuy
type: token
pattern: |
""
"""
challenge_doc = API.annotate(challenge_text)
mentions = API.odin.extract_from_document(challenge_doc, challenge_rules)
for m in mentions: print("\"{}\"".format(m.foundBy)), viz.display_mention(m)
Write a rule set that captures this simple phrase structure grammar for linguistic constituents:
Verb -> (identify by PoS tag of terminals)
Noun -> (identify by PoS tag of terminals)
Adjective -> (identify by PoS tag of terminals)
NP -> determiner (by tag) + zero or more Adjective + one or more Noun
In [9]:
challenge_psg_text = """
The black dog runs at night.
Out of nowhere, the mind comes forth.
"""
psg_rules = """
rules:
- name: "verb"
label: Verb
type: token
pattern: ???
- name: "noun"
label: Noun
type: token
pattern: ???
- name: "adjective"
label: Adjective
type: token
pattern: ???
- name: "noun-phrase"
label: NP
type: token
pattern: ???
"""
challenge_doc = API.annotate(challenge_psg_text)
# mentions = API.odin.extract_from_document(doc=challenge_doc, rules=psg_rules)
# for m in mentions: viz.display_mention(m)
Field | Description |
---|---|
word |
The actual token. |
lemma |
The lemma form of the token |
tag |
The part-of-speech (PoS) tag assigned to the token |
incoming |
Incoming relations from the dependency graph for the token |
outgoing |
Outgoing relations from the dependency graph for the token |
chunk |
The shallow constituent type (ex. NP, VP) immediately containing the token |
entity |
The NER label of the token |
mention |
The label of any Mention(s) (i.e., rule output) that contains the token. |
For a more information on PoS tags (tag
in the table above), see https://www.eecis.udel.edu/~vijay/cis889/ie/pos-set.pdf
In [10]:
text = """
the expensive delicate ship that must have seen
something amazing, a boy falling out of the sky,
had somewhere to get to and sailed calmly on.
"""
example_doc2 = API.annotate(text)
rules_v1 = """
rules:
- name: "disjunction"
label: Example
type: token
pattern: |
[tag=RB] | [tag=JJ]
"""
mentions = API.odin.extract_from_document(example_doc2, rules_v1)
for m in mentions: viz.display_mention(m)
In [11]:
rules_v2 = """
rules:
- name: "disjunction"
label: Example
type: token
pattern: |
# if it's easier to read
# we can split the disjunction
# onto two lines
[tag=RB] |
[tag=JJ]
"""
mentions = API.odin.extract_from_document(example_doc2, rules_v2)
for m in mentions: viz.display_mention(m)
You can blame the inclusion of this instance of somewhere on the PoS tagger.
In [12]:
text = "Hamlet killed Claudius. Rosencrantz and Guildenstern were executed."
example_doc2 = API.annotate(text)
# let's look at the syntactic dependency parse for each sentence
for s in example_doc2.sentences: viz.display_graph(s, distance=150)
In [13]:
rules_v1 = """
rules:
- name: "example-1"
label: Subject
type: token
pattern: |
# a disjunction of two exact strings
# denoting either a passive or active subject
[incoming=nsubjpass] | [incoming=nsubj]
"""
mentions = API.odin.extract_from_document(example_doc2, rules_v1)
for m in mentions: viz.display_mention(m)
In [14]:
rules_v2 = """
rules:
- name: "example-1"
label: Subject
pattern: |
# a regex that will match
# both passive and active subjects
[incoming=/^nsubj/]
"""
mentions = API.odin.extract_from_document(example_doc2, rules_v1)
for m in mentions: viz.display_mention(m)
In [15]:
text = """HEY YOU GUYS! """
example_doc2 = API.annotate(text)
insensitive_rules = """
rules:
- name: "insensitive"
label: Example
type: token
pattern: |
# if we don't use [],
# Odin assumes the pattern is in terms
# of the token's word attribute
/(?i)guys/
"""
mentions = API.odin.extract_from_document(example_doc2, insensitive_rules)
for m in mentions: viz.display_mention(m)
In [16]:
text = """It's not prattle when I warn a gentle touch is needed with the glass menagerie on the mantle."""
example_doc2 = API.annotate(text)
insensitive_rules = """
rules:
- name: "combined"
label: Example
type: token
pattern: |
[tag=/^NN/ & word=/tle$/]
"""
mentions = API.odin.extract_from_document(example_doc2, insensitive_rules)
for m in mentions: viz.display_mention(m)
In [17]:
challenge_text = """
???
"""
morpheme_rules = """
rules:
- name: "er-deriv-suffix"
label: HasDerivSuffix
type: token
pattern: ???
"""
# challenge_doc = API.annotate(challenge_text)
# mentions = API.odin.extract_from_document(challenge_doc, morpheme_rules)
# for m in mentions: viz.display_mention(m)
In [18]:
text = "If you wish to make an apple pie from scratch, you must first invent the universe."
d = API.annotate(text)
viz.display_graph(d.sentences[0])
challenge_rules = """
rules:
- name: "no-verbs"
label: NotVerb
# req. 1: This pattern should involve a single token constraint
# req. 2: The token constraint should use a negated pattern
pattern: | ???
"""
# mentions = API.odin.extract_from_document(d, challenge_rules)
# for m in mentions: print(m)
Token constraints, arguments, and graph edges can all be quantified.
Symbol | Description | Lazy form |
---|---|---|
? |
The quantified pattern is optional. | ?? |
* |
Repeat the quantified pattern zero or more times. | *? |
+ |
Repeat the quantified pattern one or more times. | +? |
{n} |
Exact repetition. Repeat the quantified pattern n times. | |
{n,m} |
Ranged repetition. Repeat the quantified pattern between n and m times, where n < m. | {n,m}? |
{,m} |
Open start ranged repetition. Repeat the quantified pattern between 0 and m times, where m > 0. | {,m}? |
{n,} |
Open end ranged repetition. Repeat the quantified pattern at least n times, where n > 0. | {n,}? |
Odin supports lookaround assertions, as well as start/end sentence anchors. You can use lookarounds to specify contextual constraints that you don't want to end up in your result (ex. "only match B if it's preceded by A").
Symbol | Description | Example Pattern | Match (in bold) |
---|---|---|---|
^ |
beginning of sentence | ^ My |
My name is Inigo Montoya . |
$ |
end of sentence | "." $ |
My name is Inigo Montoya . |
(?=...) |
postive lookahead | Inigo (?= Montoya) |
My name is Inigo Montoya . |
(?!...) |
negative lookahead | Inigo (?! Arocena) |
My name is Inigo Montoya . |
(?<=...) |
positive lookbehind | (?<= Inigo) Montoya |
My name is Inigo Montoya . |
(?<!...) |
negative lookbehind | (?<! Carlos) Montoya |
My name is Inigo Montoya . |
Rule writing can be an incremental process or refinement. Sometimes it's a matter of adding conjunctions to further constrain a match, or disjunctions to relax it. Other times, as demonstrated below, it comes down to picking the appropriate representation/attribute for a token...
The naive rule below is trying label Person mentions as any sequence of proper nouns. As you can see, this is too general. You can probably think of other spurious stuff that this would match, right?
In [19]:
entity_rule_v1 = """
rules:
- name: "person"
label: Person
priority: 1
type: token
pattern: |
[tag=NNP]+
"""
In [20]:
mentions = API.odin.extract_from_document(doc=example_doc, rules=entity_rule_v1)
for m in mentions: viz.display_mention(m)
Let's see if we can do better. It turns out we're lucky, as the model used by the named entity recognizer (NER) built into to our NLP pipeline has been trained to detect that label. Let's take a look...
In [21]:
entity_rule_v2 = """
rules:
- name: "person"
label: Person
priority: 1
type: token
pattern: |
[entity=PERSON]+
"""
In [22]:
mentions = API.odin.extract_from_document(doc=example_doc, rules=entity_rule_v2)
for m in mentions: viz.display_mention(m)
TODO
See the relevant section in the manual
See the relevant section in the manual
For a description of dependency relations used by default in Odin, see the collapsed dependency described in https://nlp.stanford.edu/software/dependencies_manual.pdf
In [23]:
rules_v1 = """
rules:
- name: "person"
label: Person
type: token
pattern: |
[entity=PERSON]+
- name: "marriage-event"
label: Marriage
pattern: |
trigger = [lemma=marry]
spouse: Person = nsubjpass
"""
In [24]:
mentions = API.odin.extract_from_document(doc=example_doc, rules=rules_v1)
for m in mentions:
if m.matches("Marriage"):
viz.display_mention(m)
We end up with two Marriage
event mentions, each containing only one spouse. Wouldn't it be great if we had a way to specify how many of each argument were required for a single mention?
In [25]:
rules_v2 = """
rules:
- name: "person"
label: Person
type: token
pattern: |
[entity=PERSON]+
- name: "marriage-event"
label: Marriage
pattern: |
trigger = [lemma=marry]
spouse: Person+ = nsubjpass
"""
In [26]:
mentions = API.odin.extract_from_document(doc=example_doc, rules=rules_v2)
for m in mentions:
if m.matches("Marriage"):
viz.display_mention(m)
We can even specify an exact number for each argument.
In [27]:
rules_v3 = """
rules:
- name: "person"
label: Person
type: token
pattern: |
[entity=PERSON]+
- name: "marriage-event"
label: Marriage
pattern: |
trigger = [lemma=marry]
spouse: Person{2} = nsubjpass
"""
mentions = API.odin.extract_from_document(doc=example_doc, rules=rules_v3)
for m in mentions:
if m.matches("Marriage"):
viz.display_mention(m)
In [28]:
text = "In a parallel universe, Marge married Homer, Ned Flanders, and Troy McClure."
d = API.annotate(text)
viz.display_graph(d.sentences[0], css=viz.parse_css)
challenge_rules = """
rules:
- name: "person"
label: Person
type: token
pattern: |
[entity=PERSON]+
- name: "marriage-event"
label: Marriage
pattern: ???
"""
#mentions = API.odin.extract_from_document(doc=d, rules=challenge_rules)
#for m in mentions: print(m)
In [29]:
text = "Gonzo and Camilla were married in October. Barack and Michelle were married in Chicago."
d = API.annotate(text)
challenge_rules = """
rules:
- name: "person"
label: Person
type: token
pattern: |
[entity=PERSON]+
# TODO: add a rule for Date
# TODO: add a rule for Location
# TODO: add optional args to "marriage-event"
- name: "marriage-event"
label: Marriage
pattern: |
trigger = [lemma=marry]
spouse: Person{2} = nsubjpass
"""
mentions = API.odin.extract_from_document(doc=d, rules=challenge_rules)
for m in mentions:
if m.matches("Marriage"):
viz.display_mention(m)
See the relevant section in the manual
It can be tedious to write sets of rules by hands. Often you'll see that components of rules can or should be reused in subsets of your grammar. Odin supports the use of variables and templates to address just this. Variables and templates help to maintain large grammars and create rule sets that can be "recycled" or applied to related problems with a few tweaks.
For more details, see the relevant section in the manual
Templates work via file imports. For more complex cases of template using involving multiple files, see the odin examples sbt project or Reach.
See the relevant section in the manual
Rules are applied iteratively (pass 1, pass 2, .., pass n). If you want to control when a rule should be applied, specify a value for the rule field priority
. The value can be an open or closed range, exact value, or list of comma separated values. By default, a rule will continue to be executed until no rule has produced a new match (priority: 1+
). This default means that you usually don't need to worry about setting the priority, but the power is there if you need it.
Note that quantifiers can be applied to priorities.
label
field...Every rule must have either a label
or labels
field.
This field tells Odin what the type of the Mention is that you're trying to capture.
Remember that these types can be "reused" in subsequent rules (ex. find a Person
and then find events involving some Person
).
In [30]:
bad_rules = """
rules:
- name: "person"
type: token
pattern: |
[entity=PERSON]+
"""
API.odin.extract_from_document(doc=example_doc, rules=bad_rules)
In [31]:
bad_rules = """
rules:
# we've mispelled "name"
- nme: "person"
label: Person
type: token
pattern: |
[entity=PERSON]+
"""
API.odin.extract_from_document(doc=example_doc, rules=bad_rules)
In [32]:
bad_rules = """
rules:
- name: "person"
label: Person
# we've mispelled "token"
type: tken
pattern: |
[entity=PERSON]+
"""
API.odin.extract_from_document(doc=example_doc, rules=bad_rules)
field
...In the current version of Odin, you are restricted to a predefined set of token fields for use in your patterns.
See the token constraints table for a comprehensive list of valid token fields.
In [33]:
bad_rules = """
rules:
- name: "person"
label: Person
type: token
pattern: |
[nonexistentfield=BLARG]+
"""
API.odin.extract_from_document(doc=example_doc, rules=bad_rules)
In [34]:
bad_rules = """
rules:
- name: "person"
label: Person
priority: 1+
type: token
pattern: [entity=PERSON]+
"""
API.odin.extract_from_document(doc=example_doc, rules=bad_rules)
While the error message is cryptic, the solution is to simply make the pattern multiline (ex. pattern: |
).
This pattern never makes it Odin, because it fails to parse as valid YAML
. |
denotes a YAML
scalar, which YAML
will read without complaint and pass along to Odin.
Without the |
, the YAML
parser assumes that it's dealing with a list until it sees the +
, which blows its mind with a wave of Cthulu madness, upends its conception of the reality, and sends it to an ashram for a period of convalescence and deep introspection.
In [35]:
bad_rules = """
rules:
- name: "person"
label: Person
type: token
pattern: |
[entity=PERSON]+
- name: "person"
label: Person
type: token
pattern: |
[tag=NNP]+
"""
API.odin.extract_from_document(doc=example_doc, rules=bad_rules)
In [36]:
#API.stop_server()