In [1]:
from IPython.display import HTML
HTML("""<style>{}</style>""".format(open("assets/css/custom.css").read()))


Out[1]:

In [2]:
import processors
print(processors.__version__)


3.2.1

In [3]:
from processors import *
from processors.visualization import JupyterVisualizer as viz

API = ProcessorsAPI(port=8881, keep_alive=True)


INFO - Starting processors-server (java -Xmx3G -cp /Users/gus/anaconda3/lib/python3.5/site-packages/processors/processors-server.jar NLPServer --port 8881 --host localhost) ...
Waiting for server...
[============                                                ]

Connection with processors-server established (http://localhost:8881)

Rule-based information extraction with Odin

This tutorial provides an introduction to Odin, a domain-independent rule-based system for information extraction.

Why Odin?

  • Supports patterns over directed graphs, such as syntactic dependency parses
    • Good generalizability
  • Supports patterns over sequences of tokens and their attributes
  • Supports rule templates and variables
  • It was designed to be domain independent
  • Rules can be scaffolded and applied in cascades
    • The output of one rule can be the input to another rule)
  • Odin is open source, under active development, and it even has a manual
  • You can use it natively from within the JVM (it was written in Scalal) or in Python using a client-server architecture
  • Rules are written using YAML and familiar constructs

Useful resources for learning Odin

Projects using Odin

Prerequisites

This tutorial assumes that you have some familiarity with regular expresssions.

Introduction

Odin operates over documents which have been tokenized, sentence-segmented, parsed, and annotated via an NLP pipeline for part-of-speech (PoS) tags, lemmas, and named entities.

Rules matched against these annotated documents produce mentions of entities, relations, or events that can then be reused to write more complicated rules (ex. entities $\rightarrow$ events $\rightarrow$ events involving other events).

These rules are written in a simple subset of YAML and can describe sequences of tokens or traversals over syntactic dependency parse. Luckily, you don't need to be an expert in YAML in order to write Odin rules.

All rules have the following fields:

Field Description
name the name of the specific rule. When a rule matches, the match (Mention) stores the value of this field in its .foundBy attribute.
label What a rule's match represents (Person, Location, Phosphorylation, etc.).
type Currently, two primary rule types token or dependency. token refers to a surface pattern or sequence of tokens. dependency refers to a pattern over a graph (syntactic dependency parse).
pattern Specified as multi-line string using the vertical bar character (e.g. | )

Notes on YAML

It's useful to keep in mind that YAML strings don’t have to be quoted. This is a nice feature that allows one to write shorter and cleaner rules. However, there is one exception that you should be aware of: strings that start with a YAML indicator character must be quoted. Indicator characters have special semantics and must be quoted if they should be interpreted as part of a string. These are all the valid YAML indicator characters:

- ? : , [ ] { } # & * ! | > ’ " % @ ‘

As you can probably tell, these are not characters that occur frequently in practice. Usually names and labels are composed of alphanumeric characters and the occasional underscore, so, most of the time, you can get away without quoting strings.

Outcome

By the end of this tutorial, you will understand how to interpret and modify the following grammar:


In [4]:
rules = """

taxonomy:
  - Entity:
    - ProperNoun
    - Organization
    - PossiblePerson:
      - Person
    - Location
  - Date
  - HasX:
    - HasTitle
  - Event:
    - Missing

rules:
  - name: "ner-location"
    label: Location
    priority: 1
    type: token
    pattern: |
      [entity="LOCATION"]+ | 
      Twin Peaks

  - name: "ner-person"
    label: Person
    priority: 1
    type: token
    pattern: |
     [entity="PERSON"]+

  - name: "ner-org"
    label: Organization
    priority: 1
    type: token
    pattern: |
      [entity="ORGANIZATION"]+

  - name: "ner-date"
    label: Date
    priority: 1
    type: token
    pattern: |
      [entity="DATE"]+

  - name: "proper-noun"
    label: ProperNoun
    priority: 2
    type: token
    pattern: |
      [word=/^[A-Z]/ & tag=/^(JJ|NN)/ & !mention=Person]+ |
      [tag=/^NNP/]+

  - name: "has-title"
    label: HasTitle
    pattern: |
      person: Person
      title: ProperNoun = nn [!mention=Person]

  - name: "missing"
    label: Missing
    pattern: |
      trigger = [lemma=go] missing
      theme: Person = <xcomp nsubj
      date: Date? = prep_on
"""

mentions = API.odin.extract_from_text("FBI Special Agent Dale Cooper went missing on June 10, 1991.  He was last seen in the woods of Twin Peaks. ", rules=rules)
for m in mentions: viz.display_mention(m)

viz.display_graph(mentions[-1].sentenceObj)


He was last seen in the woods of LocationTwin Peaks .
He was last seen in the woods of ProperNounTwin Peaks .
ProperNounFBI Special Agent Dale Cooper went missing on June 10 , 1991 .
HasTitleProperNounFBI Special Agenttitle PersonDale Cooperperson went missing on June 10 , 1991 .
OrganizationFBI Special Agent Dale Cooper went missing on June 10 , 1991 .
FBI Special Agent Dale Cooper went missing on DateJune 10 , 1991 .
FBI Special Agent MissingPersonDale Coopertheme went missingTRIGGER on DateJune 10 , 1991date .
FBI Special Agent PersonDale Cooper went missing on June 10 , 1991 .
FBI Special Agent ProperNounDale Cooper went missing on June 10 , 1991 .
FBI Special Agent Dale Cooper went missing on ProperNounJune 10 , 1991 .

Capturing entities

Before we can write rules to identify relations and events, we must first identify their participants. We'll refer to these participants as entities.

Consider the following sentence describing a marriage:

Julia-Louis Dreyfus and Brad Hall were married in June of 1987.


In [5]:
example_doc = API.annotate("Julia-Louis Dreyfus and Brad Hall were married in June of 1987.")

from processors.visualization import JupyterVisualizer as viz

viz.display_graph(example_doc.sentences[0])


Capturing entities with surface patterns

A surface pattern is rule that is written in terms of a sequence of tokens. The simplest surface pattern is just a sequence of words. For example, the rule below will match the sequence Special Agent and tag it as being a JobTitle.


In [6]:
rules = """
rules:
  - name: "job-title"
    label: JobTitle
    type: token
    pattern: |
      Special Agent
"""

tp_doc = API.annotate("FBI Special Agent Dale Cooper went missing on June 10, 1991")

mentions = API.odin.extract_from_document(tp_doc, rules)
for m in mentions: viz.display_mention(m)


FBI JobTitleSpecial Agent Dale Cooper went missing on June 10 , 1991

Of course as we'll see rules can get much more sophisticated than this. For example, Odin allows you to write your pattern over combinations of token attributes (see the token constraints section for more details).

Reusing mentions from an earlier rule

Much of the power of Odin comes from its ability to scaffold rules. The output of one rule can be referenced by its label in subsequent rules. This allows us to write compact, powerful grammars. In a surface pattern, this is done using the syntax @MyLabelHere where MyLabelHere refers to whatever label you wish to reference. We'll apply this syntax in the example below where we'll build another rule off of the output of JobTitle...


In [7]:
rules = """
rules:
  - name: "job-title"
    label: JobTitle
    # This rule runs in the first pass
    # of Odin and never again
    priority: 1
    type: token
    pattern: |
      Special Agent
      
  - name: "expanded-title"
    label: JobTitle
    priority: 2
    type: token
    pattern: |
      FBI @JobTitle
"""

tp_doc = API.annotate("FBI Special Agent Dale Cooper went missing on June 10, 1991")

mentions = API.odin.extract_from_document(tp_doc, rules)
for m in mentions: print("\"{}\"".format(m.foundBy)), viz.display_mention(m)


"job-title"
FBI JobTitleSpecial Agent Dale Cooper went missing on June 10 , 1991
"expanded-title"
JobTitleFBI Special Agent Dale Cooper went missing on June 10 , 1991

Note that we could omit the explicit priority from our second rule, "expanded-title", as it won't successfully match until an @JobTitle is available to reuse. Limiting the priority here is an efficiency decision in that Odin won't event attempt to match the rule until told to do so.

Challenge: another agent

Complete the grammar below to match Special Agent Fox Mulder. Note that you only need to change the rule rule "you-complete-me"


In [8]:
challenge_text = "FBI Special Agent Fox Mulder wants to believe you're gokking mention reuse."

challenge_rules = """
rules:
  - name: "job-title"
    label: JobTitle
    type: token
    pattern: |
      Special Agent
      
  - name: "org"
    label: Organization
    type: token
    # Don't worry if this rule doesn't make sense to you just yet.
    # This pattern is a peek ahead.  Feel free to rewrite it in a form that is familiar.
    pattern: |
      [entity=ORGANIZATION]+
      
  - name: "star-fox"
    label: Person
    type: token
    pattern: |
      [entity=PERSON]+

  - name: "you-complete-me"
    label: ReallySpecialGuy
    type: token
    pattern: |
      ""
"""

challenge_doc = API.annotate(challenge_text)

mentions = API.odin.extract_from_document(challenge_doc, challenge_rules)
for m in mentions: print("\"{}\"".format(m.foundBy)), viz.display_mention(m)


"job-title"
FBI JobTitleSpecial Agent Fox Mulder wants to believe you 're gokking mention reuse .
"org"
OrganizationFBI Special Agent Fox Mulder wants to believe you 're gokking mention reuse .
"star-fox"
FBI Special Agent PersonFox Mulder wants to believe you 're gokking mention reuse .

Challenge: chunking text (part 1)

Write a rule set that captures this simple phrase structure grammar for linguistic constituents:

Verb        ->  (identify by PoS tag of terminals)
Noun        ->  (identify by PoS tag of terminals)
Adjective   ->  (identify by PoS tag of terminals)
NP          ->  determiner (by tag) + zero or more Adjective + one or more Noun

In [9]:
challenge_psg_text = """
The black dog runs at night. 
Out of nowhere, the mind comes forth.
"""

psg_rules = """
rules: 
    - name: "verb"
      label: Verb
      type: token
      pattern: ???
    
    - name: "noun"
      label: Noun
      type: token
      pattern: ???
      
    - name: "adjective"
      label: Adjective
      type: token
      pattern: ???  

    - name: "noun-phrase"
      label: NP
      type: token
      pattern: ???  
"""

challenge_doc = API.annotate(challenge_psg_text)

# mentions = API.odin.extract_from_document(doc=challenge_doc, rules=psg_rules)
# for m in mentions: viz.display_mention(m)

Challenge: chunking text (part 2)

Modify the PSG rules provided in the previous challenge to include one for a verb phrase (VP). Extend your grammar to cover your VP additions.

Token contraints

Field Description
word The actual token.
lemma The lemma form of the token
tag The part-of-speech (PoS) tag assigned to the token
incoming Incoming relations from the dependency graph for the token
outgoing Outgoing relations from the dependency graph for the token
chunk The shallow constituent type (ex. NP, VP) immediately containing the token
entity The NER label of the token
mention The label of any Mention(s) (i.e., rule output) that contains the token.

For a more information on PoS tags (tag in the table above), see https://www.eecis.udel.edu/~vijay/cis889/ie/pos-set.pdf

Disjunctions

Disjunctions are specified using |. Imagine we want to find all adjectives and adverbs in the following snippet from W.H. Auden:

the expensive delicate ship that must have seen
something amazing, a boy falling out of the sky,
had somewhere to get to and sailed calmly on.


In [10]:
text = """
the expensive delicate ship that must have seen  
something amazing, a boy falling out of the sky,  
had somewhere to get to and sailed calmly on.
"""

example_doc2 = API.annotate(text)

rules_v1 = """
rules: 
    - name: "disjunction"
      label: Example
      type: token
      pattern: |
        [tag=RB] | [tag=JJ]   
"""

mentions = API.odin.extract_from_document(example_doc2, rules_v1)
for m in mentions: viz.display_mention(m)


the expensive Exampledelicate ship that must have seen something amazing , a boy falling out of the sky , had somewhere to get to and sailed calmly on .
the expensive delicate ship that must have seen something amazing , a boy falling out of the sky , had Examplesomewhere to get to and sailed calmly on .
the expensive delicate ship that must have seen something Exampleamazing , a boy falling out of the sky , had somewhere to get to and sailed calmly on .
the Exampleexpensive delicate ship that must have seen something amazing , a boy falling out of the sky , had somewhere to get to and sailed calmly on .
the expensive delicate ship that must have seen something amazing , a boy falling out of the sky , had somewhere to get to and sailed Examplecalmly on .

In [11]:
rules_v2 = """
rules: 
    - name: "disjunction"
      label: Example
      type: token
      pattern: |
        # if it's easier to read 
        # we can split the disjunction
        # onto two lines
        [tag=RB] | 
        [tag=JJ]   
"""

mentions = API.odin.extract_from_document(example_doc2, rules_v2)
for m in mentions: viz.display_mention(m)


the expensive Exampledelicate ship that must have seen something amazing , a boy falling out of the sky , had somewhere to get to and sailed calmly on .
the expensive delicate ship that must have seen something amazing , a boy falling out of the sky , had Examplesomewhere to get to and sailed calmly on .
the expensive delicate ship that must have seen something Exampleamazing , a boy falling out of the sky , had somewhere to get to and sailed calmly on .
the Exampleexpensive delicate ship that must have seen something amazing , a boy falling out of the sky , had somewhere to get to and sailed calmly on .
the expensive delicate ship that must have seen something amazing , a boy falling out of the sky , had somewhere to get to and sailed Examplecalmly on .

You can blame the inclusion of this instance of somewhere on the PoS tagger.

Exact or regex

Patterns may involve an exact string or use regular expressions (Java-flavored). Imagine we want to identify all syntactic subjects in the following text:

Hamlet killed Claudius. Rosencrantz and Guildenstern were both killed by Hamlet.


In [12]:
text = "Hamlet killed Claudius.  Rosencrantz and Guildenstern were executed."

example_doc2 = API.annotate(text)

# let's look at the syntactic dependency parse for each sentence
for s in example_doc2.sentences: viz.display_graph(s, distance=150)



In [13]:
rules_v1 = """
rules: 
    - name: "example-1"
      label: Subject
      type: token
      pattern: |
        # a disjunction of two exact strings 
        # denoting either a passive or active subject
        [incoming=nsubjpass] | [incoming=nsubj]   
"""

mentions = API.odin.extract_from_document(example_doc2, rules_v1)
for m in mentions: viz.display_mention(m)


SubjectRosencrantz and Guildenstern were executed .
Rosencrantz and SubjectGuildenstern were executed .
SubjectHamlet killed Claudius .

In [14]:
rules_v2 = """
rules: 
    - name: "example-1"
      label: Subject
      pattern: |
        # a regex that will match
        # both passive and active subjects
        [incoming=/^nsubj/]        
"""

mentions = API.odin.extract_from_document(example_doc2, rules_v1)
for m in mentions: viz.display_mention(m)


SubjectRosencrantz and Guildenstern were executed .
Rosencrantz and SubjectGuildenstern were executed .
SubjectHamlet killed Claudius .

Case-insensitive patterns

Pattens can be made case insensitive by beginning a regex with /(?i)/


In [15]:
text = """HEY YOU GUYS! """

example_doc2 = API.annotate(text)

insensitive_rules = """
rules: 
    - name: "insensitive"
      label: Example
      type: token
      pattern: |
        # if we don't use [],
        # Odin assumes the pattern is in terms
        # of the token's word attribute
        /(?i)guys/      
"""

mentions = API.odin.extract_from_document(example_doc2, insensitive_rules)
for m in mentions: viz.display_mention(m)


HEY YOU ExampleGUYS !

Combining token constraints

Token constraints can be combined using &. Imagine if we only want to capture nouns ending in the tle in the following sentence:

It's not prattle when I warn a gentle touch is needed with the glass menagerie on the mantle.


In [16]:
text = """It's not prattle when I warn a gentle touch is needed with the glass menagerie on the mantle."""

example_doc2 = API.annotate(text)

insensitive_rules = """
rules: 
    - name: "combined"
      label: Example
      type: token
      pattern: |
        [tag=/^NN/ & word=/tle$/]  
"""

mentions = API.odin.extract_from_document(example_doc2, insensitive_rules)
for m in mentions: viz.display_mention(m)


It 's not prattle when I warn a gentle touch is needed with the glass menagerie on the Examplemantle .

Challenge: identifying words containing a certain morpheme

  • Write a rule that identifies nouns containing the derivational suffix er in teacher, buyer, actor, doctor, etc., while avoiding the homophonous inflectional morpheme er in calmer, bigger, etc.
  • Test it with a few sentences

In [17]:
challenge_text = """
???
"""

morpheme_rules = """
rules: 
    - name: "er-deriv-suffix"
      label: HasDerivSuffix
      type: token
      pattern: ??? 
"""

# challenge_doc = API.annotate(challenge_text)

# mentions = API.odin.extract_from_document(challenge_doc, morpheme_rules)
# for m in mentions: viz.display_mention(m)

Negating token constraints

Token constraints can be negated by prefacing the attribute name with ! (see example below):

[!fieldname=pattern]

Challenge: no verbs!

Using a single token constraint, match all tokens in the following sentence that are not verbs:

If you wish to make an apple pie from scratch, you must first invent the universe.


In [18]:
text = "If you wish to make an apple pie from scratch, you must first invent the universe."

d = API.annotate(text)

viz.display_graph(d.sentences[0])

challenge_rules = """
rules:
    - name: "no-verbs"
      label: NotVerb
      # req. 1: This pattern should involve a single token constraint
      # req. 2: The token constraint should use a negated pattern
      pattern: | ???
"""

# mentions = API.odin.extract_from_document(d, challenge_rules)
# for m in mentions: print(m)


Wildcard

Sometimes any token will suffice to complete a pattern. In such cases where token constraints are unnecessary, the [] wildcard can be used.

Example pattern: [] people

  • Example matches
    • I see dead people
    • All the lonely people
    • The are a strange people

Quantifiers

Token constraints, arguments, and graph edges can all be quantified.

Symbol Description Lazy form
? The quantified pattern is optional. ??
* Repeat the quantified pattern zero or more times. *?
+ Repeat the quantified pattern one or more times. +?
{n} Exact repetition. Repeat the quantified pattern n times.
{n,m} Ranged repetition. Repeat the quantified pattern between n and m times, where n < m. {n,m}?
{,m} Open start ranged repetition. Repeat the quantified pattern between 0 and m times, where m > 0. {,m}?
{n,} Open end ranged repetition. Repeat the quantified pattern at least n times, where n > 0. {n,}?

Lookarounds and other zero-width assertions

Odin supports lookaround assertions, as well as start/end sentence anchors. You can use lookarounds to specify contextual constraints that you don't want to end up in your result (ex. "only match B if it's preceded by A").

Symbol Description Example Pattern Match (in bold)
^ beginning of sentence ^ My My name is Inigo Montoya .
$ end of sentence "." $ My name is Inigo Montoya .
(?=...) postive lookahead Inigo (?= Montoya) My name is Inigo Montoya .
(?!...) negative lookahead Inigo (?! Arocena) My name is Inigo Montoya .
(?<=...) positive lookbehind (?<= Inigo) Montoya My name is Inigo Montoya .
(?<!...) negative lookbehind (?<! Carlos) Montoya My name is Inigo Montoya .

Refining rules: an example

Rule writing can be an incremental process or refinement. Sometimes it's a matter of adding conjunctions to further constrain a match, or disjunctions to relax it. Other times, as demonstrated below, it comes down to picking the appropriate representation/attribute for a token...

The naive rule below is trying label Person mentions as any sequence of proper nouns. As you can see, this is too general. You can probably think of other spurious stuff that this would match, right?


In [19]:
entity_rule_v1 = """
rules: 
    - name: "person"
      label: Person
      priority: 1
      type: token
      pattern: |
        [tag=NNP]+
"""

In [20]:
mentions = API.odin.extract_from_document(doc=example_doc, rules=entity_rule_v1)
for m in mentions: viz.display_mention(m)


PersonJulia-Louis Dreyfus and Brad Hall were married in June of 1987 .
Julia-Louis Dreyfus and PersonBrad Hall were married in June of 1987 .
Julia-Louis Dreyfus and Brad Hall were married in PersonJune of 1987 .

Let's see if we can do better. It turns out we're lucky, as the model used by the named entity recognizer (NER) built into to our NLP pipeline has been trained to detect that label. Let's take a look...


In [21]:
entity_rule_v2 = """
rules:
    - name: "person"
      label: Person
      priority: 1
      type: token
      pattern: |
        [entity=PERSON]+
"""

In [22]:
mentions = API.odin.extract_from_document(doc=example_doc, rules=entity_rule_v2)
for m in mentions: viz.display_mention(m)


Julia-Louis Dreyfus and PersonBrad Hall were married in June of 1987 .
PersonJulia-Louis Dreyfus and Brad Hall were married in June of 1987 .

Capturing events and relations

TODO

Capturing events and relations with surface patterns

See the relevant section in the manual

Capturing events and relations with dependency patterns

See the relevant section in the manual

For a description of dependency relations used by default in Odin, see the collapsed dependency described in https://nlp.stanford.edu/software/dependencies_manual.pdf


In [23]:
rules_v1 = """
rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+

    - name: "marriage-event"
      label: Marriage
      pattern: |
        trigger = [lemma=marry]
        spouse: Person = nsubjpass
"""

In [24]:
mentions = API.odin.extract_from_document(doc=example_doc, rules=rules_v1)
for m in mentions: 
    if m.matches("Marriage"):
        viz.display_mention(m)


MarriagePersonJulia-Louis Dreyfusspouse and Brad Hall were marriedTRIGGER in June of 1987 .
Julia-Louis Dreyfus and MarriagePersonBrad Hallspouse were marriedTRIGGER in June of 1987 .

We end up with two Marriage event mentions, each containing only one spouse. Wouldn't it be great if we had a way to specify how many of each argument were required for a single mention?

Quantifiers for dependency patterns

We know it takes two to tango, so let's try to get those arguments in the same mention.


In [25]:
rules_v2 = """
rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+

    - name: "marriage-event"
      label: Marriage
      pattern: |
        trigger = [lemma=marry]
        spouse: Person+ = nsubjpass
"""

In [26]:
mentions = API.odin.extract_from_document(doc=example_doc, rules=rules_v2)
for m in mentions:
    if m.matches("Marriage"):
        viz.display_mention(m)


MarriagePersonJulia-Louis Dreyfusspouse and PersonBrad Hallspouse were marriedTRIGGER in June of 1987 .

We can even specify an exact number for each argument.


In [27]:
rules_v3 = """
rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+

    - name: "marriage-event"
      label: Marriage
      pattern: |
        trigger = [lemma=marry]
        spouse: Person{2} = nsubjpass
"""

mentions = API.odin.extract_from_document(doc=example_doc, rules=rules_v3)
for m in mentions:
    if m.matches("Marriage"):
        viz.display_mention(m)


MarriagePersonJulia-Louis Dreyfusspouse and PersonBrad Hallspouse were marriedTRIGGER in June of 1987 .

Challenge: no more than four!

Imagine a polyandrous society where a woman can have at most four husbands.

In a parallel universe, Marge is married to Homer, Ned, and Troy McClure.

Complete the grammar rule set below to satisfy the conditions specified in the challenge.


In [28]:
text = "In a parallel universe, Marge married Homer, Ned Flanders, and Troy McClure."

d = API.annotate(text)

viz.display_graph(d.sentences[0], css=viz.parse_css)

challenge_rules = """
rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+

    - name: "marriage-event"
      label: Marriage
      pattern: ???
"""

#mentions = API.odin.extract_from_document(doc=d, rules=challenge_rules)
#for m in mentions: print(m)


Challenge: optional arguments

Modify the grammar below to include two optional arguments in the Marriage event: "date" of type Date and "location" of type Location. Remember that you'll need additional to capture Date and Location in order for them to be available to the event rule.


In [29]:
text = "Gonzo and Camilla were married in October.  Barack and Michelle were married in Chicago."
d = API.annotate(text)


challenge_rules = """
rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+

    # TODO: add a rule for Date
    
    # TODO: add a rule for Location
    
    # TODO: add optional args to "marriage-event"
    - name: "marriage-event"
      label: Marriage
      pattern: |
        trigger = [lemma=marry]
        spouse: Person{2} = nsubjpass
"""

mentions = API.odin.extract_from_document(doc=d, rules=challenge_rules)
for m in mentions:
    if m.matches("Marriage"):
        viz.display_mention(m)


MarriagePersonBarackspouse and PersonMichellespouse were marriedTRIGGER in Chicago .

Quantifiers in graph traversals

See the relevant section in the manual

Variables and rule templates

It can be tedious to write sets of rules by hands. Often you'll see that components of rules can or should be reused in subsets of your grammar. Odin supports the use of variables and templates to address just this. Variables and templates help to maintain large grammars and create rule sets that can be "recycled" or applied to related problems with a few tweaks.

For more details, see the relevant section in the manual

Templates work via file imports. For more complex cases of template using involving multiple files, see the odin examples sbt project or Reach.

Defining a taxonomy

See the relevant section in the manual

Priorities for rules

Rules are applied iteratively (pass 1, pass 2, .., pass n). If you want to control when a rule should be applied, specify a value for the rule field priority. The value can be an open or closed range, exact value, or list of comma separated values. By default, a rule will continue to be executed until no rule has produced a new match (priority: 1+). This default means that you usually don't need to worry about setting the priority, but the power is there if you need it.

Note that quantifiers can be applied to priorities.

Debugging rules

Making sense of errors

Here we describe some common errors you may encounter as you learn to write rules.

A mispelled or missing label field...

Every rule must have either a label or labels field.

This field tells Odin what the type of the Mention is that you're trying to capture.

Remember that these types can be "reused" in subsequent rules (ex. find a Person and then find events involving some Person).


In [30]:
bad_rules = """
rules: 
    - name: "person"
      type: token
      pattern: |
        [entity=PERSON]+
"""

API.odin.extract_from_document(doc=example_doc, rules=bad_rules)


OdinError: rule 'person' has no labels

rules: 
    - name: "person"
      type: token
      pattern: |
        [entity=PERSON]+

A mispelled or missing name field...

Every rule needs a name. B shur 2 spel it write two!


In [31]:
bad_rules = """
rules: 
    # we've mispelled "name"
    - nme: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+
"""

API.odin.extract_from_document(doc=example_doc, rules=bad_rules)


OdinError: unnamed rule

rules: 
    # we've mispelled "name"
    - nme: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+

An invalid rule type...

By default, rules are assumed to be of type dependency. If you're writing a dependency pattern, you can actually leave out the type field. Wow, talk about convenient!

If you're writing a token pattern, however, you'll need to specify type: token.


In [32]:
bad_rules = """
rules: 
    - name: "person"
      label: Person
      # we've mispelled "token"
      type: tken
      pattern: |
        [entity=PERSON]+
"""

API.odin.extract_from_document(doc=example_doc, rules=bad_rules)


OdinError: type 'tken' not recognized for rule 'person'

rules: 
    - name: "person"
      label: Person
      # we've mispelled "token"
      type: tken
      pattern: |
        [entity=PERSON]+

An invalid token field...

In the current version of Odin, you are restricted to a predefined set of token fields for use in your patterns.

See the token constraints table for a comprehensive list of valid token fields.


In [33]:
bad_rules = """
rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [nonexistentfield=BLARG]+
"""

API.odin.extract_from_document(doc=example_doc, rules=bad_rules)


OdinError: Error parsing rule 'person': unrecognized token field

rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [nonexistentfield=BLARG]+

Avoid single line patterns...


In [34]:
bad_rules = """
rules: 
    - name: "person"
      label: Person
      priority: 1+
      type: token
      pattern: [entity=PERSON]+
"""

API.odin.extract_from_document(doc=example_doc, rules=bad_rules)


OdinError: while parsing a block mapping
 in 'string', line 3, column 7:
        - name: "person"
          ^
expected <block end>, but found Scalar
 in 'string', line 7, column 31:
          pattern: [entity=PERSON]+
                                  ^


rules: 
    - name: "person"
      label: Person
      priority: 1+
      type: token
      pattern: [entity=PERSON]+

While the error message is cryptic, the solution is to simply make the pattern multiline (ex. pattern: |).

Great, but what's really happening here?

This pattern never makes it Odin, because it fails to parse as valid YAML. | denotes a YAML scalar, which YAML will read without complaint and pass along to Odin.

Without the |, the YAML parser assumes that it's dealing with a list until it sees the +, which blows its mind with a wave of Cthulu madness, upends its conception of the reality, and sends it to an ashram for a period of convalescence and deep introspection.

Every rule must have a unique name...

We keep track of what rule found each Mention, so rule names need to be unique to avoid ambiguities of provenance.


In [35]:
bad_rules = """
rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+
        
    - name: "person"
      label: Person
      type: token
      pattern: |
        [tag=NNP]+
"""

API.odin.extract_from_document(doc=example_doc, rules=bad_rules)


OdinError: rule name 'person' is not unique

rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+
        
    - name: "person"
      label: Person
      type: token
      pattern: |
        [tag=NNP]+

Ok, I've had enough. Give me my memory back, you animal!

Run the line below to shut down the NLP server.


In [36]:
#API.stop_server()