In [1]:

    
from IPython.display import HTML
HTML("""<style>{}</style>""".format(open("assets/css/custom.css").read()))









    Out[1]:



In [2]:

    
import processors
print(processors.__version__)



In [3]:

    
from processors import *
from processors.visualization import JupyterVisualizer as viz

API = ProcessorsAPI(port=8881, keep_alive=True)









    



INFO - Starting processors-server (java -Xmx3G -cp /Users/gus/anaconda3/lib/python3.5/site-packages/processors/processors-server.jar NLPServer --port 8881 --host localhost) ...






    



Waiting for server...
[============                                                ]

Connection with processors-server established (http://localhost:8881)

Rule-based information extraction with Odin

This tutorial provides an introduction to Odin, a domain-independent rule-based system for information extraction.

Why Odin?

Supports patterns over directed graphs, such as syntactic dependency parses
- Good generalizability
Supports patterns over sequences of tokens and their attributes
Supports rule templates and variables
It was designed to be domain independent
Rules can be scaffolded and applied in cascades
- The output of one rule can be the input to another rule)
Odin is open source, under active development, and it even has a manual
You can use it natively from within the JVM (it was written in Scalal) or in Python using a client-server architecture
Rules are written using YAML and familiar constructs

Useful resources for learning Odin

Projects using Odin

Reach, a machine reading system for biomedical publications developed for DARPA's Big Mechanism program
The Bill and Melinda Gates Foundation's Healthy Birth, Growth, and Development Knowledge Initiative (HBGDKi)
A seedling project for DARPA's World Modelers program

Prerequisites

This tutorial assumes that you have some familiarity with regular expresssions.

Introduction

Odin operates over documents which have been tokenized, sentence-segmented, parsed, and annotated via an NLP pipeline for part-of-speech (PoS) tags, lemmas, and named entities.

Rules matched against these annotated documents produce mentions of entities, relations, or events that can then be reused to write more complicated rules (ex. entities $\rightarrow$ events $\rightarrow$ events involving other events).

These rules are written in a simple subset of YAML and can describe sequences of tokens or traversals over syntactic dependency parse. Luckily, you don't need to be an expert in YAML in order to write Odin rules.

All rules have the following fields:

Field	Description
`name`	the name of the specific rule. When a rule matches, the match (Mention) stores the value of this field in its `.foundBy` attribute.
`label`	What a rule's match represents (`Person`, `Location`, `Phosphorylation`, etc.).
`type`	Currently, two primary rule types `token` or `dependency`. `token` refers to a surface pattern or sequence of tokens. `dependency` refers to a pattern over a graph (syntactic dependency parse).
`pattern`	Specified as multi-line string using the vertical bar character (e.g. \| )

Notes on `YAML`

It's useful to keep in mind that YAML strings don’t have to be quoted. This is a nice feature that allows one to write shorter and cleaner rules. However, there is one exception that you should be aware of: strings that start with a YAML indicator character must be quoted. Indicator characters have special semantics and must be quoted if they should be interpreted as part of a string. These are all the valid YAML indicator characters:

- ? : , [ ] { } # & * ! | > ’ " % @ ‘

As you can probably tell, these are not characters that occur frequently in practice. Usually names and labels are composed of alphanumeric characters and the occasional underscore, so, most of the time, you can get away without quoting strings.

Outcome

By the end of this tutorial, you will understand how to interpret and modify the following grammar:



In [4]:

    
rules = """

taxonomy:
  - Entity:
    - ProperNoun
    - Organization
    - PossiblePerson:
      - Person
    - Location
  - Date
  - HasX:
    - HasTitle
  - Event:
    - Missing

rules:
  - name: "ner-location"
    label: Location
    priority: 1
    type: token
    pattern: |
      [entity="LOCATION"]+ | 
      Twin Peaks

  - name: "ner-person"
    label: Person
    priority: 1
    type: token
    pattern: |
     [entity="PERSON"]+

  - name: "ner-org"
    label: Organization
    priority: 1
    type: token
    pattern: |
      [entity="ORGANIZATION"]+

  - name: "ner-date"
    label: Date
    priority: 1
    type: token
    pattern: |
      [entity="DATE"]+

  - name: "proper-noun"
    label: ProperNoun
    priority: 2
    type: token
    pattern: |
      [word=/^[A-Z]/ & tag=/^(JJ|NN)/ & !mention=Person]+ |
      [tag=/^NNP/]+

  - name: "has-title"
    label: HasTitle
    pattern: |
      person: Person
      title: ProperNoun = nn [!mention=Person]

  - name: "missing"
    label: Missing
    pattern: |
      trigger = [lemma=go] missing
      theme: Person = <xcomp nsubj
      date: Date? = prep_on
"""

mentions = API.odin.extract_from_text("FBI Special Agent Dale Cooper went missing on June 10, 1991.  He was last seen in the woods of Twin Peaks. ", rules=rules)
for m in mentions: viz.display_mention(m)

viz.display_graph(mentions[-1].sentenceObj)









    




He was last seen in the woods of _LocationTwin Peaks .






    




He was last seen in the woods of _ProperNounTwin Peaks .






    




_ProperNounFBI Special Agent Dale Cooper went missing on June 10 , 1991 .






    




_HasTitle_ProperNounFBI Special Agent^title _PersonDale Cooper^person went missing on June 10 , 1991 .






    




_OrganizationFBI Special Agent Dale Cooper went missing on June 10 , 1991 .






    




FBI Special Agent Dale Cooper went missing on _DateJune 10 , 1991 .






    




FBI Special Agent _Missing_PersonDale Cooper^theme went missing^TRIGGER on _DateJune 10 , 1991^date .






    




FBI Special Agent _PersonDale Cooper went missing on June 10 , 1991 .






    




FBI Special Agent _ProperNounDale Cooper went missing on June 10 , 1991 .






    




FBI Special Agent Dale Cooper went missing on _ProperNounJune 10 , 1991 .

Capturing entities

Before we can write rules to identify relations and events, we must first identify their participants. We'll refer to these participants as entities.

Consider the following sentence describing a marriage:

Julia-Louis Dreyfus and Brad Hall were married in June of 1987.



In [5]:

    
example_doc = API.annotate("Julia-Louis Dreyfus and Brad Hall were married in June of 1987.")

from processors.visualization import JupyterVisualizer as viz

viz.display_graph(example_doc.sentences[0])

Capturing entities with surface patterns

A surface pattern is rule that is written in terms of a sequence of tokens. The simplest surface pattern is just a sequence of words. For example, the rule below will match the sequence Special Agent and tag it as being a JobTitle.



In [6]:

    
rules = """
rules:
  - name: "job-title"
    label: JobTitle
    type: token
    pattern: |
      Special Agent
"""

tp_doc = API.annotate("FBI Special Agent Dale Cooper went missing on June 10, 1991")

mentions = API.odin.extract_from_document(tp_doc, rules)
for m in mentions: viz.display_mention(m)









    




FBI _JobTitleSpecial Agent Dale Cooper went missing on June 10 , 1991

Of course as we'll see rules can get much more sophisticated than this. For example, Odin allows you to write your pattern over combinations of token attributes (see the token constraints section for more details).

Reusing mentions from an earlier rule

Much of the power of Odin comes from its ability to scaffold rules. The output of one rule can be referenced by its label in subsequent rules. This allows us to write compact, powerful grammars. In a surface pattern, this is done using the syntax @MyLabelHere where MyLabelHere refers to whatever label you wish to reference. We'll apply this syntax in the example below where we'll build another rule off of the output of JobTitle...



In [7]:

    
rules = """
rules:
  - name: "job-title"
    label: JobTitle
    # This rule runs in the first pass
    # of Odin and never again
    priority: 1
    type: token
    pattern: |
      Special Agent
      
  - name: "expanded-title"
    label: JobTitle
    priority: 2
    type: token
    pattern: |
      FBI @JobTitle
"""

tp_doc = API.annotate("FBI Special Agent Dale Cooper went missing on June 10, 1991")

mentions = API.odin.extract_from_document(tp_doc, rules)
for m in mentions: print("\"{}\"".format(m.foundBy)), viz.display_mention(m)









    



"job-title"






    




FBI _JobTitleSpecial Agent Dale Cooper went missing on June 10 , 1991






    



"expanded-title"






    




_JobTitleFBI Special Agent Dale Cooper went missing on June 10 , 1991

Note that we could omit the explicit priority from our second rule, "expanded-title", as it won't successfully match until an @JobTitle is available to reuse. Limiting the priority here is an efficiency decision in that Odin won't event attempt to match the rule until told to do so.

Challenge: another agent

Complete the grammar below to match Special Agent Fox Mulder. Note that you only need to change the rule rule "you-complete-me"



In [8]:

    
challenge_text = "FBI Special Agent Fox Mulder wants to believe you're gokking mention reuse."

challenge_rules = """
rules:
  - name: "job-title"
    label: JobTitle
    type: token
    pattern: |
      Special Agent
      
  - name: "org"
    label: Organization
    type: token
    # Don't worry if this rule doesn't make sense to you just yet.
    # This pattern is a peek ahead.  Feel free to rewrite it in a form that is familiar.
    pattern: |
      [entity=ORGANIZATION]+
      
  - name: "star-fox"
    label: Person
    type: token
    pattern: |
      [entity=PERSON]+

  - name: "you-complete-me"
    label: ReallySpecialGuy
    type: token
    pattern: |
      ""
"""

challenge_doc = API.annotate(challenge_text)

mentions = API.odin.extract_from_document(challenge_doc, challenge_rules)
for m in mentions: print("\"{}\"".format(m.foundBy)), viz.display_mention(m)









    



"job-title"






    




FBI _JobTitleSpecial Agent Fox Mulder wants to believe you 're gokking mention reuse .






    



"org"






    




_OrganizationFBI Special Agent Fox Mulder wants to believe you 're gokking mention reuse .






    



"star-fox"






    




FBI Special Agent _PersonFox Mulder wants to believe you 're gokking mention reuse .

Challenge: chunking text (part 1)

Write a rule set that captures this simple phrase structure grammar for linguistic constituents:

Verb        ->  (identify by PoS tag of terminals)
Noun        ->  (identify by PoS tag of terminals)
Adjective   ->  (identify by PoS tag of terminals)
NP          ->  determiner (by tag) + zero or more Adjective + one or more Noun



In [9]:

    
challenge_psg_text = """
The black dog runs at night. 
Out of nowhere, the mind comes forth.
"""

psg_rules = """
rules: 
    - name: "verb"
      label: Verb
      type: token
      pattern: ???
    
    - name: "noun"
      label: Noun
      type: token
      pattern: ???
      
    - name: "adjective"
      label: Adjective
      type: token
      pattern: ???  

    - name: "noun-phrase"
      label: NP
      type: token
      pattern: ???  
"""

challenge_doc = API.annotate(challenge_psg_text)

# mentions = API.odin.extract_from_document(doc=challenge_doc, rules=psg_rules)
# for m in mentions: viz.display_mention(m)

Challenge: chunking text (part 2)

Modify the PSG rules provided in the previous challenge to include one for a verb phrase (VP). Extend your grammar to cover your VP additions.

Token contraints

Field	Description
`word`	The actual token.
`lemma`	The lemma form of the token
`tag`	The part-of-speech (PoS) tag assigned to the token
`incoming`	Incoming relations from the dependency graph for the token
`outgoing`	Outgoing relations from the dependency graph for the token
`chunk`	The shallow constituent type (ex. NP, VP) immediately containing the token
`entity`	The NER label of the token
`mention`	The label of any Mention(s) (i.e., rule output) that contains the token.

For a more information on PoS tags (tag in the table above), see https://www.eecis.udel.edu/~vijay/cis889/ie/pos-set.pdf

Disjunctions

Disjunctions are specified using |. Imagine we want to find all adjectives and adverbs in the following snippet from W.H. Auden:

the expensive delicate ship that must have seen
something amazing, a boy falling out of the sky,
had somewhere to get to and sailed calmly on.



In [10]:

    
text = """
the expensive delicate ship that must have seen  
something amazing, a boy falling out of the sky,  
had somewhere to get to and sailed calmly on.
"""

example_doc2 = API.annotate(text)

rules_v1 = """
rules: 
    - name: "disjunction"
      label: Example
      type: token
      pattern: |
        [tag=RB] | [tag=JJ]   
"""

mentions = API.odin.extract_from_document(example_doc2, rules_v1)
for m in mentions: viz.display_mention(m)









    




the expensive _Exampledelicate ship that must have seen something amazing , a boy falling out of the sky , had somewhere to get to and sailed calmly on .






    




the expensive delicate ship that must have seen something amazing , a boy falling out of the sky , had _Examplesomewhere to get to and sailed calmly on .






    




the expensive delicate ship that must have seen something _Exampleamazing , a boy falling out of the sky , had somewhere to get to and sailed calmly on .






    




the _Exampleexpensive delicate ship that must have seen something amazing , a boy falling out of the sky , had somewhere to get to and sailed calmly on .






    




the expensive delicate ship that must have seen something amazing , a boy falling out of the sky , had somewhere to get to and sailed _Examplecalmly on .



In [11]:

    
rules_v2 = """
rules: 
    - name: "disjunction"
      label: Example
      type: token
      pattern: |
        # if it's easier to read 
        # we can split the disjunction
        # onto two lines
        [tag=RB] | 
        [tag=JJ]   
"""

mentions = API.odin.extract_from_document(example_doc2, rules_v2)
for m in mentions: viz.display_mention(m)









    




the expensive _Exampledelicate ship that must have seen something amazing , a boy falling out of the sky , had somewhere to get to and sailed calmly on .






    




the expensive delicate ship that must have seen something amazing , a boy falling out of the sky , had _Examplesomewhere to get to and sailed calmly on .






    




the expensive delicate ship that must have seen something _Exampleamazing , a boy falling out of the sky , had somewhere to get to and sailed calmly on .






    




the _Exampleexpensive delicate ship that must have seen something amazing , a boy falling out of the sky , had somewhere to get to and sailed calmly on .






    




the expensive delicate ship that must have seen something amazing , a boy falling out of the sky , had somewhere to get to and sailed _Examplecalmly on .

You can blame the inclusion of this instance of somewhere on the PoS tagger.

Exact or regex

Patterns may involve an exact string or use regular expressions (Java-flavored). Imagine we want to identify all syntactic subjects in the following text:

Hamlet killed Claudius. Rosencrantz and Guildenstern were both killed by Hamlet.



In [12]:

    
text = "Hamlet killed Claudius.  Rosencrantz and Guildenstern were executed."

example_doc2 = API.annotate(text)

# let's look at the syntactic dependency parse for each sentence
for s in example_doc2.sentences: viz.display_graph(s, distance=150)



In [13]:

    
rules_v1 = """
rules: 
    - name: "example-1"
      label: Subject
      type: token
      pattern: |
        # a disjunction of two exact strings 
        # denoting either a passive or active subject
        [incoming=nsubjpass] | [incoming=nsubj]   
"""

mentions = API.odin.extract_from_document(example_doc2, rules_v1)
for m in mentions: viz.display_mention(m)









    




_SubjectRosencrantz and Guildenstern were executed .






    




Rosencrantz and _SubjectGuildenstern were executed .






    




_SubjectHamlet killed Claudius .



In [14]:

    
rules_v2 = """
rules: 
    - name: "example-1"
      label: Subject
      pattern: |
        # a regex that will match
        # both passive and active subjects
        [incoming=/^nsubj/]        
"""

mentions = API.odin.extract_from_document(example_doc2, rules_v1)
for m in mentions: viz.display_mention(m)









    




_SubjectRosencrantz and Guildenstern were executed .






    




Rosencrantz and _SubjectGuildenstern were executed .






    




_SubjectHamlet killed Claudius .

Case-insensitive patterns

Pattens can be made case insensitive by beginning a regex with /(?i)/



In [15]:

    
text = """HEY YOU GUYS! """

example_doc2 = API.annotate(text)

insensitive_rules = """
rules: 
    - name: "insensitive"
      label: Example
      type: token
      pattern: |
        # if we don't use [],
        # Odin assumes the pattern is in terms
        # of the token's word attribute
        /(?i)guys/      
"""

mentions = API.odin.extract_from_document(example_doc2, insensitive_rules)
for m in mentions: viz.display_mention(m)









    




HEY YOU _ExampleGUYS !

Combining token constraints

Token constraints can be combined using &. Imagine if we only want to capture nouns ending in the tle in the following sentence:

It's not prattle when I warn a gentle touch is needed with the glass menagerie on the mantle.



In [16]:

    
text = """It's not prattle when I warn a gentle touch is needed with the glass menagerie on the mantle."""

example_doc2 = API.annotate(text)

insensitive_rules = """
rules: 
    - name: "combined"
      label: Example
      type: token
      pattern: |
        [tag=/^NN/ & word=/tle$/]  
"""

mentions = API.odin.extract_from_document(example_doc2, insensitive_rules)
for m in mentions: viz.display_mention(m)









    




It 's not prattle when I warn a gentle touch is needed with the glass menagerie on the _Examplemantle .

Challenge: identifying words containing a certain morpheme

Write a rule that identifies nouns containing the derivational suffix er in teacher, buyer, actor, doctor, etc., while avoiding the homophonous inflectional morpheme er in calmer, bigger, etc.
Test it with a few sentences



In [17]:

    
challenge_text = """
???
"""

morpheme_rules = """
rules: 
    - name: "er-deriv-suffix"
      label: HasDerivSuffix
      type: token
      pattern: ??? 
"""

# challenge_doc = API.annotate(challenge_text)

# mentions = API.odin.extract_from_document(challenge_doc, morpheme_rules)
# for m in mentions: viz.display_mention(m)

Negating token constraints

Token constraints can be negated by prefacing the attribute name with ! (see example below):

[!fieldname=pattern]

Challenge: no verbs!

Using a single token constraint, match all tokens in the following sentence that are not verbs:

If you wish to make an apple pie from scratch, you must first invent the universe.



In [18]:

    
text = "If you wish to make an apple pie from scratch, you must first invent the universe."

d = API.annotate(text)

viz.display_graph(d.sentences[0])

challenge_rules = """
rules:
    - name: "no-verbs"
      label: NotVerb
      # req. 1: This pattern should involve a single token constraint
      # req. 2: The token constraint should use a negated pattern
      pattern: | ???
"""

# mentions = API.odin.extract_from_document(d, challenge_rules)
# for m in mentions: print(m)

Wildcard

Sometimes any token will suffice to complete a pattern. In such cases where token constraints are unnecessary, the [] wildcard can be used.

Example pattern: [] people

Example matches
- I see dead people
- All the lonely people
- The are a strange people

Quantifiers

Token constraints, arguments, and graph edges can all be quantified.

Symbol	Description	Lazy form
`?`	The quantified pattern is optional.	`??`
`*`	Repeat the quantified pattern zero or more times.	`*?`
`+`	Repeat the quantified pattern one or more times.	`+?`
`{n}`	Exact repetition. Repeat the quantified pattern n times.
`{n,m}`	Ranged repetition. Repeat the quantified pattern between n and m times, where n < m.	`{n,m}?`
`{,m}`	Open start ranged repetition. Repeat the quantified pattern between 0 and m times, where m > 0.	`{,m}?`
`{n,}`	Open end ranged repetition. Repeat the quantified pattern at least n times, where n > 0.	`{n,}?`

Lookarounds and other zero-width assertions

Odin supports lookaround assertions, as well as start/end sentence anchors. You can use lookarounds to specify contextual constraints that you don't want to end up in your result (ex. "only match B if it's preceded by A").

Symbol	Description	Example Pattern	Match (in bold)
`^`	beginning of sentence	`^ My`	My name is Inigo Montoya .
`$`	end of sentence	`"." $`	My name is Inigo Montoya .
`(?=...)`	postive lookahead	`Inigo (?= Montoya)`	My name is Inigo Montoya .
`(?!...)`	negative lookahead	`Inigo (?! Arocena)`	My name is Inigo Montoya .
`(?<=...)`	positive lookbehind	`(?<= Inigo) Montoya`	My name is Inigo Montoya .
`(?<!...)`	negative lookbehind	`(?<! Carlos) Montoya`	My name is Inigo Montoya .

Refining rules: an example

Rule writing can be an incremental process or refinement. Sometimes it's a matter of adding conjunctions to further constrain a match, or disjunctions to relax it. Other times, as demonstrated below, it comes down to picking the appropriate representation/attribute for a token...

The naive rule below is trying label Person mentions as any sequence of proper nouns. As you can see, this is too general. You can probably think of other spurious stuff that this would match, right?



In [19]:

    
entity_rule_v1 = """
rules: 
    - name: "person"
      label: Person
      priority: 1
      type: token
      pattern: |
        [tag=NNP]+
"""



In [20]:

    
mentions = API.odin.extract_from_document(doc=example_doc, rules=entity_rule_v1)
for m in mentions: viz.display_mention(m)









    




_PersonJulia-Louis Dreyfus and Brad Hall were married in June of 1987 .






    




Julia-Louis Dreyfus and _PersonBrad Hall were married in June of 1987 .






    




Julia-Louis Dreyfus and Brad Hall were married in _PersonJune of 1987 .

Let's see if we can do better. It turns out we're lucky, as the model used by the named entity recognizer (NER) built into to our NLP pipeline has been trained to detect that label. Let's take a look...



In [21]:

    
entity_rule_v2 = """
rules:
    - name: "person"
      label: Person
      priority: 1
      type: token
      pattern: |
        [entity=PERSON]+
"""



In [22]:

    
mentions = API.odin.extract_from_document(doc=example_doc, rules=entity_rule_v2)
for m in mentions: viz.display_mention(m)









    




Julia-Louis Dreyfus and _PersonBrad Hall were married in June of 1987 .






    




_PersonJulia-Louis Dreyfus and Brad Hall were married in June of 1987 .

Capturing events and relations

TODO

Capturing events and relations with surface patterns

See the relevant section in the manual

Capturing events and relations with dependency patterns

See the relevant section in the manual

For a description of dependency relations used by default in Odin, see the collapsed dependency described in https://nlp.stanford.edu/software/dependencies_manual.pdf



In [23]:

    
rules_v1 = """
rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+

    - name: "marriage-event"
      label: Marriage
      pattern: |
        trigger = [lemma=marry]
        spouse: Person = nsubjpass
"""



In [24]:

    
mentions = API.odin.extract_from_document(doc=example_doc, rules=rules_v1)
for m in mentions: 
    if m.matches("Marriage"):
        viz.display_mention(m)









    




_Marriage_PersonJulia-Louis Dreyfus^spouse and Brad Hall were married^TRIGGER in June of 1987 .






    




Julia-Louis Dreyfus and _Marriage_PersonBrad Hall^spouse were married^TRIGGER in June of 1987 .

We end up with two Marriage event mentions, each containing only one spouse. Wouldn't it be great if we had a way to specify how many of each argument were required for a single mention?

Quantifiers for dependency patterns

We know it takes two to tango, so let's try to get those arguments in the same mention.



In [25]:

    
rules_v2 = """
rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+

    - name: "marriage-event"
      label: Marriage
      pattern: |
        trigger = [lemma=marry]
        spouse: Person+ = nsubjpass
"""



In [26]:

    
mentions = API.odin.extract_from_document(doc=example_doc, rules=rules_v2)
for m in mentions:
    if m.matches("Marriage"):
        viz.display_mention(m)









    




_Marriage_PersonJulia-Louis Dreyfus^spouse and _PersonBrad Hall^spouse were married^TRIGGER in June of 1987 .

We can even specify an exact number for each argument.



In [27]:

    
rules_v3 = """
rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+

    - name: "marriage-event"
      label: Marriage
      pattern: |
        trigger = [lemma=marry]
        spouse: Person{2} = nsubjpass
"""

mentions = API.odin.extract_from_document(doc=example_doc, rules=rules_v3)
for m in mentions:
    if m.matches("Marriage"):
        viz.display_mention(m)









    




_Marriage_PersonJulia-Louis Dreyfus^spouse and _PersonBrad Hall^spouse were married^TRIGGER in June of 1987 .

Challenge: no more than four!

Imagine a polyandrous society where a woman can have at most four husbands.

In a parallel universe, Marge is married to Homer, Ned, and Troy McClure.

Complete the grammar rule set below to satisfy the conditions specified in the challenge.



In [28]:

    
text = "In a parallel universe, Marge married Homer, Ned Flanders, and Troy McClure."

d = API.annotate(text)

viz.display_graph(d.sentences[0], css=viz.parse_css)

challenge_rules = """
rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+

    - name: "marriage-event"
      label: Marriage
      pattern: ???
"""

#mentions = API.odin.extract_from_document(doc=d, rules=challenge_rules)
#for m in mentions: print(m)

Challenge: optional arguments

Modify the grammar below to include two optional arguments in the Marriage event: "date" of type Date and "location" of type Location. Remember that you'll need additional to capture Date and Location in order for them to be available to the event rule.



In [29]:

    
text = "Gonzo and Camilla were married in October.  Barack and Michelle were married in Chicago."
d = API.annotate(text)


challenge_rules = """
rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+

    # TODO: add a rule for Date
    
    # TODO: add a rule for Location
    
    # TODO: add optional args to "marriage-event"
    - name: "marriage-event"
      label: Marriage
      pattern: |
        trigger = [lemma=marry]
        spouse: Person{2} = nsubjpass
"""

mentions = API.odin.extract_from_document(doc=d, rules=challenge_rules)
for m in mentions:
    if m.matches("Marriage"):
        viz.display_mention(m)









    




_Marriage_PersonBarack^spouse and _PersonMichelle^spouse were married^TRIGGER in Chicago .

Quantifiers in graph traversals

See the relevant section in the manual

Variables and rule templates

It can be tedious to write sets of rules by hands. Often you'll see that components of rules can or should be reused in subsets of your grammar. Odin supports the use of variables and templates to address just this. Variables and templates help to maintain large grammars and create rule sets that can be "recycled" or applied to related problems with a few tweaks.

For more details, see the relevant section in the manual

Templates work via file imports. For more complex cases of template using involving multiple files, see the odin examples sbt project or Reach.

Defining a taxonomy

See the relevant section in the manual

Priorities for rules

Rules are applied iteratively (pass 1, pass 2, .., pass n). If you want to control when a rule should be applied, specify a value for the rule field priority. The value can be an open or closed range, exact value, or list of comma separated values. By default, a rule will continue to be executed until no rule has produced a new match (priority: 1+). This default means that you usually don't need to worry about setting the priority, but the power is there if you need it.

Note that quantifiers can be applied to priorities.

Debugging rules

Making sense of errors

Here we describe some common errors you may encounter as you learn to write rules.

A mispelled or missing `label` field...

Every rule must have either a label or labels field.

This field tells Odin what the type of the Mention is that you're trying to capture.

Remember that these types can be "reused" in subsequent rules (ex. find a Person and then find events involving some Person).



In [30]:

    
bad_rules = """
rules: 
    - name: "person"
      type: token
      pattern: |
        [entity=PERSON]+
"""

API.odin.extract_from_document(doc=example_doc, rules=bad_rules)









    



OdinError: rule 'person' has no labels

rules: 
    - name: "person"
      type: token
      pattern: |
        [entity=PERSON]+

A mispelled or missing `name` field...

Every rule needs a name. B shur 2 spel it write two!



In [31]:

    
bad_rules = """
rules: 
    # we've mispelled "name"
    - nme: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+
"""

API.odin.extract_from_document(doc=example_doc, rules=bad_rules)









    



OdinError: unnamed rule

rules: 
    # we've mispelled "name"
    - nme: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+

An invalid rule `type`...

By default, rules are assumed to be of type dependency. If you're writing a dependency pattern, you can actually leave out the type field. Wow, talk about convenient!

If you're writing a token pattern, however, you'll need to specify type: token.



In [32]:

    
bad_rules = """
rules: 
    - name: "person"
      label: Person
      # we've mispelled "token"
      type: tken
      pattern: |
        [entity=PERSON]+
"""

API.odin.extract_from_document(doc=example_doc, rules=bad_rules)









    



OdinError: type 'tken' not recognized for rule 'person'

rules: 
    - name: "person"
      label: Person
      # we've mispelled "token"
      type: tken
      pattern: |
        [entity=PERSON]+

An invalid token `field`...

In the current version of Odin, you are restricted to a predefined set of token fields for use in your patterns.

See the token constraints table for a comprehensive list of valid token fields.



In [33]:

    
bad_rules = """
rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [nonexistentfield=BLARG]+
"""

API.odin.extract_from_document(doc=example_doc, rules=bad_rules)









    



OdinError: Error parsing rule 'person': unrecognized token field

rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [nonexistentfield=BLARG]+

Avoid single line patterns...



In [34]:

    
bad_rules = """
rules: 
    - name: "person"
      label: Person
      priority: 1+
      type: token
      pattern: [entity=PERSON]+
"""

API.odin.extract_from_document(doc=example_doc, rules=bad_rules)









    



OdinError: while parsing a block mapping
 in 'string', line 3, column 7:
        - name: "person"
          ^
expected <block end>, but found Scalar
 in 'string', line 7, column 31:
          pattern: [entity=PERSON]+
                                  ^


rules: 
    - name: "person"
      label: Person
      priority: 1+
      type: token
      pattern: [entity=PERSON]+

While the error message is cryptic, the solution is to simply make the pattern multiline (ex. pattern: |).

Great, but what's really happening here?

This pattern never makes it Odin, because it fails to parse as valid YAML. | denotes a YAML scalar, which YAML will read without complaint and pass along to Odin.

Without the |, the YAML parser assumes that it's dealing with a list until it sees the +, which blows its mind with a wave of Cthulu madness, upends its conception of the reality, and sends it to an ashram for a period of convalescence and deep introspection.

Every rule must have a unique name...

We keep track of what rule found each Mention, so rule names need to be unique to avoid ambiguities of provenance.



In [35]:

    
bad_rules = """
rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+
        
    - name: "person"
      label: Person
      type: token
      pattern: |
        [tag=NNP]+
"""

API.odin.extract_from_document(doc=example_doc, rules=bad_rules)









    



OdinError: rule name 'person' is not unique

rules: 
    - name: "person"
      label: Person
      type: token
      pattern: |
        [entity=PERSON]+
        
    - name: "person"
      label: Person
      type: token
      pattern: |
        [tag=NNP]+

Ok, I've had enough. Give me my memory back, you animal!

Run the line below to shut down the NLP server.



In [36]:

    
#API.stop_server()

Rule-based information extraction with Odin

Why Odin?

Useful resources for learning Odin

Projects using Odin

Prerequisites

Introduction

Notes on YAML

Outcome

Capturing entities

Capturing entities with surface patterns

Reusing mentions from an earlier rule

Challenge: another agent

Challenge: chunking text (part 1)

Challenge: chunking text (part 2)

Token contraints

Disjunctions

Exact or regex

Case-insensitive patterns

Combining token constraints

Challenge: identifying words containing a certain morpheme

Negating token constraints

Challenge: no verbs!

Wildcard

Quantifiers

Lookarounds and other zero-width assertions

Refining rules: an example

Capturing events and relations

Capturing events and relations with surface patterns

Capturing events and relations with dependency patterns

Quantifiers for dependency patterns

Challenge: no more than four!

Challenge: optional arguments

Quantifiers in graph traversals

Variables and rule templates

Defining a taxonomy

Priorities for rules

Debugging rules

Making sense of errors

A mispelled or missing label field...

A mispelled or missing name field...

An invalid rule type...

An invalid token field...

Avoid single line patterns...

Great, but what's really happening here?

Every rule must have a unique name...

Ok, I've had enough. Give me my memory back, you animal!

Notes on `YAML`

A mispelled or missing `label` field...

A mispelled or missing `name` field...

An invalid rule `type`...

An invalid token `field`...