Extracting a Custom Property


In [1]:
from chemdataextractor import Document
from chemdataextractor.model import Compound
from chemdataextractor.doc import Paragraph, Heading

Example Document

Let's create a simple example document with a single heading followed by a single paragraph:


In [2]:
d = Document(
    Heading(u'Synthesis of 2,4,6-trinitrotoluene (3a)'),
    Paragraph(u'The procedure was followed to yield a pale yellow solid (b.p. 240 °C)')
)

What does this look like:


In [3]:
d


Out[3]:

Synthesis of 2,4,6-trinitrotoluene (3a)

The procedure was followed to yield a pale yellow solid (b.p. 240 °C)

Default Parsers

By default, ChemDataExtractor won't extract the boiling point property:


In [4]:
d.records.serialize()


Out[4]:
[{'labels': ['3a'], 'names': ['2,4,6-trinitrotoluene'], 'roles': ['product']}]

Defining a New Property Model

The first task is to define the schema of a new property, and add it to the Compound model:


In [5]:
from chemdataextractor.model import BaseModel, StringType, ListType, ModelType

class BoilingPoint(BaseModel):
    value = StringType()
    units = StringType()
    
Compound.boiling_points = ListType(ModelType(BoilingPoint))

Writing a New Parser

Next, define parsing rules that define how to interpret text and convert it into the model:


In [6]:
import re
from chemdataextractor.parse import R, I, W, Optional, merge

prefix = (R(u'^b\.?p\.?$', re.I) | I(u'boiling') + I(u'point')).hide()
units = (W(u'°') + Optional(R(u'^[CFK]\.?$')))(u'units').add_action(merge)
value = R(u'^\d+(\.\d+)?$')(u'value')
bp = (prefix + value + units)(u'bp')

In [7]:
from chemdataextractor.parse.base import BaseParser
from chemdataextractor.utils import first

class BpParser(BaseParser):
    root = bp

    def interpret(self, result, start, end):
        compound = Compound(
            boiling_points=[
                BoilingPoint(
                    value=first(result.xpath('./value/text()')),
                    units=first(result.xpath('./units/text()'))
                )
            ]
        )
        yield compound

In [8]:
Paragraph.parsers = [BpParser()]

Running the New Parser


In [9]:
d = Document(
    Heading(u'Synthesis of 2,4,6-trinitrotoluene (3a)'),
    Paragraph(u'The procedure was followed to yield a pale yellow solid (b.p. 240 °C)')
)

d.records.serialize()


Out[9]:
[{'boiling_points': [{'units': '°C', 'value': '240'}],
  'labels': ['3a'],
  'names': ['2,4,6-trinitrotoluene'],
  'roles': ['product']}]