Extracting a Custom Property



In [1]:

    
from chemdataextractor import Document
from chemdataextractor.model import Compound
from chemdataextractor.doc import Paragraph, Heading

Example Document

Let's create a simple example document with a single heading followed by a single paragraph:



In [2]:

    
d = Document(
    Heading(u'Synthesis of 2,4,6-trinitrotoluene (3a)'),
    Paragraph(u'The procedure was followed to yield a pale yellow solid (b.p. 240 °C)')
)

What does this look like:



In [3]:

    
d









    Out[3]:





Synthesis of 2,4,6-trinitrotoluene (3a)
The procedure was followed to yield a pale yellow solid (b.p. 240 °C)

Default Parsers

By default, ChemDataExtractor won't extract the boiling point property:



In [4]:

    
d.records.serialize()









    Out[4]:





[{'labels': ['3a'], 'names': ['2,4,6-trinitrotoluene'], 'roles': ['product']}]

Defining a New Property Model

The first task is to define the schema of a new property, and add it to the Compound model:



In [5]:

    
from chemdataextractor.model import BaseModel, StringType, ListType, ModelType

class BoilingPoint(BaseModel):
    value = StringType()
    units = StringType()
    
Compound.boiling_points = ListType(ModelType(BoilingPoint))

Writing a New Parser

Next, define parsing rules that define how to interpret text and convert it into the model:



In [6]:

    
import re
from chemdataextractor.parse import R, I, W, Optional, merge

prefix = (R(u'^b\.?p\.?$', re.I) | I(u'boiling') + I(u'point')).hide()
units = (W(u'°') + Optional(R(u'^[CFK]\.?$')))(u'units').add_action(merge)
value = R(u'^\d+(\.\d+)?$')(u'value')
bp = (prefix + value + units)(u'bp')



In [7]:

    
from chemdataextractor.parse.base import BaseParser
from chemdataextractor.utils import first

class BpParser(BaseParser):
    root = bp

    def interpret(self, result, start, end):
        compound = Compound(
            boiling_points=[
                BoilingPoint(
                    value=first(result.xpath('./value/text()')),
                    units=first(result.xpath('./units/text()'))
                )
            ]
        )
        yield compound



In [8]:

    
Paragraph.parsers = [BpParser()]

Running the New Parser



In [9]:

    
d = Document(
    Heading(u'Synthesis of 2,4,6-trinitrotoluene (3a)'),
    Paragraph(u'The procedure was followed to yield a pale yellow solid (b.p. 240 °C)')
)

d.records.serialize()









    Out[9]:





[{'boiling_points': [{'units': '°C', 'value': '240'}],
  'labels': ['3a'],
  'names': ['2,4,6-trinitrotoluene'],
  'roles': ['product']}]