Chemical-Disease Relation (CDR) Tutorial

In this example, we'll be writing an application to extract mentions of chemical-induced-disease relationships from Pubmed abstracts, as per the BioCreative CDR Challenge. This tutorial will show off some of the more advanced features of Snorkel, so we'll assume you've followed the Intro tutorial.

Task Description

The CDR task is comprised of three sets of 500 documents each, called training, development, and test. A document consists of the title and abstract of an article from PubMed, an archive of biomedical and life sciences journal literature. The documents have been hand-annotated with

Mentions of chemicals and diseases along with their MESH IDs, canonical IDs for medical entities. For example, mentions of "warfarin" in two different documents will have the same ID.
Chemical-disease relations at the document-level. That is, if some piece of text in the document implies that a chemical with MESH ID X induces a disease with MESH ID Y, the document will be annotated with Relation(X, Y).

The goal is to extract the document-level relations on the test set (without accessing the entity or relation annotations). For this tutorial, we make the following assumptions and alterations to the task:

We discard all of the entity mention annotations and assume we have access to a state-of-the-art entity tagger (see Part I) to identify chemical and disease mentions, and link them to their canonical IDs.
We shuffle the training and development sets a bit, producing a new training set with 900 documents and a new development set with 100 documents. We discard the training set relation annotations, but keep the development set to evaluate our labeling functions and extraction model.
We evaluate the task at the mention-level, rather than the document-level. We will convert the document-level relation annotations to mention-level by simply saying that a mention pair (X, Y) in document D if Relation(X, Y) was hand-annotated at the document-level for D.

In effect, the only inputs to this application are the plain text of the documents, a pre-trained entity tagger, and a small development set of annotated documents. This is representative of many information extraction tasks, and Snorkel is the perfect tool to bootstrap the extraction process with weak supervision. Let's get going.

Part 0: Initial Prep

In your shell, download the raw data by running:

cd tutorials/cdr
./download_data.sh

Note that if you've previously run this tutorial (using SQLite), you can delete the old database by running (in the same directory as above):

rm snorkel.db

Part I: Corpus Preprocessing



In [1]:

    
%load_ext autoreload
%autoreload 2
%matplotlib inline

from snorkel import SnorkelSession
session = SnorkelSession()

Configuring a `DocPreprocessor`

We'll start by defining a DocPreprocessor object to read in Pubmed abstracts from Pubtator. There some extra annotation information in the file, while we'll skip for now. We'll use the XMLMultiDocPreprocessor class, which allows us to use XPath queries to specify the relevant sections of the XML format.

Note that we are newline-concatenating text from the title and abstract together for simplicity, but if we wanted to, we could easily extend the DocPreprocessor classes to preserve information about document structure.



In [2]:

    
import os
from snorkel.parser import XMLMultiDocPreprocessor

# The following line is for testing only. Feel free to ignore it.
file_path = 'data/CDR.BioC.small.xml' if 'CI' in os.environ else 'data/CDR.BioC.xml'

doc_preprocessor = XMLMultiDocPreprocessor(
    path=file_path,
    doc='.//document',
    text='.//passage/text/text()',
    id='.//id/text()'
)

Creating a `CorpusParser`

Similar to the Intro tutorial, we'll now construct a CorpusParser using the preprocessor we just defined. However, this one has an extra ingredient: an entity tagger. TaggerOne is a popular entity tagger for PubMed, so we went ahead and preprocessed its tags on the CDR corpus for you. The function TaggerOneTagger.tag (in utils.py) tags sentences with mentions of chemicals and diseases. We'll use these tags to extract candidates in Part II. The tags are stored in Sentence.entity_cids and Sentence.entity_types, which are analog to Sentence.words.

Recall that in the wild, we wouldn't have the manual labels included with the CDR data, and we'd have to use an automated tagger (like TaggerOne) to tag entity mentions. That's what we're doing here.



In [3]:

    
from snorkel.parser import CorpusParser
from utils import TaggerOneTagger

tagger_one = TaggerOneTagger()
corpus_parser = CorpusParser(fn=tagger_one.tag)
corpus_parser.apply(list(doc_preprocessor))









    



Clearing existing...
Running UDF...
[========================================] 100%



In [4]:

    
from snorkel.models import Document, Sentence

print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())









    



('Documents:', 1500)
('Sentences:', 14001)

Part II: Candidate Extraction

With the TaggerOne entity tags, candidate extraction is pretty easy! We split into some preset training, development, and test sets. Then we'll use PretaggedCandidateExtractor to extract candidates using the TaggerOne entity tags.



In [5]:

    
from six.moves.cPickle import load

with open('data/doc_ids.pkl', 'rb') as f:
    train_ids, dev_ids, test_ids = load(f)
train_ids, dev_ids, test_ids = set(train_ids), set(dev_ids), set(test_ids)

train_sents, dev_sents, test_sents = set(), set(), set()
docs = session.query(Document).order_by(Document.name).all()
for i, doc in enumerate(docs):
    for s in doc.sentences:
        if doc.name in train_ids:
            train_sents.add(s)
        elif doc.name in dev_ids:
            dev_sents.add(s)
        elif doc.name in test_ids:
            test_sents.add(s)
        else:
            raise Exception('ID <{0}> not found in any id set'.format(doc.name))



In [6]:

    
from snorkel.models import Candidate, candidate_subclass

ChemicalDisease = candidate_subclass('ChemicalDisease', ['chemical', 'disease'])



In [7]:

    
from snorkel.candidates import PretaggedCandidateExtractor

candidate_extractor = PretaggedCandidateExtractor(ChemicalDisease, ['Chemical', 'Disease'])

We should get 8268 candidates in the training set, 888 candidates in the development set, and 4620 candidates in the test set.



In [8]:

    
for k, sents in enumerate([train_sents, dev_sents, test_sents]):
    candidate_extractor.apply(sents, split=k)
    print("Number of candidates:", session.query(ChemicalDisease).filter(ChemicalDisease.split == k).count())









    



Clearing existing...
Running UDF...
[========================================] 100%

('Number of candidates:', 8272)
Clearing existing...
Running UDF...
[========================================] 100%

('Number of candidates:', 888)
Clearing existing...
Running UDF...
[========================================] 100%

('Number of candidates:', 4620)

Candidate Recall

We will briefly discuss the issue of candidate recall. The end-recall of the extraction is effectively upper-bounded by our candidate set: any chemical-disease pair that is present in a document but not identified as a candidate cannot be extracted by our end extraction model. Below are some example reasons for missing a candidate¹.

The tagger is imperfect, and may miss a chemical or disease mention.
The tagger is imperfect, and may attach an incorrect entity ID to a correctly identified chemical or disease mention. For example, "stomach pain" might get attached to the entity ID for "digestive track infection" rather than "stomach illness".
A relation occurs across multiple sentences. For example, "Artery calcification is more prominient in older populations. It can be induced by warfarin."

If we just look at the set of extractions at the end of this tutorial, we won't be able to account for some false negatives that we missed at the candidate extraction stage. For simplicity, we ignore candidate recall in this tutorial and evaluate our extraction model just on the set of extractions made by the end model. However, when you're developing information extraction applications in the future, it's important to keep candidate recall in mind.

¹Note that these specific issues can be combatted with advanced techniques like noun-phrase chunking to expand the entity mention set, or coreference parsing for cross-sentence candidates. We don't employ these here in order to focus on weak supervision.