In this example, we'll be writing an application to extract mentions of chemical-induced-disease relationships from Pubmed abstracts, as per the BioCreative CDR Challenge. This tutorial will show off some of the more advanced features of Snorkel, so we'll assume you've followed the Intro tutorial.
The CDR task is comprised of three sets of 500 documents each, called training, development, and test. A document consists of the title and abstract of an article from PubMed, an archive of biomedical and life sciences journal literature. The documents have been hand-annotated with
X
induces a disease with MESH ID Y
, the document will be annotated with Relation(X, Y)
.The goal is to extract the document-level relations on the test set (without accessing the entity or relation annotations). For this tutorial, we make the following assumptions and alterations to the task:
(X, Y)
in document D
if Relation(X, Y)
was hand-annotated at the document-level for D
.In effect, the only inputs to this application are the plain text of the documents, a pre-trained entity tagger, and a small development set of annotated documents. This is representative of many information extraction tasks, and Snorkel is the perfect tool to bootstrap the extraction process with weak supervision. Let's get going.
In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
from snorkel import SnorkelSession
session = SnorkelSession()
DocPreprocessor
We'll start by defining a DocPreprocessor
object to read in Pubmed abstracts from Pubtator. There some extra annotation information in the file, while we'll skip for now. We'll use the XMLMultiDocPreprocessor
class, which allows us to use XPath queries to specify the relevant sections of the XML format.
Note that we are newline-concatenating text from the title and abstract together for simplicity, but if we wanted to, we could easily extend the DocPreprocessor
classes to preserve information about document structure.
In [2]:
import os
from snorkel.parser import XMLMultiDocPreprocessor
# The following line is for testing only. Feel free to ignore it.
file_path = 'data/CDR.BioC.small.xml' if 'CI' in os.environ else 'data/CDR.BioC.xml'
doc_preprocessor = XMLMultiDocPreprocessor(
path=file_path,
doc='.//document',
text='.//passage/text/text()',
id='.//id/text()'
)
CorpusParser
Similar to the Intro tutorial, we'll now construct a CorpusParser
using the preprocessor we just defined. However, this one has an extra ingredient: an entity tagger. TaggerOne is a popular entity tagger for PubMed, so we went ahead and preprocessed its tags on the CDR corpus for you. The function TaggerOneTagger.tag
(in utils.py
) tags sentences with mentions of chemicals and diseases. We'll use these tags to extract candidates in Part II. The tags are stored in Sentence.entity_cids
and Sentence.entity_types
, which are analog to Sentence.words
.
Recall that in the wild, we wouldn't have the manual labels included with the CDR data, and we'd have to use an automated tagger (like TaggerOne) to tag entity mentions. That's what we're doing here.
In [3]:
from snorkel.parser import CorpusParser
from utils import TaggerOneTagger
tagger_one = TaggerOneTagger()
corpus_parser = CorpusParser(fn=tagger_one.tag)
corpus_parser.apply(list(doc_preprocessor))
In [4]:
from snorkel.models import Document, Sentence
print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())
In [5]:
from six.moves.cPickle import load
with open('data/doc_ids.pkl', 'rb') as f:
train_ids, dev_ids, test_ids = load(f)
train_ids, dev_ids, test_ids = set(train_ids), set(dev_ids), set(test_ids)
train_sents, dev_sents, test_sents = set(), set(), set()
docs = session.query(Document).order_by(Document.name).all()
for i, doc in enumerate(docs):
for s in doc.sentences:
if doc.name in train_ids:
train_sents.add(s)
elif doc.name in dev_ids:
dev_sents.add(s)
elif doc.name in test_ids:
test_sents.add(s)
else:
raise Exception('ID <{0}> not found in any id set'.format(doc.name))
In [6]:
from snorkel.models import Candidate, candidate_subclass
ChemicalDisease = candidate_subclass('ChemicalDisease', ['chemical', 'disease'])
In [7]:
from snorkel.candidates import PretaggedCandidateExtractor
candidate_extractor = PretaggedCandidateExtractor(ChemicalDisease, ['Chemical', 'Disease'])
We should get 8268 candidates in the training set, 888 candidates in the development set, and 4620 candidates in the test set.
In [8]:
for k, sents in enumerate([train_sents, dev_sents, test_sents]):
candidate_extractor.apply(sents, split=k)
print("Number of candidates:", session.query(ChemicalDisease).filter(ChemicalDisease.split == k).count())
We will briefly discuss the issue of candidate recall. The end-recall of the extraction is effectively upper-bounded by our candidate set: any chemical-disease pair that is present in a document but not identified as a candidate cannot be extracted by our end extraction model. Below are some example reasons for missing a candidate1.
If we just look at the set of extractions at the end of this tutorial, we won't be able to account for some false negatives that we missed at the candidate extraction stage. For simplicity, we ignore candidate recall in this tutorial and evaluate our extraction model just on the set of extractions made by the end model. However, when you're developing information extraction applications in the future, it's important to keep candidate recall in mind.
1Note that these specific issues can be combatted with advanced techniques like noun-phrase chunking to expand the entity mention set, or coreference parsing for cross-sentence candidates. We don't employ these here in order to focus on weak supervision.