Extracting Structure from Scientific Abstracts

using a LSTM neural network

Paul Willot

This project was made for the ICADL 2015 conference.
In this notebook we will go through all steps required to build a LSTM neural network to classify sentences inside a scientific paper abstract.

Summary:

Extract dataset
Pre-process
Label analysis
Choosing labels
Create train and test set



In [1]:

    
#%install_ext https://raw.githubusercontent.com/rasbt/watermark/master/watermark.py
%load_ext watermark
# for reproducibility
%watermark -a 'Paul Willot' -mvp numpy,scipy,spacy









    



Paul Willot 

CPython 2.7.10
IPython 4.0.0

numpy 1.8.0rc1
scipy 0.13.0b1
spacy 0.89

compiler   : GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)
system     : Darwin
release    : 14.5.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit

First, let's gather some data. We use the PubMed database of medical paper.

Specificaly, we will focus on structured abstracts. There is approximately 3 million avalaible, and we will focus on a reduced portion of this (500.000) but feel free to use a bigger corpus.

The easiest way to try this is to use the toy_corpus.txt and tokenizer.pickle included in the project repo.

To work on real dataset, for convenience I prepared the following files. Use the one appropriate for your needs, for example you can download the training and testing datas and jump to the next notebook.

Download the full corpus (~500.000 structured abstracts, 500 MB compressed)



In [5]:

    
!wget https://www.dropbox.com/s/lhqe3bls0mkbq57/pubmed_result_548899.txt.zip -P ./data/
!unzip -o ./data/pubmed_result_548899.txt.zip -d ./data/

Download a toy corpus (224 structured abstracts, 200 KB compressed)

Note: this file is already included in the project GitHub repository.



In [6]:

    
#!wget https://www.dropbox.com/s/ujo1l8duu31js34/toy_corpus.txt.zip -P ./data/
#!unzip -o ./TMP/toy_corpus.txt.zip -d ./data/

Download a lemmatized corpus (preprocessed, 350 MB compressed)



In [7]:

    
!wget https://www.dropbox.com/s/lmv88n1vpmp6c19/corpus_lemmatized.pickle.zip -P ./data/
!unzip -o ./data/corpus_lemmatized.pickle.zip -d ./data/

Download training and testing datas for the LSTM (preprocessed, vectorized and splitted, 100 MB compressed)



In [8]:

    
!wget https://www.dropbox.com/s/0o7i0ejv4aqf6gs/training_4_BacObjMetCon.pickle.zip -P ./data/
!unzip -o ./data/training_4_BacObjMetCon.pickle.zip -d ./data/

Some imports



In [1]:

    
from __future__ import absolute_import
from __future__ import print_function

# import local libraries
import tools
import prepare
import lemmatize
import analyze
import preprocess

Extract and parse the dataset

Separate each documents, isolate the abstracts



In [2]:

    
data = prepare.extract_txt('data/toy_corpus.txt')









    



Exctracting from 'toy_corpus'...
224 documents exctracted - 1.9KB  [286.4KB/s]
Done. [0.01s]

Our data currently look like this:



In [3]:

    
print("%s\n[...]"%data[0][:800])









    



1. EJNMMI Res. 2014 Dec;4(1):75. doi: 10.1186/s13550-014-0075-x. Epub 2014 Dec 14.

Labeling galectin-3 for the assessment of myocardial infarction in rats.

Arias T(1), Petrov A, Chen J, de Haas H, Pérez-Medina C, Strijkers GJ, Hajjar RJ,
Fayad ZA, Fuster V, Narula J.

Author information: 
(1)Zena and Michael A. Wiener Cardiovascular Institute, Icahn School of Medicine 
at Mount Sinai, One Gustave L. Levy Place, Box 1030, New York, NY, 10029, USA,
tvarias@cnic.es.

BACKGROUND: Galectin-3 is a ß-galactoside-binding lectin expressed in most of
tissues in normal conditions and overexpressed in myocardium from early stages of
heart failure (HF). It is an established biomarker associated with extracellular 
matrix (ECM) turnover during myocardial remodeling. The aim of this study is to
test t
[...]



In [4]:

    
abstracts = prepare.get_abstracts(data)









    



Working on 4 core...
1.4KB/s on each of the [4] core
Done. [0.35s]

Cleaning, dumping the abstracts with incorrect number of labels



In [5]:

    
def remove_err(datas,errs):
    err=sorted([item for subitem in errs for item in subitem],reverse=True)
    for e in err:
        for d in datas:
            del d[e]



In [6]:

    
remove_err([abstracts],prepare.get_errors(abstracts))



In [7]:

    
print("Working on %d documents."%len(abstracts))









    



Working on 219 documents.

Pre-process

Replacing numbers with ##NB.



In [8]:

    
abstracts = prepare.filter_numbers(abstracts)









    



Filtering numbers...
Done. [0.04s]

For correct sentence splitting, we train a tokenizer using NLTK Punkt Sentence Tokenizer. This tokenizer use an unsupervised algorithm to learn how to split sentences on a corpus.



In [9]:

    
tokenizer = prepare.create_sentence_tokenizer(abstracts)
# For a more general parser, use the one provided in NLTK:
#import nltk.data
#tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

abstracts_labeled = prepare.ex_all_labels(abstracts,tokenizer)









    



Loading sentence tokenizer...
Done. [0.29s]
Working on 4 core...
2.0KB/s on each of the [4] core
Done. [0.26s]

Our data look now like this:



In [10]:

    
abstracts_labeled[0][0]









    Out[10]:





[u'BACKGROUND',
 [u'Galectin-##NB is a \xdf-galactoside-binding lectin expressed in most of tissues in normal conditions and overexpressed in myocardium from early stages of heart failure (HF).',
  u'It is an established biomarker associated with extracellular  matrix (ECM) turnover during myocardial remodeling.',
  u'The aim of this study is to test the ability of (##NB)I-galectin-##NB (IG##NB) to assess cardiac remodeling in a model of myocardial infarction (MI) using imaging techniques. ']]

Lemmatization

It may be a long process on huge dataset, but using spacy make it currently 50 times faster than a slimple use of the NLTK tools.
It get a huge speedup with paralellisation (tryed on 80 cores). Specify nb_core=X if needed.



In [11]:

    
lemmatized = lemmatize.lemm(abstracts_labeled)









    



Working on 4 core...
Splitting datas... Done. [0.00s]
Lemmatizing...
Done. [0min 7s]



In [13]:

    
lemmatized[0]









    Out[13]:





[[u'BACKGROUND',
  [u'galectin-##nb be a \xdf-galactoside bind lectin express in most of tissue in normal condition and overexpressed in myocardium from early stage of heart failure hf',
   u'it be an establish biomarker associate with extracellular matrix ecm turnover during myocardial remodeling',
   u'the aim of this study be to test the ability of nb)i galectin-##nb ig##nb to assess cardiac remodeling in a model of myocardial infarction mi use imaging technique']],
 [u'METHODS',
  [u'recombinant galectin-##nb be label with iodine-##nb and in vitro bind assay be conduct to test nb)i galectin-##nb ability to bind to ecm target',
   u'for in vivo study a rat model of induce mi be use',
   u'animal be subject to magnetic resonance and micro spetc/micro-ct image two nb w mi or four nb w mi week after mi.',
   u'sham rat be use as control',
   u'pharmacokinetic biodistribution and histological study be also perform after intravenous administration of ig##nb.']],
 [u'RESULTS',
  [u'in vitro study reveal that ig##nb show higher bind affinity measure as count per minute cpm p < nb to laminin nb \xb1 nb cpm fibronectin nb \xb1 nb cpm and collagen type -pron- nb \xb1 nb cpm compare to bovine serum albumin bsa nb \xb1 nb cpm',
   u'myocardial quantitative ig##nb uptake %id/g be high p < nb in the infarct of nb w mi rat nb \xb1 nb compare to control nb \xb1 nb',
   u'ig##nb infarct uptake correlate with the extent of scar r s = nb p = nb',
   u'total collagen deposition in the infarct percentage area be high p < nb at nb w mi nb \xb1 nb and nb w mi nb \xb1 nb compare to control nb \xb1 nb',
   u'however thick collagen content in the infarct square micrometer stain be high at nb w mi nb \xb1 nb \u03bcm(##nb compare to control nb \xb1 nb \u03bcm(##nb p < nb and nb w mi nb \xb1 nb \u03bcm(##nb p < nb']],
 [u'CONCLUSIONS',
  [u'this study show although preliminary enough data to consider ig##nb as a potential contrast agent for imaging of myocardial interstitial change in rat after mi.',
   u'label strategy need to be seek to improve in vivo ig##nb imaging and if prove galectin-##nb may be use as an imaging tool for the assessment and treatment of mi patient']]]

Let's save that



In [13]:

    
tools.dump_pickle(lemmatized,"data/fast_lemmatized.pickle")









    



Dumping...
Done. [0.05s]

To directly load a lemmatized corpus



In [14]:

    
lemmatized = tools.load_pickle("data/corpus_lemmatized.pickle")









    



Loading 'data/corpus_lemmatized.pickle'...
'data/corpus_lemmatized.pickle' not found, trying 'data/corpus_lemmatized.pickle.pickle'






    



---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-14-e821850b10ae> in <module>()
----> 1 lemmatized = tools.load_pickle("data/corpus_lemmatized.pickle")

/Users/meat/Documents/NII/labelizer/tools.pyc in load_pickle(s)
     31             print("'%s' not found, trying '%s.pickle'"%(s,s))
     32             sys.stdout.flush()
---> 33             f = open(s+'.pickle','rb')
     34             returnObject = pickle.load(f)
     35             f.close()

IOError: [Errno 2] No such file or directory: 'data/corpus_lemmatized.pickle.pickle'

Label analysis

Does not affect the corpus, we simply do this get some insights.



In [14]:

    
dic = analyze.create_dic_simple(lemmatized)









    



Copying corpus...Done. [0.02s]
Creating dictionary of labels...
Done. [0.00s]



In [18]:

    
print("Number of labels :",len(dic.keys()))
analyze.show_keys(dic,threshold=10)









    



Number of labels : 58
195______RESULTS
151______METHODS
146______BACKGROUND
117______CONCLUSIONS
91_______CONCLUSION
26_______INTRODUCTION
22_______OBJECTIVE
16_______MATERIALS AND METHODS
10_______OBJECTIVES
10_______PURPOSE
...
(48 other labels with less than 10 occurences)
...



In [19]:

    
primary_keyword=['AIM','BACKGROUND','INTRODUCTION','METHOD','RESULT','CONCLUSION','OBJECTIVE','DESIGN','FINDING','OUTCOME','PURPOSE']



In [20]:

    
analyze.regroup_keys(dic,primary_keyword)









    



Keys regrouped: 31



In [21]:

    
analyze.show_keys(dic,threshold=10)









    



212______CONCLUSION
200______RESULT
192______METHOD
149______BACKGROUND
33_______OBJECTIVE
26_______INTRODUCTION
10_______PURPOSE
...
(22 other labels with less than 10 occurences)
...



In [22]:

    
keys_to_replace = [['INTRODUCTION','CONTEXT','PURPOSE'],
                  ['AIM','SETTING'],
                  ['FINDING','OUTCOME','DISCUSSION']]

replace_with =    ['BACKGROUND',
                  'METHOD',
                  'CONCLUSION']



In [23]:

    
analyze.replace_keys(dic,keys_to_replace,replace_with)









    



Keys regplaced: 8



In [24]:

    
analyze.show_keys(dic,threshold=10)









    



221______CONCLUSION
203______METHOD
200______RESULT
186______BACKGROUND
33_______OBJECTIVE
...
(16 other labels with less than 10 occurences)
...

Choosing labels

Does affect the corpus

We can restrict our data to work only on abstracts having labels maching a specific pattern...



In [26]:

    
pattern = [
    ['BACKGROUND','BACKGROUNDS'],
    ['METHOD','METHODS'],
    ['RESULT','RESULTS'],
    ['CONCLUSION','CONCLUSIONS'],
]



In [27]:

    
sub_perfect = analyze.get_exactly(lemmatized,pattern=pattern,no_truncate=True)









    



Selecting abstracts...
91/219 match the pattern (41%)
Done. [0.00s]



In [28]:

    
sub_perfect = analyze.get_exactly(lemmatized,pattern=pattern,no_truncate=False)









    



Selecting abstracts...
98/219 match the pattern (44%)
Done. [0.00s]



In [29]:

    
print("%d abstracts labeled and ready for the next part"%len(sub_perfect))









    



98 abstracts labeled and ready for the next part

... Or we can keep a more noisy dataset and reduce it to a set of labels



In [30]:

    
dic = preprocess.create_dic(lemmatized,100)









    



Copying corpus...Done. [0.01s]
Creating dictionary of labels...
Done. [0.01s]



In [31]:

    
# We can re-use the variables defined in the analysis section
#primary_keyword=['AIM','BACKGROUND','METHOD','RESULT','CONCLUSION','OBJECTIVE','DESIGN','FINDINGS','OUTCOME','PURPOSE']
analyze.regroup_keys(dic,primary_keyword)









    



Keys regrouped: 31



In [32]:

    
#keys_to_replace = [['INTRODUCTION','BACKGROUND','AIM','PURPOSE','CONTEXT'],
#                  ['CONCLUSION']]

#replace_with =    ['OBJECTIVE',
#                  'RESULT']

analyze.replace_keys(dic,keys_to_replace,replace_with)









    



Keys regplaced: 8



In [33]:

    
# We can restrict our analysis to the main labels
dic = {key:dic[key] for key in ['BACKGROUND','RESULT','METHOD','CONCLUSION']}



In [34]:

    
analyze.show_keys(dic,threshold=10)









    



221______CONCLUSION
203______METHOD
200______RESULT
186______BACKGROUND



In [35]:

    
print("Sentences per label :",["%s %d"%(s,len(dic[s][1])) for s in dic.keys()])









    



Sentences per label : ['CONCLUSION 446', 'RESULT 946', 'BACKGROUND 481', 'METHOD 640']

Creating train and test data

Let's format the datas for the classifier

Reorder the labels for better readability



In [36]:

    
classes_names = ['BACKGROUND', 'METHOD', 'RESULT','CONCLUSION']
dic.keys()









    Out[36]:





['CONCLUSION', 'RESULT', 'BACKGROUND', 'METHOD']



In [37]:

    
# train/test split
split = 0.8

# truncate the number of abstracts to consider for each label,
# -1 to set to the maximum while keeping the number of sentences per labels equal
raw_x_train, raw_y_train, raw_x_test, raw_y_test = preprocess.split_data(dic,classes_names,
                                                              split_train_test=split,
                                                              truncate=-1)

Vectorize the sentences.



In [38]:

    
X_train, y_train, X_test, y_test, feature_names, max_features, vectorizer = preprocess.vectorize_data(raw_x_train,
                                                                                                      raw_y_train,
                                                                                                      raw_x_test,
                                                                                                      raw_y_test)









    



Vectorizing the training set...Done. [0.07s]
Getting features...Done. [0.01s]
Creating order...Done. [0.05s]
Done. [0.13s]



In [39]:

    
print("Number of features : %d"%(max_features))









    



Number of features : 4506

Now let's save all this



In [40]:

    
tools.dump_pickle([X_train, y_train, X_test, y_test, feature_names, max_features, classes_names, vectorizer],
                  "data/unpadded_4_BacObjMetCon.pickle")









    



Dumping...
Done. [0.30s]

and jump to the second notebook to train the LSTM.