Paul Willot
This project was made for the ICADL 2015 conference.
In this notebook we will go through all steps required to build a LSTM neural network to classify sentences inside a scientific paper abstract.
Summary:
In [1]:
#%install_ext https://raw.githubusercontent.com/rasbt/watermark/master/watermark.py
%load_ext watermark
# for reproducibility
%watermark -a 'Paul Willot' -mvp numpy,scipy,spacy
First, let's gather some data. We use the PubMed database of medical paper.
Specificaly, we will focus on structured abstracts. There is approximately 3 million avalaible, and we will focus on a reduced portion of this (500.000) but feel free to use a bigger corpus.
The easiest way to try this is to use the toy_corpus.txt
and tokenizer.pickle
included in the project repo.
To work on real dataset, for convenience I prepared the following files. Use the one appropriate for your needs, for example you can download the training and testing datas and jump to the next notebook.
Download the full corpus (~500.000 structured abstracts, 500 MB compressed)
In [5]:
!wget https://www.dropbox.com/s/lhqe3bls0mkbq57/pubmed_result_548899.txt.zip -P ./data/
!unzip -o ./data/pubmed_result_548899.txt.zip -d ./data/
Download a toy corpus (224 structured abstracts, 200 KB compressed)
Note: this file is already included in the project GitHub repository.
In [6]:
#!wget https://www.dropbox.com/s/ujo1l8duu31js34/toy_corpus.txt.zip -P ./data/
#!unzip -o ./TMP/toy_corpus.txt.zip -d ./data/
Download a lemmatized corpus (preprocessed, 350 MB compressed)
In [7]:
!wget https://www.dropbox.com/s/lmv88n1vpmp6c19/corpus_lemmatized.pickle.zip -P ./data/
!unzip -o ./data/corpus_lemmatized.pickle.zip -d ./data/
Download training and testing datas for the LSTM (preprocessed, vectorized and splitted, 100 MB compressed)
In [8]:
!wget https://www.dropbox.com/s/0o7i0ejv4aqf6gs/training_4_BacObjMetCon.pickle.zip -P ./data/
!unzip -o ./data/training_4_BacObjMetCon.pickle.zip -d ./data/
Some imports
In [1]:
from __future__ import absolute_import
from __future__ import print_function
# import local libraries
import tools
import prepare
import lemmatize
import analyze
import preprocess
In [2]:
data = prepare.extract_txt('data/toy_corpus.txt')
Our data currently look like this:
In [3]:
print("%s\n[...]"%data[0][:800])
In [4]:
abstracts = prepare.get_abstracts(data)
Cleaning, dumping the abstracts with incorrect number of labels
In [5]:
def remove_err(datas,errs):
err=sorted([item for subitem in errs for item in subitem],reverse=True)
for e in err:
for d in datas:
del d[e]
In [6]:
remove_err([abstracts],prepare.get_errors(abstracts))
In [7]:
print("Working on %d documents."%len(abstracts))
In [8]:
abstracts = prepare.filter_numbers(abstracts)
For correct sentence splitting, we train a tokenizer using NLTK Punkt Sentence Tokenizer. This tokenizer use an unsupervised algorithm to learn how to split sentences on a corpus.
In [9]:
tokenizer = prepare.create_sentence_tokenizer(abstracts)
# For a more general parser, use the one provided in NLTK:
#import nltk.data
#tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
abstracts_labeled = prepare.ex_all_labels(abstracts,tokenizer)
Our data look now like this:
In [10]:
abstracts_labeled[0][0]
Out[10]:
In [11]:
lemmatized = lemmatize.lemm(abstracts_labeled)
In [13]:
lemmatized[0]
Out[13]:
Let's save that
In [13]:
tools.dump_pickle(lemmatized,"data/fast_lemmatized.pickle")
To directly load a lemmatized corpus
In [14]:
lemmatized = tools.load_pickle("data/corpus_lemmatized.pickle")
In [14]:
dic = analyze.create_dic_simple(lemmatized)
In [18]:
print("Number of labels :",len(dic.keys()))
analyze.show_keys(dic,threshold=10)
In [19]:
primary_keyword=['AIM','BACKGROUND','INTRODUCTION','METHOD','RESULT','CONCLUSION','OBJECTIVE','DESIGN','FINDING','OUTCOME','PURPOSE']
In [20]:
analyze.regroup_keys(dic,primary_keyword)
In [21]:
analyze.show_keys(dic,threshold=10)
In [22]:
keys_to_replace = [['INTRODUCTION','CONTEXT','PURPOSE'],
['AIM','SETTING'],
['FINDING','OUTCOME','DISCUSSION']]
replace_with = ['BACKGROUND',
'METHOD',
'CONCLUSION']
In [23]:
analyze.replace_keys(dic,keys_to_replace,replace_with)
In [24]:
analyze.show_keys(dic,threshold=10)
We can restrict our data to work only on abstracts having labels maching a specific pattern...
In [26]:
pattern = [
['BACKGROUND','BACKGROUNDS'],
['METHOD','METHODS'],
['RESULT','RESULTS'],
['CONCLUSION','CONCLUSIONS'],
]
In [27]:
sub_perfect = analyze.get_exactly(lemmatized,pattern=pattern,no_truncate=True)
In [28]:
sub_perfect = analyze.get_exactly(lemmatized,pattern=pattern,no_truncate=False)
In [29]:
print("%d abstracts labeled and ready for the next part"%len(sub_perfect))
... Or we can keep a more noisy dataset and reduce it to a set of labels
In [30]:
dic = preprocess.create_dic(lemmatized,100)
In [31]:
# We can re-use the variables defined in the analysis section
#primary_keyword=['AIM','BACKGROUND','METHOD','RESULT','CONCLUSION','OBJECTIVE','DESIGN','FINDINGS','OUTCOME','PURPOSE']
analyze.regroup_keys(dic,primary_keyword)
In [32]:
#keys_to_replace = [['INTRODUCTION','BACKGROUND','AIM','PURPOSE','CONTEXT'],
# ['CONCLUSION']]
#replace_with = ['OBJECTIVE',
# 'RESULT']
analyze.replace_keys(dic,keys_to_replace,replace_with)
In [33]:
# We can restrict our analysis to the main labels
dic = {key:dic[key] for key in ['BACKGROUND','RESULT','METHOD','CONCLUSION']}
In [34]:
analyze.show_keys(dic,threshold=10)
In [35]:
print("Sentences per label :",["%s %d"%(s,len(dic[s][1])) for s in dic.keys()])
Reorder the labels for better readability
In [36]:
classes_names = ['BACKGROUND', 'METHOD', 'RESULT','CONCLUSION']
dic.keys()
Out[36]:
In [37]:
# train/test split
split = 0.8
# truncate the number of abstracts to consider for each label,
# -1 to set to the maximum while keeping the number of sentences per labels equal
raw_x_train, raw_y_train, raw_x_test, raw_y_test = preprocess.split_data(dic,classes_names,
split_train_test=split,
truncate=-1)
Vectorize the sentences.
In [38]:
X_train, y_train, X_test, y_test, feature_names, max_features, vectorizer = preprocess.vectorize_data(raw_x_train,
raw_y_train,
raw_x_test,
raw_y_test)
In [39]:
print("Number of features : %d"%(max_features))
Now let's save all this
In [40]:
tools.dump_pickle([X_train, y_train, X_test, y_test, feature_names, max_features, classes_names, vectorizer],
"data/unpadded_4_BacObjMetCon.pickle")
and jump to the second notebook to train the LSTM.