Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations.
In this tutorial you will learn how to use estnltk's out of the box NER utilities and how to build your own ner-models from scratch.
The estnltk package comes with the pre-trained NER-models for Python 2.7/Python 3.4. The models distinguish 3 types of entities: person names, organizations and locations.
A quick example below demonstrates how to extract named entities from the raw text:
In [1]:
from estnltk import Text
from pprint import pprint
text = Text('''Eesti Vabariik on riik Põhja-Euroopas.
Eesti piirneb põhjas üle Soome lahe Soome Vabariigiga.
Riigikogu on Eesti Vabariigi parlament. Riigikogule kuulub Eestis seadusandlik võim.
2005. aastal sai peaministriks Andrus Ansip, kes püsis sellel kohal 2014. aastani.
2006. aastal valiti presidendiks Toomas Hendrik Ilves.
''')
# Extract named entities
pprint(text.named_entities)
When accessing the property named_entities of the Text instance, estnltk executes on the background the whole text processing pipeline, including tokenization, morphological analysis and named entity extraction.
The class Text additionally provides a number of useful methods to get more information on the extracted entities:
In [2]:
pprint(list(zip(text.named_entities, text.named_entity_labels, text.named_entity_spans)))
The default models use tags PER, ORG and LOC to denote person names, organizations and locations respectively. Entity tags are encoded using a BIO annotation scheme, where each entity label is prefixed with either B or I letter. B- denotes the beginning and I- inside of an entity. The prefixes are used to detect multiword entities, as shown in the example example above. All other words, which don't refer to entities of interest, are labelled with the O tag.
The raw labels are accessible via the property labels of the Text instance:
In [3]:
pprint(list(zip(text.word_texts, text.labels)))
Default models that come with estnltk are good enough for basic tasks. However, for some specific tasks, a custom NER model might be needed. To train your own model, you need to provide a training corpus and custom configuration settings. The following example demonstrates how to train a ner-model using the default training dataset and settings:
In [4]:
from estnltk import estner
from estnltk.corpus import read_json_corpus
from estnltk.ner import NerTrainer
# Read the default training corpus
corpus = read_json_corpus('../../../estnltk/corpora/estner.json')
# Read the default settings
ner_settings = estner.settings
# Directory to save the model
model_dir = 'output_model_directory'
# Train and save the model
trainer = NerTrainer(ner_settings)
trainer.train(corpus, model_dir)
The specified output directory will contain a resulting model file model.bin and a copy of a settings module used for training. Now we can load the model and tag some text using NerTagger:
In [5]:
from estnltk.ner import NerTagger
document = Text('Eesti koeraspordiliidu ( EKL ) presidendi Piret Laanetu intervjuu Eesti Päevalehele.')
# Load the model and settings
tagger = NerTagger(model_dir)
# ne-tag the document
tagger.tag_document(document)
pprint(list(zip(document.word_texts, document.labels)))
In [6]:
text = Text('''Eesti Vabariik on riik Põhja-Euroopas.''')
text.tokenize_words()
pprint(text)
Next, let's add named entity tags to each word in the document:
In [7]:
words = text.words
# label each word as "other":
for word in words:
word['label'] = 'O'
# label words "Eesti Vabariik" as a location
words[0]['label'] = 'B-LOC'
words[1]['label'] = 'I-LOC'
# label word "Põhja-Euroopas" as a location
words[4]['label'] = 'B-LOC'
pprint(text.words)
Once we have a collection of labelled documents, we can save it to disc using the function write_json_corpus():
In [8]:
from estnltk.corpus import write_json_corpus
documents = [text]
write_json_corpus(documents, 'output_file_name')
Out[8]:
This serializes each document object into a json string and saves to the specified file line by line. The resulting training file can be used with the NerTrainer as shown above.
By default, estnltk uses configuration module estnltk.estner.settings. A settings module defines training algorithm parameters, entity categories, feature extractors and feature templates. The simplest way to create a custom configuration is to make a new settings module, e.g. custom_settings.py, import the default settings and override necessary parts. For example, a custom minimalistic configuration module could look like this:
In [9]:
%%writefile custom_settings.py
from estnltk.estner.settings import *
# Override feature templates
TEMPLATES = [
(('lem', 0),),
]
# Override feature extractors
FEATURE_EXTRACTORS = (
"estnltk.estner.featureextraction.MorphFeatureExtractor",
)
In [10]:
import custom_settings
In [11]:
ner_settings2 = custom_settings
Now, the NerTrainer instance can be initialized using the custom_settings module (make sure custom_settings.py is on your python path):
In [12]:
trainer = NerTrainer(ner_settings2)