This tutorial describes how to work with Eesti
Koondkorpus and import the
files in TEI format to JSON format used by Estnltk. After this
conversion, you can check (database_tutorial) to see how you could
import all these documents to a fast searchable text database. First,
dowload all the XML TEI format files to your computer, into a folder
corpora/koond
. Check the subcategories of the site to find the
download links.
On my computer, I have the following list of files:
ls -1 corpora/koond/
Agraarteadus.zip
Arvutitehnika.zip
Doktoritood.zip
EestiArst.zip
Ekspress.zip
foorum_uudisgrupp_kommentaar.zip
Horisont.zip
Ilukirjandus.zip
Kroonika.zip
LaaneElu.zip
Luup.zip
Maaleht.zip
Paevaleht.zip
Postimees.zip
Riigikogu.zip
Seadused.zip
SLOleht.tar.gz
Teadusartiklid.zip
Valgamaalane.zip
Next, we go into this directory and unzip all the files.:
cd corpora/koond/
unzip "*.zip"
As a result, we have a bunch of folders with structure similar below:
├── Kroonika
│ ├── bin
│ │ ├── koondkorpus_main_header.xml
│ │ └── tei_corpus.rng
│ └── Kroon
│ ├── bin
│ │ └── header_aja_kroonika.xml
│ └── kroonika
│ ├── kroonika_2000
│ │ ├── aja_kr_2000_12_08.xml
│ │ ├── aja_kr_2000_12_15.xml
│ │ ├── aja_kr_2000_12_22.xml
│ │ └── aja_kr_2000_12_29.xml
│ ├── kroonika_2001
│ │ ├── aja_kr_2001_01_05.xml
│ │ ├── aja_kr_2001_01_12.xml
│ │ ├── aja_kr_2001_01_19.xml
│ │ ├── aja_kr_2001_01_22.xml
Folders bin
contain headers and corpus descriptions and can go
hiearchially down the way. If we are only interested in the actual
articles themselves, we should ignore all files that contain bin
in
their path and only use files that end with .xml
.
Anyway, here is a script that tries its best at doing some basic conversion:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals, print_function, absolute_import
import os
import os.path
import argparse
import logging
from estnltk.teicorpus import parse_tei_corpus
from estnltk.corpus import write_document
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger('koondkonverter')
def get_target(fnm):
if 'drtood' in fnm:
return 'dissertatsioon'
if 'ilukirjandus' in fnm:
return 'tervikteos'
if 'seadused' in fnm:
return 'seadus'
if 'EestiArst' in fnm:
return 'ajakirjanumber'
if 'foorum' in fnm:
return 'teema'
if 'kommentaarid' in fnm:
return 'kommentaarid'
if 'uudisgrupid' in fnm:
return 'uudisgrupi_salvestus'
if 'jututoad' in fnm:
return 'jututoavestlus'
if 'stenogrammid' in fnm:
return 'stenogramm'
return 'artikkel'
def process(start_dir, out_dir, encoding=None):
for dirpath, dirnames, filenames in os.walk(start_dir):
if len(dirnames) > 0 or len(filenames) == 0 or 'bin' in dirpath:
continue
for fnm in filenames:
full_fnm = os.path.join(dirpath, fnm)
out_prefix = os.path.join(out_dir, fnm)
target = get_target(full_fnm)
if os.path.exists(out_prefix + '_0.txt'):
logger.info('Skipping file {0}, because it seems to be already processed'.format(full_fnm))
continue
logger.info('Processing file {0} with target {1}'.format(full_fnm, target))
docs = parse_tei_corpus(full_fnm, target=target, encoding=encoding)
for doc_id, doc in enumerate(docs):
out_fnm = '{0}_{1}.txt'.format(out_prefix, doc_id)
logger.info('Writing document {0}'.format(out_fnm))
write_document(doc, out_fnm)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description="Convert a bunch of TEI XML files to Estnltk JSON files")
parser.add_argument('startdir', type=str, help='The path of the downloaded and extracted koondkorpus files')
parser.add_argument('outdir', type=str, help='The directory to store output results')
parser.add_argument('-e', '--encoding', type=str, default=None, help='Encoding of the TEI XML files')
args = parser.parse_args()
process(args.startdir, args.outdir, args.encoding)
Create an output directory corpora/converted
for the results and run
the scripts with appropriate parameters:
python3 -m estnltk.examples.convert_koondkorpus corpora/koond corpora/converted
The results can be downloaded from here: http://ats.cs.ut.ee/keeletehnoloogia/estnltk/koond.zip .
Note
Currently, this zip package does not include files from
SLOleht.tar.gz
. In order to include the files from SLOleht, please download theSLOleht.tar.gz
, unpack the contents, and use the scriptestnltk.examples.convert_koondkorpus
to obtain the missing part of the corpus.
The default sentence tokenizer can produce a tokenization with suboptimal quality when applied to the koondkorpus
files, as many texts in the koondkorpus
have already been tokenized at the word level (and this is an unexpected input to the default sentence tokenizer).
However, EstNLTK also provides a special sentence tokenizer, SentenceTokenizerForKoond, which fixes several known sentence-splitting problems in the corpus.
In the following examples, the default sentence tokenizer is compared with the SentenceTokenizerForKoond in processing a text with problematic tokenization.
In [1]:
# a text containing problematic tokenisations (as can be found in koondkorpus):
problematic_tok = '''Kõigi võistlejate seast jõudsid punktikohale Tipp ( 2. ) ja Täpp ( 4. ) ja Käpp ( 7. ) .
Bänd , mis moodustati 1968 . aastal .
Kirjandusel ( resp. raamatul ) on läbi aegade olnud erinevaid funktsioone .
Iga päev teeme valikuid.Valime kõike alates pesupulbrist ja lõpetades autopesulatega.'''
In [2]:
from estnltk import Text
# Use the default sentence tokenizer
text = Text( problematic_tok )
text.sentence_texts
Out[2]:
In [3]:
from estnltk.tokenizers.sent_tokenizer_for_koond import SentenceTokenizerForKoond
from estnltk import Text
kwargs = {
"sentence_tokenizer": SentenceTokenizerForKoond()
}
# Use the koondkorpus specific tokenizer
text = Text( problematic_tok, **kwargs )
text.sentence_texts
Out[3]: