Language Detection Model

Import classes needed for preprocessing, training and testing.


In [1]:
import LangByWord as lbw
import BuildTrainingDataFiles as btdf

Preprocess the database

Setup the names of the source and destination directories below:


In [2]:
# Set the input directory for preprocessing here:
base_input_dir = '/Users/frank/data/LanguageDetectionModel/txt'
# Set the output directory for the preprocessing here:
base_output_dir = '/Users/frank/data/LanguageDetectionModel/exp_data_test'

Begin preprocessing the language data


In [3]:
build_obj = btdf.BuildTrainingDataFiles()
build_obj.start_building(base_input_dir, base_output_dir)


Processing directory: /Users/frank/data/LanguageDetectionModel/txt/bg
Processing directory: /Users/frank/data/LanguageDetectionModel/txt/cs
Processing directory: /Users/frank/data/LanguageDetectionModel/txt/da
Processing directory: /Users/frank/data/LanguageDetectionModel/txt/de
Processing directory: /Users/frank/data/LanguageDetectionModel/txt/el
Processing directory: /Users/frank/data/LanguageDetectionModel/txt/en
Processing directory: /Users/frank/data/LanguageDetectionModel/txt/es
Processing directory: /Users/frank/data/LanguageDetectionModel/txt/et
Processing directory: /Users/frank/data/LanguageDetectionModel/txt/fi
Processing directory: /Users/frank/data/LanguageDetectionModel/txt/fr
Processing directory: /Users/frank/data/LanguageDetectionModel/txt/hu
Processing directory: /Users/frank/data/LanguageDetectionModel/txt/it
Processing directory: /Users/frank/data/LanguageDetectionModel/txt/lt
Processing directory: /Users/frank/data/LanguageDetectionModel/txt/lv
Processing directory: /Users/frank/data/LanguageDetectionModel/txt/nl
Processing directory: /Users/frank/data/LanguageDetectionModel/txt/pl
Error on Unicode Decode. File will be ignored: /Users/frank/data/LanguageDetectionModel/txt/pl/ep-09-10-22-009.txt
Processing directory: /Users/frank/data/LanguageDetectionModel/txt/pt
Processing directory: /Users/frank/data/LanguageDetectionModel/txt/ro
Processing directory: /Users/frank/data/LanguageDetectionModel/txt/sk
Processing directory: /Users/frank/data/LanguageDetectionModel/txt/sl
Processing directory: /Users/frank/data/LanguageDetectionModel/txt/sv

Train the language detection model and save as an object.


In [4]:
lo = lbw.LangByWord()
lo.train(base_output_dir, max_words_per_lang=0, report_freq=0)
lo.print_most_prob_words()
object_file = 'LbyW_obj.pck'
lo.save_object_to_file(object_file)


final stats: bg unique words: 96355 sentences: 125622
final stats: cs unique words: 167943 sentences: 214385
final stats: da unique words: 388074 sentences: 701603
final stats: de unique words: 424052 sentences: 687737
final stats: el unique words: 264307 sentences: 510871
final stats: en unique words: 175125 sentences: 729287
final stats: es unique words: 207479 sentences: 706481
final stats: et unique words: 290487 sentences: 215298
final stats: fi unique words: 732503 sentences: 663728
final stats: fr unique words: 217814 sentences: 714641
final stats: hu unique words: 311165 sentences: 208222
final stats: it unique words: 238525 sentences: 710810
final stats: lt unique words: 229957 sentences: 211860
final stats: lv unique words: 161945 sentences: 211143
final stats: nl unique words: 307921 sentences: 704014
final stats: pl unique words: 173748 sentences: 211642
final stats: pt unique words: 221651 sentences: 704834
final stats: ro unique words: 86004 sentences: 128298
final stats: sk unique words: 166801 sentences: 211830
final stats: sl unique words: 138086 sentences: 206456
final stats: sv unique words: 382638 sentences: 676258
Most probable word of each language:
language: en word: the prob: 0.0767
language: pt word: de prob: 0.0477
language: hu word: a prob: 0.0850
language: it word: di prob: 0.0400
language: de word: die prob: 0.0446
language: sv word: att prob: 0.0411
language: cs word: a prob: 0.0375
language: da word: at prob: 0.0347
language: sk word: a prob: 0.0372
language: es word: de prob: 0.0692
language: ro word: de prob: 0.0478
language: bg word: на prob: 0.0660
language: et word: ja prob: 0.0330
language: sl word: in prob: 0.0329
language: lv word: un prob: 0.0381
language: fr word: de prob: 0.0556
language: nl word: de prob: 0.0761
language: fi word: ja prob: 0.0387
language: lt word: ir prob: 0.0395
language: pl word: w prob: 0.0385
language: el word: και prob: 0.0324

Test the model trained during the last step.

Set the name of the training file in test_file_name below:


In [5]:
test_file_name = '/Users/frank/data/LanguageDetectionModel/europarl.test'
lo2 = lbw.LangByWord.load_object_from_file(object_file)
lo2.test_on_test(test_file_name, report_freq=0)


lv->pt  "Es runāju par Banco Português de Negócios un Banco Privado Português."
sk->cs  "Je to také jednoduché."
Error count: 2 sentence count: 21000 percent error rate:  0.0095