Natural language processing (NLP) involves the observation and manipulation of text strings, including but not limited to: transforming orthographies, tagging things within a text (such as parts-of-speech, syntax, proper nouns), counting things (e.g., word frequencies), and segmenting (dividing a text by sentences or by words). We'll cover a few of these here.
Latin orthography uses a few characters for sentence-final punctuation ('.', '?', '!'), however problems arise because "." may also be used in non-senentence-final situations (most frequently praenomina -- 'M. Tullius Cicero', 'Cn. Pompeius Magnus').
For Greek texts, similar challenges arise with periods (usually in the form of ellipses '...') being used as an editorial convention for missing characters or words.
In [1]:
# Let's start by binding a text string to a variable
# intro to Cato's de agricultura
cato_agri_praef = "Est interdum praestare mercaturis rem quaerere, nisi tam periculosum sit, et item foenerari, si tam honestum. Maiores nostri sic habuerunt et ita in legibus posiverunt: furem dupli condemnari, foeneratorem quadrupli. Quanto peiorem civem existimarint foeneratorem quam furem, hinc licet existimare. Et virum bonum quom laudabant, ita laudabant: bonum agricolam bonumque colonum; amplissime laudari existimabatur qui ita laudabatur. Mercatorem autem strenuum studiosumque rei quaerendae existimo, verum, ut supra dixi, periculosum et calamitosum. At ex agricolis et viri fortissimi et milites strenuissimi gignuntur, maximeque pius quaestus stabilissimusque consequitur minimeque invidiosus, minimeque male cogitantes sunt qui in eo studio occupati sunt. Nunc, ut ad rem redeam, quod promisi institutum principium hoc erit."
In [2]:
print(cato_agri_praef)
In [3]:
# http://docs.cltk.org/en/latest/latin.html#sentence-tokenization
from cltk.tokenize.sentence import TokenizeSentence
In [4]:
tokenizer = TokenizeSentence('latin')
In [5]:
cato_sentence_tokens = tokenizer.tokenize_sentences(cato_agri_praef)
In [6]:
print(cato_sentence_tokens)
In [7]:
# This has correctly identified 9 sentences
print(len(cato_sentence_tokens))
In [8]:
# viewed another way
for sentence in cato_sentence_tokens:
print(sentence)
print()
The CLTK offers several ways to segment word tokens, that is to automatically detect word boundaries. For most languages, a simple whitespace or punctation token suffices, however there are important edge cases.
For general tokenization, one of the methods here will likely work: http://docs.cltk.org/en/latest/multilingual.html#word-tokenization.
For the Latin language, we have a special word tokenizer which separates enclitics such as '-que' and '-ve'.
In [9]:
# import general-use word tokenizer
from cltk.tokenize.word import nltk_tokenize_words
In [10]:
cato_word_tokens = nltk_tokenize_words(cato_agri_praef)
In [11]:
print(cato_word_tokens)
Notice that punctuation is divided here and thus counted as word. One way to remove these is to use a list comprehension:
In [12]:
cato_word_tokens_no_punt = [token for token in cato_word_tokens if token not in ['.', ',', ':', ';']]
In [13]:
print(cato_word_tokens_no_punt)
In [14]:
# number words
print(len(cato_word_tokens_no_punt))
In [15]:
# the set() function removes duplicates from a list
# let's see how many unique words are in here
cato_word_tokens_no_punt_unique = set(cato_word_tokens_no_punt)
print(cato_word_tokens_no_punt_unique)
In [16]:
print(len(cato_word_tokens_no_punt_unique))
In [17]:
# there's a mistake here though
# capitalized words ('At', 'Est', 'Nunc') would be counted incorrectly
# so let's lower the input string and try again
cato_agri_praef_lowered = cato_agri_praef.lower()
cato_word_tokens_lowered = nltk_tokenize_words(cato_agri_praef_lowered)
In [18]:
# now see all lowercase
print(cato_word_tokens_lowered)
In [19]:
# now let's do everything again
cato_word_tokens_no_punt_lowered = [token for token in cato_word_tokens_lowered if token not in ['.', ',', ':', ';']]
cato_word_tokens_no_punt_unique_lowered = set(cato_word_tokens_no_punt_lowered)
print(len(cato_word_tokens_no_punt_unique_lowered))
Observe that this corrected count has two fewer unique words
In [20]:
from cltk.tokenize.word import WordTokenizer
word_tokenizer = WordTokenizer('latin')
cato_cltk_word_tokens = word_tokenizer.tokenize(cato_agri_praef_lowered)
cato_cltk_word_tokens_no_punt = [token for token in cato_cltk_word_tokens if token not in ['.', ',', ':', ';']]
# now you can see the word '-que'
print(cato_cltk_word_tokens_no_punt)
In [21]:
# more total words
print(len(cato_cltk_word_tokens_no_punt)) # was 109
In [22]:
# more accurate unique words
cato_cltk_word_tokens_no_punt_unique = set(cato_cltk_word_tokens_no_punt)
print(len(cato_cltk_word_tokens_no_punt_unique)) # balances out to be the same (90)
In [23]:
# .difference() is an easy way to compare two sets
cato_cltk_word_tokens_no_punt_unique.difference(cato_word_tokens_no_punt_unique_lowered)
Out[23]:
Normalization is not exciting but an important task in NLP. The CLTK offers help for Greek, Latin, Sanskrit, and Egyptian.
One of the most common issues is the use of 'j' and 'v' (for 'i' and 'u', respectively) before vowels. Docs: http://docs.cltk.org/en/latest/latin.html#converting-j-to-i-v-to-u.
In [24]:
from cltk.stem.latin.j_v import JVReplacer
j = JVReplacer()
replaced_text = j.replace('vem jam')
print(replaced_text)
The challenge with Greek is working with Unicode (and you should always use Unicode). Some guidance here: http://docs.cltk.org/en/latest/greek.html#accentuation.
In [25]:
# let's start with the easiest method, which is to use Python's internal Counter()
from collections import Counter
In [26]:
# don't give the unique variation, but count all tokens
cato_word_counts_counter = Counter(cato_cltk_word_tokens_no_punt)
print(cato_word_counts_counter)
In [27]:
# the data structure of cato_word_counts_counter is a 'dictionary' in Python
# get the frequency of particular words like this:
print(cato_word_counts_counter['et'])
In [28]:
print(cato_word_counts_counter['qui'])
In [29]:
print(cato_word_counts_counter['maiores'])
Lexical diversity is a simple measure of unique words divided by total words. This measures how realtively verbose an author is.
Such lexical measures are simple but can be illuminating nevertheless.
For example, here are a few visualizations done on the Greek canon.
For code, see https://github.com/kylepjohnson/notebooks/tree/master/public_talks/2016_10_26_harvard.
In [30]:
# lex diversity of this little paragraph
print(len(cato_cltk_word_tokens_no_punt_unique) / len(cato_cltk_word_tokens_no_punt))
# meaning this is the ratio of unique to re-reused words
In [31]:
# of suriving word counts
from IPython.display import Image
Image('images/tableau_bubble.png')
Out[31]:
In [32]:
Image('images/lexical_diversity_greek_canon.png')
Out[32]:
In [34]:
# http://docs.cltk.org/en/latest/latin.html#stopword-filtering
# easist way to do this in Python is to use a list comprehension to remove stopwords
from cltk.stop.latin.stops import STOPS_LIST
print(STOPS_LIST)
In [35]:
cato_no_stops = [w for w in cato_cltk_word_tokens_no_punt if not w in STOPS_LIST]
# observe no stopwords
#! consider others you might want to add to the Latin stops list
print(cato_no_stops)