Text Processing

This is an introduction to some algorithms used in text analysis. While I cannot define what questions a scholar can ask, I can and do describe here the kind of information about text that some popular methods deliver. From this, you need to draw on your own research interests and creativity...

I will describe methods of finding words that are characteristic for a certain passage ("tf/tdf"), constructing fingerprints for passages that go beyond the most significant words ("word vectors"), group passages according to their similarity ("clustering"), and forming an idea about different contexts being treated in a passage ("topic modelling"). Of course, an important resource in text analysis is the hermeneutic interpretation of the scholar herself, so I will present a method of adding manual annotations to the text, and finally I will also say something about possible approaches to working across languages. This page will not cover stylistic analyses ("stylometry") and typical neighborship relations between words ("collocation", "word2vec"). Maybe these can be discussed at another occasion and on another page.

For many of the steps discussed on this page there are ready-made tools and libraries, often with easy interfaces. But first, it is important to understand what these tools are actually doing and how their results are affected by the selection of parameters (that one can or cannot modify). And second, most of these tools expect the input to be in some particular format, say, a series of plaintext files in their own directory. So, by understanding the process, you should be better prepared to provide your text to the tools in the most productive way. Finally, it is important to be aware of what information has been lost at which point in the process. If the research requires so, one can then either look for a different tool or approach to this step (e.g. using an additional dimension in the list of words to keep both original and regularized word forms, or to remember the position of the current token in the original text), or one can compensate for the data loss (e.g. offering a lemmatised search to find occurrences after the analysis returns only normalised word forms)...

Preparations

As indicated above, before doing maths, language processing tools normally expect their input to be in a certain format. First of all, you have to have an input in the first place: Therefore, a scholar wishing to experiment with such methods should avail herself of the text that should be studied, as a full transcription. This can be done by transcribing it herself, using transcriptions that are available from elsewhere, or even from OCR. (Although in the latter case, the results depend of course on the quality of the OCR output.) Second, many tools get tripped up when formatting or bibliographical metainformation is included in their input. And since the approaches presented here are not concerned with a digital edition or any other form of true representation of the source, markup (e.g. for bold font, heading or note elements) should be suppressed. (Other tools accept marked up text and strip the formatting internally.)

For another detail regarding these plain text files, we have to make a short excursus, because even with plain text, there are some important aspects to consider: As you surely know, computers understand number only and as you probably also know, the first standards to encode alphanumeric characters, like ASCII, in numbers were designed for teleprinters and the reduced character set of the english language. When more extraordinary characters, like Umlauts or accents were to be encoded, one had to rely on extra rules, of which - unfortunately - there have been quite a lot. These are called "encodings" and one of the more important set of such rules are the windows encodings (e.g. CP-1252), another one is called Latin-9/ISO 8859-15 (it differs from the older Latin-1 encoding among others by including the Euro sign). Maybe you have seen web pages with garbled Umlauts or other special characters, then that was probably because your browser interpreted the numbers according to an encoding different from the one that the webpage author used. Anyway, the point here is that there is another standard encompassing virtually all the special signs from all languages and for a few years now, it is also supported quite well by operating systems, programming languages and linguistic tools. This standard is called "Unicode" and the encoding you want to use is called utf-8. So when you export or import your texts, try to make sure that this is what is used. (Here is a webpage with the complete unicode table - it is loaded incrementally, so make sure to scroll down. But on the other hand, it is so extensive that you don't want to scroll through all the table...)

Also, you should consider whether or not you can replace abbreviations with their expanded versions. While at some points (e.g. when lemmatising), you can associate expansions to abbreviations, the whole processing is easier when words in the text are indeed words, and periods are rather sentence punctuation than abbreviation signs. Of course, this also depends on the effort you can spend on the text...

This section describes how the plaintext can further be prepared for analyses: E.g. if you want to process the distribution of words in the text, the processing method has to have some notion of different places in the text -- normally you want to manage words not according to their absolute position in the whole work (say, the 6.349th word and the 3.100th), but according to their occurrence in a particular section (say, in the third chapter, without caring too much whether it is in the 13th or in the 643th position in this chapter). So, you partition the text into meaningful segments which you can then label, compare etc.

Other preparatory work includes suppressing stopwords (like "the", "is", "of" in english) or making the tools manage different forms of the same word or different historical writings identically. Here is what falls under this category:

Get fulltext

For the examples given on this page, I have loaded a plaintext export of Francisco de Vitoria's "Relectiones" from the School of Salamanca's project, available as one single file at this URL: [http://api.salamanca.school/txt/works.W0013.orig]. I have saved this to the file TextProcessing_2017/W0013.orig.txt.


In [229]:
bigsourcefile = 'TextProcessing_2017/W0013.orig.txt'         # This is the path to our file
input = open(bigsourcefile, encoding='utf-8').readlines()   # We use a variable 'input' for
                                                            # keeping its contents.

input[:10]                                     # Just for information,
                                               # let's see the first 10 lines of the file.


Out[229]:
['                  REVERENDI  PATRIS F. FRANCISCI DE VIctoria, ordinis Prædicatorũ, sacræ Theologiæ   in Salmanticensi Academia quondam  primarij Professoris, Relectiones  Theologicæ XII. in duos  Tomos diuisæ:  Quarum seriem uersa pagella iudicabit.   SVMMARIIS suis ubique locis adiectis, una cum  INDICE omnium copiosißimo.  TOMVS PRIMVS.           Lugduni, apud Iacobum Boyerium,  M. D. LVII.   Cum priuilegio Regis ad decennium.     \n',
 '  \n',
 '        PRIMVS TOMVS, \n',
 '  \n',
 '   De\n',
 '  \n',
 '  -\xa0Potestate ecclesiæ, prior & posterior. \n',
 '  -\xa0   Potestate ciuili.\n',
 '  -\xa0   Potestate concilij.\n',
 '  -\xa0   Indis prior.\n']

Segment source text

Next, as mentioned above, we want to associate information with only passages of the text, not the text as a whole. Therefore, the text has to be segmented. The one single file is being split into meaningful smaller chunks. What exactly constitutes a meaningful chunk -- a chapter, an article, a paragraph etc. -- cannot be known independently of the text in question and of the research questions. Therefore, it is suggested that the scholar either splits the text manually or inserts some symbols that otherwise do not appear in the text. Then, processing tools can identify these and split the file accordingly. For keeping things neat and orderly, the resulting files should be saved in a directory of their own...

Here, I am splitting the file arbitrarily every 80 lines... Note though, that this leads to a rather unusual condition: all segments are of (roughly) the same length. When counting words and assessing their relative "importance", if a word occurs twice in a very short passage, this is more telling about the passage than if the passage was very, very long. Later, we will see ways to compensate for the normal variance in passage length.


In [230]:
splitLen = 80                 # 80 lines per file
outputBase = 'TextProcessing_2017/segment'  # source/segment.1.txt, source/segment.2.txt, etc.

count = 0                     # initialise some variables. 
at    = 0
dest  = None                  # this later takes our destination files

for line in input:
    if count % splitLen == 0:
        if dest: dest.close()
        dest = open(outputBase + '.' + str(at) + '.txt', encoding='utf-8', mode='w')   # 'w' is for writing: here we open the file the current segment is being written to
        at += 1
    dest.write(line.strip())
    count += 1

print(str(at - 1) + ' files written.')


45 files written.

Read segments into a variable

From the segments, we rebuild our corpus, iterating through them and reading them into another variable (which now stores, technically speaking, a list of strings).


In [231]:
import sys
import glob
import errno

path     = 'TextProcessing_2017'
filename = 'segment.'
suffix   = '.txt'
corpus   = []

for i in range(0, at - 1):
    try:
        with open(path + '/' + filename + str(i) + suffix, encoding='utf-8') as f:
            corpus.append(f.read())
            f.close()
    except IOError as exc:
        if exc.errno != errno.EISDIR: # Do not fail if a directory is found, just ignore it.
            raise # Propagate other kinds of IOError.

Now we should have 45 strings in the variable corpus to play around with:


In [232]:
len(corpus)


Out[232]:
45

For a quick impression, let's see the opening 500 characters of an arbitrary one of them:


In [233]:
corpus[5][:500]


Out[233]:
'claues uerò in foro interiori, Iohan. 20.{Iohan. 20.}Primatum autem, & plenitudinem potestatis uidetur Petrus accepisse Iohã.  21.{Iohan. 21.}Pasce oues meas. & quod Armachanus ait,  si non simul acceperunt totam potestatem, fore, ut non esset unum sacramentum ordinis,  non necesse est. Quãquam enim Christus potestatem clauium non simul totam nec uno loco dedit: non ideò tamen consequitur, ut etiã   pontifices possint eam authoritatem diuidere, sed totam simul tribuunt, & uno ordinis sacramento:'

Tokenising

"Tokenising" means splitting the long lines of the input into single words. Since we are dealing with plain latin, we can use the default split method which relies on spaces to identify word boundaries. (In languages like Japanese or scripts like Arabic, this is more difficult.) Note that we do not compensate for words that are hyphenated/split across lines here!


In [234]:
import re

tokenised = []
for segment in corpus:
    tokenised.append(list(filter(None, (word.lower() for word in re.split('\W+', segment)))))

For our examples, let's have a look at (the first 50 words of) an arbitrary one of those segments:


In [235]:
print(tokenised[5][:50])


['claues', 'uerò', 'in', 'foro', 'interiori', 'iohan', '20', 'iohan', '20', 'primatum', 'autem', 'plenitudinem', 'potestatis', 'uidetur', 'petrus', 'accepisse', 'iohã', '21', 'iohan', '21', 'pasce', 'oues', 'meas', 'quod', 'armachanus', 'ait', 'si', 'non', 'simul', 'acceperunt', 'totam', 'potestatem', 'fore', 'ut', 'non', 'esset', 'unum', 'sacramentum', 'ordinis', 'non', 'necesse', 'est', 'quãquam', 'enim', 'christus', 'potestatem', 'clauium', 'non', 'simul', 'totam']

Already, we can have a first go at finding the most frequent words for a segment. (For this we use a simple library of functions that we import by the name of 'collections'.):


In [236]:
import collections
counter = collections.Counter(tokenised[5])
print(counter.most_common(10))


[('in', 105), ('non', 103), ('est', 101), ('ad', 77), ('quòd', 59), ('nec', 56), ('sed', 50), ('potestas', 50), ('papa', 47), ('ut', 45)]

Perhaps now is a good opportunity for a small excursus. What we have printed in the last code is a series of pairs: Words and their number of occurrences, sorted by the latter. Yet the display looks a bit ugly. With another library called "pandas" (for python data analysis), we can make this more intuitive. (Of course, your system must have this library installed in the first place so that we can import it in our code.):


In [237]:
import pandas as pd
df1 = pd.DataFrame.from_dict(counter, orient='index').reset_index()
df2 = df1.rename(columns={'index':'lemma',0:'count'})

df2.sort_values('count',0,False)[:10]


Out[237]:
lemma count
2 in 105
23 non 103
35 est 101
191 ad 77
137 quòd 59
40 nec 56
53 sed 50
150 potestas 50
162 papa 47
29 ut 45

Looks better now, doesn't it?

Stemming / Lemmatising

Next, since we prefer to count different word forms as one and the same "lemma", we need to do a step called "lemmatisation". In languages like English, that are not strongly inflected, one can get away with "stemming", i.e. just eliminating the ending of words: "wish", "wished", "wishing", "wishes" all can count as instances of "wish*". With Latin this is not so easy: we want to count occurrences of "legum", "leges", "lex" as one and the same word, but if we truncate after "le", we get too many hits that have nothing to do with lex at all. There are a couple of "lemmatising" tools available, we do our own with a dictionary approach...

First, we have to have a dictionary which associates all known word forms to their lemma. This also helps us with historical orthography. Suppose from some other context, we have a file "wordforms-lat.txt" at our disposal in the TextProcessing_2017 directory. Its contents looks like this:


In [238]:
wordfile_path = 'TextProcessing_2017/wordforms-lat.txt'
wordfile = open(wordfile_path, encoding='utf-8')

print (wordfile.read()[:59])
wordfile.close;                # (The semicolon suppresses the returned object in cell output)


a > a
à > a
ab > a
abbas > abbas
abbate > abbas
abbatem > 

So, we can again build a dictionary of key-value pairs associating all the lemmata ("values") with their wordforms ("keys"):


In [239]:
lemma    = {}    # we build a so-called dictionary for the lookups
tempdict = []

wordfile = open(wordfile_path, encoding='utf-8')

for line in wordfile.readlines():
    tempdict.append(tuple(line.split('>')))

lemma = {k.strip(): v.strip() for k, v in tempdict}
wordfile.close;
print(str(len(lemma)) + ' wordforms registered.')


1682 wordforms registered.

Again, a quick test: Let's see with which basic word the wordform "ciuicior" is associated, or, in other words, what value our lemma variable returns when we query for the key "ciuicior":


In [240]:
lemma['ciuicior']


Out[240]:
'civicus'

Now we can use this dictionary to build a new list of words, where only lemmatised forms occur:


In [241]:
lemmatised = [[lemma[word] if word in lemma else word for word in segment] \
              for segment in tokenised]

print(lemmatised[5][:50])


['clavis', 'uerò', 'in', 'foro', 'interiori', 'iohan', '20', 'iohan', '20', 'primatum', 'autem', 'plenitudinem', 'potestas', 'uidetur', 'petrus', 'accepisse', 'iohã', '21', 'iohan', '21', 'pasce', 'oues', 'meas', 'qui', 'armachanus', 'aio', 'si', 'nolo', 'simul', 'acceperunt', 'totam', 'potestas', 'fore', 'ut', 'nolo', 'sum', 'unum', 'sacramentum', 'ordo', 'nolo', 'necesse', 'sum', 'quãquam', 'enim', 'christus', 'potestas', 'clavis', 'nolo', 'simul', 'totam']

As you can see, the original text is lost now from the data that we are currently working with (unless we add another dimension to our lemmatised variable which can keep the original word form). But let us see if something in the 10 most frequent words has changed:


In [242]:
counter2 = collections.Counter(lemmatised[5])
df1 = pd.DataFrame.from_dict(counter2, orient='index').reset_index()
df2 = df1.rename(columns={'index':'lemma',0:'count'})

df2.sort_values('count',0,False)[:10]


Out[242]:
lemma count
29 sum 223
10 potestas 120
2 in 105
23 nolo 103
178 ad 77
129 quòd 59
37 nec 56
150 papa 52
50 sed 50
103 habeo 49

Yes, things have changed: "esse/sum" has moved to the most frequent place, "non" is now counted among the "nolo" (I am not sure this makes sense, but such is the dictionary of wordforms we have used) and "potestas" has now made it from the eighth to the second place!

Eliminate Stopwords

Probably "sum/esse", "non/nolo", "in", "ad" and the like are not really very informative words. They are what one calls stopwords, and we have another list of such words that we would rather want to ignore.


In [243]:
stopwords_path = 'TextProcessing_2017/stopwords-lat.txt'
stopwords = open(stopwords_path, encoding='utf-8').read().splitlines()

print(str(len(stopwords)) + ' stopwords, e.g.: ' + str(stopwords[24:54]))


297 stopwords, e.g.: ['ab', 'ac', 'ad', 'adhic', 'adhuc', 'ae', 'ait', 'ali', 'alii', 'aliis', 'alio', 'aliqua', 'aliqui', 'aliquid', 'aliquis', 'aliquo', 'am', 'an', 'ante', 'apud', 'ar', 'at', 'atque', 'au', 'aut', 'autem', 'bus', 'c', 'ca', 'ceptum']

Now let's try and suppress the stopwords in the segments...


In [244]:
stopped = [[item for item in lemmatised_segment if item not in stopwords] \
           for lemmatised_segment in lemmatised]
print(stopped[5][:20])


['clavis', 'uerò', 'foro', 'interiori', 'iohan', 'iohan', 'primatum', 'plenitudinem', 'potestas', 'uidetur', 'petrus', 'accepisse', 'iohã', 'iohan', 'pasce', 'oues', 'meas', 'armachanus', 'aio', 'simul']

With this, we can already create a first "profile" of our first 4 segments:


In [245]:
counter3 = collections.Counter(stopped[0])
counter4 = collections.Counter(stopped[1])
counter5 = collections.Counter(stopped[2])
counter6 = collections.Counter(stopped[3])

df0_1 = pd.DataFrame.from_dict(counter3, orient='index').reset_index()
df0_2 = df0_1.rename(columns={'index':'lemma',0:'count'})

df1_1 = pd.DataFrame.from_dict(counter4, orient='index').reset_index()
df1_2 = df1_1.rename(columns={'index':'lemma',0:'count'})

df2_1 = pd.DataFrame.from_dict(counter5, orient='index').reset_index()
df2_2 = df2_1.rename(columns={'index':'lemma',0:'count'})

df3_1 = pd.DataFrame.from_dict(counter6, orient='index').reset_index()
df3_2 = df3_1.rename(columns={'index':'lemma',0:'count'})

print(' ')
print(' Most frequent lemmata in the first text segment')
print(df0_2.sort_values(by='count',axis=0,ascending=False)[:10])
print(' ')
print(' ')
print(' Most frequent lemmata in the second text segment')
print(df1_2.sort_values(by='count',axis=0,ascending=False)[:10])
print(' ')
print(' ')
print(' Most frequent lemmata in the third text segment')
print(df2_2.sort_values(by='count',axis=0,ascending=False)[:10])
print(' ')
print(' ')
print(' Most frequent lemmata in the fourth text segment')
print(df3_2.sort_values(by='count',axis=0,ascending=False)[:10])


 
 Most frequent lemmata in the first text segment
          lemma  count
42     ecclesia     46
183        queo     14
270        homo      8
715  haereticus      8
823       video      7
706    synagoga      7
281         vir      7
342         res      7
3      victoria      6
831        dico      6
 
 
 Most frequent lemmata in the second text segment
           lemma  count
50      potestas     52
16      ecclesia     24
51       civilis     14
66   spiritualis     10
248        matth      9
107         deus      8
57          queo      6
81        regnum      6
167          rom      6
269          luc      6
 
 
 Most frequent lemmata in the third text segment
           lemma  count
53      potestas     50
413     peccatum     50
420       clavis     28
305  spiritualis     23
581     remissio     23
419     peccator     22
67          deus     22
144         dico     21
445        dolor     15
263         semi     14
 
 
 Most frequent lemmata in the fourth text segment
              lemma  count
219        potestas     74
222     spiritualis     32
252             ius     28
10             deus     22
59             queo     18
273            lego     16
254  ecclesiasticus     13
1              dico     13
265            tota     12
122           solum     12

So far our initial analyses, then. There are several ways in which we can continue now. We could for example use either our lemma list or our stopwords list to filter out certain words, like all non-substantives...

However, we can already observe that meaningful words like "potestas" are maybe not so helpful in characterising individual passages after all, since they occur all over the place. Also, we would like to give some weight to the fact that a passage may consist of all stopwords and perhaps one or two substantial words, whereas another might be full of substantial words and few stopwords (think e.g. of an abstract or an opening chapter describing the rest of the work). Or, in case we have text segments of varying length (which is probably rather the norm than the exception), we would like our figures to reflect the fact that a tenfold occurrence in a very short passage may be more significant than a tenfold occurrence in a very, very, very long passage.

These phenomena are treated with more mathematical tools, so let's say that our preparatory work is done ...

Characterise passages: TF/IDF

As described, we are now going to delve a wee bit deeper into mathematics in order to get more precise characterizations of our text segments. The approach we are going to use is called "TF/IDF" and is a simple, yet powerful method that is very popular in data mining and search engine discussions.

Since maths works with numbers, let's first of all build a list of all the words (in their basic form) that occur anywhere in the text, and give each one of those words an ID:


In [ ]:

Find similar passages: Clustering

  • Find good measure (word vectors, authorities cited, style, ...)
  • Find starting centroids
  • Find good K value
  • K-Means clustering

Topic Modelling

...

Manual Annotation

...

Cope with different languages

...

This is the API documentation of conceptnet.io. Which we can use to lookup synonyms, related terms and translations. Like with such a URI:

http://api.conceptnet.io/related/c/la/rex?filter=/c/es

Further information