Text Processing

Introduction

This is an introduction to some algorithms used in text analysis. While I cannot define what questions a scholar can ask, I can and do describe here what kind of information about text some popular methods can deliver. From this, you need to draw on your own research interests and creativity...

I will describe methods of finding words that are characteristic for a certain passage ("tf/tdf"), constructing fingerprints or "wordclouds" for passages that go beyond the most significant words ("word vectors"). Of course, an important resource in text analysis is the hermeneutic interpretation of the scholar herself, so I will present a method of adding manual annotations to the text, and finally I will also say something about possible approaches to working across languages.

At the moment the following topics are still waiting to be discussed: grouping passages according to their similarity ("clustering"), and forming an idea about different contexts being treated in a passage ("topic modelling"). Some more prominent approaches in the areas that have been mentioned so far are "collocation" analyses and the "word2vec" tool; I would like add discussions of these at a later moment.

"Natural language processing" in the strict sense, i.e. analyses that have an understanding of how a language works, with its grammar, different modes, times, cases and the like, are not going to be covered; this implies "stylometric" analyses. Nor are there any discussions of "artificial intelligence" approaches. Maybe these can be discussed at another occasion and on another page.

For many of the steps discussed on this page there are ready-made tools and libraries, often with easy interfaces. But first, it is important to understand what these tools are actually doing and how their results are affected by the selection of parameters (that one can or cannot modify).

And second, most of these tools expect the input to be in some particular format, say, a series of plaintext files in their own directory, a list of word/number)-pairs, a table or a series of integer (or floating point) numbers, etc. So, by understanding the process, you should be better prepared to provide your text to the tools in the most productive way.

Finally, it is important to be aware of what information is lost at which point in the process. If the research requires so, one can then either look for a different tool or approach to this step (e.g. using an additional dimension in the list of words to keep both original and regularized word forms, or to remember the position of the current token in the original text), or one can compensate for the data loss (e.g. offering a lemmatised search to find occurrences after the analysis returns only normalised word forms)...

The programming language used in the following examples is called "python" and the tool used to get prose discussion and code samples together is called "jupyter". In jupyter, you have a "notebook" that you can populate with text or code and a program that pipes a nice rendering of the notebook to a web browser. In this notebook, in many places, the output that the code samples produce is printed right below the code itself. Sometimes this can be quite a lot of output and depending on your viewing environment you might have to scroll quite some way to get to the continuation of the discussion. You can save your notebook online (the current one is here at github) and there is an online service, nbviewer, able to render any notebook that it can access online. So chances are you are reading this present notebook at the web address https://nbviewer.jupyter.org/github/awagner-mainz/notebooks/blob/master/gallery/TextProcessing_Solorzano.ipynb.

A final word about the elements of this notebook:

At some points I am mentioning things I consider to be important decisions or take-away messages for scholarly readers. E.g. whether or not to insert certain artefacts into the very transcription of your text, what the methodological ramifications of a certain approach or parameter are, what the implications of an example solution are, or what a possible interpretation of a certain result might be. I am highlighting these things in a block like this one here or at least in **green bold font**.
**NOTE:** As I have continued improving the notebook on the side of the source text, wordlists and other parameters, I could (for lack of time) not keep the prose description in synch. So while the actual descriptions still apply, the numbers that are mentioned in the prose (as where we have e.g. a "table with 20 rows and 1.672 columns"), they might no longer reflect the latest state of the sources, auxiliary files and parameters. I will try to update these as I get to it, but for now, you should take such numbers with a grain of salt and rely rather on the actual code and its diagnostic output. I apologize for the inconsistency.

Preparations

As indicated above, before doing maths, language processing tools normally expect their input to be in a certain format. First of all, you have to have an input in the first place: Therefore, a scholar wishing to experiment with such methods should avail herself of the text that should be studied, as a full transcription. This can be done by transcribing it herself, using transcriptions that are available from elsewhere, or even from OCR. (Although in the latter case, the results depend of course on the quality of the OCR output.) Second, many tools get tripped up when formatting or bibliographical metainformation is included in their input. And since the approaches presented here are not concerned with a digital edition or any other form of true representation of the source, markup (e.g. for bold font, heading or note elements) should be suppressed. (Other tools accept marked up text and strip the formatting internally.) So you should try to get a copy of the text(s) you are working with in plaintext format.

For another detail regarding these plain text files, we have to make a short excursus, because even with plain text, there are some important aspects to consider: As you surely know, computers understand number only and as you probably also know, the first standards to encode alphanumeric characters, like ASCII, in numbers were designed for teleprinters and the reduced character set of the english language. When more extraordinary characters, like Umlauts or accents were to be encoded, one had to rely on extra rules, of which - unfortunately - there have been quite a lot. These are called "encodings" and one of the more important set of such rules are the windows encodings (e.g. CP-1252), another one is called Latin-9/ISO 8859-15 (it differs from the older Latin-1 encoding among others by including the Euro sign). Maybe you have seen web pages with garbled Umlauts or other special characters, then that was probably because your browser interpreted the numbers according to an encoding different from the one that the webpage author used. Anyway, the point here is that there is another standard encompassing virtually all the special signs from all languages and for a few years now, it is also supported quite well by operating systems, programming languages and linguistic tools. This standard is called "Unicode" and the encoding you want to use is called utf-8. So when you export or import your texts, try to make sure that this is what is used. (Here is a webpage with the complete unicode table - it is loaded incrementally, so make sure to scroll down in order to get an impression of what signs this standard covers. But on the other hand, it is so extensive that you don't want to scroll through all the table...)

Especially when you are coming from a windows operating system, you might have to do some searching about how to export your text to utf-8 (at one point I could make a unicode plaintext export in wordpad, only to find out after some time of desperate debugging that it was utf-16 that I had been given. Maybe you can still find the traces of my own conversion of such files to utf-8 below).

Also, you should consider whether or not you can replace *abbreviations* with their expanded versions in your transcription. While at some points (e.g. when lemmatising), you can associate expansions to abbreviations, the whole processing is easier when words in the text are indeed words, and periods are rather sentence punctuation than abbreviation signs. Of course, this also depends on the effort you can spend on the text...

This section now describes how the plaintext can further be prepared for analyses: E.g. if you want to process the distribution of words in the text, the processing method has to have some notion of different places in the text -- normally you don't want to manage words according to their absolute position in the whole work (say, the 6.349th word and the 3.100th one), but according to their occurrence in a particular section (say, in the third chapter, without caring too much whether it is in the 13th or in the 643th position in this chapter). So, you partition the text into meaningful segments which you can then label, compare etc.

Other preparatory work includes suppressing stopwords (like "the", "is", "of" in english) or making the tools manage different forms of the same word or different historical writings identically. Here is what falls under this category:

Get fulltext

For the examples given on this page, I am using a transcription of Juan de Solorzano's De Indiarum Iure, provided by Angela Ballone. Angela has inserted a special sequence of characters - "€€€ - [<Label for the section>]" - at places where she felt that a new section or argument is beginning, so that we can segment the big source file into different sections each dealing with one particular argument. (Our first task.) But first, let's have a look at our big source file; it is in the folder "Solorzano" and is called Sections_I.1_TA.txt.


In [1]:
# This is the path to our file
bigsourcefile = 'Solorzano/Sections_I.1_TA.txt'
# We use a variable 'input' for keeping its contents.
input = open(bigsourcefile, encoding='utf-8').readlines()

# Just for information, let's see the first 10 lines of the file.
input[0:10]   # actually, since python starts counting with '0', we get 11 lines.
             # and since there is no line wrapping in the source file,
             # a line can be quite long.
             # You can see the lines ending with a "newline" character "\n" in the output.


Out[1]:
['€€€ - [Book title]\n',
 'LIBER PRIMUS:\n',
 'In quo de personis, et servitiis Indorum.\n',
 '  Caput Primum.\n',
 '  De statu et libertate Indorum in communi; et de origine et damnatione servitii personalis eorum, quod sub tributorum colore iniuste ab aliquibus usurpatur.\n',
 '€€€ - [Indians are free]\n',
 '  Quae prioribus (1) illis libris scripsimus, quos nuper circa iustam harum Indiarum Occidentalium inquisitionem, acquisitionem, et retentionem luce donavimus, ea, ut ibidem advertimus, praestolantur, quae ad earundem gubernationem spectant. In quibus proponendis et exponendis, ut recto ordine procedatur, ab Indorum personis, earumque statu, et conditione initium capessemus; (2) quarum in omnibus iuris quaestionibus priorem, potioremque; inspectionem esse debere, optime docuit I.C. in l.2 D. de stat. hom. l. si quaeremus 6. D. de testam. l. quidam referunt 14. D. de iure codicil. § ult. Instit. de iur. natur. l.2 § post originem, de orig. iur.\n',
 '  Et plane (3) ipsos Indos Naturali, ac Civili Iure inspecto, et seriis, ac repetitis Regum nostrorum iussionibus et schedulis liberos esse, et ut liberos tractari debere, satis luculenter probatum reliquimus in lib.3 prioris voluminis, c.7 per tot. et optime supponit elegantissimus P. Ioseph. ACOSTA lib.2 de procur. Ind. salut. c.7.pag.235 quem ibídem n.53 retulimus, et iterum graviter repetit lib.3.cap.17 sic inquiens: Atque inprimis Indos non esse servitute mulctatos, sed liberos prorsus, et sui iuris, ex iis, quae in lib.2 disputata sunt, summimus. Etenim et publicae leges ita statuunt, et consuetudo diuturna, et ratio constans, ac certa, quod qui nulla iniuria lacessunt, non possint reddi belli iure captivi.\n',
 '  Sed cum rerum usus, et (4) mixtae iam, ac communis eorundem Indorum, et Hispanorum Reipublicae utilitas, et neccessitas, aliqua munia, sive servitia induxisse, aut etiam extorsisse videatur, quibus illi addici, et distribui coeperunt, quae isthaec, et qualia sint, et quatenus iuxta iuris regulas subsistere possint? Hoc libro sigillatim percurrere, et distinctis capitibus trutinare conabimur.\n',
 '€€€ - [Definition of «servicios personales»]\n']

Segment source text

Next, as mentioned above, we want to associate information with only passages of the text, not the text as a whole. Therefore, the text has to be segmented. The one big single file is being split into meaningful smaller chunks. What exactly constitutes a meaningful chunk -- a chapter, an article, a paragraph etc. -- cannot be known independently of the text in question and of the research questions. Therefore, a typical approach is that the scholar either splits the text manually or inserts some symbols that otherwise do not appear in the text. This is what we have here. Then, processing tools can find these symbols and split the file accordingly. For keeping things neat and orderly, the resulting files are saved in a directory of their own...

(Note here and in the following that in most cases, when the program is counting, it does so beginning with zero. Which means that if we end up with 20 segments, they are going to be called segment_0.txt, segment_1,txt, ..., segment_19.txt. There is not going to be a segment bearing the number twenty, although we do have twenty segments. The first one has the number zero and the twentieth one has the number nineteen. Even for more experienced coders, this sometimes leads to mistakes, called "off-by-one errors".)


In [2]:
# folder for the several segment files:
outputBase = 'Solorzano/segment'

# initialise some variables:
at    = -1
dest  = None                  # this later takes our destination files

# Now, for every line, if it starts with our special string,
#    do nothing with the line,
#    but close the current and open the next destination file;
# if it does not,
#   append it to whatever is the current destination file
#   (stripping leading and trailing whitespace).
for line in input:
    if line[0:3] == '€€€':
        # if there is a file open, then close it
        if dest:
            dest.close()
        at += 1
        # open the next destination file for writing
        # (It's filename is build from our outputBase variable,
        #  the current position in the sequence of fragments,
        #  and a ".txt" ending)
        dest = open(outputBase + '.' + str(at) + '.txt',
                    encoding='utf-8',
                    mode='w')
    else:
        # write the line (after it has been stripped of leading and closing whitespace)
        dest.write(line.strip())

dest.close()
at += 1

# How many segments/files do we then have?
print(str(at) + ' files written.')


20 files written.

Read segments into a variable

From the segments just created, we rebuild our corpus, iterating through them and reading them into another variable (which now stores, technically speaking, not just one long string of characters, as the variable input in the first code snippet did, but a list of strings, one for each segment).


In [3]:
path = 'Solorzano'
filename = 'segment.'
suffix = '.txt'
corpus = []           # This is our new variable. It will be populated below.

for i in range(0, at):
    with open(path + '/' + filename + str(i) + suffix, encoding='utf-8') as f:
        corpus.append(f.read())    # Here, a new element is added to our corpus.
                                       # Its content is read from the file 'f' opened above
        f.close()

Now we should have 20 strings in the variable corpus to play around with:


In [4]:
len(corpus)


Out[4]:
20

For a quick impression, let's see the opening 500 characters of an arbitrary one of them; in this case, we take the fourth segment, i.e. the one at position '3' (remember that counting starts at 0):


In [5]:
corpus[3][0:500]


Out[5]:
'Hoc autem servitium (6) inde (ut apparet) originem habuit, quod cum principio detectionis harum regionum, Indi Hispanis commendari coepissent, ut illos protegerent, et in Fide Catholica diligenter instruerent, et huius curae ratione Indi ipsi certum illis pensum, sive tributum praestare iuberentur (de quo inferius secundo Libro plenius agemus) eiusmodi Hispani, qui ita Indos in commendam acceperant, tributi loco, plenam in eos, et eorum bona dominationem usurparunt, nullum opus, quantumvis durum'

Tokenising

"Tokenising" means splitting the long lines of the input into single words. Since we are dealing with plain latin, we can use the default split method which relies on spaces to identify word boundaries. (In languages like Japanese or scripts like Arabic, this is more difficult.) Note that we do not compensate for words that are hyphenated/split across lines here! That is something that should be catered for in the transcription itself.


In [6]:
# We need a python library, because we want to use a "regular expression"
import re

tokenised = []     # A new variable again

# Every segment, initially a long string of characters, is now split into a list of words,
# based on non-word characters (whitespace, punctuation, parentheses and others - that's
# what we need the regular expression library for).
# Also, we make everything lower-case.
for segment in corpus:
    tokenised.append(list(filter(None, (word.lower() for word in re.split('\W+', segment)))))

print('We now have ' + str(sum(len(x) for x in tokenised)) + ' wordforms or "tokens" in our corpus of ' + str(len(tokenised)) + ' segments.')


We now have 4208 wordforms or "tokens" in our corpus of 20 segments.

Now, instead of corpus, we can use tokenised for our subsequent routines: a variable which, at 20 positions, contains the list of words of the corresponding segment. In order to see the difference in structure to the corpus variable above, let's have a look at (the first 50 words of) the fourth segment again:


In [7]:
print(tokenised[3][0:49])


['hoc', 'autem', 'servitium', '6', 'inde', 'ut', 'apparet', 'originem', 'habuit', 'quod', 'cum', 'principio', 'detectionis', 'harum', 'regionum', 'indi', 'hispanis', 'commendari', 'coepissent', 'ut', 'illos', 'protegerent', 'et', 'in', 'fide', 'catholica', 'diligenter', 'instruerent', 'et', 'huius', 'curae', 'ratione', 'indi', 'ipsi', 'certum', 'illis', 'pensum', 'sive', 'tributum', 'praestare', 'iuberentur', 'de', 'quo', 'inferius', 'secundo', 'libro', 'plenius', 'agemus', 'eiusmodi']

Already, we can have a first go at finding the most frequent words for a segment. (For this we use a simple library of functions that we import by the name of 'collections'.):


In [8]:
import collections
counter = collections.Counter(tokenised[3])  # Again, consider the fourth segment
print(counter.most_common(10))               # Making a counter 'object' of our segment,
                                             # this now has a 'method' calles most_common,
                                             # offering us the object's most common elements.
                                             # More 'methods' can be found in the documentation:
                                             # https://docs.python.org/3/library/collections.html#collections.Counter


[('et', 7), ('in', 5), ('indi', 3), ('non', 3), ('hoc', 2), ('ut', 2), ('quod', 2), ('illis', 2), ('pensum', 2), ('de', 2)]

Nicer layout: tables instead of lists of tuples

Perhaps now is a good opportunity for another small excursus. What we have printed in the last code is a series of pairs: Words associated to their number of occurrences, sorted by the latter. This is called a "dictionary" in python. However, the display looks a bit ugly. With another library called "pandas" (for "python data analysis"), we can make this look more intuitive. (Of course, your system must have this library installed in the first place so that we can import it in our code.):


In [9]:
import pandas as pd
df1 = pd.DataFrame.from_dict(counter, orient='index').reset_index() # from our counter object,
                                                                    # we now make a DataFrame object
df2 = df1.rename(columns={'index':'lemma',0:'count'})               # and we name our columns

df2.sort_values('count',0,False)[:10]


Out[9]:
lemma count
21 et 7
22 in 5
15 indi 3
72 non 3
9 quod 2
52 tributi 2
55 eos 2
0 hoc 2
112 vel 2
5 ut 2

Looks better now, doesn't it?

(The bold number in the very first column is the id as it were of the respective lemma. You see that 'hoc' has the id '0' - because it was the first word that occurred at all -, and 'ut' has the id '5' because it was the sixth word in our segment. Most probably, currently we are not interested in the position of the word and can ignore the first column.)

Stemming / Lemmatising

Next, since we prefer to count different word forms as one and the same "lemma", we have to do a step called "lemmatisation". In languages that are not strongly inflected, like English, one can get away with "stemming", i.e. just eliminating the ending of words: "wish", "wished", "wishing", "wishes" all can count as instances of "wish*". With Latin this is not so easy: we want to count occurrences of "legum", "leges", "lex" as one and the same word, but if we truncate after "le", we get too many hits that have nothing to do with lex at all. There are a couple of "lemmatising" tools available, although with classical languages (or even early modern ones), it's a bit more difficult. Anyway, we do our own, using a dictionary approach...

First, we have to have a dictionary which associates all known word forms to their lemma. This can also help us with historical orthography. Suppose from some other context, we have a file "wordforms-lat.txt" at our disposal in the "Solorzano" folder. Its contents looks like this:


In [115]:
wordfile_path = 'Solorzano/wordforms-lat-full.txt'
wordfile = open(wordfile_path, encoding='utf-8')

print(wordfile.read()[:64])     # in such from-to addresses, one can just skip the zero
wordfile.close;


aër > aër
aëre > aër
aërem > aër
aëri > aër
aëris > aër
a > a
ab

So, we again build a dictionary of key-value pairs associating all the lemmata ("values") with their wordforms ("keys"). And afterwards, we can quickly look up the value under a given key:


In [116]:
lemma    = {}    # we build a so-called dictionary for the lookups
tempdict = []

# open the wordfile (defined above) for reading
wordfile = open(wordfile_path, encoding='utf-8')

for line in wordfile.readlines():
    tempdict.append(tuple(line.split('>'))) # we split each line by ">" and append a tuple to a
                                            # temporary list.

lemma = {k.strip(): v.strip() for k, v in tempdict} # for every tuple in the list,
                                                    # we strip whitespace and make a key-value
                                                    # pair, appending it to our "lemma" dictionary
wordfile.close
print(str(len(lemma)) + ' wordforms known to the system.')


2131211 wordforms known to the system.

Again, a quick test: Let's see with which "lemma"/basic word the particular wordform "ciuicior" is associated, or, in other words, what value our lemma variable returns when we query for the key "ciuicior":


In [117]:
lemma['fidem']


Out[117]:
'fides'

Now we can use this dictionary to build a new list of words, where only lemmatised forms occur:


In [118]:
# For each segment, and for each word in it, add the lemma to our new "lemmatised"
# list, or, if we cannot find a lemma, add the actual word from from the tokenised list.
lemmatised = [[lemma[word] if word in lemma else word for word in segment]
              for segment in tokenised]

Again, let's see the first 50 words from the fourth segment, and compare them with the "tokenised" variant above:


In [119]:
print(lemmatised[3][:49])


['hic', 'autem', 'servitium', '6', 'inde', 'ut', 'appareo', 'origo', 'habeo', 'qui', 'cum', 'principium', 'detectio', 'hic', 'regio', 'indios', 'hispanis', 'commendo', 'coepio', 'ut', 'ille', 'protego', 'et', 'in', 'fides', 'catholicus', 'diligens', 'instruo', 'et', 'hic', 'cura', 'ratio', 'indios', 'ipse', 'certus', 'ille', 'pendo', 'sive', 'tribuo', 'praesto', 'iubeo', 'de', 'qui', 'inferius', 'secundo', 'liber', 'plenus', 'ago', 'eiusmodi']

As you can see, the original text is lost now from the data that we are currently working with (unless we add another dimension to our lemmatised variable which can keep the original word form). But let us see if something in the 10 most frequent words has changed:


In [120]:
counter2 = collections.Counter(lemmatised[3])
df1 = pd.DataFrame.from_dict(counter2, orient='index').reset_index()
df2 = df1.rename(columns={'index':'lemma',0:'count'})

df2.sort_values('count',0,False)[:10]


Out[120]:
lemma count
20 et 7
21 in 5
0 hic 4
9 qui 4
14 indios 4
18 ille 3
64 nolo 3
48 is 3
32 tribuo 3
39 plenus 2

Yes, things have changed: "tributum" has moved one place up, "non" is now counted as "nolo" (I am not sure this makes sense, but such is the dictionary of wordforms we have used) and "pensum" has now made it on the list!

Eliminate Stopwords

Probably "et", "in", "de", "qui", "ad", "sum/esse", "non/nolo" and many of the most frequent words are not really very telling words. They are what one calls stopwords, and we have another list of such words that we would rather want to ignore:


In [121]:
stopwords_path = 'Solorzano/stopwords-lat.txt'
stopwords = open(stopwords_path, encoding='utf-8').read().splitlines()

print(str(len(stopwords)) + ' stopwords known to the system, e.g.: ' + str(stopwords[95:170]))


388 stopwords known to the system, e.g.: ['a', 'ab', 'ac', 'ad', 'adhic', 'adhuc', 'ae', 'ait', 'ali', 'alii', 'aliis', 'alio', 'aliqua', 'aliqui', 'aliquid', 'aliquis', 'aliquo', 'am', 'an', 'ante', 'apud', 'ar', 'at', 'atque', 'au', 'aut', 'autem', 'bus', 'c', 'ca', 'cap', 'ceptum', 'co', 'con', 'cons', 'cui', 'cum', 'cur', 'cùm', 'd', 'da', 'de', 'deinde', 'detur', 'di', 'diu', 'do', 'dum', 'e', 'ea', 'eadem', 'ec', 'eccle', 'ego', 'ei', 'eis', 'eius', 'el', 'em', 'en', 'enim', 'eo', 'eos', 'er', 'erat', 'ergo', 'erit', 'es', 'esse', 'essent', 'esset', 'est', 'et', 'etenim', 'eti']

Now let's try and suppress the stopwords in the segments (and see what the "reduced" fourth segment gives)...


In [122]:
# For each segment, and for each word in it,
# add it to a new list called "stopped",
# but only if it is not listed in the list of stopwords.
stopped = [[item for item in lemmatised_segment if item not in stopwords] \
           for lemmatised_segment in lemmatised]
print(stopped[3][:49])


['servitium', 'inde', 'appareo', 'origo', 'principium', 'detectio', 'regio', 'indios', 'hispanis', 'commendo', 'coepio', 'protego', 'fides', 'catholicus', 'diligens', 'instruo', 'cura', 'ratio', 'indios', 'certus', 'pendo', 'tribuo', 'praesto', 'iubeo', 'inferius', 'secundo', 'liber', 'plenus', 'ago', 'eiusmodi', 'hispani', 'queo', 'indios', 'commendam', 'accipio', 'tribuo', 'loco', 'plenus', 'bonum', 'dominatio', 'usurpo', 'nullus', 'opus', 'quantumvis', 'durus', 'laboriosus', 'praetermitto', 'tamquam', 'serva']

With this, we can already create a kind of first "profile" of, say, our first six segments, listing the most frequent words in each of them:


In [123]:
counter3 = collections.Counter(stopped[0])
df0_1 = pd.DataFrame.from_dict(counter3, orient='index').reset_index()
df0_2 = df0_1.rename(columns={'index':'lemma',0:'count'})
print(' Most frequent lemmata in the first text segment (segment number zero):')
df0_2.sort_values(by='count',axis=0,ascending=False)[:10]


 Most frequent lemmata in the first text segment (segment number zero):
Out[123]:
lemma count
3 servitium 2
4 indios 2
0 liber 1
1 prior 1
2 persona 1
5 caput 1
6 sisto 1
7 libertas 1
8 commune 1
9 origo 1

In [124]:
counter4 = collections.Counter(stopped[1])
df1_1 = pd.DataFrame.from_dict(counter4, orient='index').reset_index()
df1_2 = df1_1.rename(columns={'index':'lemma',0:'count'})
print(' Most frequent lemmata in the second text segment (segment number one):')
df1_2.sort_values(by='count',axis=0,ascending=False)[:10]


 Most frequent lemmata in the second text segment (segment number one):
Out[124]:
lemma count
31 ius 6
1 liber 5
23 indios 4
0 prior 3
55 repeto 2
43 refero 2
47 iur 2
35 debeo 2
36 bonus 2
76 gravis 1

In [125]:
counter5 = collections.Counter(stopped[2])
df2_1 = pd.DataFrame.from_dict(counter5, orient='index').reset_index()
df2_2 = df2_1.rename(columns={'index':'lemma',0:'count'})
print(' Most frequent lemmata in the third text segment:')
df2_2.sort_values(by='count',axis=0,ascending=False)[:10]


 Most frequent lemmata in the third text segment:
Out[125]:
lemma count
0 servitium 2
1 personalis 2
7 soleo 2
8 utilitas 2
9 indios 2
45 contemplatio 1
51 liber 1
50 gravo 1
49 persona 1
48 commodum 1

In [126]:
counter6 = collections.Counter(stopped[3])
df3_1 = pd.DataFrame.from_dict(counter6, orient='index').reset_index()
df3_2 = df3_1.rename(columns={'index':'lemma',0:'count'})
print(' Most frequent lemmata in the fourth text segment:')
df3_2.sort_values(by='count',axis=0,ascending=False)[:10]


 Most frequent lemmata in the fourth text segment:
Out[126]:
lemma count
7 indios 4
20 tribuo 3
21 praesto 2
22 iubeo 2
26 plenus 2
19 pendo 2
18 certus 2
55 species 1
59 quidam 1
58 designo 1

Yay, look here, we have our words "indis", "tributum", "pensum" from the top ten above again, but this time the non-significant (for our present purposes) words in-between have been eliminated. Instead, new words like "numerata", "operis" etc. have made it into the top ten.


In [127]:
counter7 = collections.Counter(stopped[4])
df4_1 = pd.DataFrame.from_dict(counter7, orient='index').reset_index()
df4_2 = df4_1.rename(columns={'index':'lemma',0:'count'})
print(' Most frequent lemmata in the fifth text segment:')
df4_2.sort_values(by='count',axis=0,ascending=False)[:10]


 Most frequent lemmata in the fifth text segment:
Out[127]:
lemma count
0 personalis 6
1 servitium 5
72 indios 5
30 anno 4
38 queo 4
79 alea 4
32 semi 4
85 tribuo 3
34 eodem 3
73 novo 2

In [128]:
counter8 = collections.Counter(stopped[5])
df5_1 = pd.DataFrame.from_dict(counter8, orient='index').reset_index()
df5_2 = df5_1.rename(columns={'index':'lemma',0:'count'})
print(' Most frequent lemmata in the sixth text segment:')
df5_2.sort_values(by='count',axis=0,ascending=False)[:10]


 Most frequent lemmata in the sixth text segment:
Out[128]:
lemma count
11 regnum 5
78 alea 5
1 anno 4
64 personalis 3
68 servitium 3
104 nosco 2
25 accipio 2
34 servicio 2
40 indios 2
12 peruanum 2
So far our initial analyses, then. There are several ways in which we can continue now. We see that there are still word (like 'damnatione', tributorum' in the first or 'statuunt' in the second segment) that are not covered by our lemmatisation process. Also, abbreviations (like 'iur' in the second segment) could be expanded either in the transcription or by adding an appropriate line in our list of lemmata. Words like 'dom' in the fifth segment could maybe be added to the list of stopwords? Anyway, more need for review of these two lists (lemmata/stopwords) is explained below and that is something that should definitely be done - after all, they were taken from the context of quite another project and a scholar should control closely what is being suppressed and what is being replaced in the text under hand.
But we could also do more sophisticated things with the list. We could e.g. use either our lemma list or our stopwords list to filter out certain words, like all non-substantives. Or we could reduce all mentions of a certain name or literary work to a specific form (that would be easily recognizable in all the places).

However, we can already observe that meaningful words like "indios/indis" are maybe not so helpful in characterising individual passages of this work, since they occur all over the place. After all, the work is called "De Indiarum Iure" and deals with various questions all related to indigenous people. Also, we would like to give some weight to the fact that a passage may consist of all stopwords and perhaps one or two substantial words, whereas another might be full of substantial words and few stopwords only (think e.g. of an abstract or an opening chapter describing the rest of the work). Or, since we have text segments of varying length, we would like our figures to reflect the fact that a tenfold occurrence in a very short passage may be more significant than a tenfold occurrence in a very, very, very long passage.

These phenomena are treated with more mathematical tools, so let's say that our preparatory work is done ...

Characterise passages: TF/IDF

As described, we are now going to delve a wee bit deeper into mathematics in order to get more precise characterizations of our text segments. The approach we are going to use is called "TF/IDF" and is a simple, yet powerful method that is very popular in text mining and search engine discussions.

Build vocabulary

Since maths works best with numbers, let's first of all build a list of all the words (in their basic form) that occur anywhere in the text, and give each one of those words an ID (say, the position of its first occurrence in the work):


In [129]:
# We can use a library function for this
from sklearn.feature_extraction.text import CountVectorizer

# Since the library function can do all of the above (splitting, tokenising, lemmatising),
# and since it is providing hooks for us to feed our own tokenising, lemmatising and stopwords
# resources or functions to it,
# we use it and work on our rather raw "corpus" variable from way above again.

# So first we build a tokenising and lemmatising function to work as an input filter
# to the CountVectorizer function
def ourLemmatiser(str_input):
    wordforms = re.split('\W+', str_input)
    return [lemma[wordform].lower().strip() if wordform in lemma else wordform.lower().strip() for wordform in wordforms ]

# Then we initialize the CountVectorizer function to use our stopwords and lemmatising fct.
count_vectorizer = CountVectorizer(tokenizer=ourLemmatiser, stop_words=stopwords)

# Finally, we feed our corpus to the function, building a new "vocab" object
vocab = count_vectorizer.fit_transform(corpus)

# Print some results
print(str(len(count_vectorizer.get_feature_names())) + ' distinct words in the corpus:')
print(count_vectorizer.get_feature_names()[0:100])


1309 distinct words in the corpus:
['1542', '1549', '1555', '1563', '1568', '1581', '1591', '1595', '1601', '1604', '1609', '1610', '1617', '1620', '1639', '342', '43', '550', 'abb', 'aboleo', 'absolvo', 'absque', 'abutor', 'accipio', 'acost', 'acosta', 'acquisitio', 'act', 'act_non_detur', 'actio', 'actum', 'ad_leg', 'addico', 'addo', 'adduco', 'adelante', 'administratio', 'admodum', 'adscriptitiis', 'adversus', 'adverto', 'aedifico', 'aequum', 'aestimo', 'aetas', 'afflictus', 'affligo', 'africae', 'africanos', 'ager', 'agero', 'agia', 'ago', 'agric', 'alban', 'albano', 'alea', 'alguno', 'alibi', 'alieno', 'aliquot', 'alium', 'alius', 'alivio', 'allec', 'allego', 'aloe', 'alphan', 'alt', 'alter', 'ambigo', 'andr', 'angel', 'ann', 'anno', 'annua', 'annus', 'ant', 'antiqua', 'antiquus', 'aperio', 'apostolicus', 'appareo', 'appell', 'appello', 'applico', 'aqaeductu', 'aquí', 'aranjuecii', 'arceo', 'archid', 'ardens', 'arequipensi', 'argentina', 'argumentum', 'art', 'ascribo', 'assero', 'asservio', 'assigno']

You can see how our corpus of four thousand "tokens" actually contains only one and a half thousand different words (plus stopwords, but these are at maximum 384). And, in contrast to simpler numbers that have been filtered out by our stopwords filter, I have left years like "1610" in place.

Calculate Terms' Text Frequencies (TF)

However, our "vocab" object contains more than just all the unique words in our corpus. Let's get some information about it:


In [130]:
vocab


Out[130]:
<20x1309 sparse matrix of type '<class 'numpy.int64'>'
	with 2019 stored elements in Compressed Sparse Row format>

It is actually a table with 20 rows (the number of our segments) and 1.672 columns (the number of unique words in the corpus). So what we do have is a table where for each segment the amount of occurrences of every "possible" (in the sense of used somewhere in the corpus) word is listed.

("Sparse" means that the majority of fields is zero. And 2.142 fields are populated, which is more than the number of unique words in the corpus (1.672, see above) - that's obviously because some words occur in multiple segments = rows. Not much of a surprise, actually.)

Here is the whole table:


In [131]:
pd.DataFrame(vocab.toarray(), columns=count_vectorizer.get_feature_names())


Out[131]:
1542 1549 1555 1563 1568 1581 1591 1595 1601 1604 ... voluntad voluntarius vos vot votum vulgatus vulgo words1 zalsius zassi
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 1 1 1 1 0 0 0 0 0 1 ... 1 0 0 0 0 0 0 0 0 0
5 0 0 0 0 1 1 1 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 1 0 ... 1 1 0 0 0 0 1 0 0 0
7 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 1 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 1
10 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
15 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
16 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 0 0 0 0 0 0
17 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
18 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
19 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0

20 rows × 1309 columns

Each row of this table is a kind of fingerprint of a segment: We don't know the order of words in the segment - for us, it is just a "bag of words" -, but we know which words occur in the segment and how often they do. But as of now, it is a rather bad fingerprint, because how significant a certain number of occurences of a word in a segment is depends on the actual length of the segment. Ignorant as we are (per assumption) of the role and meaning of those words, still, if a word occurs twice in a short paragraph, that should prima facie count as more characteristic of the paragraph than if it occurs twice in a multi-volume work.

Normalise TF

We can reflect this if we divide the number of occurrences of a word by the number of tokens in the segment. Obviously the number will then be quite small - but what counts is the relations between the cells and we can account for scaling and normalizing later...

We're almost there and we are switching from the CountVectorizer function to another one, that does the division just mentioned and will do more later on...


In [132]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the library's function
tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords, use_idf=False, tokenizer=ourLemmatiser, norm='l1')

# Finally, we feed our corpus to the function to build a new "tf_matrix" object
tf_matrix = tfidf_vectorizer.fit_transform(corpus)

# Print some results
pd.DataFrame(tf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names())


Out[132]:
1542 1549 1555 1563 1568 1581 1591 1595 1601 1604 ... voluntad voluntarius vos vot votum vulgatus vulgo words1 zalsius zassi
0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
4 0.003717 0.003717 0.003717 0.003717 0.000000 0.000000 0.000000 0.000000 0.000000 0.003717 ... 0.003717 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
5 0.000000 0.000000 0.000000 0.000000 0.005291 0.005291 0.005291 0.005291 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
6 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.006849 0.000000 ... 0.006849 0.006849 0.000000 0.000000 0.000000 0.000000 0.006849 0.000000 0.000000 0.000000
7 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.005051 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.005051 0.000000 0.000000 0.000000 0.000000 0.000000
8 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
9 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.013889 0.013889
10 0.000000 0.000000 0.000000 0.000000 0.005051 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
11 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
12 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
13 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
14 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.007353 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
15 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
16 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.007042 0.007042 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
17 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
18 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.013889 0.000000 0.000000 0.000000 0.000000
19 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.012195 0.000000 0.000000

20 rows × 1309 columns

Now we have seen above that "indis" is occurring in all of the segments, because, as the title indicates, the whole work is about issues related to the Indies and to indigenous people. When we want to characterize a segment by referring to some of its words, is there a way to weigh down words like "indis" a little bit? Not filter them out completely, as we do with stopwords, but give them just a little less weight than words not appearing all over the place? Yes there is...

Inverse Document Frequencies (IDF) and TF-IDF

There is a measure called "text frequency / (inverse) document frequency" that combines a local measure (how frequently a word appears in a segment, in comparison to the other words appearing in the same segment, viz. the table above), with a global measure (how frequently the word appears throughout the whole corpus). Roughly speaking, we have to add to the table above a new, global, element: the number of documents the term appears in divided through the number of all documents in the corpus - or, rather, the other way round (that's why it is the "inverse" document frequency): the number of documents in the corpus divided by the number of documents the current term occurs in. (As with our local measure above, there is also some normalization, i.e. compensation for different lengths of documents and attenuation of high values, going on by using a logarithm on the quotient.)

When you multiply the term frequency (from above) with this inverse document frequeny, you have a formula which "rewards" frequent occurrences in one segment and rare occurrences over the whole corpus. (For more of the mathematical background, see this tutorial.)

Again, we do not have to implement all the counting, division and logarithm ourselves but can rely on SciKit-learn's TfidfVectorizer function to generate a matrix of our corpus in just a few lines of code:


In [133]:
# Initialize the library's function
tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords, use_idf=True, tokenizer=ourLemmatiser, norm='l2')

# Finally, we feed our corpus to the function to build a new "tfidf_matrix" object
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# Print some results
tfidf_matrix_frame = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names())

tfidf_matrix_frame


Out[133]:
1542 1549 1555 1563 1568 1581 1591 1595 1601 1604 ... voluntad voluntarius vos vot votum vulgatus vulgo words1 zalsius zassi
0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
4 0.062335 0.062335 0.062335 0.062335 0.000000 0.000000 0.000000 0.000000 0.000000 0.062335 ... 0.054794 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
5 0.000000 0.000000 0.000000 0.000000 0.064922 0.073857 0.073857 0.073857 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
6 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.064248 0.000000 ... 0.071201 0.081001 0.000000 0.000000 0.000000 0.000000 0.081001 0.000000 0.000000 0.000000
7 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.056385 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.071088 0.000000 0.000000 0.000000 0.000000 0.000000
8 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
9 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.124432 0.124432
10 0.000000 0.000000 0.000000 0.000000 0.064789 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
11 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
12 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
13 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
14 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.072300 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
15 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
16 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.088002 0.088002 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
17 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
18 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.134299 0.000000 0.000000 0.000000 0.000000
19 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.117808 0.000000 0.000000

20 rows × 1309 columns

Now let's print a more qualified "top 10" words for each segment:


In [134]:
# convert your matrix to an array to loop over it
mx_array = tfidf_matrix.toarray()

# get your feature names
fn = tfidf_vectorizer.get_feature_names()

pos = 0
for l in mx_array:
    print(' ')
    print(' Most significant words segment ' + str(pos) + ':')
    print(pd.DataFrame.rename(pd.DataFrame.from_dict([(fn[x], l[x]) for x in (l*-1).argsort()][:20]), columns={0:'lemma',1:'tf/idf value'}))
    pos += 1


 
 Most significant words segment 0:
          lemma  tf/idf value
0      damnatio      0.331894
1      iniustus      0.331894
2         sisto      0.291740
3        usurpo      0.291740
4         origo      0.263250
5         prior      0.263250
6         caput      0.241152
7       commune      0.241152
8         color      0.241152
9        indios      0.239917
10    servitium      0.239917
11     libertas      0.223096
12        liber      0.207830
13      persona      0.194606
14   personalis      0.163069
15       tribuo      0.154452
16       perú_p      0.000000
17         peto      0.000000
18         1542      0.000000
19  perversitas      0.000000
 
 Most significant words segment 1:
        lemma  tf/idf value
0         ius      0.340458
1       liber      0.264301
2       prior      0.200868
3      repeto      0.148404
4       bonus      0.133912
5         iur      0.122671
6      indios      0.122043
7       debeo      0.105720
8      refero      0.098994
9     elegans      0.084415
10      recto      0.084415
11     qualis      0.084415
12   quaestio      0.084415
13    disputo      0.084415
14  distribuo      0.084415
15  diuturnus      0.084415
16    prorsus      0.084415
17       dono      0.084415
18   earumque      0.084415
19     procur      0.084415
 
 Most significant words segment 2:
           lemma  tf/idf value
0       utilitas      0.205950
1          soleo      0.152248
2       subdolus      0.129827
3         deludo      0.129827
4         excolo      0.129827
5       aedifico      0.129827
6          aetas      0.129827
7       percipio      0.129827
8          sexus      0.129827
9         defero      0.129827
10          ager      0.129827
11        fodina      0.129827
12         pasco      0.129827
13      operaque      0.129827
14       onustus      0.129827
15     intelligo      0.129827
16  contemplatio      0.129827
17    contractus      0.129827
18    contrarium      0.129827
19       semotus      0.129827
 
 Most significant words segment 3:
           lemma  tf/idf value
0          pendo      0.237419
1        praesto      0.188315
2         plenus      0.188315
3          iubeo      0.172507
4         certus      0.172507
5         indios      0.171623
6         tribuo      0.165730
7          lapis      0.118709
8         quando      0.118709
9        hispani      0.118709
10       appareo      0.118709
11       protego      0.118709
12     commendam      0.118709
13          vexo      0.118709
14  praetermitto      0.118709
15    famulitium      0.118709
16         durus      0.118709
17       instruo      0.118709
18      detectio      0.118709
19    quantumvis      0.118709
 
 Most significant words segment 4:
         lemma  tf/idf value
0   personalis      0.183763
1         anno      0.181170
2        eodem      0.164382
3         semi      0.146202
4           fr      0.124671
5        decad      0.124671
6     provisio      0.124671
7        están      0.124671
8    tractatus      0.124671
9     permitto      0.124671
10      memoro      0.124671
11       dadas      0.124671
12      indios      0.112652
13   servitium      0.112652
14        queo      0.110080
15        voco      0.109588
16        alea      0.104566
17      regnum      0.098886
18     expedio      0.098886
19        ioan      0.090585
 
 Most significant words segment 5:
             lemma  tf/idf value
0           regnum      0.292909
1             anno      0.214657
2             alea      0.154867
3           prorex      0.147715
4         postmodo      0.147715
5         quitensi      0.147715
6         peruanum      0.147715
7            nosco      0.129843
8         servicio      0.117163
9       personalis      0.108865
10          certus      0.107328
11         accipio      0.107328
12             dom      0.107328
13       servitium      0.080084
14       monzonium      0.073857
15         remaneo      0.073857
16            está      0.073857
17           modus      0.073857
18           mitto      0.073857
19  repartimientos      0.073857
 
 Most significant words segment 6:
         lemma  tf/idf value
0         semi      0.284969
1         haya      0.284804
2          por      0.176564
3        casso      0.162002
4        pario      0.162002
5         volo      0.142402
6         paro      0.142402
7    servicios      0.128496
8     servicio      0.128496
9     personal      0.128496
10      indios      0.117107
11      tribuo      0.113085
12         así      0.081001
13  tolerandus      0.081001
14    reparten      0.081001
15    asservio      0.081001
16     tierras      0.081001
17     novembr      0.081001
18     remedio      0.081001
19      assero      0.081001
 
 Most significant words segment 7:
           lemma  tf/idf value
0         magnus      0.213264
1          cuius      0.187462
2       schedula      0.166730
3           curo      0.154956
4   montisclario      0.142176
5        superus      0.142176
6      marchioni      0.142176
7      exsecutio      0.142176
8   excellentiss      0.142176
9         tracto      0.124975
10         mereo      0.124975
11      princeps      0.112770
12        omnino      0.112770
13          anno      0.103304
14           dom      0.103304
15          maga      0.103304
16        indios      0.102775
17          alea      0.089436
18         annus      0.089030
19         video      0.083365
 
 Most significant words segment 8:
          lemma  tf/idf value
0       liberta      0.339386
1        centum      0.254540
2         titio      0.223744
3           hom      0.223744
4          inst      0.201895
5       de_stat      0.169693
6        de_iur      0.169693
7        person      0.169693
8         liber      0.159391
9         allec      0.134596
10     servitus      0.134596
11         maga      0.123298
12        onero      0.123298
13         queo      0.112375
14          res      0.106261
15    existimen      0.084847
16          rer      0.084847
17       corras      0.084847
18    quis_fuer      0.084847
19  et_demonstr      0.084847
 
 Most significant words segment 9:
             lemma  tf/idf value
0          de_oper      0.373295
1         libertus      0.248863
2               gl      0.218755
3          praesto      0.197392
4              ius      0.167284
5            opera      0.129351
6     debueruntque      0.124432
7         de_usufr      0.124432
8      de_acquisit      0.124432
9         contendo      0.124432
10            serv      0.124432
11  si_ususfructus      0.124432
12          ad_leg      0.124432
13         vexatio      0.124432
14          singul      0.124432
15      supersedeo      0.124432
16          servus      0.124432
17       deminutio      0.124432
18           zassi      0.124432
19         propter      0.124432
 
 Most significant words segment 10:
       lemma  tf/idf value
0       auth      0.194368
1     de_off      0.147413
2      recop      0.147413
3   illicito      0.147413
4       sine      0.129578
5      princ      0.129578
6      pract      0.129578
7      casus      0.129578
8    violens      0.129578
9       onus      0.129578
10     novum      0.129578
11       tot      0.116925
12    impono      0.116925
13       ibi      0.116925
14      ioan      0.107109
15     omnis      0.099090
16     subdo      0.099090
17      verb      0.099090
18    noster      0.092309
19    multus      0.092309
 
 Most significant words segment 11:
         lemma  tf/idf value
0           gl      0.227964
1        opera      0.134797
2        deleg      0.129670
3         veto      0.129670
4         eccl      0.129670
5         iuro      0.129670
6        navis      0.129670
7          for      0.129670
8   nuevamente      0.129670
9        decim      0.129670
10        prax      0.129670
11      et_hon      0.129670
12    sequitut      0.129670
13      honoro      0.129670
14       tusch      0.129670
15        aloe      0.129670
16         tut      0.129670
17    et_curat      0.129670
18      platea      0.129670
19      multum      0.129670
 
 Most significant words segment 12:
            lemma  tf/idf value
0            marc      0.159257
1       et_censit      0.159257
2           angel      0.159257
3        rosental      0.159257
4   adscriptitiis      0.159257
5        terminus      0.159257
6         termino      0.159257
7           propr      0.159257
8         de_feud      0.159257
9           husan      0.159257
10            obs      0.159257
11     de_hominib      0.159257
12       de_offic      0.159257
13      uvaremund      0.159257
14       conservo      0.159257
15         menoch      0.159257
16            opt      0.159257
17      pristinus      0.159257
18          agric      0.159257
19        verosim      0.159257
 
 Most significant words segment 13:
            lemma  tf/idf value
0             ult      0.213108
1           trado      0.180602
2            alea      0.169012
3          refero      0.157539
4         salicet      0.134338
5          floreo      0.134338
6         subiici      0.134338
7     nova_vectig      0.134338
8          utique      0.134338
9            card      0.134338
10         eisque      0.134338
11          praet      0.134338
12  si_publicanus      0.134338
13         schard      0.134338
14         aviles      0.134338
15          vinea      0.134338
16         albano      0.134338
17       non_poss      0.134338
18       salarium      0.134338
19     restitutio      0.134338
 
 Most significant words segment 14:
           lemma  tf/idf value
0           dico      0.331153
1            met      0.182305
2          annus      0.171237
3            iud      0.160249
4   praescriptio      0.160249
5           loca      0.160249
6          facio      0.150731
7           fero      0.144600
8            res      0.114158
9          video      0.106894
10         soleo      0.106894
11     servitium      0.098837
12         fraus      0.091152
13       defendo      0.091152
14         sumta      0.091152
15       citatus      0.091152
16        mentio      0.091152
17     plerumque      0.091152
18        iurium      0.091152
19     de_probat      0.091152
 
 Most significant words segment 15:
            lemma  tf/idf value
0           fides      0.386697
1       possessor      0.329940
2             lex      0.257948
3            mala      0.219960
4           dubio      0.219960
5           bonum      0.174467
6          contra      0.147855
7          iudico      0.109980
8   manumisiionib      0.109980
9         affligo      0.109980
10          alban      0.109980
11         alieno      0.109980
12           uxor      0.109980
13      interdico      0.109980
14    inter_pares      0.109980
15       usucapio      0.109980
16         usucap      0.109980
17         ambigo      0.109980
18         mercor      0.109980
19         de_reg      0.109980
 
 Most significant words segment 16:
         lemma  tf/idf value
0         mali      0.176003
1    emendatio      0.176003
2         deus      0.176003
3      corrigo      0.176003
4        actio      0.154709
5    violentia      0.154709
6        bonus      0.139601
7         ioan      0.127882
8   iniustitia      0.088002
9       expens      0.088002
10       timor      0.088002
11      testor      0.088002
12      postul      0.088002
13   indebitus      0.088002
14   indagatio      0.088002
15        cado      0.088002
16  innocentio      0.088002
17       audio      0.088002
18     exerceo      0.088002
19     exemplo      0.088002
 
 Most significant words segment 17:
         lemma  tf/idf value
0         dies      0.221947
1       genero      0.221947
2        solvo      0.183461
3         dico      0.183461
4       tribuo      0.176254
5        opera      0.131239
6   practicari      0.126248
7      matienz      0.126248
8        veneo      0.126248
9      aestimo      0.126248
10    cognitus      0.126248
11      opinio      0.126248
12         rus      0.126248
13        regn      0.126248
14      postea      0.126248
15     inficio      0.126248
16        hora      0.126248
17  reiicienda      0.126248
18     congero      0.126248
19      perú_p      0.126248
 
 Most significant words segment 18:
           lemma  tf/idf value
0        de_pact      0.236102
1         multus      0.168195
2         indios      0.145622
3      conveneri      0.134299
4         alphan      0.134299
5         turpis      0.134299
6          valde      0.134299
7       superbia      0.134299
8   imbecillitas      0.134299
9      republica      0.134299
10         cuiac      0.134299
11          dego      0.134299
12       reporto      0.134299
13       demoveo      0.134299
14         quasi      0.134299
15         semel      0.134299
16      delictum      0.134299
17       collect      0.134299
18    granatensi      0.134299
19         dotal      0.134299
 
 Most significant words segment 19:
           lemma  tf/idf value
0    perversitas      0.235615
1         facile      0.235615
2         servio      0.207109
3        quamvis      0.207109
4          sibus      0.129872
5          sudor      0.117808
6        itemque      0.117808
7        perquam      0.117808
8     summarium3      0.117808
9          desum      0.117808
10        aequum      0.117808
11     constituo      0.117808
12       detraho      0.117808
13          fors      0.117808
14      perficio      0.117808
15  inaequabilis      0.117808
16      rebusque      0.117808
17      converto      0.117808
18      vicesque      0.117808
19       chapter      0.117808
You can see that, in the fourth segment, pensum and tributum have moved up while indis has fallen from the first to the third place. But in other segments you can also see that abbreviations like "fol", "gl" or "hom" still are a major nuisance, and so are spanish passages. It would surely help to improve our stopwords and lemma lists.
Of course, having more text would also help: The *idf* can kick in only when there are many documents... Also, you could play around with the segmentation. Make fewer but bigger segments or smaller ones...
And you can notice that in many segments, the lemmata at around rank 5 have the exact same value. Most certainly that's because they only occur one single time in the segment. (That those values differ from segment to segment has to do with the relation of the segment to the corpus as a whole.) And when four our fourteen of those words occur only once anyway, we should really not think that there is a meaningful sorting order between them (or that there is a good reason the 8th one is in the top ten list and the thirteenth one isn't). But in those areas where there _is_ variation in the tf/idf values, that is indeed telling.

Due to the way that they have been encoded in our sample texts, we can also observe some references to other literature by the underscore (e.g. "de_oper", "de_iur", "et_cur" etc.), which makes you wonder if it would be worthwile marking all the references in some way so that we could either concentrate on them or filter them out altogether. But other than that, it's in fact almost meaningful. Apart from making such lists, what can we do with this?

Vector Space Model of the text

First, let us recapitulate in more general terms what we have done so far, since a good part of it is extensible and applicable to many other methods: We have used a representation of each "document" (in our case, all those "documents" have been segments of one and the same text) as a series of values that indicated the document's relevance in particular "dimensions".

For example, the various values in the "alea dimension" indicate how characteristic this word, "alea", is for the present document. (By hypothesis, this also works the other way round, as an indication of which documents are the most relevant ones in matters of "alea". In fact, this is how search engines work.)

Many words did not occur at all in most of the documents and the series of values (matrix rows) contained many zeroes. Other words were stopwords which we would not want to affect our documents' scores - they did not yield a salient "dimension" and were dropped from the series of values (matrix columns). The values work independently and can be combined (when a document is relevant in one and in another dimension).

Each document is thus characterised by a so-called "vector" (a series of independent, combinable values) and is mapped in a "space" constituted by the dimensions of those vectors (matrix columns, series of values). In our case the dimensions have been derived from the corpus's vocabulary. Hence, the representation of all the documents is called their vector space model. You can really think of it as similar to a three-dimensional space: Document A goes quite some way in the x-direction, it goes not at all in the y-direction and it goes just a little bit in the z-direction. Document B goes quite some way, perhaps even further than A did, in both the y- and z- directions, but only a wee bit in the y-direction. Etc. etc. Only with many many more independent dimensions instead of just the three spatial dimensions we are used to.

The following sections are will discuss ways of manipulating the vector space -- using alternative or additional dimensions -- and also ways of leveraging the VSM representation of our text to make various analyses...

Another method to generate the dimensions: n-grams

Instead of relying on the (either lemmatized or un-lemmatized) vocabulary of words occurring in your documents, you could also use other methods to generate a vector for them. A very popular such method is called n-grams and shall be presented just shortly:

Imagine a moving window which captures, say, three words and slides over your text word by word. The first capture would get the first three words, the second one the words two to four, the third one the words three to five, and so on up to the last three words of your document. This procedure would generate all the "3-grams" contained in your text - not all the possible combinations of the words present in the vocabulary but just the triples that happen to occur in the text. The meaningfulness of this method depends to a certain extent on how strongly the respective language inflects its words and on how freely it orders its sentences' parts (a sociolect or literary genre might constrain or enhance the potential variance of the language). Less variance here means that the same ideas tend to be (!) presented in the same formulations more than in languages with more variance on this syntactic level. To a certain extent, you could play around with lemmatization and stopwords and with the size of your window. But in general, there are more 3-grams repeated in human language than one would expect. Even more so if we imagine our window encompassing only two words, resulting in 2-grams or, rather, bigrams.

As a quick example, let's list the top bi- or 3-grams of our text segments, together with the respective number of occurrences, and the 10 most frequent n-grams in the whole corpus:


In [135]:
ngram_size_high = 3
ngram_size_low  = 2
top_n = 5

# Initialize the TfidfVectorizer function from above
# (again using our lemmatising fct. but no stopwords this time)
vectorizer = CountVectorizer(ngram_range=(ngram_size_low, ngram_size_high), tokenizer=ourLemmatiser)
ngrams = vectorizer.fit_transform(corpus)

print('Most frequent 2-/3-grams')
print('========================')
print(' ')
ngrams_dict = []
df = []
df_2 = []
for i in range(0, len(corpus)):
    # (probably that's way too complicated here...)
    ngrams_dict.append(dict(zip(vectorizer.get_feature_names(), ngrams.toarray()[i])))
    df.append(pd.DataFrame.from_dict(ngrams_dict[i], orient='index').reset_index())
    df_2.append(df[i].rename(columns={'index':'n-gram',0:'count'}))
    print('Segment ' + str(i) + ':')
    if df_2[i]['count'].max() > 1:
        print(df_2[i].sort_values(by='count',axis=0,ascending=False)[:top_n])
        print(' ')
    else:
        print('  This segment has no bi- or 3-gram occurring more than just once.')
        print(' ')

ngrams_corpus = pd.DataFrame(ngrams.todense(), columns=vectorizer.get_feature_names())
ngrams_total = ngrams_corpus.cumsum()
print(' ')
print("The 10 most frequent n-grams in the whole corpus")
print("================================================")
ngrams_total.tail(1).T.rename(columns={19:'count'}).nlargest(10, 'count')


Most frequent 2-/3-grams
========================
 
Segment 0:
  This segment has no bi- or 3-gram occurring more than just once.
 
Segment 1:
      n-gram  count
1852    d de      3
3721  in lib      2
7053  sum et      2
4278     l 2      2
6157  qui in      2
 
Segment 2:
  This segment has no bi- or 3-gram occurring more than just once.
 
Segment 3:
                    n-gram  count
5754         praesto iubeo      2
845           ago eiusmodi      1
3653          in commendam      1
3654  in commendam accipio      1
846   ago eiusmodi hispani      1
 
Segment 4:
                        n-gram  count
6735      servitium personalis      4
2908                   et seqq      4
6660                   seqq et      3
3416  hic servitium personalis      3
3414             hic servitium      3
 
Segment 5:
            n-gram  count
2615       et alea      3
2747         et in      2
5065  nosco regnum      2
2313         dom d      2
7428   tribuo taxa      1
 
Segment 6:
                 n-gram  count
6044            que los      2
4586         los indios      2
2527             eo que      2
4556             lo que      2
6687  servicio personal      2
 
Segment 7:
                      n-gram  count
2747                   et in      3
1825       curo excellentiss      2
7760               video sum      2
4685  marchioni montisclario      2
3112          exsecutio curo      1
 
Segment 8:
            n-gram  count
4278           l 2      3
7317  titio centum      3
235            2 d      3
4267           l 1      2
1759      cum alea      2
 
Segment 9:
         n-gram  count
3709       in l      3
3711     in l 2      2
4427      lib 1      2
4278        l 2      2
1876  d de_oper      2
 
Segment 10:
          n-gram  count
7           1 et      3
1287        c ne      3
4297  l illicito      2
446         4 ne      2
2959  et violens      2
 
Segment 11:
     n-gram  count
3709   in l      3
0       1 c      2
4267    l 1      2
5293    p 5      2
4268  l 1 c      2
 
Segment 12:
  This segment has no bi- or 3-gram occurring more than just once.
 
Segment 13:
                 n-gram  count
4335              l ult      2
2615            et alea      2
6366     refero salicet      1
7624  utique vel rapiña      1
1852               d de      1
 
Segment 14:
           n-gram  count
1852         d de      2
5027  nolo possum      2
4870         ne 2      2
446          4 ne      2
4314         l si      2
 
Segment 15:
           n-gram  count
1852         d de      3
7476      ubi lex      2
6067  queo contra      2
1211  bonum fides      2
4646   mala fides      2
 
Segment 16:
           n-gram  count
2735       et hic      2
4443        lib 2      2
4851         ne 1      2
3637         in c      2
1670  contra deus      1
 
Segment 17:
            n-gram  count
7405     tribuo in      2
3790        in qui      2
0              1 c      1
1650     consto ut      1
4477  lib conficio      1
 
Segment 18:
                n-gram  count
1879         d de_pact      2
2839       et paciscor      1
3536  ille et paciscor      1
7379           trado a      1
6615       semel sibus      1
 
Segment 19:
  This segment has no bi- or 3-gram occurring more than just once.
 
 
The 10 most frequent n-grams in the whole corpus
================================================
Out[135]:
count
in l 14
et in 13
lib 2 12
d de 10
servitium personalis 10
l 1 9
et alea 8
et seqq 8
l 2 7
in qui 6

Extending the dimensions

Of course, there is no reason why the dimensions should be restricted to or identical with the vocabulary (or the occurring n-grams, for that matter). In fact, in the examples above, we have dropped some of the words already by using our list of stopwords. **We could also add other dimensions that are of interest for our current research question. We could add a dimension for the year in which the texts have been written, for their citing a certain author, or merely for their position in the encompassing work...**

Since in our examples, the position is represented in the "row number" and counting citations of a particular author require some more normalisations (e.g. with the lemmatisation dictionary above), let's add a dimension for the length of the respective segment (in characters) and another one for the number of occurrences of "_" (in our sample transcriptions, this character had been used to mark citations, although admittedly not all of them), just so you get the idea:


In [136]:
print("Original matrix of tf/idf values (rightmost columns):")
tfidf_matrix_frame.iloc[ :, -5:]


Original matrix of tf/idf values (rightmost columns):
Out[136]:
vulgatus vulgo words1 zalsius zassi
0 0.000000 0.000000 0.000000 0.000000 0.000000
1 0.000000 0.000000 0.000000 0.000000 0.000000
2 0.000000 0.000000 0.000000 0.000000 0.000000
3 0.000000 0.000000 0.000000 0.000000 0.000000
4 0.000000 0.000000 0.000000 0.000000 0.000000
5 0.000000 0.000000 0.000000 0.000000 0.000000
6 0.000000 0.081001 0.000000 0.000000 0.000000
7 0.000000 0.000000 0.000000 0.000000 0.000000
8 0.000000 0.000000 0.000000 0.000000 0.000000
9 0.000000 0.000000 0.000000 0.124432 0.124432
10 0.000000 0.000000 0.000000 0.000000 0.000000
11 0.000000 0.000000 0.000000 0.000000 0.000000
12 0.000000 0.000000 0.000000 0.000000 0.000000
13 0.000000 0.000000 0.000000 0.000000 0.000000
14 0.000000 0.000000 0.000000 0.000000 0.000000
15 0.000000 0.000000 0.000000 0.000000 0.000000
16 0.000000 0.000000 0.000000 0.000000 0.000000
17 0.000000 0.000000 0.000000 0.000000 0.000000
18 0.134299 0.000000 0.000000 0.000000 0.000000
19 0.000000 0.000000 0.117808 0.000000 0.000000

In [137]:
length = []
for i in range(0, len(corpus)):
    length.append(len(tokenised[i]))

citnum = []
for i in range(0, len(corpus)):
    citnum.append(corpus[i].count('_'))

print("New matrix extended with segment length and number of occurrences of '_':")
new_matrix = tfidf_matrix_frame.assign(seg_length = length).assign(cit_count = citnum)
new_matrix.iloc[ :, -6:]


New matrix extended with segment length and number of occurrences of '_':
Out[137]:
vulgo words1 zalsius zassi seg_length cit_count
0 0.000000 0.000000 0.000000 0.000000 34 0
1 0.000000 0.000000 0.000000 0.000000 268 0
2 0.000000 0.000000 0.000000 0.000000 119 0
3 0.000000 0.000000 0.000000 0.000000 140 0
4 0.000000 0.000000 0.000000 0.000000 510 0
5 0.000000 0.000000 0.000000 0.000000 313 0
6 0.081001 0.000000 0.000000 0.000000 246 0
7 0.000000 0.000000 0.000000 0.000000 304 0
8 0.000000 0.000000 0.000000 0.000000 254 15
9 0.000000 0.000000 0.124432 0.124432 144 9
10 0.000000 0.000000 0.000000 0.000000 396 7
11 0.000000 0.000000 0.000000 0.000000 138 7
12 0.000000 0.000000 0.000000 0.000000 122 4
13 0.000000 0.000000 0.000000 0.000000 113 3
14 0.000000 0.000000 0.000000 0.000000 294 8
15 0.000000 0.000000 0.000000 0.000000 139 7
16 0.000000 0.000000 0.000000 0.000000 253 0
17 0.000000 0.000000 0.000000 0.000000 144 4
18 0.000000 0.000000 0.000000 0.000000 127 3
19 0.000000 0.117808 0.000000 0.000000 150 0

You may notice that the segment with most occurrences of "_" (taken with a grain of salt, that's likely the segment with most citations), is not a particularly long one. If we had systematic markup of citations or author names in our transcription, we could be more certain or add even more columns/"dimensions" to our table.

If you bear with me for a final example, here is adding the labels that you could see in our initial one "big source file":


In [138]:
# input should still have a handle on our source file.

label = []
# Now, for every line, revisit the special string and extract just the lines marked by it
for line in input:
    if line[0:3] == '€€€':
        label.append(line[6:].strip())

# How many segments/files do we then have?
print(str(len(label)) + ' labels read.')
print("New matrix extended with segment length, number of occurrences of '_' and label:")
yet_another_matrix = new_matrix.assign(seg_length = length).assign(label = label)
yet_another_matrix.iloc[ :, -6:]


20 labels read.
New matrix extended with segment length, number of occurrences of '_' and label:
Out[138]:
words1 zalsius zassi seg_length cit_count label
0 0.000000 0.000000 0.000000 34 0 [Book title]
1 0.000000 0.000000 0.000000 268 0 [Indians are free]
2 0.000000 0.000000 0.000000 119 0 [Definition of «servicios personales»]
3 0.000000 0.000000 0.000000 140 0 [rights & duties of both Indians & Spaniards]
4 0.000000 0.000000 0.000000 510 0 [literature & royal decrees against the servic...
5 0.000000 0.000000 0.000000 313 0 [Practical case from Peru – Viceroy Toledo in ...
6 0.000000 0.000000 0.000000 246 0 [Decree of the servicio personal, 1601]
7 0.000000 0.000000 0.000000 304 0 [implementation of the decree of the servicio ...
8 0.000000 0.000000 0.000000 254 15 [definition of freedom ex Aristotle]
9 0.000000 0.124432 0.124432 144 9 [encomiendas in theory]
10 0.000000 0.000000 0.000000 396 7 [Spaniards’ illegal imposition of labour over ...
11 0.000000 0.000000 0.000000 138 7 [on vassals]
12 0.000000 0.000000 0.000000 122 4 [on colons & adscripticios]
13 0.000000 0.000000 0.000000 113 3 [on both – vassals & colons]
14 0.000000 0.000000 0.000000 294 8 [on time passing & habits]
15 0.000000 0.000000 0.000000 139 7 [on usucapione, which does not apply in the ca...
16 0.000000 0.000000 0.000000 253 0 [habits established over time are not excuse f...
17 0.000000 0.000000 0.000000 144 4 [Practical case from Peru – with literature]
18 0.000000 0.000000 0.000000 127 3 [work extorted in axchange for tribute]
19 0.117808 0.000000 0.000000 150 0 [corruption of the practice – turned into habi...

Word Clouds

We can use a library that takes word frequencies like above, calculates corresponding relative sizes of words and creates nice wordcloud images for our sections (again, taking the fourth segment as an example) like this:


In [139]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# We make tuples of (lemma, tf/idf score) for one of our segments
# But we have to convert our tf/idf weights to pseudo-frequencies (i.e. integer numbers)
frq = [ int(round(x * 100000, 0)) for x in mx_array[3]]
freq = dict(zip(fn, frq))

wc = WordCloud(background_color=None, mode="RGBA", max_font_size=40, relative_scaling=1).fit_words(freq)

# Now show/plot the wordcloud
plt.figure()
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()


In order to have a nicer overview over the many segments than is possible in this notebook, let's create a new html file listing some of the characteristics that we have found so far...


In [140]:
outputDir = "Solorzano"
htmlfile = open(outputDir + '/Overview.html', encoding='utf-8', mode='w')

# Write the html header and the opening of a layout table
htmlfile.write("""<!DOCTYPE html>
<html>
    <head>
        <title>Section Characteristics</title>
        <meta charset="utf-8"/>
    </head>
    <body>
        <table>
""")

a = [[]]
a.clear()
dicts = []
w = []

# For each segment, create a wordcloud and write it along with label and
# other information into a new row of the html table
for i in range(0, len(mx_array)):
    # this is like above in the single-segment example...
    a.append([ int(round(x * 100000, 0)) for x in mx_array[i]])
    dicts.append(dict(zip(fn, a[i])))
    w.append(WordCloud(background_color=None, mode="RGBA", \
                       max_font_size=40, min_font_size=10, \
                       max_words=60, relative_scaling=0.8).fit_words(dicts[i]))
    # We write the wordcloud image to a file
    w[i].to_file(outputDir + '/wc_' + str(i) + '.png')
    # Finally we write the column row
    htmlfile.write("""
            <tr>
                <td>
                    <head>Section {a}: <b>{b}</b></head><br/>
                    <img src="./wc_{a}.png"/><br/>
                    <small><i>length: {c} words</i></small>
                </td>
            </tr>
            <tr><td>&nbsp;</td></tr>
""".format(a = str(i), b = label[i], c = len(tokenised[i])))

# And then we write the end of the html file.
htmlfile.write("""
        </table>
    </body>
</html>
""")
htmlfile.close()

This should have created a nice html file which we can open here.

Similarity

Also, once we have a representation of our text as a vector - which we can imagine as an arrow that goes a certain distance in one direction, another distance in another direction and so on - we can compare the different arrows. Do they go the same distance in a particular direction? And maybe almost the same in another direction? This would mean that one of the terms of our vocabulary has the same weight in both texts. Comparing the weight of our many, many dimensions, we can develop a measure for the similarity of the texts.

(Probably, similarity in words that are occurring all over the place in the corpus should not count so much, and in fact it is attenuated by our arrows being made up of tf/idf weights.)

Comparing arrows means calculating with angles and technically, what we are computing is the "cosine similarity" of texts. Again, there is a library ready for us to use (but you can find some documentation here, here and here.)


In [141]:
from sklearn.metrics.pairwise import cosine_similarity

similarities = pd.DataFrame(cosine_similarity(tfidf_matrix))
similarities[round(similarities, 0) == 1] = 0 # Suppress a document's similarity to itself
print("Pairwise similarities:")
print(similarities)


Pairwise similarities:
          0         1         2         3         4         5         6   \
0   0.000000  0.222897  0.120293  0.147745  0.130148  0.081761  0.094125   
1   0.222897  0.000000  0.092630  0.072586  0.082198  0.029001  0.030976   
2   0.120293  0.092630  0.000000  0.057610  0.120441  0.043699  0.041051   
3   0.147745  0.072586  0.057610  0.000000  0.131746  0.097417  0.056908   
4   0.130148  0.082198  0.120441  0.131746  0.000000  0.221282  0.181299   
5   0.081761  0.029001  0.043699  0.097417  0.221282  0.000000  0.141404   
6   0.094125  0.030976  0.041051  0.056908  0.181299  0.141404  0.000000   
7   0.035931  0.067855  0.035364  0.082660  0.132764  0.162764  0.081296   
8   0.100506  0.166803  0.060454  0.083347  0.078576  0.039125  0.044615   
9   0.044722  0.086246  0.019550  0.102677  0.076310  0.039573  0.020726   
10  0.025080  0.067164  0.076931  0.052992  0.110786  0.050311  0.022992   
11  0.000000  0.005779  0.000000  0.026106  0.041592  0.008422  0.010421   
12  0.027620  0.055453  0.014834  0.011949  0.044076  0.027170  0.005829   
13  0.011649  0.043319  0.030025  0.020825  0.056447  0.062899  0.000000   
14  0.054892  0.066205  0.060191  0.082729  0.118508  0.078591  0.037649   
15  0.044378  0.036802  0.039557  0.070395  0.056771  0.038931  0.012981   
16  0.021745  0.090299  0.014982  0.040984  0.067496  0.038400  0.019335   
17  0.070181  0.080908  0.079733  0.139602  0.119730  0.087217  0.079050   
18  0.068122  0.050428  0.062793  0.058399  0.112498  0.039119  0.033904   
19  0.046805  0.057342  0.030257  0.068873  0.073610  0.062976  0.054613   

          7         8         9         10        11        12        13  \
0   0.035931  0.100506  0.044722  0.025080  0.000000  0.027620  0.011649   
1   0.067855  0.166803  0.086246  0.067164  0.005779  0.055453  0.043319   
2   0.035364  0.060454  0.019550  0.076931  0.000000  0.014834  0.030025   
3   0.082660  0.083347  0.102677  0.052992  0.026106  0.011949  0.020825   
4   0.132764  0.078576  0.076310  0.110786  0.041592  0.044076  0.056447   
5   0.162764  0.039125  0.039573  0.050311  0.008422  0.027170  0.062899   
6   0.081296  0.044615  0.020726  0.022992  0.010421  0.005829  0.000000   
7   0.000000  0.057690  0.033266  0.056735  0.004864  0.028876  0.035758   
8   0.057690  0.000000  0.107561  0.028066  0.067699  0.048026  0.044510   
9   0.033266  0.107561  0.000000  0.062075  0.110202  0.021130  0.061828   
10  0.056735  0.028066  0.062075  0.000000  0.091940  0.122667  0.126188   
11  0.004864  0.067699  0.110202  0.091940  0.000000  0.045572  0.059874   
12  0.028876  0.048026  0.021130  0.122667  0.045572  0.000000  0.058163   
13  0.035758  0.044510  0.061828  0.126188  0.059874  0.058163  0.000000   
14  0.093891  0.099462  0.085722  0.100223  0.029431  0.057516  0.076954   
15  0.035825  0.062046  0.015942  0.065009  0.021502  0.036766  0.016406   
16  0.042488  0.056924  0.016298  0.084189  0.031190  0.044171  0.044116   
17  0.091554  0.057106  0.049359  0.086471  0.052649  0.042823  0.078300   
18  0.059060  0.023174  0.089074  0.078308  0.018603  0.049908  0.060606   
19  0.021254  0.054920  0.019087  0.058867  0.023415  0.033189  0.006876   

          14        15        16        17        18        19  
0   0.054892  0.044378  0.021745  0.070181  0.068122  0.046805  
1   0.066205  0.036802  0.090299  0.080908  0.050428  0.057342  
2   0.060191  0.039557  0.014982  0.079733  0.062793  0.030257  
3   0.082729  0.070395  0.040984  0.139602  0.058399  0.068873  
4   0.118508  0.056771  0.067496  0.119730  0.112498  0.073610  
5   0.078591  0.038931  0.038400  0.087217  0.039119  0.062976  
6   0.037649  0.012981  0.019335  0.079050  0.033904  0.054613  
7   0.093891  0.035825  0.042488  0.091554  0.059060  0.021254  
8   0.099462  0.062046  0.056924  0.057106  0.023174  0.054920  
9   0.085722  0.015942  0.016298  0.049359  0.089074  0.019087  
10  0.100223  0.065009  0.084189  0.086471  0.078308  0.058867  
11  0.029431  0.021502  0.031190  0.052649  0.018603  0.023415  
12  0.057516  0.036766  0.044171  0.042823  0.049908  0.033189  
13  0.076954  0.016406  0.044116  0.078300  0.060606  0.006876  
14  0.000000  0.080179  0.100321  0.120819  0.112492  0.040502  
15  0.080179  0.000000  0.052674  0.039606  0.034220  0.024331  
16  0.100321  0.052674  0.000000  0.067907  0.023408  0.058716  
17  0.120819  0.039606  0.067907  0.000000  0.071507  0.060739  
18  0.112492  0.034220  0.023408  0.071507  0.000000  0.047117  
19  0.040502  0.024331  0.058716  0.060739  0.047117  0.000000  

In [142]:
print("The two most similar segments in the corpus are")
print("segments", \
      similarities[similarities == similarities.values.max()].idxmax(axis=0).idxmax(axis=1), \
      "and", \
      similarities[similarities == similarities.values.max()].idxmax(axis=0)[ similarities[similarities == similarities.values.max()].idxmax(axis=0).idxmax(axis=1) ].astype(int), \
      ".")
print("They have a similarity score of")
print(similarities.values.max())


The two most similar segments in the corpus are
segments 0 and 1 .
They have a similarity score of
0.222896735543
Of course, in every set of documents, we will always find two that are similar in the sense of them being more similar to each other than to the other ones. Whether or not this actually *means* anything in terms of content is still up to scholarly interpretation. But at least it means that a scholar can look at the two documents and when she determines that they are not so similar after all, then perhaps there is something interesting to say about similar vocabulary used for different puproses. Or the other way round: When the scholar knows that two passages are similar, but they have a low "similarity score", shouldn't that say something about the texts's rhetorics?

Clustering

Clustering is a method to find ways of grouping data into subsets, so that these do have some cohesion. Sentences that are more similar to a particular "paradigm" sentence than to another one are grouped with the first one, others are grouped with their respective "paradigm" sentence. Of course, one of the challenges is finding sentences that work well as such paradigm sentences. So we have two (or even three) stages: Find paradigms, group data accordingly. (And learn how many groups there are.)

I hope to be able to add a discussion of this subject soon. For now, here are nice tutorials for the process:

  • Find good measure (word vectors, authorities cited, style, ...)
  • Find starting centroids
  • Find good K value
  • K-Means clustering

In [ ]:

Working with several languages

Let us prepare a second text, this time in Spanish, and see how they compare...


In [143]:
bigspanishfile = 'Solorzano/Sections_II.2_PI.txt'
spInput = open(bigspanishfile, encoding='utf-8').readlines()

spAt    = -1
spDest  = None

for line in spInput:
    if line[0:3] == '€€€':
        if spDest:
            spDest.close()
        spAt += 1
        spDest = open(outputBase + '.' + str(spAt) +
                    '.spanish.txt', encoding='utf-8', mode='w')
    else:
        spDest.write(line.strip())

spAt += 1
spDest.close()
print(str(spAt) + ' files written.')

spSuffix = '.spanish.txt'
spCorpus = []
for i in range(0, spAt):
    try:
        with open(path + '/' + filename + str(i) + spSuffix, encoding='utf-8') as f:
            spCorpus.append(f.read())
            f.close()
    except IOError as exc:
        if exc.errno != errno.EISDIR:  # Do not fail if a directory is found, just ignore it.
            raise                      # Propagate other kinds of IOError.

print(str(len(spCorpus)) + ' files read.')

# Labels
spLabel = []
i = 0
for spLine in spInput:
    if spLine[0:3] == '€€€':
        spLabel.append(spLine[6:].strip())
        i =+ 1
print(str(len(spLabel)) + ' labels found.')

# Tokens
spTokenised = []
for spSegment in spCorpus:
    spTokenised.append(list(filter(None, (spWord.lower()
                                        for spWord in re.split('\W+', spSegment)))))

# Lemmata
spLemma    = {}
spTempdict = []
spWordfile_path = 'Solorzano/wordforms-es.txt'
spWordfile = open(spWordfile_path, encoding='utf-8')

for spLine in spWordfile.readlines():
    spTempdict.append(tuple(spLine.split('>')))

spLemma = {k.strip(): v.strip() for k, v in spTempdict}
spWordfile.close
print(str(len(spLemma)) + ' spanish wordforms known to the system.')

# Stopwords
spStopwords_path = 'Solorzano/stopwords-es.txt'
spStopwords = open(spStopwords_path, encoding='utf-8').read().splitlines()
print(str(len(spStopwords)) + ' spanish stopwords known to the system.')

print(' ')
print('Significant words in the spanish text:')

# tokenising and lemmatising function
def spOurLemmatiser(str_input):
    spWordforms = re.split('\W+', str_input)
    return [spLemma[spWordform].lower() if spWordform in spLemma else spWordform.lower() for spWordform in spWordforms ]

spTfidf_vectorizer = TfidfVectorizer(stop_words=spStopwords, use_idf=True, tokenizer=spOurLemmatiser, norm='l2')
spTfidf_matrix = spTfidf_vectorizer.fit_transform(spCorpus)

spMx_array = spTfidf_matrix.toarray()
spFn = spTfidf_vectorizer.get_feature_names()

pos = 1
for l in spMx_array:
    print(' ')
    print(' Most significant words in the ' + str(pos) + '. segment:')
    print(pd.DataFrame.rename(pd.DataFrame.from_dict([(spFn[x], l[x]) for x in (l*-1).argsort()][:10]), columns={0:'lemma',1:'tf/idf value'}))
    pos += 1


18 files written.
18 files read.
18 labels found.
614725 spanish wordforms known to the system.
743 spanish stopwords known to the system.
 
Significant words in the spanish text:
 
 Most significant words in the 1. segment:
        lemma  tf/idf value
0    capitvlo      0.399528
1  totalmente      0.399528
2     español      0.349703
3        casa      0.286932
4    tributar      0.286932
5  particular      0.264527
6      llamar      0.264527
7        cosa      0.245585
8    prohibir      0.245585
9    personal      0.201756
 
 Most significant words in the 2. segment:
        lemma  tf/idf value
0       indio      0.176840
1    servicio      0.158496
2  famulicios      0.138930
3     público      0.138930
4  domesticos      0.138930
5       carga      0.138930
6    reservar      0.138930
7       color      0.138930
8    reperida      0.138930
9      cobrar      0.138930
 
 Most significant words in the 3. segment:
          lemma  tf/idf value
0        hombre      0.382174
1        forzar      0.334514
2    conpadecen      0.191087
3  emperadores3      0.191087
4   contradecir      0.191087
5   aristoteles      0.191087
6      facultad      0.191087
7   impedimetos      0.191087
8      ocuparse      0.191087
9        servil      0.191087
 
 Most significant words in the 4. segment:
          lemma  tf/idf value
0         veder      0.272392
1      conducir      0.272392
2      alquilar      0.272392
3    extimables      0.272392
4       prohibe      0.272392
5         llano      0.272392
6      precioso      0.272392
7  regularmente      0.272392
8        dignar      0.238423
9        forzar      0.238423
 
 Most significant words in the 5. segment:
         lemma  tf/idf value
0       tratar      0.230925
1       grande      0.230925
2        razón      0.230925
3    provincia      0.174680
4        pagar      0.162171
5     servicio      0.150490
6    debiessen      0.131913
7     discurso      0.131913
8   permitirse      0.131913
9  franciscano      0.131913
 
 Most significant words in the 6. segment:
         lemma  tf/idf value
0        indio      0.200085
1    provisión      0.188630
2         1549      0.188630
3  encomendado      0.188630
4        tasar      0.187338
5        pagar      0.173923
6         real      0.165106
7    audiencia      0.165106
8          año      0.162302
9     voluntad      0.148416
 
 Most significant words in the 7. segment:
        lemma  tf/idf value
0     proveer      0.282385
1      entrar      0.231697
2    servicio      0.184026
3    personal      0.162917
4    convenir      0.161309
5  hizieredes      0.161309
6        vaco      0.161309
7      holgar      0.161309
8      juzgar      0.161309
9       vacar      0.161309
 
 Most significant words in the 8. segment:
       lemma  tf/idf value
0       a el      0.235682
1  envejecer      0.218778
2     quitar      0.191495
3    proveer      0.191495
4  costumbre      0.191495
5  audiencia      0.191495
6     reinar      0.172137
7     virrey      0.157121
8      de+el      0.156100
9    referir      0.144853
 
 Most significant words in the 9. segment:
      lemma  tf/idf value
0     mesmo      0.248627
1  personal      0.212763
2    perder      0.210663
3   ordenar      0.210663
4     casar      0.184391
5     indio      0.178764
6  voluntad      0.165751
7  servicio      0.160220
8    querer      0.151293
9   tributo      0.151293
 
 Most significant words in the 10. segment:
           lemma  tf/idf value
0         cedula      0.281725
1           a el      0.203724
2     particular      0.187817
3     presidente      0.141834
4   encargandome      0.141834
5        omisión      0.141834
6          oidor      0.141834
7       aranjuez      0.141834
8  expressamente      0.141834
9           limo      0.141834
 
 Most significant words in the 11. segment:
        lemma  tf/idf value
0      gravar      0.296242
1       docto      0.169225
2     materia      0.169225
3   ajustarse      0.169225
4      formar      0.169225
5   diciembre      0.169225
6        1610      0.169225
7  reformasse      0.169225
8       junta      0.169225
9     haberse      0.169225
 
 Most significant words in the 12. segment:
             lemma  tf/idf value
0            señor      0.283368
1          término      0.283368
2         disponer      0.248030
3        silvestro      0.141684
4          ultimar      0.141684
5          vasallo      0.141684
6          abrazar      0.141684
7          navarro      0.141684
8         exacción      0.141684
9  quebrantamiento      0.141684
 
 Most significant words in the 13. segment:
           lemma  tf/idf value
0         colono      0.236015
1       celebrar      0.236015
2         hablar      0.236015
3      condición      0.236015
4  adscripticios      0.236015
5      propósito      0.236015
6  violentamente      0.236015
7         enseña      0.236015
8       antiguar      0.236015
9        volumen      0.236015
 
 Most significant words in the 14. segment:
       lemma  tf/idf value
0      manar      0.250435
1   defender      0.250435
2  excluirse      0.250435
3     efecto      0.250435
4    recibir      0.250435
5  continuar      0.250435
6   posesión      0.250435
7   justicia      0.250435
8    ciencia      0.250435
9     fraude      0.250435
 
 Most significant words in the 15. segment:
          lemma  tf/idf value
0  prescripción      0.428202
1        seguir      0.214101
2         citar      0.214101
3         lucas      0.214101
4            fe      0.214101
5      alegarse      0.214101
6         valer      0.214101
7         anuas      0.214101
8       constar      0.214101
9       poderse      0.214101
 
 Most significant words in the 16. segment:
        lemma  tf/idf value
0      gravar      0.318044
1   inocencio      0.181679
2    glorioso      0.181679
3  frecuentar      0.181679
4      africa      0.181679
5    labrador      0.181679
6  excessivos      0.181679
7       mirar      0.181679
8   estrechar      0.181679
9    prefecto      0.181679
 
 Most significant words in the 17. segment:
       lemma  tf/idf value
0    señalar      0.431038
1   tributar      0.309562
2       cosa      0.176636
3    contado      0.143679
4    domingo      0.143679
5  comodidad      0.143679
6     demora      0.143679
7     alegar      0.143679
8  convencer      0.143679
9    titular      0.143679
 
 Most significant words in the 18. segment:
      lemma  tf/idf value
0      apud      0.406932
1     pagin      0.226074
2     latir      0.180859
3    acosta      0.177877
4      agia      0.142301
5     tomar      0.135644
6       ego      0.135644
7      dict      0.135644
8    librar      0.097416
9  capítulo      0.097416
Our spanish wordfiles ([lemmata list](Solorzano/wordforms-es.txt) and [stopwords list](Solorzano/stopwords-es.txt)) are quite large and generous - they spare us some work of resolving quite a lot of abbreviations. However, since they are actually originating from a completely different project, it is very unlikely, that this goes without mistakes. Also some lemmata (like "de+el" in the eighth segment) are not really such. So we need to clean our wordlist and adapt it to the current text material urgently!

Now imagine how we would bring the two documents together in a vector space. We would generate dimensions for all the words of our spanish vocabulary and would end up with a common space of roughly twice as many dimensions as before - and the latin work would be only in the first half of the dimensions and the spanish work only in the second half. The respective other half would be populated with only zeroes. So in effect, we would not really have a common space or something on the basis of which we could compare the two works. :-(

What might be an interesting perspective, however - since in this case, the second text is a translation of the first one - is a parallel, synoptic overview of both texts. So, let's at least add the second text to our html overview with the wordclouds:


In [144]:
htmlfile2 = open(outputDir + '/Synopsis.html', encoding='utf-8', mode='w')

htmlfile2.write("""<!DOCTYPE html>
<html>
    <head>
        <title>Section Characteristics, parallel view</title>
        <meta charset="utf-8"/>
    </head>
    <body>
        <table>
""")
spA = [[]]
spA.clear()
spDicts = []
spW = []
for i in range(0, max(len(mx_array), len(spMx_array))):
    if (i > len(mx_array) - 1):
        htmlfile2.write("""
            <tr>
                <td>
                    <head>Section {a}: n/a</head>
                </td>""".format(a = str(i)))
    else:
        htmlfile2.write("""
            <tr>
                <td>
                    <head>Section {a}: <b>{b}</b></head><br/>
                    <img src="./wc_{a}.png"/><br/>
                    <small><i>length: {c} words</i></small>
                </td>""".format(a = str(i), b = label[i], c = len(tokenised[i])))
    if (i > len(spMx_array) - 1):
        htmlfile2.write("""
                <td>
                    <head>Section {a}: n/a</head>
                </td>
            </tr><tr><td>&nbsp;</td></tr>""".format(a = str(i)))
    else:
        spA.append([ int(round(x * 100000, 0)) for x in spMx_array[i]])
        spDicts.append(dict(zip(spFn, spA[i])))
        spW.append(WordCloud(background_color=None, mode="RGBA", \
                           max_font_size=40, min_font_size=10, \
                           max_words=60, relative_scaling=0.8).fit_words(spDicts[i]))
        spW[i].to_file(outputDir + '/wc_' + str(i) + '_sp.png')
        htmlfile2.write("""
                <td>
                    <head>Section {d}: <b>{e}</b></head><br/>
                    <img src="./wc_{d}_sp.png"/><br/>
                    <small><i>length: {f} words</i></small>
                </td>
            </tr>
            <tr><td>&nbsp;</td></tr>""".format(d = str(i), e = spLabel[i], f = len(spTokenised[i])))
    
htmlfile2.write("""
        </table>
    </body>
</html>
""")
htmlfile2.close()

Again, the resulting file can be opened here.

Translations?

Maybe there is an approach to inter-lingual comparison after all. Here is the API documentation of conceptnet.io, which we can use to lookup synonyms, related terms and translations. Like with such a URI:

http://api.conceptnet.io/related/c/la/rex?filter=/c/es

We can get an identifier for a word and many possible translations for this word. So, we could - this remains to be tested in practice - look up our ten (or so) most frequent words in one language and collect all possible translations in the second language. Then we could compare these with what we actually find in the second work. How much overlap there is going to be and how univocal it is going to be remains to be seen, however...

For example, with a single segment, we could do something like this:


In [159]:
import urllib
import json
from collections import defaultdict

segment_no = 6
spSegment_no = 8

print("Comparing words from segments " + str(segment_no) + " (latin) and " + str(spSegment_no) + " (spanish)...")

print(" ")
# Build List of most significant words for a segment
top10a = []
top10a = ([fn[x] for x in (mx_array[segment_no]*-1).argsort()][:12])
print("Most significant words in the latin text:")
print(top10a)

print(" ")
# Build lists of possible translations (the 15 most closely related ones)
top10a_possible_translations = defaultdict(list)
for word in top10a:
    concepts_uri = "http://api.conceptnet.io/related/c/la/" + word + "?filter=/c/es"
    response = urllib.request.urlopen(concepts_uri)
    concepts = json.loads(response.read().decode(response.info().get_param('charset') or 'utf-8'))
    for rel in concepts["related"][0:15]:
        top10a_possible_translations[word].append(rel.get("@id").split('/')[-1])

print(" ")
print("For each of the latin words, here are possible translations:")
for word in top10a_possible_translations:
    print(word + ":")
    print(', '.join(trans for trans in top10a_possible_translations[word]))

print(" ")
print(" ")
# Build list of 10 most significant words in the second language
top10b = []
top10b = ([spFn[x] for x in (spMx_array[spSegment_no]*-1).argsort()][:12])
print("Most significant words in the spanish text:")
print(top10b)

# calculate number of overlapping terms
print(" ")
print(" ")
print("Overlaps:")
for word in top10a_possible_translations:
    print(', '.join(trans for trans in top10a_possible_translations[word] if (trans in top10b or trans == word)))

# do a nifty ranking


Comparing words from segments 6 (latin) and 8 (spanish)...
 
Most significant words in the latin text:
['semi', 'haya', 'por', 'casso', 'pario', 'volo', 'paro', 'servicios', 'servicio', 'personal', 'indios', 'tribuo']
 
 
For each of the latin words, here are possible translations:
semi:
mitad, semi, medio, parcialmente, media, parte, mediano, cora, intermedio, parcial, tercio, semifinal, mediana, cuasi, cuarta
haya:
haya, hamás, ele, jeque, alteza, cordobés, mahoma, córdoba, tanzania, princesa, árabe, israel, tablón, malentendido, palestina
por:
veces, ésos, ele, aquéllos, aquéllas, ésas, éste, doña, aquél, por, hai, éstas, ia, ése, favor
casso:
caer, caída, recaer, caerse, caído, comenzar, empezar, empiece, empiezo, comienzo, empezado, inicio, iniciar, iniciarse, vacilar
volo:
vuelo, volando, volar, avión, copiloto, chicago, paloma, palomar, milán, volador, mosca, pájaro, piloto, aves, aviación
paro:
huelga, paro, nepal, desempleo, perú, pelotudo, desempleado, paraguay, laburo, desocupación, delhi, lama, boludo, uruguay, bolivia
personal:
individual, personalmente, particular, personal, personales, individuo, privado, individualidad, personalidad, propio, espiritual, su, personalizado, subjetivo, suyo
 
 
Most significant words in the spanish text:
['mesmo', 'personal', 'perder', 'ordenar', 'casar', 'indio', 'voluntad', 'servicio', 'querer', 'tributo', 'encomendar', 'cosa']
 
 
Overlaps:
semi
haya
por


paro
personal

Graph-based NLP

Topic Modelling

...

Manual Annotation

...

Further information