This is an introduction to some algorithms used in text analysis. While I cannot define what questions a scholar can ask, I can and do describe here what kind of information about text some popular methods can deliver. From this, you need to draw on your own research interests and creativity...
I will describe methods of finding words that are characteristic for a certain passage ("tf/tdf"), constructing fingerprints or "wordclouds" for passages that go beyond the most significant words ("word vectors"). Of course, an important resource in text analysis is the hermeneutic interpretation of the scholar herself, so I will present a method of adding manual annotations to the text, and finally I will also say something about possible approaches to working across languages.
At the moment the following topics are still waiting to be discussed: grouping passages according to their similarity ("clustering"), and forming an idea about different contexts being treated in a passage ("topic modelling"). Some more prominent approaches in the areas that have been mentioned so far are "collocation" analyses and the "word2vec" tool; I would like add discussions of these at a later moment.
"Natural language processing" in the strict sense, i.e. analyses that have an understanding of how a language works, with its grammar, different modes, times, cases and the like, are not going to be covered; this implies "stylometric" analyses. Nor are there any discussions of "artificial intelligence" approaches. Maybe these can be discussed at another occasion and on another page.
For many of the steps discussed on this page there are ready-made tools and libraries, often with easy interfaces. But first, it is important to understand what these tools are actually doing and how their results are affected by the selection of parameters (that one can or cannot modify).
And second, most of these tools expect the input to be in some particular format, say, a series of plaintext files in their own directory, a list of word/number)-pairs, a table or a series of integer (or floating point) numbers, etc. So, by understanding the process, you should be better prepared to provide your text to the tools in the most productive way.
Finally, it is important to be aware of what information is lost at which point in the process. If the research requires so, one can then either look for a different tool or approach to this step (e.g. using an additional dimension in the list of words to keep both original and regularized word forms, or to remember the position of the current token in the original text), or one can compensate for the data loss (e.g. offering a lemmatised search to find occurrences after the analysis returns only normalised word forms)...
The programming language used in the following examples is called "python" and the tool used to get prose discussion and code samples together is called "jupyter". In jupyter, you have a "notebook" that you can populate with text or code and a program that pipes a nice rendering of the notebook to a web browser. In this notebook, in many places, the output that the code samples produce is printed right below the code itself. Sometimes this can be quite a lot of output and depending on your viewing environment you might have to scroll quite some way to get to the continuation of the discussion. You can save your notebook online (the current one is here at github) and there is an online service, nbviewer, able to render any notebook that it can access online. So chances are you are reading this present notebook at the web address https://nbviewer.jupyter.org/github/awagner-mainz/notebooks/blob/master/gallery/TextProcessing_Solorzano.ipynb.
A final word about the elements of this notebook:
As indicated above, before doing maths, language processing tools normally expect their input to be in a certain format. First of all, you have to have an input in the first place: Therefore, a scholar wishing to experiment with such methods should avail herself of the text that should be studied, as a full transcription. This can be done by transcribing it herself, using transcriptions that are available from elsewhere, or even from OCR. (Although in the latter case, the results depend of course on the quality of the OCR output.) Second, many tools get tripped up when formatting or bibliographical metainformation is included in their input. And since the approaches presented here are not concerned with a digital edition or any other form of true representation of the source, markup (e.g. for bold font, heading or note elements) should be suppressed. (Other tools accept marked up text and strip the formatting internally.) So you should try to get a copy of the text(s) you are working with in plaintext format.
For another detail regarding these plain text files, we have to make a short excursus, because even with plain text, there are some important aspects to consider: As you surely know, computers understand number only and as you probably also know, the first standards to encode alphanumeric characters, like ASCII, in numbers were designed for teleprinters and the reduced character set of the english language. When more extraordinary characters, like Umlauts or accents were to be encoded, one had to rely on extra rules, of which - unfortunately - there have been quite a lot. These are called "encodings" and one of the more important set of such rules are the windows encodings (e.g. CP-1252), another one is called Latin-9/ISO 8859-15 (it differs from the older Latin-1 encoding among others by including the Euro sign). Maybe you have seen web pages with garbled Umlauts or other special characters, then that was probably because your browser interpreted the numbers according to an encoding different from the one that the webpage author used. Anyway, the point here is that there is another standard encompassing virtually all the special signs from all languages and for a few years now, it is also supported quite well by operating systems, programming languages and linguistic tools. This standard is called "Unicode" and the encoding you want to use is called utf-8. So when you export or import your texts, try to make sure that this is what is used. (Here is a webpage with the complete unicode table - it is loaded incrementally, so make sure to scroll down in order to get an impression of what signs this standard covers. But on the other hand, it is so extensive that you don't want to scroll through all the table...)
Especially when you are coming from a windows operating system, you might have to do some searching about how to export your text to utf-8 (at one point I could make a unicode plaintext export in wordpad, only to find out after some time of desperate debugging that it was utf-16 that I had been given. Maybe you can still find the traces of my own conversion of such files to utf-8 below).
This section now describes how the plaintext can further be prepared for analyses: E.g. if you want to process the distribution of words in the text, the processing method has to have some notion of different places in the text -- normally you don't want to manage words according to their absolute position in the whole work (say, the 6.349th word and the 3.100th one), but according to their occurrence in a particular section (say, in the third chapter, without caring too much whether it is in the 13th or in the 643th position in this chapter). So, you partition the text into meaningful segments which you can then label, compare etc.
Other preparatory work includes suppressing stopwords (like "the", "is", "of" in english) or making the tools manage different forms of the same word or different historical writings identically. Here is what falls under this category:
For the examples given on this page, I am using a transcription of Juan de Solorzano's De Indiarum Iure, provided by Angela Ballone. Angela has inserted a special sequence of characters - "€€€ - [<Label for the section>]" - at places where she felt that a new section or argument is beginning, so that we can segment the big source file into different sections each dealing with one particular argument. (Our first task.) But first, let's have a look at our big source file; it is in the folder "Solorzano" and is called Sections_I.1_TA.txt.
In [1]:
# This is the path to our file
bigsourcefile = 'Solorzano/Sections_I.1_TA.txt'
# We use a variable 'input' for keeping its contents.
input = open(bigsourcefile, encoding='utf-8').readlines()
# Just for information, let's see the first 10 lines of the file.
input[0:10] # actually, since python starts counting with '0', we get 11 lines.
# and since there is no line wrapping in the source file,
# a line can be quite long.
# You can see the lines ending with a "newline" character "\n" in the output.
Out[1]:
Next, as mentioned above, we want to associate information with only passages of the text, not the text as a whole. Therefore, the text has to be segmented. The one big single file is being split into meaningful smaller chunks. What exactly constitutes a meaningful chunk -- a chapter, an article, a paragraph etc. -- cannot be known independently of the text in question and of the research questions. Therefore, a typical approach is that the scholar either splits the text manually or inserts some symbols that otherwise do not appear in the text. This is what we have here. Then, processing tools can find these symbols and split the file accordingly. For keeping things neat and orderly, the resulting files are saved in a directory of their own...
(Note here and in the following that in most cases, when the program is counting, it does so beginning with zero. Which means that if we end up with 20 segments, they are going to be called segment_0.txt, segment_1,txt, ..., segment_19.txt. There is not going to be a segment bearing the number twenty, although we do have twenty segments. The first one has the number zero and the twentieth one has the number nineteen. Even for more experienced coders, this sometimes leads to mistakes, called "off-by-one errors".)
In [2]:
# folder for the several segment files:
outputBase = 'Solorzano/segment'
# initialise some variables:
at = -1
dest = None # this later takes our destination files
# Now, for every line, if it starts with our special string,
# do nothing with the line,
# but close the current and open the next destination file;
# if it does not,
# append it to whatever is the current destination file
# (stripping leading and trailing whitespace).
for line in input:
if line[0:3] == '€€€':
# if there is a file open, then close it
if dest:
dest.close()
at += 1
# open the next destination file for writing
# (It's filename is build from our outputBase variable,
# the current position in the sequence of fragments,
# and a ".txt" ending)
dest = open(outputBase + '.' + str(at) + '.txt',
encoding='utf-8',
mode='w')
else:
# write the line (after it has been stripped of leading and closing whitespace)
dest.write(line.strip())
dest.close()
at += 1
# How many segments/files do we then have?
print(str(at) + ' files written.')
From the segments just created, we rebuild our corpus, iterating through them and reading them into another variable (which now stores, technically speaking, not just one long string of characters, as the variable input in the first code snippet did, but a list of strings, one for each segment).
In [3]:
path = 'Solorzano'
filename = 'segment.'
suffix = '.txt'
corpus = [] # This is our new variable. It will be populated below.
for i in range(0, at):
with open(path + '/' + filename + str(i) + suffix, encoding='utf-8') as f:
corpus.append(f.read()) # Here, a new element is added to our corpus.
# Its content is read from the file 'f' opened above
f.close()
Now we should have 20 strings in the variable corpus to play around with:
In [4]:
len(corpus)
Out[4]:
For a quick impression, let's see the opening 500 characters of an arbitrary one of them; in this case, we take the fourth segment, i.e. the one at position '3' (remember that counting starts at 0):
In [5]:
corpus[3][0:500]
Out[5]:
"Tokenising" means splitting the long lines of the input into single words. Since we are dealing with plain latin, we can use the default split method which relies on spaces to identify word boundaries. (In languages like Japanese or scripts like Arabic, this is more difficult.) Note that we do not compensate for words that are hyphenated/split across lines here! That is something that should be catered for in the transcription itself.
In [6]:
# We need a python library, because we want to use a "regular expression"
import re
tokenised = [] # A new variable again
# Every segment, initially a long string of characters, is now split into a list of words,
# based on non-word characters (whitespace, punctuation, parentheses and others - that's
# what we need the regular expression library for).
# Also, we make everything lower-case.
for segment in corpus:
tokenised.append(list(filter(None, (word.lower() for word in re.split('\W+', segment)))))
print('We now have ' + str(sum(len(x) for x in tokenised)) + ' wordforms or "tokens" in our corpus of ' + str(len(tokenised)) + ' segments.')
Now, instead of corpus, we can use tokenised for our subsequent routines: a variable which, at 20 positions, contains the list of words of the corresponding segment. In order to see the difference in structure to the corpus variable above, let's have a look at (the first 50 words of) the fourth segment again:
In [7]:
print(tokenised[3][0:49])
Already, we can have a first go at finding the most frequent words for a segment. (For this we use a simple library of functions that we import by the name of 'collections'.):
In [8]:
import collections
counter = collections.Counter(tokenised[3]) # Again, consider the fourth segment
print(counter.most_common(10)) # Making a counter 'object' of our segment,
# this now has a 'method' calles most_common,
# offering us the object's most common elements.
# More 'methods' can be found in the documentation:
# https://docs.python.org/3/library/collections.html#collections.Counter
Perhaps now is a good opportunity for another small excursus. What we have printed in the last code is a series of pairs: Words associated to their number of occurrences, sorted by the latter. This is called a "dictionary" in python. However, the display looks a bit ugly. With another library called "pandas" (for "python data analysis"), we can make this look more intuitive. (Of course, your system must have this library installed in the first place so that we can import it in our code.):
In [9]:
import pandas as pd
df1 = pd.DataFrame.from_dict(counter, orient='index').reset_index() # from our counter object,
# we now make a DataFrame object
df2 = df1.rename(columns={'index':'lemma',0:'count'}) # and we name our columns
df2.sort_values('count',0,False)[:10]
Out[9]:
Looks better now, doesn't it?
(The bold number in the very first column is the id as it were of the respective lemma. You see that 'hoc' has the id '0' - because it was the first word that occurred at all -, and 'ut' has the id '5' because it was the sixth word in our segment. Most probably, currently we are not interested in the position of the word and can ignore the first column.)
Next, since we prefer to count different word forms as one and the same "lemma", we have to do a step called "lemmatisation". In languages that are not strongly inflected, like English, one can get away with "stemming", i.e. just eliminating the ending of words: "wish", "wished", "wishing", "wishes" all can count as instances of "wish*". With Latin this is not so easy: we want to count occurrences of "legum", "leges", "lex" as one and the same word, but if we truncate after "le", we get too many hits that have nothing to do with lex at all. There are a couple of "lemmatising" tools available, although with classical languages (or even early modern ones), it's a bit more difficult. Anyway, we do our own, using a dictionary approach...
First, we have to have a dictionary which associates all known word forms to their lemma. This can also help us with historical orthography. Suppose from some other context, we have a file "wordforms-lat.txt" at our disposal in the "Solorzano" folder. Its contents looks like this:
In [115]:
wordfile_path = 'Solorzano/wordforms-lat-full.txt'
wordfile = open(wordfile_path, encoding='utf-8')
print(wordfile.read()[:64]) # in such from-to addresses, one can just skip the zero
wordfile.close;
So, we again build a dictionary of key-value pairs associating all the lemmata ("values") with their wordforms ("keys"). And afterwards, we can quickly look up the value under a given key:
In [116]:
lemma = {} # we build a so-called dictionary for the lookups
tempdict = []
# open the wordfile (defined above) for reading
wordfile = open(wordfile_path, encoding='utf-8')
for line in wordfile.readlines():
tempdict.append(tuple(line.split('>'))) # we split each line by ">" and append a tuple to a
# temporary list.
lemma = {k.strip(): v.strip() for k, v in tempdict} # for every tuple in the list,
# we strip whitespace and make a key-value
# pair, appending it to our "lemma" dictionary
wordfile.close
print(str(len(lemma)) + ' wordforms known to the system.')
Again, a quick test: Let's see with which "lemma"/basic word the particular wordform "ciuicior" is associated, or, in other words, what value our lemma variable returns when we query for the key "ciuicior":
In [117]:
lemma['fidem']
Out[117]:
Now we can use this dictionary to build a new list of words, where only lemmatised forms occur:
In [118]:
# For each segment, and for each word in it, add the lemma to our new "lemmatised"
# list, or, if we cannot find a lemma, add the actual word from from the tokenised list.
lemmatised = [[lemma[word] if word in lemma else word for word in segment]
for segment in tokenised]
Again, let's see the first 50 words from the fourth segment, and compare them with the "tokenised" variant above:
In [119]:
print(lemmatised[3][:49])
As you can see, the original text is lost now from the data that we are currently working with (unless we add another dimension to our lemmatised variable which can keep the original word form). But let us see if something in the 10 most frequent words has changed:
In [120]:
counter2 = collections.Counter(lemmatised[3])
df1 = pd.DataFrame.from_dict(counter2, orient='index').reset_index()
df2 = df1.rename(columns={'index':'lemma',0:'count'})
df2.sort_values('count',0,False)[:10]
Out[120]:
Yes, things have changed: "tributum" has moved one place up, "non" is now counted as "nolo" (I am not sure this makes sense, but such is the dictionary of wordforms we have used) and "pensum" has now made it on the list!
Probably "et", "in", "de", "qui", "ad", "sum/esse", "non/nolo" and many of the most frequent words are not really very telling words. They are what one calls stopwords, and we have another list of such words that we would rather want to ignore:
In [121]:
stopwords_path = 'Solorzano/stopwords-lat.txt'
stopwords = open(stopwords_path, encoding='utf-8').read().splitlines()
print(str(len(stopwords)) + ' stopwords known to the system, e.g.: ' + str(stopwords[95:170]))
Now let's try and suppress the stopwords in the segments (and see what the "reduced" fourth segment gives)...
In [122]:
# For each segment, and for each word in it,
# add it to a new list called "stopped",
# but only if it is not listed in the list of stopwords.
stopped = [[item for item in lemmatised_segment if item not in stopwords] \
for lemmatised_segment in lemmatised]
print(stopped[3][:49])
With this, we can already create a kind of first "profile" of, say, our first six segments, listing the most frequent words in each of them:
In [123]:
counter3 = collections.Counter(stopped[0])
df0_1 = pd.DataFrame.from_dict(counter3, orient='index').reset_index()
df0_2 = df0_1.rename(columns={'index':'lemma',0:'count'})
print(' Most frequent lemmata in the first text segment (segment number zero):')
df0_2.sort_values(by='count',axis=0,ascending=False)[:10]
Out[123]:
In [124]:
counter4 = collections.Counter(stopped[1])
df1_1 = pd.DataFrame.from_dict(counter4, orient='index').reset_index()
df1_2 = df1_1.rename(columns={'index':'lemma',0:'count'})
print(' Most frequent lemmata in the second text segment (segment number one):')
df1_2.sort_values(by='count',axis=0,ascending=False)[:10]
Out[124]:
In [125]:
counter5 = collections.Counter(stopped[2])
df2_1 = pd.DataFrame.from_dict(counter5, orient='index').reset_index()
df2_2 = df2_1.rename(columns={'index':'lemma',0:'count'})
print(' Most frequent lemmata in the third text segment:')
df2_2.sort_values(by='count',axis=0,ascending=False)[:10]
Out[125]:
In [126]:
counter6 = collections.Counter(stopped[3])
df3_1 = pd.DataFrame.from_dict(counter6, orient='index').reset_index()
df3_2 = df3_1.rename(columns={'index':'lemma',0:'count'})
print(' Most frequent lemmata in the fourth text segment:')
df3_2.sort_values(by='count',axis=0,ascending=False)[:10]
Out[126]:
Yay, look here, we have our words "indis", "tributum", "pensum" from the top ten above again, but this time the non-significant (for our present purposes) words in-between have been eliminated. Instead, new words like "numerata", "operis" etc. have made it into the top ten.
In [127]:
counter7 = collections.Counter(stopped[4])
df4_1 = pd.DataFrame.from_dict(counter7, orient='index').reset_index()
df4_2 = df4_1.rename(columns={'index':'lemma',0:'count'})
print(' Most frequent lemmata in the fifth text segment:')
df4_2.sort_values(by='count',axis=0,ascending=False)[:10]
Out[127]:
In [128]:
counter8 = collections.Counter(stopped[5])
df5_1 = pd.DataFrame.from_dict(counter8, orient='index').reset_index()
df5_2 = df5_1.rename(columns={'index':'lemma',0:'count'})
print(' Most frequent lemmata in the sixth text segment:')
df5_2.sort_values(by='count',axis=0,ascending=False)[:10]
Out[128]:
However, we can already observe that meaningful words like "indios/indis" are maybe not so helpful in characterising individual passages of this work, since they occur all over the place. After all, the work is called "De Indiarum Iure" and deals with various questions all related to indigenous people. Also, we would like to give some weight to the fact that a passage may consist of all stopwords and perhaps one or two substantial words, whereas another might be full of substantial words and few stopwords only (think e.g. of an abstract or an opening chapter describing the rest of the work). Or, since we have text segments of varying length, we would like our figures to reflect the fact that a tenfold occurrence in a very short passage may be more significant than a tenfold occurrence in a very, very, very long passage.
These phenomena are treated with more mathematical tools, so let's say that our preparatory work is done ...
As described, we are now going to delve a wee bit deeper into mathematics in order to get more precise characterizations of our text segments. The approach we are going to use is called "TF/IDF" and is a simple, yet powerful method that is very popular in text mining and search engine discussions.
Since maths works best with numbers, let's first of all build a list of all the words (in their basic form) that occur anywhere in the text, and give each one of those words an ID (say, the position of its first occurrence in the work):
In [129]:
# We can use a library function for this
from sklearn.feature_extraction.text import CountVectorizer
# Since the library function can do all of the above (splitting, tokenising, lemmatising),
# and since it is providing hooks for us to feed our own tokenising, lemmatising and stopwords
# resources or functions to it,
# we use it and work on our rather raw "corpus" variable from way above again.
# So first we build a tokenising and lemmatising function to work as an input filter
# to the CountVectorizer function
def ourLemmatiser(str_input):
wordforms = re.split('\W+', str_input)
return [lemma[wordform].lower().strip() if wordform in lemma else wordform.lower().strip() for wordform in wordforms ]
# Then we initialize the CountVectorizer function to use our stopwords and lemmatising fct.
count_vectorizer = CountVectorizer(tokenizer=ourLemmatiser, stop_words=stopwords)
# Finally, we feed our corpus to the function, building a new "vocab" object
vocab = count_vectorizer.fit_transform(corpus)
# Print some results
print(str(len(count_vectorizer.get_feature_names())) + ' distinct words in the corpus:')
print(count_vectorizer.get_feature_names()[0:100])
You can see how our corpus of four thousand "tokens" actually contains only one and a half thousand different words (plus stopwords, but these are at maximum 384). And, in contrast to simpler numbers that have been filtered out by our stopwords filter, I have left years like "1610" in place.
However, our "vocab" object contains more than just all the unique words in our corpus. Let's get some information about it:
In [130]:
vocab
Out[130]:
It is actually a table with 20 rows (the number of our segments) and 1.672 columns (the number of unique words in the corpus). So what we do have is a table where for each segment the amount of occurrences of every "possible" (in the sense of used somewhere in the corpus) word is listed.
("Sparse" means that the majority of fields is zero. And 2.142 fields are populated, which is more than the number of unique words in the corpus (1.672, see above) - that's obviously because some words occur in multiple segments = rows. Not much of a surprise, actually.)
Here is the whole table:
In [131]:
pd.DataFrame(vocab.toarray(), columns=count_vectorizer.get_feature_names())
Out[131]:
Each row of this table is a kind of fingerprint of a segment: We don't know the order of words in the segment - for us, it is just a "bag of words" -, but we know which words occur in the segment and how often they do. But as of now, it is a rather bad fingerprint, because how significant a certain number of occurences of a word in a segment is depends on the actual length of the segment. Ignorant as we are (per assumption) of the role and meaning of those words, still, if a word occurs twice in a short paragraph, that should prima facie count as more characteristic of the paragraph than if it occurs twice in a multi-volume work.
We can reflect this if we divide the number of occurrences of a word by the number of tokens in the segment. Obviously the number will then be quite small - but what counts is the relations between the cells and we can account for scaling and normalizing later...
We're almost there and we are switching from the CountVectorizer function to another one, that does the division just mentioned and will do more later on...
In [132]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Initialize the library's function
tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords, use_idf=False, tokenizer=ourLemmatiser, norm='l1')
# Finally, we feed our corpus to the function to build a new "tf_matrix" object
tf_matrix = tfidf_vectorizer.fit_transform(corpus)
# Print some results
pd.DataFrame(tf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names())
Out[132]:
Now we have seen above that "indis" is occurring in all of the segments, because, as the title indicates, the whole work is about issues related to the Indies and to indigenous people. When we want to characterize a segment by referring to some of its words, is there a way to weigh down words like "indis" a little bit? Not filter them out completely, as we do with stopwords, but give them just a little less weight than words not appearing all over the place? Yes there is...
There is a measure called "text frequency / (inverse) document frequency" that combines a local measure (how frequently a word appears in a segment, in comparison to the other words appearing in the same segment, viz. the table above), with a global measure (how frequently the word appears throughout the whole corpus). Roughly speaking, we have to add to the table above a new, global, element: the number of documents the term appears in divided through the number of all documents in the corpus - or, rather, the other way round (that's why it is the "inverse" document frequency): the number of documents in the corpus divided by the number of documents the current term occurs in. (As with our local measure above, there is also some normalization, i.e. compensation for different lengths of documents and attenuation of high values, going on by using a logarithm on the quotient.)
When you multiply the term frequency (from above) with this inverse document frequeny, you have a formula which "rewards" frequent occurrences in one segment and rare occurrences over the whole corpus. (For more of the mathematical background, see this tutorial.)
Again, we do not have to implement all the counting, division and logarithm ourselves but can rely on SciKit-learn's TfidfVectorizer function to generate a matrix of our corpus in just a few lines of code:
In [133]:
# Initialize the library's function
tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords, use_idf=True, tokenizer=ourLemmatiser, norm='l2')
# Finally, we feed our corpus to the function to build a new "tfidf_matrix" object
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)
# Print some results
tfidf_matrix_frame = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names())
tfidf_matrix_frame
Out[133]:
Now let's print a more qualified "top 10" words for each segment:
In [134]:
# convert your matrix to an array to loop over it
mx_array = tfidf_matrix.toarray()
# get your feature names
fn = tfidf_vectorizer.get_feature_names()
pos = 0
for l in mx_array:
print(' ')
print(' Most significant words segment ' + str(pos) + ':')
print(pd.DataFrame.rename(pd.DataFrame.from_dict([(fn[x], l[x]) for x in (l*-1).argsort()][:20]), columns={0:'lemma',1:'tf/idf value'}))
pos += 1
Due to the way that they have been encoded in our sample texts, we can also observe some references to other literature by the underscore (e.g. "de_oper", "de_iur", "et_cur" etc.), which makes you wonder if it would be worthwile marking all the references in some way so that we could either concentrate on them or filter them out altogether. But other than that, it's in fact almost meaningful. Apart from making such lists, what can we do with this?
First, let us recapitulate in more general terms what we have done so far, since a good part of it is extensible and applicable to many other methods: We have used a representation of each "document" (in our case, all those "documents" have been segments of one and the same text) as a series of values that indicated the document's relevance in particular "dimensions".
For example, the various values in the "alea dimension" indicate how characteristic this word, "alea", is for the present document. (By hypothesis, this also works the other way round, as an indication of which documents are the most relevant ones in matters of "alea". In fact, this is how search engines work.)
Many words did not occur at all in most of the documents and the series of values (matrix rows) contained many zeroes. Other words were stopwords which we would not want to affect our documents' scores - they did not yield a salient "dimension" and were dropped from the series of values (matrix columns). The values work independently and can be combined (when a document is relevant in one and in another dimension).
Each document is thus characterised by a so-called "vector" (a series of independent, combinable values) and is mapped in a "space" constituted by the dimensions of those vectors (matrix columns, series of values). In our case the dimensions have been derived from the corpus's vocabulary. Hence, the representation of all the documents is called their vector space model. You can really think of it as similar to a three-dimensional space: Document A goes quite some way in the x-direction, it goes not at all in the y-direction and it goes just a little bit in the z-direction. Document B goes quite some way, perhaps even further than A did, in both the y- and z- directions, but only a wee bit in the y-direction. Etc. etc. Only with many many more independent dimensions instead of just the three spatial dimensions we are used to.
The following sections are will discuss ways of manipulating the vector space -- using alternative or additional dimensions -- and also ways of leveraging the VSM representation of our text to make various analyses...
Instead of relying on the (either lemmatized or un-lemmatized) vocabulary of words occurring in your documents, you could also use other methods to generate a vector for them. A very popular such method is called n-grams and shall be presented just shortly:
Imagine a moving window which captures, say, three words and slides over your text word by word. The first capture would get the first three words, the second one the words two to four, the third one the words three to five, and so on up to the last three words of your document. This procedure would generate all the "3-grams" contained in your text - not all the possible combinations of the words present in the vocabulary but just the triples that happen to occur in the text. The meaningfulness of this method depends to a certain extent on how strongly the respective language inflects its words and on how freely it orders its sentences' parts (a sociolect or literary genre might constrain or enhance the potential variance of the language). Less variance here means that the same ideas tend to be (!) presented in the same formulations more than in languages with more variance on this syntactic level. To a certain extent, you could play around with lemmatization and stopwords and with the size of your window. But in general, there are more 3-grams repeated in human language than one would expect. Even more so if we imagine our window encompassing only two words, resulting in 2-grams or, rather, bigrams.
As a quick example, let's list the top bi- or 3-grams of our text segments, together with the respective number of occurrences, and the 10 most frequent n-grams in the whole corpus:
In [135]:
ngram_size_high = 3
ngram_size_low = 2
top_n = 5
# Initialize the TfidfVectorizer function from above
# (again using our lemmatising fct. but no stopwords this time)
vectorizer = CountVectorizer(ngram_range=(ngram_size_low, ngram_size_high), tokenizer=ourLemmatiser)
ngrams = vectorizer.fit_transform(corpus)
print('Most frequent 2-/3-grams')
print('========================')
print(' ')
ngrams_dict = []
df = []
df_2 = []
for i in range(0, len(corpus)):
# (probably that's way too complicated here...)
ngrams_dict.append(dict(zip(vectorizer.get_feature_names(), ngrams.toarray()[i])))
df.append(pd.DataFrame.from_dict(ngrams_dict[i], orient='index').reset_index())
df_2.append(df[i].rename(columns={'index':'n-gram',0:'count'}))
print('Segment ' + str(i) + ':')
if df_2[i]['count'].max() > 1:
print(df_2[i].sort_values(by='count',axis=0,ascending=False)[:top_n])
print(' ')
else:
print(' This segment has no bi- or 3-gram occurring more than just once.')
print(' ')
ngrams_corpus = pd.DataFrame(ngrams.todense(), columns=vectorizer.get_feature_names())
ngrams_total = ngrams_corpus.cumsum()
print(' ')
print("The 10 most frequent n-grams in the whole corpus")
print("================================================")
ngrams_total.tail(1).T.rename(columns={19:'count'}).nlargest(10, 'count')
Out[135]:
Of course, there is no reason why the dimensions should be restricted to or identical with the vocabulary (or the occurring n-grams, for that matter). In fact, in the examples above, we have dropped some of the words already by using our list of stopwords. **We could also add other dimensions that are of interest for our current research question. We could add a dimension for the year in which the texts have been written, for their citing a certain author, or merely for their position in the encompassing work...**
Since in our examples, the position is represented in the "row number" and counting citations of a particular author require some more normalisations (e.g. with the lemmatisation dictionary above), let's add a dimension for the length of the respective segment (in characters) and another one for the number of occurrences of "_" (in our sample transcriptions, this character had been used to mark citations, although admittedly not all of them), just so you get the idea:
In [136]:
print("Original matrix of tf/idf values (rightmost columns):")
tfidf_matrix_frame.iloc[ :, -5:]
Out[136]:
In [137]:
length = []
for i in range(0, len(corpus)):
length.append(len(tokenised[i]))
citnum = []
for i in range(0, len(corpus)):
citnum.append(corpus[i].count('_'))
print("New matrix extended with segment length and number of occurrences of '_':")
new_matrix = tfidf_matrix_frame.assign(seg_length = length).assign(cit_count = citnum)
new_matrix.iloc[ :, -6:]
Out[137]:
You may notice that the segment with most occurrences of "_" (taken with a grain of salt, that's likely the segment with most citations), is not a particularly long one. If we had systematic markup of citations or author names in our transcription, we could be more certain or add even more columns/"dimensions" to our table.
If you bear with me for a final example, here is adding the labels that you could see in our initial one "big source file":
In [138]:
# input should still have a handle on our source file.
label = []
# Now, for every line, revisit the special string and extract just the lines marked by it
for line in input:
if line[0:3] == '€€€':
label.append(line[6:].strip())
# How many segments/files do we then have?
print(str(len(label)) + ' labels read.')
print("New matrix extended with segment length, number of occurrences of '_' and label:")
yet_another_matrix = new_matrix.assign(seg_length = length).assign(label = label)
yet_another_matrix.iloc[ :, -6:]
Out[138]:
We can use a library that takes word frequencies like above, calculates corresponding relative sizes of words and creates nice wordcloud images for our sections (again, taking the fourth segment as an example) like this:
In [139]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# We make tuples of (lemma, tf/idf score) for one of our segments
# But we have to convert our tf/idf weights to pseudo-frequencies (i.e. integer numbers)
frq = [ int(round(x * 100000, 0)) for x in mx_array[3]]
freq = dict(zip(fn, frq))
wc = WordCloud(background_color=None, mode="RGBA", max_font_size=40, relative_scaling=1).fit_words(freq)
# Now show/plot the wordcloud
plt.figure()
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()
In order to have a nicer overview over the many segments than is possible in this notebook, let's create a new html file listing some of the characteristics that we have found so far...
In [140]:
outputDir = "Solorzano"
htmlfile = open(outputDir + '/Overview.html', encoding='utf-8', mode='w')
# Write the html header and the opening of a layout table
htmlfile.write("""<!DOCTYPE html>
<html>
<head>
<title>Section Characteristics</title>
<meta charset="utf-8"/>
</head>
<body>
<table>
""")
a = [[]]
a.clear()
dicts = []
w = []
# For each segment, create a wordcloud and write it along with label and
# other information into a new row of the html table
for i in range(0, len(mx_array)):
# this is like above in the single-segment example...
a.append([ int(round(x * 100000, 0)) for x in mx_array[i]])
dicts.append(dict(zip(fn, a[i])))
w.append(WordCloud(background_color=None, mode="RGBA", \
max_font_size=40, min_font_size=10, \
max_words=60, relative_scaling=0.8).fit_words(dicts[i]))
# We write the wordcloud image to a file
w[i].to_file(outputDir + '/wc_' + str(i) + '.png')
# Finally we write the column row
htmlfile.write("""
<tr>
<td>
<head>Section {a}: <b>{b}</b></head><br/>
<img src="./wc_{a}.png"/><br/>
<small><i>length: {c} words</i></small>
</td>
</tr>
<tr><td> </td></tr>
""".format(a = str(i), b = label[i], c = len(tokenised[i])))
# And then we write the end of the html file.
htmlfile.write("""
</table>
</body>
</html>
""")
htmlfile.close()
This should have created a nice html file which we can open here.
Also, once we have a representation of our text as a vector - which we can imagine as an arrow that goes a certain distance in one direction, another distance in another direction and so on - we can compare the different arrows. Do they go the same distance in a particular direction? And maybe almost the same in another direction? This would mean that one of the terms of our vocabulary has the same weight in both texts. Comparing the weight of our many, many dimensions, we can develop a measure for the similarity of the texts.
(Probably, similarity in words that are occurring all over the place in the corpus should not count so much, and in fact it is attenuated by our arrows being made up of tf/idf weights.)
Comparing arrows means calculating with angles and technically, what we are computing is the "cosine similarity" of texts. Again, there is a library ready for us to use (but you can find some documentation here, here and here.)
In [141]:
from sklearn.metrics.pairwise import cosine_similarity
similarities = pd.DataFrame(cosine_similarity(tfidf_matrix))
similarities[round(similarities, 0) == 1] = 0 # Suppress a document's similarity to itself
print("Pairwise similarities:")
print(similarities)
In [142]:
print("The two most similar segments in the corpus are")
print("segments", \
similarities[similarities == similarities.values.max()].idxmax(axis=0).idxmax(axis=1), \
"and", \
similarities[similarities == similarities.values.max()].idxmax(axis=0)[ similarities[similarities == similarities.values.max()].idxmax(axis=0).idxmax(axis=1) ].astype(int), \
".")
print("They have a similarity score of")
print(similarities.values.max())
Clustering is a method to find ways of grouping data into subsets, so that these do have some cohesion. Sentences that are more similar to a particular "paradigm" sentence than to another one are grouped with the first one, others are grouped with their respective "paradigm" sentence. Of course, one of the challenges is finding sentences that work well as such paradigm sentences. So we have two (or even three) stages: Find paradigms, group data accordingly. (And learn how many groups there are.)
I hope to be able to add a discussion of this subject soon. For now, here are nice tutorials for the process:
In [ ]:
Let us prepare a second text, this time in Spanish, and see how they compare...
In [143]:
bigspanishfile = 'Solorzano/Sections_II.2_PI.txt'
spInput = open(bigspanishfile, encoding='utf-8').readlines()
spAt = -1
spDest = None
for line in spInput:
if line[0:3] == '€€€':
if spDest:
spDest.close()
spAt += 1
spDest = open(outputBase + '.' + str(spAt) +
'.spanish.txt', encoding='utf-8', mode='w')
else:
spDest.write(line.strip())
spAt += 1
spDest.close()
print(str(spAt) + ' files written.')
spSuffix = '.spanish.txt'
spCorpus = []
for i in range(0, spAt):
try:
with open(path + '/' + filename + str(i) + spSuffix, encoding='utf-8') as f:
spCorpus.append(f.read())
f.close()
except IOError as exc:
if exc.errno != errno.EISDIR: # Do not fail if a directory is found, just ignore it.
raise # Propagate other kinds of IOError.
print(str(len(spCorpus)) + ' files read.')
# Labels
spLabel = []
i = 0
for spLine in spInput:
if spLine[0:3] == '€€€':
spLabel.append(spLine[6:].strip())
i =+ 1
print(str(len(spLabel)) + ' labels found.')
# Tokens
spTokenised = []
for spSegment in spCorpus:
spTokenised.append(list(filter(None, (spWord.lower()
for spWord in re.split('\W+', spSegment)))))
# Lemmata
spLemma = {}
spTempdict = []
spWordfile_path = 'Solorzano/wordforms-es.txt'
spWordfile = open(spWordfile_path, encoding='utf-8')
for spLine in spWordfile.readlines():
spTempdict.append(tuple(spLine.split('>')))
spLemma = {k.strip(): v.strip() for k, v in spTempdict}
spWordfile.close
print(str(len(spLemma)) + ' spanish wordforms known to the system.')
# Stopwords
spStopwords_path = 'Solorzano/stopwords-es.txt'
spStopwords = open(spStopwords_path, encoding='utf-8').read().splitlines()
print(str(len(spStopwords)) + ' spanish stopwords known to the system.')
print(' ')
print('Significant words in the spanish text:')
# tokenising and lemmatising function
def spOurLemmatiser(str_input):
spWordforms = re.split('\W+', str_input)
return [spLemma[spWordform].lower() if spWordform in spLemma else spWordform.lower() for spWordform in spWordforms ]
spTfidf_vectorizer = TfidfVectorizer(stop_words=spStopwords, use_idf=True, tokenizer=spOurLemmatiser, norm='l2')
spTfidf_matrix = spTfidf_vectorizer.fit_transform(spCorpus)
spMx_array = spTfidf_matrix.toarray()
spFn = spTfidf_vectorizer.get_feature_names()
pos = 1
for l in spMx_array:
print(' ')
print(' Most significant words in the ' + str(pos) + '. segment:')
print(pd.DataFrame.rename(pd.DataFrame.from_dict([(spFn[x], l[x]) for x in (l*-1).argsort()][:10]), columns={0:'lemma',1:'tf/idf value'}))
pos += 1
Now imagine how we would bring the two documents together in a vector space. We would generate dimensions for all the words of our spanish vocabulary and would end up with a common space of roughly twice as many dimensions as before - and the latin work would be only in the first half of the dimensions and the spanish work only in the second half. The respective other half would be populated with only zeroes. So in effect, we would not really have a common space or something on the basis of which we could compare the two works. :-(
What might be an interesting perspective, however - since in this case, the second text is a translation of the first one - is a parallel, synoptic overview of both texts. So, let's at least add the second text to our html overview with the wordclouds:
In [144]:
htmlfile2 = open(outputDir + '/Synopsis.html', encoding='utf-8', mode='w')
htmlfile2.write("""<!DOCTYPE html>
<html>
<head>
<title>Section Characteristics, parallel view</title>
<meta charset="utf-8"/>
</head>
<body>
<table>
""")
spA = [[]]
spA.clear()
spDicts = []
spW = []
for i in range(0, max(len(mx_array), len(spMx_array))):
if (i > len(mx_array) - 1):
htmlfile2.write("""
<tr>
<td>
<head>Section {a}: n/a</head>
</td>""".format(a = str(i)))
else:
htmlfile2.write("""
<tr>
<td>
<head>Section {a}: <b>{b}</b></head><br/>
<img src="./wc_{a}.png"/><br/>
<small><i>length: {c} words</i></small>
</td>""".format(a = str(i), b = label[i], c = len(tokenised[i])))
if (i > len(spMx_array) - 1):
htmlfile2.write("""
<td>
<head>Section {a}: n/a</head>
</td>
</tr><tr><td> </td></tr>""".format(a = str(i)))
else:
spA.append([ int(round(x * 100000, 0)) for x in spMx_array[i]])
spDicts.append(dict(zip(spFn, spA[i])))
spW.append(WordCloud(background_color=None, mode="RGBA", \
max_font_size=40, min_font_size=10, \
max_words=60, relative_scaling=0.8).fit_words(spDicts[i]))
spW[i].to_file(outputDir + '/wc_' + str(i) + '_sp.png')
htmlfile2.write("""
<td>
<head>Section {d}: <b>{e}</b></head><br/>
<img src="./wc_{d}_sp.png"/><br/>
<small><i>length: {f} words</i></small>
</td>
</tr>
<tr><td> </td></tr>""".format(d = str(i), e = spLabel[i], f = len(spTokenised[i])))
htmlfile2.write("""
</table>
</body>
</html>
""")
htmlfile2.close()
Again, the resulting file can be opened here.
Maybe there is an approach to inter-lingual comparison after all. Here is the API documentation of conceptnet.io, which we can use to lookup synonyms, related terms and translations. Like with such a URI:
http://api.conceptnet.io/related/c/la/rex?filter=/c/es
We can get an identifier for a word and many possible translations for this word. So, we could - this remains to be tested in practice - look up our ten (or so) most frequent words in one language and collect all possible translations in the second language. Then we could compare these with what we actually find in the second work. How much overlap there is going to be and how univocal it is going to be remains to be seen, however...
For example, with a single segment, we could do something like this:
In [159]:
import urllib
import json
from collections import defaultdict
segment_no = 6
spSegment_no = 8
print("Comparing words from segments " + str(segment_no) + " (latin) and " + str(spSegment_no) + " (spanish)...")
print(" ")
# Build List of most significant words for a segment
top10a = []
top10a = ([fn[x] for x in (mx_array[segment_no]*-1).argsort()][:12])
print("Most significant words in the latin text:")
print(top10a)
print(" ")
# Build lists of possible translations (the 15 most closely related ones)
top10a_possible_translations = defaultdict(list)
for word in top10a:
concepts_uri = "http://api.conceptnet.io/related/c/la/" + word + "?filter=/c/es"
response = urllib.request.urlopen(concepts_uri)
concepts = json.loads(response.read().decode(response.info().get_param('charset') or 'utf-8'))
for rel in concepts["related"][0:15]:
top10a_possible_translations[word].append(rel.get("@id").split('/')[-1])
print(" ")
print("For each of the latin words, here are possible translations:")
for word in top10a_possible_translations:
print(word + ":")
print(', '.join(trans for trans in top10a_possible_translations[word]))
print(" ")
print(" ")
# Build list of 10 most significant words in the second language
top10b = []
top10b = ([spFn[x] for x in (spMx_array[spSegment_no]*-1).argsort()][:12])
print("Most significant words in the spanish text:")
print(top10b)
# calculate number of overlapping terms
print(" ")
print(" ")
print("Overlaps:")
for word in top10a_possible_translations:
print(', '.join(trans for trans in top10a_possible_translations[word] if (trans in top10b or trans == word)))
# do a nifty ranking
...
...