Let's go back to the example of the sonnet about writing a sonnet, by Lope de Vega, in a French translation.
The output of the collation mostly contains differences of punctuation and capitalization, as we can see in the results here below.
In [1]:
from collatex import *
collation = Collation()
witness_1707 = open( "../data/sonnet/Lope_soneto_FR_1707.txt", encoding='utf-8' ).read()
witness_1822 = open( "../data/sonnet/Lope_soneto_FR_1822.txt", encoding='utf-8' ).read()
collation.add_plain_witness( "wit 1707", witness_1707 )
collation.add_plain_witness( "wit 1822", witness_1822 )
alignment_table = collate(collation, output='html2')
Imagine that we are not interested in punctuation and capitalization: we only want what might be called 'substantive variants'.
The "hard way" of obtaining the expected result is to remove punctuation and lower-case all the texts. The code below will do just that: it will
data/sonnet
dir, called 'norm'The creation of a normalized copy is safer than just normalizing the original transcriptions. If you keep the originals, you can always come back to them and perform other kinds of normalization if needed.
Note: the code below contains lots of comments, that is string that will not be executed but can be used for documentation. You've seen that in XML comments are inside <!-- -->. In Python, comment are marked with the sign #.
In [3]:
import glob, re, os
path = '../data/sonnet/' # put the path into a variable
os.makedirs(path + 'norm', exist_ok=True) # create a new folder, if does not exist
files = [os.path.basename(x) for x in glob.glob(path+'*.txt')] # take all txt files in the directory
for file in files: # for each file in the directory
### READ THE FILE CONTENT
file_opened = open(path+file, 'r', encoding='utf-8') # open the file in mode 'r' (read)
content = file_opened.read() # read the file content
### ALL TO LOWER CASE
lowerContent = content.lower()
### REMOVE PUNCTUATION
# replace everything that is not alphanumeric character (\w) or space (\s) with nothing or whitespace, depending on languages
noPunct_lowerContent = re.sub(r'[^\w\s]','',lowerContent)
### REMOVE MULTIPLE WHITESPACES
regularSpaces_noPunct_lowerContent = " ".join(noPunct_lowerContent.split())
### CREATE A NEW FILE
filename = file.split('.')[0]
new_file = open(path+'norm/' + filename + '_norm.txt', 'w', encoding='utf-8') # open the new file in mode 'w' (write)
### WRITE THE NEW CONTENT INTO THE NEW FILE
new_file.write(regularSpaces_noPunct_lowerContent)
### CLOSE THE FILE
new_file.close()
print('Finished! All normalized!')
Now, let's collate the normalized copies.
Attention to the new path. The output should be different from the one above!
In [4]:
from collatex import *
collation = Collation()
witness_1707 = open( "../data/sonnet/norm/Lope_soneto_FR_1707_norm.txt", encoding='utf-8' ).read()
witness_1822 = open( "../data/sonnet/norm/Lope_soneto_FR_1822_norm.txt", encoding='utf-8' ).read()
collation.add_plain_witness( "wit 1707", witness_1707 )
collation.add_plain_witness( "wit 1822", witness_1822 )
alignment_table = collate(collation, output='html2')
Another way to prepare the texts for collation is enriching them with annotations: the annotation for each token will be regarded as the normalized form by the software and be used for collation. This approach is only possible because CollateX has a fundamental characteristic, that we haven't seen yet.
CollateX stores by default two information for each token, called 't' and 'n'. They have been created precisely to handle normalization issues. The 't' property stores the original token, while the 'n' property can store a normalized form of 't'. CollateX used the string recorded into the 'n' property to collate, always. When a 'n' property is not explicitey defined, the value of the 't' property is copied into 'n'.
We can put whatever we want into 'n'. As in the example above, we can record there a copy of each token, but lower-case and without punctuation. We can also do more sophisticated processing and use 'n' to store linguistic information for each token, to be reused during the alignment phase or for analysis. But we won't consider these complex treatment in this course.
Let's go back to the previous example and only consider the first tercet. We'll see that the only differences are punctuation signs.
In [5]:
# first tercet only
from collatex import *
collation = Collation()
collation.add_plain_witness( "wit 1707", "Je commence au hasard; et si je ne m'abuse,")
collation.add_plain_witness( "wit 1822", "Je commence au hasard, et, si je ne m'abuse,")
alignment_table = collate(collation, output='html2', segmentation=False)
print( alignment_table )
Now we want to arrive at the same results that we reached in Normalization 1, but using the 't' and 'n' properties. They become visible if we input the data for collation as json (an open-standard file format for storing and exchanging data widely used in web development and beyond).
In the example below, you can see that there are two witnesses, each having an "id" and some "tokens". For each token, the 't' and 'n' properties are recorded.
In this case, we manually store the value we prefer into the 'n' values. Remember, it is not important what exactly it is, but that it indicates what we want.
Try to run the cell below!
Exercise. Change the value of the 'n' property for some tokens and re-run the cell to see what happens. For example, the fourth token is "hasard;" in the first witness and "hasard," in the second witness. Replace both their "n" values with "random" (the English translation of au hasard) and re-run the cell. The result should not change!
In [6]:
# first tercet only
# normalize with nothing will give errors in the svg output
from collatex import *
import json
collation = Collation()
json_input = """{
"witnesses": [
{
"id": "wit1707",
"tokens": [
{
"t": "Je",
"n": "je"
},
{
"t": "commence",
"n": "commence"
},
{
"t": "au",
"n": "au"
},
{
"t": "hasard;",
"n": "hasard"
},
{
"t": "et",
"n": "et"
},
{
"t": "si",
"n": "si"
},
{
"t": "je",
"n": "je"
},
{
"t": "ne",
"n": "ne"
},
{
"t": "m'abuse,",
"n": "m'abuse"
}
]
},
{
"id": "wit1822",
"tokens": [
{
"t": "Je",
"n": "je"
},
{
"t": "commence",
"n": "commence"
},
{
"t": "au",
"n": "au"
},
{
"t": "hasard,",
"n": "hasard"
},
{
"t": "et,",
"n": "et"
},
{
"t": "si",
"n": "si"
},
{
"t": "je",
"n": "je"
},
{
"t": "ne",
"n": "ne"
},
{
"t": "m'abuse,",
"n": "m'abuse"
}
]
}
]
}"""
collate(json.loads(json_input), segmentation=False, output="html2")
This is wonderful, but very time consuming!
We can automatically assign a value to the "n" property, processing in some way the "t" value. In the example below, the "n" value is a copy of the "t" value where all letters are lower-case and the punctuation is removed.
In the results, you will see the json input, followed by CollateX output. If you only want the output, in the code below comment out
print(json_input)
by adding # at the beginning of the line, as in
# print(json_input)
And re-run the cell!
Note: when the input is given in json, if segmentation is set to True there are no whitespaces between words, because the input are single tokens without whitespaces.
In [7]:
import re
from collatex import *
import json
witness_1707 = open( "../data/sonnet/Lope_soneto_FR_1707.txt", encoding='utf-8' ).read()
witness_1822 = open( "../data/sonnet/Lope_soneto_FR_1822.txt", encoding='utf-8' ).read()
A = ["wit 1707", witness_1707]
B = ["wit 1822", witness_1822]
listWitnesses = [A,B] # create a list of witnesses
data = {}
data["witnesses"] = []
for witness in listWitnesses: # for each witness in the list
tokens = [] # create empty list for tokens
data["witnesses"].append({
"id": witness[0], # give as id the first item in A or B
"tokens" : tokens # and as tokens the empty list
})
for w in witness[1].split(): # for each word in witness (second item in A or B)
t = w # t is the original word
# N is w with no upper-case and no punctuation
# Replace everything that is not alphanumeric character (\w) or space (\s) with nothing.
# Attention: if replaced with whitespace, it will create differences --> avoid.
# This does not happen in the previous method (Normalization 1),
# because the tokenization happens afterwards and strip whitespaces.
n = re.sub(r'[^\w\s]','',w.lower())
tokens.append({ # populate the empty token list with values for t and n
"t" : t,
"n" : n
})
json_input = json.dumps(data) # data created turned into json string with double quotes
print(json_input)
collation = Collation()
# if segmentation=True there are no whitespaces between words, because input is given with single tokens without whitespaces
collate(json.loads(json_input), segmentation=False, output="html2")