Normalization

There are various ways to normalize.

Normalization 1. Remove and replace

Let's go back to the example of the sonnet about writing a sonnet, by Lope de Vega, in a French translation.

The output of the collation mostly contains differences of punctuation and capitalization, as we can see in the results here below.


In [1]:
from collatex import *
collation = Collation()
witness_1707 = open( "../data/sonnet/Lope_soneto_FR_1707.txt", encoding='utf-8' ).read()
witness_1822 = open( "../data/sonnet/Lope_soneto_FR_1822.txt", encoding='utf-8' ).read()
collation.add_plain_witness( "wit 1707", witness_1707 )
collation.add_plain_witness( "wit 1822", witness_1822 )
alignment_table = collate(collation, output='html2')


wit 1707 wit 1822
Doris, qui sait qu'aux vers quelquefois je me plais, Me demande un sonnet Doris, qui sait qu'aux vers quelquefois je me plais, Me demande un sonnet
; ,
et je m'en désespère: Quatorze vers, grand Dieu! et je m'en désespère: Quatorze vers, grand Dieu!
le Le
moyen de les faire moyen de les faire
! ?
En voilà cependant déjà quatre de faits. Je ne pouvais d'abord trouver de rime En voilà cependant déjà quatre de faits. Je ne pouvais d'abord trouver de rime
; ,
mais mais
, -
En faisant on apprend à se tirer d'affaire En faisant on apprend à se tirer d'affaire
: .
Poursuivons Poursuivons
, :
les quatrains ne m'étonneront guère les quatrains ne m'étonneront guère
, -
Si du premier tercet je puis faire les frais. Je commence au hasard Si du premier tercet je puis faire les frais. Je commence au hasard
; ,
et et
- ,
si je ne m'abuse, Je n'ai si je ne m'abuse, Je n'ai
pas point
commencé commencé
, -
sans l'aveu de sans l'aveu de
la ma
muse, Puisqu'en si peu de temps je muse, Puisqu'en si peu de temps je
m'en me
tire tire
si du
net. J'entame le second net. J'entame le second
; ,
et ma joie est extrême; Car des vers commandés j'achève le treizième; Comptez s'ils sont quatorze et ma joie est extrême; Car des vers commandés j'achève le treizième; Comptez s'ils sont quatorze
; ,
et voilà le sonnet. et voilà le sonnet.

Imagine that we are not interested in punctuation and capitalization: we only want what might be called 'substantive variants'.

The "hard way" of obtaining the expected result is to remove punctuation and lower-case all the texts. The code below will do just that: it will

  • create a new directory, inside the data/sonnet dir, called 'norm'
  • make a normalized copy (without punctuation and all lower-case) of each file inside the new 'norm' dir

The creation of a normalized copy is safer than just normalizing the original transcriptions. If you keep the originals, you can always come back to them and perform other kinds of normalization if needed.

Note: the code below contains lots of comments, that is string that will not be executed but can be used for documentation. You've seen that in XML comments are inside <!-- -->. In Python, comment are marked with the sign #.


In [3]:
import glob, re, os

path = '../data/sonnet/'  # put the path into a variable 

os.makedirs(path + 'norm', exist_ok=True)  # create a new folder, if does not exist

files = [os.path.basename(x) for x in glob.glob(path+'*.txt')]  # take all txt files in the directory

for file in files:  # for each file in the directory
    
    ### READ THE FILE CONTENT
    file_opened = open(path+file, 'r', encoding='utf-8') # open the file in mode 'r' (read)
    content = file_opened.read()  # read the file content
    
    ### ALL TO LOWER CASE
    lowerContent = content.lower() 
    
    ### REMOVE PUNCTUATION 
    # replace everything that is not alphanumeric character (\w) or space (\s) with nothing or whitespace, depending on languages
    noPunct_lowerContent = re.sub(r'[^\w\s]','',lowerContent) 
    
    ### REMOVE MULTIPLE WHITESPACES
    regularSpaces_noPunct_lowerContent = " ".join(noPunct_lowerContent.split())
    
    ### CREATE A NEW FILE
    filename = file.split('.')[0]
    new_file = open(path+'norm/' + filename + '_norm.txt', 'w', encoding='utf-8') # open the new file in mode 'w' (write)
    
    ### WRITE THE NEW CONTENT INTO THE NEW FILE
    new_file.write(regularSpaces_noPunct_lowerContent) 
    
    ### CLOSE THE FILE
    new_file.close()
    
print('Finished! All normalized!')


Finished! All normalized!

Now, let's collate the normalized copies.

Attention to the new path. The output should be different from the one above!


In [4]:
from collatex import *
collation = Collation()
witness_1707 = open( "../data/sonnet/norm/Lope_soneto_FR_1707_norm.txt", encoding='utf-8' ).read()
witness_1822 = open( "../data/sonnet/norm/Lope_soneto_FR_1822_norm.txt", encoding='utf-8' ).read()
collation.add_plain_witness( "wit 1707", witness_1707 )
collation.add_plain_witness( "wit 1822", witness_1822 )
alignment_table = collate(collation, output='html2')


wit 1707 wit 1822
doris qui sait quaux vers quelquefois je me plais me demande un sonnet et je men désespère quatorze vers grand dieu le moyen de les faire en voilà cependant déjà quatre de faits je ne pouvais dabord trouver de rime mais en faisant on apprend à se tirer daffaire poursuivons les quatrains ne métonneront guère si du premier tercet je puis faire les frais je commence au hasard et si je ne mabuse je nai doris qui sait quaux vers quelquefois je me plais me demande un sonnet et je men désespère quatorze vers grand dieu le moyen de les faire en voilà cependant déjà quatre de faits je ne pouvais dabord trouver de rime mais en faisant on apprend à se tirer daffaire poursuivons les quatrains ne métonneront guère si du premier tercet je puis faire les frais je commence au hasard et si je ne mabuse je nai
pas point
commencé sans laveu de commencé sans laveu de
la ma
muse puisquen si peu de temps je muse puisquen si peu de temps je
men me
tire tire
si du
net jentame le second et ma joie est extrême car des vers commandés jachève le treizième comptez sils sont quatorze et voilà le sonnet net jentame le second et ma joie est extrême car des vers commandés jachève le treizième comptez sils sont quatorze et voilà le sonnet

Normalization 2. Annotate

Another way to prepare the texts for collation is enriching them with annotations: the annotation for each token will be regarded as the normalized form by the software and be used for collation. This approach is only possible because CollateX has a fundamental characteristic, that we haven't seen yet.

CollateX stores by default two information for each token, called 't' and 'n'. They have been created precisely to handle normalization issues. The 't' property stores the original token, while the 'n' property can store a normalized form of 't'. CollateX used the string recorded into the 'n' property to collate, always. When a 'n' property is not explicitey defined, the value of the 't' property is copied into 'n'.

We can put whatever we want into 'n'. As in the example above, we can record there a copy of each token, but lower-case and without punctuation. We can also do more sophisticated processing and use 'n' to store linguistic information for each token, to be reused during the alignment phase or for analysis. But we won't consider these complex treatment in this course.

Let's go back to the previous example and only consider the first tercet. We'll see that the only differences are punctuation signs.


In [5]:
# first tercet only

from collatex import *
collation = Collation()
collation.add_plain_witness( "wit 1707", "Je commence au hasard; et si je ne m'abuse,")
collation.add_plain_witness( "wit 1822", "Je commence au hasard, et, si je ne m'abuse,")
alignment_table = collate(collation, output='html2', segmentation=False)
print( alignment_table )


wit 1707 wit 1822
Je Je
commence commence
au au
hasard hasard
; ,
et et
- ,
si si
je je
ne ne
m m
' '
abuse abuse
, ,
None

Now we want to arrive at the same results that we reached in Normalization 1, but using the 't' and 'n' properties. They become visible if we input the data for collation as json (an open-standard file format for storing and exchanging data widely used in web development and beyond).

In the example below, you can see that there are two witnesses, each having an "id" and some "tokens". For each token, the 't' and 'n' properties are recorded.

In this case, we manually store the value we prefer into the 'n' values. Remember, it is not important what exactly it is, but that it indicates what we want.

Try to run the cell below!

Exercise. Change the value of the 'n' property for some tokens and re-run the cell to see what happens. For example, the fourth token is "hasard;" in the first witness and "hasard," in the second witness. Replace both their "n" values with "random" (the English translation of au hasard) and re-run the cell. The result should not change!


In [6]:
# first tercet only
# normalize with nothing will give errors in the svg output

from collatex import *
import json
collation = Collation()
json_input = """{
    "witnesses": [
        {
            "id": "wit1707",
            "tokens": [
                {
                    "t": "Je",
                    "n": "je"
                },
                {
                    "t": "commence",
                    "n": "commence"
                },
                {
                    "t": "au",
                    "n": "au"
                },
                {
                    "t": "hasard;",
                    "n": "hasard"
                },
                {
                    "t": "et",
                    "n": "et"
                },
                {
                    "t": "si",
                    "n": "si"
                },
                {
                    "t": "je",
                    "n": "je"
                },
                {
                    "t": "ne",
                    "n": "ne"
                },
                {
                    "t": "m'abuse,",
                    "n": "m'abuse"
                }
            ]
        },
        {
            "id": "wit1822",
            "tokens": [
                {
                    "t": "Je",
                    "n": "je"
                },
                {
                    "t": "commence",
                    "n": "commence"
                },
                {
                    "t": "au",
                    "n": "au"
                },
                {
                    "t": "hasard,",
                    "n": "hasard"
                },
                {
                    "t": "et,",
                    "n": "et"
                },
                {
                    "t": "si",
                    "n": "si"
                },
                {
                    "t": "je",
                    "n": "je"
                },
                {
                    "t": "ne",
                    "n": "ne"
                },
                {
                    "t": "m'abuse,",
                    "n": "m'abuse"
                }
            ]
        }
    ]
}"""
collate(json.loads(json_input), segmentation=False, output="html2")


wit1707 wit1822
Je Je
commence commence
au au
hasard; hasard,
et et,
si si
je je
ne ne
m'abuse, m'abuse,

This is wonderful, but very time consuming!

We can automatically assign a value to the "n" property, processing in some way the "t" value. In the example below, the "n" value is a copy of the "t" value where all letters are lower-case and the punctuation is removed.

In the results, you will see the json input, followed by CollateX output. If you only want the output, in the code below comment out

print(json_input)

by adding # at the beginning of the line, as in

# print(json_input)

And re-run the cell!

Note: when the input is given in json, if segmentation is set to True there are no whitespaces between words, because the input are single tokens without whitespaces.


In [7]:
import re
from collatex import *
import json

witness_1707 = open( "../data/sonnet/Lope_soneto_FR_1707.txt", encoding='utf-8' ).read()
witness_1822 = open( "../data/sonnet/Lope_soneto_FR_1822.txt", encoding='utf-8' ).read()

A = ["wit 1707", witness_1707]
B = ["wit 1822", witness_1822]

listWitnesses = [A,B]   # create a list of witnesses

data = {}
data["witnesses"] = []
for witness in listWitnesses: # for each witness in the list
    tokens = []   # create empty list for tokens
    data["witnesses"].append({
        "id": witness[0],  # give as id the first item in A or B
        "tokens" : tokens  # and as tokens the empty list
    })
    for w in witness[1].split():  # for each word in witness (second item in A or B)
        t = w   # t is the original word
        # N is w with no upper-case and no punctuation
        # Replace everything that is not alphanumeric character (\w) or space (\s) with nothing. 
        # Attention: if replaced with whitespace, it will create differences --> avoid.
        # This does not happen in the previous method (Normalization 1), 
        # because the tokenization happens afterwards and strip whitespaces.
        n = re.sub(r'[^\w\s]','',w.lower()) 
        tokens.append({    # populate the empty token list with values for t and n
            "t" : t,
            "n" : n
        })
        
json_input = json.dumps(data)  # data created turned into json string with double quotes
print(json_input)

collation = Collation()
# if segmentation=True there are no whitespaces between words, because input is given with single tokens without whitespaces
collate(json.loads(json_input), segmentation=False, output="html2")


{"witnesses": [{"id": "wit 1707", "tokens": [{"t": "Doris,", "n": "doris"}, {"t": "qui", "n": "qui"}, {"t": "sait", "n": "sait"}, {"t": "qu'aux", "n": "quaux"}, {"t": "vers", "n": "vers"}, {"t": "quelquefois", "n": "quelquefois"}, {"t": "je", "n": "je"}, {"t": "me", "n": "me"}, {"t": "plais,", "n": "plais"}, {"t": "Me", "n": "me"}, {"t": "demande", "n": "demande"}, {"t": "un", "n": "un"}, {"t": "sonnet;", "n": "sonnet"}, {"t": "et", "n": "et"}, {"t": "je", "n": "je"}, {"t": "m'en", "n": "men"}, {"t": "d\u00e9sesp\u00e8re:", "n": "d\u00e9sesp\u00e8re"}, {"t": "Quatorze", "n": "quatorze"}, {"t": "vers,", "n": "vers"}, {"t": "grand", "n": "grand"}, {"t": "Dieu!", "n": "dieu"}, {"t": "le", "n": "le"}, {"t": "moyen", "n": "moyen"}, {"t": "de", "n": "de"}, {"t": "les", "n": "les"}, {"t": "faire!", "n": "faire"}, {"t": "En", "n": "en"}, {"t": "voil\u00e0", "n": "voil\u00e0"}, {"t": "cependant", "n": "cependant"}, {"t": "d\u00e9j\u00e0", "n": "d\u00e9j\u00e0"}, {"t": "quatre", "n": "quatre"}, {"t": "de", "n": "de"}, {"t": "faits.", "n": "faits"}, {"t": "Je", "n": "je"}, {"t": "ne", "n": "ne"}, {"t": "pouvais", "n": "pouvais"}, {"t": "d'abord", "n": "dabord"}, {"t": "trouver", "n": "trouver"}, {"t": "de", "n": "de"}, {"t": "rime;", "n": "rime"}, {"t": "mais,", "n": "mais"}, {"t": "En", "n": "en"}, {"t": "faisant", "n": "faisant"}, {"t": "on", "n": "on"}, {"t": "apprend", "n": "apprend"}, {"t": "\u00e0", "n": "\u00e0"}, {"t": "se", "n": "se"}, {"t": "tirer", "n": "tirer"}, {"t": "d'affaire:", "n": "daffaire"}, {"t": "Poursuivons,", "n": "poursuivons"}, {"t": "les", "n": "les"}, {"t": "quatrains", "n": "quatrains"}, {"t": "ne", "n": "ne"}, {"t": "m'\u00e9tonneront", "n": "m\u00e9tonneront"}, {"t": "gu\u00e8re,", "n": "gu\u00e8re"}, {"t": "Si", "n": "si"}, {"t": "du", "n": "du"}, {"t": "premier", "n": "premier"}, {"t": "tercet", "n": "tercet"}, {"t": "je", "n": "je"}, {"t": "puis", "n": "puis"}, {"t": "faire", "n": "faire"}, {"t": "les", "n": "les"}, {"t": "frais.", "n": "frais"}, {"t": "Je", "n": "je"}, {"t": "commence", "n": "commence"}, {"t": "au", "n": "au"}, {"t": "hasard;", "n": "hasard"}, {"t": "et", "n": "et"}, {"t": "si", "n": "si"}, {"t": "je", "n": "je"}, {"t": "ne", "n": "ne"}, {"t": "m'abuse,", "n": "mabuse"}, {"t": "Je", "n": "je"}, {"t": "n'ai", "n": "nai"}, {"t": "pas", "n": "pas"}, {"t": "commenc\u00e9,", "n": "commenc\u00e9"}, {"t": "sans", "n": "sans"}, {"t": "l'aveu", "n": "laveu"}, {"t": "de", "n": "de"}, {"t": "la", "n": "la"}, {"t": "muse,", "n": "muse"}, {"t": "Puisqu'en", "n": "puisquen"}, {"t": "si", "n": "si"}, {"t": "peu", "n": "peu"}, {"t": "de", "n": "de"}, {"t": "temps", "n": "temps"}, {"t": "je", "n": "je"}, {"t": "m'en", "n": "men"}, {"t": "tire", "n": "tire"}, {"t": "si", "n": "si"}, {"t": "net.", "n": "net"}, {"t": "J'entame", "n": "jentame"}, {"t": "le", "n": "le"}, {"t": "second;", "n": "second"}, {"t": "et", "n": "et"}, {"t": "ma", "n": "ma"}, {"t": "joie", "n": "joie"}, {"t": "est", "n": "est"}, {"t": "extr\u00eame;", "n": "extr\u00eame"}, {"t": "Car", "n": "car"}, {"t": "des", "n": "des"}, {"t": "vers", "n": "vers"}, {"t": "command\u00e9s", "n": "command\u00e9s"}, {"t": "j'ach\u00e8ve", "n": "jach\u00e8ve"}, {"t": "le", "n": "le"}, {"t": "treizi\u00e8me;", "n": "treizi\u00e8me"}, {"t": "Comptez", "n": "comptez"}, {"t": "s'ils", "n": "sils"}, {"t": "sont", "n": "sont"}, {"t": "quatorze;", "n": "quatorze"}, {"t": "et", "n": "et"}, {"t": "voil\u00e0", "n": "voil\u00e0"}, {"t": "le", "n": "le"}, {"t": "sonnet.", "n": "sonnet"}]}, {"id": "wit 1822", "tokens": [{"t": "Doris,", "n": "doris"}, {"t": "qui", "n": "qui"}, {"t": "sait", "n": "sait"}, {"t": "qu'aux", "n": "quaux"}, {"t": "vers", "n": "vers"}, {"t": "quelquefois", "n": "quelquefois"}, {"t": "je", "n": "je"}, {"t": "me", "n": "me"}, {"t": "plais,", "n": "plais"}, {"t": "Me", "n": "me"}, {"t": "demande", "n": "demande"}, {"t": "un", "n": "un"}, {"t": "sonnet,", "n": "sonnet"}, {"t": "et", "n": "et"}, {"t": "je", "n": "je"}, {"t": "m'en", "n": "men"}, {"t": "d\u00e9sesp\u00e8re:", "n": "d\u00e9sesp\u00e8re"}, {"t": "Quatorze", "n": "quatorze"}, {"t": "vers,", "n": "vers"}, {"t": "grand", "n": "grand"}, {"t": "Dieu!", "n": "dieu"}, {"t": "Le", "n": "le"}, {"t": "moyen", "n": "moyen"}, {"t": "de", "n": "de"}, {"t": "les", "n": "les"}, {"t": "faire?", "n": "faire"}, {"t": "En", "n": "en"}, {"t": "voil\u00e0", "n": "voil\u00e0"}, {"t": "cependant", "n": "cependant"}, {"t": "d\u00e9j\u00e0", "n": "d\u00e9j\u00e0"}, {"t": "quatre", "n": "quatre"}, {"t": "de", "n": "de"}, {"t": "faits.", "n": "faits"}, {"t": "Je", "n": "je"}, {"t": "ne", "n": "ne"}, {"t": "pouvais", "n": "pouvais"}, {"t": "d'abord", "n": "dabord"}, {"t": "trouver", "n": "trouver"}, {"t": "de", "n": "de"}, {"t": "rime,", "n": "rime"}, {"t": "mais", "n": "mais"}, {"t": "En", "n": "en"}, {"t": "faisant", "n": "faisant"}, {"t": "on", "n": "on"}, {"t": "apprend", "n": "apprend"}, {"t": "\u00e0", "n": "\u00e0"}, {"t": "se", "n": "se"}, {"t": "tirer", "n": "tirer"}, {"t": "d'affaire.", "n": "daffaire"}, {"t": "Poursuivons:", "n": "poursuivons"}, {"t": "les", "n": "les"}, {"t": "quatrains", "n": "quatrains"}, {"t": "ne", "n": "ne"}, {"t": "m'\u00e9tonneront", "n": "m\u00e9tonneront"}, {"t": "gu\u00e8re", "n": "gu\u00e8re"}, {"t": "Si", "n": "si"}, {"t": "du", "n": "du"}, {"t": "premier", "n": "premier"}, {"t": "tercet", "n": "tercet"}, {"t": "je", "n": "je"}, {"t": "puis", "n": "puis"}, {"t": "faire", "n": "faire"}, {"t": "les", "n": "les"}, {"t": "frais.", "n": "frais"}, {"t": "Je", "n": "je"}, {"t": "commence", "n": "commence"}, {"t": "au", "n": "au"}, {"t": "hasard,", "n": "hasard"}, {"t": "et,", "n": "et"}, {"t": "si", "n": "si"}, {"t": "je", "n": "je"}, {"t": "ne", "n": "ne"}, {"t": "m'abuse,", "n": "mabuse"}, {"t": "Je", "n": "je"}, {"t": "n'ai", "n": "nai"}, {"t": "point", "n": "point"}, {"t": "commenc\u00e9", "n": "commenc\u00e9"}, {"t": "sans", "n": "sans"}, {"t": "l'aveu", "n": "laveu"}, {"t": "de", "n": "de"}, {"t": "ma", "n": "ma"}, {"t": "muse,", "n": "muse"}, {"t": "Puisqu'en", "n": "puisquen"}, {"t": "si", "n": "si"}, {"t": "peu", "n": "peu"}, {"t": "de", "n": "de"}, {"t": "temps", "n": "temps"}, {"t": "je", "n": "je"}, {"t": "me", "n": "me"}, {"t": "tire", "n": "tire"}, {"t": "du", "n": "du"}, {"t": "net.", "n": "net"}, {"t": "J'entame", "n": "jentame"}, {"t": "le", "n": "le"}, {"t": "second,", "n": "second"}, {"t": "et", "n": "et"}, {"t": "ma", "n": "ma"}, {"t": "joie", "n": "joie"}, {"t": "est", "n": "est"}, {"t": "extr\u00eame;", "n": "extr\u00eame"}, {"t": "Car", "n": "car"}, {"t": "des", "n": "des"}, {"t": "vers", "n": "vers"}, {"t": "command\u00e9s", "n": "command\u00e9s"}, {"t": "j'ach\u00e8ve", "n": "jach\u00e8ve"}, {"t": "le", "n": "le"}, {"t": "treizi\u00e8me;", "n": "treizi\u00e8me"}, {"t": "Comptez", "n": "comptez"}, {"t": "s'ils", "n": "sils"}, {"t": "sont", "n": "sont"}, {"t": "quatorze,", "n": "quatorze"}, {"t": "et", "n": "et"}, {"t": "voil\u00e0", "n": "voil\u00e0"}, {"t": "le", "n": "le"}, {"t": "sonnet.", "n": "sonnet"}]}]}
wit 1707 wit 1822
Doris, Doris,
qui qui
sait sait
qu'aux qu'aux
vers vers
quelquefois quelquefois
je je
me me
plais, plais,
Me Me
demande demande
un un
sonnet; sonnet,
et et
je je
m'en m'en
désespère: désespère:
Quatorze Quatorze
vers, vers,
grand grand
Dieu! Dieu!
le Le
moyen moyen
de de
les les
faire! faire?
En En
voilà voilà
cependant cependant
déjà déjà
quatre quatre
de de
faits. faits.
Je Je
ne ne
pouvais pouvais
d'abord d'abord
trouver trouver
de de
rime; rime,
mais, mais
En En
faisant faisant
on on
apprend apprend
à à
se se
tirer tirer
d'affaire: d'affaire.
Poursuivons, Poursuivons:
les les
quatrains quatrains
ne ne
m'étonneront m'étonneront
guère, guère
Si Si
du du
premier premier
tercet tercet
je je
puis puis
faire faire
les les
frais. frais.
Je Je
commence commence
au au
hasard; hasard,
et et,
si si
je je
ne ne
m'abuse, m'abuse,
Je Je
n'ai n'ai
pas point
commencé, commencé
sans sans
l'aveu l'aveu
de de
la ma
muse, muse,
Puisqu'en Puisqu'en
si si
peu peu
de de
temps temps
je je
m'en me
tire tire
si du
net. net.
J'entame J'entame
le le
second; second,
et et
ma ma
joie joie
est est
extrême; extrême;
Car Car
des des
vers vers
commandés commandés
j'achève j'achève
le le
treizième; treizième;
Comptez Comptez
s'ils s'ils
sont sont
quatorze; quatorze,
et et
voilà voilà
le le
sonnet. sonnet.