Unicode normalization

Many complex Unicode characters can be expressed in more than one way. For example, an uppercase A with a superscript ring can be expressed as a single Unicode character U+212b or as a combination of a base uppercase A (U+0041) followed by a combining superscript ring (U+030a). The documents you need to collate may contain alternative representations that you would like to treat as identical for collation purposes. To do that, you can create a shadow "n" property in your pretokenized JSON (see Unit 5 of this workshop) and normalize the strings. Here’s how to do that.

To show that the two representations are different at the underlying byte level but look alike to a human, we create two variables, which we call a and b, and assign one representation to each, which we then print so that we can examine how they look.


In [1]:
a = "\u212b"

In [2]:
a


Out[2]:
'Å'

In [3]:
b = "\u0041\u030a"

In [4]:
b


Out[4]:
'Å'

If we check these two values for equality, Python tells us that they are not equal:


In [5]:
a == b


Out[5]:
False

We can use the unicodedata.normalize() function to normalize both variables and then compare them. When we do that, the normalized versions are equal:


In [6]:
import unicodedata
unicodedata.normalize('NFC',a) == unicodedata.normalize('NFC',b)


Out[6]:
True

We’ve performed this normalization by itself so that we can examine the results. In a CollateX context, you would incorporate Unicode normalization as part of the process of generating an "n" property for your pretokenized JSON tokens. See http://unicode.org/reports/tr15/ for more information about Unicode normalization forms.

So what are the Unicode values of the characters in my text?


In [ ]: