Exercise 02: concerns about characters

The following shows examples of how to use codecs and normalize unicode, and draws heavily from the article Metal umlat.


In [1]:
x = "Rinôçérôse screams flow not unlike an encyclopædia, \
'TECHNICIÄNS ÖF SPÅCE SHIP EÅRTH THIS IS YÖÜR CÄPTÅIN SPEÄKING YÖÜR ØÅPTÅIN IS DEA̋D' to Spın̈al Tap."
type(x)


Out[1]:
str

The variable x is a string in Python:


In [2]:
repr(x)


Out[2]:
'"Rinôçérôse screams flow not unlike an encyclopædia, \'TECHNICIÄNS ÖF SPÅCE SHIP EÅRTH THIS IS YÖÜR CÄPTÅIN SPEÄKING YÖÜR ØÅPTÅIN IS DEA̋D\' to Spın̈al Tap."'

Its translation into ASCII is unusable by parsers:


In [3]:
ascii(x)


Out[3]:
'"Rin\\xf4\\xe7\\xe9r\\xf4se screams \\ufb02ow not unlike an encyclop\\xe6dia, \'TECHNICI\\xc4NS \\xd6F SP\\xc5CE SHIP E\\xc5RTH THIS IS Y\\xd6\\xdcR C\\xc4PT\\xc5IN SPE\\xc4KING Y\\xd6\\xdcR \\xd8\\xc5PT\\xc5IN IS DEA\\u030bD\' to Sp\\u0131n\\u0308al Tap."'

Encoding as UTF-8 doesn't help much - use it as an encoder:


In [4]:
x.encode('utf8')


Out[4]:
b"Rin\xc3\xb4\xc3\xa7\xc3\xa9r\xc3\xb4se screams \xef\xac\x82ow not unlike an encyclop\xc3\xa6dia, 'TECHNICI\xc3\x84NS \xc3\x96F SP\xc3\x85CE SHIP E\xc3\x85RTH THIS IS Y\xc3\x96\xc3\x9cR C\xc3\x84PT\xc3\x85IN SPE\xc3\x84KING Y\xc3\x96\xc3\x9cR \xc3\x98\xc3\x85PT\xc3\x85IN IS DEA\xcc\x8bD' to Sp\xc4\xb1n\xcc\x88al Tap."

Ignoring difficult characters is perhaps an even worse strategy - ignore everything that is not understood:


In [5]:
x.encode('ascii','ignore')


Out[5]:
b"Rinrse screams ow not unlike an encyclopdia, 'TECHNICINS F SPCE SHIP ERTH THIS IS YR CPTIN SPEKING YR PTIN IS DEAD' to Spnal Tap."
However, one can normalize then encode…

In [6]:
import unicodedata
# NFKD a robust way to handle normalizers - convert special characters into something
# that can be read and convert into ascii
unicodedata.normalize('NFKD', x).encode('ascii','ignore')


Out[6]:
b"Rinocerose screams flow not unlike an encyclopdia, 'TECHNICIANS OF SPACE SHIP EARTH THIS IS YOUR CAPTAIN SPEAKING YOUR APTAIN IS DEAD' to Spnal Tap."

Even before this normalization and encoding, you may need to convert some characters explicitly before parsing. For example:


In [7]:
x = "The sky “above” the port … was the color of ‘cable television’ – tuned to the Weather Channel®"
ascii(x)


Out[7]:
"'The sky \\u201cabove\\u201d the port \\u2026 was the color of \\u2018cable television\\u2019 \\u2013 tuned to the Weather Channel\\xae'"

Then consider the results here:


In [8]:
unicodedata.normalize('NFKD', x).encode('ascii','ignore')


Out[8]:
b'The sky above the port ... was the color of cable television  tuned to the Weather Channel'

One of the ways to handle punctuations...which drops characters that may be important for parsing a sentence, so instead:


In [ ]:
x = x.replace('“', '"').replace('”', '"')
x = x.replace("‘", "'").replace("’", "'")
x = x.replace('…', '...').replace('–', '-')
print(x)