The following shows examples of how to use codecs and normalize unicode, and draws heavily from the article Metal umlat.
In [1]:
x = "Rinôçérôse screams flow not unlike an encyclopædia, \
'TECHNICIÄNS ÖF SPÅCE SHIP EÅRTH THIS IS YÖÜR CÄPTÅIN SPEÄKING YÖÜR ØÅPTÅIN IS DEA̋D' to Spın̈al Tap."
type(x)
Out[1]:
The variable x is a string in Python:
In [2]:
repr(x)
Out[2]:
Its translation into ASCII is unusable by parsers:
In [3]:
ascii(x)
Out[3]:
Encoding as UTF-8 doesn't help much - use it as an encoder:
In [4]:
x.encode('utf8')
Out[4]:
Ignoring difficult characters is perhaps an even worse strategy - ignore everything that is not understood:
In [5]:
x.encode('ascii','ignore')
Out[5]:
In [6]:
import unicodedata
# NFKD a robust way to handle normalizers - convert special characters into something
# that can be read and convert into ascii
unicodedata.normalize('NFKD', x).encode('ascii','ignore')
Out[6]:
Even before this normalization and encoding, you may need to convert some characters explicitly before parsing. For example:
In [7]:
x = "The sky “above” the port … was the color of ‘cable television’ – tuned to the Weather Channel®"
ascii(x)
Out[7]:
Then consider the results here:
In [8]:
unicodedata.normalize('NFKD', x).encode('ascii','ignore')
Out[8]:
One of the ways to handle punctuations...which drops characters that may be important for parsing a sentence, so instead:
In [ ]:
x = x.replace('“', '"').replace('”', '"')
x = x.replace("‘", "'").replace("’", "'")
x = x.replace('…', '...').replace('–', '-')
print(x)