Understanding text (and handling it in Python)

Fiona Pigott

A Python notebook to teach you everything you never wanted to know about text encoding (specifically ASCII, UTF-8, and the difference therein but we'll explain what some others mean).

Credit to these sites for a helpful description of different file encodings:

And to these pages for a better understanding of emoji specifically:

And if you got here from Data-Science-45-min intros, check out https://github.com/fionapigott/emoji-counter for this tutorial and (a little) more.


Part 0: What do you mean by "text encoding"?


A text encoding is a scheme that allows us to convert between binary (stored on your computer) and a character that you can display and make sense of. A text encoding does not define a font.

When I say "character" I mean "unicode code point." Code point -> character is a 1->1 mapping of meaning. A font just decides how to display that character. Each emoji has a code point assigned to it by the Unicode Consortium, and "GRINNING FACE WITH SMILING EYES" should be a grinning face with smiley eyes on any platform. Windows Wingdigs, if you remember that regrettable period, is a font.

I'm going to use "code point" and "character" a little bit interchangeably. If you can get the code point represented by a string of bits, you can figure out what character it represents.

Decode = convert binary data to a code point

Encode = convert a code point (a big number) to binary data that you can write somewhere

You code will always:

  • Ingest binary data (say, my_tweets.txt)
  • Decode that data into characters (whether or not you have to type decode.)
  • Encode that data so that you cna write it again (whether or not you type encode. You can't write "128513" to a single bit.)

Part 1: Text encodings are not magic


Get the actual data of the file in your terminal with xxd.

I want to spend a few minutes convincing you of what I'm about to say about text encoding in Python.

We're spending time in the terminal to make it painfully, horribly clear that text encoding/decoding is not some Python thing, but rather exactly what every text-display program does every time you convert some binary (stored on your computer) to something that you can read.

ASCII

ASCII is a character-encoding scheme where each character fits in exactly 1 byte--8 bits. ASCII, however, uses only the bottom 7 bits of an 8-bit byte, and thus can take only 2^7 (128) values. The value of the byte that encodes a character is exactly that character's code point.

"h" and "i" are both ascii characters--they fit in one byte in the ascii encoding scheme.


In [1]:
!printf "hi\n"
!printf "hi" | xxd -g1
!printf "hi" | xxd -b -g1


hi
0000000: 68 69                                            hi
0000000: 01101000 01101001                                      hi

In [2]:
# Generate a list of all of the ASCII characters:
# 'unichr' is a built-in Python function to take a number to a unicode code point 
# (I'll talk more about this and some other built-ins later)
for i in range(0,128):
    print str(i) + " -> " + repr(unichr(i)) + "->" + "'" + unichr(i).encode("ascii") + "'"


0 -> u'\x00'->''
1 -> u'\x01'->''
2 -> u'\x02'->''
3 -> u'\x03'->''
4 -> u'\x04'->''
5 -> u'\x05'->''
6 -> u'\x06'->''
7 -> u'\x07'->''
8 -> u'\x08'->''
9 -> u'\t'->'	'
10 -> u'\n'->'
'
11 -> u'\x0b'->''
12 -> u'\x0c'->''
'
14 -> u'\x0e'->''
15 -> u'\x0f'->''
16 -> u'\x10'->''
17 -> u'\x11'->''
18 -> u'\x12'->''
19 -> u'\x13'->''
20 -> u'\x14'->''
21 -> u'\x15'->''
22 -> u'\x16'->''
23 -> u'\x17'->''
24 -> u'\x18'->''
25 -> u'\x19'->''
26 -> u'\x1a'->''
27 -> u'\x1b'->''
28 -> u'\x1c'->''
29 -> u'\x1d'->''
30 -> u'\x1e'->''
31 -> u'\x1f'->''
32 -> u' '->' '
33 -> u'!'->'!'
34 -> u'"'->'"'
35 -> u'#'->'#'
36 -> u'$'->'$'
37 -> u'%'->'%'
38 -> u'&'->'&'
39 -> u"'"->'''
40 -> u'('->'('
41 -> u')'->')'
42 -> u'*'->'*'
43 -> u'+'->'+'
44 -> u','->','
45 -> u'-'->'-'
46 -> u'.'->'.'
47 -> u'/'->'/'
48 -> u'0'->'0'
49 -> u'1'->'1'
50 -> u'2'->'2'
51 -> u'3'->'3'
52 -> u'4'->'4'
53 -> u'5'->'5'
54 -> u'6'->'6'
55 -> u'7'->'7'
56 -> u'8'->'8'
57 -> u'9'->'9'
58 -> u':'->':'
59 -> u';'->';'
60 -> u'<'->'<'
61 -> u'='->'='
62 -> u'>'->'>'
63 -> u'?'->'?'
64 -> u'@'->'@'
65 -> u'A'->'A'
66 -> u'B'->'B'
67 -> u'C'->'C'
68 -> u'D'->'D'
69 -> u'E'->'E'
70 -> u'F'->'F'
71 -> u'G'->'G'
72 -> u'H'->'H'
73 -> u'I'->'I'
74 -> u'J'->'J'
75 -> u'K'->'K'
76 -> u'L'->'L'
77 -> u'M'->'M'
78 -> u'N'->'N'
79 -> u'O'->'O'
80 -> u'P'->'P'
81 -> u'Q'->'Q'
82 -> u'R'->'R'
83 -> u'S'->'S'
84 -> u'T'->'T'
85 -> u'U'->'U'
86 -> u'V'->'V'
87 -> u'W'->'W'
88 -> u'X'->'X'
89 -> u'Y'->'Y'
90 -> u'Z'->'Z'
91 -> u'['->'['
92 -> u'\\'->'\'
93 -> u']'->']'
94 -> u'^'->'^'
95 -> u'_'->'_'
96 -> u'`'->'`'
97 -> u'a'->'a'
98 -> u'b'->'b'
99 -> u'c'->'c'
100 -> u'd'->'d'
101 -> u'e'->'e'
102 -> u'f'->'f'
103 -> u'g'->'g'
104 -> u'h'->'h'
105 -> u'i'->'i'
106 -> u'j'->'j'
107 -> u'k'->'k'
108 -> u'l'->'l'
109 -> u'm'->'m'
110 -> u'n'->'n'
111 -> u'o'->'o'
112 -> u'p'->'p'
113 -> u'q'->'q'
114 -> u'r'->'r'
115 -> u's'->'s'
116 -> u't'->'t'
117 -> u'u'->'u'
118 -> u'v'->'v'
119 -> u'w'->'w'
120 -> u'x'->'x'
121 -> u'y'->'y'
122 -> u'z'->'z'
123 -> u'{'->'{'
124 -> u'|'->'|'
125 -> u'}'->'}'
126 -> u'~'->'~'
127 -> u'\x7f'->''

In [3]:
# And if you try to use "ascii" encoding on a character whose value is too high:
# Hint: you've definitely seen this error before
unichr(129).encode("ascii")


---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-3-99e41bb4ce9c> in <module>()
      1 # And if you try to use "ascii" encoding on a character whose value is too high:
      2 # Hint: you've definitely seen this error before
----> 3 unichr(129).encode("ascii")

UnicodeEncodeError: 'ascii' codec can't encode character u'\x81' in position 0: ordinal not in range(128)

UTF-8 (most commonly used multi-byte encoding, used by Twitter)

You might be familiar with the concept of Huffman Coding (https://en.wikipedia.org/wiki/Huffman_coding). Huffamn coding is a way of losslessly compressing data by encoding the most common values with the least amount of information. A Huffman coding tree of the English language, might, for example, assign "e" a value of a single bit.

UTF-8 encoding is similar to a Huffman encoding. ASCII-compatible characters are encoded exactly the same way (a file that is UTF-8 encoded but contains only the 128 ASCII-compatible characters is effectively ASCII encoded. This way, those common characters occupy only one byte. All furter characters are encoded in multiple bytes.

The multibyte encoding scheme works like this:

  • The number of leading 1s in the first byte maps to the length of the character in bytes.
  • Each following character in the multibyte squence begins with '10'
  • The value of the unicode code point is encoded in all of the unused bits. That is, every bit that isn't either a leading '1' of the first byte, or a leading '10' of the following bytes.

In [4]:
# Example: \xf0 is the leading byte for a 4-character emoji:
print bin(ord('\xf0'))
# And it has 4 1s!
print "Count the 1s at the beginning of the bit string: 4!"


0b11110000
Count the 1s at the beginning of the bit string: 4!

Now look at a multi-byte character, "GRINNING FACE WITH SMILING EYES." This guy doesn't fit in a single byte. In fact, his encoding takes 4 bytes (http://apps.timwhitlock.info/unicode/inspect/hex/1F601).


In [5]:
!printf "😁\n" 
!printf "😁" | xxd -g1
!printf "😁" | xxd -b -g1
# Figure out what some weird emoji is: 
# https://twitter.com/jrmontag/status/677621827410255872
!printf "📊" | xxd -g1
# https://twitter.com/SRKCHENNAIFC/status/677894680303017985
!printf "❤️" | xxd -g1


😁
0000000: f0 9f 98 81                                      ....
0000000: 11110000 10011111 10011000 10000001                    ....
0000000: f0 9f 93 8a                                      ....
0000000: e2 9d a4 ef b8 8f                                ......

This is pretty much exactly what you get when you're looking at Tweet data. If you don't believe me, try:


In [6]:
text = !cat test_tweet.json | xxd -g1 | grep "f0 9f 98 81"
for line in text:
    print line


cat: test_tweet.json: No such file or directory

In [7]:
# position of the emoji in bytes:
start = int(text[0][0:7],16)
end = int(text[0][0:7],16) + 16
print "Run the following to cat out just the first line of bytes from the hexdump:"
print "!head -c{} test_tweet.json | tail -c{}".format(end, end-start)


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-c87aadfda993> in <module>()
      1 # position of the emoji in bytes:
----> 2 start = int(text[0][0:7],16)
      3 end = int(text[0][0:7],16) + 16
      4 print "Run the following to cat out just the first line of bytes from the hexdump:"
      5 print "!head -c{} test_tweet.json | tail -c{}".format(end, end-start)

ValueError: invalid literal for int() with base 16: 'cat: te'

Rolling your own UTF-8 decoder.

This is going to be fun. And by 'fun' I mean "why are you making us do this?"


In [21]:
# Get the bits!
!printf "😁" | xxd -b -g1
# We're gonna use this in a minute
byte_string_smiley = !printf "😁" | xxd -b -g1
bytes = byte_string_smiley[0].split(" ")[1:5]
print bytes


0000000: 11110000 10011111 10011000 10000001                    ....
['11110000', '10011111', '10011000', '10000001']

In [22]:
first_byte = bytes[0]
print "The 1st byte: {}".format(first_byte)
length_of_char = 0
b = 0
while first_byte[b] == '1':
    length_of_char += 1
    b += 1
print "The character length in bytes, calculated using the 1st byte: {}".format(length_of_char)
print "The remaining bits in the first byte: {}".format(first_byte[b:])
print "The non-'leading 10' bits in the next 3 bytes: {}".format([x[2:] for x in bytes[1:]])
print "The bits of the code point: {}".format(
    [first_byte[b:]]+[x[2:] for x in bytes[1:]])
code_point_bits = "".join([first_byte[b:]]+[x[2:] for x in bytes[1:]])
print "The bit string of the code point: {}".format(code_point_bits)
code_point_int = int(code_point_bits,2)
print "The code point is: {} (or in hex {})".format(code_point_int, hex(code_point_int))
print "And the character is: {}".format(unichr(code_point_int).encode("utf-8"))
print "Phew!"


The 1st byte: 11110000
The character length in bytes, calculated using the 1st byte: 4
The remaining bits in the first byte: 0000
The non-'leading 10' bits in the next 3 bytes: ['011111', '011000', '000001']
The bits of the code point: ['0000', '011111', '011000', '000001']
The bit string of the code point: 0000011111011000000001
The code point is: 128513 (or in hex 0x1f601)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-22-f2bf30f75b2f> in <module>()
     15 code_point_int = int(code_point_bits,2)
     16 print "The code point is: {} (or in hex {})".format(code_point_int, hex(code_point_int))
---> 17 print "And the character is: {}".format(unichr(code_point_int).encode("utf-8"))
     18 print "Phew!"

ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Part 2: But what if I like magic?


Getting Python (2) to help you out with this.

The hard way:

The following should demonstrate to you that what we're about to do is exactly the same as what we just did, but easier.


In [23]:
# The 'rb' option to open (or mode = 'rb' to fileinput.FileInput)
# this means, "read in the file as a byte string." Basically, exactly what you get from
# the xxd hexdump
f = open("test.txt", 'rb')
# read the file (the whole file is one emoji character)
test_emoji = f.read().strip()
bytes = []
bits = []
code_point = test_emoji.decode("utf-8")
print code_point
code_point_integer = ord(code_point)
for byte in test_emoji:
    bytes.append(byte)
    bits.append(bin(ord(byte)).lstrip("0b"))
print "The Unicode code point: {}".format([code_point])
print "Integer value of the unicode code point: hex: {}, decimal: {}".format(
    hex(code_point_integer), code_point_integer)
print "The bytes (hex): {}".format(bytes)
print "The bytes (decimal): {}".format([ord(x) for x in bytes])
print "Each byte represented in bits: {}".format(bits)
f.close()


😁
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-23-5bd1108bf954> in <module>()
      9 code_point = test_emoji.decode("utf-8")
     10 print code_point
---> 11 code_point_integer = ord(code_point)
     12 for byte in test_emoji:
     13     bytes.append(byte)

TypeError: ord() expected a character, but string of length 2 found

The easy way:

Now, imagine that you didn't want to have to think about bit strings every time you dealt with text data. We live in that brave new world.

The big problem that I (we, I think) have been having with emoji and multibyte charaters in general is decoding them in a way that allows us to process one character at a time. I had this problem because I didn't understand what the encoding/decoding steps meant.


In [31]:
!cat test.txt


😁

In [24]:
g = open("test.txt")
# read the file (the whole file is one emoji character)
test_emoji = g.read().strip()

In [25]:
# Now, try to get a list of characters
print "list(test_emoji)"
print list(test_emoji)


list(test_emoji)
['\xf0', '\x9f', '\x98', '\x81']

Just asking for a list of all of the characters doesn't work, because Python 2 assumes ASCII (1 byte per character) and splits it up appropriately. We'd have to search all of the bytes to figure out which ones constituted emoji.

I've implemented this, https://github.com/fionapigott/emoji-counter, because I didn't realize that there was a better way. But there is!


In [52]:
# *Now*, try to get a list of characters
import struct

print list(test_emoji.decode('utf-8'))
print struct.unpack("i", test_emoji.decode('utf-8'))
#.decode('utf-32')
print "list(test_emoji.decode('utf-8'))"
print list(test_emoji.decode('utf-8'))
print list(test_emoji.decode('utf-8'))[0]


[u'\ud83d', u'\ude01']
---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-52-4ffa45bdfcad> in <module>()
      3 
      4 print list(test_emoji.decode('utf-8'))
----> 5 print struct.unpack("i", test_emoji.decode('utf-8'))
      6 #.decode('utf-32')
      7 print "list(test_emoji.decode('utf-8'))"

error: unpack requires a string argument of length 4

Now if you want to search your code for "😁", you just need to know its code point (which you can find or even, if you're rather determined, derive).


In [19]:
# Get the code point for this weird emoji
"📊".decode("utf-8")


Out[19]:
u'\U0001f4ca'

Appendices

A word on other encodings, with a tiny example

There are many other encodings, such as ISO-8859-1, UTF-16, UTF-32 etc, which are less commonly used on the web, and for the most part, don't worry about them. They represent a variety of other ways to mape bytes -> code points and back again.

I want to show one quick example of the UTF-32 encoding, which simply assigns 1 code point per 4-byte block. I'm going to show the encoding/decoding in Python, write the encoded data to a file, and read it back.

I'm not showing this becuase UTF-32 is special or because you should use it. I'm showing it so you understand a little about how to work with other file encodings.


In [ ]:
print "😁"
# Remember, and this is a bit hard: that thing we just printed was encoded at UTF-8 
# (that's why Chrome renders it at all)
print repr("😁")

In [ ]:
# Get the code point, so that we can encode it again with a different scheme
code_point = "😁".decode("utf-8")
# You have to print the repr() to look at the code point value, 
# otherwise 'print' will automatically encode the character to print it
print repr(code_point)

In [ ]:
# Now encode the data as UTF32
utf32_smiley = code_point.encode("utf-32")
print repr(utf32_smiley)
print "The first 4 bytes means 'this file is UTF-32 encoded'. The next 4 are the character."

In [ ]:
# That's a byte string--we can write it to a file
utf32_file = open("test_utf32.txt","w")
utf32_file.write(utf32_smiley)
utf32_file.close()
# No nasty Encode errors. That's good.

In [ ]:
# Butttt, that file looks like garbage, because nothing is going to automatically
# decode that byte string as UTF-32
!cat test_utf32.txt 
print "\n"
# We can still look at the bytes tho! And they should look familiar
!cat test_utf32.txt | xxd -g1

In [ ]:
# And we can read in the file as long as we use the right decoder
utf32_file_2 = open("test_utf32.txt","rb")
code_point_back_again = utf32_file_2.read().decode("utf-32")
print code_point_back_again

Just when you thought you knew everything about Emoji

It's worse than it seems! Well, just a little worse.

One thing that I noticed when I was cat-ing a bunch of byte strings to my screen was that some emoji (not all) were followed by either "ef b8 8e" or "ef b8 8f." I felt sad. Had I totally failed to understand how emoji work on Twitter? Was there something I was missing?

The answer is no, not really. Those pesky multibyte charaters are non-display characters called "variation selectors (http://unicode.org/charts/PDF/UFE00.pdf)," and the change how emoji are displayed. There are lots of variation selectors (16, I think), but two apply to emoji, and they correspond to "\xef\xb8\x8e, or text style" and "\xef\xb8\x8f, or emoji style" display of the emoji characters,to allow for even more variety in a world that already allows for a normal hotel (🏨) and a "love hotel" (🏩).

Not all emoji have variants for the variation selectors, nor do all platforms bother trying to deal with them, but Twitter does. If you ever find yourself in a position where you care, here's a quick example of what they do.

You will need to open a terminal, because I couldn't find a character that would display in-notebook as both text style and emoji style.


printf "\xE2\x8C\x9A"
printf "\xE2\x8C\x9A\xef\xb8\x8e"
printf "\xE2\x8C\x9A\xef\xb8\x8f"

Takeaway: Variation selectors are the difference between an Apple Watch and a Timex.

Python functions for dealing with data representations

Some of the built-in functions that I used to manipulate binary/hex/decimal representations here:


In [ ]:
# Shoutout to Josh's RST!
def print_output(function,input_data,kwargs={}):
    kwargs_repr = ",".join(["=".join([x[0], str(x[1])]) for x in kwargs.items()])
    print "{}({},{}) -> {}".format(function.__name__, repr(input_data), kwargs_repr,
                                    repr(function(input_data,**kwargs)))

In [ ]:
# Decimal to hex:
print "Converting decimal to hex string:"
print_output(hex,240)
# hex to decimal
print "\nConverting hex to decimal:"
print_output(int,hex(240),kwargs = {"base":16})
# decimal to binary
print "\nConverting decimal to binary:"
print_output(bin,240)
# binary string to an integer
print "\nConverting decimal to binary:"
print_output(int,"11110000",kwargs = {"base":2})
# byte string representation to ordinal (unicode code point value)
print "\nConverting byte string to ordinal"
print_output(ord,"\x31")
print_output(ord,"\xF0")
# ordinal to unicode code point
print "\nConverting ordinal number to unicode code point"
print_output(unichr,49)
print_output(unichr,240)