Understanding text (and handling it in Python)

Fiona Pigott

A Python notebook to teach you everything you never wanted to know about text encoding (specifically ASCII, UTF-8, and the difference therein but we'll explain what some others mean).

Credit to these sites for a helpful description of different file encodings:

And to these pages for a better understanding of emoji specifically:

http://apps.timwhitlock.info/emoji/tables/unicode
http://unicode.org/charts/PDF/UFE00.pdf
http://www.unicode.org/Public/emoji/2.0/emoji-data.txt (a list of all of the official emoji)

And if you got here from Data-Science-45-min intros, check out https://github.com/fionapigott/emoji-counter for this tutorial and (a little) more.

Part 0: What do you mean by "text encoding"?

A text encoding is a scheme that allows us to convert between binary (stored on your computer) and a character that you can display and make sense of. A text encoding does not define a font.

When I say "character" I mean "unicode code point." Code point -> character is a 1->1 mapping of meaning. A font just decides how to display that character. Each emoji has a code point assigned to it by the Unicode Consortium, and "GRINNING FACE WITH SMILING EYES" should be a grinning face with smiley eyes on any platform. Windows Wingdigs, if you remember that regrettable period, is a font.

I'm going to use "code point" and "character" a little bit interchangeably. If you can get the code point represented by a string of bits, you can figure out what character it represents.

Decode = convert binary data to a code point

Encode = convert a code point (a big number) to binary data that you can write somewhere

You code will always:

Ingest binary data (say, my_tweets.txt)
Decode that data into characters (whether or not you have to type decode.)
Encode that data so that you cna write it again (whether or not you type encode. You can't write "128513" to a single bit.)

Part 1: Text encodings are not magic

Get the actual data of the file in your terminal with xxd.

I want to spend a few minutes convincing you of what I'm about to say about text encoding in Python.

We're spending time in the terminal to make it painfully, horribly clear that text encoding/decoding is not some Python thing, but rather exactly what every text-display program does every time you convert some binary (stored on your computer) to something that you can read.

ASCII

ASCII is a character-encoding scheme where each character fits in exactly 1 byte--8 bits. ASCII, however, uses only the bottom 7 bits of an 8-bit byte, and thus can take only 2^7 (128) values. The value of the byte that encodes a character is exactly that character's code point.

"h" and "i" are both ascii characters--they fit in one byte in the ascii encoding scheme.



In [1]:

    
!printf "hi\n"
!printf "hi" | xxd -g1
!printf "hi" | xxd -b -g1









    



hi
0000000: 68 69                                            hi
0000000: 01101000 01101001                                      hi



In [2]:

    
# Generate a list of all of the ASCII characters:
# 'unichr' is a built-in Python function to take a number to a unicode code point 
# (I'll talk more about this and some other built-ins later)
for i in range(0,128):
    print str(i) + " -> " + repr(unichr(i)) + "->" + "'" + unichr(i).encode("ascii") + "'"









    



0 -> u'\x00'->''
1 -> u'\x01'->''
2 -> u'\x02'->''
3 -> u'\x03'->''
4 -> u'\x04'->''
5 -> u'\x05'->''
6 -> u'\x06'->''
7 -> u'\x07'->''
8 -> u'\x08'->''
9 -> u'\t'->'	'
10 -> u'\n'->'
'
11 -> u'\x0b'->''
12 -> u'\x0c'->''
'
14 -> u'\x0e'->''
15 -> u'\x0f'->''
16 -> u'\x10'->''
17 -> u'\x11'->''
18 -> u'\x12'->''
19 -> u'\x13'->''
20 -> u'\x14'->''
21 -> u'\x15'->''
22 -> u'\x16'->''
23 -> u'\x17'->''
24 -> u'\x18'->''
25 -> u'\x19'->''
26 -> u'\x1a'->''
27 -> u'\x1b'->''
28 -> u'\x1c'->''
29 -> u'\x1d'->''
30 -> u'\x1e'->''
31 -> u'\x1f'->''
32 -> u' '->' '
33 -> u'!'->'!'
34 -> u'"'->'"'
35 -> u'#'->'#'
36 -> u'$'->'$'
37 -> u'%'->'%'
38 -> u'&'->'&'
39 -> u"'"->'''
40 -> u'('->'('
41 -> u')'->')'
42 -> u'*'->'*'
43 -> u'+'->'+'
44 -> u','->','
45 -> u'-'->'-'
46 -> u'.'->'.'
47 -> u'/'->'/'
48 -> u'0'->'0'
49 -> u'1'->'1'
50 -> u'2'->'2'
51 -> u'3'->'3'
52 -> u'4'->'4'
53 -> u'5'->'5'
54 -> u'6'->'6'
55 -> u'7'->'7'
56 -> u'8'->'8'
57 -> u'9'->'9'
58 -> u':'->':'
59 -> u';'->';'
60 -> u'<'->'<'
61 -> u'='->'='
62 -> u'>'->'>'
63 -> u'?'->'?'
64 -> u'@'->'@'
65 -> u'A'->'A'
66 -> u'B'->'B'
67 -> u'C'->'C'
68 -> u'D'->'D'
69 -> u'E'->'E'
70 -> u'F'->'F'
71 -> u'G'->'G'
72 -> u'H'->'H'
73 -> u'I'->'I'
74 -> u'J'->'J'
75 -> u'K'->'K'
76 -> u'L'->'L'
77 -> u'M'->'M'
78 -> u'N'->'N'
79 -> u'O'->'O'
80 -> u'P'->'P'
81 -> u'Q'->'Q'
82 -> u'R'->'R'
83 -> u'S'->'S'
84 -> u'T'->'T'
85 -> u'U'->'U'
86 -> u'V'->'V'
87 -> u'W'->'W'
88 -> u'X'->'X'
89 -> u'Y'->'Y'
90 -> u'Z'->'Z'
91 -> u'['->'['
92 -> u'\\'->'\'
93 -> u']'->']'
94 -> u'^'->'^'
95 -> u'_'->'_'
96 -> u'`'->'`'
97 -> u'a'->'a'
98 -> u'b'->'b'
99 -> u'c'->'c'
100 -> u'd'->'d'
101 -> u'e'->'e'
102 -> u'f'->'f'
103 -> u'g'->'g'
104 -> u'h'->'h'
105 -> u'i'->'i'
106 -> u'j'->'j'
107 -> u'k'->'k'
108 -> u'l'->'l'
109 -> u'm'->'m'
110 -> u'n'->'n'
111 -> u'o'->'o'
112 -> u'p'->'p'
113 -> u'q'->'q'
114 -> u'r'->'r'
115 -> u's'->'s'
116 -> u't'->'t'
117 -> u'u'->'u'
118 -> u'v'->'v'
119 -> u'w'->'w'
120 -> u'x'->'x'
121 -> u'y'->'y'
122 -> u'z'->'z'
123 -> u'{'->'{'
124 -> u'|'->'|'
125 -> u'}'->'}'
126 -> u'~'->'~'
127 -> u'\x7f'->''



In [3]:

    
# And if you try to use "ascii" encoding on a character whose value is too high:
# Hint: you've definitely seen this error before
unichr(129).encode("ascii")









    



---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-3-99e41bb4ce9c> in <module>()
      1 # And if you try to use "ascii" encoding on a character whose value is too high:
      2 # Hint: you've definitely seen this error before
----> 3 unichr(129).encode("ascii")

UnicodeEncodeError: 'ascii' codec can't encode character u'\x81' in position 0: ordinal not in range(128)

UTF-8 (most commonly used multi-byte encoding, used by Twitter)

You might be familiar with the concept of Huffman Coding (https://en.wikipedia.org/wiki/Huffman_coding). Huffamn coding is a way of losslessly compressing data by encoding the most common values with the least amount of information. A Huffman coding tree of the English language, might, for example, assign "e" a value of a single bit.

UTF-8 encoding is similar to a Huffman encoding. ASCII-compatible characters are encoded exactly the same way (a file that is UTF-8 encoded but contains only the 128 ASCII-compatible characters is effectively ASCII encoded. This way, those common characters occupy only one byte. All furter characters are encoded in multiple bytes.

The multibyte encoding scheme works like this:

The number of leading 1s in the first byte maps to the length of the character in bytes.
Each following character in the multibyte squence begins with '10'
The value of the unicode code point is encoded in all of the unused bits. That is, every bit that isn't either a leading '1' of the first byte, or a leading '10' of the following bytes.



In [4]:

    
# Example: \xf0 is the leading byte for a 4-character emoji:
print bin(ord('\xf0'))
# And it has 4 1s!
print "Count the 1s at the beginning of the bit string: 4!"









    



0b11110000
Count the 1s at the beginning of the bit string: 4!

Now look at a multi-byte character, "GRINNING FACE WITH SMILING EYES." This guy doesn't fit in a single byte. In fact, his encoding takes 4 bytes (http://apps.timwhitlock.info/unicode/inspect/hex/1F601).



In [5]:

    
!printf "😁\n" 
!printf "😁" | xxd -g1
!printf "😁" | xxd -b -g1
# Figure out what some weird emoji is: 
# https://twitter.com/jrmontag/status/677621827410255872
!printf "📊" | xxd -g1
# https://twitter.com/SRKCHENNAIFC/status/677894680303017985
!printf "❤️" | xxd -g1









    



😁
0000000: f0 9f 98 81                                      ....
0000000: 11110000 10011111 10011000 10000001                    ....
0000000: f0 9f 93 8a                                      ....
0000000: e2 9d a4 ef b8 8f                                ......

This is pretty much exactly what you get when you're looking at Tweet data. If you don't believe me, try:



In [6]:

    
text = !cat test_tweet.json | xxd -g1 | grep "f0 9f 98 81"
for line in text:
    print line









    



cat: test_tweet.json: No such file or directory



In [7]:

    
# position of the emoji in bytes:
start = int(text[0][0:7],16)
end = int(text[0][0:7],16) + 16
print "Run the following to cat out just the first line of bytes from the hexdump:"
print "!head -c{} test_tweet.json | tail -c{}".format(end, end-start)









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-c87aadfda993> in <module>()
      1 # position of the emoji in bytes:
----> 2 start = int(text[0][0:7],16)
      3 end = int(text[0][0:7],16) + 16
      4 print "Run the following to cat out just the first line of bytes from the hexdump:"
      5 print "!head -c{} test_tweet.json | tail -c{}".format(end, end-start)

ValueError: invalid literal for int() with base 16: 'cat: te'

Rolling your own UTF-8 decoder.

This is going to be fun. And by 'fun' I mean "why are you making us do this?"



In [21]:

    
# Get the bits!
!printf "😁" | xxd -b -g1
# We're gonna use this in a minute
byte_string_smiley = !printf "😁" | xxd -b -g1
bytes = byte_string_smiley[0].split(" ")[1:5]
print bytes









    



0000000: 11110000 10011111 10011000 10000001                    ....
['11110000', '10011111', '10011000', '10000001']



In [22]:

    
first_byte = bytes[0]
print "The 1st byte: {}".format(first_byte)
length_of_char = 0
b = 0
while first_byte[b] == '1':
    length_of_char += 1
    b += 1
print "The character length in bytes, calculated using the 1st byte: {}".format(length_of_char)
print "The remaining bits in the first byte: {}".format(first_byte[b:])
print "The non-'leading 10' bits in the next 3 bytes: {}".format([x[2:] for x in bytes[1:]])
print "The bits of the code point: {}".format(
    [first_byte[b:]]+[x[2:] for x in bytes[1:]])
code_point_bits = "".join([first_byte[b:]]+[x[2:] for x in bytes[1:]])
print "The bit string of the code point: {}".format(code_point_bits)
code_point_int = int(code_point_bits,2)
print "The code point is: {} (or in hex {})".format(code_point_int, hex(code_point_int))
print "And the character is: {}".format(unichr(code_point_int).encode("utf-8"))
print "Phew!"









    



The 1st byte: 11110000
The character length in bytes, calculated using the 1st byte: 4
The remaining bits in the first byte: 0000
The non-'leading 10' bits in the next 3 bytes: ['011111', '011000', '000001']
The bits of the code point: ['0000', '011111', '011000', '000001']
The bit string of the code point: 0000011111011000000001
The code point is: 128513 (or in hex 0x1f601)






    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-22-f2bf30f75b2f> in <module>()
     15 code_point_int = int(code_point_bits,2)
     16 print "The code point is: {} (or in hex {})".format(code_point_int, hex(code_point_int))
---> 17 print "And the character is: {}".format(unichr(code_point_int).encode("utf-8"))
     18 print "Phew!"

ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Part 2: But what if I like magic?

Getting Python (2) to help you out with this.

The hard way:

The following should demonstrate to you that what we're about to do is exactly the same as what we just did, but easier.



In [23]:

    
# The 'rb' option to open (or mode = 'rb' to fileinput.FileInput)
# this means, "read in the file as a byte string." Basically, exactly what you get from
# the xxd hexdump
f = open("test.txt", 'rb')
# read the file (the whole file is one emoji character)
test_emoji = f.read().strip()
bytes = []
bits = []
code_point = test_emoji.decode("utf-8")
print code_point
code_point_integer = ord(code_point)
for byte in test_emoji:
    bytes.append(byte)
    bits.append(bin(ord(byte)).lstrip("0b"))
print "The Unicode code point: {}".format([code_point])
print "Integer value of the unicode code point: hex: {}, decimal: {}".format(
    hex(code_point_integer), code_point_integer)
print "The bytes (hex): {}".format(bytes)
print "The bytes (decimal): {}".format([ord(x) for x in bytes])
print "Each byte represented in bits: {}".format(bits)
f.close()









    



😁






    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-23-5bd1108bf954> in <module>()
      9 code_point = test_emoji.decode("utf-8")
     10 print code_point
---> 11 code_point_integer = ord(code_point)
     12 for byte in test_emoji:
     13     bytes.append(byte)

TypeError: ord() expected a character, but string of length 2 found

The easy way:

Now, imagine that you didn't want to have to think about bit strings every time you dealt with text data. We live in that brave new world.

The big problem that I (we, I think) have been having with emoji and multibyte charaters in general is decoding them in a way that allows us to process one character at a time. I had this problem because I didn't understand what the encoding/decoding steps meant.



In [31]:

    
!cat test.txt



In [24]:

    
g = open("test.txt")
# read the file (the whole file is one emoji character)
test_emoji = g.read().strip()



In [25]:

    
# Now, try to get a list of characters
print "list(test_emoji)"
print list(test_emoji)









    



list(test_emoji)
['\xf0', '\x9f', '\x98', '\x81']

Just asking for a list of all of the characters doesn't work, because Python 2 assumes ASCII (1 byte per character) and splits it up appropriately. We'd have to search all of the bytes to figure out which ones constituted emoji.

I've implemented this, https://github.com/fionapigott/emoji-counter, because I didn't realize that there was a better way. But there is!



In [52]:

    
# *Now*, try to get a list of characters
import struct

print list(test_emoji.decode('utf-8'))
print struct.unpack("i", test_emoji.decode('utf-8'))
#.decode('utf-32')
print "list(test_emoji.decode('utf-8'))"
print list(test_emoji.decode('utf-8'))
print list(test_emoji.decode('utf-8'))[0]









    



[u'\ud83d', u'\ude01']






    



---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-52-4ffa45bdfcad> in <module>()
      3 
      4 print list(test_emoji.decode('utf-8'))
----> 5 print struct.unpack("i", test_emoji.decode('utf-8'))
      6 #.decode('utf-32')
      7 print "list(test_emoji.decode('utf-8'))"

error: unpack requires a string argument of length 4

Now if you want to search your code for "😁", you just need to know its code point (which you can find or even, if you're rather determined, derive).



In [19]:

    
# Get the code point for this weird emoji
"📊".decode("utf-8")









    Out[19]:





u'\U0001f4ca'

Appendices

A word on other encodings, with a tiny example

There are many other encodings, such as ISO-8859-1, UTF-16, UTF-32 etc, which are less commonly used on the web, and for the most part, don't worry about them. They represent a variety of other ways to mape bytes -> code points and back again.

I want to show one quick example of the UTF-32 encoding, which simply assigns 1 code point per 4-byte block. I'm going to show the encoding/decoding in Python, write the encoded data to a file, and read it back.

I'm not showing this becuase UTF-32 is special or because you should use it. I'm showing it so you understand a little about how to work with other file encodings.



In [ ]:

    
print "😁"
# Remember, and this is a bit hard: that thing we just printed was encoded at UTF-8 
# (that's why Chrome renders it at all)
print repr("😁")



In [ ]:

    
# Get the code point, so that we can encode it again with a different scheme
code_point = "😁".decode("utf-8")
# You have to print the repr() to look at the code point value, 
# otherwise 'print' will automatically encode the character to print it
print repr(code_point)



In [ ]:

    
# Now encode the data as UTF32
utf32_smiley = code_point.encode("utf-32")
print repr(utf32_smiley)
print "The first 4 bytes means 'this file is UTF-32 encoded'. The next 4 are the character."



In [ ]:

    
# That's a byte string--we can write it to a file
utf32_file = open("test_utf32.txt","w")
utf32_file.write(utf32_smiley)
utf32_file.close()
# No nasty Encode errors. That's good.



In [ ]:

    
# Butttt, that file looks like garbage, because nothing is going to automatically
# decode that byte string as UTF-32
!cat test_utf32.txt 
print "\n"
# We can still look at the bytes tho! And they should look familiar
!cat test_utf32.txt | xxd -g1



In [ ]:

    
# And we can read in the file as long as we use the right decoder
utf32_file_2 = open("test_utf32.txt","rb")
code_point_back_again = utf32_file_2.read().decode("utf-32")
print code_point_back_again

Just when you thought you knew everything about Emoji

It's worse than it seems! Well, just a little worse.

One thing that I noticed when I was cat-ing a bunch of byte strings to my screen was that some emoji (not all) were followed by either "ef b8 8e" or "ef b8 8f." I felt sad. Had I totally failed to understand how emoji work on Twitter? Was there something I was missing?

The answer is no, not really. Those pesky multibyte charaters are non-display characters called "variation selectors (http://unicode.org/charts/PDF/UFE00.pdf)," and the change how emoji are displayed. There are lots of variation selectors (16, I think), but two apply to emoji, and they correspond to "\xef\xb8\x8e, or text style" and "\xef\xb8\x8f, or emoji style" display of the emoji characters,to allow for even more variety in a world that already allows for a normal hotel (🏨) and a "love hotel" (🏩).

Not all emoji have variants for the variation selectors, nor do all platforms bother trying to deal with them, but Twitter does. If you ever find yourself in a position where you care, here's a quick example of what they do.

You will need to open a terminal, because I couldn't find a character that would display in-notebook as both text style and emoji style.


printf "\xE2\x8C\x9A"
printf "\xE2\x8C\x9A\xef\xb8\x8e"
printf "\xE2\x8C\x9A\xef\xb8\x8f"

Takeaway: Variation selectors are the difference between an Apple Watch and a Timex.

Python functions for dealing with data representations

Some of the built-in functions that I used to manipulate binary/hex/decimal representations here:



In [ ]:

    
# Shoutout to Josh's RST!
def print_output(function,input_data,kwargs={}):
    kwargs_repr = ",".join(["=".join([x[0], str(x[1])]) for x in kwargs.items()])
    print "{}({},{}) -> {}".format(function.__name__, repr(input_data), kwargs_repr,
                                    repr(function(input_data,**kwargs)))



In [ ]:

    
# Decimal to hex:
print "Converting decimal to hex string:"
print_output(hex,240)
# hex to decimal
print "\nConverting hex to decimal:"
print_output(int,hex(240),kwargs = {"base":16})
# decimal to binary
print "\nConverting decimal to binary:"
print_output(bin,240)
# binary string to an integer
print "\nConverting decimal to binary:"
print_output(int,"11110000",kwargs = {"base":2})
# byte string representation to ordinal (unicode code point value)
print "\nConverting byte string to ordinal"
print_output(ord,"\x31")
print_output(ord,"\xF0")
# ordinal to unicode code point
print "\nConverting ordinal number to unicode code point"
print_output(unichr,49)
print_output(unichr,240)