A Python notebook to teach you everything you never wanted to know about text encoding (specifically ASCII, UTF-8, and the difference therein but we'll explain what some others mean).
Credit to these sites for a helpful description of different file encodings:
And to these pages for a better understanding of emoji specifically:
And if you got here from Data-Science-45-min intros, check out https://github.com/fionapigott/emoji-counter for this tutorial and (a little) more.
A text encoding is a scheme that allows us to convert between binary (stored on your computer) and a character that you can display and make sense of. A text encoding does not define a font.
When I say "character" I mean "unicode code point." Code point -> character is a 1->1 mapping of meaning. A font just decides how to display that character. Each emoji has a code point assigned to it by the Unicode Consortium, and "GRINNING FACE WITH SMILING EYES" should be a grinning face with smiley eyes on any platform. Windows Wingdigs, if you remember that regrettable period, is a font.
I'm going to use "code point" and "character" a little bit interchangeably. If you can get the code point represented by a string of bits, you can figure out what character it represents.
Decode = convert binary data to a code point
Encode = convert a code point (a big number) to binary data that you can write somewhere
You code will always:
I want to spend a few minutes convincing you of what I'm about to say about text encoding in Python.
We're spending time in the terminal to make it painfully, horribly clear that text encoding/decoding is not some Python thing, but rather exactly what every text-display program does every time you convert some binary (stored on your computer) to something that you can read.
ASCII is a character-encoding scheme where each character fits in exactly 1 byte--8 bits. ASCII, however, uses only the bottom 7 bits of an 8-bit byte, and thus can take only 2^7 (128) values. The value of the byte that encodes a character is exactly that character's code point.
"h" and "i" are both ascii characters--they fit in one byte in the ascii encoding scheme.
In [1]:
!printf "hi\n"
!printf "hi" | xxd -g1
!printf "hi" | xxd -b -g1
In [2]:
# Generate a list of all of the ASCII characters:
# 'unichr' is a built-in Python function to take a number to a unicode code point
# (I'll talk more about this and some other built-ins later)
for i in range(0,128):
print str(i) + " -> " + repr(unichr(i)) + "->" + "'" + unichr(i).encode("ascii") + "'"
In [3]:
# And if you try to use "ascii" encoding on a character whose value is too high:
# Hint: you've definitely seen this error before
unichr(129).encode("ascii")
You might be familiar with the concept of Huffman Coding (https://en.wikipedia.org/wiki/Huffman_coding). Huffamn coding is a way of losslessly compressing data by encoding the most common values with the least amount of information. A Huffman coding tree of the English language, might, for example, assign "e" a value of a single bit.
UTF-8 encoding is similar to a Huffman encoding. ASCII-compatible characters are encoded exactly the same way (a file that is UTF-8 encoded but contains only the 128 ASCII-compatible characters is effectively ASCII encoded. This way, those common characters occupy only one byte. All furter characters are encoded in multiple bytes.
The multibyte encoding scheme works like this:
In [4]:
# Example: \xf0 is the leading byte for a 4-character emoji:
print bin(ord('\xf0'))
# And it has 4 1s!
print "Count the 1s at the beginning of the bit string: 4!"
Now look at a multi-byte character, "GRINNING FACE WITH SMILING EYES." This guy doesn't fit in a single byte. In fact, his encoding takes 4 bytes (http://apps.timwhitlock.info/unicode/inspect/hex/1F601).
In [5]:
!printf "😁\n"
!printf "😁" | xxd -g1
!printf "😁" | xxd -b -g1
# Figure out what some weird emoji is:
# https://twitter.com/jrmontag/status/677621827410255872
!printf "📊" | xxd -g1
# https://twitter.com/SRKCHENNAIFC/status/677894680303017985
!printf "❤️" | xxd -g1
This is pretty much exactly what you get when you're looking at Tweet data. If you don't believe me, try:
In [6]:
text = !cat test_tweet.json | xxd -g1 | grep "f0 9f 98 81"
for line in text:
print line
In [7]:
# position of the emoji in bytes:
start = int(text[0][0:7],16)
end = int(text[0][0:7],16) + 16
print "Run the following to cat out just the first line of bytes from the hexdump:"
print "!head -c{} test_tweet.json | tail -c{}".format(end, end-start)
In [21]:
# Get the bits!
!printf "😁" | xxd -b -g1
# We're gonna use this in a minute
byte_string_smiley = !printf "😁" | xxd -b -g1
bytes = byte_string_smiley[0].split(" ")[1:5]
print bytes
In [22]:
first_byte = bytes[0]
print "The 1st byte: {}".format(first_byte)
length_of_char = 0
b = 0
while first_byte[b] == '1':
length_of_char += 1
b += 1
print "The character length in bytes, calculated using the 1st byte: {}".format(length_of_char)
print "The remaining bits in the first byte: {}".format(first_byte[b:])
print "The non-'leading 10' bits in the next 3 bytes: {}".format([x[2:] for x in bytes[1:]])
print "The bits of the code point: {}".format(
[first_byte[b:]]+[x[2:] for x in bytes[1:]])
code_point_bits = "".join([first_byte[b:]]+[x[2:] for x in bytes[1:]])
print "The bit string of the code point: {}".format(code_point_bits)
code_point_int = int(code_point_bits,2)
print "The code point is: {} (or in hex {})".format(code_point_int, hex(code_point_int))
print "And the character is: {}".format(unichr(code_point_int).encode("utf-8"))
print "Phew!"
In [23]:
# The 'rb' option to open (or mode = 'rb' to fileinput.FileInput)
# this means, "read in the file as a byte string." Basically, exactly what you get from
# the xxd hexdump
f = open("test.txt", 'rb')
# read the file (the whole file is one emoji character)
test_emoji = f.read().strip()
bytes = []
bits = []
code_point = test_emoji.decode("utf-8")
print code_point
code_point_integer = ord(code_point)
for byte in test_emoji:
bytes.append(byte)
bits.append(bin(ord(byte)).lstrip("0b"))
print "The Unicode code point: {}".format([code_point])
print "Integer value of the unicode code point: hex: {}, decimal: {}".format(
hex(code_point_integer), code_point_integer)
print "The bytes (hex): {}".format(bytes)
print "The bytes (decimal): {}".format([ord(x) for x in bytes])
print "Each byte represented in bits: {}".format(bits)
f.close()
Now, imagine that you didn't want to have to think about bit strings every time you dealt with text data. We live in that brave new world.
The big problem that I (we, I think) have been having with emoji and multibyte charaters in general is decoding them in a way that allows us to process one character at a time. I had this problem because I didn't understand what the encoding/decoding steps meant.
In [31]:
!cat test.txt
In [24]:
g = open("test.txt")
# read the file (the whole file is one emoji character)
test_emoji = g.read().strip()
In [25]:
# Now, try to get a list of characters
print "list(test_emoji)"
print list(test_emoji)
Just asking for a list of all of the characters doesn't work, because Python 2 assumes ASCII (1 byte per character) and splits it up appropriately. We'd have to search all of the bytes to figure out which ones constituted emoji.
I've implemented this, https://github.com/fionapigott/emoji-counter, because I didn't realize that there was a better way. But there is!
In [52]:
# *Now*, try to get a list of characters
import struct
print list(test_emoji.decode('utf-8'))
print struct.unpack("i", test_emoji.decode('utf-8'))
#.decode('utf-32')
print "list(test_emoji.decode('utf-8'))"
print list(test_emoji.decode('utf-8'))
print list(test_emoji.decode('utf-8'))[0]
Now if you want to search your code for "😁", you just need to know its code point (which you can find or even, if you're rather determined, derive).
In [19]:
# Get the code point for this weird emoji
"📊".decode("utf-8")
Out[19]:
There are many other encodings, such as ISO-8859-1, UTF-16, UTF-32 etc, which are less commonly used on the web, and for the most part, don't worry about them. They represent a variety of other ways to mape bytes -> code points and back again.
I want to show one quick example of the UTF-32 encoding, which simply assigns 1 code point per 4-byte block. I'm going to show the encoding/decoding in Python, write the encoded data to a file, and read it back.
I'm not showing this becuase UTF-32 is special or because you should use it. I'm showing it so you understand a little about how to work with other file encodings.
In [ ]:
print "😁"
# Remember, and this is a bit hard: that thing we just printed was encoded at UTF-8
# (that's why Chrome renders it at all)
print repr("😁")
In [ ]:
# Get the code point, so that we can encode it again with a different scheme
code_point = "😁".decode("utf-8")
# You have to print the repr() to look at the code point value,
# otherwise 'print' will automatically encode the character to print it
print repr(code_point)
In [ ]:
# Now encode the data as UTF32
utf32_smiley = code_point.encode("utf-32")
print repr(utf32_smiley)
print "The first 4 bytes means 'this file is UTF-32 encoded'. The next 4 are the character."
In [ ]:
# That's a byte string--we can write it to a file
utf32_file = open("test_utf32.txt","w")
utf32_file.write(utf32_smiley)
utf32_file.close()
# No nasty Encode errors. That's good.
In [ ]:
# Butttt, that file looks like garbage, because nothing is going to automatically
# decode that byte string as UTF-32
!cat test_utf32.txt
print "\n"
# We can still look at the bytes tho! And they should look familiar
!cat test_utf32.txt | xxd -g1
In [ ]:
# And we can read in the file as long as we use the right decoder
utf32_file_2 = open("test_utf32.txt","rb")
code_point_back_again = utf32_file_2.read().decode("utf-32")
print code_point_back_again
It's worse than it seems! Well, just a little worse.
One thing that I noticed when I was cat-ing a bunch of byte strings to my screen was that some emoji (not all) were followed by either "ef b8 8e" or "ef b8 8f." I felt sad. Had I totally failed to understand how emoji work on Twitter? Was there something I was missing?
The answer is no, not really. Those pesky multibyte charaters are non-display characters called "variation selectors (http://unicode.org/charts/PDF/UFE00.pdf)," and the change how emoji are displayed. There are lots of variation selectors (16, I think), but two apply to emoji, and they correspond to "\xef\xb8\x8e, or text style" and "\xef\xb8\x8f, or emoji style" display of the emoji characters,to allow for even more variety in a world that already allows for a normal hotel (🏨) and a "love hotel" (🏩).
Not all emoji have variants for the variation selectors, nor do all platforms bother trying to deal with them, but Twitter does. If you ever find yourself in a position where you care, here's a quick example of what they do.
You will need to open a terminal, because I couldn't find a character that would display in-notebook as both text style and emoji style.
printf "\xE2\x8C\x9A"
printf "\xE2\x8C\x9A\xef\xb8\x8e"
printf "\xE2\x8C\x9A\xef\xb8\x8f"
Takeaway: Variation selectors are the difference between an Apple Watch and a Timex.
In [ ]:
# Shoutout to Josh's RST!
def print_output(function,input_data,kwargs={}):
kwargs_repr = ",".join(["=".join([x[0], str(x[1])]) for x in kwargs.items()])
print "{}({},{}) -> {}".format(function.__name__, repr(input_data), kwargs_repr,
repr(function(input_data,**kwargs)))
In [ ]:
# Decimal to hex:
print "Converting decimal to hex string:"
print_output(hex,240)
# hex to decimal
print "\nConverting hex to decimal:"
print_output(int,hex(240),kwargs = {"base":16})
# decimal to binary
print "\nConverting decimal to binary:"
print_output(bin,240)
# binary string to an integer
print "\nConverting decimal to binary:"
print_output(int,"11110000",kwargs = {"base":2})
# byte string representation to ordinal (unicode code point value)
print "\nConverting byte string to ordinal"
print_output(ord,"\x31")
print_output(ord,"\xF0")
# ordinal to unicode code point
print "\nConverting ordinal number to unicode code point"
print_output(unichr,49)
print_output(unichr,240)