Representing text in a computer

As with everything else in the computer, we represent text using numbers. That sounds simple but gets involved very quickly with lots of details. As data scientists we have to be able to load data from data files into memory and extract data from the Internet/web. That means we have to understand how computers represent characters and how they are stored on disk; it's often not just a sequence of numbers. Moreover, the numbers in the file might encode things differently than you expect, particularly if the data file comes from a foreign country.

If you look at 7-bit ascii codes, you'll see how Americans encode the English character set (upper and lower case, numbers, punctuation, and some other characters like newlines and tab). They represent the numbers < 127, which fits in 7 bits (2^7 is 128). A string such as "abc" was represented by three bytes, one byte per character. It is a very dense encoding, meaning very few bits are wasted.

For a very long time, other languages were out of luck. A number of countries used the remaining 128..255 numeric values to encode characters useful to their language such as accented letters like ś and ŝ. They typically used the Latin-1 character set. The problem is that lots of countries used the number 201 but for different characters. For example, Russian characters were often mapped to numbers using a KOI8-R mapping that overlapped with the 0..255 used by ASCII and Latin-1. Enter Unicode. See Unicode vs ascii in python for more details than I have here.

Unicode is an agreed-upon standard that maps characters for just about any human language to numeric values (called code points). Conveniently, the first 127 values map exactly to ASCII characters. Here is a mapping of character to numeric value. For example, here is how Bengali characters are encoded:

Reading from this table left to right, the first character is 980+0, the second is 980+1, etc... The only trick is that the numbers on the left are in hexadecimal. You will see the notation U+0981. Hexadecimal, base 16, is used because all possible values within 16 bits fit in 4 hexadecimal digits.

To represent Unicode we have to use 16-bit (2 byte) not 8-bit (1 byte) characters. Rats. Oh well, we can buy more memory. According to the documentation:

Since Python 3.0, the language features a str type that contain Unicode characters

So, worst-case, Python string "abc" takes 3 x 2 bytes = 6 bytes. Python 3 does seem to do some optimization, keeping strings as 1-byte-per-char as long as possible, until we introduce a non-ASCII character. We can verify this char size with the getsizeof function:



In [11]:

    
from sys import getsizeof
print(getsizeof(''))   # 49 bytes of overhead for a string object
print(getsizeof('a'))
print(getsizeof('ab'))
print(getsizeof('abc'))
print(getsizeof('Ω')) # add non-ASCII char and overhead goes way up
print(getsizeof('ΩΩ'))
print(getsizeof('ΩΩΩ'))

In Python 2, we had to use u"é" unicode strings. Now we just use "é" and can also use chars by name:



In [37]:

    
import unicodedata
print(unicodedata.name(chr(9999)))
print(unicodedata.name(chr(9991)))
print("\N{GREEK CAPITAL LETTER OMEGA}")
print("\N{PENCIL}")
print("\N{TAPE DRIVE}")









    



PENCIL
TAPE DRIVE
Ω
✏
✇

Converting char codes to chars:



In [38]:

    
print(chr(100))
print(chr(4939))
print(chr(244), repr(chr(244)))









    



d
ፋ
ô 'ô'

You will see notation \xFF, which means FF in hexadecimal (all bits on) or 255 in decimal. A byte can be described in 2 hexadecimal digits, which is why we tend to use hexadecimal. To express 16 bit Unicode characters using code points, we use \uABCD notation for a two byte character:



In [49]:

    
'\u00ab'









    Out[49]:





'«'

We can go the other way and, given a character, get its Unicode character code:



In [55]:

    
ord('Ω'), chr(ord('Ω'))









    Out[55]:





(937, 'Ω')



In [58]:

    
[ord(c) for c in 'hiፋ']









    Out[58]:





[104, 105, 4939]

By default Python 3 programs themselves support unicode characters for variable names and in strings:



In [40]:

    
répertoire = "/tmp/records.log"
print(répertoire)









    



/tmp/records.log

Text file encoding

Now, let's make a distinction between strings in memory and text files stored on the disk.

Storing a Python string with characters codes that fit into 8 bits (1 byte) into a file is straightforward. Every character in the string is written to the file as a byte. We get a sequence of 8-bit numbers in the file, each byte representing a single character. Compression algorithms can reduce that space requirement but, for an uncompressed format, it's very tight.

Not so for 16-bit Unicode characters. Such largess doubles the size requirement to store a string, even if all of the characters fit in ASCII (< 127) if we blindly save 16-bit numbers.

Instead of blindly storing two bytes per character, we should optimize for the case where characters fit within one byte using an encoding called UTF-8. UTF stands for "Unicode Transformation Format" but I typically call it "Unicode To Follow" because of the way it does the encoding.

UTF-8 is a simple encoding of Unicode strings that is optimized for the ASCII characters. In each byte of the encoding, the high bit determines if more bytes follow. A high bit of zero means that the byte has enough information to fully represent a character; ASCII characters require only a single byte. From UTF-8:

1st Byte	2nd Byte	3rd Byte	4th Byte	Number of Free Bits	Maximum Expressible Unicode Value
0xxxxxxx				7	007F hex (127)
110xxxxx	10xxxxxx			(5+6)=11	07FF hex (2047)
1110xxxx	10xxxxxx	10xxxxxx		(4+6+6)=16	FFFF hex (65535)
11110xxx	10xxxxxx	10xxxxxx	10xxxxxx	(3+6+6+6)=21	10FFFF hex (1,114,111)

Encodings are used when converting between raw 8-bit bytes and 16-bit Unicode characters. For example, the default file character encoding for files on a US computer is UTF-8. On a Japanese machine, the encoding might be euc-jp, which is optimized for the Japanese character set.

Bottom line: If you are reading text from a file, you must know the encoding. If you receive a file from Japan, you should not expect it to have the same encoding as a file created locally on your US machine, even with identical text content. This becomes even more relevant when we start talking about computers communicating over the network. Strings must be encoded for efficiency

As we will see when discussing the HTTP web protocol, servers can send back headers that are essentially properties. One of the properties that browsers look for is the encoding of the data coming back from the server. Our computer science Web server, for example, responds to page fetches with header (among other things):

content-type=text/html; charset=UTF-8

Saving text

Character strings in Python use Unicode character mappings/values, but when we write it to the disk, we have a choice of formats (encodings). In general you should stick with ASCII or UTF-8 (which is an agreed-upon file format that uses the Unicode code points).

Ok, now let's write out some text using different encodings. First, let's write out a simple string of ASCII characters from a regular Python string:



In [3]:

    
# Write an ASCII-encoded text file
with open("/tmp/ascii.txt", "w") as f:
    f.write("ID 345\n")



In [6]:

    
! od -c -t dC /tmp/ascii.txt









    



0000000    I   D       3   4   5  \n                                    
           73  68  32  51  52  53  10                                    
0000007



In [63]:

    
! od -c -t xC /tmp/ascii.txt









    



0000000    I   D       3   4   5  \n                                    
           49  44  20  33  34  35  0a                                    
0000007

That demonstrates that all of the bytes are associated with single characters. You can look up the od command but the -c tells it to print out the bytes as characters and -t dC tells it to print out the decimal values of those characters; -t xC tells it to print those character values in hexadecimal.

Please note that 345 is a sequence of three characters not the binary value 345. This is how you will see numbers in a CSV file.

Writing out a string known to contain Unicode characters should be done with an encoder and UTF-8 is the most commonly used encoder:



In [7]:

    
# Write a UTF-8-encoded text file
with open('/tmp/utf8.txt', encoding='utf-8', mode='w') as f:
    f.write('Pencil: \N{PENCIL}, Euro: \u20ac\n')
    # or use actual character: f.write('Pencil: ✏, Euro: €\n')

(But you can also write ASCII strings using this encoding because UTF-8 degenerates to ASCII for character codes < 255.)

If we look at the file, we see something's wrong. Many characters are represented by one byte but clearly others require more than one byte. We get some unprintable characters:



In [8]:

    
! od -c -t xC /tmp/utf8.txt









    



0000000    P   e   n   c   i   l   :       ✏  **  **   ,       E   u   r
           50  65  6e  63  69  6c  3a  20  e2  9c  8f  2c  20  45  75  72
0000020    o   :       €  **  **  \n                                    
           6f  3a  20  e2  82  ac  0a                                    
0000027



In [14]:

    
f"{ord('😀'):x}"









    Out[14]:





'1f600'

We have to read it back using the decoder matching the encoding of the file:



In [9]:

    
with open('/tmp/utf8.txt', encoding='utf-8', mode='r') as f:
    s = f.read()
print(s)









    



Pencil: ✏, Euro: €

If you use the wrong encoding you get the wrong strings:



In [15]:

    
with open('/tmp/utf8.txt', encoding='latin-1', mode='r') as f:
    s = f.read()
print(s)









    



Pencil: â, Euro: â¬

If you try to read as ASCII, you will get a decoding error.

Exercise

Test out those two simple Python programs to make sure you can write and read Unicode characters, but change the string so your code saves two characters: VICTORY HAND followed by HEAVY CHECK MARK. Use the od command to dump the characters in the file.

Language within a text file

If I tell you that a file is a text file, it tells you only that, with a proper decoder, the file represents a string of characters from some language's alphabet. Character-based (text) files are an incredibly common way to store information. All of the following types of files are text-based:

comma-separate values (CSV)
XML
HTML
Natural language text, such as an email message or tweet
Python, JavaScript, Java, C++, any programming language
JSON

Examples of non-textbased formats: mp3, png, jpg, mpg, ...

As we learn to process data files, you will see that they are all textbased but the text inside follows the grammar of a specific format: CSV, XML, etc...

For your first project, you will be working with stock history obtained from Quandl finance in CSV format. Your project will partially be to convert it to HTML, JSON, XML. The file sizes for the various formats are as follows.

$ ls -l
total 9728
-rw-r--r--@ 1 parrt  wheel   583817 Aug 22 12:06 AAPL.csv
-rw-r--r--  1 parrt  wheel  1177603 Aug 22 12:06 AAPL.html
-rw-r--r--  1 parrt  wheel  1438395 Aug 22 12:06 AAPL.json
-rw-r--r--  1 parrt  wheel  1771234 Aug 22 12:06 AAPL.xml

You can see that the same information takes a lot more storage, depending on the format. Compression tells us something about how much information is actually in a file. I discovered that when compressed the file sizes are very similar, indicating that all of the extra fluff in XML is a waste of space.

To compress everything with 7z, we can use a simple for loop from the bash shell:

for f in *; do 7z a $f.7z $f; done

Then, we can look at the compressed file sizes:

$ ls -l *.7z
-rw-r--r--  1 parrt  wheel  146388 Aug 22 12:18 AAPL.csv.7z
-rw-r--r--  1 parrt  wheel  159252 Aug 22 12:18 AAPL.html.7z
-rw-r--r--  1 parrt  wheel  182134 Aug 22 12:18 AAPL.json.7z
-rw-r--r--  1 parrt  wheel  187013 Aug 22 12:18 AAPL.xml.7z

The ratio of original to compressed for CSV is 4 where is the ratio for JSON is 7.9 and 9.5 for XML. Hideous waste of space apparently for these other formats. The venerable CSV is actually pretty efficient way to store data as text. Of course, that doesn't mean we can't still compress it 4 to 1.