Week 2: Dealing with data

Welcome back everyone! Week 2 is all about data: how computers physically store and represent data, some common data abstractions that are used in data science, and using a package called NumPy to work with arrays of numerical data.

The agenda:

  • Review homework assignments, answer questions
  • How do computers store data?
  • Common data storage abstractions you should know
  • A lightning intro to a very important library: NumPy

How is data stored?

Actually, there are lots of different ways to answer this question...

Moving from abstract to concrete:

  • A brief intro to computer architecture
  • Data storage paradigms
  • Different types of files, the advantages and disadvantages of each

The motivation:

  • Important to be informed about the trade-offs between storage paradigms
  • Writing performant code requires familiarity with why some code is faster than other code

Stepping wayyyy back - what is a computer?

During WWII, mathematical foundations of modern computing were invented out of necessity. In 1945, John von Neumann drafted a design for a stored program computer:

  • A processing device
    • Control unit for storing instruction sets
    • Logic unit for executing those instructions
  • Memory for storing data
  • External mass storage
  • Input/output lines for communicating with the world

The advance over previous designs: instead of "hard-wiring" (literally) a program, design a device that takes in a program just like data. The logic unit is able to execute a limited number of operations. By composing those operations together, can prove mathematically that we can solve any problem that is solvable (ignoring resource usage...). This is still fundamentally the way computers work today.

Fast-forward 70 years, how do modern computers store and process data?

What this means for us:

  • For large organizations, real tape backup is a thing! https://aws.amazon.com/glacier/
  • Most of us store data long-term on hard drives
  • When actively working on a project, our data lives in RAM
  • When in the middle of a computation, data is shifting from RAM to CPU cache
  • CPU actually does work on bits that are in its registers

The typical computing workflow:

  • Want to process some data stored on a hard drive (either physically connected to our local machine or accessible over a network connection)
  • Provide the address of that data and some information for how to load it into memory
  • Provide a set of instructions for what to do with the data that's in memory
  • Write intermediate results to memory
  • Store final results on a hard drive

Note - modern computers handle the RAM <-> Cache <-> CPU pipeline for us. But understanding how it works allows us to write faster code (will return to this later today).

Storing data (on disk)

There are lots of ways that data can be stored on disk, and different formats can have drastic performance differences! Fundamentally, the data contained in a file is represented in a computer as a sequence of bits (0s and 1s), which are grouped into chunks of 8 called bytes.

Numbers

Because data on computers can only be represented as sequences of 0s and 1s, we need a way of representing numbers (both integers and decimals) in this system.

Integers

How do we represent integers on a computer?

Imagine a super simple world, where the only integers we ever wanted to talk about were 0 and 1. We would only need 1 bit:

We just need to remember that a blank bit corresponds to "0" and a filled bit corresponds to "1".

What if, instead, we lived in a world where the only integers we ever wanted to talk about were 0, 1, 2, and 3? In this case, we would have to use 2 bits:

And if we lived in a world where we only cared about integers 0-7, we could use 3 bits:

In other words, we can always use $N$ bits to represent $2^N$ unique integers. Conversely, we can represent $M$ integers with $log_2 M$ bits.

In low-level languages like C and FORTRAN, integer types are given fixed-width representations. For example, a C short contains 2 bytes, so can represent $2^{16} = 65536$ unique integers. Depending on whether the integer is signed or unsigned, the bit patterns correspond to either 0:65535 or -32,768:32,767. Therefore, the integers 720,913 or -56,093 can't be represented with C short values - an integer type with more bits is required.

Python, in contrast, only has a single integer data type (int, as we introduced last week). Python algorithmically determines the number of bytes required and automatically allocates the necessary memory.

Your turn

Write a function that takes a list of 0s and 1s and produces the corresponding integer. The equation for converting a list $L = [l_1, l_2, ..., l_n]$ of 0's and 1's to binary is $\sum_i l_i*2^i$. What is the integer representation of [1, 0, 0, 0, 1, 1, 0, 1]?


In [ ]:
def to_integer(x):
    the_sum = 0
    for index, val in enumerate(x[::-1]):
        the_sum += val * 2 ** index
    
    return the_sum

In [ ]:
# [1, 1] == 3
to_integer([1, 0, 0, 0, 1, 1, 0, 1])

Decimals

Representing integers in binary is relatively straight-forward because both integers and bytes are discrete units - there are a fixed, countable number of elements representable with a sequence of bits of a given length. Decimal numbers are trickier - in principle, decimals can have any length (including, probably, infinite). How can you possibly represent a infinite number of decimals with a sequence of bits?

Unfortunately, the answer is that we can't represent decimals with arbitrary precision. Much as the number of bits in integer representations above defined how many integers we could represent, the number of bits in a representation of a decimal number defines how precisely we can define the number.

It's not hard to find artifacts of the floating point representation. If you're not careful, it can get you in trouble:


In [ ]:
.1 + .1 + .1 == .3

What??


In [ ]:
.1 + .1 + .1

This behavior is a result of the finite precision with which Python is representing decimals.

Most software (including Python) uses the concept of floating point numbers (i.e. floats) to represent decimals. Essentially, a float is really two separate numbers: the mantissa and the exponent. This is a similar concept to scientific notation, where, for example, the number $123456.789$ can be written as $1.23456789 \times 10^5$. In this case, $1.23456789$ is the mantissa, and $5$ is the exponent. In binary, we usually represent numbers as a string of bits, plus an exponent of 2 (rather than 10). This is a much more compact way of representing numbers than, for example, an explicit grid.

A more complete discussion of floating point arithmetic is beyond the scope of this course, but the important thing to remember is that a floating point number is really represented with 2 numbers under the hood.

Text vs binary data on disk

Broadly speaking, there are two categories of files - binary and text.

Binary files:

  • Can be any sequence of bytes (though they should adhere to some pattern)
  • Designed to be machine readable only (i.e. don't make sense to a human eye)
  • Examples: images (.png, .jpg), videos (.mp4, .wav), documents (.doc, .pdf), archive (.zip, .tar), executable (.exe, .dll)

Text files:

  • Sequence of bytes correspond to an encoding that can be rendered into text
  • Examining in a text editor, the files are human-readable
  • Examples: documents (.txt, .md), web data (.html, .json), source code (.py, .java), tabular data (.csv)

In other words, text formatted files have particular structure to their bytes that can be rendered into characters that are displayed on a screen. Binary files don't adhere to the notion that sequences of bytes should correspond to characters, so are free to implement other protocols.

Let's use Python to open and read some files:


In [ ]:
with open('data/hi.txt', 'r') as file:    # open the file data/hi.txt in read mode, refer to it as `file`
    text_data = file.read()               # read the contents of `file` into a variable called `text_data`

What's in text_data?


In [ ]:
type(text_data)

In [ ]:
len(text_data)

In [ ]:
text_data

So text_data is a string with two characters containing the text 'hi'.

Question: What is the physical size of hi.txt on disk?


In [ ]:
import os
os.path.getsize('data/hi.txt')

In [ ]:
# `bytes` converts a Python character to a representation of its bytes
# `ord()` converts a Python character into an integer representation
for char in text_data:
    print(bytes(char, 'utf-8'), ord(char))

In [ ]:
len(bytes(char, 'utf-8'))

Python 3 encodes strings with the Unicode standard (UTF-8, specifically) by default. As a sanity check, we can look up the values of 104 and 105 in a Unicode table to double check that they correspond to the characters 'h' and 'i': http://www.ssec.wisc.edu/~tomw/java/unicode.html

In the days before Unicode became the de facto standard of internet communication, it was common to use ASCII to encode characters. In ASCII, each character corresponded to 1 byte in the computer, so there were only 2^8 = 256 characters. To the early computer pioneers in the 60s and 70s, the majority of whom lived in English-speaking countries, 256 characters was plenty - there were 26 upper case characters, 26 lower case characters, 10 digits, some special symbols like "(" and "&", and a few accented characters.

The rise of the internet, however, meant that many non-English speakers wanted to communicate digitally. But with ASCII, there was no way for people to write in Cyrillic or Mandarin characters. This led to a proliferation of character encodings that were eventually unified into UTF-8.

Let's read a different file with Python, this time with some characters outside of the standard 26-character English alphabet:


In [ ]:
my_str = 'hi猫😺'
with open('data/hi2.txt', 'w', encoding='UTF-8') as f:
    f.write(my_str)

In [ ]:
with open('data/hi2.txt', 'r') as file:
    text_data = file.read()

In [ ]:
type(text_data)

In [ ]:
len(text_data)

In [ ]:
text_data

We can see that text_data is a string with 4 characters this time - the same 2 English characters "h" and "i", as well as a Chinese character and a cat emoji.

Question: What is the size of hi2.txt on disk?


In [ ]:
os.path.getsize('data/hi2.txt')

In [ ]:
for char in text_data:
    print(bytes(char, 'utf-8'), ord(char))

This gives us a better sense of where the file size comes from. The integer values of "h" and "i" are small enough that they can each be represented by a single byte, but several bytes are necessary to represent each of the other 2 characters. Printing the byte representation of the characters tells us that "猫" requires 3 bytes to store on disk, and "😺" requires 4 bytes, therefore there are a combined 9 bytes in hi.txt. In other words, Unicode characters correspond to a variable number of bytes, as opposed to ASCII, where characters always correspond to a single byte.

Now, let's read some bigger files into memory:


In [ ]:
%%timeit         # an IPython "magic" function for profiling blocks of code
with open('data/sherlock_holmes.txt', 'r') as file:    # open the file in read mode
    file.read()

An aside: Python provides a way of serializing data into a binary format for storing on disk called pickling. Many different types of Python objects can be pickled, so it's a useful step for checkpointing your work on long-running calculations or freezing the state of your code for later use.

For example, we can dump the text data to a pickle...


In [ ]:
import pickle
with open('data/sherlock_holmes.txt', 'r') as file:
    sherlock_text = file.read()
with open('data/sherlock_holmes.pickle', 'wb') as file:    # open a file in binary write mode
    pickle.dump(sherlock_text, file)

... and when we go to read the file, it loads a bit faster (even though it contains the same data).


In [ ]:
%%timeit
with open('data/sherlock_holmes.pickle', 'rb') as file:  # note the 'rb' - for "read binary"
    pickle.load(file)

In [ ]:
with open('data/sherlock_holmes.pickle', 'rb') as file:
    sherlock_pickle = pickle.load(file)

In [ ]:
sherlock_text == sherlock_pickle

What happened here? Recall - there are no primitive data types in Python, everything is an object! So to read data from disk, Python must create an object to store the data in

When reading a file from disk into memory, Python:

  • Pulls raw bytes from disk into memory
  • Encodes the raw bytes into their character representations
  • Builds the Python objects that store those bytes

That second step, the encoding, can actually be fairly slow. If you're dealing with large text files, encoding them once and them pickling (or using another serialization method) can be a much more efficient way to read them in the future. We'll use pickling later in the course to serialize some intermediate results.

IMPORTANT: Pickling is NOT SAFE. Anyone can pickle arbitrary code objects. In other words, it is possible to use pickles to distribute malicious code. Never un-pickle data from someone you don't trust. Pickling really should only be used as a convenience for yourself, not as a way of distributing code.

Your turn

  • Read data/alice_in_wonderland.txt into memory. How many characters does it contain? How does this compare to its size on disk?
  • Print out the unique non-ASCII characters in Alice in Wonderland (hint: non-ASCII means that the number of bytes used is greater than 1).
  • Write the first 10,000 characters of Alice in Wonderland as text and as a pickle. What are the sizes of each file on disk?

In [ ]:
with open('data/alice_in_wonderland.txt', 'r', encoding='utf-8') as file:
    alice = file.read()

In [ ]:
len(alice)

In [ ]:
os.path.getsize('data/alice_in_wonderland.txt')

In [ ]:
char_list = []
for char in alice:
    if len(bytes(char, 'utf-8')) > 1:
        char_list.append(char)

In [ ]:
set(char_list)

In [ ]:
with open('data/alice_partial.pickle', 'wb') as file:
    #file.write(alice[:10000])
    pickle.dump(alice[:10000], file)

In [ ]:
os.path.getsize('data/alice_partial.pickle')

Some common file types

We've already seen text and binary formats, but let's take a look at a couple others.

JSON

Javascript Object Notation - a nested sequence of lists and dictionaries (or "arrays" and "hashes"). A very common way of transmitting data on the web because it's simple for both humans and computers to parse.


In [ ]:
import json
with open('data/good_movies.json', 'r') as file:
    good_movies = json.loads(file.read())

In [ ]:
from pprint import pprint    # pprint for pretty-printing nested objects
pprint(good_movies)

In [ ]:
good_movies[0]['stars']

Your turn

  • Iterating over good_movies, print the name of the movies that Ben Affleck stars in.
  • Find the total number of Oscar nominations for 2016 movies in the dataset.

CSV

Comma-separated value data is another very common, easy-to-use way of storing data. In particular, CSVs are used when you have tabular data - think, data that fits in a spreadsheet. The most common format is for columns to correspond to categories and rows to correspond to examples.


In [ ]:
import csv
good_movies = []
with open('data/good_movies.csv', 'r') as file:
    reader = csv.DictReader(file)
    for row in reader:
        good_movies.append(row)

In [ ]:
pprint(good_movies)

Look familiar? csv.DictReader is actually parsing the CSV row-by-row into a JSON-like structure!


In [ ]:
good_movies[0]['title']  # value of cell in first row, column called "title"

For doing simple things like iterating over data structures, these built-in methods and objects are sufficient. But more complicated tasks will require better tooling. We'll see a lot more CSV data in the next couple of weeks.

NumPy: an intro to making data analysis fast

Now that we have an introduction to using data in Python, let's introduce some more ways of manipulating that data.

NumPy is a fundamental library in the Python ecosystem for handling (and doing math on) array-like data. Most importantly, all of the "heavy lifting" is done by algorithms written in "fast" languages like C and FORTRAN. We'll see below the difference between the fast algorithms implemented in NumPy and the same algorithms written in pure Python.

The NumPy array

The fundamental object that NumPy provides is an array:


In [ ]:
import numpy as np

list_of_numbers = [1, 2, 3, 4, 5]
array_1d = np.array(list_of_numbers)
array_1d.shape

In [ ]:
type(array_1d)

In [ ]:
print(array_1d)

In [ ]:
another_list_of_numbers = [6, 7, 8, 9, 10]
array_2d = np.array([list_of_numbers, another_list_of_numbers])
array_2d.shape

In [ ]:
print(array_2d)

In addition to defining arrays by hand, we can produce them programmatically:


In [ ]:
# create a 1D array with numbers 0-9
x = np.arange(10)
print(x)

In [ ]:
# create an array with 4 evenly-spaced numbers starting at 1 and ending at 13
y = np.linspace(1, 13, 3)
print(y)

In [ ]:
# create some other common types of arrays
x = np.zeros((3, 5))
print(x)

In [ ]:
x = np.ones((3, 5))
print(x)

In [ ]:
x = np.eye(5)    # why this name?
print(x)

In [ ]:
x = np.random.rand(4)
print(x)

Indexing and slicing arrays

To access the elements of NumPy arrays, we use a notation that's very similar to the one we used for accessing elements of Python lists. Remember - in Python, indexing always starts at 0!


In [ ]:
x

In [ ]:
x[0]

In [ ]:
x[-2]

In [ ]:
array_2d

In [ ]:
array_2d[0][0]

In [ ]:
array_2d[1, 4]

In [ ]:
array_2d[1, 4] = 12
array_2d[1, 4]

In [ ]:
array_2d[1, 5] = 15

If we want to access more than one value at a time, we can take slices of NumPy arrays, too. The general format is x[start_index:stop_index:step_size].


In [ ]:
x[1:3]

In [ ]:
x[:2]

In [ ]:
x[1:]

In [ ]:
x[1:3:2]

In [ ]:
x

In [ ]:
x[::-1]

There's one very important performance-related difference between the way that slicing works between Python lists and NumPy arrays. For vanilla lists, a slice returns a copy of the sliced data:


In [ ]:
my_list = [1, 2, 3, 4]

In [ ]:
my_sliced_list = my_list[0:2]
my_sliced_list[0] = 10
my_list[0] == my_sliced_list[0]

For NumPy arrays, slices are views of the original array:


In [ ]:
my_array = np.array([1, 2, 3, 4])

In [ ]:
my_sliced_array = my_array[:2]
my_sliced_array[0] = 10
my_array[0] == my_sliced_array[0]

In [ ]:
my_array

In [ ]:
my_sliced_array

This memory efficiency is extremely useful when dealing with large datasets. But be careful!

Your turn

Create a NumPy array with 100,000 random integers between 0 and 100. Then, write two functions (in pure Python, not using built-in NumPy functions):

  • Compute the average
  • Compute the standard deviation
  • Create weight vector of 100,000 elements (the sum of the elements is 1). Compute the weighted average of your first vector with these weights.

In [ ]:

We'll return to these functions a little later.

Mutability and Data types in NumPy

Python has a dynamic typing system. For example:


In [ ]:
x = 1

In [ ]:
type(x)

In [ ]:
x = 'hello'

In [ ]:
type(x)

In a statically typed language like C, you can't do this. The following code will fail when you try to compile it:

int x = 1;
x = "hello";

In other words, Python determines the best type to store your data at run time, whereas C requires you to explicitly specify the type of data (and enforces this typing). This design choice makes Python simple to use, but comes with a performance overhead, since Python needs to compute and store extra information about your data.

In addition, Python lists are able to hold heterogenous data:


In [ ]:
my_list = [1, 3.1415, 'hello']
my_list

NumPy's approach to data types in arrays is slightly different than vanilla Python lists:


In [ ]:
x = np.array([1, 2, 3, 4])

In [ ]:
x.dtype

In [ ]:
type(x[0])

In [ ]:
x[0] = 1.1
x[0]

In [ ]:
x[0] = 'hello'

In [ ]:
x = x.astype(float)

In [ ]:
type(x[0])

In [ ]:
x[0] = 1.1
x[0]

In [ ]:
x = np.array([1, 2, 3, 4], dtype='float_')

In [ ]:
x.dtype

In [ ]:
x[0] = 1.1
x[0]

So - NumPy gives you the flexibility to use a wide variety of data types in arrays. However, NumPy arrays must be homogeneous in data type.

What about the mutability of NumPy arrays? Recall that vanilla Python lists are totally mutable:


In [ ]:
my_list = [1, 2, 3, 4]

In [ ]:
my_list[0] = 10
my_list

In [ ]:
my_list.append(20)
my_list

In [ ]:
my_list.remove(2)
print(my_list)

We can always change the value of any element in a list, as well as add and delete elements as we wish. But whatever we do to the elements of the list, the same list object is always there. With NumPy arrays, the story is a bit different:


In [ ]:
my_array = np.array([1, 2, 3, 4])
print(my_array)

In [ ]:
my_array[0] = 10
print(my_array)

In [ ]:
my_new_array = np.append(my_array, 20)
my_new_array[0] = 20
print(my_array)
print(my_new_array)

The shape of a NumPy array is fixed. If you want to add data to a NumPy array, an entirely new array is created.

Your turn

Build a NumPy array of integers 0-999,999 three separate ways:

  • initialize an empty NumPy array, then iteratively append each element to it
  • initialize an array of 100,000 0's, then iteratively change the value of each element
  • by using np.arange.

Use the %%timeit magic command to profile each approach. Which is fastest? Why?

Vectorized operations with NumPy

Python is an interpreted language. This means, basically, that each line in a script is fed to the Python interpreter one at a time - Python has no sense of what is coming later in the same script. This means that Python is especially slow with loops - the interpreter has no sense that a loop is essentially doing the same operation over and over, and so it can't take advantage of CPU architecture to speed these loops up.

Most of the functions in NumPy are calling lower-level functions written in C or FORTRAN. Both C and FORTRAN are statically compiled languages, meaning that a compiler reads an entire function at once, then turns the source code into machine code that actually gets executed. Because of that intermediate step, the compilers of statically-compiled language can exploit CPU architecture, in particular the CPU cache, to ensure that the data necessary for repeated computations (like what happens in a loop) is always close at hand. How exactly this works is beyond the scope of this class, but suffice it to say - you should keep in mind that Python loops are slow.

Your turn

Using the %%timeit magic function, profile the three pure Python functions you built earlier (average, standard deviation, and weighted average). Then, profile the same operations using the built-in NumPy functions (you shouldn't have to explicitly write any for loops!).


In [ ]:

When dealing with array-like data in NumPy, you should always try to think of operations at the array level, rather than the element level. This type of thinking is called "vectorizing" your code, and it allows you to push a lot of your for loops (that are slow in Python) down into the underlying compiled code (where it runs much faster).

The bottom line is that NumPy functions that depend on looping tend to be anywhere between 10 and 100 times faster (independent of input size) than the exact same algorithm implemented in pure Python.

Take-home exercises

  • Write a function that swaps any two rows of a NumPy array.

In [ ]:

  • Generate a 12 $\times$ 12 multiplication table (i.e. where element (i, j) = i*j). Do this two ways:

    • Using nested for loops
    • Using NumPy's fromfunction

      Which method is faster? Why?


In [ ]:

Write a function that computes moving averages for an array and window size. For example,

>> a = [0, 3, 3, 3, 9, 6, 9, 9, 12]
>> size = 3
>> moving_average(a, size)
[  2.   3.   5.   6.   8.   8.  10.]

In [ ]: