Welcome back everyone! Week 2 is all about data: how computers physically store and represent data, some common data abstractions that are used in data science, and using a package called NumPy to work with arrays of numerical data.
The agenda:
Actually, there are lots of different ways to answer this question...
Moving from abstract to concrete:
The motivation:
During WWII, mathematical foundations of modern computing were invented out of necessity. In 1945, John von Neumann drafted a design for a stored program computer:
The advance over previous designs: instead of "hard-wiring" (literally) a program, design a device that takes in a program just like data. The logic unit is able to execute a limited number of operations. By composing those operations together, can prove mathematically that we can solve any problem that is solvable (ignoring resource usage...). This is still fundamentally the way computers work today.
Fast-forward 70 years, how do modern computers store and process data?
What this means for us:
The typical computing workflow:
Note - modern computers handle the RAM <-> Cache <-> CPU pipeline for us. But understanding how it works allows us to write faster code (will return to this later today).
There are lots of ways that data can be stored on disk, and different formats can have drastic performance differences! Fundamentally, the data contained in a file is represented in a computer as a sequence of bits (0s and 1s), which are grouped into chunks of 8 called bytes.
Because data on computers can only be represented as sequences of 0s and 1s, we need a way of representing numbers (both integers and decimals) in this system.
How do we represent integers on a computer?
Imagine a super simple world, where the only integers we ever wanted to talk about were 0 and 1. We would only need 1 bit:
We just need to remember that a blank bit corresponds to "0" and a filled bit corresponds to "1".
What if, instead, we lived in a world where the only integers we ever wanted to talk about were 0, 1, 2, and 3? In this case, we would have to use 2 bits:
And if we lived in a world where we only cared about integers 0-7, we could use 3 bits:
In other words, we can always use $N$ bits to represent $2^N$ unique integers. Conversely, we can represent $M$ integers with $log_2 M$ bits.
In low-level languages like C and FORTRAN, integer types are given fixed-width representations. For example, a C short
contains 2 bytes, so can represent $2^{16} = 65536$ unique integers. Depending on whether the integer is signed or unsigned, the bit patterns correspond to either 0:65535 or -32,768:32,767. Therefore, the integers 720,913 or -56,093 can't be represented with C short
values - an integer type with more bits is required.
Python, in contrast, only has a single integer data type (int
, as we introduced last week). Python algorithmically determines the number of bytes required and automatically allocates the necessary memory.
Your turn
Write a function that takes a list of 0s and 1s and produces the corresponding integer. The equation for converting a list $L = [l_1, l_2, ..., l_n]$ of 0's and 1's to binary is $\sum_i l_i*2^i$. What is the integer representation of [1, 0, 0, 0, 1, 1, 0, 1]
?
In [ ]:
def to_integer(x):
the_sum = 0
for index, val in enumerate(x[::-1]):
the_sum += val * 2 ** index
return the_sum
In [ ]:
# [1, 1] == 3
to_integer([1, 0, 0, 0, 1, 1, 0, 1])
Representing integers in binary is relatively straight-forward because both integers and bytes are discrete units - there are a fixed, countable number of elements representable with a sequence of bits of a given length. Decimal numbers are trickier - in principle, decimals can have any length (including, probably, infinite). How can you possibly represent a infinite number of decimals with a sequence of bits?
Unfortunately, the answer is that we can't represent decimals with arbitrary precision. Much as the number of bits in integer representations above defined how many integers we could represent, the number of bits in a representation of a decimal number defines how precisely we can define the number.
It's not hard to find artifacts of the floating point representation. If you're not careful, it can get you in trouble:
In [ ]:
.1 + .1 + .1 == .3
What??
In [ ]:
.1 + .1 + .1
This behavior is a result of the finite precision with which Python is representing decimals.
Most software (including Python) uses the concept of floating point numbers (i.e. float
s) to represent decimals. Essentially, a float
is really two separate numbers: the mantissa and the exponent. This is a similar concept to scientific notation, where, for example, the number $123456.789$ can be written as $1.23456789 \times 10^5$. In this case, $1.23456789$ is the mantissa, and $5$ is the exponent. In binary, we usually represent numbers as a string of bits, plus an exponent of 2 (rather than 10). This is a much more compact way of representing numbers than, for example, an explicit grid.
A more complete discussion of floating point arithmetic is beyond the scope of this course, but the important thing to remember is that a floating point number is really represented with 2 numbers under the hood.
Broadly speaking, there are two categories of files - binary and text.
Binary files:
Text files:
In other words, text formatted files have particular structure to their bytes that can be rendered into characters that are displayed on a screen. Binary files don't adhere to the notion that sequences of bytes should correspond to characters, so are free to implement other protocols.
Let's use Python to open and read some files:
In [ ]:
with open('data/hi.txt', 'r') as file: # open the file data/hi.txt in read mode, refer to it as `file`
text_data = file.read() # read the contents of `file` into a variable called `text_data`
What's in text_data
?
In [ ]:
type(text_data)
In [ ]:
len(text_data)
In [ ]:
text_data
So text_data
is a string with two characters containing the text 'hi'.
Question: What is the physical size of hi.txt
on disk?
In [ ]:
import os
os.path.getsize('data/hi.txt')
In [ ]:
# `bytes` converts a Python character to a representation of its bytes
# `ord()` converts a Python character into an integer representation
for char in text_data:
print(bytes(char, 'utf-8'), ord(char))
In [ ]:
len(bytes(char, 'utf-8'))
Python 3 encodes strings with the Unicode standard (UTF-8, specifically) by default. As a sanity check, we can look up the values of 104 and 105 in a Unicode table to double check that they correspond to the characters 'h' and 'i': http://www.ssec.wisc.edu/~tomw/java/unicode.html
In the days before Unicode became the de facto standard of internet communication, it was common to use ASCII to encode characters. In ASCII, each character corresponded to 1 byte in the computer, so there were only 2^8 = 256 characters. To the early computer pioneers in the 60s and 70s, the majority of whom lived in English-speaking countries, 256 characters was plenty - there were 26 upper case characters, 26 lower case characters, 10 digits, some special symbols like "(" and "&", and a few accented characters.
The rise of the internet, however, meant that many non-English speakers wanted to communicate digitally. But with ASCII, there was no way for people to write in Cyrillic or Mandarin characters. This led to a proliferation of character encodings that were eventually unified into UTF-8.
Let's read a different file with Python, this time with some characters outside of the standard 26-character English alphabet:
In [ ]:
my_str = 'hi猫😺'
with open('data/hi2.txt', 'w', encoding='UTF-8') as f:
f.write(my_str)
In [ ]:
with open('data/hi2.txt', 'r') as file:
text_data = file.read()
In [ ]:
type(text_data)
In [ ]:
len(text_data)
In [ ]:
text_data
We can see that text_data
is a string with 4 characters this time - the same 2 English characters "h" and "i", as well as a Chinese character and a cat emoji.
Question: What is the size of hi2.txt
on disk?
In [ ]:
os.path.getsize('data/hi2.txt')
In [ ]:
for char in text_data:
print(bytes(char, 'utf-8'), ord(char))
This gives us a better sense of where the file size comes from. The integer values of "h" and "i" are small enough that they can each be represented by a single byte, but several bytes are necessary to represent each of the other 2 characters. Printing the byte representation of the characters tells us that "猫" requires 3 bytes to store on disk, and "😺" requires 4 bytes, therefore there are a combined 9 bytes in hi.txt
. In other words, Unicode characters correspond to a variable number of bytes, as opposed to ASCII, where characters always correspond to a single byte.
Now, let's read some bigger files into memory:
In [ ]:
%%timeit # an IPython "magic" function for profiling blocks of code
with open('data/sherlock_holmes.txt', 'r') as file: # open the file in read mode
file.read()
An aside: Python provides a way of serializing data into a binary format for storing on disk called pickling. Many different types of Python objects can be pickled, so it's a useful step for checkpointing your work on long-running calculations or freezing the state of your code for later use.
For example, we can dump the text data to a pickle...
In [ ]:
import pickle
with open('data/sherlock_holmes.txt', 'r') as file:
sherlock_text = file.read()
with open('data/sherlock_holmes.pickle', 'wb') as file: # open a file in binary write mode
pickle.dump(sherlock_text, file)
... and when we go to read the file, it loads a bit faster (even though it contains the same data).
In [ ]:
%%timeit
with open('data/sherlock_holmes.pickle', 'rb') as file: # note the 'rb' - for "read binary"
pickle.load(file)
In [ ]:
with open('data/sherlock_holmes.pickle', 'rb') as file:
sherlock_pickle = pickle.load(file)
In [ ]:
sherlock_text == sherlock_pickle
What happened here? Recall - there are no primitive data types in Python, everything is an object! So to read data from disk, Python must create an object to store the data in
When reading a file from disk into memory, Python:
That second step, the encoding, can actually be fairly slow. If you're dealing with large text files, encoding them once and them pickling (or using another serialization method) can be a much more efficient way to read them in the future. We'll use pickling later in the course to serialize some intermediate results.
IMPORTANT: Pickling is NOT SAFE. Anyone can pickle arbitrary code objects. In other words, it is possible to use pickles to distribute malicious code. Never un-pickle data from someone you don't trust. Pickling really should only be used as a convenience for yourself, not as a way of distributing code.
Your turn
data/alice_in_wonderland.txt
into memory. How many characters does it contain? How does this compare to its size on disk?
In [ ]:
with open('data/alice_in_wonderland.txt', 'r', encoding='utf-8') as file:
alice = file.read()
In [ ]:
len(alice)
In [ ]:
os.path.getsize('data/alice_in_wonderland.txt')
In [ ]:
char_list = []
for char in alice:
if len(bytes(char, 'utf-8')) > 1:
char_list.append(char)
In [ ]:
set(char_list)
In [ ]:
with open('data/alice_partial.pickle', 'wb') as file:
#file.write(alice[:10000])
pickle.dump(alice[:10000], file)
In [ ]:
os.path.getsize('data/alice_partial.pickle')
We've already seen text and binary formats, but let's take a look at a couple others.
Javascript Object Notation - a nested sequence of lists and dictionaries (or "arrays" and "hashes"). A very common way of transmitting data on the web because it's simple for both humans and computers to parse.
In [ ]:
import json
with open('data/good_movies.json', 'r') as file:
good_movies = json.loads(file.read())
In [ ]:
from pprint import pprint # pprint for pretty-printing nested objects
pprint(good_movies)
In [ ]:
good_movies[0]['stars']
Your turn
good_movies
, print the name of the movies that Ben Affleck stars in.
In [ ]:
import csv
good_movies = []
with open('data/good_movies.csv', 'r') as file:
reader = csv.DictReader(file)
for row in reader:
good_movies.append(row)
In [ ]:
pprint(good_movies)
Look familiar? csv.DictReader
is actually parsing the CSV row-by-row into a JSON-like structure!
In [ ]:
good_movies[0]['title'] # value of cell in first row, column called "title"
For doing simple things like iterating over data structures, these built-in methods and objects are sufficient. But more complicated tasks will require better tooling. We'll see a lot more CSV data in the next couple of weeks.
Now that we have an introduction to using data in Python, let's introduce some more ways of manipulating that data.
NumPy is a fundamental library in the Python ecosystem for handling (and doing math on) array-like data. Most importantly, all of the "heavy lifting" is done by algorithms written in "fast" languages like C and FORTRAN. We'll see below the difference between the fast algorithms implemented in NumPy and the same algorithms written in pure Python.
The fundamental object that NumPy provides is an array:
In [ ]:
import numpy as np
list_of_numbers = [1, 2, 3, 4, 5]
array_1d = np.array(list_of_numbers)
array_1d.shape
In [ ]:
type(array_1d)
In [ ]:
print(array_1d)
In [ ]:
another_list_of_numbers = [6, 7, 8, 9, 10]
array_2d = np.array([list_of_numbers, another_list_of_numbers])
array_2d.shape
In [ ]:
print(array_2d)
In addition to defining arrays by hand, we can produce them programmatically:
In [ ]:
# create a 1D array with numbers 0-9
x = np.arange(10)
print(x)
In [ ]:
# create an array with 4 evenly-spaced numbers starting at 1 and ending at 13
y = np.linspace(1, 13, 3)
print(y)
In [ ]:
# create some other common types of arrays
x = np.zeros((3, 5))
print(x)
In [ ]:
x = np.ones((3, 5))
print(x)
In [ ]:
x = np.eye(5) # why this name?
print(x)
In [ ]:
x = np.random.rand(4)
print(x)
In [ ]:
x
In [ ]:
x[0]
In [ ]:
x[-2]
In [ ]:
array_2d
In [ ]:
array_2d[0][0]
In [ ]:
array_2d[1, 4]
In [ ]:
array_2d[1, 4] = 12
array_2d[1, 4]
In [ ]:
array_2d[1, 5] = 15
If we want to access more than one value at a time, we can take slices of NumPy arrays, too. The general format is x[start_index:stop_index:step_size]
.
In [ ]:
x[1:3]
In [ ]:
x[:2]
In [ ]:
x[1:]
In [ ]:
x[1:3:2]
In [ ]:
x
In [ ]:
x[::-1]
There's one very important performance-related difference between the way that slicing works between Python lists and NumPy arrays. For vanilla lists, a slice returns a copy of the sliced data:
In [ ]:
my_list = [1, 2, 3, 4]
In [ ]:
my_sliced_list = my_list[0:2]
my_sliced_list[0] = 10
my_list[0] == my_sliced_list[0]
For NumPy arrays, slices are views of the original array:
In [ ]:
my_array = np.array([1, 2, 3, 4])
In [ ]:
my_sliced_array = my_array[:2]
my_sliced_array[0] = 10
my_array[0] == my_sliced_array[0]
In [ ]:
my_array
In [ ]:
my_sliced_array
This memory efficiency is extremely useful when dealing with large datasets. But be careful!
Your turn
Create a NumPy array with 100,000 random integers between 0 and 100. Then, write two functions (in pure Python, not using built-in NumPy functions):
In [ ]:
We'll return to these functions a little later.
Python has a dynamic typing system. For example:
In [ ]:
x = 1
In [ ]:
type(x)
In [ ]:
x = 'hello'
In [ ]:
type(x)
In a statically typed language like C, you can't do this. The following code will fail when you try to compile it:
int x = 1;
x = "hello";
In other words, Python determines the best type to store your data at run time, whereas C requires you to explicitly specify the type of data (and enforces this typing). This design choice makes Python simple to use, but comes with a performance overhead, since Python needs to compute and store extra information about your data.
In addition, Python lists are able to hold heterogenous data:
In [ ]:
my_list = [1, 3.1415, 'hello']
my_list
NumPy's approach to data types in arrays is slightly different than vanilla Python lists:
In [ ]:
x = np.array([1, 2, 3, 4])
In [ ]:
x.dtype
In [ ]:
type(x[0])
In [ ]:
x[0] = 1.1
x[0]
In [ ]:
x[0] = 'hello'
In [ ]:
x = x.astype(float)
In [ ]:
type(x[0])
In [ ]:
x[0] = 1.1
x[0]
In [ ]:
x = np.array([1, 2, 3, 4], dtype='float_')
In [ ]:
x.dtype
In [ ]:
x[0] = 1.1
x[0]
So - NumPy gives you the flexibility to use a wide variety of data types in arrays. However, NumPy arrays must be homogeneous in data type.
What about the mutability of NumPy arrays? Recall that vanilla Python lists are totally mutable:
In [ ]:
my_list = [1, 2, 3, 4]
In [ ]:
my_list[0] = 10
my_list
In [ ]:
my_list.append(20)
my_list
In [ ]:
my_list.remove(2)
print(my_list)
We can always change the value of any element in a list, as well as add and delete elements as we wish. But whatever we do to the elements of the list, the same list object is always there. With NumPy arrays, the story is a bit different:
In [ ]:
my_array = np.array([1, 2, 3, 4])
print(my_array)
In [ ]:
my_array[0] = 10
print(my_array)
In [ ]:
my_new_array = np.append(my_array, 20)
my_new_array[0] = 20
print(my_array)
print(my_new_array)
The shape of a NumPy array is fixed. If you want to add data to a NumPy array, an entirely new array is created.
Your turn
Build a NumPy array of integers 0-999,999 three separate ways:
np.arange
.Use the %%timeit
magic command to profile each approach. Which is fastest? Why?
Python is an interpreted language. This means, basically, that each line in a script is fed to the Python interpreter one at a time - Python has no sense of what is coming later in the same script. This means that Python is especially slow with loops - the interpreter has no sense that a loop is essentially doing the same operation over and over, and so it can't take advantage of CPU architecture to speed these loops up.
Most of the functions in NumPy are calling lower-level functions written in C or FORTRAN. Both C and FORTRAN are statically compiled languages, meaning that a compiler reads an entire function at once, then turns the source code into machine code that actually gets executed. Because of that intermediate step, the compilers of statically-compiled language can exploit CPU architecture, in particular the CPU cache, to ensure that the data necessary for repeated computations (like what happens in a loop) is always close at hand. How exactly this works is beyond the scope of this class, but suffice it to say - you should keep in mind that Python loops are slow.
Your turn
Using the %%timeit
magic function, profile the three pure Python functions you built earlier (average, standard deviation, and weighted average). Then, profile the same operations using the built-in NumPy functions (you shouldn't have to explicitly write any for loops!).
In [ ]:
When dealing with array-like data in NumPy, you should always try to think of operations at the array level, rather than the element level. This type of thinking is called "vectorizing" your code, and it allows you to push a lot of your for loops (that are slow in Python) down into the underlying compiled code (where it runs much faster).
The bottom line is that NumPy functions that depend on looping tend to be anywhere between 10 and 100 times faster (independent of input size) than the exact same algorithm implemented in pure Python.
In [ ]:
Generate a 12 $\times$ 12 multiplication table (i.e. where element (i, j) = i*j). Do this two ways:
for
loopsUsing NumPy's fromfunction
Which method is faster? Why?
In [ ]:
Write a function that computes moving averages for an array and window size. For example,
>> a = [0, 3, 3, 3, 9, 6, 9, 9, 12]
>> size = 3
>> moving_average(a, size)
[ 2. 3. 5. 6. 8. 8. 10.]
In [ ]: