In this class you are expected to learn:
We will be using Numpy and Pandas methods to read and write files, but those libraries use Python builtin file()
function underneath.
To read a file at once:
In [7]:
f = open('data/example.txt', 'r') # 'r' stands for reading
s = f.read()
print(s)
f.close() # After finishing with the file, we must close it
You can also iterate over a file, line by line:
In [6]:
f = open('data/example.txt', 'r')
for line in f:
print(line)
f.close()
And writing is equally simple:
In [8]:
f = open('test_file.txt', 'w') # 'w' opens for writing
f.write('This is a test \nand another test')
f.close()
There are several file modes:
r
w
.
Note: Create a new file or overwrite existing file.a
r+
b
.
Note: Use for binary files, especially on Windows.
In [9]:
open('test_file.txt', 'r').read()
Out[9]:
That weird character \n
is a way to encode a new line using in plain text. There is other characters, like \t
, that prints a tab.
Text files are only able to handle a small set of characters to represent all that can be written. For historical reasons, that set of characters is the same that the 28 letters of English, plus some other single symbols such as dollar, point, slash, asterisk, etc. up to 128 single characters. That set is called the ASCII. But now think about other weird characters like the Spanish ñ, the French circunflect accent rôle, or even the whole Greek alphabet, περισπωμένη. If everything is at the end a character from the ASCII, how's that you can write those complex characters?
Well, to do so, we use an encoding, that is simply a way to code complex characters, such as π, using only ASCII characters. The problem is that there are many different ways of doing that encoding, and that's the reason why sometimes a file from a computer looks weird in the other. Windows uses by default a different encoding that Mac.
Fortunately, UTF-8 is becoming the de-facto stantard for text encoding in the Internet, and eventually for every system out there. It tries to represent not only ASCII symbols by a whole bunch more characters, including those from Greek, Chinese or Korean. That huge set of characters is known as Unicode.
he newest version of Python, which is 3, now uses Unicode as default, encoded using UTF8 (it was ASCII in Python 2, HELP!). You don't need to tell it that the characters are in UTF8 anymore, but it's always better to do so by adding the comment # coding: utf-8
at the beginning of the source file. Furthermore, in previous versions, you needed to put a prefix, u\"\"
, to define a string as Unicode. Now the u\"\"
is optional, but if you want to write code that works for current and older versions of Python, it's a good idea to always define strings by using the prefix.
Let's see how are actually encoded some symbols.
In [11]:
"regular string"
Out[11]:
In [14]:
u"string with morriña and other stuff, like περισπωμένη"
Out[14]:
If you had Python 2, what you'd see as the output would be be very different that what you defined as a string. You would see something like:
In [1]: u"string with morriña and other stuff, like περισπωμένη"
Out[1]: u'string with morri\xf1a and other stuff, like \u03c0\u03b5\u03c1\u03b9\u03c3\u03c0\u03c9\u03bc\u03ad\u03bd\u03b7'
And that is even different when not using the u""
prefix:
In [2]: "string with morriña and other stuff, like περισπωμένη"
Out[2]: 'string with morri\xc3\xb1a and other stuff, like \xcf\x80\xce\xb5\xcf\x81\xce\xb9\xcf\x83\xcf\x80\xcf\x89\xce\xbc\xce\xad\xce\xbd\xce\xb7'
Yes, having Python handle that for you now is beyond nice.
When you are writing code, character encoding is easy to handle, but it may become a real problem when manipulating text files, because it is virtually impossible to know the encoding of a text by simply reading it. And that is a problem. Our only option is to makes guesses, which is not very accurate.
In order to make things easier, the Standard Python Library includes codecs
, a module that can force the encoding when manipulating text files.
Python 3 assumes that everything is in UTF8, but older version assumed plain ASCII.
In [15]:
open("data/utf8.txt").read()
Out[15]:
In [18]:
import codecs
codecs.open("data/utf8.txt", encoding="utf8").read()
Out[18]:
Not much difference here. But let's try with the encoding usually used in Windows, the hideous ISO 8859-1, aka latin1
.
In [19]:
open("data/latin1.txt").read()
UnicodeDecodeError
has been the nightmare of Python programmers for decades, even centuries.
Because we now know codecs, we can use it!
In [21]:
codecs.open("data/latin1.txt").read()
Instead of raising an error, just return the file as a string of bytes, so the user can do whatever he needs in low level. But we know the encoding for that file, so let's use it.
In [23]:
codecs.open("data/latin1.txt", encoding="latin1").read()
Out[23]:
Object oriented programming, or OOP, is a way of structuring and building programs by representing ideas and concepts from the real world as objects with properties and even functions associated with them. It's a very very commonly used paradigm.
For example, we can have an object representing an author, a book, a map, or a painting. OOP is based on three principles: encapsulation, polymorphism, and inheritance; but what we need to know about objects is simpler than that.
Let's say that we want to represent paintings. Like numbers that are of type int()
, we want objects that are of type Painting
. Roughly, to define those types Python uses the keyword class
, so we can create our own classes with the behaviour we expect. All we need to do is define the class with the attributes we want a painting to have. Let's say we just want the title and the name of the author.
In [37]:
class Painting:
title = ""
author = ""
We can now create objects of the class Painting
, which is called instantiating a class, therefore, objects created this way are usually called instances of the class. And it makes sense, because Las Meninas
is an instance of a Painting
, philosophically speaking.
In [38]:
las_meninas = Painting()
las_meninas.title = "Las Meninas"
las_meninas.author = "Velázquez"
print(las_meninas)
But creating instances that way doesn't feel very natural. Python has a better way to define a class that allows to do something like this:
las_meninas = Painting(title="Las Meninas", author="Velázquez")
All we need to do is define a special method, __init__()
, which is just a function inside our class which first argument is the object itself, called self
. It sounds complicated, but it's not, trust me.
In [39]:
class Painting:
def __init__(self, title, author):
self.title = title
self.author = author
In [40]:
las_meninas = Painting(title="Las Meninas", author="Velázquez")
print(las_meninas)
But that printing is really awful. Python defines several special functions that, like __init__()
, alters the behaviour of the instantiated objects. One of those is __str__()
. Let's see how it works.
In [41]:
class Painting:
def __init__(self, title, author):
self.title = title
self.author = author
def __str__(self):
return "Painting '{title}' by '{author}'".format(
title=self.title,
author=self.author,
)
In [42]:
las_meninas = Painting(title="Las Meninas", author="Velázquez")
print(las_meninas)
What it's happening, is that when we execute print()
passing an instance of Painting
, the method __str__()
gets called or invoked. Although this is just a fancy feature, it gets really useful.
But the real power of classes is to define your own methods. For example, I might want a method that, given an Painting
object, tells me whether the author is Velázquez or not.
In [44]:
class Painting:
def __init__(self, title, author):
self.title = title
self.author = author
def __str__(self):
return "Painting '{title}' by '{author}'".format(
title=self.title,
author=self.author,
)
def is_painted_by_velazquez(self):
"""Returns if the painting was painted by Velazquez"""
return self.author.lower() == "velázquez"
In [45]:
las_meninas = Painting(title="Las Meninas", author="Velázquez")
las_meninas.is_painted_by_velazquez()
Out[45]:
In [46]:
guernica = Painting(title="Guernica", author="Picasso")
guernica.is_painted_by_velazquez()
Out[46]:
Despite the fun of defining classes, we won't be using them that much. Instead, we will use a lot of classes defined by other people in packaged modules ready to use.
Activity
What do you think `self.author.lower()` does? How would you improve that method to detect different ways of writing the name?
In [2]:
import numpy as np # Now we have all the methods from Numpy in our namespace np!
The Numpy module (almost always imported as np
) is used in almost all numerical computation using Python. Numpy is one of those external Python modules that predefines a lot of classes for you to use. The good thing about abouy Numpy is that is really really efficient, both in terms of memory and performance.
As an example, let's sum the first million numbers, and count the time by using the IPython magic command %timeit
.
In [52]:
%timeit sum(range(10 ** 6))
In [53]:
%timeit np.sum(np.arange(10 ** 6))
Well, 59.2 ms vs. 5.05 ms, that's more than 10 times faster!
The main reason behind this velocity is that Python is a dynamic language, and some of its awesomeness is traded off from its performance and memory usage. Numpy uses optimiezed C code underneath that runs as fast as a thunder because it's compiled. Compiled means that we can tell the computer to do stuff in its own language, which is fast because there is no need for any translation in between. But the good news are that we can use that code in Python by using libraries such as Numpy.
In the Numpy package the terminology used for vectors, matrices and higher-dimensional data sets is array. If you haven't heard before about matrices think about them like tables or spreadsheets. Another analogy is to think of vectors as a shelf, matrices as a bookshelf, 3 dimensions matrices as a room full of bookshelfs, 4 dimensions as building with rooms, and so no. In this sense, if we had a 5 dimensional matrix and want one single element, we are actually asking for something like the book number 1, that is in the shelf 3, of the bookshelf number 2, in the room number 7, and in the building number 5. In Python notation, having that matrix as m
, that would be something like:
m[5][7][2][3][1]
| | | | |- book 1
| | | |---- shelf 3
| | |------- bookshelf 2
| |---------- room 7
|------------- building 5
But hopefully we won't handle data that complex.
There are a number of ways to initialize new numpy arrays, for example from a Python list or tuples
using functions that are dedicated to generating numpy arrays, such as arange()
, zeros()
, etc., or
reading data from files.
For example, to create new vector and matrix arrays from Python lists we can use the numpy.array()
function.
In [55]:
# a vector: the argument to the array function is a Python list
v = np.array([1, 2, 3, 4])
v
Out[55]:
In [56]:
# a matrix: the argument to the array function is a nested Python list
M = np.array([[1, 2], [3, 4]])
M
Out[56]:
The v
and M
objects are both of the type ndarray
that the numpy module provides.
In [57]:
type(v), type(M)
Out[57]:
The difference between the v
and M
arrays is only their shapes. We can get information about the shape of an array by using the ndarray.shape
property.
In [58]:
v.shape, M.shape
Out[58]:
The number of elements in the array is available through the ndarray.size property:
In [59]:
M.size
Out[59]:
Equivalently, we could use the function numpy.shape()
and numpy.size()
In [61]:
np.shape(M), np.size(M)
Out[61]:
So far the numpy.ndarray
looks awefully much like a Python list (or nested list). Why not simply use Python lists for computations instead of creating a new array type? Remeber what we said about performance and memory.
There are several reasons:
Using the dtype
(data type) property of an ndarray
, we can see what type the data of an array has:
In [62]:
M.dtype
Out[62]:
We get an error if we try to assign a value of the wrong type to an element in a numpy array:
In [63]:
M[0,0] = "hello"
If we want, we can explicitly define the type of the array data when we create it, using the dtype
keyword argument:
In [67]:
M = np.array([[1, 2], [3, 4]], dtype=complex) # Complex numbers
M
Out[67]:
Common type that can be used with dtype
are: int
, float
, complex
, bool
, object
, etc.
We can also explicitly define the bit size of the data types, for example: int64
, int16
, float128
, complex128
.
Activity
Write a function, `print_items` that receives as the argument an Numpy array of size (2, 2), and prints out the items of the array.
For example, `print_items(np.array([[1, 2], [3, 4]]))` should print
`
1
2
3
4
`
Activity
Write a function, `flatten()` that receives as the argument a two dimensional Numpy array, and returns a list of the elements of the array.
For example, `flatten(np.array([[1, 2], [3, 4]]))` should return `[1, 2, 3, 4]`.
For larger arrays it is impractical to initialize the data manually, using explicit Python lists. Instead we can use one of the many functions in Numpy that generates arrays of different forms. Some of the more common are:
In [72]:
# create a range
x = np.arange(0, 10, 1) # arguments: start, stop, step
x
Out[72]:
In [73]:
x = np.arange(-1, 1, 0.1)
x
Out[73]:
In [85]:
np.zeros((3,3))
Out[85]:
In [86]:
np.ones((3,3))
Out[86]:
A very common file format for data files are the comma-separated values (CSV), or related format such as TSV (tab-separated values). To read data from such file into Numpy arrays we can use the numpy.genfromtxt()
function. For example,
In [152]:
!head data/temperatures.dat
data = np.genfromtxt('data/temperatures.dat')
data.shape
Out[152]:
Using numpy.savetxt()
we can store a Numpy array to a file in CSV format:
In [121]:
M = np.random.rand(3,3)
np.savetxt("random-matrix.csv", M, fmt='%.5f') # fmt specifies the format
!cat random-matrix.csv
In [122]:
M.itemsize # bytes per element
Out[122]:
In [123]:
M.nbytes # number of bytes
Out[123]:
In [124]:
M.ndim # number of dimensions
Out[124]:
We can index elements in an array using the square bracket and indices:
In [126]:
# v is a vector, and has only one dimension, taking one index
v[0]
Out[126]:
In [125]:
# M is a matrix, or a 2 dimensional array, taking two indices
M[1,1]
Out[125]:
If we omit an index of a multidimensional array it returns the whole row (or, in general, a N-1 dimensional array)
In [127]:
M
Out[127]:
In [128]:
M[1]
Out[128]:
The same thing can be achieved by using : instead of an index:
In [129]:
M[1,:] # row 1
Out[129]:
In [130]:
M[:,1] # column 1
Out[130]:
We can assign new values to elements in an array using indexing:
In [131]:
M[0,0] = 1
M
Out[131]:
In [132]:
# also works for rows and columns
M[1,:] = 0
M[:,2] = -1
M
Out[132]:
Index slicing is the technical name for the syntax M[lower:upper:step]
to extract part of an array:
In [133]:
A = np.array([1,2,3,4,5])
A
A[1:3]
Out[133]:
Array slices are mutable: if they are assigned a new value the original array from which the slice was extracted is modified:
In [134]:
A[1:3] = [-2,-3]
A
Out[134]:
We can omit any of the three parameters in M[lower:upper:step]:
In [90]:
A[::] # lower, upper, step all take the default values
Out[90]:
In [91]:
A[::2] # step is 2, lower and upper defaults to the beginning and end of the array
Out[91]:
In [92]:
A[:3] # first three elements
Out[92]:
In [93]:
A[3:] # elements from index 3
Out[93]:
Negative indices count from the end of the array (positive index from the begining):
In [95]:
A = np.array([1,2,3,4,5])
A[-1] # the last element in the array
Out[95]:
In [96]:
A[-3:] # the last three elements
Out[96]:
Index slicing works exactly the same way for multidimensional arrays:
In [99]:
A = np.array([[n + m * 10 for n in range(5)] for m in range(5)])
A
Out[99]:
In [100]:
# a block from the original array
A[1:4, 1:4]
Out[100]:
In [101]:
# strides
A[::2, ::2]
Out[101]:
Activity
Wait a second, what was that? Spend some time trying to understand what the next Python expression actually does.
`
[i for i in range(10)]
`
In [ ]:
[i for i in range(10)] # hmmm
Fancy indexing is the name for when an array or list is used in-place of an index:
In [102]:
row_indices = [1, 2, 3]
A[row_indices]
Out[102]:
In [103]:
col_indices = [1, 2, -1] # remember, index -1 means the last element
A[row_indices, col_indices]
Out[103]:
We can also index masks: If the index mask is an Numpy array of data type bool, then an element is selected (True
) or not (False
) depending on the value of the index mask at the position each element:
In [105]:
B = np.array([n for n in range(5)])
B
Out[105]:
In [106]:
row_mask = np.array([True, False, True, False, False])
B[row_mask]
Out[106]:
In [107]:
# same thing
row_mask = np.array([1,0,1,0,0], dtype=bool)
B[row_mask]
Out[107]:
This feature is very useful to conditionally select elements from an array, using for example comparison operators:
In [3]:
x = np.arange(0, 10, 0.5)
x
Out[3]:
In [5]:
mask = (5 < x) * (x < 7.5)
mask
Out[5]:
Why we use *
instead of and
if it's a logical expression? Because Numpy says so, that's why >:D
In [111]:
x[mask]
Out[111]:
Activity
The data in [`populations.txt`](data/populations.txt) describes the populations of hares and lynxes (and carrots) in northern Canada during 20 years. Compute and print, based on the data in `populations.txt`:
1. The mean and standard deviation of the populations of each species for the years in the period.
2. Which year each species had the largest population?
3. Which species has the largest population for each year. (*Hint*: argsort & fancy indexing of `np.array(['H', 'L', 'C'])`).
4. Which years any of the populations is above 50000. (*Hint*: comparisons and `np.any`).
5. The top 2 years for each species when they had the lowest populations. (*Hint*: `argsort`, fancy indexing).
And everything without a single for loop.
Numpy makes a distinction between a 2-dimensional (2D) array and a matrix. Every matrix is definitely a 2D array, but not every 2D array is a matrix. If we want Python to treat the Numpy arrays as matrices and vectors, we have to explicitly say so.
Like this:
In [10]:
a = np.eye(4) # Fills the diagonal elements with 1's
a = np.matrix(a)
a
Out[10]:
Now, we can do complex matrix operations over a
.
Another way is by keeping your arrays as arrays and using specific Numpy functions that make the conversion for you, so your arrays will be always arrays.
The index mask can be converted to position index using the where function
In [135]:
indices = np.where(mask)
indices
Out[135]:
In [136]:
x[indices] # this indexing is equivalent to the fancy indexing x[mask]
Out[136]:
With the diag function we can also extract the diagonal and subdiagonals of an array:
In [138]:
np.diag(A)
Out[138]:
In [139]:
np.diag(A, -1)
Out[139]:
The take function is similar to fancy indexing described above:
In [140]:
v2 = np.arange(-3, 3)
row_indices = [1, 3, 5]
v2[row_indices] # fancy indexing
Out[140]:
In [141]:
v2.take(row_indices)
Out[141]:
But take
also works on lists and other objects:
In [142]:
np.take([-3, -2, -1, 0, 1, 2], row_indices)
Out[142]:
Constructs an array by picking elements form several arrays:
In [144]:
which = [1, 0, 1, 0]
choices = [[-2,-2,-2,-2], [5,5,5,5]]
np.choose(which, choices)
Out[144]:
Often it is useful to store datasets in Numpy arrays. Numpy provides a number of functions to calculate statistics of datasets in arrays.
For example, let's calculate some properties data from the temperatures dataset used above.
In [153]:
# reminder, the tempeature dataset is stored in the data variable:
np.shape(data)
Out[153]:
In [151]:
# the temperature data is in column 3
np.mean(data[:,3])
Out[151]:
The daily mean temperature over the last 200 year has been about 6.2 ºC.
In [154]:
np.std(data[:,3]), np.var(data[:,3])
Out[154]:
In [155]:
# lowest daily average temperature
data[:,3].min()
Out[155]:
In [156]:
# highest daily average temperature
data[:,3].max()
Out[156]:
In [158]:
d = np.arange(0, 10)
d
Out[158]:
In [159]:
# sum up all elements
np.sum(d)
Out[159]:
In [160]:
# product of all elements
np.prod(d + 1)
Out[160]:
In [161]:
# cummulative sum
np.cumsum(d)
Out[161]:
In [162]:
# cummulative product
np.cumprod(d + 1)
Out[162]:
In [164]:
# same as: diag(A).sum()
A = np.array([[n+m*10 for n in range(5)] for m in range(5)])
np.trace(A)
Out[164]:
The SciPy framework builds on top of the low-level Numpy framework for multidimensional arrays, and provides a large number of higher-level scientific algorithms. Some of the topics that SciPy covers are:
Each of these submodules provides a number of functions and classes that can be used to solve problems in their respective topics. For what we are concerned with, we will only look at the stats
subpackage.
The scipy.stats
module contains a large number of statistical distributions, statistical functions, and tests. There is also a very powerful Python package for statistical modelling called statsmodels
.