*Class 05*

In this class you are expected to learn:

- Getting data into and out of Python
- Objects
- NumPy
- Arrays and matrices
- SciPy

We will be using Numpy and Pandas methods to read and write files, but those libraries use Python builtin `file()`

function underneath. We will be using several files, so if you don't have them downloaded yet, do it now! Put them in a folder `data`

in your IPython Notebook folder. The list is:

To read a file at once:

```
In [1]:
```f = open('data/example.txt', 'r') # 'r' stands for reading
s = f.read()
print(s)
f.close() # After finishing with the file, we must close it

```
```

You can also iterate over a file, line by line:

```
In [2]:
```f = open('data/example.txt', 'r')
for line in f:
print(line)
f.close()

```
```

And writing is equally simple:

```
In [3]:
```f = open('test_file.txt', 'w') # 'w' opens for writing
f.write('This is a test \nand another test')
f.close()

There are several file modes:

- Read-only:
`r`

- Write-only:
`w`

.*Note*: Create a new file or overwrite existing file. - Append a file:
`a`

- Read and Write:
`r+`

- Binary mode:
`b`

.*Note*: Use for binary files, especially on Windows.

```
In [4]:
```open('test_file.txt', 'r').read()

```
Out[4]:
```

That weird character `\n`

is a way to encode a new line using in plain text. There is other characters, like `\t`

, that prints a tab.

Text files are only able to handle a small set of characters to represent all that can be written. For historical reasons, that set of characters is the same that the 28 letters of English, plus some other single symbols such as dollar, point, slash, asterisk, etc. up to 128 single characters. That set is called the ASCII. But now think about other weird characters like the Spanish *ñ*, the French circunflect accent *rôle*, or even the whole Greek alphabet, *φοβερός*. If everything is at the end a character from the ASCII, how's that you can write those complex characters?

Well, to do so, we use an **encoding**, that is simply a way to code complex characters, such as *π*, using only ASCII characters. The problem is that there are many different ways of doing that encoding, and that's the reason why sometimes a file from a computer looks weird in the other. Windows uses by default a different encoding that Mac.

Fortunately, UTF-8 is becoming the de-facto stantard for text encoding in the Internet, and eventually for every system out there. It tries to represent not only ASCII symbols by a whole bunch more characters, including those from Greek, Chinese or Korean. That huge set of characters is known as Unicode.

The newest version of Python, which is 3, now uses Unicode as default, encoded using UTF8 (it was ASCII in Python 2,
*HALP!*). You don't need to tell it that the characters are in UTF8 anymore, but it's always better to do so by adding the comment `# -*- coding: utf8 -*-`

at the beginning of the source file. Furthermore, in previous versions, you needed to put a prefix, `u""`

, to define a string as Unicode. Now the `u""`

is optional, but if you want to write code that works for current and older versions of Python, it's a good idea to always define strings by using the prefix.

Let's see how are actually encoded some symbols.

```
In [5]:
``````
"regular string"
```

```
Out[5]:
```

```
In [6]:
```u"string with morriña and other stuff, like περισπωμένη"

```
Out[6]:
```

If you had Python 2, what you'd see as the output would be be very different that what you defined as a string. You would see something like:

```
In [1]: u"string with morriña and other stuff, like περισπωμένη"
Out[1]: u'string with morri\xf1a and other stuff, like \u03c0\u03b5\u03c1\u03b9\u03c3\u03c0\u03c9\u03bc\u03ad\u03bd\u03b7'
```

And that is even different when not using the `u""`

prefix:

```
In [2]: "string with morriña and other stuff, like περισπωμένη"
Out[2]: 'string with morri\xc3\xb1a and other stuff, like \xcf\x80\xce\xb5\xcf\x81\xce\xb9\xcf\x83\xcf\x80\xcf\x89\xce\xbc\xce\xad\xce\xbd\xce\xb7'
```

Yes, having Python handle that for you now is beyond nice.

When you are writing code, character encoding is easy to handle, but it may become a real problem when manipulating text files, because it is virtually impossible to know the encoding of a text by simply reading it. And that is a problem. Our only option is to *makes guesses*, which is not very accurate.

In order to make things easier, the Standard Python Library includes `codecs`

, a module that can force the encoding when manipulating text files.

Python 3 assumes that everything is in UTF8, but older version assumed plain ASCII.

```
In [7]:
```open("data/utf8.txt").read()

```
Out[7]:
```

```
In [8]:
```import codecs
codecs.open("data/utf8.txt", encoding="utf8").read()

```
Out[8]:
```

`latin1`

.

```
In [9]:
```open("data/latin1.txt").read()

```
```

`UnicodeDecodeError`

has been the nightmare of Python programmers for decades, even centuries.

Because we now know codecs, we can use it!

```
In [10]:
```codecs.open("data/latin1.txt").read()

```
Out[10]:
```

```
In [11]:
```codecs.open("data/latin1.txt", encoding="latin1").read()

```
Out[11]:
```

Object oriented programming, or OOP, is a way of structuring and building programs by representing ideas and concepts from the real world as objects with properties and even functions associated with them. It's a very very commonly used paradigm.

For example, we can have an object representing an author, a book, a map, or a painting. OOP is based on three principles: encapsulation, polymorphism, and inheritance; but what we need to know about objects is simpler than that.

Let's say that we want to represent paintings. Like numbers that are of type `int()`

, we want objects that are of type `Painting`

. Roughly, to define those types Python uses the keyword `class`

, so we can create our own classes with the behaviour we expect. All we need to do is define the class with the attributes we want a painting to have. Let's say we just want the title and the name of the author.

```
In [12]:
```class Painting:
title = ""
author = ""

`Painting`

, which is called instantiating a class, therefore, objects created this way are usually called instances of the class. And it makes sense, because `Las Meninas`

is an instance of a `Painting`

, philosophically speaking.

```
In [13]:
```las_meninas = Painting()
las_meninas.title = "Las Meninas"
las_meninas.author = "Velázquez"
print(las_meninas)

```
```

But creating instances that way doesn't feel very natural. Python has a better way to define a class that allows to do something like this:

```
las_meninas = Painting(title="Las Meninas", author="Velázquez")
```

All we need to do is define a special method, `__init__()`

, which is just a function inside our class which first argument is the object itself, called `self`

. It sounds complicated, but it's not, trust me.

```
In [14]:
```class Painting:
def __init__(self, title, author):
self.title = title
self.author = author

```
In [15]:
```las_meninas = Painting(title="Las Meninas", author="Velázquez")
print(las_meninas)

```
```

`__init__()`

, alters the behaviour of the instantiated objects. One of those is `__str__()`

. Let's see how it works.

```
In [16]:
```class Painting:
def __init__(self, title, author):
self.title = title
self.author = author
def __str__(self):
return "Painting '{title}' by '{author}'".format(
title=self.title,
author=self.author,
)

```
In [17]:
```las_meninas = Painting(title="Las Meninas", author="Velázquez")
print(las_meninas)

```
```

What it's happening, is that when we execute `print()`

passing an instance of `Painting`

, the method `__str__()`

gets called or invoked. Although this is just a fancy feature, it gets really useful.

But the real power of classes is to define your own methods. For example, I might want a method that, given an `Painting`

object, tells me whether the author is Velázquez or not.

```
In [18]:
```class Painting:
def __init__(self, title, author):
self.title = title
self.author = author
def __str__(self):
return "Painting '{title}' by '{author}'".format(
title=self.title,
author=self.author,
)
def is_painted_by_velazquez(self):
"""Returns if the painting was painted by Velazquez"""
return self.author.lower() == "velázquez"

```
In [19]:
```las_meninas = Painting(title="Las Meninas", author="Velázquez")
las_meninas.is_painted_by_velazquez()

```
Out[19]:
```

```
In [20]:
```guernica = Painting(title="Guernica", author="Picasso")
guernica.is_painted_by_velazquez()

```
Out[20]:
```

Activity

What do you think `self.author.lower()` does? How would you improve that method to detect different ways of writing the name?

```
In [2]:
```import numpy as np # Now we have all the methods from Numpy in our namespace np!

The Numpy module (almost always imported as ** np**) is used in almost all numerical computation using Python. Numpy is one of those external Python modules that predefines a lot of classes for you to use. The good thing about abouy Numpy is that is really really efficient, both in terms of memory and performance.

As an example, let's sum the first million numbers, and count the time by using the IPython magic command `%timeit`

.

```
In [24]:
```%timeit sum(range(10 ** 6))

```
```

```
In [25]:
```%timeit np.sum(np.arange(10 ** 6))

```
```

Well, 27.9 ms vs. 3 ms, that's almost **10 times** faster!

The main reason behind this velocity is that Python is a dynamic language, and some of its awesomeness is traded off from its performance and memory usage. Numpy uses optimiezed C code underneath that runs as fast as a thunder because it's compiled. Compiled means that we can tell the computer to do stuff in its own language, which is fast because there is no need for any translation in between. But the good news are that we can use that code in Python by using libraries such as Numpy.

In the Numpy package the terminology used for vectors, matrices and higher-dimensional data sets is array. If you haven't heard before about matrices think about them like tables or spreadsheets. Another analogy is to think of vectors as a shelf, matrices as a bookshelf, 3 dimensions matrices as a room full of bookshelfs, 4 dimensions as buildings with rooms, and so on. In this sense, if we had a 5 dimensional matrix and want one single element, we are actually asking for something like the book number 1, that is in the shelf 3, of the bookshelf number 2, in the room number 7, and in the building number 5. In Python notation, having that matrix as `m`

, that would be something like:

```
m[5][7][2][3][1]
| | | | |- book 1
| | | |---- shelf 3
| | |------- bookshelf 2
| |---------- room 7
|------------- building 5
```

But hopefully we won't handle data that complex.

`arange()`

, `zeros()`

, etc., or
reading data from files.

`numpy.array()`

function.

```
In [26]:
```# a vector: the argument to the array function is a Python list
v = np.array([1, 2, 3, 4])
v

```
Out[26]:
```

```
In [27]:
```# a matrix: the argument to the array function is a nested Python list
M = np.array([[1, 2], [3, 4]])
M

```
Out[27]:
```

The `v`

and `M`

objects are both of the type `ndarray`

that the numpy module provides.

```
In [28]:
```type(v), type(M)

```
Out[28]:
```

`v`

and `M`

arrays is only their shapes. We can get information about the shape of an array by using the `ndarray.shape`

property.

```
In [29]:
```v.shape, M.shape

```
Out[29]:
```

The number of elements in the array is available through the ndarray.size property:

```
In [30]:
```M.size

```
Out[30]:
```

Equivalently, we could use the function `numpy.shape()`

and `numpy.size()`

```
In [31]:
```np.shape(M), np.size(M)

```
Out[31]:
```

So far the `numpy.ndarray`

looks awefully much like a Python list (or nested list). Why not simply use Python lists for computations instead of creating a new array type? Remeber what we said about performance and memory.

There are several reasons:

- Python lists are very general. They can contain any kind of object. They are dynamically typed. They do not support mathematical functions such as matrix and dot multiplications, etc. Implementating such functions for Python lists would not be very efficient because of the dynamic typing.
- Numpy arrays are statically typed and homogeneous. The type of the elements is determined when array is created.
- Numpy arrays are memory efficient.
- Because of the static typing, fast implementation of mathematical functions such as multiplication and addition of numpy arrays can be implemented in a compiled language (C and Fortran are used).

Using the `dtype`

(data type) property of an `ndarray`

, we can see what type the data of an array has:

```
In [32]:
```M.dtype

```
Out[32]:
```

We get an error if we try to assign a value of the wrong type to an element in a numpy array:

```
In [33]:
```M[0,0] = "hello"

```
```

`dtype`

keyword argument:

```
In [34]:
```M = np.array([[1, 2], [3, 4]], dtype=complex) # Complex numbers
M

```
Out[34]:
```

Common type that can be used with `dtype`

are: `int`

, `float`

, `complex`

, `bool`

, `object`

, etc.

We can also explicitly define the bit size of the data types, for example: `int64`

, `int16`

, `float128`

, `complex128`

.

Activity

Write a function, `print_items` that receives as the argument an Numpy array of size (2, 2), and prints out the items of the array.

For example, `print_items(np.array([[1, 2], [3, 4]]))` should print

`
1
2
3
4
`

Activity

Write a function, `flatten()` that receives as the argument a two dimensional Numpy array, and returns a list of the elements of the array.

For example, `flatten(np.array([[1, 2], [3, 4]]))` should return `[1, 2, 3, 4]`.

```
In [35]:
```# create a range
x = np.arange(0, 10, 1) # arguments: start, stop, step
x

```
Out[35]:
```

```
In [36]:
```x = np.arange(-1, 1, 0.1)
x

```
Out[36]:
```

```
In [37]:
```np.zeros((3,3))

```
Out[37]:
```

```
In [38]:
```np.ones((3,3))

```
Out[38]:
```

`numpy.genfromtxt()`

function. For example,

```
In [39]:
```!head data/temperatures.dat
data = np.genfromtxt('data/temperatures.dat')
data.shape

```
Out[39]:
```

Using `numpy.savetxt()`

we can store a Numpy array to a file in CSV format:

```
In [40]:
```M = np.random.rand(3,3)
np.savetxt("random-matrix.csv", M, fmt='%.5f') # fmt specifies the format
!cat random-matrix.csv

```
```

```
In [41]:
```M.itemsize # bytes per element

```
Out[41]:
```

```
In [42]:
```M.nbytes # number of bytes

```
Out[42]:
```

```
In [43]:
```M.ndim # number of dimensions

```
Out[43]:
```

We can index elements in an array using the square bracket and indices:

```
In [44]:
```# v is a vector, and has only one dimension, taking one index
v[0]

```
Out[44]:
```

```
In [45]:
```# M is a matrix, or a 2 dimensional array, taking two indices
M[1,1]

```
Out[45]:
```

```
In [46]:
``````
M
```

```
Out[46]:
```

```
In [47]:
```M[1]

```
Out[47]:
```

The same thing can be achieved by using : instead of an index:

```
In [48]:
```M[1,:] # row 1

```
Out[48]:
```

```
In [49]:
```M[:,1] # column 1

```
Out[49]:
```

We can assign new values to elements in an array using indexing:

```
In [50]:
```M[0,0] = 1
M

```
Out[50]:
```

```
In [51]:
```# also works for rows and columns
M[1,:] = 0
M[:,2] = -1
M

```
Out[51]:
```

Index slicing is the technical name for the syntax `M[lower:upper:step]`

to extract part of an array:

```
In [52]:
```A = np.array([1,2,3,4,5])
A
A[1:3]

```
Out[52]:
```

```
In [53]:
```A[1:3] = [-2,-3]
A

```
Out[53]:
```

We can omit any of the three parameters in M[lower:upper:step]:

```
In [54]:
```A[::] # lower, upper, step all take the default values

```
Out[54]:
```

```
In [55]:
```A[::2] # step is 2, lower and upper defaults to the beginning and end of the array

```
Out[55]:
```

```
In [56]:
```A[:3] # first three elements

```
Out[56]:
```

```
In [57]:
```A[3:] # elements from index 3

```
Out[57]:
```

Negative indices count from the end of the array (positive index from the begining):

```
In [58]:
```A = np.array([1,2,3,4,5])
A[-1] # the last element in the array

```
Out[58]:
```

```
In [59]:
```A[-3:] # the last three elements

```
Out[59]:
```

Index slicing works exactly the same way for multidimensional arrays:

```
In [60]:
```A = np.array([[n + m * 10 for n in range(5)] for m in range(5)])
A

```
Out[60]:
```

```
In [61]:
```# a block from the original array
A[1:4, 1:4]

```
Out[61]:
```

```
In [62]:
```# strides
A[::2, ::2]

```
Out[62]:
```

Activity

Wait a second, what was that? Spend some time trying to understand what the next Python expression actually does.

`
[i for i in range(10)]
`

```
In [ ]:
```[i for i in range(10)] # hmmm

Fancy indexing is the name for when an array or list is used in-place of an index:

```
In [63]:
```row_indices = [1, 2, 3]
A[row_indices]

```
Out[63]:
```

```
In [64]:
```col_indices = [1, 2, -1] # remember, index -1 means the last element
A[row_indices, col_indices]

```
Out[64]:
```

`True`

) or not (`False`

) depending on the value of the index mask at the position each element:

```
In [65]:
```B = np.array([n for n in range(5)])
B

```
Out[65]:
```

```
In [66]:
```row_mask = np.array([True, False, True, False, False])
B[row_mask]

```
Out[66]:
```

```
In [67]:
```# same thing
row_mask = np.array([1,0,1,0,0], dtype=bool)
B[row_mask]

```
Out[67]:
```

```
In [68]:
```x = np.arange(0, 10, 0.5)
x

```
Out[68]:
```

```
In [69]:
```mask = (5 < x) * (x < 7.5)
mask

```
Out[69]:
```

Why we use `*`

instead of `and`

if it's a logical expression? Because Numpy says so, that's why >:D

```
In [70]:
```x[mask]

```
Out[70]:
```

Often it is useful to store datasets in Numpy arrays. Numpy provides a number of functions to calculate statistics of datasets in arrays.

For example, let's calculate some properties data from the temperatures dataset used above.

```
In [80]:
```# reminder, the tempeature dataset is stored in the data variable:
np.shape(data)

```
Out[80]:
```

```
In [81]:
```# the temperature data is in column 3
np.mean(data[:,3])

```
Out[81]:
```

The daily mean temperature over the last 200 year has been about 6.2 ºC.

```
In [85]:
```np.std(data[:,3]), np.var(data[:,3])

```
Out[85]:
```

```
In [86]:
```# lowest daily average temperature
data[:,3].min()

```
Out[86]:
```

```
In [87]:
```# highest daily average temperature
data[:,3].max()

```
Out[87]:
```

```
In [88]:
```d = np.arange(0, 10)
d

```
Out[88]:
```

```
In [89]:
```# sum up all elements
np.sum(d)

```
Out[89]:
```

```
In [90]:
```# product of all elements
np.prod(d + 1)

```
Out[90]:
```

```
In [91]:
```# cummulative sum
np.cumsum(d)

```
Out[91]:
```

```
In [92]:
```# cummulative product
np.cumprod(d + 1)

```
Out[92]:
```

```
In [93]:
```# same as: diag(A).sum()
A = np.array([[n+m*10 for n in range(5)] for m in range(5)])
np.trace(A)

```
Out[93]:
```

```
In [94]:
```data.mean(axis=0) # columns (first dimension)

```
Out[94]:
```

Or the sum of the rows

```
In [95]:
```data.mean(axis=1) # rows (second dimension)

```
Out[95]:
```

Activity

The data in [`populations.txt`](data/populations.txt) describes the populations of hares and lynxes (and carrots) in northern Canada during 20 years. Compute and print, based on the data in `populations.txt`:

1. The mean and standard deviation of the populations of each species for the years in the period.

2. Which year each species had the largest population?

3. Which species has the largest population for each year. (*Hint*: argsort & fancy indexing of `np.array(['Hare', 'Lynx', 'Carrot'])`).

4. Which years any of the populations is above 50000. (*Hint*: comparisons and `np.any`).

5. The top 2 years for each species when they had the lowest populations. (*Hint*: `argsort`, fancy indexing).

And everything without a single for loop.
[Solution](data/populations.py)

```
In [3]:
```data = np.loadtxt('data/populations.txt')
year, hares, lynxes, carrots = data.T
species = np.array(['Hare', 'Lynx', 'Carrot'])
data

```
Out[3]:
```

```
In [123]:
```# we first create an array with the populations
populations = #...
populations

```
Out[123]:
```

```
In [143]:
```# 1
means = # ...
stds = # ...
print("\t", "\t\t".join(species))
print("Mean:", means)
print("Std:", stds)

```
```

```
In [148]:
```np.argmax(populations, axis=0)

```
Out[148]:
```

```
In [142]:
```# 2
max_year = # ...
species_max_years = # ...
print("\t ", " ".join(species))
print("Max. year:", species_max_years)

```
```

```
In [111]:
```# 3
max_species = # ...
max_species_names = # ...
print("Max species:")
print(year)
print(max_species_names)

```
```

```
In [112]:
```# 4
above_50k = # ...
year_above_50k = # ...
print("Any above 50000:", year_above_50k)

```
```

```
In [115]:
```# 5
top_2 = # ...
top_2_year = # ...
print("Top 2 years with lowest populations for each:")
print(" ", " ".join(species))
print(top_2_year)

```
```

Numpy makes a distinction between a 2-dimensional (2D) array and a matrix. Every matrix is definitely a 2D array, but not every 2D array is a matrix. If we want Python to treat the Numpy arrays as matrices and vectors, we have to explicitly say so.

Like this:

```
In [71]:
```a = np.eye(4) # Fills the diagonal elements with 1's
a = np.matrix(a)
a

```
Out[71]:
```

Now, we can do complex matrix operations over `a`

.

Another way is by keeping your arrays as arrays and using specific Numpy functions that make the conversion for you, so your arrays will be always arrays.

The index mask can be converted to position index using the where function

```
In [72]:
```indices = np.where(mask)
indices

```
Out[72]:
```

```
In [73]:
```x[indices] # this indexing is equivalent to the fancy indexing x[mask]

```
Out[73]:
```

With the diag function we can also extract the diagonal and subdiagonals of an array:

```
In [74]:
```np.diag(A)

```
Out[74]:
```

```
In [75]:
```np.diag(A, -1)

```
Out[75]:
```

The take function is similar to fancy indexing described above:

```
In [76]:
```v2 = np.arange(-3, 3)
row_indices = [1, 3, 5]
v2[row_indices] # fancy indexing

```
Out[76]:
```

```
In [77]:
```v2.take(row_indices)

```
Out[77]:
```

But `take`

also works on lists and other objects:

```
In [78]:
```np.take([-3, -2, -1, 0, 1, 2], row_indices)

```
Out[78]:
```

Constructs an array by picking elements form several arrays:

```
In [79]:
```which = [1, 0, 1, 0]
choices = [[-2,-2,-2,-2], [5,5,5,5]]
np.choose(which, choices)

```
Out[79]:
```

The SciPy framework builds on top of the low-level Numpy framework for multidimensional arrays, and provides a large number of higher-level scientific algorithms. Some of the topics that SciPy covers are:

- Special functions (scipy.special)
- Integration (scipy.integrate)
- Optimization (scipy.optimize)
- Interpolation (scipy.interpolate)
- Fourier Transforms (scipy.fftpack)
- Signal Processing (scipy.signal)
- Linear Algebra (scipy.linalg)
- Sparse Eigenvalue Problems (scipy.sparse)
- Statistics (scipy.stats)
- Multi-dimensional image processing (scipy.ndimage)
- File IO (scipy.io)

Each of these submodules provides a number of functions and classes that can be used to solve problems in their respective topics. For what we are concerned with, we will only look at the `stats`

subpackage.

`scipy.stats`

module contains a large number of statistical distributions, statistical functions, and tests. There is also a very powerful Python package for statistical modelling called `statsmodels`

.

- Read 10 Minutes to Pandas.
- Read about Series and DataFrame.
- Read Chapters 1-5 from Julia Evans pandas-cookbok.