[Data, the Humanist's New Best Friend](index.ipynb)
Class 05

In this class you are expected to learn:

Getting data into and out of Python
Objects
NumPy
Arrays and matrices
SciPy

Like `np.array(np.array(np.array([1, 2])))`?

Files

We will be using Numpy and Pandas methods to read and write files, but those libraries use Python builtin file() function underneath. We will be using several files, so if you don't have them downloaded yet, do it now! Put them in a folder data in your IPython Notebook folder. The list is:

To read a file at once:



In [1]:

    
f = open('data/example.txt', 'r')  # 'r' stands for reading
s = f.read()
print(s)
f.close()  # After finishing with the file, we must close it









    



This is a plain text file.
As we already know, Python code is itself written in plain text.

Each line is retrieved by Python, one at a time.

You can also iterate over a file, line by line:



In [2]:

    
f = open('data/example.txt', 'r')
for line in f:
    print(line)
f.close()









    



This is a plain text file.

As we already know, Python code is itself written in plain text.



Each line is retrieved by Python, one at a time.

And writing is equally simple:



In [3]:

    
f = open('test_file.txt', 'w')  # 'w' opens for writing
f.write('This is a test \nand another test')
f.close()

There are several file modes:

Read-only: r
Write-only: w. Note: Create a new file or overwrite existing file.
Append a file: a
Read and Write: r+
Binary mode: b. Note: Use for binary files, especially on Windows.



In [4]:

    
open('test_file.txt', 'r').read()









    Out[4]:





'This is a test \nand another test'

That weird character \n is a way to encode a new line using in plain text. There is other characters, like \t, that prints a tab.

Text files are only able to handle a small set of characters to represent all that can be written. For historical reasons, that set of characters is the same that the 28 letters of English, plus some other single symbols such as dollar, point, slash, asterisk, etc. up to 128 single characters. That set is called the ASCII. But now think about other weird characters like the Spanish ñ, the French circunflect accent rôle, or even the whole Greek alphabet, φοβερός. If everything is at the end a character from the ASCII, how's that you can write those complex characters?

Well, to do so, we use an encoding, that is simply a way to code complex characters, such as π, using only ASCII characters. The problem is that there are many different ways of doing that encoding, and that's the reason why sometimes a file from a computer looks weird in the other. Windows uses by default a different encoding that Mac.

Fortunately, UTF-8 is becoming the de-facto stantard for text encoding in the Internet, and eventually for every system out there. It tries to represent not only ASCII symbols by a whole bunch more characters, including those from Greek, Chinese or Korean. That huge set of characters is known as Unicode.

The newest version of Python, which is 3, now uses Unicode as default, encoded using UTF8 (it was ASCII in Python 2, HALP!). You don't need to tell it that the characters are in UTF8 anymore, but it's always better to do so by adding the comment # -*- coding: utf8 -*- at the beginning of the source file. Furthermore, in previous versions, you needed to put a prefix, u"", to define a string as Unicode. Now the u"" is optional, but if you want to write code that works for current and older versions of Python, it's a good idea to always define strings by using the prefix.

Let's see how are actually encoded some symbols.



In [5]:

    
"regular string"









    Out[5]:





'regular string'



In [6]:

    
u"string with morriña and other stuff, like περισπωμένη"









    Out[6]:





'string with morriña and other stuff, like περισπωμένη'

If you had Python 2, what you'd see as the output would be be very different that what you defined as a string. You would see something like:

In [1]: u"string with morriña and other stuff, like περισπωμένη"
Out[1]: u'string with morri\xf1a and other stuff, like \u03c0\u03b5\u03c1\u03b9\u03c3\u03c0\u03c9\u03bc\u03ad\u03bd\u03b7'

And that is even different when not using the u"" prefix:

In [2]: "string with morriña and other stuff, like περισπωμένη"
Out[2]: 'string with morri\xc3\xb1a and other stuff, like \xcf\x80\xce\xb5\xcf\x81\xce\xb9\xcf\x83\xcf\x80\xcf\x89\xce\xbc\xce\xad\xce\xbd\xce\xb7'

Yes, having Python handle that for you now is beyond nice.

When you are writing code, character encoding is easy to handle, but it may become a real problem when manipulating text files, because it is virtually impossible to know the encoding of a text by simply reading it. And that is a problem. Our only option is to makes guesses, which is not very accurate.

In order to make things easier, the Standard Python Library includes codecs, a module that can force the encoding when manipulating text files.

Python 3 assumes that everything is in UTF8, but older version assumed plain ASCII.



In [7]:

    
open("data/utf8.txt").read()









    Out[7]:





'Pérez, Javier\nGonzález, Manuel'



In [8]:

    
import codecs
codecs.open("data/utf8.txt", encoding="utf8").read()









    Out[8]:





'Pérez, Javier\nGonzález, Manuel'

Not much difference here. But let's try with the encoding usually used in Windows, the hideous ISO 8859-1, aka latin1.



In [9]:

    
open("data/latin1.txt").read()









    



---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-9-c028cc44e5fc> in <module>()
----> 1 open("data/latin1.txt").read()

/home/versae/.venvs/dh2304/lib/python3.4/codecs.py in decode(self, input, final)
    311         # decode input (taking the buffer into account)
    312         data = self.buffer + input
--> 313         (result, consumed) = self._buffer_decode(data, self.errors, final)
    314         # keep undecoded input until the next call
    315         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 1: invalid continuation byte

UnicodeDecodeError has been the nightmare of Python programmers for decades, even centuries.

Because we now know codecs, we can use it!



In [10]:

    
codecs.open("data/latin1.txt").read()









    Out[10]:





b'P\xe9rez, Javier\nGonz\xe1lez, Manuel\n'

Instead of raising an error, just return the file as a string of bytes, so the user can do whatever he needs in low level. But we know the encoding for that file, so let's use it.



In [11]:

    
codecs.open("data/latin1.txt", encoding="latin1").read()









    Out[11]:





'Pérez, Javier\nGonzález, Manuel\n'

Objects and classes

Object oriented programming, or OOP, is a way of structuring and building programs by representing ideas and concepts from the real world as objects with properties and even functions associated with them. It's a very very commonly used paradigm.

For example, we can have an object representing an author, a book, a map, or a painting. OOP is based on three principles: encapsulation, polymorphism, and inheritance; but what we need to know about objects is simpler than that.

Let's say that we want to represent paintings. Like numbers that are of type int(), we want objects that are of type Painting. Roughly, to define those types Python uses the keyword class, so we can create our own classes with the behaviour we expect. All we need to do is define the class with the attributes we want a painting to have. Let's say we just want the title and the name of the author.



In [12]:

    
class Painting:
    title = ""
    author = ""

We can now create objects of the class Painting, which is called instantiating a class, therefore, objects created this way are usually called instances of the class. And it makes sense, because Las Meninas is an instance of a Painting, philosophically speaking.



In [13]:

    
las_meninas = Painting()
las_meninas.title = "Las Meninas"
las_meninas.author = "Velázquez"
print(las_meninas)









    



<__main__.Painting object at 0x7fc58811b860>

But creating instances that way doesn't feel very natural. Python has a better way to define a class that allows to do something like this:

las_meninas = Painting(title="Las Meninas", author="Velázquez")

All we need to do is define a special method, __init__(), which is just a function inside our class which first argument is the object itself, called self. It sounds complicated, but it's not, trust me.



In [14]:

    
class Painting:
    
    def __init__(self, title, author):
        self.title = title
        self.author = author



In [15]:

    
las_meninas = Painting(title="Las Meninas", author="Velázquez")
print(las_meninas)









    



<__main__.Painting object at 0x7fc5880c2240>

But that printing is really awful. Python defines several special functions that, like __init__(), alters the behaviour of the instantiated objects. One of those is __str__(). Let's see how it works.



In [16]:

    
class Painting:
    
    def __init__(self, title, author):
        self.title = title
        self.author = author
    
    def __str__(self):
        return "Painting '{title}' by '{author}'".format(
            title=self.title,
            author=self.author,
        )



In [17]:

    
las_meninas = Painting(title="Las Meninas", author="Velázquez")
print(las_meninas)









    



Painting 'Las Meninas' by 'Velázquez'

What it's happening, is that when we execute print() passing an instance of Painting, the method __str__() gets called or invoked. Although this is just a fancy feature, it gets really useful.

But the real power of classes is to define your own methods. For example, I might want a method that, given an Painting object, tells me whether the author is Velázquez or not.



In [18]:

    
class Painting:
    
    def __init__(self, title, author):
        self.title = title
        self.author = author
    
    def __str__(self):
        return "Painting '{title}' by '{author}'".format(
            title=self.title,
            author=self.author,
        )
    
    def is_painted_by_velazquez(self):
        """Returns if the painting was painted by Velazquez"""
        return self.author.lower() == "velázquez"



In [19]:

    
las_meninas = Painting(title="Las Meninas", author="Velázquez")
las_meninas.is_painted_by_velazquez()









    Out[19]:





True



In [20]:

    
guernica = Painting(title="Guernica", author="Picasso")
guernica.is_painted_by_velazquez()









    Out[20]:





False

Despite the fun of defining classes, we won't be using them that much. Instead, we will use a lot of classes defined by other people in packaged modules ready to use.

Activity

What do you think `self.author.lower()` does? How would you improve that method to detect different ways of writing the name?

Numpy



In [2]:

    
import numpy as np  # Now we have all the methods from Numpy in our namespace np!

The Numpy module (almost always imported as np) is used in almost all numerical computation using Python. Numpy is one of those external Python modules that predefines a lot of classes for you to use. The good thing about abouy Numpy is that is really really efficient, both in terms of memory and performance.

As an example, let's sum the first million numbers, and count the time by using the IPython magic command %timeit.



In [24]:

    
%timeit sum(range(10 ** 6))









    



10 loops, best of 3: 27.9 ms per loop



In [25]:

    
%timeit np.sum(np.arange(10 ** 6))









    



100 loops, best of 3: 3 ms per loop

Well, 27.9 ms vs. 3 ms, that's almost 10 times faster!

The main reason behind this velocity is that Python is a dynamic language, and some of its awesomeness is traded off from its performance and memory usage. Numpy uses optimiezed C code underneath that runs as fast as a thunder because it's compiled. Compiled means that we can tell the computer to do stuff in its own language, which is fast because there is no need for any translation in between. But the good news are that we can use that code in Python by using libraries such as Numpy.

Arrays

In the Numpy package the terminology used for vectors, matrices and higher-dimensional data sets is array. If you haven't heard before about matrices think about them like tables or spreadsheets. Another analogy is to think of vectors as a shelf, matrices as a bookshelf, 3 dimensions matrices as a room full of bookshelfs, 4 dimensions as buildings with rooms, and so on. In this sense, if we had a 5 dimensional matrix and want one single element, we are actually asking for something like the book number 1, that is in the shelf 3, of the bookshelf number 2, in the room number 7, and in the building number 5. In Python notation, having that matrix as m, that would be something like:

m[5][7][2][3][1]
  |  |  |  |  |- book 1
  |  |  |  |---- shelf 3
  |  |  |------- bookshelf 2
  |  |---------- room 7
  |------------- building 5

But hopefully we won't handle data that complex.

`many accurate`

There are a number of ways to initialize new numpy arrays, for example from a Python list or tuples using functions that are dedicated to generating numpy arrays, such as arange(), zeros(), etc., or reading data from files.

From lists

For example, to create new vector and matrix arrays from Python lists we can use the numpy.array() function.



In [26]:

    
# a vector: the argument to the array function is a Python list
v = np.array([1, 2, 3, 4])
v









    Out[26]:





array([1, 2, 3, 4])



In [27]:

    
# a matrix: the argument to the array function is a nested Python list
M = np.array([[1, 2], [3, 4]])
M









    Out[27]:





array([[1, 2],
       [3, 4]])

The v and M objects are both of the type ndarray that the numpy module provides.



In [28]:

    
type(v), type(M)









    Out[28]:





(numpy.ndarray, numpy.ndarray)

The difference between the v and M arrays is only their shapes. We can get information about the shape of an array by using the ndarray.shape property.



In [29]:

    
v.shape, M.shape









    Out[29]:





((4,), (2, 2))

The number of elements in the array is available through the ndarray.size property:



In [30]:

    
M.size









    Out[30]:





4

Equivalently, we could use the function numpy.shape() and numpy.size()



In [31]:

    
np.shape(M), np.size(M)









    Out[31]:





((2, 2), 4)

So far the numpy.ndarray looks awefully much like a Python list (or nested list). Why not simply use Python lists for computations instead of creating a new array type? Remeber what we said about performance and memory.

There are several reasons:

Python lists are very general. They can contain any kind of object. They are dynamically typed. They do not support mathematical functions such as matrix and dot multiplications, etc. Implementating such functions for Python lists would not be very efficient because of the dynamic typing.
Numpy arrays are statically typed and homogeneous. The type of the elements is determined when array is created.
Numpy arrays are memory efficient.
Because of the static typing, fast implementation of mathematical functions such as multiplication and addition of numpy arrays can be implemented in a compiled language (C and Fortran are used).

Using the dtype (data type) property of an ndarray, we can see what type the data of an array has:



In [32]:

    
M.dtype









    Out[32]:





dtype('int64')

We get an error if we try to assign a value of the wrong type to an element in a numpy array:



In [33]:

    
M[0,0] = "hello"









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-33-a09d72434238> in <module>()
----> 1 M[0,0] = "hello"

ValueError: invalid literal for int() with base 10: 'hello'

If we want, we can explicitly define the type of the array data when we create it, using the dtype keyword argument:



In [34]:

    
M = np.array([[1, 2], [3, 4]], dtype=complex)  # Complex numbers
M









    Out[34]:





array([[ 1.+0.j,  2.+0.j],
       [ 3.+0.j,  4.+0.j]])

Common type that can be used with dtype are: int, float, complex, bool, object, etc.

We can also explicitly define the bit size of the data types, for example: int64, int16, float128, complex128.

Activity

Write a function, `print_items` that receives as the argument an Numpy array of size (2, 2), and prints out the items of the array.
For example, `print_items(np.array([[1, 2], [3, 4]]))` should print
` 1 2 3 4 `

Activity

Write a function, `flatten()` that receives as the argument a two dimensional Numpy array, and returns a list of the elements of the array.
For example, `flatten(np.array([[1, 2], [3, 4]]))` should return `[1, 2, 3, 4]`.

Using array-generating functions

For larger arrays it is impractical to initialize the data manually, using explicit Python lists. Instead we can use one of the many functions in Numpy that generates arrays of different forms. Some of the more common are:

`arange`



In [35]:

    
# create a range
x = np.arange(0, 10, 1) # arguments: start, stop, step
x









    Out[35]:





array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])



In [36]:

    
x = np.arange(-1, 1, 0.1)
x









    Out[36]:





array([ -1.00000000e+00,  -9.00000000e-01,  -8.00000000e-01,
        -7.00000000e-01,  -6.00000000e-01,  -5.00000000e-01,
        -4.00000000e-01,  -3.00000000e-01,  -2.00000000e-01,
        -1.00000000e-01,  -2.22044605e-16,   1.00000000e-01,
         2.00000000e-01,   3.00000000e-01,   4.00000000e-01,
         5.00000000e-01,   6.00000000e-01,   7.00000000e-01,
         8.00000000e-01,   9.00000000e-01])

`zeros` and `ones`



In [37]:

    
np.zeros((3,3))









    Out[37]:





array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])



In [38]:

    
np.ones((3,3))









    Out[38]:





array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

From Tab and Comma-separated values (CSV)

A very common file format for data files are the comma-separated values (CSV), or related format such as TSV (tab-separated values). To read data from such file into Numpy arrays we can use the numpy.genfromtxt() function. For example,



In [39]:

    
!head data/temperatures.dat
data = np.genfromtxt('data/temperatures.dat')
data.shape









    



1800  1  1    -6.1    -6.1    -6.1 1
1800  1  2   -15.4   -15.4   -15.4 1
1800  1  3   -15.0   -15.0   -15.0 1
1800  1  4   -19.3   -19.3   -19.3 1
1800  1  5   -16.8   -16.8   -16.8 1
1800  1  6   -11.4   -11.4   -11.4 1
1800  1  7    -7.6    -7.6    -7.6 1
1800  1  8    -7.1    -7.1    -7.1 1
1800  1  9   -10.1   -10.1   -10.1 1
1800  1 10    -9.5    -9.5    -9.5 1






    Out[39]:





(77431, 7)

Using numpy.savetxt() we can store a Numpy array to a file in CSV format:



In [40]:

    
M = np.random.rand(3,3)
np.savetxt("random-matrix.csv", M, fmt='%.5f')  # fmt specifies the format
!cat random-matrix.csv









    



0.69101 0.08227 0.29611
0.05398 0.09247 0.20079
0.22264 0.24376 0.22630

More properties of the numpy arrays



In [41]:

    
M.itemsize # bytes per element









    Out[41]:





8



In [42]:

    
M.nbytes # number of bytes









    Out[42]:





72



In [43]:

    
M.ndim # number of dimensions









    Out[43]:





2

Manipulating arrays

Indexing

We can index elements in an array using the square bracket and indices:



In [44]:

    
# v is a vector, and has only one dimension, taking one index
v[0]









    Out[44]:





1



In [45]:

    
# M is a matrix, or a 2 dimensional array, taking two indices 
M[1,1]









    Out[45]:





0.092470178889929144

If we omit an index of a multidimensional array it returns the whole row (or, in general, a N-1 dimensional array)



In [46]:

    
M









    Out[46]:





array([[ 0.69100592,  0.08227421,  0.29610681],
       [ 0.05398449,  0.09247018,  0.20079446],
       [ 0.22264071,  0.24375519,  0.22630334]])



In [47]:

    
M[1]









    Out[47]:





array([ 0.05398449,  0.09247018,  0.20079446])

The same thing can be achieved by using : instead of an index:



In [48]:

    
M[1,:] # row 1









    Out[48]:





array([ 0.05398449,  0.09247018,  0.20079446])



In [49]:

    
M[:,1] # column 1









    Out[49]:





array([ 0.08227421,  0.09247018,  0.24375519])

We can assign new values to elements in an array using indexing:



In [50]:

    
M[0,0] = 1
M









    Out[50]:





array([[ 1.        ,  0.08227421,  0.29610681],
       [ 0.05398449,  0.09247018,  0.20079446],
       [ 0.22264071,  0.24375519,  0.22630334]])



In [51]:

    
# also works for rows and columns
M[1,:] = 0
M[:,2] = -1
M









    Out[51]:





array([[ 1.        ,  0.08227421, -1.        ],
       [ 0.        ,  0.        , -1.        ],
       [ 0.22264071,  0.24375519, -1.        ]])

Index slicing

Index slicing is the technical name for the syntax M[lower:upper:step] to extract part of an array:



In [52]:

    
A = np.array([1,2,3,4,5])
A
A[1:3]









    Out[52]:





array([2, 3])

Array slices are mutable: if they are assigned a new value the original array from which the slice was extracted is modified:



In [53]:

    
A[1:3] = [-2,-3]
A









    Out[53]:





array([ 1, -2, -3,  4,  5])

We can omit any of the three parameters in M[lower:upper:step]:



In [54]:

    
A[::] # lower, upper, step all take the default values









    Out[54]:





array([ 1, -2, -3,  4,  5])



In [55]:

    
A[::2] # step is 2, lower and upper defaults to the beginning and end of the array









    Out[55]:





array([ 1, -3,  5])



In [56]:

    
A[:3] # first three elements









    Out[56]:





array([ 1, -2, -3])



In [57]:

    
A[3:] # elements from index 3









    Out[57]:





array([4, 5])

Negative indices count from the end of the array (positive index from the begining):



In [58]:

    
A = np.array([1,2,3,4,5])
A[-1] # the last element in the array









    Out[58]:





5



In [59]:

    
A[-3:] # the last three elements









    Out[59]:





array([3, 4, 5])

Index slicing works exactly the same way for multidimensional arrays:



In [60]:

    
A = np.array([[n + m * 10 for n in range(5)] for m in range(5)])
A









    Out[60]:





array([[ 0,  1,  2,  3,  4],
       [10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34],
       [40, 41, 42, 43, 44]])



In [61]:

    
# a block from the original array
A[1:4, 1:4]









    Out[61]:





array([[11, 12, 13],
       [21, 22, 23],
       [31, 32, 33]])



In [62]:

    
# strides
A[::2, ::2]









    Out[62]:





array([[ 0,  2,  4],
       [20, 22, 24],
       [40, 42, 44]])

Activity

Wait a second, what was that? Spend some time trying to understand what the next Python expression actually does.
` [i for i in range(10)] `



In [ ]:

    
[i for i in range(10)]  # hmmm

Fancy indexing

Fancy indexing is the name for when an array or list is used in-place of an index:



In [63]:

    
row_indices = [1, 2, 3]
A[row_indices]









    Out[63]:





array([[10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34]])



In [64]:

    
col_indices = [1, 2, -1] # remember, index -1 means the last element
A[row_indices, col_indices]









    Out[64]:





array([11, 22, 34])

We can also index masks: If the index mask is an Numpy array of data type bool, then an element is selected (True) or not (False) depending on the value of the index mask at the position each element:



In [65]:

    
B = np.array([n for n in range(5)])
B









    Out[65]:





array([0, 1, 2, 3, 4])



In [66]:

    
row_mask = np.array([True, False, True, False, False])
B[row_mask]









    Out[66]:





array([0, 2])



In [67]:

    
# same thing
row_mask = np.array([1,0,1,0,0], dtype=bool)
B[row_mask]









    Out[67]:





array([0, 2])

This feature is very useful to conditionally select elements from an array, using for example comparison operators:



In [68]:

    
x = np.arange(0, 10, 0.5)
x









    Out[68]:





array([ 0. ,  0.5,  1. ,  1.5,  2. ,  2.5,  3. ,  3.5,  4. ,  4.5,  5. ,
        5.5,  6. ,  6.5,  7. ,  7.5,  8. ,  8.5,  9. ,  9.5])



In [69]:

    
mask = (5 < x) * (x < 7.5)
mask









    Out[69]:





array([False, False, False, False, False, False, False, False, False,
       False, False,  True,  True,  True,  True, False, False, False,
       False, False], dtype=bool)

Why we use * instead of and if it's a logical expression? Because Numpy says so, that's why >:D



In [70]:

    
x[mask]









    Out[70]:





array([ 5.5,  6. ,  6.5,  7. ])

Data processing

Often it is useful to store datasets in Numpy arrays. Numpy provides a number of functions to calculate statistics of datasets in arrays.

For example, let's calculate some properties data from the temperatures dataset used above.



In [80]:

    
# reminder, the tempeature dataset is stored in the data variable:
np.shape(data)









    Out[80]:





(77431, 7)

`mean`



In [81]:

    
# the temperature data is in column 3
np.mean(data[:,3])









    Out[81]:





6.1971096847515854

The daily mean temperature over the last 200 year has been about 6.2 ºC.

Standard deviations and variance



In [85]:

    
np.std(data[:,3]), np.var(data[:,3])









    Out[85]:





(8.2822716213405734, 68.596023209663414)

`min` and `max`



In [86]:

    
# lowest daily average temperature
data[:,3].min()









    Out[86]:





-25.800000000000001



In [87]:

    
# highest daily average temperature
data[:,3].max()









    Out[87]:





28.300000000000001

`sum`, `prod`, and `trace`



In [88]:

    
d = np.arange(0, 10)
d









    Out[88]:





array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])



In [89]:

    
# sum up all elements
np.sum(d)









    Out[89]:





45



In [90]:

    
# product of all elements
np.prod(d + 1)









    Out[90]:





3628800



In [91]:

    
# cummulative sum
np.cumsum(d)









    Out[91]:





array([ 0,  1,  3,  6, 10, 15, 21, 28, 36, 45])



In [92]:

    
# cummulative product
np.cumprod(d + 1)









    Out[92]:





array([      1,       2,       6,      24,     120,     720,    5040,
         40320,  362880, 3628800])



In [93]:

    
# same as: diag(A).sum()
A = np.array([[n+m*10 for n in range(5)] for m in range(5)])
np.trace(A)









    Out[93]:





110

Axis

Most methods of array can be also applied per column, per row, or per any other dimension. For example, if I wanted to calculate the average values per column and the sums per row, we have to specify the different axis. Remember that array methods can be invoked directly from the array object.



In [94]:

    
data.mean(axis=0)   # columns (first dimension)









    Out[94]:





array([  1.47544802e+08,   5.05086000e+05,   1.21793500e+06,
         4.79848400e+05,   4.51590800e+05,   4.47974100e+05,
         7.74310000e+04])

Or the sum of the rows



In [95]:

    
data.mean(axis=1)   # rows (second dimension)









    Out[95]:





array([ 254.95714286,  251.11428571,  251.42857143, ...,  295.18571429,
        293.48571429,  292.25714286])

Activity

The data in [`populations.txt`](data/populations.txt) describes the populations of hares and lynxes (and carrots) in northern Canada during 20 years. Compute and print, based on the data in `populations.txt`:
1. The mean and standard deviation of the populations of each species for the years in the period.
2. Which year each species had the largest population?
3. Which species has the largest population for each year. (*Hint*: argsort & fancy indexing of `np.array(['Hare', 'Lynx', 'Carrot'])`).
4. Which years any of the populations is above 50000. (*Hint*: comparisons and `np.any`).
5. The top 2 years for each species when they had the lowest populations. (*Hint*: `argsort`, fancy indexing).
And everything without a single for loop. [Solution](data/populations.py)



In [3]:

    
data = np.loadtxt('data/populations.txt')
year, hares, lynxes, carrots = data.T
species = np.array(['Hare', 'Lynx', 'Carrot'])
data









    Out[3]:





array([[  1900.,  30000.,   4000.,  48300.],
       [  1901.,  47200.,   6100.,  48200.],
       [  1902.,  70200.,   9800.,  41500.],
       [  1903.,  77400.,  35200.,  38200.],
       [  1904.,  36300.,  59400.,  40600.],
       [  1905.,  20600.,  41700.,  39800.],
       [  1906.,  18100.,  19000.,  38600.],
       [  1907.,  21400.,  13000.,  42300.],
       [  1908.,  22000.,   8300.,  44500.],
       [  1909.,  25400.,   9100.,  42100.],
       [  1910.,  27100.,   7400.,  46000.],
       [  1911.,  40300.,   8000.,  46800.],
       [  1912.,  57000.,  12300.,  43800.],
       [  1913.,  76600.,  19500.,  40900.],
       [  1914.,  52300.,  45700.,  39400.],
       [  1915.,  19500.,  51100.,  39000.],
       [  1916.,  11200.,  29700.,  36700.],
       [  1917.,   7600.,  15800.,  41800.],
       [  1918.,  14600.,   9700.,  43300.],
       [  1919.,  16200.,  10100.,  41300.],
       [  1920.,  24700.,   8600.,  47300.]])



In [123]:

    
# we first create an array with the populations
populations = #...
populations









    Out[123]:





array([[ 30000.,   4000.,  48300.],
       [ 47200.,   6100.,  48200.],
       [ 70200.,   9800.,  41500.],
       [ 77400.,  35200.,  38200.],
       [ 36300.,  59400.,  40600.],
       [ 20600.,  41700.,  39800.],
       [ 18100.,  19000.,  38600.],
       [ 21400.,  13000.,  42300.],
       [ 22000.,   8300.,  44500.],
       [ 25400.,   9100.,  42100.],
       [ 27100.,   7400.,  46000.],
       [ 40300.,   8000.,  46800.],
       [ 57000.,  12300.,  43800.],
       [ 76600.,  19500.,  40900.],
       [ 52300.,  45700.,  39400.],
       [ 19500.,  51100.,  39000.],
       [ 11200.,  29700.,  36700.],
       [  7600.,  15800.,  41800.],
       [ 14600.,   9700.,  43300.],
       [ 16200.,  10100.,  41300.],
       [ 24700.,   8600.,  47300.]])



In [143]:

    
# 1
means = # ...
stds = # ...
print("\t", "\t\t".join(species))
print("Mean:", means)
print("Std:", stds)









    



	 Hare		Lynx		Carrot
Mean: [ 34080.95238095  20166.66666667  42400.        ]
Std: [ 20897.90645809  16254.59153691   3322.50622558]



In [148]:

    
np.argmax(populations, axis=0)









    Out[148]:





array([3, 4, 0])



In [142]:

    
# 2
max_year = # ...
species_max_years = # ...
print("\t    ", "   ".join(species))
print("Max. year:", species_max_years)









    



	     Hare   Lynx   Carrot
Max. year: [ 1903.  1904.  1900.]



In [111]:

    
# 3
max_species = # ...
max_species_names = # ...
print("Max species:")
print(year)
print(max_species_names)









    



Max species:
[ 1900.  1901.  1902.  1903.  1904.  1905.  1906.  1907.  1908.  1909.
  1910.  1911.  1912.  1913.  1914.  1915.  1916.  1917.  1918.  1919.
  1920.]
['Carrot' 'Carrot' 'Hare' 'Hare' 'Lynx' 'Lynx' 'Carrot' 'Carrot' 'Carrot'
 'Carrot' 'Carrot' 'Carrot' 'Hare' 'Hare' 'Hare' 'Lynx' 'Carrot' 'Carrot'
 'Carrot' 'Carrot' 'Carrot']



In [112]:

    
# 4
above_50k = # ...
year_above_50k = # ...
print("Any above 50000:", year_above_50k)









    



Any above 50000: [ 1902.  1903.  1904.  1912.  1913.  1914.  1915.]



In [115]:

    
# 5
top_2 = # ...
top_2_year = # ...
print("Top 2 years with lowest populations for each:")
print("  ", "   ".join(species))
print(top_2_year)









    



Top 2 years with lowest populations for each:
   Hare   Lynx   Carrot
[[ 1917.  1900.  1916.]
 [ 1916.  1901.  1903.]]

Matrices

Numpy makes a distinction between a 2-dimensional (2D) array and a matrix. Every matrix is definitely a 2D array, but not every 2D array is a matrix. If we want Python to treat the Numpy arrays as matrices and vectors, we have to explicitly say so.

Like this:



In [71]:

    
a = np.eye(4)  # Fills the diagonal elements with 1's
a = np.matrix(a)
a









    Out[71]:





matrix([[ 1.,  0.,  0.,  0.],
        [ 0.,  1.,  0.,  0.],
        [ 0.,  0.,  1.,  0.],
        [ 0.,  0.,  0.,  1.]])

Now, we can do complex matrix operations over a.

Another way is by keeping your arrays as arrays and using specific Numpy functions that make the conversion for you, so your arrays will be always arrays.

Extracting data from arrays and creating arrays

`where`

The index mask can be converted to position index using the where function



In [72]:

    
indices = np.where(mask)
indices









    Out[72]:





(array([11, 12, 13, 14]),)



In [73]:

    
x[indices]  # this indexing is equivalent to the fancy indexing x[mask]









    Out[73]:





array([ 5.5,  6. ,  6.5,  7. ])

`diag`

With the diag function we can also extract the diagonal and subdiagonals of an array:



In [74]:

    
np.diag(A)









    Out[74]:





array([ 0, 11, 22, 33, 44])



In [75]:

    
np.diag(A, -1)









    Out[75]:





array([10, 21, 32, 43])

`take`

The take function is similar to fancy indexing described above:



In [76]:

    
v2 = np.arange(-3, 3)
row_indices = [1, 3, 5]
v2[row_indices]  # fancy indexing









    Out[76]:





array([-2,  0,  2])



In [77]:

    
v2.take(row_indices)









    Out[77]:





array([-2,  0,  2])

But take also works on lists and other objects:



In [78]:

    
np.take([-3, -2, -1,  0,  1,  2], row_indices)









    Out[78]:





array([-2,  0,  2])

`choose`

Constructs an array by picking elements form several arrays:



In [79]:

    
which = [1, 0, 1, 0]
choices = [[-2,-2,-2,-2], [5,5,5,5]]
np.choose(which, choices)









    Out[79]:





array([ 5, -2,  5, -2])

Scipy

The SciPy framework builds on top of the low-level Numpy framework for multidimensional arrays, and provides a large number of higher-level scientific algorithms. Some of the topics that SciPy covers are:

Special functions (scipy.special)
Integration (scipy.integrate)
Optimization (scipy.optimize)
Interpolation (scipy.interpolate)
Fourier Transforms (scipy.fftpack)
Signal Processing (scipy.signal)
Linear Algebra (scipy.linalg)
Sparse Eigenvalue Problems (scipy.sparse)
Statistics (scipy.stats)
Multi-dimensional image processing (scipy.ndimage)
File IO (scipy.io)

Each of these submodules provides a number of functions and classes that can be used to solve problems in their respective topics. For what we are concerned with, we will only look at the stats subpackage.

Statistics

The scipy.stats module contains a large number of statistical distributions, statistical functions, and tests. There is also a very powerful Python package for statistical modelling called statsmodels.

For the next class

Read 10 Minutes to Pandas.
Read about Series and DataFrame.
Read Chapters 1-5 from Julia Evans pandas-cookbok.

[Data, the Humanist's New Best Friend](index.ipynb)*Class 05*

Files

Objects and classes

Numpy

Arrays

From lists

Using array-generating functions

arange

zeros and ones

From Tab and Comma-separated values (CSV)

More properties of the numpy arrays

Manipulating arrays

Indexing

Index slicing

Fancy indexing

Data processing

mean

Standard deviations and variance

min and max

sum, prod, and trace

Axis

Matrices

Extracting data from arrays and creating arrays

where

diag

take

choose

Scipy

Statistics

For the next class

[Data, the Humanist's New Best Friend](index.ipynb)
Class 05

`arange`

`zeros` and `ones`

`mean`

`min` and `max`

`sum`, `prod`, and `trace`

`where`

`diag`

`take`

`choose`