Manipulating data with NumPy

Presented by Karen Cranston, uses some materials by Katy Huff and Matthew Terry.

The NumPy library includes (among other things) ways of storing and manipulating data that are more efficient than standard Python arrays. Using NumPy with numerical data is much faster than using Python lists or tuples. Goals here are to understand some of the gotchas when using arrays vs lists and to get a tour of the NumPy features.

We will start by importing the library and creating a regular Python list and a numpy array from that list.


In [1]:
import numpy

x = [1, 2, 3, 4, 5, 6 ]
np_arr = numpy.array(x)

Let's look at difference between x (python list) and arr (numpy array)


In [2]:
x


Out[2]:
[1, 2, 3, 4, 5, 6]

In [3]:
np_arr


Out[3]:
array([1, 2, 3, 4, 5, 6])

In [4]:
np_arr.ndim


Out[4]:
1

In [5]:
np_arr.shape


Out[5]:
(6,)

We can compare the two data structures. Operations on numpy arrays operate element by element. Explain this result?


In [6]:
x == np_arr


Out[6]:
array([ True,  True,  True,  True,  True,  True], dtype=bool)

Now, let's make a 2D array


In [7]:
x = [ [1, 2], [3, 4], [5, 6] ]
np_arr = numpy.array(x)

In [8]:
np_arr.shape


Out[8]:
(3, 2)

We can slice the matrix to get the second column. Note that slices are a view of the same data. What happens when we change an element of the slice?


In [9]:
array_slice = np_arr[:,1]
array_slice


Out[9]:
array([2, 4, 6])

In [10]:
array_slice[2]=7

In [11]:
np_arr


Out[11]:
array([[1, 2],
       [3, 4],
       [5, 7]])

Differences between shallow and deep copies


In [12]:
arr_copy = np_arr.copy()
arr_copy[0,0]=3
arr_copy


Out[12]:
array([[3, 2],
       [3, 4],
       [5, 7]])

In [13]:
np_arr


Out[13]:
array([[1, 2],
       [3, 4],
       [5, 7]])

Operating on Python lists and numpy arrays is very different.


In [14]:
x*2


Out[14]:
[[1, 2], [3, 4], [5, 6], [1, 2], [3, 4], [5, 6]]

In [15]:
np_arr * 3


Out[15]:
array([[ 3,  6],
       [ 9, 12],
       [15, 21]])

With numpy arrays, operations are element by element. The multiplication operation multiplied each element individually. Compare to the Python list, where multiplication copied the entire array as a single unit. Try adding the list to iteself and compare to when you add the array to itself.


In [16]:
np_arr + np_arr


Out[16]:
array([[ 2,  4],
       [ 6,  8],
       [10, 14]])

Numpy has functions for all of your basic matrix operations and statistical functions.

T = transpose; dot = dot product


In [17]:
np_arr.T.dot(np_arr)


Out[17]:
array([[35, 49],
       [49, 69]])

In [20]:
numpy.average(np_arr)


Out[20]:
3.6666666666666665

Average of what? (default is whole array flattened into single list). Find the average of the first column.


In [21]:
numpy.average(np_arr[:,0])


Out[21]:
3.0

In [30]:
numpy.cov(np_arr)


Out[30]:
array([[ 0.5,  0.5,  1. ],
       [ 0.5,  0.5,  1. ],
       [ 1. ,  1. ,  2. ]])

We can use NumPy functions to read data from a file into an array


In [24]:
%%file example-data.txt
0,0
1,2
2,4
3,8
4,16
5,32
6,64


Writing example-data.txt

In [25]:
data = numpy.loadtxt('example-data.txt', delimiter=',')
print data


[[  0.   0.]
 [  1.   2.]
 [  2.   4.]
 [  3.   8.]
 [  4.  16.]
 [  5.  32.]
 [  6.  64.]]

In [37]:
x = [ 0, 1, 2, 3, 4, 5, 6 ]
y = [ 0, 2, 4, 8, 16, 32, 64 ]

import matplotlib.pyplot as plt
plt.plot(x, y)


Out[37]:
[<matplotlib.lines.Line2D at 0x10e3e9b50>]

In [38]:
plt.plot(x, y, 'r--', label='my favorite line')
plt.legend()


Out[38]:
<matplotlib.legend.Legend at 0x10e3e9750>

In [40]:
plt.plot(x, y, 'r-')
plt.axis(xmin=-10, xmax = 8, ymin=-10)


Out[40]:
(-10, 8, -10, 70.0)

In [27]:
plot(x, y, 'r-')
axis(xmin=-10, xmax = 8, ymin=-10)
xlabel('This is my X axis')
ylabel('This is my Y axis')
title('foo')
savefig('/tmp/figure.pdf')



In [42]:
plt.plot(x, y, 'r-')
plt.axis(xmin=-10, xmax = 8, ymin=-10)
plt.xlabel('This is my X axis')
plt.ylabel('This is my Y axis')
plt.title('foo')
plt.savefig('/tmp/figure.png')

In [43]:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import numpy as np

bignum = 100
mat = np.random.random((bignum, bignum))
X, Y = np.mgrid[:bignum, :bignum]

fig = plt.figure()
ax = fig.add_subplot(1,1,1, projection='3d')
surf = ax.plot_surface(X,Y,mat)
plt.show()

In [ ]: