In [ ]:

    
%matplotlib inline
%pylab inline

Sorting arrays



In [ ]:

    
d = np.random.randint(0, 100, size=(10,10))



In [ ]:

    
np.sort(d)



In [ ]:

    
np.argsort(d)



In [ ]:

    
d[np.arange(d.shape[0])[:,np.newaxis], np.argsort(d)]

Lexographical sorting



In [89]:

    
x = np.ones((5,2))
x[1:3,0] = 0
x[:,1] = np.arange(5)
x









    Out[89]:





array([[ 1.,  0.],
       [ 0.,  1.],
       [ 0.,  2.],
       [ 1.,  3.],
       [ 1.,  4.]])



In [90]:

    
# create view as a structured array (requires x to be continous)
v = x.ravel().view(dtype=[('a', x.dtype), ('b', x.dtype)])
# inplace sort will change x
v.sort(order=('a', 'b'))
x









    Out[90]:





array([[ 0.,  1.],
       [ 0.,  2.],
       [ 1.,  0.],
       [ 1.,  3.],
       [ 1.,  4.]])



In [91]:

    
# alternative, lexsort
x = np.ones((5,2)); x[1:3,0] = 0; x[:,1] = np.arange(5)
x[np.lexsort((x[:,1], x[:,0]))]









    Out[91]:





array([[ 0.,  1.],
       [ 0.,  2.],
       [ 1.,  0.],
       [ 1.,  3.],
       [ 1.,  4.]])

With NumPy 1.8 order statistics can be retrieved with partition which is faster than applying a full sort



In [92]:

    
np.random.shuffle(d.ravel())
np.partition(d, (2, 8))









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-92-dbe95ab6509c> in <module>()
      1 np.random.shuffle(d.ravel())
----> 2 np.partition(d, (2, 8))

/tmp/local/lib/python2.7/site-packages/numpy/core/fromnumeric.pyc in partition(a, kth, axis, kind, order)
    620     else:
    621         a = asanyarray(a).copy(order="K")
--> 622     a.partition(kth, axis=axis, kind=kind, order=order)
    623     return a
    624 

ValueError: kth(=8) out of bounds (3)

Statistics on arrays

mean



In [ ]:

    
data = np.genfromtxt('stockholm_td_adj.dat', names=True)
temp = data['temperature']
year = data['year']
np.mean(temp) # or data['temperature'].mean()



In [ ]:

    
data['temperature'].mean()

standard deviations and variance



In [ ]:

    
np.std(temp), np.var(temp)

min, max median



In [ ]:

    
temp.min(), temp.max()



In [ ]:

    
temp.argmin(), temp.argmax()



In [93]:

    
year[temp.argmin()], year[temp.argmax()]









    Out[93]:





(1814.0, 1975.0)



In [94]:

    
np.median(temp)









    Out[94]:





5.4000000000000004



In [95]:

    
np.percentile(temp, [5, 95])









    Out[95]:





array([ -7.6,  18.4])

Most functions have an axis keyword to only do the reduction on a certain axis



In [96]:

    
d = np.arange(12).reshape((4,3))
print(d)
print('naxis1', d.sum(axis=1))
print('naxis0', d.sum(axis=0))









    



[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]
('naxis1', array([ 3, 12, 21, 30]))
('naxis0', array([18, 22, 26]))



In [97]:

    
# numpy >= 1.7, reductions also work over multiply axes:
d = np.arange(5*4*3).reshape((5,4,3))
d.sum(axis=(-2, -1))









    Out[97]:





array([ 66, 210, 354, 498, 642])



In [98]:

    
#numpy < 1.7 way for trailing axes:
nshape = d.shape[:-2] + (-1,)
view = d.reshape(nshape)
print('new shape', view.shape)
view.sum(axis=1)









    



('new shape', (5, 12))






    Out[98]:





array([ 66, 210, 354, 498, 642])

Universal functions

A universal function (ufunc) is a function operating on arrays element-by-element. They allow implementing a large set of operations and getting features like broadcasting, type casting and reductions "for free".

E.g. all basic operations like addition, multiplication and also basic math like np.sin and np.cos are ufuncs and thus all support the same set of features.



In [99]:

    
a = np.array([1, 2, 3])
b = np.array([1, 2, 3])

# binary ufunc
np.add(a, b)
# unary ufunc
np.sin(a)









    Out[99]:





array([ 0.84147098,  0.90929743,  0.14112001])



In [100]:

    
r = np.empty_like(b)
np.add(a, b, out=r), r









    Out[100]:





(array([2, 4, 6]), array([2, 4, 6]))



In [101]:

    
np.add(a, b, dtype=np.float64)









    Out[101]:





array([ 2.,  4.,  6.])



In [102]:

    
np.logaddexp(a, b)









    Out[102]:





array([ 1.69314718,  2.69314718,  3.69314718])

Reductions

Binary ufuncs support reductions of arrays



In [103]:

    
np.add.reduce(a)









    Out[103]:





6

Reductions support the additional keepdims argument that retains the same dimensionality as the input by adding empty dimensions. This allows applying the result to operands via broadcasting.



In [106]:

    
a = np.arange(3*4).reshape(3,4)
r = np.add.reduce(a, axis=1, keepdims=True)
r









    Out[106]:





array([[ 6],
       [22],
       [38]])



In [107]:

    
a - r









    Out[107]:





array([[ -6,  -5,  -4,  -3],
       [-18, -17, -16, -15],
       [-30, -29, -28, -27]])

Additional ufunc methods

accumulate, apply operation successively onto elements (cumulative sum or product)
reduceat, reduction along slices
at, unbuffered in place operation: a[indices] += b which allows repeating indices



In [108]:

    
np.add.reduceat(np.arange(8),[0,4,6])









    Out[108]:





array([ 6,  9, 13])

Computations on subsets of arrays

We can compute with subsets of the data in an array using indexing, fancy indexing, and the other methods of extracting data from an array (described above).

For example, let's go back to the temperature dataset:



In [109]:

    
!head -n 3 stockholm_td_adj.dat









    



#year month day min   temperature   max dontknow
1800  1  1    -6.1    -6.1    -6.1 1
1800  1  2   -15.4   -15.4   -15.4 1

The dataformat is: year, month, day, daily average temperature, low, high, location.

If we are interested in the average temperature only in a particular month, say February, then we can create a index mask and use the select out only the data for that month using:



In [110]:

    
np.unique(data['month']) # the month column takes values from 1 to 12









    Out[110]:





array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
        12.])



In [111]:

    
mask_feb = data['month'] == 2.



In [112]:

    
np.mean(data['temperature'][mask_feb])









    Out[112]:





-3.6189076332052785

With these tools we have very powerful data processing capabilities at our disposal. For example, to extract the average monthly average temperatures for each month of the year only takes a few lines of code:



In [113]:

    
months = arange(1,13)
monthly_mean = [np.mean(data['temperature'][data['month'] == month]) for month in months]

fig, ax = subplots()
ax.bar(months, monthly_mean)
ax.set_xlabel("Month")
ax.set_ylabel("Monthly avg. temp.");

Linear algebra

Vectorizing code is the key to writing efficient numerical calculation with Python/Numpy. That means that as much as possible of a program should be formulated in terms of matrix and vector operations.

Scalar-array operations

We can use the usual arithmetic operators to multiply, add, subtract, and divide arrays element-wise



In [114]:

    
v1 = np.arange(0, 5)
img = np.arange(25).reshape((5, 5))



In [115]:

    
v1 * 2









    Out[115]:





array([0, 2, 4, 6, 8])



In [116]:

    
img + img









    Out[116]:





array([[ 0,  2,  4,  6,  8],
       [10, 12, 14, 16, 18],
       [20, 22, 24, 26, 28],
       [30, 32, 34, 36, 38],
       [40, 42, 44, 46, 48]])

Matrix algebra

What about matrix mutiplication? There are two ways. We can either use the dot functions, which applies a matrix-matrix, matrix-vector, or inner vector multiplication to its two arguments:



In [117]:

    
np.dot(img, img)









    Out[117]:





array([[ 150,  160,  170,  180,  190],
       [ 400,  435,  470,  505,  540],
       [ 650,  710,  770,  830,  890],
       [ 900,  985, 1070, 1155, 1240],
       [1150, 1260, 1370, 1480, 1590]])



In [118]:

    
# matrix * vector
np.dot(img, v1)









    Out[118]:





array([ 30,  80, 130, 180, 230])



In [119]:

    
# vector * vector, scalar product
np.dot(v1, v1)









    Out[119]:





30

A dot product applies along the last axis of the first operand and the second to last of the second operand:

dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])

The same broadcasting rules apply to these conditions of the input operands:



In [120]:

    
m = np.arange(2*3).reshape(2,3)
v = np.arange(3 * 2).reshape(3, 2)
m









    Out[120]:





array([[0, 1, 2],
       [3, 4, 5]])



In [121]:

    
v









    Out[121]:





array([[0, 1],
       [2, 3],
       [4, 5]])



In [122]:

    
m.shape, v.shape









    Out[122]:





((2, 3), (3, 2))



In [123]:

    
np.dot(m, v)









    Out[123]:





array([[10, 13],
       [28, 40]])

NumPy includes the standard linear algebra functions to work on matrices, e.g. np.det (determinant), numpy.pinv etc. Some of them are placed in numpy.linalg (like eigenvalues, linear solvers, ...)

Einstein notation

$u \times b = u^jv^k\epsilon^i{}_{jk}$

Indices of operands denoted by comma separated string
alphabetic ordering
repeated indicies imply summation
output shape: ->
broadcasting: ...



In [124]:

    
a = np.arange(9.).reshape(3,3)
# identity
np.einsum('ij', a) == a









    Out[124]:





array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]], dtype=bool)



In [125]:

    
#indices alphabetic -> transpose:
np.einsum('ji', a) == a.T









    Out[125]:





array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]], dtype=bool)

Trace

$A^i{}_i$



In [126]:

    
#trace
a = np.arange(49.).reshape(7,7)
np.einsum('ii', a), a.trace()









    Out[126]:





(168.0, 168.0)



In [127]:

    
np.einsum('ii->i', a), np.diag(a)









    Out[127]:





(array([  0.,   8.,  16.,  24.,  32.,  40.,  48.]),
 array([  0.,   8.,  16.,  24.,  32.,  40.,  48.]))

Matrix dot product

$C_{ik} = \sum_j A_{ij}B_{jk}$



In [128]:

    
a = np.arange(12.).reshape(3,4)
b = np.arange(12.).reshape(4,3)
np.einsum('ij,jk->ik', a, b), a.dot(b)









    Out[128]:





(array([[  42.,   48.,   54.],
       [ 114.,  136.,  158.],
       [ 186.,  224.,  262.]]),
 array([[  42.,   48.,   54.],
       [ 114.,  136.,  158.],
       [ 186.,  224.,  262.]]))



In [129]:

    
#tensort dot product
a = np.arange(60.).reshape(3,4,5)
b = np.arange(24.).reshape(4,3,2)
np.einsum('ijk,jil->kl', a, b)









    Out[129]:





array([[ 4400.,  4730.],
       [ 4532.,  4874.],
       [ 4664.,  5018.],
       [ 4796.,  5162.],
       [ 4928.,  5306.]])