NumPy introduction

NumPy provides low-level and fast features to manipulate arrays of data (main implementation is in C). While it has some relatively advanced features like linear algebraic calculations and more, in many cases Pandas provides a more convenient high level interface to do the same things (and even more).

If you just want a quick overview, the following cheatsheet provides one:

Numpy Arrays

The basic building block of numpy is the array which has a number of operations defined on it. Because of this, you don't need to write for loops to manipulate them. This is often called vectorization.


In [2]:
# This is a regular python list
range(1,4)


Out[2]:
[1, 2, 3]

In [3]:
# If you multiply or add to it, it extends the list

In [4]:
a = range(1, 10)
a * 2


Out[4]:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [5]:
a = range(1,11)
a + [ 11 ]


Out[5]:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

In [6]:
# Compare this to np.array:
import numpy as np
np.array(range(1,10))


Out[6]:
array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [7]:
# Multiplication is defined as multiplying each element in the array
a = np.array(range(1, 10))
a * 2


Out[7]:
array([ 2,  4,  6,  8, 10, 12, 14, 16, 18])

In [8]:
a + 5 # Adding to it works as well, this just adds 5 to each element (note that this operation is undefined in regular python)


Out[8]:
array([ 6,  7,  8,  9, 10, 11, 12, 13, 14])

ndarray is actually a multi-dimensional array


In [9]:
np.array([[1,2],[3,4],[5,6]])


Out[9]:
array([[1, 2],
       [3, 4],
       [5, 6]])

In [10]:
a = np.array([[1,2],[3,4],[5,6]])
a.shape, a.dtype, a.size, a.ndim # shape -> dimension sizes, dtype -> datatype, size -> total number of elems, ndim -> number of dimensions


Out[10]:
((3, 2), dtype('int64'), 6, 2)

In [11]:
# You can use comma-separated indexing like so:
a[1,1] # same as a[1][1]


Out[11]:
4

In [12]:
# Note that 1,1 is really a tuple (the parenthesis are just ommited), so this works too:
indices = (1,1)
a[indices]


Out[12]:
4

In [13]:
# Note that regular python doesn't support this
mylist = [[1,2],[3,4]]
# mylist[1,1] # error!

In [14]:
# As always, use ? to get details
a?

Generating arrays

Numpy has a number of convenience functions to initialize arrays with zeros, ones or empty values. Others are identity for the identity array, and ndrange which is the equivalent of python's range.


In [15]:
np.zeros(5)


Out[15]:
array([ 0.,  0.,  0.,  0.,  0.])

In [16]:
np.ones(10)


Out[16]:
array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

In [17]:
np.empty(7) # Empty returns uninitialized garbage values (not zeroes!)


Out[17]:
array([  0.00000000e+000,   2.68156175e+154,   2.21880926e-314,
         2.21883458e-314,   2.21883482e-314,   2.21883410e-314,
         2.21883510e-314])

In [18]:
np.identity(5) # identity array


Out[18]:
array([[ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  1.]])

In [19]:
np.arange(11) # same as .nparray(range(11))


Out[19]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [20]:
np.array(range(11))


Out[20]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

Datatypes

Each np.array has a datatype associated with it. You can also cast between types.


In [21]:
np.array([1,2,3], dtype='float64')


Out[21]:
array([ 1.,  2.,  3.])

In [22]:
# Show all available types
np.sctypes


Out[22]:
{'complex': [numpy.complex64, numpy.complex128, numpy.complex256],
 'float': [numpy.float16, numpy.float32, numpy.float64, numpy.float128],
 'int': [numpy.int8, numpy.int16, numpy.int32, numpy.int64],
 'others': [bool, object, str, unicode, numpy.void],
 'uint': [numpy.uint8, numpy.uint16, numpy.uint32, numpy.uint64]}

In [23]:
# Consider strings
a = np.array(['12', '999', '432536'])
a.dtype


Out[23]:
dtype('S6')

The datatype S5 stands for Fixed String with length 5, because the longest string in the array is of length 5. Compare this to:


In [24]:
np.array(['123', '21345312312'])


Out[24]:
array(['123', '21345312312'], 
      dtype='|S11')

In [25]:
# You can also cast between types
a.astype(np.int32) # This copies the data into a new array, it does not change the array itself!


Out[25]:
array([    12,    999, 432536], dtype=int32)

Slicing

Index manipulation (slicing)with np.arrays is actually pretty similar to how it works with regular python lists


In [26]:
a = np.array(range(10, 20))
a[3:]


Out[26]:
array([13, 14, 15, 16, 17, 18, 19])

In [27]:
a[4:6]


Out[27]:
array([14, 15])

However, slices in Numpy are actually views on the original np.array which means that if you manipulate them, the array changes as well.


In [28]:
a[3:6] = 33
a


Out[28]:
array([10, 11, 12, 33, 33, 33, 16, 17, 18, 19])

Compare this to regular python:


In [29]:
b = range(1, 10)
# b[2:7] = 10  # this will raise an error

In [30]:
# Copies need to be explicit in numpy
b = a[3:6].copy()
b[:] = 22 # change all values to 22
b, a # print b and a, see that a is not modified


Out[30]:
(array([22, 22, 22]), array([10, 11, 12, 33, 33, 33, 16, 17, 18, 19]))

In [31]:
# You can also slice multi-dimensionally
c = np.array([[1,2,3], [4,5,6], [7,8,9]])
c[1:,:1] # Only keep the last 2 arrays, and from them, only keep up the first elements


Out[31]:
array([[4],
       [7]])

In [32]:
# Note how this is different from using c[1:][:1]
# This is really doing 2 operations: first slice to keep the last 2 arrays. 
# This returns a new array: array([[4, 5, 6],[7, 8, 9]])
# Then from this new array, return the first element.
c[1:][:1]


Out[32]:
array([[4, 5, 6]])

This picture explains NumPy's array slicing pretty well.

Boolean indexing

There are 2 parts to boolean indexing:

  1. Apply a boolean mask to an np.array. Boolean masks are just arrays of booleans: [True, False, True]
  2. Creating boolean masks using boolean conditions

Applying a boolean mask


In [33]:
# A boolean mask is just a boolean array
mask = np.array([ True, False, True ])
mask


Out[33]:
array([ True, False,  True], dtype=bool)

In [34]:
# To apply the mask against a target, just pass it like an index.
# The result is an array with the elements from 'target' that had True on their corresponding index in 'mask'.
target = np.array([7,8,9])
target[mask]


Out[34]:
array([7, 9])

In [35]:
# This works for multi-dimensional arrays too, but the result will obviously be a single dimensional array
# Also, you need to make sure that the dimensions of your target and mask arrays match
target2 = np.array([['a','b','c'], ['d','e','f'],['g','h','i']])
mask2 = np.array([[False,True,False], [True, True, False], [True, False, True]])
target2[mask2]


Out[35]:
array(['b', 'd', 'e', 'g', 'i'], 
      dtype='|S1')

Creating a boolean mask

The easiest way to create a boolean mask is to just create an array with booleans in it. However, you can also create boolean masks by applying a boolean expression to a existing array.


In [36]:
numbers = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
numbers > 5


Out[36]:
array([False, False, False, False, False,  True,  True,  True,  True], dtype=bool)

In [37]:
numbers % 2 == 0 # Even numbers


Out[37]:
array([False,  True, False,  True, False,  True, False,  True, False], dtype=bool)

Strings work too!


In [38]:
names = np.array(["John", "Mary", "Joe", "Jane", "Marc", "Jorge", "Adele" ])

In [39]:
names == "Joe"


Out[39]:
array([False, False,  True, False, False, False, False], dtype=bool)

You can combine filters using the boolean arithmetic operations | and &. Note that you have to but the individual boolean expressions between parentheses at this point.


In [40]:
(names == "Joe") | (names == "Mary")


Out[40]:
array([False,  True,  True, False, False, False, False], dtype=bool)

Once you have boolean mask, you can apply it to an array of the same length as a boolean mask. This is often useful if you want to select certain values in an array like so:


In [41]:
names[names == "Joe"], numbers[numbers > 5]


Out[41]:
(array(['Joe'], 
       dtype='|S5'), array([6, 7, 8, 9]))

Universal functions

A universal function, or ufunc, is a function that performs elementwise operations on data in ndarrays. You can think of them as fast vectorized wrappers for simple functions that take one or more scalar values and produce one or more scalar results.


In [42]:
numbers = np.array([-1, -9, 18.2, 3, 4.3, 0, 5.3, -12.2])
numbers


Out[42]:
array([ -1. ,  -9. ,  18.2,   3. ,   4.3,   0. ,   5.3, -12.2])

In [43]:
np.sum(numbers), np.mean(numbers)


Out[43]:
(8.5999999999999996, 1.075)

In [44]:
np.square(numbers)


Out[44]:
array([   1.  ,   81.  ,  331.24,    9.  ,   18.49,    0.  ,   28.09,
        148.84])

In [45]:
np.abs(numbers)


Out[45]:
array([  1. ,   9. ,  18.2,   3. ,   4.3,   0. ,   5.3,  12.2])

In [46]:
np.sqrt(np.abs(numbers)) # Can't take sqrt of negative number, so let's get the abs values first


Out[46]:
array([ 1.        ,  3.        ,  4.2661458 ,  1.73205081,  2.07364414,
        0.        ,  2.30217289,  3.49284984])

In [47]:
np.max(numbers), np.min(numbers)


Out[47]:
(18.199999999999999, -12.199999999999999)

In [48]:
np.ceil(numbers), np.floor(numbers)


Out[48]:
(array([ -1.,  -9.,  19.,   3.,   5.,   0.,   6., -12.]),
 array([ -1.,  -9.,  18.,   3.,   4.,   0.,   5., -13.]))

The boolean expressions that create boolean masks (see prev section) can also be expressed explicitely


In [49]:
np.greater(numbers, 3)


Out[49]:
array([False, False,  True, False,  True, False,  True, False], dtype=bool)

In [50]:
# combining with boolean arithmetic
np.logical_or(np.less_equal(numbers, 4), np.greater(numbers, 0))


Out[50]:
array([ True,  True,  True,  True,  True,  True,  True,  True], dtype=bool)

In [51]:
np.sort(numbers)


Out[51]:
array([-12.2,  -9. ,  -1. ,   0. ,   3. ,   4.3,   5.3,  18.2])

In [52]:
np.unique(np.array([1, 2, 4, 2, 5, 1]))


Out[52]:
array([1, 2, 4, 5])

Some of these operations are also directly available on the array


In [53]:
numbers.sum(), numbers.mean(), numbers.min(), numbers.max()


Out[53]:
(8.5999999999999996, 1.075, -12.199999999999999, 18.199999999999999)

File IO

You can easily store/retrieve numpy arrays from files.


In [54]:
np.save("/tmp/myarray", np.arange(10))

In [55]:
# The .npy extension is automatically added
!cat /tmp/myarray.npy


�NUMPYF{'descr': '<i8', 'fortran_order': False, 'shape': (10,), }           
	

In [56]:
np.load("/tmp/myarray.npy") # You DO need to specify the .npy extension when loading


Out[56]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

You can also save/load as a zip file using savez and loadz.


In [57]:
np.savez("/tmp/myarray2", a=np.arange(2000))

In [58]:
np.load("/tmp/myarray2.npz")['a'] # Loading from a npz file is lazy, you need to specify which array to load


Out[58]:
array([   0,    1,    2, ..., 1997, 1998, 1999])

You can also load other file formats using loadtxt.


In [59]:
!echo "1,2,3,4" > /tmp/numpytxtsample.txt
!cat /tmp/numpytxtsample.txt


1,2,3,4

In [60]:
np.loadtxt("/tmp/numpytxtsample.txt", delimiter=",")


Out[60]:
array([ 1.,  2.,  3.,  4.])

Linear Algebra

Numpy also supports linear algebra, e.g.: matrix multiplication, determinants, etc


In [61]:
x = np.array([[1,2,3],[4,5,6], [7,8,9]])
y = np.array([[9,8,7],[6,5,4],[3,2,1]])
x,y


Out[61]:
(array([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]]), array([[9, 8, 7],
        [6, 5, 4],
        [3, 2, 1]]))

In [62]:
# Matrix multiplication
np.dot(x,y) # same as: x.dot(y)


Out[62]:
array([[ 30,  24,  18],
       [ 84,  69,  54],
       [138, 114,  90]])

In [63]:
# The numpy.linalg package has a bunch of extra linear algebra functions
# For example, the determinant (https://en.wikipedia.org/wiki/Determinant)
from numpy.linalg import det
det(x)


Out[63]:
-9.5161973539299405e-16

Other commonly used functions from numpy.linalg

Function Description
diag Return diagonal of matrix as 1D array
dot Matrix multiplication
trace Sum of diagonal elements
det Determinant
eig Eigenvalues and eigenvectors
inv Inverse of square matrix
```qr```` QR Decomposition
svd Singular Value Decomposition (SVD)
solv Solve linear system Ax=b for x, where A is a square matrix
lstsq Compute the least square solution to Ax=b

In [ ]: