NumPy: Numerical Arrays for Python

Learning Objectives: Learn how to create, transform and visualize multidimensional data of a single type using Numpy.

NumPy is the foundation for scientific computing and data science in Python. Its more data object is a multidimensional array with the following characteristics:

  • Any number of dimensions
  • All elements of an array have the same data type
  • Array elements are usually native data dtype
  • The memory for an array is a contiguous block that can be easily passed to other numerical libraries (BLAS, LAPACK, etc.).
  • Most of NumPy is implemented in C, so it is fast.

Plotting

While this notebook doesn't focus on plotting, Matplotlib will be used to make a few basic plots.


In [27]:
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns

The vizarray package will be used to visualize NumPy arrays:

import antipackage from github.ellisonbg.misc import vizarray as va

Multidimensional array type

This is the canonical way you should import Numpy:


In [3]:
import numpy as np

In [4]:
data = [0,2,4,6]
a = np.array(data)

In [5]:
type(a)


Out[5]:
numpy.ndarray

In [7]:
a


Out[7]:
array([0, 2, 4, 6])

The vz.vizarray function can be used to visualize a 1d or 2d NumPy array using a colormap:


In [9]:
va.vizarray(a)


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-5fc38d85ddca> in <module>()
----> 1 va.vizarray(a)

NameError: name 'va' is not defined

The shape of the array:


In [10]:
a.shape


Out[10]:
(4,)

The number of array dimensions:


In [11]:
a.ndim


Out[11]:
1

The number of array elements:


In [12]:
a.size


Out[12]:
4

The number of bytes the array takes up:


In [13]:
a.nbytes


Out[13]:
32

The dtype attribute describes the "data type" of the elements:


In [14]:
a.dtype


Out[14]:
dtype('int64')

Creating arrays

Arrays can be created with nested lists or tuples:


In [15]:
data = [[0.0,2.0,4.0,6.0],[1.0,3.0,5.0,7.0]]
b = np.array(data)

In [16]:
b


Out[16]:
array([[ 0.,  2.,  4.,  6.],
       [ 1.,  3.,  5.,  7.]])

In [16]:
va.vizarray(b)


Out[16]:

In [17]:
b.shape, b.ndim, b.size, b.nbytes


Out[17]:
((2, 4), 2, 8, 64)

The arange function is similar to Python's builtin range function, but creates an array:


In [18]:
c = np.arange(0.0, 10.0, 1.0) # Step size of 1.0
c


Out[18]:
array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])

The linspace function is similar, but allows you to specify the number of points:


In [19]:
e = np.linspace(0.0, 5.0, 11) # 11 points
e


Out[19]:
array([ 0. ,  0.5,  1. ,  1.5,  2. ,  2.5,  3. ,  3.5,  4. ,  4.5,  5. ])

There are also empty, zeros and ones functions:


In [20]:
np.empty((4,4))


Out[20]:
array([[  6.91270953e-310,   1.69744513e-316,   6.91270974e-310,
          6.91270974e-310],
       [  6.91267078e-310,   5.04011780e-317,   0.00000000e+000,
          0.00000000e+000],
       [  6.91267078e-310,   6.91267078e-310,   0.00000000e+000,
          0.00000000e+000],
       [  6.91270974e-310,   6.91270974e-310,   0.00000000e+000,
          0.00000000e+000]])

In [21]:
np.zeros((3,3))


Out[21]:
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

In [22]:
np.ones((3,3))


Out[22]:
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

See also:

  • empty_like, ones_like, zeros_like
  • eye, identity, diag

dtype

Arrays have a dtype attribute that encapsulates the "data type" of each element. It can be set:

  • Implicitely by the element type
  • By passing the dtype argument to an array creation function

Here is an integer valued array:


In [23]:
a = np.array([0,1,2,3])

In [24]:
a, a.dtype


Out[24]:
(array([0, 1, 2, 3]), dtype('int64'))

All array creation functions accept an optional dtype argument:


In [25]:
b = np.zeros((2,2), dtype=np.complex64)
b


Out[25]:
array([[ 0.+0.j,  0.+0.j],
       [ 0.+0.j,  0.+0.j]], dtype=complex64)

In [26]:
c = np.arange(0, 10, 2, dtype=np.float)
c


Out[26]:
array([ 0.,  2.,  4.,  6.,  8.])

You can use the astype method to create a copy of the array with a given dtype:


In [27]:
d = c.astype(dtype=np.int)
d


Out[27]:
array([0, 2, 4, 6, 8])

IPython's tab completion is useful for exploring the various available dtypes:


In [25]:
np.float*?

The NumPy documentation on dtypes describes the many other ways of specifying dtypes.

Array operations

Basic mathematical operations are elementwise for:

  • Scalars and arrays
  • Arrays and arrays

Fill an array with a value:


In [28]:
a = np.empty((3,3))
a.fill(0.1)
a


Out[28]:
array([[ 0.1,  0.1,  0.1],
       [ 0.1,  0.1,  0.1],
       [ 0.1,  0.1,  0.1]])

In [29]:
b = np.ones((3,3))
b


Out[29]:
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

Addition is elementwise:


In [30]:
a+b


Out[30]:
array([[ 1.1,  1.1,  1.1],
       [ 1.1,  1.1,  1.1],
       [ 1.1,  1.1,  1.1]])

Division is elementwise:


In [31]:
b/a


Out[31]:
array([[ 10.,  10.,  10.],
       [ 10.,  10.,  10.],
       [ 10.,  10.,  10.]])

As are powers:


In [32]:
a**2


Out[32]:
array([[ 0.01,  0.01,  0.01],
       [ 0.01,  0.01,  0.01],
       [ 0.01,  0.01,  0.01]])

Scalar multiplication is also elementwise:


In [33]:
np.pi*b


Out[33]:
array([[ 3.14159265,  3.14159265,  3.14159265],
       [ 3.14159265,  3.14159265,  3.14159265],
       [ 3.14159265,  3.14159265,  3.14159265]])

Indexing and slicing

Indexing and slicing provide an efficient way of getting the values in an array and modifying them.


In [34]:
a = np.random.rand(10,10)

The enable function is part of vizarray and enables a nice display of arrays:


In [35]:
va.enable()

In [36]:
a


Out[36]:

List Python lists and tuples, NumPy arrays have zero-based indexing and use the [] syntax for getting and setting values:


In [37]:
a[0,0]


Out[37]:
0.78608884399348022

An index of -1 refers to the last element along that axis:


In [38]:
a[-1,-1] == a[9,9]


Out[38]:
True

Extract the 0th column using the : syntax, which denotes all elements along that axis.


In [39]:
a[:,0]


Out[39]:

The last row:


In [40]:
a[-1,:]


Out[40]:

You can also slice ranges:


In [41]:
a[0:2,0:2]


Out[41]:

Assignment also works with slices:


In [42]:
a[0:5,0:5] = 1.0

In [43]:
a


Out[43]:

Note how even though we assigned the value to the slice, the original array was changed. This clarifies that slices are views of the same data, not a copy.


In [44]:
va.disable()

Boolean indexing

Arrays can be indexed using other arrays that have boolean values.


In [45]:
ages = np.array([23,56,67,89,23,56,27,12,8,72])
genders = np.array(['m','m','f','f','m','f','m','m','m','f'])

Boolean expressions involving arrays create new arrays with a bool dtype and the elementwise result of the expression:


In [46]:
ages > 30


Out[46]:
array([False,  True,  True,  True, False,  True, False, False, False,  True], dtype=bool)

In [47]:
genders == 'm'


Out[47]:
array([ True,  True, False, False,  True, False,  True,  True,  True, False], dtype=bool)

Boolean expressions provide an extremely fast and flexible way of querying arrays:


In [48]:
(ages > 10) & (ages < 50)


Out[48]:
array([ True, False, False, False,  True, False,  True,  True, False, False], dtype=bool)

You can use a boolean array to index into the original or another array. This selects the ages of all females in the genders array:


In [49]:
mask = (genders == 'f')
ages[mask]


Out[49]:
array([67, 89, 56, 72])

In [50]:
ages[ages>30]


Out[50]:
array([56, 67, 89, 56, 72])

Reshaping, transposing


In [51]:
va.enable()

In [52]:
a = np.random.rand(3,4)

In [53]:
a


Out[53]:

The T atrribute contains the transpose of the original array:


In [54]:
a.T


Out[54]:

The reshape method can be used to change the shape and even the number of dimensions:


In [55]:
a.reshape(2,6)


Out[55]:

In [56]:
a.reshape(6,2)


Out[56]:

The ravel method strings the array out in one dimension:


In [57]:
a.ravel()


Out[57]:

In [59]:
va.disable()

Universal functions

Universal function, or "ufuncs," are functions that take and return arrays or scalars. They have the following characteristics:

  • Vectorized C implementations, much faster than hand written loops in Python
  • Allow for concise Pythonic code
  • Here is a complete list of the available NumPy ufuncs lists the available ufuncs.

In [60]:
va.set_block_size(5)
va.enable()

Here is a linear sequence of values"


In [61]:
t = np.linspace(0.0, 4*np.pi, 100)
t


Out[61]:

Take the $sin$ of each element of the array:


In [62]:
np.sin(t)


Out[62]:

As the next two examples show, multiple ufuncs can be used to create complex mathematical expressions that can be computed efficiently:


In [63]:
np.exp(np.sqrt(t))


Out[63]:

In [65]:
va.disable()
va.set_block_size(30)

In [66]:
plt.plot(t, np.exp(-0.1*t)*np.sin(t))


Out[66]:
[<matplotlib.lines.Line2D at 0x7f2a40339150>]

In general, you should always try to use ufuncs rather than do computations using for loops. These types of array based computations are referred to as vectorized.

Basic data processing


In [67]:
ages = np.array([23,56,67,89,23,56,27,12,8,72])
genders = np.array(['m','m','f','f','m','f','m','m','m','f'])

Numpy has a basic set of methods and function for computing basic quantities about data.


In [68]:
ages.min(), ages.max()


Out[68]:
(8, 89)

Compute the mean:


In [69]:
ages.mean()


Out[69]:
43.299999999999997

Compute the variance and standard deviation:


In [70]:
ages.var(), ages.std()


Out[70]:
(711.21000000000004, 26.668520768876554)

The bincount function counts how many times each value occurs in the array:


In [71]:
np.bincount(ages)


Out[71]:
array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1])

The cumsum and cumprod methods compute cumulative sums and products:


In [72]:
ages.cumsum()


Out[72]:
array([ 23,  79, 146, 235, 258, 314, 341, 353, 361, 433])

In [73]:
ages.cumprod()


Out[73]:
array([              23,             1288,            86296,
                7680344,        176647912,       9892283072,
           267091642944,    3205099715328,   25640797722624,
       1846137436028928])

Most of the functions and methods above take an axis argument that will apply the action along a particular axis:


In [74]:
a = np.random.randint(0,10,(3,4))
a


Out[74]:
array([[2, 6, 5, 2],
       [7, 4, 1, 8],
       [6, 1, 3, 0]])

With axis=0 the action takes place along rows:


In [75]:
a.sum(axis=0)


Out[75]:
array([15, 11,  9, 10])

With axis=1 the action takes place along columns:


In [76]:
a.sum(axis=1)


Out[76]:
array([15, 20, 10])

The unique function is extremely useful in working with categorical data:


In [77]:
np.unique(genders)


Out[77]:
array(['f', 'm'], 
      dtype='|S1')

In [78]:
np.unique(genders, return_counts=True)


Out[78]:
(array(['f', 'm'], 
       dtype='|S1'), array([4, 6]))

The where function allows you to apply conditional logic to arrays. Here is a rough sketch of how it works:

def where(condition, if_false, if_true):

In [79]:
np.where(ages>30, 0, 1)


Out[79]:
array([1, 0, 0, 0, 1, 0, 1, 1, 1, 0])

The if_false and if_true values can be arrays themselves:


In [80]:
np.where(ages<30, 0, ages)


Out[80]:
array([ 0, 56, 67, 89,  0, 56,  0,  0,  0, 72])

File IO

NumPy has a a number of different function to reading and writing arrays to and from disk.

Single array, binary format


In [81]:
a = np.random.rand(10)
a


Out[81]:
array([ 0.11893738,  0.34881727,  0.04730572,  0.09967683,  0.17042978,
        0.09753468,  0.42821737,  0.43256054,  0.25493353,  0.78965296])

Save the array to a binary file named array1.npy:


In [82]:
np.save('array1', a)

In [83]:
ls


array1.npy              CSV.ipynb        DataScienceProcess.ipynb  Numpy.ipynb
Chinook_Sqlite.sqlite*  DataIntro.ipynb  JSON.ipynb                SQL.ipynb

Using %pycat to look at the file shows that it is binary:


In [84]:
%pycat array1.npy

Load the array back into memory:


In [85]:
a_copy = np.load('array1.npy')

In [86]:
a_copy


Out[86]:
array([ 0.11893738,  0.34881727,  0.04730572,  0.09967683,  0.17042978,
        0.09753468,  0.42821737,  0.43256054,  0.25493353,  0.78965296])

Single array, text format


In [87]:
b = np.random.randint(0,10,(5,3))
b


Out[87]:
array([[9, 2, 4],
       [0, 0, 8],
       [9, 9, 6],
       [4, 8, 7],
       [2, 6, 3]])

The savetxt function saves arrays in a simple, textual format that is less effecient, but easier for other languges to read:


In [88]:
np.savetxt('array2.txt', b)

In [89]:
ls


array1.npy              CSV.ipynb                 JSON.ipynb
array2.txt              DataIntro.ipynb           Numpy.ipynb
Chinook_Sqlite.sqlite*  DataScienceProcess.ipynb  SQL.ipynb

Using %pycat to look at the contents shows that the files is indeed a plain text file:


In [90]:
%pycat array2.txt

In [91]:
np.loadtxt('array2.txt')


Out[91]:
array([[ 9.,  2.,  4.],
       [ 0.,  0.,  8.],
       [ 9.,  9.,  6.],
       [ 4.,  8.,  7.],
       [ 2.,  6.,  3.]])

Multiple arrays, binary format

The savez function provides an efficient way of saving multiple arrays to a single file:


In [92]:
np.savez('arrays.npz', a=a, b=b)

The load function returns a dictionary like object that provides access to the individual arrays:


In [93]:
a_and_b = np.load('arrays.npz')

In [94]:
a_and_b['a']


Out[94]:
array([ 0.11893738,  0.34881727,  0.04730572,  0.09967683,  0.17042978,
        0.09753468,  0.42821737,  0.43256054,  0.25493353,  0.78965296])

In [95]:
a_and_b['b']


Out[95]:
array([[9, 2, 4],
       [0, 0, 8],
       [9, 9, 6],
       [4, 8, 7],
       [2, 6, 3]])

Linear algebra

NumPy has excellent linear algebra capabilities.


In [96]:
a = np.random.rand(5,5)
b = np.random.rand(5,5)

Remember that array operations are elementwise. Thus, this is not matrix multiplication:


In [97]:
a*b


Out[97]:
array([[ 0.78372413,  0.47273738,  0.31153567,  0.40965476,  0.08740972],
       [ 0.25862387,  0.29022236,  0.17732679,  0.06547613,  0.40101443],
       [ 0.12667065,  0.47814036,  0.06160667,  0.42398101,  0.03274227],
       [ 0.03461579,  0.10325288,  0.68198206,  0.04473719,  0.05574423],
       [ 0.46070914,  0.31268264,  0.18929165,  0.01685043,  0.02724934]])

To get matrix multiplication use np.dot:


In [98]:
np.dot(a, b)


Out[98]:
array([[ 1.58001604,  1.3775459 ,  1.52535348,  1.7796161 ,  1.3377806 ],
       [ 2.1501139 ,  1.86328779,  1.58472464,  2.37983886,  1.47982503],
       [ 0.62227038,  0.63640089,  0.93216562,  0.8310279 ,  1.00756454],
       [ 1.45513071,  1.33584842,  1.21990405,  1.56385159,  0.98995202],
       [ 1.21957263,  1.03686995,  0.86930139,  1.2717817 ,  0.63968366]])

Or, NumPy as a matrix subclass for which matrix operations are the default:


In [99]:
m1 = np.matrix(a)
m2 = np.matrix(b)

In [100]:
m1*m2


Out[100]:
matrix([[ 1.58001604,  1.3775459 ,  1.52535348,  1.7796161 ,  1.3377806 ],
        [ 2.1501139 ,  1.86328779,  1.58472464,  2.37983886,  1.47982503],
        [ 0.62227038,  0.63640089,  0.93216562,  0.8310279 ,  1.00756454],
        [ 1.45513071,  1.33584842,  1.21990405,  1.56385159,  0.98995202],
        [ 1.21957263,  1.03686995,  0.86930139,  1.2717817 ,  0.63968366]])

The np.linalg package has a wide range of fast linear algebra operations.

Here is determinant:


In [101]:
np.linalg.det(a)


Out[101]:
0.012941957503533077

Matrix inverse:


In [ ]:
np.linalg.inv(a)

Eigenvalues:


In [ ]:
np.linalg.eigvals(a)

NumPy can be built against fast BLAS/LAPACK implementation for these linear algebra operations.


In [118]:
c = np.random.rand(2000,2000)

In [119]:
%timeit -n1 -r1 evs = np.linalg.eigvals(c)


1 loops, best of 1: 23.7 s per loop

Random numbers

NumPy has functions for creating arrays of random numbers from different distributions in np.random, as well as handling things like permutation, shuffling, and choosing.

Here is the numpy.random documentation.


In [104]:
plt.hist(np.random.random(250))
plt.title('Uniform Random Distribution $[0,1]$')
plt.xlabel('value')
plt.ylabel('count')


Out[104]:
<matplotlib.text.Text at 0x7f785a7dd990>

In [105]:
plt.hist(np.random.randn(250))
plt.title('Standard Normal Distribution')
plt.xlabel('value')
plt.ylabel('count')


Out[105]:
<matplotlib.text.Text at 0x7f785a646e10>

The shuffle function shuffles an array in place:


In [106]:
a = np.arange(0,10)
np.random.shuffle(a)
a


Out[106]:
array([9, 2, 1, 6, 5, 0, 7, 8, 3, 4])

The permutation function does the same thing but first makes a copy:


In [107]:
a = np.arange(0,10)
print(np.random.permutation(a))
print(a)


[0 9 6 1 7 2 5 3 8 4]
[0 1 2 3 4 5 6 7 8 9]

The choice function provides a powerful way of creating synthetic data sets of discrete data:


In [110]:
np.random.choice(['m','f'], 20, p=[0.25,0.75])


Out[110]:
array(['f', 'f', 'f', 'f', 'm', 'm', 'f', 'f', 'f', 'f', 'f', 'f', 'f',
       'f', 'f', 'f', 'f', 'f', 'f', 'm'], 
      dtype='|S1')

Resources