NumPy: Numerical Arrays for Python

Learning Objectives: Learn how to create, transform and visualize multidimensional data of a single type using Numpy.

NumPy is the foundation for scientific computing and data science in Python.

  • Any number of dimensions
  • All elements of an array have the same data type
  • Array elements are usually native data dtype
  • The memory for an array is a contiguous block that can be easily passed to other numerical libraries (BLAS, LAPACK, etc.).
  • Most of NumPy is implemented in C, so it is fast

NumPy arrays are the foundational data type that the entire Python numerical computing stack is built upon

Plotting

While this notebook doesn't focus on plotting, matplotlib will be used to make a few basic plots.


In [1]:
%matplotlib inline
from matplotlib import pyplot as plt
plt.style.use('ggplot')

Multidimensional array type


In [2]:
import numpy as np
import vizarray as vz

In [3]:
data = [0,2,4,6]
a = np.array(data)

In [4]:
type(a)


Out[4]:
numpy.ndarray

In [5]:
a


Out[5]:
array([0, 2, 4, 6])

In [6]:
vz.vizarray(a)


Out[6]:

In [7]:
a.shape


Out[7]:
(4,)

In [8]:
a.ndim


Out[8]:
1

In [9]:
a.size


Out[9]:
4

In [10]:
a.nbytes


Out[10]:
32

In [11]:
a.dtype


Out[11]:
dtype('int64')

Creating arrays


In [12]:
data = [[0.0,2.0,4.0,6.0],[1.0,3.0,5.0,7.0]]
b = np.array(data)

In [13]:
b


Out[13]:
array([[ 0.,  2.,  4.,  6.],
       [ 1.,  3.,  5.,  7.]])

In [14]:
vz.vizarray(b)


Out[14]:

In [15]:
b.shape, b.ndim, b.size, b.nbytes


Out[15]:
((2, 4), 2, 8, 64)

In [16]:
c = np.arange(0.0, 10.0, 1.0) # Step size of 1.0
c


Out[16]:
array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])

In [17]:
e = np.linspace(0.0, 5.0, 11) # 11 points
e


Out[17]:
array([ 0. ,  0.5,  1. ,  1.5,  2. ,  2.5,  3. ,  3.5,  4. ,  4.5,  5. ])

In [18]:
np.empty((4,4))


Out[18]:
array([[ -2.68156159e+154,  -2.68156159e+154,   2.18723965e-314,
          2.18725447e-314],
       [  2.18725561e-314,   2.18725518e-314,   2.18725476e-314,
          2.18725457e-314],
       [  2.18725442e-314,   2.18725490e-314,   2.18725433e-314,
          0.00000000e+000],
       [ -2.68156159e+154,  -3.10503637e+231,   9.88131292e-324,
          1.11253912e-308]])

In [19]:
np.zeros((3,3))


Out[19]:
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

In [20]:
np.ones((3,3))


Out[20]:
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

See also:

  • empty_like, ones_like, zeros_like
  • eye, identity

dtype

Arrays have a dtype attribute that encapsulates the data type of each element. It can be set:

  • Implicitely by the element type
  • By passing the dtype argument to an array creation function

In [21]:
a = np.array([0,1,2,3])

In [22]:
a, a.dtype


Out[22]:
(array([0, 1, 2, 3]), dtype('int64'))

All array creation functions accept an optional dtype argument:


In [23]:
b = np.zeros((2,2), dtype=np.complex64)
b


Out[23]:
array([[ 0.+0.j,  0.+0.j],
       [ 0.+0.j,  0.+0.j]], dtype=complex64)

In [24]:
c = np.arange(0, 10, 2, dtype=np.float)
c


Out[24]:
array([ 0.,  2.,  4.,  6.,  8.])

You can use the astype method to create a copy of the array with a given dtype:


In [25]:
d = c.astype(dtype=np.int)
d


Out[25]:
array([0, 2, 4, 6, 8])

IPython's tab completion is useful for exploring the various available dtypes:


In [26]:
np.float*?


np.float
np.float128
np.float16
np.float32
np.float64
np.float_
np.floating

The NumPy documentation on dtypes describes the many other ways of specifying dtypes.

Array operations

Basic mathematical operations are elementwise for:

  • Scalars and arrays
  • Arrays and arrays

In [27]:
a = np.empty((3,3))
a.fill(0.1)
a


Out[27]:
array([[ 0.1,  0.1,  0.1],
       [ 0.1,  0.1,  0.1],
       [ 0.1,  0.1,  0.1]])

In [28]:
b = np.ones((3,3))
b


Out[28]:
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

In [29]:
a+b


Out[29]:
array([[ 1.1,  1.1,  1.1],
       [ 1.1,  1.1,  1.1],
       [ 1.1,  1.1,  1.1]])

In [30]:
b/a


Out[30]:
array([[ 10.,  10.,  10.],
       [ 10.,  10.,  10.],
       [ 10.,  10.,  10.]])

In [31]:
a**2


Out[31]:
array([[ 0.01,  0.01,  0.01],
       [ 0.01,  0.01,  0.01],
       [ 0.01,  0.01,  0.01]])

In [32]:
np.pi*b


Out[32]:
array([[ 3.14159265,  3.14159265,  3.14159265],
       [ 3.14159265,  3.14159265,  3.14159265],
       [ 3.14159265,  3.14159265,  3.14159265]])

Indexing and slicing

Indexing and slicing provide an efficient way of getting the values in an array and modifying them.


In [33]:
a = np.random.rand(10,10)

The enable function is part of vizarray and enables a nice display of arrays:


In [34]:
vz.enable()

In [35]:
a


Out[35]:

In [36]:
a[0,0]


Out[36]:
0.44765195731375318

In [37]:
a[-1,-1] == a[9,9]


Out[37]:
True

Extract the 0th column:


In [38]:
a[:,0]


Out[38]:

The last row:


In [39]:
a[-1,:]


Out[39]:

You can also slice ranges:


In [40]:
a[0:2,0:2]


Out[40]:

Assignment also works with slices:


In [41]:
a[0:5,0:5] = 1.0

In [42]:
a


Out[42]:

In [43]:
vz.disable()

Note how even though we assigned the value to the slice, the original array was changed. This clarifies that slices are views of the same data, not a copy.

Boolean indexing


In [44]:
ages = np.array([23,56,67,89,23,56,27,12,8,72])
genders = np.array(['m','m','f','f','m','f','m','m','m','f'])

In [45]:
ages > 30


Out[45]:
array([False,  True,  True,  True, False,  True, False, False, False,  True], dtype=bool)

In [46]:
genders == 'm'


Out[46]:
array([ True,  True, False, False,  True, False,  True,  True,  True, False], dtype=bool)

In [47]:
(ages > 10) & (ages < 50)


Out[47]:
array([ True, False, False, False,  True, False,  True,  True, False, False], dtype=bool)

You can use a boolean array to index into the original or another array:


In [48]:
mask = (genders == 'f')
ages[mask]


Out[48]:
array([67, 89, 56, 72])

In [49]:
ages[ages>30]


Out[49]:
array([56, 67, 89, 56, 72])

Reshaping, transposing


In [50]:
vz.enable()

In [51]:
a = np.random.rand(3,4)

In [52]:
a


Out[52]:

In [53]:
a.T


Out[53]:

In [54]:
a.reshape(2,6)


Out[54]:

In [55]:
a.reshape(6,2)


Out[55]:

In [56]:
a.ravel()


Out[56]:

In [57]:
vz.disable()

Universal functions

Universal function, or "ufuncs," are functions that take and return arrays or scalars:

  • Vectorized C implementations, much faster than hand written loops in Python
  • Allow for concise Pythonic code
  • Here is a complete list of the available NumPy ufuncs lists the available ufuncs.

In [58]:
vz.set_block_size(5)
vz.enable()

In [59]:
t = np.linspace(0.0, 4*np.pi, 100)
t


Out[59]:

In [60]:
np.sin(t)


Out[60]:

In [61]:
np.exp(t)


Out[61]:

In [62]:
vz.disable()
vz.set_block_size(30)

In [63]:
plt.plot(t, np.exp(-0.1*t)*np.sin(t))


Out[63]:
[<matplotlib.lines.Line2D at 0x105187150>]

Basic data processing


In [64]:
ages = np.array([23,56,67,89,23,56,27,12,8,72])
genders = np.array(['m','m','f','f','m','f','m','m','m','f'])

Numpy has a basic set of methods and function for computing basic quantities about data.


In [65]:
ages.min(), ages.max()


Out[65]:
(8, 89)

In [66]:
ages.mean()


Out[66]:
43.299999999999997

In [67]:
ages.var(), ages.std()


Out[67]:
(711.21000000000004, 26.668520768876554)

In [68]:
np.bincount(ages)


Out[68]:
array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1])

The cumsum and cumprod methods compute cumulative sums and products:


In [69]:
ages.cumsum()


Out[69]:
array([ 23,  79, 146, 235, 258, 314, 341, 353, 361, 433])

In [70]:
ages.cumprod()


Out[70]:
array([              23,             1288,            86296,
                7680344,        176647912,       9892283072,
           267091642944,    3205099715328,   25640797722624,
       1846137436028928])

Most of the functions and methods above take an axis argument that will apply the action along a particular axis:


In [71]:
a = np.random.randint(0,10,(3,4))
a


Out[71]:
array([[5, 1, 8, 3],
       [9, 5, 1, 3],
       [7, 7, 9, 9]])

With axis=0 the action takes place along rows:


In [72]:
a.sum(axis=0)


Out[72]:
array([21, 13, 18, 15])

With axis=1 the action takes place along columns:


In [73]:
a.sum(axis=1)


Out[73]:
array([17, 18, 32])

The unique function is extremely useful in working with categorical data:


In [74]:
np.unique(genders)


Out[74]:
array(['f', 'm'], 
      dtype='|S1')

In [75]:
np.unique(genders, return_counts=True)


Out[75]:
(array(['f', 'm'], 
       dtype='|S1'), array([4, 6]))

The where function allows you to apply conditional logic to arrays. Here is a rough sketch of how it works:

def where(condition, if_false, if_true):

In [76]:
np.where(ages>30, 0, 1)


Out[76]:
array([1, 0, 0, 0, 1, 0, 1, 1, 1, 0])

The if_false and if_true values can be arrays themselves:


In [77]:
np.where(ages<30, 0, ages)


Out[77]:
array([ 0, 56, 67, 89,  0, 56,  0,  0,  0, 72])

File IO

NumPy has a a number of different function to reading and writing arrays to and from disk.

Single array, binary format


In [78]:
a = np.random.rand(10)
a


Out[78]:
array([ 0.91267205,  0.61166094,  0.79965337,  0.17049513,  0.47530885,
        0.49000934,  0.77588517,  0.71381787,  0.36675083,  0.40157028])

In [79]:
np.save('array1', a)

In [80]:
ls


Numpy Exercises.ipynb  array1.npy             vizarray.py
Numpy.ipynb            temps.txt              vizarray.pyc

Using %pycat to look at the file shows that it is binary:


In [81]:
%pycat array1.npy


“NUMPYF{'descr': '<f8', 'fortran_order': False, 'shape': (10,), }           


>œ4í?Ü(çø¹’ã?ÚmE¨Â–é?8ÊËÈÒÅ?8ýâÏukÞ?ø÷$%P\ß?A‹§"
Ôè?c6‘˜×æ?&ÅLvØx×?JšÔÐS³Ù?

*** ERROR: EOF in multi-line statement



In [82]:
a_copy = np.load('array1.npy')

In [83]:
a_copy


Out[83]:
array([ 0.91267205,  0.61166094,  0.79965337,  0.17049513,  0.47530885,
        0.49000934,  0.77588517,  0.71381787,  0.36675083,  0.40157028])

Single array, text format


In [84]:
b = np.random.randint(0,10,(5,3))
b


Out[84]:
array([[2, 0, 2],
       [6, 8, 9],
       [5, 4, 6],
       [0, 6, 6],
       [2, 3, 6]])

In [85]:
np.savetxt('array2.txt', b)

In [86]:
ls


Numpy Exercises.ipynb  array1.npy             temps.txt              vizarray.pyc
Numpy.ipynb            array2.txt             vizarray.py

Using %pycat to look at the contents shows that the files is indeed a plain text file:


In [87]:
%pycat array2.txt


2.000000000000000000e+00 0.000000000000000000e+00 2.000000000000000000e+00
6.000000000000000000e+00 8.000000000000000000e+00 9.000000000000000000e+00
5.000000000000000000e+00 4.000000000000000000e+00 6.000000000000000000e+00
0.000000000000000000e+00 6.000000000000000000e+00 6.000000000000000000e+00
2.000000000000000000e+00 3.000000000000000000e+00 6.000000000000000000e+00


In [88]:
np.loadtxt('array2.txt')


Out[88]:
array([[ 2.,  0.,  2.],
       [ 6.,  8.,  9.],
       [ 5.,  4.,  6.],
       [ 0.,  6.,  6.],
       [ 2.,  3.,  6.]])

Multiple arrays, binary format


In [89]:
np.savez('arrays.npz', a=a, b=b)

In [90]:
a_and_b = np.load('arrays.npz')

In [91]:
a_and_b['a']


Out[91]:
array([ 0.91267205,  0.61166094,  0.79965337,  0.17049513,  0.47530885,
        0.49000934,  0.77588517,  0.71381787,  0.36675083,  0.40157028])

In [92]:
a_and_b['b']


Out[92]:
array([[2, 0, 2],
       [6, 8, 9],
       [5, 4, 6],
       [0, 6, 6],
       [2, 3, 6]])

Linear algebra

NumPy has excellent linear algebra capabilities.


In [93]:
a = np.random.rand(5,5)
b = np.random.rand(5,5)

Remember that array operations are elementwise. Thus, this is not matrix multiplication:


In [94]:
a*b


Out[94]:
array([[ 0.00204705,  0.54096634,  0.47259787,  0.05412818,  0.19671501],
       [ 0.66129874,  0.13312196,  0.44184267,  0.12562755,  0.07581084],
       [ 0.03015261,  0.00721693,  0.49412109,  0.18472905,  0.10839132],
       [ 0.05813496,  0.89043886,  0.01727079,  0.13753354,  0.01119625],
       [ 0.17572881,  0.02565493,  0.7244605 ,  0.25694105,  0.24387372]])

To get matrix multiplication use np.dot:


In [95]:
np.dot(a, b)


Out[95]:
array([[ 0.80521201,  1.22847687,  1.59594972,  0.65024802,  0.84586594],
       [ 0.35977754,  1.68214909,  1.47852159,  0.58675961,  0.67976828],
       [ 0.5111124 ,  1.34764816,  1.39081146,  0.60400515,  0.7157343 ],
       [ 1.12466104,  1.98517212,  1.32702902,  0.51282008,  1.0631991 ],
       [ 0.35522803,  1.42543228,  1.88529867,  0.73195573,  0.87746894]])

Or, NumPy as a matrix subclass for which matrix operations are the default:


In [96]:
m1 = np.matrix(a)
m2 = np.matrix(b)

In [97]:
m1*m2


Out[97]:
matrix([[ 0.80521201,  1.22847687,  1.59594972,  0.65024802,  0.84586594],
        [ 0.35977754,  1.68214909,  1.47852159,  0.58675961,  0.67976828],
        [ 0.5111124 ,  1.34764816,  1.39081146,  0.60400515,  0.7157343 ],
        [ 1.12466104,  1.98517212,  1.32702902,  0.51282008,  1.0631991 ],
        [ 0.35522803,  1.42543228,  1.88529867,  0.73195573,  0.87746894]])

The np.linalg package has a wide range of fast linear algebra operations.

Here is determinant:


In [98]:
np.linalg.det(a)


Out[98]:
-0.10718636089982998

Matrix inverse:


In [99]:
np.linalg.inv(a)


Out[99]:
array([[-0.78993064,  0.3031534 , -1.68709991,  0.7971441 ,  1.43548423],
       [ 0.8580796 , -0.05252972, -1.6586023 ,  0.98068693,  0.13396849],
       [ 1.54890704,  0.66091847, -2.20046493, -0.44462467,  0.60168609],
       [-0.81596389,  0.45517992,  2.84968253, -0.2779306 , -1.51073162],
       [-0.47209591, -1.68088455,  3.22377699, -0.4388912 ,  0.13065681]])

Eigenvalues:


In [100]:
np.linalg.eigvals(a)


Out[100]:
array([ 2.57901192+0.j        , -0.74790223+0.j        ,
       -0.18571386+0.24459447j, -0.18571386-0.24459447j,  0.58919024+0.j        ])

NumPy can be built against fast BLAS/LAPACK implementation for these linear algebra operations.


In [101]:
c = np.random.rand(2000,2000)

In [102]:
%timeit -n1 -r1 evs = np.linalg.eigvals(c)


1 loops, best of 1: 5.17 s per loop

Random numbers

NumPy has functions for creating arrays of random numbers from different distributions in np.random, as well as handling things like permutation and shuffling.

Here is the numpy.random documentation.


In [103]:
plt.hist(np.random.random(250))
plt.title('Uniform Random Distribution $[0,1]$')
plt.xlabel('value')
plt.ylabel('count')


Out[103]:
<matplotlib.text.Text at 0x10882d110>

In [104]:
plt.hist(np.random.randn(250))
plt.title('Standard Normal Distribution')
plt.xlabel('value')
plt.ylabel('count')


Out[104]:
<matplotlib.text.Text at 0x10a82f690>

The shuffle function shuffles an array in place:


In [105]:
a = np.arange(0,10)
np.random.shuffle(a)
a


Out[105]:
array([7, 2, 8, 0, 4, 6, 1, 5, 9, 3])

The permutation function does the same thing but first makes a copy:


In [106]:
a = np.arange(0,10)
print(np.random.permutation(a))
print(a)


[0 6 4 9 5 1 8 3 7 2]
[0 1 2 3 4 5 6 7 8 9]

Resources