NumPy: Numerical Arrays for Python

Learning Objectives: Learn how to create, transform and visualize multidimensional data of a single type using Numpy.

NumPy is the foundation for scientific computing and data science in Python.

Any number of dimensions
All elements of an array have the same data type
Array elements are usually native data dtype
The memory for an array is a contiguous block that can be easily passed to other numerical libraries (BLAS, LAPACK, etc.).
Most of NumPy is implemented in C, so it is fast

NumPy arrays are the foundational data type that the entire Python numerical computing stack is built upon

Plotting

While this notebook doesn't focus on plotting, matplotlib will be used to make a few basic plots.



In [1]:

    
%matplotlib inline
from matplotlib import pyplot as plt
plt.style.use('ggplot')

Multidimensional array type



In [2]:

    
import numpy as np
import vizarray as vz



In [3]:

    
data = [0,2,4,6]
a = np.array(data)



In [4]:

    
type(a)









    Out[4]:





numpy.ndarray



In [5]:

    
a









    Out[5]:





array([0, 2, 4, 6])



In [6]:

    
vz.vizarray(a)









    Out[6]:



In [7]:

    
a.shape









    Out[7]:





(4,)



In [8]:

    
a.ndim









    Out[8]:





1



In [9]:

    
a.size









    Out[9]:





4



In [10]:

    
a.nbytes









    Out[10]:





32



In [11]:

    
a.dtype









    Out[11]:





dtype('int64')

Creating arrays



In [12]:

    
data = [[0.0,2.0,4.0,6.0],[1.0,3.0,5.0,7.0]]
b = np.array(data)



In [13]:

    
b









    Out[13]:





array([[ 0.,  2.,  4.,  6.],
       [ 1.,  3.,  5.,  7.]])



In [14]:

    
vz.vizarray(b)









    Out[14]:



In [15]:

    
b.shape, b.ndim, b.size, b.nbytes









    Out[15]:





((2, 4), 2, 8, 64)



In [16]:

    
c = np.arange(0.0, 10.0, 1.0) # Step size of 1.0
c









    Out[16]:





array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])



In [17]:

    
e = np.linspace(0.0, 5.0, 11) # 11 points
e









    Out[17]:





array([ 0. ,  0.5,  1. ,  1.5,  2. ,  2.5,  3. ,  3.5,  4. ,  4.5,  5. ])



In [18]:

    
np.empty((4,4))









    Out[18]:





array([[ -2.68156159e+154,  -2.68156159e+154,   2.18723965e-314,
          2.18725447e-314],
       [  2.18725561e-314,   2.18725518e-314,   2.18725476e-314,
          2.18725457e-314],
       [  2.18725442e-314,   2.18725490e-314,   2.18725433e-314,
          0.00000000e+000],
       [ -2.68156159e+154,  -3.10503637e+231,   9.88131292e-324,
          1.11253912e-308]])



In [19]:

    
np.zeros((3,3))









    Out[19]:





array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])



In [20]:

    
np.ones((3,3))









    Out[20]:





array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

dtype

Arrays have a dtype attribute that encapsulates the data type of each element. It can be set:

Implicitely by the element type
By passing the dtype argument to an array creation function



In [21]:

    
a = np.array([0,1,2,3])



In [22]:

    
a, a.dtype









    Out[22]:





(array([0, 1, 2, 3]), dtype('int64'))

All array creation functions accept an optional dtype argument:



In [23]:

    
b = np.zeros((2,2), dtype=np.complex64)
b









    Out[23]:





array([[ 0.+0.j,  0.+0.j],
       [ 0.+0.j,  0.+0.j]], dtype=complex64)



In [24]:

    
c = np.arange(0, 10, 2, dtype=np.float)
c









    Out[24]:





array([ 0.,  2.,  4.,  6.,  8.])

You can use the astype method to create a copy of the array with a given dtype:



In [25]:

    
d = c.astype(dtype=np.int)
d









    Out[25]:





array([0, 2, 4, 6, 8])

IPython's tab completion is useful for exploring the various available dtypes:



In [26]:

    
np.float*?









    



np.float
np.float128
np.float16
np.float32
np.float64
np.float_
np.floating

The NumPy documentation on dtypes describes the many other ways of specifying dtypes.

Array operations

Basic mathematical operations are elementwise for:

Scalars and arrays
Arrays and arrays



In [27]:

    
a = np.empty((3,3))
a.fill(0.1)
a









    Out[27]:





array([[ 0.1,  0.1,  0.1],
       [ 0.1,  0.1,  0.1],
       [ 0.1,  0.1,  0.1]])



In [28]:

    
b = np.ones((3,3))
b









    Out[28]:





array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])



In [29]:

    
a+b









    Out[29]:





array([[ 1.1,  1.1,  1.1],
       [ 1.1,  1.1,  1.1],
       [ 1.1,  1.1,  1.1]])



In [30]:

    
b/a









    Out[30]:





array([[ 10.,  10.,  10.],
       [ 10.,  10.,  10.],
       [ 10.,  10.,  10.]])



In [31]:

    
a**2









    Out[31]:





array([[ 0.01,  0.01,  0.01],
       [ 0.01,  0.01,  0.01],
       [ 0.01,  0.01,  0.01]])



In [32]:

    
np.pi*b









    Out[32]:





array([[ 3.14159265,  3.14159265,  3.14159265],
       [ 3.14159265,  3.14159265,  3.14159265],
       [ 3.14159265,  3.14159265,  3.14159265]])

Indexing and slicing

Indexing and slicing provide an efficient way of getting the values in an array and modifying them.



In [33]:

    
a = np.random.rand(10,10)

The enable function is part of vizarray and enables a nice display of arrays:



In [34]:

    
vz.enable()



In [35]:

    
a









    Out[35]:



In [36]:

    
a[0,0]









    Out[36]:





0.44765195731375318



In [37]:

    
a[-1,-1] == a[9,9]









    Out[37]:





True

Extract the 0th column:



In [38]:

    
a[:,0]









    Out[38]:

The last row:



In [39]:

    
a[-1,:]









    Out[39]:

You can also slice ranges:



In [40]:

    
a[0:2,0:2]









    Out[40]:

Assignment also works with slices:



In [41]:

    
a[0:5,0:5] = 1.0



In [42]:

    
a









    Out[42]:



In [43]:

    
vz.disable()

Note how even though we assigned the value to the slice, the original array was changed. This clarifies that slices are views of the same data, not a copy.

Boolean indexing



In [44]:

    
ages = np.array([23,56,67,89,23,56,27,12,8,72])
genders = np.array(['m','m','f','f','m','f','m','m','m','f'])



In [45]:

    
ages > 30









    Out[45]:





array([False,  True,  True,  True, False,  True, False, False, False,  True], dtype=bool)



In [46]:

    
genders == 'm'









    Out[46]:





array([ True,  True, False, False,  True, False,  True,  True,  True, False], dtype=bool)



In [47]:

    
(ages > 10) & (ages < 50)









    Out[47]:





array([ True, False, False, False,  True, False,  True,  True, False, False], dtype=bool)

You can use a boolean array to index into the original or another array:



In [48]:

    
mask = (genders == 'f')
ages[mask]









    Out[48]:





array([67, 89, 56, 72])



In [49]:

    
ages[ages>30]









    Out[49]:





array([56, 67, 89, 56, 72])

Reshaping, transposing



In [50]:

    
vz.enable()



In [51]:

    
a = np.random.rand(3,4)



In [52]:

    
a









    Out[52]:



In [53]:

    
a.T









    Out[53]:



In [54]:

    
a.reshape(2,6)









    Out[54]:



In [55]:

    
a.reshape(6,2)









    Out[55]:



In [56]:

    
a.ravel()









    Out[56]:



In [57]:

    
vz.disable()

Universal functions

Universal function, or "ufuncs," are functions that take and return arrays or scalars:

Vectorized C implementations, much faster than hand written loops in Python
Allow for concise Pythonic code
Here is a complete list of the available NumPy ufuncs lists the available ufuncs.



In [58]:

    
vz.set_block_size(5)
vz.enable()



In [59]:

    
t = np.linspace(0.0, 4*np.pi, 100)
t









    Out[59]:



In [60]:

    
np.sin(t)









    Out[60]:



In [61]:

    
np.exp(t)









    Out[61]:



In [62]:

    
vz.disable()
vz.set_block_size(30)



In [63]:

    
plt.plot(t, np.exp(-0.1*t)*np.sin(t))









    Out[63]:





[<matplotlib.lines.Line2D at 0x105187150>]

Basic data processing



In [64]:

    
ages = np.array([23,56,67,89,23,56,27,12,8,72])
genders = np.array(['m','m','f','f','m','f','m','m','m','f'])

Numpy has a basic set of methods and function for computing basic quantities about data.



In [65]:

    
ages.min(), ages.max()









    Out[65]:





(8, 89)



In [66]:

    
ages.mean()









    Out[66]:





43.299999999999997



In [67]:

    
ages.var(), ages.std()









    Out[67]:





(711.21000000000004, 26.668520768876554)



In [68]:

    
np.bincount(ages)









    Out[68]:





array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1])

The cumsum and cumprod methods compute cumulative sums and products:



In [69]:

    
ages.cumsum()









    Out[69]:





array([ 23,  79, 146, 235, 258, 314, 341, 353, 361, 433])



In [70]:

    
ages.cumprod()









    Out[70]:





array([              23,             1288,            86296,
                7680344,        176647912,       9892283072,
           267091642944,    3205099715328,   25640797722624,
       1846137436028928])

Most of the functions and methods above take an axis argument that will apply the action along a particular axis:



In [71]:

    
a = np.random.randint(0,10,(3,4))
a









    Out[71]:





array([[5, 1, 8, 3],
       [9, 5, 1, 3],
       [7, 7, 9, 9]])

With axis=0 the action takes place along rows:



In [72]:

    
a.sum(axis=0)









    Out[72]:





array([21, 13, 18, 15])

With axis=1 the action takes place along columns:



In [73]:

    
a.sum(axis=1)









    Out[73]:





array([17, 18, 32])

The unique function is extremely useful in working with categorical data:



In [74]:

    
np.unique(genders)









    Out[74]:





array(['f', 'm'], 
      dtype='|S1')



In [75]:

    
np.unique(genders, return_counts=True)









    Out[75]:





(array(['f', 'm'], 
       dtype='|S1'), array([4, 6]))

The where function allows you to apply conditional logic to arrays. Here is a rough sketch of how it works:

def where(condition, if_false, if_true):



In [76]:

    
np.where(ages>30, 0, 1)









    Out[76]:





array([1, 0, 0, 0, 1, 0, 1, 1, 1, 0])

The if_false and if_true values can be arrays themselves:



In [77]:

    
np.where(ages<30, 0, ages)









    Out[77]:





array([ 0, 56, 67, 89,  0, 56,  0,  0,  0, 72])

File IO

NumPy has a a number of different function to reading and writing arrays to and from disk.

Single array, binary format



In [78]:

    
a = np.random.rand(10)
a









    Out[78]:





array([ 0.91267205,  0.61166094,  0.79965337,  0.17049513,  0.47530885,
        0.49000934,  0.77588517,  0.71381787,  0.36675083,  0.40157028])



In [79]:

    
np.save('array1', a)



In [80]:

    
ls









    



Numpy Exercises.ipynb  array1.npy             vizarray.py
Numpy.ipynb            temps.txt              vizarray.pyc

Using %pycat to look at the file shows that it is binary:



In [81]:

    
%pycat array1.npy









    



NUMPYF{'descr': '<f8', 'fortran_order': False, 'shape': (10,), }           


>4í?Ü(çø¹ã?ÚmE¨Âé?8ÊËÈÒÅ?8ýâÏukÞ?ø÷$%P\ß?A§"
Ôè?c6¶×æ?&ÅLvØx×?JÔÐS³Ù?

*** ERROR: EOF in multi-line statement



In [82]:

    
a_copy = np.load('array1.npy')



In [83]:

    
a_copy









    Out[83]:





array([ 0.91267205,  0.61166094,  0.79965337,  0.17049513,  0.47530885,
        0.49000934,  0.77588517,  0.71381787,  0.36675083,  0.40157028])

Single array, text format



In [84]:

    
b = np.random.randint(0,10,(5,3))
b









    Out[84]:





array([[2, 0, 2],
       [6, 8, 9],
       [5, 4, 6],
       [0, 6, 6],
       [2, 3, 6]])



In [85]:

    
np.savetxt('array2.txt', b)



In [86]:

    
ls









    



Numpy Exercises.ipynb  array1.npy             temps.txt              vizarray.pyc
Numpy.ipynb            array2.txt             vizarray.py

Using %pycat to look at the contents shows that the files is indeed a plain text file:



In [87]:

    
%pycat array2.txt









    



2.000000000000000000e+00 0.000000000000000000e+00 2.000000000000000000e+00
6.000000000000000000e+00 8.000000000000000000e+00 9.000000000000000000e+00
5.000000000000000000e+00 4.000000000000000000e+00 6.000000000000000000e+00
0.000000000000000000e+00 6.000000000000000000e+00 6.000000000000000000e+00
2.000000000000000000e+00 3.000000000000000000e+00 6.000000000000000000e+00



In [88]:

    
np.loadtxt('array2.txt')









    Out[88]:





array([[ 2.,  0.,  2.],
       [ 6.,  8.,  9.],
       [ 5.,  4.,  6.],
       [ 0.,  6.,  6.],
       [ 2.,  3.,  6.]])

Multiple arrays, binary format



In [89]:

    
np.savez('arrays.npz', a=a, b=b)



In [90]:

    
a_and_b = np.load('arrays.npz')



In [91]:

    
a_and_b['a']









    Out[91]:





array([ 0.91267205,  0.61166094,  0.79965337,  0.17049513,  0.47530885,
        0.49000934,  0.77588517,  0.71381787,  0.36675083,  0.40157028])



In [92]:

    
a_and_b['b']









    Out[92]:





array([[2, 0, 2],
       [6, 8, 9],
       [5, 4, 6],
       [0, 6, 6],
       [2, 3, 6]])

Linear algebra

NumPy has excellent linear algebra capabilities.



In [93]:

    
a = np.random.rand(5,5)
b = np.random.rand(5,5)

Remember that array operations are elementwise. Thus, this is not matrix multiplication:



In [94]:

    
a*b









    Out[94]:





array([[ 0.00204705,  0.54096634,  0.47259787,  0.05412818,  0.19671501],
       [ 0.66129874,  0.13312196,  0.44184267,  0.12562755,  0.07581084],
       [ 0.03015261,  0.00721693,  0.49412109,  0.18472905,  0.10839132],
       [ 0.05813496,  0.89043886,  0.01727079,  0.13753354,  0.01119625],
       [ 0.17572881,  0.02565493,  0.7244605 ,  0.25694105,  0.24387372]])

To get matrix multiplication use np.dot:



In [95]:

    
np.dot(a, b)









    Out[95]:





array([[ 0.80521201,  1.22847687,  1.59594972,  0.65024802,  0.84586594],
       [ 0.35977754,  1.68214909,  1.47852159,  0.58675961,  0.67976828],
       [ 0.5111124 ,  1.34764816,  1.39081146,  0.60400515,  0.7157343 ],
       [ 1.12466104,  1.98517212,  1.32702902,  0.51282008,  1.0631991 ],
       [ 0.35522803,  1.42543228,  1.88529867,  0.73195573,  0.87746894]])

Or, NumPy as a matrix subclass for which matrix operations are the default:



In [96]:

    
m1 = np.matrix(a)
m2 = np.matrix(b)



In [97]:

    
m1*m2









    Out[97]:





matrix([[ 0.80521201,  1.22847687,  1.59594972,  0.65024802,  0.84586594],
        [ 0.35977754,  1.68214909,  1.47852159,  0.58675961,  0.67976828],
        [ 0.5111124 ,  1.34764816,  1.39081146,  0.60400515,  0.7157343 ],
        [ 1.12466104,  1.98517212,  1.32702902,  0.51282008,  1.0631991 ],
        [ 0.35522803,  1.42543228,  1.88529867,  0.73195573,  0.87746894]])

The np.linalg package has a wide range of fast linear algebra operations.

Here is determinant:



In [98]:

    
np.linalg.det(a)









    Out[98]:





-0.10718636089982998

Matrix inverse:



In [99]:

    
np.linalg.inv(a)









    Out[99]:





array([[-0.78993064,  0.3031534 , -1.68709991,  0.7971441 ,  1.43548423],
       [ 0.8580796 , -0.05252972, -1.6586023 ,  0.98068693,  0.13396849],
       [ 1.54890704,  0.66091847, -2.20046493, -0.44462467,  0.60168609],
       [-0.81596389,  0.45517992,  2.84968253, -0.2779306 , -1.51073162],
       [-0.47209591, -1.68088455,  3.22377699, -0.4388912 ,  0.13065681]])

Eigenvalues:



In [100]:

    
np.linalg.eigvals(a)









    Out[100]:





array([ 2.57901192+0.j        , -0.74790223+0.j        ,
       -0.18571386+0.24459447j, -0.18571386-0.24459447j,  0.58919024+0.j        ])

NumPy can be built against fast BLAS/LAPACK implementation for these linear algebra operations.



In [101]:

    
c = np.random.rand(2000,2000)



In [102]:

    
%timeit -n1 -r1 evs = np.linalg.eigvals(c)









    



1 loops, best of 1: 5.17 s per loop

Random numbers

NumPy has functions for creating arrays of random numbers from different distributions in np.random, as well as handling things like permutation and shuffling.

Here is the numpy.random documentation.



In [103]:

    
plt.hist(np.random.random(250))
plt.title('Uniform Random Distribution $[0,1]$')
plt.xlabel('value')
plt.ylabel('count')









    Out[103]:





<matplotlib.text.Text at 0x10882d110>



In [104]:

    
plt.hist(np.random.randn(250))
plt.title('Standard Normal Distribution')
plt.xlabel('value')
plt.ylabel('count')









    Out[104]:





<matplotlib.text.Text at 0x10a82f690>

The shuffle function shuffles an array in place:



In [105]:

    
a = np.arange(0,10)
np.random.shuffle(a)
a









    Out[105]:





array([7, 2, 8, 0, 4, 6, 1, 5, 9, 3])

The permutation function does the same thing but first makes a copy:



In [106]:

    
a = np.arange(0,10)
print(np.random.permutation(a))
print(a)









    



[0 6 4 9 5 1 8 3 7 2]
[0 1 2 3 4 5 6 7 8 9]

Resources

NumPy Reference Documentation
Python Scientific Lecture Notes, Edited by Valentin Haenel, Emmanuelle Gouillart and Gaël Varoquaux.
Lectures on Scientific Computing with Python, J.R. Johansson.
Introduction to Scientific Computing in Python, Jake Vanderplas.