Introduction to numpy:

Package for scientific computing with Python

Numerical Python, or "Numpy" for short, is a foundational package on which many of the most common data science packages are built. Numpy provides us with high performance multi-dimensional arrays which we can use as vectors or matrices.

The key features of numpy are:

ndarrays: n-dimensional arrays of the same data type which are fast and space-efficient. There are a number of built-in methods for ndarrays which allow for rapid processing of data without using loops (e.g., compute the mean).
Broadcasting: a useful tool which defines implicit behavior between multi-dimensional arrays of different sizes.
Vectorization: enables numeric operations on ndarrays.
Input/Output: simplifies reading and writing of data from/to file.

Additional Recommended Resources:
Numpy Documentation
Python for Data Analysis by Wes McKinney
Python Data science Handbook by Jake VanderPlas

Getting started with ndarray

ndarrays are time and space-efficient multidimensional arrays at the core of numpy. Like the data structures in Week 2, let's get started by creating ndarrays using the numpy package.

How to create Rank 1 numpy arrays:



In [1]:

    
import numpy as np

an_array = np.array([3, 33, 333])  # Create a rank 1 array

print(type(an_array))              # The type of an ndarray is: "<class 'numpy.ndarray'>"









    



<class 'numpy.ndarray'>



In [2]:

    
# test the shape of the array we just created, it should have just one dimension (Rank 1)
print(an_array.shape)









    



(3,)



In [3]:

    
# because this is a 1-rank array, we need only one index to accesss each element
print(an_array[0], an_array[1], an_array[2])



In [4]:

    
an_array[0] =888            # ndarrays are mutable, here we change an element of the array

print(an_array)









    



[888  33 333]

How to create a Rank 2 numpy array:

A rank 2 ndarray is one with two dimensions. Notice the format below of [ [row] , [row] ]. 2 dimensional arrays are great for representing matrices which are often useful in data science.



In [5]:

    
another = np.array([[11,12,13],[21,22,23]])   # Create a rank 2 array

print(another)  # print the array

print("The shape is 2 rows, 3 columns: ", another.shape)  # rows x columns                   

print("Accessing elements [0,0], [0,1], and [1,0] of the ndarray: ", another[0, 0], ", ",another[0, 1],", ", another[1, 0])









    



[[11 12 13]
 [21 22 23]]
The shape is 2 rows, 3 columns:  (2, 3)
Accessing elements [0,0], [0,1], and [1,0] of the ndarray:  11 ,  12 ,  21

There are many way to create numpy arrays:

Here we create a number of different size arrays with different shapes and different pre-filled values. numpy has a number of built in methods which help us quickly and easily create multidimensional arrays.



In [6]:

    
import numpy as np

# create a 2x2 array of zeros
ex1 = np.zeros((2,2))      
print(ex1)









    



[[ 0.  0.]
 [ 0.  0.]]



In [7]:

    
# create a 2x2 array filled with 9.0
ex2 = np.full((2,2), 9.0)  
print(ex2)









    



[[ 9.  9.]
 [ 9.  9.]]



In [8]:

    
# create a 2x2 matrix with the diagonal 1s and the others 0
ex3 = np.eye(2,2)
print(ex3)









    



[[ 1.  0.]
 [ 0.  1.]]



In [9]:

    
# create an array of ones
ex4 = np.ones((1,2))
print(ex4)









    



[[ 1.  1.]]



In [10]:

    
# notice that the above ndarray (ex4) is actually rank 2, it is a 2x1 array
print(ex4.shape)

# which means we need to use two indexes to access an element
print()
print(ex4[0,1])









    



(1, 2)

1.0



In [11]:

    
# create an array of random floats between 0 and 1
ex5 = np.random.random((2,2))
print(ex5)









    



[[ 0.14432896  0.11832371]
 [ 0.17678672  0.79801938]]

Array Indexing

Slice indexing:

Similar to the use of slice indexing with lists and strings, we can use slice indexing to pull out sub-regions of ndarrays.



In [16]:

    
import numpy as np

# Rank 2 array of shape (3, 4)
an_array = np.array([[11,12,13,14], [21,22,23,24], [31,32,33,34]])
print(an_array)









    



[[11 12 13 14]
 [21 22 23 24]
 [31 32 33 34]]

Use array slicing to get a subarray consisting of the first 2 rows x 2 columns.



In [17]:

    
a_slice = an_array[:2, 1:3]
print(a_slice)









    



[[12 13]
 [22 23]]

When you modify a slice, you actually modify the underlying array.



In [18]:

    
print("Before:", an_array[0, 1])   #inspect the element at 0, 1  
a_slice[0, 0] = 1000    # a_slice[0, 0] is the same piece of data as an_array[0, 1]
print("After:", an_array[0, 1])









    



Before: 12
After: 1000

Use both integer indexing & slice indexing

We can use combinations of integer indexing and slice indexing to create different shaped matrices.



In [23]:

    
# Create a Rank 2 array of shape (3, 4)
an_array = np.array([[11,12,13,14], [21,22,23,24], [31,32,33,34]])
print(an_array)









    



[[11 12 13 14]
 [21 22 23 24]
 [31 32 33 34]]



In [20]:

    
# Using both integer indexing & slicing generates an array of lower rank
row_rank1 = an_array[1, :]    # Rank 1 view 

print(row_rank1, row_rank1.shape)  # notice only a single []









    



[21 22 23 24] (4,)



In [26]:

    
# Slicing alone: generates an array of the same rank as the an_array
row_rank2 = an_array[1:2, :]  # Rank 2 view 

print(row_rank2, row_rank2.shape)   # Notice the [[ ]]









    



[[21 22 23 24]] (1, 4)



In [27]:

    
#We can do the same thing for columns of an array:

print()
col_rank1 = an_array[:, 1]
col_rank2 = an_array[:, 1:2]

print(col_rank1, col_rank1.shape)  # Rank 1
print()
print(col_rank2, col_rank2.shape)  # Rank 2









    



[12 22 32] (3,)

[[12]
 [22]
 [32]] (3, 1)

Array Indexing for changing elements:

Sometimes it's useful to use an array of indexes to access or change elements.



In [28]:

    
# Create a new array
an_array = np.array([[11,12,13], [21,22,23], [31,32,33], [41,42,43]])

print('Original Array:')
print(an_array)









    



Original Array:
[[11 12 13]
 [21 22 23]
 [31 32 33]
 [41 42 43]]



In [29]:

    
# Create an array of indices
col_indices = np.array([0, 1, 2, 0])
print('\nCol indices picked : ', col_indices)

row_indices = np.arange(4)
print('\nRows indices picked : ', row_indices)









    



Col indices picked :  [0 1 2 0]

Rows indices picked :  [0 1 2 3]



In [30]:

    
# Examine the pairings of row_indices and col_indices.  These are the elements we'll change next.
for row,col in zip(row_indices,col_indices):
    print(row, ", ",col)



In [31]:

    
# Select one element from each row
print('Values in the array at those indices: ',an_array[row_indices, col_indices])









    



Values in the array at those indices:  [11 22 33 41]



In [32]:

    
# Change one element from each row using the indices selected
an_array[row_indices, col_indices] += 100000

print('\nChanged Array:')
print(an_array)









    



Changed Array:
[[100011     12     13]
 [    21 100022     23]
 [    31     32 100033]
 [100041     42     43]]

Boolean Indexing

Array Indexing for changing elements:



In [38]:

    
# create a 3x2 array
an_array = np.array([[11,12], [21, 22], [31, 32]])
print(an_array)









    



[[11 12]
 [21 22]
 [31 32]]



In [39]:

    
# create a filter which will be boolean values for whether each element meets this condition
filter = (an_array > 15)
filter









    Out[39]:





array([[False, False],
       [ True,  True],
       [ True,  True]], dtype=bool)

Notice that the filter is a same size ndarray as an_array which is filled with True for each element whose corresponding element in an_array which is greater than 15 and False for those elements whose value is less than 15.



In [40]:

    
# we can now select just those elements which meet that criteria
print(an_array[filter])









    



[21 22 31 32]



In [41]:

    
# For short, we could have just used the approach below without the need for the separate filter array.

an_array[(an_array % 2 == 0)]









    Out[41]:





array([12, 22, 32])

What is particularly useful is that we can actually change elements in the array applying a similar logical filter. Let's add 100 to all the even values.



In [42]:

    
an_array[an_array % 2 == 0] +=100
print(an_array)

Datatypes and Array Operations

Datatypes:



In [43]:

    
ex1 = np.array([11, 12]) # Python assigns the  data type
print(ex1.dtype)









    



int64



In [44]:

    
ex2 = np.array([11.0, 12.0]) # Python assigns the  data type
print(ex2.dtype)









    



float64



In [45]:

    
ex3 = np.array([11, 21], dtype=np.int64) #You can also tell Python the  data type
print(ex3.dtype)









    



int64



In [46]:

    
# you can use this to force floats into integers (using floor function)
ex4 = np.array([11.1,12.7], dtype=np.int64)
print(ex4.dtype)
print()
print(ex4)









    



int64

[11 12]



In [47]:

    
# you can use this to force integers into floats if you anticipate
# the values may change to floats later
ex5 = np.array([11, 21], dtype=np.float64)
print(ex5.dtype)
print()
print(ex5)









    



float64

[ 11.  21.]

Arithmetic Array Operations:



In [48]:

    
x = np.array([[111,112],[121,122]], dtype=np.int)
y = np.array([[211.1,212.1],[221.1,222.1]], dtype=np.float64)

print(x)
print()
print(y)









    



[[111 112]
 [121 122]]

[[ 211.1  212.1]
 [ 221.1  222.1]]



In [49]:

    
# add
print(x + y)         # The plus sign works
print()
print(np.add(x, y))  # so does the numpy function "add"









    



[[ 322.1  324.1]
 [ 342.1  344.1]]

[[ 322.1  324.1]
 [ 342.1  344.1]]



In [50]:

    
# subtract
print(x - y)
print()
print(np.subtract(x, y))









    



[[-100.1 -100.1]
 [-100.1 -100.1]]

[[-100.1 -100.1]
 [-100.1 -100.1]]



In [51]:

    
# multiply
print(x * y)
print()
print(np.multiply(x, y))









    



[[ 23432.1  23755.2]
 [ 26753.1  27096.2]]

[[ 23432.1  23755.2]
 [ 26753.1  27096.2]]



In [52]:

    
# divide
print(x / y)
print()
print(np.divide(x, y))









    



[[ 0.52581715  0.52805281]
 [ 0.54726368  0.54930212]]

[[ 0.52581715  0.52805281]
 [ 0.54726368  0.54930212]]



In [53]:

    
# square root
print(np.sqrt(x))









    



[[ 10.53565375  10.58300524]
 [ 11.          11.04536102]]



In [54]:

    
# exponent (e ** x)
print(np.exp(x))









    



[[  1.60948707e+48   4.37503945e+48]
 [  3.54513118e+52   9.63666567e+52]]

Statistical Methods, Sorting, and

Set Operations:

Basic Statistical Operations:



In [55]:

    
# setup a random 2 x 4 matrix
arr = 10 * np.random.randn(2,5)
print(arr)









    



[[  2.55882734  -7.61036994 -14.07225287   4.49477682  -4.50547344]
 [  2.50516952   9.93528865  -9.32300721  -3.17782367 -23.96102336]]



In [56]:

    
# compute the mean for all elements
print(arr.mean())









    



-4.31558881684



In [57]:

    
# compute the means by row
print(arr.mean(axis = 1))









    



[-3.82689842 -4.80427921]



In [58]:

    
# compute the means by column
print(arr.mean(axis = 0))









    



[  2.53199843   1.16245936 -11.69763004   0.65847658 -14.2332484 ]



In [59]:

    
# sum all the elements
print(arr.sum())









    



-43.1558881684



In [60]:

    
# compute the medians
print(np.median(arr, axis = 1))









    



[-4.50547344 -3.17782367]

Sorting:



In [61]:

    
# create a 10 element array of randoms
unsorted = np.random.randn(10)

print(unsorted)









    



[-0.2863347   0.2375699  -1.26102839  1.31462515 -0.59745115  1.23701857
 -0.59616984 -1.30443288 -0.75020217 -0.77148224]



In [62]:

    
# create copy and sort
sorted = np.array(unsorted)
sorted.sort()

print(sorted)
print()
print(unsorted)









    



[-1.30443288 -1.26102839 -0.77148224 -0.75020217 -0.59745115 -0.59616984
 -0.2863347   0.2375699   1.23701857  1.31462515]

[-0.2863347   0.2375699  -1.26102839  1.31462515 -0.59745115  1.23701857
 -0.59616984 -1.30443288 -0.75020217 -0.77148224]



In [70]:

    
# inplace sorting
unsorted.sort() 

print(unsorted)









    



[-1.30443288 -1.26102839 -0.77148224 -0.75020217 -0.59745115 -0.59616984
 -0.2863347   0.2375699   1.23701857  1.31462515]

Finding Unique elements:



In [71]:

    
array = np.array([1,2,1,4,2,1,4,2])

print(np.unique(array))

Set Operations with np.array data type:



In [72]:

    
s1 = np.array(['desk','chair','bulb'])
s2 = np.array(['lamp','bulb','chair'])
print(s1, s2)









    



['desk' 'chair' 'bulb'] ['lamp' 'bulb' 'chair']



In [73]:

    
print( np.intersect1d(s1, s2) )









    



['bulb' 'chair']



In [74]:

    
print( np.union1d(s1, s2) )









    



['bulb' 'chair' 'desk' 'lamp']



In [75]:

    
print( np.setdiff1d(s1, s2) )# elements in s1 that are not in s2









    



['desk']



In [76]:

    
print( np.in1d(s1, s2) )#which element of s1 is also in s2









    



[False  True  True]

Broadcasting:

Introduction to broadcasting.
For more details, please see:
https://docs.scipy.org/doc/numpy-1.10.1/user/basics.broadcasting.html



In [77]:

    
import numpy as np

start = np.zeros((4,3))
print(start)









    



[[ 0.  0.  0.]
 [ 0.  0.  0.]
 [ 0.  0.  0.]
 [ 0.  0.  0.]]



In [78]:

    
# create a rank 1 ndarray with 3 values
add_rows = np.array([1, 0, 2])
print(add_rows)



In [79]:

    
y = start + add_rows  # add to each row of 'start' using broadcasting
print(y)









    



[[ 1.  0.  2.]
 [ 1.  0.  2.]
 [ 1.  0.  2.]
 [ 1.  0.  2.]]



In [80]:

    
# create an ndarray which is 4 x 1 to broadcast across columns
add_cols = np.array([[0,1,2,3]])
add_cols = add_cols.T

print(add_cols)









    



[[0]
 [1]
 [2]
 [3]]



In [81]:

    
# add to each column of 'start' using broadcasting
y = start + add_cols 
print(y)









    



[[ 0.  0.  0.]
 [ 1.  1.  1.]
 [ 2.  2.  2.]
 [ 3.  3.  3.]]



In [82]:

    
# this will just broadcast in both dimensions
add_scalar = np.array([1])  
print(start+add_scalar)









    



[[ 1.  1.  1.]
 [ 1.  1.  1.]
 [ 1.  1.  1.]
 [ 1.  1.  1.]]

Example from the slides:



In [83]:

    
# create our 3x4 matrix
arrA = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]])
print(arrA)









    



[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]



In [84]:

    
# create our 4x1 array
arrB = [0,1,0,2]
print(arrB)









    



[0, 1, 0, 2]



In [85]:

    
# add the two together using broadcasting
print(arrA + arrB)









    



[[ 1  3  3  6]
 [ 5  7  7 10]
 [ 9 11 11 14]]

Speedtest: ndarrays vs lists

First setup paramaters for the speed test. We'll be testing time to sum elements in an ndarray versus a list.



In [89]:

    
from numpy import arange
from timeit import Timer

size    = 1000000
timeits = 1000



In [90]:

    
# create the ndarray with values 0,1,2...,size-1
nd_array = arange(size)
print( type(nd_array) )









    



<class 'numpy.ndarray'>



In [91]:

    
# timer expects the operation as a parameter, 
# here we pass nd_array.sum()
timer_numpy = Timer("nd_array.sum()", "from __main__ import nd_array")

print("Time taken by numpy ndarray: %f seconds" % 
      (timer_numpy.timeit(timeits)/timeits))









    



Time taken by numpy ndarray: 0.000714 seconds



In [92]:

    
# create the list with values 0,1,2...,size-1
a_list = list(range(size))
print (type(a_list) )









    



<class 'list'>



In [93]:

    
# timer expects the operation as a parameter, here we pass sum(a_list)
timer_list = Timer("sum(a_list)", "from __main__ import a_list")

print("Time taken by list:  %f seconds" % 
      (timer_list.timeit(timeits)/timeits))









    



Time taken by list:  0.011726 seconds

Read or Write to Disk:

Binary Format:



In [94]:

    
x = np.array([ 23.23, 24.24] )



In [95]:

    
np.save('an_array', x)



In [96]:

    
np.load('an_array.npy')









    Out[96]:





array([ 23.23,  24.24])

Text Format:



In [97]:

    
np.savetxt('array.txt', X=x, delimiter=',')



In [98]:

    
!cat array.txt









    



2.323000000000000043e+01
2.423999999999999844e+01



In [99]:

    
np.loadtxt('array.txt', delimiter=',')









    Out[99]:





array([ 23.23,  24.24])

Additional Common ndarray Operations

Dot Product on Matrices and Inner Product on Vectors:



In [100]:

    
# determine the dot product of two matrices
x2d = np.array([[1,1],[1,1]])
y2d = np.array([[2,2],[2,2]])

print(x2d.dot(y2d))
print()
print(np.dot(x2d, y2d))









    



[[4 4]
 [4 4]]

[[4 4]
 [4 4]]



In [101]:

    
# determine the inner product of two vectors
a1d = np.array([9 , 9 ])
b1d = np.array([10, 10])

print(a1d.dot(b1d))
print()
print(np.dot(a1d, b1d))



In [102]:

    
# dot produce on an array and vector
print(x2d.dot(a1d))
print()
print(np.dot(x2d, a1d))

Sum:



In [103]:

    
# sum elements in the array
ex1 = np.array([[11,12],[21,22]])

print(np.sum(ex1))          # add all members



In [104]:

    
print(np.sum(ex1, axis=0))  # columnwise sum



In [105]:

    
print(np.sum(ex1, axis=1))  # rowwise sum

Element-wise Functions:

For example, let's compare two arrays values to get the maximum of each.



In [106]:

    
# random array
x = np.random.randn(8)
x









    Out[106]:





array([ 0.95927671, -2.26261242,  1.09727197, -0.83709121,  0.93916061,
        0.91244378,  0.95143736,  0.82676922])



In [107]:

    
# another random array
y = np.random.randn(8)
y









    Out[107]:





array([ 1.42014679, -0.42237978,  0.19201318, -0.42200551, -0.53414409,
       -1.19860612, -0.17275329, -1.89702862])



In [108]:

    
# returns element wise maximum between two arrays

np.maximum(x, y)









    Out[108]:





array([ 1.42014679, -0.42237978,  1.09727197, -0.42200551,  0.93916061,
        0.91244378,  0.95143736,  0.82676922])

Reshaping array:



In [109]:

    
# grab values from 0 through 19 in an array
arr = np.arange(20)
print(arr)









    



[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]



In [110]:

    
# reshape to be a 4 x 5 matrix
arr.reshape(4,5)









    Out[110]:





array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

Transpose:



In [111]:

    
# transpose
ex1 = np.array([[11,12],[21,22]])

ex1.T









    Out[111]:





array([[11, 21],
       [12, 22]])

Indexing using where():



In [112]:

    
x_1 = np.array([1,2,3,4,5])

y_1 = np.array([11,22,33,44,55])

filter = np.array([True, False, True, False, True])



In [113]:

    
out = np.where(filter, x_1, y_1)
print(out)









    



[ 1 22  3 44  5]



In [114]:

    
mat = np.random.rand(5,5)
mat









    Out[114]:





array([[ 0.29099293,  0.8513201 ,  0.02755382,  0.9153997 ,  0.1529128 ],
       [ 0.68892956,  0.8715403 ,  0.90271614,  0.83349504,  0.66892362],
       [ 0.3652049 ,  0.94876009,  0.93417171,  0.24069083,  0.17527422],
       [ 0.61088613,  0.32312425,  0.94039057,  0.61802663,  0.02472306],
       [ 0.6022862 ,  0.02278126,  0.03254162,  0.9401152 ,  0.2488345 ]])



In [115]:

    
np.where( mat > 0.5, 1000, -1)









    Out[115]:





array([[  -1, 1000,   -1, 1000,   -1],
       [1000, 1000, 1000, 1000, 1000],
       [  -1, 1000, 1000,   -1,   -1],
       [1000,   -1, 1000, 1000,   -1],
       [1000,   -1,   -1, 1000,   -1]])

"any" or "all" conditionals:



In [116]:

    
arr_bools = np.array([ True, False, True, True, False ])



In [117]:

    
arr_bools.any()









    Out[117]:





True



In [118]:

    
arr_bools.all()









    Out[118]:





False

Random Number Generation:



In [119]:

    
Y = np.random.normal(size = (1,5))[0]
print(Y)









    



[-1.00174993  0.24429746 -1.26986219  1.53674401 -0.89022467]



In [120]:

    
Z = np.random.randint(low=2,high=50,size=4)
print(Z)









    



[43 27 30 35]



In [121]:

    
np.random.permutation(Z) #return a new ordering of elements in Z









    Out[121]:





array([30, 43, 35, 27])



In [122]:

    
np.random.uniform(size=4) #uniform distribution









    Out[122]:





array([ 0.78403831,  0.17108008,  0.82968747,  0.66615602])



In [123]:

    
np.random.normal(size=4) #normal distribution









    Out[123]:





array([ 0.22164685, -0.11694988,  1.17614076,  1.56761991])

Merging data sets:



In [124]:

    
K = np.random.randint(low=2,high=50,size=(2,2))
print(K)

print()
M = np.random.randint(low=2,high=50,size=(2,2))
print(M)









    



[[48  5]
 [17 11]]

[[31 13]
 [29 18]]



In [125]:

    
np.vstack((K,M))









    Out[125]:





array([[48,  5],
       [17, 11],
       [31, 13],
       [29, 18]])



In [126]:

    
np.hstack((K,M))









    Out[126]:





array([[48,  5, 31, 13],
       [17, 11, 29, 18]])



In [127]:

    
np.concatenate([K, M], axis = 0)









    Out[127]:





array([[48,  5],
       [17, 11],
       [31, 13],
       [29, 18]])



In [128]:

    
np.concatenate([K, M.T], axis = 1)









    Out[128]:





array([[48,  5, 31, 29],
       [17, 11, 13, 18]])