Numpy: Creating and Manipulating Numerical Data

1. The Numpy Array Object

1.1 What are Numpy and Numpy Arrays?

  • Numpy: the core tool for performance numerical computing with Python
  • Numpy arrays: multi-dimentional data structures in Numpy (e.g. 1-D vector, 2-D matrix, 3-D data object, etc)

In [2]:
# import numpy by following the convention
import numpy as np

1.2 Creating Arrays


In [2]:
# manual construction of arrays
# 1-D
a = np.array([0,1,2,3])
a


Out[2]:
array([0, 1, 2, 3])

In [3]:
# 2-D
b = np.array([[1,2,3],[5,6,7]])
b


Out[3]:
array([[1, 2, 3],
       [5, 6, 7]])

In [5]:
# check for array dimension
a.ndim, b.ndim


Out[5]:
(1, 2)

In [6]:
# check for shape of the array
a.shape, b.shape


Out[6]:
((4,), (2, 3))

In [7]:
# functions for creating arrays
a = np.arange(1,9,2) # start, end(exclusive), step
a = np.arange(10)
a


Out[7]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [10]:
a = np.linspace(0,1,6) # start, end, num-points 
a


Out[10]:
array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ])

In [14]:
a = np.ones((3,2)) # a matrix of ones
a


Out[14]:
array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])

In [16]:
a= np.zeros((2,3)) # a matrix of zeros
a


Out[16]:
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

In [17]:
a = np.eye(3) # an identify matrix
a


Out[17]:
array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

In [19]:
a = np.diag(np.array([1,2,3,4])) # a diagonal matrix
a


Out[19]:
array([[1, 0, 0, 0],
       [0, 2, 0, 0],
       [0, 0, 3, 0],
       [0, 0, 0, 4]])

In [26]:
# generating random numbers
# set seed
np.random.seed(1234)
# generate a vector of length 4, in which elements are iid draws from UNIF(0,1)
a = np.random.rand(4)
print(a)
# generate a vector of length 4, in which elements are iid draws from standard normal
a = np.random.randn(4)
print(a)


[ 0.19151945  0.62210877  0.43772774  0.78535858]
[-0.72058873  0.88716294  0.85958841 -0.6365235 ]

1.3 Basic Data Types


In [28]:
a = np.array([1,2,3],dtype=float)
a.dtype


Out[28]:
dtype('float64')

In [29]:
a = np.array([True, False, True])
a.dtype


Out[29]:
dtype('bool')

1.4 Basic Visualization


In [37]:
import matplotlib.pyplot as plt
# to display plots in the notebook
%pylab inline
x = np.linspace(0,3,20)
y = np.linspace(0,9,20)
plt.plot(x,y) # line plot
plt.plot(x,y,'o') # dot plot


Populating the interactive namespace from numpy and matplotlib
Out[37]:
[<matplotlib.lines.Line2D at 0x7fa859145518>]

1.5 Indexing and Slicing


In [40]:
# indices begin at 0
a = np.arange(10)
a[0], a[1], a[-1]


Out[40]:
(0, 1, 9)

In [42]:
# slicing
a[2:5:2] #[start:end:step]


Out[42]:
array([2, 4])

In [47]:
a[::]


Out[47]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [48]:
a[::-1]


Out[48]:
array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

In [52]:
# matrices
a = np.diag(np.arange(3))
a


Out[52]:
array([[0, 0, 0],
       [0, 1, 0],
       [0, 0, 2]])

In [53]:
# slice an element in matrix
a[1,1], a[1,2]


Out[53]:
(1, 0)

In [57]:
# numpy array is mutable, and thus we could assign new values to it
a[1,1] = 10
a


Out[57]:
array([[100,   0,   0],
       [  0,  10,   0],
       [  0,   0,   2]])

In [58]:
# the second column of a
a[:,1]


Out[58]:
array([ 0, 10,  0])

In [59]:
# the first row of a
a[0,:]


Out[59]:
array([100,   0,   0])

1.6 Copies and views


In [61]:
# a slicing operation creates a view on the original array
a = np.arange(10)
b = a[::2]
b[0] = 100
a


Out[61]:
array([100,   1,   2,   3,   4,   5,   6,   7,   8,   9])

In [62]:
# force a copy
a = np.arange(10)
b = a[::2].copy()
b[0] = 100
a


Out[62]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

1.7 Fancy Indexing


In [64]:
# indexing with booleans 
a = np.arange(10)
ind = (a>5)
a[ind]


Out[64]:
array([6, 7, 8, 9])

In [67]:
# indexing with an array of integers
a = np.arange(10,100,10)
a[[2,3,4,2,1]]


Out[67]:
array([30, 40, 50, 30, 20])

2. Numerical Operations on Arrays

2.1 Elementwise Operations


In [69]:
# with scalars
a = np.array([1,2,3,4])
a + 1
2**a


Out[69]:
array([ 2,  4,  8, 16])

In [71]:
# arithmetic operations are elementwise
b = np.ones(4)
a - b
a*b


Out[71]:
array([ 1.,  2.,  3.,  4.])

In [72]:
# array multiplications
c = np.ones((3,3))
c*c


Out[72]:
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

In [73]:
# matrix multiplication
c.dot(c)


Out[73]:
array([[ 3.,  3.,  3.],
       [ 3.,  3.,  3.],
       [ 3.,  3.,  3.]])

In [75]:
# comparisons
a == b
a > b


Out[75]:
array([False,  True,  True,  True], dtype=bool)

In [76]:
# array-wise comparison
np.array_equal(a,b)


Out[76]:
False

In [77]:
# transcendental functions
np.sin(a)
np.log(a)
np.exp(a)


Out[77]:
array([ 0.84147098,  0.90929743,  0.14112001, -0.7568025 ])

In [1]:
# shape mismatches (this will cause an error)
b = np.array([1,2])
a + b


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-cf9c87d5b587> in <module>()
      1 # shape mismatches (this will cause an error)
----> 2 b = np.array([1,2])
      3 a + b

NameError: name 'np' is not defined

In [81]:
# transposition
a = np.triu(np.ones((3,3)),1)
a.T


Out[81]:
array([[ 0.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  1.,  0.]])

2.2 Basic Reductions


In [84]:
# computing sums
a = np.array([1,2,3,4])
a.sum()


Out[84]:
10

In [88]:
a = np.array([[1,2],[3,4]])
a.sum()
a.sum(axis=0) # column sum
a.sum(axis=1) # row sum


Out[88]:
array([3, 7])

In [94]:
# other reductions
a = np.array([1,2,3,4])
a.min()
a.max()
a.argmin()
a.argmax()
a.mean()
a.std()


Out[94]:
1.1180339887498949

2.3 Broadcasting


In [96]:
a = np.arange(0,40,10)
a = a[:,np.newaxis] # add a new axis -> 2D array
b = np.array([0,1,2])
a + b


Out[96]:
array([[ 0,  1,  2],
       [10, 11, 12],
       [20, 21, 22],
       [30, 31, 32]])

In [98]:
# create a matrix indicating the difference between any two observations
x = np.linspace(0,10,5)
y = x[:,np.newaxis]
np.abs(x-y)


Out[98]:
array([[  0. ,   2.5,   5. ,   7.5,  10. ],
       [  2.5,   0. ,   2.5,   5. ,   7.5],
       [  5. ,   2.5,   0. ,   2.5,   5. ],
       [  7.5,   5. ,   2.5,   0. ,   2.5],
       [ 10. ,   7.5,   5. ,   2.5,   0. ]])

2.4 Array Shape Manipulation


In [100]:
# flattening
a = np.array([[1,2],[3,4]])
b= a.ravel()
b


Out[100]:
array([1, 2, 3, 4])

In [102]:
c = b.reshape((2,2))
c


Out[102]:
array([[1, 2],
       [3, 4]])

2.5 Sorting Data


In [104]:
a = np.array([[6,3,1],[9,1,4]]) # sort each row
b = np.sort(a,axis=1)
b


Out[104]:
array([[1, 3, 6],
       [1, 4, 9]])

In [105]:
c = np.sort(a,axis=0) # sort each column
c


Out[105]:
array([[6, 1, 1],
       [9, 3, 4]])

In [107]:
# sorting with fancy indexing
a = np.array([14,13,11,12])
j = np.argsort(a)
j
a[j]


Out[107]:
array([11, 12, 13, 14])

In [109]:
# finding minima and maxima
a = np.array([4,22,3,9])
np.argmax(a)
np.argmin(a)


Out[109]:
2

3. Exercises (See session 1.3.5.1 in the Python Scientific Lecture Notes)


In [117]:
# Q1
# For the 2-D array (without typing it explicityly)
x = np.arange(1,12,5)
y = np.arange(5)[:,np.newaxis]
z = x + y
z


Out[117]:
array([[ 1,  6, 11],
       [ 2,  7, 12],
       [ 3,  8, 13],
       [ 4,  9, 14],
       [ 5, 10, 15]])

In [119]:
# generate a new array containing its 2nd and 4th row
m = z[(1,3),:]
m


Out[119]:
array([[ 2,  7, 12],
       [ 4,  9, 14]])

In [121]:
# Q2
# divide each column of the array elementwise
a = np.arange(25).reshape(5,5)
a
b = np.array([1,5,10,15,20])
b = b[:,np.newaxis]
a / b


Out[121]:
array([[ 0.        ,  1.        ,  2.        ,  3.        ,  4.        ],
       [ 1.        ,  1.2       ,  1.4       ,  1.6       ,  1.8       ],
       [ 1.        ,  1.1       ,  1.2       ,  1.3       ,  1.4       ],
       [ 1.        ,  1.06666667,  1.13333333,  1.2       ,  1.26666667],
       [ 1.        ,  1.05      ,  1.1       ,  1.15      ,  1.2       ]])

In [124]:
# Q3
# generate a 10 by 3 array of random numbers 
np.random.seed(1234)
a = np.random.rand(30).reshape(10,3)
a


Out[124]:
array([[ 0.19151945,  0.62210877,  0.43772774],
       [ 0.78535858,  0.77997581,  0.27259261],
       [ 0.27646426,  0.80187218,  0.95813935],
       [ 0.87593263,  0.35781727,  0.50099513],
       [ 0.68346294,  0.71270203,  0.37025075],
       [ 0.56119619,  0.50308317,  0.01376845],
       [ 0.77282662,  0.88264119,  0.36488598],
       [ 0.61539618,  0.07538124,  0.36882401],
       [ 0.9331401 ,  0.65137814,  0.39720258],
       [ 0.78873014,  0.31683612,  0.56809865]])

In [132]:
# for each row, pick the number closest to 0.5
b = np.abs(a - 0.5)
ind = np.argmin(b,axis=1)
c = a[np.arange(10),ind]
c


Out[132]:
array([ 0.43772774,  0.27259261,  0.27646426,  0.50099513,  0.37025075,
        0.50308317,  0.36488598,  0.61539618,  0.39720258,  0.56809865])

4. Efficiency


In [142]:
a = range(1000)
%timeit [i**2 for i in a]


1000 loops, best of 3: 515 µs per loop

In [143]:
b = np.arange(1000)
%timeit b**2


100000 loops, best of 3: 3.03 µs per loop

In [2]:
a = range(10000)
%timeit [i+1 for i in a]


1000 loops, best of 3: 951 µs per loop

In [5]:
c = np.arange(10000)
%timeit c+1


10000 loops, best of 3: 15.7 µs per loop

Additional Discussions

Formatting


In [3]:
import urllib; from IPython.core.display import HTML
HTML(urllib.urlopen('http://bit.ly/1Ki3iXw').read())


Out[3]: