Numpy: Creating and Manipulating Numerical Data

1. The Numpy Array Object

1.1 What are Numpy and Numpy Arrays?

Numpy: the core tool for performance numerical computing with Python
Numpy arrays: multi-dimentional data structures in Numpy (e.g. 1-D vector, 2-D matrix, 3-D data object, etc)



In [2]:

    
# import numpy by following the convention
import numpy as np

1.2 Creating Arrays



In [2]:

    
# manual construction of arrays
# 1-D
a = np.array([0,1,2,3])
a









    Out[2]:





array([0, 1, 2, 3])



In [3]:

    
# 2-D
b = np.array([[1,2,3],[5,6,7]])
b









    Out[3]:





array([[1, 2, 3],
       [5, 6, 7]])



In [5]:

    
# check for array dimension
a.ndim, b.ndim









    Out[5]:





(1, 2)



In [6]:

    
# check for shape of the array
a.shape, b.shape









    Out[6]:





((4,), (2, 3))



In [7]:

    
# functions for creating arrays
a = np.arange(1,9,2) # start, end(exclusive), step
a = np.arange(10)
a









    Out[7]:





array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])



In [10]:

    
a = np.linspace(0,1,6) # start, end, num-points 
a









    Out[10]:





array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ])



In [14]:

    
a = np.ones((3,2)) # a matrix of ones
a









    Out[14]:





array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])



In [16]:

    
a= np.zeros((2,3)) # a matrix of zeros
a









    Out[16]:





array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])



In [17]:

    
a = np.eye(3) # an identify matrix
a









    Out[17]:





array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])



In [19]:

    
a = np.diag(np.array([1,2,3,4])) # a diagonal matrix
a









    Out[19]:





array([[1, 0, 0, 0],
       [0, 2, 0, 0],
       [0, 0, 3, 0],
       [0, 0, 0, 4]])



In [26]:

    
# generating random numbers
# set seed
np.random.seed(1234)
# generate a vector of length 4, in which elements are iid draws from UNIF(0,1)
a = np.random.rand(4)
print(a)
# generate a vector of length 4, in which elements are iid draws from standard normal
a = np.random.randn(4)
print(a)









    



[ 0.19151945  0.62210877  0.43772774  0.78535858]
[-0.72058873  0.88716294  0.85958841 -0.6365235 ]

1.3 Basic Data Types



In [28]:

    
a = np.array([1,2,3],dtype=float)
a.dtype









    Out[28]:





dtype('float64')



In [29]:

    
a = np.array([True, False, True])
a.dtype









    Out[29]:





dtype('bool')

1.4 Basic Visualization



In [37]:

    
import matplotlib.pyplot as plt
# to display plots in the notebook
%pylab inline
x = np.linspace(0,3,20)
y = np.linspace(0,9,20)
plt.plot(x,y) # line plot
plt.plot(x,y,'o') # dot plot









    



Populating the interactive namespace from numpy and matplotlib






    Out[37]:





[<matplotlib.lines.Line2D at 0x7fa859145518>]

1.5 Indexing and Slicing



In [40]:

    
# indices begin at 0
a = np.arange(10)
a[0], a[1], a[-1]









    Out[40]:





(0, 1, 9)



In [42]:

    
# slicing
a[2:5:2] #[start:end:step]









    Out[42]:





array([2, 4])



In [47]:

    
a[::]









    Out[47]:





array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])



In [48]:

    
a[::-1]









    Out[48]:





array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])



In [52]:

    
# matrices
a = np.diag(np.arange(3))
a









    Out[52]:





array([[0, 0, 0],
       [0, 1, 0],
       [0, 0, 2]])



In [53]:

    
# slice an element in matrix
a[1,1], a[1,2]









    Out[53]:





(1, 0)



In [57]:

    
# numpy array is mutable, and thus we could assign new values to it
a[1,1] = 10
a









    Out[57]:





array([[100,   0,   0],
       [  0,  10,   0],
       [  0,   0,   2]])



In [58]:

    
# the second column of a
a[:,1]









    Out[58]:





array([ 0, 10,  0])



In [59]:

    
# the first row of a
a[0,:]









    Out[59]:





array([100,   0,   0])

1.6 Copies and views



In [61]:

    
# a slicing operation creates a view on the original array
a = np.arange(10)
b = a[::2]
b[0] = 100
a









    Out[61]:





array([100,   1,   2,   3,   4,   5,   6,   7,   8,   9])



In [62]:

    
# force a copy
a = np.arange(10)
b = a[::2].copy()
b[0] = 100
a









    Out[62]:





array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

1.7 Fancy Indexing



In [64]:

    
# indexing with booleans 
a = np.arange(10)
ind = (a>5)
a[ind]









    Out[64]:





array([6, 7, 8, 9])



In [67]:

    
# indexing with an array of integers
a = np.arange(10,100,10)
a[[2,3,4,2,1]]









    Out[67]:





array([30, 40, 50, 30, 20])

2. Numerical Operations on Arrays

2.1 Elementwise Operations



In [69]:

    
# with scalars
a = np.array([1,2,3,4])
a + 1
2**a









    Out[69]:





array([ 2,  4,  8, 16])



In [71]:

    
# arithmetic operations are elementwise
b = np.ones(4)
a - b
a*b









    Out[71]:





array([ 1.,  2.,  3.,  4.])



In [72]:

    
# array multiplications
c = np.ones((3,3))
c*c









    Out[72]:





array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])



In [73]:

    
# matrix multiplication
c.dot(c)









    Out[73]:





array([[ 3.,  3.,  3.],
       [ 3.,  3.,  3.],
       [ 3.,  3.,  3.]])



In [75]:

    
# comparisons
a == b
a > b









    Out[75]:





array([False,  True,  True,  True], dtype=bool)



In [76]:

    
# array-wise comparison
np.array_equal(a,b)









    Out[76]:





False



In [77]:

    
# transcendental functions
np.sin(a)
np.log(a)
np.exp(a)









    Out[77]:





array([ 0.84147098,  0.90929743,  0.14112001, -0.7568025 ])



In [1]:

    
# shape mismatches (this will cause an error)
b = np.array([1,2])
a + b









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-cf9c87d5b587> in <module>()
      1 # shape mismatches (this will cause an error)
----> 2 b = np.array([1,2])
      3 a + b

NameError: name 'np' is not defined



In [81]:

    
# transposition
a = np.triu(np.ones((3,3)),1)
a.T









    Out[81]:





array([[ 0.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  1.,  0.]])

2.2 Basic Reductions



In [84]:

    
# computing sums
a = np.array([1,2,3,4])
a.sum()









    Out[84]:





10



In [88]:

    
a = np.array([[1,2],[3,4]])
a.sum()
a.sum(axis=0) # column sum
a.sum(axis=1) # row sum









    Out[88]:





array([3, 7])



In [94]:

    
# other reductions
a = np.array([1,2,3,4])
a.min()
a.max()
a.argmin()
a.argmax()
a.mean()
a.std()









    Out[94]:





1.1180339887498949

2.3 Broadcasting



In [96]:

    
a = np.arange(0,40,10)
a = a[:,np.newaxis] # add a new axis -> 2D array
b = np.array([0,1,2])
a + b









    Out[96]:





array([[ 0,  1,  2],
       [10, 11, 12],
       [20, 21, 22],
       [30, 31, 32]])



In [98]:

    
# create a matrix indicating the difference between any two observations
x = np.linspace(0,10,5)
y = x[:,np.newaxis]
np.abs(x-y)









    Out[98]:





array([[  0. ,   2.5,   5. ,   7.5,  10. ],
       [  2.5,   0. ,   2.5,   5. ,   7.5],
       [  5. ,   2.5,   0. ,   2.5,   5. ],
       [  7.5,   5. ,   2.5,   0. ,   2.5],
       [ 10. ,   7.5,   5. ,   2.5,   0. ]])

2.4 Array Shape Manipulation



In [100]:

    
# flattening
a = np.array([[1,2],[3,4]])
b= a.ravel()
b









    Out[100]:





array([1, 2, 3, 4])



In [102]:

    
c = b.reshape((2,2))
c









    Out[102]:





array([[1, 2],
       [3, 4]])

2.5 Sorting Data



In [104]:

    
a = np.array([[6,3,1],[9,1,4]]) # sort each row
b = np.sort(a,axis=1)
b









    Out[104]:





array([[1, 3, 6],
       [1, 4, 9]])



In [105]:

    
c = np.sort(a,axis=0) # sort each column
c









    Out[105]:





array([[6, 1, 1],
       [9, 3, 4]])



In [107]:

    
# sorting with fancy indexing
a = np.array([14,13,11,12])
j = np.argsort(a)
j
a[j]









    Out[107]:





array([11, 12, 13, 14])



In [109]:

    
# finding minima and maxima
a = np.array([4,22,3,9])
np.argmax(a)
np.argmin(a)









    Out[109]:





2

3. Exercises (See session 1.3.5.1 in the Python Scientific Lecture Notes)



In [117]:

    
# Q1
# For the 2-D array (without typing it explicityly)
x = np.arange(1,12,5)
y = np.arange(5)[:,np.newaxis]
z = x + y
z









    Out[117]:





array([[ 1,  6, 11],
       [ 2,  7, 12],
       [ 3,  8, 13],
       [ 4,  9, 14],
       [ 5, 10, 15]])



In [119]:

    
# generate a new array containing its 2nd and 4th row
m = z[(1,3),:]
m









    Out[119]:





array([[ 2,  7, 12],
       [ 4,  9, 14]])



In [121]:

    
# Q2
# divide each column of the array elementwise
a = np.arange(25).reshape(5,5)
a
b = np.array([1,5,10,15,20])
b = b[:,np.newaxis]
a / b









    Out[121]:





array([[ 0.        ,  1.        ,  2.        ,  3.        ,  4.        ],
       [ 1.        ,  1.2       ,  1.4       ,  1.6       ,  1.8       ],
       [ 1.        ,  1.1       ,  1.2       ,  1.3       ,  1.4       ],
       [ 1.        ,  1.06666667,  1.13333333,  1.2       ,  1.26666667],
       [ 1.        ,  1.05      ,  1.1       ,  1.15      ,  1.2       ]])



In [124]:

    
# Q3
# generate a 10 by 3 array of random numbers 
np.random.seed(1234)
a = np.random.rand(30).reshape(10,3)
a









    Out[124]:





array([[ 0.19151945,  0.62210877,  0.43772774],
       [ 0.78535858,  0.77997581,  0.27259261],
       [ 0.27646426,  0.80187218,  0.95813935],
       [ 0.87593263,  0.35781727,  0.50099513],
       [ 0.68346294,  0.71270203,  0.37025075],
       [ 0.56119619,  0.50308317,  0.01376845],
       [ 0.77282662,  0.88264119,  0.36488598],
       [ 0.61539618,  0.07538124,  0.36882401],
       [ 0.9331401 ,  0.65137814,  0.39720258],
       [ 0.78873014,  0.31683612,  0.56809865]])



In [132]:

    
# for each row, pick the number closest to 0.5
b = np.abs(a - 0.5)
ind = np.argmin(b,axis=1)
c = a[np.arange(10),ind]
c









    Out[132]:





array([ 0.43772774,  0.27259261,  0.27646426,  0.50099513,  0.37025075,
        0.50308317,  0.36488598,  0.61539618,  0.39720258,  0.56809865])

4. Efficiency



In [142]:

    
a = range(1000)
%timeit [i**2 for i in a]









    



1000 loops, best of 3: 515 µs per loop



In [143]:

    
b = np.arange(1000)
%timeit b**2









    



100000 loops, best of 3: 3.03 µs per loop



In [2]:

    
a = range(10000)
%timeit [i+1 for i in a]









    



1000 loops, best of 3: 951 µs per loop



In [5]:

    
c = np.arange(10000)
%timeit c+1









    



10000 loops, best of 3: 15.7 µs per loop

Additional Discussions

Python for Data Analysis

Formatting



In [3]:

    
import urllib; from IPython.core.display import HTML
HTML(urllib.urlopen('http://bit.ly/1Ki3iXw').read())









    Out[3]: