NumPy

Numerical Python

Provides an efficient way to store and manipulate arrays. Numpy is all about VECTORIZATION. Mental model is different than regular python and works with:

"vectors"
"arrays"
"views"
"ufuncs" advanced
ndarray provides efficient storage and manipulation of 1d arrays (vectors), (NxM) matrices, higher dimensional datasets

Example Using Object Oriented Techniques



In [ ]:



In [5]:

    
import random
class RandomWalker(object):
        def __init__(self):
            self.position = 0
            
        def walk(self, n):
            self.position = 0
            for i in range(n):
                yield self.position
                self.position += 2*random.randint(0,1) - 1

Example of Cell Magic Using 3 '%%%'



In [10]:

    
%%%timeit
walker = RandomWalker()
walk = [position for position in walker.walk(10000)]









    



10 loops, best of 3: 13.4 ms per loop

Line magic uses 1 '%'

Example Using Functional Programming

Remove class definition



In [12]:

    
def random_walk_f(n):
    position = 0
    walk = [position]
    for i in range(n):
        position = 2 * random.randint(0,1) - 1
        walk.append(position)
    return walk



In [13]:

    
%%%timeit
walk = random_walk_f(10000)









    



100 loops, best of 3: 11.4 ms per loop

small improvement in time

Vectorized Approach Like When You Did Things in MATLAB :(

Get rid of the loop



In [14]:

    
from itertools import accumulate
def random_walker_v(n):
    steps = random.sample([1, -1] * n, n)
    return list(accumulate(steps))



In [16]:

    
%%%timeit
walk = random_walker_v(10000)









    



100 loops, best of 3: 6.18 ms per loop

WOW 2x as fast

Numpy ifying



In [17]:

    
import numpy as np
def random_walker_np(n):
    steps = 2 * np.random.randint(0, 2, size=n) - 1
    return np.cumsum(steps)



In [18]:

    
%%%timeit
walk = random_walker_np(10000)









    



The slowest run took 2164.96 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 74.1 µs per loop

Getting Started with Basic Numpy Array

Create an array

Clobber the namespace so we dont have to np.



In [1]:

    
import numpy as np

Create an np array. You can pass any type of python seq: list, tuples, etc



In [25]:

    
a = np.array([0,1,2,3,4,5])



In [26]:

    
a









    Out[26]:





array([0, 1, 2, 3, 4, 5])

Multidimensional array using list of lists



In [3]:

    
m = np.array([[1,2,3], [4,5,6]])



In [5]:

    
m.shape









    Out[5]:





(2, 3)



In [ ]:



In [ ]:



In [27]:

    
ad = a.data
list(ad)









    Out[27]:





[0, 1, 2, 3, 4, 5]



In [28]:

    
# what type is a
type(a)









    Out[28]:





numpy.ndarray



In [29]:

    
# what is the numerica type of the elements in the array
a.dtype









    Out[29]:





dtype('int64')



In [30]:

    
# What shape (dimensions) is the array
a.shape









    Out[30]:





(6,)



In [33]:

    
# Bytes per element. 32bit integers should be 4 bytes
a.itemsize









    Out[33]:





8



In [34]:

    
# Total size in bytes of the array
a.nbytes









    Out[34]:





48



In [35]:

    
# Beware of type coercion
# a holds dtypes int32
print(a)
a[0] = 10.38383
print(a)









    



[0 1 2 3 4 5]
[10  1  2  3  4  5]



In [37]:

    
x = np.array([0,1,1.5,3])
y = np.array([1,2,3,1])

Reshape and Resize

Operations



In [7]:

    
# Element wise addition



In [8]:

    
# Element wise subtraction



In [ ]:

Do Some Vector Math Not for Loop Math



In [43]:

    
%%%timeit
dy = y[1:] - y[:-1]









    



The slowest run took 29.78 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 813 ns per loop

%%capture captures the result of the operation into a var



In [40]:

    
%%capture timeit_result
%timeit python_list1 = range(1,1000)
%timeit python_list2 = np.arange(1,1000)



In [41]:

    
print(timeit_result)









    



The slowest run took 9.31 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 196 ns per loop
The slowest run took 31.32 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.4 us per loop

Statistical Analysis



In [19]:

    
data_set = random.random((2,3))
print(data_set)









    



[[ 0.93613618  0.33032079  0.19598773]
 [ 0.36707494  0.7528012   0.96362384]]



In [ ]:



In [18]:

    
# example of namespace....cant access np.max and builtin max is being used



In [ ]:



In [17]:

    
max(data_set[0])









    Out[17]:





0.95525652078806722

References

http://stackoverflow.com/questions/29759883/what-does-an-intermediate-result-is-being-cached-mean#29765693