NumPy Array Basics - Vectorization


In [2]:
import sys
print(sys.version)
import numpy as np
print(np.__version__)


3.3.2 (v3.3.2:d047928ae3f6, May 13 2013, 13:52:24) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
1.9.2

In [3]:
npa = np.random.random_integers(0,50,20)

Now I’ve harped on about vectorization in the last couple of videos and I’ve told you that it’s great but I haven’t shown you how it’s so great.

Here are the two powerful reasons

  • Concise
  • Efficient

The fundamental idea behind array programming is that operations apply at once to an entire set of values. This makes it a high-level programming model as it allows the programmer to think and operate on whole aggregates of data, without having to resort to explicit loops of individual scalar operations.

You can read more here: https://en.wikipedia.org/wiki/Array_programming


In [4]:
npa


Out[4]:
array([20, 43, 49,  2, 13, 24, 50, 44, 42, 35, 41, 10,  1,  2, 42,  4, 21,
       40, 38, 24])

With vectorization we can apply changes to the entire array extremely efficiently, no more for loops. If we want to double the array, we just multiply by 2 if we want to cube it we just cube it.


In [5]:
npa * 2


Out[5]:
array([ 40,  86,  98,   4,  26,  48, 100,  88,  84,  70,  82,  20,   2,
         4,  84,   8,  42,  80,  76,  48])

In [6]:
npa ** 3


Out[6]:
array([  8000,  79507, 117649,      8,   2197,  13824, 125000,  85184,
        74088,  42875,  68921,   1000,      1,      8,  74088,     64,
         9261,  64000,  54872,  13824])

In [7]:
[x * 2 for x in npa]


Out[7]:
[40, 86, 98, 4, 26, 48, 100, 88, 84, 70, 82, 20, 2, 4, 84, 8, 42, 80, 76, 48]

So who cares? Again it’s going to be efficiency thing just like boolean selection Let’s try something a bit more complex.

Define a function named new_func that cubes the value if it is less than 5 and squares it if it is greater or equal to 5.


In [8]:
def new_func(numb):
    if numb < 10:
        return numb**3
    else:
        return numb**2

In [9]:
new_func(npa)


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-3e04545c215c> in <module>()
----> 1 new_func(npa)

<ipython-input-8-a509179a0915> in new_func(numb)
      1 def new_func(numb):
----> 2     if numb < 10:
      3         return numb**3
      4     else:
      5         return numb**2

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

However we can’t just pass in the whole vector because we’re going to get this array ambiguity.


In [10]:
?np.vectorize

We need to vectorize this operation and we do that with np.vectorize

We can then apply that to our entire array and it takes care of the complexity for us. We can think in terms of the data without having to think about each individual element.


In [11]:
vect_new_func = np.vectorize(new_func)

In [12]:
type(vect_new_func)


Out[12]:
numpy.lib.function_base.vectorize

In [13]:
vect_new_func(npa)


Out[13]:
array([ 400, 1849, 2401,    8,  169,  576, 2500, 1936, 1764, 1225, 1681,
        100,    1,    8, 1764,   64,  441, 1600, 1444,  576])

In [14]:
[new_func(x) for x in npa]


Out[14]:
[400,
 1849,
 2401,
 8,
 169,
 576,
 2500,
 1936,
 1764,
 1225,
 1681,
 100,
 1,
 8,
 1764,
 64,
 441,
 1600,
 1444,
 576]

It's also much faster to vectorize operations and while these are simple examples the benefits will become apparent as we continue through this course.

this has changed since python3 and the list comprehension has gotten much faster. However, this doesn't mean that vectorization is slower, just that it's a bit heavier because it places a lot more tools at your disposal like we'll see in the next video.


In [15]:
%timeit [new_func(x) for x in npa]
%timeit vect_new_func(npa)


100000 loops, best of 3: 12.8 µs per loop
10000 loops, best of 3: 42.4 µs per loop

In [16]:
npa2 = np.random.random_integers(0,100,20*1000)

Speed comparisons with size.


In [17]:
%timeit [new_func(x) for x in npa2]
%timeit vect_new_func(npa2)


100 loops, best of 3: 9.87 ms per loop
100 loops, best of 3: 16.3 ms per loop

In [ ]: