Numpy Array Basics - Boolean Selection


In [3]:
import sys
print(sys.version)
import numpy as np
print(np.__version__)


3.3.2 (v3.3.2:d047928ae3f6, May 13 2013, 13:52:24) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
1.9.2

In [4]:
npa = np.arange(20)

In [5]:
npa


Out[5]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

Now i’m going to introduce some new notation but you’re going to be seeing this a lot in pandas. It’s similar to a dictionary but we’re performing boolean selection.

Boolean selection is not so different from filtering or using list comprehensions like we did in the last selection.

let's start with a worked example.

Let's get all values that are divisible by 2 in our list


In [6]:
[x for x in npa if x % 2 == 0]


Out[6]:
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

In [22]:
list(filter(lambda x: x % 2 ==0, npa))


Out[22]:
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

You can see how we did that with the list comprehension and a filter, now let’s do it with numpy.


In [8]:
npa % 2 == 0


Out[8]:
array([ True, False,  True, False,  True, False,  True, False,  True,
       False,  True, False,  True, False,  True, False,  True, False,
        True, False], dtype=bool)

It’s an interesting notation but what the result of what we're getting isn't really so different. We're basically just getting the boolean value of the result of each value in the array.

so how might we complete the filter? Easy, we just treat it like a dictionary and query our original array for those values that are true.


In [9]:
npa[npa % 2 == 0]


Out[9]:
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

Now you might ask yourself why things are done this way and now we’re starting going to get into the efficiency of the operation. And for datasets of reasonable size this is typically orders of magnitude. Let me show you very quickly before we move on to different boolean selections.


In [10]:
np2 = np.arange(20000)

In [11]:
%timeit [x for x in np2 if x % 2 == 0]


100 loops, best of 3: 7.38 ms per loop

In [12]:
%timeit np2[np2 % 2 == 0]


1000 loops, best of 3: 689 µs per loop

We can see that it is orders of magnitude faster than our original list comprehension.


In [13]:
npa


Out[13]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

Now here’s an exercise, try to do the same thing but get all numbers from that array that are greater than 10. Go ahead and pause and try it out.


In [14]:
npa > 10


Out[14]:
array([False, False, False, False, False, False, False, False, False,
       False, False,  True,  True,  True,  True,  True,  True,  True,
        True,  True], dtype=bool)

In [15]:
npa[npa > 10]


Out[15]:
array([11, 12, 13, 14, 15, 16, 17, 18, 19])

How about greater than 15 or less than 5?


In [16]:
npa > 15 or npa < 5


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-b6adf7e14f2b> in <module>()
----> 1 npa > 15 or npa < 5

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In [17]:
npa > 15 | npa < 5


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-0f8de9488807> in <module>()
----> 1 npa > 15 | npa < 5

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Instinct is to use “or” but we can’t use that. We have to use the bar in place of or and wrap them up in parenthesis.

now we can filter down with this expression.


In [18]:
(npa > 15) | (npa < 5)


Out[18]:
array([ True,  True,  True,  True,  True, False, False, False, False,
       False, False, False, False, False, False, False,  True,  True,
        True,  True], dtype=bool)

Now that is basically boolean selection, we are querying data that we want to from an array. This is an extremely powerful concept that will come up over and over again and don’t worry if you don’t understand it completely just understand how it works.


In [19]:
npa[(npa > 15) | (npa < 5)]


Out[19]:
array([ 0,  1,  2,  3,  4, 16, 17, 18, 19])

on a final note, for boolean selection to occur, you just have to pass in a list with the same length as the original list and has boolean values. Let me show you quickly.


In [20]:
np3 = np.array([True for x in range(20)])

In [21]:
npa[np3]


Out[21]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [ ]: