pandas Series Look Ups, Selections, and Indexing



In [1]:

    
%matplotlib inline
import sys
print(sys.version)
import numpy as np
print(np.__version__)
import pandas as pd
print(pd.__version__)
import matplotlib.pyplot as plt









    



3.3.2 (v3.3.2:d047928ae3f6, May 13 2013, 13:52:24) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
1.9.2
0.16.2

In this video I’ll be covering getting data out of a Series. That includes look ups and boolean selections.

I’ll start again by creating some random numbers and putting them into a pandas Series. Now I’ve got my Series I can start querying things out of it. I can do that in a couple of different ways.



In [2]:

    
np.random.seed(255)
raw_np_range = np.random.random_integers(1,20,26)



In [3]:

    
data = pd.Series.from_array(raw_np_range)



In [4]:

    
data









    Out[4]:





0     20
1     10
2      7
3      6
4     16
5     10
6     19
7      1
8      8
9     18
10    18
11     8
12    17
13     7
14     7
15    17
16     7
17     5
18    16
19    10
20    11
21    13
22     2
23    18
24    20
25     4
dtype: int64

Most simply I can query a specific row by using a dictionary lookup style. I can query mutiple rows but passing in a list.



In [5]:

    
data[20]









    Out[5]:





11



In [6]:

    
data[[10,20]]









    Out[6]:





10    18
20    11
dtype: int64

We can see that it actually returns the same index labels that we had in the previous Series - as a new Series.



In [12]:

    
import string
upcase = [c for c in string.ascii_uppercase]
print(upcase)









    



['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']



In [13]:

    
data.index=upcase



In [14]:

    
data.head()









    Out[14]:





A    20
B    10
C     7
D     6
E    16
dtype: int64

If the Series is indexed with Characters like uppercase values for example. Then passing in a number or list of numbers will perform the zero based look ups of those values in the Series.



In [15]:

    
data[0]









    Out[15]:





20



In [16]:

    
data[1:5]









    Out[16]:





B    10
C     7
D     6
E    16
dtype: int64



In [17]:

    
data[[1,5]]









    Out[17]:





B    10
F    10
dtype: int64

We can of course query with the actual labels as well which are now uppercase characters.

Now here is where things get tricky and it’s worth going into detail about what’s happening here.



In [18]:

    
data[['A','D']]









    Out[18]:





A    20
D     6
dtype: int64

First let’s create a Series that in number indexed with upper case letter as its values.



In [19]:

    
num_index = pd.Series(upcase[:5],index=range(5,10))
num_index









    Out[19]:





5    A
6    B
7    C
8    D
9    E
dtype: object



In [20]:

    
num_index[5]









    Out[20]:





'A'

Now when we try and query for 0 based values via the same way we did above we’ll get in trouble.



In [21]:

    
num_index[0]









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-21-9f07cf977f25> in <module>()
----> 1 num_index[0]

/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/pandas/core/series.py in __getitem__(self, key)
    519     def __getitem__(self, key):
    520         try:
--> 521             result = self.index.get_value(self, key)
    522 
    523             if not np.isscalar(result):

/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/pandas/core/index.py in get_value(self, series, key)
   1593 
   1594         try:
-> 1595             return self._engine.get_value(s, k)
   1596         except KeyError as e1:
   1597             if len(self) > 0 and self.inferred_type in ['integer','boolean']:

pandas/index.pyx in pandas.index.IndexEngine.get_value (pandas/index.c:3113)()

pandas/index.pyx in pandas.index.IndexEngine.get_value (pandas/index.c:2844)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3704)()

pandas/hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7224)()

pandas/hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7162)()

KeyError: 0

In order to enforce 0 based lookups we have to use the iget command. We can also use iloc in the same way. These are enforcing zero based look ups.



In [22]:

    
num_index.iget(0)









    Out[22]:





'A'



In [23]:

    
num_index.iget([0,4])









    Out[23]:





5    A
9    E
dtype: object



In [24]:

    
num_index.iloc[[0,4]]









    Out[24]:





5    A
9    E
dtype: object



In [25]:

    
num_index.ix[[0,4]]









    Out[25]:





0    NaN
4    NaN
dtype: object



In [26]:

    
num_index









    Out[26]:





5    A
6    B
7    C
8    D
9    E
dtype: object

However we’ve got another property as well.



In [27]:

    
num_index.ix[[0,5]]









    Out[27]:





0    NaN
5      A
dtype: object

When we’ve got an integer based index. As we can see with num_index, it returns NaN or not a number. This is because we don’t have values for those specific numbers in our index. They don't exist in our index so it has nothing to return and returns NaN.

However if we’ve got an object or another type of index, it will return the 0-based look ups of those values as we can see with data.



In [28]:

    
data.ix[[0,4]]









    Out[28]:





A    20
E    16
dtype: int64



In [29]:

    
data[[0,1]]









    Out[29]:





A    20
B    10
dtype: int64



In [30]:

    
data.iloc[[0,1]]









    Out[30]:





A    20
B    10
dtype: int64



In [31]:

    
data.ix[[0,1]]









    Out[31]:





A    20
B    10
dtype: int64

So now we’ve seen a couple of different methods and properties that help you look things up along the index. So which is best? That’s a harder question to answer because some are more efficient than others in some circumstances. My recommendation would be that you try and be explicit with your commands and be careful about confusing indexes. I try to use dictionary style lookups, iloc and ix.



In [32]:

    
data









    Out[32]:





A    20
B    10
C     7
D     6
E    16
F    10
G    19
H     1
I     8
J    18
K    18
L     8
M    17
N     7
O     7
P    17
Q     7
R     5
S    16
T    10
U    11
V    13
W     2
X    18
Y    20
Z     4
dtype: int64

Now that we better understand these explicit ways of querying data, we can talk about boolean selection which is going to start feeling really familiar. Let’s see which values are under 10.



In [33]:

    
result = data < 10
result









    Out[33]:





A    False
B    False
C     True
D     True
E    False
F    False
G    False
H     True
I     True
J    False
K    False
L     True
M    False
N     True
O     True
P    False
Q     True
R     True
S    False
T    False
U    False
V    False
W     True
X    False
Y    False
Z     True
dtype: bool



In [34]:

    
data[result]









    Out[34]:





C    7
D    6
H    1
I    8
L    8
N    7
O    7
Q    7
R    5
W    2
Z    4
dtype: int64

Now that we’ve got this result we can perform our boolean selection. Of course we can just inline this as well.



In [35]:

    
data[data < 10]









    Out[35]:





C    7
D    6
H    1
I    8
L    8
N    7
O    7
Q    7
R    5
W    2
Z    4
dtype: int64

Just like with numpy we can’t just throw in multiple selections. We have to use the & or | symbols. Of course we can just do these one by one too because we are lining things up by index label.



In [36]:

    
data[data < 10 and data > 5]









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-36-289262abe4b9> in <module>()
----> 1 data[data < 10 and data > 5]

/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/pandas/core/generic.py in __nonzero__(self)
    712         raise ValueError("The truth value of a {0} is ambiguous. "
    713                          "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
--> 714                          .format(self.__class__.__name__))
    715 
    716     __bool__ = __nonzero__

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().



In [37]:

    
# & or | instead of 'and' or 'or' keywords, although with parenthesis
data[(data < 10) & (data > 5)]









    Out[37]:





C    7
D    6
I    8
L    8
N    7
O    7
Q    7
dtype: int64



In [38]:

    
data[data < 10][data > 5]









    Out[38]:





C    7
D    6
I    8
L    8
N    7
O    7
Q    7
dtype: int64

There are times when you may not care how many values are in an array you just want to know if any satisfy a boolean requirement, say less than two. This is where the any and all keywords come in. Any will tell you if any of the values are true. You can either wrap it in parenthesis or just use it as a query like I did with the two examples above.



In [39]:

    
data[data < 10].any()









    Out[39]:





True



In [40]:

    
data[data > 50].any()









    Out[40]:





False



In [41]:

    
(data > 50).any()









    Out[41]:





False

All will check it all values are True.



In [42]:

    
(data > 0).all()









    Out[42]:





True



In [43]:

    
(data > 10).all()









    Out[43]:





False



In [44]:

    
data > 5









    Out[44]:





A     True
B     True
C     True
D     True
E     True
F     True
G     True
H    False
I     True
J     True
K     True
L     True
M     True
N     True
O     True
P     True
Q     True
R    False
S     True
T     True
U     True
V     True
W    False
X     True
Y     True
Z    False
dtype: bool

Now one thing we can take advantage of in pandas is that sum will count True as 1 and False as 0. So performing a sum of these boolean selections allows you to get the counts pretty easily.



In [45]:

    
(data > 5).sum()









    Out[45]:





22



In [46]:

    
data









    Out[46]:





A    20
B    10
C     7
D     6
E    16
F    10
G    19
H     1
I     8
J    18
K    18
L     8
M    17
N     7
O     7
P    17
Q     7
R     5
S    16
T    10
U    11
V    13
W     2
X    18
Y    20
Z     4
dtype: int64

Now let’s get to slicing which we’ve seen a bit of thus far. Slicing is done in the dictionary look up style but gives us some handy ways of slicing data. Just like we might be slicing an array.



In [47]:

    
data[0:10]









    Out[47]:





A    20
B    10
C     7
D     6
E    16
F    10
G    19
H     1
I     8
J    18
dtype: int64

This will work by doing 0 based lookups as we can see with our num index array.



In [48]:

    
num_index[1:3]









    Out[48]:





6    B
7    C
dtype: object

We can also do it in steps.



In [49]:

    
data[0:10:2]









    Out[49]:





A    20
C     7
E    16
G    19
I     8
dtype: int64

or up to a value or below a value.



In [50]:

    
data[:5]









    Out[50]:





A    20
B    10
C     7
D     6
E    16
dtype: int64



In [51]:

    
data[15:]









    Out[51]:





P    17
Q     7
R     5
S    16
T    10
U    11
V    13
W     2
X    18
Y    20
Z     4
dtype: int64

We can also do it backwards from the end of the array which is equivalent to the tail command we say above.



In [52]:

    
data[-2:]









    Out[52]:





Y    20
Z     4
dtype: int64



In [53]:

    
data.tail(2)









    Out[53]:





Y    20
Z     4
dtype: int64

However with slicing we can substitute so we’ve got to be careful. We aren’t using a copy of the data, we are using the original data. This means that any replacement of values that we do trickles down into the original list.



In [54]:

    
data[:10][0] = 25



In [55]:

    
data









    Out[55]:





A    25
B    10
C     7
D     6
E    16
F    10
G    19
H     1
I     8
J    18
K    18
L     8
M    17
N     7
O     7
P    17
Q     7
R     5
S    16
T    10
U    11
V    13
W     2
X    18
Y    20
Z     4
dtype: int64

That’s really something to be aware of and can get you in a ton of trouble.

The fundamental question is whether you’re modifying a copy of a Series or the original Series. If you run into some bugs, be sure to know when you're actually modifying the original series or dataframe.

I’ll be touching on this in the next video. When we talk about NaN values, reindexing, filling, index alignments and other useful parts of pandas.



In [ ]: