In [1]:
%matplotlib inline
import sys
print(sys.version)
import numpy as np
print(np.__version__)
import pandas as pd
print(pd.__version__)
import matplotlib.pyplot as plt
In this video I’ll be covering getting data out of a Series. That includes look ups and boolean selections.
I’ll start again by creating some random numbers and putting them into a pandas Series. Now I’ve got my Series I can start querying things out of it. I can do that in a couple of different ways.
In [2]:
np.random.seed(255)
raw_np_range = np.random.random_integers(1,20,26)
In [3]:
data = pd.Series.from_array(raw_np_range)
In [4]:
data
Out[4]:
Most simply I can query a specific row by using a dictionary lookup style. I can query mutiple rows but passing in a list.
In [5]:
data[20]
Out[5]:
In [6]:
data[[10,20]]
Out[6]:
We can see that it actually returns the same index labels that we had in the previous Series - as a new Series.
In [12]:
import string
upcase = [c for c in string.ascii_uppercase]
print(upcase)
In [13]:
data.index=upcase
In [14]:
data.head()
Out[14]:
If the Series is indexed with Characters like uppercase values for example. Then passing in a number or list of numbers will perform the zero based look ups of those values in the Series.
In [15]:
data[0]
Out[15]:
In [16]:
data[1:5]
Out[16]:
In [17]:
data[[1,5]]
Out[17]:
We can of course query with the actual labels as well which are now uppercase characters.
Now here is where things get tricky and it’s worth going into detail about what’s happening here.
In [18]:
data[['A','D']]
Out[18]:
First let’s create a Series that in number indexed with upper case letter as its values.
In [19]:
num_index = pd.Series(upcase[:5],index=range(5,10))
num_index
Out[19]:
In [20]:
num_index[5]
Out[20]:
Now when we try and query for 0 based values via the same way we did above we’ll get in trouble.
In [21]:
num_index[0]
In order to enforce 0 based lookups we have to use the iget command. We can also use iloc in the same way. These are enforcing zero based look ups.
In [22]:
num_index.iget(0)
Out[22]:
In [23]:
num_index.iget([0,4])
Out[23]:
In [24]:
num_index.iloc[[0,4]]
Out[24]:
In [25]:
num_index.ix[[0,4]]
Out[25]:
In [26]:
num_index
Out[26]:
However we’ve got another property as well.
In [27]:
num_index.ix[[0,5]]
Out[27]:
When we’ve got an integer based index. As we can see with num_index, it returns NaN or not a number. This is because we don’t have values for those specific numbers in our index. They don't exist in our index so it has nothing to return and returns NaN.
However if we’ve got an object or another type of index, it will return the 0-based look ups of those values as we can see with data.
In [28]:
data.ix[[0,4]]
Out[28]:
In [29]:
data[[0,1]]
Out[29]:
In [30]:
data.iloc[[0,1]]
Out[30]:
In [31]:
data.ix[[0,1]]
Out[31]:
So now we’ve seen a couple of different methods and properties that help you look things up along the index. So which is best? That’s a harder question to answer because some are more efficient than others in some circumstances. My recommendation would be that you try and be explicit with your commands and be careful about confusing indexes. I try to use dictionary style lookups, iloc and ix.
In [32]:
data
Out[32]:
Now that we better understand these explicit ways of querying data, we can talk about boolean selection which is going to start feeling really familiar. Let’s see which values are under 10.
In [33]:
result = data < 10
result
Out[33]:
In [34]:
data[result]
Out[34]:
Now that we’ve got this result we can perform our boolean selection. Of course we can just inline this as well.
In [35]:
data[data < 10]
Out[35]:
Just like with numpy we can’t just throw in multiple selections. We have to use the & or | symbols. Of course we can just do these one by one too because we are lining things up by index label.
In [36]:
data[data < 10 and data > 5]
In [37]:
# & or | instead of 'and' or 'or' keywords, although with parenthesis
data[(data < 10) & (data > 5)]
Out[37]:
In [38]:
data[data < 10][data > 5]
Out[38]:
There are times when you may not care how many values are in an array you just want to know if any satisfy a boolean requirement, say less than two. This is where the any and all keywords come in. Any will tell you if any of the values are true. You can either wrap it in parenthesis or just use it as a query like I did with the two examples above.
In [39]:
data[data < 10].any()
Out[39]:
In [40]:
data[data > 50].any()
Out[40]:
In [41]:
(data > 50).any()
Out[41]:
All will check it all values are True.
In [42]:
(data > 0).all()
Out[42]:
In [43]:
(data > 10).all()
Out[43]:
In [44]:
data > 5
Out[44]:
Now one thing we can take advantage of in pandas is that sum will count True as 1 and False as 0. So performing a sum of these boolean selections allows you to get the counts pretty easily.
In [45]:
(data > 5).sum()
Out[45]:
In [46]:
data
Out[46]:
Now let’s get to slicing which we’ve seen a bit of thus far. Slicing is done in the dictionary look up style but gives us some handy ways of slicing data. Just like we might be slicing an array.
In [47]:
data[0:10]
Out[47]:
This will work by doing 0 based lookups as we can see with our num index array.
In [48]:
num_index[1:3]
Out[48]:
We can also do it in steps.
In [49]:
data[0:10:2]
Out[49]:
or up to a value or below a value.
In [50]:
data[:5]
Out[50]:
In [51]:
data[15:]
Out[51]:
We can also do it backwards from the end of the array which is equivalent to the tail command we say above.
In [52]:
data[-2:]
Out[52]:
In [53]:
data.tail(2)
Out[53]:
However with slicing we can substitute so we’ve got to be careful. We aren’t using a copy of the data, we are using the original data. This means that any replacement of values that we do trickles down into the original list.
In [54]:
data[:10][0] = 25
In [55]:
data
Out[55]:
That’s really something to be aware of and can get you in a ton of trouble.
The fundamental question is whether you’re modifying a copy of a Series or the original Series. If you run into some bugs, be sure to know when you're actually modifying the original series or dataframe.
I’ll be touching on this in the next video. When we talk about NaN values, reindexing, filling, index alignments and other useful parts of pandas.
In [ ]: