Hello all, I am back with another of my notes on Pandas, Today I will focus on indexing and selection of data from pandas object. It's really important since effective use of pandas requires a good knowledge of the indexing and selection of data.

In the last post while introduding data structures, I talked about basic indexing, I will show here as well for the sake of completeness.

Basic Indexing


In [2]:
import pandas as pd

In [3]:
SpotCrudePrices_2013_Data= { 'U.K. Brent' : 
                            {'2013-Q1':112.9, '2013-Q2':103.0, 
                             '2013-Q3':110.1, '2013-Q4':109.4},
                            'Dubai': 
                            {'2013-Q1':108.1, '2013-Q2':100.8,
                             '2013-Q3':106.1,'2013-Q4':106.7},
                             'West Texas Intermediate':
                            {'2013-Q1':94.4, '2013-Q2':94.2,
                             '2013-Q3':105.8,'2013-Q4':97.4}}
SpotCrudePrices_2013=pd.DataFrame.from_dict(SpotCrudePrices_2013_Data)
SpotCrudePrices_2013


Out[3]:
Dubai U.K. Brent West Texas Intermediate
2013-Q1 108.1 112.9 94.4
2013-Q2 100.8 103.0 94.2
2013-Q3 106.1 110.1 105.8
2013-Q4 106.7 109.4 97.4

We can select the prices for the available time periods of Dubai crude oil by using the [] operator:


In [4]:
dubaiPrices=SpotCrudePrices_2013['Dubai']
dubaiPrices


Out[4]:
2013-Q1    108.1
2013-Q2    100.8
2013-Q3    106.1
2013-Q4    106.7
Name: Dubai, dtype: float64

We can also pass a list of columns to the [] operator in order to select the columns in a particular order:


In [7]:
SpotCrudePrices_2013[['West Texas Intermediate','U.K. Brent']]


Out[7]:
West Texas Intermediate U.K. Brent
2013-Q1 94.4 112.9
2013-Q2 94.2 103.0
2013-Q3 105.8 110.1
2013-Q4 97.4 109.4

Rows cannot be selected with the bracket operator [] in a DataFrame.

One can retrieve values from a Series, DataFrame, or Panel directly as an attribute using dot operator


In [9]:
SpotCrudePrices_2013.Dubai


Out[9]:
2013-Q1    108.1
2013-Q2    100.8
2013-Q3    106.1
2013-Q4    106.7
Name: Dubai, dtype: float64

However, this only works if the index element is a valid Python identifier, Dubai in this case is valid but U.K. Brent is not.

We can change the names to valid identifiers:


In [10]:
SpotCrudePrices_2013.columns=['Dubai','UK_Brent','West_Texas_Intermediate']
SpotCrudePrices_2013


Out[10]:
Dubai UK_Brent West_Texas_Intermediate
2013-Q1 108.1 112.9 94.4
2013-Q2 100.8 103.0 94.2
2013-Q3 106.1 110.1 105.8
2013-Q4 106.7 109.4 97.4

In [14]:
SpotCrudePrices_2013.UK_Brent


Out[14]:
2013-Q1    112.9
2013-Q2    103.0
2013-Q3    110.1
2013-Q4    109.4
Name: UK_Brent, dtype: float64

We can also select prices by specifying a column index number to select column 1 (U.K. Brent)


In [12]:
SpotCrudePrices_2013[[1]]


Out[12]:
UK_Brent
2013-Q1 112.9
2013-Q2 103.0
2013-Q3 110.1
2013-Q4 109.4

We can slice a range by using the [] operator. The syntax of the slicing operator exactly matches that of NumPy's:

ar[startIndex: endIndex: stepValue]

For a DataFrame, [] slices across rows, Obrain all rows starting from index 2:


In [15]:
SpotCrudePrices_2013[2:]


Out[15]:
Dubai UK_Brent West_Texas_Intermediate
2013-Q3 106.1 110.1 105.8
2013-Q4 106.7 109.4 97.4

In [20]:
# Reverse the order of rows in DataFrame
SpotCrudePrices_2013[::-1]


Out[20]:
Dubai UK_Brent West_Texas_Intermediate
2013-Q4 106.7 109.4 97.4
2013-Q3 106.1 110.1 105.8
2013-Q2 100.8 103.0 94.2
2013-Q1 108.1 112.9 94.4

In [17]:
# Selecting Dubai's data as Pandas Series
dubaiPrices = SpotCrudePrices_2013['Dubai']
# Obtain the last 3 rows or all rows but the first:
dubaiPrices[1:]


Out[17]:
2013-Q2    100.8
2013-Q3    106.1
2013-Q4    106.7
Name: Dubai, dtype: float64

In [21]:
# Obtain all rows but the last
dubaiPrices[:-1]


Out[21]:
2013-Q1    108.1
2013-Q2    100.8
2013-Q3    106.1
Name: Dubai, dtype: float64

In [22]:
# Reverse the rows
dubaiPrices[::-1]


Out[22]:
2013-Q4    106.7
2013-Q3    106.1
2013-Q2    100.8
2013-Q1    108.1
Name: Dubai, dtype: float64

Label, Integer, and Mixed Indexing

In addition to the standard indexing operator [] and attribute operator, there are operators provided in pandas to make the job of indexing easier and more convenient.

By label indexing, we generally mean indexing by a header name, which tends to be a string value in most cases. These operators are as follows:

  • The .loc operator: It allows label-oriented indexing
  • The .iloc operator: It allows integer-based indexing
  • The .ix operator: It allows mixed label and integer-based indexing

Label-Oriented Indexing

The .loc operator supports pure label-based indexing.


In [24]:
NYC_SnowAvgsData={'Months' : ['January','February','March','April', 'November', 'December'],
                  'Avg SnowDays' : [4.0,2.7,1.7,0.2,0.2,2.3],
                  'Avg Precip. (cm)' : [17.8,22.4,9.1,1.5,0.8,12.2],
                  'Avg Low Temp. (F)' : [27,29,35,45,42,32] }
NYC_SnowAvgs = pd.DataFrame(NYC_SnowAvgsData, 
                            index=NYC_SnowAvgsData['Months'],
                            columns=['Avg SnowDays','Avg Precip. (cm)','Avg Low Temp. (F)'])
NYC_SnowAvgs


Out[24]:
Avg SnowDays Avg Precip. (cm) Avg Low Temp. (F)
January 4.0 17.8 27
February 2.7 22.4 29
March 1.7 9.1 35
April 0.2 1.5 45
November 0.2 0.8 42
December 2.3 12.2 32

In [25]:
# Using a single label:
NYC_SnowAvgs.loc['January']


Out[25]:
Avg SnowDays          4.0
Avg Precip. (cm)     17.8
Avg Low Temp. (F)    27.0
Name: January, dtype: float64

In [26]:
# Using a list of labels
NYC_SnowAvgs.loc[['January', 'April']]


Out[26]:
Avg SnowDays Avg Precip. (cm) Avg Low Temp. (F)
January 4.0 17.8 27
April 0.2 1.5 45

In [28]:
# Using a Label range:
NYC_SnowAvgs.loc['January' : 'March']


Out[28]:
Avg SnowDays Avg Precip. (cm) Avg Low Temp. (F)
January 4.0 17.8 27
February 2.7 22.4 29
March 1.7 9.1 35

Note that while using the .loc , .iloc , and .ix operators on a DataFrame, the row index must always be specified first. This is the opposite of the [] operator, where only columns can be selected directly.


In [29]:
NYC_SnowAvgs.loc[:,'Avg SnowDays']


Out[29]:
January     4.0
February    2.7
March       1.7
April       0.2
November    0.2
December    2.3
Name: Avg SnowDays, dtype: float64

In [31]:
# to select a specific coordinate value
NYC_SnowAvgs.loc['March','Avg SnowDays']


Out[31]:
1.7

In [32]:
# Alternative Style
NYC_SnowAvgs.loc['March']['Avg SnowDays']


Out[32]:
1.7

In [33]:
# Without using loc function, square bracket as follows
NYC_SnowAvgs['Avg SnowDays']['March']


Out[33]:
1.7

We can use the .loc operator to select the rows instead:


In [34]:
NYC_SnowAvgs.loc['March']


Out[34]:
Avg SnowDays          1.7
Avg Precip. (cm)      9.1
Avg Low Temp. (F)    35.0
Name: March, dtype: float64

We can use selection with boolean statements, while we are selecting in Pandas.


In [35]:
# Selecting months have less than one snow day average
NYC_SnowAvgs.loc[NYC_SnowAvgs['Avg SnowDays']<1,:]


Out[35]:
Avg SnowDays Avg Precip. (cm) Avg Low Temp. (F)
April 0.2 1.5 45
November 0.2 0.8 42

In [36]:
# brand of crude priced above 110 a barrel for row 2013-Q1
SpotCrudePrices_2013.loc[:,SpotCrudePrices_2013.loc['2013-Q1']>110]
# Using 2 .loc for more precise selection, how cool is that


Out[36]:
UK_Brent
2013-Q1 112.9
2013-Q2 103.0
2013-Q3 110.1
2013-Q4 109.4

Note that the preceding arguments involve the Boolean operators < and > that actually evaluate the Boolean arrays, for example:


In [37]:
SpotCrudePrices_2013.loc['2013-Q1']>110


Out[37]:
Dubai                      False
UK_Brent                    True
West_Texas_Intermediate    False
Name: 2013-Q1, dtype: bool

Integer-Oriented Indexing

The .iloc operator supports integer-based positional indexing. It accepts the following as inputs:

  • A single integer, for example, 7
  • A list or array of integers, for example, [2,3]
  • A slice object with integers, for example, 1:4

In [39]:
import scipy.constants as phys
import math

In [41]:
sci_values=pd.DataFrame([[math.pi, math.sin(math.pi),math.cos(math.pi)],
                         [math.e,math.log(math.e), phys.golden],
                         [phys.c,phys.g,phys.e],
                         [phys.m_e,phys.m_p,phys.m_n]],
                        index=list(range(0,20,5)))
sci_values


Out[41]:
0 1 2
0 3.141593e+00 1.224647e-16 -1.000000e+00
5 2.718282e+00 1.000000e+00 1.618034e+00
10 2.997925e+08 9.806650e+00 1.602177e-19
15 9.109384e-31 1.672622e-27 1.674927e-27

In [42]:
# Select first two rows by using integer slicing
sci_values.iloc[:2]


Out[42]:
0 1 2
0 3.141593 1.224647e-16 -1.000000
5 2.718282 1.000000e+00 1.618034

In [43]:
sci_values.iloc[2,0:2]


Out[43]:
0    2.997925e+08
1    9.806650e+00
Name: 10, dtype: float64

Note that the arguments to .iloc are strictly positional and have nothing to do with the index values.

we should use the label-indexing operator .loc instead...


In [44]:
sci_values.iloc[10]


---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-44-ddf42a6ff8e1> in <module>()
----> 1 sci_values.iloc[10]

/home/eneskemal/anaconda2/lib/python2.7/site-packages/pandas/core/indexing.pyc in __getitem__(self, key)
   1294             return self._getitem_tuple(key)
   1295         else:
-> 1296             return self._getitem_axis(key, axis=0)
   1297 
   1298     def _getitem_axis(self, key, axis=0):

/home/eneskemal/anaconda2/lib/python2.7/site-packages/pandas/core/indexing.pyc in _getitem_axis(self, key, axis)
   1610 
   1611                 # validate the location
-> 1612                 self._is_valid_integer(key, axis)
   1613 
   1614             return self._get_loc(key, axis=axis)

/home/eneskemal/anaconda2/lib/python2.7/site-packages/pandas/core/indexing.pyc in _is_valid_integer(self, key, axis)
   1524         l = len(ax)
   1525         if key >= l or key < -l:
-> 1526             raise IndexError("single positional indexer is out-of-bounds")
   1527         return True
   1528 

IndexError: single positional indexer is out-of-bounds

In [45]:
sci_values.loc[10]


Out[45]:
0    2.997925e+08
1    9.806650e+00
2    1.602177e-19
Name: 10, dtype: float64

In [46]:
# To Slice out a specific row
sci_values.iloc[2:3,:]


Out[46]:
0 1 2
10 299792458.0 9.80665 1.602177e-19

In [47]:
# TO obtain a cross-section using an integer position
sci_values.iloc[3]


Out[47]:
0    9.109384e-31
1    1.672622e-27
2    1.674927e-27
Name: 15, dtype: float64

The .iat and .at operators can be used for a quick selection of scalar values. They are faster than them but not really common


In [48]:
sci_values.iloc[3,0]


Out[48]:
9.1093835599999998e-31

In [49]:
sci_values.iat[3,0]


Out[49]:
9.1093835599999998e-31

In [51]:
%timeit sci_values.iloc[3,0]


The slowest run took 6.17 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 84 µs per loop

In [52]:
%timeit sci_values.iat[3,0]


The slowest run took 20.61 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 6.06 µs per loop

Mixed Indexing with .ix operator

The .ix operator behaves like a mixture of the .loc and .iloc operators, with the .loc behavior taking precedence. It takes the following as possible inputs:

  • A single label or integer
  • A list of integeres or labels
  • An integer slice or label slice
  • A Boolean array
  • In the following examples I will use this data set imported from csv

      TradingDate,Nasdaq,S&P 500,Russell 2000
      2014/01/30,4123.13,1794.19,1139.36
      2014/01/31,4103.88,1782.59,1130.88
      2014/02/03,3996.96,1741.89,1094.58
      2014/02/04,4031.52,1755.2,1102.84
      2014/02/05,4011.55,1751.64,1093.59
      2014/02/06,4057.12,1773.43,1103.93

In [55]:
stockIndexDataDF=pd.read_csv('stock_index_closing.csv')
stockIndexDataDF


Out[55]:
TradingDate Nasdaq S&P 500 Russell 2000
0 2014/01/30 4123.13 1794.19 1139.36
1 2014/01/31 4103.88 1782.59 1130.88
2 2014/02/03 3996.96 1741.89 1094.58
3 2014/02/04 4031.52 1755.20 1102.84
4 2014/02/05 4011.55 1751.64 1093.59
5 2014/02/06 4057.12 1773.43 1103.93

What we see from the preceding example is that the DataFrame created has an integer-based row index. We promptly set the index to be the trading date to index it based on the trading date so that we can use the .ix operator:


In [56]:
stockIndexDF=stockIndexDataDF.set_index('TradingDate')
stockIndexDF


Out[56]:
Nasdaq S&P 500 Russell 2000
TradingDate
2014/01/30 4123.13 1794.19 1139.36
2014/01/31 4103.88 1782.59 1130.88
2014/02/03 3996.96 1741.89 1094.58
2014/02/04 4031.52 1755.20 1102.84
2014/02/05 4011.55 1751.64 1093.59
2014/02/06 4057.12 1773.43 1103.93

In [57]:
# Using a single label
stockIndexDF.ix['2014/01/30']


Out[57]:
Nasdaq          4123.13
S&P 500         1794.19
Russell 2000    1139.36
Name: 2014/01/30, dtype: float64

In [59]:
# Using a list of labels
stockIndexDF.ix[['2014/01/30', '2014/02/06']]


Out[59]:
Nasdaq S&P 500 Russell 2000
TradingDate
2014/01/30 4123.13 1794.19 1139.36
2014/02/06 4057.12 1773.43 1103.93

In [60]:
type(stockIndexDF.ix['2014/01/30'])


Out[60]:
pandas.core.series.Series

In [61]:
type(stockIndexDF.ix[['2014/01/30']])


Out[61]:
pandas.core.frame.DataFrame

For the former, the indexer is a scalar; for the latter, the indexer is a list. A list indexer is used to select multiple columns. A multi-column slice of a DataFrame can only result in another DataFrame since it is 2D; hence, what is returned in the latter case is a DataFrame.


In [62]:
# Using a label-based slice:
tradingDates=stockIndexDataDF.TradingDate
stockIndexDF.ix[tradingDates[:3]]


Out[62]:
Nasdaq S&P 500 Russell 2000
TradingDate
2014/01/30 4123.13 1794.19 1139.36
2014/01/31 4103.88 1782.59 1130.88
2014/02/03 3996.96 1741.89 1094.58

In [63]:
# Using a single integer:
stockIndexDF.ix[0]


Out[63]:
Nasdaq          4123.13
S&P 500         1794.19
Russell 2000    1139.36
Name: 2014/01/30, dtype: float64

In [64]:
# Using a list of integers:
stockIndexDF.ix[[0,2]]


Out[64]:
Nasdaq S&P 500 Russell 2000
TradingDate
2014/01/30 4123.13 1794.19 1139.36
2014/02/03 3996.96 1741.89 1094.58

In [65]:
# Using an integer slice:
stockIndexDF.ix[1:3]


Out[65]:
Nasdaq S&P 500 Russell 2000
TradingDate
2014/01/31 4103.88 1782.59 1130.88
2014/02/03 3996.96 1741.89 1094.58

In [66]:
# Using an boolean array
stockIndexDF.ix[stockIndexDF['Russell 2000']>1100]


Out[66]:
Nasdaq S&P 500 Russell 2000
TradingDate
2014/01/30 4123.13 1794.19 1139.36
2014/01/31 4103.88 1782.59 1130.88
2014/02/04 4031.52 1755.20 1102.84
2014/02/06 4057.12 1773.43 1103.93

As in the case of .loc , the row index must be specified first for the .ix operator.

We now turn to the topic of MultiIndexing. Multi-level or hierarchical indexing is useful because it enables the pandas user to select and massage data in multiple dimensions by using data structures such as Series and DataFrame.


In [67]:
sharesIndexDataDF=pd.read_csv('stock_index_closing.csv')
sharesIndexDataDF


Out[67]:
TradingDate PriceType Nasdaq S&P 500 Russell 2000
0 2014/02/21 open 4282.17 1841.07 1166.25
1 2014/02/21 close 4263.41 1836.25 1164.63
2 2014/02/21 high 4284.85 1846.13 1168.43
3 2014/02/24 open 4273.32 1836.78 1166.74
4 2014/02/24 close 4292.97 1847.61 1174.55
5 2014/02/24 high 4311.13 1858.71 1180.29
6 2014/02/25 open 4298.48 1847.66 1176.00
7 2014/02/25 close 4287.59 1845.12 1173.95
8 2014/02/25 high 4307.51 1852.91 1179.43
9 2014/02/26 open 4300.45 1845.79 1176.11
10 2014/02/26 close 4292.06 1845.16 1181.72
11 2014/02/26 high 4316.82 1852.65 1188.06
12 2014/02/27 open 4291.47 1844.90 1179.28
13 2014/02/27 close 4318.93 1854.29 1187.94
14 2014/02/27 high 4322.46 1854.53 1187.94
15 2014/02/28 open 4323.52 1855.12 1189.19
16 2014/02/28 close 4308.12 1859.45 1183.03
17 2014/02/28 high 4342.59 1867.92 1193.50

In [70]:
# Create a MultiIndex from trading date and priceType columns
sharesIndexDF=sharesIndexDataDF.set_index(['TradingDate','PriceType'])
mIndex = sharesIndexDF.index
mIndex


Out[70]:
MultiIndex(levels=[[u'2014/02/21', u'2014/02/24', u'2014/02/25', u'2014/02/26', u'2014/02/27', u'2014/02/28'], [u'close', u'high', u'open']],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5], [2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1]],
           names=[u'TradingDate', u'PriceType'])

In [71]:
sharesIndexDF


Out[71]:
Nasdaq S&P 500 Russell 2000
TradingDate PriceType
2014/02/21 open 4282.17 1841.07 1166.25
close 4263.41 1836.25 1164.63
high 4284.85 1846.13 1168.43
2014/02/24 open 4273.32 1836.78 1166.74
close 4292.97 1847.61 1174.55
high 4311.13 1858.71 1180.29
2014/02/25 open 4298.48 1847.66 1176.00
close 4287.59 1845.12 1173.95
high 4307.51 1852.91 1179.43
2014/02/26 open 4300.45 1845.79 1176.11
close 4292.06 1845.16 1181.72
high 4316.82 1852.65 1188.06
2014/02/27 open 4291.47 1844.90 1179.28
close 4318.93 1854.29 1187.94
high 4322.46 1854.53 1187.94
2014/02/28 open 4323.52 1855.12 1189.19
close 4308.12 1859.45 1183.03
high 4342.59 1867.92 1193.50

Upon inspection, we see that the MultiIndex consists of a list of tuples. Applying the get_level_values function with the appropriate argument produces a list of the labels for each level of the index:


In [72]:
mIndex.get_level_values(0)


Out[72]:
Index([u'2014/02/21', u'2014/02/21', u'2014/02/21', u'2014/02/24',
       u'2014/02/24', u'2014/02/24', u'2014/02/25', u'2014/02/25',
       u'2014/02/25', u'2014/02/26', u'2014/02/26', u'2014/02/26',
       u'2014/02/27', u'2014/02/27', u'2014/02/27', u'2014/02/28',
       u'2014/02/28', u'2014/02/28'],
      dtype='object', name=u'TradingDate')

In [73]:
mIndex.get_level_values(1)


Out[73]:
Index([u'open', u'close', u'high', u'open', u'close', u'high', u'open',
       u'close', u'high', u'open', u'close', u'high', u'open', u'close',
       u'high', u'open', u'close', u'high'],
      dtype='object', name=u'PriceType')

You can achieve hierarchical indexing with a MultiIndexed DataFrame:


In [75]:
# Getting All Price Type of date
sharesIndexDF.ix['2014/02/21']


Out[75]:
Nasdaq S&P 500 Russell 2000
PriceType
open 4282.17 1841.07 1166.25
close 4263.41 1836.25 1164.63
high 4284.85 1846.13 1168.43

In [76]:
# Getting specific PriceType  of date
sharesIndexDF.ix['2014/02/21','open']


Out[76]:
Nasdaq          4282.17
S&P 500         1841.07
Russell 2000    1166.25
Name: (2014/02/21, open), dtype: float64

In [77]:
# We can slice on first level 
sharesIndexDF.ix['2014/02/21':'2014/02/24']


Out[77]:
Nasdaq S&P 500 Russell 2000
TradingDate PriceType
2014/02/21 open 4282.17 1841.07 1166.25
close 4263.41 1836.25 1164.63
high 4284.85 1846.13 1168.43
2014/02/24 open 4273.32 1836.78 1166.74
close 4292.97 1847.61 1174.55
high 4311.13 1858.71 1180.29

In [78]:
# But if we can slice at lower level:
sharesIndexDF.ix[('2014/02/21','open'):('2014/02/24','open')]


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-78-2e26e327c710> in <module>()
      1 # But if we can slice at lower level:
----> 2 sharesIndexDF.ix[('2014/02/21','open'):('2014/02/24','open')]

/home/eneskemal/anaconda2/lib/python2.7/site-packages/pandas/core/indexing.pyc in __getitem__(self, key)
     76         else:
     77             key = com._apply_if_callable(key, self.obj)
---> 78             return self._getitem_axis(key, axis=0)
     79 
     80     def _get_label(self, label, axis=0):

/home/eneskemal/anaconda2/lib/python2.7/site-packages/pandas/core/indexing.pyc in _getitem_axis(self, key, axis)
    990         labels = self.obj._get_axis(axis)
    991         if isinstance(key, slice):
--> 992             return self._get_slice_axis(key, axis=axis)
    993         elif (is_list_like_indexer(key) and
    994               not (isinstance(key, tuple) and

/home/eneskemal/anaconda2/lib/python2.7/site-packages/pandas/core/indexing.pyc in _get_slice_axis(self, slice_obj, axis)
   1235         if not need_slice(slice_obj):
   1236             return obj
-> 1237         indexer = self._convert_slice_indexer(slice_obj, axis)
   1238 
   1239         if isinstance(indexer, slice):

/home/eneskemal/anaconda2/lib/python2.7/site-packages/pandas/core/indexing.pyc in _convert_slice_indexer(self, key, axis)
    190         # if we are accessing via lowered dim, use the last dim
    191         ax = self.obj._get_axis(min(axis, self.ndim - 1))
--> 192         return ax._convert_slice_indexer(key, kind=self.name)
    193 
    194     def _has_valid_setitem_indexer(self, indexer):

/home/eneskemal/anaconda2/lib/python2.7/site-packages/pandas/indexes/base.pyc in _convert_slice_indexer(self, key, kind)
   1077         else:
   1078             try:
-> 1079                 indexer = self.slice_indexer(start, stop, step, kind=kind)
   1080             except Exception:
   1081                 if is_index_slice:

/home/eneskemal/anaconda2/lib/python2.7/site-packages/pandas/indexes/base.pyc in slice_indexer(self, start, end, step, kind)
   2783         """
   2784         start_slice, end_slice = self.slice_locs(start, end, step=step,
-> 2785                                                  kind=kind)
   2786 
   2787         # return a slice

/home/eneskemal/anaconda2/lib/python2.7/site-packages/pandas/indexes/multi.pyc in slice_locs(self, start, end, step, kind)
   1482         # This function adds nothing to its parent implementation (the magic
   1483         # happens in get_slice_bound method), but it adds meaningful doc.
-> 1484         return super(MultiIndex, self).slice_locs(start, end, step, kind=kind)
   1485 
   1486     def _partial_tup_index(self, tup, side='left'):

/home/eneskemal/anaconda2/lib/python2.7/site-packages/pandas/indexes/base.pyc in slice_locs(self, start, end, step, kind)
   2962         start_slice = None
   2963         if start is not None:
-> 2964             start_slice = self.get_slice_bound(start, 'left', kind)
   2965         if start_slice is None:
   2966             start_slice = 0

/home/eneskemal/anaconda2/lib/python2.7/site-packages/pandas/indexes/multi.pyc in get_slice_bound(self, label, side, kind)
   1453         if not isinstance(label, tuple):
   1454             label = label,
-> 1455         return self._partial_tup_index(label, side=side)
   1456 
   1457     def slice_locs(self, start=None, end=None, step=None, kind=None):

/home/eneskemal/anaconda2/lib/python2.7/site-packages/pandas/indexes/multi.pyc in _partial_tup_index(self, tup, side)
   1488             raise KeyError('Key length (%d) was greater than MultiIndex'
   1489                            ' lexsort depth (%d)' %
-> 1490                            (len(tup), self.lexsort_depth))
   1491 
   1492         n = len(tup)

KeyError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'

However, this results in KeyError with a rather strange error message. The key lesson to be learned here is that the current incarnation of MultiIndex requires the labels to be sorted for the lower-level slicing routines to work correctly.

In order to do this, you can utilize the sortlevel() method, which sorts the labels of an axis within a MultiIndex. To be on the safe side, sort first before slicing with a MultiIndex. Thus, we can do the following:


In [79]:
sharesIndexDF.sortlevel(0).ix[('2014/02/21','open'):('2014/02/24','open')]


Out[79]:
Nasdaq S&P 500 Russell 2000
TradingDate PriceType
2014/02/21 open 4282.17 1841.07 1166.25
2014/02/24 close 4292.97 1847.61 1174.55
high 4311.13 1858.71 1180.29
open 4273.32 1836.78 1166.74

The swaplevel function enables levels within the MultiIndex to be swapped:


In [81]:
# Swapping level 0 and 1 in x axis
swappedDF=sharesIndexDF[:7].swaplevel(0, 1, axis=0)
swappedDF


Out[81]:
Nasdaq S&P 500 Russell 2000
PriceType TradingDate
open 2014/02/21 4282.17 1841.07 1166.25
close 2014/02/21 4263.41 1836.25 1164.63
high 2014/02/21 4284.85 1846.13 1168.43
open 2014/02/24 4273.32 1836.78 1166.74
close 2014/02/24 4292.97 1847.61 1174.55
high 2014/02/24 4311.13 1858.71 1180.29
open 2014/02/25 4298.48 1847.66 1176.00

The reorder_levels function is more general, allowing you to specify the order of the levels:


In [82]:
reorderedDF=sharesIndexDF[:7].reorder_levels(['PriceType','TradingDate'],axis=0)
reorderedDF


Out[82]:
Nasdaq S&P 500 Russell 2000
PriceType TradingDate
open 2014/02/21 4282.17 1841.07 1166.25
close 2014/02/21 4263.41 1836.25 1164.63
high 2014/02/21 4284.85 1846.13 1168.43
open 2014/02/24 4273.32 1836.78 1166.74
close 2014/02/24 4292.97 1847.61 1174.55
high 2014/02/24 4311.13 1858.71 1180.29
open 2014/02/25 4298.48 1847.66 1176.00

Boolean Indexing

We use Boolean indexing to filter or select parts of the data.

  • OR operator is |
  • AND operator is &
  • NOT operator is ~

These operators must be grouped using parentheses when used together.


In [83]:
# Selecting price type close which are bigger than 4300 in Nasdaq
sharesIndexDataDF.ix[(sharesIndexDataDF['PriceType']=='close')&(sharesIndexDataDF['Nasdaq']>4300) ]


Out[83]:
TradingDate PriceType Nasdaq S&P 500 Russell 2000
13 2014/02/27 close 4318.93 1854.29 1187.94
16 2014/02/28 close 4308.12 1859.45 1183.03

You can also create Boolean conditions in which you use arrays to filter out parts of the data:


In [84]:
# Ww can also do this extensively
highSelection=sharesIndexDataDF['PriceType']=='high'

NasdaqHigh=sharesIndexDataDF['Nasdaq']<4300

sharesIndexDataDF.ix[highSelection & NasdaqHigh]


Out[84]:
TradingDate PriceType Nasdaq S&P 500 Russell 2000
2 2014/02/21 high 4284.85 1846.13 1168.43

The isin and anyall methods enable user to achieve more with Boolean indexing that the standart operators used in the preceding sections.

The isin method takes a list of values and returns a Boolean array with True at the positions within the Series or DataFrame that match the values in the list.


In [88]:
# Check values in Series
stockSeries=pd.Series(['NFLX','AMZN','GOOG','FB','TWTR'])
stockSeries.isin(['AMZN','FB'])


Out[88]:
0    False
1     True
2    False
3     True
4    False
dtype: bool

In [89]:
# We can use the sub selecting to selecting true values
stockSeries[stockSeries.isin(['AMZN','FB'])]


Out[89]:
1    AMZN
3      FB
dtype: object

In [92]:
# Dictionary to create a dataframe
australianMammals= {'kangaroo': {'Subclass':'marsupial','Species Origin':'native'},
                    'flying fox' : {'Subclass':'placental','Species Origin':'native'},
                    'black rat': {'Subclass':'placental','Species Origin':'invasive'},
                    'platypus' : {'Subclass':'monotreme','Species Origin':'native'},
                    'wallaby' :{'Subclass':'marsupial','Species Origin':'native'},
                    'palm squirrel' : {'Subclass':'placental','Origin':'invasive'},
                    'anteater': {'Subclass':'monotreme', 'Origin':'native'},
                    'koala': {'Subclass':'marsupial', 'Origin':'native'}}
ozzieMammalsDF = pd.DataFrame(australianMammals)
ozzieMammalsDF


Out[92]:
anteater black rat flying fox kangaroo koala palm squirrel platypus wallaby
Origin native NaN NaN NaN native invasive NaN NaN
Species Origin NaN invasive native native NaN NaN native native
Subclass monotreme placental placental marsupial marsupial placental monotreme marsupial

In [93]:
aussieMammalsDF=ozzieMammalsDF.T # Transposing the data frame
aussieMammalsDF


Out[93]:
Origin Species Origin Subclass
anteater native NaN monotreme
black rat NaN invasive placental
flying fox NaN native placental
kangaroo NaN native marsupial
koala native NaN marsupial
palm squirrel invasive NaN placental
platypus NaN native monotreme
wallaby NaN native marsupial

In [94]:
# Selecting native animals
aussieMammalsDF.isin({'Subclass':['marsupial'],'Origin':['native']})


Out[94]:
Origin Species Origin Subclass
anteater True False False
black rat False False False
flying fox False False False
kangaroo False False True
koala True False True
palm squirrel False False False
platypus False False False
wallaby False False True

where() method

The where() method is used to ensure that the result of Boolean filtering is the same shape as the original data.


In [103]:
import numpy as np
np.random.seed(100)  # Setting random generator to 100 so we can generate same results later
normvals = pd.Series([np.random.normal() for i in np.arange(10)])
normvals


Out[103]:
0   -1.749765
1    0.342680
2    1.153036
3   -0.252436
4    0.981321
5    0.514219
6    0.221180
7   -1.070043
8   -0.189496
9    0.255001
dtype: float64

In [104]:
# Return values bigger than 0
normvals[normvals>0]


Out[104]:
1    0.342680
2    1.153036
4    0.981321
5    0.514219
6    0.221180
9    0.255001
dtype: float64

In [105]:
# Return values bigger than 0, prints the same shape
# by putting NaN to other places
normvals.where(normvals>0)


Out[105]:
0         NaN
1    0.342680
2    1.153036
3         NaN
4    0.981321
5    0.514219
6    0.221180
7         NaN
8         NaN
9    0.255001
dtype: float64

In [106]:
# Creating DataFrame with set random values
np.random.seed(100)
normDF = pd.DataFrame([[round(np.random.normal(),3) for i in np.arange(5)] for j in range(3)],
                      columns=['0','30','60','90','120'])
normDF


Out[106]:
0 30 60 90 120
0 -1.750 0.343 1.153 -0.252 0.981
1 0.514 0.221 -1.070 -0.189 0.255
2 -0.458 0.435 -0.584 0.817 0.673

In [107]:
# For DataFrames we get same shape no matter we use 
normDF[normDF>0]


Out[107]:
0 30 60 90 120
0 NaN 0.343 1.153 NaN 0.981
1 0.514 0.221 NaN NaN 0.255
2 NaN 0.435 NaN 0.817 0.673

In [110]:
# For DataFrames we get same shape no matter we use 
normDF.where(normDF>0)


Out[110]:
0 30 60 90 120
0 NaN 0.343 1.153 NaN 0.981
1 0.514 0.221 NaN NaN 0.255
2 NaN 0.435 NaN 0.817 0.673

In [111]:
# The inverse operation of the where is mask
normDF.mask(normDF>0)


Out[111]:
0 30 60 90 120
0 -1.750 NaN NaN -0.252 NaN
1 NaN NaN -1.070 -0.189 NaN
2 -0.458 NaN -0.584 NaN NaN

Summary

To summarize, there are various ways of selecting data from pandas:

  • We can use basic indexing, which is closest to our understanding of accessing data in an array.
  • We can use label- or integer-based indexing with the associated operators.
  • We can use a MultiIndex, which is the pandas version of a composite key comprising multiple fields.
  • We can use a Boolean/logical index.

For further reading and research check out the offical documentation, indexing