Hello all, I am back with another of my notes on Pandas, Today I will focus on indexing and selection of data from pandas object. It's really important since effective use of pandas requires a good knowledge of the indexing and selection of data.

In the last post while introduding data structures, I talked about basic indexing, I will show here as well for the sake of completeness.

Basic Indexing



In [2]:

    
import pandas as pd



In [3]:

    
SpotCrudePrices_2013_Data= { 'U.K. Brent' : 
                            {'2013-Q1':112.9, '2013-Q2':103.0, 
                             '2013-Q3':110.1, '2013-Q4':109.4},
                            'Dubai': 
                            {'2013-Q1':108.1, '2013-Q2':100.8,
                             '2013-Q3':106.1,'2013-Q4':106.7},
                             'West Texas Intermediate':
                            {'2013-Q1':94.4, '2013-Q2':94.2,
                             '2013-Q3':105.8,'2013-Q4':97.4}}
SpotCrudePrices_2013=pd.DataFrame.from_dict(SpotCrudePrices_2013_Data)
SpotCrudePrices_2013









    Out[3]:






  
    
      
      Dubai
      U.K. Brent
      West Texas Intermediate
    
  
  
    
      2013-Q1
      108.1
      112.9
      94.4
    
    
      2013-Q2
      100.8
      103.0
      94.2
    
    
      2013-Q3
      106.1
      110.1
      105.8
    
    
      2013-Q4
      106.7
      109.4
      97.4

We can select the prices for the available time periods of Dubai crude oil by using the [] operator:



In [4]:

    
dubaiPrices=SpotCrudePrices_2013['Dubai']
dubaiPrices









    Out[4]:





2013-Q1    108.1
2013-Q2    100.8
2013-Q3    106.1
2013-Q4    106.7
Name: Dubai, dtype: float64

We can also pass a list of columns to the [] operator in order to select the columns in a particular order:



In [7]:

    
SpotCrudePrices_2013[['West Texas Intermediate','U.K. Brent']]









    Out[7]:






  
    
      
      West Texas Intermediate
      U.K. Brent
    
  
  
    
      2013-Q1
      94.4
      112.9
    
    
      2013-Q2
      94.2
      103.0
    
    
      2013-Q3
      105.8
      110.1
    
    
      2013-Q4
      97.4
      109.4

Rows cannot be selected with the bracket operator [] in a DataFrame.

One can retrieve values from a Series, DataFrame, or Panel directly as an attribute using dot operator



In [9]:

    
SpotCrudePrices_2013.Dubai









    Out[9]:





2013-Q1    108.1
2013-Q2    100.8
2013-Q3    106.1
2013-Q4    106.7
Name: Dubai, dtype: float64

However, this only works if the index element is a valid Python identifier, Dubai in this case is valid but U.K. Brent is not.

We can change the names to valid identifiers:



In [10]:

    
SpotCrudePrices_2013.columns=['Dubai','UK_Brent','West_Texas_Intermediate']
SpotCrudePrices_2013









    Out[10]:






  
    
      
      Dubai
      UK_Brent
      West_Texas_Intermediate
    
  
  
    
      2013-Q1
      108.1
      112.9
      94.4
    
    
      2013-Q2
      100.8
      103.0
      94.2
    
    
      2013-Q3
      106.1
      110.1
      105.8
    
    
      2013-Q4
      106.7
      109.4
      97.4



In [14]:

    
SpotCrudePrices_2013.UK_Brent









    Out[14]:





2013-Q1    112.9
2013-Q2    103.0
2013-Q3    110.1
2013-Q4    109.4
Name: UK_Brent, dtype: float64

We can also select prices by specifying a column index number to select column 1 (U.K. Brent)



In [12]:

    
SpotCrudePrices_2013[[1]]

We can slice a range by using the [] operator. The syntax of the slicing operator exactly matches that of NumPy's:

ar[startIndex: endIndex: stepValue]

For a DataFrame, [] slices across rows, Obrain all rows starting from index 2:



In [15]:

    
SpotCrudePrices_2013[2:]









    Out[15]:






  
    
      
      Dubai
      UK_Brent
      West_Texas_Intermediate
    
  
  
    
      2013-Q3
      106.1
      110.1
      105.8
    
    
      2013-Q4
      106.7
      109.4
      97.4



In [20]:

    
# Reverse the order of rows in DataFrame
SpotCrudePrices_2013[::-1]









    Out[20]:






  
    
      
      Dubai
      UK_Brent
      West_Texas_Intermediate
    
  
  
    
      2013-Q4
      106.7
      109.4
      97.4
    
    
      2013-Q3
      106.1
      110.1
      105.8
    
    
      2013-Q2
      100.8
      103.0
      94.2
    
    
      2013-Q1
      108.1
      112.9
      94.4



In [17]:

    
# Selecting Dubai's data as Pandas Series
dubaiPrices = SpotCrudePrices_2013['Dubai']
# Obtain the last 3 rows or all rows but the first:
dubaiPrices[1:]









    Out[17]:





2013-Q2    100.8
2013-Q3    106.1
2013-Q4    106.7
Name: Dubai, dtype: float64



In [21]:

    
# Obtain all rows but the last
dubaiPrices[:-1]









    Out[21]:





2013-Q1    108.1
2013-Q2    100.8
2013-Q3    106.1
Name: Dubai, dtype: float64



In [22]:

    
# Reverse the rows
dubaiPrices[::-1]









    Out[22]:





2013-Q4    106.7
2013-Q3    106.1
2013-Q2    100.8
2013-Q1    108.1
Name: Dubai, dtype: float64

Label, Integer, and Mixed Indexing

In addition to the standard indexing operator [] and attribute operator, there are operators provided in pandas to make the job of indexing easier and more convenient.

By label indexing, we generally mean indexing by a header name, which tends to be a string value in most cases. These operators are as follows:

The .loc operator: It allows label-oriented indexing
The .iloc operator: It allows integer-based indexing
The .ix operator: It allows mixed label and integer-based indexing

Label-Oriented Indexing

The .loc operator supports pure label-based indexing.



In [24]:

    
NYC_SnowAvgsData={'Months' : ['January','February','March','April', 'November', 'December'],
                  'Avg SnowDays' : [4.0,2.7,1.7,0.2,0.2,2.3],
                  'Avg Precip. (cm)' : [17.8,22.4,9.1,1.5,0.8,12.2],
                  'Avg Low Temp. (F)' : [27,29,35,45,42,32] }
NYC_SnowAvgs = pd.DataFrame(NYC_SnowAvgsData, 
                            index=NYC_SnowAvgsData['Months'],
                            columns=['Avg SnowDays','Avg Precip. (cm)','Avg Low Temp. (F)'])
NYC_SnowAvgs









    Out[24]:






  
    
      
      Avg SnowDays
      Avg Precip. (cm)
      Avg Low Temp. (F)
    
  
  
    
      January
      4.0
      17.8
      27
    
    
      February
      2.7
      22.4
      29
    
    
      March
      1.7
      9.1
      35
    
    
      April
      0.2
      1.5
      45
    
    
      November
      0.2
      0.8
      42
    
    
      December
      2.3
      12.2
      32



In [25]:

    
# Using a single label:
NYC_SnowAvgs.loc['January']









    Out[25]:





Avg SnowDays          4.0
Avg Precip. (cm)     17.8
Avg Low Temp. (F)    27.0
Name: January, dtype: float64



In [26]:

    
# Using a list of labels
NYC_SnowAvgs.loc[['January', 'April']]









    Out[26]:






  
    
      
      Avg SnowDays
      Avg Precip. (cm)
      Avg Low Temp. (F)
    
  
  
    
      January
      4.0
      17.8
      27
    
    
      April
      0.2
      1.5
      45



In [28]:

    
# Using a Label range:
NYC_SnowAvgs.loc['January' : 'March']









    Out[28]:






  
    
      
      Avg SnowDays
      Avg Precip. (cm)
      Avg Low Temp. (F)
    
  
  
    
      January
      4.0
      17.8
      27
    
    
      February
      2.7
      22.4
      29
    
    
      March
      1.7
      9.1
      35

Note that while using the .loc , .iloc , and .ix operators on a DataFrame, the row index must always be specified first. This is the opposite of the [] operator, where only columns can be selected directly.



In [29]:

    
NYC_SnowAvgs.loc[:,'Avg SnowDays']









    Out[29]:





January     4.0
February    2.7
March       1.7
April       0.2
November    0.2
December    2.3
Name: Avg SnowDays, dtype: float64



In [31]:

    
# to select a specific coordinate value
NYC_SnowAvgs.loc['March','Avg SnowDays']









    Out[31]:





1.7



In [32]:

    
# Alternative Style
NYC_SnowAvgs.loc['March']['Avg SnowDays']









    Out[32]:





1.7



In [33]:

    
# Without using loc function, square bracket as follows
NYC_SnowAvgs['Avg SnowDays']['March']









    Out[33]:





1.7

We can use the .loc operator to select the rows instead:



In [34]:

    
NYC_SnowAvgs.loc['March']









    Out[34]:





Avg SnowDays          1.7
Avg Precip. (cm)      9.1
Avg Low Temp. (F)    35.0
Name: March, dtype: float64

We can use selection with boolean statements, while we are selecting in Pandas.



In [35]:

    
# Selecting months have less than one snow day average
NYC_SnowAvgs.loc[NYC_SnowAvgs['Avg SnowDays']<1,:]









    Out[35]:






  
    
      
      Avg SnowDays
      Avg Precip. (cm)
      Avg Low Temp. (F)
    
  
  
    
      April
      0.2
      1.5
      45
    
    
      November
      0.2
      0.8
      42



In [36]:

    
# brand of crude priced above 110 a barrel for row 2013-Q1
SpotCrudePrices_2013.loc[:,SpotCrudePrices_2013.loc['2013-Q1']>110]
# Using 2 .loc for more precise selection, how cool is that

Note that the preceding arguments involve the Boolean operators < and > that actually evaluate the Boolean arrays, for example:



In [37]:

    
SpotCrudePrices_2013.loc['2013-Q1']>110









    Out[37]:





Dubai                      False
UK_Brent                    True
West_Texas_Intermediate    False
Name: 2013-Q1, dtype: bool

Integer-Oriented Indexing

The .iloc operator supports integer-based positional indexing. It accepts the following as inputs:

A single integer, for example, 7
A list or array of integers, for example, [2,3]
A slice object with integers, for example, 1:4



In [39]:

    
import scipy.constants as phys
import math



In [41]:

    
sci_values=pd.DataFrame([[math.pi, math.sin(math.pi),math.cos(math.pi)],
                         [math.e,math.log(math.e), phys.golden],
                         [phys.c,phys.g,phys.e],
                         [phys.m_e,phys.m_p,phys.m_n]],
                        index=list(range(0,20,5)))
sci_values









    Out[41]:






  
    
      
      0
      1
      2
    
  
  
    
      0
      3.141593e+00
      1.224647e-16
      -1.000000e+00
    
    
      5
      2.718282e+00
      1.000000e+00
      1.618034e+00
    
    
      10
      2.997925e+08
      9.806650e+00
      1.602177e-19
    
    
      15
      9.109384e-31
      1.672622e-27
      1.674927e-27



In [42]:

    
# Select first two rows by using integer slicing
sci_values.iloc[:2]









    Out[42]:






  
    
      
      0
      1
      2
    
  
  
    
      0
      3.141593
      1.224647e-16
      -1.000000
    
    
      5
      2.718282
      1.000000e+00
      1.618034



In [43]:

    
sci_values.iloc[2,0:2]









    Out[43]:





0    2.997925e+08
1    9.806650e+00
Name: 10, dtype: float64

Note that the arguments to .iloc are strictly positional and have nothing to do with the index values.

we should use the label-indexing operator .loc instead...



In [44]:

    
sci_values.iloc[10]









    



---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-44-ddf42a6ff8e1> in <module>()
----> 1 sci_values.iloc[10]

/home/eneskemal/anaconda2/lib/python2.7/site-packages/pandas/core/indexing.pyc in __getitem__(self, key)
   1294             return self._getitem_tuple(key)
   1295         else:
-> 1296             return self._getitem_axis(key, axis=0)
   1297 
   1298     def _getitem_axis(self, key, axis=0):

/home/eneskemal/anaconda2/lib/python2.7/site-packages/pandas/core/indexing.pyc in _getitem_axis(self, key, axis)
   1610 
   1611                 # validate the location
-> 1612                 self._is_valid_integer(key, axis)
   1613 
   1614             return self._get_loc(key, axis=axis)

/home/eneskemal/anaconda2/lib/python2.7/site-packages/pandas/core/indexing.pyc in _is_valid_integer(self, key, axis)
   1524         l = len(ax)
   1525         if key >= l or key < -l:
-> 1526             raise IndexError("single positional indexer is out-of-bounds")
   1527         return True
   1528 

IndexError: single positional indexer is out-of-bounds



In [45]:

    
sci_values.loc[10]









    Out[45]:





0    2.997925e+08
1    9.806650e+00
2    1.602177e-19
Name: 10, dtype: float64



In [46]:

    
# To Slice out a specific row
sci_values.iloc[2:3,:]









    Out[46]:






  
    
      
      0
      1
      2
    
  
  
    
      10
      299792458.0
      9.80665
      1.602177e-19



In [47]:

    
# TO obtain a cross-section using an integer position
sci_values.iloc[3]









    Out[47]:





0    9.109384e-31
1    1.672622e-27
2    1.674927e-27
Name: 15, dtype: float64

The .iat and .at operators can be used for a quick selection of scalar values. They are faster than them but not really common



In [48]:

    
sci_values.iloc[3,0]









    Out[48]:





9.1093835599999998e-31



In [49]:

    
sci_values.iat[3,0]









    Out[49]:





9.1093835599999998e-31



In [51]:

    
%timeit sci_values.iloc[3,0]









    



The slowest run took 6.17 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 84 µs per loop



In [52]:

    
%timeit sci_values.iat[3,0]









    



The slowest run took 20.61 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 6.06 µs per loop

Mixed Indexing with `.ix` operator

The .ix operator behaves like a mixture of the .loc and .iloc operators, with the .loc behavior taking precedence. It takes the following as possible inputs:

A single label or integer
A list of integeres or labels
An integer slice or label slice
A Boolean array

In the following examples I will use this data set imported from csv

  TradingDate,Nasdaq,S&P 500,Russell 2000
  2014/01/30,4123.13,1794.19,1139.36
  2014/01/31,4103.88,1782.59,1130.88
  2014/02/03,3996.96,1741.89,1094.58
  2014/02/04,4031.52,1755.2,1102.84
  2014/02/05,4011.55,1751.64,1093.59
  2014/02/06,4057.12,1773.43,1103.93



In [55]:

    
stockIndexDataDF=pd.read_csv('stock_index_closing.csv')
stockIndexDataDF









    Out[55]:






  
    
      
      TradingDate
      Nasdaq
      S&P 500
      Russell 2000
    
  
  
    
      0
      2014/01/30
      4123.13
      1794.19
      1139.36
    
    
      1
      2014/01/31
      4103.88
      1782.59
      1130.88
    
    
      2
      2014/02/03
      3996.96
      1741.89
      1094.58
    
    
      3
      2014/02/04
      4031.52
      1755.20
      1102.84
    
    
      4
      2014/02/05
      4011.55
      1751.64
      1093.59
    
    
      5
      2014/02/06
      4057.12
      1773.43
      1103.93

What we see from the preceding example is that the DataFrame created has an integer-based row index. We promptly set the index to be the trading date to index it based on the trading date so that we can use the .ix operator:



In [56]:

    
stockIndexDF=stockIndexDataDF.set_index('TradingDate')
stockIndexDF









    Out[56]:






  
    
      
      Nasdaq
      S&P 500
      Russell 2000
    
    
      TradingDate
      
      
      
    
  
  
    
      2014/01/30
      4123.13
      1794.19
      1139.36
    
    
      2014/01/31
      4103.88
      1782.59
      1130.88
    
    
      2014/02/03
      3996.96
      1741.89
      1094.58
    
    
      2014/02/04
      4031.52
      1755.20
      1102.84
    
    
      2014/02/05
      4011.55
      1751.64
      1093.59
    
    
      2014/02/06
      4057.12
      1773.43
      1103.93



In [57]:

    
# Using a single label
stockIndexDF.ix['2014/01/30']









    Out[57]:





Nasdaq          4123.13
S&P 500         1794.19
Russell 2000    1139.36
Name: 2014/01/30, dtype: float64



In [59]:

    
# Using a list of labels
stockIndexDF.ix[['2014/01/30', '2014/02/06']]









    Out[59]:






  
    
      
      Nasdaq
      S&P 500
      Russell 2000
    
    
      TradingDate
      
      
      
    
  
  
    
      2014/01/30
      4123.13
      1794.19
      1139.36
    
    
      2014/02/06
      4057.12
      1773.43
      1103.93



In [60]:

    
type(stockIndexDF.ix['2014/01/30'])









    Out[60]:





pandas.core.series.Series



In [61]:

    
type(stockIndexDF.ix[['2014/01/30']])









    Out[61]:





pandas.core.frame.DataFrame

For the former, the indexer is a scalar; for the latter, the indexer is a list. A list indexer is used to select multiple columns. A multi-column slice of a DataFrame can only result in another DataFrame since it is 2D; hence, what is returned in the latter case is a DataFrame.



In [62]:

    
# Using a label-based slice:
tradingDates=stockIndexDataDF.TradingDate
stockIndexDF.ix[tradingDates[:3]]









    Out[62]:






  
    
      
      Nasdaq
      S&P 500
      Russell 2000
    
    
      TradingDate
      
      
      
    
  
  
    
      2014/01/30
      4123.13
      1794.19
      1139.36
    
    
      2014/01/31
      4103.88
      1782.59
      1130.88
    
    
      2014/02/03
      3996.96
      1741.89
      1094.58



In [63]:

    
# Using a single integer:
stockIndexDF.ix[0]









    Out[63]:





Nasdaq          4123.13
S&P 500         1794.19
Russell 2000    1139.36
Name: 2014/01/30, dtype: float64



In [64]:

    
# Using a list of integers:
stockIndexDF.ix[[0,2]]









    Out[64]:






  
    
      
      Nasdaq
      S&P 500
      Russell 2000
    
    
      TradingDate
      
      
      
    
  
  
    
      2014/01/30
      4123.13
      1794.19
      1139.36
    
    
      2014/02/03
      3996.96
      1741.89
      1094.58



In [65]:

    
# Using an integer slice:
stockIndexDF.ix[1:3]









    Out[65]:






  
    
      
      Nasdaq
      S&P 500
      Russell 2000
    
    
      TradingDate
      
      
      
    
  
  
    
      2014/01/31
      4103.88
      1782.59
      1130.88
    
    
      2014/02/03
      3996.96
      1741.89
      1094.58



In [66]:

    
# Using an boolean array
stockIndexDF.ix[stockIndexDF['Russell 2000']>1100]









    Out[66]:






  
    
      
      Nasdaq
      S&P 500
      Russell 2000
    
    
      TradingDate
      
      
      
    
  
  
    
      2014/01/30
      4123.13
      1794.19
      1139.36
    
    
      2014/01/31
      4103.88
      1782.59
      1130.88
    
    
      2014/02/04
      4031.52
      1755.20
      1102.84
    
    
      2014/02/06
      4057.12
      1773.43
      1103.93

As in the case of .loc , the row index must be specified first for the .ix operator.

We now turn to the topic of MultiIndexing. Multi-level or hierarchical indexing is useful because it enables the pandas user to select and massage data in multiple dimensions by using data structures such as Series and DataFrame.



In [67]:

    
sharesIndexDataDF=pd.read_csv('stock_index_closing.csv')
sharesIndexDataDF









    Out[67]:






  
    
      
      TradingDate
      PriceType
      Nasdaq
      S&P 500
      Russell 2000
    
  
  
    
      0
      2014/02/21
      open
      4282.17
      1841.07
      1166.25
    
    
      1
      2014/02/21
      close
      4263.41
      1836.25
      1164.63
    
    
      2
      2014/02/21
      high
      4284.85
      1846.13
      1168.43
    
    
      3
      2014/02/24
      open
      4273.32
      1836.78
      1166.74
    
    
      4
      2014/02/24
      close
      4292.97
      1847.61
      1174.55
    
    
      5
      2014/02/24
      high
      4311.13
      1858.71
      1180.29
    
    
      6
      2014/02/25
      open
      4298.48
      1847.66
      1176.00
    
    
      7
      2014/02/25
      close
      4287.59
      1845.12
      1173.95
    
    
      8
      2014/02/25
      high
      4307.51
      1852.91
      1179.43
    
    
      9
      2014/02/26
      open
      4300.45
      1845.79
      1176.11
    
    
      10
      2014/02/26
      close
      4292.06
      1845.16
      1181.72
    
    
      11
      2014/02/26
      high
      4316.82
      1852.65
      1188.06
    
    
      12
      2014/02/27
      open
      4291.47
      1844.90
      1179.28
    
    
      13
      2014/02/27
      close
      4318.93
      1854.29
      1187.94
    
    
      14
      2014/02/27
      high
      4322.46
      1854.53
      1187.94
    
    
      15
      2014/02/28
      open
      4323.52
      1855.12
      1189.19
    
    
      16
      2014/02/28
      close
      4308.12
      1859.45
      1183.03
    
    
      17
      2014/02/28
      high
      4342.59
      1867.92
      1193.50



In [70]:

    
# Create a MultiIndex from trading date and priceType columns
sharesIndexDF=sharesIndexDataDF.set_index(['TradingDate','PriceType'])
mIndex = sharesIndexDF.index
mIndex









    Out[70]:





MultiIndex(levels=[[u'2014/02/21', u'2014/02/24', u'2014/02/25', u'2014/02/26', u'2014/02/27', u'2014/02/28'], [u'close', u'high', u'open']],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5], [2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1]],
           names=[u'TradingDate', u'PriceType'])



In [71]:

    
sharesIndexDF









    Out[71]:






  
    
      
      
      Nasdaq
      S&P 500
      Russell 2000
    
    
      TradingDate
      PriceType
      
      
      
    
  
  
    
      2014/02/21
      open
      4282.17
      1841.07
      1166.25
    
    
      close
      4263.41
      1836.25
      1164.63
    
    
      high
      4284.85
      1846.13
      1168.43
    
    
      2014/02/24
      open
      4273.32
      1836.78
      1166.74
    
    
      close
      4292.97
      1847.61
      1174.55
    
    
      high
      4311.13
      1858.71
      1180.29
    
    
      2014/02/25
      open
      4298.48
      1847.66
      1176.00
    
    
      close
      4287.59
      1845.12
      1173.95
    
    
      high
      4307.51
      1852.91
      1179.43
    
    
      2014/02/26
      open
      4300.45
      1845.79
      1176.11
    
    
      close
      4292.06
      1845.16
      1181.72
    
    
      high
      4316.82
      1852.65
      1188.06
    
    
      2014/02/27
      open
      4291.47
      1844.90
      1179.28
    
    
      close
      4318.93
      1854.29
      1187.94
    
    
      high
      4322.46
      1854.53
      1187.94
    
    
      2014/02/28
      open
      4323.52
      1855.12
      1189.19
    
    
      close
      4308.12
      1859.45
      1183.03
    
    
      high
      4342.59
      1867.92
      1193.50

Upon inspection, we see that the MultiIndex consists of a list of tuples. Applying the get_level_values function with the appropriate argument produces a list of the labels for each level of the index:



In [72]:

    
mIndex.get_level_values(0)









    Out[72]:





Index([u'2014/02/21', u'2014/02/21', u'2014/02/21', u'2014/02/24',
       u'2014/02/24', u'2014/02/24', u'2014/02/25', u'2014/02/25',
       u'2014/02/25', u'2014/02/26', u'2014/02/26', u'2014/02/26',
       u'2014/02/27', u'2014/02/27', u'2014/02/27', u'2014/02/28',
       u'2014/02/28', u'2014/02/28'],
      dtype='object', name=u'TradingDate')



In [73]:

    
mIndex.get_level_values(1)









    Out[73]:





Index([u'open', u'close', u'high', u'open', u'close', u'high', u'open',
       u'close', u'high', u'open', u'close', u'high', u'open', u'close',
       u'high', u'open', u'close', u'high'],
      dtype='object', name=u'PriceType')

You can achieve hierarchical indexing with a MultiIndexed DataFrame:



In [75]:

    
# Getting All Price Type of date
sharesIndexDF.ix['2014/02/21']









    Out[75]:






  
    
      
      Nasdaq
      S&P 500
      Russell 2000
    
    
      PriceType
      
      
      
    
  
  
    
      open
      4282.17
      1841.07
      1166.25
    
    
      close
      4263.41
      1836.25
      1164.63
    
    
      high
      4284.85
      1846.13
      1168.43



In [76]:

    
# Getting specific PriceType  of date
sharesIndexDF.ix['2014/02/21','open']









    Out[76]:





Nasdaq          4282.17
S&P 500         1841.07
Russell 2000    1166.25
Name: (2014/02/21, open), dtype: float64



In [77]:

    
# We can slice on first level 
sharesIndexDF.ix['2014/02/21':'2014/02/24']









    Out[77]:






  
    
      
      
      Nasdaq
      S&P 500
      Russell 2000
    
    
      TradingDate
      PriceType
      
      
      
    
  
  
    
      2014/02/21
      open
      4282.17
      1841.07
      1166.25
    
    
      close
      4263.41
      1836.25
      1164.63
    
    
      high
      4284.85
      1846.13
      1168.43
    
    
      2014/02/24
      open
      4273.32
      1836.78
      1166.74
    
    
      close
      4292.97
      1847.61
      1174.55
    
    
      high
      4311.13
      1858.71
      1180.29



In [78]:

    
# But if we can slice at lower level:
sharesIndexDF.ix[('2014/02/21','open'):('2014/02/24','open')]









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-78-2e26e327c710> in <module>()
      1 # But if we can slice at lower level:
----> 2 sharesIndexDF.ix[('2014/02/21','open'):('2014/02/24','open')]

/home/eneskemal/anaconda2/lib/python2.7/site-packages/pandas/core/indexing.pyc in __getitem__(self, key)
     76         else:
     77             key = com._apply_if_callable(key, self.obj)
---> 78             return self._getitem_axis(key, axis=0)
     79 
     80     def _get_label(self, label, axis=0):

/home/eneskemal/anaconda2/lib/python2.7/site-packages/pandas/core/indexing.pyc in _getitem_axis(self, key, axis)
    990         labels = self.obj._get_axis(axis)
    991         if isinstance(key, slice):
--> 992             return self._get_slice_axis(key, axis=axis)
    993         elif (is_list_like_indexer(key) and
    994               not (isinstance(key, tuple) and

/home/eneskemal/anaconda2/lib/python2.7/site-packages/pandas/core/indexing.pyc in _get_slice_axis(self, slice_obj, axis)
   1235         if not need_slice(slice_obj):
   1236             return obj
-> 1237         indexer = self._convert_slice_indexer(slice_obj, axis)
   1238 
   1239         if isinstance(indexer, slice):

/home/eneskemal/anaconda2/lib/python2.7/site-packages/pandas/core/indexing.pyc in _convert_slice_indexer(self, key, axis)
    190         # if we are accessing via lowered dim, use the last dim
    191         ax = self.obj._get_axis(min(axis, self.ndim - 1))
--> 192         return ax._convert_slice_indexer(key, kind=self.name)
    193 
    194     def _has_valid_setitem_indexer(self, indexer):

/home/eneskemal/anaconda2/lib/python2.7/site-packages/pandas/indexes/base.pyc in _convert_slice_indexer(self, key, kind)
   1077         else:
   1078             try:
-> 1079                 indexer = self.slice_indexer(start, stop, step, kind=kind)
   1080             except Exception:
   1081                 if is_index_slice:

/home/eneskemal/anaconda2/lib/python2.7/site-packages/pandas/indexes/base.pyc in slice_indexer(self, start, end, step, kind)
   2783         """
   2784         start_slice, end_slice = self.slice_locs(start, end, step=step,
-> 2785                                                  kind=kind)
   2786 
   2787         # return a slice

/home/eneskemal/anaconda2/lib/python2.7/site-packages/pandas/indexes/multi.pyc in slice_locs(self, start, end, step, kind)
   1482         # This function adds nothing to its parent implementation (the magic
   1483         # happens in get_slice_bound method), but it adds meaningful doc.
-> 1484         return super(MultiIndex, self).slice_locs(start, end, step, kind=kind)
   1485 
   1486     def _partial_tup_index(self, tup, side='left'):

/home/eneskemal/anaconda2/lib/python2.7/site-packages/pandas/indexes/base.pyc in slice_locs(self, start, end, step, kind)
   2962         start_slice = None
   2963         if start is not None:
-> 2964             start_slice = self.get_slice_bound(start, 'left', kind)
   2965         if start_slice is None:
   2966             start_slice = 0

/home/eneskemal/anaconda2/lib/python2.7/site-packages/pandas/indexes/multi.pyc in get_slice_bound(self, label, side, kind)
   1453         if not isinstance(label, tuple):
   1454             label = label,
-> 1455         return self._partial_tup_index(label, side=side)
   1456 
   1457     def slice_locs(self, start=None, end=None, step=None, kind=None):

/home/eneskemal/anaconda2/lib/python2.7/site-packages/pandas/indexes/multi.pyc in _partial_tup_index(self, tup, side)
   1488             raise KeyError('Key length (%d) was greater than MultiIndex'
   1489                            ' lexsort depth (%d)' %
-> 1490                            (len(tup), self.lexsort_depth))
   1491 
   1492         n = len(tup)

KeyError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'

However, this results in KeyError with a rather strange error message. The key lesson to be learned here is that the current incarnation of MultiIndex requires the labels to be sorted for the lower-level slicing routines to work correctly.

In order to do this, you can utilize the sortlevel() method, which sorts the labels of an axis within a MultiIndex. To be on the safe side, sort first before slicing with a MultiIndex. Thus, we can do the following:



In [79]:

    
sharesIndexDF.sortlevel(0).ix[('2014/02/21','open'):('2014/02/24','open')]









    Out[79]:






  
    
      
      
      Nasdaq
      S&P 500
      Russell 2000
    
    
      TradingDate
      PriceType
      
      
      
    
  
  
    
      2014/02/21
      open
      4282.17
      1841.07
      1166.25
    
    
      2014/02/24
      close
      4292.97
      1847.61
      1174.55
    
    
      high
      4311.13
      1858.71
      1180.29
    
    
      open
      4273.32
      1836.78
      1166.74

The swaplevel function enables levels within the MultiIndex to be swapped:



In [81]:

    
# Swapping level 0 and 1 in x axis
swappedDF=sharesIndexDF[:7].swaplevel(0, 1, axis=0)
swappedDF









    Out[81]:






  
    
      
      
      Nasdaq
      S&P 500
      Russell 2000
    
    
      PriceType
      TradingDate
      
      
      
    
  
  
    
      open
      2014/02/21
      4282.17
      1841.07
      1166.25
    
    
      close
      2014/02/21
      4263.41
      1836.25
      1164.63
    
    
      high
      2014/02/21
      4284.85
      1846.13
      1168.43
    
    
      open
      2014/02/24
      4273.32
      1836.78
      1166.74
    
    
      close
      2014/02/24
      4292.97
      1847.61
      1174.55
    
    
      high
      2014/02/24
      4311.13
      1858.71
      1180.29
    
    
      open
      2014/02/25
      4298.48
      1847.66
      1176.00

The reorder_levels function is more general, allowing you to specify the order of the levels:



In [82]:

    
reorderedDF=sharesIndexDF[:7].reorder_levels(['PriceType','TradingDate'],axis=0)
reorderedDF









    Out[82]:






  
    
      
      
      Nasdaq
      S&P 500
      Russell 2000
    
    
      PriceType
      TradingDate
      
      
      
    
  
  
    
      open
      2014/02/21
      4282.17
      1841.07
      1166.25
    
    
      close
      2014/02/21
      4263.41
      1836.25
      1164.63
    
    
      high
      2014/02/21
      4284.85
      1846.13
      1168.43
    
    
      open
      2014/02/24
      4273.32
      1836.78
      1166.74
    
    
      close
      2014/02/24
      4292.97
      1847.61
      1174.55
    
    
      high
      2014/02/24
      4311.13
      1858.71
      1180.29
    
    
      open
      2014/02/25
      4298.48
      1847.66
      1176.00

Boolean Indexing

We use Boolean indexing to filter or select parts of the data.

OR operator is |
AND operator is &
NOT operator is ~

These operators must be grouped using parentheses when used together.



In [83]:

    
# Selecting price type close which are bigger than 4300 in Nasdaq
sharesIndexDataDF.ix[(sharesIndexDataDF['PriceType']=='close')&(sharesIndexDataDF['Nasdaq']>4300) ]









    Out[83]:






  
    
      
      TradingDate
      PriceType
      Nasdaq
      S&P 500
      Russell 2000
    
  
  
    
      13
      2014/02/27
      close
      4318.93
      1854.29
      1187.94
    
    
      16
      2014/02/28
      close
      4308.12
      1859.45
      1183.03

You can also create Boolean conditions in which you use arrays to filter out parts of the data:



In [84]:

    
# Ww can also do this extensively
highSelection=sharesIndexDataDF['PriceType']=='high'

NasdaqHigh=sharesIndexDataDF['Nasdaq']<4300

sharesIndexDataDF.ix[highSelection & NasdaqHigh]









    Out[84]:






  
    
      
      TradingDate
      PriceType
      Nasdaq
      S&P 500
      Russell 2000
    
  
  
    
      2
      2014/02/21
      high
      4284.85
      1846.13
      1168.43

The isin and anyall methods enable user to achieve more with Boolean indexing that the standart operators used in the preceding sections.

The isin method takes a list of values and returns a Boolean array with True at the positions within the Series or DataFrame that match the values in the list.



In [88]:

    
# Check values in Series
stockSeries=pd.Series(['NFLX','AMZN','GOOG','FB','TWTR'])
stockSeries.isin(['AMZN','FB'])









    Out[88]:





0    False
1     True
2    False
3     True
4    False
dtype: bool



In [89]:

    
# We can use the sub selecting to selecting true values
stockSeries[stockSeries.isin(['AMZN','FB'])]









    Out[89]:





1    AMZN
3      FB
dtype: object



In [92]:

    
# Dictionary to create a dataframe
australianMammals= {'kangaroo': {'Subclass':'marsupial','Species Origin':'native'},
                    'flying fox' : {'Subclass':'placental','Species Origin':'native'},
                    'black rat': {'Subclass':'placental','Species Origin':'invasive'},
                    'platypus' : {'Subclass':'monotreme','Species Origin':'native'},
                    'wallaby' :{'Subclass':'marsupial','Species Origin':'native'},
                    'palm squirrel' : {'Subclass':'placental','Origin':'invasive'},
                    'anteater': {'Subclass':'monotreme', 'Origin':'native'},
                    'koala': {'Subclass':'marsupial', 'Origin':'native'}}
ozzieMammalsDF = pd.DataFrame(australianMammals)
ozzieMammalsDF









    Out[92]:






  
    
      
      anteater
      black rat
      flying fox
      kangaroo
      koala
      palm squirrel
      platypus
      wallaby
    
  
  
    
      Origin
      native
      NaN
      NaN
      NaN
      native
      invasive
      NaN
      NaN
    
    
      Species Origin
      NaN
      invasive
      native
      native
      NaN
      NaN
      native
      native
    
    
      Subclass
      monotreme
      placental
      placental
      marsupial
      marsupial
      placental
      monotreme
      marsupial



In [93]:

    
aussieMammalsDF=ozzieMammalsDF.T # Transposing the data frame
aussieMammalsDF









    Out[93]:






  
    
      
      Origin
      Species Origin
      Subclass
    
  
  
    
      anteater
      native
      NaN
      monotreme
    
    
      black rat
      NaN
      invasive
      placental
    
    
      flying fox
      NaN
      native
      placental
    
    
      kangaroo
      NaN
      native
      marsupial
    
    
      koala
      native
      NaN
      marsupial
    
    
      palm squirrel
      invasive
      NaN
      placental
    
    
      platypus
      NaN
      native
      monotreme
    
    
      wallaby
      NaN
      native
      marsupial



In [94]:

    
# Selecting native animals
aussieMammalsDF.isin({'Subclass':['marsupial'],'Origin':['native']})









    Out[94]:






  
    
      
      Origin
      Species Origin
      Subclass
    
  
  
    
      anteater
      True
      False
      False
    
    
      black rat
      False
      False
      False
    
    
      flying fox
      False
      False
      False
    
    
      kangaroo
      False
      False
      True
    
    
      koala
      True
      False
      True
    
    
      palm squirrel
      False
      False
      False
    
    
      platypus
      False
      False
      False
    
    
      wallaby
      False
      False
      True

`where()` method

The where() method is used to ensure that the result of Boolean filtering is the same shape as the original data.



In [103]:

    
import numpy as np
np.random.seed(100)  # Setting random generator to 100 so we can generate same results later
normvals = pd.Series([np.random.normal() for i in np.arange(10)])
normvals









    Out[103]:





0   -1.749765
1    0.342680
2    1.153036
3   -0.252436
4    0.981321
5    0.514219
6    0.221180
7   -1.070043
8   -0.189496
9    0.255001
dtype: float64



In [104]:

    
# Return values bigger than 0
normvals[normvals>0]









    Out[104]:





1    0.342680
2    1.153036
4    0.981321
5    0.514219
6    0.221180
9    0.255001
dtype: float64



In [105]:

    
# Return values bigger than 0, prints the same shape
# by putting NaN to other places
normvals.where(normvals>0)









    Out[105]:





0         NaN
1    0.342680
2    1.153036
3         NaN
4    0.981321
5    0.514219
6    0.221180
7         NaN
8         NaN
9    0.255001
dtype: float64



In [106]:

    
# Creating DataFrame with set random values
np.random.seed(100)
normDF = pd.DataFrame([[round(np.random.normal(),3) for i in np.arange(5)] for j in range(3)],
                      columns=['0','30','60','90','120'])
normDF



In [107]:

    
# For DataFrames we get same shape no matter we use 
normDF[normDF>0]



In [110]:

    
# For DataFrames we get same shape no matter we use 
normDF.where(normDF>0)



In [111]:

    
# The inverse operation of the where is mask
normDF.mask(normDF>0)

Summary

To summarize, there are various ways of selecting data from pandas:

We can use basic indexing, which is closest to our understanding of accessing data in an array.
We can use label- or integer-based indexing with the associated operators.
We can use a MultiIndex, which is the pandas version of a composite key comprising multiple fields.
We can use a Boolean/logical index.

For further reading and research check out the offical documentation, indexing

	Dubai	U.K. Brent	West Texas Intermediate
2013-Q1	108.1	112.9	94.4
2013-Q2	100.8	103.0	94.2
2013-Q3	106.1	110.1	105.8
2013-Q4	106.7	109.4	97.4

	Avg SnowDays	Avg Precip. (cm)	Avg Low Temp. (F)
January	4.0	17.8	27
February	2.7	22.4	29
March	1.7	9.1	35
April	0.2	1.5	45
November	0.2	0.8	42
December	2.3	12.2	32

	0	1	2
0	3.141593e+00	1.224647e-16	-1.000000e+00
5	2.718282e+00	1.000000e+00	1.618034e+00
10	2.997925e+08	9.806650e+00	1.602177e-19
15	9.109384e-31	1.672622e-27	1.674927e-27

	TradingDate	Nasdaq	S&P 500	Russell 2000
0	2014/01/30	4123.13	1794.19	1139.36
1	2014/01/31	4103.88	1782.59	1130.88
2	2014/02/03	3996.96	1741.89	1094.58
3	2014/02/04	4031.52	1755.20	1102.84
4	2014/02/05	4011.55	1751.64	1093.59
5	2014/02/06	4057.12	1773.43	1103.93

	Nasdaq	S&P 500	Russell 2000
TradingDate
2014/01/30	4123.13	1794.19	1139.36
2014/01/31	4103.88	1782.59	1130.88
2014/02/03	3996.96	1741.89	1094.58
2014/02/04	4031.52	1755.20	1102.84
2014/02/05	4011.55	1751.64	1093.59
2014/02/06	4057.12	1773.43	1103.93

	TradingDate	PriceType	Nasdaq	S&P 500	Russell 2000
0	2014/02/21	open	4282.17	1841.07	1166.25
1	2014/02/21	close	4263.41	1836.25	1164.63
2	2014/02/21	high	4284.85	1846.13	1168.43
3	2014/02/24	open	4273.32	1836.78	1166.74
4	2014/02/24	close	4292.97	1847.61	1174.55
5	2014/02/24	high	4311.13	1858.71	1180.29
6	2014/02/25	open	4298.48	1847.66	1176.00
7	2014/02/25	close	4287.59	1845.12	1173.95
8	2014/02/25	high	4307.51	1852.91	1179.43
9	2014/02/26	open	4300.45	1845.79	1176.11
10	2014/02/26	close	4292.06	1845.16	1181.72
11	2014/02/26	high	4316.82	1852.65	1188.06
12	2014/02/27	open	4291.47	1844.90	1179.28
13	2014/02/27	close	4318.93	1854.29	1187.94
14	2014/02/27	high	4322.46	1854.53	1187.94
15	2014/02/28	open	4323.52	1855.12	1189.19
16	2014/02/28	close	4308.12	1859.45	1183.03
17	2014/02/28	high	4342.59	1867.92	1193.50

		Nasdaq	S&P 500	Russell 2000
TradingDate	PriceType
2014/02/21	open	4282.17	1841.07	1166.25
	close	4263.41	1836.25	1164.63
	high	4284.85	1846.13	1168.43
2014/02/24	open	4273.32	1836.78	1166.74
	close	4292.97	1847.61	1174.55
	high	4311.13	1858.71	1180.29
2014/02/25	open	4298.48	1847.66	1176.00
	close	4287.59	1845.12	1173.95
	high	4307.51	1852.91	1179.43
2014/02/26	open	4300.45	1845.79	1176.11
	close	4292.06	1845.16	1181.72
	high	4316.82	1852.65	1188.06
2014/02/27	open	4291.47	1844.90	1179.28
	close	4318.93	1854.29	1187.94
	high	4322.46	1854.53	1187.94
2014/02/28	open	4323.52	1855.12	1189.19
	close	4308.12	1859.45	1183.03
	high	4342.59	1867.92	1193.50

	anteater	black rat	flying fox	kangaroo	koala	palm squirrel	platypus	wallaby
Origin	native	NaN	NaN	NaN	native	invasive	NaN	NaN
Species Origin	NaN	invasive	native	native	NaN	NaN	native	native
Subclass	monotreme	placental	placental	marsupial	marsupial	placental	monotreme	marsupial

	Origin	Species Origin	Subclass
anteater	native	NaN	monotreme
black rat	NaN	invasive	placental
flying fox	NaN	native	placental
kangaroo	NaN	native	marsupial
koala	native	NaN	marsupial
palm squirrel	invasive	NaN	placental
platypus	NaN	native	monotreme
wallaby	NaN	native	marsupial

	Origin	Species Origin	Subclass
anteater	True	False	False
black rat	False	False	False
flying fox	False	False	False
kangaroo	False	False	True
koala	True	False	True
palm squirrel	False	False	False
platypus	False	False	False
wallaby	False	False	True

	0	30	60	90	120
0	-1.750	0.343	1.153	-0.252	0.981
1	0.514	0.221	-1.070	-0.189	0.255
2	-0.458	0.435	-0.584	0.817	0.673