Hello all, I am back with another of my notes on Pandas, Today I will focus on indexing and selection of data from pandas object. It's really important since effective use of pandas requires a good knowledge of the indexing and selection of data.
In the last post while introduding data structures, I talked about basic indexing, I will show here as well for the sake of completeness.
In [2]:
import pandas as pd
In [3]:
SpotCrudePrices_2013_Data= { 'U.K. Brent' :
{'2013-Q1':112.9, '2013-Q2':103.0,
'2013-Q3':110.1, '2013-Q4':109.4},
'Dubai':
{'2013-Q1':108.1, '2013-Q2':100.8,
'2013-Q3':106.1,'2013-Q4':106.7},
'West Texas Intermediate':
{'2013-Q1':94.4, '2013-Q2':94.2,
'2013-Q3':105.8,'2013-Q4':97.4}}
SpotCrudePrices_2013=pd.DataFrame.from_dict(SpotCrudePrices_2013_Data)
SpotCrudePrices_2013
Out[3]:
We can select the prices for the available time periods of Dubai crude oil by using the [] operator:
In [4]:
dubaiPrices=SpotCrudePrices_2013['Dubai']
dubaiPrices
Out[4]:
We can also pass a list of columns to the []
operator in order to select the columns in a particular order:
In [7]:
SpotCrudePrices_2013[['West Texas Intermediate','U.K. Brent']]
Out[7]:
Rows cannot be selected with the bracket operator
[]
in a DataFrame.
One can retrieve values from a Series, DataFrame, or Panel directly as an attribute using dot operator
In [9]:
SpotCrudePrices_2013.Dubai
Out[9]:
However, this only works if the index element is a valid Python identifier, Dubai
in this case is valid but U.K. Brent
is not.
We can change the names to valid identifiers:
In [10]:
SpotCrudePrices_2013.columns=['Dubai','UK_Brent','West_Texas_Intermediate']
SpotCrudePrices_2013
Out[10]:
In [14]:
SpotCrudePrices_2013.UK_Brent
Out[14]:
We can also select prices by specifying a column index number to select column 1 (U.K. Brent)
In [12]:
SpotCrudePrices_2013[[1]]
Out[12]:
We can slice a range by using the [] operator. The syntax of the slicing operator exactly matches that of NumPy
's:
ar[startIndex: endIndex: stepValue]
For a DataFrame, [] slices across rows, Obrain all rows starting from index 2:
In [15]:
SpotCrudePrices_2013[2:]
Out[15]:
In [20]:
# Reverse the order of rows in DataFrame
SpotCrudePrices_2013[::-1]
Out[20]:
In [17]:
# Selecting Dubai's data as Pandas Series
dubaiPrices = SpotCrudePrices_2013['Dubai']
# Obtain the last 3 rows or all rows but the first:
dubaiPrices[1:]
Out[17]:
In [21]:
# Obtain all rows but the last
dubaiPrices[:-1]
Out[21]:
In [22]:
# Reverse the rows
dubaiPrices[::-1]
Out[22]:
In addition to the standard indexing operator [] and attribute operator, there are operators provided in pandas to make the job of indexing easier and more convenient.
By label indexing, we generally mean indexing by a header name, which tends to be a string value in most cases. These operators are as follows:
.loc
operator: It allows label-oriented indexing.iloc
operator: It allows integer-based indexing.ix
operator: It allows mixed label and integer-based indexingThe .loc
operator supports pure label-based indexing.
In [24]:
NYC_SnowAvgsData={'Months' : ['January','February','March','April', 'November', 'December'],
'Avg SnowDays' : [4.0,2.7,1.7,0.2,0.2,2.3],
'Avg Precip. (cm)' : [17.8,22.4,9.1,1.5,0.8,12.2],
'Avg Low Temp. (F)' : [27,29,35,45,42,32] }
NYC_SnowAvgs = pd.DataFrame(NYC_SnowAvgsData,
index=NYC_SnowAvgsData['Months'],
columns=['Avg SnowDays','Avg Precip. (cm)','Avg Low Temp. (F)'])
NYC_SnowAvgs
Out[24]:
In [25]:
# Using a single label:
NYC_SnowAvgs.loc['January']
Out[25]:
In [26]:
# Using a list of labels
NYC_SnowAvgs.loc[['January', 'April']]
Out[26]:
In [28]:
# Using a Label range:
NYC_SnowAvgs.loc['January' : 'March']
Out[28]:
Note that while using the .loc , .iloc , and .ix operators on a DataFrame, the row index must always be specified first. This is the opposite of the [] operator, where only columns can be selected directly.
In [29]:
NYC_SnowAvgs.loc[:,'Avg SnowDays']
Out[29]:
In [31]:
# to select a specific coordinate value
NYC_SnowAvgs.loc['March','Avg SnowDays']
Out[31]:
In [32]:
# Alternative Style
NYC_SnowAvgs.loc['March']['Avg SnowDays']
Out[32]:
In [33]:
# Without using loc function, square bracket as follows
NYC_SnowAvgs['Avg SnowDays']['March']
Out[33]:
We can use the .loc
operator to select the rows instead:
In [34]:
NYC_SnowAvgs.loc['March']
Out[34]:
We can use selection with boolean statements, while we are selecting in Pandas.
In [35]:
# Selecting months have less than one snow day average
NYC_SnowAvgs.loc[NYC_SnowAvgs['Avg SnowDays']<1,:]
Out[35]:
In [36]:
# brand of crude priced above 110 a barrel for row 2013-Q1
SpotCrudePrices_2013.loc[:,SpotCrudePrices_2013.loc['2013-Q1']>110]
# Using 2 .loc for more precise selection, how cool is that
Out[36]:
Note that the preceding arguments involve the Boolean operators <
and >
that actually evaluate the Boolean arrays, for example:
In [37]:
SpotCrudePrices_2013.loc['2013-Q1']>110
Out[37]:
In [39]:
import scipy.constants as phys
import math
In [41]:
sci_values=pd.DataFrame([[math.pi, math.sin(math.pi),math.cos(math.pi)],
[math.e,math.log(math.e), phys.golden],
[phys.c,phys.g,phys.e],
[phys.m_e,phys.m_p,phys.m_n]],
index=list(range(0,20,5)))
sci_values
Out[41]:
In [42]:
# Select first two rows by using integer slicing
sci_values.iloc[:2]
Out[42]:
In [43]:
sci_values.iloc[2,0:2]
Out[43]:
Note that the arguments to .iloc
are strictly positional and have nothing to do with the index values.
we should use the label-indexing operator .loc
instead...
In [44]:
sci_values.iloc[10]
In [45]:
sci_values.loc[10]
Out[45]:
In [46]:
# To Slice out a specific row
sci_values.iloc[2:3,:]
Out[46]:
In [47]:
# TO obtain a cross-section using an integer position
sci_values.iloc[3]
Out[47]:
The
.iat
and.at
operators can be used for a quick selection of scalar values. They are faster than them but not really common
In [48]:
sci_values.iloc[3,0]
Out[48]:
In [49]:
sci_values.iat[3,0]
Out[49]:
In [51]:
%timeit sci_values.iloc[3,0]
In [52]:
%timeit sci_values.iat[3,0]
.ix
operatorThe .ix
operator behaves like a mixture of the .loc
and .iloc
operators, with the .loc
behavior taking precedence. It takes the following as possible inputs:
In the following examples I will use this data set imported from csv
TradingDate,Nasdaq,S&P 500,Russell 2000
2014/01/30,4123.13,1794.19,1139.36
2014/01/31,4103.88,1782.59,1130.88
2014/02/03,3996.96,1741.89,1094.58
2014/02/04,4031.52,1755.2,1102.84
2014/02/05,4011.55,1751.64,1093.59
2014/02/06,4057.12,1773.43,1103.93
In [55]:
stockIndexDataDF=pd.read_csv('stock_index_closing.csv')
stockIndexDataDF
Out[55]:
What we see from the preceding example is that the DataFrame created has an integer-based row index. We promptly set the index to be the trading date to index it based on the trading date so that we can use the .ix operator:
In [56]:
stockIndexDF=stockIndexDataDF.set_index('TradingDate')
stockIndexDF
Out[56]:
In [57]:
# Using a single label
stockIndexDF.ix['2014/01/30']
Out[57]:
In [59]:
# Using a list of labels
stockIndexDF.ix[['2014/01/30', '2014/02/06']]
Out[59]:
In [60]:
type(stockIndexDF.ix['2014/01/30'])
Out[60]:
In [61]:
type(stockIndexDF.ix[['2014/01/30']])
Out[61]:
For the former, the indexer is a scalar; for the latter, the indexer is a list. A list indexer is used to select multiple columns. A multi-column slice of a DataFrame can only result in another DataFrame since it is 2D; hence, what is returned in the latter case is a DataFrame.
In [62]:
# Using a label-based slice:
tradingDates=stockIndexDataDF.TradingDate
stockIndexDF.ix[tradingDates[:3]]
Out[62]:
In [63]:
# Using a single integer:
stockIndexDF.ix[0]
Out[63]:
In [64]:
# Using a list of integers:
stockIndexDF.ix[[0,2]]
Out[64]:
In [65]:
# Using an integer slice:
stockIndexDF.ix[1:3]
Out[65]:
In [66]:
# Using an boolean array
stockIndexDF.ix[stockIndexDF['Russell 2000']>1100]
Out[66]:
As in the case of .loc
, the row index must be specified first for the .ix
operator.
We now turn to the topic of MultiIndexing. Multi-level or hierarchical indexing is useful because it enables the pandas user to select and massage data in multiple dimensions by using data structures such as Series and DataFrame.
In [67]:
sharesIndexDataDF=pd.read_csv('stock_index_closing.csv')
sharesIndexDataDF
Out[67]:
In [70]:
# Create a MultiIndex from trading date and priceType columns
sharesIndexDF=sharesIndexDataDF.set_index(['TradingDate','PriceType'])
mIndex = sharesIndexDF.index
mIndex
Out[70]:
In [71]:
sharesIndexDF
Out[71]:
Upon inspection, we see that the MultiIndex consists of a list of tuples. Applying the get_level_values function with the appropriate argument produces a list of the labels for each level of the index:
In [72]:
mIndex.get_level_values(0)
Out[72]:
In [73]:
mIndex.get_level_values(1)
Out[73]:
You can achieve hierarchical indexing with a MultiIndexed DataFrame:
In [75]:
# Getting All Price Type of date
sharesIndexDF.ix['2014/02/21']
Out[75]:
In [76]:
# Getting specific PriceType of date
sharesIndexDF.ix['2014/02/21','open']
Out[76]:
In [77]:
# We can slice on first level
sharesIndexDF.ix['2014/02/21':'2014/02/24']
Out[77]:
In [78]:
# But if we can slice at lower level:
sharesIndexDF.ix[('2014/02/21','open'):('2014/02/24','open')]
However, this results in KeyError
with a rather strange error message. The key lesson to be learned here is that the current incarnation of MultiIndex requires the labels to be sorted for the lower-level slicing routines to work correctly.
In order to do this, you can utilize the sortlevel()
method, which sorts the labels of an axis within a MultiIndex. To be on the safe side, sort first before slicing with a MultiIndex. Thus, we can do the following:
In [79]:
sharesIndexDF.sortlevel(0).ix[('2014/02/21','open'):('2014/02/24','open')]
Out[79]:
The swaplevel
function enables levels within the MultiIndex to be swapped:
In [81]:
# Swapping level 0 and 1 in x axis
swappedDF=sharesIndexDF[:7].swaplevel(0, 1, axis=0)
swappedDF
Out[81]:
The reorder_levels
function is more general, allowing you to specify the order of the levels:
In [82]:
reorderedDF=sharesIndexDF[:7].reorder_levels(['PriceType','TradingDate'],axis=0)
reorderedDF
Out[82]:
In [83]:
# Selecting price type close which are bigger than 4300 in Nasdaq
sharesIndexDataDF.ix[(sharesIndexDataDF['PriceType']=='close')&(sharesIndexDataDF['Nasdaq']>4300) ]
Out[83]:
You can also create Boolean conditions in which you use arrays to filter out parts of the data:
In [84]:
# Ww can also do this extensively
highSelection=sharesIndexDataDF['PriceType']=='high'
NasdaqHigh=sharesIndexDataDF['Nasdaq']<4300
sharesIndexDataDF.ix[highSelection & NasdaqHigh]
Out[84]:
The isin
and anyall
methods enable user to achieve more with Boolean indexing that the standart operators used in the preceding sections.
The isin
method takes a list of values and returns a Boolean array with True at the positions within the Series or DataFrame that match the values in the list.
In [88]:
# Check values in Series
stockSeries=pd.Series(['NFLX','AMZN','GOOG','FB','TWTR'])
stockSeries.isin(['AMZN','FB'])
Out[88]:
In [89]:
# We can use the sub selecting to selecting true values
stockSeries[stockSeries.isin(['AMZN','FB'])]
Out[89]:
In [92]:
# Dictionary to create a dataframe
australianMammals= {'kangaroo': {'Subclass':'marsupial','Species Origin':'native'},
'flying fox' : {'Subclass':'placental','Species Origin':'native'},
'black rat': {'Subclass':'placental','Species Origin':'invasive'},
'platypus' : {'Subclass':'monotreme','Species Origin':'native'},
'wallaby' :{'Subclass':'marsupial','Species Origin':'native'},
'palm squirrel' : {'Subclass':'placental','Origin':'invasive'},
'anteater': {'Subclass':'monotreme', 'Origin':'native'},
'koala': {'Subclass':'marsupial', 'Origin':'native'}}
ozzieMammalsDF = pd.DataFrame(australianMammals)
ozzieMammalsDF
Out[92]:
In [93]:
aussieMammalsDF=ozzieMammalsDF.T # Transposing the data frame
aussieMammalsDF
Out[93]:
In [94]:
# Selecting native animals
aussieMammalsDF.isin({'Subclass':['marsupial'],'Origin':['native']})
Out[94]:
In [103]:
import numpy as np
np.random.seed(100) # Setting random generator to 100 so we can generate same results later
normvals = pd.Series([np.random.normal() for i in np.arange(10)])
normvals
Out[103]:
In [104]:
# Return values bigger than 0
normvals[normvals>0]
Out[104]:
In [105]:
# Return values bigger than 0, prints the same shape
# by putting NaN to other places
normvals.where(normvals>0)
Out[105]:
In [106]:
# Creating DataFrame with set random values
np.random.seed(100)
normDF = pd.DataFrame([[round(np.random.normal(),3) for i in np.arange(5)] for j in range(3)],
columns=['0','30','60','90','120'])
normDF
Out[106]:
In [107]:
# For DataFrames we get same shape no matter we use
normDF[normDF>0]
Out[107]:
In [110]:
# For DataFrames we get same shape no matter we use
normDF.where(normDF>0)
Out[110]:
In [111]:
# The inverse operation of the where is mask
normDF.mask(normDF>0)
Out[111]:
To summarize, there are various ways of selecting data from pandas:
For further reading and research check out the offical documentation, indexing