Advanced Pandas: Multi-axis indexing (`loc` and `iloc` methods)

In this notebook we are discussing the loc, iloc and ix methods, which are relatively newly implemented in pandas but have become the primary means of multi-axis indexing. The official pandas docs is here.

Imagine the following standard situation: we have a dataset where

Each row represents an observation
Each column is a named variable.
and we want: an easy, quick way to select relevant rows/columns (i.e. subsamples from dataset)

This notebook borrows examples from this blog post.

Note: requires internet access to run.

This Jupyter notebook was created by Balint Szoke for the NYU Stern course Data Bootcamp.



In [1]:

    
%matplotlib inline 

import numpy as np                     # foundation for Pandas   
import pandas as pd                    # data package
import matplotlib.pyplot as plt        # graphics module

To illustrate the main ideas, we are going to use a fake data set from this website. This file contains artificial names, addresses, companies, phone numbers etc. for fictitious US characters. Here is the complete list of variables

The main purpose of this dataset is testing. Straight from the website

"Always test your software with a "worst-case scenario" amount of sample data, to get an accurate sense of its performance in the real world."

The only issue is that the samples are stored in zipped csv files, i.e. we do not have url for the csv file directly.

How to download a zipfile and read the csv file inside



In [ ]:

    
# we need some extra tools to download and handle zip files 
import zipfile as zf    
import requests, io

url = "https://www.briandunning.com/sample-data/us-500.zip"
r = requests.get(url) 
file = zf.ZipFile(io.BytesIO(r.content)) 
file



In [ ]:

    
file.namelist()    # there is one csv file inside



In [ ]:

    
file_csv = file.open(file.namelist()[0])          # compare the type with file above (this is readable by pandas)
file_csv



In [ ]:

    
df = pd.read_csv(file_csv)
# What do we have?
print("Variables and types:\n\n", df.dtypes, sep='')



In [ ]:

    
df.head()



In [ ]:

    
# Sometimes this one-liner works too
#pd.read_csv(url, compression='zip')    # not in this case

(1) Selecting data using `loc`

loc is primarily label-location based indexer. That is,

it selects rows and columns by their labels (variable names for columns, index values for rows).
also works with a boolean array.

The syntax is

data.loc[<row selection>, <column selection>]

First, set an arbitrary index variable



In [ ]:

    
dff = df.set_index(['last_name'])
dff.head()

Now we can directly select rows by their index (last_name) values (just like we do with columns)



In [ ]:

    
dff.loc['Butt']



In [ ]:

    
# multiple rows
dff.loc[['Butt', 'Venere']]



In [ ]:

    
# select a subset of the data (subDataFrame)
dff.loc[['Butt', 'Foller'], ['city', 'email']]



In [ ]:

    
# ranges of index labels
dff.loc[['Butt', 'Foller'], 'address':'phone2']

Boolean indexing using `loc`

The most common method to work with data

Pass an array of True/False values to the .loc to select the rows/columns with True values.



In [ ]:

    
dff.loc[dff['city'] == 'New Orleans']

In fact, we don't need the loc indexer for this kind of task



In [ ]:

    
dff[dff['city'] == 'New Orleans']

But what if we don't want all variables?



In [ ]:

    
dff.loc[dff['city'] == 'New Orleans', ['company_name', 'zip']]

How would you get the same dataframe without loc?



In [ ]:

    
dff[dff['city'] == 'New Orleans'][['company_name', 'zip']]         # matter of taste

Recall the string methods applicable to DataFrames



In [ ]:

    
dff[dff['email'].str.endswith("gmail.com")].head()

and the isin method?



In [ ]:

    
dff.loc[dff['city'].isin(['New Orleans', 'New York'])]



In [ ]:

    
# intersection fo the two?

gmails = dff['email'].str.endswith("gmail.com")
NYNO = (dff['city'] == 'New Orleans') | (dff['city'] =='New York')
dff[gmails & NYNO]

A tricky one: we can pass a function that returns True/False values to .apply() and evaluate it at each row



In [ ]:

    
def short_company_name(x):
    """
    returns True if x contains less than 2 words
    """
    return len(x.split(' ')) < 2

dff.loc[dff['company_name'].apply(short_company_name)]

(2) Selecting data using `iloc`

iloc is primarily used for integer position based indexing. That is, it selects rows and columns by number, in the order that they appear in the data frame. Numbers are from $0$ to df.shape-1 of both axes.

The syntax is

data.iloc[<row selection>, <column selection>]



In [ ]:

    
# Rows:
df.iloc[0]             # first row



In [ ]:

    
df.iloc[-1]            # last row



In [ ]:

    
# Columns:
df.iloc[:, 0]          # first column = first variable (first_name)



In [ ]:

    
df.iloc[:, -1]         # last column (web)

For multiple columns and rows, use slicer



In [ ]:

    
df.iloc[:5]                        # first five rows



In [ ]:

    
df.iloc[:, :2]                      # first two columns



In [ ]:

    
df.iloc[[0, 4, 7, 25],                  # 1st, 5th, 6th, 26th row
        [0, 5, 6]]                      # 1st 6th 7th columns.



In [ ]:

    
df.iloc[:5, 5:8]                          # first 5 rows and 5th, 6th, 7th colum

(3) Selecting data using `ix`

ix is hybrid of loc and iloc. In general,

it is label-location based and acts just like loc
However, it also supports integer-location based selection just like iloc where passed an integer

Second option only works where the index of the DataFrame is NOT an integer.

The syntax is

data.ix[<row selection>, <column selection>]

Explicit usage of loc and iloc is preferred



In [ ]:

    
# ix indexing works just the same as loc when passed labels
dff.ix['Butt', 'city'] == dff.loc['Butt', 'city']



In [ ]:

    
# ix indexing works the same as iloc when passed integers.
dff.ix[33, 7] == dff.iloc[33, 7]

Hierarchical indexing with `loc` and `iloc`

Multi-level indexing allows us to work with higher dimensional data while storing info in lower dimensional data structures like 2D DataFrame or 1D Series. More on this here.

Read the WEO dataset that we have already used a couple of times



In [3]:

    
url_weo = 'http://www.imf.org/external/pubs/ft/weo/2016/02/weodata/WEOOct2016all.xls'

# (1) define the column indices 
col_indices = [1, 2, 3] + list(range(9, 46))

# (2) download the dataset
weo = pd.read_csv(url_weo, 
                  sep = '\t',
                  usecols=col_indices,
                  skipfooter=1, engine='python',
                  na_values=['n/a', '--'],
                  thousands =',')

# (3) change column labels to something more intuitive 
weo = weo.rename(columns={'WEO Subject Code': 'Variable', 
                          'Subject Descriptor': 'Description'})

# (4) create debt and deficits dataframe 
variables = ['GGXWDG_NGDP', 'GGXCNL_NGDP']
data = weo[weo['Variable'].isin(variables)]

data['Variable'] = data['Variable'].replace(to_replace=['GGXWDG_NGDP', 'GGXCNL_NGDP'], value=['Debt', 'Surplus'])









    



/home/daniel/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:22: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [4]:

    
data1 = data.set_index(['ISO', 'Variable'])
data1.head()









    Out[4]:






  
    
      
      
      Country
      1980
      1981
      1982
      1983
      1984
      1985
      1986
      1987
      1988
      ...
      2007
      2008
      2009
      2010
      2011
      2012
      2013
      2014
      2015
      2016
    
    
      ISO
      Variable
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      AFG
      Surplus
      Afghanistan
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      -2.492
      -3.882
      -1.779
      0.922
      -0.672
      0.182
      -0.631
      -1.741
      -1.404
      0.142
    
    
      Debt
      Afghanistan
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      20.240
      18.370
      16.776
      7.902
      7.137
      6.625
      6.749
      6.380
      6.246
      6.751
    
    
      ALB
      Surplus
      Albania
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      -3.179
      -4.884
      -6.586
      -3.520
      -3.516
      -3.441
      -5.215
      -5.461
      -4.050
      -2.463
    
    
      Debt
      Albania
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      53.550
      55.140
      59.666
      57.720
      59.413
      62.144
      70.391
      72.038
      73.321
      72.184
    
    
      DZA
      Surplus
      Algeria
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      6.133
      9.055
      -5.466
      -0.388
      -0.436
      -4.435
      -0.400
      -7.310
      -16.154
      -12.930
    
  

5 rows × 38 columns



In [17]:

    
data1.ix[('AFG', slice(None)), '1980':'1983']



In [ ]:

		Country	1980	1981	1982	1983	1984	1985	1986	1987	1988	...	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016
ISO	Variable
AFG	Surplus	Afghanistan	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	-2.492	-3.882	-1.779	0.922	-0.672	0.182	-0.631	-1.741	-1.404	0.142
AFG	Debt	Afghanistan	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	20.240	18.370	16.776	7.902	7.137	6.625	6.749	6.380	6.246	6.751
ALB	Surplus	Albania	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	-3.179	-4.884	-6.586	-3.520	-3.516	-3.441	-5.215	-5.461	-4.050	-2.463
ALB	Debt	Albania	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	53.550	55.140	59.666	57.720	59.413	62.144	70.391	72.038	73.321	72.184
DZA	Surplus	Algeria	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	6.133	9.055	-5.466	-0.388	-0.436	-4.435	-0.400	-7.310	-16.154	-12.930

Advanced Pandas: Multi-axis indexing (loc and iloc methods)

How to download a zipfile and read the csv file inside

(1) Selecting data using loc

Boolean indexing using loc

(2) Selecting data using iloc

(3) Selecting data using ix

Hierarchical indexing with loc and iloc

Advanced Pandas: Multi-axis indexing (`loc` and `iloc` methods)

(1) Selecting data using `loc`

Boolean indexing using `loc`

(2) Selecting data using `iloc`

(3) Selecting data using `ix`

Hierarchical indexing with `loc` and `iloc`