Advanced Pandas: Multi-axis indexing (loc and iloc methods)

In this notebook we are discussing the loc, iloc and ix methods, which are relatively newly implemented in pandas but have become the primary means of multi-axis indexing. The official pandas docs is here.

Imagine the following standard situation: we have a dataset where

  • Each row represents an observation
  • Each column is a named variable.
  • and we want: an easy, quick way to select relevant rows/columns (i.e. subsamples from dataset)

This notebook borrows examples from this blog post.

Note: requires internet access to run.

This Jupyter notebook was created by Balint Szoke for the NYU Stern course Data Bootcamp.


In [1]:
%matplotlib inline 

import numpy as np                     # foundation for Pandas   
import pandas as pd                    # data package
import matplotlib.pyplot as plt        # graphics module

To illustrate the main ideas, we are going to use a fake data set from this website. This file contains artificial names, addresses, companies, phone numbers etc. for fictitious US characters. Here is the complete list of variables

The main purpose of this dataset is testing. Straight from the website

"Always test your software with a "worst-case scenario" amount of sample data, to get an accurate sense of its performance in the real world."

The only issue is that the samples are stored in zipped csv files, i.e. we do not have url for the csv file directly.

How to download a zipfile and read the csv file inside


In [ ]:
# we need some extra tools to download and handle zip files 
import zipfile as zf    
import requests, io

url = "https://www.briandunning.com/sample-data/us-500.zip"
r = requests.get(url) 
file = zf.ZipFile(io.BytesIO(r.content)) 
file

In [ ]:
file.namelist()    # there is one csv file inside

In [ ]:
file_csv = file.open(file.namelist()[0])          # compare the type with file above (this is readable by pandas)
file_csv

In [ ]:
df = pd.read_csv(file_csv)
# What do we have?
print("Variables and types:\n\n", df.dtypes, sep='')

In [ ]:
df.head()

In [ ]:
# Sometimes this one-liner works too
#pd.read_csv(url, compression='zip')    # not in this case

(1) Selecting data using loc

loc is primarily label-location based indexer. That is,

  • it selects rows and columns by their labels (variable names for columns, index values for rows).
  • also works with a boolean array.

The syntax is

data.loc[<row selection>, <column selection>]

First, set an arbitrary index variable


In [ ]:
dff = df.set_index(['last_name'])
dff.head()

Now we can directly select rows by their index (last_name) values (just like we do with columns)


In [ ]:
dff.loc['Butt']

In [ ]:
# multiple rows
dff.loc[['Butt', 'Venere']]

In [ ]:
# select a subset of the data (subDataFrame)
dff.loc[['Butt', 'Foller'], ['city', 'email']]

In [ ]:
# ranges of index labels
dff.loc[['Butt', 'Foller'], 'address':'phone2']

Boolean indexing using loc

The most common method to work with data

Pass an array of True/False values to the .loc to select the rows/columns with True values.


In [ ]:
dff.loc[dff['city'] == 'New Orleans']

In fact, we don't need the loc indexer for this kind of task


In [ ]:
dff[dff['city'] == 'New Orleans']

But what if we don't want all variables?


In [ ]:
dff.loc[dff['city'] == 'New Orleans', ['company_name', 'zip']]

How would you get the same dataframe without loc?


In [ ]:
dff[dff['city'] == 'New Orleans'][['company_name', 'zip']]         # matter of taste

Recall the string methods applicable to DataFrames


In [ ]:
dff[dff['email'].str.endswith("gmail.com")].head()

and the isin method?


In [ ]:
dff.loc[dff['city'].isin(['New Orleans', 'New York'])]

In [ ]:
# intersection fo the two?

gmails = dff['email'].str.endswith("gmail.com")
NYNO = (dff['city'] == 'New Orleans') | (dff['city'] =='New York')
dff[gmails & NYNO]

A tricky one: we can pass a function that returns True/False values to .apply() and evaluate it at each row


In [ ]:
def short_company_name(x):
    """
    returns True if x contains less than 2 words
    """
    return len(x.split(' ')) < 2

dff.loc[dff['company_name'].apply(short_company_name)]

(2) Selecting data using iloc

iloc is primarily used for integer position based indexing. That is, it selects rows and columns by number, in the order that they appear in the data frame. Numbers are from $0$ to df.shape-1 of both axes.

The syntax is

data.iloc[<row selection>, <column selection>]

In [ ]:
# Rows:
df.iloc[0]             # first row

In [ ]:
df.iloc[-1]            # last row

In [ ]:
# Columns:
df.iloc[:, 0]          # first column = first variable (first_name)

In [ ]:
df.iloc[:, -1]         # last column (web)

For multiple columns and rows, use slicer


In [ ]:
df.iloc[:5]                        # first five rows

In [ ]:
df.iloc[:, :2]                      # first two columns

In [ ]:
df.iloc[[0, 4, 7, 25],                  # 1st, 5th, 6th, 26th row
        [0, 5, 6]]                      # 1st 6th 7th columns.

In [ ]:
df.iloc[:5, 5:8]                          # first 5 rows and 5th, 6th, 7th colum

(3) Selecting data using ix

ix is hybrid of loc and iloc. In general,

  1. it is label-location based and acts just like loc
  2. However, it also supports integer-location based selection just like iloc where passed an integer

Second option only works where the index of the DataFrame is NOT an integer.

The syntax is

data.ix[<row selection>, <column selection>]

Explicit usage of loc and iloc is preferred


In [ ]:
# ix indexing works just the same as loc when passed labels
dff.ix['Butt', 'city'] == dff.loc['Butt', 'city']

In [ ]:
# ix indexing works the same as iloc when passed integers.
dff.ix[33, 7] == dff.iloc[33, 7]

Hierarchical indexing with loc and iloc

Multi-level indexing allows us to work with higher dimensional data while storing info in lower dimensional data structures like 2D DataFrame or 1D Series. More on this here.

Read the WEO dataset that we have already used a couple of times


In [3]:
url_weo = 'http://www.imf.org/external/pubs/ft/weo/2016/02/weodata/WEOOct2016all.xls'

# (1) define the column indices 
col_indices = [1, 2, 3] + list(range(9, 46))

# (2) download the dataset
weo = pd.read_csv(url_weo, 
                  sep = '\t',
                  usecols=col_indices,
                  skipfooter=1, engine='python',
                  na_values=['n/a', '--'],
                  thousands =',')

# (3) change column labels to something more intuitive 
weo = weo.rename(columns={'WEO Subject Code': 'Variable', 
                          'Subject Descriptor': 'Description'})

# (4) create debt and deficits dataframe 
variables = ['GGXWDG_NGDP', 'GGXCNL_NGDP']
data = weo[weo['Variable'].isin(variables)]

data['Variable'] = data['Variable'].replace(to_replace=['GGXWDG_NGDP', 'GGXCNL_NGDP'], value=['Debt', 'Surplus'])


/home/daniel/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:22: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [4]:
data1 = data.set_index(['ISO', 'Variable'])
data1.head()


Out[4]:
Country 1980 1981 1982 1983 1984 1985 1986 1987 1988 ... 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
ISO Variable
AFG Surplus Afghanistan NaN NaN NaN NaN NaN NaN NaN NaN NaN ... -2.492 -3.882 -1.779 0.922 -0.672 0.182 -0.631 -1.741 -1.404 0.142
Debt Afghanistan NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 20.240 18.370 16.776 7.902 7.137 6.625 6.749 6.380 6.246 6.751
ALB Surplus Albania NaN NaN NaN NaN NaN NaN NaN NaN NaN ... -3.179 -4.884 -6.586 -3.520 -3.516 -3.441 -5.215 -5.461 -4.050 -2.463
Debt Albania NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 53.550 55.140 59.666 57.720 59.413 62.144 70.391 72.038 73.321 72.184
DZA Surplus Algeria NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 6.133 9.055 -5.466 -0.388 -0.436 -4.435 -0.400 -7.310 -16.154 -12.930

5 rows × 38 columns


In [17]:
data1.ix[('AFG', slice(None)), '1980':'1983']


Out[17]:
1980 1981 1982 1983
Variable
Surplus NaN NaN NaN NaN
Debt NaN NaN NaN NaN

In [ ]: