loc
and iloc
methods)In this notebook we are discussing the loc
, iloc
and ix
methods, which are relatively newly implemented in pandas
but have become the primary means of multi-axis indexing. The official pandas
docs is here.
Imagine the following standard situation: we have a dataset where
This notebook borrows examples from this blog post.
Note: requires internet access to run.
This Jupyter notebook was created by Balint Szoke for the NYU Stern course Data Bootcamp.
In [1]:
%matplotlib inline
import numpy as np # foundation for Pandas
import pandas as pd # data package
import matplotlib.pyplot as plt # graphics module
To illustrate the main ideas, we are going to use a fake data set from this website. This file contains artificial names, addresses, companies, phone numbers etc. for fictitious US characters. Here is the complete list of variables
The main purpose of this dataset is testing. Straight from the website
"Always test your software with a "worst-case scenario" amount of sample data, to get an accurate sense of its performance in the real world."
The only issue is that the samples are stored in zipped csv files, i.e. we do not have url for the csv file directly.
In [ ]:
# we need some extra tools to download and handle zip files
import zipfile as zf
import requests, io
url = "https://www.briandunning.com/sample-data/us-500.zip"
r = requests.get(url)
file = zf.ZipFile(io.BytesIO(r.content))
file
In [ ]:
file.namelist() # there is one csv file inside
In [ ]:
file_csv = file.open(file.namelist()[0]) # compare the type with file above (this is readable by pandas)
file_csv
In [ ]:
df = pd.read_csv(file_csv)
# What do we have?
print("Variables and types:\n\n", df.dtypes, sep='')
In [ ]:
df.head()
In [ ]:
# Sometimes this one-liner works too
#pd.read_csv(url, compression='zip') # not in this case
loc
loc
is primarily label-location based indexer. That is,
The syntax is
data.loc[<row selection>, <column selection>]
First, set an arbitrary index variable
In [ ]:
dff = df.set_index(['last_name'])
dff.head()
Now we can directly select rows by their index (last_name
) values (just like we do with columns)
In [ ]:
dff.loc['Butt']
In [ ]:
# multiple rows
dff.loc[['Butt', 'Venere']]
In [ ]:
# select a subset of the data (subDataFrame)
dff.loc[['Butt', 'Foller'], ['city', 'email']]
In [ ]:
# ranges of index labels
dff.loc[['Butt', 'Foller'], 'address':'phone2']
In [ ]:
dff.loc[dff['city'] == 'New Orleans']
In fact, we don't need the loc
indexer for this kind of task
In [ ]:
dff[dff['city'] == 'New Orleans']
But what if we don't want all variables?
In [ ]:
dff.loc[dff['city'] == 'New Orleans', ['company_name', 'zip']]
How would you get the same dataframe without loc
?
In [ ]:
dff[dff['city'] == 'New Orleans'][['company_name', 'zip']] # matter of taste
Recall the string methods applicable to DataFrame
s
In [ ]:
dff[dff['email'].str.endswith("gmail.com")].head()
and the isin
method?
In [ ]:
dff.loc[dff['city'].isin(['New Orleans', 'New York'])]
In [ ]:
# intersection fo the two?
gmails = dff['email'].str.endswith("gmail.com")
NYNO = (dff['city'] == 'New Orleans') | (dff['city'] =='New York')
dff[gmails & NYNO]
A tricky one: we can pass a function that returns True/False values to .apply()
and evaluate it at each row
In [ ]:
def short_company_name(x):
"""
returns True if x contains less than 2 words
"""
return len(x.split(' ')) < 2
dff.loc[dff['company_name'].apply(short_company_name)]
iloc
iloc
is primarily used for integer position based indexing. That is, it selects rows and columns by number, in the order that they appear in the data frame. Numbers are from $0$ to df.shape-1
of both axes.
The syntax is
data.iloc[<row selection>, <column selection>]
In [ ]:
# Rows:
df.iloc[0] # first row
In [ ]:
df.iloc[-1] # last row
In [ ]:
# Columns:
df.iloc[:, 0] # first column = first variable (first_name)
In [ ]:
df.iloc[:, -1] # last column (web)
For multiple columns and rows, use slicer
In [ ]:
df.iloc[:5] # first five rows
In [ ]:
df.iloc[:, :2] # first two columns
In [ ]:
df.iloc[[0, 4, 7, 25], # 1st, 5th, 6th, 26th row
[0, 5, 6]] # 1st 6th 7th columns.
In [ ]:
df.iloc[:5, 5:8] # first 5 rows and 5th, 6th, 7th colum
ix
ix
is hybrid of loc
and iloc
. In general,
loc
iloc
where passed an integerSecond option only works where the index of the DataFrame is NOT an integer.
The syntax is
data.ix[<row selection>, <column selection>]
Explicit usage of loc
and iloc
is preferred
In [ ]:
# ix indexing works just the same as loc when passed labels
dff.ix['Butt', 'city'] == dff.loc['Butt', 'city']
In [ ]:
# ix indexing works the same as iloc when passed integers.
dff.ix[33, 7] == dff.iloc[33, 7]
Multi-level indexing allows us to work with higher dimensional data while storing info in lower dimensional data structures like 2D DataFrame
or 1D Series
. More on this here.
Read the WEO dataset that we have already used a couple of times
In [3]:
url_weo = 'http://www.imf.org/external/pubs/ft/weo/2016/02/weodata/WEOOct2016all.xls'
# (1) define the column indices
col_indices = [1, 2, 3] + list(range(9, 46))
# (2) download the dataset
weo = pd.read_csv(url_weo,
sep = '\t',
usecols=col_indices,
skipfooter=1, engine='python',
na_values=['n/a', '--'],
thousands =',')
# (3) change column labels to something more intuitive
weo = weo.rename(columns={'WEO Subject Code': 'Variable',
'Subject Descriptor': 'Description'})
# (4) create debt and deficits dataframe
variables = ['GGXWDG_NGDP', 'GGXCNL_NGDP']
data = weo[weo['Variable'].isin(variables)]
data['Variable'] = data['Variable'].replace(to_replace=['GGXWDG_NGDP', 'GGXCNL_NGDP'], value=['Debt', 'Surplus'])
In [4]:
data1 = data.set_index(['ISO', 'Variable'])
data1.head()
Out[4]:
In [17]:
data1.ix[('AFG', slice(None)), '1980':'1983']
Out[17]:
In [ ]: