03 - Pandas: Indexing and selecting data - part II
DS Data manipulation, analysis and visualisation in Python
December, 2019© 2016-2019, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons
In [1]:
import pandas as pd
In [2]:
# redefining the example objects
# series
population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3,
'United Kingdom': 64.9, 'Netherlands': 16.9})
# dataframe
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
'population': [11.3, 64.3, 81.3, 16.9, 64.9],
'area': [30510, 671308, 357050, 41526, 244820],
'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries
Out[2]:
We have mostly worked with DataFrames with the default 0, 1, 2, ... N row labels (except for the time series data). But, we can also set one of the columns as the index.
Setting the index to the country names:
In [3]:
countries = countries.set_index('country')
countries
Out[3]:
Reversing this operation, is reset_index
:
In [4]:
countries.reset_index('country')
Out[4]:
When using []
like above, you can only select from one axis at once (rows or columns, not both). For more advanced indexing, you have some extra attributes:
loc
: selection by labeliloc
: selection by positionBoth loc
and iloc
use the following pattern: df.loc[ <selection of the rows> , <selection of the columns> ]
.
This 'selection of the rows / columns' can be: a single label, a list of labels, a slice or a boolean mask.
Selecting a single element:
In [5]:
countries.loc['Germany', 'area']
Out[5]:
But the row or column indexer can also be a list, slice, boolean array (see next section), ..
In [6]:
countries.loc['France':'Germany', ['area', 'population']]
Out[6]:
Selecting by position with iloc
works similar as indexing numpy arrays:
In [7]:
countries.iloc[0:2,1:3]
Out[7]:
The different indexing methods can also be used to assign data:
In [8]:
countries2 = countries.copy()
countries2.loc['Belgium':'Germany', 'population'] = 10
In [9]:
countries2
Out[9]:
In [10]:
countries['density'] = countries['population']*1000000 / countries['area']
In [11]:
countries.loc[countries['density'] > 300, ['capital', 'population']]
Out[11]:
In [12]:
countries['density_ratio'] = countries['density'] / countries['density'].mean()
countries
Out[12]:
In [13]:
countries.loc['United Kingdom', 'capital'] = 'Cambridge'
countries
Out[13]:
In [14]:
countries[(countries['density'] > 100) & (countries['density'] < 300)]
Out[14]:
In [15]:
population = countries['population']
s1 = population[['Belgium', 'France']]
s2 = population[['France', 'Germany']]
In [16]:
s1
Out[16]:
In [17]:
s2
Out[17]:
In [18]:
s1 + s2
Out[18]:
In [19]:
df = countries.copy()
When updating values in a DataFrame, you can run into the infamous "SettingWithCopyWarning" and issues with chained indexing.
Assume we want to cap the population and replace all values above 50 with 50. We can do this using the basic []
indexing operation twice ("chained indexing"):
In [20]:
df[df['population'] > 50]['population'] = 50
However, we get a warning, and we can also see that the original dataframe did not change:
In [21]:
df
Out[21]:
The warning message explains that we should use .loc[row_indexer,col_indexer] = value
instead. That is what we just learned in this notebook, so we can do:
In [22]:
df.loc[df['population'] > 50, 'population'] = 50
And now the dataframe actually changed:
In [23]:
df
Out[23]:
To explain why the original df[df['population'] > 50]['population'] = 50
didn't work, we can do the "chained indexing" in two explicit steps:
In [24]:
temp = df[df['population'] > 50]
temp['population'] = 50
For Python, there is no real difference between the one-liner or this two-liner. And when writing it as two lines, you can see we make a temporary, filtered dataframe (called temp
above). So here, with temp['population'] = 50
, we are actually updating temp
but not the original df
.
In [26]:
df = pd.read_csv("../data/titanic.csv")
In [27]:
df.head()
Out[27]:
In [28]:
df.loc[df['Sex'] == 'male', 'Age'].mean()
Out[28]:
In [29]:
df.loc[df['Sex'] == 'female', 'Age'].mean()
Out[29]:
We will later see an easier way to calculate both averages at the same time with groupby.
In [ ]: