03 - Pandas: Indexing and selecting data - part II

DS Data manipulation, analysis and visualisation in Python
December, 2019

© 2016-2019, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons



In [1]:
import pandas as pd

In [2]:
# redefining the example objects

# series
population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, 
                        'United Kingdom': 64.9, 'Netherlands': 16.9})

# dataframe
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries


Out[2]:
country population area capital
0 Belgium 11.3 30510 Brussels
1 France 64.3 671308 Paris
2 Germany 81.3 357050 Berlin
3 Netherlands 16.9 41526 Amsterdam
4 United Kingdom 64.9 244820 London
REMEMBER:

So as a summary, `[]` provides the following convenience shortcuts: * **Series**: selecting a **label**: `s[label]` * **DataFrame**: selecting a single or multiple **columns**:`df['col']` or `df[['col1', 'col2']]` * **DataFrame**: slicing or filtering the **rows**: `df['row_label1':'row_label2']` or `df[mask]`

Changing the DataFrame index

We have mostly worked with DataFrames with the default 0, 1, 2, ... N row labels (except for the time series data). But, we can also set one of the columns as the index.

Setting the index to the country names:


In [3]:
countries = countries.set_index('country')
countries


Out[3]:
population area capital
country
Belgium 11.3 30510 Brussels
France 64.3 671308 Paris
Germany 81.3 357050 Berlin
Netherlands 16.9 41526 Amsterdam
United Kingdom 64.9 244820 London

Reversing this operation, is reset_index:


In [4]:
countries.reset_index('country')


Out[4]:
country population area capital
0 Belgium 11.3 30510 Brussels
1 France 64.3 671308 Paris
2 Germany 81.3 357050 Berlin
3 Netherlands 16.9 41526 Amsterdam
4 United Kingdom 64.9 244820 London

Selecting data based on the index

ATTENTION!:

One of pandas' basic features is the labeling of rows and columns, but this makes indexing also a bit more complex compared to numpy.

We now have to distuinguish between: * selection by **label** (using the row and column names) * selection by **position** (using integers)

Systematic indexing with loc and iloc

When using [] like above, you can only select from one axis at once (rows or columns, not both). For more advanced indexing, you have some extra attributes:

  • loc: selection by label
  • iloc: selection by position

Both loc and iloc use the following pattern: df.loc[ <selection of the rows> , <selection of the columns> ].

This 'selection of the rows / columns' can be: a single label, a list of labels, a slice or a boolean mask.

Selecting a single element:


In [5]:
countries.loc['Germany', 'area']


Out[5]:
357050

But the row or column indexer can also be a list, slice, boolean array (see next section), ..


In [6]:
countries.loc['France':'Germany', ['area', 'population']]


Out[6]:
area population
country
France 671308 64.3
Germany 357050 81.3
NOTE: * Unlike slicing in numpy, the end label is **included**!

Selecting by position with iloc works similar as indexing numpy arrays:


In [7]:
countries.iloc[0:2,1:3]


Out[7]:
area capital
country
Belgium 30510 Brussels
France 671308 Paris

The different indexing methods can also be used to assign data:


In [8]:
countries2 = countries.copy()
countries2.loc['Belgium':'Germany', 'population'] = 10

In [9]:
countries2


Out[9]:
population area capital
country
Belgium 10.0 30510 Brussels
France 10.0 671308 Paris
Germany 10.0 357050 Berlin
Netherlands 16.9 41526 Amsterdam
United Kingdom 64.9 244820 London
REMEMBER:

Advanced indexing with **loc** and **iloc** * **loc**: select by label: `df.loc[row_indexer, column_indexer]` * **iloc**: select by position: `df.iloc[row_indexer, column_indexer]`
EXERCISE:

  • Add the population density as column to the DataFrame.

Note: the population column is expressed in millions.

In [10]:
countries['density'] = countries['population']*1000000 / countries['area']
EXERCISE:
  • Select the capital and the population column of those countries where the density is larger than 300

In [11]:
countries.loc[countries['density'] > 300, ['capital', 'population']]


Out[11]:
capital population
country
Belgium Brussels 11.3
Netherlands Amsterdam 16.9
EXERCISE:
  • Add a column 'density_ratio' with the ratio of the population density to the average population density for all countries.

In [12]:
countries['density_ratio'] = countries['density'] / countries['density'].mean()
countries


Out[12]:
population area capital density density_ratio
country
Belgium 11.3 30510 Brussels 370.370370 1.355755
France 64.3 671308 Paris 95.783158 0.350618
Germany 81.3 357050 Berlin 227.699202 0.833502
Netherlands 16.9 41526 Amsterdam 406.973944 1.489744
United Kingdom 64.9 244820 London 265.092721 0.970382
EXERCISE:
  • Change the capital of the UK to Cambridge

In [13]:
countries.loc['United Kingdom', 'capital'] = 'Cambridge'
countries


Out[13]:
population area capital density density_ratio
country
Belgium 11.3 30510 Brussels 370.370370 1.355755
France 64.3 671308 Paris 95.783158 0.350618
Germany 81.3 357050 Berlin 227.699202 0.833502
Netherlands 16.9 41526 Amsterdam 406.973944 1.489744
United Kingdom 64.9 244820 Cambridge 265.092721 0.970382
EXERCISE:
  • Select all countries whose population density is between 100 and 300 people/km²

In [14]:
countries[(countries['density'] > 100) & (countries['density'] < 300)]


Out[14]:
population area capital density density_ratio
country
Germany 81.3 357050 Berlin 227.699202 0.833502
United Kingdom 64.9 244820 Cambridge 265.092721 0.970382

Alignment on the index

**WARNING**: **Alignment!** (unlike numpy) * Pay attention to **alignment**: operations between series will align on the index:

In [15]:
population = countries['population']
s1 = population[['Belgium', 'France']]
s2 = population[['France', 'Germany']]

In [16]:
s1


Out[16]:
country
Belgium    11.3
France     64.3
Name: population, dtype: float64

In [17]:
s2


Out[17]:
country
France     64.3
Germany    81.3
Name: population, dtype: float64

In [18]:
s1 + s2


Out[18]:
country
Belgium      NaN
France     128.6
Germany      NaN
Name: population, dtype: float64

Pitfall: chained indexing (and the 'SettingWithCopyWarning')


In [19]:
df = countries.copy()

When updating values in a DataFrame, you can run into the infamous "SettingWithCopyWarning" and issues with chained indexing.

Assume we want to cap the population and replace all values above 50 with 50. We can do this using the basic [] indexing operation twice ("chained indexing"):


In [20]:
df[df['population'] > 50]['population'] = 50


/home/joris/miniconda3/envs/DS-python-data-analysis/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

However, we get a warning, and we can also see that the original dataframe did not change:


In [21]:
df


Out[21]:
population area capital density density_ratio
country
Belgium 11.3 30510 Brussels 370.370370 1.355755
France 64.3 671308 Paris 95.783158 0.350618
Germany 81.3 357050 Berlin 227.699202 0.833502
Netherlands 16.9 41526 Amsterdam 406.973944 1.489744
United Kingdom 64.9 244820 Cambridge 265.092721 0.970382

The warning message explains that we should use .loc[row_indexer,col_indexer] = value instead. That is what we just learned in this notebook, so we can do:


In [22]:
df.loc[df['population'] > 50, 'population'] = 50

And now the dataframe actually changed:


In [23]:
df


Out[23]:
population area capital density density_ratio
country
Belgium 11.3 30510 Brussels 370.370370 1.355755
France 50.0 671308 Paris 95.783158 0.350618
Germany 50.0 357050 Berlin 227.699202 0.833502
Netherlands 16.9 41526 Amsterdam 406.973944 1.489744
United Kingdom 50.0 244820 Cambridge 265.092721 0.970382

To explain why the original df[df['population'] > 50]['population'] = 50 didn't work, we can do the "chained indexing" in two explicit steps:


In [24]:
temp = df[df['population'] > 50]
temp['population'] = 50

For Python, there is no real difference between the one-liner or this two-liner. And when writing it as two lines, you can see we make a temporary, filtered dataframe (called temp above). So here, with temp['population'] = 50, we are actually updating temp but not the original df.

REMEMBER!

What to do when encountering the *value is trying to be set on a copy of a slice from a DataFrame* error? * Use `loc` instead of chained indexing **if possible**! * Or `copy` explicitly if you don't want to change the original data.

Exercises using the Titanic dataset


In [26]:
df = pd.read_csv("../data/titanic.csv")

In [27]:
df.head()


Out[27]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
EXERCISE: * Select all rows for male passengers and calculate the mean age of those passengers. Do the same for the female passengers. Do this now using `.loc`.

In [28]:
df.loc[df['Sex'] == 'male', 'Age'].mean()


Out[28]:
30.72664459161148

In [29]:
df.loc[df['Sex'] == 'female', 'Age'].mean()


Out[29]:
27.915708812260537

We will later see an easier way to calculate both averages at the same time with groupby.


In [ ]: