03 - Pandas: Indexing and selecting data - part II

DS Data manipulation, analysis and visualisation in Python
December, 2019

© 2016-2019, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons



In [1]:

    
import pandas as pd



In [2]:

    
# redefining the example objects

# series
population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, 
                        'United Kingdom': 64.9, 'Netherlands': 16.9})

# dataframe
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries









    Out[2]:







  
    
      
      country
      population
      area
      capital
    
  
  
    
      0
      Belgium
      11.3
      30510
      Brussels
    
    
      1
      France
      64.3
      671308
      Paris
    
    
      2
      Germany
      81.3
      357050
      Berlin
    
    
      3
      Netherlands
      16.9
      41526
      Amsterdam
    
    
      4
      United Kingdom
      64.9
      244820
      London

REMEMBER:

So as a summary, `[]` provides the following convenience shortcuts: * **Series**: selecting a **label**: `s[label]` * **DataFrame**: selecting a single or multiple **columns**:`df['col']` or `df[['col1', 'col2']]` * **DataFrame**: slicing or filtering the **rows**: `df['row_label1':'row_label2']` or `df[mask]`

Changing the DataFrame index

We have mostly worked with DataFrames with the default 0, 1, 2, ... N row labels (except for the time series data). But, we can also set one of the columns as the index.

Setting the index to the country names:



In [3]:

    
countries = countries.set_index('country')
countries









    Out[3]:







  
    
      
      population
      area
      capital
    
    
      country
      
      
      
    
  
  
    
      Belgium
      11.3
      30510
      Brussels
    
    
      France
      64.3
      671308
      Paris
    
    
      Germany
      81.3
      357050
      Berlin
    
    
      Netherlands
      16.9
      41526
      Amsterdam
    
    
      United Kingdom
      64.9
      244820
      London

Reversing this operation, is reset_index:



In [4]:

    
countries.reset_index('country')









    Out[4]:







  
    
      
      country
      population
      area
      capital
    
  
  
    
      0
      Belgium
      11.3
      30510
      Brussels
    
    
      1
      France
      64.3
      671308
      Paris
    
    
      2
      Germany
      81.3
      357050
      Berlin
    
    
      3
      Netherlands
      16.9
      41526
      Amsterdam
    
    
      4
      United Kingdom
      64.9
      244820
      London

Selecting data based on the index

ATTENTION!:

One of pandas' basic features is the labeling of rows and columns, but this makes indexing also a bit more complex compared to numpy.

We now have to distuinguish between: * selection by **label** (using the row and column names) * selection by **position** (using integers)

Systematic indexing with `loc` and `iloc`

When using [] like above, you can only select from one axis at once (rows or columns, not both). For more advanced indexing, you have some extra attributes:

loc: selection by label
iloc: selection by position

Both loc and iloc use the following pattern: df.loc[ <selection of the rows> , <selection of the columns> ].

This 'selection of the rows / columns' can be: a single label, a list of labels, a slice or a boolean mask.

Selecting a single element:



In [5]:

    
countries.loc['Germany', 'area']









    Out[5]:





357050

But the row or column indexer can also be a list, slice, boolean array (see next section), ..



In [6]:

    
countries.loc['France':'Germany', ['area', 'population']]









    Out[6]:







  
    
      
      area
      population
    
    
      country
      
      
    
  
  
    
      France
      671308
      64.3
    
    
      Germany
      357050
      81.3

NOTE: * Unlike slicing in numpy, the end label is **included**!

Selecting by position with iloc works similar as indexing numpy arrays:



In [7]:

    
countries.iloc[0:2,1:3]









    Out[7]:







  
    
      
      area
      capital
    
    
      country
      
      
    
  
  
    
      Belgium
      30510
      Brussels
    
    
      France
      671308
      Paris

The different indexing methods can also be used to assign data:



In [8]:

    
countries2 = countries.copy()
countries2.loc['Belgium':'Germany', 'population'] = 10



In [9]:

    
countries2









    Out[9]:







  
    
      
      population
      area
      capital
    
    
      country
      
      
      
    
  
  
    
      Belgium
      10.0
      30510
      Brussels
    
    
      France
      10.0
      671308
      Paris
    
    
      Germany
      10.0
      357050
      Berlin
    
    
      Netherlands
      16.9
      41526
      Amsterdam
    
    
      United Kingdom
      64.9
      244820
      London

REMEMBER:

Advanced indexing with **loc** and **iloc** * **loc**: select by label: `df.loc[row_indexer, column_indexer]` * **iloc**: select by position: `df.iloc[row_indexer, column_indexer]`

EXERCISE:

Add the population density as column to the DataFrame.

Note: the population column is expressed in millions.



In [10]:

    
countries['density'] = countries['population']*1000000 / countries['area']

EXERCISE:

Select the capital and the population column of those countries where the density is larger than 300



In [11]:

    
countries.loc[countries['density'] > 300, ['capital', 'population']]









    Out[11]:







  
    
      
      capital
      population
    
    
      country
      
      
    
  
  
    
      Belgium
      Brussels
      11.3
    
    
      Netherlands
      Amsterdam
      16.9

EXERCISE:

Add a column 'density_ratio' with the ratio of the population density to the average population density for all countries.



In [12]:

    
countries['density_ratio'] = countries['density'] / countries['density'].mean()
countries









    Out[12]:







  
    
      
      population
      area
      capital
      density
      density_ratio
    
    
      country
      
      
      
      
      
    
  
  
    
      Belgium
      11.3
      30510
      Brussels
      370.370370
      1.355755
    
    
      France
      64.3
      671308
      Paris
      95.783158
      0.350618
    
    
      Germany
      81.3
      357050
      Berlin
      227.699202
      0.833502
    
    
      Netherlands
      16.9
      41526
      Amsterdam
      406.973944
      1.489744
    
    
      United Kingdom
      64.9
      244820
      London
      265.092721
      0.970382

EXERCISE:

Change the capital of the UK to Cambridge



In [13]:

    
countries.loc['United Kingdom', 'capital'] = 'Cambridge'
countries









    Out[13]:







  
    
      
      population
      area
      capital
      density
      density_ratio
    
    
      country
      
      
      
      
      
    
  
  
    
      Belgium
      11.3
      30510
      Brussels
      370.370370
      1.355755
    
    
      France
      64.3
      671308
      Paris
      95.783158
      0.350618
    
    
      Germany
      81.3
      357050
      Berlin
      227.699202
      0.833502
    
    
      Netherlands
      16.9
      41526
      Amsterdam
      406.973944
      1.489744
    
    
      United Kingdom
      64.9
      244820
      Cambridge
      265.092721
      0.970382

EXERCISE:

Select all countries whose population density is between 100 and 300 people/km²



In [14]:

    
countries[(countries['density'] > 100) & (countries['density'] < 300)]









    Out[14]:







  
    
      
      population
      area
      capital
      density
      density_ratio
    
    
      country
      
      
      
      
      
    
  
  
    
      Germany
      81.3
      357050
      Berlin
      227.699202
      0.833502
    
    
      United Kingdom
      64.9
      244820
      Cambridge
      265.092721
      0.970382

Alignment on the index

**WARNING**: **Alignment!** (unlike numpy) * Pay attention to **alignment**: operations between series will align on the index:



In [15]:

    
population = countries['population']
s1 = population[['Belgium', 'France']]
s2 = population[['France', 'Germany']]



In [16]:

    
s1









    Out[16]:





country
Belgium    11.3
France     64.3
Name: population, dtype: float64



In [17]:

    
s2









    Out[17]:





country
France     64.3
Germany    81.3
Name: population, dtype: float64



In [18]:

    
s1 + s2









    Out[18]:





country
Belgium      NaN
France     128.6
Germany      NaN
Name: population, dtype: float64

Pitfall: chained indexing (and the 'SettingWithCopyWarning')



In [19]:

    
df = countries.copy()

When updating values in a DataFrame, you can run into the infamous "SettingWithCopyWarning" and issues with chained indexing.

Assume we want to cap the population and replace all values above 50 with 50. We can do this using the basic [] indexing operation twice ("chained indexing"):



In [20]:

    
df[df['population'] > 50]['population'] = 50









    



/home/joris/miniconda3/envs/DS-python-data-analysis/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

However, we get a warning, and we can also see that the original dataframe did not change:



In [21]:

    
df









    Out[21]:







  
    
      
      population
      area
      capital
      density
      density_ratio
    
    
      country
      
      
      
      
      
    
  
  
    
      Belgium
      11.3
      30510
      Brussels
      370.370370
      1.355755
    
    
      France
      64.3
      671308
      Paris
      95.783158
      0.350618
    
    
      Germany
      81.3
      357050
      Berlin
      227.699202
      0.833502
    
    
      Netherlands
      16.9
      41526
      Amsterdam
      406.973944
      1.489744
    
    
      United Kingdom
      64.9
      244820
      Cambridge
      265.092721
      0.970382

The warning message explains that we should use .loc[row_indexer,col_indexer] = value instead. That is what we just learned in this notebook, so we can do:



In [22]:

    
df.loc[df['population'] > 50, 'population'] = 50

And now the dataframe actually changed:



In [23]:

    
df









    Out[23]:







  
    
      
      population
      area
      capital
      density
      density_ratio
    
    
      country
      
      
      
      
      
    
  
  
    
      Belgium
      11.3
      30510
      Brussels
      370.370370
      1.355755
    
    
      France
      50.0
      671308
      Paris
      95.783158
      0.350618
    
    
      Germany
      50.0
      357050
      Berlin
      227.699202
      0.833502
    
    
      Netherlands
      16.9
      41526
      Amsterdam
      406.973944
      1.489744
    
    
      United Kingdom
      50.0
      244820
      Cambridge
      265.092721
      0.970382

To explain why the original df[df['population'] > 50]['population'] = 50 didn't work, we can do the "chained indexing" in two explicit steps:



In [24]:

    
temp = df[df['population'] > 50]
temp['population'] = 50

For Python, there is no real difference between the one-liner or this two-liner. And when writing it as two lines, you can see we make a temporary, filtered dataframe (called temp above). So here, with temp['population'] = 50, we are actually updating temp but not the original df.

REMEMBER!

What to do when encountering the *value is trying to be set on a copy of a slice from a DataFrame* error? * Use `loc` instead of chained indexing **if possible**! * Or `copy` explicitly if you don't want to change the original data.

Exercises using the Titanic dataset



In [26]:

    
df = pd.read_csv("../data/titanic.csv")



In [27]:

    
df.head()









    Out[27]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S

EXERCISE: * Select all rows for male passengers and calculate the mean age of those passengers. Do the same for the female passengers. Do this now using `.loc`.



In [28]:

    
df.loc[df['Sex'] == 'male', 'Age'].mean()









    Out[28]:





30.72664459161148



In [29]:

    
df.loc[df['Sex'] == 'female', 'Age'].mean()









    Out[29]:





27.915708812260537

We will later see an easier way to calculate both averages at the same time with groupby.



In [ ]:

	country	population	area	capital
0	Belgium	11.3	30510	Brussels
1	France	64.3	671308	Paris
2	Germany	81.3	357050	Berlin
3	Netherlands	16.9	41526	Amsterdam
4	United Kingdom	64.9	244820	London

	population	area	capital
country
Belgium	11.3	30510	Brussels
France	64.3	671308	Paris
Germany	81.3	357050	Berlin
Netherlands	16.9	41526	Amsterdam
United Kingdom	64.9	244820	London

	population	area	capital
country
Belgium	10.0	30510	Brussels
France	10.0	671308	Paris
Germany	10.0	357050	Berlin
Netherlands	16.9	41526	Amsterdam
United Kingdom	64.9	244820	London

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Changing the DataFrame index

Selecting data based on the index

Systematic indexing with loc and iloc

Alignment on the index

Pitfall: chained indexing (and the 'SettingWithCopyWarning')

Exercises using the Titanic dataset

Systematic indexing with `loc` and `iloc`