Indexing and selecting data


In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

More on NumPy indexing


In [2]:
a = np.array([-2, 3, 4, -5, 5])
print(a)


[-2  3  4 -5  5]

Fancy indexing

Apart from indexing with integers and slices NumPy also supports indexing with arrays of integers (so-called fancy indexing). For example, to get the 2nd and 4th element of a:


In [3]:
a[[1, 3]]


Out[3]:
array([ 3, -5])

Boolean indexing

To select data fulfilling specific criteria, one can use the bolean indexing. This is best illustrated on 1D arrays; for example, lets select only positive elements of a:


In [4]:
a[a > 0]


Out[4]:
array([3, 4, 5])

Note that the index array has the same size as and type of boolean:


In [5]:
print(a)
print(a > 0)


[-2  3  4 -5  5]
[False  True  True False  True]

Multiple criteria can be also combine in one query:


In [6]:
a[(a > 0) & (a < 5)]


Out[6]:
array([3, 4])
EXERCISE: Select all odd numbers from the array a

In [ ]:

EXERCISE: Select negative odd numbers from the array a

In [ ]:

Indexing pandas Series

Series can be indexed similarly to 1D NumPy array.


In [9]:
pop_dict = {'Germany': 81.3, 
            'Belgium': 11.3, 
            'France': 64.3, 
            'United Kingdom': 64.9, 
            'Netherlands': 16.9}
population = pd.Series(pop_dict)
print(population)


Belgium           11.3
France            64.3
Germany           81.3
Netherlands       16.9
United Kingdom    64.9
dtype: float64

We can use fancy indexing with the rich index:


In [10]:
population[['Netherlands', 'Germany']]


Out[10]:
Netherlands    16.9
Germany        81.3
dtype: float64

Similarly, boolean indexing can be used to filter the Series. Lets select countries with population of more than 20 millions:


In [11]:
population[population > 20]


Out[11]:
France            64.3
Germany           81.3
United Kingdom    64.9
dtype: float64

You can also do position-based indexing by using integers instead of labels:


In [12]:
population[:2]


Out[12]:
Belgium    11.3
France     64.3
dtype: float64

Indexing DataFrame


In [13]:
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries


Out[13]:
area capital country population
0 30510 Brussels Belgium 11.3
1 671308 Paris France 64.3
2 357050 Berlin Germany 81.3
3 41526 Amsterdam Netherlands 16.9
4 244820 London United Kingdom 64.9

In [14]:
countries = countries.set_index('country')
countries


Out[14]:
area capital population
country
Belgium 30510 Brussels 11.3
France 671308 Paris 64.3
Germany 357050 Berlin 81.3
Netherlands 41526 Amsterdam 16.9
United Kingdom 244820 London 64.9

Some notes on selecting data

Data frames allow for labeling rows and columns, but this makes indexing also a bit more complex compared to 1D NumPy's array and pandas Series. We now have to distuinguish between:

  • selection of rows or columns,
  • selection by label or position.

[] provides some convenience shortcuts

For a DataFrame, basic indexing selects the columns.

Selecting a single column:


In [15]:
countries['area']


Out[15]:
country
Belgium            30510
France            671308
Germany           357050
Netherlands        41526
United Kingdom    244820
Name: area, dtype: int64

or multiple columns using fancy indexing:


In [16]:
countries[['area', 'population']]


Out[16]:
area population
country
Belgium 30510 11.3
France 671308 64.3
Germany 357050 81.3
Netherlands 41526 16.9
United Kingdom 244820 64.9

But, slicing accesses the rows:


In [17]:
countries['France':'Netherlands']


Out[17]:
area capital population
country
France 671308 Paris 64.3
Germany 357050 Berlin 81.3
Netherlands 41526 Amsterdam 16.9

We can also select rows similarly to the boolean indexing in numpy. The boolean mask should be 1-dimensional and the same length as the thing being indexed. Boolean indexing of DataFrame can be used like the WHERE clause of SQL to select rows matching some criteria:


In [18]:
countries[countries['area'] > 100000]


Out[18]:
area capital population
country
France 671308 Paris 64.3
Germany 357050 Berlin 81.3
United Kingdom 244820 London 64.9

So as a summary, [] provides the following convenience shortcuts:

NumPy/`Series` `DataFrame`
Integer index
`data[label]`
single element single **column**
Slice
`data[label1:label2]`
sequence one or more **rows**
Fancy indexing
`data[[label1,label2]]`
sequence one or more **columns**
Boolean indexing
`data[mask]`
sequence one or more **rows**
EXERCISE: Calculate the area of Germany relative to the total area of all other countries in the data frame. *Hint*: you can compare the index of the data frame to any string

In [ ]:

Systematic indexing with loc and iloc

When using [] like above, you can only select from one axis at once (rows or columns, not both). For more advanced indexing, you have some extra attributes:

  • loc: selection by label
  • iloc: selection by position

These methods index the different dimensions of the frame:

  • df.loc[row_indexer, column_indexer]
  • df.iloc[row_indexer, column_indexer]

Selecting a single element:


In [20]:
countries.loc['Germany', 'area']


Out[20]:
357050

But the row or column indexer can also be a list, slice, boolean array, ..


In [21]:
countries.loc['France':'Germany', ['area', 'population']]


Out[21]:
area population
country
France 671308 64.3
Germany 357050 81.3

Selecting by position with iloc works similar as indexing numpy arrays:


In [22]:
countries.iloc[:2,1:3]


Out[22]:
capital population
country
Belgium Brussels 11.3
France Paris 64.3

The different indexing methods can also be used to assign data:


In [23]:
countries2 = countries.copy()
countries2.loc['Belgium':'Germany', 'population'] = 10

In [24]:
countries2


Out[24]:
area capital population
country
Belgium 30510 Brussels 10.0
France 671308 Paris 10.0
Germany 357050 Berlin 10.0
Netherlands 41526 Amsterdam 16.9
United Kingdom 244820 London 64.9

EXERCISE: Add a column `density` with the population density (note: population column is expressed in millions)

In [ ]:

EXERCISE: Select the capital and the population column of those countries where the density is larger than 300

In [ ]:

EXERCISE: List names, capitals and population densities of two countries with highest population density.

In [ ]:

EXERCISE: Change the capital of the UK to Cambridge

In [ ]:

EXERCISE: Select all countries whose population density is between 100 and 300 people/km²

In [ ]:

More exercises!

For the quick ones among you, here are some more exercises with some larger dataframe with film data. These exercises are based on the PyCon tutorial of Brandon Rhodes (so all credit to him!) and the datasets he prepared for that. You can download these data from here: titles.csv and cast.csv and put them in the /data folder.


In [30]:
cast = pd.read_csv('data/cast.csv')
cast.head()


Out[30]:
title year name type character n
0 Suuri illusioni 1985 Homo $ actor Guests 22
1 Gangsta Rap: The Glockumentary 2007 Too $hort actor Himself NaN
2 Menace II Society 1993 Too $hort actor Lew-Loc 27
3 Porndogs: The Adventures of Sadie 2009 Too $hort actor Bosco 3
4 Stop Pepper Palmer 2014 Too $hort actor Himself NaN

In [31]:
titles = pd.read_csv('data/titles.csv')
titles.head()


Out[31]:
title year
0 The Rising Son 1990
1 Ashes of Kukulcan 2016
2 The Thousand Plane Raid 1969
3 Crucea de piatra 1993
4 The 86 2015
EXERCISE: How many movies are listed in the titles dataframe?

In [ ]:

EXERCISE: What are the earliest two films listed in the titles dataframe?

In [ ]:

EXERCISE: How many movies have the title "Hamlet"?

In [ ]:

EXERCISE: List all of the "Treasure Island" movies from earliest to most recent.

In [ ]:

EXERCISE: How many movies were made from 1950 through 1959?

In [ ]:


In [ ]:

EXERCISE: How many roles in the movie "Inception" are NOT ranked by an "n" value?

In [ ]:

EXERCISE: But how many roles in the movie "Inception" did receive an "n" value?

In [ ]:

EXERCISE: Display the cast of "North by Northwest" in their correct "n"-value order, ignoring roles that did not earn a numeric "n" value.

In [ ]:

EXERCISE: How many roles were credited in the silent 1921 version of Hamlet?

In [ ]:

EXERCISE: List the supporting roles (having n=2) played by Cary Grant in the 1940s, in order by year.

In [ ]: