Indexing and selecting data



In [1]:

    
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

More on NumPy indexing



In [2]:

    
a = np.array([-2, 3, 4, -5, 5])
print(a)









    



[-2  3  4 -5  5]

Fancy indexing

Apart from indexing with integers and slices NumPy also supports indexing with arrays of integers (so-called fancy indexing). For example, to get the 2nd and 4th element of a:



In [3]:

    
a[[1, 3]]









    Out[3]:





array([ 3, -5])

Boolean indexing

To select data fulfilling specific criteria, one can use the bolean indexing. This is best illustrated on 1D arrays; for example, lets select only positive elements of a:



In [4]:

    
a[a > 0]









    Out[4]:





array([3, 4, 5])

Note that the index array has the same size as and type of boolean:



In [5]:

    
print(a)
print(a > 0)









    



[-2  3  4 -5  5]
[False  True  True False  True]

Multiple criteria can be also combine in one query:



In [6]:

    
a[(a > 0) & (a < 5)]









    Out[6]:





array([3, 4])

EXERCISE: Select all odd numbers from the array a



In [ ]:

EXERCISE: Select negative odd numbers from the array a



In [ ]:

Indexing pandas `Series`

Series can be indexed similarly to 1D NumPy array.



In [9]:

    
pop_dict = {'Germany': 81.3, 
            'Belgium': 11.3, 
            'France': 64.3, 
            'United Kingdom': 64.9, 
            'Netherlands': 16.9}
population = pd.Series(pop_dict)
print(population)









    



Belgium           11.3
France            64.3
Germany           81.3
Netherlands       16.9
United Kingdom    64.9
dtype: float64

We can use fancy indexing with the rich index:



In [10]:

    
population[['Netherlands', 'Germany']]









    Out[10]:





Netherlands    16.9
Germany        81.3
dtype: float64

Similarly, boolean indexing can be used to filter the Series. Lets select countries with population of more than 20 millions:



In [11]:

    
population[population > 20]









    Out[11]:





France            64.3
Germany           81.3
United Kingdom    64.9
dtype: float64

You can also do position-based indexing by using integers instead of labels:



In [12]:

    
population[:2]









    Out[12]:





Belgium    11.3
France     64.3
dtype: float64

Indexing `DataFrame`



In [13]:

    
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries









    Out[13]:






  
    
      
      area
      capital
      country
      population
    
  
  
    
      0
      30510
      Brussels
      Belgium
      11.3
    
    
      1
      671308
      Paris
      France
      64.3
    
    
      2
      357050
      Berlin
      Germany
      81.3
    
    
      3
      41526
      Amsterdam
      Netherlands
      16.9
    
    
      4
      244820
      London
      United Kingdom
      64.9



In [14]:

    
countries = countries.set_index('country')
countries









    Out[14]:






  
    
      
      area
      capital
      population
    
    
      country
      
      
      
    
  
  
    
      Belgium
      30510
      Brussels
      11.3
    
    
      France
      671308
      Paris
      64.3
    
    
      Germany
      357050
      Berlin
      81.3
    
    
      Netherlands
      41526
      Amsterdam
      16.9
    
    
      United Kingdom
      244820
      London
      64.9

Some notes on selecting data

Data frames allow for labeling rows and columns, but this makes indexing also a bit more complex compared to 1D NumPy's array and pandas Series. We now have to distuinguish between:

selection of rows or columns,
selection by label or position.

`[]` provides some convenience shortcuts

For a DataFrame, basic indexing selects the columns.

Selecting a single column:



In [15]:

    
countries['area']









    Out[15]:





country
Belgium            30510
France            671308
Germany           357050
Netherlands        41526
United Kingdom    244820
Name: area, dtype: int64

or multiple columns using fancy indexing:



In [16]:

    
countries[['area', 'population']]









    Out[16]:






  
    
      
      area
      population
    
    
      country
      
      
    
  
  
    
      Belgium
      30510
      11.3
    
    
      France
      671308
      64.3
    
    
      Germany
      357050
      81.3
    
    
      Netherlands
      41526
      16.9
    
    
      United Kingdom
      244820
      64.9

But, slicing accesses the rows:



In [17]:

    
countries['France':'Netherlands']









    Out[17]:






  
    
      
      area
      capital
      population
    
    
      country
      
      
      
    
  
  
    
      France
      671308
      Paris
      64.3
    
    
      Germany
      357050
      Berlin
      81.3
    
    
      Netherlands
      41526
      Amsterdam
      16.9

We can also select rows similarly to the boolean indexing in numpy. The boolean mask should be 1-dimensional and the same length as the thing being indexed. Boolean indexing of DataFrame can be used like the WHERE clause of SQL to select rows matching some criteria:



In [18]:

    
countries[countries['area'] > 100000]









    Out[18]:






  
    
      
      area
      capital
      population
    
    
      country
      
      
      
    
  
  
    
      France
      671308
      Paris
      64.3
    
    
      Germany
      357050
      Berlin
      81.3
    
    
      United Kingdom
      244820
      London
      64.9

So as a summary, [] provides the following convenience shortcuts:

	NumPy/`Series`	`DataFrame`
Integer index `data[label]`	single element	single column
Slice `data[label1:label2]`	sequence	one or more rows
Fancy indexing `data[[label1,label2]]`	sequence	one or more columns
Boolean indexing `data[mask]`	sequence	one or more rows

EXERCISE: Calculate the area of Germany relative to the total area of all other countries in the data frame. *Hint*: you can compare the index of the data frame to any string



In [ ]:

Systematic indexing with `loc` and `iloc`

When using [] like above, you can only select from one axis at once (rows or columns, not both). For more advanced indexing, you have some extra attributes:

loc: selection by label
iloc: selection by position

These methods index the different dimensions of the frame:

df.loc[row_indexer, column_indexer]
df.iloc[row_indexer, column_indexer]

Selecting a single element:



In [20]:

    
countries.loc['Germany', 'area']









    Out[20]:





357050

But the row or column indexer can also be a list, slice, boolean array, ..



In [21]:

    
countries.loc['France':'Germany', ['area', 'population']]









    Out[21]:






  
    
      
      area
      population
    
    
      country
      
      
    
  
  
    
      France
      671308
      64.3
    
    
      Germany
      357050
      81.3

Selecting by position with iloc works similar as indexing numpy arrays:



In [22]:

    
countries.iloc[:2,1:3]









    Out[22]:






  
    
      
      capital
      population
    
    
      country
      
      
    
  
  
    
      Belgium
      Brussels
      11.3
    
    
      France
      Paris
      64.3

The different indexing methods can also be used to assign data:



In [23]:

    
countries2 = countries.copy()
countries2.loc['Belgium':'Germany', 'population'] = 10



In [24]:

    
countries2









    Out[24]:






  
    
      
      area
      capital
      population
    
    
      country
      
      
      
    
  
  
    
      Belgium
      30510
      Brussels
      10.0
    
    
      France
      671308
      Paris
      10.0
    
    
      Germany
      357050
      Berlin
      10.0
    
    
      Netherlands
      41526
      Amsterdam
      16.9
    
    
      United Kingdom
      244820
      London
      64.9

EXERCISE: Add a column `density` with the population density (note: population column is expressed in millions)



In [ ]:

EXERCISE: Select the capital and the population column of those countries where the density is larger than 300



In [ ]:

EXERCISE: List names, capitals and population densities of two countries with highest population density.



In [ ]:

EXERCISE: Change the capital of the UK to Cambridge



In [ ]:

EXERCISE: Select all countries whose population density is between 100 and 300 people/km²



In [ ]:

More exercises!

For the quick ones among you, here are some more exercises with some larger dataframe with film data. These exercises are based on the PyCon tutorial of Brandon Rhodes (so all credit to him!) and the datasets he prepared for that. You can download these data from here: titles.csv and cast.csv and put them in the /data folder.



In [30]:

    
cast = pd.read_csv('data/cast.csv')
cast.head()









    Out[30]:






  
    
      
      title
      year
      name
      type
      character
      n
    
  
  
    
      0
      Suuri illusioni
      1985
      Homo $
      actor
      Guests
      22
    
    
      1
      Gangsta Rap: The Glockumentary
      2007
      Too $hort
      actor
      Himself
      NaN
    
    
      2
      Menace II Society
      1993
      Too $hort
      actor
      Lew-Loc
      27
    
    
      3
      Porndogs: The Adventures of Sadie
      2009
      Too $hort
      actor
      Bosco
      3
    
    
      4
      Stop Pepper Palmer
      2014
      Too $hort
      actor
      Himself
      NaN



In [31]:

    
titles = pd.read_csv('data/titles.csv')
titles.head()









    Out[31]:






  
    
      
      title
      year
    
  
  
    
      0
      The Rising Son
      1990
    
    
      1
      Ashes of Kukulcan
      2016
    
    
      2
      The Thousand Plane Raid
      1969
    
    
      3
      Crucea de piatra
      1993
    
    
      4
      The 86
      2015

EXERCISE: How many movies are listed in the titles dataframe?



In [ ]:

EXERCISE: What are the earliest two films listed in the titles dataframe?



In [ ]:

EXERCISE: How many movies have the title "Hamlet"?



In [ ]:

EXERCISE: List all of the "Treasure Island" movies from earliest to most recent.



In [ ]:

EXERCISE: How many movies were made from 1950 through 1959?



In [ ]:



In [ ]:

EXERCISE: How many roles in the movie "Inception" are NOT ranked by an "n" value?



In [ ]:

EXERCISE: But how many roles in the movie "Inception" did receive an "n" value?



In [ ]:

EXERCISE: Display the cast of "North by Northwest" in their correct "n"-value order, ignoring roles that did not earn a numeric "n" value.



In [ ]:

EXERCISE: How many roles were credited in the silent 1921 version of Hamlet?



In [ ]:

EXERCISE: List the supporting roles (having n=2) played by Cary Grant in the 1940s, in order by year.



In [ ]:

	area	capital	country	population
0	30510	Brussels	Belgium	11.3
1	671308	Paris	France	64.3
2	357050	Berlin	Germany	81.3
3	41526	Amsterdam	Netherlands	16.9
4	244820	London	United Kingdom	64.9

	area	capital	population
country
Belgium	30510	Brussels	11.3
France	671308	Paris	64.3
Germany	357050	Berlin	81.3
Netherlands	41526	Amsterdam	16.9
United Kingdom	244820	London	64.9

	title	year	name	type	character	n
0	Suuri illusioni	1985	Homo $	actor	Guests	22
1	Gangsta Rap: The Glockumentary	2007	Too $hort	actor	Himself	NaN
2	Menace II Society	1993	Too $hort	actor	Lew-Loc	27
3	Porndogs: The Adventures of Sadie	2009	Too $hort	actor	Bosco	3
4	Stop Pepper Palmer	2014	Too $hort	actor	Himself	NaN

	title	year
0	The Rising Son	1990
1	Ashes of Kukulcan	2016
2	The Thousand Plane Raid	1969
3	Crucea de piatra	1993
4	The 86	2015