03 - Pandas: Indexing and selecting data - part I

DS Data manipulation, analysis and visualisation in Python
December, 2019

© 2016-2019, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons



In [1]:
import pandas as pd

In [2]:
# redefining the example objects

# series
population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, 
                        'United Kingdom': 64.9, 'Netherlands': 16.9})

# dataframe
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries


Out[2]:
country population area capital
0 Belgium 11.3 30510 Brussels
1 France 64.3 671308 Paris
2 Germany 81.3 357050 Berlin
3 Netherlands 16.9 41526 Amsterdam
4 United Kingdom 64.9 244820 London

Subsetting data

Subset variables (columns)

For a DataFrame, basic indexing selects the columns (cfr. the dictionaries of pure python)

Selecting a single column:


In [3]:
countries['area'] # single []


Out[3]:
0     30510
1    671308
2    357050
3     41526
4    244820
Name: area, dtype: int64

Remember that the same syntax can also be used to add a new columns: df['new'] = ....

We can also select multiple columns by passing a list of column names into []:


In [4]:
countries[['area', 'population']] # double [[]]


Out[4]:
area population
0 30510 11.3
1 671308 64.3
2 357050 81.3
3 41526 16.9
4 244820 64.9

Subset observations (rows)

Using [], slicing or boolean indexing accesses the rows:

Slicing


In [5]:
countries[0:4]


Out[5]:
country population area capital
0 Belgium 11.3 30510 Brussels
1 France 64.3 671308 Paris
2 Germany 81.3 357050 Berlin
3 Netherlands 16.9 41526 Amsterdam

Boolean indexing (filtering)

Often, you want to select rows based on a certain condition. This can be done with 'boolean indexing' (like a where clause in SQL) and comparable to numpy.

The indexer (or boolean mask) should be 1-dimensional and the same length as the thing being indexed.


In [6]:
countries['area'] > 100000


Out[6]:
0    False
1     True
2     True
3    False
4     True
Name: area, dtype: bool

In [7]:
countries[countries['area'] > 100000]


Out[7]:
country population area capital
1 France 64.3 671308 Paris
2 Germany 81.3 357050 Berlin
4 United Kingdom 64.9 244820 London

In [8]:
countries[countries['population'] > 50]


Out[8]:
country population area capital
1 France 64.3 671308 Paris
2 Germany 81.3 357050 Berlin
4 United Kingdom 64.9 244820 London

An overview of the possible comparison operations:

Operator Description
== Equal
!= Not equal
> Greater than
>= Greater than or equal
< Lesser than
!= Lesser than or equal

and to combine multiple conditions:

Operator Description
& And (cond1 & cond2)
| Or (cond1 | cond2)
REMEMBER:

So as a summary, `[]` provides the following convenience shortcuts: * **Series**: selecting a **label**: `s[label]` * **DataFrame**: selecting a single or multiple **columns**:`df['col']` or `df[['col1', 'col2']]` * **DataFrame**: slicing or filtering the **rows**: `df['row_label1':'row_label2']` or `df[mask]`

Some other useful methods: isin and string methods

The isin method of Series is very useful to select rows that may contain certain values:


In [9]:
s = countries['capital']

In [10]:
s.isin?


Signature: s.isin(values)
Docstring:
Check whether `values` are contained in Series.

Return a boolean Series showing whether each element in the Series
matches an element in the passed sequence of `values` exactly.

Parameters
----------
values : set or list-like
    The sequence of values to test. Passing in a single string will
    raise a ``TypeError``. Instead, turn a single string into a
    list of one element.

    .. versionadded:: 0.18.1

      Support for values as a set.

Returns
-------
Series
    Series of booleans indicating if each element is in values.

Raises
------
TypeError
  * If `values` is a string

See Also
--------
DataFrame.isin : Equivalent method on DataFrame.

Examples
--------
>>> s = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama',
...                'hippo'], name='animal')
>>> s.isin(['cow', 'lama'])
0     True
1     True
2     True
3    False
4     True
5    False
Name: animal, dtype: bool

Passing a single string as ``s.isin('lama')`` will raise an error. Use
a list of one element instead:

>>> s.isin(['lama'])
0     True
1    False
2     True
3    False
4     True
5    False
Name: animal, dtype: bool
File:      ~/miniconda3/envs/DS-python-data-analysis/lib/python3.7/site-packages/pandas/core/series.py
Type:      method

In [11]:
s.isin(['Berlin', 'London'])


Out[11]:
0    False
1    False
2     True
3    False
4     True
Name: capital, dtype: bool

This can then be used to filter the dataframe with boolean indexing:


In [12]:
countries[countries['capital'].isin(['Berlin', 'London'])]


Out[12]:
country population area capital
2 Germany 81.3 357050 Berlin
4 United Kingdom 64.9 244820 London

Let's say we want to select all data for which the capital starts with a 'B'. In Python, when having a string, we could use the startswith method:


In [13]:
string = 'Berlin'

In [14]:
string.startswith('B')


Out[14]:
True

In pandas, these are available on a Series through the str namespace:


In [15]:
countries['capital'].str.startswith('B')


Out[15]:
0     True
1    False
2     True
3    False
4    False
Name: capital, dtype: bool

For an overview of all string methods, see: http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling

Exercises using the Titanic dataset


In [16]:
df = pd.read_csv("../data/titanic.csv")

In [17]:
df.head()


Out[17]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
EXERCISE:
  • Select all rows for male passengers and calculate the mean age of those passengers. Do the same for the female passengers.

In [18]:
males = df[df['Sex'] == 'male']

In [19]:
males['Age'].mean()


Out[19]:
30.72664459161148

In [20]:
df[df['Sex'] == 'female']['Age'].mean()


Out[20]:
27.915708812260537

We will later see an easier way to calculate both averages at the same time with groupby.

EXERCISE:
  • How many passengers older than 70 were on the Titanic?

In [21]:
len(df[df['Age'] > 70])


Out[21]:
5

In [22]:
(df['Age'] > 70).sum()


Out[22]:
5
EXERCISE:
  • Select the passengers that are between 30 and 40 years old?

In [23]:
df[(df['Age'] > 30) & (df['Age'] <= 40)]


Out[23]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 0 345763 18.0000 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
867 868 0 1 Roebling, Mr. Washington Augustus II male 31.0 0 0 PC 17590 50.4958 A24 S
872 873 0 1 Carlsson, Mr. Frans Olof male 33.0 0 0 695 5.0000 B51 B53 B55 S
881 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 NaN S
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

155 rows × 12 columns

EXERCISE:
  • Split the 'Name' column on the `,` extract the first part (the surname), and add this as new column 'Surname' .

Tip: try it first on a single string (and for this, check the `split` method of a string), and then try to 'apply' this on each row.

In [24]:
df['Surname'] = df['Name'].apply(lambda x: x.split(',')[0])
EXERCISE:
  • Select all passenger that have a surname starting with 'Williams'.

In [25]:
df[df['Surname'].str.startswith('Williams')]


Out[25]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Surname
17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S Williams
155 156 0 1 Williams, Mr. Charles Duane male 51.0 0 1 PC 17597 61.3792 NaN C Williams
304 305 0 3 Williams, Mr. Howard Hugh "Harry" male NaN 0 0 A/5 2466 8.0500 NaN S Williams
351 352 0 1 Williams-Lambert, Mr. Fletcher Fellows male NaN 0 0 113510 35.0000 C128 S Williams-Lambert
735 736 0 3 Williams, Mr. Leslie male 28.5 0 0 54636 16.1000 NaN S Williams
EXERCISE:
  • Select all rows for the passengers with a surname of more than 15 characters.

In [26]:
df[df['Surname'].str.len() > 15]


Out[26]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Surname
307 308 1 1 Penasco y Castellana, Mrs. Victor de Satode (M... female 17.0 1 0 PC 17758 108.9000 C65 C Penasco y Castellana
351 352 0 1 Williams-Lambert, Mr. Fletcher Fellows male NaN 0 0 113510 35.0000 C128 S Williams-Lambert
430 431 1 1 Bjornstrom-Steffansson, Mr. Mauritz Hakan male 28.0 0 0 110564 26.5500 C52 S Bjornstrom-Steffansson
444 445 1 3 Johannesen-Bratthammer, Mr. Bernt male NaN 0 0 65306 8.1125 NaN S Johannesen-Bratthammer
505 506 0 1 Penasco y Castellana, Mr. Victor de Satode male 18.0 1 0 PC 17758 108.9000 C65 C Penasco y Castellana
587 588 1 1 Frolicher-Stehli, Mr. Maxmillian male 60.0 1 1 13567 79.2000 B41 C Frolicher-Stehli
632 633 1 1 Stahelin-Maeglin, Dr. Max male 32.0 0 0 13214 30.5000 B50 C Stahelin-Maeglin

In [ ]:

[OPTIONAL] more exercises

For the quick ones among you, here are some more exercises with some larger dataframe with film data. These exercises are based on the PyCon tutorial of Brandon Rhodes (so all credit to him!) and the datasets he prepared for that. You can download these data from here: titles.csv and cast.csv and put them in the /data folder.


In [27]:
cast = pd.read_csv('../data/cast.csv')
cast.head()


Out[27]:
title year name type character n
0 Suuri illusioni 1985 Homo $ actor Guests 22.0
1 Gangsta Rap: The Glockumentary 2007 Too $hort actor Himself NaN
2 Menace II Society 1993 Too $hort actor Lew-Loc 27.0
3 Porndogs: The Adventures of Sadie 2009 Too $hort actor Bosco 3.0
4 Stop Pepper Palmer 2014 Too $hort actor Himself NaN

In [28]:
titles = pd.read_csv('../data/titles.csv')
titles.head()


Out[28]:
title year
0 The Rising Son 1990
1 Ashes of Kukulcan 2016
2 The Thousand Plane Raid 1969
3 Crucea de piatra 1993
4 The 86 2015
EXERCISE:
  • How many movies are listed in the titles dataframe?

In [29]:
len(titles)


Out[29]:
215981
EXERCISE:
  • What are the earliest two films listed in the titles dataframe?

In [30]:
titles.sort_values('year').head(2)


Out[30]:
title year
165182 Miss Jerry 1894
85708 Reproduction of the Corbett and Fitzsimmons Fight 1897
EXERCISE:
  • How many movies have the title "Hamlet"?

In [31]:
len(titles[titles['title'] == 'Hamlet'])


Out[31]:
19
EXERCISE:
  • List all of the "Treasure Island" movies from earliest to most recent.

In [32]:
titles[titles.title == 'Treasure Island'].sort_values('year')


Out[32]:
title year
191379 Treasure Island 1918
47769 Treasure Island 1920
192917 Treasure Island 1934
90175 Treasure Island 1950
104714 Treasure Island 1972
103646 Treasure Island 1973
190792 Treasure Island 1985
166675 Treasure Island 1999
EXERCISE:
  • How many movies were made from 1950 through 1959?

In [33]:
len(titles[(titles['year'] >= 1950) & (titles['year'] <= 1959)])


Out[33]:
12120

In [34]:
len(titles[titles['year'] // 10 == 195])


Out[34]:
12120
EXERCISE:
  • How many roles in the movie "Inception" are NOT ranked by an "n" value?

In [35]:
inception = cast[cast['title'] == 'Inception']

In [36]:
len(inception[inception['n'].isnull()])


Out[36]:
22

In [37]:
inception['n'].isnull().sum()


Out[37]:
22
EXERCISE:
  • But how many roles in the movie "Inception" did receive an "n" value?

In [38]:
len(inception[inception['n'].notnull()])


Out[38]:
51
EXERCISE:
  • Display the cast of the "Titanic" (the most famous 1997 one) in their correct "n"-value order, ignoring roles that did not earn a numeric "n" value.

In [39]:
titanic = cast[(cast['title'] == 'Titanic') & (cast['year'] == 1997)]
titanic = titanic[titanic['n'].notnull()]
titanic.sort_values('n')


Out[39]:
title year name type character n
527123 Titanic 1997 Leonardo DiCaprio actor Jack Dawson 1.0
3304287 Titanic 1997 Kate Winslet actress Rose DeWitt Bukater 2.0
2241466 Titanic 1997 Billy Zane actor Caledon 'Cal' Hockley 3.0
2323120 Titanic 1997 Kathy Bates actress Molly Brown 4.0
2567572 Titanic 1997 Frances Fisher actress Ruth Dewitt Bukater 5.0
... ... ... ... ... ... ...
2678999 Titanic 1997 Andie Hicks actress Dancer 113.0
898633 Titanic 1997 Scott Hislop actor Dancer 114.0
1315604 Titanic 1997 Stan Mazin actor Dancer 115.0
3054629 Titanic 1997 Lisa Ratzin actress Dancer 116.0
3062849 Titanic 1997 Julene Renee actress Dancer 117.0

112 rows × 6 columns

EXERCISE:
  • List the supporting roles (having n=2) played by Brad Pitt in the 1990s, in order by year.

In [40]:
brad = cast[cast['name'] == 'Brad Pitt']
brad = brad[brad['year'] // 10 == 199]
brad = brad[brad['n'] == 2]
brad.sort_values('year')


Out[40]:
title year name type character n
1614615 Across the Tracks 1990 Brad Pitt actor Joe Maloney 2.0
1614613 A River Runs Through It 1992 Brad Pitt actor Paul Maclean 2.0
1614657 The Devil's Own 1997 Brad Pitt actor Rory Devaney 2.0
1614658 The Devil's Own 1997 Brad Pitt actor Francis Austin McGuire 2.0
1614623 Fight Club 1999 Brad Pitt actor Tyler Durden 2.0

Acknowledgement

The optional exercises are based on the PyCon tutorial of Brandon Rhodes (so all credit to him!) and the datasets he prepared for that.