03 - Pandas: Indexing and selecting data - part I
DS Data manipulation, analysis and visualisation in Python
December, 2019© 2016-2019, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons
In [1]:
import pandas as pd
In [2]:
# redefining the example objects
# series
population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3,
'United Kingdom': 64.9, 'Netherlands': 16.9})
# dataframe
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
'population': [11.3, 64.3, 81.3, 16.9, 64.9],
'area': [30510, 671308, 357050, 41526, 244820],
'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries
Out[2]:
For a DataFrame, basic indexing selects the columns (cfr. the dictionaries of pure python)
Selecting a single column:
In [3]:
countries['area'] # single []
Out[3]:
Remember that the same syntax can also be used to add a new columns: df['new'] = ...
.
We can also select multiple columns by passing a list of column names into []
:
In [4]:
countries[['area', 'population']] # double [[]]
Out[4]:
Using []
, slicing or boolean indexing accesses the rows:
In [5]:
countries[0:4]
Out[5]:
Often, you want to select rows based on a certain condition. This can be done with 'boolean indexing' (like a where clause in SQL) and comparable to numpy.
The indexer (or boolean mask) should be 1-dimensional and the same length as the thing being indexed.
In [6]:
countries['area'] > 100000
Out[6]:
In [7]:
countries[countries['area'] > 100000]
Out[7]:
In [8]:
countries[countries['population'] > 50]
Out[8]:
An overview of the possible comparison operations:
Operator | Description |
---|---|
== | Equal |
!= | Not equal |
> | Greater than |
>= | Greater than or equal |
< | Lesser than |
!= | Lesser than or equal |
and to combine multiple conditions:
Operator | Description |
---|---|
& | And (cond1 & cond2 ) |
| | Or (cond1 | cond2 ) |
The isin
method of Series is very useful to select rows that may contain certain values:
In [9]:
s = countries['capital']
In [10]:
s.isin?
In [11]:
s.isin(['Berlin', 'London'])
Out[11]:
This can then be used to filter the dataframe with boolean indexing:
In [12]:
countries[countries['capital'].isin(['Berlin', 'London'])]
Out[12]:
Let's say we want to select all data for which the capital starts with a 'B'. In Python, when having a string, we could use the startswith
method:
In [13]:
string = 'Berlin'
In [14]:
string.startswith('B')
Out[14]:
In pandas, these are available on a Series through the str
namespace:
In [15]:
countries['capital'].str.startswith('B')
Out[15]:
For an overview of all string methods, see: http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling
In [16]:
df = pd.read_csv("../data/titanic.csv")
In [17]:
df.head()
Out[17]:
In [18]:
males = df[df['Sex'] == 'male']
In [19]:
males['Age'].mean()
Out[19]:
In [20]:
df[df['Sex'] == 'female']['Age'].mean()
Out[20]:
We will later see an easier way to calculate both averages at the same time with groupby.
In [21]:
len(df[df['Age'] > 70])
Out[21]:
In [22]:
(df['Age'] > 70).sum()
Out[22]:
In [23]:
df[(df['Age'] > 30) & (df['Age'] <= 40)]
Out[23]:
In [24]:
df['Surname'] = df['Name'].apply(lambda x: x.split(',')[0])
In [25]:
df[df['Surname'].str.startswith('Williams')]
Out[25]:
In [26]:
df[df['Surname'].str.len() > 15]
Out[26]:
In [ ]:
For the quick ones among you, here are some more exercises with some larger dataframe with film data. These exercises are based on the PyCon tutorial of Brandon Rhodes (so all credit to him!) and the datasets he prepared for that. You can download these data from here: titles.csv
and cast.csv
and put them in the /data
folder.
In [27]:
cast = pd.read_csv('../data/cast.csv')
cast.head()
Out[27]:
In [28]:
titles = pd.read_csv('../data/titles.csv')
titles.head()
Out[28]:
In [29]:
len(titles)
Out[29]:
In [30]:
titles.sort_values('year').head(2)
Out[30]:
In [31]:
len(titles[titles['title'] == 'Hamlet'])
Out[31]:
In [32]:
titles[titles.title == 'Treasure Island'].sort_values('year')
Out[32]:
In [33]:
len(titles[(titles['year'] >= 1950) & (titles['year'] <= 1959)])
Out[33]:
In [34]:
len(titles[titles['year'] // 10 == 195])
Out[34]:
In [35]:
inception = cast[cast['title'] == 'Inception']
In [36]:
len(inception[inception['n'].isnull()])
Out[36]:
In [37]:
inception['n'].isnull().sum()
Out[37]:
In [38]:
len(inception[inception['n'].notnull()])
Out[38]:
In [39]:
titanic = cast[(cast['title'] == 'Titanic') & (cast['year'] == 1997)]
titanic = titanic[titanic['n'].notnull()]
titanic.sort_values('n')
Out[39]:
In [40]:
brad = cast[cast['name'] == 'Brad Pitt']
brad = brad[brad['year'] // 10 == 199]
brad = brad[brad['n'] == 2]
brad.sort_values('year')
Out[40]:
The optional exercises are based on the PyCon tutorial of Brandon Rhodes (so all credit to him!) and the datasets he prepared for that.