In [ ]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
try:
import seaborn
except ImportError:
pass
In [ ]:
# redefining the example objects
# series
population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3,
'United Kingdom': 64.9, 'Netherlands': 16.9})
# dataframe
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
'population': [11.3, 64.3, 81.3, 16.9, 64.9],
'area': [30510, 671308, 357050, 41526, 244820],
'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries
Setting the index to the country names:
In [ ]:
countries = countries.set_index('country')
countries
For a DataFrame, basic indexing selects the columns.
Selecting a single column:
In [ ]:
countries['area']
or multiple columns:
In [ ]:
countries[['area', 'population']]
But, slicing accesses the rows:
In [ ]:
countries['France':'Netherlands']
So as a summary, []
provides the following convenience shortcuts:
s[label]
df['col']
or df[['col1', 'col2']]
df['row_label1':'row_label2']
or df[mask]
When using []
like above, you can only select from one axis at once (rows or columns, not both). For more advanced indexing, you have some extra attributes:
loc
: selection by labeliloc
: selection by positionThese methods index the different dimensions of the frame:
df.loc[row_indexer, column_indexer]
df.iloc[row_indexer, column_indexer]
Selecting a single element:
In [ ]:
countries.loc['Germany', 'area']
But the row or column indexer can also be a list, slice, boolean array, ..
In [ ]:
countries.loc['France':'Germany', ['area', 'population']]
Selecting by position with iloc
works similar as indexing numpy arrays:
In [ ]:
countries.iloc[0:2,1:3]
The different indexing methods can also be used to assign data:
In [ ]:
countries2 = countries.copy()
countries2.loc['Belgium':'Germany', 'population'] = 10
In [ ]:
countries2
Often, you want to select rows based on a certain condition. This can be done with 'boolean indexing' (like a where clause in SQL).
The indexer (or boolean mask) should be 1-dimensional and the same length as the thing being indexed.
In [ ]:
countries['area'] > 100000
In [ ]:
countries[countries['area'] > 100000]
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
The isin
method of Series is very useful to select rows that may contain certain values:
In [ ]:
s = countries['capital']
In [ ]:
s.isin?
In [ ]:
s.isin(['Berlin', 'London'])
This can then be used to filter the dataframe with boolean indexing:
In [ ]:
countries[countries['capital'].isin(['Berlin', 'London'])]
Let's say we want to select all data for which the capital starts with a 'B'. In Python, when having a string, we could use the startswith
method:
In [ ]:
'Berlin'.startswith('B')
In pandas, these are available on a Series through the str
namespace:
In [ ]:
countries['capital'].str.startswith('B')
For an overview of all string methods, see: http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling
In [ ]:
In [ ]:
In [ ]:
countries.loc['Belgium', 'capital'] = 'Ghent'
In [ ]:
countries
In [ ]:
countries['capital']['Belgium'] = 'Antwerp'
In [ ]:
countries
In [ ]:
countries[countries['capital'] == 'Antwerp']['capital'] = 'Brussels'
In [ ]:
countries
How to avoid this?
loc
instead of chained indexing if possible!copy
explicitly if you don't want to change the original data.For the quick ones among you, here are some more exercises with some larger dataframe with film data. These exercises are based on the PyCon tutorial of Brandon Rhodes (so all credit to him!) and the datasets he prepared for that. You can download these data from here: titles.csv
and cast.csv
and put them in the /data
folder.
In [ ]:
cast = pd.read_csv('data/cast.csv')
cast.head()
In [ ]:
titles = pd.read_csv('data/titles.csv')
titles.head()
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: