In [ ]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
try:
import seaborn
except ImportError:
pass
In [ ]:
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
'population': [11.3, 64.3, 81.3, 16.9, 64.9],
'area': [30510, 671308, 357050, 41526, 244820],
'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries
In [ ]:
countries = countries.set_index('country')
countries
For a DataFrame, basic indexing selects the columns.
Selecting a single column:
In [ ]:
countries['area']
or multiple columns:
In [ ]:
countries[['area', 'population']]
But, slicing accesses the rows:
In [ ]:
countries['France':'Netherlands']
So as a summary, []
provides the following convenience shortcuts:
s[label]
df['col']
or df[['col1', 'col2']]
df['row_label1':'row_label2']
or df[mask]
When using []
like above, you can only select from one axis at once (rows or columns, not both). For more advanced indexing, you have some extra attributes:
loc
: selection by labeliloc
: selection by positionThese methods index the different dimensions of the frame:
df.loc[row_indexer, column_indexer]
df.iloc[row_indexer, column_indexer]
Selecting a single element:
In [ ]:
countries.loc['Germany', 'area']
But the row or column indexer can also be a list, slice, boolean array, ..
In [ ]:
countries.loc['France':'Germany', ['area', 'population']]
Selecting by position with iloc
works similar as indexing numpy arrays:
In [ ]:
countries.iloc[0:2,1:3]
The different indexing methods can also be used to assign data:
In [ ]:
countries2 = countries.copy()
countries2.loc['Belgium':'Germany', 'population'] = 10
In [ ]:
countries2
Like a where clause in SQL. The indexer (or boolean mask) should be 1-dimensional and the same length as the thing being indexed.
In [ ]:
countries['area'] > 100000
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
The isin
method of Series is very useful to select rows that may contain certain values:
In [ ]:
s = countries['capital']
In [ ]:
s.isin?
In [ ]:
s.isin(['Berlin', 'London'])
This can then be used to filter the dataframe with boolean indexing:
In [ ]:
countries[countries['capital'].isin(['Berlin', 'London'])]
Let's say we want to select all data for which the capital starts with a 'B'. In Python, when having a string, we could use the startswith
method:
In [ ]:
'Berlin'.startswith('B')
In pandas, these are available on a Series through the str
namespace:
In [ ]:
countries['capital'].str.startswith('B')
For an overview of all string methods, see: http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling
In [ ]:
In [ ]:
In [ ]:
countries.loc['Belgium', 'capital'] = 'Ghent'
In [ ]:
countries
In [ ]:
countries['capital']['Belgium'] = 'Antwerp'
In [ ]:
countries
In [ ]:
countries[countries['capital'] == 'Antwerp']['capital'] = 'Brussels'
In [ ]:
countries
How to avoid this?
loc
instead of chained indexing if possible!copy
explicitly if you don't want to change the original data.For the quick ones among you, here are some more exercises with some larger dataframe with film data. These exercises are based on the PyCon tutorial of Brandon Rhodes (so all credit to him!) and the datasets he prepared for that. You can download these data from here: titles.csv
and cast.csv
and put them in the /data
folder.
In [ ]:
cast = pd.read_csv('data/cast.csv')
cast.head()
In [ ]:
titles = pd.read_csv('data/titles.csv')
titles.head()
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: