Reshaping data
© 2016, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons
In [ ]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
People who know Excel, probably know the Pivot functionality:
The data of the table:
In [ ]:
excelample = pd.DataFrame({'Month': ["January", "January", "January", "January",
"February", "February", "February", "February",
"March", "March", "March", "March"],
'Category': ["Transportation", "Grocery", "Household", "Entertainment",
"Transportation", "Grocery", "Household", "Entertainment",
"Transportation", "Grocery", "Household", "Entertainment"],
'Amount': [74., 235., 175., 100., 115., 240., 225., 125., 90., 260., 200., 120.]})
In [ ]:
excelample
In [ ]:
excelample_pivot = excelample.pivot(index="Category", columns="Month", values="Amount")
excelample_pivot
Interested in Grand totals?
In [ ]:
# sum columns
excelample_pivot.sum(axis=1)
In [ ]:
# sum rows
excelample_pivot.sum(axis=0)
Small subsample of the titanic dataset:
In [ ]:
df = pd.DataFrame({'Fare': [7.25, 71.2833, 51.8625, 30.0708, 7.8542, 13.0],
'Pclass': [3, 1, 1, 2, 3, 2],
'Sex': ['male', 'female', 'male', 'female', 'female', 'male'],
'Survived': [0, 1, 0, 1, 0, 1]})
In [ ]:
df
In [ ]:
df.pivot(index='Pclass', columns='Sex', values='Fare')
In [ ]:
df.pivot(index='Pclass', columns='Sex', values='Survived')
So far, so good...
Let's now use the full titanic dataset:
In [ ]:
df = pd.read_csv("data/titanic.csv")
In [ ]:
df.head()
And try the same pivot (no worries about the try-except, this is here just used to catch a loooong error):
In [ ]:
try:
df.pivot(index='Sex', columns='Pclass', values='Fare')
except Exception as e:
print("Exception!", e)
This does not work, because we would end up with multiple values for one cell of the resulting frame, as the error says: duplicated
values for the columns in the selection. As an example, consider the following rows of our three columns of interest:
In [ ]:
df.loc[[1, 3], ["Sex", 'Pclass', 'Fare']]
Since pivot
is just restructuring data, where would both values of Fare
for the same combination of Sex
and Pclass
need to go?
Well, they need to be combined, according to an aggregation
functionality, which is supported by the functionpivot_table
In [ ]:
df = pd.read_csv("data/titanic.csv")
In [ ]:
df.pivot_table(index='Sex', columns='Pclass', values='Fare')
In [ ]:
df.pivot_table(index='Sex', columns='Pclass',
values='Fare', aggfunc='max')
In [ ]:
df.pivot_table(index='Sex', columns='Pclass',
values='Fare', aggfunc='count')
In [ ]:
pd.crosstab(index=df['Sex'], columns=df['Pclass'])
In [ ]:
# %load snippets/06 - Reshaping data20.py
In [ ]:
# %load snippets/06 - Reshaping data21.py
In [ ]:
# %load snippets/06 - Reshaping data22.py
In [ ]:
# %load snippets/06 - Reshaping data23.py
The melt
function performs the inverse operation of a pivot
. This can be used to make your frame longer, i.e. to make a tidy version of your data.
In [ ]:
pivoted = df.pivot_table(index='Sex', columns='Pclass', values='Fare').reset_index()
pivoted.columns.name = None
In [ ]:
pivoted
Assume we have a DataFrame like the above. The observations (the average Fare people payed) are spread over different columns. In a tidy dataset, each observation is stored in one row. To obtain this, we can use the melt
function:
In [ ]:
pd.melt(pivoted)
As you can see above, the melt
function puts all column labels in one column, and all values in a second column.
In this case, this is not fully what we want. We would like to keep the 'Sex' column separately:
In [ ]:
pd.melt(pivoted, id_vars=['Sex']) #, var_name='Pclass', value_name='Fare')
The docs say:
Pivot a level of the (possibly hierarchical) column labels, returning a DataFrame (or Series in the case of an object with a single level of column labels) having a hierarchical index with a new inner-most level of row labels.
Indeed...
Before we speak about hierarchical index
, first check it in practice on the following dummy example:
In [ ]:
df = pd.DataFrame({'A':['one', 'one', 'two', 'two'],
'B':['a', 'b', 'a', 'b'],
'C':range(4)})
df
To use stack
/unstack
, we need the values we want to shift from rows to columns or the other way around as the index:
In [ ]:
df = df.set_index(['A', 'B']) # Indeed, you can combine two indices
df
In [ ]:
result = df['C'].unstack()
result
In [ ]:
df = result.stack().reset_index(name='C')
df
To better understand and reason about pivot tables, we can express this method as a combination of more basic steps. In short, the pivot is a convenient way of expressing the combination of a groupby
and stack/unstack
.
In [ ]:
df = pd.read_csv("data/titanic.csv")
In [ ]:
df.head()
In [ ]:
df.pivot_table(index='Pclass', columns='Sex',
values='Survived', aggfunc='mean')
In [ ]:
# %load snippets/06 - Reshaping data37.py
Like the pivot table above, we can now also obtain the result of melt
with stack/unstack.
Let's use the same pivoted
frame as above, and look at the final melt result:
In [ ]:
pivoted = df.pivot_table(index='Sex', columns='Pclass', values='Fare').reset_index()
pivoted.columns.name = None
pivoted
In [ ]:
pd.melt(pivoted, id_vars=['Sex'], var_name='Pclass', value_name='Fare')
In [ ]:
# %load snippets/06 - Reshaping data40.py
In [ ]:
# %load snippets/06 - Reshaping data41.py
In [ ]:
# %load snippets/06 - Reshaping data42.py
In [ ]:
# %load snippets/06 - Reshaping data43.py
These exercises are based on the PyCon tutorial of Brandon Rhodes (so credit to him!) and the datasets he prepared for that. You can download these data from here: titles.csv
and cast.csv
and put them in the /data
folder.
In [ ]:
cast = pd.read_csv('data/cast.csv')
cast.head()
In [ ]:
titles = pd.read_csv('data/titles.csv')
titles.head()
In [ ]:
# %load snippets/06 - Reshaping data46.py
In [ ]:
# %load snippets/06 - Reshaping data47.py
In [ ]:
# %load snippets/06 - Reshaping data48.py
In [ ]:
# %load snippets/06 - Reshaping data49.py
In [ ]:
# %load snippets/06 - Reshaping data50.py
In [ ]:
# %load snippets/06 - Reshaping data51.py
In [ ]:
# %load snippets/06 - Reshaping data52.py