07 - Pandas: Reshaping data
DS Data manipulation, analysis and visualisation in Python
December, 2019© 2016-2019, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
People who know Excel, probably know the Pivot functionality:
The data of the table:
In [2]:
excelample = pd.DataFrame({'Month': ["January", "January", "January", "January",
"February", "February", "February", "February",
"March", "March", "March", "March"],
'Category': ["Transportation", "Grocery", "Household", "Entertainment",
"Transportation", "Grocery", "Household", "Entertainment",
"Transportation", "Grocery", "Household", "Entertainment"],
'Amount': [74., 235., 175., 100., 115., 240., 225., 125., 90., 260., 200., 120.]})
In [3]:
excelample
Out[3]:
In [4]:
excelample_pivot = excelample.pivot(index="Category", columns="Month", values="Amount")
excelample_pivot
Out[4]:
Interested in Grand totals?
In [5]:
# sum columns
excelample_pivot.sum(axis=1)
Out[5]:
In [6]:
# sum rows
excelample_pivot.sum(axis=0)
Out[6]:
Small subsample of the titanic dataset:
In [7]:
df = pd.DataFrame({'Fare': [7.25, 71.2833, 51.8625, 30.0708, 7.8542, 13.0],
'Pclass': [3, 1, 1, 2, 3, 2],
'Sex': ['male', 'female', 'male', 'female', 'female', 'male'],
'Survived': [0, 1, 0, 1, 0, 1]})
In [8]:
df
Out[8]:
In [9]:
df.pivot(index='Pclass', columns='Sex', values='Fare')
Out[9]:
In [10]:
df.pivot(index='Pclass', columns='Sex', values='Survived')
Out[10]:
So far, so good...
Let's now use the full titanic dataset:
In [11]:
df = pd.read_csv("../data/titanic.csv")
In [12]:
df.head()
Out[12]:
And try the same pivot (no worries about the try-except, this is here just used to catch a loooong error):
In [13]:
try:
df.pivot(index='Sex', columns='Pclass', values='Fare')
except Exception as e:
print("Exception!", e)
This does not work, because we would end up with multiple values for one cell of the resulting frame, as the error says: duplicated
values for the columns in the selection. As an example, consider the following rows of our three columns of interest:
In [14]:
df.loc[[1, 3], ["Sex", 'Pclass', 'Fare']]
Out[14]:
Since pivot
is just restructering data, where would both values of Fare
for the same combination of Sex
and Pclass
need to go?
Well, they need to be combined, according to an aggregation
functionality, which is supported by the functionpivot_table
In [15]:
df = pd.read_csv("../data/titanic.csv")
In [16]:
df.pivot_table(index='Sex', columns='Pclass', values='Fare')
Out[16]:
In [17]:
df.pivot_table(index='Sex', columns='Pclass',
values='Fare', aggfunc='max')
Out[17]:
In [18]:
df.pivot_table(index='Sex', columns='Pclass',
values='Fare', aggfunc='count')
Out[18]:
pivot_table
with a aggfunc='count'
as aggregation: crosstab
In [19]:
pd.crosstab(index=df['Sex'], columns=df['Pclass'])
Out[19]:
In [20]:
df.pivot_table(index='Pclass', columns='Sex',
values='Survived', aggfunc='mean')
Out[20]:
In [21]:
fig, ax1 = plt.subplots()
df.pivot_table(index='Pclass', columns='Sex',
values='Survived', aggfunc='mean').plot(kind='bar',
rot=0,
ax=ax1)
ax1.set_ylabel('Survival ratio')
Out[21]:
In [22]:
df['Underaged'] = df['Age'] <= 18
In [23]:
df.pivot_table(index='Underaged', columns='Sex',
values='Fare', aggfunc='median')
Out[23]:
The melt
function performs the inverse operation of a pivot
. This can be used to make your frame longer, i.e. to make a tidy version of your data.
In [24]:
pivoted = df.pivot_table(index='Sex', columns='Pclass', values='Fare').reset_index()
pivoted.columns.name = None
In [25]:
pivoted
Out[25]:
Assume we have a DataFrame like the above. The observations (the average Fare people payed) are spread over different columns. In a tidy dataset, each observation is stored in one row. To obtain this, we can use the melt
function:
In [26]:
pd.melt(pivoted)
Out[26]:
As you can see above, the melt
function puts all column labels in one column, and all values in a second column.
In this case, this is not fully what we want. We would like to keep the 'Sex' column separately:
In [27]:
pd.melt(pivoted, id_vars=['Sex']) #, var_name='Pclass', value_name='Fare')
Out[27]:
The docs say:
Pivot a level of the (possibly hierarchical) column labels, returning a DataFrame (or Series in the case of an object with a single level of column labels) having a hierarchical index with a new inner-most level of row labels.
Indeed...
Before we speak about hierarchical index
, first check it in practice on the following dummy example:
In [28]:
df = pd.DataFrame({'A':['one', 'one', 'two', 'two'],
'B':['a', 'b', 'a', 'b'],
'C':range(4)})
df
Out[28]:
To use stack
/unstack
, we need the values we want to shift from rows to columns or the other way around as the index:
In [29]:
df = df.set_index(['A', 'B']) # Indeed, you can combine two indices
df
Out[29]:
In [30]:
result = df['C'].unstack()
result
Out[30]:
In [31]:
df = result.stack().reset_index(name='C')
df
Out[31]:
To better understand and reason about pivot tables, we can express this method as a combination of more basic steps. In short, the pivot is a convenient way of expressing the combination of a groupby
and stack/unstack
.
In [32]:
df = pd.read_csv("../data/titanic.csv")
In [33]:
df.head()
Out[33]:
In [34]:
df.pivot_table(index='Pclass', columns='Sex',
values='Survived', aggfunc='mean')
Out[34]:
In [35]:
df.groupby(['Pclass', 'Sex'])['Survived'].mean().unstack()
Out[35]:
These exercises are based on the PyCon tutorial of Brandon Rhodes (so credit to him!) and the datasets he prepared for that. You can download these data from here: titles.csv
and cast.csv
and put them in the /data
folder.
In [38]:
cast = pd.read_csv('../data/cast.csv')
cast.head()
Out[38]:
In [39]:
titles = pd.read_csv('../data/titles.csv')
titles.head()
Out[39]:
In [40]:
grouped = cast.groupby(['year', 'type']).size()
table = grouped.unstack('type')
table.plot()
Out[40]:
In [41]:
cast.pivot_table(index='year', columns='type', values="character", aggfunc='count').plot()
# for values in using the , take a column with no Nan values in order to count effectively all values -> at this stage: aha-erlebnis about crosstab function(!)
Out[41]:
In [42]:
pd.crosstab(index=cast['year'], columns=cast['type']).plot()
Out[42]:
In [43]:
pd.crosstab(index=cast['year'], columns=cast['type']).plot(kind='area')
Out[43]:
In [44]:
grouped = cast.groupby(['year', 'type']).size()
table = grouped.unstack('type')
(table['actor'] / (table['actor'] + table['actress'])).plot(ylim=[0,1])
Out[44]:
In [45]:
c = cast
c = c[(c.character == 'Superman') | (c.character == 'Batman')]
c = c.groupby(['year', 'character']).size()
c = c.unstack()
c = c.fillna(0)
c.head()
Out[45]:
In [46]:
d = c.Superman - c.Batman
print('Superman years:')
print(len(d[d > 0.0]))