07 - Pandas: Reshaping data

DS Data manipulation, analysis and visualisation in Python
December, 2019

© 2016-2019, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Pivoting data

Cfr. excel

People who know Excel, probably know the Pivot functionality:

The data of the table:


In [2]:
excelample = pd.DataFrame({'Month': ["January", "January", "January", "January", 
                                  "February", "February", "February", "February", 
                                  "March", "March", "March", "March"],
                   'Category': ["Transportation", "Grocery", "Household", "Entertainment",
                                "Transportation", "Grocery", "Household", "Entertainment",
                                "Transportation", "Grocery", "Household", "Entertainment"],
                   'Amount': [74., 235., 175., 100., 115., 240., 225., 125., 90., 260., 200., 120.]})

In [3]:
excelample


Out[3]:
Month Category Amount
0 January Transportation 74.0
1 January Grocery 235.0
2 January Household 175.0
3 January Entertainment 100.0
4 February Transportation 115.0
5 February Grocery 240.0
6 February Household 225.0
7 February Entertainment 125.0
8 March Transportation 90.0
9 March Grocery 260.0
10 March Household 200.0
11 March Entertainment 120.0

In [4]:
excelample_pivot = excelample.pivot(index="Category", columns="Month", values="Amount")
excelample_pivot


Out[4]:
Month February January March
Category
Entertainment 125.0 100.0 120.0
Grocery 240.0 235.0 260.0
Household 225.0 175.0 200.0
Transportation 115.0 74.0 90.0

Interested in Grand totals?


In [5]:
# sum columns
excelample_pivot.sum(axis=1)


Out[5]:
Category
Entertainment     345.0
Grocery           735.0
Household         600.0
Transportation    279.0
dtype: float64

In [6]:
# sum rows
excelample_pivot.sum(axis=0)


Out[6]:
Month
February    705.0
January     584.0
March       670.0
dtype: float64

Pivot is just reordering your data:

Small subsample of the titanic dataset:


In [7]:
df = pd.DataFrame({'Fare': [7.25, 71.2833, 51.8625, 30.0708, 7.8542, 13.0],
                   'Pclass': [3, 1, 1, 2, 3, 2],
                   'Sex': ['male', 'female', 'male', 'female', 'female', 'male'],
                   'Survived': [0, 1, 0, 1, 0, 1]})

In [8]:
df


Out[8]:
Fare Pclass Sex Survived
0 7.2500 3 male 0
1 71.2833 1 female 1
2 51.8625 1 male 0
3 30.0708 2 female 1
4 7.8542 3 female 0
5 13.0000 2 male 1

In [9]:
df.pivot(index='Pclass', columns='Sex', values='Fare')


Out[9]:
Sex female male
Pclass
1 71.2833 51.8625
2 30.0708 13.0000
3 7.8542 7.2500

In [10]:
df.pivot(index='Pclass', columns='Sex', values='Survived')


Out[10]:
Sex female male
Pclass
1 1 0
2 1 1
3 0 0

So far, so good...

Let's now use the full titanic dataset:


In [11]:
df = pd.read_csv("../data/titanic.csv")

In [12]:
df.head()


Out[12]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

And try the same pivot (no worries about the try-except, this is here just used to catch a loooong error):


In [13]:
try:
    df.pivot(index='Sex', columns='Pclass', values='Fare')
except Exception as e:
    print("Exception!", e)


Exception! Index contains duplicate entries, cannot reshape

This does not work, because we would end up with multiple values for one cell of the resulting frame, as the error says: duplicated values for the columns in the selection. As an example, consider the following rows of our three columns of interest:


In [14]:
df.loc[[1, 3], ["Sex", 'Pclass', 'Fare']]


Out[14]:
Sex Pclass Fare
1 female 1 71.2833
3 female 1 53.1000

Since pivot is just restructering data, where would both values of Fare for the same combination of Sex and Pclass need to go?

Well, they need to be combined, according to an aggregation functionality, which is supported by the functionpivot_table

NOTE:
  • Pivot is purely restructering: a single value for each index/column combination is required.

Pivot tables - aggregating while pivoting


In [15]:
df = pd.read_csv("../data/titanic.csv")

In [16]:
df.pivot_table(index='Sex', columns='Pclass', values='Fare')


Out[16]:
Pclass 1 2 3
Sex
female 106.125798 21.970121 16.118810
male 67.226127 19.741782 12.661633
REMEMBER: * By default, `pivot_table` takes the **mean** of all values that would end up into one cell. However, you can also specify other aggregation functions using the `aggfunc` keyword.

In [17]:
df.pivot_table(index='Sex', columns='Pclass', 
               values='Fare', aggfunc='max')


Out[17]:
Pclass 1 2 3
Sex
female 512.3292 65.0 69.55
male 512.3292 73.5 69.55

In [18]:
df.pivot_table(index='Sex', columns='Pclass', 
               values='Fare', aggfunc='count')


Out[18]:
Pclass 1 2 3
Sex
female 94 76 144
male 122 108 347
REMEMBER:
  • There is a shortcut function for a pivot_table with a aggfunc='count' as aggregation: crosstab

In [19]:
pd.crosstab(index=df['Sex'], columns=df['Pclass'])


Out[19]:
Pclass 1 2 3
Sex
female 94 76 144
male 122 108 347
EXERCISE:
  • Make a pivot table with the survival rates for Pclass vs Sex.

In [20]:
df.pivot_table(index='Pclass', columns='Sex', 
               values='Survived', aggfunc='mean')


Out[20]:
Sex female male
Pclass
1 0.968085 0.368852
2 0.921053 0.157407
3 0.500000 0.135447

In [21]:
fig, ax1 = plt.subplots()
df.pivot_table(index='Pclass', columns='Sex', 
               values='Survived', aggfunc='mean').plot(kind='bar', 
                                                       rot=0, 
                                                       ax=ax1)
ax1.set_ylabel('Survival ratio')


Out[21]:
Text(0, 0.5, 'Survival ratio')
EXERCISE:
  • Make a table of the median Fare payed by aged/underaged vs Sex.

In [22]:
df['Underaged'] = df['Age'] <= 18

In [23]:
df.pivot_table(index='Underaged', columns='Sex', 
               values='Fare', aggfunc='median')


Out[23]:
Sex female male
Underaged
False 46.612196 25.118081
True 36.765626 28.416023

Melt - from pivot table to long or tidy format

The melt function performs the inverse operation of a pivot. This can be used to make your frame longer, i.e. to make a tidy version of your data.


In [24]:
pivoted = df.pivot_table(index='Sex', columns='Pclass', values='Fare').reset_index()
pivoted.columns.name = None

In [25]:
pivoted


Out[25]:
Sex 1 2 3
0 female 106.125798 21.970121 16.118810
1 male 67.226127 19.741782 12.661633

Assume we have a DataFrame like the above. The observations (the average Fare people payed) are spread over different columns. In a tidy dataset, each observation is stored in one row. To obtain this, we can use the melt function:


In [26]:
pd.melt(pivoted)


Out[26]:
variable value
0 Sex female
1 Sex male
2 1 106.126
3 1 67.2261
4 2 21.9701
5 2 19.7418
6 3 16.1188
7 3 12.6616

As you can see above, the melt function puts all column labels in one column, and all values in a second column.

In this case, this is not fully what we want. We would like to keep the 'Sex' column separately:


In [27]:
pd.melt(pivoted, id_vars=['Sex']) #, var_name='Pclass', value_name='Fare')


Out[27]:
Sex variable value
0 female 1 106.125798
1 male 1 67.226127
2 female 2 21.970121
3 male 2 19.741782
4 female 3 16.118810
5 male 3 12.661633

Reshaping with stack and unstack

The docs say:

Pivot a level of the (possibly hierarchical) column labels, returning a DataFrame (or Series in the case of an object with a single level of column labels) having a hierarchical index with a new inner-most level of row labels.

Indeed...

Before we speak about hierarchical index, first check it in practice on the following dummy example:


In [28]:
df = pd.DataFrame({'A':['one', 'one', 'two', 'two'], 
                   'B':['a', 'b', 'a', 'b'], 
                   'C':range(4)})
df


Out[28]:
A B C
0 one a 0
1 one b 1
2 two a 2
3 two b 3

To use stack/unstack, we need the values we want to shift from rows to columns or the other way around as the index:


In [29]:
df = df.set_index(['A', 'B']) # Indeed, you can combine two indices
df


Out[29]:
C
A B
one a 0
b 1
two a 2
b 3

In [30]:
result = df['C'].unstack()
result


Out[30]:
B a b
A
one 0 1
two 2 3

In [31]:
df = result.stack().reset_index(name='C')
df


Out[31]:
A B C
0 one a 0
1 one b 1
2 two a 2
3 two b 3
REMEMBER:
  • stack: make your data longer and smaller
  • unstack: make your data shorter and wider

Mimick pivot table

To better understand and reason about pivot tables, we can express this method as a combination of more basic steps. In short, the pivot is a convenient way of expressing the combination of a groupby and stack/unstack.


In [32]:
df = pd.read_csv("../data/titanic.csv")

In [33]:
df.head()


Out[33]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

In [34]:
df.pivot_table(index='Pclass', columns='Sex', 
               values='Survived', aggfunc='mean')


Out[34]:
Sex female male
Pclass
1 0.968085 0.368852
2 0.921053 0.157407
3 0.500000 0.135447
EXERCISE:
  • Get the same result as above based on a combination of `groupby` and `unstack`
  • First use `groupby` to calculate the survival ratio for all groups`unstack`
  • Then, use `unstack` to reshape the output of the groupby operation

In [35]:
df.groupby(['Pclass', 'Sex'])['Survived'].mean().unstack()


Out[35]:
Sex female male
Pclass
1 0.968085 0.368852
2 0.921053 0.157407
3 0.500000 0.135447

[OPTIONAL] Exercises: use the reshaping methods with the movie data

These exercises are based on the PyCon tutorial of Brandon Rhodes (so credit to him!) and the datasets he prepared for that. You can download these data from here: titles.csv and cast.csv and put them in the /data folder.


In [38]:
cast = pd.read_csv('../data/cast.csv')
cast.head()


Out[38]:
title year name type character n
0 Suuri illusioni 1985 Homo $ actor Guests 22.0
1 Gangsta Rap: The Glockumentary 2007 Too $hort actor Himself NaN
2 Menace II Society 1993 Too $hort actor Lew-Loc 27.0
3 Porndogs: The Adventures of Sadie 2009 Too $hort actor Bosco 3.0
4 Stop Pepper Palmer 2014 Too $hort actor Himself NaN

In [39]:
titles = pd.read_csv('../data/titles.csv')
titles.head()


Out[39]:
title year
0 The Rising Son 1990
1 Ashes of Kukulcan 2016
2 The Thousand Plane Raid 1969
3 Crucea de piatra 1993
4 The 86 2015
EXERCISE:
  • Plot the number of actor roles each year and the number of actress roles each year over the whole period of available movie data.

In [40]:
grouped = cast.groupby(['year', 'type']).size()
table = grouped.unstack('type')
table.plot()


Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa3f1a7e390>

In [41]:
cast.pivot_table(index='year', columns='type', values="character", aggfunc='count').plot() 
# for values in using the , take a column with no Nan values in order to count effectively all values -> at this stage: aha-erlebnis about crosstab function(!)


Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa3f1a08e90>

In [42]:
pd.crosstab(index=cast['year'], columns=cast['type']).plot()


Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa3f19754d0>
EXERCISE:
  • Plot the number of actor roles each year and the number of actress roles each year. Use kind='area' as plot type

In [43]:
pd.crosstab(index=cast['year'], columns=cast['type']).plot(kind='area')


Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa3f18f8e50>
EXERCISE:
  • Plot the fraction of roles that have been 'actor' roles each year over the whole period of available movie data.

In [44]:
grouped = cast.groupby(['year', 'type']).size()
table = grouped.unstack('type')
(table['actor'] / (table['actor'] + table['actress'])).plot(ylim=[0,1])


Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa3f186d350>
EXERCISE:
  • Define a year as a "Superman year" when films of that year feature more Superman characters than Batman characters. How many years in film history have been Superman years?

In [45]:
c = cast
c = c[(c.character == 'Superman') | (c.character == 'Batman')]
c = c.groupby(['year', 'character']).size()
c = c.unstack()
c = c.fillna(0)
c.head()


Out[45]:
character Batman Superman
year
1938 1.0 0.0
1940 1.0 0.0
1943 1.0 0.0
1948 0.0 1.0
1949 2.0 0.0

In [46]:
d = c.Superman - c.Batman
print('Superman years:')
print(len(d[d > 0.0]))


Superman years:
12