06 - Pandas: "Group by" operations

DS Data manipulation, analysis and visualisation in Python
December, 2019

© 2016-2019, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')

Some 'theory': the groupby operation (split-apply-combine)


In [2]:
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
                   'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df


Out[2]:
key data
0 A 0
1 B 5
2 C 10
3 A 5
4 B 10
5 C 15
6 A 10
7 B 15
8 C 20

Recap: aggregating functions

When analyzing data, you often calculate summary statistics (aggregations like the mean, max, ...). As we have seen before, we can easily calculate such a statistic for a Series or column using one of the many available methods. For example:


In [3]:
df['data'].sum()


Out[3]:
90

However, in many cases your data has certain groups in it, and in that case, you may want to calculate this statistic for each of the groups.

For example, in the above dataframe df, there is a column 'key' which has three possible values: 'A', 'B' and 'C'. When we want to calculate the sum for each of those groups, we could do the following:


In [4]:
for key in ['A', 'B', 'C']:
    print(key, df[df['key'] == key]['data'].sum())


A 15
B 30
C 45

This becomes very verbose when having multiple groups. You could make the above a bit easier by looping over the different values, but still, it is not very convenient to work with.

What we did above, applying a function on different groups, is a "groupby operation", and pandas provides some convenient functionality for this.

Groupby: applying functions per group

The "group by" concept: we want to apply the same function on subsets of your dataframe, based on some key to split the dataframe in subsets

This operation is also referred to as the "split-apply-combine" operation, involving the following steps:

  • Splitting the data into groups based on some criteria
  • Applying a function to each group independently
  • Combining the results into a data structure

Similar to SQL GROUP BY

Instead of doing the manual filtering as above

df[df['key'] == "A"].sum()
df[df['key'] == "B"].sum()
...

pandas provides the groupby method to do exactly this:


In [5]:
df.groupby('key').sum()


Out[5]:
data
key
A 15
B 30
C 45

In [6]:
df.groupby('key').aggregate(np.sum)  # 'sum'


Out[6]:
data
key
A 15
B 30
C 45

And many more methods are available.


In [7]:
df.groupby('key')['data'].sum()


Out[7]:
key
A    15
B    30
C    45
Name: data, dtype: int64

Application of the groupby concept on the titanic data

We go back to the titanic passengers survival data:


In [8]:
df = pd.read_csv("../data/titanic.csv")

In [9]:
df.head()


Out[9]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
EXERCISE:
  • Using groupby(), calculate the average age for each sex.

In [10]:
df.groupby('Sex')['Age'].mean()


Out[10]:
Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64
EXERCISE:
  • Calculate the average survival ratio for all passengers.

In [11]:
# df['Survived'].sum() / len(df['Survived'])
df['Survived'].mean()


Out[11]:
0.3838383838383838
EXERCISE:
  • Calculate this survival ratio for all passengers younger than 25 (remember: filtering/boolean indexing).

In [12]:
df25 = df[df['Age'] < 25]
df25['Survived'].sum() / len(df25['Survived'])


Out[12]:
0.4244604316546763
EXERCISE:
  • What is the difference in the survival ratio between the sexes?

In [13]:
df.groupby('Sex')['Survived'].mean()


Out[13]:
Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64
EXERCISE:
  • Make a bar plot of the survival ratio for the different classes ('Pclass' column).

In [14]:
df.groupby('Pclass')['Survived'].mean().plot(kind='bar') #and what if you would compare the total number of survivors?


Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f76333a14e0>
EXERCISE:
  • Make a bar plot to visualize the average Fare payed by people depending on their age. The age column is devided is separate classes using the `pd.cut` function as provided below.

In [15]:
df['AgeClass'] = pd.cut(df['Age'], bins=np.arange(0,90,10))

In [16]:
df.groupby('AgeClass')['Fare'].mean().plot(kind='bar', rot=0)


Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f76312ca898>

If you are ready, more groupby exercises can be found below.

Some more theory

Specifying the grouper

In the previous example and exercises, we always grouped by a single column by passing its name. But, a column name is not the only value you can pass as the grouper in df.groupby(grouper). Other possibilities for grouper are:

  • a list of strings (to group by multiple columns)
  • a Series (similar to a string indicating a column in df) or array
  • function (to be applied on the index)
  • levels=[], names of levels in a MultiIndex

In [17]:
df.groupby(df['Age'] < 18)['Survived'].mean()


Out[17]:
Age
False    0.361183
True     0.539823
Name: Survived, dtype: float64

In [18]:
df.groupby(['Pclass', 'Sex'])['Survived'].mean()


Out[18]:
Pclass  Sex   
1       female    0.968085
        male      0.368852
2       female    0.921053
        male      0.157407
3       female    0.500000
        male      0.135447
Name: Survived, dtype: float64

The size of groups - value counts

Oftentimes you want to know how many elements there are in a certain group (or in other words: the number of occurences of the different values from a column).

To get the size of the groups, we can use size:


In [19]:
df.groupby('Pclass').size()


Out[19]:
Pclass
1    216
2    184
3    491
dtype: int64

In [20]:
df.groupby('Embarked').size()


Out[20]:
Embarked
C    168
Q     77
S    644
dtype: int64

Another way to obtain such counts, is to use the Series value_counts method:


In [21]:
df['Embarked'].value_counts()


Out[21]:
S    644
C    168
Q     77
Name: Embarked, dtype: int64

[OPTIONAL] Additional exercises using the movie data

These exercises are based on the PyCon tutorial of Brandon Rhodes (so credit to him!) and the datasets he prepared for that. You can download these data from here: titles.csv and cast.csv and put them in the /data folder.

cast dataset: different roles played by actors/actresses in films

  • title: title of the movie
  • year: year it was released
  • name: name of the actor/actress
  • type: actor/actress
  • n: the order of the role (n=1: leading role)

In [22]:
cast = pd.read_csv('../data/cast.csv')
cast.head()


Out[22]:
title year name type character n
0 Suuri illusioni 1985 Homo $ actor Guests 22.0
1 Gangsta Rap: The Glockumentary 2007 Too $hort actor Himself NaN
2 Menace II Society 1993 Too $hort actor Lew-Loc 27.0
3 Porndogs: The Adventures of Sadie 2009 Too $hort actor Bosco 3.0
4 Stop Pepper Palmer 2014 Too $hort actor Himself NaN

titles dataset:

  • title: title of the movie
  • year: year of release

In [23]:
titles = pd.read_csv('../data/titles.csv')
titles.head()


Out[23]:
title year
0 The Rising Son 1990
1 Ashes of Kukulcan 2016
2 The Thousand Plane Raid 1969
3 Crucea de piatra 1993
4 The 86 2015
EXERCISE:
  • Using `groupby()`, plot the number of films that have been released each decade in the history of cinema.

In [24]:
titles['decade'] = titles['year'] // 10 * 10

In [25]:
titles.groupby('decade').size().plot(kind='bar', color='green')


Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff84361beb8>
EXERCISE:
  • Use `groupby()` to plot the number of 'Hamlet' movies made each decade.

In [26]:
titles['decade'] = titles['year'] // 10 * 10
hamlet = titles[titles['title'] == 'Hamlet']
hamlet.groupby('decade').size().plot(kind='bar', color="orange")


Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff840f02a58>
EXERCISE:
  • For each decade, plot all movies of which the title contains "Hamlet".

In [27]:
titles['decade'] = titles['year'] // 10 * 10
hamlet = titles[titles['title'].str.contains('Hamlet')]
hamlet.groupby('decade').size().plot(kind='bar', color="lightblue")


Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff840ef3a90>
EXERCISE:
  • List the 10 actors/actresses that have the most leading roles (n=1) since the 1990's.

In [28]:
cast1990 = cast[cast['year'] >= 1990]
cast1990 = cast1990[cast1990['n'] == 1]
cast1990.groupby('name').size().nlargest(10)


Out[28]:
name
Mohanlal              126
Mammootty             118
Akshay Kumar           87
Jayaram                76
Andy Lau               72
Ajay Devgn             69
Amitabh Bachchan       68
Eric Roberts           68
Nagarjuna Akkineni     60
Dilip                  59
dtype: int64

In [29]:
cast1990['name'].value_counts().head(10)


Out[29]:
Mohanlal              126
Mammootty             118
Akshay Kumar           87
Jayaram                76
Andy Lau               72
Ajay Devgn             69
Amitabh Bachchan       68
Eric Roberts           68
Nagarjuna Akkineni     60
Dilip                  59
Name: name, dtype: int64
EXERCISE:
  • In a previous exercise, the number of 'Hamlet' films released each decade was checked. Not all titles are exactly called 'Hamlet'. Give an overview of the titles that contain 'Hamlet' and an overview of the titles that start with 'Hamlet', each time providing the amount of occurrences in the data set for each of the movies

In [30]:
hamlets = titles[titles['title'].str.contains('Hamlet')]
hamlets['title'].value_counts()


Out[30]:
Hamlet                                     19
Hamlet (II)                                 5
Hamlet (III)                                2
Han, hun og Hamlet                          2
Hamlet's Ghost                              1
Hamlet (A Modern Adaptation)                1
Kitchen Hamlet                              1
The Tragedy of Hamlet Prince of Denmark     1
Hamlet the Vampire Slayer                   1
Hamlet, Son of a Kingpin                    1
Predstava 'Hamleta' u Mrdusi Donjoj         1
Hamlet A.D.D.                               1
Zombie Hamlet                               1
A Sagebrush Hamlet                          1
Dogg's Hamlet, Cahoot's Macbeth             1
H for Hamlet                                1
Hamlet Unbound                              1
Harry, Hamlet and I                         1
Green Eggs and Hamlet                       1
Hamlet in the Hamptons                      1
Hamlet_X                                    1
Hamlet X                                    1
Hamlet liikemaailmassa                      1
Hamlet: The Fall of a Sparrow               1
Hamlet 2                                    1
Kadin Hamlet                                1
National Theatre Live: Hamlet               1
Fuck Hamlet                                 1
Hamlet: Prince of Denmark                   1
Hamlet, Prince of Denmark                   1
Name: title, dtype: int64

In [31]:
hamlets = titles[titles['title'].str.startswith('Hamlet')]
hamlets['title'].value_counts()


Out[31]:
Hamlet                           19
Hamlet (II)                       5
Hamlet (III)                      2
Hamlet's Ghost                    1
Hamlet, Son of a Kingpin          1
Hamlet, Prince of Denmark         1
Hamlet_X                          1
Hamlet X                          1
Hamlet the Vampire Slayer         1
Hamlet: The Fall of a Sparrow     1
Hamlet Unbound                    1
Hamlet (A Modern Adaptation)      1
Hamlet 2                          1
Hamlet A.D.D.                     1
Hamlet liikemaailmassa            1
Hamlet: Prince of Denmark         1
Hamlet in the Hamptons            1
Name: title, dtype: int64
EXERCISE:
  • List the 10 movie titles with the longest name.

In [32]:
title_longest = titles['title'].str.len().nlargest(10)
title_longest


Out[32]:
127048    208
28483     196
103597    116
8396      114
85693     104
108020    104
206303    101
122757     99
52929      94
187654     92
Name: title, dtype: int64

In [33]:
pd.options.display.max_colwidth = 210
titles.loc[title_longest.index]


Out[33]:
title year decade
127048 Night of the Day of the Dawn of the Son of the Bride of the Return of the Revenge of the Terror of the Attack of the Evil Mutant Hellbound Flesh Eating Crawling Alien Zombified Subhumanoid Living Dead, Part 5 2011 2010
28483 Night of the Day of the Dawn of the Son of the Bride of the Return of the Revenge of the Terror of the Attack of the Evil, Mutant, Hellbound, Flesh-Eating Subhumanoid Zombified Living Dead, Part 3 2005 2000
103597 Maverick and Ariel's First Ever Ever Movie Hollywood or Else... (Ang pinakamahabang title ng movie sa balat ng lupa) 2010 2010
8396 The Fable of the Kid Who Shifted His Ideals to Golf and Finally Became a Baseball Fan and Took the Only Known Cure 1916 1910
85693 Film d'amore e d'anarchia, ovvero 'stamattina alle 10 in via dei Fiori nella nota casa di tolleranza...' 1973 1970
108020 Those Magnificent Men in Their Flying Machines or How I Flew from London to Paris in 25 hours 11 minutes 1965 1960
206303 Ontologica! or a Brief Explanation of Absolutely Everything that is Known about Absolutely Everything 2012 2010
122757 The Official Motion Pictures of the Heavyweight Boxing Contest Between Gene Tunney and Jack Dempsey 1927 1920
52929 Something Strange: 23 Peculiar Perspectives of Metaphysical Phenomena in a Modern American Age 2012 2010
187654 The Personal History, Adventures, Experience, & Observation of David Copperfield the Younger 1935 1930
EXERCISE:
  • How many leading (n=1) roles were available to actors, and how many to actresses, in each year of the 1950s?

In [34]:
cast1950 = cast[cast['year'] // 10 == 195]
cast1950 = cast1950[cast1950['n'] == 1]
cast1950.groupby(['year', 'type']).size()


Out[34]:
year  type   
1950  actor      604
      actress    271
1951  actor      633
      actress    272
1952  actor      591
      actress    284
1953  actor      634
      actress    290
1954  actor      626
      actress    298
1955  actor      606
      actress    264
1956  actor      614
      actress    288
1957  actor      708
      actress    284
1958  actor      694
      actress    275
1959  actor      678
      actress    287
dtype: int64
EXERCISE:
  • What are the 11 most common character names in movie history?

In [35]:
cast.character.value_counts().head(11)


Out[35]:
Himself        18928
Dancer         11070
Extra           9141
Reporter        7646
Doctor          6846
Policeman       6501
Student         6406
Bartender       6178
Nurse           6164
Party Guest     5917
Minor Role      5880
Name: character, dtype: int64
EXERCISE:
  • Plot how many roles Brad Pitt has played in each year of his career.

In [36]:
cast[cast.name == 'Brad Pitt'].year.value_counts().sort_index().plot()


Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff843715e48>
EXERCISE:
  • What are the 10 most occurring movie titles that start with the words 'The Life'?

In [37]:
titles[titles['title'].str.startswith('The Life')]['title'].value_counts().head(10)


Out[37]:
The Life of Riley                                 3
The Life of the Party                             3
The Life Coach                                    2
The Life and Death of an Unhappily Married Man    1
The Life of Moses                                 1
The Life I Meant to Live                          1
The Lifehouse Concert                             1
The Life and Times of MC Beer Bong                1
The Life of Juanita Castro                        1
The Life and Adventures of Nicholas Nickleby      1
Name: title, dtype: int64
EXERCISE:
  • Which actors or actresses were most active in the year 2010 (i.e. appeared in the most movies)?

In [38]:
cast[cast.year == 2010].name.value_counts().head(10)


Out[38]:
Lloyd Kaufman           23
Suraaj Venjarammoodu    20
Jagathi Sreekumar       20
Chris (II) Eddy         20
Danny Trejo             17
Matt Simpson Siegel     17
Joe Estevez             15
Brahmanandam            15
Ben (II) Bishop         15
Kyle Rea                15
Name: name, dtype: int64
EXERCISE:
  • Determine how many roles are listed for each of 'The Pink Panther' movies.

In [39]:
pink = cast[cast['title'] == 'The Pink Panther']
pink.groupby(['year'])[['n']].max()


Out[39]:
n
year
1963 15.0
2006 50.0
EXERCISE:
  • List, in order by year, each of the movies in which 'Frank Oz' has played more than 1 role.

In [40]:
oz = cast[cast['name'] == 'Frank Oz']
oz_roles = oz.groupby(['year', 'title']).size()
oz_roles[oz_roles > 1]


Out[40]:
year  title                                   
1979  The Muppet Movie                            8
1981  An American Werewolf in London              2
      The Great Muppet Caper                      6
1982  The Dark Crystal                            2
1984  The Muppets Take Manhattan                  7
1985  Sesame Street Presents: Follow that Bird    3
1992  The Muppet Christmas Carol                  7
1996  Muppet Treasure Island                      4
1999  Muppets from Space                          4
      The Adventures of Elmo in Grouchland        3
dtype: int64
EXERCISE:
  • List each of the characters that Frank Oz has portrayed at least twice.

In [41]:
oz = cast[cast['name'] == 'Frank Oz']
oz_roles = oz.groupby(['character']).size()
oz_roles[oz_roles > 1].sort_values()


Out[41]:
character
Grover            2
Bert              3
Cookie Monster    3
Fozzie Bear       4
Sam the Eagle     5
Yoda              5
Animal            6
Miss Piggy        6
dtype: int64
EXERCISE:
  • Add a new column to the `cast` DataFrame that indicates the number of roles for each movie. [Hint](http://pandas.pydata.org/pandas-docs/stable/groupby.html#transformation)

In [42]:
cast['n_total'] = cast.groupby('title')['n'].transform('max') # transform will return an element for each row, so the max value is given to the whole group
cast.head()


Out[42]:
title year name type character n n_total
0 Suuri illusioni 1985 Homo $ actor Guests 22.0 22.0
1 Gangsta Rap: The Glockumentary 2007 Too $hort actor Himself NaN NaN
2 Menace II Society 1993 Too $hort actor Lew-Loc 27.0 45.0
3 Porndogs: The Adventures of Sadie 2009 Too $hort actor Bosco 3.0 9.0
4 Stop Pepper Palmer 2014 Too $hort actor Himself NaN NaN
EXERCISE:
  • Calculate the ratio of leading actor and actress roles to the total number of leading roles per decade.

**Tip**: you can do a groupby twice in two steps, first calculating the numbers, and secondly, the ratios.

In [43]:
leading = cast[cast['n'] == 1]
sums_decade = leading.groupby([cast['year'] // 10 * 10, 'type']).size()
sums_decade


Out[43]:
year  type   
1900  actor          5
      actress        1
1910  actor       2406
      actress     2753
1920  actor       4485
      actress     3014
1930  actor       6108
      actress     2826
1940  actor       5221
      actress     2143
1950  actor       6388
      actress     2813
1960  actor       7678
      actress     3112
1970  actor       8236
      actress     3320
1980  actor       9023
      actress     3274
1990  actor      10455
      actress     3972
2000  actor      17972
      actress     7537
2010  actor      17262
      actress     7384
2020  actor          3
      actress        1
dtype: int64

In [44]:
#sums_decade.groupby(level='year').transform(lambda x: x / x.sum())
ratios_decade = sums_decade / sums_decade.groupby(level='year').transform('sum')
ratios_decade


Out[44]:
year  type   
1900  actor      0.833333
      actress    0.166667
1910  actor      0.466369
      actress    0.533631
1920  actor      0.598080
      actress    0.401920
1930  actor      0.683680
      actress    0.316320
1940  actor      0.708990
      actress    0.291010
1950  actor      0.694272
      actress    0.305728
1960  actor      0.711585
      actress    0.288415
1970  actor      0.712703
      actress    0.287297
1980  actor      0.733756
      actress    0.266244
1990  actor      0.724683
      actress    0.275317
2000  actor      0.704536
      actress    0.295464
2010  actor      0.700398
      actress    0.299602
2020  actor      0.750000
      actress    0.250000
dtype: float64

In [45]:
ratios_decade[:, 'actor'].plot()
ratios_decade[:, 'actress'].plot()


Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff840d237f0>
EXERCISE:
  • In which years the most films were released?


In [46]:
t = titles
t.year.value_counts().head(3)


Out[46]:
2015    13978
2014     8209
2013     7664
Name: year, dtype: int64
EXERCISE:
  • How many leading (n=1) roles were available to actors, and how many to actresses, in the 1950s? And in 2000s?


In [47]:
cast1950 = cast[cast['year'] // 10 == 195]
cast1950 = cast1950[cast1950['n'] == 1]
cast1950['type'].value_counts()


Out[47]:
actor      6388
actress    2813
Name: type, dtype: int64

In [48]:
cast2000 = cast[cast['year'] // 10 == 200]
cast2000 = cast2000[cast2000['n'] == 1]
cast2000['type'].value_counts()


Out[48]:
actor      17972
actress     7537
Name: type, dtype: int64