06 - Pandas: "Group by" operations
DS Data manipulation, analysis and visualisation in Python
December, 2019© 2016-2019, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
In [2]:
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df
Out[2]:
When analyzing data, you often calculate summary statistics (aggregations like the mean, max, ...). As we have seen before, we can easily calculate such a statistic for a Series or column using one of the many available methods. For example:
In [3]:
df['data'].sum()
Out[3]:
However, in many cases your data has certain groups in it, and in that case, you may want to calculate this statistic for each of the groups.
For example, in the above dataframe df
, there is a column 'key' which has three possible values: 'A', 'B' and 'C'. When we want to calculate the sum for each of those groups, we could do the following:
In [4]:
for key in ['A', 'B', 'C']:
print(key, df[df['key'] == key]['data'].sum())
This becomes very verbose when having multiple groups. You could make the above a bit easier by looping over the different values, but still, it is not very convenient to work with.
What we did above, applying a function on different groups, is a "groupby operation", and pandas provides some convenient functionality for this.
The "group by" concept: we want to apply the same function on subsets of your dataframe, based on some key to split the dataframe in subsets
This operation is also referred to as the "split-apply-combine" operation, involving the following steps:
Similar to SQL GROUP BY
Instead of doing the manual filtering as above
df[df['key'] == "A"].sum()
df[df['key'] == "B"].sum()
...
pandas provides the groupby
method to do exactly this:
In [5]:
df.groupby('key').sum()
Out[5]:
In [6]:
df.groupby('key').aggregate(np.sum) # 'sum'
Out[6]:
And many more methods are available.
In [7]:
df.groupby('key')['data'].sum()
Out[7]:
We go back to the titanic passengers survival data:
In [8]:
df = pd.read_csv("../data/titanic.csv")
In [9]:
df.head()
Out[9]:
In [10]:
df.groupby('Sex')['Age'].mean()
Out[10]:
In [11]:
# df['Survived'].sum() / len(df['Survived'])
df['Survived'].mean()
Out[11]:
In [12]:
df25 = df[df['Age'] < 25]
df25['Survived'].sum() / len(df25['Survived'])
Out[12]:
In [13]:
df.groupby('Sex')['Survived'].mean()
Out[13]:
In [14]:
df.groupby('Pclass')['Survived'].mean().plot(kind='bar') #and what if you would compare the total number of survivors?
Out[14]:
In [15]:
df['AgeClass'] = pd.cut(df['Age'], bins=np.arange(0,90,10))
In [16]:
df.groupby('AgeClass')['Fare'].mean().plot(kind='bar', rot=0)
Out[16]:
If you are ready, more groupby exercises can be found below.
In the previous example and exercises, we always grouped by a single column by passing its name. But, a column name is not the only value you can pass as the grouper in df.groupby(grouper)
. Other possibilities for grouper
are:
In [17]:
df.groupby(df['Age'] < 18)['Survived'].mean()
Out[17]:
In [18]:
df.groupby(['Pclass', 'Sex'])['Survived'].mean()
Out[18]:
Oftentimes you want to know how many elements there are in a certain group (or in other words: the number of occurences of the different values from a column).
To get the size of the groups, we can use size
:
In [19]:
df.groupby('Pclass').size()
Out[19]:
In [20]:
df.groupby('Embarked').size()
Out[20]:
Another way to obtain such counts, is to use the Series value_counts
method:
In [21]:
df['Embarked'].value_counts()
Out[21]:
These exercises are based on the PyCon tutorial of Brandon Rhodes (so credit to him!) and the datasets he prepared for that. You can download these data from here: titles.csv
and cast.csv
and put them in the /data
folder.
cast
dataset: different roles played by actors/actresses in films
In [22]:
cast = pd.read_csv('../data/cast.csv')
cast.head()
Out[22]:
titles
dataset:
In [23]:
titles = pd.read_csv('../data/titles.csv')
titles.head()
Out[23]:
In [24]:
titles['decade'] = titles['year'] // 10 * 10
In [25]:
titles.groupby('decade').size().plot(kind='bar', color='green')
Out[25]:
In [26]:
titles['decade'] = titles['year'] // 10 * 10
hamlet = titles[titles['title'] == 'Hamlet']
hamlet.groupby('decade').size().plot(kind='bar', color="orange")
Out[26]:
In [27]:
titles['decade'] = titles['year'] // 10 * 10
hamlet = titles[titles['title'].str.contains('Hamlet')]
hamlet.groupby('decade').size().plot(kind='bar', color="lightblue")
Out[27]:
In [28]:
cast1990 = cast[cast['year'] >= 1990]
cast1990 = cast1990[cast1990['n'] == 1]
cast1990.groupby('name').size().nlargest(10)
Out[28]:
In [29]:
cast1990['name'].value_counts().head(10)
Out[29]:
In [30]:
hamlets = titles[titles['title'].str.contains('Hamlet')]
hamlets['title'].value_counts()
Out[30]:
In [31]:
hamlets = titles[titles['title'].str.startswith('Hamlet')]
hamlets['title'].value_counts()
Out[31]:
In [32]:
title_longest = titles['title'].str.len().nlargest(10)
title_longest
Out[32]:
In [33]:
pd.options.display.max_colwidth = 210
titles.loc[title_longest.index]
Out[33]:
In [34]:
cast1950 = cast[cast['year'] // 10 == 195]
cast1950 = cast1950[cast1950['n'] == 1]
cast1950.groupby(['year', 'type']).size()
Out[34]:
In [35]:
cast.character.value_counts().head(11)
Out[35]:
In [36]:
cast[cast.name == 'Brad Pitt'].year.value_counts().sort_index().plot()
Out[36]:
In [37]:
titles[titles['title'].str.startswith('The Life')]['title'].value_counts().head(10)
Out[37]:
In [38]:
cast[cast.year == 2010].name.value_counts().head(10)
Out[38]:
In [39]:
pink = cast[cast['title'] == 'The Pink Panther']
pink.groupby(['year'])[['n']].max()
Out[39]:
In [40]:
oz = cast[cast['name'] == 'Frank Oz']
oz_roles = oz.groupby(['year', 'title']).size()
oz_roles[oz_roles > 1]
Out[40]:
In [41]:
oz = cast[cast['name'] == 'Frank Oz']
oz_roles = oz.groupby(['character']).size()
oz_roles[oz_roles > 1].sort_values()
Out[41]:
In [42]:
cast['n_total'] = cast.groupby('title')['n'].transform('max') # transform will return an element for each row, so the max value is given to the whole group
cast.head()
Out[42]:
In [43]:
leading = cast[cast['n'] == 1]
sums_decade = leading.groupby([cast['year'] // 10 * 10, 'type']).size()
sums_decade
Out[43]:
In [44]:
#sums_decade.groupby(level='year').transform(lambda x: x / x.sum())
ratios_decade = sums_decade / sums_decade.groupby(level='year').transform('sum')
ratios_decade
Out[44]:
In [45]:
ratios_decade[:, 'actor'].plot()
ratios_decade[:, 'actress'].plot()
Out[45]:
In [46]:
t = titles
t.year.value_counts().head(3)
Out[46]:
In [47]:
cast1950 = cast[cast['year'] // 10 == 195]
cast1950 = cast1950[cast1950['n'] == 1]
cast1950['type'].value_counts()
Out[47]:
In [48]:
cast2000 = cast[cast['year'] // 10 == 200]
cast2000 = cast2000[cast2000['n'] == 1]
cast2000['type'].value_counts()
Out[48]: