Some imports:
In [ ]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
try:
import seaborn
except ImportError:
pass
pd.options.display.max_rows = 10
The "group by" concept: we want to apply the same function on subsets of your dataframe, based on some key to split the dataframe in subsets
This operation is also referred to as the "split-apply-combine" operation, involving the following steps:
Similar to SQL GROUP BY
The example of the image in pandas syntax:
In [ ]:
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df
Using the filtering and reductions operations we have seen in the previous notebooks, we could do something like:
df[df['key'] == "A"].sum()
df[df['key'] == "B"].sum()
...
But pandas provides the groupby
method to do this:
In [ ]:
df.groupby('key').aggregate(np.sum) # 'sum'
In [ ]:
df.groupby('key').sum()
And many more methods are available.
We go back to the titanic survival data:
In [ ]:
df = pd.read_csv("data/titanic.csv")
In [ ]:
df.head()
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
If you are ready, more groupby exercises can be found in the "Advanded groupby operations" notebook.