Some imports:
In [ ]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
try:
import seaborn
except ImportError:
pass
pd.options.display.max_rows = 10
By "group by" we are referring to a process involving one or more of the following steps
Similar to SQL GROUP BY
The example of the image in pandas syntax:
In [ ]:
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df
In [ ]:
df.groupby('key').aggregate('sum') # np.sum
In [ ]:
df.groupby('key').sum()
These exercises are based on the PyCon tutorial of Brandon Rhodes (so all credit to him!) and the datasets he prepared for that. You can download these data from here: titles.csv
and cast.csv
and put them in the /data
folder.
In [ ]:
cast = pd.read_csv('data/cast.csv')
cast.head()
In [ ]:
titles = pd.read_csv('data/titles.csv')
titles.head()
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Sometimes you don't want to aggregate the groups, but transform the values in each group. This can be achieved with transform
:£
In [ ]:
df
In [ ]:
def normalize(group):
return (group - group.mean()) / group.std()
In [ ]:
df.groupby('key').transform(normalize)
In [ ]:
In [ ]:
In [ ]:
A useful shortcut to calculate the number of occurences of certain values is value_counts
(this is somewhat equivalent to df.groupby(key).size())
)
For example, what are the most occuring movie titles?
In [ ]:
titles.title.value_counts().head()
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: