Notebook accompanying pbpython article - Pandas Grouper and Agg Functions Explained
In [1]:
import pandas as pd
import collections
Read in the sample sales file then convert the date column to a proper date time column
In [2]:
df = pd.read_excel("https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=True")
In [3]:
df["date"] = pd.to_datetime(df['date'])
df.head()
Out[3]:
In [4]:
df.dtypes
Out[4]:
Example showing how resample can be used along with set_index
In [5]:
df.set_index('date').resample('M')["ext price"].sum()
Out[5]:
A more complex example with a groupby
In [6]:
df.set_index('date').groupby('name')["ext price"].resample("M").sum().head(20)
Out[6]:
A simpler example using pd.Grouper
In [7]:
df.groupby(['name', pd.Grouper(key='date', freq='M')])['ext price'].sum().head(20)
Out[7]:
In [8]:
df.groupby(['name', pd.Grouper(key='date', freq='A-DEC')])['ext price'].sum()
Out[8]:
In [9]:
# This works but is kind of slow and probably not that useful for this data set
#df.groupby(['name', pd.Grouper(key='date', freq='60s')])['ext price'].sum()
Some more examples using various off set alisases - http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
In [10]:
df.groupby(['name', pd.Grouper(key='date', freq='W-MON')])['ext price'].sum()
Out[10]:
In [11]:
df.groupby(['name', 'sku', pd.Grouper(key='date', freq='A-DEC')])['ext price'].sum()
Out[11]:
Now show how to use the new .agg function
First, how to get summary stats without agg
In [12]:
df[["ext price", "quantity"]].sum()
Out[12]:
In [13]:
df[["ext price", "quantity"]].mean()
Out[13]:
Using .agg for sums and means across multiple columns
In [14]:
df[["ext price", "quantity", "unit price"]].agg(['sum', 'mean'])
Out[14]:
Passing a dictionary containing different operations per column
In [15]:
df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean']})
Out[15]:
Using custom functions
In [16]:
get_max = lambda x: x.value_counts(dropna=False).index[0]
In [17]:
df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean'], 'sku': [get_max]})
Out[17]:
Clean up the naming in the output by defining the name for get_max
In [18]:
get_max.__name__ = "most frequent"
In [19]:
df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean'], 'sku': [get_max]})
Out[19]:
Using an OrderedDictionary to maintain column order
In [20]:
f = collections.OrderedDict([('ext price', ['sum', 'mean']), ('quantity', ['sum', 'mean']), ('sku', [get_max])])
In [21]:
df.agg(f)
Out[21]:
In [ ]: