Notebook accompanying pbpython article - Pandas Grouper and Agg Functions Explained



In [1]:

    
import pandas as pd
import collections

Read in the sample sales file then convert the date column to a proper date time column



In [2]:

    
df = pd.read_excel("https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=True")



In [3]:

    
df["date"] = pd.to_datetime(df['date'])
df.head()









    Out[3]:







  
    
      
      account number
      name
      sku
      quantity
      unit price
      ext price
      date
    
  
  
    
      0
      740150
      Barton LLC
      B1-20000
      39
      86.69
      3380.91
      2014-01-01 07:21:51
    
    
      1
      714466
      Trantow-Barrows
      S2-77896
      -1
      63.16
      -63.16
      2014-01-01 10:00:47
    
    
      2
      218895
      Kulas Inc
      B1-69924
      23
      90.70
      2086.10
      2014-01-01 13:24:58
    
    
      3
      307599
      Kassulke, Ondricka and Metz
      S1-65481
      41
      21.05
      863.05
      2014-01-01 15:05:22
    
    
      4
      412290
      Jerde-Hilpert
      S2-34077
      6
      83.21
      499.26
      2014-01-01 23:26:55



In [4]:

    
df.dtypes









    Out[4]:





account number             int64
name                      object
sku                       object
quantity                   int64
unit price               float64
ext price                float64
date              datetime64[ns]
dtype: object

Example showing how resample can be used along with set_index



In [5]:

    
df.set_index('date').resample('M')["ext price"].sum()









    Out[5]:





date
2014-01-31    185361.66
2014-02-28    146211.62
2014-03-31    203921.38
2014-04-30    174574.11
2014-05-31    165418.55
2014-06-30    174089.33
2014-07-31    191662.11
2014-08-31    153778.59
2014-09-30    168443.17
2014-10-31    171495.32
2014-11-30    119961.22
2014-12-31    163867.26
Freq: M, Name: ext price, dtype: float64

A more complex example with a groupby



In [6]:

    
df.set_index('date').groupby('name')["ext price"].resample("M").sum().head(20)









    Out[6]:





name                             date      
Barton LLC                       2014-01-31     6177.57
                                 2014-02-28    12218.03
                                 2014-03-31     3513.53
                                 2014-04-30    11474.20
                                 2014-05-31    10220.17
                                 2014-06-30    10463.73
                                 2014-07-31     6750.48
                                 2014-08-31    17541.46
                                 2014-09-30    14053.61
                                 2014-10-31     9351.68
                                 2014-11-30     4901.14
                                 2014-12-31     2772.90
Cronin, Oberbrunner and Spencer  2014-01-31     1141.75
                                 2014-02-28    13976.26
                                 2014-03-31    11691.62
                                 2014-04-30     3685.44
                                 2014-05-31     6760.11
                                 2014-06-30     5379.67
                                 2014-07-31     6020.30
                                 2014-08-31     5399.58
Name: ext price, dtype: float64

A simpler example using pd.Grouper



In [7]:

    
df.groupby(['name', pd.Grouper(key='date', freq='M')])['ext price'].sum().head(20)









    Out[7]:





name                             date      
Barton LLC                       2014-01-31     6177.57
                                 2014-02-28    12218.03
                                 2014-03-31     3513.53
                                 2014-04-30    11474.20
                                 2014-05-31    10220.17
                                 2014-06-30    10463.73
                                 2014-07-31     6750.48
                                 2014-08-31    17541.46
                                 2014-09-30    14053.61
                                 2014-10-31     9351.68
                                 2014-11-30     4901.14
                                 2014-12-31     2772.90
Cronin, Oberbrunner and Spencer  2014-01-31     1141.75
                                 2014-02-28    13976.26
                                 2014-03-31    11691.62
                                 2014-04-30     3685.44
                                 2014-05-31     6760.11
                                 2014-06-30     5379.67
                                 2014-07-31     6020.30
                                 2014-08-31     5399.58
Name: ext price, dtype: float64



In [8]:

    
df.groupby(['name', pd.Grouper(key='date', freq='A-DEC')])['ext price'].sum()









    Out[8]:





name                             date      
Barton LLC                       2014-12-31    109438.50
Cronin, Oberbrunner and Spencer  2014-12-31     89734.55
Frami, Hills and Schmidt         2014-12-31    103569.59
Fritsch, Russel and Anderson     2014-12-31    112214.71
Halvorson, Crona and Champlin    2014-12-31     70004.36
Herman LLC                       2014-12-31     82865.00
Jerde-Hilpert                    2014-12-31    112591.43
Kassulke, Ondricka and Metz      2014-12-31     86451.07
Keeling LLC                      2014-12-31    100934.30
Kiehn-Spinka                     2014-12-31     99608.77
Koepp Ltd                        2014-12-31    103660.54
Kuhn-Gusikowski                  2014-12-31     91094.28
Kulas Inc                        2014-12-31    137351.96
Pollich LLC                      2014-12-31     87347.18
Purdy-Kunde                      2014-12-31     77898.21
Sanford and Sons                 2014-12-31     98822.98
Stokes LLC                       2014-12-31     91535.92
Trantow-Barrows                  2014-12-31    123381.38
White-Trantow                    2014-12-31    135841.99
Will LLC                         2014-12-31    104437.60
Name: ext price, dtype: float64



In [9]:

    
# This works but is kind of slow and probably not that useful for this data set
#df.groupby(['name', pd.Grouper(key='date', freq='60s')])['ext price'].sum()

Some more examples using various off set alisases - http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases



In [10]:

    
df.groupby(['name', pd.Grouper(key='date', freq='W-MON')])['ext price'].sum()









    Out[10]:





name        date      
Barton LLC  2014-01-06    3380.91
            2014-01-20     224.86
            2014-01-27    2423.96
            2014-02-03     147.84
            2014-02-10     217.14
            2014-02-17    8442.09
            2014-03-03    3558.80
            2014-03-10    1763.49
            2014-03-17    1750.04
            2014-04-07    1561.14
            2014-04-14    4702.61
            2014-04-21     836.63
            2014-04-28    2540.79
            2014-05-05    4109.99
            2014-05-12    1671.83
            2014-05-26    3251.43
            2014-06-02    6112.87
            2014-06-09     677.35
            2014-06-16    2798.00
            2014-06-23    3895.46
            2014-07-14    1033.26
            2014-07-21    3989.22
            2014-08-04    7384.80
            2014-08-11      84.18
            2014-08-18    3857.10
            2014-08-25    2925.94
            2014-09-01    5017.44
            2014-09-08    1319.28
            2014-09-15    2608.50
            2014-09-22    7371.67
                           ...   
Will LLC    2014-03-31    2595.78
            2014-04-07     991.99
            2014-04-14    2069.20
            2014-04-21    3007.03
            2014-04-28      26.72
            2014-05-12    2220.74
            2014-05-19    3250.88
            2014-06-02    6385.33
            2014-06-16    1440.96
            2014-06-23      56.15
            2014-06-30     922.41
            2014-07-07    2985.85
            2014-07-14    3073.44
            2014-07-21    3951.27
            2014-07-28    1006.98
            2014-08-11     773.90
            2014-08-18     254.04
            2014-08-25     411.88
            2014-09-01    2020.57
            2014-09-29    2325.42
            2014-10-06     371.52
            2014-10-20      99.47
            2014-10-27    1616.14
            2014-11-03    5768.60
            2014-11-10    1734.08
            2014-12-01     705.96
            2014-12-08    4301.00
            2014-12-15    3502.09
            2014-12-22    2993.40
            2015-01-05    1764.72
Name: ext price, Length: 805, dtype: float64



In [11]:

    
df.groupby(['name', 'sku', pd.Grouper(key='date', freq='A-DEC')])['ext price'].sum()









    Out[11]:





name                             sku       date      
Barton LLC                       B1-04202  2014-12-31     2433.20
                                 B1-05914  2014-12-31      324.72
                                 B1-20000  2014-12-31     7212.11
                                 B1-33087  2014-12-31     4159.97
                                 B1-33364  2014-12-31     4369.19
                                 B1-38851  2014-12-31     6627.93
                                 B1-50809  2014-12-31     5087.44
                                 B1-53102  2014-12-31      431.27
                                 B1-53636  2014-12-31     3737.59
                                 B1-65551  2014-12-31     4703.54
                                 B1-69924  2014-12-31     1634.29
                                 B1-86481  2014-12-31     1457.80
                                 S1-06532  2014-12-31     8346.20
                                 S1-27722  2014-12-31     5840.43
                                 S1-30248  2014-12-31      887.97
                                 S1-47412  2014-12-31     5176.74
                                 S1-50961  2014-12-31     1883.31
                                 S1-65481  2014-12-31     3352.62
                                 S1-82801  2014-12-31     5453.39
                                 S1-93683  2014-12-31     1715.28
                                 S2-10342  2014-12-31    12337.83
                                 S2-11481  2014-12-31      728.50
                                 S2-16558  2014-12-31      623.07
                                 S2-34077  2014-12-31     5726.16
                                 S2-77896  2014-12-31     6156.02
                                 S2-78676  2014-12-31     2616.60
                                 S2-82423  2014-12-31     3735.43
                                 S2-83881  2014-12-31     2679.90
Cronin, Oberbrunner and Spencer  B1-04202  2014-12-31     5054.84
                                 B1-05914  2014-12-31     4003.64
                                                           ...   
White-Trantow                    S2-82423  2014-12-31     5863.69
                                 S2-83881  2014-12-31     6506.07
Will LLC                         B1-04202  2014-12-31      464.33
                                 B1-05914  2014-12-31     5935.53
                                 B1-20000  2014-12-31     4113.19
                                 B1-33087  2014-12-31     3711.06
                                 B1-33364  2014-12-31    11599.40
                                 B1-38851  2014-12-31     1475.09
                                 B1-53102  2014-12-31     6995.58
                                 B1-53636  2014-12-31      532.80
                                 B1-65551  2014-12-31     3039.01
                                 B1-69924  2014-12-31    10021.68
                                 B1-86481  2014-12-31     1770.34
                                 S1-06532  2014-12-31     3314.72
                                 S1-27722  2014-12-31     4493.11
                                 S1-30248  2014-12-31      975.52
                                 S1-47412  2014-12-31     1642.68
                                 S1-50961  2014-12-31     3513.62
                                 S1-65481  2014-12-31      593.46
                                 S1-82801  2014-12-31     2034.92
                                 S1-93683  2014-12-31     9559.57
                                 S2-00301  2014-12-31     2117.33
                                 S2-10342  2014-12-31     4918.55
                                 S2-11481  2014-12-31     4324.79
                                 S2-16558  2014-12-31      413.52
                                 S2-23246  2014-12-31     1893.25
                                 S2-34077  2014-12-31      991.99
                                 S2-77896  2014-12-31     5321.87
                                 S2-78676  2014-12-31     5016.76
                                 S2-83881  2014-12-31     3653.93
Name: ext price, Length: 544, dtype: float64

Now show how to use the new .agg function

First, how to get summary stats without agg



In [12]:

    
df[["ext price", "quantity"]].sum()









    Out[12]:





ext price    2018784.32
quantity       36463.00
dtype: float64



In [13]:

    
df[["ext price", "quantity"]].mean()









    Out[13]:





ext price    1345.856213
quantity       24.308667
dtype: float64

Using .agg for sums and means across multiple columns



In [14]:

    
df[["ext price", "quantity", "unit price"]].agg(['sum', 'mean'])









    Out[14]:







  
    
      
      ext price
      quantity
      unit price
    
  
  
    
      sum
      2.018784e+06
      36463.000000
      82511.290000
    
    
      mean
      1.345856e+03
      24.308667
      55.007527

Passing a dictionary containing different operations per column



In [15]:

    
df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean']})









    Out[15]:







  
    
      
      ext price
      quantity
      unit price
    
  
  
    
      mean
      1.345856e+03
      24.308667
      55.007527
    
    
      sum
      2.018784e+06
      36463.000000
      NaN

Using custom functions



In [16]:

    
get_max = lambda x: x.value_counts(dropna=False).index[0]



In [17]:

    
df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean'], 'sku': [get_max]})









    Out[17]:







  
    
      
      ext price
      quantity
      unit price
      sku
    
  
  
    
      <lambda>
      NaN
      NaN
      NaN
      S2-77896
    
    
      mean
      1.345856e+03
      24.308667
      55.007527
      NaN
    
    
      sum
      2.018784e+06
      36463.000000
      NaN
      NaN

Clean up the naming in the output by defining the name for get_max



In [18]:

    
get_max.__name__ = "most frequent"



In [19]:

    
df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean'], 'sku': [get_max]})









    Out[19]:







  
    
      
      ext price
      quantity
      unit price
      sku
    
  
  
    
      mean
      1.345856e+03
      24.308667
      55.007527
      NaN
    
    
      most frequent
      NaN
      NaN
      NaN
      S2-77896
    
    
      sum
      2.018784e+06
      36463.000000
      NaN
      NaN

Using an OrderedDictionary to maintain column order



In [20]:

    
f = collections.OrderedDict([('ext price', ['sum', 'mean']), ('quantity', ['sum', 'mean']), ('sku', [get_max])])



In [21]:

    
df.agg(f)









    Out[21]:







  
    
      
      ext price
      quantity
      sku
    
  
  
    
      mean
      1.345856e+03
      24.308667
      NaN
    
    
      most frequent
      NaN
      NaN
      S2-77896
    
    
      sum
      2.018784e+06
      36463.000000
      NaN



In [ ]:

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
1	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16	2014-01-01 10:00:47
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	2014-01-01 15:05:22
4	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26	2014-01-01 23:26:55

	ext price	quantity	unit price
sum	2.018784e+06	36463.000000	82511.290000
mean	1.345856e+03	24.308667	55.007527

	ext price	quantity	unit price	sku
<lambda>	NaN	NaN	NaN	S2-77896
mean	1.345856e+03	24.308667	55.007527	NaN
sum	2.018784e+06	36463.000000	NaN	NaN