Learning More About Pandas by Building and Using a Weighted Average Function

This notebook is based on the article on Pbpython.com. Please reference it for the background and additional details



In [1]:

    
import pandas as pd
import numpy as np

Read in our sample sales data that includes projected price for our new product launch



In [2]:

    
sales = pd.read_excel("https://github.com/chris1610/pbpython/blob/master/data/sales-estimate.xlsx?raw=True", sheetname="projections")
sales









    Out[2]:






  
    
      
      Account
      Name
      State
      Rep
      Manager
      Current_Price
      Quantity
      New_Product_Price
    
  
  
    
      0
      714466
      Trantow-Barrows
      MN
      Craig Booker
      Debra Henley
      500
      100
      550
    
    
      1
      737550
      Fritsch, Russel and Anderson
      MN
      Craig Booker
      Debra Henley
      600
      90
      725
    
    
      2
      146832
      Kiehn-Spinka
      TX
      Daniel Hilton
      Debra Henley
      225
      475
      255
    
    
      3
      218895
      Kulas Inc
      TX
      Daniel Hilton
      Debra Henley
      290
      375
      300
    
    
      4
      412290
      Jerde-Hilpert
      WI
      John Smith
      Debra Henley
      375
      400
      400
    
    
      5
      740150
      Barton LLC
      WI
      John Smith
      Debra Henley
      550
      100
      600
    
    
      6
      141962
      Herman LLC
      CA
      Cedric Moss
      Fred Anderson
      400
      200
      425
    
    
      7
      163416
      Purdy-Kunde
      CA
      Cedric Moss
      Fred Anderson
      450
      150
      475
    
    
      8
      239344
      Stokes LLC
      WA
      Cedric Moss
      Fred Anderson
      550
      75
      610
    
    
      9
      307599
      Kassulke, Ondricka and Metz
      NV
      Wendy Yule
      Fred Anderson
      275
      450
      300
    
    
      10
      688981
      Keeling LLC
      NV
      Wendy Yule
      Fred Anderson
      300
      250
      350
    
    
      11
      729833
      Koepp Ltd
      NV
      Wendy Yule
      Fred Anderson
      350
      100
      375

Show the mean for our current and new product price



In [3]:

    
print(sales["Current_Price"].mean())
print(sales["New_Product_Price"].mean())









    



405.416666667
447.083333333

Calculate the weighted average using the long form



In [4]:

    
print((sales["Current_Price"] * sales["Quantity"]).sum() / sales["Quantity"].sum())
print((sales["New_Product_Price"] * sales["Quantity"]).sum() / sales["Quantity"].sum())









    



342.540687161
374.638336347

Use np.average to simplify the formula



In [5]:

    
print(np.average(sales["Current_Price"], weights=sales["Quantity"]))
print(np.average(sales["New_Product_Price"], weights=sales["Quantity"]))









    



342.540687161
374.638336347

For maximum flexibility, build our own weighted average function



In [6]:

    
def wavg(group, avg_name, weight_name):
    """ http://stackoverflow.com/questions/10951341/pandas-dataframe-aggregate-function-using-multiple-columns
    In rare instance, we may not have weights, so just return the mean
    """
    d = group[avg_name]
    w = group[weight_name]
    try:
        return (d * w).sum() / w.sum()
    except ZeroDivisionError:
        return d.mean()

Call the weighted average on all of the data



In [7]:

    
print(wavg(sales, "Current_Price", "Quantity"))
print(wavg(sales, "New_Product_Price", "Quantity"))









    



342.540687161
374.638336347

Group the data by manager



In [8]:

    
sales.groupby("Manager").apply(wavg, "Current_Price", "Quantity")









    Out[8]:





Manager
Debra Henley     340.665584
Fred Anderson    344.897959
dtype: float64



In [9]:

    
sales.groupby("Manager").apply(wavg, "New_Product_Price", "Quantity")









    Out[9]:





Manager
Debra Henley     372.646104
Fred Anderson    377.142857
dtype: float64

You can also group by state



In [10]:

    
sales.groupby("State").apply(wavg, "New_Product_Price", "Quantity")









    Out[10]:





State
CA    446.428571
MN    632.894737
NV    325.000000
TX    274.852941
WA    610.000000
WI    440.000000
dtype: float64

You can also group by multiple criteria and the function will work correctly.



In [11]:

    
sales.groupby(["Manager", "State"]).apply(wavg, "New_Product_Price", "Quantity")









    Out[11]:





Manager        State
Debra Henley   MN       632.894737
               TX       274.852941
               WI       440.000000
Fred Anderson  CA       446.428571
               NV       325.000000
               WA       610.000000
dtype: float64

Example of applying multiple aggregation functions



In [12]:

    
f = {'New_Product_Price': ['mean'],'Current_Price': ['median'], 'Quantity': ['sum', 'mean']}
sales.groupby("Manager").agg(f)









    Out[12]:






  
    
      
      Quantity
      New_Product_Price
      Current_Price
    
    
      
      sum
      mean
      mean
      median
    
    
      Manager
      
      
      
      
    
  
  
    
      Debra Henley
      1540
      256.666667
      471.666667
      437.5
    
    
      Fred Anderson
      1225
      204.166667
      422.500000
      375.0

Similar method to group multiple custom functions together into a single DataFrame



In [13]:

    
data_1 = sales.groupby("Manager").apply(wavg, "New_Product_Price", "Quantity")
data_2 = sales.groupby("Manager").apply(wavg, "Current_Price", "Quantity")



In [14]:

    
summary = pd.DataFrame(data=dict(s1=data_1, s2=data_2))
summary.columns = ["New Product Price","Current Product Price"]
summary.head()









    Out[14]:






  
    
      
      New Product Price
      Current Product Price
    
    
      Manager
      
      
    
  
  
    
      Debra Henley
      372.646104
      340.665584
    
    
      Fred Anderson
      377.142857
      344.897959

Finally, numpy has an average function that can be used:



In [15]:

    
np.average(sales["Current_Price"], weights=sales["Quantity"])









    Out[15]:





342.54068716094031

Use a lambda function for it to work with grouped data



In [16]:

    
sales.groupby("Manager").apply(lambda x: np.average(x['New_Product_Price'], weights=x['Quantity']))









    Out[16]:





Manager
Debra Henley     372.646104
Fred Anderson    377.142857
dtype: float64



In [ ]:

	Account	Name	State	Rep	Manager	Current_Price	Quantity	New_Product_Price
0	714466	Trantow-Barrows	MN	Craig Booker	Debra Henley	500	100	550
1	737550	Fritsch, Russel and Anderson	MN	Craig Booker	Debra Henley	600	90	725
2	146832	Kiehn-Spinka	TX	Daniel Hilton	Debra Henley	225	475	255
3	218895	Kulas Inc	TX	Daniel Hilton	Debra Henley	290	375	300
4	412290	Jerde-Hilpert	WI	John Smith	Debra Henley	375	400	400
5	740150	Barton LLC	WI	John Smith	Debra Henley	550	100	600
6	141962	Herman LLC	CA	Cedric Moss	Fred Anderson	400	200	425
7	163416	Purdy-Kunde	CA	Cedric Moss	Fred Anderson	450	150	475
8	239344	Stokes LLC	WA	Cedric Moss	Fred Anderson	550	75	610
9	307599	Kassulke, Ondricka and Metz	NV	Wendy Yule	Fred Anderson	275	450	300
10	688981	Keeling LLC	NV	Wendy Yule	Fred Anderson	300	250	350
11	729833	Koepp Ltd	NV	Wendy Yule	Fred Anderson	350	100	375

	Quantity		New_Product_Price	Current_Price
	sum	mean	mean	median
Manager
Debra Henley	1540	256.666667	471.666667	437.5
Fred Anderson	1225	204.166667	422.500000	375.0

	New Product Price	Current Product Price
Manager
Debra Henley	372.646104	340.665584
Fred Anderson	377.142857	344.897959