In [1]:

    
# code for loading the format for the notebook
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir(os.path.join('..', '..', 'notebook_format'))

from formats import load_style
load_style(plot_style = False)









    Out[1]:



In [2]:

    
os.chdir(path)
import numpy as np
import pandas as pd

# 1. magic to print version
# 2. magic so that the notebook will reload external python modules
%load_ext watermark
%load_ext autoreload 
%autoreload 2

%watermark -a 'Ethen' -d -t -v -p numpy,pandas









    



Ethen 2017-07-12 15:42:46 

CPython 3.5.2
IPython 5.4.1

numpy 1.13.1
pandas 0.20.2

Pandas's Pivot Table

Following the tutorial from the following link. Blog: Pandas pivot table explained.

The General rule of thumb is that once you use multiple grouby you should evaluate whether a pivot table is a useful approach.

One of the challenges with using the panda’s pivot_table is making sure you understand your data and what questions you are trying to answer with the pivot table. It is a seemingly simple function but can produce very powerful analysis very quickly. In this scenario, we'll be tracking a sales pipeline (also called funnel). The basic problem is that some sales cycles are very long (e.g. enterprise software, capital equipment, etc.) and the managemer wants to understand it in more detail throughout the year. Typical questions include:

How much revenue is in the pipeline?
What products are in the pipeline?
Who has what products at what stage?
How likely are we to close deals by year end?

Many companies will have CRM tools or other software that sales uses to track the process, while they may be useful tools for analyzing the data, inevitably someone will export the data to Excel and use a PivotTable to summarize the data. Using a panda’s pivot table can be a good alternative because it is:

Quicker (once it is set up)
Self documenting (look at the code and you know what it does)
Easy to use to generate a report or email
More flexible because you can define custome aggregation functions



In [3]:

    
df = pd.read_excel('sales-funnel.xlsx')
df.head()









    Out[3]:







  
    
      
      Account
      Name
      Rep
      Manager
      Product
      Quantity
      Price
      Status
    
  
  
    
      0
      714466
      Trantow-Barrows
      Craig Booker
      Debra Henley
      CPU
      1
      30000
      presented
    
    
      1
      714466
      Trantow-Barrows
      Craig Booker
      Debra Henley
      Software
      1
      10000
      presented
    
    
      2
      714466
      Trantow-Barrows
      Craig Booker
      Debra Henley
      Maintenance
      2
      5000
      pending
    
    
      3
      737550
      Fritsch, Russel and Anderson
      Craig Booker
      Debra Henley
      CPU
      1
      35000
      declined
    
    
      4
      146832
      Kiehn-Spinka
      Daniel Hilton
      Debra Henley
      CPU
      2
      65000
      won

Pivot the Data

As we build up the pivot table, it's probably easiest to take one step at a time. Add items and check each step to verify you are getting the results you expect.

The simplest pivot table must have a dataframe and an index, which stands for the column that the data will be aggregated upon and values, which are the aggregated value.



In [4]:

    
df.pivot_table(index = ['Manager', 'Rep'], values = ['Price'])









    Out[4]:







  
    
      
      
      Price
    
    
      Manager
      Rep
      
    
  
  
    
      Debra Henley
      Craig Booker
      20000.000000
    
    
      Daniel Hilton
      38333.333333
    
    
      John Smith
      20000.000000
    
    
      Fred Anderson
      Cedric Moss
      27500.000000
    
    
      Wendy Yule
      44250.000000

By default, the values will be averaged, but we can do a count or a sum by providing the aggfun parameter.



In [5]:

    
# you can provide multiple arguments to almost every argument of the pivot_table function
df.pivot_table(index = ['Manager', 'Rep'], values = ['Price'], aggfunc = [np.mean, len])









    Out[5]:







  
    
      
      
      mean
      len
    
    
      
      
      Price
      Price
    
    
      Manager
      Rep
      
      
    
  
  
    
      Debra Henley
      Craig Booker
      20000.000000
      4
    
    
      Daniel Hilton
      38333.333333
      3
    
    
      John Smith
      20000.000000
      2
    
    
      Fred Anderson
      Cedric Moss
      27500.000000
      4
    
    
      Wendy Yule
      44250.000000
      4

If we want to see sales broken down by the products, the columns variable allows us to define one or more columns. Note: The confusing points with the pivot_table is the use of columns and values. Columns are optional - they provide an additional way to segment the actual values you care about. The aggregation functions are applied to the values you've listed.



In [6]:

    
df.pivot_table(index = ['Manager','Rep'], values = ['Price'],
               columns = ['Product'], aggfunc = [np.sum])









    Out[6]:







  
    
      
      
      sum
    
    
      
      
      Price
    
    
      
      Product
      CPU
      Maintenance
      Monitor
      Software
    
    
      Manager
      Rep
      
      
      
      
    
  
  
    
      Debra Henley
      Craig Booker
      65000.0
      5000.0
      NaN
      10000.0
    
    
      Daniel Hilton
      105000.0
      NaN
      NaN
      10000.0
    
    
      John Smith
      35000.0
      5000.0
      NaN
      NaN
    
    
      Fred Anderson
      Cedric Moss
      95000.0
      5000.0
      NaN
      10000.0
    
    
      Wendy Yule
      165000.0
      7000.0
      5000.0
      NaN

The NaNs are a bit distracting. If we want to remove them, we could use fill_value to set them to 0.



In [7]:

    
df.pivot_table(index = ['Manager', 'Rep'], values = ['Price', 'Quantity'],
               columns = ['Product'], aggfunc = [np.sum], fill_value = 0)









    Out[7]:







  
    
      
      
      sum
    
    
      
      
      Price
      Quantity
    
    
      
      Product
      CPU
      Maintenance
      Monitor
      Software
      CPU
      Maintenance
      Monitor
      Software
    
    
      Manager
      Rep
      
      
      
      
      
      
      
      
    
  
  
    
      Debra Henley
      Craig Booker
      65000
      5000
      0
      10000
      2
      2
      0
      1
    
    
      Daniel Hilton
      105000
      0
      0
      10000
      4
      0
      0
      1
    
    
      John Smith
      35000
      5000
      0
      0
      1
      2
      0
      0
    
    
      Fred Anderson
      Cedric Moss
      95000
      5000
      0
      10000
      3
      1
      0
      1
    
    
      Wendy Yule
      165000
      7000
      5000
      0
      7
      3
      2
      0

You can move items to the index to get a different visual representation. The following code chunk removes Product from the columns and add it to the index and also uses the margins = True parameter to add totals to the pivot table.



In [8]:

    
df.pivot_table(index = ['Manager', 'Rep', 'Product'],
               values = ['Price', 'Quantity'], aggfunc = [np.sum], margins = True)









    Out[8]:







  
    
      
      
      
      sum
    
    
      
      
      
      Price
      Quantity
    
    
      Manager
      Rep
      Product
      
      
    
  
  
    
      Debra Henley
      Craig Booker
      CPU
      65000.0
      2.0
    
    
      Maintenance
      5000.0
      2.0
    
    
      Software
      10000.0
      1.0
    
    
      Daniel Hilton
      CPU
      105000.0
      4.0
    
    
      Software
      10000.0
      1.0
    
    
      John Smith
      CPU
      35000.0
      1.0
    
    
      Maintenance
      5000.0
      2.0
    
    
      Fred Anderson
      Cedric Moss
      CPU
      95000.0
      3.0
    
    
      Maintenance
      5000.0
      1.0
    
    
      Software
      10000.0
      1.0
    
    
      Wendy Yule
      CPU
      165000.0
      7.0
    
    
      Maintenance
      7000.0
      3.0
    
    
      Monitor
      5000.0
      2.0
    
    
      All
      
      
      522000.0
      30.0

We can define the status column as a category and set the order we want in the pivot table.



In [9]:

    
df['Status'] = df['Status'].astype('category')
df['Status'] = df['Status'].cat.set_categories(['won', 'pending', 'presented', 'declined'])
df.pivot_table(index = ['Manager', 'Status'], values = ['Price'],
               aggfunc = [np.sum], fill_value = 0, margins = True)









    Out[9]:







  
    
      
      
      sum
    
    
      
      
      Price
    
    
      Manager
      Status
      
    
  
  
    
      Debra Henley
      won
      65000.0
    
    
      pending
      50000.0
    
    
      presented
      50000.0
    
    
      declined
      70000.0
    
    
      Fred Anderson
      won
      172000.0
    
    
      pending
      5000.0
    
    
      presented
      45000.0
    
    
      declined
      65000.0
    
    
      All
      
      522000.0

A really handy feature is the ability to pass a dictionary to the aggfunc so you can perform different functions on each of the values you select. This has a side-effect of making the labels a little cleaner.



In [10]:

    
table = df.pivot_table(index = ['Manager','Status'], 
                       columns = ['Product'], 
                       values = ['Quantity','Price'],
                       aggfunc = {'Quantity': len, 'Price': [np.sum, np.mean]}, 
                       fill_value = 0)
table









    Out[10]:







  
    
      
      
      Price
      Quantity
    
    
      
      
      mean
      sum
      len
    
    
      
      Product
      CPU
      Maintenance
      Monitor
      Software
      CPU
      Maintenance
      Monitor
      Software
      CPU
      Maintenance
      Monitor
      Software
    
    
      Manager
      Status
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Debra Henley
      won
      65000
      0
      0
      0
      65000
      0
      0
      0
      1
      0
      0
      0
    
    
      pending
      40000
      5000
      0
      0
      40000
      10000
      0
      0
      1
      2
      0
      0
    
    
      presented
      30000
      0
      0
      10000
      30000
      0
      0
      20000
      1
      0
      0
      2
    
    
      declined
      35000
      0
      0
      0
      70000
      0
      0
      0
      2
      0
      0
      0
    
    
      Fred Anderson
      won
      82500
      7000
      0
      0
      165000
      7000
      0
      0
      2
      1
      0
      0
    
    
      pending
      0
      5000
      0
      0
      0
      5000
      0
      0
      0
      1
      0
      0
    
    
      presented
      30000
      0
      5000
      10000
      30000
      0
      5000
      10000
      1
      0
      1
      1
    
    
      declined
      65000
      0
      0
      0
      65000
      0
      0
      0
      1
      0
      0
      0

Once you have generated your data, it is in a DataFrame so you can filter on it using your standard DataFrame functions. e.g. We can look at all of our pending and won deals.



In [11]:

    
# .query uses strings for boolean indexing and we don't have to 
# specify the dataframe that the Status is comming from
table.query("Status == ['pending','won']")









    Out[11]:







  
    
      
      
      Price
      Quantity
    
    
      
      
      mean
      sum
      len
    
    
      
      Product
      CPU
      Maintenance
      Monitor
      Software
      CPU
      Maintenance
      Monitor
      Software
      CPU
      Maintenance
      Monitor
      Software
    
    
      Manager
      Status
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Debra Henley
      won
      65000
      0
      0
      0
      65000
      0
      0
      0
      1
      0
      0
      0
    
    
      pending
      40000
      5000
      0
      0
      40000
      10000
      0
      0
      1
      2
      0
      0
    
    
      Fred Anderson
      won
      82500
      7000
      0
      0
      165000
      7000
      0
      0
      2
      1
      0
      0
    
    
      pending
      0
      5000
      0
      0
      0
      5000
      0
      0
      0
      1
      0
      0

Reference

Blog: Pandas pivot table explained

	Account	Name	Rep	Manager	Product	Quantity	Price	Status
0	714466	Trantow-Barrows	Craig Booker	Debra Henley	CPU	1	30000	presented
1	714466	Trantow-Barrows	Craig Booker	Debra Henley	Software	1	10000	presented
2	714466	Trantow-Barrows	Craig Booker	Debra Henley	Maintenance	2	5000	pending
3	737550	Fritsch, Russel and Anderson	Craig Booker	Debra Henley	CPU	1	35000	declined
4	146832	Kiehn-Spinka	Daniel Hilton	Debra Henley	CPU	2	65000	won

		Price
Manager	Rep
Debra Henley	Craig Booker	20000.000000
	Daniel Hilton	38333.333333
	John Smith	20000.000000
Fred Anderson	Cedric Moss	27500.000000
Fred Anderson	Wendy Yule	44250.000000

		mean	len
		Price	Price
Manager	Rep
Debra Henley	Craig Booker	20000.000000	4
	Daniel Hilton	38333.333333	3
	John Smith	20000.000000	2
Fred Anderson	Cedric Moss	27500.000000	4
Fred Anderson	Wendy Yule	44250.000000	4

		sum
		Price
	Product	CPU	Maintenance	Monitor	Software
Manager	Rep
Debra Henley	Craig Booker	65000.0	5000.0	NaN	10000.0
	Daniel Hilton	105000.0	NaN	NaN	10000.0
	John Smith	35000.0	5000.0	NaN	NaN
Fred Anderson	Cedric Moss	95000.0	5000.0	NaN	10000.0
Fred Anderson	Wendy Yule	165000.0	7000.0	5000.0	NaN

		sum
		Price				Quantity
	Product	CPU	Maintenance	Monitor	Software	CPU	Maintenance	Monitor	Software
Manager	Rep
Debra Henley	Craig Booker	65000	5000	0	10000	2	2	0	1
	Daniel Hilton	105000	0	0	10000	4	0	0	1
	John Smith	35000	5000	0	0	1	2	0	0
Fred Anderson	Cedric Moss	95000	5000	0	10000	3	1	0	1
Fred Anderson	Wendy Yule	165000	7000	5000	0	7	3	2	0

		sum
		Price
Manager	Status
Debra Henley	won	65000.0
	pending	50000.0
	presented	50000.0
	declined	70000.0
Fred Anderson	won	172000.0
	pending	5000.0
	presented	45000.0
	declined	65000.0
All		522000.0

		Price								Quantity
		mean				sum				len
	Product	CPU	Maintenance	Monitor	Software	CPU	Maintenance	Monitor	Software	CPU	Maintenance	Monitor	Software
Manager	Status
Debra Henley	won	65000	0	0	0	65000	0	0	0	1	0	0	0
	pending	40000	5000	0	0	40000	10000	0	0	1	2	0	0
	presented	30000	0	0	10000	30000	0	0	20000	1	0	0	2
	declined	35000	0	0	0	70000	0	0	0	2	0	0	0
Fred Anderson	won	82500	7000	0	0	165000	7000	0	0	2	1	0	0
	pending	0	5000	0	0	0	5000	0	0	0	1	0	0
	presented	30000	0	5000	10000	30000	0	5000	10000	1	0	1	1
	declined	65000	0	0	0	65000	0	0	0	1	0	0	0

Table of Contents

Pandas's Pivot Table

Pivot the Data

Reference