The pivot table is a powerful tool to summarize and present data. Pandas has a function which allows you to quickly convert a DataFrame to a pivot table - http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot_table.html

This function is very useful but sometimes it can be tricky to remember how to use it to get the data formatted in a way you need.

This notebook will walk through how to use the pivot_table.

The full blog post for this article is here - http://pbpython.com/pandas-pivot-table-explained.html

Read in the data


In [1]:
import pandas as pd
import numpy as np

Read in our sales funnel data into our DataFrame


In [2]:
df = pd.read_excel("../in/sales-funnel.xlsx")
df.head()


Out[2]:
Account Name Rep Manager Product Quantity Price Status
0 714466 Trantow-Barrows Craig Booker Debra Henley CPU 1 30000 presented
1 714466 Trantow-Barrows Craig Booker Debra Henley Software 1 10000 presented
2 714466 Trantow-Barrows Craig Booker Debra Henley Maintenance 2 5000 pending
3 737550 Fritsch, Russel and Anderson Craig Booker Debra Henley CPU 1 35000 declined
4 146832 Kiehn-Spinka Daniel Hilton Debra Henley CPU 2 65000 won

For convenience sake, let's define the status column as a category and set the order we want to view.

This isn't strictly required but helps us keep the order we want as we work through analyzing the data.


In [3]:
df["Status"] = df["Status"].astype("category")
df["Status"].cat.set_categories(["won","pending","presented","declined"],inplace=True)

Pivot the data

As we build up the pivot table, I think it's easiest to take it one step at a time. Add items one at a time and check each step to verify you are getting the results you expect.

The simplest pivot table must have a dataframe and an index. In this case, let's use the Name as our index.


In [4]:
pd.pivot_table(df,index=["Name"])


Out[4]:
Account Price Quantity
Name
Barton LLC 740150 35000 1.000000
Fritsch, Russel and Anderson 737550 35000 1.000000
Herman LLC 141962 65000 2.000000
Jerde-Hilpert 412290 5000 2.000000
Kassulke, Ondricka and Metz 307599 7000 3.000000
Keeling LLC 688981 100000 5.000000
Kiehn-Spinka 146832 65000 2.000000
Koepp Ltd 729833 35000 2.000000
Kulas Inc 218895 25000 1.500000
Purdy-Kunde 163416 30000 1.000000
Stokes LLC 239344 7500 1.000000
Trantow-Barrows 714466 15000 1.333333

You can have multiple indexes as well. In fact, most of the pivot_table args can take multiple values via a list.


In [5]:
pd.pivot_table(df,index=["Name","Rep","Manager"])


Out[5]:
Account Price Quantity
Name Rep Manager
Barton LLC John Smith Debra Henley 740150 35000 1.000000
Fritsch, Russel and Anderson Craig Booker Debra Henley 737550 35000 1.000000
Herman LLC Cedric Moss Fred Anderson 141962 65000 2.000000
Jerde-Hilpert John Smith Debra Henley 412290 5000 2.000000
Kassulke, Ondricka and Metz Wendy Yule Fred Anderson 307599 7000 3.000000
Keeling LLC Wendy Yule Fred Anderson 688981 100000 5.000000
Kiehn-Spinka Daniel Hilton Debra Henley 146832 65000 2.000000
Koepp Ltd Wendy Yule Fred Anderson 729833 35000 2.000000
Kulas Inc Daniel Hilton Debra Henley 218895 25000 1.500000
Purdy-Kunde Cedric Moss Fred Anderson 163416 30000 1.000000
Stokes LLC Cedric Moss Fred Anderson 239344 7500 1.000000
Trantow-Barrows Craig Booker Debra Henley 714466 15000 1.333333

This is interesting but not particularly useful. What we probably want to do is look at this by Manager and Director. It's easy enough to do by changing the index.


In [6]:
pd.pivot_table(df,index=["Manager","Rep"])


Out[6]:
Account Price Quantity
Manager Rep
Debra Henley Craig Booker 720237.0 20000.000000 1.250000
Daniel Hilton 194874.0 38333.333333 1.666667
John Smith 576220.0 20000.000000 1.500000
Fred Anderson Cedric Moss 196016.5 27500.000000 1.250000
Wendy Yule 614061.5 44250.000000 3.000000

You can see that the pivot table is smart enough to start aggregating the data and summarizing it by grouping the reps with their managers. Now we start to get a glimpse of what a pivot table can do for us.

For this purpose, the Account and Quantity columns aren't really useful. Let's remove it by explicitly defining the columns we care about using the values field.


In [7]:
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"])


Out[7]:
Price
Manager Rep
Debra Henley Craig Booker 20000
Daniel Hilton 38333
John Smith 20000
Fred Anderson Cedric Moss 27500
Wendy Yule 44250

The price column automatically averages the data but we can do a count or a sum. Adding them is simple using aggfunc.


In [8]:
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=np.sum)


Out[8]:
Price
Manager Rep
Debra Henley Craig Booker 80000
Daniel Hilton 115000
John Smith 40000
Fred Anderson Cedric Moss 110000
Wendy Yule 177000

aggfunc can take a list of functions. Let's try a mean using the numpy functions and len to get a count.


In [9]:
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=[np.mean,len])


Out[9]:
mean len
Price Price
Manager Rep
Debra Henley Craig Booker 20000 4
Daniel Hilton 38333 3
John Smith 20000 2
Fred Anderson Cedric Moss 27500 4
Wendy Yule 44250 4

If we want to see sales broken down by the products, the columns variable allows us to define one or more columns.


In [10]:
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],
               columns=["Product"],aggfunc=[np.sum])


Out[10]:
sum
Price
Product CPU Maintenance Monitor Software
Manager Rep
Debra Henley Craig Booker 65000 5000 NaN 10000
Daniel Hilton 105000 NaN NaN 10000
John Smith 35000 5000 NaN NaN
Fred Anderson Cedric Moss 95000 5000 NaN 10000
Wendy Yule 165000 7000 5000 NaN

The NaN's are a bit distracting. If we want to remove them, we could use fill_value to set them to 0.


In [11]:
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],
               columns=["Product"],aggfunc=[np.sum],fill_value=0)


Out[11]:
sum
Price
Product CPU Maintenance Monitor Software
Manager Rep
Debra Henley Craig Booker 65000 5000 0 10000
Daniel Hilton 105000 0 0 10000
John Smith 35000 5000 0 0
Fred Anderson Cedric Moss 95000 5000 0 10000
Wendy Yule 165000 7000 5000 0

I think it would be useful to add the quantity as well. Add Quantity to the values list.


In [12]:
pd.pivot_table(df,index=["Manager","Rep"],values=["Price","Quantity"],
               columns=["Product"],aggfunc=[np.sum],fill_value=0)


Out[12]:
sum
Price Quantity
Product CPU Maintenance Monitor Software CPU Maintenance Monitor Software
Manager Rep
Debra Henley Craig Booker 65000 5000 0 10000 2 2 0 1
Daniel Hilton 105000 0 0 10000 4 0 0 1
John Smith 35000 5000 0 0 1 2 0 0
Fred Anderson Cedric Moss 95000 5000 0 10000 3 1 0 1
Wendy Yule 165000 7000 5000 0 7 3 2 0

What's interesting is that you can move items to the index to get a different visual representation. We can add the Products to the index.


In [13]:
pd.pivot_table(df,index=["Manager","Rep","Product"],
               values=["Price","Quantity"],aggfunc=[np.sum],fill_value=0)


Out[13]:
sum
Price Quantity
Manager Rep Product
Debra Henley Craig Booker CPU 65000 2
Maintenance 5000 2
Software 10000 1
Daniel Hilton CPU 105000 4
Software 10000 1
John Smith CPU 35000 1
Maintenance 5000 2
Fred Anderson Cedric Moss CPU 95000 3
Maintenance 5000 1
Software 10000 1
Wendy Yule CPU 165000 7
Maintenance 7000 3
Monitor 5000 2

For this data set, this representation makes more sense. Now, what if I want to see some totals? margins=True does that for us.


In [14]:
pd.pivot_table(df,index=["Manager","Rep","Product"],
               values=["Price","Quantity"],
               aggfunc=[np.sum,np.mean],fill_value=0,margins=True)


Out[14]:
sum mean
Price Quantity Price Quantity
Manager Rep Product
Debra Henley Craig Booker CPU 65000 2 32500.000000 1.000000
Maintenance 5000 2 5000.000000 2.000000
Software 10000 1 10000.000000 1.000000
Daniel Hilton CPU 105000 4 52500.000000 2.000000
Software 10000 1 10000.000000 1.000000
John Smith CPU 35000 1 35000.000000 1.000000
Maintenance 5000 2 5000.000000 2.000000
Fred Anderson Cedric Moss CPU 95000 3 47500.000000 1.500000
Maintenance 5000 1 5000.000000 1.000000
Software 10000 1 10000.000000 1.000000
Wendy Yule CPU 165000 7 82500.000000 3.500000
Maintenance 7000 3 7000.000000 3.000000
Monitor 5000 2 5000.000000 2.000000
All 522000 30 30705.882353 1.764706

Let's move the analysis up a level and look at our pipeline at the manager level. Notice how the status is ordered based on our earlier category definition.


In [15]:
pd.pivot_table(df,index=["Manager","Status"],values=["Price"],
               aggfunc=[np.sum],fill_value=0,margins=True)


Out[15]:
sum
Price
Manager Status
Debra Henley declined 70000
pending 50000
presented 50000
won 65000
Fred Anderson declined 65000
pending 5000
presented 45000
won 172000
All 522000

A really handy feature is the ability to pass a dictionary to the aggfunc so you can perform different functions on each of the values you select. This has a side-effect of making the labels a little cleaner.


In [16]:
pd.pivot_table(df,index=["Manager","Status"],columns=["Product"],values=["Quantity","Price"],
               aggfunc={"Quantity":len,"Price":np.sum},fill_value=0)


Out[16]:
Price Quantity
Product CPU Maintenance Monitor Software CPU Maintenance Monitor Software
Manager Status
Debra Henley declined 70000 0 0 0 2 0 0 0
pending 40000 10000 0 0 1 2 0 0
presented 30000 0 0 20000 1 0 0 2
won 65000 0 0 0 1 0 0 0
Fred Anderson declined 65000 0 0 0 1 0 0 0
pending 0 5000 0 0 0 1 0 0
presented 30000 0 5000 10000 1 0 1 1
won 165000 7000 0 0 2 1 0 0

You can provide a list of aggfunctions to apply to each value too:


In [17]:
table = pd.pivot_table(df,index=["Manager","Status"],columns=["Product"],values=["Quantity","Price"],
               aggfunc={"Quantity":len,"Price":[np.sum,np.mean]},fill_value=0)
table


Out[17]:
Price Quantity
mean sum len
Product CPU Maintenance Monitor Software CPU Maintenance Monitor Software CPU Maintenance Monitor Software
Manager Status
Debra Henley declined 35000 0 0 0 70000 0 0 0 2 0 0 0
pending 40000 5000 0 0 40000 10000 0 0 1 2 0 0
presented 30000 0 0 10000 30000 0 0 20000 1 0 0 2
won 65000 0 0 0 65000 0 0 0 1 0 0 0
Fred Anderson declined 65000 0 0 0 65000 0 0 0 1 0 0 0
pending 0 5000 0 0 0 5000 0 0 0 1 0 0
presented 30000 0 5000 10000 30000 0 5000 10000 1 0 1 1
won 82500 7000 0 0 165000 7000 0 0 2 1 0 0

It can look daunting to try to pull this all together at once but as soon as you start playing with the data and slowly add the items, you can get a feel for how it works.

Advanced Pivot Table Filtering

Once you have generated your data, it is in a DataFrame so you can filter on it using your normal DataFrame functions.


In [18]:
table.query('Manager == ["Debra Henley"]')


Out[18]:
Price Quantity
mean sum len
Product CPU Maintenance Monitor Software CPU Maintenance Monitor Software CPU Maintenance Monitor Software
Manager Status
Debra Henley declined 35000 0 0 0 70000 0 0 0 2 0 0 0
pending 40000 5000 0 0 40000 10000 0 0 1 2 0 0
presented 30000 0 0 10000 30000 0 0 20000 1 0 0 2
won 65000 0 0 0 65000 0 0 0 1 0 0 0

In [19]:
table.query('Status == ["pending","won"]')


Out[19]:
Price Quantity
mean sum len
Product CPU Maintenance Monitor Software CPU Maintenance Monitor Software CPU Maintenance Monitor Software
Manager Status
Debra Henley pending 40000 5000 0 0 40000 10000 0 0 1 2 0 0
won 65000 0 0 0 65000 0 0 0 1 0 0 0
Fred Anderson pending 0 5000 0 0 0 5000 0 0 0 1 0 0
won 82500 7000 0 0 165000 7000 0 0 2 1 0 0

I hope this tutorial has shown you how to use pivot tables on your data sets.