Pivot Table on Pandas

In this notebook, we focus on the pivot_table feature of pandas.

Import Modules



In [1]:

    
import pandas as pd
import numpy as np
import plotly.plotly as py
import plotly.graph_objs as go

Import data



In [2]:

    
df = pd.read_csv('./asset/sydney_housing_market.txt', sep='\t')
df.head()









    Out[2]:






  
    
      
      suburb
      council
      type
      distance_to_CBD
      value
      sold
    
  
  
    
      0
      Turrella
      Rockdale
      house
      9.9
      $839,676
      17
    
    
      1
      Sydenham
      Marrickville
      house
      7.4
      $845,771
      21
    
    
      2
      St Peters
      Marrickville
      house
      6.9
      $920,169
      50
    
    
      3
      Tempe
      Marrickville
      house
      8.3
      $921,390
      59
    
    
      4
      Waterloo
      Sydney
      house
      4.7
      $937,316
      11

Pivot Table

In order to build up a pivot table, we must specify an index.



In [19]:

    
pd.pivot_table(df, index=['type'])









    Out[19]:






  
    
      
      distance_to_CBD
      sold
      value
    
    
      type
      
      
      
    
  
  
    
      house
      6.585
      69.20
      1024662.90
    
    
      unit
      6.280
      108.55
      585739.05

Note that the default aggregation function is np.mean. We can specify the aggregation function in the aggfunc parameter, as shown below.



In [18]:

    
pd.pivot_table(df, index=['type'], aggfunc={'distance_to_CBD':np.mean, 'sold':np.sum})









    Out[18]:






  
    
      
      distance_to_CBD
      sold
    
    
      type
      
      
    
  
  
    
      house
      6.585
      1384
    
    
      unit
      6.280
      2171

For simplicity, we will stick with the default aggregation function.

We also want to see value, but we need to change it into floats first



In [4]:

    
df['value']=df['value'].replace('[\$,]','',regex=True).astype(float)
pd.pivot_table(df, index=['type'])









    Out[4]:






  
    
      
      distance_to_CBD
      sold
      value
    
    
      type
      
      
      
    
  
  
    
      house
      6.585
      69.20
      1024662.90
    
    
      unit
      6.280
      108.55
      585739.05

We could also choose more than one column as index



In [5]:

    
pd.pivot_table(df, index=['type','council'])









    Out[5]:






  
    
      
      
      distance_to_CBD
      sold
      value
    
    
      type
      council
      
      
      
    
  
  
    
      house
      Ashfield
      7.600000
      79.500000
      1.101914e+06
    
    
      Botany Bay
      8.200000
      29.000000
      1.090744e+06
    
    
      Canada Bay
      9.200000
      15.000000
      1.036064e+06
    
    
      Canterbury
      9.100000
      36.000000
      1.066214e+06
    
    
      Leichhardt
      5.200000
      225.000000
      1.076623e+06
    
    
      Marrickville
      6.742857
      81.285714
      9.933390e+05
    
    
      Rockdale
      9.900000
      17.000000
      8.396760e+05
    
    
      Sydney
      4.616667
      55.666667
      1.037789e+06
    
    
      unit
      Ashfield
      7.233333
      109.666667
      5.933117e+05
    
    
      Botany Bay
      8.200000
      56.000000
      4.844870e+05
    
    
      Burwood
      8.400000
      43.000000
      6.059270e+05
    
    
      Canterbury
      9.100000
      15.000000
      5.930810e+05
    
    
      Lane Cove
      7.000000
      220.000000
      6.112070e+05
    
    
      Marrickville
      6.550000
      87.333333
      5.929137e+05
    
    
      Randwick
      5.200000
      53.000000
      5.778140e+05
    
    
      Ryde
      7.900000
      139.000000
      5.729230e+05
    
    
      Sydney
      3.760000
      158.400000
      5.863850e+05

columns provide an additional way to segment the data



In [6]:

    
pd.pivot_table(df, index=['council'], columns=['type'])









    Out[6]:






  
    
      
      distance_to_CBD
      value
      sold
    
    
      type
      house
      unit
      house
      unit
      house
      unit
    
    
      council
      
      
      
      
      
      
    
  
  
    
      Ashfield
      7.600000
      7.233333
      1.101914e+06
      593311.666667
      79.500000
      109.666667
    
    
      Botany Bay
      8.200000
      8.200000
      1.090744e+06
      484487.000000
      29.000000
      56.000000
    
    
      Burwood
      NaN
      8.400000
      NaN
      605927.000000
      NaN
      43.000000
    
    
      Canada Bay
      9.200000
      NaN
      1.036064e+06
      NaN
      15.000000
      NaN
    
    
      Canterbury
      9.100000
      9.100000
      1.066214e+06
      593081.000000
      36.000000
      15.000000
    
    
      Lane Cove
      NaN
      7.000000
      NaN
      611207.000000
      NaN
      220.000000
    
    
      Leichhardt
      5.200000
      NaN
      1.076623e+06
      NaN
      225.000000
      NaN
    
    
      Marrickville
      6.742857
      6.550000
      9.933390e+05
      592913.666667
      81.285714
      87.333333
    
    
      Randwick
      NaN
      5.200000
      NaN
      577814.000000
      NaN
      53.000000
    
    
      Rockdale
      9.900000
      NaN
      8.396760e+05
      NaN
      17.000000
      NaN
    
    
      Ryde
      NaN
      7.900000
      NaN
      572923.000000
      NaN
      139.000000
    
    
      Sydney
      4.616667
      3.760000
      1.037789e+06
      586385.000000
      55.666667
      158.400000

Note that NaN implies that there is no data here

The default aggfunc is avg but we could use other functions such as np.sum



In [7]:

    
pd.pivot_table(df, index=['type'],aggfunc=np.sum)









    Out[7]:






  
    
      
      distance_to_CBD
      sold
      value
    
    
      type
      
      
      
    
  
  
    
      house
      131.7
      1384
      20493258.0
    
    
      unit
      125.6
      2171
      11714781.0

Use margins=True to show the total numbers



In [8]:

    
pd.pivot_table(df, index=['type','council'], aggfunc=np.sum, margins=True)









    Out[8]:






  
    
      
      
      distance_to_CBD
      sold
      value
    
    
      type
      council
      
      
      
    
  
  
    
      house
      Ashfield
      15.2
      159.0
      2203829.0
    
    
      Botany Bay
      8.2
      29.0
      1090744.0
    
    
      Canada Bay
      9.2
      15.0
      1036064.0
    
    
      Canterbury
      9.1
      36.0
      1066214.0
    
    
      Leichhardt
      5.2
      225.0
      1076623.0
    
    
      Marrickville
      47.2
      569.0
      6953373.0
    
    
      Rockdale
      9.9
      17.0
      839676.0
    
    
      Sydney
      27.7
      334.0
      6226735.0
    
    
      unit
      Ashfield
      21.7
      329.0
      1779935.0
    
    
      Botany Bay
      8.2
      56.0
      484487.0
    
    
      Burwood
      8.4
      43.0
      605927.0
    
    
      Canterbury
      9.1
      15.0
      593081.0
    
    
      Lane Cove
      7.0
      220.0
      611207.0
    
    
      Marrickville
      39.3
      524.0
      3557482.0
    
    
      Randwick
      5.2
      53.0
      577814.0
    
    
      Ryde
      7.9
      139.0
      572923.0
    
    
      Sydney
      18.8
      792.0
      2931925.0
    
    
      All
      
      257.3
      3555.0
      32208039.0

We should use avg for value but sum for sold, and we do not want to see distance_to_CBD for now



In [9]:

    
pd.pivot_table(df, index=['council','suburb'], 
               columns=['type'], 
               values=['sold', 'value'], 
               aggfunc={'sold':np.sum, 'value':np.mean}, 
               margins=True)









    Out[9]:






  
    
      
      
      value
      sold
    
    
      
      type
      house
      unit
      All
      house
      unit
      All
    
    
      council
      suburb
      
      
      
      
      
      
    
  
  
    
      Ashfield
      Ashfield
      1123787.0
      575096.00
      849441.500
      110.0
      228.0
      338.0
    
    
      Haberfield
      NaN
      573347.00
      573347.000
      NaN
      13.0
      13.0
    
    
      Summer Hill
      1080042.0
      631492.00
      855767.000
      49.0
      88.0
      137.0
    
    
      Botany Bay
      Eastlakes
      1090744.0
      484487.00
      787615.500
      29.0
      56.0
      85.0
    
    
      Burwood
      Croydon
      NaN
      605927.00
      605927.000
      NaN
      43.0
      43.0
    
    
      Canada Bay
      Mortlake
      1036064.0
      NaN
      1036064.000
      15.0
      NaN
      15.0
    
    
      Canterbury
      Hurlstone Park
      1066214.0
      593081.00
      829647.500
      36.0
      15.0
      51.0
    
    
      Lane Cove
      Lane Cove North
      NaN
      611207.00
      611207.000
      NaN
      220.0
      220.0
    
    
      Leichhardt
      Leichhardt
      1076623.0
      NaN
      1076623.000
      225.0
      NaN
      225.0
    
    
      Marrickville
      Dulwich Hill
      NaN
      622141.00
      622141.000
      NaN
      143.0
      143.0
    
    
      Enmore
      1079403.0
      607195.00
      843299.000
      58.0
      15.0
      73.0
    
    
      Lewisham
      NaN
      569889.00
      569889.000
      NaN
      20.0
      20.0
    
    
      Marrickville
      994026.0
      552713.00
      773369.500
      191.0
      225.0
      416.0
    
    
      Petersham
      1122372.0
      610890.00
      866631.000
      82.0
      69.0
      151.0
    
    
      St Peters
      920169.0
      NaN
      920169.000
      50.0
      NaN
      50.0
    
    
      Stanmore
      1070242.0
      594654.00
      832448.000
      108.0
      52.0
      160.0
    
    
      Sydenham
      845771.0
      NaN
      845771.000
      21.0
      NaN
      21.0
    
    
      Tempe
      921390.0
      NaN
      921390.000
      59.0
      NaN
      59.0
    
    
      Randwick
      Centennial Park
      NaN
      577814.00
      577814.000
      NaN
      53.0
      53.0
    
    
      Rockdale
      Turrella
      839676.0
      NaN
      839676.000
      17.0
      NaN
      17.0
    
    
      Ryde
      Gladesville
      NaN
      572923.00
      572923.000
      NaN
      139.0
      139.0
    
    
      Sydney
      Alexandria
      NaN
      617024.00
      617024.000
      NaN
      211.0
      211.0
    
    
      Chippendale
      1060415.0
      NaN
      1060415.000
      31.0
      NaN
      31.0
    
    
      Darlington
      1062426.0
      NaN
      1062426.000
      32.0
      NaN
      32.0
    
    
      Elizabeth Bay
      NaN
      615967.00
      615967.000
      NaN
      182.0
      182.0
    
    
      Erskineville
      1005107.0
      NaN
      1005107.000
      84.0
      NaN
      84.0
    
    
      Newtown
      1044330.0
      559175.00
      801752.500
      166.0
      104.0
      270.0
    
    
      Potts Point
      NaN
      565141.00
      565141.000
      NaN
      217.0
      217.0
    
    
      Rushcutters Bay
      NaN
      574618.00
      574618.000
      NaN
      78.0
      78.0
    
    
      Waterloo
      937316.0
      NaN
      937316.000
      11.0
      NaN
      11.0
    
    
      Zetland
      1117141.0
      NaN
      1117141.000
      10.0
      NaN
      10.0
    
    
      All
      
      1024662.9
      585739.05
      805200.975
      1384.0
      2171.0
      3555.0

Advanced Filtering over Pivot Table

We firstly build a pivot table



In [10]:

    
table = pd.pivot_table(df, index=['council'], columns=['type'], values=['sold'], aggfunc=np.sum, margins=True)
table









    Out[10]:






  
    
      
      sold
    
    
      type
      house
      unit
      All
    
    
      council
      
      
      
    
  
  
    
      Ashfield
      159.0
      329.0
      488.0
    
    
      Botany Bay
      29.0
      56.0
      85.0
    
    
      Burwood
      NaN
      43.0
      43.0
    
    
      Canada Bay
      15.0
      NaN
      15.0
    
    
      Canterbury
      36.0
      15.0
      51.0
    
    
      Lane Cove
      NaN
      220.0
      220.0
    
    
      Leichhardt
      225.0
      NaN
      225.0
    
    
      Marrickville
      569.0
      524.0
      1093.0
    
    
      Randwick
      NaN
      53.0
      53.0
    
    
      Rockdale
      17.0
      NaN
      17.0
    
    
      Ryde
      NaN
      139.0
      139.0
    
    
      Sydney
      334.0
      792.0
      1126.0
    
    
      All
      1384.0
      2171.0
      3555.0

We can just look at data from one city



In [11]:

    
table.query('council==["Randwick"]')









    Out[11]:






  
    
      
      sold
    
    
      type
      house
      unit
      All
    
    
      council
      
      
      
    
  
  
    
      Randwick
      NaN
      53.0
      53.0

We can also specify multiple values



In [12]:

    
table.query('council==["Rockdale","Lane Cove"]')









    Out[12]:






  
    
      
      sold
    
    
      type
      house
      unit
      All
    
    
      council
      
      
      
    
  
  
    
      Lane Cove
      NaN
      220.0
      220.0
    
    
      Rockdale
      17.0
      NaN
      17.0

Note: we can not query a pivot table from the columns. So you should put item in index if you want to query it.

Plot the data using plot.ly

Before you can execute the code below, you need to install plotly, register a free account with them, and create a profile that contains your own API key. See https://plot.ly/python/getting-started/



In [13]:

    
plot_table = table[:-1]  # get rid of ALL



In [14]:

    
plot_table.sold.house









    Out[14]:





council
Ashfield        159.0
Botany Bay       29.0
Burwood           NaN
Canada Bay       15.0
Canterbury       36.0
Lane Cove         NaN
Leichhardt      225.0
Marrickville    569.0
Randwick          NaN
Rockdale         17.0
Ryde              NaN
Sydney          334.0
Name: house, dtype: float64



In [15]:

    
table.sold.house









    Out[15]:





council
Ashfield         159.0
Botany Bay        29.0
Burwood            NaN
Canada Bay        15.0
Canterbury        36.0
Lane Cove          NaN
Leichhardt       225.0
Marrickville     569.0
Randwick           NaN
Rockdale          17.0
Ryde               NaN
Sydney           334.0
All             1384.0
Name: house, dtype: float64



In [16]:

    
plot_table.index









    Out[16]:





Index(['Ashfield', 'Botany Bay', 'Burwood', 'Canada Bay', 'Canterbury',
       'Lane Cove', 'Leichhardt', 'Marrickville', 'Randwick', 'Rockdale',
       'Ryde', 'Sydney'],
      dtype='object', name='council')



In [17]:

    
trace1 = go.Bar(
    x=plot_table.index,
    y=plot_table.sold.house,
    name='House'
)
trace2 = go.Bar(
    x=plot_table.index,
    y=plot_table.sold.unit,
    name='Unit'
)

data = [trace1, trace2]
layout = go.Layout(
    barmode='group'
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='pandas-notebook-plot2')









    Out[17]:

Exercise

Perform some analysis that interests you using the Sydney Acution Data at https://auction-results.domain.com.au/Proofed/PDF/Sydney_Domain.pdf



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

	suburb	council	type	distance_to_CBD	value	sold
0	Turrella	Rockdale	house	9.9	$839,676	17
1	Sydenham	Marrickville	house	7.4	$845,771	21
2	St Peters	Marrickville	house	6.9	$920,169	50
3	Tempe	Marrickville	house	8.3	$921,390	59
4	Waterloo	Sydney	house	4.7	$937,316	11

		distance_to_CBD	sold	value
type	council
house	Ashfield	7.600000	79.500000	1.101914e+06
	Botany Bay	8.200000	29.000000	1.090744e+06
	Canada Bay	9.200000	15.000000	1.036064e+06
	Canterbury	9.100000	36.000000	1.066214e+06
	Leichhardt	5.200000	225.000000	1.076623e+06
	Marrickville	6.742857	81.285714	9.933390e+05
	Rockdale	9.900000	17.000000	8.396760e+05
	Sydney	4.616667	55.666667	1.037789e+06
unit	Ashfield	7.233333	109.666667	5.933117e+05
	Botany Bay	8.200000	56.000000	4.844870e+05
	Burwood	8.400000	43.000000	6.059270e+05
	Canterbury	9.100000	15.000000	5.930810e+05
	Lane Cove	7.000000	220.000000	6.112070e+05
	Marrickville	6.550000	87.333333	5.929137e+05
	Randwick	5.200000	53.000000	5.778140e+05
	Ryde	7.900000	139.000000	5.729230e+05
	Sydney	3.760000	158.400000	5.863850e+05