SASPy Tabulation for Descriptive Statistics

This notebook demonstrates the usage of a powerful set of tools for descriptive statistics and nesting data in SASPy, powered by the TABULATE procedure.



In [27]:

    
import saspy
sas = saspy.SASsession(cfgname='default')









    



SAS Connection established. Subprocess id is 1046



In [28]:

    
cars = sas.sasdata('cars', 'sashelp')
cars.head()









    Out[28]:







  
    
      
      Make
      Model
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
  
  
    
      0
      Acura
      MDX
      SUV
      Asia
      All
      36945
      33337
      3.5
      6
      265
      17
      23
      4451
      106
      189
    
    
      1
      Acura
      RSX Type S 2dr
      Sedan
      Asia
      Front
      23820
      21761
      2.0
      4
      200
      24
      31
      2778
      101
      172
    
    
      2
      Acura
      TSX 4dr
      Sedan
      Asia
      Front
      26990
      24647
      2.4
      4
      200
      22
      29
      3230
      105
      183
    
    
      3
      Acura
      TL 4dr
      Sedan
      Asia
      Front
      33195
      30299
      3.2
      6
      270
      20
      28
      3575
      108
      186
    
    
      4
      Acura
      3.5 RL 4dr
      Sedan
      Asia
      Front
      43755
      39014
      3.5
      6
      225
      18
      24
      3880
      115
      197

Basic usage

Like the TABULATE procedure on which it relies, using the tabulate methods attached to your SASPy data sets means specifying three things:

class columns, by which to group your data;
var columns, which contain data to be calculated; and
statistics, to be calculated on the var columns within groupings of data.

Then you compose the table using a simple syntax, in which * indicates a nesting and | indicates elements at the same level (this is made possible by Python's operator overloading).



In [11]:

    
# define columns to use as classes or computational vars
by_origin, by_type     = cars.tabulate.classes('origin', 'type')
horsepower, cylinders  = cars.tabulate.vars('horsepower', 'cylinders')

# grab statistics of interest
mean, n                = cars.tabulate.stats('mean', 'n')

# compose these elements into a table
cars.tabulate.table(
    left = by_origin * by_type,
    top  = (horsepower | cylinders) * (mean | n)
)









    










SAS Output





The SAS System 








 
Horsepower
Cylinders


Mean
N
Mean
N




Origin
Type
92.00
3
3.67
3


Asia
Hybrid


SUV
214.16
25
6.00
25


Sedan
181.98
94
5.04
94


Sports
225.35
17
5.07
15


Truck
190.25
8
5.50
8


Wagon
185.64
11
4.91
11


Europe
SUV
263.10
10
7.20
10


Sedan
236.53
78
6.08
78


Sports
316.74
23
6.61
23


Wagon
218.17
12
5.75
12


USA
SUV
246.56
25
6.88
25


Sedan
191.99
90
5.71
90


Sports
312.00
9
7.78
9


Truck
242.13
16
6.63
16


Wagon
165.71
7
5.14
7



In [4]:

    
# alternatively, you can output pure-text tables using .text_table()
cars.tabulate.text_table(
    left = by_origin * by_type,
    top  = (horsepower | cylinders) * (mean | n)
)









    



                                                           The SAS System                            20:09 Friday, April 6, 2018   1

                       -------------------------------------------------------------------------------------
                       |                               |       Horsepower        |        Cylinders        |
                       |                               |-------------------------+-------------------------|
                       |                               |    Mean    |     N      |    Mean    |     N      |
                       |-------------------------------+------------+------------+------------+------------|
                       |Origin         |Type           |            |            |            |            |
                       |---------------+---------------|            |            |            |            |
                       |Asia           |Hybrid         |       92.00|        3.00|        3.67|        3.00|
                       |               |---------------+------------+------------+------------+------------|
                       |               |SUV            |      214.16|       25.00|        6.00|       25.00|
                       |               |---------------+------------+------------+------------+------------|
                       |               |Sedan          |      181.98|       94.00|        5.04|       94.00|
                       |               |---------------+------------+------------+------------+------------|
                       |               |Sports         |      225.35|       17.00|        5.07|       15.00|
                       |               |---------------+------------+------------+------------+------------|
                       |               |Truck          |      190.25|        8.00|        5.50|        8.00|
                       |               |---------------+------------+------------+------------+------------|
                       |               |Wagon          |      185.64|       11.00|        4.91|       11.00|
                       |---------------+---------------+------------+------------+------------+------------|
                       |Europe         |SUV            |      263.10|       10.00|        7.20|       10.00|
                       |               |---------------+------------+------------+------------+------------|
                       |               |Sedan          |      236.53|       78.00|        6.08|       78.00|
                       |               |---------------+------------+------------+------------+------------|
                       |               |Sports         |      316.74|       23.00|        6.61|       23.00|
                       |               |---------------+------------+------------+------------+------------|
                       |               |Wagon          |      218.17|       12.00|        5.75|       12.00|
                       |---------------+---------------+------------+------------+------------+------------|
                       |USA            |SUV            |      246.56|       25.00|        6.88|       25.00|
                       |               |---------------+------------+------------+------------+------------|
                       |               |Sedan          |      191.99|       90.00|        5.71|       90.00|
                       |               |---------------+------------+------------+------------+------------|
                       |               |Sports         |      312.00|        9.00|        7.78|        9.00|
                       |               |---------------+------------+------------+------------+------------|
                       |               |Truck          |      242.13|       16.00|        6.63|       16.00|
                       |               |---------------+------------+------------+------------+------------|
                       |               |Wagon          |      165.71|        7.00|        5.14|        7.00|
                       -------------------------------------------------------------------------------------

Adding options

Labels, formats (for aggregate statistics), and total groupings ('all' keyword) may also be specified, either when creating a placeholder or during composition using .with_()



In [12]:

    
# assign labels, formats
by_origin, by_type     = cars.tabulate.classes('origin', 'type', labels=False)
horsepower, cylinders  = cars.tabulate.vars('horsepower', 'cylinders')
mean, n                = cars.tabulate.stats('mean', 'n', formats=['6.2', '3.'], labels=['Average', 'Count'])

# you can override or add options in composition by using .with_()
cars.tabulate.table(
    left = by_origin.with_(all='Total') * by_type,
    top  = (horsepower.with_(label='Power!') | cylinders) * (mean.with_(format='6.2') | n)
)









    










SAS Output





The SAS System 








 
Power!
Cylinders


Average
Count
Average
Count




Asia
Hybrid
92.00
3
3.67
3


SUV
214.16
25
6.00
25


Sedan
181.98
94
5.04
94


Sports
225.35
17
5.07
15


Truck
190.25
8
5.50
8


Wagon
185.64
11
4.91
11


Europe
SUV
263.10
10
7.20
10


Sedan
236.53
78
6.08
78


Sports
316.74
23
6.61
23


Wagon
218.17
12
5.75
12


USA
SUV
246.56
25
6.88
25


Sedan
191.99
90
5.71
90


Sports
312.00
9
7.78
9


Truck
242.13
16
6.63
16


Wagon
165.71
7
5.14
7


Total
Hybrid
92.00
3
3.67
3


SUV
235.82
60
6.57
60


Sedan
201.66
262
5.58
262


Sports
284.16
49
6.34
47


Truck
224.83
24
6.25
24


Wagon
194.00
30
5.30
30

Alternative: Create class, var, and stat elements individually



In [6]:

    
by_origin   = cars.tabulate.as_class('origin', label='', all='All Origins')
by_type     = cars.tabulate.as_class('type', label='')

horsepower  = cars.tabulate.as_var('horsepower')
average     = cars.tabulate.stat('mean', format='8.2', label='Mean HP')
n           = cars.tabulate.stat('n', label='Count', format='6.')


cars.tabulate.table(
    left = by_origin,
    top  = by_type * (n | horsepower.with_(label='') * average)
)









    










SAS Output





The SAS System 








 
Hybrid
SUV
Sedan
Sports
Truck
Wagon


Count
Mean HP
Count
Mean HP
Count
Mean HP
Count
Mean HP
Count
Mean HP
Count
Mean HP




Asia
3
92.00
25
214.16
94
181.98
17
225.35
8
190.25
11
185.64


Europe
.
.
10
263.10
78
236.53
23
316.74
.
.
12
218.17


USA
.
.
25
246.56
90
191.99
9
312.00
16
242.13
7
165.71


All Origins
3
92.00
60
235.82
262
201.66
49
284.16
24
224.83
30
194.00

Composition and Re-use

The real advantage of a Python interface to PROC TABULATE comes with the ability to compose fragments of interest and then recompose those into various tables at will.



In [7]:

    
# create some elements for reuse
by_origin   = cars.tabulate.as_class('origin', label='')
by_type     = cars.tabulate.as_class('type', label='')

horsepower  = cars.tabulate.as_var('horsepower', label='Horses')
cylinders   = cars.tabulate.as_var('cylinders', label='Cyls.')
enginesize  = cars.tabulate.as_var('enginesize', label='Engine Size')

average     = cars.tabulate.stat('mean', label='Avg', format='8.2')
stdev       = cars.tabulate.stat('std', label='Std. Dev.', format='5.2')
n           = cars.tabulate.stat('n', label='Count', format='6.0')

# create some compositional fragments
by_origin_and_type = by_origin.with_(all='All') * by_type.with_(all='All')
hpstats            = horsepower * (average | stdev | n)
cylstats           = cylinders * (average | stdev | n)
enginestats        = enginesize * (average | stdev)



In [13]:

    
# draw a table
cars.tabulate.table(
    left = by_origin_and_type,
    top  = hpstats
)









    










SAS Output





The SAS System 








 
Horses


Avg
Std. Dev.
Count




Asia
Hybrid
92.00
18.52
3


SUV
214.16
48.70
25


Sedan
181.98
57.29
94


Sports
225.35
57.60
17


Truck
190.25
51.76
8


Wagon
185.64
69.47
11


All
190.70
59.39
158


Europe
SUV
263.10
52.66
10


Sedan
236.53
71.35
78


Sports
316.74
96.21
23


Wagon
218.17
63.71
12


All
251.89
80.74
123


USA
SUV
246.56
58.68
25


Sedan
191.99
46.50
90


Sports
312.00
91.09
9


Truck
242.13
60.48
16


Wagon
165.71
44.20
7


All
212.82
63.75
147


All
Hybrid
92.00
18.52
3


SUV
235.82
56.23
60


Sedan
201.66
62.80
262


Sports
284.16
92.79
49


Truck
224.83
61.85
24


Wagon
194.00
63.79
30


All
215.89
71.84
428



In [14]:

    
# draw another table
cars.tabulate.table(
    left = by_type,
    top  = cylstats | enginestats
)









    










SAS Output





The SAS System 








 
Cyls.
Engine Size


Avg
Std. Dev.
Count
Avg
Std. Dev.




Hybrid
3.67
0.58
3
1.63
0.32


SUV
6.57
1.38
60
3.92
1.09


Sedan
5.58
1.47
262
2.97
0.93


Sports
6.34
1.78
47
3.44
1.39


Truck
6.25
1.59
24
4.08
1.25


Wagon
5.30
1.42
30
2.77
0.89



In [10]:

    
# grab another class when needed, draw another table
drivetrain = cars.tabulate.as_class('drivetrain', label='Drive Train')

cars.tabulate.table(
    left = by_type * drivetrain.with_(all='All'),
    top  = cylstats | enginestats
)









    










SAS Output





The SAS System 








 
Cyls.
Engine Size


Avg
Std. Dev.
Count
Avg
Std. Dev.




 
Drive Train
3.67
0.58
3
1.63
0.32


Hybrid
Front


All
3.67
0.58
3
1.63
0.32


SUV
Drive Train
6.58
1.54
38
3.93
1.19


All


Front
6.55
1.10
22
3.90
0.93


All
6.57
1.38
60
3.92
1.09


Sedan
Drive Train
5.89
1.13
28
3.13
0.70


All


Front
5.07
1.23
179
2.69
0.82


Rear
7.09
1.27
55
3.80
0.84


All
5.58
1.47
262
2.97
0.93


Sports
Drive Train
4.80
1.10
5
2.62
0.77


All


Front
5.75
1.67
8
2.80
0.92


Rear
6.71
1.77
34
3.70
1.47


All
6.34
1.78
47
3.44
1.39


Truck
Drive Train
6.67
1.56
12
4.55
1.32


All


Rear
5.83
1.59
12
3.61
1.02


All
6.25
1.59
24
4.08
1.25


Wagon
Drive Train
5.89
1.76
9
3.22
1.04


All


Front
4.71
1.27
14
2.36
0.75


Rear
5.71
0.76
7
3.01
0.63


All
5.30
1.42
30
2.77
0.89

Retrieve an indexed DataFrame

One of SASPy's best features is the integration with Pandas DataFrames. Instead of drawing a presentational table in HTML or plain text, you can have the resulting nested values converted to a DataFrame using nested indices.

Note that certain presentational elements (labels, formats, etc) aren't represented, nor is the exact visual arrangement of your groupings. However, all computations are in the DataFrame and can be further accessed or sliced easily in Python.



In [21]:

    
# since w
my_frame = cars.tabulate.to_dataframe(
    left = by_type * drivetrain * by_origin,
    top  = cylstats | enginestats
)

# showing an excerpt
my_frame[:10]









    Out[21]:







  
    
      
      
      
      Cylinders_Std
      EngineSize_Mean
      Cylinders_N
      Cylinders_Mean
      EngineSize_Std
    
    
      Type
      Origin
      DriveTrain
      
      
      
      
      
    
  
  
    
      Hybrid
      Asia
      Front
      0.577350
      1.633333
      3
      3.666667
      0.321455
    
    
      SUV
      Asia
      All
      1.460593
      3.493750
      16
      6.000000
      0.944788
    
    
      Front
      1.000000
      3.433333
      9
      6.000000
      0.951315
    
    
      Europe
      All
      1.032796
      3.950000
      10
      7.200000
      0.943104
    
    
      USA
      All
      1.800673
      4.500000
      12
      6.833333
      1.471548
    
    
      Front
      1.037749
      4.223077
      13
      6.923077
      0.790732
    
    
      Sedan
      Asia
      All
      1.069045
      2.785714
      7
      4.857143
      0.393398
    
    
      Front
      0.994490
      2.510256
      78
      4.846154
      0.718510
    
    
      Rear
      1.054093
      3.733333
      9
      6.888889
      0.665207
    
    
      Europe
      All
      1.017815
      3.144444
      18
      6.277778
      0.746933



In [23]:

    
# you can access portions of that nested frame by indices, from left to right
my_frame.loc[('Sedan','Asia')]









    Out[23]:







  
    
      
      Cylinders_Std
      EngineSize_Mean
      Cylinders_N
      Cylinders_Mean
      EngineSize_Std
    
    
      DriveTrain
      
      
      
      
      
    
  
  
    
      All
      1.069045
      2.785714
      7
      4.857143
      0.393398
    
    
      Front
      0.994490
      2.510256
      78
      4.846154
      0.718510
    
    
      Rear
      1.054093
      3.733333
      9
      6.888889
      0.665207



In [24]:

    
my_frame.loc[('SUV')]









    Out[24]:







  
    
      
      
      Cylinders_Std
      EngineSize_Mean
      Cylinders_N
      Cylinders_Mean
      EngineSize_Std
    
    
      Origin
      DriveTrain
      
      
      
      
      
    
  
  
    
      Asia
      All
      1.460593
      3.493750
      16
      6.000000
      0.944788
    
    
      Front
      1.000000
      3.433333
      9
      6.000000
      0.951315
    
    
      Europe
      All
      1.032796
      3.950000
      10
      7.200000
      0.943104
    
    
      USA
      All
      1.800673
      4.500000
      12
      6.833333
      1.471548
    
    
      Front
      1.037749
      4.223077
      13
      6.923077
      0.790732

View the generated code

This can also serve as an excellent tool for teaching the complex syntax of PROC TABULATE statements.



In [26]:

    
sas.teach_me_SAS(True)

cars.tabulate.table(
    left = by_type * drivetrain.with_(all='All'),
    top  = cylstats | enginestats
)









    



proc tabulate data=sashelp.cars ;
  table type='' * (drivetrain='Drive Train' ALL='All'), (cylinders='Cyls.' * (mean='Avg'*f=8.2 std='Std. Dev.'*f=5.2 n='Count'*f=6.0) enginesize='Engine Size' * (mean='Avg'*f=8.2 std='Std. Dev.'*f=5.2));
  var cylinders enginesize;
  class type drivetrain;
run;



In [ ]:

	Make	Model	Type	Origin	DriveTrain	MSRP	Invoice	EngineSize	Cylinders	Horsepower	MPG_City	MPG_Highway	Weight	Wheelbase	Length
0	Acura	MDX	SUV	Asia	All	36945	33337	3.5	6	265	17	23	4451	106	189
1	Acura	RSX Type S 2dr	Sedan	Asia	Front	23820	21761	2.0	4	200	24	31	2778	101	172
2	Acura	TSX 4dr	Sedan	Asia	Front	26990	24647	2.4	4	200	22	29	3230	105	183
3	Acura	TL 4dr	Sedan	Asia	Front	33195	30299	3.2	6	270	20	28	3575	108	186
4	Acura	3.5 RL 4dr	Sedan	Asia	Front	43755	39014	3.5	6	225	18	24	3880	115	197

		Horsepower		Cylinders
		Mean	N	Mean	N
Origin	Type	92.00	3	3.67	3
Asia	Hybrid	92.00	3	3.67	3
	SUV	214.16	25	6.00	25
	Sedan	181.98	94	5.04	94
	Sports	225.35	17	5.07	15
	Truck	190.25	8	5.50	8
	Wagon	185.64	11	4.91	11
Europe	SUV	263.10	10	7.20	10
	Sedan	236.53	78	6.08	78
	Sports	316.74	23	6.61	23
	Wagon	218.17	12	5.75	12
USA	SUV	246.56	25	6.88	25
	Sedan	191.99	90	5.71	90
	Sports	312.00	9	7.78	9
	Truck	242.13	16	6.63	16
	Wagon	165.71	7	5.14	7

		Power!		Cylinders
		Average	Count	Average	Count
Asia	Hybrid	92.00	3	3.67	3
	SUV	214.16	25	6.00	25
	Sedan	181.98	94	5.04	94
	Sports	225.35	17	5.07	15
	Truck	190.25	8	5.50	8
	Wagon	185.64	11	4.91	11
Europe	SUV	263.10	10	7.20	10
	Sedan	236.53	78	6.08	78
	Sports	316.74	23	6.61	23
	Wagon	218.17	12	5.75	12
USA	SUV	246.56	25	6.88	25
	Sedan	191.99	90	5.71	90
	Sports	312.00	9	7.78	9
	Truck	242.13	16	6.63	16
	Wagon	165.71	7	5.14	7
Total	Hybrid	92.00	3	3.67	3
	SUV	235.82	60	6.57	60
	Sedan	201.66	262	5.58	262
	Sports	284.16	49	6.34	47
	Truck	224.83	24	6.25	24
	Wagon	194.00	30	5.30	30

	Hybrid		SUV		Sedan		Sports		Truck		Wagon
	Count	Mean HP	Count	Mean HP	Count	Mean HP	Count	Mean HP	Count	Mean HP	Count	Mean HP
Asia	3	92.00	25	214.16	94	181.98	17	225.35	8	190.25	11	185.64
Europe	.	.	10	263.10	78	236.53	23	316.74	.	.	12	218.17
USA	.	.	25	246.56	90	191.99	9	312.00	16	242.13	7	165.71
All Origins	3	92.00	60	235.82	262	201.66	49	284.16	24	224.83	30	194.00

		Horses
		Avg	Std. Dev.	Count
Asia	Hybrid	92.00	18.52	3
	SUV	214.16	48.70	25
	Sedan	181.98	57.29	94
	Sports	225.35	57.60	17
	Truck	190.25	51.76	8
	Wagon	185.64	69.47	11
	All	190.70	59.39	158
Europe	SUV	263.10	52.66	10
	Sedan	236.53	71.35	78
	Sports	316.74	96.21	23
	Wagon	218.17	63.71	12
	All	251.89	80.74	123
USA	SUV	246.56	58.68	25
	Sedan	191.99	46.50	90
	Sports	312.00	91.09	9
	Truck	242.13	60.48	16
	Wagon	165.71	44.20	7
	All	212.82	63.75	147
All	Hybrid	92.00	18.52	3
	SUV	235.82	56.23	60
	Sedan	201.66	62.80	262
	Sports	284.16	92.79	49
	Truck	224.83	61.85	24
	Wagon	194.00	63.79	30
	All	215.89	71.84	428

			Cylinders_Std	EngineSize_Mean	Cylinders_N	Cylinders_Mean	EngineSize_Std
Type	Origin	DriveTrain
Hybrid	Asia	Front	0.577350	1.633333	3	3.666667	0.321455
SUV	Asia	All	1.460593	3.493750	16	6.000000	0.944788
	Asia	Front	1.000000	3.433333	9	6.000000	0.951315
	Europe	All	1.032796	3.950000	10	7.200000	0.943104
	USA	All	1.800673	4.500000	12	6.833333	1.471548
	USA	Front	1.037749	4.223077	13	6.923077	0.790732
Sedan	Asia	All	1.069045	2.785714	7	4.857143	0.393398
		Front	0.994490	2.510256	78	4.846154	0.718510
		Rear	1.054093	3.733333	9	6.888889	0.665207
	Europe	All	1.017815	3.144444	18	6.277778	0.746933