SASPy Tabulation for Descriptive Statistics

This notebook demonstrates the usage of a powerful set of tools for descriptive statistics and nesting data in SASPy, powered by the TABULATE procedure.


In [27]:
import saspy
sas = saspy.SASsession(cfgname='default')


SAS Connection established. Subprocess id is 1046


In [28]:
cars = sas.sasdata('cars', 'sashelp')
cars.head()


Out[28]:
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All 36945 33337 3.5 6 265 17 23 4451 106 189
1 Acura RSX Type S 2dr Sedan Asia Front 23820 21761 2.0 4 200 24 31 2778 101 172
2 Acura TSX 4dr Sedan Asia Front 26990 24647 2.4 4 200 22 29 3230 105 183
3 Acura TL 4dr Sedan Asia Front 33195 30299 3.2 6 270 20 28 3575 108 186
4 Acura 3.5 RL 4dr Sedan Asia Front 43755 39014 3.5 6 225 18 24 3880 115 197

Basic usage

Like the TABULATE procedure on which it relies, using the tabulate methods attached to your SASPy data sets means specifying three things:

  1. class columns, by which to group your data;
  2. var columns, which contain data to be calculated; and
  3. statistics, to be calculated on the var columns within groupings of data.

Then you compose the table using a simple syntax, in which * indicates a nesting and | indicates elements at the same level (this is made possible by Python's operator overloading).


In [11]:
# define columns to use as classes or computational vars
by_origin, by_type     = cars.tabulate.classes('origin', 'type')
horsepower, cylinders  = cars.tabulate.vars('horsepower', 'cylinders')

# grab statistics of interest
mean, n                = cars.tabulate.stats('mean', 'n')

# compose these elements into a table
cars.tabulate.table(
    left = by_origin * by_type,
    top  = (horsepower | cylinders) * (mean | n)
)


SAS Output

The SAS System

  Horsepower Cylinders
Mean N Mean N
Origin Type 92.00 3 3.67 3
Asia Hybrid
SUV 214.16 25 6.00 25
Sedan 181.98 94 5.04 94
Sports 225.35 17 5.07 15
Truck 190.25 8 5.50 8
Wagon 185.64 11 4.91 11
Europe SUV 263.10 10 7.20 10
Sedan 236.53 78 6.08 78
Sports 316.74 23 6.61 23
Wagon 218.17 12 5.75 12
USA SUV 246.56 25 6.88 25
Sedan 191.99 90 5.71 90
Sports 312.00 9 7.78 9
Truck 242.13 16 6.63 16
Wagon 165.71 7 5.14 7

In [4]:
# alternatively, you can output pure-text tables using .text_table()
cars.tabulate.text_table(
    left = by_origin * by_type,
    top  = (horsepower | cylinders) * (mean | n)
)


                                                           The SAS System                            20:09 Friday, April 6, 2018   1

                       -------------------------------------------------------------------------------------
                       |                               |       Horsepower        |        Cylinders        |
                       |                               |-------------------------+-------------------------|
                       |                               |    Mean    |     N      |    Mean    |     N      |
                       |-------------------------------+------------+------------+------------+------------|
                       |Origin         |Type           |            |            |            |            |
                       |---------------+---------------|            |            |            |            |
                       |Asia           |Hybrid         |       92.00|        3.00|        3.67|        3.00|
                       |               |---------------+------------+------------+------------+------------|
                       |               |SUV            |      214.16|       25.00|        6.00|       25.00|
                       |               |---------------+------------+------------+------------+------------|
                       |               |Sedan          |      181.98|       94.00|        5.04|       94.00|
                       |               |---------------+------------+------------+------------+------------|
                       |               |Sports         |      225.35|       17.00|        5.07|       15.00|
                       |               |---------------+------------+------------+------------+------------|
                       |               |Truck          |      190.25|        8.00|        5.50|        8.00|
                       |               |---------------+------------+------------+------------+------------|
                       |               |Wagon          |      185.64|       11.00|        4.91|       11.00|
                       |---------------+---------------+------------+------------+------------+------------|
                       |Europe         |SUV            |      263.10|       10.00|        7.20|       10.00|
                       |               |---------------+------------+------------+------------+------------|
                       |               |Sedan          |      236.53|       78.00|        6.08|       78.00|
                       |               |---------------+------------+------------+------------+------------|
                       |               |Sports         |      316.74|       23.00|        6.61|       23.00|
                       |               |---------------+------------+------------+------------+------------|
                       |               |Wagon          |      218.17|       12.00|        5.75|       12.00|
                       |---------------+---------------+------------+------------+------------+------------|
                       |USA            |SUV            |      246.56|       25.00|        6.88|       25.00|
                       |               |---------------+------------+------------+------------+------------|
                       |               |Sedan          |      191.99|       90.00|        5.71|       90.00|
                       |               |---------------+------------+------------+------------+------------|
                       |               |Sports         |      312.00|        9.00|        7.78|        9.00|
                       |               |---------------+------------+------------+------------+------------|
                       |               |Truck          |      242.13|       16.00|        6.63|       16.00|
                       |               |---------------+------------+------------+------------+------------|
                       |               |Wagon          |      165.71|        7.00|        5.14|        7.00|
                       -------------------------------------------------------------------------------------

Adding options

Labels, formats (for aggregate statistics), and total groupings ('all' keyword) may also be specified, either when creating a placeholder or during composition using .with_()


In [12]:
# assign labels, formats
by_origin, by_type     = cars.tabulate.classes('origin', 'type', labels=False)
horsepower, cylinders  = cars.tabulate.vars('horsepower', 'cylinders')
mean, n                = cars.tabulate.stats('mean', 'n', formats=['6.2', '3.'], labels=['Average', 'Count'])

# you can override or add options in composition by using .with_()
cars.tabulate.table(
    left = by_origin.with_(all='Total') * by_type,
    top  = (horsepower.with_(label='Power!') | cylinders) * (mean.with_(format='6.2') | n)
)


SAS Output

The SAS System

  Power! Cylinders
Average Count Average Count
Asia Hybrid 92.00 3 3.67 3
SUV 214.16 25 6.00 25
Sedan 181.98 94 5.04 94
Sports 225.35 17 5.07 15
Truck 190.25 8 5.50 8
Wagon 185.64 11 4.91 11
Europe SUV 263.10 10 7.20 10
Sedan 236.53 78 6.08 78
Sports 316.74 23 6.61 23
Wagon 218.17 12 5.75 12
USA SUV 246.56 25 6.88 25
Sedan 191.99 90 5.71 90
Sports 312.00 9 7.78 9
Truck 242.13 16 6.63 16
Wagon 165.71 7 5.14 7
Total Hybrid 92.00 3 3.67 3
SUV 235.82 60 6.57 60
Sedan 201.66 262 5.58 262
Sports 284.16 49 6.34 47
Truck 224.83 24 6.25 24
Wagon 194.00 30 5.30 30

Alternative: Create class, var, and stat elements individually


In [6]:
by_origin   = cars.tabulate.as_class('origin', label='', all='All Origins')
by_type     = cars.tabulate.as_class('type', label='')

horsepower  = cars.tabulate.as_var('horsepower')
average     = cars.tabulate.stat('mean', format='8.2', label='Mean HP')
n           = cars.tabulate.stat('n', label='Count', format='6.')


cars.tabulate.table(
    left = by_origin,
    top  = by_type * (n | horsepower.with_(label='') * average)
)


SAS Output

The SAS System

  Hybrid SUV Sedan Sports Truck Wagon
Count Mean HP Count Mean HP Count Mean HP Count Mean HP Count Mean HP Count Mean HP
Asia 3 92.00 25 214.16 94 181.98 17 225.35 8 190.25 11 185.64
Europe . . 10 263.10 78 236.53 23 316.74 . . 12 218.17
USA . . 25 246.56 90 191.99 9 312.00 16 242.13 7 165.71
All Origins 3 92.00 60 235.82 262 201.66 49 284.16 24 224.83 30 194.00

Composition and Re-use

The real advantage of a Python interface to PROC TABULATE comes with the ability to compose fragments of interest and then recompose those into various tables at will.


In [7]:
# create some elements for reuse
by_origin   = cars.tabulate.as_class('origin', label='')
by_type     = cars.tabulate.as_class('type', label='')

horsepower  = cars.tabulate.as_var('horsepower', label='Horses')
cylinders   = cars.tabulate.as_var('cylinders', label='Cyls.')
enginesize  = cars.tabulate.as_var('enginesize', label='Engine Size')

average     = cars.tabulate.stat('mean', label='Avg', format='8.2')
stdev       = cars.tabulate.stat('std', label='Std. Dev.', format='5.2')
n           = cars.tabulate.stat('n', label='Count', format='6.0')

# create some compositional fragments
by_origin_and_type = by_origin.with_(all='All') * by_type.with_(all='All')
hpstats            = horsepower * (average | stdev | n)
cylstats           = cylinders * (average | stdev | n)
enginestats        = enginesize * (average | stdev)

In [13]:
# draw a table
cars.tabulate.table(
    left = by_origin_and_type,
    top  = hpstats
)


SAS Output

The SAS System

  Horses
Avg Std. Dev. Count
Asia Hybrid 92.00 18.52 3
SUV 214.16 48.70 25
Sedan 181.98 57.29 94
Sports 225.35 57.60 17
Truck 190.25 51.76 8
Wagon 185.64 69.47 11
All 190.70 59.39 158
Europe SUV 263.10 52.66 10
Sedan 236.53 71.35 78
Sports 316.74 96.21 23
Wagon 218.17 63.71 12
All 251.89 80.74 123
USA SUV 246.56 58.68 25
Sedan 191.99 46.50 90
Sports 312.00 91.09 9
Truck 242.13 60.48 16
Wagon 165.71 44.20 7
All 212.82 63.75 147
All Hybrid 92.00 18.52 3
SUV 235.82 56.23 60
Sedan 201.66 62.80 262
Sports 284.16 92.79 49
Truck 224.83 61.85 24
Wagon 194.00 63.79 30
All 215.89 71.84 428

In [14]:
# draw another table
cars.tabulate.table(
    left = by_type,
    top  = cylstats | enginestats
)


SAS Output

The SAS System

  Cyls. Engine Size
Avg Std. Dev. Count Avg Std. Dev.
Hybrid 3.67 0.58 3 1.63 0.32
SUV 6.57 1.38 60 3.92 1.09
Sedan 5.58 1.47 262 2.97 0.93
Sports 6.34 1.78 47 3.44 1.39
Truck 6.25 1.59 24 4.08 1.25
Wagon 5.30 1.42 30 2.77 0.89

In [10]:
# grab another class when needed, draw another table
drivetrain = cars.tabulate.as_class('drivetrain', label='Drive Train')

cars.tabulate.table(
    left = by_type * drivetrain.with_(all='All'),
    top  = cylstats | enginestats
)


SAS Output

The SAS System

  Cyls. Engine Size
Avg Std. Dev. Count Avg Std. Dev.
  Drive Train 3.67 0.58 3 1.63 0.32
Hybrid Front
All 3.67 0.58 3 1.63 0.32
SUV Drive Train 6.58 1.54 38 3.93 1.19
All
Front 6.55 1.10 22 3.90 0.93
All 6.57 1.38 60 3.92 1.09
Sedan Drive Train 5.89 1.13 28 3.13 0.70
All
Front 5.07 1.23 179 2.69 0.82
Rear 7.09 1.27 55 3.80 0.84
All 5.58 1.47 262 2.97 0.93
Sports Drive Train 4.80 1.10 5 2.62 0.77
All
Front 5.75 1.67 8 2.80 0.92
Rear 6.71 1.77 34 3.70 1.47
All 6.34 1.78 47 3.44 1.39
Truck Drive Train 6.67 1.56 12 4.55 1.32
All
Rear 5.83 1.59 12 3.61 1.02
All 6.25 1.59 24 4.08 1.25
Wagon Drive Train 5.89 1.76 9 3.22 1.04
All
Front 4.71 1.27 14 2.36 0.75
Rear 5.71 0.76 7 3.01 0.63
All 5.30 1.42 30 2.77 0.89

Retrieve an indexed DataFrame

One of SASPy's best features is the integration with Pandas DataFrames. Instead of drawing a presentational table in HTML or plain text, you can have the resulting nested values converted to a DataFrame using nested indices.

Note that certain presentational elements (labels, formats, etc) aren't represented, nor is the exact visual arrangement of your groupings. However, all computations are in the DataFrame and can be further accessed or sliced easily in Python.


In [21]:
# since w
my_frame = cars.tabulate.to_dataframe(
    left = by_type * drivetrain * by_origin,
    top  = cylstats | enginestats
)

# showing an excerpt
my_frame[:10]


Out[21]:
Cylinders_Std EngineSize_Mean Cylinders_N Cylinders_Mean EngineSize_Std
Type Origin DriveTrain
Hybrid Asia Front 0.577350 1.633333 3 3.666667 0.321455
SUV Asia All 1.460593 3.493750 16 6.000000 0.944788
Front 1.000000 3.433333 9 6.000000 0.951315
Europe All 1.032796 3.950000 10 7.200000 0.943104
USA All 1.800673 4.500000 12 6.833333 1.471548
Front 1.037749 4.223077 13 6.923077 0.790732
Sedan Asia All 1.069045 2.785714 7 4.857143 0.393398
Front 0.994490 2.510256 78 4.846154 0.718510
Rear 1.054093 3.733333 9 6.888889 0.665207
Europe All 1.017815 3.144444 18 6.277778 0.746933

In [23]:
# you can access portions of that nested frame by indices, from left to right
my_frame.loc[('Sedan','Asia')]


Out[23]:
Cylinders_Std EngineSize_Mean Cylinders_N Cylinders_Mean EngineSize_Std
DriveTrain
All 1.069045 2.785714 7 4.857143 0.393398
Front 0.994490 2.510256 78 4.846154 0.718510
Rear 1.054093 3.733333 9 6.888889 0.665207

In [24]:
my_frame.loc[('SUV')]


Out[24]:
Cylinders_Std EngineSize_Mean Cylinders_N Cylinders_Mean EngineSize_Std
Origin DriveTrain
Asia All 1.460593 3.493750 16 6.000000 0.944788
Front 1.000000 3.433333 9 6.000000 0.951315
Europe All 1.032796 3.950000 10 7.200000 0.943104
USA All 1.800673 4.500000 12 6.833333 1.471548
Front 1.037749 4.223077 13 6.923077 0.790732

View the generated code

This can also serve as an excellent tool for teaching the complex syntax of PROC TABULATE statements.


In [26]:
sas.teach_me_SAS(True)

cars.tabulate.table(
    left = by_type * drivetrain.with_(all='All'),
    top  = cylstats | enginestats
)


proc tabulate data=sashelp.cars ;
  table type='' * (drivetrain='Drive Train' ALL='All'), (cylinders='Cyls.' * (mean='Avg'*f=8.2 std='Std. Dev.'*f=5.2 n='Count'*f=6.0) enginesize='Engine Size' * (mean='Avg'*f=8.2 std='Std. Dev.'*f=5.2));
  var cylinders enginesize;
  class type drivetrain;
run;

In [ ]: