Filtering Your Data using Python

Sometimes you just want to look at a small subset of your data. Luckily, CAS makes this fairly easy to do through the use of WHERE clauses and variable lists. If you are using CASTable objects in Python, you can also use DataFrame-like operations on them to filter your view of the data as well. We'll look at some examples of both of these here.

The first thing we need to do is create a CAS connection.


In [1]:
import swat

conn = swat.CAS(host, port, username, password)

Now we need some data to work with, so we'll upload the cars data set.


In [2]:
cars = conn.read_csv('https://raw.githubusercontent.com/sassoftware/sas-viya-programming/master/data/cars.csv')
cars


Out[2]:
CASTable('_T_8B0C9A5C_7F68FFC710C0', caslib='CASUSERHDFS(kesmit)')

Using the len function in Python and the table.columinfo action, we can get some more information about the data in the CAS table. We see that it has 428 rows and 15 columns of data.


In [3]:
len(cars)


Out[3]:
428

In [4]:
cars.table.columninfo()


Out[4]:
§ ColumnInfo
Column ID Type RawLength FormattedLength NFL NFD
0 Make 1 varchar 13 13 0 0
1 Model 2 varchar 39 39 0 0
2 Type 3 varchar 6 6 0 0
3 Origin 4 varchar 6 6 0 0
4 DriveTrain 5 varchar 5 5 0 0
5 MSRP 6 double 8 12 0 0
6 Invoice 7 double 8 12 0 0
7 EngineSize 8 double 8 12 0 0
8 Cylinders 9 double 8 12 0 0
9 Horsepower 10 double 8 12 0 0
10 MPG_City 11 double 8 12 0 0
11 MPG_Highway 12 double 8 12 0 0
12 Weight 13 double 8 12 0 0
13 Wheelbase 14 double 8 12 0 0
14 Length 15 double 8 12 0 0

elapsed 0.00545s · user 0.002s · sys 0.017s · mem 0.767MB

Let's say that we only want to see the sports cars. We can do this in one of two ways. We could set the where parameter on the CASTable object to contain the string 'Type = "Sports"', or we could use the DataFrame data selection syntax. Let's look at the where parameter first.

To apply a WHERE clause to a CASTable, we can manually set the where attribute to the expression we want to apply.


In [5]:
cars.where = 'Type = "Sports"'
cars


Out[5]:
CASTable('_T_8B0C9A5C_7F68FFC710C0', caslib='CASUSERHDFS(kesmit)', where='Type = "Sports"')

We can look at the length and a sample of the data to see that it has been subset.


In [6]:
len(cars)


Out[6]:
49

In [7]:
cars.head()


Out[7]:
Selected Rows from Table _T_8B0C9A5C_7F68FFC710C0
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Audi RS 6 4dr Sports Europe Front 84600.0 76417.0 4.2 8.0 450.0 15.0 22.0 4024.0 109.0 191.0
1 BMW Z4 convertible 3.0i 2dr Sports Europe Rear 41045.0 37575.0 3.0 6.0 225.0 21.0 29.0 2998.0 98.0 161.0
2 Jaguar XKR convertible 2dr Sports Europe Rear 86995.0 79226.0 4.2 8.0 390.0 16.0 23.0 4042.0 102.0 187.0
3 Mazda RX-8 4dr manual Sports Asia Rear 27200.0 25179.0 1.3 NaN 238.0 18.0 24.0 3029.0 106.0 174.0
4 Mercedes-Benz SLK230 convertible 2dr Sports Europe Rear 40320.0 37548.0 2.3 4.0 192.0 21.0 29.0 3055.0 95.0 158.0

Let's remove the where attribute and look at the DataFrame-like ways of subsetting data.


In [8]:
del cars.where
cars


Out[8]:
CASTable('_T_8B0C9A5C_7F68FFC710C0', caslib='CASUSERHDFS(kesmit)')

The query method on the CASTable object mimics the query method of DataFrames. However, in this case, the syntax of the expression is the same as a CAS WHERE clause. So we use the same expression from above as the argument to the query method.


In [9]:
cars.query('Type = "Sports"').head()


Out[9]:
Selected Rows from Table _T_8B0C9A5C_7F68FFC710C0
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Audi RS 6 4dr Sports Europe Front 84600.0 76417.0 4.2 8.0 450.0 15.0 22.0 4024.0 109.0 191.0
1 BMW Z4 convertible 3.0i 2dr Sports Europe Rear 41045.0 37575.0 3.0 6.0 225.0 21.0 29.0 2998.0 98.0 161.0
2 Jaguar XKR convertible 2dr Sports Europe Rear 86995.0 79226.0 4.2 8.0 390.0 16.0 23.0 4042.0 102.0 187.0
3 Mazda RX-8 4dr manual Sports Asia Rear 27200.0 25179.0 1.3 NaN 238.0 18.0 24.0 3029.0 106.0 174.0
4 Mercedes-Benz SLK230 convertible 2dr Sports Europe Rear 40320.0 37548.0 2.3 4.0 192.0 21.0 29.0 3055.0 95.0 158.0

Unlike setting the where parameter on the CASTable object, the query method does not embed the parameter in the CASTable object. It creates a copy of the table. If you want to apply the query to the CASTable object, you would add the inplace=True option.


In [10]:
cars.query('Type = "Sports"', inplace=True)
cars


Out[10]:
CASTable('_T_8B0C9A5C_7F68FFC710C0', caslib='CASUSERHDFS(kesmit)', where='(Type = "Sports")')

A very popular way of subsetting the data in a DataFrame is using Python's getitem syntax (i.e., df[...]). You can use that same syntax on CASTable objects. First, we need to delete the WHERE clause that we had embedded using the last query method.


In [11]:
del cars.where
cars


Out[11]:
CASTable('_T_8B0C9A5C_7F68FFC710C0', caslib='CASUSERHDFS(kesmit)')

The way to subset a table using DataFrame syntax is to index a CASTable object (e.g., cars[...]) using a condition on a column of that CASTable (e.g., cars.Type == 'Sports'). The condition is applied to the CASTable as a filter.


In [12]:
cars[cars.Type == 'Sports'].head()


Out[12]:
Selected Rows from Table _T_8B0C9A5C_7F68FFC710C0
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Audi RS 6 4dr Sports Europe Front 84600.0 76417.0 4.2 8.0 450.0 15.0 22.0 4024.0 109.0 191.0
1 BMW Z4 convertible 3.0i 2dr Sports Europe Rear 41045.0 37575.0 3.0 6.0 225.0 21.0 29.0 2998.0 98.0 161.0
2 Jaguar XKR convertible 2dr Sports Europe Rear 86995.0 79226.0 4.2 8.0 390.0 16.0 23.0 4042.0 102.0 187.0
3 Mazda RX-8 4dr manual Sports Asia Rear 27200.0 25179.0 1.3 NaN 238.0 18.0 24.0 3029.0 106.0 174.0
4 Mercedes-Benz SLK230 convertible 2dr Sports Europe Rear 40320.0 37548.0 2.3 4.0 192.0 21.0 29.0 3055.0 95.0 158.0

The way this works is the condition creates a computed column that is treated as a mask. If you look at the result of the expression, you'll see that it creates a computed column describing the expression.


In [13]:
cars.Type == 'Sports'


Out[13]:
CASColumn('_T_8B0C9A5C_7F68FFC710C0', caslib='CASUSERHDFS(kesmit)', computedvars=['_eq_1_'], computedvarsprogram="_eq_1_ = (Type = 'Sports'); ")['_eq_1_']

If you look at a sample of the data created by the computed column, you'll see that it is a series of ones and zeros.


In [14]:
(cars.Type == 'Sports').head(40)


Out[14]:
0     0.0
1     0.0
2     0.0
3     0.0
4     0.0
5     1.0
6     0.0
7     0.0
8     0.0
9     0.0
10    0.0
11    1.0
12    0.0
13    0.0
14    0.0
15    0.0
16    0.0
17    0.0
18    0.0
19    0.0
20    0.0
21    0.0
22    0.0
23    0.0
24    0.0
25    0.0
26    0.0
27    0.0
28    0.0
29    0.0
30    0.0
31    0.0
32    0.0
33    0.0
34    0.0
35    0.0
36    0.0
37    0.0
38    0.0
39    0.0
Name: _eq_2_, dtype: float64

When this mask is applied to a table, only the rows where the condition is true (i.e., computed expression is equal to one) show up in the output.


In [15]:
cars[cars.Type == 'Sports'].head()


Out[15]:
Selected Rows from Table _T_8B0C9A5C_7F68FFC710C0
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Audi RS 6 4dr Sports Europe Front 84600.0 76417.0 4.2 8.0 450.0 15.0 22.0 4024.0 109.0 191.0
1 BMW Z4 convertible 3.0i 2dr Sports Europe Rear 41045.0 37575.0 3.0 6.0 225.0 21.0 29.0 2998.0 98.0 161.0
2 Jaguar XKR convertible 2dr Sports Europe Rear 86995.0 79226.0 4.2 8.0 390.0 16.0 23.0 4042.0 102.0 187.0
3 Mazda RX-8 4dr manual Sports Asia Rear 27200.0 25179.0 1.3 NaN 238.0 18.0 24.0 3029.0 106.0 174.0
4 Mercedes-Benz SLK230 convertible 2dr Sports Europe Rear 40320.0 37548.0 2.3 4.0 192.0 21.0 29.0 3055.0 95.0 158.0

It is also possible to combine expressions using Python's & and | operators. Due to the order of operations, you need to surround each subexpression with parentheses.


In [16]:
cars[(cars.Type == 'Sports') & (cars.Cylinders > 6)].head()


Out[16]:
Selected Rows from Table _T_8B0C9A5C_7F68FFC710C0
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Audi RS 6 4dr Sports Europe Front 84600.0 76417.0 4.2 8.0 450.0 15.0 22.0 4024.0 109.0 191.0
1 Jaguar XKR convertible 2dr Sports Europe Rear 86995.0 79226.0 4.2 8.0 390.0 16.0 23.0 4042.0 102.0 187.0
2 Pontiac GTO 2dr Sports USA Rear 33500.0 30710.0 5.7 8.0 340.0 16.0 20.0 3725.0 110.0 190.0
3 Cadillac XLR convertible 2dr Sports USA Rear 76200.0 70546.0 4.6 8.0 320.0 17.0 25.0 3647.0 106.0 178.0
4 Jaguar XK8 coupe 2dr Sports Europe Rear 69995.0 63756.0 4.2 8.0 294.0 18.0 26.0 3779.0 102.0 187.0

Alternatively, you can chain your subsets which also results in combining those conditions into a single filter.


In [17]:
cars[cars.Type == 'Sports'][cars.Cylinders > 6].head()


Out[17]:
Selected Rows from Table _T_8B0C9A5C_7F68FFC710C0
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Audi RS 6 4dr Sports Europe Front 84600.0 76417.0 4.2 8.0 450.0 15.0 22.0 4024.0 109.0 191.0
1 Jaguar XKR convertible 2dr Sports Europe Rear 86995.0 79226.0 4.2 8.0 390.0 16.0 23.0 4042.0 102.0 187.0
2 Pontiac GTO 2dr Sports USA Rear 33500.0 30710.0 5.7 8.0 340.0 16.0 20.0 3725.0 110.0 190.0
3 Cadillac XLR convertible 2dr Sports USA Rear 76200.0 70546.0 4.6 8.0 320.0 17.0 25.0 3647.0 106.0 178.0
4 Jaguar XK8 coupe 2dr Sports Europe Rear 69995.0 63756.0 4.2 8.0 294.0 18.0 26.0 3779.0 102.0 187.0

If you want to see what's going on behind the scenes, you can always store the result in a variable. The resulting object is simply another CASTable object with a WHERE clause and computed columns for the expressions.


In [18]:
sports8 = cars[cars.Type == 'Sports'][cars.Cylinders > 6]
sports8


Out[18]:
CASTable('_T_8B0C9A5C_7F68FFC710C0', caslib='CASUSERHDFS(kesmit)', computedvars=['_eq_9_', '_gt_A_'], computedvarsprogram="_eq_9_ = (Type = 'Sports'); _gt_A_ = (Cylinders > 6); ", where='((_eq_9_)) and (_gt_A_)')[['Make', 'Model', 'Type', 'Origin', 'DriveTrain', 'MSRP', 'Invoice', 'EngineSize', 'Cylinders', 'Horsepower', 'MPG_City', 'MPG_Highway', 'Weight', 'Wheelbase', 'Length']]

In [19]:
conn.close()

Conclusion

We've shown two different ways of filtering data in a CAS table. First we showed how you could manually apply a WHERE clause to a CASTable object. The next way of filtering data was based on the Pandas DataFrame API. You can either use the query method with a WHERE clause, or you can use the getitem syntax of Python to generate computed columns and WHERE clauses based on Python syntax.


In [ ]: