SWAT (SAS Wrapper for Analytics Transfer) is the Python module used to interface with SAS' Cloud Analytics Services (CAS). This module includes classes and functions to call actions on the server, and process the results on the client. This workshop will cover the basics of connecting to CAS from Python, calling CAS actions, interacting with CAS tables, and using the results of the CAS actions on the client.
We will cover several aspects of working with CAS in the following sections. These include:
The SWAT module in Python is merely a CAS client, it does not start the server. So the first thing you need to do is get access to a running CAS server. To connect to the server, we need the hostname and port number that the server is running at as well as a way to authenticate with the server.
There are various ways of authenticating with CAS, so you will need to check with your system administrator as to the proper method to use. The simplest form of authentication is username and password, and we will cover that here.
To connect to CAS, we will use the CAS
class in the SWAT package. The most basic form of creating a connection is to supply the CAS host, port, username, and password to the CAS
class.
In [16]:
import swat
conn = swat.CAS('sasserver.demo.sas.com', 5570)
We can print the CAS object to see the settings for that session. It includes the host, port, username, protocol, session name, and session ID. The host, port, and username should be self-explanatory. The protocol can be 'cas' for the binary protocol, 'http' for the unencrypted REST interface, or 'https' for the encrypted REST interface.
The session name is simply a human-readable tag given to a session so it is easier to identify in actions that display session information. The session ID is a unique identifier given to each session. The session ID can be used to reconnect to a session, or connect from another client.
In [17]:
conn
Out[17]:
It's generally not a good idea to put your username and password into a program, so the CAS class also supports Authinfo authentication. This is a method of storing your username and password in a file that is readable only by you. The CAS object can then safely read the username and password from that file. This file was used in the connection above which is why you didn't need to enter a username or password.
The basic format of the file is to put a host, port, username, and password on each line.
host my-cas.com port 12345 user my-name password s3cr3t
The hostname must exactly match the hostname used in the CAS
object. If no port number is specified on the line, the username and password will be used for any port on that machine.
By default, the file should be stored in ~/.authinfo
, and it should be readable only by you. On Linux, the command to do this is:
chmod 0600 ~/.authinfo
Once you have a file with the appropriate authentication information, you can create a CAS
object without specifying a username and password.
Now that we have a CAS connection, we can try running some CAS actions. Let's start with the serverstatus
action. This action displays some basic status information about tha CAS server such as platform and release information, the number of nodes, and information about each node in the grid.
CAS actions are called on CAS
objects just like Python methods. We will store the result of the action call in out
as well as display the result.
In [18]:
out = conn.serverstatus()
out
Out[18]:
The result of every CAS action is a CASResults
object. CASResults
objects are simply a subclass of Python's ordered dictionary, so anything that you can do with Python dictionaries can also be done with CASResults
objects. For example, we can print all of the result keys using the keys
method.
In [19]:
list(out.keys())
Out[19]:
You can also access the objects at each key using dictionary syntax.
In [20]:
out['nodestatus']
Out[20]:
The types of the objects at each key vary from action to action. In the case of serverstatus
, the 'About' key holds a dictionary, and the 'server' and 'nodestatus' keys hold DataFrames.
In [21]:
[type(x) for x in out.values()]
Out[21]:
As you can see, the DataFrames held in the output are SASDataFrame
objects. The SASDataFrame
object is a subclass of Pandas DataFrame
. This means that they can be treated as equivalents. The SASDataFrame
object simply adds some extra SAS metadata such as titles, labels, formats, etc.
Of course, since it is a DataFrame, we can use indexing, sorting, and filtering operations on it.
In [22]:
nodestatus = out['nodestatus']
nodestatus[nodestatus['role'] == 'worker']
Out[22]:
In [23]:
nodestatus[nodestatus['role'] == 'controller']
Out[23]:
In [24]:
nodestatus = nodestatus.set_index('name')
nodestatus.loc['sasserver.demo.sas.com']
Out[24]:
Now that we've covered the basics of calling CAS actions and traversing the results, let's see what other actions are available.
help
FunctionSo far we've only introduced the serverstatus
action. However, there are many actions that are available in all CAS installations, and various other may be available depending which products have been licensed. To display all available actions, you use the help
action. (Note that we added the semi-colon at the end to suppress the rendering of the output which is a DataFrame of the same information that is displayed in the notes.)
In [25]:
conn.help();
You can also ask for help on specifc action sets and actions.
In [26]:
conn.help(actionset='builtins');
In [27]:
conn.help(action='tableinfo');
In [28]:
help(conn.tableinfo)
In [29]:
conn.tableinfo?
In [30]:
#conn.
So far we've run a few CAS actions and taken a look at the output that they create, but we haven't done anything that requires data. In order to do that, we need to know how to load data into a CAS table.
There are many ways of getting data into CAS. We'll look at a couple of them here.
The most Pythonic way of getting data into CAS is to use the Pandas-style data readers on the CAS connection object. When we refer to "client side" data, we mean data that is located on the machine that Python is running on. "Server side" data is data that is on the machine that CAS is running on. These may or may not be the same machine.
One thing to keep in mind when loading client-side data is that all data must be uploaded to the CAS machine. This may be fine for small to medium sized data sets, but for larger data sets you would probably want to move on to the section 4.3 on loading server-side data.
One of the simplest data file formats is CSV. Pandas' read_csv
function is a very capable CSV reader with many options. You can use the SWAT version of that same function on the CAS connection object. It supports all of the same options as the Pandas version (and actually uses it in the background). The difference is that the SWAT version returns a CASTable
object rather than a DataFrame.
In [31]:
tbl = conn.read_csv('https://raw.githubusercontent.com/'
'sassoftware/sas-viya-programming/master/data/cars.csv')
tbl
Out[31]:
Since the code above uses a Python data parser, the content of the URL must be downloaded to the client machine first, then it gets parsed. The data is then uploaded to CAS. You may want to keep this in mind since it does mean that the data is transferred twice. If you have larger data files, you may want to use an alternative way of loading them.
A CASTable
object is a Python object interface to a CAS table. You can call CAS actions on it just like a CAS
connection object. It also supports much of the Pandas DataFrame API. This means that you can treat it like a DataFrame, but all of the operations will be performed on the CAS server using CAS actions.
The first thing we'll do is take a look at the column information using the columninfo
CAS action.
In [32]:
conn.columninfo(table=tbl)
Out[32]:
We can also fetch a sample of the data using the fetch
CAS action.
In [33]:
conn.fetch(table=tbl, to=5)
Out[33]:
As you can see in the previous two code samples, we ran the CAS action on the connection object and supplied the table in the table=
parameter. Since this is such a common thing to do, the CASTable
object can also be used as the action dispatcher. In this case, you simply call the action on the CASTable
object itself, and don't specify a table=
parameter. The table=
parameter will be automatically populated with the table information.
In [34]:
tbl.fetch(to=5)
Out[34]:
Another option for processing client-side data is to upload it as-is and let CAS parse it. This is generally faster since there isn't the additional overhead of converting Python objects to data in a form that CAS can read it. In addition, some file formats such as CSV can be parsed in parallel in CAS which gives another performance boost.
The example below uploads the same file as above, but this time CAS is doing the parsing. You should note that since this CSV file is a URL, the data must be downloaded to the client machine and then uploaded to CAS which adds a bit more overhead.
In [35]:
tbl2 = conn.upload_file('https://raw.githubusercontent.com/'
'sassoftware/sas-viya-programming/master/data/cars.csv')
In [36]:
tbl2.fetch(to=5)
Out[36]:
There is also an upload_frame
method on the CAS
connection object that will upload a Pandas DataFrame. This method simply exports the DataFrame to CSV and uploads that file to the server.
In [37]:
conn.caslibinfo()
Out[37]:
In [38]:
conn.fileinfo(caslib='casuser')
Out[38]:
In [39]:
conn.fileinfo(caslib='casuser', path='Data')
Out[39]:
In [40]:
out = conn.loadtable(path='Data/cars.csv', caslib='casuser')
out
Out[40]:
You'll notice that we specified both path=
and caslib=
parameters. Files are not loaded using an absolute file name. They are always relative to a CASLib. CASlibs can be configured to point to file system directories, databases, and other data sources. They can also be used to set access controls for various operations. We won't go into those details here. In the case of the example above, the file was located relative to the casuser
CASLib which points to the user's home directory.
The result of the loadtable
action is a CASResults
object containing three keys: 'caslib', 'tableName', and 'casTable'. The 'casTable' key is simply a CASTable
object constructed from the 'caslib' and 'tableName' keys.
In [41]:
tbl3 = out['casTable']
tbl3
Out[41]:
In [42]:
tbl3.fetch(to=5)
Out[42]:
Since we commonly just want the CASTable
object from the loadtable
action, there is a wrapper method on the CAS
connection object that calls loadtable
behind-the-scenes and just returns the 'casTable' key.
In this case, we'll load the same data file but we'll also specify replace=True
in the casout=
parameter since it will attempt to create the same output table name.
In [43]:
tbl4 = conn.load_path('Data/cars.csv', caslib='casuser', casout=dict(replace=True))
tbl4
Out[43]:
In [44]:
tbl4.fetch(to=5)
Out[44]:
CASTable
as a DataFrameWe have shown a few examples of how the CASTable
object acts like a Pandas DataFrame, but the similarities go much further. We have attempted to adopt as much of the DataFrame API as is relevant to CAS. When a method is implemented, both the method parameters and outputs work the same way between the CASTable
object and Pandas DataFrame objects. Pandas DataFrame method options are mapped to CAS action options, and the CAS action outputs are transformed to match the form of the Pandas DataFrame outputs. If you are familiar with Pandas DataFrames, this makes it much easier to get started with CAS.
In [45]:
import pandas as pd
In [46]:
df = pd.read_csv('https://raw.githubusercontent.com/'
'sassoftware/sas-viya-programming/master/data/cars.csv')
df.head()
Out[46]:
Here is the same data set as a CASTable
.
In [47]:
tbl.head()
Out[47]:
To select a subset of columns using a Pandas DataFrame, you use Python's indexing syntax. When you specify a list of column names, you get a new DataFrame with just those columns in it.
In [48]:
subdf = df[['Make', 'Model', 'MSRP', 'Cylinders', 'Horsepower']]
subdf.head()
Out[48]:
The same is true of CASTable
objects.
In [49]:
subtbl = tbl[['Make', 'Model', 'MSRP', 'Cylinders', 'Horsepower']]
subtbl.head()
Out[49]:
You may or may not have gotten the same rows of data when displaying data from the Pandas DataFrame and the CASTable
object. This is because CAS does not store data rows in any specified order. CAS tables can be distributed across a grid of machines, so depending on events that may happen, the rows of the table may be redistributed.
However, CASTable
objects can emulate Pandas DataFrame-like sorting features by supplying sort parameters to the fetch
action when data is retrieved.
Again, we'll show how sorting works with Pandas DataFrames, then show the CASTable
version.
In [50]:
subdf.sort_values(['MSRP']).head()
Out[50]:
In [51]:
subtbl.sort_values(['MSRP']).head()
Out[51]:
You'll see that we get the same rows, but not the same index values when sorting. CAS currently does not have a row index, so it gets the automatically applied sort index when the DataFrame is created on the client side.
The sort_values
method of CASTable
also support the ascending=
parameter.
In [52]:
subtbl.sort_values(['MSRP'], ascending=False).head()
Out[52]:
While there are CAS actions that give results similar to what Pandas DataFrame statistical methods give, they always require some transformation to be exactly the same form. Many of the Pandas DataFrame methods have been re-implemented in CASTable
s. This means that they look and act like Pandas DataFrame methods, but all of the real work is done behind-the-scenes in CAS actions.
Probably the most common statistical method in Pandas DataFrames is describe
.
In [53]:
df.describe()
Out[53]:
CASTable
objects also support the describe
method. As you can see, the column and row indexes are exactly the same as what you get with the Pandas DataFrame method.
In [54]:
tbl.describe()
Out[54]:
You can also specify the same options to the CASTable.describe
method as DataFrames, as well as using the subsetting syntax described above.
In [55]:
tbl[['MSRP', 'EngineSize', 'Cylinders']].describe(percentiles=[0.3, 0.7])
Out[55]:
While the output was created to be compatible with Pandas DataFrame's describe
method, CAS supplies many more statistics that can be displayed using the stats=
parameter. You can specify a list of names of the statistics to display, or simply specify stats=all
to display all of them.
In [56]:
tbl.describe(stats='all')
Out[56]:
If you want a full-blown table with all statistics and all variables (numeric and character), you can specify include='all'
and stats='all'
as paramaters.
In [57]:
tbl.describe(include='all', stats='all')
Out[57]:
As we showed earlier, you can specify By groups using the groupby
method on both DataFrames and CASTable
s. This also works in conjunction with DataFrame methods like describe
.
In [58]:
df.groupby('Origin').describe()
Out[58]:
Here is the CASTable
version. Notice that in this case, since compatibility with Pandas was the primary concern, the By groups are all contained in the same DataFrame. Calling CAS actions directly will always split them into separate tables.
In [59]:
tbl.groupby('Origin').describe()
Out[59]:
Sometimes you want to filter your data so that only part of the data set is analyzed. There are a few ways to do this in SWAT.
where=
parameter on the table=
parameterquery(...)
method of CASTable
CASTable
objectsEssentially these are all equivalent. They all end up setting a where=
parameter on the table=
parameter, so it's more personal preference as to which one you choose.
The first method is the brute-force method: setting the where=
parameter on the table. One thing to keep in mind is that the expression must be a CAS WHERE clause, not a Python expression. This may trip people up because =
is used to test equality in CAS, whereas ==
is used in Python.
In [60]:
tbl.set_param(where='Origin = "Asia"')
tbl
Out[60]:
In [61]:
tbl.head()
Out[61]:
Using the query
method on CASTable
is equivalent. The only difference is that it creates a copy of the CASTable
object and returns the copy with the where=
parameter set. Since tbl
already had a where=
parameter set, the conditions get combined.
In [62]:
tbl = tbl.query('Cylinders = 6')
tbl
Out[62]:
In [63]:
tbl.head()
Out[63]:
Finally, you can use indexing expressions just like in Pandas DataFrames. This style of indexing creates a computed column for the expression, then adds the result of that computed column as a condition of the where=
parameter.
In [64]:
tbl = tbl[tbl.MSRP > 50000]
tbl
Out[64]:
You may notice that the string representation of the table now includes all of the column names. This is due to the fact there there is a computed column in the table now which automatically get appended to the table, so in order for that not to show up in the output, we must select the columns that we want to display.
In [65]:
tbl.head()
Out[65]:
In [66]:
conn.tableinfo()
Out[66]:
Let's create a fresh CASTable
object.
In [67]:
tbl = conn.CASTable('cars')
In [68]:
conn.simple?
Let's start off with the distinct
action. It takes the same table=
parameter as the actions above, so we can call it directly on the CASTable
object and have the table=
parameter filled automatically.
distinct
action outputs number of distinct values and number of missing values for each column in the table.
In [69]:
tbl.distinct()
Out[69]:
In [70]:
tbl.summary()
Out[70]:
You have two ways to specifiy the columns to generate summary statistics.
In [71]:
tbl[['MSRP','MPG_CITY','MPG_Highway']].summary()
Out[71]:
In [72]:
tbl.summary(inputs=['MSRP','MPG_CITY','MPG_Highway'])
Out[72]:
In [73]:
tbl[['Origin','Cylinders']].freq()
Out[73]:
In [74]:
import cufflinks
cufflinks.go_offline()
output = tbl[['Origin','Cylinders']].freq()
df = output['Frequency']
df[df['Column'] == 'Cylinders'].iplot(kind='bar', x='FmtVar', y='Frequency')
In [75]:
tbl.crosstab(row="Cylinders", col='Origin')
Out[75]:
In [76]:
tbl.crosstab(row="MSRP", col='Origin', rowNBins=10)
Out[76]:
In [77]:
tbl.crosstab(row="MSRP", col='Origin', rowNBins=10, chisq=True)
Out[77]:
In Statistics, the Pearson correlation coefficient is a measure of the linear correlation between two variables X and Y. Its range is between +1 and −1, where 1 is total positive correlation, 0 is no correlation, and −1 is total negative correlation.
https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
In [78]:
tbl.correlation(simple=False)
Out[78]:
In [79]:
output = tbl.correlation(simple=False)['Correlation']
output.set_index(output['Variable']).iplot(kind='heatmap', colorscale='rdbu')
In [80]:
tbl[['MPG_City']].histogram()
Out[80]:
While the table of data is nice and may be useful in some contexts, it might be nice to see a visual representation of the histogram data. There are various ways of plotting data in Python. The most basic is to use the plot
method on the DataFrame.
In [81]:
# Set matplotlib to render graphics inline
%matplotlib inline
hist = tbl[['MPG_City']].histogram()
hist['BinDetails'].plot.bar(x='MidPoint', y='Percent')
Out[81]:
Now that we've done some basic statistics, let's look at how to group data. By grouping is handled by a parameter on the table: groupby
. The brute force way of adding grouping to a table is to set it manually. The code below sets the groupby
parameter to 'Origin'.
In [82]:
tbl.set_param(groupby='Origin')
tbl
Out[82]:
We can then run the same summary
action call as before and we will get the results grouped by Origin.
In [83]:
output = tbl[['MSRP','MPG_CITY','MPG_Highway']].summary()
output
Out[83]:
You'll notice that we get each By group in a separate result key. This is done to make the output as flexible as possible. When dealing with large sets of data, it is possible that you could end up with very large result sets coming back from CAS. The By groups are split so that it is possible to handle them piecemeal rather than as one big table. We will not go into detail on the way to construct action calls to handle results individually, but you can go to the SWAT documentation for more information.
In cases like this where the data is fairly small, you can concatenate all of the By group tables using the concat_bygroups
method on the CASResults
object. This gives you results similar to the non-By group code, but adds an index to the DataFrame that contains the By values.
In [84]:
output.concat_bygroups()
Out[84]:
Because SAS uses formatted values to do By grouping and Pandas typically works with raw values, there is a method called reshape_bygroups
on the SASDataFrame
object that allows you to change the representation of By groups in your DataFrames. You can indicate that you want By values (raw or formatted) as columns, an index, or neither. You can also specify a suffix to use when name collisions occur in column names if you have an action that can utilize the same column multiple times in the output.
When we started this section, we mentioned that using the set_param
method was the brute force way of setting By groupings. You can also use the Pandas-style way of setting groupings: the groupby
method.
Using this returns a CASTable
-like object with the groupby
parameter set. All action calls on the new object behave just like a CASTable
running the same action.
In [85]:
tbl.del_param('groupby')
tbl
Out[85]:
In [86]:
output = tbl[['MSRP','MPG_City']].groupby('Origin').summary()
output
Out[86]:
SQL stands for Structured Query Language. According to ANSI (American National Standards Institute), it is the standard language for relational database management systems. In this section we illustrate how to use SQL to communicate with CAS server.
In [87]:
conn.loadactionset('fedsql')
Out[87]:
In [88]:
str1 = "select make, model, msrp, mpg_highway from cars where msrp > 80000 and mpg_highway > 20"
conn.fedsql.execDirect(query=str1)
Out[88]:
In [89]:
str1 = "select origin, mean(mpg_highway), std(mpg_highway) from cars group by origin"
conn.fedsql.execDirect(query=str1)
Out[89]:
Show the SQL logic
In [90]:
str1 = "select origin, mean(mpg_highway), std(mpg_highway) from cars group by origin"
conn.fedsql.execDirect(query=str1, method=True)
Out[90]:
In [91]:
str1 = """
create table aggregated1 as
select origin, mean(mpg_highway), std(mpg_highway)
from cars group by origin
"""
conn.fedsql.execDirect(query=str1)
Out[91]:
In [92]:
conn.tableinfo()
Out[92]:
You can delete a table usig the dropTable action.
In [93]:
# conn.dropTable('aggregated1')
Continue to work on the table created by SQL
In [94]:
tbl_temp = conn.CASTable('aggregated1')
In [95]:
tbl_temp.fetch()
Out[95]:
In [96]:
str1 = """
create table aggregated2 as
select origin, type, mean(mpg_highway), std(mpg_highway)
from cars
group by origin, type
"""
conn.fedsql.execDirect(query=str1)
Out[96]:
In [97]:
str1 = """
select a.origin, a.type, b.mean as Average_Origin, a.mean as Average_Type
from aggregated2 as a
left join aggregated1 as b
on b.origin = a.origin
"""
conn.fedsql.execDirect(query=str1)
Out[97]:
You can submit SAS data step codes using the runCode
action! Don't forget the new data set option REPLACE=YES if you want to replace an existing data set in the same CAS server. By default the new table CARS_TEMP is saved to the CASUSER library in the CAS server.
In [98]:
conn.runCode(code="""
data cars_temp(replace=yes);
set cars;
sqrt_MSRP = sqrt(MSRP);
MPG_avg = (MPG_city+MPG_highway)/2;
run;
""")
Out[98]:
In [99]:
conn.tableinfo()
Out[99]:
In [100]:
tbl_cars_temp = conn.CASTable('cars_temp')
tbl_cars_temp[['sqrt_MSRP','MPG_avg']].summary()
Out[100]:
You need to load an modeling action before you use it. Let's load the action sets for regression and decision tree models
In [101]:
conn.loadactionset('regression')
conn.loadactionset('decisiontree')
tbl = conn.CASTable('cars')
The regression action set contains three models: linear regressions, logistic regressions, and generalized linear models.
In [102]:
conn.help(actionset='regression')
Out[102]:
Let us build a linear regression model to predict the MSRP values of the automobiles using the CARS data set.
In [103]:
output1 = tbl.glm(
target = 'MSRP',
nominals = ['Type', 'Origin'],
inputs = ['Type', 'Origin', 'MPG_City', 'Length', 'Weight']
)
In [104]:
output1
Out[104]:
You can get the results in a ODS-like style too.
In [105]:
from swat.render import render_html
render_html(output1)
Next let us build a decision tree model to predict MSRP.
In [106]:
output1 = tbl.dtreetrain(
target = 'MSRP',
nominals = ['Type', 'Origin'],
inputs = ['Type', 'Origin', 'MPG_City', 'Length', 'Weight'],
casout = conn.CASTable("treeModel"),
varImp = True
)
In [107]:
output1
Out[107]:
The treeModel1 CASTable
stores the decision tree model. Each row contains the information for an individual split in the decision tree.
In [108]:
conn.CASTable("treeModel1").fetch()
Out[108]:
Last but not least, let us visualize the importance of the predictors using Cufflinks.
In [109]:
output1['DTreeVarImpInfo'].sort_values(by='Importance').iplot(kind='barh',x='Variable',y='Importance')