Hands-On Workshop: Python and CAS Integration on SAS Viya

Exercise 1 - A Gentle Introduction to Pandas

Reading Data into DataFrames

Pandas is a Python package providing flexible and expressive data structures designed to make working with “relational” or “labeled” data straightforward. It is one of many ways for doing practical, real world data analysis.

The primary data structure in the Pandas package is the DataFrame. A DataFrame is a table-like object that contains named columns of data. Pandas provides readers for many different file types including CSV, Excel, SAS, HTML, etc.

Since CSV is such a common format, let's look at reading a CSV file into a Pandas DataFrame.

First, we need to import the Pandas package. A very common convention is to import Pandas using a shortcut named 'pd'. This is done with the following line of code.


In [ ]:
import pandas as pd

The function that reads CSV files into DataFrames is called read_csv. In the simplest form, you supply it with a filename or URL. The cars data set is stored as a CSV in github. We'll reference the URL, and use the head() method to print the first several observations.


In [ ]:
df=pd.read_csv('https://raw.githubusercontent.com/'
               'sassoftware/sas-viya-programming/master/data/cars.csv')

df.head(10)

Displaying Information about DataFrames

The data types of the column names can be displayed using the dtypes property.


In [ ]:
df.dtypes

Simple Statistics

One way to calculate simple statistics is to calculate multiple simple statistics for specific columns.


In [ ]:
df[['MSRP','Horsepower']].describe()

Another way is to calculate one simple statistic across all columns in a DataFrame...


In [ ]:
df.mean()

Subsetting DataFrames

You can do various types of data manipulation on DataFrames. One example is to create a new DataFrame containing only selected columns of an existing DataFrame.


In [ ]:
subdf=df[['Make','Model','Horsepower']]

subdf.head(15)

Indexing DataFrames

A useful way of indexing Pandas DataFrames is using row labels. We can set a column as a row index using the set_index method. Then we can use those labels for easy data selection.


In [ ]:
df=df.set_index('Model')

df.head()

Suppose I want a new DataFrame containing only models with an MSRP greater than 80,000.


In [ ]:
exp=df['MSRP'] > 80000

exp.head(15)

In [ ]:
df80=df[exp]

df80.head()

Plotting

There are several packages for creating plots in Python. One of those packages is matplotlib. Using matplotlib, we can create a scatter plot of MSRP values from the DataFrame we created.


In [ ]:
%matplotlib inline

In [ ]:
df.plot(kind='scatter', x='MSRP', y='Horsepower', figsize=(12,6))

Exercise 2 - Connecting to CAS

SWAT (SAS Wrapper for Analytics Transfer) is the Python module used to interface with SAS' Cloud Analytics Services (CAS). This module includes classes and functions to call actions on the server, and process the results on the client. This workshop will cover the basics of connecting to CAS from Python, calling CAS actions, interacting with CAS tables, and using the results of the CAS actions on the client.

The SWAT module in Python is merely a CAS client, it does not start the server. So the first thing you need to do is get access to a running CAS server. To connect to the server, we need the hostname and port number that the server is running at as well as a way to authenticate with the server.

There are various ways of authenticating with CAS, so you will need to check with your system administrator as to the proper method to use. The simplest form of authentication is username and password.

To connect to CAS, we will use the CAS class in the SWAT package. The most basic form of creating a connection is to supply the CAS host, port, username, and password to the CAS class.


In [ ]:
import swat 

conn = swat.CAS("server", 8777, "student", "Metadata0", protocol="http")

Tip: The object "conn" created above is referred to as a CAS Connection object.

The session name is simply a human-readable tag given to a session so it is easier to identify in actions that display session information. The session ID is a unique identifier given to each session. The session ID can be used to reconnect to a session, or connect from another client.


In [ ]:
conn

Exercise 3 - Running CAS Actions

Now that a CAS connection has been established, we can try running some CAS actions.

CAS actions are called on CAS objects just like Python methods. In the example below, we will use the serverstatus action to display some basic status information about the CAS server such as platform and release information, the number of nodes, and information about each node in the grid. That information will be stored in an object called out.

The result of every CAS action is a CASResults object. CASResults objects are simply a subclass of Python's ordered dictionary, so anything that you can do with Python dictionaries can also be done with CASResults objects.


In [ ]:
out = conn.serverstatus()

out

Using CAS' help Function

There are many actions that are available in all CAS installations, and various others may be available depending on which products have been licensed. To display all available actions, you use the help action.


In [ ]:
conn.help()

You can ask for help on specific actions and action sets.


In [ ]:
conn.help(actionset='simple')

In [ ]:
conn.help(action='correlation')

Exercise 4 - Loading Data into CAS

This example loads client-side data into CAS. The most "Pythonic" way of getting data into CAS is to use the Pandas-style data readers on the CAS connection object. When we refer to client-side data, we mean data that is located on the machine that Python is running on. Server-side data is data that is on the machine that CAS is running on. These may or may not be the same machine.

We can use Pandas' read_csv function to load data a CSV file into CAS.

Instead of returning a DataFrame, SWAT returns a CASTable object. As we'll see, you can treat a CASTable object much like a DataFrame from a programming standpoint.


In [ ]:
tbl = conn.read_csv('c:/users/student/Documents/cars.csv',"cars_cas")

tbl

CASTable object vs. DataFrame

A CASTable object is a Python object interface to a CAS table. You can call CAS actions on it just like a CAS connection object. It also supports much of the Pandas DataFrame API. This means that you can treat it like a DataFrame, but all of the operations will be performed on the CAS server using CAS actions.

Below is an example of requesting column information from a CASTable object.


In [ ]:
conn.columninfo(table=tbl)

Below is another example of printing the first ten observations from the CASTable object tbl


In [ ]:
conn.fetch(table=tbl, to=10)

Subsetting, Indexing, Plotting, and Calculating Simple Statistics for CASTable Objects

From a syntax perspective, we are treating the CASTable object the same as the DataFrame from earlier.


In [ ]:
expCAS=tbl['MSRP'] > 80000

expCAS.head()

Creating a new CASTable object containing only Models with an MSRP greater than $80,000.


In [ ]:
tbl80=tbl[expCAS]

tbl80.head()

Calculating simple statistics for the CASTable object.


In [ ]:
tbl[['MSRP','Horsepower']].describe()

In [ ]:
tbl.mean()

Exercise 5 - Loading and Using Action Sets

The syntax below loads one of the many Action sets identified above. You can load as many or as few Action sets as you wish using the CASConnection object and the loadactionset() function.


In [ ]:
conn.loadactionset('simple')

Now that the 'simple' Action set has been loaded, we can use the 'correlation' action within the 'simple' Action set to calculate a Pearson correlation matrix.


In [ ]:
tbl.correlation(simple=False)

Tip: Managing Your CAS Tables

How many CAS Tables have been created?


In [ ]:
conn.tableinfo()

How do you delete a CAS table?


In [ ]:
# conn.dropTable('<INSERT TABLE HERE>')

Running DATA Step Code

In this section, we will show how to use DATA Step code to communicate with the CAS Server.

The first step requires loading the dataStep Action set.


In [ ]:
conn.loadactionset('dataStep')

Create a new CAS Table using DATA Step Functionality

Next, using the 'runcode' action within the 'dataStep' action set, a new table called CARS_TEMP is written to CAS from the original table read in through the CSV file.


In [ ]:
out= conn.datastep.runcode(code='''
        data cars_temp(caslib='casuser');
          set cars_cas(caslib='casuser');
          if MSRP > 80000 then Category='Very Expensive'; 
          else Category='Less Expensive';
          keep Model MSRP Category EngineSize;
        run;
   ''')

out

Creating an object called 'new' to reference the CARS_TEMP table on the CAS server.


In [ ]:
new = out.OutputCasTables.loc[0, 'casTable']

new.head()

Now the CAS server has two tables loaded onto it. This can be verified using the tableinfo() function and the CASConnection object.


In [ ]:
conn.tableinfo()

Running SQL Code

SQL stands for Structured Query Language, and is commonly used when working with relational database management systems. In this section, we will illustrate how to use SQL to communicate to the CAS server.

The first step is to load the FedSQL actionset.


In [ ]:
conn.loadactionset('fedsql')

Tip: ExecDirect is a CAS action in the FedSQL action set to execute SQL statements in the CAS environment.


In [ ]:
conn.fedsql.execDirect(query='''
    select model, 
           msrp,
           case when msrp > 80000 then 'Very Expensive' else 'Less Expensive' end as CategorySQL
    from cars_temp
    where EngineSize > 5
    order by msrp desc
''')

Create a new CAS Table Using SQL

Use the CREATE TABLE syntax in SQL.


In [ ]:
conn.fedsql.execDirect(query='''
    create table cars_temp_SQL as 
    select *
    from cars_temp
    where EngineSize>5
    ''')

In [ ]:
conn.tableinfo()

Exercise 6 - Building Analytical Models

Regression Model

The first step is to load a modeling action set before using it. Let's load action sets for regression and decision tree models.


In [ ]:
conn.loadactionset('regression')

In [ ]:
conn.help(actionset='regression')

In [ ]:
glm_out=tbl.glm(
    target = 'MSRP',
    nominals = ['Type', 'Origin'],
    inputs = ['Type', 'Origin', 'MPG_City', 'Length', 'Weight']
)

glm_out

ODS-like Output Style


In [ ]:
from swat.render import render_html
render_html(glm_out)

Random Forest from the decisiontree Action Set.

The decisiontree action set is another popular analytic action set. It provides three distinct tree-based models: decision tree, random forests, and gradient boosting. Unlike the regression action set, the decisiontree action set splits a model into different actions, each represents a typical step of a machine learning process such as training, scoring and score code generation (as SAS DATA step score code).


In [ ]:
conn.loadactionset('decisiontree')

You can list all of the available actions (for decision trees, random forests, and gradient boosting tree models)


In [ ]:
conn.decisiontree?

The models in the decisiontree action set support either continuous, binary or multilevel response variable. Let us fit a random forest model to predict whether a vehicle is from Asia, Europe, or United States.

The forestModel1 object is the results table.


In [ ]:
forest1 = tbl.Foresttrain()
forest1.target = 'Origin'
forest1.inputs = ['MPG_City','MPG_Highway','Type','Weight','Length','Cylinders']
forest1.nominals = ['Type','Cylinders']
forest1.casout = conn.CASTable('forestModel1', replace=True)
forest1()

Random forest models are commonly used in variable selection. Selection is usually determined by the variable importance of the predictors in training the forest model. In the foresttrain action, this importance measure is defined as the total Gini reduction from all of the splits that use this predictor. You can request variable importance using the varimp option and generate the variable importance plot using the Matplotlib package.


In [ ]:
forest1.varimp = True
result = forest1()
dfVarImp = result['DTreeVarImpInfo']
import matplotlib.pyplot as plt
import numpy as np
y_pos = np.arange(len(dfVarImp['Importance']))
plt.barh(y_pos, dfVarImp['Importance'], align='center')
plt.yticks(y_pos, dfVarImp['Variable'])
plt.xlabel('Variable Importance')
plt.show()

To score the training data or the holdout data using the forest model, you can use the forestscore action.


In [ ]:
scored_data = conn.CASTable('scored_output', replace=True)
tbl.forestscore(modeltable=conn.CASTable('forestModel1'),casout=scored_data)
scored_data.head()

In [ ]: