Build models programmatically using Python API

The SAS Python SWAT package enables you to connect to SAS Cloud Analytic Services (CAS) engine that is the centerpiece of the SAS Viya framework.

In order to access this functionality, the SAS SWAT package must first be downloaded and installed from https://github.com/sassoftware/python-swat


In [1]:
# Import packages
from swat import *
from pprint import pprint
from swat.render import render_html
from matplotlib import pyplot as plt
import pandas as pd
import sys
%matplotlib inline

In [2]:
# Start a CAS session
cashost='<your CAS server here>'
casport=<your CAS server port here>
casauth="~/.authinfo"
sess = CAS(cashost, casport, authinfo=casauth, caslib="public")

In [3]:
# Set helper variables
gcaslib="public"
prepped_data="bank_prepped"
target = {"b_tgt"}
class_inputs = {"cat_input1", "cat_input2", "demog_ho", "demog_genf"}
interval_inputs = {"im_demog_age", "im_demog_homeval", "im_demog_inc", "demog_pr", "log_rfm1", "rfm2", "log_im_rfm3", "rfm4", "rfm5", "rfm6", "rfm7", "rfm8", "rfm9", "rfm10", "rfm11", "rfm12"}
class_vars = target | class_inputs

Train and score Stepwise Regression model using the data prepared in SAS Studio


In [4]:
# Load action set
sess.loadactionset(actionset="regression")

# Train Logistic Regression
lr=sess.regression.logistic(
  table={"name":prepped_data, "caslib":gcaslib},
  classVars=[{"vars":class_vars}],
  model={
    "depVars":[{"name":"b_tgt", "options":{"event":"1"}}],
    "effects":[{"vars":class_inputs | interval_inputs}]
  },
  partByVar={"name":"_partind_", "train":"1", "valid":"0"},
  selection={"method":"STEPWISE"},
  output={"casOut":{"name":"_scored_logistic", "replace":True}, "copyVars":{"account", "b_tgt", "_partind_"}}
)

# Output model statistics
render_html(lr)

# Compute p_b_tgt0 and p_b_tgt1 for assessment
sess.dataStep.runCode(
  code="data _scored_logistic; set _scored_logistic; p_b_tgt0=1-_pred_; rename _pred_=p_b_tgt1; run;"
)


NOTE: Added action set 'regression'.
NOTE: Convergence criterion (GCONV=1E-8) satisfied.
Model Information
RowId Description Value
DATAData SourceBANK_PREPPED
RESPONSEVARResponse Variableb_tgt
DISTDistributionBinary
LINKLink FunctionLogit
TECHOptimization TechniqueNewton-Raphson with Ridging
Number of Observations
RowId Description Total Training Validation
NREADNumber of Observations Read485452339405146047
NUSEDNumber of Observations Used479001334928144073
Response Profile
Ordered Value b_tgt b_tgt Total Frequency Training Validation Probability Modeled
100382846267721115125
211961556720728948*
Class Level Information
Class Levels Values
demog_genf20 1
cat_input25A B C D E
cat_input13X Y Z
demog_ho20 1
Selection Information
Description Value Numeric Value
Selection MethodStepwisenan
Select CriterionSBCnan
Stop CriterionSBCnan
Effect Hierarchy EnforcedNonenan
Stop Horizon33
Convergence Status
Reason Status Max Gradient
Convergence criterion (GCONV=1E-8) satisfied.01.6972957E-9
Selection Summary
Control Step Effect Entered Effect Removed Number Of Effects SBC Optimal SBC
0Intercept1335823.224680
-1rfm52290049.18290
2IM_demog_homeval3274382.347710
3LOG_RFM14257507.896680
4rfm95240126.582380
5rfm126238716.808670
6cat_input17237979.376840
7cat_input28237499.277480
8rfm49237290.672011
Stop Reason
Reason Code
Stepwise selection stopped because adding or removing an effect does not improve the SBC criterion.8
Selection Reason
Reason
The model at step 8 is selected.
Selected Effects
Label Effects
Selected Effects:Intercept cat_input1 rfm9 LOG_RFM1 rfm12 cat_input2 rfm4 IM_demog_homeval rfm5
Dimensions
RowId Description Value
NDESIGNCOLSColumns in Design15
NEFFECTSNumber of Effects9
MAXEFCOLSMax Effect Columns5
DESIGNRANKRank of Design13
OPTPARMParameters in Optimization13
Likelihood Ratio Test
Test DF Chi-Square Pr > ChiSq
Likelihood Ratio1298498.0276510
Fit Statistics
RowId Description Training Validation
M2LL-2 Log Likelihood237312.47536102563.03002
AICAIC (smaller is better)237338.47536102589.03002
AICCAICC (smaller is better)237338.47644102589.03255
SBCSBC (smaller is better)237477.85708102717.445
ASEAverage Square Error0.11049352310.1112464707
M2LLNULL-2 Log L (Intercept-only)335810.50301144558.04897
RSQUARER-Square0.25478847140.2528462634
ADJRSQMax-rescaled R-Square0.40245309660.3992160167
MCFADDENMcFadden's R-Square0.2933143150.2905062655
MISCLASSMisclassification Rate0.15560657810.1568788045
DIFFMEANDifference of Means0.31230485820.3089769919
Parameter Estimates
Effect cat_input1 cat_input2 Parameter Parameter DF Estimate Standard Error Chi-Square Pr > ChiSq
InterceptInterceptIntercept12.94030740380.04758332673818.35121180
cat_input1Xcat_input1 Xcat_input1_X10.64922194020.0239913822732.277747132.86364E-161
cat_input1Ycat_input1 Ycat_input1_Y10.59823308590.0298239181402.356907911.689932E-89
cat_input1Zcat_input1 Zcat_input1_Z00nannannan
rfm9rfm9rfm91-0.1649850330.001385412514181.7847880
LOG_RFM1LOG_RFM1LOG_RFM11-1.6667374550.012787921116987.696480
rfm12rfm12rfm1210.00418281480.00015299747.500458931.40247E-164
cat_input2Acat_input2 Acat_input2_A10.28819185450.0174243394273.558845251.902361E-61
cat_input2Bcat_input2 Bcat_input2_B10.31631016910.016016355390.030830178.150148E-87
cat_input2Ccat_input2 Ccat_input2_C10.21284862310.0168797636159.004336441.867209E-36
cat_input2Dcat_input2 Dcat_input2_D10.15148130870.01875720765.2201443296.698132E-16
cat_input2Ecat_input2 Ecat_input2_E00nannannan
rfm4rfm4rfm410.00106258950.000100708111.327737615.015454E-26
IM_demog_homevalIM_demog_homevalIM_demog_homeval17.0216498E-66.2597403E-812582.46390
rfm5rfm5rfm510.24934388090.00287886987501.59096550
Task Timing
RowId Task Time Relative Time
SETUPSetup and Parsing0.00980591770.0020223687
LEVELIZATIONLevelization0.17782282830.0366741124
INITIALIZATIONModel Initialization0.00096082690.0001981606
SSCPSSCP Computation0.16525411610.0340819459
FITTINGModel Selection4.26621890070.8798633607
OUTPUTCreating Output Data0.20510005950.0422997581
CLEANUPCleanup0.00804400440.0016589924
TOTALTotal4.84872889521
Output CAS Tables
CAS Library Name Label Number of Rows Number of Columns Table
CASUSER(ramyne)_scored_logistic4854524CASTable('_scored_logistic', caslib='CASUSER(ramyne)')
NOTE: Missing values were generated as a result of performing an operation on missing values.
      Each place is given by: (Number of times) at (Line):(Column).
      154 at 0:55
      121 at 0:55
      85 at 0:55
      90 at 0:55
      255 at 0:55
      110 at 0:55
      178 at 0:55
      109 at 0:55
      440 at 0:55
      694 at 0:55
      665 at 0:55
      718 at 0:55
      657 at 0:55
      794 at 0:55
      673 at 0:55
      708 at 0:55
NOTE: Duplicate messages output by DATA step:
NOTE: Missing values were generated as a result of performing an operation on missing values.  (occurred 16 times)
      Each place is given by: (Number of times) at (Line):(Column).  (occurred 16 times)
Out[4]:
§ InputCasTables
casLib Name Rows Columns casTable
0 CASUSER(ramyne) _scored_logistic 485452 4 CASTable('_scored_logistic', caslib='CASUSER(r...

Load the GBM model create in SAS Visual Analytics and score using this model


In [5]:
# 1. Load GBM model (ASTORE) created in VA
sess.loadTable(
  caslib="models", path="Gradient_Boosting_VA.sashdat", 
  casout={"name":"gbm_astore_model","caslib":"casuser", "replace":True}
)

# 2. Score code from VA (for data preparation)
sess.dataStep.runCode(
  code="""data bank_part_post; 
            set bank_part(caslib='public'); 
            _va_calculated_54_1=round('b_tgt'n,1.0);
            _va_calculated_54_2=round('demog_genf'n,1.0);
            _va_calculated_54_3=round('demog_ho'n,1.0);
            _va_calculated_54_4=round('_PartInd_'n,1.0);
          run;"""
)

# 3. Score using ASTORE
sess.loadactionset(actionset="astore")

sess.astore.score(
  table={"name":"bank_part_post"},
  rstore={"name":"gbm_astore_model"},
  out={"name":"_scored_gbm", "replace":True},
  copyVars={"account", "_partind_", "b_tgt"}
)

# 4. Rename p_b_tgt0 and p_b_tgt1 for assessment
sess.dataStep.runCode(
  code="""data _scored_gbm; 
            set _scored_gbm; 
            rename p__va_calculated_54_10=p_b_tgt0
                   p__va_calculated_54_11=p_b_tgt1;
          run;"""
)


NOTE: Cloud Analytic Services made the file Gradient_Boosting_VA.sashdat available as table GBM_ASTORE_MODEL in caslib CASUSER(ramyne).
NOTE: Added action set 'astore'.
Out[5]:
§ InputCasTables
casLib Name Rows Columns casTable
0 CASUSER(ramyne) _scored_gbm 485452 7 CASTable('_scored_gbm', caslib='CASUSER(ramyne)')

Load the Forest model created in SAS Studio and score using this model


In [6]:
# Load action set 
sess.loadactionset(actionset="decisionTree")

# Score using forest_model table
sess.decisionTree.forestScore(
  table={"name":prepped_data, "caslib":gcaslib},
  modelTable={"name":"forest_model", "caslib":"public"},
  casOut={"name":"_scored_rf", "replace":True},
  copyVars={"account", "b_tgt", "_partind_"},
  vote="PROB"
)

# Create p_b_tgt0 and p_b_tgt1 as _rf_predp_ is the probability of event in _rf_predname_
sess.dataStep.runCode(
  code="""data _scored_rf; 
            set _scored_rf; 
            if _rf_predname_=1 then do; 
              p_b_tgt1=_rf_predp_; 
              p_b_tgt0=1-p_b_tgt1; 
            end; 
            if _rf_predname_=0 then do; 
              p_b_tgt0=_rf_predp_; 
              p_b_tgt1=1-p_b_tgt0; 
            end; 
          run;"""
)


NOTE: Added action set 'decisionTree'.
NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).
      0:61    0:192
NOTE: Duplicate messages output by DATA step:
NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).  (occurred 16 times)
      0:61    0:192  (occurred 16 times)
Out[6]:
§ InputCasTables
casLib Name Rows Columns casTable
0 CASUSER(ramyne) _scored_rf 485452 8 CASTable('_scored_rf', caslib='CASUSER(ramyne)')

Load the SVM model created in SAS Studio and score using this model


In [7]:
# Score using ASTORE
sess.loadactionset(actionset="astore")

sess.astore.score(
  table={"name":prepped_data, "caslib":gcaslib},
  rstore={"name":"svm_astore_model", "caslib":"public"},
  out={"name":"_scored_svm", "replace":True},
  copyVars={"account", "_partind_", "b_tgt"}
)


NOTE: Added action set 'astore'.
Out[7]:
§ Timing
Task Timing
Task Seconds Percent
0 Loading the Store 0.000010 0.000010
1 Creating the State 0.005296 0.005052
2 Scoring 1.042918 0.994932
3 Total 1.048230 1.000000

elapsed 1.07s · user 5.24s · sys 1.88s · mem 32.4MB

Assess models from SAS Visual Analytics, SAS Studio and the new models created in Python interface


In [8]:
# Assess models
def assess_model(prefix):
    return sess.percentile.assess(
      table={
        "name":"_scored_" + prefix, 
        "where": "strip(put(_partind_, best.))='0'"
      },
      inputs=[{"name":"p_b_tgt1"}],      
      response="b_tgt",
      event="1",
      pVar={"p_b_tgt0"},
      pEvent={"0"}      
    )

lrAssess=assess_model(prefix="logistic")    
lr_fitstat =lrAssess.FitStat
lr_rocinfo =lrAssess.ROCInfo
lr_liftinfo=lrAssess.LIFTInfo

rfAssess=assess_model(prefix="rf")    
rf_fitstat =rfAssess.FitStat
rf_rocinfo =rfAssess.ROCInfo
rf_liftinfo=rfAssess.LIFTInfo

gbmAssess=assess_model(prefix="gbm")    
gbm_fitstat =gbmAssess.FitStat
gbm_rocinfo =gbmAssess.ROCInfo
gbm_liftinfo=gbmAssess.LIFTInfo

svmAssess=assess_model(prefix="svm")    
svm_fitstat =svmAssess.FitStat
svm_rocinfo =svmAssess.ROCInfo
svm_liftinfo=svmAssess.LIFTInfo

In [9]:
# Add new variable to indicate type of model
lr_liftinfo["model"]="Logistic (Python API)"
lr_rocinfo["model"]='Logistic (Python API)'
rf_liftinfo["model"]="Autotuned Forest (SAS Studio)"
rf_rocinfo["model"]="Autotuned Forest (SAS Studio)"
gbm_liftinfo["model"]="Gradient Boosting (SAS VA)"
gbm_rocinfo["model"]="Gradient Boosting (SAS VA)"
svm_liftinfo["model"]="SVM (SAS Studio)"
svm_rocinfo["model"]="SVM (SAS Studio)"

# Append data
all_liftinfo=lr_liftinfo.append(rf_liftinfo, ignore_index=True) \
    .append(gbm_liftinfo, ignore_index=True)  \
    .append(svm_liftinfo, ignore_index=True)  
all_rocinfo=lr_rocinfo.append(rf_rocinfo, ignore_index=True) \
    .append(gbm_rocinfo, ignore_index=True) \
    .append(svm_rocinfo, ignore_index=True) 
    
print("AUC (using validation data)".center(80, '-'))
all_rocinfo[["model", "C"]].drop_duplicates(keep="first").sort_values(by="C", ascending=False)


--------------------------AUC (using validation data)---------------------------
Out[9]:
model C
100 Autotuned Forest (SAS Studio) 0.947950
300 SVM (SAS Studio) 0.871695
0 Logistic (Python API) 0.860070
200 Gradient Boosting (SAS VA) 0.855693

Draw Assessment Plots


In [10]:
# Draw ROC charts  
plt.figure()
for key, grp in all_rocinfo.groupby(["model"]):
    plt.plot(grp["FPR"], grp["Sensitivity"], label=key)
plt.plot([0,1], [0,1], "k--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.grid(True)
plt.legend(loc="best")
plt.title("ROC Curve (using validation data)")
plt.show()

# Draw lift charts 
plt.figure()
for key, grp in all_liftinfo.groupby(["model"]):
    plt.plot(grp["Depth"], grp["Lift"], label=key)
plt.xlabel("Depth")
plt.ylabel("Lift")
plt.grid(True)
plt.legend(loc="best")
plt.title("Lift Chart (using validation data)")
plt.show()



In [ ]:
# Close the CAS session
# sess.close()