SAS and R Integration for Machine Learning

This notebook represents an example of how you can use SAS Viya with R for analysis.In this example, we will import R packages, start a CAS Session, load data from the local file system into CAS, explore the data, impute missing values, create several models in R, create several models in CAS, score a test set using our models, and assess model performance.
Further documentation on using SWAT with R can be found here:

Scripting Wrapper for Analytics Transfer
Set Up the R Notebook for Analysis
Explore the Data
Clean the Data
Build and Score R Models
Build and Score CAS Models
Assess Models

Scripting Wrapper for Analytics Transfer

The SAS Scripting Wrapper for Analytics Transfer (SWAT) package for R is a R interface to SAS Cloud Analytic Services (CAS) which is the centerpiece of the SAS Viya framework. With this package, you can load data into memory and apply CAS actions to transform, summarize, model and score the data. Result tables from the actions are a superclass of data frame, enabling you to apply your existing R programming skills to further post-process CAS result tables.

Set Up the R Notebook for Analysis

First, we will load the packages we want to use.



In [1]:

    
# Load necessary packages
library('swat')
library('ggplot2')
library('reshape2')
library('rpart')
library('randomForest')
library('xgboost')
options(cas.print.messages = FALSE)
options(warn=-1)
sink()









    



SWAT 1.4.0
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:ggplot2’:

    margin

Now we can create our connection to CAS. Please see this documentation on connecting and starting CAS Sessions for more information.



In [2]:

    
conn <- CAS('localhost', port=5570, caslib = 'casuser')









    



NOTE: Connecting to CAS and generating CAS action functions for loaded
      action sets...
NOTE: To generate the functions with signatures (for tab completion), set 
      options(cas.gen.function.sig=TRUE).

Next, we need to load our CAS Action Sets. Please see this documentation on running CAS Actions for more information.



In [3]:

    
actionsets <- c('sampling', 'fedsql', 'decisionTree', 'percentile', 'autotune', 'regression')
for(i in actionsets){
    loadActionSet(conn, i)
}

Finally, we can load our data from a CSV file.



In [4]:

    
castbl <- cas.read.csv(conn, './data/hmeq.csv')

Explore the Data

Let us begin exploring our data. Just like in R, we can view the first few rows of our CAS data table using the head function.



In [5]:

    
head(castbl)









    





A casDataFrame: 6 × 13

	BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
	<dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>


	1 1100 25860  39025 HomeImp Other 10.5   0   0  94.36667   1   9      NaN
	1 1300 70053  68400 HomeImp Other  7.0   0   2 121.83333   0  14      NaN
	1 1500 13500  16700 HomeImp Other  4.0   0   0 149.46667   1  10      NaN
	1 1500   NaN    NaN               NaN NaN NaN       NaN NaN NaN      NaN
	0 1700 97800 112000 HomeImp Office  3.0   0   0  93.33333   0  14      NaN
	1 1700 30548  40320 HomeImp Other  9.0   0   0 101.46600   1   8 37.11361

Visual exploration helps us better understand patterns and distributions within our data. We can use ggplot to look at our data distributions.



In [6]:

    
# Bring data locally
df <- to.casDataFrame(castbl, obs = nrow(castbl))
# Use reshape2's melt to help with data formatting
d <- melt(df[sapply(df, is.numeric)], id.vars=NULL)
ggplot(d, aes(x = value)) +
    facet_wrap(~variable,scales = 'free_x') +
    geom_histogram(fill = 'blue', bins = 25)

From our plots above, we can see that most of our home equity loans are not bad, meaning that they must be good. We can also see that most of our variables have a slight right skew. Let’s keep exploring our data by checking the missing values.



In [7]:

    
# Get the number of missing values for all variables
tbl <- cas.simple.distinct(castbl)$Distinct[,c('Column', 'NMiss')]
tbl









    





A data.frame: 13 × 2

	Column NMiss
	<chr> <dbl>


	BAD       0
	LOAN      0
	MORTDUE  518
	VALUE   112
	REASON  252
	JOB     279
	YOJ     515
	DEROG   708
	DELINQ  580
	CLAGE   308
	NINQ    510
	CLNO    222
	DEBTINC 1267



In [8]:

    
# Easy way to get missing values for numeric variables
cas.nmiss(castbl)









    





	BAD
		0
	LOAN
		0
	MORTDUE
		518
	VALUE
		112
	YOJ
		515
	DEROG
		708
	DELINQ
		580
	CLAGE
		308
	NINQ
		510
	CLNO
		222
	DEBTINC
		1267



In [9]:

    
# Visualize the missing data
tbl$PctMiss <- tbl$NMiss/nrow(castbl)
ggplot(tbl, aes(Column, PctMiss)) +
    geom_col(fill = 'blue') +
    ggtitle('Pct Missing Values') +
    theme(plot.title = element_text(hjust = 0.5))

Using both R and SWAT, we have explored our data, and now we have an idea of what to look for when we clean our data.

Clean the Data

First, we should impute those missing values.



In [10]:

    
# Impute missing values
cas.dataPreprocess.impute(castbl,
    methodContinuous = 'MEDIAN',
    methodNominal = 'MODE',
    inputs = colnames(castbl)[-1],
    copyAllVars = TRUE,
    casOut = list(name = 'hmeq', 
                replace = TRUE)
)









    





	$ImputeInfo
		
A casDataFrame: 12 × 7

	Variable ImputeTech ResultVar N NMiss ImputedValueContinuous ImputedValueNominal
	<chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>


	LOAN   Median IMP_LOAN   5960    0 16300.00000        
	MORTDUE Median IMP_MORTDUE 5442  518 65019.00000        
	VALUE  Median IMP_VALUE  5848  112 89235.50000        
	REASON Mode  IMP_REASON 5708  252         NaN DebtCon
	JOB    Mode  IMP_JOB    5681  279         NaN Other  
	YOJ    Median IMP_YOJ    5445  515     7.00000        
	DEROG  Median IMP_DEROG  5252  708     0.00000        
	DELINQ Median IMP_DELINQ 5380  580     0.00000        
	CLAGE  Median IMP_CLAGE  5652  308   173.46667        
	NINQ   Median IMP_NINQ   5450  510     1.00000        
	CLNO   Median IMP_CLNO   5738  222    20.00000        
	DEBTINC Median IMP_DEBTINC 4693 1267    34.81826        



	$OutputCasTables
		
A casDataFrame: 1 × 4

	casLib Name Rows Columns
	<chr> <chr> <dbl> <dbl>


	CASUSER(sasdemo05) hmeq 5960 25

Now, we should partition our data into training and testing sets.



In [11]:

    
# Partition the data
cas.sampling.srs(conn,
    table = 'hmeq',
    samppct = 30,
    partind = TRUE,
    output = list(casOut = list(name = 'hmeq', replace = T), copyVars = 'ALL')
)









    





	$outputSize
		
	$outputNObs
		5960
	$outputNVars
		26


	$SRSFreq
		
A casDataFrame: 1 × 2

	NObs NSamp
	<dbl> <dbl>


	5960 1788



	$OutputCasTables
		
A casDataFrame: 1 × 5

	casLib Name Label Rows Columns
	<chr> <chr> <chr> <dbl> <dbl>


	CASUSER(sasdemo05) hmeq 5960 26

Finally, let’s map our variables to build reusable labels for our function calls.



In [12]:

    
#Note: I do not want to hard code any of my variable names.
indata <- 'hmeq'

# Get variable info and types
colinfo <- head(cas.table.columnInfo(conn, table = indata)$ColumnInfo, -1)

# My target variable is the first column
target <- colinfo$Column[1]

# For models that can inherently handle missing values (ex: Decision Tree)
inputs <- colinfo$Column[-1]
nominals <- c(target, subset(colinfo, Type == 'varchar')$Column)

# For models that can't handle missing values
imp_inputs =  c("IMP_CLAGE", "IMP_CLNO", "IMP_DEBTINC", "IMP_DELINQ",  
                   "IMP_DEROG", "IMP_LOAN", "IMP_MORTDUE", "IMP_NINQ", "IMP_VALUE", 
                   "IMP_YOJ", "IMP_JOB", "IMP_REASON")
imp_nominals = c("IMP_JOB", "IMP_REASON")

Build and Score R Models

To run R models, the data must be taken from the in-memory CAS table and placed into a R data frame.



In [13]:

    
# Connect to in-memory CAS table
hmeq1 <- defCasTable(conn, tablename="HMEQ")



In [14]:

    
# Create CAS DataFrame from CAS table
df1 = to.casDataFrame(hmeq1)



In [15]:

    
# Create R DataFrame from CAS DataFrame
df1 = to.data.frame(df1)



In [16]:

    
# Rename partition indicator variable 
names(df1)[length(names(df1))]<-"part" 
# Make dummy variables 
df1$reason_debtcon <- ifelse(df1$REASON == 'DebtCon', 1, 0)
df1$job_Office <- ifelse(df1$JOB == "Office", 1, 0)
df1$job_Mgr <- ifelse(df1$JOB == 'Mgr', 1,  0)
df1$job_ProfExe <- ifelse(df1$JOB == 'ProfExe', 1, 0)



In [17]:

    
# Keep only imputed numeric data and target
df2 = subset(df1, select=c(BAD, IMP_CLAGE, IMP_CLNO, IMP_DEBTINC, IMP_DELINQ, 
                           IMP_DEROG, IMP_LOAN, IMP_MORTDUE, IMP_NINQ, IMP_VALUE, 
                           IMP_YOJ, part, reason_debtcon, job_Office, job_Mgr, job_ProfExe))
# Split into train and test data frames 
train = subset(df2, part==0)
train = subset(train, select = -c(part))
test = subset(df2, part==1)
test = subset(test, select = -c(part))
# Save actual values for test to use in assessment
actual = test$BAD

Great, now let’s begin modeling. I am looking at three models: a logistic regression, a decision tree, and a gradient boosting model. For each model, I will use a R function to build the model and score the test data set, but I will use SAS’s assessment function to make it easy to assess all models side-by-side.

R Logisic Regression



In [18]:

    
# Build logistic regression
rlog <- glm(BAD ~ ., family="binomial", data=train)



In [19]:

    
# Score test data 
rlog_scored <- predict(rlog, test, type="response")
# Create dataframe holding predicted values and results
rlog_scored  <- cbind(rlog_scored , actual)
rlog_scored  <- as.data.frame(rlog_scored)
# Save R dataframe to CAS table
rlog_scored <- as.casTable(conn, rlog_scored, casOut='rlog_scored')



In [20]:

    
# Assess performance
rlog_assessed <- cas.percentile.assess(conn,
        table    = list(name = 'rlog_scored'),
        inputs = 'rlog_scored',
        response = 'actual',
        event    = '1')
rlog_roc <- rlog_assessed$ROCInfo
rlog_roc$Model <- "R Logistic Regression"

R Decision Tree



In [21]:

    
# Build decision tree
rtree <- rpart(BAD ~ ., method="class", data=train)



In [22]:

    
# Score test data 
rtree_scored <- predict(rtree, test, type="prob")
# Create dataframe holding predicted values and results
rtree_scored  <- cbind(rtree_scored , actual)
rtree_scored  <- as.data.frame(rtree_scored)
names(rtree_scored) <- c("p_0", "p_1", "actual")
rtree_scored <- subset(rtree_scored, select = -c(p_0))
# Save R dataframe to CAS table
rtree_scored <- as.casTable(conn, rtree_scored, casOut='rtree_scored')



In [23]:

    
# Assess performance
rtree_assessed <- cas.percentile.assess(conn,
        table    = list(name = 'rtree_scored'),
        inputs = 'p_1',
        response = 'actual',
        event    = '1')
rtree_roc <- rtree_assessed$ROCInfo
rtree_roc$Model <- "R Decision Tree"

R Gradient Boosting



In [24]:

    
# Prepare data for xgboost package
# Saving train target
# Test target aleady stored as actual 
labels <- train$BAD
mtrain <- subset(train, select = -c(BAD))
mtest <- subset(test, select = -c(BAD))
# Convernting train and test dataframes to matrix
mtrain <- as.matrix(mtrain)
mtest <- as.matrix(mtest)
# Converting train and test matrices to DMtraix 
xgtrain <- xgb.DMatrix(data=mtrain, label=labels)
xgtest <- xgb.DMatrix(data=mtest, label=actual)



In [27]:

    
# Build gradient boosting
rboost <- xgboost(data = xgtrain, label = labels, objective = "binary:logistic", nrounds = 100)









    



[1]	train-error:0.104746 
[2]	train-error:0.103068 
[3]	train-error:0.096596 
[4]	train-error:0.093241 
[5]	train-error:0.086769 
[6]	train-error:0.081975 
[7]	train-error:0.079338 
[8]	train-error:0.078140 
[9]	train-error:0.073346 
[10]	train-error:0.072867 
[11]	train-error:0.069271 
[12]	train-error:0.066155 
[13]	train-error:0.061601 
[14]	train-error:0.061361 
[15]	train-error:0.059204 
[16]	train-error:0.055369 
[17]	train-error:0.052493 
[18]	train-error:0.050815 
[19]	train-error:0.048418 
[20]	train-error:0.047220 
[21]	train-error:0.043384 
[22]	train-error:0.042665 
[23]	train-error:0.041227 
[24]	train-error:0.039070 
[25]	train-error:0.037872 
[26]	train-error:0.037152 
[27]	train-error:0.034516 
[28]	train-error:0.031640 
[29]	train-error:0.029003 
[30]	train-error:0.028044 
[31]	train-error:0.025647 
[32]	train-error:0.024928 
[33]	train-error:0.025407 
[34]	train-error:0.023250 
[35]	train-error:0.023250 
[36]	train-error:0.021812 
[37]	train-error:0.020853 
[38]	train-error:0.018696 
[39]	train-error:0.018217 
[40]	train-error:0.017977 
[41]	train-error:0.017498 
[42]	train-error:0.016059 
[43]	train-error:0.015101 
[44]	train-error:0.015580 
[45]	train-error:0.015340 
[46]	train-error:0.013663 
[47]	train-error:0.013183 
[48]	train-error:0.012943 
[49]	train-error:0.011985 
[50]	train-error:0.011266 
[51]	train-error:0.011505 
[52]	train-error:0.011266 
[53]	train-error:0.011026 
[54]	train-error:0.010786 
[55]	train-error:0.010067 
[56]	train-error:0.010067 
[57]	train-error:0.009827 
[58]	train-error:0.008389 
[59]	train-error:0.008869 
[60]	train-error:0.008869 
[61]	train-error:0.008629 
[62]	train-error:0.006232 
[63]	train-error:0.005992 
[64]	train-error:0.006232 
[65]	train-error:0.006232 
[66]	train-error:0.006232 
[67]	train-error:0.005992 
[68]	train-error:0.005273 
[69]	train-error:0.004794 
[70]	train-error:0.004314 
[71]	train-error:0.003595 
[72]	train-error:0.003835 
[73]	train-error:0.003356 
[74]	train-error:0.002876 
[75]	train-error:0.001918 
[76]	train-error:0.001918 
[77]	train-error:0.001918 
[78]	train-error:0.001918 
[79]	train-error:0.001918 
[80]	train-error:0.001918 
[81]	train-error:0.001918 
[82]	train-error:0.001918 
[83]	train-error:0.001678 
[84]	train-error:0.001438 
[85]	train-error:0.001438 
[86]	train-error:0.001198 
[87]	train-error:0.000959 
[88]	train-error:0.000959 
[89]	train-error:0.000959 
[90]	train-error:0.000959 
[91]	train-error:0.000959 
[92]	train-error:0.000719 
[93]	train-error:0.000719 
[94]	train-error:0.000959 
[95]	train-error:0.000959 
[96]	train-error:0.000719 
[97]	train-error:0.000719 
[98]	train-error:0.000479 
[99]	train-error:0.000479 
[100]	train-error:0.000479



In [28]:

    
# Score test data
rboost_scored <- predict(rboost, xgtest)
# Create dataframe holding predicted values and results
rboost_scored <- data.frame(rboost_scored, actual)
rboost_scored <- as.data.frame(rboost_scored)
# Save R dataframe to CAS table
rboost_scored <- as.casTable(conn, rboost_scored, casOut='rboost_scored')



In [29]:

    
# Assess performance
rboost_assessed <- cas.percentile.assess(conn,
        table    = list(name = 'rboost_scored'),
        inputs = 'rboost_scored',
        response = 'actual',
        event    = '1')
rboost_roc <- rboost_assessed$ROCInfo
rboost_roc$Model <- "R Gradient Boosting"

Build and Score CAS Models

Now we can do the same thing in CAS: build, score, and assess a logistic regression, a decision tree, and a gradient boosting model. In addition, CAS has the option to autotune the tree-based models, so I would like to include an example of that as well! Autotuning finds the best combination of hyperparameter to increase model accuracy.

CAS Logisic Regression



In [30]:

    
# Build CAS Logistic Regression
cas.regression.logistic(conn,
    table = list(name = indata),
    class = imp_nominals,
    model = list(
        depVar = target,
        effects = imp_inputs),
    selection=list(method="BACKWARD"),
    store=list(name='log_model', replace=TRUE), 
    output=list(casOut=list(name='log_score',replace=TRUE),  
                pred='pred', resChi='reschi', into='into',
                copyVars=list(target, '_PartInd_')), 
    partByVar= list(name = "_partind_", train = "0", validate = "1")
)









    





	$ModelInfo
		
A casDataFrame: 5 × 3

	RowId Description Value
	<chr> <chr> <chr>


	DATA       Data Source           HMEQ                       
	RESPONSEVAR Response Variable     BAD                        
	DIST       Distribution          Binary                     
	LINK       Link Function         Logit                      
	TECH       Optimization Technique Newton-Raphson with Ridging



	$NObs
		
A casDataFrame: 2 × 5

	RowId Description Value Training Validation
	<chr> <chr> <dbl> <dbl> <dbl>


	NREAD Number of Observations Read 5960 4172 1788
	NUSED Number of Observations Used 5960 4172 1788



	$ResponseProfile
		
A casDataFrame: 2 × 7

	OrderedValue Outcome BAD Freq Training Validation Modeled
	<int> <chr> <dbl> <dbl> <dbl> <dbl> <chr>


	1 0 0 4771 3331 1440 *
	2 1 1 1189  841  348  



	$ClassInfo
		
A casDataFrame: 2 × 3

	Class Levels Values
	<chr> <dbl> <chr>


	IMP_JOB   6 Mgr Office Other ProfExe Sales Self
	IMP_REASON 2 DebtCon HomeImp                    



	$SelectionInfo
		
A casDataFrame: 5 × 4

	RowId Description Value NValue
	<chr> <chr> <chr> <dbl>


	METHOD       Selection Method         Backward NaN
	SELCRITERION Select Criterion         SBC     NaN
	STOPCRITERION Stop Criterion           SBC     NaN
	HIERARCHY    Effect Hierarchy Enforced Single  NaN
	STOPHORIZON  Stop Horizon             3         3



	$Summary.ConvergenceStatus
		
A casDataFrame: 1 × 3

	Reason Status MaxGradient
	<chr> <int> <dbl>


	Convergence criterion (FCONV=1E-7) satisfied. 0 0.0009467803



	$Summary.SelectionSummary
		
A casDataFrame: 7 × 6

	Control Step EffectRemoved nEffectsIn SBC OptSBC
	<chr> <int> <chr> <int> <dbl> <int>


	 0            13 3363.055 0
	- 1 IMP_YOJ    12 3357.863 0
	 2 IMP_JOB    11 3352.974 0
	 3 IMP_MORTDUE 10 3349.044 1
	 4 IMP_REASON  9 3349.604 0
	 5 IMP_CLNO    8 3354.464 0
	 6 IMP_LOAN    7 3372.209 0



	$Summary.StopReason
		
A casDataFrame: 1 × 2

	Reason Code
	<chr> <int>


	Selection stopped at a local minimum of the STOP criterion. 6



	$Summary.SelectionReason
		
A casDataFrame: 1 × 1

	Reason
	<chr>


	The model at step 3 is selected.



	$Summary.SelectedEffects
		
A casDataFrame: 1 × 2

	Label Effects
	<chr> <chr>


	Selected Effects: Intercept IMP_CLAGE IMP_CLNO IMP_DEBTINC IMP_DELINQ IMP_DEROG IMP_LOAN IMP_NINQ IMP_VALUE IMP_REASON



	$SelectedModel.Dimensions
		
A casDataFrame: 5 × 3

	RowId Description Value
	<chr> <chr> <int>


	NDESIGNCOLS Columns in Design         11
	NEFFECTS   Number of Effects         10
	MAXEFCOLS  Max Effect Columns         2
	DESIGNRANK Rank of Design            10
	OPTPARM    Parameters in Optimization 10



	$SelectedModel.GlobalTest
		
A casDataFrame: 1 × 4

	Test DF ChiSq ProbChiSq
	<chr> <int> <dbl> <dbl>


	Likelihood Ratio 9 927.8438 6.187011e-194



	$SelectedModel.FitStatistics
		
A casDataFrame: 11 × 4

	RowId Description Training Validation
	<chr> <chr> <dbl> <dbl>


	M2LL    -2 Log Likelihood        3265.7480145 1404.7441937
	AIC     AIC (smaller is better)  3285.7480145 1424.7441937
	AICC    AICC (smaller is better) 3285.8008864 1424.8679978
	SBC     SBC (smaller is better)  3349.1095227 1479.6327232
	ASE     Average Square Error        0.1208238    0.1208004
	M2LLNULL -2 Log L (Intercept-only) 4193.5917796 1762.4978732
	RSQUARE R-Square                    0.1994032    0.1813396
	ADJRSQ  Max-rescaled R-Square       0.3145064    0.2892936
	MCFADDEN McFadden's R-Square         0.2212528    0.2029811
	MISCLASS Misclassification Rate      0.1598754    0.1610738
	DIFFMEAN Difference of Means         0.2505009    0.2394360



	$SelectedModel.ParameterEstimates
		
A casDataFrame: 11 × 9

	Effect IMP_REASON Parameter ParmName DF Estimate StdErr ChiSq ProbChiSq
	<chr> <chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>


	Intercept         Intercept         Intercept         1  2.621205e+00 2.886291e-01  82.474890 1.070211e-19
	IMP_CLAGE         IMP_CLAGE         IMP_CLAGE         1  5.607892e-03 6.503098e-04  74.363313 6.498554e-18
	IMP_CLNO          IMP_CLNO          IMP_CLNO          1  1.606460e-02 4.998012e-03  10.331073 1.308091e-03
	IMP_DEBTINC        IMP_DEBTINC       IMP_DEBTINC       1 -6.249575e-02 7.415592e-03  71.024674 3.527853e-17
	IMP_DELINQ        IMP_DELINQ        IMP_DELINQ        1 -7.563151e-01 4.540623e-02 277.443533 2.708342e-62
	IMP_DEROG         IMP_DEROG         IMP_DEROG         1 -6.573445e-01 5.903230e-02 123.995732 8.442498e-29
	IMP_LOAN          IMP_LOAN          IMP_LOAN          1  2.135606e-05 4.873157e-06  19.205309 1.173865e-05
	IMP_NINQ          IMP_NINQ          IMP_NINQ          1 -1.649363e-01 2.431238e-02  46.023296 1.168551e-11
	IMP_VALUE         IMP_VALUE         IMP_VALUE         1 -1.258613e-06 9.087475e-07   1.918219 1.660531e-01
	IMP_REASON DebtCon IMP_REASON DebtCon IMP_REASON_DebtCon 1  2.940699e-01 9.895972e-02   8.830477 2.962409e-03
	IMP_REASON HomeImp IMP_REASON HomeImp IMP_REASON_HomeImp 0  0.000000e+00          NaN        NaN          NaN



	$Timing
		
A casDataFrame: 9 × 4

	RowId Task Time RelTime
	<chr> <chr> <dbl> <dbl>


	SETUP         Setup and Parsing   0.0409197807 0.068626965
	LEVELIZATION  Levelization        0.0499210358 0.083723058
	INITIALIZATION Model Initialization 0.0207328796 0.034771315
	SSCP          SSCP Computation    0.0463039875 0.077656871
	FITTING       Model Selection     0.3005890846 0.504120897
	OUTPUT        Creating Output Data 0.0226120949 0.037922966
	STORE         Model Storing       0.1156239510 0.193914060
	CLEANUP       Cleanup             0.0008149147 0.001366701
	TOTAL         Total               0.5962638855 1.000000000



	$OutputCasTables
		
A casDataFrame: 2 × 5

	casLib Name Label Rows Columns
	<chr> <chr> <chr> <dbl> <dbl>


	CASUSER(sasdemo05) log_model    1 2
	CASUSER(sasdemo05) log_score 5960 5



In [31]:

    
# Score Test Data
log_score1 <- defCasTable(conn, tablename="log_score")
log_score1 = to.casDataFrame(log_score1)
log_score1 = to.data.frame(log_score1)
log_score1$pred1 = 1-log_score1$pred
log_sc1 <- as.casTable(conn, log_score1, casOut='log_sc1')



In [32]:

    
# Assess Performance
log_assessed <- cas.percentile.assess(conn,
                table = list(name = 'log_sc1', where = '_PartInd_ = 1'),
                inputs = 'pred1',
                response = target,
                event = '1')
log_roc <- log_assessed$ROCInfo
log_roc$Model <- "CAS Logistic Regression"

CAS Decision Tree



In [33]:

    
# Build CAS Decision Tree
cas.decisionTree.dtreeTrain(conn,
    table = list(name = indata, where = '_PartInd_ = 0'),
    target = target,
    inputs = inputs,
    nominals = nominals,
    varImp = TRUE,
    casOut = list(name = 'dt_model', replace = TRUE)
)









    





	$ModelInfo
		
A casDataFrame: 11 × 2

	Descr Value
	<chr> <dbl>


	Number of Tree Nodes          17.00000
	Max Number of Branches         2.00000
	Number of Levels               6.00000
	Number of Leaves               9.00000
	Number of Bins                20.00000
	Minimum Size of Leaves         5.00000
	Maximum Size of Leaves      3139.00000
	Number of Variables           24.00000
	Confidence Level for Pruning    0.25000
	Number of Observations Used 4172.00000
	Misclassification Error (%)   13.85427



	$DTreeVarImpInfo
		
A casDataFrame: 5 × 4

	Variable Importance Std Count
	<chr> <dbl> <dbl> <dbl>


	DEBTINC  462.3418025 169.414308 3
	DELINQ    47.5949825   0.000000 1
	CLNO      10.6693789   3.277202 2
	DEROG      9.2955776   0.000000 1
	IMP_VALUE   0.3393939   0.000000 1



	$OutputCasTables
		
A casDataFrame: 1 × 4

	casLib Name Rows Columns
	<chr> <chr> <dbl> <dbl>


	CASUSER(sasdemo05) dt_model 17 27



In [34]:

    
# Score Test Data 
cas.decisionTree.dtreeScore(conn,
    modelTable=list(name='dt_model'),
    table=list(name=indata),
    copyVars= list(target, '_PartInd_'),  
    assessOneRow=TRUE, 
    casOut = list(name = 'dt_scored', replace = T)
)









    





	$OutputCasTables
		
A casDataFrame: 1 × 4

	casLib Name Rows Columns
	<chr> <chr> <dbl> <dbl>


	CASUSER(sasdemo05) dt_scored 5960 16



	$ScoreInfo
		
A casDataFrame: 3 × 2

	Descr Value
	<chr> <chr>


	Number of Observations Read                             5960
	Number of Observations Used                             5960
	Misclassification Error (%)                     13.674496644



In [35]:

    
# Assess Performance 
dt_assessed <- cas.percentile.assess(conn,
                table = list(name = 'dt_scored', where = '_PartInd_ = 1'),
                inputs = '_DT_P_           1',
                response = target,
                event = '1')
dt_roc <- dt_assessed$ROCInfo
dt_roc$Model <- "CAS Decision Tree"

CAS Autotuned Decision Tree



In [36]:

    
# Find Best Decision Tree Configuration
tune_dt <- cas.autotune.tuneDecisionTree(conn, 
            trainOptions=list(table = list(name = indata, where = '_PartInd_ = 0'),
            target = target,
            inputs = inputs,
            nominals = nominals,
            varImp = TRUE,
            casOut = list(name = 'tune_dt_model', replace = TRUE)))



In [37]:

    
# Score Test Data 
cas.decisionTree.dtreeScore(conn,
    table=list(name = indata),
    modelTable='tune_dt_model',
    copyVars     = list(target, '_PartInd_'),      
    assessonerow = TRUE,
    casOut       = list(name = 'tune_dt_scored', replace = T)
)









    





	$OutputCasTables
		
A casDataFrame: 1 × 4

	casLib Name Rows Columns
	<chr> <chr> <dbl> <dbl>


	CASUSER(sasdemo05) tune_dt_scored 5960 16



	$ScoreInfo
		
A casDataFrame: 3 × 2

	Descr Value
	<chr> <chr>


	Number of Observations Read                             5960
	Number of Observations Used                             5960
	Misclassification Error (%)                     10.822147651



In [38]:

    
# Assess Performance 
tune_dt_assessed <- cas.percentile.assess(conn,
                table = list(name = 'tune_dt_scored', where = '_PartInd_ = 1'),
                inputs = '_DT_P_           1',
                response = target,
                event = '1')
tune_dt_roc <- tune_dt_assessed$ROCInfo
tune_dt_roc$Model <- "CAS Autotuned Decision Tree"

CAS Gradient Boosting



In [39]:

    
# Gradient Boosting
cas.decisionTree.gbtreeTrain(conn,
    table = list(name = indata, where = '_PartInd_ = 0'),
    target = target,
    inputs = inputs,
    nominals = nominals,
    casOut = list(name = 'gbt_model', replace = TRUE)
)









    





	$ModelInfo
		
A casDataFrame: 18 × 2

	Descr Value
	<chr> <dbl>


	Number of Trees                   50.0
	Distribution                       2.0
	Learning Rate                      0.1
	Subsampling Rate                   0.5
	Number of Selected Variables (M)   24.0
	Number of Bins                    50.0
	Number of Variables               24.0
	Max Number of Tree Nodes          31.0
	Min Number of Tree Nodes          21.0
	Max Number of Branches             2.0
	Min Number of Branches             2.0
	Max Number of Levels               5.0
	Min Number of Levels               5.0
	Max Number of Leaves              16.0
	Min Number of Leaves              11.0
	Maximum Size of Leaves          1895.0
	Minimum Size of Leaves             5.0
	Random Number Seed                 0.0



	$OutputCasTables
		
A casDataFrame: 1 × 4

	casLib Name Rows Columns
	<chr> <chr> <dbl> <dbl>


	CASUSER(sasdemo05) gbt_model 1392 32



In [40]:

    
# Score Test Data
cas.decisionTree.gbtreeScore(conn,
    table=list(name = indata),
    modelTable='gbt_model',
    copyVars     = list(target, '_PartInd_'),      
    assessonerow = TRUE,
    casOut       = list(name = 'gbt_scored', replace = T)
)









    





	$OutputCasTables
		
A casDataFrame: 1 × 4

	casLib Name Rows Columns
	<chr> <chr> <dbl> <dbl>


	CASUSER(sasdemo05) gbt_scored 5960 8



	$ScoreInfo
		
A casDataFrame: 3 × 2

	Descr Value
	<chr> <chr>


	Number of Observations Read                             5960
	Number of Observations Used                             5960
	Misclassification Error (%)                     8.4731543624



	$ErrorMetricInfo
		
A casDataFrame: 50 × 8

	TreeID Trees NLeaves MCR LogLoss ASE RASE MAXAE
	<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>


	 0  1  13 0.19949664 0.4583175 0.14542261 0.3813432 0.8162591
	 1  2  28 0.19949664 0.4289321 0.13438190 0.3665814 0.8296554
	 2  3  44 0.19949664 0.4050840 0.12522465 0.3538709 0.8428032
	 3  4  60 0.19949664 0.3859273 0.11772208 0.3431065 0.8550105
	 4  5  74 0.19949664 0.3704777 0.11181257 0.3343839 0.8662366
	 5  6  90 0.17466443 0.3562004 0.10643394 0.3262421 0.8759018
	 6  7 106 0.14530201 0.3442756 0.10189294 0.3192067 0.8837723
	 7  8 120 0.13322148 0.3340708 0.09815133 0.3132911 0.8926982
	 8  9 135 0.11879195 0.3250084 0.09485241 0.3079812 0.9000767
	 9 10 150 0.11325503 0.3166343 0.09196214 0.3032526 0.9064096
	10 11 165 0.10989933 0.3102472 0.08982883 0.2997146 0.9124400
	11 12 181 0.10704698 0.3033679 0.08753238 0.2958587 0.9194278
	12 13 195 0.10604027 0.2976786 0.08576177 0.2928511 0.9261137
	13 14 211 0.10520134 0.2931048 0.08436454 0.2904557 0.9314005
	14 15 225 0.10436242 0.2883862 0.08292363 0.2879646 0.9353805
	15 16 241 0.10218121 0.2844676 0.08184719 0.2860895 0.9406621
	16 17 255 0.10285235 0.2802390 0.08068540 0.2840518 0.9449353
	17 18 271 0.10251678 0.2768595 0.07969209 0.2822979 0.9489485
	18 19 287 0.10134228 0.2736584 0.07881852 0.2807464 0.9531432
	19 20 301 0.10100671 0.2708183 0.07803602 0.2793493 0.9569792
	20 21 313 0.10134228 0.2678495 0.07706816 0.2776115 0.9583204
	21 22 327 0.10016779 0.2650589 0.07624769 0.2761298 0.9614470
	22 23 342 0.10000000 0.2622113 0.07547845 0.2747334 0.9632765
	23 24 358 0.09865772 0.2589126 0.07445794 0.2728698 0.9643662
	24 25 372 0.09899329 0.2565501 0.07374377 0.2715580 0.9657733
	25 26 385 0.09781879 0.2544412 0.07315839 0.2704781 0.9670358
	26 27 398 0.09748322 0.2525427 0.07267208 0.2695776 0.9686246
	27 28 412 0.09714765 0.2503599 0.07200030 0.2683287 0.9691852
	28 29 426 0.09714765 0.2481523 0.07135334 0.2671205 0.9699781
	29 30 440 0.09580537 0.2458725 0.07065265 0.2658057 0.9714686
	30 31 456 0.09446309 0.2440846 0.07018587 0.2649262 0.9727930
	31 32 470 0.09412752 0.2422631 0.06960415 0.2638260 0.9735549
	32 33 483 0.09345638 0.2403532 0.06903100 0.2627375 0.9742737
	33 34 494 0.09328859 0.2387300 0.06855810 0.2618360 0.9745890
	34 35 508 0.09295302 0.2372782 0.06811972 0.2609976 0.9756157
	35 36 523 0.09211409 0.2352888 0.06756168 0.2599263 0.9741923
	36 37 538 0.09060403 0.2328913 0.06689158 0.2586341 0.9759766
	37 38 550 0.08959732 0.2316356 0.06652793 0.2579301 0.9780755
	38 39 563 0.08976510 0.2303145 0.06613225 0.2571619 0.9791260
	39 40 579 0.08926174 0.2278412 0.06535282 0.2556420 0.9794762
	40 41 591 0.08909396 0.2265814 0.06492543 0.2548047 0.9786891
	41 42 603 0.08842282 0.2256198 0.06464658 0.2542569 0.9789643
	42 43 617 0.08875839 0.2243108 0.06423675 0.2534497 0.9790753
	43 44 632 0.08892617 0.2230472 0.06386086 0.2527071 0.9801133
	44 45 648 0.08775168 0.2213631 0.06345153 0.2518959 0.9817142
	45 46 664 0.08708054 0.2200668 0.06308301 0.2511633 0.9822180
	46 47 680 0.08691275 0.2183415 0.06257704 0.2501540 0.9829572
	47 48 696 0.08573826 0.2169110 0.06213175 0.2492624 0.9836021
	48 49 707 0.08540268 0.2161872 0.06188125 0.2487594 0.9836874
	49 50 721 0.08473154 0.2151943 0.06158102 0.2481552 0.9847375



In [41]:

    
# Assess Performance 
gbt_assessed <- cas.percentile.assess(conn,
                table = list(name = 'gbt_scored', where = '_PartInd_ = 1'),
                inputs = '_gbt_P_           1',
                response = target,
                event = '1')
gbt_roc <- gbt_assessed$ROCInfo
gbt_roc$Model <- "CAS Gradient Boosting"

Assess Models

We have built our seven models, now let’s us see how they compare to each other.



In [42]:

    
roc.df <- data.frame()
# Add R Models 
roc.df <- rbind(roc.df, rlog_roc, rtree_roc, rboost_roc)
# Add CAS Models 
roc.df <- rbind(roc.df, log_roc, dt_roc, tune_dt_roc, gbt_roc)

Confusion Matrix



In [43]:

    
# Manipulate the dataframe
compare <- subset(roc.df, round(roc.df$CutOff, 2) == 0.5)
rownames(compare) <- NULL
cf <- compare[,c('Model','TP','FP','FN','TN')]
cf









    





A data.frame: 7 × 5

	Model TP FP FN TN
	<chr> <dbl> <dbl> <dbl> <dbl>


	R Logistic Regression      112  51 236 1389
	R Decision Tree            223  92 125 1348
	R Gradient Boosting        231  40 117 1400
	CAS Logistic Regression    109  49 239 1391
	CAS Decision Tree          264 153  84 1287
	CAS Autotuned Decision Tree 227  90 121 1350
	CAS Gradient Boosting      221  47 127 1393

Misclassification



In [44]:

    
# Build a dataframe to compare the misclassification rates
compare$Misclassification <- 1 - compare$ACC
miss <- compare[order(compare$Misclassification), c('Model','Misclassification')]
rownames(miss) <- NULL
miss









    





A data.frame: 7 × 2

	Model Misclassification
	<chr> <dbl>


	R Gradient Boosting        0.08780761
	CAS Gradient Boosting      0.09731544
	CAS Autotuned Decision Tree 0.11800895
	R Decision Tree            0.12136465
	CAS Decision Tree          0.13255034
	R Logistic Regression      0.16051454
	CAS Logistic Regression    0.16107383

Notice the improvement in our autotuned decision tree above. It beats out both our R decision tree and our decision default decision tree.

ROC



In [45]:

    
# Add a new column to be used as the ROC curve label
roc.df$Models <- paste(roc.df$Model, round(roc.df$C, 3), sep = ' - ')

# Create the ROC curve
ggplot(data = roc.df[c('FPR', 'Sensitivity', 'Models')],
    aes(x = as.numeric(FPR), y = as.numeric(Sensitivity), colour = Models)) +
    geom_line() +
    labs(x = 'False Positive Rate', y = 'True Positive Rate')

End the session



In [46]:

    
# End the session
cas.session.endSession(conn)

We have gone through an example of using SAS Viya with R for analysis. We connected to our CAS server, imported, explored, and cleaned our data, we built three models in R, three in CAS, and one autotuned model in CAS, and we ended by examining our model’s misclassification rates and ROC curves. And ultimately, we learned how easy it is to leverage SAS and R using SWAT, allowing programmers to utilize the power of SAS Analytics from the language they are comfortable in.

BAD	LOAN	MORTDUE	VALUE	REASON	JOB	YOJ	DEROG	DELINQ	CLAGE	NINQ	CLNO	DEBTINC
<dbl>	<dbl>	<dbl>	<dbl>	<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	1100	25860	39025	HomeImp	Other	10.5	0	0	94.36667	1	9	NaN
1	1300	70053	68400	HomeImp	Other	7.0	0	2	121.83333	0	14	NaN
1	1500	13500	16700	HomeImp	Other	4.0	0	0	149.46667	1	10	NaN
1	1500	NaN	NaN			NaN	NaN	NaN	NaN	NaN	NaN	NaN
0	1700	97800	112000	HomeImp	Office	3.0	0	0	93.33333	0	14	NaN
1	1700	30548	40320	HomeImp	Other	9.0	0	0	101.46600	1	8	37.11361

Column	NMiss
<chr>	<dbl>
BAD	0
LOAN	0
MORTDUE	518
VALUE	112
REASON	252
JOB	279
YOJ	515
DEROG	708
DELINQ	580
CLAGE	308
NINQ	510
CLNO	222
DEBTINC	1267

OrderedValue	Outcome	BAD	Freq	Training	Validation	Modeled
<int>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<chr>
1	0	0	4771	3331	1440	*
2	1	1	1189	841	348

Reason	Code
<chr>	<int>
Selection stopped at a local minimum of the STOP criterion.	6

Label	Effects
<chr>	<chr>
Selected Effects:	Intercept IMP_CLAGE IMP_CLNO IMP_DEBTINC IMP_DELINQ IMP_DEROG IMP_LOAN IMP_NINQ IMP_VALUE IMP_REASON

Descr	Value
<chr>	<dbl>
Number of Tree Nodes	17.00000
Max Number of Branches	2.00000
Number of Levels	6.00000
Number of Leaves	9.00000
Number of Bins	20.00000
Minimum Size of Leaves	5.00000
Maximum Size of Leaves	3139.00000
Number of Variables	24.00000
Confidence Level for Pruning	0.25000
Number of Observations Used	4172.00000
Misclassification Error (%)	13.85427

Descr	Value
<chr>	<chr>
Number of Observations Read	5960
Number of Observations Used	5960
Misclassification Error (%)	13.674496644

Descr	Value
<chr>	<dbl>
Number of Trees	50.0
Distribution	2.0
Learning Rate	0.1
Subsampling Rate	0.5
Number of Selected Variables (M)	24.0
Number of Bins	50.0
Number of Variables	24.0
Max Number of Tree Nodes	31.0
Min Number of Tree Nodes	21.0
Max Number of Branches	2.0
Min Number of Branches	2.0
Max Number of Levels	5.0
Min Number of Levels	5.0
Max Number of Leaves	16.0
Min Number of Leaves	11.0
Maximum Size of Leaves	1895.0
Minimum Size of Leaves	5.0
Random Number Seed	0.0

Model	Misclassification
<chr>	<dbl>
R Gradient Boosting	0.08780761
CAS Gradient Boosting	0.09731544
CAS Autotuned Decision Tree	0.11800895
R Decision Tree	0.12136465
CAS Decision Tree	0.13255034
R Logistic Regression	0.16051454
CAS Logistic Regression	0.16107383