This notebook represents an example of how you can use SAS Viya with R for analysis.In this example, we will import R packages, start a CAS Session, load data from the local file system into CAS, explore the data, impute missing values, create several models in R, create several models in CAS, score a test set using our models, and assess model performance.
Further documentation on using SWAT with R can be found here:
The SAS Scripting Wrapper for Analytics Transfer (SWAT) package for R is a R interface to SAS Cloud Analytic Services (CAS) which is the centerpiece of the SAS Viya framework. With this package, you can load data into memory and apply CAS actions to transform, summarize, model and score the data. Result tables from the actions are a superclass of data frame, enabling you to apply your existing R programming skills to further post-process CAS result tables.
First, we will load the packages we want to use.
In [1]:
# Load necessary packages
library('swat')
library('ggplot2')
library('reshape2')
library('rpart')
library('randomForest')
library('xgboost')
options(cas.print.messages = FALSE)
options(warn=-1)
sink()
Now we can create our connection to CAS. Please see this documentation on connecting and starting CAS Sessions for more information.
In [2]:
conn <- CAS('localhost', port=5570, caslib = 'casuser')
Next, we need to load our CAS Action Sets. Please see this documentation on running CAS Actions for more information.
In [3]:
actionsets <- c('sampling', 'fedsql', 'decisionTree', 'percentile', 'autotune', 'regression')
for(i in actionsets){
loadActionSet(conn, i)
}
Finally, we can load our data from a CSV file.
In [4]:
castbl <- cas.read.csv(conn, './data/hmeq.csv')
Let us begin exploring our data. Just like in R, we can view the first few rows of our CAS data table using the head function.
In [5]:
head(castbl)
Visual exploration helps us better understand patterns and distributions within our data. We can use ggplot to look at our data distributions.
In [6]:
# Bring data locally
df <- to.casDataFrame(castbl, obs = nrow(castbl))
# Use reshape2's melt to help with data formatting
d <- melt(df[sapply(df, is.numeric)], id.vars=NULL)
ggplot(d, aes(x = value)) +
facet_wrap(~variable,scales = 'free_x') +
geom_histogram(fill = 'blue', bins = 25)
From our plots above, we can see that most of our home equity loans are not bad, meaning that they must be good. We can also see that most of our variables have a slight right skew. Let’s keep exploring our data by checking the missing values.
In [7]:
# Get the number of missing values for all variables
tbl <- cas.simple.distinct(castbl)$Distinct[,c('Column', 'NMiss')]
tbl
In [8]:
# Easy way to get missing values for numeric variables
cas.nmiss(castbl)
In [9]:
# Visualize the missing data
tbl$PctMiss <- tbl$NMiss/nrow(castbl)
ggplot(tbl, aes(Column, PctMiss)) +
geom_col(fill = 'blue') +
ggtitle('Pct Missing Values') +
theme(plot.title = element_text(hjust = 0.5))
Using both R and SWAT, we have explored our data, and now we have an idea of what to look for when we clean our data.
First, we should impute those missing values.
In [10]:
# Impute missing values
cas.dataPreprocess.impute(castbl,
methodContinuous = 'MEDIAN',
methodNominal = 'MODE',
inputs = colnames(castbl)[-1],
copyAllVars = TRUE,
casOut = list(name = 'hmeq',
replace = TRUE)
)
Now, we should partition our data into training and testing sets.
In [11]:
# Partition the data
cas.sampling.srs(conn,
table = 'hmeq',
samppct = 30,
partind = TRUE,
output = list(casOut = list(name = 'hmeq', replace = T), copyVars = 'ALL')
)
Finally, let’s map our variables to build reusable labels for our function calls.
In [12]:
#Note: I do not want to hard code any of my variable names.
indata <- 'hmeq'
# Get variable info and types
colinfo <- head(cas.table.columnInfo(conn, table = indata)$ColumnInfo, -1)
# My target variable is the first column
target <- colinfo$Column[1]
# For models that can inherently handle missing values (ex: Decision Tree)
inputs <- colinfo$Column[-1]
nominals <- c(target, subset(colinfo, Type == 'varchar')$Column)
# For models that can't handle missing values
imp_inputs = c("IMP_CLAGE", "IMP_CLNO", "IMP_DEBTINC", "IMP_DELINQ",
"IMP_DEROG", "IMP_LOAN", "IMP_MORTDUE", "IMP_NINQ", "IMP_VALUE",
"IMP_YOJ", "IMP_JOB", "IMP_REASON")
imp_nominals = c("IMP_JOB", "IMP_REASON")
To run R models, the data must be taken from the in-memory CAS table and placed into a R data frame.
In [13]:
# Connect to in-memory CAS table
hmeq1 <- defCasTable(conn, tablename="HMEQ")
In [14]:
# Create CAS DataFrame from CAS table
df1 = to.casDataFrame(hmeq1)
In [15]:
# Create R DataFrame from CAS DataFrame
df1 = to.data.frame(df1)
In [16]:
# Rename partition indicator variable
names(df1)[length(names(df1))]<-"part"
# Make dummy variables
df1$reason_debtcon <- ifelse(df1$REASON == 'DebtCon', 1, 0)
df1$job_Office <- ifelse(df1$JOB == "Office", 1, 0)
df1$job_Mgr <- ifelse(df1$JOB == 'Mgr', 1, 0)
df1$job_ProfExe <- ifelse(df1$JOB == 'ProfExe', 1, 0)
In [17]:
# Keep only imputed numeric data and target
df2 = subset(df1, select=c(BAD, IMP_CLAGE, IMP_CLNO, IMP_DEBTINC, IMP_DELINQ,
IMP_DEROG, IMP_LOAN, IMP_MORTDUE, IMP_NINQ, IMP_VALUE,
IMP_YOJ, part, reason_debtcon, job_Office, job_Mgr, job_ProfExe))
# Split into train and test data frames
train = subset(df2, part==0)
train = subset(train, select = -c(part))
test = subset(df2, part==1)
test = subset(test, select = -c(part))
# Save actual values for test to use in assessment
actual = test$BAD
Great, now let’s begin modeling. I am looking at three models: a logistic regression, a decision tree, and a gradient boosting model. For each model, I will use a R function to build the model and score the test data set, but I will use SAS’s assessment function to make it easy to assess all models side-by-side.
R Logisic Regression
In [18]:
# Build logistic regression
rlog <- glm(BAD ~ ., family="binomial", data=train)
In [19]:
# Score test data
rlog_scored <- predict(rlog, test, type="response")
# Create dataframe holding predicted values and results
rlog_scored <- cbind(rlog_scored , actual)
rlog_scored <- as.data.frame(rlog_scored)
# Save R dataframe to CAS table
rlog_scored <- as.casTable(conn, rlog_scored, casOut='rlog_scored')
In [20]:
# Assess performance
rlog_assessed <- cas.percentile.assess(conn,
table = list(name = 'rlog_scored'),
inputs = 'rlog_scored',
response = 'actual',
event = '1')
rlog_roc <- rlog_assessed$ROCInfo
rlog_roc$Model <- "R Logistic Regression"
R Decision Tree
In [21]:
# Build decision tree
rtree <- rpart(BAD ~ ., method="class", data=train)
In [22]:
# Score test data
rtree_scored <- predict(rtree, test, type="prob")
# Create dataframe holding predicted values and results
rtree_scored <- cbind(rtree_scored , actual)
rtree_scored <- as.data.frame(rtree_scored)
names(rtree_scored) <- c("p_0", "p_1", "actual")
rtree_scored <- subset(rtree_scored, select = -c(p_0))
# Save R dataframe to CAS table
rtree_scored <- as.casTable(conn, rtree_scored, casOut='rtree_scored')
In [23]:
# Assess performance
rtree_assessed <- cas.percentile.assess(conn,
table = list(name = 'rtree_scored'),
inputs = 'p_1',
response = 'actual',
event = '1')
rtree_roc <- rtree_assessed$ROCInfo
rtree_roc$Model <- "R Decision Tree"
R Gradient Boosting
In [24]:
# Prepare data for xgboost package
# Saving train target
# Test target aleady stored as actual
labels <- train$BAD
mtrain <- subset(train, select = -c(BAD))
mtest <- subset(test, select = -c(BAD))
# Convernting train and test dataframes to matrix
mtrain <- as.matrix(mtrain)
mtest <- as.matrix(mtest)
# Converting train and test matrices to DMtraix
xgtrain <- xgb.DMatrix(data=mtrain, label=labels)
xgtest <- xgb.DMatrix(data=mtest, label=actual)
In [27]:
# Build gradient boosting
rboost <- xgboost(data = xgtrain, label = labels, objective = "binary:logistic", nrounds = 100)
In [28]:
# Score test data
rboost_scored <- predict(rboost, xgtest)
# Create dataframe holding predicted values and results
rboost_scored <- data.frame(rboost_scored, actual)
rboost_scored <- as.data.frame(rboost_scored)
# Save R dataframe to CAS table
rboost_scored <- as.casTable(conn, rboost_scored, casOut='rboost_scored')
In [29]:
# Assess performance
rboost_assessed <- cas.percentile.assess(conn,
table = list(name = 'rboost_scored'),
inputs = 'rboost_scored',
response = 'actual',
event = '1')
rboost_roc <- rboost_assessed$ROCInfo
rboost_roc$Model <- "R Gradient Boosting"
Now we can do the same thing in CAS: build, score, and assess a logistic regression, a decision tree, and a gradient boosting model. In addition, CAS has the option to autotune the tree-based models, so I would like to include an example of that as well! Autotuning finds the best combination of hyperparameter to increase model accuracy.
CAS Logisic Regression
In [30]:
# Build CAS Logistic Regression
cas.regression.logistic(conn,
table = list(name = indata),
class = imp_nominals,
model = list(
depVar = target,
effects = imp_inputs),
selection=list(method="BACKWARD"),
store=list(name='log_model', replace=TRUE),
output=list(casOut=list(name='log_score',replace=TRUE),
pred='pred', resChi='reschi', into='into',
copyVars=list(target, '_PartInd_')),
partByVar= list(name = "_partind_", train = "0", validate = "1")
)
In [31]:
# Score Test Data
log_score1 <- defCasTable(conn, tablename="log_score")
log_score1 = to.casDataFrame(log_score1)
log_score1 = to.data.frame(log_score1)
log_score1$pred1 = 1-log_score1$pred
log_sc1 <- as.casTable(conn, log_score1, casOut='log_sc1')
In [32]:
# Assess Performance
log_assessed <- cas.percentile.assess(conn,
table = list(name = 'log_sc1', where = '_PartInd_ = 1'),
inputs = 'pred1',
response = target,
event = '1')
log_roc <- log_assessed$ROCInfo
log_roc$Model <- "CAS Logistic Regression"
CAS Decision Tree
In [33]:
# Build CAS Decision Tree
cas.decisionTree.dtreeTrain(conn,
table = list(name = indata, where = '_PartInd_ = 0'),
target = target,
inputs = inputs,
nominals = nominals,
varImp = TRUE,
casOut = list(name = 'dt_model', replace = TRUE)
)
In [34]:
# Score Test Data
cas.decisionTree.dtreeScore(conn,
modelTable=list(name='dt_model'),
table=list(name=indata),
copyVars= list(target, '_PartInd_'),
assessOneRow=TRUE,
casOut = list(name = 'dt_scored', replace = T)
)
In [35]:
# Assess Performance
dt_assessed <- cas.percentile.assess(conn,
table = list(name = 'dt_scored', where = '_PartInd_ = 1'),
inputs = '_DT_P_ 1',
response = target,
event = '1')
dt_roc <- dt_assessed$ROCInfo
dt_roc$Model <- "CAS Decision Tree"
CAS Autotuned Decision Tree
In [36]:
# Find Best Decision Tree Configuration
tune_dt <- cas.autotune.tuneDecisionTree(conn,
trainOptions=list(table = list(name = indata, where = '_PartInd_ = 0'),
target = target,
inputs = inputs,
nominals = nominals,
varImp = TRUE,
casOut = list(name = 'tune_dt_model', replace = TRUE)))
In [37]:
# Score Test Data
cas.decisionTree.dtreeScore(conn,
table=list(name = indata),
modelTable='tune_dt_model',
copyVars = list(target, '_PartInd_'),
assessonerow = TRUE,
casOut = list(name = 'tune_dt_scored', replace = T)
)
In [38]:
# Assess Performance
tune_dt_assessed <- cas.percentile.assess(conn,
table = list(name = 'tune_dt_scored', where = '_PartInd_ = 1'),
inputs = '_DT_P_ 1',
response = target,
event = '1')
tune_dt_roc <- tune_dt_assessed$ROCInfo
tune_dt_roc$Model <- "CAS Autotuned Decision Tree"
CAS Gradient Boosting
In [39]:
# Gradient Boosting
cas.decisionTree.gbtreeTrain(conn,
table = list(name = indata, where = '_PartInd_ = 0'),
target = target,
inputs = inputs,
nominals = nominals,
casOut = list(name = 'gbt_model', replace = TRUE)
)
In [40]:
# Score Test Data
cas.decisionTree.gbtreeScore(conn,
table=list(name = indata),
modelTable='gbt_model',
copyVars = list(target, '_PartInd_'),
assessonerow = TRUE,
casOut = list(name = 'gbt_scored', replace = T)
)
In [41]:
# Assess Performance
gbt_assessed <- cas.percentile.assess(conn,
table = list(name = 'gbt_scored', where = '_PartInd_ = 1'),
inputs = '_gbt_P_ 1',
response = target,
event = '1')
gbt_roc <- gbt_assessed$ROCInfo
gbt_roc$Model <- "CAS Gradient Boosting"
We have built our seven models, now let’s us see how they compare to each other.
In [42]:
roc.df <- data.frame()
# Add R Models
roc.df <- rbind(roc.df, rlog_roc, rtree_roc, rboost_roc)
# Add CAS Models
roc.df <- rbind(roc.df, log_roc, dt_roc, tune_dt_roc, gbt_roc)
Confusion Matrix
In [43]:
# Manipulate the dataframe
compare <- subset(roc.df, round(roc.df$CutOff, 2) == 0.5)
rownames(compare) <- NULL
cf <- compare[,c('Model','TP','FP','FN','TN')]
cf
Misclassification
In [44]:
# Build a dataframe to compare the misclassification rates
compare$Misclassification <- 1 - compare$ACC
miss <- compare[order(compare$Misclassification), c('Model','Misclassification')]
rownames(miss) <- NULL
miss
Notice the improvement in our autotuned decision tree above. It beats out both our R decision tree and our decision default decision tree.
ROC
In [45]:
# Add a new column to be used as the ROC curve label
roc.df$Models <- paste(roc.df$Model, round(roc.df$C, 3), sep = ' - ')
# Create the ROC curve
ggplot(data = roc.df[c('FPR', 'Sensitivity', 'Models')],
aes(x = as.numeric(FPR), y = as.numeric(Sensitivity), colour = Models)) +
geom_line() +
labs(x = 'False Positive Rate', y = 'True Positive Rate')
In [46]:
# End the session
cas.session.endSession(conn)
We have gone through an example of using SAS Viya with R for analysis. We connected to our CAS server, imported, explored, and cleaned our data, we built three models in R, three in CAS, and one autotuned model in CAS, and we ended by examining our model’s misclassification rates and ROC curves. And ultimately, we learned how easy it is to leverage SAS and R using SWAT, allowing programmers to utilize the power of SAS Analytics from the language they are comfortable in.