Import SWAT Package


In [1]:
from swat import *

Connect to CAS Server Load CAS Actionsets


In [2]:
s = CAS('host_name', 5570)

s.sessionprop.setsessopt(caslib='yourcaslib')
s.loadactionset('deepLearn')
s.loadactionset('castmine')
s.loadactionset('fedsql')


NOTE: 'yourcaslib' is now the active caslib.
NOTE: Added action set 'deepLearn'.
NOTE: Added action set 'castmine'.
NOTE: Added action set 'fedsql'.
Out[2]:
§ actionset
fedsql

elapsed 0.0633s · user 0.32s · sys 0.255s · mem 38.5MB

Load Data Sets - Training Data, Validation Data, and Test Data


In [3]:
train = s.loadtable('yelp_review_train.sashdat', casout = {"replace" : True} )['casTable']
val   = s.loadtable('yelp_review_val.sashdat',   casout = {"replace" : True} )['casTable']
test  = s.loadtable('yelp_review_test.sashdat',  casout = {"replace" : True} )['casTable']


NOTE: Cloud Analytic Services made the HDFS file yelp_review_train.sashdat available as table YELP_REVIEW_TRAIN in caslib yourcaslib.
NOTE: Cloud Analytic Services made the HDFS file yelp_review_val.sashdat available as table YELP_REVIEW_VAL in caslib yourcaslib.
NOTE: Cloud Analytic Services made the HDFS file yelp_review_test.sashdat available as table YELP_REVIEW_TEST in caslib yourcaslib.

What's in the Table


In [4]:
s.fetch(train, to=5)


Out[4]:
§ Fetch
Selected Rows from Table YELP_REVIEW_TRAIN
review sentiment
0 I love Marilo! She understands my hair type a... positive
1 I had lunch here today. I love the owner, he i... positive
2 All baristas are not created equal. The crew a... positive
3 Service okay. I always receive bad service fro... negative
4 There is nothing like riding your cruiser to L... positive

elapsed 0.105s · user 0.685s · sys 0.47s · mem 280MB

Load Word Encoding Files


In [5]:
# GloVe: Global Vectors for Word Representation. GloVe is an unsupervised learning algorithm for obtaining vector 
# representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, 
# and the resulting representations showcase interesting linear substructures of the word vector space.


s.upload(r'..\folder_on_your_local_machine\glove_100d_tab_clean.txt', 
         casout=dict(name='glove', replace=True),
         importoptions=dict(fileType='delimited', delimiter='\t'))


NOTE: Cloud Analytic Services made the uploaded file available as table GLOVE in caslib yourcaslib.
NOTE: The table GLOVE has been created in caslib yourcaslib from binary data uploaded to Cloud Analytic Services.
Out[5]:
§ caslib
yourcaslib

§ tableName
GLOVE

§ casTable
CASTable('GLOVE', caslib='yourcaslib')

elapsed 12s · user 41.4s · sys 15s · mem 1.09e+04MB

Building a Gated Recurrent Unit Model Architecture


In [6]:
# Sentiment classification
# In this example, GRU model is used as specified by the option "rnnType". You can specify other layer types "LSTM" and "RNN".
# In some layers, reverse = True is specified, and that makes GRU bi-directional. Specifically, layers rnn11 and rnn 21 
# are in the reverse direction, which means the model scan the sentence from the end to the beginning, while rnn12 and rnn22 are
# in the common forward direction. Therefore, the state of a neuron is not only affected by the previous words, but also the 
# words after the neuron.

n=64
init='msra'

s.buildmodel(model=dict(name='sentiment', replace=True), type='RNN')
s.addlayer(model='sentiment', name='data', layer=dict(type='input'))

s.addlayer(model='sentiment', name='rnn11', srclayers=['data'],
           layer=dict(type='recurrent',n=n,init=init,rnnType='GRU',outputType='samelength', 
                      reverse=True))
s.addlayer(model='sentiment', name='rnn12', srclayers=['data'],
           layer=dict(type='recurrent',n=n,init=init,rnnType='GRU',outputType='samelength', 
                      reverse=False))

s.addlayer(model='sentiment', name='rnn21', srclayers=['rnn11', 'rnn12'],
           layer=dict(type='recurrent',n=n,init=init,rnnType='GRU',outputType='samelength', 
                      reverse=True))
s.addlayer(model='sentiment', name='rnn22', srclayers=['rnn11', 'rnn12'],
           layer=dict(type='recurrent',n=n,init=init,rnnType='GRU',outputType='samelength', 
                      reverse=False))

s.addlayer(model='sentiment', name='rnn3', srclayers=['rnn21', 'rnn22'],
           layer=dict(type='recurrent',n=n,init=init,rnnType='GRU',outputType='encoding'))
         
s.addlayer(model='sentiment', name='outlayer', srclayers=['rnn3'],
           layer=dict(type='output'))


Out[6]:
§ OutputCasTables
casLib Name Rows Columns casTable
0 yourcaslib sentiment 102 5 CASTable('sentiment', caslib='yourcaslib')

elapsed 0.392s · user 8.65s · sys 13.5s · mem 887MB


In [7]:
s.tableinfo()


Out[7]:
§ TableInfo
Name Rows Columns IndexedColumns Encoding CreateTimeFormatted ModTimeFormatted AccessTimeFormatted JavaCharSet CreateTime ... Global Repeated View SourceName SourceCaslib Compressed Creator Modifier SourceModTimeFormatted SourceModTime
0 YELP_REVIEW_TRAIN 179892 2 0 wlatin1 2018-05-18T22:55:52-04:00 2018-05-18T22:55:52-04:00 2018-05-18T22:56:22-04:00 Cp1252 1.842318e+09 ... 0 0 0 yelp_review_train.sashdat yourcaslib 0 yourusername 2018-05-11T10:42:31-04:00 1.841669e+09
1 YELP_REVIEW_VAL 22437 2 0 wlatin1 2018-05-18T22:56:02-04:00 2018-05-18T22:56:02-04:00 2018-05-18T22:56:02-04:00 Cp1252 1.842318e+09 ... 0 0 0 yelp_review_val.sashdat yourcaslib 0 yourusername 2018-05-11T10:42:42-04:00 1.841669e+09
2 YELP_REVIEW_TEST 22643 2 0 wlatin1 2018-05-18T22:56:12-04:00 2018-05-18T22:56:12-04:00 2018-05-18T22:56:12-04:00 Cp1252 1.842318e+09 ... 0 0 0 yelp_review_test.sashdat HPS 0 yourusername 2018-05-11T10:43:00-04:00 1.841669e+09
3 GLOVE 399857 101 0 utf-8 2018-05-18T22:56:37-04:00 2018-05-18T22:56:37-04:00 2018-05-18T22:56:37-04:00 UTF8 1.842318e+09 ... 0 0 0 0 yourusername 2018-05-18T22:56:36-04:00 1.842318e+09
4 SENTIMENT 102 5 0 utf-8 2018-05-18T22:56:46-04:00 2018-05-18T22:56:46-04:00 2018-05-18T22:56:46-04:00 UTF8 1.842318e+09 ... 0 0 0 0 yourusername NaN

5 rows × 22 columns

elapsed 0.0519s · user 0.217s · sys 0.664s · mem 45.8MB

Training the Model


In [7]:
s.dlTrain(table=train, model='sentiment', validtable=val,
            modelWeights=dict(name='sentiment_trainedWeights', replace=True),
            textParms=dict(initEmbeddings='glove', hasInputTermIds=False, embeddingTrainable=False),
            target='sentiment', 
            inputs=['review'], 
            texts=['review'], 
            nominals=['sentiment'],
            optimizer=dict(miniBatchSize=4, maxEpochs=20, 
                           algorithm=dict(method='adam', beta1=0.9, beta2=0.999, gamma=0.5, 
                                          learningRate=0.0005, clipGradMax=100, clipGradMin=-100, 
                                          stepSize=20, lrPolicy='step')
                          ),
            seed=12345
         )


Out[7]:
§ ModelInfo
Descr Value
0 Model Name sentiment
1 Model Type Recurrent Neural Network
2 Number of Layers 7
3 Number of Input Layers 1
4 Number of Output Layers 1
5 Number of Convolutional Layers 0
6 Number of Pooling Layers 0
7 Number of Fully Connected Layers 0
8 Number of Recurrent Layers 5
9 Number of Weight Parameters 173696
10 Number of Bias Parameters 962
11 Total Number of Model Parameters 174658
12 Approximate Memory Cost for Training (MB) 424

§ OptIterHistory
Epoch LearningRate Loss FitError ValidLoss ValidError
0 0.0 0.0005 0.308329 0.127810 0.189913 0.076525
1 1.0 0.0005 0.174183 0.070843 0.165256 0.066274
2 2.0 0.0005 0.154785 0.062565 0.151924 0.059723
3 3.0 0.0005 0.142475 0.057384 0.143584 0.056246
4 4.0 0.0005 0.133127 0.053577 0.140111 0.055088
5 5.0 0.0005 0.125744 0.050208 0.138658 0.054776
6 6.0 0.0005 0.120312 0.047612 0.143572 0.057093
7 7.0 0.0005 0.117012 0.045855 0.162514 0.059010
8 8.0 0.0005 0.113089 0.044332 0.152025 0.054909
9 9.0 0.0005 0.110064 0.042887 0.148457 0.050943
10 10.0 0.0005 0.106354 0.041330 0.161550 0.052191
11 11.0 0.0005 0.104263 0.040502 0.156013 0.050631
12 12.0 0.0005 0.103524 0.039991 0.186799 0.065784
13 13.0 0.0005 0.100664 0.039007 0.170191 0.058341
14 14.0 0.0005 0.097836 0.037550 0.195519 0.062887
15 15.0 0.0005 0.095586 0.036300 0.177083 0.051834
16 16.0 0.0005 0.093813 0.035510 0.334158 0.071578
17 17.0 0.0005 0.092652 0.034865 0.215818 0.052725
18 18.0 0.0005 0.089912 0.034132 0.191313 0.049294
19 19.0 0.0005 0.086826 0.032597 0.198737 0.051477

§ OutputCasTables
casLib Name Rows Columns casTable
0 yourcaslib sentiment_trainedWeights 174658 3 CASTable('sentiment_trainedWeights', caslib='H...

elapsed 2.3e+03s · user 7.38e+04s · sys 1.09e+03s · mem 5.22e+03MB


In [8]:
s.save(table='sentiment_trainedWeights', caslib='casuser', 
       name='demo_review_sentiment_trainedweights.sashdat', replace=True, saveAttrs = True)


NOTE: Cloud Analytic Services saved the file demo_review_sentiment_trainedweights.sashdat with attributes in caslib CASUSER(yourusername).
Out[8]:
§ caslib
CASUSER(yourusername)

§ name
demo_review_sentiment_trainedweights.sashdat

elapsed 0.176s · user 0.139s · sys 0.135s · mem 21.6MB

Scoring Test Data


In [9]:
s.dlScore(table=test, model='sentiment', initWeights='sentiment_trainedWeights', 
          copyVars=['review', 'sentiment'], textParms=dict(initInputEmbeddings='glove'), 
          casout=dict(name='sentiment_out', replace=True))


Out[9]:
§ ScoreInfo
Descr Value
0 Number of Observations Read 22643
1 Number of Observations Used 22643
2 Misclassification Error (%) 5.034669
3 Loss Error 0.186817

§ OutputCasTables
casLib Name Rows Columns casTable
0 yourcaslib sentiment_out 22643 7 CASTable('sentiment_out', caslib='yourcaslib')

elapsed 29.1s · user 325s · sys 57.7s · mem 4.27e+03MB