Import SWAT Package



In [1]:

    
from swat import *

Connect to CAS Server Load CAS Actionsets



In [2]:

    
s = CAS('host_name', 5570)

s.sessionprop.setsessopt(caslib='yourcaslib')
s.loadactionset('deepLearn')
s.loadactionset('castmine')
s.loadactionset('fedsql')









    



NOTE: 'yourcaslib' is now the active caslib.
NOTE: Added action set 'deepLearn'.
NOTE: Added action set 'castmine'.
NOTE: Added action set 'fedsql'.






    Out[2]:




§ actionset

fedsql


elapsed 0.0633s · user 0.32s · sys 0.255s · mem 38.5MB

Load Data Sets - Training Data, Validation Data, and Test Data



In [3]:

    
train = s.loadtable('yelp_review_train.sashdat', casout = {"replace" : True} )['casTable']
val   = s.loadtable('yelp_review_val.sashdat',   casout = {"replace" : True} )['casTable']
test  = s.loadtable('yelp_review_test.sashdat',  casout = {"replace" : True} )['casTable']









    



NOTE: Cloud Analytic Services made the HDFS file yelp_review_train.sashdat available as table YELP_REVIEW_TRAIN in caslib yourcaslib.
NOTE: Cloud Analytic Services made the HDFS file yelp_review_val.sashdat available as table YELP_REVIEW_VAL in caslib yourcaslib.
NOTE: Cloud Analytic Services made the HDFS file yelp_review_test.sashdat available as table YELP_REVIEW_TEST in caslib yourcaslib.

What's in the Table



In [4]:

    
s.fetch(train, to=5)









    Out[4]:




§ Fetch


Selected Rows from Table YELP_REVIEW_TRAIN
  
    
      
      review
      sentiment
    
  
  
    
      0
      I love Marilo!  She understands my hair type a...
      positive
    
    
      1
      I had lunch here today. I love the owner, he i...
      positive
    
    
      2
      All baristas are not created equal. The crew a...
      positive
    
    
      3
      Service okay. I always receive bad service fro...
      negative
    
    
      4
      There is nothing like riding your cruiser to L...
      positive
    
  




elapsed 0.105s · user 0.685s · sys 0.47s · mem 280MB

Load Word Encoding Files



In [5]:

    
# GloVe: Global Vectors for Word Representation. GloVe is an unsupervised learning algorithm for obtaining vector 
# representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, 
# and the resulting representations showcase interesting linear substructures of the word vector space.


s.upload(r'..\folder_on_your_local_machine\glove_100d_tab_clean.txt', 
         casout=dict(name='glove', replace=True),
         importoptions=dict(fileType='delimited', delimiter='\t'))









    



NOTE: Cloud Analytic Services made the uploaded file available as table GLOVE in caslib yourcaslib.
NOTE: The table GLOVE has been created in caslib yourcaslib from binary data uploaded to Cloud Analytic Services.






    Out[5]:




§ caslib

yourcaslib

§ tableName

GLOVE

§ casTable

CASTable('GLOVE', caslib='yourcaslib')


elapsed 12s · user 41.4s · sys 15s · mem 1.09e+04MB

Building a Gated Recurrent Unit Model Architecture



In [6]:

    
# Sentiment classification
# In this example, GRU model is used as specified by the option "rnnType". You can specify other layer types "LSTM" and "RNN".
# In some layers, reverse = True is specified, and that makes GRU bi-directional. Specifically, layers rnn11 and rnn 21 
# are in the reverse direction, which means the model scan the sentence from the end to the beginning, while rnn12 and rnn22 are
# in the common forward direction. Therefore, the state of a neuron is not only affected by the previous words, but also the 
# words after the neuron.

n=64
init='msra'

s.buildmodel(model=dict(name='sentiment', replace=True), type='RNN')
s.addlayer(model='sentiment', name='data', layer=dict(type='input'))

s.addlayer(model='sentiment', name='rnn11', srclayers=['data'],
           layer=dict(type='recurrent',n=n,init=init,rnnType='GRU',outputType='samelength', 
                      reverse=True))
s.addlayer(model='sentiment', name='rnn12', srclayers=['data'],
           layer=dict(type='recurrent',n=n,init=init,rnnType='GRU',outputType='samelength', 
                      reverse=False))

s.addlayer(model='sentiment', name='rnn21', srclayers=['rnn11', 'rnn12'],
           layer=dict(type='recurrent',n=n,init=init,rnnType='GRU',outputType='samelength', 
                      reverse=True))
s.addlayer(model='sentiment', name='rnn22', srclayers=['rnn11', 'rnn12'],
           layer=dict(type='recurrent',n=n,init=init,rnnType='GRU',outputType='samelength', 
                      reverse=False))

s.addlayer(model='sentiment', name='rnn3', srclayers=['rnn21', 'rnn22'],
           layer=dict(type='recurrent',n=n,init=init,rnnType='GRU',outputType='encoding'))
         
s.addlayer(model='sentiment', name='outlayer', srclayers=['rnn3'],
           layer=dict(type='output'))









    Out[6]:




§ OutputCasTables



  
    
      
      casLib
      Name
      Rows
      Columns
      casTable
    
  
  
    
      0
      yourcaslib
      sentiment
      102
      5
      CASTable('sentiment', caslib='yourcaslib')
    
  




elapsed 0.392s · user 8.65s · sys 13.5s · mem 887MB



In [7]:

    
s.tableinfo()









    Out[7]:




§ TableInfo



  
    
      
      Name
      Rows
      Columns
      IndexedColumns
      Encoding
      CreateTimeFormatted
      ModTimeFormatted
      AccessTimeFormatted
      JavaCharSet
      CreateTime
      ...
      Global
      Repeated
      View
      SourceName
      SourceCaslib
      Compressed
      Creator
      Modifier
      SourceModTimeFormatted
      SourceModTime
    
  
  
    
      0
      YELP_REVIEW_TRAIN
      179892
      2
      0
      wlatin1
      2018-05-18T22:55:52-04:00
      2018-05-18T22:55:52-04:00
      2018-05-18T22:56:22-04:00
      Cp1252
      1.842318e+09
      ...
      0
      0
      0
      yelp_review_train.sashdat
      yourcaslib
      0
      yourusername
      
      2018-05-11T10:42:31-04:00
      1.841669e+09
    
    
      1
      YELP_REVIEW_VAL
      22437
      2
      0
      wlatin1
      2018-05-18T22:56:02-04:00
      2018-05-18T22:56:02-04:00
      2018-05-18T22:56:02-04:00
      Cp1252
      1.842318e+09
      ...
      0
      0
      0
      yelp_review_val.sashdat
      yourcaslib
      0
      yourusername
      
      2018-05-11T10:42:42-04:00
      1.841669e+09
    
    
      2
      YELP_REVIEW_TEST
      22643
      2
      0
      wlatin1
      2018-05-18T22:56:12-04:00
      2018-05-18T22:56:12-04:00
      2018-05-18T22:56:12-04:00
      Cp1252
      1.842318e+09
      ...
      0
      0
      0
      yelp_review_test.sashdat
      HPS
      0
      yourusername
      
      2018-05-11T10:43:00-04:00
      1.841669e+09
    
    
      3
      GLOVE
      399857
      101
      0
      utf-8
      2018-05-18T22:56:37-04:00
      2018-05-18T22:56:37-04:00
      2018-05-18T22:56:37-04:00
      UTF8
      1.842318e+09
      ...
      0
      0
      0
      
      
      0
      yourusername
      
      2018-05-18T22:56:36-04:00
      1.842318e+09
    
    
      4
      SENTIMENT
      102
      5
      0
      utf-8
      2018-05-18T22:56:46-04:00
      2018-05-18T22:56:46-04:00
      2018-05-18T22:56:46-04:00
      UTF8
      1.842318e+09
      ...
      0
      0
      0
      
      
      0
      yourusername
      
      
      NaN
    
  

5 rows × 22 columns



elapsed 0.0519s · user 0.217s · sys 0.664s · mem 45.8MB

Training the Model



In [7]:

    
s.dlTrain(table=train, model='sentiment', validtable=val,
            modelWeights=dict(name='sentiment_trainedWeights', replace=True),
            textParms=dict(initEmbeddings='glove', hasInputTermIds=False, embeddingTrainable=False),
            target='sentiment', 
            inputs=['review'], 
            texts=['review'], 
            nominals=['sentiment'],
            optimizer=dict(miniBatchSize=4, maxEpochs=20, 
                           algorithm=dict(method='adam', beta1=0.9, beta2=0.999, gamma=0.5, 
                                          learningRate=0.0005, clipGradMax=100, clipGradMin=-100, 
                                          stepSize=20, lrPolicy='step')
                          ),
            seed=12345
         )









    Out[7]:




§ ModelInfo



  
    
      
      Descr
      Value
    
  
  
    
      0
      Model Name
      sentiment
    
    
      1
      Model Type
      Recurrent Neural Network
    
    
      2
      Number of Layers
      7
    
    
      3
      Number of Input Layers
      1
    
    
      4
      Number of Output Layers
      1
    
    
      5
      Number of Convolutional Layers
      0
    
    
      6
      Number of Pooling Layers
      0
    
    
      7
      Number of Fully Connected Layers
      0
    
    
      8
      Number of Recurrent Layers
      5
    
    
      9
      Number of Weight Parameters
      173696
    
    
      10
      Number of Bias Parameters
      962
    
    
      11
      Total Number of Model Parameters
      174658
    
    
      12
      Approximate Memory Cost for Training (MB)
      424
    
  



§ OptIterHistory



  
    
      
      Epoch
      LearningRate
      Loss
      FitError
      ValidLoss
      ValidError
    
  
  
    
      0
      0.0
      0.0005
      0.308329
      0.127810
      0.189913
      0.076525
    
    
      1
      1.0
      0.0005
      0.174183
      0.070843
      0.165256
      0.066274
    
    
      2
      2.0
      0.0005
      0.154785
      0.062565
      0.151924
      0.059723
    
    
      3
      3.0
      0.0005
      0.142475
      0.057384
      0.143584
      0.056246
    
    
      4
      4.0
      0.0005
      0.133127
      0.053577
      0.140111
      0.055088
    
    
      5
      5.0
      0.0005
      0.125744
      0.050208
      0.138658
      0.054776
    
    
      6
      6.0
      0.0005
      0.120312
      0.047612
      0.143572
      0.057093
    
    
      7
      7.0
      0.0005
      0.117012
      0.045855
      0.162514
      0.059010
    
    
      8
      8.0
      0.0005
      0.113089
      0.044332
      0.152025
      0.054909
    
    
      9
      9.0
      0.0005
      0.110064
      0.042887
      0.148457
      0.050943
    
    
      10
      10.0
      0.0005
      0.106354
      0.041330
      0.161550
      0.052191
    
    
      11
      11.0
      0.0005
      0.104263
      0.040502
      0.156013
      0.050631
    
    
      12
      12.0
      0.0005
      0.103524
      0.039991
      0.186799
      0.065784
    
    
      13
      13.0
      0.0005
      0.100664
      0.039007
      0.170191
      0.058341
    
    
      14
      14.0
      0.0005
      0.097836
      0.037550
      0.195519
      0.062887
    
    
      15
      15.0
      0.0005
      0.095586
      0.036300
      0.177083
      0.051834
    
    
      16
      16.0
      0.0005
      0.093813
      0.035510
      0.334158
      0.071578
    
    
      17
      17.0
      0.0005
      0.092652
      0.034865
      0.215818
      0.052725
    
    
      18
      18.0
      0.0005
      0.089912
      0.034132
      0.191313
      0.049294
    
    
      19
      19.0
      0.0005
      0.086826
      0.032597
      0.198737
      0.051477
    
  



§ OutputCasTables



  
    
      
      casLib
      Name
      Rows
      Columns
      casTable
    
  
  
    
      0
      yourcaslib
      sentiment_trainedWeights
      174658
      3
      CASTable('sentiment_trainedWeights', caslib='H...
    
  




elapsed 2.3e+03s · user 7.38e+04s · sys 1.09e+03s · mem 5.22e+03MB



In [8]:

    
s.save(table='sentiment_trainedWeights', caslib='casuser', 
       name='demo_review_sentiment_trainedweights.sashdat', replace=True, saveAttrs = True)









    



NOTE: Cloud Analytic Services saved the file demo_review_sentiment_trainedweights.sashdat with attributes in caslib CASUSER(yourusername).






    Out[8]:




§ caslib

CASUSER(yourusername)

§ name

demo_review_sentiment_trainedweights.sashdat


elapsed 0.176s · user 0.139s · sys 0.135s · mem 21.6MB

Scoring Test Data



In [9]:

    
s.dlScore(table=test, model='sentiment', initWeights='sentiment_trainedWeights', 
          copyVars=['review', 'sentiment'], textParms=dict(initInputEmbeddings='glove'), 
          casout=dict(name='sentiment_out', replace=True))









    Out[9]:




§ ScoreInfo



  
    
      
      Descr
      Value
    
  
  
    
      0
      Number of Observations Read
      22643
    
    
      1
      Number of Observations Used
      22643
    
    
      2
      Misclassification Error (%)
      5.034669
    
    
      3
      Loss Error
      0.186817
    
  



§ OutputCasTables



  
    
      
      casLib
      Name
      Rows
      Columns
      casTable
    
  
  
    
      0
      yourcaslib
      sentiment_out
      22643
      7
      CASTable('sentiment_out', caslib='yourcaslib')
    
  




elapsed 29.1s · user 325s · sys 57.7s · mem 4.27e+03MB

	review	sentiment
0	I love Marilo! She understands my hair type a...	positive
1	I had lunch here today. I love the owner, he i...	positive
2	All baristas are not created equal. The crew a...	positive
3	Service okay. I always receive bad service fro...	negative
4	There is nothing like riding your cruiser to L...	positive

	Name	Rows	Columns	Encoding	CreateTimeFormatted	ModTimeFormatted	AccessTimeFormatted	JavaCharSet	CreateTime	...	SourceName	SourceCaslib	Creator	SourceModTimeFormatted	SourceModTime
0	YELP_REVIEW_TRAIN	179892	2	wlatin1	2018-05-18T22:55:52-04:00	2018-05-18T22:55:52-04:00	2018-05-18T22:56:22-04:00	Cp1252	1.842318e+09	...	yelp_review_train.sashdat	yourcaslib	yourusername	2018-05-11T10:42:31-04:00	1.841669e+09
1	YELP_REVIEW_VAL	22437	2	wlatin1	2018-05-18T22:56:02-04:00	2018-05-18T22:56:02-04:00	2018-05-18T22:56:02-04:00	Cp1252	1.842318e+09	...	yelp_review_val.sashdat	yourcaslib	yourusername	2018-05-11T10:42:42-04:00	1.841669e+09
2	YELP_REVIEW_TEST	22643	2	wlatin1	2018-05-18T22:56:12-04:00	2018-05-18T22:56:12-04:00	2018-05-18T22:56:12-04:00	Cp1252	1.842318e+09	...	yelp_review_test.sashdat	HPS	yourusername	2018-05-11T10:43:00-04:00	1.841669e+09
3	GLOVE	399857	101	utf-8	2018-05-18T22:56:37-04:00	2018-05-18T22:56:37-04:00	2018-05-18T22:56:37-04:00	UTF8	1.842318e+09	...			yourusername	2018-05-18T22:56:36-04:00	1.842318e+09
4	SENTIMENT	102	5	utf-8	2018-05-18T22:56:46-04:00	2018-05-18T22:56:46-04:00	2018-05-18T22:56:46-04:00	UTF8	1.842318e+09	...			yourusername		NaN

	Descr	Value
0	Model Name	sentiment
1	Model Type	Recurrent Neural Network
2	Number of Layers	7
3	Number of Input Layers	1
4	Number of Output Layers	1
5	Number of Convolutional Layers	0
6	Number of Pooling Layers	0
7	Number of Fully Connected Layers	0
8	Number of Recurrent Layers	5
9	Number of Weight Parameters	173696
10	Number of Bias Parameters	962
11	Total Number of Model Parameters	174658
12	Approximate Memory Cost for Training (MB)	424

	Epoch	LearningRate	Loss	FitError	ValidLoss	ValidError
0	0.0	0.0005	0.308329	0.127810	0.189913	0.076525
1	1.0	0.0005	0.174183	0.070843	0.165256	0.066274
2	2.0	0.0005	0.154785	0.062565	0.151924	0.059723
3	3.0	0.0005	0.142475	0.057384	0.143584	0.056246
4	4.0	0.0005	0.133127	0.053577	0.140111	0.055088
5	5.0	0.0005	0.125744	0.050208	0.138658	0.054776
6	6.0	0.0005	0.120312	0.047612	0.143572	0.057093
7	7.0	0.0005	0.117012	0.045855	0.162514	0.059010
8	8.0	0.0005	0.113089	0.044332	0.152025	0.054909
9	9.0	0.0005	0.110064	0.042887	0.148457	0.050943
10	10.0	0.0005	0.106354	0.041330	0.161550	0.052191
11	11.0	0.0005	0.104263	0.040502	0.156013	0.050631
12	12.0	0.0005	0.103524	0.039991	0.186799	0.065784
13	13.0	0.0005	0.100664	0.039007	0.170191	0.058341
14	14.0	0.0005	0.097836	0.037550	0.195519	0.062887
15	15.0	0.0005	0.095586	0.036300	0.177083	0.051834
16	16.0	0.0005	0.093813	0.035510	0.334158	0.071578
17	17.0	0.0005	0.092652	0.034865	0.215818	0.052725
18	18.0	0.0005	0.089912	0.034132	0.191313	0.049294
19	19.0	0.0005	0.086826	0.032597	0.198737	0.051477

	Descr	Value
0	Number of Observations Read	22643
1	Number of Observations Used	22643
2	Misclassification Error (%)	5.034669
3	Loss Error	0.186817