Example: Model Titanic Survival

This example is provided to demonstrate some of the typical programming activities for working with Python in a SAS Viya environment to run actions in SAS Cloud Analytic Services. The actions that are used in the example require SAS Visual Data Mining and Machine Learning.

For more information, see http://support.sas.com/documentation/prod-p/vdmml/index.html.



In [1]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.core.display import display, HTML
%matplotlib inline



In [6]:

    
import swat
s = swat.CAS('cloud.example.com', 5570)

Get the Data



In [ ]:

    
ulresult = s.upload('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv')

titanic3 = ulresult.casTable

type(titanic3)



In [8]:

    
titanic3.table.columnInfo()









    Out[8]:




§ ColumnInfo



  
    
      
      Column
      ID
      Type
      RawLength
      FormattedLength
      NFL
      NFD
    
  
  
    
      0
      pclass
      1
      double
      8
      12
      0
      0
    
    
      1
      survived
      2
      double
      8
      12
      0
      0
    
    
      2
      name
      3
      varchar
      82
      82
      0
      0
    
    
      3
      sex
      4
      varchar
      6
      6
      0
      0
    
    
      4
      age
      5
      double
      8
      12
      0
      0
    
    
      5
      sibsp
      6
      double
      8
      12
      0
      0
    
    
      6
      parch
      7
      double
      8
      12
      0
      0
    
    
      7
      ticket
      8
      varchar
      18
      18
      0
      0
    
    
      8
      fare
      9
      double
      8
      12
      0
      0
    
    
      9
      cabin
      10
      varchar
      15
      15
      0
      0
    
    
      10
      embarked
      11
      varchar
      1
      1
      0
      0
    
    
      11
      boat
      12
      varchar
      7
      7
      0
      0
    
    
      12
      body
      13
      double
      8
      12
      0
      0
    
    
      13
      home.dest
      14
      varchar
      50
      50
      0
      0
    
  




elapsed 0.00299s · user 0.007s · sys 0.005s · mem 0.73MB

Add a Computed Column

Some of the columns in the data are problematic for modeling:

The name column should not have a bearing on the analysis.
The boat and body columns are proxies for the response variable, survived.
The cabin column is similar to name in that it is too specific to be generalized. A computed column, deck, is created because it is slightly more general than cabin.



In [11]:

    
# Create a computed variable.

titanic3.computedVars = ['deck']                         # 1
titanic3.computedVarsProgram = \
   "if cabin ne '' then deck = ksubstr(cabin,1,1); else deck = '';"


numeric=['pclass', 'survived', 'age', 'sibsp', 'parch', 'fare']

# Remove boat and body because they are proxies for survived.
# Remove ticket and cabin. Use the computed column, deck, instead.
char = ['sex', 'deck', 'embarked', 'home.dest']

all = numeric + char

Group By Analysis: Descriptive Statistics

The simple.summary action is used to provide some descriptive statistics. The groupBy parameter is set on the Titanic3 object so that the statistics are shown for survivors and those that did not survive.



In [12]:

    
# The numeric variable was defined earlier.
results = titanic3[numeric].groupby("survived").simple.summary()

resultColumns = ['Column', 'Min', 'Max', 'N', 'NMiss', 'Mean', 'Sum', 'Std', 'StdErr'];

display(HTML('<h3>Perished</h3>'))
display(results['ByGroup1.Summary'][resultColumns])       # 1

display(HTML('<h3>Survived</h3>'))
display(results['ByGroup2.Summary'][resultColumns])









    




Perished






    





Descriptive Statistics for TITANIC3
  
    
      
      Column
      Min
      Max
      N
      NMiss
      Mean
      Sum
      Std
      StdErr
    
    
      survived
      
      
      
      
      
      
      
      
      
    
  
  
    
      0
      pclass
      1.00
      3
      809
      0
      2.500618
      2023.0000
      0.744825
      0.026187
    
    
      0
      survived
      0.00
      0
      809
      0
      0.000000
      0.0000
      0.000000
      0.000000
    
    
      0
      age
      0.33
      74
      619
      190
      30.545363
      18907.5800
      13.922550
      0.559595
    
    
      0
      sibsp
      0.00
      8
      809
      0
      0.521632
      422.0000
      1.210449
      0.042557
    
    
      0
      parch
      0.00
      9
      809
      0
      0.328801
      266.0000
      0.912332
      0.032076
    
    
      0
      fare
      0.00
      263
      808
      1
      23.353831
      18869.8951
      34.145096
      1.201220
    
  








    




Survived






    





Descriptive Statistics for TITANIC3
  
    
      
      Column
      Min
      Max
      N
      NMiss
      Mean
      Sum
      Std
      StdErr
    
    
      survived
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      pclass
      1.00
      3.0000
      500
      0
      1.962000
      981.0000
      0.872972
      0.039040
    
    
      1
      survived
      1.00
      1.0000
      500
      0
      1.000000
      500.0000
      0.000000
      0.000000
    
    
      1
      age
      0.17
      80.0000
      427
      73
      28.918244
      12348.0900
      15.061452
      0.728875
    
    
      1
      sibsp
      0.00
      4.0000
      500
      0
      0.462000
      231.0000
      0.685197
      0.030643
    
    
      1
      parch
      0.00
      5.0000
      500
      0
      0.476000
      238.0000
      0.776292
      0.034717
    
    
      1
      fare
      0.00
      512.3292
      500
      0
      49.361184
      24680.5918
      68.648795
      3.070067

Sample the Data



In [13]:

    
s.builtins.loadActionSet('sampling')

# The sampling.stratified action does not accept the vars parameter.
# Instead, copyVars is used to select the columns to copy to the output table.
if 'vars' in titanic3.params:
    del titanic3.vars

# Temporarily set a groupBy parameter.
with titanic3:
    titanic3.groupBy={'survived'}
    titanic3.sampling.stratified(
       partInd=True,                                     # 1
       samppct=40,                                       # 2
       seed=1234,
       output={
          'casout':{'name':'titanic3part', 'replace':True},
          'copyVars':all
       }
    )

titanic3.table.dropTable()                               # 3

titanic3part = s.CASTable('titanic3part')                # 4
ci = titanic3part.columnInfo()
display(ci)









    



NOTE: Added action set 'sampling'.
NOTE: Using SEED=1234 for sampling.
NOTE: Cloud Analytic Services dropped table TITANIC3 from caslib CASUSERHDFS(mimcki).






    




§ ColumnInfo



  
    
      
      Column
      Label
      ID
      Type
      RawLength
      FormattedLength
      NFL
      NFD
    
  
  
    
      0
      pclass
      
      1
      double
      8
      12
      0
      0
    
    
      1
      survived
      
      2
      double
      8
      12
      0
      0
    
    
      2
      age
      
      3
      double
      8
      12
      0
      0
    
    
      3
      sibsp
      
      4
      double
      8
      12
      0
      0
    
    
      4
      parch
      
      5
      double
      8
      12
      0
      0
    
    
      5
      fare
      
      6
      double
      8
      12
      0
      0
    
    
      6
      sex
      
      7
      varchar
      6
      6
      0
      0
    
    
      7
      deck
      
      8
      char
      15
      15
      0
      0
    
    
      8
      embarked
      
      9
      varchar
      1
      1
      0
      0
    
    
      9
      home.dest
      
      10
      varchar
      50
      50
      0
      0
    
    
      10
      _PartInd_
      Partition Indicator
      11
      double
      8
      12
      0
      0
    
  




elapsed 0.00333s · user 0.002s · sys 0.01s · mem 0.69MB

Check that Sampling is Even

As long as each partition has approximately .38 for the mean, then survivor rows are distributed evenly in the partitions.



In [14]:

    
survSummary = titanic3part['survived'].groupby('_partind_').simple.summary()
 
resultColumns = ['Column', 'N', 'NMiss', 'Mean', 'Sum', 'Std', 'StdErr']

display(survSummary['ByGroupInfo'])
display(survSummary['ByGroup1.Summary'][resultColumns])
display(survSummary['ByGroup2.Summary'][resultColumns])









    





ByGroupInfo
  
    
      
      _PartInd_
      _PartInd__f
      _key_
    
  
  
    
      0
      0
      0
      0
    
    
      1
      1
      1
      1
    
  








    





Descriptive Statistics for TITANIC3PART
  
    
      
      Column
      N
      NMiss
      Mean
      Sum
      Std
      StdErr
    
    
      _PartInd_
      
      
      
      
      
      
      
    
  
  
    
      0
      survived
      785
      0
      0.382166
      300
      0.486227
      0.017354
    
  








    





Descriptive Statistics for TITANIC3PART
  
    
      
      Column
      N
      NMiss
      Mean
      Sum
      Std
      StdErr
    
    
      _PartInd_
      
      
      
      
      
      
      
    
  
  
    
      1
      survived
      524
      0
      0.381679
      200
      0.486263
      0.021242

Train a Model

The casOut parameter that is shown in the example is used to store the model as an in-memory table. The next step of this example is to show how to score data with the model.



In [15]:

    
s.builtins.loadActionSet('decisionTree')                 # 1

training = titanic3part.query('0 = _partind_')           # 2

trainingResults = training.forestTrain(
        target='survived',
        inputs=all,
        nominals=char + ['pclass', 'survived'],
        casOut={'name':'forestModel', 'replace':True},
        seed=1234,
        binOrder=True,
        varImp=True
    )

display(trainingResults)









    



NOTE: Added action set 'decisionTree'.






    




§ ModelInfo


Forest for TITANIC3PART
  
    
      
      Descr
      Value
    
  
  
    
      0
      Number of Trees
      50.000000
    
    
      1
      Number of Selected Variables (M)
      3.000000
    
    
      2
      Random Number Seed
      1234.000000
    
    
      3
      Bootstrap Percentage (%)
      63.212056
    
    
      4
      Number of Bins
      20.000000
    
    
      5
      Number of Variables
      9.000000
    
    
      6
      Confidence Level for Pruning
      0.250000
    
    
      7
      Max Number of Tree Nodes
      37.000000
    
    
      8
      Min Number of Tree Nodes
      13.000000
    
    
      9
      Max Number of Branches
      2.000000
    
    
      10
      Min Number of Branches
      2.000000
    
    
      11
      Max Number of Levels
      6.000000
    
    
      12
      Min Number of Levels
      6.000000
    
    
      13
      Max Number of Leaves
      19.000000
    
    
      14
      Min Number of Leaves
      7.000000
    
    
      15
      Maximum Size of Leaves
      443.000000
    
    
      16
      Minimum Size of Leaves
      5.000000
    
    
      17
      Out-of-Bag MCR (%)
      NaN
    
  



§ DTreeVarImpInfo


Forest for TITANIC3PART
  
    
      
      Variable
      Importance
      Std
    
  
  
    
      0
      sex
      56.052681
      26.768993
    
    
      1
      deck
      12.867060
      6.508284
    
    
      2
      home.dest
      7.922736
      3.023373
    
    
      3
      fare
      5.464105
      4.177646
    
    
      4
      pclass
      4.949152
      6.724228
    
    
      5
      embarked
      3.283320
      4.081139
    
    
      6
      sibsp
      2.384015
      2.147542
    
    
      7
      age
      1.947175
      1.629284
    
    
      8
      parch
      1.574475
      1.497399
    
  



§ OutputCasTables



  
    
      
      casLib
      Name
      Rows
      Columns
      casTable
    
  
  
    
      0
      CASUSERHDFS(mimcki)
      forestModel
      1264
      327
      CASTable('forestModel', caslib='CASUSERHDFS(mi...
    
  




elapsed 0.203s · user 0.548s · sys 1.94s · mem 90.9MB

Use the Model for Scoring

In this example, both the training data and the validation data are scored. This is done so that we can assess the effectiveness of the model for predicting whether someone survives on the Titanic.

The in-memory table, forestModel, is used as the model. The scoring output is stored in an in-memory table that is named forestScored.



In [16]:

    
forestModel = s.CASTable('forestModel')

titanic3part.forestScore(
    modelTable=forestModel, 
    copyVars=['survived', '_partind_'],
    casOut={'name':'forestScored', 'replace':True}, 
    vote='prob'
)









    Out[16]:




§ OutputCasTables



  
    
      
      casLib
      Name
      Rows
      Columns
      casTable
    
  
  
    
      0
      CASUSERHDFS(mimcki)
      forestScored
      1309
      7
      CASTable('forestScored', caslib='CASUSERHDFS(m...
    
  



§ ScoreInfo



  
    
      
      Descr
      Value
    
  
  
    
      0
      Number of Observations Read
      1309
    
    
      1
      Number of Observations Used
      1309
    
    
      2
      Misclassification Error (%)
      19.480519481
    
  




elapsed 0.113s · user 0.217s · sys 0.427s · mem 244MB

Assess the Model

The assess action is part of the percentile action set. You can run the loadActionSet action to ensure that the action is available to your session.



In [17]:

    
s.builtins.loadActionSet('percentile')

forestScored = s.CASTable('forestScored')                # 1
forestScored.groupBy='_PartInd_'                         # 2
forestScored.computedVars=['P1', 'P0']                   # 3
forestScored.computedVarsProgram='''                     
   if '1' eq strip(_RF_PredName_) then do;
     P1 = _RF_PredP_;
     P0 = 1 - _RF_PredP_;
   end;
   else do;
     P1 = 1 - _RF_PredP_;
     P0 = _RF_PredP_;
   end;
'''                                                      # 4

forestScored.percentile.assess(
    casOut={'name':'forestAssess', 'replace':True},
    nbins=10,
    cutStep = 0.01,
    inputs=['P1'],
    response='survived', 
    event='1', 
    pVar=['P0'], 
    pEvent='0'
)









    



NOTE: Added action set 'percentile'.






    Out[17]:




§ OutputCasTables



  
    
      
      casLib
      Name
      Rows
      Columns
      casTable
    
  
  
    
      0
      CASUSERHDFS(mimcki)
      forestAssess
      20
      23
      CASTable('forestAssess', caslib='CASUSERHDFS(m...
    
    
      1
      CASUSERHDFS(mimcki)
      forestAssess_ROC
      200
      22
      CASTable('forestAssess_ROC', caslib='CASUSERHD...
    
    
      2
      CASUSERHDFS(mimcki)
      forestAssess_FITSTAT
      2
      8
      CASTable('forestAssess_FITSTAT', caslib='CASUS...
    
  




elapsed 0.0679s · user 0.1s · sys 0.137s · mem 28.7MB

Plot ROC



In [18]:

    
forestAssess_ROC = \
    s.CASTable('forestAssess_ROC', where='1 = _partind_') # 1

out2 = forestAssess_ROC.to_frame()

plt.figure(figsize=(8,8))
plt.plot(out2._FPR_,out2._Sensitivity_,'bo-',linewidth=2)
plt.plot(pd.Series(range(0,11,1))/10,pd.Series(range(0,11,1))/10,'k--',linewidth=1)
plt.xlabel('False Positive Rate')
plt.ylabel('Correct Classification Rate')
plt.grid(True)
plt.title('ROC Curve')
plt.show()

Plot Lift



In [19]:

    
forestAssess = \
     s.CASTable('forestAssess', where='1 = _partind_')  # 1
lift = forestAssess.to_frame()

plt.figure(figsize=(8,8))
plt.plot(lift._Depth_, lift._Lift_,'bo-',linewidth=2)
plt.xlabel('Percentile')
plt.ylabel('Lift')
plt.grid(True)
plt.title('Lift Chart')
plt.show()



In [20]:

    
s.close()

Disclaimer: SAS may reference other websites or content or resources for use at Customer's sole discretion. SAS has no control over any websites or resources that are provided by companies or persons other than SAS. Customer acknowledges and agrees that SAS is not responsible for the availability or use of any such external sites or resources, and does not endorse any advertising, products, or other materials on or available from such websites or resources. Customer acknowledges and agrees that SAS is not liable for any loss or damage that may be incurred by Customer or its end users as a result of the availability or use of those external sites or resources, or as a result of any reliance placed by Customer or its end users on the completeness, accuracy, or existence of any advertising, products, or other materials on, or available from, such websites or resources.

	Column	ID	Type	RawLength	FormattedLength
0	pclass	1	double	8	12
1	survived	2	double	8	12
2	name	3	varchar	82	82
3	sex	4	varchar	6	6
4	age	5	double	8	12
5	sibsp	6	double	8	12
6	parch	7	double	8	12
7	ticket	8	varchar	18	18
8	fare	9	double	8	12
9	cabin	10	varchar	15	15
10	embarked	11	varchar	1	1
11	boat	12	varchar	7	7
12	body	13	double	8	12
13	home.dest	14	varchar	50	50

	Column	Min	Max	N	NMiss	Mean	Sum	Std	StdErr
survived
0	pclass	1.00	3	809	0	2.500618	2023.0000	0.744825	0.026187
0	survived	0.00	0	809	0	0.000000	0.0000	0.000000	0.000000
0	age	0.33	74	619	190	30.545363	18907.5800	13.922550	0.559595
0	sibsp	0.00	8	809	0	0.521632	422.0000	1.210449	0.042557
0	parch	0.00	9	809	0	0.328801	266.0000	0.912332	0.032076
0	fare	0.00	263	808	1	23.353831	18869.8951	34.145096	1.201220

	Descr	Value
0	Number of Trees	50.000000
1	Number of Selected Variables (M)	3.000000
2	Random Number Seed	1234.000000
3	Bootstrap Percentage (%)	63.212056
4	Number of Bins	20.000000
5	Number of Variables	9.000000
6	Confidence Level for Pruning	0.250000
7	Max Number of Tree Nodes	37.000000
8	Min Number of Tree Nodes	13.000000
9	Max Number of Branches	2.000000
10	Min Number of Branches	2.000000
11	Max Number of Levels	6.000000
12	Min Number of Levels	6.000000
13	Max Number of Leaves	19.000000
14	Min Number of Leaves	7.000000
15	Maximum Size of Leaves	443.000000
16	Minimum Size of Leaves	5.000000
17	Out-of-Bag MCR (%)	NaN

	Variable	Importance	Std
0	sex	56.052681	26.768993
1	deck	12.867060	6.508284
2	home.dest	7.922736	3.023373
3	fare	5.464105	4.177646
4	pclass	4.949152	6.724228
5	embarked	3.283320	4.081139
6	sibsp	2.384015	2.147542
7	age	1.947175	1.629284
8	parch	1.574475	1.497399

	Descr	Value
0	Number of Observations Read	1309
1	Number of Observations Used	1309
2	Misclassification Error (%)	19.480519481

	casLib	Name	Rows	Columns	casTable
0	CASUSERHDFS(mimcki)	forestAssess	20	23	CASTable('forestAssess', caslib='CASUSERHDFS(m...
1	CASUSERHDFS(mimcki)	forestAssess_ROC	200	22	CASTable('forestAssess_ROC', caslib='CASUSERHD...
2	CASUSERHDFS(mimcki)	forestAssess_FITSTAT	2	8	CASTable('forestAssess_FITSTAT', caslib='CASUS...