Data Preparation Workflow

Set Up the Jupyter Notebook for Analysis

Note: We have our new package called swat - SAS Scripting Wrapper for Analytics Transfer - available on GitHub via pip install


In [1]:
# Import necessary packages and modules
import swat
%matplotlib inline

# Set the connection by specifying the hostname, port, username, and password
conn = swat.CAS(hostname, port, username, password)

# Get the hmeq csv file from SAS support documentation and lift into memory
castbl = conn.read_csv('http://support.sas.com/documentation/onlinedoc/viya/exampledatasets/hmeq.csv', casout = 'hmeq')
castbl.replace = True

Examine the first few rows


In [2]:
# Assign the variable name df to the new CASTable object
df = conn.CASTable('hmeq')

# Perform the head method to return the first 5 rows
df.head()


Out[2]:
Selected Rows from Table HMEQ
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
0 1.0 1100.0 25860.0 39025.0 HomeImp Other 10.5 0.0 0.0 94.366667 1.0 9.0 NaN
1 1.0 1300.0 70053.0 68400.0 HomeImp Other 7.0 0.0 2.0 121.833333 0.0 14.0 NaN
2 1.0 1500.0 13500.0 16700.0 HomeImp Other 4.0 0.0 0.0 149.466667 1.0 10.0 NaN
3 1.0 1500.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0.0 1700.0 97800.0 112000.0 HomeImp Office 3.0 0.0 0.0 93.333333 0.0 14.0 NaN

Create new features


In [3]:
# How much of their mortgage have they paid off?
df['MORTPAID'] = df['VALUE'] - df['MORTDUE']
df.head()


Out[3]:
Selected Rows from Table HMEQ
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC MORTPAID
0 1.0 1100.0 25860.0 39025.0 HomeImp Other 10.5 0.0 0.0 94.366667 1.0 9.0 NaN 13165.0
1 1.0 1300.0 70053.0 68400.0 HomeImp Other 7.0 0.0 2.0 121.833333 0.0 14.0 NaN -1653.0
2 1.0 1500.0 13500.0 16700.0 HomeImp Other 4.0 0.0 0.0 149.466667 1.0 10.0 NaN 3200.0
3 1.0 1500.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0.0 1700.0 97800.0 112000.0 HomeImp Office 3.0 0.0 0.0 93.333333 0.0 14.0 NaN 14200.0

In [4]:
# What percent of the time does this happen?
df.query('MORTPAID < 0')['MORTPAID'].count()/len(df)


Out[4]:
0.0085570469798657723

Examine numeric variable distribution


In [5]:
# Use the pandas/matplotlib method for plotting a histogram of all numeric variables
df.hist(figsize = (15, 10));


Examine summary statistics


In [6]:
# Use the pandas describe method, then switch rows and columns
summary = df.describe(include = 'all').transpose()
summary


Out[6]:
count unique top freq mean std min 25% 50% 75% max
BAD 5960 2 0 4771 0.199497 0.399656 0 0 0 0 1
LOAN 5960 540 15000 105 18608 11207.5 1100 11100 16300 23300 89900
MORTDUE 5442 5053 42000 11 73760.8 44457.6 2063 46268 65019 91491 399550
VALUE 5848 5381 60000 15 101776 57385.8 8000 66069 89235.5 119832 855909
REASON 5960 2 DebtCon 3928 NaN NaN DebtCon NaN NaN NaN HomeImp
JOB 5960 6 Other 2388 NaN NaN Mgr NaN NaN NaN Self
YOJ 5445 99 0 415 8.92227 7.57398 0 3 7 13 41
DEROG 5252 11 0 4527 0.25457 0.846047 0 0 0 0 10
DELINQ 5380 14 0 4179 0.449442 1.12727 0 0 0 0 15
CLAGE 5652 5314 206.967 7 179.766 85.8101 0 115.103 173.467 231.575 1168.23
NINQ 5450 16 0 2531 1.18606 1.72867 0 0 1 2 17
CLNO 5738 62 16 316 21.2961 10.1389 0 15 20 26 71
DEBTINC 4693 4693 203.312 1 33.7799 8.60175 0.524499 29.14 34.8183 39.0031 203.312
MORTPAID 5357 4915 18000 10 31235.2 26343.4 -205445 17355 26623 39757 287300

Is there an issue with missing data?


In [7]:
# Create percent missing column for plotting
summary['pctmiss'] = (len(df) - summary['count'])/len(df)

# Make a bar graph using pandas/matplotlib functionality
summary.query('pctmiss > 0')['pctmiss'].plot('bar', title = 'Pct Missing Values', figsize = (10, 6), color = 'c');


The variables need to be imputed for missing values


In [8]:
# This is using the CAS action impute - really nice method for imputing all variables at once
## Impute the median for numeric, most common for categorical
df.impute(
    methodContinuous = 'MEDIAN',
    methodNominal    = 'MODE',
    inputs           = list(summary.index[1:]), # exclude target column
    copyAllVars      = True,
    casOut           = castbl
)


Out[8]:
§ ImputeInfo
Imputation Information for HMEQ
Variable ImputeTech ResultVar N NMiss ImputedValueContinuous ImputedValueNominal
0 LOAN Median IMP_LOAN 5960.0 0.0 16300.000000
1 MORTDUE Median IMP_MORTDUE 5442.0 518.0 65019.000000
2 VALUE Median IMP_VALUE 5848.0 112.0 89235.500000
3 REASON Mode IMP_REASON 5708.0 252.0 NaN DebtCon
4 JOB Mode IMP_JOB 5681.0 279.0 NaN Other
5 YOJ Median IMP_YOJ 5445.0 515.0 7.000000
6 DEROG Median IMP_DEROG 5252.0 708.0 0.000000
7 DELINQ Median IMP_DELINQ 5380.0 580.0 0.000000
8 CLAGE Median IMP_CLAGE 5652.0 308.0 173.466667
9 NINQ Median IMP_NINQ 5450.0 510.0 1.000000
10 CLNO Median IMP_CLNO 5738.0 222.0 20.000000
11 DEBTINC Median IMP_DEBTINC 4693.0 1267.0 34.818262
12 MORTPAID Median IMP_MORTPAID 5357.0 603.0 26623.000000

§ OutputCasTables
casLib Name Rows Columns casTable
0 DMlib HMEQ 5960 26 CASTable('HMEQ', caslib='DMlib')

Create partition indicator so models won't overfit


In [9]:
# Load the sampling actionset
conn.loadactionset('sampling')

# Do a simple random sample with a 70/30 split
df.srs(samppct = 30, partind = True, seed = 1, output = dict(casout = castbl, copyvars = 'all'))

# What percentage is in each split?
castbl['_PartInd_'].groupby('_PartInd_').count()/len(castbl)


NOTE: Added action set 'sampling'.
NOTE: Using SEED=1 for sampling.
Out[9]:
_PartInd_
0.0    0.7
1.0    0.3
Name: _PartInd_, dtype: float64

Make sure the dataset looks good


In [10]:
castbl.head()


Out[10]:
Selected Rows from Table HMEQ
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE ... IMP_LOAN IMP_MORTDUE IMP_MORTPAID IMP_NINQ IMP_VALUE IMP_YOJ IMP_JOB IMP_REASON MORTPAID _PartInd_
0 1.0 1100.0 25860.0 39025.0 HomeImp Other 10.5 0.0 0.0 94.366667 ... 1100.0 25860.0 13165.0 1.0 39025.0 10.5 Other HomeImp 13165.0 0.0
1 1.0 1300.0 70053.0 68400.0 HomeImp Other 7.0 0.0 2.0 121.833333 ... 1300.0 70053.0 -1653.0 0.0 68400.0 7.0 Other HomeImp -1653.0 0.0
2 1.0 1500.0 13500.0 16700.0 HomeImp Other 4.0 0.0 0.0 149.466667 ... 1500.0 13500.0 3200.0 1.0 16700.0 4.0 Other HomeImp 3200.0 0.0
3 1.0 1500.0 NaN NaN NaN NaN NaN NaN ... 1500.0 65019.0 26623.0 1.0 89235.5 7.0 Other DebtCon NaN 0.0
4 0.0 1700.0 97800.0 112000.0 HomeImp Office 3.0 0.0 0.0 93.333333 ... 1700.0 97800.0 14200.0 0.0 112000.0 3.0 Office HomeImp 14200.0 0.0

5 rows × 28 columns

Promote the table to public memory


In [11]:
# This allows me to share my data across sessions and with other colleagues
conn.promote(name = castbl, targetlib = 'public', target = 'hmeq_prepped')

Ensure that everything worked as intended


In [12]:
# Verify that hmeq_prepped has made it into the public caslib
conn.tableinfo(caslib = 'public')['TableInfo'].query('Name == "HMEQ_PREPPED"')


Out[12]:
Name Rows Columns Encoding CreateTimeFormatted ModTimeFormatted JavaCharSet CreateTime ModTime Global Repeated View SourceName SourceCaslib Compressed Creator Modifier
7 HMEQ_PREPPED 5960 28 utf-8 01Aug2017:21:17:11 01Aug2017:21:17:11 UTF8 1.817241e+09 1.817241e+09 1 0 0 0 jelueb

In [13]:
# End the session
conn.endsession()


Out[13]:

elapsed 0.00457s · user 0.001s · sys 0.005s · mem 0.114MB