Getting Started with SWAT

SWAT (SAS Wrapper for Analytics Transfer) is the Python module used to interface with SAS' Cloud Analytics Services (CAS). This module includes classes and functions to call actions on the server, and process the results on the client. This workshop will cover the basics of connecting to CAS from Python, calling CAS actions, interacting with CAS tables, and using the results of the CAS actions on the client.

We will cover several aspects of working with CAS in the following sections. These include:

Connecting to CAS
Running CAS actions
Getting help on CAS actions
Loading data into CAS
Using CASTable as a DataFrame
Doing simple statistics
Grouping data
SQL
DATA step
Building Analytical Models

Exercise 1 - Connecting to CAS

The SWAT module in Python is merely a CAS client, it does not start the server. So the first thing you need to do is get access to a running CAS server. To connect to the server, we need the hostname and port number that the server is running at as well as a way to authenticate with the server.

There are various ways of authenticating with CAS, so you will need to check with your system administrator as to the proper method to use. The simplest form of authentication is username and password, and we will cover that here.

To connect to CAS, we will use the CAS class in the SWAT package. The most basic form of creating a connection is to supply the CAS host, port, username, and password to the CAS class.



In [16]:

    
import swat

conn = swat.CAS('sasserver.demo.sas.com', 5570)

We can print the CAS object to see the settings for that session. It includes the host, port, username, protocol, session name, and session ID. The host, port, and username should be self-explanatory. The protocol can be 'cas' for the binary protocol, 'http' for the unencrypted REST interface, or 'https' for the encrypted REST interface.

The session name is simply a human-readable tag given to a session so it is easier to identify in actions that display session information. The session ID is a unique identifier given to each session. The session ID can be used to reconnect to a session, or connect from another client.



In [17]:

    
conn









    Out[17]:





CAS('sasserver.demo.sas.com', 5570, 'viyauser', protocol='cas', name='py-session-2', session='95117b0a-44e6-5c4c-8cf3-aa242c1d8316')

Authinfo Authentication

It's generally not a good idea to put your username and password into a program, so the CAS class also supports Authinfo authentication. This is a method of storing your username and password in a file that is readable only by you. The CAS object can then safely read the username and password from that file. This file was used in the connection above which is why you didn't need to enter a username or password.

The basic format of the file is to put a host, port, username, and password on each line.

host my-cas.com port 12345 user my-name password s3cr3t

The hostname must exactly match the hostname used in the CAS object. If no port number is specified on the line, the username and password will be used for any port on that machine.

By default, the file should be stored in ~/.authinfo, and it should be readable only by you. On Linux, the command to do this is:

chmod 0600 ~/.authinfo

Once you have a file with the appropriate authentication information, you can create a CAS object without specifying a username and password.

Exercise 2 - Running CAS Actions

Now that we have a CAS connection, we can try running some CAS actions. Let's start with the serverstatus action. This action displays some basic status information about tha CAS server such as platform and release information, the number of nodes, and information about each node in the grid.

CAS actions are called on CAS objects just like Python methods. We will store the result of the action call in out as well as display the result.



In [18]:

    
out = conn.serverstatus()
out









    



NOTE: Grid node action status report: 1 nodes, 26 total actions executed.






    Out[18]:




§ About

{'Copyright': 'Copyright © 2014-2016 SAS Institute Inc. All Rights Reserved.', 'Version': '3.01', 'CAS': 'Cloud Analytic Services', 'license': {'gracePeriod': 45, 'maxCPUs': 9999, 'expires': '29Nov2016:00:00:00', 'siteNum': 70068118, 'site': '28Aug2015 English Only AIX Conf', 'warningPeriod': 55}, 'System': {'Model Number': 'x86_64', 'OS Family': 'LIN X64', 'Hostname': 'sasserver', 'OS Name': 'Linux', 'OS Version': '#1 SMP Tue Jul 12 18:30:56 UTC 2016', 'OS Release': '2.6.32-642.3.1.el6.x86_64'}, 'VersionLong': 'V.03.01M0P08282016'}

§ server


Server Status
  
    
      
      nodes
      actions
    
  
  
    
      0
      1
      26
    
  



§ nodestatus


Node Status
  
    
      
      name
      role
      uptime
      running
      stalled
    
  
  
    
      0
      sasserver.demo.sas.com
      controller
      0.16
      0
      0
    
  




elapsed 0.00269s · user 0.001s · sys 0.001s · mem 0.0761MB

The result of every CAS action is a CASResults object. CASResults objects are simply a subclass of Python's ordered dictionary, so anything that you can do with Python dictionaries can also be done with CASResults objects. For example, we can print all of the result keys using the keys method.



In [19]:

    
list(out.keys())









    Out[19]:





['About', 'server', 'nodestatus']

You can also access the objects at each key using dictionary syntax.



In [20]:

    
out['nodestatus']









    Out[20]:





Node Status
  
    
      
      name
      role
      uptime
      running
      stalled
    
  
  
    
      0
      sasserver.demo.sas.com
      controller
      0.16
      0
      0

The types of the objects at each key vary from action to action. In the case of serverstatus, the 'About' key holds a dictionary, and the 'server' and 'nodestatus' keys hold DataFrames.



In [21]:

    
[type(x) for x in out.values()]









    Out[21]:





[dict, swat.dataframe.SASDataFrame, swat.dataframe.SASDataFrame]

As you can see, the DataFrames held in the output are SASDataFrame objects. The SASDataFrame object is a subclass of Pandas DataFrame. This means that they can be treated as equivalents. The SASDataFrame object simply adds some extra SAS metadata such as titles, labels, formats, etc.

Of course, since it is a DataFrame, we can use indexing, sorting, and filtering operations on it.



In [22]:

    
nodestatus = out['nodestatus']
nodestatus[nodestatus['role'] == 'worker']









    Out[22]:





Node Status
  
    
      
      name
      role
      uptime
      running
      stalled



In [23]:

    
nodestatus[nodestatus['role'] == 'controller']









    Out[23]:





Node Status
  
    
      
      name
      role
      uptime
      running
      stalled
    
  
  
    
      0
      sasserver.demo.sas.com
      controller
      0.16
      0
      0



In [24]:

    
nodestatus = nodestatus.set_index('name')
nodestatus.loc['sasserver.demo.sas.com']









    Out[24]:





role       controller
uptime           0.16
running             0
stalled             0
Name: sasserver.demo.sas.com, dtype: object

Now that we've covered the basics of calling CAS actions and traversing the results, let's see what other actions are available.

Exercise 3 - Getting Help on CAS Actions

3.1 Using CAS' `help` Function

So far we've only introduced the serverstatus action. However, there are many actions that are available in all CAS installations, and various other may be available depending which products have been licensed. To display all available actions, you use the help action. (Note that we added the semi-colon at the end to suppress the rendering of the output which is a DataFrame of the same information that is displayed in the notes.)



In [25]:

    
conn.help();









    



NOTE: Available Action Sets and Actions:
NOTE:    accessControl
NOTE:       assumeRole - Assumes a role
NOTE:       dropRole - Relinquishes a role
NOTE:       showRolesIn - Shows the currently active role
NOTE:       showRolesAllowed - Shows the roles that a user is a member of
NOTE:       isInRole - Shows whether a role is assumed
NOTE:       isAuthorized - Shows whether access is authorized
NOTE:       isAuthorizedActions - Shows whether access is authorized to actions
NOTE:       isAuthorizedTables - Shows whether access is authorized to tables
NOTE:       isAuthorizedColumns - Shows whether access is authorized to columns
NOTE:       listAllPrincipals - Lists all principals that have explicit access controls
NOTE:       whatIsEffective - Lists effective access and explanations (Origins)
NOTE:       listAcsData - Lists access controls for caslibs, tables, and columns
NOTE:       listAcsActionSet - Lists access controls for an action or action set
NOTE:       repAllAcsCaslib - Replaces all access controls for a caslib
NOTE:       repAllAcsTable - Replaces all access controls for a table
NOTE:       repAllAcsColumn - Replaces all access controls for a column
NOTE:       repAllAcsActionSet - Replaces all access controls for an action set
NOTE:       repAllAcsAction - Replaces all access controls for an action
NOTE:       updSomeAcsCaslib - Adds, deletes, and modifies some access controls for a caslib
NOTE:       updSomeAcsTable - Adds, deletes, and modifies some access controls for a table
NOTE:       updSomeAcsColumn - Adds, deletes, and modifies some access controls for a column
NOTE:       updSomeAcsActionSet - Adds, deletes, and modifies some access controls for an action set
NOTE:       updSomeAcsAction - Adds, deletes, and modifies some access controls for an action
NOTE:       remAllAcsData - Removes all access controls for a caslib, table, or column
NOTE:       remAllAcsActionSet - Removes all access controls for an action set or action
NOTE:       operTableMd - Adds, deletes, and modifies table metadata
NOTE:       operColumnMd - Adds, deletes, and modifies column metadata
NOTE:       operActionSetMd - Adds, deletes, and modifies action set metadata
NOTE:       operActionMd - Adds, deletes, and modifies action metadata
NOTE:       operAdminMd - Assigns users and groups to roles and modifies administrator metadata
NOTE:       listMetadata - Lists the metadata for caslibs, tables, columns, action sets, actions, or administrators
NOTE:       persistMetadata - Persists the access control metadata
NOTE:       createBackup - Creates a backup if one is not in progress
NOTE:       completeBackup - Flags a backup as complete
NOTE:       operBWPaths - Configures a blacklist or whitelist of paths
NOTE:       deleteBWList - Deletes a blacklist or a whitelist
NOTE:    builtins
NOTE:       addNode - Adds a machine to the server
NOTE:       removeNode - Remove one or more machines from the server
NOTE:       help - Shows the parameters for an action or lists all available actions
NOTE:       listNodes - Shows the host names used by the server
NOTE:       loadActionSet - Loads an action set for use in this session
NOTE:       installActionSet - Loads an action set in new sessions automatically
NOTE:       log - Shows and modifies logging levels
NOTE:       queryActionSet - Shows whether an action set is loaded
NOTE:       queryName - Checks whether a name is an action or action set name
NOTE:       reflect - Shows detailed parameter information for an action or all actions in an action set
NOTE:       serverStatus - Shows the status of the server
NOTE:       about - Shows the status of the server
NOTE:       shutdown - Shuts down the server
NOTE:       getUsers - Shows the users from the authentication provider
NOTE:       getGroups - Shows the groups from the authentication provider
NOTE:       userInfo - Shows the user information for your connection
NOTE:       actionSetInfo - Shows the build information from loaded action sets
NOTE:       history - Shows the actions that were run in this session
NOTE:       casCommon - Provides parameters that are common to many actions
NOTE:       ping - Sends a single request to the server to confirm that the connection is working
NOTE:       echo - Prints the supplied parameters to the client log
NOTE:       modifyQueue - Modifies the action response queue settings
NOTE:       getLicenseInfo - Shows the license information for a SAS product
NOTE:       refreshLicense - Refresh SAS license information from a file
NOTE:       httpAddress - Shows the HTTP address for the server monitor
NOTE:    configuration
NOTE:       getServOpt - displays the value of a server option
NOTE:       listServOpts - Displays the server options and server values
NOTE:    dataPreprocess
NOTE:       rustats - Computes robust univariate statistics, centralized moments, quantiles, and frequency distribution statistics
NOTE:       impute - Performs data matrix (variable) imputation
NOTE:       outlier - Performs outlier detection and treatment
NOTE:       binning - Performs unsupervised variable discretization
NOTE:       discretize - Performs supervised and unsupervised variable discretization
NOTE:       histogram - Generates histogram bins and simple bin-based statistics for numeric variables
NOTE:       transform - Performs pipelined variable imputation, outlier detection and treatment, functional transformation, binning, and robust univariate statistics to evaluate the quality of the transformation
NOTE:       kde - Computes kernel density estimation
NOTE:    dataStep
NOTE:       runCode - Runs DATA step code
NOTE:    percentile
NOTE:       percentile - Percentile
NOTE:       boxPlot - BoxPlot
NOTE:       assess - Assess
NOTE:    search
NOTE:       searchIndex - Searches for a query against an index and retrieves records, documents, and tuples that are relevant to that query
NOTE:       searchAggregate - Aggregates certain fields in a table that is usually generated by searchIndex
NOTE:       valueCount - value count for multiple fields
NOTE:       buildIndex - Creates an empty index using a schema (the first step of Search)
NOTE:       getSchema - Gets the schema of an index
NOTE:       appendIndex - Loads data to an index after the buildIndex action is performed
NOTE:       deleteDocuments - Delete a portion of documents from index
NOTE:    session
NOTE:       listSessions - Displays a list of the sessions on the server
NOTE:       addNodeStatus - Lists details about machines currently being added to the server
NOTE:       timeout - Changes the time-out for a session
NOTE:       endSession - Ends the current session
NOTE:       sessionId - Displays the name and UUID of the current session
NOTE:       sessionName - Changes the name of the current session
NOTE:       sessionStatus - Displays the status of the current session
NOTE:       listresults - Lists the saved results for a session
NOTE:       batchresults - Change current action to batch results
NOTE:       fetchresult - Fetch the specified saved result for a session
NOTE:       flushresult - Flush the saved result for this session
NOTE:       setLocale - Changes the locale for the current session
NOTE:       metrics - Displays the metrics for each action after it executes
NOTE:    sessionProp
NOTE:       setSessOpt - Sets a session option
NOTE:       getSessOpt - Displays the value of a session option
NOTE:       listSessOpts - Displays the session options and session values
NOTE:       addFmtLib - Adds a format library
NOTE:       listFmtLibs - Lists the format libraries that are associated with the session
NOTE:       setFmtSearch - Sets the format libraries to search
NOTE:       listFmtSearch - Shows the format library search order
NOTE:       dropFmtLib - Drops a format library from global scope for all sessions
NOTE:       deleteFormat - Deletes a format from a format library
NOTE:       addFormat - Adds a format to a format library
NOTE:       listFmtValues - Shows the values for a format
NOTE:       saveFmtLib - Saves a format library
NOTE:       promoteFmtLib - Promotes a format library to global scope for all sessions
NOTE:       listFmtRanges - Displays the range information for a format
NOTE:    simple
NOTE:       mdSummary - Calculates multidimensional summaries of numeric variables
NOTE:       numRows - Shows the number of rows in a Cloud Analytic Services table
NOTE:       summary - Generates descriptive statistics of numeric variables such as the sample mean, sample variance, sample size, sum of squares, and so on
NOTE:       correlation - Generates a matrix of Pearson product-moment correlation coefficients
NOTE:       regression - Performs a linear regression up to 3rd-order polynomials
NOTE:       crossTab - Performs one-way or two-way tabulations
NOTE:       distinct - Computes the distinct number of values of the variables in the variable list
NOTE:       topK - Returns the top-K and bottom-K distinct values of each variable included in the variable list based on a user-specified ranking order
NOTE:       groupBy - Builds BY groups in terms of the variable value combinations given the variables in the variable list
NOTE:       freq - Generates a frequency distribution for one or more variables
NOTE:       paraCoord - Generates a parallel coordinates plot of the variables in the variable list
NOTE:    table
NOTE:       view - Creates a view from files or tables
NOTE:       attribute - Manages extended table attributes
NOTE:       upload - Transfers binary data to the server to create objects like tables
NOTE:       loadTable - Loads a table from a caslib's data source
NOTE:       tableExists - Checks whether a table has been loaded
NOTE:       columnInfo - Shows column information
NOTE:       fetch - Fetches rows from a table or view
NOTE:       save - Saves a table to a caslib's data source
NOTE:       addTable - Add a table by sending it from the client to the server
NOTE:       tableInfo - Shows information about a table
NOTE:       tableDetails - Get detailed information about a table
NOTE:       dropTable - Drops a table
NOTE:       deleteSource - Delete a table or file from a caslib's data source
NOTE:       fileInfo - Lists the files in a caslib's data source
NOTE:       promote - Promote a table to global scope
NOTE:       addCaslib - Adds a new caslib to enable access to a data source
NOTE:       dropCaslib - Drops a caslib
NOTE:       caslibInfo - Shows caslib information
NOTE:       queryCaslib - Checks whether a caslib exists
NOTE:       partition - Partitions a table
NOTE:       recordCount - Shows the number of rows in a Cloud Analytic Services table
NOTE:       loadDataSource - Loads one or more data source interfaces
NOTE:       update - Updates rows in a table

You can also ask for help on specifc action sets and actions.



In [26]:

    
conn.help(actionset='builtins');









    



NOTE: Information for action set 'builtins':
NOTE:    builtins
NOTE:       addNode - Adds a machine to the server
NOTE:       removeNode - Remove one or more machines from the server
NOTE:       help - Shows the parameters for an action or lists all available actions
NOTE:       listNodes - Shows the host names used by the server
NOTE:       loadActionSet - Loads an action set for use in this session
NOTE:       installActionSet - Loads an action set in new sessions automatically
NOTE:       log - Shows and modifies logging levels
NOTE:       queryActionSet - Shows whether an action set is loaded
NOTE:       queryName - Checks whether a name is an action or action set name
NOTE:       reflect - Shows detailed parameter information for an action or all actions in an action set
NOTE:       serverStatus - Shows the status of the server
NOTE:       about - Shows the status of the server
NOTE:       shutdown - Shuts down the server
NOTE:       getUsers - Shows the users from the authentication provider
NOTE:       getGroups - Shows the groups from the authentication provider
NOTE:       userInfo - Shows the user information for your connection
NOTE:       actionSetInfo - Shows the build information from loaded action sets
NOTE:       history - Shows the actions that were run in this session
NOTE:       casCommon - Provides parameters that are common to many actions
NOTE:       ping - Sends a single request to the server to confirm that the connection is working
NOTE:       echo - Prints the supplied parameters to the client log
NOTE:       modifyQueue - Modifies the action response queue settings
NOTE:       getLicenseInfo - Shows the license information for a SAS product
NOTE:       refreshLicense - Refresh SAS license information from a file
NOTE:       httpAddress - Shows the HTTP address for the server monitor



In [27]:

    
conn.help(action='tableinfo');









    



NOTE: Information for action 'table.tableInfo':
NOTE: The following parameters are accepted.  Default values are shown.
NOTE:    string name=NULL (alias: table),
NOTE:       specifies the table name.
NOTE:    string caslib=NULL,
NOTE:       specifies the caslib containing the table that you want to use with the action. By default, the active caslib is used. Specify a value only if you need to access a table from a different caslib.
NOTE:    boolean quiet=false (alias: silent)
NOTE:       when set to True, attempting to show information for a table that does not exist returns an OK status and severity. When set to False, attempting to show information for a table that does not exist returns an error.

3.2 Using Python's `help` Function

In addition to the help action in CAS, you can also use Python's help function to retrieve the help content in a Pythonic way. This will display both the help for the CAS action as well as information about the Python object driving the action.



In [28]:

    
help(conn.tableinfo)









    



Help on table.Tableinfo in module swat.cas.actions object:

class table.Tableinfo(CASAction)
 |  Shows information about a table
 |  
 |  Parameters
 |  ----------
 |  name : string or CASTable, optional
 |      specifies the table name.
 |  
 |  caslib : string, optional
 |      specifies the caslib containing the table that you want to use with
 |      the action. By default, the active caslib is used. Specify a value
 |      only if you need to access a table from a different caslib.
 |  
 |  quiet : boolean, optional
 |      when set to True, attempting to show information for a table that
 |      does not exist returns an OK status and severity. When set to False,
 |      attempting to show information for a table that does not exist
 |      returns an error.
 |      Default: False
 |  
 |  Returns
 |  -------
 |  Tableinfo object
 |  
 |  Method resolution order:
 |      table.Tableinfo
 |      CASAction
 |      swat.cas.utils.params.ParamManager
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __call__(_self_, name=None, caslib=None, quiet=None, **kwargs)
 |      Shows information about a table
 |      
 |      Parameters
 |      ----------
 |      name : string or CASTable, optional
 |          specifies the table name.
 |      
 |      caslib : string, optional
 |          specifies the caslib containing the table that you want to use with
 |          the action. By default, the active caslib is used. Specify a value
 |          only if you need to access a table from a different caslib.
 |      
 |      quiet : boolean, optional
 |          when set to True, attempting to show information for a table that
 |          does not exist returns an OK status and severity. When set to False,
 |          attempting to show information for a table that does not exist
 |          returns an error.
 |          Default: False
 |      
 |      Returns
 |      -------
 |      CASResults object
 |  
 |  __init__(_self_, name=None, caslib=None, quiet=None, **kwargs)
 |      Shows information about a table
 |      
 |      Parameters
 |      ----------
 |      name : string or CASTable, optional
 |          specifies the table name.
 |      
 |      caslib : string, optional
 |          specifies the caslib containing the table that you want to use with
 |          the action. By default, the active caslib is used. Specify a value
 |          only if you need to access a table from a different caslib.
 |      
 |      quiet : boolean, optional
 |          when set to True, attempting to show information for a table that
 |          does not exist returns an OK status and severity. When set to False,
 |          attempting to show information for a table that does not exist
 |          returns an error.
 |          Default: False
 |      
 |      Returns
 |      -------
 |      Tableinfo object
 |  
 |  get_param(_self_, key)
 |      Get the value of an action parameter
 |      
 |      Parameters
 |      ----------
 |      key : string
 |          The fully-qualified name (e.g., table.name) of the parameter to retrieve.
 |      
 |      Valid Parameters
 |      ----------------
 |      name : string or CASTable, optional
 |          specifies the table name.
 |      
 |      caslib : string, optional
 |          specifies the caslib containing the table that you want to use with
 |          the action. By default, the active caslib is used. Specify a value
 |          only if you need to access a table from a different caslib.
 |      
 |      quiet : boolean, optional
 |          when set to True, attempting to show information for a table that
 |          does not exist returns an OK status and severity. When set to False,
 |          attempting to show information for a table that does not exist
 |          returns an error.
 |          Default: False
 |      
 |      Returns
 |      -------
 |      any
 |          The value of the speciifed parameter.
 |  
 |  get_params(_self_, *keys)
 |      Get the value of one or more action parameters
 |      
 |      Parameters
 |      ----------
 |      *keys : one or more strings
 |          The fully-qualified names (e.g., table.name) of the parameters to retrieve.
 |      
 |      Valid Parameters
 |      ----------------
 |      name : string or CASTable, optional
 |          specifies the table name.
 |      
 |      caslib : string, optional
 |          specifies the caslib containing the table that you want to use with
 |          the action. By default, the active caslib is used. Specify a value
 |          only if you need to access a table from a different caslib.
 |      
 |      quiet : boolean, optional
 |          when set to True, attempting to show information for a table that
 |          does not exist returns an OK status and severity. When set to False,
 |          attempting to show information for a table that does not exist
 |          returns an error.
 |          Default: False
 |      
 |      Returns
 |      -------
 |      dict
 |          A dictionary of key value pairs containing the requested parameters.
 |  
 |  set_param(_self_, *args, **kwargs)
 |      Set one or more action parameters
 |      
 |      Parameters
 |      ----------
 |      *args : string / any pairs, optional
 |          Parameters can be specified as fully-qualified names (e.g, table.name)
 |          and values as subsequent arguments.  Any number of name / any pairs
 |          can be specified.
 |      **kwargs : any, optional
 |          Parameters can be specified as any number of keyword arguments.
 |      
 |      Examples
 |      --------
 |      #
 |      # String / any pairs
 |      #
 |      > summ = s.simple.Sumamry()
 |      > summ.set_param('table.name', 'iris',
 |                       'table.singlepass', True,
 |                       'casout.name', 'iris_summary')
 |      > print(summ)
 |      ?.simple.Summary(table={'name': 'iris', 'singlepass': True},
 |                       casout={'name': 'iris_summary'})
 |      
 |      #
 |      # Keywords
 |      #
 |      > summ.set_param(casout=dict(name='iris_out'))
 |      > print(summ)
 |      ?.simple.Summary(table={'name': 'iris', 'singlepass': True},
 |                       casout={'name': 'iris_out'})
 |      
 |      Valid Parameters
 |      ----------------
 |      name : string or CASTable, optional
 |          specifies the table name.
 |      
 |      caslib : string, optional
 |          specifies the caslib containing the table that you want to use with
 |          the action. By default, the active caslib is used. Specify a value
 |          only if you need to access a table from a different caslib.
 |      
 |      quiet : boolean, optional
 |          when set to True, attempting to show information for a table that
 |          does not exist returns an OK status and severity. When set to False,
 |          attempting to show information for a table that does not exist
 |          returns an error.
 |          Default: False
 |      
 |      Returns
 |      -------
 |      None
 |  
 |  set_params(_self_, *args, **kwargs)
 |      Set one or more action parameters
 |      
 |      Parameters
 |      ----------
 |      *args : string / any pairs, optional
 |          Parameters can be specified as fully-qualified names (e.g, table.name)
 |          and values as subsequent arguments.  Any number of name / any pairs
 |          can be specified.
 |      **kwargs : any, optional
 |          Parameters can be specified as any number of keyword arguments.
 |      
 |      Examples
 |      --------
 |      #
 |      # String / any pairs
 |      #
 |      > summ = s.simple.Sumamry()
 |      > summ.set_param('table.name', 'iris',
 |                       'table.singlepass', True,
 |                       'casout.name', 'iris_summary')
 |      > print(summ)
 |      ?.simple.Summary(table={'name': 'iris', 'singlepass': True},
 |                       casout={'name': 'iris_summary'})
 |      
 |      #
 |      # Keywords
 |      #
 |      > summ.set_param(casout=dict(name='iris_out'))
 |      > print(summ)
 |      ?.simple.Summary(table={'name': 'iris', 'singlepass': True},
 |                       casout={'name': 'iris_out'})
 |      
 |      Valid Parameters
 |      ----------------
 |      name : string or CASTable, optional
 |          specifies the table name.
 |      
 |      caslib : string, optional
 |          specifies the caslib containing the table that you want to use with
 |          the action. By default, the active caslib is used. Specify a value
 |          only if you need to access a table from a different caslib.
 |      
 |      quiet : boolean, optional
 |          when set to True, attempting to show information for a table that
 |          does not exist returns an OK status and severity. When set to False,
 |          attempting to show information for a table that does not exist
 |          returns an error.
 |          Default: False
 |      
 |      Returns
 |      -------
 |      None
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  all_params = {'caslib', 'name', 'quiet'}
 |  
 |  param_names = ['name', 'caslib', 'quiet']
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from CASAction:
 |  
 |  __iter__(self)
 |      Call the action and iterate over the results
 |  
 |  invoke(self, **kwargs)
 |      Invoke the action
 |      
 |      Parameters
 |      ----------
 |      **kwargs : any, optional
 |          Arbitrary key/value pairs to add to the arguments sent to the
 |          action.  These key/value pairs are not added to the collection
 |          of parameters set on the action object.  They are only used in
 |          this call.
 |      
 |      Returns
 |      -------
 |      self
 |          Returns the CASAction object itself
 |  
 |  retrieve = __call__(self, **kwargs)
 |      Call the action
 |      
 |      Parameters
 |      ----------
 |      **kwargs : any, optional
 |          Arbitrary key/value pairs to add to the arguments sent to the
 |          action.  These key/value pairs are not added to the collection
 |          of parameters set on the action object.  They are only used in
 |          this call.
 |      
 |      Returns
 |      -------
 |      CASResults object
 |          Collection of results from the action call
 |  
 |  ----------------------------------------------------------------------
 |  Class methods inherited from CASAction:
 |  
 |  from_reflection(asname, actinfo, connection) from builtins.type
 |      Construct a CASAction class from reflection information
 |      
 |      Parameters
 |      ----------
 |      asname : string
 |          The action set name
 |      actinfo : dict
 |          The reflection information for the action
 |      connection : CAS object
 |          The connection to associate with the CASAction
 |      defaults : dict
 |          Default parameters for the action
 |      
 |      Returns
 |      -------
 |      CASAction class
 |  
 |  get_connection() from builtins.type
 |      Return the registered connection
 |      
 |      The connection is only held by a weak reference.  If the
 |      connection no longer exists, a SWATError is raised.
 |      
 |      Raises
 |      ------
 |      SWATError
 |          If the registered connection no longer exists
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes inherited from CASAction:
 |  
 |  trait_names = None
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from swat.cas.utils.params.ParamManager:
 |  
 |  __delattr__(self, name)
 |      Delete an attribute
 |  
 |  __enter__(self)
 |  
 |  __exit__(self, type, value, traceback)
 |  
 |  __getattr__(self, name)
 |      Get named attribute
 |  
 |  __repr__(self)
 |      Return repr(self).
 |  
 |  __setattr__(self, name, value)
 |      Set an attribute
 |  
 |  __str__(self)
 |      Return str(self).
 |  
 |  del_param = del_params(self, *keys)
 |      Delete parameters
 |      
 |      Parameters
 |      ----------
 |      *keys : strings
 |         Names of parameters to delete
 |      
 |      Returns
 |      -------
 |      None
 |  
 |  del_params(self, *keys)
 |      Delete parameters
 |      
 |      Parameters
 |      ----------
 |      *keys : strings
 |         Names of parameters to delete
 |      
 |      Returns
 |      -------
 |      None
 |  
 |  has_param = has_params(self, *keys)
 |      Return a boolean indicating whether or not the parameters exist
 |      
 |      Parameters
 |      ----------
 |      *keys : one or more strings
 |          Names of parameters
 |      
 |      Returns
 |      -------
 |      True or False
 |  
 |  has_params(self, *keys)
 |      Return a boolean indicating whether or not the parameters exist
 |      
 |      Parameters
 |      ----------
 |      *keys : one or more strings
 |          Names of parameters
 |      
 |      Returns
 |      -------
 |      True or False
 |  
 |  to_dict(self)
 |      Return the parameters as a dictionary
 |  
 |  to_json(self, *args, **kwargs)
 |      Convert parameters to JSON
 |      
 |      Parameters
 |      ----------
 |      *args : any, optional
 |          Additional arguments to json.dumps
 |      **kwargs : any, optional
 |          Additional arguments to json.dumps
 |      
 |      Returns
 |      -------
 |      string
 |  
 |  to_params = to_dict(self)
 |      Return the parameters as a dictionary
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from swat.cas.utils.params.ParamManager:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

3.3 Using IPython's or Jupyter's Help Functionality

If you are using IPython or a Jupyter notebook, it has another variant of the help information stored in Python objects. To invoke IPython/Jupyter's help, you simply append a ? to the class/method/function you want help for.



In [29]:

    
conn.tableinfo?

3.4 Tab Completion

While it doesn't display true help information, tab completion of action names is also available in IPython and Jupyter. All currently loaded action sets and actions will show up in the tab completion for a CAS connection object.



In [30]:

    
#conn.

Exercise 4 - Loading Data into CAS

So far we've run a few CAS actions and taken a look at the output that they create, but we haven't done anything that requires data. In order to do that, we need to know how to load data into a CAS table.

There are many ways of getting data into CAS. We'll look at a couple of them here.

4.1 Loading Client-Side Data into CAS

The most Pythonic way of getting data into CAS is to use the Pandas-style data readers on the CAS connection object. When we refer to "client side" data, we mean data that is located on the machine that Python is running on. "Server side" data is data that is on the machine that CAS is running on. These may or may not be the same machine.

One thing to keep in mind when loading client-side data is that all data must be uploaded to the CAS machine. This may be fine for small to medium sized data sets, but for larger data sets you would probably want to move on to the section 4.3 on loading server-side data.

One of the simplest data file formats is CSV. Pandas' read_csv function is a very capable CSV reader with many options. You can use the SWAT version of that same function on the CAS connection object. It supports all of the same options as the Pandas version (and actually uses it in the background). The difference is that the SWAT version returns a CASTable object rather than a DataFrame.



In [31]:

    
tbl = conn.read_csv('https://raw.githubusercontent.com/'
                    'sassoftware/sas-viya-programming/master/data/cars.csv')
tbl









    Out[31]:





CASTable('_T_8B910465_7F1F62C95F88', caslib='CASUSER(viyauser)')

Since the code above uses a Python data parser, the content of the URL must be downloaded to the client machine first, then it gets parsed. The data is then uploaded to CAS. You may want to keep this in mind since it does mean that the data is transferred twice. If you have larger data files, you may want to use an alternative way of loading them.

A CASTable object is a Python object interface to a CAS table. You can call CAS actions on it just like a CAS connection object. It also supports much of the Pandas DataFrame API. This means that you can treat it like a DataFrame, but all of the operations will be performed on the CAS server using CAS actions.

The first thing we'll do is take a look at the column information using the columninfo CAS action.



In [32]:

    
conn.columninfo(table=tbl)









    Out[32]:




§ ColumnInfo



  
    
      
      Column
      ID
      Type
      RawLength
      FormattedLength
      NFL
      NFD
    
  
  
    
      0
      Make
      1
      varchar
      13
      13
      0
      0
    
    
      1
      Model
      2
      varchar
      39
      39
      0
      0
    
    
      2
      Type
      3
      varchar
      6
      6
      0
      0
    
    
      3
      Origin
      4
      varchar
      6
      6
      0
      0
    
    
      4
      DriveTrain
      5
      varchar
      5
      5
      0
      0
    
    
      5
      MSRP
      6
      double
      8
      12
      0
      0
    
    
      6
      Invoice
      7
      double
      8
      12
      0
      0
    
    
      7
      EngineSize
      8
      double
      8
      12
      0
      0
    
    
      8
      Cylinders
      9
      double
      8
      12
      0
      0
    
    
      9
      Horsepower
      10
      double
      8
      12
      0
      0
    
    
      10
      MPG_City
      11
      double
      8
      12
      0
      0
    
    
      11
      MPG_Highway
      12
      double
      8
      12
      0
      0
    
    
      12
      Weight
      13
      double
      8
      12
      0
      0
    
    
      13
      Wheelbase
      14
      double
      8
      12
      0
      0
    
    
      14
      Length
      15
      double
      8
      12
      0
      0
    
  




elapsed 0.00065s · mem 0.132MB

We can also fetch a sample of the data using the fetch CAS action.



In [33]:

    
conn.fetch(table=tbl, to=5)









    Out[33]:




§ Fetch


Selected Rows from Table _T_8B910465_7F1F62C95F88
  
    
      
      Make
      Model
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
  
  
    
      0
      Acura
      MDX
      SUV
      Asia
      All
      36945.0
      33337.0
      3.5
      6.0
      265.0
      17.0
      23.0
      4451.0
      106.0
      189.0
    
    
      1
      Acura
      RSX Type S 2dr
      Sedan
      Asia
      Front
      23820.0
      21761.0
      2.0
      4.0
      200.0
      24.0
      31.0
      2778.0
      101.0
      172.0
    
    
      2
      Acura
      TSX 4dr
      Sedan
      Asia
      Front
      26990.0
      24647.0
      2.4
      4.0
      200.0
      22.0
      29.0
      3230.0
      105.0
      183.0
    
    
      3
      Acura
      TL 4dr
      Sedan
      Asia
      Front
      33195.0
      30299.0
      3.2
      6.0
      270.0
      20.0
      28.0
      3575.0
      108.0
      186.0
    
    
      4
      Acura
      3.5 RL 4dr
      Sedan
      Asia
      Front
      43755.0
      39014.0
      3.5
      6.0
      225.0
      18.0
      24.0
      3880.0
      115.0
      197.0
    
  




elapsed 0.0065s · user 0.002s · mem 0.569MB

Calling Actions from CASTable Objects

As you can see in the previous two code samples, we ran the CAS action on the connection object and supplied the table in the table= parameter. Since this is such a common thing to do, the CASTable object can also be used as the action dispatcher. In this case, you simply call the action on the CASTable object itself, and don't specify a table= parameter. The table= parameter will be automatically populated with the table information.



In [34]:

    
tbl.fetch(to=5)









    Out[34]:




§ Fetch


Selected Rows from Table _T_8B910465_7F1F62C95F88
  
    
      
      Make
      Model
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
  
  
    
      0
      Acura
      MDX
      SUV
      Asia
      All
      36945.0
      33337.0
      3.5
      6.0
      265.0
      17.0
      23.0
      4451.0
      106.0
      189.0
    
    
      1
      Acura
      RSX Type S 2dr
      Sedan
      Asia
      Front
      23820.0
      21761.0
      2.0
      4.0
      200.0
      24.0
      31.0
      2778.0
      101.0
      172.0
    
    
      2
      Acura
      TSX 4dr
      Sedan
      Asia
      Front
      26990.0
      24647.0
      2.4
      4.0
      200.0
      22.0
      29.0
      3230.0
      105.0
      183.0
    
    
      3
      Acura
      TL 4dr
      Sedan
      Asia
      Front
      33195.0
      30299.0
      3.2
      6.0
      270.0
      20.0
      28.0
      3575.0
      108.0
      186.0
    
    
      4
      Acura
      3.5 RL 4dr
      Sedan
      Asia
      Front
      43755.0
      39014.0
      3.5
      6.0
      225.0
      18.0
      24.0
      3880.0
      115.0
      197.0
    
  




elapsed 0.00218s · user 0.002s · mem 0.569MB

4.2 Parsing Client-Side Data on the Server

Another option for processing client-side data is to upload it as-is and let CAS parse it. This is generally faster since there isn't the additional overhead of converting Python objects to data in a form that CAS can read it. In addition, some file formats such as CSV can be parsed in parallel in CAS which gives another performance boost.

The example below uploads the same file as above, but this time CAS is doing the parsing. You should note that since this CSV file is a URL, the data must be downloaded to the client machine and then uploaded to CAS which adds a bit more overhead.



In [35]:

    
tbl2 = conn.upload_file('https://raw.githubusercontent.com/'
                        'sassoftware/sas-viya-programming/master/data/cars.csv')









    



NOTE: Cloud Analytic Services made the uploaded file available as table CARS in caslib CASUSER(viyauser).
NOTE: The table CARS has been created in caslib CASUSER(viyauser) from binary data uploaded to Cloud Analytic Services.



In [36]:

    
tbl2.fetch(to=5)









    Out[36]:




§ Fetch


Selected Rows from Table CARS
  
    
      
      Make
      Model
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
  
  
    
      0
      Acura
      MDX
      SUV
      Asia
      All
      36945.0
      33337.0
      3.5
      6.0
      265.0
      17.0
      23.0
      4451.0
      106.0
      189.0
    
    
      1
      Acura
      RSX Type S 2dr
      Sedan
      Asia
      Front
      23820.0
      21761.0
      2.0
      4.0
      200.0
      24.0
      31.0
      2778.0
      101.0
      172.0
    
    
      2
      Acura
      TSX 4dr
      Sedan
      Asia
      Front
      26990.0
      24647.0
      2.4
      4.0
      200.0
      22.0
      29.0
      3230.0
      105.0
      183.0
    
    
      3
      Acura
      TL 4dr
      Sedan
      Asia
      Front
      33195.0
      30299.0
      3.2
      6.0
      270.0
      20.0
      28.0
      3575.0
      108.0
      186.0
    
    
      4
      Acura
      3.5 RL 4dr
      Sedan
      Asia
      Front
      43755.0
      39014.0
      3.5
      6.0
      225.0
      18.0
      24.0
      3880.0
      115.0
      197.0
    
  




elapsed 0.00138s · user 0.002s · mem 0.567MB

There is also an upload_frame method on the CAS connection object that will upload a Pandas DataFrame. This method simply exports the DataFrame to CSV and uploads that file to the server.

4.3 Loading Server-Side Data into CAS

If you have larger data sets, you will likely want to keep them on the same machine as your CAS server so that they can be loaded directly. The action used to load data from a file directly is loadtable.



In [37]:

    
conn.caslibinfo()









    Out[37]:




§ CASLibInfo



  
    
      
      Name
      Type
      Description
      Path
      Definition
      Subdirs
      Local
      Active
      Personal
    
  
  
    
      0
      CASUSER(viyauser)
      PATH
      Personal File System Caslib
      /home/viyauser/
      
      1.0
      0.0
      1.0
      1.0
    
    
      1
      Formats
      PATH
      CAS Formats caslib
      /opt/sas/viya/config/data/cas/default/formats/
      
      1.0
      0.0
      0.0
      0.0
    
    
      2
      Public
      PATH
      CAS Public caslib
      /opt/sas/viya/config/data/cas/default/public/
      
      1.0
      0.0
      0.0
      0.0
    
  




elapsed 0.00229s · sys 0.002s · mem 0.055MB



In [38]:

    
conn.fileinfo(caslib='casuser')









    Out[38]:




§ FileInfo



  
    
      
      Permission
      Owner
      Group
      Name
      Size
      Encryption
      Time
    
  
  
    
      0
      drwxr-xr-x
      viyauser
      sas
      Output
      4096
      
      17Jun2016:17:17:27
    
    
      1
      drwxr-xr-x
      viyauser
      sas
      Desktop
      4096
      
      30Aug2016:08:28:53
    
    
      2
      drwxrwxr-x
      viyauser
      sas
      Examples
      4096
      
      08Jun2016:12:10:03
    
    
      3
      drwxr-xr-x
      viyauser
      sas
      sasuser.viya
      4096
      
      14Jul2016:15:14:07
    
    
      4
      -rwxr-xr-x
      viyauser
      sas
      othermycars.sashdat
      76848
      NONE
      31Aug2016:15:10:49
    
    
      5
      drwxr-xr-x
      viyauser
      sas
      Data
      4096
      
      01Sep2016:19:34:34
    
    
      6
      drwxr-xr-x
      viyauser
      sas
      Examples_Python
      4096
      
      09Sep2016:13:45:01
    
  




elapsed 0.00914s · user 0.001s · sys 0.001s · mem 0.0788MB



In [39]:

    
conn.fileinfo(caslib='casuser', path='Data')









    Out[39]:




§ FileInfo



  
    
      
      Permission
      Owner
      Group
      Name
      Size
      Encryption
      Time
    
  
  
    
      0
      -rw-r--r--
      viyauser
      sas
      cars.csv
      42177
      
      01Sep2016:19:19:21
    
  




elapsed 0.00139s · user 0.001s · mem 0.0606MB



In [40]:

    
out = conn.loadtable(path='Data/cars.csv', caslib='casuser')
out









    



NOTE: Cloud Analytic Services made the file Data/cars.csv available as table DATA.CARS in caslib CASUSER(viyauser).






    Out[40]:




§ caslib

CASUSER(viyauser)

§ tableName

DATA.CARS

§ casTable

CASTable('DATA.CARS', caslib='CASUSER(viyauser)')


elapsed 0.00323s · user 0.005s · mem 8.35MB

You'll notice that we specified both path= and caslib= parameters. Files are not loaded using an absolute file name. They are always relative to a CASLib. CASlibs can be configured to point to file system directories, databases, and other data sources. They can also be used to set access controls for various operations. We won't go into those details here. In the case of the example above, the file was located relative to the casuser CASLib which points to the user's home directory.

The result of the loadtable action is a CASResults object containing three keys: 'caslib', 'tableName', and 'casTable'. The 'casTable' key is simply a CASTable object constructed from the 'caslib' and 'tableName' keys.



In [41]:

    
tbl3 = out['casTable']
tbl3









    Out[41]:





CASTable('DATA.CARS', caslib='CASUSER(viyauser)')



In [42]:

    
tbl3.fetch(to=5)









    Out[42]:




§ Fetch


Selected Rows from Table DATA.CARS
  
    
      
      Make
      Model
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
  
  
    
      0
      Acura
      MDX
      SUV
      Asia
      All
      36945.0
      33337.0
      3.5
      6.0
      265.0
      17.0
      23.0
      4451.0
      106.0
      189.0
    
    
      1
      Acura
      RSX Type S 2dr
      Sedan
      Asia
      Front
      23820.0
      21761.0
      2.0
      4.0
      200.0
      24.0
      31.0
      2778.0
      101.0
      172.0
    
    
      2
      Acura
      TSX 4dr
      Sedan
      Asia
      Front
      26990.0
      24647.0
      2.4
      4.0
      200.0
      22.0
      29.0
      3230.0
      105.0
      183.0
    
    
      3
      Acura
      TL 4dr
      Sedan
      Asia
      Front
      33195.0
      30299.0
      3.2
      6.0
      270.0
      20.0
      28.0
      3575.0
      108.0
      186.0
    
    
      4
      Acura
      3.5 RL 4dr
      Sedan
      Asia
      Front
      43755.0
      39014.0
      3.5
      6.0
      225.0
      18.0
      24.0
      3880.0
      115.0
      197.0
    
  




elapsed 0.00127s · user 0.001s · mem 0.568MB

Since we commonly just want the CASTable object from the loadtable action, there is a wrapper method on the CAS connection object that calls loadtable behind-the-scenes and just returns the 'casTable' key.

In this case, we'll load the same data file but we'll also specify replace=True in the casout= parameter since it will attempt to create the same output table name.



In [43]:

    
tbl4 = conn.load_path('Data/cars.csv', caslib='casuser', casout=dict(replace=True))
tbl4









    Out[43]:





CASTable('DATA.CARS', caslib='CASUSER(viyauser)')



In [44]:

    
tbl4.fetch(to=5)









    Out[44]:




§ Fetch


Selected Rows from Table DATA.CARS
  
    
      
      Make
      Model
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
  
  
    
      0
      Acura
      MDX
      SUV
      Asia
      All
      36945.0
      33337.0
      3.5
      6.0
      265.0
      17.0
      23.0
      4451.0
      106.0
      189.0
    
    
      1
      Acura
      RSX Type S 2dr
      Sedan
      Asia
      Front
      23820.0
      21761.0
      2.0
      4.0
      200.0
      24.0
      31.0
      2778.0
      101.0
      172.0
    
    
      2
      Acura
      TSX 4dr
      Sedan
      Asia
      Front
      26990.0
      24647.0
      2.4
      4.0
      200.0
      22.0
      29.0
      3230.0
      105.0
      183.0
    
    
      3
      Acura
      TL 4dr
      Sedan
      Asia
      Front
      33195.0
      30299.0
      3.2
      6.0
      270.0
      20.0
      28.0
      3575.0
      108.0
      186.0
    
    
      4
      Acura
      3.5 RL 4dr
      Sedan
      Asia
      Front
      43755.0
      39014.0
      3.5
      6.0
      225.0
      18.0
      24.0
      3880.0
      115.0
      197.0
    
  




elapsed 0.00135s · user 0.001s · mem 0.569MB

Exercise 5 - Using `CASTable` as a DataFrame

We have shown a few examples of how the CASTable object acts like a Pandas DataFrame, but the similarities go much further. We have attempted to adopt as much of the DataFrame API as is relevant to CAS. When a method is implemented, both the method parameters and outputs work the same way between the CASTable object and Pandas DataFrame objects. Pandas DataFrame method options are mapped to CAS action options, and the CAS action outputs are transformed to match the form of the Pandas DataFrame outputs. If you are familiar with Pandas DataFrames, this makes it much easier to get started with CAS.

5.1 Subsetting Columns

The most basic operation when dealing with DataFrames and CASTable objects is selecting columns. This is done using the [...] syntax in Python. We'll start with a Pandas DataFrame and a CASTable object that both contain the CARS data set.



In [45]:

    
import pandas as pd



In [46]:

    
df = pd.read_csv('https://raw.githubusercontent.com/'
                 'sassoftware/sas-viya-programming/master/data/cars.csv')
df.head()









    Out[46]:






  
    
      
      Make
      Model
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
  
  
    
      0
      Acura
      MDX
      SUV
      Asia
      All
      36945.0
      33337.0
      3.5
      6.0
      265.0
      17.0
      23.0
      4451.0
      106.0
      189.0
    
    
      1
      Acura
      RSX Type S 2dr
      Sedan
      Asia
      Front
      23820.0
      21761.0
      2.0
      4.0
      200.0
      24.0
      31.0
      2778.0
      101.0
      172.0
    
    
      2
      Acura
      TSX 4dr
      Sedan
      Asia
      Front
      26990.0
      24647.0
      2.4
      4.0
      200.0
      22.0
      29.0
      3230.0
      105.0
      183.0
    
    
      3
      Acura
      TL 4dr
      Sedan
      Asia
      Front
      33195.0
      30299.0
      3.2
      6.0
      270.0
      20.0
      28.0
      3575.0
      108.0
      186.0
    
    
      4
      Acura
      3.5 RL 4dr
      Sedan
      Asia
      Front
      43755.0
      39014.0
      3.5
      6.0
      225.0
      18.0
      24.0
      3880.0
      115.0
      197.0

Here is the same data set as a CASTable.



In [47]:

    
tbl.head()









    Out[47]:





Selected Rows from Table _T_8B910465_7F1F62C95F88
  
    
      
      Make
      Model
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
  
  
    
      0
      Acura
      MDX
      SUV
      Asia
      All
      36945.0
      33337.0
      3.5
      6.0
      265.0
      17.0
      23.0
      4451.0
      106.0
      189.0
    
    
      1
      Acura
      RSX Type S 2dr
      Sedan
      Asia
      Front
      23820.0
      21761.0
      2.0
      4.0
      200.0
      24.0
      31.0
      2778.0
      101.0
      172.0
    
    
      2
      Acura
      TSX 4dr
      Sedan
      Asia
      Front
      26990.0
      24647.0
      2.4
      4.0
      200.0
      22.0
      29.0
      3230.0
      105.0
      183.0
    
    
      3
      Acura
      TL 4dr
      Sedan
      Asia
      Front
      33195.0
      30299.0
      3.2
      6.0
      270.0
      20.0
      28.0
      3575.0
      108.0
      186.0
    
    
      4
      Acura
      3.5 RL 4dr
      Sedan
      Asia
      Front
      43755.0
      39014.0
      3.5
      6.0
      225.0
      18.0
      24.0
      3880.0
      115.0
      197.0

To select a subset of columns using a Pandas DataFrame, you use Python's indexing syntax. When you specify a list of column names, you get a new DataFrame with just those columns in it.



In [48]:

    
subdf = df[['Make', 'Model', 'MSRP', 'Cylinders', 'Horsepower']]
subdf.head()









    Out[48]:






  
    
      
      Make
      Model
      MSRP
      Cylinders
      Horsepower
    
  
  
    
      0
      Acura
      MDX
      36945.0
      6.0
      265.0
    
    
      1
      Acura
      RSX Type S 2dr
      23820.0
      4.0
      200.0
    
    
      2
      Acura
      TSX 4dr
      26990.0
      4.0
      200.0
    
    
      3
      Acura
      TL 4dr
      33195.0
      6.0
      270.0
    
    
      4
      Acura
      3.5 RL 4dr
      43755.0
      6.0
      225.0

The same is true of CASTable objects.



In [49]:

    
subtbl = tbl[['Make', 'Model', 'MSRP', 'Cylinders', 'Horsepower']]
subtbl.head()









    Out[49]:





Selected Rows from Table _T_8B910465_7F1F62C95F88
  
    
      
      Make
      Model
      MSRP
      Cylinders
      Horsepower
    
  
  
    
      0
      Acura
      MDX
      36945.0
      6.0
      265.0
    
    
      1
      Acura
      RSX Type S 2dr
      23820.0
      4.0
      200.0
    
    
      2
      Acura
      TSX 4dr
      26990.0
      4.0
      200.0
    
    
      3
      Acura
      TL 4dr
      33195.0
      6.0
      270.0
    
    
      4
      Acura
      3.5 RL 4dr
      43755.0
      6.0
      225.0

5.2 Sorting Results

You may or may not have gotten the same rows of data when displaying data from the Pandas DataFrame and the CASTable object. This is because CAS does not store data rows in any specified order. CAS tables can be distributed across a grid of machines, so depending on events that may happen, the rows of the table may be redistributed.

However, CASTable objects can emulate Pandas DataFrame-like sorting features by supplying sort parameters to the fetch action when data is retrieved.

Again, we'll show how sorting works with Pandas DataFrames, then show the CASTable version.



In [50]:

    
subdf.sort_values(['MSRP']).head()









    Out[50]:






  
    
      
      Make
      Model
      MSRP
      Cylinders
      Horsepower
    
  
  
    
      206
      Kia
      Rio 4dr manual
      10280.0
      4.0
      104.0
    
    
      168
      Hyundai
      Accent 2dr hatch
      10539.0
      4.0
      103.0
    
    
      382
      Toyota
      Echo 2dr manual
      10760.0
      4.0
      108.0
    
    
      345
      Saturn
      Ion1 4dr
      10995.0
      4.0
      140.0
    
    
      207
      Kia
      Rio 4dr auto
      11155.0
      4.0
      104.0



In [51]:

    
subtbl.sort_values(['MSRP']).head()









    Out[51]:





Selected Rows from Table _T_8B910465_7F1F62C95F88
  
    
      
      Make
      Model
      MSRP
      Cylinders
      Horsepower
    
  
  
    
      0
      Kia
      Rio 4dr manual
      10280.0
      4.0
      104.0
    
    
      1
      Hyundai
      Accent 2dr hatch
      10539.0
      4.0
      103.0
    
    
      2
      Toyota
      Echo 2dr manual
      10760.0
      4.0
      108.0
    
    
      3
      Saturn
      Ion1 4dr
      10995.0
      4.0
      140.0
    
    
      4
      Kia
      Rio 4dr auto
      11155.0
      4.0
      104.0

You'll see that we get the same rows, but not the same index values when sorting. CAS currently does not have a row index, so it gets the automatically applied sort index when the DataFrame is created on the client side.

The sort_values method of CASTable also support the ascending= parameter.



In [52]:

    
subtbl.sort_values(['MSRP'], ascending=False).head()









    Out[52]:





Selected Rows from Table _T_8B910465_7F1F62C95F88
  
    
      
      Make
      Model
      MSRP
      Cylinders
      Horsepower
    
  
  
    
      0
      Porsche
      911 GT2 2dr
      192465.0
      6.0
      477.0
    
    
      1
      Mercedes-Benz
      CL600 2dr
      128420.0
      12.0
      493.0
    
    
      2
      Mercedes-Benz
      SL600 convertible 2dr
      126670.0
      12.0
      493.0
    
    
      3
      Mercedes-Benz
      SL55 AMG 2dr
      121770.0
      8.0
      493.0
    
    
      4
      Mercedes-Benz
      CL500 2dr
      94820.0
      8.0
      302.0

5.3 Pandas-style Statistics

While there are CAS actions that give results similar to what Pandas DataFrame statistical methods give, they always require some transformation to be exactly the same form. Many of the Pandas DataFrame methods have been re-implemented in CASTables. This means that they look and act like Pandas DataFrame methods, but all of the real work is done behind-the-scenes in CAS actions.

Probably the most common statistical method in Pandas DataFrames is describe.



In [53]:

    
df.describe()









    



/opt/anaconda3/lib/python3.5/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)






    Out[53]:






  
    
      
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
  
  
    
      count
      428.000000
      428.000000
      428.000000
      426.000000
      428.000000
      428.000000
      428.000000
      428.000000
      428.000000
      428.000000
    
    
      mean
      32774.855140
      30014.700935
      3.196729
      5.807512
      215.885514
      20.060748
      26.843458
      3577.953271
      108.154206
      186.362150
    
    
      std
      19431.716674
      17642.117750
      1.108595
      1.558443
      71.836032
      5.238218
      5.741201
      758.983215
      8.311813
      14.357991
    
    
      min
      10280.000000
      9875.000000
      1.300000
      3.000000
      73.000000
      10.000000
      12.000000
      1850.000000
      89.000000
      143.000000
    
    
      25%
      20334.250000
      18866.000000
      2.375000
      NaN
      165.000000
      17.000000
      24.000000
      3104.000000
      103.000000
      178.000000
    
    
      50%
      27635.000000
      25294.500000
      3.000000
      NaN
      210.000000
      19.000000
      26.000000
      3474.500000
      107.000000
      187.000000
    
    
      75%
      39205.000000
      35710.250000
      3.900000
      NaN
      255.000000
      21.250000
      29.000000
      3977.750000
      112.000000
      194.000000
    
    
      max
      192465.000000
      173560.000000
      8.300000
      12.000000
      500.000000
      60.000000
      66.000000
      7190.000000
      144.000000
      238.000000

CASTable objects also support the describe method. As you can see, the column and row indexes are exactly the same as what you get with the Pandas DataFrame method.



In [54]:

    
tbl.describe()









    Out[54]:






  
    
      
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
  
  
    
      count
      428.000000
      428.000000
      428.000000
      426.000000
      428.000000
      428.000000
      428.000000
      428.000000
      428.000000
      428.000000
    
    
      mean
      32774.855140
      30014.700935
      3.196729
      5.807512
      215.885514
      20.060748
      26.843458
      3577.953271
      108.154206
      186.362150
    
    
      std
      19431.716674
      17642.117750
      1.108595
      1.558443
      71.836032
      5.238218
      5.741201
      758.983215
      8.311813
      14.357991
    
    
      min
      10280.000000
      9875.000000
      1.300000
      3.000000
      73.000000
      10.000000
      12.000000
      1850.000000
      89.000000
      143.000000
    
    
      25%
      20329.500000
      18851.000000
      2.350000
      4.000000
      165.000000
      17.000000
      24.000000
      3103.000000
      103.000000
      178.000000
    
    
      50%
      27635.000000
      25294.500000
      3.000000
      6.000000
      210.000000
      19.000000
      26.000000
      3474.500000
      107.000000
      187.000000
    
    
      75%
      39215.000000
      35732.500000
      3.900000
      6.000000
      255.000000
      21.500000
      29.000000
      3978.500000
      112.000000
      194.000000
    
    
      max
      192465.000000
      173560.000000
      8.300000
      12.000000
      500.000000
      60.000000
      66.000000
      7190.000000
      144.000000
      238.000000

You can also specify the same options to the CASTable.describe method as DataFrames, as well as using the subsetting syntax described above.



In [55]:

    
tbl[['MSRP', 'EngineSize', 'Cylinders']].describe(percentiles=[0.3, 0.7])









    Out[55]:






  
    
      
      MSRP
      EngineSize
      Cylinders
    
  
  
    
      count
      428.000000
      428.000000
      426.000000
    
    
      mean
      32774.855140
      3.196729
      5.807512
    
    
      std
      19431.716674
      1.108595
      1.558443
    
    
      min
      10280.000000
      1.300000
      3.000000
    
    
      30%
      22000.000000
      2.400000
      4.000000
    
    
      50%
      27635.000000
      3.000000
      6.000000
    
    
      70%
      35940.000000
      3.600000
      6.000000
    
    
      max
      192465.000000
      8.300000
      12.000000

While the output was created to be compatible with Pandas DataFrame's describe method, CAS supplies many more statistics that can be displayed using the stats= parameter. You can specify a list of names of the statistics to display, or simply specify stats=all to display all of them.



In [56]:

    
tbl.describe(stats='all')









    Out[56]:






  
    
      
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
  
  
    
      count
      4.280000e+02
      4.280000e+02
      4.280000e+02
      4.260000e+02
      4.280000e+02
      4.280000e+02
      4.280000e+02
      4.280000e+02
      4.280000e+02
      4.280000e+02
    
    
      unique
      4.100000e+02
      4.250000e+02
      4.300000e+01
      7.000000e+00
      1.100000e+02
      2.800000e+01
      3.300000e+01
      3.480000e+02
      4.000000e+01
      6.700000e+01
    
    
      mean
      3.277486e+04
      3.001470e+04
      3.196729e+00
      5.807512e+00
      2.158855e+02
      2.006075e+01
      2.684346e+01
      3.577953e+03
      1.081542e+02
      1.863621e+02
    
    
      std
      1.943172e+04
      1.764212e+04
      1.108595e+00
      1.558443e+00
      7.183603e+01
      5.238218e+00
      5.741201e+00
      7.589832e+02
      8.311813e+00
      1.435799e+01
    
    
      min
      1.028000e+04
      9.875000e+03
      1.300000e+00
      3.000000e+00
      7.300000e+01
      1.000000e+01
      1.200000e+01
      1.850000e+03
      8.900000e+01
      1.430000e+02
    
    
      25%
      2.032950e+04
      1.885100e+04
      2.350000e+00
      4.000000e+00
      1.650000e+02
      1.700000e+01
      2.400000e+01
      3.103000e+03
      1.030000e+02
      1.780000e+02
    
    
      50%
      2.763500e+04
      2.529450e+04
      3.000000e+00
      6.000000e+00
      2.100000e+02
      1.900000e+01
      2.600000e+01
      3.474500e+03
      1.070000e+02
      1.870000e+02
    
    
      75%
      3.921500e+04
      3.573250e+04
      3.900000e+00
      6.000000e+00
      2.550000e+02
      2.150000e+01
      2.900000e+01
      3.978500e+03
      1.120000e+02
      1.940000e+02
    
    
      max
      1.924650e+05
      1.735600e+05
      8.300000e+00
      1.200000e+01
      5.000000e+02
      6.000000e+01
      6.600000e+01
      7.190000e+03
      1.440000e+02
      2.380000e+02
    
    
      nmiss
      0.000000e+00
      0.000000e+00
      0.000000e+00
      2.000000e+00
      0.000000e+00
      0.000000e+00
      0.000000e+00
      0.000000e+00
      0.000000e+00
      0.000000e+00
    
    
      sum
      1.402764e+07
      1.284629e+07
      1.368200e+03
      2.474000e+03
      9.239900e+04
      8.586000e+03
      1.148900e+04
      1.531364e+06
      4.629000e+04
      7.976300e+04
    
    
      stderr
      9.392675e+02
      8.527639e+02
      5.358595e-02
      7.550679e-02
      3.472326e+00
      2.531988e-01
      2.775114e-01
      3.668684e+01
      4.017666e-01
      6.940197e-01
    
    
      var
      3.775916e+08
      3.112443e+08
      1.228982e+00
      2.428743e+00
      5.160415e+03
      2.743892e+01
      3.296139e+01
      5.760555e+05
      6.908624e+01
      2.061519e+02
    
    
      uss
      6.209854e+11
      5.184789e+11
      4.898540e+03
      1.540000e+04
      2.215110e+07
      1.839580e+05
      3.224790e+05
      5.725125e+09
      5.035958e+06
      1.495283e+07
    
    
      cv
      5.928849e+01
      5.877826e+01
      3.467903e+01
      2.683495e+01
      3.327506e+01
      2.611178e+01
      2.138771e+01
      2.121278e+01
      7.685150e+00
      7.704349e+00
    
    
      tvalue
      3.489406e+01
      3.519696e+01
      5.965611e+01
      7.691377e+01
      6.217318e+01
      7.922923e+01
      9.672920e+01
      9.752689e+01
      2.691966e+02
      2.685257e+02
    
    
      probt
      4.160412e-127
      2.684398e-128
      3.133745e-209
      1.515569e-251
      4.185344e-216
      1.866284e-257
      1.665621e-292
      5.812547e-294
      0.000000e+00
      0.000000e+00

If you want a full-blown table with all statistics and all variables (numeric and character), you can specify include='all' and stats='all' as paramaters.



In [57]:

    
tbl.describe(include='all', stats='all')









    Out[57]:






  
    
      
      Make
      Model
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
  
  
    
      count
      428
      428
      428
      428
      428
      428
      428
      428
      426
      428
      428
      428
      428
      428
      428
    
    
      unique
      38
      425
      6
      3
      3
      410
      425
      43
      7
      110
      28
      33
      348
      40
      67
    
    
      top
      Toyota
      G35 4dr
      Sedan
      Asia
      Front
      74995
      68306
      3
      6
      200
      18
      26
      3450
      107
      178
    
    
      freq
      28
      2
      262
      158
      226
      2
      2
      42
      190
      17
      69
      54
      4
      45
      27
    
    
      mean
      NaN
      NaN
      NaN
      NaN
      NaN
      32774.9
      30014.7
      3.19673
      5.80751
      215.886
      20.0607
      26.8435
      3577.95
      108.154
      186.362
    
    
      std
      NaN
      NaN
      NaN
      NaN
      NaN
      19431.7
      17642.1
      1.10859
      1.55844
      71.836
      5.23822
      5.7412
      758.983
      8.31181
      14.358
    
    
      min
      Acura
      3.5 RL 4dr
      Hybrid
      Asia
      All
      10280
      9875
      1.3
      3
      73
      10
      12
      1850
      89
      143
    
    
      25%
      NaN
      NaN
      NaN
      NaN
      NaN
      20329.5
      18851
      2.35
      4
      165
      17
      24
      3103
      103
      178
    
    
      50%
      NaN
      NaN
      NaN
      NaN
      NaN
      27635
      25294.5
      3
      6
      210
      19
      26
      3474.5
      107
      187
    
    
      75%
      NaN
      NaN
      NaN
      NaN
      NaN
      39215
      35732.5
      3.9
      6
      255
      21.5
      29
      3978.5
      112
      194
    
    
      max
      Volvo
      Z4 convertible 3.0i 2dr
      Wagon
      USA
      Rear
      192465
      173560
      8.3
      12
      500
      60
      66
      7190
      144
      238
    
    
      nmiss
      0
      0
      0
      0
      0
      0
      0
      0
      2
      0
      0
      0
      0
      0
      0
    
    
      sum
      NaN
      NaN
      NaN
      NaN
      NaN
      1.40276e+07
      1.28463e+07
      1368.2
      2474
      92399
      8586
      11489
      1.53136e+06
      46290
      79763
    
    
      stderr
      NaN
      NaN
      NaN
      NaN
      NaN
      939.267
      852.764
      0.0535859
      0.0755068
      3.47233
      0.253199
      0.277511
      36.6868
      0.401767
      0.69402
    
    
      var
      NaN
      NaN
      NaN
      NaN
      NaN
      3.77592e+08
      3.11244e+08
      1.22898
      2.42874
      5160.42
      27.4389
      32.9614
      576056
      69.0862
      206.152
    
    
      uss
      NaN
      NaN
      NaN
      NaN
      NaN
      6.20985e+11
      5.18479e+11
      4898.54
      15400
      2.21511e+07
      183958
      322479
      5.72512e+09
      5.03596e+06
      1.49528e+07
    
    
      cv
      NaN
      NaN
      NaN
      NaN
      NaN
      59.2885
      58.7783
      34.679
      26.8349
      33.2751
      26.1118
      21.3877
      21.2128
      7.68515
      7.70435
    
    
      tvalue
      NaN
      NaN
      NaN
      NaN
      NaN
      34.8941
      35.197
      59.6561
      76.9138
      62.1732
      79.2292
      96.7292
      97.5269
      269.197
      268.526
    
    
      probt
      NaN
      NaN
      NaN
      NaN
      NaN
      4.16041e-127
      2.6844e-128
      3.13374e-209
      1.51557e-251
      4.18534e-216
      1.86628e-257
      1.66562e-292
      5.81255e-294
      0
      0

As we showed earlier, you can specify By groups using the groupby method on both DataFrames and CASTables. This also works in conjunction with DataFrame methods like describe.



In [58]:

    
df.groupby('Origin').describe()









    



/opt/anaconda3/lib/python3.5/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)






    Out[58]:






  
    
      
      
      Cylinders
      EngineSize
      Horsepower
      Invoice
      Length
      MPG_City
      MPG_Highway
      MSRP
      Weight
      Wheelbase
    
    
      Origin
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Asia
      count
      156.000000
      158.000000
      158.000000
      158.000000
      158.000000
      158.000000
      158.000000
      158.000000
      158.000000
      158.000000
    
    
      mean
      5.185897
      2.774051
      190.702532
      22602.177215
      182.816456
      22.012658
      28.265823
      24741.322785
      3319.316456
      105.886076
    
    
      std
      1.269008
      0.902310
      59.392627
      9842.984880
      12.564148
      6.733307
      6.770503
      11321.069675
      717.842132
      7.735249
    
    
      min
      3.000000
      1.300000
      73.000000
      9875.000000
      153.000000
      13.000000
      17.000000
      10280.000000
      1850.000000
      89.000000
    
    
      25%
      NaN
      2.000000
      142.000000
      16271.500000
      175.000000
      18.000000
      25.000000
      17208.000000
      2751.500000
      102.000000
    
    
      50%
      NaN
      2.600000
      187.500000
      20949.500000
      183.500000
      20.500000
      27.000000
      23032.500000
      3280.000000
      105.000000
    
    
      75%
      NaN
      3.500000
      233.750000
      26645.000000
      191.000000
      24.000000
      31.000000
      28787.500000
      3706.750000
      110.000000
    
    
      max
      8.000000
      5.600000
      340.000000
      79978.000000
      224.000000
      60.000000
      66.000000
      89765.000000
      5590.000000
      140.000000
    
    
      Europe
      count
      123.000000
      123.000000
      123.000000
      123.000000
      123.000000
      123.000000
      123.000000
      123.000000
      123.000000
      123.000000
    
    
      mean
      6.235772
      3.206504
      251.894309
      44395.081301
      181.845528
      18.731707
      26.008130
      48349.796748
      3680.723577
      106.447154
    
    
      std
      1.679585
      1.037157
      80.738884
      23080.368888
      11.479870
      3.289509
      4.167588
      25318.600464
      579.576871
      6.690099
    
    
      min
      4.000000
      1.600000
      100.000000
      15437.000000
      143.000000
      12.000000
      14.000000
      16999.000000
      2524.000000
      93.000000
    
    
      25%
      5.000000
      2.500000
      195.500000
      31287.500000
      177.000000
      17.000000
      24.000000
      33837.500000
      3285.000000
      102.000000
    
    
      50%
      6.000000
      3.000000
      225.000000
      37575.000000
      182.000000
      19.000000
      26.000000
      40590.000000
      3585.000000
      107.000000
    
    
      75%
      8.000000
      4.200000
      302.000000
      51042.500000
      190.000000
      20.000000
      28.500000
      56382.500000
      3959.500000
      110.000000
    
    
      max
      12.000000
      6.000000
      493.000000
      173560.000000
      204.000000
      38.000000
      46.000000
      192465.000000
      5423.000000
      123.000000
    
    
      USA
      count
      147.000000
      147.000000
      147.000000
      147.000000
      147.000000
      147.000000
      147.000000
      147.000000
      147.000000
      147.000000
    
    
      mean
      6.108844
      3.642857
      212.823129
      25949.340136
      193.952381
      19.074830
      26.013605
      28377.442177
      3769.952381
      112.020408
    
    
      std
      1.531090
      1.194221
      63.748618
      10518.722194
      15.305265
      3.982992
      5.396582
      11711.982506
      855.305524
      8.788590
    
    
      min
      4.000000
      1.600000
      103.000000
      10319.000000
      150.000000
      10.000000
      12.000000
      10995.000000
      2348.000000
      93.000000
    
    
      25%
      5.000000
      2.700000
      160.000000
      18927.000000
      185.000000
      17.000000
      22.000000
      20315.000000
      3234.000000
      106.000000
    
    
      50%
      6.000000
      3.600000
      200.000000
      23217.000000
      194.000000
      18.000000
      26.000000
      25520.000000
      3606.000000
      111.000000
    
    
      75%
      8.000000
      4.600000
      250.000000
      30778.000000
      201.500000
      21.000000
      29.000000
      33767.500000
      4235.000000
      115.000000
    
    
      max
      10.000000
      8.300000
      500.000000
      74451.000000
      238.000000
      29.000000
      37.000000
      81795.000000
      7190.000000
      144.000000

Here is the CASTable version. Notice that in this case, since compatibility with Pandas was the primary concern, the By groups are all contained in the same DataFrame. Calling CAS actions directly will always split them into separate tables.



In [59]:

    
tbl.groupby('Origin').describe()









    Out[59]:






  
    
      
      
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
    
      Origin
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Asia
      count
      158.000000
      158.000000
      158.000000
      156.000000
      158.000000
      158.000000
      158.000000
      158.000000
      158.000000
      158.000000
    
    
      mean
      24741.322785
      22602.177215
      2.774051
      5.185897
      190.702532
      22.012658
      28.265823
      3319.316456
      105.886076
      182.816456
    
    
      std
      11321.069675
      9842.984880
      0.902310
      1.269008
      59.392627
      6.733307
      6.770503
      717.842132
      7.735249
      12.564148
    
    
      min
      10280.000000
      9875.000000
      1.300000
      3.000000
      73.000000
      13.000000
      17.000000
      1850.000000
      89.000000
      153.000000
    
    
      25%
      17200.000000
      16265.000000
      2.000000
      4.000000
      142.000000
      18.000000
      25.000000
      2750.000000
      102.000000
      175.000000
    
    
      50%
      23032.500000
      20949.500000
      2.600000
      6.000000
      187.500000
      20.500000
      27.000000
      3280.000000
      105.000000
      183.500000
    
    
      75%
      28800.000000
      26660.000000
      3.500000
      6.000000
      235.000000
      24.000000
      31.000000
      3715.000000
      110.000000
      191.000000
    
    
      max
      89765.000000
      79978.000000
      5.600000
      8.000000
      340.000000
      60.000000
      66.000000
      5590.000000
      140.000000
      224.000000
    
    
      Europe
      count
      123.000000
      123.000000
      123.000000
      123.000000
      123.000000
      123.000000
      123.000000
      123.000000
      123.000000
      123.000000
    
    
      mean
      48349.796748
      44395.081301
      3.206504
      6.235772
      251.894309
      18.731707
      26.008130
      3680.723577
      106.447154
      181.845528
    
    
      std
      25318.600464
      23080.368888
      1.037157
      1.679585
      80.738884
      3.289509
      4.167588
      579.576871
      6.690099
      11.479870
    
    
      min
      16999.000000
      15437.000000
      1.600000
      4.000000
      100.000000
      12.000000
      14.000000
      2524.000000
      93.000000
      143.000000
    
    
      25%
      33780.000000
      31187.000000
      2.500000
      5.000000
      194.000000
      17.000000
      24.000000
      3285.000000
      102.000000
      177.000000
    
    
      50%
      40590.000000
      37575.000000
      3.000000
      6.000000
      225.000000
      19.000000
      26.000000
      3585.000000
      107.000000
      182.000000
    
    
      75%
      56595.000000
      51815.000000
      4.200000
      8.000000
      302.000000
      20.000000
      29.000000
      3966.000000
      110.000000
      190.000000
    
    
      max
      192465.000000
      173560.000000
      6.000000
      12.000000
      493.000000
      38.000000
      46.000000
      5423.000000
      123.000000
      204.000000
    
    
      USA
      count
      147.000000
      147.000000
      147.000000
      147.000000
      147.000000
      147.000000
      147.000000
      147.000000
      147.000000
      147.000000
    
    
      mean
      28377.442177
      25949.340136
      3.642857
      6.108844
      212.823129
      19.074830
      26.013605
      3769.952381
      112.020408
      193.952381
    
    
      std
      11711.982506
      10518.722194
      1.194221
      1.531090
      63.748618
      3.982992
      5.396582
      855.305524
      8.788590
      15.305265
    
    
      min
      10995.000000
      10319.000000
      1.600000
      4.000000
      103.000000
      10.000000
      12.000000
      2348.000000
      93.000000
      150.000000
    
    
      25%
      20310.000000
      18881.000000
      2.700000
      4.000000
      155.000000
      17.000000
      22.000000
      3222.000000
      106.000000
      185.000000
    
    
      50%
      25520.000000
      23217.000000
      3.600000
      6.000000
      200.000000
      18.000000
      26.000000
      3606.000000
      111.000000
      194.000000
    
    
      75%
      33995.000000
      30846.000000
      4.600000
      8.000000
      250.000000
      21.000000
      29.000000
      4275.000000
      115.000000
      202.000000
    
    
      max
      81795.000000
      74451.000000
      8.300000
      10.000000
      500.000000
      29.000000
      37.000000
      7190.000000
      144.000000
      238.000000

5.4 Filtering Data

Sometimes you want to filter your data so that only part of the data set is analyzed. There are a few ways to do this in SWAT.

Use the where= parameter on the table= parameter
Use the query(...) method of CASTable
Use Pandas-style expression indexing on CASTable objects

Essentially these are all equivalent. They all end up setting a where= parameter on the table= parameter, so it's more personal preference as to which one you choose.

The first method is the brute-force method: setting the where= parameter on the table. One thing to keep in mind is that the expression must be a CAS WHERE clause, not a Python expression. This may trip people up because = is used to test equality in CAS, whereas == is used in Python.



In [60]:

    
tbl.set_param(where='Origin = "Asia"')
tbl









    Out[60]:





CASTable('_T_8B910465_7F1F62C95F88', caslib='CASUSER(viyauser)', where='Origin = "Asia"')



In [61]:

    
tbl.head()









    Out[61]:





Selected Rows from Table _T_8B910465_7F1F62C95F88
  
    
      
      Make
      Model
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
  
  
    
      0
      Acura
      MDX
      SUV
      Asia
      All
      36945.0
      33337.0
      3.5
      6.0
      265.0
      17.0
      23.0
      4451.0
      106.0
      189.0
    
    
      1
      Acura
      RSX Type S 2dr
      Sedan
      Asia
      Front
      23820.0
      21761.0
      2.0
      4.0
      200.0
      24.0
      31.0
      2778.0
      101.0
      172.0
    
    
      2
      Acura
      TSX 4dr
      Sedan
      Asia
      Front
      26990.0
      24647.0
      2.4
      4.0
      200.0
      22.0
      29.0
      3230.0
      105.0
      183.0
    
    
      3
      Acura
      TL 4dr
      Sedan
      Asia
      Front
      33195.0
      30299.0
      3.2
      6.0
      270.0
      20.0
      28.0
      3575.0
      108.0
      186.0
    
    
      4
      Acura
      3.5 RL 4dr
      Sedan
      Asia
      Front
      43755.0
      39014.0
      3.5
      6.0
      225.0
      18.0
      24.0
      3880.0
      115.0
      197.0

Using the query method on CASTable is equivalent. The only difference is that it creates a copy of the CASTable object and returns the copy with the where= parameter set. Since tbl already had a where= parameter set, the conditions get combined.



In [62]:

    
tbl = tbl.query('Cylinders = 6')
tbl









    Out[62]:





CASTable('_T_8B910465_7F1F62C95F88', caslib='CASUSER(viyauser)', where='(Origin = "Asia") and (Cylinders = 6)')



In [63]:

    
tbl.head()









    Out[63]:





Selected Rows from Table _T_8B910465_7F1F62C95F88
  
    
      
      Make
      Model
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
  
  
    
      0
      Acura
      MDX
      SUV
      Asia
      All
      36945.0
      33337.0
      3.5
      6.0
      265.0
      17.0
      23.0
      4451.0
      106.0
      189.0
    
    
      1
      Acura
      TL 4dr
      Sedan
      Asia
      Front
      33195.0
      30299.0
      3.2
      6.0
      270.0
      20.0
      28.0
      3575.0
      108.0
      186.0
    
    
      2
      Acura
      3.5 RL 4dr
      Sedan
      Asia
      Front
      43755.0
      39014.0
      3.5
      6.0
      225.0
      18.0
      24.0
      3880.0
      115.0
      197.0
    
    
      3
      Acura
      3.5 RL w/Navigation 4dr
      Sedan
      Asia
      Front
      46100.0
      41100.0
      3.5
      6.0
      225.0
      18.0
      24.0
      3893.0
      115.0
      197.0
    
    
      4
      Acura
      NSX coupe 2dr manual S
      Sports
      Asia
      Rear
      89765.0
      79978.0
      3.2
      6.0
      290.0
      17.0
      24.0
      3153.0
      100.0
      174.0

Finally, you can use indexing expressions just like in Pandas DataFrames. This style of indexing creates a computed column for the expression, then adds the result of that computed column as a condition of the where= parameter.



In [64]:

    
tbl = tbl[tbl.MSRP > 50000]
tbl









    Out[64]:





CASTable('_T_8B910465_7F1F62C95F88', caslib='CASUSER(viyauser)', computedvars=['_gt_0_'], computedvarsprogram='_gt_0_ = (MSRP > 50000); ', where='((Origin = "Asia") and (Cylinders = 6)) and (_gt_0_)')[['Make', 'Model', 'Type', 'Origin', 'DriveTrain', 'MSRP', 'Invoice', 'EngineSize', 'Cylinders', 'Horsepower', 'MPG_City', 'MPG_Highway', 'Weight', 'Wheelbase', 'Length']]

You may notice that the string representation of the table now includes all of the column names. This is due to the fact there there is a computed column in the table now which automatically get appended to the table, so in order for that not to show up in the output, we must select the columns that we want to display.



In [65]:

    
tbl.head()









    Out[65]:





Selected Rows from Table _T_8B910465_7F1F62C95F88
  
    
      
      Make
      Model
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
  
  
    
      0
      Acura
      NSX coupe 2dr manual S
      Sports
      Asia
      Rear
      89765.0
      79978.0
      3.2
      6.0
      290.0
      17.0
      24.0
      3153.0
      100.0
      174.0

Exercise 6 - Doing Simple Statistics

There are various CAS actions for doing some basic statistical analyses. These are located in the simple action set.



In [66]:

    
conn.tableinfo()









    Out[66]:




§ TableInfo



  
    
      
      Name
      Rows
      Columns
      Encoding
      CreateTimeFormatted
      ModTimeFormatted
      JavaCharSet
      CreateTime
      ModTime
      Global
      Repeated
      View
      SourceName
      SourceCaslib
      Compressed
      Creator
      Modifier
    
  
  
    
      0
      _T_8B910465_7F1F62C95F88
      428
      15
      utf-8
      09Sep2016:13:45:32
      09Sep2016:13:45:32
      UTF8
      1.789048e+09
      1.789048e+09
      0
      0
      0
      _T_8B910465_7F1F62C95F88
      
      0
      viyauser
      
    
    
      1
      CARS
      428
      15
      utf-8
      09Sep2016:13:45:32
      09Sep2016:13:45:32
      UTF8
      1.789048e+09
      1.789048e+09
      0
      0
      0
      
      
      0
      viyauser
      
    
    
      2
      DATA.CARS
      428
      15
      utf-8
      09Sep2016:13:45:34
      09Sep2016:13:45:34
      UTF8
      1.789048e+09
      1.789048e+09
      0
      0
      0
      Data/cars.csv
      CASUSER(viyauser)
      0
      viyauser
      
    
  




elapsed 0.000718s · user 0.001s · mem 0.0709MB

Let's create a fresh CASTable object.



In [67]:

    
tbl = conn.CASTable('cars')



In [68]:

    
conn.simple?

6.1 Distinct Values

Let's start off with the distinct action. It takes the same table= parameter as the actions above, so we can call it directly on the CASTable object and have the table= parameter filled automatically.

distinct action outputs number of distinct values and number of missing values for each column in the table.



In [69]:

    
tbl.distinct()









    Out[69]:




§ Distinct


Distinct Counts for CARS
  
    
      
      Column
      NDistinct
      NMiss
      Trunc
    
  
  
    
      0
      Make
      38.0
      0.0
      0.0
    
    
      1
      Model
      425.0
      0.0
      0.0
    
    
      2
      Type
      6.0
      0.0
      0.0
    
    
      3
      Origin
      3.0
      0.0
      0.0
    
    
      4
      DriveTrain
      3.0
      0.0
      0.0
    
    
      5
      MSRP
      410.0
      0.0
      0.0
    
    
      6
      Invoice
      425.0
      0.0
      0.0
    
    
      7
      EngineSize
      43.0
      0.0
      0.0
    
    
      8
      Cylinders
      8.0
      2.0
      0.0
    
    
      9
      Horsepower
      110.0
      0.0
      0.0
    
    
      10
      MPG_City
      28.0
      0.0
      0.0
    
    
      11
      MPG_Highway
      33.0
      0.0
      0.0
    
    
      12
      Weight
      348.0
      0.0
      0.0
    
    
      13
      Wheelbase
      40.0
      0.0
      0.0
    
    
      14
      Length
      67.0
      0.0
      0.0
    
  




elapsed 0.00605s · user 0.008s · mem 2.33MB

6.2 Univariate Summary Statistics

The summary action outputs univariate summary statistics.



In [70]:

    
tbl.summary()









    Out[70]:




§ Summary


Descriptive Statistics for CARS
  
    
      
      Column
      Min
      Max
      N
      NMiss
      Mean
      Sum
      Std
      StdErr
      Var
      USS
      CSS
      CV
      TValue
      ProbT
    
  
  
    
      0
      MSRP
      10280.0
      192465.0
      428.0
      0.0
      32774.855140
      14027638.0
      19431.716674
      939.267478
      3.775916e+08
      6.209854e+11
      1.612316e+11
      59.288490
      34.894059
      4.160412e-127
    
    
      1
      Invoice
      9875.0
      173560.0
      428.0
      0.0
      30014.700935
      12846292.0
      17642.117750
      852.763949
      3.112443e+08
      5.184789e+11
      1.329013e+11
      58.778256
      35.196963
      2.684398e-128
    
    
      2
      EngineSize
      1.3
      8.3
      428.0
      0.0
      3.196729
      1368.2
      1.108595
      0.053586
      1.228982e+00
      4.898540e+03
      5.247754e+02
      34.679034
      59.656105
      3.133745e-209
    
    
      3
      Cylinders
      3.0
      12.0
      426.0
      2.0
      5.807512
      2474.0
      1.558443
      0.075507
      2.428743e+00
      1.540000e+04
      1.032216e+03
      26.834946
      76.913766
      1.515569e-251
    
    
      4
      Horsepower
      73.0
      500.0
      428.0
      0.0
      215.885514
      92399.0
      71.836032
      3.472326
      5.160415e+03
      2.215110e+07
      2.203497e+06
      33.275059
      62.173176
      4.185344e-216
    
    
      5
      MPG_City
      10.0
      60.0
      428.0
      0.0
      20.060748
      8586.0
      5.238218
      0.253199
      2.743892e+01
      1.839580e+05
      1.171642e+04
      26.111777
      79.229235
      1.866284e-257
    
    
      6
      MPG_Highway
      12.0
      66.0
      428.0
      0.0
      26.843458
      11489.0
      5.741201
      0.277511
      3.296139e+01
      3.224790e+05
      1.407451e+04
      21.387709
      96.729204
      1.665621e-292
    
    
      7
      Weight
      1850.0
      7190.0
      428.0
      0.0
      3577.953271
      1531364.0
      758.983215
      36.686838
      5.760555e+05
      5.725125e+09
      2.459757e+08
      21.212776
      97.526890
      5.812547e-294
    
    
      8
      Wheelbase
      89.0
      144.0
      428.0
      0.0
      108.154206
      46290.0
      8.311813
      0.401767
      6.908624e+01
      5.035958e+06
      2.949982e+04
      7.685150
      269.196577
      0.000000e+00
    
    
      9
      Length
      143.0
      238.0
      428.0
      0.0
      186.362150
      79763.0
      14.357991
      0.694020
      2.061519e+02
      1.495283e+07
      8.802687e+04
      7.704349
      268.525733
      0.000000e+00
    
  




elapsed 0.0026s · user 0.001s · sys 0.002s · mem 0.668MB

You have two ways to specifiy the columns to generate summary statistics.



In [71]:

    
tbl[['MSRP','MPG_CITY','MPG_Highway']].summary()









    Out[71]:




§ Summary


Descriptive Statistics for CARS
  
    
      
      Column
      Min
      Max
      N
      NMiss
      Mean
      Sum
      Std
      StdErr
      Var
      USS
      CSS
      CV
      TValue
      ProbT
    
  
  
    
      0
      MSRP
      10280.0
      192465.0
      428.0
      0.0
      32774.855140
      14027638.0
      19431.716674
      939.267478
      3.775916e+08
      6.209854e+11
      1.612316e+11
      59.288490
      34.894059
      4.160412e-127
    
    
      1
      MPG_City
      10.0
      60.0
      428.0
      0.0
      20.060748
      8586.0
      5.238218
      0.253199
      2.743892e+01
      1.839580e+05
      1.171642e+04
      26.111777
      79.229235
      1.866284e-257
    
    
      2
      MPG_Highway
      12.0
      66.0
      428.0
      0.0
      26.843458
      11489.0
      5.741201
      0.277511
      3.296139e+01
      3.224790e+05
      1.407451e+04
      21.387709
      96.729204
      1.665621e-292
    
  




elapsed 0.00307s · user 0.001s · sys 0.001s · mem 0.667MB



In [72]:

    
tbl.summary(inputs=['MSRP','MPG_CITY','MPG_Highway'])









    Out[72]:




§ Summary


Descriptive Statistics for CARS
  
    
      
      Column
      Min
      Max
      N
      NMiss
      Mean
      Sum
      Std
      StdErr
      Var
      USS
      CSS
      CV
      TValue
      ProbT
    
  
  
    
      0
      MSRP
      10280.0
      192465.0
      428.0
      0.0
      32774.855140
      14027638.0
      19431.716674
      939.267478
      3.775916e+08
      6.209854e+11
      1.612316e+11
      59.288490
      34.894059
      4.160412e-127
    
    
      1
      MPG_City
      10.0
      60.0
      428.0
      0.0
      20.060748
      8586.0
      5.238218
      0.253199
      2.743892e+01
      1.839580e+05
      1.171642e+04
      26.111777
      79.229235
      1.866284e-257
    
    
      2
      MPG_Highway
      12.0
      66.0
      428.0
      0.0
      26.843458
      11489.0
      5.741201
      0.277511
      3.296139e+01
      3.224790e+05
      1.407451e+04
      21.387709
      96.729204
      1.665621e-292
    
  




elapsed 0.00363s · user 0.002s · sys 0.000999s · mem 0.667MB

6.3 One-way Frequency Table

the freq action is similar to PROC FREQ



In [73]:

    
tbl[['Origin','Cylinders']].freq()









    Out[73]:




§ Frequency


Frequency for CARS
  
    
      
      Column
      NumVar
      CharVar
      FmtVar
      Level
      Frequency
    
  
  
    
      0
      Origin
      NaN
      Asia
      Asia
      1
      158.0
    
    
      1
      Origin
      NaN
      Europe
      Europe
      2
      123.0
    
    
      2
      Origin
      NaN
      USA
      USA
      3
      147.0
    
    
      3
      Cylinders
      NaN
      
      .
      1
      2.0
    
    
      4
      Cylinders
      3.0
      
      3
      2
      1.0
    
    
      5
      Cylinders
      4.0
      
      4
      3
      136.0
    
    
      6
      Cylinders
      5.0
      
      5
      4
      7.0
    
    
      7
      Cylinders
      6.0
      
      6
      5
      190.0
    
    
      8
      Cylinders
      8.0
      
      8
      6
      87.0
    
    
      9
      Cylinders
      10.0
      
      10
      7
      2.0
    
    
      10
      Cylinders
      12.0
      
      12
      8
      3.0
    
  




elapsed 0.0047s · user 0.001s · sys 0.001s · mem 0.694MB



In [74]:

    
import cufflinks
cufflinks.go_offline()

output = tbl[['Origin','Cylinders']].freq()
df = output['Frequency']

df[df['Column'] == 'Cylinders'].iplot(kind='bar', x='FmtVar', y='Frequency')

6.4 Two-way Frequency Table and Chi-square Test

Unlike PROC FREQ, the freq action only generates one-way frequency tables. You need the crosstab action to generate two-way or multi-way frequency tables. It also computes Chi-square statistics and measure of associations.



In [75]:

    
tbl.crosstab(row="Cylinders", col='Origin')









    Out[75]:




§ Crosstab



  
    
      
      Cylinders
      Col1
      Col2
      Col3
    
  
  
    
      0
      3
      1.0
      0.0
      0.0
    
    
      1
      4
      74.0
      25.0
      37.0
    
    
      2
      5
      0.0
      7.0
      0.0
    
    
      3
      6
      69.0
      54.0
      67.0
    
    
      4
      8
      12.0
      34.0
      41.0
    
    
      5
      10
      0.0
      0.0
      2.0
    
    
      6
      12
      0.0
      3.0
      0.0
    
  




elapsed 0.00235s · user 0.002s · sys 0.001s · mem 0.64MB



In [76]:

    
tbl.crosstab(row="MSRP", col='Origin', rowNBins=10)









    Out[76]:




§ Crosstab



  
    
      
      MSRP
      Col1
      Col2
      Col3
    
  
  
    
      0
      (-12500,  12500]
      11.0
      0.0
      2.0
    
    
      1
      ( 12500,  37500]
      134.0
      47.0
      117.0
    
    
      2
      ( 37500,  62500]
      10.0
      50.0
      26.0
    
    
      3
      ( 62500,  87500]
      2.0
      20.0
      2.0
    
    
      4
      ( 87500, 112500]
      1.0
      2.0
      0.0
    
    
      5
      (112500, 137500]
      0.0
      3.0
      0.0
    
    
      6
      (187500, 212500]
      0.0
      1.0
      0.0
    
  




elapsed 0.00265s · user 0.003s · mem 0.639MB



In [77]:

    
tbl.crosstab(row="MSRP", col='Origin', rowNBins=10, chisq=True)









    Out[77]:




§ Crosstab



  
    
      
      MSRP
      Col1
      Col2
      Col3
    
  
  
    
      0
      (-12500,  12500]
      11.0
      0.0
      2.0
    
    
      1
      ( 12500,  37500]
      134.0
      47.0
      117.0
    
    
      2
      ( 37500,  62500]
      10.0
      50.0
      26.0
    
    
      3
      ( 62500,  87500]
      2.0
      20.0
      2.0
    
    
      4
      ( 87500, 112500]
      1.0
      2.0
      0.0
    
    
      5
      (112500, 137500]
      0.0
      3.0
      0.0
    
    
      6
      (187500, 212500]
      0.0
      1.0
      0.0
    
  



§ ChiSq



  
    
      
      Statistic
      DF
      Value
      Prob
    
  
  
    
      0
      Chi-Square
      12.0
      126.453870
      3.172351e-21
    
    
      1
      Likelihood Ratio Chi-Square
      12.0
      127.634505
      1.840164e-21
    
  




elapsed 0.0024s · user 0.002s · sys 0.001s · mem 0.639MB

6.5 Pearson's Correlation Coefficient

In Statistics, the Pearson correlation coefficient is a measure of the linear correlation between two variables X and Y. Its range is between +1 and −1, where 1 is total positive correlation, 0 is no correlation, and −1 is total negative correlation.

https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient



In [78]:

    
tbl.correlation(simple=False)









    Out[78]:




§ Correlation


Pearson Correlation Coefficients for CARS
  
    
      
      Variable
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
  
  
    
      0
      MSRP
      1.000000
      0.999132
      0.571753
      0.649742
      0.826945
      -0.475020
      -0.439622
      0.448426
      0.152000
      0.172037
    
    
      1
      Invoice
      0.999132
      1.000000
      0.564498
      0.645226
      0.823746
      -0.470442
      -0.434585
      0.442332
      0.148328
      0.166586
    
    
      2
      EngineSize
      0.571753
      0.564498
      1.000000
      0.908002
      0.787435
      -0.709471
      -0.717302
      0.807867
      0.636517
      0.637448
    
    
      3
      Cylinders
      0.649742
      0.645226
      0.908002
      1.000000
      0.810341
      -0.684402
      -0.676100
      0.742209
      0.546730
      0.547783
    
    
      4
      Horsepower
      0.826945
      0.823746
      0.787435
      0.810341
      1.000000
      -0.676699
      -0.647195
      0.630796
      0.387398
      0.381554
    
    
      5
      MPG_City
      -0.475020
      -0.470442
      -0.709471
      -0.684402
      -0.676699
      1.000000
      0.941021
      -0.737966
      -0.507284
      -0.501526
    
    
      6
      MPG_Highway
      -0.439622
      -0.434585
      -0.717302
      -0.676100
      -0.647195
      0.941021
      1.000000
      -0.790989
      -0.524661
      -0.466092
    
    
      7
      Weight
      0.448426
      0.442332
      0.807867
      0.742209
      0.630796
      -0.737966
      -0.790989
      1.000000
      0.760703
      0.690021
    
    
      8
      Wheelbase
      0.152000
      0.148328
      0.636517
      0.546730
      0.387398
      -0.507284
      -0.524661
      0.760703
      1.000000
      0.889195
    
    
      9
      Length
      0.172037
      0.166586
      0.637448
      0.547783
      0.381554
      -0.501526
      -0.466092
      0.690021
      0.889195
      1.000000
    
  




elapsed 0.00279s · user 0.001s · sys 0.001s · mem 0.66MB



In [79]:

    
output = tbl.correlation(simple=False)['Correlation']

output.set_index(output['Variable']).iplot(kind='heatmap', colorscale='rdbu')

6.6 Histogram

Let's look at some output from the histogram action.



In [80]:

    
tbl[['MPG_City']].histogram()









    Out[80]:




§ BinDetails


Bin Details for CARS
  
    
      
      Variable
      BinSetId
      BinId
      BinLowerBnd
      BinUpperBnd
      BinWidth
      NInBin
      Mean
      Std
      Min
      Max
      MidPoint
      Percent
    
  
  
    
      0
      MPG_City
      1
      1
      7.5
      12.500000
      5.000000
      6
      11.333333
      1.032796
      10.0
      12.0
      10.0
      1.401869
    
    
      1
      MPG_City
      1
      2
      12.5
      17.500000
      5.000000
      114
      15.666667
      1.348002
      13.0
      17.0
      15.0
      26.635514
    
    
      2
      MPG_City
      1
      3
      17.5
      22.500000
      5.000000
      219
      19.538813
      1.314039
      18.0
      22.0
      20.0
      51.168224
    
    
      3
      MPG_City
      1
      4
      22.5
      27.500000
      5.000000
      61
      24.803279
      1.107846
      23.0
      27.0
      25.0
      14.252336
    
    
      4
      MPG_City
      1
      5
      27.5
      32.500000
      5.000000
      20
      29.900000
      1.713722
      28.0
      32.0
      30.0
      4.672897
    
    
      5
      MPG_City
      1
      6
      32.5
      37.500000
      5.000000
      4
      34.750000
      1.258306
      33.0
      36.0
      35.0
      0.934579
    
    
      6
      MPG_City
      1
      7
      37.5
      42.500000
      5.000000
      1
      38.000000
      NaN
      38.0
      38.0
      40.0
      0.233645
    
    
      7
      MPG_City
      1
      8
      42.5
      47.500000
      5.000000
      1
      46.000000
      NaN
      46.0
      46.0
      45.0
      0.233645
    
    
      8
      MPG_City
      1
      9
      47.5
      52.500000
      5.000000
      0
      NaN
      NaN
      NaN
      NaN
      50.0
      0.000000
    
    
      9
      MPG_City
      1
      10
      52.5
      57.500000
      5.000000
      0
      NaN
      NaN
      NaN
      NaN
      55.0
      0.000000
    
    
      10
      MPG_City
      1
      11
      57.5
      62.500001
      5.000001
      2
      59.500000
      0.707107
      59.0
      60.0
      60.0
      0.467290
    
  




elapsed 0.307s · user 0.002s · sys 0.004s · mem 0.704MB

While the table of data is nice and may be useful in some contexts, it might be nice to see a visual representation of the histogram data. There are various ways of plotting data in Python. The most basic is to use the plot method on the DataFrame.



In [81]:

    
# Set matplotlib to render graphics inline
%matplotlib inline

hist = tbl[['MPG_City']].histogram()

hist['BinDetails'].plot.bar(x='MidPoint', y='Percent')









    Out[81]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fa2602c3470>

Exercise 7 - Grouping Data

7.1 Group-by

Now that we've done some basic statistics, let's look at how to group data. By grouping is handled by a parameter on the table: groupby. The brute force way of adding grouping to a table is to set it manually. The code below sets the groupby parameter to 'Origin'.



In [82]:

    
tbl.set_param(groupby='Origin')
tbl









    Out[82]:





CASTable('cars', groupby='Origin')

We can then run the same summary action call as before and we will get the results grouped by Origin.



In [83]:

    
output = tbl[['MSRP','MPG_CITY','MPG_Highway']].summary()
output









    Out[83]:




§ ByGroupInfo


ByGroupInfo
  
    
      
      Origin
      Origin_f
      _key_
    
  
  
    
      0
      Asia
      Asia
      Asia
    
    
      1
      Europe
      Europe
      Europe
    
    
      2
      USA
      USA
      USA
    
  



§ ByGroup1.Summary


Descriptive Statistics for CARS
  
    
      
      Column
      Min
      Max
      N
      NMiss
      Mean
      Sum
      Std
      StdErr
      Var
      USS
      CSS
      CV
      TValue
      ProbT
    
    
      Origin
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Asia
      MSRP
      10280.0
      89765.0
      158.0
      0.0
      24741.322785
      3909129.0
      11321.069675
      900.655944
      1.281666e+08
      1.168392e+11
      2.012216e+10
      45.757738
      27.470338
      7.525437e-62
    
    
      Asia
      MPG_City
      13.0
      60.0
      158.0
      0.0
      22.012658
      3478.0
      6.733307
      0.535673
      4.533742e+01
      8.367800e+04
      7.117975e+03
      30.588340
      41.093453
      6.417174e-86
    
    
      Asia
      MPG_Highway
      17.0
      66.0
      158.0
      0.0
      28.265823
      4466.0
      6.770503
      0.538632
      4.583972e+01
      1.334320e+05
      7.196835e+03
      23.952968
      52.477026
      1.853048e-101
    
  



§ ByGroup2.Summary


Descriptive Statistics for CARS
  
    
      
      Column
      Min
      Max
      N
      NMiss
      Mean
      Sum
      Std
      StdErr
      Var
      USS
      CSS
      CV
      TValue
      ProbT
    
    
      Origin
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Europe
      MSRP
      16999.0
      192465.0
      123.0
      0.0
      48349.796748
      5947025.0
      25318.600464
      2282.901323
      6.410315e+08
      3.657433e+11
      7.820585e+10
      52.365474
      21.179101
      1.104072e-42
    
    
      Europe
      MPG_City
      12.0
      38.0
      123.0
      0.0
      18.731707
      2304.0
      3.289509
      0.296605
      1.082087e+01
      4.447800e+04
      1.320146e+03
      17.561183
      63.153699
      4.846098e-95
    
    
      Europe
      MPG_Highway
      14.0
      46.0
      123.0
      0.0
      26.008130
      3199.0
      4.167588
      0.375779
      1.736879e+01
      8.531900e+04
      2.118992e+03
      16.024172
      69.211292
      9.190472e-100
    
  



§ ByGroup3.Summary


Descriptive Statistics for CARS
  
    
      
      Column
      Min
      Max
      N
      NMiss
      Mean
      Sum
      Std
      StdErr
      Var
      USS
      CSS
      CV
      TValue
      ProbT
    
    
      Origin
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      USA
      MSRP
      10995.0
      81795.0
      147.0
      0.0
      28377.442177
      4171484.0
      11711.982506
      965.988036
      1.371705e+08
      1.384029e+11
      2.002690e+10
      41.272157
      29.376598
      3.686853e-63
    
    
      USA
      MPG_City
      10.0
      29.0
      147.0
      0.0
      19.074830
      2804.0
      3.982992
      0.328512
      1.586423e+01
      5.580200e+04
      2.316177e+03
      20.880878
      58.064396
      8.933082e-103
    
    
      USA
      MPG_Highway
      12.0
      37.0
      147.0
      0.0
      26.013605
      3824.0
      5.396582
      0.445103
      2.912310e+01
      1.037280e+05
      4.251973e+03
      20.745230
      58.444064
      3.586575e-103
    
  




elapsed 0.00266s · user 0.001s · sys 0.001s · mem 0.672MB

You'll notice that we get each By group in a separate result key. This is done to make the output as flexible as possible. When dealing with large sets of data, it is possible that you could end up with very large result sets coming back from CAS. The By groups are split so that it is possible to handle them piecemeal rather than as one big table. We will not go into detail on the way to construct action calls to handle results individually, but you can go to the SWAT documentation for more information.

In cases like this where the data is fairly small, you can concatenate all of the By group tables using the concat_bygroups method on the CASResults object. This gives you results similar to the non-By group code, but adds an index to the DataFrame that contains the By values.



In [84]:

    
output.concat_bygroups()









    Out[84]:




§ Summary


Descriptive Statistics for CARS
  
    
      
      Column
      Min
      Max
      N
      NMiss
      Mean
      Sum
      Std
      StdErr
      Var
      USS
      CSS
      CV
      TValue
      ProbT
    
    
      Origin
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Asia
      MSRP
      10280.0
      89765.0
      158.0
      0.0
      24741.322785
      3909129.0
      11321.069675
      900.655944
      1.281666e+08
      1.168392e+11
      2.012216e+10
      45.757738
      27.470338
      7.525437e-62
    
    
      Asia
      MPG_City
      13.0
      60.0
      158.0
      0.0
      22.012658
      3478.0
      6.733307
      0.535673
      4.533742e+01
      8.367800e+04
      7.117975e+03
      30.588340
      41.093453
      6.417174e-86
    
    
      Asia
      MPG_Highway
      17.0
      66.0
      158.0
      0.0
      28.265823
      4466.0
      6.770503
      0.538632
      4.583972e+01
      1.334320e+05
      7.196835e+03
      23.952968
      52.477026
      1.853048e-101
    
    
      Europe
      MSRP
      16999.0
      192465.0
      123.0
      0.0
      48349.796748
      5947025.0
      25318.600464
      2282.901323
      6.410315e+08
      3.657433e+11
      7.820585e+10
      52.365474
      21.179101
      1.104072e-42
    
    
      Europe
      MPG_City
      12.0
      38.0
      123.0
      0.0
      18.731707
      2304.0
      3.289509
      0.296605
      1.082087e+01
      4.447800e+04
      1.320146e+03
      17.561183
      63.153699
      4.846098e-95
    
    
      Europe
      MPG_Highway
      14.0
      46.0
      123.0
      0.0
      26.008130
      3199.0
      4.167588
      0.375779
      1.736879e+01
      8.531900e+04
      2.118992e+03
      16.024172
      69.211292
      9.190472e-100
    
    
      USA
      MSRP
      10995.0
      81795.0
      147.0
      0.0
      28377.442177
      4171484.0
      11711.982506
      965.988036
      1.371705e+08
      1.384029e+11
      2.002690e+10
      41.272157
      29.376598
      3.686853e-63
    
    
      USA
      MPG_City
      10.0
      29.0
      147.0
      0.0
      19.074830
      2804.0
      3.982992
      0.328512
      1.586423e+01
      5.580200e+04
      2.316177e+03
      20.880878
      58.064396
      8.933082e-103
    
    
      USA
      MPG_Highway
      12.0
      37.0
      147.0
      0.0
      26.013605
      3824.0
      5.396582
      0.445103
      2.912310e+01
      1.037280e+05
      4.251973e+03
      20.745230
      58.444064
      3.586575e-103

Because SAS uses formatted values to do By grouping and Pandas typically works with raw values, there is a method called reshape_bygroups on the SASDataFrame object that allows you to change the representation of By groups in your DataFrames. You can indicate that you want By values (raw or formatted) as columns, an index, or neither. You can also specify a suffix to use when name collisions occur in column names if you have an action that can utilize the same column multiple times in the output.

7.2 Using Pandas-style Grouping

When we started this section, we mentioned that using the set_param method was the brute force way of setting By groupings. You can also use the Pandas-style way of setting groupings: the groupby method.

Using this returns a CASTable-like object with the groupby parameter set. All action calls on the new object behave just like a CASTable running the same action.



In [85]:

    
tbl.del_param('groupby')
tbl









    Out[85]:





CASTable('cars')



In [86]:

    
output = tbl[['MSRP','MPG_City']].groupby('Origin').summary()
output









    Out[86]:




§ ByGroupInfo


ByGroupInfo
  
    
      
      Origin
      Origin_f
      _key_
    
  
  
    
      0
      Asia
      Asia
      Asia
    
    
      1
      Europe
      Europe
      Europe
    
    
      2
      USA
      USA
      USA
    
  



§ ByGroup1.Summary


Descriptive Statistics for CARS
  
    
      
      Column
      Min
      Max
      N
      NMiss
      Mean
      Sum
      Std
      StdErr
      Var
      USS
      CSS
      CV
      TValue
      ProbT
    
    
      Origin
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Asia
      MSRP
      10280.0
      89765.0
      158.0
      0.0
      24741.322785
      3909129.0
      11321.069675
      900.655944
      1.281666e+08
      1.168392e+11
      2.012216e+10
      45.757738
      27.470338
      7.525437e-62
    
    
      Asia
      MPG_City
      13.0
      60.0
      158.0
      0.0
      22.012658
      3478.0
      6.733307
      0.535673
      4.533742e+01
      8.367800e+04
      7.117975e+03
      30.588340
      41.093453
      6.417174e-86
    
  



§ ByGroup2.Summary


Descriptive Statistics for CARS
  
    
      
      Column
      Min
      Max
      N
      NMiss
      Mean
      Sum
      Std
      StdErr
      Var
      USS
      CSS
      CV
      TValue
      ProbT
    
    
      Origin
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Europe
      MSRP
      16999.0
      192465.0
      123.0
      0.0
      48349.796748
      5947025.0
      25318.600464
      2282.901323
      6.410315e+08
      3.657433e+11
      7.820585e+10
      52.365474
      21.179101
      1.104072e-42
    
    
      Europe
      MPG_City
      12.0
      38.0
      123.0
      0.0
      18.731707
      2304.0
      3.289509
      0.296605
      1.082087e+01
      4.447800e+04
      1.320146e+03
      17.561183
      63.153699
      4.846098e-95
    
  



§ ByGroup3.Summary


Descriptive Statistics for CARS
  
    
      
      Column
      Min
      Max
      N
      NMiss
      Mean
      Sum
      Std
      StdErr
      Var
      USS
      CSS
      CV
      TValue
      ProbT
    
    
      Origin
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      USA
      MSRP
      10995.0
      81795.0
      147.0
      0.0
      28377.442177
      4171484.0
      11711.982506
      965.988036
      1.371705e+08
      1.384029e+11
      2.002690e+10
      41.272157
      29.376598
      3.686853e-63
    
    
      USA
      MPG_City
      10.0
      29.0
      147.0
      0.0
      19.074830
      2804.0
      3.982992
      0.328512
      1.586423e+01
      5.580200e+04
      2.316177e+03
      20.880878
      58.064396
      8.933082e-103
    
  




elapsed 0.00226s · user 0.003s · mem 0.672MB

Exercise 8 - SQL

SQL stands for Structured Query Language. According to ANSI (American National Standards Institute), it is the standard language for relational database management systems. In this section we illustrate how to use SQL to communicate with CAS server.



In [87]:

    
conn.loadactionset('fedsql')









    



NOTE: Added action set 'fedsql'.






    Out[87]:




§ actionset

fedsql


elapsed 0.0212s · sys 0.003s · mem 0.0781MB

8.1 Execute an SQL query

ExecDirect is a CAS action in the FedSQL action set to execute SQL statements in the CAS environment.



In [88]:

    
str1 = "select make, model, msrp, mpg_highway from cars where msrp > 80000 and mpg_highway > 20"
conn.fedsql.execDirect(query=str1)









    Out[88]:




§ Result Set



  
    
      
      Make
      Model
      MSRP
      MPG_Highway
    
  
  
    
      0
      Acura
      NSX coupe 2dr manual S
      89765.0
      24.0
    
    
      1
      Audi
      RS 6 4dr
      84600.0
      22.0
    
    
      2
      Jaguar
      XKR coupe 2dr
      81995.0
      23.0
    
    
      3
      Jaguar
      XKR convertible 2dr
      86995.0
      23.0
    
    
      4
      Mercedes-Benz
      CL500 2dr
      94820.0
      24.0
    
    
      5
      Mercedes-Benz
      S500 4dr
      86970.0
      24.0
    
    
      6
      Mercedes-Benz
      SL500 convertible 2dr
      90520.0
      23.0
    
    
      7
      Mercedes-Benz
      SL55 AMG 2dr
      121770.0
      21.0
    
    
      8
      Porsche
      911 Carrera 4S coupe 2dr (convert)
      84165.0
      24.0
    
    
      9
      Porsche
      911 GT2 2dr
      192465.0
      24.0
    
  




elapsed 0.126s · user 0.016s · sys 0.013s · mem 3.64MB

8.2 Group-by and summary statistics



In [89]:

    
str1 = "select origin, mean(mpg_highway), std(mpg_highway) from cars group by origin"
conn.fedsql.execDirect(query=str1)









    Out[89]:




§ Result Set



  
    
      
      Origin
      MEAN
      STD
    
  
  
    
      0
      Asia
      28.265823
      6.770503
    
    
      1
      Europe
      26.008130
      4.167588
    
    
      2
      USA
      26.013605
      5.396582
    
  




elapsed 0.0203s · user 0.021s · sys 0.005s · mem 10.2MB

Show the SQL logic



In [90]:

    
str1 = "select origin, mean(mpg_highway), std(mpg_highway) from cars group by origin"
conn.fedsql.execDirect(query=str1, method=True)









    



Methods for full query plan
----------------------------
        Agg 
          Sort 
            SeqScan from CASUSER(viyauser).CARS 
 
Methods for stage 1
--------------------
        Agg 
            SeqScan with _pushed_ order by from CASUSER(viyauser).CARS 
 
        Stage query: select "T1"."Origin", AVG ("T1"."MPG_Highway")  as "MEAN", STDDEV ("T1"."MPG_Highway")  as "STD" from "CASUSER(viyauser)"."CARS" {options tableID=1} T1 group by "T1"."Origin"
 
        Number of SQL I/O threads: 4
 






    Out[90]:




§ Result Set



  
    
      
      Origin
      MEAN
      STD
    
  
  
    
      0
      Asia
      28.265823
      6.770503
    
    
      1
      Europe
      26.008130
      4.167588
    
    
      2
      USA
      26.013605
      5.396582
    
  




elapsed 0.0155s · user 0.018s · sys 0.005s · mem 9.92MB

8.3 Create a new CAS table using SQL



In [91]:

    
str1 = """
    create table aggregated1 as 
        select origin, mean(mpg_highway), std(mpg_highway) 
        from cars group by origin
"""

conn.fedsql.execDirect(query=str1)









    



NOTE: Table AGGREGATED1 was created in caslib CASUSER(viyauser) with 3 rows returned.






    Out[91]:





elapsed 0.0137s · user 0.022s · sys 0.005s · mem 10MB



In [92]:

    
conn.tableinfo()









    Out[92]:




§ TableInfo



  
    
      
      Name
      Rows
      Columns
      Encoding
      CreateTimeFormatted
      ModTimeFormatted
      JavaCharSet
      CreateTime
      ModTime
      Global
      Repeated
      View
      SourceName
      SourceCaslib
      Compressed
      Creator
      Modifier
    
  
  
    
      0
      _T_8B910465_7F1F62C95F88
      428
      15
      utf-8
      09Sep2016:13:45:32
      09Sep2016:13:45:32
      UTF8
      1.789048e+09
      1.789048e+09
      0
      0
      0
      _T_8B910465_7F1F62C95F88
      
      0
      viyauser
      
    
    
      1
      CARS
      428
      15
      utf-8
      09Sep2016:13:45:32
      09Sep2016:13:45:32
      UTF8
      1.789048e+09
      1.789048e+09
      0
      0
      0
      
      
      0
      viyauser
      
    
    
      2
      DATA.CARS
      428
      15
      utf-8
      09Sep2016:13:45:34
      09Sep2016:13:45:34
      UTF8
      1.789048e+09
      1.789048e+09
      0
      0
      0
      Data/cars.csv
      CASUSER(viyauser)
      0
      viyauser
      
    
    
      3
      AGGREGATED1
      3
      3
      utf-8
      09Sep2016:13:45:43
      09Sep2016:13:45:43
      UTF8
      1.789048e+09
      1.789048e+09
      0
      0
      0
      
      
      0
      viyauser
      
    
  




elapsed 0.000587s · user 0.001s · mem 0.0718MB

You can delete a table usig the dropTable action.



In [93]:

    
# conn.dropTable('aggregated1')

Continue to work on the table created by SQL



In [94]:

    
tbl_temp = conn.CASTable('aggregated1')



In [95]:

    
tbl_temp.fetch()









    Out[95]:




§ Fetch


Selected Rows from Table AGGREGATED1
  
    
      
      Origin
      MEAN
      STD
    
  
  
    
      0
      Asia
      28.265823
      6.770503
    
    
      1
      Europe
      26.008130
      4.167588
    
    
      2
      USA
      26.013605
      5.396582
    
  




elapsed 0.00117s · user 0.001s · mem 0.556MB

8.4 Table Join



In [96]:

    
str1 = """
    create table aggregated2 as 
        select origin, type, mean(mpg_highway), std(mpg_highway) 
        from cars 
        group by origin, type
"""

conn.fedsql.execDirect(query=str1)









    



NOTE: Table AGGREGATED2 was created in caslib CASUSER(viyauser) with 15 rows returned.






    Out[96]:





elapsed 0.0169s · user 0.03s · sys 0.002s · mem 10.5MB



In [97]:

    
str1 = """
    select a.origin, a.type, b.mean as Average_Origin, a.mean as Average_Type 
    from aggregated2 as a 
        left join aggregated1 as b 
    on b.origin = a.origin
"""

conn.fedsql.execDirect(query=str1)









    Out[97]:




§ Result Set



  
    
      
      Origin
      Type
      AVERAGE_ORIGIN
      AVERAGE_TYPE
    
  
  
    
      0
      Asia
      Hybrid
      28.265823
      56.000000
    
    
      1
      Asia
      Truck
      28.265823
      22.000000
    
    
      2
      Europe
      Sports
      26.008130
      25.130435
    
    
      3
      USA
      Sports
      26.013605
      24.222222
    
    
      4
      Asia
      SUV
      28.265823
      21.680000
    
    
      5
      Asia
      Wagon
      28.265823
      28.181818
    
    
      6
      Europe
      Wagon
      26.008130
      26.583333
    
    
      7
      USA
      Truck
      26.013605
      20.500000
    
    
      8
      Asia
      Sedan
      28.265823
      29.968085
    
    
      9
      Europe
      SUV
      26.008130
      18.700000
    
    
      10
      USA
      SUV
      26.013605
      20.040000
    
    
      11
      USA
      Wagon
      26.013605
      29.714286
    
    
      12
      Asia
      Sports
      28.265823
      26.647059
    
    
      13
      Europe
      Sedan
      26.008130
      27.115385
    
    
      14
      USA
      Sedan
      26.013605
      28.544444
    
  




elapsed 0.021s · user 0.017s · sys 0.018s · mem 10.2MB

Exercise 9 - DATA Step

You can submit SAS data step codes using the runCode action! Don't forget the new data set option REPLACE=YES if you want to replace an existing data set in the same CAS server. By default the new table CARS_TEMP is saved to the CASUSER library in the CAS server.



In [98]:

    
conn.runCode(code="""
    data cars_temp(replace=yes); 
        set cars; 
        sqrt_MSRP = sqrt(MSRP); 
        MPG_avg = (MPG_city+MPG_highway)/2; 
    run;
""")









    Out[98]:




§ InputCasTables



  
    
      
      casLib
      Name
      Rows
      Columns
      casTable
    
  
  
    
      0
      CASUSER(viyauser)
      cars
      428
      15
      CASTable('cars', caslib='CASUSER(viyauser)')
    
  



§ OutputCasTables



  
    
      
      casLib
      Name
      Rows
      Columns
      casTable
    
  
  
    
      0
      CASUSER(viyauser)
      cars_temp
      428
      17
      CASTable('cars_temp', caslib='CASUSER(viyauser)')
    
  




elapsed 0.0466s · user 0.01s · sys 0.005s · mem 8.22MB



In [99]:

    
conn.tableinfo()









    Out[99]:




§ TableInfo



  
    
      
      Name
      Rows
      Columns
      Encoding
      CreateTimeFormatted
      ModTimeFormatted
      JavaCharSet
      CreateTime
      ModTime
      Global
      Repeated
      View
      SourceName
      SourceCaslib
      Compressed
      Creator
      Modifier
    
  
  
    
      0
      _T_8B910465_7F1F62C95F88
      428
      15
      utf-8
      09Sep2016:13:45:32
      09Sep2016:13:45:32
      UTF8
      1.789048e+09
      1.789048e+09
      0
      0
      0
      _T_8B910465_7F1F62C95F88
      
      0
      viyauser
      
    
    
      1
      CARS
      428
      15
      utf-8
      09Sep2016:13:45:32
      09Sep2016:13:45:32
      UTF8
      1.789048e+09
      1.789048e+09
      0
      0
      0
      
      
      0
      viyauser
      
    
    
      2
      DATA.CARS
      428
      15
      utf-8
      09Sep2016:13:45:34
      09Sep2016:13:45:34
      UTF8
      1.789048e+09
      1.789048e+09
      0
      0
      0
      Data/cars.csv
      CASUSER(viyauser)
      0
      viyauser
      
    
    
      3
      AGGREGATED1
      3
      3
      utf-8
      09Sep2016:13:45:43
      09Sep2016:13:45:43
      UTF8
      1.789048e+09
      1.789048e+09
      0
      0
      0
      
      
      0
      viyauser
      
    
    
      4
      AGGREGATED2
      15
      4
      utf-8
      09Sep2016:13:45:43
      09Sep2016:13:45:43
      UTF8
      1.789048e+09
      1.789048e+09
      0
      0
      0
      
      
      0
      viyauser
      
    
    
      5
      CARS_TEMP
      428
      17
      utf-8
      09Sep2016:13:45:43
      09Sep2016:13:45:43
      UTF8
      1.789048e+09
      1.789048e+09
      0
      0
      0
      
      
      0
      viyauser
      
    
  




elapsed 0.000788s · sys 0.001s · mem 0.0752MB



In [100]:

    
tbl_cars_temp = conn.CASTable('cars_temp')
tbl_cars_temp[['sqrt_MSRP','MPG_avg']].summary()









    Out[100]:




§ Summary


Descriptive Statistics for CARS_TEMP
  
    
      
      Column
      Min
      Max
      N
      NMiss
      Mean
      Sum
      Std
      StdErr
      Var
      USS
      CSS
      CV
      TValue
      ProbT
    
  
  
    
      0
      sqrt_MSRP
      101.390335
      438.708331
      428.0
      0.0
      175.121390
      74951.954867
      45.959648
      2.221543
      2112.289203
      14027638.00
      901947.489887
      26.244451
      78.828705
      1.415239e-256
    
    
      1
      MPG_avg
      11.000000
      63.000000
      428.0
      0.0
      23.452103
      10037.500000
      5.408331
      0.261422
      29.250042
      247890.25
      12489.768107
      23.061177
      89.709911
      3.448353e-279
    
  




elapsed 0.00557s · sys 0.003s · mem 0.662MB

Exercise 10 - Building Analytical Models

You need to load an modeling action before you use it. Let's load the action sets for regression and decision tree models



In [101]:

    
conn.loadactionset('regression')
conn.loadactionset('decisiontree')

tbl = conn.CASTable('cars')









    



NOTE: Added action set 'regression'.
NOTE: Added action set 'decisiontree'.

The regression action set contains three models: linear regressions, logistic regressions, and generalized linear models.



In [102]:

    
conn.help(actionset='regression')









    



NOTE: Information for action set 'regression':
NOTE:    regression
NOTE:       glm - Fits linear regression models using the method of least squares
NOTE:       genmod - Fits generalized linear regression models
NOTE:       logistic - Fits logistic regression models






    Out[102]:




§ regression



  
    
      
      name
      description
    
  
  
    
      0
      glm
      Fits linear regression models using the method...
    
    
      1
      genmod
      Fits generalized linear regression models
    
    
      2
      logistic
      Fits logistic regression models
    
  




elapsed 0.00434s · user 0.001s · mem 0.0451MB

Let us build a linear regression model to predict the MSRP values of the automobiles using the CARS data set.



In [103]:

    
output1 = tbl.glm(
    target = 'MSRP',
    nominals = ['Type', 'Origin'],
    inputs = ['Type', 'Origin', 'MPG_City', 'Length', 'Weight']
)



In [104]:

    
output1









    Out[104]:




§ ModelInfo


Model Information
  
    
      
      RowId
      Description
      Value
    
  
  
    
      0
      DATA
      Data Source
      CARS
    
    
      1
      RESPONSEVAR
      Response Variable
      MSRP
    
  



§ NObs


Number of Observations
  
    
      
      RowId
      Description
      Value
    
  
  
    
      0
      NREAD
      Number of Observations Read
      428.0
    
    
      1
      NUSED
      Number of Observations Used
      428.0
    
  



§ ClassInfo


Class Level Information
  
    
      
      Class
      Levels
      Values
    
  
  
    
      0
      Type
      6.0
      Hybrid SUV Sedan Sports Truck Wagon
    
    
      1
      Origin
      3.0
      Asia Europe USA
    
  



§ Dimensions


Dimensions
  
    
      
      RowId
      Description
      Value
    
  
  
    
      0
      NEFFECTS
      Number of Effects
      6
    
    
      1
      NPARMS
      Number of Parameters
      13
    
  



§ ANOVA


Analysis of Variance
  
    
      
      RowId
      Source
      DF
      SS
      MS
      FValue
      ProbF
    
  
  
    
      0
      MODEL
      Model
      10.0
      9.788809e+10
      9.788809e+09
      64.441212
      2.863778e-78
    
    
      1
      ERROR
      Error
      417.0
      6.334352e+10
      1.519029e+08
      NaN
      NaN
    
    
      2
      TOTAL
      Corrected Total
      427.0
      1.612316e+11
      NaN
      NaN
      NaN
    
  



§ FitStatistics


Fit Statistics
  
    
      
      RowId
      Description
      Value
    
  
  
    
      0
      RMSE
      Root MSE
      1.232489e+04
    
    
      1
      RSQUARE
      R-Square
      6.071272e-01
    
    
      2
      ADJRSQ
      Adj R-Sq
      5.977057e-01
    
    
      3
      AIC
      AIC
      8.503842e+03
    
    
      4
      AICC
      AICC
      8.504594e+03
    
    
      5
      SBC
      SBC
      8.118493e+03
    
    
      6
      TRAIN_ASE
      ASE
      1.479989e+08
    
  



§ ParameterEstimates


Parameter Estimates
  
    
      
      Effect
      Type
      Origin
      Parameter
      DF
      Estimate
      StdErr
      tValue
      Probt
    
  
  
    
      0
      Intercept
      
      
      Intercept
      1
      -24437.969421
      16457.304807
      -1.484931
      1.383175e-01
    
    
      1
      Type
      Hybrid
      
      Type Hybrid
      1
      32712.213902
      10759.213364
      3.040391
      2.511673e-03
    
    
      2
      Type
      SUV
      
      Type SUV
      1
      -3143.063706
      3130.508120
      -1.004011
      3.159557e-01
    
    
      3
      Type
      Sedan
      
      Type Sedan
      1
      2292.189910
      2406.138849
      0.952642
      3.413231e-01
    
    
      4
      Type
      Sports
      
      Type Sports
      1
      24150.756192
      3062.016347
      7.887207
      2.737502e-14
    
    
      5
      Type
      Truck
      
      Type Truck
      1
      -11340.453498
      3609.223220
      -3.142076
      1.797217e-03
    
    
      6
      Type
      Wagon
      
      Type Wagon
      0
      0.000000
      NaN
      NaN
      NaN
    
    
      7
      Origin
      
      Asia
      Origin Asia
      1
      1903.248827
      1503.771197
      1.265651
      2.063449e-01
    
    
      8
      Origin
      
      Europe
      Origin Europe
      1
      17995.412825
      1787.406597
      10.067890
      1.734237e-21
    
    
      9
      Origin
      
      USA
      Origin USA
      0
      0.000000
      NaN
      NaN
      NaN
    
    
      10
      MPG_City
      
      
      MPG_City
      1
      -742.336207
      265.209506
      -2.799056
      5.363126e-03
    
    
      11
      Length
      
      
      Length
      1
      175.595793
      82.233208
      2.135339
      3.331544e-02
    
    
      12
      Weight
      
      
      Weight
      1
      8.436445
      1.921768
      4.389939
      1.438777e-05
    
  



§ Timing


Task Timing
  
    
      
      RowId
      Task
      Time
      RelTime
    
  
  
    
      0
      SETUP
      Setup and Parsing
      0.021932
      0.459167
    
    
      1
      LEVELIZATION
      Levelization
      0.003773
      0.078990
    
    
      2
      INITIALIZATION
      Model Initialization
      0.002705
      0.056633
    
    
      3
      SSCP
      SSCP Computation
      0.013514
      0.282928
    
    
      4
      FITTING
      Model Fitting
      0.004904
      0.102670
    
    
      5
      CLEANUP
      Cleanup
      0.000867
      0.018149
    
    
      6
      TOTAL
      Total
      0.047765
      1.000000
    
  




elapsed 0.0493s · user 0.006s · sys 0.009s · mem 0.794MB

You can get the results in a ODS-like style too.



In [105]:

    
from swat.render import render_html
render_html(output1)









    





    



Model Information
RowId
Description
Value


DATA Data Source CARS
RESPONSEVAR Response Variable MSRP




Number of Observations
RowId
Description
Value


NREAD Number of Observations Read 428
NUSED Number of Observations Used 428




Class Level Information
Class
Levels
Values


Type 6 Hybrid SUV Sedan Sports Truck Wagon
Origin 3 Asia Europe USA




Dimensions
RowId
Description
Value


NEFFECTS Number of Effects 6
NPARMS Number of Parameters 13




Analysis of Variance
RowId
Source
DF
Sum of Squares
Mean Square
F Value
Pr > F


MODEL Model 10 97888093989 9788809398.9 64.441212227 2.863778E-78
ERROR Error 417 63343524714 151902936.96 nan nan
TOTAL Corrected Total 427 161231618703 nan nan nan




Fit Statistics
RowId
Description
Value


RMSE Root MSE 12324.890951
RSQUARE R-Square 0.6071271552
ADJRSQ Adj R-Sq 0.597705744
AIC AIC 8503.8421606
AICC AICC 8504.5939678
SBC SBC 8118.4925157
TRAIN_ASE ASE 147998889.52




Parameter Estimates
Effect
Type
Origin
Parameter
DF
Estimate
StdErr
tValue
Probt


Intercept Intercept 1 -24437.96942 16457.304807 -1.484931446 0.1383175144
Type Hybrid Type Hybrid 1 32712.213902 10759.213364 3.0403908535 0.0025116732
Type SUV Type SUV 1 -3143.063706 3130.5081203 -1.004010718 0.3159556852
Type Sedan Type Sedan 1 2292.1899097 2406.1388494 0.9526424089 0.3413231382
Type Sports Type Sports 1 24150.756192 3062.0163467 7.8872068134 2.737502E-14
Type Truck Type Truck 1 -11340.4535 3609.2232203 -3.142075955 0.001797217
Type Wagon Type Wagon 0 0 nan nan nan
Origin Asia Origin Asia 1 1903.248827 1503.771197 1.2656505397 0.2063449053
Origin Europe Origin Europe 1 17995.412825 1787.4065967 10.067889902 1.734237E-21
Origin USA Origin USA 0 0 nan nan nan
MPG_City MPG_City 1 -742.3362069 265.20950581 -2.799055805 0.0053631256
Length Length 1 175.59579301 82.233207647 2.1353392143 0.0333154353
Weight Weight 1 8.4364452275 1.9217683405 4.3899387088 0.0000143878




Task Timing
RowId
Task
Time
Relative Time


SETUP Setup and Parsing 0.0219321251 0.4591671201
LEVELIZATION Levelization 0.003772974 0.0789903215
INITIALIZATION Model Initialization 0.0027050972 0.05663344
SSCP SSCP Computation 0.0135140419 0.2829276084
FITTING Model Fitting 0.0049040318 0.1026699477
CLEANUP Cleanup 0.00086689 0.0181490559
TOTAL Total 0.0477650166 1

Next let us build a decision tree model to predict MSRP.



In [106]:

    
output1 = tbl.dtreetrain(
    target = 'MSRP',
    nominals = ['Type', 'Origin'],
    inputs = ['Type', 'Origin', 'MPG_City', 'Length', 'Weight'],
    casout = conn.CASTable("treeModel"),
    varImp = True
)



In [107]:

    
output1









    Out[107]:




§ ModelInfo


Decision Tree for CARS
  
    
      
      Descr
      Value
    
  
  
    
      0
      Number of Tree Nodes
      4.700000e+01
    
    
      1
      Max Number of Branches
      2.000000e+00
    
    
      2
      Number of Levels
      6.000000e+00
    
    
      3
      Number of Leaves
      2.400000e+01
    
    
      4
      Number of Bins
      2.000000e+01
    
    
      5
      Minimum Size of Leaves
      5.000000e+00
    
    
      6
      Maximum Size of Leaves
      1.030000e+02
    
    
      7
      Number of Variables
      5.000000e+00
    
    
      8
      Alpha for Cost-Complexity Pruning
      0.000000e+00
    
    
      9
      Number of Observations Used
      4.280000e+02
    
    
      10
      Maximum STD of Leaves
      4.922019e+04
    
    
      11
      Minimum STD of Leaves
      9.862942e+02
    
    
      12
      Mean Squared Error
      9.431782e+07
    
  



§ DTreeVarImpInfo


Decision Tree for CARS
  
    
      
      Variable
      Importance
      Std
      Count
    
  
  
    
      0
      MPG_City
      4.796464e+10
      8.995686e+09
      6.0
    
    
      1
      Origin
      4.186990e+10
      0.000000e+00
      1.0
    
    
      2
      Type
      1.592548e+10
      2.559214e+09
      8.0
    
    
      3
      Length
      1.181178e+10
      2.312614e+09
      4.0
    
    
      4
      Weight
      3.291795e+09
      7.979902e+08
      4.0
    
  



§ OutputCasTables



  
    
      
      casLib
      Name
      Rows
      Columns
      casTable
    
  
  
    
      0
      CASUSER(viyauser)
      treeModel
      47
      31
      CASTable('treeModel', caslib='CASUSER(viyauser)')
    
  




elapsed 0.0218s · user 0.005s · sys 0.007s · mem 2.19MB

The treeModel1 CASTable stores the decision tree model. Each row contains the information for an individual split in the decision tree.



In [108]:

    
conn.CASTable("treeModel1").fetch()









    



ERROR: The file or path 'treeModel1' is not available in the file system.
ERROR: Table 'treeModel1' could not be loaded.
ERROR: Failure opening table 'treeModel1': A table could not be loaded.
ERROR: The action stopped due to errors.






    Out[108]:





elapsed 6e-05s · user 0.486s · sys 0.276s

Last but not least, let us visualize the importance of the predictors using Cufflinks.



In [109]:

    
output1['DTreeVarImpInfo'].sort_values(by='Importance').iplot(kind='barh',x='Variable',y='Importance')

	Column	ID	Type	RawLength	FormattedLength
0	Make	1	varchar	13	13
1	Model	2	varchar	39	39
2	Type	3	varchar	6	6
3	Origin	4	varchar	6	6
4	DriveTrain	5	varchar	5	5
5	MSRP	6	double	8	12
6	Invoice	7	double	8	12
7	EngineSize	8	double	8	12
8	Cylinders	9	double	8	12
9	Horsepower	10	double	8	12
10	MPG_City	11	double	8	12
11	MPG_Highway	12	double	8	12
12	Weight	13	double	8	12
13	Wheelbase	14	double	8	12
14	Length	15	double	8	12

	Make	Model	Type	Origin	DriveTrain	MSRP	Invoice	EngineSize	Cylinders	Horsepower	MPG_City	MPG_Highway	Weight	Wheelbase	Length
0	Acura	MDX	SUV	Asia	All	36945.0	33337.0	3.5	6.0	265.0	17.0	23.0	4451.0	106.0	189.0
1	Acura	RSX Type S 2dr	Sedan	Asia	Front	23820.0	21761.0	2.0	4.0	200.0	24.0	31.0	2778.0	101.0	172.0
2	Acura	TSX 4dr	Sedan	Asia	Front	26990.0	24647.0	2.4	4.0	200.0	22.0	29.0	3230.0	105.0	183.0
3	Acura	TL 4dr	Sedan	Asia	Front	33195.0	30299.0	3.2	6.0	270.0	20.0	28.0	3575.0	108.0	186.0
4	Acura	3.5 RL 4dr	Sedan	Asia	Front	43755.0	39014.0	3.5	6.0	225.0	18.0	24.0	3880.0	115.0	197.0

	Name	Type	Description	Path	Subdirs	Active	Personal
0	CASUSER(viyauser)	PATH	Personal File System Caslib	/home/viyauser/	1.0	1.0	1.0
1	Formats	PATH	CAS Formats caslib	/opt/sas/viya/config/data/cas/default/formats/	1.0	0.0	0.0
2	Public	PATH	CAS Public caslib	/opt/sas/viya/config/data/cas/default/public/	1.0	0.0	0.0

	Permission	Owner	Group	Name	Size	Encryption	Time
0	drwxr-xr-x	viyauser	sas	Output	4096		17Jun2016:17:17:27
1	drwxr-xr-x	viyauser	sas	Desktop	4096		30Aug2016:08:28:53
2	drwxrwxr-x	viyauser	sas	Examples	4096		08Jun2016:12:10:03
3	drwxr-xr-x	viyauser	sas	sasuser.viya	4096		14Jul2016:15:14:07
4	-rwxr-xr-x	viyauser	sas	othermycars.sashdat	76848	NONE	31Aug2016:15:10:49
5	drwxr-xr-x	viyauser	sas	Data	4096		01Sep2016:19:34:34
6	drwxr-xr-x	viyauser	sas	Examples_Python	4096		09Sep2016:13:45:01

	Make	Model	MSRP	Cylinders	Horsepower
206	Kia	Rio 4dr manual	10280.0	4.0	104.0
168	Hyundai	Accent 2dr hatch	10539.0	4.0	103.0
382	Toyota	Echo 2dr manual	10760.0	4.0	108.0
345	Saturn	Ion1 4dr	10995.0	4.0	140.0
207	Kia	Rio 4dr auto	11155.0	4.0	104.0

	Make	Model	MSRP	Cylinders	Horsepower
0	Porsche	911 GT2 2dr	192465.0	6.0	477.0
1	Mercedes-Benz	CL600 2dr	128420.0	12.0	493.0
2	Mercedes-Benz	SL600 convertible 2dr	126670.0	12.0	493.0
3	Mercedes-Benz	SL55 AMG 2dr	121770.0	8.0	493.0
4	Mercedes-Benz	CL500 2dr	94820.0	8.0	302.0

	MSRP	Invoice	EngineSize	Cylinders	Horsepower	MPG_City	MPG_Highway	Weight	Wheelbase	Length
count	428.000000	428.000000	428.000000	426.000000	428.000000	428.000000	428.000000	428.000000	428.000000	428.000000
mean	32774.855140	30014.700935	3.196729	5.807512	215.885514	20.060748	26.843458	3577.953271	108.154206	186.362150
std	19431.716674	17642.117750	1.108595	1.558443	71.836032	5.238218	5.741201	758.983215	8.311813	14.357991
min	10280.000000	9875.000000	1.300000	3.000000	73.000000	10.000000	12.000000	1850.000000	89.000000	143.000000
25%	20334.250000	18866.000000	2.375000	NaN	165.000000	17.000000	24.000000	3104.000000	103.000000	178.000000
50%	27635.000000	25294.500000	3.000000	NaN	210.000000	19.000000	26.000000	3474.500000	107.000000	187.000000
75%	39205.000000	35710.250000	3.900000	NaN	255.000000	21.250000	29.000000	3977.750000	112.000000	194.000000
max	192465.000000	173560.000000	8.300000	12.000000	500.000000	60.000000	66.000000	7190.000000	144.000000	238.000000

	MSRP	Invoice	EngineSize	Cylinders	Horsepower	MPG_City	MPG_Highway	Weight	Wheelbase	Length
count	4.280000e+02	4.280000e+02	4.280000e+02	4.260000e+02	4.280000e+02	4.280000e+02	4.280000e+02	4.280000e+02	4.280000e+02	4.280000e+02
unique	4.100000e+02	4.250000e+02	4.300000e+01	7.000000e+00	1.100000e+02	2.800000e+01	3.300000e+01	3.480000e+02	4.000000e+01	6.700000e+01
mean	3.277486e+04	3.001470e+04	3.196729e+00	5.807512e+00	2.158855e+02	2.006075e+01	2.684346e+01	3.577953e+03	1.081542e+02	1.863621e+02
std	1.943172e+04	1.764212e+04	1.108595e+00	1.558443e+00	7.183603e+01	5.238218e+00	5.741201e+00	7.589832e+02	8.311813e+00	1.435799e+01
min	1.028000e+04	9.875000e+03	1.300000e+00	3.000000e+00	7.300000e+01	1.000000e+01	1.200000e+01	1.850000e+03	8.900000e+01	1.430000e+02
25%	2.032950e+04	1.885100e+04	2.350000e+00	4.000000e+00	1.650000e+02	1.700000e+01	2.400000e+01	3.103000e+03	1.030000e+02	1.780000e+02
50%	2.763500e+04	2.529450e+04	3.000000e+00	6.000000e+00	2.100000e+02	1.900000e+01	2.600000e+01	3.474500e+03	1.070000e+02	1.870000e+02
75%	3.921500e+04	3.573250e+04	3.900000e+00	6.000000e+00	2.550000e+02	2.150000e+01	2.900000e+01	3.978500e+03	1.120000e+02	1.940000e+02
max	1.924650e+05	1.735600e+05	8.300000e+00	1.200000e+01	5.000000e+02	6.000000e+01	6.600000e+01	7.190000e+03	1.440000e+02	2.380000e+02
nmiss	0.000000e+00	0.000000e+00	0.000000e+00	2.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00
sum	1.402764e+07	1.284629e+07	1.368200e+03	2.474000e+03	9.239900e+04	8.586000e+03	1.148900e+04	1.531364e+06	4.629000e+04	7.976300e+04
stderr	9.392675e+02	8.527639e+02	5.358595e-02	7.550679e-02	3.472326e+00	2.531988e-01	2.775114e-01	3.668684e+01	4.017666e-01	6.940197e-01
var	3.775916e+08	3.112443e+08	1.228982e+00	2.428743e+00	5.160415e+03	2.743892e+01	3.296139e+01	5.760555e+05	6.908624e+01	2.061519e+02
uss	6.209854e+11	5.184789e+11	4.898540e+03	1.540000e+04	2.215110e+07	1.839580e+05	3.224790e+05	5.725125e+09	5.035958e+06	1.495283e+07
cv	5.928849e+01	5.877826e+01	3.467903e+01	2.683495e+01	3.327506e+01	2.611178e+01	2.138771e+01	2.121278e+01	7.685150e+00	7.704349e+00
tvalue	3.489406e+01	3.519696e+01	5.965611e+01	7.691377e+01	6.217318e+01	7.922923e+01	9.672920e+01	9.752689e+01	2.691966e+02	2.685257e+02
probt	4.160412e-127	2.684398e-128	3.133745e-209	1.515569e-251	4.185344e-216	1.866284e-257	1.665621e-292	5.812547e-294	0.000000e+00	0.000000e+00

	Name	Rows	Columns	Encoding	CreateTimeFormatted	ModTimeFormatted	JavaCharSet	CreateTime	ModTime	SourceName	SourceCaslib	Creator
0	_T_8B910465_7F1F62C95F88	428	15	utf-8	09Sep2016:13:45:32	09Sep2016:13:45:32	UTF8	1.789048e+09	1.789048e+09	_T_8B910465_7F1F62C95F88		viyauser
1	CARS	428	15	utf-8	09Sep2016:13:45:32	09Sep2016:13:45:32	UTF8	1.789048e+09	1.789048e+09			viyauser
2	DATA.CARS	428	15	utf-8	09Sep2016:13:45:34	09Sep2016:13:45:34	UTF8	1.789048e+09	1.789048e+09	Data/cars.csv	CASUSER(viyauser)	viyauser

	Column	NDistinct	NMiss
0	Make	38.0	0.0
1	Model	425.0	0.0
2	Type	6.0	0.0
3	Origin	3.0	0.0
4	DriveTrain	3.0	0.0
5	MSRP	410.0	0.0
6	Invoice	425.0	0.0
7	EngineSize	43.0	0.0
8	Cylinders	8.0	2.0
9	Horsepower	110.0	0.0
10	MPG_City	28.0	0.0
11	MPG_Highway	33.0	0.0
12	Weight	348.0	0.0
13	Wheelbase	40.0	0.0
14	Length	67.0	0.0

	Cylinders	Col1	Col2	Col3
0	3	1.0	0.0	0.0
1	4	74.0	25.0	37.0
2	5	0.0	7.0	0.0
3	6	69.0	54.0	67.0
4	8	12.0	34.0	41.0
5	10	0.0	0.0	2.0
6	12	0.0	3.0	0.0

	MSRP	Col1	Col2	Col3
0	(-12500, 12500]	11.0	0.0	2.0
1	( 12500, 37500]	134.0	47.0	117.0
2	( 37500, 62500]	10.0	50.0	26.0
3	( 62500, 87500]	2.0	20.0	2.0
4	( 87500, 112500]	1.0	2.0	0.0
5	(112500, 137500]	0.0	3.0	0.0
6	(187500, 212500]	0.0	1.0	0.0

	Statistic	DF	Value	Prob
0	Chi-Square	12.0	126.453870	3.172351e-21
1	Likelihood Ratio Chi-Square	12.0	127.634505	1.840164e-21

	Variable	MSRP	Invoice	EngineSize	Cylinders	Horsepower	MPG_City	MPG_Highway	Weight	Wheelbase	Length
0	MSRP	1.000000	0.999132	0.571753	0.649742	0.826945	-0.475020	-0.439622	0.448426	0.152000	0.172037
1	Invoice	0.999132	1.000000	0.564498	0.645226	0.823746	-0.470442	-0.434585	0.442332	0.148328	0.166586
2	EngineSize	0.571753	0.564498	1.000000	0.908002	0.787435	-0.709471	-0.717302	0.807867	0.636517	0.637448
3	Cylinders	0.649742	0.645226	0.908002	1.000000	0.810341	-0.684402	-0.676100	0.742209	0.546730	0.547783
4	Horsepower	0.826945	0.823746	0.787435	0.810341	1.000000	-0.676699	-0.647195	0.630796	0.387398	0.381554
5	MPG_City	-0.475020	-0.470442	-0.709471	-0.684402	-0.676699	1.000000	0.941021	-0.737966	-0.507284	-0.501526
6	MPG_Highway	-0.439622	-0.434585	-0.717302	-0.676100	-0.647195	0.941021	1.000000	-0.790989	-0.524661	-0.466092
7	Weight	0.448426	0.442332	0.807867	0.742209	0.630796	-0.737966	-0.790989	1.000000	0.760703	0.690021
8	Wheelbase	0.152000	0.148328	0.636517	0.546730	0.387398	-0.507284	-0.524661	0.760703	1.000000	0.889195
9	Length	0.172037	0.166586	0.637448	0.547783	0.381554	-0.501526	-0.466092	0.690021	0.889195	1.000000

	Variable	BinSetId	BinId	BinLowerBnd	BinUpperBnd	BinWidth	NInBin	Mean	Std	Min	Max	MidPoint	Percent
0	MPG_City	1	1	7.5	12.500000	5.000000	6	11.333333	1.032796	10.0	12.0	10.0	1.401869
1	MPG_City	1	2	12.5	17.500000	5.000000	114	15.666667	1.348002	13.0	17.0	15.0	26.635514
2	MPG_City	1	3	17.5	22.500000	5.000000	219	19.538813	1.314039	18.0	22.0	20.0	51.168224
3	MPG_City	1	4	22.5	27.500000	5.000000	61	24.803279	1.107846	23.0	27.0	25.0	14.252336
4	MPG_City	1	5	27.5	32.500000	5.000000	20	29.900000	1.713722	28.0	32.0	30.0	4.672897
5	MPG_City	1	6	32.5	37.500000	5.000000	4	34.750000	1.258306	33.0	36.0	35.0	0.934579
6	MPG_City	1	7	37.5	42.500000	5.000000	1	38.000000	NaN	38.0	38.0	40.0	0.233645
7	MPG_City	1	8	42.5	47.500000	5.000000	1	46.000000	NaN	46.0	46.0	45.0	0.233645
8	MPG_City	1	9	47.5	52.500000	5.000000	0	NaN	NaN	NaN	NaN	50.0	0.000000
9	MPG_City	1	10	52.5	57.500000	5.000000	0	NaN	NaN	NaN	NaN	55.0	0.000000
10	MPG_City	1	11	57.5	62.500001	5.000001	2	59.500000	0.707107	59.0	60.0	60.0	0.467290

	Column	Min	Max	N	NMiss	Mean	Sum	Std	StdErr	Var	USS	CSS	CV	TValue	ProbT
Origin
Europe	MSRP	16999.0	192465.0	123.0	0.0	48349.796748	5947025.0	25318.600464	2282.901323	6.410315e+08	3.657433e+11	7.820585e+10	52.365474	21.179101	1.104072e-42
Europe	MPG_City	12.0	38.0	123.0	0.0	18.731707	2304.0	3.289509	0.296605	1.082087e+01	4.447800e+04	1.320146e+03	17.561183	63.153699	4.846098e-95
Europe	MPG_Highway	14.0	46.0	123.0	0.0	26.008130	3199.0	4.167588	0.375779	1.736879e+01	8.531900e+04	2.118992e+03	16.024172	69.211292	9.190472e-100

	Make	Model	MSRP	MPG_Highway
0	Acura	NSX coupe 2dr manual S	89765.0	24.0
1	Audi	RS 6 4dr	84600.0	22.0
2	Jaguar	XKR coupe 2dr	81995.0	23.0
3	Jaguar	XKR convertible 2dr	86995.0	23.0
4	Mercedes-Benz	CL500 2dr	94820.0	24.0
5	Mercedes-Benz	S500 4dr	86970.0	24.0
6	Mercedes-Benz	SL500 convertible 2dr	90520.0	23.0
7	Mercedes-Benz	SL55 AMG 2dr	121770.0	21.0
8	Porsche	911 Carrera 4S coupe 2dr (convert)	84165.0	24.0
9	Porsche	911 GT2 2dr	192465.0	24.0

	Origin	MEAN	STD
0	Asia	28.265823	6.770503
1	Europe	26.008130	4.167588
2	USA	26.013605	5.396582

	Origin	Type	AVERAGE_ORIGIN	AVERAGE_TYPE
0	Asia	Hybrid	28.265823	56.000000
1	Asia	Truck	28.265823	22.000000
2	Europe	Sports	26.008130	25.130435
3	USA	Sports	26.013605	24.222222
4	Asia	SUV	28.265823	21.680000
5	Asia	Wagon	28.265823	28.181818
6	Europe	Wagon	26.008130	26.583333
7	USA	Truck	26.013605	20.500000
8	Asia	Sedan	28.265823	29.968085
9	Europe	SUV	26.008130	18.700000
10	USA	SUV	26.013605	20.040000
11	USA	Wagon	26.013605	29.714286
12	Asia	Sports	28.265823	26.647059
13	Europe	Sedan	26.008130	27.115385
14	USA	Sedan	26.013605	28.544444

	Column	Min	Max	N	NMiss	Mean	Sum	Std	StdErr	Var	USS	CSS	CV	TValue	ProbT
0	sqrt_MSRP	101.390335	438.708331	428.0	0.0	175.121390	74951.954867	45.959648	2.221543	2112.289203	14027638.00	901947.489887	26.244451	78.828705	1.415239e-256
1	MPG_avg	11.000000	63.000000	428.0	0.0	23.452103	10037.500000	5.408331	0.261422	29.250042	247890.25	12489.768107	23.061177	89.709911	3.448353e-279