Introducing CivisML 2.0

Note: We are continually releasing changes to CivisML, and this notebook is useful for any versions 2.0.0 and above.

Data scientists are on the front lines of their organization’s most important customer growth and engagement questions, and they need to guide action as quickly as possible by getting models into production. CivisML is a machine learning service that makes it possible for data scientists to massively increase the speed with which they can get great models into production. And because it’s built on open-source packages, CivisML remains transparent and data scientists remain in control.

In this notebook, we’ll go over the new features introduced in CivisML 2.0. For a walkthrough of CivisML’s fundamentals, check out this introduction to the mechanics of CivisML: https://github.com/civisanalytics/civis-python/blob/master/examples/CivisML_parallel_training.ipynb

CivisML 2.0 is full of new features to make modeling faster, more accurate, and more portable. This notebook will cover the following topics:

CivisML overview
Parallel training and validation
Use of the new ETL transformer, DataFrameETL, for easy, customizable ETL
Stacked models: combine models to get one bigger, better model
Model portability: get trained models out of CivisML
Multilayer perceptron models: neural networks built in to CivisML
Hyperband: a smarter alternative to grid search

CivisML can be used to build models that answer all kinds of business questions, such as what movie to recommend to a customer, or which customers are most likely to upgrade their accounts. For the sake of example, this notebook uses a publicly available dataset on US colleges, and focuses on predicting the type of college (public non-profit, private non-profit, or private for-profit).



In [1]:

    
# first, let's import the packages we need
import requests
from io import StringIO
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import model_selection

# import the Civis Python API client
import civis
# ModelPipeline is the class used to build CivisML models
from civis.ml import ModelPipeline



In [2]:

    
# Suppress warnings for demo purposes. This is not recommended as a general practice.
import warnings
warnings.filterwarnings('ignore')

Downloading data

Before we build any models, we need a dataset to play with. We're going to use the most recent College Scorecard data from the Department of Education.

This dataset is collected to study the performance of US higher education institutions. You can learn more about it in this technical paper, and you can find details on the dataset features in this data dictionary.



In [3]:

    
# Downloading data; this may take a minute
# Two kind of nulls
df = pd.read_csv("https://ed-public-download.app.cloud.gov/downloads/Most-Recent-Cohorts-All-Data-Elements.csv", sep=",", na_values=['NULL', 'PrivacySuppressed'], low_memory=False)



In [4]:

    
# How many rows and columns?
df.shape









    Out[4]:





(7593, 1805)



In [5]:

    
# What are some of the column names?
df.columns









    Out[5]:





Index(['UNITID', 'OPEID', 'OPEID6', 'INSTNM', 'CITY', 'STABBR', 'ZIP',
       'ACCREDAGENCY', 'INSTURL', 'NPCURL',
       ...
       'OMENRYP8_FTNFT', 'OMENRAP8_FTNFT', 'OMENRUP8_FTNFT', 'OMACHT6_PTNFT',
       'OMAWDP6_PTNFT', 'OMACHT8_PTNFT', 'OMAWDP8_PTNFT', 'OMENRYP8_PTNFT',
       'OMENRAP8_PTNFT', 'OMENRUP8_PTNFT'],
      dtype='object', length=1805)

Data Munging

Before running CivisML, we need to do some basic data munging, such as removing missing data from the dependent variable, and splitting the data into training and test sets.

Throughout this notebook, we'll be trying to predict whether a college is public (labelled as 1), private non-profit (2), or private for-profit (3). The column name for this dependent variable is "CONTROL".



In [6]:

    
# Make sure to remove any rows with nulls in the dependent variable
df = df[np.isfinite(df['CONTROL'])]



In [7]:

    
# split into training and test sets
train_data, test_data = model_selection.train_test_split(df, test_size=0.2)



In [8]:

    
# print a few sample columns
train_data.head()









    Out[8]:







  
    
      
      UNITID
      OPEID
      OPEID6
      INSTNM
      CITY
      STABBR
      ZIP
      ACCREDAGENCY
      INSTURL
      NPCURL
      ...
      OMENRYP8_FTNFT
      OMENRAP8_FTNFT
      OMENRUP8_FTNFT
      OMACHT6_PTNFT
      OMAWDP6_PTNFT
      OMACHT8_PTNFT
      OMAWDP8_PTNFT
      OMENRYP8_PTNFT
      OMENRAP8_PTNFT
      OMENRUP8_PTNFT
    
  
  
    
      1575
      164599
      2337400
      23374
      Bancroft School of Massage Therapy
      Worcester
      MA
      01604
      Accrediting Commission of Career Schools and C...
      https://www.bancroftsmt.com
      www.bancroftsmt.com/NetPriceCalculator/npcalc.htm
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      672
      131803
      145900
      1459
      Strayer University-District of Columbia
      Washington
      DC
      20005
      Middle States Commission on Higher Education
      www.strayer.edu/district-columbia/washington
      https://strayer.aidcalc.com/netprice.htm
      ...
      0.0000
      0.1667
      0.2778
      199.0
      0.2513
      199.0
      0.2915
      0.0302
      0.2915
      0.3869
    
    
      7388
      21130702
      323901
      3239
      Bucks County Community College-Lower Bucks Campus
      Bristol
      PA
      190070277
      Middle States Commission on Higher Education
      www.bucks.edu
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      6926
      483902
      4225200
      42252
      Yechanlaz Instituto Vocacional
      Miami
      FL
      33144-4817
      NaN
      www.yechanlaz-instituto.com
      www.yechanlaz-instituto.com
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3571
      224110
      355800
      3558
      North Central Texas College
      Gainesville
      TX
      76240-4699
      Southern Association of Colleges and Schools C...
      www.nctc.edu
      www.collegeforalltexans.com/apps/CollegeMoney/
      ...
      0.0125
      0.5625
      0.2750
      487.0
      0.1109
      487.0
      0.1150
      0.0041
      0.5236
      0.3573
    
  

5 rows × 1805 columns

Some of these columns are duplicates, or contain information we don't want to use in our model (like college names and URLs). CivisML can take a list of columns to exclude and do this part of the data munging for us, so let's make that list here.



In [8]:

    
to_exclude = ['ADM_RATE_ALL', 'OPEID', 'OPEID6', 'ZIP', 'INSTNM', 
              'INSTURL', 'NPCURL', 'ACCREDAGENCY', 'T4APPROVALDATE', 
              'STABBR', 'ALIAS', 'REPAY_DT_MDN', 'SEPAR_DT_MDN']

Basic CivisML Usage

When building a supervised model, there are a few basic things you'll probably want to do:

Transform the data into a modelling-friendly format
Train the model on some labelled data
Validate the model
Use the model to make predictions about unlabelled data

CivisML does all of this in three lines of code. Let's fit a basic sparse logistic model to see how.

The first thing we need to do is build a ModelPipeline object. This stores all of the basic configuration options for the model. We'll tell it things like the type of model, dependent variable, and columns we want to exclude. CivisML handles basic ETL for you, including categorical expansion of any string-type columns.



In [9]:

    
# Use a push-button workflow to fit a model with reasonable default parameters
sl_model = ModelPipeline(model='sparse_logistic',
                         model_name='Example sparse logistic',
                         primary_key='UNITID',
                         dependent_variable=['CONTROL'],
                         excluded_columns=to_exclude)

Next, we want to train and validate the model by calling .train on the ModelPipeline object. CivisML uses 4-fold cross-validation on the training set. You can train on local data or query data from Redshift. In this case, we have our data locally, so we just pass the data frame.



In [10]:

    
sl_train = sl_model.train(train_data)

This returns a ModelFuture object, which is non-blocking-- this means that you can keep doing things in your notebook while the model runs on Civis Platform in the background. If you want to make a blocking call (one that doesn't complete until your model is finished), you can use .result().



In [11]:

    
# non-blocking
sl_train









    Out[11]:





<ModelFuture at 0x7f293cf39eb8 state=running>



In [12]:

    
# blocking
sl_train.result()









    Out[12]:





{'container_id': 9137792,
 'error': None,
 'finished_at': '2018-01-17T21:25:08.000Z',
 'id': 69726571,
 'is_cancel_requested': False,
 'started_at': '2018-01-17T21:18:48.000Z',
 'state': 'succeeded'}

Parallel Model Tuning and Validation

We didn't actually specify the number of jobs in the .train() call above, but behind the scenes, the model was actually training in parallel! In CivisML 2.0, model tuning and validation will automatically be distributed across your computing cluster, without ever using more than 90% of the cluster resources. This means that you can build models faster and try more model configurations, leaving you more time to think critically about your data. If you decide you want more control over the resources you're using, you can set the n_jobs parameter to a specific number of jobs, and CivisML won't run more than that at once.

We can see how well the model did by looking at the validation metrics.



In [13]:

    
# loop through the metric names and print to screen
metrics = [print(key) for key in sl_train.metrics.keys()]









    



accuracy
confusion_matrix
p_correct
pop_incidence_true
pop_incidence_pred
roc_auc
log_loss
brier_score
deciles
roc_curve_by_class
calibration_curve_by_class
roc_auc_macroavg
score_histogram
training_histogram
oos_score_table



In [14]:

    
# ROC AUC for each of the three categories in our dependent variable
sl_train.metrics['roc_auc']









    Out[14]:





[0.9963479457451291, 0.9413246335261132, 0.9602249988203488]

Impressive!

This is the basic CivisML workflow: create the model, train, and make predictions. There are other configuration options for more complex use cases; for example, you can create a custom estimator, pass custom dependencies, manage the computing resources for larger models, and more. For more information, see the Machine Learning section of the Python API client docs.

Now that we can build a simple model, let's see what's new to CivisML 2.0!

Custom ETL

CivisML can do several data transformations to prepare your data for modeling. This makes data preprocessing easier, and makes it part of your model pipeline rather than an additional script you have to run. CivisML's built-in ETL includes:

Categorical expansion: expand a single column of strings or categories into separate binary variables.
Dropping columns: remove columns not needed in a model, such as an ID number.
Removing null columns: remove columns that contain no data.

With CivisML 2.0, you can now recreate and customize this ETL using DataFrameETL, our open source ETL transformer, available on GitHub.

By default, CivisML will use DataFrameETL to automatically detect non-numeric columns for categorical expansion. Our example college dataset has a lot of integer columns which are actually categorical, but we can make sure they're handled correctly by passing CivisML a custom ETL transformer.



In [15]:

    
# The ETL transformer used in CivisML can be found in the civismlext module
from civismlext.preprocessing import DataFrameETL

This creates a list of columns to categorically expand, identified using the data dictionary available here.



In [16]:

    
# column indices for columns to expand
to_expand = list(df.columns[:21]) + list(df.columns[23:36]) + list(df.columns[99:290]) + \
    list(df.columns[[1738, 1773, 1776]])



In [17]:

    
# create ETL estimator to pass to CivisML
etl = DataFrameETL(cols_to_drop=to_exclude, 
                   cols_to_expand=to_expand, # we made this column list during data munging
                   check_null_cols='warn')

Model Stacking

Now it's time to fit a model. Let's take a look at model stacking, which is new to CivisML 2.0.

Stacking lets you combine several algorithms into a single model which performs as well or better than the component algorithms. We use stacking at Civis to build more accurate models, which saves our data scientists time comparing algorithm performance. In CivisML, we have two stacking workflows: stacking_classifier (sparse logistic, GBT, and random forest, with a logistic regression model as a "meta-estimator" to combine predictions from the other models); and stacking_regressor (sparse linear, GBT, and random forest, with a non-negative linear regression as the meta-estimator). Use them the same way you use sparse_logistic or other pre-defined models. If you want to learn more about how stacking works under the hood, take a look at this talk by the person at Civis who wrote it!

Let's fit both a stacking classifier and some un-stacked models, so we can compare the performance.



In [19]:

    
workflows = ['stacking_classifier',
            'sparse_logistic',
            'random_forest_classifier',
            'gradient_boosting_classifier']
models = []
# create a model object for each of the four model types
for wf in workflows:
    model = ModelPipeline(model=wf,
                          model_name=wf + ' v2 example',
                          primary_key='UNITID',
                          dependent_variable=['CONTROL'],
                          etl=etl  # use the custom ETL we created
                          )
    models.append(model)



In [20]:

    
# iterate over the model objects and run a CivisML training job for each
trains = []
for model in models:
    train = model.train(train_data)
    trains.append(train)

Let's plot diagnostics for each of the models. In the Civis Platform, these plots will automatically be built and displayed in the "Models" tab. But for the sake of example, let's also explicitly plot ROC curves and AUCs in the notebook.

There are three classes (public, non-profit private, and for-profit private), so we'll have three curves per model. It looks like all of the models are doing well, with sparse logistic performing slightly worse than the other three.



In [21]:

    
%matplotlib inline
# Let's look at how the model performed during validation
def extract_roc(fut_job, model_name):
    '''Build a data frame of ROC curve data from the completed training job `fut_job`
    with model name `model_name`. Note that this function will only work for a classification
    model where the dependent variable has more than two classes.'''
    aucs = fut_job.metrics['roc_auc']
    roc_curve = fut_job.metrics['roc_curve_by_class']
    n_classes = len(roc_curve)
    fpr = []
    tpr = []
    class_num = []
    auc = []
    for i, curve in enumerate(roc_curve):
        fpr.extend(curve['fpr'])
        tpr.extend(curve['tpr'])
        class_num.extend([i] * len(curve['fpr']))
        auc.extend([aucs[i]] * len(curve['fpr']))
    model_vec = [model_name] * len(fpr)
    df = pd.DataFrame({
        'model': model_vec,
        'class': class_num,
        'fpr': fpr,
        'tpr': tpr,
        'auc': auc
    })
    return df

# extract ROC curve information for all of the trained models
workflows_abbrev = ['stacking', 'logistic', 'RF', 'GBT']
roc_dfs = [extract_roc(train, w) for train, w in zip(trains, workflows_abbrev)]
roc_df = pd.concat(roc_dfs)

# create faceted ROC curve plots. Each row of plots is a different model type, and each
# column of plots is a different class of the dependent variable.
g = sns.FacetGrid(roc_df, col="class",  row="model")
g = g.map(plt.plot, "fpr", "tpr", color='blue')

All of the models perform quite well, so it's difficult to compare based on the ROC curves. Let's plot the AUCs themselves.



In [22]:

    
# Plot AUCs for each model
%matplotlib inline
auc_df = roc_df[['model', 'class', 'auc']]
auc_df.drop_duplicates(inplace=True)
plt.show(sns.swarmplot(x=auc_df['model'], y=auc_df['auc']))

Here we can see that all models but sparse logistic perform quite well, but stacking appears to perform marginally better than the others. For more challenging modeling tasks, the difference between stacking and other models will often be more pronounced.

Now our models are trained, and we know that they all perform very well. Because the AUCs are all so high, we would expect the models to make similar predictions. Let's see if that's true.



In [23]:

    
# kick off a prediction job for each of the four models
preds = [model.predict(test_data) for model in models]



In [24]:

    
# This will run on Civis Platform cloud resources
[pred.result() for pred in preds]









    Out[24]:





[{'container_id': 9138218,
  'error': None,
  'finished_at': '2018-01-17T21:44:07.000Z',
  'id': 69728304,
  'is_cancel_requested': False,
  'started_at': '2018-01-17T21:43:26.000Z',
  'state': 'succeeded'},
 {'container_id': 9138220,
  'error': None,
  'finished_at': '2018-01-17T21:44:10.000Z',
  'id': 69728306,
  'is_cancel_requested': False,
  'started_at': '2018-01-17T21:43:32.000Z',
  'state': 'succeeded'},
 {'container_id': 9138222,
  'error': None,
  'finished_at': '2018-01-17T21:44:08.000Z',
  'id': 69728308,
  'is_cancel_requested': False,
  'started_at': '2018-01-17T21:43:36.000Z',
  'state': 'succeeded'},
 {'container_id': 9138229,
  'error': None,
  'finished_at': '2018-01-17T21:44:11.000Z',
  'id': 69728315,
  'is_cancel_requested': False,
  'started_at': '2018-01-17T21:43:41.000Z',
  'state': 'succeeded'}]



In [25]:

    
# print the top few rows for each of the models
pred_df = [pred.table.head() for pred in preds]
import pprint
pprint.pprint(pred_df)









    



[          control_1  control_2  control_3
UNITID                                   
217882     0.993129   0.006856   0.000015
195234     0.001592   0.990423   0.007985
446385     0.002784   0.245300   0.751916
13508115   0.003109   0.906107   0.090785
459499     0.005351   0.039922   0.954726,
              control_1  control_2  control_3
UNITID                                      
217882    9.954234e-01   0.000200   0.004377
195234    6.766601e-08   0.999615   0.000385
446385    4.571749e-03   0.056303   0.939125
13508115  1.768058e-02   0.699806   0.282514
459499    1.319468e-02   0.285295   0.701510,
           control_1  control_2  control_3
UNITID                                   
217882        0.960      0.034      0.006
195234        0.012      0.974      0.014
446385        0.020      0.508      0.472
13508115      0.006      0.914      0.080
459499        0.032      0.060      0.908,
           control_1  control_2  control_3
UNITID                                   
217882     0.993809   0.005610   0.000581
195234     0.004323   0.991094   0.004583
446385     0.001309   0.066452   0.932238
13508115   0.012525   0.809062   0.178413
459499     0.002034   0.061846   0.936120]

Looks like the probabilities here aren't exactly the same, but are directionally identical-- so, if you chose the class that had the highest probability for each row, you'd end up with the same predictions for all models. This makes sense, because all of the models performed well.

Model Portability

What if you want to score a model outside of Civis Platform? Maybe you want to deploy this model in an app for education policy makers. In CivisML 2.0, you can easily get the trained model pipeline out of the ModelFuture object.



In [26]:

    
train_stack = trains[0] # Get the ModelFuture for the stacking model
trained_model = train_stack.estimator

This Pipeline contains all of the steps CivisML used to train the model, from ETL to the model itself. We can print each step individually to get a better sense of what is going on.



In [27]:

    
# print each of the estimators in the pipeline, separated by newlines for readability
for step in train_stack.estimator.steps:
    print(step[1])
    print('\n')









    



DataFrameETL(check_null_cols='warn',
       cols_to_drop=['ADM_RATE_ALL', 'OPEID', 'OPEID6', 'ZIP', 'INSTNM', 'INSTURL', 'NPCURL', 'ACCREDAGENCY', 'T4APPROVALDATE', 'STABBR', 'ALIAS', 'REPAY_DT_MDN', 'SEPAR_DT_MDN'],
       cols_to_expand=['UNITID', 'OPEID', 'OPEID6', 'INSTNM', 'CITY', 'STABBR', 'ZIP', 'ACCREDAGENCY', 'INSTURL', 'NPCURL', 'SCH_DEG', 'HCM2', 'MAIN', 'NUMBRANCH', 'PREDDEG', 'HIGHDEG', 'CONTROL', 'ST_FIPS', 'REGION', 'LOCALE', 'LOCALE2', 'CCBASIC', 'CCUGPROF', 'CCSIZSET', 'HBCU', 'PBI', 'ANNHI', 'TRIBAL',...RT2', 'CIP54ASSOC', 'CIP54CERT4', 'CIP54BACHL', 'DISTANCEONLY', 'ICLEVEL', 'OPENADMP', 'ACCREDCODE'],
       dataframe_output=False, dummy_na=True, fill_value=0.0)


Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)


StackedClassifier(cv=StratifiedKFold(n_splits=4, random_state=42420, shuffle=True),
         estimator_list=[('sparse_logistic', Pipeline(memory=None,
     steps=[('selectfrommodel', SelectFromModel(estimator=LogitNet(alpha=1, cut_point=0.5, fit_intercept=True, lambda_path=None,
     max_iter=10000, min_lambda_ratio=0.0001, n_jobs=1, n_lambda=100,
     n_splits=4, random_state=42, scoring='...    random_state=42, refit=True, scoring=None, solver='lbfgs',
           tol=1e-08, verbose=0))]))],
         n_jobs=1, pre_dispatch='2*n_jobs', verbose=0)

Now we can see that there are three steps: the DataFrameETL object we passed in, a null imputation step, and the stacking estimator itself.

We can use this outside of CivisML simply by calling .predict on the estimator. This will make predictions using the model in the notebook without using CivisML.



In [28]:

    
# drop the dependent variable so we don't use it to predict itself!
predictions = trained_model.predict(test_data.drop(labels=['CONTROL'], axis=1))



In [29]:

    
# print out the class predictions. These will be integers representing the predicted
# class rather than probabilities.
predictions









    Out[29]:





array([1, 2, 3, ..., 3, 2, 2])

Hyperparameter optimization with Hyperband and Neural Networks

Multilayer Perceptrons (MLPs) are simple neural networks, which are now built in to CivisML. The MLP estimators in CivisML come from muffnn, another open source package written and maintained by Civis Analytics using tensorflow. Let's fit one using hyperband.

Tuning hyperparameters is a critical chore for getting an algorithm to perform at its best, but it can take a long time to run. Using CivisML 2.0, we can use hyperband as an alternative to conventional grid search for hyperparameter optimization-- it runs about twice as fast. While grid search runs every parameter combination for the full time, hyperband runs many combinations for a short time, then filters out the best, runs them for longer, filters again, and so on. This means that you can try more combinations in less time, so we recommend using it whenever possible. The hyperband estimator is open source and available on GitHub. You can learn about the details in the original paper, Li et al. (2016).

Right now, hyperband is implemented in CivisML named preset models for the following algorithms:

Multilayer Perceptrons (MLPs)
Stacking
Random forests
GBTs
ExtraTrees

Unlike grid search, you don't need to specify values to search over. If you pass cross_validation_parameters='hyperband' to ModelPipeline, hyperparameter combinations will be randomly drawn from preset distributions.



In [30]:

    
# build a model specifying the MLP model with hyperband
model_mlp = ModelPipeline(model='multilayer_perceptron_classifier',
                          model_name='MLP example',
                          primary_key='UNITID',
                          dependent_variable=['CONTROL'],
                          cross_validation_parameters='hyperband',
                          etl=etl
                          )
train_mlp = model_mlp.train(train_data, 
                            n_jobs=10) # parallel hyperparameter optimization and validation!
# block until the job finishes
train_mlp.result()









    Out[30]:





{'container_id': 9138258,
 'error': None,
 'finished_at': '2018-01-17T22:11:21.000Z',
 'id': 69728426,
 'is_cancel_requested': False,
 'started_at': '2018-01-17T21:44:33.000Z',
 'state': 'succeeded'}

Let's dig into the hyperband model a little bit. Like the stacking model, the model below starts with ETL and null imputation, but contains some additional steps: a step to scale the predictor variables (which improves neural network performance), and a hyperband searcher containing the MLP.



In [31]:

    
for step in train_mlp.estimator.steps:
    print(step[1])
    print('\n')









    



INFO:tensorflow:Restoring parameters from /tmp/tmpe49np0dv/saved_model
DataFrameETL(check_null_cols='warn',
       cols_to_drop=['ADM_RATE_ALL', 'OPEID', 'OPEID6', 'ZIP', 'INSTNM', 'INSTURL', 'NPCURL', 'ACCREDAGENCY', 'T4APPROVALDATE', 'STABBR', 'ALIAS', 'REPAY_DT_MDN', 'SEPAR_DT_MDN'],
       cols_to_expand=['UNITID', 'OPEID', 'OPEID6', 'INSTNM', 'CITY', 'STABBR', 'ZIP', 'ACCREDAGENCY', 'INSTURL', 'NPCURL', 'SCH_DEG', 'HCM2', 'MAIN', 'NUMBRANCH', 'PREDDEG', 'HIGHDEG', 'CONTROL', 'ST_FIPS', 'REGION', 'LOCALE', 'LOCALE2', 'CCBASIC', 'CCUGPROF', 'CCSIZSET', 'HBCU', 'PBI', 'ANNHI', 'TRIBAL',...RT2', 'CIP54ASSOC', 'CIP54CERT4', 'CIP54BACHL', 'DISTANCEONLY', 'ICLEVEL', 'OPENADMP', 'ACCREDCODE'],
       dataframe_output=False, dummy_na=True, fill_value=0.0)


Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)


MinMaxScaler(copy=False, feature_range=(0, 1))


HyperbandSearchCV(cost_parameter_max={'n_epochs': 50},
         cost_parameter_min={'n_epochs': 5}, cv=None, error_score='raise',
         estimator=MLPClassifier(activation=<function relu at 0x7f28a5746510>, batch_size=64,
       hidden_units=(256,), init_scale=0.1, keep_prob=1.0, n_epochs=5,
       random_state=None,
       solver=<class 'tensorflow.python.training.adam.AdamOptimizer'>,
       solver_kwargs=None),
         eta=3, iid=True, n_jobs=1,
         param_distributions={'keep_prob': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f28b44a9400>, 'hidden_units': [(), (16,), (32,), (64,), (64, 64), (64, 64, 64), (128,), (128, 128), (128, 128, 128), (256,), (256, 256), (256, 256, 256), (512, 256, 128, 64), (1024, 512, 256, 128)], 'solver_k...rning_rate': 0.002}, {'learning_rate': 0.005}, {'learning_rate': 0.008}, {'learning_rate': 0.0001}]},
         pre_dispatch='2*n_jobs', random_state=42, refit=True,
         return_train_score=True, scoring=None, verbose=0)

HyperbandSearchCV essentially works like GridSearchCV. If you want to get the best estimator without all of the extra CV information, you can access it using the best_estimator_ attribute.



In [32]:

    
train_mlp.estimator.steps[3][1].best_estimator_









    Out[32]:





MLPClassifier(activation=<function relu at 0x7f28a5746510>, batch_size=64,
       hidden_units=(128, 128), init_scale=0.1,
       keep_prob=0.83244264080042174, n_epochs=45, random_state=None,
       solver=<class 'tensorflow.python.training.adam.AdamOptimizer'>,
       solver_kwargs={'learning_rate': 0.002})

To see how well the best model performed, you can look at the best_score_.



In [33]:

    
train_mlp.estimator.steps[3][1].best_score_









    Out[33]:





0.94616397760948301

And to look at information about the different hyperparameter configurations that were tried, you can look at the cv_results_.



In [34]:

    
train_mlp.estimator.steps[3][1].cv_results_









    Out[34]:





{'mean_fit_time': array([  5.71521004,   9.87880683,   7.02491919,   2.49734783,
          2.04555511,   3.0459307 ,   1.41299955,   1.03468744,
          8.28476421,  13.8823324 ,  17.15766454,   5.7730906 ,
          6.91940331,   5.92865777,  55.33232911,  13.65520374,
         49.46581841,   7.73342903,  10.29447095,   2.70951978,
         17.35557111,  33.21902045]),
 'mean_score_time': array([ 0.12489303,  0.25389655,  0.11093688,  0.08840664,  0.0935638 ,
         0.11657325,  0.07519325,  0.06182806,  0.2851553 ,  0.15998785,
         0.2072041 ,  0.09375119,  0.11130897,  0.1001962 ,  0.22202452,
         0.09535344,  0.1758012 ,  0.1068356 ,  0.13764652,  0.05892269,
         0.10381524,  0.14680648]),
 'mean_test_score': array([ 0.89924267,  0.45702996,  0.45702996,  0.87471189,  0.45702996,
         0.91307211,  0.72275272,  0.88508396,  0.55021403,  0.45702996,
         0.88162661,  0.93233454,  0.54050049,  0.74349687,  0.91126111,
         0.92920645,  0.45702996,  0.92113928,  0.94023708,  0.90434639,
         0.94106026,  0.94616398]),
 'mean_train_score': array([ 0.9234465 ,  0.45702996,  0.45702996,  0.88417887,  0.45702996,
         0.93669558,  0.73147083,  0.89315006,  0.55242091,  0.45702996,
         0.8957862 ,  0.97448112,  0.54061033,  0.7705279 ,  0.92862888,
         0.96460401,  0.45702996,  0.95423238,  0.97612758,  0.92509068,
         0.98781759,  0.98798199]),
 'param_hidden_units': masked_array(data = [(128,) (512, 256, 128, 64) (128,) (64, 64) (32,) (128, 128) (16,) ()
  (512, 256, 128, 64) (256, 256) (256, 256, 256) (32,) (64, 64) (64,) (256,)
  (16,) (256, 256, 256) (128,) (128, 128) () (32,) (128, 128)],
              mask = [False False False False False False False False False False False False
  False False False False False False False False False False],
        fill_value = ?),
 'param_keep_prob': masked_array(data = [0.79654298686023284 0.59685015794648699 0.058083612168199461
  0.6011150117432088 0.020584494295802447 0.83244264080042174
  0.18182496720710062 0.30424224295953772 0.43194501864211576
  0.61185289472237947 0.51423443841361161 0.85994040673632055
  0.45049925196954299 0.94220175568485276 0.30461376917337069
  0.68423302651215689 0.49517691011127019 0.79654298686023284
  0.83244264080042174 0.30424224295953772 0.85994040673632055
  0.83244264080042174],
              mask = [False False False False False False False False False False False False
  False False False False False False False False False False],
        fill_value = ?),
 'param_n_epochs': masked_array(data = [5 5 5 5 5 5 5 5 5 16 16 16 16 16 50 50 50 15 15 15 48 45],
              mask = [False False False False False False False False False False False False
  False False False False False False False False False False],
        fill_value = ?),
 'param_solver_kwargs': masked_array(data = [{'learning_rate': 0.008} {'learning_rate': 0.05} {'learning_rate': 0.008}
  {'learning_rate': 0.008} {'learning_rate': 0.02} {'learning_rate': 0.002}
  {'learning_rate': 0.001} {'learning_rate': 0.002} {'learning_rate': 0.01}
  {'learning_rate': 0.05} {'learning_rate': 0.0001} {'learning_rate': 0.005}
  {'learning_rate': 0.02} {'learning_rate': 0.02} {'learning_rate': 0.001}
  {'learning_rate': 0.005} {'learning_rate': 0.05} {'learning_rate': 0.008}
  {'learning_rate': 0.002} {'learning_rate': 0.002} {'learning_rate': 0.005}
  {'learning_rate': 0.002}],
              mask = [False False False False False False False False False False False False
  False False False False False False False False False False],
        fill_value = ?),
 'params': ({'hidden_units': (128,),
   'keep_prob': 0.79654298686023284,
   'n_epochs': 5,
   'solver_kwargs': {'learning_rate': 0.008}},
  {'hidden_units': (512, 256, 128, 64),
   'keep_prob': 0.59685015794648699,
   'n_epochs': 5,
   'solver_kwargs': {'learning_rate': 0.05}},
  {'hidden_units': (128,),
   'keep_prob': 0.058083612168199461,
   'n_epochs': 5,
   'solver_kwargs': {'learning_rate': 0.008}},
  {'hidden_units': (64, 64),
   'keep_prob': 0.6011150117432088,
   'n_epochs': 5,
   'solver_kwargs': {'learning_rate': 0.008}},
  {'hidden_units': (32,),
   'keep_prob': 0.020584494295802447,
   'n_epochs': 5,
   'solver_kwargs': {'learning_rate': 0.02}},
  {'hidden_units': (128, 128),
   'keep_prob': 0.83244264080042174,
   'n_epochs': 5,
   'solver_kwargs': {'learning_rate': 0.002}},
  {'hidden_units': (16,),
   'keep_prob': 0.18182496720710062,
   'n_epochs': 5,
   'solver_kwargs': {'learning_rate': 0.001}},
  {'hidden_units': (),
   'keep_prob': 0.30424224295953772,
   'n_epochs': 5,
   'solver_kwargs': {'learning_rate': 0.002}},
  {'hidden_units': (512, 256, 128, 64),
   'keep_prob': 0.43194501864211576,
   'n_epochs': 5,
   'solver_kwargs': {'learning_rate': 0.01}},
  {'hidden_units': (256, 256),
   'keep_prob': 0.61185289472237947,
   'n_epochs': 16,
   'solver_kwargs': {'learning_rate': 0.05}},
  {'hidden_units': (256, 256, 256),
   'keep_prob': 0.51423443841361161,
   'n_epochs': 16,
   'solver_kwargs': {'learning_rate': 0.0001}},
  {'hidden_units': (32,),
   'keep_prob': 0.85994040673632055,
   'n_epochs': 16,
   'solver_kwargs': {'learning_rate': 0.005}},
  {'hidden_units': (64, 64),
   'keep_prob': 0.45049925196954299,
   'n_epochs': 16,
   'solver_kwargs': {'learning_rate': 0.02}},
  {'hidden_units': (64,),
   'keep_prob': 0.94220175568485276,
   'n_epochs': 16,
   'solver_kwargs': {'learning_rate': 0.02}},
  {'hidden_units': (256,),
   'keep_prob': 0.30461376917337069,
   'n_epochs': 50,
   'solver_kwargs': {'learning_rate': 0.001}},
  {'hidden_units': (16,),
   'keep_prob': 0.68423302651215689,
   'n_epochs': 50,
   'solver_kwargs': {'learning_rate': 0.005}},
  {'hidden_units': (256, 256, 256),
   'keep_prob': 0.49517691011127019,
   'n_epochs': 50,
   'solver_kwargs': {'learning_rate': 0.05}},
  {'hidden_units': (128,),
   'keep_prob': 0.79654298686023284,
   'n_epochs': 15,
   'solver_kwargs': {'learning_rate': 0.008}},
  {'hidden_units': (128, 128),
   'keep_prob': 0.83244264080042174,
   'n_epochs': 15,
   'solver_kwargs': {'learning_rate': 0.002}},
  {'hidden_units': (),
   'keep_prob': 0.30424224295953772,
   'n_epochs': 15,
   'solver_kwargs': {'learning_rate': 0.002}},
  {'hidden_units': (32,),
   'keep_prob': 0.85994040673632055,
   'n_epochs': 48,
   'solver_kwargs': {'learning_rate': 0.005}},
  {'hidden_units': (128, 128),
   'keep_prob': 0.83244264080042174,
   'n_epochs': 45,
   'solver_kwargs': {'learning_rate': 0.002}}),
 'rank_test_score': array([10, 18, 18, 13, 18,  7, 15, 11, 16, 18, 12,  4, 17, 14,  8,  5, 18,
         6,  3,  9,  2,  1], dtype=int32),
 'split0_test_score': array([ 0.91461007,  0.45705824,  0.45705824,  0.88302073,  0.45705824,
         0.90819348,  0.67966436,  0.8810464 ,  0.45705824,  0.45705824,
         0.8810464 ,  0.93188549,  0.70730503,  0.45705824,  0.90720632,
         0.93435341,  0.45705824,  0.93089832,  0.94521224,  0.90967423,
         0.94718657,  0.95162883]),
 'split0_train_score': array([ 0.9375    ,  0.45701581,  0.45701581,  0.88661067,  0.45701581,
         0.92564229,  0.68527668,  0.88661067,  0.45701581,  0.45701581,
         0.90118577,  0.97282609,  0.70775692,  0.45701581,  0.92045455,
         0.96936759,  0.45701581,  0.96170949,  0.97504941,  0.92588933,
         0.99184783,  0.99061265]),
 'split1_test_score': array([ 0.90316206,  0.45701581,  0.45701581,  0.87401186,  0.45701581,
         0.92094862,  0.71492095,  0.87994071,  0.62450593,  0.45701581,
         0.87401186,  0.93527668,  0.45701581,  0.87302372,  0.90810277,
         0.93132411,  0.45701581,  0.92539526,  0.93181818,  0.90513834,
         0.93527668,  0.94021739]),
 'split1_train_score': array([ 0.93432099,  0.45703704,  0.45703704,  0.8854321 ,  0.45703704,
         0.94864198,  0.73728395,  0.88987654,  0.63901235,  0.45703704,
         0.89061728,  0.97876543,  0.45703704,  0.90740741,  0.93283951,
         0.96962963,  0.45703704,  0.96345679,  0.97382716,  0.93432099,
         0.98691358,  0.98691358]),
 'split2_test_score': array([ 0.87994071,  0.45701581,  0.45701581,  0.86709486,  0.45701581,
         0.91007905,  0.77371542,  0.89426877,  0.56916996,  0.45701581,
         0.88982213,  0.9298419 ,  0.45701581,  0.9006917 ,  0.91847826,
         0.92193676,  0.45701581,  0.90711462,  0.94367589,  0.89822134,
         0.94071146,  0.94664032]),
 'split2_train_score': array([ 0.89851852,  0.45703704,  0.45703704,  0.88049383,  0.45703704,
         0.93580247,  0.77185185,  0.90296296,  0.56123457,  0.45703704,
         0.89555556,  0.97185185,  0.45703704,  0.94716049,  0.93259259,
         0.95481481,  0.45703704,  0.93753086,  0.97950617,  0.91506173,
         0.98469136,  0.98641975]),
 'std_fit_time': array([ 1.37948502,  2.05526639,  3.2464671 ,  0.50829486,  0.49376665,
         0.16138088,  0.01638957,  0.03163927,  0.37672038,  0.1275633 ,
         1.76849646,  0.66819983,  1.84264573,  0.64170712,  8.03222655,
         2.66716171,  2.94213638,  0.03909851,  1.61943898,  0.02392876,
         2.14579761,  6.55491198]),
 'std_score_time': array([ 0.0303808 ,  0.0035742 ,  0.01347911,  0.00391845,  0.00073238,
         0.00871074,  0.00049139,  0.00370175,  0.02670011,  0.0021613 ,
         0.02528999,  0.001849  ,  0.03923516,  0.01926425,  0.05806998,
         0.02840535,  0.00227157,  0.00641461,  0.03439537,  0.00013065,
         0.01539214,  0.05509931]),
 'std_test_score': array([  1.44234985e-02,   2.00061944e-05,   2.00061944e-05,
          6.52104784e-03,   2.00061944e-05,   5.62112199e-03,
          3.87964203e-02,   6.50871187e-03,   6.96668104e-02,
          2.00061944e-05,   6.46649654e-03,   2.24100711e-03,
          1.18006882e-01,   2.02957201e-01,   5.11514515e-03,
          5.28591261e-03,   2.00061944e-05,   1.01658774e-02,
          5.98455096e-03,   4.70940337e-03,   4.86884145e-03,
          4.67123452e-03]),
 'std_train_score': array([  1.76744600e-02,   1.00063908e-05,   1.00063908e-05,
          2.64976626e-03,   1.00063908e-05,   9.41079474e-03,
          3.55823871e-02,   7.06570527e-03,   7.45606919e-02,
          1.00063908e-05,   4.31764806e-03,   3.05546044e-03,
          1.18190485e-01,   2.22279781e-01,   5.78100728e-03,
          6.92283370e-03,   1.00063908e-05,   1.18312790e-02,
          2.44057891e-03,   7.88281440e-03,   2.99072806e-03,
          1.87104661e-03])}

Just like any other model in CivisML, we can use hyperband-tuned models to make predictions using .predict() on the ModelPipeline.



In [35]:

    
predict_mlp = model_mlp.predict(test_data)



In [36]:

    
predict_mlp.table.head()









    Out[36]:







  
    
      
      control_1
      control_2
      control_3
    
    
      UNITID
      
      
      
    
  
  
    
      217882
      9.999834e-01
      0.000016
      4.727007e-07
    
    
      195234
      1.779818e-03
      0.996192
      2.028217e-03
    
    
      446385
      8.158081e-07
      0.005291
      9.947079e-01
    
    
      13508115
      1.671655e-02
      0.972799
      1.048439e-02
    
    
      459499
      4.405403e-03
      0.035383
      9.602115e-01

It looks like this model is predicting the same categories as the models we tried earlier, so we can feel very confident about those predictions.

We're excited to see what problems you solve with these new capabilities. If you have any problems or questions, contact us at support@civisanalytics.com. Happy modeling!

	UNITID	OPEID	OPEID6	INSTNM	CITY	STABBR	ZIP	ACCREDAGENCY	INSTURL	NPCURL	...	OMENRYP8_FTNFT	OMENRAP8_FTNFT	OMENRUP8_FTNFT	OMACHT6_PTNFT	OMAWDP6_PTNFT	OMACHT8_PTNFT	OMAWDP8_PTNFT	OMENRYP8_PTNFT	OMENRAP8_PTNFT	OMENRUP8_PTNFT
1575	164599	2337400	23374	Bancroft School of Massage Therapy	Worcester	MA	01604	Accrediting Commission of Career Schools and C...	https://www.bancroftsmt.com	www.bancroftsmt.com/NetPriceCalculator/npcalc.htm	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
672	131803	145900	1459	Strayer University-District of Columbia	Washington	DC	20005	Middle States Commission on Higher Education	www.strayer.edu/district-columbia/washington	https://strayer.aidcalc.com/netprice.htm	...	0.0000	0.1667	0.2778	199.0	0.2513	199.0	0.2915	0.0302	0.2915	0.3869
7388	21130702	323901	3239	Bucks County Community College-Lower Bucks Campus	Bristol	PA	190070277	Middle States Commission on Higher Education	www.bucks.edu	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6926	483902	4225200	42252	Yechanlaz Instituto Vocacional	Miami	FL	33144-4817	NaN	www.yechanlaz-instituto.com	www.yechanlaz-instituto.com	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3571	224110	355800	3558	North Central Texas College	Gainesville	TX	76240-4699	Southern Association of Colleges and Schools C...	www.nctc.edu	www.collegeforalltexans.com/apps/CollegeMoney/	...	0.0125	0.5625	0.2750	487.0	0.1109	487.0	0.1150	0.0041	0.5236	0.3573

	control_1	control_2	control_3
UNITID
217882	9.999834e-01	0.000016	4.727007e-07
195234	1.779818e-03	0.996192	2.028217e-03
446385	8.158081e-07	0.005291	9.947079e-01
13508115	1.671655e-02	0.972799	1.048439e-02
459499	4.405403e-03	0.035383	9.602115e-01