Note: We are continually releasing changes to CivisML, and this notebook is useful for any versions 2.0.0 and above.
Data scientists are on the front lines of their organization’s most important customer growth and engagement questions, and they need to guide action as quickly as possible by getting models into production. CivisML is a machine learning service that makes it possible for data scientists to massively increase the speed with which they can get great models into production. And because it’s built on open-source packages, CivisML remains transparent and data scientists remain in control.
In this notebook, we’ll go over the new features introduced in CivisML 2.0. For a walkthrough of CivisML’s fundamentals, check out this introduction to the mechanics of CivisML: https://github.com/civisanalytics/civis-python/blob/master/examples/CivisML_parallel_training.ipynb
CivisML 2.0 is full of new features to make modeling faster, more accurate, and more portable. This notebook will cover the following topics:
DataFrameETL
, for easy, customizable ETLCivisML can be used to build models that answer all kinds of business questions, such as what movie to recommend to a customer, or which customers are most likely to upgrade their accounts. For the sake of example, this notebook uses a publicly available dataset on US colleges, and focuses on predicting the type of college (public non-profit, private non-profit, or private for-profit).
In [1]:
# first, let's import the packages we need
import requests
from io import StringIO
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import model_selection
# import the Civis Python API client
import civis
# ModelPipeline is the class used to build CivisML models
from civis.ml import ModelPipeline
In [2]:
# Suppress warnings for demo purposes. This is not recommended as a general practice.
import warnings
warnings.filterwarnings('ignore')
Before we build any models, we need a dataset to play with. We're going to use the most recent College Scorecard data from the Department of Education.
This dataset is collected to study the performance of US higher education institutions. You can learn more about it in this technical paper, and you can find details on the dataset features in this data dictionary.
In [3]:
# Downloading data; this may take a minute
# Two kind of nulls
df = pd.read_csv("https://ed-public-download.app.cloud.gov/downloads/Most-Recent-Cohorts-All-Data-Elements.csv", sep=",", na_values=['NULL', 'PrivacySuppressed'], low_memory=False)
In [4]:
# How many rows and columns?
df.shape
Out[4]:
In [5]:
# What are some of the column names?
df.columns
Out[5]:
Before running CivisML, we need to do some basic data munging, such as removing missing data from the dependent variable, and splitting the data into training and test sets.
Throughout this notebook, we'll be trying to predict whether a college is public (labelled as 1), private non-profit (2), or private for-profit (3). The column name for this dependent variable is "CONTROL".
In [6]:
# Make sure to remove any rows with nulls in the dependent variable
df = df[np.isfinite(df['CONTROL'])]
In [7]:
# split into training and test sets
train_data, test_data = model_selection.train_test_split(df, test_size=0.2)
In [8]:
# print a few sample columns
train_data.head()
Out[8]:
Some of these columns are duplicates, or contain information we don't want to use in our model (like college names and URLs). CivisML can take a list of columns to exclude and do this part of the data munging for us, so let's make that list here.
In [8]:
to_exclude = ['ADM_RATE_ALL', 'OPEID', 'OPEID6', 'ZIP', 'INSTNM',
'INSTURL', 'NPCURL', 'ACCREDAGENCY', 'T4APPROVALDATE',
'STABBR', 'ALIAS', 'REPAY_DT_MDN', 'SEPAR_DT_MDN']
When building a supervised model, there are a few basic things you'll probably want to do:
CivisML does all of this in three lines of code. Let's fit a basic sparse logistic model to see how.
The first thing we need to do is build a ModelPipeline
object. This stores all of the basic configuration options for the model. We'll tell it things like the type of model, dependent variable, and columns we want to exclude. CivisML handles basic ETL for you, including categorical expansion of any string-type columns.
In [9]:
# Use a push-button workflow to fit a model with reasonable default parameters
sl_model = ModelPipeline(model='sparse_logistic',
model_name='Example sparse logistic',
primary_key='UNITID',
dependent_variable=['CONTROL'],
excluded_columns=to_exclude)
Next, we want to train and validate the model by calling .train
on the ModelPipeline
object. CivisML uses 4-fold cross-validation on the training set. You can train on local data or query data from Redshift. In this case, we have our data locally, so we just pass the data frame.
In [10]:
sl_train = sl_model.train(train_data)
This returns a ModelFuture
object, which is non-blocking-- this means that you can keep doing things in your notebook while the model runs on Civis Platform in the background. If you want to make a blocking call (one that doesn't complete until your model is finished), you can use .result()
.
In [11]:
# non-blocking
sl_train
Out[11]:
In [12]:
# blocking
sl_train.result()
Out[12]:
We didn't actually specify the number of jobs in the .train()
call above, but behind the scenes, the model was actually training in parallel! In CivisML 2.0, model tuning and validation will automatically be distributed across your computing cluster, without ever using more than 90% of the cluster resources. This means that you can build models faster and try more model configurations, leaving you more time to think critically about your data. If you decide you want more control over the resources you're using, you can set the n_jobs
parameter to a specific number of jobs, and CivisML won't run more than that at once.
We can see how well the model did by looking at the validation metrics.
In [13]:
# loop through the metric names and print to screen
metrics = [print(key) for key in sl_train.metrics.keys()]
In [14]:
# ROC AUC for each of the three categories in our dependent variable
sl_train.metrics['roc_auc']
Out[14]:
Impressive!
This is the basic CivisML workflow: create the model, train, and make predictions. There are other configuration options for more complex use cases; for example, you can create a custom estimator, pass custom dependencies, manage the computing resources for larger models, and more. For more information, see the Machine Learning section of the Python API client docs.
Now that we can build a simple model, let's see what's new to CivisML 2.0!
CivisML can do several data transformations to prepare your data for modeling. This makes data preprocessing easier, and makes it part of your model pipeline rather than an additional script you have to run. CivisML's built-in ETL includes:
With CivisML 2.0, you can now recreate and customize this ETL using DataFrameETL
, our open source ETL transformer, available on GitHub.
By default, CivisML will use DataFrameETL to automatically detect non-numeric columns for categorical expansion. Our example college dataset has a lot of integer columns which are actually categorical, but we can make sure they're handled correctly by passing CivisML a custom ETL transformer.
In [15]:
# The ETL transformer used in CivisML can be found in the civismlext module
from civismlext.preprocessing import DataFrameETL
This creates a list of columns to categorically expand, identified using the data dictionary available here.
In [16]:
# column indices for columns to expand
to_expand = list(df.columns[:21]) + list(df.columns[23:36]) + list(df.columns[99:290]) + \
list(df.columns[[1738, 1773, 1776]])
In [17]:
# create ETL estimator to pass to CivisML
etl = DataFrameETL(cols_to_drop=to_exclude,
cols_to_expand=to_expand, # we made this column list during data munging
check_null_cols='warn')
Now it's time to fit a model. Let's take a look at model stacking, which is new to CivisML 2.0.
Stacking lets you combine several algorithms into a single model which performs as well or better than the component algorithms. We use stacking at Civis to build more accurate models, which saves our data scientists time comparing algorithm performance. In CivisML, we have two stacking workflows: stacking_classifier
(sparse logistic, GBT, and random forest, with a logistic regression model as a "meta-estimator" to combine predictions from the other models); and stacking_regressor
(sparse linear, GBT, and random forest, with a non-negative linear regression as the meta-estimator). Use them the same way you use sparse_logistic
or other pre-defined models. If you want to learn more about how stacking works under the hood, take a look at this talk by the person at Civis who wrote it!
Let's fit both a stacking classifier and some un-stacked models, so we can compare the performance.
In [19]:
workflows = ['stacking_classifier',
'sparse_logistic',
'random_forest_classifier',
'gradient_boosting_classifier']
models = []
# create a model object for each of the four model types
for wf in workflows:
model = ModelPipeline(model=wf,
model_name=wf + ' v2 example',
primary_key='UNITID',
dependent_variable=['CONTROL'],
etl=etl # use the custom ETL we created
)
models.append(model)
In [20]:
# iterate over the model objects and run a CivisML training job for each
trains = []
for model in models:
train = model.train(train_data)
trains.append(train)
Let's plot diagnostics for each of the models. In the Civis Platform, these plots will automatically be built and displayed in the "Models" tab. But for the sake of example, let's also explicitly plot ROC curves and AUCs in the notebook.
There are three classes (public, non-profit private, and for-profit private), so we'll have three curves per model. It looks like all of the models are doing well, with sparse logistic performing slightly worse than the other three.
In [21]:
%matplotlib inline
# Let's look at how the model performed during validation
def extract_roc(fut_job, model_name):
'''Build a data frame of ROC curve data from the completed training job `fut_job`
with model name `model_name`. Note that this function will only work for a classification
model where the dependent variable has more than two classes.'''
aucs = fut_job.metrics['roc_auc']
roc_curve = fut_job.metrics['roc_curve_by_class']
n_classes = len(roc_curve)
fpr = []
tpr = []
class_num = []
auc = []
for i, curve in enumerate(roc_curve):
fpr.extend(curve['fpr'])
tpr.extend(curve['tpr'])
class_num.extend([i] * len(curve['fpr']))
auc.extend([aucs[i]] * len(curve['fpr']))
model_vec = [model_name] * len(fpr)
df = pd.DataFrame({
'model': model_vec,
'class': class_num,
'fpr': fpr,
'tpr': tpr,
'auc': auc
})
return df
# extract ROC curve information for all of the trained models
workflows_abbrev = ['stacking', 'logistic', 'RF', 'GBT']
roc_dfs = [extract_roc(train, w) for train, w in zip(trains, workflows_abbrev)]
roc_df = pd.concat(roc_dfs)
# create faceted ROC curve plots. Each row of plots is a different model type, and each
# column of plots is a different class of the dependent variable.
g = sns.FacetGrid(roc_df, col="class", row="model")
g = g.map(plt.plot, "fpr", "tpr", color='blue')
All of the models perform quite well, so it's difficult to compare based on the ROC curves. Let's plot the AUCs themselves.
In [22]:
# Plot AUCs for each model
%matplotlib inline
auc_df = roc_df[['model', 'class', 'auc']]
auc_df.drop_duplicates(inplace=True)
plt.show(sns.swarmplot(x=auc_df['model'], y=auc_df['auc']))
Here we can see that all models but sparse logistic perform quite well, but stacking appears to perform marginally better than the others. For more challenging modeling tasks, the difference between stacking and other models will often be more pronounced.
Now our models are trained, and we know that they all perform very well. Because the AUCs are all so high, we would expect the models to make similar predictions. Let's see if that's true.
In [23]:
# kick off a prediction job for each of the four models
preds = [model.predict(test_data) for model in models]
In [24]:
# This will run on Civis Platform cloud resources
[pred.result() for pred in preds]
Out[24]:
In [25]:
# print the top few rows for each of the models
pred_df = [pred.table.head() for pred in preds]
import pprint
pprint.pprint(pred_df)
Looks like the probabilities here aren't exactly the same, but are directionally identical-- so, if you chose the class that had the highest probability for each row, you'd end up with the same predictions for all models. This makes sense, because all of the models performed well.
In [26]:
train_stack = trains[0] # Get the ModelFuture for the stacking model
trained_model = train_stack.estimator
This Pipeline
contains all of the steps CivisML used to train the model, from ETL to the model itself. We can print each step individually to get a better sense of what is going on.
In [27]:
# print each of the estimators in the pipeline, separated by newlines for readability
for step in train_stack.estimator.steps:
print(step[1])
print('\n')
Now we can see that there are three steps: the DataFrameETL
object we passed in, a null imputation step, and the stacking estimator itself.
We can use this outside of CivisML simply by calling .predict
on the estimator. This will make predictions using the model in the notebook without using CivisML.
In [28]:
# drop the dependent variable so we don't use it to predict itself!
predictions = trained_model.predict(test_data.drop(labels=['CONTROL'], axis=1))
In [29]:
# print out the class predictions. These will be integers representing the predicted
# class rather than probabilities.
predictions
Out[29]:
Multilayer Perceptrons (MLPs) are simple neural networks, which are now built in to CivisML. The MLP estimators in CivisML come from muffnn, another open source package written and maintained by Civis Analytics using tensorflow. Let's fit one using hyperband.
Tuning hyperparameters is a critical chore for getting an algorithm to perform at its best, but it can take a long time to run. Using CivisML 2.0, we can use hyperband as an alternative to conventional grid search for hyperparameter optimization-- it runs about twice as fast. While grid search runs every parameter combination for the full time, hyperband runs many combinations for a short time, then filters out the best, runs them for longer, filters again, and so on. This means that you can try more combinations in less time, so we recommend using it whenever possible. The hyperband estimator is open source and available on GitHub. You can learn about the details in the original paper, Li et al. (2016).
Right now, hyperband is implemented in CivisML named preset models for the following algorithms:
Unlike grid search, you don't need to specify values to search over. If you pass cross_validation_parameters='hyperband'
to ModelPipeline
, hyperparameter combinations will be randomly drawn from preset distributions.
In [30]:
# build a model specifying the MLP model with hyperband
model_mlp = ModelPipeline(model='multilayer_perceptron_classifier',
model_name='MLP example',
primary_key='UNITID',
dependent_variable=['CONTROL'],
cross_validation_parameters='hyperband',
etl=etl
)
train_mlp = model_mlp.train(train_data,
n_jobs=10) # parallel hyperparameter optimization and validation!
# block until the job finishes
train_mlp.result()
Out[30]:
Let's dig into the hyperband model a little bit. Like the stacking model, the model below starts with ETL and null imputation, but contains some additional steps: a step to scale the predictor variables (which improves neural network performance), and a hyperband searcher containing the MLP.
In [31]:
for step in train_mlp.estimator.steps:
print(step[1])
print('\n')
HyperbandSearchCV
essentially works like GridSearchCV
. If you want to get the best estimator without all of the extra CV information, you can access it using the best_estimator_
attribute.
In [32]:
train_mlp.estimator.steps[3][1].best_estimator_
Out[32]:
To see how well the best model performed, you can look at the best_score_
.
In [33]:
train_mlp.estimator.steps[3][1].best_score_
Out[33]:
And to look at information about the different hyperparameter configurations that were tried, you can look at the cv_results_
.
In [34]:
train_mlp.estimator.steps[3][1].cv_results_
Out[34]:
Just like any other model in CivisML, we can use hyperband-tuned models to make predictions using .predict()
on the ModelPipeline
.
In [35]:
predict_mlp = model_mlp.predict(test_data)
In [36]:
predict_mlp.table.head()
Out[36]:
It looks like this model is predicting the same categories as the models we tried earlier, so we can feel very confident about those predictions.
We're excited to see what problems you solve with these new capabilities. If you have any problems or questions, contact us at support@civisanalytics.com. Happy modeling!