Example CivisML Notebook

Setup

This notebook will highlight CivisML features and provide example code and commentary with some sample data from a company called Brandable. We'll start with outlining the model types and the parameters users have available to them before working through training and prediction.


In [1]:
from civis.ml import ModelPipeline
from civis import APIClient
client = APIClient()

# dynamically get database name
creds = client.credentials.list()
dbs = [db for db in find(creds, type='Database')
       if 'redshift' in db.name.lower()]
db_name = dbs[0].name

In the first example, we'll use a random forest model from scikit-learn. In addition, we'll grid search over hyperparameters for the maximum depth of the tree and the number of trees to optimally predict which users choose to upgrade to a premium service from the free version. Our data are in an AWS Redshift database on Civis Platform.


In [2]:
model = ModelPipeline('random_forest_classifier',
                      dependent_variable='upgrade',
                      primary_key='brandable_user_id',
                      model_name='Brandable "upgrade" CivisML model',
                      excluded_columns=['residential_zip'],
                      cross_validation_parameters={"max_depth": [2, 3, 5],
                                                   "n_estimators": [50, 100, 500]})

In [3]:
from civis.io import read_civis
df = read_civis(table='sample_project.brandable_training_set',
                database=db_name,
                use_pandas=True)

In [4]:
print('Data has dimensions: {}'.format(df.shape))
df.head()


Data has dimensions: (1000, 88)
Out[4]:
brandable_user_id upgrade residential_zip census10_county_is_in_msa census10_is_in_place census10_place_is_in_principal_city census10_pct_under18 census10_pct_18plus census10_pct_18to34 census10_pct_35to64 ... acs14_pct_commute_over90min acs14_pct_vehicle_available acs14_pct_enrolled_in_higher_ed acs14_pct_educ_no_hs acs14_pct_educ_bachelors acs14_pct_speak_only_english acs14_pct_in_labor_force acs14_pct_disabled total_orders weeks_as_user
0 01f3f292d7201ff 0 45011 1 0 0 0.350267 0.649733 0.128342 0.481283 ... 0.023485 0.982299 0.097506 0.085747 0.482302 0.857657 0.724399 0.031550 1 24.6
1 09d737b0cd46d3b 1 43611 1 0 0 0.207792 0.792208 0.207792 0.376623 ... 0.036976 0.995948 0.073047 0.102290 0.150891 0.948855 0.648092 0.142474 6 24.0
2 144c2c9278ae7b8 1 43515 1 0 0 0.275329 0.724671 0.210361 0.388952 ... 0.031269 0.934026 0.085178 0.099053 0.155469 0.971146 0.729113 0.095766 4 24.4
3 222074d21a2e2f0 0 43558 1 0 0 0.184358 0.815642 0.206704 0.530726 ... 0.013513 0.992593 0.038514 0.088274 0.133475 0.981920 0.702739 0.117063 2 25.1
4 22a38064bb8aeb7 0 44060 1 1 0 0.262295 0.737705 0.114754 0.549180 ... 0.020622 0.987624 0.087978 0.042164 0.367049 0.942237 0.657324 0.075758 2 24.2

5 rows × 88 columns

Training

There are many ways to train a model -- you can use a pandas.DataFrame, a CSV file stored on Civis Platform at the Files endpoint, or a Redshift table. In this example, we'll walk through training from a pandas.DataFrame. We'll use our ModelPipeline class to make a ModelFuture corresponding to a Civis Platform model training job.


In [5]:
train = model.train(df)

In [6]:
train.result()  # success!


Out[6]:
{'container_id': 5676302,
 'error': None,
 'finished_at': '2017-05-08T22:18:13.000Z',
 'id': 48418170,
 'is_cancel_requested': False,
 'started_at': '2017-05-08T22:15:24.000Z',
 'state': 'succeeded'}

We can find out which hyperparameter combination was optimal and easily extract out-of-sample scoring metrics, as well as metadata.


In [7]:
train.estimator


Out[7]:
GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('randomforestclassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_spli...stimators=10, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'randomforestclassifier__n_estimators': [50, 100, 500], 'randomforestclassifier__max_depth': [2, 3, 5]},
       pre_dispatch='n_jobs', refit=True, return_train_score=True,
       scoring='neg_log_loss', verbose=0)

In [8]:
train.estimator.best_params_


Out[8]:
{'randomforestclassifier__max_depth': 5,
 'randomforestclassifier__n_estimators': 500}

In [9]:
train.metrics['roc_auc']


Out[9]:
0.8007542663663708

In [10]:
train.metrics.keys()  # lots of other metrics here too


Out[10]:
dict_keys(['accuracy', 'confusion_matrix', 'p_correct', 'pop_incidence_true', 'pop_incidence_pred', 'roc_auc', 'log_loss', 'brier_score', 'roc_curve', 'calibration_curve', 'deciles', 'score_histogram', 'training_histogram', 'oos_score_table'])

In [11]:
train.table.head()  # out-of-sample scores


Out[11]:
upgrade_1
brandable_user_id
01f3f292d7201ff 0.044406
09d737b0cd46d3b 0.401671
144c2c9278ae7b8 0.228400
222074d21a2e2f0 0.125058
22a38064bb8aeb7 0.055437

Fitting Custom Models

It is often the case that we have some intuitions about what estimation strategies might be best -- and they might not be supported as preconfigured CivisML workflows. Thankfully, using your own scikit-learn estimators is supported in ModelPipeline!


In [12]:
from sklearn.svm import SVC
# we need  to call `predict_proba` for our predictions
# so we set `probability=True`
est = SVC(probability=True,
          kernel='rbf')

model_custom = ModelPipeline(model=est,
                             dependent_variable='upgrade',
                             primary_key='brandable_user_id',
                             model_name='Brandable "upgrade" CivisML custom model',
                             excluded_columns=['residential_zip'])

model_custom.model


Out[12]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [13]:
train_custom = model_custom.train(table_name='sample_project.brandable_training_set',
                                  database_name=db_name)

This time we'll use our scratch.brandable_training table in Redshift instead of a dataframe.


In [14]:
train_custom.result()  # wait for result


Out[14]:
{'container_id': 5676326,
 'error': None,
 'finished_at': '2017-05-08T22:18:59.000Z',
 'id': 48418500,
 'is_cancel_requested': False,
 'started_at': '2017-05-08T22:18:32.000Z',
 'state': 'succeeded'}

In [15]:
train_custom.metrics['roc_auc']


Out[15]:
0.4912858805568572

Prediction

We can use our ModelPipeline objects for prediction, too. You can predict with the same sources of data as for training (i.e., CSVs stored at the Files endpoint, pandas.DataFrames, and tables on Redshift). Here, You can use your original random forest to predict on a much larger dataset.


In [16]:
predict = model.predict(table_name='sample_project.brandable_all_users',
                        database_name=db_name,
                        output_table='sample_project.brandable_user_scores')

In [18]:
predict.table.head()


Out[18]:
upgrade_1
brandable_user_id
00214b9181f2347 0.488163
004ac2d6147bdcd 0.129830
004df4f87236346 0.692678
0064ab441715d02 0.358899
00691fc4caa7f29 0.184070

Recreating CivisML models

It's also straightforward to recreate models you've previously used from the model ID. In this example, we'll recreate the model associated with our eariler training run.


In [19]:
old_model = ModelPipeline.from_existing(train.job_id)
old_model.model_name  # same as before!


Out[19]:
'Brandable "upgrade" CivisML model'