Example CivisML Notebook

Setup

This notebook will highlight CivisML features and provide example code and commentary with some sample data from a company called Brandable. We'll start with outlining the model types and the parameters users have available to them before working through training and prediction.



In [1]:

    
from civis.ml import ModelPipeline
from civis import APIClient
client = APIClient()

# dynamically get database name
creds = client.credentials.list()
dbs = [db for db in find(creds, type='Database')
       if 'redshift' in db.name.lower()]
db_name = dbs[0].name

In the first example, we'll use a random forest model from scikit-learn. In addition, we'll grid search over hyperparameters for the maximum depth of the tree and the number of trees to optimally predict which users choose to upgrade to a premium service from the free version. Our data are in an AWS Redshift database on Civis Platform.



In [2]:

    
model = ModelPipeline('random_forest_classifier',
                      dependent_variable='upgrade',
                      primary_key='brandable_user_id',
                      model_name='Brandable "upgrade" CivisML model',
                      excluded_columns=['residential_zip'],
                      cross_validation_parameters={"max_depth": [2, 3, 5],
                                                   "n_estimators": [50, 100, 500]})



In [3]:

    
from civis.io import read_civis
df = read_civis(table='sample_project.brandable_training_set',
                database=db_name,
                use_pandas=True)



In [4]:

    
print('Data has dimensions: {}'.format(df.shape))
df.head()









    



Data has dimensions: (1000, 88)






    Out[4]:






  
    
      
      brandable_user_id
      upgrade
      residential_zip
      census10_county_is_in_msa
      census10_is_in_place
      census10_place_is_in_principal_city
      census10_pct_under18
      census10_pct_18plus
      census10_pct_18to34
      census10_pct_35to64
      ...
      acs14_pct_commute_over90min
      acs14_pct_vehicle_available
      acs14_pct_enrolled_in_higher_ed
      acs14_pct_educ_no_hs
      acs14_pct_educ_bachelors
      acs14_pct_speak_only_english
      acs14_pct_in_labor_force
      acs14_pct_disabled
      total_orders
      weeks_as_user
    
  
  
    
      0
      01f3f292d7201ff
      0
      45011
      1
      0
      0
      0.350267
      0.649733
      0.128342
      0.481283
      ...
      0.023485
      0.982299
      0.097506
      0.085747
      0.482302
      0.857657
      0.724399
      0.031550
      1
      24.6
    
    
      1
      09d737b0cd46d3b
      1
      43611
      1
      0
      0
      0.207792
      0.792208
      0.207792
      0.376623
      ...
      0.036976
      0.995948
      0.073047
      0.102290
      0.150891
      0.948855
      0.648092
      0.142474
      6
      24.0
    
    
      2
      144c2c9278ae7b8
      1
      43515
      1
      0
      0
      0.275329
      0.724671
      0.210361
      0.388952
      ...
      0.031269
      0.934026
      0.085178
      0.099053
      0.155469
      0.971146
      0.729113
      0.095766
      4
      24.4
    
    
      3
      222074d21a2e2f0
      0
      43558
      1
      0
      0
      0.184358
      0.815642
      0.206704
      0.530726
      ...
      0.013513
      0.992593
      0.038514
      0.088274
      0.133475
      0.981920
      0.702739
      0.117063
      2
      25.1
    
    
      4
      22a38064bb8aeb7
      0
      44060
      1
      1
      0
      0.262295
      0.737705
      0.114754
      0.549180
      ...
      0.020622
      0.987624
      0.087978
      0.042164
      0.367049
      0.942237
      0.657324
      0.075758
      2
      24.2
    
  

5 rows × 88 columns

Training

There are many ways to train a model -- you can use a pandas.DataFrame, a CSV file stored on Civis Platform at the Files endpoint, or a Redshift table. In this example, we'll walk through training from a pandas.DataFrame. We'll use our ModelPipeline class to make a ModelFuture corresponding to a Civis Platform model training job.



In [5]:

    
train = model.train(df)



In [6]:

    
train.result()  # success!









    Out[6]:





{'container_id': 5676302,
 'error': None,
 'finished_at': '2017-05-08T22:18:13.000Z',
 'id': 48418170,
 'is_cancel_requested': False,
 'started_at': '2017-05-08T22:15:24.000Z',
 'state': 'succeeded'}

We can find out which hyperparameter combination was optimal and easily extract out-of-sample scoring metrics, as well as metadata.



In [7]:

    
train.estimator









    Out[7]:





GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('randomforestclassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_spli...stimators=10, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'randomforestclassifier__n_estimators': [50, 100, 500], 'randomforestclassifier__max_depth': [2, 3, 5]},
       pre_dispatch='n_jobs', refit=True, return_train_score=True,
       scoring='neg_log_loss', verbose=0)



In [8]:

    
train.estimator.best_params_









    Out[8]:





{'randomforestclassifier__max_depth': 5,
 'randomforestclassifier__n_estimators': 500}



In [9]:

    
train.metrics['roc_auc']









    Out[9]:





0.8007542663663708



In [10]:

    
train.metrics.keys()  # lots of other metrics here too









    Out[10]:





dict_keys(['accuracy', 'confusion_matrix', 'p_correct', 'pop_incidence_true', 'pop_incidence_pred', 'roc_auc', 'log_loss', 'brier_score', 'roc_curve', 'calibration_curve', 'deciles', 'score_histogram', 'training_histogram', 'oos_score_table'])



In [11]:

    
train.table.head()  # out-of-sample scores









    Out[11]:






  
    
      
      upgrade_1
    
    
      brandable_user_id
      
    
  
  
    
      01f3f292d7201ff
      0.044406
    
    
      09d737b0cd46d3b
      0.401671
    
    
      144c2c9278ae7b8
      0.228400
    
    
      222074d21a2e2f0
      0.125058
    
    
      22a38064bb8aeb7
      0.055437

Fitting Custom Models

It is often the case that we have some intuitions about what estimation strategies might be best -- and they might not be supported as preconfigured CivisML workflows. Thankfully, using your own scikit-learn estimators is supported in ModelPipeline!



In [12]:

    
from sklearn.svm import SVC
# we need  to call `predict_proba` for our predictions
# so we set `probability=True`
est = SVC(probability=True,
          kernel='rbf')

model_custom = ModelPipeline(model=est,
                             dependent_variable='upgrade',
                             primary_key='brandable_user_id',
                             model_name='Brandable "upgrade" CivisML custom model',
                             excluded_columns=['residential_zip'])

model_custom.model









    Out[12]:





SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)



In [13]:

    
train_custom = model_custom.train(table_name='sample_project.brandable_training_set',
                                  database_name=db_name)

This time we'll use our scratch.brandable_training table in Redshift instead of a dataframe.



In [14]:

    
train_custom.result()  # wait for result









    Out[14]:





{'container_id': 5676326,
 'error': None,
 'finished_at': '2017-05-08T22:18:59.000Z',
 'id': 48418500,
 'is_cancel_requested': False,
 'started_at': '2017-05-08T22:18:32.000Z',
 'state': 'succeeded'}



In [15]:

    
train_custom.metrics['roc_auc']









    Out[15]:





0.4912858805568572

Prediction

We can use our ModelPipeline objects for prediction, too. You can predict with the same sources of data as for training (i.e., CSVs stored at the Files endpoint, pandas.DataFrames, and tables on Redshift). Here, You can use your original random forest to predict on a much larger dataset.



In [16]:

    
predict = model.predict(table_name='sample_project.brandable_all_users',
                        database_name=db_name,
                        output_table='sample_project.brandable_user_scores')



In [18]:

    
predict.table.head()









    Out[18]:






  
    
      
      upgrade_1
    
    
      brandable_user_id
      
    
  
  
    
      00214b9181f2347
      0.488163
    
    
      004ac2d6147bdcd
      0.129830
    
    
      004df4f87236346
      0.692678
    
    
      0064ab441715d02
      0.358899
    
    
      00691fc4caa7f29
      0.184070

Recreating CivisML models

It's also straightforward to recreate models you've previously used from the model ID. In this example, we'll recreate the model associated with our eariler training run.



In [19]:

    
old_model = ModelPipeline.from_existing(train.job_id)
old_model.model_name  # same as before!









    Out[19]:





'Brandable "upgrade" CivisML model'

	brandable_user_id	upgrade	residential_zip	census10_county_is_in_msa	census10_is_in_place	census10_pct_under18	census10_pct_18plus	census10_pct_18to34	census10_pct_35to64	...	acs14_pct_commute_over90min	acs14_pct_vehicle_available	acs14_pct_enrolled_in_higher_ed	acs14_pct_educ_no_hs	acs14_pct_educ_bachelors	acs14_pct_speak_only_english	acs14_pct_in_labor_force	acs14_pct_disabled	total_orders	weeks_as_user
0	01f3f292d7201ff	0	45011	1	0	0.350267	0.649733	0.128342	0.481283	...	0.023485	0.982299	0.097506	0.085747	0.482302	0.857657	0.724399	0.031550	1	24.6
1	09d737b0cd46d3b	1	43611	1	0	0.207792	0.792208	0.207792	0.376623	...	0.036976	0.995948	0.073047	0.102290	0.150891	0.948855	0.648092	0.142474	6	24.0
2	144c2c9278ae7b8	1	43515	1	0	0.275329	0.724671	0.210361	0.388952	...	0.031269	0.934026	0.085178	0.099053	0.155469	0.971146	0.729113	0.095766	4	24.4
3	222074d21a2e2f0	0	43558	1	0	0.184358	0.815642	0.206704	0.530726	...	0.013513	0.992593	0.038514	0.088274	0.133475	0.981920	0.702739	0.117063	2	25.1
4	22a38064bb8aeb7	0	44060	1	1	0.262295	0.737705	0.114754	0.549180	...	0.020622	0.987624	0.087978	0.042164	0.367049	0.942237	0.657324	0.075758	2	24.2

	upgrade_1
brandable_user_id
01f3f292d7201ff	0.044406
09d737b0cd46d3b	0.401671
144c2c9278ae7b8	0.228400
222074d21a2e2f0	0.125058
22a38064bb8aeb7	0.055437

	upgrade_1
brandable_user_id
00214b9181f2347	0.488163
004ac2d6147bdcd	0.129830
004df4f87236346	0.692678
0064ab441715d02	0.358899
00691fc4caa7f29	0.184070