This notebook will highlight CivisML features and provide example code and commentary with some sample data from a company called Brandable. We'll start with outlining the model types and the parameters users have available to them before working through training and prediction.
In [1]:
from civis.ml import ModelPipeline
from civis import APIClient
client = APIClient()
# dynamically get database name
creds = client.credentials.list()
dbs = [db for db in find(creds, type='Database')
if 'redshift' in db.name.lower()]
db_name = dbs[0].name
In the first example, we'll use a random forest model from scikit-learn. In addition, we'll grid search over hyperparameters for the maximum depth of the tree and the number of trees to optimally predict which users choose to upgrade to a premium service from the free version. Our data are in an AWS Redshift database on Civis Platform.
In [2]:
model = ModelPipeline('random_forest_classifier',
dependent_variable='upgrade',
primary_key='brandable_user_id',
model_name='Brandable "upgrade" CivisML model',
excluded_columns=['residential_zip'],
cross_validation_parameters={"max_depth": [2, 3, 5],
"n_estimators": [50, 100, 500]})
In [3]:
from civis.io import read_civis
df = read_civis(table='sample_project.brandable_training_set',
database=db_name,
use_pandas=True)
In [4]:
print('Data has dimensions: {}'.format(df.shape))
df.head()
Out[4]:
There are many ways to train a model -- you can use a pandas.DataFrame, a CSV file stored on Civis Platform at the Files endpoint, or a Redshift table. In this example, we'll walk through training from a pandas.DataFrame. We'll use our ModelPipeline class to make a ModelFuture corresponding to a Civis Platform model training job.
In [5]:
train = model.train(df)
In [6]:
train.result() # success!
Out[6]:
We can find out which hyperparameter combination was optimal and easily extract out-of-sample scoring metrics, as well as metadata.
In [7]:
train.estimator
Out[7]:
In [8]:
train.estimator.best_params_
Out[8]:
In [9]:
train.metrics['roc_auc']
Out[9]:
In [10]:
train.metrics.keys() # lots of other metrics here too
Out[10]:
In [11]:
train.table.head() # out-of-sample scores
Out[11]:
In [12]:
from sklearn.svm import SVC
# we need to call `predict_proba` for our predictions
# so we set `probability=True`
est = SVC(probability=True,
kernel='rbf')
model_custom = ModelPipeline(model=est,
dependent_variable='upgrade',
primary_key='brandable_user_id',
model_name='Brandable "upgrade" CivisML custom model',
excluded_columns=['residential_zip'])
model_custom.model
Out[12]:
In [13]:
train_custom = model_custom.train(table_name='sample_project.brandable_training_set',
database_name=db_name)
This time we'll use our scratch.brandable_training table in Redshift instead of a dataframe.
In [14]:
train_custom.result() # wait for result
Out[14]:
In [15]:
train_custom.metrics['roc_auc']
Out[15]:
In [16]:
predict = model.predict(table_name='sample_project.brandable_all_users',
database_name=db_name,
output_table='sample_project.brandable_user_scores')
In [18]:
predict.table.head()
Out[18]:
In [19]:
old_model = ModelPipeline.from_existing(train.job_id)
old_model.model_name # same as before!
Out[19]: