In [1]:
# Enable logging
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
Drain workflows consist of drain.step.Step
objects. Take for example the drain.data.ClassificationData
step:
In [2]:
import drain.data
data = drain.data.ClassificationData(target=True, n_samples=1000, n_features=100)
This step calls the sklearn.datasets.make_classification
method to generate a dataset with a binary outcome. We can run the step by step by calling its execute
method:
In [3]:
data.execute()
Out[3]:
The result is a dictionary containing a standard set of objects that drain uses for machine learning workflows:
X
is a matrix of features, also called a design matrix,y
is a vector of outcomestrain
is a binary vector indicating the rows of X
which are in the training settest
is a binary vector indicating the rows of X
which are in the test setLet's add another step to our workflow to construct a random forest estimator:
In [4]:
import drain.model, drain.step
estimator = drain.step.Construct('sklearn.ensemble.RandomForestClassifier', n_estimators=1)
The Construct
step is simply constructs an instance of the specified class with the given arguments:
In [5]:
estimator.execute()
Out[5]:
Next we add another step to fit this estimator on our previously generated dataset:
In [6]:
fit = drain.model.Fit(inputs=[estimator, data], return_estimator=True, return_feature_importances=True)
Note the special inputs
argument. This argument is a collection of steps whose results Fit
takes as input.
In [7]:
fit.execute()
Out[7]:
The Fit
step returns the fitted estimator object as well as a dataframe containing the names of features and their importances.
Let's add one final step to our pipeline to generate predictions on the test set of our classification data:
In [8]:
predict = drain.model.Predict(inputs=[fit, data])
predict.execute()
Out[8]:
The Predict method returns a dataframe with a score
column containing the predictions of the estimator and a true
column containing the true outcomes.
The drain.model
module contains a variety of metrics which can be run directly on the predict object:
In [9]:
drain.model.auc(predict)
Out[9]:
In [10]:
drain.model.baseline(predict)
Out[10]:
In [11]:
drain.model.precision(predict, k=10)
Out[11]:
We can retrieve the results of any step that has been run through the get_result
method:
In [12]:
predict.get_result()
Out[12]:
Let's redefine the above workflow using a function:
In [13]:
def prediction_workflow():
# generate the data including a training and test split
data = drain.data.ClassificationData(target=True, n_samples=1000, n_features=100)
# construct a random forest estimator
estimator = drain.step.Construct('sklearn.ensemble.RandomForestClassifier', n_estimators=1)
# fit the estimator
fit = drain.model.Fit(inputs=[estimator, data], return_estimator=True, return_feature_importances=True)
# make predictions
return drain.model.Predict(inputs=[fit, data])
In [14]:
predict2 = prediction_workflow()
Note that step execution is recursive, that is the execute
method will ensure that all inputs, and inputs of inputs, etc. have been run before running the given step:
In [15]:
predict2.execute()
Out[15]:
The steps of a workflow form a network (a directed acyclic graph or DAG, to be precise).
In practice we want to train many models on a given dataset. Let's define a workflow that searches over the number of trees in the random forest model:
In [16]:
def n_estimator_search():
data = drain.data.ClassificationData(target=True, n_samples=1000, n_features=100)
predict = []
for n_estimators in range(1, 4):
estimator = drain.step.Construct('sklearn.ensemble.RandomForestClassifier', n_estimators=n_estimators, name = 'estimator')
fit = drain.model.Fit(inputs=[estimator, data], return_estimator=True, return_feature_importances=True)
predict.append(drain.model.Predict(inputs=[fit, data]))
return predict
In [17]:
predictions = n_estimator_search()
In [18]:
for p in predictions:
p.execute()
Note that the ClassificationData step was only run once.
Drain provides some additional utilities for model exploration in the drain.explore
module:
In [19]:
from drain import explore
df = explore.to_dataframe(predictions)
df
Out[19]:
In [20]:
from drain import model
explore.apply(df, model.auc)
Out[20]:
In [21]:
%matplotlib inline
explore.apply(df, model.precision_series).plot()
Out[21]: