from IPython import display
TPOT uses a genetic algorithm (implemented with DEAP library) to pick an optimal pipeline for a regression task.

What is a pipeline?

Pipeline is composed of preprocessors:

  • take polynomial transformations of features

TPOTBase is key class


population_size: int (default: 100) The number of pipelines in the genetic algorithm population. Must be > 0.The more pipelines in the population, the slower TPOT will run, but it's also more likely to find better pipelines.

  • generations: int (default: 100) The number of generations to run pipeline optimization for. Must be > 0. The more generations you give TPOT to run, the longer it takes, but it's also more likely to find better pipelines.
  • mutation_rate: float (default: 0.9) The mutation rate for the genetic programming algorithm in the range [0.0, 1.0]. This tells the genetic programming algorithm how many pipelines to apply random changes to every generation. We don't recommend that you tweak this parameter unless you know what you're doing.
  • crossover_rate: float (default: 0.05) The crossover rate for the genetic programming algorithm in the range [0.0, 1.0]. This tells the genetic programming algorithm how many pipelines to "breed" every generation. We don't recommend that you tweak this parameter unless you know what you're doing.
  • scoring: function or str Function used to evaluate the quality of a given pipeline for the problem. By default, balanced class accuracy is used for classification problems, mean squared error for regression problems. TPOT assumes that this scoring function should be maximized, i.e., higher is better. Offers the same options as sklearn.cross_validation.cross_val_score: ['accuracy', 'adjusted_rand_score', 'average_precision', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc']
  • num_cv_folds: int (default: 3) The number of folds to evaluate each pipeline over in k-fold cross-validation during the TPOT pipeline optimization process
  • max_time_mins: int (default: None) How many minutes TPOT has to optimize the pipeline. If not None, this setting will override the generations parameter.

TPOTClassifier and TPOTRegressor inherit parent class TPOTBase, with modifications of the scoring function.

!sudo pip install deap update_checker tqdm xgboost tpot

In [3]:
import pandas as pd 
import numpy as np
import psycopg2 
import os
import json
from tpot import TPOTClassifier
from sklearn.metrics import classification_report

In [4]:
conn = psycopg2.connect(
    user = os.environ['REDSHIFT_USER']
    ,password = os.environ['REDSHIFT_PASS']    
    ,port = os.environ['REDSHIFT_PORT']
    ,host = os.environ['REDSHIFT_HOST']
    ,database = 'tradesy'
query = """
    from saleability_model_v2
    limit 50000

df = pd.read_sql(query, conn)

In [5]:
target = 'purchase_dummy'
domain = filter(lambda x: x != target, df.columns.values)
df = df.astype(float)

y_all = df[target].values
X_all = df[domain].values

idx_all = np.random.RandomState(1).permutation(len(y_all))
idx_train = idx_all[:int(.8 * len(y_all))]
idx_test = idx_all[int(.8 *  len(y_all)):]

X_train = X_all[idx_train]
y_train = y_all[idx_train]
X_test = X_all[idx_test]
y_test = y_all[idx_test]

Sklearn model:

In [6]:
from sklearn.ensemble import RandomForestClassifier
sklearn_model = RandomForestClassifier(), y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,

In [7]:
sklearn_predictions = sklearn_model.predict(X_test)
print classification_report(y_test, sklearn_predictions)

             precision    recall  f1-score   support

        0.0       0.86      0.96      0.91      8260
        1.0       0.60      0.27      0.37      1740

avg / total       0.82      0.84      0.82     10000

TPOT Classifier

In [14]:
tpot_model = TPOTClassifier(generations=3, population_size=10, verbosity=2, max_time_mins=10), y_train)

GP Progress:  90%|█████████ | 18/20 [09:47<01:56, 58.40s/pipeline]
Generation 1 - Current best internal CV score: 0.647821506914
Generation 2 - Current best internal CV score: 0.647821506914
GP Progress:  70%|███████   | 21/30 [10:09<06:28, 43.13s/pipeline]
GP closed prematurely - will use current best pipeline

Best pipeline: XGBClassifier(input_matrix, 32, 6, 0.48999999999999999, 27.0)

In [15]:
tpot_predictions = tpot_model.predict(X_test)
print classification_report(y_test, tpot_predictions)

             precision    recall  f1-score   support

        0.0       0.88      0.93      0.90      8260
        1.0       0.54      0.39      0.45      1740

avg / total       0.82      0.84      0.82     10000

Export Pseudo Pipeline Code

import numpy as np

from sklearn.cross_validation import train_test_split
from sklearn.ensemble import VotingClassifier
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import FunctionTransformer
from xgboost import XGBClassifier

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)
features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1)
training_features, testing_features, training_classes, testing_classes = \
    train_test_split(features, tpot_data['class'], random_state=42)

exported_pipeline = make_pipeline(
    XGBClassifier(learning_rate=0.49, max_depth=10, min_child_weight=6, n_estimators=500, subsample=1.0)
), training_classes)
results = exported_pipeline.predict(testing_features)

