In [2]:
from IPython import display
URL = "https://github.com/rhiever/tpot"
display.IFrame(URL, 1000, 1000)


Out[2]:

TPOT uses a genetic algorithm (implemented with DEAP library) to pick an optimal pipeline for a regression task.

What is a pipeline?

Pipeline is composed of preprocessors:

  • take polynomial transformations of features

TPOTBase is key class

parameters:

population_size: int (default: 100) The number of pipelines in the genetic algorithm population. Must be > 0.The more pipelines in the population, the slower TPOT will run, but it's also more likely to find better pipelines.

  • generations: int (default: 100) The number of generations to run pipeline optimization for. Must be > 0. The more generations you give TPOT to run, the longer it takes, but it's also more likely to find better pipelines.
  • mutation_rate: float (default: 0.9) The mutation rate for the genetic programming algorithm in the range [0.0, 1.0]. This tells the genetic programming algorithm how many pipelines to apply random changes to every generation. We don't recommend that you tweak this parameter unless you know what you're doing.
  • crossover_rate: float (default: 0.05) The crossover rate for the genetic programming algorithm in the range [0.0, 1.0]. This tells the genetic programming algorithm how many pipelines to "breed" every generation. We don't recommend that you tweak this parameter unless you know what you're doing.
  • scoring: function or str Function used to evaluate the quality of a given pipeline for the problem. By default, balanced class accuracy is used for classification problems, mean squared error for regression problems. TPOT assumes that this scoring function should be maximized, i.e., higher is better. Offers the same options as sklearn.cross_validation.cross_val_score: ['accuracy', 'adjusted_rand_score', 'average_precision', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc']
  • num_cv_folds: int (default: 3) The number of folds to evaluate each pipeline over in k-fold cross-validation during the TPOT pipeline optimization process
  • max_time_mins: int (default: None) How many minutes TPOT has to optimize the pipeline. If not None, this setting will override the generations parameter.

TPOTClassifier and TPOTRegressor inherit parent class TPOTBase, with modifications of the scoring function.


In [1]:
!sudo pip install deap update_checker tqdm xgboost tpot


Collecting deap
  Downloading deap-1.0.2.post2.tar.gz (852kB)
    100% |################################| 856kB 722kB/s 
Collecting update-checker
  Downloading update_checker-0.12-py2.py3-none-any.whl
Collecting tqdm
  Downloading tqdm-4.8.4-py2.py3-none-any.whl
Requirement already satisfied (use --upgrade to upgrade): requests>=2.3.0 in /usr/local/lib/python2.7/dist-packages (from update-checker)
Building wheels for collected packages: deap
  Running setup.py bdist_wheel for deap
  Stored in directory: /root/.cache/pip/wheels/c9/9c/cd/d52106f0148e675df35718c0efff2ecf03cc86d5bdcfb91db5
Successfully built deap
Installing collected packages: deap, update-checker, tqdm
Successfully installed deap-1.0.2 tqdm-4.8.4 update-checker-0.12
You are using pip version 7.1.2, however version 8.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

In [3]:
import pandas as pd 
import numpy as np
import psycopg2 
import os
import json
from tpot import TPOTClassifier
from sklearn.metrics import classification_report


/usr/local/lib/python2.7/dist-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

In [4]:
conn = psycopg2.connect(
    user = os.environ['REDSHIFT_USER']
    ,password = os.environ['REDSHIFT_PASS']    
    ,port = os.environ['REDSHIFT_PORT']
    ,host = os.environ['REDSHIFT_HOST']
    ,database = 'tradesy'
)
query = """
    select 
        purchase_dummy
        ,shipping_price_ratio
        ,asking_price
        ,price_level
        ,brand_score
        ,brand_size
        ,a_over_b
        ,favorite_count
        ,has_blurb
        ,has_image
        ,seasonal_component
        ,description_length
        ,product_category_accessories
        ,product_category_shoes
        ,product_category_bags
        ,product_category_tops
        ,product_category_dresses
        ,product_category_weddings
        ,product_category_bottoms
        ,product_category_outerwear
        ,product_category_jeans
        ,product_category_activewear
        ,product_category_suiting
        ,product_category_swim
        
    from saleability_model_v2
     
    limit 50000
    
"""

df = pd.read_sql(query, conn)

In [5]:
target = 'purchase_dummy'
domain = filter(lambda x: x != target, df.columns.values)
df = df.astype(float)

y_all = df[target].values
X_all = df[domain].values

idx_all = np.random.RandomState(1).permutation(len(y_all))
idx_train = idx_all[:int(.8 * len(y_all))]
idx_test = idx_all[int(.8 *  len(y_all)):]

# TRAIN AND TEST DATA
X_train = X_all[idx_train]
y_train = y_all[idx_train]
X_test = X_all[idx_test]
y_test = y_all[idx_test]

Sklearn model:


In [6]:
from sklearn.ensemble import RandomForestClassifier
sklearn_model = RandomForestClassifier()
sklearn_model.fit(X_train, y_train)


Out[6]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [7]:
sklearn_predictions = sklearn_model.predict(X_test)
print classification_report(y_test, sklearn_predictions)


             precision    recall  f1-score   support

        0.0       0.86      0.96      0.91      8260
        1.0       0.60      0.27      0.37      1740

avg / total       0.82      0.84      0.82     10000

TPOT Classifier


In [14]:
tpot_model = TPOTClassifier(generations=3, population_size=10, verbosity=2, max_time_mins=10)
tpot_model.fit(X_train, y_train)


GP Progress:  90%|█████████ | 18/20 [09:47<01:56, 58.40s/pipeline]
Generation 1 - Current best internal CV score: 0.647821506914
Generation 2 - Current best internal CV score: 0.647821506914
GP Progress:  70%|███████   | 21/30 [10:09<06:28, 43.13s/pipeline]
GP closed prematurely - will use current best pipeline
                                                                  

Best pipeline: XGBClassifier(input_matrix, 32, 6, 0.48999999999999999, 27.0)


In [15]:
tpot_predictions = tpot_model.predict(X_test)
print classification_report(y_test, tpot_predictions)


             precision    recall  f1-score   support

        0.0       0.88      0.93      0.90      8260
        1.0       0.54      0.39      0.45      1740

avg / total       0.82      0.84      0.82     10000

Export Pseudo Pipeline Code


In [17]:
tpot_model.export('optimal-saleability-model.py')

In [18]:
!cat optimal-saleability-model.py


import numpy as np

from sklearn.cross_validation import train_test_split
from sklearn.ensemble import VotingClassifier
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import FunctionTransformer
from xgboost import XGBClassifier

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)
features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1)
training_features, testing_features, training_classes, testing_classes = \
    train_test_split(features, tpot_data['class'], random_state=42)

exported_pipeline = make_pipeline(
    XGBClassifier(learning_rate=0.49, max_depth=10, min_child_weight=6, n_estimators=500, subsample=1.0)
)

exported_pipeline.fit(training_features, training_classes)
results = exported_pipeline.predict(testing_features)

In [ ]:


In [ ]: