pyplearnr demo

Here I demonstrate pyplearnr, a wrapper for building/training/validating scikit learn pipelines using GridSearchCV or RandomizedSearchCV.

Quick keyword arguments give access to optional feature selection (e.g. SelectKBest), scaling (e.g. standard scaling), use of feature interactions, and data transformations (e.g. PCA, t-SNE) before being fed to a classifier/regressor.

After building the pipeline, data can be used to perform a nested (stratified if classification) k-folds cross-validation and output an object containing data from the process, including the best model.

Various default pipeline step parameters for the grid-search for quick iteration over different pipelines, with the option to ignore/override them in a flexible way.

This is an on-going project that I intend to update with more models and pre-processing options and also with corresponding defaults.

Titanic dataset example

Here I use the Titanic dataset I've cleaned and pickled in a separate tutorial.

Import data



In [ ]:

    
import pandas as pd

df = pd.read_pickle('trimmed_titanic_data.pkl')

df.info()

By "cleaned" I mean I've derived titles (e.g. "Mr.", "Mrs.", "Dr.", etc) from the passenger names, imputed the missing Age values using polynomial regression with grid-searched 10-fold cross-validation, filled in the 3 missing Embarked values with the mode, and removed all fields that could be considered an id for that individual.

Thus, there is no missing data.

Set categorical features as type 'category'



In [ ]:

    
simulation_df = df.copy()

categorical_features = ['Survived','Pclass','Sex','Embarked','Title']

for feature in categorical_features:
    simulation_df[feature] = simulation_df[feature].astype('category')
    
simulation_df.info()

One-hot encode categorical features



In [ ]:

    
simulation_df = pd.get_dummies(simulation_df,drop_first=True)

simulation_df.info()

Now we have 17 features.

Split into input/output data



In [ ]:

    
# Set output feature
output_feature = 'Survived_1'

# Get all column names
column_names = list(simulation_df.columns)

# Get input features
input_features = [x for x in column_names if x != output_feature]

# Split into features and responses
X = simulation_df[input_features].copy()
y = simulation_df[output_feature].copy()

Null model



In [ ]:

    
simulation_df['Survived_1'].value_counts().values/float(simulation_df['Survived_1'].value_counts().values.sum())

Thus, null accuracy of ~62% if always predict death.

Import data science library and initialize optimized pipeline collection



In [ ]:

    
import pyplearnr as ppl

optimized_pipelines = {}

Basic models w/ no pre-processing

KNN

Here we do a simple K-nearest neighbors (KNN) classification with stratified 10-fold (default) cross-validation with a grid search over the default of 1 to 30 nearest neighbors and the use of either "uniform" or "distance" weights:



In [ ]:

    
%%time

reload(dsl)

estimator = 'knn'

# Set pipeline keyword arguments
optimized_pipeline_kwargs = {
    'feature_selection_type': None,
    'scale_type': None,
    'feature_interactions': False,
    'transform_type': None
    }

# Initialize pipeline
optimized_pipeline = ppl.OptimizedPipeline(estimator,**optimized_pipeline_kwargs)

# Set pipeline fitting parameters
fit_kwargs = {
    'cv': 10,
    'num_parameter_combos': None,
    'n_jobs': -1,
    'random_state': 6,
    'suppress_output': True,
    'use_default_param_dist': True,
    'param_dist': None,
    'test_size': 0.2 # 20% saved as test set
}

# Fit data
optimized_pipeline.fit(X,y,**fit_kwargs)

# Save 
optimized_pipelines[estimator] = optimized_pipeline

The output of the train_model() method is an instance of my custom OptimizedPipeline class containing all of the data associated with the nested stratified k-folds cross-validation.

This includes the data, its test/train splits (based on the test_size percentage keyword argument), the GridSearchCV or RandomizedGridSearchCV object, the Pipeline object that has been retrained using all of the data with the best parameters, test/train scores, and validation metrics/reports.

A report can be printed immediately after the fit by setting the suppress_output keyword argument to True.

It lists the steps in the pipeline, their optimized settings, the test/training accuracy (or L2 regression score), the grid search parameters, and the best parameters.

If the estimator used is a classifier it also includes the confusion matrix, normalized confusion matrix, and a classification report containing precision/recall/f1-score for each class.

This same report is also accessible by printing the OptimizedPipeline class instance:



In [ ]:

    
print optimized_pipeline

Turns out that the best settings are 12 neighbors and the use of the 'uniform' weight.

Note how I've set the random_state keyword agument to 6 so that the models can be compared using the same test/train split.

The default parameters to grid-search over for k-nearest neighbors is 1 to 30 neighbors and either the 'uniform' or 'distance' weight.

The defaults for the pre-processing steps, classifiers, and regressors can be viewed by using the get_default_pipeline_step_parameters() method with the number of features as the input:



In [ ]:

    
pre_processing_grid_parameters,classifier_grid_parameters,regression_grid_parameters = \
optimized_pipeline.get_default_pipeline_step_parameters(X.shape[0])

classifier_grid_parameters['knn']

These default parameters can be ignored by setting the use_default_param_dist keyword argument to False.

The param_dist keyword argument can be used to keep default parameters (if use_default_param_dist set to True) or to be used as the sole source of parameters (if use_default_param_dist set to False).

Here is a demonstration of generation of default parameters with those in param_dist being overridden:



In [ ]:

    
%%time

reload(dsl)

model_name = 'custom_override_%s'%(estimator_name)

# Set custom parameters
param_dist = {
    'estimator__n_neighbors': range(30,500)
}

estimator = 'knn'

# Set pipeline keyword arguments
optimized_pipeline_kwargs = {
    'feature_selection_type': None,
    'scale_type': None,
    'feature_interactions': False,
    'transform_type': None
    }

# Initialize pipeline
optimized_pipeline = dsl.OptimizedPipeline(estimator,**optimized_pipeline_kwargs)

# Set pipeline fitting parameters
fit_kwargs = {
    'cv': 10,
    'num_parameter_combos': None,
    'n_jobs': -1,
    'random_state': 6,
    'suppress_output': False,
    'use_default_param_dist': True,
    'param_dist': param_dist,
    'test_size': 0.2 # 20% saved as test set
}

# Fit data
optimized_pipeline.fit(X,y,**fit_kwargs)

# Save 
optimized_pipelines[model_name] = optimized_pipeline

Note how the n_neighbors parameter was 30 to 499 instead of 1 to 30.

Here's an example of only using param_dist for parameters:



In [ ]:

    
%%time

reload(dsl)

model_name = 'from_scratch_%s'%(estimator_name)

# Set custom parameters
param_dist = {
    'estimator__n_neighbors': range(10,30)
}

estimator = 'knn'

# Set pipeline keyword arguments
optimized_pipeline_kwargs = {
    'feature_selection_type': None,
    'scale_type': None,
    'feature_interactions': False,
    'transform_type': None
    }

# Initialize pipeline
optimized_pipeline = dsl.OptimizedPipeline(estimator,**optimized_pipeline_kwargs)

# Set pipeline fitting parameters
fit_kwargs = {
    'cv': 10,
    'num_parameter_combos': None,
    'n_jobs': -1,
    'random_state': 6,
    'suppress_output': False,
    'use_default_param_dist': False,
    'param_dist': param_dist,
    'test_size': 0.2 # 20% saved as test set
}

# Fit data
optimized_pipeline.fit(X,y,**fit_kwargs)

# Save 
optimized_pipelines[model_name] = optimized_pipeline

Note how the estimator__weights parameter isn't set for the KNN estimator.

Other models

This code currently supports K-nearest neighbors, logistic regression, support vector machines, multilayer perceptrons, random forest, and adaboost. We can loop through and pick the best model like this:



In [ ]:

    
%%time

reload(dsl)

classifiers = ['knn','logistic_regression','svm',
               'multilayer_perceptron','random_forest','adaboost']

for estimator in classifiers:
    # Set pipeline keyword arguments
    optimized_pipeline_kwargs = {
        'feature_selection_type': None,
        'scale_type': None,
        'feature_interactions': False,
        'transform_type': None
        }

    # Initialize pipeline
    optimized_pipeline = dsl.OptimizedPipeline(estimator,**optimized_pipeline_kwargs)
    
    # Set pipeline fitting parameters
    fit_kwargs = {
        'cv': 10,
        'num_parameter_combos': None,
        'n_jobs': -1,
        'random_state': 6,
        'suppress_output': True,
        'use_default_param_dist': True,
        'param_dist': None,
        'test_size': 0.2
    }
    
    # Fit data
    optimized_pipeline.fit(X,y,**fit_kwargs)
    
    # Save 
    optimized_pipelines[estimator] = optimized_pipeline



In [ ]:

    
format_str = '{0:<22} {1:<15} {2:<15}'

print format_str.format(*['model','train score','test score'])
print format_str.format(*['','',''])
for x in [[key,value.train_score_,value.test_score_] for key,value in optimized_pipelines.iteritems()]:
    print format_str.format(*x)

Random forest performed the best with a test score of ~0.854.

Lets look at the report:



In [ ]:

    
print optimized_pipelines['random_forest']

The optimal parameter was 96 for the n_estimators parameter for the RandomizedForestClassifier.

All models with standard scaling

We can set the scaling type using the scale_type keyword argument:



In [ ]:

    
%%time

reload(dsl)

classifiers = ['knn','logistic_regression','svm',
               'multilayer_perceptron','random_forest','adaboost']

prefix = 'scale'

for estimator in classifiers:
    # Set pipeline keyword arguments
    optimized_pipeline_kwargs = {
        'feature_selection_type': None,
        'scale_type': 'standard',
        'feature_interactions': False,
        'transform_type': None
        }

    # Initialize pipeline
    optimized_pipeline = dsl.OptimizedPipeline(estimator,**optimized_pipeline_kwargs)
    
    # Set pipeline fitting parameters
    fit_kwargs = {
        'cv': 10,
        'num_parameter_combos': None,
        'n_jobs': -1,
        'random_state': 6,
        'suppress_output': True,
        'use_default_param_dist': True,
        'param_dist': None,
        'test_size': 0.2
    }
    
    # Fit data
    optimized_pipeline.fit(X,y,**fit_kwargs)
    
    # Form name used to save optimized pipeline
    pipeline_name = '%s_%s'%(prefix,estimator)
    
    # Save 
    optimized_pipelines[pipeline_name] = optimized_pipeline



In [ ]:

    
format_str = '{0:<30} {1:<15} {2:<15}'

print format_str.format(*['model','train score','test score'])
print format_str.format(*['','',''])
for x in [[key,value.train_score_,value.test_score_] for key,value in optimized_pipelines.iteritems()]:
    print format_str.format(*x)

Random forest without scaling still appears to have the best test score. Though that with scaling had closer test and train scores.

All models with SelectKBest feature selection

Setting the feature_selection_type keyword argument will use SelectKBest with f_classif for feature selection:



In [ ]:

    
%%time

reload(dsl)

classifiers = ['knn','logistic_regression','svm',
               'multilayer_perceptron','random_forest','adaboost']

prefix = 'select'

for estimator in classifiers:
    # Set pipeline keyword arguments
    optimized_pipeline_kwargs = {
        'feature_selection_type': 'select_k_best',
        'scale_type': None,
        'feature_interactions': False,
        'transform_type': None
        }

    # Initialize pipeline
    optimized_pipeline = dsl.OptimizedPipeline(estimator,**optimized_pipeline_kwargs)
    
    # Set pipeline fitting parameters
    fit_kwargs = {
        'cv': 10,
        'num_parameter_combos': None,
        'n_jobs': -1,
        'random_state': 6,
        'suppress_output': True,
        'use_default_param_dist': True,
        'param_dist': None,
        'test_size': 0.2
    }
    
    # Fit data
    optimized_pipeline.fit(X,y,**fit_kwargs)
    
    # Form name used to save optimized pipeline
    pipeline_name = '%s_%s'%(prefix,estimator)
    
    # Save 
    optimized_pipelines[pipeline_name] = optimized_pipeline



In [ ]:

    
format_str = '{0:<30} {1:<15} {2:<15} {3:<15}'

print format_str.format(*['model','train score','test score','train-test'])
print format_str.format(*['','','',''])
for x in [[key,value.train_score_,value.test_score_,value.train_score_-value.test_score_] for key,value in optimized_pipelines.iteritems()]:
    print format_str.format(*x)

Again, random_forest performs the best.

Though K-nearest neighbors appears to have the smallest difference between testing and training sets.

All models with feature interaction

Setting the feature_interactions keyword argument to True will cause the use of feature interactions. The default is to only consider pairwise products, though this be set to higher by overriding using param_dist:



In [ ]:

    
%%time

reload(dsl)

classifiers = ['knn','logistic_regression','svm','multilayer_perceptron','random_forest','adaboost']

prefix = 'interact'

for estimator in classifiers:
    # Set pipeline keyword arguments
    optimized_pipeline_kwargs = {
        'feature_selection_type': None,
        'scale_type': None,
        'feature_interactions': True,
        'transform_type': None
        }

    # Initialize pipeline
    optimized_pipeline = dsl.OptimizedPipeline(estimator,**optimized_pipeline_kwargs)
    
    # Set pipeline fitting parameters
    fit_kwargs = {
        'cv': 10,
        'num_parameter_combos': None,
        'n_jobs': -1,
        'random_state': 6,
        'suppress_output': True,
        'use_default_param_dist': True,
        'param_dist': None,
        'test_size': 0.2
    }
    
    # Fit data
    optimized_pipeline.fit(X,y,**fit_kwargs)
    
    # Form name used to save optimized pipeline
    pipeline_name = '%s_%s'%(prefix,estimator)
    
    # Save 
    optimized_pipelines[pipeline_name] = optimized_pipeline



In [ ]:

    
format_str = '{0:<30} {1:<15} {2:<15} {3:<15}'

print format_str.format(*['model','train score','test score','train-test'])
print format_str.format(*['','','',''])
for x in [[key,value.train_score_,value.test_score_,value.train_score_-value.test_score_] \
          for key,value in optimized_pipelines.iteritems()]:
    print format_str.format(*x)

This doesn't appear to result in many gains in this case.

All models with transformed data

Setting the transform_type to 'pca' or 't-sne' will apply Principal Component Analysis or t-distributed stochastic neighbor embedding, respectively, to the data before applying the estimator:



In [ ]:

    
%%time

reload(dsl)

classifiers = ['knn','logistic_regression','svm',
               'multilayer_perceptron','random_forest','adaboost']

prefix = 'pca'

for estimator in classifiers:
    # Set pipeline keyword arguments
    optimized_pipeline_kwargs = {
        'feature_selection_type': None,
        'scale_type': None,
        'feature_interactions': None,
        'transform_type': 'pca'
        }

    # Initialize pipeline
    optimized_pipeline = dsl.OptimizedPipeline(estimator,**optimized_pipeline_kwargs)
    
    # Set pipeline fitting parameters
    fit_kwargs = {
        'cv': 10,
        'num_parameter_combos': None,
        'n_jobs': -1,
        'random_state': 6,
        'suppress_output': True,
        'use_default_param_dist': True,
        'param_dist': None,
        'test_size': 0.2
    }
    
    # Fit data
    optimized_pipeline.fit(X,y,**fit_kwargs)
    
    # Form name used to save optimized pipeline
    pipeline_name = '%s_%s'%(prefix,estimator)
    
    # Save 
    optimized_pipelines[pipeline_name] = optimized_pipeline



In [ ]:

    
format_str = '{0:<30} {1:<15} {2:<15} {3:<15}'

print format_str.format(*['model','train score','test score','train-test'])
print format_str.format(*['','','',''])
for x in [[key,value.train_score_,value.test_score_,value.train_score_-value.test_score_] for key,value in optimized_pipelines.iteritems()]:
    print format_str.format(*x)

Here's the use of t-SNE:



In [ ]:

    
%%time

reload(dsl)

classifiers = ['knn','logistic_regression','svm','multilayer_perceptron','random_forest','adaboost']

prefix = 't_sne'

for estimator in classifiers:
    # Set pipeline keyword arguments
    optimized_pipeline_kwargs = {
        'feature_selection_type': None,
        'scale_type': None,
        'feature_interactions': None,
        'transform_type': 't-sne'
        }

    # Initialize pipeline
    optimized_pipeline = dsl.OptimizedPipeline(estimator,**optimized_pipeline_kwargs)
    
    # Set pipeline fitting parameters
    fit_kwargs = {
        'cv': 10,
        'num_parameter_combos': None,
        'n_jobs': -1,
        'random_state': 6,
        'suppress_output': True,
        'use_default_param_dist': True,
        'param_dist': None,
        'test_size': 0.2
    }
    
    # Fit data
    optimized_pipeline.fit(X,y,**fit_kwargs)
    
    # Form name used to save optimized pipeline
    pipeline_name = '%s_%s'%(prefix,estimator)
    
    # Save 
    optimized_pipelines[pipeline_name] = optimized_pipeline



In [ ]:

    
format_str = '{0:<30} {1:<15} {2:<15} {3:<15}'

print format_str.format(*['model','train score','test score','train-test'])
print format_str.format(*['','','',''])
for x in [[key,value.train_score_,value.test_score_,value.train_score_-value.test_score_] for key,value in optimized_pipelines.iteritems()]:
    print format_str.format(*x)

Wow, that took forever.

We can get a better idea on how long this will take by setting the num_parameter_combos keyword argument. Setting this will only allow that number of grid combinations to be used for each run:



In [ ]:

    
%%time

reload(dsl)

classifiers = ['knn','logistic_regression','svm',
               'multilayer_perceptron','random_forest','adaboost']

prefix = 't_sne_less_combo'

for estimator in classifiers:
    # Set pipeline keyword arguments
    optimized_pipeline_kwargs = {
        'feature_selection_type': None,
        'scale_type': None,
        'feature_interactions': None,
        'transform_type': 't-sne'
        }

    # Initialize pipeline
    optimized_pipeline = dsl.OptimizedPipeline(estimator,**optimized_pipeline_kwargs)
    
    # Set pipeline fitting parameters
    fit_kwargs = {
        'cv': 10,
        'num_parameter_combos': 1,
        'n_jobs': -1,
        'random_state': 6,
        'suppress_output': True,
        'use_default_param_dist': True,
        'param_dist': None,
        'test_size': 0.2
    }
    
    # Fit data
    optimized_pipeline.fit(X,y,**fit_kwargs)
    
    # Form name used to save optimized pipeline
    pipeline_name = '%s_%s'%(prefix,estimator)
    
    # Save 
    optimized_pipelines[pipeline_name] = optimized_pipeline

Applying t-sne to the data and then testing the 6 classifiers takes about 7 min. This could be optimized by pre-transforming the data once and then applying the classifiers. I'm thinking of creating some sort of container class that should be able to optimize this in the future.

SelectKBest, standard scaling, and all classifiers

Finally, here we appply feature selection and standard scaling for all 6 classifiers:



In [ ]:

    
%%time

reload(dsl)

classifiers = ['knn','logistic_regression','svm',
               'multilayer_perceptron','random_forest','adaboost']

prefix = 'select_standard'

for estimator in classifiers:
    # Set pipeline keyword arguments
    optimized_pipeline_kwargs = {
        'feature_selection_type': 'select_k_best',
        'scale_type': 'standard',
        'feature_interactions': None,
        'transform_type': None
        }

    # Initialize pipeline
    optimized_pipeline = dsl.OptimizedPipeline(estimator,**optimized_pipeline_kwargs)
    
    # Set pipeline fitting parameters
    fit_kwargs = {
        'cv': 10,
        'num_parameter_combos': None,
        'n_jobs': -1,
        'random_state': 6,
        'suppress_output': True,
        'use_default_param_dist': True,
        'param_dist': None,
        'test_size': 0.2
    }
    
    # Fit data
    optimized_pipeline.fit(X,y,**fit_kwargs)
    
    # Form name used to save optimized pipeline
    pipeline_name = '%s_%s'%(prefix,estimator)
    
    # Save 
    optimized_pipelines[pipeline_name] = optimized_pipeline



In [ ]:

    
format_str = '{0:<40} {1:<15} {2:<15} {3:<15}'

print format_str.format(*['model','train score','test score','train-test'])
print format_str.format(*['','','',''])
for x in [[key,value.train_score_,value.test_score_,value.train_score_-value.test_score_] for key,value in optimized_pipelines.iteritems()]:
    print format_str.format(*x)



In [ ]:

    
len(optimized_pipelines)

With 48 different pre-processing/transformation/classification combinations, this has become rather unwieldy.

Here I make a quick dataframe of the test/train scores and visualize:



In [ ]:

    
%matplotlib inline

model_indices = optimized_pipelines.keys()
train_scores = [value.train_score_ for key,value in optimized_pipelines.iteritems()]
test_scores = [value.test_score_ for key,value in optimized_pipelines.iteritems()]

score_df = pd.DataFrame({'training_score':train_scores,'test_score':test_scores},
                        index=model_indices)

score_df['test-train'] = score_df['test_score']-score_df['training_score']



In [ ]:

    
score_df['test_score'].sort_values().plot(kind='barh',figsize=(10,20))

The best training score was acheived by the random forest classifier.



In [ ]:

    
score_df['test-train'].sort_values().plot(kind='barh',figsize=(10,20))



In [ ]:

    
ax = score_df.plot(x=['test_score'],y='test-train',style='o',legend=None)

ax.set_xlabel('test score')
ax.set_ylabel('test-train')

So the best model was random forest.

Here's the report for the model:



In [ ]:

    
print optimized_pipelines['random_forest']