Here I demonstrate pyplearnr, a wrapper for building/training/validating scikit learn pipelines using GridSearchCV or RandomizedSearchCV.
Quick keyword arguments give access to optional feature selection (e.g. SelectKBest), scaling (e.g. standard scaling), use of feature interactions, and data transformations (e.g. PCA, t-SNE) before being fed to a classifier/regressor.
After building the pipeline, data can be used to perform a nested (stratified if classification) k-folds cross-validation and output an object containing data from the process, including the best model.
Various default pipeline step parameters for the grid-search for quick iteration over different pipelines, with the option to ignore/override them in a flexible way.
This is an on-going project that I intend to update with more models and pre-processing options and also with corresponding defaults.
Here I use the Titanic dataset I've cleaned and pickled in a separate tutorial.
In [ ]:
import pandas as pd
df = pd.read_pickle('trimmed_titanic_data.pkl')
df.info()
By "cleaned" I mean I've derived titles (e.g. "Mr.", "Mrs.", "Dr.", etc) from the passenger names, imputed the missing Age values using polynomial regression with grid-searched 10-fold cross-validation, filled in the 3 missing Embarked values with the mode, and removed all fields that could be considered an id for that individual.
Thus, there is no missing data.
In [ ]:
simulation_df = df.copy()
categorical_features = ['Survived','Pclass','Sex','Embarked','Title']
for feature in categorical_features:
simulation_df[feature] = simulation_df[feature].astype('category')
simulation_df.info()
In [ ]:
simulation_df = pd.get_dummies(simulation_df,drop_first=True)
simulation_df.info()
In [ ]:
# Set output feature
output_feature = 'Survived_1'
# Get all column names
column_names = list(simulation_df.columns)
# Get input features
input_features = [x for x in column_names if x != output_feature]
# Split into features and responses
X = simulation_df[input_features].copy()
y = simulation_df[output_feature].copy()
In [ ]:
simulation_df['Survived_1'].value_counts().values/float(simulation_df['Survived_1'].value_counts().values.sum())
In [ ]:
import pyplearnr as ppl
optimized_pipelines = {}
In [ ]:
%%time
reload(dsl)
estimator = 'knn'
# Set pipeline keyword arguments
optimized_pipeline_kwargs = {
'feature_selection_type': None,
'scale_type': None,
'feature_interactions': False,
'transform_type': None
}
# Initialize pipeline
optimized_pipeline = ppl.OptimizedPipeline(estimator,**optimized_pipeline_kwargs)
# Set pipeline fitting parameters
fit_kwargs = {
'cv': 10,
'num_parameter_combos': None,
'n_jobs': -1,
'random_state': 6,
'suppress_output': True,
'use_default_param_dist': True,
'param_dist': None,
'test_size': 0.2 # 20% saved as test set
}
# Fit data
optimized_pipeline.fit(X,y,**fit_kwargs)
# Save
optimized_pipelines[estimator] = optimized_pipeline
The output of the train_model() method is an instance of my custom OptimizedPipeline class containing all of the data associated with the nested stratified k-folds cross-validation.
This includes the data, its test/train splits (based on the test_size percentage keyword argument), the GridSearchCV or RandomizedGridSearchCV object, the Pipeline object that has been retrained using all of the data with the best parameters, test/train scores, and validation metrics/reports.
A report can be printed immediately after the fit by setting the suppress_output keyword argument to True.
It lists the steps in the pipeline, their optimized settings, the test/training accuracy (or L2 regression score), the grid search parameters, and the best parameters.
If the estimator used is a classifier it also includes the confusion matrix, normalized confusion matrix, and a classification report containing precision/recall/f1-score for each class.
This same report is also accessible by printing the OptimizedPipeline class instance:
In [ ]:
print optimized_pipeline
Turns out that the best settings are 12 neighbors and the use of the 'uniform' weight.
Note how I've set the random_state keyword agument to 6 so that the models can be compared using the same test/train split.
The default parameters to grid-search over for k-nearest neighbors is 1 to 30 neighbors and either the 'uniform' or 'distance' weight.
The defaults for the pre-processing steps, classifiers, and regressors can be viewed by using the get_default_pipeline_step_parameters() method with the number of features as the input:
In [ ]:
pre_processing_grid_parameters,classifier_grid_parameters,regression_grid_parameters = \
optimized_pipeline.get_default_pipeline_step_parameters(X.shape[0])
classifier_grid_parameters['knn']
These default parameters can be ignored by setting the use_default_param_dist keyword argument to False.
The param_dist keyword argument can be used to keep default parameters (if use_default_param_dist set to True) or to be used as the sole source of parameters (if use_default_param_dist set to False).
Here is a demonstration of generation of default parameters with those in param_dist being overridden:
In [ ]:
%%time
reload(dsl)
model_name = 'custom_override_%s'%(estimator_name)
# Set custom parameters
param_dist = {
'estimator__n_neighbors': range(30,500)
}
estimator = 'knn'
# Set pipeline keyword arguments
optimized_pipeline_kwargs = {
'feature_selection_type': None,
'scale_type': None,
'feature_interactions': False,
'transform_type': None
}
# Initialize pipeline
optimized_pipeline = dsl.OptimizedPipeline(estimator,**optimized_pipeline_kwargs)
# Set pipeline fitting parameters
fit_kwargs = {
'cv': 10,
'num_parameter_combos': None,
'n_jobs': -1,
'random_state': 6,
'suppress_output': False,
'use_default_param_dist': True,
'param_dist': param_dist,
'test_size': 0.2 # 20% saved as test set
}
# Fit data
optimized_pipeline.fit(X,y,**fit_kwargs)
# Save
optimized_pipelines[model_name] = optimized_pipeline
Note how the n_neighbors parameter was 30 to 499 instead of 1 to 30.
Here's an example of only using param_dist for parameters:
In [ ]:
%%time
reload(dsl)
model_name = 'from_scratch_%s'%(estimator_name)
# Set custom parameters
param_dist = {
'estimator__n_neighbors': range(10,30)
}
estimator = 'knn'
# Set pipeline keyword arguments
optimized_pipeline_kwargs = {
'feature_selection_type': None,
'scale_type': None,
'feature_interactions': False,
'transform_type': None
}
# Initialize pipeline
optimized_pipeline = dsl.OptimizedPipeline(estimator,**optimized_pipeline_kwargs)
# Set pipeline fitting parameters
fit_kwargs = {
'cv': 10,
'num_parameter_combos': None,
'n_jobs': -1,
'random_state': 6,
'suppress_output': False,
'use_default_param_dist': False,
'param_dist': param_dist,
'test_size': 0.2 # 20% saved as test set
}
# Fit data
optimized_pipeline.fit(X,y,**fit_kwargs)
# Save
optimized_pipelines[model_name] = optimized_pipeline
In [ ]:
%%time
reload(dsl)
classifiers = ['knn','logistic_regression','svm',
'multilayer_perceptron','random_forest','adaboost']
for estimator in classifiers:
# Set pipeline keyword arguments
optimized_pipeline_kwargs = {
'feature_selection_type': None,
'scale_type': None,
'feature_interactions': False,
'transform_type': None
}
# Initialize pipeline
optimized_pipeline = dsl.OptimizedPipeline(estimator,**optimized_pipeline_kwargs)
# Set pipeline fitting parameters
fit_kwargs = {
'cv': 10,
'num_parameter_combos': None,
'n_jobs': -1,
'random_state': 6,
'suppress_output': True,
'use_default_param_dist': True,
'param_dist': None,
'test_size': 0.2
}
# Fit data
optimized_pipeline.fit(X,y,**fit_kwargs)
# Save
optimized_pipelines[estimator] = optimized_pipeline
In [ ]:
format_str = '{0:<22} {1:<15} {2:<15}'
print format_str.format(*['model','train score','test score'])
print format_str.format(*['','',''])
for x in [[key,value.train_score_,value.test_score_] for key,value in optimized_pipelines.iteritems()]:
print format_str.format(*x)
Random forest performed the best with a test score of ~0.854.
Lets look at the report:
In [ ]:
print optimized_pipelines['random_forest']
In [ ]:
%%time
reload(dsl)
classifiers = ['knn','logistic_regression','svm',
'multilayer_perceptron','random_forest','adaboost']
prefix = 'scale'
for estimator in classifiers:
# Set pipeline keyword arguments
optimized_pipeline_kwargs = {
'feature_selection_type': None,
'scale_type': 'standard',
'feature_interactions': False,
'transform_type': None
}
# Initialize pipeline
optimized_pipeline = dsl.OptimizedPipeline(estimator,**optimized_pipeline_kwargs)
# Set pipeline fitting parameters
fit_kwargs = {
'cv': 10,
'num_parameter_combos': None,
'n_jobs': -1,
'random_state': 6,
'suppress_output': True,
'use_default_param_dist': True,
'param_dist': None,
'test_size': 0.2
}
# Fit data
optimized_pipeline.fit(X,y,**fit_kwargs)
# Form name used to save optimized pipeline
pipeline_name = '%s_%s'%(prefix,estimator)
# Save
optimized_pipelines[pipeline_name] = optimized_pipeline
In [ ]:
format_str = '{0:<30} {1:<15} {2:<15}'
print format_str.format(*['model','train score','test score'])
print format_str.format(*['','',''])
for x in [[key,value.train_score_,value.test_score_] for key,value in optimized_pipelines.iteritems()]:
print format_str.format(*x)
In [ ]:
%%time
reload(dsl)
classifiers = ['knn','logistic_regression','svm',
'multilayer_perceptron','random_forest','adaboost']
prefix = 'select'
for estimator in classifiers:
# Set pipeline keyword arguments
optimized_pipeline_kwargs = {
'feature_selection_type': 'select_k_best',
'scale_type': None,
'feature_interactions': False,
'transform_type': None
}
# Initialize pipeline
optimized_pipeline = dsl.OptimizedPipeline(estimator,**optimized_pipeline_kwargs)
# Set pipeline fitting parameters
fit_kwargs = {
'cv': 10,
'num_parameter_combos': None,
'n_jobs': -1,
'random_state': 6,
'suppress_output': True,
'use_default_param_dist': True,
'param_dist': None,
'test_size': 0.2
}
# Fit data
optimized_pipeline.fit(X,y,**fit_kwargs)
# Form name used to save optimized pipeline
pipeline_name = '%s_%s'%(prefix,estimator)
# Save
optimized_pipelines[pipeline_name] = optimized_pipeline
In [ ]:
format_str = '{0:<30} {1:<15} {2:<15} {3:<15}'
print format_str.format(*['model','train score','test score','train-test'])
print format_str.format(*['','','',''])
for x in [[key,value.train_score_,value.test_score_,value.train_score_-value.test_score_] for key,value in optimized_pipelines.iteritems()]:
print format_str.format(*x)
Again, random_forest performs the best.
Though K-nearest neighbors appears to have the smallest difference between testing and training sets.
Setting the feature_interactions keyword argument to True will cause the use of feature interactions. The default is to only consider pairwise products, though this be set to higher by overriding using param_dist:
In [ ]:
%%time
reload(dsl)
classifiers = ['knn','logistic_regression','svm','multilayer_perceptron','random_forest','adaboost']
prefix = 'interact'
for estimator in classifiers:
# Set pipeline keyword arguments
optimized_pipeline_kwargs = {
'feature_selection_type': None,
'scale_type': None,
'feature_interactions': True,
'transform_type': None
}
# Initialize pipeline
optimized_pipeline = dsl.OptimizedPipeline(estimator,**optimized_pipeline_kwargs)
# Set pipeline fitting parameters
fit_kwargs = {
'cv': 10,
'num_parameter_combos': None,
'n_jobs': -1,
'random_state': 6,
'suppress_output': True,
'use_default_param_dist': True,
'param_dist': None,
'test_size': 0.2
}
# Fit data
optimized_pipeline.fit(X,y,**fit_kwargs)
# Form name used to save optimized pipeline
pipeline_name = '%s_%s'%(prefix,estimator)
# Save
optimized_pipelines[pipeline_name] = optimized_pipeline
In [ ]:
format_str = '{0:<30} {1:<15} {2:<15} {3:<15}'
print format_str.format(*['model','train score','test score','train-test'])
print format_str.format(*['','','',''])
for x in [[key,value.train_score_,value.test_score_,value.train_score_-value.test_score_] \
for key,value in optimized_pipelines.iteritems()]:
print format_str.format(*x)
In [ ]:
%%time
reload(dsl)
classifiers = ['knn','logistic_regression','svm',
'multilayer_perceptron','random_forest','adaboost']
prefix = 'pca'
for estimator in classifiers:
# Set pipeline keyword arguments
optimized_pipeline_kwargs = {
'feature_selection_type': None,
'scale_type': None,
'feature_interactions': None,
'transform_type': 'pca'
}
# Initialize pipeline
optimized_pipeline = dsl.OptimizedPipeline(estimator,**optimized_pipeline_kwargs)
# Set pipeline fitting parameters
fit_kwargs = {
'cv': 10,
'num_parameter_combos': None,
'n_jobs': -1,
'random_state': 6,
'suppress_output': True,
'use_default_param_dist': True,
'param_dist': None,
'test_size': 0.2
}
# Fit data
optimized_pipeline.fit(X,y,**fit_kwargs)
# Form name used to save optimized pipeline
pipeline_name = '%s_%s'%(prefix,estimator)
# Save
optimized_pipelines[pipeline_name] = optimized_pipeline
In [ ]:
format_str = '{0:<30} {1:<15} {2:<15} {3:<15}'
print format_str.format(*['model','train score','test score','train-test'])
print format_str.format(*['','','',''])
for x in [[key,value.train_score_,value.test_score_,value.train_score_-value.test_score_] for key,value in optimized_pipelines.iteritems()]:
print format_str.format(*x)
Here's the use of t-SNE:
In [ ]:
%%time
reload(dsl)
classifiers = ['knn','logistic_regression','svm','multilayer_perceptron','random_forest','adaboost']
prefix = 't_sne'
for estimator in classifiers:
# Set pipeline keyword arguments
optimized_pipeline_kwargs = {
'feature_selection_type': None,
'scale_type': None,
'feature_interactions': None,
'transform_type': 't-sne'
}
# Initialize pipeline
optimized_pipeline = dsl.OptimizedPipeline(estimator,**optimized_pipeline_kwargs)
# Set pipeline fitting parameters
fit_kwargs = {
'cv': 10,
'num_parameter_combos': None,
'n_jobs': -1,
'random_state': 6,
'suppress_output': True,
'use_default_param_dist': True,
'param_dist': None,
'test_size': 0.2
}
# Fit data
optimized_pipeline.fit(X,y,**fit_kwargs)
# Form name used to save optimized pipeline
pipeline_name = '%s_%s'%(prefix,estimator)
# Save
optimized_pipelines[pipeline_name] = optimized_pipeline
In [ ]:
format_str = '{0:<30} {1:<15} {2:<15} {3:<15}'
print format_str.format(*['model','train score','test score','train-test'])
print format_str.format(*['','','',''])
for x in [[key,value.train_score_,value.test_score_,value.train_score_-value.test_score_] for key,value in optimized_pipelines.iteritems()]:
print format_str.format(*x)
Wow, that took forever.
We can get a better idea on how long this will take by setting the num_parameter_combos keyword argument. Setting this will only allow that number of grid combinations to be used for each run:
In [ ]:
%%time
reload(dsl)
classifiers = ['knn','logistic_regression','svm',
'multilayer_perceptron','random_forest','adaboost']
prefix = 't_sne_less_combo'
for estimator in classifiers:
# Set pipeline keyword arguments
optimized_pipeline_kwargs = {
'feature_selection_type': None,
'scale_type': None,
'feature_interactions': None,
'transform_type': 't-sne'
}
# Initialize pipeline
optimized_pipeline = dsl.OptimizedPipeline(estimator,**optimized_pipeline_kwargs)
# Set pipeline fitting parameters
fit_kwargs = {
'cv': 10,
'num_parameter_combos': 1,
'n_jobs': -1,
'random_state': 6,
'suppress_output': True,
'use_default_param_dist': True,
'param_dist': None,
'test_size': 0.2
}
# Fit data
optimized_pipeline.fit(X,y,**fit_kwargs)
# Form name used to save optimized pipeline
pipeline_name = '%s_%s'%(prefix,estimator)
# Save
optimized_pipelines[pipeline_name] = optimized_pipeline
Applying t-sne to the data and then testing the 6 classifiers takes about 7 min. This could be optimized by pre-transforming the data once and then applying the classifiers. I'm thinking of creating some sort of container class that should be able to optimize this in the future.
Finally, here we appply feature selection and standard scaling for all 6 classifiers:
In [ ]:
%%time
reload(dsl)
classifiers = ['knn','logistic_regression','svm',
'multilayer_perceptron','random_forest','adaboost']
prefix = 'select_standard'
for estimator in classifiers:
# Set pipeline keyword arguments
optimized_pipeline_kwargs = {
'feature_selection_type': 'select_k_best',
'scale_type': 'standard',
'feature_interactions': None,
'transform_type': None
}
# Initialize pipeline
optimized_pipeline = dsl.OptimizedPipeline(estimator,**optimized_pipeline_kwargs)
# Set pipeline fitting parameters
fit_kwargs = {
'cv': 10,
'num_parameter_combos': None,
'n_jobs': -1,
'random_state': 6,
'suppress_output': True,
'use_default_param_dist': True,
'param_dist': None,
'test_size': 0.2
}
# Fit data
optimized_pipeline.fit(X,y,**fit_kwargs)
# Form name used to save optimized pipeline
pipeline_name = '%s_%s'%(prefix,estimator)
# Save
optimized_pipelines[pipeline_name] = optimized_pipeline
In [ ]:
format_str = '{0:<40} {1:<15} {2:<15} {3:<15}'
print format_str.format(*['model','train score','test score','train-test'])
print format_str.format(*['','','',''])
for x in [[key,value.train_score_,value.test_score_,value.train_score_-value.test_score_] for key,value in optimized_pipelines.iteritems()]:
print format_str.format(*x)
In [ ]:
len(optimized_pipelines)
With 48 different pre-processing/transformation/classification combinations, this has become rather unwieldy.
Here I make a quick dataframe of the test/train scores and visualize:
In [ ]:
%matplotlib inline
model_indices = optimized_pipelines.keys()
train_scores = [value.train_score_ for key,value in optimized_pipelines.iteritems()]
test_scores = [value.test_score_ for key,value in optimized_pipelines.iteritems()]
score_df = pd.DataFrame({'training_score':train_scores,'test_score':test_scores},
index=model_indices)
score_df['test-train'] = score_df['test_score']-score_df['training_score']
In [ ]:
score_df['test_score'].sort_values().plot(kind='barh',figsize=(10,20))
The best training score was acheived by the random forest classifier.
In [ ]:
score_df['test-train'].sort_values().plot(kind='barh',figsize=(10,20))
In [ ]:
ax = score_df.plot(x=['test_score'],y='test-train',style='o',legend=None)
ax.set_xlabel('test score')
ax.set_ylabel('test-train')
So the best model was random forest.
Here's the report for the model:
In [ ]:
print optimized_pipelines['random_forest']