In [1]:
import os
import pandas as pd
import numpy as np
In [2]:
df = pd.read_pickle('trimmed_titanic_data.pkl')
df.info()
By "cleaned" I mean I've derived titles (e.g. "Mr.", "Mrs.", "Dr.", etc) from the passenger names, imputed the missing Age values using polynomial regression with grid-searched 10-fold cross-validation, filled in the 3 missing Embarked values with the mode, and removed all fields that could be considered an id for that individual.
Thus, no data are missing or null.
In order to one-hot encode categorical data, its best to set the features that are considered categorical:
In [3]:
df.head(10)
Out[3]:
In [4]:
simulation_df = df.copy()
categorical_features = ['Survived','Pclass','Sex','Embarked','Title']
for feature in categorical_features:
simulation_df[feature] = simulation_df[feature].astype('category')
simulation_df.info()
In [5]:
simulation_df = pd.get_dummies(simulation_df,drop_first=True)
simulation_df.info()
In [6]:
# Set output feature
output_feature = 'Survived_1'
# Get all column names
column_names = list(simulation_df.columns)
# Get input features
input_features = [x for x in column_names if x != output_feature]
# Split into features and responses
X = simulation_df[input_features].copy()
y = simulation_df[output_feature].copy()
In [7]:
simulation_df['Survived_1'].value_counts().values/float(simulation_df['Survived_1'].value_counts().values.sum())
Out[7]:
In [8]:
%matplotlib inline
import pyplearnr as ppl
In [9]:
%%time
# Initialize nested k-fold cross-validation object
kfcv = ppl.NestedKFoldCrossValidation(outer_loop_fold_count=3,
inner_loop_fold_count=3,
shuffle_seed=2369,
outer_loop_split_seed=461,
inner_loop_split_seeds=[284, 406, 303])
# Combinatorial pipeline schematic
pipeline_schematic = [
{'estimator': {
'knn': {
'n_neighbors': range(1,31),
'weights': ['uniform','distance']
}}}
]
# Perform nested k-fold cross-validation
kfcv.fit(X.values, y.values, pipeline_schematic=pipeline_schematic,
scoring_metric='auc', score_type='median')
Pyplearnr has indicated that the contest of outer-fold 2 has resulted in a tie between two pipelines with the same median score over all inner-folds. We can resolve this by re-running the fit method with the best_inner_fold_pipeline_inds keyword argument. I'll choose the simplest (higher number of neighbors):
In [10]:
kfcv.fit(X.values, y.values, pipeline_schematic=pipeline_schematic,
best_inner_fold_pipeline_inds={2:55})
Whatever pipline wins the most outer-fold contests wins overall.
In this case, pyplearnr has notified us that all inner-fold contests of each outer-fold have resulted in different winners. We can resolve this conflict by, again, re-running the fit method, but with the best_outer_fold_pipeline keyword argument:
In [11]:
kfcv.fit(X.values, y.values, pipeline_schematic=pipeline_schematic,
best_outer_fold_pipeline=55)
The output report lists the winning pipeline index, its validation (outer-fold test) scores and statistics, inner-fold (IF) test scores and statistics for each outer-fold (OF), a layout of the pipeline steps, and the corresponding step parameters.
Additionally, the report contains the outer- and inner-fold counts, seeds, scoring metric, and scoring type. These, along with the same data and pipelines, can be used as inputs to the nested k-fold cross-validation object initialization and its fit method to uniquely duplicate the results of this run.
We can get a visual report of this pipeline's validation scores and the inner-fold test scores for each outer-fold:
In [12]:
kfcv.plot_best_pipeline_scores()
Additionally we can visualize the performance of all pipelines over all folds:
In [13]:
kfcv.plot_contest(all_folds=True, markersize=3, figsize=(5,10), fontsize=8)
Additionally, we can visualize the test fold scores in all folds separately:
In [14]:
kfcv.plot_contest(markersize=2, figsize=(5,10), fontsize=8)
In [15]:
%%time
# Initialize nested k-fold cross-validation object
kfcv = ppl.NestedKFoldCrossValidation(outer_loop_fold_count=3,
inner_loop_fold_count=3,
shuffle_seed=2369,
outer_loop_split_seed=461,
inner_loop_split_seeds=[284, 406, 303])
# Combinatorial pipeline schematic
pipeline_schematic = [
{'scaler': {
'none': {},
'standard': {},
'normal': {},
'min_max': {},
'binary': {}
}},
{'estimator': {
'knn': {
'n_neighbors': range(1,31),
'weights': ['uniform','distance']
}}}
]
# Perform nested k-fold cross-validation
kfcv.fit(X.values, y.values, pipeline_schematic=pipeline_schematic,
scoring_metric='auc', score_type='median')
In [16]:
kfcv.fit(X.values, y.values, pipeline_schematic=pipeline_schematic,
scoring_metric='auc', score_type='median',
best_inner_fold_pipeline_inds={1:15})
In [18]:
kfcv.fit(X.values, y.values, pipeline_schematic=pipeline_schematic,
scoring_metric='auc', score_type='median',
best_outer_fold_pipeline=27)
Pyplearnr has chosen a pipeline that scales the feature data to between 0 and 1 before putting it through a KNN classifier set to consider 14 nearest neighbors weighted by distance.
We can plot the contests again, but this time with the color_by keyword argument, to see if there are any patterns:
In [19]:
kfcv.plot_contest(color_by='scaler', all_folds=True,
markersize=1, figsize=(10,20), fontsize=4)
As expected, those pipelines with a lack of scaling have the lowest scores. Additionally, the scikit-learn Normalizer scaler does worse than the others.
The MinMaxScaler and StandardScaler do the best for this dataset with the KNN classifier.
In [28]:
%%time
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
# Initialize nested k-fold cross-validation object
kfcv = ppl.NestedKFoldCrossValidation(outer_loop_fold_count=3,
inner_loop_fold_count=3,
shuffle_seed=2369,
outer_loop_split_seed=461,
inner_loop_split_seeds=[284, 406, 303])
# Combinatorial pipeline schematic
pipeline_schematic = [
{'estimator': {
'knn': {
'n_neighbors': range(1,31),
'weights': ['uniform','distance']
},
'svm': {
'sklo': LinearSVC,
'loss': ['hinge', 'squared_hinge']
},
'logistic_regression': {
'random_state': [65]
},
'random_forest': {
'sklo': RandomForestClassifier,
'max_depth': range(2,6),
'random_state': [57]
},
'gaussian': {
'sklo': GaussianProcessClassifier,
},
'adaboost': {
'sklo': AdaBoostClassifier
},
'naive_bayes': {
'sklo': GaussianNB
},
'qda': {
'sklo': QuadraticDiscriminantAnalysis
}
}}
]
# Perform nested k-fold cross-validation
kfcv.fit(X.values, y.values, pipeline_schematic=pipeline_schematic,
scoring_metric='auc', score_type='median')
Note how I've set the random_state kwarg for RandomForestClassifier and logistic_regression to control their behavior so these results are repeatable. It's good practice to have a random number generator set seeds and then to save them where needed for debugging and such.
I think normal logistic regression is "simpler" than the result of AdaBoostClassifier:
In [29]:
kfcv.fit(X.values, y.values, pipeline_schematic=pipeline_schematic,
best_inner_fold_pipeline_inds={2:60})
The best model appears to be logistic regression.
In [30]:
kfcv.plot_best_pipeline_scores()
The validation performance matches well with that in the inner-fold contests.
Let's look at all of the inner-fold pipeline contests:
In [31]:
kfcv.plot_contest(all_folds=True, color_by='estimator',
color_map='jet', markersize=5, fontsize=10)
Note, I've chosen different input parameters to make the plot look better.
We'd like to see if there's any pattern in doing either standard or min_max scaling, PCA, selection of different numbers of the transformed output (essentially selecting the number of principal components to use to transform the data), and k-nearest neighbors over multiple values of k:
In [32]:
%%time
import numpy as np
# Initialize nested k-fold cross-validation object
kfcv = ppl.NestedKFoldCrossValidation(outer_loop_fold_count=3,
inner_loop_fold_count=3,
shuffle_seed=3243,
outer_loop_split_seed=45,
inner_loop_split_seeds=[62, 207, 516])
# Combinatorial pipeline schematic
feature_count = X.shape[1]
pipeline_schematic = [
{'scaler': {
'min_max': {},
'standard': {}
}
},
{'transform': {
'pca': {
'n_components': [feature_count]
}
}
},
{'feature_selection': {
'select_k_best': {
'k': range(1, feature_count+1)
}
}
},
{'estimator': {
'knn': {
'n_neighbors': range(1,31)
}
}
}
]
# Perform nested k-fold cross-validation
kfcv.fit(X.values, y.values, pipeline_schematic=pipeline_schematic, scoring_metric='auc')
In [33]:
kfcv.fit(X.values, y.values,
pipeline_schematic=pipeline_schematic,
best_outer_fold_pipeline=728)
Our process has resulted in the selection of a pipeline with standard scaling, PCA, selection of 9 principal components, and feeding into a KNN classifier with 9 neighbors and use of unform weighting.
In [34]:
kfcv.plot_best_pipeline_scores()
The validation performance is in line with the best pipeline's inner-fold testing performance for each outer-fold.
We can look at the effect of parameter values by changing the color_by keyword argument to a string with the format 'step__step_option__parameter_name__parameter_value'. To be clear, those are two underscores in-between step, step_options, parameter_name, and parameter_value
Let's see if there are any patterns with regard to the number of principal components used:
In [35]:
kfcv.plot_contest(all_folds=True, markersize=1, fontsize=2, figsize=(20,60),
color_by='feature_selection__select_k_best__k', color_map='hot')
Not sure I see much of a pattern other than the lower numbers of principal components used to transform the data tends to predominate at the lowest scores.
Now let's look at the number of k-nearest neighbors for the classifier:
In [36]:
kfcv.plot_contest(all_folds=True, markersize=1, fontsize=2, figsize=(20,60),
color_by='estimator__knn__n_neighbors', color_map='hot')
It appears there are rather mixed results, except the highest scores appear to occur with 1 to about 13 nearest neighbors. Although, they are still represented at the lowest levels as well.
This process can become time-intensive quickly. So, in the spirit of RandomizedGridSearchCV, I've included a random_combinations keyword argument to specify the number of available combinations and a random_comboination_seed that will be calculated similarly to duplicate results:
In [37]:
%%time
import numpy as np
# Initialize nested k-fold cross-validation object
kfcv = ppl.NestedKFoldCrossValidation(outer_loop_fold_count=3,
inner_loop_fold_count=3,
shuffle_seed=3243,
outer_loop_split_seed=45,
inner_loop_split_seeds=[62, 207, 516],
random_combinations=50,
random_combination_seed=2374)
# Design combinatorial pipeline schematic
feature_count = X.shape[1]
pipeline_schematic = [
{'scaler': {
'min_max': {},
'standard': {}
}
},
{'transform': {
'pca': {
'n_components': [feature_count]
}
}
},
{'feature_selection': {
'select_k_best': {
'k': range(1, feature_count+1)
}
}
},
{'estimator': {
'knn': {
'n_neighbors': range(1,31)
}
}
}
]
# Perform nested k-fold cross-validation
kfcv.fit(X.values, y.values, pipeline_schematic=pipeline_schematic, scoring_metric='auc')
In [38]:
kfcv.fit(X.values, y.values, pipeline_schematic=pipeline_schematic,
best_inner_fold_pipeline_inds={2:681})
In [39]:
kfcv.plot_best_pipeline_scores()
The best pipeline has a slightly lower median validation score (0.7918 vs 0.7995) than that testing on all pipelines, though at about 1/15 of the time (8.68 s versus 2 min 15 s).
In [43]:
kfcv.plot_contest(all_folds=True, markersize=5, fontsize=13, figsize=(30,15),
color_by='estimator__knn__n_neighbors', color_map='hot', legend_loc='best')
Having less pipelines certainly makes it easier to make these graphs look better.
The best pipeline is automatically placed in the pipeline field of the nested k-fold cross-validation object (kfcv).
This object is a custom pyplearnr.OuterFoldTrainedPipeline object whose own pipeline field contains the actual trained sklearn.pipeline.Pipeline object. This object can be used to look at derived pipeline step parameters normally. Please see scikit-learn's documention of Pipeline objects for more information.
All one has to do to make a prediction is use the .predict() method.
Here's an example of predicting whether I would survive on the Titanic. I'm 33, would probably have one family member with me, might be Pclass1 (I'd hope), male, have a Ph.D (if that's what they mean by Dr.). I'm using the median Fare for Pclass 1 and arbitrary chose a city to have embarked from:
In [44]:
personal_stats = np.array([33, 1, 0, df[df['Pclass']==1]['Fare'].median(),
0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0])
zip(personal_stats,X.columns)
Out[44]:
In [45]:
kfcv.predict(personal_stats.reshape(1,-1))
Out[45]:
Looks like I survived!
Let's look at my predicted probability of surviving:
In [46]:
kfcv.predict_proba(personal_stats.reshape(1,-1))
Out[46]:
I would have a 60% chance of survival.
I've shown how to use pyplearnr to do model selection and validation among a diverse collection of pipelines, generated using a simple/intuitive/flexible combinatorial pipeline schematic input, using nested k-fold cross-validation.
Also, I've shown how to visualize the best model performance and that of all models in the inner-fold contests of each outer-fold, predict survival, and check the actual predicted probability according to the optimized pipeline.
I hope this proves to be a useful tool.
Please let me know if you have any questions or suggestions about how to improve this tool, my code, the approach I'm taking, etc.