REP implements folding strategy as one more metaestimator.
When we don't have enough data to split data on train/test, we're stick to k-folding cross-validation scheme. Folding becomes the only way when you use some multi-staged stacking algorithm.
Usually we split training data into folds manually, but this is annoying and not reliable. REP has FoldingClassifier and FoldingRegressor, which do this automatically.
In [1]:
%pylab inline
In [2]:
!cd toy_datasets; wget -O MiniBooNE_PID.txt -nc --no-check-certificate https://archive.ics.uci.edu/ml/machine-learning-databases/00199/MiniBooNE_PID.txt
In [3]:
import numpy, pandas
from rep.utils import train_test_split
from sklearn.metrics import roc_auc_score
data = pandas.read_csv('toy_datasets/MiniBooNE_PID.txt', sep='\s*', skiprows=[0], header=None, engine='python')
labels = pandas.read_csv('toy_datasets/MiniBooNE_PID.txt', sep=' ', nrows=1, header=None)
labels = [1] * labels[1].values[0] + [0] * labels[2].values[0]
data.columns = ['feature_{}'.format(key) for key in data.columns]
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.5)
In [4]:
variables = list(data.columns)
In [5]:
from rep.estimators import SklearnClassifier
from sklearn.ensemble import GradientBoostingClassifier
from rep.metaml import FoldingClassifier
In [6]:
%%time
n_folds = 4
folder = FoldingClassifier(GradientBoostingClassifier(n_estimators=30),
n_folds=n_folds, features=variables,
parallel_profile='threads-4')
folder.fit(train_data, train_labels)
In [7]:
folder.predict_proba(train_data)
Out[7]:
In [8]:
# definition of mean function, which combines all predictions
def mean_vote(x):
return numpy.mean(x, axis=0)
In [9]:
folder.predict_proba(test_data, vote_function=mean_vote)
Out[9]:
In [10]:
from rep.data.storage import LabeledDataStorage
from rep.report import ClassificationReport
# add folds_column to dataset to use mask
train_data["FOLDS"] = folder._get_folds_column(len(train_data))
lds = LabeledDataStorage(train_data, train_labels)
report = ClassificationReport({'folding': folder}, lds)
In [11]:
for fold_num in range(n_folds):
report.prediction_pdf(mask="FOLDS == %d" % fold_num, labels_dict={1: 'sig fold %d' % fold_num}).plot()
In [12]:
for fold_num in range(n_folds):
report.prediction_pdf(mask="FOLDS == %d" % fold_num, labels_dict={0: 'bck fold %d' % fold_num}).plot()
In [13]:
for fold_num in range(n_folds):
report.roc(mask="FOLDS == %d" % fold_num).plot()
In [14]:
lds = LabeledDataStorage(test_data, test_labels)
report = ClassificationReport({'folding': folder}, lds)
In [15]:
report.prediction_pdf().plot(new_plot=True, figsize = (9, 4))
In [16]:
report.roc().plot(xlim=(0.5, 1))