In this tutorial we will show how to use Optunity in combination with sklearn to classify the digit recognition data set available in sklearn. The cool part is that we will use Optunity to choose the best approach from a set of available learning algorithms and optimize hyperparameters in one go. We will use the following learning algorithms:
k-nearest neighbour
SVM
Naive Bayes
Random Forest
For simplicity, we will focus on a binary classification task, namely digit 3 versus digit 9. We start with the necessary imports and create the data set.
In [1]:
import optunity
import optunity.metrics
import numpy as np
# k nearest neighbours
from sklearn.neighbors import KNeighborsClassifier
# support vector machine classifier
from sklearn.svm import SVC
# Naive Bayes
from sklearn.naive_bayes import GaussianNB
# Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits
digits = load_digits()
n = digits.data.shape[0]
positive_digit = 3
negative_digit = 9
positive_idx = [i for i in range(n) if digits.target[i] == positive_digit]
negative_idx = [i for i in range(n) if digits.target[i] == negative_digit]
# add some noise to the data to make it a little challenging
original_data = digits.data[positive_idx + negative_idx, ...]
data = original_data + 5 * np.random.randn(original_data.shape[0], original_data.shape[1])
labels = [True] * len(positive_idx) + [False] * len(negative_idx)
For the SVM model we will let Optunity optimize the kernel family, choosing from linear, polynomial and RBF. We start by creating a convenience functions for SVM training that handles this:
In [2]:
def train_svm(data, labels, kernel, C, gamma, degree, coef0):
"""A generic SVM training function, with arguments based on the chosen kernel."""
if kernel == 'linear':
model = SVC(kernel=kernel, C=C)
elif kernel == 'poly':
model = SVC(kernel=kernel, C=C, degree=degree, coef0=coef0)
elif kernel == 'rbf':
model = SVC(kernel=kernel, C=C, gamma=gamma)
else:
raise ArgumentError("Unknown kernel function: %s" % kernel)
model.fit(data, labels)
return model
Every learning algorithm has its own hyperparameters:
k-NN: $1 < n\_neighbors < 5$ the number of neighbours to use
SVM: kernel family and misclassification penalty, we will make the penalty contingent on the family. Per kernel family, we have different hyperparameters:
naive Bayes: no hyperparameters
random forest:
This translates into the following search space:
In [3]:
search = {'algorithm': {'k-nn': {'n_neighbors': [1, 5]},
'SVM': {'kernel': {'linear': {'C': [0, 2]},
'rbf': {'gamma': [0, 1], 'C': [0, 10]},
'poly': {'degree': [2, 5], 'C': [0, 50], 'coef0': [0, 1]}
}
},
'naive-bayes': None,
'random-forest': {'n_estimators': [10, 30],
'max_features': [5, 20]}
}
}
We also need an objective function that can properly orchestrate everything. We will choose the best model based on area under the ROC curve in 5-fold cross-validation.
In [4]:
@optunity.cross_validated(x=data, y=labels, num_folds=5)
def performance(x_train, y_train, x_test, y_test,
algorithm, n_neighbors=None, n_estimators=None, max_features=None,
kernel=None, C=None, gamma=None, degree=None, coef0=None):
# fit the model
if algorithm == 'k-nn':
model = KNeighborsClassifier(n_neighbors=int(n_neighbors))
model.fit(x_train, y_train)
elif algorithm == 'SVM':
model = train_svm(x_train, y_train, kernel, C, gamma, degree, coef0)
elif algorithm == 'naive-bayes':
model = GaussianNB()
model.fit(x_train, y_train)
elif algorithm == 'random-forest':
model = RandomForestClassifier(n_estimators=int(n_estimators),
max_features=int(max_features))
model.fit(x_train, y_train)
else:
raise ArgumentError('Unknown algorithm: %s' % algorithm)
# predict the test set
if algorithm == 'SVM':
predictions = model.decision_function(x_test)
else:
predictions = model.predict_proba(x_test)[:, 1]
return optunity.metrics.roc_auc(y_test, predictions, positive=True)
Lets do a simple test run of this fancy objective function.
In [5]:
performance(algorithm='k-nn', n_neighbors=3)
Out[5]:
Seems okay! Now we can let Optunity do its magic with a budget of 300 tries.
In [17]:
optimal_configuration, info, _ = optunity.maximize_structured(performance,
search_space=search,
num_evals=300)
print(optimal_configuration)
print(info.optimum)
Finally, lets make the results a little bit more readable. All dictionary items in optimal_configuration with value None can be removed.
In [18]:
solution = dict([(k, v) for k, v in optimal_configuration.items() if v is not None])
print('Solution\n========')
print("\n".join(map(lambda x: "%s \t %s" % (x[0], str(x[1])), solution.items())))