Foundations of Data Mining: Assignment 2

Please complete all assignments in this notebook. You should submit this notebook, as well as a PDF version (See File > Download as).


In [1]:
%matplotlib inline
from preamble import *
plt.rcParams['savefig.dpi'] = 100 # This controls the size of your figures
# Comment out and restart notebook if you only want the last output of each cell.
InteractiveShell.ast_node_interactivity = "all"

In [2]:
# This is a temporary read-only OpenML key. Replace with your own key later. 
oml.config.apikey = '11e82c8d91c5abece86f424369c71590'

Kernel selection (4 points (1+2+1))

SVMs can be trained with different kernels. Generate a 2-dimensional dataset as shown below and study the effect of the choice of kernel by visualizing the results.

  • Train a SVM classifier on the dataset using respectively a linear, polynomial and radial basis function (RBF) kernel, evaluate the performance of each kernel using 10-fold cross-validation and AUC. Which one works best? Visualize the results. Can you intuitively explain why one kernel is more suited than another?
    • Hint: you can use the visualization code used in class. It is under mglearn/plot_svm.py > plot_svm_kernels().
  • Take the RBF kernel and vary both the C parameter and the kernel width ($\gamma$). Use 3 values for each (a very small, default, and very large value). For each of the 9 combinations, create the same RBF plot as before, report the number of support vectors, and the AUC performance. Explain the performance results. When are you over/underfitting?
    • Hint: values for C and $\gamma$ are typically in [$2^{-15}..2^{15}$] on a log scale.
    • Hint: don't count the support vectors manually, retrieve them from the trained SVM.
  • Vary C and $\gamma$ again, but this time use a grid of at least 20x20, vary both parameters uniformly on a log scale, and visualise the results using a $C \times \gamma \rightarrow AUC$ heatmap. Explain the performance results, and compare them to the 9 results obtained in the previous subquestion. Can you also tell in which regions of the heatmap you are over/underfitting?
    • Hint: We've constructed such a heatmap in class and in assignment 1.

In [3]:
from sklearn.datasets import make_blobs
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

X, y = make_blobs(centers=2, n_samples=1000, random_state=0)
svm_lin = SVC(kernel='linear')
svm_pol = SVC(kernel='poly')
svm_rbf = SVC(kernel='rbf')

lin_score = cross_val_score(svm_lin, X, y, cv=10, scoring='roc_auc', n_jobs=-1)
pol_score = cross_val_score(svm_pol, X, y, cv=10, scoring='roc_auc', n_jobs=-1)
rbf_score = cross_val_score(svm_rbf, X, y, cv=10, scoring='roc_auc', n_jobs=-1)

print("Mean 10-CV score of linear kernel: " + str(lin_score.mean()))
print("Mean 10-CV score of polynomial kernel: " + str(pol_score.mean()))
print("Mean 10-CV score of radial basis function kernel: " + str(rbf_score.mean()))

# Using a slightly adapted version of the plot_svm_kernels function from mglearn.
def plot_svm_kernels(X, y):
    # figure number
    fignum = 1

    # fit the model
    for kernel in ('linear', 'poly', 'rbf'):
        clf = SVC(kernel=kernel, gamma=2)
        clf.fit(X, y)

        # plot the line, the points, and the nearest vectors to the plane
        plt.figure(fignum, figsize=(4, 3))
        plt.suptitle('kernel = %s' % kernel)

        plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
                    s=85, edgecolors='k', c='w', zorder=10)
        plt.scatter(X[:, 0], X[:, 1], c=y, zorder=10, cmap=plt.cm.bwr)

#         for i, coef in enumerate(clf.dual_coef_[0]):
#             plt.annotate("%0.2f" % (coef), (clf.support_vectors_[i, 0]+0.15,clf.support_vectors_[i, 1]), fontsize=8, zorder=11)

        plt.axis('tight')
        x_min = np.min(X, axis=0)[0] - 1
        x_max = np.max(X, axis=0)[0] + 1
        y_min = np.min(X, axis=0)[1] - 1
        y_max = np.max(X, axis=0)[1] + 1

        XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]
        Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()])

        # Put the result into a color plot
        Z = Z.reshape(XX.shape)
        plt.figure(fignum, figsize=(4, 3))
        #plt.pcolormesh(XX, YY, Z > 0, cmap=plt.cm.bwr, alpha=0.1)
        plt.contour(XX, YY, Z, colors=['k', 'k', 'k'], linestyles=['--', '-', '--'],
                    levels=[-.5, 0, .5])

        plt.xlim(x_min, x_max)
        plt.ylim(y_min, y_max)

        plt.xticks(())
        plt.yticks(())
        fignum = fignum + 1
        
    plt.show()


plot_svm_kernels(X, y)


Mean 10-CV score of linear kernel: 0.99364
Mean 10-CV score of polynomial kernel: 0.99288
Mean 10-CV score of radial basis function kernel: 0.9768

Results

As we can see, the linear kernel and the polynomial kernel fits the data very well, whereas the RBF kernel performs the slightly worse. We can see in the figure that the ideal classifier here would be a linear classifier, as the data is somewhat linearly seperable. A polynomial kernel can act like a linear kernel with high level polynomials, so that is why it performs similar to the linear kernel.

But the RBF kernel wants to look for the distance with respect to other points, which does not work so well in this case. It cannot find the optimal linear classifier, but instead tries to look for points near other points.


In [4]:
# First performing a 10CV grid search to obtain the ROC_AUC values.
from sklearn.model_selection import GridSearchCV

param_grid = {
    "C" : [2e-15, 2, 2e15],
    "gamma" : [2e-15, 2, 2e15]
}

svm_clf = SVC(kernel="rbf")

grid_search = GridSearchCV(svm_clf, param_grid, n_jobs=-1, cv=3, scoring="roc_auc")
_ = grid_search.fit(X, y)

results = pd.DataFrame(grid_search.cv_results_)
scores = np.array(results.mean_test_score)

In [ ]:
# Using a slightly adapted version of the plot_svm_kernels function from mglearn.
def plot_svm_rbf_kernel(X, y, clf, C, gamma):
    # figure number
    fignum = 1


    # plot the line, the points, and the nearest vectors to the plane
    plt.figure(fignum, figsize=(4, 3))
    plt.suptitle('C = ' + str(C) + ', gamma = ' + str(gamma))

    plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
                s=85, edgecolors='k', c='w', zorder=10)
    plt.scatter(X[:, 0], X[:, 1], c=y, zorder=10, cmap=plt.cm.bwr)

#         for i, coef in enumerate(clf.dual_coef_[0]):
#             plt.annotate("%0.2f" % (coef), (clf.support_vectors_[i, 0]+0.15,clf.support_vectors_[i, 1]), fontsize=8, zorder=11)

    plt.axis('tight')
    x_min = np.min(X, axis=0)[0] - 1
    x_max = np.max(X, axis=0)[0] + 1
    y_min = np.min(X, axis=0)[1] - 1
    y_max = np.max(X, axis=0)[1] + 1

    XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]
    Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(XX.shape)
    plt.figure(fignum, figsize=(4, 3))
    #plt.pcolormesh(XX, YY, Z > 0, cmap=plt.cm.bwr, alpha=0.1)
    plt.contour(XX, YY, Z, colors=['k', 'k', 'k'], linestyles=['--', '-', '--'],
                levels=[-.5, 0, .5])

    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)

    plt.xticks(())
    plt.yticks(())
        
    plt.show()

# Question: moest je het handmatig scoren op 75-25 of moet je grid_search gebruiken?
# Zo ja, als je grid_search gebruikt, hoe krijg je dan alle estimators terug voor de support vectors?
idx = 0
for C in [2e-15, 2, 2e15]:
    for gamma in [2e-15, 2, 2e15]:
        svm_clf = SVC(kernel='rbf', C=C, gamma=gamma)
        svm_clf.fit(X, y)
        print("C value of " + str(C) + ", gamma value of " + str(gamma))
        print("Mean test score (10-CV AUC): " + str(results.mean_test_score[idx]))
        print("Number of support vectors: " + str(np.size(svm_clf.support_vectors_, axis=0)))
        plot_svm_rbf_kernel(X, y, svm_clf, C, gamma)
        idx += 1


Out[ ]:
SVC(C=2e-15, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=2e-15, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
C value of 2e-15, gamma value of 2e-15
Mean test score (10-CV AUC): 0.993685809105
Number of support vectors: 1000
Out[ ]:
SVC(C=2e-15, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=2, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
C value of 2e-15, gamma value of 2
Mean test score (10-CV AUC): 0.984234326528
Number of support vectors: 1000
Out[ ]:
SVC(C=2e-15, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=2000000000000000.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
C value of 2e-15, gamma value of 2000000000000000.0
Mean test score (10-CV AUC): 0.5
Number of support vectors: 1000
Out[ ]:
SVC(C=2, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=2e-15, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
C value of 2, gamma value of 2e-15
Mean test score (10-CV AUC): 0.993679352139
Number of support vectors: 1000
Out[ ]:
SVC(C=2, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=2, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
C value of 2, gamma value of 2
Mean test score (10-CV AUC): 0.973999206406
Number of support vectors: 234
Out[ ]:
SVC(C=2, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=2000000000000000.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
C value of 2, gamma value of 2000000000000000.0
Mean test score (10-CV AUC): 0.5
Number of support vectors: 1000
Out[ ]:
SVC(C=2000000000000000.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=2e-15, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
C value of 2000000000000000.0, gamma value of 2e-15
Mean test score (10-CV AUC): 0.993709328331
Number of support vectors: 1000

In [ ]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "C" : [2*10**(i) for i in range(-12, 13, 1)],
    "gamma" : [2*10**(i) for i in range(-12, 13, 1)]
}

svm_clf = SVC(kernel="rbf")

grid_search = GridSearchCV(svm_clf, param_grid, n_jobs=3, cv=10, scoring="roc_auc")
_ = grid_search.fit(X, y)
# For each of the 9 combinations, create the same RBF plot as before, report the number of support vectors, and the AUC performance.

In [ ]:
results = pd.DataFrame(grid_search.cv_results_)
scores = np.array(results.mean_test_score).reshape(24, 24)
plt.figure(figsize=[8, 8])

# Plots the mean cross-validation scores
mglearn.tools.heatmap(scores, xlabel='Gamma', xticklabels=param_grid["gamma"],
                      ylabel='C', yticklabels=param_grid["C"], cmap="viridis");

Robots and SVMs (4 points (2+1+1))

The Wall Robot Navigation dataset contains about 5500 readings of an ultrasound sensor array mounted on a robot, and your task is to finetune and train an SVM classifier to predict how the robot should move next.

  • Make a stratified 80-20 split of the data. On the training set alone, optimize the main hyperparameters of the SVM for Accuracy with a random search. Vary at least the main kernel types (linear, polynomial, and RBF), the C parameter, the $\gamma$ parameter for the RBF kernel and the exponent/degree for the polynomial kernel. Report the optimal hyperparameter settings and Accuracy performance.
    • The degree of the polynonial is typically in the range 2..10.
    • Hint: note that the hyperparameter ranges depend on each other. For instance, $\gamma$ only makes sense if you have selected the RBF kernel as well. We've seen in class how to define multiple hyperparameter spaces in a random/grid search.
  • Use a 5x3-fold (5 outer, 3 inner) nested cross-validation (CV) on the whole dataset to obtain a clean evaluation. What is the mean optimized performance? Is this in line with the optimized result of the random search of the previous question?
  • Train an SVM using the optimal hyperparameter configuration you found (in part 1 of this question) and test it on the held out (20%) test set. Compare this Accuracy result with the (mean) result of the nested CV. If you would build this robot in practice, how would you find the hyperparameters to use, and which performance would you expect? Is it truly necessary to tune the hyperparameters? Which hyperparameters were most important to tune?

In [3]:
robot_data = oml.datasets.get_dataset(1497) # Download Robot data
# Get the predictors X and the labels y
X, y = robot_data.get_data(target=robot_data.default_target_attribute);

In [11]:
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score


# We'll use stratify=None for the built-in stratify of train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)


param_grid = [
    {
        'kernel':['linear'],
        'C' : [2*10**(i) for i in range(-12, 13, 1)],
    },
    {
        'kernel':['poly'],
        'degree': [i for i in range(2, 11)],
        'gamma' : [2*10**(i) for i in range(-12, 13, 1)],
        'C' : [2*10**(i) for i in range(-12, 13, 1)],
    },
    {
        'kernel':['rbf', 'sigmoid'],
        'C' : [2*10**(i) for i in range(-12, 13, 1)],
        'gamma' : [2*10**(i) for i in range(-12, 13, 1)]
    }
]

random_search = RandomizedSearchCV(SVC(), param_distributions=param_grid, n_iter=30, n_jobs=-1, cv=3)
random_search.fit(X_train, y_train)
print("Best Score (3CV accuracy): " + str(best_score_))
print("Best parameters: ")
print(best_params_)


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-11-9192ff7f349c> in <module>()
     26 
     27 random_search = RandomizedSearchCV(SVC(), param_distributions=param_grid, n_iter=30, n_jobs=-1, cv=3)
---> 28 random_search.fit(X_train, y_train)
     29 print("Best Score (3CV accuracy): " + str(best_score_))
     30 print("Best parameters: ")

/usr/lib/python3.6/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups)
   1188                                           self.n_iter,
   1189                                           random_state=self.random_state)
-> 1190         return self._fit(X, y, groups, sampled_params)

/usr/lib/python3.6/site-packages/sklearn/model_selection/_search.py in _fit(self, X, y, groups, parameter_iterable)
    562                                   return_times=True, return_parameters=True,
    563                                   error_score=self.error_score)
--> 564           for parameters in parameter_iterable
    565           for train, test in cv_iter)
    566 

/usr/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    756             # was dispatched. In particular this covers the edge
    757             # case of Parallel used with an exhausted iterator.
--> 758             while self.dispatch_one_batch(iterator):
    759                 self._iterating = True
    760             else:

/usr/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
    601 
    602         with self._lock:
--> 603             tasks = BatchedCalls(itertools.islice(iterator, batch_size))
    604             if len(tasks) == 0:
    605                 # No more tasks available in the iterator: tell caller to stop.

/usr/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __init__(self, iterator_slice)
    125 
    126     def __init__(self, iterator_slice):
--> 127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 

/usr/lib/python3.6/site-packages/sklearn/model_selection/_search.py in <genexpr>(.0)
    555             n_jobs=self.n_jobs, verbose=self.verbose,
    556             pre_dispatch=pre_dispatch
--> 557         )(delayed(_fit_and_score)(clone(base_estimator), X, y, self.scorer_,
    558                                   train, test, self.verbose, parameters,
    559                                   fit_params=self.fit_params,

/usr/lib/python3.6/site-packages/sklearn/model_selection/_search.py in __iter__(self)
    228         # in this case we want to sample without replacement
    229         all_lists = np.all([not hasattr(v, "rvs")
--> 230                             for v in self.param_distributions.values()])
    231         rnd = check_random_state(self.random_state)
    232 

AttributeError: 'tuple' object has no attribute 'values'

In [10]:
param_grid


Out[10]:
[{'C': [2e-12,
   2e-11,
   2e-10,
   2e-09,
   2e-08,
   2e-07,
   2e-06,
   2e-05,
   0.0002,
   0.002,
   0.02,
   0.2,
   2,
   20,
   200,
   2000,
   20000,
   200000,
   2000000,
   20000000,
   200000000,
   2000000000,
   20000000000,
   200000000000,
   2000000000000],
  'kernel': ['linear']}]

A benchmark study (3 points (2+1))

A benchmark study is an experiment in which multiple algorithms are evaluated on multiple datasets. The end goal is to study whether one algorithm is generally better than the others. Meaningful benchmark studies can grow quite complex, here we do a simplified variant.

  • Download OpenML datasets 37, 470, 1120, 1464 and 1471. They are sufficiently large (e.g., at least 500 data points) so that the performance estimation is trustworthy. Select at least three classifiers that we discussed in class, e.g. kNN, Logistic Regression, Random Forests, Gradient Boosting, SVMs, Naive Bayes. Note that some of these algorithms take longer to train. Evaluate all classifiers (with default parameter settings) on all datasets, using a 10-fold CV and AUC. Show the results in a table and interpret them. Which is the best algorithm in this benchmark?
    • Note that these datasets have categorical features, different scales, missing values, and (likely) irrelevant features. You'll need to build pipelines to correctly build all models. Also remove any row identifiers (see, e.g., https://www.openml.org/d/1120)
    • Hint: You can either compare the performances directly, or (better) use a statistical significance test, e.g. a pairwise t-test or (better) Wilcoxon signed ranks test, to see whether the performance differences are significant. This is covered in statistics courses. You can then count wins, ties and losses.
  • Repeat the benchmark, but now additionally optimize the main hyperparameters of each algorithm in a grid or random search (explore at least 5 values per hyperparameter, where possible). Does this affect the ranking of the algorithms?

Gaussian Processes (2 points (1+1))

Consider the RAM prices dataset (included in the data folder). Separate the data in a training set of all data points up until the year 2000, and a test set with all points after that.

  • Train several of the algorithms we have covered in the course that can handle regression. Include at least linear regression, decision tree, and RandomForest. Which ones give the best $R^2$ performance on the test set? Plot the predictions (both on the training and test data) on the figure below. Use different colors for different algorithms or build multiple plots.
  • Train a Gaussian process on an increasing amount of samples of the training data. Start with 5 random sample and plot the predictions (both the mean and the uncertainty interval) for both training and test data, as shown in class. Now add 5 more points and retrain and redraw. Do this a couple of times and interpret/explain what you see. Finally, train the Gaussian on the full dataset and again show plot the predictions. Evaluate on the test set using $R^2$. Compare these results with those achieved with other algorithms and explain.

In [3]:
ram_prices = pd.read_csv('data/ram_price.csv')

plt.semilogy(ram_prices.date, ram_prices.price)
plt.xlabel("Year")
plt.ylabel("Price in $/Mbyte");


A mini-data mining challenge (2 points (+1))

The goal here is to use everything you have learned to build the best model for a given classification task. The task is hosted on OpenML, so you will receive the train-test splits, and your model will be evaluated on the server. The goal is to reasonably select algorithms and hyperparameter settings to obtain the best model. You can also do model selection and parameter optimization as you have done before. Skeleton code is provided in the OpenML tutorial.

  • All details can be found online:
  • A leaderboard is kept of the best models: https://www.openml.org/t/145677#!people
    • You are able to see the solutions of others (by clicking in the timeline or run list), but resubmission of the exact same solution does not register on the leaderboard.
    • You can share one account (one API key) per team. In case you use two, we take the one that performs best.
  • You can document the different experiments that you ran in this notebook. For each experiment, provide a description of how you chose the algorithms and parameters that you submitted. Try to reason about which experiments to try, don't just do an immense random search.
  • Points are rewarded as follows:
    • 1 point for the breadth of experiments you ran (algorithms, hyperparameter settings)
    • 1 point for reasoning/insight and interpretation of the results
    • 1 (bonus) point for every team who has uploaded the best solution thus far on AUC (who reaches the top of the leaderboard at any moment during the assignment)
      • Note: On the leaderboard page, the 'frontier' line is drawn, and your top ranking is also shown in the table.

Note: Report AUC scores in your report as well. In case of issues with OpenML we will use the experiments and scores mentioned your report.


In [ ]: