To solve this task, you will write a lot of code to try several machine learning methods for classification and regression.
You are HIGHLY RECOMMENDED to read relevant documentation, e.g. for python, numpy, matlpotlib and sklearn. Also remember that seminars, lecture slides, Google and StackOverflow are your close friends during this course (and, probably, whole life?).
If you want an easy life, you have to use BUILT-IN METHODS of sklearn
library instead of writing tons of our yown code. There exists a class/method for almost everything you can imagine (related to this homework).
To do this part of homework, you have to write CODE directly inside specified places incide notebook CELLS.
In some problems you are asked to provide short discussion of the results. In this cases you have to create MARKDOWN cell with your comments right after the your code cell.
For every separate problem you can get only 0 points or maximal points for this problem. There are NO INTERMEDIATE scores. So make sure that you did everything required in the task
Your SOLUTION notebook MUST BE REPRODUCIBLE, i.e. if the reviewer decides to execute Kernel
-> Restart Kernel and Run All Cells
, after all the computation he will obtain exactly the same solution (with all the corresponding plots) as in your uploaded notebook. For this purpose, we suggest to fix random seed
or (better) define random_state=
inside every algorithm that uses some pseudorandomness.
Your code must be clear to the reviewer. For this purpose, try to include neccessary comments inside the code. But remember: GOOD CODE MUST BE SELF-EXPLANATORY without any additional comments.
Before the start, read several additional recommendations.
jupyter notebook
or ipython notebook
from linux console. Try jupyter lab
instead - it is a more convenient environment to work with notebooks.htop
(for CPU/RAM) or free -s 0.2
(for RAM) in terminal.sklearn
algorithms support multithreading (Ensemble Methods, Cross-Validation, etc.). Check if the particular algorithm has n_jobs
parameters and set it to -1
to use all the cores.To begin with, lets import the essential (for this assignment) libraries.
In [1]:
import numpy as np
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import make_moons, make_circles
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
### BEGIN Your imports
from sklearn.svm import SVC
from mlxtend.plotting import plot_decision_regions
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.gridspec as gridspec
import itertools
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, StratifiedKFold, GridSearchCV, cross_validate
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report, make_scorer, f1_score
from sklearn.random_projection import GaussianRandomProjection
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from IPython.display import display, HTML
from sklearn.multiclass import OneVsOneClassifier, OneVsRestClassifier
### END Your imports
Write a function, which takes a matrix arr
and centers each of its rows
by the mean within that row. Check this
out for documentation.
In [2]:
def center(arr):
"""Center each row of a matrix by the mean value in that row.
Parameters
----------
arr : array, shape = (n_rows, n_cols)
The matrix, the rows of which to center.
Returns
----------
out : array, shape = (n_rows, n_cols)
The final row-centered matrix.
"""
assert arr.ndim == 2
n_rows, n_cols = arr.shape
### BEGIN Solution
out = (arr.T - np.mean(arr, axis=1) * np.ones_like(arr)).T
### END Solution
return out
In [3]:
def fancy_function(x):
def sigmoid(x):
return 1 / (1 + np.exp(-x))
"""Compute some fancy function.
Parameters
----------
x : array, 1 dimendional, shape=(n_samples,)
The array argument values.
Returns
-------
y : array, 1 dimendional, shape=(n_samples,)
The values of the fancy function.
"""
### BEGIN Solution
print(x)
max1 = np.amax((x + 5, np.zeros_like(x)), axis=0)
max2 = np.amax((5 - x, np.zeros_like(x)), axis=0)
min1 = np.amin((np.cos(2*x*np.pi), np.ones_like(x)*0.5), axis=0)
max3 = np.amax((min1, np.ones_like(x)*-0.25), axis=0)
out = sigmoid(max1 + max2 + max3)
#max2 = np.max([5 - x, 0])
#min3 = np.min([np.cos(2 * x * np.pi), 0.5])
#out = sigmoid(max1 + max2 + np.max([min3, -0.25]))
### END Solution
return out
Plot the function
In [4]:
x = np.linspace(-12.5, 12.5, num=1001)
fig, ax = plt.subplots(figsize=(5, 5))
ax.plot(x, fancy_function(x))
plt.show()
Plot the level sets of the $l^p$ norm \begin{equation} \|z\|_p = \biggl(\sum_i \lvert x_i\rvert^p\biggr)^\tfrac1{p} \,. \end{equation} and make the contour of the unit ball in $l^p$ norm stand out. Draw plots for $p \in \{0, \tfrac1{25}, \tfrac12, 1, 1.5, 2, 7, \infty\}$.
Study plotting examples on this and this pages (especially the last one) and have a look at these functions: np.meshgrid, np.linspace in numpy's documentation. We suggest to use np.linalg.norm
.
Try to produce a plot that looks like the one below:
In [5]:
p_values = [0., 0.04, 0.5, 1, 1.5, 2, 7, np.inf]
xx, yy = np.meshgrid(np.linspace(-3, 3, num=101),
np.linspace(-3, 3, num=101))
fig, axes = plt.subplots(ncols=(len(p_values) + 1)// 2,
nrows=2, figsize=(14, 7))
for p, ax in zip(p_values, axes.flat):
### BEGIN Solution
norm = np.linalg.norm([xx, yy], p, axis=0)
ax.contourf(xx, yy, norm, levels=np.linspace(0, norm.max(), 20), cmap=plt.cm.winter_r)
ax.contour(xx, yy, norm, levels=[1], colors="hotpink")
ax.set_title(r"contour = {$x : \| x \|_{%g}$ = 1}" % (p))
### END Solution
plt.show()
In this task your goal is to visualize the decision rules of several classifiers applied to artificial $2$-dimensional dataset generated by builtin sklearn.datasets
method called make_moons
. In the cell below we generate the dataset.
In [6]:
X, y = make_moons(n_samples=500, noise=0.2, random_state=404)
The goal of the subproblem is to fit the following classifiers on features X
to target y
:
For all the fitted classifiers you have to plot the decision regions (the example is shown below the cell). Each plot must have Title which contains name of the classifier and its accuracy on the data.
You can write the plotting code on your own, but we highly recommend just to use mlxtend
library (pip install mlxtend
in linux terminal), which has awesome one-line decision rule plotting function (you are to google it).
In [7]:
clfs = [
{'clf': DecisionTreeClassifier(max_depth=4, random_state=101)},
{'clf': RandomForestClassifier(n_estimators=25, max_depth=4, random_state=101, n_jobs=-1)},
{'clf': LogisticRegression(random_state=101, solver="lbfgs")},
{'clf': SVC(kernel="rbf", random_state=101, gamma="scale")},
{'clf': GaussianNB()},
{'clf': KNeighborsClassifier(n_neighbors=3, n_jobs=-1)}
]
fig, axes = plt.subplots(
ncols=(len(clfs) + 1)// 2,
nrows=2, figsize=(6 * ((len(clfs) + 1)) // 2, 12),
dpi=75
)
for clf, ax in zip(clfs, axes.flat):
clf['clf'].fit(X, y)
acc = cross_val_score(clf['clf'], X, y, scoring="accuracy", cv=5).mean()
fig = plot_decision_regions(X=X, y=y, clf=clf['clf'], legend=2, ax=ax)
ax.set_title(clf['clf'].__class__.__name__ + ", accuracy = %g" % acc, size=16)
plt.tight_layout()
In previous task 1.1 several classifiers obviously failed fitting to data. This happend because the decision rule of the classifier has a restricted form (e.g. linear for linear models), while the data is more complicated.
One may try to change the parameters of the classifier (e.g. increase the number of trees in Forest) in order to improve accuracy, but some models (especially linear) do not have parameters that can change the form of the decision rule.
In this case, the feature engeneering helps: one may try to compute new (e.g. non-linear) features based on the existing pool and fit the classifier in the new features. This may help low-complex claffifiers to fit to hard data dependencies.
Your task it to
It is your choice how to generate features. You may create hand-crafted featues and add them manualy. Nevertheless, we highly suggest to get used to and apply the following builtin sklearn
methods:
PolynomialFeatures
, GaussianRandomProjection
among others - for feature generationStandartScaler
, MinMaxScaler
among others - for feature scalingPipeline
- for combining several operations in a row (e.g. feature creation & prediction)
In [8]:
### BEGIN Solution
grid_params = [
{
'clf': [LogisticRegression(random_state=101, solver="lbfgs")],
'poly': [PolynomialFeatures()],
'poly__degree': [2,3],
'scaler': [StandardScaler()]},
{
'clf': [DecisionTreeClassifier(max_depth=4, random_state=101)],
'poly': [PolynomialFeatures()],
'poly__degree': [2,3],
'scaler' : [None]}
]
fig, axes = plt.subplots(ncols=2, nrows=1, figsize=(14, 7), dpi=75)
kfold = StratifiedKFold(n_splits=5, random_state=101, shuffle=True)
for grid_p, ax in zip(grid_params, fig.axes):
pipe = Pipeline([('scaler', StandardScaler()),
('poly', PolynomialFeatures()),
('clf', RandomForestClassifier(random_state=101, n_jobs=-1))])
feature_tuned_clf = GridSearchCV(pipe, param_grid = grid_p, cv=kfold, n_jobs=-1, \
verbose=0, scoring=make_scorer(accuracy_score))
feature_tuned_clf.fit(X, y)
fig = plot_decision_regions(X=X, y=y, clf=feature_tuned_clf.best_estimator_, legend=2, ax=ax)
ax.set_title(grid_p['clf'][0].__class__.__name__ + ", accuracy = %g" % \
accuracy_score(y, feature_tuned_clf.best_estimator_.predict(X)), size=16)
plt.tight_layout()
### END Solution
The separation line is not linear in the dataset given (and it is even not composed of lines), so that's why some conventional classifiers like LinearRegression and DecisionTree might fail. To address this problem one might transform initial feature space and make them linerarly separable. To do so we use polynomial in such a way that after preprocessing classifiers will be able to separate two classes. In case of LinearRegression the reason is simple, we just transform a curve to a line, but in case of DecisionTree we transform an initial curve to the set of lines (like pixelating), which can be easily processed then.
You are to test Random Forests and Support Vector Machines on a trivial Tic Tac Toe Endgame Dataset. Let's load it.
In [9]:
data = pd.read_csv('data/tic-tac-toe.csv')
X, y = data.drop('class', axis=1), data['class'].astype(int)
data.sample(3, random_state=101)
Out[9]:
The dataset consists of several possible endgame positions of the Tic-Tac-Toe game. The target variable is the victory of x
player over o
player (victory or defeat/draw). Since the features are categorical, we simply transform them to real-valued $-1$ for o
, $1$ for x
and $0$ for emply cell b
.
In [10]:
X = X.applymap(lambda v: 1.0 if v == 'x' else -1.0 if v == 'o' else 0.0)
X.sample(4, random_state=101).sort_index()
Out[10]:
We are going to test how machine learning algorithms can classify the final game positions into the ones when x
player won and all others. Everybody knows that for this problem there is a simple decision rule: x
wins if there are three x
's in a row/column/diagonal. But can our cool machine learning tools catch this trivial dependence? In this problem the class balance is around $2:1$ so we still use accuracy metric.
Intuitively, this rule is logical, i.e. one may expect decision-tree-based algorithm to be the most appropriate for this case. But is that true? In the code below we compare huge Random forest with simple Logistic Regression and SVM with default parameters.
In [11]:
rf = RandomForestClassifier(n_estimators=400, max_depth=10)
svm = SVC(gamma='auto')
lr = LogisticRegression(solver='lbfgs')
clfs = (rf, svm, lr)
for clf in clfs:
score = cross_val_score(clf, X, y, cv=5).mean()
name = clf.__class__.__name__
print(f'{name} scored {round(score, 3)}')
We counterintuitively see that RandomForest is outperformed by SVM. Moreover, the simplest Logistic regression significantly outperforms everything. Let's improve Random Forest and SVM in order to be able to compete with Logistic model.
In [12]:
### BEGIN Solution
grid_params = {
'clf': [SVC(random_state=101, gamma="scale")],
'clf__kernel': ['linear', 'poly', 'rbf', 'sigmoid'] }
kfold = StratifiedKFold(n_splits=5, random_state=101, shuffle=False)
pipe = Pipeline([('scaler', StandardScaler()),
('clf', SVC(random_state=101, gamma="scale"))])
tuned_clf = GridSearchCV(pipe, param_grid = grid_params, cv=kfold, n_jobs=-1, \
verbose=0, scoring=make_scorer(accuracy_score), iid=True)
tuned_clf.fit(X, y)
print("The best kernel: %s" % tuned_clf.best_params_['clf__kernel'])
print("Accuracy: %g" % cross_val_score(tuned_clf.best_estimator_, X, y, cv=5).mean())
### END Solution
In [13]:
rf = RandomForestClassifier(n_estimators=400, max_depth=10, random_state=404)
### BEGIN Solution
X_handcrafted = pd.DataFrame(X.sum(axis=1))
print("Accuracy: %g" % cross_val_score(rf, X_handcrafted, y, cv=5).mean())
### END Solution
Assuming that all possible values are +1, 0 and -1 we can sum all values on the board at final stage and classify win when total sum is more than 0. It is intuitive because win happens once imbalance occurs (except for a few cases when is doesn't hold).
In this problem you are to deal with Concrete Compressive Strength Dataset. You goal will be to determine the optimal parameters for two Bagging-Based Forest Ensemble Regressors and compare the forests. Lets load the data and split it into test and train parts.
In [14]:
data = pd.read_csv('data/concrete.csv').astype(float)
X = data.drop('concrete_compressive_strength', axis=1)
y = data.concrete_compressive_strength
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
data.sample(3, random_state=101).sort_index()
Out[14]:
Please note that both in Task 4 and Task 5 the whole data was the Train Data. In Task 4 the output score (accuracy) was the train score (i.e. the score on the train data of the model fitted on the same data). In task 5 the output score was the validation score, i.e. the result of validating the model on the train data.
In this problem, we do a step further and split the whole data into train part (on which we train & validate) and test part (where we compute the final test score on the validated model).
In this problem you are to consider the RandomForestRegressor
and ExtraTreesRegressor
models for the prediction of concrete compressive strength under squared loss function (mean squared error). Recall that Random Forest was discussed in the lectures. Extremely Randomized Forest is an other bootstraped forest with simple tree building algorithm. Basically, each split of each tree node is chosen at random both w.r.t. feature and threshold (while in random forest the split minimizes impurity).
You have to do the following steps and answer the following questions:
In [15]:
### BEGIN Solution
grid_params = [
{
'clf': [RandomForestRegressor(random_state=101, n_jobs=-1)],
'clf__n_estimators': [1, 10, 100],
'clf__max_features': [1, 2, 4, 8],
'clf__max_depth': np.linspace(4, 20, 8),
'clf__bootstrap' : [True, False]},
{
'clf': [ExtraTreesRegressor(random_state=101, n_jobs=-1)],
'clf__n_estimators': [1, 10, 100],
'clf__max_features': [1, 2, 4, 8],
'clf__max_depth': np.linspace(4, 20, 8),
'clf__bootstrap' : [True, False]}
]
output = [
{
'clf' : "RndForestRegressor",
'train' : 0,
'validation' : 0,
'test' : 0 },
{
'clf' : "ExtraTreesRegressor",
'train' : 0,
'validation' : 0,
'test' : 0 }
]
print('%s %8s %8s %6s' % ("Ensemble Classifier", "Train", "Valid", "Test"))
for i, grid_p in enumerate(grid_params):
pipe = Pipeline([('clf', RandomForestRegressor(random_state=101, n_jobs=-1))])
tuned_clf = GridSearchCV(pipe, param_grid=grid_p, cv=5, n_jobs=-1, \
verbose=0, scoring='neg_mean_squared_error', iid=True)
tuned_clf.fit(X_train, y_train)
output[i]['train'] = np.abs(tuned_clf.score(X_train, y_train))
output[i]['validation'] = np.abs(tuned_clf.best_score_)
output[i]['test'] = np.abs(tuned_clf.score(X_test, y_test))
print('%s %8.4g %8.4g %7.4g' % (output[i]['clf'], output[i]['train'], \
output[i]['validation'], output[i]['test']))
### END Solution
Randomization itself in the context of ExtraTrees is used to lower the variance. As it is intrinsic for forests structures to have a big bias, resulted loss in test fold might be greater than in train fold, which in fact leads to the case when model is not able to fit correctly.
In this task you deal with multiclass classification problem for Glass Classification Data. Lets load the dataset.
In [16]:
data = pd.read_csv('data/glass.csv')
X, y = data.drop('Type', axis=1), data.Type
data.sample(3, random_state=101)
Out[16]:
The features of each glass oject correspond to the fraction of the particular chemical element in the object. The target variable corresponds to the type of glass (6 classes).
In this problem you have to empirically compare the time complexity and performance of several multiclass labeling strategies for different algorithms. You must consider the following algorithms:
Note that all these algorithms by default support multiclass labeling. Nevertheless, we want you to compare this approach with OneVSRest and OneVSOne approaches applied to this algorithms. More precisely, for every pair (algorithm, approach) you are to perform 5-fold cross validation on the data and output the validation score and the computation time in the table form. Please note that you also have to choose the metric to optimize during CV (e.g. accuracy, balanced accuracy) on your own.
After that, you are to answer the following questions:
In [17]:
### BEGIN Solution
clfs = [
{'clf': DecisionTreeClassifier(max_depth=7, random_state=101)},
{'clf': RandomForestClassifier(n_estimators=100, max_depth=3, random_state=101, n_jobs=-1)},
{'clf': LogisticRegression(random_state=101, multi_class='multinomial', solver="newton-cg")},
{'clf': KNeighborsClassifier(n_neighbors=3, n_jobs=-1)}
]
# Create all neccesary tables
cross_valid = pd.DataFrame(columns=['','LogisticRegression','RandomForestClassifier',\
'KNeighborsClassifier','DecisionTreeClassifier'])
cross_valid[''] = ["Multiclass", "OneVsRest", "OneVsOne"]
time_performance = pd.DataFrame(columns=['','LogisticRegression','RandomForestClassifier',\
'KNeighborsClassifier','DecisionTreeClassifier'])
time_performance[''] = ["Multiclass", "OneVsRest", "OneVsOne"]
# Do all calculations
for clf in clfs:
one_vs_one = OneVsOneClassifier(clf['clf'], n_jobs=-1)
one_vs_rest = OneVsRestClassifier(clf['clf'], n_jobs=-1)
valid_score = []
comp_time = []
for clf_clf in reversed([one_vs_one, one_vs_rest, clf['clf']]):
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=101)
scores = cross_validate(clf_clf, X, y, scoring="recall_macro",
cv=cv, return_train_score=False)
valid_score.append(scores['test_score'].mean())
comp_time.append(scores['fit_time'].mean())
cross_valid[clf['clf'].__class__.__name__] = valid_score
time_performance[clf['clf'].__class__.__name__] = comp_time
# Display results
print("Cross validation")
display(HTML(cross_valid.to_html(index=False)))
print("Computation time")
display(HTML(time_performance.to_html(index=False)))
### END Solution
There is a zoo of possible metrics to be chosen, but the main feature of this dataset is that it is quite unbalanced and multiclass. For that reason, precision metrics are not feaasible because there might be cases when there are not samples from a particular class (that leads to zero in a denominator). Another possible metric is recall, but there are two kinds of them: macro and micro. The latter is not suitable as well, as it focuses too much on the largest classes imposing unfair conditions to less populated classes. After such contemplations the recall_macro was picked as a final evaluation score.
A few words need to be said about computational time. One might easily observed that the greatest score improvment was achieved with DecisionTreeClassifier, but computational time increased significantly (8 times in OneVsOne scenario). Just a little bit lower imporovement was achieved by RandomForestClassifier, but the price for that turned out to be higher (10 times slower in OneVsOne), which is in fact also a good result.