Tutorial held at University of Zurich, 23-24 March 2016
(c) 2016 Jan Šnajder (jan.snajder@fer.hr), FER, University of Zagreb
Version: 0.4 (2018-12-24)
In [1]:
import scipy as sp
import scipy.stats as stats
import matplotlib.pyplot as plt
from numpy.random import normal
from SU import *
%pylab inline
Typical steps in applying an ML algorithm
Instance space
Hypothesis
Empirical error
Training a model
Model complexity
Inductive bias
The three ingredients of every ML algorithm
Algo 1: Logistic regression
Feature mapping
Hyperparameters
The problem of noise
Model selection
Cross-validation
Regularization
Algo 2: Support vector machine (SVM)
Kernel trick
Algo 3: Decision tree
Algo 4: Naive Bayes classifier
Data preparation (cleansing and wrangling)
Data annotation
Feature engineering
Dimensionality reduction / feature selection
Model selection
Model training
Model evaluation
Diagnostics and debugging
Deployment
1-3 and 8 are task-specific
ML focuses on 4-8
We will focus on 5-7
Labeled dataset: $\mathcal{D} = \big\{(x^{(i)}, y^{(i)})\big\}_{i=1}^N \subseteq \mathcal{X}\times\mathcal{Y}$
In matrix form:
In [2]:
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=20, n_features=2, n_classes=2, n_redundant=0, n_clusters_per_class=1, random_state=42)
plot_problem(X, y)
In [3]:
X
Out[3]:
In [4]:
y
Out[4]:
In [5]:
def h(x, w) : return 1 if w[0] + w[1] * x[0] + w[2] * x[1] >= 0 else 0
In [6]:
def h0(x) : return h(x, [0, 0.5, -0.5])
In [7]:
print h0([0, -11])
In [8]:
h0([-2, 2])
Out[8]:
In [9]:
plot_problem(X, y, h0)
In [10]:
def h(x, w) : return 1 if w[0] + w[1] * x[0] + w[2] * x[1] > 0 else 0
In [11]:
def h0(x) : return h(x, [0, 0.5, -0.5])
def h1(x) : return h(x, [1, 1, 2.0])
def h2(x) : return h(x, [-1, 2, 1])
In [12]:
for hx in [h0, h1, h2] : plot_problem(X, y, hx, surfaces=False)
In [13]:
from sklearn.metrics import zero_one_loss
from sklearn.metrics import accuracy_score
def misclassification_error(h, X, y) :
error = 0
for xi, yi in zip(X, y):
if h(xi) != yi : error += 1
return float(error) / len(X)
In [14]:
misclassification_error(h0, X, y)
Out[14]:
In [15]:
misclassification_error(h1, X, y)
Out[15]:
In [16]:
misclassification_error(h2, X, y)
Out[16]:
Two flavors:
Language bias: the model $\mathcal{H}$ limits the number of hypotheses from which we can choose from $\Rightarrow$ defines where we search
Preference bias: we prefer one hypothesis over the other $\Rightarrow$ defines how we search
(1) The model: $$ h(\mathbf{x}|\mathbf{w}) = \sigma\big(\mathbf{w}^\intercal\mathbf{x}\big) = \frac{1}{1+\exp(-\mathbf{w}^\intercal\mathbf{x})} $$
$\mathbf{w}^\intercal\mathbf{x}$ is the scalar product of the weight vector $\mathbf{w}$ and the instance vector $\mathbf{x}$
In [17]:
def sigm(x): return 1 / (1 + sp.exp(-x))
xs = sp.linspace(-10, 10)
plt.plot(xs, sigm(xs));
In [18]:
from sklearn.linear_model import LogisticRegression
h = LogisticRegression()
h.fit(X, y)
plot_problem(X, y, h.predict)
In [19]:
h.coef_
Out[19]:
In [20]:
h.intercept_
Out[20]:
In [21]:
misclassification_error(h.predict, X, y)
Out[21]:
In [22]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(4)
X2 = poly.fit_transform(X)
h = LogisticRegression()
h.fit(X2, y)
plot_problem(X, y, lambda x: h.predict(poly.transform(x)))
In [23]:
from sklearn.preprocessing import PolynomialFeatures
for m in [1, 2, 3, 4, 5]:
poly = PolynomialFeatures(m)
X2 = poly.fit_transform(X)
h = LogisticRegression()
h.fit(X2, y)
plot_problem(X, y, lambda x: h.predict(poly.transform(x)))
error = misclassification_error(lambda x : h.predict(poly.transform(x)), X, y)
plt.title('$m = %d, E(h|\mathcal{D})=%.2f$' % (m, error))
plt.show()
Two extremes:
Underfitting - $\mathcal{H}$ is too simple for our problem $\Rightarrow$ works bad on existing as well as unseen data
Overfitting - $\mathcal{H}$ is too complex for our problem $\Rightarrow$ works excellent on existing data but fails miserably on unseen data
\mathcal{D} = \mathcal{D}_{\mathrm{train}} \cup \mathcal{D}_{\mathrm{test}}
$$
In [24]:
X, y = make_classification(n_samples=100, n_features=2, n_classes=2, n_redundant=0, n_clusters_per_class=2, random_state=53)
plot_problem(X, y)
In [25]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3)
In [26]:
print sp.shape(X_train), sp.shape(y_train)
In [27]:
print sp.shape(X_test), sp.shape(y_test)
In [28]:
plot_problem(X_train, y_train)
In [29]:
plot_problem(X_test, y_test)
In [30]:
from sklearn.metrics import zero_one_loss
train_errs, test_errs = [], []
for m in range(1, 6) :
poly=PolynomialFeatures(m)
X2 = poly.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X2, y, test_size=0.3, random_state=42)
h = LogisticRegression()
h.fit(X_train,y_train)
train_err = zero_one_loss(y_train,h.predict(X_train))
test_err = zero_one_loss(y_test,h.predict(X_test))
train_errs = append(train_errs, train_err)
test_errs = append(test_errs, test_err)
plot(test_errs,'b',label="test");
plot(train_errs,'r',label="train")
legend();
\mathcal{D} = \mathcal{D}_{\mathrm{train}} \cup \mathcal{D}_{\mathrm{val}} \cup \mathcal{D}_{\mathrm{test}}
$$
$$
\mathcal{D}_{\mathrm{train}} \cap \mathcal{D}_{\mathrm{val}} =
\mathcal{D}_{\mathrm{train}} \cap \mathcal{D}_{\mathrm{test}} =
\mathcal{D}_{\mathrm{val}} \cap \mathcal{D}_{\mathrm{test}} = \emptyset
$$
In [31]:
from sklearn.svm import SVC
X, y = make_classification(n_samples=20, n_features=2, n_classes=2, n_redundant=0, n_clusters_per_class=1, random_state=41)
plot_problem(X, y)
h = SVC(kernel='linear')
h.fit(X, y)
plot_problem(X, y, h.predict)
In [32]:
h = LogisticRegression()
h.fit(X, y)
plot_problem(X, y, h.predict)
In [33]:
X, y = make_classification(n_samples=20, n_features=2, n_classes=2, n_redundant=0, n_clusters_per_class=1, random_state=42)
plot_problem(X, y)
In [34]:
h = SVC(kernel='rbf', C=1)
h.fit(X, y)
plot_problem(X, y, h.predict)
In [35]:
h = SVC(kernel='rbf', C=10^5)
h.fit(X, y)
plot_problem(X, y, h.predict)
In [36]:
from sklearn.grid_search import GridSearchCV
param_grid = [{'C': [2**x for x in range(-5,5)],
'gamma': [2**x for x in range(-5,5)]}]
h = GridSearchCV(SVC(), param_grid)
h.fit(X, y)
plot_problem(X, y, h.predict)
In [37]:
from sklearn import tree
h = tree.DecisionTreeClassifier()
h.fit(X, y)
plot_problem(X, y, h.predict)
In [38]:
from sklearn.externals.six import StringIO
import pyparsing
import pydot
from IPython.display import Image
dot_data = StringIO()
tree.export_graphviz(h, out_file=dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
img = Image(graph.create_png())
In [39]:
img.width=300; img
Out[39]:
In [40]:
X, y = make_classification(n_samples=100, n_features=2, n_classes=2, n_redundant=0, n_clusters_per_class=2, random_state=54)
plot_problem(X, y)
In [41]:
h = tree.DecisionTreeClassifier()
h.fit(X, y)
plot_problem(X, y, h.predict)
In [42]:
misclassification_error(h.predict, X, y)
Out[42]:
In [43]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3)
In [44]:
h.fit(X_train, y_train)
plot_problem(X_train, y_train, h.predict)
In [45]:
misclassification_error(h.predict, X_train, y_train)
Out[45]:
In [46]:
misclassification_error(h.predict, X_test, y_test)
Out[46]:
In [47]:
dot_data = StringIO()
tree.export_graphviz(h, out_file=dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
img = Image(graph.create_png())
img.width=500; img
Out[47]:
In [48]:
h = tree.DecisionTreeClassifier(max_depth=3)
h.fit(X_train, y_train);
In [49]:
misclassification_error(h.predict, X_train, y_train)
Out[49]:
In [50]:
misclassification_error(h.predict, X_test, y_test)
Out[50]:
(1) The model: $$ h(\mathbf{x}) = \mathrm{argmax}_{j}\ p(\mathbf{x}|y=j) P(y=j) $$
Or, if we want probabilities for each class:
In [51]:
from sklearn.naive_bayes import GaussianNB
h = GaussianNB()
h.fit(X_train, y_train)
plot_problem(X_train, y_train, h.predict)