Note that the optional watermark extension is a small IPython notebook plugin that I developed to make the code reproducible. You can just skip the following line(s).
In [1]:
%load_ext watermark
%watermark -a '' -u -d -v -p numpy,pandas,matplotlib,sklearn
The use of watermark
is optional. You can install this IPython extension via "pip install watermark
". For more information, please see: https://github.com/rasbt/watermark.
In [2]:
from IPython.display import Image
%matplotlib inline
In [3]:
# Added version check for recent scikit-learn 0.18 checks
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version
There is no free lunch; different algorithms are suitable for different data and applications.
In the linear perceptron part, we wrote the models from ground up.
Existing machine learning libraries
Scikit-learn
Let's use this dataset for comparing machine learning methods
|
|
|
Setosa | Versicolor | Virginica |
Loading the Iris dataset from scikit-learn.
In [4]:
from sklearn import datasets
import numpy as np
iris = datasets.load_iris()
print('Data set size: ' + str(iris.data.shape))
X = iris.data[:, [2, 3]]
y = iris.target
print('Class labels:', np.unique(y))
In [5]:
import pandas as pd
df = pd.DataFrame(iris.data)
df.tail()
Out[5]:
Here, the third column represents the petal length, and the fourth column the petal width of the flower samples.
The classes are already converted to integer labels where 0=Iris-Setosa, 1=Iris-Versicolor, 2=Iris-Virginica.
In [6]:
if Version(sklearn_version) < '0.18':
from sklearn.cross_validation import train_test_split
else:
from sklearn.model_selection import train_test_split
# splitting data into 70% training and 30% test data:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=0)
In [7]:
num_training = y_train.shape[0]
num_test = y_test.shape[0]
print('training: ' + str(num_training) + ', test: ' + str(num_test))
In [8]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
We learned and coded perceptron in chapter 2. Here we use the scikit-learn library version.
The perceptron only handles 2 classes for now. We will discuss how to handle $N > 2$ classes.
In [9]:
from sklearn.linear_model import Perceptron
ppn = Perceptron(n_iter=40, eta0=0.1, random_state=0)
_ = ppn.fit(X_train_std, y_train)
In [10]:
y_pred = ppn.predict(X_test_std)
print('Misclassified samples: %d out of %d' % ((y_test != y_pred).sum(), y_test.shape[0]))
In [11]:
from sklearn.metrics import accuracy_score
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))
In [12]:
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
import warnings
def versiontuple(v):
return tuple(map(int, (v.split("."))))
def plot_decision_regions(X, y, classifier, test_idx=None,
resolution=0.02, xlabel='', ylabel='', title=''):
# setup marker generator and color map
markers = ('s', 'x', 'o', '^', 'v')
colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
cmap = ListedColormap(colors[:len(np.unique(y))])
# plot the decision surface
x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
np.arange(x2_min, x2_max, resolution))
Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
Z = Z.reshape(xx1.shape)
plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
plt.xlim(xx1.min(), xx1.max())
plt.ylim(xx2.min(), xx2.max())
for idx, cl in enumerate(np.unique(y)):
plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
alpha=0.8, c=cmap(idx),
marker=markers[idx], label=cl)
# highlight test samples
if test_idx:
# plot all samples
if not versiontuple(np.__version__) >= versiontuple('1.9.0'):
X_test, y_test = X[list(test_idx), :], y[list(test_idx)]
warnings.warn('Please update to NumPy 1.9.0 or newer')
else:
X_test, y_test = X[test_idx, :], y[test_idx]
plt.scatter(X_test[:, 0],
X_test[:, 1],
c='',
alpha=1.0,
linewidths=1,
marker='o',
s=55, label='test set')
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()
Training a perceptron model using the standardized training data:
In [13]:
X_combined_std = np.vstack((X_train_std, X_test_std))
y_combined = np.hstack((y_train, y_test))
test_idx = range(X_train_std.shape[0], X_combined_std.shape[0])
plot_decision_regions(X=X_combined_std, y=y_combined,
classifier=ppn, test_idx=test_idx,
xlabel='petal length [standardized]',
ylabel='petal width [standardized]')
In [14]:
import matplotlib.pyplot as plt
import numpy as np
def sigmoid(z):
return 1.0 / (1.0 + np.exp(-z))
z = np.arange(-7, 7, 0.1)
phi_z = sigmoid(z)
plt.plot(z, phi_z)
plt.axvline(0.0, color='k')
plt.ylim(-0.1, 1.1)
plt.xlabel('z')
plt.ylabel('$\phi (z)$')
# y axis ticks and gridline
plt.yticks([0.0, 0.5, 1.0])
ax = plt.gca()
ax.yaxis.grid(True)
plt.tight_layout()
# plt.savefig('./figures/sigmoid.png', dpi=300)
plt.show()
$\phi(z) = \frac{1}{1 + e^{-z}}$: sigmoid function
$\phi(z) \in [0, 1]$, so can be interpreted as probability: $P(y = 1 \; | \; \mathbf{x} ; \mathbf{w}) = \phi(\mathbf{w}^T \mathbf{x})$
We can then choose class by interpreting the probability: $ \begin{align} y &= \begin{cases} 1 \; \phi(z) \geq 0.5 \\ 0 \; \phi(z) < 0.5 \end{cases} \end{align} $
The probability information can be very useful for many applications
Perceptron:
Adaline:
Logistic regression:
$J(\mathbf{w})$: cost function to minimize with parameters $\mathbf{w}$
$z = \mathbf{w}^T \mathbf{x}$
For Adaline, we minimize sum-of-squared-error: $$ J(\mathbf{w}) = \frac{1}{2} \sum_i \left( y^{(i)} - t^{(i)}\right)^2 = \frac{1}{2} \sum_i \left( \phi\left(z^{(i)}\right) - t^{(i)}\right)^2 $$
For logistic regression, we take advantage of the probability interpretation to maximize the likelihood: $$ L(\mathbf{w}) = P(t \; | \; \mathbf{x}; \mathbf{w}) = \prod_i P\left( t^{(i)} \; | \; \mathbf{x}^{(i)} ; \mathbf{w} \right) = \prod_i \phi\left(z^{(i)}\right)^{t^{(i)}} \left(1 - \phi\left(z^{(i)}\right)\right)^{1-t^{(i)}} $$
Why? $$ \begin{align} \phi\left(z^{(i)}\right)^{t^{(i)}} \left(1 - \phi\left(z^{(i)}\right)\right)^{1-t^{(i)}} = \begin{cases} \phi\left(z^{(i)} \right) & \; if \; t^{(i)} = 1 \\ 1 - \phi\left(z^{(i)}\right) & \; if \; t^{(i)} = 0 \end{cases} \end{align} $$
This is equivalent to minimize the negative log likelihood: $$ J(\mathbf{w}) = -\log L(\mathbf{w}) = \sum_i -t^{(i)}\log\left(\phi\left(z^{(i)}\right)\right) - \left(1 - t^{(i)}\right) \log\left(1 - \phi\left(z^{(i)}\right) \right) $$
Converting prod to sum via log() is a common math trick for easier computation.
In [15]:
def cost_1(z):
return - np.log(sigmoid(z))
def cost_0(z):
return - np.log(1 - sigmoid(z))
z = np.arange(-10, 10, 0.1)
phi_z = sigmoid(z)
c1 = [cost_1(x) for x in z]
plt.plot(phi_z, c1, label='J(w) if t=1')
c0 = [cost_0(x) for x in z]
plt.plot(phi_z, c0, linestyle='--', label='J(w) if t=0')
plt.ylim(0.0, 5.1)
plt.xlim([0, 1])
plt.xlabel('$\phi$(z)')
plt.ylabel('J(w)')
plt.legend(loc='best')
plt.tight_layout()
# plt.savefig('./figures/log_cost.png', dpi=300)
plt.show()
From $$ \frac{\partial \log\left(x\right)}{\partial x} = \frac{1}{x} $$ and chain rule: $$ \frac{\partial f\left(y\left(x\right)\right)}{\partial x} = \frac{\partial f(y)}{\partial y} \frac{\partial y}{\partial x} $$
We know $$ J(\mathbf{w}) = \sum_i -t^{(i)}\log\left(\phi\left(z^{(i)}\right)\right) - \left(1 - t^{(i)}\right) \log\left(1 - \phi\left(z^{(i)}\right) \right) $$
$$ \frac{\partial J(\mathbf{w})}{\partial \mathbf{w}} = \sum_i \left(\frac{-t^{(i)}}{\phi\left(z^{(i)}\right)} + \frac{1- t^{(i)}}{1 - \phi\left(z^{(i)}\right)} \right) \frac{\partial \phi \left(z^{(i)}\right)}{\partial \mathbf{w}} $$For sigmoid $ \frac{\partial \phi(z)}{\partial z} = \phi(z)\left(1-\phi(z)\right) $
Thus $$ \begin{align} \delta J = \frac{\partial J(\mathbf{w})}{\partial \mathbf{w}} &= \sum_i \left(\frac{-t^{(i)}}{\phi\left(z^{(i)}\right)} + \frac{1- t^{(i)}}{1 - \phi\left(z^{(i)}\right)} \right) \phi\left(z^{(i)}\right)\left(1 - \phi\left(z^{(i)}\right) \right) \frac{\partial z^{(i)}}{\partial \mathbf{w}} \\ &= \sum_i \left( -t^{(i)}\left(1 - \phi\left(z^{(i)}\right)\right) + \left(1-t^{(i)}\right)\phi\left(z^{(i)}\right) \right) \mathbf{x}^{(i)} \\ &= \sum_i \left( -t^{(i)} + \phi\left( z^{(i)} \right) \right) \mathbf{x}^{(i)} \end{align} $$
For gradient descent $$ \begin{align} \delta \mathbf{w} &= -\eta \delta J = \eta \sum_i \left( t^{(i)} - \phi\left( z^{(i)} \right) \right) \mathbf{x}^{(i)} \\ \mathbf{w} & \leftarrow \mathbf{w} + \delta \mathbf{w} \end{align} $$ as related to what we did for optimizing in chapter 2.
Both one-versus-one and one-versus-all have ambiguous regions and may incur more complexity/computation.
Ambiguity can be resolved via tie breakers, e.g.:
Multiple activation functions $\phi_k, k=1, 2, ... K$ each with different parameters $$ \phi_k\left(\mathbf{x}\right) = \phi\left(\mathbf{x}, \mathbf{w_k}\right) = \phi\left(\mathbf{w}_k^T \mathbf{x} \right) $$
We can then choose the class based on maximum activation: $$ y = argmax_k \; \phi_k\left( \mathbf{x} \right) $$
Caveat
Binary logistic regression:
$$ J(\mathbf{w}) = \sum_{i \in samples} -t^{(i)}\log\left(\phi\left(z^{(i)}\right)\right) - \left(1 - t^{(i)}\right) \log\left(1 - \phi\left(z^{(i)}\right) \right) $$Multi-class logistic regression: $$ J(\mathbf{w}) = \sum_{i \in samples} \sum_{j \in classes} -t^{(i, j)}\log\left(\phi\left(z^{(i, j)}\right)\right) - \left(1 - t^{(i, j)}\right) \log\left(1 - \phi\left(z^{(i, j)}\right) \right) $$
For $\phi \geq 0$, we can normalize for probabilistic interpretation: $$ P\left(k \; | \; \mathbf{x} ; \{\mathbf{w}_k\} \right) = \frac{\phi_k\left(\mathbf{x}\right)}{\sum_{m=1}^K \phi_m\left(\mathbf{x}\right) } $$
Or use softmax (normalized exponential) for any activation $\phi$: $$ P\left(k \; | \; \mathbf{x} ; \{\mathbf{w}_k\} \right) = \frac{e^{\phi_k\left(\mathbf{x}\right)}}{\sum_{m=1}^K e^{\phi_m\left(\mathbf{x}\right)} } $$
For example, if $\phi(z) = z$: $$ P\left(k \; | \; \mathbf{x} ; \{\mathbf{w}_k\} \right) = \frac{e^{\mathbf{w}_k^T\mathbf{x}}}{\sum_{m=1}^K e^{\mathbf{w}_m^T \mathbf{x}} } $$
For training, the model can be optimized via gradient descent.
The likelihood function (to maximize): $$ L(\mathbf{w}) = P(t \; | \; \mathbf{x}; \mathbf{w}) = \prod_i P\left( t^{(i)} \; | \; \mathbf{x}^{(i)} ; \mathbf{w} \right) $$
The loss function (to minimize): $$ J(\mathbf{w}) = -\log{L(\mathbf{w})} = -\sum_i \log{P\left( t^{(i)} \; | \; \mathbf{x}^{(i)} ; \mathbf{w} \right)} $$
In [16]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C=1000.0, random_state=0)
lr.fit(X_train_std, y_train)
plot_decision_regions(X_combined_std, y_combined,
classifier=lr, test_idx=test_idx,
xlabel = 'petal length [standardized]',
ylabel='petal width [standardized]')
In [17]:
if Version(sklearn_version) < '0.17':
print(lr.predict_proba(X_test_std[0, :]))
else:
print(lr.predict_proba(X_test_std[0, :].reshape(1, -1)))
Recall our general representation of our modeling objective: $$\Phi(\mathbf{X}, \mathbf{T}, \Theta) = L\left(\mathbf{X}, \mathbf{T}, \mathbf{Y}=f(\mathbf{X}, \Theta)\right) + P(\Theta)$$
Need to balance between accuracy/bias (L) and complexity/variance (P)
A well-trained model should
The following illustrates bias and variance for a potentially non-linear model
$L_2$ norm is a common form for regularization, e.g. $ P = \lambda ||\mathbf{w}||^2 $ for the linear weights $\mathbf{w}$
$\lambda$ is a parameter to weigh between bias and variance
$C = \frac{1}{\lambda}$ for scikit-learn
In [18]:
weights, params = [], []
for c in np.arange(-5, 5):
lr = LogisticRegression(C=10**c, random_state=0)
lr.fit(X_train_std, y_train)
# coef_ has shape (n_classes, n_features)
# we visualize only class 1
weights.append(lr.coef_[1])
params.append(10**c)
weights = np.array(weights)
plt.plot(params, weights[:, 0],
label='petal length')
plt.plot(params, weights[:, 1], linestyle='--',
label='petal width')
plt.ylabel('weight coefficient')
plt.xlabel('C')
plt.legend(loc='upper left')
plt.xscale('log')
# plt.savefig('./figures/regression_path.png', dpi=300)
plt.show()
Linear classification
Perceptron
SVM
Let us start with the simple case of two classes with labels +1 and -1. (We choose this particular combination of labeling for numerical simplicity, as follows.) Let the training dataset be $\{\mathbf{x}^{(i)}, y^{(i)}\}$, $i=1$ to $N$.
The goal is to find hyper-plane parameters $\mathbf{w}$ and $w_0$ so that $$y^{(i)} \left( \mathbf{w}^T\mathbf{x}^{(i)} + w_0\right) \geq 1, \; \forall i$$.
Note that $y^{(i)} = \pm1$ above.
For the purpose of optimization, we can cast the problem as maximize $\rho$ for: $$\frac{y^{(i)} \left( \mathbf{w}^T\mathbf{x}^{(i)} + w_0\right)}{||\mathbf{w}||} \geq \rho, \; \forall i$$ ; note that the left-hand side can be interpreted as the distance from $\mathbf{x}^{(i)}$ to the hyper-plane.
Note that the above equation remains invariant if we multiply $||\mathbf{w}||$ and $w_0$ by any non-zero scalar.
To eliminate this ambiguity, we can fix $\rho ||\mathbf{w}|| = 1$ and minimize $||\mathbf{w}||$, i.e.: min $\frac{1}{2} ||\mathbf{w}||^2$ subject to $y^{(i)}\left( \mathbf{w}^T \mathbf{x}^{(i)} + w_0\right) \geq 1, \; \forall i$
We can use Lagrangian multipliers $\alpha^{(i)}$ for this constrained optimization problem: $$ \begin{align} L(\mathbf{w}, w_0, \alpha) &= \frac{1}{2} ||\mathbf{w}||^2 - \sum_i \alpha^{(i)} \left( y^{(i)} \left( \mathbf{w}^T \mathbf{x}^{(i)} + w_0\right) -1 \right) \\ &= \frac{1}{2} ||\mathbf{w}||^2 - \sum_i \alpha^{(i)} y^{(i)} \left( \mathbf{w}^T \mathbf{x}^{(i)} + w_0\right) + \sum_i \alpha^{(i)} \end{align} $$
With some calculus/algebraic manipulations: $$\frac{\partial L}{\partial \mathbf{w}} = 0 \Rightarrow \mathbf{w} = \sum_i \alpha^{(i)} y^{(i)} \mathbf{x}^{(i)}$$ $$\frac{\partial L}{\partial w_0} = 0 \Rightarrow \sum_i \alpha^{(i)} y^{(i)} = 0$$
Plug the above two into $L$ above, we have: $$ \begin{align} L(\mathbf{w}, w_0, \alpha) &= \frac{1}{2} \mathbf{w}^T \mathbf{w} - \mathbf{w}^T \sum_i \alpha^{(i)}y^{(i)}\mathbf{x}^{(i)} - w_0 \sum_i \alpha^{(i)} y^{(i)} + \sum_i \alpha^{(i)} \\ &= -\frac{1}{2} \mathbf{w}^T \mathbf{w} + \sum_i \alpha^{(i)} \\ &= -\frac{1}{2} \sum_i \sum_j \alpha^{(i)} \alpha^{(j)} y^{(i)} y^{(j)} \left( \mathbf{x}^{(i)}\right)^T \mathbf{x}^{(j)} + \sum_i \alpha^{(i)} \end{align} $$ , which can be maximized, via quadratic optimization with $\alpha^{(i)}$ only, subject to the constraints: $\sum_i \alpha^{(i)} y^{(i)} = 0$ and $\alpha^{(i)} \geq 0, \; \forall i$
Note that $y^{(i)} = \pm 1$.
Once we solve $\{ \alpha^{(i)} \}$ we will see that most of them are $0$ with a few $> 0$.
The $>0$ ones correspond to lie on the decision boundaries and thus called support vectors: $$y^{(i)} \left( \mathbf{w}^T \mathbf{x}^{(i)} + w_0\right) = 1$$ from which we can calculate $w_0$.
For datasets that are not linearly separable, we can introduce slack variables $\{\xi^{(i)}\}$ as follows: $$y^{(i)} \left( \mathbf{w}^T \mathbf{x}^{(i)} + w_0\right) \geq 1 - \xi^{(i)}, \; \forall i$$
For optimization, the goal is to minimize $$\frac{1}{2} ||\mathbf{w}||^2 + C \sum_i \xi^{(i)}$$ , where $C$ is the strength of the penalty factor (like in regularization).
Using the Lagrangian multipliers $\{\alpha^{(i)}, \mu^{(i)} \}$ with constraints we have: $$L = \frac{1}{2} ||\mathbf{w}||^2 + C \sum_i \xi^{(i)} - \sum_i \alpha^{(i)} \left( y^{(i)} \left( \mathbf{w}^{(i)}\mathbf{x}^{(i)} + w_0\right) - 1 + \xi^{(i)}\right) - \sum_i \mu^{(i)} \xi^{(i)}$$ , which can be solved via a similar process as in the original case without slack variables.
In [19]:
from sklearn.svm import SVC
svm = SVC(kernel='linear', C=1.0, random_state=0)
svm.fit(X_train_std, y_train)
plot_decision_regions(X_combined_std, y_combined,
classifier=svm, test_idx=test_idx,
xlabel='petal length [standardized]', ylabel='petal width [standardized]')
In [20]:
from IPython.display import YouTubeVideo
YouTubeVideo('3liCbRZPrZA')
Out[20]:
In [21]:
YouTubeVideo('9NrALgHFwTo')
Out[21]:
For datasets that are not linearly separable, we can map them into a higher dimensional space and make them linearly separable. Let $\phi$ be this mapping: $\mathbf{z} = \phi(\mathbf{x})$
And we perform the linear decision in the $\mathbf{z}$ instead of the original $\mathbf{x}$ space: $$y^{(i)} \left( \mathbf{w}^{(i)} \mathbf{z}^{(i)} + w_0\right) \geq 1 - \xi^{(i)}$$
Following similar Lagrangian multiplier optimization as above, we eventually want to optimize: $$ \begin{align} L &= \frac{1}{2} \sum_i \sum_j \alpha^{(i)} \alpha^{(j)} y^{(i)} y^{(j)} \left(\mathbf{z}^{(i)}\right)^T \mathbf{z}^{(j)} + \sum_i \alpha^{(i)} \\ &= \frac{1}{2} \sum_i \sum_j \alpha^{(i)} \alpha^{(j)} y^{(i)} y^{(j)} \phi\left(\mathbf{x}^{(i)}\right)^T \phi\left(\mathbf{x}^{(j)}\right) + \sum_i \alpha^{(i)} \end{align} $$
The key idea behind kernel trick, and kernel machines in general, is to represent the high dimensional dot product by a kernel function: $$K\left(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}\right) = \phi\left(\mathbf{x}^{(i)}\right)^T \phi\left(\mathbf{x}^{(j)}\right)$$
Intuitively, the data points become more likely to be linearly separable in a higher dimensional space.
Recall from part of our derivation above: $$ \frac{\partial L}{\partial \mathbf{w}} = 0 \Rightarrow \mathbf{w} = \sum_i \alpha^{(i)} y^{(i)} \mathbf{z}^{(i)} $$
Which allows us to compute the discriminant via kernel trick as well: $$ \begin{align} \mathbf{w}^T \mathbf{z} &= \sum_i \alpha^{(i)} y^{(i)} \left(\mathbf{z}^{(i)}\right)^T \mathbf{z} \\ &= \sum_i \alpha^{(i)} y^{(i)} \phi\left(\mathbf{x}^{(i)}\right)^T \phi(\mathbf{x}) \\ &= \sum_i \alpha^{(i)} y^{(i)} K\left(\mathbf{x}^{(i)}, \mathbf{x}\right) \end{align} $$
Random point sets classified via XOR based on the signs of 2D coordinates:
In [22]:
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(0)
X_xor = np.random.randn(200, 2)
y_xor = np.logical_xor(X_xor[:, 0] > 0,
X_xor[:, 1] > 0)
y_xor = np.where(y_xor, 1, -1)
plt.scatter(X_xor[y_xor == 1, 0],
X_xor[y_xor == 1, 1],
c='b', marker='x',
label='1')
plt.scatter(X_xor[y_xor == -1, 0],
X_xor[y_xor == -1, 1],
c='r',
marker='s',
label='-1')
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.legend(loc='best')
plt.tight_layout()
# plt.savefig('./figures/xor.png', dpi=300)
plt.show()
This is the classification result using a rbf (radial basis function) kernel
Notice the non-linear decision boundaries
In [23]:
svm = SVC(kernel='rbf', random_state=0, gamma=0.10, C=10.0)
svm.fit(X_xor, y_xor)
plot_decision_regions(X_xor, y_xor,
classifier=svm)
$ K\left(\mathbf{x}, \mathbf{y}\right) = (\mathbf{x}^T\mathbf{y} + 1)^q $
Example for $d=2$ and $q=2$ $$ \begin{align} K\left(\mathbf{x}, \mathbf{y}\right) &= \left( x_1y_1 + x_2y_2 + 1 \right)^2 \\ &= 1 + 2x_1y_1 + 2x_2y_2 + 2x_1x_2y_1y_2 + x_1^2y_1^2 + x_2^2y_2^2 \end{align} $$ , which corresponds to the following kernel function: $$ \phi(x, y) = \left[1, \sqrt{2}x, \sqrt{2}y, \sqrt{2}xy, x^2, y^2 \right]^T $$
Scalar variance: $$ K\left(\mathbf{x}, \mathbf{y} \right) = e^{-\frac{\left|\mathbf{x} - \mathbf{y}\right|^2}{2s^2}} $$
General co-variance matrix: $$ K\left(\mathbf{x}, \mathbf{y} \right) = e^{-\frac{1}{2} \left(\mathbf{x}-\mathbf{y}\right)^T \mathbf{S}^{-1} \left(\mathbf{x} - \mathbf{y}\right)} $$
General distance function $D\left(\mathbf{x}, \mathbf{y}\right)$: $$ K\left(\mathbf{x}, \mathbf{y} \right) = e^{-\frac{D\left(\mathbf{x}, \mathbf{y} \right)}{2s^2}} $$
RBF essentially projects to an infinite dimensional space.
In [24]:
from sklearn.svm import SVC
svm = SVC(kernel='rbf', random_state=0, gamma=0.2, C=1.0)
svm.fit(X_train_std, y_train)
plot_decision_regions(X_combined_std, y_combined,
classifier=svm, test_idx=test_idx,
xlabel = 'petal length [standardized]',
ylabel = 'petal width [standardized]')
In [25]:
svm = SVC(kernel='rbf', random_state=0, gamma=100.0, C=1.0)
svm.fit(X_train_std, y_train)
plot_decision_regions(X_combined_std, y_combined,
classifier=svm, test_idx=test_idx,
xlabel = 'petal length [standardized]',
ylabel='petal width [standardized]')
$I(D)$ information/impurity for a tree node with dataset $D$
Maximize information gain $IG$ for splitting each (parent) node $D_p$ into $m$ child nodes $j$: $$ IG = I(D_p) - \sum_{j=1}^m \frac{N_j}{N_p} I(D_j) $$ Usually $m=2$ for simplicity (binary split)
Commonly used impurity measures $I$
$p(i|t)$ - probability/proportion of dataset in node $t$ belongs to class $i$
Random variable $X$ with probability mass/density function $P(X)$
Information content $ I(X) = -\log_b\left(P(X)\right) $
Entropy is the expectation of information $$ H(X) = E(I(X)) = E(-\log_b(P(X))) $$
log base $b$ can be $2$, $e$, $10$
Continuous $X$: $$ H(X) = \int P(x) I(x) \; dx = -\int P(x) \log_b P(x) \;dx $$ Discrete $X$: $$ H(X) = \sum_i P(x_i) I(x_i) = -\sum_i P(x_i) \log_b P(x_i) $$
$-\log_b P(x)$ - number of bits needed to represent $P(x)$
Minimize expected value of misclassification
$$ I_G(t) = \sum_{i=1}^c p(i|t) \left( 1 - p(i|t) \right) = 1 - \sum_{i=1}^c p(i|t)^2 $$Similar to entropy
Entropy and Gini are probabilisitic
Classification error is deterministic
Entropy and Gini index are similar, and tend to behave better than classification error
In [26]:
import matplotlib.pyplot as plt
import numpy as np
def gini(p):
return p * (1 - p) + (1 - p) * (1 - (1 - p))
def entropy(p):
return - p * np.log2(p) - (1 - p) * np.log2((1 - p))
def error(p):
return 1 - np.max([p, 1 - p])
x = np.arange(0.0, 1.0, 0.01)
ent = [entropy(p) if p != 0 else None for p in x]
sc_ent = [e * 0.5 if e else None for e in ent]
err = [error(i) for i in x]
fig = plt.figure()
ax = plt.subplot(111)
for i, lab, ls, c, in zip([ent, sc_ent, gini(x), err],
['Entropy', 'Entropy (scaled)',
'Gini Impurity', 'Misclassification Error'],
['-', '-', '--', '-.'],
['black', 'lightgray', 'red', 'green', 'cyan']):
line = ax.plot(x, i, label=lab, linestyle=ls, lw=2, color=c)
ax.legend(loc='upper center', bbox_to_anchor=(0.5, 1.15),
ncol=3, fancybox=True, shadow=False)
ax.axhline(y=0.5, linewidth=1, color='k', linestyle='--')
ax.axhline(y=1.0, linewidth=1, color='k', linestyle='--')
plt.ylim([0, 1.1])
plt.xlabel('p(i=1)')
plt.ylabel('Impurity Index')
plt.tight_layout()
#plt.savefig('./figures/impurity.png', dpi=300, bbox_inches='tight')
plt.show()
In [27]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=0)
tree.fit(X_train, y_train)
X_combined = np.vstack((X_train, X_test))
y_combined = np.hstack((y_train, y_test))
plot_decision_regions(X_combined, y_combined,
classifier=tree, test_idx=test_idx,
xlabel='petal length [cm]',ylabel='petal width [cm]')
In [28]:
from sklearn.tree import export_graphviz
export_graphviz(tree,
out_file='tree.dot',
feature_names=['petal length', 'petal width'])
Note
If you have scikit-learn 0.18 and pydotplus installed (e.g., you can install it via pip install pydotplus
), you can also show the decision tree directly without creating a separate dot file as shown below. Also note that sklearn 0.18
offers a few additional options to make the decision tree visually more appealing.
In [29]:
#import pydotplus
from IPython.display import Image
from IPython.display import display
if False and Version(sklearn_version) >= '0.18':
try:
import pydotplus
dot_data = export_graphviz(
tree,
out_file=None,
# the parameters below are new in sklearn 0.18
feature_names=['petal length', 'petal width'],
class_names=['setosa', 'versicolor', 'virginica'],
filled=True,
rounded=True)
graph = pydotplus.graph_from_dot_data(dot_data)
display(Image(graph.create_png()))
except ImportError:
print('pydotplus is not installed.')
Simplify the tree via pruning
Pre-pruning
Post-pruning
Pre-pruning is simpler, post-pruning works better
Decide how many trees to build
To train each tree:
(randomization is a key)
Majority vote from all trees
In [30]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(criterion='entropy',
n_estimators=10,
random_state=1,
n_jobs=2)
forest.fit(X_train, y_train)
plot_decision_regions(X_combined, y_combined,
classifier=forest, test_idx=test_idx,
xlabel = 'petal length [cm]', ylabel = 'petal width [cm]')
Linear classifiers (SVM, perceptron)
Kernel SVM
Decision tree
Personal take: Parametric versus non-parametric is more of a continuous spectrum than a binary decision. Many algorithms lie somewhere in between.
How it works:
$k$ is a hyper-parameter (picked by human), not a (ordinary) parameter (trained from data by machines)
Pro:
Con:
Minkowski distance of order $p$: $ d(\mathbf{x}, \mathbf{y}) = \sqrt[p]{\sum_k |\mathbf{x}_k - \mathbf{y}_k|^p} $
Number of neighbors $k$ trade-off between bias and variance
In [31]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, p=2, metric='minkowski')
knn.fit(X_train_std, y_train)
plot_decision_regions(X_combined_std, y_combined,
classifier=knn, test_idx=test_idx,
xlabel='petal length [standardized]', ylabel='petal width [standardized]')
In [32]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1, p=2, metric='minkowski')
knn.fit(X_train_std, y_train)
plot_decision_regions(X_combined_std, y_combined,
classifier=knn, test_idx=test_idx,
xlabel='petal length [standardized]', ylabel='petal width [standardized]')
Too small k can cause overfitting (high variance).
In [33]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=100, p=2, metric='minkowski')
knn.fit(X_train_std, y_train)
plot_decision_regions(X_combined_std, y_combined,
classifier=knn, test_idx=test_idx,
xlabel='petal length [standardized]', ylabel='petal width [standardized]')
Too large k can cause under-fitting (high bias).
How about using different $p$ values for Minkowski distance?
In [34]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, p=1, metric='minkowski')
knn.fit(X_train_std, y_train)
plot_decision_regions(X_combined_std, y_combined,
classifier=knn, test_idx=test_idx,
xlabel='petal length [standardized]', ylabel='petal width [standardized]')
In [35]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, p=10, metric='minkowski')
knn.fit(X_train_std, y_train)
plot_decision_regions(X_combined_std, y_combined,
classifier=knn, test_idx=test_idx,
xlabel='petal length [standardized]', ylabel='petal width [standardized]')