Similiarity-based approaches in machine learning come from the idea that the best way to make predictions is simply to look at what has worked in the past and predict the same thing again. The fundamental concepts required to build a system based on this idea are feature spaces and measures of similarity.
Slides are posted on the internal server http://131.96.197.204/~pmolnar/mlbook
The "distance" $d$ between two points in a vector space must satisfy the following requirements:
Some common measures of distance:
In [33]:
import numpy as np
import math as ma
import matplotlib.pyplot as plt
%matplotlib inline
In [28]:
X = np.array([3.3, 1.2])
Y = np.array([2.1, -1.8])
plt.arrow(0,0,*X, head_width=0.2);
plt.arrow(0,0,*Y, head_width=0.2);
plt.xlim([0, 4]);
plt.ylim([-2,2]);
plt.show();
In [18]:
# Euclidean distance manually:
ma.sqrt(np.sum((X-Y)**2))
Out[18]:
In [15]:
# numpy norm:
np.linalg.norm(X-Y)
Out[15]:
In a d-dimensional vector space, the Minkowski distance of order $p$ is defined as:
$d_p(X,Y) = \left(\sum_{i=1}^{d} \left|X_i - Y_i\right|^p \right)^{1/p}$
The Euclidean distance is a special case of the Minkowski distance with $p=2$.
Some other common cases include:
In [34]:
import scipy.spatial.distance as dst
In [25]:
# Manhattan distance
dst.cdist(np.expand_dims(X, axis=0),np.expand_dims(Y, axis=0),'cityblock')
Out[25]:
In [26]:
# Chebyshev distance
dst.cdist(np.expand_dims(X, axis=0),np.expand_dims(Y, axis=0),'chebyshev')
Out[26]:
The Euclidean distance can be written in (suggestive) vector notation as:
$d^2(X,Y) = (X-Y)^T I_{n \times n} (X-Y)$
Instead of the $n \times n$ identity matrix, we could use and positive definite matrix.
A positive definite matrix is defined as a matrix $M$ for which $z^T M z \geq 0$ for all real vectors $z$, with equality only if $z$ is the vector of all zeros.
We can use this matrix to appropriately rescale the axes, for example to correct for high variance along a given dimension in our feature space: this gives us the Mahalanobis metric,
$d_M(X,Y) = (X-Y)^T \Sigma^{-1} (X-Y)$,
where $\Sigma$ is the covariance matrix of your data points.
Distances between words (taking into account the context):
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
In [56]:
import pandas as pd
from sklearn.neighbors import NearestNeighbors
In [54]:
# read in the basketball draft dataset
df = pd.read_csv('./Table5-2.csv', names=['ID','Speed','Agility','Draft'], skiprows=[0])
df.head()
Out[54]:
In [53]:
fig, ax = plt.subplots()
ax.margins(0.05)
groups = df.groupby('Draft')
for name, group in groups:
ax.plot(group.Speed, group.Agility, marker='o', linestyle='', ms=12, label=name);
ax.legend(numpoints=1, loc='lower right');
In [60]:
# Let's fit a nearest-neighbor model to our data, using Euclidean distance...
nearest_neighbor_model = NearestNeighbors(1, metric='euclidean').fit(df[['Speed','Agility']], df['Draft'])
In [93]:
# OK, now let's find the nearest neighbors for some new data points!
samples = np.array([[7,7],[5,5]]) # samples to classify, in [speed, agility] format
In [82]:
fig, ax = plt.subplots()
ax.margins(0.05)
groups = df.groupby('Draft')
for name, group in groups:
ax.plot(group.Speed, group.Agility, marker='o', linestyle='', ms=12, label=name);
ax.legend(numpoints=1, loc='lower right');
ax.plot(samples[:,0],samples[:,1], marker='o', linestyle='', ms=12, c='red');
In [94]:
nearest_neighbor_model.kneighbors(samples, return_distance=True)
Out[94]:
In [95]:
df.Draft.iloc[nearest_neighbor_model.kneighbors(samples, return_distance=False).ravel()] # the kneighbors method returns the index of the
# nearest neighbors....
Out[95]:
In [70]:
nearest_neighbor_model.kneighbors([[7,7],[5,4]], return_distance=False).ravel()
Out[70]:
The NearestNeighbors function helps us recover the neighbors that are closest to the desired data point; but if we're interested in using k-nearest neighbors for classification, we can use KNeighborsClassifier.
In [84]:
from sklearn.neighbors import KNeighborsClassifier
In [88]:
# define model and train it on the input data
knn_model = KNeighborsClassifier(n_neighbors=5, metric='euclidean').fit(df[['Speed','Agility']], df['Draft'])
In [96]:
# predict classes for "samples", using k nearest neighbors
knn_model.predict(samples)
Out[96]:
In [100]:
from sklearn.cluster import KMeans
In [102]:
kmeans_model = KMeans(2).fit(df[['Speed','Agility']])
In [104]:
df['Clust'] = kmeans_model.predict(df[['Speed','Agility']])
In [105]:
fig, ax = plt.subplots()
ax.margins(0.05)
groups = df.groupby('Clust')
for name, group in groups:
ax.plot(group.Speed, group.Agility, marker='o', linestyle='', ms=12, label=name);
ax.legend(numpoints=1, loc='lower right');
ax.plot(samples[:,0],samples[:,1], marker='o', linestyle='', ms=12, c='red');
In [101]:
help(KMeans)