Nearest Neighbor

  • No model is built
  • very costly

Distance/similarity measure:

  • Euclidean
  • Cosine similarity $cos(x,y)=\frac{\sum_i x_iy_i}{\sqrt{\sum_i x_i^2} \sqrt{\sum_i y_i^2}}$
    • often used for document data, binary data, etc
Example:

Age, Salary
Compute the Euclidean distance
What is the major limitation of Euclidean distance? the differeneces are dominated by the salary, since salary has much higher values compare to age. Solution: standardization

Example:
Document $w_1$ $w_2$ $w_3$ $w_4$ $w_5$
Doc1 1 1 0 0 0
Doc2 1 1 1 1 1
Doc3 1 1 1 0 0

Nearest Neighbor Parameters

  • if k is too small: sensitive to noise
  • if k is too large: neighborhood may include points from other classes

In [ ]: