In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
# special imports for running k-means
from k_means_clustering import init_centers, k_means, plot_final
In this problem we will use $k$-means clustering on a dataset consisting of observations of dogs, cats, and mops. An example observation from each type of object is presented below:
To try and recover this information we will use an unsupervised learning algorithm called k-means clustering. As you may recall from lecture, the $k$ here refers to how many types of clusters we think exist in the data, and the goal of the algorithm is to assign labels to the data points using their distance to the centers (or means) of the clusters. For this particular problem, we assume $k=3$. After randomly initializing cluster centers, the algorithm can be broken down into two alternating steps:
Before you begin, load the data we will be using. For answering the questions in this problem set, use the centers
loaded from the X.npz
file below (i.e., do NOT randomly initialize the values yourself - the autograder for this problem relies on a "stock" initialization).
In [2]:
# NOTE: we use non-random initializations for the cluster centers
# to make autograding feasible; normally cluster centers would be
# randomly initialized.
data = np.load('data/X.npz')
X = data['X']
centers = data['centers']
print ('X: \n' + str(X))
print ('\ncenters: \n' + str(centers))
Also, take a look at the imported functions k_means
:
In [3]:
k_means??
This is the function you will run in Part C once you have completed the helper functions in parts A and B.
First, we will need a function that gives us the distance between two points. We can use Euclidean distance to compute the distance between two points ($x_1,y_1$) and ($x_2,y_2$). Recall that Euclidean distance in $\mathbb{R}^2$ is calculated as:
$$ distance((x_1,y_1),(x_2,y_2)) = \sqrt{(x_1 - x_2)^{2} + (y_1 - y_2)^{2}} $$
In [4]:
def distance(a, b):
"""
Returns the Euclidean distance between two points,
a and b, in R^2.
Parameters
----------
a, b : numpy arrays of shape (2,)
The (x,y) coordinates for two points, a and b,
in R^2. E.g., a[0] is the x coordinate,
and a[1] is the y coordinate.
Returns
-------
distance : float
The Euclidean distance between a and b
"""
### BEGIN SOLUTION
return np.sqrt((a[0]-b[0])**2 + (a[1]-b[1])**2)
### END SOLUTION
In [5]:
# add your own test cases here!
In [6]:
"""Check distances computes the correct values"""
from numpy.testing import assert_allclose
assert_allclose(distance(np.array([0.0, 0.0]), np.array([0.0, 1.0])), 1.0)
assert_allclose(distance(np.array([3.0, 3.0]), np.array([4.3, 5.0])), 2.3853720883753127)
assert_allclose(distance(np.array([130.0, -25.0]), np.array([0.4, 15.0])), 135.63244449614552)
print("Success!")
In [7]:
def update_assignments(num_clusters, X, centers):
"""
Returns the cluster assignment (number) for each data point
in X, computed as the closest cluster center.
Parameters
----------
num_clusters : int
The number of disjoint clusters (i.e., k) in
the X
X : numpy array of shape (m, 2)
An array of m data points in R^2.
centers : numpy array of shape (num_clusters, 2)
The coordinates for the centers of each cluster
Returns
-------
cluster_assignments : numpy array of shape (m,)
An array containing the cluster label assignments
for each data point in X. Each cluster label is an integer
between 0 and (num_clusters - 1).
"""
### BEGIN SOLUTION
cluster_assignments = []
for x in X:
cluster_assignments.append(np.array([distance(x, c) for c in centers]).argmin())
return np.array(cluster_assignments)
### END SOLUTION
In [8]:
# add your own test cases here!
In [9]:
"""Check update_assignments computes the correct values"""
from nose.tools import assert_equal
from numpy.testing import assert_array_equal
# load the data
data = np.load('data/X.npz')
X = data['X']
# validate update_assignments using different values
actual = update_assignments(2, X, np.array([[3, 2], [1, 4]]))
expected = np.array([
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0])
# is the output of the correct shape?
assert_equal(actual.shape[0], X.shape[0])
# are the cluster labels correct?
assert_array_equal(expected, actual)
# validate update_assignments using different values
actual = update_assignments(3, X[:X.shape[0]/2], np.array([X[0], X[1], X[2]]))
expected = np.array([0, 1, 2, 2, 0, 2, 1, 2, 2, 2, 0, 0, 0, 0, 0])
# is the output of the correct shape?
assert_equal(actual.shape[0], X.shape[0] / 2)
# are the cluster labels correct?
assert_array_equal(expected, actual)
# check that it uses distance
old_distance = distance
del distance
try:
update_assignments(2, X, np.array([[3, 2], [1, 4]]))
except NameError:
pass
else:
raise AssertionError("update_assignments does not call distance")
finally:
distance = old_distance
del old_distance
print("Success!")
In [10]:
def update_parameters(num_clusters, X, cluster_assignment):
"""
Recalculates cluster centers running update_assignments.
Parameters
----------
num_clusters : int
The number of disjoint clusters (i.e., k) in
the X
X : numpy array of shape (m, 2)
An array of m data points in R^2
cluster_assignment : numpy array of shape (m,)
The array of cluster labels assigned to each data
point as returned from update_assignments
Returns
-------
updated_centers : numpy array of shape (num_clusters, 2)
An array containing the new positions for each of
the cluster centers
"""
### BEGIN SOLUTION
updated_centers = []
for i in np.unique(cluster_assignment):
cluster_idx = np.argwhere(cluster_assignment == i).ravel()
updated_centers.append(np.mean(X[cluster_idx,:], axis=0))
return np.asarray(updated_centers)
### END SOLUTION
In [11]:
# add your own test cases here!
In [12]:
"""Check update_parameters computes the correct values"""
from nose.tools import assert_equal
from numpy.testing import assert_allclose
# load the data
data = np.load('data/X.npz')
X = data['X']
# validate update_assignments using different values
cluster_assignment1 = np.array([
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0])
actual = update_parameters(2, X, cluster_assignment1)
expected = np.array([[ 3.24286584, 2.71362623], [ 2.80577245, 4.07633606]])
assert_allclose(expected, actual)
cluster_assignment2 = np.array([0, 1, 2, 2, 0, 2, 1, 2, 2, 2, 0, 0, 0, 0, 0])
actual = update_parameters(3, X[:X.shape[0]/2], cluster_assignment2)
expected = np.array([[ 3.4914304 , 2.79181724], [ 3.03095255, 2.02958778], [ 2.86686881, 1.76070598]])
assert_allclose(expected, actual, rtol=1e-6)
print("Success!")
At this stage you are ready to run the $k$-means clustering algorithm! The k_means
function below will call your functions from Part A and B to run the k-means algorithm on the data points in X
. Note that for this problem we assume that $k = 3$.
Call the function as so:
In [13]:
# load the edata
data = np.load('data/X.npz')
X = data['X']
centers = data['centers']
# run k-means
cluster_assignments, updated_centers = k_means(3, X, centers, update_assignments, update_parameters, n_iter=4)
If the functions you completed above are working properly, you should see a figure containing a subplot of the output from steps (1) and (2) for four iterations of the algorithm. This plot should give you a sense of how the algorithm progresses over time. The data points are each assigned to one of three colors corresponding to their current cluster label. The cluster centers are plotted as stars.
In [14]:
def assign_new_object(new_object, updated_centers):
"""
Returns the cluster label (number) for new_object using k-means
clustering.
Parameters
----------
new_object : numpy array of shape (2,)
The (x,y) coordinates of a new object to be classified
updated_centers : numpy array of shape (num_clusters,2)
An array containing the updated (x,y) coordinates for
each cluster center
Returns
-------
label : int
The cluster label assignment for new_object. This is a
number between 0 and and (num_clusters - 1).
"""
### BEGIN SOLUTION
return np.array([distance(new_object, c) for c in updated_centers]).argmin()
### END SOLUTION
In [15]:
# add your own test cases here!
In [16]:
"""Check assign_new_object computes the correct values"""
from nose.tools import assert_equal
# validate update_assignments using different values
centers1 = np.array([[ 3.17014624, 2.42738134], [ 2.90932354, 4.26426491]])
assert_equal(assign_new_object(np.array([0, 1]), centers1), 0)
assert_equal(assign_new_object(np.array([1, 0]), centers1), 0)
assert_equal(assign_new_object(np.array([3, 2]), centers1), 0)
assert_equal(assign_new_object(np.array([2, 4]), centers1), 1)
centers2 = np.array([[ 3.170146, 2.427381], [ 3.109456, 1.902395], [ 2.964183, 1.827484]])
assert_equal(assign_new_object(np.array([0, 1]), centers2), 2)
assert_equal(assign_new_object(np.array([1, 0]), centers2), 2)
assert_equal(assign_new_object(np.array([3, 2]), centers2), 1)
assert_equal(assign_new_object(np.array([2, 4]), centers2), 0)
# check that it uses distance
old_distance = distance
del distance
try:
update_assignments(2, X, np.array([[3, 2], [1, 4]]))
except NameError:
pass
else:
raise AssertionError("assign_new_object does not call distance")
finally:
distance = old_distance
del old_distance
print("Success!")
Let's go ahead and rerun $k$-means, to make sure we have the correct variables set:
In [17]:
# load the edata
data = np.load('data/X.npz')
X = data['X']
centers = data['centers']
# run k-means
cluster_assignments, updated_centers = k_means(3, X, centers, update_assignments, update_parameters, n_iter=4)
Once you've implemented assign_new_object
, give it a spin on the image of the Shih-Tzu:
In [18]:
new_object = np.array([3.3, 3.5]) # image coordinates
label = assign_new_object(new_object, updated_centers)
print ('The new object was assigned to cluster: '+ str(label))
Finally, we can visualize this result against the true assignments using the helper function plot_final
:
In [19]:
plot_final(X, cluster_assignments, updated_centers, new_object, assign_new_object)
Correct written responses should have noted that the algorithm's random initialization for the cluster centers was the cause for different results. 0.25 points were taken off if this was not the reason given for the differences.
Students should also have commented on whether the Shih-Tzu was correctly identified as a dog. Points are taken off for explanations that didn't explain WHY the algorithm did or did not meet expectations (i.e., if they just say something like "Yes, it met my expectations...")