In [1]:
%matplotlib inline
Machine learning is often divided into three broad categories, supervised, unsupervised and reinforcement learning. We'll be skipping reinforcement learning today, so we can focus on supervised and unsupervised algorithms.
Supervised learning is based on the notion of taking a set of data called a training set, consisting of labeled examples, and using it to build a model of how to classify new examples to fit within the existing training set.
In [2]:
import random
Instead of generating a set by hand, let's create a randomize set of points on a plot colored differently based on different distributions. We'll use three different gaussian distributions to create the set.
In [28]:
red = [(round(random.gauss(27, 10)), round(random.gauss(31, 10))) for x in range(100)]
green = [(round(random.gauss(40, 8)), round(random.gauss(40, 9))) for x in range(100)]
blue = [(round(random.gauss(50, 10)), round(random.gauss(60, 12))) for x in range(100)]
In [29]:
import matplotlib.pyplot as plt # Let's plot!
In [30]:
plt.rc('axes', color_cycle=['r', 'g', 'b'])
plt.axis([0, 100, 0, 100])
plt.plot([x for x,y in red], [y for x,y in red], 'ro')
plt.plot([x for x,y in green], [y for x,y in green], 'go')
plt.plot([x for x,y in blue], [y for x,y in blue], 'bo')
plt.show()
That looks like a pretty good distribution. Now we can see that there are intuitively three regions of the graph where there are points of a particular color dominating. The KNN algorithm predicts an unlabeled point is most likely to be the same color as its K nearest neighbors.
In [31]:
from sklearn.neighbors import KNeighborsClassifier
In [33]:
neighbors = KNeighborsClassifier(n_neighbors=5)
In [39]:
neighbors.fit(red + green + blue, [0]*100 + [1]*100 + [2]*100) # we have to put all points in serially and label each
Out[39]:
In [61]:
print("Point\t\tProbability")
for x,y in zip(range(20, 80, 3), range(20, 80, 3)):
r,g,b = neighbors.predict_proba([x,y])[0]
print("P: ({},{})\tR: {}%\n\t\tG: {}%\n\t\tB: {}%\n".format(x, y, r*100, g*100, b*100))
Unupervised learning is based on the idea of pattern matching, where an algorithm learns to find patterns in the data and separates data based on which pattern it fits.
For the Unsupervises learning example, we're going to cluster some distinct points into groups. Let's start the same way as last time.
In [69]:
red2 = [(round(random.gauss(17, 5)), round(random.gauss(21, 6))) for x in range(20)]
green2 = [(round(random.gauss(40, 8)), round(random.gauss(40, 6))) for x in range(20)]
blue2 = [(round(random.gauss(60, 7)), round(random.gauss(60, 4))) for x in range(20)]
In [71]:
plt.rc('axes', color_cycle=['r', 'g', 'b'])
plt.axis([0, 100, 0, 100])
plt.plot([x for x,y in red2], [y for x,y in red2], 'ro')
plt.plot([x for x,y in green2], [y for x,y in green2], 'go')
plt.plot([x for x,y in blue2], [y for x,y in blue2], 'bo')
plt.show()
It's clear for the most part where the boundaries between these different point groups is when we look at a colored chart. Now let's see if we can get sklearn to figure out the same, just looking at the data with no help.
In [73]:
from sklearn.cluster import KMeans
In [75]:
kmeans = KMeans(n_clusters=3) # We know there are three clusters
In [77]:
kmeans.fit(red2+green2+blue2)
Out[77]:
In [87]:
print("Red:\t{}".format(" ".join(str(kmeans.predict(pt)[0]) for pt in red2)))
print("Green:\t{}".format(" ".join(str(kmeans.predict(pt)[0]) for pt in green2)))
print("Blue:\t{}".format(" ".join(str(kmeans.predict(pt)[0]) for pt in blue2)))
There you have it! On its own, knowing only to separate into three clusters, via the K-Means algorithm sklearn was able to find the boundaries between each cluster.
In [ ]: