The UCI Machine Learning Repository is a "go-to" place for fascinating data sets often used in machine learning education and research.
Let's check out the Stone Flakes data set. Find and download the StoneFlakes.dat file by clicking here or by running wget like below.
In [1]:
!wget http://archive.ics.uci.edu/ml/machine-learning-databases/00299/StoneFlakes.dat
Let's look at the first few lines.
In [2]:
!head StoneFlakes.dat
Read about the column names and the meaning of the ID values at the data set's web site.
Notice that values are separated by commas, except for the first column. Also notice that there are question marks where data is missing. How can we read this? Well, the usual answer is to "google" for the answer. Try seaching for "read data set numpy"
numpy includes functions for reading csv files. However, the pandas package offers more powerful functions for parsing data. Let's try its read_csv function.
In [3]:
import pandas
In [4]:
d = pandas.read_csv(open('StoneFlakes.dat'))
In [5]:
d[:5]
Out[5]:
In [6]:
d = pandas.read_csv(open('StoneFlakes.dat'),sep=',')
d[:5]
Out[6]:
Let's just replace commas with spaces, using unix. Read about the tr unix command at Linux TR Command Examples
In [7]:
! tr -s ' ' ',' < StoneFlakes.dat > StoneFlakes2.dat
! head StoneFlakes2.dat
In [8]:
d = pandas.read_csv(open('StoneFlakes2.dat'))
d[:5]
Out[8]:
In [9]:
d = pandas.read_csv(open('StoneFlakes2.dat'),na_values='?')
d[:5]
Out[9]:
In [10]:
d = pandas.read_csv(open('StoneFlakes2.dat'),na_values='?',error_bad_lines=False)
d[:5]
Out[10]:
In [11]:
d[:5].isnull()
Out[11]:
In [12]:
d[:5].isnull().any(axis=1)
Out[12]:
In [13]:
d[:5].isnull().any(axis=1) == False
Out[13]:
In [14]:
print(d.shape)
d = d[d.isnull().any(axis=1)==False]
print(d.shape)
In [4]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [16]:
data = d.iloc[:,1:].values
data.shape
Out[16]:
In [17]:
data[:5,:]
Out[17]:
To see this data, let's try plotting each column as a separate curve on the same axes.
In [18]:
plt.plot(data);
Each sample has 8 attribues, so each sample is a point in 8-dimensional space. I wonder how well the samples "clump" in those 8 dimensions. Let's try clustering them with the k-means algorithm.
First, let's try to find two clusters, so $k=2$. We must initialize the two means of the two clusters. Let's just pick two samples at random.
In [19]:
np.random.choice(range(data.shape[0]),2, replace=False) # data.shape[0] is number of rows, or samples
Out[19]:
In [20]:
np.random.choice(range(data.shape[0]),2, replace=False)
Out[20]:
In [21]:
centersIndex = np.random.choice(range(data.shape[0]),2, replace=False)
centersIndex
Out[21]:
In [22]:
centers = data[centersIndex,:]
centers
Out[22]:
Now we must find all samples that are closest to the first center, and those that are closest to the second sample.
Check out the wonders of numpy broadcasting.
In [23]:
a = np.array([1,2,3])
b = np.array([10,20,30])
a, b
Out[23]:
In [24]:
a-b
Out[24]:
But what if we want to subtract every element of a with every element of b?
In [25]:
np.resize(a,(3,3))
Out[25]:
In [26]:
np.resize(b, (3,3))
Out[26]:
In [27]:
np.resize(a,(3,3)).T
Out[27]:
In [28]:
np.resize(a,(3,3)).T - np.resize(b,(3,3))
Out[28]:
However, we can ask numpy to do this duplication for us if we reshape a to be a column vector and leave b as a row vector.
$$ \begin{pmatrix} 1\\ 2\\ 3 \end{pmatrix} - \begin{pmatrix} 10 & 20 & 30 \end{pmatrix} \;\; = \;\; \begin{pmatrix} 1 & 1 & 1\\ 2 & 2 & 2\\ 3 & 3 & 3 \end{pmatrix} - \begin{pmatrix} 10 & 20 & 30\\ 10 & 20 & 30\\ 10 & 20 & 30 \end{pmatrix} $$
In [29]:
a[:,np.newaxis]
Out[29]:
In [30]:
a.reshape((-1,1))
Out[30]:
In [31]:
a.shape, a[:,np.newaxis].shape
Out[31]:
In [32]:
a[:,np.newaxis] - b
Out[32]:
In [33]:
a = np.array([1,2,3])
b = np.array([[10,20,30],[40,50,60]])
print(a)
print(b)
In [34]:
b-a
Out[34]:
The single row vector a is duplicated for as many rows as there are in b! We can use this to calculate the squared distance between a center and every sample.
In [35]:
centers[0,:]
Out[35]:
In [36]:
np.sum((centers[0,:] - data)**2, axis=1)
Out[36]:
In [37]:
np.sum((centers[1,:] - data)**2, axis=1) > np.sum((centers[0,:] - data)**2, axis=1)
Out[37]:
In [38]:
centers
Out[38]:
In [39]:
centers[:,np.newaxis,:].shape, data.shape
Out[39]:
In [40]:
(centers[:,np.newaxis,:] - data).shape
Out[40]:
In [41]:
np.sum((centers[:,np.newaxis,:] - data)**2, axis=2).shape
Out[41]:
In [42]:
np.argmin(np.sum((centers[:,np.newaxis,:] - data)**2, axis=2), axis=0)
Out[42]:
In [43]:
cluster = np.argmin(np.sum((centers[:,np.newaxis,:] - data)**2, axis=2), axis=0)
cluster
Out[43]:
In [44]:
data[cluster==0,:].mean(axis=0)
Out[44]:
In [45]:
data[cluster==1,:].mean(axis=0)
Out[45]:
In [46]:
k = 2
for i in range(k):
centers[i,:] = data[cluster==i,:].mean(axis=0)
In [47]:
centers
Out[47]:
In [48]:
def kmeans(data, k = 2, n = 5):
# Initial centers
centers = data[np.random.choice(range(data.shape[0]),k, replace=False), :]
# Repeat n times
for iteration in range(n):
# Which center is each sample closest to?
closest = np.argmin(np.sum((centers[:,np.newaxis,:] - data)**2, axis=2), axis=0)
# Update cluster centers
for i in range(k):
centers[i,:] = data[closest==i,:].mean(axis=0)
return centers
In [49]:
kmeans(data,2)
Out[49]:
In [50]:
kmeans(data,2)
Out[50]:
Let's define $J$ from the book, which is a performance measure being minimized by k-means. It is defined on page 424 as $$ J = \sum_{n=1}^N \sum_{k=1}^K r_{nk} ||\mathbf{x}_n - \mathbf{\mu}_k||^2 $$ where $N$ is the number of samples, $K$ is the number of cluster centers, $\mathbf{x}_n$ is the $n^{th}$ sample and $\mathbf{\mu}_k$ is the $k^{th}$ center, each being an element of $\mathbf{R}^p$ where $p$ is the dimensionality of the data.
The sums can be computed using python for loops, but for loops are much slower than matrix operations in python, as the following cells show.
In [51]:
a = np.linspace(0,10,30).reshape(3,10)
a
Out[51]:
In [52]:
b = np.arange(30).reshape(3,10)
b
Out[52]:
In [53]:
result = np.zeros((3,10))
for i in range(3):
for j in range(10):
result[i,j] = a[i,j] + b[i,j]
result
Out[53]:
In [54]:
%%timeit
result = np.zeros((3,10))
for i in range(3):
for j in range(10):
result[i,j] = a[i,j] + b[i,j]
In [55]:
result = a + b
result
Out[55]:
In [56]:
%%timeit
result = a + b
So, the matrix form is 10 times faster!
Now, back to our problem. How can we use matrix operations to calculate the squared distance between two centers and, say, five data samples? Let's say both are two-dimensional.
In [57]:
centers = np.array([[1,2],[5,4]])
centers
Out[57]:
In [58]:
data = np.array([[3,2],[4,6],[7,3],[4,6],[1,8]])
data
Out[58]:
This will be a little weird, and hard to understand, but by adding an empty dimension to the centers array, numpy broadcasting does all the work for us.
In [59]:
centers[:,np.newaxis,:]
Out[59]:
In [60]:
centers[:,np.newaxis,:].shape
Out[60]:
In [61]:
data.shape
Out[61]:
In [62]:
diffsq = (centers[:,np.newaxis,:] - data)**2
diffsq
Out[62]:
In [63]:
diffsq.shape
Out[63]:
In [64]:
np.sum(diffsq,axis=2)
Out[64]:
Now we have a 2 x 5 array with the first row containing the squared distance from the first center to each of the five data samples, and the second row containing the squared distances from the second center to each of the five data samples.
Now we just have to find the smallest distance in each column and sum them up.
In [65]:
np.min(np.sum(diffsq,axis=2), axis=0)
Out[65]:
In [66]:
np.sum(np.min(np.sum(diffsq,axis=2), axis=0))
Out[66]:
Let's define a function named calcJ to do this calculation.
In [7]:
def calcJ(data,centers):
diffsq = (centers[:,np.newaxis,:] - data)**2
return np.sum(np.min(np.sum(diffsq,axis=2), axis=0))
In [68]:
calcJ(data,centers)
Out[68]:
In [1]:
def kmeans(data, k = 2, n = 5):
# Initialize centers and list J to track performance metric
centers = data[np.random.choice(range(data.shape[0]),k,replace=False), :]
J = []
# Repeat n times
for iteration in range(n):
# Which center is each sample closest to?
sqdistances = np.sum((centers[:,np.newaxis,:] - data)**2, axis=2)
closest = np.argmin(sqdistances, axis=0)
# Calculate J and append to list J
J.append(calcJ(data,centers))
# Update cluster centers
for i in range(k):
centers[i,:] = data[closest==i,:].mean(axis=0)
# Calculate J one final time and return results
J.append(calcJ(data,centers))
return centers,J,closest
In [70]:
centers,J,closest = kmeans(data,2)
In [71]:
J
Out[71]:
In [72]:
plt.plot(J);
In [73]:
centers,J,closest = kmeans(data,2,10)
plt.plot(J);
In [74]:
centers,J,closest = kmeans(data,3,10)
plt.plot(J);
In [75]:
small = np.array([[8,7],[7,6.6],[9.2,8.3],[6.8,9.2], [1.2,3.2],[4.8,2.3],[3.4,3.2],[3.2,5.6],[1,4],[2,2.2]])
In [76]:
plt.scatter(small[:,0],small[:,1]);
In [85]:
c,J,closest = kmeans(small,2,n=2)
In [78]:
c
Out[78]:
In [79]:
closest
Out[79]:
In [80]:
plt.scatter(small[:,0], small[:,1], s=80, c=closest, alpha=0.5);
plt.scatter(c[:,0],c[:,1],s=80,c="green",alpha=0.5);
In [81]:
c,J,closest = kmeans(small,2,n=2)
plt.scatter(small[:,0], small[:,1], s=80, c=closest, alpha=0.5);
plt.scatter(c[:,0],c[:,1],s=80,c="green",alpha=0.5);
In [82]:
J
Out[82]:
In [5]:
import gzip
import pickle
with gzip.open('mnist.pkl.gz', 'rb') as f:
train_set, valid_set, test_set = pickle.load(f, encoding='latin1')
# zero = train_set[0][1,:].reshape((28,28,1))
# one = train_set[0][3,:].reshape((28,28,1))
# two = train_set[0][5,:].reshape((28,28,1))
# four = train_set[0][20,:].reshape((28,28,1))
X = train_set[0]
T = train_set[1].reshape((-1,1))
Xtest = test_set[0]
Ttest = test_set[1].reshape((-1,1))
X.shape, T.shape, Xtest.shape, Ttest.shape
Out[5]:
In [8]:
c,J,closest = kmeans(X, k=10, n=20)
In [10]:
plt.plot(J)
Out[10]:
In [11]:
c.shape
Out[11]:
In [12]:
for i in range(10):
plt.subplot(2,5,i+1)
plt.imshow(-c[i,:].reshape((28,28)), interpolation='nearest', cmap='gray')
plt.axis('off')
In [16]:
c,J,closest = kmeans(X, k=10, n=20)
plt.plot(J)
plt.figure()
for i in range(10):
plt.subplot(2,5,i+1)
plt.imshow(-c[i,:].reshape((28,28)), interpolation='nearest', cmap='gray')
plt.axis('off')
In [15]:
c,J,closest = kmeans(X, k=20, n=20)
plt.plot(J)
plt.figure()
for i in range(20):
plt.subplot(4,5,i+1)
plt.imshow(-c[i,:].reshape((28,28)), interpolation='nearest', cmap='gray')
plt.axis('off')
In [ ]: