In [1]:

    
# Some preliminary imports
%matplotlib inline

import numpy as np
import sklearn.datasets as data
import sklearn.metrics.pairwise as pairwise
from scipy.interpolate import interp1d

import matplotlib.pyplot as plt
import seaborn as sns



In [9]:

    
# Let's generate some toy data.
X, y = data.make_blobs(n_samples = 300, centers = np.array([[0, 0], [3, 3]]), cluster_std = 0.5, random_state = 12, shuffle = False)
plt.scatter(X[:, 0], X[:, 1])









    Out[9]:





<matplotlib.collections.PathCollection at 0x11224d7f0>

Two extremely well-separated Gaussian clusters. This will be easy--really not even requiring spectral analysis--but it's fine for demo purposes.

Let's explore

First step in any data science project: get a feel for the data. Fergus SSL relies quite on a big on the dimensions of the data being separable, so let's plot the histograms of each individual dimension.



In [10]:

    
# We'll make the number of bins, B
B = 50
plt.figure(0)
plt.hist(X[:, 0], bins = B, normed = True)
plt.title("Dimension 1 ($x$-axis)")
plt.figure(1)
plt.hist(X[:, 1], bins = B, normed = True)
plt.title("Dimension 2 ($y$-axis)")









    Out[10]:





<matplotlib.text.Text at 0x1132964a8>

You see the data are easily separable along both the $x$ and $y$ axes. Put another way--if someone gives you an $x$ value of a data point and asks you what cluster the point comes from, it's easy to make that assessment. Same with just providing the $y$ value of a data point--you could easily predict what cluster it comes from.

Now let's make things a little harder.



In [11]:

    
rng = np.random.RandomState(74)
t = rng.normal(size = (2, 2))
Xr = X @ t
plt.scatter(Xr[:, 0], Xr[:, 1])









    Out[11]:





<matplotlib.collections.PathCollection at 0x11419f710>

We've effectively rotated the data about 45 degrees. If we now plot the 1D histograms for the $x$ and $y$ axes again...



In [12]:

    
plt.figure(0)
plt.hist(Xr[:, 0], bins = B, normed = True)
plt.title("Dimension 1 ($x$-axis)")
plt.figure(1)
plt.hist(Xr[:, 1], bins = B, normed = True)
plt.title("Dimension 2 ($y$-axis)")









    Out[12]:





<matplotlib.text.Text at 0x11447a400>

Good luck telling me where the clusters are! Now the dimensions are highly correlated. This is why, if this was the original data we were given, we need to perform a rotation on the data--decorrelate the axes so they become separable in one dimension (i.e. histograms).



In [13]:

    
from sklearn.decomposition import RandomizedPCA

pca = RandomizedPCA(n_components = 2, random_state = 12)
Xt = pca.fit_transform(Xr)
plt.scatter(Xt[:, 0], Xt[:, 1])









    Out[13]:





<matplotlib.collections.PathCollection at 0x1130232b0>



In [14]:

    
plt.figure(0)
plt.hist(Xt[:, 0], bins = B, normed = True)
plt.title("Dimension 1 ($x$-axis)")
plt.figure(1)
plt.hist(Xt[:, 1], bins = B, normed = True)
plt.title("Dimension 2 ($y$-axis)")









    Out[14]:





<matplotlib.text.Text at 0x1148c6ac8>

The $x$ axis is still a lost cause, but now the $y$ axis has become separable again!

Fergus algorithm

Now let's start looking into what's required to do the semi-supervised learning according to Fergus et al.



In [17]:

    
# First, define the method that, given a 1D histogram of the data, will return the 
# solution to the generalized eigenvalue problem as defined in Eq. 2 of the Fergus paper.
import scipy.linalg as sla
def solve(hist):
    """
    hist : 1D histogram
    """
    W = pairwise.rbf_kernel(hist.reshape(-1, 1))
    P = np.diag(hist)
    PW = P @ W  #  matrix-matrix multiplication in Python 3
    PWP = PW @ P
    Ds = np.diag(PWP.sum(axis = 0))  # sum along the COLUMNS of PWP
    Dh = np.diag(PW.sum(axis = 0))  # sum along the COLUMNS of PW
    
    # Solve the generalized eigenvalue problem for symmetric matrices (this is, after all, a laplacian).
    L = Ds - PWP
    b = P @ Dh
    v, U = sla.eigh(L, b = b)
    return [v, U]

# Second, we'll create the 1D histograms for each dimension.
h1, e1 = np.histogram(Xt[:, 0], bins = B)
h2, e2 = np.histogram(Xt[:, 1], bins = B)
# Sidestep: add a small amount to each histogram bin, then renormalize to a density.
h1 = np.array(h1, dtype = np.float) + 0.01
h1 /= h1.sum()
h2 = np.array(h2, dtype = np.float) + 0.01
h2 /= h2.sum()

# Finally, we'll invoke the solver we wrote.
v1, U1 = solve(h1)
v2, U2 = solve(h2)



In [18]:

    
# How do the eigenvalues and eigenfunctions look?
plt.figure(0)
plt.title("Eigenvalues for axis 1")
plt.plot(v1)
plt.figure(1)
plt.title("Eigenvalues for axis 2")
plt.plot(v2)









    Out[18]:





[<matplotlib.lines.Line2D at 0x114c767b8>]



In [20]:

    
plt.figure(0)
plt.title("Axis 1")
plt.plot(U1[:, 0], label = "EF 1")
plt.plot(U1[:, 1], label = "EF 2")
plt.plot(U1[:, 2], label = "EF 3")
plt.legend(loc = 0)
plt.figure(1)
plt.title("Axis 2")
plt.plot(U2[:, 0], label = "EF 1")
plt.plot(U2[:, 1], label = "EF 2")
plt.plot(U2[:, 2], label = "EF 3")
plt.legend(loc = 0)









    Out[20]:





<matplotlib.legend.Legend at 0x11464a5c0>

We can see things happening with the eigenfunctions--and, clearly, the eigenfunctions associated with the smallest eigenvalues are useless, but we knew this from looking at the examples in the Fergus paper itself--but let's continue on to get a feel for things in the interpolation process.

For simplicity, we'll just use the eigenfunctions associated with the 2nd axis ($y$ axis, in this particular case, but in $n$ dimensions this could be any axis).



In [21]:

    
# Assign the eigenfunctions
EF = U2[:, 1:3]  # Take the eigenfunctions, starting at index 1 (inclusive) and ending at 3 (exclusive).
EV = v2[1:3]  # Take the eigenvalues associated with the eigenfunctions



In [27]:

    
# Create the x-axis from the histogram edges.
x1 = np.array([e1[i:i + 2].mean() for i in range(e1.shape[0] - 1)])  # Finds the midpoints of the histogram bins.
x2 = np.array([e2[i:i + 2].mean() for i in range(e2.shape[0] - 1)])

# Now, use the eigenfunctions (y-axis) to create the interpolators.
f1 = interp1d(x1, EF[:, 0])
f2 = interp1d(x2, EF[:, 1])

# What do they look like?
newdata = np.linspace(x1.min(), x1.max(), 100)
plt.figure(0)
plt.title("$x_1$ dimension")
plt.plot(x1, EF[:, 0], 'o', label = 'Original')
plt.plot(newdata, f1(newdata), '-', label = 'Interpolator')
plt.legend(loc = 0)

newdata = np.linspace(x2.min(), x2.max(), 100)
plt.figure(1)
plt.title("$x_2$ dimension")
plt.plot(x2, EF[:, 1], 'o', label = 'Original')
plt.plot(newdata, f2(newdata), '-', label = 'Interpolator')
plt.legend(loc = 0)









    Out[27]:





<matplotlib.legend.Legend at 0x1164bd588>



In [37]:

    
# We now have our eigenfunctions, at least in 1D. Let's see if we can reconstruct that heatmap from Fig 1.

plt.title("Original Data")
_ = plt.hexbin(Xt[:, 0], Xt[:, 1])



In [ ]:

    
x = np.linspace(-2, 2.5, num = 100)  # Our original (rotated) data varies from about (-2, 2.5) on the x-axis.
y = np.linspace(-1, 1.0, num = 100)  # ...and from about (-1, 1) on the y-axis.

# Evaluate the eigenfunctions on each dimension.
fx = f1(x)
fy = f2(y)

# Plot the heatmaps along the grid.
plt.figure(0)
plt.title("Dim $x_1$: $\sigma_1$ = {:.2f}".format(EV[0]))