This notebook contains proof that hyperplanes, in essence, are not important.

(What is important (in an unclear way) is their intersection, so this proof is relevant for K>2)

1 Notation and basic analysis

$K$ - number of classes, $D$ - number of dimensions. Define $\{w_i\}_{i=1,\ldots,K}$ - K random vectors, $\{v_i\}_{i=1,\ldots,K}$ - K normal vectors defining hyperplanes of the multiclass classifier.

I will focus on the displacement applied to the vector

$$ T(x) = \sum_{i=1}^{K} \langle x, v_i \rangle w_i $$

Rewriting it a little bit we can see it is just affine transformation

$$ T(x) = x(\sum_{i=1}^{K} v_i w_i^T) = xA $$

To gain some intuition about this affine transformation note that it has well defined kernel

$$ ker(T) = \{x: \langle x, v_i \rangle = 0 \,\, i=1, \ldots, K\} $$

,

and $dim(ker(T)) = D-K$ assuming all normal vectors are linearly independent. $ker(T)$ is just interesection of all the hyperplanes.

2. Define helper functions


In [1]:
import numpy as np
import matplotlib.pylab as plt 
%matplotlib inline
from sympy import Matrix

In [2]:
def create_random_vectors(K, D, norm=True):
    """
    Creates K (# of classes) random D-dim vectors
    """
    def make_rand_vector(dims, norm=True):                                                      
        vec = [np.random.normal(0, 1) for i in range(dims)]                          
        if norm:
            mag = sum(x**2 for x in vec) ** .5 
        else:
            mag = 1

        return np.array([[x/mag for x in vec]]) 
    return [make_rand_vector(D, norm=norm) for _ in xrange(K)]

In [3]:
def calculate_transformation(v, w, x):
    """
    Exectus model.
    @param w - random vectors
    @param v - hyperplanes
    @param x - transformed point
    """
    assert(len(v) == len(w))
    # A is our transformation matrix
    A = sum(v[i].reshape(-1,1).dot(w[i].reshape(1,-1)) for i in range(K))
    return x.reshape(1,-1).dot(A)

3. Theorem. What happens when we rotate hyperplanes?

There are many ways to show that hyperplanes are not important. One way to go about this is showing that given ($V$, $W$) (sets of normals and random vectors) any rotation around some axis applied to the normal vectors $V$ is equivalent to changing $W$.

Why is that equivalent to the hypothesis? For instance if we have the optimal layer transformation $V^*$ and $W^*$ we can show that there is equivalent pair $RV^*, W'$. So we can start from the "bad", or "random" $RV^*$ and still find the best projection. Because of that we can start from an infinite number of different hyperplanes.

Theorem:

Theorem. Given ($V$, $W$), there exists and axis that for each rotation $R$ around this axis if we create $V'= \{ Rv_i , v_i \in V\}$ then model defined by ($V'$, $W$) is equivalent to ($V$, $W'$) for some $W'$.

4. Empirical proof (theoretical follows in section 5)

4.1 Create random (V,W) model


In [23]:
print V.shape
print b.shape


(8, 30)
(30, 1)

In [56]:
import scipy
import scipy.linalg

In [79]:
K = 5
D = 30
v = create_random_vectors(K, D, norm=True)
V = np.vstack(v)
b = np.random.normal(size=(K, 1))
sol = scipy.linalg.lstsq(V, -b)

In [83]:
print sum( np.inner(v[i], sol[0].reshape(-1)) + b[i] for i in range(K))


[ -1.11022302e-15]

In [42]:
np.inner(v[0], sol[0].reshape(-1))


Out[42]:
array([ 0.2293337])

In [38]:
help(np.linalg.lstsq)


Help on function lstsq in module numpy.linalg.linalg:

lstsq(a, b, rcond=-1)
    Return the least-squares solution to a linear matrix equation.
    
    Solves the equation `a x = b` by computing a vector `x` that
    minimizes the Euclidean 2-norm `|| b - a x ||^2`.  The equation may
    be under-, well-, or over- determined (i.e., the number of
    linearly independent rows of `a` can be less than, equal to, or
    greater than its number of linearly independent columns).  If `a`
    is square and of full rank, then `x` (but for round-off error) is
    the "exact" solution of the equation.
    
    Parameters
    ----------
    a : (M, N) array_like
        "Coefficient" matrix.
    b : {(M,), (M, K)} array_like
        Ordinate or "dependent variable" values. If `b` is two-dimensional,
        the least-squares solution is calculated for each of the `K` columns
        of `b`.
    rcond : float, optional
        Cut-off ratio for small singular values of `a`.
        Singular values are set to zero if they are smaller than `rcond`
        times the largest singular value of `a`.
    
    Returns
    -------
    x : {(N,), (N, K)} ndarray
        Least-squares solution. If `b` is two-dimensional,
        the solutions are in the `K` columns of `x`.
    residuals : {(), (1,), (K,)} ndarray
        Sums of residuals; squared Euclidean 2-norm for each column in
        ``b - a*x``.
        If the rank of `a` is < N or M <= N, this is an empty array.
        If `b` is 1-dimensional, this is a (1,) shape array.
        Otherwise the shape is (K,).
    rank : int
        Rank of matrix `a`.
    s : (min(M, N),) ndarray
        Singular values of `a`.
    
    Raises
    ------
    LinAlgError
        If computation does not converge.
    
    Notes
    -----
    If `b` is a matrix, then all array results are returned as matrices.
    
    Examples
    --------
    Fit a line, ``y = mx + c``, through some noisy data-points:
    
    >>> x = np.array([0, 1, 2, 3])
    >>> y = np.array([-1, 0.2, 0.9, 2.1])
    
    By examining the coefficients, we see that the line should have a
    gradient of roughly 1 and cut the y-axis at, more or less, -1.
    
    We can rewrite the line equation as ``y = Ap``, where ``A = [[x 1]]``
    and ``p = [[m], [c]]``.  Now use `lstsq` to solve for `p`:
    
    >>> A = np.vstack([x, np.ones(len(x))]).T
    >>> A
    array([[ 0.,  1.],
           [ 1.,  1.],
           [ 2.,  1.],
           [ 3.,  1.]])
    
    >>> m, c = np.linalg.lstsq(A, y)[0]
    >>> print m, c
    1.0 -0.95
    
    Plot the data along with the fitted line:
    
    >>> import matplotlib.pyplot as plt
    >>> plt.plot(x, y, 'o', label='Original data', markersize=10)
    >>> plt.plot(x, m*x + c, 'r', label='Fitted line')
    >>> plt.legend()
    >>> plt.show()


In [35]:
sol[1]


Out[35]:
array([], dtype=float64)

In [ ]:


In [30]:
np.inner(v[0], sol.reshape(-1))


Out[30]:
array([-1.81398429])

In [7]:
# Number of classes
K = 8
# Number of dimensions
D = 30

# Create normals and random vectors
v = create_random_vectors(K, D, norm=True)
w = create_random_vectors(K, D, norm=False)

# This is matrix A that defines the transformation
A = sum(v[i].reshape(-1,1).dot(w[i].reshape(1,-1)) for i in range(K))

4.2 What happens when we apply orthogonal transformation to v

To remind our model without transformation is

$$ T(x) = x(\sum_{i=1}^{K} v_i w_i^T) = xA $$

After applying transformation

$$ T'(x) = x(\sum_{i=1}^{K} Qv_i w_i^T) = xQ(\sum_{i=1}^{K} v_i w_i^T) = xQA $$

Interestingly applying this transformation to $v_i$ is equivalent to "rotating" whole affine transformation.

What we can do is try to find new set of $\{w_i'\}_{i=1,\ldots,K}=W'$ . The main thing is to note that it amounts to simple matrix equation. Define

$$ B = (\sum_{i=1}^{K} Qv_i w_i'^T) $$

and we would like that

$$ T'(x) = xB $$

so:

$$ T'(x) = xB = xQA $$

from which follows that matrices are equivalent

$$ QA = B $$

This is just a linear equation for coefficients of W'. There are $KxD$ such coefficients.

Cool, but there might be one problem? This is a set of $DxD$ equations (for each entry of QA matrix). But we can prove mathematically that it has only $KxD$ free parameters if we choose rotation axis smartly, what I will do in section 5.

4.3 Choosing rotation axis smartly

We need to choose rotation axis such that we do not change intersection of the hyperplanes (this is why proof doesn't work for K=2 :)

Don't try to understand get_rotation_matrix. Section 5 math is better explainer


In [14]:
import copy
def get_rotation_matrix(D, K, v, theta=5, random=False):
    """
    Constructs rotation matrix that keep Q invariant
    """
    
    def get_nullspace(v):
        """
        Calculates nullspace of list of vectors v
        """
        return np.array(Matrix(np.vstack([np.vstack(v)])).nullspace())
    
    Q = get_nullspace(v)

    if random:
        v = copy.deepcopy(v)
        v[0:2] = create_random_vectors(2, D)
    
    
    # We define basis and we will perform rotation in 2-dim subspace defined by first 2 vectors of basis
    P = np.hstack([np.hstack([v[i].reshape(-1,1) for i in range(K)]), Q.T])
    Port = np.array(gs(P.T)).T

    # Create rotation matrix
    R = np.eye(D)
    theta = theta*3.14/180
    R[0:2,0:2] = np.array( [[np.cos(theta), -np.sin(theta)], [np.sin(theta), np.cos(theta)]])
    R_full = Port.dot(R).dot(np.linalg.inv(Port))
    
    return R_full

In [12]:
# Number of classes
K = 3
# Number of dimensions
D = 250

# Create normals and random vectors
v = create_random_vectors(K, D, norm=True)
w = create_random_vectors(K, D, norm=False)

# This is matrix A that defines the transformation
A = sum(v[i].reshape(-1,1).dot(w[i].reshape(1,-1)) for i in range(K))

In [22]:
# Number of classes
K = 3
# Number of dimensions
D = 55

# Create normals and random vectors
v = create_random_vectors(K, D, norm=True)
w = create_random_vectors(K, D, norm=False)

# This is matrix A that defines the transformation
A = sum(v[i].reshape(-1,1).dot(w[i].reshape(1,-1)) for i in range(K))
Q = np.random.normal(size=(D,D))
Q = np.eye(D)

theta = 45
Q[0:2,0:2] = np.array([[np.cos(theta), np.sin(theta)], [-np.sin(theta), np.cos(theta)]])
QA = Q.dot(A)
abs(QA-A).sum() # Making sure change is big


Out[22]:
17.734953324339408

In [16]:
R = get_rotation_matrix(D, K, v, theta=45) # We rotate by 45 degrees
RA = R.dot(A)
abs(RA-A).sum() # Making sure change is big


Out[16]:
298.639646948294

In [13]:
R = get_rotation_matrix(D, K, v, theta=45, random=True) # We rotate by 45 degrees
RA = R.dot(A)
abs(RA-A).sum() # Making sure change is big


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-13-b4a26e9c5eac> in <module>()
----> 1 R = get_rotation_matrix(D, K, v, theta=45, random=True) # We rotate by 45 degrees
      2 RA = R.dot(A)
      3 abs(RA-A).sum() # Making sure change is big

<ipython-input-5-a5278b233b07> in get_rotation_matrix(D, K, v, theta, random)
     20     # We define basis and we will perform rotation in 2-dim subspace defined by first 2 vectors of basis
     21     P = np.hstack([np.hstack([v[i].reshape(-1,1) for i in range(K)]), Q.T])
---> 22     Port = np.array(gs(P.T)).T
     23 
     24     # Create rotation matrix

NameError: global name 'gs' is not defined

4.4 Solve by least squares $QA = B$

Note: there is some magic indexing, but it is just writing down the linear equation


In [10]:
print B[0,0:10]
print QA[0,0:10]


[-0.06107082  0.05048638 -0.08127015  0.18721233  0.11800719  0.13899996
  0.03682909 -0.15950646 -0.10764648  0.10688499]
[-0.04935148  0.04122691 -0.06635005  0.17176184  0.11974439  0.11456017
  0.01916262 -0.13656447 -0.09649816  0.08251703]

In [ ]:
# K^2 D x K parametrow

In [21]:
D*D


Out[21]:
3025

In [20]:
D*K


Out[20]:
165

In [18]:
np.linalg.matrix_rank(L)


Out[18]:
165

In [17]:
# R = Q
# RA = QA
# Let's first find QA that defines our T' transformation.


# Just define the linear matrix equation. We need coefficient matrix L and free coef matrix b and
# we will solve it by np.linalg.lstsq(L, b)
L = []
b = []

# Each entry of B matrix defines one equation
for i in range(D): 
    for j in range(D): 
        m = np.zeros(shape=(K,D))
        m[:, j] = [vi.reshape(-1)[i] for vi in v]
        b.append(RA[i,j])
        L.append(m.reshape(-1))
L = np.vstack(L)
b = np.array(b).reshape(-1, 1)

Wprim = np.linalg.lstsq(L, b)[0]
Wprim = Wprim.reshape(K,D)
B = sum(v[i].reshape(-1,1).dot(Wprim[i].reshape(1,-1)) for i in range(K))

print "This should be small!!!!! ", abs(B - RA).max()
print "This means we have found equivalent set of random vectors for this rotation"


This should be small!!!!!  2.88657986402541e-15
This means we have found equivalent set of random vectors for this rotation

5. Mathematical justification

This is just a sketch, because up to this point it should be already convincing.

We can decompose the space (always) into $Im(T)$ and $Ker(T)$. Now we choose rotation such that $Ker(T)$ is invariant. What is beautiful is that now we just need to "fix" behavior of the transformation $T'$ on $Im(T')$. This is governed by $KxD$ parameters.

More formal proof would do change of basis


In [15]:
import numpy as np
 
def gs_cofficient(v1, v2):
    return np.dot(v2, v1) / np.dot(v1, v1)
 
def multiply(cofficient, v):
    return map((lambda x : x * cofficient), v)
 
def proj(v1, v2):
    return multiply(gs_cofficient(v1, v2) , v1)
 
def gs(X):
    Y = []
    for i in range(len(X)):
        temp_vec = X[i]
        for inY in Y :
            proj_vec = proj(inY, X[i])
            temp_vec = map(lambda x, y : x - y, temp_vec, proj_vec)
        Y.append(temp_vec)
    return Y