SVD Practice.

2018/2/12 - WNixalo

Fastai Computational Linear Algebra (2017) §2: Topic Modeling w NMF & SVD

facebook research: Fast Randomized SVD


1. Singular-Value Decomposition

SVD is a factorization of a real or complex matrix. It factorizes a matrix $A$ into one with orthogonal columns $V^T$, one with orthogonal rows $U$, and a diagonal matrix of singular values $Σ$ (aka $S$ or $s$ or $σ$) which contains the relative importance of each factor.


In [4]:
from scipy.stats import ortho_group
import numpy as np

In [9]:
Q = ortho_group.rvs(dim=3)
B = np.random.randint(0,10,size=(3,3))
A = Q@B@Q.T

In [18]:
U,S,V = np.linalg.svd(A, full_matrices=False)

In [19]:
U


Out[19]:
array([[-0.30605318,  0.00646792,  0.95199245],
       [ 0.16912584, -0.98370154,  0.06105511],
       [ 0.93687134,  0.17969264,  0.29997109]])

In [20]:
S


Out[20]:
array([11.5434562 ,  3.42163627,  0.20254442])

In [21]:
V


Out[21]:
array([[ 0.09420232, -0.04237883,  0.99465067],
       [-0.3064163 ,  0.94935928,  0.06946947],
       [-0.94722488, -0.31132137,  0.07644628]])

In [32]:
for i in range(3):
    print(U[i] @ U[(i+1) % len(U)])
    # wraps around
    # U[0] @ U[1]
    # U[1] @ U[2]
    # U[2] @ U[0]


2.7755575615628914e-17
1.5959455978986625e-16
2.7755575615628914e-16

In [34]:
for i in range(len(U)):
    print(U[:,i] @ U[:, (i+1)%len(U[0])])


-1.6653345369377348e-16
-2.7755575615628914e-17
1.8041124150158794e-16

Wait so.. the rows of a matrix $A$ are orthogonal iff $AA^T$ is diagonal? Hmm. Math.StackEx Link


In [39]:
np.isclose(np.eye(len(U)), U @ U.T)


Out[39]:
array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]])

In [41]:
np.isclose(np.eye(len(V)), V.T @ V)


Out[41]:
array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]])

Wait but that also gives True for $VV^T$. Hmmm.

2. Truncated SVD

Okay, so SVD is an exact decomposition of a matrix and allows us to pull out distinct topics from data (due to their orthonormality (orthogonality?)).

But doing so for a large data corpus is ... bad. Especially if most of the data's meaning / information relevant to us is captured by a small prominent subset. IE: prevalence of articles like a and the are likely poor indicators of any particular meaning in a piece of text since they're everywhere in English. Likewise for other types of data.

Hmm, so, if I understood correctly, the Σ/S/s/σ matrix is ordered by value max$\rightarrow$min.. but computing the SVD of a large dataset $A$ is exactly what we want to avoid using T-SVD. Okay so how?

$\rightarrow$Full SVD we're calculating the full dimension of topics -- but its handy to limit to the most important ones -- this is how SVD is used in compression.

Aha. This is where I was confused. Truncation is used with Randomization in R-SVD. The Truncated section was just introducing the concept. Got it.

So that's where, in R-SVD, we use a buffer in addition to the portion of the dataset we take for SVD.

And yay scikit-learn has R-SVD built in.


In [43]:
from sklearn import decomposition

In [47]:
# ofc this is just dummy data to test it works
datavectors = np.random.randint(-1000,1000,size=(10,50))
U,S,V = decomposition.randomized_svd(datavectors, n_components=5)

In [49]:
U.shape, S.shape, V.shape


Out[49]:
((10, 5), (5,), (5, 50))

The idea of T-SVD is that we want to compute an approximation to the range of $A$. The range of $A$ is the space covered by the column basis.

ie: Range(A) = {y: Ax = y}

that is: all $y$ you can achieve by multiplying $x$ with $A$.

Depending on your space, the bases are vectors that you can take linear combinations of to get any value in your space.

3. Details of Randomized SVD (Truncated)

Our goal is to have an algorithm to perform Truncated SVD using Randomized values from the dataset matrix. We want to use randomization to calculate the topics we're interested in, instead of calculating all of them.

Aha. So.. the way to do that, using randomization, is to have a special kind of randomization. Find a matrix $Q$ with some special properties that will allow us to pull a matrix that is a near match to our dataset matrix $A$ in the ways we want it to be. Ie: It'll have the same singular values, meaning the same importance-ordered topics.

Wow mathematics is really.. somethin.

That process:

  1. Compute an approximation to the range of $A$. ie: we want $Q$ with $r$ orthonormal columns st:
$$A \approx QQ^TA$$
  1. Construct $B = Q^TA,$, which is small $(r \times n)$

  2. Compute the SVD of $B$ by standard methods (fast since $B$ is smaller than $A$), $B = SΣV^T$

  3. Since: $$A \approx QQ^TA = Q(SΣV^T)$$ if we set $U = QS$, then we have a low-rank approximation of $A \approx UΣV^T$.

-- okay so.. confusion here. What is $S$ and $Σ$? Because I see them elsewhere taken to mean the same thing on this subject, but all of a sudden they seem to be totally different things.

-- oh, so apparently $S$ here is actually something different. $Σ$ is what's been interchangeably referred to in Hellenic/Latin letters throughout the notebook.

NOTE that $A: m \times n$ while $Q: m \times r$, so $Q$ is generally a tall, skinny matrix and therefore much smaller & easier to compute with than $A$.

Also, because $S$ & $Q$ are both orthonormal, setting $R = QS$ makes $R$ orthonormal as well.

How do we find Q (in step 1)?

General Idea: we find this special $Q$, then we do SVD on this smaller matrix $Q^TA$, and we plug that back in to have our Truncated-SVD for $A$.

And HERE is where the Random part of Randomized SVD comes in! How do we find $Q$?:

We just take a bunch of random vectors $w_i$ and look at / evaluate the subspace formed by $Aw_i$. We form a matrix $W$ with the $w_i$'s as its columns. Then we take the QR Decomposition of $AW = QR$. Then the colunms of $Q$ form an orthonormal basis for $AW$, which is the range of $A$.

Basically a QR Decomposition exists for any matrix, and is an orthonormal matrix $\times$ an upper triangular matrix.

So basically: we take $AW$, $W$ is random, get the $QR$ -- and a property of the QR-Decomposition is that $Q$ forms an orthonormal basis for $AW$ -- and $AW$ gives the range of $A$.

Since $AW$ has far more rows than columns, it turns out in practice that these columns are approximately orthonormal. It's very unlikely you'll get linearly-dependent columns when you choose random values.

Aand apparently the QR-Decomp is v.foundational to Numerical Linear Algebra.

How do we choose r?

We chose $Q$ to have $r$ orthonormal columns, and $r$ gives us the dimension of $B$.

We choose $r$ to be the number of topics we want to retrieve $+$ some buffer.

See the lesson notebook and accompanying lecture time for an implementatinon of Randomized SVD. NOTE that Scikit-Learn's implementation is more powerful; the example is for example purposes.


4. Non-negative Matrix Factorization

Wiki

NMF is a group of algorithms in multivariate analysis and linear algebra where a matrix $V$ is factorized into (usually) two matrices $W$ & $H$, with the property that all three matrices have no negative elements.

Lecture 2 40:32

The key thing in SVD is orthogonality -- basically everything is orthogonal to eachother -- the key idea in NMF is that nothing is negative. The lower-bound is zero-clamped.

NOTE your original dataset shoudl be nonnegative if you use NMF, or else you won't be able to reconstruct it.

Idea

Rather than constraining our factors to be orthogonal, another idea would be to constrain them to be non-negative. NMF is a factorization of a non-negative dataset $V$: $$V=WH$$ into non-negative matrices $W$, $H$. Often positive factors will be more easily interpretable (and this is the reason behind NMF's popularity).

huh.. really now.?..

For example if your dataset is a matrix of faces $V$, where each columns holds a vectorized face, then $W$ would be a matrix of column facial features, and $H$ a matrix of column relative importance of features in each image.

Applications of NMF / Sklearn

NMF is a 'difficult' problem because it is unconstrained and NP-Hard

NMF looks smth like this in schematic form:

  Documents  Topics  Topic Importance Indicators
W ---------   ---   -----------------
o | | | | |   |||   | | | | | | | | |
r | | | | | ≈ |||   -----------------
d | | | | |   |||
s ---------   ---
      V        W            H

In [55]:
# workflow w NMF is something like this
V = np.random.randint(0, 20, size=(10,10))

m,n = V.shape
d = 5 # num_topics

clsf = decomposition.NMF(n_components=d, random_state=1)

W1 = clsf.fit_transform(V)
H1 = clsf.components_


Out[55]:
array([[2.61262961, 0.        , 0.17223232, 0.        , 0.49167006,
        0.7636565 , 0.        , 3.6364682 , 3.50209071, 2.92555005],
       [0.20734896, 1.95822219, 1.26795513, 0.13318651, 0.        ,
        0.        , 3.9102733 , 3.00383236, 0.        , 1.14597237],
       [2.53447222, 2.92131282, 0.33628198, 0.        , 2.64144824,
        0.        , 0.25258609, 0.47547565, 1.41874121, 0.        ],
       [0.02012285, 0.        , 1.16795696, 2.04410331, 2.58123674,
        0.        , 0.03433443, 0.        , 2.34728833, 3.01734911],
       [0.        , 0.44988815, 3.90161126, 3.46595851, 0.7795191 ,
        4.6329324 , 0.        , 0.5792899 , 0.38211515, 0.        ]])

NOTE: NMF is non-exact. You'll get something close to the original matrix back.

NMF Summary:

Benefits: fast and easy to use.

Downsides: took years of research and expertise to create

NOTES:

  • For NMF, matrix needs to be at least as tall as it is wide, or we get an error with fit_transform
  • Can use df_min in CountVectorizer to only look at workds that were in at least k of the split texts.

WNx: Okay, I'm not going to go through and implement NMF in NumPy & PyTorch using SGD today. Maybe later. -- 19:44

Lecture 2 @ 51:09


In [ ]: