2018/2/12 - WNixalo
Fastai Computational Linear Algebra (2017) §2: Topic Modeling w NMF & SVD
facebook research: Fast Randomized SVD
SVD is a factorization of a real or complex matrix. It factorizes a matrix $A$ into one with orthogonal columns $V^T$, one with orthogonal rows $U$, and a diagonal matrix of singular values $Σ$ (aka $S$ or $s$ or $σ$) which contains the relative importance of each factor.
In [4]:
from scipy.stats import ortho_group
import numpy as np
In [9]:
Q = ortho_group.rvs(dim=3)
B = np.random.randint(0,10,size=(3,3))
A = Q@B@Q.T
In [18]:
U,S,V = np.linalg.svd(A, full_matrices=False)
In [19]:
U
Out[19]:
In [20]:
S
Out[20]:
In [21]:
V
Out[21]:
In [32]:
for i in range(3):
print(U[i] @ U[(i+1) % len(U)])
# wraps around
# U[0] @ U[1]
# U[1] @ U[2]
# U[2] @ U[0]
In [34]:
for i in range(len(U)):
print(U[:,i] @ U[:, (i+1)%len(U[0])])
Wait so.. the rows of a matrix $A$ are orthogonal iff $AA^T$ is diagonal? Hmm. Math.StackEx Link
In [39]:
np.isclose(np.eye(len(U)), U @ U.T)
Out[39]:
In [41]:
np.isclose(np.eye(len(V)), V.T @ V)
Out[41]:
Wait but that also gives True
for $VV^T$. Hmmm.
Okay, so SVD is an exact decomposition of a matrix and allows us to pull out distinct topics from data (due to their orthonormality (orthogonality?)).
But doing so for a large data corpus is ... bad. Especially if most of the data's meaning / information relevant to us is captured by a small prominent subset. IE: prevalence of articles like a and the are likely poor indicators of any particular meaning in a piece of text since they're everywhere in English. Likewise for other types of data.
Hmm, so, if I understood correctly, the Σ/S/s/σ matrix is ordered by value max$\rightarrow$min.. but computing the SVD of a large dataset $A$ is exactly what we want to avoid using T-SVD. Okay so how?
$\rightarrow$Full SVD we're calculating the full dimension of topics -- but its handy to limit to the most important ones -- this is how SVD is used in compression.
Aha. This is where I was confused. Truncation is used with Randomization in R-SVD. The Truncated section was just introducing the concept. Got it.
So that's where, in R-SVD, we use a buffer in addition to the portion of the dataset we take for SVD.
And yay scikit-learn
has R-SVD built in.
In [43]:
from sklearn import decomposition
In [47]:
# ofc this is just dummy data to test it works
datavectors = np.random.randint(-1000,1000,size=(10,50))
U,S,V = decomposition.randomized_svd(datavectors, n_components=5)
In [49]:
U.shape, S.shape, V.shape
Out[49]:
The idea of T-SVD is that we want to compute an approximation to the range of $A$. The range of $A$ is the space covered by the column basis.
ie: Range(A) = {y: Ax = y}
that is: all $y$ you can achieve by multiplying $x$ with $A$.
Depending on your space, the bases are vectors that you can take linear combinations of to get any value in your space.
Our goal is to have an algorithm to perform Truncated SVD using Randomized values from the dataset matrix. We want to use randomization to calculate the topics we're interested in, instead of calculating all of them.
Aha. So.. the way to do that, using randomization, is to have a special kind of randomization. Find a matrix $Q$ with some special properties that will allow us to pull a matrix that is a near match to our dataset matrix $A$ in the ways we want it to be. Ie: It'll have the same singular values, meaning the same importance-ordered topics.
Wow mathematics is really.. somethin.
That process:
Construct $B = Q^TA,$, which is small $(r \times n)$
Compute the SVD of $B$ by standard methods (fast since $B$ is smaller than $A$), $B = SΣV^T$
Since: $$A \approx QQ^TA = Q(SΣV^T)$$ if we set $U = QS$, then we have a low-rank approximation of $A \approx UΣV^T$.
-- okay so.. confusion here. What is $S$ and $Σ$? Because I see them elsewhere taken to mean the same thing on this subject, but all of a sudden they seem to be totally different things.
-- oh, so apparently $S$ here is actually something different. $Σ$ is what's been interchangeably referred to in Hellenic/Latin letters throughout the notebook.
NOTE that $A: m \times n$ while $Q: m \times r$, so $Q$ is generally a tall, skinny matrix and therefore much smaller & easier to compute with than $A$.
Also, because $S$ & $Q$ are both orthonormal, setting $R = QS$ makes $R$ orthonormal as well.
General Idea: we find this special $Q$, then we do SVD on this smaller matrix $Q^TA$, and we plug that back in to have our Truncated-SVD for $A$.
And HERE is where the Random part of Randomized SVD comes in! How do we find $Q$?:
We just take a bunch of random vectors $w_i$ and look at / evaluate the subspace formed by $Aw_i$. We form a matrix $W$ with the $w_i$'s as its columns. Then we take the QR Decomposition
of $AW = QR$. Then the colunms of $Q$ form an orthonormal basis for $AW$, which is the range of $A$.
Basically a QR Decomposition exists for any matrix, and is an orthonormal matrix $\times$ an upper triangular matrix.
So basically: we take $AW$, $W$ is random, get the $QR$ -- and a property of the QR-Decomposition is that $Q$ forms an orthonormal basis for $AW$ -- and $AW$ gives the range of $A$.
Since $AW$ has far more rows than columns, it turns out in practice that these columns are approximately orthonormal. It's very unlikely you'll get linearly-dependent columns when you choose random values.
Aand apparently the QR-Decomp is v.foundational to Numerical Linear Algebra.
We chose $Q$ to have $r$ orthonormal columns, and $r$ gives us the dimension of $B$.
We choose $r$ to be the number of topics we want to retrieve $+$ some buffer.
See the lesson notebook and accompanying lecture time for an implementatinon of Randomized SVD. NOTE that Scikit-Learn's implementation is more powerful; the example is for example purposes.
NMF is a group of algorithms in multivariate analysis and linear algebra where a matrix $V$ is factorized into (usually) two matrices $W$ & $H$, with the property that all three matrices have no negative elements.
The key thing in SVD is orthogonality -- basically everything is orthogonal to eachother -- the key idea in NMF is that nothing is negative. The lower-bound is zero-clamped.
NOTE your original dataset shoudl be nonnegative if you use NMF, or else you won't be able to reconstruct it.
Rather than constraining our factors to be orthogonal, another idea would be to constrain them to be non-negative. NMF is a factorization of a non-negative dataset $V$: $$V=WH$$ into non-negative matrices $W$, $H$. Often positive factors will be more easily interpretable (and this is the reason behind NMF's popularity).
huh.. really now.?..
For example if your dataset is a matrix of faces $V$, where each columns holds a vectorized face, then $W$ would be a matrix of column facial features, and $H$ a matrix of column relative importance of features in each image.
NMF is a 'difficult' problem because it is unconstrained and NP-Hard
NMF looks smth like this in schematic form:
Documents Topics Topic Importance Indicators
W --------- --- -----------------
o | | | | | ||| | | | | | | | | |
r | | | | | ≈ ||| -----------------
d | | | | | |||
s --------- ---
V W H
In [55]:
# workflow w NMF is something like this
V = np.random.randint(0, 20, size=(10,10))
m,n = V.shape
d = 5 # num_topics
clsf = decomposition.NMF(n_components=d, random_state=1)
W1 = clsf.fit_transform(V)
H1 = clsf.components_
Out[55]:
NOTE: NMF is non-exact. You'll get something close to the original matrix back.
Benefits: fast and easy to use.
Downsides: took years of research and expertise to create
NOTES:
fit_transform
df_min
in CountVectorizer
to only look at workds that were in at least k
of the split texts.WNx: Okay, I'm not going to go through and implement NMF in NumPy & PyTorch using SGD today. Maybe later. -- 19:44
In [ ]: