In [5]:
%load_ext autoreload
%autoreload 2

Problem set 2

155 pts

For problems 1, 2, 4: all functions that you are asked to implement, you have to complete in separate file pset2.py, where we provide signatures of the required functions . Also only this py-file you have to submit in the bot to check correctness of your implementations.

For problem 3: see instructions in text

Problem 1 (LU decomposition) 35 pts

1. LU for band matrices

The complexity to find an LU decomposition of a dense $n\times n$ matrix is $\mathcal{O}(n^3)$. Significant reduction in complexity can be achieved if the matrix has a certain structure, e.g. it is sparse. In the following task we consider an important example of $LU$ for a special type of sparse matrices –– band matrices with the bandwidth $m$ equal to 3 or 5 which called tridiagonal and pentadiagonal respectively.

  • (5 pts) Write a function band_lu(diag_broadcast, n) which computes LU decomposition for tridiagonal or pentadiagonal matrix with given diagonal values. For example, input parametres (diag_broadcast = [1,-2,1], n = 4) mean that we need to find LU decomposition for the triangular matrix of the form:
$$A = \begin{pmatrix} -2 & 1 & 0 & 0\\ 1 & -2 & 1 & 0 \\ 0 & 1 & -2 & 1 \\ 0 & 0 & 1 & -2 \\ \end{pmatrix}.$$

As an output it is considered to make L and U - 2D arrays representing diagonals in factors $L$ (L[0] keeps first lower diagonal, L[1] keeps second lower, ...), and $U$ (U[:,0] keeps main diagonal, U[:,1] keeps first upper, ...). More details you can find in comments to the corresponding function in pset2.py

  • (2 pts) Compare execution time of the band LU decomposition using standard function from scipy, i.e. which takes the whole matrix and does not know about its special structure, and band decomposition of yours implementation. Comment on the results.

In [6]:
# Implement function in the ```pset2.py``` file
from pset2 import band_lu
import scipy.sparse
import scipy as sp # can be used with broadcasting of scalars if desired dimensions are large
import numpy as np
import scipy.linalg as lg
import time
import matplotlib.pyplot as plt
%matplotlib inline

In [5]:
def build_diag(diag_broadcast, n):
    length = len(diag_broadcast) // 2
    diag_map = np.arange(-length, length + 1, 1)
    A = sp.sparse.diags(diag_broadcast, diag_map, shape=(n, n)).toarray()
    return A

def do_test(matrix_size):
    my_impl = []
    numpy_impl = []

    for n in matrix_size:
        diag_elem = np.random.random((1, 3))[0]

        start = time.time()
        L, U = band_lu(diag_elem, n)
        end = time.time()
        my_impl.append(end - start)

        start = time.time()
        A = build_diag(diag_elem, n)
        _, _, _ = lg.lu(A,permute_l=False)
        end = time.time()
        numpy_impl.append(end - start)    
    return my_impl, numpy_impl
        
matrix_size = [x for x in range(100, 3000, 100)]
my_impl_ = []
numpy_impl_ = []
N = 20

for i in range(N):
    a, b = do_test(matrix_size)
    my_impl_.append(a)
    numpy_impl_.append(b)
    
my_impl = [k/N for k in [sum(i) for i in zip(*my_impl_)]]
numpy_impl = [k/N for k in [sum(i) for i in zip(*numpy_impl_)]]
plt.figure(figsize=(14,7))
plt.loglog(matrix_size, my_impl, label='Optimized version for band matrices')
plt.loglog(matrix_size, numpy_impl, label='Scipy algorithm')
plt.ylabel(r"Time", size=14)
plt.xlabel("Square matrix size, items", size=14)
plt.title("Different LU decomposition algorithm performance (band=3)", size=14)
plt.legend(loc='upper left')
plt.grid()
plt.show()



In [158]:
def build_diag(diag_broadcast, n):
    length = len(diag_broadcast) // 2
    diag_map = np.arange(-length, length + 1, 1)
    A = sp.sparse.diags(diag_broadcast, diag_map, shape=(n, n)).toarray()
    return A

def do_test(matrix_size):
    my_impl = []
    numpy_impl = []

    for n in matrix_size:
        diag_elem = np.random.random((1, 5))[0]

        start = time.time()
        L, U = band_lu(diag_elem, n)
        end = time.time()
        my_impl.append(end - start)

        start = time.time()
        A = build_diag(diag_elem, n)
        _, _, _ = lg.lu(A,permute_l=False)
        end = time.time()
        numpy_impl.append(end - start)    
    return my_impl, numpy_impl
        
matrix_size = [x for x in range(100, 3000, 100)]
my_impl_ = []
numpy_impl_ = []
N = 20

for i in range(N):
    a, b = do_test(matrix_size)
    my_impl_.append(a)
    numpy_impl_.append(b)
    
my_impl = [k/N for k in [sum(i) for i in zip(*my_impl_)]]
numpy_impl = [k/N for k in [sum(i) for i in zip(*numpy_impl_)]]
plt.figure(figsize=(14,7))
plt.loglog(matrix_size, my_impl, label='Optimized version for band matrices')
plt.loglog(matrix_size, numpy_impl, label='Scipy algorithm')
plt.ylabel(r"Time", size=14)
plt.xlabel("Square matrix size, items", size=14)
plt.title("Different LU decomposition algorithm performance (band=5)", size=14)
plt.legend(loc='upper left')
plt.grid()
plt.show()


In out algorithm we know that the matrix is banded and apply much faster algorithm which is linear of size of matrix and quadratic of band size.

Numpy algorithm is not aware of this and thus does much worse.

2. Completing the proof of existence of LU

Some details in lecture proofs about $LU$ were omitted. Let us complete them here.

  • (5 pts) Prove that if $LU$ decomposition exists, then matrix is strictly regular.
    Matrix is sctrictly regular if all leading principal submatrices of A are non-singular.
    Let's represent matrix $A = LU$ in the following way: $$ \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} = \begin{bmatrix} L_{11} & 0 \\ L_{21} & L_{22} \end{bmatrix} \begin{bmatrix} U_{11} & U_{12} \\ 0 & U_{22} \end{bmatrix} = \begin{bmatrix} L_{11}U_{11} & L_{11}U_{12} \\ L_{21}U_{11} & L_{21}U_{12} + L_{22}U_{22} \end{bmatrix} $$ $A_{11}$ consists of $k$ leading principal submatrices $$ det(A_{11}) = det(L_{11}U_{11}) = det(L_{11})det(U_{11}) = det(U_{11}) \neq 0 $$ since $U$ is triangular and hence $A_{11}$ is non-singular
  • (5 pts) Prove that if $A$ is a strictly regular matrix, then $A_1 = D - \frac 1a b c^T$ (see lectures for notations) is also strictly regular. $$ A = \begin{bmatrix} 1 & 0 \\ z & I \end{bmatrix} \begin{bmatrix} a & c^T \\ 0 & A_{1} \end{bmatrix} = KM\\ A_1 = D - \frac{1}{a}bc^T $$ All leading submatrices of $A$ are non-singular and the matrix $K$ is elementary matrix $$ M = K^{-1} A $$ $K^{-1}$ is elementary as well then $K^{-1}$ does not zero determinant of all submatrices in $A$ then all leading submatrices of $M$ are non-singular.
    As $a \neq 0$ then all leading submatrices of $A_1$ are non-singular, q.e.d.

3. Stability of LU

Let $A = \begin{pmatrix} \varepsilon & 1 & 0\\ 1 & 1 & 1 \\ 0 & 1 & 1 \end{pmatrix}.$

  • (5 pts) Find analytically LU decomposition with and without pivoting for the matrix $A$.
    w/o pivoting $$ A = \begin{pmatrix} 1 & \frac{1}{\varepsilon} & 0\\ 0 & 1 - \frac{1}{\varepsilon} & 1 \\ 0 & 1 & 1 \end{pmatrix} \sim \begin{pmatrix} 1 & \frac{1}{\varepsilon} & 0 \\ 0 & \frac{\varepsilon - 1}{\varepsilon} & 1 \\ 0 & 1 & 1 \end{pmatrix} \sim \begin{pmatrix} 1 & \frac{1}{\varepsilon} & 0 \\ 0 & 1 & \frac{\varepsilon}{\varepsilon - 1} \\ 0 & 0 & \frac{-1}{\varepsilon - 1} \end{pmatrix} $$ Hence $$ A = LU = \begin{pmatrix} \varepsilon & 0 & 0 \\ 1 & \frac{\varepsilon - 1}{\varepsilon} & 0 \\ 0 & 1 & 1 \end{pmatrix} \begin{pmatrix} 1 & \frac{1}{\varepsilon} & 0 \\ 0 & 1 & \frac{\varepsilon}{\varepsilon - 1} \\ 0 & 0 & \frac{-1}{\varepsilon - 1} \end{pmatrix} $$ w/ pivoting
    skipping all intermediate steps we'll get $$ A = PLU = \begin{pmatrix} 0 & 0 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \end{pmatrix} \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ \varepsilon & 1 - \varepsilon & 1 \end{pmatrix} \begin{pmatrix} 1 & 1 & 1 \\ 0 & 1 & 1 \\ 0 & 0 & -1 \end{pmatrix} $$
  • (3 pts) Explain, why can the LU decomposition fail to approximate factors $L$ and $U$ for $|\varepsilon|\ll 1$ in computer arithmetic?
    Becuase of roundoff errors during computations and float numbers representation in memory. It can be seen in previous case when there was division by small number $\varepsilon$. The accuracy for small and big numbers are different so after a sequence of such operations the final result might be completely different. In the case without pivoting while constructing we will get different matrix.

4. Block LU

Let $A = \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix}$ be a block matrix. The goal is to solve the linear system

$$ \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} \begin{bmatrix} u_1 \\ u_2 \end{bmatrix} = \begin{bmatrix} f_1 \\ f_2 \end{bmatrix}. $$
  • (2 pts) Using block elimination find matrix $S$ and right-hand side $\hat{f_2}$ so that $u_2$ can be found from $S u_2 = \hat{f_2}$. Note that the matrix $S$ is called Schur complement of the block $A_{11}$. $$ Au=f \\ LAu = Lf = \hat{f} \\ $$ $L$ elementary matrix such that $$ \begin{bmatrix} \dots \\ Su_2 \end{bmatrix} = \begin{bmatrix} \hat{f_1} \\ \hat{f_2} \end{bmatrix} $$ $$ \begin{bmatrix} I_n & X \\ Y & I_m \end{bmatrix} A \begin{bmatrix} u_1 \\ u_2 \end{bmatrix} = \begin{bmatrix} I_n & X \\ Y & I_m \end{bmatrix} \begin{bmatrix} f_1 \\ f_2 \end{bmatrix} $$ $$ \begin{bmatrix} (A_{11} I_n + X A_{21})u_1 + (I_n A_{12} + X A_{22})u_2 \\ (Y A_{11} + I_m A_{21})u_1 + (Y A_{12} + I_m A_{22})u_2 \end{bmatrix} = \begin{bmatrix} I_n f_1 + X f_2 \\ Y f_1 + I_m f_2 \end{bmatrix} = \begin{bmatrix} \hat{f_1} \\ \hat{f_2} \end{bmatrix} $$ $$ Y A_{11} + I_m A_{21} = 0 \to Y = - A_{21} A_{11}^2 \\ S = -A_{21}A_{11}^{-1}A_{12} + I_m A_{22} \to \\ \boxed{S = A_{22} - A_{21}A_{11}^{-1}A_{12}} \\ \boxed{\hat{f_2} = f_2 - A_{21} A_{11}^{-1}f_{1}} $$
  • (4 pts) Using Schur complement properties prove that $$\det(X+AB) = \det(X)\det(I+BX^{-1}A), $$ where $X$ - nonsingular square matrix. $$ M = \begin{bmatrix} I_m & -B \\ A & X \end{bmatrix} $$ $$ ML = \begin{bmatrix} I_m & -B \\ A & X \end{bmatrix} \begin{bmatrix} I_m & 0 \\ -X^{-1}A & I_n \end{bmatrix} = \begin{bmatrix} I_m + BX^{-1}A & -B \\ 0 & X \end{bmatrix} $$ Hence $$ det(ML) = det(M) = det(X) det(I + BX^{-1}A) $$ $$ M = \begin{bmatrix} I_m & -B \\ A & X \end{bmatrix} \sim \begin{bmatrix} I_m & -B \\ 0 & X + AB \end{bmatrix} $$ Hence $$ det(M) = det(X + AB) = det(X)det(I + BX^{-1}A) $$
  • (4 pts) Let matrix $F \in \mathbb{R}^{m \times n}$ and $G \in \mathbb{R}^{n \times m}$. Prove that $$\det(I_m - FG) = \det(I_n - GF).$$ $$ ML = \begin{bmatrix} I_n & G \\ F & I_m \end{bmatrix} \begin{bmatrix} I_n & 0 \\ -F & I_m \end{bmatrix} = \begin{bmatrix} I_n - FG & G \\ 0 & I_m \end{bmatrix} = det(I_m)det(I_n - GF) = det(M) $$ $$ M \sim \begin{bmatrix} I_n & G \\ 0 & I_m - FG \end{bmatrix} \to \\ det(M) = det(I_n)det(I_m - FG) \to det(I_m - FG) = det(I_n - GF) $$

Problem 2 (QR decomposition) 30 pts

1. Standard Gram-Schmidt algorithm

Our goal is to orthogonalize a system of linearly independent vectors $v_1,\dots,v_n$. The standard algorithm for this task is the Gram-Schmidt process:

$$ \begin{split} u_1 &= v_1, \\ u_2 &= v_2 - \frac{(v_2, u_1)}{(u_1, u_1)} u_1, \\ \dots \\ u_n &= v_n - \frac{(v_n, u_1)}{(u_1, u_1)} u_1 - \frac{(v_n, u_2)}{(u_2, u_2)} u_2 - \dots - \frac{(v_n, u_{n-1})}{(u_{n-1}, u_{n-1})} u_{n-1}. \end{split} $$

Obtained $u_1, \dots, u_n$ are orthogonal vectors in exact arithmetics. Then to make the system orthonormal you should divide each of the vectors by its norm: $u_i := u_i/\|u_i\|$. The Gram-Schmidt process can be considered as a QR decomposition. Let us show that.

  • (2 pts) Write out what is matrices $Q$ and $R$ obtained in the process above.
    Let $A = QR$ be QR decomposition of matrix $A = [v_1 ... v_2]$ then $$ Q = \Big[ \Big] $$
  • (5 pts) Implement in the pset2.py the described Gram-Schmidt algorithm as a function gram_schmidt_qr(A) that takes a rectangular matrix A and outputs Q,R.
  • (3 pts) Create a square Vandermonde matrix $V\in\mathbb{R}^{n\times n},\ n = 20$ defined by the vector $x$: x = np.linspace(0,1,n) (components of $x$ are spaced uniformly between 0 and 1). The loss of orthogonality can be described by the following error: $\|Q^{\top}Q-I\|_2$, where $Q^{\top}Q$ is called a Gram matrix. Compute QR decomposition of the created matrix $V$ with function that you have implemented and calculate error $\|Q^{\top}Q-I\|_2$. Comment on the result.
  • (5 pts) The observed loss of orthogonality is a problem of this particular algorithm. Luckily, there is a simple improvement to the algorithm above that reduces the loss of orthogonality. Implement this modification in the pset2.py as a function modified_gram_schmidt_qr(A) such that input and output are similar to gram_schmidt_qr(A).
  • (3 pts) Compute QR decomposition of the matrix $V$ from the previous task with the function modified_gram_schmidt_qr(A). Compute error $\|Q^{\top}Q-I\|_2$. Compare this error to the error obtained with a "pure" Gram-Schmidt and comment on the result.

In [328]:
# Implement the functions in the ```pset2.py``` file
from pset2 import gram_schmidt_qr
from pset2 import modified_gram_schmidt_qr

n = 20
x = np.linspace(0, 1, n)
V = np.vander(x)
Q, _ = gram_schmidt_qr(V)
print("Error (first version):", np.linalg.norm(Q.T @ Q - np.identity(n), 2))
Q, _ = modified_gram_schmidt_qr(V)
print("Error (second version):", np.linalg.norm(Q.T @ Q - np.identity(n), 2))


Error (first version): 13.001024206525116
Error (second version): 0.16891754819348156

Roundoff errors leads to loss of orthogonality, the second function makes less amount of errors.

2. Householder QR (10 pts)

  • (7 pts) Implement algorithm for computing QR decomposition based on Householder reflections as a function householder_qr(A) that takes a rectangular matrix A and outputs Q,R.
  • (2 pts) Apply it to the Vandermonde matrix $V$ created above. Print out the error $\|Q^{\top}Q-I\|_2$, where $Q$ is given by householder_qr(A). Compare it to the corresponding results of Gram-Schmidt and modified Gram-Schmidt algorithms and comment on it.
  • (3 pts) For values of $n = \{2,25,100,250,500\}$, create a $B\in\mathbb{R}^{n\times n}$ and an upper triangular matrix $R\in\mathbb{R}^{n\times n}$ both filled with standard normal entries. Use numpy (or scipy) built-in QR decomposition function to obtain a random orthogonal matrix $Q$ from the decomposition of $B$. Then compute $A = QR$ and apply your Gram-Schmidt and Householder algorithms to find the $Q$ and $R$ factors of $A$ – denoted as $\hat{Q}$ and $\hat{R}$. Calculate relative errors $$\frac{\|R-\hat{R}\|_2}{\|R\|_2}, \frac{\|Q-\hat{Q}\|_2}{\|Q\|_2}, \frac{\|A-\hat{Q}\hat{R}\|_2}{\|A\|_2}$$ for each value of $n$ and for both algorithms. Note: scale (multiply corresponding rows/columns by -1) $Q, R,\hat{Q},\hat{R}$ such that diagonal elements of $R$ and $\hat{R}$ be positive.
    • Comment on the relative errors in $Q$ and $R$ (forward error) compared to ones in $QR$ (backward error).
    • Comment on the backward error obtained for Gram-Schmidt compared to Householder.

In [334]:
A = np.array([[4, -4, 9], [4, 2, 0], [2, 4, 0]])
Q, R = householder_qr(A)
np.allclose(np.dot(Q, R), A)


Out[334]:
True

Problem 3 (Word2Vec as Matrix Factorization) 45 pts

In this assignment you are supposed to apply SVD to training your own word embedding model which maps English words to vectors of real numbers.

Skip-Gram Negative Sampling (SGNS) word embedding model, commonly known as word2vec (Mikolov et al., 2013), is usually optimized by stochastic gradient descent. However, the optimization of SGNS objective can be viewed as implicit matrix factorization objective as was shown in (Levy and Goldberg, 2015).

1. Notation

Assume we have a text corpus given as a sequence of words $\{w_1,w_2,\dots,w_n\}$ where $n$ may be larger than $10^{12}$ and $w_i \in \mathcal{V}$ belongs to a vocabulary of words $\mathcal{V}$. A word $c \in \mathcal{V}$ is called a context of word $w_i$ if they are found together in the text. More formally, given some measure $L$ of closeness between two words (typical choice is $L=2$), a word $c \in \mathcal{V}$ is called a context if $c \in \{w_{i-L}, \dots, w_{i-1}, w_{i+1}, \dots, w_{i+L} \}$ Let $\mathbf{w},\mathbf{c}\in\mathbb{R}^d$ be the word embeddings of word $w$ and context $c$, respectively. Assume they are specified by the mapping $\Phi:\mathcal{V}\rightarrow\mathbb{R}^d$, so $\mathbf{w}=\Phi(w)$. The ultimate goal of SGNS word embedding model is to fit a good mapping $\Phi$.

Let $\mathcal{D}$ be a multiset of all word-contexts pairs observed in the corpus. In the SGNS model, the probability that word-context pair $(w,c)$ is observed in the corpus is modeled as the following distribution:

$$ P(\#(w,c)\neq 0|w,c) = \sigma(\mathbf{w}^\top \mathbf{c}) = \frac{1}{1 + \exp(-\mathbf{w}^\top \mathbf{c})}, $$

where $\#(w,c)$ is the number of times the pair $(w,c)$ appears in $\mathcal{D}$ and $\mathbf{w}^\top\mathbf{c}$ is the scalar product of vectors $\mathbf{w}$ and $\mathbf{c}$. Two important quantities which we will also use further are the number of times the word $w$ and the context $c$ appear in $\mathcal{D}$, which can be computed as

$$ \#(w) = \sum_{c\in\mathcal{V}} \#(w,c), \quad \#(c) = \sum_{w\in\mathcal{V}} \#(w,c). $$
2. Optimization objective

Vanilla word embedding models are trained by maximizing log-likelihood of observed word-context pairs, namely

$$ \mathcal{L} = \sum_{w \in \mathcal{V}} \sum_{c \in \mathcal{V}} \#(w,c) \log \sigma(\mathbf{w}^\top\mathbf{c}) \rightarrow \max_{\mathbf{w},\mathbf{c} \in \mathbb{R}^d}. $$

Skip-Gram Negative Sampling approach modifies the objective by additionally minimizing the log-likelihood of random word-context pairs, so called negative samples. This idea incorporates some useful linguistic information that some number ($k$, usually $k=5$) of word-context pairs are not found together in the corpus which usually results in word embeddings of higher quality. The resulting optimization problem is

$$ \mathcal{L} = \sum_{w \in \mathcal{V}} \sum_{c \in \mathcal{V}} \left( \#(w,c) \log \sigma(\mathbf{w}^\top\mathbf{c}) + k \cdot \mathbb{E}_{c'\sim P_\mathcal{D}} \log \sigma (-\mathbf{w}^\top\mathbf{c}) \right) \rightarrow \max_{\mathbf{w},\mathbf{c} \in \mathbb{R}^d}, $$

where $P_\mathcal{D}(c)=\frac{\#(c)}{|\mathcal{D}|}$ is a probability distribution over word contexts from which negative samples are drawn.

Levy and Goldberg, 2015 showed that this objective can be equivalently written as

$$ \mathcal{L} = \sum_{w \in \mathcal{V}} \sum_{c \in \mathcal{V}} f(w,c) = \sum_{w \in \mathcal{V}} \sum_{c \in \mathcal{V}} \left( \#(w,c) \log \sigma(\mathbf{w}^\top\mathbf{c}) + \frac{k\cdot\#(w)\cdot\#(c)}{|\mathcal{D}|} \log \sigma (-\mathbf{w}^\top\mathbf{c}) \right) \rightarrow \max_{\mathbf{w},\mathbf{c} \in \mathbb{R}^d}, $$

A crucial observation is that this loss function depends only on the scalar product $\mathbf{w}^\top\mathbf{c}$ but not on embedding $\mathbf{w}$ and $\mathbf{c}$ separately.

3. Matrix factorization problem statement

Let $|\mathcal{V}|=m$, $W \in \mathbb{R}^{m\times d}$ and $C \in \mathbb{R}^{m\times d}$ be matrices, where each row $\mathbf{w}\in\mathbb{R}^d$ of matrix $W$ is the word embedding of the corresponding word $w$ and each row $\mathbf{c}\in\mathbb{R}^d$ of matrix $C$ is the context embedding of the corresponding context $c$. SGNS embeds both words and their contexts into a low-dimensional space $\mathbb{R}^d$, resulting in the word and context matrices $W$ and $C$. The rows of matrix $W$ are typically used in NLP tasks (such as computing word similarities) while $C$ is ignored. It is nonetheless instructive to consider the product $W^\top C = M$. Viewed this way, SGNS can be described as factorizing an implicit matrix $M$ of dimensions $m \times m$ into two smaller matrices.

Which matrix is being factorized? A matrix entry $M_{wc}$ corresponds to the dot product $\mathbf{w}^\top\mathbf{c}$ . Thus, SGNS is factorizing a matrix in which each row corresponds to a word $w \in \mathcal{V}$ , each column corresponds to a context $c \in \mathcal{V}$, and each cell contains a quantity $f(w,c)$ reflecting the strength of association between that particular word-context pair. Such word-context association matrices are very common in the NLP and word-similarity literature. That said, the objective of SGNS does not explicitly state what this association metric is. What can we say about the association function $f(w,c)$? In other words, which matrix is SGNS factorizing? Below you will find the answers.

Task 1 (theoretical) 5 pts

Solve SGNS optimization problem with respect to the $\mathbf{w}^\top\mathbf{c}$ and show that the matrix being factorized is

$$ M_{wc} = \mathbf{w}^\top\mathbf{c} = \log \left( \frac{\#(w,c) \cdot |\mathcal{D}|}{k\cdot\#(w)\cdot\#(c)} \right) $$

Hint: Denote $x=\mathbf{w}^\top\mathbf{c}$, rewrite SGNG optimization problem in terms of $x$ and solve it.

Note: This matrix is called Shifted Pointwise Mutual Information (SPMI) matrix, as its elements can be written as

$$ \text{SPMI}(w,c) = M_{wc} = \mathbf{w}^\top\mathbf{c} = \text{PMI}(w,c) - \log k $$

and $\text{PMI}(w,c) = \log \left( \frac{\#(w,c) \cdot |\mathcal{D}|}{\#(w)\cdot\#(c)} \right)$ is the well-known pointwise mutual information of $(w,c)$.

This equality was shown in Levy and Goldberg, 2015.

Task 2 (practical) 40 pts


In [1]:
import os
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse.linalg import svds
  1. Download dataset enwik8 of compressed Wikipedia articles and preprocess raw data with Perl script main_.pl. This script will clean all unnecessary symbols, make all words to lowercase, and produce only sentences with words.
    wget http://mattmahoney.net/dc/enwik8.zip
    unzip enwik8.zip
    mkdir data
    perl main_.pl enwik8 > data/enwik8.txt

In [2]:
# Load enwik 8

import re
file = open("data/enwik8.txt", "r")
doclist = [line for line in file]
docstr = ''.join(doclist)
sentences = re.split(r'[.!?]', docstr)
sentences = [sentence.split() for sentence in sentences if len(sentence) > 1]

In [3]:
print (sentences[1249])


['achilles', 'wrath', 'is', 'terrible', 'and', 'he', 'slays', 'many', 'trojan', 'warriors', 'and', 'allies', 'including', 'priam', 's', 'son', 'lycaon', 'whom', 'achilles', 'had', 'previously', 'captured', 'and', 'sold', 'into', 'slavery', 'but', 'who', 'had', 'been', 'returned', 'to', 'troy']
  1. Construct the word vocabulary from the obtained sentences which enumerates words which occur more than $r=200$ times in the corpus.

In [4]:
def create_vocabulary(sentences, r=200):
    vocabulary = {}
    # Your code is here
    return vocabulary

In [5]:
vocab = create_vocabulary(sentences)
  1. Scan the text corpus with sliding window of size $5$ and step $1$ (which corresponds to $L$=2) and construct co-occurrence word-context matrix $D$ with elements $D_{wc}=\#(w,c)$. Please, ignore words which occur less than $r=200$ times, but include them into the sliding window. Please, see the graphical illustration of the procedure described.


In [6]:
def create_corpus_matrix(sentences, vocabulary):
    # Your code is here
    return corpus_matrix

In [7]:
D = create_corpus_matrix(sentences, vocab)
  1. To find good word embeddings, Levy and Goldberg, 2015 proposed to find rank-$d$ SVD of Shifted Positive Pointwise Mutual Information (SPPMI) matrix
$$ U \Sigma V^\top \approx \text{SPPMI}, $$

where $\text{SPPMI}(w, c) = \max\left(\text{SPMI}(w, c), 0 \right)$ and $\text{SPMI}(w, c)$ is the element of the matrix $\text{SPPMI}$ at position $(w, c)$. Then use $W=U\sqrt{\Sigma}$ as word embedding matrix. Your task is to reproduce their results. Write function constructs $\text{SPPMI}$ matrix, computes its SVD and produces word-vectors matrix $W$. Pay attention that $\text{SPPMI}$ matrix is sparse!


In [9]:
def compute_embeddings(D, k, d=200):
    # Your code is here
    return embedding_matrix

In [ ]:
k = 5 # negative sampling parameter
W = compute_embeddings(D, k)
  1. Write class WordVectors using provided template.

In [11]:
class WordVectors:
    
    def __init__(self, vocabulary, embedding_matrix):
        self.vocab = vocabulary
        self.W = embedding_matrix
        self.inv_vocab = {v: k for k, v in self.vocab.items()}
        
    def word_vector(self, word):
        """ 
        Takes word and returns its word vector.
        """
#         Your code is here
        return word_vector
    
    def nearest_words(self, word, top_n=10):
        """ 
        Takes word from the vocabulary and returns its top_n
        nearest neighbors in terms of cosine similarity.
        """
#         Your code is here
        return neighbors

In [12]:
model = WordVectors(vocab, W)

In [13]:
model.nearest_words("anarchism")


Out[13]:
[('communism', 0.792),
 ('anarcho', 0.787),
 ('capitalism', 0.784),
 ('socialism', 0.752),
 ('liberalism', 0.727),
 ('criticisms', 0.705),
 ('capitalist', 0.662),
 ('fascism', 0.565),
 ('anarchist', 0.527),
 ('marxist', 0.518)]

In [14]:
model.nearest_words("ussr")


Out[14]:
[('ukraine', 0.671),
 ('russia', 0.629),
 ('poland', 0.55),
 ('belarus', 0.538),
 ('yugoslavia', 0.538),
 ('romania', 0.517),
 ('serbia', 0.507),
 ('austria', 0.5),
 ('hungary', 0.466),
 ('bulgaria', 0.43)]

In [15]:
model.nearest_words("rap")


Out[15]:
[('hop', 0.828),
 ('hip', 0.817),
 ('funk', 0.758),
 ('rock', 0.737),
 ('punk', 0.706),
 ('music', 0.676),
 ('pop', 0.666),
 ('scene', 0.659),
 ('band', 0.657),
 ('jazz', 0.625)]
  1. Calculate top 10 nearest neighbours with the corresponding cosine similarities for the words {numerical, linear, algebra} and insert them in the correspoding functions in pset2.py.

Problem 4 (eigenvalues) 45 pts

1. Theoretical tasks

  • (5 pts) Prove that normal matrix is Hermitian iff its eigenvalues are real. Prove that normal matrix is unitary iff its eigenvalues satisfy $|\lambda| = 1$.
    a)
    normal matrix is Hermitian then eigenvalues are real $$ A^{*}A = AA^{*}, A = A^{*} \\ Ax = \lambda x, x^{*} A^{*} = \overline{\lambda} x^* \\ x^*Ax = x^* \lambda x = \lambda \|x\|^2_2 \\ (x^* A x)^* = (x^* \lambda x)^*,\, x^* A^* x = \overline{\lambda} \|x\|^2_2 = x^* A x \\ \lambda \|x\|^2_2 = \overline{\lambda} \|x\|^2_2 \\ \lambda = \overline{\lambda} \to \lambda_i \in \mathbb{R} \\ $$ eigenvalues are real then normal matrix is Hermitian $$ x^*Ax = \lambda \|x\|^2_2 \\ x^* A^* x = \overline{\lambda} \|x\|^2_2 = \lambda \|x\|^2_2 \\ x^*(A-A^*)x = \lambda \|x\|^2_2 - \lambda\|x\|^2_2 = 0 \\ x \neq 0 \to A = A^* \\ $$ for each non-zero eigenvector we will get $A=A^*$ then $A$ is Hermitian
    b)
    normal matrix is unitary then eigenvalues satisfy $|\lambda| = 1$ $$ (Ax, Ax) = (\lambda x, \lambda x) = \lambda^2 \|x\|^2_2 \to \\ x^* A^* A x = \lambda^2\|x\|^2_2 \to x^* x = \lambda^2 \|x\|^2_2 \to \lambda^2 = 1 \to |\lambda| = 1 $$ eigenvalues satisfy $|\lambda| = 1$ then normal matrix is unitary $$ x^* A^* A x = \lambda^2 x^* I x \to x^*(A^* A - I)x = 0 $$ for every non-zero eigen-vector $x: A^* A = I = A A^*$ hence normal matrix is unitary q.e.d
  • (5 pts) The following problem illustrates instability of the Jordan form. Find theoretically the eigenvalues of the perturbed Jordan block: $$ J(\varepsilon) = \begin{bmatrix} \lambda & 1 & & & 0 \\ & \lambda & 1 & & \\ & & \ddots & \ddots & \\ & & & \lambda & 1 \\ \varepsilon & & & & \lambda \\ \end{bmatrix}_{n\times n} $$ Comment how eigenvalues of $J(0)$ are perturbed for large $n$. $$ det(J(\varepsilon) - \alpha I) = 0 \\ J(\varepsilon) = \begin{bmatrix} \lambda - \alpha & 1 & & & 0 \\ & \lambda - \alpha & 1 & & \\ & & \ddots & \ddots & \\ & & & \lambda - \alpha & 1 \\ 0 & & & & \lambda - \alpha - \frac{(-1)^{n-1} \varepsilon}{(\lambda - \alpha)^{n-1}} \\ \end{bmatrix}_{n\times n} $$ Hence $$ (\lambda - \alpha)^{n-1} \Big( (\lambda - \alpha) - \frac{(-1)^{n-2} \varepsilon}{(\lambda - \alpha)^{n-1}} \Big) = 0 \\ (\lambda - \alpha)^n - (-1)^(n-2) \varepsilon = 0 \to \\ \Big(\frac{\lambda - \alpha}{-1}\Big)^n = \varepsilon \to \\ (\lambda - \alpha)^n = \varepsilon \to \\ \boxed{\alpha = \lambda + \varepsilon^{\frac{1}{n}}} $$

In case of $|\varepsilon| << 1$ and real eigenvalues of $J(\varepsilon)$ the difference between eigenvalues and perturbed ones will be 1 at most.

2. PageRank

Damping factor importance

Solutions

  • (5 pts) Write the function pagerank_matrix(G) that takes an adjacency matrix $G$ (in both sparse and dense formats) as an input and outputs the corresponding PageRank matrix $A$.
    Done in pset2.py
  • (3 pts) Find PageRank matrix $A$ that corresponds to the following graph: What is its largest eigenvalue? What multiplicity does it have?

In [335]:
# implement the functions in the pset2.py file
from pset2 import pagerank_matrix, power_method, pagerank_matvec

G = np.matrix([[0, 0, 1, 0, 0],
               [1, 0, 1, 0, 0],
               [0, 1, 0, 0, 0],
               [0, 0, 0, 0, 1],
               [0, 0, 0, 1, 0]])
A = pagerank_matrix(G)
print("Matrix A:")
print(A)


Matrix A:
[[0.  0.  0.5 0.  0. ]
 [1.  0.  0.5 0.  0. ]
 [0.  1.  0.  0.  0. ]
 [0.  0.  0.  0.  1. ]
 [0.  0.  0.  1.  0. ]]

In [248]:
ev = np.linalg.eig(A)
print(ev[0])
print("Largest eigenvalue is 1 with multiplicity 2")


[-0.5+0.5j -0.5-0.5j  1. +0.j   1. +0.j  -1. +0.j ]
Largest eigenvalue is 1 with multiplicity 2
  • (5 pts) Implement the power method for a given matrix $A$, an initial guess $x_0$ and a number of iterations num_iter. It should be organized as a function power_method(A, x0, num_iter) that outputs approximation to eigenvector $x$, eigenvalue $\lambda$ and history of residuals $\{\|Ax_k - \lambda_k x_k\|_2\}$. Make sure that the method conveges to the correct solution on a matrix $\begin{bmatrix} 2 & -1 \\ -1 & 2 \end{bmatrix}$ which is known to have the largest eigenvalue equal to $3$.

In [249]:
A = np.matrix([[2, -1], [-1, 2]])
x0 = np.matrix([0, 1])
x, l, res = power_method(A, x0, 10)
print(x, l)


[[-0.70709481  0.70711876]] [[3.]]
  • (2 pts) Run the power method for the graph presented above and plot residuals $\|Ax_k - \lambda_k x_k\|_2$ as a function of $k$ for num_iter=100 and random initial guess x0. Explain the absence of convergence.

Convergence ratio $q$ is equal to 1 (as two largest eigenvalues are ones). As $q$ should be less than 1 hence there will be no convergence.


In [250]:
A = pagerank_matrix(G)
x0 = np.random.rand(G.shape[0], 1)
x, l, res = power_method(A, x0, 100)

plt.figure(figsize=(14, 7))
plt.loglog(res/res[0])
plt.ylabel("Residual", size=14)
plt.xlabel("Iteration number", size=14)
plt.title(r'Convergence plot for the largest eigenvalue', size=14)
plt.grid()
plt.show()


  • (2 pts) Consider the same graph, but with a directed edge that goes from the node 3 to the node 4 being removed. Plot residuals as in the previous task and discuss the convergence. Now, run the power method with num_iter=100 for 10 different initial guesses and print/plot the resulting approximated eigenvectors. Why do they depend on the initial guess?

In [255]:
G = np.matrix([[0, 0, 1, 0, 0],
               [1, 0, 1, 0, 0],
               [0, 1, 0, 0, 0],
               [0, 0, 0, 0, 1],
               [0, 0, 0, 0, 0]])
A = pagerank_matrix(G)
ev = np.linalg.eig(A)
print(sorted([abs(x) for x in ev[0]], reverse=True))

x0 = np.random.rand(G.shape[0], 1)
x, l, res = power_method(A, x0, 100)

plt.figure(figsize=(14, 7))
plt.loglog(res/res[0])
plt.ylabel("Residual", size=14)
plt.xlabel("Iteration number", size=14)
plt.title(r'Convergence plot for the largest eigenvalue (one edge is removed)', size=14)
plt.grid()
plt.show()

np.set_printoptions(precision=4)
for i in range(10):
    x0 = np.random.rand(G.shape[0], 1)
    if (i == 2):
        x0[0] = 0
        x0[1] = 0
        x0[2] = 0
        x0[3] = 1
        x0[4] = -1
    x, l, res = power_method(A, x0, 100)
    print("Eigen vector: ")
    print(x.T)


[1.0000000000000004, 0.7071067811865476, 0.7071067811865476, 0.558257569495584, 0.358257569495584]
Eigen vector: 
[[3.3333e-01 6.6667e-01 6.6667e-01 3.2451e-26 1.1626e-26]]
Eigen vector: 
[[3.3333e-01 6.6667e-01 6.6667e-01 2.6934e-26 9.6492e-27]]
Eigen vector: 
[[ 2.9493e-01  3.9051e-01 -8.7208e-01 -6.4373e-11 -2.3062e-11]]
Eigen vector: 
[[3.3333e-01 6.6667e-01 6.6667e-01 4.8502e-26 1.7376e-26]]
Eigen vector: 
[[3.3333e-01 6.6667e-01 6.6667e-01 4.6167e-26 1.6540e-26]]
Eigen vector: 
[[3.3333e-01 6.6667e-01 6.6667e-01 1.4526e-26 5.2042e-27]]
Eigen vector: 
[[3.3333e-01 6.6667e-01 6.6667e-01 6.3572e-27 2.2775e-27]]
Eigen vector: 
[[3.3333e-01 6.6667e-01 6.6667e-01 3.0460e-26 1.0913e-26]]
Eigen vector: 
[[3.3333e-01 6.6667e-01 6.6667e-01 4.3599e-26 1.5620e-26]]
Eigen vector: 
[[3.3333e-01 6.6667e-01 6.6667e-01 1.0766e-26 3.8570e-27]]

So convergence ratio is less than 1 hence it converges. The result of process depends on the initial vector as it might to the closest collinear vector (as it can be seen when $i = 2$).

In order to avoid this problem Larry Page and Sergey Brin proposed to use the following regularization technique:

$$ A_d = dA + \frac{1-d}{N} \begin{pmatrix} 1 & \dots & 1 \\ \vdots & & \vdots \\ 1 & \dots & 1 \end{pmatrix}, $$

where $d$ is a small parameter in $[0,1]$ (typically $d=0.85$), which is called damping factor, $A$ is of size $N\times N$. Now $A_d$ is the matrix with multiplicity of the largest eigenvalue equal to 1. Recall that computing the eigenvector of the PageRank matrix, which corresponds to the largest eigenvalue, has the following interpretation. Consider a person who stays in a random node of a graph (i.e. opens a random web page); at each step s/he follows one of the outcoming edges uniformly at random (i.e. opens one of the links). So the person randomly walks through the graph and the eigenvector we are looking for is exactly his/her stationary distribution — for each node it tells you the probability of visiting this particular node. Therefore, if the person has started from a part of the graph which is not connected with the other part, he will never get there. In the regularized model, the person at each step follows one of the outcoming links with probability $d$ OR teleports to a random node from the whole graph with probability $(1-d)$.

  • (2 pts) Now, run the power method with $A_d$ and plot residuals $\|A_d x_k - \lambda_k x_k\|_2$ as a function of $k$ for $d=0.97$, num_iter=100 and a random initial guess x0.

In [256]:
d = 0.97

A_d = d * A + (1 - d)/A.shape[0] * np.ones(A.shape)

x0 = np.random.rand(A_d.shape[0], 1)
x, l, res = power_method(A_d, x0, 100)

plt.figure(figsize=(14, 7))
plt.loglog(res/res[0])
plt.ylabel("Residual", size=14)
plt.xlabel("Iteration number", size=14)
plt.title(r'Convergence plot for the largest eigenvalue (with regularization)', size=14)
plt.grid()
plt.show()


  • (5 pts) Find the second largest in the absolute value eigenvalue of the obtained matrix $A_d$. How and why is it connected to the damping factor $d$? What is the convergence rate of the PageRank algorithm when using damping factor?

Usually, graphs that arise in various areas are sparse (social, web, road networks, etc.) and, thus, computation of a matrix-vector product for corresponding PageRank matrix $A$ is much cheaper than $\mathcal{O}(N^2)$. However, if $A_d$ is calculated directly, it becomes dense and, therefore, $\mathcal{O}(N^2)$ cost grows prohibitively large for big $N$.


In [257]:
G = np.matrix([[0, 0, 1, 0, 0],
               [1, 0, 1, 0, 0],
               [0, 1, 0, 0, 0],
               [0, 0, 0, 0, 1],
               [0, 0, 0, 1, 0]])
A = pagerank_matrix(G)
A_d = d * A + (1 - d)/A.shape[0] * np.ones(A.shape)

ev = np.linalg.eig(A)
print("Before regularization")
print('l = ', sorted([abs(x) for x in ev[0]], reverse=True))
ev = np.linalg.eig(A_d)
print("After regularization")
print('l = ', sorted([abs(x) for x in ev[0]], reverse=True))


Before regularization
l =  [1.0000000000000004, 1.0, 1.0, 0.7071067811865476, 0.7071067811865476]
After regularization
l =  [0.9999999999999994, 0.97, 0.97, 0.6858935777509516, 0.6858935777509516]

Here it can be seen that after regularization $\lambda_2' = d \lambda_2 $, where $d$ is damping factor. The convergence rate is $q = d \lambda_2 $

There(p. 46) it is proven that:

If the spectrum of the stochastic matrix $S$ is ${1, λ_2, λ_3,...,λ_n}$, then the spectrum of the Google matrix $G$ is ${1, αλ_2, αλ_3, . . . , αλ_n}$

  • (2 pts) Implement fast matrix-vector product for $A_d$ as a function pagerank_matvec(A, d, x), which takes a PageRank matrix $A$ (in sparse format, e.g., csr_matrix), damping factor $d$ and a vector $x$ as an input and returns $A_dx$ as an output.
    Done in pset2.py
  • (1 pts) Generate a random adjacency matrix of size $10000 \times 10000$ with only 100 non-zero elements and compare pagerank_matvec performance with direct evaluation of $A_dx$.

In [258]:
import random
import time

N = 10000
G = np.zeros((N, N))
subG = (np.random.random((10, 10)) - 0.5).clip(0,1)
subG[subG.nonzero()] = 1
p1, p3 = random.randint(1, N - 10), random.randint(1, N - 10)
p2, p4 = p1 + 10, p3 + 10
G[p1:p2,p3:p4] = subG
x = np.random.rand(N, 1)
G = sp.sparse.csr_matrix(G)

A = pagerank_matrix(G)
d = 0.97

In [263]:
%%time
y = pagerank_matvec(A, d, x)


CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 390 µs

In [264]:
%%time
y = np.dot((d * A + (1 - d)/A.shape[0] * np.ones(A.shape)), x)


CPU times: user 1.5 s, sys: 348 ms, total: 1.84 s
Wall time: 1.54 s

Much faster!

DBLP: computer science bibliography

Download the dataset from here, unzip it and put dblp_authors.npz and dblp_graph.npz in the same folder with this notebook. Each value (author name) from dblp_authors.npz corresponds to the row/column of the matrix from dblp_graph.npz. Value at row i and column j of the matrix from dblp_graph.npz corresponds to the number of times author i cited papers of the author j. Let us now find the most significant scientists according to PageRank model over DBLP data.

  • (4 pts) Load the weighted adjacency matrix and the authors list into Python using load_dblp(...) function. Print its density (fraction of nonzero elements). Find top-10 most cited authors from the weighted adjacency matrix. Now, make all the weights of the adjacency matrix equal to 1 for simplicity (consider only existence of connection between authors, not its weight). Obtain the PageRank matrix $A$ from the adjacency matrix and verify that it is stochastic.

In [265]:
from scipy.sparse import load_npz
import numpy as np
def load_dblp(path_auth, path_graph):
    G = load_npz(path_graph).astype(float)
    with np.load(path_auth) as data: authors = data['authors']
    return G, authors

G, authors = load_dblp('dblp_authors.npz', 'dblp_graph.npz')
density = np.count_nonzero(G.data)/(G.shape[0]) ** 2
print("Density: ", density)
diags = np.squeeze(np.asarray(np.sum(G, axis=1)))
top10 = sorted(range(len(diags)), reverse=True, key=lambda k: diags[k])[:10]
for i,val in enumerate(top10):
    print(str(i + 1) + ". ", authors[val])
G[G.nonzero()]  = 1


Density:  4.4986518152305776e-05
1.  Wei Wang
2.  Lei Zhang
3.  Yang Liu
4.  Wei Zhang
5.  Jun Wang
6.  Dacheng Tao
7.  Philip S. Yu
8.  Lei Wang
9.  Wei Li
10.  Jun Zhang

In [279]:
A = pagerank_matrix(G)

In [284]:
columns_sum = np.squeeze(np.asarray(np.sum(A, axis=0)))

epsilon = 10e-13
passed = 1
for col in range(A.shape[0]):
    if (abs(columns_sum[col]) > epsilon) and (abs(columns_sum[col] - 1) > epsilon):
        passed = 0
        print(columns_sum[col])
        break
        
if (passed):
    print("Matrix is stochastic")
else:
    print("Matrix is not stochastic")


Matrix is stochastic
  • (1 pts) In order to provide pagerank_matvec to your power_method (without rewriting it) for fast calculation of $A_dx$, you can create a LinearOperator:
    L = scipy.sparse.linalg.LinearOperator(A.shape, matvec=lambda x, A=A, d=d: pagerank_matvec(A, d, x))
    
    Calling L@x or L.dot(x) will result in calculation of pagerank_matvec(A, d, x) and, thus, you can plug $L$ instead of the matrix $A$ in the power_method directly. Note: though in the previous subtask graph was very small (so you could disparage fast matvec implementation), here it is very large (but sparse), so that direct evaluation of $A_dx$ will require $\sim 10^{12}$ matrix elements to store - good luck with that (^_<).

In [285]:
from scipy.sparse import linalg
d = 0.85
L = sp.sparse.linalg.LinearOperator(A.shape, matvec=lambda x, A=A, d=d: pagerank_matvec(A, d, x))
  • (2 pts) Run the power method starting from the vector of all ones and plot residuals $\|A_dx_k - \lambda_k x_k\|_2$ as a function of $k$ for $d=0.85$.

In [287]:
x0 = np.ones((A.shape[0], 1))
x_k, l, res = power_method(L, x0, 50)

In [288]:
plt.figure(figsize=(14, 7))
plt.loglog(res/res[0])
plt.ylabel("Residual", size=14)
plt.xlabel("Iteration number", size=14)
plt.title(r'Convergence plot for the DBLP', size=14)
plt.grid()
plt.show()


  • (1 pts) Print names of the top-10 authors according to PageRank over DBLP when $d=0.85$. Comment on your findings.

In [289]:
top10_2 = sorted(range(len(x_k)), reverse=True, key=lambda k: x_k[k])[:10]
for i,val in enumerate(top10_2):
    print(str(i + 1) + ". ", authors[val])


1.  David E. Goldberg
2.  Richard M. Karp
3.  John E. Hopcroft
4.  Azriel Rosenfeld
5.  Ronald L. Rivest
6.  Anil K. Jain
7.  Takeo Kanade
8.  Alfred V. Aho
9.  Robert Endre Tarjan
10.  Gerard Salton

These results are more realistic because they depict unique links to authors. In the first case one might create enormous number of citings in one work thus increasing rate. But here it won't work because we take into account only unique citings.