Problem Set 1 (120 points)

Important information

We provide signatures of the functions that you have to implement. Make sure you follow the signatures defined, otherwise your coding solutions will not be graded.

Read homework rules carefully. If you do not follow it you will likely be penalized.

Problem 1 (Python demo) 40 pts

Data preparation (10 pts)

First of all download $\verb|.wav|$ file with starcraft sound from here. Load it in python and play using the following functions:



In [1]:

    
from scipy.linalg import toeplitz
import numpy as np
import math
import scipy.io.wavfile as wav
import matplotlib.pyplot as plt
from IPython.display import Audio
%matplotlib notebook



In [2]:

    
# reading
rate, audio = wav.read("TMaRdy00.wav")

# plotting
plt.plot(audio)
plt.ylabel("Amplitude")
plt.xlabel("Time")
plt.title("You wanna piece of me, boy?")
plt.show()

# playing
Audio(audio, rate=rate)









    












    Out[2]:

Our next goal is to process this signal by multiplying it by a special type of matrix (convolution operation) that will smooth the signal.

(5 pts) Before processing this file let us estimate what size of matrix we can afford. Let $N$ be the size of the signal. Estimate analytically memory in megabytes required to store dense square matrix of size $N\times N$ to fit in your operation memory and print this number. Cut the signal so that you will not have swap (overflow of the operation memory). Note: Cut the signal by taking every p-th number in array: signal[::p].



In [3]:

    
N = len(audio) ** 2 * 8 / 1024 / 1024
print(N)
audio = audio[::4]

(5 pts) Write a function
```
def gen_toeplitz(N, alpha):    
  return T
```
that outputs matrix $T$: $$T_{ij} = \sqrt{\frac{\alpha}{\pi}}e^{-\alpha (i-j)^2}, \quad i,j=1,\dots,N$$ as numpy array. Avoid using loops or lists! The function np.meshgrid will be helpful for this task. Note: matrices that depend only on difference of indices: $T_{ij} \equiv T_{i-j}$ are called Toeplitz. Toeplitz matrix-by-vector multiplication is convolution since it can be written as $$y_i = \sum_{j=1}^N T_{i-j} x_j.$$ Convolutions can be computed faster than $\mathcal{O}(N^2)$ complexity using Fast Fourier transform (will be covered later in our course, no need to implement it here).



In [4]:

    
# INPUT: N - integer (positive), alpha - float (positive)
# OUTPUT: T - np.array (shape: NxN)

def gen_toeplitz(N, alpha):
    i = np.linspace(1, N, N)
    j = np.linspace(1, N, N)
    im, jm = np.meshgrid(i, j, indexing="ij")
    T = np.sqrt(alpha / np.pi) * np.exp(-alpha * (im - jm)**2)
    return T

Convolution (10 pts)

(5 pts) Write a function convolution (see below) that takes the signal you want to convolve and multiply it by Toeplitz matrix T (for matvec operations use @ symbol).



In [5]:

    
# INPUT: signal - np.array (shape: Nx1), N - int (positive), alpha - float (positive)
# OUTPUT: convolved_signal - np.array (shape: Nx1)

def convolution(signal, N, alpha):
    T = gen_toeplitz(N, alpha)
    convolved_signal = np.matmul(T, signal)
    return convolved_signal

(3 pts) Plot the first $100$ points of the result and the first $100$ points of your signal on the same figure. Do the same plots for $\alpha = \frac{1}{5}$, $\alpha = \frac{1}{100}$ using plt.subplots in matplotlib. Each subplot should contain first $100$ points of initial and convolved signals for some $\alpha$. Make sure that you got results that look like smoothed initial signal.
(2 pts) Play the resulting signal. In order to do so you should also scale the frequency (rate), which is one of the inputs in Audio.
Note that you cannot play a signal which is too small.



In [6]:

    
audio_alpha1 = convolution(audio, len(audio), 0.2)
audio_alpha2 = convolution(audio, len(audio), 0.01)
# plotting
fig = plt.figure(figsize=(15,10))
fig.tight_layout()
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)

ax1.plot(audio[:100:], label="Original")
ax1.plot(audio_alpha1[:100:], label="Convolved")
ax1.set_ylabel("Amplitude", size=13)
ax1.set_xlabel("Time", size=13)
ax1.set_title("Original and convolved signals (alpha = 0.2)", size=13)
ax1.legend(loc="upper right")

ax2.plot(audio[:100:], label="Original")
ax2.plot(audio_alpha2[:100:], label="Convolved")
ax2.set_ylabel("Amplitude", size=13)
ax2.set_xlabel("Time", size=13)
ax2.set_title("Original and convolved signals (alpha = 0.01)", size=13)
ax2.legend(loc="upper right")









    Out[6]:





<matplotlib.legend.Legend at 0x7fad0537e0f0>



In [7]:

    
# As we put every 4 sample from original audio track we need to descrease sample rate in 4 times
Audio(audio_alpha1, rate=rate/4)









    Out[7]:



In [ ]:

    
Audio(audio_alpha2, rate=rate/4)

Deconvolution (20 pts)

Given a convolved signal $y$ and an initial signal $x$ our goal now is to recover $x$ by solving the system $$ y = Tx. $$ To do so we will run iterative process $$ x_{k+1} = x_{k} - \tau_k (Tx_k - y), \quad k=1,2,\dots $$ starting from zero vector $x_0$. There are different ways how to define parameters $\tau_k$. Different choices lead to different methods (e.g. Richardson iteration, Chebyshev iteration, etc.). This topic will be covered in details later in our course.

To get some intuition why this process converges to the solution of $Tx=y$, we can consider the following. Let us note that if $x_k$ converges to some limit $x$, then so does $x_{k+1}$. Taking $k\to \infty$ we arrive at $x = x - \tau (Tx - y)$ and hence $x$ is the solution of $Tx = y$.

Another important point is that iterative process requires only matrix-vector porducts $Tx_k$ on each iteration instead of the whole matrix. In this problem we, however, work with the full matrix, but keep in mind, that convolution can be done efficiently without storing the whole matrix.

(5 pts) For each $k$ choose paremeter $\tau_k$ such that the residual $r_{k+1}=Tx_{k+1} - y$ is minimal possible (line search with search direction $r_k$): $$ \|Tx_{k+1} - y\|_2 \to \min_{\tau_k} $$ found analytically. The answer to this bullet is a derivation of $\tau_k$. The parameter $\tau_k$ should be expressed in terms of residuals $r_k = T x_k - y$.

First of all, let's work out the expression $Tx_{k+1} - y$: $$ Tx_{k+1} - y = T(x_k - \tau_k(Tx_k-y))-y = T(x_k - \tau_k T x_k + \tau_k y) - y = Tx_k - \tau_k T^2 x_k + \tau_kTy - y = Tx_k - y + \tau_k(Ty - T^2x_k) = r_k - \tau_k Tr_k $$ So $$ \|r_k - \tau_k Tr_k\|_2 \to \min_{\tau_k} $$ Then expression can be rewritten as: $$ \langle r_k - \tau_k T r_k, r_k - \tau_k Tr_k \rangle = \langle r_k, r_k \rangle - \tau_k \langle r_k, Tr_k \rangle - \tau_k \langle Tr_k, r_k \rangle + \tau^2_k \langle T r_k, T r_k \rangle = \langle r_k, r_k \rangle - 2 \tau_k \langle r_k, Tr_k \rangle + \tau^2_k \langle T r_k, T r_k \rangle $$ So, we have to minimize the following function over $\tau_k$ : $$ f(\tau_k) = \langle r_k, r_k \rangle - 2 \tau_k \langle r_k, Tr_k \rangle + \tau^2_k \langle T r_k, T r_k \rangle $$ It can be easily done using derivative of $f(\tau_k)$ over $\tau_k$: $$ f'(\tau_k) = -2 \langle r_k, T r_k \rangle + 2 \tau_k \langle T r_k, Tr_k \rangle = 0 $$ Hence on every step we have to take the following $\tau_k$: $$ \tau_k = \frac{\langle r_k, T r_k \rangle}{\langle Tr_k, Tr_k\rangle} $$

(10 pts) Write a function iterative that outputs accuracy –– a numpy array of relative errors $\big\{\frac{\|x_{k+1} - x\|_2}{\|x\|_2}\big\}$ after num_iter iterations using $\tau_k$ from the previous task. Note: The only loop you are allowed to use here is a loop for $k$.



In [1]:

    
# INPUT:  N - int (positive), alpha - float (positive), num_iter - integer (positive), 
#  y - np.array (shape: Nx1, convolved signal), s - np.array (shape: Nx1, original signal)
# OUTPUT: rel_error - np.array size (num_iter x 1)

def iterative(N, num_iter, y, s, alpha):
    T = gen_toeplitz(N, alpha)
    x_k = np.zeros(s.shape)
    rel_error = []
    
    for k in range(num_iter):
        r_k = T @ x_k - y
        tau_k = np.dot(r_k, T @ r_k) / np.dot(T @ r_k, T @ r_k)
        x_k = x_k - tau_k * (T @ x_k - y)
        rel_error.append(np.linalg.norm(x_k - s)/np.linalg.norm(s))
    return rel_error

(2 pts) Set num_iter=1000, x=s[::20] and do a convergence plot for $\alpha = \frac{1}{2}$ and $\alpha = \frac{1}{5}$.



In [566]:

    
num_iter = 1000
x = audio[::5]
N = len(x)

x_conv_alpha1 = convolution(x, N, 0.5)
x_conv_alpha2 = convolution(x, N, 0.2)

conv_err_alpha1 = iterative(len(x), num_iter, x_conv_alpha1, x, 0.5)
conv_err_alpha2 = iterative(len(x), num_iter, x_conv_alpha2, x, 0.2)

plt.figure(figsize=(15, 7))
plt.plot(conv_err_alpha1, label=r"$\alpha = 0.5$")
plt.plot(conv_err_alpha2, label=r"$\alpha = 0.2$")
plt.legend(loc="upper right")
plt.ylabel("Relative error")
plt.xlabel("Iteration number")
plt.title(r'Convergence plot for $\alpha = 0.1$ and $\alpha = 0.2$')

plt.show()

(3 pts) Set x=s[::20], num_iter=1000 and $\alpha=\frac{1}{5}$. Explain what happens with the convergence if you add small random noise of amplitude $10^{-3}\max(x)$ to $y$. The answer to this question should be an explanation supported by plots and/or tables.



In [571]:

    
num_iter = 1000
#Before we have already cut every forth signal
x = audio[::5]
N = len(x)

noise = np.random.random(x.shape) * 0.001 * np.max(x)
x_conv_alpha1 = convolution(x, N, 0.2)
x_conv_noise = x_conv_alpha1 + noise

conv_err_alpha1 = iterative(len(x), num_iter, x_conv_alpha1, x, 0.2)
conv_err_noise =  iterative(len(x), num_iter, x_conv_noise, x, 0.2)

plt.figure(figsize=(15, 7))
plt.plot(conv_err_alpha1, label="Deconvolution")
plt.plot(conv_err_noise, label="Deconvolution + noise")
plt.legend(loc="upper left")
plt.ylabel("Relative error")
plt.xlabel("Iteration number")
plt.title(r'Convergence plot for deconvolution w/ and w/o noise ($\alpha = 0.2$)')

plt.show()

Problem 2 (Theoretical tasks) 45 pts

(5 pts) Prove that $\| U A \|_F = \| A U \|_F = \| A \|_F$ for any unitary matrix $U$.
(5 pts) Prove that $\| Ux \|_2 = \| x \|_2$ for any $x$ iff $U$ is unitary.
(5 pts) Prove that $\| U A \|_2 = \| A U \|_2 = \| A \|_2$ for any unitary $U$.

(5 pts) Using the results from the previous subproblem, prove that $\| A \|_F \le \sqrt{\mathrm{rank}(A)} \| A \|_2$. Hint: SVD will help you.
(5 pts) Show that for any $m, n$ and $k \le \min(m, n)$ there exists $A \in \mathbb{R}^{m \times n}: \mathrm{rank}(A) = k$, such that $\| A \|_F = \sqrt{\mathrm{rank}(A)} \| A \|_2$. In other words, show that the previous inequality is not strict.
(5 pts) Prove that if $\mathrm{rank}(A) = 1$, then $\| A \|_F = \| A \|_2$.
(5 pts) Prove that $\| A B \|_F \le \| A \|_2 \| B \|_F$.

(3 pts) Differentiate with respect to $A$ the function $$ f(A) = \mathrm{sin}(x^\top A B C D x), $$ where $x$ is a vector and $A, B, C, D$ are square matrices.
(7 pts) Differentiate with respect to $y, A, X$ the function $$f(y, A, X) = \mathrm{tr}(\mathrm{diag}(y) A X),$$ where $y \in \mathbb{R}^n$ and $A, X \in \mathbb{R}^{n \times n}$. Here

$$ \mathrm{diag}(y)_{i, j} = \begin{cases} y_i, & \text{if}\ i = j \\ 0, & \text{otherwise} \end{cases} $$

Proofs

(5 pts) Prove that $\| U A \|_F = \| A U \|_F = \| A \|_F$ for any unitary matrix $U$.
According to the corollary for the Frobenius norm: $$\| UA \|_F^2 = \mbox{trace }[( U A )^{*}\, UA] = \mbox{trace }[A^{*}U^{*} \, UA] = \mbox{trace }[A^{*}A] = \|A\|^2_F$$ Since matrix trace is invariant under cyclic permutations: $$\| AU \|_F^2 = \mbox{trace }[( A U )^{*}\, A U] = \mbox{trace }[U^{*}A^{*} \, AU] = \mbox{trace }[U U^{*} \, A^{*} A] = \mbox{trace }[A^{*}A] = \|A\|^2_F $$ q.e.d
(5 pts) Prove that $\| Ux \|_2 = \| x \|_2$ for any $x$ iff $U$ is unitary.
Suppose $U$ is unitary and $y=Ux$, then: $$ \| Ux \|^2_2 = \sum_i \| y_i \|^2 = \sum_i y_i y_i^{*} = \langle Ux,Ux \rangle = \langle x, U^{*}Ux \rangle = \langle x, x \rangle = \| x \|^2_2 $$ Suppose the statement $\| Ux \|_2 = \| x \|_2 $ holds, then: $$ \|Ux\|^2_2 = \sum_i \Big( \sum_j U_{ij} x_j \Big)^{*} \Big(\sum_k U_{ik}x_k \Big) = \sum_i \Big(\sum_j U_{ij}^{*} x^{*}_j \Big) \Big(\sum_k U_{ik}x_k \Big) = \sum_j \sum_k \Big(\sum_i U_{ij}^{*}U_{ik}\Big)x^{*}_j x_k $$ $$ \| x \|^2_2 = \sum_k x^{*}_k x_k $$ $$ \sum_j \sum_k \Big(\sum_i U_{ij}^{*}U_{ik}\Big)x^{*}_j x_k = \sum_k x^{*}_k x_k $$ $$ \sum_j \sum_k \Big(\sum_i U_{ij}^{*}U_{ik}\Big)x^{*}_j x_k = \sum_k \sum_j \delta_{jk} x^{*}_j x_k $$ $$ \sum_i (U^{*}_{ij} U_{ik}) = \delta_{jk} $$ As for any $j,k$ the statement above holds: $$ U^{*}U = I $$ q.e.d
(5 pts) Prove that $\| U A \|_2 = \| A U \|_2 = \| A \|_2$ for any unitary $U$.
According to the corollary for the second norm: $$ \| UA \|^2_2 = \lambda_{max} [(U A)^{*} \, U A] = \lambda_{max} [A^{*} A] = \|A\|^2_2 $$ $$ \|AU\|^2_2 = \|U^T A^T \|^2_2 = \lambda_{max} [(U^T A^T)^{*} \, U^T A^T] = \lambda_{max} [A^{T^{*}} U^{T^{*}} \, U^T A^T] = \lambda_{max} [A^{T^*} A^T] = \|A^T\|^2_2 = \|A\|^2_2 $$ q.e.d

(5 pts) Using the results from the previous subproblem, prove that $\| A \|_F \le \sqrt{\mathrm{rank}(A)} \| A \|_2$. Hint: SVD will help you.
Let $r$ be $rank(A)$ and $U \Sigma V^{*} $ be SVD of $A$. Since unitary matrixes do not change the Frobenius and second norms: $$ \| A \|_F = \|U \Sigma V^{*} \|_F = \| \Sigma \|_F = \sqrt{\mbox{trace }[ \Sigma^{*} \Sigma]} = \sqrt{\sigma^2_1 + ... + \sigma^2_r}$$ $$ \| A \|_2 = \|U\Sigma V^{*} \|_2 = \| \Sigma \|_2 = \sqrt{\lambda_{max}(\Sigma^{*}\Sigma)} = \sqrt{\sigma^2_1} $$ Since $ \sigma_1 \geq \, ... \, \geq \sigma_r $ and $ r = rank(\Sigma)$: $$ \sqrt{ \sigma^2_1 + ... + \sigma^2_r} \leq \sqrt{r \sigma^2_1} $$ Hence: $$ \|A\|_F \leq \sqrt{rank(A)} \|A\|_2 $$
(5 pts) Show that for any $m, n$ and $k \le \min(m, n)$ there exists $A \in \mathbb{R}^{m \times n}: \mathrm{rank}(A) = k$, such that $\| A \|_F = \sqrt{\mathrm{rank}(A)} \| A \|_2$. In other words, show that the previous inequality is not strict.
Since the following inequality holds: $$ \sqrt{ \sigma^2_1 + ... + \sigma^2_k} \leq \sqrt{r \sigma^2_1} $$ In case of equality we need to show that: $$ \exists A \in \mathbb{R}^{m \times n} : rank(A) = k, k \leq min(m,n) \implies \sigma_1 = \, ... \, = \sigma_k for A $$ Basically, we might take any identity matrix $A$ of size $m\times n$ where left upper submatrix is identity matrix size of $k\times k$, where $k \leq min(m,n)$ so $rank(A) = k.$
If $rank(A) = k$ then $\sigma_1 = \, ... \, = \sigma_k $ and $$ \|A\|_F = \sqrt{rank(A)} \|A\|_2 $$ Hence, such a matrix exists
(5 pts) Prove that if $\mathrm{rank}(A) = 1$, then $\| A \|_F = \| A \|_2$.
As it was already shown and given $rank(A)=1$: $$ \| A \|_F = \| U\Sigma V^{*} \|_F = \|\Sigma\|_F = \sqrt{\mbox{trace }[\Sigma^{*}\Sigma]} = \sqrt{\sigma^2_1} $$ $$ \| A \|_2 = \| \Sigma \|_2 = \sqrt{\|\lambda(\Sigma^{*}\Sigma)} = \sqrt{(\sigma^2_1) }$$ Hence: $$ \| A \|_F = \| A \|_2 $$
(5 pts) Prove that $\| A B \|_F \le \| A \|_2 \| B \|_F$.
Let consider $B$ as $ \begin{bmatrix} b_1 , \dots , b_n \end{bmatrix} $ then: $$ \|AB\|_F = \sqrt{\sum_i \|A b_i\|^2_F} $$ According to Cauchy–Schwarz inequality: $$ \sqrt{\sum_i \|A b_i\|^2_F} \leq \sqrt{\|A\|_2 \sum_i \| b_i\|^2_2} = \|A\|_2 \|B\|_F $$ so $$ \| A B \|_F \le \| A \|_2 \| B \|_F $$

(3 pts) Differentiate with respect to $A$ the function $$ f(A) = \mathrm{sin}(x^\top A B C D x), $$ where $x$ is a vector and $A, B, C, D$ are square matrices.
Let $y = BCDx$, then $f(A) = sin(x^T Ay)$ $$ \boxed{\frac{\partial{f(A)}}{\partial{A}}} = cos(x^T A y) \frac{\partial{(x^T A y)}}{\partial{A}} $$ $$ \frac{\partial{(x^T A y)}}{\partial{A}} = \begin{bmatrix} {\frac{\partial{\Big( \sum_j \sum_i x_i a_{ij}y_j \Big)}}{\partial{a_{11}}}} & \dots & {\frac{\partial{\Big( \sum_j \sum_i x_i a_{ij}y_j \Big) }}{\partial{a_{1n}}}} \\ \vdots & \ddots & \vdots \\ {\frac{\partial{\Big( \sum_j \sum_i x_i a_{ij}y_j \Big)}}{\partial{a_{n1}}}} & \dots & {\frac{\partial{\Big( \sum_j \sum_i x_i a_{ij}y_j \Big) }}{\partial{a_{nn}}}} \end{bmatrix} = \begin{bmatrix} x_1 y_1 & \dots & x_1 y_n \\ \vdots & \ddots & \vdots \\ x_n y_1 & \dots & x_n y_n \end{bmatrix} = xy^T $$ Hence: $$ \frac{\partial{f(A)}}{\partial{A}} = cos(x^T A y) x y^T = \boxed {cos(x^T A B C D x) x x^T D^T C^T B^T A^T} $$
(7 pts) Differentiate with respect to $y, A, X$ the function $$f(y, A, X) = \mathrm{tr}(\mathrm{diag}(y) A X),$$ where $y \in \mathbb{R}^n$ and $A, X \in \mathbb{R}^{n \times n}$. Here $$ \mathrm{diag}(y)_{i, j} = \begin{cases}
```
y_i, & \text{if}\ i = j \\
0, & \text{otherwise}
```
\end{cases} $$ <br> $$ \boxed{\frac{\partial{(\mathrm{trace}[\mathrm{diag}(y)AX])}}{\partial{y}}} = \begin{bmatrix} \frac{\partial{(\mathrm{trace}[\mathrm{diag}(y)AX])}}{\partial{y_1}} \\ \vdots \\ \frac{\partial{(\mathrm{trace}[\mathrm{diag}(y)AX])}}{\partial{y_n}} \end{bmatrix} = \begin{bmatrix} \frac{\partial{\Big(\sum_j y_j \sum_i a_{ji} x_{ij} \Big)}}{\partial{y_{1}}} \\ \vdots \\ \frac{\partial{\Big(\sum_j y_j \sum_i a_{ji} x_{ij} \Big)}}{\partial{y_{n}}} \end{bmatrix} = \begin{bmatrix} \sum_i a_{1i} x_{i1} \\ \vdots \\ \sum_i a_{ni} x_{in} \end{bmatrix} = \boxed{\sumi P{(i)} A X P{(i)} u} $$ where $$ P{(i)} = \begin{bmatrix} \delta_{i1} & \dots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \dots & \delta_{in} \end{bmatrix}, \, u = \begin{bmatrix} 1 \\ \vdots \\ 1 \end{bmatrix} ; $$ <br> $$ \boxed{\frac{\partial{(\mathrm{trace}[\mathrm{diag}(y)AX])}}{\partial{A}}} = \begin{bmatrix} \frac{\partial{(\mathrm{trace}[\mathrm{diag}(y)AX])}}{\partial{a_{11}}} & \dots & \frac{\partial{(\mathrm{trace}[\mathrm{diag}(y)AX])}}{\partial{a_{1n}}} \\ \vdots & \ddots & \vdots \\ \frac{\partial{(\mathrm{trace}[\mathrm{diag}(y)AX])}}{\partial{a_{n1}}} & \dots & \frac{\partial{(\mathrm{trace}[\mathrm{diag}(y)AX])}}{\partial{a_{nn}}} \end{bmatrix} = \begin{bmatrix} \frac{\partial{\Big(\sum_j y_j \sum_i a_{ji} x_{ij} \Big)}}{\partial{a_{11}}} & \dots & \frac{\partial{\Big(\sum_j y_j \sum_i a_{ji} x_{ij} \Big)}}{\partial{a_{1n}}} \\ \vdots & \ddots & \vdots \\ \frac{\partial{\Big(\sum_j y_j \sum_i a_{ji} x_{ij} \Big)}}{\partial{a_{1n}}} & \dots & \frac{\partial{\Big(\sum_j y_j \sum_i a_{ji} x_{ij} \Big)}}{\partial{a_{nn}}} \end{bmatrix} = \begin{bmatrix} y_1 x_{11} & \dots & y_1 x_{n1} \\ \vdots & \ddots & \vdots \\ y_n x_{1n} & \dots & y_n x_{nn} \end{bmatrix} = \boxed{\mathrm{diag}(y) X^T} ; $$ $$ \boxed{\frac{\partial{(\mathrm{trace}[\mathrm{diag}(y)AX])}}{\partial{X}}} = \begin{bmatrix} \frac{\partial{(\mathrm{trace}[\mathrm{diag}(y)AX])}}{\partial{x_{11}}} & \dots & \frac{\partial{(\mathrm{trace}[\mathrm{diag}(y)AX])}}{\partial{x_{1n}}} \\ \vdots & \ddots & \vdots \\ \frac{\partial{(\mathrm{trace}[\mathrm{diag}(y)AX])}}{\partial{x_{n1}}} & \dots & \frac{\partial{(\mathrm{trace}[\mathrm{diag}(y)AX])}}{\partial{x_{nn}}} \end{bmatrix} = \begin{bmatrix} \frac{\partial{\Big(\sum_j y_j \sum_i a_{ji} x_{ij} \Big)}}{\partial{x_{11}}} & \dots & \frac{\partial{\Big(\sum_j y_j \sum_i a_{ji} x_{ij} \Big)}}{\partial{x_{1n}}} \\ \vdots & \ddots & \vdots \\ \frac{\partial{\Big(\sum_j y_j \sum_i a_{ji} x_{ij} \Big)}}{\partial{x_{1n}}} & \dots & \frac{\partial{\Big(\sum_j y_j \sum_i a_{ji} x_{ij} \Big)}}{\partial{x_{nn}}} \end{bmatrix} = \begin{bmatrix} y_1 a_{11} & \dots & y_n a_{n1} \\ \vdots & \ddots & \vdots \\ y_1 a_{1n} & \dots & y_n a_{nn} \end{bmatrix} = \boxed{\mathrm{A^T diag}(y)} $$

Problem 3 (Strassen algorithm) 15 pts

1. (3 pts) Implement the naive algorithm for squared matrix multiplication with explicit “for” cycles.



In [85]:

    
def naive_multiplication(A, B):
    """
    Implement naive matrix multiplication with explicit for cycles
    
    Parameters: Matrices A, B
    
    Returns: Matrix C = AB
    """
    sizeA = A.shape
    sizeB = B.shape
    C = np.zeros((sizeA[0], sizeB[1]))
    for i in range(sizeA[0]):
        for j in range(sizeB[1]):
            for k in range(sizeA[1]):
                C[i,j] += A[i,k] * B[k,j]
    return C

2. (7 pts) Implement the Strassen algorithm.



In [308]:

    
def strassen(A, B):
    """
    Implement Strassen algorithm for matrix multiplication 
    
    Parameters: Matrices A, B
    
    Returns: Matrix C = AB
    """
    n = A.shape[0]
            
    if (n == 1):
        return A[0,0] * B[0,0]
    
    # Here we suppose that input matrices have the same size and ONLY square
    if ((n & (n - 1)) != 0):
        padA = int(2 ** np.ceil(np.log2(n)))
        A = np.pad(A, ((0, padA - n),(0, padA - n)), 'constant')
        B = np.pad(B, ((0, padA - n),(0, padA - n)), 'constant')
        n = padA

    n2 = int(n/2)
    A11, A12, A22, A21 = A[:n2:, :n2:], A[:n2:, n2::], A[n2::, n2::], A[n2::, :n2:]
    B11, B12, B22, B21 = B[:n2:, :n2:], B[:n2:, n2::], B[n2::, n2::], B[n2::, :n2:]
    m1 = strassen(A11 + A22, B11 + B22)
    m2 = strassen(A21 + A22, B11)
    m3 = strassen(A11, B12 - B22)
    m4 = strassen(A22, B21 - B11)
    m5 = strassen(A11 + A12, B22)
    m6 = strassen(A21 - A11, B11 + B12)
    m7 = strassen(A12 - A22, B21 + B22) 
    if (isinstance(m1, np.ndarray)):
        C = np.bmat([[m1 + m4 - m5 + m7, m3 + m5], [m2 + m4, m1 - m2 + m3 + m6]])
    else:
        C = np.matrix([[m1 + m4 - m5 + m7, m3 + m5], [m2 + m4, m1 - m2 + m3 + m6]])
    return C

3. (5 pts) Compare three approaches: naive multiplication, Strassen algorithm and standard NumPy function. Provide a plot in log-scale of dependence between the matrix size and the runtime of multiplication. You will have three lines, do not forget to add legend, axis labels and other attributes (see our requirements)
Consider the matrix size in the range of 100 to 700 with step 100, e.g. $n=100, 200,\ldots, 700$. Justify the results theoretically (e.g., use the known formulas for total multiplication complexity of naive and Strassen algorithms).



In [307]:

    
import time

matrix_size = [x for x in range(100, 800, 100)]
naive_impl = []
strassen_impl = []
numpy_impl = []

for n in matrix_size:
    print(n)
    A = np.random.random((n, n))
    B = np.random.random((n, n))
    start = time.time()
    naive_multiplication(A,B)
    end = time.time()
    naive_impl.append(end - start)
    
    start = time.time()
    strassen(A,B)
    end = time.time()
    strassen_impl.append(end - start)
    
    start = time.time()
    np.matmul(A,B)
    end = time.time()
    numpy_impl.append(end - start)



In [393]:

    
plt.figure(figsize=(15,7))
plt.loglog(matrix_size, naive_impl, label='Naive algorithm')
plt.loglog(matrix_size, strassen_impl, label='Strassen algorithm')
plt.loglog(matrix_size, numpy_impl, label='Numpy algorithm')
plt.ylabel(r"$t/t_0$")
plt.xlabel("Square matrix size, items")
plt.title("Different multiplication algorithm performance")
plt.legend(loc='upper left')
plt.show()

Theoretical justificaton

Regardless of theoretical expectations, Strassen algorithm works slower than naive implementation and there is some reason behind it. First of all, it is hard to achieve some accelerations using python directly because it is not fast and causes big overhead to process every line of code (as an example, numpy efficient multiplication function falls back to C/C++ function which calculates result). Second, since on every step a big amount of memory are allocated there is some extra time consumed. And finally it is because of recursion, which is also quite heavy procedure in sense of memory allocation. Interestingly, one might observe some step on the plot. Steps are reasonable because all matrices are padded to the closest power of 2 and only then multiplied.

Problem 4 (SVD) 20 pts

In this assignment you are supposed to study how SVD could be used in image compression.

1. (2 pts) Compute the singular values of some predownloaded image (via the code provided below) and plot them. Do not forget to use logarithmic scale.



In [360]:

    
%matplotlib inline
import matplotlib.pyplot as plt
from PIL import Image, ImageDraw
import requests
import numpy as np


url = 'https://people.sc.fsu.edu/~jburkardt/data/png/lena.png'

face_raw = Image.open(requests.get(url, stream=True).raw)
face = np.array(face_raw).astype(np.uint8)

plt.imshow(face_raw)
plt.xticks(())
plt.yticks(())
plt.title('Original Picture')
plt.show()

plt.figure(figsize=(15,7))
u, s, vh = np.linalg.svd(face)
plt.semilogy(s/s[0])
plt.xlabel("Number of sing. value")
plt.ylabel(r"$\sigma_i/\sigma_0$")
plt.title("Distribution of singular value for three color channels")
plt.show()

2. (3 pts) Complete a function compress, that performs SVD and truncates it (using $k$ singular values/vectors). See the prototype below.

Note, that in colourful case you have to split your image to channels and work with matrices corresponding to different channels separately.

Plot approximate reconstructed image $M_\varepsilon$ of your favorite image such that $rank(M_\varepsilon) = 5, 20, 50$ using plt.subplots.



In [485]:

    
def compress(image, k):
    """
    Perform svd decomposition and truncate it (using k singular values/vectors)
    
    Parameters: 
        image (np.array):        input image (probably, colourful)
        
        k (int):                 approximation rank
        
    --------
    Returns:
      reconst_matrix (np.array): reconstructed matrix (tensor in colourful case)
      
      s (np.array):              array of singular values 
    """
    u, s, v = np.linalg.svd(image)
    diag_s = s.transpose() * np.identity(len(s))
    u_shr = u[:,:k]
    v_shr = v[:k,:]
    reconst_matrix = u_shr @ diag_s[:k,:k] @ v_shr
    return reconst_matrix, s[:k]



In [486]:

    
plt.figure(figsize=(30,15))

for i,k in enumerate([5, 20, 50]):
    plt.subplot(1, 3, i + 1)
    img, sv = compress(face, k)
    plt.xticks(())
    plt.yticks(())
    plt.imshow(img, cmap='gray')
    plt.title('Number of singular values = {}'.format(k), fontsize=26)

3. (3 pts) Plot the following two figures for your favorite picture

How relative error of approximation depends on the rank of approximation?
How compression rate in terms of storing information ((singular vectors + singular numbers) / total size of image) depends on the rank of approximation?



In [477]:

    
rank = [x for x in range(1, 200, 2)]
rel_error = []
compr_rate = []

for r in rank:
    img, sv = compress(face, r)
    rel_error.append(np.linalg.norm(img - face, 'fro')/np.linalg.norm(face, 'fro'))
    compr_rate.append((face.shape[0] * r + face.shape[1] * r + r) / (face.shape[0] * face.shape[1]))
    
plt.figure(figsize=(14,7))
plt.xlabel("Rank of approximation")
plt.ylabel("Relative error")
plt.title("Relative error of approximation")
plt.plot(rank, rel_error)
plt.show()

plt.figure(figsize=(14,7))
plt.xlabel("Rank of approximation")
plt.ylabel("Compression rate")
plt.title("Compression rate over approximation rank")
plt.plot(rank, compr_rate)
plt.show()

4. (2 pts) Consider the following two pictures. Compute their approximations (with the same rank, or relative error). What do you see? Explain results.



In [517]:

    
url1 = 'http://sk.ru/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-00-60-11/skoltech1.jpg'
url2 = 'http://www.simpsoncrazy.com/content/characters/poster/bottom-right.jpg'
image_raw1 = Image.open(requests.get(url1, stream=True).raw)
image_raw2 = Image.open(requests.get(url2, stream=True).raw)

image1 = np.array(image_raw1).astype(np.uint8)
image2 = np.array(image_raw2).astype(np.uint8)

plt.figure(figsize=(18, 6))
plt.subplot(1,2,1)
plt.imshow(image_raw1)
plt.title('One Picture')
plt.xticks(())
plt.yticks(())

plt.subplot(1,2,2)
plt.imshow(image_raw2)
plt.title('Another Picture')
plt.xticks(())
plt.yticks(())

plt.show()



In [531]:

    
# Your code is here
rank = 50
dim = image1.ndim

image_zip1 = np.zeros(image1.shape)
image_zip2 = np.zeros(image2.shape)
for i in range(dim):
    image_zip1[:,:,i], sing_val1 = compress(image1[:,:,i], rank)
    image_zip2[:,:,i], sing_val2 = compress(image2[:,:,i], rank)

plt.figure(figsize=(18, 6))
plt.subplot(1,2,1)
plt.imshow(Image.fromarray(image_zip1.astype(np.uint8), 'RGB'))
plt.title('One Picture')
plt.xticks(())
plt.yticks(())

plt.subplot(1,2,2)
plt.imshow(Image.fromarray(image_zip2.astype(np.uint8), 'RGB'))
plt.title('Another Picture')
plt.xticks(())
plt.yticks(())

plt.show()

To figure out why the second image is much more distorted than the second one (basically, in the first case most of objects are well recognizable compared to the second case) let's plot their singular values.



In [521]:

    
_, sing_val1, _ = np.linalg.svd(image1[:,:,0])
_, sing_val2, _ = np.linalg.svd(image2[:,:,0])

plt.figure(figsize=(18, 6))
plt.subplot(1,2,1)
plt.semilogy(sing_val1/sing_val1[0])
plt.ylabel(r"$\sigma_i/\sigma_0$")
plt.xlabel("Number of sing. value")
plt.title('One Picture, sing. values')

plt.subplot(1,2,2)
plt.semilogy(sing_val2/sing_val2[0])
plt.ylabel(r"$\sigma_i/\sigma_0$")
plt.xlabel("Number of sing. value")
plt.title('Another Picture, sing. values')

plt.show()

As we know the more singular values we leave the better quality of image will be, but at the same time, the greater singular values we have the more ones contribute to the final image quality (it is obvious from definition of SVD). Keeping that in mind, we can observe that in the first image we have up to 60 singular values from $\sigma_0$ to $0.01\sigma_0$, so the contribution of the rest is less important. But in the second case we have up to 150 singular values within the same range. Hence, cutting the same number of singular values from both images is wrong on its head in terms of similarity of approximated images. Taking singular values within the same relative range will be more fair, so let's check it.



In [532]:

    
# Your code is here
dim = image1.ndim

image_zip1 = np.zeros(image1.shape)
image_zip2 = np.zeros(image2.shape)
for i in range(dim):
    image_zip1[:,:,i], sing_val1 = compress(image1[:,:,i], 60)
    image_zip2[:,:,i], sing_val2 = compress(image2[:,:,i], 150)

plt.figure(figsize=(18, 6))
plt.subplot(1,2,1)
plt.imshow(Image.fromarray(image_zip1.astype(np.uint8), 'RGB'))
plt.title('One Picture')
plt.xticks(())
plt.yticks(())

plt.subplot(1,2,2)
plt.imshow(Image.fromarray(image_zip2.astype(np.uint8), 'RGB'))
plt.title('Another Picture')
plt.xticks(())
plt.yticks(())

plt.show()

So both images are still well-recognizable.

Problem 5 (Bonus)

The norm is called absolute if $\|x\|=\| \lvert x \lvert \|$ holds for any vector $x$, where $x=(x_1,\dots,x_n)^T$ and $\lvert x \lvert = (\lvert x_1 \lvert,\dots, \lvert x_n \lvert)^T$. Give an example of a norm which is not absolute.
Write a function ranks_HOSVD(A, eps) that calculates Tucker ranks of a d-dimensional tensor $A$ using High-Order SVD (HOSVD) algorithm, where eps is the relative accuracy in the Frobenius norm between the approximated and the initial tensors. Details can be found here on Figure 4.3.
```
def ranks_HOSVD(A, eps):
   return r #r should be a tuple of ranks r = (r1, r2, ..., rd)
```



In [ ]: