Exam

Subject 1

10p | 1. Care este avantajul utilizarii fct softmax pe ultimul strat al RN fata de utilizarea functiei sigmoid (logistic) ? In ce situatie acest avantaj se pierde ?

$$\begin{align}softmax(L_n) = \frac{e^{L_n}}{||e^L||} \quad vs \quad & sigmoid(z) = \frac{1}{1 + e^{-z}} \\ & z = \sum_{i}{}{w_i x_i + b} \end{align} $$

`` Softmax is kind of Multi Class Sigmoid, but if you see the function of Softmax, the sum of all softmax units are supposed to be 1. In sigmoid it’s not really necessary.

As told earlier, in the case of softmax, increasing the output value of one class makes the the others go down (sigma=1). So, sigmoids can probably be preferred over softmax when your outputs are independent of one another. To put it more simple, if there are multiple classes and each input can belong to exactly one class, then it absolutely makes sense to use softmax, in the other cases, sigmoid seems better.

source: https://www.quora.com/Why-is-it-better-to-use-Softmax-function-than-sigmoid-function

Sigmoid - It squashes all values between 0 and 1 and when you do so repeatedly, neuron outputs and their gradients can vanish entirely

source: https://codelabs.developers.google.com/codelabs/cloud-tensorflow-mnist/#7

Softmax takes a 2-step approach to the activation:

max - takes the largest element, that will approach 1
soft - the other elements will be divided by large numbers, so will be normalized (usually, euclidean norm) to 0 => the "soft" part maintains relative order

source: https://codelabs.developers.google.com/codelabs/cloud-tensorflow-mnist/#3

10p | 2. Regularizarea L1 si L2 (weight decay) au un efect diferit asupra ponderilor (weight-urilor) din retea. Explicati ce efect are L1 si ce efect are L2.

 L1 penalties are great for recovering sparse data, as they are computationally tractable but still capable of recovering the exact sparse solution. 

 L2 penalization is preferable for data that is not at all sparse.

source: https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization

10p | 3. Algoritmul Adagrad reprezinta o imbunatatire al algoritmului SGD. Explicati in ce consta acest avantaj si explicati si dezavantajul pe care Adagrad il are (obs: este rezolvat acest dezavantaj in RMSProp)

One of the main disadvantages of Adagrad is that it can become
blocked on local minimum.
+ This happens since the L2 norm can only increase

source: http://profs.info.uaic.ro/~rbenchea/rn/curs6.pdf - pg 25/40

10p | 4. Dati 2 dezavantaje pe care functia sigmoid (logistic) o are fata de functia tangenta hiperbolica (tanh).

1. The output is always positive
2. They can output 0 when they saturate

source: http://profs.info.uaic.ro/~rbenchea/rn/curs7.pdf - pg 4/46

smaller derivatives (since > 0) -> weaker gradients

source: https://stats.stackexchange.com/questions/101560/tanh-activation-function-vs-sigmoid-activation-function

10p | 5. In algoritmul de Q-learning folosind RN este de preferat ca arhitectura retelei sa aiba la input starea agentului, iar la output cate o iesire pentru fiecare actiune. Argumentati de ce este preferata aceasta arhitectura in defavoarea celei care primeste ca input starea si actiujnea si returneaza valoarea Q(s,a).

smth

10p | 6. Explicati de ce la o retea restrictionata Boltzmann se ajunge in echilibru termic (distributie stationara) intr-un singur pas in faza pozitiva a antrenamentului.

smth

15p | 7. Date urmatoarele date de intrare: $$v^1 = (1, -1, 1, -1, 1) \ , \ v^2 = (1, 1, 1, -1, -1)$$ exemplificati care sunt ponderile intermediare si care sunt cele finale obtinute in urma antrenarii unei retele Hopfield folosind invatarea Hebbiana.

smth

25p | 8. Data reteaua din imagine, stabiliti care ar trebui sa fie ponderile (weight-urile) noi dupa rularea algoritmului de backprop intr-o singura iteratie, folosind inputul (2,6) ce are clasa (target) 0 si reteaua din imaginea alaturata.

toate functiile de activare sunt sigmoid
functia de eroare - cross-entropia
neuronii nu au bias
rata de invatare este: $\lambda = 0.5$

Formule utile:

$w_{ij}^{l} = w_{ij}^l - \eta * \frac{ \partial{c_l} }{ \partial{w_{ij}} }$
$\delta_i^L = y_i^L - t_i^L$, $\delta_i^L = y_i^l (1-y_i^l) * \sum_k{\delta_k^{l+1} } * w_{ik}^{l+1}$
$\sigma(z) = \frac{1}{1+e^{-z}}$
$\frac{\delta C}{\delta w_{ij}^l} = \delta_i^l * y^{l-1}$

Example MathJax graph

Source: meta.math.stackexchange.com/questions/23055/how-can-i-draw-a-tree-like-this

Apparently, MathJax (used by Jupyter) doesn't support many LaTeX features (eg: reddit source )

smth

Subject 2

(10p) Why is the sigmoid activation preffered over BTU (binary threshold unit) ?
(10p) What problem does 0-initialization generate in an MLP ? What if it was a constant ?
(10p) What is minibatch. Why would you use minibatch. List 1 advantage over fullbatch.
(10p) Explain what dropout is. Show how would you apply 0.2 dropout over 2nd layer.
(10p) Why choose Adam optimizer over Adagrad optimizer ?
(10p) Describe the vanishing gradient problem. How does relu help ?

(15p) Given an input image I and a filter F, apply convolutionally F over I, given the formula F(I) = I(x-i, x-j).

$$ I = \begin{bmatrix} 0 & 3 & 1 & 1 & 4 \\ 2 & 1 & 0 & 0 & 0 \\ 0 & 2 & 5 & 2 & 3 \\ 1 & 1 & 3 & 0 & 2 \\ 0 & 0 & 1 & 1 & 0 \end{bmatrix}, F = \begin{bmatrix} 0 & 1 & 2 \\ 2 & 2 & 0 \\ 0 & 2 & 1 \\ \end{bmatrix} $$

9 convolutions to be calculated given the above I, F.

(25p) backprop problem