`` Softmax is kind of Multi Class Sigmoid, but if you see the function of Softmax, the sum of all softmax units are supposed to be 1. In sigmoid it’s not really necessary.
As told earlier, in the case of softmax, increasing the output value of one class makes the the others go down (sigma=1). So, sigmoids can probably be preferred over softmax when your outputs are independent of one another. To put it more simple, if there are multiple classes and each input can belong to exactly one class, then it absolutely makes sense to use softmax, in the other cases, sigmoid seems better.
source: https://www.quora.com/Why-is-it-better-to-use-Softmax-function-than-sigmoid-function
Sigmoid - It squashes all values between 0 and 1 and when you do so repeatedly, neuron outputs and their gradients can vanish entirely
source: https://codelabs.developers.google.com/codelabs/cloud-tensorflow-mnist/#7
Softmax takes a 2-step approach to the activation:
source: https://codelabs.developers.google.com/codelabs/cloud-tensorflow-mnist/#3
L1 penalties are great for recovering sparse data, as they are computationally tractable but still capable of recovering the exact sparse solution.
L2 penalization is preferable for data that is not at all sparse.
source: https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization
One of the main disadvantages of Adagrad is that it can become
blocked on local minimum.
+ This happens since the L2 norm can only increase
source: http://profs.info.uaic.ro/~rbenchea/rn/curs6.pdf - pg 25/40
1. The output is always positive
2. They can output 0 when they saturate
source: http://profs.info.uaic.ro/~rbenchea/rn/curs7.pdf - pg 4/46
smaller derivatives (since > 0) -> weaker gradients
smth
smth
smth
Formule utile:
Example MathJax graph
Source: meta.math.stackexchange.com/questions/23055/how-can-i-draw-a-tree-like-this
smth
I
and a filter F
, apply convolutionally F
over I
, given the formula F(I) = I(x-i, x-j)
.