Cross-entropy loss for softmax "neural" classifiers

Suppose we have data consisting of a set of $N$ patterns $\mathbf{x}$ which are $D$ elements long, each with a target class $\in \{1..K\}$. We can think of this as a $N\times D$ matrix $X$, and a $N\times K$ matrix $T$. each row of $T$ contains zeros except from a single 1 in the column corresponding to the correct target class.

Consider a feed-forward neural network. When the network is given the input vector $ \mathbf{x}_{n} $ it generates an output vector $\mathbf{y}_{n}$ via the softmax function. We can make an $N\times K$ matrix $Y$ from these output vectors (one per row), and thus the network maps $X \rightarrow Y$ and $Y$ has the same dimensions as $T$. The softmax function is $$ Y_{n,i} = \frac{\exp(\phi_{n,i})}{Z_n} $$ where $\phi_{n,i} = \sum_j W_{i,j} X_{n,j}$ and $Z_n = \sum_k \exp(\phi_{n,k})$.

In lecture we met the "cross entropy" loss function: $$ E = \sum_n \sum_k T_{n,k} \log Y_{n,k} $$

(Note that sometimes it's more convenient to write this as $$ E = \sum_n \log Y_{n, c_n} $$ where $ c_n $ is the index of the target class for the $ n^\text{th} $ item in the training set.)

We motivated the cross-entropy loss by arguing that it is the log likelihood, ie. the probability that a stochastic form of this network would generate precisely the training classes, namely to use the softmax outputs as a categorical distribution and sample classes from it.

(See this which discusses cross-entropy vs training error vs sum-of-squared errors without refering to a stochastic model).

Q: Consider a simple neural network with no hidden layers and a softmax as the output layer. Show mathematically that gradient descent of the cross entropy loss in leads to the "delta rule" for the weight change: $\Delta W_{ij} \propto \sum_n (T_{n,i} - Y_{n,i}) X_{n,j}$

Slightly easier option : do it for the 2-class case. Hint: since there are only two options (say $a$ and $b$) and we know the second probability has to be $Y_b = 1-Y_a$, it's enough to worry about $Y_a$ alone, and this means we don't need two neurons computing two different $\phi$ values. Instead, just find one, $Y_i = Prob(class=a)$, and this can be implemented by a sigmoid (or logistic) non-linearity applied to $\phi$. In this case the log likelihood is a sum over all items in the training set of this: $$ T_{n,i} \log Y_{n,i} + (1-T_{n,i}) \log (1- Y_{n,i})$$ which you can differentiate and reorganise to get the answer : the delta rule.