consequences, if we are trying to fit some uni-modal distribution $s$ to $r$, and $r$ has two modes:
- if we minimize $D_{KL}(r \| s)$, ie minimize $\mathbb{E}_r[ \log(r/s)]$, then we constrain $r$ and $s$ to be similar anywhere $r$ is large, but $s$ is relatively unconstrained for small $r$, so the result will be that the mode of $s$ will span both of $r$'s modes
- if we minimize $D_{KL}(s \| r)$, ie minimize $\mathbb{E}_s[ \log(s/r) ]$, then we constrain $r$ and $s$ to be similar anywhere $s$ is large. A single mode that spans both of $r$'s modes will by contrast have high $s$ mass in the middle of $r$'s modes, and thus not satisfy this constraint
- by constrast, if $s$ has its mode over one of $r$'s modes only, then $r$ and $s$ can have similar probabilities where $s$ probability is high, and thus the KL in this case can be small