Softmax from First Principles

Language barriers between humans and autonomous systems

If our goal is to help humans and autnomous systems communicate, we need to speak in a common language. Just as humans have verbal and written languages to communicate ideas, so have we developed mathematical languages to communicate information. Probability is one of those languages and, thankfully for us, autonomous systems are pretty good at describing probabilities, even if humans aren't. This document shows one technique for translating a human language (English) into a language known by autonomous systems (probability).

Our translator is something called the SoftMax classifier, which is one type of probability distribution that takes discrete labels and translates them to probabilities. We'll show you the details on how to create a softmax model, but let's get to the punchline first: we can decompose elements of human language to represent a partitioning of arbitrary state spaces.

Say, for instance, we'd like to specify the location of an object in two dimensional cartesian coordinates. Our state space is all combinations of x and y, and we'd like to translate human language into some probability that our target is at a given combination of x and y. One common tactic humans use to communicate position is range (near, far, next to, etc.) and bearing (North, South, SouthEast, etc.). This already completely partitions our xy space: if something is north, it's not south; if it's east, it's not west; and so on.

A softmax model that translates range and bearing into probability in a state space is shown below:

Assuming that next to doesn't require a range, we see seventeen different word combinations we can use to describe something's position: two ranges (nearby and far) for each cardinal and intercardinal direction (eight total), and then one extra label for next to. This completely partitions our entire state space $\mathbb{R}^2$.

This range and bearing language is, by its natue, inexact. If I say, "That boat is far north.", you don't have a deterministic notion of exactly where the boat is -- but you have a good sense of where it is, and where it is not. We can represent that sense probabilistically, such that the probability of a target existing at a location described by a range and bearing label is nonzero over the entire state space, but that probability is very small if not in the area most associated with that label.

What do we get from this probabilistic interpretation of the state space? We get a two-way translation between humans and autonomous systems to describe anything we'd like. If our state space is one-dimensional relative velocity (i.e. the derivative of range without bearing), I can say, "She's moving really fast!", to give the autonomous system a probability distribution over my target's velocity with an expected value of, say, 4 m/s. Alternatively, if my autnomous system knows my target's moving at 0.04352 m/s, it can tell me, "Your target is moving slowly." Our labeled partitioning of the state space (that is, our classifier) is the mechanism that translates for us.

Softmax model construction

The SoftMax model goes by many names: normalized exponential, multinomial logistic function, log-linear model, sigmoidal function. We use the SoftMax function to develop a classification model for our state space:

$$ P(D=i \vert \mathbf{x}) = \frac{e^{\mathbf{w}_i^T \mathbf{x}}}{\sum_{k=i}^M e^{\mathbf{w}_k^T\mathbf{x}}} $$

Where $D = i$ is our random variable of class labels instantiated as class $i$, $\mathbf{x}$ is our state vector, $\mathbf{w}_i$ is a vector of parameters (or weights) associated with our class $i$, and $M$ is the total number of classes. The state vector $\mathbf{x}$ traditionally includes a constant bias term.

Note that a label is a set of words associated with a class (i.e. far northwest) whereas a class is a single probability distribution over the entire state space. The terms are sometimes used interchangeably.

Several key factors come out of the SoftMax equation:

  • The probabilities of all classes for any given point $\mathbf{x}$ sum to 1.
  • The probability any single class for any given point $\mathbf{x}$ is bounded by 0 and 1.
  • The space can be partitioned into an arbitrary number of classes (with some restrictions about those classes - more on this later).
  • The probability of one class for a given point $\mathbf{x}$ is determined by that class' weighted exponential sum of the state vector relative to the weighted exponential sums of other classes.
  • Since the probability of a class is conditioned on $\mathbf{x}$, we can apply estimators such as Maximum Likelihood to learn SoftMax models.
  • $P(D=i \vert \mathbf{x})$ is convex for any arbitrary $\mathbf{x}$ [citation needed].

For any two classes, we can take the ratio of their probabilities to determine the odds of one class instead of the other:

$$ L(i,j) =\frac{P(D=i \vert \mathbf{x})}{P(D=j \vert \mathbf{x})} = \frac{\frac{e^{\mathbf{w}_i^T \mathbf{x}}}{\sum_{k=i}^M e^{\mathbf{w}_k^T\mathbf{x}}}}{\frac{e^{\mathbf{w}_j^T \mathbf{x}}}{\sum_{k=i}^M e^{\mathbf{w}_k^T\mathbf{x}}}} = \frac{e^{\mathbf{w}_i^T \mathbf{x}}}{e^{\mathbf{w}_j^T\mathbf{x}}} $$

When $L(i,j)=1$, the two classes have equal probability. This doesn't give us a whole lot of insight until we take the log-odds (the logarithm of the odds):

$$ L_{log}(i,j) = \log{\frac{P(D=i \vert \mathbf{x})}{P(D=j \vert \mathbf{x})}} = \log{\frac{e^{\mathbf{w}_i^T \mathbf{x}}}{e^{\mathbf{w}_j^T\mathbf{x}}}} = \mathbf{w}_i^T\mathbf{x} - \mathbf{w}_j^T\mathbf{x} = (\mathbf{w}_i - \mathbf{w}_j)^T\mathbf{x} $$

When $L_{log}(i,j) = \log{L(i,j)} = \log{1} = 0$, we have equal probability between the two classes, and we've also stumbled upon the equation for an n-dimensional affine hyperplane dividing the two classes:

$$ \begin{align} 0 &= (\mathbf{w}_i - \mathbf{w}_j)^T\mathbf{x} \\ &= (w_{i,x_1} - w_{j,x_1})x_1 + (w_{i,x_2} - w_{j,x_2})x_2 + \dots + (w_{i,x_n} - w_{j,x_n})x_n \end{align} $$

This follows from the general definition of an Affine Hyperplane (that is, an n-dimensional flat plane):

$$ a_1x_1 + a_2x_2 + \dots + a_nx_n - b = 0 $$

This gives us a general formula for the division of class boundaries -- that is, we can specify the class boundaries directly, rather than specifying the weights leading to those class boundaries.

Example

Let's take a step back and look at an example. Suppose I'm playing Pac-Man, and I want to warn our eponymous hero of a ghost approaching him. Let's restrict my language to the four intercardinal directions: NE, SE, SW and NW. My state space is $\mathbf{x} = \begin{bmatrix}1 & x & y\end{bmatrix}^T$, which includes one bias term and one term for each cartesian direction in $\mathbb{R}^2$.

In this simple problem, we can expect our weights to be something along the lines of:

$$ \begin{align} \mathbf{w}_{NE} &= \begin{bmatrix}0 & 1 & 1 \end{bmatrix}^T \\ \mathbf{w}_{SE} &= \begin{bmatrix}0 & 1 & -1 \end{bmatrix}^T \\ \mathbf{w}_{SW} &= \begin{bmatrix}0 & -1 & -1 \end{bmatrix}^T \\ \mathbf{w}_{NW} &= \begin{bmatrix}0 & -1 & 1 \end{bmatrix}^T \end{align} $$

If we run these weights in our SoftMax model, we get the following results:


In [46]:
import numpy as np
from cops_and_robots.robo_tools.fusion.softmax import SoftMax
%matplotlib inline

labels = ['SW','NW','SE','NE']
weights = np.array([[0, -1, -1],
                    [0, -1, 1],
                    [0, 1, -1],
                    [0, 1, 1],
                   ])
pacman = SoftMax(weights,class_labels=labels)
pacman.plot(title='Pac-Man Bearing Model')


Which is along the right path, but needs to be shifted down to Pac-Man's location. Say Pac-Man is approximately one quarter of the map south from the center point, we can update our model accordingly (assuming a $10m \times 10m$ space):

$$ \begin{align} \mathbf{w}_{NE} &= \begin{bmatrix}2.5 & 1 & 1 \end{bmatrix}^T \\ \mathbf{w}_{SE} &= \begin{bmatrix}-2.5 & 1 & -1 \end{bmatrix}^T \\ \mathbf{w}_{SW} &= \begin{bmatrix}-2.5 & -1 & -1 \end{bmatrix}^T \\ \mathbf{w}_{NW} &= \begin{bmatrix}2.5 & -1 & 1 \end{bmatrix}^T \end{align} $$

In [47]:
weights = np.array([[-2.5, -1, -1],
                    [2.5, -1, 1],
                    [-2.5, 1, -1],
                    [2.5, 1, 1],
                   ])
pacman = SoftMax(weights,class_labels=labels)
pacman.plot(title='Pac-Man Bearing Model')


Looking good! Note that we'd get the same answer had we used the following weights:

$$ \begin{align} \mathbf{w}_{NE} &= \begin{bmatrix}0 & 1 & 1 \end{bmatrix}^T \\ \mathbf{w}_{SE} &= \begin{bmatrix}-5 & 1 & -1 \end{bmatrix}^T \\ \mathbf{w}_{SW} &= \begin{bmatrix}-5 & -1 & -1 \end{bmatrix}^T \\ \mathbf{w}_{NW} &= \begin{bmatrix}0 & -1 & 1 \end{bmatrix}^T \end{align} $$

Because the class boundaries are defined by the relative differences.

One other thing we can illustrate with this example: how would the SoftMax model shift if we multiplied all our weights by 10? When we use:

$$ \begin{align} \mathbf{w}_{NE} &= \begin{bmatrix}25 & 10 & 10 \end{bmatrix}^T \\ \mathbf{w}_{SE} &= \begin{bmatrix}-25 & 10 & -10 \end{bmatrix}^T \\ \mathbf{w}_{SW} &= \begin{bmatrix}-25 & -10 & -10 \end{bmatrix}^T \\ \mathbf{w}_{NW} &= \begin{bmatrix}25 & -10 & 10 \end{bmatrix}^T \end{align} $$

We get:


In [48]:
weights = np.array([[-25, -10, -10],
                    [25, -10, 10],
                    [-25, 10, -10],
                    [25, 10, 10],
                   ])
pacman = SoftMax(weights,class_labels=labels)
pacman.plot(title='Pac-Man Bearing Model')


Why does this increase in slope happen? We'll let you ponder this question for now and get back to it later.

Specification from Normals

We've seen how we can create a simple SoftMax distribtion by specifying each class's parameters, but we also saw that the log-odds between two classes generates a hyperplane when one class equals the other. If we wanted to chop up the state space, couldn't we simply specify those hyperplanes instead?

First, we need to investigate the relationships between the weights and the normals. Take our first attempt at the intercardinal bearing Pac-Man problem: we have four classes, and each class shares a boundary with another class over an equiprobable region. In the case of NE and SE, for instance, we have a line dividing north from south as our boundary. In the case of NE and SW, we have a point at the cardinal center as our boundary. Let's see if we can calculate these.

Recall our weights:

$$ \begin{align} \mathbf{w}_{NE} &= \begin{bmatrix}0 & 1 & 1 \end{bmatrix}^T \\ \mathbf{w}_{SE} &= \begin{bmatrix}0 & 1 & -1 \end{bmatrix}^T \\ \mathbf{w}_{SW} &= \begin{bmatrix}0 & -1 & -1 \end{bmatrix}^T \\ \mathbf{w}_{NW} &= \begin{bmatrix}0 & -1 & 1 \end{bmatrix}^T \end{align} $$

And the definition of our class boundaries:

$$ \begin{align} 0 &= (\mathbf{w}_i - \mathbf{w}_j)^T\mathbf{x} \\ &= (w_{i,x_1} - w_{j,x_1})x_1 + (w_{i,x_2} - w_{j,x_2})x_2 + \dots + (w_{i,x_n} - w_{j,x_n})x_n \end{align} $$

From the equation of a plane, recall that the normal vector can be taken from the coefficients for the non-constant terms. We can define a normal vector for each class boundary:

$$ \begin{align} \mathbf{n}_{NW,NE} &= (\mathbf{w}_{NE} - \mathbf{w}_{NW})^T\mathbf{x} \\ &= (0 - 0) + (1 + 1)x + (1 - 1)y = 2x \\ \mathbf{n}_{NE,SE} &= (0 - 0) + (1 - 1)x + (-1 - 1)y = -2y \\ \mathbf{n}_{SE,SW} &= (0 - 0) + (-1 - 1)x + (-1 - 1)y = -2x \\ \mathbf{n}_{SW,NW} &= (0 - 0) + (-1 + 1)x + (1 + 1)y = 2y \end{align} $$

But these aren't the only class boundaries -- diagonally positioned classes also have normals:

$$ \begin{align} \mathbf{n}_{NW,SE} &= (0 - 0) + (1 + 1)x + (-1 - 1)y = 2x - 2y \\ \mathbf{n}_{NE,SW} &= (0 - 0) + (-1 - 1)x + (-1 - 1)y = -2x -2y \\ \end{align} $$

So we have a map of the normal vectors:


In [111]:
import matplotlib.pyplot as plt 
x = np.arange(-5,5,0.1)
n_NWNE = 0  # Vertical Line
n_NESE = 0 * x # Horizontal Line
n_SESW = 0 # Vertical Line
n_SWNW = 0 * x # Horizontal Line
n_NWSE = -x
n_SWNE = x

plt.plot(x, n_NESE, 'g-', label="NESE", lw=3)
plt.plot(x, n_SWNW, 'y--', label="SWNW", lw=3)
plt.plot(x, n_NWSE, 'k-', label="NWSE", lw=3)
plt.plot(x, n_SWNE, 'r-', label="SWNE", lw=3)
plt.axvline(color='blue', label="NWNE", lw=3)
plt.axvline(color='pink', ls='--', label="SESW", lw=3)

plt.grid()
plt.xlabel('x [m]')
plt.ylabel('y [m]')
plt.xlim([-5, 5])
plt.ylim([-5, 5])
plt.title('Normals to Intercardinal Spaces')
plt.legend(loc='lower center', bbox_to_anchor=(-0.2, -0.175, 1.4, -0.075),
            mode='expand', borderaxespad=0., ncol=6)


Out[111]:
<matplotlib.legend.Legend at 0x1169d5f50>

Using this visualization as a guide, we notice that the sum of all normal vectors is zero in each dimension. Will this always be true? Let's generalize:

$$ $$

In [ ]:
### Example

In [60]:
weights = np.array([[0, -1, 1],
                    [2, 0, 0],
                    [0, 1, -1],
                   ])
simple = SoftMax(weights)
simple.plot()


Learning weights from data

Example

Prior class boundaries

Example

Using symmetry to develop models from sparse data

Examples

a

a

a


In [24]:


In [9]:


In [40]:
from IPython.core.display import HTML

# Borrowed style from Probabilistic Programming and Bayesian Methods for Hackers
def css_styling():
    styles = open("../styles/custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[40]: