Why SoftMax? Full Bayesian filtering of arbitrary states. Is it possible to model dynamic targets/groundings with G^3? Is there a one-to-one mapping between our state-space representation, and factor graph representation? If you could do it, what would it require you to do? What assumptions they make don't scale?
we're not focused on the grounding problem - they are we're focused on providing perception, given groundings
Extracting Meaning from Human Sensors
This document provides the rough, unformatted run-through of the paper being submitted to the Robotics Science and Systems 2015 Workshop on Model Learning for Human-Robot Communication.
Answers to Heilmeier's Catechism:
We're developing a technique for humans to inform robots of the world around them. Put another way, we're developing a Bayesian framework for a (noisy) human sensor that the robot can use to understand its environment. We want to be able to plug this human sensor into existing perception algorithms.
Most researchers focus on providing robots with a spatial understanding of the environment, but this doesn't allow for many of the descriptions in human language. Moreover, they focus on control, and so they're looking for a single best estimate.
3 types of uncertainty - humans are not oracles. We're looking at typical Bayesian uncertainty - sensor error/imprecision/accuracy, association error.
talk about cops and robots
Physical vs. abstract concepts
scalability & implelmentation - dynamics? larger space-time scales for a grid?
Our approach is more general than simply focusing on Euclidian space: we want to translate arbitrary state space, which includes Euclidian space. A state could be position, velocity, heading, a target's tactics -- anything a human can sense.
This approach will be successful because it captures the richness of human language instead of focusing solely on prepositional relationships, allowing for far more information to be provided by a human sensor (as long as the robot has a model for how to use the state estimate).
use GMMs instead - they scale far better than particles/grid models
we want to human to do more than labeling - we want the human to express a broader understanding
human doesn't necessarily know all the answers, but can provide helpful (positive or negative) information
think of difficult environments for sensors - constrained computing, out on the road at night
In the event of sensor failure, a human can step in to make perception more robust
big difference because humans are used as multi-state sensors, not simply labelers of the world. we have the human in the loop. they focus on human as an augmentation for planning, we focus on human as an augmentation for sensing.
we need our model to work well within a Bayesian filter. our focus is more on dynamics (aerospace - computation constraints are IMPORTANT).
estimation-friendly models for state-spatial language (sticking with codebook)
We can break down the, "Who cares?" question into three types of people interested in using humans as sensors to inform an autonomous system of its environment:
The ability for humans to explain the world around them to robots is huge: robots are fantastic at intensely complicated and quick computation, but poor on developing deep understanding. Humans are the opposite.
The ability to tell the robot about aspects of its world is akin to telling a toddler about its world. Currently, we can only tell our toddler-bots about spatial relationships: the box is in a corner, the kitchen is down the hall from the office, etc.
Imagine how much more they'll learn if we can tell them about all the rest of the concepts described human language: height, texture, movement models, etc.
We might be wasting our time/AI is dangerous/We might miss a better technique.
The payoff, if our model works well, is that we can take a big step towards conversational interaction with autonomous systems.
For a system that already has humans to interact with, this will not cost any money. However, the cost of interaction is an interesting problem: humans intuitively know the rules of when and how to interact with each other (to a varying degree...). Robots don't.
If a robot is able to ask questions about its environment, how does it determine what the correct amount of questions to ask is? How does it understand when it has annoyed its teammate? This is one of our core research questions: what is the cost, in term of a human operator's attention and willingness to answer, of a robot asking questions and providing information?
Depends on how complicated and exact it needs to be. We can propose models, or we can learn models from human experience. The first is simply the length of building up a library for the ways in which humans describe environments - the second would be an ongoing process of calibrating a human's meaning to a probabilistic understanding of the state space.
We'd run experiments to determine:
1st technical nugget: Generalized MMS with GMMs 2nd technical nugget: Extension to velocity/Heading 3rd technical nugget: learning
Think about: key figures, demos (esp. for 1, and how it connects to 2 and 3) start document
Used for dynamics prediction step in Bayes' filters (like KF, UKF, EKF, PF, GSF). \begin{equation} p(\left. X_k \right| \zeta_{1:k-1},D_{1:k-1}) = \int p(\left. X_k \right| X_{k-1}) p(\left. X_{k-1} \right| \zeta_{1:k-1},D_{1:k-1}) dX_{k-1} \end{equation}
Done before human update.
\begin{align} \begin{split} p(\left. X_k \right| \zeta_{1:k},D_{1:k-1}) & = \frac{p(\left. \zeta_k \right| X_k) p(\left. X_k \right| \zeta_{1:k-1},D_{1:k-1})} {\int p(\left. \zeta_k \right| X_k) p(\left. X_k \right| \zeta_{1:k-1},D_{1:k-1}) dX_k} \\ &\equiv p( X_k ) \end{split} \end{align}The focus of our paper. We want to generate the likelihood model. \begin{align} \begin{split} p(\left. X_k \right| \zeta_{1:k},D_{1:k}) &= \frac{p(\left. D_k \right| X_k) p(\left. X_k \right| \zeta_{1:k},D_{1:k-1})} {\int p(\left. D_k \right| X_k) p(\left. X_k \right| \zeta_{1:k},D_{1:k-1}) dX_k} \\ &\equiv p(\left. X_k \right| D_{k} ) \end{split} \end{align}
Widely used. Well-suited for modling hybrid continuous-to-discrete mappings (e.g. human utterances to continuous state space probabilities. Always leads to complete convex decomposition of the state space, so the state can be fully partitioned.
Problems: do not (necessarily) lead to closed-form posteriors. Thus, grid-based and particle approximations could work, but they scale poorly with state dimensionality, provide a cumbersome posterior, and, for grids, don't mesh easily with typical filters for hard sensor data. Gaussian mixtures via EM are prone to poor local maxima (?) and high computation.
\begin{equation} P(D_k=j \vert X_k = \mathbf{x}) = \frac{e^{\mathbf{w}_j^T \mathbf{x} + b_j}}{\sum_{h=1}^m e^{\mathbf{w}_h^T\mathbf{x} + b_h}} \end{equation}Could be critical or non-critical boundaries.
\begin{equation} L_{log}(i,j) = (\mathbf{w}_i - \mathbf{w}_j)^T\mathbf{x} + (b_i - b_j) = 0 \end{equation}Should we consider a quaternion representation? What about generalized rotations?*
\begin{align} \begin{split} \mathbf{w}_i^\prime &= \mathbf{w}_i^T R(\theta) \\ b_i^\prime &= \mathbf{w}_i^T R(\theta) \mathbf{b} \end{split} \end{align}Defines the slope of a given class.
\begin{equation} \frac{\partial P(D_k = i \vert \mathbf{x})} {\partial \mathbf{x}} = P(D_k = i \vert \mathbf{x}) \left(\mathbf{w}_{i} - \sum_{h=1}^m \mathbf{w}_{h}P(D_k = h \vert \mathbf{x}) \right) \\ \end{equation}General transformation:
\begin{equation} \mathbf{n} = \mathbf{A}\mathbf{w} \end{equation}Or, for strict ordering and zero-weight reference class $i$:
\begin{equation} \mathbf{n}_{i,j} = \mathbf{w}_j \end{equation}Used in conjunction with the rank check to ensure proper A matrix. \begin{equation} rank\left(\mathbf{A}_{min}\right) = rank\left(\left[\begin{array}{r|r} \mathbf{A}_{min} & \mathbf{n}_{min}\end{array}\right]\right) \end{equation}
Sample points from polytopes to find exact weights. Use SVD or other matrix decomposition technique.
\begin{equation} \begin{bmatrix} \mathbf{x}_{i,1} & \mathbf{x}_{i,2} & \dots & \mathbf{x}_{i,n} & \mathbf{1} \\ \end{bmatrix} \begin{bmatrix} w_{i,1}\\ w_{i,2}\\ \vdots \\ b_i \end{bmatrix} =\begin{bmatrix} 0\\ 0\\ 0 \end{bmatrix} \end{equation}Sum of all normals must equal 0.
Groups subclasses $\theta(j)$ together into classes. No longer convex, though subclasses are. Boundaries or gradients shift, but not both. Possible to estimate MMS class weights from data (i.e. through symmetry).
\begin{equation} P(D_k=j \vert X_k = \mathbf{x}) = \frac{\sum\limits_{r \, \in \, \sigma(j)} e^{\mathbf{w}_r^T \mathbf{x} + b_r}} {\sum\limits_{c = 1}^S e^{\mathbf{w}_c^T\mathbf{x} + b_c}} \end{equation}(Fixed boundaries vs. fixed gradients)
Take three arbitrary classes A
, B
, and C
. We will sum A
and C
together to make superclass D
:
Where $x_{AB}$ and $x_{BC}$ are the solutions to the above equations for fixed weights (i.e. the locations of the boundary in the state space). Must be solved numerically.
Inputs: Map object shapes + scales Scaling factor for object interiors
Outputs: Multiple binary, convex, linearly/non-linearly separable probabilistic decompositions of the state space.
Note: template concept applies to non-physical shapes (i.e. motion models for velocity states)
State space: $s$
State space: $\begin{bmatrix} x & y \end{bmatrix}$
State space: $\begin{bmatrix} x & y & x^2 & xy & y^2 \end{bmatrix}$
State space: $\begin{bmatrix} x & y & x^2 & xy & y^2 \end{bmatrix}$
Show near: clusters of objects, rooms
Video example for demo: "The ball is between three robots." as the robots are moving.
In [5]:
from IPython.core.display import HTML
# Borrowed style from Probabilistic Programming and Bayesian Methods for Hackers
def css_styling():
styles = open("../styles/custom.css", "r").read()
return HTML(styles)
css_styling()
Out[5]: