Why SoftMax? Full Bayesian filtering of arbitrary states. Is it possible to model dynamic targets/groundings with G^3? Is there a one-to-one mapping between our state-space representation, and factor graph representation? If you could do it, what would it require you to do? What assumptions they make don't scale?

we're not focused on the grounding problem - they are we're focused on providing perception, given groundings

Extracting Meaning from Human Sensors

i. Preface

This document provides the rough, unformatted run-through of the paper being submitted to the Robotics Science and Systems 2015 Workshop on Model Learning for Human-Robot Communication.

ii. Big Picture

Answers to Heilmeier's Catechism:

ii.1 What are you trying to do? Articulate your objectives using absolutely no jargon.

We're developing a technique for humans to inform robots of the world around them. Put another way, we're developing a Bayesian framework for a (noisy) human sensor that the robot can use to understand its environment. We want to be able to plug this human sensor into existing perception algorithms.

ii.2 How is it done today, and what are the limits of current practice?

Most researchers focus on providing robots with a spatial understanding of the environment, but this doesn't allow for many of the descriptions in human language. Moreover, they focus on control, and so they're looking for a single best estimate.

3 types of uncertainty - humans are not oracles. We're looking at typical Bayesian uncertainty - sensor error/imprecision/accuracy, association error.

talk about cops and robots

Physical vs. abstract concepts

scalability & implelmentation - dynamics? larger space-time scales for a grid?

ii.3 What's new in your approach and why do you think it will be successful?

Our approach is more general than simply focusing on Euclidian space: we want to translate arbitrary state space, which includes Euclidian space. A state could be position, velocity, heading, a target's tactics -- anything a human can sense.

This approach will be successful because it captures the richness of human language instead of focusing solely on prepositional relationships, allowing for far more information to be provided by a human sensor (as long as the robot has a model for how to use the state estimate).

use GMMs instead - they scale far better than particles/grid models

we want to human to do more than labeling - we want the human to express a broader understanding

human doesn't necessarily know all the answers, but can provide helpful (positive or negative) information

think of difficult environments for sensors - constrained computing, out on the road at night

In the event of sensor failure, a human can step in to make perception more robust

big difference because humans are used as multi-state sensors, not simply labelers of the world. we have the human in the loop. they focus on human as an augmentation for planning, we focus on human as an augmentation for sensing.

we need our model to work well within a Bayesian filter. our focus is more on dynamics (aerospace - computation constraints are IMPORTANT).

estimation-friendly models for state-spatial language (sticking with codebook)

ii.4 Who cares?

We can break down the, "Who cares?" question into three types of people interested in using humans as sensors to inform an autonomous system of its environment:

  • Researchers can use and build upon our model;
  • Developers can incorporate the use of human sensors to make their robots and autonomous systems take in a rich new spectrum of information;
  • End-users can use the products of the developers.

ii.5 If you're successful, what difference will it make?

The ability for humans to explain the world around them to robots is huge: robots are fantastic at intensely complicated and quick computation, but poor on developing deep understanding. Humans are the opposite.

The ability to tell the robot about aspects of its world is akin to telling a toddler about its world. Currently, we can only tell our toddler-bots about spatial relationships: the box is in a corner, the kitchen is down the hall from the office, etc.

Imagine how much more they'll learn if we can tell them about all the rest of the concepts described human language: height, texture, movement models, etc.

ii.6 What are the risks and the payoffs?

We might be wasting our time/AI is dangerous/We might miss a better technique.

The payoff, if our model works well, is that we can take a big step towards conversational interaction with autonomous systems.

ii.7 How much will it cost?

For a system that already has humans to interact with, this will not cost any money. However, the cost of interaction is an interesting problem: humans intuitively know the rules of when and how to interact with each other (to a varying degree...). Robots don't.

If a robot is able to ask questions about its environment, how does it determine what the correct amount of questions to ask is? How does it understand when it has annoyed its teammate? This is one of our core research questions: what is the cost, in term of a human operator's attention and willingness to answer, of a robot asking questions and providing information?

ii.8 How long will it take?

Depends on how complicated and exact it needs to be. We can propose models, or we can learn models from human experience. The first is simply the length of building up a library for the ways in which humans describe environments - the second would be an ongoing process of calibrating a human's meaning to a probabilistic understanding of the state space.

ii.9 What are the midterm and final "exams" to check for success?

We'd run experiments to determine:

  • How well does a human-robot team work with a human acting as a sensor?
  • Do some states work better than others? Are some more variable between humans?
  • Instead of re-learning all human models, can we effectively transfer 'calibrarion profiles' between humans?

iii. Outline

1st technical nugget: Generalized MMS with GMMs 2nd technical nugget: Extension to velocity/Heading 3rd technical nugget: learning

  1. how do you get the models?
    1. binary softmax
    2. deformable templates
  2. what properties of this model makes it good for inference?
    1. dynamics
    2. motion models
  3. what properties of this model makes it good for learning?
    1. GPs
    2. Population learning (priors based on hyperpriors shared among human sensors)
    3. Concept transfer (how are 'front' and 'nearby' related) & online learning (ICCPS)

Think about: key figures, demos (esp. for 1, and how it connects to 2 and 3) start document

  1. Introduction
    1. Ground Cops and Robots What's in this problem that other related research can't solve
  2. Related Work
  3. Generating SoftMax Models
    1. From Weights
    2. From Class Boundaries
    3. From Polytope Templates
    4. In n-dimensions
    5. Multimodal SoftMax
  4. Learning Spatial SoftMax Models
    1. From Collected Data
    2. With Prior Boundaries
    3. Using Symmetry to Minimize Data Collection
  5. Simulations
  6. Ongoing Work

iii.1 Key Equations

Chapman-Komolgorov

Used for dynamics prediction step in Bayes' filters (like KF, UKF, EKF, PF, GSF). \begin{equation} p(\left. X_k \right| \zeta_{1:k-1},D_{1:k-1}) = \int p(\left. X_k \right| X_{k-1}) p(\left. X_{k-1} \right| \zeta_{1:k-1},D_{1:k-1}) dX_{k-1} \end{equation}

Bayes' Fusion (robot update)

Done before human update.

\begin{align} \begin{split} p(\left. X_k \right| \zeta_{1:k},D_{1:k-1}) & = \frac{p(\left. \zeta_k \right| X_k) p(\left. X_k \right| \zeta_{1:k-1},D_{1:k-1})} {\int p(\left. \zeta_k \right| X_k) p(\left. X_k \right| \zeta_{1:k-1},D_{1:k-1}) dX_k} \\ &\equiv p( X_k ) \end{split} \end{align}

Bayes' Fusion (human update)

The focus of our paper. We want to generate the likelihood model. \begin{align} \begin{split} p(\left. X_k \right| \zeta_{1:k},D_{1:k}) &= \frac{p(\left. D_k \right| X_k) p(\left. X_k \right| \zeta_{1:k},D_{1:k-1})} {\int p(\left. D_k \right| X_k) p(\left. X_k \right| \zeta_{1:k},D_{1:k-1}) dX_k} \\ &\equiv p(\left. X_k \right| D_{k} ) \end{split} \end{align}

Softmax Likelihood

Widely used. Well-suited for modling hybrid continuous-to-discrete mappings (e.g. human utterances to continuous state space probabilities. Always leads to complete convex decomposition of the state space, so the state can be fully partitioned.

Problems: do not (necessarily) lead to closed-form posteriors. Thus, grid-based and particle approximations could work, but they scale poorly with state dimensionality, provide a cumbersome posterior, and, for grids, don't mesh easily with typical filters for hard sensor data. Gaussian mixtures via EM are prone to poor local maxima (?) and high computation.

\begin{equation} P(D_k=j \vert X_k = \mathbf{x}) = \frac{e^{\mathbf{w}_j^T \mathbf{x} + b_j}}{\sum_{h=1}^m e^{\mathbf{w}_h^T\mathbf{x} + b_h}} \end{equation}

Linear Hyperplane Class Boundaries

Could be critical or non-critical boundaries.

\begin{equation} L_{log}(i,j) = (\mathbf{w}_i - \mathbf{w}_j)^T\mathbf{x} + (b_i - b_j) = 0 \end{equation}

Shifted and Rotated Coordinate Frame

Should we consider a quaternion representation? What about generalized rotations?*

\begin{align} \begin{split} \mathbf{w}_i^\prime &= \mathbf{w}_i^T R(\theta) \\ b_i^\prime &= \mathbf{w}_i^T R(\theta) \mathbf{b} \end{split} \end{align}

Softmax Gradient

Defines the slope of a given class.

\begin{equation} \frac{\partial P(D_k = i \vert \mathbf{x})} {\partial \mathbf{x}} = P(D_k = i \vert \mathbf{x}) \left(\mathbf{w}_{i} - \sum_{h=1}^m \mathbf{w}_{h}P(D_k = h \vert \mathbf{x}) \right) \\ \end{equation}

Weight-Normal Transform

General transformation:

\begin{equation} \mathbf{n} = \mathbf{A}\mathbf{w} \end{equation}

Or, for strict ordering and zero-weight reference class $i$:

\begin{equation} \mathbf{n}_{i,j} = \mathbf{w}_j \end{equation}

Rouché-Capelli Theorem

Used in conjunction with the rank check to ensure proper A matrix. \begin{equation} rank\left(\mathbf{A}_{min}\right) = rank\left(\left[\begin{array}{r|r} \mathbf{A}_{min} & \mathbf{n}_{min}\end{array}\right]\right) \end{equation}

Polytopes to Weights

Sample points from polytopes to find exact weights. Use SVD or other matrix decomposition technique.

\begin{equation} \begin{bmatrix} \mathbf{x}_{i,1} & \mathbf{x}_{i,2} & \dots & \mathbf{x}_{i,n} & \mathbf{1} \\ \end{bmatrix} \begin{bmatrix} w_{i,1}\\ w_{i,2}\\ \vdots \\ b_i \end{bmatrix} =\begin{bmatrix} 0\\ 0\\ 0 \end{bmatrix} \end{equation}

Normal check

Sum of all normals must equal 0.

Multimodal Softmax Likelihood

Groups subclasses $\theta(j)$ together into classes. No longer convex, though subclasses are. Boundaries or gradients shift, but not both. Possible to estimate MMS class weights from data (i.e. through symmetry).

\begin{equation} P(D_k=j \vert X_k = \mathbf{x}) = \frac{\sum\limits_{r \, \in \, \sigma(j)} e^{\mathbf{w}_r^T \mathbf{x} + b_r}} {\sum\limits_{c = 1}^S e^{\mathbf{w}_c^T\mathbf{x} + b_c}} \end{equation}

MMS Boundary Shaping

(Fixed boundaries vs. fixed gradients)

Take three arbitrary classes A, B, and C. We will sum A and C together to make superclass D:

\begin{align} L_{log}(B,D) &= \mathbf{w}_B^T \mathbf{x} + b_B - \ln{\left( e^{\mathbf{w}_A ^T \mathbf{x} + b_A} + e^{\mathbf{w}_C^T \mathbf{x} + b_C}\right)} = 0\\ L_{log}(B,A) &= (\mathbf{w}_B - \mathbf{w}_A)^T \mathbf{x}_{AB} + (b_B - b_A) = 0\\ L_{log}(B,C) &= (\mathbf{w}_B - \mathbf{w}_C)^T \mathbf{x}_{BC} + (b_B - b_C) = 0 \end{align}

Where $x_{AB}$ and $x_{BC}$ are the solutions to the above equations for fixed weights (i.e. the locations of the boundary in the state space). Must be solved numerically.

Complete Multi-shape MMS Algorithm

Inputs: Map object shapes + scales Scaling factor for object interiors

Outputs: Multiple binary, convex, linearly/non-linearly separable probabilistic decompositions of the state space.

  1. Find normal vectors for all shapes
  2. Find independent shape weights (i.e. through learning - following section.)
  3. Use MMS boundary shaping on all shapes
  4. Account for object interiors

Note: template concept applies to non-physical shapes (i.e. motion models for velocity states)

iii.2 Examples

Speed example

State space: $s$

Non-MMS and MMS Pentagon examples (regular 2D polygon)

State space: $\begin{bmatrix} x & y \end{bmatrix}$

Football stadium example (curved 2D shape)

State space: $\begin{bmatrix} x & y & x^2 & xy & y^2 \end{bmatrix}$

Full room example (many objects, many barriers)

State space: $\begin{bmatrix} x & y & x^2 & xy & y^2 \end{bmatrix}$

Show near: clusters of objects, rooms

Video example for demo: "The ball is between three robots." as the robots are moving.

iv. Abstract

  • Key words: Human-robot dialog, spatial language modeling and interpretation, Spatial-semantic mapping, Semantic perception

1. Introduction

Questions this section answers:

  1. What's the importance of a physical understanding of the world?
  2. How does it apply to a human-robot team?
  3. How can humans convey a physical understanding of the world to the robot?
  4. What types of physical understanding can we convey? How do we convey them?
  5. What domains does this apply to? When would it be used?

2. Related Work

Questions this section answers:

  1. Who else has worked on communicating spatial understanding to a robot?
  2. How does our research differ from theirs?
  3. What have we worked on in the past, and how are we improving it?
  • Nisar's previous work
  • Nick Roy's Group (Tellex, Walter)
  • Jensfelt and Burgard
  • Dieter Fox, Cynthia Matuszek
  • Gaardenfors
  • Kuipers
  • Kaupp
  • Terry Regier (?)
  • Jamie Frost, Alastair Harrison
  • Marjorie Skubic (this?)

3. Generating SoftMax Models

Questions this section answers:

  1. What is a SoftMax Model and why does it best represent a spatial understanding?
  2. What different elements compose a SoftMax model, and what do they mean?
  3. What different ways can we construct a SoftMax model easily?

3.1. From Weights

Questions this section answers:

  1. What are the core insights we can get from SoftMax models?
  2. How is the slope of each class's probability derived and manipulated?
  3. How do manipulate (i.e. rotate/translate) the model?

3.2. From Class Boundaries

Questions this section answers:

  1. How do we compute SoftMax weights from class boundaries?
  2. When *can't* we use class boundaries?

3.3. From Polytope Templates

Questions this section answers:

  1. How do we derive normals from Polytope templates?

3.4. In n-dimensions

Questions this section answers:

  1. What do 1D examples looks like, and what purposes do they serve?
  2. Can we combine multiple states (i.e. position and velocty) into one SoftMax model?

3.5. Multimodal Softmax

Questions this section answers:

  1. Can we compose 'superclasses' from multiple SoftMax classes?
  2. Can we merge multiple state types (i.e. position and velocity) into a superclass?
  3. Can we fix the normal boundary problem using our classes as superclasses?

4. Learning SoftMax Models

Questions this section answers:

  1. Instead of specifying the SoftMax Models, can we learn them from user data?
  2. What use is collecting data from individuals? How different are they? Are humans good estimators?
  3. What language do humans use to express spatial representations? How many prepositions exist? Does their usage vary?
  4. Can we use negative information well?

4.1. From Collected Data

Questions this section answers:

  1. How do we learn distributions from experimental human data?

4.2. With Prior Boundaries

Questions this section answers:

  1. Given a gounding polytope, can we simply learn the gradients of the SoftMax classes?
  2. Can we represent the polytope boundaries probabilistically?

4.3. Using Symmetry to Minimize Data Collection

Questions this section answers:

  1. Can we minimize the amount of data needed to be collected by exploiting symmetry?
  2. Which prepositions or prepositional phrases imply symmetry?

5. Simulations

Questions this section answers:

  1. Is this SoftMax decomposition effective?
  2. Can it run in real-time?
  3. How did the robot perform on its own?
  4. How did the human-robot team perform when using positional statements? Velocity statements? Both?
  5. How useful was negative information?

6. Conclusion

Questions this section answers:

  1. What model did we introduce?
  2. What validation can we provide?
  3. What's next?

In [5]:
from IPython.core.display import HTML

# Borrowed style from Probabilistic Programming and Bayesian Methods for Hackers
def css_styling():
    styles = open("../styles/custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[5]: