Why SoftMax? Full Bayesian filtering of arbitrary states. Is it possible to model dynamic targets/groundings with G^3? Is there a one-to-one mapping between our state-space representation, and factor graph representation? If you could do it, what would it require you to do? What assumptions they make don't scale?

we're not focused on the grounding problem - they are we're focused on providing perception, given groundings

Extracting Meaning from Human Sensors

This document provides the rough, unformatted run-through of the paper being submitted to the Robotics Science and Systems 2015 Workshop on Model Learning for Human-Robot Communication.

Answers to Heilmeier's Catechism:

We're developing a technique for humans to inform robots of the world around them. Put another way, we're developing a Bayesian framework for a (noisy) *human sensor* that the robot can use to understand its environment. We want to be able to plug this human sensor into existing perception algorithms.

Most researchers focus on providing robots with a spatial understanding of the environment, but this doesn't allow for many of the descriptions in human language. Moreover, they focus on control, and so they're looking for a single best estimate.

3 types of uncertainty - humans are not oracles. We're looking at typical Bayesian uncertainty - sensor error/imprecision/accuracy, association error.

talk about cops and robots

Physical vs. abstract concepts

scalability & implelmentation - dynamics? larger space-time scales for a grid?

Our approach is more general than simply focusing on Euclidian space: we want to translate arbitrary *state space*, which includes Euclidian space. A state could be position, velocity, heading, a target's tactics -- anything a human can sense.

This approach will be successful because it captures the richness of human language instead of focusing solely on prepositional relationships, allowing for far more information to be provided by a human sensor (as long as the robot has a model for how to *use* the state estimate).

use GMMs instead - they scale far better than particles/grid models

we want to human to do more than labeling - we want the human to express a broader understanding

human doesn't necessarily know all the answers, but can provide helpful (positive or negative) information

think of difficult environments for sensors - constrained computing, out on the road at night

In the event of sensor failure, a human can step in to make perception more robust

big difference because humans are used as multi-state sensors, not simply labelers of the world. we have the human in the loop. they focus on human as an augmentation for planning, we focus on human as an augmentation for sensing.

we need our model to work well within a Bayesian filter. our focus is more on *dynamics* (*aerospace* - computation constraints are IMPORTANT).

estimation-friendly models for state-spatial language (sticking with codebook)

We can break down the, "Who cares?" question into three types of people interested in using humans as sensors to inform an autonomous system of its environment:

*Researchers*can use and build upon our model;*Developers*can incorporate the use of human sensors to make their robots and autonomous systems take in a rich new spectrum of information;*End-users*can use the products of the developers.

The ability for humans to explain the world around them to robots is huge: robots are fantastic at intensely complicated and quick computation, but poor on developing deep understanding. Humans are the opposite.

The ability to tell the robot about aspects of its world is akin to telling a toddler about its world. Currently, we can only tell our toddler-bots about spatial relationships: the box is in a corner, the kitchen is down the hall from the office, etc.

Imagine how much more they'll learn if we can tell them about all the rest of the concepts described human language: height, texture, movement models, etc.

We might be wasting our time/AI is dangerous/We might miss a better technique.

The payoff, if our model works well, is that we can take a big step towards conversational interaction with autonomous systems.

For a system that already has humans to interact with, this will not cost any money. However, the cost of interaction is an interesting problem: humans intuitively know the rules of when and how to interact with each other (to a varying degree...). Robots don't.

If a robot is able to ask questions about its environment, how does it determine what the correct amount of questions to ask is? How does it understand when it has annoyed its teammate? This is one of our core research questions: what is the cost, in term of a human operator's attention and willingness to answer, of a robot asking questions and providing information?

Depends on how complicated and exact it needs to be. We can propose models, or we can learn models from human experience. The first is simply the length of building up a library for the ways in which humans describe environments - the second would be an ongoing process of calibrating a human's meaning to a probabilistic understanding of the state space.

We'd run experiments to determine:

- How well does a human-robot team work with a human acting as a sensor?
- Do some states work better than others? Are some more variable between humans?
- Instead of re-learning all human models, can we effectively transfer 'calibrarion profiles' between humans?

1st technical nugget: Generalized MMS with GMMs 2nd technical nugget: Extension to velocity/Heading 3rd technical nugget: learning

- how do you get the models?
- binary softmax
- deformable templates

- what properties of this model makes it good for inference?
- dynamics
- motion models

- what properties of this model makes it good for learning?
- GPs
- Population learning (priors based on hyperpriors shared among human sensors)
- Concept transfer (how are 'front' and 'nearby' related) & online learning (ICCPS)

Think about: key figures, demos (esp. for 1, and how it connects to 2 and 3) start document

- Introduction
- Ground Cops and Robots What's in this problem that other related research can't solve

- Related Work
- Generating SoftMax Models
- From Weights
- From Class Boundaries
- From Polytope Templates
- In n-dimensions
- Multimodal SoftMax

- Learning Spatial SoftMax Models
- From Collected Data
- With Prior Boundaries
- Using Symmetry to Minimize Data Collection

- Simulations
- Ongoing Work

Used for dynamics prediction step in Bayes' filters (like KF, UKF, EKF, PF, GSF). \begin{equation} p(\left. X_k \right| \zeta_{1:k-1},D_{1:k-1}) = \int p(\left. X_k \right| X_{k-1}) p(\left. X_{k-1} \right| \zeta_{1:k-1},D_{1:k-1}) dX_{k-1} \end{equation}

Done *before* human update.

The focus of our paper. We want to generate the likelihood model. \begin{align} \begin{split} p(\left. X_k \right| \zeta_{1:k},D_{1:k}) &= \frac{p(\left. D_k \right| X_k) p(\left. X_k \right| \zeta_{1:k},D_{1:k-1})} {\int p(\left. D_k \right| X_k) p(\left. X_k \right| \zeta_{1:k},D_{1:k-1}) dX_k} \\ &\equiv p(\left. X_k \right| D_{k} ) \end{split} \end{align}

Widely used. Well-suited for modling hybrid continuous-to-discrete mappings (e.g. human utterances to continuous state space probabilities. Always leads to complete convex decomposition of the state space, so the state can be fully partitioned.

Problems: do *not* (necessarily) lead to closed-form posteriors. Thus, grid-based and particle approximations could work, but they scale poorly with state dimensionality, provide a cumbersome posterior, and, for grids, don't mesh easily with typical filters for hard sensor data. Gaussian mixtures via EM are prone to poor local maxima **(?)** and high computation.

Could be critical or non-critical boundaries.

\begin{equation} L_{log}(i,j) = (\mathbf{w}_i - \mathbf{w}_j)^T\mathbf{x} + (b_i - b_j) = 0 \end{equation}**Should we consider a quaternion representation? What about generalized rotations?***

Defines the slope of a given class.

\begin{equation} \frac{\partial P(D_k = i \vert \mathbf{x})} {\partial \mathbf{x}} = P(D_k = i \vert \mathbf{x}) \left(\mathbf{w}_{i} - \sum_{h=1}^m \mathbf{w}_{h}P(D_k = h \vert \mathbf{x}) \right) \\ \end{equation}General transformation:

\begin{equation} \mathbf{n} = \mathbf{A}\mathbf{w} \end{equation}Or, for strict ordering and zero-weight reference class $i$:

\begin{equation} \mathbf{n}_{i,j} = \mathbf{w}_j \end{equation}Used in conjunction with the rank check to ensure proper A matrix. \begin{equation} rank\left(\mathbf{A}_{min}\right) = rank\left(\left[\begin{array}{r|r} \mathbf{A}_{min} & \mathbf{n}_{min}\end{array}\right]\right) \end{equation}

Sample points from polytopes to find exact weights. Use SVD or other matrix decomposition technique.

\begin{equation} \begin{bmatrix} \mathbf{x}_{i,1} & \mathbf{x}_{i,2} & \dots & \mathbf{x}_{i,n} & \mathbf{1} \\ \end{bmatrix} \begin{bmatrix} w_{i,1}\\ w_{i,2}\\ \vdots \\ b_i \end{bmatrix} =\begin{bmatrix} 0\\ 0\\ 0 \end{bmatrix} \end{equation}*Sum of *all* normals *must* equal 0*.

Groups subclasses $\theta(j)$ together into classes. No longer convex, though subclasses are. *Boundaries *or* gradients shift, but* not both. Possible to estimate MMS class weights from data (i.e. through symmetry).

(Fixed boundaries vs. fixed gradients)

Take three arbitrary classes `A`

, `B`

, and `C`

. We will sum `A`

and `C`

together to make superclass `D`

:

Where $x_{AB}$ and $x_{BC}$ are the solutions to the above equations for fixed weights (i.e. the locations of the boundary in the state space). Must be solved numerically.

*Inputs*:
Map object shapes + scales
Scaling factor for object interiors

*Outputs*:
Multiple binary, convex, linearly/non-linearly separable probabilistic decompositions of the state space.

- Find normal vectors for all shapes
- Find
*independent*shape weights (i.e. through learning - following section.) - Use MMS boundary shaping on all shapes
- Account for object interiors

Note: template concept applies to non-physical shapes (i.e. motion models for velocity states)

State space: $s$

State space: $\begin{bmatrix} x & y \end{bmatrix}$

State space: $\begin{bmatrix} x & y & x^2 & xy & y^2 \end{bmatrix}$

State space: $\begin{bmatrix} x & y & x^2 & xy & y^2 \end{bmatrix}$

Show near: clusters of objects, rooms

Video example for demo: "The ball is between three robots." as the robots are moving.

- What's the importance of a physical understanding of the world?
- How does it apply to a human-robot team?
- How can humans convey a physical understanding of the world to the robot?
- What types of physical understanding can we convey? How do we convey them?
- What domains does this apply to? When would it be used?

- Who else has worked on communicating spatial understanding to a robot?
- How does our research differ from theirs?
- What have we worked on in the past, and how are we improving it?

- Nisar's previous work
- Nick Roy's Group (Tellex, Walter)
- Jensfelt and Burgard
- Dieter Fox, Cynthia Matuszek
- Gaardenfors
- Kuipers
- Kaupp
- Terry Regier (?)
- Jamie Frost, Alastair Harrison
- Marjorie Skubic (this?)

- What is a SoftMax Model and why does it best represent a spatial understanding?
- What different elements compose a SoftMax model, and what do they mean?
- What different ways can we construct a SoftMax model easily?

- What are the core insights we can get from SoftMax models?
- How is the slope of each class's probability derived and manipulated?
- How do manipulate (i.e. rotate/translate) the model?

- How do we compute SoftMax weights from class boundaries?
- When *can't* we use class boundaries?

- How do we derive normals from Polytope templates?

- What do 1D examples looks like, and what purposes do they serve?
- Can we combine multiple states (i.e. position and velocty) into one SoftMax model?

- Can we compose 'superclasses' from multiple SoftMax classes?
- Can we merge multiple state types (i.e. position and velocity) into a superclass?
- Can we fix the normal boundary problem using our classes as superclasses?

- Instead of specifying the SoftMax Models, can we learn them from user data?
- What use is collecting data from individuals? How different are they? Are humans good estimators?
- What language do humans use to express spatial representations? How many prepositions exist? Does their usage vary?
- Can we use negative information well?

- How do we learn distributions from experimental human data?

- Given a gounding polytope, can we simply learn the gradients of the SoftMax classes?
- Can we represent the polytope boundaries probabilistically?

- Can we minimize the amount of data needed to be collected by exploiting symmetry?
- Which prepositions or prepositional phrases imply symmetry?

- Is this SoftMax decomposition effective?
- Can it run in real-time?
- How did the robot perform on its own?
- How did the human-robot team perform when using positional statements? Velocity statements? Both?
- How useful was negative information?

- What model did we introduce?
- What validation can we provide?
- What's next?

```
In [5]:
```from IPython.core.display import HTML
# Borrowed style from Probabilistic Programming and Bayesian Methods for Hackers
def css_styling():
styles = open("../styles/custom.css", "r").read()
return HTML(styles)
css_styling()

```
Out[5]:
```