Copyright 2019 The RecSim Authors.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Developing an Agent

Having familiarized ourselves with the overall structure of RecSim and how environments come together, we now turn to the final piece of the puzzle -- agent development. In this tutorial, we aim to cover the following topics:

  • basics: what data (and how) does RecSim feed to an agent and what does it expect to receive in return;
  • design: what features does RecSim provide for developing agents.

Basics

To start unpacking the functionality of a RecSim agent, we once again refer to the structural diagram. Here's what we discern from it on first pass -- an agent is meant to consume:

  • observations about the user's state,
  • observations about the user's response to a recommendation,
  • and a set of available documents $D$, each represented by a vector of features. In return, the agent is expected to produce a $K$-sized slate of elements of $D$ to be presented to the user's choice and transition model.

To illustrate RecSim's agent API, we will the develop a simple bandit agent for RecSim's interest exploration environment.

The interest exploration representes a clustered bandit problem: the world consists of some very large number of documents, which cluster into topics (this is a hard clustering -- one topic per document). We further posit that users also cluster into types.

A user's affinity towards a document is a sum of the document's production quality plus the user's (user type's) affinity to the topic. This naturally creates a situation where a myopic agent that ranks documents by predicted click rate will favor types with high production value, as they have a high apriori probability of getting clicked across all user types. This leads the agent to ignoring to explore niche interests, producing a suboptimal policy. Hence the need for active exploration.

For the purposes of exposition, we will define the agent method by method, which we will then assemble in a class.

Set-Up

We now instantiate an environment to illustrate the various data types it produces and consumes, and how they are handled within an agent.


In [0]:
# @title Install
!pip install --upgrade --no-cache-dir recsim

In [0]:
# @title Imports
# Generic imports
import functools
from gym import spaces
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# RecSim imports
from recsim import agent
from recsim import document
from recsim import user
from recsim.choice_model import MultinomialLogitChoiceModel
from recsim.simulator import environment
from recsim.simulator import recsim_gym
from recsim.simulator import runner_lib

In [0]:
from recsim.environments import interest_exploration

Since we're not about to do anything fancy with this environment, we will initialize it with the provided create_environment function (further details on this here).


In [0]:
env_config = {'slate_size': 2,
              'seed': 0,
              'num_candidates': 15,
              'resample_documents': True}
ie_environment = interest_exploration.create_environment(env_config)

At the start of each session, the simulator resets the environment, which triggers a resampling of the user. The reset call generates our initial observation.


In [0]:
initial_observation = ie_environment.reset()

Observations

A RecSim observation is a dictionary with 3 keys:

  • 'user', which represent the 'User Observable Features' in the structure diagram above,
  • 'doc', containing the current corpus of recommendable documents and their observable features ('Document Observable Features'),
  • and 'response', indicating the user's response to the last slate of recommendations ('User Response'). At this stage the 'response' key is vacuous and will be set to None, as no recommendation has been made yet.

Note that this environment does not implement user observable features, so that field would be empty at all times.


In [0]:
print('User Observable Features')
print(initial_observation['user'])
print('User Response')
print(initial_observation['response'])
print('Document Observable Features')
for doc_id, doc_features in initial_observation['doc'].items():
  print('ID:', doc_id, 'features:', doc_features)


User Observable Features
[]
User Response
None
Document Observable Features
ID: 15 features: {'quality': 1.2272016322975663, 'cluster_id': 1}
ID: 16 features: {'quality': 1.2925848895378007, 'cluster_id': 1}
ID: 17 features: {'quality': 1.239770781835802, 'cluster_id': 1}
ID: 18 features: {'quality': 1.4604555455549542, 'cluster_id': 1}
ID: 19 features: {'quality': 2.1023342470023874, 'cluster_id': 0}
ID: 20 features: {'quality': 1.0957290496089296, 'cluster_id': 1}
ID: 21 features: {'quality': 2.372569629131807, 'cluster_id': 0}
ID: 22 features: {'quality': 1.3492800243147158, 'cluster_id': 1}
ID: 23 features: {'quality': 1.0067018798187535, 'cluster_id': 1}
ID: 24 features: {'quality': 1.2044856191727935, 'cluster_id': 1}
ID: 25 features: {'quality': 2.1835115903440956, 'cluster_id': 0}
ID: 26 features: {'quality': 1.1941158468553823, 'cluster_id': 1}
ID: 27 features: {'quality': 1.0351464593750552, 'cluster_id': 1}
ID: 28 features: {'quality': 2.2959262349993166, 'cluster_id': 0}
ID: 29 features: {'quality': 2.059365556961282, 'cluster_id': 0}

We are thus presented with a corpus of 15 documents (num_candidates), each represented by their topic and their production quality score. Note, though, that the user's affinity is not an observable quantity.

The observation format specification can be accessed as a feature of the environment in the form of an OpenAI gym space. It is also provided to the agent at initialization time.


In [0]:
print('Document observation space')
for key, space in ie_environment.observation_space['doc'].spaces.items():
  print(key, ':', space)
print('Response observation space')
print(ie_environment.observation_space['response'])
print('User observation space')
print(ie_environment.observation_space['user'])


Document observation space
15 : Dict(cluster_id:Discrete(2), quality:Box())
16 : Dict(cluster_id:Discrete(2), quality:Box())
17 : Dict(cluster_id:Discrete(2), quality:Box())
18 : Dict(cluster_id:Discrete(2), quality:Box())
19 : Dict(cluster_id:Discrete(2), quality:Box())
20 : Dict(cluster_id:Discrete(2), quality:Box())
21 : Dict(cluster_id:Discrete(2), quality:Box())
22 : Dict(cluster_id:Discrete(2), quality:Box())
23 : Dict(cluster_id:Discrete(2), quality:Box())
24 : Dict(cluster_id:Discrete(2), quality:Box())
25 : Dict(cluster_id:Discrete(2), quality:Box())
26 : Dict(cluster_id:Discrete(2), quality:Box())
27 : Dict(cluster_id:Discrete(2), quality:Box())
28 : Dict(cluster_id:Discrete(2), quality:Box())
29 : Dict(cluster_id:Discrete(2), quality:Box())
Response observation space
Tuple(Dict(click:Discrete(2), cluster_id:Discrete(2), quality:Box()), Dict(click:Discrete(2), cluster_id:Discrete(2), quality:Box()))
User observation space
Box(0,)

Slates

A RecSim slate is a list of $K$ indices of obeservation['doc']. E.g. the slate [0, 1] corresponds to the slate consisting of:


In [0]:
slate = [0, 1]
for slate_doc in slate:
  print(list(initial_observation['doc'].items())[slate_doc])


('15', {'quality': 1.2272016322975663, 'cluster_id': 1})
('16', {'quality': 1.2925848895378007, 'cluster_id': 1})

The action space gym specification is also provided by the environment.


In [0]:
ie_environment.action_space


Out[0]:
MultiDiscrete([15 15])

When the first slate is available, the simulator will run the environment and generate a new observation, along with a reward for the agent.


In [0]:
observation, reward, done, _ = ie_environment.step(slate)

The main job of the agent is to produce a valid slate for each step of the simulation.

A trivial agent

On a most basic level, the main function of the agent can be fulfilled by simply implementing a step-function. Let us implement a very basic agent which just serves the first $K$ documents from the corpus.


In [0]:
from recsim.agent import AbstractEpisodicRecommenderAgent

A RecSim agent inherits from AbstractEpisodicRecommenderAgent. Required arguments (which RecSim will pass to the agent at simulation time) for the agent's init are the observation_space and action_space. We can use them to validate whether the environment meets the preconditions for the agent's operation.


In [0]:
class StaticAgent(AbstractEpisodicRecommenderAgent):
  def __init__(self, observation_space, action_space):
    # Check if document corpus is large enough.
    if len(observation_space['doc'].spaces) < len(action_space.nvec):
      raise RuntimeError('Slate size larger than size of the corpus.')
    super(StaticAgent, self).__init__(action_space)

  def step(self, reward, observation):
    print(observation)
    return list(range(self._slate_size))

This agent will statically recommend the first K documents of the corpus. For reasons that will become clear soon, we'll also have it print the observation.

We can now run it in RecSim using runner_lib (See tutorial for details).


In [0]:
def create_agent(sess, environment, eval_mode, summary_writer=None):
  return StaticAgent(environment.observation_space, environment.action_space)

tmp_base_dir = '/tmp/recsim/'

runner = runner_lib.EvalRunner(
  base_dir=tmp_base_dir,
  create_agent_fn=create_agent,
  env=ie_environment,
  max_eval_episodes=1,
  max_steps_per_episode=5,
  test_mode=True)

# We won't run this, but we totally could
# runner.run_experiment()

Design: Hierarchical Agent Layers

Now that we've gotten a basic agent off the ground, we might want to set our aims a little higher. That is, let's see if we can build an agent that actually does something useful.

The way this problem is set up, a natural heuristic presents itself. We can run a bandit algorithm to reveal the average engagement of a user with each cluster of documents. That is, each cluster becomes an arm. Once the algorithm has chosen a cluster, we serve take the highest quality video from that cluster. This is a metaphor for a situation that occurs often in recommender systems that serve as a front end to multiple (sub-)products: within each session, the user will interact with the recommender with some intent in mind, that is, to realize some task that can be fulfilled by one of the possible sub-products. Sometimes, the user will issue an explicit query (e.g., enter search terms), which effectively makes that intent observable up ot query interpretation uncertainty. Most often, however, the intent will be latent -- the user will reveal it indirectly by chosing among a set of items from the slate. We assume that had the intent been observable, a product-specific policy would be available to fulfill it.

This set-up captures some typical features of practical recommender systems -- they tend to very hierarchical, often very heuristic due to the complexity of the environment they operate in, and also very idiosyncratic to the task at hand. For this reason, RecSim's approach to agent engineering is very modular. Instead of providing a wide array of agents, we provide an easily extendable set of agent building blocks, called Agent Layers, which could be combined into hierarchies to create more complex agents.

Hierarchical agent layers

A hierarchical agent layer does not materialize a slate of documents, but relies on one or more base agents to do so. The hierarchical agent architecture in RecSim can roughly be described follows:

  • a hierarchical agent layer receives an observationand reward from the environment; it preprocesses the raw observation and passes it to one or more base agents.
  • Each base agent outputs either a slate or an abstract action (depending on the use case), which is then post-processed by the layer to create/output the slate (concrete action).

Hierarchical layers are recursively stackable in a fashion similar to Keras layers. Hierarchical layers are defined by their pre- and post-processing functions and can play many roles dependinghow these are implemented. For example, a layer can beused as a pure feature injector — it can extract some feature from the (history of) observations and pass it to the base agent, while keeping the post-processing function vacuous. This allows decoupling of feature- and agent-engineering. Various regularizers can be implemented in a similar fashion by modifying the reward. Layers may also be stateful and dynamic, as the pre- or post-processing functions may implement parameter updates or learning mechanisms.

We will not discuss how to implement these layers here (the reader is referred to examples in the layers/ directory), rather, we will show their usage and benefits.

ClusterClickStats

Recall that the Interest Exploration provides clicks as feedback, but does not keep track of cumulative click counts or impression counts. Since maintaining such statistics is generally useful, we provide an agent layer that does exactly that. That is, it monitors the stream of responses and retains the number of clicks and impressions from each cluster. The precondition is that the response space has a key 'click', as well as 'cluster_id'. If this is met, than the layer can be used with any environment/agent. Let's see how this works.


In [0]:
from recsim.agents.layers.cluster_click_statistics import ClusterClickStatsLayer

A hierarchical agent layer is instantiated in a smilar way to usual agents, except that it takes in a constructor for a base agent, that is, an agent whose abstract action it can interpret. In the case of cluster click stats, it will not do any post-processing of the abstract action, that is, it simply relays the action of the base agent to the environment. This implies that the base agent will need to provide a full slate.

Once instantiated, the cluster click stats layer will inject a sufficient statistic to the base agent's observation space containing clicks and impressions. Thus, the combination of both will behave like as if the base agent had an additional field in its observation space. We showcase this using our StaticAgent.


In [0]:
static_agent = StaticAgent(ie_environment.observation_space,
                           ie_environment.action_space)
static_agent.step(reward, observation)


{'user': array([], dtype=float64), 'doc': {'30': {'quality': 2.489224450301943, 'cluster_id': 0}, '31': {'quality': 2.125926607579561, 'cluster_id': 0}, '32': {'quality': 1.27448138607991, 'cluster_id': 1}, '33': {'quality': 1.2179911236932994, 'cluster_id': 1}, '34': {'quality': 1.177703750911228, 'cluster_id': 1}, '35': {'quality': 2.079489146813576, 'cluster_id': 0}, '36': {'quality': 1.1416765236282371, 'cluster_id': 1}, '37': {'quality': 1.2052916542615082, 'cluster_id': 1}, '38': {'quality': 1.2424683972006194, 'cluster_id': 1}, '39': {'quality': 1.8727966807396805, 'cluster_id': 0}, '40': {'quality': 1.1964488835024119, 'cluster_id': 1}, '41': {'quality': 1.282540205315461, 'cluster_id': 1}, '42': {'quality': 2.015585394934561, 'cluster_id': 0}, '43': {'quality': 2.464004827721051, 'cluster_id': 0}, '44': {'quality': 1.33980633202097, 'cluster_id': 1}}, 'response': ({'click': 0, 'quality': 1.2272016322975663, 'cluster_id': 1}, {'click': 0, 'quality': 1.2925848895378007, 'cluster_id': 1})}
Out[0]:
[0, 1]

In [0]:
cluster_static_agent = ClusterClickStatsLayer(StaticAgent,
                                              ie_environment.observation_space,
                                              ie_environment.action_space)
cluster_static_agent.step(reward, observation)


{'user': {'raw_observation': array([], dtype=float64), 'sufficient_statistics': {'impression_count': array([0, 2]), 'click_count': array([0, 0])}}, 'doc': {'30': {'quality': 2.489224450301943, 'cluster_id': 0}, '31': {'quality': 2.125926607579561, 'cluster_id': 0}, '32': {'quality': 1.27448138607991, 'cluster_id': 1}, '33': {'quality': 1.2179911236932994, 'cluster_id': 1}, '34': {'quality': 1.177703750911228, 'cluster_id': 1}, '35': {'quality': 2.079489146813576, 'cluster_id': 0}, '36': {'quality': 1.1416765236282371, 'cluster_id': 1}, '37': {'quality': 1.2052916542615082, 'cluster_id': 1}, '38': {'quality': 1.2424683972006194, 'cluster_id': 1}, '39': {'quality': 1.8727966807396805, 'cluster_id': 0}, '40': {'quality': 1.1964488835024119, 'cluster_id': 1}, '41': {'quality': 1.282540205315461, 'cluster_id': 1}, '42': {'quality': 2.015585394934561, 'cluster_id': 0}, '43': {'quality': 2.464004827721051, 'cluster_id': 0}, '44': {'quality': 1.33980633202097, 'cluster_id': 1}}, 'response': ({'click': 0, 'quality': 1.2272016322975663, 'cluster_id': 1}, {'click': 0, 'quality': 1.2925848895378007, 'cluster_id': 1})}
Out[0]:
[0, 1]

Observe how the 'user' field of the observation dictionary (as printed from within the static agent's step function) now has a new key 'sufficient_statistics', whereas the old user observation (which is vacuous) went under the 'raw_observation' key. This is done to avoid naming conflicts.

AbstractClickBandit

The ClusterClickStats layer takes care of computing the necessary sufficient statistics for exploration. To implement the actual bandit policy, RecSim offers an abstract bandit layer implementation. The AbstractClickBandit takes as input a list of base agents, which it treats as arms. It will then utilize one of a a few implemented bandit policies (UCB1, KL-UCB, ThompsonSampling) to mix the policies in a way that achieves sub-linear regret relative to the best policy (which is apriori unknown), subject to certain assumptions about the environment.


In [0]:
from recsim.agents.layers.abstract_click_bandit import AbstractClickBanditLayer

To instantiate an abstract bandit, we must present a list of base agents. In our case, we will have one base agent for each cluster. That agent simply retrieves the documents of that cluster from the corpus and sorts them according to perceived quality.


In [0]:
class GreedyClusterAgent(agent.AbstractEpisodicRecommenderAgent):
  """Simple agent sorting all documents of a topic according to quality."""

  def __init__(self, observation_space, action_space, cluster_id, **kwargs):
    del observation_space
    super(GreedyClusterAgent, self).__init__(action_space)
    self._cluster_id = cluster_id

  def step(self, reward, observation):
    del reward
    my_docs = []
    my_doc_quality = []
    for i, doc in enumerate(observation['doc'].values()):
      if doc['cluster_id'] == self._cluster_id:
        my_docs.append(i)
        my_doc_quality.append(doc['quality'])
    if not bool(my_docs):
      return []
    sorted_indices = np.argsort(my_doc_quality)[::-1]
    return list(np.array(my_docs)[sorted_indices])

We will now instantiate one GreedyClusterAgent for each cluster.


In [0]:
num_topics = list(ie_environment.observation_space.spaces['doc']
                    .spaces.values())[0].spaces['cluster_id'].n
  base_agent_ctors = [
      functools.partial(GreedyClusterAgent, cluster_id=i)
      for i in range(num_topics)
  ]

We can now instantiate our cluster bandit as a combination of ClusterClickStats, AbstractClickBandit, and GreedyClusterAgent:


In [0]:
bandit_ctor = functools.partial(AbstractClickBanditLayer,
                                arm_base_agent_ctors=base_agent_ctors)
cluster_bandit = ClusterClickStatsLayer(bandit_ctor,
                                        ie_environment.observation_space,
                                        ie_environment.action_space)

Our ClusterBandit is ready to use!


In [0]:
observation0 = ie_environment.reset()
slate = cluster_bandit.begin_episode(observation0)
print("Cluster bandit slate 0:")
doc_list = list(observation0['doc'].values())
for doc_position in slate:
  print(doc_list[doc_position])


Cluster bandit slate 0:
{'quality': 1.4686875120276195, 'cluster_id': 1}
{'quality': 1.4226918183479484, 'cluster_id': 1}