"World of Bits: An Open-Domain Platform for Web-Based Agents", Shi, Karpathy, Fan, Hernandez, Liang

Abstract

"While simulated game environments have greatly accelerated research in reinforcement learning, existing environments lack the open-domain realism of tasks in computer vision or natural language processing, which operate on artifacts created by humans in natural, organic settings. To foster reinforcement learning research in such settings, we introduce the World of Bits (WoB), a platform in which agents complete tasks on the Internet by performing low-level keyboard and mouse actions. The two main challenges are: (i) to curate a diverse set of natural web-based tasks, and (ii) to ensure that these tasks have a well-defined reward structure and are reproducible despite the transience of the web. To tackle this, we develop a methodology in which crowdworkers create tasks defined by natural language questions and provide demonstrations of how to answer the question on real websites using keyboard and mouse; HTTP traffic is cached to create a reproduceable offline approximation of the website. Finally, we show that agents trained via behavioral cloning and reinforcement learning can complete a range of web-based tasks."

1. Introduction

"Use of web as a learning platform offers three benefits:

  • Open-domain By allowing agents to interact with the web, we open up the world's supply of websites as a rich source of learning environments and application domains. Since agents directly work with the UI, we can use existing web infrastructure without designing specialized APIs
  • Open-source Unlike robotics, WoB is digital, which enables fast iteration and massive scaling. Webpages are open-source and consist entirely of HTML/CSS/Javascript, which is easy to inspect and change dynamically
  • Easy to collect data Because agents use the same interface as humans do, it is possible to crowdsource human demonstrations of a web task from anyone with an access to a web browser, keyboard and mouse at a low cost. This unlocks the potential for large-scale data collection

MiniWoB

"While WoB specifies a platform, the main conceptual challenge is to define meaningful web tasks in a scalable way. In Section 2.2, we start by constructing the Mini World of Bits (MiniWoB), 100 web tasks (see Figure 7 for examples) of varying difficulty, in which the reward function is manually constructed.

FormWoB

"next, in section 2.3, we describe FormWoB, which consists of four web tasks based on real flight booking websites. The main difficulty here is that websites are constantly changing, and yet we would like to package them into reproducible research environments for our agents. To this end, we use a man-in-the-middle proxy to capture and replay live HTTP traffic, building up an approximation of the live website.

QAWoB

"Finally, inspired by large datasets such as ImageNet in computer vision (Deng et al., 2009) and SQuAD in NLP (Rajpurkar et al., 2016), we would like to scale up to a diverse set of web tasks without manual effort on each web task. To tackle this, we develop a methodology based on crowdsourcing that effectively casts web tasks as question answering (Section 2.4). First, we ask crowdworkers to write queries that can be answered by interacting with a given website. Each query is defined by a query template and slot values (eg 'New York') that fill the template slots (see Figure 2 for examples). Positive reward is given if an agent clicks on the correct answer. We create a dataset QAWoB, which has 11,650 queries (from 521 templates). We collected initial demonstrations for four of the templates, with one demonstration per query. Collecting demonstration for the full dataset is on-going work.

"To benchmark a standard approach, we evaluate the performance of convolutional neural networks taht take as input the image and text from the DOM and outputs keyboard and mouse actions. We train these models using both supervised learning and reinforcement learning, and show that in some caes we can generalize across different queries of the same template. However, our overall error rates remain relatively high, suggesting that the proposed benchmark leaves a lot of room for improvement."


In [ ]: