Artificial Intelligence: A Modern Approach [4 ed.] 2019047498, 9780134610993, 0134610997

http://aima.cs.berkeley.edu/index.html

2,110 479 663MB

English Pages 1152 [1134] Year 2020

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Artificial Intelligence: A Modern Approach [4 ed.]
 2019047498, 9780134610993, 0134610997

  • Commentary
  • Ripped from eText. Original was 1.9Gb; rescaled to 50% to fit filesize requirements.
  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Artificialfintelligence Bl

ol

ANVIOAEM APPreacH

Helligigh=eljtjelf)

Artificial Intelligence A Modern Approach Fourth Edition

PEARSON SERIES IN ARTIFICIAL INTELLIGENCE Stuart Russell and Peter Norvig, Editors

FORSYTH & PONCE

Computer Vision: A Modern Approach, 2nd ed.

JURAFSKY & MARTIN

Speech and Language Processing, 2nd ed.

RUSSELL & NORVIG

Artificial Intelligence: A Modern Approach, 4th ed.

GRAHAM

NEAPOLITAN

ANSI Common Lisp

Learning Bayesian Networks

Artificial Intelligence A Modern Approach Fourth Edition Stuart J. Russell and Peter Norvig

Contributing writers: Ming-Wei Chang Jacob Devlin Anca Dragan David Forsyth Tan Goodfellow Jitendra M. Malik Vikash Mansinghka Judea Pearl Michael Wooldridge

@

Pearson

Copyright © 2021, 2010, 2003 by Pearson Education, Inc. or its affiliates, 221 River Street, Hoboken, NJ 07030. All Rights Reserved. Manufactured in the United States of America. This publication is protected by copyright, and permission should be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise. For information regarding permissions, request forms, and the appropriate contacts within the Pearson Education Global Rights and Permissions department, please visit www.pearsoned.com/permissions/. Acknowledgments of third-party content appear on the appropriate page within the text. Cover Images: Alan Turing - Science History Images/Alamy Stock Photo Statue of Aristotle — Panos Karas/Shutterstock Ada Lovelace - Pictorial Press Ltd/Alamy Stock Photo Autonomous cars — Andrey Suslov/Shutterstock Atlas Robot ~ Boston Dynamics, Inc. Berkeley Campanile and Golden Gate Bridge — Ben Chu/Shutterstock Background ghosted nodes — Eugene Sergeev/Alamy Stock Photo Chess board with chess figure — Titania/Shutterstock Mars Rover - Stocktrek Images, Inc./Alamy Stock Photo Kasparov - KATHY WILLENS/AP Images PEARSON, ALWAYS LEARNING is an exclusive trademark owned by Pearson Education, Inc. or its affiliates in the U.S. and/or other countries.

Unless otherwise indicated herein, any third-party trademarks, logos, or icons that may appear in this work are the property of their respective owners, and any references to third-party trademarks, logos, icons, or other trade dress are for demonstrative or descriptive purposes only. Such references are not intended to imply any sponsorship, endorsement, authorization, or promotion of Pearson’s products

by the owners of such marks, or any relationship between the owner and Pearson Education, Inc., or its affiliates, authors, licensees, or distributors.

Library of Congress Cataloging-in-Publication Data Russell, Stuart J. (Stuart Jonathan), author. | Norvig, Peter, author. rtificial intelligence : @ modern approach/ Stuart J. Russell and Peter Norvig. Description: Fourth edition. | Hoboken : Pearson, [2021] | Series: Pearson series in artificial intelligence | Includes bibliographical references and index. | Summary: “Updated edition of popular textbook on Artificial Intelligence.”— Provided by publisher. Identifiers: LCCN 2019047498 | ISBN 9780134610993 (hardcover) Subjects: LCSH: Artificial intelligence. Classification: LCC Q335 .R86 2021 | DDC 006.3-dc23 LC record available at https://lcen.loc.gov/2019047498 ScoutAutomatedPrintCode

@

Pearson ISBN-10: ISBN-13:

0-13-461099-7 978-0-13-461099-3

For Loy, Gordon, Lucy, George, and Isaac — S.J.R. For Kris, Isabella, and Juliet — P.N.

This page intentionally left blank

Preface Artificial Intelligence (Al) is a big field, and this is a big book.

We have tried to explore

the full breadth of the field, which encompasses logic, probability, and continuous mathematics; perception, reasoning, learning, and action; fairness, trust, social good, and safety; and applications that range from microelectronic devices to robotic planetary explorers to online services with billions of users. The subtitle of this book is “A Modern Approach.”

That means we have chosen to tell

the story from a current perspective. We synthesize what is now known into a common

framework, recasting early work using the ideas and terminology that are prevalent today.

We apologize to those whose subfields are, as a result, less recognizable. New to this edition

This edition reflects the changes in Al since the last edition in 2010: « We focus more on machine learning rather than hand-crafted knowledge engineering, due to the increased availability of data, computing resources, and new algorithms.

« Deep learning, probabilistic programming, and multiagent systems receive expanded coverage, each with their own chapter.

« The coverage of natural language understanding, robotics, and computer vision has been revised to reflect the impact of deep learning. + The robotics chapter now includes robots that interact with humans and the application of reinforcement learning to robotics.

« Previously we defined the goal of Al as creating systems that try to maximize expected utility, where the specific utility information—the objective—is supplied by the human

designers of the system. Now we no longer assume that the objective is fixed and known by the Al system;

instead, the system may be uncertain about the true objectives of the

humans on whose behalf it operates. It must learn what to maximize and must function

appropriately even while uncertain about the objective. « We increase coverage of the impact of Al on society, including the vital issues of ethics, fairness, trust, and safety.

+ We have moved the exercises from the end of each chapter to an online site.

This

allows us to continuously add to, update, and improve the exercises, to meet the needs

of instructors and to reflect advances in the field and in Al-related software tools.

* Overall, about 25% of the material in the book is brand new. The remaining 75% has

been largely rewritten to present a more unified picture of the field. 22% of the citations in this edition are to works published after 2010.

Overview of the book The main unifying theme is the idea of an intelligent agent. We define Al as the study of agents that receive percepts from the environment and perform actions. Each such agent implements a function that maps percept sequences to actions, and we cover different ways

to represent these functions, such as reactive agents, real-time planners, decision-theoretic vii

viii

Preface

systems, and deep learning systems. We emphasize learning both as a construction method for competent systems and as a way of extending the reach of the designer into unknown

environments. We treat robotics and vision not as independently defined problems, but as occurring in the service of achieving goals. We stress the importance of the task environment in determining the appropriate agent design.

Our primary aim is to convey the ideas that have emerged over the past seventy years

of Al research and the past two millennia of related work.

We have tried to avoid exces-

sive formality in the presentation of these ideas, while retaining precision. We have included mathematical formulas and pseudocode algorithms to make the key ideas concrete; mathe-

matical concepts and notation are described in Appendix A and our pseudocode is described in Appendix B. This book is primarily intended for use in an undergraduate course or course sequence. The book has 28 chapters, each requiring about a week’s worth of lectures, so working through the whole book requires a two-semester sequence. A one-semester course can use. selected chapters to suit the interests of the instructor and students.

The book can also be

used in a graduate-level course (perhaps with the addition of some of the primary sources suggested in the bibliographical notes), or for self-study or as a reference.

Term

Throughout the book, important points are marked with a triangle icon in the margin.

‘Wherever a new term is defined, it is also noted in the margin. Subsequent significant uses

of the term are in bold, but not in the margin. We have included a comprehensive index and

an extensive bibliography. The only prerequisite is familiarity with basic concepts of computer science (algorithms, data structures, complexity) at a sophomore level. Freshman calculus and linear algebra are useful for some of the topics. Online resources

Online resources are available through pearsonhighered. com/cs-resources or at the book’s Web site, aima. cs . berkeley. edu. There you will find: « Exercises, programming projects, and research projects. These are no longer at the end of each chapter; they are online only. Within the book, we refer to an online exercise with a name like “Exercise 6.NARY.”

Instructions on the Web site allow you to find

exercises by name or by topi Implementations of the algorithms in the book in Python, Java, and other programming languages (currently hosted at github.com/aimacode).

A list of over 1400 schools that have used the book, many with links to online course

materials and syllabi. Supplementary material and links for students and instructors.

Instructions on how to report errors in the book, in the likely event that some exist.

Book cover

The cover depicts the final position from the decisive game 6 of the 1997 chess match in which the program Deep Blue defeated Garry Kasparov (playing Black), making this the first time a computer had beaten a world champion in a chess match. Kasparov is shown at the

Preface

top. To his right is a pivotal position from the second game of the historic Go match between former world champion Lee Sedol and DeepMind’s ALPHAGO program. Move 37 by ALPHAGO violated centuries of Go orthodoxy and was immediately seen by human experts as an embarrassing mistake, but it turned out to be a winning move. At top left is an Atlas humanoid robot built by Boston Dynamics. A depiction of a self-driving car sensing its environment appears between Ada Lovelace, the world’s first computer programmer, and Alan Turing, whose fundamental work defined artificial intelligence.

At the bottom of the chess

board are a Mars Exploration Rover robot and a statue of Aristotle, who pioneered the study of logic; his planning algorithm from De Motu Animalium appears behind the authors’ names. Behind the chess board is a probabilistic programming model used by the UN Comprehensive Nuclear-Test-Ban Treaty Organization for detecting nuclear explosions from seismic signals. Acknowledgments It takes a global village to make a book. Over 600 people read parts of the book and made

suggestions for improvement. The complete list is at aima.cs.berkeley.edu/ack.html;

we are grateful to all of them. We have space here to mention only a few especially important contributors. First the contributing writers:

Judea Pearl (Section 13.5, Causal Network Vikash Mansinghka (Section 15.3, Programs as Probability Models): Michael Wooldridge (Chapter 18, Multiagent Decision Making); Tan Goodfellow (Chapter 21, Deep Learning); Jacob Devlin and Mei-Wing Chang (Chapter 24, Deep Learning for Natural Language); « Jitendra Malik and David Forsyth (Chapter 25, Computer Vision); « Anca Dragan (Chapter 26, Robotics). Then some key roles: « Cynthia Yeung and Malika Cantor (project management); « Julie Sussman and Tom Galloway (copyediting and writing suggestions); « Omari Stephens (illustrations); « Tracy Johnson (editor); « Erin Ault and Rose Kernan (cover and color conversion); « Nalin Chhibber, Sam Goto, Raymond de Lacaze, Ravi Mohan, Ciaran O’Reilly, Amit Patel, Dragomir Radiv, and Samagra Sharma (online code development and mentoring); « Google Summer of Code students (online code development). Stuart would like to thank his wife, Loy Sheflott, for her endless patience and boundless wisdom. He hopes that Gordon, Lucy, George, and Isaac will soon be reading this book after they have forgiven him for working so long on it. RUGS (Russell’s Unusual Group of Students) have been unusually helpful, as always. Peter would like to thank his parents (Torsten and Gerda) for getting him started, and his wife (Kris), children (Bella and Juliet), colleagues, boss, and friends for encouraging and tolerating him through the long hours of writing and rewriting.

ix

About the Authors Stuart Russell was born in 1962 in Portsmouth, England.

He received his B.A. with first-

class honours in physics from Oxford University in 1982, and his Ph.D. in computer science from Stanford in 1986.

He then joined the faculty of the University of California at Berke-

ley, where he is a professor and former chair of computer science, director of the Center for Human-Compatible Al, and holder of the Smith-Zadeh Chair in Engineering. In 1990, he received the Presidential Young Investigator Award of the National Science Foundation, and in 1995 he was cowinner of the Computers and Thought Award. He is a Fellow of the Amer-

ican Association for Artificial Intelligence, the Association for Computing Machinery, and

the American Association for the Advancement of Science, an Honorary Fellow of Wadham College, Oxford, and an Andrew Carnegie Fellow. He held the Chaire Blaise Pascal in Paris

from 2012 to 2014. He has published over 300 papers on a wide range of topics in artificial intelligence. His other books include The Use of Knowledge in Analogy and Induction, Do the Right Thing: Studies in Limited Rationality (with Eric Wefald), and Human Compatible: Ariificial Intelligence and the Problem of Control.

Peter Norvig is currently a Director of Research at Google, Inc., and was previously the director responsible for the core Web search algorithms. He co-taught an online Al class that signed up 160,000 students, helping to kick off the current round of massive open online

classes. He was head of the Computational Sciences Division at NASA Ames Research Center, overseeing research and development in artificial intelligence and robotics. He received

aB.S. in applied mathematics from Brown University and a Ph.D. in computer science from Berkeley. He has been a professor at the University of Southern California and a faculty

member at Berkeley and Stanford. He is a Fellow of the American Association for Artificial

Intelligence, the Association for Computing Machinery, the American Academy of Arts and Sciences, and the California Academy of Science. His other books are Paradigms of Al Programming: Case Studies in Common Lisp, Verbmobil: A Translation System for Face-to-Face Dialog, and Intelligent Help Systems for UNIX. The two authors shared the inaugural AAAI/EAAI Outstanding Educator award in 2016.

Contents Atrtificial Intelligence Introduction L1 WhatTs AI? ... .o 1.2 The Foundations of Artificial Intelligence . 1.3 The History of Artificial Intelligence 1.4 The State of the Art . . . 1.5 Risks and Benefits of AI .

Summary

Bibliographical and Historical Notes

Intelligent Agents 2.1

22

2.3

24

Agentsand Environments

1 1 5 17 27 31

34

. . . . . . .

. . ..

..

... ...

Good Behavior: The Concept of Rationality

35 ...

The Nature of Environments . . . .. .

TheStructure of Agents

36 36

...................

. . . .. .. ... ...

SUMMATY . ..o o e Bibliographical and Historical Notes . . . . .. ..................

it

39

42

47

60 60

Problem-solving

Solving Problems by Searching 3.1 Problem-Solving Agents .

63 63

3.5

84

3.2 3.3 3.4

Example Problems . . . . Search Algorithms . . . Uninformed Search Strategies . . . . . . .

3.6

Heuristic Functions

Informed (Heuristic) Search Strategies . .

Summary

Bibliographical and Historical Notes

. . . . . . .

66 71 76 97

104

106

Search in Complex Environments

110

Summary

141

4.1 4.2 4.3 4.4 4.5

Local Search and Optimization Problems . Local Search in Continuous Spaces . . . . Search with Nondeterministic Actions Search in Partially Observable Environments Online Search Agents and Unknown Environments

Bibliographical and Historical Notes

. . . . .. ..................

Adversarial Search and Games 5.1 Game Theory 5.2 Optimal Decisions in Games

110 119 122 126 134 142

146 146 148 xi

Contents 5.3 54

Heuristic Alpha-Beta Tree Search Monte Carlo Tree Search

156 161

5.6 5.7

Partially Observable Games . Limitations of Game Search Algorithms

168 173

5.5

Stochastic Games

Summary

Bibliographical 6

174

and Historical Notes

175

Constraint Satisfaction Problems

180

6.1

Defining Constraint Satisfaction Problems

180

6.3 64 6.5

Backtracking Search for CSPs Local Search for CSPs . . . The Structure of Problems .

191 197 199

6.2

Constraint Propagation: Inference in CSPs .

Summary

Bibliographical and Historical Notes

III

7

164

. . . . .. ..................

204

Logical Agents 7.1 Knowledge-Based Agents

208 209

7.4 7.5

217 222

7.2 73

The Wumpus World . . Logic...................

7.6

Effective Propositional Model Checking

Propositional Logic: A Very Simple Logic . Propositional Theorem Proving

Bibliographical and Historical Notes

. . . . . ...................

210 214

232

237 246 247

First-Order Logic

251

8.1 8.2 8.3

251 256 265

Representation Revisited . Syntax and Semantics of First-Order Logic Using First-Order Logic . .

.

84 Knowledge Engineering in First-Order Logic Summary Bibliographical and Historical Notes

9

203

Knowledge, reasoning, and planning

7.7 Agents Based on Propositional Logic . . SUMMary ... 8

185

. . . . . .

271 277

278

Inference in First-Order Logic

280

Summary

309

9.1 9.2 9.3 9.4 9.5

Propositional vs. First-Order Inference . . . ... ............. Unification and First-Order Inference . . Forward Chaining . . . Backward Chaining . . Resolution . ......

Lo

Bibliographical and Historical Notes

. . . . . .

280 282 286 293 298

310

Contents

10 Knowledge Representation 10.1 Ontological Engineering . 102 Categories and Object 10.3 10.4

Events Mental Objects and Modal Logic

10.5

Reasoning Systems for Categories

10.6

Reasoning with Default Information

Summary

Bibliographical

and Historical NOes . . . . .. ..o oot oot

11 Automated Planning 11.1

112

11.3 11.4 11.5 11.6

Definition of Classical Planning

Algorithms for Classical Planning

Heuristics for Planning Hierarchical Planning . . Planning and Acting in Nondeterministic Domains Time, Schedules, and Resources

117 Analysis of Planning Approaches Summary ... Bibliographical and Historical Notes . . . . . . ... IV

. . . .

ooor

oot

Uncertain knowledge and reasoning

12 Quantifying Uncertainty

12.1

Acting under Uncertainty

12.2 12.3

Basic Probability Notation Inference Using Full Joint Distributions . .

12.5 12.6 12.7

Bayes’ Rule and Its Use . Naive Bayes Models . . . The Wumpus World Revisited

124

Independence

Summary

Bibliographical and Historical Notes

. . . . .. ..................

13 Probabilistic Reasoning

13.1

13.2 13.3

134

13.5

Representing Knowledge in an Uncertain Domain . . . .

The Semantics of Bayesian Networks . . . .. ... ... Exact Inference in Bayesian Networks

Approximate Inference for Bayesian Networks . . . . . . Causal Networks . . . . .

Summary

... .........

Bibliographical and Historical Notes 14 Probabilistic Reasoning over Time 14.1 Time and Uncertainty . . 14.2 Inference in Temporal Models

. . . . . . .

xiii

xiv

Contents

14.3 Hidden Markov Models 144 Kalman Filters . . . . . 14.5 Dynamic Bayesian Networks Summary Bibliographic and Historical Notes 15 Probabilistic Programming 15.1 Relational Probability Models 152 Open-Universe Probability Models

153

. . .

Keeping Track ofa Complex World . . .

Summary Bibliographic 16 Making Simple Decisions 16.1 16.2 16.3 16.4 16.5 16.6 16.7

Combining Beliefs and Desires under Uncertainty The Basis of Utility Theory .

. . . .. ........

Multiattribute Utility Functions Decision Networks . . . . .. ...... The Value of Information Unknown Preferences . . .

Summary Bibliographical and Historical Notes . . . . . . 17

18

Making Complex Decisions 17.1 Sequential Decision Problems 17.2 Algorithms for MDPs . 17.3 Bandit Problems . . . . 17.4 Partially Observable MDPs 17.5 Algorithms for Solving POMDPs . . . . Summary Bibliographical and Historical Notes . . . . . . Multiagent Decision Making 18.1

182 18.3

18.4

Properties of Multiagent Environments

Non-Cooperative Game Theory Cooperative Game Theory

. . . . .. ... ..........

Making Collective Decisions

Summary Bibliographical and Historical Notes . . . . . . . ... ..............

562 562 572 581 588 590 595 596 599 599 605 626 632 645 646

Machine Learning

19 Learning from Examples 19.1

FormsofLearning

. . .. ..........................

651

Contents

21

22

192 Supervised Learning. . . 193 Learning Decision Trees . 19.4 Model Selection and Optimization 19.5 The Theory of Learning . . . 19.6 Lincar Regression and Classification 19.7 Nonparametric Models 19.8 Ensemble Learning 19.9 Developing Machine Learning Systems . . Summary Bibliographical and Historical Notes Learning Probabilistic Models 20.1 Statistical Leaming . . .« ..o i i it 202 Leaming with Complete Data 203 Leaming with Hidden Variables: The EM Algorithm . . . Summary Bibliographical and Historical Notes Deep Learning 211 Simple Feedforward Networks . . ..« oo oo ooounittt o 212 Computation Graphs for Decp Learning 213 Convolutional Networks . . . . . . . . . . 214 Leaming Algorithms. . . 215 GeneraliZation . . . ... i i i 216 Recurrent Neural Networks - - .« oo« oo otot ot 217 Unsupervised Learning and Transfer Learning . . . . . . . ... ... .. 208 APPHCALONS « .« o« e e e e SUMMATY .« o o oo oo e e e e et Bibliographical and Historical Notes . . . . .« .. ... ..o ooooooo. Reinforcement Learning 22.1 Leaming from Rewards . . . . ..o oo oott it 222 Passive Reinforcement Learning 223 Active Reinforcement Learning 224 Generalization in Reinforcement Learning 225 Policy Search 22.6 Apprenticeship and Inverse Reinforcement Learning . . . 227 Applications of Reinforcement Learning SUMMALY - o o v vveeeerneneenenns Bibliographical and Historical Notes

VI

653 657 665 672 676 686 696 704 714 715 721 721 724 737 746 747 750 751 756 760 765 768 772 775 782 784 785 789 789 791 797 803 810 812 815 818 819

Communicating, perceiving, and acting

23 Natural Language Processing 23.1 Language Models 23.2

Grammar

823 823

833

xv

xvi

Contents 233 23.4

Parsing . ........ Augmented Grammars .

235 Complications of Real Natural Language 23.6 Natural Language Tasks . . . . . . . .. Summary Bibliographical and Historical Notes

24 Deep Learning for Natural Language Processing 24.1 24.2

Word Embeddings . . . .. .............. .. ... ..., Recurrent Neural Networks for NLP

24.4

The Transformer Architecture

243 245

24.6

Sequence-to-Sequence Models

Pretraining and Transfer Learning . . . . State of the art

Summary ... ..... ...

Bibliographical and Historical Notes 25

. . . . .. ..................

835 841

845 849 850 851

856

856 860

864 868

871

875

878

878

Computer Vision 25.1 Introduction 25.2 Image Formation . . . .

881 881 882

256

901

253 254 255

Simple Image Features Classifying Images . . . Detecting Objects . . .

The3D World

888 895 899

. . ...

25.7 Using Computer Vision Summary ... ... ...

906 919

Bibliographical and Historical Notes

920

26 Robotics 26.1 Robots 26.2 Robot Hardware . . . . 26.3 What kind of problem is robotics solving? 26.4 Robotic Perception . . . 26.5 Planning and Control 26.6 Planning Uncertain Movements . . . . . 26.7 Reinforcement Learning in Robotics 26.8 Humans and Robots 26.9 Alternative Robotic Frameworks 26.10 Application Domains

Summary ... ...

...

Bibliographical and Historical Notes

VII

. . . . .. ..................

925 925 926 930 931 938 956 958 961 968 971

974 975

Conclusions

27 Philosophy, Ethics, and Safety of AI 27.1

TheLimitsof AL . . ... ..........................

981

981

Contents 27.2 27.3

Can Machines Really Think? The Ethics of AT . .

.

Bibliographical and Historical Notes

.

Summary

. ... .....

L.

984 986

1005

1006

28 The Future of AT 28.1 28.2

A

B

. .

1012 1018

Mathematical Background

1023

A2 Vectors, Matrices, and Linear Algebra A3 Probability Distributions . . . . ... ... e L .. Bibliographical and Historical Notes . . . . .. ..................

1025 1027 1029

Al

Al Components Al Architectures

1012

Complexity Analysis and O() Notation . .

Notes on Languages and Algorithms B.1 B.2 B.3

Defining Languages with Backus—Naur Form (BNF) Describing Algorithms with Pseudocode . Online Supplemental Material . . . . . . .

Bibliography

Index

.

1023

1030

1030 1031 1032

1033

1069

xvii

This page intentionally left blank

G

1

INTRODUCTION In which we try to explain why we consider artificial intelligence to be a subject most

worthy of study, and in which we try to decide what exactly it is, this being a good thing to

decide before embarking.

We call ourselves Homo sapiens—man the wise—because our intelligence is so important

to us. For thousands of years, we have tried to understand how we think and act—that is,

how our brain, a mere handful of matter, can perceive, understand, predict, and manipulate a world far larger and more complicated than itself. The field of artificial intelligence, or Al

is concerned with not just understanding but also building intelligent entities—machines that

Intelligence Artificial intelligence

can compute how to act effectively and safely in a wide variety of novel situations.

Surveys regularly rank Al as one of the most interesting and fastest-growing fields, and it is already generating over a trillion dollars a year in revenue. Al expert Kai-Fu Lee predicts

that its impact will be “more than anything in the history of mankind.” Moreover, the intel-

lectual frontiers of Al are wide open. Whereas a student of an older science such as physics

might feel that the best ideas have already been discovered by Galileo, Newton, Curie, Ein-

stein, and the rest, Al still has many openings for full-time masterminds.

Al currently encompasses a huge variety of subfields, ranging from the general (learning,

reasoning, perception, and so on) to the specific, such as playing chess, proving mathematical theorems, writing poetry, driving a car, or diagnosing diseases. Al is relevant to any intellectual task; it is truly a universal field.

1.1

What

Is AI?

‘We have claimed that Al is interesting, but we have not said what it is. Historically, researchers have pursued several different versions of Al Some have defined intelligence in

terms of fidelity to human performance, while others prefer an abstract, formal definition of

intelligence called rationality—loosely speaking, doing the “right thing.” The subject matter Rationality itself also varies: some consider intelligence to be a property of internal thought processes

and reasoning, while others focus on intelligent behavior, an external characterization.! From these two dimensions—human

vs. rational® and thought vs. behavior—there are

four possible combinations, and there have been adherents and research programs for all 1" Inthe public eye, there i sometimes confusion between the terms “artificial intelligence” and “machine learning” Machine learning is a subfield of Al that studies the ability to improve performance based on experience. Some Al systems use machine learning methods to achieve competence, but some do not. 2 We are not suggesting that humans are “irrational” in the dictionary sense of “deprived of normal mental clarity” We are merely conceding that human decisions are not always mathematically perfect.

Chapter 1 Introduction

four. The methods used are necessarily different: the pursuit of human-like intelligence must be in part an empirical science related to psychology, involving observations and hypotheses about actual human behavior and thought processes; a rationalist approach, on the other hand, involves a combination of mathematics and engineering, and connecs to statistics, control theory, and economics. The various groups have both disparaged and helped cach other. Let us look at the four approaches in more detail. 1.1.1

Turing test

Acting humanly:

The Turing test approach

The Turing test, proposed by Alan Turing (1950), was designed as a thought experiment that would sidestep the philosophical vagueness of the question “Can a machine think?” A com-

puter passes the test if a human interrogator, after posing some written questions, cannot tell

whether the written responses come from a person or from a computer. Chapter 27 discusses the details of the test and whether a computer would really be intelligent if it passed.

For

now, we note that programming a computer to pass a rigorously applied test provides plenty

Natural language processing Knowledge representation Automated reasoning Machine learning Total Turing test Computer vision Robotics

to work on. The computer would need the following capabilities:

o natural language processing to communicate successfully in a human language;

o knowledge representation to store what it knows or hears; « automated reasoning to answer questions and to draw new conclusions;

« machine learning to adapt to new circumstances and to detect and extrapolate patterns. Turing viewed the physical simulation of a person as unnecessary to demonstrate intelligence.

However, other researchers have proposed a total Turing test, which requires interaction with

objects and people in the real world. To pass the total Turing test, a robot will need * computer vision and speech recognition to perceive the world;

o robotics to manipulate objects and move about.

These six disciplines compose most of AL Yet Al researchers have devoted little effort to

passing the Turing test, believing that it is more important to study the underlying principles of intelligence. The quest for “artificial flight” succeeded when engineers and inventors stopped imitating birds and started using wind tunnels and learning about aerodynamics.

Aeronautical engineering texts do not define the goal of their field as making “machines that

fly so exactly like pigeons that they can fool even other pigeons.” 1.1.2

Thinking humanly:

The cognitive modeling approach

To say that a program thinks like a human, we must know how humans think. We can learn about human thought in three ways:

Introspection

+ introspection—trying to catch our own thoughts as they go by;

Bra

« brain imaging—observing the brain in action.

Psychological experiment

imaging

« psychological experiments—observing a person in action;

Once we have a sufficiently precise theory of the mind, it becomes possible to express the

theory as a computer program. If the program’s input—output behavior matches corresponding human behavior, that is evidence that some of the program’s mechanisms could also be

operating in humans.

For example, Allen Newell and Herbert Simon, who developed GPS, the “General Problem Solver” (Newell and Simon, 1961), were not content merely to have their program solve

Section 1.1

What Is AI?

problems correctly. They were more concerned with comparing the sequence and timing of its reasoning steps to those of human subjects solving the same problems.

The interdisci-

plinary field of cognitive science brings together computer models from Al and experimental techniques from psychology to construct precise and testable theories of the human mind.

Cognitive science

Cognitive science is a fascinating field in itself, worthy of several textbooks and at least

one encyclopedia (Wilson and Keil, 1999). We will occasionally comment on similarities or

differences between Al techniques and human cognition. Real cognitive science, however, is necessarily based on experimental investigation of actual humans or animals. We will leave

that for other books, as we assume the reader has only a computer for experimentation.

In the early days of Al there was often confusion between the approaches. An author would argue that an algorithm performs well on a task and that it is therefore a good model

of human performance, or vice versa. Modern authors separate the two kinds of claims; this

distinction has allowed both Al and cognitive science to develop more rapidly. The two fields fertilize each other, most notably in computer vision, which incorporates neurophysiological evidence into computational models.

Recently, the combination of neuroimaging methods

combined with machine learning techniques for analyzing such data has led to the beginnings

of a capability to “read minds™—that is, to ascertain the semantic content of a person’s inner

thoughts. This capability could, in turn, shed further light on how human cognition works. 1.1.3

Thinking rationally:

The

“laws of thought”

approach

The Greek philosopher Aristotle was one of the first to attempt to codify “right thinking”™— that is, irrefutable reasoning processes. His syllogisms provided patterns for argument struc-

tures that always yielded correct conclusions when given correct premises. The canonical

Syllogism

example starts with Socrates is a man and all men are mortal and concludes that Socrates is

mortal. (This example is probably due to Sextus Empiricus rather than Aristotle.) These laws of thought were supposed to govern the operation of the mind; their study initiated the field called logic.

Logicians in the 19th century developed a precise notation for statements about objects

in the world and the relations among them. (Contrast this with ordinary arithmetic notation,

which provides only for statements about numbers.) By 1965, programs could, in principle, solve any solvable problem described in logical notation.

The so-called logicist tradition

within artificial intelligence hopes to build on such programs to create intelligent systems.

Logicist

Logic as conventionally understood requires knowledge of the world that is certain—

a condition that, in reality, is seldom achieved. We simply don’t know the rules of, say, politics or warfare in the same way that we know the rules of chess or arithmetic. The theory

of probability fills this gap, allowing rigorous reasoning with uncertain information.

In Probability

principle, it allows the construction of a comprehensive model of rational thought, leading

from raw perceptual information to an understanding of how the world works to predictions about the future. What it does not do, is generate intelligent behavior.

theory of rational action. Rational thought, by itself, is not enough. 1.1.4

Acting ra

nally:

The

For that, we need a

rational agent approach

An agent is just something that acts (agent comes from the Latin agere, to do). Of course,

all computer programs do something, but computer agents are expected to do more: operate autonomously, perceive their environment, persist over a prolonged time period, adapt to

Agent

Chapter 1 Introduction Rational agent

change, and create and pursue goals. A rational agent is one that acts so as to achieve the best outcome or, when there is uncertainty, the best expected outcome.

In the “laws of thought” approach to Al the emphasis was on correct inferences. Making correct inferences is sometimes part of being a rational agent, because one way to act rationally is to deduce that a given action is best and then to act on that conclusion.

On the

other hand, there are ways of acting rationally that cannot be said to involve inference. For

example, recoiling from a hot stove is a reflex action that is usually more successful than a slower action taken after careful deliberation.

All the skills needed for the Turing test also allow an agent to act rationally. Knowledge

representation and reasoning enable agents to reach good decisions. We need to be able to

generate comprehensible sentences in natural language to get by in a complex society.

We

need learning not only for erudition, but also because it improves our ability to generate

effective behavior, especially in circumstances that are new.

The rational-agent approach to Al has two advantages over the other approaches. First, it

is more general than the “laws of thought™ approach because correct inference is just one of several possible mechanisms for achieving rationality. Second, it is more amenable to scien-

tific development. The standard of rationality is mathematically well defined and completely

general. We can often work back from this specification to derive agent designs that provably

achieve it—something that is largely impossible if the goal is to imitate human behavior or

thought processes. For these reasons, the rational-agent approach to Al has prevailed throughout most of the field’s history. In the early decades, rational agents were built on logical foundations and formed definite plans to achieve specific goals. Later, methods based on probability Do the

> right thing

Standard model

theory and machine learning allowed the creation of agents that could make decisions under

uncertainty to attain the best expected outcome.

In a nutshell, A/ has focused on the study

and construction of agents that do the right thing. What counts as the right thing is defined

by the objective that we provide to the agent. This general paradigm is so pervasive that we

might call it the standard model. It prevails not only in Al but also in control theory, where a

controller minimizes a cost function; in operations research, where a policy maximizes a sum of rewards; in statistics, where a decision rule minimizes a loss function; and in economics, where a decision maker maximizes utility or some measure of social welfare.

We need to make one important refinement to the standard model to account for the fact

that perfect rationality—always taking the exactly optimal action—is not feasible in complex

Limited rationality

environments. The computational demands are just too high. Chapters 5 and 17 deal with the

issue of limited rationality—acting appropriately when there is not enough time to do all the computations one might like. However, perfect rationality often remains a good starting point for theoretical analysis.

1.1.5

Benef

| machines

The standard model

has been a useful guide for Al research since its inception, but it is

probably not the right model in the long run. The reason is that the standard model assumes that we will supply a fully specified objective to the machine.

For an artificially defined task such as chess or shortest-path computation, the task comes

with an objective built in—so the standard model is applicable.

As we move into the real

world, however, it becomes more and more difficult to specify the objective completely and

Section 1.2

The Foundations of Artificial Intelligence

correctly. For example, in designing a self-driving car, one might think that the objective is to reach the destination safely. But driving along any road incurs a risk of injury due to other

errant drivers, equipment failure, and so on; thus, a strict goal of safety requires staying in the garage. There is a tradeoff between making progress towards the destination and incurring a risk of injury. How should this tradeoff be made? Furthermore, to what extent can we allow the car to take actions that would annoy other drivers? How much should the car moderate

its acceleration, steering, and braking to avoid shaking up the passenger? These kinds of questions are difficult to answer a priori. They are particularly problematic in the general area of human-robot interaction, of which the self-driving car is one example. The problem of achieving agreement between our true preferences and the objective we put into the machine is called the value alignment problem: the values or objectives put into

the machine must be aligned with those of the human. If we are developing an Al system in

Value alignment problem

the lab or in a simulator—as has been the case for most of the field’s history—there is an easy

fix for an incorrectly specified objective: reset the system, fix the objective, and try again. As the field progresses towards increasingly capable intelligent systems that are deployed

in the real world, this approach is no longer viable. A system deployed with an incorrect

objective will have negative consequences. more negative the consequences.

Moreover, the more intelligent the system, the

Returning to the apparently unproblematic example of chess, consider what happens if

the machine is intelligent enough to reason and act beyond the confines of the chessboard. In that case, it might attempt to increase its chances of winning by such ruses as hypnotiz-

ing or blackmailing its opponent or bribing the audience to make rustling noises during its

opponent’s thinking time.>

It might also attempt to hijack additional computing power for

itself. These behaviors are not “unintelligent” or “insane”; they are a logical consequence of defining winning as the sole objective for the machine. It is impossible to anticipate all the ways in which a machine pursuing a fixed objective might misbehave. There is good reason, then, to think that the standard model is inadequate.

‘We don’t want machines that are intelligent in the sense of pursuing their objectives; we want

them to pursue our objectives. If we cannot transfer those objectives perfectly to the machine, then we need a new formulation—one in which the machine is pursuing our objectives, but is necessarily uncertain as to what they are. When a machine knows that it doesn’t know the

complete objective, it has an incentive to act cautiously, to ask permission, to learn more about

our preferences through observation, and to defer to human control. Ultimately, we want agents that are provably beneficial to humans. We will return to this topic in Section 1.5.

The Foundations of Arti

| Intelligence

In this section, we provide a brief history of the disciplines that contributed ideas, viewpoints,

and techniques to Al Like any history, this one concentrates on a small number of people, events, and ideas and ignores others that also were important. We organize the history around

a series of questions. We certainly would not wish to give the impression that these questions are the only ones the disciplines address or that the disciplines have all been working toward Al as their ultimate fruition.

3 In one of the fi opponent’s eyes.”

books on chess, Ruy Lopez (1561) wrote, “Always place the board so the sun

Provably beneficial

Chapter 1 Introduction 1.2.1

Philosophy

« Can formal rules be used to draw valid conclusions? « How does the mind arise from a physical brain? « Where does knowledge come from? + How does knowledge lead to action?

Aristotle (384-322 BCE) was the first to formulate a precise set of laws governing the rational

part of the mind. He developed an informal system of syllogisms for proper reasoning, which in principle allowed one to generate conclusions mechanically, given initial premises. Ramon Llull (c. 1232-1315) devised a system of reasoning published as Ars Magna or The Great Art (1305). Llull tried to implement his system using an actual mechanical device:

a set of paper wheels that could be rotated into different permutations.

Around 1500, Leonardo da Vinci (1452-1519) designed but did not build a mechanical calculator; recent reconstructions have shown the design to be functional. The first known calculating machine was constructed around 1623 by the German scientist Wilhelm Schickard (1592-1635). Blaise Pascal (1623-1662) built the Pascaline in 1642 and wrote that it “produces effects which appear nearer to thought than all the actions of animals.” Gottfried Wilhelm Leibniz (1646-1716) built a mechanical device intended to carry out operations on concepts rather than numbers, but its scope was rather limited. In his 1651 book Leviathan, Thomas Hobbes (1588-1679) suggested the idea of a thinking machine, an “artificial animal”

in his words, arguing “For what is the heart but a spring; and the nerves, but so many strings; and the joints, but so many wheels.” He also suggested that reasoning was like numerical computation: “For ‘reason’ ... is nothing but ‘reckoning,” that is adding and subtracting.”

It’s one thing to say that the mind operates, at least in part, according to logical or nu-

merical rules, and to build physical systems that emulate some of those rules. It’s another to

say that the mind itself is such a physical system. René Descartes (1596-1650) gave the first clear discussion of the distinction between mind and matter. He noted that a purely physical conception of the mind seems to leave little room for free will. If the mind is governed en-

Dualism

tirely by physical laws, then it has no more free will than a rock “deciding” to fall downward. Descartes was a proponent of dualism.

He held that there is a part of the human mind (or

soul or spirit) that is outside of nature, exempt from physical laws. Animals, on the other hand, did not possess this dual quality; they could be treated as machines. An alternative to dualism is materialism, which holds that the brain’s operation accord-

ing to the laws of physics constitutes the mind. Free will is simply the way that the perception

of available choices appears to the choosing entity. The terms physicalism and naturalism are also used to describe this view that stands in contrast to the supernatural.

Empiricism

Given a physical mind that manipulates knowledge, the next problem is to establish the source of knowledge. The empiricism movement, starting with Francis Bacon’s (1561-1626) Novum Organum,* is characterized by a dictum of John Locke (1632-1704): “Nothing is in the understanding, which was not first in the senses.”

Induction

David Hume’s (1711-1776) A Treatise of Human Nature (Hume,

1739) proposed what

is now known as the principle of induction: that general rules are acquired by exposure to

repeated associations between their elements.

4 The Novum Organum is an update of Aristotle’s Organon, or instrument of thought.

Section 1.2

The Foundations of Artificial Intelligence

Building on the work of Ludwig Wittgenstein (1889-1951) and Bertrand Russell (1872—

1970), the famous Vienna Circle (Sigmund, 2017), a group of philosophers and mathemati-

cians meeting in Vienna in the 1920s and 1930s, developed the doctrine of logical positivism. This doctrine holds that all knowledge can be characterized by logical theories connected, ul-

timately, to observation sentences that correspond to sensory inputs; thus logical positivism combines rationalism and empiricism.

The confirmation theory of Rudolf Carnap (1891-1970) and Carl Hempel (1905-1997)

attempted to analyze the acquisition of knowledge from experience by quantifying the degree

Logical positivism Observation sentence

Confirmation theory

of belief that should be assigned to logical sentences based on their connection to observations that confirm or disconfirm them.

Carnap’s book The Logical Structure of the World (1928)

was perhaps the first theory of mind as a computational process. The final element in the philosophical

picture of the mind is the connection

between

knowledge and action. This question is vital to AT because intelligence requires action as well as reasoning. Moreover, only by understanding how actions are justified can we understand how to build an agent whose actions are justifiable (or rational).

Avistotle argued (in De Motu Animalium) that actions are justified by a logical connection

between goals and knowledge of the action’s outcome:

But how does it happen that thinking is sometimes accompanied by action and sometimes not, sometimes by motion, and sometimes not? It looks as if almost the same thing happens as in the case of reasoning and making inferences about unchanging objects. But in that case the end is a speculative proposition ... whereas here the conclusion which results from the two premises is an action. . . I need covering; a cloak is a covering. T need a cloak. What I need, I have to make; I need a cloak. I have to make a cloak. And the conclusion, the “T have to make a cloak.” is an action.

In the Nicomachean Ethics (Book IIL. 3, 1112b), Aristotle further elaborates on this topic, suggesting an algorithm:

‘We deliberate not about ends, but about means. For a doctor does not deliberate whether he shall heal, nor an orator whether he shall persuade, . . They assume the end and consider how and by what means it is attained, and if it scems casily and best produced thereby: while if it is achieved by one means only they consider how it will be achieved by this and by what means this will be achieved, till they come to the first cause, ... and what is last in the order of analysis seems to be first in the order of becoming. And if we come on an impossibility, we give up the search, e.g., if we need money and this cannot be got; but if a thing appears possible we try to do it. Aristotle’s algorithm was implemented 2300 years later by Newell and Simon in their General Problem Solver program. We would now call it a greedy regression planning system (see Chapter 11). Methods based on logical planning to achieve definite goals dominated the first few decades of theoretical research in AL

Thinking purely in terms of actions achieving goals is often useful but sometimes inapplicable. For example, if there are several different ways to achieve a goal, there needs to be some way to choose among them. More importantly, it may not be possible to achieve a goal with certainty, but some action must still be taken. How then should one decide? Antoine Ar-

nauld (1662), analyzing the notion of rational decisions in gambling, proposed a quantitative formula for maximizing the expected monetary value of the outcome. Later, Daniel Bernoulli

(1738) introduced the more general notion of utility to capture the internal, subjective value Utility

Chapter 1 Introduction of an outcome.

The modern notion of rational decision making under uncertainty involves

maximizing expected utility, as explained in Chapter 16.

Utilitarianism

In matters of ethics and public policy, a decision maker must consider the interests of multiple individuals. Jeremy Bentham (1823) and John Stuart Mill (1863) promoted the idea

of utilitarianism: that rational decision making based on maximizing utility should apply to all spheres of human activity, including public policy decisions made on behalf of many

individuals. Utilitarianism is a specific kind of consequentialism: the idea that what is right

and wrong is determined by the expected outcomes of an action. Deontological ethics

In contrast, Immanuel Kant, in 1875 proposed a theory of rule-based or deontological

ethics, in which “doing the right thing” is determined not by outcomes but by universal social

laws that govern allowable actions, such as “don’t lie” or “don’t kill.” Thus, a utilitarian could tell a white lie if the expected good outweighs the bad, but a Kantian would be bound not to,

because lying is inherently wrong. Mill acknowledged the value of rules, but understood them as efficient decision procedures compiled from first-principles reasoning about consequences. Many modern Al systems adopt exactly this approach.

1.2.2

Mathematics

+ What are the formal rules to draw valid conclusions?

« What can be computed?

+ How do we reason with uncertain information? Philosophers staked out some of the fundamental ideas of Al but the leap to a formal science

required the mathematization of logic and probability and the introduction of a new branch

Formal logic

of mathematics: computation. The idea of formal logic can be traced back to the philosophers of ancient Greece, India, and China, but its mathematical development really began with the work of George Boole (1815-1864), who worked out the details of propositional, or Boolean, logic (Boole, 1847). In 1879, Gottlob Frege (1848-1925) extended Boole’s logic to include objects and relations,

creating the first-order logic that is used today.> In addition to its central role in the early period of Al research, first-order logic motivated the work of Godel and Turing that underpinned

Probability

computation itself, as we explain below.

The theory of probability can be seen as generalizing logic to situations with uncertain

information—a consideration of great importance for Al Gerolamo Cardano (1501-1576)

first framed the idea of probability, describing it in terms of the possible outcomes of gam-

bling events.

In 1654, Blaise Pascal (1623-1662), in a letter to Pierre Fermat (1601-1665),

showed how to predict the future of an unfinished gambling game and assign average pay-

offs to the gamblers.

Probability quickly became an invaluable part of the quantitative sci-

ences, helping to deal with uncertain measurements and incomplete theories. Jacob Bernoulli (1654-1705, uncle of Daniel), Pierre Laplace (1749-1827), and others advanced the theory and introduced new statistical methods. Thomas Bayes (1702-1761) proposed a rule for updating probabilities in the light of new evidence; Bayes’ rule is a crucial tool for Al systems. Statistics

The formalization of probability, combined with the availability of data, led to the emergence of statistics as a field. One of the first uses was John Graunt’s analysis of Lon-

5 Frege’s proposed notation for first-order logic—an arcane combination of textual and geometric features— never became popular.

Section 1.2

don census data in 1662.

The Foundations of Artificial Intelligence

Ronald Fisher is considered the first modern statistician (Fisher,

1922). He brought together the ideas of probability, experiment design, analysis of data, and

computing—in 1919, he insisted that he couldn’t do his work without a mechanical calculator called the MILLIONAIRE

(the first calculator that could do multiplication), even though the

cost of the calculator was more than his annual ry (Ross, 2012). The history of computation is as old as the history of numbers, but the first nontrivial

algorithm is thought to be Euclid’s algorithm for computing greatest common divisors. The word algorithm comes from Muhammad

Algorithm

ibn Musa al-Khwarizmi, a 9th century mathemati-

cian, whose writings also introduced Arabic numerals and algebra to Europe. Boole and others discussed algorithms for logical deduction, and, by the late 19th century, efforts were under way to formalize general mathematical reasoning as logical deduction.

Kurt Godel (1906-1978) showed that there exists an effective procedure to prove any true

statement in the first-order logic of Frege and Russell, but that first-order logic could not cap-

ture the principle of mathematical induction needed to characterize the natural numbers. In 1931, Godel showed that limits on deduction do exist. His incompleteness theorem showed

that in any formal theory as strong as Peano arithmetic (the elementary theory of natural

Incompleteness theorem

numbers), there are necessarily true statements that have no proof within the theory.

This fundamental result can also be interpreted as showing that some functions on the

integers cannot be represented by an algorithm—that is, they cannot be computed.

This

motivated Alan Turing (1912-1954) to try to characterize exactly which functions are com-

putable—capable of being computed by an effective procedure. The Church-Turing thesis Computability proposes to identify the general notion of computability with functions computed by a Turing machine (Turing, 1936). Turing also showed that there were some functions that no Turing machine can compute. For example, no machine can tell in general whether a given program will return an answer on a given input or run forever. Although computability is important to an understanding of computation, the notion of

tractability has had an even greater impact on Al Roughly speaking, a problem is called Tractability intractable if the time required to solve instances of the problem grows exponentially with

the size of the instances. The distinction between polynomial and exponential growth in

complexity was first emphasized in the mid-1960s (Cobham, 1964; Edmonds, 1965). It is important because exponential growth means that even moderately large instances cannot be

solved in any reasonable time. The theory of NP-completeness, pioneered by Cook (1971) and Karp (1972), provides a NP-completeness basis for analyzing the tractability of problems: any problem class to which the class of NPcomplete problems can be reduced is likely to be intractable. (Although it has not been proved that NP-complete problems are necessarily intractable, most theoreticians believe it.) These results contrast with the optimism with which the popular press greeted the first computers—

“Electronic Super-Brains” that were “Faster than Einstein!” Despite the increasing speed of computers, careful use of resources and necessary imperfection will characterize intelligent systems. Put crudely, the world is an extremely large problem instance!

1.2.3

Economics

« How should we make decisions in accordance with our preferences? « How should we do this when others may not go along? « How should we do this when the payoff may be far in the future?

10

Chapter 1 Introduction The science of economics originated in 1776, when Adam Smith (1723-1790) published An

Inquiry into the Nature and Causes of the Wealth of Nations. Smith proposed to analyze economies as consisting of many individual agents attending to their own interests. Smith

was not, however, advocating financial greed as a moral position: his earlier (1759) book The

Theory of Moral Sentiments begins by pointing out that concern for the well-being of others is an essential component of the interests of every individual.

Most people think of economics as being about money, and indeed the first mathemati-

cal analysis of decisions under uncertainty, the maximum-expected-value formula of Arnauld (1662), dealt with the monetary value of bets. Daniel Bernoulli (1738) noticed that this formula didn’t seem to work well for larger amounts of money, such as investments in maritime

trading expeditions. He proposed instead a principle based utility, and explained human investment choices by proposing additional quantity of money diminished as one acquired more Léon Walras (pronounced “Valrasse™) (1834-1910) gave

on maximization of expected that the marginal utility of an money. utility theory a more general

foundation in terms of preferences between gambles on any outcomes (not just monetary

outcomes). The theory was improved by Ramsey (1931) and later by John von Neumann

and Oskar Morgenstern in their book The Theory of Games and Economic Behavior (1944).

Decision theory

Economics is no longer the study of money; rather it is the study of desires and preferences.

Decision theory, which combines probability theory with utility theory, provides a for-

mal and complete framework for individual decisions (economic or otherwise) made under

uncertainty—that is, in cases where probabilistic descriptions appropriately capture the de-

cision maker’s environment.

This is suitable for “large” economies where each agent need

pay no attention to the actions of other agents as individuals.

For “small” economies, the

situation is much more like a game: the actions of one player can significantly affect the utility of another (either positively or negatively). Von Neumann and Morgenstern’s develop-

ment of game theory (see also Luce and Raiffa, 1957) included the surprising result that, for some games, a rational agent should adopt policies that are (or least appear to be) randomized. Unlike decision theory, game theory does not offer an unambiguous prescription for selecting actions. In AL decisions involving multiple agents are studied under the heading of multiagent systems (Chapter 18).

Economists, with some exceptions, did not address the third question listed above: how to

Operations research

make rational decisions when payoffs from actions are not immediate but instead result from

several actions taken in sequence. This topic was pursued in the field of operations research,

which emerged in World War II from efforts in Britain to optimize radar installations, and later

found innumerable civilian applications. The work of Richard Bellman (1957) formalized a

class of sequential decision problems called Markov decision processes, which we study in Chapter 17 and, under the heading of reinforcement learning, in Chapter 22. Work in economics and operations research has contributed much to our notion of rational

agents, yet for many years Al research developed along entirely separate paths. One reason was the apparent complexity of making rational decisions. The pioneering Al researcher

Satisfcing

Herbert Simon (1916-2001) won the Nobel Prize in economics in 1978 for his early work

showing that models based on satisficing—making decisions that are “good enough,” rather than laboriously calculating an optimal decision—gave a better description of actual human behavior (Simon, 1947). Since the 1990s, there has been a resurgence of interest in decisiontheoretic techniques for AL

Section 1.2

1.2.4

The Foundations of Artificial Intelligence

11

Neuroscience

+ How do brains process information?

Neuroscience is the study of the nervous system, particularly the brain. Although the exact Neuroscience way in which the brain enables thought is one of the great mysteries of science, the fact that it

does enable thought has been appreciated for thousands of years because of the evidence that strong blows to the head can lead to mental incapacitation. It has also long been known that human brains are somehow different; in about 335 BCE Aristotle wrote, “Of all the animals,

man has the largest brain in proportion to his size.”® Still, it was not until the middle of the

18th century that the brain was widely recognized as the seat of consciousness. Before then,

candidate locations included the heart and the spleen.

Paul Broca’s (1824-1880) investigation of aphasia (speech deficit) in brain-damaged patients in 1861 initiated the study of the brain’s functional organization by identifying a localized area in the left hemisphere—now called Broca’s area—that is responsible for speech production.” By that time, it was known that the brain consisted largely of nerve cells, or neurons, but it was not until 1873 that Camillo Golgi (1843-1926) developed a staining technique Neuron allowing the observation of individual neurons (see Figure 1.1). This technique was used by Santiago Ramon y Cajal (1852-1934) in his pioneering studies of neuronal organization.® It is now widely accepted that cognitive functions result from the electrochemical operation of these structures.

That is, a collection of simple cells can lead to thought, action, and

consciousness. In the pithy words of John Searle (1992), brains cause minds.



Actuators

Agent

Figure 2.14 A model-based. utility-based agent. It uses a model of the world, along with a utility function that measures its preferences among states of the world. Then it chooses the action that leads to the best expected utility, where expected utility is computed by averaging over all possible outcome states, weighted by the probability of the outcome. aim for, none of which can be achieved with certainty, utility provides a way in which the likelihood of success can be weighed against the importance of the goals. Partial observability and nondeterminism are ubiquitous in the real world, and so, there-

fore, is decision making under uncertainty. Technically speaking, a rational utility-based agent chooses the action that maximizes the expected utility of the action outcomes—that

is, the utility the agent expects to derive, on average, given the probabilities and utilities of each outcome. (Appendix A defines expectation more precisely.) In Chapter 16, we show that any rational agent must behave as if it possesses a utility function whose expected value

it tries to maximize. An agent that possesses an explicit utility function can make rational de-

cisions with a general-purpose algorithm that does not depend on the specific utility function

being maximized. In this way, the “global” definition of rationality—designating as rational

those agent functions that have the highest performance—is turned into a “local” constraint

on rational-agent designs that can be expressed in a simple program. The utility-based agent structure appears in Figure 2.14.

Utility-based agent programs

appear in Chapters 16 and 17, where we design decision-making agents that must handle the uncertainty inherent in nondeterministic or partially observable environments. Decision mak-

ing in multiagent environments is also studied in the framework of utility theory, as explained in Chapter 18. At this point, the reader may be wondering, “Is it that simple? We just build agents that maximize expected utility, and we're done?”

It’s true that such agents would be intelligent,

but it’s not simple. A utility-based agent has to model and keep track of its environment,

tasks that have involved a great deal of research on perception, representation, reasoning,

and learning. The results of this research fill many of the chapters of this book. Choosing

the utility-maximizing course of action is also a difficult task, requiring ingenious algorithms

that fill several more chapters.

Even with these algorithms, perfect rationality is usually

Expected utility

56

Chapter 2 Intelligent Agents Performance standard

Cride

|-

Sensors =

Learning element learning goals Problem generator Agent

changes knowledge

Performance element

JUSWIUOIAUF

feedback ‘

Actuators

Figure 2.15 A general leaning agent. The “performance element” box represents what we have previously considered to be the whole agent program. Now, the “learning element” box gets to modify that program to improve its performance. unachievable in practice because of computational complexity, as we noted in Chapter 1. We. Model-free agent

also note that not all utility-based agents are model-based; we will see in Chapters 22 and 26 that a model-free agent can learn what action is best in a particular situation without ever

learning exactly how that action changes the environment.

Finally, all of this assumes that the designer can specify the utility function correctly; Chapters 17, 18, and 22 consider the issue of unknown utility functions in more depth. 2.4.6

Learning agents

We have described agent programs with various methods for selecting actions. We have not, so far, explained how the agent programs come into being. Tn his famous early paper, Turing (1950) considers the idea of actually programming his intelligent machines by hand. He estimates how much work this might take and concludes, “Some more expeditious method seems desirable.”

them.

The method he proposes is to build learning machines and then to teach

In many areas of Al this is now the preferred method for creating state-of-the-art

systems. Any type of agent (model-based, goal-based, utility-based, etc.) can be built as a learning agent (or not). Learning has another advantage, as we noted earlier: it allows the agent to operate in initially unknown environments and to become more competent than its initial knowledge alone

might allow. In this section, we briefly introduce the main ideas of learning agents. Through-

out the book, we comment on opportunities and methods for learning in particular kinds of

Learning element

Performance element

agents. Chapters 19-22 go into much more depth on the learning algorithms themselves. A learning agent can be divided into four conceptual components, as shown in Figure 2.15.

The most important distinction is between the learning element,

which is re-

sponsible for making improvements, and the performance element, which is responsible for selecting external actions. The performance element is what we have previously considered

Section 2.4 The Structure of Agents 1o be the entire agent: it takes in percepts and decides on actions. The learning element uses feedback from the critic on how the agent is doing and determines how the performance element should be modified to do better in the future.

57

Critic

The design of the learning element depends very much on the design of the performance

element. When trying to design an agent that learns a certain capability, the first question is

not “How am I going to get it to learn this?” but “What kind of performance element will my

agent use to do this once it has learned how?” Given a design for the performance element,

learning mechanisms can be constructed to improve every part of the agent.

The critic tells the learning element how well the agent is doing with respect to a fixed

performance standard. The critic is necessary because the percepts themselves provide no indication of the agent’s success. For example, a chess program could receive a percept indicating that it has checkmated its opponent, but it needs a performance standard to know that this is a good thing; the percept itself does not say so. It is important that the performance

standard be fixed. Conceptually, one should think of it as being outside the agent altogether because the agent must not modify it to fit its own behavior.

The last component of the learning agent is the problem generator. It is responsible

for suggesting actions that will lead to new and informative experiences. If the performance

Problem generator

element had its way, it would keep doing the actions that are best, given what it knows, but if the agent is willing to explore a little and do some perhaps suboptimal actions in the short

run, it might discover much better actions for the long run. The problem generator’s job is to suggest these exploratory actions. This is what scientists do when they carry out experiments.

Galileo did not think that dropping rocks from the top of a tower in Pisa was valuable in itself.

He was not trying to break the rocks or to modify the brains of unfortunate pedestrians. His

aim was to modify his own brain by identifying a better theory of the motion of object The learning element can make changes to any of the “knowledge” components shown in the agent diagrams (Figures 2.9, 2.11, 2.13, and 2.14). The simplest cases involve learning directly from the percept sequence. Observation of pairs of successive states of the environ-

ment can allow the agent to learn “What my actions do” and “How the world evolves™ in

response (o its actions. For example, if the automated taxi exerts a certain braking pressure

when driving on a wet road, then it will soon find out how much deceleration is actually

achieved, and whether it skids off the road. The problem generator might identify certain parts of the model that are in need of improvement and suggest experiments, such as trying out the brakes on different road surfaces under different conditions.

Improving the model components of a model-based agent so that they conform better

with reality is almost always a good idea, regardless of the external performance standard.

(In some cases, it is better from a computational point of view to have a simple but slightly inaccurate model rather than a perfect but fiendishly complex model.)

Information from the

external standard is needed when trying to learn a reflex component or a utility function.

For example, suppose the taxi-driving agent receives no tips from passengers who have

been thoroughly shaken up during the trip. The external performance standard must inform

the agent that the loss of tips is a negative contribution to its overall performance; then the

agent might a sense, the (or penalty) performance

be able to learn that violent maneuvers do not contribute to its own utility. In performance standard distinguishes part of the incoming percept as a reward Reward that provides direct feedback on the quality of the agent’s behavior. Hard-wired Penalty standards such as pain and hunger in animals can be understood in this way.

Chapter 2 Intelligent Agents More example, settles on know it’s

generally, human choices can provide information about human preferences. For suppose the taxi does not know that people generally don’t like loud noises, and the idea of blowing its horn continuously as a way of ensuring that pedestrians coming. The consequent human behavior—covering ears, using bad language, and

possibly cutting the wires to the horn—would provide evidence to the agent with which to update its utility function. This

issue is discussed further in Chapter 22.

In summary, agents have a variety of components, and those components

can be repre-

sented in many ways within the agent program, so there appears to be great variety among

learning methods. There is, however, a single unifying theme. Learning in intelligent agents can be summarized as a process of modification of each component of the agent to bring the components into closer agreement with the available feedback information, thereby improv-

ing the overall performance of the agent. 2.4.7

How the components of agent programs work

We have described agent programs (in very high-level terms) as consisting of various compo-

nents, whose function it is to answer questions such as: “What is the world like now?” “What action should I do now?” “What do my actions do?” The next question for a student of Al

is, “How on Earth do these components work?” It takes about a thousand pages to begin to answer that question properly, but here we want to draw the reader’s attention to some basic

distinctions among the various ways that the components can represent the environment that the agent inhabits. Roughly speaking, we can place the representations along an axis of increasing complexity and expressive power—atomic, factored, and structured. To illustrate these ideas, it helps

to consider a particular agent component, such as the one that deals with “What my actions

do.” This component describes the changes that might occur in the environment as the result

(a) Atomic

oIIQ.

of taking an action, and Figure 2.16 provides schematic depictions of how those transitions might be represented.

mII.l

58

(b) Factored

() Structured

Figure 2.16 Three ways to represent states and the transitions between them. (a) Atomic

representation: a state (such as B or C) is a black box with no internal structure; (b) Factored representation: a state consists ofa vector of attribute values; values can be Boolean, real-

valued, or one of a fixed set of symbols. (c) Structured representation: a state includes objects, each of which may have attributes of its own as well s relationships to other objects.

Section 2.4 The Structure of Agents In an atomic representation each state of the world is indivisible—it has no internal

structure. Consider the task of finding a driving route from one end of a country to the other

59

Atomic representation

via some sequence of cities (we address this problem in Figure 3.1 on page 64). For the purposes of solving this problem, it may suffice to reduce the state of the world to just the name of the city we are in—a single atom of knowledge, a “black box whose only discernible property is that of being identical to or different from another black box. The standard algorithms

underlying search and game-playing (Chapters 3-5), hidden Markov models (Chapter 14), and Markov decision processes (Chapter 17) all work with atomic representations.

Factored representation each of which can have a value. Consider a higher-fidelity description for the same driving Variable problem, where we need to be concerned with more than just atomic location in one city or Attribute another; we might need to pay attention to how much gas is in the tank, our current GPS Value A factored representation splits up each state into a fixed set of variables or attributes,

coordinates, whether or not the oil warning light is working, how much money we have for

tolls, what station is on the radio, and so on. While two different atomic states have nothing in

common—they are just different black boxes—two different factored states can share some

attributes (such as being at some particular GPS location) and not others (such as having lots

of gas or having no gas); this makes it much easier to work out how to turn one state into an-

other. Many important areas of Al are based on factored representations, including constraint satisfaction algorithms (Chapter 6), propositional logic (Chapter 7), planning (Chapter 11), Bayesian networks (Chapters 12-16), and various machine learning algorithms. For many purposes, we need to understand the world as having things in it that are related to each other, not just variables with values. For example, we might notice that a large truck ahead of us is reversing into the driveway of a dairy farm, but a loose cow is blocking the truck’s path. A factored representation is unlikely to be pre-equipped with the at-

tribute TruckAheadBackingIntoDairyFarmDrivewayBlockedByLooseCow with value true or

Jfalse. Tnstead, we would need a structured representation, in which objects such as cows Structured representation and trucks and their various and varying relationships can be described explicitly (see Figure 2.16(c)).

Structured representations underlie relational databases and first-order logic

(Chapters 8, 9, and 10), first-order probability models (Chapter 15), and much of natural lan-

guage understanding (Chapters 23 and 24). In fact, much of what humans express in natural language concerns objects and their relationships.

As we mentioned earlier, the axis along which atomic, factored, and structured repre-

sentations lie is the axis of increasing expressiveness. Roughly speaking, a more expressive

representation can capture, at least as concisely, everything a less expressive one can capture,

Expressiveness

plus some more. Ofien, the more expressive language is much more concise; for example, the rules of chess can be written in a page or two of a structured-representation language such

as first-order logic but require thousands of pages when written in a factored-representation

language such as propositional logic and around 10*® pages when written in an atomic language such as that of finite-state automata. On the other hand, reasoning and learning become

more complex as the expressive power of the representation increases. To gain the benefits

of expressive representations while avoiding their drawbacks, intelligent systems for the real world may need to operate at all points along the axis simultaneously.

Another axis for representation involves the mapping of concepts to locations in physical

memory,

whether in a computer or in a brain.

If there is a one-to-one mapping between

concepts and memory locations, we call that a localist representation. On the other hand, Localist representation

Chapter 2 Intelligent Agents if the representation of a concept is spread over many memory locations, and each memory Distributed representation

location is employed as part of the representation of multiple different concepts, we call

that a distributed representation. Distributed representations are more robust against noise and information loss. With a localist representation, the mapping from concept to memory location is arbitrary, and if a transmission error garbles a few bits, we might confuse Truck

with the unrelated concept Truce. But with a

distributed representation, you can think of each

concept representing a point in multidimensional space, and if you garble a few bits you move

to a nearby point in that space, which will have similar meaning.

Summary

This chapter has been something of a whirlwind tour of AL which we have conceived of as the science of agent design. The major points to recall are as follows: « An agent is something that perceives and acts in an environment. The agent function for an agent specifies the action taken by the agent in response to any percept sequence.

+ The performance measure evaluates the behavior of the agent in an environment.

A

rational agent acts so as to maximize the expected value of the performance measure, given the percept sequence it has seen so far.

+ A task environment specification includes the performance measure, the external en-

vironment, the actuators, and the sensors. In designing an agent, the first step must always be to specify the task environment as fully as possible.

« Task environments vary along several significant dimensions. They can be fully or partially observable, single-agent or multiagent, deterministic or nondeterministic, episodic or sequential, static or dynamic, discrete or continuous, and known or unknown.

« In cases where the performance measure is unknown or hard to specify correctly, there

is a significant risk of the agent optimizing the wrong objective. In such cases the agent design should reflect uncertainty about the true objective. + The agent program implements the agent function. There exists a variety of basic

agent program designs reflecting the kind of information made explicit and used in the decision process. The designs vary in efficiency, compactness, and flexibility. The appropriate design of the agent program depends on the nature of the environment.

+ Simple reflex agents respond directly to percepts, whereas model-based reflex agents maintain internal state to track aspects of the world that are not evident in the current

percept. Goal-based agents act to achieve their goals, and utility-based agents try to

maximize their own expected “happiness.”

« All agents can improve their performance through learning.

Bibliographical and Historical Notes The central role of action in intelligence—the notion of practical reasoning—goes back at least as far as Aristotle’s Nicomachean Ethics.

Practical reasoning was also the subject of

McCarthy’s influential paper “Programs with Common Sense” (1958). The fields of robotics and control theory are, by their very nature, concerned principally with physical agents. The

Bibliographical and Historical Notes

61

concept of a controller in control theory is identical to that of an agent in AL Perhaps sur- Controller prisingly, Al has concentrated for most of its history on isolated components of agents—

question-answering systems, theorem-provers, vision systems, and so on—rather than on

whole agents. The discussion of agents in the text by Genesereth and Nilsson (1987) was an

influential exception. The whole-agent view is now widely accepted and is a central theme in

recent texts (Padgham and Winikoff, 2004; Jones, 2007; Poole and Mackworth, 2017).

Chapter 1 traced the roots of the concept of rationality in philosophy and economics. In

AL the concept was of peripheral interest until the mid-1980s, when it began to suffuse many discussions about the proper technical foundations of the field. A paper by Jon Doyle (1983)

predicted that rational agent design would come to be seen as the core mission of Al, while

other popular topics would spin off to form new disciplines. Careful attention to the properties of the environment and their consequences for rational agent design is most apparent in the control theory tradition—for example, cla:

control systems (Dorf and Bishop, 2004; Kirk, 2004) handle fully observable, deterministic environments; stochastic optimal control (Kumar and Varaiya, 1986; Bertsekas and Shreve,

2007) handles partially observable, stochastic environments; and hybrid control (Henzinger and Sastry, 1998; Cassandras and Lygeros, 2006) deals with environments containing both

discrete and continuous elements. The distinction between fully and partially observable en-

vironments is also central in the dynamic programming literature developed in the field of operations research (Puterman, 1994), which we discuss in Chapter 17.

Although simple reflex agents were central to behaviorist psychology (see Chapter 1),

most Al researchers view them as too simple to provide much leverage. (Rosenschein (1985)

and Brooks (1986) questioned this assumption; see Chapter 26.) A great deal of work has gone into finding efficient algorithms for keeping track of complex environments (Bar-

Shalom et al., 2001; Choset et al., 2005; Simon, 2006), most of it in the probabilistic setting. Goal-based agents are presupposed in everything from Aristotle’s view of practical rea-

soning to McCarthy’s early papers on logical Al Shakey the Robot (Fikes and Nilsson, 1971; Nilsson,

1984) was the first robotic embodiment of a logical, goal-based agent.

A

full logical analysis of goal-based agents appeared in Genesereth and Nilsson (1987), and a goal-based programming methodology called agent-oriented programming was developed by Shoham (1993). The agent-based approach is now extremely popular in software engineering (Ciancarini and Wooldridge, 2001). It has also infiltrated the area of operating systems, ‘where autonomic computing refers to computer systems and networks that monitor and con-

trol themselves with a perceive—act loop and machine learning methods (Kephart and Chess, 2003). Noting that a collection of agent programs designed to work well together in a true

multiagent environment necessarily exhibits modularity—the programs share no internal state

and communicate with each other only through the environment—it

is common

within the

field of multiagent systems to design the agent program of a single agent as a collection of autonomous sub-agents.

In some cases, one can even prove that the resulting system gives

the same optimal solutions as a monolithic design.

The goal-based view of agents also dominates the cognitive psychology tradition in the area of problem solving, beginning with the enormously influential Human Problem Solving (Newell and Simon, 1972) and running through all of Newell’s later work (Newell, 1990).

Goals, further analyzed as desires (general) and intentions (currently pursued), are central to the influential theory of agents developed by Michael Bratman (1987).

Autonomic computing

62

Chapter 2 Intelligent Agents As noted in Chapter 1, the development of utility theory as a basis for rational behavior goes back hundreds of years. In Al early research eschewed utilities in favor of goals, with some exceptions (Feldman and Sproull, 1977). The resurgence of interest in probabilistic

methods in the 1980s led to the acceptance of maximization of expected utility as the most general framework for decision making (Horvitz et al., 1988). The text by Pearl (1988) was

the first in Al to cover probability and utility theory in depth; its exposition of practical methods for reasoning and decision making under uncertainty was probably the single biggest factor in the rapid shift towards utility-based agents

in the 1990s (see Chapter 16). The for-

malization of reinforcement learning within a decision-theoretic framework also contributed

to this

shift (Sutton, 1988). Somewhat remarkably, almost all Al research until very recently

has assumed that the performance measure can be exactly and correctly specified in the form

of a utility function or reward function (Hadfield-Menell e al., 2017a; Russell, 2019).

The general design for learning agents portrayed in Figure 2.15 is classic in the machine

learning literature (Buchanan et al., 1978; Mitchell, 1997). Examples of the design, as em-

bodied in programs, go back at least as far as Arthur Samuel’s (1959, 1967) learning program for playing checkers. Learning agents are discussed in depth in Chapters 19-22. Some early papers on agent-based approaches are collected by Huhns and Singh (1998) and Wooldridge and Rao (1999). Texts on multiagent systems provide a good introduction to many aspects of agent design (Weiss, 2000a; Wooldridge, 2009). Several conference series devoted to agents began in the 1990s, including the International Workshop on Agent Theories, Architectures, and Languages (ATAL), the International Conference on Autonomous Agents (AGENTS), and the International Conference on Multi-Agent Systems (ICMAS). In

2002, these three merged to form the International Joint Conference on Autonomous Agents and Multi-Agent Systems (AAMAS).

From 2000 to 2012 there were annual workshops on

Agent-Oriented Software Engineering (AOSE). The journal Autonomous Agents and MultiAgent Systems was founded in 1998. Finally, Dung Beetle Ecology (Hanski and Cambefort, 1991) provides a wealth of interesting information on the behavior of dung beetles. YouTube

has inspiring video recordings of their activities.

TR 3

SOLVING PROBLEMS BY SEARCHING In which we see how an agent can look ahead to find a sequence of actions that will eventually achieve its goal.

‘When the correct action to take is not immediately obvious, an agent may need to to plan

ahead: 10 consider a sequence of actions that form a path to a goal state. Such an agent is called a problem-solving agent, and the computational process it undertakes is called search.

Problem-solving agents use atomic representations, as described in Section 2.4.7—that

is, states of the world are considered as wholes, with no internal structure visible to the

~5/ebiemsolvine

Search

problem-solving algorithms. Agents that use factored or structured representations of states are called planning agents and are discussed in Chapters 7 and 11.

We will cover several search algorithms. In this chapter, we consider only the simplest

environments: known.

episodic, single agent, fully observable, deterministic,

static, discrete, and

We distinguish between informed algorithms, in which the agent can estimate how

far it is from the goal, and uninformed algorithms, where no such estimate is available. Chapter 4 relaxes the constraints on environments, and Chapter 5 considers multiple agents. This chapter uses the concepts of asymptotic complexity (that is, O(n) notation). Readers

unfamiliar with these concepts should consult Appendix A. 3.1

Problem-Solving Agents

Imagine an agent enjoying a touring vacation in Romania. The agent wants to take in the sights, improve its Romanian, enjoy the nightlife, avoid hangovers, and so on. The decision problem is a complex one. Now, suppose the agent is currently in the city of Arad and has a nonrefundable ticket to fly out of Bucharest the following day.

The agent observes

street signs and sees that there are three roads leading out of Arad: one toward Sibiu, one to

Timisoara, and one to Zerind. None of these are the goal, so unless the agent is familiar with

the geography of Romania, it will not know which road to follow."

If the agent has no additional information—that is, if the environment is unknown—then

the agent can do no better than to execute one of the actions at random.

This

sad situation

is discussed in Chapter 4. In this chapter, we will assume our agents always have access to information about the world, such as the map in Figure 3.1. With that information, the agent

can follow this four-phase problem-solving process:

+ Goal formulation: The agent adopts the goal of reaching Bucharest.

Goals organize

behavior by limiting the objectives and hence the actions to be considered.

‘We are assuming that most readers are in the same position and can easily imagine themselves to be as clueles as our agent. We apologize to Romanian readers who are unable to take advantage of this pedagogic: device.

Goal formulation

Chapter 3 Solving Problems by Searching

75 Arad

Sibiu

0

Fagaras

18

Vashui Rimnicu Vileea

Timisoara

Urziceni

75 Drobeta

Bucharest 90

riu Figure 3.1 A simplified road map of part of Romania, with road distances in miles. Craiova

Problem formulation

Hirsova

Eforie

+ Problem formulation: The agent devises a description of the states and actions necessary to reach the goal—an abstract model of the relevant part of the world. For our

agent, one good model is to consider the actions of traveling from one city to an adja-

cent city, and therefore the only fact about the state of the world that will change due to

an action is the current city.

Search

+ Search:

Before taking any action in the real world, the agent simulates sequences of

actions in its model, searching until it finds a sequence of actions that reaches the goal. Such a sequence is called a solution. The agent might have to simulate multiple sequences that do not reach the goal, but eventually it will find a solution (such as going

Solution

from Arad to Sibiu to Fagaras to Bucharest), or it will find that no solution is possible.

Execution

|

+ Execution: The agent can now execute the actions in the solution, one at a time.

It is an important property that in a fully observable, deterministic, known environment, the

solution to any problem is a fixed sequence of actions: drive to Sibiu, then Fagaras, then

Bucharest. If the model is correct, then once the agent has found a solution, it can ignore its

Open-loop

Closed-loop

percepts while it is executing the actions—closing its eyes, so to speak—because the solution

is guaranteed to lead to the goal. Control theorists call this an open-loop system: ignoring the

percepts breaks the loop between agent and environment. If there is a chance that the model

is incorrect, or the environment is nondeterministic, then the agent would be safer using a

closed-loop approach that monitors the percepts (see Section 4.4).

In partially observable or nondeterministic environments, a solution would be a branching

strategy that recommends different future actions depending on what percepts arrive. For

example, the agent might plan to drive from Arad to Sibiu but might need a contingency plan

in case it arrives in Zerind by accident or finds a sign saying “Drum inchis” (Road Closed).

Section 3.1 3.1.1

Problem-Solving Agents

Search problems and solutions

Problem

A search problem can be defined formally as follows: * A set of possible states that the environment can be in. We call this the state space. + The initial state that the agent starts in. For example: Arad.

+ A set of one or more goal states. Sometimes there is one goal state (e.g., Bucharest), sometimes there is a small set of alternative goal states, and sometimes the goal is

defined by a property that applies to many states (potentially an infinite number). For example, in a vacuum-cleaner world, the goal might be to have no dirt in any location, regardless of any other facts about the state.

States State space Initial state Goal states,

We can account for all three of these

possibilities by specifying an Is-GOAL method for a problem. In this chapter we will

sometimes say “the goal” for simplicity, but what we say also applies to “any one of the

possible goal states.”

« The actions available to the agent. Given a state s, ACTIONS(s) returns a finite? set of actions that can be executed in 5. We say that each of these actions is applicable in s.

An example:

Action Applicable

ACTIONS (Arad) = {ToSibiu, ToTimisoara, ToZerind } + A transition model, which describes what each action does.

RESULT(s, a) returns the

state that results from doing action a in state s. For example,

Transition model

RESULT(Arad, ToZerind) = Zerind.

« Anaction cost function, denoted by ACTION-COST(s,a,s") when we are programming or ¢(s,a,s') when we are doing math, that gives the numeric cost of applying action a in state s to reach state s'.

Action cost function

A problem-solving agent should use a cost function that

reflects its own performance measure; for example, for route-finding agents, the cost of

an action might be the length in miles (as seen in Figure 3.1), or it might be the time it

takes to complete the action. A sequence of actions forms a path, and a solution is a path from the initial state to a goal

state. We assume that action costs are additive; that is, the total cost of a path is the sum of the

Path

individual action costs. An optimal solution has the lowest path cost among all solutions. In Optimal solution

this chapter, we assume that all action costs will be positive, to avoid certain complications.®

The state space can be represented as a graph in which the vertices are states and the Graph directed edges between them are actions. The map of Romania shown in Figure 3.1 is such a graph, where each road indicates two actions, one in each direction. 2 For problems with an infinite number of actions we would need techniques that go beyond this chapter. 3 Inany problem with a cycle of net negative cost, the cost-optimal solution is to go around that cycle an infinite number of times. The Bellman—Ford and Floyd-Warshall algorithms (not covered here) handle negative-cost actions, as long as there are no negative cycles. It is easy to accommodate zero-cost actions, as long as the number of consecutive zero-cost actions is bounded. For example, we might have a robot where there is a cost 1o move, but zero cost to rotate 90°; the algorithms in this chapter can handle this as long as no more than three consecutive 90° turns are allowed. There is also a complication with problems that have an infinite number of arbitrarily small action costs. Consi ion of Zeno's paradox where there is an action to move half way to the goal, at a cost of half of the previous move. This problem has no solution with a finite number of actions, but to prevent a search from taking an unbounded number of actions without quite reaching the goal, we can require that all action costs be at least , for some small positive value c.

66

Chapter 3 Solving Problems by Searching 3.1.2

Formulating problems

Our formulation of the problem of getting to Bucharest is a model—an abstract mathematical

description—and not the real thing. Compare the simple atomic state description Arad to an actual cross-country trip, where the state of the world includes so many things: the traveling companions, the current radio program, the scenery out of the window, the proximity of law enforcement officers, the distance to the next rest stop, the condition of the road, the weather,

the traffic, and so on.

Abstraction

All these considerations are left out of our model because they are

irrelevant to the problem of finding a route to Bucharest.

The process of removing detail from a representation is called abstraction.

A good

problem formulation has the right level of detail. If the actions were at the level of “move the

right foot forward a centimeter” or “turn the steering wheel one degree left,” the agent would

Level of abstraction

probably never find its way out of the parking lot, let alone to Bucharest.

Can we be more precise about the appropriate level of abstraction? Think of the abstract

states and actions we have chosen as corresponding to large sets of detailed world states and

detailed action sequences.

Now consider a solution to the abstract problem:

for example,

the path from Arad to Sibiu to Rimnicu Vilcea to Pitesti to Bucharest. This abstract solution

corresponds to a large number of more detailed paths. For example, we could drive with the radio on between Sibiu and Rimnicu Vilcea, and then switch it off for the rest of the trip.

The abstraction is valid if we can elaborate any abstract solution into a solution in the

more detailed world; a sufficient condition

is that for every detailed state that is “in Arad,”

there is a detailed path to some state that is “in Sibiu,” and so on.* The abstraction is useful if

carrying out each of the actions in the solution is easier than the original problem; in our case,

the action “drive from Arad to Sibiu™ can be carried out without further search or planning by a driver with average skill. The choice ofa good abstraction thus involves removing as much

detail as possible while retaining validity and ensuring that the abstract actions are easy to

carry out. Were it not for the ability to construct useful abstractions, intelligent agents would be completely swamped by the real world. 3.2

Example

Problems

The problem-solving approach has been applied to a vast array of task environments. We list Standardized problem Real-world problem

some of the best known here, distinguishing between standardized and real-world problems.

A standardized problem is intended to illustrate or exercise various problem-solving meth-

ods.

It can be given a concise, exact description and hence is suitable as a benchmark for

researchers to compare the performance of algorithms. A real-world problem, such as robot

navigation, is one whose solutions people actually use, and whose formulation is idiosyn-

cratic, not standardized, because, for example, each robot has different sensors that produce different data.

3.2.1 Grid world

Standardized problems

A grid world problem is a two-dimensional rectangular array of square cells in which agents can move from cell to cell. Typically the agent can move to any obstacle-free adjacent cell— horizontally or vertically and in some problems diagonally. Cells can contain objects, which

Section 3.2 Example Problems

67

Figure 3.2 The state-space graph for the two-cell vacuum world. There are 8 states and three actions for each state: L = Lefr, R = Right, S = Suck.

the agent can pick up, push, or otherwise act upon; a wall or other impassible obstacle in a cell prevents an agent from moving into that cell. The vacuum world from Section 2.1 can be formulated as a grid world problem as follows: e States:

A state of the world says which objects are in which cells.

For the vacuum

world, the objects are the agent and any dirt. In the simple two-cell version, the agent

can be in either of the two cells, and each call can either contain dirt or not, so there are 2-2-2 = 8 states (see Figure 3.2). In general, a vacuum environment with n cells has

n-2" states.

o Initial state: Any state can be designated as the initial state. ® Actions: In the two-cell world we defined three actions: Suck, move Lefr, and move Right. Tn a two-dimensional multi-cell world we need more movement actions. We

could add Upward and Downward, giving us four absolute movement actions, or we could switch to egocentric actions, defined relative to the viewpoint of the agent—for example, Forward, Backward, TurnRight, and TurnLefi. o Transition model: Suck removes any dirt from the agent’s cell; Forward moves the agent ahead one cell in the direction it is facing, unless it hits a wall, in which case

the action has no effect.

Backward moves the agent in the opposite direction, while

TurnRight and TurnLeft change the direction it is facing by 90°. * Goal states: The states in which every cell is clean. ® Action cost: Each action costs 1.

Another type of grid world is the sokoban puzzle, in which the agent’s goal is to push a

number of boxes, scattered about the grid, to designated storage locations. There can be at

most one box per cell. When an agent moves forward into a cell containing a box and there is an empty cell on the other side of the box, then both the box and the agent move forward.

Sckoban puzzle

68

Chapter 3 Solving Problems by Searching

Start State

Goal State

Figure 3.3 A typical instance of the 8-puzzle. The agent can’t push a box into another box or a wall. For a world with 2 non-obstacle cells

and b boxes, there are n x n!/(b!(n — b)!) states; for example on an 8 x 8 grid with a dozen

Sliding-tile puzzle

boxes, there are over 200 trillion states.

In a sliding-tile puzzle, a number of tiles (sometimes called blocks or pieces) are ar-

ranged in a grid with one or more blank spaces so that some of the tiles can slide into the blank space.

8-puzzle 15-puzzle

One variant is the Rush Hour puzzle, in which cars and trucks slide around a

6% 6 grid in an attempt to free a car from the traffic jam. Perhaps the best-known variant is the 8-puzzle (see Figure 3.3), which consists of a 3 x 3 grid with eight numbered tiles and

one blank space, and the 15-puzzle on a 4 x 4 grid. The object is to reach a specified goal state, such as the one shown on the right of the figure.

puzzle is as follows:

The standard formulation of the 8

o States: A state description specifies the location of each of the tiles.

o Initial state: Any state can be designated as the initial state. Note that a parity prop-

erty partitions the state space—any given goal can be reached from exactly half of the possible initial states (see Exercise 3.PART).

e Actions: While in the physical world it is a tile that slides, the simplest way of describ-

ing an action is to think of the blank space moving Left, Right, Up, or Down. If the blank is at an edge or corner then not all actions will be applicable.

« Transition model: Maps a state and action to a resulting state; for example, if we apply Left to the start state in Figure 3.3, the resulting state has the 5 and the blank switched.

Goal state: Although any state could be the goal, we typically specify a state with the numbers in order, as in Figure 3.3.

® Action cost: Each action costs 1.

Note that every problem formulation involves abstractions. The 8-puzzle actions are ab-

stracted to their beginning and final states, ignoring the intermediate locations where the tile

is sliding. We have abstracted away actions such as shaking the board when tiles get stuck

and ruled out extracting the tiles with a knife and putting them back again. We are left with a

description of the rules, avoiding all the details of physical manipulations.

Our final standardized problem was devised by Donald Knuth (1964) and illustrates how

infinite state spaces can arise. Knuth conjectured that starting with the number 4, a sequence

Section 3.2 Example Problems

of square root, floor, and factorial operations can reach any desired positive integer. For example, we can reach 5 from 4 as follows:

|

@y =s

The problem definition is simple: States: Positive real numbers. o Initial state: 4.

o Actions: Apply square root, floor, or factorial operation (factorial for integers only). o Transition model: As given by the mathematical definitions of the operations.

o Goal state: The desired positive integer. ® Action cost: Each action costs 1.

The state space for this problem is infinite: for any integer greater than 2 the factorial oper-

ator will always yield a larger integer. The problem is interesting because it explores very large numbers: the shortest path to 5 goes through (4!)! = 620,448,401,733,239,439,360,000.

Infinite state spaces arise frequently in tasks involving the generation of mathematical expressions, circuits, proofs, programs, and other recursively defined objects. 3.2.2

Real-world problems

We have already seen how the route-finding problem is defined in terms of specified locations and transitions along edges between them. Route-finding algorithms are used in a variety of applications.

Some, such as Web sites and in-car systems that provide driving

directions, are relatively straightforward extensions

of the Romania example.

(The main

complications are varying costs due to traffic-dependent delays, and rerouting due to road closures.) Others, such as routing video streams in computer networks, military operations planning, and airline travel-planning systems, involve much more complex specifications. Consider the airline travel problems that must be solved by a travel-planning Web site: o States: Each state obviously includes a location (e.g., an airport) and the current time. Furthermore, because the cost of an action (a flight segment) may depend on previous segments, their fare bases, and their status as domestic or international, the state must record extra information about these “historical” aspects. o Initial state: The user’s home airport.

® Actions: Take any flight from the current location, in any seat class, leaving after the

current time, leaving enough time for within-airport transfer if needed.

o Transition model: The state resulting from taking a flight will have the flight’s destination as the new location and the flight’s arrival time as the new time.

o Goal state: A destination city. Sometimes the goal can be more complex, such as “arrive at the destination on a nonstop flight.”

* Action cost: A combination of monetary cost, waiting time, flight time, customs and immigration procedures, seat quality, time of day, type of airplane, frequent-flyer re-

ward points, and 50 on.

69

70

Chapter 3 Solving Problems by Searching Commercial travel advice systems use a problem formulation of this kind, with many addi-

tional complications to handle the airlines’ byzantine fare structures. Any seasoned traveler

Touring problem Traveling salesperson problem (TSP)

knows, however, that not all air travel goes according to plan. A really good system should include contingency plans—what happens if this flight is delayed and the connection is missed? Touring problems describe a set of locations that must be visited, rather than a single

goal destination. The traveling salesperson problem (TSP) is a touring problem in which every city on a map must be visited. optimization

The aim is to find a tour with cost < C (or in the

version, to find a tour with the lowest cost possible).

An enormous

amount

of effort has been expended to improve the capabilities of TSP algorithms. The algorithms

can also be extended to handle fleets of vehicles. For example, a search and optimization algorithm for routing school buses in Boston saved $5 million, cut traffic and air pollution, and saved time for drivers and students (Bertsimas ez al., 2019). In addition to planning trips,

search algorithms have been used for tasks such as planning the movements of automatic

VLSI layout

circuit-board drills and of stocking machines on shop floors.

A VLSI layout problem requires positioning millions of components and connections on a chip to minimize area, minimize circuit delays, minimize stray capacitances, and maximize manufacturing yield. The layout problem comes after the logical design phase and is usually split into two parts: cell layout and channel routing. In cell layout, the primitive components of the circuit are grouped into cells, each of which performs some recognized function. Each cell has a fixed footprint (size and shape) and requires a certain number of connections to each of the other cells. The aim is to place the cells on the chip so that they do not overlap and so that there is room for the connecting wires to be placed between the cells. Channel

routing finds a specific route for each wire through the gaps between the cells. These search

Robot navigation

problems are extremely complex, but definitely worth solving.

Robot navigation is a generalization of the route-finding problem described earlier. Rather than following distinct paths (such as the roads in Romania), a robot can roam around,

in effect making its own paths. For a circular robot moving on a flat surface, the space is

essentially two-dimensional. When the robot has arms and legs that must also be controlled,

the search space becomes many-dimensional—one dimension for each joint angle. Advanced

techniques are required just to make the essentially continuous search space finite (see Chap-

ter 26). In addition to the complexity of the problem, real robots must also deal with errors

in their sensor readings and motor controls, with partial observability, and with other agents

Automatic assembly sequencing

that might alter the environment.

Automatic assembly sequencing of complex objects (such as electric motors) by a robot

has been standard industry practice since the 1970s. Algorithms first find a feasible assembly

sequence and then work to optimize the process. Minimizing the amount of manual human

labor on the assembly line can produce significant savings in time and cost. In assembly problems, the aim is to find an order in which to assemble the parts of some object.

If the

wrong order is chosen, there will be no way to add some part later in the sequence without

undoing some of the work already done. Checking an action in the sequence for feasibility is a difficult geometrical search problem closely related to robot navigation. Thus, the generation Protein design

of legal actions is the expensive part of assembly sequencing. Any practical algorithm must avoid exploring all but a tiny fraction of the state space. One important assembly problem is

protein design, in which the goal is to find a sequence of amino acids that will fold into a three-dimensional protein with the right properties to cure some disease.

Section 33 Search Algorithms 3.3 Search Algorithms A search algorithm takes a search problem as input and returns a solution, or an indication of ~Search algorithm failure. In this chapter we consider algorithms that superimpose a search tree over the state-

space graph, forming various paths from the initial state, trying to find a path that reaches a goal state. Each node in the search tree corresponds to a state in the state space and the edges

in the search tree correspond to actions. The root of the tree corresponds to the initial state of

Node

the problem.

It is important to understand the distinction between the state space and the search tree.

The state space describes the (possibly infinite) set of states in the world, and the actions that allow transitions from one state to another.

The search tree describes paths between

these states, reaching towards the goal. The search tree may have multiple paths to (and thus

‘multiple nodes for) any given state, but each node in the tree has a unique path back to the root (as in all trees). Figure 3.4 shows the first few steps in finding a path from Arad to Bucharest. The root

node of the search tree is at the initial state, Arad. We can expand the node, by considering

Figure 3.4 Three partial search trees for finding a route from Arad to Bucharest. Nodes that have been expanded are lavender with bold letters; nodes on the frontier that have been generated but not yet expanded are in green; the set of states corresponding to these two

types of nodes are said to have been reached. Nodes that could be generated next are shown

in faint dashed lines. Notice in the bottom tree there is a cycle from Arad to Sibiu to Arad; that can’t be an optimal path, so search should not continue from there.

Expand

72

Chapter 3 Solving Problems by Searching

Figure 3.5 A sequence of search trees generated by a graph search on the Romania problem of Figure 3.1. At each stage, we have expanded every node on the frontier, extending every path with all applicable actions that don’t result in a state that has already been reached. Notice that at the third stage, the topmost city (Oradea) has two successors, both of which have already been reached by other paths, so no paths are extended from Oradea.

(a)

(b)

©

Figure 3.6 The separation property of graph search, illustrated on a rectangular-grid problem. The frontier (green) separates the interior (lavender) from the exterior (faint dashed).

The frontier is the set of nodes (and corresponding states) that have been reached but not yet expanded; the interior s the set of nodes (and corresponding states) that have been expanded; and the exterior is the set of states that have not been reached. In (a), just the root has been

expanded. In (b), the top frontier node is expanded. In (c), the remaining successors of the root are expanded in clockwise order.

Generating Child node Successor node

the available ACTIONS

for that state, using the RESULT function to see where those actions

lead to, and generating a new node (called a child node or successor node) for each of the resulting states. Each child node has Arad as its parent node. Now we must choose which of these three child nodes to consider next.

This is the

Parent node

essence of search—following up one option now and putting the others aside for later. Sup-

Frontier Reached

panded nodes (outlined in bold). We call this the frontier of the search tree. We say that any state that has had a node generated for it has been reached (whether or not that node has been

Separator

pose we choose to expand Sibiu first. Figure 3.4 (bottom) shows the result: a set of 6 unex-

expanded). Figure 3.5 shows the search tree superimposed on the state-space graph.

Note that the frontier separates two regions of the state-space graph: an interior region

where every state has been expanded, and an exterior region of states that have not yet been

reached. This property is illustrated in Figure 3.6. 5 Some authors call the frontier the open list, which s both geographically less evocative and computationally less appropriate, because a queue is more efficient than a list here. Those authors use the term elosed list to refer 10 the set of previously expanded nodes, which in our terminology would be the reached nodes minus the frontier.

Section 33 Search Algorithms

73

function BEST-FIRST-SEARCH(problem, ) returns a solution node or failure node « NODE(STATE=problem.INITIAL)

frontier ® B D

c

G F G E D F G E Figure 3.8 Breadth-first scarch on a simple binary tree. At each stage, the node to be expanded next s indicated by the triangular marker.

Section 3.4

Uninformed Search Strategies

77

function BREADTH-FIRST-SEARCH(problem) returns a solution node or failure

node < NODE(problem.INITIAL) if problem.1s-GOAL(node. STATE) then return node frontier 4+ b = 0(b%) All the nodes remain in memory, so both time and space complexity are O(b). Exponential bounds like that are scary. As a typical real-world example, consider a problem with branching factor b = 10, processing speed 1 million nodes/second, and memory requirements of 1 A search to depth d = 10 would take less than 3 hours, but would require 10

terabytes of memory. The memory requirements are a bigger problem for breadth-first search than the execution time.

But time is still an important factor.

At depth d = 14, even with

infinite memory, the search would take 3.5 years. In general, exponential-complexity search problems cannot be solved by uninformed search for any but the smallest instances. 3.4.2

A A

Kbyte/node.

Dijkstra’s algorithm or uniform-cost search

‘When actions have different costs, an obvious choice is to use best-first search where the

evaluation function is the cost of the path from the root to the current node. This is called Di-

jkstra’s algorithm by the theoretical computer science community, and uniform-cost search

by the AT community. The idea is that while breadth-first search spreads out in waves of uni-

form depth—first depth 1, then depth 2, and so on—uniform-cost search spreads out in waves of uniform path-cost. The algorithm can be implemented as a call to BEST-FIRST-SEARCH with PATH-COST as the evaluation function, as shown in Figure 3.9.

Uniform-cost search

78

Chapter 3 Solving Problems by Searching

Bucharest Figure 3.10 Part of the Romania state space, selected to illustrate uniform-cost search.

Consider Figure 3.10, where the problem is to get from Sibiu to Bucharest. The succes-

sors of Sibiu are Rimnicu Vilcea and Fagaras, with costs 80 and 99, respectively. The least-

cost node, Rimnicu Vilcea, is expanded next, adding Pitesti with cost 80 +97=177.

The

least-cost node is now Fagaras, so it is expanded, adding Bucharest with cost 99 +211=310.

Bucharest is the goal, but the algorithm tests for goals only when it expands a node, not when it generates a node, so it has not yet detected that this is a path to the goal.

The algorithm continues on, choosing Pitesti for expansion next and adding a second path

to Bucharest with cost 80 +97 + 101 =278.

path in reached and is added to the frontier.

It has a lower cost, so it replaces the previous

It turns out this node now has the lowest cost,

50 it is considered next, found to be a goal, and returned. Note that if we had checked for a

goal upon generating a node rather than when expanding the lowest-cost node, then we would have returned a higher-cost path (the one through Fagaras).

The complexity of uniform-cost search is characterized in terms of C*, the cost of the

optimal solution,® and ¢, a lower bound on the cost of each action, with ¢ > 0.

Then the

algorithm’s worst-case time and space complexity is O(b'*1€"/ ¢>0;? cost-optimal if action costs are all identical; * if both directions are breadth-first

or uniform-cost. 3.5

Informed (Heuristic) Search Strategies

Informed search

This section shows how an informed search strategy—one that uses domain-specific hints

Heuristic function

The hints come in the form of a heuristic function, denoted /(n):'*

about the location of goals—can find solutions more efficiently than an uninformed strategy. h(n) = estimated cost of the cheapest path from the state at node n to a goal state.

For example, in route-finding problems, we can estimate the distance from the current state to a goal by computing the straight-line distance on the map between the two points. We study heuristics

and where they come from in more detail in Section 3.6.

10 It may seem odd that the heuristic function operates on a node, when all it really needs is the node’s state. It is traditional t0 use /(n) rather than h(s) to be consistent with the evaluation function J (n) and the path cost g(n).

Section 3.5 Arad Bucharest Craiova

366 0 160 242 161 176 77 151 226 244

Informed (Heuristic) Search Strategies

Mehadia Neamt Oradea Pitesti Rimnicu Vilcea Sibiu Timisoara Urziceni Vaslui Zerind

241 234 380 100 193 253 329 80 199 374

Figure 3.16 Values of is;p—straight-line distances to Bucharest. 3.5.1

Greedy best-first search

Greedy best-first search is a form of best-first search that expands first the node with the §eedy best-irst lowest /(n) value—the node that appears to be closest to the goal—on the grounds that this is likely to lead to a solution quickly. So the evaluation function f(n) = h(n).

Let us see how this works for route-finding problems in Romania; we use the straight-

line distance heuristic, which we will call hg;p. If the goal is Bucharest, we need to know the straight-line distances to Bucharest, which are shown in Figure 3.16. For example, hsip(Arad)=366. Notice that the values of hg;p cannot be computed from the problem description itself (that is, the ACTIONS

and RESULT functions).

Straight-ine distance

Moreover, it takes a certain

amount of world knowledge to know that g p is correlated with actual road distances and is, therefore, a useful heuristic.

Figure 3.17 shows the progress of a greedy best-first search using hg;p to find a path

from Arad to Bucharest.

The first node to be expanded from Arad will be Sibiu because the

heuristic says it is closer to Bucharest than is either Zerind or Timisoara. The next node to be

expanded will be Fagaras because it is now closest according to the heuristic. Fagaras in turn generates Bucharest, which is the goal. For this

particular problem, greedy best-first search

using hg;p finds a solution without ever expanding a node that is not on the solution path.

The solution it found does not have optimal cost, however: the path via Sibiu and Fagaras to Bucharest is 32 miles longer than the path through Rimnicu Vilcea and Pitesti. This is why

the algorithm is called “greedy”—on each iteration it tries to get as close to a goal as it can, but greediness can lead to worse results than being careful.

Greedy best-first graph search is complete in finite state spaces, but not in infinite ones. The worst-case time and space complexity is O(|V'|). With a good heuristic function, however, the complexity can be reduced substantially, on certain problems reaching O(bm). 3.5.2

A" search

The most common

informed search algorithm is A* search (pronounced “A-star search”), a

best-first search that uses the evaluation function

F(n) = g(n) +h(n) where g(n) is the path cost from the initial state to node n, and h(n) is the estimated cost of the shortest path from 7 to a goal state, so we have f(n) = estimated cost of the best path that continues from 7 to a goal.

A" search

86

Chapter 3 Solving Problems by Searching (a) The initial state

366

(b) After expanding Arad

Chmd>

b >

isoad>

253

CZerind>

329

374

(¢) After expanding Sibiu

(d) After expanding Fagaras

Figure 3.17 Stages in a greedy best-first tree-like search for Bucharest with the straight-line distance heuristic /iszp. Nodes are labeled with their A-values. In Figure 3.18, we show the progress of an A* search with the goal of reaching Bucharest.

The values of g are computed from the action costs in Figure 3.1, and the values of hgs.p are

given in Figure 3.16. Notice that Bucharest first appears on the frontier at step (e), but it is not selected for expansion (and thus not detected as a solution) because at f =450 it is not the lowest-cost node on the frontier—that would be Pitesti, at f=417.

Another way to say this

is that there might be a solution through Pitesti whose cost is as low as 417, so the algorithm will not settle for a solution that costs

450.

At step (f), a different path to Bucharest is now

the lowest-cost node, at f =418, so it is selected and detected as the optimal solution.

Admissible heuristic

A search is complete.!! Whether A* is cost-optimal depends on certain properties of

the heuristic.

A key property is admissibil

an admissible heuristic is one that never

overestimates the cost to reach a goal. (An admissible heuristic is therefore optimistic.) With

11 Again, assuming all action costs are > ¢ > 0, and the state space either has a solution or is finite.

Section 3.5

(a) The initial state

Informed (Heuristic) Search Strategies

366-0+366

(b) After expanding Arad 449754374

646=280+366 415-239+176 671-291+380 413-220+193

(d) After expanding Rimnicu Vilcea

SOI=3384253 4S0450+0

S26=366+160 4173174100 §53-300-253

(f) After expanding Pitesti 449754374

41841840 615-455+160 607=414+193

Figure 3.18 Stages in an A" search for Bucharest. Nodes are labeled with f = g+ /. The h values are the straight-line distances to Bucharest taken from Figure 3.16.

87

88

Chapter 3 Solving Problems by Searching

Figure 3.19 Triangle inequality: If the heuristic / is consistent, then the single number k()

will be less than the sum of the cost ¢(n,a,d’) of the action from n to n plus the heuristic

estimate h(n').

an admissible heuristic, A* is cost-optimal, which we can show with a proof by contradiction. Suppose the optimal path has cost C*, but the algorithm returns a path with cost C > C*. Then there must be some node 7 which is on the optimal path and is unexpanded (because if all the nodes on the optimal path had been expanded, then we would have returned that optimal solution). So then, using the notation g*(n) to mean the cost of the optimal path from the start to n, and h*(n) to mean the cost of the optimal path from # to the nearest goal, we have:

f(n) > € (otherwise n would have been expanded) f(n) = g(n)+h(n) f(n)

= g"(n)+h(n)

(by definition)

(because n is on an optimal path)

f(n) < g"(n)+"(n) (because of admissibility, h(n) < h*(n)) f(n) < € (by definition,C* = g*(n) + h*(n)) The first and last lines form a contradiction, so the supposition that the algorithm could return

Consistency

a suboptimal path must be wrong—it must be that A* returns only cost-optimal paths.

A slightly stronger property is called consistency.

A heuristic i(n) is consistent if, for

every node n and every successor n’ of n generated by an action a, we have:

h(n) < e(nan’)+h(n').

Triangle inequality

This is a form of the triangle inequality, which stipulates that a side of a triangle cannot

be longer than the sum of the other two sides (see Figure 3.19). An example of a consistent

heuristic is the straight-line distance /s> that we used in getting to Bucharest.

Every consistent heuristic is admissible (but not vice versa), so with a consistent heuristic, A" is cost-optimal. In addition, with a consistent heuristic, the first time we reach a state it will be on an optimal path, so we never have to re-add a state to the frontier, and never have to

change an entry in reached. But with an inconsistent heuristic, we may end up with multiple paths reaching the same state, and if each new path has a lower path cost than the previous

one, then we will end up with multiple nodes for that state in the frontier, costing us both

time and space. Because of that, some implementations of A* take care to only enter a state into the frontier once, and if a better path to the state is found, all the successors of the state

are updated (which requires that nodes have child pointers as well as parent pointers). These complications have led many implementers to avoid inconsistent heuristics, but Felner et al. (2011) argues that the worst effects rarely happen in practice, and one shouldn’t be afraid of inconsistent heuristics.

Section 3.5

Informed (Heuristic) Search Strategies

89

Figure 3.20 Map of Romania showing contours at f = 380, f = 400, and f = 420, with Arad as the start state. Nodes inside a given contour have f = g+ costs less than or equal to the contour value.

With an inadmissible heuristic, A* may or may not be cost-optimal. Here are two cases where it is: First, if there is even one cost-optimal path on which h(n) is admissible for all nodes n on the path, then that path will be found, no matter what the heuristic says for states off the path. Second, if the optimal solution has cost C*, and the second-best has cost C,, and

if h(n) overestimates some costs, but never by more than C; — C*, then A is guaranteed to

return cost-optimal solutions.

3.5.3 Search contours A useful way to visualize a search is to draw contours in the state space, just like the contours

in a topographic map. Figure 3.20 shows an example. Inside the contour labeled 400, all nodes have f(n) = g(n) +h(n) < 400, and so on. Then, because A" expands the frontier node

Contour

of lowest f-cost, we can see that an A” search fans out from the start node, adding nodes in concentric bands of increasing f-cost. ‘With uniform-cost search, we also have contours, but of g-cost, not g + . The contours

with uniform-cost search will be “circular” around the start state, spreading out equally in all

directions with no preference towards the goal. With A* search using a good heuristic, the g+ h bands will stretch toward a goal state (as in Figure 3.20) and become more narrowly focused around an optimal path. It should be clear that as you extend a path, the g costs are monotonic:

the path cost

always increases as you go along a path, because action costs are always positive.'? Therefore

you get concentric contour lines that don’t cross each other, and if you choose to draw the

lines fine enough, you can put a line between any two nodes on any path.

12 Technically, we say decrease, but might rem:

ctly monotonic™ for costs that always increase, and “monotonic” for cost the same.

that never

Monotonic

Chapter 3 Solving Problems by Searching But it is not obvious whether the f = g+ h cost will monotonically increase. As you ex-

tend a path from 7 to 1, the cost goes from g(n) +h(n) to g(n) +c(n,a,n’) +h(n’). Canceling

out the g(n) term, we see that the path’s cost will be monotonically increasing if and only if h(n) < c(n,a,n") + h(n'); in other words if and only if the heuristic is consistent.'> But note that a path might contribute several nodes in a row with the same g(n) + (n) score; this will

happen whenever the decrease in / is exactly equal to the action cost just taken (for example,

in a grid problem, when n is in the same row as the goal and you take a step towards the goal,

g s increased by 1 and h is decreased by 1). If C* is the cost of the optimal solution path, then we can say the following:

Surely expanded nodes

+ A expands all nodes that can be reached from the initial state on a path where every

node on the path has f(n) < C*. We say these are surely expanded nodes. A" might then expand some of the nodes right on the “goal contour” (where f(n) = C*) before selecting a goal node.

« A" expands no nodes with f(n) > C*.

Optimally efficient

We say that A* with a consistent heuristic is optimally efficient in the sense that any algorithm

that extends search paths from the initial state, and uses the same heuristic information, must

expand all nodes that are surely expanded by A* (because any one of them could have been

part of an optimal solution). Among the nodes with f(n)=C", one algorithm could get lucky

and choose the optimal one first while another algorithm is unlucky; we don’t consider this

Pruning

difference in defining optimal efficiency.

A" is efficient because it prunes away search tree nodes that are not necessary for finding an optimal solution. Tn Figure 3.18(b) we see that Timisoara has f = 447 and Zerind has f 449. Even though they are children of the root and would be among the first nodes expanded by uniform-cost or breadth-first search, they are never expanded by A* search because the solution with f = 418 is found first. The concept of pruning—eliminating possibilities from consideration without having to examine them—is important for many areas of Al

That A* search is complete, cost-optimal, and optimally efficient among all such algo-

rithms is rather satisfying.

Unfortunately, it does not mean that A* is the answer to all our

searching needs. The catch is that for many problems, the number of nodes expanded can be exponential in the length of the solution. For example, consider a version of the vacuum world with a super-powerful vacuum that can clean up any one square at a cost of 1 unit, without even having to visit the square; in that scenario, squares can be cleaned in any order.

With N initially dirty squares, there are 2V states where some subset has been cleaned; all of those states are on an optimal solution path, and hence satisfy f(n) < C*, so all of them would be visited by A*.

3.5.4

Inadmissible heuristic

Satisficing search:

Inadmissible heuristics and weighted A*

A" search has many good qualities, but it expands a lot of nodes. We can explore fewer nodes (taking less time and space) if we are willing to accept solutions that are suboptimal, but are “good enough”—what we call satisficing solutions. If we allow A® search to use an inadmissible heuristic—one that may overestimate—then we risk missing the optimal solution, but the heuristic can potentially be more accurate, thereby reducing the number of 13 In fact, the term * Jonotonic heuristic” is a synonym for “consistent heurist . The two ideas were developed independently, and ther was proved that they are equivalent (Pearl, 1984).

Section 3.5

Informed (Heuristic) Search Strategies

91

(b)

(a)

Figure 3.21 Two searches on the same grid: (a) an A" search and (b) a weighted A" search with weight W = 2. The gray bars are obstacles, the purple line is the path from the green start to red goal, and the small dots are states that were reached by each search. On this particular problem, weighted A” explores 7 times fewer states and finds a path that is 5% more costly.

nodes expanded. For example, road engineers know the concept of a detour index, which is Detour index a multiplier applied to the straight-line distance to account for the typical curvature of roads. A detour index of 1.3 means that if two cities are 10 miles apart in straight-line distance, a good estimate of the best path between them is 13 miles. For most localities, the detour index

ranges between 1.2 and 1.6. We can apply this idea to any problem, not just ones involving roads, with an approach

called weighted A* search where we weight the heuristic value more heavily, giving us the

evaluation function £(n) = g(n) +W x h(n), for some W > 1. Figure 3.21 shows a search problem on a grid world. In (), an A" search finds the optimal solution, but has to explore a large portion of the state space to find it. In (b), a weighted A*

search finds a solution that is slightly costlier, but the search time is much faster. We see that the weighted search focuses the contour of reached states towards a goal.

That means that

fewer states are explored, but if the optimal path ever strays outside of the weighted search’s

contour (as it does in this case), then the optimal path will not be found. In general, if the optimal solution costs C*, a weighted A* search will find a solution that costs somewhere

between C* and W x C*; but in practice we usually get results much closer to C* than W x C*.

We have considered searches that evaluate states by combining g and / in various ways; weighted A* can be seen as a generalization of the others:

At search:

Uniform-cost search:

Greedy best-first search: Weighted A" search:

g(n)+h(n) 2(n) h(n)

g(n) +W x h(n)

(1 0 then current

next

else current —next only with probability e~2£/T

Figure 4.5 The simulated annealing algorithm, a version of stochastic hill climbing where some downhill moves are allowed. The schedule input determines the value of the “tempera-

ture” 7 as a function of time.

all the probability is concentrated on the global maxima, which the algorithm will find with probability approaching 1. Simulated annealing was used to solve VLSI layout problems beginning in the 1980s. Tt has been applied widely to factory scheduling and other large-scale optimization tasks. 4.1.3

Local beam search

Keeping just one node in memory might seem to be an extreme reaction to the problem of

memory limitations.

The local beam search algorithm keeps track of k states rather than

just one. Tt begins with k randomly generated states. At each step, all the successors of all k states are generated. If any one is a goal, the algorithm halts. Otherwise, it selects the k best

Local beam search

successors from the complete list and repeats. At first sight, a local beam search with k states might seem to be nothing more than

running k random restarts in parallel instead of in sequence. are quite different.

In fact, the two algorithms

In a random-restart search, each search process runs independently of

the others. In a local beam search, useful information is passed among the parallel search threads.

In effect, the states that generate the best successors say to the others, “Come over


2, which occurs

only rarely in nature but is easy enough to simulate on computers.

Selection

« The selection process for selecting the individuals who will become the parents of the next generation: one possibility is to select from all individuals with probability proportional to their fitness score. Another possibility is to randomly select n individuals (n > p), and then select the p most fit ones as parents.

Crossover point

« The recombination procedure. One common approach (assuming p = 2), is to randomly select a crossover point to split each of the parent strings, and recombine the parts to form two children, one with the first part of parent 1 and the second part of parent 2; the other with the second part of parent 1 and the first part of parent 2.

Mutation rate

+ The mutation rate, which determines how often offspring have random mutations to

their representation. Once an offspring has been generated, every bit in its composition is flipped with probability equal to the mutation rate.

« The makeup of the next generation. This can be just the newly formed offspring, or it Elitism

can include a few top-scoring parents from the previous generation (a practice called

elitism, which guarantees that overall fitness will never decrease over time). The practice of culling, in which all individuals below a given threshold are discarded, can lead

to a speedup (Baum e al., 1995). Figure 4.6(a) shows a population of four 8-digit strings, each representing a state of the 8-

queens puzzle: the c-th digit represents the row number of the queen in column c. In (b), each state is rated by the fitness function.

Higher fitness values are better, so for the 8-

Section 4.1

Local Search and Optimization Problems

17

Figure 4.7 The 8-queens states corresponding to the first two parents in Figure 4.6(c) and the first offspring in Figure 4.6(d). The green columns are lost in the crossover step and the red columns are retained. (To interpret the numbers in Figure 4.6: row 1 is the bottom row, and 8 is the top row.) queens problem we use the number of nonattacking pairs of queens, which has a value of 8 x 7/2 =28 for a solution. The values of the four states in (b) are 24, 23, 20, and 11. The

fitness scores are then normalized to probabilities, and the resulting values are shown next to

the fitness values in (b).

In (c), two pairs of parents are selected, in accordance with the probabilities in (b). Notice that one individual is selected twice and one not at all. For each selected pair, a crossover point (dotted line) is chosen randomly. In (d), we cross over the parent strings at the crossover

points, yielding new offspring. For example, the first child of the first pair gets the first three digits (327) from the first parent and the remaining digits (48552) from the second parent.

The 8-queens states involved in this recombination step are shown in Figure 4.7.

Finally, in (e), each location in each string is subject to random mutation with a small independent probability. One digit was mutated in the first, third, and fourth offspring. In the

8-queens problem, this corresponds to choosing a queen at random and moving it to a random

square in its column. It is often the case that the population is diverse early on in the process,

so crossover frequently takes large steps in the state space early in the search process (as in simulated annealing). After many generations of selection towards higher fitness, the popu-

lation becomes less diverse, and smaller steps are typical. Figure 4.8 describes an algorithm

that implements all these steps.

Genetic algorithms are similar to stochastic beam search, but with the addition of the

crossover operation. This is advantageous if there are blocks that perform useful functions.

For example, it could be that putting the first three queens in positions 2, 4, and 6 (where they do not attack each other) constitutes a useful block that can be combined with other useful blocks that appear in other individuals to construct a solution. It can be shown mathematically

that, if the blocks do not serve a purpose—for example if the positions of the genetic code

are randomly permuted—then crossover conveys no advantage.

The theory of genetic algorithms explains how this works using the idea of a schema,

which is a substring in which some of the positions can be left unspecified. For example, the schema 246#*###

Schema

describes all 8-queens states in which the first three queens are in

positions 2, 4, and 6, respectively. Strings that match the schema (such as 24613578) are

called instances of the schema. It can be shown that if the average fitness of the instances of a schema is above the mean, then the number of instances of the schema will grow over time.

Instance

118

Chapter 4 Search in Complex Environments Evolution and Search The theory of evolution was developed by Charles Darwin in On the Origin of Species by Means of Natural Selection (1859) and independently by Alfred Russel ‘Wallace (1858).

The central idea is simple:

variations occur in reproduction and

will be preserved in successive generations approximately in proportion to their effect on reproductive fitness.

Darwin’s theory was developed with no knowledge of how the traits of organisms can be inherited and modified. The probabilistic laws governing these processes were first identified by Gregor Mendel (1866), a monk who experimented

with sweet peas. Much later, Watson and Crick (1953) identified the structure of the

DNA molecule and its alphabet, AGTC (adenine, guanine, thymine, cytosine). In the standard model, variation occurs both by point mutations in the letter sequence and by “crossover” (in which the DNA of an offspring is generated by combining long sections of DNA from each parent). The analogy to local search algorithms has already been described; the prin-

cipal difference between stochastic beam search and evolution is the use of sexual

reproduction, wherein successors are generated from multiple individuals rather than just one.

The actual mechanisms of evolution are, however, far richer than

most genetic algorithms allow. For example, mutations can involve reversals, duplications, and movement of large chunks of DNA;

some viruses borrow DNA

from one organism and insert it into another; and there are transposable genes that

do nothing but copy themselves many thousands of times within the genome.

There are even genes that poison cells from potential mates that do not carry

the gene, thereby increasing their own chances of replication. Most important is the fact that the genes themselves encode the mechanisms whereby the genome is reproduced and translated into an organism. In genetic algorithms, those mechanisms

are a separate program that is not represented within the strings being manipulated. Darwinian evolution may appear inefficient, having generated blindly some 10" or so organisms without improving its search heuristics one iota. But learning does play a role in evolution.

Although the otherwise great French naturalist

Jean Lamarck (1809) was wrong to propose that traits ing an organism’s lifetime would be passed on to its (1896) superficially similar theory is correct: learning ness landscape, leading to an acceleration in the rate of

acquired by adaptation duroffspring, James Baldwin’s can effectively relax the fitevolution. An organism that

has a trait that is not quite adaptive for its environment will pass on the trait if it also

has enough plasticity to learn to adapt to the environment in a way that is beneficial. Computer simulations (Hinton and Nowlan, 1987) confirm that this Baldwin

effect is real, and that a consequence is that things that are hard to learn end up in the genome, but things that are easy to learn need not reside there (Morgan and

Griffiths, 2015).

Section 4.2

Local Search in Continuous Spaces

function GENETIC-ALGORITHM(population, fitness) returns an individual

repeat weights— WEIGHTED-BY (population, fitness) population2 +—empty list for i = 1 to SIZE(population) do

parent], parent2< WEIGHTED-R ANDOM-CHOICES (population, weights, 2) child REPRODUCE (parent ], parent2) if (small random probability) then child < MUTATE(child) add child to population2 population < population2 until some individual is fit enough, or enough time has elapsed return the best individual in population, according to fitness function REPRODUCE(parent], parent2) returns an individual

14 LENGTH(parent])

¢ +random number from 1 to n

return APPEND(SUBSTRING(parentl, 1,c), SUBSTRING (parent2, ¢ + 1,m))

Figure 4.8 A genetic algorithm. Within the function, population is an ordered list of individuals, weights is a list of corresponding fitness values for each individual, and fitess is a function to compute these values.

Clearly, this effect is unlikely to be significant if adjacent bits are totally unrelated to

each other, because then there will be few contiguous blocks that provide a consistent benefit. Genetic algorithms work best when schemas correspond to meaningful components of a

solution. For example, if the string is a representation of an antenna, then the schemas may represent components of the antenna, such as reflectors and deflectors. A good component is likely to be good in a variety of different designs. This suggests that successful use of genetic

algorithms requires careful engineering of the representation. In practice, genetic algorithms have their place within the broad landscape of optimization methods (Marler and Arora, 2004), particularly for complex structured problems such as circuit layout or job-shop scheduling, and more recently for evolving the architecture of deep neural networks (Miikkulainen ef al., 2019). It is not clear how much of the appeal of genetic algorithms arises from their superiority on specific tasks, and how much from the appealing metaphor of evolution.

Local Search in Continuous Spaces In Chapter 2, we explained the distinction between discrete and continuous environments,

pointing out that most real-world environments are continuous.

A continuous action space

has an infinite branching factor, and thus can’t be handled by most of the algorithms we have covered so far (with the exception of first-choice hill climbing and simulated annealing). This section provides a very brief introduction to some local search techniques for continuous spaces. The literature on this topic is vast; many of the basic techniques originated

119

120

Chapter 4 Search in Complex Environments in the 17th century, after the development of calculus by Newton and Leibniz? We find uses for these techniques in several places in this book, including the chapters on learning, vision, and robotics.

We begin with an example. Suppose we want to place three new airports anywhere in

Romania, such that the sum of squared straight-line distances from each city on the map

Variable

to its nearest airport is minimized. (See Figure 3.1 for the map of Romania.) The state space is then defined by the coordinates of the three airports: (xy,y;), (x2,y2), and (x3,y3).

This is a six-dimensional space; we also say that states are defined by six variables. In general, states are defined by an n-dimensional vector of variables, x. Moving around in this space corresponds to moving one or more of the airports on the map. The objective function f(X) = f(x1,y1.X2,y2,x3,y3) is relatively easy to compute for any particular state once we

compute the closest cities. Let C; be the set of cities whose closest airport (in the state x) is airport i. Then, we have

3

F) = fiynx e ys) = Y Z(X:*X‘)2+(Yi*n)2»

@

This equation is correct not only for the state x but also for states in the local neighborhood

of x. However, it is not correct globally; if we stray too far from x (by altering the location

of one or more of the airports by a large amount) then the set of closest cities for that airport

Discretization

changes, and we need to recompute C;.

One way to deal with a continuous state space is to discretize it. For example, instead of

allowing the (x;,y;) locations to be any point in continuous two-dimensional space, we could

limit them to fixed points on a rectangular grid with spacing of size § (delta). Then instead of

having an infinite number of successors, each state in the space would have only 12 successors, corresponding to incrementing one of the 6 variables by 4. We can then apply any of our local search algorithms to this discrete space. Alternatively, we could make the branching factor finite by sampling successor states randomly, moving in a random direction by a small

Empirical gradient

amount, 5. Methods that measure progress by the change in the value of the objective function between two nearby points are called empirical gradient methods.

Empirical gradient

search is the same as steepest-ascent hill climbing in a discretized version of the state space.

Reducing the value of § over time can give a more accurate solution, but does not necessarily

converge to a global optimum in the limit.

Often we have an objective function expressed in a mathematical form such that we can

Gradient

use calculus to solve the problem analytically rather than empirically. Many methods attempt

to use the gradient of the landscape to find a maximum. The gradient of the objective function

is a vector V£ that gives the magnitude and direction of the steepest slope. For our problem, we have

vio (9F

9f

9f

9f

9f

of

1= (S A 2 5)

In some cases, we can find a maximum by solving the equation V £ =0. (This could be done,

for example, if we were placing just one airport; the solution is the arithmetic mean of all the

cities’ coordinates.) In many cases, however, this equation cannot be solved in closed form.

For example, with three airports, the expression for the gradient depends on what cities are

Knowledge of vectors, mat ces, and derivatives s useful for this

(see Appendix A).

Section 4.2

Local Search in Continuous Spaces

121

closest to each airport in the current state. This means we can compute the gradient locally (but not globally); for example, of 22 ¥ (x—x0). “2) Given alocally correct expression for the gradient, we can perform steepest-ascent hill climbing by updating the current state according to the formula X

x+aVf(x),

where a (alpha) is a small constant often called the step size. There exist a huge variety Step size of methods for adjusting a.. The basic problem is that if « is too small, too many steps are needed; if a is too large, the search could overshoot the maximum.

The technique of line

search tries to overcome this dilemma by extending the current gradient direction—usually by repeatedly doubling a—until f starts to decrease again. becomes

the new current state.

The point at which this occurs

There are several schools of thought about how the new

direction should be chosen at this point. For many problems, the most effective algorithm is the venerable Newton-Raphson

method. This is a general technique for finding roots of functions—that is, solving equations

of the form g(x)=0. Newton’s formula

Line search

Newton-Raphson

It works by computing a new estimate for the root x according to

x e x—g(x)/g'(x).

To find a maximum or minimum of f, we need to find x such that the gradient is a zero vector

(i.e., V(x)=0). Thus, g(x) in Newton’s formula becomes V £(x), and the update equation can be written in matrix-vector form as

x4 x—H

(x)Vf(x),

where Hy(x) is the Hessian matrix of second derivatives, whose elements H;; are given by

92f/9x;dx;. For our airport example, we can see from Equation (4.2) that Hy(x) is particularly simple: the off-diagonal elements are zero and the diagonal elements for airport are just

Hessian

twice the number of cities in C;. A moment’s calculation shows that one step of the update

moves airport i directly to the centroid of C;, which is the minimum of the local expression

for f from Equation (4.1).3 For high-dimensional problems, however, computing the n? en-

tries of the Hessian and inverting it may be expensive, so many approximate versions of the

Newton-Raphson method have been developed.

Local search methods suffer from local maxima, ridges, and plateaus in continuous state

spaces just as much as in discrete spaces. Random restarts and simulated annealing are often helpful. High-dimensional continuous spaces are, however, big places in which it is very easy to get lost. Constrained A final topic is constrained optimization. An optimization problem is constrained if optimization solutions must satisfy some hard constraints on the values of the variables.

For example, in

our airport-siting problem, we might constrain sites to be inside Romania and on dry land (rather than in the middle of lakes).

The difficulty of constrained

optimization problems

depends on the nature of the constraints and the objective function. The best-known category is that of linear programming problems, in which constraints must be linear inequalities

3 In general, the Newton-Raphson update can be seen s fitting a quadratic surface to / at x and then moving direetly to the minimum of that surface—which is also the minimum of f if / is quadrati.

Linear programming

122

Convex set Convex optimization

Chapter 4 Search in Complex Environments forming a convex set* and the objective function is also linear. The time complexity of linear programming is polynomial in the number of variables. Linear programming is probably the most widely studied and broadly useful method for optimization. It is a special case of the more general problem of convex optimization, which

allows the constraint region to be any convex region and the objective to be any function that is

convex within the constraint region. Under certain conditions, convex optimization problems are also polynomially solvable and may be feasible in practice with thousands of variables. Several important problems in machine learning and control theory can be formulated as convex optimization problems (see Chapter 20). 4.3

Search with Nondeterministic Actions

In Chapter 3, we assumed a fully observable, deterministic, known environment.

Therefore,

an agent can observe the initial state, calculate a sequence of actions that reach the goal, and

execute the actions with its “eyes closed,” never having to use its percepts. When the environment is partially observable, however, the agent doesn’t know for sure what state it is in; and when the environment is nondeterministic, the agent doesn’t know

what state it transitions to after taking an action. That means that rather than thinking “I'm in state 51 and ifI do action a I'll end up in state s»,” an agent will now be thinking “I'm either

Belief state Conditional plan

in state s or 53, and if I do action a 'l end up in state s,,54 or s5.” We call a set of physical states that the agent believes are possible a belief state.

In partially observable and nondeterministic environments, the solution to a problem is

no longer a sequence, but rather a conditional plan (sometimes called a contingency plan or a

strategy) that specifies what to do depending on what percepts agent receives while executing the plan. We examine nondeterminism in this section and partial observability in the next.

4.3.1

The erratic vacuum world

The vacuum world from Chapter 2 has eight states, as shown in Figure 4.9. There are three

actions—Right, Left, and Suck—and the goal is to clean up all the dirt (states 7 and 8). If the

environment is fully observable, deterministic,

and completely known, then the problem is

easy to solve with any of the algorithms in Chapter 3, and the solution is an action sequence.

For example, if the initial state is 1, then the action sequence [Suck, Right, Suck] will reach a

goal state, 8.

Now suppose that we introduce nondeterminism in the form of a powerful but erratic

vacuum cleaner. In the erratic vacuum world, the Suck action works as follows:

+ When applied to a dirty square the action cleans the square and sometimes cleans up

dirt in an adjacent square, too.

« When applied to a clean square the action sometimes deposits dirt on the carpet.”

To provide a precise formulation of this problem, we need to generalize the notion of a transition model from Chapter 3. Instead of defining the transition model by a RESULT function

4 A set of points is convex if the line joining any two points in & is also contained in S. A convex function is one for which the space “above” it forms a convex set; by definition, convex functions have no local (as opposed 10 global) minima. 5 We assume that most readers milar problems and can sympathize with our agent. We a owners of modern, efficient cleaning appliances who cannot take advantage of this pedagogical de

Section 4.3

Search with Nondeterministic Actions

123

=0 -

Figure 4.9 The eight possible states of the vacuum world; states 7 and 8 are goal states. that returns a single outcome state, we use a RESULTS function that returns a set of possible

outcome states. For example, in the erratic vacuum world, the Suck action in state 1 cleans

up either just the current location, or both locations RESULTS (1, Suck)= {5,7}

If we start in state 1, no single sequence of actions solves the problem, but the following

conditional plan does:

[Suck,if State =5 then [Right, Suck] else []] .

4.3)

Here we see that a conditional plan can contain if-then—else steps; this means that solutions are frees rather than sequences. Here the conditional in the if statement tests to see what the current state is; this is something the agent will be able to observe at runtime, but doesn’t

know at planning time. Alternatively, we could have had a formulation that tests the percept

rather than the state. Many problems in the real, physical world are contingency problems, because exact prediction of the future is impossible. For this reason, many people keep their eyes open while walking around. 4.3.2

AND-OR

search trees

How do we find these contingent solutions to nondeterministic problems?

As in Chapter 3,

we begin by constructing search trees, but here the trees have a different character. In a de-

terministic environment, the only branching is introduced by the agent’s own choices in each

state: I can do this action or that action. We call these nodes OR nodes. In the vacuum world,

Or node

environment, branching is also introduced by the environment’s choice of outcome for each action. We call these nodes AND nodes. For example, the Suck action in state 1 results in the

And node

two kinds of nodes alternate, leading to an AND-OR tree as illustrated in Figure 4.10.

And-or tree

for example, at an OR node the agent chooses Left or Right or Suck. In a nondeterministic

belief state {5,7}, so the agent would need to find a plan for state 5 and for state 7. These

124

Chapter 4 Search in Complex Environments

L

GoaL

=

Suck,

el

el

LOOP

Loop

=

GOAL

Light

Left

Fle]

[ S

EME

L

Loor

Suck

17

o]

GOAL

LooP

Figure 4.10 The first two levels of the search tree for the erratic vacuum world. State nodes are OR nodes where some action must be chosen. At the AND nodes, shown as circles, every ‘outcome must be handled, as indicated by the arc linking the outgoing branches. The solution found is shown in bold lines.

A solution for an AND-OR search problem is a subtree of the complete search tree that

(1) has a goal node at every leaf, (2) specifies one action at each of its OR nodes, and (3) includes every outcome branch at each of its AND nodes. The solution is shown in bold lines

in the figure; it corresponds to the plan given in Equation (4.3).

Figure 4.11 gives a recursive, depth-first algorithm for AND-OR graph search. One key aspect of the algorithm is the way in which it deals with cycles, which often arise in nonde-

terministic problems (e.g., if an action sometimes has no effect or if an unintended effect can be corrected). If the current state is identical to a state on the path from the root, then it returns with failure. This doesn’t mean that there is no solution from the current state; it simply

means that if there is a noncyclic solution, it must be reachable from the earlier incarnation of

the current state, so the new incarnation can be discarded. With this check, we ensure that the

algorithm terminates in every finite state space, because every path must reach a goal, a dead

end, or a repeated state. Notice that the algorithm does not check whether the current state is

a repetition of a state on some other path from the root, which is important for efficiency.

AND-OR graphs can be explored either breadth-first or best-first. The concept of a heuris-

tic function must be modified to estimate the cost of a contingent solution rather than a se-

quence, but the notion of admissibility carries over and there is an analog of the A* algorithm for finding optimal solutions. (See the bibliographical notes at the end of the chapter.)

Section 4.3

Search with Nondeterministic Actions

125

function AND-OR-SEARCH(problem) returns a conditional plan, or failure

return OR-SEARCH(problem, problem.INITIAL, [])

function OR-SEARCH(problem, state, path) returns a conditional plan, or failure if problem.1s-GOAL(state) then return the empty plan if Is-CYCLE(path) then return failure

for each action in problem. ACTIONS state) do

plan < AND-SEARCH(problem, RESULTS(state, action), [state] + path])

if plan # failure then return [action] + plan] return failure

function AND-SEARCH(problem, states, path) returns a conditional plan, or failure for each s; in states do plan; < OR-SEARCH(problem, s;. path)

if plan; = failure then return failure return [ifs, then plan, else if s, then plan, else ...if s,

then plan,,_, else plan,]

Figure 4.11 An algorithm for searching AND-OR graphs generated by nondeterministic environments. A solution is a conditional plan that considers every nondeterministic outcome and makes a plan for each one. 4.3.3

Try, try ag

Consider a slippery vacuum world, which is identical to the ordinary (non-erratic) vacuum

world except that movement actions sometimes fail, leaving the agent in the same location.

For example, moving Right in state 1 leads to the belief state {1,2}. Figure 4.12 shows part of the search graph; clearly, there are no longer any acyclic solutions from state 1, and

AND-OR-SEARCH would return with failure. There is, however, a cyclic solution, which is to keep trying Right until it works. We can express this with a new while construct:

[Suck, while State =S5 do Right, Suck] or by adding a label to denote some portion of the plan and referring to that label later: [Suck,Ly : Right,if State=S5 then L; else Suck].

When is a cyclic plan a solution? A minimum condition is that every leaf is a goal state and

that a leaf is reachable from every point in the plan. In addition to that, we need to consider the cause of the nondeterminism. If it is really the case that the vacuum robot’s drive mechanism

works some of the time, but randomly and independently slips on other occasions, then the

agent can be confident that if the action is repeated enough times, eventually it will work and the plan will succeed.

But if the nondeterminism is due to some unobserved fact about the

robot or environment—perhaps a drive belt has snapped and the robot will never move—then repeating the action will not help. One way to understand this decision is to say that the initial problem formulation (fully observable, nondeterministic) is abandoned in favor of a different formulation (partially observable, deterministic) where the failure of the cyclic plan is attributed to an unobserved

property of the drive belt. In Chapter 12 we discuss how to decide which of several uncertain possibilities is more likely.

Cyelic solution

126

Chapter 4 Search in Complex Environments

Figure 4.12 Part of the search graph for a slippery vacuum world, where we have shown (some) cycles explicitly. All solutions for this problem are cyclic plans because there is no way to move reliably. 4.4

Search in Partially Observable

Environments

‘We now turn to the problem of partial observability, where the agent’s percepts are not enough to pin down the exact state.

That means that some of the agent’s actions will be aimed at

reducing uncertainty about the current state.

4.4.1

Sensorless Conformant

Searching with no observation

‘When the agent’s percepts provide no information at all, we have what is called a sensorless

problem (or a conformant problem). At first, you might think the sensorless agent has no hope of solving a problem if it has no idea what state it starts in, but sensorless solutions are

surprisingly common and useful, primarily because they don’t rely on sensors working properly. In manufacturing systems, for example, many ingenious methods have been developed for orienting parts correctly from an unknown initial position by using a sequence of actions

with no sensing at all. Sometimes a sensorless plan is better even when a conditional plan

with sensing is available. For example, doctors often prescribe a broad-spectrum antibiotic

rather than using the conditional plan of doing a blood test, then waiting for the results to

come back, and then prescribing a more specific antibiotic. The sensorless plan saves time and money, and avoids the risk of the infection worsening before the test results are available.

Consider a sensorless version of the (deterministic) vacuum world. Assume that the agent

knows the geography of its world, but not its own location or the distribution of dirt. In that

case, its initial belief state is {1,2,3,4,5,6,7,8} (see Figure 4.9).

Now, if the agent moves

Right it will be in one of the states {2,4,6,8}—the agent has gained information without

Coercion

perceiving anything! After [Right,Suck] the agent will always end up in one of the states {4,8}. Finally, after [Right,Suck,Lefr,Suck] the agent is guaranteed to reach the goal state 7, no matter what the start state. We say that the agent can coerce the world into state 7.

Section 4.4

Search in Partially Observable Environments

The solution to a sensorless problem is a sequence of actions, not a conditional plan

(because there is no perceiving).

But we search in the space of belief states rather than

physical states.S In belief-state space, the problem is fully observable because the agent always knows its own belief state. Furthermore, the solution (if any) for a sensorless problem

is always a sequence of actions. This is because, as in the ordinary problems of Chapter 3, the percepts received after each action are completely predictable—they’re always empty! So there are no contingencies to plan for. This is true even if the environment is nondeterministic.

‘We could introduce new algorithms for sensorless search problems. But instead, we can

use the existing algorithms from Chapter 3 if we transform the underlying physical problem

into a belief-state problem, in which we search over belief states rather than physical states.

The original problem, P, has components Actionsp, Resultp etc., and the belief-state problem has the following components:

o States: The belief-state space contains every possible subset of the physical states. If P

has N states, then the belief-state problem has 2V belief states, although many of those may be unreachable from the initial state.

o Initial state: Typically the belief state consisting of all states in P, although in some cases the agent will have more knowledge than this.

o Actions: This is slightly tricky. Suppose the agent is in belief state b={s;,s2}, but

ACTIONSp(s1) # ACTIONSp(s>): then the agent is unsure of which actions are legal. If we assume that illegal actions have no effect on the environment, then it is safe to take the union of all the actions in any of the physical states in the current belief state b:

AcTIONS(b) = [J ACTIONSp(s). seb

On the other hand, if an illegal action might lead to catastrophe, it is safer to allow only the intersection, that is, the set of actions legal in all the states. For the vacuum world,

every state has the same legal actions, so both methods give the same result.

o Transition model: For deterministic actions, the new belief state has one result state for each of the current possible states (although some result states may be the same):

b =RESULT(b,a) = {s" : '

=RESULTp(s,a) and 5 € b}.

(4.4)

‘With nondeterminism, the new belief state consists of all the possible results of applying the action to any of the

b =ResuLT(b,a)

states in the current belief state:

= {s':5' € RESULTS p(s,a) and s € b}

= [JREsuLtsp(s,a),

seb

The size of b’ will be the same or smaller than b for deterministic actions, but may be larger than b with nondeterministic actions (see Figure 4.13).

o Goal test: The agent possibly achieves the goal if any state s in the belief state satisfies

the goal test of the underlying problem, Is-GOALp(s). The agent necessarily achieves the goal if every state satisfies IS-GOALp(s). We aim to necessarily achieve the goal.

® Action cost:

This is also tricky.

If the same action can have different costs in dif-

ferent states, then the cost of taking an action in a given belief state could be one of

© Ina fully observable environment, each belief state contai ins one physical state. Thus, we can view the algorithms in Chapter 3 as searching in a belief-state space of leton belief stat

127

128

Chapter 4 Search in Complex Environments

(@

(b)

Figure 4.13 (a) Predicting the next belief state for the sensorless vacuum world with the deterministic action, Right. (b) Prediction for the same belief state and action in the slippery version of the sensorless vacuum world.

several values. (This gives rise to a new class of problems, which we explore in Exercise 4.MVAL.) For now we assume that the cost of an action is the same in all states and 5o can be transferred directly from the underlying physical problem. Figure 4.14 shows the reachable belief-state space for the deterministic, sensorless vacuum world. There are only 12 reachable belief states out of 28 =256 possible belief states. The preceding definitions enable the automatic construction of the belief-state problem

formulation from the definition of the underlying physical problem. Once this is done, we can solve sensorless problems with any of the ordinary search algorithms of Chapter 3. In ordinary graph search, newly reached states are tested to see if they were previously reached. This works for belief states, too; for example, in Figure 4.14, the action sequence [Suck.Left,Suck] starting at the initial state reaches the same belief state as [Right,Left,Suck), namely, {5,7}. Now, consider the belief state reached by [Lefr], namely, {1,3,5,7}. Obviously, this is not identical to {5,7}, but it is a superser. We can discard (prune) any such superset belief state. Why? Because a solution from {1,3,5,7} must be a solution for each

of the individual states 1, 3, 5, and 7, and thus it is a solution for any combination of these

individual states, such as {5,7}; therefore we don’t need to try to solve {1,3,5,7}, we can

concentrate on trying to solve the strictly easier belief state {5,7}.

Conversely, if {1,3,5,7} has already been generated and found to be solvable, then any

subset, such as {5,7}, is guaranteed to be solvable. (If I have a solution that works when I'm very confused about what state I'm in, it will still work when I'm less confused.) This extra

level of pruning may dramatically improve the efficiency of sensorless problem solving. Even with this improvement, however, sensorless problem-solving as we have described

itis seldom feasible in practice. One issue is the vastness of the belief-state space—we saw in

the previous chapter that often a search space of size N is too large, and now we have search

spaces of size 2V. Furthermore, each element of the search space is a set of up to N elements.

For large N, we won’t be able to represent even a single belief state without running out of

memory space.

One solution is to represent the belief state by some more compact description.

In En-

glish, we could say the agent knows “Nothing” in the initial state; after moving Left, we could

Section 4.4

Search in Partially Observable Environments

129

L L1 3 [=] 5

1[ 54~7

L

| =]



2| e [=A] |55

sfA] %=

4y [

5-4“

o

o[

o[

1[5

s

1 [

=]

&

R

¢

|

|

|s [ [ 5[ 7[=a] ] o[

S

EL|

s [\EL

——

7[=] R

¥

o[ T s

7[=]

L



[~ lk

L

L R

S

e[ T

A A

TL

|

=

[F

s[4 Ll

= s |

[~

L

3[=

R

7~

l———l

TR

"=

Figure 4.14 The reachable portion of the belief-state space for the deterministic, sensorless

vacuum world. Each rectangular box corresponds to a single belief state. At any given point,

the agent has a belief state but does not know which physical state it is in. The initial belief state (complete ignorance) is the top center box.

say, “Not in the rightmost column,” and so on. Chapter 7 explains how to do this in a formal representation scheme.

Another approach is to avoid the standard search algorithms, which treat belief states as

black boxes just like any other problem state. Instead, we can look inside the belief states

and develop incremental belief-state search algorithms that build up the solution one phys- Incremental belief-state search ical state at a time.

For example, in the sensorless vacuum world, the initial belief state is

{1,2,3,4,5,6,7,8}, and we have to find an action sequence that works in all 8 states. We can

do this by first finding a solution that works for state 1; then we check if it works for state 2; if not, go back and find a different solution for state 1, and so on.

Just as an AND-OR search has to find a solution for every branch at an AND node, this

algorithm has to find a solution for every state in the belief state; the difference is that AND—

OR search can find a different solution for each branch, whereas search has to find one solution that works for all the states.

an incremental belief-state

The main advantage of the incremental approach is that it is typically able to detect failure

quickly—when a belief state is unsolvable, it is usually the case that a small subset of the

130

Chapter 4 Search in Complex Environments belief state, consisting of the first few states examined, is also unsolvable. In some cases, this

leads to a speedup proportional to the size of the belief states, which may themselves be as

large as the physical state space itself. 4.4.2

Searching in partially observable environments

Many problems cannot be solved without sensing. For example, the sensorless 8-puzzle is impossible. On the other hand, a little bit of sensing can go a long way: we can solve 8puzzles if we can see just the upper-left corner square.

The solution involves moving each

tile in turn into the observable square and keeping track of its location from then on .

For a partially observable problem, the problem specification will specify a PERCEPT (s)

function that returns the percept received by the agent in a given state. If sensing is non-

deterministic, then we can use a PERCEPTS function that returns a set of possible percepts. For fully observable problems, PERCEPT (s) = s for every state s, and for sensorless problems PERCEPT (s) = null.

Consider a local-sensing vacuum world, in which the agent has a position sensor that

yields the percept L in the left square, and R in the right square, and a dirt sensor that yields

Dirty when the current square is dirty and Clean when it is clean. Thus, the PERCEPT in state

Lis [L, Dirty]. With partial observability, it will usually be the case that several states produce

the same percept; state 3 will also produce [L, Dirfy]. Hence, given this

initial percept, the

initial belief state will be {1,3}. We can think of the transition model between belief states

for partially observable problems as occurring in three stages, as shown in Figure 4.15:

+ The prediction stage computes the belief state resulting from the action, RESULT (b, a),

exactly as we did with sensorless problems. To emphasize that this is a prediction, we

use the notation h=RESULT(b,a), where the “hat” over the b means “estimated,” and we also use PREDICT(b,a) as a synonym for RESULT (b, a). + The possible percepts stage computes the set of percepts that could be observed in the predicted belief state (using the letter o for observation):

POSSIBLE-PERCEPTS (b) = {0 : 0=PERCEPT(s) and s € b} . « The update stage computes, for each possible percept, the belief state that would result from the percept.

The updated belief state b, is the set of states in b that could have

produced the percept:

b, = UPDATE(b,0) = {s : 0=PERCEPT(s) and 5 € b}. The agent needs to deal with possible percepts at planning time, because it won’t know

the actual percepts until it executes the plan. Notice that nondeterminism in the phys-

ical environment can enlarge the belief state in the prediction stage, but each updated belief state b, can be no larger than the predicted belief state b; observations can only

help reduce uncertainty. Moreover, for deterministic sensing, the belief states for the

different possible percepts will be disjoint, forming a partition of the original predicted belief state.

Putting these three stages together, we obtain the possible belief states resulting from a given action and the subsequent possible percepts:

RESULTS (b,a) = {b, : b, = UPDATE(PREDICT(b,a),0) and 0 € POSSIBLE-PERCEPTS (PREDICT(b,a))} .

(4.5)

Section 4.4

Search in Partially Observable Environments

(a)

B.Dirty]

(b)

=

i1/

o=



[ o[z

)

1B Clean] Figure 4.15 Two examples of transitions in local-sensing vacuum worlds. (a) In the deterministic world, Right is applied in the initial belief state, resulting in a new predicted belief

state with two possible physical states; for those states, the possible percepts are [R, Dirty]

and [R, Clean], leading to two belief states, each of which is a singleton. (b) In the slippery world, Right i applied in the initial belief state, giving a new belief state with four physical states; for those states, the possible percepts are [L, Dirty], [R, Dirty], and [R, Clean], leading to three belief states as shown.

[4,Clean]

Figure 4.16 The first level of the AND-OR

[8.0iry) L

[B.Clean]

search tree for a problem in the local-sensing

vacuum world; Suck is the first action in the solution.

131

132

Chapter 4 Search in Complex Environments 4.4.3

Solving partially observable problems

The preceding section showed how to derive the RESULTS function for a nondeterministic

belief-state problem from an underlying physical problem, given the PERCEPT function. With this formulation, the AND—OR search algorithm of Figure 4.11 can be applied directly to

derive a solution.

Figure 4.16 shows part of the search tree for the local-sensing vacuum

world, assuming an initial percept [A, Dirty]. The solution is the conditional plan [Suck, Right, if Bstate={6} then Suck else []]. Notice that, because we supplied a belief-state problem to the AND-OR

search algorithm, it

returned a conditional plan that tests the belief state rather than the actual state. This is as it

should be:

in a partially observable environment the agent won’t know the actual state.

As in the case of standard search algorithms applied to sensorless problems, the AND—

OR search algorithm treats belief states as black boxes, just like any other states. One can improve on this by checking for previously generated belief states that are subsets or supersets

of the current state, just as for sensorless problems.

One can also derive incremental search

algorithms, analogous to those described for sensorless problems, that provide substantial

speedups over the black-box approach.

4.4.4

An agent for partially observable environments

An agent for partially observable environments formulates a problem, calls a search algorithm (such as AND-OR-SEARCH)

to solve it, and executes the solution. There are two main

differences between this agent and the one for fully observable deterministic environments.

First, the solution will be a conditional plan rather than a sequence; to execute an if-then—else

expression, the agent will need to test the condition and execute the appropriate branch of the conditional.

Second, the agent will need to maintain its belief state as it performs actions

and receives percepts. This process resembles the prediction-observation-update process in

Equation (4.5) but is actually simpler because the percept is given by the environment rather

than calculated by the agent. Given an initial belief state b, an action a, and a percept o, the new belief state is:

b’ = UPDATE(PREDICT(b,a),0).

(4.6)

Consider a kindergarten vacuum world wherein agents sense only the state of their current

square, and any square may become dirty at any time unless the agent is actively cleaning it at that moment.” Figure 4.17 shows the belief state being maintained in this environment. In partially observable environments—which include the vast majority of real-world Monitoring Filtering

State estimation

environments—maintaining one’s belief state is a core function of any intelligent system.

This function goes under various names, including monitoring, filtering, and state estima-

tion. Equation (4.6) is called a recursive state estimator because it computes the new belief

state from the previous one rather than by examining the entire percept sequence. If the agent is not to “fall behind,” the computation has to happen as fast as percepts are coming in. As

the environment becomes more complex, the agent will only have time to compute an ap-

proximate belief state, perhaps focusing on the implications of the percept for the aspects of the environment that are of current interest. Most work on this problem has been done for

7 The usual apologies to those who are unfamiliar with the effect of small children on the environment.

Section 4.4

[A.Clean)

1B.Diny]

4l

Suck

133

Search in Partially Observable Environments

Figure 4.17 Two prediction-update cycles of belief-state maintenance in the kindergarten vacuum world with local sensing. stochastic, continuous-state environments with the tools of probability theory, as explained in

Chapter 14.

In this section we will show an example in a discrete environment with deterministic

sensors and nondeterministic actions. The example concerns a robot with a particular state estimation task called localization: working out where it is, given a map of the world and

a sequence of percepts and actions. Our robot is placed in the maze-like environment of Figure 4.18. The robot is equipped with four sonar sensors that tell whether there is an obstacle—the outer wall or a dark shaded square in the figure—in each of the four compass directions. The percept is in the form of a bt vector, one bit for each of the directions north,

east, south, and west in that order, so 1011 means there are obstacles to the north, south, and

west, but not east. ‘We assume that the sensors give perfectly correct data, and that the robot has a correct

map of the environment.

But unfortunately, the robot’s navigational system is broken, so

when it executes a Right action, it moves randomly to one of the adjacent squares. robot’s task is to determine its current location.

The

Suppose the robot has just been switched on, and it does not know where it is—its initial

belief state b consists of the set of all locations.

The robot then receives the percept 1011

and does an update using the equation b, =UPDATE(1011), yielding the 4 locations shown

in Figure 4.18(a). You can inspect the maze to see that those are the only four locations that yield the percept 1011.

Next the robot executes a Right action, but the result is nondeterministic. The new belief

state, b,

=PREDICT (b, Right), contains all the locations that are one step away from the lo-

cations in b,. When the second percept, 1010, arrives, the robot does UPDATE (b,, 1010) and finds that the belief state has collapsed down to the single location shown in Figure 4.18(b). That’s the only location that could be the result of

UPDATE (PREDICT (UPDATE(b, 1011),Right),1010) . ‘With nondeterministic actions the PREDICT step grows the belief state, but the UPDATE step

shrinks it back down—as long as the percepts provide some useful identifying information.

Sometimes the percepts don’t help much for localization: If there were one or more long east-

west corridors, then a robot could receive a long sequence of 1010 percepts, but never know

Localization

134

Chapter 4 Search in Complex Environments

(b) Possible locations of robot after E; = 1011, E,=1010

Figure 4.18 Possible positions of the robot, ©, (a) after one observation, £; = 1011, and (b) after moving one square and making a second observation, E> = 1010. When sensors are noiseless and the transition model is accurate, there is only one possible location for the robot consistent with this sequence of two observations. where in the corridor(s) it was. But for environments with reasonable variation in geography, localization often converges quickly to a single point, even when actions are nondeterministic.

What happens if the sensors are faulty? If we can reason only with Boolean logic, then we.

have to treat every sensor bit as being either correct or incorrect, which is the same as having

no perceptual information at all. But we will see that probabilistic reasoning (Chapter 12), allows us to extract useful information from a faulty sensor as long as it is wrong less than half the time.

Online Search Agents and Unknown Offline search Online search

Environments

So far we have concentrated on agents that use offfine search algorithms. They compute

a complete solution before taking their first action.

In contrast, an online search® agent

interleaves computation and action: first it takes an action, then it observes the environment

and computes the next action. Online search is a good idea in dynamic or semi-dynamic environments, where there is a penalty for sitting around and computing too long. Online 8 The term “onl i¢” here refers to algorithms that must proc nput as it is received rather than waiting for the entire input ata set to become available. This usage of “online’ unrelated to the concept of “having an Internet

connecti

"

Section 4.5

Online Search Agents and Unknown Environments

135

search is also helpful in nondeterministic domains because it allows the agent to focus its

computational efforts on the contingencies that actually arise rather than those that might

happen but probably won’t.

Of course, there is a tradeoff:

the more an agent plans ahead, the less often it will find

itself up the creek without a paddle. In unknown environments, where the agent does not

know what states exist or what its actions do, the agent must use its actions as experiments in order to learn about the environment.

A canonical example of online search is the mapping problem: a robot is placed in an Mapping problem

unknown building and must explore to build a map that can later be used for getting from

A to B. Methods for escaping from labyrinths—required knowledge for aspiring heroes of antiquity—are also examples of online search algorithms. Spatial exploration is not the only

form of online exploration, however. Consider a newborn baby: it has many possible actions but knows the outcomes of none of them, and it has experienced only a few of the possible states that it can reach.

4.5.1

Online search problems

An online search problem is solved by interleaving computation, sensing, and acting. We’ll start by assuming a deterministic and fully observable environment (Chapter 17 relaxes these assumptions) and stipulate that the agent knows only the following: * ACTIONS(s), the legal actions in state s;

* ¢(s,a,s'), the cost of applying action a in state s to arrive at state s'. Note that this cannot be used until the agent knows that s” is the outcome. * Is-GOAL(s), the goal test. Note in particular that the agent cannot determine RESULT (s, @) except by actually being in s

and doing a. For example, in the maze problem shown in Figure 4.19, the agent does not know that going Up from (1,1) leads to (1,2); nor, having done that, does it know that going Down will take it back to (1,1). This degree of ignorance can be reduced in some applications—for example, a robot explorer might know how its movement actions work and be ignorant only of the locations of obstacles.

Finally, the agent might have access to an admissible heuristic function /(s) that estimates

the distance from the current state to a goal state. For example, in Figure 4.19, the agent might know the location of the goal and be able to use the Manhattan-distance heuristic (page 97).

Typically, the agent’s objective is to reach a goal state while minimizing cost. (Another

possible objective is simply to explore the entire environment.) The cost is the total path

cost that the agent incurs as it travels. It is common to compare this cost with the path cost

the agent would incur if it knew the search space in advance—that is, the optimal path in the known environment. In the language of online algorithms, this comparison is called the competitive ratio; we would like it to be as small as possible.

Online explorers are vulnerable to dead ends: states from which no goal state is reach-

able. If the agent doesn’t know what each action does, it might execute the “jump into bottomless pit” action, and thus never reach the goal. In general, no algorithm can avoid dead ends in all state spaces. Consider the two dead-end state spaces in Figure 4.20(a). An on-

line search algorithm that has visited states S and A cannot tell if it is in the top state or the bottom one; the two look identical based on what the agent has seen. Therefore, there is no

Competitive ratio Dead end


n, then the constraint cannot be satisfied.

Section 6.2 Constraint Propagation: Inference in CSPs

189

This leads to the following simple algorithm: First, remove any variable in the constraint

that has a singleton domain, and delete that variable’s value from the domains of the re-

maining variables. Repeat as long as there are singleton variables. If at any point an empty

domain is produced or there are more variables than domain values left, then an inconsistency has been detected. This method can detect the inconsistency in the assignment {WA = red, NSW =red} for Figure 6.1. Notice that the variables SA, NT, and Q are effectively connected by an Alldiff

constraint because each pair must have two different colors. After applying AC-3 with the partial assignment, the domains of SA, NT, and Q are all reduced to {green, blue}. That is, we have three variables and only two colors, so the Alldiff constraint is violated.

Thus, a

simple consistency procedure for a higher-order constraint is sometimes more effective than

applying arc consistency to an equivalent set of binary constraints.

Another important higher-order constraint is the resource constraint, sometimes called

the Atmost constraint. For example, in a scheduling problem, let P, ..., P; denote the numbers.

Resource constraint

of personnel assigned to each of four tasks. The constraint that no more than 10 personnel

are assigned in total is written as Ammost(10, Py, Py, Py, Py). We can detect an inconsistency simply by checking the sum of the minimum values of the current domains; for example, if each variable has the domain {3,4,5,6}, the Armost constraint cannot be satisfied. We can

also enforce consistency by deleting the maximum value of any domain if it is not consistent

with the minimum values of the other domains. Thus, if each variable in our example has the domain {2,3,4,5,6}, the values 5 and 6 can be deleted from each domain.

For large resource-limited problems with integer values—such as logistical problems involving moving thousands of people in hundreds of vehicles—it is usually not possible to represent the domain of each variable as a large set of integers and gradually reduce that set by consistency-checking methods. Instead, domains are represented by upper and lower bounds and are managed by bounds propagation. For example, in an airline-scheduling

problem, let’s suppose there are two flights, i and F, for which the planes have capacities 165 and 385, respectively. The initial domains for the numbers of passengers on flights Fi and F; are then

D;=[0,165]

and

Bounds propagation

D;=[0,385]

Now suppose we have the additional constraint that the two flights together must carry 420

people: Fj + Fy = 420. Propagating bounds constraints, we reduce the domains to

Dy =[35,165]

and D, =|[255,385].

‘We say that a CSP is bounds-consistent if for every variable X, and for both the lower-bound and upper-bound values of X, there exists some value of ¥ that satisfies the constraint between

Bounds-consistent

X and Y for every variable Y. This kind of bounds propagation is widely used in practical constraint problems. 6.2.6 Sudoku

The popular Sudoku puzzle has introduced millions of people to constraint satisfaction problems, although they may not realize it. A Sudoku board consists of 81 squares, some of which are initially filled with digits from 1 to 9. The puzzle is to fill in all the remaining squares such that no digit appears twice in any row, column, or 3x 3 box (see Figure 6.4). A row, column, or box is called a unit.

Sudoku

(@)

R [o|r|unfo|a|w|—| i.

To solve a tree-structured CSP, first pick any variable to be the root of the tree, and choose

an ordering of the variables such that each variable appears after its parent in the tree. Such an ordering is called a topological sort. Figure 6.10(a) shows a sample tree and (b) shows

one possible ordering. Any tree with n nodes has n— 1 edges, so we can make this graph directed arc-consistent in O(n) steps, each of which must compare up to d possible domain

values for two variables, for a total time of O(nd?). Once we have a directed arc-consistent

graph, we can just march down the list of variables and choose any remaining value. Since each edge from a parent to its child is arc-consistent, we know that for any value we choose

for the parent, there will be a valid value left to choose for the child. That means we won’t

3 A careful cartographer or patriotic Tasmanian might object that Tasmania should not be colored the same as its nearest mainland neighbor, to avoid the impression that it might be part of that state. 4 Sadly, very few regions of the world have tree-structured maps, although Sulawesi comes close.

Topological sort

200

Chapter 6

Constraint Satisfaction Problems

ze ez @

®

Figure 6.10 () The constraint graph of a tree-structured CSP. (b) A linear ordering of the variables consistent with the tree with A as the root. This is known as a topological sort of the variables. function TREE-CSP-SOLVER(csp) returns a solution, or failure inputs: csp, a CSP with components X, D, C n < number of variables in X

assignment < an empty assignment root any variable in X X < TOPOLOGICALSORT(X, roof) for j=n down to 2 do

MAKE-ARC-CONSISTENT(PARENT(X)), X))

if it cannot be made consistent then return failure fori=1tondo

assignment[X;]

any consistent value from D;

if there is no consistent value then return failure

return assignment

Figure 6.11 The TREE-CSP-SOLVER algorithm for solving tree-structured CSPs. If the (CSP has a solution, we will find it in linear time; if not, we will detect a contradiction.

have to backtrack; we can move linearly through the variables. The complete algorithm is shown in Figure 6.11.

Now that we have an efficient algorithm for trees, we can consider whether more general constraint graphs can be reduced to trees somehow. There are two ways to do this: by

removing nodes (Section 6.5.1) or by collapsing nodes together (Section 6.5.2). 6.5.1

Cutset conditioning

The first way to reduce a constraint graph to a tree involves assigning values to some variables so that the remaining variables form a tree. Consider the constraint graph for Australia, shown

again in Figure 6.12(a). Without South Australia, the graph would become a tree, as in (b). Fortunately, we can delete South Australia (in the graph, not the country) by fixing a value for SA and deleting from the domains of the other variables any values that are inconsistent with the value chosen for SA.

Now, any solution for the CSP after SA and its constraints are removed will be consistent

with the value chosen for SA. (This works for binary CSPs; the situation is more complicated with higher-order constraints.) Therefore, we can solve the remaining tree with the algorithm

Section 6.5

(@)

201

The Structure of Problems

(®)

Figure 6.12 (a) The original constraint graph from Figure 6.1. (b) After the removal of SA,

the constraint graph becomes a forest of two trees.

given above and thus solve the whole problem. OF course, in the general case (as opposed o map coloring), the value chosen for SA could be the wrong one, so we would need to try each possible value. The general algorithm is as follows: 1. Choose a subset S of the CSP’s variables such that the constraint graph becomes a tree after removal of S. is called a cycle cutset.

Cycle cutset

2. For each possible assignment to the variables in S that satisfies all constraints on S,

(a) remove from the domains of the remaining variables any values that are inconsis-

tent with the assignment for S, and (b) if the remaining CSP has a solution, return it together with the assignment for S.

If the cycle cutset has size c, then the total run time is O(d® - (n — ¢)d?): we have to try each of

the d° combinations of values for the variables in §, and for each combination we must solve

atree problem of size n — c. If the graph is “nearly a tree,” then ¢ will be small and the savings over straight backtracking will be huge—for our 100-Boolean-variable example, if we could

find a cutset of size ¢ =20, this would get us down from the lifetime of the Universe to a few

minutes. In the worst case, however, ¢ can be as large as (n — 2). Finding the smallest cycle cutset is NP-hard, but several efficient approximation algorithms are known.

The overall

algorithmic approach is called cutset conditioning; it comes up again in Chapter 13, where

itis used for reasoning about probabilities.

6.5.2

Cutset conditioning

Tree decomposition

The second way to reduce a constraint graph to a tree is based on constructing a tree decom-

position of the constraint graph: a transformation of the original graph into a tree where each node in the tree consists of a set of variables, as in Figure 6.13. A tree decomposition must

satisfy these three requirements: + Every variable in the original problem appears in at least one of the tree nodes. If two variables are connected by a constraint in the original problem, they must appear together (along with the constraint) in at least one of the tree nodes.

« If a variable appears in two nodes in the tree, it must appear in every node along the

path connecting those nodes.

Tree decomposition

202

Chapter 6

Constraint Satisfaction Problems

Figure 6.13 A tree decomposition of the constraint graph in Figure 6.12(a).

The first two conditions ensure that all the variables and constraints are represented in the tree decomposition.

The third condition seems rather technical, but allows us to say that

any variable from the original problem must have the same value wherever it appears: the

constraints in the tree say that a variable in one node of the tree must have the same value as

the corresponding variable in the adjacent node in the tree. For example, SA appears in all four of the connected nodes in Figure 6.13, so each edge in the tree decomposition therefore

includes the constraint that the value of SA in one node must be the same as the value of SA

in the next. You can verify from Figure 6.12 that this decomposition makes sense.

Once we have a tree-structured graph, we can apply TREE-CSP-SOLVER to get a solution

in O(nd?) time, where n is the number of tree nodes and d is the size of the largest domain. But note that in the tree, a domain is a set of tuples of values, not just individual values.

For example, the top left node in Figure 6.13 represents, at the level of the original prob-

lem, a subproblem with variables {WA,NT,SA}, domain {red, green, blue}, and constraints WA # NT,SA # NT,WA # SA. At the level of the tree, the node represents a single variable, which we can call SANTWA,

whose value must be a three-tuple of colors,

such as

(red, green,blue), but not (red,red,blue), because that would violate the constraint SA # NT

from the original problem. We can then move from that node to the adjacent one, with the variable we can call SANTQ, and find that there is only one tuple, (red, green, blue), that is

consistent with the choice for SANTWA.

The exact same process is repeated for the next two

nodes, and independently we can make any choice for T.

We can solve any tree decomposition problem in O(nd?) time with TREE-CSP-SOLVER,

which will be efficient as long as d remains small. Going back to our example with 100 Boolean variables, if each node has 10 variables, then d=2'° and we should be able to solve

the problem in seconds. But if there is a node with 30 variables, it would take centuries.

Tree width

A given graph admits many tree decompositions; in choosing a decomposition, the aim is to make the subproblems as small as possible. (Putting all the variables into one node is technically a tree, but is not helpful.) The tree width of a tree decomposition of a graph is

Summary

203

one less than the size of the largest node; the tree width of the graph itself is defined to be

the minimum width among all its tree decompositions.

If a graph has tree width w then the

problem can be solved in O(nd"*!) time given the corresponding tree decomposition. Hence, CSPs with constraint graphs of bounded tree width are solvable in polynomial time.

Unfortunately, finding the decomposition with minimal tree width is NP-hard, but there

are heuristic methods that work well in practice.

Which is better: the cutset decomposition

with time O(d°- (n — ¢)d?), or the tree decomposition with time O(nd"*')? Whenever you

have a cycle-cutset of size c, there is also a tree width of size w < ¢+ 1, and it may be far

smaller in some cases. So time consideration favors tree decomposition, but the advantage of

the cycle-cutset approach is that it can be executed in linear memory, while tree decomposition requires memory exponential in w.

6.5.3

Value symmetry

So far, we have looked at the structure of the constraint graph. There can also be important structure in the values of variables, or in the structure of the constraint relations themselves.

Consider the map-coloring problem with d colors.

For every consistent solution, there is

actually a set of d! solutions formed by permuting the color names. For example, on the

Australia map we know that WA, NT, and SA must all have different colors, but there are

31

=6 ways to assign three colors to three regions. This is called value symmetry. We would Value symmetry

like to reduce the search space by a factor of d! by breaking the symmetry in assignments.

Symmetry-breaking We do this by introducing a symmetry-breaking constraint. For our example, we might cons traint impose an arbitrary ordering constraint, NT < SA < WA, that requires the three values to be in alphabetical order.

This constraint ensures that only one of the d! solutions is possible:

{NT = blue,SA = green, WA = red}.

For map coloring, it was easy to find a constraint that eliminates the symmetry. In general it is NP-hard to eliminate all symmetry, but breaking value symmetry has proved to be

important and effective on a wide range of problems.

Summary + Constraint satisfaction problems (CSPs) represent a state with a set of variable/value

pairs and represent the conditions for a solution by a set of constraints on the variables. Many important real-world problems can be described as CSPs.

+ A number of inference techniques use the constraints to rule out certain variable as-

signments. These include node, arc, path, and k-consistency.

+ Backtracking search, a form of depth-first search, is commonly used for solving CSPs. Inference can be interwoven with search.

« The minimum-remaining-values and degree heuristics are domain-independent methods for deciding which variable to choose next in a backtracking search.

The least-

constraining-value heuristic helps in deciding which value to try first for a given variable. Backtracking occurs when no legal assignment can be found for a variable.

Conflict-directed backjumping backtracks directly to the source of the problem. Constraint learning records the conflicts as they are encountered during search in order to avoid the same conflict later in the search.

204

Chapter 6

Constraint Satisfaction Problems

+ Local search using the min-conflicts heuristic has also been applied to constraint satis-

faction problems with great success. + The complexity of solving a CSP is strongly related to the structure of its constraint

graph. Tree-structured problems can be solved in linear time. Cutset conditioning can

reduce a general CSP to a tree-structured one and is quite efficient (requiring only lin-

ear memory) if a small cutset can be found. Tree decomposition techniques transform

the CSP into a tree of subproblems and are efficient if the tree width of the constraint

graph is small; however they need memory exponential in the tree width of the con-

straint graph. Combining cutset conditioning with tree decomposition can allow a better

tradeoff of memory versus time.

Bibliographical and Historical Notes

Diophantine equations

The Greek mathematician Diophantus (c. 200-284) presented and solved problems involving algebraic constraints on equations, although he didn’t develop a generalized methodology. ‘We now call equations over integer domains Diophantine equations.

The Indian mathe-

matician Brahmagupta (c. 650) was the first to show a general solution over the domain of integers for the equation ax+ by = c. Systematic methods for solving linear equations by variable elimination were studied by Gauss (1829); the solution of linear inequality constraints

goes back to Fourier (1827). Finite-domain constraint satisfaction problems also have a long history. For example, graph coloring (of which map coloring is a special case) is an old problem in mathematics.

The four-color conjecture (that every planar graph can be colored with four or fewer colors) was first made by Francis Guthrie, a student of De Morgan, in 1852.

It resisted solution—

despite several published claims to the contrary—until a proof was devised by Appel and Haken (1977) (see the book Four Colors Suffice (Wilson, 2004)). Purists were disappointed that part of the proof relied on a computer,

so Georges Gonthier (2008), using the COQ

theorem prover, derived a formal proof that Appel and Haken’s proof program was correct.

Specific classes of constraint satisfaction problems occur throughout the history of com-

puter science. One of the most influential early examples was SKETCHPAD (Sutherland, 1963), which solved geometric constraints in diagrams and was the forerunner of modern

drawing programs and CAD tools. The identification of CSPs as a general class is due to Ugo Montanari (1974). The reduction of higher-order CSPs to purely binary CSPs with auxiliary variables (see Exercise 6.NARY) is due originally to the 19th-century logician Charles Sanders Peirce. It was introduced into the CSP literature by Dechter (1990b) and was elaborated by Bacchus and van Beek (1998). CSPs with preferences among solutions are studied

widely in the optimization literature; see Bistarelli er al. (1997) for a generalization of the

CSP framework to allow for preferences.

Constraint propagation methods were popularized by Waltz’s (1975) success on polyhedral line-labeling problems for computer vision. Waltz showed that in many problems, propagation completely eliminates the need for backtracking. Montanari (1974) introduced the notion of constraint graphs and propagation by path consistency. Alan Mackworth (1977)

proposed the AC-3 algorithm for enforcing arc consistency as well as the general idea of combining backtracking with some degree of consistency enforcement. AC-4, a more efficient

Bibliographical and Historical Notes

205

arc-consistency algorithm developed by Mohr and Henderson (1986), runs in O(cd?) worstcase time but can be slower than AC-3 on average cases. The PC-2 algorithm (Mackworth,

1977) achieves path consistency in much the same way that AC-3 achieves arc consistency. Soon after Mackworth’s paper appeared, researchers began experimenting with the trade-

off between the cost of consistency enforcement and the benefits in terms of search reduction. Haralick and Elliott (1980) favored the minimal forward-checking algorithm described

by McGregor (1979), whereas Gaschnig (1979) suggested full arc-consistency checking after each variable assignment—an algorithm later called MAC

by Sabin and Freuder (1994). The

latter paper provides somewhat convincing evidence that on harder CSPs, full arc-consistency

checking pays off. Freuder (1978, 1982) investigated the notion of k-consistency and its relationship to the complexity of solving CSPs. Dechter and Dechter (1987) introduced directional arc consistency. Apt (1999) describes a generic algorithmic framework within which consistency propagation algorithms can be analyzed, and surveys are given by Bessiére (2006) and Bartdk et al. (2010). Special methods for handling higher-order or global constraints were developed first logic within the context of constraint logic programming. Marriott and Stuckey (1998) pro- Constraint programming vide excellent coverage of research in this area. The Alldiff constraint was studied by Regin (1994), Stergiou and Walsh (1999), and van Hoeve (2001). There are more complex inference algorithms for Alldiff (see van Hoeve and Katriel, 2006) that propagate more constraints but are more computationally expensive to run. Bounds constraints were incorporated into con-

straint logic programming by Van Hentenryck et al. (1998). A survey of global constraints is provided by van Hoeve and Katriel (2006). Sudoku has become the most widely known CSP and was described as such by Simonis (2005). Agerbeck and Hansen (2008) describe some of the strategies and show that Sudoku

on an n? x n? board is in the class of NP-hard problems. In 1850, C. F. Gauss described

a recursive backtracking algorithm

for solving the 8-

queens problem, which had been published in the German chess magazine Schachzeirung in

1848. Gauss called his method Tatonniren, derived from the French word taronner—to grope around, as if in the dark. According to Donald Knuth (personal communication), R. J. Walker introduced the term backtrack in the 1950s. Walker (1960) described the basic backtracking algorithm and used it to find all solutions to the 13-queens problem. Golomb and Baumert (1965) formulated, with

examples, the general class of combinatorial problems to which backtracking can be applied, and introduced what we call the MRV

heuristic.

Bitner and Reingold (1975) provided an

influential survey of backtracking techniques. Brelaz (1979) used the degree heuristic as a

tiebreaker after applying the MRV heuristic. The resulting algorithm, despite its simplicity,

is still the best method for k-coloring arbitrary graphs. Haralick and Elliott (1980) proposed

the least-constraining-value heuristic. The basic backjumping

method is due to John Gaschnig (1977,

1979).

Kondrak and

van Beek (1997) showed that this algorithm is essentially subsumed by forward checking. Conflict-directed backjumping was devised by Prosser (1993).

Dechter (1990a) introduced

graph-based backjumping, which bounds the complexity of backjumping-based algorithms. as a function of the constraint graph (Dechter and Frost, 2002).

A very general form of intelligent backtracking was developed early on by Stallman and Sussman (1977). Their technique of dependency-directed backtracking combines bacl k. Dependency-directed backtracking

206

Chapter 6

Constraint Satisfaction Problems

jumping with no-good learning (McAllester, 1990) and led to the development of truth maintenance systems (Doyle, 1979), which we discuss in Section 10.6.2. The connection between

Constraint learning

the two areas is analyzed by de Kleer (1989).

The work of Stallman and Sussman also introduced the idea of constraint learning, in which partial results obtained by search can be saved and reused later in the search. The

idea was formalized by Dechter (1990a). Backmarking (Gaschnig, 1979) is a particularly

simple method in which consistent and inconsistent pairwise assignments are saved and used to avoid rechecking constraints. Backmarking can be combined with conflict-directed back-

jumping; Kondrak and van Beek (1997) present a hybrid algorithm that provably subsumes either method taken separately. The method of dynamic backtracking (Ginsberg,

1993) retains successful partial as-

signments from later subsets of variables when backtracking over an earlier choice that does

not invalidate the later success.

Moskewicz er al. (2001) show how these techniques and

others are used to create an efficient SAT solver. Empirical studies of several randomized backtracking methods were done by Gomes ez al. (2000) and Gomes and Selman (2001).

Van Beek (2006) surveys backtracking. Local search in constraint satisfaction problems was popularized by the work of Kirkpatrick et al. (1983) on simulated annealing (see Chapter 4), which is widely used for VLSI

layout and scheduling problems. Beck ef al. (2011) give an overview of recent work on jobshop scheduling. The min-conflicts heuristic was first proposed by Gu (1989) and was devel-

oped independently by Minton ez al. (1992).

Sosic and Gu (1994) showed how it could be

applied to solve the 3,000,000 queens problem in less than a minute. The astounding success

of local search using min-conflicts on the n-queens problem led to a reappraisal of the nature and prevalence of “easy” and “hard” problems.

Peter Cheeseman ez al. (1991) explored the

difficulty of randomly generated CSPs and discovered that almost all such problems either are trivially easy or have no solutions. Only if the parameters of the problem generator are

set in a certain narrow range, within which roughly half of the problems are solvable, do we. find “hard” problem instances. We discuss this phenomenon further in Chapter 7. Konolige (1994) showed that local search is inferior to backtracking search on problems

with a certain degree of local structure; this led to work that combined

local search and

inference, such as that by Pinkas and Dechter (1995). Hoos and Tsang (2006) provide a survey of local search techniques, and textbooks are offered by Hoos and Stiitzle (2004) and

Aarts and Lenstra (2003). Work relating the structure and complexity of CSPs originates with Freuder (1985) and

Mackworth and Freuder (1985), who showed that search on arc-consistent trees works with-

out any backtracking. A similar result, with extensions to acyclic hypergraphs, was developed in the database community (Beeri ef al., 1983). Bayardo and Miranker (1994) present an algorithm for tree-structured CSPs that runs in linear time without any preprocessing. Dechter

(1990a) describes the cycle-cutset approach. Since those papers were published, there has been a great deal of progress in developing more general results relating the complexity of solving a CSP to the structure of its constraint

graph.

The notion of tree width was introduced by the graph theorists Robertson and Sey-

mour (1986). Dechter and Pearl (1987, 1989), building on the work of Freuder, applied a related notion (which they called induced width but is identical to tree width) to constraint

satisfaction problems and developed the tree decomposition approach sketched in Section 6.5.

Bibliographical and Historical Notes Drawing on this work and on results from database theory, Gottlob ez al. (1999a, 1999b)

developed a notion, hypertree width, that is based on the characterization of the CSP as a

hypergraph. In addition to showing that any CSP with hypertree width w can be solved in time O(n"*+!logn), they also showed that hypertree width subsumes all previously defined measures of “width” in the sense that there are cases where the hypertree width is bounded

and the other measures are unbounded.

The RELSAT algorithm of Bayardo and Schrag (1997) combined constraint learning and

backjumping and was shown to outperform many other algorithms of the time. This led to AND-OR search algorithms applicable to both CSPs and probabilistic reasoning (Dechter and Mateescu, 2007). Brown et al. (1988) introduce the idea of symmetry breaking in CSPs,

and Gent et al. (2006) give a survey.

The field of distributed constraint satisfaction looks at solving CSPs when there is a

collection of agents, each of which controls a subset of the constraint variables. There have

been annual workshops on this problem since 2000, and good coverage elsewhere (Collin et al., 1999; Pearce et al., 2008).

Comparing CSP algorithms is mostly an empirical science: few theoretical results show that one algorithm dominates another on all problems; instead, we need to run experiments to see which algorithms perform better on typical instances of problems. As Hooker (1995)

points out, we need to be careful to distinguish between competitive testing—as occurs in

competitions among algorithms based on run time—and scientific testing, whose goal is to identify the properties of an algorithm that determine its efficacy on a class of problems. The textbooks by Apt (2003), Dechter (2003), Tsang (1993), and Lecoutre (2009), and the collection by Rossi er al. (2006), are excellent resources on constraint processing. There

are several good survey articles, including those by Dechter and Frost (2002), and Bartdk et al. (2010). Carbonnel and Cooper (2016) survey tractable classes of CSPs.

Kondrak and

van Beek (1997) give an analytical survey of backtracking search algorithms, and Bacchus and van Run (1995) give a more empirical survey. Constraint programming is covered in the books by Apt (2003) and Fruhwirth and Abdennadher (2003). Papers on constraint satisfac-

tion appear regularly in Artificial Intelligence and in the specialist journal Constraints; the latest SAT solvers are described in the annual International SAT Competition. The primary

conference venue is the International Conference on Principles and Practice of Constraint

Programming, often called CP.

207

TR

[

LOGICAL AGENTS In which we design agents that can form representations ofa complex world, use a process. of inference to derive new representations about the world, and use these new representa-

tions to deduce what to do.

Knowledge-based agents.

Reasoning Representation

Humans, it seems, know things; and what they know helps them do things. In AI knowledge-

based agents use a process of reasoning over an internal representation of knowledge to

decide what actions to take.

The problem-solving agents of Chapters 3 and 4 know things, but only in a very limited, inflexible sense. They know what actions are available and what the result of performing a specific action from a specific state will be, but they don’t know general facts. A route-finding agent doesn’t know that it is impossible for a road to be a negative number of kilometers long.

An 8-puzzle agent doesn’t know that two tiles cannot occupy the same space. The knowledge they have is very useful for finding a path from the start to a goal, but not for anything else. The atomic representations used by problem-solving agents are also very limiting. In a partially observable environment, for example, a problem-solving agent’s only choice for representing what it knows about the current state is to list all possible concrete states. I could

give a human the goal of driving to a U.S. town with population less than 10,000, but to say

that to a problem-solving agent, I could formally describe the goal only as an explicit set of the 16,000 or so towns that satisfy the description.

Chapter 6 introduced our first factored representation, whereby states are represented as

assignments of values to variables; this is a step in the right direction, enabling some parts of

the agent to work in a domain-independent way and allowing for more efficient algorithms.

In this chapter, we take this step to its logical conclusion, so to speak—we develop logic as a

general class of representations to support knowledge-based agents. These agents can com-

bine and recombine information to suit myriad purposes. This can be far removed from the needs of the moment—as

when a mathematician proves a theorem or an astronomer calcu-

lates the Earth’s life expectancy.

Knowledge-based agents can accept new tasks in the form

new environment, the wumpus

world, and illustrates the operation of a knowledge-based

of explicitly described goals; they can achieve competence quickly by being told or learning new knowledge about the environment; and they can adapt to changes in the environment by updating the relevant knowledge. We begin in Section 7.1 with the overall agent design. Section 7.2 introduces a simple agent without going into any technical detail. Then we explain the general principles of logic

in Section 7.3 and the specifics of propositional logic in Section 7.4. Propositional logic is

a factored representation; while less expressive than first-order logic (Chapter 8), which is the canonical structured representation, propositional logic illustrates all the basic concepts

Section 7.1

Knowledge-Based Agents

209

function KB-AGENT(percept) returns an action

persistent: KB, a knowledge base 1, acounter, initially 0, indicating time TELL(KB, MAKE-PERCEPT-SENTENCE(percept, 1))

action ASK(KB, MAKE-ACTION-QUERY(1)) TELL(KB, MAKE-ACTION-SENTENCE(action, 1)) tertl return action

Figure 7.1 A generic knowledge-based agent. Given a percept, the agent adds the percept o0 its knowledge base, asks the knowledge base for the best action, and tells the knowledge base that it has in fact taken that action.

of logic. It also comes with well-developed inference technologies, which we describe in sections 7.5 and 7.6. Finally, Section 7.7 combines the concept of knowledge-based agents with the technology of propositional logic to build some simple agents for the wumpus world. 7.1

Knowledge-Based

Agents

The central component of a knowledge-based agent is its knowledge base, or KB. A knowledge base

is a set of sentences.

(Here “sentence” is used as a technical term.

It is related

but not identical to the sentences of English and other natural languages.) Each sentence is

Knowledge base Sentence

expressed in a language called a knowledge representation language and represents some

Knowledge representation language

from other sentences, we call it an axiom.

Axiom

assertion about the world. When the sentence is taken as being given without being derived

There must be a way to add new sentences to the knowledge base and a way to query

what is known.

The standard names for these operations are TELL and ASK, respectively.

Both operations may involve inference—that is, deriving new sentences from old. Inference Inference must obey the requirement that when one ASKs a question of the knowledge base, the answer

should follow from what has been told (or TELLed) to the knowledge base previously. Later in this chapter, we will be more precise about the crucial word “follow.” For now, take it to

mean that the inference process should not make things up as it goes along. Figure 7.1 shows the outline of a knowledge-based agent program. Like all our agents, it takes a percept as input and returns an action. The agent maintains a knowledge base, KB,

which may initially contain some background knowledge.

Each time the agent program is called, it does three things. First, it TELLS the knowledge

base what it perceives. Second, it ASKs the knowledge base what action it should perform. In

the process of answering this query, extensive reasoning may be done about the current state of the world, about the outcomes of possible action sequences, and so on. Third, the agent

program TELLs the knowledge base which action was chosen, and returns the action so that it can be executed.

The details of the representation language are hidden inside three functions that imple-

ment the interface between the sensors and actuators on one side and the core representation and reasoning system on the other.

MAKE-PERCEPT-SENTENCE

constructs a sentence as-

Background knowledge

210

Chapter 7 Logical Agents serting that the agent perceived the given percept at the given time. MAKE-ACTION-QUERY constructs a sentence that asks what action should be done at the current time. Finally, MAKE-ACTION-SENTENCE constructs a sentence asserting that the chosen action was executed. The details of the inference mechanisms are hidden inside TELL and ASK. Later

sections will reveal these details.

The agent in Figure 7.1 appears quite similar to the agents with internal state described

in Chapter 2. Because of the definitions of TELL and ASK, however, the knowledge-based

Knowledge level

agent iis not an arbitrary program for calculating actions. It is amenable to a description at the knowledge level, where we need specify only what the agent knows and what its goals are, in order to determine its behavior.

For example, an automated taxi might have the goal of taking a passenger from San

Francisco to Marin County and might know that the Golden Gate Bridge is the only link

between the two locations. Then we can expect it to cross the Golden Gate Bridge because it

Implementation level

knows that that will achieve its goal. Notice that this analysis is independent of how the taxi

works at the implementation level. It doesn’t matter whether its geographical knowledge is

implemented as linked lists or pixel maps, or whether it reasons by manipulating strings of symbols stored in registers or by propagating noisy signals in a network of neurons.

A knowledge-based agent can be built simply by TELLing it what it needs to know. Start-

Declarative Procedural

ing with an empty knowledge base, the agent designer can TELL sentences one by one until

the agent knows how to operate in its environment. This is called the declarative approach to system building. In contrast, the procedural approach encodes desired behaviors directly as program code. In the 1970s and 1980s, advocates of the two approaches engaged in heated debates.

We now understand that a successful agent often combines both declarative and

procedural elements in its design, and that declarative knowledge can often be compiled into more efficient procedural code.

We can also provide a knowledge-based agent with mechanisms that allow it to learn for

itself. These mechanisms, which are discussed in Chapter 19, create general knowledge about

the environment from a series of percepts. A learning agent can be fully autonomous.

7.2 Wumpus world

The Wumpus

World

In this section we describe an environment in which knowledge-based agents can show their

worth. The wumpus world is a cave consisting of rooms connected by passageways. Lurking

somewhere in the cave is the terrible wumpus, a beast that eats anyone who enters its room.

The wumpus can be shot by an agent, but the agent has only one arrow. Some rooms contain bottomless pits that will trap anyone who wanders into these rooms (except for the wumpus, which

is too big to fall in).

The only redeeming feature of this bleak environment is the

possibility of finding a heap of gold. Although the wumpus world is rather tame by modern computer game standards, it illustrates some important points about intelligence. A sample wumpus world is shown in Figure 7.2. The precise definition of the task environment is given, as suggested in Section 2.3, by the PEAS description: o Performance measure:

+1000 for climbing out of the cave with the gold, 1000

for

falling into a pit or being eaten by the wumpus, —1 for each action taken, and ~10 for

using up the arrow. The game ends either when the agent dies or when the agent climbs out of the cave.

Section7.2

4

| S8

L0

Zhreozs-

W= o

|sss START. 1

The Wumpus World

@)= o PIT

2

3

4

Figure 7.2 A typical wumpus world. The agent is in the botiom left corner, facing east (rightward). o Environment: A 4 x4 grid of rooms, with walls surrounding the grid. The agent always starts in the square labeled [1,1], facing to the east. The locations of the gold and the wumpus are chosen randomly, with a uniform distribution, from the squares other

than the start square. In addition, each square other than the start can be a pit, with probability 0.2. o Actuators: The agent can move Forward, TurnLeft by 90°, or TurnRight by 90°. The agent dies a miserable death if it enters a square containing a pit or a live wumpus. (It is safe, albeit smelly, to enter a square with a dead wumpus.) If an agent tries to move

forward and bumps into a wall, then the agent does not move. The action Grab can be

used to pick up the gold if it is in the same square as the agent. The action Shoot can

be used to fire an arrow in a straight line in the direction the agent is facing. The arrow continues until it either hits (and hence kills) the wumpus or hits a wall. The agent has

only one arrow, so only the first Shoot action has any effect. Finally, the action Climb can be used to climb out of the cave, but only from square [1,1].

o Sensors: The agent has five sensors, each of which gives a single bit of information: — In the squares directly (not diagonally) adjacent to the wumpus, the agent will — — — ~

perceive a Stench.!

In the In the When When where

squares directly adjacent to a pit, the agent will perceive a Breeze. square where the gold is, the agent will perceive a Glitter. an agent walks into a wall, it will perceive a Bump. the wumpus is killed, it emits a woeful Scream that can be perceived anyin the cave.

The percepts will be given to the agent program in the form of a list of five symbols;

for example, if there is a stench and a breeze, but no glitter, bump, or scream, the agent program will get [Stench, Breeze, None, None, None]. ! Presumably the square containing the wumpus also has a stench, but any agent entering that square is eaten before being able to perceive anything.

211

212

Chapter 7 Logical Agents 34

33

A

B G o » s v W

2 31

=Wumpus

14

24

34

44

13

23

33

43

12

22 P

32

42

11

0K v oK.

@

31 5

[A1

()

Figure 7.3 The first step taken by the agent in the wumpus world. (a) The initial situation, after percept [None, None, None, None, None]. (b) After moving to [2,1] and perceiving [None, Breeze, None, None, None]. We can characterize the wumpus environment along the various dimensions given in Chapter 2. Clearly, it is deterministic, discrete, static, and single-agent.

(The wumpus doesn’t

move, fortunately.) It is sequential, because rewards may come only after many actions are taken. Itis partially observable, because some aspects of the state are not directly perceivable:

the agent’s location, the wumpus'’s state of health, and the availability of an arrow. As for the

locations of the pits and the wumpus: we could treat them as unobserved parts of the state— in which case, the transition model for the environment is completely known, and finding the

locations of pits completes the agent’s knowledge of the state. Alternatively, we could say that the transition model itself is unknown because the agent doesn’t know which Forward

actions are fatal—in which case, discovering the locations of pits and wumpus completes the agent’s knowledge of the transition model.

For an agent in the environment, the main challenge is its initial ignorance of the configuration of the environment; overcoming this ignorance seems to require logical reasoning. In most instances of the wumpus world, it is possible for the agent to retrieve the gold safely.

Occasionally, the agent must choose between going home empty-handed and risking death to

find the gold. About 21% of the environments are utterly unfair, because the gold is in a pit or surrounded by pits. Let us watch a knowledge-based wumpus agent exploring the environment shown in Figure 7.2. We use an informal knowledge representation language consisting of writing down symbols in a grid (as in Figures 7.3 and 7.4). The agent’s initial knowledge base contains the rules of the environment, as described

previously; in particular, it knows that it is in [1,1] and that [1,1] is a safe square; we denote that with an “A” and “OK,” respectively, in square [1,1]. The first percept is [None, None, None, None, None], from which the agent can conclude that its neighboring squares, [1,2] and [2,1], are free of dangers—they are OK. Figure 7.3(a) shows the agent’s state of knowledge at this point.

Section7.2

14

24

3.4

44

13w

|28

33

4.3

22

3.2

4.2

14

3.4

44

13 1

33 p,

[43

12

3.2

4.2

i OK

B

3.1

P

4.1

v OK

1.1

v oK 2,1

W OK

(@)

»

s

v oK

0K 2.1

2.4

The Wumpus World

B

3,1

Pt

4.1

WL OK

®)

Figure 7.4 Two later stages in the progress of the agent. (a) After moving to [1.1] and then [1.2), and perceiving [Stench, None, None, None, None]. (b) After moving to [2.2] and then [2.3), and perceiving [Stench, Breeze, Glitter, None, None]. A cautious agent will move only into a square that it knows to be OK. Let us suppose

the agent decides to move forward to [2,1]. The agent perceives a breeze (denoted by “B”) in [2,1], 50 there must be a pit in a neighboring square. The pit cannot be in [1,1], by the rules of the game, so there must be a pit in [2,2] or [3,1] or both. The notation “P?” in Figure 7.3(b) indicates a possible pit in those squares. At this point, there is only one known square that is

OK and that has not yet been visited. So the prudent agent will turn around, go back to [1,1], and then proceed to [1,2]. The agent perceives a stench in [1,2], resulting in the state of knowledge shown in Figure 7.4(a). The stench in [1,2] means that there must be a wumpus nearby. But the wumpus cannot be in [1,1], by the rules of the game, and it cannot be in [2,2] (or the agent would have detected a stench when it was in [2,1]). Therefore, the agent can infer that the wumpus

is in [1,3]. The notation W' indicates this inference. Moreover, the lack of a breeze in [1,2] implies that there is no pit in [2,2]. Yet the agent has already inferred that there must be a pit in either [2,2] or [3,1], so this means it must be in [3,1]. This is a fairly difficult inference, because it combines knowledge gained at different times in different places and relies on the

lack of a percept to make one crucial step.

The agent has now proved to itself that there is neither a pit nor a wumpus in [2,2], so it

is OK to move there. We do not show the agent’s state of knowledge at [2,2]; we just assume

that the agent turns and moves to [2,3], giving us Figure 7.4(b). In [2,3], the agent detects a glitter, so it should grab the gold and then return home.

Note that in each case for which the agent draws a conclusion from the available infor-

mation, that conclusion is guaranteed to be correct if the available information is correct.

This is a fundamental property of logical reasoning. In the rest of this chapter, we describe

how to build logical agents that can represent information and draw conclusions such as those

described in the preceding paragraphs.

213

214

Chapter 7 Logical Agents 7.3

Logic

This section summarizes the fundamental concepts of logical representation and reasoning. These beautiful ideas are independent of any of logic’s particular forms. We therefore post-

pone the technical details of those forms until the next section, using instead the familiar

example of ordinary arithmetic. Syntax Semantics Truth Possible world Model

In Section 7.1, we said that knowledge bases consist of sentences. These sentences are expressed according to the syntax of the representation language, which specifies all the

sentences that are well formed. The notion of syntax is clear enough in ordinary arithmetic: “x+y=4"is a well-formed sentence, whereas “x4y+ =" is not.

A logic must also define the semantics, or meaning, of sentences. The semantics defines

the truth of each sentence with respect to each possible world. For example, the semantics

for arithmetic specifies that the sentence “x+y=4"is true in a world where x is 2 and y is 2, but false in a world where x is 1 and y is 1. In standard logics, every sentence must be either true or false in each possible world—there is no “in between.”

When we need to be precise, we use the term model in place of “possible world.” Whereas possible worlds might be thought of as (potentially) real environments that the agent

might or might not be in, models are mathematical abstractions, each of which has a fixed

truth value (true or false) for every relevant sentence. Informally, we may think of a possible world as, for example, having x men and y women sitting at a table playing bridge, and the sentence x + y=4 is true when there are four people in total. Formally, the possible models are just all possible assignments of nonnegative integers to the variables x and y. Each such

Satisfaction Entailment

assignment determines the truth of any sentence of arithmetic whose variables are x and y. If a sentence « is true in model m, we say that m satisfies o or sometimes m is a model of o

‘We use the notation M(«) to mean the set of all models of a.

Now that we have a notion of truth, we are ready to talk about logical reasoning. This in-

volves the relation of logical entailment between sentences—the idea that a sentence follows

logically from another sentence. In mathematical notation, we write

aks

to mean that the sentence « entails the sentence 3. The formal definition of entailment is this:

«a k= Bif and only if, in every model in which a is true, 4 s also true. Using the notation just introduced, we can write

a k= Bifand only if M(a) C M(B). (Note the direction of the C here: if o |= 3, then « is a stronger assertion than 3 it rules out more possible worlds.) The relation of entailment is familiar from arithmetic; we are happy with the idea that the sentence x = 0 entails the sentence xy = 0. Obviously, in any model

where x i zero, it is the case that xy is zero (regardless of the value of y). We can apply the same kind of analysis to the wumpus-world reasoning example given in the preceding section.

Consider the situation in Figure 7.3(b):

the agent has detected

nothing in [1,1] and a breeze in [2,1]. These percepts, combined with the agent’s knowledge

of the rules of the wumpus world, constitute the KB. The agent is interested in whether the

adjacent squares [1,2], [2,2], and [3,1] contain pits. Each of the three squares might or might

2 Fuzzy logic,

cussed in Chapter 13, allows for degrees of truth.

Section7.3

Logic

215

(@

Figure 7.5 Possible models for the presence of pits in squares [1,2], [2:2], and [3.1]. The KB corresponding to the observations of nothing in [1.1] and a breeze in [2,1]is shown by the solid line. (a) Dotted line shows models of a1 (no pit in [1.2]). (b) Dotted line shows models of a3 (no pit in [2.2]). not contain a pit, so (ignoring other aspects of the world for now) there are 23=8 possible models. These eight models are shown in Figure 7.5.3

The KB can be thought of as a set of sentences or as a single sentence that asserts all the individual sentences. The KB is false in models that contradict what the agent knows—

for example, the KB is false in any model in which [1,2] contains a pit, because there is no breeze in [1,1]. There are in fact just three models in which the KB is true, and these are shown surrounded by a solid line in Figure 7.5. Now let us consider two possible conclusions: ay = “There is no pit in [1,2].”

=

“There is no pit in [2,2]."

We have surrounded the models of a; and a2 with dotted lines in Figures 7.5(a) and 7.5(b), respectively. By inspection, we see the following: in every model in which KB s true, v is also true. Hence, KB = a: there is no pit in [1,2]. We can also see that in some models in which KB is true, a is false.

Hence, KB does not entail a: the agent cannot conclude that there is no pit in [2,2]. (Nor can it conclude that there is a pit in [2,2].)*

The preceding example not only illustrates entailment but also shows how the definition

of entailment can be applied to derive conclusions—that is, to carry out logical inference.

The inference algorithm illustrated in Figure 7.5 is called model checking, because it enu‘merates all possible models to check that a is true in all models in which KB is true, that is, that M(KB) C M(a). 3 Although the figure shows the models as partial wumpus worlds, they are really nothing more than assignments of true and false 1o the sentences “there s a pit in [1,2]” etc. Models, in the mathematical sense, do not need to have “orrible "airy wumpuses in them. 4 The agent can calculate the probability that there is a pit in [2.2]; Chapter 12 shows how.

Logical inference Model checking

Chapter 7 Logical Agents

216

Aspects of the ~~=~Follows =~ > Aspect of the " veaiworld atwond '

Figure 7.6 Sentences are physical configurations of the agent, and reasoning is a process of constructing new physical configurations from old ones. Logical reasoning should ensure that the new configurations represent aspects of the world that actually follow from the aspects that the old configurations represent.

In understanding entailment and inference, it might help to think of the set of all consequences of KB as a haystack and of a as a needle. Entailment is like the needle being in the haystack; inference is like finding it. This distinction is embodied in some formal notation: if an inference algorithm i can derive from KB, we write KBt a, Sound

Truth-preserving

which is pronounced “a is derived from KB by i or “i derives a from KB.” An inference algorithm that derives only entailed sentences is called sound or truth-

preserving. Soundness is a highly desirable property. An unsound inference procedure essentially makes things up as it goes along—it announces the discovery of nonexistent needles.

Itis easy to see that model checking, when it is applicable,’ is a sound procedure.

Completeness

The property of completeness is also desirable:

an inference algorithm is complete if

it can derive any sentence that is entailed. For real haystacks, which are finite in extent, it seems obvious that a systematic examination can always decide whether the needle is in

the haystack. For many knowledge bases, however, the haystack of consequences is infinite, and completeness becomes an important issue.® Fortunately, there are complete inference procedures for logics that are sufficiently expressive to handle many knowledge bases. We have described a reasoning process whose conclusions are guaranteed to be true in any world in which the premises are true; in particular, if KB is true in the real world, then any sentence o derived from KB by a sound inference procedure is also true in the real world. So, while an inference process operates on “syntax™—internal physical configurations such as bits in registers or patterns of electrical blips in brains—the process corresponds to the realworld relationship whereby some aspect of the real world is the case by virtue of other aspects

of the real world being the case.” This correspondence between world and representation is Grounding

illustrated in Figure 7.6.

|

The final

issue to consider is grounding—the connection between logical reasoning pro-

cesses and the real environment in which the agent exists. In particular, how do we know that

5 Model checking works if the space of models is finite—for example, in wumpus worlds of fixed size. For arithmetic, on the other hand, the space of models is infinite: even if we restrict ourselves to the integers, there are infinitely many pairs of values forx and y in the sentence x-+y = 4. 6 Compare with the case of infinite search spaces in Chapter 3, where depth-first search is not complete. 7 As Wittgenstein (1922) put it in his famous Tractatus: “The world is everything that is the case.”

Section7.4

Propositional Logic: A Very Simple Logic

217

KB is true in the real world? (After all, KB is just “syntax” inside the agent’s head.) This is a

philosophical question about which many, many books have been written. (See Chapter 27.) A simple answer is that the agent’s sensors create the connection. For example, our wumpusworld agent has a smell sensor. The agent program creates a suitable sentence whenever there is a smell. Then, whenever that sentence is in the knowledge base, it is true in the real world.

Thus, the meaning and truth of percept sentences are defined by the processes of sensing and sentence construction that produce them. What about the rest of the agent’s knowledge, such

as its belief that wumpuses cause smells in adjacent squares? This is not a direct representation of a single percept, but a general rule—derived, perhaps, from perceptual experience but not identical to a statement of that experience.

General rules like this are produced by

a sentence construction process called learning, which is the subject of Part V. Learning is

fallible. Tt could be the case that wumpuses cause smells except on February 29 in leap years, which is when they take their baths.

Thus, KB may not be true in the real world, but with

good learning procedures, there is reason for optimism. Propo:

nal Logic:

A Very

ple Log

We now present propositional logic. We describe its syntax (the structure of sentences) and Propositional logic its semantics

(the way in which the truth of sentences is determined). From these, we derive

a simple, syntactic algorithm for logical inference that implements the semantic notion of

entailment. Everything takes place, of course, in the wumpus world. 7.4.1

Syntax

The syntax of propositional logic defines the allowable sentences. The atomic sentences Atomic sentences consist of a single proposition symbol. Each such symbol stands for a proposition that can be true or false.

Proposition symbol

We use symbols that start with an uppercase letter and may contain other

letters or subscripts, for example: P, Q, R, Wy3 and FacingEast. The names are arbitrary but are often chosen to have some mnemonic value—we use W 3 to stand for the proposition that the wumpus is in [1,3]. (Remember that symbols such as W;3 are atomic, i.e., W, 1, and 3 are not meaningful parts of the symbol.) There are two proposition symbols with fixed meanings: True is the always-true proposition and False is the always-false proposition. Complex sentences are constructed from simpler sentences, using parentheses and operators Complex sentences

called logical connectives. There are five connectives in common use:

Logical connectives

~ (not). A sentence such as =W, 3 is called the negation of W,5. A literal is cither an Negation Literal atomic sentence (a positive literal) or a negated atomic sentence (a negative literal). A (and). A sentence whose main connective is A, such as Wy 3 A Py 1, is called a conjunction; its parts are the conjuncts. (The A looks like an “A” for “And.”)

V (or). A sentence whose main connective is V, such as (Wi3 A Ps,1) V Wao, is a disjunc-

Conjunction

Disjunction tion; its parts are disjuncts—in this example, (Wj3 A Py;) and Wa. = (implies). A sentence such as (W3 APs;) = —Way is called an implication (or con- Implication ditional). Its premise or antecedent is (Wi3 A Ps.1), and its conclusion or consequent is =Wa.

Implications are also known as rules or if-then statements. The implication

symbol is sometimes written in other books as D or —.

< (if and only if). The sentence W; 3 —W, is a biconditional.

Premise Conclusion Rules Biconditional

218

Chapter 7 Logical Agents Sentence



AtomicSentence ComplexSentence

— —

True| False| P| Q| R| ... ( Sentence)

|

- Sentence

|

Sentence A Sentence

|

Sentence V Sentence

|

Sentence

=

Sentence

|

Sentence


=8 possible models—exactly those depicted in Figure 7.5. Notice, however, that the models are purely mathematical objects with no

necessary connection to wumpus worlds. P> is just a symbol; it might mean “there is a pit in [1,2]” or “I'm in Paris today and tomorrow.”

The semantics for propositional logic must specify how to compute the truth value of any

sentence, given a model. This is done recursively. All sentences are constructed from atomic

sentences and the five connectives; therefore, we need to specify how to compute the truth

of atomic sentences and how to compute the truth of sentences formed with each of the five

connectives. Atomic sentences are easy:

« True is true in every model and False is false in every model.

* The truth value of every other proposition symbol must be specified directly in the

model. For example, in the model m, given earlier, Py ; is false.

Section7.4 Propositional Logic: A Very Simple Logic

219

Figure 7.8 Truth tables for the five logical connectives. To use the table to compute, for example, the value of PVQ when P is true and Q is false, first look on the left for the row where P is true and Q is false (the third row). Then look in that row under the PV Q column

to see the result: true.

For complex sentences, we have five rules, which hold for any subsentences P and Q (atomic or complex) in any model m (here “iff” means “if and only if”): « =Pis true iffP s false in m. « PAQis true iff both P and Q are true in m. « PV Qis true iff either P or Q is truc in m. « P= Qis true unless P is true and Q is false in m. « P& Qis true iff P and Q are both true or both false in m. The rules can also be expressed with truth tables that specify the truth value of a complex

sentence for each possible assignment of truth values to its components. Truth tables for the

five connectives are given in Figure 7.8. From these tables, the truth value of any sentence s can be computed with respect to any model m by a simple recursive evaluation.

For ex-

ample, the sentence —P3 A (P2 V Py 1), evaluated in my, gives true A (false\/ true) = trueA true =true. Exercise 7.TRUV asks you to write the algorithm PL-TRUE?(s, m), which com-

putes the truth value of a propositional logic sentence s in a model 1.

The truth tables for “and,” “or,” and “not” are in close accord with our intuitions about

the English words. The main point of possible confusion is that P or Q is true or both.

Q is true when P is true

A different connective, called “exclusive or” (“xor” for short), yields

false when both disjuncts are true.® There is no consensus on the symbol for exclusive or;

some choices are \/ or # or &

The truth table for = may not quite fit one’s intuitive understanding of “P implies Q” or “if P then Q. For one thing, propositional logic does not require any relation of causation or relevance between P and Q. The sentence “5 is odd implies Tokyo is the capital of Japan” is a true sentence of propositional logic (under the normal interpretation), even though it is a decidedly odd sentence of English. Another point of confusion is that any implication is true whenever its antecedent is false. For example, “5 is even implies Sam is smart” is true, regardless of whether Sam is smart. This seems bizarre, but it makes sense if you think of

“P = Q" as saying, “If P is true, then I am claiming that Q is true; otherwise I am making no claim.” The only way for this sentence to be false is if P is true but Q is false. The biconditional, P < @, is true whenever both P => Q and Q = P are true. In English,

this is often written as “P if and only if Q.” Many of the rules of the wumpus world are best

8 Latin u;

two separate words:

el is inclusive or and “aut” is exclusive or.

Truth table

220

Chapter 7 Logical Agents

written using

For this reason, we have used a special notation—the double-boxed link—in Figure 10.4.

This link asserts that

Vx x€Persons

=

[Vy HasMother(x,y)

= y& FemalePersons).

‘We might also want to assert that persons have two legs—that is, Vx x€Persons

= Legs(x,2).

As before, we need to be careful not to assert that a category has legs; the single-boxed link

in Figure 10.4 is used to assert properties of every member of a category.

The semantic network notation makes it convenient to perform inheritance reasoning of

the kind introduced in Section 10.2. For example, by virtue of being a person, Mary inherits the property of having two legs. Thus, to find out how many legs Mary has, the inheritance

algorithm follows the MemberOf link from Mary to the category she belongs to, and then S Several early systems failed to distinguish between properties of members of a category and properties of the category as a whole. Thi istencies, as pointed out by Drew McDermott (1976) in his article “Artificial Intelligence Meets Natural Stupidity.” Another common problem was the use of /s links for both subset and membership relations, in correspondence with English usage: “a cat isa mammal” and “Fif is 4 cat” See Exercise 10.NATS for more on these is ues.

330

Chapter 10 Knowledge Representation

Tastioter

&5 Subser0y

ez

Persons )—=2 \uhm(l/ Subser0) Memberf

sisier0f

Menberof

Figure 10.4 A semantic network with four objects (John, Mary, 1, and 2) and four categories. Relations are denoted by labeled links.

Menberof Agent

During

Figure 10.5 A fragment of a semantic network showing the representation of the logical assertion Fly(Shankar, NewYork, NewDelhi, Yesterday). follows SubsetOf links up the hierarchy until it finds a category for which there is a boxed Legs link—in this case, the Persons category. The simplicity and efficiency of this inference mechanism, compared with semidecidable logical theorem proving, has been one of the main attractions of semantic networks.

Inheritance becomes complicated when an object can belong to more than one category

Multiple inheritance

or when a category can be a subset of more than one other category; this is called multiple inheritance. In such cases, the inheritance algorithm might find two or more conflicting values

answering the query. For this reason, multiple inheritance is banned in some object-oriented

programming (OOP) languages, such as Java, that use inheritance in a class hierarchy. It is usually allowed in semantic networks, but we defer discussion of that until Section 10.6.

The reader might have noticed an obvious drawback of semantic network notation, com-

pared to first-order logic: the fact that links between bubbles represent only binary relations.

For example, the sentence Fly(Shankar,NewYork, NewDelhi, Yesterday) cannot be asserted

directly in a semantic network.

Nonetheless, we can obtain the effect of n-ary assertions

by reifying the proposition itself as an event belonging to an appropriate event category.

Figure 10.5 shows the semantic network structure for this particular event.

Notice that the

restriction to binary relations forces the creation of a rich ontology of reified concepts.

Section 105

Reasoning Systems for Categories

331

Reification of propositions makes it possible to represent every ground, function-free

atomic sentence of first-order logic in the semantic network notation. Certain kinds of univer-

sally quantified sentences can be asserted using inverse links and the singly boxed and doubly boxed arrows applied to categories, but that still leaves us a long way short of full first-order logic.

Negation, disjunction, nested function symbols, and existential quantification are all

missing. Now it is possible to extend the notation to make it equivalent to first-order logic—as

in Peirce’s existential graphs—but doing so negates one of the main advantages of semantic

networks, which s the simplicity and transparency of the inference processes. Designers can build a large network and still have a good idea about what queries will be efficient, because (a) it is easy to visualize the steps that the inference procedure will go through and (b) in some cases the query language is so simple that difficult queries cannot be posed.

In cases where the expressive power proves to be too limiting, many semantic network

systems provide for procedural attachment to fill in the gaps.

Procedural attachment is a

technique whereby a query about (or sometimes an assertion of) a certain relation results in a

Procedural attachment

call to a special procedure designed for that relation rather than a general inference algorithm.

One of the most important aspects of semantic networks is their ability to represent default values for categories. Examining Figure 10.4 carefully, one notices that John has one Default value leg, despite the fact that he is a person and all persons have two legs. In a strictly logical KB, this would be a contradiction, but in a semantic network, the assertion that all persons have

two legs has only default status; that is, a person is assumed to have two legs unless this is contradicted by more specific information. The default semantics is enforced naturally by the inheritance algorithm, because it follows links upwards from the object itself (John in this

case) and stops as soon as it finds a value. We say that the default is overridden by the more

specific value. Notice that we could also override the default number of legs by creating a category of OneLeggedPersons, a subset of Persons of which John is a member. ‘We can retain a strictly logical semantics for the network if we say that the Legs

Overriding

rtion

for Persons includes an exception for John:

Vx x€Persons Ax # John = Legs(x,2). For a fixed network, this is semantically adequate but will be much less concise than the network notation itself if there are lots of exceptions. For a network that will be updated with

more assertions, however, such an approach fails—we really want to say that any persons as

yet unknown with one leg are exceptions t0o. Section 10.6 goes into more depth on this issue and on default reasoning in general. 10.5.2

Description logics

The syntax of first-order logic is designed to make it easy to say things about objects. Description logics are notations that are designed to make it easier to describe definitions

and

properties of categories. Description logic systems evolved from semantic networks in re-

Description logic

sponse to pressure to formalize what the networks mean while retaining the emphasis on

taxonomic structure as an organizing principle. The principal inference tasks for description logics are subsumption (checking if one Subsumption category is a subset of another by comparing their definitions) and classification (checking Classification whether an object belongs to a category). Some systems also include consistency of a cate-

gory definition—whether the membership criteria are logically satisfiable.

Consistency

332

Chapter 10 Knowledge Representation Concept



Thing| ConceptName

| And(Concept,...) | All(RoleName, Concept) | AtLeast(Integer, RoleName) | AtMost(Integer,RoleName)

| Fills(RoleName, IndividualName, . |

| Path

ConceptName RoleName



SameAs(Path, Path)

OneOf(IndividualName, . [RoleName, ...

— Adult| Female| Male| — Spouse| Daughter | Son| ...

Figure 10.6 The syntax of descriptions in a subset of the CLASSIC language.

The CLASSIC language (Borgida er al., 1989) is a typical description logic. The syntax

of CLASSIC descriptions is shown in Figure 10.6.% unmarried adult males we would write

For example, to say that bachelors are

Bachelor = And(Unmarried, Adult, Male) The equivalent in first-order logic would be Bachelor(x) < Unmarried(x) AAdult(x) A Male(x). Notice that the description logic has an algebra of operations on predicates, which of course

we can’t do in first-order logic. Any description in CLASSIC can be translated into an equiv-

alent first-order sentence, but some descriptions are more straightforward in CLASSIC.

For

example, to describe the set of men with at least three sons who are all unemployed and married to doctors, and at most two daughters who are all professors in physics or math

departments, we would use

And(Man, AtLeast(3, Son), AtMost(2, Daughter), All(Son, And(Unemployed, Married All(Spouse, Doctor))), All(Daughter, And(Professor, Fills(Department, Physics, Math)))) .

‘We leave it as an exercise to translate this into first-order logic.

Perhaps the most important aspect of description logics s their emphasis on tractability of inference. A problem instance is solved by describing it and then asking if it is subsumed by one of several possible solution categories. In standard first-order logic systems, predicting the solution time is often impossible. It is frequently left to the user to engineer the represen-

tation to detour around sets of sentences that seem to be causing the system to take several

© Notice that the language does not allow one to simply state that one concept, or category, is a subset of another. This is a deliberate policy: subsumption between categories must be derivable from some aspects of the descriptions of the categories. If not, then something is missing from the descriptions.

Section 10.6

Reasoning with Default Information

333

weeks to solve a problem. The thrust in description logics, on the other hand, is to ensure that

subsumption-testing can be solved in time polynomial in the size of the descriptions.”

This sounds wonderful in principle, until one realizes that it can only have one of two

consequences: either hard problems cannot be stated at all, or they require exponentially

large descriptions! However, the tractability results do shed light on what sorts of constructs

cause problems and thus help the user to understand how different representations behave. For example, description logics usually lack negation and disjunction. Each forces firstorder logical systems to go through a potentially exponential case analysis in order to ensure completeness.

CLASSIC allows only a limited form of disjunction in the Fills and OneOf

constructs, which permit disjunction over explicitly enumerated individuals but not over descriptions. With disjunctive descriptions, nested definitions can lead easily to an exponential

number of alternative routes by which one category can subsume another. 10.6

Reasoning with Default Information

In the preceding section, we saw a simple example of an assertion with default status: people have two legs. This default can be overridden by more specific information, such as that Long John Silver has one leg. We saw that the inheritance mechanism in semantic networks

implements the overriding of defaults in a simple and natural way. In this section, we study

defaults more generally, with a view toward understanding the semantics of defaults rather than just providing a procedural mechanism. 10.6.1

Circumscription and default logic

‘We have seen two examples of reasoning processes that violate the monotonicity property of

logic that was proved in Chapter 7.5 In this chapter we saw that a property inherited by all members of a category in a semantic network could be overridden by more specific informa-

Monotonicity

tion for a subcategory. In Section 9.4.4, we saw that under the closed-world assumption, ifa

proposition « is not mentioned in KB then KB }= —a, but KBA o |= ..

Simple introspection suggests that these failures of monotonicity are widespread in com-

monsense reasoning. It seems that humans often “jump to conclusions.” For example, when

one sees a car parked on the street, one is normally willing to believe that it has four wheels

even though only three are visible. Now, probability theory can certainly provide a conclusion

that the fourth wheel exists with high probability; yet, for most people, the possibility that the

car does not have four wheels will not arise unless some new evidence presents itself. Thus,

it seems that the four-wheel conclusion is reached by default, in the absence of any reason to

doubt it. If new evidence arrives—for example, if one sees the owner carrying a wheel and notices that the car is jacked up—then the conclusion can be retracted. This kind of reasoning

is said to exhibit nonmonotonicity, because the set of beliefs does not grow monotonically over time as new evidence arrives. Nonmonotonic logics have been devised with modified notions of truth and entailment in order to capture such behavior. We will look at two such

logics that have been studied extensively: circumscription and default logic.

Circumscription can be seen as a more powerful and precise version of the closed-world

7 CLASSIC provides efficient subsumption testing in practice, but the worst-case run time is exponential. 8 Recall that monotonicity requires all entailed sentences to remain entailed after new sentences are added to the KB. Thatis, if KB |= o then KBA 5 = .

Nonmonotonicity

Nonmonotonic logic Circumscription

334

Chapter 10 Knowledge Representation assumption. The idea is to specify particular predicates that are assumed to be “as false as possible”—that is, false for every object except those for which they are known to be true.

For example, suppose we want to assert the default rule that birds fly. We would introduce a predicate, say Abnormal, (x), and write Bird(x) A —Abnormaly (x) = Flies(x). If we say that Abnormal;

is to be circumscribed, a circumscriptive reasoner is entitled to

assume —Abnormal, (x) unless Abnormal; (x) is known to be true. This allows the conclusion

Model preference

Flies(Tweety) to be drawn from the premise Bird(Tweety), but the conclusion no longer holds if Abnormal, (Tweety) is asserted. Circumscription can be viewed as an example of a model preference logic.

In such

logics, a sentence is entailed (with default status) if it is true in all preferred models of the KB,

as opposed to the requirement of truth in all models in classical logic. For circumscription,

one model is preferred to another if it has fewer abnormal objects.” Let us see how this idea

works in the context of multiple inheritance in semantic networks. The standard example for

which multiple inheritance is problematic is called the “Nixon diamond.” It arises from the observation that Richard Nixon was both a Quaker (and hence by default a pacifist) and a

Republican (and hence by default not a pacifist). We can write this as follows: Republican(Nixon) A Quaker(Nixon). Republican(x) A ~Abnormaly(x) = —Pacifist(x).

Quaker(x) A —~Abnormals(x) = Pacifist(x).

If we circumscribe Abnormaly and Abnormals, there are two preferred models: one in which

Abnormaly(Nixon) and Pacifist(Nixon) are true and one in which Abnormals(Nixon) and —Pacifist(Nixon) are true. Thus, the circumscriptive reasoner remains properly agnostic as Prioritized circumscription

Default logic Default rules

to whether Nixon was a pacifist. If we wish, in addition, to assert that religious beliefs

take

precedence over political beliefs, we can use a formalism called prioritized circumscription to give preference to models where Abnormals is minimized.

Default logic is a formalism in which default rules can be written to generate contingent,

nonmonotonic conclusions. A default rule looks like this:

Bird(x) : Flies(x)/Flies(x). This rule means that if Bird(x) is true, and if Flies(x) is consistent with the knowledge base, then Flies(x) may be concluded by default. In general, a default rule has the form P:li,...,Jn/C

where P s called the prerequisite, C is the conclusion, and J; are the justifications—if any one

of them can be proven false, then the conclusion cannot be drawn. Any variable that appears

in J; or C must also appear in P. The Nixon-diamond example can be represented in default logic with one fact and two default rules:

Republican(Nixon) A Quaker(Nixon).

Republican(x) : ~Pacifist(x) | ~Pacifist(x) . Quaker(x) : Pacifist(x) /Pacifist(x) mption, one model s preferred to another if it has fewer true atoms—that i, preferred models are minimal models. There is a natural connection between the closed-world assumption and definiteclause KBs, because the fixed point reached by forward chaining on definite-clause KBs is the unique minimal model. See page 231 for more on this 9 Forthe closed-world

Section 10.6

Reasoning with Default Information

To interpret what the default rules mean, we define the notion of an extension of a default

335 Extension

theory to be a maximal set of consequences of the theory. That is, an extension S consists

of the original known facts and a set of conclusions from the default rules, such that no additional conclusions can be drawn from S, and the justifications of every default conclusion

in S are consistent with S. As in the case of the preferred models in circumscription, we have

two possible extensions for the Nixon diamond: one wherein he is a pacifist and one wherein he is not. Prioritized schemes exist in which some default rules can be given precedence over

others, allowing some ambiguities to be resolved. Since 1980, when nonmonotonic logics were first proposed, a great deal of progress

has been made in understanding their mathematical properties. There are still unresolved questions, however. For example, if “Cars have four wheels” is false, what does it mean to have it in one’s knowledge base? What is a good set of default rules to have? If we cannot

decide, for each rule separately, whether it belongs in our knowledge base, then we have a serious problem of nonmodularity.

Finally, how can beliefs that have default status be used

to make decisions? This is probably the hardest issue for default reasoning.

Decisions often involve tradeoffs, and one therefore needs to compare the strengths of be-

lief in the outcomes of different actions, and the costs of making a wrong decision. In cases

where the same kinds of decisions are being made repeatedly, it is possible to interpret default rules as “threshold probability” statements. For example, the default rule “My brakes are always OK” really means “The probability that my brakes are OK, given no other information, is sufficiently high that the optimal decision is for me to drive without checking them.” When

the decision context changes—for example, when one is driving a heavily laden truck down a steep mountain road—the default rule suddenly becomes inappropriate, even though there is no new evidence of faulty brakes. These considerations have led researchers to consider how

to embed default reasoning within probability theory or utility theory.

10.6.2

Truth maintenance systems

‘We have seen that many of the inferences drawn by a knowledge representation system will

have only default status, rather than being absolutely certain. Inevitably, some of these inferred facts will turn out to be wrong and will have to be retracted in the face of new infor-

mation. This process is called belief revision.!® Suppose that a knowledge base KB contains a sentence P—perhaps a default conclusion recorded by a forward-chaining algorithm, or perhaps just an incorrect assertion—and we want to execute TELL(KB, —P).

ating a contradiction, we must first execute RETRACT(KB, P).

To avoid cre-

This sounds easy enough.

Problems arise, however, if any additional sentences were inferred from P and asserted in the KB. For example, the implication P = Q might have been used to add Q. The obvious “solution”—retracting all sentences inferred from P—fails because such sentences may have other justifications besides P. For example, ifR and R = Q are also in the KB, then Q does not have to be removed after all. Truth maintenance systems, or TMSs, are designed to

handle exactly these kinds of complications. One simple approach to truth maintenance is to keep track of the order in which sentences are told to the knowledge

base by numbering

Belief revision

them from P; to P,.

When

the call

10 Belief revision is often contrasted with belief update, which occurs when a knowledge base is revised to reflect a change in the world rather than new information about a fixed world. Belief update combines belief revis with reasoning about time and change; it is also related to the process of filtering described in Chapter 14.

Truth maintenance system

336

Chapter 10 Knowledge Representation RETRACT(KB,

JT™S Justification

P,) is made, the system reverts to the state just before P; was added, thereby

removing both P, and any inferences that were derived from P, The sentences P through P, can then be added again. This is simple, and it guarantees that the knowledge base will be consistent, but retracting P; requires retracting and reasserting n — i sentences as well as undoing and redoing all the inferences drawn from those sentences. For systems to which many facts are being added—such as large commercial databas is impractical. A more efficient approach is the justification-based truth maintenance system, or JTMS.

In a JTMS, each sentence in the knowledge base is annotated with a justification consisting

of the set of sentences from which it was inferred. For example, if the knowledge base already contains P = Q, then TELL(P) will cause Q to be added with the justification {P, P = Q}.

In general, a sentence can have any number of justifications. Justifications make retraction efficient. Given the call RETRACT(P), the JTMS will delete exactly those sentences for which P is a member of every justification. So, if a sentence Q had the single justification {P. P =

Q}, it would be removed; if it had the additional justification {P, PVR

would still be removed; but if it also had the justification {R, PVR

=

Q}, it

= Q}, then it would

be spared. In this way, the time required for retraction of P depends only on the number of sentences derived from P rather than on the number of sentences added after P.

The JTMS assumes that sentences that are considered once will probably be considered

again, so rather than deleting a sentence from the knowledge base entirely when it loses

all justifications, we merely mark the sentence as being our of the knowledge base. If a subsequent assertion restores one of the justifications, then we mark the sentence as being back in.

In this way, the JTMS

retains all the inference chains that it uses and need not

rederive sentences when a justification becomes valid again.

In addition to handling the retraction of incorrect information, TMSs

can be used to

speed up the analysis of multiple hypothetical situations. Suppose, for example, that the Romanian Olympic Committee is choosing sites for the swimming, athletics, and equestrian

events at the 2048 Games to be held in Romania.

For example, let the first hypothesis be

Site(Swimming, Pitesti), Site(Athletics, Bucharest), and Site(Equestrian,Arad). A great deal of reasoning must then be done to work out the logistical consequences and hence

the desirability of this selection.

If we

want to consider Site(Athletics, Sibiu)

instead, the TMS avoids the need to start again from scratch. Instead, we simply retract Site(Athletics, Bucharest) and as ert Site(Athletics, Sibiu) and the TMS takes care of the necessary revisions. Inference chains generated from the choice of Bucharest can be reused with

ATMS

Sibiu, provided that the conclusions are the same.

An assumption-based truth maintenance system, or ATMS,

makes this type of context-

switching between hypothetical worlds particularly efficient. In a JTMS, the maintenance of

justifications allows you to move quickly from one state to another by making a few retrac-

tions and assertions, but at any time only one state is represented. An ATMS represents all the states that have ever been considered at the same time. Whereas a JTMS simply labels each

sentence as being in or out, an ATMS keeps track, for each sentence, of which assumptions

would cause the sentence to be true. In other words, each sentence has a label that consists of

a set of assumption sets. The sentence is true just in those cases in which all the assumptions

Explanation

in one of the assumption sets are true. Truth maintenance systems also provide a mechanism for generating explanations. Tech-

nically, an explanation of a sentence P is a set of sentences E such that E entails P. If the

Summary sentences in E are already known to be true, then E simply provides a sufficient basis for proving that P must be the case. But explanations can also include assumptions—sentences

that are not known to be true, but would suffice to prove P if they were true. For example, if your car won’t start, you probably don’t have enough information to definitively prove the

reason for the problem. But a reasonable explanation might include the assumption that the

battery is dead. This, combined with knowledge of how cars operate, explains the observed nonbehavior.

In most cases, we will prefer an explanation E that is minimal, meaning that

there is no proper subset of E that is also an explanation. An ATMS can generate explanations for the “car won’t start” problem by making assumptions (such as “no gas in car” or “battery

dead”) in any order we like, even if some assumptions are contradictory. Then we look at the

label for the sentence “car won’t start™ to read off the sets of assumptions that would justify the sentence. The exact algorithms used to implement truth maintenance systems are a little compli-

cated, and we do not cover them here. The computational complexity of the truth maintenance. problem is at least as great as that of propositional inference—that is, NP-hard.

Therefore,

you should not expect truth maintenance to be a panacea. When used carefully, however, a TMS can provide a substantial increase in the ability of a logical system to handle complex environments and hypotheses.

Summary

By delving into the details of how one represents a variety of knowledge, we hope we have given the reader a sense of how real knowledge bases are constructed and a feeling for the interesting philosophical issues that arise. The major points are as follows: + Large-scale knowledge representation requires a general-purpose ontology to organize

and tie together the various specific domains of knowledge. « A general-purpose ontology needs to cover a wide variety of knowledge and should be capable, in principle, of handling any domain. « Building a large, general-purpose ontology is a significant challenge that has yet to be fully realized, although current frameworks seem to be quite robust.

« We presented an upper ontology based on categories and the event calculus. We covered categories, subcategories, parts, structured objects, measurements, substances,

events, time and space, change, and beliefs. « Natural kinds cannot be defined completely in logic, but properties of natural kinds can be represented. * Actions, events, and time can be represented with the event calculus.

Such represen-

tations enable an agent to construct sequences of actions and make logical inferences

about what will be true when these actions happen.

* Special-purpose representation systems, such as semantic networks and description logics, have been devised to help in organizing a hierarchy of categories. Inheritance is an important form of inference, allowing the properties of objects to be deduced from

their membership in categories.

337

Assumption

338

Chapter 10 Knowledge Representation « The closed-world assumption, as implemented in logic programs, provides a simple way to avoid having to specify lots of negative information. default that can be overridden by additional information.

It is best interpreted as a

+ Nonmonotonic logics, such as circumscription and default logic, are intended to cap-

ture default reasoning in general.

+ Truth maintenance systems handle knowledge updates and revisions efficiently.

« It is difficult to construct large ontologies by hand; extracting knowledge from text makes the job easier. Bibliographical and Historical Notes Briggs (1985) claims that knowledge representation research began with first millennium BCE

Indian theorizing about the grammar of Shastric Sanskrit. Western philosophers trace their work on the subject back to c. 300 BCE in Aristotle’s Metaphysics (literally, what comes after

the book on physics). The development of technical terminology in any field can be regarded as a form of knowledge representation.

Early discussions of representation in Al tended to focus on “problem representation”

rather than “knowledge representation.” (See, for example, Amarel’s (1968) discussion of the “Mi naries and Cannibals” problem.) In the 1970s, Al emphasized the development of

“expert systems” (also called “knowledge-based systems”) that could, if given the appropriate domain knowledge, match or exceed the performance of human experts on narrowly defined tasks. For example, the first expert system, DENDRAL (Feigenbaum ef al., 1971; Lindsay

et al., 1980), interpreted the output of a mass spectrometer (a type of instrument used to ana-

Iyze the structure of organic chemical compounds) as accurately as expert chemists. Although

the success of DENDRAL was instrumental in convincing the Al research community of the importance of knowledge representation, the representational formalisms used in DENDRAL

are highly specific to the domain of chemistry.

Over time, researchers became interested in standardized knowledge representation for-

malisms and ontologies that could assist in the creation of new expert systems. This brought

them into territory previously explored by philosophers of science and of language. The discipline imposed in Al by the need for one’s theories to “work” has led to more rapid and deeper progress than when these problems were the exclusive domain of philosophy (although it has at times also led to the repeated reinvention of the wheel). But to what extent can we trust expert knowledge?

As far back as 1955, Paul Meehl

(see also Grove and Meehl, 1996) studied the decision-making processes of trained experts at subjective tasks such as predicting the success of a student in a training program or the recidivism of a criminal.

In 19 out of the 20 studies he looked at, Meehl found that simple

statistical learning algorithms (such as linear regression or naive Bayes) predict better than the experts. Tetlock (2017) also studies expert knowledge and finds it lacking in difficult cases. The Educational Testing Service has used an automated program to grade millions of essay questions on the GMAT exam since 1999. The program agrees with human graders 97%

of the time, about the same level that two human graders agree (Burstein ez al., 2001).

(This does not mean the program understands essays, just that it can distinguish good ones from bad ones about as well as human graders can.)

Bibliographical and Historical Notes The creation of comprehensive taxonomies or classifications dates back to ancient times.

Aristotle (384-322 BCE) strongly emphasized classification and categorization schemes. His

Organon, a collection of works on logic assembled by his students after his death, included a

treatise called Categories in which he attempted to construct what we would now call an upper ontology. He also introduced the notions of genus and species for lower-level classification. Our present system of biological classification, including the use of “binomial nomenclature” (classification via genus and species in the technical sense), was invented by the Swedish biologist Carolus Linnaeus, or Carl von Linne (1707-1778).

The problems associated with

natural kinds and inexact category boundaries have been addressed by Wittgenstein (1953), Quine (1953), Lakoff (1987), and Schwartz (1977), among others. See Chapter 24 for a discussion of deep neural network representations of words and concepts that escape some of the problems ofa strict ontology, but also sacrifice some of the precision.

We still don’t know the best way to combine the advantages of neural networks

and logical semantics for representation. Interest in larger-scale ontologies is increasing, as documented by the Handbook on Ontologies (Staab, 2004).

The OPENCYC

project (Lenat and Guha,

1990; Matuszek er al.,

2006) has released a 150,000-concept ontology, with an upper ontology similar to the one in Figure 10.1 as well as specific concepts like “OLED Display” and “iPhone,” which is a type of “cellular phone,” which in turn is a type of “consumer electronics,” “phone,” “wireless communication device,” and other concepts. The NEXTKB project extends CYC and other resources including FrameNet and WordNet into a knowledge base with almost 3 million

facts, and provides a reasoning engine, FIRE to go with it (Forbus et al., 2010). The DBPEDIA

project extracts structured data from Wikipedia,

specifically from In-

foboxes: the attribute/value pairs that accompany many Wikipedia articles (Wu and Weld, 2008; Bizer et al., 2007).

As of 2015, DBPEDIA

contained 400 million facts about 4 mil-

lion objects in the English version alone; counting all 110 languages yields 1.5 billion facts (Lehmann et al., 2015). The IEEE working group P1600.1 created SUMO, the Suggested Upper Merged Ontology (Niles and Pease, 2001; Pease and Niles, 2002), with about 1000 terms in the upper ontology and links to over 20,000 domain-specific terms. Stoffel ef al. (1997) describe algorithms for efficiently managing a very large ontology. A survey of techniques for extracting knowledge from Web pages is given by Etzioni ef al. (2008). On the Web, representation languages are emerging. RDF (Brickley and Guha, 2004)

allows for assertions to be made in the form of relational triples and provides some means for

evolving the meaning of names over time. OWL (Smith ef al., 2004) is a description logic that supports inferences over these triples. So far, usage seems to be inversely proportional

to representational complexity: the traditional HTML and CSS formats account for over 99% of Web content, followed by the simplest representation schemes, such as RDFa (Adida and Birbeck, 2008), and microformats (Khare, 2006; Patel-Schneider, 2014) which use HTML

and XHTML markup to add attributes to text on web pages. Usage of sophisticated RDF and OWL ontologies is not yet widespread, and the full vision of the Semantic Web (BernersLee et al., 2001) has not been realized. The conferences on Formal Ontology in Information

Systems (FOIS) covers both general and domain-specific ontologies.

The taxonomy used in this chapter was developed by the authors and is based in part

on their experience in the CYC project and in part on work by Hwang and Schubert (1993)

339

Chapter 10 Knowledge Representation and Davis (1990, 2005). An inspirational discussion of the general project of commonsense

knowledge representation appears in Hayes’s (1978, 1985b) “Naive Physics Manifesto.” Successful deep ontologies within a specific field include the Gene Ontology project (Gene Ontology Consortium, 2008) and the Chemical Markup Language (Murray-Rust ef al., 2003). Doubts about the feasibility of a single ontology for all knowledge are expressed by Doctorow (2001), Gruber (2004), Halevy et al. (2009), and Smith (2004).

The event calculus was introduced by Kowalski and Sergot (1986) to handle continuous time, and there have been several variations (Sadri and Kowalski, 1995; Shanahan, 1997) and overviews (Shanahan, 1999; Mueller, 2006). James Allen introduced time intervals for the same reason (Allen, 1984), arguing that intervals were much more natural than situations for reasoning about extended and concurrent events. In van Lambalgen and Hamm (2005) we see how the logic of events maps onto the language we use to talk about events. An alternative to the event and situation calculi is the fluent calculus (Thielscher, 1999), which reifies the facts

out of which states are composed.

Peter Ladkin (1986a, 1986b) introduced “concave” time intervals (intervals with gaps—

essentially, unions of ordinary “convex” time intervals) and applied the techniques of mathematical abstract algebra to time representation. Allen (1991) systematically investigates the wide variety of techniques available for time representation; van Beek and Manchak (1996)

analyze algorithms for temporal reasoning. There are significant commonalities between the

event-based ontology given in this chapter and an analysis of events due to the philosopher

Donald Davidson (1980). The histories in Pat Hayes’s (1985a) ontology of liquids and the chronicles in McDermott’s (1985) theory of plans were also important influences on the field

and on this chapter. The question of the ontological status of substances has a long history. Plato proposed

that substances were abstract entities entirely distinct from physical objects; he would say MadeOf (Butters, Butter) rather than Butters € Butter. This leads to a substance hierarchy in which, for example, UnsaltedButter is a more specific substance than Butter.

The position

adopted in this chapter, in which substances are categories of objects, was championed by

Richard Montague (1973). It has also been adopted in the CYC project. Copeland (1993) mounts a serious, but not invincible, attack.

The alternative approach mentioned in the chapter, in which butter is one object consisting of all buttery objects in the universe, was proposed originally by the Polish logician Lesniewski (1916).

His mereology (the name is derived from the Greek word for “part”)

used the part-whole relation as a substitute for mathematical set theory, with the aim of elim-

inating abstract entities such as sets. A more readable exposition of these ideas is given by Leonard and Goodman (1940), and Goodman’s The Structure of Appearance (1977) applies the ideas to various problems in knowledge representation. While some aspects of the mereological approach are awkward—for example, the need for a separate inheritance mechanism based on part-whole relations—the approach gained the support of Quine (1960). Harry Bunt (1985) has provided an extensive analysis of its use in knowledge representation. Casati and Varzi (1999) cover parts, wholes, and a general theory of spatial locations.

There are three main approaches to the study of mental objects. The one taken in this chapter, based on modal logic and possible worlds, is the classical approach from philosophy (Hintikka,

1962; Kripke,

1963; Hughes and Cresswell,

1996).

The book Reasoning about

Bibliographical and Historical Notes Knowledge (Fagin et al., 1995) provides a thorough introduction, and Gordon and Hobbs

(2017) provide A Formal Theory of Commonsense Psychology. The second approach is a first-order theory in which mental objects are fluents. Davis

(2005) and Davis and Morgenstern (2005) describe this approach. It relies on the possibleworlds formalism, and builds on work by Robert Moore (1980, 1985).

The third approach is a syntactic theory, in which mental objects are represented by

character strings. A string is just a complex term denoting a list of symbols, so CanFly(Clark)

can be represented by the list of symbols [C,a,n,F,1,y,(,C,l,a,r,k,)]. The syntactic theory of mental objects was first studied in depth by Kaplan and Montague (1960), who showed

that it led to paradoxes if not handled carefully. Ernie Davis (1990) provides an excellent comparison of the syntactic and modal theories of knowledge. Pnueli (1977) describes a

temporal logic used to reason about programs, work that won him the Turing Award and which was expanded upon by Vardi (1996). Littman e al. (2017) show that a temporal logic can be a good language for specifying goals to a reinforcement learning robot in a way that is easy for a human to specify, and generalizes well to different environments.

The Greek philosopher Porphyry (c. 234-305 CE), commenting on Aristotle’s Caregories, drew what might qualify as the first semantic network. Charles S. Peirce (1909) developed existential graphs as the first semantic network formalism using modern logic. Ross Quillian (1961), driven by an interest in human memory and language processing, initiated work on semantic networks within Al An influential paper by Marvin Minsky (1975)

presented a version of semantic networks called frames; a frame was a representation of an object or category, with attributes and relations to other objects or categories.

The question of semantics arose quite acutely with respect to Quillian’s semantic net-

works (and those of others who followed his approach), with their ubiquitous and very vague

“IS-A links” Bill Woods’s (1975) famous article “What's In a Link?” drew the attention of AT

researchers to the need for precise semantics in knowledge representation formalisms. Brachman (1979) elaborated on this point and proposed solutions.

Ron

Patrick Hayes’s (1979)

“The Logic of Frames” cut even deeper, claiming that “Most of ‘frames’ is just a new syntax for parts of first-order logic.” Drew McDermott’s (1978b) “Tarskian Semantics, or, No No-

tation without Denotation!”

argued that the model-theoretic approach to semantics used in

first-order logic should be applied to all knowledge representation formalisms. This remains

a controversial idea; notably, McDermott himself has reversed his position in “A Critique of Pure Reason” (McDermott, 1987). Selman and Levesque (1993) discuss the complexity of

inheritance with exceptions, showing that in most formulations it is NP-complete.

Description logics were developed as a useful subset of first-order logic for which infer-

ence is computationally tractable.

Hector Levesque and Ron Brachman (1987) showed that

certain uses of disjunction and negation were primarily responsible for the intractability of logical inference. This led to a better understanding of the interaction between complexity

and expressiveness in reasoning systems. Calvanese ef al. (1999) summarize the state of the art, and Baader et al. (2007) present a comprehensive handbook of description logic. The three main formalisms for dealing with nonmonotonic inference—circumscription

(McCarthy, 1980), default logic (Reiter, 1980), and modal nonmonotonic logic (McDermott

and Doyle, 1980)—were all introduced in one special issue of the Al Journal. Delgrande and Schaub (2003) discuss the merits of the variants, given 25 years of hindsight. Answer

set programming can be seen as an extension of negation as failure or as a refinement of

341

342

Chapter 10 Knowledge Representation circumscription; the underlying theory of stable model semantics was introduced by Gelfond

and Lifschitz (1988), and the leading answer set programming systems are DLV (Eiter ef al.,

1998) and SMODELS (Niemelii et al., 2000). Lifschitz (2001) discusses the use of answer set programming for planning. Brewka er al. (1997) give a good overview of the various approaches to nonmonotonic logic. Clark (1978) covers the negation-as-failure approach to logic programming and Clark completion. Lifschitz (2001) discusses the application of answer set programming to planning. A variety of nonmonotonic reasoning systems based on

logic programming are documented in the proceedings of the conferences on Logic Program-

ming and Nonmonotonic Reasoning (LPNMR).

The study of truth maintenance systems began with the TMS (Doyle, 1979) and RUP

(McAllester,

1980) systems, both of which were essentially JTMSs.

Forbus and de Kleer

(1993) explain in depth how TMSs can be used in Al applications. Nayak and Williams

(1997) show how an efficient incremental TMS called an ITMS makes it feasible to plan the

operations of a NASA spacecraft in real time.

This chapter could not cover every area of knowledge representation in depth. The three

principal topics omitted are the following: Qualitative physics.

Qualitative physics: Qualitative physics is a subfield of knowledge representation concerned specifically with constructing a logical, nonnumeric theory of physical objects and processes. The term was coined

by Johan de Kleer (1975), although the enterprise could be said to

have started in Fahlman’s (1974) BUILD, a sophisticated planner for constructing complex towers of blocks.

Fahlman discovered in the process of designing it that most of the effort

(80%, by his estimate) went into modeling the physics of the blocks world to calculate the

stability of various subassemblies of blocks, rather than into planning per se. He sketches a hypothetical naive-physics-like process to explain why young children can solve BUILD-like

problems without access to the high-speed floating-point arithmetic used in BUILD’s physical

modeling. Hayes (1985a) uses “histories"—four-dimensional slices of space-time similar to

Davidson’s events—to construct a fairly complex naive physics of liquids. Davis (2008) gives an update to the ontology of liquids that describes the pouring of liquids into containers.

De Kleer and Brown (1985), Ken Forbus (1985), and Benjamin Kuipers (1985) independently and almost simultaneously developed systems that can reason about a physical system based on qualitative abstractions of the underlying equations. Qualitative physics soon developed to the point where it became possible to analyze an impressive variety of complex phys-

ical systems (Yip, 1991). Qualitative techniques have been used to construct novel designs

for clocks, windshield wipers, and six-legged walkers (Subramanian and Wang, 1994). The collection Readings in Qualitative Reasoning about Physical Systems (Weld and de Kleer, 1990), an encyclopedia article by Kuipers (2001), and a handbook article by Davis (2007)

provide good introductions to the field.

Spatial reasoning

Spatial reasoning: The reasoning necessary to navigate in the wumpus world is trivial in comparison to the rich spatial structure of the real world.

The earliest serious attempt to

capture commonsense reasoning about space appears in the work of Ernest Davis (1986, 1990). The region connection calculus of Cohn ef al. (1997) supports a form of qualitative spatial reasoning and has led to new kinds of geographical information systems; see also (Davis, 2006). As with qualitative physics, an agent can go a long way, so to speak, without

resorting to a full metric representation.

Bibliographical and Historical Notes Psychological reasoning: Psychological reasoning involves the development of a working

psychology for artificial agents to use in reasoning about themselves and other agents. This

is often based on so-called folk psychology, the theory that humans in general are believed

to use in reasoning about themselves and other humans.

When Al researchers provide their

artificial agents with psychological theories for reasoning about other agents, the theories are

frequently based on the researchers” description of the logical agents” own design. Psychological reasoning is currently most useful within the context of natural language understanding,

where divining the speaker’s intentions is of paramount importance. Minker (2001) collects papers by leading researchers in knowledge representation, summarizing 40 years of work in the field. The proceedings of the international conferences on Principles of Knowledge Representation and Reasoning provide the most up-to-date sources. for work in this area. Readings in Knowledge Representation (Brachman and Levesque, 1985) and Formal Theories of the Commonsense

World (Hobbs and Moore,

1985) are ex-

cellent anthologies on knowledge representation; the former focuses more on historically

important papers in representation languages and formalisms, the latter on the accumulation of the knowledge itself. Davis (1990), Stefik (1995), and Sowa (1999) provide textbook introductions to knowledge representation, van Harmelen et al. (2007) contributes a handbook, and Davis and Morgenstern (2004) edited a special issue of the Al Journal on the topic. Davis

(2017) gives a survey of logic for commonsense reasoning. The biennial conference on Theoretical Aspects of Reasoning About Knowledge (TARK) covers applications of the theory of knowledge in Al, economics, and distributed systems.

Psychological reasoning

TR

1 1

AUTOMATED PLANNING In which we see how an agent can take advantage of the structure ofa problem to efficiently construct complex plans of action.

Planning a course of action is a key requirement for an intelligent agent. The right representation for actions and states and the right algorithms can make this easier.

In Section 11.1

we introduce a general factored representation language for planning problems that can naturally and succinctly represent a wide variety of domains, can efficiently scale up to large

problems, and does not require ad hoc heuristics for a new domain. Section 11.4 extends the

representation language to allow for hierarchical actions, allowing us to tackle more complex

problems. We cover efficient algorithms for planning in Section 11.2, and heuristics for them in Section

11.3.

In Section

11.5 we account for partially observable and nondeterministic

domains, and in Section 11.6 we extend the language once again to cover scheduling problems with resource constraints. This gets us closer to planners that are used in the real world

for planning and scheduling the operations of spacecraft, factories, and military campaigns. Section 11.7 analyzes the effectiveness of these techniques.

11.1 Classical planning

Definition of Classical Planning

Classical planning is defined as the task of finding a sequence of actions to accomplish a

goal in a discrete, deterministic, static, fully observable environment. We have seen two approaches to this task: the problem-solving agent of Chapter 3 and the hybrid propositional

logical agent of Chapter 7. Both share two limitations. First, they both require ad hoc heuristics for each new domai

heuristic evaluation function for search, and hand-written code

for the hybrid wumpus agent. Second, they both need to explicitly represent an exponentially large state space.

For example, in the propositional logic model of the wumpus world, the

axiom for moving a step forward had to be repeated for all four agent orientations, T time. steps, and n* current locations. PDDL

In response to these limitations, planning researchers have invested in a factored repre-

sentation using a family of languages called PDDL, the Planning Domain Definition Lan-

guage (Ghallab ef al., 1998), which allows us to express all 47n? actions with a single action schema, and does not need domain-specific knowledge.

Basic PDDL can handle classical

planning domains, and extensions can handle non-classical domains that are continuous, partially observable, concurrent, and multi-agent. The syntax of PDDL is based on Lisp, but we. State

will translate it into a form that matches the notation used in this book.

In PDDL, a state is represented as a conjunction of ground atomic fluents. Recall that

“ground” means no variables, “fluent” means an aspect of the world that changes over time,

Section 11.1

Definition of Classical Planning

and “ground atomic” means there is a single predicate, and if there are any arguments, they must be constants. For example, Poor A Unknown might represent the state of a hapless agent,

and Ar(Truck,, Melbourne) AAt(Trucks, Sydney) could represent a state in a package delivery problem. PDDL uses database semantics: the closed-world assumption means that any fluents that are not mentioned are false, and the unique names

assumption means that Truck;

and Truck; are distinct. The following fluents are nor allowed in a state: Ar(x,y) (because it has variables), ~Poor (because it is a negation), and Ar(Spouse(Ali), Sydney) (because it uses a function symbol, Spouse). When convenient, we can think of the conjunction of fluents as a set of fluents.

An action schema represents a family of ground actions. For example, here is an action Action schema

schema for flying a plane from one location to another:

Action(Fly(p. from,t0),

PRECOND:At(p. from) A Plane(p) A Airport(from) A Airport(to) EFFECT:—A!(p. from) A At(p, t0))

The schema consists of the action name,

a list of all the variables used in the schema,

a

precondition and an effect. The precondition and the effect are each conjunctions of literals (positive or negated atomic sentences).

yielding a ground (variable-free) action:

We can choose constants to instantiate the variables,

Precondition Effect

Action(Fly(P,,SFO,JFK),

PRECOND:At(Py,SFO) A Plane(Py) AAirport(SFO) AAirport(JFK) EFFECT:—AI(P1,SFO) AA1(Py,JFK))

A ground action a is applicable in state s if s entails the precondition of a; that is, if every

positive literal in the precondition is in s and every negated literal is not.

The result of executing applicable action a in state s is defined as a state s’ which is

represented by the set of fluents formed by starting with s, removing the fluents that appear as negative literals in the action’s effects (what we call the delete list or DEL(a)), and adding the fluents that are positive literals in the action’s effects (what we call the add list or ADD(a)):

RESULT(s,a) = (s — DEL(a)) UADD(a) .

(11.1)

For example, with the action Fiy(Py,SFO,JFK), we would remove the fluent Ar(Py,SFO) and add the fluent At(Py,JFK). A set of action schemas serves as a definition of a planning domain. A specific problem within the domain is defined with the addition of an initial state and a goal. The initial state is a conjunction of ground fluents (introduced with the keyword Inir in Figure 11.1). As with all states, the closed-world assumption is used, which means that any atoms that are not mentioned are false. The goal (introduced with Goal) is just like a precondition: a conjunction of literals (positive or negative) that may contain variables. For example, the goal At(C1,SFO) A=At(C2,SFO) AAt(p, SFO), refers to any state in which cargo C} is at SFO but C is not, and in which there is a plane at SFO. 11.1.1

Example domain:

Air cargo transport

Figure 11.1 shows an air cargo transport problem involving loading and unloading cargo and flying it from place to place. The problem can be defined with three actions: Load, Unload,

and Fly. The actions affect two predicates: In(c, p) means that cargo c is inside plane p, and Ar(x,a) means that object x (either plane or cargo) is at airport a. Note that some care

Delete list Add list

Chapter 11

Automated Planning

Init(AK(Cy, SFO) A At(Cy, JFK) A At(Py, SFO) A At(Py, JFK) A Cargo(Cy) A Cargo(Cs) A Plane(P;) A Plane(Ps) AAirport(JFK) A Airport(SFO)) Goal(A1(Cy, JFK) A At(Ca, SFO)) Action(Load(c, p, @), PRECOND: At(c, a) A At(p, a) A Cargo(c)

EFFECT: = At(c, a) A In(c, p))

Action(Unload(c, p, a),

PRECOND: In(c, p) A At(p, a) A Cargo(c) EFFECT: At(c, a) A —In(c, p))

A Plane(p) A Airport(a)

A Plane(p) A Airport(a)

Action(Fly(p, from, t0),

PRECOND: At(p, from) A Plane(p) A Airport(from) A Airport(to) EFFECT: = At(p, from) A\ At(p, t0))

Figure 1.1 A PDDL description of an air cargo transportation planning problem. must be taken to make sure the Az predicates are maintained properly.

When a plane flies

from one airport to another, all the cargo inside the plane goes with it. Tn first-order logic it would be easy to quantify over all objects that are inside the plane. But PDDL does not have

a universal quantifier, so we need a different solution. The approach we use is to say that a piece of cargo ceases to be Ar anywhere when it is /n a plane; the cargo only becomes At the new airport when it is unloaded.

So At really means “available for use at a given location.”

The following plan is a solution to the problem:

[Load(Cy, Py, SFO). Fly(Py,SFO.JFK), Unload(C\,Pi,JFK). Load(Ca. P2, JFK), Fly(P2,JFK SFO), Unload(Ca, P2, SFO)]..

11.1.2

Example domain:

The spare tire problem

Consider the problem of changing a flat tire (Figure 11.2). The goal is to have a good spare tire properly mounted onto the car’s axle, where the initial state has a flat tire on the axle and a good spare tire in the trunk.

To keep it simple, our version of the problem is

an abstract

one, with no sticky lug nuts or other complications. There are just four actions: removing the spare from the trunk, removing the flat tire from the axle, putting the spare on the axle, and leaving the car unattended overnight.

We assume that the car is parked in a particu-

larly bad neighborhood, so that the effect of leaving it overnight is that the tires disappear. [Remove(Flat,Axle),Remove(Spare, Trunk), PutOn(Spare,Axle)] is a solution to the problem. 11.1.3

Example domain:

The blocks world

One of the most famous planning domains is the blocks world. This domain consists of a set

of cube-shaped blocks sitting on an arbitrarily-large table.! The blocks can be stacked, but

only one block can fit directly on top of another. A robot arm can pick up a block and move it

to another position, either on the table or on top of another block. The arm can pick up only

one block at a time, so it cannot pick up a block that has another one on top of it. A typical

goal to get block A on B and block B on C (see Figure 11.3). 1" The blocks world commonly used in planning research is much simpler than SHRDLUs version (page 20).

Section 11.1

Definition of Classical Planning

Init(Tire(Flat) A Tire(Spare) A At(Flat,Axle) A At(Spare, Trunk)) Goal(At(Spare,Axle))

Action(Remove(obj. loc),

PRECOND: At(obj. loc)

EFFECT: At(obj,loc) A A(obj, Ground))

Action(PutOn(t, Axle),

PRECOND: Tire(t) A At(t,Ground) A — At(Flat,Axle) N\ — At(Spare,Axle) EFFECT: = At(1,Ground) A At(t,Axle))

Action(LeaveOvernight, PRECOND: EFFECT: — At(Spare, Ground) A — At(Spare, Axle) A = At(Spare, Trunk) A = At(Flat,Ground) A = At(Flat,Axle) A = At(Flat, Trunk))

Figure 11.2 The simple spare tire problem.

BN Start State

A

Goal State

Figure 11.3 Diagram of the blocks-world problem in Figure 11.4.

Init(On(A, Table) A On(B,Table) A On(C,A) A Block(A) A Block(B) A Block(C) A Clear(B) A Clear(C) A Clear(Table)) Goal(On(A,B) A On(B,C))

Action(Move(b.x.y),

PRECOND: On(b,x) A Clear(b) A Clear(y) A Block(b) A Block(y) N

(b#x) A (b#y) A (x#),

EFFECT: On(b,y) A Clear(x) A —On(b,x) A —Clear(y))

Action(MoveToTable(b,x),

PRECOND: On(b,x) A Clear(b) A Block(b) A Block(x), EFFECT: On(b, Table) A Clear(x) A —On(b,x))

Figure 11.4 A planning problem in the blocks world: building a three-block tower. One solution is the sequence [MoveToTuble(C.A), Move(B, Table.C), Move(A, Table, B)).

347

Chapter 11

Automated Planning

We use On(b,x) to indicate that block b is on x, where x is either another block or the

table. The action for moving block b from the top of x to the top of y will be Move(b,x,y).

Now, one of the preconditions on moving b is that no other block be on it. In first-order logic,

this would be —3x On(x,b) or, alternatively, Vx —On(x,b). Basic PDDL does not allow

quantifiers, so instead we introduce a predicate Clear(x) that is true when nothing is on x.

(The complete problem description is in Figure 11.4.)

The action Move moves a block b from x to y if both b and y are clear. After the move is made, b is still clear but y is not. A first attempt at the Move schema is

Action(Move(b, x.y),

PRECOND:On(b.x) A Clear(b) A Clear(y),

EFFECT:On(b,y) A Clear(x) A =On(b,x) A—Clear(y)).

Unfortunately, this does not maintain Clear properly when x or y is the table. When x is

the Table, this action has the effect Clear(Table), but the table should not become clear; and

when y=Table, it has the precondition Clear(Table), but the table does not have to be clear for us to move a block onto it. To fix this, we do two things.

action to move a block b from x to the table:

First, we introduce another

Action(MoveToTable(b,x),

PRECOND:On(b,x) A Clear(b),

EFFECT: On(b, Table) A Clear(x) A—On(b,x)) .

Second, we take the interpretation of Clear(x) to be “there is a clear space on x to hold a block.” Under this interpretation, Clear(Table) will always be true. The only problem is that nothing prevents the planner from using Move(b,x, Table) instead of MoveToTable(b,x). We could live with this problem—it will lead to a larger-than-necessary scarch space, but will not lead to incorrect answers—or we could introduce the predicate Block and add Block(b) A Block(y) to the precondition of Move, as shown in Figure 11.4. 11.2

Algorithms for Classical Planning

The description of a planning problem provides an obvious way to search from the initial state through the space of states, looking for a goal. A nice advantage of the declarative representation of action schemas is that we can also search backward from the goal, looking

for the initial state (Figure 11.5 compares forward and backward searches). A third possibility

is to translate the problem description into a set of logic sentences, to which we can apply a

logical inference algorithm to find a solution. 11.2.1

Forward state-space search for planning

We can solve planning problems by applying any of the heuristic search algorithms from Chapter 3 or Chapter 4. The states in this search state space are ground states, where every fluent is either true or not.

The goal is a state that has all the positive fluents in the prob-

lem’s goal and none of the negative fluents. The applicable actions in a state, Actions(s), are

grounded instantiations of the action schemas—that is, actions where the variables have all

been replaced by constant values.

To determine the applicable actions we unify the current state against the preconditions

of each action schema.

For each unification that successfully results in a substitution, we

Section 112 Algorithms for Classical Planning apply the substitution to the action schema to yield a ground action with no variables. (It is a requirement of action schemas that any variable in the effect must also appear in the

precondition; that way, we are guaranteed that no variables remain after the substitution.) Each schema may unify in multiple ways. In the spare tire example (page 346), the Remove action has the precondition Ar(obj, loc), which matches against the initial state in two

ways, resulting in the two substitutions {obj/Flat,loc/Axle} and {obj/Spare,loc/Trunk}; applying these substitutions yields two ground actions.

If an action has multiple literals in

the precondition, then each of them can potentially be matched against the current state in multiple ways.

At first, it seems that the state space might be too big for many problems. Consider an

air cargo problem with 10 airports, where each airport initially has 5 planes and 20 pieces of cargo. The goal is to move all the cargo at airport A to airport B. There is a 41-step solution

to the problem: load the 20 pieces of cargo into one of the planes at A, fly the plane to B, and

unload the 20 pieces. Finding this apparently straightforward solution can be difficult because the average branching factor is huge: each of the 50 planes can fly to 9 other airports, and each of the 200

packages can be either unloaded (if it is loaded) or loaded into any plane at its airport (if it is unloaded).

So in any state there is a minimum of 450 actions (when all the packages are

at airports with no planes) and a maximum of 10,450 (when all packages and planes are at

Fly(P,, A B) Fly(P,, A B)

Figure 11.5 Two approaches to searching for a plan. (a) Forward (progression) search through the space of ground states, starting in the initial state and using the problem’s actions to search forward for a member of the set of goal states.

(b) Backward (regression)

search through state descriptions, starting at the goal and using the inverse of the actions to search backward for the initial state.

349

350

Chapter 11

Automated Planning

the same airport). On average, let’s say there are about 2000 possible actions per state, so the search graph up to the depth of the 41-step solution has about 2000*' nodes.

Clearly, even this relatively small problem instance is hopeless without an accurate heuristic. Although many real-world applications of planning have relied on domain-specific heuris-

tics, it turns out (as we see in Section 11.3) that strong domain-independent heuristics can be

derived automatically; that is what makes forward search feasible. 11.2.2 Regression search Relevant action

Backward search for planning

In backward search (also called regression search) we start at the goal and apply the actions backward until we find a sequence of steps that reaches the initial state. At each step we

consider relevant actions (in contrast to forward search, which considers actions that are

applicable). This reduces the branching factor significantly, particularly in domains with many possible actions.

A relevant action is one with an effect that unifies with one of the goal literals, but with

no effect that negates any part of the goal. For example, with the goal ~PoorA Famous, an action with the sole effect Famous would be relevant, but one with the effect Poor A Famous

is not considered relevant: even though that action might be used at some point in the plan (to

establish Famous), it cannot appear at rhis point in the plan because then Poor would appear in the final state.

Regression

What does it mean to apply an action in the backward direction?

Given a goal g and

an action a, the regression from g over a gives us a state description g’ whose positive and negative literals are given by Pos(¢') = (Pos(g) — ADD(a)) UPOS(Precond(a))

NEG(g') = (NEG(g) — DEL(a)) UNEG(Precond(a)).

That is, the preconditions

must have held before, or else the action could not have been

executed, but the positive/negative literals that were added/deleted by the action need not have been true before.

These equations are straightforward for ground literals, but some care is required when there are variables in g and a. For example, suppose the goal is to deliver a specific piece of cargo to SFO: A1(Ca,SFO). The Unload action schema has the effect Ar(c,a). When we

unify that with the goal, we get the substitution {c/C,a/SFO}; applying that substitution to the schema gives us a new schema which captures the idea of using any plane that is at SFO:

Action(Unload(Cs, p/,SFO),

PRECOND:In(Cy, p') AAt(p',SFO) A Cargo(C,) A Plane(p') A Airport(SFO) EFFECT:A1(C2,SFO) A —In(Ca, p')) .

Here we replaced p with a new variable named p'. This is an instance of standardizing apart

variable names so there will be no conflict between different variables that happen to have the

same name (see page 284). The regressed state description gives us a new goal: & = In(Ca, p') AAI(p',SFO) A Cargo(C2) A Plane(p') A Airport(SFO) As another example, consider the goal of owning a book with a specific ISBN number: Own(9780134610993). Given a trillion 13-digit ISBNs and the single action schema A = Action(Buy(i), PRECOND:ISBN (i), EFFECT: Own(i)) .

a forward search without a heuristic would have to start enumerating the 10 billion ground

Buy actions. But with backward search, we would unify the goal Own(9780134610993 ) with

Section 112

Algorithms for Classical Planning

the effect Own(i"), yielding the substitution 6 = {i /9780134610993 }. Then we would regress over the action Subst(6,A) to yield the predecessor state description ISBN (9780134610993).

This is part of the initial state, so we have a solution and we are done, having considered just one action, not a trillion.

More formally, assume a goal description g that contains a goal literal g; and an action

schema A. If A has an effect literal ¢); where Unify(g;.¢})=0 and where we define A’

SUBST(0,A) and if there is no effect in A’ that is the negation of a literal in g, then A’ is a

relevant action towards g.

For most problem domains backward search keeps the branching factor lower than for-

ward search.

However, the fact that backward search uses states with variables rather than

ground states makes it harder to come up with good heuristics. That is the main reason why

the majority of current systems favor forward search. 11.2.3

Planning as Boolean satisfiability

In Section 7.7.4 we showed how some clever axiom-rewriting could turn a wumpus world problem into a propositional logic satisfiability problem that could be handed to an efficient satisfiability solver. SAT-based planners such as SATPLAN operate by translating a PDDL problem description into propositional form. The translation involves a series of steps: « Propositionalize the actions: for each action schema, form ground propositions by substituting constants for each of the variables. So instead of a single Unload(c,p,a) schema, we would have separate action propositions for each combination of cargo,

plane, and airport (here written with subscripts), and for each time step (here written as a superscript). + Add action exclusion axioms saying that no two actions can occur at the same time, e.g.

—(FlyP;SFOJFK' A FlyP,;SFOBUH").

+ Add precondition axioms:

For each ground action A, add the axiom A’ =

PRE(A)',

that is, if an action is taken at time 7, then the preconditions must have been true. For

example, FIyP;SFOJFK' = At(P,,SFO) A Plane(P\) A Airport(SFO) AAirport (JFK).

« Define the initial state: assert FO for every fluent F in the problem’s initial state, and

—F" for every fluent not mentioned in the initial state.

« Propositionalize the goal: the goal becomes a disjunction over all of its ground instances, where variables are replaced by constants. For example, the goal of having block A on another block, On(A,x) ABlock(x) in a world with objects A, B and C, would

be replaced by the goal

(On(A,A) ABlock(A)) V (On(A,B) ABlock(B)) V (On(A,C) A Block(C)). + Add successor-state axioms: For each fluent ', add an axiom of the form

F'*' & ActionCausesF' V (F' A=ActionCausesNotF"), where ActionCausesF stands for a disjunction of all the ground actions that add F, and ActionCausesNotF stands for a disjunction of all the ground actions that delete F.

The resulting translation is typically much larger than the original PDDL, but modern the efficiency of modern SAT solvers often more than makes up for this.

351

352

Chapter 11 11.2.4

Planning graph

Automated Planning

Other classical planning approaches

The three approaches we covered above are not the only ones tried in the 50-year history of automated planning. We briefly describe some others here. An approach called Graphplan uses a specialized data structure, a planning graph, to

encode constraints on how actions are related to their preconditions and effects, and on which

Situation calculus

things are mutually exclusive. Situation calculus is a method of describing planning problems in first-order logic. It uses successor-state axioms just as SATPLAN

does, but first-order logic allows for more

flexibility and more succinct axioms. Overall the approach has contributed to our theoretical

understanding of planning, but has not made a big impact in practical applications, perhaps

because first-order provers are not as well developed as propositional satisfiability programs. It is possible to encode a bounded planning problem (i.c., the problem of finding a plan

of length k) as a constraint satisfaction problem

(CSP). The encoding

is similar to the

encoding to a SAT problem (Section 11.2.3), with one important simplification: at each time

step we need only a single variable, Action', whose domain s the set of possible actions. We. no longer need one variable for every action, and we don’t need the action exclusion axioms.

Partial-order planning

All the approaches we have seen so far construct fotally ordered plans consisting of strictly linear sequences of actions. But if an air cargo problem has 30 packages being loaded onto one plane and 50 packages being loaded onto another, it seems pointless to decree a specific linear ordering of the 80 load actions. An alternative called partial-order planning represents a plan as a graph rather than a linear sequence: each action is a node in the graph, and for each precondition of the action there is an edge from another action (or from the initial state) that indicates that the predeces-

sor action establishes the precondition. So we could have a partial-order plan that says that ac-

tions Remove(Spare, Trunk) and Remove(Flat, Axle) must come before PutOn(Spare,Axle),

but without saying which of the two Remove actions should come first. We search in the space

of plans rather than world-states, inserting actions to satisfy conditions.

In the 1980s and 1990s, partial-order planning was seen as the best way to handle planning problems with independent subproblems. By 2000, forward-search planners had developed excellent heuristics that allowed them to efficiently discover the independent subprob-

lems that partial-order planning was designed for. Moreover, SATPLAN was able to take ad-

vantage of Moore’s law: a propositionalization that was hopelessly large in 1980 now looks tiny, because computers have 10,000 times more memory today.

As a result, partial-order

planners are not competitive on fully automated classical planning problems.

Nonetheless, partial-order planning remains an important part of the field.

For some

specific tasks, such as operations scheduling, partial-order planning with domain-specific heuristics is the technology of choice. Many of these systems use libraries of high-level plans, as described in Section 11.4.

Partial-order planning is also often used in domains where it is important for humans

to understand the plans. For example, operational plans for spacecraft and Mars rovers are generated by partial-order planners and are then checked by human operators before being uploaded to the vehicles for execution. The plan refinement approach makes it easier for the humans to understand what the planning algorithms are doing and to verify that the plans are

correct before they are executed.

Section 11.3 11.3

Heuristics for Planning

353

Heuristics for Planning

Neither forward nor backward search is efficient without a good heuristic function.

Recall

from Chapter 3 that a heuristic function h(s) estimates the distance from a state s to the goal, and that if we can derive an admissible heuristic for this distance—one that does not

overestimate—then we can use A* search to find optimal solutions.

By definition, there is no way to analyze an atomic state, and thus it requires some ingenuity by an analyst (usually human) to define good domain-specific heuristics for search problems with atomic states. But planning uses a factored representation for states and actions, which makes it possible to define good domain-independent heuristics.

Recall that an admissible heuristic can be derived by defining a relaxed problem that is

easier to solve. The exact cost of a solution to this easier problem then becomes the heuristic

for the original problem. A search problem is a graph where the nodes are states and the edges are actions.

The problem is to find a path connecting the initial state to a goal state.

There are two main ways we can relax this problem to make it easier: by adding more edges to the graph, making it strictly easier to find a path, or by grouping multiple nodes together, forming an abstraction of the state space that has fewer states, and thus is easier to search.

We look first at heuristics that add edges to the graph. Perhaps the simplest is the ignore-

preconditions heuristic, which drops all preconditions from actions. Every action becomes

applicable in every state, and any single goal fluent can be achieved in one step (if there are any applicable actions—if not, the problem is impossible). This almost implies that the number of steps required to solve the relaxed problem is the number of unsatisfied goals—

Ignore-preconditions heuristic

almost but not quite, because (1) some action may achieve multiple goals and (2) some actions

may undo the effects of others. For many problems an accurate heuristic is obtained by considering (1) and ignoring (2). First, we relax the actions by removing all preconditions and all effects except those that are

literals in the goal. Then, we count the minimum number of actions required such that the union of those actions’ effects satisfies the goal. This is an instance of the set-cover problem.

There is one minor irritation: the set-cover problem is NP-hard. Fortunately a simple greedy algorithm is guaranteed to return a set covering whose size is within a factor of logn of the true minimum covering, where n is the number of literals in the goal.

Unfortunately, the

greedy algorithm loses the guarantee of admissibility. It is also possible to ignore only selected preconditions of actions. Consider the slidingtile puzzle (8-puzzle or 15-puzzle) from Section 3.2. We could encode this as a planning problem involving tiles with a single schema Slide: Action(Slide(t,s1,52),

PRECOND:On(t,sy) ATile(t) A Blank(sy) A Adjacent (s, s2) EFFECT:On(t,s2) A Blank(s;) A =On(t,s1) A —Blank(s;))

As we saw in Section 3.6, if we remove the preconditions Blank(s) A Adjacent(sy ,s2) then any tile can move in one action to any space and we get the number-of-misplaced-tiles heuris-

tic. If we remove only the Blank(s,) precondition then we get the Manhattan-distance heuris

tic. It is easy to see how these heuristics could be derived automatically from the action schema description. The ease of manipulating the action schemas is the great advantage of the factored representation of planning problems, as compared with the atomic representation

of search problems.

Set-cover problem

354

Chapter 11

Automated Planning

Figure 11.6 Two state spaces from planning problems with the ignore-delete-lists heuristic. The height above the bottom plane is the heuristic score of a state; states on the bottom plane are goals. There are no local minima, so search for the goal is straightforward. From Hoffmann (2005). Ignore-delete-lists heuristic

Another possibility is the ignore-delete-lists heuristic.

Assume for a moment that all

goals and preconditions contain only positive literals.> We want to create a relaxed version

of the original problem that will be easier to solve, and where the length of the solution will serve as a good heuristic.

We can do that by removing the delete lists from all actions

(i.e., removing all negative literals from effects). That makes it possible to make monotonic

progress towards the goal—no action will ever undo progress made by another action. It turns

out it is still NP-hard to find the optimal solution to this relaxed problem, but an approximate

solution can be found in polynomial time by hill climbing.

Figure 11.6 diagrams part of the state space for two planning problems using the ignore-

delete-lists heuristic. The dots represent states and the edges actions, and the height of each

dot above the bottom plane represents the heuristic value. States on the bottom plane are solutions. Tn both of these problems, there is a wide path to the goal. There are no dead ends, 50 no need for backtracking; a simple hill-climbing search will easily find a solution to these

problems (although it may not be an optimal solution). 11.3.1

Dom:

ndependent

pruning

Factored representations make it obvious that many states are just variants of other states. For

example, suppose we have a dozen blocks on a table, and the goal is to have block A on top

of a three-block tower. The first step in a solution is to place some block x on top of block y

(where x, y, and A are all different). After that, place A on top ofx and we’re done. There are 11 choices for x, and given x, 10 choices for y, and thus 110 states to consider. But all these

Symmetry reduction

states are symmetric: choosing one over another makes no difference, and thus a planner should only consider one of them. This is the process of symmetry reduction: we prune out aren’t, replace every negative literal ~P in tate and the action effects accordingly.

Section 11.3

Heuristics for Planning

355

of consideration all symmetric branches of the search tree except for one. For many domains, this makes the difference between intractable and efficient solving.

Another possibility is to do forward pruning, accepting the risk that we might prune away an optimal solution, in order to focus the search on promising branches. We can define

a preferred action as follows: First, define a relaxed version of the problem, and solve it to

Preferred action

teractions can be ruled out. We say that a problem has serializable subgoals if there exists

Serializable subgoals

get a relaxed plan. Then a preferred action s either a step of the relaxed plan, or it achieves some precondition of the relaxed plan. Sometimes it is possible to solve a problem efficiently by recognizing that negative in-

an order of subgoals such that the planner can achieve them in that order without having to undo any of the previously achieved subgoals. For example, in the blocks world, if the goal is to build a tower (e.g., A on B, which in turn is on C, which in turn is on the Table, as in

Figure 11.3 on page 347), then the subgoals are serializable bottom to top: if we first achieve C on Table, we will never have to undo it while we are achieving the other subgoals.

A

planner that uses the bottom-to-top trick can solve any problem in the blocks world without backtracking (although it might not always find the shortest plan). As another example, if there is a room with n light switches, each controlling a separate light, and the goal is to have them all on, then we don’t have to consider permutations of the order; we could arbitrarily

restrict ourselves to plans that flip switches in, say, ascending order.

For the Remote Agent planner that commanded NASA’s Deep Space One spacecraft, it was determined that the propositions involved in commanding a spacecraft are serializable. This is perhaps not too surprising, because a spacecraft is designed by its engineers to be as easy as possible to control (subject to other constraints). Taking advantage of the serialized

ordering of goals, the Remote Agent planner was able to eliminate most of the search. This meant that it was fast enough to control the spacecraft in real time, something previously

considered impossible. 11.3.2

State abstraction in planning

A relaxed problem leaves us with a simplified planning problem just to calculate the value of the heuristic function. Many planning problems have 10'% states or more, and relaxing

the actions does nothing to reduce the number of states, which means that it may still be

expensive to compute the heuristic. Therefore, we now look at relaxations that decrease the

number of states by forming a state abstraction—a many-to-one mapping from states in the State abstraction ground representation of the problem to the abstract representation.

The easiest form of state abstraction is to ignore some fluents. For example, consider an

air cargo problem with 10 airports, 50 planes, and 200 pieces of cargo. Each plane can be at one of 10 airports and each package can be either in one of the planes or unloaded at one of the airports. So there are 10% x (50 + 10)** ~ 10**5 states. Now consider a particular

problem in that domain in which it happens that all the packages are at just 5 of the airports,

and all packages at a given airport have the same destination. Then a useful abstraction of the

problem is to drop all the Ar fluents except for the ones involving one plane and one package

at each of the 5 airports. Now there are only 10% x (5+ 10)° ~ 10'! states. A solution in this abstract state space will be shorter than a solution in the original space (and thus will be an admissible heuristic), and the abstract solution is easy to extend to a solution to the original

problem (by adding additional Load and Unload actions).

356 Decomposition B ence

Chapter 11

Automated Planning

A key idea in defining heuristics is decomposition: dividing a problem into parts, solving each part independently, and then combining the parts. The subgoal independence assumption is that the cost of solving a conjunction of subgoals is approximated by the sum of the costs of solving each subgoal independently. The subgoal independence assumption can be optimistic or pessimistic.

It is optimistic when there are negative interactions between

the subplans for each subgoal—for example, when an action in one subplan deletes a goal achieved by another subplan. It is pessimistic, and therefore inadmissible, when subplans contain redundant actions—for instance, two actions that could be replaced by a single action in the merged plan. Suppose the goal is a set of fluents G, which we divide into disjoint subsets Gy, . ., G. ‘We then find optimal plans P, ..., P, that solve the respective subgoals. What is an estimate

of the cost of the plan for achieving all of G? We can think of each COST(P;) as a heuristic estimate, and we know that if we combine estimates by taking their maximum

value, we

always get an admissible heuristic. So max;COST(P,) is admissible, and sometimes it is exactly correct: it could be that Py serendipitously achieves all the G;. But usually the estimate is too low. Could we sum the costs instead? For many problems that is a reasonable estimate, but it is not admissible. The best case is when G; and G; are independent, in the sense that plans for one cannot reduce the cost of plans for the other. In that case, the estimate

COST(P,) + CosT(P;) is admissible, and more accurate than the max estimate.

It is clear that there is great potential for cutting down the search space by forming abstractions. The trick is choosing the right abstractions and using them in a way that makes the total cost—defining an abstraction, doing an abstract search, and mapping the abstraction back to the original problem—Iless than the cost of solving the original problem. The techniques of pattern databases from Section 3.6.3 can be useful, because the cost of creating the pattern database can be amortized over multiple problem instances. A system that makes use of effective heuristics is FF, or FASTFORWARD

(Hoffmann,

2005), a forward state-space searcher that uses the ignore-delete-lists heuristic, estimating

the heuristic with the help of a planning graph. FF then uses hill climbing search (modified

to keep track of the plan) with the heuristic to find a solution. FF’s hill climbing algorithm is

nonstandard:

it avoids local maxima by running a breadth-first search from the current state

until a better one is found. If this fails, FF switches to a greedy best-first search instead.

11.4

Hierarchical Planning

The problem-solving and planning methods of the preceding chapters all operate with a fixed set of atomic actions.

Actions can be strung together, and state-of-the-art algorithms can

generate solutions containing thousands of actions. That’s fine if we are planning a vacation and the actions are at the level of “fly from San Francisco to Honolulu,” but at the motor-

control level of “bend the left knee by 5 degrees” we would need to string together millions or billions of actions, not thousands.

Bridging this gap requires planning at higher levels of abstraction. A high-level plan for

a Hawaii vacation might be “Go to San Francisco airport; take flight HA 11 to Honolulu; do vacation stuff for two weeks; take HA 12 back to San Francisco; go home.” Given such

a plan, the action “Go to San Francisco airport™ can be viewed as a planning task in itself,

with a solution such as “Choose a ride-hailing service; order a car; ride to airport.” Each of

Section 11.4

Hierarchical Planning

357

these actions, in turn, can be decomposed further, until we reach the low-level motor control

actions like a button-press.

In this example, planning and acting are interleaved; for example, one would defer the problem of planning the walk from the curb to the gate until after being dropped off. Thus, that particular action will remain at an abstract level prior to the execution phase. discussion of this topic until Section 11.5.

We defer

Here, we concentrate on the idea of hierarchi-

cal decomposition, an idea that pervades almost all attempts to manage complexity.

For

example, complex software is created from a hierarchy of subroutines and classes; armies, governments and corporations have organizational hierarchies. The key benefit of hierarchi-

Hierarchical decomposition

cal structure is that at each level of the hierarchy, a computational task, military mission, or

administrative function is reduced to a small number of activities at the next lower level, so

the computational cost of finding the correct way to arrange those activities for the current problem is small. 11.4.1

High-level actions

The basic formalism we adopt to understand hierarchical decomposition comes from the area

of hierarchical task networks or HTN planning. For now we assume full observability and determinism and a set of actions, now called primitive actions, with standard precondition—

effect schemas.

The key additional concept is the high-level action or HLA—for example,

the action “Go to San Francisco airport.” Each HLA has one or more possible refinements,

into a sequence of actions, each of which may be an HLA or a primitive action. For example, the action “Go to San Francisco airport,” represented formally as Go(Home,SFO), might have two possible refinements, as shown in Figure 11.7. The same figure shows a recursive

Hierarchical task network Primitive action High-level action Refinement

refinement for navigation in the vacuum world: to get to a destination, take a step, and then

20 to the destination.

These examples show that high-level actions and their refinements embody knowledge

about how to do things. For instance, the refinements for Go(Home,SFO) say that to get to

the airport you can drive or take a ride-hailing service; buying milk, sitting down, and moving the knight to e4 are not to be considered.

An HLA refinement that contains only primitive actions is called an implementation

of the HLA. In a grid world, the sequences [Right,Right, Down] and [Down, Right,Right] both implement the HLA Navigate([1,3],[3,2]). An implementation of a high-level plan (a sequence of HLAS) is the concatenation of implementations of each HLA in the sequence. Given the precondition—effect definitions of each primitive action, it is straightforward to

determine whether any given implementation of a high-level plan achieves the goal. We can say, then, that a high-level plan achieves the goal from a given state if at least one of its implementations achieves the goal from that state. ~ The “at least one” in this

definition is crucial—not all implementations need to achieve the goal, because the agent gets

to decide which implementation it will execute. Thus, the set of possible implementations in

HTN planning—each of which may have a different outcome—is not the same as the set of

possible outcomes in nondeterministic planning. There, we required that a plan work for all outcomes because the agent doesn’t get to choose the outcome; nature does.

The simplest case is an HLA that has exactly one implementation. In that case, we can

compute the preconditions and effects of the HLA from those of the implementation (see

Exercise 11.HLAU) and then treat the HLA exactly as if it were a primitive action itself. It

Implementation

358

Chapter 11

Automated Planning

Refinement(Go(Home,SFO), STEPS: [Drive(Home, SFOLongTermParking), Shuttle(SFOLongTermParking, SFO)] ) Refinement(Go(Home,SFO), STEPS: [Taxi(Home,SFO)] )

Refinement(Navigate([a,b], [x,]), PRECOND: a=x A b=y

STEPS: ] ) Refinement(Navigate([a.b), x.

PRECOND: Connected([ab], a— 1.b]) STEPS: [Left, Navigate([a— 1,b].[x.])] ) Refinement(Navigate([a,b], [ PRECOND: Connected ([a,b], a + 1,5])

STEPS: [Right. Navigate([a+ 1,5],[x.])]) Figure 11.7 Definitions of possible refinements for two high-level actions: going to San Francisco airport and navigating in the vacuum world. In the latter case, note the recursive nature of the refinements and the use of preconditions.

can be shown that the right collection of HLAs can result in the time complexity of blind

search dropping from exponential in the solution depth to linear in the solution depth, although devising such a collection of HLAs may be a nontrivial task in itself. When HLAs

have multiple possible implementations, there are two options: one is to search among the implementations for one that works, as in Section 11.4.2; the other is to reason directly about

the HLAs—despite the multiplicity of implementations—as explained in Section 11.4.3. The latter method enables the derivation of provably correct abstract plans, without the need to

consider their implementations.

11.4.2

Searching for primitive solutions

HTN planning is often formulated with a single “top level” action called Act, where the aim is to find an implementation of Act that achieves the goal. This approach is entirely general. For example, classical planning problems can be defined as follows: for each primitive action

a;, provide one refinement of Act with steps [a;, Act]. That creates a recursive definition of Act

that lets us add actions. But we need some way to stop the recursion; we do that by providing

one more refinement for Act, one with an empty list of steps and with a precondition equal to the goal of the problem.

This says that if the goal is already achieved, then the right

implementation is to do nothing. The approach leads to a simple algorithm: repeatedly choose an HLA in the current plan and replace it with one of its refinements, until the plan achieves the goal. One possible implementation based on breadth-first tree search is shown in Figure 11.8. Plans are considered

in order of depth of nesting of the refinements, rather than number of primitive steps. It is straightforward to design a graph-search version of the algorithm as well as depth-first and iterative deepening versions.

Section 11.4

Hierarchical Planning

function HIERARCHICAL-SEARCH(problem, hierarchy) returns a solution or failure

frontier —a FIFO queue with [Aci] as the only element while rrue do if IS-EMPTY( frontier) then return failure

plan — POP(frontier)

1/ chooses the shallowest plan in frontier

hla + the first HLA in plan, or null if none

prefix,suffix — the action subsequences before and after hla in plan outcome«— RESULT(problem.INITIAL, prefix) if hla is null then

// so plan is primitive and outcome is its result

if problem.1s-GOAL(outcome) then return plan else for each sequence in REFINEMENTS (hla, outcome, hierarchy) do add APPEND(prefix, sequence, suffix) to frontier

Figure 11.8 A breadth-first implementation of hierarchical forward planning search. The

initial plan supplied to the algorithm is [Acr]. The REFINEMENTS function returns a set of

action sequences, one for each refinement of the HLA whose preconditions are satisfied by the specified state, outcome.

In essence, this form of hierarchical search explores the space of sequences that conform to the knowledge contained in the HLA library about how things are to be done. A great deal of knowledge can be encoded, not just in the action sequences specified in each refinement but also in the preconditions for the refinements. For some domains, HTN planners have been able to generate huge plans with very little search. For example, O-PLAN (Bell and Tate, 1985), which combines HTN planning with scheduling, has been used to develop production plans for Hitachi. A typical problem involves a product line of 350 different products, 35 assembly machines, and over 2000 different operations. The planner generates a 30-day schedule with three 8-hour shifts a day, involving tens of millions of steps. Another important

aspect of HTN plans is that they are, by definition, hierarchically structured; usually this makes them easy for humans to understand.

The computational benefits of hierarchical search can be seen by examining an ideal-

ized case.

Suppose that a planning problem has a solution with d primitive actions.

a nonhierarchical,

For

forward state-space planner with b allowable actions at each state, the

cost is O(b?), as explained in Chapter 3. For an HTN planner, let us suppose a very regular refinement structure:

each nonprimitive action has r possible refinements,

k actions at the next lower level. there are with this structure.

Now,

each into

We want to know how many different refinement trees

if there are d actions at the primitive level, then the

number of levels below the root is log,d, so the number of internal refinement nodes is

14 k+ K24+ K24~

= (¢ — 1) /(k— 1). Each internal node has r possible refinements,

s0 rld=1/(=1) possible decomposition trees could be constructed.

Examining this formula, we see that keeping r small and k large can result in huge sav-

ings: we are taking the kth root of the nonhierarchical cost, if » and r are comparable. Small r

and large k means a library of HLAs with a small number of refinements each yielding a long

action sequence. This is not always possible: long action sequences that are usable across a

wide range of problems are extremely rare.

359

360

Chapter 11

Automated Planning

The key to HTN planning is a plan library containing known methods for implementing

complex, high-level actions.

One way to construct the library is to learn the methods from

problem-solving experience. After the excruciating experience of constructing a plan from scratch, the agent can save the plan in the library as a method for implementing the high-level

action defined by the task. In this way, the agent can become more and more competent over

time as new methods are built on top of old methods. One important aspect of this learning process is the ability to generalize the methods that are constructed, eliminating detail that is specific to the problem instance (e.g., the name of the builder or the address of the plot of land) and keeping just the key elements of the plan. It seems to us inconceivable that humans could be as competent as they are without some such mechanism. 11.4.3

Searching for abstract solutions

The hierarchical search algorithm in the preceding section refines HLAs all the way to primitive action sequences to determine if a plan is workable. This contradicts common sense: one should be able to determine that the two-HLA high-level plan

[Drive(Home,SFOLongTermParking), Shuttle(SFOLongTermParking, SFO)]

gets one to the airport without having to determine a precise route, choice of parking spot, and so on. The solution is to write precondition—effect descriptions of the HLAs, just as we

do for primitive actions. From the descriptions, it ought to be easy to prove that the high-level

plan achieves the goal. This is the holy grail, so to speak, of hierarchical planning, because if we derive a high-level plan that provably achieves the goal, working in a small search space of high-level actions, then we can commit to that plan and work on the problem of refining each step of the plan. This gives us the exponential reduction we seek.

For this to work, it has to be the case that every high-level plan that “claims” to achieve

Downward refinement property

the goal (by virtue of the descriptions of its steps) does in fact achieve the goal in the sense defined earlier: it must have at least one implementation that does achieve the goal. This

property has been called the downward refinement property for HLA descriptions.

Writing HLA descriptions that satisfy the downward refinement property is, in principle,

easy: as long as the descriptions are frue, then any high-level plan that claims to achieve

the goal must in fact do so—otherwise, the descriptions are making some false claim about

what the HLAs do. We have already seen how to write true descriptions for HLAs that have

exactly one implementation (Exercise 11.HLAU); a problem arises when the HLA has multiple implementations.

How can we describe the effects of an action that can be implemented in

many different ways? One safe answer (at least for problems where all preconditions and goals are positive) is

to include only the positive effects that are achieved by every implementation of the HLA and

the negative effects of any implementation. Then the downward refinement property would be satisfied. Unfortunately, this semantics for HLAs is much too conservative.

Consider again the HLA Go(Home, SFO), which has two refinements, and suppose, for the sake of argument, a simple world in which one can always drive to the airport and park, but taking a taxi requires Cash as a precondition. In that case, Go(Home, SFO) doesn’t always get you to the airport.

In particular, it fails if Cash is false, and so we cannot assert

At(Agent,SFO) as an effect of the HLA. This makes no sense, however; if the agent didn’t

have Cash, it would drive itself. Requiring that an effect hold for every implementation is

equivalent to assuming that someone else—an adversary—will choose the implementation.

Section 11.4

(@)

Hierarchical Planning

361

(b)

Figure 11.9 Schematic examples of reachable sets. The set of goal states is shaded in purple. Black and gray arrows indicate possible implementations of Ay and ha, respectively. (a) The reachable set of an HLA hy in a state s. (b) The reachable set for the sequence [k, ha]. Because this intersects the goal set, the sequence achieves the goal. It treats the HLA’s multiple outcomes exactly as if the HLA were a nondeterministic action,

as in Section 4.3. For our case, the agent itself will choose the implementation.

The programming languages community has coined the term demonic nondeterminism Demeric

for the case where an adversary makes the choices, contrasting this with angelic nondeterminism, where the agent itself makes the choices. We borrow this term to define angelic semantics for HLA descriptions. The basic concept required for understanding angelic se-

mantics is the reachable set of an HLA: given a state s, the reachable set for an HLA h,

written as REACH (s, h), is the set of states reachable by any of the HLA’s implementations.

The key idea is that the agent can choose which element of the reachable set it ends up in when it executes the HLA; thus, an HLA with multiple refinements is more “powerful” than the same HLA with fewer refinements. We can also define the reachable set of a sequence of HLAs. For example, the reachable set of a sequence [hy, /] is the union of all the reachable

sets obtained by applying h, in each state in the reachable set of h;: REACH(s, [, ha]) = U REACH(S, ). #eREACH(s.In) Given these definitions, a high-level plan—a sequence of HLAs—achieves the goal if its reachable set intersects the set of goal states. (Compare this to the much stronger condition for demonic semantics,

where every member of the reachable set has to be a goal state.)

Conversely, if the reachable set doesn’t intersect the goal, then the plan definitely doesn’t work. Figure 11.9 illustrates these ideas.

The notion of reachable sets yields a straightforward algorithm: search among highlevel plans, looking for one whose reachable set intersects the goal; once that happens, the

algorithm can commit to that abstract plan, knowing that it works, and focus on refining the

plan further. We will return to the algorithmic issues later; for now consider how the effects

:;g;l‘; rmminism Angelic semantics Reachable set

362

Chapter 11

Automated Planning

of an HLA—the reachable set for each possible initial state—are represented.

action can set a fluent to true or false or leave it unchanged.

A primitive

(With conditional effects (see

Section 11.5.1) there is a fourth possibility: flipping a variable to its opposite.)

An HLA under angelic semantics can do more: it can control the value of a fluent, setting

it to true or false depending on which implementation is chosen. That means that an HLA can

have nine different effects on a fluent: if the variable starts out true, it can always keep it true,

always make it false, or have a choice; if the fluent starts out false, it can always keep it false, always make it true, or have a choice; and the three choices for both cases can be combined

arbitrarily, making nine. Notationally, this is a bit challenging. We’ll use the language of add lists and delete lists (rather than true/false fluents) along with the ~ symbol to mean “possibly, if the agent

so chooses.” Thus, the effect -4 means “possibly add A,” that is, either leave A unchanged

or make it true. or delete A

Similarly, —A means “possibly delete A” and FA means “possibly add

For example, the HLA

Go(Home,SFO),

with the two refinements shown in

Figure 11.7, possibly deletes Cash (if the agent decides to take a taxi), so it should have the

effect ~Cash. Thus, we see that the descriptions of HLAs are derivable from the descriptions

of their refinements. Now, suppose we have the following schemas for the HLAs /; and h»: Action(hy, PRECOND: A, EFFECT:A A ~B)

Action(hy, PRECOND: ~B, EFFECT: +A A £C) That is, /; adds A and possibly deletes B, while &, possibly adds A and has full control over

C. Now, if only B is true in the initial state and the goal is A A C then the sequence [hy,hs]

achieves the goal: we choose an implementation of /; that makes B false, then choose an implementation of /2, that leaves A true and makes C true. The preceding discussion assumes that the effects of an HLA—the reachable set for any

given initial state—can be described exactly by describing the effect on each fluent. It would

be nice if this cause an HLA gly reachable page 243. For

were always true, but in many cases we can only approximate may have infinitely many implementations and may produce sets—rather like the wiggly-belief-state problem illustrated in example, we said that Go(Home, SFO) possibly deletes Cash;

the effects bearbitrarily wigFigure 7.21 on it also possibly

adds At(Car,SFOLongTermParking); but it cannot do both—in fact, it must do exactly one.

Optimistic lescription essimistic description

As with belief states, we may need to write approximate descriptions. We will use two kinds of approximation: an optimistic description REACH* (s, ) of an HLA h may overstate the reachable set, while a pessimistic description REACH ™ (s, 1) may understate the reachable set. Thus, we have

REACH ™ (s,h) C REACH(s, h) C REACHT (s, h)

For example, an optimistic description of Go(Home, SFO) says that it possibly deletes Cash and possibly adds At(Car,SFOLongTermParking). Another good example arises in the 8puzzle, half of whose states are unreachable from any given state (see Exercise 11.PART): the optimistic description of Act might well include the whole state space, since the exact reachable set is quite wiggly.

With approximate descriptions, the test for whether a plan achieves the goal needs to be modified slightly. If the optimistic reachable set for the plan doesn’t intersect the goal, then the plan doesn’t work; if the pessimistic reachable set intersects the goal, then the plan

does work (Figure 11.10(a)). With exact descriptions, a plan either works or it doesn’t, but

Section 11.4

Hierarchical Planning

Figure 11.10 Goal achievement for high-level plans with approximate descriptions. The set of goal states is shaded in purple. For each plan, the pessimistic (solid lines, light blue) and optimistic (dashed lines, light green) reachable sets are shown. (a) The plan indicated by the black arrow definitely achieves the goal, while the plan indicated by the gray arrow definitely doesn’t. (b) A plan that possibly achieves the goal (the optimistic reachable set intersects the goal) but does not necessarily achieve the goal (the pessimistic reachable set does not intersect the goal). The plan would need to be refined further to determine if it really does achieve the goal. with approximate descriptions, there is a middle ground: if the optimistic set intersects the goal but the pessimistic set doesn’t, then we cannot tell if the plan works (Figure 11.10(b)).

When this circumstance arises, the uncertainty can be resolved by refining the plan. This is a very common situation in human reasoning. For example, in planning the aforementioned

two-week Hawaii vacation, one might propose to spend two days on each of seven islands.

Prudence would indicate that this ambitious plan needs to be refined by adding details of inter-island transportation.

An algorithm for hierarchical planning with approximate angelic descriptions is shown in Figure 11.11. For simplicity, we have kept to the same overall scheme used previously in Figure 11.8, that is, a breadth-first search in the space of refinements. As just explained,

the algorithm can detect plans that will and won’t work by checking the intersections of the optimistic and pessimistic reachable sets with the goal. (The details of how to compute

the reachable sets of a plan, given approximate descriptions of each step, are covered in Exercise 1 1.LHLAP.)

When a workable abstract plan is found, the algorithm decomposes the original problem into subproblems, one for each step of the plan. The initial state and goal for each subproblem are obtained by regressing a guaranteed-reachable goal state through the action schemas for each step of the plan. (See Section 11.2.2 for a discussion of how regression works.) Figure 11.9(b) illustrates the basic idea: the right-hand circled state is the guaranteed-reachable

goal state, and the left-hand circled state is the intermediate goal obtained by regressing the

goal through the final action.

363

364

Chapter 11

Automated Planning

function ANGELIC-SEARCH(problem, hierarchy, initialPlan) returns solution or fail Jfrontier +a FIFO queue with initialPlan as the only element while rrue do if EMPTY?(frontier) then return fail

plan & POP(frontier) /1 chooses the shallowest node in frontier if REACH* (problem.INITIAL, plan) intersects problem.GOAL then if plan is primitive then return plan

// REACH" is exact for primitive plans

guaranteed— REACH™ (problem.INITIAL, plan) N problem.GOAL

if guaranteed#{ } and MAKING-PROGRESS(plan, initialPlan) then

JfinalState —any element of guaranteed return DECOMPOSE (hierarchy, problem.INITIAL, plan, finalState)

hla+some HLA in plan

prefix,suffix < the action subsequences before and after hla in plan outcome < RESULT(problem.INITIAL, prefix) for each sequence in REFINEMENTS(hla, outcome, hierarchy) do Sfrontier «— Insert( APPEND(prefix, sequence, suffix), frontier)

function DECOMPOSE hierarchy, so. plan, sy) returns a solution solution «an empty plan while plan is not empty do action < REMOVE-LAST(plan)

si¢—astate in REACH™ (sp, plan) such that s;€REACH™ (s;, action)

problem «—a problem with INITIAL = s; and GOAL = sy solution « APPEND(ANGELIC-SEARCH(problem, hierarchy, action), solution) Spesi

return solution

Figure 11.11 A hierarchical planning algorithm that uses angelic semantics to identify and commit to high-level plans that work while avoiding high-level plans that don’t. The predicate MAKING-PROGRESS

checks to make sure that we aren’t stuck in an infinite regression

of refinements. At top level, call ANGELIC-SEARCH with [Act] as the initialPlan.

The ability to commit to or reject high-level plans can give ANGELIC-SEARCH

a sig-

nificant computational advantage over HIERARCHICAL-SEARCH, which in turn may have a

large advantage over plain old BREADTH-FIRST-SEARCH.

Consider, for example, cleaning

up a large vacuum world consisting of an arrangement of rooms connected by narrow corridors, where each room is a w x h rectangle of squares. It makes sense to have an HLA for Navigate (as shown in Figure 11.7) and one for CleanWholeRoom. (Cleaning the room could

be implemented with the repeated application of another HLA to clean each row.) Since there

are five primitive actions, the cost for BREADTH-FIRST-SEARCH grows as 59, where d is the

length of the shortest solution (roughly twice the total number of squares); the algorithm

cannot manage even two 3 x 3 rooms.

HIERARCHICAL-SEARCH

is more efficient, but still

suffers from exponential growth because it tries all ways of cleaning that are consistent with the hierarchy. ANGELIC-SEARCH scales approximately linearly in the number of squares— it commits to a good high-level sequence of room-cleaning and navigation steps and prunes

away the other options.

Section 1.5

Planning and Acting in Nondeterministic Domains

365

Cleaning a set of rooms by cleaning each room in turn is hardly rocket science: it is

easy for humans because of the hierarchical structure of the task. When we consider how

difficult humans find it to solve small puzzles such as the 8-puzzle, it seems likely that the

human capacity for solving complex problems derives not from considering combinatorics, but rather from skill in abstracting and decomposing problems to eliminate combinatorics.

The angelic approach can be extended to find least-cost solutions by generalizing the

notion of reachable set. Instead of a state being reachable or not, each state will have a cost for the most efficient way to get there.

(The cost is

infinite for unreachable states.)

optimistic and pessimistic descriptions bound these costs.

The

In this way, angelic search can

find provably optimal abstract plans without having to consider their implementations. The

same approach can be used to obtain effective hierarchical look-ahead algorithms for online

search, in the style of LRTA* (page 140). In some ways, such algorithms mirror aspects of human deliberation in tasks such as planning a vacation to Hawaii—consideration of alternatives is done initially at an abstract

level over long time scales; some parts of the plan are left quite abstract until execution time,

such as how to spend two lazy days on Moloka'i, while others parts are planned in detail, such as the flights to be taken and lodging to be reserved—without these latter refinements,

there is no guarantee that the plan would be feasible. 11.5

Planning and Acting in Nondeterministic

Domains

In this section we extend planning to handle partially observable, nondeterministic, and un-

known environments. The basic concepts mirror those in Chapter 4, but there are differences

arising from the use of factored representations rather than atomic representations. This affects the way we represent the agent’s capability for action and observation and the way

we represent belief states—the sets of possible physical states the agent might be in—for partially observable environments.

We can also take advantage of many of the domain-

independent methods given in Section 11.3 for calculating search heuristics.

‘We will cover sensorless planning (also known as conformant planning) for environ-

ments with no observations; contingency planning for partially observable and nondeterministic environments; and online planning and replanning for unknown environments. This

will allow us to tackle sizable real-world problems.

Consider this problem:

given a chair and a table, the goal is to have them match—have

the same color. In the initial state we have two cans

of paint, but the colors of the paint and

the furniture are unknown. Only the table is initially in the agent’s field of view:

Init(Object(Tuble) A Object(Chair) A Can(Cy) A Can(Cz) A InView(Tuble)) Goal(Color(Chair,c) A Color(Table,c))

There are two actions:

removing the lid from a paint can and painting an object using the

paint from an open can. Action(RemoveLid(can), PRECOND:Can(can) EFFECT: Open(can))

Action(Paint(x,can),

PRECOND:Object(x) A Can(can) A Color(can,c) A Open(can) EFFECT: Color(x,c))

Hierarchical look-ahead

366

Chapter 11

Automated Planning

The action schemas are straightforward, with one exception: preconditions and effects now may contain variables that are not part of the action’s variable list.

That is, Paint(x,can)

does not mention the variable ¢, representing the color of the paint in the can. In the fully

observable case, this is not allowed—we would have to name the action Paint(x,can,c). But

in the partially observable case, we might or might not know what color is in the can.

To solve a partially observable problem, the agent will have to reason about the percepts

it will obtain when it is executing the plan.

The percept will be supplied by the agent’s

sensors when it is actually acting, but when it is planning it will need a model of its sensors. Percept schema

In Chapter 4, this model was given by a function, PERCEPT(s).

PDDL with a new type of schema, the percept schema: Percept(Color(x.c),

For planning, we augment

PRECOND: Object(x) A InView(x)

Percept(Color(can,c),

PRECOND:Can(can) A InView(can) A Open(can)

The first schema says that whenever an object is in view, the agent will perceive the color of the object (that is, for the object x, the agent will learn the truth value of Color(x,c) for all ¢). The second schema says that if an open can is in view, then the agent perceives the color of the paint in the can. Because there are no exogenous events in this world, the color

of an object will remain the same, even if it is not being perceived, until the agent performs

an action to change the object’s color. Of course, the agent will need an action that causes objects (one at a time) to come into view:

Action(LookAt(x),

PRECOND:InView(y) A (x # y) EFFECT: InView(x) A —~InView(y))

For a fully observable environment, we would have a Percept schema with no preconditions

for each fluent. A sensorless agent, on the other hand, has no Percept schemas at all. Note

that even a sensorless agent can solve the painting problem. One solution is to open any can of paint and apply it to both chair and table, thus coercing them to be the same color (even

though the agent doesn’t know what the color is). A contingent planning agent with sensors can generate a better plan. First, look at the

table and chair to obtain their colors; if they are already the same then the plan is done. If

not, look at the paint cans; if the paint in a can is the same color as one piece of furniture,

then apply that paint to the other piece. Otherwise, paint both pieces with any color.

Finally, an online planning agent might generate a contingent plan with fewer branches

at first—perhaps ignoring the possibility that no cans match any of the furniture—and deal

with problems when they arise by replanning. It could also deal with incorrectness of its

action schemas.

Whereas a contingent planner simply assumes that the effects of an action

always succeed—that painting the chair does the job—a replanning agent would check the result and make an additional plan to fix any unexpected failure, such as an unpainted area or the original color showing through. In the real world, agents use a combination of approaches. Car manufacturers sell spare tires and air bags, which are physical embodiments of contingent plan branches designed to handle punctures or crashes.

On the other hand, most car drivers never consider these

possibilities; when a problem arises they respond as replanning agents. In general, agents

Section 1.5

Planning and Acting in Nondeterministic Domains

plan only for contingencies that have important consequences and a nonnegligible chance of happening. Thus, a car driver contemplating a trip across the Sahara desert should make explicit contingency plans for breakdowns, whereas a trip o the supermarket requires less advance planning. We next look at cach of the three approaches in more detail. 11.5.1

Sensorless planning

Section 4.4.1 (page 126) introduced the basic idea of searching in belief-state space to find

a solution for sensorless problems. Conversion of a sensorless planning problem to a beliefstate planning problem works much the same way as it did in Section 4.4.1; the main dif-

ferences are that the underlying physical transition model is represented by a collection of

action schemas, and the belief state can be represented by a logical formula instead of by

an explicitly enumerated set of states. We assume that the underlying planning problem is deterministic.

The initial belief state for the sensorless painting problem can ignore InView fluents

because the agent has no sensors. Furthermore, we take as given the unchanging facts Object(Table) A Object(Chair) A Can(Cy) A Can(Cs) because these hold in every belief state. The agent doesn’t know the colors of the cans or the objects, or whether the cans are open or closed, but it does know that objects and cans have colors:

Skolemizing (see Section 9.5.1), we obtain the initial belief state:

Vx



Color(x,c).

After

bo = Color(x,C(x)).

In classical planning, where the closed-world assumption is made, we would assume that any fluent not mentioned in a state is false, but in sensorless (and partially observable) plan-

ning we have to switch to an open-world assumption in which states contain both positive and negative fluents, and if a fluent does not appear, its value is unknown.

Thus, the belief

state corresponds exactly to the set of possible worlds that satisfy the formula. Given this initial belief state, the following action sequence is a solution:

[RemoveLid(Cany ), Paint(Chair, Cany ), Paint(Table, Cany ).

‘We now show how to progress the belief state through the action sequence to show that the

final belief state satisfies the goal.

First, note that in a given belief state b, the agent can consider any action whose preconditions are satisfied by b. (The other actions cannot be used because the transition model doesn’t define the effects of actions whose preconditions might be unsatisfied.) According

to Equation (4.4) (page 127), the general formula for updating the belief state b given an applicable action a in a deterministic world is as follows:

b =RESULT(b,a) = {s' : ' =RESULTp(s,a) and s € b} where RESULTp defines the physical transition model. For the time being, we assume that the

initial belief state is always a conjunction of literals, that is, a 1-CNF formula. To construct the new belief state 5, we must consider what happens to each literal £ in each physical state s in b when action a is applied. For literals whose truth value is already known in b, the truth value in b is computed from the current value and the add list and delete list of the action. (For example, if £ is in the delete list of the action, then —¢ is added to &'.) What about a literal whose truth value is unknown in b? There are three cas 1. If the action adds ¢, then ¢ will be true in b’ regardless of its initial value.

367

368

Chapter 11

Automated Planning

2. If the action deletes £, then ¢ will be false in b’ regardless of its initial value.

3. If the action does not affect £, then ¢ will retain its initial value (which is unknown) and will not appear in 4. Hence, we see that the calculation of 4/ is almost identical to the observable case, which was specified by Equation (11.1) on page 345: b =ResuLT(b,a) = (b— DEL(a)) UADD(a). ‘We cannot quite use the set semantics because (1) we must make sure that b’ does not contain both ¢ and —¢, and (2) atoms may contain unbound variables.

But it is still the case

that RESULT (b, a) is computed by starting with b, setting any atom that appears in DEL(a) to false, and setting any atom that appears in ADD(a) to true. For example, if we apply

RemoveLid(Cany) to the initial belief state by, we get

by = Color(x,C(x)) A Open(Can,). When we apply the action Paint(Chair,Cany), the precondition Color(Cany,c) is satisfied by the literal Color(x,C(x)) with binding {x/Cany,c/C(Can;)} and the new belief state is by = Color(x,C(x)) A Open(Cany) A Color(Chair,C(Cany ). Finally, we apply the action Paint(Table, Cany) to obtain by = Color(x,C(x)) A Open(Can,) A Color(Chair,C(Cany ) A Color(Table,C(Cany)).

The final belief state satisfies the goal, Color(Table,c) A Color(Chair,c), with the variable ¢ bound to C(Cany). The preceding analysis of the update rule has

shown a very important fact: the family

of belief states defined as conjunctions of literals is closed under updates defined by PDDL action schemas.

That is, if the belief state starts as a conjunction of literals, then any update

will yield a conjunction of literals.

That means that in a world with n fluents, any belief

state can be represented by a conjunction of size O(n).

This is a very comforting result,

considering that there are 2" states in the world. It says we can compactly represent all the

subsets of those 2" states that we will ever need. Moreover, the process of checking for belief

states that are subsets or supersets of previously visited belief states is also easy, at least in

the propositional case.

The fly in the ointment of this pleasant picture is that it only works for action schemas

that have the same effects for all states in which their preconditions are satisfied.

It is this

property that enables the preservation of the 1-CNF belief-state representation. As soon as

the effect can depend on the state, dependencies are introduced between fluents, and the 1-

CNF property is lost.

Consider, for example, the simple vacuum world defined in Section 3.2.1. Let the fluents be AzL and AtR for the location of the robot and CleanL and CleanR for the state of the

squares. According to the definition of the problem, the Suck action has no precondition—it

Conditional effect

can always be done. The difficulty is that its effect depends on the robot’s location: when the robot is ArL, the result is CleanL, but when it is AzR, the result is CleanR. For such actions, our action schemas will need something new: a conditional effect. These have the syntax

Section 1.5

Planning and Acting in Nondeterministic Domains

“when condition: effect,” where condition is a logical formula to be compared against the current state, and effect is a formula describing the resulting state. For the vacuum world:

Action(Suck,

EFFECT:when ArL: CleanL \ when AtR: CleanR) .

When applied to the initial belief state True, the resulting belief state is (AL A CleanL) V' (ARA CleanR), which is no longer in 1-CNF. (This transition can be seen in Figure 4.14

on page 129.) In general, conditional effects can induce arbitrary dependencies among the fluents in a belief state, leading to belief states of exponential size in the worst case.

Itis important to understand the difference between preconditions and conditional effects.

All conditional effects whose conditions are satisfied have their effects applied to generate the resulting belief state; if none are satisfied, then the resulting state is unchanged. On the other hand, if a precondition is unsatisfied, then the action is inapplicable and the resulting state is undefined.

From the point of view of sensorless planning, it is better to have conditional

effects than an inapplicable action. unconditional effects as follows:

Action(SuckL,

PRECOND:AL;

Action(SuckR,

For example, we could split Suck into two actions with

EFFECT: CleanL)

PRECOND:AtR; EFFECT: CleanR).

Now we have only unconditional schemas, so the belief states all remain in 1-CNF; unfortu-

nately, we cannot determine the applicability of SuckL and SuckR in the initial belief state.

It seems inevitable, then, that nontrivial problems will involve wiggly belief states, just

like those encountered when we considered the problem of state estimation for the wumpus

world (see Figure 7.21 on page 243). The solution suggested then was to use a conservative

approximation to the exact belief state; for example, the belief state can remain in 1-CNF

if it contains all literals whose truth values can be determined and treats all other literals as

unknown.

While this approach is sound, in that it never generates an incorrect plan, it is

incomplete because it may be unable to find solutions to problems that necessarily involve interactions among literals.

To give a trivial example, if the goal is for the robot to be on

a clean square, then [Suck] is a solution but a sensorless agent that insists on 1-CNF belief

states will not find it.

Perhaps a better solution is to look for action sequences that keep the belief state as simple as possible. In the sensorless vacuum world, the action sequence [Right, Suck, Left, Suck] generates the following sequence of belief states: by

=

True

by

=

AR

by = ARACleanR by = AILACleanR by

= AtL A CleanR N CleanL

That is, the agent can solve the problem while retaining a 1-CNF belief state, even though

some sequences (e.g., those beginning with Suck) go outside 1-CNE. The general lesson is not lost on humans: we are always performing little actions (checking the time, patting our

369

370

Chapter 11

Automated Planning

pockets to make sure we have the car keys, reading street signs as we navigate through a city) to eliminate uncertainty and keep our belief state manageable.

There is another, quite different approach to the problem of unmanageably wiggly belief states: don’t bother computing them at all. Suppose the initial belief state is by and we would like to know the belief state resulting from the action sequence [a1, ...,

). Instead of com-

puting it explicitly, just represent it as “by then [ar, ..., a,,].” This is a lazy but unambiguous

representation of the belief state, and it’s quite concise—O(n +m) where n is the size of the initial belief state (assumed to be in 1-CNF) and m is the maximum

length of an action se-

quence. As a belief-state representation, it suffers from one drawback, however: determining

whether the goal is satisfied, or an action is applicable, may require a lot of computation.

The computation can be implemented as an entailment test: if A,, represents the collec-

tion of successor-state axioms required to define occurrences of the actions aj....,a,—as

explained for SATPLAN in Section 11.2.3—and G, asserts that the goal is true after m steps,

then the plan achieves the goal if by A A,, = G—that is, if by A Ay A G, is unsatisfiable. Given a modern SAT solver, it may be possible to do this much more quickly than computing the full belief state. For example, if none of the actions in the sequence has a particular goal fluent in its add list, the solver will detect this immediately.

It also helps if partial results

about the belief state—for example, fluents known to be true or false—are cached to simplify

subsequent computations.

The final piece of the sensorless planning puzzle is a heuristic function to guide the

search. The meaning of the heuristic function is the same as for classical planning: an esti-

mate (perhaps admissible) of the cost of achieving the goal from the given belief state. With

belief states, we have one additional fact: solving any subset of a belief state is necessarily easier than solving the belief state:

if by C by then h*(by) < h*(b2). Hence, any admissible heuristic computed for a subset is admissible for the belief state itself. The most obvious candidates are the singleton subsets, that is, individual physical states. We

can take any random collection of states s admissible heuristic /, and return

sy that are in the belief state b, apply any

H(b) =max{h(s1),....h(sn)}

as the heuristic estimate for solving b. We can also use inadmissible heuristics such as the ignore-delete-lists heuristic (page 354), which seems to work quite well in practice. 11.5.2

Contingent planning

We saw in Chapter 4 that contingency planning—the generation of plans with conditional

branching based on percepts—is appropriate for environments with partial observability, non-

determinism, or both. For the partially observable painting problem with the percept schemas given earlier, one possible conditional solution is as follows:

[LookAt(Table), LookAt(Chair), if Color(Tuble,c) A Color(Chair, c) then NoOp else [RemoveLid(Can, ), LookAt(Can, ), RemoveLid(Cans), LookAt(Cans), if Color(Table,c) A Color(can, c) then Paint(Chair, can) else if Color(Chair,c) A Color(can,c) then Paint(Table,can) else [Paint(Chair, Can), Paint (Table, Can, )|]]

Section 1.5

Planning and Acting in Nondeterministic Domains

Variables in this plan should be considered existentially quantified;

the second

line says

that if there exists some color ¢ that is the color of the table and the chair, then the agent

need not do anything to achieve the goal. When executing this plan, a contingent-planning

agent can maintain its belief state as a logical formula and evaluate each branch condition

by determining if the belief state entails the condition formula or its negation.

(It is up to

the contingent-planning algorithm to make sure that the agent will never end up in a belief

state where the condition formula’s truth value is unknown.)

Note that with first-order

conditions, the formula may be satisfied in more than one way; for example, the condition Color(Table, c) A Color(can, ¢) might be satisfied by {can/Cany } and by {can/Cans} if both cans are the same color as the table. In that case, the agent can choose any satisfying substitution to apply to the rest of the plan.

As shown in Section 4.4.2, calculating the new belief state b after an action a and subse-

quent percept is done in two stages. The first stage calculates the belief state after the action,

just as for the sensorless agent: b= (b—DEL(a))UADD(a)

where, as before, we have assumed a belief state represented as a conjunction of literals. The

second stageis a little trickier. Suppose that percept literals p;., ..., py are received. One might

think that we simply need to add these into the belief state; in fact, we can also infer that the

preconditions for sensing are satisfied. Now, if a percept p has exactly one percept schema, Percept(p, PRECOND:c), where c is a conjunction of literals, then those literals can be thrown into the belief state along with p. On the other hand, if p has more than one percept schema

whose preconditions might hold according to the predicted belief state b, then we have to add in the disjunction of the preconditions. Obviously, this takes the belief state outside 1-CNF

and brings up the same complications as conditional effects, with much the same classes of

solutions.

Given a mechanism for computing exact or approximate belief states, we can generate

contingent plans with an extension of the AND—OR

forward search over belief states used

in Section 4.4. Actions with nondeterministic effects—which are defined simply by using a

disjunction in the EFFECT of the action schema—can be accommodated with minor changes

to the belief-state update calculation and no change to the search algorithm.® For the heuristic

function, many of the methods suggested for sensorless planning are also applicable in the partially observable, nondeterministic case. 11.5.3

Online planning

Imagine watching a spot-welding robot in a car plant. The robot’s fast, accurate motions are

repeated over and over again as each car passes down the line. Although technically impressive, the robot probably does not seem at all intelligent because the motion is a fixed, preprogrammed sequence; the robot obviously doesn’t “know what it’s doing” in any meaningful sense.

Now suppose that a poorly attached door falls off the car just as the robot is

about to apply a spot-weld. The robot quickly replaces its welding actuator with a gripper,

picks up the door, checks it for scratches, reattaches it to the car, sends an email to the floor supervisor, switches back to the welding actuator, and resumes its work.

All of a sudden,

3 If cyclic solutions are required for a nondeterministic problem, AND—OR search must be generalized to a loopy version such as LAO" (Hansen and Zilberstein, 2001).

371

372

Chapter 11

Automated Planning

the robot’s behavior seems purposive rather than rote; we assume it results not from a vast,

precomputed contingent plan but from an online replanning process—which means that the

Execution monitoring

robot does need to know what it’s trying to do. Replanning presupposes some form of execution monitoring to determine the need for a new plan. One such need arises when a contingent planning agent gets tired of planning

for every little contingency, such as whether the sky might fall on its head.* This means that the contingent plan is left in an incomplete form.

For example, Some branches of a

partially constructed contingent plan can simply say Replan; if such a branch is reached

during execution, the agent reverts to planning mode. As we mentioned earlier, the decision

as to how much of the problem to solve in advance and how much to leave to replanning

is one that involves tradeoffs among possible events with different costs and probabilities of

occurring.

Nobody wants to have a car break down in the middle of the Sahara desert and

only then think about having enough water.

Missing precondition Missing effect

Missing fluent Exogenous event

Replanning may be needed if the agent’s model of the world is incorrect. The model

for an action may have a missing precondition—for example, the agent may not know that

removing the lid of a paint can often requires a screwdriver. The model may have a missing effect—painting an object may get paint on the floor as well.

Or the model may have a

missing fluent that is simply absent from the representation altogether—for example, the model given earlier has no notion of the amount of paint in a can, of how its actions affect

this amount, or of the need for the amount to be nonzero. The model may also lack provision

for exogenous events such as someone knocking over the paint can. Exogenous events can

also include changes in the goal, such as the addition of the requirement that the table and

chair not be painted black. Without the ability to monitor and replan, an agent’s behavior is likely to be fragile if it relies on absolute correctness of its model.

The online agent has a choice of (at least) three different approaches for monitoring the environment during plan execution:

Action monitoring

+ Action monitoring: before executing an action, the agent verifies that all the precondi-

Plan monitoring

+ Plan monitoring: before executing an action, the agent verifies that the remaining plan

Goal monitoring

* Goal monitoring: before executing an action, the agent checks to see if there is a better

tions still hold.

will still succeed.

set of goals it could be trying to achieve.

In Figure 11.12 we see a schematic of action monitoring.

The agent keeps track of both its

original plan, whole plan, and the part of the plan that has not been executed yet, which is denoted by plan.

After executing the first few steps of the plan, the agent expects to be in

state E. But the agent observes that it is actually in state O. It then needs to repair the plan by

finding some point P on the original plan that it can get back to. (It may be that P is the goal

state, G.) The agent tries to minimize the total cost of the plan: the repair part (from O to P)

plus the continuation (from P to G).

# In 1954, a Mrs. Hodges of Alabama was hit by meteorite that crashed through her roof. In 1992, a piece of the Mbale metcorite hit a small boy on the head; fortunately. its descent was slowed by banana leaves (Jenniskens etal., 1994). Andiin 2009, a German boy claimed to have been injuries resulted from any of these incidents, suggesting that the need for preplanning st such contingencies is sometimes overstated.

Section 1.5

Planning and Acting in Nondeterministic Domains whole plan

Figure 11.12 At first, the sequence “whole plan” is expected to get the agent from S to G. The agent executes steps of the plan until it expects to be in state £, but observes that it is actually in O. The agent then replans for the minimal repair plus continuation to reach G. Now let’s return to the example problem of achieving a chair and table of matching color.

Suppose the agent comes up with this plan:

[LookAt(Table), LookAt(Chair), if Color(Tuble, ) A Color(Chair, ) then NoOp else [RemoveLid(Cany ), LookAt(Cany),

if Color(Tuble, c) A Color(Can) ) then Paint(Chair, Cany) else REPLAN]].

Now the agent is ready to execute the plan. The agent observes that the table and can of paint are white and the chair is black.

It then executes Paint(Chair,Can,).

At this point a

classical planner would declare victory; the plan has been executed. But an online execution monitoring agent needs to check that the action succeeded.

Suppose the agent perceives that the chair is a mottled gray because the black paint is

showing through. The agent then needs to figure out a recovery position in the plan to aim for and a repair action sequence to get there. The agent notices that the current state is identical to the precondition before the Paint(Chair,Can;) action, so the agent chooses the empty

sequence for repair and makes its plan be the same [Paint] sequence that it just attempted.

With this new plan in place, execution monitoring resumes, and the Paint action is retried.

This behavior will loop until the chair is perceived to be completely painted. But notice that

the loop is created by a process of plan-execute-replan, rather than by an explicit loop in a plan. Note also that the original plan need not cover every contingency. If the agent reaches

the step marked REPLAN, it can then generate a new plan (perhaps involving Cany).

Action monitoring is a simple method of execution monitoring, but it can sometimes lead

to less than intelligent behavior. For example, suppose there is no black or white paint, and

the agent constructs a plan to solve the painting problem by painting both the chair and table

red. Suppose that there is only enough red paint for the chair. With action monitoring, the agent would go ahead and paint the chair red, then notice that it is out of paint and cannot

paint the table, at which point it would replan a repair—perhaps painting both chair and table

green. A plan-monitoring agent can detect failure whenever the current state is such that the

remaining plan no longer works. Thus, it would not waste time painting the chair red.

373

374

Chapter 11

Automated Planning

Plan monitoring achieves this by checking the preconditions for success of the entire

remaining plan—that is, the preconditions of each step in the plan, except those preconditions

that are achieved by another step in the remaining plan. Plan monitoring cuts off execution of a doomed plan as soon as possible, rather than continuing until the failure actually occurs.3 Plan monitoring also allows for serendipity—accidental success.

If someone comes along

and paints the table red at the same time that the agent is painting the chair red, then the final plan preconditions are satisfied (the goal has been achieved), and the agent can go home early. It is straightforward to modify a planning algorithm so that each action in the plan is annotated with the action’s preconditions, thus enabling action monitoring.

It is slightly more

complex to enable plan monitoring. Partial-order planners have the advantage that they have already built up structures that contain the relations necessary for plan monitoring. Augment-

ing state-space planners with the necessary annotations can be done by careful bookkeeping as the goal fluents are regressed through the plan. Now that we have described a method for monitoring and replanning, we need to ask, “Does it work?” This is a surprisingly tricky question. If we mean, “Can we guarantee that the agent will always achieve the goal?” then the answer is no, because the agent could

inadvertently arrive at a dead end from which there is no repair. For example, the vacuum

agent might have a faulty model of itself and not know that its batteries can run out. Once

they do, it cannot repair any plans. If we rule out dead ends—assume that there exists a plan to reach the goal from any state in the environment—and

assume that the environment is

really nondeterministic, in the sense that such a plan always has some chance of success on

any given execution attempt, then the agent will eventually reach the goal.

Trouble occurs when a seemingly-nondeterministic action is not actually random, but

rather depends on some precondition that the agent does not know about. For example, sometimes a paint can may be empty, so painting from that can has no effect.

No amount

of retrying is going to change this.® One solution is to choose randomly from among the set of possible repair plans, rather than to try the same one each time. In this case, the repair plan of opening another can might work. A better approach is to learn a better model. Every

prediction failure is an opportunity for learning; an agent should be able to modify its model

of the world to accord with its percepts. From then on, the replanner will be able to come up with a repair that gets at the root problem, rather than relying on luck to choose a good repair.

11.6_Time, Schedules, and Resources Classical planning talks about what to do, in what order, but does not talk about time: how

Scheduling Resource constraint

long an action takes and when it occurs. For example, in the airport domain we could produce aplan saying what planes go where, carrying what, but could not specify departure and arrival

times. This is the subject matter of scheduling. The real world also imposes resource constraints: an airline has a limited number of staff, and staff who are on one flight cannot be on another at the same time. This section

introduces techniques for planning and scheduling problems with resource constraints.

5 Plan monitoring means that finally, after 374 pages, we have an agent that is smarter than a dung beetle (see page 41). A plan-monitoring agent would notice that the dung ball was missing from its grasp and would replan 10 get another ball and plug its hole. 6 Futile repetition of a plan repair is exactly the behavior exhibited by the sphex wasp (page 41).

Section 11.6

Time, Schedules, and Resources

375

Jobs({AddEnginel < AddWheels] < Inspectl}, {AddEngine2 < AddWheels2 < Inspeci2}) Resources(EngineHoists(1), WheelStations(1), Inspectors(e2), LugNuts(500)) Action(AddEnginel, DURATION:30,

USE:EngineHoists(1)) Action(AddEngine2, DURATION:60,

USE:EngineHoists(1)) Action(AddWheels], DURATION:30,

CONSUME: LugNuts(20), USE: WheelStations(1))

Action(AddWheels2, DURATION:15,

CONSUME: LugNuts(20), USE:WheelStations(1))

Action(Inspect;, DURATION: 10,

UsE:Inspectors(1))

Figure 11.13 A job-shop scheduling problem for assembling two cars, with resource constraints. The notation A < B means that action A must precede action B. The approach we take is “plan first, schedule later”: divide the overall problem into a planning phase in which actions are selected, with some ordering constraints, to meet the goals of the problem, and a later scheduling phase, in which temporal information is added to the plan to ensure that it meets resource and deadline constraints. This approach is common in real-world manufacturing and logistical settings, where the planning phase is sometimes

automated, and sometimes performed by human experts. 11.6.1

Representing temporal and resource constraints

A typical job-shop scheduling problem (see Section 6.1.2), consists of a set of jobs, each

of which has a collection of actions with ordering constraints among them. Each action has

a duration and a set of resource constraints required by the action.

A constraint specifies

a fype of resource (e.g., bolts, wrenches, or pilots), the number of that resource required,

and whether that resource is consumable (e.g., the bolts are no longer available for use) or

reusable (e.g., a pilot is occupied during a flight but is available again when the flight is over). Actions can also produce resources (¢.g., manufacturing and resupply actions). A solution to a job-shop scheduling problem specifies the start times for each action and must satisfy all the temporal ordering constraints and resource constraints.

Job-shop scheduling problem Job Duration Consumable Reusable

As with search

and planning problems, solutions can be evaluated according to a cost function; this can be quite complicated, with nonlinear resource costs, time-dependent delay costs, and so on. For simplicity, we assume that the cost function is just the total duration of the plan, which is called the makespan.

Figure 11.13 shows a simple example: a problem involving the assembly of two cars. The problem consists of two jobs, each of the form [AddEngine,AddWheels, Inspect]. Then

the Resources statement declares that there are four types of resources, and gives the number of each type available at the start: 1 engine hoist, 1 wheel station, 2 inspectors, and 500 lug nuts. The action schemas give the duration and resource needs of each action. The lug nuts

Makespan

376

Chapter 11

Automated Planning

are consumed as wheels are added to the car, whereas the other resources are “borrowed” at

the start of an action and released at the action’s end.

Aggregation

The representation of resources as numerical quantities, such as Inspectors(2), rather than as named entities, such as Inspector (1) and Inspector (L), is an example of a technique called aggregation:

grouping individual objects into quantities when the objects are all in-

distinguishable. In our assembly problem, it does not matter which inspector inspects the car, so there is no need to make the distinction. Aggregation is essential for reducing complexity.

Consider what happens when a proposed schedule has 10 concurrent Inspect actions but only

9 inspectors are available. With inspectors represented as quantities, a failure is detected im-

mediately and the algorithm backtracks to try another schedule. With inspectors represented as individuals, the algorithm would try all 9! ways of assigning inspectors to actions before noticing that none of them work. 11.6.2

Solving scheduling problems

We begin by considering just the temporal scheduling problem, ignoring resource constraints.

To minimize makespan (plan duration), we must find the earliest start times for all the actions

consistent with the ordering constraints supplied with the problem. It is helpful to view these Critical path method

ordering constraints as a directed graph relating the actions, as shown in Figure 11.14. We can

apply the critical path method (CPM) to this graph to determine the possible start and end

times of each action. A path through a graph representing a partial-order plan is a linearly

Critical path

ordered sequence of actions beginning with Start and ending with Finish. (For example, there are two paths in the partial-order plan in Figure 11.14.) The critical path is that path whose total duration is longest; the path is “critical” because it determines the duration of the entire plan—shortening other paths doesn’t shorten the plan

as a whole, but delaying the start of any action on the critical path slows down the whole plan. Actions that are off the critical path have a window of time in which they can be executed.

Slack Schedule

The window is specified in terms of an earliest possible start time, ES, and a latest possible

start time, LS.

The quantity LS — ES is known as the slack of an action.

We can see in

Figure 11.14 that the whole plan will take 85 minutes, that each action in the top job has 15 minutes of slack, and that each action on the critical path has no slack (by definition). Together the ES and LS times for all the actions constitute a schedule for the problem.

The following formulas define ES and LS and constitute a dynamic-programming algo-

rithm to compute them. A and B are actions, and A < B means that A precedes B:

ES(Start) =0

ES(B) = maxs < ES(A) + Duration(A)

LS(Finish) = ES(Finish) LS(A) = ming,. o LS(B) — Duration(A) .

The idea is that we start by assigning ES(Start) to be 0. Then, as soon as we get an action

B such that all the actions that come immediately before B have ES values assigned, we

set ES(B) to be the maximum of the earliest finish times of those immediately preceding

actions, where the earliest finish time of an action is defined as the earliest start time plus

the duration. This process repeats until every action has been assigned an ES value. The LS values are computed in a similar manner, working backward from the Finish action.

The complexity of the critical path algorithm is just O(Nb), where N is the number of

actions and b is the maximum

branching factor into or out of an action.

(To see this, note

Section 11.6

Start

o

0

o g 30

[EX5] 07 nsawiess =] tmpect 30 10

o7 g2 &

[@er ] niawteci 15

2

o



Time, Schedules, and Resources

[85.85]

Finish

[FEE ! o o

0

@

o

"

»

Figure 11.14 Top: a representation of the temporal constraints for the job-shop scheduling problem of Figure 11.13. The duration of each action is given at the bottom of each rectangle. In solving the problem, we compute the earliest and latest start times as the pair [ES, LS],

displayed in the upper left. The difference between these two numbers is the slack of an

action; actions with zero slack are on the critical path, shown with bold arrows. Bottom: the

same solution shown as a timeline. Grey rectangles represent time intervals during which an action may be executed, provided that the ordering constraints are respected. The unoccupied portion ofa gray rectangle indicates the slack.

that the LS and ES computations are done once for each action, and each computation iterates

over at most b other actions.) Therefore, finding a minimum-duration schedule, given a partial

ordering on the actions and no resource constraints, is quite easy.

Mathematically speaking, critical-path problems are easy to solve because they are de-

fined as a conjunction of linear inequalities on the start and end times. When we introduce

resource constraints, the resulting constraints on start and end times become more complicated. For example, the AddEngine actions, which begin at the same time in Figure 11.14, require the same EngineHoist and so cannot overlap. The “cannot overlap” constraint is a disjunction of two linear inequalities, one for each possible ordering. The introduction of disjunctions turns out to make scheduling with resource constraints NP-hard.

Figure 11.15 shows the solution with the fastest completion time, 115 minutes. This is

30 minutes longer than the 85 minutes required for a schedule without resource constraints.

Notice that there is no time at which both inspectors are required, so we can immediately

‘move one of our two inspectors to a more productive position. There is a long history of work on optimal scheduling. A challenge problem posed in 1963—to find the optimal schedule for a problem involving just 10 machines and 10 jobs of 100 actions each—went unsolved for 23 years (Lawler et al., 1993). Many approaches have been tried, including branch-and-bound, simulated annealing, tabu search, and constraint sat-

377

378

Chapter 11

Automated Planning

——&

Engincoist)

o

T

oo

T



T

T w0

T 0

T @

W

T s»

T %

T w

T

Figure 11.15 A solution to the job-shop scheduling problem from Figure 11.13, taking into account resource constraints. The left-hand margin lists the three reusable resources, and actions are shown aligned horizontally with the resources they use. There are two pos ble schedules, depending on which assembly uses the engine hoist first; we’ve shown the shortest-duration solution, which takes 115 minutes. Minimum slack

isfaction. One popular approach is the minimum slack heuristic:

on each iteration, schedule

for the earliest possible start whichever unscheduled action has all its predecessors scheduled and has the least slack; then update the ES and LS times for each affected action and

repeat. This greedy heuristic resembles the minimum-remaining-values (MRV) heuristic in constraint satisfaction. It often works well in practice, but for our assembly problem it yields a 130-minute solution, not the 115-inute solution of Figure 11.15.

Up to this point, we have assumed that the set of actions and ordering constraints is fixed. Under these assumptions, every scheduling problem can be solved by a nonoverlapping sequence that avoids all resource conflicts, provided that each action is feasible by itself.

However if a scheduling problem is proving very difficult, it may not be a good idea to solve

it this way—it may be better to reconsider the actions and constraints, in case that leads to a

much easier scheduling problem. Thus, it makes sense to infegrate planning and scheduling by taking into account durations and overlaps during the construction of a plan. Several of

the planning algorithms in Section 11.2 can be augmented to handle this information.

11.7

Analysis of Planning Approaches

Planning combines the two major areas of Al we have covered so far: search and logic. A planner can be seen either as a program that searches for a solution or as one that (constructively) proves the existence of a solution. The cross-fertilization of ideas from the two areas

has allowed planners to scale up from toy problems where the number of actions and states was limited to around a dozen, to real-world industrial applications with millions of states and thousands of actions.

Planning is foremost an exercise in controlling combinatorial explosion. If there are n propositions in a domain, then there are 2" states. Against such pessimism, the identification of independent subproblems can be a powerful weapon. In the best case—full decomposability of the problem—we get an exponential speedup. Decomposability is destroyed, however, by negative interactions between actions.

SATPLAN can encode logical relations between

subproblems. Forward search addresses the problem heuristically by trying to find patterns (subsets of propositions) that cover the independent subproblems.

Since this approach is

heuristic, it can work even when the subproblems are not completely independent.

Summary Unfortunately, we do not yet have a clear understanding of which techniques work best on which kinds of problems. Quite possibly, new techniques will emerge, perhaps providing a synthesis of highly expressive first-order and hierarchical representations with the highly efficient factored and propositional representations that dominate today. We are seeing exam-

ples of portfolio planning systems, where a collection of algorithms are available to apply to

any given problem. This can be done selectively (the system classifies each new problem to choose the best algorithm for it), or in parallel (all the algorithms run concurrently, each on a different CPU), or by interleaving the algorithms according to a schedule.

Summary

In this chapter, we described the PDDL representation for both classical and extended planning problems, and presented several algorithmic approaches for finding solutions. The points to remember: + Planning systems are problem-solving algorithms that operate on explicit factored rep-

resentations of states and actions. These representations make possible the derivation of

efffective domain-independent heuristics and the development of powerful and flexible algorithms for solving problems. « PDDL, the Planning Domain Definition Language, describes the initial and goal states as conjunctions of literals, and actions in terms of their preconditions and effects. Ex-

tensions represent time, resources, percepts, contingent plans, and hierarchical plans

+ State-space search can operate in the forward direction (progression) or the backward

direction (regression). Effective heuristics can be derived by subgoal independence assumptions and by various relaxations of the planning problem.

« Other approaches include encoding a planning problem as a Boolean satisfiability problem or as a constraint satisfaction problem; and explicitly searching through the space of partially ordered plans. + Hierarchical task network (HTN) planning allows the agent to take advice from the

domain designer in the form of high-level actions (HLAs) that can be implemented in

various ways by lower-level action sequences. The effects of HLAs can be defined with angelic semantics, allowing provably correct high-level plans to be derived without consideration of lower-level implementations. HTN methods can create the very large plans required by many real-world applications. + Contingent plans allow the agent to sense the world during execution to decide what

branch of the plan to follow. In some cases, sensorless or conformant planning can be used to construct a plan that works without the need for perception. Both conformant and contingent plans can be constructed by search in the space of belief states. Efficient representation or computation of belief states is a key problem.

« An online planning agent uses execution monitoring and splices in repairs as needed to recover from unexpected situations, which can be due to nondeterministic

exogenous events, or incorrect models of the environment.

actions,

* Many actions consume resources, such as money, gas, or raw materials. It is convenient

to treat these resources as numeric measures in a pool rather than try to reason about,

379

Portfolio

380

Chapter 11

Automated Planning

say, each individual coin and bill in the world.

Time is one of the most important

resources. It can be handled by specialized scheduling algorithms, or scheduling can be integrated with planning. « This chapter extends classical planning to cover nondeterministic environments (where

outcomes of actions are uncertain), but it is not the last word on planning. Chapter 17 describes techniques for stochastic environments (in which outcomes of actions have

probabilities associated with them): Markov decision processes, partially observable Markov decision processes, and game theory. In Chapter 22 we show that reinforcement learning allows an agent to learn how to behave from past successes and failures. Bibliographical and Historical Notes

AT planning arose from investigations into state-space search, theorem proving, and control theory.

STRIPS (Fikes and Nilsson, 1971, 1993), the first major planning system, was de-

signed as the planner for the Shakey robot at SRL The first version of the program ran on a

computer with only 192 KB of memory.

Its overall control structure was modeled on GPS,

the General Problem Solver (Newell and Simon, 1961), a state-space search system that used

means—ends analysis.

The STRIPS representation language evolved into the Action Description Language, or ADL (Pednault, 1986), and then the Problem Domain Description Language, or PDDL (Ghallab et al., 1998), which has been used for the International Planning Competition since 1998. The most recent version is PDDL 3.1 (Kovacs, 2011).

Linear planning

Planners in the early 1970s decomposed problems by computing a subplan for each subgoal and then stringing the subplans together in some order. This approach, called linear planning by Sacerdoti (1975), was soon discovered to be incomplete. It cannot solve some very simple problems, such as the Sussman anomaly (see Exercise 11.5Uss), found by Allen Brown during experimentation with the HACKER system (Sussman, 1975). A complete plan-

ner must allow for interleaving of actions from different subplans within a single sequence.

‘Warren’s (1974) WARPLAN system achieved that, and demonstrated how the logic programming language Prolog can produce concise programs; WARPLAN is only 100 lines of code.

Partial-order planning dominated the next 20 years of research, with theoretical work

describing the detection of conflicts (Tate, 1975a) and the protection of achieved conditions (Sussman, 1975), and implementations including NOAH (Sacerdoti, 1977) and NONLIN (Tate, 1977). That led to formal models (Chapman, 1987; McAllester and Rosenblitt, 1991)

that allowed for theoretical analysis of various algorithms and planning problems, and to a widely distributed system, UCPOP (Penberthy and Weld, 1992).

Drew McDermott suspected that the emphasis on partial-order planning was crowding out

other techniques that should perhaps be reconsidered now that computers had 100 times the

memory of Shakey’s day. His UNPOP (McDermott, 1996) was a state-space planning program employing the ignore-delete-list heuristic. HSP, the Heuristic Search Planner (Bonet and Geffner, 1999; Haslum, 2006) made state-space search practical for large planning problems. The FF or Fast Forward planner (Hoffmann, 2001; Hoffmann and Nebel, 2001; Hoffmann, 2005) and the FASTDOWNWARD variant (Helmert, 2006) won international planning

competitions in the 2000s.

Bibliographical and Historical Notes Bidirectional search (see Section 3.4.5) has also been known

381

to suffer from a lack of

heuristics, but some success has been obtained by using backward search to create a perimeter around the goal, and then refining a heuristic to search forward towards that perimeter (Torralba et al., 2016). The SYMBA*

bidirectional search planner (Torralba ez al., 2016)

won the 2016 competition.

Researchers turned to PDDL and the planning paradigm so that they could use domain independent heuristics. Hoffmann (2005) analyzes the search space of the ignore-deletelist heuristic.

Edelkamp (2009) and Haslum er al. (2007) describe how to construct pattern

databases for planning heuristics. Felner ef al. (2004) show encouraging results using pattern databases for sliding-tile puzzles, which can be thought of as a planning domain, but Hoffmann ef al. (2006) show some limitations of abstraction for classical planning problems. (Rintanen, 2012) discusses planning-specific variable-selection heuristics for SAT solving. Helmert et al. (2011) describe the Fast Downward Stone Soup (FDSS) system, a portfolio

planner that, as in the fable of stone soup, invites us to throw in as many planning algorithms. as possible. The system maintains a set of training problems, and for each problem and each algorithm records the run time and resulting plan cost of the problem’s solution. Then when

faced with a new problem, it uses the past experience to decide which algorithm(s) to try, with what time limits, and takes the solution with minimal cost. FDSS

was a winner in the 2018

International Planning Competition (Seipp and Rger, 2018). Seipp ef al. (2015) describe a machine learning approach to automatically learn a good portfolio, given a new problem. Vallati ef al. (2015) give an overview of portfolio planning. The idea of algorithm portfolios for combinatorial search problems goes back to Gomes and Selman (2001). Sistla and Godefroid (2004) cover symmetry reduction, and Godefroid (1990) covers

heuristics for partial ordering. Richter and Helmert (2009) demonstrate the efficiency gains

of forward pruning using preferred actions.

Blum and Furst (1997) revitalized the field of planning with their Graphplan system, which was orders of magnitude faster than the partial-order planners of the time. Bryce and Kambhampati (2007) give an overview of planning graphs. The use of situation calculus for planning was introduced by John McCarthy (1963) and refined by Ray Reiter (2001). Kautz et al. (1996) investigated various ways to propositionalize action schemas, finding

that the most compact forms did not necessarily lead to the fastest solution times. A systematic analysis was carried out by Emst er al. (1997), who also developed an automatic “compiler” for generating propositional representations from PDDL problems. The BLACKBOX

planner, which combines ideas from Graphplan and SATPLAN, was developed by Kautz and Selman (1998). Planners based on constraint satisfaction include CPLAN van Beek and Chen

(1999) and GP-CSP (Do and Kambhampati, 2003). There has also been interest in the representation of a plan as a binary decision diagram

(BDD), a compact data structure for Boolean expressions widely studied in the hardware

verification community (Clarke and Grumberg, 1987; McMillan, 1993). There are techniques

for proving properties of binary decision diagrams, including the property of being a solution to a planning problem. Cimatti ef al. (1998) present a planner based on this approach. Other representations have also been used, such as integer programming (Vossen et al., 2001).

There are some interesting comparisons of the various approaches to planning. Helmert (2001) analyzes several classes of planning problems, and shows that constraint-based approaches such as Graphplan and SATPLAN are best for NP-hard domains, while search-based

Binary decision diagram (BDD)

382

Chapter 11

Automated Planning

approaches do better in domains where feasible solutions can be found without backtracking.

Graphplan and SATPLAN have trouble in domains with many objects because that means

they must create many actions. In some cases the problem can be delayed or avoided by generating the propositionalized actions dynamically, only as needed, rather than instantiating them all before the search begins.

Macrops Abstraction hierarchy

The first mechanism for hierarchical planning was a facility in the STRIPS program for

learning macrops—‘macro-operators”

consisting of a sequence of primitive steps (Fikes

et al., 1972). The ABSTRIPS system (Sacerdoti, 1974) introduced the idea of an abstraction

hierarchy, whereby planning at higher levels was permitted to ignore lower-level precon-

ditions of actions in order to derive the general structure of a working plan. Austin Tate’s Ph.D. thesis (1975b) and work by Earl Sacerdoti (1977) developed the basic ideas of HTN

planning. Erol, Hendler, and Nau (1994, 1996) present a complete hierarchical decomposi-

tion planner as well as a range of complexity results for pure HTN planners. Our presentation of HLAs and angelic semantics is due to Marthi ez al. (2007, 2008).

One of the goals of hierarchical planning has been the reuse of previous planning experience in the form of generalized plans. The technique of explanation-based learning has been used as a means of generalizing previously computed plans in systems such as SOAR (Laird et al., 1986) and PRODIGY

(Carbonell ef al., 1989). An alternative approach is

to store previously computed plans in their original form and then reuse them to solve new,

Case-based planning

similar problems by analogy to the original problem. This is the approach taken by the field called case-based planning (Carbonell, 1983; Alterman, 1988). Kambhampati (1994) argues that case-based planning should be analyzed as a form of refinement planning and provides a formal foundation for case-based partial-order planning. Early planners lacked conditionals and loops, but some could use coercion to form conformant plans. Sacerdoti’s NOAH solved the “keys and boxes” problem (in which the planner knows little about the initial state) using coercion.

Mason (1993) argued that sensing often

can and should be dispensed with in robotic planning, and described a sensorless plan that can move a tool into a specific position on a table by a sequence of tilting actions, regardless

of the initial position.

Goldman and Boddy (1996) introduced the term conformant planning, noting that sen-

sorless plans are often effective even if the agent has sensors. The first moderately efficient conformant planner was Smith and Weld's (1998) Conformant Graphplan (CGP). Ferraris and Giunchiglia (2000) and Rintanen (1999) independently developed SATPLAN-based conformant planners. Bonet and Geffner (2000) describe a conformant planner based on heuristic

search in the space of belief states, drawing on ideas first developed in the 1960s for partially

observable Markov decision processes, or POMDPs (see Chapter 17).

Currently, there are three main approaches to conformant planning. The first two use

heuristic search in belief-state space:

HSCP (Bertoli ef al., 2001a) uses binary decision di-

agrams (BDDs) to represent belief states, whereas Hoffmann and Brafman (2006) adopt the lazy approach of computing precondition and goal tests on demand using a SAT solver. The third approach, championed primarily by Jussi Rintanen (2007), formulates the entire sensorless planning problem as a quantified Boolean formula (QBF) and solves it using a

general-purpose QBF solver. Current conformant planners are five orders of magnitude faster than CGP. The winner of the 2006 conformant-planning track at the International Planning

Competition was Tj (Palacios and Geffner, 2007), which uses heuristic search in belief-state

Bibliographical and Historical Notes

383

space while keeping the belief-state representation simple by defining derived literals that cover conditional effects. Bryce and Kambhampati (2007) discuss how a planning graph can be generalized to generate good heuristics for conformant and contingent planning. The contingent-planning approach described in the chapter is based on Hoffmann and Brafman (2005), and was influenced by the efficient search algorithms for cyclic AND—OR graphs developed by Jimenez and Torras (2000) and Hansen and Zilberstein (2001).

The

problem of contingent planning received more attention after the publication of Drew Mc-

Dermott’s (1978a) influential article, Planning and Acting. Bertoli et al. (2001b) describe MBP

(Model-Based

Planner), which uses binary decision diagrams to do conformant and

contingent planning. Some authors use “conditional planning” and “contingent planning” as synonyms; others make the distinction that “conditional” refers to actions with nondetermin-

istic effects, and “contingent” means using sensing to overcome partial observability. In retrospect, it is now possible to see how the major classical planning algorithms led to

extended versions for uncertain domains. Fast-forward heuristic search through state space led to forward search in belief space (Bonet and Geffner, 2000; Hoffmann and Brafman, 2005); SATPLAN led to stochastic SATPLAN (Majercik and Littman, 2003) and to planning

with quantified Boolean logic (Rintanen, 2007); partial order planning led to UWL (Etzioni et al., 1992) and CNLP (Peot and Smith, 1992); Graphplan led to Sensory Graphplan or SGP

(Weld et al., 1998).

The first online planner with execution

monitoring

was

PLANEX

(Fikes ef al.,

1972),

which worked with the STRIPS planner to control the robot Shakey. SIPE (System for Interactive Planning and Execution monitoring) (Wilkins, 1988) was the first planner to deal

systematically with the problem of replanning. It has been used in demonstration projects in

several domains, including planning operations on the flight deck of an aircraft carrier, jobshop scheduling for an Australian beer factory, and planning the construction of multistory buildings (Kartam and Levitt, 1990).

In the mid-1980s, pessimism about the slow run times of planning systems led to the pro-

posal of reflex agents called reactive planning systems (Brooks, 1986; Agre and Chapman,

1987). “Universal plans” (Schoppers, 1989) were developed as a lookup-table method for

reactive planning, but turned out to be a rediscovery of the idea of policies that had long been

used in Markov decision processes (see Chapter 17). Koenig (2001) surveys online planning techniques, under the name Agent-Centered Search.

Planning with time constraints was first dealt with by DEVISER (Vere, 1983). The representation of time in plans was addressed by Allen (1984) and by Dean et al. (1990) in the FORBIN

system.

NONLIN+

(Tate and Whiter,

1984) and SiPE (Wilkins,

1990) could rea-

son about the allocation of limited resources to various plan steps. O-PLAN (Bell and Tate,

1985) has been applied to resource problems such as software procurement planning at Price Waterhouse and back-axle assembly planning at Jaguar Cars. The two planners SAPA

(Do and Kambhampati,

2001) and T4 (Haslum and Geffner,

2001) both used forward state-space search with sophisticated heuristics to handle actions

with durations and resources. An alternative is to use very expressive action languages, but guide them by human-written, domain-specific heuristics, as is done by ASPEN (Fukunaga et al., 1997), HSTS (Jonsson et al., 2000), and IxTeT (Ghallab and Laruelle, 1994). A number of hybrid planning-and-scheduling systems have been deployed: Isis (Fox et al., 1982; Fox, 1990) has been used for job-shop scheduling at Westinghouse, GARI (De-

Reactive planning

384

Chapter 11

Automated Planning

scotte and Latombe,

1985) planned

the machining and construction

of mechanical

parts,

FORBIN was used for factory control, and NONLIN+ was used for naval logistics planning.

We chose to present planning and scheduling as two separate problems; Cushing et al. (2007) show that this can lead to incompleteness on certain problems. There is a long history of scheduling in aerospace. T-SCHED (Drabble, 1990) was used

to schedule mission-command sequences for the UOSAT-II satellite. OPTIMUM-AIV (Aarup et al., 1994) and PLAN-ERS1 (Fuchs ez al., 1990), both based on O-PLAN, were used for

spacecraft assembly and observation planning, respectively, at the European Space Agency.

SPIKE (Johnston and Adorf, 1992) was used for observation planning at NASA for the Hub-

ble Space Telescope, while the Space Shuttle Ground Processing Scheduling System (Deale et al., 1994) does job-shop scheduling of up to 16,000 worker-shifts. Remote Agent (Muscettola et al., 1998) became the first autonomous planner—scheduler to control a spacecraft, when

it flew onboard the Deep Space One probe in 1999. Space applications have driven the development of algorithms for resource allocation; see Laborie (2003) and Muscettola (2002). The literature on scheduling is presented in a classic survey article (Lawler ef al., 1993), a book (Pinedo, 2008), and an edited handbook (Blazewicz et al., 2007).

The computational complexity of of planning has been analyzed by several authors (By-

lander,

1994; Ghallab et al., 2004; Rintanen, 2016).

There are two main tasks:

PlanSAT

is the question of whether there exists any plan that solves a planning problem.

Bounded

PlanSAT asks whether there is a solution of length k or less; this can be used to find an optimal plan. Both are decidable for classical planning (because the number of states is finite). But if we add function symbols to the language, then the number of states becomes infinite,

and PlanSAT becomes only semidecidable. For propositionalized problems both are in the

complexity class PSPACE, a class that is larger (and hence more difficult) than NP and refers

to problems that can be solved by a deterministic Turing machine with a polynomial amount of space. These theoretical results are discouraging, but in practice, the problems we want to solve tend to be not so bad.

The true advantage of the classical planning formalism is

that it has facilitated the development of very accurate domain-independent heuristics; other

approaches have not been as fruitful. Readings in Planning (Allen et al., 1990) is a comprehensive anthology of early work in the field. Weld (1994, 1999) provides two excellent surveys of planning algorithms of the

1990s. It is interesting to see the change in the five years between the two surveys: the first

concentrates on partial-order planning, and the second introduces Graphplan and SATPLAN.

Automated Planning and Acting (Ghallab ez al., 2016) is an excellent textbook on all aspects of the field. LaValle’s text Planning Algorithms (2006) covers both classical and stochastic

planning, with extensive coverage of robot motion planning.

Planning research has been central to Al since its inception, and papers on planning are a staple of mainstream Al journals and conferences. There are also specialized conferences such as the International Conference on Automated Planning and Scheduling and the Inter-

national Workshop on Planning and Scheduling for Space.

TS

12

QUANTIFYING UNCERTAINTY In which we see how to tame uncertainty with numeric degrees of belief.

12.1

Acting under Uncertainty

Agents in the real world need to handle uncertainty, whether due to partial observability,

nondeterminism, or adversaries. An agent may never know for sure what state it is in now or

Uncertainty

where it will end up after a sequence of actions.

We have seen problem-solving and logical agents handle uncertainty by keeping track of

a belief state—a representation of the set of all possible world states that it might be in—and

generating a contingency plan that handles every possible eventuality that its sensors may report during execution. This approach works on simple problems, but it has drawbacks: « The agent must consider every possible explanation for its sensor observations, no matter how unlikely. This leads to a large belief-state full of unlikely possibilities. « A correct contingent plan that handles every eventuality can grow arbitrarily large and must consider arbitrarily unlikely contingencies.

« Sometimes there is no plan that is guaranteed to achieve the goal—yet the agent must act. It must have some way to compare the merits of plans that are not guaranteed.

Suppose, for example, that an automated taxi has the goal of delivering a passenger to the

airport on time. The taxi forms a plan, A, that involves leaving home 90 minutes before the

flight departs and driving at a reasonable speed. Even though the airport is only 5 miles away,

alogical agent will not be able to conclude with absolute certainty that “Plan Agy will get us to the airport in time.” Instead, it reaches the weaker conclusion “Plan Agg will get us to the

airport in time, as long as the car doesn’t break down, and I don’t get into an accident, and

the road isn’t closed, and no meteorite hits the car, and ... .” None of these conditions can be

deduced for sure, so we can’t infer that the plan succeeds. This is the logical qualification

problem (page 241), for which we so far have seen no real solution.

Nonetheless, in some sense Ag is in fact the right thing to do. What do we mean by this?

As we discussed in Chapter 2, we mean that out of all the plans that could be executed, Agy

is expected to maximize the agent’s performance measure (where the expectation is relative

to the agent’s knowledge about the environment). The performance measure includes getting to the airport in time for the flight, avoiding a long, unproductive wait at the airport, and

avoiding speeding tickets along the way. The agent’s knowledge cannot guarantee any of these outcomes for Agg, but it can provide some degree of belief that they will be achieved. Other plans, such as Ago, might increase the agent’s belief that it will get to the airport on time, but also increase the likelihood of a long, boring wait. The right thing to do—the

rational decision—therefore depends on both the relative importance of various goals and



=28,224 entries—still a manageable number.

If we add the possibility of dirt in each of the 42 squares, the number of states is multiplied

by 2*2 and the transition matrix has more than 102 entries—no longer a manageable number.

In general, if the state is composed of n discrete variables with at most d values each, the

corresponding HMM transition matrix will have size O(d2") and the per-update computation time will also be O(d?").

For these reasons, although HMMs have many uses in areas ranging from speech recogni-

tion to molecular biology, they are fundamentally limited in their ability to represent complex

processes.

In the terminology introduced in Chapter 2, HMMs are an atomic representation:

states of the world have no internal structure and are simply labeled by integers. Section 14.5

shows how to use dynamic Bayesian networks—a factored representation—to model domains

with many state variables. The next section shows how to handle domains with continuous state variables, which of course lead to an infinite state space.

Section 14.4

14.4

Kalman

Kalman Filters

Filters

Imagine watching a small bird flying through dense jungle foliage at dusk: you glimpse

brief, intermittent flashes of motion; you try hard to guess where the bird is and where it will

appear next so that you don’t lose it. Or imagine that you are a World War II radar operator

peering at a faint, wandering blip that appears once every 10 seconds on the screen. Or, going

back further still, imagine you are Kepler trying to reconstruct the motions of the planets

from a collection of highly inaccurate angular observations taken at irregular and imprecisely measured intervals.

In all these cases, you are doing filtering: estimating state variables (here, the position and velocity of a moving object) from noisy observations over time. If the variables were discrete, we could model the system with a hidden Markov model.

This section examines

methods for handling continuous variables, using an algorithm called Kalman filtering, after Kalman filtering

one of its inventors, Rudolf Kalman.

The bird’s flight might be specified by six continuous variables at each time point; three for position (X;,Y;,Z) and three for velocity (X;.,¥;,Z). We will need suitable conditional

densities to represent the transition and sensor models; as in Chapter 13, we will use linear— Gaussian distributions. This means that the next state X,

must be a linear function of the

current state X, plus some Gaussian noise, a condition that turns out to be quite reasonable in

practice. Consider, for example, the X-coordinate of the bird, ignoring the other coordinates

for now. Let the time interval between observations be A, and assume constant velocity during

the interval; then the position update is given by X;;a = X, + X A. Adding Gaussian noise (to account for wind variation, etc.), we obtain a linear-Gaussian transition model:

P(Xrra=3eal X =2, X =5) = N (0a3% + 5% A,0%).

The Bayesian network structure for a system with position vector X, and velocity X, is shown

in Figure 14.9. Note that this is a very specific form of linear-Gaussian model: the general

form will be described later in this

section and covers a vast array of applications beyond the

simple motion examples of the first paragraph. The reader might wish to consult Appendix A for some of the mathematical properties of Gaussian distributions; for our immediate purposes, the most important is that a multivariate Gaussian

distribution for ¢ variables is

specified by a d-element mean 12 and a d x d covariance matrix . 14.4.1

Updating

Gaussian distributions

In Chapter 13 on page 423, we alluded to a key property of the lincar-Gaussian family of distributions: it remains closed under Bayesian updating.

(That is, given any evidence, the

posterior is still in the linear-Gaussian family.) Here we make this claim precise in the context

of filtering in a temporal probability model. The required properties correspond to the twostep filtering calculation in Equation (14.5):

1. If the current distribution P(X, |ey) is Gaussian and the transition model P(X,+|x,) is linear-Gaussian, then the one-step predicted distribution given by

P(Xpiifer) = |Jx, P(Xoior[%)P(x |er)dx is also a Gaussian distribution.

(14.17)

479

Chapter 14 Probabilistic Reasoning over Time

P

480

Figure 14.9 Bayesian network structure for a linear dynamical system with position X,. velocity X,. and position measurement Z 2. If the prediction P(X,; |e|,) is Gaussian and the sensor model P(e,1| X, 1) is linear— Gaussian, then, after conditioning on the new evidence, the updated distribution

P(Xpiieri1) = aP(erst [ X

is also a Gaussian distribution.

) P(Xrii | err)

(14.18)

Thus, the FORWARD operator for Kalman filtering takes a Gaussian forward message f.;,

specified by a mean z, and covariance ¥, and produces a new multivariate Gaussian forward

message f1... 1, specified by a mean i,

and covariance ¥, . So if we start with a Gaussian

prior f1.0=P(Xo) = (119, o), filtering with a linear-Gaussian model produces a Gaussian state distribution for all time.

This seems to be a nice, elegant result, but why is it so important? The reason is that except for a few special cases such as this, filtering with continuous or hybrid (discrete and continuous) networks generates state distributions whose representation grows without bound

over time. This statement is not easy to prove in general, but Exercise 14.KFSW shows what

happens for a simple example. 14.4.2

A simple one-dimensional example

‘We have said that the FORWARD operator for the Kalman filter maps a Gaussian into a new

Gaussian. This translates into computing a new mean and covariance from the previous mean

and covariance. Deriving the update rule in the general (multivariate) case requires rather a

lot of linear algebra, so we will stick to a very simple univariate case for now, and later give

the results for the general case. Even for the univariate case, the calculations are somewhat

tedious, but we feel that they are worth seeing because the usefulness of the Kalman filter is

tied so intimately to the mathematical properties of Gaussian distributions.

The temporal model we consider describes a random walk of a single continuous state

variable X, with a noisy observation Z. An example might be the “consumer confidence” in-

dex, which can be modeled as undergoing a random Gaussian-distributed change each month

and is measured by a random consumer survey that also introduces Gaussian sampling noise.

The prior distribution is assumed to be Gaussian with variance a‘%:

P(xo) =ae °

Section 14.4

Kalman Filters

(For simplicity, we use the same symbol o for all normalizing constants in this section.) The transition model adds a Gaussian perturbation of constant variance o2 to the current state: e

Pl

= e ().

The sensor model assumes Gaussian noise with variance o2:2

Pla

= ae

(),

Now, given the prior P(Xo), the one-step predicted distribution comes from Equation (14.17):

p) = [ Plalsopain=a [~ Al ‘This integral looks rather complicated. The key to progress is to notice that the exponent is the sum of two expressions that are quadratic in xo and hence

is itself a quadratic in xo. A simple

trick known as completing the square allows the rewriting of any quadratic ax} +bxo+cas Sempletie the the sum of a squared term a(xo — 52)? and a residual term ¢ — % that is independent of xo. In this case, we have a= (03 +02)/(0302), b=—2(03x) + 2 p0)/(03072). and c= (o} + o2u3)/(0302). The residual term can be taken outside the integral, giving us

P() = ae HEE) /"“ e Hatw-2) gy

Now the integral is just the integral of a Gaussian over its full range, which is simply 1. Thus,

we are left with only the residual term from the quadratic. Plugging back in the expressions for a, b, and ¢ and simplifying, we obtain -4

)

That is, the one-step predicted distribution is a Gaussian with the same mean 19 and a variance

equal to the sum of the original variance o7 and the transition variance 2.

To complete the update step, we need to condition on the observation at the first time

step, namely, z;. From Equation (14.18), this is given by

P(xi|z1) = aP(a|x)P(x) Once again, we combine the exponents and complete the square (Exercise 14.KALM), obtain-

ing the following expression for the posterior:

(14.19)

P(x|z1) =ae *

Thus, after one update cycle, we have a new Gaussian distribution for the state variable. From the Gaussian formula in Equation (14.19), we see that the new mean and standard deviation can be calculated from the old mean and standard deviation as follows: M1

=

+ o)z (2

+otm

Ftoit ol

and

(07 +a})e? 2 _ _

Fraitol

1420 14.20,

481

Chapter 14 Probabilistic Reasoning over Time 0.45 0.4 0.35 0.3

P(x)

482

0.25 0.2

0.15 0.1 0.05

xposition Figure 14.10 Stages in the Kalman filter update cycle for a random walk with a prior given by f1o=0.0 and o= 1.5, transition noise given by o, =2.0, sensor noise given by o= 1.0, and a first observation

.5 (marked on the x-axis).

flattened out, relative to P(xg), by the transition noise.

Notice how the prediction P(xi) is

Notice also that the mean of the

posterior distribution P(x; |2y is slightly to the left of the observation z; because the mean is a weighted average of the prediction and the observation.

Figure 14.10 shows one update cycle of the Kalman filter in the one-dimensional case for particular values of the transition and sensor models. Equation (14.20) plays exactly the same role as the general filtering equation (14.5) or the HMM filtering equation (14.12). Because of the special nature of Gaussian distributions, however, the equations have some interesting additional properties. First, we can interpret the calculation for the new mean y, 1| as a weighted mean of the

new observation z; and the old mean ;. If the observation is unreliable, then o2 is large

and we pay more attention to the old mean; if the old mean is unreliable (o7 is large) or the

process is highly unpredictable (o7 is large), then we pay more attention to the observation.

Second, notice that the update for the variance o7, | is independent of the observation. We

can therefore compute in advance what the sequence of variance values will be. Third, the sequence of variance values converges quickly to a fixed value that depends only on o2 and

2, thereby substantially simplifying the subsequent calculations. (See Exercise 14.VARL) 14.4.3

The general case

The preceding derivation illustrates the key property of Gaussian distributions that allows

Kalman filtering to work: the fact that the exponent is a quadratic form. This is true not just

for the univariate case; the full multivariate Gaussian distribution has the form L

N(x:p,E) =ae’5((x’m

)

B

(x’m)

Multiplying out the terms in the exponent, we see that the exponent is also a quadratic func-

tion of the values x; in . Thus, filtering preserves the Gaussian nature of the state distribution.

Section 14.4

Kalman Filters

483

Let us first define the general temporal model used with Kalman filtering. Both the tran-

sition model and the sensor model are required to be a linear transformation with additive Gaussian noise. Thus, we have

P(%1|x) = N(x1:F%, %) P(z|x) = N(z:Hx,.Z.),

(14.21)

where F and Z, are matrices describing the linear transition model and transition noise co-

variance, and H and X are the corresponding matrices for the sensor model. Now the update equations for the mean and covariance, in their full, hairy horribleness, are

Hrir = Fpy+ Ko (241 — HFpy,) S where K,

(14.22)

= (=K H)(FEFT 1),

=

(FL,F +X)H

(H(FZ,F' +X,)H" +X;)"! is the Kalman gain matrix. Be-

lieve it or not, these equations make some intuitive sense. For example, consider the up-

date for the mean state estimate j.. The term Fy, is the predicted state at 1 +

Kalman gain matrix

1, so HFpy, is

the predicted observation. Therefore, the term 7| — HFyy, represents the error in the pre-

dicted observation. This is multiplied by K, to correct the predicted state; hence, K, is a measure of how seriously to take the new observation relative to the prediction. As in

Equation (14.20), we also have the property that the variance update is independent of the observations.

The sequence of values for £, and K, can therefore be computed offline, and

the actual calculations required during online tracking are quite modest.

To illustrate these equations at work, we have applied them to the problem of tracking an object moving on the X—Y plane. The state variables are X = (X,Y,X,¥)7, so F, £,, H, and X are 4 x 4 matrices. Figure 14.11(a) shows the true trajectory, a series of noisy observations,

and the trajectory estimated by Kalman filtering, along with the covariances indicated by the one-standard-deviation contours. The filtering process does a good job of tracking the actual

motion, and, as expected, the variance quickly reaches a fixed point. We can also derive equations for smoothing as well as filtering with linear-Gaussian

models. The smoothing results are shown in Figure 14.11(b). Notice how the variance in the

position estimate is sharply reduced, except at the ends of the trajectory (why?), and that the

estimated trajectory is much smoother. 14.4.4

Applicability of Kalman

filtering

The Kalman filter and its elaborations are used in a vast array of applications. The “classical”

application is in radar tracking of aircraft and missiles. Related applications include acoustic tracking of submarines and ground vehicles and visual tracking of vehicles and people. In a slightly more esoteric vein, Kalman filters are used to reconstruct particle trajectories from

bubble-chamber photographs and ocean currents from satellite surface measurements. The range of application is much larger than just the tracking of motion: any system characterized

by continuous state variables and noisy measurements will do. Such systems include pulp mills, chemical plants, nuclear reactors, plant ecosystems, and national economies.

The fact that Kalman filtering can be applied to a system does not mean that the re-

sults will be valid or useful. The assumptions made—linear-Gaussian transition and sensor

Kalman models—are very strong. The extended Kalman filter (EKF) attempts to overcome nonlin- Extended filter (EKF)

earities in the system being modeled.

A system is nonlinear if the transition model cannot

be described as a matrix multiplication of the state vector, as in Equation (14.21). The EKF

Nonlinear

484

Chapter 14 Probabilistic Reasoning over Time 2D filiering iT

2D smoothing, e hered

(a)

Served Smoathed

(b)

Figure 14.11 (a) Results of Kalman filtering for an object moving on the XY plane, showing the true trajectory (left to right), a series of noisy observations, and the trajectory estimated by Kalman filtering. Variance in the position estimate is indicated by the ovals. (b) The results of Kalman smoothing for the same observation sequence. works by modeling the system as locally linear in x, in the region of X, = j1,, the mean of the

current state distribution. This works well for smooth, well-behaved systems and allows the

tracker to maintain and update a Gaussian state distribution that is a reasonable approximation to the true posterior. A detailed example is given in Chapter 26.

What does it mean for a system to be “unsmooth” or “poorly behaved”? Technically,

it means that there is significant nonlinearity in system response within the region that is “close” (according to the covariance X;) to the current mean f,.

To understand this idea

in nontechnical terms, consider the example of trying to track a bird as it flies through the

jungle. The bird appears to be heading at high speed straight for a tree trunk. The Kalman

filter, whether regular or extended, can make only a Gaussian prediction of the location of the

bird, and the mean of this Gaussian will be centered on the trunk, as shown in Figure 14.12(a). A reasonable model of the bird, on the other hand, would predict evasive action to one side or the other, as shown in Figure 14.12(b). Such a model is highly nonlinear, because the bird’s

decision varies sharply depending on its precise location relative to the trunk.

To handle examples like these, we clearly need a more expressive language for repre-

senting the behavior of the system being modeled. Within the control theory community, for

Switching Kalman filter

which problems such as evasive maneuvering by aircraft raise the same kinds of difficulties,

the standard solution is the switching Kalman filter. In this approach, multiple Kalman filters run in parallel, each using a different model of the system—for example, one for straight

flight, one for sharp left turns, and one for sharp right turns. A weighted sum of predictions

is used, where the weight depends on how well each filter fits the current data. We will see

in the next section that this is simply a special case of the general dynamic Bayesian net-

work model, obtained by adding a discrete “maneuver” state variable to the network shown in Figure 14.9. Switching Kalman filters are discussed further in Exercise 14.KFSW.

Section 14.5

Dynamic Bayesian Networks

485

Figure 14.12 A bird flying toward a tree (top views). (a) A Kalman filter will predict the location of the bird using a single Gaussian centered on the obstacle. (b) A more realistic model allows for the bird’s evasive action, predicting that it will fly to one side or the other. 14.5

Dyna

Bayesian

Networks

Bayesian Dynamic Bayesian networks, or DBNs, extend the semantics of standard Bayesian networks Dynamic network to handle temporal probability models of the kind described in Section 14.1. We have already seen examples of DBN:

the umbrella network in Figure 14.2 and the Kalman filter network

in Figure 14.9. In general, each slice of a DBN can have any number of state variables X,

and evidence variables E,. For simplicity, we assume that the variables, their links, and their

conditional distributions are exactly replicated from slice to slice and that the DBN represents

a first-order Markov process, so that each variable can have parents only in its own slice or the immediately preceding slice. In this way, the DBN corresponds to a Bayesian network with infinitely many variables.

It should be clear that every hidden Markov model can be represented as a DBN

with

a single state variable and a single evidence variable. It is also the case that every discrete-

variable DBN can be represented as an HMM; as explained in Section 14.3, we can combine all the state variables in the DBN into a single state variable whose values are all possible tuples of values of the individual state variables. Now, if every HMM is a DBN and every DBN can be translated into an HMM, what's the difference? The difference is that, by decomposing the state of a complex system into its constituent variables, we can take advantage

of sparseness in the temporal probability model.

To see what this means in practice, remember that in Section 14.3 we said that an HMM

representation for a temporal process with n discrete variables, each with up to d values,

needs a transition matrix of size 0({]2”). ‘The DBN representation, on the other hand, has size

0O(nd") if the number of parents of each variable is bounded by . In other words, the DBN

representation is linear rather than exponential in the number of variables.

For the vacuum

robot with 42 possibly dirty locations, the number of probabilities required is reduced from 5% 10% to a few thousand.

We have already explained that every Kalman filter model can be represented in a DBN

with continuous

variables and linear-Gaussian conditional distributions (Figure

14.9).

It

should be clear from the discussion at the end of the preceding section that not every DBN

can be represented by a Kalman filter model. In a Kalman filter, the current state distribution

486

Chapter 14 Probabilistic Reasoning over Time

o

o)

07!

Ry [PRilRo) S|

07

03

B

(P(UIIRY)| 0.9 0.2

Figure 14.13 Left: Specification of the prior, transition model. and sensor model for the umbrella DBN. Subsequent slices are copies of slice 1. Right: A simple DBN for robot motion in the X-Y plane. is always a single multivariate Gaussian distribution—that s, a single “bump” in a particular location. DBNSs, on the other hand, can model arbitrary distributions.

For many real-world applications, this flexibility is essential. Consider, for example, the current location of my keys. They might be in my pocket, on the bedside table, on the kitchen counter, dangling from the front door, or locked in the car. A single Gaussian bump that included all these places would have to allocate significant probability to the keys being in mid-air above the front garden. Aspects of the real world such as purposive agents, obstacles, and pockets introduce “nonlinearities” that require combinations of discrete and continuous variables in order to get reasonable models.

14.5.1

Constructing

DBNs

To construct a DBN, one must specify three kinds of information: the prior distribution over

the state variables, P(Xo); the transition model P(X,+ 1| X;); and the sensor model P(E, |X,).

To specify the transition and sensor models, one must also specify the topology of the connections between successive slices and between the state and evidence variables. Because the transition and sensor models are assumed to be time-homogeneous—the same for all r—

it is most convenient simply to specify them for the first slice. For example, the complete DBN specification for the umbrella world is given by the three-node network shown in Figure 14.13(a). From this specification, the complete DBN with an unbounded number of time

slices can be constructed as needed by copying the first slice. Let us now consider a more interesting example: monitoring a battery-powered robot moving in the X-Y plane, as introduced at the end of Section 14.1.

First, we need state

variables, which will include both X, = (X,,¥;) for position and X, = (X;.;) for velocity. We assume some method of measuring position—perhaps a fixed camera or onboard GPS (Global Positioning System)—yielding measurements Z,. The position at the next time step depends

on the current position and velocity, as in the standard Kalman filter model. The velocity at the next step depends on the current velocity and the state of the battery. We add Battery; to

Section 14.5

Dynamic Bayesian Networks

487

represent the actual battery charge level, which has as parents the previous battery level and the velocity, and we add BMeter,, which measures the battery charge level. This gives us the basic model shown in Figure 14.13(b). It is worth looking in more depth at the nature of the sensor model for BMeter;.

Let us

suppose, for simplicity, that both Battery, and BMeter, can take on discrete values 0 through 5. (Exercise

model.)

14.BATT asks you to relate this discrete model to a corresponding continuous

If the meter is always accurate, then the CPT P(BMeter, | Battery,) should have

probabilities of 1.0 “along the diagonal” and probabilities of 0.0 elsewhere.

always creeps into measurements.

In reality, noise

For continuous measurements, a Gaussian distribution

with a small variance might be used.”

For our discrete variables, we can approximate a

Gaussian using a distribution in which the probability of error drops off in the appropriate

way, so that the probability of a large error is very small. We use the term Gaussian error model to cover both the continuous and discrete versions.

Anyone with hands-on experience of robotics, computerized process control, or other

Gaussian error model

forms of automatic sensing will readily testify to the fact that small amounts of measurement noise are often the least of one’s

problems.

Real sensors fail.

When a sensor fails, it does

not necessarily send a signal saying, “Oh, by the way, the data I'm about to send you is a load of nonsense.”

Instead, it simply sends the nonsense.

The simplest kind of failure is

called a transient failure, where the sensor occasionally decides to send some nonsense. For

example, the battery level sensor might have a habit of sending a reading of 0 when someone bumps the robot, even if the battery is fully charged.

Let’s see what happens when a transient failure occurs with a Gaussian error model that

doesn’t accommodate such failures.

Suppose, for example, that the robot is sitting quietly

and observes 20 consecutive battery readings of 5. Then the battery meter has a temporary seizure and the next reading is BMetery;

=0. What will the simple Gaussian error model lead

us to believe about Barteryy? According to Bayes’ rule, the answer depends on both the sensor model P(BMeter =0| Batterys) and the prediction P(Batterys) | BMetery). If the

probability of a large sensor error is significantly less than the probability of a transition to

Battery,; =0, even if the latter is very unlikely, then the posterior distribution will assign a high probability to the battery’s being empty.

A second reading of 0 at 7 =22 will make this conclusion almost certain. If the transient

failure then disappears and the reading returns to 5 from 7 =23 onwards, the estimate for the

battery level will quickly return to 5. (This does not mean the algorithm thinks the battery magically recharged itself, which may be physically impossible; instead, the algorithm now

believes that the battery was never low and the extremely unlikely hypothesis that the battery

meter had two consecutive huge errors must be the right explanation.) This course of events

is illustrated in the upper curve of Figure 14.14(a), which shows the expected value (see

Appendix A) of Battery, over time, using a discrete Gaussian error model.

Despite the recovery, there is a time (r=22) when the robot is convinced that its battery is empty; presumably, then, it should send out a mayday signal and shut down. Alas, its oversimplified sensor model has led it astray. The moral of the story is simple: for the system 10 handle sensor failure properly, the sensor model must include the possibility of failure. 7 Strictly speaking, a Gaussia ution is problematic because it assigns nonzero probability to large negative charge levels. The beta distribution s a better choice for a variable whose range is restricted.

Transient failure

488

Chapter 14 Probabilistic Reasoning over Time E(Battery, ..5555005555...)

5

5

4

4

S 2 4 1

S 2 = 1

T

\

3

£

o

El

E(Battery, ..5555005555...

KM e K E(Battery, |...5555000000...) 15

20

25

30

Time step ¢

0

Bl

¥\

\

\\

k\'—n—n—n—n—n—n—x E(Bartery, |...5555000000...) 15

20

Time step

()

25

30

(®)

Figure 14.14 (a) Upper curve: trajectory of the expected value of Battery, for an observation sequence consisting of all Ss except for Os at =21 and =22, using a simple Gaussian error model. Lower curve: trajectory when the observation remains at 0 from =21 onwards. (b) The same experiment run with the transient failure model. The transient failure is handled well, but the persistent failure results in excessive pessimism about the battery charge. The simplest kind of failure model for a sensor allows a certain probability that the sensor will return some completely incorrect value, regardless of the true state of the world. For

example, if the battery meter fails by returning 0, we might say that

P(BMeter,=0| Battery, =5)=0.03, Transient failure

which is presumably much larger than the probability assigned by the simple Gaussian error

model. Let's call this the transient failure model. How does it help when we are faced

with a reading of 0? Provided that the predicted probability of an empty battery, according

to the readings so far, is much less than 0.03, then the best explanation of the observation

BMeter;; =0 is that the sensor has temporarily failed. Intuitively, we can think of the belief

about the battery level as having a certain amount of “inertia” that helps to overcome tempo-

rary blips in the meter reading. The upper curve in Figure 14.14(b) shows that the transient failure model can handle transient failures without a catastrophic change in beliefs.

So much for temporary blips. What about a persistent sensor failure? Sadly, failures of

this kind are all too common.

If the sensor returns 20 readings of 5 followed by 20 readings

of 0, then the transient sensor failure model described in the preceding paragraph will result in the robot gradually coming to believe that its battery is empty when in fact it may be that the meter has failed.

The lower curve in Figure

14.14(b) shows the belief “trajectory” for

this case. By r=25—five readings of O—the robot is convinced that its battery is empty.

Obviously, we would prefer the robot to believe that its battery meter is broken—if indeed

Pt

fallure

this is the more likely event.

Unsurprisingly, to handle persistent failure, we need a persistent failure model that

describes how the sensor behaves

under normal conditions

and after failure.

To do this,

we need to augment the state of the system with an additional variable, say, BMBroken, that

describes the status of the battery meter. The persistence of failure must be modeled by an

Section 14.5

Dynamic Bayesian Networks E(Battery,|...5555005555...)

B, | P(B)

7 [ 1.000

/|

489

ey

0.001

E(Battery,|..5555000000...)

BMBrokeny

P(BMBroken, |..5555000000...) Roe e eea

BMeter,

P(BMBroken, ...5555005555..) 15

(a)

20

“Time step

25

30

(b)

Figure 14.15 (a) A DBN fragment showing the sensor status variable required for modeling persistent failure of the battery sensor. (b) Upper curves: trajectories of the expected value of Battery, for the “transient failure” and “permanent failure” observations sequences. Lower curves: probability trajectories for BMBroken given the two observation sequences. arc linking BMBrokeny to BMBroken,.

This persistence arc has a CPT that gives a small

probability of failure in any given time step, say, 0.001, but specifies that the sensor stays

broken once it breaks. When the sensor is OK, the sensor model for BMeter is identical to the transient failure model; when the sensor is broken, it says BMeter is always 0, regardless

of the actual battery charge. The persistent failure model for the battery sensor is shown in Figure 14.15(a). Its performance on the two data sequences (temporary blip and persistent failure) is shown in Figure 14.15(b).

There are several things to notice about these curves.

First, in the case of the

temporary blip, the probability that the sensor is broken rises significantly after the second

0 reading, but immediately drops back to zero once a 5 is observed. Second, in the case of persistent failure, the probability that the sensor is broken rises quickly to almost 1 and stays there.

Finally, once the sensor is known to be broken, the robot can only assume that its

battery discharges at the “normal” rate. This is shown by the gradually descending level of E(Battery,|...).

So far, we have merely scraiched the surface of the problem of representing complex

processes.

The variety of transition models is huge, encompassing

topics as disparate as

‘modeling the human endocrine system and modeling multiple vehicles driving on a freeway. Sensor modeling is also a vast subfield in itself. But dynamic Bayesian networks can model

even subtle phenomena, such as sensor drift, sudden decalibration, and the effects of exoge-

nous conditions (such as weather) on sensor readings. 14.5.2

Exact inference in DBNs

Having sketched some ideas for representing complex processes

as DBNs, we now turn to the

question of inference. In a sense, this question has already been answered: dynamic Bayesian networks are Bayesian networks, and we already have algorithms for inference in Bayesian networks. Given a sequence of observations, one can construct the full Bayesian network rep-

Persistence arc

490

Chapter 14 Probabilistic Reasoning over Time

BR[O Ry|P(R)|R,

07

03

I

07

P(R)|R,

R,

[P(R IR,

Ry

|P(RyR;

f1

03

f1

03

f1

03

@@@@

T

(R, | P(UIR,)|

AN

f1l

02

T e

Ry |

@@

Tl T Tl (R |P(UIR)|

o[

fl

09

02

Ry [ PCUAR,)| fl

[

o9 | 02

(R; [ PCUSR,)

[t]

£l

o9

02

Figure 14.16 Unrolling a dynamic Bayesian network: slices are replicated to accommodate the observation sequence Umbrellay;s. Further slices have no effect on inferences within the observation period. resentation of a DBN by replicating slices until the network is large enough to accommodate the observations, as in Figure 14.16. This technique is called unrolling. (Technically, the DBN

is equivalent to the semi-infinite network obtained by unrolling forever.

Slices added

beyond the last observation have no effect on inferences within the observation period and can be omitted.) Once the DBN is unrolled, one can use any of the inference algorithms— variable elimination, clustering methods, and so on—described in Chapter 13.

Unfortunately, a naive application of unrolling would not be particularly efficient. If

we want to perform filtering or smoothing with a long sequence of observations ey, the

unrolled network would require O(r) space and would thus grow without bound as more observations were added. Moreover, if we simply run the inference algorithm anew each time an observation is added, the inference time per update will also increase as O(f). Looking back to Section 14.2.1, we see that constant time and space per filtering update

can be achieved if the computation can be done recursively. Essentially, the filtering update in Equation (14.5) works by summing out the state variables of the previous time step to get the distribution for the new time step. Summing out variables is exactly what the variable

elimination (Figure 13.13) algorithm does, and it turns out that running variable elimination

with the variables in temporal order exactly mimics the operation of the recursive filtering

update in Equation (14.5). The modified algorithm keeps at most two slices in memory at any one time: starting with slice 0, we add slice 1, then sum out slice 0, then add slice 2, then

sum out slice 1, and so on. In this way, we can achieve constant space and time per filtering update.

(The same performance can be achieved by suitable modifications to the clustering

algorithm.) Exercise 14.DBNE asks you to verify this fact for the umbrella network.

So much for the good news; now for the bad news: It turns out that the “constant” for the

per-update time and space complexity is, in almost all cases, exponential in the number of state variables. What happens is that, as the variable elimination proceeds, the factors grow

to include all the state variables (or, more precisely, all those state variables that have parents

in the previous time slice). The maximum factor size is O(d"**) and the total update cost per

step is O(nd"**), where d is the domain size of the variables and k is the maximum number

of parents of any state variable.

Of course, this is much less than the cost of HMM updating, which is O(d?"), but it s still infeasible for large numbers of variables. This grim fact means is that even though we can use

DBN s to represent very complex temporal processes with many sparsely connected variables,

Section 14.5

Dynamic Bayesian Networks

491

we cannot reason efficiently and exactly about those processes. The DBN model itself, which

represents the prior joint distribution over all the variables, is factorable into its constituent CPTs, but the posterior joint distribution conditioned on an observation sequence—that

is,

the forward message—is generally not factorable. The problem is intractable in general, so we must fall back on approximate methods.

14.5.3

Approximate inference in DBNs

Section 13.4 described two approximation algorithms: likelihood weighting (Figure 13.18)

and Markov chain Monte Carlo (MCMC, Figure 13.20). Of the two, the former is most easily adapted to the DBN context. (An MCMC filtering algorithm is described briefly in the notes

at the end of this chapter.) We will see, however, that several improvements are required over

the standard likelihood weighting algorithm before a practical method emerges. Recall that likelihood weighting works by sampling the nonevidence nodes of the network in topological order, weighting each sample by the likelihood it accords to the observed evidence variables. As with the exact algorithms, we could apply likelihood weighting directly to an unrolled DBN, but this would suffer from the same problems of increasing time

and space requirements per update as the observation sequence grows. The problem is that

the standard algorithm runs each sample in turn, all the way through the network.

Instead, we can simply run all N samples together through the DBN, one slice at a time.

The modified algorithm fits the general pattern of filtering algorithms, with the set of N samples as the forward message. The first key innovation, then, is to use the samples themselves as an approximate representation of the current state distribution.

This meets the require-

For a single customer Cj recommending a single book B1, the Bayes net might look like the one shown in Figure 15.2(a). (Just as in Section 9.1, expressions with parentheses such as Honest(Cy) are just fancy symbols—in this case, fancy names for random variables.) With 1" The name relational probability model was given by Pfeffer (2000) to a slightly different representation, but the underlying ideas are the same. 2 A game theorist would advise a dishones! customer to avoid detection by occasionally recommending a good book from a competitor. See Chapter 18.

Section 15.1

503

Relational Probability Models

two customers and two books, the Bayes net looks like the one in Figure 15.2(b). For larger

numbers of books and customers, it is quite impractical to specify a Bayes net by hand.

Fortunately, the network has a lot of repeated structure. Each Recommendation(c,b) vari-

able has as its parents the variables Honest(c), Kindness(c), and Quality(b). Moreover, the conditional probability tables (CPTs) for all the Recommendation(c,b) variables are identical, as are those for all the Honest(c) variables, and so on. The situation seems tailor-made

for a first-order language. We would like to say something like Recommendation(c,b) ~ RecCPT (Honest(c), Kindness c), Quality(b))

which means that a customer’s recommendation for a book depends probabilistically on the

customer’s honesty and kindness and the book’s quality according to a fixed CPT.

Like first-order logic, RPMs have constant, function, and predicate symbols. We will also

assume a type signature for each function—that is, a specification of the type of each argu-

ment and the function’s value. (If the type of each object is known, many spurious possible worlds are eliminated by this mechanism; for example, we need not worry about the kindness of each book, books recommending customers, and so on.)

Type signature

For the book-recommendation

domain, the types are Customer and Book, and the type signatures for the functions and predicates are as follows:

Honest : Customer — {true false}

Kindness : Customer — {1,2,3,4,5} Quality : Book — {1,2,3,4,5} Recommendation : Customer x Book — {1,2,3,4,5} The constant symbols will be whatever customer and book names appear in the retailer’s data

set. In the example given in Figure 15.2(b), these were Cy, C> and B1, B. Given the constants and their types, together with the functions and their type signatures,

the basic random variables of the RPM are obtained by instantiating each function with each possible combination of objects. For the book recommendation model, the basic random

variables include Honest(Cy), Quality(Bs), Recommendation(C\,B3), and so on. These are exactly the variables appearing in Figure 15.2(b). Because each type has only finitely many instances (thanks to the domain closure assumption), the number of basic random variables is also finite.

To complete the RPM, we have to write the dependencies that govern these random vari-

ables. There function is a For example, of honesty is

is one dependency statement for each function, where each argument of the logical variable (i.c., a variable that ranges over objects, as in first-order logic). the following dependency states that, for every customer ¢, the prior probability 0.9 true and 0.01 false:

Honest(c) ~ (0.99,0.01)

Similarly, we can state prior probabilities for the kindness value of each customer and the quality of each book, each on the 1-5 scale: Kindness(c) ~ (0.1,0.1,0.2,0.3,0.3)

Quality(h) ~ (0.05,0.2,0.4,0.2,0.15) Finally, we need the dependency for recommendations: for any customer ¢ and book b, the score depends on the honesty and kindness of the customer and the quality of the book:

Recommendation(c,b) ~ RecCPT(Honest(c), Kindness(c), Quality (b))

Basic random variable

504

Chapter 15 Probabilistic Programming where RecCPT is a separately defined conditional probability table with 2 x 5 x 5=50 rows,

each with 5 entries. For the purposes of illustration, we’ll assume that an honest recommen-

dation for a book of quality ¢ from a person of kindness is uniformly distributed in the range 1155, 145411 The semantics of the RPM can be obtained by instantiating these dependencies for all known constants, giving a Bayesian network (as in Figure 15.2(b)) that defines a joint distribution over the RPM’s random variables.>

The set of possible worlds is the Cartesian product of the ranges of all the basic random variables, and, as with Bayesian networks, the probability for each possible world is the product of the relevant conditional probabilities from the model. With C customers and B books, there are C Honest variables, C Kindness variables, B Quality variables, and BC Recommendation variables, leading to 265*3+5C possible worlds. With ten million books

and a billion customers, that’s about 107> 1" worlds. Thanks to the expressive power of

RPMs, the complete probability model still has only fewer than 300 parameters—most of them in the RecCPT table. We can refine the model by asserting a context-specific independence (see page 420) to reflect the fact that dishonest customers ignore quality when giving a recommendation; more-

over, kindness plays no role in their decisions. Thus, Recommendation(c,b) is independent of Kindness(c) and Quality(b) when Honest(c) = false:

Recommendation(c,b) ~

if Honest(c) then HonestRecCPT (Kindness(c), Quality(b)) else (0.4,0.1,0.0,0.1,0.4).

This kind of dependency may look like an ordinary if-then—else statement in a programming language, but there is a key difference: the inference engine doesn’t necessarily know the value of the conditional test because Honest(c) is a random variable.

We can elaborate this model in endless ways to make it more realistic.

For example,

suppose that an honest customer who is a fan of a book’s author always gives the book a 5, regardless of quality:

Recommendation(c,b) ~

if Honest(c) then if Fan(c, Author (b)) then Exactly(5) else HonestRecCPT (Kindness(c), Quality(b)) else (0.4,0.1,0.0,0.1,0.4)

Again, the conditional test Fan(c,Author(b)) is unknown, but if a customer gives only 5s to a

particular author’s books and is not otherwise especially kind, then the posterior probability

that the customer is a fan of that author will be high. Furthermore, the posterior distribution will tend to discount the customer’s 5s in evaluating the quality of that author’s books.

In this example, we implicitly assumed that the value of Author(b) is known for every b, but this may not be the case. How can the system reason about whether, say, C) is a fan of Author(B,) when Author(Bs) is unknown?

The answer is that the system may have to

reason about all possible authors. Suppose (to keep things simple) 3 Some technical conditions are required for an RPM to define a proper distribution. be acyclic; otherwise the resulting Bayesian network will have cycles. Second, the be well-founded: there can be no infinite ancestor chains, such as might arise from Exercise 15.HAMD for an exception to this rule.

that there are just two First, the dependencies must dependencies must (usually) recursive dependen

Section 15.1

Recommendation(Cy, B>

Relational Probability Models

505

Recommendation(Cy, By)

Figure 15.3 Fragment of the equivalent Bayes net for the book recommendation RPM when Author(Bz) is unknown. authors, A; and Ay, Then Author(Ba) is a random variable with two possible values, A; and Ay, and it is a parent of Recommendation(Cy, By). The variables Fan(Cy,A;) and Fan(C),As)

are parents too. The conditional distribution for Recommendation(C),B,) is then essentially a

‘multiplexer in which the Author(B,) parent acts as a selector to choose which of Fan(Cy,A;)

and Fan(C,A,) actually gets to influence the recommendation. A fragment of the equivalent

Multiplexer

Bayes net is shown in Figure 15.3. Uncertainty in the value of Author(B,), which affects the dependency structure of the network, is an instance of relational uncertainty. e In case you are wondering how the system can possibly work out who the author of B, is: consider the possibility that three other customers are fans of A} (and have no other favorite authors in common) and all three have given B; a 5, even though most other customers find

it quite dismal. In that case, it is extremely likely that A, is the author of B,. The emergence of sophisticated reasoning like this from an RPM model of just a few lines is an intriguing

example of how probabilistic influences spread through the web of interconnections among objects in the model. As more dependencies and more objects are added, the picture conveyed by the posterior distribution often becomes clearer and clearer.

15.1.2

Example:

Rating player skill levels

Many competitive games have a numerical measure of players’ skill levels, sometimes called arating. Perhaps the best-known is the Elo rating for chess players, which rates a typical be- Rating ginner at around 800 and the world champion usually somewhere above 2800. Although Elo ratings have a statistical basis, they have some ad hoc elements. We can develop a Bayesian rating scheme as follows: each player i has an underlying skill level Skill(i); in each game g, Ps actual performance is Performance(i,g), which may vary from the underlying skill level; and the winner of g is the player whose performance in g is better. As an RPM, the model looks like this: Skill(i) ~ N (11, 0?) Performance(i,g) ~ N (Skill(i),%)

Win(i.j,g) = if Game(g,i, j) then (Performance(i,g) > Performance(j,g))

where /32 is the variance ofa player’s actual performance in any specific game relative to the

player’s underlying skill level. Given a set of players and games, as well as outcomes for

some of the games, an RPM inference engine can compute a posterior distribution over the

skill of each player and the probable outcome of any additional game that might be played.

506

Chapter 15 Probabilistic Programming For team games, we’ll assume, as a first approximation, that the overall performance of

team ¢ in game g is the sum of the individual performances of the players on :

TeamPerformance(t,g) = ¥ic, Performance(i,g).

Even though the individual performances are not visible to the ratings engine, the players’ skill levels can still be estimated from the results of several games, as long as the team com-

positions vary across games. Microsoft’s TrueSkill™ ratings engine uses this model, along

with an efficient approximate inference algorithm, to serve hundreds of millions of users

every day. This model can be elaborated in numerous ways. For example, we might assume that weaker players have higher variance in their performance; we might include the player’s role on the team; and we might consider specific kinds of performance and skill—e.g., defending and attacking—in order to improve team composition and predictive accuracy.

15.1.3

Inference in relational probability models

The most straightforward approach to inference in RPMs s simply to construct the equivalent Bayesian network, given the known constant symbols belonging to each type. With B books

and C customers, the basic model given previously could be constructed with simple loops:*

forb=1toBdo add node Quality,, with no parents, prior ( 0.05,0.2,0.4,0.2,0.15) forc=1toCdo add node Honest, with no parents, prior ( 0.99,0.01 ) add node Kindness. with no parents, prior ( 0.1,0.1,0.2,0.3,03 ) forb=1toBdo add node Recommendation, with parents Honeste, Kindness.. Quality), and conditional distribution RecCPT (Honestc, Kindnessc, Quality,) Grounding Unrolling

This technique is called grounding or unrolling; it is the exact analog of propositionaliza-

tion for first-order logic (page 280). The obvious drawback is that the resulting Bayes net may be very large. Furthermore, if there are many candidate objects for an unknown relation or function—for example, the unknown author of By—then some variables in the network

may have many parents. Fortunately, it is often possible to avoid generating the entire implicit Bayes net. As we saw in the discussion of the variable elimination algorithm on page 433, every variable that is not an ancestor of a query variable or evidence variable is irrelevant to the query. Moreover, if the query is conditionally independent of some variable given the evidence, then that variable is also irrelevant. So, by chaining through the model starting from the query and evidence, we can identify just the set of variables that are relevant to the query. These are the only ones

that need to be instantiated to create a potentially tiny fragment of the implicit Bayes net.

Inference in this fragment gives the same answer as inference in the entire implicit Bayes net.

Another avenue for improving the efficiency of inference comes from the presence of repeated substructure in the unrolled Bayes net. This means that many of the factors constructed during variable elimination (and similar kinds of tables constructed by clustering algorithms) 4 Several statistical packages would view this code as defining the RPM, rather than just constructing a Bayes net to perform inference in the RPM. This view, however, misses an important role for RPM syntax: without a syntax with clear semantics, there is no way the model structure can be learned from data.

Section 152

Open-Universe Probability Models

507

will be identical; effective caching schemes have yielded speedups of three orders of magnitude for large networks.

Third, MCMC inference algorithms have some interesting properties when applied to RPMs with relational uncertainty. MCMC works by sampling complete possible worlds, 50 in each state the relational structure is completely known. In the example given earlier, each MCMC state would specify the value of Author(B,), and so the other potential authors are no longer parents of the recommendation nodes for B;.

For MCMC,

then, relational

uncertainty causes no increase in network complexity; instead, the MCMC process includes

transitions that change the relational structure, and hence the dependency structure, of the unrolled network.

Finally, it may be possible in some cases to avoid grounding the model altogether. Resolution theorem provers and logic programming systems avoid propositionalizing by instanti-

ating the logical variables only as needed to make the inference go through; that is, they /ift

the inference process above the level of ground propositional sentences and make each lifted step do the work of many ground steps.

The same idea can be applied in probabilistic inference. For example, in the variable

elimination algorithm, a lifted factor can represent an entire set of ground factors that assign

probabilities to random variables in the RPM, where those random variables differ only in the

constant symbols used to construct them. The details of this method are beyond the scope of

this book, but references are given at the end of the chapter. 15.2

Open-Universe

Probability Models

‘We argued earlier that database semantics was appropriate for situations in which we know

exactly the set of relevant objects that exist and can identify them unambiguously. (In partic-

ular, all observations about an object are correctly associated with the constant symbol that

names it.) In many real-world settings, however, these assumptions are simply untenable. For example, a book retailer might use an ISBN (International Standard Book Number)

constant symbol to name each book, even though a given “logical” book (e.g., “Gone With the Wind”) may have several ISBNs corresponding to hardcover, paperback, large print, reissues, and so on. It would make sense to aggregate recommendations across multiple ISBN, but the retailer may not know for sure which ISBNs are really the same book. (Note that we

are not reifying the individual copies of the book, which might be necessary for used-book

sales, car sales, and so on.) Worse still, each customer is identified by a login ID, but a dis-

honest customer may have thousands of IDs!

In the computer security field, these multiple

IDs are called sybils and their use to confound a reputation system is called a sybil attack.’ Thus, even a simple application in a relatively well-defined, online domain involves both existence uncertainty (what are the real books and customers underlying the observed data) and identity uncertainty (which logical terms really refer to the same object).

The phenomena of existence and identity uncertainty extend far beyond online book-

sellers. In fact they are pervasive:

+ A vision system doesn’t know what exists, if anything, around the next corner, and may not know if the object it sees now is the same one it saw a few minutes ago.

S The name “Sybi

comes from a famous case of multiple personality disorder.

Sybil

Sybil attack Existence uncertainty Identity uncertainty

508

Chapter 15 Probabilistic Programming +A

text-understanding system does not know in advance the entities that will be featured in a text, and must reason about whether phrases such as “Mary,” “Dr. Smith,” “she,” “his cardiologist,” “his mother,” and so on refer to the same object.

« An intelligence analyst hunting for spies never knows how many spies there really are and can only guess whether various pseudonyms, phone numbers, and sightings belong to the same individual.

Indeed, a major part of human cognition seems to require learning what objects exist and

being able to connect observations—which almost never come with unique IDs attached—to

Open universe probability model (OUPM)

hypothesized objects in the world.

Thus, we need to be able to define an open universe probability model (OUPM) based

on the standard semantics of first-order logic, as illustrated at the top of Figure

15.1.

A

language for OUPMs provides a way of easily writing such models while guaranteeing a unique, consistent probability distribution over the infinite space of possible worlds. 15.2.1

Syntax and semantics

The basic idea is to understand how ordinary Bayesian networks and RPMs manage to define a unique probability model and to transfer that insight to the first-order setting. In essence, a Bayes net generates each possible world, event by event, in the topological order defined

by the network structure, where each event is an assignment of a value to a variable. An RPM extends this to entire sets of events, defined by the possible instantiations of the logical

variables in a given predicate or function. OUPMs go further by allowing generative steps that

Number statement

add objects to the possible world under construction, where the number and type of objects may depend on the objects that are already in that world and their properties and relations. That is, the event being generated is not the assignment of a value to a variable, but the very existence of objects. One way o do this in OUPMs is to provide number statements that specify conditional distributions over the numbers of objects of various kinds. For example, in the book-

recommendation domain, we might want to distinguish between customers (real people) and their login IDs. (It’s actually login IDs that make recommendations, not customers!) Suppose

(to keep things simple) the number of customers is uniform between 1 and 3 and the number of books is uniform between 2 and 4:

#Customer ~ UniformInt(1,3) #Book ~ Uniformini(2,4).

(15.2)

‘We expect honest customers to have just one ID, whereas dishonest customers might have anywhere between 2 and 5 IDs:

#LoginID(Owner=c) ~ Origin function

if Honest(c) then Exactly(1) else Uniformint(2,5).

(15.3)

This number statement specifies the distribution over the number of login IDs for which customer c is the Owner.

The Owner function is called an origin function because it says

where each object generated by this number statement came from.

The example in the preceding paragraph uses a uniform distribution over the integers

between 2 and 5 to specify the number of logins for a dishonest customer.

This particular

distribution is bounded, but in general there may not be an a priori bound on the number of

Section 152

Open-Universe Probability Models

objects. The most commonly used distribution over the nonnegative integers is the Poisson distribution. The Poisson has one parameter, \, which is the expected number of objects, and a variable X sampled from Poisson(X) has the following distribution:

509

Poisson distribution

P(X=k)=Xe /K. The variance of the Poisson is also A, so the standard deviation is v/X.

This means that

for large values of ), the distribution is narrow relative to the mean—for example, if the number of ants in a nest is modeled by a Poisson with a mean of one million, the standard deviation is only a thousand, or 0.1%. For large numbers, it often makes more sense to use

the discrete log-normal distribution, which is appropriate when the log of the number of

Discrete log-normal distribution

magnitude distribution, uses logs to base 10: thus, a distribution OM(3, 1) has a mean of

Order-of-magnitude distribution

objects is normally distributed.

A particularly intuitive form, which we call the order-of-

10° and a standard deviation of one order of magnitude, i.c., the bulk of the probability mass falls between 102 and 10°.

The formal semantics of OUPMs begins with a definition of the objects that populate

possible worlds. In the standard semantics of typed first-order logic, objects are just numbered tokens with types. In OUPMs, each object is a generation history; for example, an object might be “the fourth login ID of the seventh customer.” (The reason for this slightly baroque construction will become clear shortly.) For types with no origin functions—e.g., the Customer and Book types in Equation (15.2)—the objects have an empty origin; for example, (Customer, ,2) refers to the second customer generated from that number statement.

For number statements with origin functions—e.g., Equation (15.3)—each object records its origin; for example, the object (LoginID, (Owner, (Customer, ,2)),3) is the third login belonging to the second customer. The number variables of an OUPM specify how many objects there are of each type with

each possible origin in each possible world; thus #L0ginlD guuer (Customer, 2)) () =4 means

that in world w, customer 2 owns 4 login IDs. As in relational probability models, the basic random variables determine the values of predicates and functions for all tuples of objects; thus, Honest(cusiomer, 2) (w) = true means that in world w, customer 2 is honest. A possible world is defined by the values of all the number variables and basic random variables. A

world may be generated from the model by sampling in topological order; Figure 15.4 shows an example. The probability of a world so constructed is the product of the probabilities for all the sampled values; in this case, 1.2672x 10~!!. Now it becomes clear why each object contains its origin: this property ensures that every world can be constructed by exactly

one generation sequence. If this were not the case, the probability of a world would be an unwieldy combinatorial sum over all possible generation sequences that create it.

Open-universe models may have infinitely many random variables, so the full theory in-

volves nontrivial measure-theoretic considerations.

For example, number statements with

Poisson or order-of-magnitude distributions allow for unbounded numbers of objects, lead-

ing to unbounded numbers of random variables for the properties and relations of those objects. Moreover, OUPMs can have recursive dependencies and infinite types (integers, strings, etc.). Finally, well-formedness disallows cyclic dependencies and infinitely receding ancestor chains; these conditions are undecidable in general, but certain syntactic sufficient conditions

can be checked easily.

Number variable

510

Chapter 15 Probabilistic Programming Variable #Customer #Book Honest cusiomer. 1)

Value 2 3 rue

Probability 03333 03333 099

1

0.1

Honest(cusiomer. 2) Kindness cusiomer..1)

Jalse 4

Quality gy, 1)

1 3

Kindness customer. 2) Quality gy, Quality gy,

2 3)

FLOgINID,e (Cusomer, 1)

#LoginID guner, (Customer, 2))

Recommendationtginip, (Owner,(Customer, 1))1),(Book 1) Recommendationtaginip, (Owner.(Customer, 1))1)(Book 2)

Recommendation(poginp,(Owner.(Customer, 1)1 (Book.3) RecommendationLoginip, (Owner. (Customer. 2)).1).(Book..1) RecommendationLoginip, (Owner. (Customer. 2)).1).(Book, 2) Recommendationt

ginip,(Owner.(Customer. 2)).1).(Boo

3)

i (0wner. (Custome 2)).2).(Book..1) Recommendationy.oginip,

Recommendation(g oginip, (Owner.(Customer, 2)).2).(Book. 2) Recommendation g oginip, (Owner.(Customer, 2)).2). (Book. 3

5 1 2 2 4 5 5 1 5 5 1

0.01 0.3

0.05 04

015

1.0 0.25

0.5 0.5

05

04 04

04

0.4 0.4 0.4

Figure 15.4 One particular world for the book recommendation OUPM. The number variables and basic random variables are shown in topological order, along with their chosen values and the probabilities for those values.

15.2.2

Inference in open-universe

probability models

Because of the potentially huge and sometimes unbounded size of the implicit Bayes net that corresponds to a typical OUPM, unrolling it fully and performing exact inference is quite impractical. Instead, we must consider approximate inference algorithms such as MCMC (see Section 13.4.2).

Roughly speaking, an MCMC algorithm for an OUPM is exploring the space of possible worlds defined by sets of objects and relations among them, as illustrated in Figure 15.1(top). A move between adjacent states in this space can not only alter relations and functions but also add or subtract objects and change the interpretations of constant symbols. Even though

each possible world may be huge, the probability computations required for each step— whether in Gibbs sampling or Metropolis-Hastings—are entirely local and in most cases take constant time. This is because the probability ratio between neighboring worlds depends on a subgraph of constant size around the variables whose values are changed. Moreover, a logical query can be evaluated incrementally in each world visited, usually in constant time. per world, rather than being recomputing from scratch.

Some special consideration needs to be given to the fact that a typical OUPM may have

possible worlds of infinite size. As an example, consider the multitarget tracking model in Figure 15.9: the function X (a.t), denoting the state of aircraft a at time 1, corresponds to an infinite sequence of variables for an unbounded number of aircraft at each step. For this

reason, MCMC for OUPMs samples not completely specified possible worlds but partial

Section 152

Open-Universe Probability Models

worlds, each corresponding to a disjoint set of complete worlds. A partial world is a minimal

self-supporting instantiation® of a subset of the relevant variables—that is, ancestors of the

evidence and query variables. For example, variables X (a,) for values of7 greater than the last observation time (or the query time, whichever is greater) are irrelevant, so the algorithm can consider just a finite prefix of the infinite sequence.

15.2.3

Examples

The standard “use case” for an OUPM

has three elements:

the model,

the evidence (the

Kknown facts in a given scenario), and the guery, which may be any expression, possibly with free logical variables. The answer is a posterior joint probability for each possible set of substitutions for the free variables, given the evidence, according to the model.” Every model includes type declarations, type signatures for the predicates and functions, one or

more number statements for each type, and one dependency statement for each predicate and

function. (In the examples below, declarations and signatures are omitted where the meaning

is clear.) As in RPMs, dependency statements use an if-then-else syntax to handle contextspecific dependencies. Citation matching

Millions of academic research papers and technical reports are to be found online in the form of pdf files. Such papers usually contain a section near the end called “References” or “Bibliography,” in which citations—strings of characters—are provided to inform the reader

of related work. These strings can be located and “scraped” from the pdf files with the aim of

creating a database-like representation that relates papers and researchers by authorship and

citation links. Systems such as CiteSeer and Google Scholar present such a representation to their users; behind the scenes, algorithms operate to find papers, scrape the citation strings, and identify the actual papers to which the citation strings refer. This is a difficult task because these strings contain no object identifiers and include errors of syntax, spelling, punctuation,

and content. To illustrate this, here are two relatively benign examples:

1. [Lashkari et al 94] Collaborative Interface Agents, Yezdi Lashkari, Max Metral, and Pattie Maes, Proceedings of the Twelfth National Conference on Articial Intelligence, MIT Press, Cambridge, MA, 1994. 2. Metral M. Lashkari, Y. and P. Maes. Collaborative interface agents. In Conference of the American Association for Artificial Intelligence, Seattle, WA, August 1994.

The key question is one of identity: are these citations of the same paper or different pa-

pers? Asked this question, even experts disagree or are unwilling to decide, indicating that

reasoning under uncertainty is going to be an important part of solving this problem.® Ad hoc approaches—such as methods based on a textual similarity metric—often fail miserably. For example, in 2002,

CiteSeer reported over 120 distinct books written by Russell and Norvig.

one in which the parents of every variable in the set are

7 As with Prolog, there may be infinitely many sets of substitutions of unbounded size; designing exploratory interfaces for such answers is an interesting visualization challenge. 8 The answer is yes, they are the same paper. The “National Conference on Articial Intelligence” (notice how the “fi” s missing, thanks to an error in seraping the ligature character) is another name for the AAAI conference; the conference took place in Seattle whereas the proceedings publisher is in Cambridge.

511

512

Chapter 15 Probabilistic Programming type Researcher, Paper, Citation random String Name(Researcher) random String Title(Paper) random Paper PubCited(Citation) random String Text(Citation) random Boolean Professor(Researcher) origin Researcher Author(Paper) #Researcher ~ OM(3,1) Name(r) ~ NamePrior() Professor(r) ~ Boolean(0.2) #Paper(Author=r) ~ if Professor(r) then OM(1.5,0.5) else OM(1,0.5) Title(p) ~ PaperTitlePrior() CitedPaper(c) ~ UniformChoice({Paper p}) Text(c) ~ HMMGrammar(Name(Author(CitedPaper(c))), Title(CitedPaper(c))) Figure 15.5 An OUPM for citation information extraction. For simplicity the model assumes one author per paper and omits details of the grammar and error models. In order to solve the problem using a probabilistic approach, we need a generative model

for the domain.

That is, we ask how these citation strings come to be in the world.

The

process begins with researchers, who have names. (We don’t need to worry about how the researchers came into existence; we just need to express our uncertainty about how many

there are.) These researchers write some papers, which have titles; people cite the papers,

combining the authors’ names and the paper’s title (with errors) into the text of the citation according to some grammar.

The basic elements of this model are shown in Figure 15.5,

covering the case where papers have just one author.”

Given just citation strings as evidence, probabilistic inference on this model to pick

out the most likely explanation for the data produces an error rate 2 to 3 times lower than CiteSeer’s (Pasula et al., 2003).

The inference process also exhibits a form of collective,

knowledge-driven disambiguation: the more citations for a given paper, the more accurately each of them is parsed, because the parses have to agree on the facts about the paper. Nuclear treaty monitoring

Verifying the Comprehensive Nuclear-Test-Ban Treaty requires finding all seismic events on Earth above a minimum

magnitude.

The UN CTBTO

maintains a network of sensors, the

International Monitoring System (IMS); its automated processing software, based on 100 years of seismology research, has a detection failure rate of about 30%. The NET-VISA system (Arora ef al., 2013), based on an OUPM, significantly reduces detection failures. The NET-VISA model (Figure 15.6) expresses the relevant geophysics directly. It describes distributions over the number of events in a given time interval (most of which are

9 The multi-author case has the same overall structure but is a bit more complicated. The parts of the model not shown—the NamePrior, ritlePrior, and HMMGrammar—are traditional probability models. For example, the NamePrior is a mixture of a categorical distribution over actual names and a letter trigram model (sec Section 23.1) to cover names not previously seen, both trained from data in the U.S. Census database.

Section 152

Open-Universe Probability Models

#SeismicEvents ~ Poisson(T * \e) Time(e) ~ UniformReal(0,T) EarthQuake(e) ~ Boolean(0.999) Location(e) ~ if Earthquake(e) then SpatialPrior() else UniformEarth() Depth(e) ~ if Earthquake(e) then UniformReal(0,700) else Exacly(0) Magnitude(e) ~ Exponential(log(10)) Detected(e,p.s) ~ Logistic(weights(s, p), Magnitude(e), Depth(e), Dist(e.s)) #Detections(site = 5) ~ Poisson(T * Af(s)) #Detections(event=e, phase=p, station=s) = if Detected(e, p.s) then 1 else0 OnserTime(a,s)if (event(a) = null) then ~ UniformReal(0,T) else = Time(event(a)) + GeoTT(Dist(event(a),s), Depth(event(a)).phase(a)) + Laplace(pu(s),o:(s)) Amplitude(a,s) if (event(a) = null) then ~ NoiseAmpModel(s) else = AmpModel(Magnitude(event(a)). Dist(event(a), 5), Depth(event(a)). phase(a) Azimuth(a,s) if (event(a) = null) then ~ UniformReal(0, 360) else = GeoAzimuth(Location(event(a)), Depth(event(a)), phase(a). Site(s)) + Laplace(0,7a(s)) Slowness(a,s) if (event(a) = null) then ~ UniformReal(0,20) else = GeoSlowness(Location(event(a)), Depth(event(a)), phase(a), Site(s)) + Laplace(0,7(s)) ObservedPhase(a,s) ~ CategoricalPhaseModel(phase(a)) Figure 15.6 A simplified version of the NET-VISA model (see text).

naturally occurring) as well as over their time, magnitude, depth, and location. The locations of natural events are distributed according to a spatial prior that is trained (like other parts

of the model) from historical data; man-made events are, by the treaty rules, assumed to oc-

cur uniformly over the surface of the Earth.

At every station s, each phase (seismic wave

type) p from an event ¢ produces either 0 or 1 detections (above-threshold signals); the detection probability depends on the event magnitude and depth and its distance from the station.

“False alarm” detections also occur according to a station-specific rate parameter. The measured arrival time, amplitude, and other properties of a detection d from a real event depend

on the properties of the originating event and its distance from the station.

Once trained, the model runs continuously. The evidence consists of detections (90% of

which are false alarms) extracted from raw IMS waveform data, and the query typically asks for the most likely event history, or bulletin, given the data. Results so far are encouraging;

for example, in 2009 the UN’s SEL3 automated bulletin missed 27.4% of the 27294 events in the magnitude range 3-4 while NET-VISA missed 11.1%. Moreover, comparisons with dense regional networks show that NET-VISA finds up to 50% more real events than the

final bulletins produced by the UN’s expert seismic analysts. NET-VISA also tends to associate more detections with a given event, leading to more accurate location estimates (see

Figure 15.7). As of January 1, 2018, NET-VISA has been deployed as part of the CTBTO ‘monitoring pipeline.

Despite superficial differences, the two examples are structurally similar: there are un-

Kknown objects (papers, earthquakes) that generate percepts according to some physical pro-

513

514

Chapter 15 Probabilistic Programming

i)

:

e (@

(b)

Figure 15.7 (a) Top: Example of seismic waveform recorded at Alice Springs, Australia. Bottom: the waveform after processing to detect the arrival times of seismic waves. Blue lines are the automatically detected arrivals; red lines are the true arrivals. (b) Location estimates for the DPRK nuclear test of February 12, 2013: UN CTBTO Late Event Bulletin (green triangle at top left): NET-VISA (blue square in center). The entrance to the underground test facility (small “x”) is 0.75km from NET-VISA's estimate. Contours show NET-VISA's posterior location distribution. Courtesy of CTBTO Preparatory Commission. cess (citation, seismic propagation). The percepts are ambiguous as to their origin, but when

multiple percepts are hypothesized to have originated with the same unknown object, that

object’s properties can be inferred more accurately. The same structure and reasoning patterns hold for areas such as database deduplication

and natural language understanding. In some cases, inferring an object’s existence involves

grouping percepts together—a process that resembles the clustering task in machine learning.

In other cases, an object may generate no percepts at all and still have its existence inferred— as happened, for example, when observations of Uranus led to the discovery of Neptune. The

existence of the unobserved object follows from its effects on the behavior and properties of

observed objects. 15.3

Keeping Track of a Complex World

Chapter 14 considered the problem of keeping track of the state of the world, but covered only the case of atomic representations (HMMs)

and factored representations (DBNs and

Kalman filters). This makes sense for worlds with a single object—perhaps a single patient

in the intensive care unit or a single bird flying through the forest. In this section, we see what

happens when two or more objects generate the observations. What makes this case different

from plain old state estimation is that there is now the possibility of uncertainty about which

object generated which observation. This is the identity uncertainty problem of Section 15.2 Data association

(page 507), now viewed in a temporal context. In the control theory literature, this is the data association problem—that is, the problem of associating observation data with the objects that generated them. Although we could view this as yet another example of open-universe

probabilistic modeling, it is important enough in practice to deserve its own section.

Section 153

Keeping Track of a Complex World

515

track termination detection failure



@

false alarm —

®

'rack/

initiation

0; a positive affine transformation.? This fact was noted

in Chapter 5 (page 167) for two-player games of chance; here, we see that it applies to all kinds of decision scenarios.

Value function Ordinal utility function

As

in game-playing, in a deterministic environment an agent needs only a preference

ranking on states—the numbers don’t matter.

This is called a value function or ordinal

utility function.

Itis important to remember that the existence ofa utility function that describes an agent’s

preference behavior does not necessarily mean that the agent is explicirly maximizing that utility function in its own deliberations. As we showed in Chapter 2, rational behavior can be

generated in any number of ways. A rational agent might be implemented with a table lookup

(if the number of possible states is small enough).

By observing a rational agent’s behavior, an observer can learn about the utility function

that represents what the agent is actually trying to achieve (even if the agent doesn’t know it). ‘We return to this point in Section 16.7.

16.3

Utility Functions

Utility functions map from lotteries to real numbers. We know they must obey the axioms

of orderability, transitivity, continuity, substitutability, monotonicity, and decomposability. Is

that all we can say about utility functions? Strictly speaking, that is it: an agent can have any preferences it likes. For example, an agent might prefer to have a prime number of dollars in its bank account; in which case, if it had $16 it would give away $3. This might be unusual,

but we can’t call it irrational. An agent might prefer a dented 1973 Ford Pinto to a shiny new

Mercedes. The agent might prefer prime numbers of dollars only when it owns the Pinto, but when it owns the Mercedes, it might prefer more dollars to fewer. Fortunately, the preferences

of real agents are usually more systematic and thus easier to deal with.

3 Inthis nse, utilities resemble temperatures: a temperature in Fahrenheit 1.8 times the Cel plus3 . but converting from one to the other doesn’t make you hotter or colder.

s temperature

Section 163

Utility Functions

533

y assessment and utility scales If we want to build a decision-theoretic system that helps a human make decisions or acts on his or her behalf, we must first work out what the human’s utility function is. This process,

often called preference elicitation, involves presenting choices to the human and using the

observed preferences to pin down the underlying utility function.

Preference elicitation

Equation (16.2) says that there is no absolute scale for utilities, but it is helpful, nonethe-

less, to establish some scale on which utilities can be recorded and compared for any particu-

lar problem. A scale can be established by fixing the utilities of any two particular outcomes,

just as we fix a temperature scale by fixing the freezing point and boiling point of water. Typically, we fix the utility of a “best possible prize” at U(S) = ur and a “worst possible

catastrophe” at U (S) = u,. (Both of these should be finite.) Normalized utilities use a scale

with u; =0 and ur = 1. With such a scale, an England fan might assign a utility of 1 to England winning the World Cup and a utility of 0 to England failing to qualify.

Normalized utilties

Given a utility scale between ut and u , we can assess the utility of any particular prize

S by asking the agent to choose between S and a standard lottery [p,ur; (1 — p),u.]. The Standard lottery probability p is adjusted until the agent is indifferent between S and the standard lottery.

Assuming normalized utilities, the utility ofS is given by p. Once this is done for each prize, the utilities for all lotteries involving those prizes are determined.

Suppose, for example,

we want to know how much our England fan values the outcome of England reaching the

semi-final and then losing. We compare that outcome to a standard lottery with probability p

of winning the trophy and probability 1 — p of an ignominious failure to qualify. If there is indifference at p=0.3, then 0.3 is the value of reaching the semi-final and then losing.

In medical, transportation, environmental and other decision problems, people’s lives are

at stake. (Yes, there are things more important than England’s fortunes in the World Cup.) In such cases, u, is the value assigned to immediate death (or in the really worst cases, many deaths).

Although nobody feels

comfortable with putting a value on human life, it is a fact

that tradeoffs on matters of life and death are made all the time. Aircraft are given a complete

overhaul at intervals, rather than after every trip. Cars are manufactured in a way that trades off costs against accident survival rates.

We tolerate a level of air pollution that kills four

million people a year. Paradoxically, a refusal to put a monetary value on life can mean that life is undervalued. Ross Shachter describes a government agency that commissioned a study on removing asbestos from schools. The decision analysts performing the study assumed a particular dollar value for the life of a school-age child, and argued that the rational choice under that assump-

tion was to remove the asbestos. The agency, morally outraged at the idea of setting the value

of a life, rejected the report out of hand. It then decided against asbestos removal—implicitly asserting a lower value for the life of a child than that assigned by the analysts.

Currently several agencies of the U.S. government, including the Environmental Protec-

tion Agency, the Food and Drug Administration, and the Department of Transportation, use the value of a statistical life to determine the costs and benefits of regulations and interven-

Value of a statistical

One common “currency” used in medical and safety analysis is the micromort, a one in a

Micromort

tions. Typical values in 2019 are roughly $10 million. Some attempts have been made to find out the value that people place on their own lives. million chance of death. If you ask people how much they would pay to avoid a risk—for

534

Chapter 16 Making Simple Decisions example, to avoid playing Russian roulette with a million-barreled revolver—they will respond with very large numbers, perhaps tens of thousands of dollars, but their actual behavior reflects a much lower monetary value for a micromort. For example, in the UK, driving in a car for 230 miles incurs a risk of one micromort.

Over the life of your car—say, 92,000 miles—that’s 400 micromorts.

People appear to be

willing to pay about $12,000 more for a safer car that halves the risk of death.

Thus, their

car-buying action says they have a value of $60 per micromort. A number of studies have confirmed a figure in this range across many individuals and risk types. However, government

agencies such as the U.S. Department of Transportation typically set a lower figure; they will

QALY

spend only about $6 in road repairs per expected life saved. Of course, these calculations hold only for small risks. Most people won’t agree to kill themselves, even for $60 million. Another measure is the QALY or quality-adjusted life year. Patients are willing to accept a shorter life expectancy to avoid a disability.

For example, kidney patients on average are

indifferent between living two years on dialysis and one year at full health. 16.3.2

The uti

y of money

Utility theory has its roots in economics, and economics provides one obvious candidate for a utility measure:

money (or more specifically, an agent’s total net assets). The almost

universal exchangeability of money for all kinds of goods and services suggests that money plays a significant role in human utility functions.

Monotonic preference

Tt will usually be the case that an agent prefers more money to less, all other things being

equal.

We say that the agent exhibits a monotonic preference for more money.

This does

not mean that money behaves as a utility function, because it says nothing about preferences

between lotteries involving money.

Suppose you have triumphed over the other competitors in a television game show. The

host now offers you a choice: either you can take the $1,000,000 prize or you can gamble it

Expected monetary value

on the flip of a coin. If the coin comes up heads, you end up with nothing, but if it comes up tails, you get $2,500,000. If you're like most people, you would decline the gamble and pocket the million. Are you being irrational?

Assuming the coin is fair, the expected monetary value (EMV) of the gamble is %($0) +

%(52.500,000) = §$1,250,000, which is more than the original $1,000,000. But that does not necessarily mean that accepting the gamble is a better decision. Suppose we use S, to denote

the state of possessing total wealth $n, and that your current wealth is $k. Then the expected utilities of the two actions of accepting and declining the gamble are

EU(Accept) = U(Sk U (Sk12:5 )+ 00.000) EU(Decline) = U(Sk41,000000) To determine what to do, we need to assign utilities to the outcome states.

Utility is not

directly proportional to monetary value, because the utility for your first million is very high (or so they say), whereas the utility for an additional million is smaller. Suppose you assign a utility of 5 to your current financial status (S), a 9 to the state Sk+2.500.000 and an 8 to the. state Si+1.000.000- Then the rational action would be to decline, because the expected utility of accepting is only 7 (less than the 8 for declining). On the other hand, a billionaire would most likely have a utility function that is locally linear over the range of a few million more,

and thus would accept the gamble.

Section 16,3 Utility Functions U

150000

535

U

800000

-3

(a)

-5

(b)

Figure 16.2 The utility of money. (a) Empirical data for Mr. Beard over a limited range. (b) A typical curve for the full range. In a pioneering study of actual utility functions, Grayson (1960) found that the utility

of money was almost exactly proportional to the logarithm of the amount. (This idea was

first suggested by Bernoulli (1738); see Exercise 16.STPT.) One particular utility curve, for a certain Mr. Beard, is shown in Figure 16.2(a). The data obtained for Mr. Beard’s preferences are consistent with a utility function

U(Sksn) = —263.31 422.091og(n+ 150,000)

for the range between n = —$150,000 and n = $800, 000.

‘We should not assume that this is the definitive utility function for monetary value, but

itis likely that most people have a utility function that is concave for positive wealth. Going into debt is bad, but preferences between different levels of debt can display a reversal of the concavity associated with positive wealth. For example, someone already $10,000,000 in debt might well accept a gamble on a fair coin with a gain of $10,000,000 for heads and a Toss of $20,000,000 for tails.* This yields the S-shaped curve shown in Figure 16.2(b). If we restrict our attention to the positive part of the curves, where the slope is decreasing, then for any lottery L, the utility of being faced with that lottery is less than the utility of being handed the expected monetary value of the lottery as a sure thing:

U(L) 0, then life is

positively enjoyable and the agent avoids borh exits. As long as the actions in (4,1), (3,2), and (3,3) are as shown, every policy is optimal, and the agent obtains infinite total reward

because it never enters a terminal state. It turns out that there are nine optimal policies in all

for various ranges of r; Exercise 17.THRC asks you to find them.

The introduction of uncertainty brings MDPs closer to the real world than deterministic search problems. For this reason, MDPs have been studied in several fields, including Al,

operations research, economics, and control theory. Dozens of solution algorithms have been proposed, several of which we discuss in Section 17.2. First, however, we spell out in more detail the definitions of utilities, optimal policies, and models for MDPs.

17.1.1

Utilities over time

In the MDP example in Figure 17.1, the performance of the agent was measured by a sum of rewards for the transitions experienced. This choice of performance measure is not arbitrary,

but it is not the only possibility for the utility function> on environment histories, which we

write as Up([s0,a0,51,a1 .., ,)).

2 In this chapter we use U for the utility function (to be consistent with the rest of the book), but many works about MDPs use V (for value) instead.

Section 17.1

Sequential Decision Problems

565

&1

|

o

T

Geeh

—0.0274 0. It is easy to see, from the definition of utilities as discounted sums of rewards, that a similar transformation of rewards will leave the optimal policy unchanged in an MDP: R(s,a,5) = mR(s,a,5') +b. It turns out, however, that the additive reward decomposition of utilities leads to significantly

more freedom in defining rewards. Let &(s) be any function of the state s. Then, according to the shaping theorem, the following transformation leaves the optimal policy unchanged:

R(s,a,5') = R(s,a,8') +7D(s') — B(s) .

(17.9)

To show that this is true, we need to prove that two MDPs, M and M’, have identical optimal

policies as long as they differ only in their reward functions as specified in Equation (17.9). We start from the Bellman equation for Q, the Q-function for MDP M:

0(s.0) = L) P | s.0)[R(s.a.8) + 7 maxs Qs )]

Now let Q'(s,a)=Q(s,a) — ®(s) and plug it into this equation; we get

Q'(s.0) +®(s) = Y. P(s | 5.0)[R(s.a.5') +7 myX(Q'(:'st/) +@(s))]. which then simplifies to

Q50) = LPE15.0)[RG.0,5) +1() ~B(s) +7 max ()] LPE |5 @)R (s.a.8) +7 max @/ ()]

Shaping theorem

570

Chapter 17 Making Complex Decisions In other words, Q'(s,a) satisfies the Bellman equation for MDP M. Now we can extract the optimal policy for M’ using Equation (17.7):

i (s) = argmax Q' (s, a) = argmax O(s,@) — B(s) = argmax Q(s,a) = wj(s) a

a

a

The function ®(s) is often called a potential, by analogy to the electrical potential (volt-

age) that gives rise to electric fields. The term 7(s') — d(s) functions as a gradient of the

potential. Thus, if ®(s) has higher value in states that have higher utility, the addition of ~YP(s') — D(s) to the reward has the effect of leading the agent “uphill” in utility. At first sight, it may seem rather counterintuitive that we can modify the reward in this

way without changing the optimal policy. It helps if we remember that all policies are optimal with a reward function that is zero everywhere. This means, according to the shaping theorem,

that all policies are optimal for any potential-based reward of the form R(s,a,s') = y®(s') — @(s).

Intuitively, this is because with such a reward it doesn’t matter which way the agent

goes from A to B. (This is easiest to see when v=1: along any path the sum of rewards collapses to d(B) —D(A), so all paths are equally good.) So adding a potential-based reward to any other reward shouldn’t change the optimal policy.

The flexibility afforded by the shaping theorem means that we can actually help out the

agent by making the immediate reward more directly reflect what the agent should do. In fact, if we set @(s) =U(s), then the greedy policy 7 with respect to the modified reward R’

is also an optimal policy:

7G(s) = argmax ) P(s'|5,a)R (s,a,5)

= argmax Y P(s'| 5,a) [R(s,a,5') +7(s') — ®(s)] . & = argmax Y P(s 5,0)[R(s,a,5') + U (s') = U (s)) a

g

i

I

= argmax )"

P(s'|5,a)[R(s,a,s') + U (s)]

(by Equation (17.4)).

Of course, in order to set (s) =U (s), we would need to know U (s); so there is no free lunch, but there is still considerable value in defining a reward function that is helpful to the extent

possible.

This is precisely what animal trainers do when they provide a small treat to the

animal for each step in the target sequence.

17.1.4

Representing MDPs

The simplest way to represent P(s'|s,a) and R(s,a,s') is with big, three-dimensional tables of size [S|?|A|. This is fine for small problems such as the 4 x 3 world, for which the tables

have 112 x 4=484 entries each. In some cases, the tables are sparse—most entries are zero

because each state s can transition to only a bounded number of states s'—which means the

tables are of size O(|S]|A]). For larger problems, even sparse tables are far too big.

Dynamic decision

Just as in Chapter 16, where Bayesian networks were extended with action and utility nodes to create decision networks, we can represent MDPs by extending dynamic Bayesian networks (DBNS, see Chapter 14) with decision, reward, and utility nodes to create dynamic

decision networks, or DDNs. DDN are factored representations in the terminology of

Section 17.1

[ Prag/nplus,

Sequential Decision Problems

Plug/Unplug.,| LeftWheel,,,

Figure 17.4 A dynamic decision network for a mobile robot with state variables for battery level, charging status, location, and velocity, and action variables for the left and right wheel motors and for charging. Chapter 2; they typically have an exponential complexity advantage over atomic representations and can model quite substantial real-world problems.

Figure 17.4, which is based on the DBN in Figure 14.13(b) (page 486), shows some

elements of a slightly realistic model for a mobile robot that can charge itself. The state S; is

decomposed into four state variables: + X, consists of the two-dimensional location on a grid plus the orientation;

« X, is the rate of change of X;;

« Charging, is true when the robot is plugged in to a power source; « Battery, is the battery level, which we model as an integer in the range 0, The state space for the MDP is the Cartesian product of the ranges of these four variables. The

action is now a set A, of action variables, comprised of Plug/Unplug, which has three values (plug, unplug, and noop); LefitWheel for the power sent to the left wheel; and RightWheel for

the power sent to the right wheel. The set of actions for the MDP is the Cartesian product of

the ranges of these three variables. Notice that each action variable affects only a subset of the state variables. The overall transition model is the conditional distribution P(X,|X;,A,), which can be

computed as a product of conditional probabilities from the DDN. The reward here is a single variable that depends only on the location X (for, say, arriving at a destination) and Charging, as the robot has to pay for electricity used; in this particular model, the reward doesn’t depend

on the action or the outcome state.

The network in Figure 17.4 has been projected three steps into the future. Notice that the

network includes nodes for the rewards for t, 1 + 1, and 1 + 2, but the wtility for 1 + 3. This

571

572

Chapter 17 Making Complex Decisions Next

F

&

o

A,

A1

CurrentPiece;

CurrentPiece

Filled,

Filledyy

(a)

(b)

Figure 17.5 (a) The game of Tetris. The T-shaped piece at the top center can be dropped in any orientation and in any horizontal position. Ifa row is completed, that row disappears and the rows above it move down, and the agent receives one point. The next piece (here, the L-shaped piece at top right) becomes the current piece, and a new next piece appears, chosen at random from the seven piece types. The game ends if the board fills up to the top. (b) The DDN for the Tetris MDP. is because the agent must maximize the (discounted) sum of all future rewards, and U (X;3) represents the reward for all rewards from ¢ + 3 onwards. If a heuristic approximation to U is available, it can be included in the MDP representation in this way and used in lieu of further expansion. This approach is closely related to the use of bounded-depth search and heuristic evaluation functions for games in Chapter 5.

Another interesting and well-studied MDP is the game of Tetris (Figure 17.5(a)). The

state variables for the game are the CurrentPiece, the NextPiece, and a bit-vector-valued

variable Filled with one bit for each of the 10 x 20 board locations. Thus, the state space has

7 %7 % 22% 2 10

states. The DDN for Tetris is shown in Figure 17.5(b). Note that Filled,

is a deterministic function of Filled, and A,. It turns out that every policy for Tetris is proper (reaches a terminal state): eventually the board fills despite one’s best efforts to empty it.

17.2

Algorithms for MDPs

In this section, we present four different algorithms for solving MDPs. The first three, value

Monte Carlo planning

iteration, policy iteration, and linear programming, generate exact solutions offline. The fourth is a family of online approximate algorithms that includes Monte Carlo planning.

17.2.1 Value lteration

Value iteration

‘The Bellman equation (Equation (17.5)) s the basis of the value iteration algorithm for solving MDPs. If there are n possible states, then there are n Bellman equations, one for cach

Section 17.2

Algorithms for MDPs

573

function VALUE-ITERATION(mdp, €) returnsa utility function

inputs: mdp, an MDP with states S, actions A(s), transition model P(s’|s,a),

rewards R(s,a, "), discount ¢, the maximum error allowed in the utility of any state

local variables: U, U’, vectors of utilities for states in S, initially zero

4, the maximum relative change in the utility of any state

repeat

VU360 for each state s in S do U'ls] max,c o5y Q-VALUE(mdp, s,a. U) if |U[s] — Uls]| > & then 5 |U'[s] — U[s]| until§ < e(1-7)/7 return U Figure 17.6 The value iteration algorithm for calculating tilities of states. The termination condition is from Equation (17.12). state. The n equations contain n unknowns—the utilities of the

states. So we would like to

solve these simultaneous equations to find the utilities. There is one problem: the equations

are nonlinear, because the “max” operator is not a linear operator. Whereas systems of linear equations can be solved quickly using linear algebra techniques, systems of nonlinear equations are more problematic. One thing to try is an iterative approach. We start with arbitrary

initial values for the utilities, calculate the right-hand side of the equation, and plug it into the

left-hand side—thereby updating the utility of each state from the utilities of its neighbors.

We repeat this until we reach an equilibrium.

Let U;(s) be the utility value for state s at the ith iteration.

Bellman update, looks like this:

The iteration step, called a

Ui (s) = aeA(s) max Y7 P(s'|5,a)[R(s,a.s) + v Uls')],

(17.10)

where the update is assumed to be applied simultaneously to all the states at each iteration. If we apply the Bellman update infinitely often, we are guaranteed to reach an equilibrium

(see “convergence of value iteration” below), in which case the final utility values must be

solutions to the Bellman equations. In fact, they are also the unigue solutions, and the corre-

sponding policy (obtained using Equation (17.4)) is optimal. The detailed algorithm, including a termination condition when the utilities are “close enough,” is shown in Figure 17.6.

Notice that we make use of the Q-VALUE function defined on page 569. ‘We can apply value iteration to the 4x 3 world in Figure 17.1(a).

Starting with initial

values of zero, the utilities evolve as shown in Figure 17.7(a). Notice how the states at different distances from (4,3) accumulate negative reward until a path is found to (4,3), whereupon

the utilities start to increase.

We can think of the value iteration algorithm as propagating

information through the state space by means of local updates.

Bellman update

574

Chapter 17 Making Complex Decisions 1x107

558 g S

Iierations required

1x10°

0

5

10 15 20 25 30 Number of terations (a)

35

40

05

06

07 08 09 Discount factorY ®)

1

Figure 17.7 (a) Graph showing the evolution of the utilities of selected states using value iteration. (b) The number of value iterations required to guarantee an error of at most e =c-

Rumax, for different values of ¢, as a function of the discount factor 7. Convergence of value iteration

We said that value iteration eventually converges to a unique set of solutions of the Bellman

equations. In this section, we explain why this happens. We introduce some useful mathematical ideas along the way, and we obtain some methods for assessing the error in the utility function returned when the algorithm is terminated early; this is useful because it means that we don’t have to run forever. This section is quite technical.

Contraction

The basic concept used in showing that value iteration converges is the notion of a con-

traction. Roughly speaking, a contraction is a function of one argument that, when applied to two different inputs in turn, produces two output values that are “closer together.” by at least some constant factor, than the original inputs. For example, the function “divide by two™ is

a contraction, because, after we divide any two numbers by two, their difference is halved.

Notice that the “divide by two” function has a fixed point, namely zero, that is unchanged by

the application of the function. From this example, we can discern two important properties of contractions:

+ A contraction has only one fixed point; if there were two fixed points they would not

get closer together when the function was applied, so it would not be a contraction. « When the function is applied to any argument, the value must get closer to the fixed point (because the fixed point does not move), so repeated application of a contraction

always reaches the fixed point in the limit. Now, suppose we view the Bellman update (Equation (17.10)) as an operator B that is applied simultaneously to update the utility of every state. Then the Bellman equation becomes

U=BU and the Bellman update equation can be written as Uit1

Max norm

0.5 and Go otherwise. Once we have utilities c,(s) for all the conditional plans p of depth 1 in each physical state s, we can compute the utilities for conditional plans of depth 2 by considering each

possible first action, each possible subsequent percept, and then each way of choosing a

depth-1 plan to execute for each percept: [Stay; if Percept=A then Stay else Stay] [Stay; if Percept =A then Stay else Go|

[Go; if Percept=A then Stay else Stay]

Dominated plan

There are eight distinct depth-2 plans in all, and their utilities are shown in Figure 17.15(b). Notice that four of the plans, shown as dashed lines, are suboptimal across the entire belief space—we say these plans are dominated, and they need not be considered further. There are four undominated plans, cach of which is optimal in a specific region, as shown in Figure 17.15(c). The regions partition the belief-state space. We repeat the process for depth 3, and so on. In general, let p be a depth-d conditional plan whose initial action is @ and whose depth-(d — 1) subplan for percept e is p.e; then (17.18) ap(s) = );P(!\smlkts,a,:’) +7 L P(e|s)ape(s))This recursion naturally gives us a value iteration algorithm, which is given in Figure 17.16. The structure of the algorithm and its error analysis are similar to those of the basic value

iteration algorithm in Figure 17.6 on page 573; the main difference is that instead of computing one utility number for each state, POMDP-VALUE-ITERATION

maintains a collection of

undominated plans with their utility hyperplanes.

The algorithm’s complexity depends primarily on how many plans get generated. Given |A] actions and |E| possible observations, there are |A|?(£1"") distinct depth-d plans. Even for the lowly two-state world with d=8, that’s 2255 plans. The elimination of dominated plans is essential for reducing this doubly exponential growth: the number of undominated plans with d =8 is just 144. The utility function for these 144 plans is shown in Figure 17.15(d). Notice that the intermediate belief states have lower value than state A and state B, because in the intermediate states the agent lacks the information needed to choose a good action. This is why information has value in the sense defined in Section 16.6 and optimal policies in POMDPs often include information-gathering actions.

Section 17.5

Algorithms for Solving POMDPs

function POMDP-VALUE-ITERATION(pomdp, ¢) returns a utility function inputs: pomdp, a POMDP with states S, actions A(s). transition model P(s'|s.a). sensor model P(e|s). rewards R(s), discount y ¢, the maximum error allowed in the utility of any state local variables: U, U, sets of plans p with associated utility vectors o, U’ ¢ a set containing just the empty plan []. with a)(s) = R(s) repeat [

U’ the set of all plans consisting of an action and, for each possible next percept, aplanin U with utility vectors computed according to Equation (17.18) U’ REMOVE-DOMINATED-PLANS(U")

until MAX-DIFFERENCE(U,U’) return U

< e(1-7)/y

Figure 17.16 A high-level sketch of the value iteration algorithm for POMDPs. The REMOVE-DOMINATED-PLANS step and MAX-DIFFERENCE test are typically implemented as linear programs. Given such a utility function, an executable policy can be extracted by looking at which hyperplane is optimal at any given belief state b and executing the first action of the corresponding plan. In Figure 17.15(d), the corresponding optimal policy is still the same as for depth-1 plans: Stay when b(B) > 0.5 and Go otherwise. In practice, the value iteration algorithm in Figure 17.16 is hopelessly inefficient for larger problems—even the 4 x 3 POMDP is too hard. The main reason is that given n undominated

conditional plans at level d, the algorithm constructs |A|-nlE| conditional plans at level d + 1 before eliminating the dominated ones. With the four-bit sensor, |E| is 16, and n can be in the hundreds, so this is hopeless. Since this algorithm was developed in the 1970s, there have been several advances, in-

cluding more efficient forms of value iteration and various kinds of policy iteration algorithms.

Some

of these are discussed in the notes at the end of the chapter.

For general

POMDPs, however, finding optimal policies is very difficult (PSPACE-hard, in fact—that is, very hard indeed). The next section describes a different, approximate method for solving POMDPs, one based on look-ahead search.

17.5.2

Online algorithms for POMDPs

The basic design for an online POMDP agent is straightforward: it starts with some prior belief state; it chooses an action based on some deliberation process centered on its current

belief state; after acting, it receives an observation and updates its belief state using a filtering algorithm; and the process repeats.

One obvious choice for the deliberation process is the expectimax algorithm from Sec-

tion 17.2.4, except with belief states rather than physical states as the decision nodes in the tree. The chance nodes in the POMDP

tree have branches labeled by possible observations

and leading to the next belief state, with transition probabilities given by Equation (17.17). A fragment of the belief-state expectimax tree for the 4 x 3 POMDP is shown in Figure 17.17.

593

594

Chapter 17 Making Complex Decisions

Up.

1100

Righ,

o110/

1100

Down

Left

1010

0110

Figure 17.17 Part of an expectimax tree for the 4 x 3 POMDP with a uniform initial belief state. The belief states are depicted with shading proportional to the probability of being in each location. The time complexity of an exhaustive search to depth d is O(JA|" - [E[Y), where |A| is

the number of available actions and |E| is the number of possible percepts.

(Notice that

this is far less than the number of possible depth-d conditional plans generated by value iteration.)

As in the observable case, sampling at the chance nodes is a good way to cut

down the branching factor without losing too much accuracy in the final decision. Thus, the

complexity of approximate online decision making in POMDPs may not be drastically worse than that in MDPs.

For very large state spaces, exact filtering is infeasible, so the agent will need to run

an approximate filtering algorithm such as particle filtering (see page 492). Then the belief

POMCP

states in the expectimax tree become collections of particles rather than exact probability distributions. For problems with long horizons, we may also need to run the kind of long-range playouts used in the UCT algorithm (Figure 5.11). The combination of particle filtering and UCT applied to POMDPs goes under the name of partially observable Monte Carlo planning

or POMCP.

With a DDN

representation for the model, the POMCP

algorithm is, at least

in principle, applicable to very large and realistic POMDPs. Details of the algorithm are explored in Exercise 17.PoMC. POMCP is capable of generating competent behavior in the

4 x 3 POMDP. A short (and somewhat fortunate) example is shown in Figure 17.18. POMDP agents based on dynamic decision networks and online decision making have a

number of advantages compared with other, simpler agent designs presented in earlier chap-

ters. In particular, they handle partially observable, stochastic environments and can easily revise their “plans” to handle unexpected evidence. With appropriate sensor models, they can handle sensor failure and can plan to gather information. They exhibit “graceful degradation” under time pressure and in complex environments, using various approximation techniques.

So what is missing? The principal obstacle to real-world deployment of such agents is

the inability to generate successful behavior over long time-scales. Random or near-random

playouts have no hope of gaining any positive reward on, say, the task of laying the table

Summary

B 1000 111 0001

1010

1001

Figure 17.18 A sequence of percepts, belief states, and actions in the 4 x 3 POMDP with a wall-sensing error of ¢=0.2. Notice how the early Lefr moves are safe—they are very unlikely to fall into (4,2)—and coerce the agent’s location into a small number of possible locations. After moving Up, the agent thinks it is probably in (3,3). but possibly in (1.3). Fortunately, moving Right is a good idea in both cases, so it moves Right, finds out that it had been in (1,3) and is now in (2,3), and then continues moving Right and reaches the goal. for dinner, which might take tens of millions of motor-control actions.

It seems necessary

to borrow some of the hierarchical planning ideas described in Section 11.4. At the time of

writing, there are not yet satisfactory and efficient ways to apply these ideas in stochastic,

partially observable environments.

Summary

This chapter shows how to use knowledge about the world to make decisions even when the outcomes of an action are uncertain and the rewards for acting might not be reaped until many actions have passed. The main points are as follows: + Sequential decision problems in stochastic environments, also called Markov decision

processes, or MDPs, are defined by a transition model specifying the probabilistic

outcomes of actions and a reward function specifying the reward in each state. « The utility of a state sequence is the sum of all the rewards over the sequence, possibly discounted over time.

The solution of an MDP is a policy that associates a decision

with every state that the agent might reach. An optimal policy maximizes the utility of

the state sequences encountered when it is executed. + The utility of a state is the expected sum of rewards when an optimal policy is executed

from that state. The value iteration algorithm iteratively solves a set of equations

relating the utility of each state to those of its neighbors. + Policy iteration alternates between calculating the utilities of states under the current

policy and improving the current policy with respect to the current utilities.

* Partially observable MDPs, or POMDPs, are much more difficult to solve than are MDPs. They can be solved by conversion to an MDP in the continuous space of belief

states; both value iteration and policy iteration algorithms have been devised. Optimal behavior in POMDPs

includes information gathering to reduce uncertainty and there-

fore make better decisions in the future.

* A decision-theoretic agent can be constructed for POMDP environments.

The agent

uses a dynamic decision network to represent the transition and sensor models, to update its belief state, and to project forward possible action sequences.

We shall return MDPs and POMDPs in Chapter 22, which covers reinforcement learning methods that allow an agent to improve its behavior from experience.

595

596

Chapter 17 Making Complex Decisions Bibliographical and Historical Notes

Richard Bellman developed the ideas underlying the modern approach to sequential decision problems while working at the RAND Corporation beginning in 1949. According to his autobiography (Bellman, 1984), he coined the term “dynamic programming” to hide from a research-phobic Secretary of Defense, Charles Wilson, the fact that his group was doing mathematics. (This cannot be strictly true, because his first paper using the term (Bellman,

1952) appeared before Wilson became Secretary of Defense in 1953.) Bellman’s book, Dynamic Programming (1957), gave the new field a solid foundation and introduced the value iteration algorithm.

Shapley (1953b) actually described the value iteration algorithm independently of Bellman, but his results were not widely appreciated in the operations research community, perhaps because they were presented in the more general context of Markov games. Although the original formulations included discounting, its analysis in terms of stationary preferences was suggested by Koopmans (1972). The shaping theorem is due to Ng ef al. (1999). Ron Howard’s Ph.D. thesis (1960) introduced policy iteration and the idea of average reward for solving infinite-horizon problems. Several additional results were introduced

by Bellman and Dreyfus (1962). The use of contraction mappings in analyzing dynamic programming algorithms is due to Denardo (1967). Modified policy iteration is due to van Nunen (1976) and Puterman and Shin (1978). Asynchronous policy iteration was analyzed by Williams and Baird (1993), who also proved the policy loss bound in Equation (17.13). The general family of prioritized sweeping algorithms aims to speed up convergence to optimal policies by heuristically ordering the value and policy update calculations (Moore and Atkeson, 1993; Andre ez al., 1998; Wingate and Seppi, 2005).

The formulation of MDP-solving as a linear program is due to de Ghellinck (1960), Manne (1960), and D’Epenoux (1963). Although linear programming has traditionally been considered inferior to dynamic programming as an exact solution method for MDPs, de Farias and Roy (2003) show that it is possible to use linear programming and a linear representation of the utility function to obtain provably good approximate solutions to very large MDPs.

Papadimitriou and Tsitsiklis (1987) and Littman er al. (1995) provide general results on the

computational complexity of MDPs.

Yinyu Ye (2011) analyzes the relationship between

policy iteration and the simplex method for linear programming and proves that for fixed v,

the runtime of policy iteration is polynomial in the number of states and actions.

Seminal work by Sutton (1988) and Watkins (1989) on reinforcement learning methods

for solving MDPs played a significant role in introducing MDPs into the Al community. (Earlier work by Werbos (1977) contained many similar ideas, but was not taken up to the same extent.) Al researchers have pushed MDPs in the direction of more expressive representations that can accommodate much larger problems than the traditional atomic representations based on transition matrices.

The basic ideas for an agent architecture using dynamic decision networks were proposed by Dean and Kanazawa (1989a). Tatman and Shachter (1990) showed how to apply dynamic programming algorithms to DDN models. Several authors made the connection between MDPs and Al planning problems, developing probabilistic forms of the compact STRIPS representation for transition models (Wellman,

1990b; Koenig,

1991).

The book Planning

and Control by Dean and Wellman (1991) explores the connection in great depth.

Bibliographical and Historical Notes Later work on factored MDPs (Boutilier ez al., 2000; Koller and Parr, 2000; Guestrin et al., 2003b) uses structured representations of the value function as well as the transition model, with provable improvements in complexity. Relational MDPs (Boutilier et al., 2001;

Guestrin e al., 2003a) go one step further, using structured representations to handle domains

with many related objects. Open-universe MDPs and POMDPs (Srivastava et al., 2014b) also allow for uncertainty over the existence and identity of objects and actions.

Many authors have developed approximate online algorithms for decision making in MDPs, often borrowing explicitly from earlier AT approaches to real-time search and gameplaying (Werbos, 1992; Dean ez al., 1993; Tash and Russell, 1994). The work of Barto et al.

(1995) on RTDP (real-time dynamic programming) provided a general framework for understanding such algorithms and their connection to reinforcement learning and heuristic search.

The analysis of depth-bounded expectimax with sampling at chance nodes is due to Kearns

et al. (2002). The UCT algorithm described in the chapter is due to Kocsis and Szepesvari (2006) and borrows from earlier work on random playouts for estimating the values of states (Abramson, 1990; Briigmann, 1993; Chang er al., 2005).

Bandit problems were introduced by Thompson (1933) but came to prominence after

‘World War II through the work of Herbert Robbins (1952).

Bradt et al. (1956) proved the

first results concerning stopping rules for one-armed bandits, which led eventually to the

breakthrough results of John Gittins (Gittins and Jones, 1974; Gittins, 1989). Katehakis and

Veinott (1987) suggested the restart MDP as a method of computing Gittins indices. The text by Berry and Fristedt (1985) covers many variations on the basic problem, while the pellucid online text by Ferguson (2001) connects bandit problems with stopping problems. Lai and Robbins (1985) initiated the study of the asymptotic regret of optimal bandit

policies. The UCB heuristic was introduced and analyzed by Auer et al. (2002).

Bandit su-

perprocesses (BSPs) were first studied by Nash (1973) but have remained largely unknown

in AL Hadfield-Menell and Russell (2015) describe an efficient branch-and-bound algorithm

capable of solving relatively large BSPs. Selection problems were introduced by Bechhofer (1954). Hay et al. (2012) developed a formal framework for metareasoning problems, showing that simple instances mapped to selection rather than bandit problems. They also proved

the satisfying result that expected computation cost of the optimal computational strategy is

never higher than the expected gain in decision quality—although there are cases where the optimal policy may, with some probability, keep computing long past the point where any

possible gain has been used up. The observation that a partially observable MDP can be transformed into a regular MDP over belief states is due to Astrom (1965) and Aoki (1965).

The first complete algorithm

for the exact solution of POMDPs—essentially the value iteration algorithm presented in

this chapter—was proposed by Edward Sondik (1971) in his Ph.D. thesis. (A later journal paper by Smallwood and Sondik (1973) contains some errors, but is more accessible.)

Lovejoy (1991) surveyed the first twenty-five years of POMDP research, reaching somewhat pessimistic conclusions about the feasibility of solving large problems.

The first significant contribution within Al was the Witness algorithm (Cassandra et al., 1994; Kaelbling er al., 1998), an improved version of POMDP value iteration. Other algo-

rithms soon followed, including an approach due to Hansen (1998) that constructs a policy

incrementally in the form of a finite-state automaton whose states define the possible belief

states of the agent.

597

Factored MDP Relational MDP.

598

Chapter 17 Making Complex Decisions More recent work in Al has focused on point-based value iteration methods that, at each

iteration, generate conditional plans and a-vectors for a finite set of belief states rather than

for the entire belief space. Lovejoy (1991) proposed such an algorithm for a fixed grid of points, an approach taken also by Bonet (2002). An influential paper by Pineau ef al. (2003) suggested generating reachable points by simulating trajectories in a somewhat greedy fash-

ion; Spaan and Vlassis (2005) observe that one need generate plans for only a small, randomly

selected subset of points to improve on the plans from the previous iteration for all points in

the set. Shani ef al. (2013) survey these and other developments in point-based algorithms,

which have led to good solutions for problems with thousands of states. Because POMDPs are PSPACE-hard (Papadimitriou and Tsitsiklis, 1987), further progress on offline solution

methods may require taking advantage of various kinds of structure in value functions arising from a factored representation of the model.

The online approach for POMDPs—using look-ahead search to select an action for the

current belief state—was first examined by Satia and Lave (1973).

The use of sampling at

chance nodes was explored analytically by Keamns ef al. (2000) and Ng and Jordan (2000). ‘The POMCP algorithm is due to Silver and Veness (2011).

With the development of reasonably effective approximation algorithms for POMDPs,

their use as models for real-world problems has increased, particularly in education (Rafferty et al., 2016), dialog systems (Young et al., 2013), robotics (Hsiao et al., 2007; Huynh and

Roy, 2009), and self-driving cars (Forbes et al., 1995; Bai et al., 2015). An important largescale application is the Airborne Collision Avoidance System X (ACAS X), which keeps airplanes and drones from colliding midair. The system uses POMDPs with neural networks

to do function approximation. ACAS X significantly improves safety compared to the legacy

TCAS system, which was built in the 1970s using expert system technology (Kochenderfer, 2015; Julian er al., 2018).

Complex decision making has also been studied by economists and psychologists. They find that decision makers are not always rational, and may not be operating exactly as described by the models in this chapter. For example, when given a choice, a majority of people prefer $100 today over a guarantee of $200 in two years, but those same people prefer $200 Hyperbolic reward

in eight years over $100 in six years. One way to interpret this result is that people are not

using additive exponentially discounted rewards; perhaps they are using hyperbolic rewards (the hyperbolic function dips more steeply in the near term than does the exponential decay function). This and other possible interpretations are discussed by Rubinstein (2003).

The texts by Bertsekas (1987) and Puterman (1994) provide rigorous introductions to sequential decision problems and dynamic programming. Bertsekas and Tsitsiklis (1996) include coverage of reinforcement learning. Sutton and Barto (2018) cover similar ground but in a more accessible style. Sigaud and Buffet (2010), Mausam and Kolobov (2012) and

Kochenderfer (2015) cover sequential decision making from an Al perspective. Krishnamurthy (2016) provides thorough coverage of POMDPs.

TG

18

MULTIAGENT DECISION MAKING In which we examine what to do when more than one agent inhabits the environment.

18.1

Properties of Multiagent

Environments

So far, we have largely assumed that only one agent has been doing the sensing, planning, and

acting. But this represents a huge simplifying assumption, which fails to capture many realworld Al settings.

In this chapter, therefore, we will consider the issues that arise when an

agent must make decisions in environments that contain multiple actors. Such environments

are called multiagent systems, and agents in such a system face a multiagent planning Multiagent systems problem. However, as we will see, the precise nature of the multiagent planning problem— }13{25" Plarnine and the techniques that are appropriate for solving it—will depend on the relationships among the agents in the environment.

18.1.1

One decision maker

The first possibility is that while the environment contains multiple actors, it contains only one decision maker. In such a case, the decision maker develops plans for the other agents, and tells them what to do.

The assumption that agents will simply do what they are told

agent is called the benevolent agent assumption. However, even in this setting, plans involving Benevolent assumption

multiple actors will require actors to synchronize their actions. Actors A and B will have to

act at the same time for joint actions (such as singing a duet), at different times for mutually exclusive actions (such as recharging batteries when there is only one plug), and sequentially

when one establishes a precondition for the other (such as A washing the dishes and then B

drying them).

One special case is where we have a single decision maker with multiple effectors that

can operate concurrently—for example, a human who can walk and talk at the same time.

Such an agent needs to do multieffector planning to manage each effector while handling =45 e/iec*>" positive and negative interactions among the effectors.

When the effectors are physically

decoupled into detached units—as in a fleet of delivery robots in a factory—multieffector

planning becomes multibody planning.

A multibody problem is still a “standard” single-agent problem as long as the relevant sensor information collected by each body can be pooled—either centrally or within each body—to

form a common

Multibody planning

estimate of the world state that then informs the execution of

the overall plan; in this case, the multiple bodies can be thought of as acting as a single body. When communication constraints make this impossible, we have what is sometimes

called a decentralized planning problem; this is perhaps a misnomer, because the planning - Jeente!= —3. « Now suppose we change the rules to force O to reveal his strategy first, followed by E. Then the minimax value of this game is Up ., and because this game favors E we know that U is at most Up &

know U < 42.

With pure strategies, the value is +2 (see Figure 18.2(b)), so we

Combining these two arguments, we see that the true utility U of the solution to the original game must satisfy

Upo v(C)+Vv(D)

forall CCDCN

If a game is superadditive, then the grand coalition receives a value that is at least as high

as or higher than the total received by any other coalition structure. However, as we will see

shortly, superadditive games do not always end up with a grand coalition, for much the same reason that the players do not always arrive at a collectively desirable Pareto-optimal outcome in the prisoner’s dilemma.

18.3.2

Strategy in cooperative games

The basic assumption in cooperative game theory is that players will make strategic decisions about who they will cooperate with. Intuitively, players will not desire to work with unproductive players—they will naturally seek out players that collectively yield a high coalitional

value. But these sought-after players will be doing their own strategic reasoning. Before we can describe this reasoning, we need some further definitions.

An imputation for a cooperative game (N,) is a payoff vector that satisfies the follow-

Imputation

ing two conditions:

2= v(N)

x> v({i}) forallie N. The first condition says that an imputation must distribute the total value of the grand coali-

tion; the second condition, known as individual rationality, says that each player is at least

Individual rationality

as well off as if it had worked alone.

Given an imputation X = (x,...,x,) and a coalition C C N, we define x(C) to be the sum

¥jecXi—the total amount disbursed to C by the imputation x.

Next, we define the core of a game (N, v) as the set of all imputations X that satisfy the

condition x(C) > v(C) for every possible coalition C C N. Thus, if an imputation x is not in the core, then there exists some coalition C C N such that v(C) > x(C). The players in C

would refuse to join the grand coalition because they would be better off sticking with C.

The core of a game therefore consists of all the possible payoff vectors that no coalition

could object to on the grounds that they could do better by not joining the grand coalition.

Thus, if the core is empty, then the grand coalition cannot form, because no matter how the

grand coalition divided its payoff, some smaller coalition would refuse to join. The main

computational questions around the core relate to whether or not it is empty, and whether a

particular payoff distribution is in the core.

Core

628

Chapter 18 Multiagent Decision Making

The definition of the core naturally leads to a system of linear inequalities, as follows (the unknowns are variables xi....,.x,, and the values v(C) are constants): x>

v({i})

forallie N

v(C)

foralCCN

Lienxi = V(N) Yiecxi

=

Any solution to these inequalities will define an imputation in the core. We can formulate the

inequalities as a linear program by using a dummy objective function (for example, maximizing ¥y x;), which will allow us to compute imputations in time polynomial in the number

of inequalities.

The difficulty is that this gives an exponential number of inequalities (one

for each of the 2" possible coalitions). Thus, this approach yields an algorithm for checking

non-emptiness of the core that runs in exponential time. Whether we can do better than this

depends on the game being studied: for many classes of cooperative game, the problem of checking non-emptiness of the core is co-NP-complete. We give an example below. Before proceeding, let’s see an example of a superadditive game with an empty core. The game has three players N = {1,2,3}, and has a characteristic function defined as follows:

[

=2

o) = { 0 otherwise.

Now consider any imputation (x;,x2,x3) for this game. Since v(N) = 1, it must be the case that at least one player i has x; > 0, and the other two get a total payoff less than 1. Those two could benefit by forming a coalition without player i and sharing the value 1 among themselves. But since this holds for all imputations, the core must be empty.

The core formalizes the idea of the grand coalition being stable, in the sense that no

coalition can profitably defect from it. However, the core may contain imputations that are unreasonable, in the sense that one or more players might feel they were unfair.

N = {1,2}, and we have a characteristic function v defined as follows:

Suppose

v({1}) =v({2}) =5

v({1,2}) =20.

Here, cooperation yields a surplus of 10 over what players could obtain working in isolation,

and so intuitively, cooperation will make sense in this scenario. Now, it is easy to see that the

imputation (6, 14) is in the core of this game: neither player can deviate to obtain a higher

uiility. But from the point of view of player 1, this might appear unreasonable, because it gives 9/10 of the surplus to player 2. Thus, the notion of the core tells us when a grand Shapley value

coalition can form, but it does not tell us how to distribute the payoff.

The Shapley value is an elegant proposal for how to divide the v(N) value among the players, given that the grand coalition N formed. Formulated by Nobel laureate Lloyd Shapley in the early 1950s, the Shapley value is intended to be a fair distribution scheme. What does fair mean?

It would be unfair to distribute v(N) based on the eye color of

players, or their gender, or skin color. Students often suggest that the value v(N) should be

divided equally, which seems like it might be fair, until we consider that this would give the

same reward to players that contribute a lot and players that contribute nothing. Shapley’s insight was to suggest that the only fair way to divide the value v(N) was to do so according

Mond |

to how much each player contributed to creating the value v(N).

First we need to define the notion of a player’s marginal contribution. The marginal

Section 18.3 contribution that a player i makes to a coalition C

Cooperative Game Theory

629

is the value that i would add (or remove),

should i join the coalition C. Formally, the marginal contribution that player i makes to C is

denoted by mc;(C):

me;(C) = v(Cu{i}) - v(C). Now, a first attempt to define a payoff division scheme in line with Shapley’s suggestion

that players should be rewarded according to their contribution would be to pay each playeri the value that they would add to the coalition containing all other players:

mei(N —{i}). The problem is that this implicitly assumes that player i is the last player to enter the coalition.

So, Shapley suggested, we need to consider all possible ways that the grand coalition could form, that is, all possible orderings of the players N, and consider the value that i adds to the preceding players in the ordering. Then, a player should be rewarded by being paid the average marginal contribution that player i makes, over all possible orderings of the players, 10 the set of players preceding i in the ordering.

‘We let P denote all possible permutations (e.g., orderings) of the players N, and denote

members of P by p,p',... etc. Where p € P and i € N, we denote by p; the set of players

that precede i in the ordering p.

Then the Shapley value for a game G is the imputation

3(G) = (61(G),...,¢a(G)) defined as follows: 1

4(G) = fip();mc.(p,).

(18.1)

This should convince you that the Shapley value is a reasonable proposal. But the remark-

able fact is that it is the unique solution to a set of axioms that characterizes a “fair” payoff distribution scheme. We’ll need some more definitions before defining the axioms.

‘We define a dummy player as a player i that never adds any value to a coalition—that is,

mc;(C) =0 for all C C N — {i}. We will say that two players i and j are symmetric players

if they always make identical contributions to coalitions—that is, mc;(C) = me;(C) for all

C CN—{i,j}. Finally, where G = (N,v) and G’ = (N, V') are games with the same set of players, the game G + G’ is the game with the same player set, and a characteristic function

V" defined by v/(C) = v(C) + V/(C).

Given these definitions, we can define the fairness axioms satisfied by the Shapley value:

« Efficiency: ¥y 6i(G) = v(N). (All the value should be distributed.) « Dummy Player: 1f i is a dummy player in G then ¢;(G) = 0. (Players who never contribute anything should never receive anything.)

« Symmetry: 1f i and j are symmetric in G then ¢;(G) = ¢,(G). (Players who make identical contributions should receive identical payoffs.)

« Additivity: The value is additive over games: For all games G = (N, v) and G’ = (N, V), and for all players i € N, we have ¢;(G +G') = ¢i(G) + ¢i(G').

The additivity axiom is admittedly rather technical. If we accept it as a requirement, however,

we can establish the following key property: the Shapley value is the only way to distribute

coalitional value so as to satisfy these fairness axioms.

Dummy player

Symmetric players

630

Chapter 18 Multiagent Decision Making 18.3.3

Computation

in cooperative games

From a theoretical point of view, we now have a satisfactory solution. But from a computa-

tional point of view, we need to know how to compactly represent cooperative games, and

how to efficiently compute solution concepts such as the core and the Shapley value.

The obvious representation for a characteristic function would be a table listing the value

v(C) for all 2" coalitions. This is infeasible for large n. A number of approaches to compactly representing cooperative games have been developed, which can be distinguished by whether or not they are complete. A complete representation scheme is one that is capable of representing any cooperative game. The drawback with complete representation schemes is that there will always be some games that cannot be represented compactly. An alternative is to use a representation scheme that is guaranteed to be compact, but which is not complete.

Marginal contribution nets

Marginal conibation net

‘We now describe one representation scheme, called marginal contribution nets (MC-nets).

We will use a slightly simplified version to facilitate presentation, and the simplification makes it incomplete—the full version of MC-nets is a complete representation.

The idea behind marginal contribution nets is to represent the characteristic function of a

game (N,v) as a set of coalition-value rules, of the form: (C;,x;), where C; C N is a coalition

and x; is a number. To compute the value of a coalition C, we simply sum the values of all rules (C;,x;) such that C; C C. Thus, given a set of rules R = {(C1,x1),...,(C,x¢)}, the corresponding characteristic function is: v(C€) = Y {xi | (Cixi) eRand G; C C} Suppose we have a rule set R containing the following three rules:

{({1.2}5), ({212, ({344} Then, for example, we have: « v({1}) = 0 (because no rules apply), v({3}) = 4 v({1,3}) = v({2,3}) = v({1,2,3})

(third rule), 4 (third rule), 6 (second and third rules), and = 11 (first, second, and third rules).

‘With this representation we can compute the Shapley value in polynomial time.

The key

insight is that each rule can be understood as defining a game on its own, in which the players

are symmetric. By appealing to Shapley’s axioms of additivity and symmetry, therefore, the Shapley value ¢;(R) of player i in the game associated with the rule set R is then simply:

aR=

T

iec . 5er |{ 0fr otherwise The version of marginal contribution nets that we have presented here is not a complete repre-

sentation scheme: there are games whose characteristic function cannot be represented using rule sets of the form described above. A richer type of marginal contribution networks al-

Tows for rules of the form (¢,x), where ¢ is a propositional logic formula over the players

N: a coalition C satisfies the condition ¢ if it corresponds to a satisfying assignment for ¢.

Section 18.3

Cooperative Game Theory

e 11,121,841

(s

!l

A= (eaen) (Buas)

631

ool (nwes)

level3.

©2590)

a1

Figure 187 The coalition structure graph for N = {1,2,3,4}. Level 1 has coalition structures containing a single coalition; level 2 has coalition structures containing two coalitions, and 0 on. This scheme is a complete representation—in the worst case, we need a rule for every possible coalition. Moreover, the Shapley value can be computed in polynomial time with this scheme; the details are more involved than for the simple rules described above, although the

basic principle is the same; see the notes at the end of the chapter for references.

Coalition structures for maximum

social welfare

‘We obtain a different perspective on cooperative games if we assume that the agents share

a common purpose. For example, if we think of the agents as being workers in a company,

then the strategic considerations relating to coalition formation that are addressed by the core,

for example, are not relevant. Instead, we might want to organize the workforce (the agents)

into teams so as to maximize their overall productivity. More generally, the task is to find a coalition that maximizes the social welfare of the system, defined as the sum of the values of

the individual coalitions. We write the social welfare of a coalition structure CS as sw(CS), with the following definition:

sw(CSs) =Y v(C). cecs Then a socially optimal coalition structure CS* with respect to G maximizes this quantity.

Finding a socially optimal coalition structure is a very natural computational problem, which has been studied beyond the multiagent systems community:

it is sometimes called the set

partitioning partitioning problem. Unfortunately, the problem is NP-hard, because the number of possi- Set problem ble coalition structures grows exponentially in the number of players.

Finding the optimal coalition structure by naive exhaustive search is therefore infeasible

in general. An influential approach to optimal coalition structure formation is based on the

Coalition structure idea of searching a subspace of the coalition structure graph. The idea is best explained graph with reference to an example.

Suppose we have a game with four agents, N = {1,2,3,4}. There are fifteen possible

coalition structures for this set of agents.

We can organize these into a coalition structure

graph as shown in Figure 18.7, where the nodes at level £ of the graph correspond to all the coalition structures with exactly £ coalitions.

An upward edge in the graph represents

the division of a coalition in the lower node into two separate coalitions in the upper node.

632

Chapter 18 Multiagent Decision Making For example, there is an edge from {{1},{2,3,4}} to {{1},{2}.{3.4}} because this latter

coalition structure is obtained from the former by dividing the coalition {2,3,4} into the

coalitions {2} and {3,4}.

The optimal coalition structure CS* lies somewhere within the coalition structure graph,

and so to find this, it seems we would have to evaluate every node in the graph. But consider

the bottom two rows of the graph—Ilevels 1 and 2. Every possible coalition (excluding the empty coalition) appears in these two levels. (Of course, not every possible coalition structure

appears in these two levels.) Now, suppose we restrict our search for a possible coalition structure to just these two levels—we go no higher in the graph. Let CS’ be the best coalition

structure that we find in these two levels, and let CS* be the best coalition structure overall.

Let C* be a coalition with the highest value of all possible coalitions:

C* € argmax v(C). CeN

The value of the best coalition structure we find in the first two levels of the coalition structure

graph must be at least as much as the value of the best possible coalition: sw(CS') > v(C*).

This is because every possible coalition appears in at least one coalition structure in the first two levels of the graph. So assume the worst case, that is, sw(CS') = v(C*).

Compare the value of sw(CS') to sw(CS*). Since sw(CS') is the highest possible value

of any coalition structure, and there are n agents (n = 4 in the case of Figure 18.7), then the

highest possible value of sw(CS*) would be nv(C*) = n-sw(CS'). In other words, in the

worst possible case, the value of the best coalition structure we find in the first two levels of

the graph would be L the value of the best, where n is the number of agents. Thus, although

searching the first two levels of the graph does not guarantee to give us the optimal coalition

structure, it does guarantee to give s one that is no worse that & of the optimal. In practice it will often be much better than that.

18.4

Making Collective Decisions

‘We will now turn from agent design to mechanism design—the problem of designing the right game for a collection of agents to play. Formally, a mechanism consists of 1. A language for describing the set of allowable strategies that agents may adopt.

Center

2. A distinguished agent, called the center, that collects reports of strategy choices from the agents in the game. (For example, the auctioneer is the center in an auction.)

3. An outcome rule, known to all agents, that the center uses to determine the payoffs to

each agent, given their strategy choices. This section discusses some of the most important mechanisms.

Contract net protocol

18.4.1

Allocating tasks with the contract net

The contract net protocol is probably the oldest and most important multiagent problem-

solving technique studied in AL It is a high-level protocol for task sharing. As the name suggests, the contract net was inspired from the way that companies make use of contracts.

The overall contract net protocol has four main phases—see Figure 18.8. The process

starts with an agent identifying the need for cooperative action with respect to some task.

The need might arise because the agent does not have the capability to carry out the task

Section 18.4

problem recognition

27 awarding

633

lask\ o ouncement %

‘X’

T _X_

Making Collective Decisions

® «4 * bidding

'X'

Figure 18.8 The contract net task allocation protocol.

in isolation, or because a cooperative solution might in some way be better (faster, more efficient, more accurate).

The agent advertises the task to other agents in the net with a task announcement mes-

sage, and then acts as the manager of that task for its duration.

The task announcement

Task announcement Manager

message must include sufficient information for recipients to judge whether or not they are willing and able to bid for the task. The precise information included in a task announcement

will depend on the application area. It might be some code that needs to be executed; or it ‘might be a logical specification of a goal to be achieved. The task announcement might also include other information that might be required by recipients, such as deadlines, quality-ofservice requirements, and so on. ‘When an agent receives a task announcement, it must evaluate it with respect to its own

capabilities and preferences. In particular, each agent must determine, whether it has the

capability to carry out the task, and second, whether or not it desires to do so. On this basis, it may then submit a bid for the task. A bid will typically indicate the capabilities of the bidder that are relevant to the advertised task, and any terms and conditions under which the task will be carried out.

In general, a manager may receive multiple bids in response to a single task announcement. Based on the information in the bids, the manager selects the most appropriate agent

(or agents) to execute the task. Successful agents are notified through an award message, and

become contractors for the task, taking responsibility for the task until it is completed.

The main computational tasks required to implement the contract net protocol can be

summarized as follows:

&id

634

Chapter 18 Multiagent Decision Making « Task announcement processing. On receipt ofa task announcement, an agent decides if it wishes to bid for the advertised task.

« Bid processing. On receiving multiple bids, the manager must decide which agent to award the task to, and then award the task. * Award processing. Successful bidders (contractors) must attempt to carry out the task,

which may mean generating new subtasks, which are advertised via further task announcements.

Despite (or perhaps because of) its simplicity, the contract net is probably the most widely

implemented and best-studied framework for cooperative problem solving. It is naturally

applicable in many settings—a variation of it is enacted every time you request a car with

Uber, for example. 18.4.2

Allocating scarce resources with auctions

One of the most important problems in multiagent systems is that of allocating scarce re-

sources; but we may as well simply say “allocating resources,” since in practice most useful

Auction

resources are scarce in some sense. The auction is the most important mechanism for allo-

Bidder

there are multiple possible bidders. Each bidder i has a utility value v; for the item.

cating resources. The simplest setting for an auction is where there is a single resource and In some cases, each bidder has a private value for the item. For example, a tacky sweater

might be attractive to one bidder and valueless to another.

In other cases, such as auctioning drilling rights for an oil tract, the item has a com-

mon value—the tract will produce some amount of money, X, and all bidders value a dollar

equally—but there is uncertainty as to what the actual value of X is. Different bidders have different information, and hence different estimates of the item’s true value.

In either case,

bidders end up with their own v;. Given v;, each bidder gets a chance, at the appropriate time

or times in the auction, to make a bid b;. The highest bid, b,,q,, wins the item, but the price Ascending-bid auction English auction

paid need not be b,,; that’s part of the mechanism design. The best-known auction mechanism is the ascending-bid auction,’ or English auction,

in which the center starts by asking for a minimum (or reserve) bid by, If some bidder is willing to pay that amount, the center then asks for by, + d, for some increment d, and continues up from there. The auction ends when nobody is willing to bid anymore; then the last bidder wins the item, paying the price bid. How do we know if this is a good mechanism? One goal is to maximize expected revenue for the seller. Another goal is to maximize a notion of global utility. These goals overlap to

some extent, because one aspect of maximizing global utility is to ensure that the winner of

Efficient

the auction is the agent who values the item the most (and thus is willing to pay the most). We

say an auction is efficient if the goods go to the agent who values them most. The ascendingbid auction is usually both efficient and revenue maximizing, but if the reserve price is set too

high, the bidder who values it most may not bid, and if the reserve is set too low, the seller

Collusion

may get less revenue. Probably the most important things that an auction mechanism can do is encourage a sufficient number of bidders to enter the game and discourage them from engaging in collusion.

Collusion is an unfair or illegal agreement by two or more bidders to manipulate prices. It can 3 The word ion” comes from the L augeo, to increase.

Section 18.4

Making Collective Decisions

635

happen in secret backroom deals or tacitly, within the rules of the mechanism. For example, in 1999, Germany auctioned ten blocks of cellphone spectrum with a simultaneous auction

(bids were taken on all ten blocks at the same time), using the rule that any bid must be a minimum of a 10% raise over the previous bid on a block. There were only two credible bidders, and the first, Mannesman, entered the bid of 20 million deutschmark on blocks 1-5 and 18.18 million on blocks 6-10. Why 18.18M? One of T-Mobile’s managers said they “interpreted Mannesman’s first bid as an offer.” Both parties could compute that a 10% raise on 18.18M

is 19.99M; thus Mannesman’s bid was interpreted as saying “we can each get half the blocks for 20M:; let’s not spoil it by bidding the prices up higher”” And in fact T-Mobile bid 20M on blocks 6-10 and that was the end of the bidding.

The German government got less than they expected, because the two competitors were

able to use the bidding mechanism to come to a tacit agreement on how not to compete.

From the government’s point of view, a better result could have been obtained by any of these

changes to the mechanism: a higher reserve price; a sealed-bid first-price auction, so that the competitors could not communicate through their bids; or incentives to bring in a third bidder.

Perhaps the 10% rule was an error in mechanism design, because it facilitated the

precise signaling from Mannesman to T-Mobile.

In general, both the seller and the global utility function benefit if there are more bidders,

although global utility can suffer if you count the cost of wasted time of bidders that have no

chance of winning. One way to encourage more bidders is to make the mechanism easier for them. After all, if it requires too much research or computation on the part of the bidders, they may decide to take their money elsewhere. So it is desirable that the bidders have a dominant strategy. Recall that “dominant™ means that the strategy works against all other strategies, which in turn means that an agent

can adopt it without regard for the other strategies. An agent with a dominant strategy can just bid, without wasting time contemplating other agents’ possible strategies. A mechanism by

which agents have a dominant strategy is called a strategy-proof mechanism. If, as is usually Strategy-proof the case, that strategy involves the bidders revealing their true value, v;, then it is called a truth-revealing, or truthful, auction; the term incentive compatible is also used.

The

revelation principle states that any mechanism can be transformed into an equivalent truth-

revealing mechanism, so part of mechanism design is finding these equivalent mechanisms.

It turns out that the ascending-bid auction has most of the desirable properties. The bidder

with the highest value v; gets the goods at a price of b, +d, where b, is the highest bid among all the other agents and d is the auctioneer’s increment.*

Bidders have a simple dominant

strategy: keep bidding as long as the current cost is below your v;. The mechanism is not

quite truth-revealing, because the winning bidder reveals only that his v; > b, +d; we have a

lower bound on v; but not an exact amount.

A disadvantage (from the point of view of the seller) of the ascending-bid auction is that

it can discourage competition. Suppose that in a bid for cellphone spectrum there is one

advantaged company that everyone agrees would be able to leverage existing customers and

infrastructure, and thus can make a larger profit than anyone else. Potential competitors can

see that they have no chance in an ascending-bid auction, because the advantaged company

4 There is actually a small chance that the agent with highest v; fails to gt the goods, in the case in which by < v; < by +d. The chance of this can be made arbitrarily small by decreasing the increment d.

Truth-revealing

Revelation principle

636

Chapter 18 Multiagent Decision Making can always bid higher. Thus, the competitors may not enter at all, and the advantaged company ends up winning at the reserve price.

Another negative property of the English auction is its high communication costs. Either the auction takes place in one room or all bidders have to have high-speed, secure communi-

cation lines; in either case they have to have time to go through several rounds of bidding.

Sealed-bid auction

An alternative mechanism, which requires much less communication,

is the sealed-bid

auction. Each bidder makes a single bid and communicates it to the auctioneer, without the

other bidders seeing it. With this mechanism, there is no longer a simple dominant strategy.

If your value is v; and you believe that the maximum of all the other agents” bids will be b,, then you should bid b, + ¢, for some small ¢, if that is less than v;. Thus, your bid depends on

your estimation of the other agents’ bids, requiring you to do more work. Also, note that the

agent with the highest v; might not win the auction. This is offset by the fact that the auction is more competitive, reducing the bias toward an advantaged bidder.

Sealed-bid second-price auction

Vickrey auction

A small change in the mechanism for sealed-bid auctions leads to the sealed-bid second-

price auction, also known as a Vickrey auction.’

In such auctions, the winner pays the

price of the second-highest bid, b,, rather than paying his own bid. This simple modification completely eliminates the complex deliberations required for standard (or first-price) sealedbid auctions, because the dominant strategy is now simply to bid v;; the mechanism is truth-

revealing. Note that the utility of agent i in terms of his bid b;, his value v;, and the best bid among the other agents, b,, is

Ui

_ { (i=by) ifbi>b, 0

otherwise.

To see that b; = v; is a dominant strategy, note that when (v; — b,) is positive, any bid that wins the auction is optimal, and bidding v; in particular wins the auction. On the other hand, when

(vi—b,) is negative, any bid that loses the auction is optimal, and bidding v; in particular loses the auction. So bidding v; is optimal for all possible values of b,, and in fact, v; is the only bid that has this property. Because of its simplicity and the minimal computation requirements for both seller and bidders, the Vickrey auction is widely used in distributed Al systems.

Internet search engines conduct several trillion auctions each year to sell advertisements

along with their search results, and online auction sites handle $100 billion a year in goods, all using variants of the Vickrey auction. Note that the expected value to the seller is b,,

Revenue equivalence theorem

which is the same expected return as the limit of the English auction as the increment d goes to zero. This is actually a very general result: the revenue equivalence theorem states that, with a few minor caveats, any auction mechanism in which bidders have values v; known only

to themselves (but know the probability distribution from which those values are sampled),

will yield the same expected revenue. This principle means that the various mechanisms are not competing on the basis of revenue generation, but rather on other qualities.

Although the second-price auction is truth-revealing, it turns out that auctioning n goods

with an n+ 1 price auction is not truth-revealing. Many Internet search engines use a mech-

anism where they auction n slots for ads on a page. The highest bidder wins the top spot,

the second highest gets the second spot, and so on. Each winner pays the price bid by the

next-lower bidder, with the understanding that payment is made only if the searcher actually

5 Named after William Vickrey (1914-1996), who won the 1996 Nobel Prize in economics for this work and died of a heart attack three days later.

Section 18.4

Making Collective Decisions

637

clicks on the ad. The top slots are considered more valuable because they are more likely to be noticed and clicked on.

Imagine that three bidders, by, b, and b3, have valuations for a click of vy =200, v, = 180,

and v3 =100, and that n = 2 slots are available; and it is known that the top spot is clicked on

5% of the time and the bottom spot 2%. If all bidders bid truthfully, then b; wins the top slot

and pays 180, and has an expected return of (200— 180) x 0.05= 1. The second slot goes to by. But by can see that if she were to bid anything in the range 101-179, she would concede

the top slot to by, win the second slot, and yield an expected return of (200 — 100) x .02=2. Thus, b; can double her expected return by bidding less than her true value in this case. In general, bidders in this n+ 1 price auction must spend a lot of energy analyzing the bids of others to determine their best strategy; there is no simple dominant strategy.

Aggarwal et al. (2006) show that there is a unique truthful auction mechanism for this

multislot problem, in which the winner of slot j pays the price for slot j just for those addi-

tional clicks that are available at slot j and not at slot j+ 1. The winner pays the price for the lower slot for the remaining clicks. In our example, by would bid 200 truthfully, and would

pay 180 for the additional .05 —.02=.03 clicks in the top slot, but would pay only the cost of the bottom slot, 100, for the remaining .02 clicks. (200— 180) x .03+ (200— 100) x .02=2.6.

Thus, the total return to b; would be

Another example of where auctions can come into play within Al is when a collection of agents are deciding whether to cooperate on a joint plan. Hunsberger and Grosz (2000) show that this can be accomplished efficiently with an auction in which the agents bid for roles in

the joint plan. Common

goods

Now let’s consider another type of game, in which countries set their policy for controlling air pollution. Each country has a choice: they can reduce pollution at a cost of -10 points for

implementing the necessary changes, or they can continue to pollute, which gives them a net

utility of -5 (in added health costs, etc.) and also contributes -1 points to every other country

(because the air is shared across countries). Clearly, the dominant strategy for each country

is “continue to pollute,” but if there are 100 countries and each follows this policy, then each

country gets a total utility of -104, whereas if every country reduced pollution, they would

each have a utility of -10. This situation is called the tragedy of the commons: if nobody has

to pay for using a common resource, then it may be exploited in a way that leads to a lower total utility for all agents.

It is similar to the prisoner’s dilemma:

Tragedy of the commons

there is another solution

to the game that is better for all parties, but there appears to be no way for rational agents to

arrive at that solution under the current game. One approach for dealing with the tragedy of the commons s to change the mechanism to one that charges each agent for using the commons.

More generally, we need to ensure that

all externalities—effects on global utility that are not recognized in the individual agents’ transactions—are made explicit.

Setting the prices correctly is the difficult part. In the limit, this approach amounts to

creating a mechanism in which each agent is effectively required to maximize global utility,

but can do so by making a local decision. For this example, a carbon tax would be an example of a mechanism that charges for use of the commons

maximizes global utility.

in a way that, if implemented well,

Externalities

638

VCG

Chapter 18 Multiagent Decision Making It turns out there is a mechanism design, known as the Vickrey-Clarke-Groves or VCG

mechanism, which has two favorable properties.

First, it is utility maximizing—that is, it

maximizes the global utility, which is the sum of the utilities for all parties, };v;.

Second,

the mechanism is truth-revealing—the dominant strategy for all agents is to reveal their true value. There is no need for them to engage in complicated strategic bidding calculations. We will give an example using the problem of allocating some common goods. Suppose a city decides it wants to install some free wireless Internet transceivers. However, the number

of transceivers available is less than the number of neighborhoods that want them. The city

wants to maximize global utility, but if it says to each neighborhood council you value a free transceiver (and by the way we will give them to the parties the most)?” then each neighborhood will have an incentive to report a very VCG mechanism discourages this ploy and gives them an incentive to report It works as follows:

1. The center asks each agent to report its value for an item, v;. 2. The center allocates the goods to a set of winners, W, to maximize ;¢

“How much do that value them high value. The their true value. v;.

3. The center calculates for each winning agent how much of a loss their individual pres-

ence in the game has caused to the losers (who each got 0 utility, but could have got v;

if they were a winner).

4. Each winning agent then pays to the center a tax equal (o this loss. For example, suppose there are 3 transceivers available and 5 bidders, who bid 100, 50, 40, 20, and 10. Thus the set of 3 winners, W, are the ones who bid 100, 50, and 40 and the

global utility from allocating these goods is 190. For each winner, it is the case that had they

not been in the game, the bid of 20 would have been a winner. Thus, each winner pays a tax

of 20 to the center.

All winners should be happy because they pay a tax that is less than their value, and all losers are as happy as they can be, because they value the goods less than the required

tax. That’s why the mechanism is truth-revealing. In this example, the crucial value is 20; it would be irrational to bid above 20 if your true value was actually below 20, and vice versa.

Since the crucial value could be anything (depending on the other bidders), that means that is always irrational to bid anything other than your true value.

The VCG mechanism is very general, and can be applied to all sorts of games, not just auctions, with a slight generalization of the mechanism described above. For example, in a

ccombinatorial auction there are multiple different items available and each bidder can place

multiple bids, each on a subset of the items. For example, in bidding on plots of land, one bidder might want either plot X or plot Y but not both; another might want any three adjacent plots, and so on. The VCG mechanism can be used to find the optimal outcome, although

with 2V subsets of N goods to contend with, the computation of the optimal outcome is NPcomplete. With a few caveats the VCG mechanism is unique: every other optimal mechanism is essentially equivalent.

18.4.3

Social choice theory

Voting

The next class of mechanisms that we look at are voting procedures, of the type that are used for political decision making in democratic societies. The study of voting procedures derives from the domain of social choice theory.

Section 18.4 The basic setting is as follows.

Making Collective Decisions

639

As usual, we have a set N = {1,...,n} of agents, who

in this section will be the voters. These voters want to make decisions with respect to a set

Q= {wy,w,...} of possible outcomes. In a political election, each element of Q could stand for a different candidate winning the election.

Each voter will have preferences over Q. These are usually expressed not as quantitative

utilities but rather as qualitative comparisons:

we write w >~; w' to mean that outcome w is

ranked above outcome w' by agent i. In an election with three candidates, agent i might have

W

w3 = Wi The fundamental problem of social choice theory is to combine these preferences, using Social welfare a social welfare function, to come up with a social preference order: a ranking of the Function candidates, from most preferred down to least preferred. In some cases, we are only interested

in a social outcome—the most preferred outcome by the group as a whole. We will write Social outcome

w =" w' to mean that w is ranked above w’ in the social preference order.

A simpler setting is where we are not concerned with obtaining an entire ordering of

candidates, but simply want to choose a set of winners.

A social choice function takes as

input a preference order for each voter, and produces as output a set of winners.

Democratic societies want a social outcome that reflects the preferences of the voters.

Unfortunately, this is not always straightforward. Consider Condorcet’s Paradox, a famous

example posed by the Marquis de Condorcet (1743-1794). Suppose we have three outcomes, Q = {w,,wp,w,}, and three voters, N = {1,2,3}, with preferences as follows. Wa =1 Wp =1 We

We >2 Wa =2 Wh

(18.2)

Wp >3 We =3 Wa

Now, suppose we have to choose one of the three candidates on the basis of these preferences. The paradox is that: * 2/3 of the voters prefer w; over w;. + 2/3 of the voters prefer w; over ws.

* 2/3 of the voters prefer w) over ws. So, for each possible winner, we can point to another candidate who would be preferred by at least 2/3 of the electorate. It is obvious that in a democracy we cannot hope to make every

voter happy. This demonstrates that there are scenarios in which no matter which outcome we choose, a majority of voters will prefer a different outcome. A natural question is whether there is any “good” social choice procedure that really reflects the preferences of voters. To answer this, we need to be precise about what we mean when we say that a rule is “good.” ‘We will list some properties we would like a good social welfare function to satis

* The Pareto Condition:

above wj, then w; =* w;.

* The Condorcet

The Pareto condition simply says that if every voter ranks w;

Winner Condition:

An outcome is said to be a Condorcet winner if

a majority of candidates prefer it over all other outcomes. To put it another way, a

Condorcet winner is a candidate that would beat every other candidate in a pairwise

election. The Condorcet winner condition says that if w; is a Condorcet winner, then w; should be ranked first.

« Independence of Irrelevant Alternatives (IIA): Suppose there are a number of candidates, including w; and w;, and voter preferences are such that w; = w;. Now, suppose

Social choice function Condorcet's Paradox

Chapter 18 Multiagent Decision Making

Arrow's theorem

Simple majority vote

Plurality voting

one voter changed their preferences in some way, but nor about the relative ranking of wi and w;. The TTA condition says that, w; * w; should not change. « No Dictatorships: 1t should not be the case that the social welfare function simply outputs one voter’s preferences and ignores all other voters. These four conditions seem reasonable, but a fundamental theorem of social choice theory called Arrow’s theorem (due to Kenneth Arrow) tells s that it is impossible to satisfy all four conditions (for cases where there are at least three outcomes). That means that for any social choice mechanism we might care to pick, there will be some situations (perhaps unusual or pathological) that lead to controversial outcomes. However, it does not mean that democratic decision making is hopeless in most cases. We have not yet seen any actual voting procedures, so let’s now look at some. = With just two candidates, simple majority vote (the standard method in the US and UK) s the favored mechanism. We ask each voter which of the two candidates they prefer, and the one with the most votes is the winner. + With more than two outcomes, plurality voting is a common system.

We ask each

voter for their top choice, and select the candidate(s) (more than one in the case of ties) who get the most votes, even if nobody gets a majority. While it is common, plurality

voting has been criticized for delivering unpopular outcomes. A key problem is that it only takes into account the top-ranked candidate in each voter’s preferences.

Borda count

+ The Borda count (after Jean-Charles de Borda, a contemporary and rival of Condorcet)

is a voting procedure that takes into account all the information in a voter’s preference

ordering. Suppose we have k candidates. Then for each voter i, we take their preference

ordering -;, and give a score of k to the top ranked candidate, a score of k — 1 to the

second-ranked candidate, and so on down to the least-favored candidate in i’s ordering. The total score for each candidate is their Borda count, and to obtain the social outcome

~*, outcomes are ordered by their Borda count—highest to lowest. One practical prob-

lem with this system is that it asks voters to express preferences on all the candidates,

Approval voting

and some voters may only care about a subset of candidates.

+ In approval voting, voters submit a subset of the candidates that they approve of. The winner(s) are those who are approved by the most voters. This system is often used when the task is to choose multiple winners.

Instant runoff voting

« In instant runoff voting, voters rank all the candidates, and if a candidate has a major-

ity of first-place votes, they are declared the winner. If not, the candidate with the fewest

first-place votes is eliminated. That candidate is removed from all the preference rank-

ings (so those voters who had the eliminated candidate as their first choice now have

another candidate as their new first choice) and the process is repeated.

True majority rule voting

Eventually,

some candidate will have a majority of first-place votes (unless there is a tie).

+ In true majority rule voting, the winner is the candidate who beats every other can-

didate in pairwise comparisons. Voters are asked for a full preference ranking of all

candidates. We say that w beats ', if more voters have w > w’ than have w’ - w. This

system has the nice property that the majority always agrees on the winner, but it has the bad property that not every election will be decided:

example, no candidate wins a majority.

in the Condorcet paradox, for

Section 18.4

Making Collective Decisions

641

Strategic manipulation Besides Arrow’s Theorem,

another important negative results in the area of social choice

theory is the Gibbard-Satterthwaite Theorem.

This result relates to the circumstances

under which a voter can benefit from misrepresenting their preferenc

Gibbard— Satterthwaite Theorem

Recall that a social choice function takes as input a preference order for each voter, and

gives as output a set of winning candidates. Each voter has, of course, their own true prefer-

ences, but there is nothing in the definition of a social choice function that requires voters to

report their preferences truthfully; they can declare whatever preferences they like.

In some cases, it can make sense for a voter to misrepresent their preferences. For exam-

ple, in plurality voting, voters who think their preferred candidate has no chance of winning may vote for their second choice instead. That means plurality voting is a game in which voters have to think strategically (about the other voters) to maximize their expected utility.

This raises an interesting question: can we design a voting mechanism that is immune to

such manipulation—a mechanism that is truth-revealing? The Gibbard-Satterthwaite Theo-

rem tells us that we can not: Any social choice function that satisfies the Pareto condition for a domain with more than two outcomes is either manipulable or a dictatorship. That is, for any “reasonable” social choice procedure, there will be some circumstances under which a

voter can in principle benefit by misrepresenting their preferences. However, it does not tell

us how such manipulation might be done; and it does not tell us that such manipulation is

likely in practice. 18.4.4

Bargaining

Bargaining, or negotiation, is another mechanism that is used frequently in everyday life. It has been studied in game theory since the 1950s and more recently has become a task for automated agents. Bargaining is used when agents need to reach agreement on a matter of common interest. The agents make offers (also called proposals or deals) to each other under specific protocols, and either accept or reject cach offer. Bargaining with the alternating offers protocol

offers One influential bargaining protocol is the alternating offers bargaining model. For simplic- Alternating bargaining model ity we’ll again

assume just two agents. Bargaining takes place in a sequence of rounds. A;

begins, at round 0, by making an offer. If A, accepts the offer, then the offer is implemented. If Ay rejects the offer, then negotiation moves to the next round. This time A> makes an offer and A chooses to accept or reject it, and so on. If the negotiation never terminates (because

agents reject every offer) then we define the outcome to be the conflict deal. A convenient

simplifying assumption is that both agents prefer to reach an outcome—any outcome—in finite time rather than being stuck in the infinitely time-consuming conflict deal. ‘We will use the scenario of dividing a pie to illustrate alternating offers. The idea is that

Conflict deal

there is some resource (the “pie”) whose value is 1, which can be divided into two parts, one

part for each agent. Thus an offer in this scenario is a pair (x,1 —x), where x is the amount

of the pie that A; gets, and 1 — x is the amount that A, gets. The space of possible deals (the negotiation set) is thus:

{(1-2):0. Agent A can take this fact into account by offering (1 —72,72), an

Section 18.4

Making Collective Decisions

offer that A, may as well accept because A, can do no better than 7, at this point in time. (If you are worried about what happens with ties, just make the offer be (1 — (v, +¢€),72 +€) for

some small value of ¢.) So, the two strategies of A

offering (1 —~2,72), and A accepting that offer are in Nash

equilibrium. Patient players (those with a larger ~2) will be able to obtain larger pieces of the pie under this protocol: in this setting, patience truly is a virtue. Now consider the general case, where there are no bounds on the number of rounds. As

in the I-round case, A, can craft a proposal that A, should accept, because it gives A, the maximal

achievable amount, given the discount factors. It turns out that A; will get

I-m

-

and A will get the remainder. Negotiation

in task-oriented domains

In this section, we consider negotiation for task-oriented domains. In such a domain, a set of

tasks must be carried out, and each task is initially assigned to a set of agents. The agents may

Task-oriented domain

be able to benefit by negotiating on who will carry out which tasks. For example, suppose

some tasks are done on a lathe machine and others on a milling machine, and that any agent

using a machine must incur a significant setup cost. Then it would make sense for one agent 1o offer another “T have to set up on the milling machine anyway; how about if I do all your milling tasks, and you do all my lathe tasks?”

Unlike the bargaining scenario, we start with an initial allocation, so if the agents fail to

agree on any offers, they perform the tasks T that they were originally allocated.

To keep things simple, we will again assume just two agents. Let 7 be the set of all tasks

and let (7}, 7) denote the initial allocation of tasks to the two agents at time 0. Each task in T must be assigned to exactly one agent.

We assume we have a cost function ¢, which

for every set of tasks 7" gives a positive real number ¢(7”) indicating the cost to any agent

of carrying out the tasks 7”. (Assume the cost depends only on the tasks, not on the agent

carrying out the task.) The cost function is monotonic—adding more tasks never reduces the cost—and the cost of doing nothing is zero: ¢({}) =

0. As an example, suppose the cost of

setting up the milling machine is 10 and each milling task costs 1, then the cost of a set of two milling tasks would be 12, and the cost of a set of five would be 15.

An offer of the form (7;,7>) means that agent i is committed to performing the set of tasks 7;, at cost ¢(7;). The utility to agent i is the amount they have to gain from accepting the offer—the difference between the cost of doing this new set of tasks versus the originally

assigned set of tasks:

Ui((T,T2) = e(T3) = e(TY)

An offer (T, T5) is individually rational if U;((7},73)) > 0 for both agents. If a deal is not Individually rational individually rational, then at least one agent can do better by simply performing the tasks it

was originally allocated.

The negotiation set for task-oriented domains (assuming rational agents) is the set of

offers that are both individually rational and Pareto optimal. There is no sense making an

individually irrational offer that will be refused, nor in making an offer when there is a better offer that improves one agent’s utility without hurting anyone else.

Chapter 18 Multiagent Decision Making

The monotonic concession protocol Mornotonic concession protocol

The negotiation protocol we consider for task-oriented domains

is known as the monotonic

concession protocol. The rules of this protocol are as follows. + Negotiation proceeds in a series of rounds. + On the first round, both agents simultaneously propose a deal, D; = (T}, T3), from the negotiation set. (This is different from the alternating offers we saw before.)

+ An agreement is reached if the two agents propose deals D; and D;, respectively, such

that either (i) Uy (D) > Uy (Dy) or (i) Us(Dy) > Us(D3), that is, if one of the agents finds that the deal proposed by the other is at least as good or better than the proposal it made. If agreement is reached, then the rule for determining the agreement deal is as follows:

If each agent’s offer matches or exceeds that of the other agent, then one of

the proposals is selected at random. If only one proposal exceeds or matches the other’s proposal, then this is the agreement deal. « If no agreement is reached, then negotiation proceeds to another round of simultaneous

Concession

proposals. In round 7 + 1, each agent must either repeat the proposal from the previous

round or make a concession—a proposal that is more preferred by the other agent (i.e.,

has higher utility).

« If neither agent makes a concession, then negotiation terminates, and both agents im-

plement the conflict deal, carrying out the tasks they were originally assigned.

Since the set of possible deals is finite, the agents cannot negotiate indefinitely:

either the

agents will reach agreement, or a round will occur in which neither agent concedes. However, the protocol does not guarantee that agreement will be reached quickly: since the number of

possible deals is O(271), it is conceivable that negotiation will continue for a number of

rounds exponential in the number of tasks to be allocated. The Zeuthen strategy

Zeuthen strategy

So far, we have said nothing about how negotiation participants might or should behave when using the monotonic concession protocol for task-oriented domains. One possible strategy is the Zeuthen strategy.

The idea of the Zeuthen strategy is to measure an agent’s willingness to risk conflict.

Intuitively, an agent will be more willing to risk conflict if the difference in utility between its current proposal and the conflict deal is low.

In this case, the agent has little to lose if

negotiation fails and the conflict deal is implemented, and so is more willing to risk conflict, and less willing to concede. In contrast, if the difference between the agent’s current proposal and the conflict deal is high, then the agent has more to lose from conflict and is therefore less willing to risk conflict—and thus more willing to concede. Agent i’s willingness to risk conflict at round 7, denoted risk!, is measured as follows:

by conceding and accepting js offer

Until an agreement is reached, the value of risk{ will be a value between 0 and 1. Higher

values of risk! (nearer to 1) indicate that i has less to lose from conflict, and so is more willing to risk conflict.

Summary The Zeuthen strategy says that each agent’s first proposal should be a deal in the negoti-

ation set that maximizes its own utility (there may be more than one). After that, the agent

who should concede on round 7 of negotiation should be the one with the smaller value of risk—the one with the most to lose from conflict if neither concedes.

The next question to answer is how much should be conceded? The answer provided by the Zeuthen strategy is, “Just enough to change the balance of risk to the other agent.” That is, an agent should make the smallest concession that will make the other agent concede on

the next round. There is one final refinement to the Zeuthen strategy.

Suppose that at some point both

agents have equal risk. Then, according to the strategy, both should concede. But, knowing this, one agent could potentially “defect” by not conceding, and so benefit.

To avoid the

possibility of both conceding at this point, we extend the strategy by having the agents “flip a coin” to decide who should concede if ever an equal risk situation is reached.

With this strategy, agreement will be Pareto optimal and individually rational. However,

since the space of possible deals is exponential in the number of tasks, following this

strategy

may require O(2/"!) computations of the cost function at each negotiation step. Finally, the

Zeuthen strategy (with the coin flipping rule) is in Nash equilibrium.

Summary « Multiagent planning is necessary when there are other agents in the environment with which to cooperate or compete. Joint plans can be constructed, but must be augmented with some form of coordination if two agents are to agree on which joint plan to execute.

+ Game

theory describes rational behavior for agents in situations in which multiple

agents interact. Game theory is to multiagent decision making as decision theory is to

single-agent decision making. « Solution concepts in game theory are intended to characterize rational outcomes of a game—outcomes that might occur if every agent acted rationally.

+ Non-cooperative game theory assumes that agents must make their decisions indepen-

dently. Nash equilibrium is the most important solution concept in non-cooperative game theory. A Nash equilibrium is a strategy profile in which no agent has an incentive to deviate from its specified strategy. We have techniques for dealing with repeated games and sequential games.

+ Cooperative game theory considers settings in which agents can make binding agree-

ments to form coalitions in order to cooperate. Solution concepts in cooperative game attempt to formulate which coalitions are stable (the core) and how to fairly divide the

value that a coalition obtains (the Shapley value). « Specialized techniques are available for certain important classes of multiagent decision: the contract net for task sharing; auctions are used to efficiently allocate scarce

resources; bargaining for reaching agreements on matters of common interest; and vot-

ing procedures for aggregating preferences.

Chapter 18 Multiagent Decision Making Bibliographical and Historical Notes Itis a curiosity of the field that researchers in Al did not begin to seriously consider the issues

surrounding interacting agents until the 1980s—and the multiagent systems field did not really become established as a distinctive subdiscipline of Al until a decade later. Nevertheless,

ideas that hint at multiagent systems were present in the 1970s. For example, in his highly influential Society of Mind theory, Marvin Minsky (1986, 2007) proposed that human minds are constructed from an ensemble of agents. Doug Lenat had similar ideas in a framework he called BEINGS (Lenat, 1975). In the 1970s, building on his PhD work on the PLANNER

system, Carl Hewitt proposed a model of computation as interacting agents called the ac-

tor model, which has become established as one of the fundamental models in concurrent computation (Hewitt, 1977; Agha, 1986).

The prehistory of the multiagent systems field is thoroughly documented in a collection

of papers entitled Readings in Distributed Artificial Intelligence (Bond and Gasser, 1988).

The collection is prefaced with a detailed statement of the key research challenges in multi-

agent systems, which remains remarkably relevant today, more than thirty years after it was written.

Early research on multiagent systems tended to assume that all agents in a system

were acting with common interest, with a single designer. This

Cooperative distributed problem solving

is now recognized as a spe-

cial case of the more general multiagent setting—the special case is known as cooperative

distributed problem solving. A key system of this time was the Distributed Vehicle Moni-

toring Testbed (DVMT), developed under the supervision of Victor Lesser at the University

of Massachusetts (Lesser and Corkill, 1988). The DVMT modeled a scenario in which a col-

lection of geographically distributed acoustic sensor agents cooperate to track the movement

of vehicles.

The contemporary era of multiagent systems research began in the late 1980s, when it

was widely realized that agents with differing preferences are the norm in Al and society— from this point on, game theory began to be established as the main methodology for studying such agents. Multiagent planning has leaped in popularity in recent years, although it does have a long history. Konolige (1982) formalizes multiagent planning in first-order logic, while Pednault (1986) gives a STRIPS-style description.

The notion of joint intention, which is essential if

agents are to execute a joint plan, comes from work on communicative acts (Cohen and Perrault, 1979; Cohen and Levesque,

1990; Cohen et al., 1990). Boutilier and Brafman (2001)

show how to adapt partial-order planning to a multiactor setting.

Brafman and Domshlak

(2008) devise a multiactor planning algorithm whose complexity grows only linearly with

the number of actors, provided that the degree of coupling (measured partly by the tree width

of the graph of interactions among agents) is bounded.

Multiagent planning is hardest when there are adversarial agents.

As Jean-Paul Sartre

(1960) said, “In a football match, everything is complicated by the presence of the other team.” General Dwight D. Eisenhower said, “In preparing for battle I have always found that plans are useless, but planning is indispensable,” meaning that it is important to have a conditional plan or policy, and not to expect an unconditional plan to succeed.

The topic of distributed and multiagent reinforcement learning (RL) was not covered in

this chapter but is of great current interest. In distributed RL, the aim is to devise methods by

which multiple, coordinated agents learn to optimize a common utility function. For example,

Bibliographical and Historical Notes can we devise methods whereby separate subagents for robot navigation and robot obstacle avoidance could cooperatively achieve a combined control system that is globally optimal?

Some basic results in this direction have been obtained (Guestrin et al., 2002; Russell and Zimdars, 2003). The basic idea is that each subagent learns its own Q-function (a kind of

utility function; see Section 22.3.3) from its own stream of rewards. For example, a robot-

navigation component can receive rewards for making progress towards the goal, while the obstacle-avoidance component receives negative rewards for every collision. Each global decision maximizes the sum of Q-functions and the whole process converges to globally optimal solutions. The roots of game theory can be traced back to proposals made in the 17th century by

Christiaan Huygens and Gottfried Leibniz to study competitive and cooperative human in-

ientifically and mathematically. Throughout the 19th century, several leading reated simple mathematical examples to analyze particular examples of compet-

itive situations. The first formal results in game theory are due to Zermelo (1913) (who had, the year before, suggested a form of minimax search for games, albeit an incorrect one). Emile Borel (1921) introduced the notion of a mixed strategy. John von Neumann (1928) proved that

every two-person, zero-sum game has a maximin equilibrium in mixed strategies and a well-

defined value. Von Neumann’s collaboration with the economist Oskar Morgenstern led to the publication in 1944 of the Theory of Games and Economic Behavior, the defining book

for game theory. Publication of the book was delayed by the wartime paper shortage until a member of the Rockefeller family personally subsidized its publication. In 1950, at the age of 21, John Nash published his ideas concerning equilibria in general (non-zero-sum) games. His definition of an equilibrium solution, although anticipated in the work of Cournot (1838), became known as Nash equilibrium. After a long delay because of the schizophrenia he suffered from 1959 onward, Nash was awarded the Nobel Memorial Prize in Economics (along with Reinhart Selten and John Harsanyi) in 1994. The Bayes-Nash

equilibrium is described by Harsanyi (1967) and discussed by Kadane and Larkey (1982). Some issues in the use of game theory for agent control are covered by Binmore (1982). Aumann and Brandenburger (1995) show how different equilibria can be reached depending on the knowleedge each player has.

The prisoner’s dilemma was invented as a classroom exercise by Albert W. Tucker in 1950 (based on an example by Merrill Flood and Melvin Dresher) and is covered extensively

by Axelrod (1985) and Poundstone (1993). Repeated games were introduced by Luce and

Raiffa (1957), and Abreu and Rubinstein (1988) discuss the use of finite state machines for

repeated games—technically, Moore machines. The text by Mailath and Samuelson (2006) concentrates on repeated games. Games of partial information in extensive form were introduced by Kuhn (1953).

The

sequence form for partial-information games was invented by Romanovskii (1962) and independently by Koller ef al. (1996); the paper by Koller and Pfeffer (1997) provides a readable introduction to the field and describes a system for representing and solving sequential games. The use of abstraction to reduce a game tree to a size that can be solved with Koller’s

technique was introduced by Billings ef al. (2003). Subsequently, improved methods for equilibrium-finding enabled solution of abstractions with 102 states (Gilpin er al., 2008;

Zinkevich et al., 2008).

Bowling ef al. (2008) show how to use importance sampling to

647

Chapter 18 Multiagent Decision Making get a better estimate of the value ofa strategy. Waugh ef al. (2009) found that the abstraction approach is vulnerable to making systematic errors in approximating the equilibrium solution: it works for some games but not others.

Brown and Sandholm (2019) showed that, at least

in the case of multiplayer Texas hold "em poker, these vulnerabilities can be overcome by sufficient computing power.

They used a 64-core server running for 8 days to compute a

baseline strategy for their Pluribus program. human champion opponents.

With that strategy they were able to defeat

Game theory and MDPs are combined in the theory of Markov games, also called stochastic games (Littman, 1994; Hu and Wellman, 1998). Shapley (1953b) actually described the

value iteration algorithm independently of Bellman, but his results were not widely appre-

ciated, perhaps because they were presented in the context of Markov games. Evolutionary

game theory (Smith, 1982; Weibull, 1995) looks at strategy drift over time: if your opponent’s strategy is changing, how should you react?

Textbooks on game theory from an economics point of view include those by Myerson

(1991), Fudenberg and Tirole (1991), Osborne (2004), and Osborne and Rubinstein (1994).

From an Al perspective we have Nisan ef al. (2007) and Leyton-Brown and Shoham (2008). See (Sandholm, 1999) for a useful survey of multiagent decision making. Multiagent RL is distinguished from distributed RL by the presence of agents who cannot

coordinate their actions (except by explicit communicative acts) and who may not share the

same utility function. Thus, multiagent RL deals with sequential game-theoretic problems or

Markov games, as defined in Chapter 17. What causes problems is the fact that, while an

agent is learning to defeat its opponent’s policy, the opponent is changing its policy to defeat the agent. Thus, the environment is nonstationary (see page 444).

Littman (1994) noted this difficulty when introducing the first RL algorithms for zero-

sum Markov games. Hu and Wellman (2003) present a Q-learning algorithm for generalsum games that converges when the Nash equilibrium is unique; when there are multiple equilibria, the notion of convergence is not so easy to define (Shoham ef al., 2004). Assistance games were introduced under the heading of cooperative inverse reinforce-

ment learning by Hadfield-Menell ez al. (2017a). Malik ez al. (2018) introduced an efficient

Principal-agent game

POMDP solver designed specifically for assistance games.

They are related to principal-

agent games in economics, in which a principal (e.g., an employer) and an agent (e.g., an employee) need to find a mutually beneficial arrangement despite having widely different preferences. The primary differences are that (1) the robot has no preferences of its own, and (2) the robot is uncertain about the human preferences it needs to optimize.

Cooperative games were first studied by von Neumann and Morgenstern (1944). The notion of the core was introduced by Donald Gillies (1959), and the Shapley value by Lloyd Shapley (1953a). A good introduction to the mathematics of cooperative games is Peleg and Sudholter (2002). Simple games in general are discussed in detail by Taylor and Zwicker (1999). For an introduction to the computational aspects of cooperative game theory, see

Chalkiadakis er al. (2011).

Many compact representation schemes for cooperative games have been developed over

the past three decades, starting with the work of Deng and Papadimitriou (1994). The most

influential of these schemes is the marginal contribution networks model, which was intro-

duced by Teong and Shoham (2005). The approach to coalition formation that we describe was developed by Sandholm ef al. (1999); Rahwan ef al. (2015) survey the state of the art.

Bibliographical and Historical Notes The contract net protocol was introduced by Reid Smith for his PhD work at Stanford

University in the late 1970s (Smith, 1980). The protocol seems to be so natural that it is reg-

ularly reinvented to the present day. The economic foundations of the protocol were studied by Sandholm (1993).

Auctions and mechanism design have been mainstream topics in computer science and

AI for several decades:

see Nisan (2007) for a mainstream computer science perspective,

Krishna (2002) for an introduction to the theory of auctions, and Cramton er al. (2006) for a collection of articles on computational aspects of auctions. The 2007 Nobel Memorial Prize in Economics went to Hurwicz, Maskin, and Myerson

“for having laid the foundations of mechanism design theory” (Hurwicz, 1973). The tragedy of the commons, a motivating problem for the field, was analyzed by William Lloyd (1833) but named and brought to public attention by Garrett Hardin (1968). Ronald Coase presented

a theorem that if resources are subject to private ownership and if transaction costs are low

enough, then the resources will be managed efficiently (Coase, 1960). He points out that, in practice, transaction costs are high, so this theorem does not apply, and we should look to other solutions beyond privatization and the marketplace. Elinor Ostrom’s Governing the Commons (1990) described solutions for the problem based on placing management control over the resources into the hands of the local people who have the most knowledge of the situation. Both Coase and Ostrom won the Nobel Prize in economics for their work.

The revelation principle is due to Myerson (1986), and the revenue equivalence theorem was developed independently by Myerson (1981) and Riley and Samuelson (1981). Two

economists, Milgrom (1997) and Klemperer (2002), write about the multibillion-dollar spec-

trum auctions they were involved in. Mechanism design is used in multiagent planning (Hunsberger and Grosz, 2000; Stone et al., 2009) and scheduling (Rassenti er al., 1982). Varian (1995) gives a brief overview with connections to the computer science literature, and Rosenschein and Zlotkin (1994)

present a book-length treatment with applications to distributed AL Related work on distributed AT goes under several names, including collective intelligence (Tumer and Wolpert, 2000; Segaran, 2007) and market-based control (Clearwater,

1996).

Since 2001

there has

been an annual Trading Agents Competition (TAC), in which agents try to make the best profit on a series of auctions (Wellman ez al., 2001; Arunachalam and Sadeh, 2005).

The social choice literature is enormous, and spans erations on the nature of democracy through to highly procedures. Campbell and Kelly (2002) provide a good Handbook of Computational Social Choice provides a

the gulf from philosophical considtechnical analyses of specific voting starting point for this literature. The range of articles surveying research

topics and methods in this field (Brandt ez al., 2016). Arrow’s theorem lists desired properties

of a voting system and proves that is impossible to achieve all of them (Arrow, 1951). Dasgupta and Maskin (2008) show that majority rule (not plurality rule, and not ranked choice voting) is the most robust voting system. The computational complexity of manipulating elections was first studied by Bartholdi ez al. (1989).

We have barely skimmed the surface of work on negotiation in multiagent planning. Durfee and Lesser (1989) discuss how tasks can be shared out among agents by negotiation. Kraus et al. (1991) describe a system for playing Diplomacy, a board game requiring negoti-

ation, coalition formation, and dishonesty. Stone (2000) shows how agents can cooperate as

teammates in the competitive, dynamic, partially observable environment of robotic soccer. In

649

650

Chapter 18 Multiagent Decision Making a later article, Stone (2003) analyzes two competitive multiagent environments—RoboCup,

a robotic soccer competition, and TAC, the auction-based Trading Agents Competition—

and finds that the computational intractability of our current theoretically well-founded approaches has led to many multiagent systems being designed by ad hoc methods. Sarit Kraus has developed a number of agents that can negotiate with humans and other agents—

see Kraus (2001) for a survey. The monotonic concession protocol for automated negotiation was proposed by Jeffrey S. Rosenschein and his students (Rosenschein and Zlotkin, 1994). The alternating offers protocol was developed by Rubinstein (1982). Books on multiagent systems include those by Weiss (2000a), Young (2004), Vlassis (2008), Shoham and Leyton-Brown (2009), and Wooldridge (2009). The primary conference for multiagent systems is the International Conference on Autonomous Agents and Multi-

Agent Systems (AAMAS); there is also a journal by the same name. The ACM Conference on Electronic Commerce (EC) also publishes many relevant papers, particularly in the area of auction algorithms. The principal journal for game theory is Games and Economic Behavior.

TG

19

LEARNING FROM EXAMPLES In which we describe agents that can improve their behavior through diligent study of past experiences and predictions about the future.

An agent is learning if it improves its performance after making observations about the world.

Learning can range from the trivial, such as jotting down a shopping list, to the profound, as when Albert Einstein inferred a new theory of the universe. When the agent is a computer, we call it machine learning:

a computer observes some data, builds a model based on the

data, and uses the model as both a hypothesis about the world and a piece of software that

can solve problems.

‘Why would we want a machine to learn? Why not just program it the right way to begin

with? There are two main reasons.

First, the designers cannot anticipate all possible future

situations. For example, a robot designed to navigate mazes must learn the layout of each new

maze it encounters; a program for predicting stock market prices must learn to adapt when

conditions change from boom to bust. Second, sometimes the designers have no idea how

to program a solution themselves. Most people are good at recognizing the faces of family

members, but they do it subconsciously, so even the best programmers don’t know how to

program a computer to accomplish that task, except by using machine learning algorithms.

In this chapter, we interleave a discussion of various model classes—decision trees (Sec-

tion 19.3), linear models (Section 19.6), nonparametric models such as nearest neighbors (Section 19.7), ensemble models such as random forests (Section 19.8)—with practical advice on building machine learning systems (Section 19.9), and discussion of the theory of ‘machine learning (Sections 19.1 to 19.5).

19.1

Forms of Learning

Any component of an agent program can be improved by machine learning. The improve‘ments, and the techniques used to make them, depend on these factors:

« Which component is to be improved. * What prior knowledge the agent has, which influences the model it builds. * What data and feedback on that data is available.

Chapter 2 described several agent designs. The components of these agents include: 1. A direct mapping from conditions on the current state to actions. 2. A means to infer relevant properties of the world from the percept sequence. 3. Information about the way the world evolves and about the results of possible actions

the agent can take.

Machine learning

652

Chapter 19 Learning from Examples 4. Utility information indicating the desirability of world states. 5. Action-value information indicating the desirability of actions.

6. Goals that describe the most desirable states.

7. A problem generator, critic, and learning element that enable the system to improve. Each of these components can be learned. Consider a self-driving car agent that learns by observing a human driver. Every time the driver brakes, the agent might learn a condition— action rule for when to brake (component 1). By seeing many camera images that it is told contain buses, it can learn to recognize them (component 2).

By trying actions and ob-

serving the results—for example, braking hard on a wet road—it can learn the effects of its actions (component 3). Then, when it receives complaints from passengers who have been thoroughly shaken up during the trip, it can learn a useful component of its overall utility function (component 4).

The technology of machine learning has become a standard part of software engineering.

Any time you are building a software system, even if you don’t think of it as an AI agent,

components of the system can potentially be improved with machine learning. For example,

software to analyze images of galaxies under gravitational lensing was speeded up by a factor

of 10 million with a machine-learned model (Hezaveh et al., 2017), and energy use for cooling data centers was reduced by 40% with another machine-learned model (Gao, 2014). Turing Award winner David Patterson and Google Al head Jeff Dean declared the dawn of a “Golden

Age” for computer architecture due to machine learning (Dean et al., 2018).

We have seen several examples of models for agent components: atomic, factored, and

relational models based on logic or probability, and so on. Learning algorithms have been Prior knowledge

devised for all of these.

This chapter assumes little prior knowledge on the part of the agent: it starts from scratch

and learns from the data. In Section 21.7.2 we consider transfer learning, in which knowl-

edge from one domain is transferred to a new domain, so that learning can proceed faster with less data.

We do assume, however, that the designer of the system chooses a model

framework that can lead to effective learning.

Going from a specific set of observations to a general rule is called induction; from the

observations that the sun rose every day in the past, we induce that the sun will come up tomorrow.

This differs from the deduction we studied in Chapter 7 because the inductive

conclusions may be incorrect, whereas deductive conclusions are guaranteed to be correct if

the premises are correct. This chapter concentrates on problems where the input is a factored representation—a vector of attribute values.

It is also possible for the input to be any kind of data structure,

including atomic and relational.

When the output is one of a finite set of values (such as sunny/cloudy/rainy or true/false),

Classification

the learning problem is called classification. When it is a number (such as tomorrow’s tem-

Regression

mittedly obscure!) name regression. ! A better name would have been function approximation or mumeric prediction. But in 1886 Francis Galton ‘wrote an influential article on the concept of regression o the mean (e.g.. the children of tall parents are likely to be taller than average, but not as tall as the parents). Galton showed plots with what he called “regression lines,” and readers came to associate the word “regression” with the statistical technique of function approximation rather than with the topic of regression to the mean.

perature, measured either as an integer or a real number), the learning problem has the (ad-

Section 1.2 Supervised Learning

653

There are three types of feedback that can accompany the inputs, and that determine the Feedback

three main types of learning:

« In supervised learning the agent observes input-output pairs and learns a function that Supervised learning maps from input to output.

For example, the inputs could be camera images, each

one accompanied by an output saying “bus” or “pedestrian,” etc. An output like this is called a label. The agent learns a function that, when given a new image, predicts

the appropriate label. In the case of braking actions (component 1 above), an input is

Label

the current state (speed and direction of the car, road condition), and an output is the distance it took to stop. In this case a set of output values can be obtained by the agent

from its own percepts (after the fact); the environment is the teacher, and the agent learns a function that maps states to stopping distance.

« In unsupervised learning the agent learns patterns in the input without any explicit Unsupervised learning feedback. The most common unsupervised learning task is clustering: detecting poten-

tially useful clusters of input examples. For example, when shown millions of images

taken from the Internet, a computer vision system can identify a large cluster of similar images which an English speaker would call “cats.”

«+ In reinforcement learning the agent learns from a series of reinforcements:

rewards

and punishments. For example, at the end of a chess game the agent is told that it has

Reinforcement learning

won (a reward) or lost (a punishment). It is up to the agent to decide which of the actions prior to the reinforcement were most responsible for it, and to alter its actions

to aim towards more rewards in the future.

19.2

Super

ed Learning

More formally, the task of supervised learning is this: Given a training set of N example input-output pairs

Training set

(1, y1)s (32,32)s - (s yn) 5 where each pair was generated by an unknown function y = f(x),

discover a function / that approximates the true function f. The function / is called a hypothesis about the world. It is drawn from a hypothesis space H of possible functions. For example, the hypothesis space might be the set of polynomials

Hypothesis space

of degree 3; or the set of Javascript functions; or the set of 3-SAT Boolean logic formulas.

‘With alternative vocabulary, we can say that / is a model of the data, drawn from a model class 7, or we can say a function drawn from a function class. We call the output y; the

Model class Ground truth

ground truth—the true answer we are asking our model to predict. How do we choose a hypothesis space? We might have some prior knowledge about the data process that generated the data. If not, we can perform exploratory data analysis: examining Exploratory analysis the data with statistical tests and visualizations—histograms, scatter plots, box plots—to get

afeel for the data, and some insight into what hypothesis space might be appropriate. Or we can just try multiple hypothesis spaces and evaluate which one works best.

How do we choose a good hypothesis from within the hypothesis space? We could hope

for a consistent hypothesis: and / such that each x; in the training set has h(x;) = y;. With

continuous-valued outputs we can’t expect an exact match to the ground truth; instead we

Consistent hypothesis

654

Chapter 19 Learning from Examples Sinusoidal

Piecewise linear

Degree-12 polynomial

o

Data set 2

Data set 1

Linear

flf}

Figure 19.1 Finding hypotheses to fit data. Top row: four plots of best-fit functions from four different hypothesis spaces trained on data set 1. Bottom row: the same four functions, but trained on a slightly different data

set (sampled from the same f(x) function).

look for a best-fit function for which each h(x;) is close to y; (in a way that we will formalize in Section 19.4.2).

The true measure of a hypothesis is not how it does on the training set, but rather how

Test set Generalization

well it handles inputs it has not yet seen. We can evaluate that with a second sample of (x;,y;) pairs called a test set. We say that / generalizes well if it accurately predicts the outputs of the test set. Figure 19.1 shows that the function that a learning algorithm discovers depends on the hypothesis space H it considers and on the training set it is given. Each of the four plots in the top row have the same training set of 13 data points in the (x,y) plane. The four plots in the bottom row have a second set of 13 data points; both sets are representative of the

same unknown function f(x). Each column shows the best-fit hypothesis / from a different hypothesis space:

o Column 1: Straight lines; functions of the form i(x) = wjx- wy. There is no line that

would be a consistent hypothesis for the data points.

« Column 2: Sinusoidal functions of the form A(x) = wy.x -+sin(wox). This choice is not quite consistent, but fits both data sets very well.

e Column 3: Piecewise-linear functions where each line segment connects the dots from

one data point to the next. These functions are always consistent. o Column 4: Degree-12 polynomials, h(x) = ¥/2gwix’. These are consistent: we can always get a degree-12 polynomial to perfectly fit 13 distinct points. But just because

the hypothesis is consistent does not mean it is a good guess. One way to analyze hypothesis spaces is by the bias they impose (regardless of the train-

Bias

ing data set) and the variance they produce (from one training set to another). By bias we mean (loosely) the tendency of a predictive hypothesis to deviate from the

expected value when averaged over different training sets. Bias often results from restrictions

Section 1.2 Supervised Learning

655

imposed by the hypothesis space. For example, the hypothesis space of linear functions

induces a strong bias: it only allows functions consisting of straight lines. If there are any

patterns in the data other than the overall slope of a

line, a linear function will not be able

to represent those patterns. We say that a hypothesis is underfitting when it fails to find a Underfitting pattern in the data. On the other hand, the piecewise linear function has low bias; the shape of the function is driven by the data.

By variance we mean the amount of change in the hypothesis due to fluctuation in the

training data. The two rows of Figure 19.1 represent data sets that were each sampled from the same f(x) function. The data sets turned out to be slightly different.

Variance

For the first three

columns, the small difference in the data set translates into a small difference in the hypothe-

sis. We call that low variance. But the degree-12 polynomials in the fourth column have high

variance: look how different the two functions are at both ends of the x-axis. Clearly, at least

one of these polynomials must be a poor approximation to the true f(x). We say a function

is overfitting the data when it pays too much attention to the particular data set it is trained

on, causing it to perform poorly on unseen data.

Often there is a bias-variance tradeoff: a choice between more complex, low-bias hy-

potheses that fit the training data well and simpler, low-variance hypotheses that may generalize better.

Albert Einstein said in 1933, “the supreme goal of all theory is to make the

irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience.” In other words, Einstein recommends choosing the simplest hypothesis that matches the data. This principle can be traced further back to the 14th-century English philosopher William of Ockham.? His principle that “plurality [of entities] should not be posited without necessity” is called Ockham’s razor because it is used to “shave off”” dubious explanations.

Defining simplicity is not easy. It seems clear that a polynomial with only two parameters is simpler than one with thirteen parameters. We will make this intuition more precise in

Section

19.3.4.

However, in Chapter 21 we will see that deep neural network models can

often generalize quite well, even though they are very complex—some of them have billions of parameters. So the number of parameters by itself is not a good measure of a model’s fitness. Perhaps we should be aiming for “appropriateness,” not “simplicity” in a model class. We will consider this issue in Section 19.4.1. ‘Which hypothesis is best in Figure 19.1? We can’t be certain.

If we knew the data

represented, say, the number of hits to a Web site that grows from day to day, but also cycles depending on the time of day, then we might favor the sinusoidal function.

If we knew the

data was definitely not cyclic but had high noise, that would favor the linear function.

In some cases, an analyst is willing to say not just that a hypothesis is possible or im-

possible, but rather how probable it is. Supervised learning can be done by choosing the hypothesis 4 that is most probable given the data:

1" = argmax P(h|data) . heM

By Bayes’ rule this s equivalent to I* = argmax P(datal ) P(h) . heM

2 The name is often misspelled as “Occam.”

Overfitting Bias-variance tradeoff

Chapter 19 Learning from Examples Then we can say that the prior probability P() is high for a smooth degree-1 and lower for a degree-12 polynomial with large, sharp spikes. We allow functions when the data say we really need them, but we discourage them low prior probability. Why not let H be the class of all computer programs, or all Turing

or -2 polynomial unusual-looking by giving them a machines? The

problem is that there is a tradeoff between the expressiveness of a hypothesis space and the

computational complexity of finding a good hypothesis within that space. For example, fitting a straight line to data is an easy computation; fitting high-degree polynomials is somewhat harder; and fitting Turing machines is undecidable. A second reason to prefer simple hypothesis spaces is that presumably we will want to use h after we have learned it, and computing h(x) when h is a linear function is guaranteed to be fast, while computing an arbitrary Turing machine program is not even guaranteed to terminate.

For these reasons, most work on learning has focused on simple representations. In recent

years there has been great interest in deep learning (Chapter 21), where representations are

not simple but where the h(x) computation still takes only a bounded number of steps to

compute with appropriate hardware.

We will see that the expressiveness—complexity tradeoff is not simple: it is often the case,

as we saw with first-order logic in Chapter 8, that an expressive language makes it possible for a simple hypothesis to fit the data, whereas restricting the expressiveness of the language

means that any consistent hypothesis must be complex. 19.2.1

Example problem:

Restaurant waiting

We will describe a sample supervised learning problem in detail: the problem of deciding

whether to wait for a table at a restaurant. This problem will be used throughout the chapter

to demonstrate different model classes. For this problem the output, y, is a Boolean variable

that we will call WillWait; it is true for examples where we do wait for a table. The input, x,

is a vector of ten attribute values, each of which has discrete values:

1. Alternate: whether there is a suitable alternative restaurant nearby.

0PN AE W

656

Bar: whether the restaurant has a comfortable bar area to wait in. Fri/Sat: true on Fridays and Saturdays. Hungry: whether we are hungry right now.

Patrons: how many people are in the restaurant (values are None, Some, and Full). Price: the restaurant’s price range ($, $3, $$8). Raining: whether it is raining outside.

Reservation: whether we made a reservation.

Type: the kind of restaurant (French, Italian, Thai, or burger).

10. WaitEstimate: host’s wait estimate: 0-10, 10-30, 3060, or >60 minutes.

A set of 12 examples, taken from the experience of one of us (SR), is shown in Figure 19.2.

Note how skimpy these data are: there are 26 x 3% x 4% = 9,216 possible combinations of

values for the input attributes, but we are given the correct output for only 12 of them; each of

the other 9,204 could be either true or false; we don’t know. This is the essence of induction: we need to make our best guess at these missing 9,204 output values, given only the evidence of the 12 examples.

Section 19.3

Example

Learning Decision Trees

657

Input Attribut Alt Yes No Yes No

Yes

No

No Yes No

Some

$3$

Some



%$Full $ Some $ Full $ Full ~ $8$

None Some Full Full None Full

Figure 19.2 Examples for the restaurant domain. 19.3

Learning Decision Trees

A decision tree is a representation of a function that maps a vector of attribute values to

a single output value—a “decision.” A decision tree reaches its decision by performing a

Decision tree

sequence of tests, starting at the root and following the appropriate branch until a leaf is reached. Each internal node in the tree corresponds to a test of the value of one of the input

attributes, the branches from the node are labeled with the possible values of the attribute,

and the leaf nodes specify what value s to be returned by the function. In general, the input and output values can be discrete or continuous, but for now we will

Positive example) or false (a negative example). We call this Boolean classification. We will use j Negative consider only inputs consisting of discrete values and outputs that are either true (a positive

to index the examples (x; is the input vector for the jth example and y; is the output), and x;;

for the ith attribute of the jth example.

The tree representing the decision function that SR uses for the restaurant problem is

shown in Figure 19.3. Following the branches, we see that an example with Patrons = Full and WaitEstimate =0-10 will be classified as positive (i.e., yes, we will wait for a table).

19.3.1

Expressiveness of decision trees

A Boolean decision tree is equivalent to a logical statement of the form:

Output
A +A; is hard to represent

with a decision tree because the decision boundary is a diagonal line, and all decision tree

tests divide the space up into rectangular, axis-aligned boxes. We would have to stack a lot

of boxes to closely approximate the diagonal line. In other words, decision trees are good for some kinds of functions and bad for others. Is there any kind of representation that is efficient for all kinds of functions?

Unfortu-

nately, the answer is no—there are just too many functions to be able to represent them all with a small number of bits.

Even just considering Boolean functions with n Boolean at-

tributes, the truth table will have 2" rows, and each row can output true or false, so there are

22" different functions. With 20 attributes there are 248576 ~ 10300000 fynctions, so if we limit ourselves to a million-bit representation, we can’t represent all these functions.

19.3.2

Learning decision trees from examples

‘We want to find a tree that is consistent with the examples in Figure 19.2 and is as small as possible. Unfortunately, it is intractable to find a guaranteed smallest consistent tree. But with some simple heuristics, we can efficiently find one that is close to the smallest.

The

LEARN-DECISION-TREE algorithm adopts a greedy divide-and-conquer strategy: always test the most important attribute first, then recursively solve the smaller subproblems that are defined by the possible results of the test. By “most important attribute,” we mean the one

that makes the most difference to the classification of an example. That way, we hope to get to the correct classification with a small number of tests, meaning that all paths in the tree will be short and the tree as a whole will be shallow.

Figure 19.4(a) shows that Type is a poor attribute, because it leaves us with four possible

Pations? None

Some"

o N Ves|

Full

WaitEstimate? Altemate? No Yes

Reservation? No Yes

No

No /"\ Yes Yes

Bai No /"\ Yes

Figure 19.3 A decision tree for deciding whether to wait for a table.

Section 19.3

8

alian

659

mEan BE@EBD

Type? French

Learning Decision Trees

Patrons?

Thai

Burger

©lad ao

None

Some

Ful

.

No// (@)

(b)

\Yes

B

Figure 19.4 Splitting the examples by testing on attributes. At each node we show the positive (light boxes) and negative (dark boxes) examples remaining. (a) Splitting on Type brings us no nearer to distinguishing between positive and negative examples. (b) Splitting on Patrons does a good job of separating positive and negative examples. After splitting on Patrons, Hungry is a fairly good second test. outcomes, each of which has the same number of positive as negative examples. On the other

hand, in (b) we see that Patrons is a fairly important attribute, because if the value is None or Some, then we are left with example sets for which we can answer definitively (No and Yes,

respectively). If the value is Full, we are left with a mixed set of examples. There are four cases to consider for these recursive subproblems:

1. If the remaining examples are all positive (or all negative), then we are done: we can answer Yes or No. Figure 19.4(b) shows examples of this happening in the None and Some branches.

2. If there are some positive and some negative examples, then choose the best attribute to split them. Figure 19.4(b) shows Hungry being used to split the remaining examples. 3. If there are no examples left, it means that no example has been observed for this com-

bination of attribute values, and we return the most common output value from the set

of examples that were used in constructing the node’s parent. 4. If there are no attributes left, but both positive and negative examples, it means that these examples have exactly the same description, but different classifications. This can

happen because there is an error or noise in the data; because the domain is nondeterministic; or because we can’t observe an attribute that would distinguish the examples.

The best we can do is return the most common output value of the remaining examples. The LEARN-DECISION-TREE algorithm is shown in Figure 19.5. Note that the set of exam-

ples is an input to the algorithm, but nowhere do the examples appear in the tree returned by the algorithm. A tree consists of tests on attributes in the interior nodes, values of attributes on the branches, and output values on the leaf nodes. The details of the IMPORTANCE func-

tion are given in Section 19.3.3. The output of the learning algorithm on our sample training

set is shown in Figure 19.6. The tree is clearly different from the original tree shown in Fig-

Noise

660

Chapter 19 Learning from Examples function LEARN-DECISION-TREE(examples, attributes, parent_examples) returns a tree

if examples is empty then return PLURALITY-VALUE(parent_examples)

else if all examples have the same classification then return the classification

else if artributes is empty then return PLURALITY-VALUE(examples) else

A~ argmax, ¢ uipues IMPORTANCE (a, examples) tree < a new decision tree with root test A for each value v ofA do

exs 0, while carthquakes have —4.9+ 1.7x; —x, < x=

Section 19.6

Linear Regression and Classification

683

0. We can make the equation easier to deal with by changing it into the vector dot product form—with xo=1 we have —4.9%+ 17x -3 =0, and we can define the vector of weights, w=(-49,17,~1), and write the classification hypothesis Jw(x) = 1if w-x > 0 and 0 otherwise.

Alternatively, we can think of / as the result of passing the linear function w - x through a threshold function:

Iw(x) = Threshold(w-x) where Threshold(z) =1 if z > 0 and 0 otherwise.

Threshold function

The threshold function is shown in Figure 19.17(a).

Now that the hypothesis &y (x) has a well-defined mathematical form, we can think about

choosing the weights w to minimize the loss. In Sections 19.6.1 and 19.6.3, we did this both

in closed form (by setting the gradient to zero and solving for the weights) and by gradient

descent in weight space. Here we cannot do either of those things because the gradient is zero almost everywhere in weight space except at those points where w - x =0, and at those points the gradient is undefined.

There is, however, a simple weight update rule that converges to a solution—that is, to

a linear separator that classifies the data perfectly—provided the data are linearly separable. For a single example (x,y), we have wi — wita(y

(X)) X x;

(19.8)

which is essentially identical to Equation (19.6), the update rule for linear regression!

This

rule is called the perceptron learning rule, for reasons that will become clear in Chapter 21. Because we are considering a 0/1 classification problem, however, the behavior is somewhat

Perceptron learning rule

different. Both the true value y and the hypothesis output /(x) are either 0 or 1, so there are three possibilities:

« If the output is correct (i.e., y=hy(x)) then the weights are not changed.

« Ifyis 1 but hy(x) is 0, then w; is increased when the corresponding input x; is positive and decreased when x; is negative. This makes sense, because we want to make w - x

bigger so that /i (X) outputs a 1.

« If yis 0 but hy(x) is 1, then w; is decreased when the corresponding input x; is positive and increased when x; is negative.

This makes sense, because we want to make W - X

smaller so that A (x) outputs a 0.

Typically the learning rule is applied one example at a time, choosing examples at random (as in stochastic gradient descent). Figure 19.16(a) shows a training curve for this learning rule Training curve applied to the earthquake/explosion data shown in Figure 19.15(a). A training curve measures the classifier performance on a fixed training set as the learning process proceeds one update at a time on that training set. The curve shows the update rule converging to a zero-error

linear separator. The “convergence” process isn’t exactly preity, but it always works. This

particular run takes 657 steps to converge, for a data set with 63 examples, so each example

is presented roughly 10 times on average. Typically, the variation across runs is large.

684

Chapter 19 Learning from Examples 5 £

P Zoo

51 209

§

507

§07

8 0,

£ o, z &

Sos

0 100200300400500600 700 Number of weight updates (@)

06 20 = 04

Sos

025000 50000 75000 Number of weight updates

€06 08 h=gg

The solution for @ is the same as before.

S

The solution for 6}, the probability that a cherry

candy has a red wrapper, is the observed fraction of cherry candies with red wrappers, and

similarly for 6.

These results are very comforting, and it is easy to see that they can be extended to any

Bayesian network whose conditional probabilities are represented as tables. The most impor-

Section20.2

Learning with Complete Data

727

tant point is that with complete data, the maximum-likelihood parameter learning problem

for a Bayesian network decomposes into separate learning problems, one for each parameter. (See Exercise 20.NORX

for the nontabulated case, where each parameter affects several

conditional probabilities.) The second point is that the parameter values for a variable, given its parents, are just the observed frequencies of the variable values for each setting of the parent values. As before, we must be careful to avoid zeroes when the data set is small.

20.2.2

Naive Bayes models

Probably the most common Bayesian network model used in machine learning is the naive Bayes model first introduced on page 402. In this model, the “class™ variable C (which is to be predicted) is the root and the “attribute™ variables X; are the leaves. The model is “naive”

because it assumes that the attributes are conditionally independent of each other, given the class. (The model in Figure 20.2(b) is a naive Bayes model with class Flavor and just one

attribute, Wrapper.) In the case of Boolean variables, the parameters are

0=P(C=true),0y =P(X;=true|C=true),0,, = P(X; = true| C =false). The maximum-likelihood parameter values are found in exactly the same way as in Figure 20.2(b). Once the model has been trained in this way, it can be used to classify new examples for which the class variable C is unobserved. With observed attribute values ¥y, .., %, the probability of each class is given by P(C|x1,...,%,) = a P(C ]'[Px,\c

A deterministic prediction can be obtained by choosing the most likely class. Figure 20.3 shows the learning curve for this method when it is applied to the restaurant problem from

Chapter 19. The method learns fairly well but not as well as decision tree learning; this is presumably because the true hypothesis—which is a decision tree—is not representable exactly

using a naive Bayes model. Naive Bayes learning turns out to do surprisingly well in a wide

range of applications; the boosted version (Exercise 20.BNBX) is one of the most effective

general-purpose learning algorithms. Naive Bayes learning scales well to very large prob-

lems: with n Boolean attributes, there are just 2n+ 1 parameters, and no search is required

10 find hy, the maximum-likelihood naive Bayes hypothesis. Finally, naive Bayes learning systems deal well with noisy or missing data and can give probabilistic predictions when appropriate. Their primary drawback is the fact that the conditional independence assumption is seldom accurate; as noted on page 403, the assumption leads to overconfident probabilities that are often very close to 0 or 1, especially with large numbers of attributes.

20.2.3

Generative and discriminative models

We can distinguish two kinds of machine learning models used for classifiers: generative and discriminative.

A generative model models the probability distribution of each class.

For

example, the naive Bayes text classifier from Section 12.6.1 creates a separate model for each

possible category of text—one for sports, one for weather, and so on. Each model includes

the prior probability of the category—for example P(Category=weather)—as well as the conditional probability P(Inputs | Category =weather). From these we can compute the joint

probability P(Inputs, Category = weather)) and we can generate a random selection of words that is representative of texts in the weather category.

Generative model

Chapter 20 Learning Probabilistic Models

Proportion correct on fest set

728

Naive Bayes

0

20

40 60 Training set size

80

100

Figure 20.3 The learning curve for naive Bayes learning applied to the restaurant problem from Chapter 19; the learning curve for decision tree learning is shown for comparison. Discriminative model

A discriminative model directly learns the decision boundary between classes. That is,

it learns P(Category | Inputs). Given example inputs, a discriminative model will come up

with an output category, but you cannot use a discriminative model to, say, generate random

words that are representative of a category. Logistic regression, decision trees, and support vector machines are all discriminative model:

Since discriminative models put all their emphasis on defining the decision boundary— that is, actually doing the classification task they were asked to do—they tend to perform

better in the limit, with an arbitrary amount of training data. However, with limited data, in

some cases a generative model performs better. (Ng and Jordan, 2002) compare the generative naive Bayes classifier to the discriminative logistic regression classifier on 15 (small) data sets, and find that with the maximum amount of data, the discriminative model does better on

9 out of 15 data sets, but with only a small amount of data, the generative model does better

on 14 out of 15 data sets.

20.2.4

Maximum-likelihood parameter learning:

Continuous models

Continuous probability models such as the linear-Gaussian model were shown on page 422.

Because continuous variables are ubiquitous in real-world applications, it is important to know how to learn the parameters of continuous models from data. The principles for maximum-likelihood learning are identical in the continuous and discrete case:

Let us begin with a very simple case: learning the parameters of a Gaussian density function on a single variable. That is, we assume the data are generated as follows:

P = The parameters of this model are the mean 4 and the standard deviation o. (Notice that the

normalizing “constant” depends on o, so we cannot ignore it.) Let the observed values be

Section20.2

Learning with Complete Data

0 01020304 0506070809 x ®)

(a)

1

Figure 20.4 (a) A linear-Gaussian model described as y =6+ 6 plus Gaussian noise with fixed variance. (b) A set of 50 data points generated from this model and the best-fit line. xy. Then the log likelihood is

m?

L= Zlog e 5T= N(—logV27 — logo) — Setting the denvzuves to zero as usual, we obtain

LyN 5oL = —Flial-p=0 AL =_ S+ N, ELL(y-p)?=0 IyN 2 Se

@04)

That is, the maximum-likelihood value of the mean is the sample average and the maximum-

likelihood value of the standard deviation is the square root of the sample variance. Again, these are comforting results that confirm “commonsense” practice. Now consider a linear-Gaussian model with one continuous parent X and a continuous

child Y. As explained on page 422, ¥ has a Gaussian distribution whose mean depends linearly on the value ofX and whose standard deviation is fixed. distribution P(Y | X), we can maximize the conditional likelihood

POl =

1

_o 0o’

To learn the conditional

@5

Here, the parameters are 0y, 02, and o. The data are a collection of (x;, ;) pairs, as illustrated

in Figure 20.4. Using the usual methods (Exercise 20.LINR), we can find the maximumlikelihood values of the parameters. The point here is different. If we consider just the parameters #, and 0, that define the linear relationship between x and y, it becomes clear

that maximizing the log likelihood with respect to these parameters is the same as minimizing

the numerator (y— (6,x+6,))? in the exponent of Equation (20.5). This is the Ly loss, the

squared error between the actual value y and the prediction §;x + 5.

This is the quantity minimized by the standard linear regression procedure described in

Section 19.6. Now we can understand why: minimizing the sum of squared errors gives the maximum-likelihood straight-line model, provided that the data are generated with Gaussian

noise of fixed variance.

729

730

Chapter 20 Learning Probabilistic

0

02

04 06 Parameter § ()

Models

08

1

0

02

04 06 Parameter ®)

08

1

Figure 20.5 Examples of the Bera(a,b) distribution for different values of (a,b). 20.2.5

Bayesian parameter learning

Maximum-likelihood learning gives rise to simple procedures, but it has serious deficiencies with small data sets. For example, after seeing one cherry candy, the maximum-likelihood hypothesis is that the bag is 100% cherry (i.c., §=1.0). Unless one’s hypothesis prior is that bags must be either all cherry or all lime, this is not a reasonable conclusion. It is more likely

that the bag is a mixture of lime and cherry.

The Bayesian approach to parameter learning

starts with a hypothesis prior and updates the distribution as data arrive. The candy example in Figure 20.2(a) has one parameter, 0: the probability that a randomly selected piece of candy is cherry-flavored. In the Bayesian view, 0 is the (unknown)

value of a random variable © that defines the hypothesis space; the hypothesis prior is the

prior distribution over P(®). Thus, P(@=8) is the prior probability that the bag has a frac-

tion 0 of cherry candies.

Beta distribution

Hyperparameter

If the parameter 6 can be any value between 0 and 1, then P(®) is a continuous probability density function (see Section A.3). If we don’t know anything about the possible values of 6 we can use the uniform density function P(6) = Uniform(6;0, 1), which says all values are equally likely. A more flexible family of probability density functions is known as the beta distributions. Each beta distribution is defined by two hyperparameters® a and b such that Beta(0:a,b) = a 0!

(1-9)"",

(20.6)

for 0 in the range [0, 1]. The normalization constant a, which makes the distribution integrate to 1, depends on a and b. Figure 20.5 shows what the distribution looks like for various values of a and b. The mean value of the beta distribution is a/(a+ b), so larger values of a

suggest a belief that @ is closer to 1 than to 0. Larger values of a -+ b make the distribution

more peaked, suggesting greater certainty about the value of ©. It turns out that the uniform

density function is the same as Beta(1,1): the mean is 1/2, and the distribution is flat.

3 They are called hyperparameters because they parameterize a distribution over 0, which is itselfa parameter.

Section20.2

Learning with Complete Data

731

Figure 20.6 A Bayesian network that corresponds to a Bayesian learning process. Posterior distributions for the parameter variables ©, ©;, and @, can be inferred from their prior distributions and the evidence in Flavor; and Wrapper;. Besides its flexibility, the beta family has another wonderful property: if Beta(a, b), then, after a data point is observed, the posterior distribution for © distribution. In other words, Beta is closed under update. The beta family conjugate prior for the family of distributions for a Boolean variable.* Let’s works. Suppose we observe a cherry candy: then we have

© has a prior is also a beta is called the see how this Conjugate prior

P(0|Dy=cherry) = a P(Dy=cherry|0)P(6)

= o 0-Beta(f;a,b) = o’ 6-0°'(1-9)"" = ' 0°(1-0)""" = o/ Beta(:a+1,b).

Thus, after seeing a cherry candy, we simply increment the a parameter to get the posterior;

similarly, after seeing a lime candy, we increment the b parameter. Thus, we can view the @ and b hyperparameters as virtual counts, in the sense that a prior Beta(a, b) behaves exactly

as if we had started out with a uniform prior Beta(1,1) and seen a— 1 actual cherry candies and b — 1 actual lime candies. By examining a sequence of beta distributions for increasing values ofa and b, keeping the proportions fixed, we can see vividly how the posterior distribution over the parameter

© changes as data arrive. For example, suppose the actual bag of candy is 75% cherry. Figure 20.5(b) shows the sequence Beta(3,1), Beta(6,2), Beta(30,10). Clearly, the distribution is converging to a narrow peak around the true value of ©. For large data sets, then, Bayesian learning (at least in this case) converges to the same answer as maximum-likelihood learning. Now let us consider a more complicated case. The network in Figure 20.2(b) has three parameters, 6, 01, and 6, where 6 is the probability ofa red wrapper on a cherry candy and 4 Other conjugate priors include the Dirichlet family for the parameters of a discrete multivalued distribution and the Normal-Wishart family for the parameters of a Gaussian ion. See Bernardo and Smith (1994).

Virtual count

732

Chapter 20 Learning Probabilistic

Models

0, is the probability of a red wrapper on a lime candy. The Bayesian hypothesis prior must

cover all three parameters—that is, we need to specify P(®,0,,0;).

Parameter independence

parameter independence:

Usually, we assume

P(0,0,.0,) = P(O)P(©,)P(©,).

‘With this assumption, each parameter can have its own beta distribution that is updated sepa-

rately as data arrive. Figure 20.6 shows how we can incorporate the hypothesis prior and any data into a Bayesian network, in which we have a node for each parameter variable. The nodes ©,0;,0, have no parents. For the ith observation of a wrapper and corresponding flavor of a piece of candy, we add nodes Wrapper; and Flavor;. Flavor; is dependent on the flavor parameter ©:

P(Flavor;=cherry|©=0) = 0. Wrapper; is dependent on ©; and ©,: P(Wrapper; = red| Flavor; =cherry,©,=6,) = 6 P(Wrapper; = red| Flavor; =lime,® = 0,) = 0, .

Now, the entire Bayesian learning process for the original Bayes net in Figure 20.2(b) can be formulated as an inference problem in the derived Bayes net shown in Figure 20.6, where the.

P>

data and parameters become nodes. Once we have added all the new evidence nodes, we can

then query the parameter variables (in this case, ©,©;,0,). Under this formulation there is just one learning algorithm—the inference algorithm for Bayesian networks. Of course, the nature of these networks is somewhat different from those of Chapter 13

because of the potentially huge number of evidence variables representing the training set and the prevalence of continuous-valued parameter variables. Exact inference may be impos-

sible except in very simple cases such as the naive Bayes model. Practitioners typically use approximate inference methods such as MCMC

(Section 13.4.2); many statistical software

packages incorporate efficient implementations of MCMC for this purpose. 20.2.6

Bayesian linear regression

Here we illustrate how to apply a Bayesian approach to a standard statistical task:

linear

regression. The conventional approach was described in Section 19.6 as minimizing the sum of squared errors and reinterpreted in Section 20.2.4 as maximizing likelihood assuming a Gaussian error model. These produce a single best hypothesis: a straight line with specific values for the slope and intercept and a fixed variance for the prediction error at any given

point. There is no measure of how confident one should be in the slope and intercept values.

Furthermore, if one is predicting a value for an unseen data point far from the observed

data points, it seems to make no sense to assume a prediction error that is the same as the

prediction error for a data point right next to an observed data point.

It would seem more

sensible for the prediction error to be larger, the farther the data point is from the observed

data, because a small change in the slope will cause a large change in the predicted value for a distant point. The Bayesian approach fixes both of these problems. The general idea, as in the preceding section, is to place a prior on the model parameters—here, the coefficients of the linear model and the noise variance—and then to compute the parameter posterior given the data. For

multivariate data and unknown noise model, this leads to rather a lot of linear algebra, so we

Section20.2

Learning with Complete Data

733

focus on a simple case: univariable data, a model that is constrained to go through the origin, and known noise:

and the model is

a normal distribution with variance o2. Then we have just one parameter ¢ 5 (ML) .

P(y[x,0) = N (y:6x,07) =

(20.7)

As the log likelihood is quadratic in 0, the appropriate form for a conjugate prior on is also a Gaussian. This ensures that the posterior for ¢ will also be Gaussian. We’ll assume a mean 0 and variance o7 for the prior, so that

P(0)=N(9:€u. We don’t necessarily expect the top halves of photos to look like bottom halves, so there is a scale beyond which spatial invariance no longer holds.

Local spatial invariance can be achieved by constraining the / weights connecting a local

region to a unit in the hidden layer to be the same for each hidden unit. (That is, for hidden

units i and j, the weights wy ..., w;,; are the same as wi j,...,wy,;.)

This makes the hidden

units into feature detectors that detect the same feature wherever it appear in the image.

Typically, we want the first hidden layer to detect many kinds of features, not just one; so

for each local image region we might have d hidden units with d distinct sets of weights.

This means that there are d/ weights in all—a number that is not only far smaller than n?,

Convolutional neural network (CNN) Kernel Convolution

but is actually independent of n, the image size. Thus, by injecting some prior knowledge— namely, knowledge of adjacency and spatial invariance—we can develop models that have far fewer parameters and can learn much more quickly. A convolutional neural network (CNN) is one that contains spatially local connections,

at least in the early layers, and has patterns of weights that are replicated across the units

in each layer. A pattern of weights that is replicated across multiple local regions is called a kernel and the process of applying the kernel to the pixels of the image (or to spatially

organized units in a subsequent layer) is called convolution.*

Kernels and convolutions are easiest to illustrate in one dimension rather than two or

more, so we will assume an input vector x of size n, corresponding to n pixels in a one-

3 Similar ideas can be applied to process time-series data sources such as audio waveforms. These typically exhibit temporal invariance—a word sounds the same no matter what time of day it is uttered. Recurrent neural neworks (Setion 216 auomatiall exhibittemmpors invaranee. # In the terminology of signa correlation, not a convolution. But “convolution” is used within the field of neural networks.

Section 21.3

Convolutional Networks

Figure 21.4 An example ofa one-dimensional convolution operation with a kernel of size 1=3 and a stride s=2. The peak response is centered on the darker (lower intensity) input pixel. The results would usually be fed through a nonlinear activation function (not shown) before going to the next hidden layer. dimensional image, and a vector kernel k of size 1. (For simplicity we will assume that / is an odd number.) All the ideas carry over straightforwardly to higher-dimensional case: ‘We write the convolution operation using the * symbol, for example:

operation is defined as follows:

=

!

£

|k

i (141)/2 -

z = x

k.

The

(21.8)

In other words, for each output position i, we take the dot product between the kernel k and a

snippet of x centered on x; with width /.

The process is illustrated in Figure 21.4 for a kernel vector [+1,—1,+1], which detects a darker point in the 1D image. (The 2D version might detect a darker line.) Notice that in

this example the pixels on which the kernels are centered are separated by a distance of 2 pixels; we say the kernel is applied with a stride s=2. Notice that the output layer has fewer Stride pixels: because of the stride, the number of pixels is reduced from 1 to roughly n/s. (In two dimensions, the number of pixels would be roughly n/s,s,, where s, and s, are the strides in the x and y directions in the image.) We say “roughly” because of what happens at the edge of the image: in Figure 21.4 the convolution stops at the edges of the image, but one can also

pad the input with extra pixels (either zeroes or copies of the outer pixels) so that the kernel can be applied exactly |n/s| times. For small kernels, we typically use s=1, so the output

has the same dimensions as the image (see Figure 21.5). The operation of applying a kernel across an image can be implemented in the obvious way by a program with suitable nested loops; but it can also be formulated as a single matrix

operation, just like the application of the weight matrix in Equation (21.1). For example, the convolution illustrated in Figure 21.4 can be viewed as the following matrix multiplication: 0 -1 0

0 +1 +1

0 0 -1

0 0 +1

5

oo

-1+ 0 +1 0 0

§ =[9]. 4

21.9)

hom

4+l 0 0

In this weight matrix, the kernel appears in each row, shifted according to the stride relative

to the previous row, One wouldn’t necessarily construct the weight matrix explicitly—it is

761

762

Chapter 21

Deep Learning

Figure 21.5 The first two layers of a CNN for a 1D image with a kernel size /=3 and a stride s= 1. Padding is added at the left and right ends in order to keep the hidden layers the same size as the input. Shown in red is the receptive field ofa unit in the second hidden layer. Generally speaking, the deeper the unit, the larger the receptive field. mostly zeroes, after all—but the fact that convolution is a linear matrix operation serves as a

reminder that gradient descent can be applied easily and effectively to CNNG, just as it can to plain vanilla neural networks. As mentioned earlier, there will be d kernels, not just one; so, with a stride of 1, the

output will be d times larger.

This means that a two-dimensional input array becomes a

three-dimensional array of hidden units, where the third dimension is of size d.

It is im-

portant to organize the hidden layer this way, so that all the kernel outputs from a particular image location stay associated with that location. Unlike the spatial dimensions of the image,

Receptive field

however, this additional “kernel dimension” does not have any adjacency properties, so it does not make sense to run convolutions along it. CNNs were inspired originally by models of the visual cortex proposed in neuroscience. In those models, the receptive

field of a neuron is the portion of the sensory input that can

affect that neuron’s activation. In a CNN, the receptive field of a unit in the first hidden layer

is small—just the size of the kernel, i.e., / pixels.

In the deeper layers of the network, it

can be much larger. Figure 21.5 illustrates this for a unit in the second hidden layer, whose

receptive field contains five pixels. When the stride is 1, as in the figure, a node in the mth

hidden layer will have a receptive field of size (I — 1)m+ 1; so the growth is linear in m. (In a 2D image, each dimension of the receptive field grows linearly with m, so the area grows quadratically.) When the stride is larger than 1, each pixel in layer m represents s pixels in layer m — 1; therefore, the receptive field grows as O(ls™)—that is, exponentially with depth.

The same effect occurs with pooling layers, which we discuss next. 21.3.1

Pooling

Pooling and downsampling

A pooling layer in a neural network summarizes a set of adjacent units from the preceding layer with a single value. Pooling works just like a convolution layer, with a kernel size / and stride s, but the operation that is applied is fixed rather than learned. Typically, no activation function is associated with the pooling layer. There are two common forms of pooling: * Average-pooling computes the average value of its / inputs.

Downsampling

This is identical to con-

volution with a uniform kernel vector k= (1/1,..., 1/I]. If we set [ =s, the effect is to coarsen the resolution of the image—to downsample it—by a factor of s. An object

that occupied, say, 10s pixels would now occupy only 10 pixels after pooling. The same

Section 21.3

Convolutional Networks

learned classifier that would be able to recognize the object at original image would now be able to recognize that object in if it was too big to recognize in the original image. In other facilitates multiscale recognition. It also reduces the number

a size of 10 pixels in the the pooled image, even words, average-pooling of weights required in

subsequent layers, leading to lower computational cost and possibly faster learning.

* Max-pooling computes the maximum value of its / inputs.

It can also be used purely

for downsampling, but it has a somewhat different semantics. Suppose we applied maxpooling to the hidden layer [5,9,4] in Figure 21.4: the result would be a 9, indicating that somewhere in the input image there is a darker dot that is detected by the kernel.

In other words, max-pooling acts as a kind of logical disjunction, saying that a feature

exists somewhere in the unit’s receptive field.

If the goal is to classify the image into one of ¢ categories, then the final layer of the network will be a softmax with ¢ output units.

The early layers of the CNN

are image-sized, so

somewhere in between there must be significant reductions in layer size. Convolution layers and pooling layers with stride larger than 1 all serve to reduce the layer size. It’s also possible to reduce the layer size simply by having a fully connected layer with fewer units than the preceding layer. CNNs often have one or two such layers preceding the final softmax layer. 21.3.2

Tensor operations in CNNs

We saw in Equations (21.1) and (21.3) that the use of vector and matrix notation can be helpful in keeping mathematical derivations simple and elegant and providing concise descriptions of computation graphs. Vectors and matrices are one-dimensional and two-dimensional special cases of tensors, which (in deep learning terminology) are simply multidimensional arrays Tensor of any dimension.’ For CNNs, tensors are a way of keeping track of the “shape™ of the data as it progresses

through the layers of the network. This is important because the whole notion of convolution

depends on the idea of adjacency:

adjacent data elements are assumed to be semantically

related, so it makes sense to apply operators to local regions of the data. Moreover, with suitable language primitives for constructing tensors and applying operators, the layers them-

selves can be described concisely as maps from tensor inputs to tensor outputs.

A final reason for describing CNNs in terms of tensor operations is computational effi-

ciency: given a description of a network as a sequence of tensor operations, a deep learning software package can generate compiled code that is highly optimized for the underlying

computational substrate. Deep learning workloads are often run on GPUs (graphics processing units) or TPUs (tensor processing units), which make available a high degree of parallelism. For example, one of Google’s third-generation TPU pods has throughput equivalent

to about ten million laptops. Taking advantage of these capabilities is essential if one is train-

ing a large CNN on a large database of images. Thus, it is common to process not one image at a time but many images in parallel; as we will see in Section 21.4, this also aligns nicely with the way that the stochastic gradient descent algorithm calculates gradients with respect

to a minibatch of training examples.

Let us put all this together in the form of an example.

256 x 256 RGB

images with a minibatch size of 64.

atical definition of tensors requires that cert

Suppose we are training on

The input in this case will be a four-

ariances hold under a change of b:

763

764

Chapter 21

Deep Learning

dimensional tensor of size 256 x 256 x 3 x 64.

Feature map Channel

Then we apply 96 kernels of size 5x5x 3

with a stride of 2 in both x and y directions in the image. This gives an output tensor of size 128 x 128 x 96 x 64. Such a

tensor is often called a feature map, since it shows how each

feature extracted by a kernel appears across the entire image; in this case it is composed of

96 channels, where each channel carries information from one feature. Notice that unlike the

input tensor, this feature map no longer has dedicated color channels; nonetheless, the color information may still be present in the various feature channels if the learning algorithm finds

color to be useful for the final predictions of the network.

21.3.3 Residual network

Residual networks

Residual networks are a popular and successful approach to building very deep networks that avoid the problem of vanishing gradients.

Typical deep models use layers that learn a new representation at layer i by completely re-

placing the representation at layer i — 1. Using the matrix—vector notation that we introduced

in Equation (21.3), with z(!) being the values of the units in layer i, we have

20 = f(20) = g (WD)

Because each layer completely replaces the representation from the preceding layer, all of the layers must learn to do something useful. Each layer must, at the very least, preserve the

task-relevant information contained in the preceding layer. If we set W = 0 for any layer i, the entire network ceases to function. If we also set W(~!) = 0, the network would not even

be able to learn: layer i would not learn because it would observe no variation in its input from layer i — 1, and layer i — 1 would not learn because the back-propagated gradient from

layer i would always be zero. Of course, these are extreme examples, but they illustrate the

need for layers to serve as conduits for the signals passing through the network.

The key idea of residual networks is that a layer should perturb the representation from the previous layer rather than replace it entirely. If the learned perturbation is small, the next

layer is close to being a copy of the previous layer. This is achieved by the following equation for layer i in terms of layer i — 1:

20 = g (@D 4 @),

Residual

1.10)

where g, denotes the activation functions for the residual layer. Here we think of f as the

residual, perturbing the default behavior of passing layer i — 1 through to layer i. The function used to compute the residual is typically a neural network with one nonlinear layer combined

with one linear layer:

f(z) =Vg(Wz), where W and V are learned weight matrices with the usual bias weights added.

Residual networks make it possible to learn significantly deeper networks reliably. Consider what happens if we set V=0 for a particular layer in order to disable that layer. Then the residual f disappears and Equation (21.10) simplifies to 20 = g,z ). Now suppose that g, con:

function to its inputs: z

s of ReLU activation functions and that z¢~!) also applies a ReLU =ReLU(in"""). In that case we have

21 = g,(z0V) = ReLU(2"")) = ReLU(ReLU(in"""))) = ReLU(in(""1) = 2(-1) |

Section 21.4 Learning Algorithms where the penultimate step follows because ReLU(ReLU(x))=ReLU(x).

In other words,

in residual nets with ReLU activations, a layer with zero weights simply passes its inputs

through with no change. The rest of the network functions just as if the layer had never existed. Whereas traditional networks must learn to propagate information and are subject to catastrophic failure of information propagation for bad choices of the parameters, residual

networks propagate information by default. Residual networks

are often used with convolutional layers in vision applications, but

they are in fact a general-purpose tool that makes deep networks more robust and allows

researchers to experiment more freely with complex and heterogeneous network designs. At the time of writing, it is not uncommon

to see residual networks with hundreds of layers.

The design of such networks is evolving rapidly, so any additional specifics we might provide would probably be outdated before reaching printed form. Readers desiring to know the best architectures for specific applications should consult recent research publications.

21.4

Learning Algorithms

Training a neural network consists of modifying the network’s parameters so as to minimize

the loss function on the training set. In principle, any kind of optimization algorithm could

be used. In practice, modern neural networks are almost always trained with some variant of

stochastic gradient descent (SGD).

‘We covered standard gradient descent and its stochastic version in Section 19.6.2. Here,

the goal i to minimize the loss L(w), where w represents all of the parameters of the network. Each update step in the gradient descent process looks like this: W

w—aVyL(w),

where « is the learning rate. For standard gradient descent, the loss L is defined with respect

to the entire training set. For SGD, it is defined with respect to a minibatch of m examples chosen randomly at each step.

As noted in Section 4.2, the literature on optimization methods for high-dimensional

continuous spaces includes innumerable enhancements to basic gradient descent.

We will

not cover all of them here, but it is worth mentioning a few important considerations that are

particularly relevant to training neural networks:

+ For most networks that solve real-world problems, both the dimensionality of w and the

size of the training set are very large. These considerations militate strongly in favor

of using SGD with a relatively small minibatch size m: stochasticity helps the algo-

rithm escape small local minima in the high-dimensional weight space (as in simulated annealing—see page 114); and the small minibatch size ensures that the computational

cost of each weight update step is a small constant, independent of the training set size. * Because the gradient contribution of each training example in the SGD minibatch can

be computed independently, the minibatch size is often chosen so as to take maximum advantage of hardware parallelism in GPUs or TPUs.

« To improve convergence, it is usually a good idea to use a learning rate that decreases over time. Choosing the right schedule is usually a matter of trial and error. + Near a local or global minimum of the loss function with respect to the entire training set, the gradients estimated from small minibatches will often have high variance and

765

766

Chapter 21

Deep Learning

Figure 21.6 Tllustration of the back-propagation of gradient information in an arbitrary computation graph. The forward computation of the output of the network proceeds from left to right, while the back-propagation of gradients proceeds from right to left. may point in entirely the wrong direction, making convergence difficult. One solution

Momentum

is to increase the minibatch size as training proceeds; another is to incorporate the idea

of momentum, which keeps a running average of the gradients of past minibatches in order to compensate for small minibatch sizes.

+ Care must be taken to mitigate numerical instabilities that may arise due to overflow,

underflow, and rounding error. These are particularly problematic with the use of exponentials in softmax,

sigmoid,

and tanh activation functions, and with the iterated

computations in very deep networks and recurrent networks (Section 21.6) that lead to

vanishing and exploding activations and gradients. Overall, the process of learning the weights of the network is usually one that exhibits diminishing returns.

We run until it is no longer practical to decrease the test error by running

longer. Usually this does not mean we have reached a global or even a local minimum of the loss function. Instead, it means we would have to make an impractically large number of very small steps to continue reducing the cost, or that additional steps would only cause overfitting, or that estimates of the gradient are too inaccurate to make further progress.

21.4.1

Computing

gradients in computation graphs

On page 755, we derived the gradient of the loss function with respect to the weights in a specific (and very simple) network. We observed that the gradient could be computed by back-propagating error information from the output layer of the network to the hidden layers. ‘We also said that this result holds in general for any feedforward computation graph. Here,

we explain how this works. Figure 21.6 shows a generic node in a computation graph. (The node & has in-degree and out-degree 2, but nothing in the analysis depends on this.) During the forward pass, the node

computes some arbitrary function / from its inputs, which come from nodes f and g. In turn, h feeds its value to nodes j and k.

The back-propagation process passes messages back along each link in the network. At each node, the incoming messages are collected and new messages are calculated to pass

Section 21.4 Learning Algorithms back to the next layer. As the figure shows, the messages are all partial derivatives of the loss

L. For example, the backward message dL/dh; is the partial derivative of L with respect to

Jj’s first input, which is the forward message from & to j. Now, / affects L through both j and k, so we have

AL/3h=OL/3h;+ AL/ .

@111y

‘With this equation, the node / can compute the derivative of L with respect to / by summing

the incoming messages from j and k. Now, to compute the outgoing messages dL/d fy, and 9L/dgp, we use the following equations:

OL

i

9L ah

ISy

and

AL

g,

JL Jh

9h dgi

(21.12)

In Equation (21.12), JL/dh was already computed by Equation (21.11), and 9h/d f;, and dh/dgy are just the derivatives of i with respect to its first and second arguments, respec-

tively. For example, if 4 is a multiplication node—that is, h(f,g)= f - g—then dh/d fi=g and 9h/dg, = f. Software packages for deep learning typically come with a library of node types (addition, multiplication, sigmoid, and so on), each of which knows how to compute its own derivatives as needed for Equation (21.12). The back-propagation process begins with the output nodes, where each initial message 9L/35; is calculated directly from the expression for L in terms of the predicted value § and the true value y from the training data.

At each internal node, the incoming backward

messages are summed according to Equation (21.11) and the outgoing messages are generated

from Equation (21.12). The process terminates at each node in the computation graph that represents a weight w (e.g., the light mauve ovals in Figure 21.3(b)). At that point, the sum of the incoming messages to w is JL/Jw—precisely the gradient we need to update w. Exercise 21.BPRE asks you to apply this process to the simple network in Figure 21.3 in order

to rederive the gradient expressions in Equations (21.4) and (21.5).

‘Weight-sharing, as used in convolutional networks (Section 21.3) and recurrent networks

(Section 21.6), is handled simply by treating each shared weight as a single node with multiple outgoing arcs in the computation graph. During back-propagation, this results in multiple incoming gradient messages. By Equation (21.11), this means that the gradient for the shared

weight is the sum of the gradient contributions from each place it is used in the network.

It is clear from this description of the back-propagation process that its computational

cost is linear in the number of nodes in the computation graph, just like the cost of the forward computation. Furthermore, because the node types are typically fixed when the network

is designed, all of the gradient computations can be prepared in symbolic form in advance

and compiled into very efficient code for each node in the graph.

Note also that the mes-

sages in Figure 21.6 need not be scalars: they could equally be vectors, matrices, or higher-

dimensional tensors, so that the gradient computations can be mapped onto GPUs or TPUs to benefit from parallelism.

One drawback of back-propagation is that it requires storing most of the intermediate

values that were computed during forward propagation in order to calculate gradients in the backward pass. This means that the total memory cost of training the network is proportional to the number of units in the entire network.

Thus, even if the network itself is represented

only implicitly by propagation code with lots of loops, rather than explicitly by a data struc-

ture, all of the intermediate results of that propagation code have to be stored explicitly.

767

768

Chapter 21 21.4.2

Batch normalization

Deep Learning

Batch normalization

Batch normalization is a commonly used technique that improves the rate of convergence of SGD by rescaling the values generated at the internal layers of the network from the examples within each minibatch.

Although the reasons for its effectiveness are not well understood at

the time of writing, we include it because it confers significant benefits in practice. To some

extent, batch normalization seems to have effects similar to those of the residual network. Consider a node z somewhere in the network: the values of z for the m examples in a

minibatch are zj., ... ,z,. Batch normalization replaces each z; with a new quantity Z;:

where 1 is the mean value of z across the minibatch, o is the standard deviation of zy, ...z,

€ is a small constant added to prevent division by zero, and and 3 are learned parameters.

Batch normalization standardizes the mean and variance of the values, as determined by the values of 3 and . This makes it much simpler to train a deep network. Without batch

normalization, information can get lost if a layer’s weights are too small, and the standard

deviation at that layer decays to near zero. Batch normalization prevents this from happening.

It also reduces the need for careful initialization of all the weights in the network to make sure

that the nodes in each layer are in the right operating region to allow information to propagate. With batch normalization, we usually include 3 and , which may be node-specific or layer-specific, among the parameters of the network, so that they are included in the learning process.

After training, 3 and ~ are fixed at their learned values.

21.5

Generalization

So far we have described how to fit a neural network to its training set, but in machine learn-

ing the goal is to generalize to new data that has not been seen previously, as measured by performance on a test set. In this section, we focus on three approaches to improving gener-

alization performance: choosing the right network architecture, penalizing large weights, and

randomly perturbing the values passing through the network during training. 21.5.1

Choosing a network architecture

A great deal of effort in deep learning research has gone into finding network architectures that generalize well. Indeed, for each particular kind of data—images, speech, text, video, and so on—a good deal of the progress in performance has come from exploring different kinds of network architectures and varying the number of layers, their connectivity, and the types of node in each layer.®

Some neural network architectures are explicitly designed to generalize well on particular

types of data: convolutional networks encode the idea that the same feature extractor is useful at all locations across a spatial grid, and recurrent networks encode the idea that the same

update rule is useful at all points in a stream of sequential data.

To the extent that these

assumptions are valid, we expect convolutional architectures to generalize well on images and recurrent networks to generalize well on text and audio signals.

& Noting that much of this incremental, exploratory work is arried out by graduate students, some have called the process graduate student descent (GSD).

Section 21.5

01

3.layer 1llayer

008 Test-set error

Generalization

006 004 002

o

1

2

3

4

5

Number of weights (x 10")

6

7

Figure 21.7 Test-set error as a function of layer width (as measured by total number of weights) for three-layer and eleven-layer convolutional networks. The data come from early versions of Google’s system for transcribing addresses in photos taken by Street View cars (Goodfellow et al., 2014). One of the most important empirical findings in the field of deep learning is that when

comparing two networks with similar numbers of weights, the deeper network usually gives better generalization performance.

Figure 21.7 shows this effect for at least one real-world

application—recognizing house numbers. The results show that for any fixed number of parameters, an eleven-layer network gives much lower test-set error than a three-layer network. Deep learning systems perform well on some but not all tasks. For tasks with high-

dimensional inputs—images, video, speech signals, etc.—they perform better than any other

pure machine learning approaches. Most of the algorithms described in Chapter 19 can handle high-dimensional input only if it is preprocessed using manually designed features to reduce the dimensionality. This preprocessing approach, which prevailed prior to 2010, has not yielded performance comparable to that achieved by deep learning systems. Clearly, deep learning models are capturing some important aspects of these tasks. In particular, their success implies that the tasks can be solved by parallel programs with a relatively

small number of steps (10 to 10° rather than, say, 107). This is perhaps not surprising, because these tasks are typically solved by the brain in less than a second, which is time enough for only a few tens of sequential neuron firings. Moreover, by examining the internal-layer representations learned by deep convolutional networks for vision tasks, we find evidence

that the processing steps seem to involve extracting a sequence of increasingly abstract representations of the scene, beginning with tiny edges, dots, and corner features and ending with

entire objects and arrangements of multiple objects. On the other hand, because they are simple circuits, deep learning models lack the compositional and quantificational expressive power that we see in first-order logic (Chapter 8) and context-free grammars (Chapter 23).

Although deep learning models generalize well in many cases, they may also produce

unintuitive errors. They tend to produce input-output mappings that are discontinuous, so

that a small change to an input can cause a large change in the output. For example, it may

769

770

Adversarial example

Chapter 21

Deep Learning

be possible to alter just a few pixels in an image of a dog and cause the network to classify the dog as an ostrich or a school bus—even though the altered image still looks exactly like a

dog. An altered image of this kind is called an adversarial example. In low-dimensional spaces it is hard to find adversarial examples. But for an image with

a million pixel values, it is often the case that even though most of the pixels contribute to

the image being classified in the middle of the “dog” region of the space, there are a few dimensions where the pixel value is near the boundary to another category. An adversary

with the ability to reverse engineer the network can find the smallest vector difference that

would move the image over the border.

When adversarial examples were first discovered, they set off two worldwide scrambles:

one to find learning algorithms and network architectures that would not be susceptible to adversarial attack, and another to create ever-more-effective adversarial attacks against all

kinds of learning systems.

So far the attackers seem to be ahead.

In fact, whereas it was

assumed initially that one would need access to the internals of the trained network in order

to construct an adversarial example specifically for that network, it has turned out that one can construct robust adversarial examples that fool multiple networks with different architec-

tures, hyperparameters, and training sets. These findings suggest that deep learning models

recognize objects in ways that are quite different from the human visual system. 21.5.2

Neural architecture search

Unfortunately, we don’t yet have a clear set of guidelines to help you choose the best network

architecture for a particular problem. Success in deploying a deep learning solution requires experience and good judgment. From the earliest days of neural network research, attempts have been made to automate

the process of architecture selection. We can think of this as a case of hyperparameter tuning (Section 19.4.4), where the hyperparameters determine the depth, width, connectivity, and

Neural architecture search

other attributes of the network.

However, there are so many choices to be made that simple

possible network architectures.

Many of the search techniques and learning techniques we

approaches like grid search can’t cover all possibilities in a reasonable amount of time. Therefore, it is common to use neural architecture search to explore the state space of covered earlier in the book have been applied to neural architecture search.

Evolutionary algorithms have been popular because it is sensible to do both recombination (joining parts of two networks together) and mutation (adding or removing a layer or changing a parameter value). Hill climbing can also be used with these same mutation operations.

Some researchers have framed the problem as reinforcement learning, and some

as Bayesian optimization.

Another possibility is to treat the architectural possibilities as a

continuous differentiable space and use gradient descent to find a locally optimal solution.

For all these search techniques, a major challenge is estimating the value of a candidate

network.

The straightforward way to evaluate an architecture is to train it on a test set for

multiple batches and then evaluate its accuracy on a validation set. But with large networks that could take many GPU-days.

Therefore, there have been many attempts to speed up this estimation process by eliminating or at least reducing the expensive training process. We can train on a smaller data set. We can train for a small number of batches and predict how the network would improve with more batches.

We can use a reduced version of the network architecture that we hope

Section 21.5

Generalization

771

retains the properties of the full version. We can train one big network and then search for subgraphs of the network that perform better; this search can be fast because the subgraphs

share parameters and don’t have to be retrained. Another approach is to learn a heuristic evaluation function (as was done for A* search).

That is, start by choosing a few hundred network architectures and train and evaluate them.

That gives us a data set of (network, score) pairs. Then learn a mapping from the features of a network to a predicted

score. From that point on we can generate a large number of candidate

networks and quickly estimate their value. After a search through the space of networks, the best one(s) can be fully evaluated with a complete training procedure.

21.5.3 Weight decay In Section 19.4.3 we saw that regularization—limiting the complexity of a model—can aid

generalization. This is true for deep learning models as well. In the context of neural networks

we usually call this approach weight decay.

Weight decay consists of adding a penalty AX; ;W

to the loss function used to train the

neural network, where \ is a hyperparameter controlling the strength of the penalty and the

sum is usually taken over all of the weights in the network. Using A=0 is equivalent to not using weight decay, while using larger values of A encourages the weights to become small. It is common to use weight decay with A near 104

Choosing a specific network architecture can be seen as an absolute constraint on the

hypothesis space: a function is either representable within that architecture or it is not. Loss

function penalty terms such as weight decay offer a softer constraint: functions represented

with large weights are in the function family, but the training set must provide more evidence in favor of these functions than is required to choose a function with small weights. It is not straightforward to interpret the effect of weight decay in a neural network. In

networks with sigmoid activation functions, it is hypothesized that weight decay helps to keep the activations near the linear part of the sigmoid, avoiding the flat operating region

that leads to vanishing gradients. With ReLU activation functions, weight decay seems to be beneficial, but the explanation that makes sense for sigmoids no longer applies because the ReLU’s output is either linear or zero.

Moreover, with residual connections, weight decay

encourages the network to have small differences between consecutive layers rather than

small absolute weight values.

Despite these differences in the behavior of weight decay

across many architectures, weight decay is still widely useful.

One explanation for the beneficial effect of weight decay is that it implements a form of maximum a posteriori (MAP) learning (see page 723). Letting X and y stand for the inputs

and outputs across the entire training set, the maximum a posteriori hypothesis /ap satisfies

Inap = argmax P(y| X, W)P(W) w = argmin[ log P(y|X, W) — log P(W)] The first term is the usual cross-entropy loss; the second term prefers weights that are likely

under a prior distribution. This aligns exactly with a regularized loss function if we set

logP(W) = -A Y W3, 7

which means that P(W) is a zero-mean Gaussian prior.

Weight decay

772

Chapter 21 21.5.4

Dropout

Deep Learning

Dropout

Another way that we can intervene to reduce the test-set error of a network—at the cost of making it harder to fit the training set—is to use dropout. At each step of training, dropout

applies one step of back-propagation learning to a new version of the network that is created

by deactivating a randomly chosen subset of the units. This is a rough and very low-cost approximation to training a large ensemble of different networks (see Section 19.8).

More specifically, let us suppose we are using stochastic gradient descent with minibatch

size m.

For each minibatch, the dropout algorithm applies the following process to every

node in the network: with probability p, the unit output is multiplied by a factor of 1/p;

otherwise, the unit output is fixed at zero. Dropout is typically applied to units in the hidden

layers with p=0.5; for input units, a value of p=0.8 turns out to be most effective. This process

produces a thinned network with about half as many units as the original, to which

back-propagation is applied with the minibatch of m training examples. The process repeats in the usual way until training is complete. At test time, the model is run with no dropout.

We can think of dropout from several perspectives:

+ By introducing noise at training time, the model is forced to become robust to noise. + As noted above, dropout approximates the creation of a large ensemble of thinned net-

works. This claim can be verified analytically for linear models, and appears to hold experimentally for deep learning models.

+ Hidden units trained with dropout must learn not only to be useful hidden units; they

must also learn to be compatible with many other possible sets of other hidden units

that may or may not be included in the full model.

This is similar to the selection

processes that guide the evolution of genes: each gene must not only be effective in its own function, but must work well with other genes, whose identity in future organisms

may vary considerably.

+ Dropout applied to later layers in a deep network forces the final decision to be made

robustly by paying attention to all of the abstract features of the example rather than focusing on just one and ignoring the others. For example, a classifier for animal images might be able to achieve high performance on the training set just by looking at the

animal’s nose, but would presumably fail on a test case where the nose was obscured or damaged. With dropout, there will be training cases where the internal “nose unit” is

zeroed out, causing the learning process to find additional identifying features. Notice

that trying to achieve the same degree of robustness by adding noise to the input data

would be difficult: there is no easy way to know in advance that the network is going to

focus on noses, and no easy way to delete noses automatically from each image.

Altogether, dropout forces the model to learn multiple, robust explanations for each input.

This causes the model to generalize well, but also makes it more difficult to fit the training

set—it is usually necessary to use a larger model and to train it for more iterations.

21.6

Recurrent Neural Networks

Recurrent neural networks (RNNs) are distinct from feedforward networks in that they allow

cycles in the computation graph. In all the cases we will consider, each cycle has a delay,

50 that units may take as input a value computed from their own output at an earlier step in

Section 21.6

(@)

Recurrent Neural Networks

773

(b)

Figure 21.8 (a) Schematic diagram of a basic RNN where the hidden layer z has recurrent connections; the A symbol indicates a delay. (b) The same network unrolled over three time steps to create a feedforward network. Note that the weights are shared across all time steps. the computation. (Without the delay, a cyclic circuit may reach an inconsistent state.) This

allows the RNN to have internal state, or memory: inputs received at earlier time steps affect

the RNN’s response to the current input. RNNs can also be used to perform more general computations—after all, ordinary com-

puters are just Boolean circuits with memory—and to model real neural systems, many of

which contain cyclic connections. Here we focus on the use of RNNs to analyze sequential data, where we assume that a new input vector X, arrives at each time step.

As tools for analyzing sequential data, RNNs can be compared to the hidden Markov

models, dynamic Bayesian networks, and Kalman filters described in Chapter 14. (The reader

may find it helpful to refer back to that chapter before proceeding.) Like those models, RNNs. make a Markov assumption (see page 463):

the hidden state z, of the network suffices

to capture the information from all previous inputs. Furthermore, suppose we describe the RNN'’s update process for the hidden state by the equation z, =f,,(z,_;,x;) for some param-

eterized function f,. Once trained, this function represents a time-homogeneous process

(page 463)—effectively a universally quantified assertion that the dynamics represented by

fw hold for all time steps. Thus, RNNs add expressive power compared to feedforward networks, just as convolutional networks do, and just as dynamic Bayes nets add expressive power compared to regular Bayes nets. Indeed, if you tried to use a feedforward network to analyze sequential data, the fixed size of the input layer would force the network to examine only a finite-length window of data, in which case the network would fail to detect

long-distance dependencies. 21.6.1

Training a basic RNN

The basic model we will consider has an input layer x, a hidden layer z with recurrent con-

nections, and an output layer y, as shown in Figure 21.8(a). We assume that both x and y are

observed in the training data at each time step. The equations defining the model refer to the values of the variables indexed by time step 7:

% = fo(z-1,%)=g (W21 + Wiexi) = g (inz,) 9 = g(Weyz) =gy(iny,),

(21.13)

Memory

774

Chapter 21

Deep Learning

where g, and g, denote the activation functions for the hidden and output layers, respectively.

As usual, we assume an extra dummy input fixed at +1 for each unit as well as bias weights

associated with those inputs. Given a sequence of input vectors Xi,...,x and observed outputs y;,...,yr, we can turn this model into a feedforward network by “unrolling” it for T steps, as shown in Figure 21.8(b). Notice that the weight matrices W, W__, and W are shared across all time steps.

In the unrolled network, it is easy to see that we can calculate gradients to train the

weights in the usual way; the only difference is that the sharing of weights across layers makes the gradient computation a little more complicated.

To keep the equations simple, we will show the gradient calculation for an RNN with

just one input unit, one hidden unit, and one output unit.

For this case, making the bias

weights explicit, we have z,=g:(w.z—1 + WioX; + woz) and § =gy (w.,z + woy). As in Equations (21.4) and (21.5), we will assume a squared-error loss L—in this case, summed

over the time steps. The derivations for the input-layer and output-layer weights w,. and w.., are essentially identical to Equation (21.4), so we leave them as an exercise. For the hidden-layer weight w. ., the first few steps also follow the same pattern as Equation (21.4):

%

= %

7

T

5

Z’Z(,Vl’fl);%

r

I

™=

M

I

y(iny) = );. 2y

$1)8(inys)

=200, = 30, i) g(weys + ) a

) wzy =20y = )&y (iny

(21.14)

I

Now the gradient for the hidden unit z; can be obtained from the previous time step as follows:

9z

T

W,

(Wezio1 + Wy

+wo2)

(21.15)

where the last line uses the rule for derivatives of products: (uv)/dx=vdu/dx+udv/dx.

Looking at Equation (21.15), we notice two things. First, the gradient expression is re-

cursive: the contribution to the gradient from time step 7 is calculated using the contribution

Back-propagation through time

from time step 7 — 1. If we order the calculations in the right way, the total run time for computing the gradient will be linear in the size of the network. This algorithm is called backpropagation through time, and is usually handled automatically by deep learning software

Exploding gradient

terms proportional o w.. TT_ g.(in.,). For sigmoids, tanhs, and ReLUs, g’ < 1, 5o our simple RNN will certainly suffer from the vanishing gradient problem (see page 756) if w.. < 1. On the other hand, if w.. > 1, we may experience the exploding gradient problem. (For the

systems. Second, if we iterate the recursive calculation, we see that gradients at 7 will include

general case, these outcomes depend on the first eigenvalue of the weight matrix W...) The next section describes a more elaborate RNN design intended to mitigate this issue.

Section 217 21.6.2

Long short-term memory

Unsupervised Learning and Transfer Learning

775

RNNs

Several specialized RNN architectures have been designed with the goal of enabling informa-

tion to be preserved over many time steps. One of the most popular is the long short-term

short-term ‘memory or LSTM. The long-term memory component of an LSTM, called the memory cell Long memory and denoted by ¢, is essentially copied from time step to time step. (In contrast, the basic RNN

multiplies its memory by a weight matrix at every time step, as shown in Equation (21.13).)

Memory cell

New information enters the memory by adding updates; in this way, the gradient expressions

do not accumulate multiplicatively over time. LSTMs also include gating units, which are Gating unit vectors that control the flow of information in the LSTM via elementwise multiplication of

the corresponding information vector: « The forget gate f determines if each element of the memory cell is remembered (copied Forget gate to the next time step) or forgotten (reset to zero).

« The input gate i determines if each element of the memory cell is updated additively Input gate by new information from the input vector at the current time step.

« The output gate o determines if each element of the memory cell is transferred to the Output gate short-term memory z, which plays a similar role to the hidden state in basic RNNs. Whereas the word “gate” in circuit design usually connotes a Boolean function, gates in

LSTMs are soft—for example, elements of the memory cell vector will be partially forgotten

if the corresponding elements of the forget-gate vector are small but not zero. The values for the gating units are always in the range [0, 1] and are obtained as the outputs of a sigmoid function applied to the current input and the previous hidden state. In detail, the update equations for the LSTM are as follows: £, =

o(Weyx +W_yz,1)

o(Weix+ Weizi1) (WeoX, +Wooz,1)

€ = ¢ 1 Of +i O tanh(Wy X, + Wez,1)

7, =

tanh(¢,) © o,

where the subscripts on the various weight matrices W indicate the origin and destination of

the corresponding links. The ® symbol denotes elementwise multiplication. LSTMs were among the first practically usable forms of RNN. They have demonstrated excellent performance on a wide range of tasks including speech recognition and handwriting recognition. Their use in natural language processing is discussed in Chapter 24. 21.7

Unsupervised

Learning and Transfer Learning

The deep learning systems we have discussed so far are based on supervised learning, which requires each training example to be labeled with a value for the target function.

Although

such systems can reach a high level of test-set accuracy—as shown by the ImageNet com-

petition results, for example—they often require far more labeled data than a human would for the same task. For example, a child needs to see only one picture of a giraffe, rather than thousands, in order to be able to recognize giraffes reliably in a wide range of settings

and views. Clearly, something is missing in our deep learning story; indeed, it may be the

776

Chapter 21

Deep Learning

case that our current approach to supervised deep learning renders some tasks completely

unattainable because the requirements for labeled data would exceed what the human race

(or the universe) can supply. Moreover, even in cases where the task is feasible, labeling large data sets usually requires scarce and expensive human labor.

For these reasons, there is intense interest in several learning paradigms that reduce the

dependence on labeled data. As we saw in Chapter 19, these paradigms include unsuper-

vised learning, transfer learning, and semisupervised learning. Unsupervised learning

algorithms learn solely from unlabeled inputs x, which are often more abundantly available than labeled examples. Unsupervised learning algorithms typically produce generative models, which can produce realistic text, images, audio, and video, rather than simply predicting labels for such data. Transfer learning algorithms require some labeled examples but are able to improve their performance further by studying labeled examples for different tasks, thus making it possible to draw on more existing sources of data. Semisupervised learning algo-

rithms require some labeled examples but are able to improve their performance further by also studying unlabeled examples. This section covers deep learning approaches to unsupervised and transfer learning; while semisupervised learning is also an active area of research in the deep learning community, the techniques developed so far have not proven broadly

effective in practice, so we do not cover them. 21.7.1

Unsupervised learning

Supervised learning algorithms all have essentially the same goal: given a training set of inputs x and corresponding outputs y=

f(x), learn a function A that approximates

f well.

Unsupervised learning algorithms, on the other hand, take a training set of unlabeled exam-

ples x. Here we describe two things that such an algorithm might try to do. The first is to

learn new representations—for example, new features of images that make it easier to iden-

tify the objects in an image. The second is to learn a generative model—typically in the form

of a probability distribution from which new samples can be generated. (The algorithms for learning Bayes nets in Chapter 20 fall in this category.) Many algorithms are capable of both representation learning and generative modeling. Suppose we learn a joint model Ay (x,z), where z is a set of latent, unobserved variables that represent the content of the data x in some way. In keeping with the spirit of the chapter, we do not predefine the meanings of the z variables; the model is free to learn to associate

z with x however it chooses. For example, a model trained on images of handwritten digits

might choose to use one direction in z space to represent the thickness of pen strokes, another to represent ink color, another to represent background color, and so on. With images of faces, the learning algorithm might choose one direction to represent gender and another to

capture the presence or absence of glasses, as illustrated in Figure 21.9. A learned probability model Py (x,z) achieves both representation learning (it has constructed meaningful z vectors from the raw x vectors) and generative modeling: grate z out of Py (x,z) we obtain Py (x).

if we inte-

Probabilistic PCA: A simple generative model PPCA

There have been many proposals for the form that Py (x,z) might take. One of the simplest

is the probabilistic principal components analysis (PPCA) model.” In a PPCA model, z

Section 217

Unsupervised Learning and Transfer Learning

Figure 21.9 A demonstration of how a generative model has learned to use different directions in z space to represent different aspects of faces. We can actually perform arithmetic in 2 space. The images here are all generated from the learned model and show what happens when we decode different points in z space. We start with the coordinates for the concept of “man with glasses.” subtract off the coordinates for “man,” add the coordinates for “woman.” and obtain the coordinates for “woman with glasses.” Images reproduced with permission from (Radford et al., 2015).

is chosen from a zero-mean, spherical Gaussian, then x is generated from z by applying a weight matrix W and adding spherical Gauss ian noise: P(z) = N(z0,I)

By(x|z) = N(x;Wz,0°1).

The weights W (and optionally the noise parameter o) can be learned by maximizing the likelihood of the data, given by

Pu(x) :/Ry(x.z)dz=N(x:0.WWT+rIZI).

(21.16)

The maximization with respect to W can be done by gradient methods or by an efficient iterative EM

algorithm

(see Section 20.3).

Once W

has been learned, new data samples

can be generated directly from Py (x) using Equation (21.16). Moreover, new observations. x that have very low probability according to Equation (21.16) can be flagged as potential

anomalies. With PPCA, we usually assume that the dimensionality ofz is much less than the dimensionality of x, so that the model learns to explain the data as well as possible in terms of a

small number of features. These features can be extracted for use in standard classifiers by computing Z, the expectation of Py (z|x). Generating data from a probabilistic PCA model is straightforward: first sample z from its fixed Gaussian prior, then sample x from a Gaussian with mean Wz.

As we will see

shortly, many other generative models resemble this process, but use complicated mappings

defined by deep models rather than linear mappings from z-space to x-space.

7 Standard PCA involves fitting a multivariate Gaussian to the raw input data and then selecting out the longest axes—the principal components—of that ellipsoidal distribution.

777

778

Chapter 21

Deep Learning

Autoencoders

Autoencoder

Many unsupervised deep learning algorithms are based on the idea of an autoencoder. An autoencoder is a model containing two parts: an encoder that maps from x to a representation

2 and a decoder that maps from a representation Z to observed data x. In general, the encoder

is just a parameterized function f and the decoder is just a parameterized function g. The

model is trained so that x ~ g(f(x)), so that the encoding decoding process. The functions f and g can be simple single matrix or they can be represented by a deep neural A very simple autoencoder is the linear autoencoder, a shared weight matrix W:

process is roughly inverted by the linear models parameterized by a network. where both f and g are linear with

One way to train this model is to minimize the squared error ¥; [[x; — g(f(x;))|[* so that

x ~ g(f(x)). The idea is to train W so that a low-dimensional 2 will retain as much information as possible to reconstruct the high-dimensional data x.

This linear autoencoder

turns out to be closely connected to classical principal components analysis (PCA). When

z is m-dimensional, the matrix W should learn to span the m principal components of the data—in other words, the set of m orthogonal directions in which the data has highest vari-

ance, or equivalently the m eigenvectors of the data covariance matrix that have the largest

eigenvalues—exactly as in PCA. The PCA model is a simple generative model that corresponds to a simple linear autoencoder. The correspondence suggests that there may be a way to capture more complex kinds Variational autoencoder Variational posterior

of generative models using more complex kinds of autoencoders.

coder (VAE) provides one way to do this.

The variational autoen-

Variational methods were introduced briefly on page 458 as a way to approximate the

posterior distribution in complex probability models, where summing or integrating out a large number of hidden variables is intractable.

The idea is to use a variational posterior

Q(z), drawn from a computationally tractable family of distributions, as an approximation to

the true posterior. For example, we might choose Q from the family of Gaussian distributions

with a diagonal covariance matrix. Within the chosen family of tractable distributions, Q is

optimized to be as close as possible to the true posterior distribution P(z|x).

For our purposes, the notion of “as close as possible” is defined by the KL divergence, which we mentioned on page 758. This is given by

Da(Q@IP(alx) = [ 0(a)ioggt

which is an average (with respect to Q) of the log ratio between Q and P. It is easy to see

Variational lower bound ELBO

that Dk (Q(z)||P(z|x)) > 0, with equality when Q and P coincide. We can then define the

variational lower bound £ (sometimes called the evidence lower bound, or ELBO) on the log likelihood of the data:

L(x,0) =logP(x) — Dg1(Q(2)||P(2]x)) QL17) We can see that £ is a lower bound for logP because the KL divergence is nonnegative. Variational learning maximizes £ with respect to parameters w rather than maximizing log P(x),

in the hope that the solution found, w", is close to maximizing log P(x) as well.

Section 217

Unsupervised Learning and Transfer Learning

779

As written, £ does not yet seem to be any easier to maximize than log P. Fortunately, we

can rewrite Equation (21.17) to reveal improved computational tractability:

£ = logP(x)— / Q(z)logp?z(T))()dz = 7/Q(z)logQ(z)dz+/Q(z)logP(x)P(z\x)dz = H(Q)+Fz-glogP(z,x)

where H(Q) is the entropy of the Q distribution.

For some variational families @ (such

as Gaussian distributions), H(Q) can be evaluated analytically.

Moreover, the expectation,

E,olog P(z,x), admits an efficient unbiased estimate via samples of z from Q. For each sample, P(z,x) can usually be evaluated efficiently—for example, if P is a Bayes net, P(z,X) is just a product of conditional probabilities because z and x comprise all the variables. Variational autoencoders provide a means of performing variational learning in the deep learning setting. Variational learning involves maximizing £ with respect to the parameters

of both P and Q. For a variational autoencoder, the decoder g(z) is interpreted as defining

log P(x|z). For example, the output of the decoder might define the mean of a conditional

Gaussian. Similarly, the output of the encoder f(x) is interpreted as defining the parameters of

Q—for example, Q might be a Gaussian with mean f(x). Training the variational autoencoder

then consists of maximizing £ with respect to the parameters of both the encoder f and the

decoder g, which can themselves be arbitrarily complicated deep networks.

Deep autoregressive models An autoregressive model (or AR model) is one in which each element x; of the data vector x

is predicted based on other elements of the vector. Such a model has no latent variables. If x

Autoregressive model

is of fixed size, an AR model can be thought of as a fully observable and possibly fully connected Bayes net. This means that calculating the likelihood ofa given data vector according to an AR model is trivial; the same holds for predicting the value of a single missing variable given all the others, and for sampling a data vector from the model.

The most common application of autoregressive models is in the analysis of time series

data, where an AR model of order k predicts x; given x,_j,...,x,_. In the terminology of Chapter 14, an AR model is a non-hidden Markov model. In the terminology of Chapter 23, an n-gram model of letter or word sequences is an AR model of order n — 1. In classical AR models, where the variables are real-valued, the conditional distribution

P(% | t»....X_1) is a linear-Gaussian model with fixed variance whose mean is a weighted linear combination of X, ..., X j—in other words, a standard linear regression model. The maximum likelihood solution is given by the Yule-Walker equations, which are closely Yule-Walker equations related to the normal equations on page 680.

A deep autoregressive model is one in which the linear-Gaussian

model is replaced

by an arbitrary deep network with a suitable output layer depending on whether x; is dis-

crete or continuous. Recent applications of this autoregressive approach include DeepMind’s ‘WaveNet model

for speech generation (van den Oord er al., 2016a).

WaveNet is trained

on raw acoustic signals, sampled 16,000 times per second, and implements a nonlinear AR model of order 4800 with a multilayer convolutional structure.

In tests it proves to be sub-

stantially more realistic than previous state-of-the-art speech generation systems.

Deep autoregressive model

780

Generative adversarial network GAN) enerator Discriminator Implicit model

Chapter 21

Deep Learning

Generative adversarial networks A generative adversarial network (GAN)

is actually a pair of networks that combine to

form a generative system. One of the networks, the generator, maps values from z to X in

order to produce samples from the distribution Py (x). A typical scheme samples z from a unit

Gaussian of moderate dimension and then passes it through a deep network /,, to obtain x. The other network, the discriminator, is a classifier trained to classify inputs x as real (drawn

from the training set) or fake (created by the generator). GANs are a kind of implicit model

in the sense that samples can be generated but their probabilities are not readily available; in a Bayes net, on the other hand, the probability of a sample is just the product of the conditional

probabilities along the sample generation path.

The generator is closely related to the decoder from the variational autoencoder frame-

work. The challenge in implicit modeling is to design a loss function that makes it possible to train the model using samples from the distribution, rather than maximizing the likelihood assigned to training examples from the data set.

Both the generator and the discriminator are trained simultaneously, with the generator

learning to fool the discriminator and the discriminator learning to accurately separate real from fake data.

The competition between generator and discriminator can be described in

the language of game theory (see Chapter 18). The idea is that in the equilibrium state of the

game, the generator should reproduce the training distribution perfectly, such that the discrim-

inator cannot perform better than random guessing. GANs have worked particularly well for

image generation tasks. For example, GAN can create photorealistic, high-resolution images of people who have never existed (Karras ef al., 2017). Unsupervised

translation

Translation tasks, broadly construed, consist of transforming an input x that has rich structure into an output y that also has rich structure.

In this context, “rich structure” means that the

data are multidimensional and have interesting statistical dependencies among the various

dimensions. Images and natural language sentences have a rich structure, but a single number, such as a clas ID, does not. Transforming a sentence from English to French or converting a photo of a night scene into an equivalent photo taken during the daytime are both examples

of translation tasks.

Supervised translation consists of gathering many (x,y) pairs and training the model to

map each X to the corresponding y.

For example, machine translation systems are often

trained on pairs of sentences that have been translated by professional human translators. For

other kinds of translation, supervised training data may not be available. For example, con-

sider a photo ofa night scene containing many moving cars and pedestrians. It is presumably

not feasible to find all of the cars and pedestrians and return them to their original positions in

Unsupervised translation

the night-time photo in order to retake the same photo in the daytime. To overcome this difficulty, it is possible to use unsupervised translation techniques that are capable of training

on many examples of x and many separate examples of y but no corresponding (x,y) pairs. These approaches are generally based on GANS; for example, one can train a GAN gen-

erator to produce a realistic example ofy when conditioned on x, and another GAN generator to perform the reverse mapping. The GAN training framework makes it possible to train a

generator to generate any one of many possible samples that the discriminator accepts as a

Section 217

Unsupervised Learning and Transfer Learning

781

realistic example of y given x, without any need for a specific paired y as is traditionally needed in supervised learning. More detail on unsupervised translation for images is given in Section 25.7.5. 21.7.2

Transfer learning and multitask learning

In transfer learning, experience with one learning task helps an agent learn better on another Transfer learning task. For example, a person who has already learned to play tennis will typically find it casier to learn related sports such as racquetball and squash; a pilot who has learned to fly one type of commercial passenger airplane will very quickly learn to fly another type: a student who has already learned algebra finds it easier to learn calculus.

‘We do not yet know the mechanisms of human transfer learning.

For neural networks,

learning consists of adjusting weights, so the most plausible approach for transfer learning is

to copy over the weights learned for task A to a network that will be trained for task B. The

weights are then updated by gradient descent in the usual way using data for task B. It may be a good idea to use a smaller learning rate in task B, depending on how similar the tasks are and how much data was used in task A.

Notice that this approach requires human expertise in selecting the tasks: for example,

weights learned during algebra training may not be very useful in a network intended for racquetball.

Also, the notion of copying weights requires a simple mapping between the

input spaces for the two tasks and essentially identical network architectures.

One reason for the popularity of transfer learning is the availability of high-quality pre-

trained models.

For example, you could download a pretrained visual object recognition

model such as the ResNet-50 model trained on the COCO data set, thereby saving yourself

weeks of work. From there you can modify the model parameters by supplying additional images and object labels for your specific task.

Suppose you want to classify types of unicycles. You have only a few hundred pictures

of different unicycles, but the COCO data set has over 3,000 images in each of the categories

of bicycles, motorcycles, and skateboards. This means that a model pretrained on COCO

already has experience with wheels and roads and other relevant features that will be helpful

in interpreting the unicycle images.

Often you will want to freeze the first few layers of the pretrained model—these layers

serve as feature detectors that will be useful for your new model. Your new data set will be

allowed to modify the parameters of the higher levels only; these are the layers that identify problem-specific features and do classification. However, sometimes the difference between sensors means that even the lowest-level layers need to be retrained.

As another example, for those building a natural language system, it is now common

to start with a pretrained model

such as the ROBERTA

model (see Section 24.6), which

already “knows” a great deal about the vocabulary and syntax of everyday language. The next step is to fine-tune the model in two ways. First, by giving it examples of the specialized vocabulary used in the desired domain; perhaps a medical domain (where it will learn about

“mycardial infarction”) or perhaps a financial domain (where it will learn about “fiduciary

responsibility”). Second, by training the model on the task it is to perform. If it is to do question answering, train it on question/answer pairs. One very important kind of transfer learning involves transfer between simulations and the real world.

For example, the controller for a self-driving car can be trained on billions

782

Chapter 21

Deep Learning

of miles of simulated driving, which would be impossible in the real world. Then, when the

Multitask learning

controller is transitioned to the real vehicle, it adapts quickly to the new environment. Multitask learning is a form of transfer learning in which we simultaneously

train a

model on multiple objectives. For example, rather than training a natural language system on part-of-speech tagging and then transferring the learned weights to a new task such as document classification, we train one system simultaneously on part-of-speech tagging, document

classification, language detection, word prediction, sentence difficulty modeling, plagiarism

detection, sentence entailment, and question answering. The idea is that to solve any one of these tasks, a model might be able to take advantage of superficial features of the data. But to

solve all eight at once with a common representation layer, the model is more likely to create a common representation that reflects real natural language usage and content.

21.8

Applications

Deep learning has been applied successfully to many important problem areas in Al For indepth explanations, we refer the reader to the relevant chapters: Chapter 22 for the use of deep learning in reinforcement learning systems, Chapter 24 for natural language processing, Chapter 25 (particularly Section 25.4) for computer vision, and Chapter 26 for robotics. 21.8.1

Vision

We begin with computer vision, which is the application area that has arguably had the biggest impact on deep learning, and vice versa. Although deep convolutional networks had been in use since the 1990s for tasks such as handwriting recognition, and neural networks had begun to surpass generative probability models for speech recognition by around 2010, it was the success of the AlexNet deep learning system in the 2012 ImageNet competition that propelled

deep learning into the limelight. The ImageNet competition was a supervised learning task with 1,200,000 images in 1,000 different categories, and systems were evaluated on the “top-5” score—how often the correct category appears in the top five predictions. AlexNet achieved an error rate of 15.3%, whereas the next best system had an error rate of more than 25%.

AlexNet had five convolutional

layers interspersed with max-pooling layers, followed by three fully connected layers. It used

ReLU activation functions and took advantage of GPUs to speed up the process of training

60 million weight

Since 2012, with improvements

in network design, training methods, and computing

resources, the top-5 error rate has been reduced to less than 2%—well below the error rate of

a trained human (around 5%). CNNs have been applied to a wide range of vision tasks, from self-driving cars to grading cucumbers.® Driving, which is covered in Section 25.7.6 and in several sections of Chapter 26, is among the most demanding of vision tasks: not only must

the algorithm detect, localize, track, and recognize pigeons, paper bags, and pedestrians, but it has to do it in real time with near-perfect accuracy.

8 The widely known tale of the Japanese cucumber farmer who built his own cucumber-s ing robot using TensorFlow is, it tums out, mostly mythical. The algorithm was developed by the farmer’s son, who worked previously as a software engineer at Toyota, and its low accuracy—about 70%—meant that the cucumbers il had to be sorted by hand (Zeeberg, 2017).

Section 218 Applications 21.8.2

Natural language processing

Deep learning has also had a huge impact on natural language processing (NLP) applications. such as machine translation and speech recognition. Some advantages of deep learning for these applications include the possibility of end-to-end learning, the automatic generation

of internal representations for the meanings of words, and the interchangeability of learned

encoders and decoders. End-to-end learning refers to the construction of entire systems as a single, learned func-

tion f. For example, an f for machine translation might take

as input an English sentence

S and produce an equivalent Japanese sentence S; = £(Sg). Such an f can be learned from training data in the form of human-translated pairs of sentences (or even pairs of texts, where

the alignment of corresponding sentences or phrases is part of the problem to be solved). A more classical pipeline approach might first parse Sg, then extract its meaning, then re-express

the meaning in Japanese as S, then post-edit S, using a language model for Japanese. This pipeline approach has two major drawbacks:

first, errors are compounded at each stage; and

second, humans have to determine what constitutes a “parse tree” and a “meaning representation,” but there is no easily accessible ground truth for these notions, and our theoretical

ideas about them are almost certainly incomplete.

At our present stage of understanding, then, the classical pipeline approach—which, at

least naively, seems to correspond to how a human translator works—is outperformed by the end-to-end method made possible by deep learning. For example, Wu ef al. (2016b) showed that end-to-end translation using deep learning reduced translation errors by 60% relative to a previous pipeline-based system. As of 2020, machine translation systems are approaching human performance for language pairs such as French and English for which very large paired data sets are available, and they are usable for other language pairs covering the majority of Earth’s population. There is even some evidence that networks trained on multiple languages do in fact learn an internal meaning representation: for example, after learning to translate Portuguese to English and English to Spanish, it is possible to translate Portuguese directly

into Spanish without any Portuguese/Spanish sentence pairs in the training set.

One of the most significant findings to emerge from the application of deep learning

to language tasks is that a great deal deal of mileage comes from re-representing individual words as vectors in a high-dimensional space—so-called word embeddings (see Section 24.1).

The vectors are usually extracted from the weights of the first hidden layer of

a network trained on large quantities of text, and they capture the statistics of the lexical

contexts in which words are used. Because words with similar meanings are used in similar

contexts, they end up close to each other in the vector space. This allows the network to generalize effectively across categories of words, without the need for humans to predefine those categories. For example, a sentence beginning “John bought a watermelon and two pounds of ... s likely to continue with “apples” or “bananas” but not with “thorium” or “geography.” Such a prediction is much easier to make if “apples” and “bananas” have similar representations in the internal layer. 21.8.3

Reinforcement learning

In reinforcement learning (RL), a decision-making agent learns from a sequence of reward

signals that provide some indication of the quality of its behavior. The goal is to optimize the

sum of future rewards. This can be done in several ways: in the terminology of Chapter 17,

783

784

Chapter 21

Deep Learning

the agent can learn a value function, a Q-function, a policy, and so on. From the point of view of deep learning, all these are functions that can be represented by computation graphs.

For example, a value function in Go takes a board position as input and returns an estimate of how advantageous the position is for the agent. While the methods of training in RL differ from those of supervised learning, the ability of multilayer computation graphs to represent

Deep reinforcement learning

complex functions over large input spaces has proved to be very useful. The resulting field of research is called deep reinforcement learning.

In the 1950s, Arthur Samuel experimented with multilayer representations of value func-

tions in his work on reinforcement learning for checkers, but he found that in practice a linear

function approximator worked best. (This may have been a consequence of working with a computer roughly 100 billion times less powerful than a modern tensor processing unit.) The first major successful demonstration of deep RL was DeepMind’s Atari-playing agent, DQN (Mnih et al., 2013). Different copies of this agent were trained to play each of several different Atari video games, and demonstrated skills such as shooting alien spaceships, bouncing balls with paddles, and driving simulated racing cars. In each case, the agent learned a Qfunction from raw image data with the reward signal being the game score. Subsequent work has produced deep RL systems that play at a superhuman level on the majority of the 57 different Atari games. DeepMind’s ALPHAGO system also used deep RL to defeat the best human players at the game of Go (see Chapter 5).

Despite its impressive successes, deep RL still faces significant obstacles: it is often difficult to get good performance, and the trained system may behave very unpredictably if the environment differs even a little from the training data (Irpan, 2018).

Compared to

other applications of deep learning, deep RL is rarely applied in commercial settings. It is, nonetheless, a very active area of research.

Summary

This chapter described methods for learning functions represented by deep computational graphs. The main points were: + Neural networks represent complex nonlinear functions with a network of parameterized linear-threshold units.

« The back-propagation algorithm implements a gradient descent in parameter space to minimize the loss function. « Deep learning works well for visual object recognition, speech recognition, natural language processing, and reinforcement learning in complex environments.

+ Convolutional networks are particularly well suited for image processing and other tasks

where the data have a grid topology. « Recurrent networks are effective for sequence-processing tasks including language modeling and machine translation.

Bibliographical and Historical Notes Bibliographical and

Historical Notes

The literature on neural networks is vast. Cowan and Sharp (1988b,

1988a) survey the early

history, beginning with the work of McCulloch and Pitts (1943). (As mentioned in Chap-

ter 1, John McCarthy has pointed to the work of Nicolas Rashevsky (1936, 1938) as the earliest mathematical model of neural learning.) Norbert Wiener, a pioneer of cybernetics and control theory (Wiener, 1948), worked with McCulloch and Pitts and influenced a num-

ber of young researchers, including Marvin Minsky, who may have been the first to develop a working neural network in hardware, in 1951 (see Minsky and Papert, 1988, pp. ix-x). Alan Turing (1948) wrote a research report titled Intelligent Machinery that begins with the

sentence “I propose to investigate the question as to whether it is possible for machinery to show intelligent behaviour” and goes on to describe a recurrent neural network architecture

he called “B-type unorganized machines” and an approach to training them. Unfortunately, the report went unpublished until 1969, and was all but ignored until recently.

The perceptron, a one-layer neural network with a hard-threshold activation function, was

popularized by Frank Rosenblatt (1957). After a demonstration in July 1958, the New York

Times described it as “the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.” Rosenblatt

(1960) later proved the perceptron convergence theorem, although it had been foreshadowed by purely mathematical work outside the context of neural networks (Agmon, 1954; Motzkin and Schoenberg, 1954). Some early work was also done on multilayer networks, including Gamba perceptrons (Gamba ef al., 1961) and madalines (Widrow, 1962). Learning Machines (Nilsson,

1965) covers much of this early work and more.

The subsequent demise

of early perceptron research efforts was hastened (or, the authors later claimed, merely explained) by the book Perceptrons (Minsky and Papert, 1969), which lamented the field’s lack of mathematical rigor. The book pointed out that single-layer perceptrons could represent only linearly separable concepts and noted the lack of effective learning algorithms for multilayer networks. These limitations were already well known (Hawkins, 1961) and had been acknowledged by Rosenblatt himself (Rosenblatt, 1962). The papers collected by Hinton and Anderson (1981), based on a conference in San Diego in 1979, can be regarded as marking a renaissance of connectionism. The two-volume “PDP” (Parallel Distributed Processing) anthology (Rumelhart and McClelland, 1986) helped

to spread the gospel, so to speak, particularly in the psychology and cognitive science com-

munities. The most important development of this period was the back-propagation algorithm

for training multilayer networks.

The back-propagation algorithm was discovered independently several times in different

contexts (Kelley, 1960; Bryson,

1962; Dreyfus, 1962; Bryson and Ho, 1969; Werbos,

1974;

Parker, 1985) and Stuart Dreyfus (1990) calls it the “Kelley-Bryson gradient procedure.” Although Werbos had applied it to neural networks, this idea did not become widely known

until a paper by David Rumelhart, Geoff Hinton, and Ron Williams (1986) appeared in Narure giving a nonmathematical presentation of the algorithm. Mathematical respectability was enhanced by papers showing that multilayer feedforward networks are (subject to technical conditions) universal function approximators (Cybenko,

1988,

1989).

The late 1980s and

early 1990s saw a huge growth in neural network research: the number of papers mushroomed by a factor of 200 between 1980-84 and 1990-94.

785

786

Chapter 21

Deep Learning

In the late 1990s and early 2000s, interest in neural networks waned as other techniques such as Bayes nets, ensemble methods, and kernel machines came to the fore. Interest in deep

models was sparked when Geoff Hinton’s research on deep Bayesian networks—generative models with category variables at the root and evidence variables at the leaves—began to bear fruit, outperforming kernel machines on small benchmark

data sets (Hinton er al., 2006).

Interest in deep learning exploded when Krizhevsky ef al. (2013) used deep convolutional networks to win the ImageNet competition (Russakovsky et al., 2015).

Commentators often cite the availability of “big data” and the processing power of GPUs

as the main contributing factors in the emergence of deep learning.

Architectural improve-

ments were also important, including the adoption of the ReLU activation function instead of the logistic sigmoid (Jarrett et al., 2009; Nair and Hinton, 2010; Glorot et al., 2011) and later the development of residual networks (He et al., 2016).

On the algorithmic side, the use of stochastic gradient descent (SGD) with small batches

was essential in allowing neural networks to scale to large data sets (Bottou and Bousquet, 2008). Batch normalization (loffe and Szegedy, 2015) also helped in making the training pro-

cess faster and more reliable and has spawned several additional normalization techniques (Ba etal.,2016; Wu and He, 2018; Miyato et al., 2018). Several papers have studied the empirical behavior of SGD on large networks and large data sets (Dauphin et al., 2015; Choromanska et al., 2014; Goodfellow et al., 2015b). On the theoretical side, some progress has been made

on explaining the observation that SGD applied to overparameterized networks often reaches

a global minimum with a training error of zero, although so far the theorems to this effect assume a network with layers far wider than would ever occur in practice (Allen-Zhu et al.,

2018; Du ef al., 2018). Such networks have more than enough capacity to function as lookup tables for the training data.

The last piece of the puzzle, at least for vision applications, was the use of convolutional

networks. These had their origins in the descriptions of the mammalian visual system by neurophysiologists David Hubel and Torsten Wiesel (Hubel and Wiesel, 1959, 1962, 1968).

They described “simple cells” in the visual system of a cat that resemble edge detectors,

as well as “complex cells” that are invariant to some transformations such as small spatial

translations. In modern convolutional networks, the output of a convolution is analogous to a

simple cell while the output of a pooling layer is analogous to a complex cell. The work of Hubel and Wiesel inspired many of the early connectionist models of vision (Marr and Poggio, 1976). The neocognitron (Fukushima, 1980; Fukushima and Miyake,

1982), designed as a model of the visual cortex, was essentially a convolutional network in terms of model architecture, although an effective training algorithm for such networks had

to wait until Yann LeCun and collaborators showed how to apply back-propagation (LeCun

et al., 1995). One of the early commercial successes of neural networks was handwritten digit recognition using convolutional networks (LeCun ez al., 1995).

Recurrent neural networks (RNNs) were commonly proposed as models of brain function

in the 1970s, but no effective learning algorithms were associated with these proposals. The

method of back-propagation through time appears in the PhD thesis of Paul Werbos (1974),

and his later review paper (Werbos, 1990) gives several additional references to rediscoveries of the method in the 1980s. One of the most influential early works on RNNs was due to Jeff Elman (1990), building on an RNN architecture suggested by Michael Jordan (1986).

Williams and Zipser (1989) present an algorithm for online learning in RNNs. Bengio et al.

Bibliographical and Historical Notes (1994) analyzed the problem of vanishing gradients in recurrent networks. The long shortterm memory (LSTM) architecture (Hochreiter,

1991; Hochreiter and Schmidhuber,

1997;

Gers et al., 2000) was proposed as a way of avoiding this problem. More recently, effective RNN designs have been derived automatically (Jozefowicz er al., 2015; Zoph and Le, 2016).

Many methods have been tried for improving generalization in neural networks. Weight decay was suggested by Hinton (1987) and analyzed mathematically by Krogh and Hertz

(1992). The dropout method is due to Srivastava et al. (2014a). Szegedy et al. (2013) intro-

duced the idea of adversarial examples, spawning a huge literature.

Poole et al. (2017) showed that deep networks (but not shallow ones) can disentangle complex functions into flat manifolds in the space of hidden units. Rolnick and Tegmark

(2018) showed that the number of units required to approximate a certain class of polynomials

of n variables grows exponentially for shallow networks but only linearly for deep networks. White et al. (2019) showed that their BANANAS

system could do neural architecture

search (NAS) by predicting the accuracy of a network to within 1% after training on just 200 random sample architectures. Zoph and Le (2016) use reinforcement learning to search the space of neural network architectures. Real er al. (2018) use an evolutionary algorithm to do model selection, Liu er al. (2017) use evolutionary algorithms on hierarchical representations, and Jaderberg ef al. (2017) describe population-based training. Liu er al. (2019)

relax the space of architectures to a continuous differentiable space and use gradient descent to find a locally optimal solution.

Pham et al. (2018) describe the ENAS

(Efficient Neural

Architecture Search) system, which searches for optimal subgraphs of a larger graph. It is fast because it does not need to retrain parameters. The idea of searching for a subgraph goes back to the “optimal brain damage™ algorithm of LeCun et al. (1990).

Despite this impressive array of approaches, there are critics who feel the field has not yet

matured. Yu et al. (2019) show that in some cases these NAS algorithms are no more efficient

than random architecture selection. For a survey of recent results in neural architecture search,

see Elsken ef al. (2018).

Unsupervised learning constitutes a large subfield within statistics, mostly under the

heading of density estimation. Silverman (1986) and Murphy (2012) are good sources for classical and modem techniques in this area. Principal components analysis (PCA) dates back to Pearson (1901); the name comes from independent work by Hotelling (1933). The probabilistic PCA model (Tipping and Bishop, 1999) adds a generative model for the principal components themselves. The variational autoencoder is due to Kingma and Welling (2013) and Rezende ef al. (2014); Jordan et al. (1999) provide an introduction to variational methods for inference in graphical models. For autoregressive models, the classic text is by Box er al. (2016). The Yule-Walker equations for fitting AR models were developed independently by Yule (1927) and Walker (1931).

Autoregressive models with nonlinear dependencies were developed by several authors (Frey, 1998; Bengio and Bengio, 2001; Larochelle and Murray, 2011). The autoregressive WaveNet

model (van den Oord et al., 2016a) was based on earlier work on autoregressive image gen-

eration (van den Oord et al., 2016b).

Generative adversarial networks, or GANs, were first

proposed by Goodfellow et al. (2015a), and have found many applications in AL Some theoretical understanding of their properties is emerging, leading to improved GAN models and algorithms (Li and Malik, 2018b, 2018a; Zhu et al., 2019). Part of that understanding involves

protecting against adversarial attacks (Carlini ef al., 2019).

787

788

Hopfield network

Boltzmann machine

Chapter 21

Deep Learning

Several branches of research into neural networks have been popular in the past but are not actively explored today. Hopfield networks (Hopfield, 1982) have symmetric connections between each pair of nodes and can learn to store patterns in an associative memory, 5o that an entire pattern can be retrieved by indexing into the memory using a fragment of the pattern.

Hopfield networks are deterministic; they were later generalized to stochastic

Boltzmann machines (Hinton and Sejnowski, 1983, 1986). Boltzmann machines are possi-

bly the earliest example of a deep generative model. The difficulty of inference in Boltzmann

machines led to advances in both Monte Carlo techniques and variational techniques (see

Section 13.4).

Research on neural networks for Al has also been intertwined to some extent with research into biological neural networks. The two topics coincided in the 1940s, and ideas for convolutional networks and reinforcement learning can be traced to studies of biological sys-

Computational

tems; but at present, new ideas in deep learning tend to be based on purely computational or statistical concerns.

The field of computational neuroscience aims to build computational

models that capture important and specific properties of actual biological systems. Overviews

are given by Dayan and Abbott (2001) and Trappenberg (2010). For modern neural nets and deep learning, the leading textbooks are those by Goodfellow et al. (2016) and Charniak (2018). There are also many hands-on guides associated with the various open-source software packages for deep learning.

Three of the leaders of the

field—Yann LeCun, Yoshua Bengio, and Geoff Hinton—introduced the key ideas to non-Al researchers in an influential Nature article (2015). The three were recipients of the 2018 Turing Award. Schmidhuber (2015) provides a general overview, and Deng et al. (2014)

focus on signal processing tasks.

The primary publication venues for deep learning research are the conference on Neural

Information Processing Systems (NeurIPS), the International Conference on Machine Learning (ICML), and the International Conference on Learning Representations (ICLR). The main

journals are Machine Learning, the Journal of Machine Learning Research, and Neural Com-

putation. Increasingly, because of the fast pace of research, papers appear first on arXiv.org and are often described in the research blogs of the major research centers.

TS

D2

REINFORCEMENT LEARNING In which we see how experiencing rewards and punishments can teach an agent how to maximize rewards in the future. ‘With supervised learning, an agent learns by passively observing example input/output

pairs provided by a “teacher” In this chapter, we will see how agents can actively learn from

their own experience, without a teacher, by considering their own ultimate success or failure.

22.1

Learning from Rewards

Consider the problem of learning to play chess.

Let’s imagine treating this as a supervised

learning problem using the methods of Chapters 19-21. The chess-playing agent function

takes as input a board position and returns a move, so we train this function by supplying

examples of chess positions, each labeled with the correct move. Now, it so happens that we have available databases of several million grandmaster games, each a sequence of positions. and moves. The moves made by the winner are, with few exceptions, assumed to be good,

if not always perfect. Thus, we have a promising training set. The problem is that there are

relatively few examples (about 108) compared to the space of all possible chess positions (about 10°). In a new game, one soon encounters positions that are significantly different

from those in the database, and the trained agent function is likely to fail miserably—not least because it has no idea of what its moves are supposed to achieve (checkmate) or even what

effect the moves have on the positions of the pieces. And of course chess is a tiny part of the real world. For more realistic problems, we would need much vaster grandmaster databases,

and they simply don’t exist.'

Analternative is reinforcement learning (RL), in which an agent interacts with the world

and periodically receives rewards (or, in the terminology of psychology, reinforcements) that reflect how well it is doing.

For example, in chess the reward is 1 for winning, 0 for

losing, and § for a draw. We have already seen the concept of rewards in Chapter 17 for Markov decision processes (MDPs). Indeed, the goal is the same in reinforcement learning:

maximize the expected sum of rewards.

Reinforcement learning differs from “just solving

an MDP” because the agent is not given the MDP as a problem to solve; the agent is in the MDP. It may not know the transition model or the reward function, and it has to act in order

to learn more. Imagine playing a new game whose rules you don’t know; after a hundred or 50 moves, the referee tells you “You lose.” That is reinforcement learning in a nutshell.

From our point of view as designers of Al systems, providing a reward signal to the agent is usually much easier than providing labeled examples of how to behave. First, the reward ! As Yann LeCun and Alyosha Efros have pointed out, “the Al revolution will not be supervised.”

{3570

790

Chapter 22 Reinforcement Learning function is often (as we saw for chess) very concise and easy to specify: it requires only a few lines of code to tell the chess agent if it has won or lost the game or to tell the car-racing agent that it has won or lost the race or has crashed.

Second, we don’t have to be experts,

capable of supplying the correct action in any situation, as would be the case if we tried to apply supervised learning. It turns out, however, that a

Sparse

little bit of expertise can go a long way in reinforcement

learning. The two examples in the preceding paragraph—the win/loss rewards for chess and racing—are what we call sparse rewards, because in the vast majority of states the agent is given no informative reward signal at all. In games such as tennis and cricket, we can easily supply additional rewards for each point won or for each run scored. In car racing, we could reward the agent for making progress around the track in the right direction. When learning to crawl, any forward motion is an achievement. much easier.

These intermediate rewards make learning

As long as we can provide the correct reward signal to the agent, reinforcement learning provides a very general way to build Al systems. This is particularly true for simulated environments, where there is no shortage of opportunities to gain experience. The addition of deep learning as a tool within RL systems has also made new applications possible, including learning to play Atari video games from raw visual input (Mnih ef al., 2013), controlling robots (Levine ef al., 2016), and playing poker (Brown and Sandholm, 2017). Literally hundreds of different reinforcement learning algorithms have been devised, and

Model-based reinforcement Iearning

many of them can employ as tools a wide range of learning methods from Chapters 19-21. In this chapter, we cover the basic ideas and give some sense of the variety of approaches through a few examples. We categorize the approaches as follows: o Model-based reinforcement learni

In these approaches the agent uses a transition

model of the environment to help interpret the reward signals and to make decisions about how to act. The model may be initially unknown, in which case the agent learns the model from observing the effects of its actions, or it may already be known—for example, a chess program may know the rules of chess even if it does not know how to choose good moves. In partially observable environments, the transition model is also useful for state estimation (see Chapter 14). Model-based reinforcement learning

Model-free reinforcement learning

Action-utiity learning Qlearning Q-function

Policy search

systems often learn a utility function U (s), defined (as in Chapter 17) in terms of the sum of rewards from state s onward.” o Model-free reinforcement learning: In these approaches the agent neither knows nor learns a transition model for the environment. Instead, it learns a more direct represen-

tation of how to behave. This comes in one of two varieties:

o Action-utility learning: We introduced action-utility functions in Chapter 17. The

most common form of action-utility learning is Q-learning, where the agent learns

a Q-function, or quality-function, Q(s,a), denoting the sum of rewards from state

s onward if action a is taken. Given a Q-function, the agent can choose what to do in s by finding the action with the highest Q-value.

o Policy search:

The agent learns a policy 7(s) that maps directly from states to

actions. In the terminology of Chapter 2, this a reflex agent. 2 Inthe RL literature, which draws more on operations research than economics, utility functions are often called value functions and denoted V (s).

Section 22.2

Passive Reinforcement Learning

791

08516 | 0.9078 | 0.9578 0.8016

(@)

(®)

Figure 22.1 (a) The optimal policies for the stochastic environment with R(s.a,s') = — 0.04 for transitions between nonterminal states. There are two policies because in state (3,1) both Left and Up are optimal. We saw this before in Figure 17.2. (b) The utilities of the states in the4 x 3 world, given policy 7. Passive

‘We begin in Section 22.2 with passive reinforcement learning, where the agent’s policy = reinforcement

is fixed and the task is to learn the utilities of states (or of state-action pairs); this could

="

reinforcement learning, where the agent must also figure out what to do. The principal

(e reinforcement

also involve learning a model of the environment. (An understanding of Markov decision processes, as described in Chapter 17, is essential for this section.) Section 22.3 covers active issue is exploration: an agent must experience as much as possible of its environment in

order to learn how to behave in it. Section 22.4 discusses how an agent can use inductive

learning (including deep learning methods) to learn much faster from its experiences. We also discuss other approaches that can help scale up RL to solve real problems, including providing intermediate pseudorewards to guide the learner and organizing behavior into a hierarchy of actions. Section 22.5 covers methods for policy search. In Section 22.6, we explore apprenticeship learning: training a learning agent using demonstrations rather than

reward signals. Finally, Section 22.7 reports on applications of reinforcement learning. 22.2

Passive Reinforcement Learning

We start with the simple case of a fully observable environment with a small number of

actions and states, in which an agent already has a fixed policy 7 (s) that determines its actions.

The agent is trying to learn the utility function U™ (s)—the expected total discounted reward if policy m is executed beginning in state s. We call this a passive learning agent.

The passive learning task is similar to the policy evaluation task, part of the policy iteration algorithm described in Section 17.2.2. The difference is that the passive learning agent does not know the transition model P(s'|s,a), which specifies the probability of reaching state s’ from state s after doing action a@; nor does it know the reward function R(s,a,s’), which specifies the reward for each transition.

‘We will use as our example the 4 x 3 world introduced in Chapter 17. Figure 22.1 shows

the optimal policies for that world and the corresponding utilities. The agent executes a set

Fassive leaming

792

Chapter 22 Reinforcement Learning

Trial

of trials in the environment using its policy 7. In each trial, the agent starts in state (1,1) and

experiences a sequence of state transitions until it reaches one of the terminal states, (4,2) or

(4,3). Tts percepts supply both the current state and the reward received for the transition that just occurred to reach that state. Typical trials might look like this:

P20

0203

B e

6

a 1>‘““(12 (4,2)

Note that each transition is annotated with both the action ldkcn and the reward received at the next state. The object is to use the information about rewards to learn the expected utility U™ (s) associated with each nonterminal state s. The utility is defined to be the expected sum

of (discounted) rewards obtained if policy 7 is followed. As in Equation (17.2) on page 567, we write

U™(s) =E | Y 1=0

+'R(S:,7(S:),S41) |»

2.1)

where R(S;,7(S;),5.1) is the reward received when action 7(S,) is taken in state S, and reaches state S..1. Note that S, is a random variable denoting the state reached at time ¢ when executing policy 7, starting from state So=s. We will include a discount factor

in all of

our equations, but for the 4 x 3 world we will set = 1, which means no discounting. Direct utility estimation

Reward-to-go

22.2.1

Direct utility estimation

The idea of direct utility estimation is that the utility of a

state is defined as the expected

total reward from that state onward (called the expected reward-to-go), and that each trial

provides a sample of this quantity for each state visited. For example, the first of the three

trials shown earlier provides a sample total reward of 0.76 for state (1,1), two samples of 0.80 and 0.88 for (1,2), two samples of 0.84 and 0.92 for (1,3), and so on. Thus, at the end of each

sequence, the algorithm calculates the observed reward-to-go for each state and updates the

estimated utility for that state accordingly, just by keeping a running average for each state

in a table. In the limit of infinitely many trials, the sample average will converge to the true

expectation in Equation (22.1). This means that we have reduced reinforcement learning to a standard supervised learn-

ing problem in which each example is a (state, reward-to-go) pair. We have a lot of powerful algorithms for supervised learning, so this approach seems promising, but it ignores an im-

portant constraint: The utility of a state is determined by the reward and the expected utility

of the successor states. More specifically, the utility values obey the Bellman equations for a

fixed policy (see also Equation (17.14)):

Uils) = X P(s' | 5.m1(8)) [R(s. mi(5), ) + 7 Ui(s))] s

(222)

By ignoring the connections between states, direct utility estimation misses opportunities for

learning.

For example, the second of the three trials given earlier reaches the state (3,2),

which has not previously been visited.

from the first trial to have a high utility.

The next transition reaches (3,3), which is known

The Bellman equation suggests immediately that

(3,2) is also likely to have a high utility, because it leads to (3,3), but direct utility estimation

Section 22.2

Passive Reinforcement Learning

793

learns nothing until the end of the trial. More broadly, we can view direct utility estimation

as searching for U in a hypothesis space that is much larger than it needs to be, in that it includes many functions that violate the Bellman equations.

often converges very slowly. 22.2.2

Adaptive dynamic

For this reason, the algorithm

programming

An adaptive dynamic programming

(or ADP) agent takes advantage of the constraints

among the utilities of states by learning the transition model that connects them and solving the corresponding Markov decision process using dynamic programming. For a passive learning agent, this means plugging the learned transition model P(s'[5,7(s)) and the observed rewards R(s,7(s),s')

into Equation (22.2) to calculate the utilities of the states.

As

we remarked in our discussion of policy iteration in Chapter 17, these Bellman equations are linear when the policy 7 is fixed, so they can be solved using any linear algebra package. Alternatively, we can adopt the approach of modified policy iteration (see page 578),

using a simplified value iteration process to update the utility estimates after each change to the learned model. Because the model usually changes only slightly with each observation, the value iteration process can use the previous utility estimates as initial values and typically

converge very quickly.

Learning the transition model is easy, because the environment is fully observable. This

means that we have a supervised learning task where the input for each training example is a

state-action pair, (s,a), and the output is the resulting state, s'. The transition model P(s'| s,a)

is represented as a table and it is estimated directly from the counts that are accumulated in

Ny - The counts record how often state is reached when executing in s. For example, in the three trials given on page 792, Right is executed four times in (3,3) and the resulting state

is (3,2) twice and (4,3) twice, so P((3,2)|(3,3),Right) and P((4,3)|(3,3),Right) are both estimated to be §. The full agent program for a passive ADP agent is shown in Figure 22.2. Its performance on the 4 x 3 world is shown in Figure 22.3. In terms of how quickly its value estimates

improve, the ADP agent is limited only by its ability to learn the transition model. In this

sense, it provides a standard against which to measure any other reinforcement learning al-

gorithms. It is, however, intractable for large state spaces. In backgammon, for example, it would involve solving roughly 10*° equations in 10*° unknowns. 22.2.3

Temporal-difference learning

Solving the underlying MDP as in the preceding section is not the only way to bring the

Bellman equations to bear on the learning problem.

Another way is to use the observed

transitions to adjust the utilities of the observed states so that they agree with the constraint

equations. Consider, for example, the transition from (1,3) to (2,3) in the second trial on page 792. Suppose that as a result of the first trial, the utility estimates are U™(1,3)=0.88

and U™(2,3)=0.96. Now, if this transition from (1,3) to (2,3) occurred all the time, we would expect the utilities to obey the equation

U™(1,3) = —0.04+ U™(2,3), 50 U7(1,3) would be 0.92. Thus, its current estimate of 0.84 might be a lttle low and should be increased. More generally, when a transition occurs from state s to state s’ via action 7(s),

Adaptive dynamic programming

794

Chapter 22 Reinforcement Learning function PASSIVE-ADP-LEARNER(percepr) returns an action

inputs: percept, a percept indicating the current state s’ and reward signal r

persistent: 7, a fixed policy

mdp, an MDP with model P, rewards R, actions A, discount y U, atable of utilities for states, initially empty

Nyjs.a» 2 table of outcome count vectors indexed by state and action, initially zero 5. a, the previous state and action, initially null if " is new then U[s'] 0 s is not null then

increment Ny s, alls’]

Rls,a,s'1r

add a to A[s] P(- | 5,a) =+ > [envada |06 | | [Caenviada |->-[de 03] ([ |»a et Score00 ] [Una [-2.1] | [Scow=03] [puera [-09] | [Seorer=11 |puema [-19] | [Seorer =17 Beam 2

Beam 2

Beam 2

Fypomests | [ Viord [Score] | [Fipoesis | [Word [Score] | [Fypomesis [-05 | |[Lapueraal] > |~>[envada]-03 | | [Lapvera |>foe s Scors: =21] [puena [-2.1] | [Scorer=12] (o ]-07] | [Scorer =15

++

Figure 24.8 Beam search with beam size of b=2. The score of each word is the logprobability generated by the target RNN softmax, and the score of each hypothesis is the sum of the word scores. At timestep 3, the highest scoring hypothesis La entrada can only generate low-probability continuations, so it “falls off the beam.” using a greedy decoder to translate into Spanish the English sentence we saw before, The

front door is red. The correct translation is “La puerta de entrada es roja”—literally “The door of entry is red.” Suppose the target RNN correctly generates the first word La for The. Next, a greedy decoder might propose entrada for front. But this is an error—Spanish word order should put the noun puerta before the modifier. Greedy decoding is fast—it only considers one choice at each timestep and can do so quickly—but the model has no mechanism to correct mistakes. We could try to improve the attention mechanism so that it always attends to the right word and guesses correctly every time.

But for many sentences it is infeasible to guess

correctly all the words at the start of the sentence until you have seen what’s at the end.

A better approach is to search for an optimal decoding (or at least a good one) using

one of the search algorithms from Chapter 3. A common choice is a beam search (see Sec-

tion 4.1.3). In the context of MT decoding, beam search typically keeps the top k hypotheses at each stage, extending each by one word using the top k choices of words, then chooses the best k of the resulting > new hypotheses. When all hypotheses in the beam generate the special token, the algorithm outputs the highest scoring hypothesis.

A visualization of beam search is given in Figure 24.8. As deep learning models become

more accurate, we can usually afford to use a smaller beam size.

Current state-of-the-art

neural MT models use a beam size of 4 to 8, whereas the older generation of statistical MT models would use a beam size of 100 or more.

24.4 Transformer Self-attention

The influential article “Attention is all you need” (Vaswani et al., 2018) introduced the transformer architecture, which uses a self-attention mechanism that can model long-distance

context without a sequential dependency.

24.4.1 Self-attention

The Transformer Architecture

Self-attention

Previously, in sequence-to-sequence models, attention was applied from the target RNN to the source RNN. Self-attention extends this mechanism so that each sequence of hidden states also attends to itself—the source to the source, and the target to the target. This allows

the model to additionally capture long-distance (and nearby) context within each sequence.

Section 24.4

The Transformer Architecture

869

The most straightforward way of applying self-attention is where the attention matrix is

directly formed by the dot product of the input vectors. However, this is problematic. The dot product between a vector and itself will always be high, so each hidden state will be biased

towards attending to itself. The transformer solves this by first projecting the input into three different representations using three different weight matrices:

« The query vector g;=W,x; is the one being attended from, like the target in the standard attention mechanism. + The key vector k;=W,x; attention mechanism.

is the one being attended to, like the source in the basic

« The value vector v;=W,x; is the context that is being generated.

Query vector

Key vector Value vector

In the standard attention mechanism, the key and value networks are identical, but intuitively

it makes sense for these to be separate representations. The encoding results of the ith word,

¢;, can be calculated by applying an attention mechanism to the projected vectors:

rj = (ak)/Va ay = & f(L)

a-X i

k

lij Vs

where d is the dimension of k and q. Note that i and j are indexes in the same sentence, since we are encoding the context using self-attention. In each transformer layer, self-attention uses

the hidden vectors from the previous layer, which initially is the embedding layer. There are several details worth mentioning here.

First of all, the self-attention mecha-

nism is asymmetric, as r;; is different from rj;. Second, the scale factor v/ was added to improve numerical stability. Third, the encoding for all words in a sentence can be calculated

simultaneously, as the above equations can be expressed using matrix operations that can be

computed efficiently in parallel on modern specialized hardware. The choice of which context to use is completely learned from training examples, not

prespecified. The context-based summarization, ¢;, is a sum over all previous positions in the

sentence. In theory, and information from the sentence could appear in ¢;, but in practice, sometimes important information gets lost, because it is essentially averaged out over the whole sentence.

One way to address that is called multiheaded attention.

We divide the

sentence up into m equal pieces and apply the attention model to each of the m pieces. Each piece has its own set of weights.

Then the results are concatenated together to form ¢;. By

concatenating rather than summing, we make it easier for an important subpiece to stand out.

24.4.2

From self-attention to transformer

Self-attention is only one component of the transformer model. Each transformer layer con-

sists of several sub-layers. At each transformer layer, self-attention is applied first. The output of the attention module is fed through feedforward layers, where the same feedforward

weight matrices are applied independently at each position. A nonlinear activation function, typically ReLU, is applied after the first feedforward layer. In order to address the potential

vanishing gradient problem, two residual connections (are added into the transformer layer. A single-layer transformer in shown in Figure 24.9. In practice, transformer models usually

Multiheaded attention

870

Chapter 24 Deep Learning for Natural Language Processing

{

Transformer

Output Vectors,

{

Layer

Residual

Connection

]

Rssiual)

Self-Attention-

Residual

P11

L input Vectors Figure 24.9 A single-layer transformer consists of self-attention, a feedforward network, and residual connections.

Clas: Adverb

Class Pronoun

Cla PastTenseVerb

t t 1 Feedforward | | Feedforward | | Feedforward | | f f i “Transformer Layer f f i ‘Transformer Layer f f i “Transformer Layer t f f Positional Positional Positional Embedding 1| |Embedding2| |Embedding3| + + + Embedding | [Embedding | [Embedding | lookup Tookup Tookup. f f Yesterday

Figure 24.10

they

cut

Class. Determiner

t Feedforward | | Feedforward f f f

f

f

f

f f Positional Positional |Embedding4| |Embedding 5 + + [Embedding | [ Embedding Tookup Tookup f f the

rope

Using the transformer architecture for POS tagging.

have six or more layers.

As with the other models that we’ve learned about, the output of

layer i is used as the input to layer i+ 1.

oo

The transformer architecture does not explicitly capture the order of words in the sequence, since context is modeled only through self-attention, which is agnostic to word order. To capture the ordering of the words, the transformer uses a technique called positional em-

bedding. If our input sequence has a maximum length of n, then we learn n new embedding

Section 245 Pretraining and Transfer Learning

871

vectors—one for each word position. The input to the first transformer layer is the sum of the word embedding at position 7 plus the positional embedding corresponding to position 7.

Figure 24.10 illustrates the transformer architecture for POS tagging, applied to the same

sentence used in Figure 24.3. At the bottom, the word embedding and the positional embed-

dings are summed to form the input for a three-layer transformer. The transformer produces

one vector per word, as in RNN-based POS tagging. Each vector is fed into a final output layer and softmax layer to produce a probability distribution over the tags.

In this section, we have actually only told half the transformer story: the model we de-

scribed here is called the transformer encoder. It is useful for text classification tasks. The

full transformer architecture was originally designed as a sequence-to-sequence model for machine translation. decoder.

Therefore, in addition to the encoder, it also includes a transformer

The encoder and decoder are nearly identical, except that the decoder uses a ver-

sion of self-attention where each word can only attend to the words before it, since text is

Transformer encoder Transformer decoder

generated left-to-right. The decoder also has a second attention module in each transformer

layer that attends to the output of the transformer encoder.

2

Pretraining and Transfer Learning

Getting enough data to build a robust model can be a challenge.

In computer vision (see

Chapter 25), that challenge was addressed by assembling large collections of images (such as ImageNet) and hand-labeling them. For natural language, it is more common

to work with text that is unlabeled.

The dif-

ference is in part due to the difficulty of labeling: an unskilled worker can easily label an image as “cat” or “sunset,” but it requires extensive training to annotate a sentence with partof-speech tags or parse trees. The difference is also due to the abundance of text: the Internet adds over 100 billion words of text each day, including digitized books, curated resources

such as Wikipedia, and uncurated social media posts. Projects such as Common Crawl provide easy access to this data. Any running text can

be used to build n-gram or word embedding models, and some text comes with structure that

can be helpful for a variety of tasks—for example, there are many FAQ sites with questionanswer pairs that can be used to train a question-answering system.

Similarly, many Web

sites publish side-by-side translations of texts, which can be used to train machine translation

systems. Some text even comes with labels of a sort, such as review sites where users annotate their text reviews with a 5-star rating system.

‘We would prefer not to have to go to the trouble of creating a new data set every time we want a new NLP model. In this section, we introduce the idea of pretraining: a form

of transfer learning (see Section 21.7.2) in which we use a large amount of shared general-

domain language data to train an initial version of an NLP model. From there, we can use a

smaller amount of domain-specific data (perhaps including some labeled data) to refine the model.

The refined model can learn the vocabulary, idioms, syntactic structures, and other

linguistic phenomena that are specific to the new domain. 24.5.1

Pretrained word embeddings

In Section 24.1, we briefly introduced word embeddings. We saw that how similar words like banana and apple end up with similar vectors, and we saw that we can solve analogy

Pretraining

872

Chapter 24 Deep Learning for Natural Language Processing problems with vector subtraction. This indicates that the word embeddings are capturing substantial information about the words.

In this section we will dive into the details of how word embeddings are created using an

entirely unsupervised process over a large corpus of text. That is in contrast to the embeddings

from Section 24.1, which were built during the process of supervised part of speech tagging,

and thus required POS tags that come from expensive hand annotation. We will concentrate

on one specific model for word embeddings,

the GloVe (Global

Vectors) model. The model starts by gathering counts of how many times each word appears

within a window of another word, similar to the skip-gram model. First choose window size

(perhaps 5 words) and let X;; be the number of times that words i and j co-occur within

a window,

and let X; be the number of times word i co-occurs with any other word.

Let

P,;=X;;/X; be the probability that word j appears in the context of word i. As before, let E; be the word embedding for word i. Part of the intuition of the GloVe model is that the relationship between two words can

best be captured by comparing them both to other words. Consider the words ice and steam. Now consider the ratio of their probabilities of co-occurrence with another word, w, that is:

Puice/Pustean

When w is the word solid the ratio will be high (meaning solid applies more to ice) and when w is the word gas it will be low (meaning gas applies more to steam). And when w is a non-content word like rhe, a word like warer that is equally relevant to both, or an equally irrelevant word like fashion, the ratio will be close to 1.

The GloVe model starts with this intuition and goes through some mathematical reason-

ing (Pennington ef al., 2014) that converts ratios of probabilities into vector differences and

dot products, eventually arriving at the constraint

E;-E}=log(P;). In other words, the dot product of two word vectors is equal to the log probability of their

co-occurrence. That makes intuitive sense: two nearly-orthogonal vectors have a dot product

close to 0, and two nearly-identical normalized vectors have a dot product close to 1. There

is a technical complication wherein the GloVe model creates two word embedding vectors

for each word, E; and EJ; computing the two and then adding them together at the end helps limit overfitting.

Training a model like GloVe is typically much less expensive than training a standard

neural network:

a new model can be trained from billions of words of text in a few hours

using a standard desktop CPU. It is possible to train word embeddings on a specific domain, and recover knowledge in

that domain. For example, Tshitoyan e al. (2019) used 3.3 million scientific abstracts on the

subject of material science to train a word embedding model. They found that, just as we saw

that a generic word embedding model can answer “Athens is to Greece as Oslo is to what?” with “Norway,” their material science model can answer “NiFe is to ferromagnetic as IrMn is to what?” with “antiferromagnetic.”

Their model does not rely solely on co-occurrence of words; it seems to be capturing more complex scientific knowledge. When asked what chemical compounds can be classified

as “thermoelectric™ or “topological insulator,” their model is able to answer correctly.

For

example, CsAgGapSey never appears near “thermoelectric” in the corpus, but it does appear

Section 245 Pretraining and Transfer Learning

873

near “chalcogenide.” “band gap,” and “optoelectric,” which are all clues enabling it to be classified as similar to “thermoelectric.”

Furthermore, when trained only on abstracts up

to the year 2008 and asked to pick compounds that are “thermoelectric” but have not yet appeared in abstracts, three of the model’s top five picks were discovered to be thermoelectric in papers published between 2009 and 2019. 24.5.2

Pretrained contextual representations

‘Word embeddings are better representations than atomic word tokens, but there is an impor-

tant issue with polysemous words. For example, the word rose can refer to a flower or the past tense of rise. Thus, we expect to find at least two entirely distinct clusters of word contexts for rose: one similar to flower names such as dahlia, and one similar to upsurge.

No

single embedding vector can capture both of these simultaneously. Rose is a clear example of a word with (at least) two distinct meanings, but other words have subtle shades of meaning that depend on context, such as the word need in you need to see this movie versus humans

need oxygen to survive. And some idiomatic phrases like break the bank are better analyzed as a whole rather than as component words.

Therefore, instead of just learning a word-to-embedding table, we want to train a model to

generate contextual representations of each word in a sentence. A contextual representation

maps both a word and the surrounding context of words into a word embedding vector. In

other words, if we feed this model the word rose and the context the gardener planted a rose bush, it should produce a contextual embedding that is similar (but not necessarily identical)

to the representation we get with the context the cabbage rose had an unusual fragrance, and very different from the representation of rose in the context the river rose five feet.

Figure 24.11 shows a recurrent network that creates contextual word embeddings—the boxes that are unlabeled in the figure. We assume we have already built a collection of

noncontextual word embeddings. We feed in one word at a time, and ask the model to predict the next word. So for example in the figure at the point where we have reached the word “car,” the the RNN node at that time step will receive two inputs: the noncontextual word embedding for “car” and the context, which encodes information from the previous words “The red.” The RNN node will then output a contextual representation for “car.” The network

as a whole then outputs a prediction for the next word, “is.” We then update the network’s weights to minimize the error between the prediction and the actual next word.

This model is similar to the one for POS tagging in Figure 24.5, with two important

differences. First, this model is unidirectional (left-to-right), whereas the POS model is bidi-

rectional. Second, instead of predicting the POS tags for the current word, this model predicts

the next word using the prior context. Once the model is built, we can use it to retrieve representations for words and pass them on to some other task: we need not continue to predict

the next word. Note that computing a contextual representations always requires two inputs, the current word and the context.

24.5.3

Masked

language models

A weakness of standard language models such as n-gram models is that the contextualization

of each word is based only on the previous words of the sentence. Predictions are made from

left to right. But sometimes context from later in a sentence—for example, feet in the phrase

rose five feet—helps to clarify earlier words.

Contextual representations

874

Chapter 24 Deep Learning for Natural Language Processing

t

is

Feedforward |

[Feedforward

Feedforward

t

big

car

red

t

{

Feedforward |

t

[Feedforward

Contextual

representations (RNN output)

RNN}——[RNN|——{RNN|——{RNN}-——[RNN

]\t“’,"‘“f::‘"\,:;“

(word embeddings)

1

f

‘°°; X

> X

Figure 26.11 A simple triangular robot that can translate, and needs to avoid a rectangular

obstacle. On the left is the workspace, on the right is the configuration space.

Once we have a path, the task of executing a sequence of actions to follow the path is

called trajectory tracking control. A trajectory is a path that has a time associated with

[rajecio tracking

each point on the path. A path just says “go from A to B to C, etc.” and a trajectory says Trajectory “start at A, take 1 second to get to B, and another 1.5 seconds to get to C, ete.” 26.5.1

Configuration space

Imagine a simple robot, R, in the shape of a right triangle as shown by the lavender triangle in the lower left corner of Figure 26.11. The robot needs to plan a path that avoids a rectangular obstacle, O. The physical space that a robot moves about in is called the workspace. This Workspace particular robot can move in any direction in the x — y plane, but cannot rotate. The figure shows five other possible positions of the robot with dashed outlines; these are each as close

to the obstacle as the robot can get. The body of the robot could be represented as a set of (x,y) points (or (x,y,z) points

for a three-dimensional robot), as could the obstacle. With this representation, avoiding the obstacle means that no point on the robot overlaps any point on the obstacle. Motion planning

would require calculations on sets of points, which can be complicated and time-consuming.

We can simplify the calculations by using a representation scheme in which all the points

that comprise the robot are represented as a single point in an abstract multidimensional

space, which we call the configuration space, or C-space. The idea is that the set of points Configuration space that comprise the robot can be computed if we know (1) the basic measurements of the robot

(for our triangle robot, the length of the three sides will do) and (2) the current pose of the robot—its position and orientation.

For our simple triangular robot, two dimensions suffice for the C-space: if we know the

(x,y) coordinates of a specific point on the robot—we’ll use the right-angle vertex—then we

can calculate where every other point of the triangle is (because we know the size and shape of the triangle and because the triangle cannot rotate). In the lower-left corner of Figure 26.11,

the lavender triangle can be represented by the configuration (0,0).

If we change the rules so that the robot can rotate, then we will need three dimensions,

(x,y,0), to be able to calculate where every point is. Here

is the robot’s angle of rotation

in the plane. If the robot also had the ability to stretch itself, growing uniformly by a scaling

factor s, then the C-space would have four dimensions, (x,y,6,5).

~C-space

Chapter 26 Robotics For now we’ll stick with the simple two-dimensional C-space of the non-rotating triangle

robot. The next task is to figure out where the points in the obstacle are in C-space. Consider

the five dashed-line triangles on the left of Figure 26.11 and notice where the right-angle vertex is on each of these. Then imagine all the ways that the triangle could slide about.

Obviously, the right-angle vertex can’t go inside the obstacle, and neither can it get any

C-space obstacle

closer than it is on any of the five dashed-line triangles. So you can see that the area where

the right-angle vertex can’t go—the C-space obstacle—is the five-sided polygon on the right

of Figure 26.11 labeled Cyp

In everyday language we speak of there being multiple obstacles for the robot—a table, a

chair, some walls. But the math notation is a bit easier if we think of all of these as combit

into one “obstacle” that happens to have disconnected components.

In general, the C:

obstacle is the set of all points ¢ in C such that, if the robot were placed in that configuration, its workspace geometry would intersect the workspace obstacle.

Let the obstacles in the workspace be the set of points O, and let the set of all points on

the robot in configuration ¢ be .A(g). Then the C-space obstacle is defined as

Free space Degrees of freedom (DOF)

Forward kinematics

Cons={q:q€CandA(q)NO# {}}

and the free space is Cjree = C — Cops.

The C-space becomes more interesting for robots with moving parts. Consider the two-

link arm from Figure 26.12(a). It is bolted to a table so the base does not move, but the arm

has two joints that move independently—we call these degrees of freedom (DOF). Moving

the joints alters the (x,y) coordinates of the elbow, the gripper, and every point on the arm. The arm’s configuration space is two-dimensional: (Bguous ferp), Where Oy, is the angle of the shoulder joint, and 6, is the angle of the elbow joint. Knowing the configuration for our two-link arm means we can determine where each

point on the arm is through simple trigonometry.

ping is a function

In general, the forward kinematics map-

¢p:C—=W

that takes in a configuration and outputs the location of a particular point b on the robot when

the robot is in that configuration. A particularly useful forward kinematics mapping is that for

the robot’s end effector, ¢z The set of all points on the robot in a particular configuration ¢ is denoted by A(q) C W:

Alg) =U{(@)}-

;

Inverse kinematics

The inverse problem, of mapping a desired location for a point on the robot to the config-

uration(s) the robot needs to be in for that to happen, is known as inverse kinematics:

IKy:x €W = {g €C st d(q) =x}.

Sometimes the inverse kinematics mapping might take not just a position, but also a desired

orientation as input. When we want a manipulator to grasp an object, for instance, we can

compute a desired position and orientation for its gripper, and use inverse kinematics to de-

termine a goal configuration for the robot. Then a planner needs to find a way to get the robot from its current configuration to the goal configuration without intersecting obstacles.

Workspace obstacles are often depicted as simple geometric forms—especially in robotics textbooks, which tend to focus on polygonal obstacles. But how do the obstacles look in configuration space?

Section 26.5 Planning and Control

Felby Fshou,

(b)

(a)

Figure 26.12 (a) Workspace representation of a robot arm with two degrees of freedom. The workspace is a box with a flat obstacle hanging from the ceiling. (b) Configuration space of the same robot. Only white regions in the space are configurations that are free of collisions. The dot in this diagram corresponds to the configuration of the robot shown on the left.

For the two-link arm, simple obstacles in the workspace, like a vertical line, have very

complex C-space counterparts, as shown in Figure 26.12(b). The different shadings of the occupied space correspond to the different objects in the robot’s workspace: the dark region

surrounding the entire free space corresponds to configurations in which the robot collides

with itself. It is easy to see that extreme values of the shoulder or elbow angles cause such a violation. The two oval-shaped regions on both sides of the robot correspond to the table on which the robot is mounted. The third oval region corresponds to the left wall.

Finally, the most interesting object in configuration space is the vertical obstacle that

hangs from the ceiling and impedes the robot’s motions.

This object has a funny shape in

configuration space: it is highly nonlinear and at places even concave. With a little bit of imagination the reader will recognize the shape of the gripper at the upper left end. ‘We encourage the reader to pause for a moment and study this diagram. The shape of this

obstacle in C-space is not at all obvious!

The dot inside Figure 26.12(b) marks the configu-

ration of the robot in Figure 26.12(a). Figure 26.13 depicts three additional configurations,

both in workspace and in configuration space. In configuration conf-1, the gripper is grasping the vertical obstacle.

We see that even if the robot’s workspace is represented by flat polygons, the shape of

the free space can be very complicated. In practice, therefore, one usually probes a configu-

ration space instead of constructing it explicitly. A planner may generate a configuration and

then test to see if it is in free space by applying the robot kinematics and then checking for collisions in workspace coordinates.

941

942

Chapter 26 Robotics

conf-3

conf-1

conf-2

() (a) Figure 26.13 Three robot configurations, shown in workspace and configuration space. 26.5.2 Motion planning

Motion planning

The motion planning problem is that of finding a plan that takes a robot from one configuration to another without colliding with an obstacle. It is a basic building block for movement and manipulation.

In Section 26.5.4 we will discuss how to do this under complicated dy-

namics, like steering a car that may drift off the path if you take a curve too fast. For now, we

will focus on the simple motion planning problem of finding a geometric path that is collision

free. Motion planning is a quintessentially continuous-state search problem, but it is often

Piano mover's problem

possible to discretize the space and apply the search algorithms from Chapter 3.

The motion planning problem is sometimes referred to as the piano mover’s problem. It

gets its name from a mover’s struggles with getting a large, irregular-shaped piano from one room to another without hitting anything. We are given:

« a workspace world W in either R? for the plane or R? for three dimensions,

an obstacle region O C W,

+ arobot with a configuration space C and set of points A(g) for g € C, * astarting configuration ¢ € C, and

« a goal configuration g € C. The obstacle region induces a C-space obstacle C,p, and its corresponding free space Cyee defined as in the previous section. We need to find a continuous path through free space. We

will use a parameterized curve, 7(r), to represent the path, where 7(0) = g, and 7(1) = g, and 7(r) for every

between 0 and 1 is some point in Cyy.. That is, ¢ parameterizes how

far we are along the path, from start to goal. Note that ¢ acts somewhat like time in that as

1 increases the distance along the path increases, but7 is always a point on the interval [0, 1] and is not measured in seconds.

Section 265

Planning and Control

943

4

o

Figure 26.14 A visibility graph. Lines connect every pair of vertices that can “see” each other—lines that don’t go through an obstacle. The shortest path must lie upon these lines. The motion planning problem can be made more complex in various ways: defining the

goal as

a set of possible configurations rather than a single configuration; defining the goal

in the workspace rather than the C-space; defining a cost function (e.g.,

path length) to be

minimized; satisfying constraints (e.g., if the path involves carrying a cup of coffee, making

sure that the cup is always oriented upright so the coffee does not spill).

The spaces of motion planning: Let’s take a step back and make sure we understand the

spaces involved in motion planning. First, there is the workspace or world W. Points in W are points in the everyday three-dimensional world. Next, we have the space of configurations,

C. Points g in C are d-dimensional, with d the robot’s number of degrees of freedom, and

map to sets of points .A(g) in W. Finally, there is the space of paths. The space of paths is a

space of functions. Each point in this space maps to an entire curve through C-space. This space is co-dimensional! Intuitively, we need d dimensions for each configuration along the path, and there are as many configurations on a path as there are points in the number line interval [0, 1]. Now let’s consider some ways of solving the motion planning problem. Visi

ility graphs

For the simplified case of two-dimensional configuration spaces and polygonal C-space ob-

stacles, vi ity graphs are a convenient way to solve the motion planning problem with a guaranteed shortest-path solution. Let Vops C C be the set of vertices of the polygons making

up Cops, and let V = VoU {gs.q5}-

We construct a graph G = (V,E) on the vertex set V with edges e;; € E connecting a

vertex v; to another vertex v; if the line connecting the two vertices

is collision-free—that is,

if {Avi+ (1= A)v;: A€ [0,1]}NCops = { }. When this happens, we say the two vertices “can

see each other,” which is where “visibility” graphs got their name. To solve the motion planning problem, all we need to do is run a discrete graph search (e.g., best-first search) on the graph G with starting state g, and goal g,. In Figure 26.14

we see a visibility graph and an optimal three-step solution. An optimal search on visibility

graphs will always give us the optimal path (if one exists), or report failure if no path exists.

Voronoi

diagrams

Visibility graphs encourage paths that run immediately adjacent to an obstacle—if you had to walk around a table to get to the door, the shortest path would be to stick as close to the table

as possible. However, if motion or sensing is nondeterministic, that would put you at risk of

bumping into the table. One way to address this is to pretend that the robot’s body is a bit

Visibility graph

Chapter 26 Robotics

Figure 26.15 A Voronoi diagram showing the set of points (black lines) equidistant to two

or more obstacles in configuration space.

larger than it actually is, providing a buffer zone. Another way is to accept that path length is not the only metric we want to optimize. Section 26.8.2 shows how to learn a good metric from human examples of behavior.

Voronoi diagram Region

Voronoi graph

A third way is to use a different technique, one that puts paths as far away from obstacles

as possible rather than hugging close to them. A Voronoi diagram is a representation that

allows us to do just that. To get an idea for what a Voronoi diagram does, consider a space

where the obstacles are, say, a dozen small points scattered about a plane. Now surround each

of the obstacle points with a region consisting of all the points in the plane that are closer to that obstacle point than to any other obstacle point. Thus, the regions partition the plane. The

Voronoi diagram consists

and vertices of the regions.

of the set of regions, and the Voronoi graph consists of the edges

‘When obstacles are areas, not points, everything stays pretty much the same. Each region

still contains all the points that are closer to one obstacle than to any other, where distance is

measured to the closest point on an obstacle. The boundaries between regions still correspond

to points that are equidistant between two obstacles, but now the boundary may be a curve

rather than a straight line. Computing these boundaries can be prohibitively expensive in high-dimensional spaces.

To solve the motion planning problem, we connect the start point g, to the closest point

on the Voronoi graph via a straight line, and the same for the goal point g,. We then use discrete graph search to find the shortest path on the graph. For problems like navigating

through corridors indoors, this gives a nice path that goes down the middle of the corridor.

However, in outdoor settings it can come up with inefficient paths, for example suggesting an

unnecessary 100 meter detour to stick to the middle of a wide-open 200-meter space.

Section 26.5

Planning and Control

945

joal

(a)

(b)

Figure 26.16 (a) Value function and path found for a discrete grid cell approximation of the configuration space. (b) The same path visualized in workspace coordinates. Notice how the robot bends its elbow to avoid a collision with the vertical obstacle. Cell decomposition

An alternative approach to motion planning is to discretize the C-space. Cell decomposition ~Cell decomposition methods decompose the free space into a finite number of contiguous regions, called cells. These cells are designed so that the path-planning problem within a single cell can be solved by simple means (e.g., moving along a straight line). The path-planning problem then becomes a discrete graph search problem (as with visibility graphs and Voronoi graphs) to find a path through a sequence of cel The simplest cell decomposition consists of a regularly spaced grid. Figure 26.16(a) shows a square grid decomposition of the space and a solution path that is optimal for this grid size. Grayscale shading indicates the value of each free-space grid cell—the cost of the shortest path from that cell to the goal. (These values can be computed by a deterministic form of the VALUE-ITERATION algorithm given in Figure 17.6 on page 573.) Figure 26.16(b)

shows the corresponding workspace trajectory for the arm. Of course, we could also use the A" algorithm to find a shortest path.

This grid decomposition has the advantage that it

from three limitations.

simple to implement, but it suffers

First, it is workable only for low-dimensional configuration spaces,

because the number of grid cells increases exponentially with d, the number of dimensions. (Sounds familiar? This is the curse of dimensionality.) Second, paths through discretized

state space will not always be smooth. We see in Figure 26.16(a) that the diagonal parts of the path are jagged and hence very difficult for the robot to follow accurately. The robot can attempt to smooth out the solution path, but this is far from straightforward. Third, there is the problem of what to do with cells that are “mixed”—that

is, neither

entirely within free space nor entirely within occupied space. A solution path that includes

Chapter 26 Robotics such a cell may not be a real solution, because there may be no way to safely cross the

cell. This would make the path planner unsound.

On the other hand, if we insist that only

completely free cells may be used, the planner will be incomplete, because it might be the case that the only paths to the goal go through mixed cells—it might be that a corridor is actually wide enough for the robot to pass, but the corridor is covered only by mixed cells. The first approach to this problem is further subdivision of the mixed cells—perhaps using cells of half the original size. This can be continued recursively until a path is found

that lies entirely within free cells. This method works well and is complete if there is a way to

decide if a given cell is a mixed cell, which is easy only if the configuration space boundaries have relatively simple mathematical descriptions.

It is important to note that cell decomposition does not necessarily require explicitly rep-

Collision checker

resenting the obstacle space C,p,. We can decide to include a cell or not by using a collision checker. This is a crucial notion to motion planning. A collision checker is a function 7(g)

that maps to 1 if the configuration collides with an obstacle, and 0 otherwise. It is much easier

to check whether a specific configuration is in collision than to explicitly construct the entire

obstacle space Cyps. Examining the solution path shown in Figure 26.16(a), we can see an additional difficulty that will have to be resolved. The path contains arbitrarily sharp corners, but a physical robot has momentum and cannot change direction instantaneously. This problem can be solved by storing, for each grid cell, the exact continuous state (position and velocity) that was attained

when the cell was reached in the search. Assume further that when propagating information to nearby grid cells, we use this continuous state as a basis, and apply the continuous robot

motion model for jumping to nearby cells. So we don’t make an instantaneous 90° turn; we

make a rounded turn governed by the laws of motion. We can now guarantee that the resulting

Hybrid A

trajectory is smooth and can indeed be executed by the robot. One algorithm that implements

this is hybrid A*.

Randomized motion planning Randomized motion planning does graph search on a random decomposition of the configuration space, rather than a regular cell decomposition. The key idea is to sample a random set

of points and to create edges between them if there is a very simple way to get from one to

Probabilistic roadmap (PRM) Simple planner

the other (e.g., via a straight line) without colliding; then we can search on this graph.

A probabilistic roadmap (PRM) algorithm is one way to leverage this idea. We assume

access to a collision checker 7 (defined on page 946), and to a simple planner B(q;,q>) that

returns a path from g, to ¢, (or failure) but does so quickly. This simple planner is not going

to be complete—it might return failure even if a solution actually exists. Its job is to quickly

try to connect gy and g5 and let the main algorithm know if it succeeds. We will use it to

Milestone

define whether an edge exists between two vertices.

The algorithm starts by sampling M milestones—points in Cy—in addition to the points g, and g,. It uses rejection sampling, where configurations are sampled randomly and collision-checked using + until a total of M milestones are found.

uses the simple planner to try to connect pairs of milestones.

Next, the algorithm

If the simple planner returns

success, then an edge between the pair is added to the graph; otherwise, the graph remains as

is. We try to connect each milestone either to its k nearest neighbors (we call this &-PRM), or

to all milestones in a sphere of a radius r. Finally, the algorithm searches for a path on this

Section 26.5

@

Planning and Control

947

7

Figure 26.17 The probabilistic roadmap (PRM) algorithm. Top left: the start and goal configurations. Top right: sample M collision-free milestones (here M = 5). Bottom left: connect each milestone to its k nearest neighbors (here k = 3). Bottom right: find the shortest path from the start to the goal on the resulting graph. graph from g to g. If no path is found, then M more milestones are sampled, added to the graph, and the process is repeated. Figure 26.17 shows a roadmap with the path found between two configurations. PRMs

Probabilistically are not complete, but they are what is called probabilistically complete—they will eventu- comp lete ally find a path, if one exists. Intuitively, this is because they keep sampling more milestones.

PRMs work well even in high-dimensional configuration spaces. PRM:s are also popular for multi-query planning, in which we have multiple motion Multi-query planning planning problems within the same C-space. Often, once the robot reaches a goal, it is called upon to reach another goal in the same workspace. PRMs are really useful, because the robot can dedicate time up front to constructing a roadmap, and amortize the use of that roadmap over multiple queries.

Rapidly-exploring random trees Rapidly exploring : ! of PRMs called rapidly exploring random trees (RRTS) is} popular for singleAn extension random trées query planning. We incrementally build two trees, one with g, as the oot and one with gg " * as the root.

Random

milestones are chosen, and an attempt is made to connect each new

milestone to the existing trees. If a milestone connects both trees, that means a solution has been found, as in Figure 26.18. If not, the algorithm finds the closest point in each tree and

adds to the tree a new edge that extends from the point by a distance § towards the milestone.

This tends to grow the tree towards previously unexplored sections of the space. Roboticists love RRTs for their ease of use. However, RRT solutions are typically nonoptimal and lack smoothness. Therefore, RRTs are often followed by a post-processing step.

The most common one is “short-cutting,” in which we randomly select one of the vertices on

the solution path and try to remove it by connecting its neighbors to each other (via the simple

Chapter 26 Robotics

Gsample

qs Figure 26.18 The bidirectional RRT algorithm constructs two trees (one from the start, the other from the goal) by incrementally connecting each sample to the closest node in each tree, if the connection is possible. When a sample connects to both trees, that means we have found a solution path.

Figure 26.19 Snapshots of a trajectory produced by an RRT and post-processed with short-

cutting. Courtesy of Anca Dragan.

planner). We do this repeatedly for as many steps as we have compute time for. Even then, RRT"

the trajectories might look a little unnatural due to the random positions of the milestone that were selected, as shown in Figure 26.19.

RRT is a modification to RRT that makes the algorithm asymptotically optimal: the

solution converges to the optimal solution as more and more milestones are sampled. The key idea is to pick the nearest neighbor based on a notion of cost to come rather than distance

from the milestone only, and to rewire the tree, swapping parents of older vertices if it is

cheaper to reach them via the new milestone.

Trajectory optimization for kinematic planning Randomized sampling algorithms tend to first construct a complex but feasible path and then optimize it. Trajectory optimization does the opposite: it starts with a simple but infeasible

path, and then works to push it out of collision. The goal is to find a path that optimizes a cost

Section 265

Planning and Control

949

function! over paths. That is, we want to minimize the cost function J(7), where 7(0) = g, and 7(1) = g, J is called a functional because it is a function over functions.

The argument to J is

7, which is itself a function: 7(r) takes as input a point in the [0, 1] interval and maps it to

a configuration. A standard cost functional trades off between two important aspects of the robot’s motion: collision avoidance and efficiency,

J = Jobs + Mo

where the efficiency J;- measures the length of the path and may also measure smoothness. A convenient way to define efficiency is with a quadratic: it integrates the squared first derivative of 7 (we will see in a bit why this does in fact incentivize short paths):

Ja = [ 30 Pas. e

For the obstacle term, assume we can compute the distance d(x) from any point x € W to the nearest obstacle edge. This distance is positive outside of obstacles, 0 at the edge, and

negative inside. This is called a signed distance field. We can now define a cost field in the

workspace, call it ¢, that has high cost inside of obstacles, and a small cost right outside. With

Signed distance field

this cost, we can make points in the workspace really hate being inside obstacles, and dislike being right next to them (avoiding the visibility graph problem of their always hanging out by the edges of obstacles). Of course, our robot is not a point in the workspace, so we have some more work to do—we need to consider all points b on the robot’s body:

(fn(T \—ob () [l db ds ,»m—// ew

This

is called a path mtegral~n does not just integrate ¢ along the way for each body point,

but it multiplies by the derivative to make the cost invariant to retiming of the path. Imagine a robot sweeping through the cost field, accumulating cost as is moves. Regardless of how fast

Path integral

or slow the arm moves through the field, it must accumulate the exact same cost.

The simplest way to solve the optimization problem above and find a path is gradient

descent. If you are wondering how to take gradients of functionals with respect to functions,

something called the calculus of variations is here to help. It is especially easy for functionals of the form

'

Il

/0 F(s,7(5),#(5))ds

which are integrals of functions that depend just on the parameter s, the value of the function

Euler-Lagrange at s, and the derivative of the function at s. In such a case, the Euler-Lagrange equation equation says that the gradient is

Vo J(s)

)=

JF

= o

d

OF

5O " warm

If we look closely at Jo; and Jyp, they both follow this pattern. In particular for J,7, we have F(s,7(s),7(s)) = [|#(s)[|. To get a bit more comfortable with this, let’s compute the gradient

1 Roboticists like to minimize a cost function J, whereas in other parts of Al we try to maximize a utility function

U orareward R.

950

Chapter 26 Robotics

Figure 26.20 Trajectory optimization for motion planning. Two point-obstacles with circular bands of decreasing cost around them. The optimizer starts with the straight line trajectory, and lets the obstacles bend the line away from collisions, finding the minimum path through the cost field.

for Jo5 only. We see that F does not have a direct dependence on 7(s), so the first term in the formula is 0. We are left with d

Vel(s)=0-2(s9)

since the partial of F with respect to 7(s) is 7(s).

Notice how we made things easier for ourselves when defining Jo;—it’s a nice quadratic

of the derivative (and we even put a % in front so that the 2 nicely cancels out). In practice, you will see this trick happen a lot for optimization—the art is not just in choosing how to

optimize the cost function, but also in choosing a cost function that will play nicely with how you will optimize it. Simplifying our gradient, we get

V. J(s) = —#(s). Now, since J is a quadratic, setting this gradient to 0 gives us the solution for 7 if we

didn’t have to deal with obstacles. Integrating once, we get that the first derivative needs to be constant; integrating again we get that 7(s) = a- s+ b, with a and b determined by the endpoint constraints for 7(0) and 7(1). The optimal path with respect to J.g is thus the

straight line from start to goal! It is indeed the most efficient way to go from one to the other if there are no obstacles to worry about.

Of course, the addition of Jp, is what makes things difficult—and we will spare you

deriving its gradient here. The robot would typically initialize its path to be a straight line,

which would plow right through some obstacles. It would then calculate the gradient of the cost about the current path, and the gradient would serve to push the path away from the obstacles (Figure 26.20). Keep in mind that gradient descent will only find a locally optimal solution—just like hill climbing. Methods such as simulated annealing (Section 4.1.2) can be

used for exploration, to make it more likely that the local optimum is a good one. 26.5.3

Control theory

Trajectory tracking control

We have covered how to plan motions, but not how to actually move—to apply current to motors, to produce torque, to move the robot.

This is the realm of control theory, a field

of increasing importance in Al There are two main questions to deal with: how do we turn

Section 26.5 Planning and Control

951

Figure 26.21 The task of reaching to grasp a bottle solved with a trajectory optimizer. Left: the initial trajectory, plotted for the end effector. Middle: the final trajectory after optimiza-

tion. Right: the goal configuration. Courtesy of Anca Dragan. See Ratliff ef al. (2009).

a mathematical description of a path into a sequence of actions in the real world (open-loop

control), and how do we make sure that we are staying on track (closed-loop control)?

From configurations to torques for open-loop tracking: Our path 7(1) gives us configurations. The robot starts at rest at g, = 7(0). From there the robot’s motors will turn currents into torques, leading to motion. But what torques should the robot aim for, such that it ends

up at g = 7(1)?

This is where the idea of a dynamics model (or transition model) comes in. We can give

the robot a function f that computes the effects torques have on the configuration. Remem-

Dynamics model

ber F = ma from physics? Well, there is something like that for torques too, in the form

u= f"(.4.G), with u a torque, ¢ a velocity, and g an acceleration.? If the robot is at config-

uration ¢ and velocity ¢, and applied torque u, that would lead to acceleration G = f(q,q,u). The tuple (g.¢) is a dynamic state, because it includes velocity, whereas g is the kinematic

state and is not sufficient for computing exactly what torque to apply. f is a deterministic

dynamics model in the MDP over dynamic states with torques as actions. f~! is the inverse

Dynamic state

Kinematic state

dynamics, telling us what torque to apply if we want a particular acceleration, which leads Inverse dynamics to a change in velocity and thus a change in dynamic state.

Now, naively, we could think of 7 € [0, 1] as “time” on a scale from 0 to 1 and select our

torque using inverse dynamics:

) = £ (r(0,7(0),5(1))

(262)

assuming that the robot starts at (7(0),7(0)). In reality though, things are not that easy.

The path 7 was created as a sequence of points, without taking velocities and accelera-

tions into account. As such, the path may not satisfy 7(0) = 0 (the robot starts at 0 velocity), or even that 7 is differentiable (let alone twice differentiable).

Further, the meaning of the

endpoint “1” is unclear: how many seconds does that map to? In practice, before we even think of tracking a reference path, we usually retime it, that Retiming is, transform it into a trajectory £(r) that maps the interval [0, 7] for some time duration 7'

into points in the configuration space C. (The symbol is the Greek letter Xi.) Retiming is trickier than you might think, but there are approximate ways to do it, for instance by picking

a maximum velocity and acceleration, and using a profile that accelerates to that maximum 2 We omit the details of /=" here, but they involve mass, inertia, gravity, and Coriolis and centrifugal forces.

952

Chapter 26 Robotics

e

(@)

s

(®)

(c)

Figure 26.22 Robot arm control using (a) proportional control with gain factor 1.0, (b) proportional control with gain factor 0.1, and (c) PD (proportional derivative) control with gain factors 0.3 for the proportional component and 0.8 for the differential component. In all cases the robot arm tries to follow the smooth line path, but in (a) and (b) deviates substantially from

the path.

velocity, stays there as long as it can, and then decelerates back to 0. Assuming we can do this, Equation (26.2) above can be rewritten as

ur) = 1760 E0,€0)-

(263)

Even with the change from 7 to £, an actual trajectory, the equation of applying torques from

Control law

above (called a control law) has a problem in practice. Thinking back to the reinforcement

Stiction

and inertias exactly, and f might not properly account for physical phenomena like stiction in the motors (the friction that tends to prevent stationary surfaces from being set in motion—to

learning section, you might guess what it is. The equation works great in the situation where [ is exact, but pesky reality gets in the way as usual: in real systems, we can’t measure masses

make them stick). So, when the robot arm starts applying those torques but f is wrong, the

errors accumulate and you deviate further and further from the reference path.

Rather than just letting those errors accumulate, a robot can use a control process that

looks at where it thinks it is, compares that to where it wanted to be, and applies a torque to minimize the error.

P controller Gain factor

A controller that provides force in negative proportion to the observed error is known as

a proportional controller or P controller for short. The equation for the force

u(t) = Kp(&(1) — q:)

where g, is the current configuration, and Kp is a constant representing the gain factor of the

controller. Kp regulates how strongly the controller corrects for deviations between the actual state g; and the desired state &(t).

Figure 26.22(a) illustrates what can go wrong with proportional control. Whenever a deviation occurs—whether due to noise or to constraints on the forces the robot can apply—the

robot provides an opposing force whose magnitude is proportional to this deviation. Intuitively, this might appear plausible, since deviations should be compensated by a counterforce to keep the robot on track. However, as Figure 26.22(a) illustrates, a proportional controller

can cause the robot to apply too much force, overshooting the desired path and zig-zagging

Section 265 back and forth. This is the result of the natural inertia of the robot:

Planning and Control

953

once driven back to its

reference position the robot has a velocity that can’t instantaneously be stopped.

In Figure 26.22(a), the parameter Kp = 1. At first glance, one might think that choosing a smaller value for Kp would remedy the problem, giving the robot a gentler approach to the desired path.

Kp

Unfortunately, this is not the case. Figure 26.22(b) shows a trajectory for

=1, still exhibiting oscillatory behavior. The lower value of the gain parameter helps, but

does not solve the problem. In fact, in the absence of friction, the P controller is essentially a

spring law; so it will oscillate indefinitely around a fixed target location.

There are a number of controllers that are superior to the simple proportional control law.

A controller is said to be stable if small perturbations lead to a bounded error between the

robot and the reference signal. It is said to be strictly stable if it is able to return to and then

stay on its reference path upon such perturbations. Our P controller appears to be stable but not strictly stable, since it fails to stay anywhere near its reference trajectory.

The simplest controller that achieves strict stability in our domain is a PD controller. The letter ‘P’ stands again for proportional, and ‘D’ stands for derivative. PD controllers are

Stable Strictly stable PD controller

described by the following equation:

(26.4) + Kp(€(r) = dr) = a1) £() ult) = Kp( As this equation suggests, PD controllers extend P controllers by a differential component,

which adds to the value of u(r) a term that is proportional to the first derivative of the error

&(t) — g, over time. What is the effect of such a term? In general, a derivative term damp-

ens the system that is being controlled.

To see this, consider a situation where the error is

changing rapidly over time, as is the case for our P controller above. The derivative of this error will then counteract the proportional term, which will reduce the overall response to

the perturbation. However, if the same error persists and does not change, the derivative will

vanish and the proportional term dominates the choice of control.

Figure 26.22(c) shows the result of applying this PD controller to our robot arm, using as gain parameters Kp = .3 and K = .8. Clearly, the resulting path is much smoother, and does

not exhibit any obvious oscillations.

PD controllers do have failure modes, however. In particular, PD controllers may fail to

regulate an error down to zero, even in the absence of external perturbations.

Often such a

situation is the result ofa systematic external force that is not part of the model. For example,

an autonomous car driving on a banked surface may find itself systematically pulled to one

side.

Wear and tear in robot arms causes similar systematic errors.

In such situations, an

over-proportional feedback is required to drive the error closer to zero. The solution to this

problem lies in adding a third term to the control law, based on the integrated error over time:

u(t) = KP({(f)*q:)+K:A’({(-Y)*q\)d»“rKv(f(')*d:)-

(26.5)

Here K; is a third gain parameter. The term [}(€(s) calculates the integral of the error over time. The effect of this term is that long-lasting deviations between the reference signal and the actual state are corrected.

Integral terms, then, ensure that a controller does not exhibit

systematic long-term error, although they do pose a danger of oscillatory behavior. A controller with all three terms is called a PID controller (for proportional integral

derivative). PID controllers are widely used in industry, for a variety of control problems. Think of the three terms as follows—proportional: try harder the farther away you are from

PID controller

954

Chapter 26 Robotics the path; derivative: try even harder if the error is increasing; integral: try harder if you haven’t made progress for a long time.

Computed torque control

A middle ground between open-loop control based on inverse dynamics and closed-loop PID control is called computed torque control. We compute the torque our model thinks we. will need, but compensate for model inaccuracy with proportional error terms:

u(t) = §7 (60, £0).60)) +m(E)) (Kp(€(1) — 1) +Kp(£(1) = 1)) Feedforward component Feedback component

feedforward

(26.6)

feedback

The first term is called the feedforward component because it looks forward to where the

robot needs to go and computes what torque might be required. The second is the feedback

component because it feeds the current error in the dynamic state back into the control law.

m(q) is the inertia matrix at configuration g—unlike normal PD control, the gains change with the configuration of the system. Plans versus policies

Let’s take a step back and make sure we understand the analogy between what happened so far in this chapter and what we learned in the search, MDP, and reinforcement learning chapters.

‘With motion in robotics, we are really considering an underlying MDP where the states are

dynamic states (configuration and velocity), and the actions are control inputs, usually in the form of torques. If you take another look at our control laws above, they are policies, not plans—they tell the robot what action to take from any state it might reach. However,

they are usually far from optimal policies. Because the dynamic state is continuous and high dimensional (as is the action space), optimal policies are computationally difficult to extract. Instead, what we did here is to break up the problem. We come up with a plan first, in a simplified state and action space: we use only the kinematic state, and assume that states are reachable from one another without paying attention to the underlying dynamics. This is motion planning, and it gives us the reference path. If we knew the dynamics perfectly, we

could turn this into a plan for the original state and action space with Equation (26.3).

But because our dynamics model is typically erroneous, we turn it instead into a policy

that tries to follow the plan—getting back to it when it drifts away. When doing this, we introduce suboptimality in two ways:

first by planning without considering dynamics, and

second by assuming that if we deviate from the plan, the optimal thing to do is to return to the original plan. In what follows, we describe techniques that compute policies directly over the dynamic state, avoiding the separation altogether.

26.5.4

Optimal

control

Rather than using a planner to create a kinematic path, and only worrying about the dynamics

of the system after the fact, here we discuss how we might be able to do it all at once. We'll

take the trajectory optimization problem for kinematic paths, and turn it into true trajectory

optimization with dynamics: we will optimize directly over the actions, taking the dynamics (or transitions) into account. This brings us much closer to what we’ve seen in the search and MDP chapters.

If we

know the system’s dynamics, then we can find a sequence of actions to execute, as we did in

Chapter 3. If we’re not sure, then we might want a policy, as in Chapter 17.

Section 265

Planning and Control

955

In this section, we are looking more directly at the underlying MDP the robot works

in. We're switching from the familiar discrete MDPs to continuous ones.

We will denote

our dynamic state of the world by x, as is common practice—the equivalent of s in discrete MDPs. Let x, and x; be the starting and goal states.

‘We want to find a sequence of actions that, when executed by the robot, result in state-

action pairs with low cumulative cost. The actions are torques which we denote with u(r)

for ¢ starting at 0 and ending at 7. Formally, we want to find the sequence of torques u that minimize a cumulative cost J:

min Ji J(x(t),u(r))dr

(26.7)

u

subject to the constraints

Ve, x(1) = f(x(0),u(t))

X(0) = xy, X(T) =%,

How is this connected to motion planning and trajectory tracking control? Well, imagine

we take the notion of efficiency and clearance away from the obstacles and put it into the cost function J, just as we did before in trajectory optimization over kinematic state. The dynamic

state is the configuration and velocity, and torques u change it via the dynamics

f from open-

loop trajectory tracking. The difference is that now we're thinking about the configurations

and the torques at the same time. Sometimes, we might want to treat collision avoidance as a

hard constraint as well, something we’ve also mentioned before when we looked at trajectory optimization for the kinematic state only.

To solve this optimization problem, we can take gradients of J—not with respect to the

sequence 7 of configurations anymore, but directly with respect to the controls . It is sometimes helpful to include the state sequence x as a decision variable too, and use the dynamics

constraints to ensure that x and u are consistent. There are various trajectory optimization

techniques using this approach; two of them go by the names multiple shooting and direct

collocation. None of these techniques will find the global optimal solution, but in practice they can effectively make humanoid robots walk and make autonomous cars drive. Magic happens when in the problem above, J is quadratic and f is linear in x and 1. We want to minimize

min / XTQx+ulRudt

o

subjectto

Vi, i(r) = Ax(r) + Bu(t).

‘We can optimize over an infinite horizon rather than a finite one, and we obtain a policy

from any state rather than just a sequence of controls. Q and R need to be positive definite matrices for this to work. This gives us the linear quadratic regulator (LQR). With LQR,

Linear quadratic regulator (LQR)

the optimal value function (called cost to go) is quadratic, and the optimal policy is linear. The policy looks like u = —Kx, where finding the matrix K requires solving an algebraic Riceati equation—no local optimization, no value iteration, no policy iteration are needed! Riccati equation Because of the ease of finding the optimal policy, LQR finds many uses in practice despite the fact that real problems seldom actually have quadratic costs and linear dynamics. A LQR really useful method is called iterative LQR (ILQR), which works by starting with a solu- Iterative (ILQR) tion and then iteratively computing a linear approximation of the dynamics and a quadratic approximation of the cost around it, then solving the resulting LQR system to arrive at a new

solution. Variants of LQR are also often used for trajectory tracking.

956

Chapter 26 Robotics 26.6_Planning Uncertain

Movements

In robotics, uncertainty arises from partial observability of the environment and from the stochastic (or unmodeled) effects of the robot’s actions. Errors can also arise from the use

of approximation algorithms

such as particle filtering, which does not give the robot an exact

belief state even if the environment is modeled perfectly.

The majority of today’s robots use deterministic algorithms for decision making, such as the path-planning algorithms of the previous section, or the search algorithms that were introduced in Chapter 3. These deterministic algorithms are adapted in two ways: first, they

deal with the continuous state space by turning it into a discrete space (for example with Most likely state

visibility graphs or cell decomposition).

Second, they deal with uncertainty in the current

state by choosing the most likely state from the probability distribution produced by the state estimation algorithm. That approach makes the computation faster and makes a better fit for the deterministic search algorithms. In this section we discuss methods for dealing with

uncertainty that are analogous to the more complex search algorithms covered in Chapter 4.

First, instead of deterministic plans, uncertainty calls for policies. We already discussed

how trajectory tracking control turns a plan into a policy to compensate for errors in dynamics.

Online replanning Model predictive

control (MPC)

Sometimes though, if the most likely hypothesis changes enough, tracking the plan designed for a different hypothesis is too suboptimal. This is where online replanning comes in: we can recompute a new plan based on the new belief. Many robots today use a technique called model predictive control (MPC), where they plan for a shorter time horizon, but replan

at every time step. (MPC is therefore closely related to real-time search and game-playing

algorithms.) This effectively results in a policy: at every step, we run a planner and take the

first action in the plan; if new information comes along, or we end up not where we expected, that’s OK, because we are going to replan anyway and that will tell us what to do next. Second, uncertainty calls for information gathering actions. When we consider only the information we have and make a plan based on it (this is called separating estimation from

control), we are effectively solving (approximately) a new MDP at every step, corresponding

to our current belief about where we are or how the world works. But in reality, uncertainty is

better captured by the POMDP framework: there is something we don’t directly observe, be it the robot’s location or configuration, the location of objects in the world, or the parameters

of the dynamics model itself—for example, where exactly is the center of mass of link two on this arm?

What we lose when we don’t solve the POMDP

is the ability to reason about future

information the robot will get: in MDPs we only plan with what we know, not with what we might eventually know. Remember the value of information? Well, robots that plan using

their current belief as if they will never find out anything more fail to account for the value of

information. They will never take actions that seem suboptimal right now according to what

they know, but that will actually result in a lot of information and enable the robot to do well.

What does such an action look like for a navigation robot?

The robot could get close

to a landmark to get a better estimate of where it is, even if that landmark is out of the way

according to what it currently knows. This action is optimal only if the robot considers the

new observations it will get, as opposed to looking only at the information it already has.

Guarded movement

To get around this, robotics techniques sometimes define information gathering actions

explicitly—such as moving a hand until it touches a surface (called guarded movements)—

Section 26.6 initial configuration

Planning Uncertain Movements

-

motion

envelope —

957

cy \2

Figure 26.23 A two-dimensional environment, velocity uncertainty cone, and envelope of possible robot motions. The intended velocity is v, but with uncertainty the actual velocity could be anywhere in C,. resulting in a final configuration somewhere in the motion envelope, which means we wouldn’t know if we hit the hole or not. and make sure the robot does that before coming up with a plan for reaching its actual goal. Each guarded motion consists of (1) a motion command

and (2) a termination condition,

which is a predicate on the robot’s sensor values saying when to stop. Sometimes, the goal itself could be reached via a sequence of guarded moves guaranteed to succeed regardless of uncertainty. As an example, Figure 26.23 shows a two-dimensional

configuration space with a narrow vertical hole. It could be the configuration space for inser-

tion ofa rectangular peg into a hole or a car key into the ignition. The motion commands are constant velocities. The termination conditions are contact with a surface. To model uncer-

tainty in control, we assume that instead of moving in the commanded direction, the robot’s actual motion lies in the cone C, about it.

The figure shows what would happen if the robot attempted to move straight down from

the initial configuration. Because of the uncertainty in velocity, the robot could move anywhere in the conical envelope, possibly going into the hole, but more likely landing to one side of it. Because the robot would not then know which side of the hole it was on, it would

not know which way to move.

A more sensible strategy is shown in Figures 26.24 and 26.25. In Figure 26.24, the robot deliberately moves to one side of the hole. The motion command is shown in the figure, and the termination test is contact with any surface. In Figure 26.25, a motion command is

given that causes the robot to slide along the surface and into the hole. Because all possible

velocities in the motion envelope are to the right, the robot will slide to the right whenever it

is in contact with a horizontal surface.

It will slide down the right-hand vertical edge of the hole when it touches it, because

all possible velocities are down relative to a vertical surface.

It will keep moving until it

reaches the bottom of the hole, because that is its termination condition.

In spite of the

control uncertainty, all possible trajectories of the robot terminate in contact with the bottom of the hole—that is, unless surface irregularities cause the robot to stick in one place.

Other techniques beyond guarded movements change the cost function to incentivize ac-

tions we know will lead to information—Tlike the coastal navigation heuristic which requires

the robot to stay near known landmarks. More generally, techniques can incorporate the ex-

pected information gain (reduction of entropy of the belief) as a term in the cost function,

Coastal navigation

958

Chapter 26

Robotics

initial configuration

7Cv

~

v motion

envelope

Figure 26.24 The first motion command and the resulting envelope of possible robot motions. No matter what actual motion ensues, we know the final configuration will be to the left of the hole.

DN

motion

v

/ envelope

Figure 26.25 The second motion command and the envelope of possible motions. Even with error, we will eventually get into the hole. leading to the robot explicitly reasoning about how much information each action might bring

when deciding what to do. While more difficult computationally, such approaches have the

advantage that the robot invents its own information gathering actions rather than relying on human-provided heuristics and scripted strategies that often lack flexibility. 26.7

Reinforcement Learning in Robotics

Thus far we have considered tasks in which the robot has access to the dynamics model of

the world. In many tasks, it is very difficult to write down such a model, which puts us in the

domain of reinforcement learning (RL).

One challenge of RL in robotics is the continuous nature of the state and action spaces,

which we handle either through discretization, or, more commonly, through function approxi-

mation. Policies or value functions are represented as combinations of known useful features,

or as deep neural networks. Neural nets can map from raw inputs directly to outputs, and thus largely avoid the need for feature engineering, but they do require more data. A bigger challenge is that robots operate in the real world. We have seen how reinforcement learning can be used to learn to play chess or Go by playing simulated games. But when

a real robot moves in the real world, we have to make sure that its actions are safe (things

Section 26.7

Reinforcement Learning in Robotics

(@)

Figure 26.26 Training a robust policy. (a) Multiple simulations are run of a robot hand ma-

nipulating objects, with different randomized parameters for physics and lighting. Courtesy of Wojciech Zaremba. (b) The real-world environment, with a single robot hand in the center

of a cage, surrounded by cameras and range finders. (c) Simulation and real-world training yields multiple different policies for grasping objects; here a pinch grasp and a quadpod grasp. Courtesy of OpenAL See Andrychowicz et al. (2018a).

break!), and we have to accept that progress will be slower than in a simulation because the

world refuses to move faster than one second per second. Much of what is interesting about using reinforcement learning in robotics boils down to how we might reduce the real world

sample complexity—the number of interactions with the physical world that the robot needs before it has learned how to do the task.

26.7.1

Exploiting models

A natural way to avoid the need for many real-world samples is to use as much knowledge of the world’s dynamics as possible. For instance, we might not know exactly what the coefficient of friction or the mass of an object is, but we might have equations that describe

the dynamics as a function of these parameters.

In such a case, model-based reinforcement learning (Chapter 22) is appealing, where

the robot can alternate between fitting the dynamics parameters and computing a better policy. Even if the equations are incorrect because they fail to model every detail of physics,

researchers have experimented with learning an error term, in addition to the parameters, that can compensate for the inaccuracy of the physical model. Or, we can abandon the equations

and instead fit locally linear models of the world that each approximate the dynamics in a region of the state space, an approach that has been successful in getting robots to master

complex dynamic tasks like juggling. A model of the world can also be useful in reducing the sample complexity of model-free

reinforcement learning methods by doing sim-to-real transfer: transferring policies that work ~sim-to-rea!

959

960

Chapter 26 Robotics in simulation to the real world. The idea is to use the model as a simulator for a policy search (Section 22.5). To learn a policy that transfers well, we can add noise to the model during

Domain randomization

training, thereby making the policy more robust. Or, we can train policies that will work with a variety of models by sampling different parameters in the simulations—sometimes referred to as domain randomization. An example is in Figure 26.26, where a dexterous manipulation task is trained in simulation by varying visual attributes, as well as physical attributes like friction or damping.

Finally, hybrid approaches that borrow ideas from both model-based and model-free al-

gorithms are meant to give us the best of both. The hybrid approach originated with the Dyna

architecture, where the idea was to iterate between acting and improving the policy, but the

policy improvement would come in two complementary ways:

1) the standard model-free

way of using the experience to directly update the policy, and 2) the model-based way of using the experience to fit a model, then plan with it to generate a policy.

More recent techniques have experimented with fitting local models, planning with them

to generate actions, and using these actions as supervision to fit a policy, then iterating to get better and better models around the areas that the policy needs.

This has been successfully

applied in end-to-end learning, where the policy takes pixels as input and directly outputs

torques as actions—it enabled the first demonstration of deep RL on physical robots. Models can also be exploited for the purpose of ensuring safe exploration.

Learning

slowly but safely may be better than learning quickly but crashing and burning half way through. So arguably, more important than reducing real-world samples is reducing realworld samples in dangerous states—we

don’t want robots falling off cliffs, and we don’t

want them breaking our favorite mugs or, even worse, colliding with objects and people. An approximate model, with uncertainty associated to it (for example by considering a range of

values for its parameters), can guide exploration and impose constraints on the actions that the robot is allowed to take in order to avoid these dangerous states. This is an active area of research in robotics and control.

26.7.2

Motion primitive

Exploiting other information

Models are useful, but there is more we can do to further reduce sample complexity. When setting up a reinforcement learning problem, we have to select the state and action spaces, the representation of the policy or value function, and the reward function we’re using. These decisions have a large impact on how easy or how hard we are making the problem. One approach is to use higher-level motion primitives instead of low-level actions like torque commands. A motion primitive is a parameterized skill that the robot has. For exam-

ple, a robotic soccer player might have the skill of “pass the ball to the player at (x,y).” All the policy needs to do is to figure out how to combine them and set their parameters, instead

of reinventing them. This approach often learns much faster than low-level approaches, but does restrict the space of possible behaviors that the robot can learn.

Another way to reduce the number of real-world samples required for learning is to reuse information from previous learning episodes on other tasks, rather than starting from scratch.

This falls under the umbrella of metalearning or transfer learning.

Finally, people are a great source of information. In the next section, we talk about how

to interact with people, and part of it is how to use their actions to guide the robot’s learning.

Section 26.8

26.8

Humans and Robots

961

Humans and Robots

Thus far, we’ve focused on a robot planning and learning how to act in isolation. This is useful for some robots, like the rovers we send out to explore distant planets on our behalf.

But, for the most part, we do not build robots to work in isolation. We build them to help us, and to work in human environments, around and with us.

This raises two complementary challenges. First is optimizing reward when there are

people acting in the same environment as the robot. We call this the coordination problem

(see Section 18.1). When the robot’s reward depends on not just its own actions, but also the

actions that people take, the robot has to choose its actions in a way that meshes well with

theirs. When the human and the robot are on the same team, this turns into collaboration.

Second is the challenge of optimizing for what people actually want. If a robot is to

help people, its reward function needs to incentivize the actions that people want the robot to

execute. Figuring out the right reward function (or policy) for the robot is itself an interaction problem. We will explore these two challenges in turn. 26.8.1

Coordination

Let’s assume for now, as we have been, that the robot has access to a clearly defined reward

function. But, instead of needing to optimize it in isolation, now the robot needs to optimize

it around a human who is also acting. For example, as an autonomous car merges on the

highway, it needs to negotiate the maneuver with the human driver coming in the target lane—

should it accelerate and merge in front, or slow down and merge behind? Later, as it pulls to

astop sign, preparing to take a right, it has to watch out for the cyclist in the bicycle lane, and for the pedestrian about to step onto the crosswalk.

Or, consider a mobile robot in a hallway.

Someone heading straight toward the robot

steps slightly to the right, indicating which side of the robot they want to pass on. The robot has to respond, clarifying its intentions. Humans

as approximately

rational agents

One way to formulate coordination with a human is to model it as a game between the robot

and the human (Section 18.2). With this approach, we explicitly make the assumption that

people are agents incentivized by objectives. This does not automatically mean that they are

perfectly rational agents (i.e., find optimal solutions in the game), but it does mean that the

robot can structure the way it reasons about the human via the notion of possible objectives that the human might have. In this game: « the state of the environment captures the configurations of both the robot and human agents; call itx = (xg,xg); + each agent can take actions, ug and uy respectively;

« cach agent has an objective that can be represented as a cost, Jg and Jy: each agent

wants to get to its goal safely and efficiently; + and, as in any game, each objective depends on the state and on the actions of both

agents: Jg(x, g, yr) and Jy (x, 1y, ug). Think of the car-pedestrian interaction—the car should stop if the pedestrian crosses, and should go forward if the pedestrian waits.

Three important aspects complicate this game. First is that the human and the robot don’t

necessarily know each other’s objectives. This makes it an incomplete information game.

Incomplete information game

962

Chapter 26 Robotics Second is that the state and action spaces are continuous, as they’ve been throughout this

chapter. We learned in Chapter 5 how to do tree search to tackle discrete games, but how do we tackle continuous spaces? Third, even though at the high level the game model makes sense—humans do move, and they do have objectives—a human’s behavior might not always be well-characterized as a solution to the game. The game comes with a computational challenge not only for the robot, but for us humans too. It requires thinking about what the robot will do in response to what the person does, which depends on what the robot thinks the person will do, and pretty soon we get to “what do you think I think you think I think”— it’s turtles all the way down! Humans can’t deal with all of that, and exhibit certain suboptimalities.

This means that the

robot should account for these suboptimalities.

So, then, what is an autonomous car to do when the coordination problem is this hard?

We will do something similar to what we’ve done before in this chapter. For motion planning and control, we took an MDP and broke it up into planning a trajectory and then tracking it with a controller. Here too, we will take the game, and break it up into making predictions about human actions, and deciding what the robot should do given these predictions.

Predicting human action Predicting human actions is hard because they depend on the robot’s actions and vice versa. One trick that robots use is to pretend the person is ignoring the robot. The robot assumes people are noisily optimal with respect to their objective, which is unknown to the robot and

is modeled as no longer dependent on the robot’s actions: Jj (x,

). In particular, the higher

the value of an action for the objective (the lower the cost to go), the more likely the human

is to take it. The robot can create a model for P(uy | x,Jy), for instance using the softmax function from page 811:

Plug | %, Jig) o< &= Q)

(268)

with Q(x, ups3Jy) the Q-value function corresponding to Ji; (the negative sign is there because in robotics we like to minimize cost, not maximize reward). Note that the robot does not

assume perfectly optimal actions, nor does it assume that the actions are chosen based on

reasoning about the robot at all.

Armed with this model, the robot uses the human’s ongoing actions as evidence about Jp;.

If we have an observation model for how human actions depend on the human’s objective,

each human action can be incorporated to update the robot’s belief over what objective the person has

b (J) o< b(Ju)Purg | %) -

(26.9)

An example is in Figure 26.27: the robot is tracking a human’s location and as the human moves, the robot updates its belief over human goals. As the human heads toward the

windows, the robot increases the probability that the goal is to look out the window, and

decreases the probability that the goal is going to the kitchen, which is in the other direction.

This is how the human’s past actions end up informing the robot about what the human will do in the future. Having a belief about the human’s goal helps the robot anticipate what next actions the human will take. The heatmap in the figure shows the robot’s future

predictions: red is most probable; blue least probable.

Section 26.8

Humans and Robots

(b)

(©)

Figure 26.27 Making predictions by assuming that people are noisily rational given their goal: the robot uses the past actions to update a belief over what goal the person is heading 10, and then uses the belief to make predictions about future actions. (a) The map ofa room. (b) Predictions after seeing a small part of the person’s trajectory (white path):; (¢) Predictions after seeing more human actions: the robot now knows that the person is not heading to the hallway on the left, because the path taken so far would be a poor path if that were the person’s goal. Images courtesy of Brian D. Ziebart. See Ziebart et al. (2009). The same can happen in driving. We might not know how much another driver values

efficiency, but if we see them accelerate as someone is trying to merge in front of them, we

now know a bit more about them. And once we know that, we can better anticipate what they will do in the future—the same driver is likely to come closer behind us, or weave through traffic to get ahead. Once the robot can make predictions about human future actions, it has reduced its prob-

lem to solving an MDP. The human actions complicate the transition function, but as long as

the robot can anticipate what action the person will take from any future state, the robot can

calculate P(¥' | x,ug): it can compute P(up | x) from P(ug | x,Ju) by marginalizing over Ji,

and combine it with P(x’ | x,ug,up), the transition (dynamics) function for how the world updates based on both the robot’s and the human’s actions. In Section 26.5 we focused on how to solve this in continuous state and action spaces for deterministic dynamics, and in

Section 26.6 we discussed doing it with stochastic dynamics and uncertainty.

Splitting prediction from action makes it easier for the robot to handle interaction, but

sacrifices performance much from control.

plitting estimation from motion did, or splitting planning

A robot with this split no longer understands that its actions can influence what people end up doing. In contrast, the robot in Figure 26.27 anticipates where people will go and then optimizes for reaching its own goal and avoiding collisions with them. In Figure 26.28, we have an autonomous car merging on the highway. If it just planned in reaction to other cars, it might have to wait a long time while other cars occupy its target lane. In contrast, a car that reasons about prediction and action jointly knows that different actions it could take will

result in different reactions from the human. If it starts to assert itself, the other cars are likely to slow down a bit and make room. Roboticists are working towards coordinated interactions like this so robots can work better with humans.

963

Chapter 26 Robotics

(b)

Figure 26.28 (a) Left: An autonomous car (middle lane) predicts that the human driver (left lane) wants to keep going forward, and plans a trajectory that slows down and merges behind. Right: The car accounts for the influence its actions can have on human actions, and realizes

it can merge in front and rely on the human driver to slow down. (b) That same algorithm produces an unusual strategy at an intersection: the car realizes that it can make it more

likely for the person (bottom) to proceed faster through the intersection by starting to inch

backwards. Images courtesy of Anca Dragan. See Sadigh et al. (2016). Human

predi

ions about the robot

Incomplete information is often two-sided:

the robot does not know the human’s objective

and the human, in turn, does not know the robot’s objective—people need to be making predictions about robots. As robot designers, we are not in charge of how the human makes

predictions; we can only control what the robot does. However, the robot can act in a way to make it easier for the human to make correct predictions.

The robot can assume that

the human is using something roughly analogous to Equation (26.8) to estimate the robot’s objective Jg, and thus the robot will act so that its true objective can be easily inferred. A special case of the game is when the human

and the robot are on the same team,

working toward the same goal or objective: Jy = Jg. Imagine getting a personal home robot

Joint agent

that is helping you make dinner or clean up—these are examples of collaboration.

We can now define a joint agent whose actions are tuples of human-robot actions,

(upr, ug) and who optimizes for Jy (x,up,ug) = Jg(x,ug,u), and we're solving a regular planning problem. We compute the optimal plan or policy for the joint agent, and voila, we now know what the robot and human should do.

This would work really well if people were perfectly optimal. The robot would do its part

of the joint plan, the human theirs. Unfortunately, in practice, people don’t seem to follow the perfectly laid out joint-agent plan; they have a mind of their own! We've already learned one way to handle this though, back in Section 26.6. We called it model predictive control

(MPC): the idea was to come up with a plan, execute the first action, and then replan. That

way, the robot always adapts its plan to what the human is actually doing.

Let’s work through an example. Suppose you and the robot are in your kitchen, and have

decided to make waffles. You are slightly closer to the fridge, so the optimal joint plan would

Section 26.8

Humans and Robots

have you grab the eggs and milk from the fridge, while the robot fetches the flour from the cabinet. The robot knows this because it can measure quite precisely where everyone is. But

suppose you start heading for the flour cabinet. You are going against the optimal joint plan. Rather than sticking to it and stubbornly also going for the flour, the MPC robot recalculates

the optimal plan, and now that you are close enough to the flour it is best for the robot to grab the waffle iron instead.

If we know that people might deviate from optimality, we can account for it ahead of time.

In our example, the robot can try to anticipate that you are going for the flour the moment you take your first step (say, using the prediction technique above). Even if it is still technically

optimal for you to turn around and head for the fridge, the robot should not assume that’s

what is going to happen. Instead, the robot can compute a plan in which you keep doing what you seem to want.

Humans as black box agents ‘We don’t have to treat people as objective-driven, intentional agents to get robots to coordinate with us. An alternative model is that the human is merely some agent whose policy 7y “messes” with the environment dynamics.

The robot does not know 7y, but can model the

problem as needing to act in an MDP with unknown dynamics. We have seen this before: for general agents in Chapter 22, and for robots in particular in Section 26.7. The robot can fit a policy model 7y to human data, and use it to compute an optimal policy for itself. Due to scarcity of data, this has been mostly used so far at the task level. For

instance, robots have learned through interaction what actions people tend to take (in response

to its own actions) for the task of placing and drilling screws in an industrial assembly task. Then there is also the model-free reinforcement learning alternative:

the robot can start

with some initial policy or value function, and keep improving it over time via trial and error.

26.8.2

Learning to do what humans want

Another way interaction with humans comes into robotics is in Jg itself—the robot’s cost or reward function. The framework of rational agents and the associated algorithms reduce the

problem of generating good behavior to specifying a good reward function.

as for many other Al agents, getting the cost right is still difficult.

But for robots,

Take autonomous cars: we want them to reach the destination, to be safe, to drive com-

fortably for their passengers, to obey traffic laws, etc. A designer of such a system needs to trade off these different components of the cost function. The designer’s task is hard because robots are built to help end users, and not every end user is the same. We all have different

preferences for how aggressively we want our car to drive, etc.

Below, we explore two alternatives for trying to get robot behavior to match what we

actually want the robot to do.

The first is to learn a cost function from human input.

The

second is to bypass the cost function and imitate human demonstrations of the task.

Preference learning: Learning cost functions Imagine that an end user is showing a robot how to do a task. For instance, they are driving

the car in the way they would like it to be driven by the robot. Can you think of a way for the robot to use these actions—we call them “demonstrations”—to figure out what cost function it should optimize?

965

966

Chapter 26 Robotics

Figure 26.29 Left: A mobile robot is shown a demonstration that stays on the dirt road. Middle: The robot infers the desired cost function, and uses it in a new scene, knowing to

put lower cost on the road there. Right: The robot plans a path for the new scene that also

stays on the road, reproducing the preferences behind the demonstration. Images courtesy of Nathan Ratliff and James A. Bagnell. See Ratliff er al. (2006).

We have actually already seen the answer to this back in Section 26.8.1. There, the setup

was a little different: we had another person taking actions in the same space as the robot, and the robot needed to predict what the person would do. But one technique we went over for making these predictions was to assume that people act to noisily optimize some cost function

Jy, and we can use their ongoing actions as evidence about what cost function that is. We

can do the same here, except not for the purpose of predicting human behavior in the future, but rather acquiring the cost function the robot itself should optimize.

If the person drives

defensively, the cost function that will explain their actions will put a lot of weight on safety

and less so on efficiency. The robot can adopt this cost function as its own and optimize it when driving the car itself.

Roboticists have experimented with different algorithms for making this cost inference

computationally tractable. In Figure 26.29, we see an example of teaching a robot to prefer staying on the road to going over the grassy

terrain. Traditionally in such methods, the cost

function has been represented as a combination of hand-crafted features, but recent work has also studied how to represent it using a deep neural network, without feature engineering.

There are other ways for a person to provide input. A person could use language rather

than demonstration to instruct the robot.

A person could act as a critic, watching the robot

perform a task one way (or two ways) and then saying how well the task was done (or which way was better), or giving advice on how to improve.

Learning policies directly via imitation An alternative is to bypass cost functions and learn the desired robot policy directly. In our car example, the human’s demonstrations make for a convenient data set of states labeled by

the action the robot should take at each state: D = {(x;,u;)}. The robot can run supervised Behavioral cloning Generalization

learning to fit a policy

behavioral cloning.

: x — u, and execute that policy. This is called imitation learning or

A challenge with this approach is in generalization to new states. The robot does not

know why the actions in its database have been marked as optimal. It has no causal rule; all

it can do is run a supervised learning algorithm to try to learn a policy that will generalize to unknown states. However, there is no guarantee that the generalization will be correct.

Section 26.8

Humans and Robots

Figure 26.30 A human teacher pushes the robot down to teach it to stay closer to the table. The robot appropriately updates its understanding of the desired cost function and starts optimizing it. Courtesy of Anca Dragan. See Bajcsy f al. (2017).

Figure 26.31 A programming interface that involves placing specially designed blocks in the robot’s workspace to select objects and specify high-level actions. Images courtesy of Maya Cakmak. See Sefidgar ef al. (2017). The ALVINN autonomous car project used this approach, and found that even when

starting from a

state in D, 7 will make small errors, which will take the car off the demon-

strated trajectory. There, 7 will make a larger error, which will take the car even further off the desired course.

‘We can address this at training time if we interleave collecting labels and learning: start

with a demonstration, learn a policy, then roll out that policy and ask the human for what action to take at every state along the way, then repeat. The robot then learns how to correct its mistakes as it deviates from the human’s desired actions.

Alternatively, we can address it by leveraging reinforcement learning. The robot can fit a

dynamics model based on the demonstrations, and then use optimal control (Section 26.5.4)

to generate a policy that optimizes for staying close to the demonstration.

A version of this

has been used to perform very challenging maneuvers at an expert level in a small radiocontrolled helicopter (see Figure 22.9(b)). The DAGGER (Data Aggregation) system starts with a human expert demonstration. From that it learns a policy, 7 and uses the policy to generate a data set D. Then from

D it generates a new policy 75 that best imitates the original human data. This repeats, and

967

968

Chapter 26 Robotics on the nth iteration it uses 7, to generate more data, to be added to D, which is then used to create 7, . In other words, at each iteration the system gathers new data under the current

policy and trains the next policy using all the data gathered so far. Related recent techniques use adversarial training:

they alternate between training a

classifier to distinguish between the robot’s learned policy and the human’s demonstrations,

and training a new robot policy via reinforcement learning to fool the classifier. These ad-

vances enable the robot to handle states that are near demonstrations, but generalization to

far-off states or to new dynamics is a work in progress.

Teaching interfaces and the correspondence problem.

So far, we have imagined the

case of an autonomous car or an autonomous helicopter, for which human demonstrations use

the same actions that the robot can take itself: accelerating, braking, and steering. But what

happens if we do this for tasks like cleaning up the Kitchen table? We have two choices here: either the person demonstrates using their own body while the robot watches, or the person physically guides the robot’s effectors. Correspondence problem

The first approach is appealing because it comes naturally to end users. Unfortunately,

it suffers from the correspondence problem: how to map human actions onto robot actions.

People have different kinematics and dynamics than robots. Not only does that make it difficult to translate or retarget human motion onto robot motion (e.g., retargeting a five-finger

human grasp to a two-finger robot grasp), but often the high-level strategy a person might use is not appropriate for the robot.

Kinesthetic teaching Keyframe

Visual programming

The second approach, where the human teacher moves the robot’s effectors into the right positions, is called kinesthetic teaching. It is not easy for humans to teach this way, espe-

cially to teach robots with multiple joints. The teacher needs to coordinate all the degrees of

freedom as it is guiding the arm through the task. Researchers have thus investigated alter-

natives, like demonstrating keyframes as opposed to continuous trajectories, as well as the use of visual programming to enable end users to program primitives for a task rather than

demonstrate from scratch (Figure 26.31). Sometimes both approaches are combined.

26.9 Deliberative Reactive

Alternative Robo

Frameworks

Thus far, we have taken a view of robotics based on the notion of defining or learning a reward function, and having the robot optimize that reward function (be it via planning or learning), sometimes in coordination or collaboration with humans.

view of robotics, to be contrasted with a reactive view.

26.9.1

This is a deliberative

Reactive controllers

In some cases, it is easier to set up a good policy for a robot than to model the world and plan.

Then, instead of a rational agent, we have a reflex agent.

For example, picture a legged robot that attempts to lift a leg over an obstacle. We could

give this robot a rule that says lift the leg a small height 7 and move it forward, and if the leg

encounters an obstacle, move it back and start again at a higher height. You could say that is modeling an aspect of the world, but we can also think of& as an auxiliary variable of the robot controller, devoid of direct physical meaning. One such example is the six-legged (hexapod) robot, shown in Figure 26.32(a), designed for walking through rough terrain. The robot’s sensors are inadequate to obtain accurate

Section 26.9

969

Alternative Robotic Frameworks

retract, lift higher

liftup

setdown push backward (®)

Figure 26.32 (a) Genghis, a hexapod robot. (Image courtesy of Rodney A. Brooks.) (b) An

augmented finite state machine (AFSM) that controls one leg. The AFSM

reacts to sensor

feedback: if a leg is stuck during the forward swinging phase, it will be lifted increasingly higher.

‘models of the terrain for path planning. Moreover, even if we added high-precision cameras and rangefinders, the 12 degrees of freedom (two for each leg) would render the resulting path planning problem computationally difficult. It is possible, nonetheless, to specify a controller directly without an explicit environ-

mental model.

(We have already seen this with the PD controller, which was able to keep a

complex robot arm on target without an explicit model of the robot dynamics.)

For the hexapod robot we first choose a gait, or pattern of movement of the limbs. One

statically stable gait is to first move the right front, right rear, and left center legs forward (keeping the other three fixed), and then move the other three.

flat terrain.

Gait

This gait works well on

On rugged terrain, obstacles may prevent a leg from swinging forward.

This

problem can be overcome by a remarkably simple control rule: when a leg’s forward motion is blocked, simply retract it, lift it higher; and try again. The resulting controller is shown in Figure 26.32(b) as a simple finite state machine; it constitutes a reflex agent with state, where the internal state is represented by the index of the current machine state (s; through s4).

26.9.2

Subsumption

architectures

The subsumption architecture (Brooks, 1986) is a framework for assembling reactive controllers out of finite state machines.

Nodes in these machines may contain tests for certain

Subsumption architecture

sensor variables, in which case the execution trace of a finite state machine is conditioned

on the outcome of such a test. Arcs can be tagged with messages that will be generated when traversing them, and that are sent to the robot’s motors or to other finite state machines.

Additionally, finite state machines possess internal timers (clocks) that control the time it

takes to traverse an arc. The resulting machines are called augmented finite state machines

(AFSMs), where the augmentation refers to the use of clocks.

An example of a simple AFSM is the four-state machine we just talked about, shown in

Figure 26.32(b).

This AFSM

implements a cyclic controller, whose execution mostly does

not rely on environmental feedback. The forward swing phase, however, does rely on sensor feedback.

If the leg is stuck, meaning that it has failed to execute the forward swing, the

Augmented finite

state machine {AFSM)

970

Chapter 26 Robotics robot retracts the leg, lifts it up a little higher, and attempts to execute the forward swing once

again. Thus, the controller is able to react to contingencies arising from the interplay of the robot and its environment.

The subsumption architecture offers additional primitives for synchronizing AFSMs, and for combining output values of multiple, possibly conflicting AFSMs. In this way, it enables the programmer to compose increasingly complex controllers in a bottom-up fashion. In our example, we might begin with AFSMs for individual legs, followed by an AFSM for coordinating multiple legs. On top of this, we might implement higher-level behaviors such as collision avoidance, which might involve backing up and turning. The idea of composing robot controllers from AFSMs is quite intriguing. Imagine how difficult it would be to generate the same behavior with any of the configuration-space path-

planning algorithms described in the previous section. First, we would need an accurate model of the terrain. The configuration space of a robot with six legs, each of which is driven

by two independent motors, totals 18 dimensions (12 dimensions for the configuration of the

legs, and six for the location and orientation of the robot relative to its environment).

Even

if our computers were fast enough to find paths in such high-dimensional spaces, we would

have to worry about nasty effects such as the robot sliding down a slope.

Because of such stochastic effects, a single path through configuration space would al-

most certainly be too brittle, and even a PID controller might not be able to cope with such

contingencies. In other words, generating motion behavior deliberately is simply too complex a problem in some cases for present-day robot motion planning algorithms. Unfortunately, the subsumption architecture has its own problems. First, the AFSMs

are driven by raw sensor input, an arrangement that works if the sensor data is reliable and contains all necessary

information for decision making,

but fails if sensor data has to be

integrated in nontrivial ways over time. Subsumption-style controllers have therefore mostly

been applied to simple tasks, such as following a wall or moving toward visible light sources. Second, the lack of deliberation makes it difficult to change the robot’s goals.

A robot

with a subsumption architecture usually does just one task, and it has no notion of how to

modify its controls to accommodate different goals (just like the dung beetle on page 41). Third, in many real-world problems, the policy we want is often too complex to encode explicitly. Think about the example from Figure 26.28, of an autonomous car needing to negotiate a lane change with a human driver.

goes into the target lane.

We might start off with a simple policy that

But when we test the car, we find out that not every driver in the

target lane will slow down to let the car in. We might then add a bit more complexity: make

the car nudge towards the target lane, wait for a response form the driver in that lane, and

then either proceed or retreat back. But then we test the car, and realize that the nudging needs to happen at a different speed depending on the speed of the vehicle in the target lane, on whether there is another vehicle in front in the target lane, on whether there is a vehicle

behind the car in the initial, and so on. The number of conditions that we need to consider

to determine the right course of action can be very large, even for such a deceptively simple

maneuver. This in turn presents scalability challenges for subsumption-style architectures. All that said, robotics is a complex problem with many approaches: deliberative, reactive, or a mixture thereof; based on physics, cognitive models, data, or a mixture thereof. The right approach s still a subject for debate, scientific inquiry, and engincering prowess.

Section 26.10

Application Domains

971

Figure 2633 (a) A patient with a brain-machine interface controlling a robot arm to grab a drink. Tmage courtesy of Brown University. (b) Roomba, the robot vacuum cleaner. Photo by HANDOUT/KRT/Newscom. 26.10

Application Domains

Robotic technology is already permeating our world, and has the potential to improve our independence, health, and productivity. Here are some example applications. Home care:

Robots have started to enter the home to care for older adults and people

with motor impairments, assisting them with activities of daily living and enabling them to live more independently.

These include wheelchairs and wheelchair-mounted arms like the

Kinova arm from Figure 26.1(b). Even though they start off as being operated by a human di-

rectly, these robots are gaining more and more autonomy. On the horizon are robots operated

by brain-machine interfaces, which have been shown to enable people with quadriplegia to use a robot arm to grasp objects and even feed themselves (Figure 26.33(a)). Related to these are prosthetic limbs that intelligently respond to our actions, and exoskeletons that give us superhuman strength or enable people who can’t control their muscles from the waist down

to walk again.

Personal robots are meant to assist us with daily tasks

like cleaning and organizing, free-

ing up our time. Although manipulation still has a way to go before it can operate seamlessly

in messy, unstructured human environments, navigation has made some headway. In particu-

lar, many homes already enjoy a mobile robot vacuum cleaner like the one in Figure 26.33(b). Health care:

Robots assist and augment surgeons, enabling more precise, minimally

invasive, safer procedures with better patient outcomes. The Da Vinci surgical robot from Figure 26.34(a) is now widely deployed at hospitals in the U.S. Services: Mobile robots help out in office buildings, hotels, and hospitals.

Savioke has

put robots in hotels delivering products like towels or toothpaste to your room. The Helpmate and TUG

robots carry food and medicine in hospitals (Figure 26.34(b)), while Dili-

gent Robotics’ Moxi robot helps out nurses with back-end logistical responsibilities. Co-Bot

roams the halls of Carnegie Mellon University, ready to guide you to someone’s office. We

can also use telepresence robots like the Beam to attend meetings and conferences remotely,

or check in on our grandparents.

Telepresence robots

972

Chapter 26 Robotics

(b)

Figure 26.34 (a) Surgical robot in the operating room. Photo by Patrick Landmann/Science Source. (b) Hospital delivery robot. Photo by Wired.

Figure 26.35 (a) Autonomous car BOsS which won the DARPA Urban Challenge. Photo by Tangi Quemener/AFP/Getty Images/Newscom. Courtesy of Sebastian Thrun. (b) Aerial view showing the perception and predictions of the Waymo autonomous car (white vehicle with green track). Other vehicles (blue boxes) and pedestrians (orange boxes) are shown with anticipated trajectories. Road/sidewalk boundaries are in yellow. Photo courtesy of Waymo. Autonomous cars: Some of us are occasionally distracted while driving, by cell phone

calls, texts, or other distractions. The sad result: more than a million people die every year in traffic accidents. Further, many of us spend a lot of time driving and would like to recapture

some of that time. All this has led to a massive ongoing effort to deploy autonomous cars. Prototypes have existed since the 1980s, but progress was

stimulated by the 2005 DARPA

Grand Challenge, an autonomous vehicle race over 200 challenging kilometers of unrehearsed desert terrain.

Stanford’s Stanley vehicle completed the course in less than seven

hours, winning a $2 million prize and a place in the National Museum of American History.

Section 26.10

Application Domains

973

(b)

Figure 26.36 (a) A robot mapping an abandoned coal mine. (b) A 3D map of the mine acquired by the robot. Courtesy of Sebastian Thrun.

Figure 26.35(a) depicts BOss, which in 2007 won the DARPA

Urban Challenge, a compli-

cated road race on city streets where robots faced other robots and had to obey traffic rules.

In 2009, Google started an autonomous driving project (featuring many of the researchers who had worked on Stanley and BOSs), which has now spun off as Waymo. In 2018 Waymo started driverless testing (with nobody in the driver seat) in the suburbs of Pheonix,

zona.

Ari-

In the meantime, other autonomous driving companies and ride-sharing companies

are working on developing their own technology, while car manufacturers have been selling cars with more and more assistive intelligence, such as Tesla’s driver assist, which is meant

for highway driving. Other companies are targeting non-highway driving applications including college campuses and retirement communities. Still other companies are focused on non-passenger applications such as trucking, grocery delivery, and valet parking. Entertainment:

Disney has been using robots (under the name animatronics) in their

Driver assist

Animatronics

parks since 1963. Originally, these robots were restricted to hand-designed, open-loop, unvarying motion (and speech), but since 2009 a version called autonomatronics can generate Autonomatronics autonomous actions. Robots

also take the form of intelligent toys for children; for example,

Anki’s Cozmo plays games with children and may pound the table with frustration when it loses. Finally, quadrotors like Skydio’s R1 from Figure 26.2(b) act as personal photographers and videographers, following us around to take action shots as we ski or bike. Exploration and hazardous environments: Robots have gone where no human has gone before, including the surface of Mars. Robotic arms assist astronauts in deploying and retrieving satellites and in building the International Space Station. Robots also help explore

under the sea. They are routinely used to acquire maps of sunken ships. Figure 26.36 shows a robot mapping an abandoned coal mine, along with a 3D model of the mine acquired using range sensors. In 1996, a team of researches released a legged robot into the crater of an active volcano to acquire data for climate research. Robots are becoming very effective tools

for gathering information in domains that are difficult (or dangerous) for people to access. Robots have assisted people in cleaning up nuclear waste, most notably in Three Mile Island, Chernobyl, and Fukushima. Robots were present after the collapse of the World Trade

974

Chapter 26 Robotics Center, where they entered structures deemed too dangerous for human

search and rescue

crews. Here too, these robots are initially deployed via teleoperation, and as technology advances they are becoming more and more autonomous, with a human operator in charge but not having to specify every single command.

Industry: The majority of robots today are deployed in factories, automating tasks that

are difficult, dangerous, or dull for humans. (The majority of factory robots are in automobile

factories.) Automating these tasks is a positive in terms of efficiently producing what society needs. At the same time, it also means displacing some human workers from their jobs. This

has important policy and economics implications—the need for retraining and education, the need for a fair division of resources, etc. These topics are discussed further in Section 27.3.5.

Summary

Robotics is about physically embodied agents, which can change the state of the physical world. In this chapter, we have leaned the following: « The most common types of robots arc manipulators (robot arms) and mobile robots. They have sensors for perceiving the world and actuators that produce motion, which then affects the world via effectors. « The general robotics problem involves stochasticity (which can be handled by MDPs), partial observability (which can be handled by POMDPs), and acting with and around other agents (which can be handled with game theory). The problem is made even

harder by the fact that most robots work in continuous and high-dimensional state and

action spaces. They also operate in the real world, which refuses to run faster than real time and in which failures lead to real things being damaged, with no “undo” capability. Ideally, the robot would solve the entire problem in one go: observations in the form

of raw sensor feeds go in, and actions in the form of torques or currents to the motors

come out. In practice though, this is too daunting, and roboticists typically decouple different aspects of the problem and treat them independently.

* We typically separate perception (estimation) from action (motion generation). Percep-

tion in robotics involves computer vision to recognize the surroundings through cam-

eras, but also localization and mapping.

« Robotic perception concerns itself with estimating decision-relevant quantities from sensor data. To do so, we need an internal representation and a method for updating this internal representation over time.

+ Probabilistic filtering algorithms such as particle filters and Kalman filters are useful

for robot perception. These techniques maintain the belief state, a posterior distribution over state variables.

« For generating motion, we use configuration spaces, where a point specifies everything

we need to know to locate every body point on the robot. For instance, for a robot arm with two joints, a configuration consists of the two joint angles.

« We typically decouple the motion generation problem into motion planning, concerned

with producing a plan, and trajectory tracking control, concerned with producing a

policy for control inputs (actuator commands) that results in executing the plan.

Bibliographical and Historical Notes

« Motion planning can be solved via graph scarch using cell decomposition; using randomized motion planning algorithms, which sample milestones in the continuous configuration space; or using trajectory optimization, which can iteratively push a straight-line path out of collision by leveraging a signed distance field. + A path found by a search algorithm can be executed using the path as the reference

trajectory for a PID controller, which constantly corrects for errors between where the robot is and where it is supposed o be, or via computed torque control, which adds a

feedforward term that makes use of inverse dynamics to compute roughly what torque to send to make progress along the trajectory.

« Optimal control unites motion planning and trajectory tracking by computing an optimal trajectory directly over control inputs.

This

is especially easy when we have

quadratic costs and linear dynamics, resulting in a linear quadratic regulator (LQR). Popular methods make use of this by linearizing the dynamics and computing secondorder approximations of the cost (ILQR). « Planning under uncertainty unites perception and action by online replanning (such as model predictive control) and information gathering actions that aid perception.

+ Reinforcement learning is applied in robotics, with techniques striving to reduce the

required number of interactions with the real world. Such techniques tend to exploit models, be it estimating models and using them to plan, or training policies that are robust with respect to different possible model parameters.

« Interaction with humans

requires the ability to coordinate the robot’s actions with

theirs, which can be formulated as a game. We usually decompose the solution into

prediction, in which we use the person’s ongoing actions to estimate what they will do in the future, and action, in which we use the predictions to compute the optimal motion for the robot. + Helping humans also requires the ability to learn or infer what they want. Robots can

approach this by learning the desired cost function they should optimize from human

input, such as demonstrations, corrections, or instruction in natural language. Alterna-

tively, robots can imitate human behavior, and use reinforcement learning to help tackle

the challenge of generalization to new states. Bibliographical and

Historical Notes

The word robot was popularized by Czech playwright Karel Capek in his 1920 play R.U.R. (Rossum’s Universal Robots).

The robots, which were grown chemically rather than con-

structed mechanically, end up resenting their masters and decide to take over. It appears that

it was Capek’s brother, Josef, who first combined the Czech words “robota” (obligatory work)

and “robotnik™ (serf) to yield “robot™ in his 1917 short story Opilec (Glanc, 1978). The term

robotics was invented for a science fiction story (Asimov, 1950). The idea of an autonomous machine predates the word “robot” by thousands of years. In 7th century BCE Greek mythology, a robot named Talos was built by Hephaistos, the Greek god of metallurgy, to protect the island of Crete.

The legend is that the sorceress Medea

defeated Talos by promising him immortality but then draining his life fluid. Thus, this is the

975

976

Chapter 26 Robotics first example of a robot making a mistake in the process of changing its objective function. In 322 BCE, Aristotle anticipated technological unemployment, speculating “If every tool, when ordered, or even of its own accord, could do the work that befits it.. . then there would be no need either of apprentices for the master workers or of slaves for the lords.” In the 3rd century BCE an actual humanoid robot called the Servant of Philon could pour wine or water into a cup; a series of valves cut off the flow at the right time. Wonderful automata were built in the 18th century—Jacques Vaucanson’s mechanical duck from 1738

being one early example—but the complex behaviors they exhibited were entirely fixed in advance. Possibly the earliest example of a programmable robot-like device was the Jacquard loom (1805), described on page 15. Grey Walter’s “turtle,” built in 1948, could be considered the first autonomous mobile

robot, although its control system was not programmable. The “Hopkins Beast” built in 1960 at Johns Hopkins University, was much more sophisticated; it had sonar and photocell sensors, pattern-recognition hardware, and could recognize the cover plate of a standard AC power outlet. It was capable of searching for outlets, plugging itself in, and then recharging its batteries! Still, the Beast had a limited repertoire of skills.

The first general-purpose mobile robot was “Shakey,” developed at what was then the

Stanford Research Institute (now SRI) in the late 1960s (Fikes and Nilsson, 1971; Nilsson,

1984). Shakey was the first robot to integrate perception, planning, and execution, and much

subsequent research in Al was influenced by this remarkable achievement. Shakey appears

on the cover of this book with project leader Charlie Rosen (1917-2002). Other influential projects include the Stanford Cart and the CMU Rover (Moravec, 1983). Cox and Wilfong (1990) describe classic work on autonomous vehicles. The first commercial robot was an arm called UNIMATE,

for universal automation, de-

veloped by Joseph Engelberger and George Devol in their compnay, Unimation. In 1961, the first UNIMATE robot was sold to General Motors for use in manufacturing TV picture tubes. 1961 was also the year when Devol obtained the first U.S. patent on a robot.

In 1973, Toyota and Nissan started using an updated version of UNIMATE for auto body

spot welding. This initiated a major revolution in automobile manufacturing that took place mostly in Japan and the U.S., and that is still ongoing. Unimation followed up in 1978 with the development of the Puma robot (Programmable Universal Machine for Assembly), which

was the de facto standard for robotic manipulation for the two decades that followed. About

500,000 robots are sold each year, with half of those going to the automotive industry. In manipulation,

the first major effort at creating a hand-eye machine was Heinrich

Emst’s MH-1, described in his MIT Ph.D. thesis (Ernst, 1961).

The Machine Intelligence

project at Edinburgh also demonstrated an impressive early system for vision-based assembly called FREDDY (Michie, 1972).

Research on mobile robotics has been stimulated by several important competitions.

AAAT’s annual mobile robot competition began in 1992.

The first competition winner was

CARMEL (Congdon et al., 1992). Progress has been steady and impressive: in recent com-

petitions robots entered the conference complex, found their way to the registration desk, registered for the conference, and even gave a short talk. The RoboCup

competition,

launched in 1995 by Kitano and colleagues (1997), aims

to “develop a team of fully autonomous humanoid robots that can win against the human world champion team in soccer” by 2050.

Some competitions use wheeled robots, some

Bibliographical and Historical Notes

977

humanoid robots, and some software simulations. Stone (2016) describes recent innovations

in RoboCup.

The DARPA Grand Challenge, organized by DARPA in 2004 and 2005, required autonomous vehicles to travel more than 200 kilometers through the desert in less than ten hours (Buehler ez al., 2006). In the original event in 2004, no robot traveled more than eight miles, leading many to believe the prize would never be claimed. In 2005, Stanford’s robot

Stanley won the competition in just under seven hours (Thrun, 2006).

DARPA then orga-

nized the Urban Challenge, a competition in which robots had to navigate 60 miles in an

urban environment with other traffic.

Carnegie Mellon University’s robot BOSS took first

place and claimed the $2 million prize (Urmson and Whittaker, 2008). Early pioneers in the development of robotic cars included Dickmanns and Zapp (1987) and Pomerleau (1993).

The field of robotic mapping has evolved from two distinct origins. The first thread began with work by Smith and Cheeseman (1986), who applied Kalman filters to the simultaneous localization and mapping (SLAM) problem. This algorithm was first implemented by Moutarlier and Chatila (1989) and later extended by Leonard and Durrant-Whyte (1992); see Dissanayake et al. (2001) for an overview of early Kalman filter variations. The second thread

began with the development of the occupancy grid representation for probabilistic mapping,

which specifies the probability that each (x,y) location is occupied by an obstacle (Moravec

Occupancy grid

and Elfes, 1985).

Kuipers and Levitt (1988) were among the first to propose topological rather than metric mapping, motivated by models of human spatial cognition. A seminal paper by Lu and Milios (1997) recognized the sparseness of the simultaneous localization and mapping problem, which gave rise to the development of nonlinear optimization techniques by Konolige (2004)

and Montemerlo and Thrun (2004), as well as hierarchical methods by Bosse et al. (2004). Shatkay and Kaelbling (1997) and Thrun er al. (1998) introduced the EM algorithm into the

field of robotic mapping for data association. An overview of probabilistic mapping methods can be found in (Thrun et al., 2005). Early mobile robot localization techniques are surveyed by Borenstein ef al. (1996). Although Kalman

filtering was well known

as a localization method in control theory for

decades, the general probabilistic formulation of the localization problem did not appear in the Al literature until much later, through the work of Tom Dean and colleagues (Dean er al., 1990) and of Simmons and Koenig (1995). The latter work introduced the term Markov

localization. The first real-world application of this technique was by Burgard ef al. (1999), Markov localization through a series of robots that were deployed in museums.

Monte Carlo localization based

on particle filters was developed by Fox ef al. (1999) and is now widely used. The Rao-

Blackwellized particle filter combines particle filtering for robot localization with exact

filtering for map building (Murphy and Russell, 2001; Montemerlo et al., 2002).

Rao-Blackwellized particle filter

A great deal of early work on motion planning focused on geometric algorithms for de-

terministic and fully observable motion planning problems. The PSPACE-hardness of robot

motion planning was shown in a seminal paper by Reif (1979). The configuration space rep-

resentation is due to Lozano-Perez (1983). A series of papers by Schwartz and Sharir on what

they called piano movers problems (Schwartz et al., 1987) was highly influential.

Recursive cell decomposition for configuration space planning was originated in the work

of Brooks and Lozano-Perez (1985) and improved significantly by Zhu and Latombe (1991). The earliest skeletonization algorithms were based on Voronoi diagrams (Rowat, 1979) and

Piano movers

978

Chapter 26 Robotics

Visibility graph

visibility graphs (Wesley and Lozano-Perez, 1979). Guibas ez al. (1992) developed efficient

techniques for calculating Voronoi diagrams incrementally, and Choset (1996) generalized Voronoi diagrams to broader motion planning problems. John Canny (1988) established the first singly exponential algorithm for motion planning.

The seminal text by Latombe (1991) covers a variety of approaches to motion planning, as

do the texts by Choset et al. (2005) and LaValle (2006). Kavraki et al. (1996) developed the theory of probabilistic roadmaps. Kuffner and LaValle (2000) developed rapidly exploring random trees (RRTS). Involving optimization in geometric motion planning began with elastic bands (Quinlan

and Khatib, 1993), which refine paths when the configuration-space obstacles change. Ratliff et al. (2009) formulated the idea as the solution to an optimal control problem,

allowing

the initial trajectory to start in collision, and deforming it by mapping workspace obstacle gradients via the Jacobian into the configuration space. Schulman et al. (2013) proposed a practical second-order alternative. The control of robots as dynamical systems—whether for manipulation or navigation—

has generated a vast literature. While this chapter explained the basics of trajectory tracking

control and optimal control, it left out entire subfields, including adaptive control, robust

control, and Lyapunov analysis. Rather than assuming everything about the system is known

a priori, adaptive control aims to adapt the dynamics parameters and/or the control law online.

Robust control, on the other hand, aims to design controllers that perform well in spite of

uncertainty and external disturbances. Lyapunov analysis was originally developed in the 1890s for the stability analysis of

general nonlinear systems, but it was not until the early 1930s that control theorists realized its true potential.

With the development of optimization methods, Lyapunov

analysis was

extended to control barrier functions, which lend themselves nicely to modern optimization tools. These methods are widely used in modern robotics for real-time controller design and

safety analysis. Crucial works in robotic control include a trilogy on impedance control by Hogan (1985) and a general study of robot dynamics by Featherstone (1987). Dean and Wellman (1991)

were among the first to try to tie together control theory and Al planning systems. Three clas-

Haptic feedback

sic textbooks on the mathematics of robot manipulation are due to Paul (1981), Craig (1989), and Yoshikawa (1990). Control for manipulation is covered by Murray (2017). The area of grasping is also important in robotics—the problem of determining a stable grasp is quite difficult (Mason and Salisbury, 1985). Competent grasping requires touch sensing, or haptic feedback, to determine contact forces and detect slip (Fearing and Hollerbach, 1985). Understanding how to grasp the the wide variety of objects in the world is a daunting

task.

(Bousmalis et al., 2017) describe a system that combines real-world experimentation

with simulations guided by sim-to-real transfer to produce robust grasping.

Potential-field control, which attempts to solve the motion planning and control problems

Vector field histogram

simultaneously, was developed for robotics by Khatib (1986). In mobile robotics, this idea was viewed as a practical solution to the collision avoidance problem, and was later extended

into an algorithm called vector field histograms by Borenstein (1991). ILQR is currently widely used at the intersection of motion planning and control and is

due to Li and Todorov (2004). It is a variant of the much older differential dynamic program-

ming technique (Jacobson and Mayne, 1970).

Bibliographical and Historical Notes Fine-motion planning with limited sensing was investigated by Lozano-Perez et al. (1984)

and Canny and Reif (1987). Landmark-based navigation (Lazanas and Latombe, 1992) uses many of the same ideas in the mobile robot arena. Navigation functions, the robotics version of a control policy for deterministic MDPs, were introduced by Koditschek (1987). Key work

applying POMDP methods (Section 17.4) to motion planning under uncertainty in robotics is

due to Pineau et al. (2003) and Roy et al. (2005). Reinforcement learning in robotics took off with the seminal work by Bagnell and Schneider (2001) and Ng ef al. (2003), who developed the paradigm in the context of autonomous helicopter control.

Kober er al. (2013) offers an overview of how reinforcement

learning changes when applied to the robotics problem. Many of the techniques implemented on physical systems build approximate dynamics models, dating back to locally weighted linear models due to Atkeson ef al. (1997). But policy gradients played their role as well,

enabling (simplified) humanoid robots to walk (Tedrake er al., 2004), or a robot arm to hit a baseball (Peters and Schaal, 2008).

Levine ez al. (2016) demonstrated the first deep reinforcement learning application on a

real robot. At the same time, model-free RL in simulation was being extended to continuous domains (Schulman et al., 2015a; Heess et al., 2016; Lillicrap et al., 2015). Other work

scaled up physical data collection massively to showcase the learning of grasps and dynamics

models (Pinto and Gupta, 2016; Agrawal ez al., 2017; Levine et al., 2018). Transfer from simulation to reality or sim-to-real (Sadeghi and Levine, 2016; Andrychowicz et al., 2018a), metalearning (Finn er al., 2017), and sample-efficient model-free reinforcement learning

(Andrychowicz ef al., 2018b) are active areas of research.

Early methods for predicting human actions made use of filtering approaches (Madha-

van and Schlenoff, 2003), but seminal work by Ziebart ef al. (2009) proposed prediction by modeling people as approximately rational agents. Sadigh ef al. (2016) captured how these predictions should actually depend on what the robot decides to do, building toward a gametheoretic setting. For collaborative settings, Sisbot et al. (2007) pioneered the idea of account-

ing for what people want in the robot’s cost function. Nikolaidis and Shah (2013) decomposed

collaboration into learning how the human will act, but also learning how the human wants

the robot to act, both achievable from demonstrations. For learning from demonstration see

Argall et al. (2009). Akgun et al. (2012) and Sefidgar et al. (2017) studied teaching by end users rather than by experts. Tellex et al. (2011) showed how robots can infer what people want from natural language

instructions. Finally, not only do robots need to infer what people want and plan on doing, but

people too need to make the same inferences about robots. Dragan et al. (2013) incorporated amodel of the human’s inferences into robot motion planning.

The field of human-robot interaction is much broader than what we covered in this

chapter, which focused primarily on the planning and learning aspects. Thomaz ef al. (2016) provides a survey of interaction more broadly from a computational perspective. Ross et al. (2011) describe the DAGGER system. The topic of software architectures for robots engenders much religious debate. The

good old-fashioned Al candidate—the three-layer architecture—dates back to the design of Shakey and is reviewed by Gat (1998). The subsumption architecture is due to Brooks (1986),

although similar ideas were developed independently by Braitenberg, whose book, Vehicles (1984), describes a

series of simple robots based on the behavioral approach.

979

980

Chapter 26 Robotics The success of Brooks’s

six-legged walking robot was followed by many other projects.

Connell, in his Ph.D. thesis (1989), developed an entirely reactive mobile robot that was ca-

pable of retrieving objects. Extensions of the paradigm to multirobot systems can be found in work by Parker (1996) and Mataric (1997). GRL (Horswill, 2000) and COLBERT (Kono-

lige, 1997) abstract the ideas of concurrent behavior-based robotics into general robot control

languages. Arkin (1998) surveys some of the most popular approaches in this field. Two early textbooks, by Dudek and Jenkin (2000) and by Murphy (2000), cover robotics

generally. More recent overviews are due to Bekey (2008) and Lynch and Park (2017). Anex-

cellent book on robot manipulation addresses advanced topics such as compliant motion (Ma-

son, 2001). Robot motion planning is covered in Choset ef al. (2005) and LaValle (2006).

Thrun ez al. (2005) introduces probabilistic robotics. The Handbook of Robotics (Siciliano and Khatib, 2016) is a massive, comprehensive overview of all of robotics.

The premiere conference for robotics is Robotics: Science and Systems Conference, fol-

lowed by the IEEE International Conference on Robotics and Automation.

Human-Robot

Interaction is the premiere venue for interaction. Leading robotics journals include IEEE

Robotics and Automation, the International Journal of Robotics Research, and Robotics and

Autonomous Systems.

TR 97

PHILOSOPHY, ETHICS, AND SAFETY OF Al In which we consider the big questions around the meaning of Al how we can ethically develop and apply it, and how we can keep it safe.

Philosophers have been asking big questions for a long time: How do minds work? s it possible for machines to act intelligently in the way that people do? Would such machines have real, conscious minds? To these, we add new ones: What are the ethical implications of intelligent machines in day-to-day use? Should machines be allowed to decide to kill humans? Can algorithms be fair and unbiased? What will humans do if machines can do all kinds of work? And how do we control machines that may become more intelligent than us? 27.1

The Limits of Al

In 1980, philosopher John Searle introduced a distinction between weak Al—the idea that

machines could act as if they were intelligent—and strong AI—the assertion that machines

that do so are actually consciously thinking (not just simulating thinking). Over time the definition of strong Al shifted to refer to what is

also called “human-level AI” or “general

AI’—programs that can solve an arbitrarily wide variety of tasks, including novel ones, and do so as well as a human.

Critics of weak AT who objected to the very possibility of intelligent behavior in machines now appear as shortsighted as Simon Newcomb, who in October 1903 wrote “aerial flight is one of the great class of problems with which man can never cope”—just two months before the Wright brothers’ flight at Kitty Hawk.

The rapid progress of recent years does

not, however, prove that there can be no limits to what Al can achieve. Alan Turing (1950),

the first person to define Al was also the first to raise possible objections to Al foreseeing almost all the ones subsequently raised by others. 27.1.1

The argument from informality

Turing’s “argument from informality of behavior” says that human behavior is far too com-

plex to be captured by any formal set of rules—humans must be using some informal guide-

lines that (the argument claims) could never be captured in a formal set of rules and thus

could never be codified in a computer program. A key proponent of this view was Hubert Dreyfus, who produced a series of influential critiques of artificial intelligence:

What Computers

Can’t Do (1972), the sequel What

weak Al

Strong Al

982

Chapter 27 Philosophy, Ethics, and Safety of AT Computers Still Can’t Do (1992), and, with his brother Stuart, Mind Over Machine (1986).

Good Old-Fashioned AT'(GOFAI)

Similarly, philosopher Kenneth Sayre (1993) said “Artificial intelligence pursued within the cult of computationalism stands not even a ghost of a chance of producing durable results.” The technology they criticize came to be called Good Old-Fashioned AI (GOFAI).

GOFAI corresponds to the simplest logical agent design described in Chapter 7, and we saw there that it is indeed difficult to capture every contingency of appropriate behavior in a set of necessary and sufficient logical rules; we called that the qualification problem.

But

as we saw in Chapter 12, probabilistic reasoning systems are more appropriate for openended domains, and as we saw in Chapter 21, deep learning systems do well on a variety of “informal” tasks. Thus, the critique is not addressed against computers per se, but rather against one particular style of programming them with logical rules—a style that was popular

in the 1980s but has been eclipsed by new approaches.

One of Dreyfus’s strongest arguments is for situated agents rather than disembodied log-

ical inference engines. An agent whose understanding of “dog” comes only from a limited set of logical sentences such as “Dog(x) = Mammal(x)” is at a disadvantage compared to an agent that has watched dogs run, has played fetch with them, and has been licked by one. As philosopher Andy Clark (1998) says, “Biological brains are first and foremost the control systems for biological bodies. Biological bodies move and act in rich real-world surroundings.”

Embodied cognition

According to Clark, we are “good at frisbee, bad at logic.”

The embodied cognition approach claims that it makes no sense to consider the brain

separately: cognition takes place within a body, which is embedded in an environment. We need to study the system as a whole; the brain’s functioning exploits regularities in its envi-

ronment, including the rest of its body.

Under the embodied cognition approach, robotics,

vision, and other sensors become central, not peripheral. Overall, Dreyfus

saw areas where Al did not have complete answers and said that Al is

therefore impossible; we now see many of these same areas undergoing continued research and development leading to increased capability, not impossibility. 27.1.2

The argument from disability

The “argument from disability” makes the claim that “a machine can never do X.” As examples of X, Turing lists the following: Be kind, resourceful, beautiful, friendly, have initiative, have a sense of humor, tell right from wrong, make mistakes, fall in love, enjoy strawberries and cream, make someone fall in love with it, learn from experience, use words properly, be the subject of its own thought, have as much diversity of behavior as man, do something really new. In retrospect, some of these are rather easy—we’re all familiar with computers that “make

mistakes.” Computers with metareasoning capabilities (Chapter 5) can examine heir own computations, thus being the subject of their own reasoning. A century-old technology has the proven ability to “make someone fall in love with it"—the teddy bear. Computer chess

expert David Levy predicts that by 2050 people will routinely fall in love with humanoid robots.

As for a robot falling in love, that is a common theme in fiction,' but there has

been only limited academic speculation on the subject (Kim ef al., 2007). Computers have

! For example, the opera Coppélia (1870), the novel Do Androids Dream of Electric Sheep? (1968), the movies AIQ001), Wall-E (2008), and Her (2013).

Section 27.1

The Limits of AT

done things that are “really new;” making significant discoveries in astronomy, mathematics, chemistry, mineralogy, biology, computer science, and other fields, and creating new forms of art through style transfer (Gatys e al., 2016). Overall, programs exceed human performance in some tasks and lag behind on others. The one thing that it is clear they can’t do is be exactly human. 27.1.3

The

mathematical

objection

Turing (1936) and Godel (1931) proved that certain mathematical questions are in princi-

ple unanswerable by particular formal systems. Godel’s incompleteness theorem (see Sec-

tion 9.5) is the most famous example of this. Briefly, for any formal axiomatic framework F

powerful enough to do arithmetic, it is possible to construct a so-called Godel sentence G(F)

with the following properties: « G(F) is a sentence of F, but cannot be proved within F. « If F is consistent, then G(F) is true.

Philosophers such as J. R. Lucas (1961) have claimed that this theorem shows that machines

are mentally inferior to humans, because machines are formal systems that are limited by the incompleteness theorem—they cannot establish the truth of their own Godel sentence—

while humans have no such limitation. This has caused a lot of controversy, spawning a vast literature, including two books by the mathematician/physicist Sir Roger Penrose (1989, 1994).

Penrose repeats Lucas’s claim with some fresh twists, such as the hypothesis that

humans are different because their brains operate by quantum gravity—a theory that makes

multiple false predictions about brain physiology.

‘We will examine three of the problems with Lucas’s claim. First, an agent should not be

ashamed that it cannot establish the truth of some sentence while other agents can. Consider the following sentence:

Lucas cannot consistently assert that this sentence is true. If Lucas asserted this sentence, then he would be contradicting himself, so therefore Lucas

cannot consistently assert it, and hence it is true. We have thus demonstrated that there is a

true sentence that Lucas cannot consistently assert while other people (and machines) can. But that does not make us think any less of Lucas.

Second, Gadel’s incompleteness theorem and related results apply to mathematics, not

to computers. No entity—human or machine—can prove things that are impossible to prove. Lucas and Penrose falsely assume that humans can somehow get around these limits, as when

Lucas (1976) says “we must assume our own consistency, if thought is to be possible at all.” But this is an unwarranted assumption: humans are notoriously inconsistent. This is certainly

true for everyday reasoning, but it is also true for careful mathematical thought. A famous example is the four-color map problem. Alfred Kempe (1879) published a proof that was widely accepted for 11 years until Percy Heawood (1890) pointed out a flaw. Third, Gédel’s incompleteness theorem technically applies only to formal systems that are powerful enough to do arithmetic. This

includes Turing machines, and Lucas’s claim is

in part based on the assertion that computers are equivalent to Turing machines. This is not

quite true. Turing machines are infinite, whereas computers (and brains) are finite, and any computer can therefore be described as a (very large) system in propositional logic, which is not subject to Gédel’s incompleteness theorem. Lucas assumes that humans can “change their

983

984

Chapter 27 Philosophy, Ethics, and Safety of AT minds” while computers cannot, but that is also false—a computer can retract a conclusion

after new evidence or further deliberation; it can upgrade its hardware; and it can change its

decision-making processes with machine learning or software rewriting. 27.1.4

Measuring Al

Alan Turing, in his famous paper “Computing Machinery and Intelligence” (1950), suggested that instead of

asking whether machines can think, we should ask whether machines can pass

a behavioral test, which has come to be called the Turing test. The test requires a program

to have a conversation (via typed messages) with an interrogator for five minutes.

The in-

terrogator then has to guess if the conversation is with a program or a person; the program

passes the test if it fools the interrogator 30% of the time. To Turing, the key point was not

the exact details of the test, but instead the idea of measuring intelligence by performance on some kind of open-ended behavioral task, rather than by philosophical speculation.

Nevertheless, Turing conjectured that by the year 2000 a computer with a storage of a

billion units could pass the test, but here we are on the other side of 2000, and we still can’t

agree whether any program has passed. Many people have been fooled when they didn’t know they might be chatting with a computer.

The ELIZA program and Internet chatbots such as

MGONZ (Humphrys, 2008) and NATACHATA (Jonathan et al., 2009) fool their correspondents

repeatedly, and the chatbot CYBERLOVER

has attracted the attention of law enforcement be-

cause of its penchant for tricking fellow chatters into divulging enough personal information that their identity can be stolen. In 2014, a chatbot called Eugene Goostman fooled 33% of the untrained amateur judges

in a Turing test. The program claimed to be a boy from Ukraine with limited command of

English; this helped explain its grammatical errors. Perhaps the Turing test is really a test of

human gullibility. So far no well-trained judge has been fooled (Aaronson, 2014).

Turing test competitions have led to better chatbots, but have not been a focus of research

within the AT community. Instead, Al researchers who crave competition are more likely

to concentrate on playing chess or Go or StarCraft II, or taking an 8th grade science exam,

or identifying objects in images.

In many of these competitions, programs have reached

or surpassed human-level performance, but that doesn’t mean the programs are human-like

outside the specific task. The point is to improve basic science and technology and to provide useful tools, not to fool judges.

27.2

Can Machines Really Think?

Some philosophers claim that a machine that acts intelligently would not be actually thinking,

but would be only a simulation of thinking. But most Al researchers are not concerned with

the distinction, and the computer scientist Edsger Dijkstra (1984) said that “The question of whether Machines Can Think ... is about as relevant as the question of whether Submarines Can Swim.” The American Heritage Dictionary’s first definition of swim is “To move through

water by means of the limbs, fins, or tail,” and most people agree that submarines, being limbless, cannot swim. The dictionary also defines fly as “To move through the air by means of wings or winglike parts,” and most people agree that airplanes, having winglike parts,

can fly. However, neither the questions nor the answers have any relevance to the design or

capabilities of airplanes and submarines; rather they are about word usage in English. (The

Section 27.2

985

Can Machines Really Think?

fact that ships do swim (“priver”) in Russian amplifies this point.) English speakers have

not yet settled on a precise definition for the word “think”—does it require “a brain” or just “brain-like parts?”

Again, the issue was addressed by Turing. He notes that we never have any direct ev-

idence about the internal mental states of other humans—a kind of mental solipsism. Nev-

ertheless, Turing says, “Instead of arguing continually over this point, it is usual to have the polite convention that everyone thinks.” Turing argues that we would also extend the polite

convention to machines, if only we had experience with ones that act intelligently. How-

Polite convention

ever, now that we do have some experience, it seems that our willingness to ascribe sentience

depends at least as much on humanoid appearance and voice as on pure intelligence. 27.2.1

The Chinese room

The philosopher John Searle rejects the polite convention. His famous Chinese room argu- Chinese room ment (Searle, 1990) goes as follows: Imagine a human, who understands only English, inside a room that contains a rule book, written in English, and various stacks of paper. Pieces of paper containing indecipherable symbols are slipped under the door to the room. The human follows the instructions in the rule book, finding symbols in the stacks, writing symbols on new pieces of paper, rearranging the stacks, and so on. Eventually, the instructions will cause

one or more symbols to be transcribed onto a piece of paper that is passed back to the outside world. From the outside, we see a system that is taking input in the form of Chinese sentences and generating fluent, intelligent Chinese responses. Searle then argues: it is given that the human does not understand Chinese. The rule book

and the stacks of paper, being just pieces of paper, do not understand Chinese. Therefore, there is no understanding of Chinese. And Searle says that the Chinese room is doing the same thing that a computer would do, so therefore computers generate no understanding. Searle (1980) is a proponent of biological naturalism, according to which mental states Biological naturalism are high-level emergent features that are caused by low-level physical processes in the neurons, and it is the (unspecified) properties of the neurons that matter: according to Searle’s biases, neurons have “it” and transistors do not. There have been many refutations of Searle’s

argument, but no consensus. His argument could equally well be used (perhaps by robots) to argue that a human cannot have true understanding; after all, a human is made out of cells,

the cells do not understand, therefore there is no understanding. In fact, that is the plot of Terry Bisson’s (1990) science fiction story They're Made Out of Meat, in which alien robots

explore Earth and can’t believe that hunks of meat could possibly be sentient. How they can be remains a mystery.

27.2.2

Consciousness and qualia

Running through all the debates about strong Al is the issue of consciousness:

awareness

of the outside world, and of the self, and the subjective experience of living. The technical

term for the intrinsic nature of experiences is qualia (from the Latin word meaning, roughly,

“of what kind”). The big question is whether machines can have qualia. In the movie 2001, when astronaut David Bowman is disconnecting the “cognitive circuits™ of the HAL 9000

computer, it says “I'm afraid, Dave. Dave, my mind is going. I can feel it.” Does HAL actually have feelings (and deserve sympathy)? Or is the reply just an algorithmic response, no different from “Error 404: not found”?

Consciousness Qualia

986

Chapter 27 Philosophy, Ethics, and Safety of AT There is a similar question for animals: pet owners are certain that their dog or cat has

consciousness, but not all scientists agree. Crickets change their behavior based on tempera-

ture, but few people would say that crickets experience the feeling of being warm or cold.

One reason that the problem of consciousness is hard is that it remains ill-defined, even

after centuries of debate.

But help may be on the way. Recently philosophers have teamed

with neuroscientists under the auspices of the Templeton Foundation to start a series of ex-

periments that could resolve some of the issues. Advocates of two leading theories of con-

sciousness (global workspace theory and integrated information theory) have agreed that the

experiments could confirm one theory over the other—a rarity in philosophy.

Alan Turing (1950) concedes that the question of consciousness is a difficult one, but

denies that it has much relevance to the practice of Al: “I do not wish to give the impression

that I think there is no mystery about consciousness ... But I do not think these mysteries

necessarily need to be solved before we can answer the question with which we are concerned in this paper” We agree with Turing—we are interested in creating programs that behave intelligently.

Individual aspects of consciousness—awareness,

self-awareness, attention—

can be programmed and can be part of an intelligent machine. The additional project of making a machine conscious in exactly the way humans are is not one that we are equipped

to take on. We do agree that behaving intelligently will require some degree of awareness,

which will differ from task to task, and that tasks involving interaction with humans will

require a model of human subjective experience. In the matter of modeling experience, humans have a clear advantage over machines, because they can use their own subjective apparatus to appreciate the subjective experience of others.

For example, if you want to know what it’s like when someone hits their thumb

with a hammer, you can hit your thumb with a hammer. Machines have no such capability— although unlike humans, they can run each other’s code. 27.3

The Ethics of Al

Given that Al is a powerful technology, we have a moral obligation to use it well, to promote the positive aspects and avoid or mitigate the negative ones.

The positive aspects are many. For example, Al can save lives through improved med-

ical diagnosis, new medical discoveries, better prediction of extreme weather events, and

safer driving with driver assistance and (eventually) self-driving technologies. There are also many opportunities to improve lives. Microsoft’s Al for Humanitarian Action program ap-

plies Al to recovering from natural disasters, addressing the needs of children, protecting refugees, and promoting human rights. Google’s Al for Social Good program supports work on rainforest protection, human rights jurisprudence, pollution monitoring, measurement of fossil fuel emissions, cris counseling, news fact checking, suicide prevention, recycling, and other issues. The University of Chicago’s Center for Data Science for Social Good applies machine learning to problems in criminal justice, economic development, education, public health, energy, and environment.

Al applications in crop management and food production help feed the world. Optimization of business processes using machine learning will make businesses more productive,

increasing wealth and providing more employment. Automation can replace the tedious and

dangerous tasks that many workers face, and free them to concentrate on more interesting

Section 27.3

The Ethics of Al

987

aspects. People with disabilities will benefit from Al-based assistance in seeing, hearing, and

mobility. Machine translation already allows people from different cultures to communicate.

Software-based Al solutions have near zero marginal cost of production, and so have the potential to democratize access to advanced technology (even as other aspects of software have the potential to centralize power).

Despite these many positive aspects, we shouldn’t ignore the negatives. Many new tech-

nologies have had unintended negative side effects:

nuclear fission brought Chernobyl and

the threat of global destruction; the internal combustion engine brought air pollution, global warming, and the paving of paradise. Other technologies can have negative effects even when

used as intended, such as sarin gas, AR-15 rifles, and telephone solicitation. Automation will

create wealth, but under current economic conditions much of that wealth will flow to the

owners of the automated systems, leading to increased income inequality. This can be disruptive to a well-functioning society. In developing countries, the traditional path to growth

through low-cost manufacturing for export may be cut off, as wealthy countries adopt fully

automated manufacturing facilities on-shore. Our ethical and governance decisions will dic-

tate the level of inequality that AT will engender.

All scientists and engineers face ethical considerations of what projects they should or

should not take on, and how they can make sure the execution of the project is safe and beneficial. In 2010, the UK’s Engineering and Physical Sciences Research Council held a meeting to develop a set of Principles of Robotics. In subsequent years other government agencies, nonprofit organizations, and companies created similar sets of principles. The gist is that every organization that creates Al technology, and everyone in the organization, has a responsibility to make sure the technology contributes to good, not harm. The most commonly-cited

principles are:

Ensure safety

Establish accountability

Promote collaboration

Avoid concentration of power

Ensure fairness Respect privacy

Provide transparency

Limit harmful uses of Al

Uphold human rights and values Reflect diversity/inclusion Acknowledge legal/policy implications

Contemplate implications for employment

Note that many of the principles, such as “ensure safety,” have applicability to all software or

hardware systems, not just Al systems. Several principles are worded in a vague way, making them difficult to measure or enforce.

That is in part because Al is a big field with many

subfields, each of which has a different set of historical norms and different relationships between the Al developers and the stakeholders. Mittelstadt (2019) suggests that the subfields

should each develop more specific actionable guidelines and case precedents. 27.3.1

Lethal autonomous weapons

The UN defines a lethal autonomous weapon as one that locates, selects, and engages (i.c., Kills) human targets without human supervision. Various weapons fulfill some of these criteria. For example, land mines have been used since the 17th century: they can select and engage targets in a limited sense according to the degree of pressure exerted or the quan-

tity of metal present, but they cannot go out and locate targets by themselves. (Land mines are banned under the Ottawa Treaty.) Guided missiles, in use since the 1940s, can chase

targets, but they have to be pointed in the right general direction by a human. Auto-firing

Negative side effects

988

Chapter 27 Philosophy, Ethics, and Safety of AT radar-controlled guns have been used to defend naval ships since the 1970s; they are mainly intended to destroy incoming missiles, but they could also attack manned aircraft. Although

the word “autonomous” is often used to describe unmanned air vehicles or drones, most such

weapons are both remotely piloted and require human actuation of the lethal payload.

At the time of writing, several weapons systems seem to have crossed the line into full au-

tonomy. For example Israel’s Harop missile is a “loitering munition” with a ten-foot wingspan and a fifty-pound warhead. It searches for up to six hours in a given geographical region for any target that meets a given criterion and then destroys it. The criterion could be “emits a radar signal resembling antiaircraft radar” or “looks like a tank.”

The Turkish manufac-

turer STM advertises its Kargu quadcopter—which carries up to 1.5kg of explosives—as

capable of “Autonomous hit ... . targets selected on images . . . tracking moving targets . ... anti-

personnel ... face recognition.” Autonomous weapons have been called the “third revolution in warfare” after gunpowder and nuclear weapons. Their military potential is obvious. For example, few experts doubt

that autonomous fighter aircraft would defeat any human pilot. Autonomous aircraft, tanks,

and submarines can be cheaper, faster, more maneuverable, and have longer range than their manned counterparts.

Since 2014, the United Nations in Geneva has conducted regular discussions under the

auspices of the Convention on Certain Conventional Weapons (CCW) on the question of whether to ban lethal autonomous

weapons.

At the time of writing, 30 nations, ranging

in size from China to the Holy See, have declared their support for an international treaty, while other key countries—including Israel, Russia, South Korea, and the United States—are

opposed to a ban. The debate over autonomous weapons includes legal, ethical and practical aspects. The legal issues are governed primarily by the CCW, which requires the possibility of discrim-

inating between combatants and non-combatants, the judgment of military necessity for an attack,

and the assessment of proportionality

possibility of collateral damage.

The feas

between the military value of a target and the

ity of meeting these criteria

is an engineering

question—one whose answer will undoubtedly change over time. At present, discrimination seems feasible in some circumstances and will undoubtedly improve rapidly, but necessity and proportionality are not presently feasible: they require that machines make subjective

and situational judgments that are considerably more difficult than the relatively simple tasks

of searching for and engaging potential targets. For these reasons, it would be legal to use autonomous weapons only in circumstances where a human operator can reasonably predict that the execution of the mission will not result in civilians being targeted or the weapons

conducting unnecessary or disproportionate attacks. This means that, for the time being, only

very restricted missions could be undertaken by autonomous weapons.

On the ethical side, some find it simply morally unacceptable to delegate the decision to

kill humans to a machine.

For example, Germany’s ambassador in Geneva has stated that

it “will not accept that the decision over life and death is taken solely by an autonomous

system” while Japan “has no plan to develop robots with humans out of the loop, which may

be capable of committing murder.” Gen. Paul Selva, at the time the second-ranking military

officer in the United States, said in 2017, “I don’t think it’s reasonable for us to put robots in charge of whether or not we take a human life.” Finally, Anténio Guterres, the head of the United Nations, stated in 2019 that “machines with the power and discretion to take lives

Section 27.3

The Ethics of Al

without human involvement are politically unacceptable, morally repugnant and should be prohibited by international law.” More than 140 NGOs in over 60 countries are part of the Campaign to Stop Killer Robots, and an open letter organized in 2015 by the Future of Life Institute organized an open letter was signed by over 4,000 Al researchers? and 22,000 others. Against this, it can be argued that as technology improves it ought to be possible to develop weapons that are less likely than human soldiers or pilots to cause civilian casualties. (There is also the important benefit that autonomous weapons reduce the need for human sol-

diers and pilots to risk death.) Autonomous systems will not succumb to fatigue, frustration,

hysteria, fear, anger, or revenge, and need not “shoot first, ask questions later” (Arkin, 2015). Just as guided munitions have reduced collateral damage compared to unguided bombs, one may expect intelligent weapons to further improve the precision of attacks. (Against this, see Benjamin (2013) for an analysis of drone warfare casualties.) This, apparently, is the position of the United States in the latest round of negotiations in Geneva.

Perhaps counterintuitively, the United States is also one of the few nations whose own

policies currently preclude the use of autonomous weapons. The 2011 U.S. Department of Defense (DOD) roadmap says: “For the foreseeable future, decisions over the use of force

[by autonomous systems] and the choice of which individual targets to engage with lethal force will be retained under human control.” The primary reason for this policy is practical: autonomous systems are not reliable enough to be trusted with military decisions. The issue of reliability came to the fore on September 26,

1983, when Soviet missile

officer Stanislav Petrov’s computer display flashed an alert of an incoming missile attack. According to protocol, Petrov should have initiated a nuclear counterattack, but he suspected the alert was a bug and treated it as such. He was correct, and World War III was (narrowly)

averted. We don’t know what would have happened if there had been no human in the loop. Reliability is a very serious concern for military commanders, who know well the complexity of battlefield situations. Machine learning systems that operate flawlessly in training

may perform poorly when deployed. Cyberattacks against autonomous weapons could result

in friendly-fire casualties; disconnecting the weapon from all communication may prevent

that (assuming it has not already been compromised), but then the weapon cannot be recalled if it is malfunctioning.

The overriding practical issue with autonomous weapons is that they they are scalable

weapons of mass destruction, in the sense that the scale of an attack that can be launched is

proportional to the amount of hardware one can afford to deploy. A quadcopter two inches

in diameter can carry a lethal explosive charge, and one million can fit in a regular shipping

container. Precisely because they are autonomous, these weapons would not need one million

human supervisors to do their work.

As weapons of mass destruction, scalable autonomous weapons have advantages for the

attacker compared to nuclear weapons and carpet bombing: they leave property intact and can be applied selectively to eliminate only those who might threaten an occupying force. They

could certainly be used to wipe out an entire ethnic group or all the adherents ofa particular religion. In many situations, they would also be untraceable. These characteristics make them particularly attractive to non-state actors.

the two authors of this book.

989

990

Chapter 27 Philosophy, Ethics, and Safety of AT These considerations—particularly

those characteristics that advantage the attacker—

suggest that autonomous weapons will reduce global and national security for all parties.

The rational response for governments seems to be to engage in arms control discussions Dual use

rather than an arms race. The process of designing a treaty is not without its difficulties, however.

Al is a dual

use technology: Al technologies that have peaceful applications such as flight control, visual tracking, mapping, navigation, and multiagent planning, can easily be applied to military

purposes.

It is easy to turn an autonomous quadcopter into a weapon simply by attaching

an explosive and commanding it to seek out a target.

Dealing with this will require care-

ful implementation of compliance regimes with industry cooperation, as has already been

demonstrated with some success by the Chemical Weapons Convention.

27.3.2

Surveillance, security, and privacy

In 1976, Joseph Weizenbaum warned that automated speech recognition technology could

lead to widespread wiretapping, and hence to a loss of civil liberties. Today, that threat has

been realized, with most electronic communication going through central servers that can

be monitored, and cities packed with microphones and cameras that can identify and track

Surveillance camera

individuals based on their voice, face, and gait. Surveillance that used to require expensive and scarce human resources can now be done at a mass scale by machines. As of 2018, there were as many as 350 million surveillance cameras in China and 70 million in the United States. China and other countries have begun exporting surveillance technology to low-tech countries, some with reputations for mistreating their citizens and

disproportionately targeting marginalized communities. Al engineers should be clear on what

uses of surveillance are compatible with human rights, and decline to work on applications that are incompatible. As more of our institutions operate online, we become more vulnerable to cybercrime

Cybersecurity

(phishing, credit card fraud, botnets, ransomware) and cyberterrorism (including potentially deadly attacks such as shutting down hospitals and power plants or commandeering selfdriving cars). Machine learning can be a powerful tool for both sides in the cybersecurity battle. Attackers can use automation to probe for insecurities and they can apply reinforce-

ment learning for phishing attempts and automated blackmail. Defenders can use unsuper-

vised learning to detect anomalous incoming traffic patterns (Chandola ef al., 2009; Malhotra

et al., 2015) and various machine learning techniques to detect fraud (Fawcett and Provost,

1997; Bolton and Hand, 2002). As attacks get more sophisticated, there is a greater responsi-

bility for all engineers, not just the security experts, to design secure systems from the start.

One forecast (Kanal, 2017) puts the market for machine learning in cybersecurity at about

$100 billion by 2021.

As we interact with computers for increasing amounts of our daily lives, more data on us

is being collected by governments and corporations. Data collectors have a moral and legal

responsibility to be good stewards of the data they hold.

In the U.S., the Health Insurance

Portability and Accountability Act (HIPAA) and the Family Educational Rights and Privacy Act (FERPA) protect the privacy of medical and student records. The European Union’s General Data Protection Regulation (GDPR) mandates that companies design their systems

with protection of data in mind and requires that they obtain user consent for any collection

or processing of data.

Section 27.3

991

The Ethics of Al

Balanced against the individual’s right to privacy is the value that society gains from sharing data. We want to be able to stop terrorists without oppressing peaceful dissent, and we want to cure diseases without compromising any individual’s right to keep their health history private. One key practice is de-identification: eliminating personally identifying in- De-identification formation (such as name and social security number) so that medical researchers can use the data to advance the common good. The problem is that the shared de-identified data may

be subject to re-identification.

For example, if the data strips out the name, social security

number, and street address, but includes date of birth, gender, and zip code, then, as shown by

Latanya Sweeney (2000), 87% of the U.S. population can be uniquely re-identified. Sweeney emphasized this point by re-identifying the health record for the governor of her state when

he was admitted to the hospital. In the Netflix Prize competition, de-identified records of in-

dividual movie ratings were released, and competitors were asked to come up with a machine learning algorithm that could accurately predict which movies an individual would like. But researchers were able to re-identify individual users by matching the date of a rating in the

Netflix Prize

Netflix database with the date of a similar ranking in the Internet Movie Database (IMDB), where users sometimes use their actual names (Narayanan and Shmatikov, 2006).

This risk can be mitigated somewhat by generalizing fields: for example, replacing the

exact birth date with just the year of birth, or a broader range like “20-30 years old.” Deleting a field altogether can be seen as a form of generalizing to “any.” But generalization alone does not guarantee that records are safe from re-identification; it may be that there is only one person in zip code 94720 who is 90-100 years old. A useful property is k-anonymity: K-anonymity a database is k-anonymized if every record in the database is indistinguishable from at least

k— 1 other records. If there are records that are more unique than this, they would have to be

further generalized.

An alternative to sharing de-identified records is to keep all records private, but allow

aggregate querying. An API for queries against the database is provided, and valid queries Aggregate querying receive a response that summarizes the data with a count or average. But no response is given if it would violate certain guarantees of privacy. For example, we could allow an epidemiologist to ask, for each zip code, the percentage of people with cancer. For zip codes with at least n people a percentage would be given (with a small amount of random noise),

but no response would be given for zip codes with fewer than n people..

Care must be taken to protect against de-identification using multiple queries. For exam-

ple, if the query “average salary and number of employees of XYZ company age 30-40” gives the response [$81,234, 12] and the query “average salary and number of employees of XYZ company age 30-41 gives the response [$81,199, 131, and if we use LinkedIn to find the one 41-year-old at XYZ company, then we have successfully identified them, and can compute their exact salary, even though all the responses involved 12 or more people. The system must

be carefully designed to protect against this, with a combination of limits on the queries that

can be asked (perhaps only a predefined set of non-overlapping age ranges can be queried) and the precision of the results (perhaps both queries give the answer “about $81,000). A stronger guarantee is differential privacy, which assures that an attacker cannot use Differential privacy queries to re-identify any individual in the database, even if the attacker can make multiple

queries and has access to separate linking databases. The query response employs a random-

ized algorithm that adds a small amount of noise to the result. Given a database D, any record

in the database r, any query Q, and a possible response y to the query, we say that the database

992

Chapter 27 Philosophy, Ethics, and Safety of AT

D has edifferential privacy if the log probability of the response y varies by less than € when we add the record r: |log P(Q(D)=y) —logP(Q(D+r)=y)| < €. In other words, whether any one person decides to participate in the data base or not makes no appreciable difference to the answers anyone can get, and therefore there is no privacy

disincentive to participate. Many databases are designed to guarantee differential privacy.

Federated learning

So far we have considered the issue of sharing de-identified data from a central database.

An approach called federated learning (Konecny et al., 2016) has no central database; in-

stead, users maintain their own local databases that keep their data private. However, they

can share parameters of a machine learning model that is enhanced with their data, without

the risk of revealing any of the private data. Imagine a speech understanding application that users can run locally on their phone. The application contains a baseline neural network, which is then improved by local training on the words that are heard on the user’s phone. Periodically, the owners of the application poll a subset of the users and ask them for the parameter values of their improved local network, but not for any of their raw data. The

parameter values are combined together to form a new improved model which is then made

available to all users, so that they all get the benefit of the training that is done by other users.

For this scheme to preserve privacy, we have to be able to guarantee that the model

parameters shared by each user cannot be reverse-engineered. If we sent the raw parameters,

Secure aggregation

there is a chance that an adversary inspecting them could deduce whether, say, a certain word had been heard by the user’s phone. One way to eliminate this risk is with secure aggregation (Bonawitz et al., 2017).

The idea is that the central server doesn’t need to know the exact

parameter value from each distributed user; it only needs to know the average value for each

parameter, over all polled users. So each user can disguise their parameter values by adding

a unique mask to each value; as long as the sum of the masks is zero, the central server will be able to compute the correct average. Details of the protocol make sure that it is efficient in terms of communication (less than half the bits transmitted correspond to masking), is robust to individual users failing to respond, and is secure in the face of adversarial users,

cavesdroppers, or even an adversarial central server. 27.3.3

Fairness and bias

Machine learning is augmenting and sometimes replacing human decision-making in im-

portant situations: whose loan gets approved, to what neighborhoods police officers are de-

Societal bias

ployed, who gets pretrial release or parole. But machine learning models can perpetuate

societal bias. Consider the example of an algorithm to predict whether criminal defendants

are likely to re-offend, and thus whether they should be released before trial. It could well be

that such a system picks up the racial or gender prejudices of human judges from the examples in the training set. Designers of machine learning systems have a moral responsibility to

ensure that their systems are in fact fair. In regulated domains such as credit, education, em-

ployment, and housing, they have a legal responsibility as well. But what is faimess? There are multiple criteria; here are six of the most commonly-used concepts:

o Individual fairness: A requirement that individuals are treated similarly to other similar individuals, regardless of what class they are in.

Section 27.3

993

The Ethics of Al

* Group fairness: A requirement that two classes are treated similarly, as measured by

some summary statistic. o Fairness through unawareness:

If we delete the race and gender attributes from the

data set, then it might seem that the system cannot discriminate on those attributes.

Unfortunately, we know that machine learning models can predict latent variables (such as race and gender) given other correlated variables (such as zip code and occupation). Furthermore, deleting those attributes makes it impossible to verify equal opportunity

or equal outcomes. Still, some countries (e.g., Germany) have chosen this approach for

their demographic statistics (whether or not machine learning models are involved). o Equal outcome: The idea that each demographic class gets the same results; they have demographic parity. For example, suppose we have to decide whether we should Demographic parity approve loan applications; the goal is to approve those applicants who will pay back the loan and not those who will default on the loan. Demographic parity says that

both males and females should have the same percentage of loans approved. Note that this is a group faimess criterion that does nothing to ensure individual fairness; a wellqualified applicant might be denied and a poorly-qualified applicant might be approved, as long as the overall percentages are equal. Also, this approach favors redress of past biases over accuracy of prediction. Ifa man and a woman are equal in every way, except the woman receives a lower salary for the same job, should she be approved because she would be equal if not for historical biases, or should she be denied because the lower salary does in fact make her more likely to default?

Equal opportunity: The idea that the people who truly have the ability to pay back the loan should have an equal chance of being correctly classified as such, regardless of their sex. This approach is also called “balance.” It can lead to unequal outcomes and

ignores the effect of bias in the societal processes that produced the training data.

Equal impact: People with similar likelihood to pay back the loan should have the same expected utility, regardless of the class they belong to. This goes beyond equal

opportunity in that it considers both the benefits of a true prediction and the costs of a

false prediction. Let us examine how these issues play out in a particular context. COMPAS is a commercial system for recidivism (re-offense) scoring. It assigns to a defendant in a criminal case a risk score, which is then used by a judge to help make decisions: Is it safe to release

the defendant before trial, or should they be held in jail? If convicted, how long should the

sentence be? Should parole be granted? Given the significance of these decisions, the system has been the subject of intense scrutiny (Dressel and Farid, 2018). COMPAS is designed to be well calibrated: all the individuals who are given the same

score by the algorithm should have approximately the same probability of re-offending, regardless of race. For example, among all people that the model assigns a risk score of 7 out of 10, 60% of whites and 61% of blacks re-offend. The designers thus claim that it meets the

desired fairness goal.

On the other hand, COMPAS

does not achieve equal opportunity:

the proportion of

those who did not re-offend but were falsely rated as high-risk was 45% for blacks and 23%

for whites. In the case State v. Loomis, where a judge relied on COMPAS

to determine the

sentence of the defendant, Loomis argued that the secretive inner workings of the algorithm

Well calibrated

994

Chapter 27 Philosophy, Ethics, and Safety of AT violated his due process rights. Though the Wisconsin Supreme Court found that the sentence

given would be no different without COMPAS

in this case, it did issue warnings about the

algorithm’s accuracy and risks to minority defendants. Other researchers have questioned whether it is appropriate to use algorithms in applications such as sentencing.

We could hope for an algorithm that is both well calibrated and equal opportunity, but,

as Kleinberg er al. (2016) show, that is impossible.

If the base classes are different, then

any algorithm that is well calibrated will necessarily not provide equal opportunity, and vice

versa. How can we weigh the two criteria? Equal impact is one possibility. In the case of COMPAS, this means weighing the negative utility of defendants being falsely classified as high risk and losing their freedom, versus the cost to society of an additional crime being

committed, and finding the point that optimizes the tradeoff.

This is complicated because

there are multiple costs to consider. There are individual costs—a defendant who is wrong-

fully held in jail suffers a loss, as does the victim of a defendant who was wrongfully released and re-offends. But beyond that there are group costs—everyone has a certain fear that they will be wrongfully jailed, or will be the victim ofa crime, and all taxpayers contribute to the costs of jails and courts. If we give value to those fears and costs in proportion to the size of a group, then utility for the majority may come at the expense of a minority.

Another problem with the whole idea of recidivism scoring, regardless of the model used, is that we don’t have unbiased ground truth data. The data does not tell us who has committed a crime—all we know is who has been convicted of a crime. If the arresting officers, judge, or jury is biased, then the data will be biased. If more officers patrol some locations, then

the data will be biased against people in those locations. Only defendants who are released

are candidates to recommit, so if the judges making the release decisions are biased, the data

may be biased. If you assume that behind the biased data set there is an underlying, unknown,

unbiased data set which has been corrupted by an agent with biases, then there are techniques

to recover an approximation to the unbiased data. Jiang and Nachum (2019) describe various scenarios and the techniques involved.

One more risk is that machine learning can be used to justify bias. If decisions are made by a biased human after consulting with a machine learning system, the human can say “here is how my interpretation of the model supports my decision, so you shouldn’t question my

decision.” But other interpretations could lead to an opposite decision.

Sometimes fairness means that we should reconsider the objective function, not the data

or the algorithm. For example, in making job hiring decisions, if the objective is to hire candidates with the best qualifications in hand, we risk unfairly rewarding those who have

had advantageous educational opportunities throughout their lives, thereby enforcing class

boundaries. But if the objective is to hire candidates with the best ability to learn on the job, we have a better chance to cut across class boundaries and choose from a broader pool. Many

companies have programs designed for such applicants, and find that after a year of training,

the employees hired this way do as well as the traditional candidates. Similarly, just 18% of computer science graduates in the U.S. are women, but some schools, such as Harvey Mudd University, have achieved 50% parity with an approach that is focused on encouraging and retaining those who start the computer science program, especially those who start with less

programming experience.

A final complication is deciding which classes deserve protection.

In the U.S., the Fair

Housing Act recognized seven protected classes: race, color, religion, national origin, sex,

Section 27.3

disability, and familial status.

995

The Ethics of Al

Other local, state, and federal laws recognize other classes,

including sexual orientation, and pregnancy, marital, and veteran status. Is it fair that these

classes count for some laws and not others? International human rights law, which encom-

passes a broad set of protected classes, is a potential framework to harmonize protections across various groups. Even in the absence of societal bias, sample

disparity can lead to biased results.

Sample size disparity

In most data sets there will be fewer training examples of minority class individuals than

of majority class individuals. Machine learning algorithms give better accuracy with more training data, so that means that members of minority classes will experience lower accuracy.

For example, Buolamwini and Gebru (2018) examined a computer vision gender identifica-

tion service, and found that it had near-perfect accuracy for light-skinned males, and a 33% error rate for dark-skinned females. A constrained model may not be able to simultaneously

fit both the majority and minority class—a linear regression model might minimize average

error by fitting just the majority class, and in an SVM model, the support vectors might all correspond to majority class members.

Bias can also come into play in the software development process (whether or not the software involves machine learning). Engineers who are debugging a system are more likely to notice and fix those problems that are applicable to themselves. For example, it is difficult

to notice that a user interface design won’t work for colorblind people unless you are in fact

colorblind, or that an Urdu language translation is faulty if you don’t speak Urdu.

How can we defend against these biases? First, understand the limits of the data you are using. It has been suggested that data sets (Gebru et al., 2018; Hind er al., 2018) and models (Mitchell er al., 2019) should come with annotations: declarations of provenance, security,

conformity, and fitness for use. This is similar to the data sheets that accompany electronic Data sheet components such as resi they allow designers to decide what components to use. In addition to the data sheets, it is important to train engineers to be aware of issues of fairness

and bias, both in school and with on-the-job training. Having a diversity of engineers from different backgrounds makes it easier for them to notice problems in the data or models.

A

study by the Al Now Institute (West et al., 2019) found that only 18% of authors at leading Al conferences and 20% of Al professors are women. Black Al workers are at less than 4%.

Rates at industry research labs are similar. Diversity could be increased by programs earlier in the pipeline—in college or high school—and by greater awareness at the professional level. Joy Buolamwini founded the Algorithmic Justice League to raise awareness of this issue and

develop practices for accountability.

A second idea is to de-bias the data (Zemel ez al., 2013).

We could over-sample from

minority classes to defend against sample size disparity. Techniques such as SMOTE, the synthetic minority over-sampling technique (Chawla e al., 2002) or ADASYN, the adaptive synthetic sampling approach for imbalanced learning (He ef al., 2008), provide principled ways of oversampling. We could examine the provenance of data and, for example, eliminate examples from judges who have exhibited bias in their past court cases. Some analysts object

to the idea of discarding data, and instead would recommend building a hierarchical model of

the data that includes sources of bias, so they can be modeled and compensated for. Google

and NeurlPS have attempted to raise awareness of this issue by sponsoring the Inclusive Images Competition, in which competitors train a network on a data set of labeled images

collected in North America and Europe, and then test it on images taken from all around the

996

Chapter 27 Philosophy, Ethics, and Safety of AT world. The issue is that given this data set, it is easy to apply the label “bride” to a woman

in a standard Western wedding dress, but harder to recognize traditional African and Indian matrimonial dress. A third idea is to invent new machine learning models and algorithms that are more resistant to bias; and the final idea is to let a system make initial recommendations that may be biased, but then train a second system to de-bias the recommendations of the first one. Bellamy

et al. (2018) introduced the IBM AT FAIRNESS 360 system, which provides a framework for all of these ideas. We expect there will be increased use of tools like this in the future.

How do you make sure that the systems you build will be fair? A set of best practices has

been emerging (although they are not always followed):

* Make sure that the software engineers talk with social scientists and domain experts to understand the issues and perspectives, and consider fairness from the start.

+ Create an environment that fosters the development of a diverse pool of software engi-

neers that are representative of society.

+ Define what groups your system will support: different language speakers, different age

groups, different abilities with sight and hearing, etc.

+ Optimize for an objective function that incorporates fairness. + Examine your data for prejudice and for correlations between protected attributes and other attributes. + Understand how any human annotation of data is done, design goals for annotation

accuracy, and verify that the goals are met.

+ Don’t just track overall metrics for your system; make sure you track metrics for sub-

groups that might be victims of bias.

+ Include system tests that reflect the experience of minority group users.

« Have a feedback loop so that when fairness problems come up, they are dealt with. 27.3.4

Trust

Trust and transparency

It is one challenge to make an Al system accurate, fair, safe, and secure; a different chal-

lenge to convince everyone else that you have done so. People need to be able to trust the systems they use. A PwC survey in 2017 found that 76% of businesses were slowing the

adoption of Al because of trustworthiness concerns.

Verification and validation

In Section 19.9.4 we covered some of

the engineering approaches to trust; here we discuss the policy issues.

To earn trust, any engineered systems must go through a verification and validation

(V&V) process.

Verification means that the product satisfies the specifications.

Validation

means ensuring that the specifications actually meet the needs of the user and other affected

parties. We have an elaborate V&V methodology for engineering in general, and for traditional software development done by human coders; much of that is applicable to Al systems.

But machine learning systems are different and demand a different V&V process, which has

not yet been fully developed.

We need to verify the data that these systems learn from; we

need to verify the accuracy and fairness of the results, even in the face of uncertainty that makes an exact result unknowable;

Certification

and we need to verify that adversaries cannot unduly

influence the model, nor steal information by querying the resulting model.

One instrument of trust is certification; for example, Underwriters Laboratories (UL) was founded in 1894 at a time when consumers were apprehensive about the risks of electric

Section 27.3

997

The Ethics of Al

power. UL certification of appliances gave consumers increased trust, and in fact UL is now considering entering the business of product testing and certification for AL Other industries have long had safety standards.

For example, ISO 26262 is an interna-

tional standard for the safety of automobiles, describing how to develop, produce, operate, and service vehicles in a safe way. The Al industry is not yet at this level of clarity, although there are some frameworks

in progress, such as IEEE P7001, a standard defining ethical de-

sign for artificial intelligence and autonomous systems (Bryson and Winfield, 2017). There is ongoing debate about what kind of certification is necessary, and to what extent it should be

done by the government, by professional organizations like IEEE, by independent certifiers such as UL, or through self-regulation by the product companies. Another aspect of trust is transparency:

consumers want to know what is going on

inside a system, and that the system is not working against them, whether due to intentional

Transparency

malice, an unintentional bug, or pervasive societal bias that is recapitulated by the system. In

some cases this transparency is delivered directly to the consumer. In other cases their are

intellectual property issues that keep some aspects of the system hidden to consumers, but

open to regulators and certification agencies. When an Al system turns you down for a loan, you deserve an explanation. In Europe,

the GDPR enforces this for you. An Al system that can explain itself is called explainable AT

(XAI). A good explanation has several properties: it should be understandable and convincing Explainable Al (XAl) to the user, it should accurately reflect the reasoning of the system, it should be complete, and it should be specific in that different users with different conditions or different outcomes

should get different explanations.

It is quite easy to give a decision algorithm access to its own deliberative processes,

simply by recording them and making them available as data structures. This means that machines may eventually be able to give better explanations of their decisions than humans can. Moreover, we can take steps to certify that the machine’s explanations are not deceptions (intentional or self-deception), something that is more difficult with a human.

An explanation is a helpful but not sufficient ingredient to trust. One issue is that expla-

nations are not decisions:

they are stories about decisions. As discussed in Section 19.9.4, we

say that a system is interpretable if we can inspect the source code of the model and see what

itis doing, and we say it is explainable if we can make up a story about what it is doing—even if the system itself is an uninterpretable black box. To explain an uninterpretable black box, we need to build, debug, and test a separate explanation system, and make sure it is in sync with the original system. And because humans love a good story, we are all too willing to be swayed by an explanation that sounds good. Take any political controversy of the day, and

you can always find two so-called experts with diametrically opposed explanations, both of which are internally consistent.

A final issue is that an explanation about one case does not give you a summary over

other cases. If the bank explains, “Sorry, you didn’t get the loan because you have a history

of previous financial problems,” you don’t know if that explanation is accurate or if the bank is

secretly biased against you for some reason. In this case, you require not just an explanation, but also an audit of past decisions, with aggregated statistics across various demographic

groups, to see if their approval rates are balanced.

Part of transparency is knowing whether you are interacting with an Al system or a hu-

man. Toby Walsh (2015) proposed that “an autonomous system should be designed so that

998

Chapter 27 Philosophy, Ethics, and Safety of AT itis unlikely to be mistaken for anything besides an autonomous system, and should identify itself at the start of any interaction.”” He called this the “red flag” law, in honor of the UK’s

1865 Locomotive Act, which required any motorized vehicle to have a person with a red flag walk in front of i, to signal the oncoming danger.

In 2019, California enacted a law stating that “It shall be unlawful for any person to use a bot to communicate or interact with another person in California online, with the intent to

mislead the other person about its artificial identity.”

27.3.5

The future of work

From the first agricultural revolution (10,000 BCE)

to the industrial revolution (late 18th

century) to the green revolution in food production (1950s), new technologies have changed

the way humanity works and lives. A primary concern arising from the advance of Al is that human labor will become obsolete.

point quite clearly:

Aristotle, in Book I of his Politics, presents the main

For if every instrument could accomplish its own work, obeying or anticipating the will of others ....if, in like manner, the shuttle would weave and the plectrum touch the lyre withouta hand to guide them, chief workmen would not want servants, nor masters slaves. Everyone agrees with Aristotle’s observation that there is an immediate reduction in employ-

ment when an employer finds a mechanical method to perform work previously done by a

person. The issue is whether the so-called compensation effects that ensue—and that tend to

increase employment—will eventually make up for this reduction. The primary compensa-

tion effect is the increase in overall wealth from greater productivity, which leads in turn to

greater demand for goods and tends to increase employment. For example, PwC (Rao and Verweij, 2017) predicts that Al contribute $15 trillion annually to global GDP by 2030. The healthcare and automotive/transportation industries stand to gain the most in the short term. However, the advantages of automation have not yet taken over in our economy: the current

rate of growth in labor productivity is actually below historical standards. Brynjolfsson ef al.

(2018) attempt to explain this paradox by suggesting that the lag between the development of

Technological unemployment

basic technology and its implementation in the economy is longer than commonly supposed. Technological innovations have historically put some people out of work. Weavers were replaced by automated looms in the 1810s, leading to the Luddite protests. The Luddites were not against technology per se; they just wanted the machines to be used by skilled workers paid a good wage to make high-quality goods, rather than by unskilled workers to make poorquality goods at low wages. The global destruction of jobs in the 1930s led John Maynard Keynes to coin the term technological unemployment.

employment levels eventally recovered.

In both cases, and several others,

The mainstream economic view for most of the 20th century was that technological em-

ployment was at most a short-term phenomenon. Increased productivity would always lead to increased wealth and increased demand, and thus net job growth. A commonly cited example is that of bank tellers: although ATMs replaced humans in the job of counting out cash for withdrawals, that made it cheaper to operate a bank branch, so the number of branches

increased, leading to more bank employees overall. The nature of the work also changed, becoming less routine and requiring more advanced business skills. The net effect of automation seems to be in eliminating fasks rather than jobs.

Section 27.3

The Ethics of Al

The majority of commenters predict that the same will hold true with AI technology, at

least in the short run. Gartner, McKinsey, Forbes, the World Economic Forum, and the Pew

Research Center each released reports in 2018 predicting a net increase in jobs due to Al-

driven automation. But some analysts think that this time around, things will be different. In 2019, IBM predicted that 120 million workers would need retraining due to automation by 2022, and Oxford Economics predicted that 20 million manufacturing jobs could be lost to

automation by 2030.

Frey and Osborne (2017) survey 702 different occupations, and estimate that 47% of them

are at risk of being automated, meaning that at least some of the tasks in the occupation can

be performed by machine. For example, almost 3% of the workforce in the U.S. are vehicle drivers, and in some districts, as much as 15% of the male workforce are drivers. As we saw in Chapter 26, the task of driving is likely to be eliminated by driverless cars/trucks/buses/taxis.

It is important to distinguish between occupations and the tasks within those occupations.

McKinsey estimates that only 5% of occupations are fully automatable, but that 60% of occupations can have about 30% of their tasks automated.

For example, future truck drivers

will spend less time holding the steering wheel and more time making sure that the goods are picked up and delivered properly; serving as customer service representatives and salespeople at either end of the journey; and perhaps managing convoys of, say, three robotic trucks. Replacing three drivers with one convoy manager implies a net loss in employment, but if transportation costs decrease, there will be more demand,

which wins some of the

jobs back—but perhaps not all of them. As another example, despite many advances in applying machine learning to the problem of medical imaging, radiologists have so far been

augmented, not replaced, by these tools. Ultimately, there is a choice of how to make use of automation: do we want to focus on cutting cost, and thus see job loss as a positive; or do we

want to focus on improving quality, making life better for the worker and the customer? It is difficult to predict exact timelines for automation, but currently, and for the next

few years, the emphasis is on automation of structured analytical tasks, such as reading x-ray images, customer relationship management (e.g., bots that automatically sort customer comBusiness process plaints and respond with suggested remedies), and business process automation that com- automation bines text documents and structured data to make business decisions and improve workflow. Over time, we will see more automation with physical robots, first in controlled warehouse

environments, then in more uncertain environments, building to a significant portion of the

marketplace by around 2030.

As populations in developed countries grow older, the ratio between workers and retirees

changes. In 2015 there were less than 30 retirees per 100 workers; by 2050 there may be over 60 per 100 workers. Care for the elderly will be an increasingly important role, one that can partially be filled by AL Moreover, if we want to maintain the current standard of living, it will also be necessary to make the remaining workers more productive; automation seems

like the best opportunity to do that.

Even if automation has a multi-trillion-dollar net positive impact, there may still be prob-

lems due to the pace of change. Consider how change came to the farming industry: in 1900, Pace of change

over 40% of the U.S. workforce was in agriculture, but by 2000 that had fallen to 2%.> That

In 2010, although only 2% of the U.S. workforce were actual farmers, over 25% of the population (80 million people) played the FARMVILLE game at least once.

999

1000

Chapter 27 Philosophy, Ethics, and Safety of AT is a huge disruption in the way we work, but it happened over a period of 100 years, and thus across generations, not in the lifetime of one worker.

Workers whose jobs are automated away this decade may have to retrain for a new profession within a few years—and then perhaps see their new profession automated and face

yet another retraining period. Some may be happy to leave their old profession—we see that

as the economy improves, trucking companies need to offer new incentives to hire enough

drivers—but workers will be apprehensive about their new roles. To handle this, we as a soci-

ety need to provide lifelong education, perhaps relying in part on online education driven by artificial intelligence (Martin, 2012). Bessen (2015) argues that workers will not see increases

Income inequality

in income until they are trained to implement the new technologies, a process that takes time. Technology tends to magnify income inequality.

In an information economy marked

by high-bandwidth global communication and zero-marginal-cost replication of intellectual property (what Frank and Cook (1996) call the “Winner-Take-All Society”), rewards tend to be concentrated.

If farmer Ali is 10% better than farmer Bo, then Ali gets about 10% more

income: Ali can charge slightly more for superior goods, but there is a limit on how much can be produced on the land, and how far it can be shipped. But if software app developer

Cary is 10% better than Dana, it may be that Cary ends up with 99% of the global market. AT

increases the pace of technological innovation and thus contributes to this overall trend, but

Al also holds the promise of allowing us to take some time off and let our automated agents

handle things for a while. Tim Ferriss (2007) recommends using automation and outsourcing to achieve a four-hour work week. Before the industrial revolution, people worked as farmers or in other crafts, but didn’t

report to a job at a place of work and put in hours for an employer.

But today, most adults

in developed countries do just that, and the job serves three purposes: it fuels the production

of the goods that society needs to flourish, it provides the income that the worker needs to live, and it gives the worker a sense of purpose, accomplishment, and social integration. With increasing automation, it may be that these three purposes become disaggregated—society’s

needs will be served in part by automation, and in the long run, individuals will get their sense of purpose from contributions other than work. Their income needs can be served by

social policies that include a combination of free or inexpensive access to social services and education, portable health care, retirement, and education accounts, progressive tax rates,

earned income tax credits, negative income tax, or universal basic income.

27.3.6

Robot rights

The question of robot consciousness, discussed in Section 27.2, is critical to the question of

what rights, if any, robots should have. would argue that they deserve rights.

If they have no consciousness, no qualia, then few

But if robots can feel pain, if they can dread death, if they are considered “persons,” then the argument can be made (e.g., by Sparrow (2004)) that they have rights and deserve to have their rights recognized, just as slaves, women, and other historically oppressed groups have

fought to have their rights recognized. The issue of robot personhood is often considered in fiction: from Pygmalion to Coppélia to Pinocchio to the movies A7 and Centennial Man, we have the legend of a doll/robot coming to life and striving to be accepted as a human with human rights. In real life, Saudi Arabia made headlines by giving honorary citizenship to Sophia, a human-looking puppet capable of speaking preprogrammed lines.

Section 27.3

1001

The Ethics of Al

If robots have rights, then they should not be enslaved, and there is a question of whether

reprogramming them would be a kind of enslavement. Another ethical issue involves voting rights: a rich person could buy thousands of robots and program them to cast thousands of votes—should

those votes count?

If a robot clones itself, can they both vote?

What is

the boundary between ballot stuffing and exercising free will, and when does robotic voting violate the “one person, one vote” principle? Ernie Davis argues for avoiding the dilemmas of robot consciousness by never building robots that could possibly be considered conscious. This argument was previously made by

Joseph Weizenbaum in his book Computer Power and Human Reason (1976), and before that by Julien de La Mettrie in L’Homme Machine (1748). Robots are tools that we create, to do

the tasks we direct them to do, and if we grant them personhood, we are just declining to take

responsibility for the actions of our own property: “I'm not at fault for my self-driving car crash—the car did it itself.” This issue takes a different turn if we develop human-robot hybrids. Of course we already

have humans enhanced by technology such as contact lenses, pacemakers, and artificial hips. But adding computational protheses may blur the lines between human and machine. 27.3.7

Al Safety

Almost any technology has the potential to cause harm in the wrong hands, but with AT and robotics, the hands might be operating on their own. Countless science fiction stories have warned about robots or cyborgs running amok. Early examples include Mary Shelley’s Frankenstein, or the Modern Prometheus (1818) and Karel Capek’s play R.U.R. (1920), in which robots conquer the world. In movies, we have The Terminator (1984) and The Matrix

(1999), which both feature robots trying to eliminate humans—the robopocalypse (Wilson, Robopocalypse 2011). Perhaps robots are so often the villains because they represent the unknown, just like the witches and ghosts of tales from earlier eras.

We can hope that a robot that is smart

enough to figure out how to terminate the human race is also smart enough to figure out that

that was not the intended utility function; but in building intelligent systems, we want to rely not just on hope, but on a design process with guarantees of safety.

It would be unethical to distribute an unsafe Al agent.

We require our agents to avoid

accidents, to be resistant to adversarial attacks and malicious abuse, and in general to cause

benefits, not harms. That is especially true as Al agents are deployed in safety-critical appli-

cations, such as driving cars, controlling robots in dangerous factory or construction settings, and making life-or-death medical decisions.

There is a long history of safety engineering in traditional engineering fields. We know

how to build bridges, airplanes, spacecraft, and power plants that are designed up front to

behave safely even when components of the system fail. The first technique is failure modes

and effect analysis (FMEA): analysts consider each component of the system, and imagine

every possible way the component could go wrong (for example, what if this bolt were to snap?), drawing on past experience and on calculations based on the physical properties of the component.

Safety engineering Failure modes and effect analysis

(FMEA)

Then the analysts work forward to see what would result from the failure.

If the result is severe (a section of the bridge could fall down) then the analysts alter the design to mitigate the failure. (With this additional cross-member, the bridge can survive the

failure of any 5 bolts; with this backup server, the online service can survive a tsunami taking

tree analysis out the primary server.) The technique of fault tree analysis (FTA) is used to make these Fault (FTA)

1002

Chapter 27 Philosophy, Ethics, and Safety of AT determinations:

analysts build an AND/OR

tree of possible failures and assign probabilities

to each root cause, allowing for calculations of overall failure probability. These techniques can and should be applied to all safety-critical engineered systems, including AI systems. The field of software engineering is aimed at producing reliable software, but the em-

phasis has historically been on correctness, not safety. Correctness means that the software

faithfully implements the specification. But safety goes beyond that to insist that the specification has considered any feasible failure modes, and is designed to degrade gracefully even in the face of unforeseen failures. For example, the software for a self-driving car wouldn’t

be considered safe unless it can handle unusual situations. For example, what if the power to

the main computer dies? A safe system will have a backup computer with a separate power

supply. What if a tire is punctured at high speed? A safe system will have tested for this, and will have software to correct for the resulting loss of control.

Unintended side effect

An agent designed as a utility maximizer, or as a goal achiever, can be unsafe if it has the wrong objective function. Suppose we give a robot the task of fetching a coffee from the kitchen. We might run into trouble with unintended side effects—the robot might rush

to accomplish the goal, knocking over lamps and tables along the way. In testing, we might

notice this kind of behavior and modify the utility function to penalize such damage, but it is

Low impact

difficult for the designers and testers to anticipate all possible side effects ahead of time.

One way to deal with this is to design a robot to have low impact (Armstrong and Levin-

stein, 2017): instead of just maximizing utility, maximize the utility minus a weighted summary of all changes to the state of the world.

In this way, all other things being equal, the

robot prefers not to change those things whose effect on utility is unknown; so it avoids knocking over the lamp not because it knows specifically that knocking the lamp will cause

it to fall over and break, but because it knows in general that disruption might be bad. This

can be seen as a version of the physician’s creed “first, do no harm,” or as an analog to regularization in machine learning: we want a policy that achieves goals, but we prefer policies that take smooth, low-impact actions to get there.

The trick is how to measure impact.

It

is not acceptable to knock over a fragile lamp, but perfectly fine if the air molecules in the room are disturbed a little, or if some bacteria in the room are inadvertently killed. It is cer-

tainly not acceptable to harm pets and humans in the room.

We need to make sure that the

robot knows the differences between these cases (and many subtle cases in between) through

a combination of explicit programming, machine learning over time, and rigorous testing.

Utility functions can go wrong due to externalities, the word used by economists for

factors that are outside of what is measured and paid for. The world suffers when green-

house gases are considered as externalities—companies and countries are not penalized for producing them, and as a result everyone suffers. Ecologist Garrett Hardin (1968) called the exploitation of shared resources the tragedy of the commons. We can mitigate the tragedy by internalizing the externalities—making them part of the utility function, for example with

a carbon tax—or by using the design principles that economist Elinor Ostrom identified as being used by local people throughout the world for centuries (work that won her the Nobel Prize in Economics in 2009):

Clearly define the shared resource and who has access.

+ Adapt to local conditions.

« Allow all parties to participate in decisions.

Section 27.3

The Ethics of Al

Monitor the resource with accountable monitors.

+ Sanctions, proportional to the severity of the violation. + Easy conflict resolution procedures. Hierarchical control for large shared resources.

Victoria Krakovna (2018) has cataloged examples of Al agents that have gamed the system, figuring out how to maximize utility without actually solving the problem that their designers

intended them to solve. To the designers this looks like cheating, but to the agents, they

are just doing their job.

Some agents took advantage of bugs in the simulation (such as

floating point overflow bugs) to propose solutions that would not work once the bug was

fixed. Several agents in video games discovered ways to crash or pause the game when they

were about to lose, thus avoiding a penalty. And in a specification where crashing the game

was penalized, one agent learned to use up just enough of the game’s memory so that when it

was the opponent’s turn, it would run out of memory and crash the game. Finally, a genetic

algorithm operating in a simulated world was supposed to evolve fast-moving creatures but

in fact produced creatures that were enormously tall and moved fast by falling over. Designers of agents should be aware of these kinds of specification failures and take steps

to avoid them. To help them do that, Krakovna was part of the team that released the Al Safety Gridworlds environments (Leike ef al., 2017), which allows designers to test how well their

agents perform.

The moral is that we need to be very careful in specifying what we want, because with

alignment utility maximizers we get what we actually asked for. The value alignment problem is the Value problem problem of making sure that what we ask for is what we really want; it is also known as the

King Midas problem, as discussed on page 33. We run into trouble when a utility function fails to capture background societal norms about acceptable behavior. For example, a human

who is hired to clean floors, when faced with a messy person who repeatedly tracks in dirt,

Kknows that it is acceptable to politely ask the person to be more careful, but it is not acceptable to kidnap or incapacitate said person.

A robotic cleaner needs to know these things too, either through explicit programming or

by learning from observation. Trying to write down all the rules so that the robot always does

the right thing is almost certainly hopeless. We have been trying to write loophole-free tax laws for several thousand years without success. Better to make the robot want to pay taxes,

50 to speak, than to try to make rules to force it to do so when it really wants to do something else. A sufficiently intelligent robot will find a way to do something else.

Robots can learn to conform better with human preferences by observing human behavfor. This is clearly related to the notion of apprenticeship learning (Section 22.6). The robot

may learn a policy that directly suggests what actions to take in what situations; this is often

a straightforward supervised learning problem if the environment is observable. For example, a robot can watch a human playing chess: each state-action pair is an example for the learning process.

Unfortunately, this form of imitation learning means that the robot will

repeat human mistakes. Instead, the robot can apply inverse reinforcement learning to

cover the utility function that the humans must be operating under. Watching even terrible chess players is probably enough for the robot to learn the objective of the game. Given just this information, the robot can then go on to exceed human performance—as, for example, ALPHAZERO

did in chess—by computing optimal or near-optimal policies from the objec-

1003

1004

Chapter 27 Philosophy, Ethics, and Safety of AT tive. This approach works not just in board games, but in real-world physical tasks such as helicopter aerobatics (Coates et al., 2009).

In more complex settings involving, for example, social interactions with humans, it is very unlikely that the robot will converge to exact and correct knowledge of each human’s individual preferences. (After all, many humans never quite learn what makes other humans tick, despite a lifetime of experience, and many of s are unsure of our own preferences t00.) It will be necessary, therefore, for machines to function appropriately when it is uncertain

about human preferences. In Chapter 18, we introduced assistance games, which capture

exactly this situation.

Solutions to assistance games include acting cautiously, so as not to

disturb aspects of the world that the human might care about, and asking questions. For example, the robot could ask whether turning the oceans into sulphuric acid is an acceptable

solution to global warming before it puts the plan into effect.

In dealing with humans, a robot solving an assistance game must accommodate human

imperfections. If the robot asks permission, the human may give it, not foreseeing that the

robot’s proposal is in fact catastrophic in the long term.

Moreover, humans do not have

complete introspective access to their true utility function, and they don’t always act in a way

that is compatible with it. Humans sometimes lie or cheat, or do things they know are wrong.

They sometimes take self-destructive actions like overeating or abusing drugs. AI systems need not learn to adopt these problematic tendencies, but they must understand that they exist when interpreting human behavior to get at the underlying human preferences. Despite this toolbox of safeguards, there is a fear, expressed by prominent technologists such as Bill Gates and Elon Musk and scientists such as Stephen Hawking and Martin Rees,

that AT could evolve out of control. They warn that we have no experience controlling powerful nonhuman entities with super-human capabilities. However, that’s not quite true; we have

centuries of experience with nations and corporations; non-human entities that aggregate the power of thousands or millions of people. Our record of controlling these entities is not very

encouraging: nations produce periodic convulsions called wars that kill tens of millions of human beings, and corporations are partly responsible for global warming and our inability

to confront it. Al systems may present much greater problems than nations and corporations because of

Ultraintelligent machine

Technological singularity

their potential to self-improve at a rapid pace, as considered by L. J. Good (1965b): Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines: there would then unquestionably be an “intelligence explosion,” and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make, provided that the machine is docile enough to tell us how to keep it under control.

Good’s “intelligence explosion” has also been called the technological singularity by math-

ematics professor and science fiction author Vernor Vinge, who wrote in 1993: “Within thirty

years, we will have the technological means to create superhuman intelligence. Shortly after, the human era will be ended.”

In 2017, inventor and futurist Ray Kurzweil predicted the

singularity would appear by 2045, which means it got 2 years closer in 24 years. (At that

rate, only 336 years to go!) Vinge and Kurzweil correctly note that technological progress on many measures is growing exponentially at present.

Summary

1005

It is, however, quite a leap to extrapolate all the way from the rapidly decreasing cost

of computation to a singularity.

So far, every technology has followed an S-shaped curve,

where the exponential growth eventually tapers off. Sometimes new technologies step in

when the old ones plateau, but sometimes it is not possible to keep the growth going, for

technical, political, or sociological reasons. For example, the technology of flight advanced

dramatically from the Wright brothers” flight in 1903 to the moon landing in 1969, but has

had no breakthroughs of comparable magnitude since then.

Another obstacle in the way of ultraintelligent machines taking over the world is the

world. More specifically, some kinds of progress require not just thinking but acting in the

physical world. (Kevin Kelly calls the overemphasis on pure intelligence thinkism.) An ul- Thinkism traintelligent machine tasked with creating a grand unified theory of physics might be capable

of cleverly manipulating equations progress, it would still need to raise and run physical experiments over analyzing the data and theorizing.

a billion times faster than Einstein, but to make any real millions of dollars to build a more powerful supercollider the course of months or years. Only then could it start Depending on how the data turn out, the next step might

require raising additional billions of dollars for an interstellar probe mission that would take

centuries to complete. The “ultraintelligent thinking™ part of this whole process might actu-

ally be the least important part. As another example, an ultraintelligent machine tasked with

bringing peace to the Middle East might just end up getting 1000 times more frustrated than

a human envoy. As yet, we don’t know how many of the big problems are like mathematics and how many are like the Middle East.

While some people fear the singularity, others relish it. The transhumanism social Transhumanism movement looks forward to a future in which humans are merged with—or replaced by—

robotic and biotech inventions. Ray Kurzweil writes in The Singularity is Near (2005):

The Singularity will allow us to transcend these limitations of our biological bodies and brain. We will gain power over our fates. ... We will be able to live as long as we want .. We will fully understand human thinking and will vastly extend and expand its reach. By the end of this century, the nonbiological portion of our intelligence will be trillions of trillions of times more powerful than unaided human intelligence. Similarly, when asked whether robots will inherit the Earth, Marvin Minsky said “yes, but they will be our children.” These possibilities present a challenge for most moral theorists,

who take the preservation of human life and the human species to be a good thing. Kurzweil also notes the potential dangers, writing “But the Singularity will also amplify the ability to act on our destructive inclinations, so its full story has not yet been written.”

We humans

would do well to make sure that any intelligent machine we design today that might evolve into an ultraintelligent machine will do so in a way that ends up treating us well.

As Eric

Brynjolfsson puts it, “The future is not preordained by machines. It’s created by humans.”

Summary This chapter has addressed the following issues:

+ Philosophers use the term weak Al for the hypothesis that machines could possibly behave intelligently, and strong AI for the hypothesis that such machines would count

as having actual minds (as opposed to simulated minds).

1006

Chapter 27 Philosophy, Ethics, and Safety of AT « Alan Turing rejected the question “Can machines think?” and replaced it with a behavioral test. He anticipated many objections to the possibility of thinking machines.

Few Al researchers pay attention to the Turing test, preferring to concentrate on their

systems’ performance on practical tasks, rather than the ability to imitate humans. « Consciousness remains a mystery. « Al s a powerful technology, and as such it poses potential dangers, through lethal autonomous weapons, security and privacy breaches, unintended side effects, uninten-

tional errors, and malignant misuse. Those who work with AI technology have an

ethical imperative to responsibly reduce those dangers. + Al systems must be able to demonstrate they are fair, trustworthy, and transparent.

+ There are multiple aspects of fairness, and it is impossible to maximize all of them at

once. So a first step is to decide what counts as fair.

« Automation is already changing the way people work. As a society, we will have to deal with these changes. Bibliographical and Historical Notes Weak AL: When Alan Turing (1950) proposed the possibility of AL he also posed many of the key philosophical questions, and provided possible replies. But various philosophers had raised similar issues long before Al was invented. Maurice Merleau-Ponty’s Phenomenology of Perception (1945) stressed the importance of the body and the subjective interpretation of reality afforded by our senses, and Martin Heidegger’s Being and Time (1927) asked what it means to actually be an agent. In the computer age, Alva Noe (2009) and Andy Clark (2015) propose that our brains form a rather minimal representation of the world, use the world itself

on a just-in-time basis to maintain the illusion of a detailed internal model, and use props

in the world (such as paper and pencil as well as computers) to increase the capabilities of the mind. Pfeifer ef al. (2006) and Lakoff and Johnson (1999) present arguments for how the body helps shape cognition. Speaking of bodies, Levy (2008), Danaher and McArthur the issue of robot sex. (2017), and Devlin (2018) address Strong AL René Descartes is known for his dualistic view of the human mind, but ironi-

cally his historical influence was toward mechanism and physicalism. He explicitly conceived

of animals as automata, and he anticipated the Turing test, writing “it is not conceivable [that

a machine] should produce different arrangements of words so as to give an appropriately

meaningful answer to whatever is said in its presence, as even the dullest of men can do”

(Descartes,

1637).

Descartes’s spirited defense of the animals-as-automata viewpoint ac-

tually had the effect of making it easier to conceive of humans as automata as well, even though he himself did not take this step. The book L’Homme Machine (La Mettrie, 1748) did explicitly argue that humans are automata. As far back as Homer (circa 700 BCE), the Greek legends envisioned automata such as the bronze giant Talos and considered the issue of biotechne, or life through craft (Mayor, 2018).

The Turing test (Turing, 1950) has been debated (Shieber, 2004), anthologized (Epstein et al., 2008), and criticized (Shieber, 1994; Ford and Hayes, 1995). Bringsjord (2008) gives advice for a Turing test judge, and Christian (2011) for a human contestant. The annual Loebner Prize competition is the longest-running Turing test-like contest; Steve Worswick’s

Bibliographical and Historical Notes MITSUKU won four in a row from 2016 to 2019. The Chinese room has been debated endlessly (Searle, 1980; Chalmers, 1992; Preston and Bishop, 2002). Herndndez-Orallo (2016)

gives an overview of approaches to measuring Al progress, and Chollet (2019) proposes a measure of intelligence based on skill-acquisition efficiency. Consciousness remains a vexing problem for philosophers, neuroscientists, and anyone who has pondered their own existence. Block (2009), Churchland (2013) and Dehaene (2014) provide overviews of the major theories. Crick and Koch (2003) add their expertise in biology and neuroscience to the debate, and Gazzaniga (2018) shows what can be learned from

studying brain disabilities in hospital cases. Koch (2019) gives a theory of consciousness—

“intelligence is about doing while experience is about being”—that includes most animals,

but not computers. Giulio Tononi and his colleagues propose integrated information theory

(Oizumi et al., 2014). Damasio (1999) has a theory based on three levels: emotion, feeling, and feeling a feeling. Bryson (2012) shows the value of conscious attention for the process of learning action selection. The philosophical literature on minds, brains, and related topics is large and jargon-filled. The Encyclopedia of Philosophy (Edwards, 1967) is an impressively authoritative and very useful navigation aid. The Cambridge Dictionary of Philosophy (Audi, 1999) is shorter and more accessible, and the online Stanford Encyclopedia of Philosophy offers many excellent articles and up-to-date references. The MIT Encyclopedia of Cognitive Science (Wilson and Keil, 1999) covers the philosophy, biology, and psychology of mind. There are multiple intro-

ductions to the philosophical “Al question” (Haugeland, 1985; Boden, 1990; Copeland, 1993; McCorduck, 2004; Minsky, 2007). The Behavioral and Brain Sciences, abbreviated BBS, is

amajor journal devoted to philosophical and scientific debates about Al and neuroscience.

Science fiction writer Isaac Asimov (1942, 1950) was one of the first to address the issue

of robot ethics, with his laws of robotics:

0. A robot may not harm humanity, or through inaction, allow humanity to come to harm.

1. A robot may not injure a human being or, through inaction, allow a human being to come to harm. 2. A robot must obey orders given to it by human beings, except where such orders would conflict with the First Law. 3. A robot must protect its own existence as long as such protection does not conflict with

the First or Second Law.

At first glance, these laws seem reasonable. But the trick is how to implement them. Should

a robot allow a human to cross the street, or eat junk food, if the human might conceivably

come to harm? In Asimov’s story Runaround (1942), humans need to debug a robot that is

found wandering in a circle, acting “drunk > They work out that the circle defines the locus of points that balance the second law (the robot was ordered to fetch some selenium at the center

of the circle) with the third law (there is a danger there that threatens the robot’s existence).* This suggests that the laws are not logical absolutes, but rather are weighed against each other, with a higher weight for the carlier laws. As this was 1942, before the emergence of

4 Science fiction writers are in broad agreement that robot e very b HAL 9000 computer becomes homicidal due to a con Captain Kirk tells an enemy robot that “Everything Harry el youis smoke comes out of the robot’s head and it shuts down.

at resolving contradictions. In 2001, the the Star Trek episode 1, Mudd,” Tam lying.” At this,

1007

1008

Chapter 27 Philosophy, Ethics, and Safety of AT digital computers, Asimov was probably thinking of an architecture based on control theory via analog computing. Weld and Etzioni (1994) analyze Asimov’s laws and suggest some ways to modify the planning techniques of Chapter 11 to generate plans that do no harm. Asimov has considered many of the ethical issues around technology; in his 1958 story The Feeling of Power he tackles the issue of automation leading to a lapse of human skill—a technician rediscovers

the lost art of multiplication—as well as the dilemma of what to do when the rediscovery is

applied to warfare.

Norbert Wiener’s book God & Golem,

Inc.

(1964) correctly predicted that computers

would achieve expert-level performance at games and other tasks, and that specifying what it is that we want would prove to be difficult. Wiener writes:

While it is always possible to ask for something other than we really want, this

possibility is most serious when the process by which we are to obtain our wish

is indirect, and the degree to which we have obtained our wish is not clear until

the very end. Usually we realize our wishes, insofar as we do actually realize them, by a feedback process, in which we compare the degree of attainment of intermediate goals with our anticipation of them.

In this process, the feedback

goes through us, and we can turn back before it is too late.

If the feedback is

built into a machine that cannot be inspected until the final goal is attained, the

possibilities for catastrophe are greatly increased. I should very much hate to ride

on the first trial of an automobile regulated by photoelectric feedback devices, unless there were somewhere a handle by which I could take over control ifT

found myself driving smack into a tree.

We summarized codes of ethics in the chapter, but the list of organizations that have is-

sued sets of principles is growing rapidly, and now includes Apple, DeepMind, Facebook, Google, IBM, Microsoft, the Organisation for Economic Co-operation and Development (OECD), the United Nations Educational, Scientific and Cultural Organization (UNESCO),

the U.S. Office of Science and Technology Policy the Beijing Academy of Artificial Intelligence (BAAI), the Institute of Electrical and Electronics Engineers (IEEE), the Association

of Computing Machinery (ACM), the World Economic Forum, the Group of Twenty (G20),

OpenAl, the Machine Intelligence Research Institute (MIRI), Al4People, the Centre for the Study of Existential Risk, the Center for Human-Compatible Al the Center for Humane Tech-

nology, the Partnership on Al the AI Now Institute, the Future of Life Institute, the Future

of Humanity Institute, the European Union, and at least 42 national governments. We have the handbook on the Ethics of Computing (Berleur and Brunnstein, 2001) and introductions to the topic of Al ethics in book (Boddington, 2017) and survey (Etzioni and Etzioni, 2017a)

form. The Journal of Artificial Intelligence and Law and Al and Society cover ethical issues. ‘We’ll now look at some of the individual issues.

Lethal autonomous weapons: P. W. Singer’s Wired for War (2009) raised ethical, legal,

and technical issues around robots on the battlefield.

Paul Scharre’s Army of None (2018),

written by one of the authors of current US policy on autonomous weapons, offers a balanced

and authoritative view.

Etzioni and Etzioni (2017b) address the question of whether artifi-

cial intelligence should be regulated; they recommend a pause in the development of lethal autonomous weapons, and an international discussion on the subject of regulation.

Bibliographical and Historical Notes Privacy: Latanya Sweeney (Sweeney, 2002b) presents the k-anonymity model and the idea of generalizing fields (Sweeney, 2002a). Achieving k-anonymity with minimal loss of data is an NP-hard problem, but Bayardo and Agrawal (2005) give an approximation algorithm. Cynthia Dwork (2008) describes differential privacy, and in subsequent work gives practical examples of clever ways to apply differential privacy to get better results than the naive approach (Dwork et al., 2014). Guo et al. (2019) describe a process for certified data

removal: if you train a model on some data, and then there is a request to delete some of the

data, this extension of differential privacy lets you modify the model and prove that it does not make use of the deleted data. Ji ef al. (2014) gives a review of the field of privacy. Etzioni (2004) argues for a balancing of privacy and security; individual rights and community. Fung etal. (2018), Bagdasaryan et al. (2018) discuss the various attacks on federated learning protocols. Narayanan e al. (2011) describe how they were able to de-anonymize the obfuscated

connection graph from the 2011 Social Network Challenge by crawling the site where the data was obtained (Flickr), and matching nodes with unusually high in-degree or out-degree between the provided data and the crawled data. This allowed them to gain additional information to win the challenge, and it also allowed them to uncover the true identity of nodes

in the data. Tools for user privacy are becoming available; for example, TensorFlow provides

modules for federated learning and privacy (McMahan and Andrew, 2018). Fairness: Cathy O’Neil's book Weapons of Math Destruction (2017) describes how various black box machine learning models influence our lives, often in unfair ways. She calls

on model builders to take responsibility for fairness, and for policy makers to impose appropriate regulation. Dwork er al. (2012) showed the flaws with the simplistic “fairness through

unawareness” approach. Bellamy er al. (2018) present a toolkit for mitigating bias in machine learning systems. Tramer et al. (2016) show how an adversary can “steal” a machine learning model by making queries against an API, Hardt ef al. (2017) describe equal opportunity as

a metric for fairness. Chouldechova and Roth (2018) give an overview of the frontiers of fairness, and Verma and Rubin (2018) give an exhaustive survey of fairness definitions. Kleinberg et al. (2016) show that, in general, an algorithm cannot be both well-calibrated

and equal opportunity. Berk ef al. (2017) give some additional definitions of types of fairness, and again conclude that it is impossible to satisfy all aspects at once. Beutel et al. (2019) give advice for how to put fairness metrics into practice.

Dressel and Farid (2018) report on the COMPAS recidivism scoring model. Christin et al. (2015) and Eckhouse ef al. (2019) discuss the use of predictive algorithms in the le-

gal system. Corbett-Davies e al. (2017) show that that there is a tension between ensuring

fairness and optimizing public safety, and Corbett-Davies and Goel (2018) discuss the differences between fairness frameworks. Chouldechova (2017) advocates for fair impact: all classes should have the same expected utility. Liu er al. (2018a) advocate for a long-term

measure of impact, pointing out that, for example, if we change the decision point for ap-

proving a loan in order to be more fair in the short run, this could have negative effect in the

long run on people who end up defaulting on a loan and thus have their credit score reduced.

Since 2014 there has been an annual conference on Fairness, Accountability, and Trans-

parency in Machine Learning. Mehrabi et al. (2019) give a comprehensive survey of bias and fairness in machine learning, cataloging 23 kinds of bias and 10 definitions of fairness.

Trust:

Explainable Al was an important topic going back to the early days of expert

systems (Neches ef al., 1985), and has been making a resurgence in recent years (Biran and

1009

1010

Chapter 27 Philosophy, Ethics, and Safety of AT Cotton, 2017; Miller er al., 2017; Kim, 2018).

Barreno et al. (2010) give a taxonomy

of

the types of security attacks that can be made against a machine learning system, and Tygar

(2011) surveys adversarial machine learning. Researchers at IBM have a proposal for gaining trust in AT systems through declarations of conformity (Hind et al., 2018). DARPA requires

explainable decisions for its battlefield systems, and has issued a call for research in the area

(Gunning, 2016). Al safety: The book Arrificial Intelligence Safety and Security (Yampolskiy, 2018) collects essays on Al safety, both recent and classic, going back to Bill Joy’s Why the Future Doesn’t Need Us (Joy, 2000). The “King Midas problem” was anticipated by Marvin Minsky, who once suggested that an Al program designed to solve the Riemann Hypothesis might end up taking over all the resources of Earth to build more powerful supercomputers. Similarly, Omohundro (2008) foresees a chess program that hijacks resources, and Bostrom (2014) de-

scribes the runaway paper clip factory that takes over the world. Yudkowsky (2008) goes into

more detail about how to design a Friendly AL Amodei er al. (2016) present five practical safety problems for Al systems.

Omohundro (2008) describes the Basic Al Drives and concludes, “Social structures which

cause individuals to bear the cost of their negative externalities would go a long way toward

ensuring a stable and positive future.” Elinor Ostrom’s Governing the Commons (1990) de-

scribes practices for dealing with externalities by traditional cultures. Ostrom has also applied

this approach to the idea of knowledge as a commons (Hess and Ostrom, 2007). Ray Kurzweil (2005) proclaimed The Singularity is Near, and a decade later Murray Shanahan (2015) gave an update on the topic.

Microsoft cofounder Paul Allen countered

with The Singularity isn’t Near (2011). He didn’t dispute the possibility of ultraintelligent machines; he just thought it would take more than a century to get there. Rod Brooks is a

frequent critic of singularitarianism; he points out that technologies often take longer than

predicted to mature, that we are prone to magical thinking, and that exponentials don’t last forever (Brooks, 2017). On the other hand, for every optimistic singularitarian there is a pessimist who fears

new technology. The Web site pessimists.co shows that this has been true throughout

history: for example, in the 1890s people were concerned that the elevator would inevitably

cause nausea, that the telegraph would lead to loss of privacy and moral corruption, that the subway would release dangerous underground air and disturb the dead, and that the bicycle— especially the idea of a woman riding one—was the work of the devil. Hans

Moravec

(2000) introduces

some of the ideas of transhumanism,

and Bostrom

(2005) gives an updated history. Good’s ultraintelligent machine idea was foreseen a hun-

dred years earlier in Samuel Butler’s Darwin Among

the Machines (1863).

years after the publication of Charles Darwin’s On the Origins of Species

Written four

and at a time when

the most sophisticated machines were steam engines, Butler’s article envisioned “the ultimate development of mechanical consciousness” by natural selection. The theme was reiterated by

George Dyson (1998) in a book of the same title, and was referenced by Alan Turing, who

wrote in 1951 “At some stage therefore we should have to expect the machines to take control in the way that is mentioned in Samuel Butler’s Erewhon™ (Turing, 1996).

Robot rights: A book edited by Yorick Wilks (2010) gives different perspectives on how we should deal with artificial companions, ranging from Joanna Bryson’s view that robots should serve us as tools, not as citizens, to Sherry Turkle’s observation that we already per-

Bibliographical and Historical Notes sonify our computers and other tools, and are quite willing to blur the boundaries between

machines and life. Wilks also contributed a recent update on his views (Wilks, 2019). The philosopher David Gunkel’s book Robot Rights (2018) considers four possibilities: can robots have rights or not, and should they or not? The American Society for the Prevention of Cruelty to Robots (ASPCR) proclaims that “The ASPCR is, and will continue to be, exactly as

serious as robots are sentient.” The future of work:

In 1888, Edward Bellamy published the best-seller Looking Back-

ward, which predicted that by the year 2000, technological advances would led to a utopia where equality is achieved and people work short hours and retire early. Soon after, E. M. Forster took the dystopian view in The Machine Stops (1909), in which a benevolent machine

takes over the running of a society; things fall apart when the machine inevitably fails. Nor-

bert Wiener’s prescient book The Human Use of Human Beings (1950) argues for the benefits

of automation in freeing people from drudgery while offering more creative work, but also

discusses several dangers that we recognize as problems today, particularly the problem of value alignment. The book Disrupting Unemployment (Nordfors et al., 2018) discuss some of the ways that work is changing, opening opportunities for new careers. Erik Brynjolfsson and Andrew McAfee address these themes and more in their books Race Against the Machine (2011) and The Second Machine Age (2014). Ford (2015) describes the challenges of increasing automation, and West (2018) provides recommendations to mitigate the problems, while MIT’s

Thomas Malone (2004) shows that many of the same issues were apparent a decade earlier, but at that time were attributed to worldwide communication networks, not to automation.

1011

TR 28

THE FUTURE OF Al In which we try to see a short distance ahead.

In Chapter 2, we decided to view Al as the task of designing approximately rational agents. A variety of different agent designs were considered, ranging from reflex agents to knowledge-

based decision-theoretic agents to deep learning agents using reinforcement learning. There is also variety in the component technologies from which these designs are assembled: logical, probabilistic, or neural reasoning; atomic, factored, or structured representations of states; various learning algorithms from various types of data; sensors

and actuators to interact with

the world. Finally, we have seen a variety of applications, in medicine, finance, transportation,

communication,

and other fields.

There has been progress on all these fronts, both in our

scientific understanding and in our technological capabilities.

Most experts are optimistic about continued progress; as we saw on page 28, the median

estimate is for approximately human-level Al across a broad variety of tasks somewhere in

the next 50 to 100 years. Within the next decade, Al is predicted to add trillions of dollars to

the economy each year. But as we also saw, there are some critics who think general Al is centuries off, and there are numerous ethical concerns about the fairness, equity, and lethality of AL In this chapter, we ask: where are we headed and what remains to be done?

We do

that by asking whether we have the right components, architectures, and goals to make Al a successful technology that delivers benefits to the world.

28.1

Al Components

This section examines the components of Al systems and the extent to which each of them might accelerate or hinder future progress.

Sensors and actuators For much of the history of Al, direct access to the world has been glaringly absent. With a

few notable exceptions, Al systems were built in such a way that humans had to supply the inputs and interpret the outputs.

Meanwhile, robotic systems focused on low-level tasks in

which high-level reasoning and planning were largely ignored and the need for perception

was minimized. This was partly due to the great expense and engineering effort required to

get real robots to work at all, and partly because of the lack of sufficient processing power and sufficiently effective algorithms to handle high-bandwidth visual input.

The situation has changed rapidly in recent years with the availability of ready-made programmable robots. These, in turn, have benefited from compact reliable motor drives and

improved sensors. The cost of lidar for a self-driving car has fallen from $75,000 to $1,000,

Section 28.1 Al Components and a single-chip version may reach $10 per unit (Poulton and Watts, 2016). Radar sensors, once capable of only coarse-grained detection, are now sensitive enough to count the number

of sheets in a stack of paper (Yeo ef al., 2018). The demand for better image processing in cellphone cameras has given us inexpensive

high-resolution cameras for use in robotics. MEMS (micro-electromechanical systems) tech-

nology has supplied miniaturized accelerometers, gyroscopes, and actuators small enough to fit in artificial flying insects (Floreano et al., 2009; Fuller et al., 2014). It may be possible to combine millions of MEMS devices to produce powerful macroscopic actuators. 3-D printing (Muth ez al., 2014) and bioprinting (Kolesky ef al., 2014) have made it easier to experiment with prototypes.

Thus, we see that Al systems are at the cusp of moving from primarily software-only sys-

tems to useful embedded robotic systems. The state of robotics today is roughly comparable

to the state of personal computers in the early 1980s: at that time personal computers were

becoming available, but it would take another decade before they became commonplace. It is

likely that flexible, intelligent robots will first make strides in industry (where environments are more controlled,

tasks are more repetitive, and the value of an investment

is easier to

measure) before the home market (where there is more variability in environment and tasks). Representing the state of the world

Keeping track of the world requires perception as well as updating of internal representations.

Chapter 4 showed how to keep track of atomic state representations; Chapter 7 described

how to do it for factored (propositional) state representations; Chapter 10 extended this to

first-order logic; and Chapter 14 described probabilistic reasoning over time in uncertain environments.

Chapter 21 introduced recurrent neural networks, which are also capable of

maintaining a state representation over time.

Current filtering and perception algorithms can be combined to do a reasonable job of recognizing objects (“that’s a cat”) and reporting low-level predicates (“the cup is on the table”). Recognizing higher-level actions, such as “Dr. Russell is having a cup of tea with Dr. Norvig while discussing plans for next week,” is more difficult. Currently it can sometimes be done

(see Figure 25.17 on page 908) given enough training examples, but future progress will require techniques that generalize to novel situations without requiring exhaustive examples

(Poppe, 2010; Kang and Wildes, 2016). Another problem is that although the approximate filtering algorithms from Chapter 14 can handle quite large environments, they are still dealing with a factored representation—

they have random variables, but do not represent objects and relations explicitly. Also, their notion of time is restricted to step-by-step change; given the recent trajectory of a ball, we

can predict where it will be at time 7+ 1, but it is difficult to represent the abstract idea that

what goes up must come down. Section 15.1 explained how probability and first-order logic can be combined to solve

these problems; Section 15.2 showed how we can handle uncertainty about the identity of

objects; and Chapter 25 showed how recurrent neural networks enable computer vision to track the world; but we don’t yet have a good way of putting all these techniques together.

Chapter 24 showed how word embeddings and similar representations can free us from the

strict bounds of concepts defined by necessary and sufficient conditions. It remains a daunting

task to define general, reusable representation schemes for complex domains.

1013

1014

Chapter 28 The Future of AT Selecting actions

The primary difficulty in action selection in the real world is coping with long-term plans—

such as graduating from college in four years—that consist of billions of primitive steps.

Search algorithms that consider sequences of primitive actions scale only to tens or perhaps hundreds of steps. It is only by imposing hierarchical structure on behavior that we humans

cope at all. We saw in Section 11.4 how to use hierarchical representations to handle problems

of this

scale; furthermore, work in hierarchical reinforcement learning has succeeded in

combining these ideas with the MDP formalism described in Chapter 17. As yet, these methods have not been extended to the partially observable case (POMDPs).

Moreover, algorithms for solving POMDPs are typically using the same atomic state repre-

sentation we used for the search algorithms of Chapter 3. There is clearly a great deal of

work to do here, but the technical foundations are largely in place for making progress. The main missing element is an effective method for constructing the hierarchical representations

of state and behavior that are necessary for decision making over long time scales. Deciding what we want

Chapter 3 introduced search algorithms to find a goal state. But goal-based agents are brittle when the environment is uncertain, and when there are multiple factors to consider. In princi-

ple, utility-maximization agents address those issues in a completely general way. The fields of economics and game theory, as well as AI, make use of this insight: just declare what you want to optimize, and what each action does, and we can compute the optimal action.

In practice, however, we now realize that the task of picking the right utility function is a

challenging problem in its own right. Imagine, for example, the complex web of interacting preferences that must be understood by an agent operating as an office assistant for a human

being. The problem is exacerbated by the fact that each human is different, so an agent just “out of the box™ will not have enough experience with any one individual to learn an accurate

preference model; it will necessarily need to operate under preference uncertainty.

Further

complexity arises if we want to ensure that our agents are acting in a way that is fair and equitable for society, rather than just one individual.

We do not yet have much experience with building complex real-world preference mod-

els, let alone probability distributions over such models.

Although there are factored for-

malisms, similar to Bayes nets, that are intended to decompose preferences over complex states, it has proven difficult to use these formalisms in practice.

One reason may be that

preferences over states are really compiled from preferences over state histories, which are

described by reward functions (see Chapter 17). Even if the reward function is simple, the

corresponding utility function may be very complex.

This suggests that we take seriously the task of knowledge engineering for reward func-

tions as a way of conveying to our agents what we want them to do. The idea of inverse

reinforcement learning (Section 22.6) is one approach to this problem when we have an

expert who can perform a task, but not explain it. We could also use better languages for

expressing what we want. For example, in robotics, linear temporal logic makes it easier to say what things we want to happen in the near future, what things we want to avoid, and what

states we want to persist forever (Littman er al., 2017). We need better ways of saying what

we want and better ways for robots to interpret the information we provide.

Section 28.1 Al Components The computer industry as a whole has developed a powerful ecosystem for aggregating user preferences. When you click on something in an app, online game, social network, or shopping site, that serves as a recommendation that you (and your similar peers) would like to see similar things in the future. (Or it might be that the site is confusing and you clicked on the wrong thing—the data are always noisy.) The feedback inherent in this system makes

it very effective in the short run for picking out ever more addictive games and videos.

But these systems often fail to provide an easy way of opting out—your device will auto-

play a relevant video, but it is less likely to tell you “maybe it is time to put away your devices and take a relaxing walk in nature.” A shopping site will help you find clothes that match your style, but will not address world peace or ending hunger and poverty. To the extent that the menu of choices is driven by companies trying to profit from a customer’s attention, the menu

will remain incomplete.

However, companies do respond to customers’ interests, and many customers have voiced

the opinion that they are interested in a fair and sustainable world. Tim O’Reilly explains why profitis not the only motive with the following analogy: “Money is like gasoline during a road trip. You don’t want to run out of gas on your trip, but you’re not doing a tour of gas stations. You have to pay attention to money, but it shouldn’t be about the money.”

Tristan Harris’s time well spent movement at the Center for Humane Technology is a Time well spent step towards giving us more well-rounded choices (Harris, 2016). The movement addresses an issue that was recognized by Herbert Simon in 1971: “A wealth of information creates a poverty of attention.” Perhaps in the future we will have personal agents that stick up for Personal agent our true long-term interests rather than the interests of the corporations whose apps currently

fill our devices. It will be the agent’s job to mediate the offerings of various vendors, protect

us from addictive attention-grabbers, and guide us towards the goals that really matter to us.

Learning Chapters 19 to 22 described how agents can learn. Current algorithms can cope with quite large problems, reaching or exceeding human capabilities in many tasks—as long as we have sufficient training examples and we are dealing with a predefined vocabulary of features and concepts. But learning can stall when data are sparse, or unsupervised, or when we are dealing with complex representations.

Much of the recent resurgence of Al in the popular press and in industry is due to the

success of deep learning (Chapter 21). On the one hand, this can be seen as the incremental

maturation of the subfield of neural networks.

On the other hand, we can see it as a rev-

olutionary leap in capabilities spurred by a confluence of factors: the availability of more

training data thanks to the Internet, increased processing power from specialized hardware,

and a few algorithmic tricks, such as generative adversarial networks (GANs), batch normalization, dropout, and the rectified linear (ReLU) activation function.

The future should see continued emphasis on improving deep learning for the tasks it excels at, and also extending it to cover other tasks. The brand name “deep learning” has proven to be so popular that we should expect its use to continue, even if the mix of techniques. that fuel it changes considerably.

‘We have seen the emergence of the field of data science as the confluence of statistics,

programming, and domain expertise. While we can expect to see continued development in

the tools and techniques necessary to acquire, manage, and maintain big data, we will also

1015

1016

Chapter 28 The Future of AT need advances in transfer learning so that we can take advantage of data in one domain to

improve performance on a related domain. The vast majority of machine learning research today assumes a factored representation, learning a function & : R" — R for regression and i : R" — {0, 1} for classification. Machine

learning has been less successful for problems that have only a small amount of data, or problems that require the construction of new structured, hierarchical representations. Deep learning, especially with convolutional networks applied to computer vision problems, has demonstrated some success in going from low-level pixels to intermediate-level concepts like Eye and Mouth, then to Face, and finally to Person or Cat.

A challenge for the future is to more smoothly combine learning and prior knowledge.

If we give a computer a problem it has

not encountered before—say, recognizing different

models of cars—we don’t want the system to be powerless until it has been fed millions of

labeled examples.

The ideal system should be able to draw on what it already knows: it should already have

amodel of how vision works, and how the design and branding of products in general work; now it should use transfer learning to apply that to the new problem of car models. It should

be able to find on its own information about car models, drawing from text, images, and

video available on the Internet. It should be capable of apprenticeship learning: having a conversation with a teacher, and not just asking “may I have a thousand images of a Corolla,” but rather being able to understand advice like “the Insight is similar to the Prius, but the

Insight has a larger grille.” It should know that each model comes in a small range of possible

colors, but that a car can be repainted, so there is a chance that it might see a car in a color

that was not in the training set. (If it didn’t know that, it should be capable of learning it, or being told about it.)

All this requires a communication and representation language that humans and computers can share; we can’t expect a human analyst to directly modify a model with millions of

weights. Probabilistic models (including probabilistic programming languages) give humans

some ability to describe what we know, but these models are not yet well integrated with

Differentiable programming

other learning mechanisms. The work of Bengio and LeCun (2007) is one step towards this integration. Recently Yann LeCun has suggested that the term “deep learning” should be replaced with the more

general differentiable programming (Siskind and Pearlmutter, 2016; Li ef al., 2018); this

suggests that our general programming languages and our machine learning models could be merged together. Right now, it is common to build a deep learning model that is differentiable, and thus can

be trained to minimize loss, and retrained when circumstances change. But that deep learning

model is only one part of a larger software system that takes in data, massages the data, feeds

it to the model, and figures out what to do with the model’s output. All these other parts of the

larger system were written by hand by a programmer, and thus are nondifferentiable, which means that when circumstances change, it is up to the programmer to recognize any problems and fix them by hand. With differentiable programming, the hope is that the entire system is subject to automated optimization.

The end goal is to be able to express what we know in whatever form is convenient to us:

informal advice given in natural language, a strong mathematical law like F = ma, a statistical

model accompanied by data, or a probabilistic program with unknown parameters that can

Section 28.1 Al Components

1017

be automatically optimized through gradient descent. Our computer models will learn from conversations with human experts as well as by using all the available data. Yann LeCun, Geoffrey Hinton, and others have suggested that the current emphasis on supervised learning (and to a lesser extent reinforcement learning) is not sustainable—that computer models will have to rely on weakly supervised learning, in which some supervi-

sion is given with a small number of labeled examples and/or a small number of rewards, but

most of the learning is unsupervised, because unannotated data are so much more plentiful.

LeCun uses the term predictive learning for an unsupervised learning system that can Predictive learning

model the world and learn to predict aspects of future states of the world—not just predict

labels for inputs that are independent and identically distributed with respect to past data, and

not just predict a value function over states. He suggests that GANs (generative adversarial networks) can be used to learn to minimize the difference between predictions and reality.

Geoffrey Hinton stated in 2017 that “My view is throw it all away and start again,” mean-

ing that the overall idea of learning by adjusting parameters in a network is enduring, but the

specifics of the architecture of the networks and the technique of back-propagation need to be

rethought. Smolensky (1988) had a prescription for how to think about connectionist models; his thoughts remain relevant today. Resources

Machine learning research and development has been accelerated by the increasing availabil-

ity of data, storage, processing power, software, trained experts, and the investments needed to

support them. Since the 1970s, there has been a 100,000-fold speedup in general-purpose processors and an additional 1,000-fold speedup due to specialized machine learning hardware.

The Web has served as a rich source of images, videos, speech, text, and semi-structured data,

currently adding over 10'8 bytes every day.

Hundreds of high-quality data sets are available for a range of tasks in computer vision,

speech recognition, and natural language processing. If the data you need is not already

available, you can often assemble it from other sources, or engage humans to label data for

you through a crowdsourcing platform. Validating the data obtained in this way becomes an important part of the overall workflow (Hirth et al., 2013).

An important recent development is the shift from shared data to shared models.

The

major cloud service providers (e.g., Amazon, Microsoft, Google, Alibaba, IBM, Salesforce) have begun competing to offer machine learning APIs with pre-built models for specific tasks such as visual object recognition, speech recognition, and machine translation. These models

can be used as is, or can serve as a baseline to be customized with your particular data for

your particular application.

‘We expect that these models will improve over time, and that it will become unusual to

start a machine learning project from scratch, just as it is now unusual to do a Web develop-

ment project from scratch, with no libraries. It is possible that a big jump in model quality

will occur when it becomes economical to process all the video on the Web; for example, the YouTube platform alone adds 300 hours of video every minute.

Moore’s law has made it more cost effective to process data;

a megabyte of storage cost $1

million in 1969 and less then $0.02 in 2019, and supercomputer throughput has increased by a

factor of more than 10'” in that time. Specialized hardware components for machine learning

such as graphics processing units (GPUS), tensor cores, tensor processing units (TPUs), and

Shared model

1018

Chapter 28 The Future of AT field programmable gate arrays (FPGAs) are hundreds of times faster than conventional CPUs for machine learning training (Vasilache ef al., 2014; Jouppi et al., 2017). In 2014 it took a full day to train an ImageNet model; in 2018 it takes just 2 minutes (Ying ef al., 2018).

The OpenAl Institute reports that the amount of compute power used to train the largest machine learning models doubled every 3.5 months from 2012 to 2018, reaching over an exaflop/second-day for ALPHAZERO (although they also report that some very influential work used 100 million times less computing power (Amodei and Hernandez, 2018)). The

same economic trends that have made cell-phone cameras cheaper and better also apply to processors—we will see continued progress in low-power, high-performance computing that benefits from economies of scale.

There is a possibility that quantum computers could accelerate Al Currently there are

some fast quantum algorithms for the linear algebra operations used in machine learning (Harrow et al., 2009; Dervovic et al., 2018), but no quantum computer capable of running them. We have some example applications of tasks such as image classification (Mott et al., 2017) where quantum algorithms are as good as classical algorithms on small problems. Current quantum computers handle only a few tens of bits, whereas machine learning algorithms often handle inputs with millions of bits and create models with hundreds of mil-

lions of parameters. So we need breakthroughs in both quantum hardware and software to make quantum computing practical for large-scale machine learning. Alternatively, there may

be a division of labor—perhaps a quantum algorithm to efficiently search the space of hyperparameters while the normal training process runs on conventional computers—but we don’t know how to do that yet. Research on quantum algorithms can sometimes inspire new and better algorithms on classical computers (Tang, 2018). We have also seen exponential growth in the number of publications, people, and dollars

in Al/machine learning/data science. Dean er al. (2018) show that the number of papers about “machine learning” on arXiv doubled every two years from 2009 to 2017. Investors are

funding startup companies in these fields, large companies are hiring and spending as they

determine their Al strategy, and governments are investing to make sure their country doesn’t fall too far behind.

28.2

Al Architectures

It is natural to ask, “Which of the agent architectures in Chapter 2 should an agent use?” The answer is, “All of them!” Reflex responses are needed for situations in which time is of the

essence, whereas knowledge-based deliberation allows the agent to plan ahead. Learning is convenient when we have lots of data, and necessary when the environment is changing, or when human designers have insufficient knowledge of the domain.

AI has long had a split between symbolic systems (based on logical and probabilistic

inference) and connectionist systems (based on loss minimization over a large number of

uninterpreted parameters).

A continuing challenge for Al is to bring these two together,

to capture the best of both. Symbolic systems allow us to string together long chains of

reasoning and to take advantage of the expressive power of structured representations, while connectionist systems can recognize patterns even in the face of noisy data. One line of

research aims to combine probabilistic programming with deep learning, although as yet the various proposals are limited in the extent to which the approaches are truly merged.

Section 28.2

Al Architectures

1019

Agents also need ways to control their own deliberations. They must be able to use the

available time well, and cease deliberating when action is demanded. For example, a taxidriving agent that sees an accident ahead must decide in a split second whether to brake or

swerve. It should also spend that split second thinking about the most important questions,

such as whether the lanes to the left and right are clear and whether there is a large truck close

behind, rather than worrying about where to pick up the next passenger. These issues are usually studied under the heading of real-time AL As Al systems move into more complex Real-time Al domains, all problems will become real-time, because the agent will never have long enough to solve the decision problem exactly.

Clearly, there is a pressing need for general methods of controlling deliberation, rather

than specific recipes for what to think about in each situation.

The first useful idea is the

anytime algorithms (Dean and Boddy, 1988; Horvitz, 1987): an algorithm whose output Anytime algorithm quality improves gradually over time, so that it has a reasonable decision ready whenever it is

interrupted. Examples of anytime algorithms include iterative deepening in game-tree search and MCMC in Bayesian networks. The second technique for controlling deliberation is decision-theoretic metareasoning (Russell and Wefald, 1989; Horvitz and Breese, 1996; Hay ez al., 2012). This method, which

was mentioned briefly in Sections 3.6.5 and 5.7, (Chapter

Decision-theoretic metareasoning

applies the theory of information value

16) to the selection of individual computations (Section 3.6.5).

The value of a

computation depends on both its cost (in terms of delaying action) and its benefits (in terms

of improved decision quality).

Metareasoning techniques can be used to design better search algorithms and to guarantee

that the algorithms have the anytime property. Monte Carlo tree search is one example: the

choice of leaf node at which to begin the next playout is made by an approximately rational metalevel decision derived from bandit theory.

Metareasoning is more expensive than reflex action, of course, but compilation methods can be applied so that the overhead is small compared to the costs of the computations being controlled.

Metalevel reinforcement learning may provide another way to acquire effective

policies for controlling deliberation: in essence, computations that lead to better decisions are

reinforced, while those that turn out to have no effect are penalized. This approach avoids the

myopia problems of the simple value-of-information calculation.

Metareasoning is one specific example of a reflective architecture—that is, an architec-

ture that enables deliberation about the computational entities and actions occurring within

the architecture itself.

A theoretical foundation for reflective architectures can be built by

defining a joint state space composed from the environment state and the computational state

of the agent itself. Decision-making and learning algorithms can be designed that operate

over this joint state space and thereby serve to implement and improve the agent’s compu-

tational activities. Eventually, we expect task-specific algorithms such as alpha—beta search,

regression planning, and variable elimination to disappear from Al systems, to be replaced

by general methods that direct the agent’s computations toward the efficient generation of

high-quality decisions.

Metareasoning and reflection (and many other efficiency-related architectural and algo-

rithmic devices explored in this book) are necessary because making decisions

is hard. Ever

since computers were invented, their blinding speed has led people to overestimate their ability to overcome complexity, or, equivalently, to underestimate what complexity really means.

Reflective architecture

1020

Chapter 28 The Future of AT The truly gargantuan power of today’s machines tempts one to think that we could bypass all the clever devices and rely more on brute force. So let’s try to counteract this tendency.

We begin with what physicists believe to be the speed of the ultimate 1kg computing device: about 105! operations per second, or a billion trillion trillion times faster than the fastest supercomputer as of 2020 (Lloyd, 2000)." Then we propose a simple task: enumerating strings of English words, much as Borges proposed in The Library of Babel. Borges stipulated books of 410 pages.

Would that be feasible? Not quite.

In fact, the computer running for a year

could enumerate only the 11-word strings. Now consider the fact that a detailed plan for a human life consists of (very roughly) twenty trillion potential muscle actuations (Russell, 2019), and you begin to see the scale of the problem.

A computer that is a billion trillion trillion times more powerful than the

human brain is much further from being rational than a slug is from overtaking the starship Enterprise traveling at warp nine.

With these considerations in mind, it seems that the goal of building rational agents is

perhaps a little too ambitious.

Rather than aiming for something that cannot possibly exist,

we should consider a different normative target—one that necessarily exists.

Chapter 2 the following simple idea:

Recall from

agent = architecture + program .

Now fix the agent architecture (the underlying machine capabilities, perhaps with a fixed software layer on top) and allow the agent program to vary over all possible programs that the architecture can support. In any given task environment, one of these programs (or an equiv-

alence class of them) delivers the best possible performance—perhaps not close to perfect

Bounded optimality

rationality, but still better than any other agent program.

the criterion of bounded optimality.

We say that this program satisfies

Clearly it exists, and clearly it constitutes a desirable

goal. The trick is finding it, or something close to it. For some elementary classes of agent programs in simple real-time environments, it is possible to identify bounded-optimal agent programs (Etzioni, 1989; Russell and Subramanian, 1995). The success of Monte Carlo tree search has revived interest in metalevel decision

making, and there is reason to hope that bounded optimality within more complex families of

agent programs can be achieved by techniques such as metalevel reinforcement learning. It should also be possible to develop a constructive theory of architecture, beginning with theorems on the bounded optimality of suitable methods of combining different bounded-optimal

components such as reflex and action-value systems. General Al

Much of the progress in Al in the 21st century so far has been guided by competition on narrow tasks, such as the DARPA Grand Challenge for autonomous cars, the ImageNet object recognition competition, or playing Go, chess, poker, or Jeopardy! against a world champion. For each separate task, we build a separate Al system, usually with a separate machine learning model trained from scratch with data collected specifically for this task. But a truly

intelligent agent should be able to do more than one thing. Alan Turing (1950) proposed his list (page 982) and science fiction author Robert Heinlein (1973) countered with:

! We gloss over the fact that this device consumes the entire energy output of a star and operates at a billion degrees cei

Section 28.2

Al Architectures

A human being should be able to change a diaper, plan an invasion, butcher a hog, conn a ship, design a building, write a sonnet, balance accounts, build a wall, set a bone, comfort the dying, take orders, give orders, cooperate, act alone, solve equations, analyse a new problem, pitch manure, program a computer, cook a tasty meal, fight efficiently, die gallantly. Specialization is for insects.

So far, no Al system measures up to either of these lists, and some proponents of general or human-level Al (HLAI) insist that continued work on specific tasks (or on individual com-

ponents) will not be enough to reach mastery on a wide variety of tasks; that we will need a

fundamentally new approach. It seems to us that numerous new breakthroughs will indeed be necessary, but overall, Al as a field has made a reasonable exploration/exploitation tradeoff, assembling a portfolio of components, improving on particular tasks, while also exploring promising and sometimes far-out new ideas.

It would have been a mistake to tell the Wright brothers in 1903 to stop work on their

single-task airplane and design an “artificial general flight” machine that can take off vertically, fly faster than sound, carry hundreds of passengers, and land on the moon. It also would have been a mistake to follow up their first flight with an annual competition to make spruce wood biplanes incrementally better.

We have seen that work on components can spur new ideas; for example, generative adversarial networks (GANs) and transformer language models each opened up new areas of research.

We have also seen steps towards “diversity of behaviour.”

For example, machine

translation systems in the 1990s were built one at a time for each language pair (such as

French to English), but today a single system can identifying the input text as being one of a hundred languages, and translate it into any of 100 target languages. Another natural language system can perform five distinct tasks with one joint model (Hashimoto et al., 2016).

Al engineering

The field of computer programming started with a few extraordinary pioneers. But it didn’t reach the status of a major industry until a practice of software engineering was developed,

with a powerful collection of widely available tools, and a thriving ecosystem of teachers, students, practitioners, entrepreneurs, investors, and customers.

The Al industry has not yet reached that level of maturity. We do have a variety of pow-

erful tools and frameworks, such as TensorFlow, Keras, PyTorch, CAFFE, Scikit-Learn and

ScIPY. But many of the most promising approaches, such as GANs and deep reinforcement learning, have proven to be difficult to work with—they require experience and a degree of

fiddling to get them to train properly in a new domain. We don’t have enough experts to do this across all the domains where we need it, and we don’t yet have the tools and ecosystem to let less-expert practitioners succeed. Google’s Jeff Dean sees a future where we will want machine learning to handle millions

of tasks; it won’t be feasible to develop each of them from scratch, so he suggests that rather

than building each new system from scratch, we should start with a single huge system and, for each new task, extract from it the parts that are relevant to the task. We have seen some steps

in this direction, such as the transformer language models (e.g., BERT,

GPT-2) with

billions of parameters, and an “outrageously large” ensemble neural network architecture

that scales up to 68 billion parameters in one experiment (Shazeer et al., 2017). Much work remains to be done.

1021

1022

Chapter 28 The Future of AT The future ‘Which way will the future go? Science fiction authors seem to favor dystopian futures over utopian ones, probably because they make for more interesting plots. So far, Al seems to fit

in with other powerful revolutionary technologies

such as printing, plumbing, air travel, and

telephony. All these technologies have made positive impacts, but also have some unintended

side effects that disproportionately impact disadvantaged classes. We would do well to invest in minimizing the negative impacts.

Alis also different from previous revolutionary technologies. Improving printing, plumb-

ing, air travel, and telephony to their logical limits would not produce anything to threaten

human supremacy in the world. Improving Al to its logical limit certainly could.

In conclusion, Al has made great progress in its short history, but the final sentence of

Alan Turing’s (1950) essay on Computing Machinery and Intelligence is still valid today: We can see only a short distance ahead,

but we can see that much remains to be done.

ESEGTA MATHEMATICAL BACKGROUND A.1

Complexity Analysis and O() Notation

Computer scientists are often faced with the task of comparing algorithms to see how fast

they run or how much memory they require. There are two approaches to this task. The first is benchmarking—running the algorithms on a computer and measuring speed in seconds

and memory consumption in bytes. Ultimately, this is what really matters, but a benchmark can be unsatisfactory because it is so specific:

it measures the performance of a particular

Benchmarking

program written in a particular language, running on a particular computer, with a particular compiler and particular input data. From the single result that the benchmark provides, it can be difficult to predict how well the algorithm would do on a different compiler, computer, or of data set. The second approach relies on a mathematical analysis of algorithms, independent Analysis algorithms of the particular implementation and input, as discussed below.

A.1.1

Asymptotic analysis

We will consider algorithm analysis through the following example, a program to compute the sum of a sequence of numbers: function SUMMATION(sequence) returns a number sum 0 for i = 1 to LENGTH(sequence) do

sum & sum + sequenceli] return sum

The first step in the analysis is to abstract over the input, in order to find some parameter or parameters that characterize the size of the input. In this example, the input can be charac-

terized by the length of the sequence, which we will call n. The second step is to abstract over the implementation, to find some measure that reflects the running time of the algorithm

but is not tied to a particular compiler or computer. For the SUMMATION program, this could be just the number of lines of code executed, or it could be more detailed, measuring the

number of additions, assignments, array references, and branches executed by the algorithm.

Either way gives us a characterization of the total number of steps taken by the algorithm as a function of the size of the input. We will call this characterization T (n). If we count lines

of code, we have T (1) =2n+2 for our example.

If all programs were as simple as SUMMATION,

the analysis of algorithms would be a

trivial field. But two problems make it more complicated. First, it is rare to find a parameter

like n that completely characterizes the number of steps taken by an algorithm. Instead,

the best we can usually do is compute the worst case Tyors (1) O the average case Ty ().

Computing an average means that the analyst must assume some distribution of inputs.

1024

Appendix A

Mathematical Background

The second problem is that algorithms tend to resist exact analysis. In that case, it is

necessary to fall back on an approximation. We say that the SUMMATION algorithm is O(n),

meaning that its measure is at most a constant times n, with the possible exception of a few small values of n. More formally, Asymptotic analysis

T(n) is O(f(n)) if T(n) < kf(n) for some k, for all n > no.

The O() notation gives us what is called an asymptotic analysis. We can say without question that, as n asymptotically approaches infinity, an O(n) algorithm is better than an O(n?)

algorithm. A single benchmark figure could not substantiate such a claim.

The O() notation abstracts over constant factors, which makes it easier to use, but less

precise, than the T() notation. For example, an O(n2) algorithm will always be worse than an O(n) in the long run, but if the two algorithms are T'(n* + 1) and 7100+

O(n?) algorithm is actually better for n < 110. Despite this drawback, asymptotic anal;

1000), then the

is the most widely used tool for analyzing

algorithms. It is precisely because the analysis abstracts over both the exact number of operations (by ignoring the constant factor ) and the exact content of the input (by considering

only its size n) that the analysis becomes mathematically feasible. The O() notation is a good compromise between precision and ease of analysis. A.1.2

Complexity analysis

NP and inherently hard problems

The analysis of algorithms and the O() notation allow us to talk about the efficiency of a particular algorithm. However, they have nothing to say about whether there could be a better algorithm for the problem at hand. The field of complexity analysis analyzes problems rather than algorithms. The first gross division is between problems that can be solved in polynomial time and problems that cannot be solved in polynomial time, no matter what algorithm is used. The class of polynomial problems—those which can be solved in time O(n) for some k—is called P. These are sometimes called “easy” problems, because the class contains those

problems with running times like O(logn) and O(n). But it also contains those with time

NP

O(n'), 50 the name “easy™ should not be taken too literally.

Another important class of problems is NP, the class of nondeterministic polynomial

problems. A problem is in this class

if there is some algorithm that can guess

a solution and

then verify whether a guess is correct in polynomial time. The idea is that if you have an

arbitrarily large number of processors so that you can try all the guesses at once, or if you are

very lucky and always guess right the first time, then the NP problems become P problems.

One of the biggest open questions in computer science is whether the class NP is equivalent

to the class P when one does not have the luxury of an infinite number of processors or

omniscient guessing. Most computer scientists are convinced that P 7 NP; that NP problems

are inherently hard and have no polynomial-time algorithms. But this has never been proven.

NP-complete

Those who are interested in deciding whether P = NP look at a subclass of NP called the NP-complete problems. The word “complete” is used here in the sense of “most extreme” and thus refers to the hardest problems in the class NP. It has been proven that either all the NP-complete problems are in P or none of them is. This makes the class theoretically

interesting, but the class is also of practical interest because many important problems are known to be NP-complete. An example is the satisfiability problem: given a sentence of propositional logic, is there an assignment of truth values to the proposition symbols of the sentence that makes it true? Unless a miracle occurs and P = NP, there can be no algorithm

Section A.2

1025

Vectors, Matrices, and Linear Algebra

that solves all satisfiability problems in polynomial time. However, Al is more interested in whether there are algorithms that perform efficiently on typical problems drawn from a pre-

determined distribution; as we saw in Chapter 7, there are algorithms such as WALKSAT that

do quite well on many problems. The class of NP-hard problems consists of those problems that are reducible (in poly- NP-hard nomial time) to all the problems in NP, so if you solved any NP-hard problem, you could solve all the problems in NP. The NP-complete problems are all NP-hard, but there are some NP-hard problems that are even harder than NP-complete.

The class co-NP is the complement of NP, in the sense that, for every decision problem in NP, there is a corresponding problem in co-NP with the “yes” and “no” answers reversed.

‘We know that P is a subset of both NP and co-NP, and it is believed that there are problems in co-NP that are not in P. The co-NP-complete problems are the hardest problems in co-NP. The class #P (pronounced “number P™ according to Garey and Johnson (1979), but often

CoNP Co-NP-complete

pronounced “sharp P”) is the set of counting problems corresponding to the decision problems. in NP. Decision problems have a yes-or-no answer: is there a solution to this 3-SAT formula?

Counting problems have an integer answer: how many solutions are there to this 3-SAT formula? In some cases, the counting problem is much harder than the decision problem. For example, deciding whether a bipartite graph has a perfect matching can be done in time O(VE) (where the graph has V vertices and E edges), but the counting problem “how many perfect matches does this bipartite graph have” is #P-complete, meaning that it is hard as any problem in #P and thus at least as hard as any NP problem. Another class is the class of PSPACE problems—those that require a polynomial amount

of space, even on a nondeterministic machine. It is believed that PSPACE-hard problems are

worse than NP-complete problems, although it could turn out that NP = PSPACE, just as it could turn out that P = NP.

A.2

Vectors,

Matrices,

and

Linear

Algebra

Mathematicians define a vector as a member of a vector space, but we will use a more con-

crete definition: a vector is an ordered sequence of values. For example, in two-dimensional

space, we have vectors such as x= (3,4) and y = (0,2). We follow the convention of boldface

characters for vector names, although some authors use arrows or bars over the names: X or

. The elements of a vector can be accessed using subscripts: 2= (1,22,...,2,). One confusing point: this book is synthesizing work from many subfields, which variously call their sequences vectors, lists, or tuples, and variously use the notations (1,2), [1, 2], or (1, 2). The two fundamental operations on vectors are vector addition and scalar multiplication.

The vector addition x +y is the elementwise sum: x+y=(3 +0,4 +2)=(3,6). Scalar multiplication multiplies each element by a constant: 5x= (5 x 3,5 x 4) = (15,20). The length of a vector is denoted [x| and is computed by taking the square root of the sum of the squares of the elements:

[x|=+/(3%+4%)=5.

The dot product x-y (also called

scalar product) of two vectors is the sum of the products of corresponding elements, that is, X-y= ¥ X,yi, or in our particular case, X-y=3 x 0+4 x 2=8. Vectors are often interpreted as directed line segments (arrows) in an n-dimensional Eu-

clidean space. Vector addition is then equivalent to placing the tail of one vector at the head

of the other, and the dot product x-y is [x| |y| cosf, where is the angle between x and y.

Vector

1026

Appendix A

Mathematical Background

Matrix

A matrix is a rectangular array of values arranged into rows and columns. Here is a matrix A of size 3 x 4: A Axr Asr

Az Az Arg Azz Ags Agy Asz Asz Az

The first index of A

specifies the row and the second the column. In programming lan-

guages, A, is often written A[1, 3] or A[1] [].

The sum of two matrices is defined by adding their corresponding elements; for example (A+B);j=A;;+B;. (The sum is undefined if A and B have different sizes.) We can also define the multiplication of a matrix by a scalar: (cA);;=cA,;. Matrix multiplication (the product of two matrices) is more complicated. The product AB is defined only if A is of size

ax band B is of size b x ¢ (i.e., the second matrix has the same number of rows as the first

has columns); the result is a matrix of size a x . If the matrices are of appropriate size, then the result is

(AB)= YA, By )

Matrix multiplication is not commutative, even for square matrices: AB # BA in general. It

is, however, associative: (AB)C = A(BC). Note that the dot product can be expressed in

has the property that AL=A for all A. The transpose of A, written A is formed by turning rows into columns and vice versa, or, more formally, by A™; ;= A ;. The inverse ofa square matrix A is another square matrix A~' such that A~'A =1. For a singular matrix, the inverse

does not exist. For a nonsingular matrix, it can be computed in O(#3) time.

Matrices are used to solve systems of linear equations in O(n?) time; the time is domi-

nated by inverting a matrix of coefficients. Consider the following set of equations, for which we want a solution in x, y, and z:

+2x+y—z = 8 “3x—y+2z = —11 “2ty+2 = 3. We can represent this system as the matrix equation Ax = b, where 21 -1 x 8 =[-3-1 2], v, —11 2 1 2 z -3

To solve Ax = b we multiply both sides by A~', yielding A~'Ax = A~'b, which simplifies 2 3

I

R

tox=A""b. After inverting A and multiplying by b, we get the answer I

Singular

The identity matrix I has elements I ; equal to 1 when i= j and equal to 0 otherwise. It

=

Identity matrix Transpose Inverse

terms of a transpose and a matrix multiplication: x-y =x"y.

-1

A few more miscellaneous points: we use log(x) for the natural logarithm, log,(x). We use argmax, f(x) for the value ofx for which £(x) is maximal.

Section A.3

A.3

Probability Distribu

1027

Probability Distributions

ns

A probability is a measure over a set of events that satisfies three axioms:

1. The measure of each event is between 0 and 1. We write this as 0 < P(X=x;) < I, where X is a random variable representing an event and x; are the possible values of X. In general, random variables are denoted by uppercase letters and their values by lowercase letters. 2. The measure of the whole set is 1: that is, ¥/ P(X =x;) = 1. 3. The probability of a union of disjoint events is the sum of the probabilities of the individual events; that is, P(X =x; VX =x;) = P(X =x;) + P(X =x2), in the case where x| and x; are disjoint. A probabilistic model con:

of a sample space of mutually exclusive possible outcomes,

together with a probability measure for each outcome.

For example, in a model of the

weather tomorrow, the outcomes might be sun, cloud, rain, and snow.

A subset of these

outcomes constitutes an event. For example, the event of precipitation is the subset consist-

ing of {rain, snow}.

We use P(X) to denote the vector of values (P(X =x1),...,

P(X =x,)). We also use P(x;)

as an abbreviation for P(X =x;) and ¥, P(x) for T2 P(X =x;). The conditional probability P(B|A) is defined as P(BNA)/P(A). A and B are conditionally independent if P(B| A) = P(B) (or equivalently, P(A|B) = P(A)).

For continuous variables, there are an infinite number of values, and unless there are

point spikes, the probability of any one exact value is 0.

So it makes more sense to talk

about the value being within a range. We do that with a probability density function, which has a slightly different meaning from the discrete probability function. Since P(X =x)—the

Probability density function

probability that X has the value x exactly—is zero, we instead measure measure how likely it is that X falls into an interval around x, compared to the the width of the interval, and take

the limit as the interval width goes to zero:

P(x)= Jim0 P(x