2,111 479 663MB
English Pages 1152 [1134] Year 2020
Artificialfintelligence Bl
ol
ANVIOAEM APPreacH
Helligigh=eljtjelf)
Artificial Intelligence A Modern Approach Fourth Edition
PEARSON SERIES IN ARTIFICIAL INTELLIGENCE Stuart Russell and Peter Norvig, Editors
FORSYTH & PONCE
Computer Vision: A Modern Approach, 2nd ed.
JURAFSKY & MARTIN
Speech and Language Processing, 2nd ed.
RUSSELL & NORVIG
Artificial Intelligence: A Modern Approach, 4th ed.
GRAHAM
NEAPOLITAN
ANSI Common Lisp
Learning Bayesian Networks
Artificial Intelligence A Modern Approach Fourth Edition Stuart J. Russell and Peter Norvig
Contributing writers: Ming-Wei Chang Jacob Devlin Anca Dragan David Forsyth Tan Goodfellow Jitendra M. Malik Vikash Mansinghka Judea Pearl Michael Wooldridge
@
Pearson
Copyright © 2021, 2010, 2003 by Pearson Education, Inc. or its affiliates, 221 River Street, Hoboken, NJ 07030. All Rights Reserved. Manufactured in the United States of America. This publication is protected by copyright, and permission should be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise. For information regarding permissions, request forms, and the appropriate contacts within the Pearson Education Global Rights and Permissions department, please visit www.pearsoned.com/permissions/. Acknowledgments of third-party content appear on the appropriate page within the text. Cover Images: Alan Turing - Science History Images/Alamy Stock Photo Statue of Aristotle — Panos Karas/Shutterstock Ada Lovelace - Pictorial Press Ltd/Alamy Stock Photo Autonomous cars — Andrey Suslov/Shutterstock Atlas Robot ~ Boston Dynamics, Inc. Berkeley Campanile and Golden Gate Bridge — Ben Chu/Shutterstock Background ghosted nodes — Eugene Sergeev/Alamy Stock Photo Chess board with chess figure — Titania/Shutterstock Mars Rover - Stocktrek Images, Inc./Alamy Stock Photo Kasparov - KATHY WILLENS/AP Images PEARSON, ALWAYS LEARNING is an exclusive trademark owned by Pearson Education, Inc. or its affiliates in the U.S. and/or other countries.
Unless otherwise indicated herein, any third-party trademarks, logos, or icons that may appear in this work are the property of their respective owners, and any references to third-party trademarks, logos, icons, or other trade dress are for demonstrative or descriptive purposes only. Such references are not intended to imply any sponsorship, endorsement, authorization, or promotion of Pearson’s products
by the owners of such marks, or any relationship between the owner and Pearson Education, Inc., or its affiliates, authors, licensees, or distributors.
Library of Congress Cataloging-in-Publication Data Russell, Stuart J. (Stuart Jonathan), author. | Norvig, Peter, author. rtificial intelligence : @ modern approach/ Stuart J. Russell and Peter Norvig. Description: Fourth edition. | Hoboken : Pearson, [2021] | Series: Pearson series in artificial intelligence | Includes bibliographical references and index. | Summary: “Updated edition of popular textbook on Artificial Intelligence.”— Provided by publisher. Identifiers: LCCN 2019047498 | ISBN 9780134610993 (hardcover) Subjects: LCSH: Artificial intelligence. Classification: LCC Q335 .R86 2021 | DDC 006.3-dc23 LC record available at https://lcen.loc.gov/2019047498 ScoutAutomatedPrintCode
@
Pearson ISBN-10: ISBN-13:
0-13-461099-7 978-0-13-461099-3
For Loy, Gordon, Lucy, George, and Isaac — S.J.R. For Kris, Isabella, and Juliet — P.N.
This page intentionally left blank
Preface Artificial Intelligence (Al) is a big field, and this is a big book.
We have tried to explore
the full breadth of the field, which encompasses logic, probability, and continuous mathematics; perception, reasoning, learning, and action; fairness, trust, social good, and safety; and applications that range from microelectronic devices to robotic planetary explorers to online services with billions of users. The subtitle of this book is “A Modern Approach.”
That means we have chosen to tell
the story from a current perspective. We synthesize what is now known into a common
framework, recasting early work using the ideas and terminology that are prevalent today.
We apologize to those whose subfields are, as a result, less recognizable. New to this edition
This edition reflects the changes in Al since the last edition in 2010: « We focus more on machine learning rather than hand-crafted knowledge engineering, due to the increased availability of data, computing resources, and new algorithms.
« Deep learning, probabilistic programming, and multiagent systems receive expanded coverage, each with their own chapter.
« The coverage of natural language understanding, robotics, and computer vision has been revised to reflect the impact of deep learning. + The robotics chapter now includes robots that interact with humans and the application of reinforcement learning to robotics.
« Previously we defined the goal of Al as creating systems that try to maximize expected utility, where the specific utility information—the objective—is supplied by the human
designers of the system. Now we no longer assume that the objective is fixed and known by the Al system;
instead, the system may be uncertain about the true objectives of the
humans on whose behalf it operates. It must learn what to maximize and must function
appropriately even while uncertain about the objective. « We increase coverage of the impact of Al on society, including the vital issues of ethics, fairness, trust, and safety.
+ We have moved the exercises from the end of each chapter to an online site.
This
allows us to continuously add to, update, and improve the exercises, to meet the needs
of instructors and to reflect advances in the field and in Al-related software tools.
* Overall, about 25% of the material in the book is brand new. The remaining 75% has
been largely rewritten to present a more unified picture of the field. 22% of the citations in this edition are to works published after 2010.
Overview of the book The main unifying theme is the idea of an intelligent agent. We define Al as the study of agents that receive percepts from the environment and perform actions. Each such agent implements a function that maps percept sequences to actions, and we cover different ways
to represent these functions, such as reactive agents, real-time planners, decision-theoretic vii
viii
Preface
systems, and deep learning systems. We emphasize learning both as a construction method for competent systems and as a way of extending the reach of the designer into unknown
environments. We treat robotics and vision not as independently defined problems, but as occurring in the service of achieving goals. We stress the importance of the task environment in determining the appropriate agent design.
Our primary aim is to convey the ideas that have emerged over the past seventy years
of Al research and the past two millennia of related work.
We have tried to avoid exces-
sive formality in the presentation of these ideas, while retaining precision. We have included mathematical formulas and pseudocode algorithms to make the key ideas concrete; mathe-
matical concepts and notation are described in Appendix A and our pseudocode is described in Appendix B. This book is primarily intended for use in an undergraduate course or course sequence. The book has 28 chapters, each requiring about a week’s worth of lectures, so working through the whole book requires a two-semester sequence. A one-semester course can use. selected chapters to suit the interests of the instructor and students.
The book can also be
used in a graduate-level course (perhaps with the addition of some of the primary sources suggested in the bibliographical notes), or for self-study or as a reference.
Term
Throughout the book, important points are marked with a triangle icon in the margin.
‘Wherever a new term is defined, it is also noted in the margin. Subsequent significant uses
of the term are in bold, but not in the margin. We have included a comprehensive index and
an extensive bibliography. The only prerequisite is familiarity with basic concepts of computer science (algorithms, data structures, complexity) at a sophomore level. Freshman calculus and linear algebra are useful for some of the topics. Online resources
Online resources are available through pearsonhighered. com/cs-resources or at the book’s Web site, aima. cs . berkeley. edu. There you will find: « Exercises, programming projects, and research projects. These are no longer at the end of each chapter; they are online only. Within the book, we refer to an online exercise with a name like “Exercise 6.NARY.”
Instructions on the Web site allow you to find
exercises by name or by topi Implementations of the algorithms in the book in Python, Java, and other programming languages (currently hosted at github.com/aimacode).
A list of over 1400 schools that have used the book, many with links to online course
materials and syllabi. Supplementary material and links for students and instructors.
Instructions on how to report errors in the book, in the likely event that some exist.
Book cover
The cover depicts the final position from the decisive game 6 of the 1997 chess match in which the program Deep Blue defeated Garry Kasparov (playing Black), making this the first time a computer had beaten a world champion in a chess match. Kasparov is shown at the
Preface
top. To his right is a pivotal position from the second game of the historic Go match between former world champion Lee Sedol and DeepMind’s ALPHAGO program. Move 37 by ALPHAGO violated centuries of Go orthodoxy and was immediately seen by human experts as an embarrassing mistake, but it turned out to be a winning move. At top left is an Atlas humanoid robot built by Boston Dynamics. A depiction of a self-driving car sensing its environment appears between Ada Lovelace, the world’s first computer programmer, and Alan Turing, whose fundamental work defined artificial intelligence.
At the bottom of the chess
board are a Mars Exploration Rover robot and a statue of Aristotle, who pioneered the study of logic; his planning algorithm from De Motu Animalium appears behind the authors’ names. Behind the chess board is a probabilistic programming model used by the UN Comprehensive Nuclear-Test-Ban Treaty Organization for detecting nuclear explosions from seismic signals. Acknowledgments It takes a global village to make a book. Over 600 people read parts of the book and made
suggestions for improvement. The complete list is at aima.cs.berkeley.edu/ack.html;
we are grateful to all of them. We have space here to mention only a few especially important contributors. First the contributing writers:
Judea Pearl (Section 13.5, Causal Network Vikash Mansinghka (Section 15.3, Programs as Probability Models): Michael Wooldridge (Chapter 18, Multiagent Decision Making); Tan Goodfellow (Chapter 21, Deep Learning); Jacob Devlin and Mei-Wing Chang (Chapter 24, Deep Learning for Natural Language); « Jitendra Malik and David Forsyth (Chapter 25, Computer Vision); « Anca Dragan (Chapter 26, Robotics). Then some key roles: « Cynthia Yeung and Malika Cantor (project management); « Julie Sussman and Tom Galloway (copyediting and writing suggestions); « Omari Stephens (illustrations); « Tracy Johnson (editor); « Erin Ault and Rose Kernan (cover and color conversion); « Nalin Chhibber, Sam Goto, Raymond de Lacaze, Ravi Mohan, Ciaran O’Reilly, Amit Patel, Dragomir Radiv, and Samagra Sharma (online code development and mentoring); « Google Summer of Code students (online code development). Stuart would like to thank his wife, Loy Sheflott, for her endless patience and boundless wisdom. He hopes that Gordon, Lucy, George, and Isaac will soon be reading this book after they have forgiven him for working so long on it. RUGS (Russell’s Unusual Group of Students) have been unusually helpful, as always. Peter would like to thank his parents (Torsten and Gerda) for getting him started, and his wife (Kris), children (Bella and Juliet), colleagues, boss, and friends for encouraging and tolerating him through the long hours of writing and rewriting.
ix
About the Authors Stuart Russell was born in 1962 in Portsmouth, England.
He received his B.A. with first-
class honours in physics from Oxford University in 1982, and his Ph.D. in computer science from Stanford in 1986.
He then joined the faculty of the University of California at Berke-
ley, where he is a professor and former chair of computer science, director of the Center for Human-Compatible Al, and holder of the Smith-Zadeh Chair in Engineering. In 1990, he received the Presidential Young Investigator Award of the National Science Foundation, and in 1995 he was cowinner of the Computers and Thought Award. He is a Fellow of the Amer-
ican Association for Artificial Intelligence, the Association for Computing Machinery, and
the American Association for the Advancement of Science, an Honorary Fellow of Wadham College, Oxford, and an Andrew Carnegie Fellow. He held the Chaire Blaise Pascal in Paris
from 2012 to 2014. He has published over 300 papers on a wide range of topics in artificial intelligence. His other books include The Use of Knowledge in Analogy and Induction, Do the Right Thing: Studies in Limited Rationality (with Eric Wefald), and Human Compatible: Ariificial Intelligence and the Problem of Control.
Peter Norvig is currently a Director of Research at Google, Inc., and was previously the director responsible for the core Web search algorithms. He co-taught an online Al class that signed up 160,000 students, helping to kick off the current round of massive open online
classes. He was head of the Computational Sciences Division at NASA Ames Research Center, overseeing research and development in artificial intelligence and robotics. He received
aB.S. in applied mathematics from Brown University and a Ph.D. in computer science from Berkeley. He has been a professor at the University of Southern California and a faculty
member at Berkeley and Stanford. He is a Fellow of the American Association for Artificial
Intelligence, the Association for Computing Machinery, the American Academy of Arts and Sciences, and the California Academy of Science. His other books are Paradigms of Al Programming: Case Studies in Common Lisp, Verbmobil: A Translation System for Face-to-Face Dialog, and Intelligent Help Systems for UNIX. The two authors shared the inaugural AAAI/EAAI Outstanding Educator award in 2016.
Contents Atrtificial Intelligence Introduction L1 WhatTs AI? ... .o 1.2 The Foundations of Artificial Intelligence . 1.3 The History of Artificial Intelligence 1.4 The State of the Art . . . 1.5 Risks and Benefits of AI .
Summary
Bibliographical and Historical Notes
Intelligent Agents 2.1
22
2.3
24
Agentsand Environments
1 1 5 17 27 31
34
. . . . . . .
. . ..
..
... ...
Good Behavior: The Concept of Rationality
35 ...
The Nature of Environments . . . .. .
TheStructure of Agents
36 36
...................
. . . .. .. ... ...
SUMMATY . ..o o e Bibliographical and Historical Notes . . . . .. ..................
it
39
42
47
60 60
Problem-solving
Solving Problems by Searching 3.1 Problem-Solving Agents .
63 63
3.5
84
3.2 3.3 3.4
Example Problems . . . . Search Algorithms . . . Uninformed Search Strategies . . . . . . .
3.6
Heuristic Functions
Informed (Heuristic) Search Strategies . .
Summary
Bibliographical and Historical Notes
. . . . . . .
66 71 76 97
104
106
Search in Complex Environments
110
Summary
141
4.1 4.2 4.3 4.4 4.5
Local Search and Optimization Problems . Local Search in Continuous Spaces . . . . Search with Nondeterministic Actions Search in Partially Observable Environments Online Search Agents and Unknown Environments
Bibliographical and Historical Notes
. . . . .. ..................
Adversarial Search and Games 5.1 Game Theory 5.2 Optimal Decisions in Games
110 119 122 126 134 142
146 146 148 xi
Contents 5.3 54
Heuristic Alpha-Beta Tree Search Monte Carlo Tree Search
156 161
5.6 5.7
Partially Observable Games . Limitations of Game Search Algorithms
168 173
5.5
Stochastic Games
Summary
Bibliographical 6
174
and Historical Notes
175
Constraint Satisfaction Problems
180
6.1
Defining Constraint Satisfaction Problems
180
6.3 64 6.5
Backtracking Search for CSPs Local Search for CSPs . . . The Structure of Problems .
191 197 199
6.2
Constraint Propagation: Inference in CSPs .
Summary
Bibliographical and Historical Notes
III
7
164
. . . . .. ..................
204
Logical Agents 7.1 Knowledge-Based Agents
208 209
7.4 7.5
217 222
7.2 73
The Wumpus World . . Logic...................
7.6
Effective Propositional Model Checking
Propositional Logic: A Very Simple Logic . Propositional Theorem Proving
Bibliographical and Historical Notes
. . . . . ...................
210 214
232
237 246 247
First-Order Logic
251
8.1 8.2 8.3
251 256 265
Representation Revisited . Syntax and Semantics of First-Order Logic Using First-Order Logic . .
.
84 Knowledge Engineering in First-Order Logic Summary Bibliographical and Historical Notes
9
203
Knowledge, reasoning, and planning
7.7 Agents Based on Propositional Logic . . SUMMary ... 8
185
. . . . . .
271 277
278
Inference in First-Order Logic
280
Summary
309
9.1 9.2 9.3 9.4 9.5
Propositional vs. First-Order Inference . . . ... ............. Unification and First-Order Inference . . Forward Chaining . . . Backward Chaining . . Resolution . ......
Lo
Bibliographical and Historical Notes
. . . . . .
280 282 286 293 298
310
Contents
10 Knowledge Representation 10.1 Ontological Engineering . 102 Categories and Object 10.3 10.4
Events Mental Objects and Modal Logic
10.5
Reasoning Systems for Categories
10.6
Reasoning with Default Information
Summary
Bibliographical
and Historical NOes . . . . .. ..o oot oot
11 Automated Planning 11.1
112
11.3 11.4 11.5 11.6
Definition of Classical Planning
Algorithms for Classical Planning
Heuristics for Planning Hierarchical Planning . . Planning and Acting in Nondeterministic Domains Time, Schedules, and Resources
117 Analysis of Planning Approaches Summary ... Bibliographical and Historical Notes . . . . . . ... IV
. . . .
ooor
oot
Uncertain knowledge and reasoning
12 Quantifying Uncertainty
12.1
Acting under Uncertainty
12.2 12.3
Basic Probability Notation Inference Using Full Joint Distributions . .
12.5 12.6 12.7
Bayes’ Rule and Its Use . Naive Bayes Models . . . The Wumpus World Revisited
124
Independence
Summary
Bibliographical and Historical Notes
. . . . .. ..................
13 Probabilistic Reasoning
13.1
13.2 13.3
134
13.5
Representing Knowledge in an Uncertain Domain . . . .
The Semantics of Bayesian Networks . . . .. ... ... Exact Inference in Bayesian Networks
Approximate Inference for Bayesian Networks . . . . . . Causal Networks . . . . .
Summary
... .........
Bibliographical and Historical Notes 14 Probabilistic Reasoning over Time 14.1 Time and Uncertainty . . 14.2 Inference in Temporal Models
. . . . . . .
xiii
xiv
Contents
14.3 Hidden Markov Models 144 Kalman Filters . . . . . 14.5 Dynamic Bayesian Networks Summary Bibliographic and Historical Notes 15 Probabilistic Programming 15.1 Relational Probability Models 152 Open-Universe Probability Models
153
. . .
Keeping Track ofa Complex World . . .
Summary Bibliographic 16 Making Simple Decisions 16.1 16.2 16.3 16.4 16.5 16.6 16.7
Combining Beliefs and Desires under Uncertainty The Basis of Utility Theory .
. . . .. ........
Multiattribute Utility Functions Decision Networks . . . . .. ...... The Value of Information Unknown Preferences . . .
Summary Bibliographical and Historical Notes . . . . . . 17
18
Making Complex Decisions 17.1 Sequential Decision Problems 17.2 Algorithms for MDPs . 17.3 Bandit Problems . . . . 17.4 Partially Observable MDPs 17.5 Algorithms for Solving POMDPs . . . . Summary Bibliographical and Historical Notes . . . . . . Multiagent Decision Making 18.1
182 18.3
18.4
Properties of Multiagent Environments
Non-Cooperative Game Theory Cooperative Game Theory
. . . . .. ... ..........
Making Collective Decisions
Summary Bibliographical and Historical Notes . . . . . . . ... ..............
562 562 572 581 588 590 595 596 599 599 605 626 632 645 646
Machine Learning
19 Learning from Examples 19.1
FormsofLearning
. . .. ..........................
651
Contents
21
22
192 Supervised Learning. . . 193 Learning Decision Trees . 19.4 Model Selection and Optimization 19.5 The Theory of Learning . . . 19.6 Lincar Regression and Classification 19.7 Nonparametric Models 19.8 Ensemble Learning 19.9 Developing Machine Learning Systems . . Summary Bibliographical and Historical Notes Learning Probabilistic Models 20.1 Statistical Leaming . . .« ..o i i it 202 Leaming with Complete Data 203 Leaming with Hidden Variables: The EM Algorithm . . . Summary Bibliographical and Historical Notes Deep Learning 211 Simple Feedforward Networks . . ..« oo oo ooounittt o 212 Computation Graphs for Decp Learning 213 Convolutional Networks . . . . . . . . . . 214 Leaming Algorithms. . . 215 GeneraliZation . . . ... i i i 216 Recurrent Neural Networks - - .« oo« oo otot ot 217 Unsupervised Learning and Transfer Learning . . . . . . . ... ... .. 208 APPHCALONS « .« o« e e e e SUMMATY .« o o oo oo e e e e et Bibliographical and Historical Notes . . . . .« .. ... ..o ooooooo. Reinforcement Learning 22.1 Leaming from Rewards . . . . ..o oo oott it 222 Passive Reinforcement Learning 223 Active Reinforcement Learning 224 Generalization in Reinforcement Learning 225 Policy Search 22.6 Apprenticeship and Inverse Reinforcement Learning . . . 227 Applications of Reinforcement Learning SUMMALY - o o v vveeeerneneenenns Bibliographical and Historical Notes
VI
653 657 665 672 676 686 696 704 714 715 721 721 724 737 746 747 750 751 756 760 765 768 772 775 782 784 785 789 789 791 797 803 810 812 815 818 819
Communicating, perceiving, and acting
23 Natural Language Processing 23.1 Language Models 23.2
Grammar
823 823
833
xv
xvi
Contents 233 23.4
Parsing . ........ Augmented Grammars .
235 Complications of Real Natural Language 23.6 Natural Language Tasks . . . . . . . .. Summary Bibliographical and Historical Notes
24 Deep Learning for Natural Language Processing 24.1 24.2
Word Embeddings . . . .. .............. .. ... ..., Recurrent Neural Networks for NLP
24.4
The Transformer Architecture
243 245
24.6
Sequence-to-Sequence Models
Pretraining and Transfer Learning . . . . State of the art
Summary ... ..... ...
Bibliographical and Historical Notes 25
. . . . .. ..................
835 841
845 849 850 851
856
856 860
864 868
871
875
878
878
Computer Vision 25.1 Introduction 25.2 Image Formation . . . .
881 881 882
256
901
253 254 255
Simple Image Features Classifying Images . . . Detecting Objects . . .
The3D World
888 895 899
. . ...
25.7 Using Computer Vision Summary ... ... ...
906 919
Bibliographical and Historical Notes
920
26 Robotics 26.1 Robots 26.2 Robot Hardware . . . . 26.3 What kind of problem is robotics solving? 26.4 Robotic Perception . . . 26.5 Planning and Control 26.6 Planning Uncertain Movements . . . . . 26.7 Reinforcement Learning in Robotics 26.8 Humans and Robots 26.9 Alternative Robotic Frameworks 26.10 Application Domains
Summary ... ...
...
Bibliographical and Historical Notes
VII
. . . . .. ..................
925 925 926 930 931 938 956 958 961 968 971
974 975
Conclusions
27 Philosophy, Ethics, and Safety of AI 27.1
TheLimitsof AL . . ... ..........................
981
981
Contents 27.2 27.3
Can Machines Really Think? The Ethics of AT . .
.
Bibliographical and Historical Notes
.
Summary
. ... .....
L.
984 986
1005
1006
28 The Future of AT 28.1 28.2
A
B
. .
1012 1018
Mathematical Background
1023
A2 Vectors, Matrices, and Linear Algebra A3 Probability Distributions . . . . ... ... e L .. Bibliographical and Historical Notes . . . . .. ..................
1025 1027 1029
Al
Al Components Al Architectures
1012
Complexity Analysis and O() Notation . .
Notes on Languages and Algorithms B.1 B.2 B.3
Defining Languages with Backus—Naur Form (BNF) Describing Algorithms with Pseudocode . Online Supplemental Material . . . . . . .
Bibliography
Index
.
1023
1030
1030 1031 1032
1033
1069
xvii
This page intentionally left blank
G
1
INTRODUCTION In which we try to explain why we consider artificial intelligence to be a subject most
worthy of study, and in which we try to decide what exactly it is, this being a good thing to
decide before embarking.
We call ourselves Homo sapiens—man the wise—because our intelligence is so important
to us. For thousands of years, we have tried to understand how we think and act—that is,
how our brain, a mere handful of matter, can perceive, understand, predict, and manipulate a world far larger and more complicated than itself. The field of artificial intelligence, or Al
is concerned with not just understanding but also building intelligent entities—machines that
Intelligence Artificial intelligence
can compute how to act effectively and safely in a wide variety of novel situations.
Surveys regularly rank Al as one of the most interesting and fastest-growing fields, and it is already generating over a trillion dollars a year in revenue. Al expert Kai-Fu Lee predicts
that its impact will be “more than anything in the history of mankind.” Moreover, the intel-
lectual frontiers of Al are wide open. Whereas a student of an older science such as physics
might feel that the best ideas have already been discovered by Galileo, Newton, Curie, Ein-
stein, and the rest, Al still has many openings for full-time masterminds.
Al currently encompasses a huge variety of subfields, ranging from the general (learning,
reasoning, perception, and so on) to the specific, such as playing chess, proving mathematical theorems, writing poetry, driving a car, or diagnosing diseases. Al is relevant to any intellectual task; it is truly a universal field.
1.1
What
Is AI?
‘We have claimed that Al is interesting, but we have not said what it is. Historically, researchers have pursued several different versions of Al Some have defined intelligence in
terms of fidelity to human performance, while others prefer an abstract, formal definition of
intelligence called rationality—loosely speaking, doing the “right thing.” The subject matter Rationality itself also varies: some consider intelligence to be a property of internal thought processes
and reasoning, while others focus on intelligent behavior, an external characterization.! From these two dimensions—human
vs. rational® and thought vs. behavior—there are
four possible combinations, and there have been adherents and research programs for all 1" Inthe public eye, there i sometimes confusion between the terms “artificial intelligence” and “machine learning” Machine learning is a subfield of Al that studies the ability to improve performance based on experience. Some Al systems use machine learning methods to achieve competence, but some do not. 2 We are not suggesting that humans are “irrational” in the dictionary sense of “deprived of normal mental clarity” We are merely conceding that human decisions are not always mathematically perfect.
Chapter 1 Introduction
four. The methods used are necessarily different: the pursuit of human-like intelligence must be in part an empirical science related to psychology, involving observations and hypotheses about actual human behavior and thought processes; a rationalist approach, on the other hand, involves a combination of mathematics and engineering, and connecs to statistics, control theory, and economics. The various groups have both disparaged and helped cach other. Let us look at the four approaches in more detail. 1.1.1
Turing test
Acting humanly:
The Turing test approach
The Turing test, proposed by Alan Turing (1950), was designed as a thought experiment that would sidestep the philosophical vagueness of the question “Can a machine think?” A com-
puter passes the test if a human interrogator, after posing some written questions, cannot tell
whether the written responses come from a person or from a computer. Chapter 27 discusses the details of the test and whether a computer would really be intelligent if it passed.
For
now, we note that programming a computer to pass a rigorously applied test provides plenty
Natural language processing Knowledge representation Automated reasoning Machine learning Total Turing test Computer vision Robotics
to work on. The computer would need the following capabilities:
o natural language processing to communicate successfully in a human language;
o knowledge representation to store what it knows or hears; « automated reasoning to answer questions and to draw new conclusions;
« machine learning to adapt to new circumstances and to detect and extrapolate patterns. Turing viewed the physical simulation of a person as unnecessary to demonstrate intelligence.
However, other researchers have proposed a total Turing test, which requires interaction with
objects and people in the real world. To pass the total Turing test, a robot will need * computer vision and speech recognition to perceive the world;
o robotics to manipulate objects and move about.
These six disciplines compose most of AL Yet Al researchers have devoted little effort to
passing the Turing test, believing that it is more important to study the underlying principles of intelligence. The quest for “artificial flight” succeeded when engineers and inventors stopped imitating birds and started using wind tunnels and learning about aerodynamics.
Aeronautical engineering texts do not define the goal of their field as making “machines that
fly so exactly like pigeons that they can fool even other pigeons.” 1.1.2
Thinking humanly:
The cognitive modeling approach
To say that a program thinks like a human, we must know how humans think. We can learn about human thought in three ways:
Introspection
+ introspection—trying to catch our own thoughts as they go by;
Bra
« brain imaging—observing the brain in action.
Psychological experiment
imaging
« psychological experiments—observing a person in action;
Once we have a sufficiently precise theory of the mind, it becomes possible to express the
theory as a computer program. If the program’s input—output behavior matches corresponding human behavior, that is evidence that some of the program’s mechanisms could also be
operating in humans.
For example, Allen Newell and Herbert Simon, who developed GPS, the “General Problem Solver” (Newell and Simon, 1961), were not content merely to have their program solve
Section 1.1
What Is AI?
problems correctly. They were more concerned with comparing the sequence and timing of its reasoning steps to those of human subjects solving the same problems.
The interdisci-
plinary field of cognitive science brings together computer models from Al and experimental techniques from psychology to construct precise and testable theories of the human mind.
Cognitive science
Cognitive science is a fascinating field in itself, worthy of several textbooks and at least
one encyclopedia (Wilson and Keil, 1999). We will occasionally comment on similarities or
differences between Al techniques and human cognition. Real cognitive science, however, is necessarily based on experimental investigation of actual humans or animals. We will leave
that for other books, as we assume the reader has only a computer for experimentation.
In the early days of Al there was often confusion between the approaches. An author would argue that an algorithm performs well on a task and that it is therefore a good model
of human performance, or vice versa. Modern authors separate the two kinds of claims; this
distinction has allowed both Al and cognitive science to develop more rapidly. The two fields fertilize each other, most notably in computer vision, which incorporates neurophysiological evidence into computational models.
Recently, the combination of neuroimaging methods
combined with machine learning techniques for analyzing such data has led to the beginnings
of a capability to “read minds™—that is, to ascertain the semantic content of a person’s inner
thoughts. This capability could, in turn, shed further light on how human cognition works. 1.1.3
Thinking rationally:
The
“laws of thought”
approach
The Greek philosopher Aristotle was one of the first to attempt to codify “right thinking”™— that is, irrefutable reasoning processes. His syllogisms provided patterns for argument struc-
tures that always yielded correct conclusions when given correct premises. The canonical
Syllogism
example starts with Socrates is a man and all men are mortal and concludes that Socrates is
mortal. (This example is probably due to Sextus Empiricus rather than Aristotle.) These laws of thought were supposed to govern the operation of the mind; their study initiated the field called logic.
Logicians in the 19th century developed a precise notation for statements about objects
in the world and the relations among them. (Contrast this with ordinary arithmetic notation,
which provides only for statements about numbers.) By 1965, programs could, in principle, solve any solvable problem described in logical notation.
The so-called logicist tradition
within artificial intelligence hopes to build on such programs to create intelligent systems.
Logicist
Logic as conventionally understood requires knowledge of the world that is certain—
a condition that, in reality, is seldom achieved. We simply don’t know the rules of, say, politics or warfare in the same way that we know the rules of chess or arithmetic. The theory
of probability fills this gap, allowing rigorous reasoning with uncertain information.
In Probability
principle, it allows the construction of a comprehensive model of rational thought, leading
from raw perceptual information to an understanding of how the world works to predictions about the future. What it does not do, is generate intelligent behavior.
theory of rational action. Rational thought, by itself, is not enough. 1.1.4
Acting ra
nally:
The
For that, we need a
rational agent approach
An agent is just something that acts (agent comes from the Latin agere, to do). Of course,
all computer programs do something, but computer agents are expected to do more: operate autonomously, perceive their environment, persist over a prolonged time period, adapt to
Agent
Chapter 1 Introduction Rational agent
change, and create and pursue goals. A rational agent is one that acts so as to achieve the best outcome or, when there is uncertainty, the best expected outcome.
In the “laws of thought” approach to Al the emphasis was on correct inferences. Making correct inferences is sometimes part of being a rational agent, because one way to act rationally is to deduce that a given action is best and then to act on that conclusion.
On the
other hand, there are ways of acting rationally that cannot be said to involve inference. For
example, recoiling from a hot stove is a reflex action that is usually more successful than a slower action taken after careful deliberation.
All the skills needed for the Turing test also allow an agent to act rationally. Knowledge
representation and reasoning enable agents to reach good decisions. We need to be able to
generate comprehensible sentences in natural language to get by in a complex society.
We
need learning not only for erudition, but also because it improves our ability to generate
effective behavior, especially in circumstances that are new.
The rational-agent approach to Al has two advantages over the other approaches. First, it
is more general than the “laws of thought™ approach because correct inference is just one of several possible mechanisms for achieving rationality. Second, it is more amenable to scien-
tific development. The standard of rationality is mathematically well defined and completely
general. We can often work back from this specification to derive agent designs that provably
achieve it—something that is largely impossible if the goal is to imitate human behavior or
thought processes. For these reasons, the rational-agent approach to Al has prevailed throughout most of the field’s history. In the early decades, rational agents were built on logical foundations and formed definite plans to achieve specific goals. Later, methods based on probability Do the
> right thing
Standard model
theory and machine learning allowed the creation of agents that could make decisions under
uncertainty to attain the best expected outcome.
In a nutshell, A/ has focused on the study
and construction of agents that do the right thing. What counts as the right thing is defined
by the objective that we provide to the agent. This general paradigm is so pervasive that we
might call it the standard model. It prevails not only in Al but also in control theory, where a
controller minimizes a cost function; in operations research, where a policy maximizes a sum of rewards; in statistics, where a decision rule minimizes a loss function; and in economics, where a decision maker maximizes utility or some measure of social welfare.
We need to make one important refinement to the standard model to account for the fact
that perfect rationality—always taking the exactly optimal action—is not feasible in complex
Limited rationality
environments. The computational demands are just too high. Chapters 5 and 17 deal with the
issue of limited rationality—acting appropriately when there is not enough time to do all the computations one might like. However, perfect rationality often remains a good starting point for theoretical analysis.
1.1.5
Benef
| machines
The standard model
has been a useful guide for Al research since its inception, but it is
probably not the right model in the long run. The reason is that the standard model assumes that we will supply a fully specified objective to the machine.
For an artificially defined task such as chess or shortest-path computation, the task comes
with an objective built in—so the standard model is applicable.
As we move into the real
world, however, it becomes more and more difficult to specify the objective completely and
Section 1.2
The Foundations of Artificial Intelligence
correctly. For example, in designing a self-driving car, one might think that the objective is to reach the destination safely. But driving along any road incurs a risk of injury due to other
errant drivers, equipment failure, and so on; thus, a strict goal of safety requires staying in the garage. There is a tradeoff between making progress towards the destination and incurring a risk of injury. How should this tradeoff be made? Furthermore, to what extent can we allow the car to take actions that would annoy other drivers? How much should the car moderate
its acceleration, steering, and braking to avoid shaking up the passenger? These kinds of questions are difficult to answer a priori. They are particularly problematic in the general area of human-robot interaction, of which the self-driving car is one example. The problem of achieving agreement between our true preferences and the objective we put into the machine is called the value alignment problem: the values or objectives put into
the machine must be aligned with those of the human. If we are developing an Al system in
Value alignment problem
the lab or in a simulator—as has been the case for most of the field’s history—there is an easy
fix for an incorrectly specified objective: reset the system, fix the objective, and try again. As the field progresses towards increasingly capable intelligent systems that are deployed
in the real world, this approach is no longer viable. A system deployed with an incorrect
objective will have negative consequences. more negative the consequences.
Moreover, the more intelligent the system, the
Returning to the apparently unproblematic example of chess, consider what happens if
the machine is intelligent enough to reason and act beyond the confines of the chessboard. In that case, it might attempt to increase its chances of winning by such ruses as hypnotiz-
ing or blackmailing its opponent or bribing the audience to make rustling noises during its
opponent’s thinking time.>
It might also attempt to hijack additional computing power for
itself. These behaviors are not “unintelligent” or “insane”; they are a logical consequence of defining winning as the sole objective for the machine. It is impossible to anticipate all the ways in which a machine pursuing a fixed objective might misbehave. There is good reason, then, to think that the standard model is inadequate.
‘We don’t want machines that are intelligent in the sense of pursuing their objectives; we want
them to pursue our objectives. If we cannot transfer those objectives perfectly to the machine, then we need a new formulation—one in which the machine is pursuing our objectives, but is necessarily uncertain as to what they are. When a machine knows that it doesn’t know the
complete objective, it has an incentive to act cautiously, to ask permission, to learn more about
our preferences through observation, and to defer to human control. Ultimately, we want agents that are provably beneficial to humans. We will return to this topic in Section 1.5.
The Foundations of Arti
| Intelligence
In this section, we provide a brief history of the disciplines that contributed ideas, viewpoints,
and techniques to Al Like any history, this one concentrates on a small number of people, events, and ideas and ignores others that also were important. We organize the history around
a series of questions. We certainly would not wish to give the impression that these questions are the only ones the disciplines address or that the disciplines have all been working toward Al as their ultimate fruition.
3 In one of the fi opponent’s eyes.”
books on chess, Ruy Lopez (1561) wrote, “Always place the board so the sun
Provably beneficial
Chapter 1 Introduction 1.2.1
Philosophy
« Can formal rules be used to draw valid conclusions? « How does the mind arise from a physical brain? « Where does knowledge come from? + How does knowledge lead to action?
Aristotle (384-322 BCE) was the first to formulate a precise set of laws governing the rational
part of the mind. He developed an informal system of syllogisms for proper reasoning, which in principle allowed one to generate conclusions mechanically, given initial premises. Ramon Llull (c. 1232-1315) devised a system of reasoning published as Ars Magna or The Great Art (1305). Llull tried to implement his system using an actual mechanical device:
a set of paper wheels that could be rotated into different permutations.
Around 1500, Leonardo da Vinci (1452-1519) designed but did not build a mechanical calculator; recent reconstructions have shown the design to be functional. The first known calculating machine was constructed around 1623 by the German scientist Wilhelm Schickard (1592-1635). Blaise Pascal (1623-1662) built the Pascaline in 1642 and wrote that it “produces effects which appear nearer to thought than all the actions of animals.” Gottfried Wilhelm Leibniz (1646-1716) built a mechanical device intended to carry out operations on concepts rather than numbers, but its scope was rather limited. In his 1651 book Leviathan, Thomas Hobbes (1588-1679) suggested the idea of a thinking machine, an “artificial animal”
in his words, arguing “For what is the heart but a spring; and the nerves, but so many strings; and the joints, but so many wheels.” He also suggested that reasoning was like numerical computation: “For ‘reason’ ... is nothing but ‘reckoning,” that is adding and subtracting.”
It’s one thing to say that the mind operates, at least in part, according to logical or nu-
merical rules, and to build physical systems that emulate some of those rules. It’s another to
say that the mind itself is such a physical system. René Descartes (1596-1650) gave the first clear discussion of the distinction between mind and matter. He noted that a purely physical conception of the mind seems to leave little room for free will. If the mind is governed en-
Dualism
tirely by physical laws, then it has no more free will than a rock “deciding” to fall downward. Descartes was a proponent of dualism.
He held that there is a part of the human mind (or
soul or spirit) that is outside of nature, exempt from physical laws. Animals, on the other hand, did not possess this dual quality; they could be treated as machines. An alternative to dualism is materialism, which holds that the brain’s operation accord-
ing to the laws of physics constitutes the mind. Free will is simply the way that the perception
of available choices appears to the choosing entity. The terms physicalism and naturalism are also used to describe this view that stands in contrast to the supernatural.
Empiricism
Given a physical mind that manipulates knowledge, the next problem is to establish the source of knowledge. The empiricism movement, starting with Francis Bacon’s (1561-1626) Novum Organum,* is characterized by a dictum of John Locke (1632-1704): “Nothing is in the understanding, which was not first in the senses.”
Induction
David Hume’s (1711-1776) A Treatise of Human Nature (Hume,
1739) proposed what
is now known as the principle of induction: that general rules are acquired by exposure to
repeated associations between their elements.
4 The Novum Organum is an update of Aristotle’s Organon, or instrument of thought.
Section 1.2
The Foundations of Artificial Intelligence
Building on the work of Ludwig Wittgenstein (1889-1951) and Bertrand Russell (1872—
1970), the famous Vienna Circle (Sigmund, 2017), a group of philosophers and mathemati-
cians meeting in Vienna in the 1920s and 1930s, developed the doctrine of logical positivism. This doctrine holds that all knowledge can be characterized by logical theories connected, ul-
timately, to observation sentences that correspond to sensory inputs; thus logical positivism combines rationalism and empiricism.
The confirmation theory of Rudolf Carnap (1891-1970) and Carl Hempel (1905-1997)
attempted to analyze the acquisition of knowledge from experience by quantifying the degree
Logical positivism Observation sentence
Confirmation theory
of belief that should be assigned to logical sentences based on their connection to observations that confirm or disconfirm them.
Carnap’s book The Logical Structure of the World (1928)
was perhaps the first theory of mind as a computational process. The final element in the philosophical
picture of the mind is the connection
between
knowledge and action. This question is vital to AT because intelligence requires action as well as reasoning. Moreover, only by understanding how actions are justified can we understand how to build an agent whose actions are justifiable (or rational).
Avistotle argued (in De Motu Animalium) that actions are justified by a logical connection
between goals and knowledge of the action’s outcome:
But how does it happen that thinking is sometimes accompanied by action and sometimes not, sometimes by motion, and sometimes not? It looks as if almost the same thing happens as in the case of reasoning and making inferences about unchanging objects. But in that case the end is a speculative proposition ... whereas here the conclusion which results from the two premises is an action. . . I need covering; a cloak is a covering. T need a cloak. What I need, I have to make; I need a cloak. I have to make a cloak. And the conclusion, the “T have to make a cloak.” is an action.
In the Nicomachean Ethics (Book IIL. 3, 1112b), Aristotle further elaborates on this topic, suggesting an algorithm:
‘We deliberate not about ends, but about means. For a doctor does not deliberate whether he shall heal, nor an orator whether he shall persuade, . . They assume the end and consider how and by what means it is attained, and if it scems casily and best produced thereby: while if it is achieved by one means only they consider how it will be achieved by this and by what means this will be achieved, till they come to the first cause, ... and what is last in the order of analysis seems to be first in the order of becoming. And if we come on an impossibility, we give up the search, e.g., if we need money and this cannot be got; but if a thing appears possible we try to do it. Aristotle’s algorithm was implemented 2300 years later by Newell and Simon in their General Problem Solver program. We would now call it a greedy regression planning system (see Chapter 11). Methods based on logical planning to achieve definite goals dominated the first few decades of theoretical research in AL
Thinking purely in terms of actions achieving goals is often useful but sometimes inapplicable. For example, if there are several different ways to achieve a goal, there needs to be some way to choose among them. More importantly, it may not be possible to achieve a goal with certainty, but some action must still be taken. How then should one decide? Antoine Ar-
nauld (1662), analyzing the notion of rational decisions in gambling, proposed a quantitative formula for maximizing the expected monetary value of the outcome. Later, Daniel Bernoulli
(1738) introduced the more general notion of utility to capture the internal, subjective value Utility
Chapter 1 Introduction of an outcome.
The modern notion of rational decision making under uncertainty involves
maximizing expected utility, as explained in Chapter 16.
Utilitarianism
In matters of ethics and public policy, a decision maker must consider the interests of multiple individuals. Jeremy Bentham (1823) and John Stuart Mill (1863) promoted the idea
of utilitarianism: that rational decision making based on maximizing utility should apply to all spheres of human activity, including public policy decisions made on behalf of many
individuals. Utilitarianism is a specific kind of consequentialism: the idea that what is right
and wrong is determined by the expected outcomes of an action. Deontological ethics
In contrast, Immanuel Kant, in 1875 proposed a theory of rule-based or deontological
ethics, in which “doing the right thing” is determined not by outcomes but by universal social
laws that govern allowable actions, such as “don’t lie” or “don’t kill.” Thus, a utilitarian could tell a white lie if the expected good outweighs the bad, but a Kantian would be bound not to,
because lying is inherently wrong. Mill acknowledged the value of rules, but understood them as efficient decision procedures compiled from first-principles reasoning about consequences. Many modern Al systems adopt exactly this approach.
1.2.2
Mathematics
+ What are the formal rules to draw valid conclusions?
« What can be computed?
+ How do we reason with uncertain information? Philosophers staked out some of the fundamental ideas of Al but the leap to a formal science
required the mathematization of logic and probability and the introduction of a new branch
Formal logic
of mathematics: computation. The idea of formal logic can be traced back to the philosophers of ancient Greece, India, and China, but its mathematical development really began with the work of George Boole (1815-1864), who worked out the details of propositional, or Boolean, logic (Boole, 1847). In 1879, Gottlob Frege (1848-1925) extended Boole’s logic to include objects and relations,
creating the first-order logic that is used today.> In addition to its central role in the early period of Al research, first-order logic motivated the work of Godel and Turing that underpinned
Probability
computation itself, as we explain below.
The theory of probability can be seen as generalizing logic to situations with uncertain
information—a consideration of great importance for Al Gerolamo Cardano (1501-1576)
first framed the idea of probability, describing it in terms of the possible outcomes of gam-
bling events.
In 1654, Blaise Pascal (1623-1662), in a letter to Pierre Fermat (1601-1665),
showed how to predict the future of an unfinished gambling game and assign average pay-
offs to the gamblers.
Probability quickly became an invaluable part of the quantitative sci-
ences, helping to deal with uncertain measurements and incomplete theories. Jacob Bernoulli (1654-1705, uncle of Daniel), Pierre Laplace (1749-1827), and others advanced the theory and introduced new statistical methods. Thomas Bayes (1702-1761) proposed a rule for updating probabilities in the light of new evidence; Bayes’ rule is a crucial tool for Al systems. Statistics
The formalization of probability, combined with the availability of data, led to the emergence of statistics as a field. One of the first uses was John Graunt’s analysis of Lon-
5 Frege’s proposed notation for first-order logic—an arcane combination of textual and geometric features— never became popular.
Section 1.2
don census data in 1662.
The Foundations of Artificial Intelligence
Ronald Fisher is considered the first modern statistician (Fisher,
1922). He brought together the ideas of probability, experiment design, analysis of data, and
computing—in 1919, he insisted that he couldn’t do his work without a mechanical calculator called the MILLIONAIRE
(the first calculator that could do multiplication), even though the
cost of the calculator was more than his annual ry (Ross, 2012). The history of computation is as old as the history of numbers, but the first nontrivial
algorithm is thought to be Euclid’s algorithm for computing greatest common divisors. The word algorithm comes from Muhammad
Algorithm
ibn Musa al-Khwarizmi, a 9th century mathemati-
cian, whose writings also introduced Arabic numerals and algebra to Europe. Boole and others discussed algorithms for logical deduction, and, by the late 19th century, efforts were under way to formalize general mathematical reasoning as logical deduction.
Kurt Godel (1906-1978) showed that there exists an effective procedure to prove any true
statement in the first-order logic of Frege and Russell, but that first-order logic could not cap-
ture the principle of mathematical induction needed to characterize the natural numbers. In 1931, Godel showed that limits on deduction do exist. His incompleteness theorem showed
that in any formal theory as strong as Peano arithmetic (the elementary theory of natural
Incompleteness theorem
numbers), there are necessarily true statements that have no proof within the theory.
This fundamental result can also be interpreted as showing that some functions on the
integers cannot be represented by an algorithm—that is, they cannot be computed.
This
motivated Alan Turing (1912-1954) to try to characterize exactly which functions are com-
putable—capable of being computed by an effective procedure. The Church-Turing thesis Computability proposes to identify the general notion of computability with functions computed by a Turing machine (Turing, 1936). Turing also showed that there were some functions that no Turing machine can compute. For example, no machine can tell in general whether a given program will return an answer on a given input or run forever. Although computability is important to an understanding of computation, the notion of
tractability has had an even greater impact on Al Roughly speaking, a problem is called Tractability intractable if the time required to solve instances of the problem grows exponentially with
the size of the instances. The distinction between polynomial and exponential growth in
complexity was first emphasized in the mid-1960s (Cobham, 1964; Edmonds, 1965). It is important because exponential growth means that even moderately large instances cannot be
solved in any reasonable time. The theory of NP-completeness, pioneered by Cook (1971) and Karp (1972), provides a NP-completeness basis for analyzing the tractability of problems: any problem class to which the class of NPcomplete problems can be reduced is likely to be intractable. (Although it has not been proved that NP-complete problems are necessarily intractable, most theoreticians believe it.) These results contrast with the optimism with which the popular press greeted the first computers—
“Electronic Super-Brains” that were “Faster than Einstein!” Despite the increasing speed of computers, careful use of resources and necessary imperfection will characterize intelligent systems. Put crudely, the world is an extremely large problem instance!
1.2.3
Economics
« How should we make decisions in accordance with our preferences? « How should we do this when others may not go along? « How should we do this when the payoff may be far in the future?
10
Chapter 1 Introduction The science of economics originated in 1776, when Adam Smith (1723-1790) published An
Inquiry into the Nature and Causes of the Wealth of Nations. Smith proposed to analyze economies as consisting of many individual agents attending to their own interests. Smith
was not, however, advocating financial greed as a moral position: his earlier (1759) book The
Theory of Moral Sentiments begins by pointing out that concern for the well-being of others is an essential component of the interests of every individual.
Most people think of economics as being about money, and indeed the first mathemati-
cal analysis of decisions under uncertainty, the maximum-expected-value formula of Arnauld (1662), dealt with the monetary value of bets. Daniel Bernoulli (1738) noticed that this formula didn’t seem to work well for larger amounts of money, such as investments in maritime
trading expeditions. He proposed instead a principle based utility, and explained human investment choices by proposing additional quantity of money diminished as one acquired more Léon Walras (pronounced “Valrasse™) (1834-1910) gave
on maximization of expected that the marginal utility of an money. utility theory a more general
foundation in terms of preferences between gambles on any outcomes (not just monetary
outcomes). The theory was improved by Ramsey (1931) and later by John von Neumann
and Oskar Morgenstern in their book The Theory of Games and Economic Behavior (1944).
Decision theory
Economics is no longer the study of money; rather it is the study of desires and preferences.
Decision theory, which combines probability theory with utility theory, provides a for-
mal and complete framework for individual decisions (economic or otherwise) made under
uncertainty—that is, in cases where probabilistic descriptions appropriately capture the de-
cision maker’s environment.
This is suitable for “large” economies where each agent need
pay no attention to the actions of other agents as individuals.
For “small” economies, the
situation is much more like a game: the actions of one player can significantly affect the utility of another (either positively or negatively). Von Neumann and Morgenstern’s develop-
ment of game theory (see also Luce and Raiffa, 1957) included the surprising result that, for some games, a rational agent should adopt policies that are (or least appear to be) randomized. Unlike decision theory, game theory does not offer an unambiguous prescription for selecting actions. In AL decisions involving multiple agents are studied under the heading of multiagent systems (Chapter 18).
Economists, with some exceptions, did not address the third question listed above: how to
Operations research
make rational decisions when payoffs from actions are not immediate but instead result from
several actions taken in sequence. This topic was pursued in the field of operations research,
which emerged in World War II from efforts in Britain to optimize radar installations, and later
found innumerable civilian applications. The work of Richard Bellman (1957) formalized a
class of sequential decision problems called Markov decision processes, which we study in Chapter 17 and, under the heading of reinforcement learning, in Chapter 22. Work in economics and operations research has contributed much to our notion of rational
agents, yet for many years Al research developed along entirely separate paths. One reason was the apparent complexity of making rational decisions. The pioneering Al researcher
Satisfcing
Herbert Simon (1916-2001) won the Nobel Prize in economics in 1978 for his early work
showing that models based on satisficing—making decisions that are “good enough,” rather than laboriously calculating an optimal decision—gave a better description of actual human behavior (Simon, 1947). Since the 1990s, there has been a resurgence of interest in decisiontheoretic techniques for AL
Section 1.2
1.2.4
The Foundations of Artificial Intelligence
11
Neuroscience
+ How do brains process information?
Neuroscience is the study of the nervous system, particularly the brain. Although the exact Neuroscience way in which the brain enables thought is one of the great mysteries of science, the fact that it
does enable thought has been appreciated for thousands of years because of the evidence that strong blows to the head can lead to mental incapacitation. It has also long been known that human brains are somehow different; in about 335 BCE Aristotle wrote, “Of all the animals,
man has the largest brain in proportion to his size.”® Still, it was not until the middle of the
18th century that the brain was widely recognized as the seat of consciousness. Before then,
candidate locations included the heart and the spleen.
Paul Broca’s (1824-1880) investigation of aphasia (speech deficit) in brain-damaged patients in 1861 initiated the study of the brain’s functional organization by identifying a localized area in the left hemisphere—now called Broca’s area—that is responsible for speech production.” By that time, it was known that the brain consisted largely of nerve cells, or neurons, but it was not until 1873 that Camillo Golgi (1843-1926) developed a staining technique Neuron allowing the observation of individual neurons (see Figure 1.1). This technique was used by Santiago Ramon y Cajal (1852-1934) in his pioneering studies of neuronal organization.® It is now widely accepted that cognitive functions result from the electrochemical operation of these structures.
That is, a collection of simple cells can lead to thought, action, and
consciousness. In the pithy words of John Searle (1992), brains cause minds.
Actuators
Agent
Figure 2.14 A model-based. utility-based agent. It uses a model of the world, along with a utility function that measures its preferences among states of the world. Then it chooses the action that leads to the best expected utility, where expected utility is computed by averaging over all possible outcome states, weighted by the probability of the outcome. aim for, none of which can be achieved with certainty, utility provides a way in which the likelihood of success can be weighed against the importance of the goals. Partial observability and nondeterminism are ubiquitous in the real world, and so, there-
fore, is decision making under uncertainty. Technically speaking, a rational utility-based agent chooses the action that maximizes the expected utility of the action outcomes—that
is, the utility the agent expects to derive, on average, given the probabilities and utilities of each outcome. (Appendix A defines expectation more precisely.) In Chapter 16, we show that any rational agent must behave as if it possesses a utility function whose expected value
it tries to maximize. An agent that possesses an explicit utility function can make rational de-
cisions with a general-purpose algorithm that does not depend on the specific utility function
being maximized. In this way, the “global” definition of rationality—designating as rational
those agent functions that have the highest performance—is turned into a “local” constraint
on rational-agent designs that can be expressed in a simple program. The utility-based agent structure appears in Figure 2.14.
Utility-based agent programs
appear in Chapters 16 and 17, where we design decision-making agents that must handle the uncertainty inherent in nondeterministic or partially observable environments. Decision mak-
ing in multiagent environments is also studied in the framework of utility theory, as explained in Chapter 18. At this point, the reader may be wondering, “Is it that simple? We just build agents that maximize expected utility, and we're done?”
It’s true that such agents would be intelligent,
but it’s not simple. A utility-based agent has to model and keep track of its environment,
tasks that have involved a great deal of research on perception, representation, reasoning,
and learning. The results of this research fill many of the chapters of this book. Choosing
the utility-maximizing course of action is also a difficult task, requiring ingenious algorithms
that fill several more chapters.
Even with these algorithms, perfect rationality is usually
Expected utility
56
Chapter 2 Intelligent Agents Performance standard
Cride
|-
Sensors =
Learning element learning goals Problem generator Agent
changes knowledge
Performance element
JUSWIUOIAUF
feedback ‘
Actuators
Figure 2.15 A general leaning agent. The “performance element” box represents what we have previously considered to be the whole agent program. Now, the “learning element” box gets to modify that program to improve its performance. unachievable in practice because of computational complexity, as we noted in Chapter 1. We. Model-free agent
also note that not all utility-based agents are model-based; we will see in Chapters 22 and 26 that a model-free agent can learn what action is best in a particular situation without ever
learning exactly how that action changes the environment.
Finally, all of this assumes that the designer can specify the utility function correctly; Chapters 17, 18, and 22 consider the issue of unknown utility functions in more depth. 2.4.6
Learning agents
We have described agent programs with various methods for selecting actions. We have not, so far, explained how the agent programs come into being. Tn his famous early paper, Turing (1950) considers the idea of actually programming his intelligent machines by hand. He estimates how much work this might take and concludes, “Some more expeditious method seems desirable.”
them.
The method he proposes is to build learning machines and then to teach
In many areas of Al this is now the preferred method for creating state-of-the-art
systems. Any type of agent (model-based, goal-based, utility-based, etc.) can be built as a learning agent (or not). Learning has another advantage, as we noted earlier: it allows the agent to operate in initially unknown environments and to become more competent than its initial knowledge alone
might allow. In this section, we briefly introduce the main ideas of learning agents. Through-
out the book, we comment on opportunities and methods for learning in particular kinds of
Learning element
Performance element
agents. Chapters 19-22 go into much more depth on the learning algorithms themselves. A learning agent can be divided into four conceptual components, as shown in Figure 2.15.
The most important distinction is between the learning element,
which is re-
sponsible for making improvements, and the performance element, which is responsible for selecting external actions. The performance element is what we have previously considered
Section 2.4 The Structure of Agents 1o be the entire agent: it takes in percepts and decides on actions. The learning element uses feedback from the critic on how the agent is doing and determines how the performance element should be modified to do better in the future.
57
Critic
The design of the learning element depends very much on the design of the performance
element. When trying to design an agent that learns a certain capability, the first question is
not “How am I going to get it to learn this?” but “What kind of performance element will my
agent use to do this once it has learned how?” Given a design for the performance element,
learning mechanisms can be constructed to improve every part of the agent.
The critic tells the learning element how well the agent is doing with respect to a fixed
performance standard. The critic is necessary because the percepts themselves provide no indication of the agent’s success. For example, a chess program could receive a percept indicating that it has checkmated its opponent, but it needs a performance standard to know that this is a good thing; the percept itself does not say so. It is important that the performance
standard be fixed. Conceptually, one should think of it as being outside the agent altogether because the agent must not modify it to fit its own behavior.
The last component of the learning agent is the problem generator. It is responsible
for suggesting actions that will lead to new and informative experiences. If the performance
Problem generator
element had its way, it would keep doing the actions that are best, given what it knows, but if the agent is willing to explore a little and do some perhaps suboptimal actions in the short
run, it might discover much better actions for the long run. The problem generator’s job is to suggest these exploratory actions. This is what scientists do when they carry out experiments.
Galileo did not think that dropping rocks from the top of a tower in Pisa was valuable in itself.
He was not trying to break the rocks or to modify the brains of unfortunate pedestrians. His
aim was to modify his own brain by identifying a better theory of the motion of object The learning element can make changes to any of the “knowledge” components shown in the agent diagrams (Figures 2.9, 2.11, 2.13, and 2.14). The simplest cases involve learning directly from the percept sequence. Observation of pairs of successive states of the environ-
ment can allow the agent to learn “What my actions do” and “How the world evolves™ in
response (o its actions. For example, if the automated taxi exerts a certain braking pressure
when driving on a wet road, then it will soon find out how much deceleration is actually
achieved, and whether it skids off the road. The problem generator might identify certain parts of the model that are in need of improvement and suggest experiments, such as trying out the brakes on different road surfaces under different conditions.
Improving the model components of a model-based agent so that they conform better
with reality is almost always a good idea, regardless of the external performance standard.
(In some cases, it is better from a computational point of view to have a simple but slightly inaccurate model rather than a perfect but fiendishly complex model.)
Information from the
external standard is needed when trying to learn a reflex component or a utility function.
For example, suppose the taxi-driving agent receives no tips from passengers who have
been thoroughly shaken up during the trip. The external performance standard must inform
the agent that the loss of tips is a negative contribution to its overall performance; then the
agent might a sense, the (or penalty) performance
be able to learn that violent maneuvers do not contribute to its own utility. In performance standard distinguishes part of the incoming percept as a reward Reward that provides direct feedback on the quality of the agent’s behavior. Hard-wired Penalty standards such as pain and hunger in animals can be understood in this way.
Chapter 2 Intelligent Agents More example, settles on know it’s
generally, human choices can provide information about human preferences. For suppose the taxi does not know that people generally don’t like loud noises, and the idea of blowing its horn continuously as a way of ensuring that pedestrians coming. The consequent human behavior—covering ears, using bad language, and
possibly cutting the wires to the horn—would provide evidence to the agent with which to update its utility function. This
issue is discussed further in Chapter 22.
In summary, agents have a variety of components, and those components
can be repre-
sented in many ways within the agent program, so there appears to be great variety among
learning methods. There is, however, a single unifying theme. Learning in intelligent agents can be summarized as a process of modification of each component of the agent to bring the components into closer agreement with the available feedback information, thereby improv-
ing the overall performance of the agent. 2.4.7
How the components of agent programs work
We have described agent programs (in very high-level terms) as consisting of various compo-
nents, whose function it is to answer questions such as: “What is the world like now?” “What action should I do now?” “What do my actions do?” The next question for a student of Al
is, “How on Earth do these components work?” It takes about a thousand pages to begin to answer that question properly, but here we want to draw the reader’s attention to some basic
distinctions among the various ways that the components can represent the environment that the agent inhabits. Roughly speaking, we can place the representations along an axis of increasing complexity and expressive power—atomic, factored, and structured. To illustrate these ideas, it helps
to consider a particular agent component, such as the one that deals with “What my actions
do.” This component describes the changes that might occur in the environment as the result
(a) Atomic
oIIQ.
of taking an action, and Figure 2.16 provides schematic depictions of how those transitions might be represented.
mII.l
58
(b) Factored
() Structured
Figure 2.16 Three ways to represent states and the transitions between them. (a) Atomic
representation: a state (such as B or C) is a black box with no internal structure; (b) Factored representation: a state consists ofa vector of attribute values; values can be Boolean, real-
valued, or one of a fixed set of symbols. (c) Structured representation: a state includes objects, each of which may have attributes of its own as well s relationships to other objects.
Section 2.4 The Structure of Agents In an atomic representation each state of the world is indivisible—it has no internal
structure. Consider the task of finding a driving route from one end of a country to the other
59
Atomic representation
via some sequence of cities (we address this problem in Figure 3.1 on page 64). For the purposes of solving this problem, it may suffice to reduce the state of the world to just the name of the city we are in—a single atom of knowledge, a “black box whose only discernible property is that of being identical to or different from another black box. The standard algorithms
underlying search and game-playing (Chapters 3-5), hidden Markov models (Chapter 14), and Markov decision processes (Chapter 17) all work with atomic representations.
Factored representation each of which can have a value. Consider a higher-fidelity description for the same driving Variable problem, where we need to be concerned with more than just atomic location in one city or Attribute another; we might need to pay attention to how much gas is in the tank, our current GPS Value A factored representation splits up each state into a fixed set of variables or attributes,
coordinates, whether or not the oil warning light is working, how much money we have for
tolls, what station is on the radio, and so on. While two different atomic states have nothing in
common—they are just different black boxes—two different factored states can share some
attributes (such as being at some particular GPS location) and not others (such as having lots
of gas or having no gas); this makes it much easier to work out how to turn one state into an-
other. Many important areas of Al are based on factored representations, including constraint satisfaction algorithms (Chapter 6), propositional logic (Chapter 7), planning (Chapter 11), Bayesian networks (Chapters 12-16), and various machine learning algorithms. For many purposes, we need to understand the world as having things in it that are related to each other, not just variables with values. For example, we might notice that a large truck ahead of us is reversing into the driveway of a dairy farm, but a loose cow is blocking the truck’s path. A factored representation is unlikely to be pre-equipped with the at-
tribute TruckAheadBackingIntoDairyFarmDrivewayBlockedByLooseCow with value true or
Jfalse. Tnstead, we would need a structured representation, in which objects such as cows Structured representation and trucks and their various and varying relationships can be described explicitly (see Figure 2.16(c)).
Structured representations underlie relational databases and first-order logic
(Chapters 8, 9, and 10), first-order probability models (Chapter 15), and much of natural lan-
guage understanding (Chapters 23 and 24). In fact, much of what humans express in natural language concerns objects and their relationships.
As we mentioned earlier, the axis along which atomic, factored, and structured repre-
sentations lie is the axis of increasing expressiveness. Roughly speaking, a more expressive
representation can capture, at least as concisely, everything a less expressive one can capture,
Expressiveness
plus some more. Ofien, the more expressive language is much more concise; for example, the rules of chess can be written in a page or two of a structured-representation language such
as first-order logic but require thousands of pages when written in a factored-representation
language such as propositional logic and around 10*® pages when written in an atomic language such as that of finite-state automata. On the other hand, reasoning and learning become
more complex as the expressive power of the representation increases. To gain the benefits
of expressive representations while avoiding their drawbacks, intelligent systems for the real world may need to operate at all points along the axis simultaneously.
Another axis for representation involves the mapping of concepts to locations in physical
memory,
whether in a computer or in a brain.
If there is a one-to-one mapping between
concepts and memory locations, we call that a localist representation. On the other hand, Localist representation
Chapter 2 Intelligent Agents if the representation of a concept is spread over many memory locations, and each memory Distributed representation
location is employed as part of the representation of multiple different concepts, we call
that a distributed representation. Distributed representations are more robust against noise and information loss. With a localist representation, the mapping from concept to memory location is arbitrary, and if a transmission error garbles a few bits, we might confuse Truck
with the unrelated concept Truce. But with a
distributed representation, you can think of each
concept representing a point in multidimensional space, and if you garble a few bits you move
to a nearby point in that space, which will have similar meaning.
Summary
This chapter has been something of a whirlwind tour of AL which we have conceived of as the science of agent design. The major points to recall are as follows: « An agent is something that perceives and acts in an environment. The agent function for an agent specifies the action taken by the agent in response to any percept sequence.
+ The performance measure evaluates the behavior of the agent in an environment.
A
rational agent acts so as to maximize the expected value of the performance measure, given the percept sequence it has seen so far.
+ A task environment specification includes the performance measure, the external en-
vironment, the actuators, and the sensors. In designing an agent, the first step must always be to specify the task environment as fully as possible.
« Task environments vary along several significant dimensions. They can be fully or partially observable, single-agent or multiagent, deterministic or nondeterministic, episodic or sequential, static or dynamic, discrete or continuous, and known or unknown.
« In cases where the performance measure is unknown or hard to specify correctly, there
is a significant risk of the agent optimizing the wrong objective. In such cases the agent design should reflect uncertainty about the true objective. + The agent program implements the agent function. There exists a variety of basic
agent program designs reflecting the kind of information made explicit and used in the decision process. The designs vary in efficiency, compactness, and flexibility. The appropriate design of the agent program depends on the nature of the environment.
+ Simple reflex agents respond directly to percepts, whereas model-based reflex agents maintain internal state to track aspects of the world that are not evident in the current
percept. Goal-based agents act to achieve their goals, and utility-based agents try to
maximize their own expected “happiness.”
« All agents can improve their performance through learning.
Bibliographical and Historical Notes The central role of action in intelligence—the notion of practical reasoning—goes back at least as far as Aristotle’s Nicomachean Ethics.
Practical reasoning was also the subject of
McCarthy’s influential paper “Programs with Common Sense” (1958). The fields of robotics and control theory are, by their very nature, concerned principally with physical agents. The
Bibliographical and Historical Notes
61
concept of a controller in control theory is identical to that of an agent in AL Perhaps sur- Controller prisingly, Al has concentrated for most of its history on isolated components of agents—
question-answering systems, theorem-provers, vision systems, and so on—rather than on
whole agents. The discussion of agents in the text by Genesereth and Nilsson (1987) was an
influential exception. The whole-agent view is now widely accepted and is a central theme in
recent texts (Padgham and Winikoff, 2004; Jones, 2007; Poole and Mackworth, 2017).
Chapter 1 traced the roots of the concept of rationality in philosophy and economics. In
AL the concept was of peripheral interest until the mid-1980s, when it began to suffuse many discussions about the proper technical foundations of the field. A paper by Jon Doyle (1983)
predicted that rational agent design would come to be seen as the core mission of Al, while
other popular topics would spin off to form new disciplines. Careful attention to the properties of the environment and their consequences for rational agent design is most apparent in the control theory tradition—for example, cla:
control systems (Dorf and Bishop, 2004; Kirk, 2004) handle fully observable, deterministic environments; stochastic optimal control (Kumar and Varaiya, 1986; Bertsekas and Shreve,
2007) handles partially observable, stochastic environments; and hybrid control (Henzinger and Sastry, 1998; Cassandras and Lygeros, 2006) deals with environments containing both
discrete and continuous elements. The distinction between fully and partially observable en-
vironments is also central in the dynamic programming literature developed in the field of operations research (Puterman, 1994), which we discuss in Chapter 17.
Although simple reflex agents were central to behaviorist psychology (see Chapter 1),
most Al researchers view them as too simple to provide much leverage. (Rosenschein (1985)
and Brooks (1986) questioned this assumption; see Chapter 26.) A great deal of work has gone into finding efficient algorithms for keeping track of complex environments (Bar-
Shalom et al., 2001; Choset et al., 2005; Simon, 2006), most of it in the probabilistic setting. Goal-based agents are presupposed in everything from Aristotle’s view of practical rea-
soning to McCarthy’s early papers on logical Al Shakey the Robot (Fikes and Nilsson, 1971; Nilsson,
1984) was the first robotic embodiment of a logical, goal-based agent.
A
full logical analysis of goal-based agents appeared in Genesereth and Nilsson (1987), and a goal-based programming methodology called agent-oriented programming was developed by Shoham (1993). The agent-based approach is now extremely popular in software engineering (Ciancarini and Wooldridge, 2001). It has also infiltrated the area of operating systems, ‘where autonomic computing refers to computer systems and networks that monitor and con-
trol themselves with a perceive—act loop and machine learning methods (Kephart and Chess, 2003). Noting that a collection of agent programs designed to work well together in a true
multiagent environment necessarily exhibits modularity—the programs share no internal state
and communicate with each other only through the environment—it
is common
within the
field of multiagent systems to design the agent program of a single agent as a collection of autonomous sub-agents.
In some cases, one can even prove that the resulting system gives
the same optimal solutions as a monolithic design.
The goal-based view of agents also dominates the cognitive psychology tradition in the area of problem solving, beginning with the enormously influential Human Problem Solving (Newell and Simon, 1972) and running through all of Newell’s later work (Newell, 1990).
Goals, further analyzed as desires (general) and intentions (currently pursued), are central to the influential theory of agents developed by Michael Bratman (1987).
Autonomic computing
62
Chapter 2 Intelligent Agents As noted in Chapter 1, the development of utility theory as a basis for rational behavior goes back hundreds of years. In Al early research eschewed utilities in favor of goals, with some exceptions (Feldman and Sproull, 1977). The resurgence of interest in probabilistic
methods in the 1980s led to the acceptance of maximization of expected utility as the most general framework for decision making (Horvitz et al., 1988). The text by Pearl (1988) was
the first in Al to cover probability and utility theory in depth; its exposition of practical methods for reasoning and decision making under uncertainty was probably the single biggest factor in the rapid shift towards utility-based agents
in the 1990s (see Chapter 16). The for-
malization of reinforcement learning within a decision-theoretic framework also contributed
to this
shift (Sutton, 1988). Somewhat remarkably, almost all Al research until very recently
has assumed that the performance measure can be exactly and correctly specified in the form
of a utility function or reward function (Hadfield-Menell e al., 2017a; Russell, 2019).
The general design for learning agents portrayed in Figure 2.15 is classic in the machine
learning literature (Buchanan et al., 1978; Mitchell, 1997). Examples of the design, as em-
bodied in programs, go back at least as far as Arthur Samuel’s (1959, 1967) learning program for playing checkers. Learning agents are discussed in depth in Chapters 19-22. Some early papers on agent-based approaches are collected by Huhns and Singh (1998) and Wooldridge and Rao (1999). Texts on multiagent systems provide a good introduction to many aspects of agent design (Weiss, 2000a; Wooldridge, 2009). Several conference series devoted to agents began in the 1990s, including the International Workshop on Agent Theories, Architectures, and Languages (ATAL), the International Conference on Autonomous Agents (AGENTS), and the International Conference on Multi-Agent Systems (ICMAS). In
2002, these three merged to form the International Joint Conference on Autonomous Agents and Multi-Agent Systems (AAMAS).
From 2000 to 2012 there were annual workshops on
Agent-Oriented Software Engineering (AOSE). The journal Autonomous Agents and MultiAgent Systems was founded in 1998. Finally, Dung Beetle Ecology (Hanski and Cambefort, 1991) provides a wealth of interesting information on the behavior of dung beetles. YouTube
has inspiring video recordings of their activities.
TR 3
SOLVING PROBLEMS BY SEARCHING In which we see how an agent can look ahead to find a sequence of actions that will eventually achieve its goal.
‘When the correct action to take is not immediately obvious, an agent may need to to plan
ahead: 10 consider a sequence of actions that form a path to a goal state. Such an agent is called a problem-solving agent, and the computational process it undertakes is called search.
Problem-solving agents use atomic representations, as described in Section 2.4.7—that
is, states of the world are considered as wholes, with no internal structure visible to the
~5/ebiemsolvine
Search
problem-solving algorithms. Agents that use factored or structured representations of states are called planning agents and are discussed in Chapters 7 and 11.
We will cover several search algorithms. In this chapter, we consider only the simplest
environments: known.
episodic, single agent, fully observable, deterministic,
static, discrete, and
We distinguish between informed algorithms, in which the agent can estimate how
far it is from the goal, and uninformed algorithms, where no such estimate is available. Chapter 4 relaxes the constraints on environments, and Chapter 5 considers multiple agents. This chapter uses the concepts of asymptotic complexity (that is, O(n) notation). Readers
unfamiliar with these concepts should consult Appendix A. 3.1
Problem-Solving Agents
Imagine an agent enjoying a touring vacation in Romania. The agent wants to take in the sights, improve its Romanian, enjoy the nightlife, avoid hangovers, and so on. The decision problem is a complex one. Now, suppose the agent is currently in the city of Arad and has a nonrefundable ticket to fly out of Bucharest the following day.
The agent observes
street signs and sees that there are three roads leading out of Arad: one toward Sibiu, one to
Timisoara, and one to Zerind. None of these are the goal, so unless the agent is familiar with
the geography of Romania, it will not know which road to follow."
If the agent has no additional information—that is, if the environment is unknown—then
the agent can do no better than to execute one of the actions at random.
This
sad situation
is discussed in Chapter 4. In this chapter, we will assume our agents always have access to information about the world, such as the map in Figure 3.1. With that information, the agent
can follow this four-phase problem-solving process:
+ Goal formulation: The agent adopts the goal of reaching Bucharest.
Goals organize
behavior by limiting the objectives and hence the actions to be considered.
‘We are assuming that most readers are in the same position and can easily imagine themselves to be as clueles as our agent. We apologize to Romanian readers who are unable to take advantage of this pedagogic: device.
Goal formulation
Chapter 3 Solving Problems by Searching
75 Arad
Sibiu
0
Fagaras
18
Vashui Rimnicu Vileea
Timisoara
Urziceni
75 Drobeta
Bucharest 90
riu Figure 3.1 A simplified road map of part of Romania, with road distances in miles. Craiova
Problem formulation
Hirsova
Eforie
+ Problem formulation: The agent devises a description of the states and actions necessary to reach the goal—an abstract model of the relevant part of the world. For our
agent, one good model is to consider the actions of traveling from one city to an adja-
cent city, and therefore the only fact about the state of the world that will change due to
an action is the current city.
Search
+ Search:
Before taking any action in the real world, the agent simulates sequences of
actions in its model, searching until it finds a sequence of actions that reaches the goal. Such a sequence is called a solution. The agent might have to simulate multiple sequences that do not reach the goal, but eventually it will find a solution (such as going
Solution
from Arad to Sibiu to Fagaras to Bucharest), or it will find that no solution is possible.
Execution
|
+ Execution: The agent can now execute the actions in the solution, one at a time.
It is an important property that in a fully observable, deterministic, known environment, the
solution to any problem is a fixed sequence of actions: drive to Sibiu, then Fagaras, then
Bucharest. If the model is correct, then once the agent has found a solution, it can ignore its
Open-loop
Closed-loop
percepts while it is executing the actions—closing its eyes, so to speak—because the solution
is guaranteed to lead to the goal. Control theorists call this an open-loop system: ignoring the
percepts breaks the loop between agent and environment. If there is a chance that the model
is incorrect, or the environment is nondeterministic, then the agent would be safer using a
closed-loop approach that monitors the percepts (see Section 4.4).
In partially observable or nondeterministic environments, a solution would be a branching
strategy that recommends different future actions depending on what percepts arrive. For
example, the agent might plan to drive from Arad to Sibiu but might need a contingency plan
in case it arrives in Zerind by accident or finds a sign saying “Drum inchis” (Road Closed).
Section 3.1 3.1.1
Problem-Solving Agents
Search problems and solutions
Problem
A search problem can be defined formally as follows: * A set of possible states that the environment can be in. We call this the state space. + The initial state that the agent starts in. For example: Arad.
+ A set of one or more goal states. Sometimes there is one goal state (e.g., Bucharest), sometimes there is a small set of alternative goal states, and sometimes the goal is
defined by a property that applies to many states (potentially an infinite number). For example, in a vacuum-cleaner world, the goal might be to have no dirt in any location, regardless of any other facts about the state.
States State space Initial state Goal states,
We can account for all three of these
possibilities by specifying an Is-GOAL method for a problem. In this chapter we will
sometimes say “the goal” for simplicity, but what we say also applies to “any one of the
possible goal states.”
« The actions available to the agent. Given a state s, ACTIONS(s) returns a finite? set of actions that can be executed in 5. We say that each of these actions is applicable in s.
An example:
Action Applicable
ACTIONS (Arad) = {ToSibiu, ToTimisoara, ToZerind } + A transition model, which describes what each action does.
RESULT(s, a) returns the
state that results from doing action a in state s. For example,
Transition model
RESULT(Arad, ToZerind) = Zerind.
« Anaction cost function, denoted by ACTION-COST(s,a,s") when we are programming or ¢(s,a,s') when we are doing math, that gives the numeric cost of applying action a in state s to reach state s'.
Action cost function
A problem-solving agent should use a cost function that
reflects its own performance measure; for example, for route-finding agents, the cost of
an action might be the length in miles (as seen in Figure 3.1), or it might be the time it
takes to complete the action. A sequence of actions forms a path, and a solution is a path from the initial state to a goal
state. We assume that action costs are additive; that is, the total cost of a path is the sum of the
Path
individual action costs. An optimal solution has the lowest path cost among all solutions. In Optimal solution
this chapter, we assume that all action costs will be positive, to avoid certain complications.®
The state space can be represented as a graph in which the vertices are states and the Graph directed edges between them are actions. The map of Romania shown in Figure 3.1 is such a graph, where each road indicates two actions, one in each direction. 2 For problems with an infinite number of actions we would need techniques that go beyond this chapter. 3 Inany problem with a cycle of net negative cost, the cost-optimal solution is to go around that cycle an infinite number of times. The Bellman—Ford and Floyd-Warshall algorithms (not covered here) handle negative-cost actions, as long as there are no negative cycles. It is easy to accommodate zero-cost actions, as long as the number of consecutive zero-cost actions is bounded. For example, we might have a robot where there is a cost 1o move, but zero cost to rotate 90°; the algorithms in this chapter can handle this as long as no more than three consecutive 90° turns are allowed. There is also a complication with problems that have an infinite number of arbitrarily small action costs. Consi ion of Zeno's paradox where there is an action to move half way to the goal, at a cost of half of the previous move. This problem has no solution with a finite number of actions, but to prevent a search from taking an unbounded number of actions without quite reaching the goal, we can require that all action costs be at least , for some small positive value c.
66
Chapter 3 Solving Problems by Searching 3.1.2
Formulating problems
Our formulation of the problem of getting to Bucharest is a model—an abstract mathematical
description—and not the real thing. Compare the simple atomic state description Arad to an actual cross-country trip, where the state of the world includes so many things: the traveling companions, the current radio program, the scenery out of the window, the proximity of law enforcement officers, the distance to the next rest stop, the condition of the road, the weather,
the traffic, and so on.
Abstraction
All these considerations are left out of our model because they are
irrelevant to the problem of finding a route to Bucharest.
The process of removing detail from a representation is called abstraction.
A good
problem formulation has the right level of detail. If the actions were at the level of “move the
right foot forward a centimeter” or “turn the steering wheel one degree left,” the agent would
Level of abstraction
probably never find its way out of the parking lot, let alone to Bucharest.
Can we be more precise about the appropriate level of abstraction? Think of the abstract
states and actions we have chosen as corresponding to large sets of detailed world states and
detailed action sequences.
Now consider a solution to the abstract problem:
for example,
the path from Arad to Sibiu to Rimnicu Vilcea to Pitesti to Bucharest. This abstract solution
corresponds to a large number of more detailed paths. For example, we could drive with the radio on between Sibiu and Rimnicu Vilcea, and then switch it off for the rest of the trip.
The abstraction is valid if we can elaborate any abstract solution into a solution in the
more detailed world; a sufficient condition
is that for every detailed state that is “in Arad,”
there is a detailed path to some state that is “in Sibiu,” and so on.* The abstraction is useful if
carrying out each of the actions in the solution is easier than the original problem; in our case,
the action “drive from Arad to Sibiu™ can be carried out without further search or planning by a driver with average skill. The choice ofa good abstraction thus involves removing as much
detail as possible while retaining validity and ensuring that the abstract actions are easy to
carry out. Were it not for the ability to construct useful abstractions, intelligent agents would be completely swamped by the real world. 3.2
Example
Problems
The problem-solving approach has been applied to a vast array of task environments. We list Standardized problem Real-world problem
some of the best known here, distinguishing between standardized and real-world problems.
A standardized problem is intended to illustrate or exercise various problem-solving meth-
ods.
It can be given a concise, exact description and hence is suitable as a benchmark for
researchers to compare the performance of algorithms. A real-world problem, such as robot
navigation, is one whose solutions people actually use, and whose formulation is idiosyn-
cratic, not standardized, because, for example, each robot has different sensors that produce different data.
3.2.1 Grid world
Standardized problems
A grid world problem is a two-dimensional rectangular array of square cells in which agents can move from cell to cell. Typically the agent can move to any obstacle-free adjacent cell— horizontally or vertically and in some problems diagonally. Cells can contain objects, which
Section 3.2 Example Problems
67
Figure 3.2 The state-space graph for the two-cell vacuum world. There are 8 states and three actions for each state: L = Lefr, R = Right, S = Suck.
the agent can pick up, push, or otherwise act upon; a wall or other impassible obstacle in a cell prevents an agent from moving into that cell. The vacuum world from Section 2.1 can be formulated as a grid world problem as follows: e States:
A state of the world says which objects are in which cells.
For the vacuum
world, the objects are the agent and any dirt. In the simple two-cell version, the agent
can be in either of the two cells, and each call can either contain dirt or not, so there are 2-2-2 = 8 states (see Figure 3.2). In general, a vacuum environment with n cells has
n-2" states.
o Initial state: Any state can be designated as the initial state. ® Actions: In the two-cell world we defined three actions: Suck, move Lefr, and move Right. Tn a two-dimensional multi-cell world we need more movement actions. We
could add Upward and Downward, giving us four absolute movement actions, or we could switch to egocentric actions, defined relative to the viewpoint of the agent—for example, Forward, Backward, TurnRight, and TurnLefi. o Transition model: Suck removes any dirt from the agent’s cell; Forward moves the agent ahead one cell in the direction it is facing, unless it hits a wall, in which case
the action has no effect.
Backward moves the agent in the opposite direction, while
TurnRight and TurnLeft change the direction it is facing by 90°. * Goal states: The states in which every cell is clean. ® Action cost: Each action costs 1.
Another type of grid world is the sokoban puzzle, in which the agent’s goal is to push a
number of boxes, scattered about the grid, to designated storage locations. There can be at
most one box per cell. When an agent moves forward into a cell containing a box and there is an empty cell on the other side of the box, then both the box and the agent move forward.
Sckoban puzzle
68
Chapter 3 Solving Problems by Searching
Start State
Goal State
Figure 3.3 A typical instance of the 8-puzzle. The agent can’t push a box into another box or a wall. For a world with 2 non-obstacle cells
and b boxes, there are n x n!/(b!(n — b)!) states; for example on an 8 x 8 grid with a dozen
Sliding-tile puzzle
boxes, there are over 200 trillion states.
In a sliding-tile puzzle, a number of tiles (sometimes called blocks or pieces) are ar-
ranged in a grid with one or more blank spaces so that some of the tiles can slide into the blank space.
8-puzzle 15-puzzle
One variant is the Rush Hour puzzle, in which cars and trucks slide around a
6% 6 grid in an attempt to free a car from the traffic jam. Perhaps the best-known variant is the 8-puzzle (see Figure 3.3), which consists of a 3 x 3 grid with eight numbered tiles and
one blank space, and the 15-puzzle on a 4 x 4 grid. The object is to reach a specified goal state, such as the one shown on the right of the figure.
puzzle is as follows:
The standard formulation of the 8
o States: A state description specifies the location of each of the tiles.
o Initial state: Any state can be designated as the initial state. Note that a parity prop-
erty partitions the state space—any given goal can be reached from exactly half of the possible initial states (see Exercise 3.PART).
e Actions: While in the physical world it is a tile that slides, the simplest way of describ-
ing an action is to think of the blank space moving Left, Right, Up, or Down. If the blank is at an edge or corner then not all actions will be applicable.
« Transition model: Maps a state and action to a resulting state; for example, if we apply Left to the start state in Figure 3.3, the resulting state has the 5 and the blank switched.
Goal state: Although any state could be the goal, we typically specify a state with the numbers in order, as in Figure 3.3.
® Action cost: Each action costs 1.
Note that every problem formulation involves abstractions. The 8-puzzle actions are ab-
stracted to their beginning and final states, ignoring the intermediate locations where the tile
is sliding. We have abstracted away actions such as shaking the board when tiles get stuck
and ruled out extracting the tiles with a knife and putting them back again. We are left with a
description of the rules, avoiding all the details of physical manipulations.
Our final standardized problem was devised by Donald Knuth (1964) and illustrates how
infinite state spaces can arise. Knuth conjectured that starting with the number 4, a sequence
Section 3.2 Example Problems
of square root, floor, and factorial operations can reach any desired positive integer. For example, we can reach 5 from 4 as follows:
|
@y =s
The problem definition is simple: States: Positive real numbers. o Initial state: 4.
o Actions: Apply square root, floor, or factorial operation (factorial for integers only). o Transition model: As given by the mathematical definitions of the operations.
o Goal state: The desired positive integer. ® Action cost: Each action costs 1.
The state space for this problem is infinite: for any integer greater than 2 the factorial oper-
ator will always yield a larger integer. The problem is interesting because it explores very large numbers: the shortest path to 5 goes through (4!)! = 620,448,401,733,239,439,360,000.
Infinite state spaces arise frequently in tasks involving the generation of mathematical expressions, circuits, proofs, programs, and other recursively defined objects. 3.2.2
Real-world problems
We have already seen how the route-finding problem is defined in terms of specified locations and transitions along edges between them. Route-finding algorithms are used in a variety of applications.
Some, such as Web sites and in-car systems that provide driving
directions, are relatively straightforward extensions
of the Romania example.
(The main
complications are varying costs due to traffic-dependent delays, and rerouting due to road closures.) Others, such as routing video streams in computer networks, military operations planning, and airline travel-planning systems, involve much more complex specifications. Consider the airline travel problems that must be solved by a travel-planning Web site: o States: Each state obviously includes a location (e.g., an airport) and the current time. Furthermore, because the cost of an action (a flight segment) may depend on previous segments, their fare bases, and their status as domestic or international, the state must record extra information about these “historical” aspects. o Initial state: The user’s home airport.
® Actions: Take any flight from the current location, in any seat class, leaving after the
current time, leaving enough time for within-airport transfer if needed.
o Transition model: The state resulting from taking a flight will have the flight’s destination as the new location and the flight’s arrival time as the new time.
o Goal state: A destination city. Sometimes the goal can be more complex, such as “arrive at the destination on a nonstop flight.”
* Action cost: A combination of monetary cost, waiting time, flight time, customs and immigration procedures, seat quality, time of day, type of airplane, frequent-flyer re-
ward points, and 50 on.
69
70
Chapter 3 Solving Problems by Searching Commercial travel advice systems use a problem formulation of this kind, with many addi-
tional complications to handle the airlines’ byzantine fare structures. Any seasoned traveler
Touring problem Traveling salesperson problem (TSP)
knows, however, that not all air travel goes according to plan. A really good system should include contingency plans—what happens if this flight is delayed and the connection is missed? Touring problems describe a set of locations that must be visited, rather than a single
goal destination. The traveling salesperson problem (TSP) is a touring problem in which every city on a map must be visited. optimization
The aim is to find a tour with cost < C (or in the
version, to find a tour with the lowest cost possible).
An enormous
amount
of effort has been expended to improve the capabilities of TSP algorithms. The algorithms
can also be extended to handle fleets of vehicles. For example, a search and optimization algorithm for routing school buses in Boston saved $5 million, cut traffic and air pollution, and saved time for drivers and students (Bertsimas ez al., 2019). In addition to planning trips,
search algorithms have been used for tasks such as planning the movements of automatic
VLSI layout
circuit-board drills and of stocking machines on shop floors.
A VLSI layout problem requires positioning millions of components and connections on a chip to minimize area, minimize circuit delays, minimize stray capacitances, and maximize manufacturing yield. The layout problem comes after the logical design phase and is usually split into two parts: cell layout and channel routing. In cell layout, the primitive components of the circuit are grouped into cells, each of which performs some recognized function. Each cell has a fixed footprint (size and shape) and requires a certain number of connections to each of the other cells. The aim is to place the cells on the chip so that they do not overlap and so that there is room for the connecting wires to be placed between the cells. Channel
routing finds a specific route for each wire through the gaps between the cells. These search
Robot navigation
problems are extremely complex, but definitely worth solving.
Robot navigation is a generalization of the route-finding problem described earlier. Rather than following distinct paths (such as the roads in Romania), a robot can roam around,
in effect making its own paths. For a circular robot moving on a flat surface, the space is
essentially two-dimensional. When the robot has arms and legs that must also be controlled,
the search space becomes many-dimensional—one dimension for each joint angle. Advanced
techniques are required just to make the essentially continuous search space finite (see Chap-
ter 26). In addition to the complexity of the problem, real robots must also deal with errors
in their sensor readings and motor controls, with partial observability, and with other agents
Automatic assembly sequencing
that might alter the environment.
Automatic assembly sequencing of complex objects (such as electric motors) by a robot
has been standard industry practice since the 1970s. Algorithms first find a feasible assembly
sequence and then work to optimize the process. Minimizing the amount of manual human
labor on the assembly line can produce significant savings in time and cost. In assembly problems, the aim is to find an order in which to assemble the parts of some object.
If the
wrong order is chosen, there will be no way to add some part later in the sequence without
undoing some of the work already done. Checking an action in the sequence for feasibility is a difficult geometrical search problem closely related to robot navigation. Thus, the generation Protein design
of legal actions is the expensive part of assembly sequencing. Any practical algorithm must avoid exploring all but a tiny fraction of the state space. One important assembly problem is
protein design, in which the goal is to find a sequence of amino acids that will fold into a three-dimensional protein with the right properties to cure some disease.
Section 33 Search Algorithms 3.3 Search Algorithms A search algorithm takes a search problem as input and returns a solution, or an indication of ~Search algorithm failure. In this chapter we consider algorithms that superimpose a search tree over the state-
space graph, forming various paths from the initial state, trying to find a path that reaches a goal state. Each node in the search tree corresponds to a state in the state space and the edges
in the search tree correspond to actions. The root of the tree corresponds to the initial state of
Node
the problem.
It is important to understand the distinction between the state space and the search tree.
The state space describes the (possibly infinite) set of states in the world, and the actions that allow transitions from one state to another.
The search tree describes paths between
these states, reaching towards the goal. The search tree may have multiple paths to (and thus
‘multiple nodes for) any given state, but each node in the tree has a unique path back to the root (as in all trees). Figure 3.4 shows the first few steps in finding a path from Arad to Bucharest. The root
node of the search tree is at the initial state, Arad. We can expand the node, by considering
Figure 3.4 Three partial search trees for finding a route from Arad to Bucharest. Nodes that have been expanded are lavender with bold letters; nodes on the frontier that have been generated but not yet expanded are in green; the set of states corresponding to these two
types of nodes are said to have been reached. Nodes that could be generated next are shown
in faint dashed lines. Notice in the bottom tree there is a cycle from Arad to Sibiu to Arad; that can’t be an optimal path, so search should not continue from there.
Expand
72
Chapter 3 Solving Problems by Searching
Figure 3.5 A sequence of search trees generated by a graph search on the Romania problem of Figure 3.1. At each stage, we have expanded every node on the frontier, extending every path with all applicable actions that don’t result in a state that has already been reached. Notice that at the third stage, the topmost city (Oradea) has two successors, both of which have already been reached by other paths, so no paths are extended from Oradea.
(a)
(b)
©
Figure 3.6 The separation property of graph search, illustrated on a rectangular-grid problem. The frontier (green) separates the interior (lavender) from the exterior (faint dashed).
The frontier is the set of nodes (and corresponding states) that have been reached but not yet expanded; the interior s the set of nodes (and corresponding states) that have been expanded; and the exterior is the set of states that have not been reached. In (a), just the root has been
expanded. In (b), the top frontier node is expanded. In (c), the remaining successors of the root are expanded in clockwise order.
Generating Child node Successor node
the available ACTIONS
for that state, using the RESULT function to see where those actions
lead to, and generating a new node (called a child node or successor node) for each of the resulting states. Each child node has Arad as its parent node. Now we must choose which of these three child nodes to consider next.
This is the
Parent node
essence of search—following up one option now and putting the others aside for later. Sup-
Frontier Reached
panded nodes (outlined in bold). We call this the frontier of the search tree. We say that any state that has had a node generated for it has been reached (whether or not that node has been
Separator
pose we choose to expand Sibiu first. Figure 3.4 (bottom) shows the result: a set of 6 unex-
expanded). Figure 3.5 shows the search tree superimposed on the state-space graph.
Note that the frontier separates two regions of the state-space graph: an interior region
where every state has been expanded, and an exterior region of states that have not yet been
reached. This property is illustrated in Figure 3.6. 5 Some authors call the frontier the open list, which s both geographically less evocative and computationally less appropriate, because a queue is more efficient than a list here. Those authors use the term elosed list to refer 10 the set of previously expanded nodes, which in our terminology would be the reached nodes minus the frontier.
Section 33 Search Algorithms
73
function BEST-FIRST-SEARCH(problem, ) returns a solution node or failure node « NODE(STATE=problem.INITIAL)
frontier ® B D
c
G F G E D F G E Figure 3.8 Breadth-first scarch on a simple binary tree. At each stage, the node to be expanded next s indicated by the triangular marker.
Section 3.4
Uninformed Search Strategies
77
function BREADTH-FIRST-SEARCH(problem) returns a solution node or failure
node < NODE(problem.INITIAL) if problem.1s-GOAL(node. STATE) then return node frontier 4+ b = 0(b%) All the nodes remain in memory, so both time and space complexity are O(b). Exponential bounds like that are scary. As a typical real-world example, consider a problem with branching factor b = 10, processing speed 1 million nodes/second, and memory requirements of 1 A search to depth d = 10 would take less than 3 hours, but would require 10
terabytes of memory. The memory requirements are a bigger problem for breadth-first search than the execution time.
But time is still an important factor.
At depth d = 14, even with
infinite memory, the search would take 3.5 years. In general, exponential-complexity search problems cannot be solved by uninformed search for any but the smallest instances. 3.4.2
A A
Kbyte/node.
Dijkstra’s algorithm or uniform-cost search
‘When actions have different costs, an obvious choice is to use best-first search where the
evaluation function is the cost of the path from the root to the current node. This is called Di-
jkstra’s algorithm by the theoretical computer science community, and uniform-cost search
by the AT community. The idea is that while breadth-first search spreads out in waves of uni-
form depth—first depth 1, then depth 2, and so on—uniform-cost search spreads out in waves of uniform path-cost. The algorithm can be implemented as a call to BEST-FIRST-SEARCH with PATH-COST as the evaluation function, as shown in Figure 3.9.
Uniform-cost search
78
Chapter 3 Solving Problems by Searching
Bucharest Figure 3.10 Part of the Romania state space, selected to illustrate uniform-cost search.
Consider Figure 3.10, where the problem is to get from Sibiu to Bucharest. The succes-
sors of Sibiu are Rimnicu Vilcea and Fagaras, with costs 80 and 99, respectively. The least-
cost node, Rimnicu Vilcea, is expanded next, adding Pitesti with cost 80 +97=177.
The
least-cost node is now Fagaras, so it is expanded, adding Bucharest with cost 99 +211=310.
Bucharest is the goal, but the algorithm tests for goals only when it expands a node, not when it generates a node, so it has not yet detected that this is a path to the goal.
The algorithm continues on, choosing Pitesti for expansion next and adding a second path
to Bucharest with cost 80 +97 + 101 =278.
path in reached and is added to the frontier.
It has a lower cost, so it replaces the previous
It turns out this node now has the lowest cost,
50 it is considered next, found to be a goal, and returned. Note that if we had checked for a
goal upon generating a node rather than when expanding the lowest-cost node, then we would have returned a higher-cost path (the one through Fagaras).
The complexity of uniform-cost search is characterized in terms of C*, the cost of the
optimal solution,® and ¢, a lower bound on the cost of each action, with ¢ > 0.
Then the
algorithm’s worst-case time and space complexity is O(b'*1€"/ ¢>0;? cost-optimal if action costs are all identical; * if both directions are breadth-first
or uniform-cost. 3.5
Informed (Heuristic) Search Strategies
Informed search
This section shows how an informed search strategy—one that uses domain-specific hints
Heuristic function
The hints come in the form of a heuristic function, denoted /(n):'*
about the location of goals—can find solutions more efficiently than an uninformed strategy. h(n) = estimated cost of the cheapest path from the state at node n to a goal state.
For example, in route-finding problems, we can estimate the distance from the current state to a goal by computing the straight-line distance on the map between the two points. We study heuristics
and where they come from in more detail in Section 3.6.
10 It may seem odd that the heuristic function operates on a node, when all it really needs is the node’s state. It is traditional t0 use /(n) rather than h(s) to be consistent with the evaluation function J (n) and the path cost g(n).
Section 3.5 Arad Bucharest Craiova
366 0 160 242 161 176 77 151 226 244
Informed (Heuristic) Search Strategies
Mehadia Neamt Oradea Pitesti Rimnicu Vilcea Sibiu Timisoara Urziceni Vaslui Zerind
241 234 380 100 193 253 329 80 199 374
Figure 3.16 Values of is;p—straight-line distances to Bucharest. 3.5.1
Greedy best-first search
Greedy best-first search is a form of best-first search that expands first the node with the §eedy best-irst lowest /(n) value—the node that appears to be closest to the goal—on the grounds that this is likely to lead to a solution quickly. So the evaluation function f(n) = h(n).
Let us see how this works for route-finding problems in Romania; we use the straight-
line distance heuristic, which we will call hg;p. If the goal is Bucharest, we need to know the straight-line distances to Bucharest, which are shown in Figure 3.16. For example, hsip(Arad)=366. Notice that the values of hg;p cannot be computed from the problem description itself (that is, the ACTIONS
and RESULT functions).
Straight-ine distance
Moreover, it takes a certain
amount of world knowledge to know that g p is correlated with actual road distances and is, therefore, a useful heuristic.
Figure 3.17 shows the progress of a greedy best-first search using hg;p to find a path
from Arad to Bucharest.
The first node to be expanded from Arad will be Sibiu because the
heuristic says it is closer to Bucharest than is either Zerind or Timisoara. The next node to be
expanded will be Fagaras because it is now closest according to the heuristic. Fagaras in turn generates Bucharest, which is the goal. For this
particular problem, greedy best-first search
using hg;p finds a solution without ever expanding a node that is not on the solution path.
The solution it found does not have optimal cost, however: the path via Sibiu and Fagaras to Bucharest is 32 miles longer than the path through Rimnicu Vilcea and Pitesti. This is why
the algorithm is called “greedy”—on each iteration it tries to get as close to a goal as it can, but greediness can lead to worse results than being careful.
Greedy best-first graph search is complete in finite state spaces, but not in infinite ones. The worst-case time and space complexity is O(|V'|). With a good heuristic function, however, the complexity can be reduced substantially, on certain problems reaching O(bm). 3.5.2
A" search
The most common
informed search algorithm is A* search (pronounced “A-star search”), a
best-first search that uses the evaluation function
F(n) = g(n) +h(n) where g(n) is the path cost from the initial state to node n, and h(n) is the estimated cost of the shortest path from 7 to a goal state, so we have f(n) = estimated cost of the best path that continues from 7 to a goal.
A" search
86
Chapter 3 Solving Problems by Searching (a) The initial state
366
(b) After expanding Arad
Chmd>
b >
isoad>
253
CZerind>
329
374
(¢) After expanding Sibiu
(d) After expanding Fagaras
Figure 3.17 Stages in a greedy best-first tree-like search for Bucharest with the straight-line distance heuristic /iszp. Nodes are labeled with their A-values. In Figure 3.18, we show the progress of an A* search with the goal of reaching Bucharest.
The values of g are computed from the action costs in Figure 3.1, and the values of hgs.p are
given in Figure 3.16. Notice that Bucharest first appears on the frontier at step (e), but it is not selected for expansion (and thus not detected as a solution) because at f =450 it is not the lowest-cost node on the frontier—that would be Pitesti, at f=417.
Another way to say this
is that there might be a solution through Pitesti whose cost is as low as 417, so the algorithm will not settle for a solution that costs
450.
At step (f), a different path to Bucharest is now
the lowest-cost node, at f =418, so it is selected and detected as the optimal solution.
Admissible heuristic
A search is complete.!! Whether A* is cost-optimal depends on certain properties of
the heuristic.
A key property is admissibil
an admissible heuristic is one that never
overestimates the cost to reach a goal. (An admissible heuristic is therefore optimistic.) With
11 Again, assuming all action costs are > ¢ > 0, and the state space either has a solution or is finite.
Section 3.5
(a) The initial state
Informed (Heuristic) Search Strategies
366-0+366
(b) After expanding Arad 449754374
646=280+366 415-239+176 671-291+380 413-220+193
(d) After expanding Rimnicu Vilcea
SOI=3384253 4S0450+0
S26=366+160 4173174100 §53-300-253
(f) After expanding Pitesti 449754374
41841840 615-455+160 607=414+193
Figure 3.18 Stages in an A" search for Bucharest. Nodes are labeled with f = g+ /. The h values are the straight-line distances to Bucharest taken from Figure 3.16.
87
88
Chapter 3 Solving Problems by Searching
Figure 3.19 Triangle inequality: If the heuristic / is consistent, then the single number k()
will be less than the sum of the cost ¢(n,a,d’) of the action from n to n plus the heuristic
estimate h(n').
an admissible heuristic, A* is cost-optimal, which we can show with a proof by contradiction. Suppose the optimal path has cost C*, but the algorithm returns a path with cost C > C*. Then there must be some node 7 which is on the optimal path and is unexpanded (because if all the nodes on the optimal path had been expanded, then we would have returned that optimal solution). So then, using the notation g*(n) to mean the cost of the optimal path from the start to n, and h*(n) to mean the cost of the optimal path from # to the nearest goal, we have:
f(n) > € (otherwise n would have been expanded) f(n) = g(n)+h(n) f(n)
= g"(n)+h(n)
(by definition)
(because n is on an optimal path)
f(n) < g"(n)+"(n) (because of admissibility, h(n) < h*(n)) f(n) < € (by definition,C* = g*(n) + h*(n)) The first and last lines form a contradiction, so the supposition that the algorithm could return
Consistency
a suboptimal path must be wrong—it must be that A* returns only cost-optimal paths.
A slightly stronger property is called consistency.
A heuristic i(n) is consistent if, for
every node n and every successor n’ of n generated by an action a, we have:
h(n) < e(nan’)+h(n').
Triangle inequality
This is a form of the triangle inequality, which stipulates that a side of a triangle cannot
be longer than the sum of the other two sides (see Figure 3.19). An example of a consistent
heuristic is the straight-line distance /s> that we used in getting to Bucharest.
Every consistent heuristic is admissible (but not vice versa), so with a consistent heuristic, A" is cost-optimal. In addition, with a consistent heuristic, the first time we reach a state it will be on an optimal path, so we never have to re-add a state to the frontier, and never have to
change an entry in reached. But with an inconsistent heuristic, we may end up with multiple paths reaching the same state, and if each new path has a lower path cost than the previous
one, then we will end up with multiple nodes for that state in the frontier, costing us both
time and space. Because of that, some implementations of A* take care to only enter a state into the frontier once, and if a better path to the state is found, all the successors of the state
are updated (which requires that nodes have child pointers as well as parent pointers). These complications have led many implementers to avoid inconsistent heuristics, but Felner et al. (2011) argues that the worst effects rarely happen in practice, and one shouldn’t be afraid of inconsistent heuristics.
Section 3.5
Informed (Heuristic) Search Strategies
89
Figure 3.20 Map of Romania showing contours at f = 380, f = 400, and f = 420, with Arad as the start state. Nodes inside a given contour have f = g+ costs less than or equal to the contour value.
With an inadmissible heuristic, A* may or may not be cost-optimal. Here are two cases where it is: First, if there is even one cost-optimal path on which h(n) is admissible for all nodes n on the path, then that path will be found, no matter what the heuristic says for states off the path. Second, if the optimal solution has cost C*, and the second-best has cost C,, and
if h(n) overestimates some costs, but never by more than C; — C*, then A is guaranteed to
return cost-optimal solutions.
3.5.3 Search contours A useful way to visualize a search is to draw contours in the state space, just like the contours
in a topographic map. Figure 3.20 shows an example. Inside the contour labeled 400, all nodes have f(n) = g(n) +h(n) < 400, and so on. Then, because A" expands the frontier node
Contour
of lowest f-cost, we can see that an A” search fans out from the start node, adding nodes in concentric bands of increasing f-cost. ‘With uniform-cost search, we also have contours, but of g-cost, not g + . The contours
with uniform-cost search will be “circular” around the start state, spreading out equally in all
directions with no preference towards the goal. With A* search using a good heuristic, the g+ h bands will stretch toward a goal state (as in Figure 3.20) and become more narrowly focused around an optimal path. It should be clear that as you extend a path, the g costs are monotonic:
the path cost
always increases as you go along a path, because action costs are always positive.'? Therefore
you get concentric contour lines that don’t cross each other, and if you choose to draw the
lines fine enough, you can put a line between any two nodes on any path.
12 Technically, we say decrease, but might rem:
ctly monotonic™ for costs that always increase, and “monotonic” for cost the same.
that never
Monotonic
Chapter 3 Solving Problems by Searching But it is not obvious whether the f = g+ h cost will monotonically increase. As you ex-
tend a path from 7 to 1, the cost goes from g(n) +h(n) to g(n) +c(n,a,n’) +h(n’). Canceling
out the g(n) term, we see that the path’s cost will be monotonically increasing if and only if h(n) < c(n,a,n") + h(n'); in other words if and only if the heuristic is consistent.'> But note that a path might contribute several nodes in a row with the same g(n) + (n) score; this will
happen whenever the decrease in / is exactly equal to the action cost just taken (for example,
in a grid problem, when n is in the same row as the goal and you take a step towards the goal,
g s increased by 1 and h is decreased by 1). If C* is the cost of the optimal solution path, then we can say the following:
Surely expanded nodes
+ A expands all nodes that can be reached from the initial state on a path where every
node on the path has f(n) < C*. We say these are surely expanded nodes. A" might then expand some of the nodes right on the “goal contour” (where f(n) = C*) before selecting a goal node.
« A" expands no nodes with f(n) > C*.
Optimally efficient
We say that A* with a consistent heuristic is optimally efficient in the sense that any algorithm
that extends search paths from the initial state, and uses the same heuristic information, must
expand all nodes that are surely expanded by A* (because any one of them could have been
part of an optimal solution). Among the nodes with f(n)=C", one algorithm could get lucky
and choose the optimal one first while another algorithm is unlucky; we don’t consider this
Pruning
difference in defining optimal efficiency.
A" is efficient because it prunes away search tree nodes that are not necessary for finding an optimal solution. Tn Figure 3.18(b) we see that Timisoara has f = 447 and Zerind has f 449. Even though they are children of the root and would be among the first nodes expanded by uniform-cost or breadth-first search, they are never expanded by A* search because the solution with f = 418 is found first. The concept of pruning—eliminating possibilities from consideration without having to examine them—is important for many areas of Al
That A* search is complete, cost-optimal, and optimally efficient among all such algo-
rithms is rather satisfying.
Unfortunately, it does not mean that A* is the answer to all our
searching needs. The catch is that for many problems, the number of nodes expanded can be exponential in the length of the solution. For example, consider a version of the vacuum world with a super-powerful vacuum that can clean up any one square at a cost of 1 unit, without even having to visit the square; in that scenario, squares can be cleaned in any order.
With N initially dirty squares, there are 2V states where some subset has been cleaned; all of those states are on an optimal solution path, and hence satisfy f(n) < C*, so all of them would be visited by A*.
3.5.4
Inadmissible heuristic
Satisficing search:
Inadmissible heuristics and weighted A*
A" search has many good qualities, but it expands a lot of nodes. We can explore fewer nodes (taking less time and space) if we are willing to accept solutions that are suboptimal, but are “good enough”—what we call satisficing solutions. If we allow A® search to use an inadmissible heuristic—one that may overestimate—then we risk missing the optimal solution, but the heuristic can potentially be more accurate, thereby reducing the number of 13 In fact, the term * Jonotonic heuristic” is a synonym for “consistent heurist . The two ideas were developed independently, and ther was proved that they are equivalent (Pearl, 1984).
Section 3.5
Informed (Heuristic) Search Strategies
91
(b)
(a)
Figure 3.21 Two searches on the same grid: (a) an A" search and (b) a weighted A" search with weight W = 2. The gray bars are obstacles, the purple line is the path from the green start to red goal, and the small dots are states that were reached by each search. On this particular problem, weighted A” explores 7 times fewer states and finds a path that is 5% more costly.
nodes expanded. For example, road engineers know the concept of a detour index, which is Detour index a multiplier applied to the straight-line distance to account for the typical curvature of roads. A detour index of 1.3 means that if two cities are 10 miles apart in straight-line distance, a good estimate of the best path between them is 13 miles. For most localities, the detour index
ranges between 1.2 and 1.6. We can apply this idea to any problem, not just ones involving roads, with an approach
called weighted A* search where we weight the heuristic value more heavily, giving us the
evaluation function £(n) = g(n) +W x h(n), for some W > 1. Figure 3.21 shows a search problem on a grid world. In (), an A" search finds the optimal solution, but has to explore a large portion of the state space to find it. In (b), a weighted A*
search finds a solution that is slightly costlier, but the search time is much faster. We see that the weighted search focuses the contour of reached states towards a goal.
That means that
fewer states are explored, but if the optimal path ever strays outside of the weighted search’s
contour (as it does in this case), then the optimal path will not be found. In general, if the optimal solution costs C*, a weighted A* search will find a solution that costs somewhere
between C* and W x C*; but in practice we usually get results much closer to C* than W x C*.
We have considered searches that evaluate states by combining g and / in various ways; weighted A* can be seen as a generalization of the others:
At search:
Uniform-cost search:
Greedy best-first search: Weighted A" search:
g(n)+h(n) 2(n) h(n)
g(n) +W x h(n)
(1 0 then current
next
else current —next only with probability e~2£/T
Figure 4.5 The simulated annealing algorithm, a version of stochastic hill climbing where some downhill moves are allowed. The schedule input determines the value of the “tempera-
ture” 7 as a function of time.
all the probability is concentrated on the global maxima, which the algorithm will find with probability approaching 1. Simulated annealing was used to solve VLSI layout problems beginning in the 1980s. Tt has been applied widely to factory scheduling and other large-scale optimization tasks. 4.1.3
Local beam search
Keeping just one node in memory might seem to be an extreme reaction to the problem of
memory limitations.
The local beam search algorithm keeps track of k states rather than
just one. Tt begins with k randomly generated states. At each step, all the successors of all k states are generated. If any one is a goal, the algorithm halts. Otherwise, it selects the k best
Local beam search
successors from the complete list and repeats. At first sight, a local beam search with k states might seem to be nothing more than
running k random restarts in parallel instead of in sequence. are quite different.
In fact, the two algorithms
In a random-restart search, each search process runs independently of
the others. In a local beam search, useful information is passed among the parallel search threads.
In effect, the states that generate the best successors say to the others, “Come over
2, which occurs
only rarely in nature but is easy enough to simulate on computers.
Selection
« The selection process for selecting the individuals who will become the parents of the next generation: one possibility is to select from all individuals with probability proportional to their fitness score. Another possibility is to randomly select n individuals (n > p), and then select the p most fit ones as parents.
Crossover point
« The recombination procedure. One common approach (assuming p = 2), is to randomly select a crossover point to split each of the parent strings, and recombine the parts to form two children, one with the first part of parent 1 and the second part of parent 2; the other with the second part of parent 1 and the first part of parent 2.
Mutation rate
+ The mutation rate, which determines how often offspring have random mutations to
their representation. Once an offspring has been generated, every bit in its composition is flipped with probability equal to the mutation rate.
« The makeup of the next generation. This can be just the newly formed offspring, or it Elitism
can include a few top-scoring parents from the previous generation (a practice called
elitism, which guarantees that overall fitness will never decrease over time). The practice of culling, in which all individuals below a given threshold are discarded, can lead
to a speedup (Baum e al., 1995). Figure 4.6(a) shows a population of four 8-digit strings, each representing a state of the 8-
queens puzzle: the c-th digit represents the row number of the queen in column c. In (b), each state is rated by the fitness function.
Higher fitness values are better, so for the 8-
Section 4.1
Local Search and Optimization Problems
17
Figure 4.7 The 8-queens states corresponding to the first two parents in Figure 4.6(c) and the first offspring in Figure 4.6(d). The green columns are lost in the crossover step and the red columns are retained. (To interpret the numbers in Figure 4.6: row 1 is the bottom row, and 8 is the top row.) queens problem we use the number of nonattacking pairs of queens, which has a value of 8 x 7/2 =28 for a solution. The values of the four states in (b) are 24, 23, 20, and 11. The
fitness scores are then normalized to probabilities, and the resulting values are shown next to
the fitness values in (b).
In (c), two pairs of parents are selected, in accordance with the probabilities in (b). Notice that one individual is selected twice and one not at all. For each selected pair, a crossover point (dotted line) is chosen randomly. In (d), we cross over the parent strings at the crossover
points, yielding new offspring. For example, the first child of the first pair gets the first three digits (327) from the first parent and the remaining digits (48552) from the second parent.
The 8-queens states involved in this recombination step are shown in Figure 4.7.
Finally, in (e), each location in each string is subject to random mutation with a small independent probability. One digit was mutated in the first, third, and fourth offspring. In the
8-queens problem, this corresponds to choosing a queen at random and moving it to a random
square in its column. It is often the case that the population is diverse early on in the process,
so crossover frequently takes large steps in the state space early in the search process (as in simulated annealing). After many generations of selection towards higher fitness, the popu-
lation becomes less diverse, and smaller steps are typical. Figure 4.8 describes an algorithm
that implements all these steps.
Genetic algorithms are similar to stochastic beam search, but with the addition of the
crossover operation. This is advantageous if there are blocks that perform useful functions.
For example, it could be that putting the first three queens in positions 2, 4, and 6 (where they do not attack each other) constitutes a useful block that can be combined with other useful blocks that appear in other individuals to construct a solution. It can be shown mathematically
that, if the blocks do not serve a purpose—for example if the positions of the genetic code
are randomly permuted—then crossover conveys no advantage.
The theory of genetic algorithms explains how this works using the idea of a schema,
which is a substring in which some of the positions can be left unspecified. For example, the schema 246#*###
Schema
describes all 8-queens states in which the first three queens are in
positions 2, 4, and 6, respectively. Strings that match the schema (such as 24613578) are
called instances of the schema. It can be shown that if the average fitness of the instances of a schema is above the mean, then the number of instances of the schema will grow over time.
Instance
118
Chapter 4 Search in Complex Environments Evolution and Search The theory of evolution was developed by Charles Darwin in On the Origin of Species by Means of Natural Selection (1859) and independently by Alfred Russel ‘Wallace (1858).
The central idea is simple:
variations occur in reproduction and
will be preserved in successive generations approximately in proportion to their effect on reproductive fitness.
Darwin’s theory was developed with no knowledge of how the traits of organisms can be inherited and modified. The probabilistic laws governing these processes were first identified by Gregor Mendel (1866), a monk who experimented
with sweet peas. Much later, Watson and Crick (1953) identified the structure of the
DNA molecule and its alphabet, AGTC (adenine, guanine, thymine, cytosine). In the standard model, variation occurs both by point mutations in the letter sequence and by “crossover” (in which the DNA of an offspring is generated by combining long sections of DNA from each parent). The analogy to local search algorithms has already been described; the prin-
cipal difference between stochastic beam search and evolution is the use of sexual
reproduction, wherein successors are generated from multiple individuals rather than just one.
The actual mechanisms of evolution are, however, far richer than
most genetic algorithms allow. For example, mutations can involve reversals, duplications, and movement of large chunks of DNA;
some viruses borrow DNA
from one organism and insert it into another; and there are transposable genes that
do nothing but copy themselves many thousands of times within the genome.
There are even genes that poison cells from potential mates that do not carry
the gene, thereby increasing their own chances of replication. Most important is the fact that the genes themselves encode the mechanisms whereby the genome is reproduced and translated into an organism. In genetic algorithms, those mechanisms
are a separate program that is not represented within the strings being manipulated. Darwinian evolution may appear inefficient, having generated blindly some 10" or so organisms without improving its search heuristics one iota. But learning does play a role in evolution.
Although the otherwise great French naturalist
Jean Lamarck (1809) was wrong to propose that traits ing an organism’s lifetime would be passed on to its (1896) superficially similar theory is correct: learning ness landscape, leading to an acceleration in the rate of
acquired by adaptation duroffspring, James Baldwin’s can effectively relax the fitevolution. An organism that
has a trait that is not quite adaptive for its environment will pass on the trait if it also
has enough plasticity to learn to adapt to the environment in a way that is beneficial. Computer simulations (Hinton and Nowlan, 1987) confirm that this Baldwin
effect is real, and that a consequence is that things that are hard to learn end up in the genome, but things that are easy to learn need not reside there (Morgan and
Griffiths, 2015).
Section 4.2
Local Search in Continuous Spaces
function GENETIC-ALGORITHM(population, fitness) returns an individual
repeat weights— WEIGHTED-BY (population, fitness) population2 +—empty list for i = 1 to SIZE(population) do
parent], parent2< WEIGHTED-R ANDOM-CHOICES (population, weights, 2) child REPRODUCE (parent ], parent2) if (small random probability) then child < MUTATE(child) add child to population2 population < population2 until some individual is fit enough, or enough time has elapsed return the best individual in population, according to fitness function REPRODUCE(parent], parent2) returns an individual
14 LENGTH(parent])
¢ +random number from 1 to n
return APPEND(SUBSTRING(parentl, 1,c), SUBSTRING (parent2, ¢ + 1,m))
Figure 4.8 A genetic algorithm. Within the function, population is an ordered list of individuals, weights is a list of corresponding fitness values for each individual, and fitess is a function to compute these values.
Clearly, this effect is unlikely to be significant if adjacent bits are totally unrelated to
each other, because then there will be few contiguous blocks that provide a consistent benefit. Genetic algorithms work best when schemas correspond to meaningful components of a
solution. For example, if the string is a representation of an antenna, then the schemas may represent components of the antenna, such as reflectors and deflectors. A good component is likely to be good in a variety of different designs. This suggests that successful use of genetic
algorithms requires careful engineering of the representation. In practice, genetic algorithms have their place within the broad landscape of optimization methods (Marler and Arora, 2004), particularly for complex structured problems such as circuit layout or job-shop scheduling, and more recently for evolving the architecture of deep neural networks (Miikkulainen ef al., 2019). It is not clear how much of the appeal of genetic algorithms arises from their superiority on specific tasks, and how much from the appealing metaphor of evolution.
Local Search in Continuous Spaces In Chapter 2, we explained the distinction between discrete and continuous environments,
pointing out that most real-world environments are continuous.
A continuous action space
has an infinite branching factor, and thus can’t be handled by most of the algorithms we have covered so far (with the exception of first-choice hill climbing and simulated annealing). This section provides a very brief introduction to some local search techniques for continuous spaces. The literature on this topic is vast; many of the basic techniques originated
119
120
Chapter 4 Search in Complex Environments in the 17th century, after the development of calculus by Newton and Leibniz? We find uses for these techniques in several places in this book, including the chapters on learning, vision, and robotics.
We begin with an example. Suppose we want to place three new airports anywhere in
Romania, such that the sum of squared straight-line distances from each city on the map
Variable
to its nearest airport is minimized. (See Figure 3.1 for the map of Romania.) The state space is then defined by the coordinates of the three airports: (xy,y;), (x2,y2), and (x3,y3).
This is a six-dimensional space; we also say that states are defined by six variables. In general, states are defined by an n-dimensional vector of variables, x. Moving around in this space corresponds to moving one or more of the airports on the map. The objective function f(X) = f(x1,y1.X2,y2,x3,y3) is relatively easy to compute for any particular state once we
compute the closest cities. Let C; be the set of cities whose closest airport (in the state x) is airport i. Then, we have
3
F) = fiynx e ys) = Y Z(X:*X‘)2+(Yi*n)2»
@
This equation is correct not only for the state x but also for states in the local neighborhood
of x. However, it is not correct globally; if we stray too far from x (by altering the location
of one or more of the airports by a large amount) then the set of closest cities for that airport
Discretization
changes, and we need to recompute C;.
One way to deal with a continuous state space is to discretize it. For example, instead of
allowing the (x;,y;) locations to be any point in continuous two-dimensional space, we could
limit them to fixed points on a rectangular grid with spacing of size § (delta). Then instead of
having an infinite number of successors, each state in the space would have only 12 successors, corresponding to incrementing one of the 6 variables by 4. We can then apply any of our local search algorithms to this discrete space. Alternatively, we could make the branching factor finite by sampling successor states randomly, moving in a random direction by a small
Empirical gradient
amount, 5. Methods that measure progress by the change in the value of the objective function between two nearby points are called empirical gradient methods.
Empirical gradient
search is the same as steepest-ascent hill climbing in a discretized version of the state space.
Reducing the value of § over time can give a more accurate solution, but does not necessarily
converge to a global optimum in the limit.
Often we have an objective function expressed in a mathematical form such that we can
Gradient
use calculus to solve the problem analytically rather than empirically. Many methods attempt
to use the gradient of the landscape to find a maximum. The gradient of the objective function
is a vector V£ that gives the magnitude and direction of the steepest slope. For our problem, we have
vio (9F
9f
9f
9f
9f
of
1= (S A 2 5)
In some cases, we can find a maximum by solving the equation V £ =0. (This could be done,
for example, if we were placing just one airport; the solution is the arithmetic mean of all the
cities’ coordinates.) In many cases, however, this equation cannot be solved in closed form.
For example, with three airports, the expression for the gradient depends on what cities are
Knowledge of vectors, mat ces, and derivatives s useful for this
(see Appendix A).
Section 4.2
Local Search in Continuous Spaces
121
closest to each airport in the current state. This means we can compute the gradient locally (but not globally); for example, of 22 ¥ (x—x0). “2) Given alocally correct expression for the gradient, we can perform steepest-ascent hill climbing by updating the current state according to the formula X
x+aVf(x),
where a (alpha) is a small constant often called the step size. There exist a huge variety Step size of methods for adjusting a.. The basic problem is that if « is too small, too many steps are needed; if a is too large, the search could overshoot the maximum.
The technique of line
search tries to overcome this dilemma by extending the current gradient direction—usually by repeatedly doubling a—until f starts to decrease again. becomes
the new current state.
The point at which this occurs
There are several schools of thought about how the new
direction should be chosen at this point. For many problems, the most effective algorithm is the venerable Newton-Raphson
method. This is a general technique for finding roots of functions—that is, solving equations
of the form g(x)=0. Newton’s formula
Line search
Newton-Raphson
It works by computing a new estimate for the root x according to
x e x—g(x)/g'(x).
To find a maximum or minimum of f, we need to find x such that the gradient is a zero vector
(i.e., V(x)=0). Thus, g(x) in Newton’s formula becomes V £(x), and the update equation can be written in matrix-vector form as
x4 x—H
(x)Vf(x),
where Hy(x) is the Hessian matrix of second derivatives, whose elements H;; are given by
92f/9x;dx;. For our airport example, we can see from Equation (4.2) that Hy(x) is particularly simple: the off-diagonal elements are zero and the diagonal elements for airport are just
Hessian
twice the number of cities in C;. A moment’s calculation shows that one step of the update
moves airport i directly to the centroid of C;, which is the minimum of the local expression
for f from Equation (4.1).3 For high-dimensional problems, however, computing the n? en-
tries of the Hessian and inverting it may be expensive, so many approximate versions of the
Newton-Raphson method have been developed.
Local search methods suffer from local maxima, ridges, and plateaus in continuous state
spaces just as much as in discrete spaces. Random restarts and simulated annealing are often helpful. High-dimensional continuous spaces are, however, big places in which it is very easy to get lost. Constrained A final topic is constrained optimization. An optimization problem is constrained if optimization solutions must satisfy some hard constraints on the values of the variables.
For example, in
our airport-siting problem, we might constrain sites to be inside Romania and on dry land (rather than in the middle of lakes).
The difficulty of constrained
optimization problems
depends on the nature of the constraints and the objective function. The best-known category is that of linear programming problems, in which constraints must be linear inequalities
3 In general, the Newton-Raphson update can be seen s fitting a quadratic surface to / at x and then moving direetly to the minimum of that surface—which is also the minimum of f if / is quadrati.
Linear programming
122
Convex set Convex optimization
Chapter 4 Search in Complex Environments forming a convex set* and the objective function is also linear. The time complexity of linear programming is polynomial in the number of variables. Linear programming is probably the most widely studied and broadly useful method for optimization. It is a special case of the more general problem of convex optimization, which
allows the constraint region to be any convex region and the objective to be any function that is
convex within the constraint region. Under certain conditions, convex optimization problems are also polynomially solvable and may be feasible in practice with thousands of variables. Several important problems in machine learning and control theory can be formulated as convex optimization problems (see Chapter 20). 4.3
Search with Nondeterministic Actions
In Chapter 3, we assumed a fully observable, deterministic, known environment.
Therefore,
an agent can observe the initial state, calculate a sequence of actions that reach the goal, and
execute the actions with its “eyes closed,” never having to use its percepts. When the environment is partially observable, however, the agent doesn’t know for sure what state it is in; and when the environment is nondeterministic, the agent doesn’t know
what state it transitions to after taking an action. That means that rather than thinking “I'm in state 51 and ifI do action a I'll end up in state s»,” an agent will now be thinking “I'm either
Belief state Conditional plan
in state s or 53, and if I do action a 'l end up in state s,,54 or s5.” We call a set of physical states that the agent believes are possible a belief state.
In partially observable and nondeterministic environments, the solution to a problem is
no longer a sequence, but rather a conditional plan (sometimes called a contingency plan or a
strategy) that specifies what to do depending on what percepts agent receives while executing the plan. We examine nondeterminism in this section and partial observability in the next.
4.3.1
The erratic vacuum world
The vacuum world from Chapter 2 has eight states, as shown in Figure 4.9. There are three
actions—Right, Left, and Suck—and the goal is to clean up all the dirt (states 7 and 8). If the
environment is fully observable, deterministic,
and completely known, then the problem is
easy to solve with any of the algorithms in Chapter 3, and the solution is an action sequence.
For example, if the initial state is 1, then the action sequence [Suck, Right, Suck] will reach a
goal state, 8.
Now suppose that we introduce nondeterminism in the form of a powerful but erratic
vacuum cleaner. In the erratic vacuum world, the Suck action works as follows:
+ When applied to a dirty square the action cleans the square and sometimes cleans up
dirt in an adjacent square, too.
« When applied to a clean square the action sometimes deposits dirt on the carpet.”
To provide a precise formulation of this problem, we need to generalize the notion of a transition model from Chapter 3. Instead of defining the transition model by a RESULT function
4 A set of points is convex if the line joining any two points in & is also contained in S. A convex function is one for which the space “above” it forms a convex set; by definition, convex functions have no local (as opposed 10 global) minima. 5 We assume that most readers milar problems and can sympathize with our agent. We a owners of modern, efficient cleaning appliances who cannot take advantage of this pedagogical de
Section 4.3
Search with Nondeterministic Actions
123
=0 -
Figure 4.9 The eight possible states of the vacuum world; states 7 and 8 are goal states. that returns a single outcome state, we use a RESULTS function that returns a set of possible
outcome states. For example, in the erratic vacuum world, the Suck action in state 1 cleans
up either just the current location, or both locations RESULTS (1, Suck)= {5,7}
If we start in state 1, no single sequence of actions solves the problem, but the following
conditional plan does:
[Suck,if State =5 then [Right, Suck] else []] .
4.3)
Here we see that a conditional plan can contain if-then—else steps; this means that solutions are frees rather than sequences. Here the conditional in the if statement tests to see what the current state is; this is something the agent will be able to observe at runtime, but doesn’t
know at planning time. Alternatively, we could have had a formulation that tests the percept
rather than the state. Many problems in the real, physical world are contingency problems, because exact prediction of the future is impossible. For this reason, many people keep their eyes open while walking around. 4.3.2
AND-OR
search trees
How do we find these contingent solutions to nondeterministic problems?
As in Chapter 3,
we begin by constructing search trees, but here the trees have a different character. In a de-
terministic environment, the only branching is introduced by the agent’s own choices in each
state: I can do this action or that action. We call these nodes OR nodes. In the vacuum world,
Or node
environment, branching is also introduced by the environment’s choice of outcome for each action. We call these nodes AND nodes. For example, the Suck action in state 1 results in the
And node
two kinds of nodes alternate, leading to an AND-OR tree as illustrated in Figure 4.10.
And-or tree
for example, at an OR node the agent chooses Left or Right or Suck. In a nondeterministic
belief state {5,7}, so the agent would need to find a plan for state 5 and for state 7. These
124
Chapter 4 Search in Complex Environments
L
GoaL
=
Suck,
el
el
LOOP
Loop
=
GOAL
Light
Left
Fle]
[ S
EME
L
Loor
Suck
17
o]
GOAL
LooP
Figure 4.10 The first two levels of the search tree for the erratic vacuum world. State nodes are OR nodes where some action must be chosen. At the AND nodes, shown as circles, every ‘outcome must be handled, as indicated by the arc linking the outgoing branches. The solution found is shown in bold lines.
A solution for an AND-OR search problem is a subtree of the complete search tree that
(1) has a goal node at every leaf, (2) specifies one action at each of its OR nodes, and (3) includes every outcome branch at each of its AND nodes. The solution is shown in bold lines
in the figure; it corresponds to the plan given in Equation (4.3).
Figure 4.11 gives a recursive, depth-first algorithm for AND-OR graph search. One key aspect of the algorithm is the way in which it deals with cycles, which often arise in nonde-
terministic problems (e.g., if an action sometimes has no effect or if an unintended effect can be corrected). If the current state is identical to a state on the path from the root, then it returns with failure. This doesn’t mean that there is no solution from the current state; it simply
means that if there is a noncyclic solution, it must be reachable from the earlier incarnation of
the current state, so the new incarnation can be discarded. With this check, we ensure that the
algorithm terminates in every finite state space, because every path must reach a goal, a dead
end, or a repeated state. Notice that the algorithm does not check whether the current state is
a repetition of a state on some other path from the root, which is important for efficiency.
AND-OR graphs can be explored either breadth-first or best-first. The concept of a heuris-
tic function must be modified to estimate the cost of a contingent solution rather than a se-
quence, but the notion of admissibility carries over and there is an analog of the A* algorithm for finding optimal solutions. (See the bibliographical notes at the end of the chapter.)
Section 4.3
Search with Nondeterministic Actions
125
function AND-OR-SEARCH(problem) returns a conditional plan, or failure
return OR-SEARCH(problem, problem.INITIAL, [])
function OR-SEARCH(problem, state, path) returns a conditional plan, or failure if problem.1s-GOAL(state) then return the empty plan if Is-CYCLE(path) then return failure
for each action in problem. ACTIONS state) do
plan < AND-SEARCH(problem, RESULTS(state, action), [state] + path])
if plan # failure then return [action] + plan] return failure
function AND-SEARCH(problem, states, path) returns a conditional plan, or failure for each s; in states do plan; < OR-SEARCH(problem, s;. path)
if plan; = failure then return failure return [ifs, then plan, else if s, then plan, else ...if s,
then plan,,_, else plan,]
Figure 4.11 An algorithm for searching AND-OR graphs generated by nondeterministic environments. A solution is a conditional plan that considers every nondeterministic outcome and makes a plan for each one. 4.3.3
Try, try ag
Consider a slippery vacuum world, which is identical to the ordinary (non-erratic) vacuum
world except that movement actions sometimes fail, leaving the agent in the same location.
For example, moving Right in state 1 leads to the belief state {1,2}. Figure 4.12 shows part of the search graph; clearly, there are no longer any acyclic solutions from state 1, and
AND-OR-SEARCH would return with failure. There is, however, a cyclic solution, which is to keep trying Right until it works. We can express this with a new while construct:
[Suck, while State =S5 do Right, Suck] or by adding a label to denote some portion of the plan and referring to that label later: [Suck,Ly : Right,if State=S5 then L; else Suck].
When is a cyclic plan a solution? A minimum condition is that every leaf is a goal state and
that a leaf is reachable from every point in the plan. In addition to that, we need to consider the cause of the nondeterminism. If it is really the case that the vacuum robot’s drive mechanism
works some of the time, but randomly and independently slips on other occasions, then the
agent can be confident that if the action is repeated enough times, eventually it will work and the plan will succeed.
But if the nondeterminism is due to some unobserved fact about the
robot or environment—perhaps a drive belt has snapped and the robot will never move—then repeating the action will not help. One way to understand this decision is to say that the initial problem formulation (fully observable, nondeterministic) is abandoned in favor of a different formulation (partially observable, deterministic) where the failure of the cyclic plan is attributed to an unobserved
property of the drive belt. In Chapter 12 we discuss how to decide which of several uncertain possibilities is more likely.
Cyelic solution
126
Chapter 4 Search in Complex Environments
Figure 4.12 Part of the search graph for a slippery vacuum world, where we have shown (some) cycles explicitly. All solutions for this problem are cyclic plans because there is no way to move reliably. 4.4
Search in Partially Observable
Environments
‘We now turn to the problem of partial observability, where the agent’s percepts are not enough to pin down the exact state.
That means that some of the agent’s actions will be aimed at
reducing uncertainty about the current state.
4.4.1
Sensorless Conformant
Searching with no observation
‘When the agent’s percepts provide no information at all, we have what is called a sensorless
problem (or a conformant problem). At first, you might think the sensorless agent has no hope of solving a problem if it has no idea what state it starts in, but sensorless solutions are
surprisingly common and useful, primarily because they don’t rely on sensors working properly. In manufacturing systems, for example, many ingenious methods have been developed for orienting parts correctly from an unknown initial position by using a sequence of actions
with no sensing at all. Sometimes a sensorless plan is better even when a conditional plan
with sensing is available. For example, doctors often prescribe a broad-spectrum antibiotic
rather than using the conditional plan of doing a blood test, then waiting for the results to
come back, and then prescribing a more specific antibiotic. The sensorless plan saves time and money, and avoids the risk of the infection worsening before the test results are available.
Consider a sensorless version of the (deterministic) vacuum world. Assume that the agent
knows the geography of its world, but not its own location or the distribution of dirt. In that
case, its initial belief state is {1,2,3,4,5,6,7,8} (see Figure 4.9).
Now, if the agent moves
Right it will be in one of the states {2,4,6,8}—the agent has gained information without
Coercion
perceiving anything! After [Right,Suck] the agent will always end up in one of the states {4,8}. Finally, after [Right,Suck,Lefr,Suck] the agent is guaranteed to reach the goal state 7, no matter what the start state. We say that the agent can coerce the world into state 7.
Section 4.4
Search in Partially Observable Environments
The solution to a sensorless problem is a sequence of actions, not a conditional plan
(because there is no perceiving).
But we search in the space of belief states rather than
physical states.S In belief-state space, the problem is fully observable because the agent always knows its own belief state. Furthermore, the solution (if any) for a sensorless problem
is always a sequence of actions. This is because, as in the ordinary problems of Chapter 3, the percepts received after each action are completely predictable—they’re always empty! So there are no contingencies to plan for. This is true even if the environment is nondeterministic.
‘We could introduce new algorithms for sensorless search problems. But instead, we can
use the existing algorithms from Chapter 3 if we transform the underlying physical problem
into a belief-state problem, in which we search over belief states rather than physical states.
The original problem, P, has components Actionsp, Resultp etc., and the belief-state problem has the following components:
o States: The belief-state space contains every possible subset of the physical states. If P
has N states, then the belief-state problem has 2V belief states, although many of those may be unreachable from the initial state.
o Initial state: Typically the belief state consisting of all states in P, although in some cases the agent will have more knowledge than this.
o Actions: This is slightly tricky. Suppose the agent is in belief state b={s;,s2}, but
ACTIONSp(s1) # ACTIONSp(s>): then the agent is unsure of which actions are legal. If we assume that illegal actions have no effect on the environment, then it is safe to take the union of all the actions in any of the physical states in the current belief state b:
AcTIONS(b) = [J ACTIONSp(s). seb
On the other hand, if an illegal action might lead to catastrophe, it is safer to allow only the intersection, that is, the set of actions legal in all the states. For the vacuum world,
every state has the same legal actions, so both methods give the same result.
o Transition model: For deterministic actions, the new belief state has one result state for each of the current possible states (although some result states may be the same):
b =RESULT(b,a) = {s" : '
=RESULTp(s,a) and 5 € b}.
(4.4)
‘With nondeterminism, the new belief state consists of all the possible results of applying the action to any of the
b =ResuLT(b,a)
states in the current belief state:
= {s':5' € RESULTS p(s,a) and s € b}
= [JREsuLtsp(s,a),
seb
The size of b’ will be the same or smaller than b for deterministic actions, but may be larger than b with nondeterministic actions (see Figure 4.13).
o Goal test: The agent possibly achieves the goal if any state s in the belief state satisfies
the goal test of the underlying problem, Is-GOALp(s). The agent necessarily achieves the goal if every state satisfies IS-GOALp(s). We aim to necessarily achieve the goal.
® Action cost:
This is also tricky.
If the same action can have different costs in dif-
ferent states, then the cost of taking an action in a given belief state could be one of
© Ina fully observable environment, each belief state contai ins one physical state. Thus, we can view the algorithms in Chapter 3 as searching in a belief-state space of leton belief stat
127
128
Chapter 4 Search in Complex Environments
(@
(b)
Figure 4.13 (a) Predicting the next belief state for the sensorless vacuum world with the deterministic action, Right. (b) Prediction for the same belief state and action in the slippery version of the sensorless vacuum world.
several values. (This gives rise to a new class of problems, which we explore in Exercise 4.MVAL.) For now we assume that the cost of an action is the same in all states and 5o can be transferred directly from the underlying physical problem. Figure 4.14 shows the reachable belief-state space for the deterministic, sensorless vacuum world. There are only 12 reachable belief states out of 28 =256 possible belief states. The preceding definitions enable the automatic construction of the belief-state problem
formulation from the definition of the underlying physical problem. Once this is done, we can solve sensorless problems with any of the ordinary search algorithms of Chapter 3. In ordinary graph search, newly reached states are tested to see if they were previously reached. This works for belief states, too; for example, in Figure 4.14, the action sequence [Suck.Left,Suck] starting at the initial state reaches the same belief state as [Right,Left,Suck), namely, {5,7}. Now, consider the belief state reached by [Lefr], namely, {1,3,5,7}. Obviously, this is not identical to {5,7}, but it is a superser. We can discard (prune) any such superset belief state. Why? Because a solution from {1,3,5,7} must be a solution for each
of the individual states 1, 3, 5, and 7, and thus it is a solution for any combination of these
individual states, such as {5,7}; therefore we don’t need to try to solve {1,3,5,7}, we can
concentrate on trying to solve the strictly easier belief state {5,7}.
Conversely, if {1,3,5,7} has already been generated and found to be solvable, then any
subset, such as {5,7}, is guaranteed to be solvable. (If I have a solution that works when I'm very confused about what state I'm in, it will still work when I'm less confused.) This extra
level of pruning may dramatically improve the efficiency of sensorless problem solving. Even with this improvement, however, sensorless problem-solving as we have described
itis seldom feasible in practice. One issue is the vastness of the belief-state space—we saw in
the previous chapter that often a search space of size N is too large, and now we have search
spaces of size 2V. Furthermore, each element of the search space is a set of up to N elements.
For large N, we won’t be able to represent even a single belief state without running out of
memory space.
One solution is to represent the belief state by some more compact description.
In En-
glish, we could say the agent knows “Nothing” in the initial state; after moving Left, we could
Section 4.4
Search in Partially Observable Environments
129
L L1 3 [=] 5
1[ 54~7
L
| =]
—
2| e [=A] |55
sfA] %=
4y [
5-4“
o
o[
o[
1[5
s
1 [
=]
&
R
¢
|
|
|s [ [ 5[ 7[=a] ] o[
S
EL|
s [\EL
——
7[=] R
¥
o[ T s
7[=]
L
—
[~ lk
L
L R
S
e[ T
A A
TL
|
=
[F
s[4 Ll
= s |
[~
L
3[=
R
7~
l———l
TR
"=
Figure 4.14 The reachable portion of the belief-state space for the deterministic, sensorless
vacuum world. Each rectangular box corresponds to a single belief state. At any given point,
the agent has a belief state but does not know which physical state it is in. The initial belief state (complete ignorance) is the top center box.
say, “Not in the rightmost column,” and so on. Chapter 7 explains how to do this in a formal representation scheme.
Another approach is to avoid the standard search algorithms, which treat belief states as
black boxes just like any other problem state. Instead, we can look inside the belief states
and develop incremental belief-state search algorithms that build up the solution one phys- Incremental belief-state search ical state at a time.
For example, in the sensorless vacuum world, the initial belief state is
{1,2,3,4,5,6,7,8}, and we have to find an action sequence that works in all 8 states. We can
do this by first finding a solution that works for state 1; then we check if it works for state 2; if not, go back and find a different solution for state 1, and so on.
Just as an AND-OR search has to find a solution for every branch at an AND node, this
algorithm has to find a solution for every state in the belief state; the difference is that AND—
OR search can find a different solution for each branch, whereas search has to find one solution that works for all the states.
an incremental belief-state
The main advantage of the incremental approach is that it is typically able to detect failure
quickly—when a belief state is unsolvable, it is usually the case that a small subset of the
130
Chapter 4 Search in Complex Environments belief state, consisting of the first few states examined, is also unsolvable. In some cases, this
leads to a speedup proportional to the size of the belief states, which may themselves be as
large as the physical state space itself. 4.4.2
Searching in partially observable environments
Many problems cannot be solved without sensing. For example, the sensorless 8-puzzle is impossible. On the other hand, a little bit of sensing can go a long way: we can solve 8puzzles if we can see just the upper-left corner square.
The solution involves moving each
tile in turn into the observable square and keeping track of its location from then on .
For a partially observable problem, the problem specification will specify a PERCEPT (s)
function that returns the percept received by the agent in a given state. If sensing is non-
deterministic, then we can use a PERCEPTS function that returns a set of possible percepts. For fully observable problems, PERCEPT (s) = s for every state s, and for sensorless problems PERCEPT (s) = null.
Consider a local-sensing vacuum world, in which the agent has a position sensor that
yields the percept L in the left square, and R in the right square, and a dirt sensor that yields
Dirty when the current square is dirty and Clean when it is clean. Thus, the PERCEPT in state
Lis [L, Dirty]. With partial observability, it will usually be the case that several states produce
the same percept; state 3 will also produce [L, Dirfy]. Hence, given this
initial percept, the
initial belief state will be {1,3}. We can think of the transition model between belief states
for partially observable problems as occurring in three stages, as shown in Figure 4.15:
+ The prediction stage computes the belief state resulting from the action, RESULT (b, a),
exactly as we did with sensorless problems. To emphasize that this is a prediction, we
use the notation h=RESULT(b,a), where the “hat” over the b means “estimated,” and we also use PREDICT(b,a) as a synonym for RESULT (b, a). + The possible percepts stage computes the set of percepts that could be observed in the predicted belief state (using the letter o for observation):
POSSIBLE-PERCEPTS (b) = {0 : 0=PERCEPT(s) and s € b} . « The update stage computes, for each possible percept, the belief state that would result from the percept.
The updated belief state b, is the set of states in b that could have
produced the percept:
b, = UPDATE(b,0) = {s : 0=PERCEPT(s) and 5 € b}. The agent needs to deal with possible percepts at planning time, because it won’t know
the actual percepts until it executes the plan. Notice that nondeterminism in the phys-
ical environment can enlarge the belief state in the prediction stage, but each updated belief state b, can be no larger than the predicted belief state b; observations can only
help reduce uncertainty. Moreover, for deterministic sensing, the belief states for the
different possible percepts will be disjoint, forming a partition of the original predicted belief state.
Putting these three stages together, we obtain the possible belief states resulting from a given action and the subsequent possible percepts:
RESULTS (b,a) = {b, : b, = UPDATE(PREDICT(b,a),0) and 0 € POSSIBLE-PERCEPTS (PREDICT(b,a))} .
(4.5)
Section 4.4
Search in Partially Observable Environments
(a)
B.Dirty]
(b)
=
i1/
o=
‘
[ o[z
)
1B Clean] Figure 4.15 Two examples of transitions in local-sensing vacuum worlds. (a) In the deterministic world, Right is applied in the initial belief state, resulting in a new predicted belief
state with two possible physical states; for those states, the possible percepts are [R, Dirty]
and [R, Clean], leading to two belief states, each of which is a singleton. (b) In the slippery world, Right i applied in the initial belief state, giving a new belief state with four physical states; for those states, the possible percepts are [L, Dirty], [R, Dirty], and [R, Clean], leading to three belief states as shown.
[4,Clean]
Figure 4.16 The first level of the AND-OR
[8.0iry) L
[B.Clean]
search tree for a problem in the local-sensing
vacuum world; Suck is the first action in the solution.
131
132
Chapter 4 Search in Complex Environments 4.4.3
Solving partially observable problems
The preceding section showed how to derive the RESULTS function for a nondeterministic
belief-state problem from an underlying physical problem, given the PERCEPT function. With this formulation, the AND—OR search algorithm of Figure 4.11 can be applied directly to
derive a solution.
Figure 4.16 shows part of the search tree for the local-sensing vacuum
world, assuming an initial percept [A, Dirty]. The solution is the conditional plan [Suck, Right, if Bstate={6} then Suck else []]. Notice that, because we supplied a belief-state problem to the AND-OR
search algorithm, it
returned a conditional plan that tests the belief state rather than the actual state. This is as it
should be:
in a partially observable environment the agent won’t know the actual state.
As in the case of standard search algorithms applied to sensorless problems, the AND—
OR search algorithm treats belief states as black boxes, just like any other states. One can improve on this by checking for previously generated belief states that are subsets or supersets
of the current state, just as for sensorless problems.
One can also derive incremental search
algorithms, analogous to those described for sensorless problems, that provide substantial
speedups over the black-box approach.
4.4.4
An agent for partially observable environments
An agent for partially observable environments formulates a problem, calls a search algorithm (such as AND-OR-SEARCH)
to solve it, and executes the solution. There are two main
differences between this agent and the one for fully observable deterministic environments.
First, the solution will be a conditional plan rather than a sequence; to execute an if-then—else
expression, the agent will need to test the condition and execute the appropriate branch of the conditional.
Second, the agent will need to maintain its belief state as it performs actions
and receives percepts. This process resembles the prediction-observation-update process in
Equation (4.5) but is actually simpler because the percept is given by the environment rather
than calculated by the agent. Given an initial belief state b, an action a, and a percept o, the new belief state is:
b’ = UPDATE(PREDICT(b,a),0).
(4.6)
Consider a kindergarten vacuum world wherein agents sense only the state of their current
square, and any square may become dirty at any time unless the agent is actively cleaning it at that moment.” Figure 4.17 shows the belief state being maintained in this environment. In partially observable environments—which include the vast majority of real-world Monitoring Filtering
State estimation
environments—maintaining one’s belief state is a core function of any intelligent system.
This function goes under various names, including monitoring, filtering, and state estima-
tion. Equation (4.6) is called a recursive state estimator because it computes the new belief
state from the previous one rather than by examining the entire percept sequence. If the agent is not to “fall behind,” the computation has to happen as fast as percepts are coming in. As
the environment becomes more complex, the agent will only have time to compute an ap-
proximate belief state, perhaps focusing on the implications of the percept for the aspects of the environment that are of current interest. Most work on this problem has been done for
7 The usual apologies to those who are unfamiliar with the effect of small children on the environment.
Section 4.4
[A.Clean)
1B.Diny]
4l
Suck
133
Search in Partially Observable Environments
Figure 4.17 Two prediction-update cycles of belief-state maintenance in the kindergarten vacuum world with local sensing. stochastic, continuous-state environments with the tools of probability theory, as explained in
Chapter 14.
In this section we will show an example in a discrete environment with deterministic
sensors and nondeterministic actions. The example concerns a robot with a particular state estimation task called localization: working out where it is, given a map of the world and
a sequence of percepts and actions. Our robot is placed in the maze-like environment of Figure 4.18. The robot is equipped with four sonar sensors that tell whether there is an obstacle—the outer wall or a dark shaded square in the figure—in each of the four compass directions. The percept is in the form of a bt vector, one bit for each of the directions north,
east, south, and west in that order, so 1011 means there are obstacles to the north, south, and
west, but not east. ‘We assume that the sensors give perfectly correct data, and that the robot has a correct
map of the environment.
But unfortunately, the robot’s navigational system is broken, so
when it executes a Right action, it moves randomly to one of the adjacent squares. robot’s task is to determine its current location.
The
Suppose the robot has just been switched on, and it does not know where it is—its initial
belief state b consists of the set of all locations.
The robot then receives the percept 1011
and does an update using the equation b, =UPDATE(1011), yielding the 4 locations shown
in Figure 4.18(a). You can inspect the maze to see that those are the only four locations that yield the percept 1011.
Next the robot executes a Right action, but the result is nondeterministic. The new belief
state, b,
=PREDICT (b, Right), contains all the locations that are one step away from the lo-
cations in b,. When the second percept, 1010, arrives, the robot does UPDATE (b,, 1010) and finds that the belief state has collapsed down to the single location shown in Figure 4.18(b). That’s the only location that could be the result of
UPDATE (PREDICT (UPDATE(b, 1011),Right),1010) . ‘With nondeterministic actions the PREDICT step grows the belief state, but the UPDATE step
shrinks it back down—as long as the percepts provide some useful identifying information.
Sometimes the percepts don’t help much for localization: If there were one or more long east-
west corridors, then a robot could receive a long sequence of 1010 percepts, but never know
Localization
134
Chapter 4 Search in Complex Environments
(b) Possible locations of robot after E; = 1011, E,=1010
Figure 4.18 Possible positions of the robot, ©, (a) after one observation, £; = 1011, and (b) after moving one square and making a second observation, E> = 1010. When sensors are noiseless and the transition model is accurate, there is only one possible location for the robot consistent with this sequence of two observations. where in the corridor(s) it was. But for environments with reasonable variation in geography, localization often converges quickly to a single point, even when actions are nondeterministic.
What happens if the sensors are faulty? If we can reason only with Boolean logic, then we.
have to treat every sensor bit as being either correct or incorrect, which is the same as having
no perceptual information at all. But we will see that probabilistic reasoning (Chapter 12), allows us to extract useful information from a faulty sensor as long as it is wrong less than half the time.
Online Search Agents and Unknown Offline search Online search
Environments
So far we have concentrated on agents that use offfine search algorithms. They compute
a complete solution before taking their first action.
In contrast, an online search® agent
interleaves computation and action: first it takes an action, then it observes the environment
and computes the next action. Online search is a good idea in dynamic or semi-dynamic environments, where there is a penalty for sitting around and computing too long. Online 8 The term “onl i¢” here refers to algorithms that must proc nput as it is received rather than waiting for the entire input ata set to become available. This usage of “online’ unrelated to the concept of “having an Internet
connecti
"
Section 4.5
Online Search Agents and Unknown Environments
135
search is also helpful in nondeterministic domains because it allows the agent to focus its
computational efforts on the contingencies that actually arise rather than those that might
happen but probably won’t.
Of course, there is a tradeoff:
the more an agent plans ahead, the less often it will find
itself up the creek without a paddle. In unknown environments, where the agent does not
know what states exist or what its actions do, the agent must use its actions as experiments in order to learn about the environment.
A canonical example of online search is the mapping problem: a robot is placed in an Mapping problem
unknown building and must explore to build a map that can later be used for getting from
A to B. Methods for escaping from labyrinths—required knowledge for aspiring heroes of antiquity—are also examples of online search algorithms. Spatial exploration is not the only
form of online exploration, however. Consider a newborn baby: it has many possible actions but knows the outcomes of none of them, and it has experienced only a few of the possible states that it can reach.
4.5.1
Online search problems
An online search problem is solved by interleaving computation, sensing, and acting. We’ll start by assuming a deterministic and fully observable environment (Chapter 17 relaxes these assumptions) and stipulate that the agent knows only the following: * ACTIONS(s), the legal actions in state s;
* ¢(s,a,s'), the cost of applying action a in state s to arrive at state s'. Note that this cannot be used until the agent knows that s” is the outcome. * Is-GOAL(s), the goal test. Note in particular that the agent cannot determine RESULT (s, @) except by actually being in s
and doing a. For example, in the maze problem shown in Figure 4.19, the agent does not know that going Up from (1,1) leads to (1,2); nor, having done that, does it know that going Down will take it back to (1,1). This degree of ignorance can be reduced in some applications—for example, a robot explorer might know how its movement actions work and be ignorant only of the locations of obstacles.
Finally, the agent might have access to an admissible heuristic function /(s) that estimates
the distance from the current state to a goal state. For example, in Figure 4.19, the agent might know the location of the goal and be able to use the Manhattan-distance heuristic (page 97).
Typically, the agent’s objective is to reach a goal state while minimizing cost. (Another
possible objective is simply to explore the entire environment.) The cost is the total path
cost that the agent incurs as it travels. It is common to compare this cost with the path cost
the agent would incur if it knew the search space in advance—that is, the optimal path in the known environment. In the language of online algorithms, this comparison is called the competitive ratio; we would like it to be as small as possible.
Online explorers are vulnerable to dead ends: states from which no goal state is reach-
able. If the agent doesn’t know what each action does, it might execute the “jump into bottomless pit” action, and thus never reach the goal. In general, no algorithm can avoid dead ends in all state spaces. Consider the two dead-end state spaces in Figure 4.20(a). An on-
line search algorithm that has visited states S and A cannot tell if it is in the top state or the bottom one; the two look identical based on what the agent has seen. Therefore, there is no
Competitive ratio Dead end
n, then the constraint cannot be satisfied.
Section 6.2 Constraint Propagation: Inference in CSPs
189
This leads to the following simple algorithm: First, remove any variable in the constraint
that has a singleton domain, and delete that variable’s value from the domains of the re-
maining variables. Repeat as long as there are singleton variables. If at any point an empty
domain is produced or there are more variables than domain values left, then an inconsistency has been detected. This method can detect the inconsistency in the assignment {WA = red, NSW =red} for Figure 6.1. Notice that the variables SA, NT, and Q are effectively connected by an Alldiff
constraint because each pair must have two different colors. After applying AC-3 with the partial assignment, the domains of SA, NT, and Q are all reduced to {green, blue}. That is, we have three variables and only two colors, so the Alldiff constraint is violated.
Thus, a
simple consistency procedure for a higher-order constraint is sometimes more effective than
applying arc consistency to an equivalent set of binary constraints.
Another important higher-order constraint is the resource constraint, sometimes called
the Atmost constraint. For example, in a scheduling problem, let P, ..., P; denote the numbers.
Resource constraint
of personnel assigned to each of four tasks. The constraint that no more than 10 personnel
are assigned in total is written as Ammost(10, Py, Py, Py, Py). We can detect an inconsistency simply by checking the sum of the minimum values of the current domains; for example, if each variable has the domain {3,4,5,6}, the Armost constraint cannot be satisfied. We can
also enforce consistency by deleting the maximum value of any domain if it is not consistent
with the minimum values of the other domains. Thus, if each variable in our example has the domain {2,3,4,5,6}, the values 5 and 6 can be deleted from each domain.
For large resource-limited problems with integer values—such as logistical problems involving moving thousands of people in hundreds of vehicles—it is usually not possible to represent the domain of each variable as a large set of integers and gradually reduce that set by consistency-checking methods. Instead, domains are represented by upper and lower bounds and are managed by bounds propagation. For example, in an airline-scheduling
problem, let’s suppose there are two flights, i and F, for which the planes have capacities 165 and 385, respectively. The initial domains for the numbers of passengers on flights Fi and F; are then
D;=[0,165]
and
Bounds propagation
D;=[0,385]
Now suppose we have the additional constraint that the two flights together must carry 420
people: Fj + Fy = 420. Propagating bounds constraints, we reduce the domains to
Dy =[35,165]
and D, =|[255,385].
‘We say that a CSP is bounds-consistent if for every variable X, and for both the lower-bound and upper-bound values of X, there exists some value of ¥ that satisfies the constraint between
Bounds-consistent
X and Y for every variable Y. This kind of bounds propagation is widely used in practical constraint problems. 6.2.6 Sudoku
The popular Sudoku puzzle has introduced millions of people to constraint satisfaction problems, although they may not realize it. A Sudoku board consists of 81 squares, some of which are initially filled with digits from 1 to 9. The puzzle is to fill in all the remaining squares such that no digit appears twice in any row, column, or 3x 3 box (see Figure 6.4). A row, column, or box is called a unit.
Sudoku
(@)
R [o|r|unfo|a|w|—| i.
To solve a tree-structured CSP, first pick any variable to be the root of the tree, and choose
an ordering of the variables such that each variable appears after its parent in the tree. Such an ordering is called a topological sort. Figure 6.10(a) shows a sample tree and (b) shows
one possible ordering. Any tree with n nodes has n— 1 edges, so we can make this graph directed arc-consistent in O(n) steps, each of which must compare up to d possible domain
values for two variables, for a total time of O(nd?). Once we have a directed arc-consistent
graph, we can just march down the list of variables and choose any remaining value. Since each edge from a parent to its child is arc-consistent, we know that for any value we choose
for the parent, there will be a valid value left to choose for the child. That means we won’t
3 A careful cartographer or patriotic Tasmanian might object that Tasmania should not be colored the same as its nearest mainland neighbor, to avoid the impression that it might be part of that state. 4 Sadly, very few regions of the world have tree-structured maps, although Sulawesi comes close.
Topological sort
200
Chapter 6
Constraint Satisfaction Problems
ze ez @
®
Figure 6.10 () The constraint graph of a tree-structured CSP. (b) A linear ordering of the variables consistent with the tree with A as the root. This is known as a topological sort of the variables. function TREE-CSP-SOLVER(csp) returns a solution, or failure inputs: csp, a CSP with components X, D, C n < number of variables in X
assignment < an empty assignment root any variable in X X < TOPOLOGICALSORT(X, roof) for j=n down to 2 do
MAKE-ARC-CONSISTENT(PARENT(X)), X))
if it cannot be made consistent then return failure fori=1tondo
assignment[X;]
any consistent value from D;
if there is no consistent value then return failure
return assignment
Figure 6.11 The TREE-CSP-SOLVER algorithm for solving tree-structured CSPs. If the (CSP has a solution, we will find it in linear time; if not, we will detect a contradiction.
have to backtrack; we can move linearly through the variables. The complete algorithm is shown in Figure 6.11.
Now that we have an efficient algorithm for trees, we can consider whether more general constraint graphs can be reduced to trees somehow. There are two ways to do this: by
removing nodes (Section 6.5.1) or by collapsing nodes together (Section 6.5.2). 6.5.1
Cutset conditioning
The first way to reduce a constraint graph to a tree involves assigning values to some variables so that the remaining variables form a tree. Consider the constraint graph for Australia, shown
again in Figure 6.12(a). Without South Australia, the graph would become a tree, as in (b). Fortunately, we can delete South Australia (in the graph, not the country) by fixing a value for SA and deleting from the domains of the other variables any values that are inconsistent with the value chosen for SA.
Now, any solution for the CSP after SA and its constraints are removed will be consistent
with the value chosen for SA. (This works for binary CSPs; the situation is more complicated with higher-order constraints.) Therefore, we can solve the remaining tree with the algorithm
Section 6.5
(@)
201
The Structure of Problems
(®)
Figure 6.12 (a) The original constraint graph from Figure 6.1. (b) After the removal of SA,
the constraint graph becomes a forest of two trees.
given above and thus solve the whole problem. OF course, in the general case (as opposed o map coloring), the value chosen for SA could be the wrong one, so we would need to try each possible value. The general algorithm is as follows: 1. Choose a subset S of the CSP’s variables such that the constraint graph becomes a tree after removal of S. is called a cycle cutset.
Cycle cutset
2. For each possible assignment to the variables in S that satisfies all constraints on S,
(a) remove from the domains of the remaining variables any values that are inconsis-
tent with the assignment for S, and (b) if the remaining CSP has a solution, return it together with the assignment for S.
If the cycle cutset has size c, then the total run time is O(d® - (n — ¢)d?): we have to try each of
the d° combinations of values for the variables in §, and for each combination we must solve
atree problem of size n — c. If the graph is “nearly a tree,” then ¢ will be small and the savings over straight backtracking will be huge—for our 100-Boolean-variable example, if we could
find a cutset of size ¢ =20, this would get us down from the lifetime of the Universe to a few
minutes. In the worst case, however, ¢ can be as large as (n — 2). Finding the smallest cycle cutset is NP-hard, but several efficient approximation algorithms are known.
The overall
algorithmic approach is called cutset conditioning; it comes up again in Chapter 13, where
itis used for reasoning about probabilities.
6.5.2
Cutset conditioning
Tree decomposition
The second way to reduce a constraint graph to a tree is based on constructing a tree decom-
position of the constraint graph: a transformation of the original graph into a tree where each node in the tree consists of a set of variables, as in Figure 6.13. A tree decomposition must
satisfy these three requirements: + Every variable in the original problem appears in at least one of the tree nodes. If two variables are connected by a constraint in the original problem, they must appear together (along with the constraint) in at least one of the tree nodes.
« If a variable appears in two nodes in the tree, it must appear in every node along the
path connecting those nodes.
Tree decomposition
202
Chapter 6
Constraint Satisfaction Problems
Figure 6.13 A tree decomposition of the constraint graph in Figure 6.12(a).
The first two conditions ensure that all the variables and constraints are represented in the tree decomposition.
The third condition seems rather technical, but allows us to say that
any variable from the original problem must have the same value wherever it appears: the
constraints in the tree say that a variable in one node of the tree must have the same value as
the corresponding variable in the adjacent node in the tree. For example, SA appears in all four of the connected nodes in Figure 6.13, so each edge in the tree decomposition therefore
includes the constraint that the value of SA in one node must be the same as the value of SA
in the next. You can verify from Figure 6.12 that this decomposition makes sense.
Once we have a tree-structured graph, we can apply TREE-CSP-SOLVER to get a solution
in O(nd?) time, where n is the number of tree nodes and d is the size of the largest domain. But note that in the tree, a domain is a set of tuples of values, not just individual values.
For example, the top left node in Figure 6.13 represents, at the level of the original prob-
lem, a subproblem with variables {WA,NT,SA}, domain {red, green, blue}, and constraints WA # NT,SA # NT,WA # SA. At the level of the tree, the node represents a single variable, which we can call SANTWA,
whose value must be a three-tuple of colors,
such as
(red, green,blue), but not (red,red,blue), because that would violate the constraint SA # NT
from the original problem. We can then move from that node to the adjacent one, with the variable we can call SANTQ, and find that there is only one tuple, (red, green, blue), that is
consistent with the choice for SANTWA.
The exact same process is repeated for the next two
nodes, and independently we can make any choice for T.
We can solve any tree decomposition problem in O(nd?) time with TREE-CSP-SOLVER,
which will be efficient as long as d remains small. Going back to our example with 100 Boolean variables, if each node has 10 variables, then d=2'° and we should be able to solve
the problem in seconds. But if there is a node with 30 variables, it would take centuries.
Tree width
A given graph admits many tree decompositions; in choosing a decomposition, the aim is to make the subproblems as small as possible. (Putting all the variables into one node is technically a tree, but is not helpful.) The tree width of a tree decomposition of a graph is
Summary
203
one less than the size of the largest node; the tree width of the graph itself is defined to be
the minimum width among all its tree decompositions.
If a graph has tree width w then the
problem can be solved in O(nd"*!) time given the corresponding tree decomposition. Hence, CSPs with constraint graphs of bounded tree width are solvable in polynomial time.
Unfortunately, finding the decomposition with minimal tree width is NP-hard, but there
are heuristic methods that work well in practice.
Which is better: the cutset decomposition
with time O(d°- (n — ¢)d?), or the tree decomposition with time O(nd"*')? Whenever you
have a cycle-cutset of size c, there is also a tree width of size w < ¢+ 1, and it may be far
smaller in some cases. So time consideration favors tree decomposition, but the advantage of
the cycle-cutset approach is that it can be executed in linear memory, while tree decomposition requires memory exponential in w.
6.5.3
Value symmetry
So far, we have looked at the structure of the constraint graph. There can also be important structure in the values of variables, or in the structure of the constraint relations themselves.
Consider the map-coloring problem with d colors.
For every consistent solution, there is
actually a set of d! solutions formed by permuting the color names. For example, on the
Australia map we know that WA, NT, and SA must all have different colors, but there are
31
=6 ways to assign three colors to three regions. This is called value symmetry. We would Value symmetry
like to reduce the search space by a factor of d! by breaking the symmetry in assignments.
Symmetry-breaking We do this by introducing a symmetry-breaking constraint. For our example, we might cons traint impose an arbitrary ordering constraint, NT < SA < WA, that requires the three values to be in alphabetical order.
This constraint ensures that only one of the d! solutions is possible:
{NT = blue,SA = green, WA = red}.
For map coloring, it was easy to find a constraint that eliminates the symmetry. In general it is NP-hard to eliminate all symmetry, but breaking value symmetry has proved to be
important and effective on a wide range of problems.
Summary + Constraint satisfaction problems (CSPs) represent a state with a set of variable/value
pairs and represent the conditions for a solution by a set of constraints on the variables. Many important real-world problems can be described as CSPs.
+ A number of inference techniques use the constraints to rule out certain variable as-
signments. These include node, arc, path, and k-consistency.
+ Backtracking search, a form of depth-first search, is commonly used for solving CSPs. Inference can be interwoven with search.
« The minimum-remaining-values and degree heuristics are domain-independent methods for deciding which variable to choose next in a backtracking search.
The least-
constraining-value heuristic helps in deciding which value to try first for a given variable. Backtracking occurs when no legal assignment can be found for a variable.
Conflict-directed backjumping backtracks directly to the source of the problem. Constraint learning records the conflicts as they are encountered during search in order to avoid the same conflict later in the search.
204
Chapter 6
Constraint Satisfaction Problems
+ Local search using the min-conflicts heuristic has also been applied to constraint satis-
faction problems with great success. + The complexity of solving a CSP is strongly related to the structure of its constraint
graph. Tree-structured problems can be solved in linear time. Cutset conditioning can
reduce a general CSP to a tree-structured one and is quite efficient (requiring only lin-
ear memory) if a small cutset can be found. Tree decomposition techniques transform
the CSP into a tree of subproblems and are efficient if the tree width of the constraint
graph is small; however they need memory exponential in the tree width of the con-
straint graph. Combining cutset conditioning with tree decomposition can allow a better
tradeoff of memory versus time.
Bibliographical and Historical Notes
Diophantine equations
The Greek mathematician Diophantus (c. 200-284) presented and solved problems involving algebraic constraints on equations, although he didn’t develop a generalized methodology. ‘We now call equations over integer domains Diophantine equations.
The Indian mathe-
matician Brahmagupta (c. 650) was the first to show a general solution over the domain of integers for the equation ax+ by = c. Systematic methods for solving linear equations by variable elimination were studied by Gauss (1829); the solution of linear inequality constraints
goes back to Fourier (1827). Finite-domain constraint satisfaction problems also have a long history. For example, graph coloring (of which map coloring is a special case) is an old problem in mathematics.
The four-color conjecture (that every planar graph can be colored with four or fewer colors) was first made by Francis Guthrie, a student of De Morgan, in 1852.
It resisted solution—
despite several published claims to the contrary—until a proof was devised by Appel and Haken (1977) (see the book Four Colors Suffice (Wilson, 2004)). Purists were disappointed that part of the proof relied on a computer,
so Georges Gonthier (2008), using the COQ
theorem prover, derived a formal proof that Appel and Haken’s proof program was correct.
Specific classes of constraint satisfaction problems occur throughout the history of com-
puter science. One of the most influential early examples was SKETCHPAD (Sutherland, 1963), which solved geometric constraints in diagrams and was the forerunner of modern
drawing programs and CAD tools. The identification of CSPs as a general class is due to Ugo Montanari (1974). The reduction of higher-order CSPs to purely binary CSPs with auxiliary variables (see Exercise 6.NARY) is due originally to the 19th-century logician Charles Sanders Peirce. It was introduced into the CSP literature by Dechter (1990b) and was elaborated by Bacchus and van Beek (1998). CSPs with preferences among solutions are studied
widely in the optimization literature; see Bistarelli er al. (1997) for a generalization of the
CSP framework to allow for preferences.
Constraint propagation methods were popularized by Waltz’s (1975) success on polyhedral line-labeling problems for computer vision. Waltz showed that in many problems, propagation completely eliminates the need for backtracking. Montanari (1974) introduced the notion of constraint graphs and propagation by path consistency. Alan Mackworth (1977)
proposed the AC-3 algorithm for enforcing arc consistency as well as the general idea of combining backtracking with some degree of consistency enforcement. AC-4, a more efficient
Bibliographical and Historical Notes
205
arc-consistency algorithm developed by Mohr and Henderson (1986), runs in O(cd?) worstcase time but can be slower than AC-3 on average cases. The PC-2 algorithm (Mackworth,
1977) achieves path consistency in much the same way that AC-3 achieves arc consistency. Soon after Mackworth’s paper appeared, researchers began experimenting with the trade-
off between the cost of consistency enforcement and the benefits in terms of search reduction. Haralick and Elliott (1980) favored the minimal forward-checking algorithm described
by McGregor (1979), whereas Gaschnig (1979) suggested full arc-consistency checking after each variable assignment—an algorithm later called MAC
by Sabin and Freuder (1994). The
latter paper provides somewhat convincing evidence that on harder CSPs, full arc-consistency
checking pays off. Freuder (1978, 1982) investigated the notion of k-consistency and its relationship to the complexity of solving CSPs. Dechter and Dechter (1987) introduced directional arc consistency. Apt (1999) describes a generic algorithmic framework within which consistency propagation algorithms can be analyzed, and surveys are given by Bessiére (2006) and Bartdk et al. (2010). Special methods for handling higher-order or global constraints were developed first logic within the context of constraint logic programming. Marriott and Stuckey (1998) pro- Constraint programming vide excellent coverage of research in this area. The Alldiff constraint was studied by Regin (1994), Stergiou and Walsh (1999), and van Hoeve (2001). There are more complex inference algorithms for Alldiff (see van Hoeve and Katriel, 2006) that propagate more constraints but are more computationally expensive to run. Bounds constraints were incorporated into con-
straint logic programming by Van Hentenryck et al. (1998). A survey of global constraints is provided by van Hoeve and Katriel (2006). Sudoku has become the most widely known CSP and was described as such by Simonis (2005). Agerbeck and Hansen (2008) describe some of the strategies and show that Sudoku
on an n? x n? board is in the class of NP-hard problems. In 1850, C. F. Gauss described
a recursive backtracking algorithm
for solving the 8-
queens problem, which had been published in the German chess magazine Schachzeirung in
1848. Gauss called his method Tatonniren, derived from the French word taronner—to grope around, as if in the dark. According to Donald Knuth (personal communication), R. J. Walker introduced the term backtrack in the 1950s. Walker (1960) described the basic backtracking algorithm and used it to find all solutions to the 13-queens problem. Golomb and Baumert (1965) formulated, with
examples, the general class of combinatorial problems to which backtracking can be applied, and introduced what we call the MRV
heuristic.
Bitner and Reingold (1975) provided an
influential survey of backtracking techniques. Brelaz (1979) used the degree heuristic as a
tiebreaker after applying the MRV heuristic. The resulting algorithm, despite its simplicity,
is still the best method for k-coloring arbitrary graphs. Haralick and Elliott (1980) proposed
the least-constraining-value heuristic. The basic backjumping
method is due to John Gaschnig (1977,
1979).
Kondrak and
van Beek (1997) showed that this algorithm is essentially subsumed by forward checking. Conflict-directed backjumping was devised by Prosser (1993).
Dechter (1990a) introduced
graph-based backjumping, which bounds the complexity of backjumping-based algorithms. as a function of the constraint graph (Dechter and Frost, 2002).
A very general form of intelligent backtracking was developed early on by Stallman and Sussman (1977). Their technique of dependency-directed backtracking combines bacl k. Dependency-directed backtracking
206
Chapter 6
Constraint Satisfaction Problems
jumping with no-good learning (McAllester, 1990) and led to the development of truth maintenance systems (Doyle, 1979), which we discuss in Section 10.6.2. The connection between
Constraint learning
the two areas is analyzed by de Kleer (1989).
The work of Stallman and Sussman also introduced the idea of constraint learning, in which partial results obtained by search can be saved and reused later in the search. The
idea was formalized by Dechter (1990a). Backmarking (Gaschnig, 1979) is a particularly
simple method in which consistent and inconsistent pairwise assignments are saved and used to avoid rechecking constraints. Backmarking can be combined with conflict-directed back-
jumping; Kondrak and van Beek (1997) present a hybrid algorithm that provably subsumes either method taken separately. The method of dynamic backtracking (Ginsberg,
1993) retains successful partial as-
signments from later subsets of variables when backtracking over an earlier choice that does
not invalidate the later success.
Moskewicz er al. (2001) show how these techniques and
others are used to create an efficient SAT solver. Empirical studies of several randomized backtracking methods were done by Gomes ez al. (2000) and Gomes and Selman (2001).
Van Beek (2006) surveys backtracking. Local search in constraint satisfaction problems was popularized by the work of Kirkpatrick et al. (1983) on simulated annealing (see Chapter 4), which is widely used for VLSI
layout and scheduling problems. Beck ef al. (2011) give an overview of recent work on jobshop scheduling. The min-conflicts heuristic was first proposed by Gu (1989) and was devel-
oped independently by Minton ez al. (1992).
Sosic and Gu (1994) showed how it could be
applied to solve the 3,000,000 queens problem in less than a minute. The astounding success
of local search using min-conflicts on the n-queens problem led to a reappraisal of the nature and prevalence of “easy” and “hard” problems.
Peter Cheeseman ez al. (1991) explored the
difficulty of randomly generated CSPs and discovered that almost all such problems either are trivially easy or have no solutions. Only if the parameters of the problem generator are
set in a certain narrow range, within which roughly half of the problems are solvable, do we. find “hard” problem instances. We discuss this phenomenon further in Chapter 7. Konolige (1994) showed that local search is inferior to backtracking search on problems
with a certain degree of local structure; this led to work that combined
local search and
inference, such as that by Pinkas and Dechter (1995). Hoos and Tsang (2006) provide a survey of local search techniques, and textbooks are offered by Hoos and Stiitzle (2004) and
Aarts and Lenstra (2003). Work relating the structure and complexity of CSPs originates with Freuder (1985) and
Mackworth and Freuder (1985), who showed that search on arc-consistent trees works with-
out any backtracking. A similar result, with extensions to acyclic hypergraphs, was developed in the database community (Beeri ef al., 1983). Bayardo and Miranker (1994) present an algorithm for tree-structured CSPs that runs in linear time without any preprocessing. Dechter
(1990a) describes the cycle-cutset approach. Since those papers were published, there has been a great deal of progress in developing more general results relating the complexity of solving a CSP to the structure of its constraint
graph.
The notion of tree width was introduced by the graph theorists Robertson and Sey-
mour (1986). Dechter and Pearl (1987, 1989), building on the work of Freuder, applied a related notion (which they called induced width but is identical to tree width) to constraint
satisfaction problems and developed the tree decomposition approach sketched in Section 6.5.
Bibliographical and Historical Notes Drawing on this work and on results from database theory, Gottlob ez al. (1999a, 1999b)
developed a notion, hypertree width, that is based on the characterization of the CSP as a
hypergraph. In addition to showing that any CSP with hypertree width w can be solved in time O(n"*+!logn), they also showed that hypertree width subsumes all previously defined measures of “width” in the sense that there are cases where the hypertree width is bounded
and the other measures are unbounded.
The RELSAT algorithm of Bayardo and Schrag (1997) combined constraint learning and
backjumping and was shown to outperform many other algorithms of the time. This led to AND-OR search algorithms applicable to both CSPs and probabilistic reasoning (Dechter and Mateescu, 2007). Brown et al. (1988) introduce the idea of symmetry breaking in CSPs,
and Gent et al. (2006) give a survey.
The field of distributed constraint satisfaction looks at solving CSPs when there is a
collection of agents, each of which controls a subset of the constraint variables. There have
been annual workshops on this problem since 2000, and good coverage elsewhere (Collin et al., 1999; Pearce et al., 2008).
Comparing CSP algorithms is mostly an empirical science: few theoretical results show that one algorithm dominates another on all problems; instead, we need to run experiments to see which algorithms perform better on typical instances of problems. As Hooker (1995)
points out, we need to be careful to distinguish between competitive testing—as occurs in
competitions among algorithms based on run time—and scientific testing, whose goal is to identify the properties of an algorithm that determine its efficacy on a class of problems. The textbooks by Apt (2003), Dechter (2003), Tsang (1993), and Lecoutre (2009), and the collection by Rossi er al. (2006), are excellent resources on constraint processing. There
are several good survey articles, including those by Dechter and Frost (2002), and Bartdk et al. (2010). Carbonnel and Cooper (2016) survey tractable classes of CSPs.
Kondrak and
van Beek (1997) give an analytical survey of backtracking search algorithms, and Bacchus and van Run (1995) give a more empirical survey. Constraint programming is covered in the books by Apt (2003) and Fruhwirth and Abdennadher (2003). Papers on constraint satisfac-
tion appear regularly in Artificial Intelligence and in the specialist journal Constraints; the latest SAT solvers are described in the annual International SAT Competition. The primary
conference venue is the International Conference on Principles and Practice of Constraint
Programming, often called CP.
207
TR
[
LOGICAL AGENTS In which we design agents that can form representations ofa complex world, use a process. of inference to derive new representations about the world, and use these new representa-
tions to deduce what to do.
Knowledge-based agents.
Reasoning Representation
Humans, it seems, know things; and what they know helps them do things. In AI knowledge-
based agents use a process of reasoning over an internal representation of knowledge to
decide what actions to take.
The problem-solving agents of Chapters 3 and 4 know things, but only in a very limited, inflexible sense. They know what actions are available and what the result of performing a specific action from a specific state will be, but they don’t know general facts. A route-finding agent doesn’t know that it is impossible for a road to be a negative number of kilometers long.
An 8-puzzle agent doesn’t know that two tiles cannot occupy the same space. The knowledge they have is very useful for finding a path from the start to a goal, but not for anything else. The atomic representations used by problem-solving agents are also very limiting. In a partially observable environment, for example, a problem-solving agent’s only choice for representing what it knows about the current state is to list all possible concrete states. I could
give a human the goal of driving to a U.S. town with population less than 10,000, but to say
that to a problem-solving agent, I could formally describe the goal only as an explicit set of the 16,000 or so towns that satisfy the description.
Chapter 6 introduced our first factored representation, whereby states are represented as
assignments of values to variables; this is a step in the right direction, enabling some parts of
the agent to work in a domain-independent way and allowing for more efficient algorithms.
In this chapter, we take this step to its logical conclusion, so to speak—we develop logic as a
general class of representations to support knowledge-based agents. These agents can com-
bine and recombine information to suit myriad purposes. This can be far removed from the needs of the moment—as
when a mathematician proves a theorem or an astronomer calcu-
lates the Earth’s life expectancy.
Knowledge-based agents can accept new tasks in the form
new environment, the wumpus
world, and illustrates the operation of a knowledge-based
of explicitly described goals; they can achieve competence quickly by being told or learning new knowledge about the environment; and they can adapt to changes in the environment by updating the relevant knowledge. We begin in Section 7.1 with the overall agent design. Section 7.2 introduces a simple agent without going into any technical detail. Then we explain the general principles of logic
in Section 7.3 and the specifics of propositional logic in Section 7.4. Propositional logic is
a factored representation; while less expressive than first-order logic (Chapter 8), which is the canonical structured representation, propositional logic illustrates all the basic concepts
Section 7.1
Knowledge-Based Agents
209
function KB-AGENT(percept) returns an action
persistent: KB, a knowledge base 1, acounter, initially 0, indicating time TELL(KB, MAKE-PERCEPT-SENTENCE(percept, 1))
action ASK(KB, MAKE-ACTION-QUERY(1)) TELL(KB, MAKE-ACTION-SENTENCE(action, 1)) tertl return action
Figure 7.1 A generic knowledge-based agent. Given a percept, the agent adds the percept o0 its knowledge base, asks the knowledge base for the best action, and tells the knowledge base that it has in fact taken that action.
of logic. It also comes with well-developed inference technologies, which we describe in sections 7.5 and 7.6. Finally, Section 7.7 combines the concept of knowledge-based agents with the technology of propositional logic to build some simple agents for the wumpus world. 7.1
Knowledge-Based
Agents
The central component of a knowledge-based agent is its knowledge base, or KB. A knowledge base
is a set of sentences.
(Here “sentence” is used as a technical term.
It is related
but not identical to the sentences of English and other natural languages.) Each sentence is
Knowledge base Sentence
expressed in a language called a knowledge representation language and represents some
Knowledge representation language
from other sentences, we call it an axiom.
Axiom
assertion about the world. When the sentence is taken as being given without being derived
There must be a way to add new sentences to the knowledge base and a way to query
what is known.
The standard names for these operations are TELL and ASK, respectively.
Both operations may involve inference—that is, deriving new sentences from old. Inference Inference must obey the requirement that when one ASKs a question of the knowledge base, the answer
should follow from what has been told (or TELLed) to the knowledge base previously. Later in this chapter, we will be more precise about the crucial word “follow.” For now, take it to
mean that the inference process should not make things up as it goes along. Figure 7.1 shows the outline of a knowledge-based agent program. Like all our agents, it takes a percept as input and returns an action. The agent maintains a knowledge base, KB,
which may initially contain some background knowledge.
Each time the agent program is called, it does three things. First, it TELLS the knowledge
base what it perceives. Second, it ASKs the knowledge base what action it should perform. In
the process of answering this query, extensive reasoning may be done about the current state of the world, about the outcomes of possible action sequences, and so on. Third, the agent
program TELLs the knowledge base which action was chosen, and returns the action so that it can be executed.
The details of the representation language are hidden inside three functions that imple-
ment the interface between the sensors and actuators on one side and the core representation and reasoning system on the other.
MAKE-PERCEPT-SENTENCE
constructs a sentence as-
Background knowledge
210
Chapter 7 Logical Agents serting that the agent perceived the given percept at the given time. MAKE-ACTION-QUERY constructs a sentence that asks what action should be done at the current time. Finally, MAKE-ACTION-SENTENCE constructs a sentence asserting that the chosen action was executed. The details of the inference mechanisms are hidden inside TELL and ASK. Later
sections will reveal these details.
The agent in Figure 7.1 appears quite similar to the agents with internal state described
in Chapter 2. Because of the definitions of TELL and ASK, however, the knowledge-based
Knowledge level
agent iis not an arbitrary program for calculating actions. It is amenable to a description at the knowledge level, where we need specify only what the agent knows and what its goals are, in order to determine its behavior.
For example, an automated taxi might have the goal of taking a passenger from San
Francisco to Marin County and might know that the Golden Gate Bridge is the only link
between the two locations. Then we can expect it to cross the Golden Gate Bridge because it
Implementation level
knows that that will achieve its goal. Notice that this analysis is independent of how the taxi
works at the implementation level. It doesn’t matter whether its geographical knowledge is
implemented as linked lists or pixel maps, or whether it reasons by manipulating strings of symbols stored in registers or by propagating noisy signals in a network of neurons.
A knowledge-based agent can be built simply by TELLing it what it needs to know. Start-
Declarative Procedural
ing with an empty knowledge base, the agent designer can TELL sentences one by one until
the agent knows how to operate in its environment. This is called the declarative approach to system building. In contrast, the procedural approach encodes desired behaviors directly as program code. In the 1970s and 1980s, advocates of the two approaches engaged in heated debates.
We now understand that a successful agent often combines both declarative and
procedural elements in its design, and that declarative knowledge can often be compiled into more efficient procedural code.
We can also provide a knowledge-based agent with mechanisms that allow it to learn for
itself. These mechanisms, which are discussed in Chapter 19, create general knowledge about
the environment from a series of percepts. A learning agent can be fully autonomous.
7.2 Wumpus world
The Wumpus
World
In this section we describe an environment in which knowledge-based agents can show their
worth. The wumpus world is a cave consisting of rooms connected by passageways. Lurking
somewhere in the cave is the terrible wumpus, a beast that eats anyone who enters its room.
The wumpus can be shot by an agent, but the agent has only one arrow. Some rooms contain bottomless pits that will trap anyone who wanders into these rooms (except for the wumpus, which
is too big to fall in).
The only redeeming feature of this bleak environment is the
possibility of finding a heap of gold. Although the wumpus world is rather tame by modern computer game standards, it illustrates some important points about intelligence. A sample wumpus world is shown in Figure 7.2. The precise definition of the task environment is given, as suggested in Section 2.3, by the PEAS description: o Performance measure:
+1000 for climbing out of the cave with the gold, 1000
for
falling into a pit or being eaten by the wumpus, —1 for each action taken, and ~10 for
using up the arrow. The game ends either when the agent dies or when the agent climbs out of the cave.
Section7.2
4
| S8
L0
Zhreozs-
W= o
|sss START. 1
The Wumpus World
@)= o PIT
2
3
4
Figure 7.2 A typical wumpus world. The agent is in the botiom left corner, facing east (rightward). o Environment: A 4 x4 grid of rooms, with walls surrounding the grid. The agent always starts in the square labeled [1,1], facing to the east. The locations of the gold and the wumpus are chosen randomly, with a uniform distribution, from the squares other
than the start square. In addition, each square other than the start can be a pit, with probability 0.2. o Actuators: The agent can move Forward, TurnLeft by 90°, or TurnRight by 90°. The agent dies a miserable death if it enters a square containing a pit or a live wumpus. (It is safe, albeit smelly, to enter a square with a dead wumpus.) If an agent tries to move
forward and bumps into a wall, then the agent does not move. The action Grab can be
used to pick up the gold if it is in the same square as the agent. The action Shoot can
be used to fire an arrow in a straight line in the direction the agent is facing. The arrow continues until it either hits (and hence kills) the wumpus or hits a wall. The agent has
only one arrow, so only the first Shoot action has any effect. Finally, the action Climb can be used to climb out of the cave, but only from square [1,1].
o Sensors: The agent has five sensors, each of which gives a single bit of information: — In the squares directly (not diagonally) adjacent to the wumpus, the agent will — — — ~
perceive a Stench.!
In the In the When When where
squares directly adjacent to a pit, the agent will perceive a Breeze. square where the gold is, the agent will perceive a Glitter. an agent walks into a wall, it will perceive a Bump. the wumpus is killed, it emits a woeful Scream that can be perceived anyin the cave.
The percepts will be given to the agent program in the form of a list of five symbols;
for example, if there is a stench and a breeze, but no glitter, bump, or scream, the agent program will get [Stench, Breeze, None, None, None]. ! Presumably the square containing the wumpus also has a stench, but any agent entering that square is eaten before being able to perceive anything.
211
212
Chapter 7 Logical Agents 34
33
A
B G o » s v W
2 31
=Wumpus
14
24
34
44
13
23
33
43
12
22 P
32
42
11
0K v oK.
@
31 5
[A1
()
Figure 7.3 The first step taken by the agent in the wumpus world. (a) The initial situation, after percept [None, None, None, None, None]. (b) After moving to [2,1] and perceiving [None, Breeze, None, None, None]. We can characterize the wumpus environment along the various dimensions given in Chapter 2. Clearly, it is deterministic, discrete, static, and single-agent.
(The wumpus doesn’t
move, fortunately.) It is sequential, because rewards may come only after many actions are taken. Itis partially observable, because some aspects of the state are not directly perceivable:
the agent’s location, the wumpus'’s state of health, and the availability of an arrow. As for the
locations of the pits and the wumpus: we could treat them as unobserved parts of the state— in which case, the transition model for the environment is completely known, and finding the
locations of pits completes the agent’s knowledge of the state. Alternatively, we could say that the transition model itself is unknown because the agent doesn’t know which Forward
actions are fatal—in which case, discovering the locations of pits and wumpus completes the agent’s knowledge of the transition model.
For an agent in the environment, the main challenge is its initial ignorance of the configuration of the environment; overcoming this ignorance seems to require logical reasoning. In most instances of the wumpus world, it is possible for the agent to retrieve the gold safely.
Occasionally, the agent must choose between going home empty-handed and risking death to
find the gold. About 21% of the environments are utterly unfair, because the gold is in a pit or surrounded by pits. Let us watch a knowledge-based wumpus agent exploring the environment shown in Figure 7.2. We use an informal knowledge representation language consisting of writing down symbols in a grid (as in Figures 7.3 and 7.4). The agent’s initial knowledge base contains the rules of the environment, as described
previously; in particular, it knows that it is in [1,1] and that [1,1] is a safe square; we denote that with an “A” and “OK,” respectively, in square [1,1]. The first percept is [None, None, None, None, None], from which the agent can conclude that its neighboring squares, [1,2] and [2,1], are free of dangers—they are OK. Figure 7.3(a) shows the agent’s state of knowledge at this point.
Section7.2
14
24
3.4
44
13w
|28
33
4.3
22
3.2
4.2
14
3.4
44
13 1
33 p,
[43
12
3.2
4.2
i OK
B
3.1
P
4.1
v OK
1.1
v oK 2,1
W OK
(@)
»
s
v oK
0K 2.1
2.4
The Wumpus World
B
3,1
Pt
4.1
WL OK
®)
Figure 7.4 Two later stages in the progress of the agent. (a) After moving to [1.1] and then [1.2), and perceiving [Stench, None, None, None, None]. (b) After moving to [2.2] and then [2.3), and perceiving [Stench, Breeze, Glitter, None, None]. A cautious agent will move only into a square that it knows to be OK. Let us suppose
the agent decides to move forward to [2,1]. The agent perceives a breeze (denoted by “B”) in [2,1], 50 there must be a pit in a neighboring square. The pit cannot be in [1,1], by the rules of the game, so there must be a pit in [2,2] or [3,1] or both. The notation “P?” in Figure 7.3(b) indicates a possible pit in those squares. At this point, there is only one known square that is
OK and that has not yet been visited. So the prudent agent will turn around, go back to [1,1], and then proceed to [1,2]. The agent perceives a stench in [1,2], resulting in the state of knowledge shown in Figure 7.4(a). The stench in [1,2] means that there must be a wumpus nearby. But the wumpus cannot be in [1,1], by the rules of the game, and it cannot be in [2,2] (or the agent would have detected a stench when it was in [2,1]). Therefore, the agent can infer that the wumpus
is in [1,3]. The notation W' indicates this inference. Moreover, the lack of a breeze in [1,2] implies that there is no pit in [2,2]. Yet the agent has already inferred that there must be a pit in either [2,2] or [3,1], so this means it must be in [3,1]. This is a fairly difficult inference, because it combines knowledge gained at different times in different places and relies on the
lack of a percept to make one crucial step.
The agent has now proved to itself that there is neither a pit nor a wumpus in [2,2], so it
is OK to move there. We do not show the agent’s state of knowledge at [2,2]; we just assume
that the agent turns and moves to [2,3], giving us Figure 7.4(b). In [2,3], the agent detects a glitter, so it should grab the gold and then return home.
Note that in each case for which the agent draws a conclusion from the available infor-
mation, that conclusion is guaranteed to be correct if the available information is correct.
This is a fundamental property of logical reasoning. In the rest of this chapter, we describe
how to build logical agents that can represent information and draw conclusions such as those
described in the preceding paragraphs.
213
214
Chapter 7 Logical Agents 7.3
Logic
This section summarizes the fundamental concepts of logical representation and reasoning. These beautiful ideas are independent of any of logic’s particular forms. We therefore post-
pone the technical details of those forms until the next section, using instead the familiar
example of ordinary arithmetic. Syntax Semantics Truth Possible world Model
In Section 7.1, we said that knowledge bases consist of sentences. These sentences are expressed according to the syntax of the representation language, which specifies all the
sentences that are well formed. The notion of syntax is clear enough in ordinary arithmetic: “x+y=4"is a well-formed sentence, whereas “x4y+ =" is not.
A logic must also define the semantics, or meaning, of sentences. The semantics defines
the truth of each sentence with respect to each possible world. For example, the semantics
for arithmetic specifies that the sentence “x+y=4"is true in a world where x is 2 and y is 2, but false in a world where x is 1 and y is 1. In standard logics, every sentence must be either true or false in each possible world—there is no “in between.”
When we need to be precise, we use the term model in place of “possible world.” Whereas possible worlds might be thought of as (potentially) real environments that the agent
might or might not be in, models are mathematical abstractions, each of which has a fixed
truth value (true or false) for every relevant sentence. Informally, we may think of a possible world as, for example, having x men and y women sitting at a table playing bridge, and the sentence x + y=4 is true when there are four people in total. Formally, the possible models are just all possible assignments of nonnegative integers to the variables x and y. Each such
Satisfaction Entailment
assignment determines the truth of any sentence of arithmetic whose variables are x and y. If a sentence « is true in model m, we say that m satisfies o or sometimes m is a model of o
‘We use the notation M(«) to mean the set of all models of a.
Now that we have a notion of truth, we are ready to talk about logical reasoning. This in-
volves the relation of logical entailment between sentences—the idea that a sentence follows
logically from another sentence. In mathematical notation, we write
aks
to mean that the sentence « entails the sentence 3. The formal definition of entailment is this:
«a k= Bif and only if, in every model in which a is true, 4 s also true. Using the notation just introduced, we can write
a k= Bifand only if M(a) C M(B). (Note the direction of the C here: if o |= 3, then « is a stronger assertion than 3 it rules out more possible worlds.) The relation of entailment is familiar from arithmetic; we are happy with the idea that the sentence x = 0 entails the sentence xy = 0. Obviously, in any model
where x i zero, it is the case that xy is zero (regardless of the value of y). We can apply the same kind of analysis to the wumpus-world reasoning example given in the preceding section.
Consider the situation in Figure 7.3(b):
the agent has detected
nothing in [1,1] and a breeze in [2,1]. These percepts, combined with the agent’s knowledge
of the rules of the wumpus world, constitute the KB. The agent is interested in whether the
adjacent squares [1,2], [2,2], and [3,1] contain pits. Each of the three squares might or might
2 Fuzzy logic,
cussed in Chapter 13, allows for degrees of truth.
Section7.3
Logic
215
(@
Figure 7.5 Possible models for the presence of pits in squares [1,2], [2:2], and [3.1]. The KB corresponding to the observations of nothing in [1.1] and a breeze in [2,1]is shown by the solid line. (a) Dotted line shows models of a1 (no pit in [1.2]). (b) Dotted line shows models of a3 (no pit in [2.2]). not contain a pit, so (ignoring other aspects of the world for now) there are 23=8 possible models. These eight models are shown in Figure 7.5.3
The KB can be thought of as a set of sentences or as a single sentence that asserts all the individual sentences. The KB is false in models that contradict what the agent knows—
for example, the KB is false in any model in which [1,2] contains a pit, because there is no breeze in [1,1]. There are in fact just three models in which the KB is true, and these are shown surrounded by a solid line in Figure 7.5. Now let us consider two possible conclusions: ay = “There is no pit in [1,2].”
=
“There is no pit in [2,2]."
We have surrounded the models of a; and a2 with dotted lines in Figures 7.5(a) and 7.5(b), respectively. By inspection, we see the following: in every model in which KB s true, v is also true. Hence, KB = a: there is no pit in [1,2]. We can also see that in some models in which KB is true, a is false.
Hence, KB does not entail a: the agent cannot conclude that there is no pit in [2,2]. (Nor can it conclude that there is a pit in [2,2].)*
The preceding example not only illustrates entailment but also shows how the definition
of entailment can be applied to derive conclusions—that is, to carry out logical inference.
The inference algorithm illustrated in Figure 7.5 is called model checking, because it enu‘merates all possible models to check that a is true in all models in which KB is true, that is, that M(KB) C M(a). 3 Although the figure shows the models as partial wumpus worlds, they are really nothing more than assignments of true and false 1o the sentences “there s a pit in [1,2]” etc. Models, in the mathematical sense, do not need to have “orrible "airy wumpuses in them. 4 The agent can calculate the probability that there is a pit in [2.2]; Chapter 12 shows how.
Logical inference Model checking
Chapter 7 Logical Agents
216
Aspects of the ~~=~Follows =~ > Aspect of the " veaiworld atwond '
Figure 7.6 Sentences are physical configurations of the agent, and reasoning is a process of constructing new physical configurations from old ones. Logical reasoning should ensure that the new configurations represent aspects of the world that actually follow from the aspects that the old configurations represent.
In understanding entailment and inference, it might help to think of the set of all consequences of KB as a haystack and of a as a needle. Entailment is like the needle being in the haystack; inference is like finding it. This distinction is embodied in some formal notation: if an inference algorithm i can derive from KB, we write KBt a, Sound
Truth-preserving
which is pronounced “a is derived from KB by i or “i derives a from KB.” An inference algorithm that derives only entailed sentences is called sound or truth-
preserving. Soundness is a highly desirable property. An unsound inference procedure essentially makes things up as it goes along—it announces the discovery of nonexistent needles.
Itis easy to see that model checking, when it is applicable,’ is a sound procedure.
Completeness
The property of completeness is also desirable:
an inference algorithm is complete if
it can derive any sentence that is entailed. For real haystacks, which are finite in extent, it seems obvious that a systematic examination can always decide whether the needle is in
the haystack. For many knowledge bases, however, the haystack of consequences is infinite, and completeness becomes an important issue.® Fortunately, there are complete inference procedures for logics that are sufficiently expressive to handle many knowledge bases. We have described a reasoning process whose conclusions are guaranteed to be true in any world in which the premises are true; in particular, if KB is true in the real world, then any sentence o derived from KB by a sound inference procedure is also true in the real world. So, while an inference process operates on “syntax™—internal physical configurations such as bits in registers or patterns of electrical blips in brains—the process corresponds to the realworld relationship whereby some aspect of the real world is the case by virtue of other aspects
of the real world being the case.” This correspondence between world and representation is Grounding
illustrated in Figure 7.6.
|
The final
issue to consider is grounding—the connection between logical reasoning pro-
cesses and the real environment in which the agent exists. In particular, how do we know that
5 Model checking works if the space of models is finite—for example, in wumpus worlds of fixed size. For arithmetic, on the other hand, the space of models is infinite: even if we restrict ourselves to the integers, there are infinitely many pairs of values forx and y in the sentence x-+y = 4. 6 Compare with the case of infinite search spaces in Chapter 3, where depth-first search is not complete. 7 As Wittgenstein (1922) put it in his famous Tractatus: “The world is everything that is the case.”
Section7.4
Propositional Logic: A Very Simple Logic
217
KB is true in the real world? (After all, KB is just “syntax” inside the agent’s head.) This is a
philosophical question about which many, many books have been written. (See Chapter 27.) A simple answer is that the agent’s sensors create the connection. For example, our wumpusworld agent has a smell sensor. The agent program creates a suitable sentence whenever there is a smell. Then, whenever that sentence is in the knowledge base, it is true in the real world.
Thus, the meaning and truth of percept sentences are defined by the processes of sensing and sentence construction that produce them. What about the rest of the agent’s knowledge, such
as its belief that wumpuses cause smells in adjacent squares? This is not a direct representation of a single percept, but a general rule—derived, perhaps, from perceptual experience but not identical to a statement of that experience.
General rules like this are produced by
a sentence construction process called learning, which is the subject of Part V. Learning is
fallible. Tt could be the case that wumpuses cause smells except on February 29 in leap years, which is when they take their baths.
Thus, KB may not be true in the real world, but with
good learning procedures, there is reason for optimism. Propo:
nal Logic:
A Very
ple Log
We now present propositional logic. We describe its syntax (the structure of sentences) and Propositional logic its semantics
(the way in which the truth of sentences is determined). From these, we derive
a simple, syntactic algorithm for logical inference that implements the semantic notion of
entailment. Everything takes place, of course, in the wumpus world. 7.4.1
Syntax
The syntax of propositional logic defines the allowable sentences. The atomic sentences Atomic sentences consist of a single proposition symbol. Each such symbol stands for a proposition that can be true or false.
Proposition symbol
We use symbols that start with an uppercase letter and may contain other
letters or subscripts, for example: P, Q, R, Wy3 and FacingEast. The names are arbitrary but are often chosen to have some mnemonic value—we use W 3 to stand for the proposition that the wumpus is in [1,3]. (Remember that symbols such as W;3 are atomic, i.e., W, 1, and 3 are not meaningful parts of the symbol.) There are two proposition symbols with fixed meanings: True is the always-true proposition and False is the always-false proposition. Complex sentences are constructed from simpler sentences, using parentheses and operators Complex sentences
called logical connectives. There are five connectives in common use:
Logical connectives
~ (not). A sentence such as =W, 3 is called the negation of W,5. A literal is cither an Negation Literal atomic sentence (a positive literal) or a negated atomic sentence (a negative literal). A (and). A sentence whose main connective is A, such as Wy 3 A Py 1, is called a conjunction; its parts are the conjuncts. (The A looks like an “A” for “And.”)
V (or). A sentence whose main connective is V, such as (Wi3 A Ps,1) V Wao, is a disjunc-
Conjunction
Disjunction tion; its parts are disjuncts—in this example, (Wj3 A Py;) and Wa. = (implies). A sentence such as (W3 APs;) = —Way is called an implication (or con- Implication ditional). Its premise or antecedent is (Wi3 A Ps.1), and its conclusion or consequent is =Wa.
Implications are also known as rules or if-then statements. The implication
symbol is sometimes written in other books as D or —.
< (if and only if). The sentence W; 3 —W, is a biconditional.
Premise Conclusion Rules Biconditional
218
Chapter 7 Logical Agents Sentence
—
AtomicSentence ComplexSentence
— —
True| False| P| Q| R| ... ( Sentence)
|
- Sentence
|
Sentence A Sentence
|
Sentence V Sentence
|
Sentence
=
Sentence
|
Sentence
=8 possible models—exactly those depicted in Figure 7.5. Notice, however, that the models are purely mathematical objects with no
necessary connection to wumpus worlds. P> is just a symbol; it might mean “there is a pit in [1,2]” or “I'm in Paris today and tomorrow.”
The semantics for propositional logic must specify how to compute the truth value of any
sentence, given a model. This is done recursively. All sentences are constructed from atomic
sentences and the five connectives; therefore, we need to specify how to compute the truth
of atomic sentences and how to compute the truth of sentences formed with each of the five
connectives. Atomic sentences are easy:
« True is true in every model and False is false in every model.
* The truth value of every other proposition symbol must be specified directly in the
model. For example, in the model m, given earlier, Py ; is false.
Section7.4 Propositional Logic: A Very Simple Logic
219
Figure 7.8 Truth tables for the five logical connectives. To use the table to compute, for example, the value of PVQ when P is true and Q is false, first look on the left for the row where P is true and Q is false (the third row). Then look in that row under the PV Q column
to see the result: true.
For complex sentences, we have five rules, which hold for any subsentences P and Q (atomic or complex) in any model m (here “iff” means “if and only if”): « =Pis true iffP s false in m. « PAQis true iff both P and Q are true in m. « PV Qis true iff either P or Q is truc in m. « P= Qis true unless P is true and Q is false in m. « P& Qis true iff P and Q are both true or both false in m. The rules can also be expressed with truth tables that specify the truth value of a complex
sentence for each possible assignment of truth values to its components. Truth tables for the
five connectives are given in Figure 7.8. From these tables, the truth value of any sentence s can be computed with respect to any model m by a simple recursive evaluation.
For ex-
ample, the sentence —P3 A (P2 V Py 1), evaluated in my, gives true A (false\/ true) = trueA true =true. Exercise 7.TRUV asks you to write the algorithm PL-TRUE?(s, m), which com-
putes the truth value of a propositional logic sentence s in a model 1.
The truth tables for “and,” “or,” and “not” are in close accord with our intuitions about
the English words. The main point of possible confusion is that P or Q is true or both.
Q is true when P is true
A different connective, called “exclusive or” (“xor” for short), yields
false when both disjuncts are true.® There is no consensus on the symbol for exclusive or;
some choices are \/ or # or &
The truth table for = may not quite fit one’s intuitive understanding of “P implies Q” or “if P then Q. For one thing, propositional logic does not require any relation of causation or relevance between P and Q. The sentence “5 is odd implies Tokyo is the capital of Japan” is a true sentence of propositional logic (under the normal interpretation), even though it is a decidedly odd sentence of English. Another point of confusion is that any implication is true whenever its antecedent is false. For example, “5 is even implies Sam is smart” is true, regardless of whether Sam is smart. This seems bizarre, but it makes sense if you think of
“P = Q" as saying, “If P is true, then I am claiming that Q is true; otherwise I am making no claim.” The only way for this sentence to be false is if P is true but Q is false. The biconditional, P < @, is true whenever both P => Q and Q = P are true. In English,
this is often written as “P if and only if Q.” Many of the rules of the wumpus world are best
8 Latin u;
two separate words:
el is inclusive or and “aut” is exclusive or.
Truth table
220
Chapter 7 Logical Agents
written using
For this reason, we have used a special notation—the double-boxed link—in Figure 10.4.
This link asserts that
Vx x€Persons
=
[Vy HasMother(x,y)
= y& FemalePersons).
‘We might also want to assert that persons have two legs—that is, Vx x€Persons
= Legs(x,2).
As before, we need to be careful not to assert that a category has legs; the single-boxed link
in Figure 10.4 is used to assert properties of every member of a category.
The semantic network notation makes it convenient to perform inheritance reasoning of
the kind introduced in Section 10.2. For example, by virtue of being a person, Mary inherits the property of having two legs. Thus, to find out how many legs Mary has, the inheritance
algorithm follows the MemberOf link from Mary to the category she belongs to, and then S Several early systems failed to distinguish between properties of members of a category and properties of the category as a whole. Thi istencies, as pointed out by Drew McDermott (1976) in his article “Artificial Intelligence Meets Natural Stupidity.” Another common problem was the use of /s links for both subset and membership relations, in correspondence with English usage: “a cat isa mammal” and “Fif is 4 cat” See Exercise 10.NATS for more on these is ues.
330
Chapter 10 Knowledge Representation
Tastioter
&5 Subser0y
ez
Persons )—=2 \uhm(l/ Subser0) Memberf
sisier0f
Menberof
Figure 10.4 A semantic network with four objects (John, Mary, 1, and 2) and four categories. Relations are denoted by labeled links.
Menberof Agent
During
Figure 10.5 A fragment of a semantic network showing the representation of the logical assertion Fly(Shankar, NewYork, NewDelhi, Yesterday). follows SubsetOf links up the hierarchy until it finds a category for which there is a boxed Legs link—in this case, the Persons category. The simplicity and efficiency of this inference mechanism, compared with semidecidable logical theorem proving, has been one of the main attractions of semantic networks.
Inheritance becomes complicated when an object can belong to more than one category
Multiple inheritance
or when a category can be a subset of more than one other category; this is called multiple inheritance. In such cases, the inheritance algorithm might find two or more conflicting values
answering the query. For this reason, multiple inheritance is banned in some object-oriented
programming (OOP) languages, such as Java, that use inheritance in a class hierarchy. It is usually allowed in semantic networks, but we defer discussion of that until Section 10.6.
The reader might have noticed an obvious drawback of semantic network notation, com-
pared to first-order logic: the fact that links between bubbles represent only binary relations.
For example, the sentence Fly(Shankar,NewYork, NewDelhi, Yesterday) cannot be asserted
directly in a semantic network.
Nonetheless, we can obtain the effect of n-ary assertions
by reifying the proposition itself as an event belonging to an appropriate event category.
Figure 10.5 shows the semantic network structure for this particular event.
Notice that the
restriction to binary relations forces the creation of a rich ontology of reified concepts.
Section 105
Reasoning Systems for Categories
331
Reification of propositions makes it possible to represent every ground, function-free
atomic sentence of first-order logic in the semantic network notation. Certain kinds of univer-
sally quantified sentences can be asserted using inverse links and the singly boxed and doubly boxed arrows applied to categories, but that still leaves us a long way short of full first-order logic.
Negation, disjunction, nested function symbols, and existential quantification are all
missing. Now it is possible to extend the notation to make it equivalent to first-order logic—as
in Peirce’s existential graphs—but doing so negates one of the main advantages of semantic
networks, which s the simplicity and transparency of the inference processes. Designers can build a large network and still have a good idea about what queries will be efficient, because (a) it is easy to visualize the steps that the inference procedure will go through and (b) in some cases the query language is so simple that difficult queries cannot be posed.
In cases where the expressive power proves to be too limiting, many semantic network
systems provide for procedural attachment to fill in the gaps.
Procedural attachment is a
technique whereby a query about (or sometimes an assertion of) a certain relation results in a
Procedural attachment
call to a special procedure designed for that relation rather than a general inference algorithm.
One of the most important aspects of semantic networks is their ability to represent default values for categories. Examining Figure 10.4 carefully, one notices that John has one Default value leg, despite the fact that he is a person and all persons have two legs. In a strictly logical KB, this would be a contradiction, but in a semantic network, the assertion that all persons have
two legs has only default status; that is, a person is assumed to have two legs unless this is contradicted by more specific information. The default semantics is enforced naturally by the inheritance algorithm, because it follows links upwards from the object itself (John in this
case) and stops as soon as it finds a value. We say that the default is overridden by the more
specific value. Notice that we could also override the default number of legs by creating a category of OneLeggedPersons, a subset of Persons of which John is a member. ‘We can retain a strictly logical semantics for the network if we say that the Legs
Overriding
rtion
for Persons includes an exception for John:
Vx x€Persons Ax # John = Legs(x,2). For a fixed network, this is semantically adequate but will be much less concise than the network notation itself if there are lots of exceptions. For a network that will be updated with
more assertions, however, such an approach fails—we really want to say that any persons as
yet unknown with one leg are exceptions t0o. Section 10.6 goes into more depth on this issue and on default reasoning in general. 10.5.2
Description logics
The syntax of first-order logic is designed to make it easy to say things about objects. Description logics are notations that are designed to make it easier to describe definitions
and
properties of categories. Description logic systems evolved from semantic networks in re-
Description logic
sponse to pressure to formalize what the networks mean while retaining the emphasis on
taxonomic structure as an organizing principle. The principal inference tasks for description logics are subsumption (checking if one Subsumption category is a subset of another by comparing their definitions) and classification (checking Classification whether an object belongs to a category). Some systems also include consistency of a cate-
gory definition—whether the membership criteria are logically satisfiable.
Consistency
332
Chapter 10 Knowledge Representation Concept
—
Thing| ConceptName
| And(Concept,...) | All(RoleName, Concept) | AtLeast(Integer, RoleName) | AtMost(Integer,RoleName)
| Fills(RoleName, IndividualName, . |
| Path
ConceptName RoleName
—
SameAs(Path, Path)
OneOf(IndividualName, . [RoleName, ...
— Adult| Female| Male| — Spouse| Daughter | Son| ...
Figure 10.6 The syntax of descriptions in a subset of the CLASSIC language.
The CLASSIC language (Borgida er al., 1989) is a typical description logic. The syntax
of CLASSIC descriptions is shown in Figure 10.6.% unmarried adult males we would write
For example, to say that bachelors are
Bachelor = And(Unmarried, Adult, Male) The equivalent in first-order logic would be Bachelor(x) < Unmarried(x) AAdult(x) A Male(x). Notice that the description logic has an algebra of operations on predicates, which of course
we can’t do in first-order logic. Any description in CLASSIC can be translated into an equiv-
alent first-order sentence, but some descriptions are more straightforward in CLASSIC.
For
example, to describe the set of men with at least three sons who are all unemployed and married to doctors, and at most two daughters who are all professors in physics or math
departments, we would use
And(Man, AtLeast(3, Son), AtMost(2, Daughter), All(Son, And(Unemployed, Married All(Spouse, Doctor))), All(Daughter, And(Professor, Fills(Department, Physics, Math)))) .
‘We leave it as an exercise to translate this into first-order logic.
Perhaps the most important aspect of description logics s their emphasis on tractability of inference. A problem instance is solved by describing it and then asking if it is subsumed by one of several possible solution categories. In standard first-order logic systems, predicting the solution time is often impossible. It is frequently left to the user to engineer the represen-
tation to detour around sets of sentences that seem to be causing the system to take several
© Notice that the language does not allow one to simply state that one concept, or category, is a subset of another. This is a deliberate policy: subsumption between categories must be derivable from some aspects of the descriptions of the categories. If not, then something is missing from the descriptions.
Section 10.6
Reasoning with Default Information
333
weeks to solve a problem. The thrust in description logics, on the other hand, is to ensure that
subsumption-testing can be solved in time polynomial in the size of the descriptions.”
This sounds wonderful in principle, until one realizes that it can only have one of two
consequences: either hard problems cannot be stated at all, or they require exponentially
large descriptions! However, the tractability results do shed light on what sorts of constructs
cause problems and thus help the user to understand how different representations behave. For example, description logics usually lack negation and disjunction. Each forces firstorder logical systems to go through a potentially exponential case analysis in order to ensure completeness.
CLASSIC allows only a limited form of disjunction in the Fills and OneOf
constructs, which permit disjunction over explicitly enumerated individuals but not over descriptions. With disjunctive descriptions, nested definitions can lead easily to an exponential
number of alternative routes by which one category can subsume another. 10.6
Reasoning with Default Information
In the preceding section, we saw a simple example of an assertion with default status: people have two legs. This default can be overridden by more specific information, such as that Long John Silver has one leg. We saw that the inheritance mechanism in semantic networks
implements the overriding of defaults in a simple and natural way. In this section, we study
defaults more generally, with a view toward understanding the semantics of defaults rather than just providing a procedural mechanism. 10.6.1
Circumscription and default logic
‘We have seen two examples of reasoning processes that violate the monotonicity property of
logic that was proved in Chapter 7.5 In this chapter we saw that a property inherited by all members of a category in a semantic network could be overridden by more specific informa-
Monotonicity
tion for a subcategory. In Section 9.4.4, we saw that under the closed-world assumption, ifa
proposition « is not mentioned in KB then KB }= —a, but KBA o |= ..
Simple introspection suggests that these failures of monotonicity are widespread in com-
monsense reasoning. It seems that humans often “jump to conclusions.” For example, when
one sees a car parked on the street, one is normally willing to believe that it has four wheels
even though only three are visible. Now, probability theory can certainly provide a conclusion
that the fourth wheel exists with high probability; yet, for most people, the possibility that the
car does not have four wheels will not arise unless some new evidence presents itself. Thus,
it seems that the four-wheel conclusion is reached by default, in the absence of any reason to
doubt it. If new evidence arrives—for example, if one sees the owner carrying a wheel and notices that the car is jacked up—then the conclusion can be retracted. This kind of reasoning
is said to exhibit nonmonotonicity, because the set of beliefs does not grow monotonically over time as new evidence arrives. Nonmonotonic logics have been devised with modified notions of truth and entailment in order to capture such behavior. We will look at two such
logics that have been studied extensively: circumscription and default logic.
Circumscription can be seen as a more powerful and precise version of the closed-world
7 CLASSIC provides efficient subsumption testing in practice, but the worst-case run time is exponential. 8 Recall that monotonicity requires all entailed sentences to remain entailed after new sentences are added to the KB. Thatis, if KB |= o then KBA 5 = .
Nonmonotonicity
Nonmonotonic logic Circumscription
334
Chapter 10 Knowledge Representation assumption. The idea is to specify particular predicates that are assumed to be “as false as possible”—that is, false for every object except those for which they are known to be true.
For example, suppose we want to assert the default rule that birds fly. We would introduce a predicate, say Abnormal, (x), and write Bird(x) A —Abnormaly (x) = Flies(x). If we say that Abnormal;
is to be circumscribed, a circumscriptive reasoner is entitled to
assume —Abnormal, (x) unless Abnormal; (x) is known to be true. This allows the conclusion
Model preference
Flies(Tweety) to be drawn from the premise Bird(Tweety), but the conclusion no longer holds if Abnormal, (Tweety) is asserted. Circumscription can be viewed as an example of a model preference logic.
In such
logics, a sentence is entailed (with default status) if it is true in all preferred models of the KB,
as opposed to the requirement of truth in all models in classical logic. For circumscription,
one model is preferred to another if it has fewer abnormal objects.” Let us see how this idea
works in the context of multiple inheritance in semantic networks. The standard example for
which multiple inheritance is problematic is called the “Nixon diamond.” It arises from the observation that Richard Nixon was both a Quaker (and hence by default a pacifist) and a
Republican (and hence by default not a pacifist). We can write this as follows: Republican(Nixon) A Quaker(Nixon). Republican(x) A ~Abnormaly(x) = —Pacifist(x).
Quaker(x) A —~Abnormals(x) = Pacifist(x).
If we circumscribe Abnormaly and Abnormals, there are two preferred models: one in which
Abnormaly(Nixon) and Pacifist(Nixon) are true and one in which Abnormals(Nixon) and —Pacifist(Nixon) are true. Thus, the circumscriptive reasoner remains properly agnostic as Prioritized circumscription
Default logic Default rules
to whether Nixon was a pacifist. If we wish, in addition, to assert that religious beliefs
take
precedence over political beliefs, we can use a formalism called prioritized circumscription to give preference to models where Abnormals is minimized.
Default logic is a formalism in which default rules can be written to generate contingent,
nonmonotonic conclusions. A default rule looks like this:
Bird(x) : Flies(x)/Flies(x). This rule means that if Bird(x) is true, and if Flies(x) is consistent with the knowledge base, then Flies(x) may be concluded by default. In general, a default rule has the form P:li,...,Jn/C
where P s called the prerequisite, C is the conclusion, and J; are the justifications—if any one
of them can be proven false, then the conclusion cannot be drawn. Any variable that appears
in J; or C must also appear in P. The Nixon-diamond example can be represented in default logic with one fact and two default rules:
Republican(Nixon) A Quaker(Nixon).
Republican(x) : ~Pacifist(x) | ~Pacifist(x) . Quaker(x) : Pacifist(x) /Pacifist(x) mption, one model s preferred to another if it has fewer true atoms—that i, preferred models are minimal models. There is a natural connection between the closed-world assumption and definiteclause KBs, because the fixed point reached by forward chaining on definite-clause KBs is the unique minimal model. See page 231 for more on this 9 Forthe closed-world
Section 10.6
Reasoning with Default Information
To interpret what the default rules mean, we define the notion of an extension of a default
335 Extension
theory to be a maximal set of consequences of the theory. That is, an extension S consists
of the original known facts and a set of conclusions from the default rules, such that no additional conclusions can be drawn from S, and the justifications of every default conclusion
in S are consistent with S. As in the case of the preferred models in circumscription, we have
two possible extensions for the Nixon diamond: one wherein he is a pacifist and one wherein he is not. Prioritized schemes exist in which some default rules can be given precedence over
others, allowing some ambiguities to be resolved. Since 1980, when nonmonotonic logics were first proposed, a great deal of progress
has been made in understanding their mathematical properties. There are still unresolved questions, however. For example, if “Cars have four wheels” is false, what does it mean to have it in one’s knowledge base? What is a good set of default rules to have? If we cannot
decide, for each rule separately, whether it belongs in our knowledge base, then we have a serious problem of nonmodularity.
Finally, how can beliefs that have default status be used
to make decisions? This is probably the hardest issue for default reasoning.
Decisions often involve tradeoffs, and one therefore needs to compare the strengths of be-
lief in the outcomes of different actions, and the costs of making a wrong decision. In cases
where the same kinds of decisions are being made repeatedly, it is possible to interpret default rules as “threshold probability” statements. For example, the default rule “My brakes are always OK” really means “The probability that my brakes are OK, given no other information, is sufficiently high that the optimal decision is for me to drive without checking them.” When
the decision context changes—for example, when one is driving a heavily laden truck down a steep mountain road—the default rule suddenly becomes inappropriate, even though there is no new evidence of faulty brakes. These considerations have led researchers to consider how
to embed default reasoning within probability theory or utility theory.
10.6.2
Truth maintenance systems
‘We have seen that many of the inferences drawn by a knowledge representation system will
have only default status, rather than being absolutely certain. Inevitably, some of these inferred facts will turn out to be wrong and will have to be retracted in the face of new infor-
mation. This process is called belief revision.!® Suppose that a knowledge base KB contains a sentence P—perhaps a default conclusion recorded by a forward-chaining algorithm, or perhaps just an incorrect assertion—and we want to execute TELL(KB, —P).
ating a contradiction, we must first execute RETRACT(KB, P).
To avoid cre-
This sounds easy enough.
Problems arise, however, if any additional sentences were inferred from P and asserted in the KB. For example, the implication P = Q might have been used to add Q. The obvious “solution”—retracting all sentences inferred from P—fails because such sentences may have other justifications besides P. For example, ifR and R = Q are also in the KB, then Q does not have to be removed after all. Truth maintenance systems, or TMSs, are designed to
handle exactly these kinds of complications. One simple approach to truth maintenance is to keep track of the order in which sentences are told to the knowledge
base by numbering
Belief revision
them from P; to P,.
When
the call
10 Belief revision is often contrasted with belief update, which occurs when a knowledge base is revised to reflect a change in the world rather than new information about a fixed world. Belief update combines belief revis with reasoning about time and change; it is also related to the process of filtering described in Chapter 14.
Truth maintenance system
336
Chapter 10 Knowledge Representation RETRACT(KB,
JT™S Justification
P,) is made, the system reverts to the state just before P; was added, thereby
removing both P, and any inferences that were derived from P, The sentences P through P, can then be added again. This is simple, and it guarantees that the knowledge base will be consistent, but retracting P; requires retracting and reasserting n — i sentences as well as undoing and redoing all the inferences drawn from those sentences. For systems to which many facts are being added—such as large commercial databas is impractical. A more efficient approach is the justification-based truth maintenance system, or JTMS.
In a JTMS, each sentence in the knowledge base is annotated with a justification consisting
of the set of sentences from which it was inferred. For example, if the knowledge base already contains P = Q, then TELL(P) will cause Q to be added with the justification {P, P = Q}.
In general, a sentence can have any number of justifications. Justifications make retraction efficient. Given the call RETRACT(P), the JTMS will delete exactly those sentences for which P is a member of every justification. So, if a sentence Q had the single justification {P. P =
Q}, it would be removed; if it had the additional justification {P, PVR
would still be removed; but if it also had the justification {R, PVR
=
Q}, it
= Q}, then it would
be spared. In this way, the time required for retraction of P depends only on the number of sentences derived from P rather than on the number of sentences added after P.
The JTMS assumes that sentences that are considered once will probably be considered
again, so rather than deleting a sentence from the knowledge base entirely when it loses
all justifications, we merely mark the sentence as being our of the knowledge base. If a subsequent assertion restores one of the justifications, then we mark the sentence as being back in.
In this way, the JTMS
retains all the inference chains that it uses and need not
rederive sentences when a justification becomes valid again.
In addition to handling the retraction of incorrect information, TMSs
can be used to
speed up the analysis of multiple hypothetical situations. Suppose, for example, that the Romanian Olympic Committee is choosing sites for the swimming, athletics, and equestrian
events at the 2048 Games to be held in Romania.
For example, let the first hypothesis be
Site(Swimming, Pitesti), Site(Athletics, Bucharest), and Site(Equestrian,Arad). A great deal of reasoning must then be done to work out the logistical consequences and hence
the desirability of this selection.
If we
want to consider Site(Athletics, Sibiu)
instead, the TMS avoids the need to start again from scratch. Instead, we simply retract Site(Athletics, Bucharest) and as ert Site(Athletics, Sibiu) and the TMS takes care of the necessary revisions. Inference chains generated from the choice of Bucharest can be reused with
ATMS
Sibiu, provided that the conclusions are the same.
An assumption-based truth maintenance system, or ATMS,
makes this type of context-
switching between hypothetical worlds particularly efficient. In a JTMS, the maintenance of
justifications allows you to move quickly from one state to another by making a few retrac-
tions and assertions, but at any time only one state is represented. An ATMS represents all the states that have ever been considered at the same time. Whereas a JTMS simply labels each
sentence as being in or out, an ATMS keeps track, for each sentence, of which assumptions
would cause the sentence to be true. In other words, each sentence has a label that consists of
a set of assumption sets. The sentence is true just in those cases in which all the assumptions
Explanation
in one of the assumption sets are true. Truth maintenance systems also provide a mechanism for generating explanations. Tech-
nically, an explanation of a sentence P is a set of sentences E such that E entails P. If the
Summary sentences in E are already known to be true, then E simply provides a sufficient basis for proving that P must be the case. But explanations can also include assumptions—sentences
that are not known to be true, but would suffice to prove P if they were true. For example, if your car won’t start, you probably don’t have enough information to definitively prove the
reason for the problem. But a reasonable explanation might include the assumption that the
battery is dead. This, combined with knowledge of how cars operate, explains the observed nonbehavior.
In most cases, we will prefer an explanation E that is minimal, meaning that
there is no proper subset of E that is also an explanation. An ATMS can generate explanations for the “car won’t start” problem by making assumptions (such as “no gas in car” or “battery
dead”) in any order we like, even if some assumptions are contradictory. Then we look at the
label for the sentence “car won’t start™ to read off the sets of assumptions that would justify the sentence. The exact algorithms used to implement truth maintenance systems are a little compli-
cated, and we do not cover them here. The computational complexity of the truth maintenance. problem is at least as great as that of propositional inference—that is, NP-hard.
Therefore,
you should not expect truth maintenance to be a panacea. When used carefully, however, a TMS can provide a substantial increase in the ability of a logical system to handle complex environments and hypotheses.
Summary
By delving into the details of how one represents a variety of knowledge, we hope we have given the reader a sense of how real knowledge bases are constructed and a feeling for the interesting philosophical issues that arise. The major points are as follows: + Large-scale knowledge representation requires a general-purpose ontology to organize
and tie together the various specific domains of knowledge. « A general-purpose ontology needs to cover a wide variety of knowledge and should be capable, in principle, of handling any domain. « Building a large, general-purpose ontology is a significant challenge that has yet to be fully realized, although current frameworks seem to be quite robust.
« We presented an upper ontology based on categories and the event calculus. We covered categories, subcategories, parts, structured objects, measurements, substances,
events, time and space, change, and beliefs. « Natural kinds cannot be defined completely in logic, but properties of natural kinds can be represented. * Actions, events, and time can be represented with the event calculus.
Such represen-
tations enable an agent to construct sequences of actions and make logical inferences
about what will be true when these actions happen.
* Special-purpose representation systems, such as semantic networks and description logics, have been devised to help in organizing a hierarchy of categories. Inheritance is an important form of inference, allowing the properties of objects to be deduced from
their membership in categories.
337
Assumption
338
Chapter 10 Knowledge Representation « The closed-world assumption, as implemented in logic programs, provides a simple way to avoid having to specify lots of negative information. default that can be overridden by additional information.
It is best interpreted as a
+ Nonmonotonic logics, such as circumscription and default logic, are intended to cap-
ture default reasoning in general.
+ Truth maintenance systems handle knowledge updates and revisions efficiently.
« It is difficult to construct large ontologies by hand; extracting knowledge from text makes the job easier. Bibliographical and Historical Notes Briggs (1985) claims that knowledge representation research began with first millennium BCE
Indian theorizing about the grammar of Shastric Sanskrit. Western philosophers trace their work on the subject back to c. 300 BCE in Aristotle’s Metaphysics (literally, what comes after
the book on physics). The development of technical terminology in any field can be regarded as a form of knowledge representation.
Early discussions of representation in Al tended to focus on “problem representation”
rather than “knowledge representation.” (See, for example, Amarel’s (1968) discussion of the “Mi naries and Cannibals” problem.) In the 1970s, Al emphasized the development of
“expert systems” (also called “knowledge-based systems”) that could, if given the appropriate domain knowledge, match or exceed the performance of human experts on narrowly defined tasks. For example, the first expert system, DENDRAL (Feigenbaum ef al., 1971; Lindsay
et al., 1980), interpreted the output of a mass spectrometer (a type of instrument used to ana-
Iyze the structure of organic chemical compounds) as accurately as expert chemists. Although
the success of DENDRAL was instrumental in convincing the Al research community of the importance of knowledge representation, the representational formalisms used in DENDRAL
are highly specific to the domain of chemistry.
Over time, researchers became interested in standardized knowledge representation for-
malisms and ontologies that could assist in the creation of new expert systems. This brought
them into territory previously explored by philosophers of science and of language. The discipline imposed in Al by the need for one’s theories to “work” has led to more rapid and deeper progress than when these problems were the exclusive domain of philosophy (although it has at times also led to the repeated reinvention of the wheel). But to what extent can we trust expert knowledge?
As far back as 1955, Paul Meehl
(see also Grove and Meehl, 1996) studied the decision-making processes of trained experts at subjective tasks such as predicting the success of a student in a training program or the recidivism of a criminal.
In 19 out of the 20 studies he looked at, Meehl found that simple
statistical learning algorithms (such as linear regression or naive Bayes) predict better than the experts. Tetlock (2017) also studies expert knowledge and finds it lacking in difficult cases. The Educational Testing Service has used an automated program to grade millions of essay questions on the GMAT exam since 1999. The program agrees with human graders 97%
of the time, about the same level that two human graders agree (Burstein ez al., 2001).
(This does not mean the program understands essays, just that it can distinguish good ones from bad ones about as well as human graders can.)
Bibliographical and Historical Notes The creation of comprehensive taxonomies or classifications dates back to ancient times.
Aristotle (384-322 BCE) strongly emphasized classification and categorization schemes. His
Organon, a collection of works on logic assembled by his students after his death, included a
treatise called Categories in which he attempted to construct what we would now call an upper ontology. He also introduced the notions of genus and species for lower-level classification. Our present system of biological classification, including the use of “binomial nomenclature” (classification via genus and species in the technical sense), was invented by the Swedish biologist Carolus Linnaeus, or Carl von Linne (1707-1778).
The problems associated with
natural kinds and inexact category boundaries have been addressed by Wittgenstein (1953), Quine (1953), Lakoff (1987), and Schwartz (1977), among others. See Chapter 24 for a discussion of deep neural network representations of words and concepts that escape some of the problems ofa strict ontology, but also sacrifice some of the precision.
We still don’t know the best way to combine the advantages of neural networks
and logical semantics for representation. Interest in larger-scale ontologies is increasing, as documented by the Handbook on Ontologies (Staab, 2004).
The OPENCYC
project (Lenat and Guha,
1990; Matuszek er al.,
2006) has released a 150,000-concept ontology, with an upper ontology similar to the one in Figure 10.1 as well as specific concepts like “OLED Display” and “iPhone,” which is a type of “cellular phone,” which in turn is a type of “consumer electronics,” “phone,” “wireless communication device,” and other concepts. The NEXTKB project extends CYC and other resources including FrameNet and WordNet into a knowledge base with almost 3 million
facts, and provides a reasoning engine, FIRE to go with it (Forbus et al., 2010). The DBPEDIA
project extracts structured data from Wikipedia,
specifically from In-
foboxes: the attribute/value pairs that accompany many Wikipedia articles (Wu and Weld, 2008; Bizer et al., 2007).
As of 2015, DBPEDIA
contained 400 million facts about 4 mil-
lion objects in the English version alone; counting all 110 languages yields 1.5 billion facts (Lehmann et al., 2015). The IEEE working group P1600.1 created SUMO, the Suggested Upper Merged Ontology (Niles and Pease, 2001; Pease and Niles, 2002), with about 1000 terms in the upper ontology and links to over 20,000 domain-specific terms. Stoffel ef al. (1997) describe algorithms for efficiently managing a very large ontology. A survey of techniques for extracting knowledge from Web pages is given by Etzioni ef al. (2008). On the Web, representation languages are emerging. RDF (Brickley and Guha, 2004)
allows for assertions to be made in the form of relational triples and provides some means for
evolving the meaning of names over time. OWL (Smith ef al., 2004) is a description logic that supports inferences over these triples. So far, usage seems to be inversely proportional
to representational complexity: the traditional HTML and CSS formats account for over 99% of Web content, followed by the simplest representation schemes, such as RDFa (Adida and Birbeck, 2008), and microformats (Khare, 2006; Patel-Schneider, 2014) which use HTML
and XHTML markup to add attributes to text on web pages. Usage of sophisticated RDF and OWL ontologies is not yet widespread, and the full vision of the Semantic Web (BernersLee et al., 2001) has not been realized. The conferences on Formal Ontology in Information
Systems (FOIS) covers both general and domain-specific ontologies.
The taxonomy used in this chapter was developed by the authors and is based in part
on their experience in the CYC project and in part on work by Hwang and Schubert (1993)
339
Chapter 10 Knowledge Representation and Davis (1990, 2005). An inspirational discussion of the general project of commonsense
knowledge representation appears in Hayes’s (1978, 1985b) “Naive Physics Manifesto.” Successful deep ontologies within a specific field include the Gene Ontology project (Gene Ontology Consortium, 2008) and the Chemical Markup Language (Murray-Rust ef al., 2003). Doubts about the feasibility of a single ontology for all knowledge are expressed by Doctorow (2001), Gruber (2004), Halevy et al. (2009), and Smith (2004).
The event calculus was introduced by Kowalski and Sergot (1986) to handle continuous time, and there have been several variations (Sadri and Kowalski, 1995; Shanahan, 1997) and overviews (Shanahan, 1999; Mueller, 2006). James Allen introduced time intervals for the same reason (Allen, 1984), arguing that intervals were much more natural than situations for reasoning about extended and concurrent events. In van Lambalgen and Hamm (2005) we see how the logic of events maps onto the language we use to talk about events. An alternative to the event and situation calculi is the fluent calculus (Thielscher, 1999), which reifies the facts
out of which states are composed.
Peter Ladkin (1986a, 1986b) introduced “concave” time intervals (intervals with gaps—
essentially, unions of ordinary “convex” time intervals) and applied the techniques of mathematical abstract algebra to time representation. Allen (1991) systematically investigates the wide variety of techniques available for time representation; van Beek and Manchak (1996)
analyze algorithms for temporal reasoning. There are significant commonalities between the
event-based ontology given in this chapter and an analysis of events due to the philosopher
Donald Davidson (1980). The histories in Pat Hayes’s (1985a) ontology of liquids and the chronicles in McDermott’s (1985) theory of plans were also important influences on the field
and on this chapter. The question of the ontological status of substances has a long history. Plato proposed
that substances were abstract entities entirely distinct from physical objects; he would say MadeOf (Butters, Butter) rather than Butters € Butter. This leads to a substance hierarchy in which, for example, UnsaltedButter is a more specific substance than Butter.
The position
adopted in this chapter, in which substances are categories of objects, was championed by
Richard Montague (1973). It has also been adopted in the CYC project. Copeland (1993) mounts a serious, but not invincible, attack.
The alternative approach mentioned in the chapter, in which butter is one object consisting of all buttery objects in the universe, was proposed originally by the Polish logician Lesniewski (1916).
His mereology (the name is derived from the Greek word for “part”)
used the part-whole relation as a substitute for mathematical set theory, with the aim of elim-
inating abstract entities such as sets. A more readable exposition of these ideas is given by Leonard and Goodman (1940), and Goodman’s The Structure of Appearance (1977) applies the ideas to various problems in knowledge representation. While some aspects of the mereological approach are awkward—for example, the need for a separate inheritance mechanism based on part-whole relations—the approach gained the support of Quine (1960). Harry Bunt (1985) has provided an extensive analysis of its use in knowledge representation. Casati and Varzi (1999) cover parts, wholes, and a general theory of spatial locations.
There are three main approaches to the study of mental objects. The one taken in this chapter, based on modal logic and possible worlds, is the classical approach from philosophy (Hintikka,
1962; Kripke,
1963; Hughes and Cresswell,
1996).
The book Reasoning about
Bibliographical and Historical Notes Knowledge (Fagin et al., 1995) provides a thorough introduction, and Gordon and Hobbs
(2017) provide A Formal Theory of Commonsense Psychology. The second approach is a first-order theory in which mental objects are fluents. Davis
(2005) and Davis and Morgenstern (2005) describe this approach. It relies on the possibleworlds formalism, and builds on work by Robert Moore (1980, 1985).
The third approach is a syntactic theory, in which mental objects are represented by
character strings. A string is just a complex term denoting a list of symbols, so CanFly(Clark)
can be represented by the list of symbols [C,a,n,F,1,y,(,C,l,a,r,k,)]. The syntactic theory of mental objects was first studied in depth by Kaplan and Montague (1960), who showed
that it led to paradoxes if not handled carefully. Ernie Davis (1990) provides an excellent comparison of the syntactic and modal theories of knowledge. Pnueli (1977) describes a
temporal logic used to reason about programs, work that won him the Turing Award and which was expanded upon by Vardi (1996). Littman e al. (2017) show that a temporal logic can be a good language for specifying goals to a reinforcement learning robot in a way that is easy for a human to specify, and generalizes well to different environments.
The Greek philosopher Porphyry (c. 234-305 CE), commenting on Aristotle’s Caregories, drew what might qualify as the first semantic network. Charles S. Peirce (1909) developed existential graphs as the first semantic network formalism using modern logic. Ross Quillian (1961), driven by an interest in human memory and language processing, initiated work on semantic networks within Al An influential paper by Marvin Minsky (1975)
presented a version of semantic networks called frames; a frame was a representation of an object or category, with attributes and relations to other objects or categories.
The question of semantics arose quite acutely with respect to Quillian’s semantic net-
works (and those of others who followed his approach), with their ubiquitous and very vague
“IS-A links” Bill Woods’s (1975) famous article “What's In a Link?” drew the attention of AT
researchers to the need for precise semantics in knowledge representation formalisms. Brachman (1979) elaborated on this point and proposed solutions.
Ron
Patrick Hayes’s (1979)
“The Logic of Frames” cut even deeper, claiming that “Most of ‘frames’ is just a new syntax for parts of first-order logic.” Drew McDermott’s (1978b) “Tarskian Semantics, or, No No-
tation without Denotation!”
argued that the model-theoretic approach to semantics used in
first-order logic should be applied to all knowledge representation formalisms. This remains
a controversial idea; notably, McDermott himself has reversed his position in “A Critique of Pure Reason” (McDermott, 1987). Selman and Levesque (1993) discuss the complexity of
inheritance with exceptions, showing that in most formulations it is NP-complete.
Description logics were developed as a useful subset of first-order logic for which infer-
ence is computationally tractable.
Hector Levesque and Ron Brachman (1987) showed that
certain uses of disjunction and negation were primarily responsible for the intractability of logical inference. This led to a better understanding of the interaction between complexity
and expressiveness in reasoning systems. Calvanese ef al. (1999) summarize the state of the art, and Baader et al. (2007) present a comprehensive handbook of description logic. The three main formalisms for dealing with nonmonotonic inference—circumscription
(McCarthy, 1980), default logic (Reiter, 1980), and modal nonmonotonic logic (McDermott
and Doyle, 1980)—were all introduced in one special issue of the Al Journal. Delgrande and Schaub (2003) discuss the merits of the variants, given 25 years of hindsight. Answer
set programming can be seen as an extension of negation as failure or as a refinement of
341
342
Chapter 10 Knowledge Representation circumscription; the underlying theory of stable model semantics was introduced by Gelfond
and Lifschitz (1988), and the leading answer set programming systems are DLV (Eiter ef al.,
1998) and SMODELS (Niemelii et al., 2000). Lifschitz (2001) discusses the use of answer set programming for planning. Brewka er al. (1997) give a good overview of the various approaches to nonmonotonic logic. Clark (1978) covers the negation-as-failure approach to logic programming and Clark completion. Lifschitz (2001) discusses the application of answer set programming to planning. A variety of nonmonotonic reasoning systems based on
logic programming are documented in the proceedings of the conferences on Logic Program-
ming and Nonmonotonic Reasoning (LPNMR).
The study of truth maintenance systems began with the TMS (Doyle, 1979) and RUP
(McAllester,
1980) systems, both of which were essentially JTMSs.
Forbus and de Kleer
(1993) explain in depth how TMSs can be used in Al applications. Nayak and Williams
(1997) show how an efficient incremental TMS called an ITMS makes it feasible to plan the
operations of a NASA spacecraft in real time.
This chapter could not cover every area of knowledge representation in depth. The three
principal topics omitted are the following: Qualitative physics.
Qualitative physics: Qualitative physics is a subfield of knowledge representation concerned specifically with constructing a logical, nonnumeric theory of physical objects and processes. The term was coined
by Johan de Kleer (1975), although the enterprise could be said to
have started in Fahlman’s (1974) BUILD, a sophisticated planner for constructing complex towers of blocks.
Fahlman discovered in the process of designing it that most of the effort
(80%, by his estimate) went into modeling the physics of the blocks world to calculate the
stability of various subassemblies of blocks, rather than into planning per se. He sketches a hypothetical naive-physics-like process to explain why young children can solve BUILD-like
problems without access to the high-speed floating-point arithmetic used in BUILD’s physical
modeling. Hayes (1985a) uses “histories"—four-dimensional slices of space-time similar to
Davidson’s events—to construct a fairly complex naive physics of liquids. Davis (2008) gives an update to the ontology of liquids that describes the pouring of liquids into containers.
De Kleer and Brown (1985), Ken Forbus (1985), and Benjamin Kuipers (1985) independently and almost simultaneously developed systems that can reason about a physical system based on qualitative abstractions of the underlying equations. Qualitative physics soon developed to the point where it became possible to analyze an impressive variety of complex phys-
ical systems (Yip, 1991). Qualitative techniques have been used to construct novel designs
for clocks, windshield wipers, and six-legged walkers (Subramanian and Wang, 1994). The collection Readings in Qualitative Reasoning about Physical Systems (Weld and de Kleer, 1990), an encyclopedia article by Kuipers (2001), and a handbook article by Davis (2007)
provide good introductions to the field.
Spatial reasoning
Spatial reasoning: The reasoning necessary to navigate in the wumpus world is trivial in comparison to the rich spatial structure of the real world.
The earliest serious attempt to
capture commonsense reasoning about space appears in the work of Ernest Davis (1986, 1990). The region connection calculus of Cohn ef al. (1997) supports a form of qualitative spatial reasoning and has led to new kinds of geographical information systems; see also (Davis, 2006). As with qualitative physics, an agent can go a long way, so to speak, without
resorting to a full metric representation.
Bibliographical and Historical Notes Psychological reasoning: Psychological reasoning involves the development of a working
psychology for artificial agents to use in reasoning about themselves and other agents. This
is often based on so-called folk psychology, the theory that humans in general are believed
to use in reasoning about themselves and other humans.
When Al researchers provide their
artificial agents with psychological theories for reasoning about other agents, the theories are
frequently based on the researchers” description of the logical agents” own design. Psychological reasoning is currently most useful within the context of natural language understanding,
where divining the speaker’s intentions is of paramount importance. Minker (2001) collects papers by leading researchers in knowledge representation, summarizing 40 years of work in the field. The proceedings of the international conferences on Principles of Knowledge Representation and Reasoning provide the most up-to-date sources. for work in this area. Readings in Knowledge Representation (Brachman and Levesque, 1985) and Formal Theories of the Commonsense
World (Hobbs and Moore,
1985) are ex-
cellent anthologies on knowledge representation; the former focuses more on historically
important papers in representation languages and formalisms, the latter on the accumulation of the knowledge itself. Davis (1990), Stefik (1995), and Sowa (1999) provide textbook introductions to knowledge representation, van Harmelen et al. (2007) contributes a handbook, and Davis and Morgenstern (2004) edited a special issue of the Al Journal on the topic. Davis
(2017) gives a survey of logic for commonsense reasoning. The biennial conference on Theoretical Aspects of Reasoning About Knowledge (TARK) covers applications of the theory of knowledge in Al, economics, and distributed systems.
Psychological reasoning
TR
1 1
AUTOMATED PLANNING In which we see how an agent can take advantage of the structure ofa problem to efficiently construct complex plans of action.
Planning a course of action is a key requirement for an intelligent agent. The right representation for actions and states and the right algorithms can make this easier.
In Section 11.1
we introduce a general factored representation language for planning problems that can naturally and succinctly represent a wide variety of domains, can efficiently scale up to large
problems, and does not require ad hoc heuristics for a new domain. Section 11.4 extends the
representation language to allow for hierarchical actions, allowing us to tackle more complex
problems. We cover efficient algorithms for planning in Section 11.2, and heuristics for them in Section
11.3.
In Section
11.5 we account for partially observable and nondeterministic
domains, and in Section 11.6 we extend the language once again to cover scheduling problems with resource constraints. This gets us closer to planners that are used in the real world
for planning and scheduling the operations of spacecraft, factories, and military campaigns. Section 11.7 analyzes the effectiveness of these techniques.
11.1 Classical planning
Definition of Classical Planning
Classical planning is defined as the task of finding a sequence of actions to accomplish a
goal in a discrete, deterministic, static, fully observable environment. We have seen two approaches to this task: the problem-solving agent of Chapter 3 and the hybrid propositional
logical agent of Chapter 7. Both share two limitations. First, they both require ad hoc heuristics for each new domai
heuristic evaluation function for search, and hand-written code
for the hybrid wumpus agent. Second, they both need to explicitly represent an exponentially large state space.
For example, in the propositional logic model of the wumpus world, the
axiom for moving a step forward had to be repeated for all four agent orientations, T time. steps, and n* current locations. PDDL
In response to these limitations, planning researchers have invested in a factored repre-
sentation using a family of languages called PDDL, the Planning Domain Definition Lan-
guage (Ghallab ef al., 1998), which allows us to express all 47n? actions with a single action schema, and does not need domain-specific knowledge.
Basic PDDL can handle classical
planning domains, and extensions can handle non-classical domains that are continuous, partially observable, concurrent, and multi-agent. The syntax of PDDL is based on Lisp, but we. State
will translate it into a form that matches the notation used in this book.
In PDDL, a state is represented as a conjunction of ground atomic fluents. Recall that
“ground” means no variables, “fluent” means an aspect of the world that changes over time,
Section 11.1
Definition of Classical Planning
and “ground atomic” means there is a single predicate, and if there are any arguments, they must be constants. For example, Poor A Unknown might represent the state of a hapless agent,
and Ar(Truck,, Melbourne) AAt(Trucks, Sydney) could represent a state in a package delivery problem. PDDL uses database semantics: the closed-world assumption means that any fluents that are not mentioned are false, and the unique names
assumption means that Truck;
and Truck; are distinct. The following fluents are nor allowed in a state: Ar(x,y) (because it has variables), ~Poor (because it is a negation), and Ar(Spouse(Ali), Sydney) (because it uses a function symbol, Spouse). When convenient, we can think of the conjunction of fluents as a set of fluents.
An action schema represents a family of ground actions. For example, here is an action Action schema
schema for flying a plane from one location to another:
Action(Fly(p. from,t0),
PRECOND:At(p. from) A Plane(p) A Airport(from) A Airport(to) EFFECT:—A!(p. from) A At(p, t0))
The schema consists of the action name,
a list of all the variables used in the schema,
a
precondition and an effect. The precondition and the effect are each conjunctions of literals (positive or negated atomic sentences).
yielding a ground (variable-free) action:
We can choose constants to instantiate the variables,
Precondition Effect
Action(Fly(P,,SFO,JFK),
PRECOND:At(Py,SFO) A Plane(Py) AAirport(SFO) AAirport(JFK) EFFECT:—AI(P1,SFO) AA1(Py,JFK))
A ground action a is applicable in state s if s entails the precondition of a; that is, if every
positive literal in the precondition is in s and every negated literal is not.
The result of executing applicable action a in state s is defined as a state s’ which is
represented by the set of fluents formed by starting with s, removing the fluents that appear as negative literals in the action’s effects (what we call the delete list or DEL(a)), and adding the fluents that are positive literals in the action’s effects (what we call the add list or ADD(a)):
RESULT(s,a) = (s — DEL(a)) UADD(a) .
(11.1)
For example, with the action Fiy(Py,SFO,JFK), we would remove the fluent Ar(Py,SFO) and add the fluent At(Py,JFK). A set of action schemas serves as a definition of a planning domain. A specific problem within the domain is defined with the addition of an initial state and a goal. The initial state is a conjunction of ground fluents (introduced with the keyword Inir in Figure 11.1). As with all states, the closed-world assumption is used, which means that any atoms that are not mentioned are false. The goal (introduced with Goal) is just like a precondition: a conjunction of literals (positive or negative) that may contain variables. For example, the goal At(C1,SFO) A=At(C2,SFO) AAt(p, SFO), refers to any state in which cargo C} is at SFO but C is not, and in which there is a plane at SFO. 11.1.1
Example domain:
Air cargo transport
Figure 11.1 shows an air cargo transport problem involving loading and unloading cargo and flying it from place to place. The problem can be defined with three actions: Load, Unload,
and Fly. The actions affect two predicates: In(c, p) means that cargo c is inside plane p, and Ar(x,a) means that object x (either plane or cargo) is at airport a. Note that some care
Delete list Add list
Chapter 11
Automated Planning
Init(AK(Cy, SFO) A At(Cy, JFK) A At(Py, SFO) A At(Py, JFK) A Cargo(Cy) A Cargo(Cs) A Plane(P;) A Plane(Ps) AAirport(JFK) A Airport(SFO)) Goal(A1(Cy, JFK) A At(Ca, SFO)) Action(Load(c, p, @), PRECOND: At(c, a) A At(p, a) A Cargo(c)
EFFECT: = At(c, a) A In(c, p))
Action(Unload(c, p, a),
PRECOND: In(c, p) A At(p, a) A Cargo(c) EFFECT: At(c, a) A —In(c, p))
A Plane(p) A Airport(a)
A Plane(p) A Airport(a)
Action(Fly(p, from, t0),
PRECOND: At(p, from) A Plane(p) A Airport(from) A Airport(to) EFFECT: = At(p, from) A\ At(p, t0))
Figure 1.1 A PDDL description of an air cargo transportation planning problem. must be taken to make sure the Az predicates are maintained properly.
When a plane flies
from one airport to another, all the cargo inside the plane goes with it. Tn first-order logic it would be easy to quantify over all objects that are inside the plane. But PDDL does not have
a universal quantifier, so we need a different solution. The approach we use is to say that a piece of cargo ceases to be Ar anywhere when it is /n a plane; the cargo only becomes At the new airport when it is unloaded.
So At really means “available for use at a given location.”
The following plan is a solution to the problem:
[Load(Cy, Py, SFO). Fly(Py,SFO.JFK), Unload(C\,Pi,JFK). Load(Ca. P2, JFK), Fly(P2,JFK SFO), Unload(Ca, P2, SFO)]..
11.1.2
Example domain:
The spare tire problem
Consider the problem of changing a flat tire (Figure 11.2). The goal is to have a good spare tire properly mounted onto the car’s axle, where the initial state has a flat tire on the axle and a good spare tire in the trunk.
To keep it simple, our version of the problem is
an abstract
one, with no sticky lug nuts or other complications. There are just four actions: removing the spare from the trunk, removing the flat tire from the axle, putting the spare on the axle, and leaving the car unattended overnight.
We assume that the car is parked in a particu-
larly bad neighborhood, so that the effect of leaving it overnight is that the tires disappear. [Remove(Flat,Axle),Remove(Spare, Trunk), PutOn(Spare,Axle)] is a solution to the problem. 11.1.3
Example domain:
The blocks world
One of the most famous planning domains is the blocks world. This domain consists of a set
of cube-shaped blocks sitting on an arbitrarily-large table.! The blocks can be stacked, but
only one block can fit directly on top of another. A robot arm can pick up a block and move it
to another position, either on the table or on top of another block. The arm can pick up only
one block at a time, so it cannot pick up a block that has another one on top of it. A typical
goal to get block A on B and block B on C (see Figure 11.3). 1" The blocks world commonly used in planning research is much simpler than SHRDLUs version (page 20).
Section 11.1
Definition of Classical Planning
Init(Tire(Flat) A Tire(Spare) A At(Flat,Axle) A At(Spare, Trunk)) Goal(At(Spare,Axle))
Action(Remove(obj. loc),
PRECOND: At(obj. loc)
EFFECT: At(obj,loc) A A(obj, Ground))
Action(PutOn(t, Axle),
PRECOND: Tire(t) A At(t,Ground) A — At(Flat,Axle) N\ — At(Spare,Axle) EFFECT: = At(1,Ground) A At(t,Axle))
Action(LeaveOvernight, PRECOND: EFFECT: — At(Spare, Ground) A — At(Spare, Axle) A = At(Spare, Trunk) A = At(Flat,Ground) A = At(Flat,Axle) A = At(Flat, Trunk))
Figure 11.2 The simple spare tire problem.
BN Start State
A
Goal State
Figure 11.3 Diagram of the blocks-world problem in Figure 11.4.
Init(On(A, Table) A On(B,Table) A On(C,A) A Block(A) A Block(B) A Block(C) A Clear(B) A Clear(C) A Clear(Table)) Goal(On(A,B) A On(B,C))
Action(Move(b.x.y),
PRECOND: On(b,x) A Clear(b) A Clear(y) A Block(b) A Block(y) N
(b#x) A (b#y) A (x#),
EFFECT: On(b,y) A Clear(x) A —On(b,x) A —Clear(y))
Action(MoveToTable(b,x),
PRECOND: On(b,x) A Clear(b) A Block(b) A Block(x), EFFECT: On(b, Table) A Clear(x) A —On(b,x))
Figure 11.4 A planning problem in the blocks world: building a three-block tower. One solution is the sequence [MoveToTuble(C.A), Move(B, Table.C), Move(A, Table, B)).
347
Chapter 11
Automated Planning
We use On(b,x) to indicate that block b is on x, where x is either another block or the
table. The action for moving block b from the top of x to the top of y will be Move(b,x,y).
Now, one of the preconditions on moving b is that no other block be on it. In first-order logic,
this would be —3x On(x,b) or, alternatively, Vx —On(x,b). Basic PDDL does not allow
quantifiers, so instead we introduce a predicate Clear(x) that is true when nothing is on x.
(The complete problem description is in Figure 11.4.)
The action Move moves a block b from x to y if both b and y are clear. After the move is made, b is still clear but y is not. A first attempt at the Move schema is
Action(Move(b, x.y),
PRECOND:On(b.x) A Clear(b) A Clear(y),
EFFECT:On(b,y) A Clear(x) A =On(b,x) A—Clear(y)).
Unfortunately, this does not maintain Clear properly when x or y is the table. When x is
the Table, this action has the effect Clear(Table), but the table should not become clear; and
when y=Table, it has the precondition Clear(Table), but the table does not have to be clear for us to move a block onto it. To fix this, we do two things.
action to move a block b from x to the table:
First, we introduce another
Action(MoveToTable(b,x),
PRECOND:On(b,x) A Clear(b),
EFFECT: On(b, Table) A Clear(x) A—On(b,x)) .
Second, we take the interpretation of Clear(x) to be “there is a clear space on x to hold a block.” Under this interpretation, Clear(Table) will always be true. The only problem is that nothing prevents the planner from using Move(b,x, Table) instead of MoveToTable(b,x). We could live with this problem—it will lead to a larger-than-necessary scarch space, but will not lead to incorrect answers—or we could introduce the predicate Block and add Block(b) A Block(y) to the precondition of Move, as shown in Figure 11.4. 11.2
Algorithms for Classical Planning
The description of a planning problem provides an obvious way to search from the initial state through the space of states, looking for a goal. A nice advantage of the declarative representation of action schemas is that we can also search backward from the goal, looking
for the initial state (Figure 11.5 compares forward and backward searches). A third possibility
is to translate the problem description into a set of logic sentences, to which we can apply a
logical inference algorithm to find a solution. 11.2.1
Forward state-space search for planning
We can solve planning problems by applying any of the heuristic search algorithms from Chapter 3 or Chapter 4. The states in this search state space are ground states, where every fluent is either true or not.
The goal is a state that has all the positive fluents in the prob-
lem’s goal and none of the negative fluents. The applicable actions in a state, Actions(s), are
grounded instantiations of the action schemas—that is, actions where the variables have all
been replaced by constant values.
To determine the applicable actions we unify the current state against the preconditions
of each action schema.
For each unification that successfully results in a substitution, we
Section 112 Algorithms for Classical Planning apply the substitution to the action schema to yield a ground action with no variables. (It is a requirement of action schemas that any variable in the effect must also appear in the
precondition; that way, we are guaranteed that no variables remain after the substitution.) Each schema may unify in multiple ways. In the spare tire example (page 346), the Remove action has the precondition Ar(obj, loc), which matches against the initial state in two
ways, resulting in the two substitutions {obj/Flat,loc/Axle} and {obj/Spare,loc/Trunk}; applying these substitutions yields two ground actions.
If an action has multiple literals in
the precondition, then each of them can potentially be matched against the current state in multiple ways.
At first, it seems that the state space might be too big for many problems. Consider an
air cargo problem with 10 airports, where each airport initially has 5 planes and 20 pieces of cargo. The goal is to move all the cargo at airport A to airport B. There is a 41-step solution
to the problem: load the 20 pieces of cargo into one of the planes at A, fly the plane to B, and
unload the 20 pieces. Finding this apparently straightforward solution can be difficult because the average branching factor is huge: each of the 50 planes can fly to 9 other airports, and each of the 200
packages can be either unloaded (if it is loaded) or loaded into any plane at its airport (if it is unloaded).
So in any state there is a minimum of 450 actions (when all the packages are
at airports with no planes) and a maximum of 10,450 (when all packages and planes are at
Fly(P,, A B) Fly(P,, A B)
Figure 11.5 Two approaches to searching for a plan. (a) Forward (progression) search through the space of ground states, starting in the initial state and using the problem’s actions to search forward for a member of the set of goal states.
(b) Backward (regression)
search through state descriptions, starting at the goal and using the inverse of the actions to search backward for the initial state.
349
350
Chapter 11
Automated Planning
the same airport). On average, let’s say there are about 2000 possible actions per state, so the search graph up to the depth of the 41-step solution has about 2000*' nodes.
Clearly, even this relatively small problem instance is hopeless without an accurate heuristic. Although many real-world applications of planning have relied on domain-specific heuris-
tics, it turns out (as we see in Section 11.3) that strong domain-independent heuristics can be
derived automatically; that is what makes forward search feasible. 11.2.2 Regression search Relevant action
Backward search for planning
In backward search (also called regression search) we start at the goal and apply the actions backward until we find a sequence of steps that reaches the initial state. At each step we
consider relevant actions (in contrast to forward search, which considers actions that are
applicable). This reduces the branching factor significantly, particularly in domains with many possible actions.
A relevant action is one with an effect that unifies with one of the goal literals, but with
no effect that negates any part of the goal. For example, with the goal ~PoorA Famous, an action with the sole effect Famous would be relevant, but one with the effect Poor A Famous
is not considered relevant: even though that action might be used at some point in the plan (to
establish Famous), it cannot appear at rhis point in the plan because then Poor would appear in the final state.
Regression
What does it mean to apply an action in the backward direction?
Given a goal g and
an action a, the regression from g over a gives us a state description g’ whose positive and negative literals are given by Pos(¢') = (Pos(g) — ADD(a)) UPOS(Precond(a))
NEG(g') = (NEG(g) — DEL(a)) UNEG(Precond(a)).
That is, the preconditions
must have held before, or else the action could not have been
executed, but the positive/negative literals that were added/deleted by the action need not have been true before.
These equations are straightforward for ground literals, but some care is required when there are variables in g and a. For example, suppose the goal is to deliver a specific piece of cargo to SFO: A1(Ca,SFO). The Unload action schema has the effect Ar(c,a). When we
unify that with the goal, we get the substitution {c/C,a/SFO}; applying that substitution to the schema gives us a new schema which captures the idea of using any plane that is at SFO:
Action(Unload(Cs, p/,SFO),
PRECOND:In(Cy, p') AAt(p',SFO) A Cargo(C,) A Plane(p') A Airport(SFO) EFFECT:A1(C2,SFO) A —In(Ca, p')) .
Here we replaced p with a new variable named p'. This is an instance of standardizing apart
variable names so there will be no conflict between different variables that happen to have the
same name (see page 284). The regressed state description gives us a new goal: & = In(Ca, p') AAI(p',SFO) A Cargo(C2) A Plane(p') A Airport(SFO) As another example, consider the goal of owning a book with a specific ISBN number: Own(9780134610993). Given a trillion 13-digit ISBNs and the single action schema A = Action(Buy(i), PRECOND:ISBN (i), EFFECT: Own(i)) .
a forward search without a heuristic would have to start enumerating the 10 billion ground
Buy actions. But with backward search, we would unify the goal Own(9780134610993 ) with
Section 112
Algorithms for Classical Planning
the effect Own(i"), yielding the substitution 6 = {i /9780134610993 }. Then we would regress over the action Subst(6,A) to yield the predecessor state description ISBN (9780134610993).
This is part of the initial state, so we have a solution and we are done, having considered just one action, not a trillion.
More formally, assume a goal description g that contains a goal literal g; and an action
schema A. If A has an effect literal ¢); where Unify(g;.¢})=0 and where we define A’
SUBST(0,A) and if there is no effect in A’ that is the negation of a literal in g, then A’ is a
relevant action towards g.
For most problem domains backward search keeps the branching factor lower than for-
ward search.
However, the fact that backward search uses states with variables rather than
ground states makes it harder to come up with good heuristics. That is the main reason why
the majority of current systems favor forward search. 11.2.3
Planning as Boolean satisfiability
In Section 7.7.4 we showed how some clever axiom-rewriting could turn a wumpus world problem into a propositional logic satisfiability problem that could be handed to an efficient satisfiability solver. SAT-based planners such as SATPLAN operate by translating a PDDL problem description into propositional form. The translation involves a series of steps: « Propositionalize the actions: for each action schema, form ground propositions by substituting constants for each of the variables. So instead of a single Unload(c,p,a) schema, we would have separate action propositions for each combination of cargo,
plane, and airport (here written with subscripts), and for each time step (here written as a superscript). + Add action exclusion axioms saying that no two actions can occur at the same time, e.g.
—(FlyP;SFOJFK' A FlyP,;SFOBUH").
+ Add precondition axioms:
For each ground action A, add the axiom A’ =
PRE(A)',
that is, if an action is taken at time 7, then the preconditions must have been true. For
example, FIyP;SFOJFK' = At(P,,SFO) A Plane(P\) A Airport(SFO) AAirport (JFK).
« Define the initial state: assert FO for every fluent F in the problem’s initial state, and
—F" for every fluent not mentioned in the initial state.
« Propositionalize the goal: the goal becomes a disjunction over all of its ground instances, where variables are replaced by constants. For example, the goal of having block A on another block, On(A,x) ABlock(x) in a world with objects A, B and C, would
be replaced by the goal
(On(A,A) ABlock(A)) V (On(A,B) ABlock(B)) V (On(A,C) A Block(C)). + Add successor-state axioms: For each fluent ', add an axiom of the form
F'*' & ActionCausesF' V (F' A=ActionCausesNotF"), where ActionCausesF stands for a disjunction of all the ground actions that add F, and ActionCausesNotF stands for a disjunction of all the ground actions that delete F.
The resulting translation is typically much larger than the original PDDL, but modern the efficiency of modern SAT solvers often more than makes up for this.
351
352
Chapter 11 11.2.4
Planning graph
Automated Planning
Other classical planning approaches
The three approaches we covered above are not the only ones tried in the 50-year history of automated planning. We briefly describe some others here. An approach called Graphplan uses a specialized data structure, a planning graph, to
encode constraints on how actions are related to their preconditions and effects, and on which
Situation calculus
things are mutually exclusive. Situation calculus is a method of describing planning problems in first-order logic. It uses successor-state axioms just as SATPLAN
does, but first-order logic allows for more
flexibility and more succinct axioms. Overall the approach has contributed to our theoretical
understanding of planning, but has not made a big impact in practical applications, perhaps
because first-order provers are not as well developed as propositional satisfiability programs. It is possible to encode a bounded planning problem (i.c., the problem of finding a plan
of length k) as a constraint satisfaction problem
(CSP). The encoding
is similar to the
encoding to a SAT problem (Section 11.2.3), with one important simplification: at each time
step we need only a single variable, Action', whose domain s the set of possible actions. We. no longer need one variable for every action, and we don’t need the action exclusion axioms.
Partial-order planning
All the approaches we have seen so far construct fotally ordered plans consisting of strictly linear sequences of actions. But if an air cargo problem has 30 packages being loaded onto one plane and 50 packages being loaded onto another, it seems pointless to decree a specific linear ordering of the 80 load actions. An alternative called partial-order planning represents a plan as a graph rather than a linear sequence: each action is a node in the graph, and for each precondition of the action there is an edge from another action (or from the initial state) that indicates that the predeces-
sor action establishes the precondition. So we could have a partial-order plan that says that ac-
tions Remove(Spare, Trunk) and Remove(Flat, Axle) must come before PutOn(Spare,Axle),
but without saying which of the two Remove actions should come first. We search in the space
of plans rather than world-states, inserting actions to satisfy conditions.
In the 1980s and 1990s, partial-order planning was seen as the best way to handle planning problems with independent subproblems. By 2000, forward-search planners had developed excellent heuristics that allowed them to efficiently discover the independent subprob-
lems that partial-order planning was designed for. Moreover, SATPLAN was able to take ad-
vantage of Moore’s law: a propositionalization that was hopelessly large in 1980 now looks tiny, because computers have 10,000 times more memory today.
As a result, partial-order
planners are not competitive on fully automated classical planning problems.
Nonetheless, partial-order planning remains an important part of the field.
For some
specific tasks, such as operations scheduling, partial-order planning with domain-specific heuristics is the technology of choice. Many of these systems use libraries of high-level plans, as described in Section 11.4.
Partial-order planning is also often used in domains where it is important for humans
to understand the plans. For example, operational plans for spacecraft and Mars rovers are generated by partial-order planners and are then checked by human operators before being uploaded to the vehicles for execution. The plan refinement approach makes it easier for the humans to understand what the planning algorithms are doing and to verify that the plans are
correct before they are executed.
Section 11.3 11.3
Heuristics for Planning
353
Heuristics for Planning
Neither forward nor backward search is efficient without a good heuristic function.
Recall
from Chapter 3 that a heuristic function h(s) estimates the distance from a state s to the goal, and that if we can derive an admissible heuristic for this distance—one that does not
overestimate—then we can use A* search to find optimal solutions.
By definition, there is no way to analyze an atomic state, and thus it requires some ingenuity by an analyst (usually human) to define good domain-specific heuristics for search problems with atomic states. But planning uses a factored representation for states and actions, which makes it possible to define good domain-independent heuristics.
Recall that an admissible heuristic can be derived by defining a relaxed problem that is
easier to solve. The exact cost of a solution to this easier problem then becomes the heuristic
for the original problem. A search problem is a graph where the nodes are states and the edges are actions.
The problem is to find a path connecting the initial state to a goal state.
There are two main ways we can relax this problem to make it easier: by adding more edges to the graph, making it strictly easier to find a path, or by grouping multiple nodes together, forming an abstraction of the state space that has fewer states, and thus is easier to search.
We look first at heuristics that add edges to the graph. Perhaps the simplest is the ignore-
preconditions heuristic, which drops all preconditions from actions. Every action becomes
applicable in every state, and any single goal fluent can be achieved in one step (if there are any applicable actions—if not, the problem is impossible). This almost implies that the number of steps required to solve the relaxed problem is the number of unsatisfied goals—
Ignore-preconditions heuristic
almost but not quite, because (1) some action may achieve multiple goals and (2) some actions
may undo the effects of others. For many problems an accurate heuristic is obtained by considering (1) and ignoring (2). First, we relax the actions by removing all preconditions and all effects except those that are
literals in the goal. Then, we count the minimum number of actions required such that the union of those actions’ effects satisfies the goal. This is an instance of the set-cover problem.
There is one minor irritation: the set-cover problem is NP-hard. Fortunately a simple greedy algorithm is guaranteed to return a set covering whose size is within a factor of logn of the true minimum covering, where n is the number of literals in the goal.
Unfortunately, the
greedy algorithm loses the guarantee of admissibility. It is also possible to ignore only selected preconditions of actions. Consider the slidingtile puzzle (8-puzzle or 15-puzzle) from Section 3.2. We could encode this as a planning problem involving tiles with a single schema Slide: Action(Slide(t,s1,52),
PRECOND:On(t,sy) ATile(t) A Blank(sy) A Adjacent (s, s2) EFFECT:On(t,s2) A Blank(s;) A =On(t,s1) A —Blank(s;))
As we saw in Section 3.6, if we remove the preconditions Blank(s) A Adjacent(sy ,s2) then any tile can move in one action to any space and we get the number-of-misplaced-tiles heuris-
tic. If we remove only the Blank(s,) precondition then we get the Manhattan-distance heuris
tic. It is easy to see how these heuristics could be derived automatically from the action schema description. The ease of manipulating the action schemas is the great advantage of the factored representation of planning problems, as compared with the atomic representation
of search problems.
Set-cover problem
354
Chapter 11
Automated Planning
Figure 11.6 Two state spaces from planning problems with the ignore-delete-lists heuristic. The height above the bottom plane is the heuristic score of a state; states on the bottom plane are goals. There are no local minima, so search for the goal is straightforward. From Hoffmann (2005). Ignore-delete-lists heuristic
Another possibility is the ignore-delete-lists heuristic.
Assume for a moment that all
goals and preconditions contain only positive literals.> We want to create a relaxed version
of the original problem that will be easier to solve, and where the length of the solution will serve as a good heuristic.
We can do that by removing the delete lists from all actions
(i.e., removing all negative literals from effects). That makes it possible to make monotonic
progress towards the goal—no action will ever undo progress made by another action. It turns
out it is still NP-hard to find the optimal solution to this relaxed problem, but an approximate
solution can be found in polynomial time by hill climbing.
Figure 11.6 diagrams part of the state space for two planning problems using the ignore-
delete-lists heuristic. The dots represent states and the edges actions, and the height of each
dot above the bottom plane represents the heuristic value. States on the bottom plane are solutions. Tn both of these problems, there is a wide path to the goal. There are no dead ends, 50 no need for backtracking; a simple hill-climbing search will easily find a solution to these
problems (although it may not be an optimal solution). 11.3.1
Dom:
ndependent
pruning
Factored representations make it obvious that many states are just variants of other states. For
example, suppose we have a dozen blocks on a table, and the goal is to have block A on top
of a three-block tower. The first step in a solution is to place some block x on top of block y
(where x, y, and A are all different). After that, place A on top ofx and we’re done. There are 11 choices for x, and given x, 10 choices for y, and thus 110 states to consider. But all these
Symmetry reduction
states are symmetric: choosing one over another makes no difference, and thus a planner should only consider one of them. This is the process of symmetry reduction: we prune out aren’t, replace every negative literal ~P in tate and the action effects accordingly.
Section 11.3
Heuristics for Planning
355
of consideration all symmetric branches of the search tree except for one. For many domains, this makes the difference between intractable and efficient solving.
Another possibility is to do forward pruning, accepting the risk that we might prune away an optimal solution, in order to focus the search on promising branches. We can define
a preferred action as follows: First, define a relaxed version of the problem, and solve it to
Preferred action
teractions can be ruled out. We say that a problem has serializable subgoals if there exists
Serializable subgoals
get a relaxed plan. Then a preferred action s either a step of the relaxed plan, or it achieves some precondition of the relaxed plan. Sometimes it is possible to solve a problem efficiently by recognizing that negative in-
an order of subgoals such that the planner can achieve them in that order without having to undo any of the previously achieved subgoals. For example, in the blocks world, if the goal is to build a tower (e.g., A on B, which in turn is on C, which in turn is on the Table, as in
Figure 11.3 on page 347), then the subgoals are serializable bottom to top: if we first achieve C on Table, we will never have to undo it while we are achieving the other subgoals.
A
planner that uses the bottom-to-top trick can solve any problem in the blocks world without backtracking (although it might not always find the shortest plan). As another example, if there is a room with n light switches, each controlling a separate light, and the goal is to have them all on, then we don’t have to consider permutations of the order; we could arbitrarily
restrict ourselves to plans that flip switches in, say, ascending order.
For the Remote Agent planner that commanded NASA’s Deep Space One spacecraft, it was determined that the propositions involved in commanding a spacecraft are serializable. This is perhaps not too surprising, because a spacecraft is designed by its engineers to be as easy as possible to control (subject to other constraints). Taking advantage of the serialized
ordering of goals, the Remote Agent planner was able to eliminate most of the search. This meant that it was fast enough to control the spacecraft in real time, something previously
considered impossible. 11.3.2
State abstraction in planning
A relaxed problem leaves us with a simplified planning problem just to calculate the value of the heuristic function. Many planning problems have 10'% states or more, and relaxing
the actions does nothing to reduce the number of states, which means that it may still be
expensive to compute the heuristic. Therefore, we now look at relaxations that decrease the
number of states by forming a state abstraction—a many-to-one mapping from states in the State abstraction ground representation of the problem to the abstract representation.
The easiest form of state abstraction is to ignore some fluents. For example, consider an
air cargo problem with 10 airports, 50 planes, and 200 pieces of cargo. Each plane can be at one of 10 airports and each package can be either in one of the planes or unloaded at one of the airports. So there are 10% x (50 + 10)** ~ 10**5 states. Now consider a particular
problem in that domain in which it happens that all the packages are at just 5 of the airports,
and all packages at a given airport have the same destination. Then a useful abstraction of the
problem is to drop all the Ar fluents except for the ones involving one plane and one package
at each of the 5 airports. Now there are only 10% x (5+ 10)° ~ 10'! states. A solution in this abstract state space will be shorter than a solution in the original space (and thus will be an admissible heuristic), and the abstract solution is easy to extend to a solution to the original
problem (by adding additional Load and Unload actions).
356 Decomposition B ence
Chapter 11
Automated Planning
A key idea in defining heuristics is decomposition: dividing a problem into parts, solving each part independently, and then combining the parts. The subgoal independence assumption is that the cost of solving a conjunction of subgoals is approximated by the sum of the costs of solving each subgoal independently. The subgoal independence assumption can be optimistic or pessimistic.
It is optimistic when there are negative interactions between
the subplans for each subgoal—for example, when an action in one subplan deletes a goal achieved by another subplan. It is pessimistic, and therefore inadmissible, when subplans contain redundant actions—for instance, two actions that could be replaced by a single action in the merged plan. Suppose the goal is a set of fluents G, which we divide into disjoint subsets Gy, . ., G. ‘We then find optimal plans P, ..., P, that solve the respective subgoals. What is an estimate
of the cost of the plan for achieving all of G? We can think of each COST(P;) as a heuristic estimate, and we know that if we combine estimates by taking their maximum
value, we
always get an admissible heuristic. So max;COST(P,) is admissible, and sometimes it is exactly correct: it could be that Py serendipitously achieves all the G;. But usually the estimate is too low. Could we sum the costs instead? For many problems that is a reasonable estimate, but it is not admissible. The best case is when G; and G; are independent, in the sense that plans for one cannot reduce the cost of plans for the other. In that case, the estimate
COST(P,) + CosT(P;) is admissible, and more accurate than the max estimate.
It is clear that there is great potential for cutting down the search space by forming abstractions. The trick is choosing the right abstractions and using them in a way that makes the total cost—defining an abstraction, doing an abstract search, and mapping the abstraction back to the original problem—Iless than the cost of solving the original problem. The techniques of pattern databases from Section 3.6.3 can be useful, because the cost of creating the pattern database can be amortized over multiple problem instances. A system that makes use of effective heuristics is FF, or FASTFORWARD
(Hoffmann,
2005), a forward state-space searcher that uses the ignore-delete-lists heuristic, estimating
the heuristic with the help of a planning graph. FF then uses hill climbing search (modified
to keep track of the plan) with the heuristic to find a solution. FF’s hill climbing algorithm is
nonstandard:
it avoids local maxima by running a breadth-first search from the current state
until a better one is found. If this fails, FF switches to a greedy best-first search instead.
11.4
Hierarchical Planning
The problem-solving and planning methods of the preceding chapters all operate with a fixed set of atomic actions.
Actions can be strung together, and state-of-the-art algorithms can
generate solutions containing thousands of actions. That’s fine if we are planning a vacation and the actions are at the level of “fly from San Francisco to Honolulu,” but at the motor-
control level of “bend the left knee by 5 degrees” we would need to string together millions or billions of actions, not thousands.
Bridging this gap requires planning at higher levels of abstraction. A high-level plan for
a Hawaii vacation might be “Go to San Francisco airport; take flight HA 11 to Honolulu; do vacation stuff for two weeks; take HA 12 back to San Francisco; go home.” Given such
a plan, the action “Go to San Francisco airport™ can be viewed as a planning task in itself,
with a solution such as “Choose a ride-hailing service; order a car; ride to airport.” Each of
Section 11.4
Hierarchical Planning
357
these actions, in turn, can be decomposed further, until we reach the low-level motor control
actions like a button-press.
In this example, planning and acting are interleaved; for example, one would defer the problem of planning the walk from the curb to the gate until after being dropped off. Thus, that particular action will remain at an abstract level prior to the execution phase. discussion of this topic until Section 11.5.
We defer
Here, we concentrate on the idea of hierarchi-
cal decomposition, an idea that pervades almost all attempts to manage complexity.
For
example, complex software is created from a hierarchy of subroutines and classes; armies, governments and corporations have organizational hierarchies. The key benefit of hierarchi-
Hierarchical decomposition
cal structure is that at each level of the hierarchy, a computational task, military mission, or
administrative function is reduced to a small number of activities at the next lower level, so
the computational cost of finding the correct way to arrange those activities for the current problem is small. 11.4.1
High-level actions
The basic formalism we adopt to understand hierarchical decomposition comes from the area
of hierarchical task networks or HTN planning. For now we assume full observability and determinism and a set of actions, now called primitive actions, with standard precondition—
effect schemas.
The key additional concept is the high-level action or HLA—for example,
the action “Go to San Francisco airport.” Each HLA has one or more possible refinements,
into a sequence of actions, each of which may be an HLA or a primitive action. For example, the action “Go to San Francisco airport,” represented formally as Go(Home,SFO), might have two possible refinements, as shown in Figure 11.7. The same figure shows a recursive
Hierarchical task network Primitive action High-level action Refinement
refinement for navigation in the vacuum world: to get to a destination, take a step, and then
20 to the destination.
These examples show that high-level actions and their refinements embody knowledge
about how to do things. For instance, the refinements for Go(Home,SFO) say that to get to
the airport you can drive or take a ride-hailing service; buying milk, sitting down, and moving the knight to e4 are not to be considered.
An HLA refinement that contains only primitive actions is called an implementation
of the HLA. In a grid world, the sequences [Right,Right, Down] and [Down, Right,Right] both implement the HLA Navigate([1,3],[3,2]). An implementation of a high-level plan (a sequence of HLAS) is the concatenation of implementations of each HLA in the sequence. Given the precondition—effect definitions of each primitive action, it is straightforward to
determine whether any given implementation of a high-level plan achieves the goal. We can say, then, that a high-level plan achieves the goal from a given state if at least one of its implementations achieves the goal from that state. ~ The “at least one” in this
definition is crucial—not all implementations need to achieve the goal, because the agent gets
to decide which implementation it will execute. Thus, the set of possible implementations in
HTN planning—each of which may have a different outcome—is not the same as the set of
possible outcomes in nondeterministic planning. There, we required that a plan work for all outcomes because the agent doesn’t get to choose the outcome; nature does.
The simplest case is an HLA that has exactly one implementation. In that case, we can
compute the preconditions and effects of the HLA from those of the implementation (see
Exercise 11.HLAU) and then treat the HLA exactly as if it were a primitive action itself. It
Implementation
358
Chapter 11
Automated Planning
Refinement(Go(Home,SFO), STEPS: [Drive(Home, SFOLongTermParking), Shuttle(SFOLongTermParking, SFO)] ) Refinement(Go(Home,SFO), STEPS: [Taxi(Home,SFO)] )
Refinement(Navigate([a,b], [x,]), PRECOND: a=x A b=y
STEPS: ] ) Refinement(Navigate([a.b), x.
PRECOND: Connected([ab], a— 1.b]) STEPS: [Left, Navigate([a— 1,b].[x.])] ) Refinement(Navigate([a,b], [ PRECOND: Connected ([a,b], a + 1,5])
STEPS: [Right. Navigate([a+ 1,5],[x.])]) Figure 11.7 Definitions of possible refinements for two high-level actions: going to San Francisco airport and navigating in the vacuum world. In the latter case, note the recursive nature of the refinements and the use of preconditions.
can be shown that the right collection of HLAs can result in the time complexity of blind
search dropping from exponential in the solution depth to linear in the solution depth, although devising such a collection of HLAs may be a nontrivial task in itself. When HLAs
have multiple possible implementations, there are two options: one is to search among the implementations for one that works, as in Section 11.4.2; the other is to reason directly about
the HLAs—despite the multiplicity of implementations—as explained in Section 11.4.3. The latter method enables the derivation of provably correct abstract plans, without the need to
consider their implementations.
11.4.2
Searching for primitive solutions
HTN planning is often formulated with a single “top level” action called Act, where the aim is to find an implementation of Act that achieves the goal. This approach is entirely general. For example, classical planning problems can be defined as follows: for each primitive action
a;, provide one refinement of Act with steps [a;, Act]. That creates a recursive definition of Act
that lets us add actions. But we need some way to stop the recursion; we do that by providing
one more refinement for Act, one with an empty list of steps and with a precondition equal to the goal of the problem.
This says that if the goal is already achieved, then the right
implementation is to do nothing. The approach leads to a simple algorithm: repeatedly choose an HLA in the current plan and replace it with one of its refinements, until the plan achieves the goal. One possible implementation based on breadth-first tree search is shown in Figure 11.8. Plans are considered
in order of depth of nesting of the refinements, rather than number of primitive steps. It is straightforward to design a graph-search version of the algorithm as well as depth-first and iterative deepening versions.
Section 11.4
Hierarchical Planning
function HIERARCHICAL-SEARCH(problem, hierarchy) returns a solution or failure
frontier —a FIFO queue with [Aci] as the only element while rrue do if IS-EMPTY( frontier) then return failure
plan — POP(frontier)
1/ chooses the shallowest plan in frontier
hla + the first HLA in plan, or null if none
prefix,suffix — the action subsequences before and after hla in plan outcome«— RESULT(problem.INITIAL, prefix) if hla is null then
// so plan is primitive and outcome is its result
if problem.1s-GOAL(outcome) then return plan else for each sequence in REFINEMENTS (hla, outcome, hierarchy) do add APPEND(prefix, sequence, suffix) to frontier
Figure 11.8 A breadth-first implementation of hierarchical forward planning search. The
initial plan supplied to the algorithm is [Acr]. The REFINEMENTS function returns a set of
action sequences, one for each refinement of the HLA whose preconditions are satisfied by the specified state, outcome.
In essence, this form of hierarchical search explores the space of sequences that conform to the knowledge contained in the HLA library about how things are to be done. A great deal of knowledge can be encoded, not just in the action sequences specified in each refinement but also in the preconditions for the refinements. For some domains, HTN planners have been able to generate huge plans with very little search. For example, O-PLAN (Bell and Tate, 1985), which combines HTN planning with scheduling, has been used to develop production plans for Hitachi. A typical problem involves a product line of 350 different products, 35 assembly machines, and over 2000 different operations. The planner generates a 30-day schedule with three 8-hour shifts a day, involving tens of millions of steps. Another important
aspect of HTN plans is that they are, by definition, hierarchically structured; usually this makes them easy for humans to understand.
The computational benefits of hierarchical search can be seen by examining an ideal-
ized case.
Suppose that a planning problem has a solution with d primitive actions.
a nonhierarchical,
For
forward state-space planner with b allowable actions at each state, the
cost is O(b?), as explained in Chapter 3. For an HTN planner, let us suppose a very regular refinement structure:
each nonprimitive action has r possible refinements,
k actions at the next lower level. there are with this structure.
Now,
each into
We want to know how many different refinement trees
if there are d actions at the primitive level, then the
number of levels below the root is log,d, so the number of internal refinement nodes is
14 k+ K24+ K24~
= (¢ — 1) /(k— 1). Each internal node has r possible refinements,
s0 rld=1/(=1) possible decomposition trees could be constructed.
Examining this formula, we see that keeping r small and k large can result in huge sav-
ings: we are taking the kth root of the nonhierarchical cost, if » and r are comparable. Small r
and large k means a library of HLAs with a small number of refinements each yielding a long
action sequence. This is not always possible: long action sequences that are usable across a
wide range of problems are extremely rare.
359
360
Chapter 11
Automated Planning
The key to HTN planning is a plan library containing known methods for implementing
complex, high-level actions.
One way to construct the library is to learn the methods from
problem-solving experience. After the excruciating experience of constructing a plan from scratch, the agent can save the plan in the library as a method for implementing the high-level
action defined by the task. In this way, the agent can become more and more competent over
time as new methods are built on top of old methods. One important aspect of this learning process is the ability to generalize the methods that are constructed, eliminating detail that is specific to the problem instance (e.g., the name of the builder or the address of the plot of land) and keeping just the key elements of the plan. It seems to us inconceivable that humans could be as competent as they are without some such mechanism. 11.4.3
Searching for abstract solutions
The hierarchical search algorithm in the preceding section refines HLAs all the way to primitive action sequences to determine if a plan is workable. This contradicts common sense: one should be able to determine that the two-HLA high-level plan
[Drive(Home,SFOLongTermParking), Shuttle(SFOLongTermParking, SFO)]
gets one to the airport without having to determine a precise route, choice of parking spot, and so on. The solution is to write precondition—effect descriptions of the HLAs, just as we
do for primitive actions. From the descriptions, it ought to be easy to prove that the high-level
plan achieves the goal. This is the holy grail, so to speak, of hierarchical planning, because if we derive a high-level plan that provably achieves the goal, working in a small search space of high-level actions, then we can commit to that plan and work on the problem of refining each step of the plan. This gives us the exponential reduction we seek.
For this to work, it has to be the case that every high-level plan that “claims” to achieve
Downward refinement property
the goal (by virtue of the descriptions of its steps) does in fact achieve the goal in the sense defined earlier: it must have at least one implementation that does achieve the goal. This
property has been called the downward refinement property for HLA descriptions.
Writing HLA descriptions that satisfy the downward refinement property is, in principle,
easy: as long as the descriptions are frue, then any high-level plan that claims to achieve
the goal must in fact do so—otherwise, the descriptions are making some false claim about
what the HLAs do. We have already seen how to write true descriptions for HLAs that have
exactly one implementation (Exercise 11.HLAU); a problem arises when the HLA has multiple implementations.
How can we describe the effects of an action that can be implemented in
many different ways? One safe answer (at least for problems where all preconditions and goals are positive) is
to include only the positive effects that are achieved by every implementation of the HLA and
the negative effects of any implementation. Then the downward refinement property would be satisfied. Unfortunately, this semantics for HLAs is much too conservative.
Consider again the HLA Go(Home, SFO), which has two refinements, and suppose, for the sake of argument, a simple world in which one can always drive to the airport and park, but taking a taxi requires Cash as a precondition. In that case, Go(Home, SFO) doesn’t always get you to the airport.
In particular, it fails if Cash is false, and so we cannot assert
At(Agent,SFO) as an effect of the HLA. This makes no sense, however; if the agent didn’t
have Cash, it would drive itself. Requiring that an effect hold for every implementation is
equivalent to assuming that someone else—an adversary—will choose the implementation.
Section 11.4
(@)
Hierarchical Planning
361
(b)
Figure 11.9 Schematic examples of reachable sets. The set of goal states is shaded in purple. Black and gray arrows indicate possible implementations of Ay and ha, respectively. (a) The reachable set of an HLA hy in a state s. (b) The reachable set for the sequence [k, ha]. Because this intersects the goal set, the sequence achieves the goal. It treats the HLA’s multiple outcomes exactly as if the HLA were a nondeterministic action,
as in Section 4.3. For our case, the agent itself will choose the implementation.
The programming languages community has coined the term demonic nondeterminism Demeric
for the case where an adversary makes the choices, contrasting this with angelic nondeterminism, where the agent itself makes the choices. We borrow this term to define angelic semantics for HLA descriptions. The basic concept required for understanding angelic se-
mantics is the reachable set of an HLA: given a state s, the reachable set for an HLA h,
written as REACH (s, h), is the set of states reachable by any of the HLA’s implementations.
The key idea is that the agent can choose which element of the reachable set it ends up in when it executes the HLA; thus, an HLA with multiple refinements is more “powerful” than the same HLA with fewer refinements. We can also define the reachable set of a sequence of HLAs. For example, the reachable set of a sequence [hy, /] is the union of all the reachable
sets obtained by applying h, in each state in the reachable set of h;: REACH(s, [, ha]) = U REACH(S, ). #eREACH(s.In) Given these definitions, a high-level plan—a sequence of HLAs—achieves the goal if its reachable set intersects the set of goal states. (Compare this to the much stronger condition for demonic semantics,
where every member of the reachable set has to be a goal state.)
Conversely, if the reachable set doesn’t intersect the goal, then the plan definitely doesn’t work. Figure 11.9 illustrates these ideas.
The notion of reachable sets yields a straightforward algorithm: search among highlevel plans, looking for one whose reachable set intersects the goal; once that happens, the
algorithm can commit to that abstract plan, knowing that it works, and focus on refining the
plan further. We will return to the algorithmic issues later; for now consider how the effects
:;g;l‘; rmminism Angelic semantics Reachable set
362
Chapter 11
Automated Planning
of an HLA—the reachable set for each possible initial state—are represented.
action can set a fluent to true or false or leave it unchanged.
A primitive
(With conditional effects (see
Section 11.5.1) there is a fourth possibility: flipping a variable to its opposite.)
An HLA under angelic semantics can do more: it can control the value of a fluent, setting
it to true or false depending on which implementation is chosen. That means that an HLA can
have nine different effects on a fluent: if the variable starts out true, it can always keep it true,
always make it false, or have a choice; if the fluent starts out false, it can always keep it false, always make it true, or have a choice; and the three choices for both cases can be combined
arbitrarily, making nine. Notationally, this is a bit challenging. We’ll use the language of add lists and delete lists (rather than true/false fluents) along with the ~ symbol to mean “possibly, if the agent
so chooses.” Thus, the effect -4 means “possibly add A,” that is, either leave A unchanged
or make it true. or delete A
Similarly, —A means “possibly delete A” and FA means “possibly add
For example, the HLA
Go(Home,SFO),
with the two refinements shown in
Figure 11.7, possibly deletes Cash (if the agent decides to take a taxi), so it should have the
effect ~Cash. Thus, we see that the descriptions of HLAs are derivable from the descriptions
of their refinements. Now, suppose we have the following schemas for the HLAs /; and h»: Action(hy, PRECOND: A, EFFECT:A A ~B)
Action(hy, PRECOND: ~B, EFFECT: +A A £C) That is, /; adds A and possibly deletes B, while &, possibly adds A and has full control over
C. Now, if only B is true in the initial state and the goal is A A C then the sequence [hy,hs]
achieves the goal: we choose an implementation of /; that makes B false, then choose an implementation of /2, that leaves A true and makes C true. The preceding discussion assumes that the effects of an HLA—the reachable set for any
given initial state—can be described exactly by describing the effect on each fluent. It would
be nice if this cause an HLA gly reachable page 243. For
were always true, but in many cases we can only approximate may have infinitely many implementations and may produce sets—rather like the wiggly-belief-state problem illustrated in example, we said that Go(Home, SFO) possibly deletes Cash;
the effects bearbitrarily wigFigure 7.21 on it also possibly
adds At(Car,SFOLongTermParking); but it cannot do both—in fact, it must do exactly one.
Optimistic lescription essimistic description
As with belief states, we may need to write approximate descriptions. We will use two kinds of approximation: an optimistic description REACH* (s, ) of an HLA h may overstate the reachable set, while a pessimistic description REACH ™ (s, 1) may understate the reachable set. Thus, we have
REACH ™ (s,h) C REACH(s, h) C REACHT (s, h)
For example, an optimistic description of Go(Home, SFO) says that it possibly deletes Cash and possibly adds At(Car,SFOLongTermParking). Another good example arises in the 8puzzle, half of whose states are unreachable from any given state (see Exercise 11.PART): the optimistic description of Act might well include the whole state space, since the exact reachable set is quite wiggly.
With approximate descriptions, the test for whether a plan achieves the goal needs to be modified slightly. If the optimistic reachable set for the plan doesn’t intersect the goal, then the plan doesn’t work; if the pessimistic reachable set intersects the goal, then the plan
does work (Figure 11.10(a)). With exact descriptions, a plan either works or it doesn’t, but
Section 11.4
Hierarchical Planning
Figure 11.10 Goal achievement for high-level plans with approximate descriptions. The set of goal states is shaded in purple. For each plan, the pessimistic (solid lines, light blue) and optimistic (dashed lines, light green) reachable sets are shown. (a) The plan indicated by the black arrow definitely achieves the goal, while the plan indicated by the gray arrow definitely doesn’t. (b) A plan that possibly achieves the goal (the optimistic reachable set intersects the goal) but does not necessarily achieve the goal (the pessimistic reachable set does not intersect the goal). The plan would need to be refined further to determine if it really does achieve the goal. with approximate descriptions, there is a middle ground: if the optimistic set intersects the goal but the pessimistic set doesn’t, then we cannot tell if the plan works (Figure 11.10(b)).
When this circumstance arises, the uncertainty can be resolved by refining the plan. This is a very common situation in human reasoning. For example, in planning the aforementioned
two-week Hawaii vacation, one might propose to spend two days on each of seven islands.
Prudence would indicate that this ambitious plan needs to be refined by adding details of inter-island transportation.
An algorithm for hierarchical planning with approximate angelic descriptions is shown in Figure 11.11. For simplicity, we have kept to the same overall scheme used previously in Figure 11.8, that is, a breadth-first search in the space of refinements. As just explained,
the algorithm can detect plans that will and won’t work by checking the intersections of the optimistic and pessimistic reachable sets with the goal. (The details of how to compute
the reachable sets of a plan, given approximate descriptions of each step, are covered in Exercise 1 1.LHLAP.)
When a workable abstract plan is found, the algorithm decomposes the original problem into subproblems, one for each step of the plan. The initial state and goal for each subproblem are obtained by regressing a guaranteed-reachable goal state through the action schemas for each step of the plan. (See Section 11.2.2 for a discussion of how regression works.) Figure 11.9(b) illustrates the basic idea: the right-hand circled state is the guaranteed-reachable
goal state, and the left-hand circled state is the intermediate goal obtained by regressing the
goal through the final action.
363
364
Chapter 11
Automated Planning
function ANGELIC-SEARCH(problem, hierarchy, initialPlan) returns solution or fail Jfrontier +a FIFO queue with initialPlan as the only element while rrue do if EMPTY?(frontier) then return fail
plan & POP(frontier) /1 chooses the shallowest node in frontier if REACH* (problem.INITIAL, plan) intersects problem.GOAL then if plan is primitive then return plan
// REACH" is exact for primitive plans
guaranteed— REACH™ (problem.INITIAL, plan) N problem.GOAL
if guaranteed#{ } and MAKING-PROGRESS(plan, initialPlan) then
JfinalState —any element of guaranteed return DECOMPOSE (hierarchy, problem.INITIAL, plan, finalState)
hla+some HLA in plan
prefix,suffix < the action subsequences before and after hla in plan outcome < RESULT(problem.INITIAL, prefix) for each sequence in REFINEMENTS(hla, outcome, hierarchy) do Sfrontier «— Insert( APPEND(prefix, sequence, suffix), frontier)
function DECOMPOSE hierarchy, so. plan, sy) returns a solution solution «an empty plan while plan is not empty do action < REMOVE-LAST(plan)
si¢—astate in REACH™ (sp, plan) such that s;€REACH™ (s;, action)
problem «—a problem with INITIAL = s; and GOAL = sy solution « APPEND(ANGELIC-SEARCH(problem, hierarchy, action), solution) Spesi
return solution
Figure 11.11 A hierarchical planning algorithm that uses angelic semantics to identify and commit to high-level plans that work while avoiding high-level plans that don’t. The predicate MAKING-PROGRESS
checks to make sure that we aren’t stuck in an infinite regression
of refinements. At top level, call ANGELIC-SEARCH with [Act] as the initialPlan.
The ability to commit to or reject high-level plans can give ANGELIC-SEARCH
a sig-
nificant computational advantage over HIERARCHICAL-SEARCH, which in turn may have a
large advantage over plain old BREADTH-FIRST-SEARCH.
Consider, for example, cleaning
up a large vacuum world consisting of an arrangement of rooms connected by narrow corridors, where each room is a w x h rectangle of squares. It makes sense to have an HLA for Navigate (as shown in Figure 11.7) and one for CleanWholeRoom. (Cleaning the room could
be implemented with the repeated application of another HLA to clean each row.) Since there
are five primitive actions, the cost for BREADTH-FIRST-SEARCH grows as 59, where d is the
length of the shortest solution (roughly twice the total number of squares); the algorithm
cannot manage even two 3 x 3 rooms.
HIERARCHICAL-SEARCH
is more efficient, but still
suffers from exponential growth because it tries all ways of cleaning that are consistent with the hierarchy. ANGELIC-SEARCH scales approximately linearly in the number of squares— it commits to a good high-level sequence of room-cleaning and navigation steps and prunes
away the other options.
Section 1.5
Planning and Acting in Nondeterministic Domains
365
Cleaning a set of rooms by cleaning each room in turn is hardly rocket science: it is
easy for humans because of the hierarchical structure of the task. When we consider how
difficult humans find it to solve small puzzles such as the 8-puzzle, it seems likely that the
human capacity for solving complex problems derives not from considering combinatorics, but rather from skill in abstracting and decomposing problems to eliminate combinatorics.
The angelic approach can be extended to find least-cost solutions by generalizing the
notion of reachable set. Instead of a state being reachable or not, each state will have a cost for the most efficient way to get there.
(The cost is
infinite for unreachable states.)
optimistic and pessimistic descriptions bound these costs.
The
In this way, angelic search can
find provably optimal abstract plans without having to consider their implementations. The
same approach can be used to obtain effective hierarchical look-ahead algorithms for online
search, in the style of LRTA* (page 140). In some ways, such algorithms mirror aspects of human deliberation in tasks such as planning a vacation to Hawaii—consideration of alternatives is done initially at an abstract
level over long time scales; some parts of the plan are left quite abstract until execution time,
such as how to spend two lazy days on Moloka'i, while others parts are planned in detail, such as the flights to be taken and lodging to be reserved—without these latter refinements,
there is no guarantee that the plan would be feasible. 11.5
Planning and Acting in Nondeterministic
Domains
In this section we extend planning to handle partially observable, nondeterministic, and un-
known environments. The basic concepts mirror those in Chapter 4, but there are differences
arising from the use of factored representations rather than atomic representations. This affects the way we represent the agent’s capability for action and observation and the way
we represent belief states—the sets of possible physical states the agent might be in—for partially observable environments.
We can also take advantage of many of the domain-
independent methods given in Section 11.3 for calculating search heuristics.
‘We will cover sensorless planning (also known as conformant planning) for environ-
ments with no observations; contingency planning for partially observable and nondeterministic environments; and online planning and replanning for unknown environments. This
will allow us to tackle sizable real-world problems.
Consider this problem:
given a chair and a table, the goal is to have them match—have
the same color. In the initial state we have two cans
of paint, but the colors of the paint and
the furniture are unknown. Only the table is initially in the agent’s field of view:
Init(Object(Tuble) A Object(Chair) A Can(Cy) A Can(Cz) A InView(Tuble)) Goal(Color(Chair,c) A Color(Table,c))
There are two actions:
removing the lid from a paint can and painting an object using the
paint from an open can. Action(RemoveLid(can), PRECOND:Can(can) EFFECT: Open(can))
Action(Paint(x,can),
PRECOND:Object(x) A Can(can) A Color(can,c) A Open(can) EFFECT: Color(x,c))
Hierarchical look-ahead
366
Chapter 11
Automated Planning
The action schemas are straightforward, with one exception: preconditions and effects now may contain variables that are not part of the action’s variable list.
That is, Paint(x,can)
does not mention the variable ¢, representing the color of the paint in the can. In the fully
observable case, this is not allowed—we would have to name the action Paint(x,can,c). But
in the partially observable case, we might or might not know what color is in the can.
To solve a partially observable problem, the agent will have to reason about the percepts
it will obtain when it is executing the plan.
The percept will be supplied by the agent’s
sensors when it is actually acting, but when it is planning it will need a model of its sensors. Percept schema
In Chapter 4, this model was given by a function, PERCEPT(s).
PDDL with a new type of schema, the percept schema: Percept(Color(x.c),
For planning, we augment
PRECOND: Object(x) A InView(x)
Percept(Color(can,c),
PRECOND:Can(can) A InView(can) A Open(can)
The first schema says that whenever an object is in view, the agent will perceive the color of the object (that is, for the object x, the agent will learn the truth value of Color(x,c) for all ¢). The second schema says that if an open can is in view, then the agent perceives the color of the paint in the can. Because there are no exogenous events in this world, the color
of an object will remain the same, even if it is not being perceived, until the agent performs
an action to change the object’s color. Of course, the agent will need an action that causes objects (one at a time) to come into view:
Action(LookAt(x),
PRECOND:InView(y) A (x # y) EFFECT: InView(x) A —~InView(y))
For a fully observable environment, we would have a Percept schema with no preconditions
for each fluent. A sensorless agent, on the other hand, has no Percept schemas at all. Note
that even a sensorless agent can solve the painting problem. One solution is to open any can of paint and apply it to both chair and table, thus coercing them to be the same color (even
though the agent doesn’t know what the color is). A contingent planning agent with sensors can generate a better plan. First, look at the
table and chair to obtain their colors; if they are already the same then the plan is done. If
not, look at the paint cans; if the paint in a can is the same color as one piece of furniture,
then apply that paint to the other piece. Otherwise, paint both pieces with any color.
Finally, an online planning agent might generate a contingent plan with fewer branches
at first—perhaps ignoring the possibility that no cans match any of the furniture—and deal
with problems when they arise by replanning. It could also deal with incorrectness of its
action schemas.
Whereas a contingent planner simply assumes that the effects of an action
always succeed—that painting the chair does the job—a replanning agent would check the result and make an additional plan to fix any unexpected failure, such as an unpainted area or the original color showing through. In the real world, agents use a combination of approaches. Car manufacturers sell spare tires and air bags, which are physical embodiments of contingent plan branches designed to handle punctures or crashes.
On the other hand, most car drivers never consider these
possibilities; when a problem arises they respond as replanning agents. In general, agents
Section 1.5
Planning and Acting in Nondeterministic Domains
plan only for contingencies that have important consequences and a nonnegligible chance of happening. Thus, a car driver contemplating a trip across the Sahara desert should make explicit contingency plans for breakdowns, whereas a trip o the supermarket requires less advance planning. We next look at cach of the three approaches in more detail. 11.5.1
Sensorless planning
Section 4.4.1 (page 126) introduced the basic idea of searching in belief-state space to find
a solution for sensorless problems. Conversion of a sensorless planning problem to a beliefstate planning problem works much the same way as it did in Section 4.4.1; the main dif-
ferences are that the underlying physical transition model is represented by a collection of
action schemas, and the belief state can be represented by a logical formula instead of by
an explicitly enumerated set of states. We assume that the underlying planning problem is deterministic.
The initial belief state for the sensorless painting problem can ignore InView fluents
because the agent has no sensors. Furthermore, we take as given the unchanging facts Object(Table) A Object(Chair) A Can(Cy) A Can(Cs) because these hold in every belief state. The agent doesn’t know the colors of the cans or the objects, or whether the cans are open or closed, but it does know that objects and cans have colors:
Skolemizing (see Section 9.5.1), we obtain the initial belief state:
Vx
3¢
Color(x,c).
After
bo = Color(x,C(x)).
In classical planning, where the closed-world assumption is made, we would assume that any fluent not mentioned in a state is false, but in sensorless (and partially observable) plan-
ning we have to switch to an open-world assumption in which states contain both positive and negative fluents, and if a fluent does not appear, its value is unknown.
Thus, the belief
state corresponds exactly to the set of possible worlds that satisfy the formula. Given this initial belief state, the following action sequence is a solution:
[RemoveLid(Cany ), Paint(Chair, Cany ), Paint(Table, Cany ).
‘We now show how to progress the belief state through the action sequence to show that the
final belief state satisfies the goal.
First, note that in a given belief state b, the agent can consider any action whose preconditions are satisfied by b. (The other actions cannot be used because the transition model doesn’t define the effects of actions whose preconditions might be unsatisfied.) According
to Equation (4.4) (page 127), the general formula for updating the belief state b given an applicable action a in a deterministic world is as follows:
b =RESULT(b,a) = {s' : ' =RESULTp(s,a) and s € b} where RESULTp defines the physical transition model. For the time being, we assume that the
initial belief state is always a conjunction of literals, that is, a 1-CNF formula. To construct the new belief state 5, we must consider what happens to each literal £ in each physical state s in b when action a is applied. For literals whose truth value is already known in b, the truth value in b is computed from the current value and the add list and delete list of the action. (For example, if £ is in the delete list of the action, then —¢ is added to &'.) What about a literal whose truth value is unknown in b? There are three cas 1. If the action adds ¢, then ¢ will be true in b’ regardless of its initial value.
367
368
Chapter 11
Automated Planning
2. If the action deletes £, then ¢ will be false in b’ regardless of its initial value.
3. If the action does not affect £, then ¢ will retain its initial value (which is unknown) and will not appear in 4. Hence, we see that the calculation of 4/ is almost identical to the observable case, which was specified by Equation (11.1) on page 345: b =ResuLT(b,a) = (b— DEL(a)) UADD(a). ‘We cannot quite use the set semantics because (1) we must make sure that b’ does not contain both ¢ and —¢, and (2) atoms may contain unbound variables.
But it is still the case
that RESULT (b, a) is computed by starting with b, setting any atom that appears in DEL(a) to false, and setting any atom that appears in ADD(a) to true. For example, if we apply
RemoveLid(Cany) to the initial belief state by, we get
by = Color(x,C(x)) A Open(Can,). When we apply the action Paint(Chair,Cany), the precondition Color(Cany,c) is satisfied by the literal Color(x,C(x)) with binding {x/Cany,c/C(Can;)} and the new belief state is by = Color(x,C(x)) A Open(Cany) A Color(Chair,C(Cany ). Finally, we apply the action Paint(Table, Cany) to obtain by = Color(x,C(x)) A Open(Can,) A Color(Chair,C(Cany ) A Color(Table,C(Cany)).
The final belief state satisfies the goal, Color(Table,c) A Color(Chair,c), with the variable ¢ bound to C(Cany). The preceding analysis of the update rule has
shown a very important fact: the family
of belief states defined as conjunctions of literals is closed under updates defined by PDDL action schemas.
That is, if the belief state starts as a conjunction of literals, then any update
will yield a conjunction of literals.
That means that in a world with n fluents, any belief
state can be represented by a conjunction of size O(n).
This is a very comforting result,
considering that there are 2" states in the world. It says we can compactly represent all the
subsets of those 2" states that we will ever need. Moreover, the process of checking for belief
states that are subsets or supersets of previously visited belief states is also easy, at least in
the propositional case.
The fly in the ointment of this pleasant picture is that it only works for action schemas
that have the same effects for all states in which their preconditions are satisfied.
It is this
property that enables the preservation of the 1-CNF belief-state representation. As soon as
the effect can depend on the state, dependencies are introduced between fluents, and the 1-
CNF property is lost.
Consider, for example, the simple vacuum world defined in Section 3.2.1. Let the fluents be AzL and AtR for the location of the robot and CleanL and CleanR for the state of the
squares. According to the definition of the problem, the Suck action has no precondition—it
Conditional effect
can always be done. The difficulty is that its effect depends on the robot’s location: when the robot is ArL, the result is CleanL, but when it is AzR, the result is CleanR. For such actions, our action schemas will need something new: a conditional effect. These have the syntax
Section 1.5
Planning and Acting in Nondeterministic Domains
“when condition: effect,” where condition is a logical formula to be compared against the current state, and effect is a formula describing the resulting state. For the vacuum world:
Action(Suck,
EFFECT:when ArL: CleanL \ when AtR: CleanR) .
When applied to the initial belief state True, the resulting belief state is (AL A CleanL) V' (ARA CleanR), which is no longer in 1-CNF. (This transition can be seen in Figure 4.14
on page 129.) In general, conditional effects can induce arbitrary dependencies among the fluents in a belief state, leading to belief states of exponential size in the worst case.
Itis important to understand the difference between preconditions and conditional effects.
All conditional effects whose conditions are satisfied have their effects applied to generate the resulting belief state; if none are satisfied, then the resulting state is unchanged. On the other hand, if a precondition is unsatisfied, then the action is inapplicable and the resulting state is undefined.
From the point of view of sensorless planning, it is better to have conditional
effects than an inapplicable action. unconditional effects as follows:
Action(SuckL,
PRECOND:AL;
Action(SuckR,
For example, we could split Suck into two actions with
EFFECT: CleanL)
PRECOND:AtR; EFFECT: CleanR).
Now we have only unconditional schemas, so the belief states all remain in 1-CNF; unfortu-
nately, we cannot determine the applicability of SuckL and SuckR in the initial belief state.
It seems inevitable, then, that nontrivial problems will involve wiggly belief states, just
like those encountered when we considered the problem of state estimation for the wumpus
world (see Figure 7.21 on page 243). The solution suggested then was to use a conservative
approximation to the exact belief state; for example, the belief state can remain in 1-CNF
if it contains all literals whose truth values can be determined and treats all other literals as
unknown.
While this approach is sound, in that it never generates an incorrect plan, it is
incomplete because it may be unable to find solutions to problems that necessarily involve interactions among literals.
To give a trivial example, if the goal is for the robot to be on
a clean square, then [Suck] is a solution but a sensorless agent that insists on 1-CNF belief
states will not find it.
Perhaps a better solution is to look for action sequences that keep the belief state as simple as possible. In the sensorless vacuum world, the action sequence [Right, Suck, Left, Suck] generates the following sequence of belief states: by
=
True
by
=
AR
by = ARACleanR by = AILACleanR by
= AtL A CleanR N CleanL
That is, the agent can solve the problem while retaining a 1-CNF belief state, even though
some sequences (e.g., those beginning with Suck) go outside 1-CNE. The general lesson is not lost on humans: we are always performing little actions (checking the time, patting our
369
370
Chapter 11
Automated Planning
pockets to make sure we have the car keys, reading street signs as we navigate through a city) to eliminate uncertainty and keep our belief state manageable.
There is another, quite different approach to the problem of unmanageably wiggly belief states: don’t bother computing them at all. Suppose the initial belief state is by and we would like to know the belief state resulting from the action sequence [a1, ...,
). Instead of com-
puting it explicitly, just represent it as “by then [ar, ..., a,,].” This is a lazy but unambiguous
representation of the belief state, and it’s quite concise—O(n +m) where n is the size of the initial belief state (assumed to be in 1-CNF) and m is the maximum
length of an action se-
quence. As a belief-state representation, it suffers from one drawback, however: determining
whether the goal is satisfied, or an action is applicable, may require a lot of computation.
The computation can be implemented as an entailment test: if A,, represents the collec-
tion of successor-state axioms required to define occurrences of the actions aj....,a,—as
explained for SATPLAN in Section 11.2.3—and G, asserts that the goal is true after m steps,
then the plan achieves the goal if by A A,, = G—that is, if by A Ay A G, is unsatisfiable. Given a modern SAT solver, it may be possible to do this much more quickly than computing the full belief state. For example, if none of the actions in the sequence has a particular goal fluent in its add list, the solver will detect this immediately.
It also helps if partial results
about the belief state—for example, fluents known to be true or false—are cached to simplify
subsequent computations.
The final piece of the sensorless planning puzzle is a heuristic function to guide the
search. The meaning of the heuristic function is the same as for classical planning: an esti-
mate (perhaps admissible) of the cost of achieving the goal from the given belief state. With
belief states, we have one additional fact: solving any subset of a belief state is necessarily easier than solving the belief state:
if by C by then h*(by) < h*(b2). Hence, any admissible heuristic computed for a subset is admissible for the belief state itself. The most obvious candidates are the singleton subsets, that is, individual physical states. We
can take any random collection of states s admissible heuristic /, and return
sy that are in the belief state b, apply any
H(b) =max{h(s1),....h(sn)}
as the heuristic estimate for solving b. We can also use inadmissible heuristics such as the ignore-delete-lists heuristic (page 354), which seems to work quite well in practice. 11.5.2
Contingent planning
We saw in Chapter 4 that contingency planning—the generation of plans with conditional
branching based on percepts—is appropriate for environments with partial observability, non-
determinism, or both. For the partially observable painting problem with the percept schemas given earlier, one possible conditional solution is as follows:
[LookAt(Table), LookAt(Chair), if Color(Tuble,c) A Color(Chair, c) then NoOp else [RemoveLid(Can, ), LookAt(Can, ), RemoveLid(Cans), LookAt(Cans), if Color(Table,c) A Color(can, c) then Paint(Chair, can) else if Color(Chair,c) A Color(can,c) then Paint(Table,can) else [Paint(Chair, Can), Paint (Table, Can, )|]]
Section 1.5
Planning and Acting in Nondeterministic Domains
Variables in this plan should be considered existentially quantified;
the second
line says
that if there exists some color ¢ that is the color of the table and the chair, then the agent
need not do anything to achieve the goal. When executing this plan, a contingent-planning
agent can maintain its belief state as a logical formula and evaluate each branch condition
by determining if the belief state entails the condition formula or its negation.
(It is up to
the contingent-planning algorithm to make sure that the agent will never end up in a belief
state where the condition formula’s truth value is unknown.)
Note that with first-order
conditions, the formula may be satisfied in more than one way; for example, the condition Color(Table, c) A Color(can, ¢) might be satisfied by {can/Cany } and by {can/Cans} if both cans are the same color as the table. In that case, the agent can choose any satisfying substitution to apply to the rest of the plan.
As shown in Section 4.4.2, calculating the new belief state b after an action a and subse-
quent percept is done in two stages. The first stage calculates the belief state after the action,
just as for the sensorless agent: b= (b—DEL(a))UADD(a)
where, as before, we have assumed a belief state represented as a conjunction of literals. The
second stageis a little trickier. Suppose that percept literals p;., ..., py are received. One might
think that we simply need to add these into the belief state; in fact, we can also infer that the
preconditions for sensing are satisfied. Now, if a percept p has exactly one percept schema, Percept(p, PRECOND:c), where c is a conjunction of literals, then those literals can be thrown into the belief state along with p. On the other hand, if p has more than one percept schema
whose preconditions might hold according to the predicted belief state b, then we have to add in the disjunction of the preconditions. Obviously, this takes the belief state outside 1-CNF
and brings up the same complications as conditional effects, with much the same classes of
solutions.
Given a mechanism for computing exact or approximate belief states, we can generate
contingent plans with an extension of the AND—OR
forward search over belief states used
in Section 4.4. Actions with nondeterministic effects—which are defined simply by using a
disjunction in the EFFECT of the action schema—can be accommodated with minor changes
to the belief-state update calculation and no change to the search algorithm.® For the heuristic
function, many of the methods suggested for sensorless planning are also applicable in the partially observable, nondeterministic case. 11.5.3
Online planning
Imagine watching a spot-welding robot in a car plant. The robot’s fast, accurate motions are
repeated over and over again as each car passes down the line. Although technically impressive, the robot probably does not seem at all intelligent because the motion is a fixed, preprogrammed sequence; the robot obviously doesn’t “know what it’s doing” in any meaningful sense.
Now suppose that a poorly attached door falls off the car just as the robot is
about to apply a spot-weld. The robot quickly replaces its welding actuator with a gripper,
picks up the door, checks it for scratches, reattaches it to the car, sends an email to the floor supervisor, switches back to the welding actuator, and resumes its work.
All of a sudden,
3 If cyclic solutions are required for a nondeterministic problem, AND—OR search must be generalized to a loopy version such as LAO" (Hansen and Zilberstein, 2001).
371
372
Chapter 11
Automated Planning
the robot’s behavior seems purposive rather than rote; we assume it results not from a vast,
precomputed contingent plan but from an online replanning process—which means that the
Execution monitoring
robot does need to know what it’s trying to do. Replanning presupposes some form of execution monitoring to determine the need for a new plan. One such need arises when a contingent planning agent gets tired of planning
for every little contingency, such as whether the sky might fall on its head.* This means that the contingent plan is left in an incomplete form.
For example, Some branches of a
partially constructed contingent plan can simply say Replan; if such a branch is reached
during execution, the agent reverts to planning mode. As we mentioned earlier, the decision
as to how much of the problem to solve in advance and how much to leave to replanning
is one that involves tradeoffs among possible events with different costs and probabilities of
occurring.
Nobody wants to have a car break down in the middle of the Sahara desert and
only then think about having enough water.
Missing precondition Missing effect
Missing fluent Exogenous event
Replanning may be needed if the agent’s model of the world is incorrect. The model
for an action may have a missing precondition—for example, the agent may not know that
removing the lid of a paint can often requires a screwdriver. The model may have a missing effect—painting an object may get paint on the floor as well.
Or the model may have a
missing fluent that is simply absent from the representation altogether—for example, the model given earlier has no notion of the amount of paint in a can, of how its actions affect
this amount, or of the need for the amount to be nonzero. The model may also lack provision
for exogenous events such as someone knocking over the paint can. Exogenous events can
also include changes in the goal, such as the addition of the requirement that the table and
chair not be painted black. Without the ability to monitor and replan, an agent’s behavior is likely to be fragile if it relies on absolute correctness of its model.
The online agent has a choice of (at least) three different approaches for monitoring the environment during plan execution:
Action monitoring
+ Action monitoring: before executing an action, the agent verifies that all the precondi-
Plan monitoring
+ Plan monitoring: before executing an action, the agent verifies that the remaining plan
Goal monitoring
* Goal monitoring: before executing an action, the agent checks to see if there is a better
tions still hold.
will still succeed.
set of goals it could be trying to achieve.
In Figure 11.12 we see a schematic of action monitoring.
The agent keeps track of both its
original plan, whole plan, and the part of the plan that has not been executed yet, which is denoted by plan.
After executing the first few steps of the plan, the agent expects to be in
state E. But the agent observes that it is actually in state O. It then needs to repair the plan by
finding some point P on the original plan that it can get back to. (It may be that P is the goal
state, G.) The agent tries to minimize the total cost of the plan: the repair part (from O to P)
plus the continuation (from P to G).
# In 1954, a Mrs. Hodges of Alabama was hit by meteorite that crashed through her roof. In 1992, a piece of the Mbale metcorite hit a small boy on the head; fortunately. its descent was slowed by banana leaves (Jenniskens etal., 1994). Andiin 2009, a German boy claimed to have been injuries resulted from any of these incidents, suggesting that the need for preplanning st such contingencies is sometimes overstated.
Section 1.5
Planning and Acting in Nondeterministic Domains whole plan
Figure 11.12 At first, the sequence “whole plan” is expected to get the agent from S to G. The agent executes steps of the plan until it expects to be in state £, but observes that it is actually in O. The agent then replans for the minimal repair plus continuation to reach G. Now let’s return to the example problem of achieving a chair and table of matching color.
Suppose the agent comes up with this plan:
[LookAt(Table), LookAt(Chair), if Color(Tuble, ) A Color(Chair, ) then NoOp else [RemoveLid(Cany ), LookAt(Cany),
if Color(Tuble, c) A Color(Can) ) then Paint(Chair, Cany) else REPLAN]].
Now the agent is ready to execute the plan. The agent observes that the table and can of paint are white and the chair is black.
It then executes Paint(Chair,Can,).
At this point a
classical planner would declare victory; the plan has been executed. But an online execution monitoring agent needs to check that the action succeeded.
Suppose the agent perceives that the chair is a mottled gray because the black paint is
showing through. The agent then needs to figure out a recovery position in the plan to aim for and a repair action sequence to get there. The agent notices that the current state is identical to the precondition before the Paint(Chair,Can;) action, so the agent chooses the empty
sequence for repair and makes its plan be the same [Paint] sequence that it just attempted.
With this new plan in place, execution monitoring resumes, and the Paint action is retried.
This behavior will loop until the chair is perceived to be completely painted. But notice that
the loop is created by a process of plan-execute-replan, rather than by an explicit loop in a plan. Note also that the original plan need not cover every contingency. If the agent reaches
the step marked REPLAN, it can then generate a new plan (perhaps involving Cany).
Action monitoring is a simple method of execution monitoring, but it can sometimes lead
to less than intelligent behavior. For example, suppose there is no black or white paint, and
the agent constructs a plan to solve the painting problem by painting both the chair and table
red. Suppose that there is only enough red paint for the chair. With action monitoring, the agent would go ahead and paint the chair red, then notice that it is out of paint and cannot
paint the table, at which point it would replan a repair—perhaps painting both chair and table
green. A plan-monitoring agent can detect failure whenever the current state is such that the
remaining plan no longer works. Thus, it would not waste time painting the chair red.
373
374
Chapter 11
Automated Planning
Plan monitoring achieves this by checking the preconditions for success of the entire
remaining plan—that is, the preconditions of each step in the plan, except those preconditions
that are achieved by another step in the remaining plan. Plan monitoring cuts off execution of a doomed plan as soon as possible, rather than continuing until the failure actually occurs.3 Plan monitoring also allows for serendipity—accidental success.
If someone comes along
and paints the table red at the same time that the agent is painting the chair red, then the final plan preconditions are satisfied (the goal has been achieved), and the agent can go home early. It is straightforward to modify a planning algorithm so that each action in the plan is annotated with the action’s preconditions, thus enabling action monitoring.
It is slightly more
complex to enable plan monitoring. Partial-order planners have the advantage that they have already built up structures that contain the relations necessary for plan monitoring. Augment-
ing state-space planners with the necessary annotations can be done by careful bookkeeping as the goal fluents are regressed through the plan. Now that we have described a method for monitoring and replanning, we need to ask, “Does it work?” This is a surprisingly tricky question. If we mean, “Can we guarantee that the agent will always achieve the goal?” then the answer is no, because the agent could
inadvertently arrive at a dead end from which there is no repair. For example, the vacuum
agent might have a faulty model of itself and not know that its batteries can run out. Once
they do, it cannot repair any plans. If we rule out dead ends—assume that there exists a plan to reach the goal from any state in the environment—and
assume that the environment is
really nondeterministic, in the sense that such a plan always has some chance of success on
any given execution attempt, then the agent will eventually reach the goal.
Trouble occurs when a seemingly-nondeterministic action is not actually random, but
rather depends on some precondition that the agent does not know about. For example, sometimes a paint can may be empty, so painting from that can has no effect.
No amount
of retrying is going to change this.® One solution is to choose randomly from among the set of possible repair plans, rather than to try the same one each time. In this case, the repair plan of opening another can might work. A better approach is to learn a better model. Every
prediction failure is an opportunity for learning; an agent should be able to modify its model
of the world to accord with its percepts. From then on, the replanner will be able to come up with a repair that gets at the root problem, rather than relying on luck to choose a good repair.
11.6_Time, Schedules, and Resources Classical planning talks about what to do, in what order, but does not talk about time: how
Scheduling Resource constraint
long an action takes and when it occurs. For example, in the airport domain we could produce aplan saying what planes go where, carrying what, but could not specify departure and arrival
times. This is the subject matter of scheduling. The real world also imposes resource constraints: an airline has a limited number of staff, and staff who are on one flight cannot be on another at the same time. This section
introduces techniques for planning and scheduling problems with resource constraints.
5 Plan monitoring means that finally, after 374 pages, we have an agent that is smarter than a dung beetle (see page 41). A plan-monitoring agent would notice that the dung ball was missing from its grasp and would replan 10 get another ball and plug its hole. 6 Futile repetition of a plan repair is exactly the behavior exhibited by the sphex wasp (page 41).
Section 11.6
Time, Schedules, and Resources
375
Jobs({AddEnginel < AddWheels] < Inspectl}, {AddEngine2 < AddWheels2 < Inspeci2}) Resources(EngineHoists(1), WheelStations(1), Inspectors(e2), LugNuts(500)) Action(AddEnginel, DURATION:30,
USE:EngineHoists(1)) Action(AddEngine2, DURATION:60,
USE:EngineHoists(1)) Action(AddWheels], DURATION:30,
CONSUME: LugNuts(20), USE: WheelStations(1))
Action(AddWheels2, DURATION:15,
CONSUME: LugNuts(20), USE:WheelStations(1))
Action(Inspect;, DURATION: 10,
UsE:Inspectors(1))
Figure 11.13 A job-shop scheduling problem for assembling two cars, with resource constraints. The notation A < B means that action A must precede action B. The approach we take is “plan first, schedule later”: divide the overall problem into a planning phase in which actions are selected, with some ordering constraints, to meet the goals of the problem, and a later scheduling phase, in which temporal information is added to the plan to ensure that it meets resource and deadline constraints. This approach is common in real-world manufacturing and logistical settings, where the planning phase is sometimes
automated, and sometimes performed by human experts. 11.6.1
Representing temporal and resource constraints
A typical job-shop scheduling problem (see Section 6.1.2), consists of a set of jobs, each
of which has a collection of actions with ordering constraints among them. Each action has
a duration and a set of resource constraints required by the action.
A constraint specifies
a fype of resource (e.g., bolts, wrenches, or pilots), the number of that resource required,
and whether that resource is consumable (e.g., the bolts are no longer available for use) or
reusable (e.g., a pilot is occupied during a flight but is available again when the flight is over). Actions can also produce resources (¢.g., manufacturing and resupply actions). A solution to a job-shop scheduling problem specifies the start times for each action and must satisfy all the temporal ordering constraints and resource constraints.
Job-shop scheduling problem Job Duration Consumable Reusable
As with search
and planning problems, solutions can be evaluated according to a cost function; this can be quite complicated, with nonlinear resource costs, time-dependent delay costs, and so on. For simplicity, we assume that the cost function is just the total duration of the plan, which is called the makespan.
Figure 11.13 shows a simple example: a problem involving the assembly of two cars. The problem consists of two jobs, each of the form [AddEngine,AddWheels, Inspect]. Then
the Resources statement declares that there are four types of resources, and gives the number of each type available at the start: 1 engine hoist, 1 wheel station, 2 inspectors, and 500 lug nuts. The action schemas give the duration and resource needs of each action. The lug nuts
Makespan
376
Chapter 11
Automated Planning
are consumed as wheels are added to the car, whereas the other resources are “borrowed” at
the start of an action and released at the action’s end.
Aggregation
The representation of resources as numerical quantities, such as Inspectors(2), rather than as named entities, such as Inspector (1) and Inspector (L), is an example of a technique called aggregation:
grouping individual objects into quantities when the objects are all in-
distinguishable. In our assembly problem, it does not matter which inspector inspects the car, so there is no need to make the distinction. Aggregation is essential for reducing complexity.
Consider what happens when a proposed schedule has 10 concurrent Inspect actions but only
9 inspectors are available. With inspectors represented as quantities, a failure is detected im-
mediately and the algorithm backtracks to try another schedule. With inspectors represented as individuals, the algorithm would try all 9! ways of assigning inspectors to actions before noticing that none of them work. 11.6.2
Solving scheduling problems
We begin by considering just the temporal scheduling problem, ignoring resource constraints.
To minimize makespan (plan duration), we must find the earliest start times for all the actions
consistent with the ordering constraints supplied with the problem. It is helpful to view these Critical path method
ordering constraints as a directed graph relating the actions, as shown in Figure 11.14. We can
apply the critical path method (CPM) to this graph to determine the possible start and end
times of each action. A path through a graph representing a partial-order plan is a linearly
Critical path
ordered sequence of actions beginning with Start and ending with Finish. (For example, there are two paths in the partial-order plan in Figure 11.14.) The critical path is that path whose total duration is longest; the path is “critical” because it determines the duration of the entire plan—shortening other paths doesn’t shorten the plan
as a whole, but delaying the start of any action on the critical path slows down the whole plan. Actions that are off the critical path have a window of time in which they can be executed.
Slack Schedule
The window is specified in terms of an earliest possible start time, ES, and a latest possible
start time, LS.
The quantity LS — ES is known as the slack of an action.
We can see in
Figure 11.14 that the whole plan will take 85 minutes, that each action in the top job has 15 minutes of slack, and that each action on the critical path has no slack (by definition). Together the ES and LS times for all the actions constitute a schedule for the problem.
The following formulas define ES and LS and constitute a dynamic-programming algo-
rithm to compute them. A and B are actions, and A < B means that A precedes B:
ES(Start) =0
ES(B) = maxs < ES(A) + Duration(A)
LS(Finish) = ES(Finish) LS(A) = ming,. o LS(B) — Duration(A) .
The idea is that we start by assigning ES(Start) to be 0. Then, as soon as we get an action
B such that all the actions that come immediately before B have ES values assigned, we
set ES(B) to be the maximum of the earliest finish times of those immediately preceding
actions, where the earliest finish time of an action is defined as the earliest start time plus
the duration. This process repeats until every action has been assigned an ES value. The LS values are computed in a similar manner, working backward from the Finish action.
The complexity of the critical path algorithm is just O(Nb), where N is the number of
actions and b is the maximum
branching factor into or out of an action.
(To see this, note
Section 11.6
Start
o
0
o g 30
[EX5] 07 nsawiess =] tmpect 30 10
o7 g2 &
[@er ] niawteci 15
2
o
“
Time, Schedules, and Resources
[85.85]
Finish
[FEE ! o o
0
@
o
"
»
Figure 11.14 Top: a representation of the temporal constraints for the job-shop scheduling problem of Figure 11.13. The duration of each action is given at the bottom of each rectangle. In solving the problem, we compute the earliest and latest start times as the pair [ES, LS],
displayed in the upper left. The difference between these two numbers is the slack of an
action; actions with zero slack are on the critical path, shown with bold arrows. Bottom: the
same solution shown as a timeline. Grey rectangles represent time intervals during which an action may be executed, provided that the ordering constraints are respected. The unoccupied portion ofa gray rectangle indicates the slack.
that the LS and ES computations are done once for each action, and each computation iterates
over at most b other actions.) Therefore, finding a minimum-duration schedule, given a partial
ordering on the actions and no resource constraints, is quite easy.
Mathematically speaking, critical-path problems are easy to solve because they are de-
fined as a conjunction of linear inequalities on the start and end times. When we introduce
resource constraints, the resulting constraints on start and end times become more complicated. For example, the AddEngine actions, which begin at the same time in Figure 11.14, require the same EngineHoist and so cannot overlap. The “cannot overlap” constraint is a disjunction of two linear inequalities, one for each possible ordering. The introduction of disjunctions turns out to make scheduling with resource constraints NP-hard.
Figure 11.15 shows the solution with the fastest completion time, 115 minutes. This is
30 minutes longer than the 85 minutes required for a schedule without resource constraints.
Notice that there is no time at which both inspectors are required, so we can immediately
‘move one of our two inspectors to a more productive position. There is a long history of work on optimal scheduling. A challenge problem posed in 1963—to find the optimal schedule for a problem involving just 10 machines and 10 jobs of 100 actions each—went unsolved for 23 years (Lawler et al., 1993). Many approaches have been tried, including branch-and-bound, simulated annealing, tabu search, and constraint sat-
377
378
Chapter 11
Automated Planning
——&
Engincoist)
o
T
oo
T
—
T
T w0
T 0
T @
W
T s»
T %
T w
T
Figure 11.15 A solution to the job-shop scheduling problem from Figure 11.13, taking into account resource constraints. The left-hand margin lists the three reusable resources, and actions are shown aligned horizontally with the resources they use. There are two pos ble schedules, depending on which assembly uses the engine hoist first; we’ve shown the shortest-duration solution, which takes 115 minutes. Minimum slack
isfaction. One popular approach is the minimum slack heuristic:
on each iteration, schedule
for the earliest possible start whichever unscheduled action has all its predecessors scheduled and has the least slack; then update the ES and LS times for each affected action and
repeat. This greedy heuristic resembles the minimum-remaining-values (MRV) heuristic in constraint satisfaction. It often works well in practice, but for our assembly problem it yields a 130-minute solution, not the 115-inute solution of Figure 11.15.
Up to this point, we have assumed that the set of actions and ordering constraints is fixed. Under these assumptions, every scheduling problem can be solved by a nonoverlapping sequence that avoids all resource conflicts, provided that each action is feasible by itself.
However if a scheduling problem is proving very difficult, it may not be a good idea to solve
it this way—it may be better to reconsider the actions and constraints, in case that leads to a
much easier scheduling problem. Thus, it makes sense to infegrate planning and scheduling by taking into account durations and overlaps during the construction of a plan. Several of
the planning algorithms in Section 11.2 can be augmented to handle this information.
11.7
Analysis of Planning Approaches
Planning combines the two major areas of Al we have covered so far: search and logic. A planner can be seen either as a program that searches for a solution or as one that (constructively) proves the existence of a solution. The cross-fertilization of ideas from the two areas
has allowed planners to scale up from toy problems where the number of actions and states was limited to around a dozen, to real-world industrial applications with millions of states and thousands of actions.
Planning is foremost an exercise in controlling combinatorial explosion. If there are n propositions in a domain, then there are 2" states. Against such pessimism, the identification of independent subproblems can be a powerful weapon. In the best case—full decomposability of the problem—we get an exponential speedup. Decomposability is destroyed, however, by negative interactions between actions.
SATPLAN can encode logical relations between
subproblems. Forward search addresses the problem heuristically by trying to find patterns (subsets of propositions) that cover the independent subproblems.
Since this approach is
heuristic, it can work even when the subproblems are not completely independent.
Summary Unfortunately, we do not yet have a clear understanding of which techniques work best on which kinds of problems. Quite possibly, new techniques will emerge, perhaps providing a synthesis of highly expressive first-order and hierarchical representations with the highly efficient factored and propositional representations that dominate today. We are seeing exam-
ples of portfolio planning systems, where a collection of algorithms are available to apply to
any given problem. This can be done selectively (the system classifies each new problem to choose the best algorithm for it), or in parallel (all the algorithms run concurrently, each on a different CPU), or by interleaving the algorithms according to a schedule.
Summary
In this chapter, we described the PDDL representation for both classical and extended planning problems, and presented several algorithmic approaches for finding solutions. The points to remember: + Planning systems are problem-solving algorithms that operate on explicit factored rep-
resentations of states and actions. These representations make possible the derivation of
efffective domain-independent heuristics and the development of powerful and flexible algorithms for solving problems. « PDDL, the Planning Domain Definition Language, describes the initial and goal states as conjunctions of literals, and actions in terms of their preconditions and effects. Ex-
tensions represent time, resources, percepts, contingent plans, and hierarchical plans
+ State-space search can operate in the forward direction (progression) or the backward
direction (regression). Effective heuristics can be derived by subgoal independence assumptions and by various relaxations of the planning problem.
« Other approaches include encoding a planning problem as a Boolean satisfiability problem or as a constraint satisfaction problem; and explicitly searching through the space of partially ordered plans. + Hierarchical task network (HTN) planning allows the agent to take advice from the
domain designer in the form of high-level actions (HLAs) that can be implemented in
various ways by lower-level action sequences. The effects of HLAs can be defined with angelic semantics, allowing provably correct high-level plans to be derived without consideration of lower-level implementations. HTN methods can create the very large plans required by many real-world applications. + Contingent plans allow the agent to sense the world during execution to decide what
branch of the plan to follow. In some cases, sensorless or conformant planning can be used to construct a plan that works without the need for perception. Both conformant and contingent plans can be constructed by search in the space of belief states. Efficient representation or computation of belief states is a key problem.
« An online planning agent uses execution monitoring and splices in repairs as needed to recover from unexpected situations, which can be due to nondeterministic
exogenous events, or incorrect models of the environment.
actions,
* Many actions consume resources, such as money, gas, or raw materials. It is convenient
to treat these resources as numeric measures in a pool rather than try to reason about,
379
Portfolio
380
Chapter 11
Automated Planning
say, each individual coin and bill in the world.
Time is one of the most important
resources. It can be handled by specialized scheduling algorithms, or scheduling can be integrated with planning. « This chapter extends classical planning to cover nondeterministic environments (where
outcomes of actions are uncertain), but it is not the last word on planning. Chapter 17 describes techniques for stochastic environments (in which outcomes of actions have
probabilities associated with them): Markov decision processes, partially observable Markov decision processes, and game theory. In Chapter 22 we show that reinforcement learning allows an agent to learn how to behave from past successes and failures. Bibliographical and Historical Notes
AT planning arose from investigations into state-space search, theorem proving, and control theory.
STRIPS (Fikes and Nilsson, 1971, 1993), the first major planning system, was de-
signed as the planner for the Shakey robot at SRL The first version of the program ran on a
computer with only 192 KB of memory.
Its overall control structure was modeled on GPS,
the General Problem Solver (Newell and Simon, 1961), a state-space search system that used
means—ends analysis.
The STRIPS representation language evolved into the Action Description Language, or ADL (Pednault, 1986), and then the Problem Domain Description Language, or PDDL (Ghallab et al., 1998), which has been used for the International Planning Competition since 1998. The most recent version is PDDL 3.1 (Kovacs, 2011).
Linear planning
Planners in the early 1970s decomposed problems by computing a subplan for each subgoal and then stringing the subplans together in some order. This approach, called linear planning by Sacerdoti (1975), was soon discovered to be incomplete. It cannot solve some very simple problems, such as the Sussman anomaly (see Exercise 11.5Uss), found by Allen Brown during experimentation with the HACKER system (Sussman, 1975). A complete plan-
ner must allow for interleaving of actions from different subplans within a single sequence.
‘Warren’s (1974) WARPLAN system achieved that, and demonstrated how the logic programming language Prolog can produce concise programs; WARPLAN is only 100 lines of code.
Partial-order planning dominated the next 20 years of research, with theoretical work
describing the detection of conflicts (Tate, 1975a) and the protection of achieved conditions (Sussman, 1975), and implementations including NOAH (Sacerdoti, 1977) and NONLIN (Tate, 1977). That led to formal models (Chapman, 1987; McAllester and Rosenblitt, 1991)
that allowed for theoretical analysis of various algorithms and planning problems, and to a widely distributed system, UCPOP (Penberthy and Weld, 1992).
Drew McDermott suspected that the emphasis on partial-order planning was crowding out
other techniques that should perhaps be reconsidered now that computers had 100 times the
memory of Shakey’s day. His UNPOP (McDermott, 1996) was a state-space planning program employing the ignore-delete-list heuristic. HSP, the Heuristic Search Planner (Bonet and Geffner, 1999; Haslum, 2006) made state-space search practical for large planning problems. The FF or Fast Forward planner (Hoffmann, 2001; Hoffmann and Nebel, 2001; Hoffmann, 2005) and the FASTDOWNWARD variant (Helmert, 2006) won international planning
competitions in the 2000s.
Bibliographical and Historical Notes Bidirectional search (see Section 3.4.5) has also been known
381
to suffer from a lack of
heuristics, but some success has been obtained by using backward search to create a perimeter around the goal, and then refining a heuristic to search forward towards that perimeter (Torralba et al., 2016). The SYMBA*
bidirectional search planner (Torralba ez al., 2016)
won the 2016 competition.
Researchers turned to PDDL and the planning paradigm so that they could use domain independent heuristics. Hoffmann (2005) analyzes the search space of the ignore-deletelist heuristic.
Edelkamp (2009) and Haslum er al. (2007) describe how to construct pattern
databases for planning heuristics. Felner ef al. (2004) show encouraging results using pattern databases for sliding-tile puzzles, which can be thought of as a planning domain, but Hoffmann ef al. (2006) show some limitations of abstraction for classical planning problems. (Rintanen, 2012) discusses planning-specific variable-selection heuristics for SAT solving. Helmert et al. (2011) describe the Fast Downward Stone Soup (FDSS) system, a portfolio
planner that, as in the fable of stone soup, invites us to throw in as many planning algorithms. as possible. The system maintains a set of training problems, and for each problem and each algorithm records the run time and resulting plan cost of the problem’s solution. Then when
faced with a new problem, it uses the past experience to decide which algorithm(s) to try, with what time limits, and takes the solution with minimal cost. FDSS
was a winner in the 2018
International Planning Competition (Seipp and Rger, 2018). Seipp ef al. (2015) describe a machine learning approach to automatically learn a good portfolio, given a new problem. Vallati ef al. (2015) give an overview of portfolio planning. The idea of algorithm portfolios for combinatorial search problems goes back to Gomes and Selman (2001). Sistla and Godefroid (2004) cover symmetry reduction, and Godefroid (1990) covers
heuristics for partial ordering. Richter and Helmert (2009) demonstrate the efficiency gains
of forward pruning using preferred actions.
Blum and Furst (1997) revitalized the field of planning with their Graphplan system, which was orders of magnitude faster than the partial-order planners of the time. Bryce and Kambhampati (2007) give an overview of planning graphs. The use of situation calculus for planning was introduced by John McCarthy (1963) and refined by Ray Reiter (2001). Kautz et al. (1996) investigated various ways to propositionalize action schemas, finding
that the most compact forms did not necessarily lead to the fastest solution times. A systematic analysis was carried out by Emst er al. (1997), who also developed an automatic “compiler” for generating propositional representations from PDDL problems. The BLACKBOX
planner, which combines ideas from Graphplan and SATPLAN, was developed by Kautz and Selman (1998). Planners based on constraint satisfaction include CPLAN van Beek and Chen
(1999) and GP-CSP (Do and Kambhampati, 2003). There has also been interest in the representation of a plan as a binary decision diagram
(BDD), a compact data structure for Boolean expressions widely studied in the hardware
verification community (Clarke and Grumberg, 1987; McMillan, 1993). There are techniques
for proving properties of binary decision diagrams, including the property of being a solution to a planning problem. Cimatti ef al. (1998) present a planner based on this approach. Other representations have also been used, such as integer programming (Vossen et al., 2001).
There are some interesting comparisons of the various approaches to planning. Helmert (2001) analyzes several classes of planning problems, and shows that constraint-based approaches such as Graphplan and SATPLAN are best for NP-hard domains, while search-based
Binary decision diagram (BDD)
382
Chapter 11
Automated Planning
approaches do better in domains where feasible solutions can be found without backtracking.
Graphplan and SATPLAN have trouble in domains with many objects because that means
they must create many actions. In some cases the problem can be delayed or avoided by generating the propositionalized actions dynamically, only as needed, rather than instantiating them all before the search begins.
Macrops Abstraction hierarchy
The first mechanism for hierarchical planning was a facility in the STRIPS program for
learning macrops—‘macro-operators”
consisting of a sequence of primitive steps (Fikes
et al., 1972). The ABSTRIPS system (Sacerdoti, 1974) introduced the idea of an abstraction
hierarchy, whereby planning at higher levels was permitted to ignore lower-level precon-
ditions of actions in order to derive the general structure of a working plan. Austin Tate’s Ph.D. thesis (1975b) and work by Earl Sacerdoti (1977) developed the basic ideas of HTN
planning. Erol, Hendler, and Nau (1994, 1996) present a complete hierarchical decomposi-
tion planner as well as a range of complexity results for pure HTN planners. Our presentation of HLAs and angelic semantics is due to Marthi ez al. (2007, 2008).
One of the goals of hierarchical planning has been the reuse of previous planning experience in the form of generalized plans. The technique of explanation-based learning has been used as a means of generalizing previously computed plans in systems such as SOAR (Laird et al., 1986) and PRODIGY
(Carbonell ef al., 1989). An alternative approach is
to store previously computed plans in their original form and then reuse them to solve new,
Case-based planning
similar problems by analogy to the original problem. This is the approach taken by the field called case-based planning (Carbonell, 1983; Alterman, 1988). Kambhampati (1994) argues that case-based planning should be analyzed as a form of refinement planning and provides a formal foundation for case-based partial-order planning. Early planners lacked conditionals and loops, but some could use coercion to form conformant plans. Sacerdoti’s NOAH solved the “keys and boxes” problem (in which the planner knows little about the initial state) using coercion.
Mason (1993) argued that sensing often
can and should be dispensed with in robotic planning, and described a sensorless plan that can move a tool into a specific position on a table by a sequence of tilting actions, regardless
of the initial position.
Goldman and Boddy (1996) introduced the term conformant planning, noting that sen-
sorless plans are often effective even if the agent has sensors. The first moderately efficient conformant planner was Smith and Weld's (1998) Conformant Graphplan (CGP). Ferraris and Giunchiglia (2000) and Rintanen (1999) independently developed SATPLAN-based conformant planners. Bonet and Geffner (2000) describe a conformant planner based on heuristic
search in the space of belief states, drawing on ideas first developed in the 1960s for partially
observable Markov decision processes, or POMDPs (see Chapter 17).
Currently, there are three main approaches to conformant planning. The first two use
heuristic search in belief-state space:
HSCP (Bertoli ef al., 2001a) uses binary decision di-
agrams (BDDs) to represent belief states, whereas Hoffmann and Brafman (2006) adopt the lazy approach of computing precondition and goal tests on demand using a SAT solver. The third approach, championed primarily by Jussi Rintanen (2007), formulates the entire sensorless planning problem as a quantified Boolean formula (QBF) and solves it using a
general-purpose QBF solver. Current conformant planners are five orders of magnitude faster than CGP. The winner of the 2006 conformant-planning track at the International Planning
Competition was Tj (Palacios and Geffner, 2007), which uses heuristic search in belief-state
Bibliographical and Historical Notes
383
space while keeping the belief-state representation simple by defining derived literals that cover conditional effects. Bryce and Kambhampati (2007) discuss how a planning graph can be generalized to generate good heuristics for conformant and contingent planning. The contingent-planning approach described in the chapter is based on Hoffmann and Brafman (2005), and was influenced by the efficient search algorithms for cyclic AND—OR graphs developed by Jimenez and Torras (2000) and Hansen and Zilberstein (2001).
The
problem of contingent planning received more attention after the publication of Drew Mc-
Dermott’s (1978a) influential article, Planning and Acting. Bertoli et al. (2001b) describe MBP
(Model-Based
Planner), which uses binary decision diagrams to do conformant and
contingent planning. Some authors use “conditional planning” and “contingent planning” as synonyms; others make the distinction that “conditional” refers to actions with nondetermin-
istic effects, and “contingent” means using sensing to overcome partial observability. In retrospect, it is now possible to see how the major classical planning algorithms led to
extended versions for uncertain domains. Fast-forward heuristic search through state space led to forward search in belief space (Bonet and Geffner, 2000; Hoffmann and Brafman, 2005); SATPLAN led to stochastic SATPLAN (Majercik and Littman, 2003) and to planning
with quantified Boolean logic (Rintanen, 2007); partial order planning led to UWL (Etzioni et al., 1992) and CNLP (Peot and Smith, 1992); Graphplan led to Sensory Graphplan or SGP
(Weld et al., 1998).
The first online planner with execution
monitoring
was
PLANEX
(Fikes ef al.,
1972),
which worked with the STRIPS planner to control the robot Shakey. SIPE (System for Interactive Planning and Execution monitoring) (Wilkins, 1988) was the first planner to deal
systematically with the problem of replanning. It has been used in demonstration projects in
several domains, including planning operations on the flight deck of an aircraft carrier, jobshop scheduling for an Australian beer factory, and planning the construction of multistory buildings (Kartam and Levitt, 1990).
In the mid-1980s, pessimism about the slow run times of planning systems led to the pro-
posal of reflex agents called reactive planning systems (Brooks, 1986; Agre and Chapman,
1987). “Universal plans” (Schoppers, 1989) were developed as a lookup-table method for
reactive planning, but turned out to be a rediscovery of the idea of policies that had long been
used in Markov decision processes (see Chapter 17). Koenig (2001) surveys online planning techniques, under the name Agent-Centered Search.
Planning with time constraints was first dealt with by DEVISER (Vere, 1983). The representation of time in plans was addressed by Allen (1984) and by Dean et al. (1990) in the FORBIN
system.
NONLIN+
(Tate and Whiter,
1984) and SiPE (Wilkins,
1990) could rea-
son about the allocation of limited resources to various plan steps. O-PLAN (Bell and Tate,
1985) has been applied to resource problems such as software procurement planning at Price Waterhouse and back-axle assembly planning at Jaguar Cars. The two planners SAPA
(Do and Kambhampati,
2001) and T4 (Haslum and Geffner,
2001) both used forward state-space search with sophisticated heuristics to handle actions
with durations and resources. An alternative is to use very expressive action languages, but guide them by human-written, domain-specific heuristics, as is done by ASPEN (Fukunaga et al., 1997), HSTS (Jonsson et al., 2000), and IxTeT (Ghallab and Laruelle, 1994). A number of hybrid planning-and-scheduling systems have been deployed: Isis (Fox et al., 1982; Fox, 1990) has been used for job-shop scheduling at Westinghouse, GARI (De-
Reactive planning
384
Chapter 11
Automated Planning
scotte and Latombe,
1985) planned
the machining and construction
of mechanical
parts,
FORBIN was used for factory control, and NONLIN+ was used for naval logistics planning.
We chose to present planning and scheduling as two separate problems; Cushing et al. (2007) show that this can lead to incompleteness on certain problems. There is a long history of scheduling in aerospace. T-SCHED (Drabble, 1990) was used
to schedule mission-command sequences for the UOSAT-II satellite. OPTIMUM-AIV (Aarup et al., 1994) and PLAN-ERS1 (Fuchs ez al., 1990), both based on O-PLAN, were used for
spacecraft assembly and observation planning, respectively, at the European Space Agency.
SPIKE (Johnston and Adorf, 1992) was used for observation planning at NASA for the Hub-
ble Space Telescope, while the Space Shuttle Ground Processing Scheduling System (Deale et al., 1994) does job-shop scheduling of up to 16,000 worker-shifts. Remote Agent (Muscettola et al., 1998) became the first autonomous planner—scheduler to control a spacecraft, when
it flew onboard the Deep Space One probe in 1999. Space applications have driven the development of algorithms for resource allocation; see Laborie (2003) and Muscettola (2002). The literature on scheduling is presented in a classic survey article (Lawler ef al., 1993), a book (Pinedo, 2008), and an edited handbook (Blazewicz et al., 2007).
The computational complexity of of planning has been analyzed by several authors (By-
lander,
1994; Ghallab et al., 2004; Rintanen, 2016).
There are two main tasks:
PlanSAT
is the question of whether there exists any plan that solves a planning problem.
Bounded
PlanSAT asks whether there is a solution of length k or less; this can be used to find an optimal plan. Both are decidable for classical planning (because the number of states is finite). But if we add function symbols to the language, then the number of states becomes infinite,
and PlanSAT becomes only semidecidable. For propositionalized problems both are in the
complexity class PSPACE, a class that is larger (and hence more difficult) than NP and refers
to problems that can be solved by a deterministic Turing machine with a polynomial amount of space. These theoretical results are discouraging, but in practice, the problems we want to solve tend to be not so bad.
The true advantage of the classical planning formalism is
that it has facilitated the development of very accurate domain-independent heuristics; other
approaches have not been as fruitful. Readings in Planning (Allen et al., 1990) is a comprehensive anthology of early work in the field. Weld (1994, 1999) provides two excellent surveys of planning algorithms of the
1990s. It is interesting to see the change in the five years between the two surveys: the first
concentrates on partial-order planning, and the second introduces Graphplan and SATPLAN.
Automated Planning and Acting (Ghallab ez al., 2016) is an excellent textbook on all aspects of the field. LaValle’s text Planning Algorithms (2006) covers both classical and stochastic
planning, with extensive coverage of robot motion planning.
Planning research has been central to Al since its inception, and papers on planning are a staple of mainstream Al journals and conferences. There are also specialized conferences such as the International Conference on Automated Planning and Scheduling and the Inter-
national Workshop on Planning and Scheduling for Space.
TS
12
QUANTIFYING UNCERTAINTY In which we see how to tame uncertainty with numeric degrees of belief.
12.1
Acting under Uncertainty
Agents in the real world need to handle uncertainty, whether due to partial observability,
nondeterminism, or adversaries. An agent may never know for sure what state it is in now or
Uncertainty
where it will end up after a sequence of actions.
We have seen problem-solving and logical agents handle uncertainty by keeping track of
a belief state—a representation of the set of all possible world states that it might be in—and
generating a contingency plan that handles every possible eventuality that its sensors may report during execution. This approach works on simple problems, but it has drawbacks: « The agent must consider every possible explanation for its sensor observations, no matter how unlikely. This leads to a large belief-state full of unlikely possibilities. « A correct contingent plan that handles every eventuality can grow arbitrarily large and must consider arbitrarily unlikely contingencies.
« Sometimes there is no plan that is guaranteed to achieve the goal—yet the agent must act. It must have some way to compare the merits of plans that are not guaranteed.
Suppose, for example, that an automated taxi has the goal of delivering a passenger to the
airport on time. The taxi forms a plan, A, that involves leaving home 90 minutes before the
flight departs and driving at a reasonable speed. Even though the airport is only 5 miles away,
alogical agent will not be able to conclude with absolute certainty that “Plan Agy will get us to the airport in time.” Instead, it reaches the weaker conclusion “Plan Agg will get us to the
airport in time, as long as the car doesn’t break down, and I don’t get into an accident, and
the road isn’t closed, and no meteorite hits the car, and ... .” None of these conditions can be
deduced for sure, so we can’t infer that the plan succeeds. This is the logical qualification
problem (page 241), for which we so far have seen no real solution.
Nonetheless, in some sense Ag is in fact the right thing to do. What do we mean by this?
As we discussed in Chapter 2, we mean that out of all the plans that could be executed, Agy
is expected to maximize the agent’s performance measure (where the expectation is relative
to the agent’s knowledge about the environment). The performance measure includes getting to the airport in time for the flight, avoiding a long, unproductive wait at the airport, and
avoiding speeding tickets along the way. The agent’s knowledge cannot guarantee any of these outcomes for Agg, but it can provide some degree of belief that they will be achieved. Other plans, such as Ago, might increase the agent’s belief that it will get to the airport on time, but also increase the likelihood of a long, boring wait. The right thing to do—the
rational decision—therefore depends on both the relative importance of various goals and
=28,224 entries—still a manageable number.
If we add the possibility of dirt in each of the 42 squares, the number of states is multiplied
by 2*2 and the transition matrix has more than 102 entries—no longer a manageable number.
In general, if the state is composed of n discrete variables with at most d values each, the
corresponding HMM transition matrix will have size O(d2") and the per-update computation time will also be O(d?").
For these reasons, although HMMs have many uses in areas ranging from speech recogni-
tion to molecular biology, they are fundamentally limited in their ability to represent complex
processes.
In the terminology introduced in Chapter 2, HMMs are an atomic representation:
states of the world have no internal structure and are simply labeled by integers. Section 14.5
shows how to use dynamic Bayesian networks—a factored representation—to model domains
with many state variables. The next section shows how to handle domains with continuous state variables, which of course lead to an infinite state space.
Section 14.4
14.4
Kalman
Kalman Filters
Filters
Imagine watching a small bird flying through dense jungle foliage at dusk: you glimpse
brief, intermittent flashes of motion; you try hard to guess where the bird is and where it will
appear next so that you don’t lose it. Or imagine that you are a World War II radar operator
peering at a faint, wandering blip that appears once every 10 seconds on the screen. Or, going
back further still, imagine you are Kepler trying to reconstruct the motions of the planets
from a collection of highly inaccurate angular observations taken at irregular and imprecisely measured intervals.
In all these cases, you are doing filtering: estimating state variables (here, the position and velocity of a moving object) from noisy observations over time. If the variables were discrete, we could model the system with a hidden Markov model.
This section examines
methods for handling continuous variables, using an algorithm called Kalman filtering, after Kalman filtering
one of its inventors, Rudolf Kalman.
The bird’s flight might be specified by six continuous variables at each time point; three for position (X;,Y;,Z) and three for velocity (X;.,¥;,Z). We will need suitable conditional
densities to represent the transition and sensor models; as in Chapter 13, we will use linear— Gaussian distributions. This means that the next state X,
must be a linear function of the
current state X, plus some Gaussian noise, a condition that turns out to be quite reasonable in
practice. Consider, for example, the X-coordinate of the bird, ignoring the other coordinates
for now. Let the time interval between observations be A, and assume constant velocity during
the interval; then the position update is given by X;;a = X, + X A. Adding Gaussian noise (to account for wind variation, etc.), we obtain a linear-Gaussian transition model:
P(Xrra=3eal X =2, X =5) = N (0a3% + 5% A,0%).
The Bayesian network structure for a system with position vector X, and velocity X, is shown
in Figure 14.9. Note that this is a very specific form of linear-Gaussian model: the general
form will be described later in this
section and covers a vast array of applications beyond the
simple motion examples of the first paragraph. The reader might wish to consult Appendix A for some of the mathematical properties of Gaussian distributions; for our immediate purposes, the most important is that a multivariate Gaussian
distribution for ¢ variables is
specified by a d-element mean 12 and a d x d covariance matrix . 14.4.1
Updating
Gaussian distributions
In Chapter 13 on page 423, we alluded to a key property of the lincar-Gaussian family of distributions: it remains closed under Bayesian updating.
(That is, given any evidence, the
posterior is still in the linear-Gaussian family.) Here we make this claim precise in the context
of filtering in a temporal probability model. The required properties correspond to the twostep filtering calculation in Equation (14.5):
1. If the current distribution P(X, |ey) is Gaussian and the transition model P(X,+|x,) is linear-Gaussian, then the one-step predicted distribution given by
P(Xpiifer) = |Jx, P(Xoior[%)P(x |er)dx is also a Gaussian distribution.
(14.17)
479
Chapter 14 Probabilistic Reasoning over Time
P
480
Figure 14.9 Bayesian network structure for a linear dynamical system with position X,. velocity X,. and position measurement Z 2. If the prediction P(X,; |e|,) is Gaussian and the sensor model P(e,1| X, 1) is linear— Gaussian, then, after conditioning on the new evidence, the updated distribution
P(Xpiieri1) = aP(erst [ X
is also a Gaussian distribution.
) P(Xrii | err)
(14.18)
Thus, the FORWARD operator for Kalman filtering takes a Gaussian forward message f.;,
specified by a mean z, and covariance ¥, and produces a new multivariate Gaussian forward
message f1... 1, specified by a mean i,
and covariance ¥, . So if we start with a Gaussian
prior f1.0=P(Xo) = (119, o), filtering with a linear-Gaussian model produces a Gaussian state distribution for all time.
This seems to be a nice, elegant result, but why is it so important? The reason is that except for a few special cases such as this, filtering with continuous or hybrid (discrete and continuous) networks generates state distributions whose representation grows without bound
over time. This statement is not easy to prove in general, but Exercise 14.KFSW shows what
happens for a simple example. 14.4.2
A simple one-dimensional example
‘We have said that the FORWARD operator for the Kalman filter maps a Gaussian into a new
Gaussian. This translates into computing a new mean and covariance from the previous mean
and covariance. Deriving the update rule in the general (multivariate) case requires rather a
lot of linear algebra, so we will stick to a very simple univariate case for now, and later give
the results for the general case. Even for the univariate case, the calculations are somewhat
tedious, but we feel that they are worth seeing because the usefulness of the Kalman filter is
tied so intimately to the mathematical properties of Gaussian distributions.
The temporal model we consider describes a random walk of a single continuous state
variable X, with a noisy observation Z. An example might be the “consumer confidence” in-
dex, which can be modeled as undergoing a random Gaussian-distributed change each month
and is measured by a random consumer survey that also introduces Gaussian sampling noise.
The prior distribution is assumed to be Gaussian with variance a‘%:
P(xo) =ae °
Section 14.4
Kalman Filters
(For simplicity, we use the same symbol o for all normalizing constants in this section.) The transition model adds a Gaussian perturbation of constant variance o2 to the current state: e
Pl
= e ().
The sensor model assumes Gaussian noise with variance o2:2
Pla
= ae
(),
Now, given the prior P(Xo), the one-step predicted distribution comes from Equation (14.17):
p) = [ Plalsopain=a [~ Al ‘This integral looks rather complicated. The key to progress is to notice that the exponent is the sum of two expressions that are quadratic in xo and hence
is itself a quadratic in xo. A simple
trick known as completing the square allows the rewriting of any quadratic ax} +bxo+cas Sempletie the the sum of a squared term a(xo — 52)? and a residual term ¢ — % that is independent of xo. In this case, we have a= (03 +02)/(0302), b=—2(03x) + 2 p0)/(03072). and c= (o} + o2u3)/(0302). The residual term can be taken outside the integral, giving us
P() = ae HEE) /"“ e Hatw-2) gy
Now the integral is just the integral of a Gaussian over its full range, which is simply 1. Thus,
we are left with only the residual term from the quadratic. Plugging back in the expressions for a, b, and ¢ and simplifying, we obtain -4
)
That is, the one-step predicted distribution is a Gaussian with the same mean 19 and a variance
equal to the sum of the original variance o7 and the transition variance 2.
To complete the update step, we need to condition on the observation at the first time
step, namely, z;. From Equation (14.18), this is given by
P(xi|z1) = aP(a|x)P(x) Once again, we combine the exponents and complete the square (Exercise 14.KALM), obtain-
ing the following expression for the posterior:
(14.19)
P(x|z1) =ae *
Thus, after one update cycle, we have a new Gaussian distribution for the state variable. From the Gaussian formula in Equation (14.19), we see that the new mean and standard deviation can be calculated from the old mean and standard deviation as follows: M1
=
+ o)z (2
+otm
Ftoit ol
and
(07 +a})e? 2 _ _
Fraitol
1420 14.20,
481
Chapter 14 Probabilistic Reasoning over Time 0.45 0.4 0.35 0.3
P(x)
482
0.25 0.2
0.15 0.1 0.05
xposition Figure 14.10 Stages in the Kalman filter update cycle for a random walk with a prior given by f1o=0.0 and o= 1.5, transition noise given by o, =2.0, sensor noise given by o= 1.0, and a first observation
.5 (marked on the x-axis).
flattened out, relative to P(xg), by the transition noise.
Notice how the prediction P(xi) is
Notice also that the mean of the
posterior distribution P(x; |2y is slightly to the left of the observation z; because the mean is a weighted average of the prediction and the observation.
Figure 14.10 shows one update cycle of the Kalman filter in the one-dimensional case for particular values of the transition and sensor models. Equation (14.20) plays exactly the same role as the general filtering equation (14.5) or the HMM filtering equation (14.12). Because of the special nature of Gaussian distributions, however, the equations have some interesting additional properties. First, we can interpret the calculation for the new mean y, 1| as a weighted mean of the
new observation z; and the old mean ;. If the observation is unreliable, then o2 is large
and we pay more attention to the old mean; if the old mean is unreliable (o7 is large) or the
process is highly unpredictable (o7 is large), then we pay more attention to the observation.
Second, notice that the update for the variance o7, | is independent of the observation. We
can therefore compute in advance what the sequence of variance values will be. Third, the sequence of variance values converges quickly to a fixed value that depends only on o2 and
2, thereby substantially simplifying the subsequent calculations. (See Exercise 14.VARL) 14.4.3
The general case
The preceding derivation illustrates the key property of Gaussian distributions that allows
Kalman filtering to work: the fact that the exponent is a quadratic form. This is true not just
for the univariate case; the full multivariate Gaussian distribution has the form L
N(x:p,E) =ae’5((x’m
)
B
(x’m)
Multiplying out the terms in the exponent, we see that the exponent is also a quadratic func-
tion of the values x; in . Thus, filtering preserves the Gaussian nature of the state distribution.
Section 14.4
Kalman Filters
483
Let us first define the general temporal model used with Kalman filtering. Both the tran-
sition model and the sensor model are required to be a linear transformation with additive Gaussian noise. Thus, we have
P(%1|x) = N(x1:F%, %) P(z|x) = N(z:Hx,.Z.),
(14.21)
where F and Z, are matrices describing the linear transition model and transition noise co-
variance, and H and X are the corresponding matrices for the sensor model. Now the update equations for the mean and covariance, in their full, hairy horribleness, are
Hrir = Fpy+ Ko (241 — HFpy,) S where K,
(14.22)
= (=K H)(FEFT 1),
=
(FL,F +X)H
(H(FZ,F' +X,)H" +X;)"! is the Kalman gain matrix. Be-
lieve it or not, these equations make some intuitive sense. For example, consider the up-
date for the mean state estimate j.. The term Fy, is the predicted state at 1 +
Kalman gain matrix
1, so HFpy, is
the predicted observation. Therefore, the term 7| — HFyy, represents the error in the pre-
dicted observation. This is multiplied by K, to correct the predicted state; hence, K, is a measure of how seriously to take the new observation relative to the prediction. As in
Equation (14.20), we also have the property that the variance update is independent of the observations.
The sequence of values for £, and K, can therefore be computed offline, and
the actual calculations required during online tracking are quite modest.
To illustrate these equations at work, we have applied them to the problem of tracking an object moving on the X—Y plane. The state variables are X = (X,Y,X,¥)7, so F, £,, H, and X are 4 x 4 matrices. Figure 14.11(a) shows the true trajectory, a series of noisy observations,
and the trajectory estimated by Kalman filtering, along with the covariances indicated by the one-standard-deviation contours. The filtering process does a good job of tracking the actual
motion, and, as expected, the variance quickly reaches a fixed point. We can also derive equations for smoothing as well as filtering with linear-Gaussian
models. The smoothing results are shown in Figure 14.11(b). Notice how the variance in the
position estimate is sharply reduced, except at the ends of the trajectory (why?), and that the
estimated trajectory is much smoother. 14.4.4
Applicability of Kalman
filtering
The Kalman filter and its elaborations are used in a vast array of applications. The “classical”
application is in radar tracking of aircraft and missiles. Related applications include acoustic tracking of submarines and ground vehicles and visual tracking of vehicles and people. In a slightly more esoteric vein, Kalman filters are used to reconstruct particle trajectories from
bubble-chamber photographs and ocean currents from satellite surface measurements. The range of application is much larger than just the tracking of motion: any system characterized
by continuous state variables and noisy measurements will do. Such systems include pulp mills, chemical plants, nuclear reactors, plant ecosystems, and national economies.
The fact that Kalman filtering can be applied to a system does not mean that the re-
sults will be valid or useful. The assumptions made—linear-Gaussian transition and sensor
Kalman models—are very strong. The extended Kalman filter (EKF) attempts to overcome nonlin- Extended filter (EKF)
earities in the system being modeled.
A system is nonlinear if the transition model cannot
be described as a matrix multiplication of the state vector, as in Equation (14.21). The EKF
Nonlinear
484
Chapter 14 Probabilistic Reasoning over Time 2D filiering iT
2D smoothing, e hered
(a)
Served Smoathed
(b)
Figure 14.11 (a) Results of Kalman filtering for an object moving on the XY plane, showing the true trajectory (left to right), a series of noisy observations, and the trajectory estimated by Kalman filtering. Variance in the position estimate is indicated by the ovals. (b) The results of Kalman smoothing for the same observation sequence. works by modeling the system as locally linear in x, in the region of X, = j1,, the mean of the
current state distribution. This works well for smooth, well-behaved systems and allows the
tracker to maintain and update a Gaussian state distribution that is a reasonable approximation to the true posterior. A detailed example is given in Chapter 26.
What does it mean for a system to be “unsmooth” or “poorly behaved”? Technically,
it means that there is significant nonlinearity in system response within the region that is “close” (according to the covariance X;) to the current mean f,.
To understand this idea
in nontechnical terms, consider the example of trying to track a bird as it flies through the
jungle. The bird appears to be heading at high speed straight for a tree trunk. The Kalman
filter, whether regular or extended, can make only a Gaussian prediction of the location of the
bird, and the mean of this Gaussian will be centered on the trunk, as shown in Figure 14.12(a). A reasonable model of the bird, on the other hand, would predict evasive action to one side or the other, as shown in Figure 14.12(b). Such a model is highly nonlinear, because the bird’s
decision varies sharply depending on its precise location relative to the trunk.
To handle examples like these, we clearly need a more expressive language for repre-
senting the behavior of the system being modeled. Within the control theory community, for
Switching Kalman filter
which problems such as evasive maneuvering by aircraft raise the same kinds of difficulties,
the standard solution is the switching Kalman filter. In this approach, multiple Kalman filters run in parallel, each using a different model of the system—for example, one for straight
flight, one for sharp left turns, and one for sharp right turns. A weighted sum of predictions
is used, where the weight depends on how well each filter fits the current data. We will see
in the next section that this is simply a special case of the general dynamic Bayesian net-
work model, obtained by adding a discrete “maneuver” state variable to the network shown in Figure 14.9. Switching Kalman filters are discussed further in Exercise 14.KFSW.
Section 14.5
Dynamic Bayesian Networks
485
Figure 14.12 A bird flying toward a tree (top views). (a) A Kalman filter will predict the location of the bird using a single Gaussian centered on the obstacle. (b) A more realistic model allows for the bird’s evasive action, predicting that it will fly to one side or the other. 14.5
Dyna
Bayesian
Networks
Bayesian Dynamic Bayesian networks, or DBNs, extend the semantics of standard Bayesian networks Dynamic network to handle temporal probability models of the kind described in Section 14.1. We have already seen examples of DBN:
the umbrella network in Figure 14.2 and the Kalman filter network
in Figure 14.9. In general, each slice of a DBN can have any number of state variables X,
and evidence variables E,. For simplicity, we assume that the variables, their links, and their
conditional distributions are exactly replicated from slice to slice and that the DBN represents
a first-order Markov process, so that each variable can have parents only in its own slice or the immediately preceding slice. In this way, the DBN corresponds to a Bayesian network with infinitely many variables.
It should be clear that every hidden Markov model can be represented as a DBN
with
a single state variable and a single evidence variable. It is also the case that every discrete-
variable DBN can be represented as an HMM; as explained in Section 14.3, we can combine all the state variables in the DBN into a single state variable whose values are all possible tuples of values of the individual state variables. Now, if every HMM is a DBN and every DBN can be translated into an HMM, what's the difference? The difference is that, by decomposing the state of a complex system into its constituent variables, we can take advantage
of sparseness in the temporal probability model.
To see what this means in practice, remember that in Section 14.3 we said that an HMM
representation for a temporal process with n discrete variables, each with up to d values,
needs a transition matrix of size 0({]2”). ‘The DBN representation, on the other hand, has size
0O(nd") if the number of parents of each variable is bounded by . In other words, the DBN
representation is linear rather than exponential in the number of variables.
For the vacuum
robot with 42 possibly dirty locations, the number of probabilities required is reduced from 5% 10% to a few thousand.
We have already explained that every Kalman filter model can be represented in a DBN
with continuous
variables and linear-Gaussian conditional distributions (Figure
14.9).
It
should be clear from the discussion at the end of the preceding section that not every DBN
can be represented by a Kalman filter model. In a Kalman filter, the current state distribution
486
Chapter 14 Probabilistic Reasoning over Time
o
o)
07!
Ry [PRilRo) S|
07
03
B
(P(UIIRY)| 0.9 0.2
Figure 14.13 Left: Specification of the prior, transition model. and sensor model for the umbrella DBN. Subsequent slices are copies of slice 1. Right: A simple DBN for robot motion in the X-Y plane. is always a single multivariate Gaussian distribution—that s, a single “bump” in a particular location. DBNSs, on the other hand, can model arbitrary distributions.
For many real-world applications, this flexibility is essential. Consider, for example, the current location of my keys. They might be in my pocket, on the bedside table, on the kitchen counter, dangling from the front door, or locked in the car. A single Gaussian bump that included all these places would have to allocate significant probability to the keys being in mid-air above the front garden. Aspects of the real world such as purposive agents, obstacles, and pockets introduce “nonlinearities” that require combinations of discrete and continuous variables in order to get reasonable models.
14.5.1
Constructing
DBNs
To construct a DBN, one must specify three kinds of information: the prior distribution over
the state variables, P(Xo); the transition model P(X,+ 1| X;); and the sensor model P(E, |X,).
To specify the transition and sensor models, one must also specify the topology of the connections between successive slices and between the state and evidence variables. Because the transition and sensor models are assumed to be time-homogeneous—the same for all r—
it is most convenient simply to specify them for the first slice. For example, the complete DBN specification for the umbrella world is given by the three-node network shown in Figure 14.13(a). From this specification, the complete DBN with an unbounded number of time
slices can be constructed as needed by copying the first slice. Let us now consider a more interesting example: monitoring a battery-powered robot moving in the X-Y plane, as introduced at the end of Section 14.1.
First, we need state
variables, which will include both X, = (X,,¥;) for position and X, = (X;.;) for velocity. We assume some method of measuring position—perhaps a fixed camera or onboard GPS (Global Positioning System)—yielding measurements Z,. The position at the next time step depends
on the current position and velocity, as in the standard Kalman filter model. The velocity at the next step depends on the current velocity and the state of the battery. We add Battery; to
Section 14.5
Dynamic Bayesian Networks
487
represent the actual battery charge level, which has as parents the previous battery level and the velocity, and we add BMeter,, which measures the battery charge level. This gives us the basic model shown in Figure 14.13(b). It is worth looking in more depth at the nature of the sensor model for BMeter;.
Let us
suppose, for simplicity, that both Battery, and BMeter, can take on discrete values 0 through 5. (Exercise
model.)
14.BATT asks you to relate this discrete model to a corresponding continuous
If the meter is always accurate, then the CPT P(BMeter, | Battery,) should have
probabilities of 1.0 “along the diagonal” and probabilities of 0.0 elsewhere.
always creeps into measurements.
In reality, noise
For continuous measurements, a Gaussian distribution
with a small variance might be used.”
For our discrete variables, we can approximate a
Gaussian using a distribution in which the probability of error drops off in the appropriate
way, so that the probability of a large error is very small. We use the term Gaussian error model to cover both the continuous and discrete versions.
Anyone with hands-on experience of robotics, computerized process control, or other
Gaussian error model
forms of automatic sensing will readily testify to the fact that small amounts of measurement noise are often the least of one’s
problems.
Real sensors fail.
When a sensor fails, it does
not necessarily send a signal saying, “Oh, by the way, the data I'm about to send you is a load of nonsense.”
Instead, it simply sends the nonsense.
The simplest kind of failure is
called a transient failure, where the sensor occasionally decides to send some nonsense. For
example, the battery level sensor might have a habit of sending a reading of 0 when someone bumps the robot, even if the battery is fully charged.
Let’s see what happens when a transient failure occurs with a Gaussian error model that
doesn’t accommodate such failures.
Suppose, for example, that the robot is sitting quietly
and observes 20 consecutive battery readings of 5. Then the battery meter has a temporary seizure and the next reading is BMetery;
=0. What will the simple Gaussian error model lead
us to believe about Barteryy? According to Bayes’ rule, the answer depends on both the sensor model P(BMeter =0| Batterys) and the prediction P(Batterys) | BMetery). If the
probability of a large sensor error is significantly less than the probability of a transition to
Battery,; =0, even if the latter is very unlikely, then the posterior distribution will assign a high probability to the battery’s being empty.
A second reading of 0 at 7 =22 will make this conclusion almost certain. If the transient
failure then disappears and the reading returns to 5 from 7 =23 onwards, the estimate for the
battery level will quickly return to 5. (This does not mean the algorithm thinks the battery magically recharged itself, which may be physically impossible; instead, the algorithm now
believes that the battery was never low and the extremely unlikely hypothesis that the battery
meter had two consecutive huge errors must be the right explanation.) This course of events
is illustrated in the upper curve of Figure 14.14(a), which shows the expected value (see
Appendix A) of Battery, over time, using a discrete Gaussian error model.
Despite the recovery, there is a time (r=22) when the robot is convinced that its battery is empty; presumably, then, it should send out a mayday signal and shut down. Alas, its oversimplified sensor model has led it astray. The moral of the story is simple: for the system 10 handle sensor failure properly, the sensor model must include the possibility of failure. 7 Strictly speaking, a Gaussia ution is problematic because it assigns nonzero probability to large negative charge levels. The beta distribution s a better choice for a variable whose range is restricted.
Transient failure
488
Chapter 14 Probabilistic Reasoning over Time E(Battery, ..5555005555...)
5
5
4
4
S 2 4 1
S 2 = 1
T
\
3
£
o
El
E(Battery, ..5555005555...
KM e K E(Battery, |...5555000000...) 15
20
25
30
Time step ¢
0
Bl
¥\
\
\\
k\'—n—n—n—n—n—n—x E(Bartery, |...5555000000...) 15
20
Time step
()
25
30
(®)
Figure 14.14 (a) Upper curve: trajectory of the expected value of Battery, for an observation sequence consisting of all Ss except for Os at =21 and =22, using a simple Gaussian error model. Lower curve: trajectory when the observation remains at 0 from =21 onwards. (b) The same experiment run with the transient failure model. The transient failure is handled well, but the persistent failure results in excessive pessimism about the battery charge. The simplest kind of failure model for a sensor allows a certain probability that the sensor will return some completely incorrect value, regardless of the true state of the world. For
example, if the battery meter fails by returning 0, we might say that
P(BMeter,=0| Battery, =5)=0.03, Transient failure
which is presumably much larger than the probability assigned by the simple Gaussian error
model. Let's call this the transient failure model. How does it help when we are faced
with a reading of 0? Provided that the predicted probability of an empty battery, according
to the readings so far, is much less than 0.03, then the best explanation of the observation
BMeter;; =0 is that the sensor has temporarily failed. Intuitively, we can think of the belief
about the battery level as having a certain amount of “inertia” that helps to overcome tempo-
rary blips in the meter reading. The upper curve in Figure 14.14(b) shows that the transient failure model can handle transient failures without a catastrophic change in beliefs.
So much for temporary blips. What about a persistent sensor failure? Sadly, failures of
this kind are all too common.
If the sensor returns 20 readings of 5 followed by 20 readings
of 0, then the transient sensor failure model described in the preceding paragraph will result in the robot gradually coming to believe that its battery is empty when in fact it may be that the meter has failed.
The lower curve in Figure
14.14(b) shows the belief “trajectory” for
this case. By r=25—five readings of O—the robot is convinced that its battery is empty.
Obviously, we would prefer the robot to believe that its battery meter is broken—if indeed
Pt
fallure
this is the more likely event.
Unsurprisingly, to handle persistent failure, we need a persistent failure model that
describes how the sensor behaves
under normal conditions
and after failure.
To do this,
we need to augment the state of the system with an additional variable, say, BMBroken, that
describes the status of the battery meter. The persistence of failure must be modeled by an
Section 14.5
Dynamic Bayesian Networks E(Battery,|...5555005555...)
B, | P(B)
7 [ 1.000
/|
489
ey
0.001
E(Battery,|..5555000000...)
BMBrokeny
P(BMBroken, |..5555000000...) Roe e eea
BMeter,
P(BMBroken, ...5555005555..) 15
(a)
20
“Time step
25
30
(b)
Figure 14.15 (a) A DBN fragment showing the sensor status variable required for modeling persistent failure of the battery sensor. (b) Upper curves: trajectories of the expected value of Battery, for the “transient failure” and “permanent failure” observations sequences. Lower curves: probability trajectories for BMBroken given the two observation sequences. arc linking BMBrokeny to BMBroken,.
This persistence arc has a CPT that gives a small
probability of failure in any given time step, say, 0.001, but specifies that the sensor stays
broken once it breaks. When the sensor is OK, the sensor model for BMeter is identical to the transient failure model; when the sensor is broken, it says BMeter is always 0, regardless
of the actual battery charge. The persistent failure model for the battery sensor is shown in Figure 14.15(a). Its performance on the two data sequences (temporary blip and persistent failure) is shown in Figure 14.15(b).
There are several things to notice about these curves.
First, in the case of the
temporary blip, the probability that the sensor is broken rises significantly after the second
0 reading, but immediately drops back to zero once a 5 is observed. Second, in the case of persistent failure, the probability that the sensor is broken rises quickly to almost 1 and stays there.
Finally, once the sensor is known to be broken, the robot can only assume that its
battery discharges at the “normal” rate. This is shown by the gradually descending level of E(Battery,|...).
So far, we have merely scraiched the surface of the problem of representing complex
processes.
The variety of transition models is huge, encompassing
topics as disparate as
‘modeling the human endocrine system and modeling multiple vehicles driving on a freeway. Sensor modeling is also a vast subfield in itself. But dynamic Bayesian networks can model
even subtle phenomena, such as sensor drift, sudden decalibration, and the effects of exoge-
nous conditions (such as weather) on sensor readings. 14.5.2
Exact inference in DBNs
Having sketched some ideas for representing complex processes
as DBNs, we now turn to the
question of inference. In a sense, this question has already been answered: dynamic Bayesian networks are Bayesian networks, and we already have algorithms for inference in Bayesian networks. Given a sequence of observations, one can construct the full Bayesian network rep-
Persistence arc
490
Chapter 14 Probabilistic Reasoning over Time
BR[O Ry|P(R)|R,
07
03
I
07
P(R)|R,
R,
[P(R IR,
Ry
|P(RyR;
f1
03
f1
03
f1
03
@@@@
T
(R, | P(UIR,)|
AN
f1l
02
T e
Ry |
@@
Tl T Tl (R |P(UIR)|
o[
fl
09
02
Ry [ PCUAR,)| fl
[
o9 | 02
(R; [ PCUSR,)
[t]
£l
o9
02
Figure 14.16 Unrolling a dynamic Bayesian network: slices are replicated to accommodate the observation sequence Umbrellay;s. Further slices have no effect on inferences within the observation period. resentation of a DBN by replicating slices until the network is large enough to accommodate the observations, as in Figure 14.16. This technique is called unrolling. (Technically, the DBN
is equivalent to the semi-infinite network obtained by unrolling forever.
Slices added
beyond the last observation have no effect on inferences within the observation period and can be omitted.) Once the DBN is unrolled, one can use any of the inference algorithms— variable elimination, clustering methods, and so on—described in Chapter 13.
Unfortunately, a naive application of unrolling would not be particularly efficient. If
we want to perform filtering or smoothing with a long sequence of observations ey, the
unrolled network would require O(r) space and would thus grow without bound as more observations were added. Moreover, if we simply run the inference algorithm anew each time an observation is added, the inference time per update will also increase as O(f). Looking back to Section 14.2.1, we see that constant time and space per filtering update
can be achieved if the computation can be done recursively. Essentially, the filtering update in Equation (14.5) works by summing out the state variables of the previous time step to get the distribution for the new time step. Summing out variables is exactly what the variable
elimination (Figure 13.13) algorithm does, and it turns out that running variable elimination
with the variables in temporal order exactly mimics the operation of the recursive filtering
update in Equation (14.5). The modified algorithm keeps at most two slices in memory at any one time: starting with slice 0, we add slice 1, then sum out slice 0, then add slice 2, then
sum out slice 1, and so on. In this way, we can achieve constant space and time per filtering update.
(The same performance can be achieved by suitable modifications to the clustering
algorithm.) Exercise 14.DBNE asks you to verify this fact for the umbrella network.
So much for the good news; now for the bad news: It turns out that the “constant” for the
per-update time and space complexity is, in almost all cases, exponential in the number of state variables. What happens is that, as the variable elimination proceeds, the factors grow
to include all the state variables (or, more precisely, all those state variables that have parents
in the previous time slice). The maximum factor size is O(d"**) and the total update cost per
step is O(nd"**), where d is the domain size of the variables and k is the maximum number
of parents of any state variable.
Of course, this is much less than the cost of HMM updating, which is O(d?"), but it s still infeasible for large numbers of variables. This grim fact means is that even though we can use
DBN s to represent very complex temporal processes with many sparsely connected variables,
Section 14.5
Dynamic Bayesian Networks
491
we cannot reason efficiently and exactly about those processes. The DBN model itself, which
represents the prior joint distribution over all the variables, is factorable into its constituent CPTs, but the posterior joint distribution conditioned on an observation sequence—that
is,
the forward message—is generally not factorable. The problem is intractable in general, so we must fall back on approximate methods.
14.5.3
Approximate inference in DBNs
Section 13.4 described two approximation algorithms: likelihood weighting (Figure 13.18)
and Markov chain Monte Carlo (MCMC, Figure 13.20). Of the two, the former is most easily adapted to the DBN context. (An MCMC filtering algorithm is described briefly in the notes
at the end of this chapter.) We will see, however, that several improvements are required over
the standard likelihood weighting algorithm before a practical method emerges. Recall that likelihood weighting works by sampling the nonevidence nodes of the network in topological order, weighting each sample by the likelihood it accords to the observed evidence variables. As with the exact algorithms, we could apply likelihood weighting directly to an unrolled DBN, but this would suffer from the same problems of increasing time
and space requirements per update as the observation sequence grows. The problem is that
the standard algorithm runs each sample in turn, all the way through the network.
Instead, we can simply run all N samples together through the DBN, one slice at a time.
The modified algorithm fits the general pattern of filtering algorithms, with the set of N samples as the forward message. The first key innovation, then, is to use the samples themselves as an approximate representation of the current state distribution.
This meets the require-
For a single customer Cj recommending a single book B1, the Bayes net might look like the one shown in Figure 15.2(a). (Just as in Section 9.1, expressions with parentheses such as Honest(Cy) are just fancy symbols—in this case, fancy names for random variables.) With 1" The name relational probability model was given by Pfeffer (2000) to a slightly different representation, but the underlying ideas are the same. 2 A game theorist would advise a dishones! customer to avoid detection by occasionally recommending a good book from a competitor. See Chapter 18.
Section 15.1
503
Relational Probability Models
two customers and two books, the Bayes net looks like the one in Figure 15.2(b). For larger
numbers of books and customers, it is quite impractical to specify a Bayes net by hand.
Fortunately, the network has a lot of repeated structure. Each Recommendation(c,b) vari-
able has as its parents the variables Honest(c), Kindness(c), and Quality(b). Moreover, the conditional probability tables (CPTs) for all the Recommendation(c,b) variables are identical, as are those for all the Honest(c) variables, and so on. The situation seems tailor-made
for a first-order language. We would like to say something like Recommendation(c,b) ~ RecCPT (Honest(c), Kindness c), Quality(b))
which means that a customer’s recommendation for a book depends probabilistically on the
customer’s honesty and kindness and the book’s quality according to a fixed CPT.
Like first-order logic, RPMs have constant, function, and predicate symbols. We will also
assume a type signature for each function—that is, a specification of the type of each argu-
ment and the function’s value. (If the type of each object is known, many spurious possible worlds are eliminated by this mechanism; for example, we need not worry about the kindness of each book, books recommending customers, and so on.)
Type signature
For the book-recommendation
domain, the types are Customer and Book, and the type signatures for the functions and predicates are as follows:
Honest : Customer — {true false}
Kindness : Customer — {1,2,3,4,5} Quality : Book — {1,2,3,4,5} Recommendation : Customer x Book — {1,2,3,4,5} The constant symbols will be whatever customer and book names appear in the retailer’s data
set. In the example given in Figure 15.2(b), these were Cy, C> and B1, B. Given the constants and their types, together with the functions and their type signatures,
the basic random variables of the RPM are obtained by instantiating each function with each possible combination of objects. For the book recommendation model, the basic random
variables include Honest(Cy), Quality(Bs), Recommendation(C\,B3), and so on. These are exactly the variables appearing in Figure 15.2(b). Because each type has only finitely many instances (thanks to the domain closure assumption), the number of basic random variables is also finite.
To complete the RPM, we have to write the dependencies that govern these random vari-
ables. There function is a For example, of honesty is
is one dependency statement for each function, where each argument of the logical variable (i.c., a variable that ranges over objects, as in first-order logic). the following dependency states that, for every customer ¢, the prior probability 0.9 true and 0.01 false:
Honest(c) ~ (0.99,0.01)
Similarly, we can state prior probabilities for the kindness value of each customer and the quality of each book, each on the 1-5 scale: Kindness(c) ~ (0.1,0.1,0.2,0.3,0.3)
Quality(h) ~ (0.05,0.2,0.4,0.2,0.15) Finally, we need the dependency for recommendations: for any customer ¢ and book b, the score depends on the honesty and kindness of the customer and the quality of the book:
Recommendation(c,b) ~ RecCPT(Honest(c), Kindness(c), Quality (b))
Basic random variable
504
Chapter 15 Probabilistic Programming where RecCPT is a separately defined conditional probability table with 2 x 5 x 5=50 rows,
each with 5 entries. For the purposes of illustration, we’ll assume that an honest recommen-
dation for a book of quality ¢ from a person of kindness is uniformly distributed in the range 1155, 145411 The semantics of the RPM can be obtained by instantiating these dependencies for all known constants, giving a Bayesian network (as in Figure 15.2(b)) that defines a joint distribution over the RPM’s random variables.>
The set of possible worlds is the Cartesian product of the ranges of all the basic random variables, and, as with Bayesian networks, the probability for each possible world is the product of the relevant conditional probabilities from the model. With C customers and B books, there are C Honest variables, C Kindness variables, B Quality variables, and BC Recommendation variables, leading to 265*3+5C possible worlds. With ten million books
and a billion customers, that’s about 107> 1" worlds. Thanks to the expressive power of
RPMs, the complete probability model still has only fewer than 300 parameters—most of them in the RecCPT table. We can refine the model by asserting a context-specific independence (see page 420) to reflect the fact that dishonest customers ignore quality when giving a recommendation; more-
over, kindness plays no role in their decisions. Thus, Recommendation(c,b) is independent of Kindness(c) and Quality(b) when Honest(c) = false:
Recommendation(c,b) ~
if Honest(c) then HonestRecCPT (Kindness(c), Quality(b)) else (0.4,0.1,0.0,0.1,0.4).
This kind of dependency may look like an ordinary if-then—else statement in a programming language, but there is a key difference: the inference engine doesn’t necessarily know the value of the conditional test because Honest(c) is a random variable.
We can elaborate this model in endless ways to make it more realistic.
For example,
suppose that an honest customer who is a fan of a book’s author always gives the book a 5, regardless of quality:
Recommendation(c,b) ~
if Honest(c) then if Fan(c, Author (b)) then Exactly(5) else HonestRecCPT (Kindness(c), Quality(b)) else (0.4,0.1,0.0,0.1,0.4)
Again, the conditional test Fan(c,Author(b)) is unknown, but if a customer gives only 5s to a
particular author’s books and is not otherwise especially kind, then the posterior probability
that the customer is a fan of that author will be high. Furthermore, the posterior distribution will tend to discount the customer’s 5s in evaluating the quality of that author’s books.
In this example, we implicitly assumed that the value of Author(b) is known for every b, but this may not be the case. How can the system reason about whether, say, C) is a fan of Author(B,) when Author(Bs) is unknown?
The answer is that the system may have to
reason about all possible authors. Suppose (to keep things simple) 3 Some technical conditions are required for an RPM to define a proper distribution. be acyclic; otherwise the resulting Bayesian network will have cycles. Second, the be well-founded: there can be no infinite ancestor chains, such as might arise from Exercise 15.HAMD for an exception to this rule.
that there are just two First, the dependencies must dependencies must (usually) recursive dependen
Section 15.1
Recommendation(Cy, B>
Relational Probability Models
505
Recommendation(Cy, By)
Figure 15.3 Fragment of the equivalent Bayes net for the book recommendation RPM when Author(Bz) is unknown. authors, A; and Ay, Then Author(Ba) is a random variable with two possible values, A; and Ay, and it is a parent of Recommendation(Cy, By). The variables Fan(Cy,A;) and Fan(C),As)
are parents too. The conditional distribution for Recommendation(C),B,) is then essentially a
‘multiplexer in which the Author(B,) parent acts as a selector to choose which of Fan(Cy,A;)
and Fan(C,A,) actually gets to influence the recommendation. A fragment of the equivalent
Multiplexer
Bayes net is shown in Figure 15.3. Uncertainty in the value of Author(B,), which affects the dependency structure of the network, is an instance of relational uncertainty. e In case you are wondering how the system can possibly work out who the author of B, is: consider the possibility that three other customers are fans of A} (and have no other favorite authors in common) and all three have given B; a 5, even though most other customers find
it quite dismal. In that case, it is extremely likely that A, is the author of B,. The emergence of sophisticated reasoning like this from an RPM model of just a few lines is an intriguing
example of how probabilistic influences spread through the web of interconnections among objects in the model. As more dependencies and more objects are added, the picture conveyed by the posterior distribution often becomes clearer and clearer.
15.1.2
Example:
Rating player skill levels
Many competitive games have a numerical measure of players’ skill levels, sometimes called arating. Perhaps the best-known is the Elo rating for chess players, which rates a typical be- Rating ginner at around 800 and the world champion usually somewhere above 2800. Although Elo ratings have a statistical basis, they have some ad hoc elements. We can develop a Bayesian rating scheme as follows: each player i has an underlying skill level Skill(i); in each game g, Ps actual performance is Performance(i,g), which may vary from the underlying skill level; and the winner of g is the player whose performance in g is better. As an RPM, the model looks like this: Skill(i) ~ N (11, 0?) Performance(i,g) ~ N (Skill(i),%)
Win(i.j,g) = if Game(g,i, j) then (Performance(i,g) > Performance(j,g))
where /32 is the variance ofa player’s actual performance in any specific game relative to the
player’s underlying skill level. Given a set of players and games, as well as outcomes for
some of the games, an RPM inference engine can compute a posterior distribution over the
skill of each player and the probable outcome of any additional game that might be played.
506
Chapter 15 Probabilistic Programming For team games, we’ll assume, as a first approximation, that the overall performance of
team ¢ in game g is the sum of the individual performances of the players on :
TeamPerformance(t,g) = ¥ic, Performance(i,g).
Even though the individual performances are not visible to the ratings engine, the players’ skill levels can still be estimated from the results of several games, as long as the team com-
positions vary across games. Microsoft’s TrueSkill™ ratings engine uses this model, along
with an efficient approximate inference algorithm, to serve hundreds of millions of users
every day. This model can be elaborated in numerous ways. For example, we might assume that weaker players have higher variance in their performance; we might include the player’s role on the team; and we might consider specific kinds of performance and skill—e.g., defending and attacking—in order to improve team composition and predictive accuracy.
15.1.3
Inference in relational probability models
The most straightforward approach to inference in RPMs s simply to construct the equivalent Bayesian network, given the known constant symbols belonging to each type. With B books
and C customers, the basic model given previously could be constructed with simple loops:*
forb=1toBdo add node Quality,, with no parents, prior ( 0.05,0.2,0.4,0.2,0.15) forc=1toCdo add node Honest, with no parents, prior ( 0.99,0.01 ) add node Kindness. with no parents, prior ( 0.1,0.1,0.2,0.3,03 ) forb=1toBdo add node Recommendation, with parents Honeste, Kindness.. Quality), and conditional distribution RecCPT (Honestc, Kindnessc, Quality,) Grounding Unrolling
This technique is called grounding or unrolling; it is the exact analog of propositionaliza-
tion for first-order logic (page 280). The obvious drawback is that the resulting Bayes net may be very large. Furthermore, if there are many candidate objects for an unknown relation or function—for example, the unknown author of By—then some variables in the network
may have many parents. Fortunately, it is often possible to avoid generating the entire implicit Bayes net. As we saw in the discussion of the variable elimination algorithm on page 433, every variable that is not an ancestor of a query variable or evidence variable is irrelevant to the query. Moreover, if the query is conditionally independent of some variable given the evidence, then that variable is also irrelevant. So, by chaining through the model starting from the query and evidence, we can identify just the set of variables that are relevant to the query. These are the only ones
that need to be instantiated to create a potentially tiny fragment of the implicit Bayes net.
Inference in this fragment gives the same answer as inference in the entire implicit Bayes net.
Another avenue for improving the efficiency of inference comes from the presence of repeated substructure in the unrolled Bayes net. This means that many of the factors constructed during variable elimination (and similar kinds of tables constructed by clustering algorithms) 4 Several statistical packages would view this code as defining the RPM, rather than just constructing a Bayes net to perform inference in the RPM. This view, however, misses an important role for RPM syntax: without a syntax with clear semantics, there is no way the model structure can be learned from data.
Section 152
Open-Universe Probability Models
507
will be identical; effective caching schemes have yielded speedups of three orders of magnitude for large networks.
Third, MCMC inference algorithms have some interesting properties when applied to RPMs with relational uncertainty. MCMC works by sampling complete possible worlds, 50 in each state the relational structure is completely known. In the example given earlier, each MCMC state would specify the value of Author(B,), and so the other potential authors are no longer parents of the recommendation nodes for B;.
For MCMC,
then, relational
uncertainty causes no increase in network complexity; instead, the MCMC process includes
transitions that change the relational structure, and hence the dependency structure, of the unrolled network.
Finally, it may be possible in some cases to avoid grounding the model altogether. Resolution theorem provers and logic programming systems avoid propositionalizing by instanti-
ating the logical variables only as needed to make the inference go through; that is, they /ift
the inference process above the level of ground propositional sentences and make each lifted step do the work of many ground steps.
The same idea can be applied in probabilistic inference. For example, in the variable
elimination algorithm, a lifted factor can represent an entire set of ground factors that assign
probabilities to random variables in the RPM, where those random variables differ only in the
constant symbols used to construct them. The details of this method are beyond the scope of
this book, but references are given at the end of the chapter. 15.2
Open-Universe
Probability Models
‘We argued earlier that database semantics was appropriate for situations in which we know
exactly the set of relevant objects that exist and can identify them unambiguously. (In partic-
ular, all observations about an object are correctly associated with the constant symbol that
names it.) In many real-world settings, however, these assumptions are simply untenable. For example, a book retailer might use an ISBN (International Standard Book Number)
constant symbol to name each book, even though a given “logical” book (e.g., “Gone With the Wind”) may have several ISBNs corresponding to hardcover, paperback, large print, reissues, and so on. It would make sense to aggregate recommendations across multiple ISBN, but the retailer may not know for sure which ISBNs are really the same book. (Note that we
are not reifying the individual copies of the book, which might be necessary for used-book
sales, car sales, and so on.) Worse still, each customer is identified by a login ID, but a dis-
honest customer may have thousands of IDs!
In the computer security field, these multiple
IDs are called sybils and their use to confound a reputation system is called a sybil attack.’ Thus, even a simple application in a relatively well-defined, online domain involves both existence uncertainty (what are the real books and customers underlying the observed data) and identity uncertainty (which logical terms really refer to the same object).
The phenomena of existence and identity uncertainty extend far beyond online book-
sellers. In fact they are pervasive:
+ A vision system doesn’t know what exists, if anything, around the next corner, and may not know if the object it sees now is the same one it saw a few minutes ago.
S The name “Sybi
comes from a famous case of multiple personality disorder.
Sybil
Sybil attack Existence uncertainty Identity uncertainty
508
Chapter 15 Probabilistic Programming +A
text-understanding system does not know in advance the entities that will be featured in a text, and must reason about whether phrases such as “Mary,” “Dr. Smith,” “she,” “his cardiologist,” “his mother,” and so on refer to the same object.
« An intelligence analyst hunting for spies never knows how many spies there really are and can only guess whether various pseudonyms, phone numbers, and sightings belong to the same individual.
Indeed, a major part of human cognition seems to require learning what objects exist and
being able to connect observations—which almost never come with unique IDs attached—to
Open universe probability model (OUPM)
hypothesized objects in the world.
Thus, we need to be able to define an open universe probability model (OUPM) based
on the standard semantics of first-order logic, as illustrated at the top of Figure
15.1.
A
language for OUPMs provides a way of easily writing such models while guaranteeing a unique, consistent probability distribution over the infinite space of possible worlds. 15.2.1
Syntax and semantics
The basic idea is to understand how ordinary Bayesian networks and RPMs manage to define a unique probability model and to transfer that insight to the first-order setting. In essence, a Bayes net generates each possible world, event by event, in the topological order defined
by the network structure, where each event is an assignment of a value to a variable. An RPM extends this to entire sets of events, defined by the possible instantiations of the logical
variables in a given predicate or function. OUPMs go further by allowing generative steps that
Number statement
add objects to the possible world under construction, where the number and type of objects may depend on the objects that are already in that world and their properties and relations. That is, the event being generated is not the assignment of a value to a variable, but the very existence of objects. One way o do this in OUPMs is to provide number statements that specify conditional distributions over the numbers of objects of various kinds. For example, in the book-
recommendation domain, we might want to distinguish between customers (real people) and their login IDs. (It’s actually login IDs that make recommendations, not customers!) Suppose
(to keep things simple) the number of customers is uniform between 1 and 3 and the number of books is uniform between 2 and 4:
#Customer ~ UniformInt(1,3) #Book ~ Uniformini(2,4).
(15.2)
‘We expect honest customers to have just one ID, whereas dishonest customers might have anywhere between 2 and 5 IDs:
#LoginID(Owner=c) ~ Origin function
if Honest(c) then Exactly(1) else Uniformint(2,5).
(15.3)
This number statement specifies the distribution over the number of login IDs for which customer c is the Owner.
The Owner function is called an origin function because it says
where each object generated by this number statement came from.
The example in the preceding paragraph uses a uniform distribution over the integers
between 2 and 5 to specify the number of logins for a dishonest customer.
This particular
distribution is bounded, but in general there may not be an a priori bound on the number of
Section 152
Open-Universe Probability Models
objects. The most commonly used distribution over the nonnegative integers is the Poisson distribution. The Poisson has one parameter, \, which is the expected number of objects, and a variable X sampled from Poisson(X) has the following distribution:
509
Poisson distribution
P(X=k)=Xe /K. The variance of the Poisson is also A, so the standard deviation is v/X.
This means that
for large values of ), the distribution is narrow relative to the mean—for example, if the number of ants in a nest is modeled by a Poisson with a mean of one million, the standard deviation is only a thousand, or 0.1%. For large numbers, it often makes more sense to use
the discrete log-normal distribution, which is appropriate when the log of the number of
Discrete log-normal distribution
magnitude distribution, uses logs to base 10: thus, a distribution OM(3, 1) has a mean of
Order-of-magnitude distribution
objects is normally distributed.
A particularly intuitive form, which we call the order-of-
10° and a standard deviation of one order of magnitude, i.c., the bulk of the probability mass falls between 102 and 10°.
The formal semantics of OUPMs begins with a definition of the objects that populate
possible worlds. In the standard semantics of typed first-order logic, objects are just numbered tokens with types. In OUPMs, each object is a generation history; for example, an object might be “the fourth login ID of the seventh customer.” (The reason for this slightly baroque construction will become clear shortly.) For types with no origin functions—e.g., the Customer and Book types in Equation (15.2)—the objects have an empty origin; for example, (Customer, ,2) refers to the second customer generated from that number statement.
For number statements with origin functions—e.g., Equation (15.3)—each object records its origin; for example, the object (LoginID, (Owner, (Customer, ,2)),3) is the third login belonging to the second customer. The number variables of an OUPM specify how many objects there are of each type with
each possible origin in each possible world; thus #L0ginlD guuer (Customer, 2)) () =4 means
that in world w, customer 2 owns 4 login IDs. As in relational probability models, the basic random variables determine the values of predicates and functions for all tuples of objects; thus, Honest(cusiomer, 2) (w) = true means that in world w, customer 2 is honest. A possible world is defined by the values of all the number variables and basic random variables. A
world may be generated from the model by sampling in topological order; Figure 15.4 shows an example. The probability of a world so constructed is the product of the probabilities for all the sampled values; in this case, 1.2672x 10~!!. Now it becomes clear why each object contains its origin: this property ensures that every world can be constructed by exactly
one generation sequence. If this were not the case, the probability of a world would be an unwieldy combinatorial sum over all possible generation sequences that create it.
Open-universe models may have infinitely many random variables, so the full theory in-
volves nontrivial measure-theoretic considerations.
For example, number statements with
Poisson or order-of-magnitude distributions allow for unbounded numbers of objects, lead-
ing to unbounded numbers of random variables for the properties and relations of those objects. Moreover, OUPMs can have recursive dependencies and infinite types (integers, strings, etc.). Finally, well-formedness disallows cyclic dependencies and infinitely receding ancestor chains; these conditions are undecidable in general, but certain syntactic sufficient conditions
can be checked easily.
Number variable
510
Chapter 15 Probabilistic Programming Variable #Customer #Book Honest cusiomer. 1)
Value 2 3 rue
Probability 03333 03333 099
1
0.1
Honest(cusiomer. 2) Kindness cusiomer..1)
Jalse 4
Quality gy, 1)
1 3
Kindness customer. 2) Quality gy, Quality gy,
2 3)
FLOgINID,e (Cusomer, 1)
#LoginID guner, (Customer, 2))
Recommendationtginip, (Owner,(Customer, 1))1),(Book 1) Recommendationtaginip, (Owner.(Customer, 1))1)(Book 2)
Recommendation(poginp,(Owner.(Customer, 1)1 (Book.3) RecommendationLoginip, (Owner. (Customer. 2)).1).(Book..1) RecommendationLoginip, (Owner. (Customer. 2)).1).(Book, 2) Recommendationt
ginip,(Owner.(Customer. 2)).1).(Boo
3)
i (0wner. (Custome 2)).2).(Book..1) Recommendationy.oginip,
Recommendation(g oginip, (Owner.(Customer, 2)).2).(Book. 2) Recommendation g oginip, (Owner.(Customer, 2)).2). (Book. 3
5 1 2 2 4 5 5 1 5 5 1
0.01 0.3
0.05 04
015
1.0 0.25
0.5 0.5
05
04 04
04
0.4 0.4 0.4
Figure 15.4 One particular world for the book recommendation OUPM. The number variables and basic random variables are shown in topological order, along with their chosen values and the probabilities for those values.
15.2.2
Inference in open-universe
probability models
Because of the potentially huge and sometimes unbounded size of the implicit Bayes net that corresponds to a typical OUPM, unrolling it fully and performing exact inference is quite impractical. Instead, we must consider approximate inference algorithms such as MCMC (see Section 13.4.2).
Roughly speaking, an MCMC algorithm for an OUPM is exploring the space of possible worlds defined by sets of objects and relations among them, as illustrated in Figure 15.1(top). A move between adjacent states in this space can not only alter relations and functions but also add or subtract objects and change the interpretations of constant symbols. Even though
each possible world may be huge, the probability computations required for each step— whether in Gibbs sampling or Metropolis-Hastings—are entirely local and in most cases take constant time. This is because the probability ratio between neighboring worlds depends on a subgraph of constant size around the variables whose values are changed. Moreover, a logical query can be evaluated incrementally in each world visited, usually in constant time. per world, rather than being recomputing from scratch.
Some special consideration needs to be given to the fact that a typical OUPM may have
possible worlds of infinite size. As an example, consider the multitarget tracking model in Figure 15.9: the function X (a.t), denoting the state of aircraft a at time 1, corresponds to an infinite sequence of variables for an unbounded number of aircraft at each step. For this
reason, MCMC for OUPMs samples not completely specified possible worlds but partial
Section 152
Open-Universe Probability Models
worlds, each corresponding to a disjoint set of complete worlds. A partial world is a minimal
self-supporting instantiation® of a subset of the relevant variables—that is, ancestors of the
evidence and query variables. For example, variables X (a,) for values of7 greater than the last observation time (or the query time, whichever is greater) are irrelevant, so the algorithm can consider just a finite prefix of the infinite sequence.
15.2.3
Examples
The standard “use case” for an OUPM
has three elements:
the model,
the evidence (the
Kknown facts in a given scenario), and the guery, which may be any expression, possibly with free logical variables. The answer is a posterior joint probability for each possible set of substitutions for the free variables, given the evidence, according to the model.” Every model includes type declarations, type signatures for the predicates and functions, one or
more number statements for each type, and one dependency statement for each predicate and
function. (In the examples below, declarations and signatures are omitted where the meaning
is clear.) As in RPMs, dependency statements use an if-then-else syntax to handle contextspecific dependencies. Citation matching
Millions of academic research papers and technical reports are to be found online in the form of pdf files. Such papers usually contain a section near the end called “References” or “Bibliography,” in which citations—strings of characters—are provided to inform the reader
of related work. These strings can be located and “scraped” from the pdf files with the aim of
creating a database-like representation that relates papers and researchers by authorship and
citation links. Systems such as CiteSeer and Google Scholar present such a representation to their users; behind the scenes, algorithms operate to find papers, scrape the citation strings, and identify the actual papers to which the citation strings refer. This is a difficult task because these strings contain no object identifiers and include errors of syntax, spelling, punctuation,
and content. To illustrate this, here are two relatively benign examples:
1. [Lashkari et al 94] Collaborative Interface Agents, Yezdi Lashkari, Max Metral, and Pattie Maes, Proceedings of the Twelfth National Conference on Articial Intelligence, MIT Press, Cambridge, MA, 1994. 2. Metral M. Lashkari, Y. and P. Maes. Collaborative interface agents. In Conference of the American Association for Artificial Intelligence, Seattle, WA, August 1994.
The key question is one of identity: are these citations of the same paper or different pa-
pers? Asked this question, even experts disagree or are unwilling to decide, indicating that
reasoning under uncertainty is going to be an important part of solving this problem.® Ad hoc approaches—such as methods based on a textual similarity metric—often fail miserably. For example, in 2002,
CiteSeer reported over 120 distinct books written by Russell and Norvig.
one in which the parents of every variable in the set are
7 As with Prolog, there may be infinitely many sets of substitutions of unbounded size; designing exploratory interfaces for such answers is an interesting visualization challenge. 8 The answer is yes, they are the same paper. The “National Conference on Articial Intelligence” (notice how the “fi” s missing, thanks to an error in seraping the ligature character) is another name for the AAAI conference; the conference took place in Seattle whereas the proceedings publisher is in Cambridge.
511
512
Chapter 15 Probabilistic Programming type Researcher, Paper, Citation random String Name(Researcher) random String Title(Paper) random Paper PubCited(Citation) random String Text(Citation) random Boolean Professor(Researcher) origin Researcher Author(Paper) #Researcher ~ OM(3,1) Name(r) ~ NamePrior() Professor(r) ~ Boolean(0.2) #Paper(Author=r) ~ if Professor(r) then OM(1.5,0.5) else OM(1,0.5) Title(p) ~ PaperTitlePrior() CitedPaper(c) ~ UniformChoice({Paper p}) Text(c) ~ HMMGrammar(Name(Author(CitedPaper(c))), Title(CitedPaper(c))) Figure 15.5 An OUPM for citation information extraction. For simplicity the model assumes one author per paper and omits details of the grammar and error models. In order to solve the problem using a probabilistic approach, we need a generative model
for the domain.
That is, we ask how these citation strings come to be in the world.
The
process begins with researchers, who have names. (We don’t need to worry about how the researchers came into existence; we just need to express our uncertainty about how many
there are.) These researchers write some papers, which have titles; people cite the papers,
combining the authors’ names and the paper’s title (with errors) into the text of the citation according to some grammar.
The basic elements of this model are shown in Figure 15.5,
covering the case where papers have just one author.”
Given just citation strings as evidence, probabilistic inference on this model to pick
out the most likely explanation for the data produces an error rate 2 to 3 times lower than CiteSeer’s (Pasula et al., 2003).
The inference process also exhibits a form of collective,
knowledge-driven disambiguation: the more citations for a given paper, the more accurately each of them is parsed, because the parses have to agree on the facts about the paper. Nuclear treaty monitoring
Verifying the Comprehensive Nuclear-Test-Ban Treaty requires finding all seismic events on Earth above a minimum
magnitude.
The UN CTBTO
maintains a network of sensors, the
International Monitoring System (IMS); its automated processing software, based on 100 years of seismology research, has a detection failure rate of about 30%. The NET-VISA system (Arora ef al., 2013), based on an OUPM, significantly reduces detection failures. The NET-VISA model (Figure 15.6) expresses the relevant geophysics directly. It describes distributions over the number of events in a given time interval (most of which are
9 The multi-author case has the same overall structure but is a bit more complicated. The parts of the model not shown—the NamePrior, ritlePrior, and HMMGrammar—are traditional probability models. For example, the NamePrior is a mixture of a categorical distribution over actual names and a letter trigram model (sec Section 23.1) to cover names not previously seen, both trained from data in the U.S. Census database.
Section 152
Open-Universe Probability Models
#SeismicEvents ~ Poisson(T * \e) Time(e) ~ UniformReal(0,T) EarthQuake(e) ~ Boolean(0.999) Location(e) ~ if Earthquake(e) then SpatialPrior() else UniformEarth() Depth(e) ~ if Earthquake(e) then UniformReal(0,700) else Exacly(0) Magnitude(e) ~ Exponential(log(10)) Detected(e,p.s) ~ Logistic(weights(s, p), Magnitude(e), Depth(e), Dist(e.s)) #Detections(site = 5) ~ Poisson(T * Af(s)) #Detections(event=e, phase=p, station=s) = if Detected(e, p.s) then 1 else0 OnserTime(a,s)if (event(a) = null) then ~ UniformReal(0,T) else = Time(event(a)) + GeoTT(Dist(event(a),s), Depth(event(a)).phase(a)) + Laplace(pu(s),o:(s)) Amplitude(a,s) if (event(a) = null) then ~ NoiseAmpModel(s) else = AmpModel(Magnitude(event(a)). Dist(event(a), 5), Depth(event(a)). phase(a) Azimuth(a,s) if (event(a) = null) then ~ UniformReal(0, 360) else = GeoAzimuth(Location(event(a)), Depth(event(a)), phase(a). Site(s)) + Laplace(0,7a(s)) Slowness(a,s) if (event(a) = null) then ~ UniformReal(0,20) else = GeoSlowness(Location(event(a)), Depth(event(a)), phase(a), Site(s)) + Laplace(0,7(s)) ObservedPhase(a,s) ~ CategoricalPhaseModel(phase(a)) Figure 15.6 A simplified version of the NET-VISA model (see text).
naturally occurring) as well as over their time, magnitude, depth, and location. The locations of natural events are distributed according to a spatial prior that is trained (like other parts
of the model) from historical data; man-made events are, by the treaty rules, assumed to oc-
cur uniformly over the surface of the Earth.
At every station s, each phase (seismic wave
type) p from an event ¢ produces either 0 or 1 detections (above-threshold signals); the detection probability depends on the event magnitude and depth and its distance from the station.
“False alarm” detections also occur according to a station-specific rate parameter. The measured arrival time, amplitude, and other properties of a detection d from a real event depend
on the properties of the originating event and its distance from the station.
Once trained, the model runs continuously. The evidence consists of detections (90% of
which are false alarms) extracted from raw IMS waveform data, and the query typically asks for the most likely event history, or bulletin, given the data. Results so far are encouraging;
for example, in 2009 the UN’s SEL3 automated bulletin missed 27.4% of the 27294 events in the magnitude range 3-4 while NET-VISA missed 11.1%. Moreover, comparisons with dense regional networks show that NET-VISA finds up to 50% more real events than the
final bulletins produced by the UN’s expert seismic analysts. NET-VISA also tends to associate more detections with a given event, leading to more accurate location estimates (see
Figure 15.7). As of January 1, 2018, NET-VISA has been deployed as part of the CTBTO ‘monitoring pipeline.
Despite superficial differences, the two examples are structurally similar: there are un-
Kknown objects (papers, earthquakes) that generate percepts according to some physical pro-
513
514
Chapter 15 Probabilistic Programming
i)
:
e (@
(b)
Figure 15.7 (a) Top: Example of seismic waveform recorded at Alice Springs, Australia. Bottom: the waveform after processing to detect the arrival times of seismic waves. Blue lines are the automatically detected arrivals; red lines are the true arrivals. (b) Location estimates for the DPRK nuclear test of February 12, 2013: UN CTBTO Late Event Bulletin (green triangle at top left): NET-VISA (blue square in center). The entrance to the underground test facility (small “x”) is 0.75km from NET-VISA's estimate. Contours show NET-VISA's posterior location distribution. Courtesy of CTBTO Preparatory Commission. cess (citation, seismic propagation). The percepts are ambiguous as to their origin, but when
multiple percepts are hypothesized to have originated with the same unknown object, that
object’s properties can be inferred more accurately. The same structure and reasoning patterns hold for areas such as database deduplication
and natural language understanding. In some cases, inferring an object’s existence involves
grouping percepts together—a process that resembles the clustering task in machine learning.
In other cases, an object may generate no percepts at all and still have its existence inferred— as happened, for example, when observations of Uranus led to the discovery of Neptune. The
existence of the unobserved object follows from its effects on the behavior and properties of
observed objects. 15.3
Keeping Track of a Complex World
Chapter 14 considered the problem of keeping track of the state of the world, but covered only the case of atomic representations (HMMs)
and factored representations (DBNs and
Kalman filters). This makes sense for worlds with a single object—perhaps a single patient
in the intensive care unit or a single bird flying through the forest. In this section, we see what
happens when two or more objects generate the observations. What makes this case different
from plain old state estimation is that there is now the possibility of uncertainty about which
object generated which observation. This is the identity uncertainty problem of Section 15.2 Data association
(page 507), now viewed in a temporal context. In the control theory literature, this is the data association problem—that is, the problem of associating observation data with the objects that generated them. Although we could view this as yet another example of open-universe
probabilistic modeling, it is important enough in practice to deserve its own section.
Section 153
Keeping Track of a Complex World
515
track termination detection failure
(©
@
false alarm —
®
'rack/
initiation
0; a positive affine transformation.? This fact was noted
in Chapter 5 (page 167) for two-player games of chance; here, we see that it applies to all kinds of decision scenarios.
Value function Ordinal utility function
As
in game-playing, in a deterministic environment an agent needs only a preference
ranking on states—the numbers don’t matter.
This is called a value function or ordinal
utility function.
Itis important to remember that the existence ofa utility function that describes an agent’s
preference behavior does not necessarily mean that the agent is explicirly maximizing that utility function in its own deliberations. As we showed in Chapter 2, rational behavior can be
generated in any number of ways. A rational agent might be implemented with a table lookup
(if the number of possible states is small enough).
By observing a rational agent’s behavior, an observer can learn about the utility function
that represents what the agent is actually trying to achieve (even if the agent doesn’t know it). ‘We return to this point in Section 16.7.
16.3
Utility Functions
Utility functions map from lotteries to real numbers. We know they must obey the axioms
of orderability, transitivity, continuity, substitutability, monotonicity, and decomposability. Is
that all we can say about utility functions? Strictly speaking, that is it: an agent can have any preferences it likes. For example, an agent might prefer to have a prime number of dollars in its bank account; in which case, if it had $16 it would give away $3. This might be unusual,
but we can’t call it irrational. An agent might prefer a dented 1973 Ford Pinto to a shiny new
Mercedes. The agent might prefer prime numbers of dollars only when it owns the Pinto, but when it owns the Mercedes, it might prefer more dollars to fewer. Fortunately, the preferences
of real agents are usually more systematic and thus easier to deal with.
3 Inthis nse, utilities resemble temperatures: a temperature in Fahrenheit 1.8 times the Cel plus3 . but converting from one to the other doesn’t make you hotter or colder.
s temperature
Section 163
Utility Functions
533
y assessment and utility scales If we want to build a decision-theoretic system that helps a human make decisions or acts on his or her behalf, we must first work out what the human’s utility function is. This process,
often called preference elicitation, involves presenting choices to the human and using the
observed preferences to pin down the underlying utility function.
Preference elicitation
Equation (16.2) says that there is no absolute scale for utilities, but it is helpful, nonethe-
less, to establish some scale on which utilities can be recorded and compared for any particu-
lar problem. A scale can be established by fixing the utilities of any two particular outcomes,
just as we fix a temperature scale by fixing the freezing point and boiling point of water. Typically, we fix the utility of a “best possible prize” at U(S) = ur and a “worst possible
catastrophe” at U (S) = u,. (Both of these should be finite.) Normalized utilities use a scale
with u; =0 and ur = 1. With such a scale, an England fan might assign a utility of 1 to England winning the World Cup and a utility of 0 to England failing to qualify.
Normalized utilties
Given a utility scale between ut and u , we can assess the utility of any particular prize
S by asking the agent to choose between S and a standard lottery [p,ur; (1 — p),u.]. The Standard lottery probability p is adjusted until the agent is indifferent between S and the standard lottery.
Assuming normalized utilities, the utility ofS is given by p. Once this is done for each prize, the utilities for all lotteries involving those prizes are determined.
Suppose, for example,
we want to know how much our England fan values the outcome of England reaching the
semi-final and then losing. We compare that outcome to a standard lottery with probability p
of winning the trophy and probability 1 — p of an ignominious failure to qualify. If there is indifference at p=0.3, then 0.3 is the value of reaching the semi-final and then losing.
In medical, transportation, environmental and other decision problems, people’s lives are
at stake. (Yes, there are things more important than England’s fortunes in the World Cup.) In such cases, u, is the value assigned to immediate death (or in the really worst cases, many deaths).
Although nobody feels
comfortable with putting a value on human life, it is a fact
that tradeoffs on matters of life and death are made all the time. Aircraft are given a complete
overhaul at intervals, rather than after every trip. Cars are manufactured in a way that trades off costs against accident survival rates.
We tolerate a level of air pollution that kills four
million people a year. Paradoxically, a refusal to put a monetary value on life can mean that life is undervalued. Ross Shachter describes a government agency that commissioned a study on removing asbestos from schools. The decision analysts performing the study assumed a particular dollar value for the life of a school-age child, and argued that the rational choice under that assump-
tion was to remove the asbestos. The agency, morally outraged at the idea of setting the value
of a life, rejected the report out of hand. It then decided against asbestos removal—implicitly asserting a lower value for the life of a child than that assigned by the analysts.
Currently several agencies of the U.S. government, including the Environmental Protec-
tion Agency, the Food and Drug Administration, and the Department of Transportation, use the value of a statistical life to determine the costs and benefits of regulations and interven-
Value of a statistical
One common “currency” used in medical and safety analysis is the micromort, a one in a
Micromort
tions. Typical values in 2019 are roughly $10 million. Some attempts have been made to find out the value that people place on their own lives. million chance of death. If you ask people how much they would pay to avoid a risk—for
534
Chapter 16 Making Simple Decisions example, to avoid playing Russian roulette with a million-barreled revolver—they will respond with very large numbers, perhaps tens of thousands of dollars, but their actual behavior reflects a much lower monetary value for a micromort. For example, in the UK, driving in a car for 230 miles incurs a risk of one micromort.
Over the life of your car—say, 92,000 miles—that’s 400 micromorts.
People appear to be
willing to pay about $12,000 more for a safer car that halves the risk of death.
Thus, their
car-buying action says they have a value of $60 per micromort. A number of studies have confirmed a figure in this range across many individuals and risk types. However, government
agencies such as the U.S. Department of Transportation typically set a lower figure; they will
QALY
spend only about $6 in road repairs per expected life saved. Of course, these calculations hold only for small risks. Most people won’t agree to kill themselves, even for $60 million. Another measure is the QALY or quality-adjusted life year. Patients are willing to accept a shorter life expectancy to avoid a disability.
For example, kidney patients on average are
indifferent between living two years on dialysis and one year at full health. 16.3.2
The uti
y of money
Utility theory has its roots in economics, and economics provides one obvious candidate for a utility measure:
money (or more specifically, an agent’s total net assets). The almost
universal exchangeability of money for all kinds of goods and services suggests that money plays a significant role in human utility functions.
Monotonic preference
Tt will usually be the case that an agent prefers more money to less, all other things being
equal.
We say that the agent exhibits a monotonic preference for more money.
This does
not mean that money behaves as a utility function, because it says nothing about preferences
between lotteries involving money.
Suppose you have triumphed over the other competitors in a television game show. The
host now offers you a choice: either you can take the $1,000,000 prize or you can gamble it
Expected monetary value
on the flip of a coin. If the coin comes up heads, you end up with nothing, but if it comes up tails, you get $2,500,000. If you're like most people, you would decline the gamble and pocket the million. Are you being irrational?
Assuming the coin is fair, the expected monetary value (EMV) of the gamble is %($0) +
%(52.500,000) = §$1,250,000, which is more than the original $1,000,000. But that does not necessarily mean that accepting the gamble is a better decision. Suppose we use S, to denote
the state of possessing total wealth $n, and that your current wealth is $k. Then the expected utilities of the two actions of accepting and declining the gamble are
EU(Accept) = U(Sk U (Sk12:5 )+ 00.000) EU(Decline) = U(Sk41,000000) To determine what to do, we need to assign utilities to the outcome states.
Utility is not
directly proportional to monetary value, because the utility for your first million is very high (or so they say), whereas the utility for an additional million is smaller. Suppose you assign a utility of 5 to your current financial status (S), a 9 to the state Sk+2.500.000 and an 8 to the. state Si+1.000.000- Then the rational action would be to decline, because the expected utility of accepting is only 7 (less than the 8 for declining). On the other hand, a billionaire would most likely have a utility function that is locally linear over the range of a few million more,
and thus would accept the gamble.
Section 16,3 Utility Functions U
150000
535
U
800000
-3
(a)
-5
(b)
Figure 16.2 The utility of money. (a) Empirical data for Mr. Beard over a limited range. (b) A typical curve for the full range. In a pioneering study of actual utility functions, Grayson (1960) found that the utility
of money was almost exactly proportional to the logarithm of the amount. (This idea was
first suggested by Bernoulli (1738); see Exercise 16.STPT.) One particular utility curve, for a certain Mr. Beard, is shown in Figure 16.2(a). The data obtained for Mr. Beard’s preferences are consistent with a utility function
U(Sksn) = —263.31 422.091og(n+ 150,000)
for the range between n = —$150,000 and n = $800, 000.
‘We should not assume that this is the definitive utility function for monetary value, but
itis likely that most people have a utility function that is concave for positive wealth. Going into debt is bad, but preferences between different levels of debt can display a reversal of the concavity associated with positive wealth. For example, someone already $10,000,000 in debt might well accept a gamble on a fair coin with a gain of $10,000,000 for heads and a Toss of $20,000,000 for tails.* This yields the S-shaped curve shown in Figure 16.2(b). If we restrict our attention to the positive part of the curves, where the slope is decreasing, then for any lottery L, the utility of being faced with that lottery is less than the utility of being handed the expected monetary value of the lottery as a sure thing:
U(L) 0, then life is
positively enjoyable and the agent avoids borh exits. As long as the actions in (4,1), (3,2), and (3,3) are as shown, every policy is optimal, and the agent obtains infinite total reward
because it never enters a terminal state. It turns out that there are nine optimal policies in all
for various ranges of r; Exercise 17.THRC asks you to find them.
The introduction of uncertainty brings MDPs closer to the real world than deterministic search problems. For this reason, MDPs have been studied in several fields, including Al,
operations research, economics, and control theory. Dozens of solution algorithms have been proposed, several of which we discuss in Section 17.2. First, however, we spell out in more detail the definitions of utilities, optimal policies, and models for MDPs.
17.1.1
Utilities over time
In the MDP example in Figure 17.1, the performance of the agent was measured by a sum of rewards for the transitions experienced. This choice of performance measure is not arbitrary,
but it is not the only possibility for the utility function> on environment histories, which we
write as Up([s0,a0,51,a1 .., ,)).
2 In this chapter we use U for the utility function (to be consistent with the rest of the book), but many works about MDPs use V (for value) instead.
Section 17.1
Sequential Decision Problems
565
&1
|
o
T
Geeh
—0.0274 0. It is easy to see, from the definition of utilities as discounted sums of rewards, that a similar transformation of rewards will leave the optimal policy unchanged in an MDP: R(s,a,5) = mR(s,a,5') +b. It turns out, however, that the additive reward decomposition of utilities leads to significantly
more freedom in defining rewards. Let &(s) be any function of the state s. Then, according to the shaping theorem, the following transformation leaves the optimal policy unchanged:
R(s,a,5') = R(s,a,8') +7D(s') — B(s) .
(17.9)
To show that this is true, we need to prove that two MDPs, M and M’, have identical optimal
policies as long as they differ only in their reward functions as specified in Equation (17.9). We start from the Bellman equation for Q, the Q-function for MDP M:
0(s.0) = L) P | s.0)[R(s.a.8) + 7 maxs Qs )]
Now let Q'(s,a)=Q(s,a) — ®(s) and plug it into this equation; we get
Q'(s.0) +®(s) = Y. P(s | 5.0)[R(s.a.5') +7 myX(Q'(:'st/) +@(s))]. which then simplifies to
Q50) = LPE15.0)[RG.0,5) +1() ~B(s) +7 max ()] LPE |5 @)R (s.a.8) +7 max @/ ()]
Shaping theorem
570
Chapter 17 Making Complex Decisions In other words, Q'(s,a) satisfies the Bellman equation for MDP M. Now we can extract the optimal policy for M’ using Equation (17.7):
i (s) = argmax Q' (s, a) = argmax O(s,@) — B(s) = argmax Q(s,a) = wj(s) a
a
a
The function ®(s) is often called a potential, by analogy to the electrical potential (volt-
age) that gives rise to electric fields. The term 7(s') — d(s) functions as a gradient of the
potential. Thus, if ®(s) has higher value in states that have higher utility, the addition of ~YP(s') — D(s) to the reward has the effect of leading the agent “uphill” in utility. At first sight, it may seem rather counterintuitive that we can modify the reward in this
way without changing the optimal policy. It helps if we remember that all policies are optimal with a reward function that is zero everywhere. This means, according to the shaping theorem,
that all policies are optimal for any potential-based reward of the form R(s,a,s') = y®(s') — @(s).
Intuitively, this is because with such a reward it doesn’t matter which way the agent
goes from A to B. (This is easiest to see when v=1: along any path the sum of rewards collapses to d(B) —D(A), so all paths are equally good.) So adding a potential-based reward to any other reward shouldn’t change the optimal policy.
The flexibility afforded by the shaping theorem means that we can actually help out the
agent by making the immediate reward more directly reflect what the agent should do. In fact, if we set @(s) =U(s), then the greedy policy 7 with respect to the modified reward R’
is also an optimal policy:
7G(s) = argmax ) P(s'|5,a)R (s,a,5)
= argmax Y P(s'| 5,a) [R(s,a,5') +7(s') — ®(s)] . & = argmax Y P(s 5,0)[R(s,a,5') + U (s') = U (s)) a
g
i
I
= argmax )"
P(s'|5,a)[R(s,a,s') + U (s)]
(by Equation (17.4)).
Of course, in order to set (s) =U (s), we would need to know U (s); so there is no free lunch, but there is still considerable value in defining a reward function that is helpful to the extent
possible.
This is precisely what animal trainers do when they provide a small treat to the
animal for each step in the target sequence.
17.1.4
Representing MDPs
The simplest way to represent P(s'|s,a) and R(s,a,s') is with big, three-dimensional tables of size [S|?|A|. This is fine for small problems such as the 4 x 3 world, for which the tables
have 112 x 4=484 entries each. In some cases, the tables are sparse—most entries are zero
because each state s can transition to only a bounded number of states s'—which means the
tables are of size O(|S]|A]). For larger problems, even sparse tables are far too big.
Dynamic decision
Just as in Chapter 16, where Bayesian networks were extended with action and utility nodes to create decision networks, we can represent MDPs by extending dynamic Bayesian networks (DBNS, see Chapter 14) with decision, reward, and utility nodes to create dynamic
decision networks, or DDNs. DDN are factored representations in the terminology of
Section 17.1
[ Prag/nplus,
Sequential Decision Problems
Plug/Unplug.,| LeftWheel,,,
Figure 17.4 A dynamic decision network for a mobile robot with state variables for battery level, charging status, location, and velocity, and action variables for the left and right wheel motors and for charging. Chapter 2; they typically have an exponential complexity advantage over atomic representations and can model quite substantial real-world problems.
Figure 17.4, which is based on the DBN in Figure 14.13(b) (page 486), shows some
elements of a slightly realistic model for a mobile robot that can charge itself. The state S; is
decomposed into four state variables: + X, consists of the two-dimensional location on a grid plus the orientation;
« X, is the rate of change of X;;
« Charging, is true when the robot is plugged in to a power source; « Battery, is the battery level, which we model as an integer in the range 0, The state space for the MDP is the Cartesian product of the ranges of these four variables. The
action is now a set A, of action variables, comprised of Plug/Unplug, which has three values (plug, unplug, and noop); LefitWheel for the power sent to the left wheel; and RightWheel for
the power sent to the right wheel. The set of actions for the MDP is the Cartesian product of
the ranges of these three variables. Notice that each action variable affects only a subset of the state variables. The overall transition model is the conditional distribution P(X,|X;,A,), which can be
computed as a product of conditional probabilities from the DDN. The reward here is a single variable that depends only on the location X (for, say, arriving at a destination) and Charging, as the robot has to pay for electricity used; in this particular model, the reward doesn’t depend
on the action or the outcome state.
The network in Figure 17.4 has been projected three steps into the future. Notice that the
network includes nodes for the rewards for t, 1 + 1, and 1 + 2, but the wtility for 1 + 3. This
571
572
Chapter 17 Making Complex Decisions Next
F
&
o
A,
A1
CurrentPiece;
CurrentPiece
Filled,
Filledyy
(a)
(b)
Figure 17.5 (a) The game of Tetris. The T-shaped piece at the top center can be dropped in any orientation and in any horizontal position. Ifa row is completed, that row disappears and the rows above it move down, and the agent receives one point. The next piece (here, the L-shaped piece at top right) becomes the current piece, and a new next piece appears, chosen at random from the seven piece types. The game ends if the board fills up to the top. (b) The DDN for the Tetris MDP. is because the agent must maximize the (discounted) sum of all future rewards, and U (X;3) represents the reward for all rewards from ¢ + 3 onwards. If a heuristic approximation to U is available, it can be included in the MDP representation in this way and used in lieu of further expansion. This approach is closely related to the use of bounded-depth search and heuristic evaluation functions for games in Chapter 5.
Another interesting and well-studied MDP is the game of Tetris (Figure 17.5(a)). The
state variables for the game are the CurrentPiece, the NextPiece, and a bit-vector-valued
variable Filled with one bit for each of the 10 x 20 board locations. Thus, the state space has
7 %7 % 22% 2 10
states. The DDN for Tetris is shown in Figure 17.5(b). Note that Filled,
is a deterministic function of Filled, and A,. It turns out that every policy for Tetris is proper (reaches a terminal state): eventually the board fills despite one’s best efforts to empty it.
17.2
Algorithms for MDPs
In this section, we present four different algorithms for solving MDPs. The first three, value
Monte Carlo planning
iteration, policy iteration, and linear programming, generate exact solutions offline. The fourth is a family of online approximate algorithms that includes Monte Carlo planning.
17.2.1 Value lteration
Value iteration
‘The Bellman equation (Equation (17.5)) s the basis of the value iteration algorithm for solving MDPs. If there are n possible states, then there are n Bellman equations, one for cach
Section 17.2
Algorithms for MDPs
573
function VALUE-ITERATION(mdp, €) returnsa utility function
inputs: mdp, an MDP with states S, actions A(s), transition model P(s’|s,a),
rewards R(s,a, "), discount ¢, the maximum error allowed in the utility of any state
local variables: U, U’, vectors of utilities for states in S, initially zero
4, the maximum relative change in the utility of any state
repeat
VU360 for each state s in S do U'ls] max,c o5y Q-VALUE(mdp, s,a. U) if |U[s] — Uls]| > & then 5 |U'[s] — U[s]| until§ < e(1-7)/7 return U Figure 17.6 The value iteration algorithm for calculating tilities of states. The termination condition is from Equation (17.12). state. The n equations contain n unknowns—the utilities of the
states. So we would like to
solve these simultaneous equations to find the utilities. There is one problem: the equations
are nonlinear, because the “max” operator is not a linear operator. Whereas systems of linear equations can be solved quickly using linear algebra techniques, systems of nonlinear equations are more problematic. One thing to try is an iterative approach. We start with arbitrary
initial values for the utilities, calculate the right-hand side of the equation, and plug it into the
left-hand side—thereby updating the utility of each state from the utilities of its neighbors.
We repeat this until we reach an equilibrium.
Let U;(s) be the utility value for state s at the ith iteration.
Bellman update, looks like this:
The iteration step, called a
Ui (s) = aeA(s) max Y7 P(s'|5,a)[R(s,a.s) + v Uls')],
(17.10)
where the update is assumed to be applied simultaneously to all the states at each iteration. If we apply the Bellman update infinitely often, we are guaranteed to reach an equilibrium
(see “convergence of value iteration” below), in which case the final utility values must be
solutions to the Bellman equations. In fact, they are also the unigue solutions, and the corre-
sponding policy (obtained using Equation (17.4)) is optimal. The detailed algorithm, including a termination condition when the utilities are “close enough,” is shown in Figure 17.6.
Notice that we make use of the Q-VALUE function defined on page 569. ‘We can apply value iteration to the 4x 3 world in Figure 17.1(a).
Starting with initial
values of zero, the utilities evolve as shown in Figure 17.7(a). Notice how the states at different distances from (4,3) accumulate negative reward until a path is found to (4,3), whereupon
the utilities start to increase.
We can think of the value iteration algorithm as propagating
information through the state space by means of local updates.
Bellman update
574
Chapter 17 Making Complex Decisions 1x107
558 g S
Iierations required
1x10°
0
5
10 15 20 25 30 Number of terations (a)
35
40
05
06
07 08 09 Discount factorY ®)
1
Figure 17.7 (a) Graph showing the evolution of the utilities of selected states using value iteration. (b) The number of value iterations required to guarantee an error of at most e =c-
Rumax, for different values of ¢, as a function of the discount factor 7. Convergence of value iteration
We said that value iteration eventually converges to a unique set of solutions of the Bellman
equations. In this section, we explain why this happens. We introduce some useful mathematical ideas along the way, and we obtain some methods for assessing the error in the utility function returned when the algorithm is terminated early; this is useful because it means that we don’t have to run forever. This section is quite technical.
Contraction
The basic concept used in showing that value iteration converges is the notion of a con-
traction. Roughly speaking, a contraction is a function of one argument that, when applied to two different inputs in turn, produces two output values that are “closer together.” by at least some constant factor, than the original inputs. For example, the function “divide by two™ is
a contraction, because, after we divide any two numbers by two, their difference is halved.
Notice that the “divide by two” function has a fixed point, namely zero, that is unchanged by
the application of the function. From this example, we can discern two important properties of contractions:
+ A contraction has only one fixed point; if there were two fixed points they would not
get closer together when the function was applied, so it would not be a contraction. « When the function is applied to any argument, the value must get closer to the fixed point (because the fixed point does not move), so repeated application of a contraction
always reaches the fixed point in the limit. Now, suppose we view the Bellman update (Equation (17.10)) as an operator B that is applied simultaneously to update the utility of every state. Then the Bellman equation becomes
U=BU and the Bellman update equation can be written as Uit1
Max norm
0.5 and Go otherwise. Once we have utilities c,(s) for all the conditional plans p of depth 1 in each physical state s, we can compute the utilities for conditional plans of depth 2 by considering each
possible first action, each possible subsequent percept, and then each way of choosing a
depth-1 plan to execute for each percept: [Stay; if Percept=A then Stay else Stay] [Stay; if Percept =A then Stay else Go|
[Go; if Percept=A then Stay else Stay]
Dominated plan
There are eight distinct depth-2 plans in all, and their utilities are shown in Figure 17.15(b). Notice that four of the plans, shown as dashed lines, are suboptimal across the entire belief space—we say these plans are dominated, and they need not be considered further. There are four undominated plans, cach of which is optimal in a specific region, as shown in Figure 17.15(c). The regions partition the belief-state space. We repeat the process for depth 3, and so on. In general, let p be a depth-d conditional plan whose initial action is @ and whose depth-(d — 1) subplan for percept e is p.e; then (17.18) ap(s) = );P(!\smlkts,a,:’) +7 L P(e|s)ape(s))This recursion naturally gives us a value iteration algorithm, which is given in Figure 17.16. The structure of the algorithm and its error analysis are similar to those of the basic value
iteration algorithm in Figure 17.6 on page 573; the main difference is that instead of computing one utility number for each state, POMDP-VALUE-ITERATION
maintains a collection of
undominated plans with their utility hyperplanes.
The algorithm’s complexity depends primarily on how many plans get generated. Given |A] actions and |E| possible observations, there are |A|?(£1"") distinct depth-d plans. Even for the lowly two-state world with d=8, that’s 2255 plans. The elimination of dominated plans is essential for reducing this doubly exponential growth: the number of undominated plans with d =8 is just 144. The utility function for these 144 plans is shown in Figure 17.15(d). Notice that the intermediate belief states have lower value than state A and state B, because in the intermediate states the agent lacks the information needed to choose a good action. This is why information has value in the sense defined in Section 16.6 and optimal policies in POMDPs often include information-gathering actions.
Section 17.5
Algorithms for Solving POMDPs
function POMDP-VALUE-ITERATION(pomdp, ¢) returns a utility function inputs: pomdp, a POMDP with states S, actions A(s). transition model P(s'|s.a). sensor model P(e|s). rewards R(s), discount y ¢, the maximum error allowed in the utility of any state local variables: U, U, sets of plans p with associated utility vectors o, U’ ¢ a set containing just the empty plan []. with a)(s) = R(s) repeat [
U’ the set of all plans consisting of an action and, for each possible next percept, aplanin U with utility vectors computed according to Equation (17.18) U’ REMOVE-DOMINATED-PLANS(U")
until MAX-DIFFERENCE(U,U’) return U
< e(1-7)/y
Figure 17.16 A high-level sketch of the value iteration algorithm for POMDPs. The REMOVE-DOMINATED-PLANS step and MAX-DIFFERENCE test are typically implemented as linear programs. Given such a utility function, an executable policy can be extracted by looking at which hyperplane is optimal at any given belief state b and executing the first action of the corresponding plan. In Figure 17.15(d), the corresponding optimal policy is still the same as for depth-1 plans: Stay when b(B) > 0.5 and Go otherwise. In practice, the value iteration algorithm in Figure 17.16 is hopelessly inefficient for larger problems—even the 4 x 3 POMDP is too hard. The main reason is that given n undominated
conditional plans at level d, the algorithm constructs |A|-nlE| conditional plans at level d + 1 before eliminating the dominated ones. With the four-bit sensor, |E| is 16, and n can be in the hundreds, so this is hopeless. Since this algorithm was developed in the 1970s, there have been several advances, in-
cluding more efficient forms of value iteration and various kinds of policy iteration algorithms.
Some
of these are discussed in the notes at the end of the chapter.
For general
POMDPs, however, finding optimal policies is very difficult (PSPACE-hard, in fact—that is, very hard indeed). The next section describes a different, approximate method for solving POMDPs, one based on look-ahead search.
17.5.2
Online algorithms for POMDPs
The basic design for an online POMDP agent is straightforward: it starts with some prior belief state; it chooses an action based on some deliberation process centered on its current
belief state; after acting, it receives an observation and updates its belief state using a filtering algorithm; and the process repeats.
One obvious choice for the deliberation process is the expectimax algorithm from Sec-
tion 17.2.4, except with belief states rather than physical states as the decision nodes in the tree. The chance nodes in the POMDP
tree have branches labeled by possible observations
and leading to the next belief state, with transition probabilities given by Equation (17.17). A fragment of the belief-state expectimax tree for the 4 x 3 POMDP is shown in Figure 17.17.
593
594
Chapter 17 Making Complex Decisions
Up.
1100
Righ,
o110/
1100
Down
Left
1010
0110
Figure 17.17 Part of an expectimax tree for the 4 x 3 POMDP with a uniform initial belief state. The belief states are depicted with shading proportional to the probability of being in each location. The time complexity of an exhaustive search to depth d is O(JA|" - [E[Y), where |A| is
the number of available actions and |E| is the number of possible percepts.
(Notice that
this is far less than the number of possible depth-d conditional plans generated by value iteration.)
As in the observable case, sampling at the chance nodes is a good way to cut
down the branching factor without losing too much accuracy in the final decision. Thus, the
complexity of approximate online decision making in POMDPs may not be drastically worse than that in MDPs.
For very large state spaces, exact filtering is infeasible, so the agent will need to run
an approximate filtering algorithm such as particle filtering (see page 492). Then the belief
POMCP
states in the expectimax tree become collections of particles rather than exact probability distributions. For problems with long horizons, we may also need to run the kind of long-range playouts used in the UCT algorithm (Figure 5.11). The combination of particle filtering and UCT applied to POMDPs goes under the name of partially observable Monte Carlo planning
or POMCP.
With a DDN
representation for the model, the POMCP
algorithm is, at least
in principle, applicable to very large and realistic POMDPs. Details of the algorithm are explored in Exercise 17.PoMC. POMCP is capable of generating competent behavior in the
4 x 3 POMDP. A short (and somewhat fortunate) example is shown in Figure 17.18. POMDP agents based on dynamic decision networks and online decision making have a
number of advantages compared with other, simpler agent designs presented in earlier chap-
ters. In particular, they handle partially observable, stochastic environments and can easily revise their “plans” to handle unexpected evidence. With appropriate sensor models, they can handle sensor failure and can plan to gather information. They exhibit “graceful degradation” under time pressure and in complex environments, using various approximation techniques.
So what is missing? The principal obstacle to real-world deployment of such agents is
the inability to generate successful behavior over long time-scales. Random or near-random
playouts have no hope of gaining any positive reward on, say, the task of laying the table
Summary
B 1000 111 0001
1010
1001
Figure 17.18 A sequence of percepts, belief states, and actions in the 4 x 3 POMDP with a wall-sensing error of ¢=0.2. Notice how the early Lefr moves are safe—they are very unlikely to fall into (4,2)—and coerce the agent’s location into a small number of possible locations. After moving Up, the agent thinks it is probably in (3,3). but possibly in (1.3). Fortunately, moving Right is a good idea in both cases, so it moves Right, finds out that it had been in (1,3) and is now in (2,3), and then continues moving Right and reaches the goal. for dinner, which might take tens of millions of motor-control actions.
It seems necessary
to borrow some of the hierarchical planning ideas described in Section 11.4. At the time of
writing, there are not yet satisfactory and efficient ways to apply these ideas in stochastic,
partially observable environments.
Summary
This chapter shows how to use knowledge about the world to make decisions even when the outcomes of an action are uncertain and the rewards for acting might not be reaped until many actions have passed. The main points are as follows: + Sequential decision problems in stochastic environments, also called Markov decision
processes, or MDPs, are defined by a transition model specifying the probabilistic
outcomes of actions and a reward function specifying the reward in each state. « The utility of a state sequence is the sum of all the rewards over the sequence, possibly discounted over time.
The solution of an MDP is a policy that associates a decision
with every state that the agent might reach. An optimal policy maximizes the utility of
the state sequences encountered when it is executed. + The utility of a state is the expected sum of rewards when an optimal policy is executed
from that state. The value iteration algorithm iteratively solves a set of equations
relating the utility of each state to those of its neighbors. + Policy iteration alternates between calculating the utilities of states under the current
policy and improving the current policy with respect to the current utilities.
* Partially observable MDPs, or POMDPs, are much more difficult to solve than are MDPs. They can be solved by conversion to an MDP in the continuous space of belief
states; both value iteration and policy iteration algorithms have been devised. Optimal behavior in POMDPs
includes information gathering to reduce uncertainty and there-
fore make better decisions in the future.
* A decision-theoretic agent can be constructed for POMDP environments.
The agent
uses a dynamic decision network to represent the transition and sensor models, to update its belief state, and to project forward possible action sequences.
We shall return MDPs and POMDPs in Chapter 22, which covers reinforcement learning methods that allow an agent to improve its behavior from experience.
595
596
Chapter 17 Making Complex Decisions Bibliographical and Historical Notes
Richard Bellman developed the ideas underlying the modern approach to sequential decision problems while working at the RAND Corporation beginning in 1949. According to his autobiography (Bellman, 1984), he coined the term “dynamic programming” to hide from a research-phobic Secretary of Defense, Charles Wilson, the fact that his group was doing mathematics. (This cannot be strictly true, because his first paper using the term (Bellman,
1952) appeared before Wilson became Secretary of Defense in 1953.) Bellman’s book, Dynamic Programming (1957), gave the new field a solid foundation and introduced the value iteration algorithm.
Shapley (1953b) actually described the value iteration algorithm independently of Bellman, but his results were not widely appreciated in the operations research community, perhaps because they were presented in the more general context of Markov games. Although the original formulations included discounting, its analysis in terms of stationary preferences was suggested by Koopmans (1972). The shaping theorem is due to Ng ef al. (1999). Ron Howard’s Ph.D. thesis (1960) introduced policy iteration and the idea of average reward for solving infinite-horizon problems. Several additional results were introduced
by Bellman and Dreyfus (1962). The use of contraction mappings in analyzing dynamic programming algorithms is due to Denardo (1967). Modified policy iteration is due to van Nunen (1976) and Puterman and Shin (1978). Asynchronous policy iteration was analyzed by Williams and Baird (1993), who also proved the policy loss bound in Equation (17.13). The general family of prioritized sweeping algorithms aims to speed up convergence to optimal policies by heuristically ordering the value and policy update calculations (Moore and Atkeson, 1993; Andre ez al., 1998; Wingate and Seppi, 2005).
The formulation of MDP-solving as a linear program is due to de Ghellinck (1960), Manne (1960), and D’Epenoux (1963). Although linear programming has traditionally been considered inferior to dynamic programming as an exact solution method for MDPs, de Farias and Roy (2003) show that it is possible to use linear programming and a linear representation of the utility function to obtain provably good approximate solutions to very large MDPs.
Papadimitriou and Tsitsiklis (1987) and Littman er al. (1995) provide general results on the
computational complexity of MDPs.
Yinyu Ye (2011) analyzes the relationship between
policy iteration and the simplex method for linear programming and proves that for fixed v,
the runtime of policy iteration is polynomial in the number of states and actions.
Seminal work by Sutton (1988) and Watkins (1989) on reinforcement learning methods
for solving MDPs played a significant role in introducing MDPs into the Al community. (Earlier work by Werbos (1977) contained many similar ideas, but was not taken up to the same extent.) Al researchers have pushed MDPs in the direction of more expressive representations that can accommodate much larger problems than the traditional atomic representations based on transition matrices.
The basic ideas for an agent architecture using dynamic decision networks were proposed by Dean and Kanazawa (1989a). Tatman and Shachter (1990) showed how to apply dynamic programming algorithms to DDN models. Several authors made the connection between MDPs and Al planning problems, developing probabilistic forms of the compact STRIPS representation for transition models (Wellman,
1990b; Koenig,
1991).
The book Planning
and Control by Dean and Wellman (1991) explores the connection in great depth.
Bibliographical and Historical Notes Later work on factored MDPs (Boutilier ez al., 2000; Koller and Parr, 2000; Guestrin et al., 2003b) uses structured representations of the value function as well as the transition model, with provable improvements in complexity. Relational MDPs (Boutilier et al., 2001;
Guestrin e al., 2003a) go one step further, using structured representations to handle domains
with many related objects. Open-universe MDPs and POMDPs (Srivastava et al., 2014b) also allow for uncertainty over the existence and identity of objects and actions.
Many authors have developed approximate online algorithms for decision making in MDPs, often borrowing explicitly from earlier AT approaches to real-time search and gameplaying (Werbos, 1992; Dean ez al., 1993; Tash and Russell, 1994). The work of Barto et al.
(1995) on RTDP (real-time dynamic programming) provided a general framework for understanding such algorithms and their connection to reinforcement learning and heuristic search.
The analysis of depth-bounded expectimax with sampling at chance nodes is due to Kearns
et al. (2002). The UCT algorithm described in the chapter is due to Kocsis and Szepesvari (2006) and borrows from earlier work on random playouts for estimating the values of states (Abramson, 1990; Briigmann, 1993; Chang er al., 2005).
Bandit problems were introduced by Thompson (1933) but came to prominence after
‘World War II through the work of Herbert Robbins (1952).
Bradt et al. (1956) proved the
first results concerning stopping rules for one-armed bandits, which led eventually to the
breakthrough results of John Gittins (Gittins and Jones, 1974; Gittins, 1989). Katehakis and
Veinott (1987) suggested the restart MDP as a method of computing Gittins indices. The text by Berry and Fristedt (1985) covers many variations on the basic problem, while the pellucid online text by Ferguson (2001) connects bandit problems with stopping problems. Lai and Robbins (1985) initiated the study of the asymptotic regret of optimal bandit
policies. The UCB heuristic was introduced and analyzed by Auer et al. (2002).
Bandit su-
perprocesses (BSPs) were first studied by Nash (1973) but have remained largely unknown
in AL Hadfield-Menell and Russell (2015) describe an efficient branch-and-bound algorithm
capable of solving relatively large BSPs. Selection problems were introduced by Bechhofer (1954). Hay et al. (2012) developed a formal framework for metareasoning problems, showing that simple instances mapped to selection rather than bandit problems. They also proved
the satisfying result that expected computation cost of the optimal computational strategy is
never higher than the expected gain in decision quality—although there are cases where the optimal policy may, with some probability, keep computing long past the point where any
possible gain has been used up. The observation that a partially observable MDP can be transformed into a regular MDP over belief states is due to Astrom (1965) and Aoki (1965).
The first complete algorithm
for the exact solution of POMDPs—essentially the value iteration algorithm presented in
this chapter—was proposed by Edward Sondik (1971) in his Ph.D. thesis. (A later journal paper by Smallwood and Sondik (1973) contains some errors, but is more accessible.)
Lovejoy (1991) surveyed the first twenty-five years of POMDP research, reaching somewhat pessimistic conclusions about the feasibility of solving large problems.
The first significant contribution within Al was the Witness algorithm (Cassandra et al., 1994; Kaelbling er al., 1998), an improved version of POMDP value iteration. Other algo-
rithms soon followed, including an approach due to Hansen (1998) that constructs a policy
incrementally in the form of a finite-state automaton whose states define the possible belief
states of the agent.
597
Factored MDP Relational MDP.
598
Chapter 17 Making Complex Decisions More recent work in Al has focused on point-based value iteration methods that, at each
iteration, generate conditional plans and a-vectors for a finite set of belief states rather than
for the entire belief space. Lovejoy (1991) proposed such an algorithm for a fixed grid of points, an approach taken also by Bonet (2002). An influential paper by Pineau ef al. (2003) suggested generating reachable points by simulating trajectories in a somewhat greedy fash-
ion; Spaan and Vlassis (2005) observe that one need generate plans for only a small, randomly
selected subset of points to improve on the plans from the previous iteration for all points in
the set. Shani ef al. (2013) survey these and other developments in point-based algorithms,
which have led to good solutions for problems with thousands of states. Because POMDPs are PSPACE-hard (Papadimitriou and Tsitsiklis, 1987), further progress on offline solution
methods may require taking advantage of various kinds of structure in value functions arising from a factored representation of the model.
The online approach for POMDPs—using look-ahead search to select an action for the
current belief state—was first examined by Satia and Lave (1973).
The use of sampling at
chance nodes was explored analytically by Keamns ef al. (2000) and Ng and Jordan (2000). ‘The POMCP algorithm is due to Silver and Veness (2011).
With the development of reasonably effective approximation algorithms for POMDPs,
their use as models for real-world problems has increased, particularly in education (Rafferty et al., 2016), dialog systems (Young et al., 2013), robotics (Hsiao et al., 2007; Huynh and
Roy, 2009), and self-driving cars (Forbes et al., 1995; Bai et al., 2015). An important largescale application is the Airborne Collision Avoidance System X (ACAS X), which keeps airplanes and drones from colliding midair. The system uses POMDPs with neural networks
to do function approximation. ACAS X significantly improves safety compared to the legacy
TCAS system, which was built in the 1970s using expert system technology (Kochenderfer, 2015; Julian er al., 2018).
Complex decision making has also been studied by economists and psychologists. They find that decision makers are not always rational, and may not be operating exactly as described by the models in this chapter. For example, when given a choice, a majority of people prefer $100 today over a guarantee of $200 in two years, but those same people prefer $200 Hyperbolic reward
in eight years over $100 in six years. One way to interpret this result is that people are not
using additive exponentially discounted rewards; perhaps they are using hyperbolic rewards (the hyperbolic function dips more steeply in the near term than does the exponential decay function). This and other possible interpretations are discussed by Rubinstein (2003).
The texts by Bertsekas (1987) and Puterman (1994) provide rigorous introductions to sequential decision problems and dynamic programming. Bertsekas and Tsitsiklis (1996) include coverage of reinforcement learning. Sutton and Barto (2018) cover similar ground but in a more accessible style. Sigaud and Buffet (2010), Mausam and Kolobov (2012) and
Kochenderfer (2015) cover sequential decision making from an Al perspective. Krishnamurthy (2016) provides thorough coverage of POMDPs.
TG
18
MULTIAGENT DECISION MAKING In which we examine what to do when more than one agent inhabits the environment.
18.1
Properties of Multiagent
Environments
So far, we have largely assumed that only one agent has been doing the sensing, planning, and
acting. But this represents a huge simplifying assumption, which fails to capture many realworld Al settings.
In this chapter, therefore, we will consider the issues that arise when an
agent must make decisions in environments that contain multiple actors. Such environments
are called multiagent systems, and agents in such a system face a multiagent planning Multiagent systems problem. However, as we will see, the precise nature of the multiagent planning problem— }13{25" Plarnine and the techniques that are appropriate for solving it—will depend on the relationships among the agents in the environment.
18.1.1
One decision maker
The first possibility is that while the environment contains multiple actors, it contains only one decision maker. In such a case, the decision maker develops plans for the other agents, and tells them what to do.
The assumption that agents will simply do what they are told
agent is called the benevolent agent assumption. However, even in this setting, plans involving Benevolent assumption
multiple actors will require actors to synchronize their actions. Actors A and B will have to
act at the same time for joint actions (such as singing a duet), at different times for mutually exclusive actions (such as recharging batteries when there is only one plug), and sequentially
when one establishes a precondition for the other (such as A washing the dishes and then B
drying them).
One special case is where we have a single decision maker with multiple effectors that
can operate concurrently—for example, a human who can walk and talk at the same time.
Such an agent needs to do multieffector planning to manage each effector while handling =45 e/iec*>" positive and negative interactions among the effectors.
When the effectors are physically
decoupled into detached units—as in a fleet of delivery robots in a factory—multieffector
planning becomes multibody planning.
A multibody problem is still a “standard” single-agent problem as long as the relevant sensor information collected by each body can be pooled—either centrally or within each body—to
form a common
Multibody planning
estimate of the world state that then informs the execution of
the overall plan; in this case, the multiple bodies can be thought of as acting as a single body. When communication constraints make this impossible, we have what is sometimes
called a decentralized planning problem; this is perhaps a misnomer, because the planning - Jeente!= —3. « Now suppose we change the rules to force O to reveal his strategy first, followed by E. Then the minimax value of this game is Up ., and because this game favors E we know that U is at most Up &
know U < 42.
With pure strategies, the value is +2 (see Figure 18.2(b)), so we
Combining these two arguments, we see that the true utility U of the solution to the original game must satisfy
Upo v(C)+Vv(D)
forall CCDCN
If a game is superadditive, then the grand coalition receives a value that is at least as high
as or higher than the total received by any other coalition structure. However, as we will see
shortly, superadditive games do not always end up with a grand coalition, for much the same reason that the players do not always arrive at a collectively desirable Pareto-optimal outcome in the prisoner’s dilemma.
18.3.2
Strategy in cooperative games
The basic assumption in cooperative game theory is that players will make strategic decisions about who they will cooperate with. Intuitively, players will not desire to work with unproductive players—they will naturally seek out players that collectively yield a high coalitional
value. But these sought-after players will be doing their own strategic reasoning. Before we can describe this reasoning, we need some further definitions.
An imputation for a cooperative game (N,) is a payoff vector that satisfies the follow-
Imputation
ing two conditions:
2= v(N)
x> v({i}) forallie N. The first condition says that an imputation must distribute the total value of the grand coali-
tion; the second condition, known as individual rationality, says that each player is at least
Individual rationality
as well off as if it had worked alone.
Given an imputation X = (x,...,x,) and a coalition C C N, we define x(C) to be the sum
¥jecXi—the total amount disbursed to C by the imputation x.
Next, we define the core of a game (N, v) as the set of all imputations X that satisfy the
condition x(C) > v(C) for every possible coalition C C N. Thus, if an imputation x is not in the core, then there exists some coalition C C N such that v(C) > x(C). The players in C
would refuse to join the grand coalition because they would be better off sticking with C.
The core of a game therefore consists of all the possible payoff vectors that no coalition
could object to on the grounds that they could do better by not joining the grand coalition.
Thus, if the core is empty, then the grand coalition cannot form, because no matter how the
grand coalition divided its payoff, some smaller coalition would refuse to join. The main
computational questions around the core relate to whether or not it is empty, and whether a
particular payoff distribution is in the core.
Core
628
Chapter 18 Multiagent Decision Making
The definition of the core naturally leads to a system of linear inequalities, as follows (the unknowns are variables xi....,.x,, and the values v(C) are constants): x>
v({i})
forallie N
v(C)
foralCCN
Lienxi = V(N) Yiecxi
=
Any solution to these inequalities will define an imputation in the core. We can formulate the
inequalities as a linear program by using a dummy objective function (for example, maximizing ¥y x;), which will allow us to compute imputations in time polynomial in the number
of inequalities.
The difficulty is that this gives an exponential number of inequalities (one
for each of the 2" possible coalitions). Thus, this approach yields an algorithm for checking
non-emptiness of the core that runs in exponential time. Whether we can do better than this
depends on the game being studied: for many classes of cooperative game, the problem of checking non-emptiness of the core is co-NP-complete. We give an example below. Before proceeding, let’s see an example of a superadditive game with an empty core. The game has three players N = {1,2,3}, and has a characteristic function defined as follows:
[
=2
o) = { 0 otherwise.
Now consider any imputation (x;,x2,x3) for this game. Since v(N) = 1, it must be the case that at least one player i has x; > 0, and the other two get a total payoff less than 1. Those two could benefit by forming a coalition without player i and sharing the value 1 among themselves. But since this holds for all imputations, the core must be empty.
The core formalizes the idea of the grand coalition being stable, in the sense that no
coalition can profitably defect from it. However, the core may contain imputations that are unreasonable, in the sense that one or more players might feel they were unfair.
N = {1,2}, and we have a characteristic function v defined as follows:
Suppose
v({1}) =v({2}) =5
v({1,2}) =20.
Here, cooperation yields a surplus of 10 over what players could obtain working in isolation,
and so intuitively, cooperation will make sense in this scenario. Now, it is easy to see that the
imputation (6, 14) is in the core of this game: neither player can deviate to obtain a higher
uiility. But from the point of view of player 1, this might appear unreasonable, because it gives 9/10 of the surplus to player 2. Thus, the notion of the core tells us when a grand Shapley value
coalition can form, but it does not tell us how to distribute the payoff.
The Shapley value is an elegant proposal for how to divide the v(N) value among the players, given that the grand coalition N formed. Formulated by Nobel laureate Lloyd Shapley in the early 1950s, the Shapley value is intended to be a fair distribution scheme. What does fair mean?
It would be unfair to distribute v(N) based on the eye color of
players, or their gender, or skin color. Students often suggest that the value v(N) should be
divided equally, which seems like it might be fair, until we consider that this would give the
same reward to players that contribute a lot and players that contribute nothing. Shapley’s insight was to suggest that the only fair way to divide the value v(N) was to do so according
Mond |
to how much each player contributed to creating the value v(N).
First we need to define the notion of a player’s marginal contribution. The marginal
Section 18.3 contribution that a player i makes to a coalition C
Cooperative Game Theory
629
is the value that i would add (or remove),
should i join the coalition C. Formally, the marginal contribution that player i makes to C is
denoted by mc;(C):
me;(C) = v(Cu{i}) - v(C). Now, a first attempt to define a payoff division scheme in line with Shapley’s suggestion
that players should be rewarded according to their contribution would be to pay each playeri the value that they would add to the coalition containing all other players:
mei(N —{i}). The problem is that this implicitly assumes that player i is the last player to enter the coalition.
So, Shapley suggested, we need to consider all possible ways that the grand coalition could form, that is, all possible orderings of the players N, and consider the value that i adds to the preceding players in the ordering. Then, a player should be rewarded by being paid the average marginal contribution that player i makes, over all possible orderings of the players, 10 the set of players preceding i in the ordering.
‘We let P denote all possible permutations (e.g., orderings) of the players N, and denote
members of P by p,p',... etc. Where p € P and i € N, we denote by p; the set of players
that precede i in the ordering p.
Then the Shapley value for a game G is the imputation
3(G) = (61(G),...,¢a(G)) defined as follows: 1
4(G) = fip();mc.(p,).
(18.1)
This should convince you that the Shapley value is a reasonable proposal. But the remark-
able fact is that it is the unique solution to a set of axioms that characterizes a “fair” payoff distribution scheme. We’ll need some more definitions before defining the axioms.
‘We define a dummy player as a player i that never adds any value to a coalition—that is,
mc;(C) =0 for all C C N — {i}. We will say that two players i and j are symmetric players
if they always make identical contributions to coalitions—that is, mc;(C) = me;(C) for all
C CN—{i,j}. Finally, where G = (N,v) and G’ = (N, V') are games with the same set of players, the game G + G’ is the game with the same player set, and a characteristic function
V" defined by v/(C) = v(C) + V/(C).
Given these definitions, we can define the fairness axioms satisfied by the Shapley value:
« Efficiency: ¥y 6i(G) = v(N). (All the value should be distributed.) « Dummy Player: 1f i is a dummy player in G then ¢;(G) = 0. (Players who never contribute anything should never receive anything.)
« Symmetry: 1f i and j are symmetric in G then ¢;(G) = ¢,(G). (Players who make identical contributions should receive identical payoffs.)
« Additivity: The value is additive over games: For all games G = (N, v) and G’ = (N, V), and for all players i € N, we have ¢;(G +G') = ¢i(G) + ¢i(G').
The additivity axiom is admittedly rather technical. If we accept it as a requirement, however,
we can establish the following key property: the Shapley value is the only way to distribute
coalitional value so as to satisfy these fairness axioms.
Dummy player
Symmetric players
630
Chapter 18 Multiagent Decision Making 18.3.3
Computation
in cooperative games
From a theoretical point of view, we now have a satisfactory solution. But from a computa-
tional point of view, we need to know how to compactly represent cooperative games, and
how to efficiently compute solution concepts such as the core and the Shapley value.
The obvious representation for a characteristic function would be a table listing the value
v(C) for all 2" coalitions. This is infeasible for large n. A number of approaches to compactly representing cooperative games have been developed, which can be distinguished by whether or not they are complete. A complete representation scheme is one that is capable of representing any cooperative game. The drawback with complete representation schemes is that there will always be some games that cannot be represented compactly. An alternative is to use a representation scheme that is guaranteed to be compact, but which is not complete.
Marginal contribution nets
Marginal conibation net
‘We now describe one representation scheme, called marginal contribution nets (MC-nets).
We will use a slightly simplified version to facilitate presentation, and the simplification makes it incomplete—the full version of MC-nets is a complete representation.
The idea behind marginal contribution nets is to represent the characteristic function of a
game (N,v) as a set of coalition-value rules, of the form: (C;,x;), where C; C N is a coalition
and x; is a number. To compute the value of a coalition C, we simply sum the values of all rules (C;,x;) such that C; C C. Thus, given a set of rules R = {(C1,x1),...,(C,x¢)}, the corresponding characteristic function is: v(C€) = Y {xi | (Cixi) eRand G; C C} Suppose we have a rule set R containing the following three rules:
{({1.2}5), ({212, ({344} Then, for example, we have: « v({1}) = 0 (because no rules apply), v({3}) = 4 v({1,3}) = v({2,3}) = v({1,2,3})
(third rule), 4 (third rule), 6 (second and third rules), and = 11 (first, second, and third rules).
‘With this representation we can compute the Shapley value in polynomial time.
The key
insight is that each rule can be understood as defining a game on its own, in which the players
are symmetric. By appealing to Shapley’s axioms of additivity and symmetry, therefore, the Shapley value ¢;(R) of player i in the game associated with the rule set R is then simply:
aR=
T
iec . 5er |{ 0fr otherwise The version of marginal contribution nets that we have presented here is not a complete repre-
sentation scheme: there are games whose characteristic function cannot be represented using rule sets of the form described above. A richer type of marginal contribution networks al-
Tows for rules of the form (¢,x), where ¢ is a propositional logic formula over the players
N: a coalition C satisfies the condition ¢ if it corresponds to a satisfying assignment for ¢.
Section 18.3
Cooperative Game Theory
e 11,121,841
(s
!l
A= (eaen) (Buas)
631
ool (nwes)
level3.
©2590)
a1
Figure 187 The coalition structure graph for N = {1,2,3,4}. Level 1 has coalition structures containing a single coalition; level 2 has coalition structures containing two coalitions, and 0 on. This scheme is a complete representation—in the worst case, we need a rule for every possible coalition. Moreover, the Shapley value can be computed in polynomial time with this scheme; the details are more involved than for the simple rules described above, although the
basic principle is the same; see the notes at the end of the chapter for references.
Coalition structures for maximum
social welfare
‘We obtain a different perspective on cooperative games if we assume that the agents share
a common purpose. For example, if we think of the agents as being workers in a company,
then the strategic considerations relating to coalition formation that are addressed by the core,
for example, are not relevant. Instead, we might want to organize the workforce (the agents)
into teams so as to maximize their overall productivity. More generally, the task is to find a coalition that maximizes the social welfare of the system, defined as the sum of the values of
the individual coalitions. We write the social welfare of a coalition structure CS as sw(CS), with the following definition:
sw(CSs) =Y v(C). cecs Then a socially optimal coalition structure CS* with respect to G maximizes this quantity.
Finding a socially optimal coalition structure is a very natural computational problem, which has been studied beyond the multiagent systems community:
it is sometimes called the set
partitioning partitioning problem. Unfortunately, the problem is NP-hard, because the number of possi- Set problem ble coalition structures grows exponentially in the number of players.
Finding the optimal coalition structure by naive exhaustive search is therefore infeasible
in general. An influential approach to optimal coalition structure formation is based on the
Coalition structure idea of searching a subspace of the coalition structure graph. The idea is best explained graph with reference to an example.
Suppose we have a game with four agents, N = {1,2,3,4}. There are fifteen possible
coalition structures for this set of agents.
We can organize these into a coalition structure
graph as shown in Figure 18.7, where the nodes at level £ of the graph correspond to all the coalition structures with exactly £ coalitions.
An upward edge in the graph represents
the division of a coalition in the lower node into two separate coalitions in the upper node.
632
Chapter 18 Multiagent Decision Making For example, there is an edge from {{1},{2,3,4}} to {{1},{2}.{3.4}} because this latter
coalition structure is obtained from the former by dividing the coalition {2,3,4} into the
coalitions {2} and {3,4}.
The optimal coalition structure CS* lies somewhere within the coalition structure graph,
and so to find this, it seems we would have to evaluate every node in the graph. But consider
the bottom two rows of the graph—Ilevels 1 and 2. Every possible coalition (excluding the empty coalition) appears in these two levels. (Of course, not every possible coalition structure
appears in these two levels.) Now, suppose we restrict our search for a possible coalition structure to just these two levels—we go no higher in the graph. Let CS’ be the best coalition
structure that we find in these two levels, and let CS* be the best coalition structure overall.
Let C* be a coalition with the highest value of all possible coalitions:
C* € argmax v(C). CeN
The value of the best coalition structure we find in the first two levels of the coalition structure
graph must be at least as much as the value of the best possible coalition: sw(CS') > v(C*).
This is because every possible coalition appears in at least one coalition structure in the first two levels of the graph. So assume the worst case, that is, sw(CS') = v(C*).
Compare the value of sw(CS') to sw(CS*). Since sw(CS') is the highest possible value
of any coalition structure, and there are n agents (n = 4 in the case of Figure 18.7), then the
highest possible value of sw(CS*) would be nv(C*) = n-sw(CS'). In other words, in the
worst possible case, the value of the best coalition structure we find in the first two levels of
the graph would be L the value of the best, where n is the number of agents. Thus, although
searching the first two levels of the graph does not guarantee to give us the optimal coalition
structure, it does guarantee to give s one that is no worse that & of the optimal. In practice it will often be much better than that.
18.4
Making Collective Decisions
‘We will now turn from agent design to mechanism design—the problem of designing the right game for a collection of agents to play. Formally, a mechanism consists of 1. A language for describing the set of allowable strategies that agents may adopt.
Center
2. A distinguished agent, called the center, that collects reports of strategy choices from the agents in the game. (For example, the auctioneer is the center in an auction.)
3. An outcome rule, known to all agents, that the center uses to determine the payoffs to
each agent, given their strategy choices. This section discusses some of the most important mechanisms.
Contract net protocol
18.4.1
Allocating tasks with the contract net
The contract net protocol is probably the oldest and most important multiagent problem-
solving technique studied in AL It is a high-level protocol for task sharing. As the name suggests, the contract net was inspired from the way that companies make use of contracts.
The overall contract net protocol has four main phases—see Figure 18.8. The process
starts with an agent identifying the need for cooperative action with respect to some task.
The need might arise because the agent does not have the capability to carry out the task
Section 18.4
problem recognition
27 awarding
633
lask\ o ouncement %
‘X’
T _X_
Making Collective Decisions
® «4 * bidding
'X'
Figure 18.8 The contract net task allocation protocol.
in isolation, or because a cooperative solution might in some way be better (faster, more efficient, more accurate).
The agent advertises the task to other agents in the net with a task announcement mes-
sage, and then acts as the manager of that task for its duration.
The task announcement
Task announcement Manager
message must include sufficient information for recipients to judge whether or not they are willing and able to bid for the task. The precise information included in a task announcement
will depend on the application area. It might be some code that needs to be executed; or it ‘might be a logical specification of a goal to be achieved. The task announcement might also include other information that might be required by recipients, such as deadlines, quality-ofservice requirements, and so on. ‘When an agent receives a task announcement, it must evaluate it with respect to its own
capabilities and preferences. In particular, each agent must determine, whether it has the
capability to carry out the task, and second, whether or not it desires to do so. On this basis, it may then submit a bid for the task. A bid will typically indicate the capabilities of the bidder that are relevant to the advertised task, and any terms and conditions under which the task will be carried out.
In general, a manager may receive multiple bids in response to a single task announcement. Based on the information in the bids, the manager selects the most appropriate agent
(or agents) to execute the task. Successful agents are notified through an award message, and
become contractors for the task, taking responsibility for the task until it is completed.
The main computational tasks required to implement the contract net protocol can be
summarized as follows:
&id
634
Chapter 18 Multiagent Decision Making « Task announcement processing. On receipt ofa task announcement, an agent decides if it wishes to bid for the advertised task.
« Bid processing. On receiving multiple bids, the manager must decide which agent to award the task to, and then award the task. * Award processing. Successful bidders (contractors) must attempt to carry out the task,
which may mean generating new subtasks, which are advertised via further task announcements.
Despite (or perhaps because of) its simplicity, the contract net is probably the most widely
implemented and best-studied framework for cooperative problem solving. It is naturally
applicable in many settings—a variation of it is enacted every time you request a car with
Uber, for example. 18.4.2
Allocating scarce resources with auctions
One of the most important problems in multiagent systems is that of allocating scarce re-
sources; but we may as well simply say “allocating resources,” since in practice most useful
Auction
resources are scarce in some sense. The auction is the most important mechanism for allo-
Bidder
there are multiple possible bidders. Each bidder i has a utility value v; for the item.
cating resources. The simplest setting for an auction is where there is a single resource and In some cases, each bidder has a private value for the item. For example, a tacky sweater
might be attractive to one bidder and valueless to another.
In other cases, such as auctioning drilling rights for an oil tract, the item has a com-
mon value—the tract will produce some amount of money, X, and all bidders value a dollar
equally—but there is uncertainty as to what the actual value of X is. Different bidders have different information, and hence different estimates of the item’s true value.
In either case,
bidders end up with their own v;. Given v;, each bidder gets a chance, at the appropriate time
or times in the auction, to make a bid b;. The highest bid, b,,q,, wins the item, but the price Ascending-bid auction English auction
paid need not be b,,; that’s part of the mechanism design. The best-known auction mechanism is the ascending-bid auction,’ or English auction,
in which the center starts by asking for a minimum (or reserve) bid by, If some bidder is willing to pay that amount, the center then asks for by, + d, for some increment d, and continues up from there. The auction ends when nobody is willing to bid anymore; then the last bidder wins the item, paying the price bid. How do we know if this is a good mechanism? One goal is to maximize expected revenue for the seller. Another goal is to maximize a notion of global utility. These goals overlap to
some extent, because one aspect of maximizing global utility is to ensure that the winner of
Efficient
the auction is the agent who values the item the most (and thus is willing to pay the most). We
say an auction is efficient if the goods go to the agent who values them most. The ascendingbid auction is usually both efficient and revenue maximizing, but if the reserve price is set too
high, the bidder who values it most may not bid, and if the reserve is set too low, the seller
Collusion
may get less revenue. Probably the most important things that an auction mechanism can do is encourage a sufficient number of bidders to enter the game and discourage them from engaging in collusion.
Collusion is an unfair or illegal agreement by two or more bidders to manipulate prices. It can 3 The word ion” comes from the L augeo, to increase.
Section 18.4
Making Collective Decisions
635
happen in secret backroom deals or tacitly, within the rules of the mechanism. For example, in 1999, Germany auctioned ten blocks of cellphone spectrum with a simultaneous auction
(bids were taken on all ten blocks at the same time), using the rule that any bid must be a minimum of a 10% raise over the previous bid on a block. There were only two credible bidders, and the first, Mannesman, entered the bid of 20 million deutschmark on blocks 1-5 and 18.18 million on blocks 6-10. Why 18.18M? One of T-Mobile’s managers said they “interpreted Mannesman’s first bid as an offer.” Both parties could compute that a 10% raise on 18.18M
is 19.99M; thus Mannesman’s bid was interpreted as saying “we can each get half the blocks for 20M:; let’s not spoil it by bidding the prices up higher”” And in fact T-Mobile bid 20M on blocks 6-10 and that was the end of the bidding.
The German government got less than they expected, because the two competitors were
able to use the bidding mechanism to come to a tacit agreement on how not to compete.
From the government’s point of view, a better result could have been obtained by any of these
changes to the mechanism: a higher reserve price; a sealed-bid first-price auction, so that the competitors could not communicate through their bids; or incentives to bring in a third bidder.
Perhaps the 10% rule was an error in mechanism design, because it facilitated the
precise signaling from Mannesman to T-Mobile.
In general, both the seller and the global utility function benefit if there are more bidders,
although global utility can suffer if you count the cost of wasted time of bidders that have no
chance of winning. One way to encourage more bidders is to make the mechanism easier for them. After all, if it requires too much research or computation on the part of the bidders, they may decide to take their money elsewhere. So it is desirable that the bidders have a dominant strategy. Recall that “dominant™ means that the strategy works against all other strategies, which in turn means that an agent
can adopt it without regard for the other strategies. An agent with a dominant strategy can just bid, without wasting time contemplating other agents’ possible strategies. A mechanism by
which agents have a dominant strategy is called a strategy-proof mechanism. If, as is usually Strategy-proof the case, that strategy involves the bidders revealing their true value, v;, then it is called a truth-revealing, or truthful, auction; the term incentive compatible is also used.
The
revelation principle states that any mechanism can be transformed into an equivalent truth-
revealing mechanism, so part of mechanism design is finding these equivalent mechanisms.
It turns out that the ascending-bid auction has most of the desirable properties. The bidder
with the highest value v; gets the goods at a price of b, +d, where b, is the highest bid among all the other agents and d is the auctioneer’s increment.*
Bidders have a simple dominant
strategy: keep bidding as long as the current cost is below your v;. The mechanism is not
quite truth-revealing, because the winning bidder reveals only that his v; > b, +d; we have a
lower bound on v; but not an exact amount.
A disadvantage (from the point of view of the seller) of the ascending-bid auction is that
it can discourage competition. Suppose that in a bid for cellphone spectrum there is one
advantaged company that everyone agrees would be able to leverage existing customers and
infrastructure, and thus can make a larger profit than anyone else. Potential competitors can
see that they have no chance in an ascending-bid auction, because the advantaged company
4 There is actually a small chance that the agent with highest v; fails to gt the goods, in the case in which by < v; < by +d. The chance of this can be made arbitrarily small by decreasing the increment d.
Truth-revealing
Revelation principle
636
Chapter 18 Multiagent Decision Making can always bid higher. Thus, the competitors may not enter at all, and the advantaged company ends up winning at the reserve price.
Another negative property of the English auction is its high communication costs. Either the auction takes place in one room or all bidders have to have high-speed, secure communi-
cation lines; in either case they have to have time to go through several rounds of bidding.
Sealed-bid auction
An alternative mechanism, which requires much less communication,
is the sealed-bid
auction. Each bidder makes a single bid and communicates it to the auctioneer, without the
other bidders seeing it. With this mechanism, there is no longer a simple dominant strategy.
If your value is v; and you believe that the maximum of all the other agents” bids will be b,, then you should bid b, + ¢, for some small ¢, if that is less than v;. Thus, your bid depends on
your estimation of the other agents’ bids, requiring you to do more work. Also, note that the
agent with the highest v; might not win the auction. This is offset by the fact that the auction is more competitive, reducing the bias toward an advantaged bidder.
Sealed-bid second-price auction
Vickrey auction
A small change in the mechanism for sealed-bid auctions leads to the sealed-bid second-
price auction, also known as a Vickrey auction.’
In such auctions, the winner pays the
price of the second-highest bid, b,, rather than paying his own bid. This simple modification completely eliminates the complex deliberations required for standard (or first-price) sealedbid auctions, because the dominant strategy is now simply to bid v;; the mechanism is truth-
revealing. Note that the utility of agent i in terms of his bid b;, his value v;, and the best bid among the other agents, b,, is
Ui
_ { (i=by) ifbi>b, 0
otherwise.
To see that b; = v; is a dominant strategy, note that when (v; — b,) is positive, any bid that wins the auction is optimal, and bidding v; in particular wins the auction. On the other hand, when
(vi—b,) is negative, any bid that loses the auction is optimal, and bidding v; in particular loses the auction. So bidding v; is optimal for all possible values of b,, and in fact, v; is the only bid that has this property. Because of its simplicity and the minimal computation requirements for both seller and bidders, the Vickrey auction is widely used in distributed Al systems.
Internet search engines conduct several trillion auctions each year to sell advertisements
along with their search results, and online auction sites handle $100 billion a year in goods, all using variants of the Vickrey auction. Note that the expected value to the seller is b,,
Revenue equivalence theorem
which is the same expected return as the limit of the English auction as the increment d goes to zero. This is actually a very general result: the revenue equivalence theorem states that, with a few minor caveats, any auction mechanism in which bidders have values v; known only
to themselves (but know the probability distribution from which those values are sampled),
will yield the same expected revenue. This principle means that the various mechanisms are not competing on the basis of revenue generation, but rather on other qualities.
Although the second-price auction is truth-revealing, it turns out that auctioning n goods
with an n+ 1 price auction is not truth-revealing. Many Internet search engines use a mech-
anism where they auction n slots for ads on a page. The highest bidder wins the top spot,
the second highest gets the second spot, and so on. Each winner pays the price bid by the
next-lower bidder, with the understanding that payment is made only if the searcher actually
5 Named after William Vickrey (1914-1996), who won the 1996 Nobel Prize in economics for this work and died of a heart attack three days later.
Section 18.4
Making Collective Decisions
637
clicks on the ad. The top slots are considered more valuable because they are more likely to be noticed and clicked on.
Imagine that three bidders, by, b, and b3, have valuations for a click of vy =200, v, = 180,
and v3 =100, and that n = 2 slots are available; and it is known that the top spot is clicked on
5% of the time and the bottom spot 2%. If all bidders bid truthfully, then b; wins the top slot
and pays 180, and has an expected return of (200— 180) x 0.05= 1. The second slot goes to by. But by can see that if she were to bid anything in the range 101-179, she would concede
the top slot to by, win the second slot, and yield an expected return of (200 — 100) x .02=2. Thus, b; can double her expected return by bidding less than her true value in this case. In general, bidders in this n+ 1 price auction must spend a lot of energy analyzing the bids of others to determine their best strategy; there is no simple dominant strategy.
Aggarwal et al. (2006) show that there is a unique truthful auction mechanism for this
multislot problem, in which the winner of slot j pays the price for slot j just for those addi-
tional clicks that are available at slot j and not at slot j+ 1. The winner pays the price for the lower slot for the remaining clicks. In our example, by would bid 200 truthfully, and would
pay 180 for the additional .05 —.02=.03 clicks in the top slot, but would pay only the cost of the bottom slot, 100, for the remaining .02 clicks. (200— 180) x .03+ (200— 100) x .02=2.6.
Thus, the total return to b; would be
Another example of where auctions can come into play within Al is when a collection of agents are deciding whether to cooperate on a joint plan. Hunsberger and Grosz (2000) show that this can be accomplished efficiently with an auction in which the agents bid for roles in
the joint plan. Common
goods
Now let’s consider another type of game, in which countries set their policy for controlling air pollution. Each country has a choice: they can reduce pollution at a cost of -10 points for
implementing the necessary changes, or they can continue to pollute, which gives them a net
utility of -5 (in added health costs, etc.) and also contributes -1 points to every other country
(because the air is shared across countries). Clearly, the dominant strategy for each country
is “continue to pollute,” but if there are 100 countries and each follows this policy, then each
country gets a total utility of -104, whereas if every country reduced pollution, they would
each have a utility of -10. This situation is called the tragedy of the commons: if nobody has
to pay for using a common resource, then it may be exploited in a way that leads to a lower total utility for all agents.
It is similar to the prisoner’s dilemma:
Tragedy of the commons
there is another solution
to the game that is better for all parties, but there appears to be no way for rational agents to
arrive at that solution under the current game. One approach for dealing with the tragedy of the commons s to change the mechanism to one that charges each agent for using the commons.
More generally, we need to ensure that
all externalities—effects on global utility that are not recognized in the individual agents’ transactions—are made explicit.
Setting the prices correctly is the difficult part. In the limit, this approach amounts to
creating a mechanism in which each agent is effectively required to maximize global utility,
but can do so by making a local decision. For this example, a carbon tax would be an example of a mechanism that charges for use of the commons
maximizes global utility.
in a way that, if implemented well,
Externalities
638
VCG
Chapter 18 Multiagent Decision Making It turns out there is a mechanism design, known as the Vickrey-Clarke-Groves or VCG
mechanism, which has two favorable properties.
First, it is utility maximizing—that is, it
maximizes the global utility, which is the sum of the utilities for all parties, };v;.
Second,
the mechanism is truth-revealing—the dominant strategy for all agents is to reveal their true value. There is no need for them to engage in complicated strategic bidding calculations. We will give an example using the problem of allocating some common goods. Suppose a city decides it wants to install some free wireless Internet transceivers. However, the number
of transceivers available is less than the number of neighborhoods that want them. The city
wants to maximize global utility, but if it says to each neighborhood council you value a free transceiver (and by the way we will give them to the parties the most)?” then each neighborhood will have an incentive to report a very VCG mechanism discourages this ploy and gives them an incentive to report It works as follows:
1. The center asks each agent to report its value for an item, v;. 2. The center allocates the goods to a set of winners, W, to maximize ;¢
“How much do that value them high value. The their true value. v;.
3. The center calculates for each winning agent how much of a loss their individual pres-
ence in the game has caused to the losers (who each got 0 utility, but could have got v;
if they were a winner).
4. Each winning agent then pays to the center a tax equal (o this loss. For example, suppose there are 3 transceivers available and 5 bidders, who bid 100, 50, 40, 20, and 10. Thus the set of 3 winners, W, are the ones who bid 100, 50, and 40 and the
global utility from allocating these goods is 190. For each winner, it is the case that had they
not been in the game, the bid of 20 would have been a winner. Thus, each winner pays a tax
of 20 to the center.
All winners should be happy because they pay a tax that is less than their value, and all losers are as happy as they can be, because they value the goods less than the required
tax. That’s why the mechanism is truth-revealing. In this example, the crucial value is 20; it would be irrational to bid above 20 if your true value was actually below 20, and vice versa.
Since the crucial value could be anything (depending on the other bidders), that means that is always irrational to bid anything other than your true value.
The VCG mechanism is very general, and can be applied to all sorts of games, not just auctions, with a slight generalization of the mechanism described above. For example, in a
ccombinatorial auction there are multiple different items available and each bidder can place
multiple bids, each on a subset of the items. For example, in bidding on plots of land, one bidder might want either plot X or plot Y but not both; another might want any three adjacent plots, and so on. The VCG mechanism can be used to find the optimal outcome, although
with 2V subsets of N goods to contend with, the computation of the optimal outcome is NPcomplete. With a few caveats the VCG mechanism is unique: every other optimal mechanism is essentially equivalent.
18.4.3
Social choice theory
Voting
The next class of mechanisms that we look at are voting procedures, of the type that are used for political decision making in democratic societies. The study of voting procedures derives from the domain of social choice theory.
Section 18.4 The basic setting is as follows.
Making Collective Decisions
639
As usual, we have a set N = {1,...,n} of agents, who
in this section will be the voters. These voters want to make decisions with respect to a set
Q= {wy,w,...} of possible outcomes. In a political election, each element of Q could stand for a different candidate winning the election.
Each voter will have preferences over Q. These are usually expressed not as quantitative
utilities but rather as qualitative comparisons:
we write w >~; w' to mean that outcome w is
ranked above outcome w' by agent i. In an election with three candidates, agent i might have
W
w3 = Wi The fundamental problem of social choice theory is to combine these preferences, using Social welfare a social welfare function, to come up with a social preference order: a ranking of the Function candidates, from most preferred down to least preferred. In some cases, we are only interested
in a social outcome—the most preferred outcome by the group as a whole. We will write Social outcome
w =" w' to mean that w is ranked above w’ in the social preference order.
A simpler setting is where we are not concerned with obtaining an entire ordering of
candidates, but simply want to choose a set of winners.
A social choice function takes as
input a preference order for each voter, and produces as output a set of winners.
Democratic societies want a social outcome that reflects the preferences of the voters.
Unfortunately, this is not always straightforward. Consider Condorcet’s Paradox, a famous
example posed by the Marquis de Condorcet (1743-1794). Suppose we have three outcomes, Q = {w,,wp,w,}, and three voters, N = {1,2,3}, with preferences as follows. Wa =1 Wp =1 We
We >2 Wa =2 Wh
(18.2)
Wp >3 We =3 Wa
Now, suppose we have to choose one of the three candidates on the basis of these preferences. The paradox is that: * 2/3 of the voters prefer w; over w;. + 2/3 of the voters prefer w; over ws.
* 2/3 of the voters prefer w) over ws. So, for each possible winner, we can point to another candidate who would be preferred by at least 2/3 of the electorate. It is obvious that in a democracy we cannot hope to make every
voter happy. This demonstrates that there are scenarios in which no matter which outcome we choose, a majority of voters will prefer a different outcome. A natural question is whether there is any “good” social choice procedure that really reflects the preferences of voters. To answer this, we need to be precise about what we mean when we say that a rule is “good.” ‘We will list some properties we would like a good social welfare function to satis
* The Pareto Condition:
above wj, then w; =* w;.
* The Condorcet
The Pareto condition simply says that if every voter ranks w;
Winner Condition:
An outcome is said to be a Condorcet winner if
a majority of candidates prefer it over all other outcomes. To put it another way, a
Condorcet winner is a candidate that would beat every other candidate in a pairwise
election. The Condorcet winner condition says that if w; is a Condorcet winner, then w; should be ranked first.
« Independence of Irrelevant Alternatives (IIA): Suppose there are a number of candidates, including w; and w;, and voter preferences are such that w; = w;. Now, suppose
Social choice function Condorcet's Paradox
Chapter 18 Multiagent Decision Making
Arrow's theorem
Simple majority vote
Plurality voting
one voter changed their preferences in some way, but nor about the relative ranking of wi and w;. The TTA condition says that, w; * w; should not change. « No Dictatorships: 1t should not be the case that the social welfare function simply outputs one voter’s preferences and ignores all other voters. These four conditions seem reasonable, but a fundamental theorem of social choice theory called Arrow’s theorem (due to Kenneth Arrow) tells s that it is impossible to satisfy all four conditions (for cases where there are at least three outcomes). That means that for any social choice mechanism we might care to pick, there will be some situations (perhaps unusual or pathological) that lead to controversial outcomes. However, it does not mean that democratic decision making is hopeless in most cases. We have not yet seen any actual voting procedures, so let’s now look at some. = With just two candidates, simple majority vote (the standard method in the US and UK) s the favored mechanism. We ask each voter which of the two candidates they prefer, and the one with the most votes is the winner. + With more than two outcomes, plurality voting is a common system.
We ask each
voter for their top choice, and select the candidate(s) (more than one in the case of ties) who get the most votes, even if nobody gets a majority. While it is common, plurality
voting has been criticized for delivering unpopular outcomes. A key problem is that it only takes into account the top-ranked candidate in each voter’s preferences.
Borda count
+ The Borda count (after Jean-Charles de Borda, a contemporary and rival of Condorcet)
is a voting procedure that takes into account all the information in a voter’s preference
ordering. Suppose we have k candidates. Then for each voter i, we take their preference
ordering -;, and give a score of k to the top ranked candidate, a score of k — 1 to the
second-ranked candidate, and so on down to the least-favored candidate in i’s ordering. The total score for each candidate is their Borda count, and to obtain the social outcome
~*, outcomes are ordered by their Borda count—highest to lowest. One practical prob-
lem with this system is that it asks voters to express preferences on all the candidates,
Approval voting
and some voters may only care about a subset of candidates.
+ In approval voting, voters submit a subset of the candidates that they approve of. The winner(s) are those who are approved by the most voters. This system is often used when the task is to choose multiple winners.
Instant runoff voting
« In instant runoff voting, voters rank all the candidates, and if a candidate has a major-
ity of first-place votes, they are declared the winner. If not, the candidate with the fewest
first-place votes is eliminated. That candidate is removed from all the preference rank-
ings (so those voters who had the eliminated candidate as their first choice now have
another candidate as their new first choice) and the process is repeated.
True majority rule voting
Eventually,
some candidate will have a majority of first-place votes (unless there is a tie).
+ In true majority rule voting, the winner is the candidate who beats every other can-
didate in pairwise comparisons. Voters are asked for a full preference ranking of all
candidates. We say that w beats ', if more voters have w > w’ than have w’ - w. This
system has the nice property that the majority always agrees on the winner, but it has the bad property that not every election will be decided:
example, no candidate wins a majority.
in the Condorcet paradox, for
Section 18.4
Making Collective Decisions
641
Strategic manipulation Besides Arrow’s Theorem,
another important negative results in the area of social choice
theory is the Gibbard-Satterthwaite Theorem.
This result relates to the circumstances
under which a voter can benefit from misrepresenting their preferenc
Gibbard— Satterthwaite Theorem
Recall that a social choice function takes as input a preference order for each voter, and
gives as output a set of winning candidates. Each voter has, of course, their own true prefer-
ences, but there is nothing in the definition of a social choice function that requires voters to
report their preferences truthfully; they can declare whatever preferences they like.
In some cases, it can make sense for a voter to misrepresent their preferences. For exam-
ple, in plurality voting, voters who think their preferred candidate has no chance of winning may vote for their second choice instead. That means plurality voting is a game in which voters have to think strategically (about the other voters) to maximize their expected utility.
This raises an interesting question: can we design a voting mechanism that is immune to
such manipulation—a mechanism that is truth-revealing? The Gibbard-Satterthwaite Theo-
rem tells us that we can not: Any social choice function that satisfies the Pareto condition for a domain with more than two outcomes is either manipulable or a dictatorship. That is, for any “reasonable” social choice procedure, there will be some circumstances under which a
voter can in principle benefit by misrepresenting their preferences. However, it does not tell
us how such manipulation might be done; and it does not tell us that such manipulation is
likely in practice. 18.4.4
Bargaining
Bargaining, or negotiation, is another mechanism that is used frequently in everyday life. It has been studied in game theory since the 1950s and more recently has become a task for automated agents. Bargaining is used when agents need to reach agreement on a matter of common interest. The agents make offers (also called proposals or deals) to each other under specific protocols, and either accept or reject cach offer. Bargaining with the alternating offers protocol
offers One influential bargaining protocol is the alternating offers bargaining model. For simplic- Alternating bargaining model ity we’ll again
assume just two agents. Bargaining takes place in a sequence of rounds. A;
begins, at round 0, by making an offer. If A, accepts the offer, then the offer is implemented. If Ay rejects the offer, then negotiation moves to the next round. This time A> makes an offer and A chooses to accept or reject it, and so on. If the negotiation never terminates (because
agents reject every offer) then we define the outcome to be the conflict deal. A convenient
simplifying assumption is that both agents prefer to reach an outcome—any outcome—in finite time rather than being stuck in the infinitely time-consuming conflict deal. ‘We will use the scenario of dividing a pie to illustrate alternating offers. The idea is that
Conflict deal
there is some resource (the “pie”) whose value is 1, which can be divided into two parts, one
part for each agent. Thus an offer in this scenario is a pair (x,1 —x), where x is the amount
of the pie that A; gets, and 1 — x is the amount that A, gets. The space of possible deals (the negotiation set) is thus:
{(1-2):0. Agent A can take this fact into account by offering (1 —72,72), an
Section 18.4
Making Collective Decisions
offer that A, may as well accept because A, can do no better than 7, at this point in time. (If you are worried about what happens with ties, just make the offer be (1 — (v, +¢€),72 +€) for
some small value of ¢.) So, the two strategies of A
offering (1 —~2,72), and A accepting that offer are in Nash
equilibrium. Patient players (those with a larger ~2) will be able to obtain larger pieces of the pie under this protocol: in this setting, patience truly is a virtue. Now consider the general case, where there are no bounds on the number of rounds. As
in the I-round case, A, can craft a proposal that A, should accept, because it gives A, the maximal
achievable amount, given the discount factors. It turns out that A; will get
I-m
-
and A will get the remainder. Negotiation
in task-oriented domains
In this section, we consider negotiation for task-oriented domains. In such a domain, a set of
tasks must be carried out, and each task is initially assigned to a set of agents. The agents may
Task-oriented domain
be able to benefit by negotiating on who will carry out which tasks. For example, suppose
some tasks are done on a lathe machine and others on a milling machine, and that any agent
using a machine must incur a significant setup cost. Then it would make sense for one agent 1o offer another “T have to set up on the milling machine anyway; how about if I do all your milling tasks, and you do all my lathe tasks?”
Unlike the bargaining scenario, we start with an initial allocation, so if the agents fail to
agree on any offers, they perform the tasks T that they were originally allocated.
To keep things simple, we will again assume just two agents. Let 7 be the set of all tasks
and let (7}, 7) denote the initial allocation of tasks to the two agents at time 0. Each task in T must be assigned to exactly one agent.
We assume we have a cost function ¢, which
for every set of tasks 7" gives a positive real number ¢(7”) indicating the cost to any agent
of carrying out the tasks 7”. (Assume the cost depends only on the tasks, not on the agent
carrying out the task.) The cost function is monotonic—adding more tasks never reduces the cost—and the cost of doing nothing is zero: ¢({}) =
0. As an example, suppose the cost of
setting up the milling machine is 10 and each milling task costs 1, then the cost of a set of two milling tasks would be 12, and the cost of a set of five would be 15.
An offer of the form (7;,7>) means that agent i is committed to performing the set of tasks 7;, at cost ¢(7;). The utility to agent i is the amount they have to gain from accepting the offer—the difference between the cost of doing this new set of tasks versus the originally
assigned set of tasks:
Ui((T,T2) = e(T3) = e(TY)
An offer (T, T5) is individually rational if U;((7},73)) > 0 for both agents. If a deal is not Individually rational individually rational, then at least one agent can do better by simply performing the tasks it
was originally allocated.
The negotiation set for task-oriented domains (assuming rational agents) is the set of
offers that are both individually rational and Pareto optimal. There is no sense making an
individually irrational offer that will be refused, nor in making an offer when there is a better offer that improves one agent’s utility without hurting anyone else.
Chapter 18 Multiagent Decision Making
The monotonic concession protocol Mornotonic concession protocol
The negotiation protocol we consider for task-oriented domains
is known as the monotonic
concession protocol. The rules of this protocol are as follows. + Negotiation proceeds in a series of rounds. + On the first round, both agents simultaneously propose a deal, D; = (T}, T3), from the negotiation set. (This is different from the alternating offers we saw before.)
+ An agreement is reached if the two agents propose deals D; and D;, respectively, such
that either (i) Uy (D) > Uy (Dy) or (i) Us(Dy) > Us(D3), that is, if one of the agents finds that the deal proposed by the other is at least as good or better than the proposal it made. If agreement is reached, then the rule for determining the agreement deal is as follows:
If each agent’s offer matches or exceeds that of the other agent, then one of
the proposals is selected at random. If only one proposal exceeds or matches the other’s proposal, then this is the agreement deal. « If no agreement is reached, then negotiation proceeds to another round of simultaneous
Concession
proposals. In round 7 + 1, each agent must either repeat the proposal from the previous
round or make a concession—a proposal that is more preferred by the other agent (i.e.,
has higher utility).
« If neither agent makes a concession, then negotiation terminates, and both agents im-
plement the conflict deal, carrying out the tasks they were originally assigned.
Since the set of possible deals is finite, the agents cannot negotiate indefinitely:
either the
agents will reach agreement, or a round will occur in which neither agent concedes. However, the protocol does not guarantee that agreement will be reached quickly: since the number of
possible deals is O(271), it is conceivable that negotiation will continue for a number of
rounds exponential in the number of tasks to be allocated. The Zeuthen strategy
Zeuthen strategy
So far, we have said nothing about how negotiation participants might or should behave when using the monotonic concession protocol for task-oriented domains. One possible strategy is the Zeuthen strategy.
The idea of the Zeuthen strategy is to measure an agent’s willingness to risk conflict.
Intuitively, an agent will be more willing to risk conflict if the difference in utility between its current proposal and the conflict deal is low.
In this case, the agent has little to lose if
negotiation fails and the conflict deal is implemented, and so is more willing to risk conflict, and less willing to concede. In contrast, if the difference between the agent’s current proposal and the conflict deal is high, then the agent has more to lose from conflict and is therefore less willing to risk conflict—and thus more willing to concede. Agent i’s willingness to risk conflict at round 7, denoted risk!, is measured as follows:
by conceding and accepting js offer
Until an agreement is reached, the value of risk{ will be a value between 0 and 1. Higher
values of risk! (nearer to 1) indicate that i has less to lose from conflict, and so is more willing to risk conflict.
Summary The Zeuthen strategy says that each agent’s first proposal should be a deal in the negoti-
ation set that maximizes its own utility (there may be more than one). After that, the agent
who should concede on round 7 of negotiation should be the one with the smaller value of risk—the one with the most to lose from conflict if neither concedes.
The next question to answer is how much should be conceded? The answer provided by the Zeuthen strategy is, “Just enough to change the balance of risk to the other agent.” That is, an agent should make the smallest concession that will make the other agent concede on
the next round. There is one final refinement to the Zeuthen strategy.
Suppose that at some point both
agents have equal risk. Then, according to the strategy, both should concede. But, knowing this, one agent could potentially “defect” by not conceding, and so benefit.
To avoid the
possibility of both conceding at this point, we extend the strategy by having the agents “flip a coin” to decide who should concede if ever an equal risk situation is reached.
With this strategy, agreement will be Pareto optimal and individually rational. However,
since the space of possible deals is exponential in the number of tasks, following this
strategy
may require O(2/"!) computations of the cost function at each negotiation step. Finally, the
Zeuthen strategy (with the coin flipping rule) is in Nash equilibrium.
Summary « Multiagent planning is necessary when there are other agents in the environment with which to cooperate or compete. Joint plans can be constructed, but must be augmented with some form of coordination if two agents are to agree on which joint plan to execute.
+ Game
theory describes rational behavior for agents in situations in which multiple
agents interact. Game theory is to multiagent decision making as decision theory is to
single-agent decision making. « Solution concepts in game theory are intended to characterize rational outcomes of a game—outcomes that might occur if every agent acted rationally.
+ Non-cooperative game theory assumes that agents must make their decisions indepen-
dently. Nash equilibrium is the most important solution concept in non-cooperative game theory. A Nash equilibrium is a strategy profile in which no agent has an incentive to deviate from its specified strategy. We have techniques for dealing with repeated games and sequential games.
+ Cooperative game theory considers settings in which agents can make binding agree-
ments to form coalitions in order to cooperate. Solution concepts in cooperative game attempt to formulate which coalitions are stable (the core) and how to fairly divide the
value that a coalition obtains (the Shapley value). « Specialized techniques are available for certain important classes of multiagent decision: the contract net for task sharing; auctions are used to efficiently allocate scarce
resources; bargaining for reaching agreements on matters of common interest; and vot-
ing procedures for aggregating preferences.
Chapter 18 Multiagent Decision Making Bibliographical and Historical Notes Itis a curiosity of the field that researchers in Al did not begin to seriously consider the issues
surrounding interacting agents until the 1980s—and the multiagent systems field did not really become established as a distinctive subdiscipline of Al until a decade later. Nevertheless,
ideas that hint at multiagent systems were present in the 1970s. For example, in his highly influential Society of Mind theory, Marvin Minsky (1986, 2007) proposed that human minds are constructed from an ensemble of agents. Doug Lenat had similar ideas in a framework he called BEINGS (Lenat, 1975). In the 1970s, building on his PhD work on the PLANNER
system, Carl Hewitt proposed a model of computation as interacting agents called the ac-
tor model, which has become established as one of the fundamental models in concurrent computation (Hewitt, 1977; Agha, 1986).
The prehistory of the multiagent systems field is thoroughly documented in a collection
of papers entitled Readings in Distributed Artificial Intelligence (Bond and Gasser, 1988).
The collection is prefaced with a detailed statement of the key research challenges in multi-
agent systems, which remains remarkably relevant today, more than thirty years after it was written.
Early research on multiagent systems tended to assume that all agents in a system
were acting with common interest, with a single designer. This
Cooperative distributed problem solving
is now recognized as a spe-
cial case of the more general multiagent setting—the special case is known as cooperative
distributed problem solving. A key system of this time was the Distributed Vehicle Moni-
toring Testbed (DVMT), developed under the supervision of Victor Lesser at the University
of Massachusetts (Lesser and Corkill, 1988). The DVMT modeled a scenario in which a col-
lection of geographically distributed acoustic sensor agents cooperate to track the movement
of vehicles.
The contemporary era of multiagent systems research began in the late 1980s, when it
was widely realized that agents with differing preferences are the norm in Al and society— from this point on, game theory began to be established as the main methodology for studying such agents. Multiagent planning has leaped in popularity in recent years, although it does have a long history. Konolige (1982) formalizes multiagent planning in first-order logic, while Pednault (1986) gives a STRIPS-style description.
The notion of joint intention, which is essential if
agents are to execute a joint plan, comes from work on communicative acts (Cohen and Perrault, 1979; Cohen and Levesque,
1990; Cohen et al., 1990). Boutilier and Brafman (2001)
show how to adapt partial-order planning to a multiactor setting.
Brafman and Domshlak
(2008) devise a multiactor planning algorithm whose complexity grows only linearly with
the number of actors, provided that the degree of coupling (measured partly by the tree width
of the graph of interactions among agents) is bounded.
Multiagent planning is hardest when there are adversarial agents.
As Jean-Paul Sartre
(1960) said, “In a football match, everything is complicated by the presence of the other team.” General Dwight D. Eisenhower said, “In preparing for battle I have always found that plans are useless, but planning is indispensable,” meaning that it is important to have a conditional plan or policy, and not to expect an unconditional plan to succeed.
The topic of distributed and multiagent reinforcement learning (RL) was not covered in
this chapter but is of great current interest. In distributed RL, the aim is to devise methods by
which multiple, coordinated agents learn to optimize a common utility function. For example,
Bibliographical and Historical Notes can we devise methods whereby separate subagents for robot navigation and robot obstacle avoidance could cooperatively achieve a combined control system that is globally optimal?
Some basic results in this direction have been obtained (Guestrin et al., 2002; Russell and Zimdars, 2003). The basic idea is that each subagent learns its own Q-function (a kind of
utility function; see Section 22.3.3) from its own stream of rewards. For example, a robot-
navigation component can receive rewards for making progress towards the goal, while the obstacle-avoidance component receives negative rewards for every collision. Each global decision maximizes the sum of Q-functions and the whole process converges to globally optimal solutions. The roots of game theory can be traced back to proposals made in the 17th century by
Christiaan Huygens and Gottfried Leibniz to study competitive and cooperative human in-
ientifically and mathematically. Throughout the 19th century, several leading reated simple mathematical examples to analyze particular examples of compet-
itive situations. The first formal results in game theory are due to Zermelo (1913) (who had, the year before, suggested a form of minimax search for games, albeit an incorrect one). Emile Borel (1921) introduced the notion of a mixed strategy. John von Neumann (1928) proved that
every two-person, zero-sum game has a maximin equilibrium in mixed strategies and a well-
defined value. Von Neumann’s collaboration with the economist Oskar Morgenstern led to the publication in 1944 of the Theory of Games and Economic Behavior, the defining book
for game theory. Publication of the book was delayed by the wartime paper shortage until a member of the Rockefeller family personally subsidized its publication. In 1950, at the age of 21, John Nash published his ideas concerning equilibria in general (non-zero-sum) games. His definition of an equilibrium solution, although anticipated in the work of Cournot (1838), became known as Nash equilibrium. After a long delay because of the schizophrenia he suffered from 1959 onward, Nash was awarded the Nobel Memorial Prize in Economics (along with Reinhart Selten and John Harsanyi) in 1994. The Bayes-Nash
equilibrium is described by Harsanyi (1967) and discussed by Kadane and Larkey (1982). Some issues in the use of game theory for agent control are covered by Binmore (1982). Aumann and Brandenburger (1995) show how different equilibria can be reached depending on the knowleedge each player has.
The prisoner’s dilemma was invented as a classroom exercise by Albert W. Tucker in 1950 (based on an example by Merrill Flood and Melvin Dresher) and is covered extensively
by Axelrod (1985) and Poundstone (1993). Repeated games were introduced by Luce and
Raiffa (1957), and Abreu and Rubinstein (1988) discuss the use of finite state machines for
repeated games—technically, Moore machines. The text by Mailath and Samuelson (2006) concentrates on repeated games. Games of partial information in extensive form were introduced by Kuhn (1953).
The
sequence form for partial-information games was invented by Romanovskii (1962) and independently by Koller ef al. (1996); the paper by Koller and Pfeffer (1997) provides a readable introduction to the field and describes a system for representing and solving sequential games. The use of abstraction to reduce a game tree to a size that can be solved with Koller’s
technique was introduced by Billings ef al. (2003). Subsequently, improved methods for equilibrium-finding enabled solution of abstractions with 102 states (Gilpin er al., 2008;
Zinkevich et al., 2008).
Bowling ef al. (2008) show how to use importance sampling to
647
Chapter 18 Multiagent Decision Making get a better estimate of the value ofa strategy. Waugh ef al. (2009) found that the abstraction approach is vulnerable to making systematic errors in approximating the equilibrium solution: it works for some games but not others.
Brown and Sandholm (2019) showed that, at least
in the case of multiplayer Texas hold "em poker, these vulnerabilities can be overcome by sufficient computing power.
They used a 64-core server running for 8 days to compute a
baseline strategy for their Pluribus program. human champion opponents.
With that strategy they were able to defeat
Game theory and MDPs are combined in the theory of Markov games, also called stochastic games (Littman, 1994; Hu and Wellman, 1998). Shapley (1953b) actually described the
value iteration algorithm independently of Bellman, but his results were not widely appre-
ciated, perhaps because they were presented in the context of Markov games. Evolutionary
game theory (Smith, 1982; Weibull, 1995) looks at strategy drift over time: if your opponent’s strategy is changing, how should you react?
Textbooks on game theory from an economics point of view include those by Myerson
(1991), Fudenberg and Tirole (1991), Osborne (2004), and Osborne and Rubinstein (1994).
From an Al perspective we have Nisan ef al. (2007) and Leyton-Brown and Shoham (2008). See (Sandholm, 1999) for a useful survey of multiagent decision making. Multiagent RL is distinguished from distributed RL by the presence of agents who cannot
coordinate their actions (except by explicit communicative acts) and who may not share the
same utility function. Thus, multiagent RL deals with sequential game-theoretic problems or
Markov games, as defined in Chapter 17. What causes problems is the fact that, while an
agent is learning to defeat its opponent’s policy, the opponent is changing its policy to defeat the agent. Thus, the environment is nonstationary (see page 444).
Littman (1994) noted this difficulty when introducing the first RL algorithms for zero-
sum Markov games. Hu and Wellman (2003) present a Q-learning algorithm for generalsum games that converges when the Nash equilibrium is unique; when there are multiple equilibria, the notion of convergence is not so easy to define (Shoham ef al., 2004). Assistance games were introduced under the heading of cooperative inverse reinforce-
ment learning by Hadfield-Menell ez al. (2017a). Malik ez al. (2018) introduced an efficient
Principal-agent game
POMDP solver designed specifically for assistance games.
They are related to principal-
agent games in economics, in which a principal (e.g., an employer) and an agent (e.g., an employee) need to find a mutually beneficial arrangement despite having widely different preferences. The primary differences are that (1) the robot has no preferences of its own, and (2) the robot is uncertain about the human preferences it needs to optimize.
Cooperative games were first studied by von Neumann and Morgenstern (1944). The notion of the core was introduced by Donald Gillies (1959), and the Shapley value by Lloyd Shapley (1953a). A good introduction to the mathematics of cooperative games is Peleg and Sudholter (2002). Simple games in general are discussed in detail by Taylor and Zwicker (1999). For an introduction to the computational aspects of cooperative game theory, see
Chalkiadakis er al. (2011).
Many compact representation schemes for cooperative games have been developed over
the past three decades, starting with the work of Deng and Papadimitriou (1994). The most
influential of these schemes is the marginal contribution networks model, which was intro-
duced by Teong and Shoham (2005). The approach to coalition formation that we describe was developed by Sandholm ef al. (1999); Rahwan ef al. (2015) survey the state of the art.
Bibliographical and Historical Notes The contract net protocol was introduced by Reid Smith for his PhD work at Stanford
University in the late 1970s (Smith, 1980). The protocol seems to be so natural that it is reg-
ularly reinvented to the present day. The economic foundations of the protocol were studied by Sandholm (1993).
Auctions and mechanism design have been mainstream topics in computer science and
AI for several decades:
see Nisan (2007) for a mainstream computer science perspective,
Krishna (2002) for an introduction to the theory of auctions, and Cramton er al. (2006) for a collection of articles on computational aspects of auctions. The 2007 Nobel Memorial Prize in Economics went to Hurwicz, Maskin, and Myerson
“for having laid the foundations of mechanism design theory” (Hurwicz, 1973). The tragedy of the commons, a motivating problem for the field, was analyzed by William Lloyd (1833) but named and brought to public attention by Garrett Hardin (1968). Ronald Coase presented
a theorem that if resources are subject to private ownership and if transaction costs are low
enough, then the resources will be managed efficiently (Coase, 1960). He points out that, in practice, transaction costs are high, so this theorem does not apply, and we should look to other solutions beyond privatization and the marketplace. Elinor Ostrom’s Governing the Commons (1990) described solutions for the problem based on placing management control over the resources into the hands of the local people who have the most knowledge of the situation. Both Coase and Ostrom won the Nobel Prize in economics for their work.
The revelation principle is due to Myerson (1986), and the revenue equivalence theorem was developed independently by Myerson (1981) and Riley and Samuelson (1981). Two
economists, Milgrom (1997) and Klemperer (2002), write about the multibillion-dollar spec-
trum auctions they were involved in. Mechanism design is used in multiagent planning (Hunsberger and Grosz, 2000; Stone et al., 2009) and scheduling (Rassenti er al., 1982). Varian (1995) gives a brief overview with connections to the computer science literature, and Rosenschein and Zlotkin (1994)
present a book-length treatment with applications to distributed AL Related work on distributed AT goes under several names, including collective intelligence (Tumer and Wolpert, 2000; Segaran, 2007) and market-based control (Clearwater,
1996).
Since 2001
there has
been an annual Trading Agents Competition (TAC), in which agents try to make the best profit on a series of auctions (Wellman ez al., 2001; Arunachalam and Sadeh, 2005).
The social choice literature is enormous, and spans erations on the nature of democracy through to highly procedures. Campbell and Kelly (2002) provide a good Handbook of Computational Social Choice provides a
the gulf from philosophical considtechnical analyses of specific voting starting point for this literature. The range of articles surveying research
topics and methods in this field (Brandt ez al., 2016). Arrow’s theorem lists desired properties
of a voting system and proves that is impossible to achieve all of them (Arrow, 1951). Dasgupta and Maskin (2008) show that majority rule (not plurality rule, and not ranked choice voting) is the most robust voting system. The computational complexity of manipulating elections was first studied by Bartholdi ez al. (1989).
We have barely skimmed the surface of work on negotiation in multiagent planning. Durfee and Lesser (1989) discuss how tasks can be shared out among agents by negotiation. Kraus et al. (1991) describe a system for playing Diplomacy, a board game requiring negoti-
ation, coalition formation, and dishonesty. Stone (2000) shows how agents can cooperate as
teammates in the competitive, dynamic, partially observable environment of robotic soccer. In
649
650
Chapter 18 Multiagent Decision Making a later article, Stone (2003) analyzes two competitive multiagent environments—RoboCup,
a robotic soccer competition, and TAC, the auction-based Trading Agents Competition—
and finds that the computational intractability of our current theoretically well-founded approaches has led to many multiagent systems being designed by ad hoc methods. Sarit Kraus has developed a number of agents that can negotiate with humans and other agents—
see Kraus (2001) for a survey. The monotonic concession protocol for automated negotiation was proposed by Jeffrey S. Rosenschein and his students (Rosenschein and Zlotkin, 1994). The alternating offers protocol was developed by Rubinstein (1982). Books on multiagent systems include those by Weiss (2000a), Young (2004), Vlassis (2008), Shoham and Leyton-Brown (2009), and Wooldridge (2009). The primary conference for multiagent systems is the International Conference on Autonomous Agents and Multi-
Agent Systems (AAMAS); there is also a journal by the same name. The ACM Conference on Electronic Commerce (EC) also publishes many relevant papers, particularly in the area of auction algorithms. The principal journal for game theory is Games and Economic Behavior.
TG
19
LEARNING FROM EXAMPLES In which we describe agents that can improve their behavior through diligent study of past experiences and predictions about the future.
An agent is learning if it improves its performance after making observations about the world.
Learning can range from the trivial, such as jotting down a shopping list, to the profound, as when Albert Einstein inferred a new theory of the universe. When the agent is a computer, we call it machine learning:
a computer observes some data, builds a model based on the
data, and uses the model as both a hypothesis about the world and a piece of software that
can solve problems.
‘Why would we want a machine to learn? Why not just program it the right way to begin
with? There are two main reasons.
First, the designers cannot anticipate all possible future
situations. For example, a robot designed to navigate mazes must learn the layout of each new
maze it encounters; a program for predicting stock market prices must learn to adapt when
conditions change from boom to bust. Second, sometimes the designers have no idea how
to program a solution themselves. Most people are good at recognizing the faces of family
members, but they do it subconsciously, so even the best programmers don’t know how to
program a computer to accomplish that task, except by using machine learning algorithms.
In this chapter, we interleave a discussion of various model classes—decision trees (Sec-
tion 19.3), linear models (Section 19.6), nonparametric models such as nearest neighbors (Section 19.7), ensemble models such as random forests (Section 19.8)—with practical advice on building machine learning systems (Section 19.9), and discussion of the theory of ‘machine learning (Sections 19.1 to 19.5).
19.1
Forms of Learning
Any component of an agent program can be improved by machine learning. The improve‘ments, and the techniques used to make them, depend on these factors:
« Which component is to be improved. * What prior knowledge the agent has, which influences the model it builds. * What data and feedback on that data is available.
Chapter 2 described several agent designs. The components of these agents include: 1. A direct mapping from conditions on the current state to actions. 2. A means to infer relevant properties of the world from the percept sequence. 3. Information about the way the world evolves and about the results of possible actions
the agent can take.
Machine learning
652
Chapter 19 Learning from Examples 4. Utility information indicating the desirability of world states. 5. Action-value information indicating the desirability of actions.
6. Goals that describe the most desirable states.
7. A problem generator, critic, and learning element that enable the system to improve. Each of these components can be learned. Consider a self-driving car agent that learns by observing a human driver. Every time the driver brakes, the agent might learn a condition— action rule for when to brake (component 1). By seeing many camera images that it is told contain buses, it can learn to recognize them (component 2).
By trying actions and ob-
serving the results—for example, braking hard on a wet road—it can learn the effects of its actions (component 3). Then, when it receives complaints from passengers who have been thoroughly shaken up during the trip, it can learn a useful component of its overall utility function (component 4).
The technology of machine learning has become a standard part of software engineering.
Any time you are building a software system, even if you don’t think of it as an AI agent,
components of the system can potentially be improved with machine learning. For example,
software to analyze images of galaxies under gravitational lensing was speeded up by a factor
of 10 million with a machine-learned model (Hezaveh et al., 2017), and energy use for cooling data centers was reduced by 40% with another machine-learned model (Gao, 2014). Turing Award winner David Patterson and Google Al head Jeff Dean declared the dawn of a “Golden
Age” for computer architecture due to machine learning (Dean et al., 2018).
We have seen several examples of models for agent components: atomic, factored, and
relational models based on logic or probability, and so on. Learning algorithms have been Prior knowledge
devised for all of these.
This chapter assumes little prior knowledge on the part of the agent: it starts from scratch
and learns from the data. In Section 21.7.2 we consider transfer learning, in which knowl-
edge from one domain is transferred to a new domain, so that learning can proceed faster with less data.
We do assume, however, that the designer of the system chooses a model
framework that can lead to effective learning.
Going from a specific set of observations to a general rule is called induction; from the
observations that the sun rose every day in the past, we induce that the sun will come up tomorrow.
This differs from the deduction we studied in Chapter 7 because the inductive
conclusions may be incorrect, whereas deductive conclusions are guaranteed to be correct if
the premises are correct. This chapter concentrates on problems where the input is a factored representation—a vector of attribute values.
It is also possible for the input to be any kind of data structure,
including atomic and relational.
When the output is one of a finite set of values (such as sunny/cloudy/rainy or true/false),
Classification
the learning problem is called classification. When it is a number (such as tomorrow’s tem-
Regression
mittedly obscure!) name regression. ! A better name would have been function approximation or mumeric prediction. But in 1886 Francis Galton ‘wrote an influential article on the concept of regression o the mean (e.g.. the children of tall parents are likely to be taller than average, but not as tall as the parents). Galton showed plots with what he called “regression lines,” and readers came to associate the word “regression” with the statistical technique of function approximation rather than with the topic of regression to the mean.
perature, measured either as an integer or a real number), the learning problem has the (ad-
Section 1.2 Supervised Learning
653
There are three types of feedback that can accompany the inputs, and that determine the Feedback
three main types of learning:
« In supervised learning the agent observes input-output pairs and learns a function that Supervised learning maps from input to output.
For example, the inputs could be camera images, each
one accompanied by an output saying “bus” or “pedestrian,” etc. An output like this is called a label. The agent learns a function that, when given a new image, predicts
the appropriate label. In the case of braking actions (component 1 above), an input is
Label
the current state (speed and direction of the car, road condition), and an output is the distance it took to stop. In this case a set of output values can be obtained by the agent
from its own percepts (after the fact); the environment is the teacher, and the agent learns a function that maps states to stopping distance.
« In unsupervised learning the agent learns patterns in the input without any explicit Unsupervised learning feedback. The most common unsupervised learning task is clustering: detecting poten-
tially useful clusters of input examples. For example, when shown millions of images
taken from the Internet, a computer vision system can identify a large cluster of similar images which an English speaker would call “cats.”
«+ In reinforcement learning the agent learns from a series of reinforcements:
rewards
and punishments. For example, at the end of a chess game the agent is told that it has
Reinforcement learning
won (a reward) or lost (a punishment). It is up to the agent to decide which of the actions prior to the reinforcement were most responsible for it, and to alter its actions
to aim towards more rewards in the future.
19.2
Super
ed Learning
More formally, the task of supervised learning is this: Given a training set of N example input-output pairs
Training set
(1, y1)s (32,32)s - (s yn) 5 where each pair was generated by an unknown function y = f(x),
discover a function / that approximates the true function f. The function / is called a hypothesis about the world. It is drawn from a hypothesis space H of possible functions. For example, the hypothesis space might be the set of polynomials
Hypothesis space
of degree 3; or the set of Javascript functions; or the set of 3-SAT Boolean logic formulas.
‘With alternative vocabulary, we can say that / is a model of the data, drawn from a model class 7, or we can say a function drawn from a function class. We call the output y; the
Model class Ground truth
ground truth—the true answer we are asking our model to predict. How do we choose a hypothesis space? We might have some prior knowledge about the data process that generated the data. If not, we can perform exploratory data analysis: examining Exploratory analysis the data with statistical tests and visualizations—histograms, scatter plots, box plots—to get
afeel for the data, and some insight into what hypothesis space might be appropriate. Or we can just try multiple hypothesis spaces and evaluate which one works best.
How do we choose a good hypothesis from within the hypothesis space? We could hope
for a consistent hypothesis: and / such that each x; in the training set has h(x;) = y;. With
continuous-valued outputs we can’t expect an exact match to the ground truth; instead we
Consistent hypothesis
654
Chapter 19 Learning from Examples Sinusoidal
Piecewise linear
Degree-12 polynomial
o
Data set 2
Data set 1
Linear
flf}
Figure 19.1 Finding hypotheses to fit data. Top row: four plots of best-fit functions from four different hypothesis spaces trained on data set 1. Bottom row: the same four functions, but trained on a slightly different data
set (sampled from the same f(x) function).
look for a best-fit function for which each h(x;) is close to y; (in a way that we will formalize in Section 19.4.2).
The true measure of a hypothesis is not how it does on the training set, but rather how
Test set Generalization
well it handles inputs it has not yet seen. We can evaluate that with a second sample of (x;,y;) pairs called a test set. We say that / generalizes well if it accurately predicts the outputs of the test set. Figure 19.1 shows that the function that a learning algorithm discovers depends on the hypothesis space H it considers and on the training set it is given. Each of the four plots in the top row have the same training set of 13 data points in the (x,y) plane. The four plots in the bottom row have a second set of 13 data points; both sets are representative of the
same unknown function f(x). Each column shows the best-fit hypothesis / from a different hypothesis space:
o Column 1: Straight lines; functions of the form i(x) = wjx- wy. There is no line that
would be a consistent hypothesis for the data points.
« Column 2: Sinusoidal functions of the form A(x) = wy.x -+sin(wox). This choice is not quite consistent, but fits both data sets very well.
e Column 3: Piecewise-linear functions where each line segment connects the dots from
one data point to the next. These functions are always consistent. o Column 4: Degree-12 polynomials, h(x) = ¥/2gwix’. These are consistent: we can always get a degree-12 polynomial to perfectly fit 13 distinct points. But just because
the hypothesis is consistent does not mean it is a good guess. One way to analyze hypothesis spaces is by the bias they impose (regardless of the train-
Bias
ing data set) and the variance they produce (from one training set to another). By bias we mean (loosely) the tendency of a predictive hypothesis to deviate from the
expected value when averaged over different training sets. Bias often results from restrictions
Section 1.2 Supervised Learning
655
imposed by the hypothesis space. For example, the hypothesis space of linear functions
induces a strong bias: it only allows functions consisting of straight lines. If there are any
patterns in the data other than the overall slope of a
line, a linear function will not be able
to represent those patterns. We say that a hypothesis is underfitting when it fails to find a Underfitting pattern in the data. On the other hand, the piecewise linear function has low bias; the shape of the function is driven by the data.
By variance we mean the amount of change in the hypothesis due to fluctuation in the
training data. The two rows of Figure 19.1 represent data sets that were each sampled from the same f(x) function. The data sets turned out to be slightly different.
Variance
For the first three
columns, the small difference in the data set translates into a small difference in the hypothe-
sis. We call that low variance. But the degree-12 polynomials in the fourth column have high
variance: look how different the two functions are at both ends of the x-axis. Clearly, at least
one of these polynomials must be a poor approximation to the true f(x). We say a function
is overfitting the data when it pays too much attention to the particular data set it is trained
on, causing it to perform poorly on unseen data.
Often there is a bias-variance tradeoff: a choice between more complex, low-bias hy-
potheses that fit the training data well and simpler, low-variance hypotheses that may generalize better.
Albert Einstein said in 1933, “the supreme goal of all theory is to make the
irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience.” In other words, Einstein recommends choosing the simplest hypothesis that matches the data. This principle can be traced further back to the 14th-century English philosopher William of Ockham.? His principle that “plurality [of entities] should not be posited without necessity” is called Ockham’s razor because it is used to “shave off”” dubious explanations.
Defining simplicity is not easy. It seems clear that a polynomial with only two parameters is simpler than one with thirteen parameters. We will make this intuition more precise in
Section
19.3.4.
However, in Chapter 21 we will see that deep neural network models can
often generalize quite well, even though they are very complex—some of them have billions of parameters. So the number of parameters by itself is not a good measure of a model’s fitness. Perhaps we should be aiming for “appropriateness,” not “simplicity” in a model class. We will consider this issue in Section 19.4.1. ‘Which hypothesis is best in Figure 19.1? We can’t be certain.
If we knew the data
represented, say, the number of hits to a Web site that grows from day to day, but also cycles depending on the time of day, then we might favor the sinusoidal function.
If we knew the
data was definitely not cyclic but had high noise, that would favor the linear function.
In some cases, an analyst is willing to say not just that a hypothesis is possible or im-
possible, but rather how probable it is. Supervised learning can be done by choosing the hypothesis 4 that is most probable given the data:
1" = argmax P(h|data) . heM
By Bayes’ rule this s equivalent to I* = argmax P(datal ) P(h) . heM
2 The name is often misspelled as “Occam.”
Overfitting Bias-variance tradeoff
Chapter 19 Learning from Examples Then we can say that the prior probability P() is high for a smooth degree-1 and lower for a degree-12 polynomial with large, sharp spikes. We allow functions when the data say we really need them, but we discourage them low prior probability. Why not let H be the class of all computer programs, or all Turing
or -2 polynomial unusual-looking by giving them a machines? The
problem is that there is a tradeoff between the expressiveness of a hypothesis space and the
computational complexity of finding a good hypothesis within that space. For example, fitting a straight line to data is an easy computation; fitting high-degree polynomials is somewhat harder; and fitting Turing machines is undecidable. A second reason to prefer simple hypothesis spaces is that presumably we will want to use h after we have learned it, and computing h(x) when h is a linear function is guaranteed to be fast, while computing an arbitrary Turing machine program is not even guaranteed to terminate.
For these reasons, most work on learning has focused on simple representations. In recent
years there has been great interest in deep learning (Chapter 21), where representations are
not simple but where the h(x) computation still takes only a bounded number of steps to
compute with appropriate hardware.
We will see that the expressiveness—complexity tradeoff is not simple: it is often the case,
as we saw with first-order logic in Chapter 8, that an expressive language makes it possible for a simple hypothesis to fit the data, whereas restricting the expressiveness of the language
means that any consistent hypothesis must be complex. 19.2.1
Example problem:
Restaurant waiting
We will describe a sample supervised learning problem in detail: the problem of deciding
whether to wait for a table at a restaurant. This problem will be used throughout the chapter
to demonstrate different model classes. For this problem the output, y, is a Boolean variable
that we will call WillWait; it is true for examples where we do wait for a table. The input, x,
is a vector of ten attribute values, each of which has discrete values:
1. Alternate: whether there is a suitable alternative restaurant nearby.
0PN AE W
656
Bar: whether the restaurant has a comfortable bar area to wait in. Fri/Sat: true on Fridays and Saturdays. Hungry: whether we are hungry right now.
Patrons: how many people are in the restaurant (values are None, Some, and Full). Price: the restaurant’s price range ($, $3, $$8). Raining: whether it is raining outside.
Reservation: whether we made a reservation.
Type: the kind of restaurant (French, Italian, Thai, or burger).
10. WaitEstimate: host’s wait estimate: 0-10, 10-30, 3060, or >60 minutes.
A set of 12 examples, taken from the experience of one of us (SR), is shown in Figure 19.2.
Note how skimpy these data are: there are 26 x 3% x 4% = 9,216 possible combinations of
values for the input attributes, but we are given the correct output for only 12 of them; each of
the other 9,204 could be either true or false; we don’t know. This is the essence of induction: we need to make our best guess at these missing 9,204 output values, given only the evidence of the 12 examples.
Section 19.3
Example
Learning Decision Trees
657
Input Attribut Alt Yes No Yes No
Yes
No
No Yes No
Some
$3$
Some
%$Full $ Some $ Full $ Full ~ $8$
None Some Full Full None Full
Figure 19.2 Examples for the restaurant domain. 19.3
Learning Decision Trees
A decision tree is a representation of a function that maps a vector of attribute values to
a single output value—a “decision.” A decision tree reaches its decision by performing a
Decision tree
sequence of tests, starting at the root and following the appropriate branch until a leaf is reached. Each internal node in the tree corresponds to a test of the value of one of the input
attributes, the branches from the node are labeled with the possible values of the attribute,
and the leaf nodes specify what value s to be returned by the function. In general, the input and output values can be discrete or continuous, but for now we will
Positive example) or false (a negative example). We call this Boolean classification. We will use j Negative consider only inputs consisting of discrete values and outputs that are either true (a positive
to index the examples (x; is the input vector for the jth example and y; is the output), and x;;
for the ith attribute of the jth example.
The tree representing the decision function that SR uses for the restaurant problem is
shown in Figure 19.3. Following the branches, we see that an example with Patrons = Full and WaitEstimate =0-10 will be classified as positive (i.e., yes, we will wait for a table).
19.3.1
Expressiveness of decision trees
A Boolean decision tree is equivalent to a logical statement of the form:
Output
A +A; is hard to represent
with a decision tree because the decision boundary is a diagonal line, and all decision tree
tests divide the space up into rectangular, axis-aligned boxes. We would have to stack a lot
of boxes to closely approximate the diagonal line. In other words, decision trees are good for some kinds of functions and bad for others. Is there any kind of representation that is efficient for all kinds of functions?
Unfortu-
nately, the answer is no—there are just too many functions to be able to represent them all with a small number of bits.
Even just considering Boolean functions with n Boolean at-
tributes, the truth table will have 2" rows, and each row can output true or false, so there are
22" different functions. With 20 attributes there are 248576 ~ 10300000 fynctions, so if we limit ourselves to a million-bit representation, we can’t represent all these functions.
19.3.2
Learning decision trees from examples
‘We want to find a tree that is consistent with the examples in Figure 19.2 and is as small as possible. Unfortunately, it is intractable to find a guaranteed smallest consistent tree. But with some simple heuristics, we can efficiently find one that is close to the smallest.
The
LEARN-DECISION-TREE algorithm adopts a greedy divide-and-conquer strategy: always test the most important attribute first, then recursively solve the smaller subproblems that are defined by the possible results of the test. By “most important attribute,” we mean the one
that makes the most difference to the classification of an example. That way, we hope to get to the correct classification with a small number of tests, meaning that all paths in the tree will be short and the tree as a whole will be shallow.
Figure 19.4(a) shows that Type is a poor attribute, because it leaves us with four possible
Pations? None
Some"
o N Ves|
Full
WaitEstimate? Altemate? No Yes
Reservation? No Yes
No
No /"\ Yes Yes
Bai No /"\ Yes
Figure 19.3 A decision tree for deciding whether to wait for a table.
Section 19.3
8
alian
659
mEan BE@EBD
Type? French
Learning Decision Trees
Patrons?
Thai
Burger
©lad ao
None
Some
Ful
.
No// (@)
(b)
\Yes
B
Figure 19.4 Splitting the examples by testing on attributes. At each node we show the positive (light boxes) and negative (dark boxes) examples remaining. (a) Splitting on Type brings us no nearer to distinguishing between positive and negative examples. (b) Splitting on Patrons does a good job of separating positive and negative examples. After splitting on Patrons, Hungry is a fairly good second test. outcomes, each of which has the same number of positive as negative examples. On the other
hand, in (b) we see that Patrons is a fairly important attribute, because if the value is None or Some, then we are left with example sets for which we can answer definitively (No and Yes,
respectively). If the value is Full, we are left with a mixed set of examples. There are four cases to consider for these recursive subproblems:
1. If the remaining examples are all positive (or all negative), then we are done: we can answer Yes or No. Figure 19.4(b) shows examples of this happening in the None and Some branches.
2. If there are some positive and some negative examples, then choose the best attribute to split them. Figure 19.4(b) shows Hungry being used to split the remaining examples. 3. If there are no examples left, it means that no example has been observed for this com-
bination of attribute values, and we return the most common output value from the set
of examples that were used in constructing the node’s parent. 4. If there are no attributes left, but both positive and negative examples, it means that these examples have exactly the same description, but different classifications. This can
happen because there is an error or noise in the data; because the domain is nondeterministic; or because we can’t observe an attribute that would distinguish the examples.
The best we can do is return the most common output value of the remaining examples. The LEARN-DECISION-TREE algorithm is shown in Figure 19.5. Note that the set of exam-
ples is an input to the algorithm, but nowhere do the examples appear in the tree returned by the algorithm. A tree consists of tests on attributes in the interior nodes, values of attributes on the branches, and output values on the leaf nodes. The details of the IMPORTANCE func-
tion are given in Section 19.3.3. The output of the learning algorithm on our sample training
set is shown in Figure 19.6. The tree is clearly different from the original tree shown in Fig-
Noise
660
Chapter 19 Learning from Examples function LEARN-DECISION-TREE(examples, attributes, parent_examples) returns a tree
if examples is empty then return PLURALITY-VALUE(parent_examples)
else if all examples have the same classification then return the classification
else if artributes is empty then return PLURALITY-VALUE(examples) else
A~ argmax, ¢ uipues IMPORTANCE (a, examples) tree < a new decision tree with root test A for each value v ofA do
exs 0, while carthquakes have —4.9+ 1.7x; —x, < x=
Section 19.6
Linear Regression and Classification
683
0. We can make the equation easier to deal with by changing it into the vector dot product form—with xo=1 we have —4.9%+ 17x -3 =0, and we can define the vector of weights, w=(-49,17,~1), and write the classification hypothesis Jw(x) = 1if w-x > 0 and 0 otherwise.
Alternatively, we can think of / as the result of passing the linear function w - x through a threshold function:
Iw(x) = Threshold(w-x) where Threshold(z) =1 if z > 0 and 0 otherwise.
Threshold function
The threshold function is shown in Figure 19.17(a).
Now that the hypothesis &y (x) has a well-defined mathematical form, we can think about
choosing the weights w to minimize the loss. In Sections 19.6.1 and 19.6.3, we did this both
in closed form (by setting the gradient to zero and solving for the weights) and by gradient
descent in weight space. Here we cannot do either of those things because the gradient is zero almost everywhere in weight space except at those points where w - x =0, and at those points the gradient is undefined.
There is, however, a simple weight update rule that converges to a solution—that is, to
a linear separator that classifies the data perfectly—provided the data are linearly separable. For a single example (x,y), we have wi — wita(y
(X)) X x;
(19.8)
which is essentially identical to Equation (19.6), the update rule for linear regression!
This
rule is called the perceptron learning rule, for reasons that will become clear in Chapter 21. Because we are considering a 0/1 classification problem, however, the behavior is somewhat
Perceptron learning rule
different. Both the true value y and the hypothesis output /(x) are either 0 or 1, so there are three possibilities:
« If the output is correct (i.e., y=hy(x)) then the weights are not changed.
« Ifyis 1 but hy(x) is 0, then w; is increased when the corresponding input x; is positive and decreased when x; is negative. This makes sense, because we want to make w - x
bigger so that /i (X) outputs a 1.
« If yis 0 but hy(x) is 1, then w; is decreased when the corresponding input x; is positive and increased when x; is negative.
This makes sense, because we want to make W - X
smaller so that A (x) outputs a 0.
Typically the learning rule is applied one example at a time, choosing examples at random (as in stochastic gradient descent). Figure 19.16(a) shows a training curve for this learning rule Training curve applied to the earthquake/explosion data shown in Figure 19.15(a). A training curve measures the classifier performance on a fixed training set as the learning process proceeds one update at a time on that training set. The curve shows the update rule converging to a zero-error
linear separator. The “convergence” process isn’t exactly preity, but it always works. This
particular run takes 657 steps to converge, for a data set with 63 examples, so each example
is presented roughly 10 times on average. Typically, the variation across runs is large.
684
Chapter 19 Learning from Examples 5 £
P Zoo
51 209
§
507
§07
8 0,
£ o, z &
Sos
0 100200300400500600 700 Number of weight updates (@)
06 20 = 04
Sos
025000 50000 75000 Number of weight updates
€06 08 h=gg
The solution for @ is the same as before.
S
The solution for 6}, the probability that a cherry
candy has a red wrapper, is the observed fraction of cherry candies with red wrappers, and
similarly for 6.
These results are very comforting, and it is easy to see that they can be extended to any
Bayesian network whose conditional probabilities are represented as tables. The most impor-
Section20.2
Learning with Complete Data
727
tant point is that with complete data, the maximum-likelihood parameter learning problem
for a Bayesian network decomposes into separate learning problems, one for each parameter. (See Exercise 20.NORX
for the nontabulated case, where each parameter affects several
conditional probabilities.) The second point is that the parameter values for a variable, given its parents, are just the observed frequencies of the variable values for each setting of the parent values. As before, we must be careful to avoid zeroes when the data set is small.
20.2.2
Naive Bayes models
Probably the most common Bayesian network model used in machine learning is the naive Bayes model first introduced on page 402. In this model, the “class™ variable C (which is to be predicted) is the root and the “attribute™ variables X; are the leaves. The model is “naive”
because it assumes that the attributes are conditionally independent of each other, given the class. (The model in Figure 20.2(b) is a naive Bayes model with class Flavor and just one
attribute, Wrapper.) In the case of Boolean variables, the parameters are
0=P(C=true),0y =P(X;=true|C=true),0,, = P(X; = true| C =false). The maximum-likelihood parameter values are found in exactly the same way as in Figure 20.2(b). Once the model has been trained in this way, it can be used to classify new examples for which the class variable C is unobserved. With observed attribute values ¥y, .., %, the probability of each class is given by P(C|x1,...,%,) = a P(C ]'[Px,\c
A deterministic prediction can be obtained by choosing the most likely class. Figure 20.3 shows the learning curve for this method when it is applied to the restaurant problem from
Chapter 19. The method learns fairly well but not as well as decision tree learning; this is presumably because the true hypothesis—which is a decision tree—is not representable exactly
using a naive Bayes model. Naive Bayes learning turns out to do surprisingly well in a wide
range of applications; the boosted version (Exercise 20.BNBX) is one of the most effective
general-purpose learning algorithms. Naive Bayes learning scales well to very large prob-
lems: with n Boolean attributes, there are just 2n+ 1 parameters, and no search is required
10 find hy, the maximum-likelihood naive Bayes hypothesis. Finally, naive Bayes learning systems deal well with noisy or missing data and can give probabilistic predictions when appropriate. Their primary drawback is the fact that the conditional independence assumption is seldom accurate; as noted on page 403, the assumption leads to overconfident probabilities that are often very close to 0 or 1, especially with large numbers of attributes.
20.2.3
Generative and discriminative models
We can distinguish two kinds of machine learning models used for classifiers: generative and discriminative.
A generative model models the probability distribution of each class.
For
example, the naive Bayes text classifier from Section 12.6.1 creates a separate model for each
possible category of text—one for sports, one for weather, and so on. Each model includes
the prior probability of the category—for example P(Category=weather)—as well as the conditional probability P(Inputs | Category =weather). From these we can compute the joint
probability P(Inputs, Category = weather)) and we can generate a random selection of words that is representative of texts in the weather category.
Generative model
Chapter 20 Learning Probabilistic Models
Proportion correct on fest set
728
Naive Bayes
0
20
40 60 Training set size
80
100
Figure 20.3 The learning curve for naive Bayes learning applied to the restaurant problem from Chapter 19; the learning curve for decision tree learning is shown for comparison. Discriminative model
A discriminative model directly learns the decision boundary between classes. That is,
it learns P(Category | Inputs). Given example inputs, a discriminative model will come up
with an output category, but you cannot use a discriminative model to, say, generate random
words that are representative of a category. Logistic regression, decision trees, and support vector machines are all discriminative model:
Since discriminative models put all their emphasis on defining the decision boundary— that is, actually doing the classification task they were asked to do—they tend to perform
better in the limit, with an arbitrary amount of training data. However, with limited data, in
some cases a generative model performs better. (Ng and Jordan, 2002) compare the generative naive Bayes classifier to the discriminative logistic regression classifier on 15 (small) data sets, and find that with the maximum amount of data, the discriminative model does better on
9 out of 15 data sets, but with only a small amount of data, the generative model does better
on 14 out of 15 data sets.
20.2.4
Maximum-likelihood parameter learning:
Continuous models
Continuous probability models such as the linear-Gaussian model were shown on page 422.
Because continuous variables are ubiquitous in real-world applications, it is important to know how to learn the parameters of continuous models from data. The principles for maximum-likelihood learning are identical in the continuous and discrete case:
Let us begin with a very simple case: learning the parameters of a Gaussian density function on a single variable. That is, we assume the data are generated as follows:
P = The parameters of this model are the mean 4 and the standard deviation o. (Notice that the
normalizing “constant” depends on o, so we cannot ignore it.) Let the observed values be
Section20.2
Learning with Complete Data
0 01020304 0506070809 x ®)
(a)
1
Figure 20.4 (a) A linear-Gaussian model described as y =6+ 6 plus Gaussian noise with fixed variance. (b) A set of 50 data points generated from this model and the best-fit line. xy. Then the log likelihood is
m?
L= Zlog e 5T= N(—logV27 — logo) — Setting the denvzuves to zero as usual, we obtain
LyN 5oL = —Flial-p=0 AL =_ S+ N, ELL(y-p)?=0 IyN 2 Se
@04)
That is, the maximum-likelihood value of the mean is the sample average and the maximum-
likelihood value of the standard deviation is the square root of the sample variance. Again, these are comforting results that confirm “commonsense” practice. Now consider a linear-Gaussian model with one continuous parent X and a continuous
child Y. As explained on page 422, ¥ has a Gaussian distribution whose mean depends linearly on the value ofX and whose standard deviation is fixed. distribution P(Y | X), we can maximize the conditional likelihood
POl =
1
_o 0o’
To learn the conditional
@5
Here, the parameters are 0y, 02, and o. The data are a collection of (x;, ;) pairs, as illustrated
in Figure 20.4. Using the usual methods (Exercise 20.LINR), we can find the maximumlikelihood values of the parameters. The point here is different. If we consider just the parameters #, and 0, that define the linear relationship between x and y, it becomes clear
that maximizing the log likelihood with respect to these parameters is the same as minimizing
the numerator (y— (6,x+6,))? in the exponent of Equation (20.5). This is the Ly loss, the
squared error between the actual value y and the prediction §;x + 5.
This is the quantity minimized by the standard linear regression procedure described in
Section 19.6. Now we can understand why: minimizing the sum of squared errors gives the maximum-likelihood straight-line model, provided that the data are generated with Gaussian
noise of fixed variance.
729
730
Chapter 20 Learning Probabilistic
0
02
04 06 Parameter § ()
Models
08
1
0
02
04 06 Parameter ®)
08
1
Figure 20.5 Examples of the Bera(a,b) distribution for different values of (a,b). 20.2.5
Bayesian parameter learning
Maximum-likelihood learning gives rise to simple procedures, but it has serious deficiencies with small data sets. For example, after seeing one cherry candy, the maximum-likelihood hypothesis is that the bag is 100% cherry (i.c., §=1.0). Unless one’s hypothesis prior is that bags must be either all cherry or all lime, this is not a reasonable conclusion. It is more likely
that the bag is a mixture of lime and cherry.
The Bayesian approach to parameter learning
starts with a hypothesis prior and updates the distribution as data arrive. The candy example in Figure 20.2(a) has one parameter, 0: the probability that a randomly selected piece of candy is cherry-flavored. In the Bayesian view, 0 is the (unknown)
value of a random variable © that defines the hypothesis space; the hypothesis prior is the
prior distribution over P(®). Thus, P(@=8) is the prior probability that the bag has a frac-
tion 0 of cherry candies.
Beta distribution
Hyperparameter
If the parameter 6 can be any value between 0 and 1, then P(®) is a continuous probability density function (see Section A.3). If we don’t know anything about the possible values of 6 we can use the uniform density function P(6) = Uniform(6;0, 1), which says all values are equally likely. A more flexible family of probability density functions is known as the beta distributions. Each beta distribution is defined by two hyperparameters® a and b such that Beta(0:a,b) = a 0!
(1-9)"",
(20.6)
for 0 in the range [0, 1]. The normalization constant a, which makes the distribution integrate to 1, depends on a and b. Figure 20.5 shows what the distribution looks like for various values of a and b. The mean value of the beta distribution is a/(a+ b), so larger values of a
suggest a belief that @ is closer to 1 than to 0. Larger values of a -+ b make the distribution
more peaked, suggesting greater certainty about the value of ©. It turns out that the uniform
density function is the same as Beta(1,1): the mean is 1/2, and the distribution is flat.
3 They are called hyperparameters because they parameterize a distribution over 0, which is itselfa parameter.
Section20.2
Learning with Complete Data
731
Figure 20.6 A Bayesian network that corresponds to a Bayesian learning process. Posterior distributions for the parameter variables ©, ©;, and @, can be inferred from their prior distributions and the evidence in Flavor; and Wrapper;. Besides its flexibility, the beta family has another wonderful property: if Beta(a, b), then, after a data point is observed, the posterior distribution for © distribution. In other words, Beta is closed under update. The beta family conjugate prior for the family of distributions for a Boolean variable.* Let’s works. Suppose we observe a cherry candy: then we have
© has a prior is also a beta is called the see how this Conjugate prior
P(0|Dy=cherry) = a P(Dy=cherry|0)P(6)
= o 0-Beta(f;a,b) = o’ 6-0°'(1-9)"" = ' 0°(1-0)""" = o/ Beta(:a+1,b).
Thus, after seeing a cherry candy, we simply increment the a parameter to get the posterior;
similarly, after seeing a lime candy, we increment the b parameter. Thus, we can view the @ and b hyperparameters as virtual counts, in the sense that a prior Beta(a, b) behaves exactly
as if we had started out with a uniform prior Beta(1,1) and seen a— 1 actual cherry candies and b — 1 actual lime candies. By examining a sequence of beta distributions for increasing values ofa and b, keeping the proportions fixed, we can see vividly how the posterior distribution over the parameter
© changes as data arrive. For example, suppose the actual bag of candy is 75% cherry. Figure 20.5(b) shows the sequence Beta(3,1), Beta(6,2), Beta(30,10). Clearly, the distribution is converging to a narrow peak around the true value of ©. For large data sets, then, Bayesian learning (at least in this case) converges to the same answer as maximum-likelihood learning. Now let us consider a more complicated case. The network in Figure 20.2(b) has three parameters, 6, 01, and 6, where 6 is the probability ofa red wrapper on a cherry candy and 4 Other conjugate priors include the Dirichlet family for the parameters of a discrete multivalued distribution and the Normal-Wishart family for the parameters of a Gaussian ion. See Bernardo and Smith (1994).
Virtual count
732
Chapter 20 Learning Probabilistic
Models
0, is the probability of a red wrapper on a lime candy. The Bayesian hypothesis prior must
cover all three parameters—that is, we need to specify P(®,0,,0;).
Parameter independence
parameter independence:
Usually, we assume
P(0,0,.0,) = P(O)P(©,)P(©,).
‘With this assumption, each parameter can have its own beta distribution that is updated sepa-
rately as data arrive. Figure 20.6 shows how we can incorporate the hypothesis prior and any data into a Bayesian network, in which we have a node for each parameter variable. The nodes ©,0;,0, have no parents. For the ith observation of a wrapper and corresponding flavor of a piece of candy, we add nodes Wrapper; and Flavor;. Flavor; is dependent on the flavor parameter ©:
P(Flavor;=cherry|©=0) = 0. Wrapper; is dependent on ©; and ©,: P(Wrapper; = red| Flavor; =cherry,©,=6,) = 6 P(Wrapper; = red| Flavor; =lime,® = 0,) = 0, .
Now, the entire Bayesian learning process for the original Bayes net in Figure 20.2(b) can be formulated as an inference problem in the derived Bayes net shown in Figure 20.6, where the.
P>
data and parameters become nodes. Once we have added all the new evidence nodes, we can
then query the parameter variables (in this case, ©,©;,0,). Under this formulation there is just one learning algorithm—the inference algorithm for Bayesian networks. Of course, the nature of these networks is somewhat different from those of Chapter 13
because of the potentially huge number of evidence variables representing the training set and the prevalence of continuous-valued parameter variables. Exact inference may be impos-
sible except in very simple cases such as the naive Bayes model. Practitioners typically use approximate inference methods such as MCMC
(Section 13.4.2); many statistical software
packages incorporate efficient implementations of MCMC for this purpose. 20.2.6
Bayesian linear regression
Here we illustrate how to apply a Bayesian approach to a standard statistical task:
linear
regression. The conventional approach was described in Section 19.6 as minimizing the sum of squared errors and reinterpreted in Section 20.2.4 as maximizing likelihood assuming a Gaussian error model. These produce a single best hypothesis: a straight line with specific values for the slope and intercept and a fixed variance for the prediction error at any given
point. There is no measure of how confident one should be in the slope and intercept values.
Furthermore, if one is predicting a value for an unseen data point far from the observed
data points, it seems to make no sense to assume a prediction error that is the same as the
prediction error for a data point right next to an observed data point.
It would seem more
sensible for the prediction error to be larger, the farther the data point is from the observed
data, because a small change in the slope will cause a large change in the predicted value for a distant point. The Bayesian approach fixes both of these problems. The general idea, as in the preceding section, is to place a prior on the model parameters—here, the coefficients of the linear model and the noise variance—and then to compute the parameter posterior given the data. For
multivariate data and unknown noise model, this leads to rather a lot of linear algebra, so we
Section20.2
Learning with Complete Data
733
focus on a simple case: univariable data, a model that is constrained to go through the origin, and known noise:
and the model is
a normal distribution with variance o2. Then we have just one parameter ¢ 5 (ML) .
P(y[x,0) = N (y:6x,07) =
(20.7)
As the log likelihood is quadratic in 0, the appropriate form for a conjugate prior on is also a Gaussian. This ensures that the posterior for ¢ will also be Gaussian. We’ll assume a mean 0 and variance o7 for the prior, so that
P(0)=N(9:€u. We don’t necessarily expect the top halves of photos to look like bottom halves, so there is a scale beyond which spatial invariance no longer holds.
Local spatial invariance can be achieved by constraining the / weights connecting a local
region to a unit in the hidden layer to be the same for each hidden unit. (That is, for hidden
units i and j, the weights wy ..., w;,; are the same as wi j,...,wy,;.)
This makes the hidden
units into feature detectors that detect the same feature wherever it appear in the image.
Typically, we want the first hidden layer to detect many kinds of features, not just one; so
for each local image region we might have d hidden units with d distinct sets of weights.
This means that there are d/ weights in all—a number that is not only far smaller than n?,
Convolutional neural network (CNN) Kernel Convolution
but is actually independent of n, the image size. Thus, by injecting some prior knowledge— namely, knowledge of adjacency and spatial invariance—we can develop models that have far fewer parameters and can learn much more quickly. A convolutional neural network (CNN) is one that contains spatially local connections,
at least in the early layers, and has patterns of weights that are replicated across the units
in each layer. A pattern of weights that is replicated across multiple local regions is called a kernel and the process of applying the kernel to the pixels of the image (or to spatially
organized units in a subsequent layer) is called convolution.*
Kernels and convolutions are easiest to illustrate in one dimension rather than two or
more, so we will assume an input vector x of size n, corresponding to n pixels in a one-
3 Similar ideas can be applied to process time-series data sources such as audio waveforms. These typically exhibit temporal invariance—a word sounds the same no matter what time of day it is uttered. Recurrent neural neworks (Setion 216 auomatiall exhibittemmpors invaranee. # In the terminology of signa correlation, not a convolution. But “convolution” is used within the field of neural networks.
Section 21.3
Convolutional Networks
Figure 21.4 An example ofa one-dimensional convolution operation with a kernel of size 1=3 and a stride s=2. The peak response is centered on the darker (lower intensity) input pixel. The results would usually be fed through a nonlinear activation function (not shown) before going to the next hidden layer. dimensional image, and a vector kernel k of size 1. (For simplicity we will assume that / is an odd number.) All the ideas carry over straightforwardly to higher-dimensional case: ‘We write the convolution operation using the * symbol, for example:
operation is defined as follows:
=
!
£
|k
i (141)/2 -
z = x
k.
The
(21.8)
In other words, for each output position i, we take the dot product between the kernel k and a
snippet of x centered on x; with width /.
The process is illustrated in Figure 21.4 for a kernel vector [+1,—1,+1], which detects a darker point in the 1D image. (The 2D version might detect a darker line.) Notice that in
this example the pixels on which the kernels are centered are separated by a distance of 2 pixels; we say the kernel is applied with a stride s=2. Notice that the output layer has fewer Stride pixels: because of the stride, the number of pixels is reduced from 1 to roughly n/s. (In two dimensions, the number of pixels would be roughly n/s,s,, where s, and s, are the strides in the x and y directions in the image.) We say “roughly” because of what happens at the edge of the image: in Figure 21.4 the convolution stops at the edges of the image, but one can also
pad the input with extra pixels (either zeroes or copies of the outer pixels) so that the kernel can be applied exactly |n/s| times. For small kernels, we typically use s=1, so the output
has the same dimensions as the image (see Figure 21.5). The operation of applying a kernel across an image can be implemented in the obvious way by a program with suitable nested loops; but it can also be formulated as a single matrix
operation, just like the application of the weight matrix in Equation (21.1). For example, the convolution illustrated in Figure 21.4 can be viewed as the following matrix multiplication: 0 -1 0
0 +1 +1
0 0 -1
0 0 +1
5
oo
-1+ 0 +1 0 0
§ =[9]. 4
21.9)
hom
4+l 0 0
In this weight matrix, the kernel appears in each row, shifted according to the stride relative
to the previous row, One wouldn’t necessarily construct the weight matrix explicitly—it is
761
762
Chapter 21
Deep Learning
Figure 21.5 The first two layers of a CNN for a 1D image with a kernel size /=3 and a stride s= 1. Padding is added at the left and right ends in order to keep the hidden layers the same size as the input. Shown in red is the receptive field ofa unit in the second hidden layer. Generally speaking, the deeper the unit, the larger the receptive field. mostly zeroes, after all—but the fact that convolution is a linear matrix operation serves as a
reminder that gradient descent can be applied easily and effectively to CNNG, just as it can to plain vanilla neural networks. As mentioned earlier, there will be d kernels, not just one; so, with a stride of 1, the
output will be d times larger.
This means that a two-dimensional input array becomes a
three-dimensional array of hidden units, where the third dimension is of size d.
It is im-
portant to organize the hidden layer this way, so that all the kernel outputs from a particular image location stay associated with that location. Unlike the spatial dimensions of the image,
Receptive field
however, this additional “kernel dimension” does not have any adjacency properties, so it does not make sense to run convolutions along it. CNNs were inspired originally by models of the visual cortex proposed in neuroscience. In those models, the receptive
field of a neuron is the portion of the sensory input that can
affect that neuron’s activation. In a CNN, the receptive field of a unit in the first hidden layer
is small—just the size of the kernel, i.e., / pixels.
In the deeper layers of the network, it
can be much larger. Figure 21.5 illustrates this for a unit in the second hidden layer, whose
receptive field contains five pixels. When the stride is 1, as in the figure, a node in the mth
hidden layer will have a receptive field of size (I — 1)m+ 1; so the growth is linear in m. (In a 2D image, each dimension of the receptive field grows linearly with m, so the area grows quadratically.) When the stride is larger than 1, each pixel in layer m represents s pixels in layer m — 1; therefore, the receptive field grows as O(ls™)—that is, exponentially with depth.
The same effect occurs with pooling layers, which we discuss next. 21.3.1
Pooling
Pooling and downsampling
A pooling layer in a neural network summarizes a set of adjacent units from the preceding layer with a single value. Pooling works just like a convolution layer, with a kernel size / and stride s, but the operation that is applied is fixed rather than learned. Typically, no activation function is associated with the pooling layer. There are two common forms of pooling: * Average-pooling computes the average value of its / inputs.
Downsampling
This is identical to con-
volution with a uniform kernel vector k= (1/1,..., 1/I]. If we set [ =s, the effect is to coarsen the resolution of the image—to downsample it—by a factor of s. An object
that occupied, say, 10s pixels would now occupy only 10 pixels after pooling. The same
Section 21.3
Convolutional Networks
learned classifier that would be able to recognize the object at original image would now be able to recognize that object in if it was too big to recognize in the original image. In other facilitates multiscale recognition. It also reduces the number
a size of 10 pixels in the the pooled image, even words, average-pooling of weights required in
subsequent layers, leading to lower computational cost and possibly faster learning.
* Max-pooling computes the maximum value of its / inputs.
It can also be used purely
for downsampling, but it has a somewhat different semantics. Suppose we applied maxpooling to the hidden layer [5,9,4] in Figure 21.4: the result would be a 9, indicating that somewhere in the input image there is a darker dot that is detected by the kernel.
In other words, max-pooling acts as a kind of logical disjunction, saying that a feature
exists somewhere in the unit’s receptive field.
If the goal is to classify the image into one of ¢ categories, then the final layer of the network will be a softmax with ¢ output units.
The early layers of the CNN
are image-sized, so
somewhere in between there must be significant reductions in layer size. Convolution layers and pooling layers with stride larger than 1 all serve to reduce the layer size. It’s also possible to reduce the layer size simply by having a fully connected layer with fewer units than the preceding layer. CNNs often have one or two such layers preceding the final softmax layer. 21.3.2
Tensor operations in CNNs
We saw in Equations (21.1) and (21.3) that the use of vector and matrix notation can be helpful in keeping mathematical derivations simple and elegant and providing concise descriptions of computation graphs. Vectors and matrices are one-dimensional and two-dimensional special cases of tensors, which (in deep learning terminology) are simply multidimensional arrays Tensor of any dimension.’ For CNNs, tensors are a way of keeping track of the “shape™ of the data as it progresses
through the layers of the network. This is important because the whole notion of convolution
depends on the idea of adjacency:
adjacent data elements are assumed to be semantically
related, so it makes sense to apply operators to local regions of the data. Moreover, with suitable language primitives for constructing tensors and applying operators, the layers them-
selves can be described concisely as maps from tensor inputs to tensor outputs.
A final reason for describing CNNs in terms of tensor operations is computational effi-
ciency: given a description of a network as a sequence of tensor operations, a deep learning software package can generate compiled code that is highly optimized for the underlying
computational substrate. Deep learning workloads are often run on GPUs (graphics processing units) or TPUs (tensor processing units), which make available a high degree of parallelism. For example, one of Google’s third-generation TPU pods has throughput equivalent
to about ten million laptops. Taking advantage of these capabilities is essential if one is train-
ing a large CNN on a large database of images. Thus, it is common to process not one image at a time but many images in parallel; as we will see in Section 21.4, this also aligns nicely with the way that the stochastic gradient descent algorithm calculates gradients with respect
to a minibatch of training examples.
Let us put all this together in the form of an example.
256 x 256 RGB
images with a minibatch size of 64.
atical definition of tensors requires that cert
Suppose we are training on
The input in this case will be a four-
ariances hold under a change of b:
763
764
Chapter 21
Deep Learning
dimensional tensor of size 256 x 256 x 3 x 64.
Feature map Channel
Then we apply 96 kernels of size 5x5x 3
with a stride of 2 in both x and y directions in the image. This gives an output tensor of size 128 x 128 x 96 x 64. Such a
tensor is often called a feature map, since it shows how each
feature extracted by a kernel appears across the entire image; in this case it is composed of
96 channels, where each channel carries information from one feature. Notice that unlike the
input tensor, this feature map no longer has dedicated color channels; nonetheless, the color information may still be present in the various feature channels if the learning algorithm finds
color to be useful for the final predictions of the network.
21.3.3 Residual network
Residual networks
Residual networks are a popular and successful approach to building very deep networks that avoid the problem of vanishing gradients.
Typical deep models use layers that learn a new representation at layer i by completely re-
placing the representation at layer i — 1. Using the matrix—vector notation that we introduced
in Equation (21.3), with z(!) being the values of the units in layer i, we have
20 = f(20) = g (WD)
Because each layer completely replaces the representation from the preceding layer, all of the layers must learn to do something useful. Each layer must, at the very least, preserve the
task-relevant information contained in the preceding layer. If we set W = 0 for any layer i, the entire network ceases to function. If we also set W(~!) = 0, the network would not even
be able to learn: layer i would not learn because it would observe no variation in its input from layer i — 1, and layer i — 1 would not learn because the back-propagated gradient from
layer i would always be zero. Of course, these are extreme examples, but they illustrate the
need for layers to serve as conduits for the signals passing through the network.
The key idea of residual networks is that a layer should perturb the representation from the previous layer rather than replace it entirely. If the learned perturbation is small, the next
layer is close to being a copy of the previous layer. This is achieved by the following equation for layer i in terms of layer i — 1:
20 = g (@D 4 @),
Residual
1.10)
where g, denotes the activation functions for the residual layer. Here we think of f as the
residual, perturbing the default behavior of passing layer i — 1 through to layer i. The function used to compute the residual is typically a neural network with one nonlinear layer combined
with one linear layer:
f(z) =Vg(Wz), where W and V are learned weight matrices with the usual bias weights added.
Residual networks make it possible to learn significantly deeper networks reliably. Consider what happens if we set V=0 for a particular layer in order to disable that layer. Then the residual f disappears and Equation (21.10) simplifies to 20 = g,z ). Now suppose that g, con:
function to its inputs: z
s of ReLU activation functions and that z¢~!) also applies a ReLU =ReLU(in"""). In that case we have
21 = g,(z0V) = ReLU(2"")) = ReLU(ReLU(in"""))) = ReLU(in(""1) = 2(-1) |
Section 21.4 Learning Algorithms where the penultimate step follows because ReLU(ReLU(x))=ReLU(x).
In other words,
in residual nets with ReLU activations, a layer with zero weights simply passes its inputs
through with no change. The rest of the network functions just as if the layer had never existed. Whereas traditional networks must learn to propagate information and are subject to catastrophic failure of information propagation for bad choices of the parameters, residual
networks propagate information by default. Residual networks
are often used with convolutional layers in vision applications, but
they are in fact a general-purpose tool that makes deep networks more robust and allows
researchers to experiment more freely with complex and heterogeneous network designs. At the time of writing, it is not uncommon
to see residual networks with hundreds of layers.
The design of such networks is evolving rapidly, so any additional specifics we might provide would probably be outdated before reaching printed form. Readers desiring to know the best architectures for specific applications should consult recent research publications.
21.4
Learning Algorithms
Training a neural network consists of modifying the network’s parameters so as to minimize
the loss function on the training set. In principle, any kind of optimization algorithm could
be used. In practice, modern neural networks are almost always trained with some variant of
stochastic gradient descent (SGD).
‘We covered standard gradient descent and its stochastic version in Section 19.6.2. Here,
the goal i to minimize the loss L(w), where w represents all of the parameters of the network. Each update step in the gradient descent process looks like this: W
w—aVyL(w),
where « is the learning rate. For standard gradient descent, the loss L is defined with respect
to the entire training set. For SGD, it is defined with respect to a minibatch of m examples chosen randomly at each step.
As noted in Section 4.2, the literature on optimization methods for high-dimensional
continuous spaces includes innumerable enhancements to basic gradient descent.
We will
not cover all of them here, but it is worth mentioning a few important considerations that are
particularly relevant to training neural networks:
+ For most networks that solve real-world problems, both the dimensionality of w and the
size of the training set are very large. These considerations militate strongly in favor
of using SGD with a relatively small minibatch size m: stochasticity helps the algo-
rithm escape small local minima in the high-dimensional weight space (as in simulated annealing—see page 114); and the small minibatch size ensures that the computational
cost of each weight update step is a small constant, independent of the training set size. * Because the gradient contribution of each training example in the SGD minibatch can
be computed independently, the minibatch size is often chosen so as to take maximum advantage of hardware parallelism in GPUs or TPUs.
« To improve convergence, it is usually a good idea to use a learning rate that decreases over time. Choosing the right schedule is usually a matter of trial and error. + Near a local or global minimum of the loss function with respect to the entire training set, the gradients estimated from small minibatches will often have high variance and
765
766
Chapter 21
Deep Learning
Figure 21.6 Tllustration of the back-propagation of gradient information in an arbitrary computation graph. The forward computation of the output of the network proceeds from left to right, while the back-propagation of gradients proceeds from right to left. may point in entirely the wrong direction, making convergence difficult. One solution
Momentum
is to increase the minibatch size as training proceeds; another is to incorporate the idea
of momentum, which keeps a running average of the gradients of past minibatches in order to compensate for small minibatch sizes.
+ Care must be taken to mitigate numerical instabilities that may arise due to overflow,
underflow, and rounding error. These are particularly problematic with the use of exponentials in softmax,
sigmoid,
and tanh activation functions, and with the iterated
computations in very deep networks and recurrent networks (Section 21.6) that lead to
vanishing and exploding activations and gradients. Overall, the process of learning the weights of the network is usually one that exhibits diminishing returns.
We run until it is no longer practical to decrease the test error by running
longer. Usually this does not mean we have reached a global or even a local minimum of the loss function. Instead, it means we would have to make an impractically large number of very small steps to continue reducing the cost, or that additional steps would only cause overfitting, or that estimates of the gradient are too inaccurate to make further progress.
21.4.1
Computing
gradients in computation graphs
On page 755, we derived the gradient of the loss function with respect to the weights in a specific (and very simple) network. We observed that the gradient could be computed by back-propagating error information from the output layer of the network to the hidden layers. ‘We also said that this result holds in general for any feedforward computation graph. Here,
we explain how this works. Figure 21.6 shows a generic node in a computation graph. (The node & has in-degree and out-degree 2, but nothing in the analysis depends on this.) During the forward pass, the node
computes some arbitrary function / from its inputs, which come from nodes f and g. In turn, h feeds its value to nodes j and k.
The back-propagation process passes messages back along each link in the network. At each node, the incoming messages are collected and new messages are calculated to pass
Section 21.4 Learning Algorithms back to the next layer. As the figure shows, the messages are all partial derivatives of the loss
L. For example, the backward message dL/dh; is the partial derivative of L with respect to
Jj’s first input, which is the forward message from & to j. Now, / affects L through both j and k, so we have
AL/3h=OL/3h;+ AL/ .
@111y
‘With this equation, the node / can compute the derivative of L with respect to / by summing
the incoming messages from j and k. Now, to compute the outgoing messages dL/d fy, and 9L/dgp, we use the following equations:
OL
i
9L ah
ISy
and
AL
g,
JL Jh
9h dgi
(21.12)
In Equation (21.12), JL/dh was already computed by Equation (21.11), and 9h/d f;, and dh/dgy are just the derivatives of i with respect to its first and second arguments, respec-
tively. For example, if 4 is a multiplication node—that is, h(f,g)= f - g—then dh/d fi=g and 9h/dg, = f. Software packages for deep learning typically come with a library of node types (addition, multiplication, sigmoid, and so on), each of which knows how to compute its own derivatives as needed for Equation (21.12). The back-propagation process begins with the output nodes, where each initial message 9L/35; is calculated directly from the expression for L in terms of the predicted value § and the true value y from the training data.
At each internal node, the incoming backward
messages are summed according to Equation (21.11) and the outgoing messages are generated
from Equation (21.12). The process terminates at each node in the computation graph that represents a weight w (e.g., the light mauve ovals in Figure 21.3(b)). At that point, the sum of the incoming messages to w is JL/Jw—precisely the gradient we need to update w. Exercise 21.BPRE asks you to apply this process to the simple network in Figure 21.3 in order
to rederive the gradient expressions in Equations (21.4) and (21.5).
‘Weight-sharing, as used in convolutional networks (Section 21.3) and recurrent networks
(Section 21.6), is handled simply by treating each shared weight as a single node with multiple outgoing arcs in the computation graph. During back-propagation, this results in multiple incoming gradient messages. By Equation (21.11), this means that the gradient for the shared
weight is the sum of the gradient contributions from each place it is used in the network.
It is clear from this description of the back-propagation process that its computational
cost is linear in the number of nodes in the computation graph, just like the cost of the forward computation. Furthermore, because the node types are typically fixed when the network
is designed, all of the gradient computations can be prepared in symbolic form in advance
and compiled into very efficient code for each node in the graph.
Note also that the mes-
sages in Figure 21.6 need not be scalars: they could equally be vectors, matrices, or higher-
dimensional tensors, so that the gradient computations can be mapped onto GPUs or TPUs to benefit from parallelism.
One drawback of back-propagation is that it requires storing most of the intermediate
values that were computed during forward propagation in order to calculate gradients in the backward pass. This means that the total memory cost of training the network is proportional to the number of units in the entire network.
Thus, even if the network itself is represented
only implicitly by propagation code with lots of loops, rather than explicitly by a data struc-
ture, all of the intermediate results of that propagation code have to be stored explicitly.
767
768
Chapter 21 21.4.2
Batch normalization
Deep Learning
Batch normalization
Batch normalization is a commonly used technique that improves the rate of convergence of SGD by rescaling the values generated at the internal layers of the network from the examples within each minibatch.
Although the reasons for its effectiveness are not well understood at
the time of writing, we include it because it confers significant benefits in practice. To some
extent, batch normalization seems to have effects similar to those of the residual network. Consider a node z somewhere in the network: the values of z for the m examples in a
minibatch are zj., ... ,z,. Batch normalization replaces each z; with a new quantity Z;:
where 1 is the mean value of z across the minibatch, o is the standard deviation of zy, ...z,
€ is a small constant added to prevent division by zero, and and 3 are learned parameters.
Batch normalization standardizes the mean and variance of the values, as determined by the values of 3 and . This makes it much simpler to train a deep network. Without batch
normalization, information can get lost if a layer’s weights are too small, and the standard
deviation at that layer decays to near zero. Batch normalization prevents this from happening.
It also reduces the need for careful initialization of all the weights in the network to make sure
that the nodes in each layer are in the right operating region to allow information to propagate. With batch normalization, we usually include 3 and , which may be node-specific or layer-specific, among the parameters of the network, so that they are included in the learning process.
After training, 3 and ~ are fixed at their learned values.
21.5
Generalization
So far we have described how to fit a neural network to its training set, but in machine learn-
ing the goal is to generalize to new data that has not been seen previously, as measured by performance on a test set. In this section, we focus on three approaches to improving gener-
alization performance: choosing the right network architecture, penalizing large weights, and
randomly perturbing the values passing through the network during training. 21.5.1
Choosing a network architecture
A great deal of effort in deep learning research has gone into finding network architectures that generalize well. Indeed, for each particular kind of data—images, speech, text, video, and so on—a good deal of the progress in performance has come from exploring different kinds of network architectures and varying the number of layers, their connectivity, and the types of node in each layer.®
Some neural network architectures are explicitly designed to generalize well on particular
types of data: convolutional networks encode the idea that the same feature extractor is useful at all locations across a spatial grid, and recurrent networks encode the idea that the same
update rule is useful at all points in a stream of sequential data.
To the extent that these
assumptions are valid, we expect convolutional architectures to generalize well on images and recurrent networks to generalize well on text and audio signals.
& Noting that much of this incremental, exploratory work is arried out by graduate students, some have called the process graduate student descent (GSD).
Section 21.5
01
3.layer 1llayer
008 Test-set error
Generalization
006 004 002
o
1
2
3
4
5
Number of weights (x 10")
6
7
Figure 21.7 Test-set error as a function of layer width (as measured by total number of weights) for three-layer and eleven-layer convolutional networks. The data come from early versions of Google’s system for transcribing addresses in photos taken by Street View cars (Goodfellow et al., 2014). One of the most important empirical findings in the field of deep learning is that when
comparing two networks with similar numbers of weights, the deeper network usually gives better generalization performance.
Figure 21.7 shows this effect for at least one real-world
application—recognizing house numbers. The results show that for any fixed number of parameters, an eleven-layer network gives much lower test-set error than a three-layer network. Deep learning systems perform well on some but not all tasks. For tasks with high-
dimensional inputs—images, video, speech signals, etc.—they perform better than any other
pure machine learning approaches. Most of the algorithms described in Chapter 19 can handle high-dimensional input only if it is preprocessed using manually designed features to reduce the dimensionality. This preprocessing approach, which prevailed prior to 2010, has not yielded performance comparable to that achieved by deep learning systems. Clearly, deep learning models are capturing some important aspects of these tasks. In particular, their success implies that the tasks can be solved by parallel programs with a relatively
small number of steps (10 to 10° rather than, say, 107). This is perhaps not surprising, because these tasks are typically solved by the brain in less than a second, which is time enough for only a few tens of sequential neuron firings. Moreover, by examining the internal-layer representations learned by deep convolutional networks for vision tasks, we find evidence
that the processing steps seem to involve extracting a sequence of increasingly abstract representations of the scene, beginning with tiny edges, dots, and corner features and ending with
entire objects and arrangements of multiple objects. On the other hand, because they are simple circuits, deep learning models lack the compositional and quantificational expressive power that we see in first-order logic (Chapter 8) and context-free grammars (Chapter 23).
Although deep learning models generalize well in many cases, they may also produce
unintuitive errors. They tend to produce input-output mappings that are discontinuous, so
that a small change to an input can cause a large change in the output. For example, it may
769
770
Adversarial example
Chapter 21
Deep Learning
be possible to alter just a few pixels in an image of a dog and cause the network to classify the dog as an ostrich or a school bus—even though the altered image still looks exactly like a
dog. An altered image of this kind is called an adversarial example. In low-dimensional spaces it is hard to find adversarial examples. But for an image with
a million pixel values, it is often the case that even though most of the pixels contribute to
the image being classified in the middle of the “dog” region of the space, there are a few dimensions where the pixel value is near the boundary to another category. An adversary
with the ability to reverse engineer the network can find the smallest vector difference that
would move the image over the border.
When adversarial examples were first discovered, they set off two worldwide scrambles:
one to find learning algorithms and network architectures that would not be susceptible to adversarial attack, and another to create ever-more-effective adversarial attacks against all
kinds of learning systems.
So far the attackers seem to be ahead.
In fact, whereas it was
assumed initially that one would need access to the internals of the trained network in order
to construct an adversarial example specifically for that network, it has turned out that one can construct robust adversarial examples that fool multiple networks with different architec-
tures, hyperparameters, and training sets. These findings suggest that deep learning models
recognize objects in ways that are quite different from the human visual system. 21.5.2
Neural architecture search
Unfortunately, we don’t yet have a clear set of guidelines to help you choose the best network
architecture for a particular problem. Success in deploying a deep learning solution requires experience and good judgment. From the earliest days of neural network research, attempts have been made to automate
the process of architecture selection. We can think of this as a case of hyperparameter tuning (Section 19.4.4), where the hyperparameters determine the depth, width, connectivity, and
Neural architecture search
other attributes of the network.
However, there are so many choices to be made that simple
possible network architectures.
Many of the search techniques and learning techniques we
approaches like grid search can’t cover all possibilities in a reasonable amount of time. Therefore, it is common to use neural architecture search to explore the state space of covered earlier in the book have been applied to neural architecture search.
Evolutionary algorithms have been popular because it is sensible to do both recombination (joining parts of two networks together) and mutation (adding or removing a layer or changing a parameter value). Hill climbing can also be used with these same mutation operations.
Some researchers have framed the problem as reinforcement learning, and some
as Bayesian optimization.
Another possibility is to treat the architectural possibilities as a
continuous differentiable space and use gradient descent to find a locally optimal solution.
For all these search techniques, a major challenge is estimating the value of a candidate
network.
The straightforward way to evaluate an architecture is to train it on a test set for
multiple batches and then evaluate its accuracy on a validation set. But with large networks that could take many GPU-days.
Therefore, there have been many attempts to speed up this estimation process by eliminating or at least reducing the expensive training process. We can train on a smaller data set. We can train for a small number of batches and predict how the network would improve with more batches.
We can use a reduced version of the network architecture that we hope
Section 21.5
Generalization
771
retains the properties of the full version. We can train one big network and then search for subgraphs of the network that perform better; this search can be fast because the subgraphs
share parameters and don’t have to be retrained. Another approach is to learn a heuristic evaluation function (as was done for A* search).
That is, start by choosing a few hundred network architectures and train and evaluate them.
That gives us a data set of (network, score) pairs. Then learn a mapping from the features of a network to a predicted
score. From that point on we can generate a large number of candidate
networks and quickly estimate their value. After a search through the space of networks, the best one(s) can be fully evaluated with a complete training procedure.
21.5.3 Weight decay In Section 19.4.3 we saw that regularization—limiting the complexity of a model—can aid
generalization. This is true for deep learning models as well. In the context of neural networks
we usually call this approach weight decay.
Weight decay consists of adding a penalty AX; ;W
to the loss function used to train the
neural network, where \ is a hyperparameter controlling the strength of the penalty and the
sum is usually taken over all of the weights in the network. Using A=0 is equivalent to not using weight decay, while using larger values of A encourages the weights to become small. It is common to use weight decay with A near 104
Choosing a specific network architecture can be seen as an absolute constraint on the
hypothesis space: a function is either representable within that architecture or it is not. Loss
function penalty terms such as weight decay offer a softer constraint: functions represented
with large weights are in the function family, but the training set must provide more evidence in favor of these functions than is required to choose a function with small weights. It is not straightforward to interpret the effect of weight decay in a neural network. In
networks with sigmoid activation functions, it is hypothesized that weight decay helps to keep the activations near the linear part of the sigmoid, avoiding the flat operating region
that leads to vanishing gradients. With ReLU activation functions, weight decay seems to be beneficial, but the explanation that makes sense for sigmoids no longer applies because the ReLU’s output is either linear or zero.
Moreover, with residual connections, weight decay
encourages the network to have small differences between consecutive layers rather than
small absolute weight values.
Despite these differences in the behavior of weight decay
across many architectures, weight decay is still widely useful.
One explanation for the beneficial effect of weight decay is that it implements a form of maximum a posteriori (MAP) learning (see page 723). Letting X and y stand for the inputs
and outputs across the entire training set, the maximum a posteriori hypothesis /ap satisfies
Inap = argmax P(y| X, W)P(W) w = argmin[ log P(y|X, W) — log P(W)] The first term is the usual cross-entropy loss; the second term prefers weights that are likely
under a prior distribution. This aligns exactly with a regularized loss function if we set
logP(W) = -A Y W3, 7
which means that P(W) is a zero-mean Gaussian prior.
Weight decay
772
Chapter 21 21.5.4
Dropout
Deep Learning
Dropout
Another way that we can intervene to reduce the test-set error of a network—at the cost of making it harder to fit the training set—is to use dropout. At each step of training, dropout
applies one step of back-propagation learning to a new version of the network that is created
by deactivating a randomly chosen subset of the units. This is a rough and very low-cost approximation to training a large ensemble of different networks (see Section 19.8).
More specifically, let us suppose we are using stochastic gradient descent with minibatch
size m.
For each minibatch, the dropout algorithm applies the following process to every
node in the network: with probability p, the unit output is multiplied by a factor of 1/p;
otherwise, the unit output is fixed at zero. Dropout is typically applied to units in the hidden
layers with p=0.5; for input units, a value of p=0.8 turns out to be most effective. This process
produces a thinned network with about half as many units as the original, to which
back-propagation is applied with the minibatch of m training examples. The process repeats in the usual way until training is complete. At test time, the model is run with no dropout.
We can think of dropout from several perspectives:
+ By introducing noise at training time, the model is forced to become robust to noise. + As noted above, dropout approximates the creation of a large ensemble of thinned net-
works. This claim can be verified analytically for linear models, and appears to hold experimentally for deep learning models.
+ Hidden units trained with dropout must learn not only to be useful hidden units; they
must also learn to be compatible with many other possible sets of other hidden units
that may or may not be included in the full model.
This is similar to the selection
processes that guide the evolution of genes: each gene must not only be effective in its own function, but must work well with other genes, whose identity in future organisms
may vary considerably.
+ Dropout applied to later layers in a deep network forces the final decision to be made
robustly by paying attention to all of the abstract features of the example rather than focusing on just one and ignoring the others. For example, a classifier for animal images might be able to achieve high performance on the training set just by looking at the
animal’s nose, but would presumably fail on a test case where the nose was obscured or damaged. With dropout, there will be training cases where the internal “nose unit” is
zeroed out, causing the learning process to find additional identifying features. Notice
that trying to achieve the same degree of robustness by adding noise to the input data
would be difficult: there is no easy way to know in advance that the network is going to
focus on noses, and no easy way to delete noses automatically from each image.
Altogether, dropout forces the model to learn multiple, robust explanations for each input.
This causes the model to generalize well, but also makes it more difficult to fit the training
set—it is usually necessary to use a larger model and to train it for more iterations.
21.6
Recurrent Neural Networks
Recurrent neural networks (RNNs) are distinct from feedforward networks in that they allow
cycles in the computation graph. In all the cases we will consider, each cycle has a delay,
50 that units may take as input a value computed from their own output at an earlier step in
Section 21.6
(@)
Recurrent Neural Networks
773
(b)
Figure 21.8 (a) Schematic diagram of a basic RNN where the hidden layer z has recurrent connections; the A symbol indicates a delay. (b) The same network unrolled over three time steps to create a feedforward network. Note that the weights are shared across all time steps. the computation. (Without the delay, a cyclic circuit may reach an inconsistent state.) This
allows the RNN to have internal state, or memory: inputs received at earlier time steps affect
the RNN’s response to the current input. RNNs can also be used to perform more general computations—after all, ordinary com-
puters are just Boolean circuits with memory—and to model real neural systems, many of
which contain cyclic connections. Here we focus on the use of RNNs to analyze sequential data, where we assume that a new input vector X, arrives at each time step.
As tools for analyzing sequential data, RNNs can be compared to the hidden Markov
models, dynamic Bayesian networks, and Kalman filters described in Chapter 14. (The reader
may find it helpful to refer back to that chapter before proceeding.) Like those models, RNNs. make a Markov assumption (see page 463):
the hidden state z, of the network suffices
to capture the information from all previous inputs. Furthermore, suppose we describe the RNN'’s update process for the hidden state by the equation z, =f,,(z,_;,x;) for some param-
eterized function f,. Once trained, this function represents a time-homogeneous process
(page 463)—effectively a universally quantified assertion that the dynamics represented by
fw hold for all time steps. Thus, RNNs add expressive power compared to feedforward networks, just as convolutional networks do, and just as dynamic Bayes nets add expressive power compared to regular Bayes nets. Indeed, if you tried to use a feedforward network to analyze sequential data, the fixed size of the input layer would force the network to examine only a finite-length window of data, in which case the network would fail to detect
long-distance dependencies. 21.6.1
Training a basic RNN
The basic model we will consider has an input layer x, a hidden layer z with recurrent con-
nections, and an output layer y, as shown in Figure 21.8(a). We assume that both x and y are
observed in the training data at each time step. The equations defining the model refer to the values of the variables indexed by time step 7:
% = fo(z-1,%)=g (W21 + Wiexi) = g (inz,) 9 = g(Weyz) =gy(iny,),
(21.13)
Memory
774
Chapter 21
Deep Learning
where g, and g, denote the activation functions for the hidden and output layers, respectively.
As usual, we assume an extra dummy input fixed at +1 for each unit as well as bias weights
associated with those inputs. Given a sequence of input vectors Xi,...,x and observed outputs y;,...,yr, we can turn this model into a feedforward network by “unrolling” it for T steps, as shown in Figure 21.8(b). Notice that the weight matrices W, W__, and W are shared across all time steps.
In the unrolled network, it is easy to see that we can calculate gradients to train the
weights in the usual way; the only difference is that the sharing of weights across layers makes the gradient computation a little more complicated.
To keep the equations simple, we will show the gradient calculation for an RNN with
just one input unit, one hidden unit, and one output unit.
For this case, making the bias
weights explicit, we have z,=g:(w.z—1 + WioX; + woz) and § =gy (w.,z + woy). As in Equations (21.4) and (21.5), we will assume a squared-error loss L—in this case, summed
over the time steps. The derivations for the input-layer and output-layer weights w,. and w.., are essentially identical to Equation (21.4), so we leave them as an exercise. For the hidden-layer weight w. ., the first few steps also follow the same pattern as Equation (21.4):
%
= %
7
T
5
Z’Z(,Vl’fl);%
r
I
™=
M
I
y(iny) = );. 2y
$1)8(inys)
=200, = 30, i) g(weys + ) a
) wzy =20y = )&y (iny
(21.14)
I
Now the gradient for the hidden unit z; can be obtained from the previous time step as follows:
9z
T
W,
(Wezio1 + Wy
+wo2)
(21.15)
where the last line uses the rule for derivatives of products: (uv)/dx=vdu/dx+udv/dx.
Looking at Equation (21.15), we notice two things. First, the gradient expression is re-
cursive: the contribution to the gradient from time step 7 is calculated using the contribution
Back-propagation through time
from time step 7 — 1. If we order the calculations in the right way, the total run time for computing the gradient will be linear in the size of the network. This algorithm is called backpropagation through time, and is usually handled automatically by deep learning software
Exploding gradient
terms proportional o w.. TT_ g.(in.,). For sigmoids, tanhs, and ReLUs, g’ < 1, 5o our simple RNN will certainly suffer from the vanishing gradient problem (see page 756) if w.. < 1. On the other hand, if w.. > 1, we may experience the exploding gradient problem. (For the
systems. Second, if we iterate the recursive calculation, we see that gradients at 7 will include
general case, these outcomes depend on the first eigenvalue of the weight matrix W...) The next section describes a more elaborate RNN design intended to mitigate this issue.
Section 217 21.6.2
Long short-term memory
Unsupervised Learning and Transfer Learning
775
RNNs
Several specialized RNN architectures have been designed with the goal of enabling informa-
tion to be preserved over many time steps. One of the most popular is the long short-term
short-term ‘memory or LSTM. The long-term memory component of an LSTM, called the memory cell Long memory and denoted by ¢, is essentially copied from time step to time step. (In contrast, the basic RNN
multiplies its memory by a weight matrix at every time step, as shown in Equation (21.13).)
Memory cell
New information enters the memory by adding updates; in this way, the gradient expressions
do not accumulate multiplicatively over time. LSTMs also include gating units, which are Gating unit vectors that control the flow of information in the LSTM via elementwise multiplication of
the corresponding information vector: « The forget gate f determines if each element of the memory cell is remembered (copied Forget gate to the next time step) or forgotten (reset to zero).
« The input gate i determines if each element of the memory cell is updated additively Input gate by new information from the input vector at the current time step.
« The output gate o determines if each element of the memory cell is transferred to the Output gate short-term memory z, which plays a similar role to the hidden state in basic RNNs. Whereas the word “gate” in circuit design usually connotes a Boolean function, gates in
LSTMs are soft—for example, elements of the memory cell vector will be partially forgotten
if the corresponding elements of the forget-gate vector are small but not zero. The values for the gating units are always in the range [0, 1] and are obtained as the outputs of a sigmoid function applied to the current input and the previous hidden state. In detail, the update equations for the LSTM are as follows: £, =
o(Weyx +W_yz,1)
o(Weix+ Weizi1) (WeoX, +Wooz,1)
€ = ¢ 1 Of +i O tanh(Wy X, + Wez,1)
7, =
tanh(¢,) © o,
where the subscripts on the various weight matrices W indicate the origin and destination of
the corresponding links. The ® symbol denotes elementwise multiplication. LSTMs were among the first practically usable forms of RNN. They have demonstrated excellent performance on a wide range of tasks including speech recognition and handwriting recognition. Their use in natural language processing is discussed in Chapter 24. 21.7
Unsupervised
Learning and Transfer Learning
The deep learning systems we have discussed so far are based on supervised learning, which requires each training example to be labeled with a value for the target function.
Although
such systems can reach a high level of test-set accuracy—as shown by the ImageNet com-
petition results, for example—they often require far more labeled data than a human would for the same task. For example, a child needs to see only one picture of a giraffe, rather than thousands, in order to be able to recognize giraffes reliably in a wide range of settings
and views. Clearly, something is missing in our deep learning story; indeed, it may be the
776
Chapter 21
Deep Learning
case that our current approach to supervised deep learning renders some tasks completely
unattainable because the requirements for labeled data would exceed what the human race
(or the universe) can supply. Moreover, even in cases where the task is feasible, labeling large data sets usually requires scarce and expensive human labor.
For these reasons, there is intense interest in several learning paradigms that reduce the
dependence on labeled data. As we saw in Chapter 19, these paradigms include unsuper-
vised learning, transfer learning, and semisupervised learning. Unsupervised learning
algorithms learn solely from unlabeled inputs x, which are often more abundantly available than labeled examples. Unsupervised learning algorithms typically produce generative models, which can produce realistic text, images, audio, and video, rather than simply predicting labels for such data. Transfer learning algorithms require some labeled examples but are able to improve their performance further by studying labeled examples for different tasks, thus making it possible to draw on more existing sources of data. Semisupervised learning algo-
rithms require some labeled examples but are able to improve their performance further by also studying unlabeled examples. This section covers deep learning approaches to unsupervised and transfer learning; while semisupervised learning is also an active area of research in the deep learning community, the techniques developed so far have not proven broadly
effective in practice, so we do not cover them. 21.7.1
Unsupervised learning
Supervised learning algorithms all have essentially the same goal: given a training set of inputs x and corresponding outputs y=
f(x), learn a function A that approximates
f well.
Unsupervised learning algorithms, on the other hand, take a training set of unlabeled exam-
ples x. Here we describe two things that such an algorithm might try to do. The first is to
learn new representations—for example, new features of images that make it easier to iden-
tify the objects in an image. The second is to learn a generative model—typically in the form
of a probability distribution from which new samples can be generated. (The algorithms for learning Bayes nets in Chapter 20 fall in this category.) Many algorithms are capable of both representation learning and generative modeling. Suppose we learn a joint model Ay (x,z), where z is a set of latent, unobserved variables that represent the content of the data x in some way. In keeping with the spirit of the chapter, we do not predefine the meanings of the z variables; the model is free to learn to associate
z with x however it chooses. For example, a model trained on images of handwritten digits
might choose to use one direction in z space to represent the thickness of pen strokes, another to represent ink color, another to represent background color, and so on. With images of faces, the learning algorithm might choose one direction to represent gender and another to
capture the presence or absence of glasses, as illustrated in Figure 21.9. A learned probability model Py (x,z) achieves both representation learning (it has constructed meaningful z vectors from the raw x vectors) and generative modeling: grate z out of Py (x,z) we obtain Py (x).
if we inte-
Probabilistic PCA: A simple generative model PPCA
There have been many proposals for the form that Py (x,z) might take. One of the simplest
is the probabilistic principal components analysis (PPCA) model.” In a PPCA model, z
Section 217
Unsupervised Learning and Transfer Learning
Figure 21.9 A demonstration of how a generative model has learned to use different directions in z space to represent different aspects of faces. We can actually perform arithmetic in 2 space. The images here are all generated from the learned model and show what happens when we decode different points in z space. We start with the coordinates for the concept of “man with glasses.” subtract off the coordinates for “man,” add the coordinates for “woman.” and obtain the coordinates for “woman with glasses.” Images reproduced with permission from (Radford et al., 2015).
is chosen from a zero-mean, spherical Gaussian, then x is generated from z by applying a weight matrix W and adding spherical Gauss ian noise: P(z) = N(z0,I)
By(x|z) = N(x;Wz,0°1).
The weights W (and optionally the noise parameter o) can be learned by maximizing the likelihood of the data, given by
Pu(x) :/Ry(x.z)dz=N(x:0.WWT+rIZI).
(21.16)
The maximization with respect to W can be done by gradient methods or by an efficient iterative EM
algorithm
(see Section 20.3).
Once W
has been learned, new data samples
can be generated directly from Py (x) using Equation (21.16). Moreover, new observations. x that have very low probability according to Equation (21.16) can be flagged as potential
anomalies. With PPCA, we usually assume that the dimensionality ofz is much less than the dimensionality of x, so that the model learns to explain the data as well as possible in terms of a
small number of features. These features can be extracted for use in standard classifiers by computing Z, the expectation of Py (z|x). Generating data from a probabilistic PCA model is straightforward: first sample z from its fixed Gaussian prior, then sample x from a Gaussian with mean Wz.
As we will see
shortly, many other generative models resemble this process, but use complicated mappings
defined by deep models rather than linear mappings from z-space to x-space.
7 Standard PCA involves fitting a multivariate Gaussian to the raw input data and then selecting out the longest axes—the principal components—of that ellipsoidal distribution.
777
778
Chapter 21
Deep Learning
Autoencoders
Autoencoder
Many unsupervised deep learning algorithms are based on the idea of an autoencoder. An autoencoder is a model containing two parts: an encoder that maps from x to a representation
2 and a decoder that maps from a representation Z to observed data x. In general, the encoder
is just a parameterized function f and the decoder is just a parameterized function g. The
model is trained so that x ~ g(f(x)), so that the encoding decoding process. The functions f and g can be simple single matrix or they can be represented by a deep neural A very simple autoencoder is the linear autoencoder, a shared weight matrix W:
process is roughly inverted by the linear models parameterized by a network. where both f and g are linear with
One way to train this model is to minimize the squared error ¥; [[x; — g(f(x;))|[* so that
x ~ g(f(x)). The idea is to train W so that a low-dimensional 2 will retain as much information as possible to reconstruct the high-dimensional data x.
This linear autoencoder
turns out to be closely connected to classical principal components analysis (PCA). When
z is m-dimensional, the matrix W should learn to span the m principal components of the data—in other words, the set of m orthogonal directions in which the data has highest vari-
ance, or equivalently the m eigenvectors of the data covariance matrix that have the largest
eigenvalues—exactly as in PCA. The PCA model is a simple generative model that corresponds to a simple linear autoencoder. The correspondence suggests that there may be a way to capture more complex kinds Variational autoencoder Variational posterior
of generative models using more complex kinds of autoencoders.
coder (VAE) provides one way to do this.
The variational autoen-
Variational methods were introduced briefly on page 458 as a way to approximate the
posterior distribution in complex probability models, where summing or integrating out a large number of hidden variables is intractable.
The idea is to use a variational posterior
Q(z), drawn from a computationally tractable family of distributions, as an approximation to
the true posterior. For example, we might choose Q from the family of Gaussian distributions
with a diagonal covariance matrix. Within the chosen family of tractable distributions, Q is
optimized to be as close as possible to the true posterior distribution P(z|x).
For our purposes, the notion of “as close as possible” is defined by the KL divergence, which we mentioned on page 758. This is given by
Da(Q@IP(alx) = [ 0(a)ioggt
which is an average (with respect to Q) of the log ratio between Q and P. It is easy to see
Variational lower bound ELBO
that Dk (Q(z)||P(z|x)) > 0, with equality when Q and P coincide. We can then define the
variational lower bound £ (sometimes called the evidence lower bound, or ELBO) on the log likelihood of the data:
L(x,0) =logP(x) — Dg1(Q(2)||P(2]x)) QL17) We can see that £ is a lower bound for logP because the KL divergence is nonnegative. Variational learning maximizes £ with respect to parameters w rather than maximizing log P(x),
in the hope that the solution found, w", is close to maximizing log P(x) as well.
Section 217
Unsupervised Learning and Transfer Learning
779
As written, £ does not yet seem to be any easier to maximize than log P. Fortunately, we
can rewrite Equation (21.17) to reveal improved computational tractability:
£ = logP(x)— / Q(z)logp?z(T))()dz = 7/Q(z)logQ(z)dz+/Q(z)logP(x)P(z\x)dz = H(Q)+Fz-glogP(z,x)
where H(Q) is the entropy of the Q distribution.
For some variational families @ (such
as Gaussian distributions), H(Q) can be evaluated analytically.
Moreover, the expectation,
E,olog P(z,x), admits an efficient unbiased estimate via samples of z from Q. For each sample, P(z,x) can usually be evaluated efficiently—for example, if P is a Bayes net, P(z,X) is just a product of conditional probabilities because z and x comprise all the variables. Variational autoencoders provide a means of performing variational learning in the deep learning setting. Variational learning involves maximizing £ with respect to the parameters
of both P and Q. For a variational autoencoder, the decoder g(z) is interpreted as defining
log P(x|z). For example, the output of the decoder might define the mean of a conditional
Gaussian. Similarly, the output of the encoder f(x) is interpreted as defining the parameters of
Q—for example, Q might be a Gaussian with mean f(x). Training the variational autoencoder
then consists of maximizing £ with respect to the parameters of both the encoder f and the
decoder g, which can themselves be arbitrarily complicated deep networks.
Deep autoregressive models An autoregressive model (or AR model) is one in which each element x; of the data vector x
is predicted based on other elements of the vector. Such a model has no latent variables. If x
Autoregressive model
is of fixed size, an AR model can be thought of as a fully observable and possibly fully connected Bayes net. This means that calculating the likelihood ofa given data vector according to an AR model is trivial; the same holds for predicting the value of a single missing variable given all the others, and for sampling a data vector from the model.
The most common application of autoregressive models is in the analysis of time series
data, where an AR model of order k predicts x; given x,_j,...,x,_. In the terminology of Chapter 14, an AR model is a non-hidden Markov model. In the terminology of Chapter 23, an n-gram model of letter or word sequences is an AR model of order n — 1. In classical AR models, where the variables are real-valued, the conditional distribution
P(% | t»....X_1) is a linear-Gaussian model with fixed variance whose mean is a weighted linear combination of X, ..., X j—in other words, a standard linear regression model. The maximum likelihood solution is given by the Yule-Walker equations, which are closely Yule-Walker equations related to the normal equations on page 680.
A deep autoregressive model is one in which the linear-Gaussian
model is replaced
by an arbitrary deep network with a suitable output layer depending on whether x; is dis-
crete or continuous. Recent applications of this autoregressive approach include DeepMind’s ‘WaveNet model
for speech generation (van den Oord er al., 2016a).
WaveNet is trained
on raw acoustic signals, sampled 16,000 times per second, and implements a nonlinear AR model of order 4800 with a multilayer convolutional structure.
In tests it proves to be sub-
stantially more realistic than previous state-of-the-art speech generation systems.
Deep autoregressive model
780
Generative adversarial network GAN) enerator Discriminator Implicit model
Chapter 21
Deep Learning
Generative adversarial networks A generative adversarial network (GAN)
is actually a pair of networks that combine to
form a generative system. One of the networks, the generator, maps values from z to X in
order to produce samples from the distribution Py (x). A typical scheme samples z from a unit
Gaussian of moderate dimension and then passes it through a deep network /,, to obtain x. The other network, the discriminator, is a classifier trained to classify inputs x as real (drawn
from the training set) or fake (created by the generator). GANs are a kind of implicit model
in the sense that samples can be generated but their probabilities are not readily available; in a Bayes net, on the other hand, the probability of a sample is just the product of the conditional
probabilities along the sample generation path.
The generator is closely related to the decoder from the variational autoencoder frame-
work. The challenge in implicit modeling is to design a loss function that makes it possible to train the model using samples from the distribution, rather than maximizing the likelihood assigned to training examples from the data set.
Both the generator and the discriminator are trained simultaneously, with the generator
learning to fool the discriminator and the discriminator learning to accurately separate real from fake data.
The competition between generator and discriminator can be described in
the language of game theory (see Chapter 18). The idea is that in the equilibrium state of the
game, the generator should reproduce the training distribution perfectly, such that the discrim-
inator cannot perform better than random guessing. GANs have worked particularly well for
image generation tasks. For example, GAN can create photorealistic, high-resolution images of people who have never existed (Karras ef al., 2017). Unsupervised
translation
Translation tasks, broadly construed, consist of transforming an input x that has rich structure into an output y that also has rich structure.
In this context, “rich structure” means that the
data are multidimensional and have interesting statistical dependencies among the various
dimensions. Images and natural language sentences have a rich structure, but a single number, such as a clas ID, does not. Transforming a sentence from English to French or converting a photo of a night scene into an equivalent photo taken during the daytime are both examples
of translation tasks.
Supervised translation consists of gathering many (x,y) pairs and training the model to
map each X to the corresponding y.
For example, machine translation systems are often
trained on pairs of sentences that have been translated by professional human translators. For
other kinds of translation, supervised training data may not be available. For example, con-
sider a photo ofa night scene containing many moving cars and pedestrians. It is presumably
not feasible to find all of the cars and pedestrians and return them to their original positions in
Unsupervised translation
the night-time photo in order to retake the same photo in the daytime. To overcome this difficulty, it is possible to use unsupervised translation techniques that are capable of training
on many examples of x and many separate examples of y but no corresponding (x,y) pairs. These approaches are generally based on GANS; for example, one can train a GAN gen-
erator to produce a realistic example ofy when conditioned on x, and another GAN generator to perform the reverse mapping. The GAN training framework makes it possible to train a
generator to generate any one of many possible samples that the discriminator accepts as a
Section 217
Unsupervised Learning and Transfer Learning
781
realistic example of y given x, without any need for a specific paired y as is traditionally needed in supervised learning. More detail on unsupervised translation for images is given in Section 25.7.5. 21.7.2
Transfer learning and multitask learning
In transfer learning, experience with one learning task helps an agent learn better on another Transfer learning task. For example, a person who has already learned to play tennis will typically find it casier to learn related sports such as racquetball and squash; a pilot who has learned to fly one type of commercial passenger airplane will very quickly learn to fly another type: a student who has already learned algebra finds it easier to learn calculus.
‘We do not yet know the mechanisms of human transfer learning.
For neural networks,
learning consists of adjusting weights, so the most plausible approach for transfer learning is
to copy over the weights learned for task A to a network that will be trained for task B. The
weights are then updated by gradient descent in the usual way using data for task B. It may be a good idea to use a smaller learning rate in task B, depending on how similar the tasks are and how much data was used in task A.
Notice that this approach requires human expertise in selecting the tasks: for example,
weights learned during algebra training may not be very useful in a network intended for racquetball.
Also, the notion of copying weights requires a simple mapping between the
input spaces for the two tasks and essentially identical network architectures.
One reason for the popularity of transfer learning is the availability of high-quality pre-
trained models.
For example, you could download a pretrained visual object recognition
model such as the ResNet-50 model trained on the COCO data set, thereby saving yourself
weeks of work. From there you can modify the model parameters by supplying additional images and object labels for your specific task.
Suppose you want to classify types of unicycles. You have only a few hundred pictures
of different unicycles, but the COCO data set has over 3,000 images in each of the categories
of bicycles, motorcycles, and skateboards. This means that a model pretrained on COCO
already has experience with wheels and roads and other relevant features that will be helpful
in interpreting the unicycle images.
Often you will want to freeze the first few layers of the pretrained model—these layers
serve as feature detectors that will be useful for your new model. Your new data set will be
allowed to modify the parameters of the higher levels only; these are the layers that identify problem-specific features and do classification. However, sometimes the difference between sensors means that even the lowest-level layers need to be retrained.
As another example, for those building a natural language system, it is now common
to start with a pretrained model
such as the ROBERTA
model (see Section 24.6), which
already “knows” a great deal about the vocabulary and syntax of everyday language. The next step is to fine-tune the model in two ways. First, by giving it examples of the specialized vocabulary used in the desired domain; perhaps a medical domain (where it will learn about
“mycardial infarction”) or perhaps a financial domain (where it will learn about “fiduciary
responsibility”). Second, by training the model on the task it is to perform. If it is to do question answering, train it on question/answer pairs. One very important kind of transfer learning involves transfer between simulations and the real world.
For example, the controller for a self-driving car can be trained on billions
782
Chapter 21
Deep Learning
of miles of simulated driving, which would be impossible in the real world. Then, when the
Multitask learning
controller is transitioned to the real vehicle, it adapts quickly to the new environment. Multitask learning is a form of transfer learning in which we simultaneously
train a
model on multiple objectives. For example, rather than training a natural language system on part-of-speech tagging and then transferring the learned weights to a new task such as document classification, we train one system simultaneously on part-of-speech tagging, document
classification, language detection, word prediction, sentence difficulty modeling, plagiarism
detection, sentence entailment, and question answering. The idea is that to solve any one of these tasks, a model might be able to take advantage of superficial features of the data. But to
solve all eight at once with a common representation layer, the model is more likely to create a common representation that reflects real natural language usage and content.
21.8
Applications
Deep learning has been applied successfully to many important problem areas in Al For indepth explanations, we refer the reader to the relevant chapters: Chapter 22 for the use of deep learning in reinforcement learning systems, Chapter 24 for natural language processing, Chapter 25 (particularly Section 25.4) for computer vision, and Chapter 26 for robotics. 21.8.1
Vision
We begin with computer vision, which is the application area that has arguably had the biggest impact on deep learning, and vice versa. Although deep convolutional networks had been in use since the 1990s for tasks such as handwriting recognition, and neural networks had begun to surpass generative probability models for speech recognition by around 2010, it was the success of the AlexNet deep learning system in the 2012 ImageNet competition that propelled
deep learning into the limelight. The ImageNet competition was a supervised learning task with 1,200,000 images in 1,000 different categories, and systems were evaluated on the “top-5” score—how often the correct category appears in the top five predictions. AlexNet achieved an error rate of 15.3%, whereas the next best system had an error rate of more than 25%.
AlexNet had five convolutional
layers interspersed with max-pooling layers, followed by three fully connected layers. It used
ReLU activation functions and took advantage of GPUs to speed up the process of training
60 million weight
Since 2012, with improvements
in network design, training methods, and computing
resources, the top-5 error rate has been reduced to less than 2%—well below the error rate of
a trained human (around 5%). CNNs have been applied to a wide range of vision tasks, from self-driving cars to grading cucumbers.® Driving, which is covered in Section 25.7.6 and in several sections of Chapter 26, is among the most demanding of vision tasks: not only must
the algorithm detect, localize, track, and recognize pigeons, paper bags, and pedestrians, but it has to do it in real time with near-perfect accuracy.
8 The widely known tale of the Japanese cucumber farmer who built his own cucumber-s ing robot using TensorFlow is, it tums out, mostly mythical. The algorithm was developed by the farmer’s son, who worked previously as a software engineer at Toyota, and its low accuracy—about 70%—meant that the cucumbers il had to be sorted by hand (Zeeberg, 2017).
Section 218 Applications 21.8.2
Natural language processing
Deep learning has also had a huge impact on natural language processing (NLP) applications. such as machine translation and speech recognition. Some advantages of deep learning for these applications include the possibility of end-to-end learning, the automatic generation
of internal representations for the meanings of words, and the interchangeability of learned
encoders and decoders. End-to-end learning refers to the construction of entire systems as a single, learned func-
tion f. For example, an f for machine translation might take
as input an English sentence
S and produce an equivalent Japanese sentence S; = £(Sg). Such an f can be learned from training data in the form of human-translated pairs of sentences (or even pairs of texts, where
the alignment of corresponding sentences or phrases is part of the problem to be solved). A more classical pipeline approach might first parse Sg, then extract its meaning, then re-express
the meaning in Japanese as S, then post-edit S, using a language model for Japanese. This pipeline approach has two major drawbacks:
first, errors are compounded at each stage; and
second, humans have to determine what constitutes a “parse tree” and a “meaning representation,” but there is no easily accessible ground truth for these notions, and our theoretical
ideas about them are almost certainly incomplete.
At our present stage of understanding, then, the classical pipeline approach—which, at
least naively, seems to correspond to how a human translator works—is outperformed by the end-to-end method made possible by deep learning. For example, Wu ef al. (2016b) showed that end-to-end translation using deep learning reduced translation errors by 60% relative to a previous pipeline-based system. As of 2020, machine translation systems are approaching human performance for language pairs such as French and English for which very large paired data sets are available, and they are usable for other language pairs covering the majority of Earth’s population. There is even some evidence that networks trained on multiple languages do in fact learn an internal meaning representation: for example, after learning to translate Portuguese to English and English to Spanish, it is possible to translate Portuguese directly
into Spanish without any Portuguese/Spanish sentence pairs in the training set.
One of the most significant findings to emerge from the application of deep learning
to language tasks is that a great deal deal of mileage comes from re-representing individual words as vectors in a high-dimensional space—so-called word embeddings (see Section 24.1).
The vectors are usually extracted from the weights of the first hidden layer of
a network trained on large quantities of text, and they capture the statistics of the lexical
contexts in which words are used. Because words with similar meanings are used in similar
contexts, they end up close to each other in the vector space. This allows the network to generalize effectively across categories of words, without the need for humans to predefine those categories. For example, a sentence beginning “John bought a watermelon and two pounds of ... s likely to continue with “apples” or “bananas” but not with “thorium” or “geography.” Such a prediction is much easier to make if “apples” and “bananas” have similar representations in the internal layer. 21.8.3
Reinforcement learning
In reinforcement learning (RL), a decision-making agent learns from a sequence of reward
signals that provide some indication of the quality of its behavior. The goal is to optimize the
sum of future rewards. This can be done in several ways: in the terminology of Chapter 17,
783
784
Chapter 21
Deep Learning
the agent can learn a value function, a Q-function, a policy, and so on. From the point of view of deep learning, all these are functions that can be represented by computation graphs.
For example, a value function in Go takes a board position as input and returns an estimate of how advantageous the position is for the agent. While the methods of training in RL differ from those of supervised learning, the ability of multilayer computation graphs to represent
Deep reinforcement learning
complex functions over large input spaces has proved to be very useful. The resulting field of research is called deep reinforcement learning.
In the 1950s, Arthur Samuel experimented with multilayer representations of value func-
tions in his work on reinforcement learning for checkers, but he found that in practice a linear
function approximator worked best. (This may have been a consequence of working with a computer roughly 100 billion times less powerful than a modern tensor processing unit.) The first major successful demonstration of deep RL was DeepMind’s Atari-playing agent, DQN (Mnih et al., 2013). Different copies of this agent were trained to play each of several different Atari video games, and demonstrated skills such as shooting alien spaceships, bouncing balls with paddles, and driving simulated racing cars. In each case, the agent learned a Qfunction from raw image data with the reward signal being the game score. Subsequent work has produced deep RL systems that play at a superhuman level on the majority of the 57 different Atari games. DeepMind’s ALPHAGO system also used deep RL to defeat the best human players at the game of Go (see Chapter 5).
Despite its impressive successes, deep RL still faces significant obstacles: it is often difficult to get good performance, and the trained system may behave very unpredictably if the environment differs even a little from the training data (Irpan, 2018).
Compared to
other applications of deep learning, deep RL is rarely applied in commercial settings. It is, nonetheless, a very active area of research.
Summary
This chapter described methods for learning functions represented by deep computational graphs. The main points were: + Neural networks represent complex nonlinear functions with a network of parameterized linear-threshold units.
« The back-propagation algorithm implements a gradient descent in parameter space to minimize the loss function. « Deep learning works well for visual object recognition, speech recognition, natural language processing, and reinforcement learning in complex environments.
+ Convolutional networks are particularly well suited for image processing and other tasks
where the data have a grid topology. « Recurrent networks are effective for sequence-processing tasks including language modeling and machine translation.
Bibliographical and Historical Notes Bibliographical and
Historical Notes
The literature on neural networks is vast. Cowan and Sharp (1988b,
1988a) survey the early
history, beginning with the work of McCulloch and Pitts (1943). (As mentioned in Chap-
ter 1, John McCarthy has pointed to the work of Nicolas Rashevsky (1936, 1938) as the earliest mathematical model of neural learning.) Norbert Wiener, a pioneer of cybernetics and control theory (Wiener, 1948), worked with McCulloch and Pitts and influenced a num-
ber of young researchers, including Marvin Minsky, who may have been the first to develop a working neural network in hardware, in 1951 (see Minsky and Papert, 1988, pp. ix-x). Alan Turing (1948) wrote a research report titled Intelligent Machinery that begins with the
sentence “I propose to investigate the question as to whether it is possible for machinery to show intelligent behaviour” and goes on to describe a recurrent neural network architecture
he called “B-type unorganized machines” and an approach to training them. Unfortunately, the report went unpublished until 1969, and was all but ignored until recently.
The perceptron, a one-layer neural network with a hard-threshold activation function, was
popularized by Frank Rosenblatt (1957). After a demonstration in July 1958, the New York
Times described it as “the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.” Rosenblatt
(1960) later proved the perceptron convergence theorem, although it had been foreshadowed by purely mathematical work outside the context of neural networks (Agmon, 1954; Motzkin and Schoenberg, 1954). Some early work was also done on multilayer networks, including Gamba perceptrons (Gamba ef al., 1961) and madalines (Widrow, 1962). Learning Machines (Nilsson,
1965) covers much of this early work and more.
The subsequent demise
of early perceptron research efforts was hastened (or, the authors later claimed, merely explained) by the book Perceptrons (Minsky and Papert, 1969), which lamented the field’s lack of mathematical rigor. The book pointed out that single-layer perceptrons could represent only linearly separable concepts and noted the lack of effective learning algorithms for multilayer networks. These limitations were already well known (Hawkins, 1961) and had been acknowledged by Rosenblatt himself (Rosenblatt, 1962). The papers collected by Hinton and Anderson (1981), based on a conference in San Diego in 1979, can be regarded as marking a renaissance of connectionism. The two-volume “PDP” (Parallel Distributed Processing) anthology (Rumelhart and McClelland, 1986) helped
to spread the gospel, so to speak, particularly in the psychology and cognitive science com-
munities. The most important development of this period was the back-propagation algorithm
for training multilayer networks.
The back-propagation algorithm was discovered independently several times in different
contexts (Kelley, 1960; Bryson,
1962; Dreyfus, 1962; Bryson and Ho, 1969; Werbos,
1974;
Parker, 1985) and Stuart Dreyfus (1990) calls it the “Kelley-Bryson gradient procedure.” Although Werbos had applied it to neural networks, this idea did not become widely known
until a paper by David Rumelhart, Geoff Hinton, and Ron Williams (1986) appeared in Narure giving a nonmathematical presentation of the algorithm. Mathematical respectability was enhanced by papers showing that multilayer feedforward networks are (subject to technical conditions) universal function approximators (Cybenko,
1988,
1989).
The late 1980s and
early 1990s saw a huge growth in neural network research: the number of papers mushroomed by a factor of 200 between 1980-84 and 1990-94.
785
786
Chapter 21
Deep Learning
In the late 1990s and early 2000s, interest in neural networks waned as other techniques such as Bayes nets, ensemble methods, and kernel machines came to the fore. Interest in deep
models was sparked when Geoff Hinton’s research on deep Bayesian networks—generative models with category variables at the root and evidence variables at the leaves—began to bear fruit, outperforming kernel machines on small benchmark
data sets (Hinton er al., 2006).
Interest in deep learning exploded when Krizhevsky ef al. (2013) used deep convolutional networks to win the ImageNet competition (Russakovsky et al., 2015).
Commentators often cite the availability of “big data” and the processing power of GPUs
as the main contributing factors in the emergence of deep learning.
Architectural improve-
ments were also important, including the adoption of the ReLU activation function instead of the logistic sigmoid (Jarrett et al., 2009; Nair and Hinton, 2010; Glorot et al., 2011) and later the development of residual networks (He et al., 2016).
On the algorithmic side, the use of stochastic gradient descent (SGD) with small batches
was essential in allowing neural networks to scale to large data sets (Bottou and Bousquet, 2008). Batch normalization (loffe and Szegedy, 2015) also helped in making the training pro-
cess faster and more reliable and has spawned several additional normalization techniques (Ba etal.,2016; Wu and He, 2018; Miyato et al., 2018). Several papers have studied the empirical behavior of SGD on large networks and large data sets (Dauphin et al., 2015; Choromanska et al., 2014; Goodfellow et al., 2015b). On the theoretical side, some progress has been made
on explaining the observation that SGD applied to overparameterized networks often reaches
a global minimum with a training error of zero, although so far the theorems to this effect assume a network with layers far wider than would ever occur in practice (Allen-Zhu et al.,
2018; Du ef al., 2018). Such networks have more than enough capacity to function as lookup tables for the training data.
The last piece of the puzzle, at least for vision applications, was the use of convolutional
networks. These had their origins in the descriptions of the mammalian visual system by neurophysiologists David Hubel and Torsten Wiesel (Hubel and Wiesel, 1959, 1962, 1968).
They described “simple cells” in the visual system of a cat that resemble edge detectors,
as well as “complex cells” that are invariant to some transformations such as small spatial
translations. In modern convolutional networks, the output of a convolution is analogous to a
simple cell while the output of a pooling layer is analogous to a complex cell. The work of Hubel and Wiesel inspired many of the early connectionist models of vision (Marr and Poggio, 1976). The neocognitron (Fukushima, 1980; Fukushima and Miyake,
1982), designed as a model of the visual cortex, was essentially a convolutional network in terms of model architecture, although an effective training algorithm for such networks had
to wait until Yann LeCun and collaborators showed how to apply back-propagation (LeCun
et al., 1995). One of the early commercial successes of neural networks was handwritten digit recognition using convolutional networks (LeCun ez al., 1995).
Recurrent neural networks (RNNs) were commonly proposed as models of brain function
in the 1970s, but no effective learning algorithms were associated with these proposals. The
method of back-propagation through time appears in the PhD thesis of Paul Werbos (1974),
and his later review paper (Werbos, 1990) gives several additional references to rediscoveries of the method in the 1980s. One of the most influential early works on RNNs was due to Jeff Elman (1990), building on an RNN architecture suggested by Michael Jordan (1986).
Williams and Zipser (1989) present an algorithm for online learning in RNNs. Bengio et al.
Bibliographical and Historical Notes (1994) analyzed the problem of vanishing gradients in recurrent networks. The long shortterm memory (LSTM) architecture (Hochreiter,
1991; Hochreiter and Schmidhuber,
1997;
Gers et al., 2000) was proposed as a way of avoiding this problem. More recently, effective RNN designs have been derived automatically (Jozefowicz er al., 2015; Zoph and Le, 2016).
Many methods have been tried for improving generalization in neural networks. Weight decay was suggested by Hinton (1987) and analyzed mathematically by Krogh and Hertz
(1992). The dropout method is due to Srivastava et al. (2014a). Szegedy et al. (2013) intro-
duced the idea of adversarial examples, spawning a huge literature.
Poole et al. (2017) showed that deep networks (but not shallow ones) can disentangle complex functions into flat manifolds in the space of hidden units. Rolnick and Tegmark
(2018) showed that the number of units required to approximate a certain class of polynomials
of n variables grows exponentially for shallow networks but only linearly for deep networks. White et al. (2019) showed that their BANANAS
system could do neural architecture
search (NAS) by predicting the accuracy of a network to within 1% after training on just 200 random sample architectures. Zoph and Le (2016) use reinforcement learning to search the space of neural network architectures. Real er al. (2018) use an evolutionary algorithm to do model selection, Liu er al. (2017) use evolutionary algorithms on hierarchical representations, and Jaderberg ef al. (2017) describe population-based training. Liu er al. (2019)
relax the space of architectures to a continuous differentiable space and use gradient descent to find a locally optimal solution.
Pham et al. (2018) describe the ENAS
(Efficient Neural
Architecture Search) system, which searches for optimal subgraphs of a larger graph. It is fast because it does not need to retrain parameters. The idea of searching for a subgraph goes back to the “optimal brain damage™ algorithm of LeCun et al. (1990).
Despite this impressive array of approaches, there are critics who feel the field has not yet
matured. Yu et al. (2019) show that in some cases these NAS algorithms are no more efficient
than random architecture selection. For a survey of recent results in neural architecture search,
see Elsken ef al. (2018).
Unsupervised learning constitutes a large subfield within statistics, mostly under the
heading of density estimation. Silverman (1986) and Murphy (2012) are good sources for classical and modem techniques in this area. Principal components analysis (PCA) dates back to Pearson (1901); the name comes from independent work by Hotelling (1933). The probabilistic PCA model (Tipping and Bishop, 1999) adds a generative model for the principal components themselves. The variational autoencoder is due to Kingma and Welling (2013) and Rezende ef al. (2014); Jordan et al. (1999) provide an introduction to variational methods for inference in graphical models. For autoregressive models, the classic text is by Box er al. (2016). The Yule-Walker equations for fitting AR models were developed independently by Yule (1927) and Walker (1931).
Autoregressive models with nonlinear dependencies were developed by several authors (Frey, 1998; Bengio and Bengio, 2001; Larochelle and Murray, 2011). The autoregressive WaveNet
model (van den Oord et al., 2016a) was based on earlier work on autoregressive image gen-
eration (van den Oord et al., 2016b).
Generative adversarial networks, or GANs, were first
proposed by Goodfellow et al. (2015a), and have found many applications in AL Some theoretical understanding of their properties is emerging, leading to improved GAN models and algorithms (Li and Malik, 2018b, 2018a; Zhu et al., 2019). Part of that understanding involves
protecting against adversarial attacks (Carlini ef al., 2019).
787
788
Hopfield network
Boltzmann machine
Chapter 21
Deep Learning
Several branches of research into neural networks have been popular in the past but are not actively explored today. Hopfield networks (Hopfield, 1982) have symmetric connections between each pair of nodes and can learn to store patterns in an associative memory, 5o that an entire pattern can be retrieved by indexing into the memory using a fragment of the pattern.
Hopfield networks are deterministic; they were later generalized to stochastic
Boltzmann machines (Hinton and Sejnowski, 1983, 1986). Boltzmann machines are possi-
bly the earliest example of a deep generative model. The difficulty of inference in Boltzmann
machines led to advances in both Monte Carlo techniques and variational techniques (see
Section 13.4).
Research on neural networks for Al has also been intertwined to some extent with research into biological neural networks. The two topics coincided in the 1940s, and ideas for convolutional networks and reinforcement learning can be traced to studies of biological sys-
Computational
tems; but at present, new ideas in deep learning tend to be based on purely computational or statistical concerns.
The field of computational neuroscience aims to build computational
models that capture important and specific properties of actual biological systems. Overviews
are given by Dayan and Abbott (2001) and Trappenberg (2010). For modern neural nets and deep learning, the leading textbooks are those by Goodfellow et al. (2016) and Charniak (2018). There are also many hands-on guides associated with the various open-source software packages for deep learning.
Three of the leaders of the
field—Yann LeCun, Yoshua Bengio, and Geoff Hinton—introduced the key ideas to non-Al researchers in an influential Nature article (2015). The three were recipients of the 2018 Turing Award. Schmidhuber (2015) provides a general overview, and Deng et al. (2014)
focus on signal processing tasks.
The primary publication venues for deep learning research are the conference on Neural
Information Processing Systems (NeurIPS), the International Conference on Machine Learning (ICML), and the International Conference on Learning Representations (ICLR). The main
journals are Machine Learning, the Journal of Machine Learning Research, and Neural Com-
putation. Increasingly, because of the fast pace of research, papers appear first on arXiv.org and are often described in the research blogs of the major research centers.
TS
D2
REINFORCEMENT LEARNING In which we see how experiencing rewards and punishments can teach an agent how to maximize rewards in the future. ‘With supervised learning, an agent learns by passively observing example input/output
pairs provided by a “teacher” In this chapter, we will see how agents can actively learn from
their own experience, without a teacher, by considering their own ultimate success or failure.
22.1
Learning from Rewards
Consider the problem of learning to play chess.
Let’s imagine treating this as a supervised
learning problem using the methods of Chapters 19-21. The chess-playing agent function
takes as input a board position and returns a move, so we train this function by supplying
examples of chess positions, each labeled with the correct move. Now, it so happens that we have available databases of several million grandmaster games, each a sequence of positions. and moves. The moves made by the winner are, with few exceptions, assumed to be good,
if not always perfect. Thus, we have a promising training set. The problem is that there are
relatively few examples (about 108) compared to the space of all possible chess positions (about 10°). In a new game, one soon encounters positions that are significantly different
from those in the database, and the trained agent function is likely to fail miserably—not least because it has no idea of what its moves are supposed to achieve (checkmate) or even what
effect the moves have on the positions of the pieces. And of course chess is a tiny part of the real world. For more realistic problems, we would need much vaster grandmaster databases,
and they simply don’t exist.'
Analternative is reinforcement learning (RL), in which an agent interacts with the world
and periodically receives rewards (or, in the terminology of psychology, reinforcements) that reflect how well it is doing.
For example, in chess the reward is 1 for winning, 0 for
losing, and § for a draw. We have already seen the concept of rewards in Chapter 17 for Markov decision processes (MDPs). Indeed, the goal is the same in reinforcement learning:
maximize the expected sum of rewards.
Reinforcement learning differs from “just solving
an MDP” because the agent is not given the MDP as a problem to solve; the agent is in the MDP. It may not know the transition model or the reward function, and it has to act in order
to learn more. Imagine playing a new game whose rules you don’t know; after a hundred or 50 moves, the referee tells you “You lose.” That is reinforcement learning in a nutshell.
From our point of view as designers of Al systems, providing a reward signal to the agent is usually much easier than providing labeled examples of how to behave. First, the reward ! As Yann LeCun and Alyosha Efros have pointed out, “the Al revolution will not be supervised.”
{3570
790
Chapter 22 Reinforcement Learning function is often (as we saw for chess) very concise and easy to specify: it requires only a few lines of code to tell the chess agent if it has won or lost the game or to tell the car-racing agent that it has won or lost the race or has crashed.
Second, we don’t have to be experts,
capable of supplying the correct action in any situation, as would be the case if we tried to apply supervised learning. It turns out, however, that a
Sparse
little bit of expertise can go a long way in reinforcement
learning. The two examples in the preceding paragraph—the win/loss rewards for chess and racing—are what we call sparse rewards, because in the vast majority of states the agent is given no informative reward signal at all. In games such as tennis and cricket, we can easily supply additional rewards for each point won or for each run scored. In car racing, we could reward the agent for making progress around the track in the right direction. When learning to crawl, any forward motion is an achievement. much easier.
These intermediate rewards make learning
As long as we can provide the correct reward signal to the agent, reinforcement learning provides a very general way to build Al systems. This is particularly true for simulated environments, where there is no shortage of opportunities to gain experience. The addition of deep learning as a tool within RL systems has also made new applications possible, including learning to play Atari video games from raw visual input (Mnih ef al., 2013), controlling robots (Levine ef al., 2016), and playing poker (Brown and Sandholm, 2017). Literally hundreds of different reinforcement learning algorithms have been devised, and
Model-based reinforcement Iearning
many of them can employ as tools a wide range of learning methods from Chapters 19-21. In this chapter, we cover the basic ideas and give some sense of the variety of approaches through a few examples. We categorize the approaches as follows: o Model-based reinforcement learni
In these approaches the agent uses a transition
model of the environment to help interpret the reward signals and to make decisions about how to act. The model may be initially unknown, in which case the agent learns the model from observing the effects of its actions, or it may already be known—for example, a chess program may know the rules of chess even if it does not know how to choose good moves. In partially observable environments, the transition model is also useful for state estimation (see Chapter 14). Model-based reinforcement learning
Model-free reinforcement learning
Action-utiity learning Qlearning Q-function
Policy search
systems often learn a utility function U (s), defined (as in Chapter 17) in terms of the sum of rewards from state s onward.” o Model-free reinforcement learning: In these approaches the agent neither knows nor learns a transition model for the environment. Instead, it learns a more direct represen-
tation of how to behave. This comes in one of two varieties:
o Action-utility learning: We introduced action-utility functions in Chapter 17. The
most common form of action-utility learning is Q-learning, where the agent learns
a Q-function, or quality-function, Q(s,a), denoting the sum of rewards from state
s onward if action a is taken. Given a Q-function, the agent can choose what to do in s by finding the action with the highest Q-value.
o Policy search:
The agent learns a policy 7(s) that maps directly from states to
actions. In the terminology of Chapter 2, this a reflex agent. 2 Inthe RL literature, which draws more on operations research than economics, utility functions are often called value functions and denoted V (s).
Section 22.2
Passive Reinforcement Learning
791
08516 | 0.9078 | 0.9578 0.8016
(@)
(®)
Figure 22.1 (a) The optimal policies for the stochastic environment with R(s.a,s') = — 0.04 for transitions between nonterminal states. There are two policies because in state (3,1) both Left and Up are optimal. We saw this before in Figure 17.2. (b) The utilities of the states in the4 x 3 world, given policy 7. Passive
‘We begin in Section 22.2 with passive reinforcement learning, where the agent’s policy = reinforcement
is fixed and the task is to learn the utilities of states (or of state-action pairs); this could
="
reinforcement learning, where the agent must also figure out what to do. The principal
(e reinforcement
also involve learning a model of the environment. (An understanding of Markov decision processes, as described in Chapter 17, is essential for this section.) Section 22.3 covers active issue is exploration: an agent must experience as much as possible of its environment in
order to learn how to behave in it. Section 22.4 discusses how an agent can use inductive
learning (including deep learning methods) to learn much faster from its experiences. We also discuss other approaches that can help scale up RL to solve real problems, including providing intermediate pseudorewards to guide the learner and organizing behavior into a hierarchy of actions. Section 22.5 covers methods for policy search. In Section 22.6, we explore apprenticeship learning: training a learning agent using demonstrations rather than
reward signals. Finally, Section 22.7 reports on applications of reinforcement learning. 22.2
Passive Reinforcement Learning
We start with the simple case of a fully observable environment with a small number of
actions and states, in which an agent already has a fixed policy 7 (s) that determines its actions.
The agent is trying to learn the utility function U™ (s)—the expected total discounted reward if policy m is executed beginning in state s. We call this a passive learning agent.
The passive learning task is similar to the policy evaluation task, part of the policy iteration algorithm described in Section 17.2.2. The difference is that the passive learning agent does not know the transition model P(s'|s,a), which specifies the probability of reaching state s’ from state s after doing action a@; nor does it know the reward function R(s,a,s’), which specifies the reward for each transition.
‘We will use as our example the 4 x 3 world introduced in Chapter 17. Figure 22.1 shows
the optimal policies for that world and the corresponding utilities. The agent executes a set
Fassive leaming
792
Chapter 22 Reinforcement Learning
Trial
of trials in the environment using its policy 7. In each trial, the agent starts in state (1,1) and
experiences a sequence of state transitions until it reaches one of the terminal states, (4,2) or
(4,3). Tts percepts supply both the current state and the reward received for the transition that just occurred to reach that state. Typical trials might look like this:
P20
0203
B e
6
a 1>‘““(12 (4,2)
Note that each transition is annotated with both the action ldkcn and the reward received at the next state. The object is to use the information about rewards to learn the expected utility U™ (s) associated with each nonterminal state s. The utility is defined to be the expected sum
of (discounted) rewards obtained if policy 7 is followed. As in Equation (17.2) on page 567, we write
U™(s) =E | Y 1=0
+'R(S:,7(S:),S41) |»
2.1)
where R(S;,7(S;),5.1) is the reward received when action 7(S,) is taken in state S, and reaches state S..1. Note that S, is a random variable denoting the state reached at time ¢ when executing policy 7, starting from state So=s. We will include a discount factor
in all of
our equations, but for the 4 x 3 world we will set = 1, which means no discounting. Direct utility estimation
Reward-to-go
22.2.1
Direct utility estimation
The idea of direct utility estimation is that the utility of a
state is defined as the expected
total reward from that state onward (called the expected reward-to-go), and that each trial
provides a sample of this quantity for each state visited. For example, the first of the three
trials shown earlier provides a sample total reward of 0.76 for state (1,1), two samples of 0.80 and 0.88 for (1,2), two samples of 0.84 and 0.92 for (1,3), and so on. Thus, at the end of each
sequence, the algorithm calculates the observed reward-to-go for each state and updates the
estimated utility for that state accordingly, just by keeping a running average for each state
in a table. In the limit of infinitely many trials, the sample average will converge to the true
expectation in Equation (22.1). This means that we have reduced reinforcement learning to a standard supervised learn-
ing problem in which each example is a (state, reward-to-go) pair. We have a lot of powerful algorithms for supervised learning, so this approach seems promising, but it ignores an im-
portant constraint: The utility of a state is determined by the reward and the expected utility
of the successor states. More specifically, the utility values obey the Bellman equations for a
fixed policy (see also Equation (17.14)):
Uils) = X P(s' | 5.m1(8)) [R(s. mi(5), ) + 7 Ui(s))] s
(222)
By ignoring the connections between states, direct utility estimation misses opportunities for
learning.
For example, the second of the three trials given earlier reaches the state (3,2),
which has not previously been visited.
from the first trial to have a high utility.
The next transition reaches (3,3), which is known
The Bellman equation suggests immediately that
(3,2) is also likely to have a high utility, because it leads to (3,3), but direct utility estimation
Section 22.2
Passive Reinforcement Learning
793
learns nothing until the end of the trial. More broadly, we can view direct utility estimation
as searching for U in a hypothesis space that is much larger than it needs to be, in that it includes many functions that violate the Bellman equations.
often converges very slowly. 22.2.2
Adaptive dynamic
For this reason, the algorithm
programming
An adaptive dynamic programming
(or ADP) agent takes advantage of the constraints
among the utilities of states by learning the transition model that connects them and solving the corresponding Markov decision process using dynamic programming. For a passive learning agent, this means plugging the learned transition model P(s'[5,7(s)) and the observed rewards R(s,7(s),s')
into Equation (22.2) to calculate the utilities of the states.
As
we remarked in our discussion of policy iteration in Chapter 17, these Bellman equations are linear when the policy 7 is fixed, so they can be solved using any linear algebra package. Alternatively, we can adopt the approach of modified policy iteration (see page 578),
using a simplified value iteration process to update the utility estimates after each change to the learned model. Because the model usually changes only slightly with each observation, the value iteration process can use the previous utility estimates as initial values and typically
converge very quickly.
Learning the transition model is easy, because the environment is fully observable. This
means that we have a supervised learning task where the input for each training example is a
state-action pair, (s,a), and the output is the resulting state, s'. The transition model P(s'| s,a)
is represented as a table and it is estimated directly from the counts that are accumulated in
Ny - The counts record how often state is reached when executing in s. For example, in the three trials given on page 792, Right is executed four times in (3,3) and the resulting state
is (3,2) twice and (4,3) twice, so P((3,2)|(3,3),Right) and P((4,3)|(3,3),Right) are both estimated to be §. The full agent program for a passive ADP agent is shown in Figure 22.2. Its performance on the 4 x 3 world is shown in Figure 22.3. In terms of how quickly its value estimates
improve, the ADP agent is limited only by its ability to learn the transition model. In this
sense, it provides a standard against which to measure any other reinforcement learning al-
gorithms. It is, however, intractable for large state spaces. In backgammon, for example, it would involve solving roughly 10*° equations in 10*° unknowns. 22.2.3
Temporal-difference learning
Solving the underlying MDP as in the preceding section is not the only way to bring the
Bellman equations to bear on the learning problem.
Another way is to use the observed
transitions to adjust the utilities of the observed states so that they agree with the constraint
equations. Consider, for example, the transition from (1,3) to (2,3) in the second trial on page 792. Suppose that as a result of the first trial, the utility estimates are U™(1,3)=0.88
and U™(2,3)=0.96. Now, if this transition from (1,3) to (2,3) occurred all the time, we would expect the utilities to obey the equation
U™(1,3) = —0.04+ U™(2,3), 50 U7(1,3) would be 0.92. Thus, its current estimate of 0.84 might be a lttle low and should be increased. More generally, when a transition occurs from state s to state s’ via action 7(s),
Adaptive dynamic programming
794
Chapter 22 Reinforcement Learning function PASSIVE-ADP-LEARNER(percepr) returns an action
inputs: percept, a percept indicating the current state s’ and reward signal r
persistent: 7, a fixed policy
mdp, an MDP with model P, rewards R, actions A, discount y U, atable of utilities for states, initially empty
Nyjs.a» 2 table of outcome count vectors indexed by state and action, initially zero 5. a, the previous state and action, initially null if " is new then U[s'] 0 s is not null then
increment Ny s, alls’]
Rls,a,s'1r
add a to A[s] P(- | 5,a) =+ > [envada |06 | | [Caenviada |->-[de 03] ([ |»a et Score00 ] [Una [-2.1] | [Scow=03] [puera [-09] | [Seorer=11 |puema [-19] | [Seorer =17 Beam 2
Beam 2
Beam 2
Fypomests | [ Viord [Score] | [Fipoesis | [Word [Score] | [Fypomesis [-05 | |[Lapueraal] > |~>[envada]-03 | | [Lapvera |>foe s Scors: =21] [puena [-2.1] | [Scorer=12] (o ]-07] | [Scorer =15
++
Figure 24.8 Beam search with beam size of b=2. The score of each word is the logprobability generated by the target RNN softmax, and the score of each hypothesis is the sum of the word scores. At timestep 3, the highest scoring hypothesis La entrada can only generate low-probability continuations, so it “falls off the beam.” using a greedy decoder to translate into Spanish the English sentence we saw before, The
front door is red. The correct translation is “La puerta de entrada es roja”—literally “The door of entry is red.” Suppose the target RNN correctly generates the first word La for The. Next, a greedy decoder might propose entrada for front. But this is an error—Spanish word order should put the noun puerta before the modifier. Greedy decoding is fast—it only considers one choice at each timestep and can do so quickly—but the model has no mechanism to correct mistakes. We could try to improve the attention mechanism so that it always attends to the right word and guesses correctly every time.
But for many sentences it is infeasible to guess
correctly all the words at the start of the sentence until you have seen what’s at the end.
A better approach is to search for an optimal decoding (or at least a good one) using
one of the search algorithms from Chapter 3. A common choice is a beam search (see Sec-
tion 4.1.3). In the context of MT decoding, beam search typically keeps the top k hypotheses at each stage, extending each by one word using the top k choices of words, then chooses the best k of the resulting > new hypotheses. When all hypotheses in the beam generate the special token, the algorithm outputs the highest scoring hypothesis.
A visualization of beam search is given in Figure 24.8. As deep learning models become
more accurate, we can usually afford to use a smaller beam size.
Current state-of-the-art
neural MT models use a beam size of 4 to 8, whereas the older generation of statistical MT models would use a beam size of 100 or more.
24.4 Transformer Self-attention
The influential article “Attention is all you need” (Vaswani et al., 2018) introduced the transformer architecture, which uses a self-attention mechanism that can model long-distance
context without a sequential dependency.
24.4.1 Self-attention
The Transformer Architecture
Self-attention
Previously, in sequence-to-sequence models, attention was applied from the target RNN to the source RNN. Self-attention extends this mechanism so that each sequence of hidden states also attends to itself—the source to the source, and the target to the target. This allows
the model to additionally capture long-distance (and nearby) context within each sequence.
Section 24.4
The Transformer Architecture
869
The most straightforward way of applying self-attention is where the attention matrix is
directly formed by the dot product of the input vectors. However, this is problematic. The dot product between a vector and itself will always be high, so each hidden state will be biased
towards attending to itself. The transformer solves this by first projecting the input into three different representations using three different weight matrices:
« The query vector g;=W,x; is the one being attended from, like the target in the standard attention mechanism. + The key vector k;=W,x; attention mechanism.
is the one being attended to, like the source in the basic
« The value vector v;=W,x; is the context that is being generated.
Query vector
Key vector Value vector
In the standard attention mechanism, the key and value networks are identical, but intuitively
it makes sense for these to be separate representations. The encoding results of the ith word,
¢;, can be calculated by applying an attention mechanism to the projected vectors:
rj = (ak)/Va ay = & f(L)
a-X i
k
lij Vs
where d is the dimension of k and q. Note that i and j are indexes in the same sentence, since we are encoding the context using self-attention. In each transformer layer, self-attention uses
the hidden vectors from the previous layer, which initially is the embedding layer. There are several details worth mentioning here.
First of all, the self-attention mecha-
nism is asymmetric, as r;; is different from rj;. Second, the scale factor v/ was added to improve numerical stability. Third, the encoding for all words in a sentence can be calculated
simultaneously, as the above equations can be expressed using matrix operations that can be
computed efficiently in parallel on modern specialized hardware. The choice of which context to use is completely learned from training examples, not
prespecified. The context-based summarization, ¢;, is a sum over all previous positions in the
sentence. In theory, and information from the sentence could appear in ¢;, but in practice, sometimes important information gets lost, because it is essentially averaged out over the whole sentence.
One way to address that is called multiheaded attention.
We divide the
sentence up into m equal pieces and apply the attention model to each of the m pieces. Each piece has its own set of weights.
Then the results are concatenated together to form ¢;. By
concatenating rather than summing, we make it easier for an important subpiece to stand out.
24.4.2
From self-attention to transformer
Self-attention is only one component of the transformer model. Each transformer layer con-
sists of several sub-layers. At each transformer layer, self-attention is applied first. The output of the attention module is fed through feedforward layers, where the same feedforward
weight matrices are applied independently at each position. A nonlinear activation function, typically ReLU, is applied after the first feedforward layer. In order to address the potential
vanishing gradient problem, two residual connections (are added into the transformer layer. A single-layer transformer in shown in Figure 24.9. In practice, transformer models usually
Multiheaded attention
870
Chapter 24 Deep Learning for Natural Language Processing
{
Transformer
Output Vectors,
{
Layer
Residual
Connection
]
Rssiual)
Self-Attention-
Residual
P11
L input Vectors Figure 24.9 A single-layer transformer consists of self-attention, a feedforward network, and residual connections.
Clas: Adverb
Class Pronoun
Cla PastTenseVerb
t t 1 Feedforward | | Feedforward | | Feedforward | | f f i “Transformer Layer f f i ‘Transformer Layer f f i “Transformer Layer t f f Positional Positional Positional Embedding 1| |Embedding2| |Embedding3| + + + Embedding | [Embedding | [Embedding | lookup Tookup Tookup. f f Yesterday
Figure 24.10
they
cut
Class. Determiner
t Feedforward | | Feedforward f f f
f
f
f
f f Positional Positional |Embedding4| |Embedding 5 + + [Embedding | [ Embedding Tookup Tookup f f the
rope
Using the transformer architecture for POS tagging.
have six or more layers.
As with the other models that we’ve learned about, the output of
layer i is used as the input to layer i+ 1.
oo
The transformer architecture does not explicitly capture the order of words in the sequence, since context is modeled only through self-attention, which is agnostic to word order. To capture the ordering of the words, the transformer uses a technique called positional em-
bedding. If our input sequence has a maximum length of n, then we learn n new embedding
Section 245 Pretraining and Transfer Learning
871
vectors—one for each word position. The input to the first transformer layer is the sum of the word embedding at position 7 plus the positional embedding corresponding to position 7.
Figure 24.10 illustrates the transformer architecture for POS tagging, applied to the same
sentence used in Figure 24.3. At the bottom, the word embedding and the positional embed-
dings are summed to form the input for a three-layer transformer. The transformer produces
one vector per word, as in RNN-based POS tagging. Each vector is fed into a final output layer and softmax layer to produce a probability distribution over the tags.
In this section, we have actually only told half the transformer story: the model we de-
scribed here is called the transformer encoder. It is useful for text classification tasks. The
full transformer architecture was originally designed as a sequence-to-sequence model for machine translation. decoder.
Therefore, in addition to the encoder, it also includes a transformer
The encoder and decoder are nearly identical, except that the decoder uses a ver-
sion of self-attention where each word can only attend to the words before it, since text is
Transformer encoder Transformer decoder
generated left-to-right. The decoder also has a second attention module in each transformer
layer that attends to the output of the transformer encoder.
2
Pretraining and Transfer Learning
Getting enough data to build a robust model can be a challenge.
In computer vision (see
Chapter 25), that challenge was addressed by assembling large collections of images (such as ImageNet) and hand-labeling them. For natural language, it is more common
to work with text that is unlabeled.
The dif-
ference is in part due to the difficulty of labeling: an unskilled worker can easily label an image as “cat” or “sunset,” but it requires extensive training to annotate a sentence with partof-speech tags or parse trees. The difference is also due to the abundance of text: the Internet adds over 100 billion words of text each day, including digitized books, curated resources
such as Wikipedia, and uncurated social media posts. Projects such as Common Crawl provide easy access to this data. Any running text can
be used to build n-gram or word embedding models, and some text comes with structure that
can be helpful for a variety of tasks—for example, there are many FAQ sites with questionanswer pairs that can be used to train a question-answering system.
Similarly, many Web
sites publish side-by-side translations of texts, which can be used to train machine translation
systems. Some text even comes with labels of a sort, such as review sites where users annotate their text reviews with a 5-star rating system.
‘We would prefer not to have to go to the trouble of creating a new data set every time we want a new NLP model. In this section, we introduce the idea of pretraining: a form
of transfer learning (see Section 21.7.2) in which we use a large amount of shared general-
domain language data to train an initial version of an NLP model. From there, we can use a
smaller amount of domain-specific data (perhaps including some labeled data) to refine the model.
The refined model can learn the vocabulary, idioms, syntactic structures, and other
linguistic phenomena that are specific to the new domain. 24.5.1
Pretrained word embeddings
In Section 24.1, we briefly introduced word embeddings. We saw that how similar words like banana and apple end up with similar vectors, and we saw that we can solve analogy
Pretraining
872
Chapter 24 Deep Learning for Natural Language Processing problems with vector subtraction. This indicates that the word embeddings are capturing substantial information about the words.
In this section we will dive into the details of how word embeddings are created using an
entirely unsupervised process over a large corpus of text. That is in contrast to the embeddings
from Section 24.1, which were built during the process of supervised part of speech tagging,
and thus required POS tags that come from expensive hand annotation. We will concentrate
on one specific model for word embeddings,
the GloVe (Global
Vectors) model. The model starts by gathering counts of how many times each word appears
within a window of another word, similar to the skip-gram model. First choose window size
(perhaps 5 words) and let X;; be the number of times that words i and j co-occur within
a window,
and let X; be the number of times word i co-occurs with any other word.
Let
P,;=X;;/X; be the probability that word j appears in the context of word i. As before, let E; be the word embedding for word i. Part of the intuition of the GloVe model is that the relationship between two words can
best be captured by comparing them both to other words. Consider the words ice and steam. Now consider the ratio of their probabilities of co-occurrence with another word, w, that is:
Puice/Pustean
When w is the word solid the ratio will be high (meaning solid applies more to ice) and when w is the word gas it will be low (meaning gas applies more to steam). And when w is a non-content word like rhe, a word like warer that is equally relevant to both, or an equally irrelevant word like fashion, the ratio will be close to 1.
The GloVe model starts with this intuition and goes through some mathematical reason-
ing (Pennington ef al., 2014) that converts ratios of probabilities into vector differences and
dot products, eventually arriving at the constraint
E;-E}=log(P;). In other words, the dot product of two word vectors is equal to the log probability of their
co-occurrence. That makes intuitive sense: two nearly-orthogonal vectors have a dot product
close to 0, and two nearly-identical normalized vectors have a dot product close to 1. There
is a technical complication wherein the GloVe model creates two word embedding vectors
for each word, E; and EJ; computing the two and then adding them together at the end helps limit overfitting.
Training a model like GloVe is typically much less expensive than training a standard
neural network:
a new model can be trained from billions of words of text in a few hours
using a standard desktop CPU. It is possible to train word embeddings on a specific domain, and recover knowledge in
that domain. For example, Tshitoyan e al. (2019) used 3.3 million scientific abstracts on the
subject of material science to train a word embedding model. They found that, just as we saw
that a generic word embedding model can answer “Athens is to Greece as Oslo is to what?” with “Norway,” their material science model can answer “NiFe is to ferromagnetic as IrMn is to what?” with “antiferromagnetic.”
Their model does not rely solely on co-occurrence of words; it seems to be capturing more complex scientific knowledge. When asked what chemical compounds can be classified
as “thermoelectric™ or “topological insulator,” their model is able to answer correctly.
For
example, CsAgGapSey never appears near “thermoelectric” in the corpus, but it does appear
Section 245 Pretraining and Transfer Learning
873
near “chalcogenide.” “band gap,” and “optoelectric,” which are all clues enabling it to be classified as similar to “thermoelectric.”
Furthermore, when trained only on abstracts up
to the year 2008 and asked to pick compounds that are “thermoelectric” but have not yet appeared in abstracts, three of the model’s top five picks were discovered to be thermoelectric in papers published between 2009 and 2019. 24.5.2
Pretrained contextual representations
‘Word embeddings are better representations than atomic word tokens, but there is an impor-
tant issue with polysemous words. For example, the word rose can refer to a flower or the past tense of rise. Thus, we expect to find at least two entirely distinct clusters of word contexts for rose: one similar to flower names such as dahlia, and one similar to upsurge.
No
single embedding vector can capture both of these simultaneously. Rose is a clear example of a word with (at least) two distinct meanings, but other words have subtle shades of meaning that depend on context, such as the word need in you need to see this movie versus humans
need oxygen to survive. And some idiomatic phrases like break the bank are better analyzed as a whole rather than as component words.
Therefore, instead of just learning a word-to-embedding table, we want to train a model to
generate contextual representations of each word in a sentence. A contextual representation
maps both a word and the surrounding context of words into a word embedding vector. In
other words, if we feed this model the word rose and the context the gardener planted a rose bush, it should produce a contextual embedding that is similar (but not necessarily identical)
to the representation we get with the context the cabbage rose had an unusual fragrance, and very different from the representation of rose in the context the river rose five feet.
Figure 24.11 shows a recurrent network that creates contextual word embeddings—the boxes that are unlabeled in the figure. We assume we have already built a collection of
noncontextual word embeddings. We feed in one word at a time, and ask the model to predict the next word. So for example in the figure at the point where we have reached the word “car,” the the RNN node at that time step will receive two inputs: the noncontextual word embedding for “car” and the context, which encodes information from the previous words “The red.” The RNN node will then output a contextual representation for “car.” The network
as a whole then outputs a prediction for the next word, “is.” We then update the network’s weights to minimize the error between the prediction and the actual next word.
This model is similar to the one for POS tagging in Figure 24.5, with two important
differences. First, this model is unidirectional (left-to-right), whereas the POS model is bidi-
rectional. Second, instead of predicting the POS tags for the current word, this model predicts
the next word using the prior context. Once the model is built, we can use it to retrieve representations for words and pass them on to some other task: we need not continue to predict
the next word. Note that computing a contextual representations always requires two inputs, the current word and the context.
24.5.3
Masked
language models
A weakness of standard language models such as n-gram models is that the contextualization
of each word is based only on the previous words of the sentence. Predictions are made from
left to right. But sometimes context from later in a sentence—for example, feet in the phrase
rose five feet—helps to clarify earlier words.
Contextual representations
874
Chapter 24 Deep Learning for Natural Language Processing
t
is
Feedforward |
[Feedforward
Feedforward
t
big
car
red
t
{
Feedforward |
t
[Feedforward
Contextual
representations (RNN output)
RNN}——[RNN|——{RNN|——{RNN}-——[RNN
]\t“’,"‘“f::‘"\,:;“
(word embeddings)
1
f
‘°°; X
> X
Figure 26.11 A simple triangular robot that can translate, and needs to avoid a rectangular
obstacle. On the left is the workspace, on the right is the configuration space.
Once we have a path, the task of executing a sequence of actions to follow the path is
called trajectory tracking control. A trajectory is a path that has a time associated with
[rajecio tracking
each point on the path. A path just says “go from A to B to C, etc.” and a trajectory says Trajectory “start at A, take 1 second to get to B, and another 1.5 seconds to get to C, ete.” 26.5.1
Configuration space
Imagine a simple robot, R, in the shape of a right triangle as shown by the lavender triangle in the lower left corner of Figure 26.11. The robot needs to plan a path that avoids a rectangular obstacle, O. The physical space that a robot moves about in is called the workspace. This Workspace particular robot can move in any direction in the x — y plane, but cannot rotate. The figure shows five other possible positions of the robot with dashed outlines; these are each as close
to the obstacle as the robot can get. The body of the robot could be represented as a set of (x,y) points (or (x,y,z) points
for a three-dimensional robot), as could the obstacle. With this representation, avoiding the obstacle means that no point on the robot overlaps any point on the obstacle. Motion planning
would require calculations on sets of points, which can be complicated and time-consuming.
We can simplify the calculations by using a representation scheme in which all the points
that comprise the robot are represented as a single point in an abstract multidimensional
space, which we call the configuration space, or C-space. The idea is that the set of points Configuration space that comprise the robot can be computed if we know (1) the basic measurements of the robot
(for our triangle robot, the length of the three sides will do) and (2) the current pose of the robot—its position and orientation.
For our simple triangular robot, two dimensions suffice for the C-space: if we know the
(x,y) coordinates of a specific point on the robot—we’ll use the right-angle vertex—then we
can calculate where every other point of the triangle is (because we know the size and shape of the triangle and because the triangle cannot rotate). In the lower-left corner of Figure 26.11,
the lavender triangle can be represented by the configuration (0,0).
If we change the rules so that the robot can rotate, then we will need three dimensions,
(x,y,0), to be able to calculate where every point is. Here
is the robot’s angle of rotation
in the plane. If the robot also had the ability to stretch itself, growing uniformly by a scaling
factor s, then the C-space would have four dimensions, (x,y,6,5).
~C-space
Chapter 26 Robotics For now we’ll stick with the simple two-dimensional C-space of the non-rotating triangle
robot. The next task is to figure out where the points in the obstacle are in C-space. Consider
the five dashed-line triangles on the left of Figure 26.11 and notice where the right-angle vertex is on each of these. Then imagine all the ways that the triangle could slide about.
Obviously, the right-angle vertex can’t go inside the obstacle, and neither can it get any
C-space obstacle
closer than it is on any of the five dashed-line triangles. So you can see that the area where
the right-angle vertex can’t go—the C-space obstacle—is the five-sided polygon on the right
of Figure 26.11 labeled Cyp
In everyday language we speak of there being multiple obstacles for the robot—a table, a
chair, some walls. But the math notation is a bit easier if we think of all of these as combit
into one “obstacle” that happens to have disconnected components.
In general, the C:
obstacle is the set of all points ¢ in C such that, if the robot were placed in that configuration, its workspace geometry would intersect the workspace obstacle.
Let the obstacles in the workspace be the set of points O, and let the set of all points on
the robot in configuration ¢ be .A(g). Then the C-space obstacle is defined as
Free space Degrees of freedom (DOF)
Forward kinematics
Cons={q:q€CandA(q)NO# {}}
and the free space is Cjree = C — Cops.
The C-space becomes more interesting for robots with moving parts. Consider the two-
link arm from Figure 26.12(a). It is bolted to a table so the base does not move, but the arm
has two joints that move independently—we call these degrees of freedom (DOF). Moving
the joints alters the (x,y) coordinates of the elbow, the gripper, and every point on the arm. The arm’s configuration space is two-dimensional: (Bguous ferp), Where Oy, is the angle of the shoulder joint, and 6, is the angle of the elbow joint. Knowing the configuration for our two-link arm means we can determine where each
point on the arm is through simple trigonometry.
ping is a function
In general, the forward kinematics map-
¢p:C—=W
that takes in a configuration and outputs the location of a particular point b on the robot when
the robot is in that configuration. A particularly useful forward kinematics mapping is that for
the robot’s end effector, ¢z The set of all points on the robot in a particular configuration ¢ is denoted by A(q) C W:
Alg) =U{(@)}-
;
Inverse kinematics
The inverse problem, of mapping a desired location for a point on the robot to the config-
uration(s) the robot needs to be in for that to happen, is known as inverse kinematics:
IKy:x €W = {g €C st d(q) =x}.
Sometimes the inverse kinematics mapping might take not just a position, but also a desired
orientation as input. When we want a manipulator to grasp an object, for instance, we can
compute a desired position and orientation for its gripper, and use inverse kinematics to de-
termine a goal configuration for the robot. Then a planner needs to find a way to get the robot from its current configuration to the goal configuration without intersecting obstacles.
Workspace obstacles are often depicted as simple geometric forms—especially in robotics textbooks, which tend to focus on polygonal obstacles. But how do the obstacles look in configuration space?
Section 26.5 Planning and Control
Felby Fshou,
(b)
(a)
Figure 26.12 (a) Workspace representation of a robot arm with two degrees of freedom. The workspace is a box with a flat obstacle hanging from the ceiling. (b) Configuration space of the same robot. Only white regions in the space are configurations that are free of collisions. The dot in this diagram corresponds to the configuration of the robot shown on the left.
For the two-link arm, simple obstacles in the workspace, like a vertical line, have very
complex C-space counterparts, as shown in Figure 26.12(b). The different shadings of the occupied space correspond to the different objects in the robot’s workspace: the dark region
surrounding the entire free space corresponds to configurations in which the robot collides
with itself. It is easy to see that extreme values of the shoulder or elbow angles cause such a violation. The two oval-shaped regions on both sides of the robot correspond to the table on which the robot is mounted. The third oval region corresponds to the left wall.
Finally, the most interesting object in configuration space is the vertical obstacle that
hangs from the ceiling and impedes the robot’s motions.
This object has a funny shape in
configuration space: it is highly nonlinear and at places even concave. With a little bit of imagination the reader will recognize the shape of the gripper at the upper left end. ‘We encourage the reader to pause for a moment and study this diagram. The shape of this
obstacle in C-space is not at all obvious!
The dot inside Figure 26.12(b) marks the configu-
ration of the robot in Figure 26.12(a). Figure 26.13 depicts three additional configurations,
both in workspace and in configuration space. In configuration conf-1, the gripper is grasping the vertical obstacle.
We see that even if the robot’s workspace is represented by flat polygons, the shape of
the free space can be very complicated. In practice, therefore, one usually probes a configu-
ration space instead of constructing it explicitly. A planner may generate a configuration and
then test to see if it is in free space by applying the robot kinematics and then checking for collisions in workspace coordinates.
941
942
Chapter 26 Robotics
conf-3
conf-1
conf-2
() (a) Figure 26.13 Three robot configurations, shown in workspace and configuration space. 26.5.2 Motion planning
Motion planning
The motion planning problem is that of finding a plan that takes a robot from one configuration to another without colliding with an obstacle. It is a basic building block for movement and manipulation.
In Section 26.5.4 we will discuss how to do this under complicated dy-
namics, like steering a car that may drift off the path if you take a curve too fast. For now, we
will focus on the simple motion planning problem of finding a geometric path that is collision
free. Motion planning is a quintessentially continuous-state search problem, but it is often
Piano mover's problem
possible to discretize the space and apply the search algorithms from Chapter 3.
The motion planning problem is sometimes referred to as the piano mover’s problem. It
gets its name from a mover’s struggles with getting a large, irregular-shaped piano from one room to another without hitting anything. We are given:
« a workspace world W in either R? for the plane or R? for three dimensions,
an obstacle region O C W,
+ arobot with a configuration space C and set of points A(g) for g € C, * astarting configuration ¢ € C, and
« a goal configuration g € C. The obstacle region induces a C-space obstacle C,p, and its corresponding free space Cyee defined as in the previous section. We need to find a continuous path through free space. We
will use a parameterized curve, 7(r), to represent the path, where 7(0) = g, and 7(1) = g, and 7(r) for every
between 0 and 1 is some point in Cyy.. That is, ¢ parameterizes how
far we are along the path, from start to goal. Note that ¢ acts somewhat like time in that as
1 increases the distance along the path increases, but7 is always a point on the interval [0, 1] and is not measured in seconds.
Section 265
Planning and Control
943
4
o
Figure 26.14 A visibility graph. Lines connect every pair of vertices that can “see” each other—lines that don’t go through an obstacle. The shortest path must lie upon these lines. The motion planning problem can be made more complex in various ways: defining the
goal as
a set of possible configurations rather than a single configuration; defining the goal
in the workspace rather than the C-space; defining a cost function (e.g.,
path length) to be
minimized; satisfying constraints (e.g., if the path involves carrying a cup of coffee, making
sure that the cup is always oriented upright so the coffee does not spill).
The spaces of motion planning: Let’s take a step back and make sure we understand the
spaces involved in motion planning. First, there is the workspace or world W. Points in W are points in the everyday three-dimensional world. Next, we have the space of configurations,
C. Points g in C are d-dimensional, with d the robot’s number of degrees of freedom, and
map to sets of points .A(g) in W. Finally, there is the space of paths. The space of paths is a
space of functions. Each point in this space maps to an entire curve through C-space. This space is co-dimensional! Intuitively, we need d dimensions for each configuration along the path, and there are as many configurations on a path as there are points in the number line interval [0, 1]. Now let’s consider some ways of solving the motion planning problem. Visi
ility graphs
For the simplified case of two-dimensional configuration spaces and polygonal C-space ob-
stacles, vi ity graphs are a convenient way to solve the motion planning problem with a guaranteed shortest-path solution. Let Vops C C be the set of vertices of the polygons making
up Cops, and let V = VoU {gs.q5}-
We construct a graph G = (V,E) on the vertex set V with edges e;; € E connecting a
vertex v; to another vertex v; if the line connecting the two vertices
is collision-free—that is,
if {Avi+ (1= A)v;: A€ [0,1]}NCops = { }. When this happens, we say the two vertices “can
see each other,” which is where “visibility” graphs got their name. To solve the motion planning problem, all we need to do is run a discrete graph search (e.g., best-first search) on the graph G with starting state g, and goal g,. In Figure 26.14
we see a visibility graph and an optimal three-step solution. An optimal search on visibility
graphs will always give us the optimal path (if one exists), or report failure if no path exists.
Voronoi
diagrams
Visibility graphs encourage paths that run immediately adjacent to an obstacle—if you had to walk around a table to get to the door, the shortest path would be to stick as close to the table
as possible. However, if motion or sensing is nondeterministic, that would put you at risk of
bumping into the table. One way to address this is to pretend that the robot’s body is a bit
Visibility graph
Chapter 26 Robotics
Figure 26.15 A Voronoi diagram showing the set of points (black lines) equidistant to two
or more obstacles in configuration space.
larger than it actually is, providing a buffer zone. Another way is to accept that path length is not the only metric we want to optimize. Section 26.8.2 shows how to learn a good metric from human examples of behavior.
Voronoi diagram Region
Voronoi graph
A third way is to use a different technique, one that puts paths as far away from obstacles
as possible rather than hugging close to them. A Voronoi diagram is a representation that
allows us to do just that. To get an idea for what a Voronoi diagram does, consider a space
where the obstacles are, say, a dozen small points scattered about a plane. Now surround each
of the obstacle points with a region consisting of all the points in the plane that are closer to that obstacle point than to any other obstacle point. Thus, the regions partition the plane. The
Voronoi diagram consists
and vertices of the regions.
of the set of regions, and the Voronoi graph consists of the edges
‘When obstacles are areas, not points, everything stays pretty much the same. Each region
still contains all the points that are closer to one obstacle than to any other, where distance is
measured to the closest point on an obstacle. The boundaries between regions still correspond
to points that are equidistant between two obstacles, but now the boundary may be a curve
rather than a straight line. Computing these boundaries can be prohibitively expensive in high-dimensional spaces.
To solve the motion planning problem, we connect the start point g, to the closest point
on the Voronoi graph via a straight line, and the same for the goal point g,. We then use discrete graph search to find the shortest path on the graph. For problems like navigating
through corridors indoors, this gives a nice path that goes down the middle of the corridor.
However, in outdoor settings it can come up with inefficient paths, for example suggesting an
unnecessary 100 meter detour to stick to the middle of a wide-open 200-meter space.
Section 26.5
Planning and Control
945
joal
(a)
(b)
Figure 26.16 (a) Value function and path found for a discrete grid cell approximation of the configuration space. (b) The same path visualized in workspace coordinates. Notice how the robot bends its elbow to avoid a collision with the vertical obstacle. Cell decomposition
An alternative approach to motion planning is to discretize the C-space. Cell decomposition ~Cell decomposition methods decompose the free space into a finite number of contiguous regions, called cells. These cells are designed so that the path-planning problem within a single cell can be solved by simple means (e.g., moving along a straight line). The path-planning problem then becomes a discrete graph search problem (as with visibility graphs and Voronoi graphs) to find a path through a sequence of cel The simplest cell decomposition consists of a regularly spaced grid. Figure 26.16(a) shows a square grid decomposition of the space and a solution path that is optimal for this grid size. Grayscale shading indicates the value of each free-space grid cell—the cost of the shortest path from that cell to the goal. (These values can be computed by a deterministic form of the VALUE-ITERATION algorithm given in Figure 17.6 on page 573.) Figure 26.16(b)
shows the corresponding workspace trajectory for the arm. Of course, we could also use the A" algorithm to find a shortest path.
This grid decomposition has the advantage that it
from three limitations.
simple to implement, but it suffers
First, it is workable only for low-dimensional configuration spaces,
because the number of grid cells increases exponentially with d, the number of dimensions. (Sounds familiar? This is the curse of dimensionality.) Second, paths through discretized
state space will not always be smooth. We see in Figure 26.16(a) that the diagonal parts of the path are jagged and hence very difficult for the robot to follow accurately. The robot can attempt to smooth out the solution path, but this is far from straightforward. Third, there is the problem of what to do with cells that are “mixed”—that
is, neither
entirely within free space nor entirely within occupied space. A solution path that includes
Chapter 26 Robotics such a cell may not be a real solution, because there may be no way to safely cross the
cell. This would make the path planner unsound.
On the other hand, if we insist that only
completely free cells may be used, the planner will be incomplete, because it might be the case that the only paths to the goal go through mixed cells—it might be that a corridor is actually wide enough for the robot to pass, but the corridor is covered only by mixed cells. The first approach to this problem is further subdivision of the mixed cells—perhaps using cells of half the original size. This can be continued recursively until a path is found
that lies entirely within free cells. This method works well and is complete if there is a way to
decide if a given cell is a mixed cell, which is easy only if the configuration space boundaries have relatively simple mathematical descriptions.
It is important to note that cell decomposition does not necessarily require explicitly rep-
Collision checker
resenting the obstacle space C,p,. We can decide to include a cell or not by using a collision checker. This is a crucial notion to motion planning. A collision checker is a function 7(g)
that maps to 1 if the configuration collides with an obstacle, and 0 otherwise. It is much easier
to check whether a specific configuration is in collision than to explicitly construct the entire
obstacle space Cyps. Examining the solution path shown in Figure 26.16(a), we can see an additional difficulty that will have to be resolved. The path contains arbitrarily sharp corners, but a physical robot has momentum and cannot change direction instantaneously. This problem can be solved by storing, for each grid cell, the exact continuous state (position and velocity) that was attained
when the cell was reached in the search. Assume further that when propagating information to nearby grid cells, we use this continuous state as a basis, and apply the continuous robot
motion model for jumping to nearby cells. So we don’t make an instantaneous 90° turn; we
make a rounded turn governed by the laws of motion. We can now guarantee that the resulting
Hybrid A
trajectory is smooth and can indeed be executed by the robot. One algorithm that implements
this is hybrid A*.
Randomized motion planning Randomized motion planning does graph search on a random decomposition of the configuration space, rather than a regular cell decomposition. The key idea is to sample a random set
of points and to create edges between them if there is a very simple way to get from one to
Probabilistic roadmap (PRM) Simple planner
the other (e.g., via a straight line) without colliding; then we can search on this graph.
A probabilistic roadmap (PRM) algorithm is one way to leverage this idea. We assume
access to a collision checker 7 (defined on page 946), and to a simple planner B(q;,q>) that
returns a path from g, to ¢, (or failure) but does so quickly. This simple planner is not going
to be complete—it might return failure even if a solution actually exists. Its job is to quickly
try to connect gy and g5 and let the main algorithm know if it succeeds. We will use it to
Milestone
define whether an edge exists between two vertices.
The algorithm starts by sampling M milestones—points in Cy—in addition to the points g, and g,. It uses rejection sampling, where configurations are sampled randomly and collision-checked using + until a total of M milestones are found.
uses the simple planner to try to connect pairs of milestones.
Next, the algorithm
If the simple planner returns
success, then an edge between the pair is added to the graph; otherwise, the graph remains as
is. We try to connect each milestone either to its k nearest neighbors (we call this &-PRM), or
to all milestones in a sphere of a radius r. Finally, the algorithm searches for a path on this
Section 26.5
@
Planning and Control
947
7
Figure 26.17 The probabilistic roadmap (PRM) algorithm. Top left: the start and goal configurations. Top right: sample M collision-free milestones (here M = 5). Bottom left: connect each milestone to its k nearest neighbors (here k = 3). Bottom right: find the shortest path from the start to the goal on the resulting graph. graph from g to g. If no path is found, then M more milestones are sampled, added to the graph, and the process is repeated. Figure 26.17 shows a roadmap with the path found between two configurations. PRMs
Probabilistically are not complete, but they are what is called probabilistically complete—they will eventu- comp lete ally find a path, if one exists. Intuitively, this is because they keep sampling more milestones.
PRMs work well even in high-dimensional configuration spaces. PRM:s are also popular for multi-query planning, in which we have multiple motion Multi-query planning planning problems within the same C-space. Often, once the robot reaches a goal, it is called upon to reach another goal in the same workspace. PRMs are really useful, because the robot can dedicate time up front to constructing a roadmap, and amortize the use of that roadmap over multiple queries.
Rapidly-exploring random trees Rapidly exploring : ! of PRMs called rapidly exploring random trees (RRTS) is} popular for singleAn extension random trées query planning. We incrementally build two trees, one with g, as the oot and one with gg " * as the root.
Random
milestones are chosen, and an attempt is made to connect each new
milestone to the existing trees. If a milestone connects both trees, that means a solution has been found, as in Figure 26.18. If not, the algorithm finds the closest point in each tree and
adds to the tree a new edge that extends from the point by a distance § towards the milestone.
This tends to grow the tree towards previously unexplored sections of the space. Roboticists love RRTs for their ease of use. However, RRT solutions are typically nonoptimal and lack smoothness. Therefore, RRTs are often followed by a post-processing step.
The most common one is “short-cutting,” in which we randomly select one of the vertices on
the solution path and try to remove it by connecting its neighbors to each other (via the simple
Chapter 26 Robotics
Gsample
qs Figure 26.18 The bidirectional RRT algorithm constructs two trees (one from the start, the other from the goal) by incrementally connecting each sample to the closest node in each tree, if the connection is possible. When a sample connects to both trees, that means we have found a solution path.
Figure 26.19 Snapshots of a trajectory produced by an RRT and post-processed with short-
cutting. Courtesy of Anca Dragan.
planner). We do this repeatedly for as many steps as we have compute time for. Even then, RRT"
the trajectories might look a little unnatural due to the random positions of the milestone that were selected, as shown in Figure 26.19.
RRT is a modification to RRT that makes the algorithm asymptotically optimal: the
solution converges to the optimal solution as more and more milestones are sampled. The key idea is to pick the nearest neighbor based on a notion of cost to come rather than distance
from the milestone only, and to rewire the tree, swapping parents of older vertices if it is
cheaper to reach them via the new milestone.
Trajectory optimization for kinematic planning Randomized sampling algorithms tend to first construct a complex but feasible path and then optimize it. Trajectory optimization does the opposite: it starts with a simple but infeasible
path, and then works to push it out of collision. The goal is to find a path that optimizes a cost
Section 265
Planning and Control
949
function! over paths. That is, we want to minimize the cost function J(7), where 7(0) = g, and 7(1) = g, J is called a functional because it is a function over functions.
The argument to J is
7, which is itself a function: 7(r) takes as input a point in the [0, 1] interval and maps it to
a configuration. A standard cost functional trades off between two important aspects of the robot’s motion: collision avoidance and efficiency,
J = Jobs + Mo
where the efficiency J;- measures the length of the path and may also measure smoothness. A convenient way to define efficiency is with a quadratic: it integrates the squared first derivative of 7 (we will see in a bit why this does in fact incentivize short paths):
Ja = [ 30 Pas. e
For the obstacle term, assume we can compute the distance d(x) from any point x € W to the nearest obstacle edge. This distance is positive outside of obstacles, 0 at the edge, and
negative inside. This is called a signed distance field. We can now define a cost field in the
workspace, call it ¢, that has high cost inside of obstacles, and a small cost right outside. With
Signed distance field
this cost, we can make points in the workspace really hate being inside obstacles, and dislike being right next to them (avoiding the visibility graph problem of their always hanging out by the edges of obstacles). Of course, our robot is not a point in the workspace, so we have some more work to do—we need to consider all points b on the robot’s body:
(fn(T \—ob () [l db ds ,»m—// ew
This
is called a path mtegral~n does not just integrate ¢ along the way for each body point,
but it multiplies by the derivative to make the cost invariant to retiming of the path. Imagine a robot sweeping through the cost field, accumulating cost as is moves. Regardless of how fast
Path integral
or slow the arm moves through the field, it must accumulate the exact same cost.
The simplest way to solve the optimization problem above and find a path is gradient
descent. If you are wondering how to take gradients of functionals with respect to functions,
something called the calculus of variations is here to help. It is especially easy for functionals of the form
'
Il
/0 F(s,7(5),#(5))ds
which are integrals of functions that depend just on the parameter s, the value of the function
Euler-Lagrange at s, and the derivative of the function at s. In such a case, the Euler-Lagrange equation equation says that the gradient is
Vo J(s)
)=
JF
= o
d
OF
5O " warm
If we look closely at Jo; and Jyp, they both follow this pattern. In particular for J,7, we have F(s,7(s),7(s)) = [|#(s)[|. To get a bit more comfortable with this, let’s compute the gradient
1 Roboticists like to minimize a cost function J, whereas in other parts of Al we try to maximize a utility function
U orareward R.
950
Chapter 26 Robotics
Figure 26.20 Trajectory optimization for motion planning. Two point-obstacles with circular bands of decreasing cost around them. The optimizer starts with the straight line trajectory, and lets the obstacles bend the line away from collisions, finding the minimum path through the cost field.
for Jo5 only. We see that F does not have a direct dependence on 7(s), so the first term in the formula is 0. We are left with d
Vel(s)=0-2(s9)
since the partial of F with respect to 7(s) is 7(s).
Notice how we made things easier for ourselves when defining Jo;—it’s a nice quadratic
of the derivative (and we even put a % in front so that the 2 nicely cancels out). In practice, you will see this trick happen a lot for optimization—the art is not just in choosing how to
optimize the cost function, but also in choosing a cost function that will play nicely with how you will optimize it. Simplifying our gradient, we get
V. J(s) = —#(s). Now, since J is a quadratic, setting this gradient to 0 gives us the solution for 7 if we
didn’t have to deal with obstacles. Integrating once, we get that the first derivative needs to be constant; integrating again we get that 7(s) = a- s+ b, with a and b determined by the endpoint constraints for 7(0) and 7(1). The optimal path with respect to J.g is thus the
straight line from start to goal! It is indeed the most efficient way to go from one to the other if there are no obstacles to worry about.
Of course, the addition of Jp, is what makes things difficult—and we will spare you
deriving its gradient here. The robot would typically initialize its path to be a straight line,
which would plow right through some obstacles. It would then calculate the gradient of the cost about the current path, and the gradient would serve to push the path away from the obstacles (Figure 26.20). Keep in mind that gradient descent will only find a locally optimal solution—just like hill climbing. Methods such as simulated annealing (Section 4.1.2) can be
used for exploration, to make it more likely that the local optimum is a good one. 26.5.3
Control theory
Trajectory tracking control
We have covered how to plan motions, but not how to actually move—to apply current to motors, to produce torque, to move the robot.
This is the realm of control theory, a field
of increasing importance in Al There are two main questions to deal with: how do we turn
Section 26.5 Planning and Control
951
Figure 26.21 The task of reaching to grasp a bottle solved with a trajectory optimizer. Left: the initial trajectory, plotted for the end effector. Middle: the final trajectory after optimiza-
tion. Right: the goal configuration. Courtesy of Anca Dragan. See Ratliff ef al. (2009).
a mathematical description of a path into a sequence of actions in the real world (open-loop
control), and how do we make sure that we are staying on track (closed-loop control)?
From configurations to torques for open-loop tracking: Our path 7(1) gives us configurations. The robot starts at rest at g, = 7(0). From there the robot’s motors will turn currents into torques, leading to motion. But what torques should the robot aim for, such that it ends
up at g = 7(1)?
This is where the idea of a dynamics model (or transition model) comes in. We can give
the robot a function f that computes the effects torques have on the configuration. Remem-
Dynamics model
ber F = ma from physics? Well, there is something like that for torques too, in the form
u= f"(.4.G), with u a torque, ¢ a velocity, and g an acceleration.? If the robot is at config-
uration ¢ and velocity ¢, and applied torque u, that would lead to acceleration G = f(q,q,u). The tuple (g.¢) is a dynamic state, because it includes velocity, whereas g is the kinematic
state and is not sufficient for computing exactly what torque to apply. f is a deterministic
dynamics model in the MDP over dynamic states with torques as actions. f~! is the inverse
Dynamic state
Kinematic state
dynamics, telling us what torque to apply if we want a particular acceleration, which leads Inverse dynamics to a change in velocity and thus a change in dynamic state.
Now, naively, we could think of 7 € [0, 1] as “time” on a scale from 0 to 1 and select our
torque using inverse dynamics:
) = £ (r(0,7(0),5(1))
(262)
assuming that the robot starts at (7(0),7(0)). In reality though, things are not that easy.
The path 7 was created as a sequence of points, without taking velocities and accelera-
tions into account. As such, the path may not satisfy 7(0) = 0 (the robot starts at 0 velocity), or even that 7 is differentiable (let alone twice differentiable).
Further, the meaning of the
endpoint “1” is unclear: how many seconds does that map to? In practice, before we even think of tracking a reference path, we usually retime it, that Retiming is, transform it into a trajectory £(r) that maps the interval [0, 7] for some time duration 7'
into points in the configuration space C. (The symbol is the Greek letter Xi.) Retiming is trickier than you might think, but there are approximate ways to do it, for instance by picking
a maximum velocity and acceleration, and using a profile that accelerates to that maximum 2 We omit the details of /=" here, but they involve mass, inertia, gravity, and Coriolis and centrifugal forces.
952
Chapter 26 Robotics
e
(@)
s
(®)
(c)
Figure 26.22 Robot arm control using (a) proportional control with gain factor 1.0, (b) proportional control with gain factor 0.1, and (c) PD (proportional derivative) control with gain factors 0.3 for the proportional component and 0.8 for the differential component. In all cases the robot arm tries to follow the smooth line path, but in (a) and (b) deviates substantially from
the path.
velocity, stays there as long as it can, and then decelerates back to 0. Assuming we can do this, Equation (26.2) above can be rewritten as
ur) = 1760 E0,€0)-
(263)
Even with the change from 7 to £, an actual trajectory, the equation of applying torques from
Control law
above (called a control law) has a problem in practice. Thinking back to the reinforcement
Stiction
and inertias exactly, and f might not properly account for physical phenomena like stiction in the motors (the friction that tends to prevent stationary surfaces from being set in motion—to
learning section, you might guess what it is. The equation works great in the situation where [ is exact, but pesky reality gets in the way as usual: in real systems, we can’t measure masses
make them stick). So, when the robot arm starts applying those torques but f is wrong, the
errors accumulate and you deviate further and further from the reference path.
Rather than just letting those errors accumulate, a robot can use a control process that
looks at where it thinks it is, compares that to where it wanted to be, and applies a torque to minimize the error.
P controller Gain factor
A controller that provides force in negative proportion to the observed error is known as
a proportional controller or P controller for short. The equation for the force
u(t) = Kp(&(1) — q:)
where g, is the current configuration, and Kp is a constant representing the gain factor of the
controller. Kp regulates how strongly the controller corrects for deviations between the actual state g; and the desired state &(t).
Figure 26.22(a) illustrates what can go wrong with proportional control. Whenever a deviation occurs—whether due to noise or to constraints on the forces the robot can apply—the
robot provides an opposing force whose magnitude is proportional to this deviation. Intuitively, this might appear plausible, since deviations should be compensated by a counterforce to keep the robot on track. However, as Figure 26.22(a) illustrates, a proportional controller
can cause the robot to apply too much force, overshooting the desired path and zig-zagging
Section 265 back and forth. This is the result of the natural inertia of the robot:
Planning and Control
953
once driven back to its
reference position the robot has a velocity that can’t instantaneously be stopped.
In Figure 26.22(a), the parameter Kp = 1. At first glance, one might think that choosing a smaller value for Kp would remedy the problem, giving the robot a gentler approach to the desired path.
Kp
Unfortunately, this is not the case. Figure 26.22(b) shows a trajectory for
=1, still exhibiting oscillatory behavior. The lower value of the gain parameter helps, but
does not solve the problem. In fact, in the absence of friction, the P controller is essentially a
spring law; so it will oscillate indefinitely around a fixed target location.
There are a number of controllers that are superior to the simple proportional control law.
A controller is said to be stable if small perturbations lead to a bounded error between the
robot and the reference signal. It is said to be strictly stable if it is able to return to and then
stay on its reference path upon such perturbations. Our P controller appears to be stable but not strictly stable, since it fails to stay anywhere near its reference trajectory.
The simplest controller that achieves strict stability in our domain is a PD controller. The letter ‘P’ stands again for proportional, and ‘D’ stands for derivative. PD controllers are
Stable Strictly stable PD controller
described by the following equation:
(26.4) + Kp(€(r) = dr) = a1) £() ult) = Kp( As this equation suggests, PD controllers extend P controllers by a differential component,
which adds to the value of u(r) a term that is proportional to the first derivative of the error
&(t) — g, over time. What is the effect of such a term? In general, a derivative term damp-
ens the system that is being controlled.
To see this, consider a situation where the error is
changing rapidly over time, as is the case for our P controller above. The derivative of this error will then counteract the proportional term, which will reduce the overall response to
the perturbation. However, if the same error persists and does not change, the derivative will
vanish and the proportional term dominates the choice of control.
Figure 26.22(c) shows the result of applying this PD controller to our robot arm, using as gain parameters Kp = .3 and K = .8. Clearly, the resulting path is much smoother, and does
not exhibit any obvious oscillations.
PD controllers do have failure modes, however. In particular, PD controllers may fail to
regulate an error down to zero, even in the absence of external perturbations.
Often such a
situation is the result ofa systematic external force that is not part of the model. For example,
an autonomous car driving on a banked surface may find itself systematically pulled to one
side.
Wear and tear in robot arms causes similar systematic errors.
In such situations, an
over-proportional feedback is required to drive the error closer to zero. The solution to this
problem lies in adding a third term to the control law, based on the integrated error over time:
u(t) = KP({(f)*q:)+K:A’({(-Y)*q\)d»“rKv(f(')*d:)-
(26.5)
Here K; is a third gain parameter. The term [}(€(s) calculates the integral of the error over time. The effect of this term is that long-lasting deviations between the reference signal and the actual state are corrected.
Integral terms, then, ensure that a controller does not exhibit
systematic long-term error, although they do pose a danger of oscillatory behavior. A controller with all three terms is called a PID controller (for proportional integral
derivative). PID controllers are widely used in industry, for a variety of control problems. Think of the three terms as follows—proportional: try harder the farther away you are from
PID controller
954
Chapter 26 Robotics the path; derivative: try even harder if the error is increasing; integral: try harder if you haven’t made progress for a long time.
Computed torque control
A middle ground between open-loop control based on inverse dynamics and closed-loop PID control is called computed torque control. We compute the torque our model thinks we. will need, but compensate for model inaccuracy with proportional error terms:
u(t) = §7 (60, £0).60)) +m(E)) (Kp(€(1) — 1) +Kp(£(1) = 1)) Feedforward component Feedback component
feedforward
(26.6)
feedback
The first term is called the feedforward component because it looks forward to where the
robot needs to go and computes what torque might be required. The second is the feedback
component because it feeds the current error in the dynamic state back into the control law.
m(q) is the inertia matrix at configuration g—unlike normal PD control, the gains change with the configuration of the system. Plans versus policies
Let’s take a step back and make sure we understand the analogy between what happened so far in this chapter and what we learned in the search, MDP, and reinforcement learning chapters.
‘With motion in robotics, we are really considering an underlying MDP where the states are
dynamic states (configuration and velocity), and the actions are control inputs, usually in the form of torques. If you take another look at our control laws above, they are policies, not plans—they tell the robot what action to take from any state it might reach. However,
they are usually far from optimal policies. Because the dynamic state is continuous and high dimensional (as is the action space), optimal policies are computationally difficult to extract. Instead, what we did here is to break up the problem. We come up with a plan first, in a simplified state and action space: we use only the kinematic state, and assume that states are reachable from one another without paying attention to the underlying dynamics. This is motion planning, and it gives us the reference path. If we knew the dynamics perfectly, we
could turn this into a plan for the original state and action space with Equation (26.3).
But because our dynamics model is typically erroneous, we turn it instead into a policy
that tries to follow the plan—getting back to it when it drifts away. When doing this, we introduce suboptimality in two ways:
first by planning without considering dynamics, and
second by assuming that if we deviate from the plan, the optimal thing to do is to return to the original plan. In what follows, we describe techniques that compute policies directly over the dynamic state, avoiding the separation altogether.
26.5.4
Optimal
control
Rather than using a planner to create a kinematic path, and only worrying about the dynamics
of the system after the fact, here we discuss how we might be able to do it all at once. We'll
take the trajectory optimization problem for kinematic paths, and turn it into true trajectory
optimization with dynamics: we will optimize directly over the actions, taking the dynamics (or transitions) into account. This brings us much closer to what we’ve seen in the search and MDP chapters.
If we
know the system’s dynamics, then we can find a sequence of actions to execute, as we did in
Chapter 3. If we’re not sure, then we might want a policy, as in Chapter 17.
Section 265
Planning and Control
955
In this section, we are looking more directly at the underlying MDP the robot works
in. We're switching from the familiar discrete MDPs to continuous ones.
We will denote
our dynamic state of the world by x, as is common practice—the equivalent of s in discrete MDPs. Let x, and x; be the starting and goal states.
‘We want to find a sequence of actions that, when executed by the robot, result in state-
action pairs with low cumulative cost. The actions are torques which we denote with u(r)
for ¢ starting at 0 and ending at 7. Formally, we want to find the sequence of torques u that minimize a cumulative cost J:
min Ji J(x(t),u(r))dr
(26.7)
u
subject to the constraints
Ve, x(1) = f(x(0),u(t))
X(0) = xy, X(T) =%,
How is this connected to motion planning and trajectory tracking control? Well, imagine
we take the notion of efficiency and clearance away from the obstacles and put it into the cost function J, just as we did before in trajectory optimization over kinematic state. The dynamic
state is the configuration and velocity, and torques u change it via the dynamics
f from open-
loop trajectory tracking. The difference is that now we're thinking about the configurations
and the torques at the same time. Sometimes, we might want to treat collision avoidance as a
hard constraint as well, something we’ve also mentioned before when we looked at trajectory optimization for the kinematic state only.
To solve this optimization problem, we can take gradients of J—not with respect to the
sequence 7 of configurations anymore, but directly with respect to the controls . It is sometimes helpful to include the state sequence x as a decision variable too, and use the dynamics
constraints to ensure that x and u are consistent. There are various trajectory optimization
techniques using this approach; two of them go by the names multiple shooting and direct
collocation. None of these techniques will find the global optimal solution, but in practice they can effectively make humanoid robots walk and make autonomous cars drive. Magic happens when in the problem above, J is quadratic and f is linear in x and 1. We want to minimize
min / XTQx+ulRudt
o
subjectto
Vi, i(r) = Ax(r) + Bu(t).
‘We can optimize over an infinite horizon rather than a finite one, and we obtain a policy
from any state rather than just a sequence of controls. Q and R need to be positive definite matrices for this to work. This gives us the linear quadratic regulator (LQR). With LQR,
Linear quadratic regulator (LQR)
the optimal value function (called cost to go) is quadratic, and the optimal policy is linear. The policy looks like u = —Kx, where finding the matrix K requires solving an algebraic Riceati equation—no local optimization, no value iteration, no policy iteration are needed! Riccati equation Because of the ease of finding the optimal policy, LQR finds many uses in practice despite the fact that real problems seldom actually have quadratic costs and linear dynamics. A LQR really useful method is called iterative LQR (ILQR), which works by starting with a solu- Iterative (ILQR) tion and then iteratively computing a linear approximation of the dynamics and a quadratic approximation of the cost around it, then solving the resulting LQR system to arrive at a new
solution. Variants of LQR are also often used for trajectory tracking.
956
Chapter 26 Robotics 26.6_Planning Uncertain
Movements
In robotics, uncertainty arises from partial observability of the environment and from the stochastic (or unmodeled) effects of the robot’s actions. Errors can also arise from the use
of approximation algorithms
such as particle filtering, which does not give the robot an exact
belief state even if the environment is modeled perfectly.
The majority of today’s robots use deterministic algorithms for decision making, such as the path-planning algorithms of the previous section, or the search algorithms that were introduced in Chapter 3. These deterministic algorithms are adapted in two ways: first, they
deal with the continuous state space by turning it into a discrete space (for example with Most likely state
visibility graphs or cell decomposition).
Second, they deal with uncertainty in the current
state by choosing the most likely state from the probability distribution produced by the state estimation algorithm. That approach makes the computation faster and makes a better fit for the deterministic search algorithms. In this section we discuss methods for dealing with
uncertainty that are analogous to the more complex search algorithms covered in Chapter 4.
First, instead of deterministic plans, uncertainty calls for policies. We already discussed
how trajectory tracking control turns a plan into a policy to compensate for errors in dynamics.
Online replanning Model predictive
control (MPC)
Sometimes though, if the most likely hypothesis changes enough, tracking the plan designed for a different hypothesis is too suboptimal. This is where online replanning comes in: we can recompute a new plan based on the new belief. Many robots today use a technique called model predictive control (MPC), where they plan for a shorter time horizon, but replan
at every time step. (MPC is therefore closely related to real-time search and game-playing
algorithms.) This effectively results in a policy: at every step, we run a planner and take the
first action in the plan; if new information comes along, or we end up not where we expected, that’s OK, because we are going to replan anyway and that will tell us what to do next. Second, uncertainty calls for information gathering actions. When we consider only the information we have and make a plan based on it (this is called separating estimation from
control), we are effectively solving (approximately) a new MDP at every step, corresponding
to our current belief about where we are or how the world works. But in reality, uncertainty is
better captured by the POMDP framework: there is something we don’t directly observe, be it the robot’s location or configuration, the location of objects in the world, or the parameters
of the dynamics model itself—for example, where exactly is the center of mass of link two on this arm?
What we lose when we don’t solve the POMDP
is the ability to reason about future
information the robot will get: in MDPs we only plan with what we know, not with what we might eventually know. Remember the value of information? Well, robots that plan using
their current belief as if they will never find out anything more fail to account for the value of
information. They will never take actions that seem suboptimal right now according to what
they know, but that will actually result in a lot of information and enable the robot to do well.
What does such an action look like for a navigation robot?
The robot could get close
to a landmark to get a better estimate of where it is, even if that landmark is out of the way
according to what it currently knows. This action is optimal only if the robot considers the
new observations it will get, as opposed to looking only at the information it already has.
Guarded movement
To get around this, robotics techniques sometimes define information gathering actions
explicitly—such as moving a hand until it touches a surface (called guarded movements)—
Section 26.6 initial configuration
Planning Uncertain Movements
-
motion
envelope —
957
cy \2
Figure 26.23 A two-dimensional environment, velocity uncertainty cone, and envelope of possible robot motions. The intended velocity is v, but with uncertainty the actual velocity could be anywhere in C,. resulting in a final configuration somewhere in the motion envelope, which means we wouldn’t know if we hit the hole or not. and make sure the robot does that before coming up with a plan for reaching its actual goal. Each guarded motion consists of (1) a motion command
and (2) a termination condition,
which is a predicate on the robot’s sensor values saying when to stop. Sometimes, the goal itself could be reached via a sequence of guarded moves guaranteed to succeed regardless of uncertainty. As an example, Figure 26.23 shows a two-dimensional
configuration space with a narrow vertical hole. It could be the configuration space for inser-
tion ofa rectangular peg into a hole or a car key into the ignition. The motion commands are constant velocities. The termination conditions are contact with a surface. To model uncer-
tainty in control, we assume that instead of moving in the commanded direction, the robot’s actual motion lies in the cone C, about it.
The figure shows what would happen if the robot attempted to move straight down from
the initial configuration. Because of the uncertainty in velocity, the robot could move anywhere in the conical envelope, possibly going into the hole, but more likely landing to one side of it. Because the robot would not then know which side of the hole it was on, it would
not know which way to move.
A more sensible strategy is shown in Figures 26.24 and 26.25. In Figure 26.24, the robot deliberately moves to one side of the hole. The motion command is shown in the figure, and the termination test is contact with any surface. In Figure 26.25, a motion command is
given that causes the robot to slide along the surface and into the hole. Because all possible
velocities in the motion envelope are to the right, the robot will slide to the right whenever it
is in contact with a horizontal surface.
It will slide down the right-hand vertical edge of the hole when it touches it, because
all possible velocities are down relative to a vertical surface.
It will keep moving until it
reaches the bottom of the hole, because that is its termination condition.
In spite of the
control uncertainty, all possible trajectories of the robot terminate in contact with the bottom of the hole—that is, unless surface irregularities cause the robot to stick in one place.
Other techniques beyond guarded movements change the cost function to incentivize ac-
tions we know will lead to information—Tlike the coastal navigation heuristic which requires
the robot to stay near known landmarks. More generally, techniques can incorporate the ex-
pected information gain (reduction of entropy of the belief) as a term in the cost function,
Coastal navigation
958
Chapter 26
Robotics
initial configuration
7Cv
~
v motion
envelope
Figure 26.24 The first motion command and the resulting envelope of possible robot motions. No matter what actual motion ensues, we know the final configuration will be to the left of the hole.
DN
motion
v
/ envelope
Figure 26.25 The second motion command and the envelope of possible motions. Even with error, we will eventually get into the hole. leading to the robot explicitly reasoning about how much information each action might bring
when deciding what to do. While more difficult computationally, such approaches have the
advantage that the robot invents its own information gathering actions rather than relying on human-provided heuristics and scripted strategies that often lack flexibility. 26.7
Reinforcement Learning in Robotics
Thus far we have considered tasks in which the robot has access to the dynamics model of
the world. In many tasks, it is very difficult to write down such a model, which puts us in the
domain of reinforcement learning (RL).
One challenge of RL in robotics is the continuous nature of the state and action spaces,
which we handle either through discretization, or, more commonly, through function approxi-
mation. Policies or value functions are represented as combinations of known useful features,
or as deep neural networks. Neural nets can map from raw inputs directly to outputs, and thus largely avoid the need for feature engineering, but they do require more data. A bigger challenge is that robots operate in the real world. We have seen how reinforcement learning can be used to learn to play chess or Go by playing simulated games. But when
a real robot moves in the real world, we have to make sure that its actions are safe (things
Section 26.7
Reinforcement Learning in Robotics
(@)
Figure 26.26 Training a robust policy. (a) Multiple simulations are run of a robot hand ma-
nipulating objects, with different randomized parameters for physics and lighting. Courtesy of Wojciech Zaremba. (b) The real-world environment, with a single robot hand in the center
of a cage, surrounded by cameras and range finders. (c) Simulation and real-world training yields multiple different policies for grasping objects; here a pinch grasp and a quadpod grasp. Courtesy of OpenAL See Andrychowicz et al. (2018a).
break!), and we have to accept that progress will be slower than in a simulation because the
world refuses to move faster than one second per second. Much of what is interesting about using reinforcement learning in robotics boils down to how we might reduce the real world
sample complexity—the number of interactions with the physical world that the robot needs before it has learned how to do the task.
26.7.1
Exploiting models
A natural way to avoid the need for many real-world samples is to use as much knowledge of the world’s dynamics as possible. For instance, we might not know exactly what the coefficient of friction or the mass of an object is, but we might have equations that describe
the dynamics as a function of these parameters.
In such a case, model-based reinforcement learning (Chapter 22) is appealing, where
the robot can alternate between fitting the dynamics parameters and computing a better policy. Even if the equations are incorrect because they fail to model every detail of physics,
researchers have experimented with learning an error term, in addition to the parameters, that can compensate for the inaccuracy of the physical model. Or, we can abandon the equations
and instead fit locally linear models of the world that each approximate the dynamics in a region of the state space, an approach that has been successful in getting robots to master
complex dynamic tasks like juggling. A model of the world can also be useful in reducing the sample complexity of model-free
reinforcement learning methods by doing sim-to-real transfer: transferring policies that work ~sim-to-rea!
959
960
Chapter 26 Robotics in simulation to the real world. The idea is to use the model as a simulator for a policy search (Section 22.5). To learn a policy that transfers well, we can add noise to the model during
Domain randomization
training, thereby making the policy more robust. Or, we can train policies that will work with a variety of models by sampling different parameters in the simulations—sometimes referred to as domain randomization. An example is in Figure 26.26, where a dexterous manipulation task is trained in simulation by varying visual attributes, as well as physical attributes like friction or damping.
Finally, hybrid approaches that borrow ideas from both model-based and model-free al-
gorithms are meant to give us the best of both. The hybrid approach originated with the Dyna
architecture, where the idea was to iterate between acting and improving the policy, but the
policy improvement would come in two complementary ways:
1) the standard model-free
way of using the experience to directly update the policy, and 2) the model-based way of using the experience to fit a model, then plan with it to generate a policy.
More recent techniques have experimented with fitting local models, planning with them
to generate actions, and using these actions as supervision to fit a policy, then iterating to get better and better models around the areas that the policy needs.
This has been successfully
applied in end-to-end learning, where the policy takes pixels as input and directly outputs
torques as actions—it enabled the first demonstration of deep RL on physical robots. Models can also be exploited for the purpose of ensuring safe exploration.
Learning
slowly but safely may be better than learning quickly but crashing and burning half way through. So arguably, more important than reducing real-world samples is reducing realworld samples in dangerous states—we
don’t want robots falling off cliffs, and we don’t
want them breaking our favorite mugs or, even worse, colliding with objects and people. An approximate model, with uncertainty associated to it (for example by considering a range of
values for its parameters), can guide exploration and impose constraints on the actions that the robot is allowed to take in order to avoid these dangerous states. This is an active area of research in robotics and control.
26.7.2
Motion primitive
Exploiting other information
Models are useful, but there is more we can do to further reduce sample complexity. When setting up a reinforcement learning problem, we have to select the state and action spaces, the representation of the policy or value function, and the reward function we’re using. These decisions have a large impact on how easy or how hard we are making the problem. One approach is to use higher-level motion primitives instead of low-level actions like torque commands. A motion primitive is a parameterized skill that the robot has. For exam-
ple, a robotic soccer player might have the skill of “pass the ball to the player at (x,y).” All the policy needs to do is to figure out how to combine them and set their parameters, instead
of reinventing them. This approach often learns much faster than low-level approaches, but does restrict the space of possible behaviors that the robot can learn.
Another way to reduce the number of real-world samples required for learning is to reuse information from previous learning episodes on other tasks, rather than starting from scratch.
This falls under the umbrella of metalearning or transfer learning.
Finally, people are a great source of information. In the next section, we talk about how
to interact with people, and part of it is how to use their actions to guide the robot’s learning.
Section 26.8
26.8
Humans and Robots
961
Humans and Robots
Thus far, we’ve focused on a robot planning and learning how to act in isolation. This is useful for some robots, like the rovers we send out to explore distant planets on our behalf.
But, for the most part, we do not build robots to work in isolation. We build them to help us, and to work in human environments, around and with us.
This raises two complementary challenges. First is optimizing reward when there are
people acting in the same environment as the robot. We call this the coordination problem
(see Section 18.1). When the robot’s reward depends on not just its own actions, but also the
actions that people take, the robot has to choose its actions in a way that meshes well with
theirs. When the human and the robot are on the same team, this turns into collaboration.
Second is the challenge of optimizing for what people actually want. If a robot is to
help people, its reward function needs to incentivize the actions that people want the robot to
execute. Figuring out the right reward function (or policy) for the robot is itself an interaction problem. We will explore these two challenges in turn. 26.8.1
Coordination
Let’s assume for now, as we have been, that the robot has access to a clearly defined reward
function. But, instead of needing to optimize it in isolation, now the robot needs to optimize
it around a human who is also acting. For example, as an autonomous car merges on the
highway, it needs to negotiate the maneuver with the human driver coming in the target lane—
should it accelerate and merge in front, or slow down and merge behind? Later, as it pulls to
astop sign, preparing to take a right, it has to watch out for the cyclist in the bicycle lane, and for the pedestrian about to step onto the crosswalk.
Or, consider a mobile robot in a hallway.
Someone heading straight toward the robot
steps slightly to the right, indicating which side of the robot they want to pass on. The robot has to respond, clarifying its intentions. Humans
as approximately
rational agents
One way to formulate coordination with a human is to model it as a game between the robot
and the human (Section 18.2). With this approach, we explicitly make the assumption that
people are agents incentivized by objectives. This does not automatically mean that they are
perfectly rational agents (i.e., find optimal solutions in the game), but it does mean that the
robot can structure the way it reasons about the human via the notion of possible objectives that the human might have. In this game: « the state of the environment captures the configurations of both the robot and human agents; call itx = (xg,xg); + each agent can take actions, ug and uy respectively;
« cach agent has an objective that can be represented as a cost, Jg and Jy: each agent
wants to get to its goal safely and efficiently; + and, as in any game, each objective depends on the state and on the actions of both
agents: Jg(x, g, yr) and Jy (x, 1y, ug). Think of the car-pedestrian interaction—the car should stop if the pedestrian crosses, and should go forward if the pedestrian waits.
Three important aspects complicate this game. First is that the human and the robot don’t
necessarily know each other’s objectives. This makes it an incomplete information game.
Incomplete information game
962
Chapter 26 Robotics Second is that the state and action spaces are continuous, as they’ve been throughout this
chapter. We learned in Chapter 5 how to do tree search to tackle discrete games, but how do we tackle continuous spaces? Third, even though at the high level the game model makes sense—humans do move, and they do have objectives—a human’s behavior might not always be well-characterized as a solution to the game. The game comes with a computational challenge not only for the robot, but for us humans too. It requires thinking about what the robot will do in response to what the person does, which depends on what the robot thinks the person will do, and pretty soon we get to “what do you think I think you think I think”— it’s turtles all the way down! Humans can’t deal with all of that, and exhibit certain suboptimalities.
This means that the
robot should account for these suboptimalities.
So, then, what is an autonomous car to do when the coordination problem is this hard?
We will do something similar to what we’ve done before in this chapter. For motion planning and control, we took an MDP and broke it up into planning a trajectory and then tracking it with a controller. Here too, we will take the game, and break it up into making predictions about human actions, and deciding what the robot should do given these predictions.
Predicting human action Predicting human actions is hard because they depend on the robot’s actions and vice versa. One trick that robots use is to pretend the person is ignoring the robot. The robot assumes people are noisily optimal with respect to their objective, which is unknown to the robot and
is modeled as no longer dependent on the robot’s actions: Jj (x,
). In particular, the higher
the value of an action for the objective (the lower the cost to go), the more likely the human
is to take it. The robot can create a model for P(uy | x,Jy), for instance using the softmax function from page 811:
Plug | %, Jig) o< &= Q)
(268)
with Q(x, ups3Jy) the Q-value function corresponding to Ji; (the negative sign is there because in robotics we like to minimize cost, not maximize reward). Note that the robot does not
assume perfectly optimal actions, nor does it assume that the actions are chosen based on
reasoning about the robot at all.
Armed with this model, the robot uses the human’s ongoing actions as evidence about Jp;.
If we have an observation model for how human actions depend on the human’s objective,
each human action can be incorporated to update the robot’s belief over what objective the person has
b (J) o< b(Ju)Purg | %) -
(26.9)
An example is in Figure 26.27: the robot is tracking a human’s location and as the human moves, the robot updates its belief over human goals. As the human heads toward the
windows, the robot increases the probability that the goal is to look out the window, and
decreases the probability that the goal is going to the kitchen, which is in the other direction.
This is how the human’s past actions end up informing the robot about what the human will do in the future. Having a belief about the human’s goal helps the robot anticipate what next actions the human will take. The heatmap in the figure shows the robot’s future
predictions: red is most probable; blue least probable.
Section 26.8
Humans and Robots
(b)
(©)
Figure 26.27 Making predictions by assuming that people are noisily rational given their goal: the robot uses the past actions to update a belief over what goal the person is heading 10, and then uses the belief to make predictions about future actions. (a) The map ofa room. (b) Predictions after seeing a small part of the person’s trajectory (white path):; (¢) Predictions after seeing more human actions: the robot now knows that the person is not heading to the hallway on the left, because the path taken so far would be a poor path if that were the person’s goal. Images courtesy of Brian D. Ziebart. See Ziebart et al. (2009). The same can happen in driving. We might not know how much another driver values
efficiency, but if we see them accelerate as someone is trying to merge in front of them, we
now know a bit more about them. And once we know that, we can better anticipate what they will do in the future—the same driver is likely to come closer behind us, or weave through traffic to get ahead. Once the robot can make predictions about human future actions, it has reduced its prob-
lem to solving an MDP. The human actions complicate the transition function, but as long as
the robot can anticipate what action the person will take from any future state, the robot can
calculate P(¥' | x,ug): it can compute P(up | x) from P(ug | x,Ju) by marginalizing over Ji,
and combine it with P(x’ | x,ug,up), the transition (dynamics) function for how the world updates based on both the robot’s and the human’s actions. In Section 26.5 we focused on how to solve this in continuous state and action spaces for deterministic dynamics, and in
Section 26.6 we discussed doing it with stochastic dynamics and uncertainty.
Splitting prediction from action makes it easier for the robot to handle interaction, but
sacrifices performance much from control.
plitting estimation from motion did, or splitting planning
A robot with this split no longer understands that its actions can influence what people end up doing. In contrast, the robot in Figure 26.27 anticipates where people will go and then optimizes for reaching its own goal and avoiding collisions with them. In Figure 26.28, we have an autonomous car merging on the highway. If it just planned in reaction to other cars, it might have to wait a long time while other cars occupy its target lane. In contrast, a car that reasons about prediction and action jointly knows that different actions it could take will
result in different reactions from the human. If it starts to assert itself, the other cars are likely to slow down a bit and make room. Roboticists are working towards coordinated interactions like this so robots can work better with humans.
963
Chapter 26 Robotics
(b)
Figure 26.28 (a) Left: An autonomous car (middle lane) predicts that the human driver (left lane) wants to keep going forward, and plans a trajectory that slows down and merges behind. Right: The car accounts for the influence its actions can have on human actions, and realizes
it can merge in front and rely on the human driver to slow down. (b) That same algorithm produces an unusual strategy at an intersection: the car realizes that it can make it more
likely for the person (bottom) to proceed faster through the intersection by starting to inch
backwards. Images courtesy of Anca Dragan. See Sadigh et al. (2016). Human
predi
ions about the robot
Incomplete information is often two-sided:
the robot does not know the human’s objective
and the human, in turn, does not know the robot’s objective—people need to be making predictions about robots. As robot designers, we are not in charge of how the human makes
predictions; we can only control what the robot does. However, the robot can act in a way to make it easier for the human to make correct predictions.
The robot can assume that
the human is using something roughly analogous to Equation (26.8) to estimate the robot’s objective Jg, and thus the robot will act so that its true objective can be easily inferred. A special case of the game is when the human
and the robot are on the same team,
working toward the same goal or objective: Jy = Jg. Imagine getting a personal home robot
Joint agent
that is helping you make dinner or clean up—these are examples of collaboration.
We can now define a joint agent whose actions are tuples of human-robot actions,
(upr, ug) and who optimizes for Jy (x,up,ug) = Jg(x,ug,u), and we're solving a regular planning problem. We compute the optimal plan or policy for the joint agent, and voila, we now know what the robot and human should do.
This would work really well if people were perfectly optimal. The robot would do its part
of the joint plan, the human theirs. Unfortunately, in practice, people don’t seem to follow the perfectly laid out joint-agent plan; they have a mind of their own! We've already learned one way to handle this though, back in Section 26.6. We called it model predictive control
(MPC): the idea was to come up with a plan, execute the first action, and then replan. That
way, the robot always adapts its plan to what the human is actually doing.
Let’s work through an example. Suppose you and the robot are in your kitchen, and have
decided to make waffles. You are slightly closer to the fridge, so the optimal joint plan would
Section 26.8
Humans and Robots
have you grab the eggs and milk from the fridge, while the robot fetches the flour from the cabinet. The robot knows this because it can measure quite precisely where everyone is. But
suppose you start heading for the flour cabinet. You are going against the optimal joint plan. Rather than sticking to it and stubbornly also going for the flour, the MPC robot recalculates
the optimal plan, and now that you are close enough to the flour it is best for the robot to grab the waffle iron instead.
If we know that people might deviate from optimality, we can account for it ahead of time.
In our example, the robot can try to anticipate that you are going for the flour the moment you take your first step (say, using the prediction technique above). Even if it is still technically
optimal for you to turn around and head for the fridge, the robot should not assume that’s
what is going to happen. Instead, the robot can compute a plan in which you keep doing what you seem to want.
Humans as black box agents ‘We don’t have to treat people as objective-driven, intentional agents to get robots to coordinate with us. An alternative model is that the human is merely some agent whose policy 7y “messes” with the environment dynamics.
The robot does not know 7y, but can model the
problem as needing to act in an MDP with unknown dynamics. We have seen this before: for general agents in Chapter 22, and for robots in particular in Section 26.7. The robot can fit a policy model 7y to human data, and use it to compute an optimal policy for itself. Due to scarcity of data, this has been mostly used so far at the task level. For
instance, robots have learned through interaction what actions people tend to take (in response
to its own actions) for the task of placing and drilling screws in an industrial assembly task. Then there is also the model-free reinforcement learning alternative:
the robot can start
with some initial policy or value function, and keep improving it over time via trial and error.
26.8.2
Learning to do what humans want
Another way interaction with humans comes into robotics is in Jg itself—the robot’s cost or reward function. The framework of rational agents and the associated algorithms reduce the
problem of generating good behavior to specifying a good reward function.
as for many other Al agents, getting the cost right is still difficult.
But for robots,
Take autonomous cars: we want them to reach the destination, to be safe, to drive com-
fortably for their passengers, to obey traffic laws, etc. A designer of such a system needs to trade off these different components of the cost function. The designer’s task is hard because robots are built to help end users, and not every end user is the same. We all have different
preferences for how aggressively we want our car to drive, etc.
Below, we explore two alternatives for trying to get robot behavior to match what we
actually want the robot to do.
The first is to learn a cost function from human input.
The
second is to bypass the cost function and imitate human demonstrations of the task.
Preference learning: Learning cost functions Imagine that an end user is showing a robot how to do a task. For instance, they are driving
the car in the way they would like it to be driven by the robot. Can you think of a way for the robot to use these actions—we call them “demonstrations”—to figure out what cost function it should optimize?
965
966
Chapter 26 Robotics
Figure 26.29 Left: A mobile robot is shown a demonstration that stays on the dirt road. Middle: The robot infers the desired cost function, and uses it in a new scene, knowing to
put lower cost on the road there. Right: The robot plans a path for the new scene that also
stays on the road, reproducing the preferences behind the demonstration. Images courtesy of Nathan Ratliff and James A. Bagnell. See Ratliff er al. (2006).
We have actually already seen the answer to this back in Section 26.8.1. There, the setup
was a little different: we had another person taking actions in the same space as the robot, and the robot needed to predict what the person would do. But one technique we went over for making these predictions was to assume that people act to noisily optimize some cost function
Jy, and we can use their ongoing actions as evidence about what cost function that is. We
can do the same here, except not for the purpose of predicting human behavior in the future, but rather acquiring the cost function the robot itself should optimize.
If the person drives
defensively, the cost function that will explain their actions will put a lot of weight on safety
and less so on efficiency. The robot can adopt this cost function as its own and optimize it when driving the car itself.
Roboticists have experimented with different algorithms for making this cost inference
computationally tractable. In Figure 26.29, we see an example of teaching a robot to prefer staying on the road to going over the grassy
terrain. Traditionally in such methods, the cost
function has been represented as a combination of hand-crafted features, but recent work has also studied how to represent it using a deep neural network, without feature engineering.
There are other ways for a person to provide input. A person could use language rather
than demonstration to instruct the robot.
A person could act as a critic, watching the robot
perform a task one way (or two ways) and then saying how well the task was done (or which way was better), or giving advice on how to improve.
Learning policies directly via imitation An alternative is to bypass cost functions and learn the desired robot policy directly. In our car example, the human’s demonstrations make for a convenient data set of states labeled by
the action the robot should take at each state: D = {(x;,u;)}. The robot can run supervised Behavioral cloning Generalization
learning to fit a policy
behavioral cloning.
: x — u, and execute that policy. This is called imitation learning or
A challenge with this approach is in generalization to new states. The robot does not
know why the actions in its database have been marked as optimal. It has no causal rule; all
it can do is run a supervised learning algorithm to try to learn a policy that will generalize to unknown states. However, there is no guarantee that the generalization will be correct.
Section 26.8
Humans and Robots
Figure 26.30 A human teacher pushes the robot down to teach it to stay closer to the table. The robot appropriately updates its understanding of the desired cost function and starts optimizing it. Courtesy of Anca Dragan. See Bajcsy f al. (2017).
Figure 26.31 A programming interface that involves placing specially designed blocks in the robot’s workspace to select objects and specify high-level actions. Images courtesy of Maya Cakmak. See Sefidgar ef al. (2017). The ALVINN autonomous car project used this approach, and found that even when
starting from a
state in D, 7 will make small errors, which will take the car off the demon-
strated trajectory. There, 7 will make a larger error, which will take the car even further off the desired course.
‘We can address this at training time if we interleave collecting labels and learning: start
with a demonstration, learn a policy, then roll out that policy and ask the human for what action to take at every state along the way, then repeat. The robot then learns how to correct its mistakes as it deviates from the human’s desired actions.
Alternatively, we can address it by leveraging reinforcement learning. The robot can fit a
dynamics model based on the demonstrations, and then use optimal control (Section 26.5.4)
to generate a policy that optimizes for staying close to the demonstration.
A version of this
has been used to perform very challenging maneuvers at an expert level in a small radiocontrolled helicopter (see Figure 22.9(b)). The DAGGER (Data Aggregation) system starts with a human expert demonstration. From that it learns a policy, 7 and uses the policy to generate a data set D. Then from
D it generates a new policy 75 that best imitates the original human data. This repeats, and
967
968
Chapter 26 Robotics on the nth iteration it uses 7, to generate more data, to be added to D, which is then used to create 7, . In other words, at each iteration the system gathers new data under the current
policy and trains the next policy using all the data gathered so far. Related recent techniques use adversarial training:
they alternate between training a
classifier to distinguish between the robot’s learned policy and the human’s demonstrations,
and training a new robot policy via reinforcement learning to fool the classifier. These ad-
vances enable the robot to handle states that are near demonstrations, but generalization to
far-off states or to new dynamics is a work in progress.
Teaching interfaces and the correspondence problem.
So far, we have imagined the
case of an autonomous car or an autonomous helicopter, for which human demonstrations use
the same actions that the robot can take itself: accelerating, braking, and steering. But what
happens if we do this for tasks like cleaning up the Kitchen table? We have two choices here: either the person demonstrates using their own body while the robot watches, or the person physically guides the robot’s effectors. Correspondence problem
The first approach is appealing because it comes naturally to end users. Unfortunately,
it suffers from the correspondence problem: how to map human actions onto robot actions.
People have different kinematics and dynamics than robots. Not only does that make it difficult to translate or retarget human motion onto robot motion (e.g., retargeting a five-finger
human grasp to a two-finger robot grasp), but often the high-level strategy a person might use is not appropriate for the robot.
Kinesthetic teaching Keyframe
Visual programming
The second approach, where the human teacher moves the robot’s effectors into the right positions, is called kinesthetic teaching. It is not easy for humans to teach this way, espe-
cially to teach robots with multiple joints. The teacher needs to coordinate all the degrees of
freedom as it is guiding the arm through the task. Researchers have thus investigated alter-
natives, like demonstrating keyframes as opposed to continuous trajectories, as well as the use of visual programming to enable end users to program primitives for a task rather than
demonstrate from scratch (Figure 26.31). Sometimes both approaches are combined.
26.9 Deliberative Reactive
Alternative Robo
Frameworks
Thus far, we have taken a view of robotics based on the notion of defining or learning a reward function, and having the robot optimize that reward function (be it via planning or learning), sometimes in coordination or collaboration with humans.
view of robotics, to be contrasted with a reactive view.
26.9.1
This is a deliberative
Reactive controllers
In some cases, it is easier to set up a good policy for a robot than to model the world and plan.
Then, instead of a rational agent, we have a reflex agent.
For example, picture a legged robot that attempts to lift a leg over an obstacle. We could
give this robot a rule that says lift the leg a small height 7 and move it forward, and if the leg
encounters an obstacle, move it back and start again at a higher height. You could say that is modeling an aspect of the world, but we can also think of& as an auxiliary variable of the robot controller, devoid of direct physical meaning. One such example is the six-legged (hexapod) robot, shown in Figure 26.32(a), designed for walking through rough terrain. The robot’s sensors are inadequate to obtain accurate
Section 26.9
969
Alternative Robotic Frameworks
retract, lift higher
liftup
setdown push backward (®)
Figure 26.32 (a) Genghis, a hexapod robot. (Image courtesy of Rodney A. Brooks.) (b) An
augmented finite state machine (AFSM) that controls one leg. The AFSM
reacts to sensor
feedback: if a leg is stuck during the forward swinging phase, it will be lifted increasingly higher.
‘models of the terrain for path planning. Moreover, even if we added high-precision cameras and rangefinders, the 12 degrees of freedom (two for each leg) would render the resulting path planning problem computationally difficult. It is possible, nonetheless, to specify a controller directly without an explicit environ-
mental model.
(We have already seen this with the PD controller, which was able to keep a
complex robot arm on target without an explicit model of the robot dynamics.)
For the hexapod robot we first choose a gait, or pattern of movement of the limbs. One
statically stable gait is to first move the right front, right rear, and left center legs forward (keeping the other three fixed), and then move the other three.
flat terrain.
Gait
This gait works well on
On rugged terrain, obstacles may prevent a leg from swinging forward.
This
problem can be overcome by a remarkably simple control rule: when a leg’s forward motion is blocked, simply retract it, lift it higher; and try again. The resulting controller is shown in Figure 26.32(b) as a simple finite state machine; it constitutes a reflex agent with state, where the internal state is represented by the index of the current machine state (s; through s4).
26.9.2
Subsumption
architectures
The subsumption architecture (Brooks, 1986) is a framework for assembling reactive controllers out of finite state machines.
Nodes in these machines may contain tests for certain
Subsumption architecture
sensor variables, in which case the execution trace of a finite state machine is conditioned
on the outcome of such a test. Arcs can be tagged with messages that will be generated when traversing them, and that are sent to the robot’s motors or to other finite state machines.
Additionally, finite state machines possess internal timers (clocks) that control the time it
takes to traverse an arc. The resulting machines are called augmented finite state machines
(AFSMs), where the augmentation refers to the use of clocks.
An example of a simple AFSM is the four-state machine we just talked about, shown in
Figure 26.32(b).
This AFSM
implements a cyclic controller, whose execution mostly does
not rely on environmental feedback. The forward swing phase, however, does rely on sensor feedback.
If the leg is stuck, meaning that it has failed to execute the forward swing, the
Augmented finite
state machine {AFSM)
970
Chapter 26 Robotics robot retracts the leg, lifts it up a little higher, and attempts to execute the forward swing once
again. Thus, the controller is able to react to contingencies arising from the interplay of the robot and its environment.
The subsumption architecture offers additional primitives for synchronizing AFSMs, and for combining output values of multiple, possibly conflicting AFSMs. In this way, it enables the programmer to compose increasingly complex controllers in a bottom-up fashion. In our example, we might begin with AFSMs for individual legs, followed by an AFSM for coordinating multiple legs. On top of this, we might implement higher-level behaviors such as collision avoidance, which might involve backing up and turning. The idea of composing robot controllers from AFSMs is quite intriguing. Imagine how difficult it would be to generate the same behavior with any of the configuration-space path-
planning algorithms described in the previous section. First, we would need an accurate model of the terrain. The configuration space of a robot with six legs, each of which is driven
by two independent motors, totals 18 dimensions (12 dimensions for the configuration of the
legs, and six for the location and orientation of the robot relative to its environment).
Even
if our computers were fast enough to find paths in such high-dimensional spaces, we would
have to worry about nasty effects such as the robot sliding down a slope.
Because of such stochastic effects, a single path through configuration space would al-
most certainly be too brittle, and even a PID controller might not be able to cope with such
contingencies. In other words, generating motion behavior deliberately is simply too complex a problem in some cases for present-day robot motion planning algorithms. Unfortunately, the subsumption architecture has its own problems. First, the AFSMs
are driven by raw sensor input, an arrangement that works if the sensor data is reliable and contains all necessary
information for decision making,
but fails if sensor data has to be
integrated in nontrivial ways over time. Subsumption-style controllers have therefore mostly
been applied to simple tasks, such as following a wall or moving toward visible light sources. Second, the lack of deliberation makes it difficult to change the robot’s goals.
A robot
with a subsumption architecture usually does just one task, and it has no notion of how to
modify its controls to accommodate different goals (just like the dung beetle on page 41). Third, in many real-world problems, the policy we want is often too complex to encode explicitly. Think about the example from Figure 26.28, of an autonomous car needing to negotiate a lane change with a human driver.
goes into the target lane.
We might start off with a simple policy that
But when we test the car, we find out that not every driver in the
target lane will slow down to let the car in. We might then add a bit more complexity: make
the car nudge towards the target lane, wait for a response form the driver in that lane, and
then either proceed or retreat back. But then we test the car, and realize that the nudging needs to happen at a different speed depending on the speed of the vehicle in the target lane, on whether there is another vehicle in front in the target lane, on whether there is a vehicle
behind the car in the initial, and so on. The number of conditions that we need to consider
to determine the right course of action can be very large, even for such a deceptively simple
maneuver. This in turn presents scalability challenges for subsumption-style architectures. All that said, robotics is a complex problem with many approaches: deliberative, reactive, or a mixture thereof; based on physics, cognitive models, data, or a mixture thereof. The right approach s still a subject for debate, scientific inquiry, and engincering prowess.
Section 26.10
Application Domains
971
Figure 2633 (a) A patient with a brain-machine interface controlling a robot arm to grab a drink. Tmage courtesy of Brown University. (b) Roomba, the robot vacuum cleaner. Photo by HANDOUT/KRT/Newscom. 26.10
Application Domains
Robotic technology is already permeating our world, and has the potential to improve our independence, health, and productivity. Here are some example applications. Home care:
Robots have started to enter the home to care for older adults and people
with motor impairments, assisting them with activities of daily living and enabling them to live more independently.
These include wheelchairs and wheelchair-mounted arms like the
Kinova arm from Figure 26.1(b). Even though they start off as being operated by a human di-
rectly, these robots are gaining more and more autonomy. On the horizon are robots operated
by brain-machine interfaces, which have been shown to enable people with quadriplegia to use a robot arm to grasp objects and even feed themselves (Figure 26.33(a)). Related to these are prosthetic limbs that intelligently respond to our actions, and exoskeletons that give us superhuman strength or enable people who can’t control their muscles from the waist down
to walk again.
Personal robots are meant to assist us with daily tasks
like cleaning and organizing, free-
ing up our time. Although manipulation still has a way to go before it can operate seamlessly
in messy, unstructured human environments, navigation has made some headway. In particu-
lar, many homes already enjoy a mobile robot vacuum cleaner like the one in Figure 26.33(b). Health care:
Robots assist and augment surgeons, enabling more precise, minimally
invasive, safer procedures with better patient outcomes. The Da Vinci surgical robot from Figure 26.34(a) is now widely deployed at hospitals in the U.S. Services: Mobile robots help out in office buildings, hotels, and hospitals.
Savioke has
put robots in hotels delivering products like towels or toothpaste to your room. The Helpmate and TUG
robots carry food and medicine in hospitals (Figure 26.34(b)), while Dili-
gent Robotics’ Moxi robot helps out nurses with back-end logistical responsibilities. Co-Bot
roams the halls of Carnegie Mellon University, ready to guide you to someone’s office. We
can also use telepresence robots like the Beam to attend meetings and conferences remotely,
or check in on our grandparents.
Telepresence robots
972
Chapter 26 Robotics
(b)
Figure 26.34 (a) Surgical robot in the operating room. Photo by Patrick Landmann/Science Source. (b) Hospital delivery robot. Photo by Wired.
Figure 26.35 (a) Autonomous car BOsS which won the DARPA Urban Challenge. Photo by Tangi Quemener/AFP/Getty Images/Newscom. Courtesy of Sebastian Thrun. (b) Aerial view showing the perception and predictions of the Waymo autonomous car (white vehicle with green track). Other vehicles (blue boxes) and pedestrians (orange boxes) are shown with anticipated trajectories. Road/sidewalk boundaries are in yellow. Photo courtesy of Waymo. Autonomous cars: Some of us are occasionally distracted while driving, by cell phone
calls, texts, or other distractions. The sad result: more than a million people die every year in traffic accidents. Further, many of us spend a lot of time driving and would like to recapture
some of that time. All this has led to a massive ongoing effort to deploy autonomous cars. Prototypes have existed since the 1980s, but progress was
stimulated by the 2005 DARPA
Grand Challenge, an autonomous vehicle race over 200 challenging kilometers of unrehearsed desert terrain.
Stanford’s Stanley vehicle completed the course in less than seven
hours, winning a $2 million prize and a place in the National Museum of American History.
Section 26.10
Application Domains
973
(b)
Figure 26.36 (a) A robot mapping an abandoned coal mine. (b) A 3D map of the mine acquired by the robot. Courtesy of Sebastian Thrun.
Figure 26.35(a) depicts BOss, which in 2007 won the DARPA
Urban Challenge, a compli-
cated road race on city streets where robots faced other robots and had to obey traffic rules.
In 2009, Google started an autonomous driving project (featuring many of the researchers who had worked on Stanley and BOSs), which has now spun off as Waymo. In 2018 Waymo started driverless testing (with nobody in the driver seat) in the suburbs of Pheonix,
zona.
Ari-
In the meantime, other autonomous driving companies and ride-sharing companies
are working on developing their own technology, while car manufacturers have been selling cars with more and more assistive intelligence, such as Tesla’s driver assist, which is meant
for highway driving. Other companies are targeting non-highway driving applications including college campuses and retirement communities. Still other companies are focused on non-passenger applications such as trucking, grocery delivery, and valet parking. Entertainment:
Disney has been using robots (under the name animatronics) in their
Driver assist
Animatronics
parks since 1963. Originally, these robots were restricted to hand-designed, open-loop, unvarying motion (and speech), but since 2009 a version called autonomatronics can generate Autonomatronics autonomous actions. Robots
also take the form of intelligent toys for children; for example,
Anki’s Cozmo plays games with children and may pound the table with frustration when it loses. Finally, quadrotors like Skydio’s R1 from Figure 26.2(b) act as personal photographers and videographers, following us around to take action shots as we ski or bike. Exploration and hazardous environments: Robots have gone where no human has gone before, including the surface of Mars. Robotic arms assist astronauts in deploying and retrieving satellites and in building the International Space Station. Robots also help explore
under the sea. They are routinely used to acquire maps of sunken ships. Figure 26.36 shows a robot mapping an abandoned coal mine, along with a 3D model of the mine acquired using range sensors. In 1996, a team of researches released a legged robot into the crater of an active volcano to acquire data for climate research. Robots are becoming very effective tools
for gathering information in domains that are difficult (or dangerous) for people to access. Robots have assisted people in cleaning up nuclear waste, most notably in Three Mile Island, Chernobyl, and Fukushima. Robots were present after the collapse of the World Trade
974
Chapter 26 Robotics Center, where they entered structures deemed too dangerous for human
search and rescue
crews. Here too, these robots are initially deployed via teleoperation, and as technology advances they are becoming more and more autonomous, with a human operator in charge but not having to specify every single command.
Industry: The majority of robots today are deployed in factories, automating tasks that
are difficult, dangerous, or dull for humans. (The majority of factory robots are in automobile
factories.) Automating these tasks is a positive in terms of efficiently producing what society needs. At the same time, it also means displacing some human workers from their jobs. This
has important policy and economics implications—the need for retraining and education, the need for a fair division of resources, etc. These topics are discussed further in Section 27.3.5.
Summary
Robotics is about physically embodied agents, which can change the state of the physical world. In this chapter, we have leaned the following: « The most common types of robots arc manipulators (robot arms) and mobile robots. They have sensors for perceiving the world and actuators that produce motion, which then affects the world via effectors. « The general robotics problem involves stochasticity (which can be handled by MDPs), partial observability (which can be handled by POMDPs), and acting with and around other agents (which can be handled with game theory). The problem is made even
harder by the fact that most robots work in continuous and high-dimensional state and
action spaces. They also operate in the real world, which refuses to run faster than real time and in which failures lead to real things being damaged, with no “undo” capability. Ideally, the robot would solve the entire problem in one go: observations in the form
of raw sensor feeds go in, and actions in the form of torques or currents to the motors
come out. In practice though, this is too daunting, and roboticists typically decouple different aspects of the problem and treat them independently.
* We typically separate perception (estimation) from action (motion generation). Percep-
tion in robotics involves computer vision to recognize the surroundings through cam-
eras, but also localization and mapping.
« Robotic perception concerns itself with estimating decision-relevant quantities from sensor data. To do so, we need an internal representation and a method for updating this internal representation over time.
+ Probabilistic filtering algorithms such as particle filters and Kalman filters are useful
for robot perception. These techniques maintain the belief state, a posterior distribution over state variables.
« For generating motion, we use configuration spaces, where a point specifies everything
we need to know to locate every body point on the robot. For instance, for a robot arm with two joints, a configuration consists of the two joint angles.
« We typically decouple the motion generation problem into motion planning, concerned
with producing a plan, and trajectory tracking control, concerned with producing a
policy for control inputs (actuator commands) that results in executing the plan.
Bibliographical and Historical Notes
« Motion planning can be solved via graph scarch using cell decomposition; using randomized motion planning algorithms, which sample milestones in the continuous configuration space; or using trajectory optimization, which can iteratively push a straight-line path out of collision by leveraging a signed distance field. + A path found by a search algorithm can be executed using the path as the reference
trajectory for a PID controller, which constantly corrects for errors between where the robot is and where it is supposed o be, or via computed torque control, which adds a
feedforward term that makes use of inverse dynamics to compute roughly what torque to send to make progress along the trajectory.
« Optimal control unites motion planning and trajectory tracking by computing an optimal trajectory directly over control inputs.
This
is especially easy when we have
quadratic costs and linear dynamics, resulting in a linear quadratic regulator (LQR). Popular methods make use of this by linearizing the dynamics and computing secondorder approximations of the cost (ILQR). « Planning under uncertainty unites perception and action by online replanning (such as model predictive control) and information gathering actions that aid perception.
+ Reinforcement learning is applied in robotics, with techniques striving to reduce the
required number of interactions with the real world. Such techniques tend to exploit models, be it estimating models and using them to plan, or training policies that are robust with respect to different possible model parameters.
« Interaction with humans
requires the ability to coordinate the robot’s actions with
theirs, which can be formulated as a game. We usually decompose the solution into
prediction, in which we use the person’s ongoing actions to estimate what they will do in the future, and action, in which we use the predictions to compute the optimal motion for the robot. + Helping humans also requires the ability to learn or infer what they want. Robots can
approach this by learning the desired cost function they should optimize from human
input, such as demonstrations, corrections, or instruction in natural language. Alterna-
tively, robots can imitate human behavior, and use reinforcement learning to help tackle
the challenge of generalization to new states. Bibliographical and
Historical Notes
The word robot was popularized by Czech playwright Karel Capek in his 1920 play R.U.R. (Rossum’s Universal Robots).
The robots, which were grown chemically rather than con-
structed mechanically, end up resenting their masters and decide to take over. It appears that
it was Capek’s brother, Josef, who first combined the Czech words “robota” (obligatory work)
and “robotnik™ (serf) to yield “robot™ in his 1917 short story Opilec (Glanc, 1978). The term
robotics was invented for a science fiction story (Asimov, 1950). The idea of an autonomous machine predates the word “robot” by thousands of years. In 7th century BCE Greek mythology, a robot named Talos was built by Hephaistos, the Greek god of metallurgy, to protect the island of Crete.
The legend is that the sorceress Medea
defeated Talos by promising him immortality but then draining his life fluid. Thus, this is the
975
976
Chapter 26 Robotics first example of a robot making a mistake in the process of changing its objective function. In 322 BCE, Aristotle anticipated technological unemployment, speculating “If every tool, when ordered, or even of its own accord, could do the work that befits it.. . then there would be no need either of apprentices for the master workers or of slaves for the lords.” In the 3rd century BCE an actual humanoid robot called the Servant of Philon could pour wine or water into a cup; a series of valves cut off the flow at the right time. Wonderful automata were built in the 18th century—Jacques Vaucanson’s mechanical duck from 1738
being one early example—but the complex behaviors they exhibited were entirely fixed in advance. Possibly the earliest example of a programmable robot-like device was the Jacquard loom (1805), described on page 15. Grey Walter’s “turtle,” built in 1948, could be considered the first autonomous mobile
robot, although its control system was not programmable. The “Hopkins Beast” built in 1960 at Johns Hopkins University, was much more sophisticated; it had sonar and photocell sensors, pattern-recognition hardware, and could recognize the cover plate of a standard AC power outlet. It was capable of searching for outlets, plugging itself in, and then recharging its batteries! Still, the Beast had a limited repertoire of skills.
The first general-purpose mobile robot was “Shakey,” developed at what was then the
Stanford Research Institute (now SRI) in the late 1960s (Fikes and Nilsson, 1971; Nilsson,
1984). Shakey was the first robot to integrate perception, planning, and execution, and much
subsequent research in Al was influenced by this remarkable achievement. Shakey appears
on the cover of this book with project leader Charlie Rosen (1917-2002). Other influential projects include the Stanford Cart and the CMU Rover (Moravec, 1983). Cox and Wilfong (1990) describe classic work on autonomous vehicles. The first commercial robot was an arm called UNIMATE,
for universal automation, de-
veloped by Joseph Engelberger and George Devol in their compnay, Unimation. In 1961, the first UNIMATE robot was sold to General Motors for use in manufacturing TV picture tubes. 1961 was also the year when Devol obtained the first U.S. patent on a robot.
In 1973, Toyota and Nissan started using an updated version of UNIMATE for auto body
spot welding. This initiated a major revolution in automobile manufacturing that took place mostly in Japan and the U.S., and that is still ongoing. Unimation followed up in 1978 with the development of the Puma robot (Programmable Universal Machine for Assembly), which
was the de facto standard for robotic manipulation for the two decades that followed. About
500,000 robots are sold each year, with half of those going to the automotive industry. In manipulation,
the first major effort at creating a hand-eye machine was Heinrich
Emst’s MH-1, described in his MIT Ph.D. thesis (Ernst, 1961).
The Machine Intelligence
project at Edinburgh also demonstrated an impressive early system for vision-based assembly called FREDDY (Michie, 1972).
Research on mobile robotics has been stimulated by several important competitions.
AAAT’s annual mobile robot competition began in 1992.
The first competition winner was
CARMEL (Congdon et al., 1992). Progress has been steady and impressive: in recent com-
petitions robots entered the conference complex, found their way to the registration desk, registered for the conference, and even gave a short talk. The RoboCup
competition,
launched in 1995 by Kitano and colleagues (1997), aims
to “develop a team of fully autonomous humanoid robots that can win against the human world champion team in soccer” by 2050.
Some competitions use wheeled robots, some
Bibliographical and Historical Notes
977
humanoid robots, and some software simulations. Stone (2016) describes recent innovations
in RoboCup.
The DARPA Grand Challenge, organized by DARPA in 2004 and 2005, required autonomous vehicles to travel more than 200 kilometers through the desert in less than ten hours (Buehler ez al., 2006). In the original event in 2004, no robot traveled more than eight miles, leading many to believe the prize would never be claimed. In 2005, Stanford’s robot
Stanley won the competition in just under seven hours (Thrun, 2006).
DARPA then orga-
nized the Urban Challenge, a competition in which robots had to navigate 60 miles in an
urban environment with other traffic.
Carnegie Mellon University’s robot BOSS took first
place and claimed the $2 million prize (Urmson and Whittaker, 2008). Early pioneers in the development of robotic cars included Dickmanns and Zapp (1987) and Pomerleau (1993).
The field of robotic mapping has evolved from two distinct origins. The first thread began with work by Smith and Cheeseman (1986), who applied Kalman filters to the simultaneous localization and mapping (SLAM) problem. This algorithm was first implemented by Moutarlier and Chatila (1989) and later extended by Leonard and Durrant-Whyte (1992); see Dissanayake et al. (2001) for an overview of early Kalman filter variations. The second thread
began with the development of the occupancy grid representation for probabilistic mapping,
which specifies the probability that each (x,y) location is occupied by an obstacle (Moravec
Occupancy grid
and Elfes, 1985).
Kuipers and Levitt (1988) were among the first to propose topological rather than metric mapping, motivated by models of human spatial cognition. A seminal paper by Lu and Milios (1997) recognized the sparseness of the simultaneous localization and mapping problem, which gave rise to the development of nonlinear optimization techniques by Konolige (2004)
and Montemerlo and Thrun (2004), as well as hierarchical methods by Bosse et al. (2004). Shatkay and Kaelbling (1997) and Thrun er al. (1998) introduced the EM algorithm into the
field of robotic mapping for data association. An overview of probabilistic mapping methods can be found in (Thrun et al., 2005). Early mobile robot localization techniques are surveyed by Borenstein ef al. (1996). Although Kalman
filtering was well known
as a localization method in control theory for
decades, the general probabilistic formulation of the localization problem did not appear in the Al literature until much later, through the work of Tom Dean and colleagues (Dean er al., 1990) and of Simmons and Koenig (1995). The latter work introduced the term Markov
localization. The first real-world application of this technique was by Burgard ef al. (1999), Markov localization through a series of robots that were deployed in museums.
Monte Carlo localization based
on particle filters was developed by Fox ef al. (1999) and is now widely used. The Rao-
Blackwellized particle filter combines particle filtering for robot localization with exact
filtering for map building (Murphy and Russell, 2001; Montemerlo et al., 2002).
Rao-Blackwellized particle filter
A great deal of early work on motion planning focused on geometric algorithms for de-
terministic and fully observable motion planning problems. The PSPACE-hardness of robot
motion planning was shown in a seminal paper by Reif (1979). The configuration space rep-
resentation is due to Lozano-Perez (1983). A series of papers by Schwartz and Sharir on what
they called piano movers problems (Schwartz et al., 1987) was highly influential.
Recursive cell decomposition for configuration space planning was originated in the work
of Brooks and Lozano-Perez (1985) and improved significantly by Zhu and Latombe (1991). The earliest skeletonization algorithms were based on Voronoi diagrams (Rowat, 1979) and
Piano movers
978
Chapter 26 Robotics
Visibility graph
visibility graphs (Wesley and Lozano-Perez, 1979). Guibas ez al. (1992) developed efficient
techniques for calculating Voronoi diagrams incrementally, and Choset (1996) generalized Voronoi diagrams to broader motion planning problems. John Canny (1988) established the first singly exponential algorithm for motion planning.
The seminal text by Latombe (1991) covers a variety of approaches to motion planning, as
do the texts by Choset et al. (2005) and LaValle (2006). Kavraki et al. (1996) developed the theory of probabilistic roadmaps. Kuffner and LaValle (2000) developed rapidly exploring random trees (RRTS). Involving optimization in geometric motion planning began with elastic bands (Quinlan
and Khatib, 1993), which refine paths when the configuration-space obstacles change. Ratliff et al. (2009) formulated the idea as the solution to an optimal control problem,
allowing
the initial trajectory to start in collision, and deforming it by mapping workspace obstacle gradients via the Jacobian into the configuration space. Schulman et al. (2013) proposed a practical second-order alternative. The control of robots as dynamical systems—whether for manipulation or navigation—
has generated a vast literature. While this chapter explained the basics of trajectory tracking
control and optimal control, it left out entire subfields, including adaptive control, robust
control, and Lyapunov analysis. Rather than assuming everything about the system is known
a priori, adaptive control aims to adapt the dynamics parameters and/or the control law online.
Robust control, on the other hand, aims to design controllers that perform well in spite of
uncertainty and external disturbances. Lyapunov analysis was originally developed in the 1890s for the stability analysis of
general nonlinear systems, but it was not until the early 1930s that control theorists realized its true potential.
With the development of optimization methods, Lyapunov
analysis was
extended to control barrier functions, which lend themselves nicely to modern optimization tools. These methods are widely used in modern robotics for real-time controller design and
safety analysis. Crucial works in robotic control include a trilogy on impedance control by Hogan (1985) and a general study of robot dynamics by Featherstone (1987). Dean and Wellman (1991)
were among the first to try to tie together control theory and Al planning systems. Three clas-
Haptic feedback
sic textbooks on the mathematics of robot manipulation are due to Paul (1981), Craig (1989), and Yoshikawa (1990). Control for manipulation is covered by Murray (2017). The area of grasping is also important in robotics—the problem of determining a stable grasp is quite difficult (Mason and Salisbury, 1985). Competent grasping requires touch sensing, or haptic feedback, to determine contact forces and detect slip (Fearing and Hollerbach, 1985). Understanding how to grasp the the wide variety of objects in the world is a daunting
task.
(Bousmalis et al., 2017) describe a system that combines real-world experimentation
with simulations guided by sim-to-real transfer to produce robust grasping.
Potential-field control, which attempts to solve the motion planning and control problems
Vector field histogram
simultaneously, was developed for robotics by Khatib (1986). In mobile robotics, this idea was viewed as a practical solution to the collision avoidance problem, and was later extended
into an algorithm called vector field histograms by Borenstein (1991). ILQR is currently widely used at the intersection of motion planning and control and is
due to Li and Todorov (2004). It is a variant of the much older differential dynamic program-
ming technique (Jacobson and Mayne, 1970).
Bibliographical and Historical Notes Fine-motion planning with limited sensing was investigated by Lozano-Perez et al. (1984)
and Canny and Reif (1987). Landmark-based navigation (Lazanas and Latombe, 1992) uses many of the same ideas in the mobile robot arena. Navigation functions, the robotics version of a control policy for deterministic MDPs, were introduced by Koditschek (1987). Key work
applying POMDP methods (Section 17.4) to motion planning under uncertainty in robotics is
due to Pineau et al. (2003) and Roy et al. (2005). Reinforcement learning in robotics took off with the seminal work by Bagnell and Schneider (2001) and Ng ef al. (2003), who developed the paradigm in the context of autonomous helicopter control.
Kober er al. (2013) offers an overview of how reinforcement
learning changes when applied to the robotics problem. Many of the techniques implemented on physical systems build approximate dynamics models, dating back to locally weighted linear models due to Atkeson ef al. (1997). But policy gradients played their role as well,
enabling (simplified) humanoid robots to walk (Tedrake er al., 2004), or a robot arm to hit a baseball (Peters and Schaal, 2008).
Levine ez al. (2016) demonstrated the first deep reinforcement learning application on a
real robot. At the same time, model-free RL in simulation was being extended to continuous domains (Schulman et al., 2015a; Heess et al., 2016; Lillicrap et al., 2015). Other work
scaled up physical data collection massively to showcase the learning of grasps and dynamics
models (Pinto and Gupta, 2016; Agrawal ez al., 2017; Levine et al., 2018). Transfer from simulation to reality or sim-to-real (Sadeghi and Levine, 2016; Andrychowicz et al., 2018a), metalearning (Finn er al., 2017), and sample-efficient model-free reinforcement learning
(Andrychowicz ef al., 2018b) are active areas of research.
Early methods for predicting human actions made use of filtering approaches (Madha-
van and Schlenoff, 2003), but seminal work by Ziebart ef al. (2009) proposed prediction by modeling people as approximately rational agents. Sadigh ef al. (2016) captured how these predictions should actually depend on what the robot decides to do, building toward a gametheoretic setting. For collaborative settings, Sisbot et al. (2007) pioneered the idea of account-
ing for what people want in the robot’s cost function. Nikolaidis and Shah (2013) decomposed
collaboration into learning how the human will act, but also learning how the human wants
the robot to act, both achievable from demonstrations. For learning from demonstration see
Argall et al. (2009). Akgun et al. (2012) and Sefidgar et al. (2017) studied teaching by end users rather than by experts. Tellex et al. (2011) showed how robots can infer what people want from natural language
instructions. Finally, not only do robots need to infer what people want and plan on doing, but
people too need to make the same inferences about robots. Dragan et al. (2013) incorporated amodel of the human’s inferences into robot motion planning.
The field of human-robot interaction is much broader than what we covered in this
chapter, which focused primarily on the planning and learning aspects. Thomaz ef al. (2016) provides a survey of interaction more broadly from a computational perspective. Ross et al. (2011) describe the DAGGER system. The topic of software architectures for robots engenders much religious debate. The
good old-fashioned Al candidate—the three-layer architecture—dates back to the design of Shakey and is reviewed by Gat (1998). The subsumption architecture is due to Brooks (1986),
although similar ideas were developed independently by Braitenberg, whose book, Vehicles (1984), describes a
series of simple robots based on the behavioral approach.
979
980
Chapter 26 Robotics The success of Brooks’s
six-legged walking robot was followed by many other projects.
Connell, in his Ph.D. thesis (1989), developed an entirely reactive mobile robot that was ca-
pable of retrieving objects. Extensions of the paradigm to multirobot systems can be found in work by Parker (1996) and Mataric (1997). GRL (Horswill, 2000) and COLBERT (Kono-
lige, 1997) abstract the ideas of concurrent behavior-based robotics into general robot control
languages. Arkin (1998) surveys some of the most popular approaches in this field. Two early textbooks, by Dudek and Jenkin (2000) and by Murphy (2000), cover robotics
generally. More recent overviews are due to Bekey (2008) and Lynch and Park (2017). Anex-
cellent book on robot manipulation addresses advanced topics such as compliant motion (Ma-
son, 2001). Robot motion planning is covered in Choset ef al. (2005) and LaValle (2006).
Thrun ez al. (2005) introduces probabilistic robotics. The Handbook of Robotics (Siciliano and Khatib, 2016) is a massive, comprehensive overview of all of robotics.
The premiere conference for robotics is Robotics: Science and Systems Conference, fol-
lowed by the IEEE International Conference on Robotics and Automation.
Human-Robot
Interaction is the premiere venue for interaction. Leading robotics journals include IEEE
Robotics and Automation, the International Journal of Robotics Research, and Robotics and
Autonomous Systems.
TR 97
PHILOSOPHY, ETHICS, AND SAFETY OF Al In which we consider the big questions around the meaning of Al how we can ethically develop and apply it, and how we can keep it safe.
Philosophers have been asking big questions for a long time: How do minds work? s it possible for machines to act intelligently in the way that people do? Would such machines have real, conscious minds? To these, we add new ones: What are the ethical implications of intelligent machines in day-to-day use? Should machines be allowed to decide to kill humans? Can algorithms be fair and unbiased? What will humans do if machines can do all kinds of work? And how do we control machines that may become more intelligent than us? 27.1
The Limits of Al
In 1980, philosopher John Searle introduced a distinction between weak Al—the idea that
machines could act as if they were intelligent—and strong AI—the assertion that machines
that do so are actually consciously thinking (not just simulating thinking). Over time the definition of strong Al shifted to refer to what is
also called “human-level AI” or “general
AI’—programs that can solve an arbitrarily wide variety of tasks, including novel ones, and do so as well as a human.
Critics of weak AT who objected to the very possibility of intelligent behavior in machines now appear as shortsighted as Simon Newcomb, who in October 1903 wrote “aerial flight is one of the great class of problems with which man can never cope”—just two months before the Wright brothers’ flight at Kitty Hawk.
The rapid progress of recent years does
not, however, prove that there can be no limits to what Al can achieve. Alan Turing (1950),
the first person to define Al was also the first to raise possible objections to Al foreseeing almost all the ones subsequently raised by others. 27.1.1
The argument from informality
Turing’s “argument from informality of behavior” says that human behavior is far too com-
plex to be captured by any formal set of rules—humans must be using some informal guide-
lines that (the argument claims) could never be captured in a formal set of rules and thus
could never be codified in a computer program. A key proponent of this view was Hubert Dreyfus, who produced a series of influential critiques of artificial intelligence:
What Computers
Can’t Do (1972), the sequel What
weak Al
Strong Al
982
Chapter 27 Philosophy, Ethics, and Safety of AT Computers Still Can’t Do (1992), and, with his brother Stuart, Mind Over Machine (1986).
Good Old-Fashioned AT'(GOFAI)
Similarly, philosopher Kenneth Sayre (1993) said “Artificial intelligence pursued within the cult of computationalism stands not even a ghost of a chance of producing durable results.” The technology they criticize came to be called Good Old-Fashioned AI (GOFAI).
GOFAI corresponds to the simplest logical agent design described in Chapter 7, and we saw there that it is indeed difficult to capture every contingency of appropriate behavior in a set of necessary and sufficient logical rules; we called that the qualification problem.
But
as we saw in Chapter 12, probabilistic reasoning systems are more appropriate for openended domains, and as we saw in Chapter 21, deep learning systems do well on a variety of “informal” tasks. Thus, the critique is not addressed against computers per se, but rather against one particular style of programming them with logical rules—a style that was popular
in the 1980s but has been eclipsed by new approaches.
One of Dreyfus’s strongest arguments is for situated agents rather than disembodied log-
ical inference engines. An agent whose understanding of “dog” comes only from a limited set of logical sentences such as “Dog(x) = Mammal(x)” is at a disadvantage compared to an agent that has watched dogs run, has played fetch with them, and has been licked by one. As philosopher Andy Clark (1998) says, “Biological brains are first and foremost the control systems for biological bodies. Biological bodies move and act in rich real-world surroundings.”
Embodied cognition
According to Clark, we are “good at frisbee, bad at logic.”
The embodied cognition approach claims that it makes no sense to consider the brain
separately: cognition takes place within a body, which is embedded in an environment. We need to study the system as a whole; the brain’s functioning exploits regularities in its envi-
ronment, including the rest of its body.
Under the embodied cognition approach, robotics,
vision, and other sensors become central, not peripheral. Overall, Dreyfus
saw areas where Al did not have complete answers and said that Al is
therefore impossible; we now see many of these same areas undergoing continued research and development leading to increased capability, not impossibility. 27.1.2
The argument from disability
The “argument from disability” makes the claim that “a machine can never do X.” As examples of X, Turing lists the following: Be kind, resourceful, beautiful, friendly, have initiative, have a sense of humor, tell right from wrong, make mistakes, fall in love, enjoy strawberries and cream, make someone fall in love with it, learn from experience, use words properly, be the subject of its own thought, have as much diversity of behavior as man, do something really new. In retrospect, some of these are rather easy—we’re all familiar with computers that “make
mistakes.” Computers with metareasoning capabilities (Chapter 5) can examine heir own computations, thus being the subject of their own reasoning. A century-old technology has the proven ability to “make someone fall in love with it"—the teddy bear. Computer chess
expert David Levy predicts that by 2050 people will routinely fall in love with humanoid robots.
As for a robot falling in love, that is a common theme in fiction,' but there has
been only limited academic speculation on the subject (Kim ef al., 2007). Computers have
! For example, the opera Coppélia (1870), the novel Do Androids Dream of Electric Sheep? (1968), the movies AIQ001), Wall-E (2008), and Her (2013).
Section 27.1
The Limits of AT
done things that are “really new;” making significant discoveries in astronomy, mathematics, chemistry, mineralogy, biology, computer science, and other fields, and creating new forms of art through style transfer (Gatys e al., 2016). Overall, programs exceed human performance in some tasks and lag behind on others. The one thing that it is clear they can’t do is be exactly human. 27.1.3
The
mathematical
objection
Turing (1936) and Godel (1931) proved that certain mathematical questions are in princi-
ple unanswerable by particular formal systems. Godel’s incompleteness theorem (see Sec-
tion 9.5) is the most famous example of this. Briefly, for any formal axiomatic framework F
powerful enough to do arithmetic, it is possible to construct a so-called Godel sentence G(F)
with the following properties: « G(F) is a sentence of F, but cannot be proved within F. « If F is consistent, then G(F) is true.
Philosophers such as J. R. Lucas (1961) have claimed that this theorem shows that machines
are mentally inferior to humans, because machines are formal systems that are limited by the incompleteness theorem—they cannot establish the truth of their own Godel sentence—
while humans have no such limitation. This has caused a lot of controversy, spawning a vast literature, including two books by the mathematician/physicist Sir Roger Penrose (1989, 1994).
Penrose repeats Lucas’s claim with some fresh twists, such as the hypothesis that
humans are different because their brains operate by quantum gravity—a theory that makes
multiple false predictions about brain physiology.
‘We will examine three of the problems with Lucas’s claim. First, an agent should not be
ashamed that it cannot establish the truth of some sentence while other agents can. Consider the following sentence:
Lucas cannot consistently assert that this sentence is true. If Lucas asserted this sentence, then he would be contradicting himself, so therefore Lucas
cannot consistently assert it, and hence it is true. We have thus demonstrated that there is a
true sentence that Lucas cannot consistently assert while other people (and machines) can. But that does not make us think any less of Lucas.
Second, Gadel’s incompleteness theorem and related results apply to mathematics, not
to computers. No entity—human or machine—can prove things that are impossible to prove. Lucas and Penrose falsely assume that humans can somehow get around these limits, as when
Lucas (1976) says “we must assume our own consistency, if thought is to be possible at all.” But this is an unwarranted assumption: humans are notoriously inconsistent. This is certainly
true for everyday reasoning, but it is also true for careful mathematical thought. A famous example is the four-color map problem. Alfred Kempe (1879) published a proof that was widely accepted for 11 years until Percy Heawood (1890) pointed out a flaw. Third, Gédel’s incompleteness theorem technically applies only to formal systems that are powerful enough to do arithmetic. This
includes Turing machines, and Lucas’s claim is
in part based on the assertion that computers are equivalent to Turing machines. This is not
quite true. Turing machines are infinite, whereas computers (and brains) are finite, and any computer can therefore be described as a (very large) system in propositional logic, which is not subject to Gédel’s incompleteness theorem. Lucas assumes that humans can “change their
983
984
Chapter 27 Philosophy, Ethics, and Safety of AT minds” while computers cannot, but that is also false—a computer can retract a conclusion
after new evidence or further deliberation; it can upgrade its hardware; and it can change its
decision-making processes with machine learning or software rewriting. 27.1.4
Measuring Al
Alan Turing, in his famous paper “Computing Machinery and Intelligence” (1950), suggested that instead of
asking whether machines can think, we should ask whether machines can pass
a behavioral test, which has come to be called the Turing test. The test requires a program
to have a conversation (via typed messages) with an interrogator for five minutes.
The in-
terrogator then has to guess if the conversation is with a program or a person; the program
passes the test if it fools the interrogator 30% of the time. To Turing, the key point was not
the exact details of the test, but instead the idea of measuring intelligence by performance on some kind of open-ended behavioral task, rather than by philosophical speculation.
Nevertheless, Turing conjectured that by the year 2000 a computer with a storage of a
billion units could pass the test, but here we are on the other side of 2000, and we still can’t
agree whether any program has passed. Many people have been fooled when they didn’t know they might be chatting with a computer.
The ELIZA program and Internet chatbots such as
MGONZ (Humphrys, 2008) and NATACHATA (Jonathan et al., 2009) fool their correspondents
repeatedly, and the chatbot CYBERLOVER
has attracted the attention of law enforcement be-
cause of its penchant for tricking fellow chatters into divulging enough personal information that their identity can be stolen. In 2014, a chatbot called Eugene Goostman fooled 33% of the untrained amateur judges
in a Turing test. The program claimed to be a boy from Ukraine with limited command of
English; this helped explain its grammatical errors. Perhaps the Turing test is really a test of
human gullibility. So far no well-trained judge has been fooled (Aaronson, 2014).
Turing test competitions have led to better chatbots, but have not been a focus of research
within the AT community. Instead, Al researchers who crave competition are more likely
to concentrate on playing chess or Go or StarCraft II, or taking an 8th grade science exam,
or identifying objects in images.
In many of these competitions, programs have reached
or surpassed human-level performance, but that doesn’t mean the programs are human-like
outside the specific task. The point is to improve basic science and technology and to provide useful tools, not to fool judges.
27.2
Can Machines Really Think?
Some philosophers claim that a machine that acts intelligently would not be actually thinking,
but would be only a simulation of thinking. But most Al researchers are not concerned with
the distinction, and the computer scientist Edsger Dijkstra (1984) said that “The question of whether Machines Can Think ... is about as relevant as the question of whether Submarines Can Swim.” The American Heritage Dictionary’s first definition of swim is “To move through
water by means of the limbs, fins, or tail,” and most people agree that submarines, being limbless, cannot swim. The dictionary also defines fly as “To move through the air by means of wings or winglike parts,” and most people agree that airplanes, having winglike parts,
can fly. However, neither the questions nor the answers have any relevance to the design or
capabilities of airplanes and submarines; rather they are about word usage in English. (The
Section 27.2
985
Can Machines Really Think?
fact that ships do swim (“priver”) in Russian amplifies this point.) English speakers have
not yet settled on a precise definition for the word “think”—does it require “a brain” or just “brain-like parts?”
Again, the issue was addressed by Turing. He notes that we never have any direct ev-
idence about the internal mental states of other humans—a kind of mental solipsism. Nev-
ertheless, Turing says, “Instead of arguing continually over this point, it is usual to have the polite convention that everyone thinks.” Turing argues that we would also extend the polite
convention to machines, if only we had experience with ones that act intelligently. How-
Polite convention
ever, now that we do have some experience, it seems that our willingness to ascribe sentience
depends at least as much on humanoid appearance and voice as on pure intelligence. 27.2.1
The Chinese room
The philosopher John Searle rejects the polite convention. His famous Chinese room argu- Chinese room ment (Searle, 1990) goes as follows: Imagine a human, who understands only English, inside a room that contains a rule book, written in English, and various stacks of paper. Pieces of paper containing indecipherable symbols are slipped under the door to the room. The human follows the instructions in the rule book, finding symbols in the stacks, writing symbols on new pieces of paper, rearranging the stacks, and so on. Eventually, the instructions will cause
one or more symbols to be transcribed onto a piece of paper that is passed back to the outside world. From the outside, we see a system that is taking input in the form of Chinese sentences and generating fluent, intelligent Chinese responses. Searle then argues: it is given that the human does not understand Chinese. The rule book
and the stacks of paper, being just pieces of paper, do not understand Chinese. Therefore, there is no understanding of Chinese. And Searle says that the Chinese room is doing the same thing that a computer would do, so therefore computers generate no understanding. Searle (1980) is a proponent of biological naturalism, according to which mental states Biological naturalism are high-level emergent features that are caused by low-level physical processes in the neurons, and it is the (unspecified) properties of the neurons that matter: according to Searle’s biases, neurons have “it” and transistors do not. There have been many refutations of Searle’s
argument, but no consensus. His argument could equally well be used (perhaps by robots) to argue that a human cannot have true understanding; after all, a human is made out of cells,
the cells do not understand, therefore there is no understanding. In fact, that is the plot of Terry Bisson’s (1990) science fiction story They're Made Out of Meat, in which alien robots
explore Earth and can’t believe that hunks of meat could possibly be sentient. How they can be remains a mystery.
27.2.2
Consciousness and qualia
Running through all the debates about strong Al is the issue of consciousness:
awareness
of the outside world, and of the self, and the subjective experience of living. The technical
term for the intrinsic nature of experiences is qualia (from the Latin word meaning, roughly,
“of what kind”). The big question is whether machines can have qualia. In the movie 2001, when astronaut David Bowman is disconnecting the “cognitive circuits™ of the HAL 9000
computer, it says “I'm afraid, Dave. Dave, my mind is going. I can feel it.” Does HAL actually have feelings (and deserve sympathy)? Or is the reply just an algorithmic response, no different from “Error 404: not found”?
Consciousness Qualia
986
Chapter 27 Philosophy, Ethics, and Safety of AT There is a similar question for animals: pet owners are certain that their dog or cat has
consciousness, but not all scientists agree. Crickets change their behavior based on tempera-
ture, but few people would say that crickets experience the feeling of being warm or cold.
One reason that the problem of consciousness is hard is that it remains ill-defined, even
after centuries of debate.
But help may be on the way. Recently philosophers have teamed
with neuroscientists under the auspices of the Templeton Foundation to start a series of ex-
periments that could resolve some of the issues. Advocates of two leading theories of con-
sciousness (global workspace theory and integrated information theory) have agreed that the
experiments could confirm one theory over the other—a rarity in philosophy.
Alan Turing (1950) concedes that the question of consciousness is a difficult one, but
denies that it has much relevance to the practice of Al: “I do not wish to give the impression
that I think there is no mystery about consciousness ... But I do not think these mysteries
necessarily need to be solved before we can answer the question with which we are concerned in this paper” We agree with Turing—we are interested in creating programs that behave intelligently.
Individual aspects of consciousness—awareness,
self-awareness, attention—
can be programmed and can be part of an intelligent machine. The additional project of making a machine conscious in exactly the way humans are is not one that we are equipped
to take on. We do agree that behaving intelligently will require some degree of awareness,
which will differ from task to task, and that tasks involving interaction with humans will
require a model of human subjective experience. In the matter of modeling experience, humans have a clear advantage over machines, because they can use their own subjective apparatus to appreciate the subjective experience of others.
For example, if you want to know what it’s like when someone hits their thumb
with a hammer, you can hit your thumb with a hammer. Machines have no such capability— although unlike humans, they can run each other’s code. 27.3
The Ethics of Al
Given that Al is a powerful technology, we have a moral obligation to use it well, to promote the positive aspects and avoid or mitigate the negative ones.
The positive aspects are many. For example, Al can save lives through improved med-
ical diagnosis, new medical discoveries, better prediction of extreme weather events, and
safer driving with driver assistance and (eventually) self-driving technologies. There are also many opportunities to improve lives. Microsoft’s Al for Humanitarian Action program ap-
plies Al to recovering from natural disasters, addressing the needs of children, protecting refugees, and promoting human rights. Google’s Al for Social Good program supports work on rainforest protection, human rights jurisprudence, pollution monitoring, measurement of fossil fuel emissions, cris counseling, news fact checking, suicide prevention, recycling, and other issues. The University of Chicago’s Center for Data Science for Social Good applies machine learning to problems in criminal justice, economic development, education, public health, energy, and environment.
Al applications in crop management and food production help feed the world. Optimization of business processes using machine learning will make businesses more productive,
increasing wealth and providing more employment. Automation can replace the tedious and
dangerous tasks that many workers face, and free them to concentrate on more interesting
Section 27.3
The Ethics of Al
987
aspects. People with disabilities will benefit from Al-based assistance in seeing, hearing, and
mobility. Machine translation already allows people from different cultures to communicate.
Software-based Al solutions have near zero marginal cost of production, and so have the potential to democratize access to advanced technology (even as other aspects of software have the potential to centralize power).
Despite these many positive aspects, we shouldn’t ignore the negatives. Many new tech-
nologies have had unintended negative side effects:
nuclear fission brought Chernobyl and
the threat of global destruction; the internal combustion engine brought air pollution, global warming, and the paving of paradise. Other technologies can have negative effects even when
used as intended, such as sarin gas, AR-15 rifles, and telephone solicitation. Automation will
create wealth, but under current economic conditions much of that wealth will flow to the
owners of the automated systems, leading to increased income inequality. This can be disruptive to a well-functioning society. In developing countries, the traditional path to growth
through low-cost manufacturing for export may be cut off, as wealthy countries adopt fully
automated manufacturing facilities on-shore. Our ethical and governance decisions will dic-
tate the level of inequality that AT will engender.
All scientists and engineers face ethical considerations of what projects they should or
should not take on, and how they can make sure the execution of the project is safe and beneficial. In 2010, the UK’s Engineering and Physical Sciences Research Council held a meeting to develop a set of Principles of Robotics. In subsequent years other government agencies, nonprofit organizations, and companies created similar sets of principles. The gist is that every organization that creates Al technology, and everyone in the organization, has a responsibility to make sure the technology contributes to good, not harm. The most commonly-cited
principles are:
Ensure safety
Establish accountability
Promote collaboration
Avoid concentration of power
Ensure fairness Respect privacy
Provide transparency
Limit harmful uses of Al
Uphold human rights and values Reflect diversity/inclusion Acknowledge legal/policy implications
Contemplate implications for employment
Note that many of the principles, such as “ensure safety,” have applicability to all software or
hardware systems, not just Al systems. Several principles are worded in a vague way, making them difficult to measure or enforce.
That is in part because Al is a big field with many
subfields, each of which has a different set of historical norms and different relationships between the Al developers and the stakeholders. Mittelstadt (2019) suggests that the subfields
should each develop more specific actionable guidelines and case precedents. 27.3.1
Lethal autonomous weapons
The UN defines a lethal autonomous weapon as one that locates, selects, and engages (i.c., Kills) human targets without human supervision. Various weapons fulfill some of these criteria. For example, land mines have been used since the 17th century: they can select and engage targets in a limited sense according to the degree of pressure exerted or the quan-
tity of metal present, but they cannot go out and locate targets by themselves. (Land mines are banned under the Ottawa Treaty.) Guided missiles, in use since the 1940s, can chase
targets, but they have to be pointed in the right general direction by a human. Auto-firing
Negative side effects
988
Chapter 27 Philosophy, Ethics, and Safety of AT radar-controlled guns have been used to defend naval ships since the 1970s; they are mainly intended to destroy incoming missiles, but they could also attack manned aircraft. Although
the word “autonomous” is often used to describe unmanned air vehicles or drones, most such
weapons are both remotely piloted and require human actuation of the lethal payload.
At the time of writing, several weapons systems seem to have crossed the line into full au-
tonomy. For example Israel’s Harop missile is a “loitering munition” with a ten-foot wingspan and a fifty-pound warhead. It searches for up to six hours in a given geographical region for any target that meets a given criterion and then destroys it. The criterion could be “emits a radar signal resembling antiaircraft radar” or “looks like a tank.”
The Turkish manufac-
turer STM advertises its Kargu quadcopter—which carries up to 1.5kg of explosives—as
capable of “Autonomous hit ... . targets selected on images . . . tracking moving targets . ... anti-
personnel ... face recognition.” Autonomous weapons have been called the “third revolution in warfare” after gunpowder and nuclear weapons. Their military potential is obvious. For example, few experts doubt
that autonomous fighter aircraft would defeat any human pilot. Autonomous aircraft, tanks,
and submarines can be cheaper, faster, more maneuverable, and have longer range than their manned counterparts.
Since 2014, the United Nations in Geneva has conducted regular discussions under the
auspices of the Convention on Certain Conventional Weapons (CCW) on the question of whether to ban lethal autonomous
weapons.
At the time of writing, 30 nations, ranging
in size from China to the Holy See, have declared their support for an international treaty, while other key countries—including Israel, Russia, South Korea, and the United States—are
opposed to a ban. The debate over autonomous weapons includes legal, ethical and practical aspects. The legal issues are governed primarily by the CCW, which requires the possibility of discrim-
inating between combatants and non-combatants, the judgment of military necessity for an attack,
and the assessment of proportionality
possibility of collateral damage.
The feas
between the military value of a target and the
ity of meeting these criteria
is an engineering
question—one whose answer will undoubtedly change over time. At present, discrimination seems feasible in some circumstances and will undoubtedly improve rapidly, but necessity and proportionality are not presently feasible: they require that machines make subjective
and situational judgments that are considerably more difficult than the relatively simple tasks
of searching for and engaging potential targets. For these reasons, it would be legal to use autonomous weapons only in circumstances where a human operator can reasonably predict that the execution of the mission will not result in civilians being targeted or the weapons
conducting unnecessary or disproportionate attacks. This means that, for the time being, only
very restricted missions could be undertaken by autonomous weapons.
On the ethical side, some find it simply morally unacceptable to delegate the decision to
kill humans to a machine.
For example, Germany’s ambassador in Geneva has stated that
it “will not accept that the decision over life and death is taken solely by an autonomous
system” while Japan “has no plan to develop robots with humans out of the loop, which may
be capable of committing murder.” Gen. Paul Selva, at the time the second-ranking military
officer in the United States, said in 2017, “I don’t think it’s reasonable for us to put robots in charge of whether or not we take a human life.” Finally, Anténio Guterres, the head of the United Nations, stated in 2019 that “machines with the power and discretion to take lives
Section 27.3
The Ethics of Al
without human involvement are politically unacceptable, morally repugnant and should be prohibited by international law.” More than 140 NGOs in over 60 countries are part of the Campaign to Stop Killer Robots, and an open letter organized in 2015 by the Future of Life Institute organized an open letter was signed by over 4,000 Al researchers? and 22,000 others. Against this, it can be argued that as technology improves it ought to be possible to develop weapons that are less likely than human soldiers or pilots to cause civilian casualties. (There is also the important benefit that autonomous weapons reduce the need for human sol-
diers and pilots to risk death.) Autonomous systems will not succumb to fatigue, frustration,
hysteria, fear, anger, or revenge, and need not “shoot first, ask questions later” (Arkin, 2015). Just as guided munitions have reduced collateral damage compared to unguided bombs, one may expect intelligent weapons to further improve the precision of attacks. (Against this, see Benjamin (2013) for an analysis of drone warfare casualties.) This, apparently, is the position of the United States in the latest round of negotiations in Geneva.
Perhaps counterintuitively, the United States is also one of the few nations whose own
policies currently preclude the use of autonomous weapons. The 2011 U.S. Department of Defense (DOD) roadmap says: “For the foreseeable future, decisions over the use of force
[by autonomous systems] and the choice of which individual targets to engage with lethal force will be retained under human control.” The primary reason for this policy is practical: autonomous systems are not reliable enough to be trusted with military decisions. The issue of reliability came to the fore on September 26,
1983, when Soviet missile
officer Stanislav Petrov’s computer display flashed an alert of an incoming missile attack. According to protocol, Petrov should have initiated a nuclear counterattack, but he suspected the alert was a bug and treated it as such. He was correct, and World War III was (narrowly)
averted. We don’t know what would have happened if there had been no human in the loop. Reliability is a very serious concern for military commanders, who know well the complexity of battlefield situations. Machine learning systems that operate flawlessly in training
may perform poorly when deployed. Cyberattacks against autonomous weapons could result
in friendly-fire casualties; disconnecting the weapon from all communication may prevent
that (assuming it has not already been compromised), but then the weapon cannot be recalled if it is malfunctioning.
The overriding practical issue with autonomous weapons is that they they are scalable
weapons of mass destruction, in the sense that the scale of an attack that can be launched is
proportional to the amount of hardware one can afford to deploy. A quadcopter two inches
in diameter can carry a lethal explosive charge, and one million can fit in a regular shipping
container. Precisely because they are autonomous, these weapons would not need one million
human supervisors to do their work.
As weapons of mass destruction, scalable autonomous weapons have advantages for the
attacker compared to nuclear weapons and carpet bombing: they leave property intact and can be applied selectively to eliminate only those who might threaten an occupying force. They
could certainly be used to wipe out an entire ethnic group or all the adherents ofa particular religion. In many situations, they would also be untraceable. These characteristics make them particularly attractive to non-state actors.
the two authors of this book.
989
990
Chapter 27 Philosophy, Ethics, and Safety of AT These considerations—particularly
those characteristics that advantage the attacker—
suggest that autonomous weapons will reduce global and national security for all parties.
The rational response for governments seems to be to engage in arms control discussions Dual use
rather than an arms race. The process of designing a treaty is not without its difficulties, however.
Al is a dual
use technology: Al technologies that have peaceful applications such as flight control, visual tracking, mapping, navigation, and multiagent planning, can easily be applied to military
purposes.
It is easy to turn an autonomous quadcopter into a weapon simply by attaching
an explosive and commanding it to seek out a target.
Dealing with this will require care-
ful implementation of compliance regimes with industry cooperation, as has already been
demonstrated with some success by the Chemical Weapons Convention.
27.3.2
Surveillance, security, and privacy
In 1976, Joseph Weizenbaum warned that automated speech recognition technology could
lead to widespread wiretapping, and hence to a loss of civil liberties. Today, that threat has
been realized, with most electronic communication going through central servers that can
be monitored, and cities packed with microphones and cameras that can identify and track
Surveillance camera
individuals based on their voice, face, and gait. Surveillance that used to require expensive and scarce human resources can now be done at a mass scale by machines. As of 2018, there were as many as 350 million surveillance cameras in China and 70 million in the United States. China and other countries have begun exporting surveillance technology to low-tech countries, some with reputations for mistreating their citizens and
disproportionately targeting marginalized communities. Al engineers should be clear on what
uses of surveillance are compatible with human rights, and decline to work on applications that are incompatible. As more of our institutions operate online, we become more vulnerable to cybercrime
Cybersecurity
(phishing, credit card fraud, botnets, ransomware) and cyberterrorism (including potentially deadly attacks such as shutting down hospitals and power plants or commandeering selfdriving cars). Machine learning can be a powerful tool for both sides in the cybersecurity battle. Attackers can use automation to probe for insecurities and they can apply reinforce-
ment learning for phishing attempts and automated blackmail. Defenders can use unsuper-
vised learning to detect anomalous incoming traffic patterns (Chandola ef al., 2009; Malhotra
et al., 2015) and various machine learning techniques to detect fraud (Fawcett and Provost,
1997; Bolton and Hand, 2002). As attacks get more sophisticated, there is a greater responsi-
bility for all engineers, not just the security experts, to design secure systems from the start.
One forecast (Kanal, 2017) puts the market for machine learning in cybersecurity at about
$100 billion by 2021.
As we interact with computers for increasing amounts of our daily lives, more data on us
is being collected by governments and corporations. Data collectors have a moral and legal
responsibility to be good stewards of the data they hold.
In the U.S., the Health Insurance
Portability and Accountability Act (HIPAA) and the Family Educational Rights and Privacy Act (FERPA) protect the privacy of medical and student records. The European Union’s General Data Protection Regulation (GDPR) mandates that companies design their systems
with protection of data in mind and requires that they obtain user consent for any collection
or processing of data.
Section 27.3
991
The Ethics of Al
Balanced against the individual’s right to privacy is the value that society gains from sharing data. We want to be able to stop terrorists without oppressing peaceful dissent, and we want to cure diseases without compromising any individual’s right to keep their health history private. One key practice is de-identification: eliminating personally identifying in- De-identification formation (such as name and social security number) so that medical researchers can use the data to advance the common good. The problem is that the shared de-identified data may
be subject to re-identification.
For example, if the data strips out the name, social security
number, and street address, but includes date of birth, gender, and zip code, then, as shown by
Latanya Sweeney (2000), 87% of the U.S. population can be uniquely re-identified. Sweeney emphasized this point by re-identifying the health record for the governor of her state when
he was admitted to the hospital. In the Netflix Prize competition, de-identified records of in-
dividual movie ratings were released, and competitors were asked to come up with a machine learning algorithm that could accurately predict which movies an individual would like. But researchers were able to re-identify individual users by matching the date of a rating in the
Netflix Prize
Netflix database with the date of a similar ranking in the Internet Movie Database (IMDB), where users sometimes use their actual names (Narayanan and Shmatikov, 2006).
This risk can be mitigated somewhat by generalizing fields: for example, replacing the
exact birth date with just the year of birth, or a broader range like “20-30 years old.” Deleting a field altogether can be seen as a form of generalizing to “any.” But generalization alone does not guarantee that records are safe from re-identification; it may be that there is only one person in zip code 94720 who is 90-100 years old. A useful property is k-anonymity: K-anonymity a database is k-anonymized if every record in the database is indistinguishable from at least
k— 1 other records. If there are records that are more unique than this, they would have to be
further generalized.
An alternative to sharing de-identified records is to keep all records private, but allow
aggregate querying. An API for queries against the database is provided, and valid queries Aggregate querying receive a response that summarizes the data with a count or average. But no response is given if it would violate certain guarantees of privacy. For example, we could allow an epidemiologist to ask, for each zip code, the percentage of people with cancer. For zip codes with at least n people a percentage would be given (with a small amount of random noise),
but no response would be given for zip codes with fewer than n people..
Care must be taken to protect against de-identification using multiple queries. For exam-
ple, if the query “average salary and number of employees of XYZ company age 30-40” gives the response [$81,234, 12] and the query “average salary and number of employees of XYZ company age 30-41 gives the response [$81,199, 131, and if we use LinkedIn to find the one 41-year-old at XYZ company, then we have successfully identified them, and can compute their exact salary, even though all the responses involved 12 or more people. The system must
be carefully designed to protect against this, with a combination of limits on the queries that
can be asked (perhaps only a predefined set of non-overlapping age ranges can be queried) and the precision of the results (perhaps both queries give the answer “about $81,000). A stronger guarantee is differential privacy, which assures that an attacker cannot use Differential privacy queries to re-identify any individual in the database, even if the attacker can make multiple
queries and has access to separate linking databases. The query response employs a random-
ized algorithm that adds a small amount of noise to the result. Given a database D, any record
in the database r, any query Q, and a possible response y to the query, we say that the database
992
Chapter 27 Philosophy, Ethics, and Safety of AT
D has edifferential privacy if the log probability of the response y varies by less than € when we add the record r: |log P(Q(D)=y) —logP(Q(D+r)=y)| < €. In other words, whether any one person decides to participate in the data base or not makes no appreciable difference to the answers anyone can get, and therefore there is no privacy
disincentive to participate. Many databases are designed to guarantee differential privacy.
Federated learning
So far we have considered the issue of sharing de-identified data from a central database.
An approach called federated learning (Konecny et al., 2016) has no central database; in-
stead, users maintain their own local databases that keep their data private. However, they
can share parameters of a machine learning model that is enhanced with their data, without
the risk of revealing any of the private data. Imagine a speech understanding application that users can run locally on their phone. The application contains a baseline neural network, which is then improved by local training on the words that are heard on the user’s phone. Periodically, the owners of the application poll a subset of the users and ask them for the parameter values of their improved local network, but not for any of their raw data. The
parameter values are combined together to form a new improved model which is then made
available to all users, so that they all get the benefit of the training that is done by other users.
For this scheme to preserve privacy, we have to be able to guarantee that the model
parameters shared by each user cannot be reverse-engineered. If we sent the raw parameters,
Secure aggregation
there is a chance that an adversary inspecting them could deduce whether, say, a certain word had been heard by the user’s phone. One way to eliminate this risk is with secure aggregation (Bonawitz et al., 2017).
The idea is that the central server doesn’t need to know the exact
parameter value from each distributed user; it only needs to know the average value for each
parameter, over all polled users. So each user can disguise their parameter values by adding
a unique mask to each value; as long as the sum of the masks is zero, the central server will be able to compute the correct average. Details of the protocol make sure that it is efficient in terms of communication (less than half the bits transmitted correspond to masking), is robust to individual users failing to respond, and is secure in the face of adversarial users,
cavesdroppers, or even an adversarial central server. 27.3.3
Fairness and bias
Machine learning is augmenting and sometimes replacing human decision-making in im-
portant situations: whose loan gets approved, to what neighborhoods police officers are de-
Societal bias
ployed, who gets pretrial release or parole. But machine learning models can perpetuate
societal bias. Consider the example of an algorithm to predict whether criminal defendants
are likely to re-offend, and thus whether they should be released before trial. It could well be
that such a system picks up the racial or gender prejudices of human judges from the examples in the training set. Designers of machine learning systems have a moral responsibility to
ensure that their systems are in fact fair. In regulated domains such as credit, education, em-
ployment, and housing, they have a legal responsibility as well. But what is faimess? There are multiple criteria; here are six of the most commonly-used concepts:
o Individual fairness: A requirement that individuals are treated similarly to other similar individuals, regardless of what class they are in.
Section 27.3
993
The Ethics of Al
* Group fairness: A requirement that two classes are treated similarly, as measured by
some summary statistic. o Fairness through unawareness:
If we delete the race and gender attributes from the
data set, then it might seem that the system cannot discriminate on those attributes.
Unfortunately, we know that machine learning models can predict latent variables (such as race and gender) given other correlated variables (such as zip code and occupation). Furthermore, deleting those attributes makes it impossible to verify equal opportunity
or equal outcomes. Still, some countries (e.g., Germany) have chosen this approach for
their demographic statistics (whether or not machine learning models are involved). o Equal outcome: The idea that each demographic class gets the same results; they have demographic parity. For example, suppose we have to decide whether we should Demographic parity approve loan applications; the goal is to approve those applicants who will pay back the loan and not those who will default on the loan. Demographic parity says that
both males and females should have the same percentage of loans approved. Note that this is a group faimess criterion that does nothing to ensure individual fairness; a wellqualified applicant might be denied and a poorly-qualified applicant might be approved, as long as the overall percentages are equal. Also, this approach favors redress of past biases over accuracy of prediction. Ifa man and a woman are equal in every way, except the woman receives a lower salary for the same job, should she be approved because she would be equal if not for historical biases, or should she be denied because the lower salary does in fact make her more likely to default?
Equal opportunity: The idea that the people who truly have the ability to pay back the loan should have an equal chance of being correctly classified as such, regardless of their sex. This approach is also called “balance.” It can lead to unequal outcomes and
ignores the effect of bias in the societal processes that produced the training data.
Equal impact: People with similar likelihood to pay back the loan should have the same expected utility, regardless of the class they belong to. This goes beyond equal
opportunity in that it considers both the benefits of a true prediction and the costs of a
false prediction. Let us examine how these issues play out in a particular context. COMPAS is a commercial system for recidivism (re-offense) scoring. It assigns to a defendant in a criminal case a risk score, which is then used by a judge to help make decisions: Is it safe to release
the defendant before trial, or should they be held in jail? If convicted, how long should the
sentence be? Should parole be granted? Given the significance of these decisions, the system has been the subject of intense scrutiny (Dressel and Farid, 2018). COMPAS is designed to be well calibrated: all the individuals who are given the same
score by the algorithm should have approximately the same probability of re-offending, regardless of race. For example, among all people that the model assigns a risk score of 7 out of 10, 60% of whites and 61% of blacks re-offend. The designers thus claim that it meets the
desired fairness goal.
On the other hand, COMPAS
does not achieve equal opportunity:
the proportion of
those who did not re-offend but were falsely rated as high-risk was 45% for blacks and 23%
for whites. In the case State v. Loomis, where a judge relied on COMPAS
to determine the
sentence of the defendant, Loomis argued that the secretive inner workings of the algorithm
Well calibrated
994
Chapter 27 Philosophy, Ethics, and Safety of AT violated his due process rights. Though the Wisconsin Supreme Court found that the sentence
given would be no different without COMPAS
in this case, it did issue warnings about the
algorithm’s accuracy and risks to minority defendants. Other researchers have questioned whether it is appropriate to use algorithms in applications such as sentencing.
We could hope for an algorithm that is both well calibrated and equal opportunity, but,
as Kleinberg er al. (2016) show, that is impossible.
If the base classes are different, then
any algorithm that is well calibrated will necessarily not provide equal opportunity, and vice
versa. How can we weigh the two criteria? Equal impact is one possibility. In the case of COMPAS, this means weighing the negative utility of defendants being falsely classified as high risk and losing their freedom, versus the cost to society of an additional crime being
committed, and finding the point that optimizes the tradeoff.
This is complicated because
there are multiple costs to consider. There are individual costs—a defendant who is wrong-
fully held in jail suffers a loss, as does the victim of a defendant who was wrongfully released and re-offends. But beyond that there are group costs—everyone has a certain fear that they will be wrongfully jailed, or will be the victim ofa crime, and all taxpayers contribute to the costs of jails and courts. If we give value to those fears and costs in proportion to the size of a group, then utility for the majority may come at the expense of a minority.
Another problem with the whole idea of recidivism scoring, regardless of the model used, is that we don’t have unbiased ground truth data. The data does not tell us who has committed a crime—all we know is who has been convicted of a crime. If the arresting officers, judge, or jury is biased, then the data will be biased. If more officers patrol some locations, then
the data will be biased against people in those locations. Only defendants who are released
are candidates to recommit, so if the judges making the release decisions are biased, the data
may be biased. If you assume that behind the biased data set there is an underlying, unknown,
unbiased data set which has been corrupted by an agent with biases, then there are techniques
to recover an approximation to the unbiased data. Jiang and Nachum (2019) describe various scenarios and the techniques involved.
One more risk is that machine learning can be used to justify bias. If decisions are made by a biased human after consulting with a machine learning system, the human can say “here is how my interpretation of the model supports my decision, so you shouldn’t question my
decision.” But other interpretations could lead to an opposite decision.
Sometimes fairness means that we should reconsider the objective function, not the data
or the algorithm. For example, in making job hiring decisions, if the objective is to hire candidates with the best qualifications in hand, we risk unfairly rewarding those who have
had advantageous educational opportunities throughout their lives, thereby enforcing class
boundaries. But if the objective is to hire candidates with the best ability to learn on the job, we have a better chance to cut across class boundaries and choose from a broader pool. Many
companies have programs designed for such applicants, and find that after a year of training,
the employees hired this way do as well as the traditional candidates. Similarly, just 18% of computer science graduates in the U.S. are women, but some schools, such as Harvey Mudd University, have achieved 50% parity with an approach that is focused on encouraging and retaining those who start the computer science program, especially those who start with less
programming experience.
A final complication is deciding which classes deserve protection.
In the U.S., the Fair
Housing Act recognized seven protected classes: race, color, religion, national origin, sex,
Section 27.3
disability, and familial status.
995
The Ethics of Al
Other local, state, and federal laws recognize other classes,
including sexual orientation, and pregnancy, marital, and veteran status. Is it fair that these
classes count for some laws and not others? International human rights law, which encom-
passes a broad set of protected classes, is a potential framework to harmonize protections across various groups. Even in the absence of societal bias, sample
disparity can lead to biased results.
Sample size disparity
In most data sets there will be fewer training examples of minority class individuals than
of majority class individuals. Machine learning algorithms give better accuracy with more training data, so that means that members of minority classes will experience lower accuracy.
For example, Buolamwini and Gebru (2018) examined a computer vision gender identifica-
tion service, and found that it had near-perfect accuracy for light-skinned males, and a 33% error rate for dark-skinned females. A constrained model may not be able to simultaneously
fit both the majority and minority class—a linear regression model might minimize average
error by fitting just the majority class, and in an SVM model, the support vectors might all correspond to majority class members.
Bias can also come into play in the software development process (whether or not the software involves machine learning). Engineers who are debugging a system are more likely to notice and fix those problems that are applicable to themselves. For example, it is difficult
to notice that a user interface design won’t work for colorblind people unless you are in fact
colorblind, or that an Urdu language translation is faulty if you don’t speak Urdu.
How can we defend against these biases? First, understand the limits of the data you are using. It has been suggested that data sets (Gebru et al., 2018; Hind er al., 2018) and models (Mitchell er al., 2019) should come with annotations: declarations of provenance, security,
conformity, and fitness for use. This is similar to the data sheets that accompany electronic Data sheet components such as resi they allow designers to decide what components to use. In addition to the data sheets, it is important to train engineers to be aware of issues of fairness
and bias, both in school and with on-the-job training. Having a diversity of engineers from different backgrounds makes it easier for them to notice problems in the data or models.
A
study by the Al Now Institute (West et al., 2019) found that only 18% of authors at leading Al conferences and 20% of Al professors are women. Black Al workers are at less than 4%.
Rates at industry research labs are similar. Diversity could be increased by programs earlier in the pipeline—in college or high school—and by greater awareness at the professional level. Joy Buolamwini founded the Algorithmic Justice League to raise awareness of this issue and
develop practices for accountability.
A second idea is to de-bias the data (Zemel ez al., 2013).
We could over-sample from
minority classes to defend against sample size disparity. Techniques such as SMOTE, the synthetic minority over-sampling technique (Chawla e al., 2002) or ADASYN, the adaptive synthetic sampling approach for imbalanced learning (He ef al., 2008), provide principled ways of oversampling. We could examine the provenance of data and, for example, eliminate examples from judges who have exhibited bias in their past court cases. Some analysts object
to the idea of discarding data, and instead would recommend building a hierarchical model of
the data that includes sources of bias, so they can be modeled and compensated for. Google
and NeurlPS have attempted to raise awareness of this issue by sponsoring the Inclusive Images Competition, in which competitors train a network on a data set of labeled images
collected in North America and Europe, and then test it on images taken from all around the
996
Chapter 27 Philosophy, Ethics, and Safety of AT world. The issue is that given this data set, it is easy to apply the label “bride” to a woman
in a standard Western wedding dress, but harder to recognize traditional African and Indian matrimonial dress. A third idea is to invent new machine learning models and algorithms that are more resistant to bias; and the final idea is to let a system make initial recommendations that may be biased, but then train a second system to de-bias the recommendations of the first one. Bellamy
et al. (2018) introduced the IBM AT FAIRNESS 360 system, which provides a framework for all of these ideas. We expect there will be increased use of tools like this in the future.
How do you make sure that the systems you build will be fair? A set of best practices has
been emerging (although they are not always followed):
* Make sure that the software engineers talk with social scientists and domain experts to understand the issues and perspectives, and consider fairness from the start.
+ Create an environment that fosters the development of a diverse pool of software engi-
neers that are representative of society.
+ Define what groups your system will support: different language speakers, different age
groups, different abilities with sight and hearing, etc.
+ Optimize for an objective function that incorporates fairness. + Examine your data for prejudice and for correlations between protected attributes and other attributes. + Understand how any human annotation of data is done, design goals for annotation
accuracy, and verify that the goals are met.
+ Don’t just track overall metrics for your system; make sure you track metrics for sub-
groups that might be victims of bias.
+ Include system tests that reflect the experience of minority group users.
« Have a feedback loop so that when fairness problems come up, they are dealt with. 27.3.4
Trust
Trust and transparency
It is one challenge to make an Al system accurate, fair, safe, and secure; a different chal-
lenge to convince everyone else that you have done so. People need to be able to trust the systems they use. A PwC survey in 2017 found that 76% of businesses were slowing the
adoption of Al because of trustworthiness concerns.
Verification and validation
In Section 19.9.4 we covered some of
the engineering approaches to trust; here we discuss the policy issues.
To earn trust, any engineered systems must go through a verification and validation
(V&V) process.
Verification means that the product satisfies the specifications.
Validation
means ensuring that the specifications actually meet the needs of the user and other affected
parties. We have an elaborate V&V methodology for engineering in general, and for traditional software development done by human coders; much of that is applicable to Al systems.
But machine learning systems are different and demand a different V&V process, which has
not yet been fully developed.
We need to verify the data that these systems learn from; we
need to verify the accuracy and fairness of the results, even in the face of uncertainty that makes an exact result unknowable;
Certification
and we need to verify that adversaries cannot unduly
influence the model, nor steal information by querying the resulting model.
One instrument of trust is certification; for example, Underwriters Laboratories (UL) was founded in 1894 at a time when consumers were apprehensive about the risks of electric
Section 27.3
997
The Ethics of Al
power. UL certification of appliances gave consumers increased trust, and in fact UL is now considering entering the business of product testing and certification for AL Other industries have long had safety standards.
For example, ISO 26262 is an interna-
tional standard for the safety of automobiles, describing how to develop, produce, operate, and service vehicles in a safe way. The Al industry is not yet at this level of clarity, although there are some frameworks
in progress, such as IEEE P7001, a standard defining ethical de-
sign for artificial intelligence and autonomous systems (Bryson and Winfield, 2017). There is ongoing debate about what kind of certification is necessary, and to what extent it should be
done by the government, by professional organizations like IEEE, by independent certifiers such as UL, or through self-regulation by the product companies. Another aspect of trust is transparency:
consumers want to know what is going on
inside a system, and that the system is not working against them, whether due to intentional
Transparency
malice, an unintentional bug, or pervasive societal bias that is recapitulated by the system. In
some cases this transparency is delivered directly to the consumer. In other cases their are
intellectual property issues that keep some aspects of the system hidden to consumers, but
open to regulators and certification agencies. When an Al system turns you down for a loan, you deserve an explanation. In Europe,
the GDPR enforces this for you. An Al system that can explain itself is called explainable AT
(XAI). A good explanation has several properties: it should be understandable and convincing Explainable Al (XAl) to the user, it should accurately reflect the reasoning of the system, it should be complete, and it should be specific in that different users with different conditions or different outcomes
should get different explanations.
It is quite easy to give a decision algorithm access to its own deliberative processes,
simply by recording them and making them available as data structures. This means that machines may eventually be able to give better explanations of their decisions than humans can. Moreover, we can take steps to certify that the machine’s explanations are not deceptions (intentional or self-deception), something that is more difficult with a human.
An explanation is a helpful but not sufficient ingredient to trust. One issue is that expla-
nations are not decisions:
they are stories about decisions. As discussed in Section 19.9.4, we
say that a system is interpretable if we can inspect the source code of the model and see what
itis doing, and we say it is explainable if we can make up a story about what it is doing—even if the system itself is an uninterpretable black box. To explain an uninterpretable black box, we need to build, debug, and test a separate explanation system, and make sure it is in sync with the original system. And because humans love a good story, we are all too willing to be swayed by an explanation that sounds good. Take any political controversy of the day, and
you can always find two so-called experts with diametrically opposed explanations, both of which are internally consistent.
A final issue is that an explanation about one case does not give you a summary over
other cases. If the bank explains, “Sorry, you didn’t get the loan because you have a history
of previous financial problems,” you don’t know if that explanation is accurate or if the bank is
secretly biased against you for some reason. In this case, you require not just an explanation, but also an audit of past decisions, with aggregated statistics across various demographic
groups, to see if their approval rates are balanced.
Part of transparency is knowing whether you are interacting with an Al system or a hu-
man. Toby Walsh (2015) proposed that “an autonomous system should be designed so that
998
Chapter 27 Philosophy, Ethics, and Safety of AT itis unlikely to be mistaken for anything besides an autonomous system, and should identify itself at the start of any interaction.”” He called this the “red flag” law, in honor of the UK’s
1865 Locomotive Act, which required any motorized vehicle to have a person with a red flag walk in front of i, to signal the oncoming danger.
In 2019, California enacted a law stating that “It shall be unlawful for any person to use a bot to communicate or interact with another person in California online, with the intent to
mislead the other person about its artificial identity.”
27.3.5
The future of work
From the first agricultural revolution (10,000 BCE)
to the industrial revolution (late 18th
century) to the green revolution in food production (1950s), new technologies have changed
the way humanity works and lives. A primary concern arising from the advance of Al is that human labor will become obsolete.
point quite clearly:
Aristotle, in Book I of his Politics, presents the main
For if every instrument could accomplish its own work, obeying or anticipating the will of others ....if, in like manner, the shuttle would weave and the plectrum touch the lyre withouta hand to guide them, chief workmen would not want servants, nor masters slaves. Everyone agrees with Aristotle’s observation that there is an immediate reduction in employ-
ment when an employer finds a mechanical method to perform work previously done by a
person. The issue is whether the so-called compensation effects that ensue—and that tend to
increase employment—will eventually make up for this reduction. The primary compensa-
tion effect is the increase in overall wealth from greater productivity, which leads in turn to
greater demand for goods and tends to increase employment. For example, PwC (Rao and Verweij, 2017) predicts that Al contribute $15 trillion annually to global GDP by 2030. The healthcare and automotive/transportation industries stand to gain the most in the short term. However, the advantages of automation have not yet taken over in our economy: the current
rate of growth in labor productivity is actually below historical standards. Brynjolfsson ef al.
(2018) attempt to explain this paradox by suggesting that the lag between the development of
Technological unemployment
basic technology and its implementation in the economy is longer than commonly supposed. Technological innovations have historically put some people out of work. Weavers were replaced by automated looms in the 1810s, leading to the Luddite protests. The Luddites were not against technology per se; they just wanted the machines to be used by skilled workers paid a good wage to make high-quality goods, rather than by unskilled workers to make poorquality goods at low wages. The global destruction of jobs in the 1930s led John Maynard Keynes to coin the term technological unemployment.
employment levels eventally recovered.
In both cases, and several others,
The mainstream economic view for most of the 20th century was that technological em-
ployment was at most a short-term phenomenon. Increased productivity would always lead to increased wealth and increased demand, and thus net job growth. A commonly cited example is that of bank tellers: although ATMs replaced humans in the job of counting out cash for withdrawals, that made it cheaper to operate a bank branch, so the number of branches
increased, leading to more bank employees overall. The nature of the work also changed, becoming less routine and requiring more advanced business skills. The net effect of automation seems to be in eliminating fasks rather than jobs.
Section 27.3
The Ethics of Al
The majority of commenters predict that the same will hold true with AI technology, at
least in the short run. Gartner, McKinsey, Forbes, the World Economic Forum, and the Pew
Research Center each released reports in 2018 predicting a net increase in jobs due to Al-
driven automation. But some analysts think that this time around, things will be different. In 2019, IBM predicted that 120 million workers would need retraining due to automation by 2022, and Oxford Economics predicted that 20 million manufacturing jobs could be lost to
automation by 2030.
Frey and Osborne (2017) survey 702 different occupations, and estimate that 47% of them
are at risk of being automated, meaning that at least some of the tasks in the occupation can
be performed by machine. For example, almost 3% of the workforce in the U.S. are vehicle drivers, and in some districts, as much as 15% of the male workforce are drivers. As we saw in Chapter 26, the task of driving is likely to be eliminated by driverless cars/trucks/buses/taxis.
It is important to distinguish between occupations and the tasks within those occupations.
McKinsey estimates that only 5% of occupations are fully automatable, but that 60% of occupations can have about 30% of their tasks automated.
For example, future truck drivers
will spend less time holding the steering wheel and more time making sure that the goods are picked up and delivered properly; serving as customer service representatives and salespeople at either end of the journey; and perhaps managing convoys of, say, three robotic trucks. Replacing three drivers with one convoy manager implies a net loss in employment, but if transportation costs decrease, there will be more demand,
which wins some of the
jobs back—but perhaps not all of them. As another example, despite many advances in applying machine learning to the problem of medical imaging, radiologists have so far been
augmented, not replaced, by these tools. Ultimately, there is a choice of how to make use of automation: do we want to focus on cutting cost, and thus see job loss as a positive; or do we
want to focus on improving quality, making life better for the worker and the customer? It is difficult to predict exact timelines for automation, but currently, and for the next
few years, the emphasis is on automation of structured analytical tasks, such as reading x-ray images, customer relationship management (e.g., bots that automatically sort customer comBusiness process plaints and respond with suggested remedies), and business process automation that com- automation bines text documents and structured data to make business decisions and improve workflow. Over time, we will see more automation with physical robots, first in controlled warehouse
environments, then in more uncertain environments, building to a significant portion of the
marketplace by around 2030.
As populations in developed countries grow older, the ratio between workers and retirees
changes. In 2015 there were less than 30 retirees per 100 workers; by 2050 there may be over 60 per 100 workers. Care for the elderly will be an increasingly important role, one that can partially be filled by AL Moreover, if we want to maintain the current standard of living, it will also be necessary to make the remaining workers more productive; automation seems
like the best opportunity to do that.
Even if automation has a multi-trillion-dollar net positive impact, there may still be prob-
lems due to the pace of change. Consider how change came to the farming industry: in 1900, Pace of change
over 40% of the U.S. workforce was in agriculture, but by 2000 that had fallen to 2%.> That
In 2010, although only 2% of the U.S. workforce were actual farmers, over 25% of the population (80 million people) played the FARMVILLE game at least once.
999
1000
Chapter 27 Philosophy, Ethics, and Safety of AT is a huge disruption in the way we work, but it happened over a period of 100 years, and thus across generations, not in the lifetime of one worker.
Workers whose jobs are automated away this decade may have to retrain for a new profession within a few years—and then perhaps see their new profession automated and face
yet another retraining period. Some may be happy to leave their old profession—we see that
as the economy improves, trucking companies need to offer new incentives to hire enough
drivers—but workers will be apprehensive about their new roles. To handle this, we as a soci-
ety need to provide lifelong education, perhaps relying in part on online education driven by artificial intelligence (Martin, 2012). Bessen (2015) argues that workers will not see increases
Income inequality
in income until they are trained to implement the new technologies, a process that takes time. Technology tends to magnify income inequality.
In an information economy marked
by high-bandwidth global communication and zero-marginal-cost replication of intellectual property (what Frank and Cook (1996) call the “Winner-Take-All Society”), rewards tend to be concentrated.
If farmer Ali is 10% better than farmer Bo, then Ali gets about 10% more
income: Ali can charge slightly more for superior goods, but there is a limit on how much can be produced on the land, and how far it can be shipped. But if software app developer
Cary is 10% better than Dana, it may be that Cary ends up with 99% of the global market. AT
increases the pace of technological innovation and thus contributes to this overall trend, but
Al also holds the promise of allowing us to take some time off and let our automated agents
handle things for a while. Tim Ferriss (2007) recommends using automation and outsourcing to achieve a four-hour work week. Before the industrial revolution, people worked as farmers or in other crafts, but didn’t
report to a job at a place of work and put in hours for an employer.
But today, most adults
in developed countries do just that, and the job serves three purposes: it fuels the production
of the goods that society needs to flourish, it provides the income that the worker needs to live, and it gives the worker a sense of purpose, accomplishment, and social integration. With increasing automation, it may be that these three purposes become disaggregated—society’s
needs will be served in part by automation, and in the long run, individuals will get their sense of purpose from contributions other than work. Their income needs can be served by
social policies that include a combination of free or inexpensive access to social services and education, portable health care, retirement, and education accounts, progressive tax rates,
earned income tax credits, negative income tax, or universal basic income.
27.3.6
Robot rights
The question of robot consciousness, discussed in Section 27.2, is critical to the question of
what rights, if any, robots should have. would argue that they deserve rights.
If they have no consciousness, no qualia, then few
But if robots can feel pain, if they can dread death, if they are considered “persons,” then the argument can be made (e.g., by Sparrow (2004)) that they have rights and deserve to have their rights recognized, just as slaves, women, and other historically oppressed groups have
fought to have their rights recognized. The issue of robot personhood is often considered in fiction: from Pygmalion to Coppélia to Pinocchio to the movies A7 and Centennial Man, we have the legend of a doll/robot coming to life and striving to be accepted as a human with human rights. In real life, Saudi Arabia made headlines by giving honorary citizenship to Sophia, a human-looking puppet capable of speaking preprogrammed lines.
Section 27.3
1001
The Ethics of Al
If robots have rights, then they should not be enslaved, and there is a question of whether
reprogramming them would be a kind of enslavement. Another ethical issue involves voting rights: a rich person could buy thousands of robots and program them to cast thousands of votes—should
those votes count?
If a robot clones itself, can they both vote?
What is
the boundary between ballot stuffing and exercising free will, and when does robotic voting violate the “one person, one vote” principle? Ernie Davis argues for avoiding the dilemmas of robot consciousness by never building robots that could possibly be considered conscious. This argument was previously made by
Joseph Weizenbaum in his book Computer Power and Human Reason (1976), and before that by Julien de La Mettrie in L’Homme Machine (1748). Robots are tools that we create, to do
the tasks we direct them to do, and if we grant them personhood, we are just declining to take
responsibility for the actions of our own property: “I'm not at fault for my self-driving car crash—the car did it itself.” This issue takes a different turn if we develop human-robot hybrids. Of course we already
have humans enhanced by technology such as contact lenses, pacemakers, and artificial hips. But adding computational protheses may blur the lines between human and machine. 27.3.7
Al Safety
Almost any technology has the potential to cause harm in the wrong hands, but with AT and robotics, the hands might be operating on their own. Countless science fiction stories have warned about robots or cyborgs running amok. Early examples include Mary Shelley’s Frankenstein, or the Modern Prometheus (1818) and Karel Capek’s play R.U.R. (1920), in which robots conquer the world. In movies, we have The Terminator (1984) and The Matrix
(1999), which both feature robots trying to eliminate humans—the robopocalypse (Wilson, Robopocalypse 2011). Perhaps robots are so often the villains because they represent the unknown, just like the witches and ghosts of tales from earlier eras.
We can hope that a robot that is smart
enough to figure out how to terminate the human race is also smart enough to figure out that
that was not the intended utility function; but in building intelligent systems, we want to rely not just on hope, but on a design process with guarantees of safety.
It would be unethical to distribute an unsafe Al agent.
We require our agents to avoid
accidents, to be resistant to adversarial attacks and malicious abuse, and in general to cause
benefits, not harms. That is especially true as Al agents are deployed in safety-critical appli-
cations, such as driving cars, controlling robots in dangerous factory or construction settings, and making life-or-death medical decisions.
There is a long history of safety engineering in traditional engineering fields. We know
how to build bridges, airplanes, spacecraft, and power plants that are designed up front to
behave safely even when components of the system fail. The first technique is failure modes
and effect analysis (FMEA): analysts consider each component of the system, and imagine
every possible way the component could go wrong (for example, what if this bolt were to snap?), drawing on past experience and on calculations based on the physical properties of the component.
Safety engineering Failure modes and effect analysis
(FMEA)
Then the analysts work forward to see what would result from the failure.
If the result is severe (a section of the bridge could fall down) then the analysts alter the design to mitigate the failure. (With this additional cross-member, the bridge can survive the
failure of any 5 bolts; with this backup server, the online service can survive a tsunami taking
tree analysis out the primary server.) The technique of fault tree analysis (FTA) is used to make these Fault (FTA)
1002
Chapter 27 Philosophy, Ethics, and Safety of AT determinations:
analysts build an AND/OR
tree of possible failures and assign probabilities
to each root cause, allowing for calculations of overall failure probability. These techniques can and should be applied to all safety-critical engineered systems, including AI systems. The field of software engineering is aimed at producing reliable software, but the em-
phasis has historically been on correctness, not safety. Correctness means that the software
faithfully implements the specification. But safety goes beyond that to insist that the specification has considered any feasible failure modes, and is designed to degrade gracefully even in the face of unforeseen failures. For example, the software for a self-driving car wouldn’t
be considered safe unless it can handle unusual situations. For example, what if the power to
the main computer dies? A safe system will have a backup computer with a separate power
supply. What if a tire is punctured at high speed? A safe system will have tested for this, and will have software to correct for the resulting loss of control.
Unintended side effect
An agent designed as a utility maximizer, or as a goal achiever, can be unsafe if it has the wrong objective function. Suppose we give a robot the task of fetching a coffee from the kitchen. We might run into trouble with unintended side effects—the robot might rush
to accomplish the goal, knocking over lamps and tables along the way. In testing, we might
notice this kind of behavior and modify the utility function to penalize such damage, but it is
Low impact
difficult for the designers and testers to anticipate all possible side effects ahead of time.
One way to deal with this is to design a robot to have low impact (Armstrong and Levin-
stein, 2017): instead of just maximizing utility, maximize the utility minus a weighted summary of all changes to the state of the world.
In this way, all other things being equal, the
robot prefers not to change those things whose effect on utility is unknown; so it avoids knocking over the lamp not because it knows specifically that knocking the lamp will cause
it to fall over and break, but because it knows in general that disruption might be bad. This
can be seen as a version of the physician’s creed “first, do no harm,” or as an analog to regularization in machine learning: we want a policy that achieves goals, but we prefer policies that take smooth, low-impact actions to get there.
The trick is how to measure impact.
It
is not acceptable to knock over a fragile lamp, but perfectly fine if the air molecules in the room are disturbed a little, or if some bacteria in the room are inadvertently killed. It is cer-
tainly not acceptable to harm pets and humans in the room.
We need to make sure that the
robot knows the differences between these cases (and many subtle cases in between) through
a combination of explicit programming, machine learning over time, and rigorous testing.
Utility functions can go wrong due to externalities, the word used by economists for
factors that are outside of what is measured and paid for. The world suffers when green-
house gases are considered as externalities—companies and countries are not penalized for producing them, and as a result everyone suffers. Ecologist Garrett Hardin (1968) called the exploitation of shared resources the tragedy of the commons. We can mitigate the tragedy by internalizing the externalities—making them part of the utility function, for example with
a carbon tax—or by using the design principles that economist Elinor Ostrom identified as being used by local people throughout the world for centuries (work that won her the Nobel Prize in Economics in 2009):
Clearly define the shared resource and who has access.
+ Adapt to local conditions.
« Allow all parties to participate in decisions.
Section 27.3
The Ethics of Al
Monitor the resource with accountable monitors.
+ Sanctions, proportional to the severity of the violation. + Easy conflict resolution procedures. Hierarchical control for large shared resources.
Victoria Krakovna (2018) has cataloged examples of Al agents that have gamed the system, figuring out how to maximize utility without actually solving the problem that their designers
intended them to solve. To the designers this looks like cheating, but to the agents, they
are just doing their job.
Some agents took advantage of bugs in the simulation (such as
floating point overflow bugs) to propose solutions that would not work once the bug was
fixed. Several agents in video games discovered ways to crash or pause the game when they
were about to lose, thus avoiding a penalty. And in a specification where crashing the game
was penalized, one agent learned to use up just enough of the game’s memory so that when it
was the opponent’s turn, it would run out of memory and crash the game. Finally, a genetic
algorithm operating in a simulated world was supposed to evolve fast-moving creatures but
in fact produced creatures that were enormously tall and moved fast by falling over. Designers of agents should be aware of these kinds of specification failures and take steps
to avoid them. To help them do that, Krakovna was part of the team that released the Al Safety Gridworlds environments (Leike ef al., 2017), which allows designers to test how well their
agents perform.
The moral is that we need to be very careful in specifying what we want, because with
alignment utility maximizers we get what we actually asked for. The value alignment problem is the Value problem problem of making sure that what we ask for is what we really want; it is also known as the
King Midas problem, as discussed on page 33. We run into trouble when a utility function fails to capture background societal norms about acceptable behavior. For example, a human
who is hired to clean floors, when faced with a messy person who repeatedly tracks in dirt,
Kknows that it is acceptable to politely ask the person to be more careful, but it is not acceptable to kidnap or incapacitate said person.
A robotic cleaner needs to know these things too, either through explicit programming or
by learning from observation. Trying to write down all the rules so that the robot always does
the right thing is almost certainly hopeless. We have been trying to write loophole-free tax laws for several thousand years without success. Better to make the robot want to pay taxes,
50 to speak, than to try to make rules to force it to do so when it really wants to do something else. A sufficiently intelligent robot will find a way to do something else.
Robots can learn to conform better with human preferences by observing human behavfor. This is clearly related to the notion of apprenticeship learning (Section 22.6). The robot
may learn a policy that directly suggests what actions to take in what situations; this is often
a straightforward supervised learning problem if the environment is observable. For example, a robot can watch a human playing chess: each state-action pair is an example for the learning process.
Unfortunately, this form of imitation learning means that the robot will
repeat human mistakes. Instead, the robot can apply inverse reinforcement learning to
cover the utility function that the humans must be operating under. Watching even terrible chess players is probably enough for the robot to learn the objective of the game. Given just this information, the robot can then go on to exceed human performance—as, for example, ALPHAZERO
did in chess—by computing optimal or near-optimal policies from the objec-
1003
1004
Chapter 27 Philosophy, Ethics, and Safety of AT tive. This approach works not just in board games, but in real-world physical tasks such as helicopter aerobatics (Coates et al., 2009).
In more complex settings involving, for example, social interactions with humans, it is very unlikely that the robot will converge to exact and correct knowledge of each human’s individual preferences. (After all, many humans never quite learn what makes other humans tick, despite a lifetime of experience, and many of s are unsure of our own preferences t00.) It will be necessary, therefore, for machines to function appropriately when it is uncertain
about human preferences. In Chapter 18, we introduced assistance games, which capture
exactly this situation.
Solutions to assistance games include acting cautiously, so as not to
disturb aspects of the world that the human might care about, and asking questions. For example, the robot could ask whether turning the oceans into sulphuric acid is an acceptable
solution to global warming before it puts the plan into effect.
In dealing with humans, a robot solving an assistance game must accommodate human
imperfections. If the robot asks permission, the human may give it, not foreseeing that the
robot’s proposal is in fact catastrophic in the long term.
Moreover, humans do not have
complete introspective access to their true utility function, and they don’t always act in a way
that is compatible with it. Humans sometimes lie or cheat, or do things they know are wrong.
They sometimes take self-destructive actions like overeating or abusing drugs. AI systems need not learn to adopt these problematic tendencies, but they must understand that they exist when interpreting human behavior to get at the underlying human preferences. Despite this toolbox of safeguards, there is a fear, expressed by prominent technologists such as Bill Gates and Elon Musk and scientists such as Stephen Hawking and Martin Rees,
that AT could evolve out of control. They warn that we have no experience controlling powerful nonhuman entities with super-human capabilities. However, that’s not quite true; we have
centuries of experience with nations and corporations; non-human entities that aggregate the power of thousands or millions of people. Our record of controlling these entities is not very
encouraging: nations produce periodic convulsions called wars that kill tens of millions of human beings, and corporations are partly responsible for global warming and our inability
to confront it. Al systems may present much greater problems than nations and corporations because of
Ultraintelligent machine
Technological singularity
their potential to self-improve at a rapid pace, as considered by L. J. Good (1965b): Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines: there would then unquestionably be an “intelligence explosion,” and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make, provided that the machine is docile enough to tell us how to keep it under control.
Good’s “intelligence explosion” has also been called the technological singularity by math-
ematics professor and science fiction author Vernor Vinge, who wrote in 1993: “Within thirty
years, we will have the technological means to create superhuman intelligence. Shortly after, the human era will be ended.”
In 2017, inventor and futurist Ray Kurzweil predicted the
singularity would appear by 2045, which means it got 2 years closer in 24 years. (At that
rate, only 336 years to go!) Vinge and Kurzweil correctly note that technological progress on many measures is growing exponentially at present.
Summary
1005
It is, however, quite a leap to extrapolate all the way from the rapidly decreasing cost
of computation to a singularity.
So far, every technology has followed an S-shaped curve,
where the exponential growth eventually tapers off. Sometimes new technologies step in
when the old ones plateau, but sometimes it is not possible to keep the growth going, for
technical, political, or sociological reasons. For example, the technology of flight advanced
dramatically from the Wright brothers” flight in 1903 to the moon landing in 1969, but has
had no breakthroughs of comparable magnitude since then.
Another obstacle in the way of ultraintelligent machines taking over the world is the
world. More specifically, some kinds of progress require not just thinking but acting in the
physical world. (Kevin Kelly calls the overemphasis on pure intelligence thinkism.) An ul- Thinkism traintelligent machine tasked with creating a grand unified theory of physics might be capable
of cleverly manipulating equations progress, it would still need to raise and run physical experiments over analyzing the data and theorizing.
a billion times faster than Einstein, but to make any real millions of dollars to build a more powerful supercollider the course of months or years. Only then could it start Depending on how the data turn out, the next step might
require raising additional billions of dollars for an interstellar probe mission that would take
centuries to complete. The “ultraintelligent thinking™ part of this whole process might actu-
ally be the least important part. As another example, an ultraintelligent machine tasked with
bringing peace to the Middle East might just end up getting 1000 times more frustrated than
a human envoy. As yet, we don’t know how many of the big problems are like mathematics and how many are like the Middle East.
While some people fear the singularity, others relish it. The transhumanism social Transhumanism movement looks forward to a future in which humans are merged with—or replaced by—
robotic and biotech inventions. Ray Kurzweil writes in The Singularity is Near (2005):
The Singularity will allow us to transcend these limitations of our biological bodies and brain. We will gain power over our fates. ... We will be able to live as long as we want .. We will fully understand human thinking and will vastly extend and expand its reach. By the end of this century, the nonbiological portion of our intelligence will be trillions of trillions of times more powerful than unaided human intelligence. Similarly, when asked whether robots will inherit the Earth, Marvin Minsky said “yes, but they will be our children.” These possibilities present a challenge for most moral theorists,
who take the preservation of human life and the human species to be a good thing. Kurzweil also notes the potential dangers, writing “But the Singularity will also amplify the ability to act on our destructive inclinations, so its full story has not yet been written.”
We humans
would do well to make sure that any intelligent machine we design today that might evolve into an ultraintelligent machine will do so in a way that ends up treating us well.
As Eric
Brynjolfsson puts it, “The future is not preordained by machines. It’s created by humans.”
Summary This chapter has addressed the following issues:
+ Philosophers use the term weak Al for the hypothesis that machines could possibly behave intelligently, and strong AI for the hypothesis that such machines would count
as having actual minds (as opposed to simulated minds).
1006
Chapter 27 Philosophy, Ethics, and Safety of AT « Alan Turing rejected the question “Can machines think?” and replaced it with a behavioral test. He anticipated many objections to the possibility of thinking machines.
Few Al researchers pay attention to the Turing test, preferring to concentrate on their
systems’ performance on practical tasks, rather than the ability to imitate humans. « Consciousness remains a mystery. « Al s a powerful technology, and as such it poses potential dangers, through lethal autonomous weapons, security and privacy breaches, unintended side effects, uninten-
tional errors, and malignant misuse. Those who work with AI technology have an
ethical imperative to responsibly reduce those dangers. + Al systems must be able to demonstrate they are fair, trustworthy, and transparent.
+ There are multiple aspects of fairness, and it is impossible to maximize all of them at
once. So a first step is to decide what counts as fair.
« Automation is already changing the way people work. As a society, we will have to deal with these changes. Bibliographical and Historical Notes Weak AL: When Alan Turing (1950) proposed the possibility of AL he also posed many of the key philosophical questions, and provided possible replies. But various philosophers had raised similar issues long before Al was invented. Maurice Merleau-Ponty’s Phenomenology of Perception (1945) stressed the importance of the body and the subjective interpretation of reality afforded by our senses, and Martin Heidegger’s Being and Time (1927) asked what it means to actually be an agent. In the computer age, Alva Noe (2009) and Andy Clark (2015) propose that our brains form a rather minimal representation of the world, use the world itself
on a just-in-time basis to maintain the illusion of a detailed internal model, and use props
in the world (such as paper and pencil as well as computers) to increase the capabilities of the mind. Pfeifer ef al. (2006) and Lakoff and Johnson (1999) present arguments for how the body helps shape cognition. Speaking of bodies, Levy (2008), Danaher and McArthur the issue of robot sex. (2017), and Devlin (2018) address Strong AL René Descartes is known for his dualistic view of the human mind, but ironi-
cally his historical influence was toward mechanism and physicalism. He explicitly conceived
of animals as automata, and he anticipated the Turing test, writing “it is not conceivable [that
a machine] should produce different arrangements of words so as to give an appropriately
meaningful answer to whatever is said in its presence, as even the dullest of men can do”
(Descartes,
1637).
Descartes’s spirited defense of the animals-as-automata viewpoint ac-
tually had the effect of making it easier to conceive of humans as automata as well, even though he himself did not take this step. The book L’Homme Machine (La Mettrie, 1748) did explicitly argue that humans are automata. As far back as Homer (circa 700 BCE), the Greek legends envisioned automata such as the bronze giant Talos and considered the issue of biotechne, or life through craft (Mayor, 2018).
The Turing test (Turing, 1950) has been debated (Shieber, 2004), anthologized (Epstein et al., 2008), and criticized (Shieber, 1994; Ford and Hayes, 1995). Bringsjord (2008) gives advice for a Turing test judge, and Christian (2011) for a human contestant. The annual Loebner Prize competition is the longest-running Turing test-like contest; Steve Worswick’s
Bibliographical and Historical Notes MITSUKU won four in a row from 2016 to 2019. The Chinese room has been debated endlessly (Searle, 1980; Chalmers, 1992; Preston and Bishop, 2002). Herndndez-Orallo (2016)
gives an overview of approaches to measuring Al progress, and Chollet (2019) proposes a measure of intelligence based on skill-acquisition efficiency. Consciousness remains a vexing problem for philosophers, neuroscientists, and anyone who has pondered their own existence. Block (2009), Churchland (2013) and Dehaene (2014) provide overviews of the major theories. Crick and Koch (2003) add their expertise in biology and neuroscience to the debate, and Gazzaniga (2018) shows what can be learned from
studying brain disabilities in hospital cases. Koch (2019) gives a theory of consciousness—
“intelligence is about doing while experience is about being”—that includes most animals,
but not computers. Giulio Tononi and his colleagues propose integrated information theory
(Oizumi et al., 2014). Damasio (1999) has a theory based on three levels: emotion, feeling, and feeling a feeling. Bryson (2012) shows the value of conscious attention for the process of learning action selection. The philosophical literature on minds, brains, and related topics is large and jargon-filled. The Encyclopedia of Philosophy (Edwards, 1967) is an impressively authoritative and very useful navigation aid. The Cambridge Dictionary of Philosophy (Audi, 1999) is shorter and more accessible, and the online Stanford Encyclopedia of Philosophy offers many excellent articles and up-to-date references. The MIT Encyclopedia of Cognitive Science (Wilson and Keil, 1999) covers the philosophy, biology, and psychology of mind. There are multiple intro-
ductions to the philosophical “Al question” (Haugeland, 1985; Boden, 1990; Copeland, 1993; McCorduck, 2004; Minsky, 2007). The Behavioral and Brain Sciences, abbreviated BBS, is
amajor journal devoted to philosophical and scientific debates about Al and neuroscience.
Science fiction writer Isaac Asimov (1942, 1950) was one of the first to address the issue
of robot ethics, with his laws of robotics:
0. A robot may not harm humanity, or through inaction, allow humanity to come to harm.
1. A robot may not injure a human being or, through inaction, allow a human being to come to harm. 2. A robot must obey orders given to it by human beings, except where such orders would conflict with the First Law. 3. A robot must protect its own existence as long as such protection does not conflict with
the First or Second Law.
At first glance, these laws seem reasonable. But the trick is how to implement them. Should
a robot allow a human to cross the street, or eat junk food, if the human might conceivably
come to harm? In Asimov’s story Runaround (1942), humans need to debug a robot that is
found wandering in a circle, acting “drunk > They work out that the circle defines the locus of points that balance the second law (the robot was ordered to fetch some selenium at the center
of the circle) with the third law (there is a danger there that threatens the robot’s existence).* This suggests that the laws are not logical absolutes, but rather are weighed against each other, with a higher weight for the carlier laws. As this was 1942, before the emergence of
4 Science fiction writers are in broad agreement that robot e very b HAL 9000 computer becomes homicidal due to a con Captain Kirk tells an enemy robot that “Everything Harry el youis smoke comes out of the robot’s head and it shuts down.
at resolving contradictions. In 2001, the the Star Trek episode 1, Mudd,” Tam lying.” At this,
1007
1008
Chapter 27 Philosophy, Ethics, and Safety of AT digital computers, Asimov was probably thinking of an architecture based on control theory via analog computing. Weld and Etzioni (1994) analyze Asimov’s laws and suggest some ways to modify the planning techniques of Chapter 11 to generate plans that do no harm. Asimov has considered many of the ethical issues around technology; in his 1958 story The Feeling of Power he tackles the issue of automation leading to a lapse of human skill—a technician rediscovers
the lost art of multiplication—as well as the dilemma of what to do when the rediscovery is
applied to warfare.
Norbert Wiener’s book God & Golem,
Inc.
(1964) correctly predicted that computers
would achieve expert-level performance at games and other tasks, and that specifying what it is that we want would prove to be difficult. Wiener writes:
While it is always possible to ask for something other than we really want, this
possibility is most serious when the process by which we are to obtain our wish
is indirect, and the degree to which we have obtained our wish is not clear until
the very end. Usually we realize our wishes, insofar as we do actually realize them, by a feedback process, in which we compare the degree of attainment of intermediate goals with our anticipation of them.
In this process, the feedback
goes through us, and we can turn back before it is too late.
If the feedback is
built into a machine that cannot be inspected until the final goal is attained, the
possibilities for catastrophe are greatly increased. I should very much hate to ride
on the first trial of an automobile regulated by photoelectric feedback devices, unless there were somewhere a handle by which I could take over control ifT
found myself driving smack into a tree.
We summarized codes of ethics in the chapter, but the list of organizations that have is-
sued sets of principles is growing rapidly, and now includes Apple, DeepMind, Facebook, Google, IBM, Microsoft, the Organisation for Economic Co-operation and Development (OECD), the United Nations Educational, Scientific and Cultural Organization (UNESCO),
the U.S. Office of Science and Technology Policy the Beijing Academy of Artificial Intelligence (BAAI), the Institute of Electrical and Electronics Engineers (IEEE), the Association
of Computing Machinery (ACM), the World Economic Forum, the Group of Twenty (G20),
OpenAl, the Machine Intelligence Research Institute (MIRI), Al4People, the Centre for the Study of Existential Risk, the Center for Human-Compatible Al the Center for Humane Tech-
nology, the Partnership on Al the AI Now Institute, the Future of Life Institute, the Future
of Humanity Institute, the European Union, and at least 42 national governments. We have the handbook on the Ethics of Computing (Berleur and Brunnstein, 2001) and introductions to the topic of Al ethics in book (Boddington, 2017) and survey (Etzioni and Etzioni, 2017a)
form. The Journal of Artificial Intelligence and Law and Al and Society cover ethical issues. ‘We’ll now look at some of the individual issues.
Lethal autonomous weapons: P. W. Singer’s Wired for War (2009) raised ethical, legal,
and technical issues around robots on the battlefield.
Paul Scharre’s Army of None (2018),
written by one of the authors of current US policy on autonomous weapons, offers a balanced
and authoritative view.
Etzioni and Etzioni (2017b) address the question of whether artifi-
cial intelligence should be regulated; they recommend a pause in the development of lethal autonomous weapons, and an international discussion on the subject of regulation.
Bibliographical and Historical Notes Privacy: Latanya Sweeney (Sweeney, 2002b) presents the k-anonymity model and the idea of generalizing fields (Sweeney, 2002a). Achieving k-anonymity with minimal loss of data is an NP-hard problem, but Bayardo and Agrawal (2005) give an approximation algorithm. Cynthia Dwork (2008) describes differential privacy, and in subsequent work gives practical examples of clever ways to apply differential privacy to get better results than the naive approach (Dwork et al., 2014). Guo et al. (2019) describe a process for certified data
removal: if you train a model on some data, and then there is a request to delete some of the
data, this extension of differential privacy lets you modify the model and prove that it does not make use of the deleted data. Ji ef al. (2014) gives a review of the field of privacy. Etzioni (2004) argues for a balancing of privacy and security; individual rights and community. Fung etal. (2018), Bagdasaryan et al. (2018) discuss the various attacks on federated learning protocols. Narayanan e al. (2011) describe how they were able to de-anonymize the obfuscated
connection graph from the 2011 Social Network Challenge by crawling the site where the data was obtained (Flickr), and matching nodes with unusually high in-degree or out-degree between the provided data and the crawled data. This allowed them to gain additional information to win the challenge, and it also allowed them to uncover the true identity of nodes
in the data. Tools for user privacy are becoming available; for example, TensorFlow provides
modules for federated learning and privacy (McMahan and Andrew, 2018). Fairness: Cathy O’Neil's book Weapons of Math Destruction (2017) describes how various black box machine learning models influence our lives, often in unfair ways. She calls
on model builders to take responsibility for fairness, and for policy makers to impose appropriate regulation. Dwork er al. (2012) showed the flaws with the simplistic “fairness through
unawareness” approach. Bellamy er al. (2018) present a toolkit for mitigating bias in machine learning systems. Tramer et al. (2016) show how an adversary can “steal” a machine learning model by making queries against an API, Hardt ef al. (2017) describe equal opportunity as
a metric for fairness. Chouldechova and Roth (2018) give an overview of the frontiers of fairness, and Verma and Rubin (2018) give an exhaustive survey of fairness definitions. Kleinberg et al. (2016) show that, in general, an algorithm cannot be both well-calibrated
and equal opportunity. Berk ef al. (2017) give some additional definitions of types of fairness, and again conclude that it is impossible to satisfy all aspects at once. Beutel et al. (2019) give advice for how to put fairness metrics into practice.
Dressel and Farid (2018) report on the COMPAS recidivism scoring model. Christin et al. (2015) and Eckhouse ef al. (2019) discuss the use of predictive algorithms in the le-
gal system. Corbett-Davies e al. (2017) show that that there is a tension between ensuring
fairness and optimizing public safety, and Corbett-Davies and Goel (2018) discuss the differences between fairness frameworks. Chouldechova (2017) advocates for fair impact: all classes should have the same expected utility. Liu er al. (2018a) advocate for a long-term
measure of impact, pointing out that, for example, if we change the decision point for ap-
proving a loan in order to be more fair in the short run, this could have negative effect in the
long run on people who end up defaulting on a loan and thus have their credit score reduced.
Since 2014 there has been an annual conference on Fairness, Accountability, and Trans-
parency in Machine Learning. Mehrabi et al. (2019) give a comprehensive survey of bias and fairness in machine learning, cataloging 23 kinds of bias and 10 definitions of fairness.
Trust:
Explainable Al was an important topic going back to the early days of expert
systems (Neches ef al., 1985), and has been making a resurgence in recent years (Biran and
1009
1010
Chapter 27 Philosophy, Ethics, and Safety of AT Cotton, 2017; Miller er al., 2017; Kim, 2018).
Barreno et al. (2010) give a taxonomy
of
the types of security attacks that can be made against a machine learning system, and Tygar
(2011) surveys adversarial machine learning. Researchers at IBM have a proposal for gaining trust in AT systems through declarations of conformity (Hind et al., 2018). DARPA requires
explainable decisions for its battlefield systems, and has issued a call for research in the area
(Gunning, 2016). Al safety: The book Arrificial Intelligence Safety and Security (Yampolskiy, 2018) collects essays on Al safety, both recent and classic, going back to Bill Joy’s Why the Future Doesn’t Need Us (Joy, 2000). The “King Midas problem” was anticipated by Marvin Minsky, who once suggested that an Al program designed to solve the Riemann Hypothesis might end up taking over all the resources of Earth to build more powerful supercomputers. Similarly, Omohundro (2008) foresees a chess program that hijacks resources, and Bostrom (2014) de-
scribes the runaway paper clip factory that takes over the world. Yudkowsky (2008) goes into
more detail about how to design a Friendly AL Amodei er al. (2016) present five practical safety problems for Al systems.
Omohundro (2008) describes the Basic Al Drives and concludes, “Social structures which
cause individuals to bear the cost of their negative externalities would go a long way toward
ensuring a stable and positive future.” Elinor Ostrom’s Governing the Commons (1990) de-
scribes practices for dealing with externalities by traditional cultures. Ostrom has also applied
this approach to the idea of knowledge as a commons (Hess and Ostrom, 2007). Ray Kurzweil (2005) proclaimed The Singularity is Near, and a decade later Murray Shanahan (2015) gave an update on the topic.
Microsoft cofounder Paul Allen countered
with The Singularity isn’t Near (2011). He didn’t dispute the possibility of ultraintelligent machines; he just thought it would take more than a century to get there. Rod Brooks is a
frequent critic of singularitarianism; he points out that technologies often take longer than
predicted to mature, that we are prone to magical thinking, and that exponentials don’t last forever (Brooks, 2017). On the other hand, for every optimistic singularitarian there is a pessimist who fears
new technology. The Web site pessimists.co shows that this has been true throughout
history: for example, in the 1890s people were concerned that the elevator would inevitably
cause nausea, that the telegraph would lead to loss of privacy and moral corruption, that the subway would release dangerous underground air and disturb the dead, and that the bicycle— especially the idea of a woman riding one—was the work of the devil. Hans
Moravec
(2000) introduces
some of the ideas of transhumanism,
and Bostrom
(2005) gives an updated history. Good’s ultraintelligent machine idea was foreseen a hun-
dred years earlier in Samuel Butler’s Darwin Among
the Machines (1863).
years after the publication of Charles Darwin’s On the Origins of Species
Written four
and at a time when
the most sophisticated machines were steam engines, Butler’s article envisioned “the ultimate development of mechanical consciousness” by natural selection. The theme was reiterated by
George Dyson (1998) in a book of the same title, and was referenced by Alan Turing, who
wrote in 1951 “At some stage therefore we should have to expect the machines to take control in the way that is mentioned in Samuel Butler’s Erewhon™ (Turing, 1996).
Robot rights: A book edited by Yorick Wilks (2010) gives different perspectives on how we should deal with artificial companions, ranging from Joanna Bryson’s view that robots should serve us as tools, not as citizens, to Sherry Turkle’s observation that we already per-
Bibliographical and Historical Notes sonify our computers and other tools, and are quite willing to blur the boundaries between
machines and life. Wilks also contributed a recent update on his views (Wilks, 2019). The philosopher David Gunkel’s book Robot Rights (2018) considers four possibilities: can robots have rights or not, and should they or not? The American Society for the Prevention of Cruelty to Robots (ASPCR) proclaims that “The ASPCR is, and will continue to be, exactly as
serious as robots are sentient.” The future of work:
In 1888, Edward Bellamy published the best-seller Looking Back-
ward, which predicted that by the year 2000, technological advances would led to a utopia where equality is achieved and people work short hours and retire early. Soon after, E. M. Forster took the dystopian view in The Machine Stops (1909), in which a benevolent machine
takes over the running of a society; things fall apart when the machine inevitably fails. Nor-
bert Wiener’s prescient book The Human Use of Human Beings (1950) argues for the benefits
of automation in freeing people from drudgery while offering more creative work, but also
discusses several dangers that we recognize as problems today, particularly the problem of value alignment. The book Disrupting Unemployment (Nordfors et al., 2018) discuss some of the ways that work is changing, opening opportunities for new careers. Erik Brynjolfsson and Andrew McAfee address these themes and more in their books Race Against the Machine (2011) and The Second Machine Age (2014). Ford (2015) describes the challenges of increasing automation, and West (2018) provides recommendations to mitigate the problems, while MIT’s
Thomas Malone (2004) shows that many of the same issues were apparent a decade earlier, but at that time were attributed to worldwide communication networks, not to automation.
1011
TR 28
THE FUTURE OF Al In which we try to see a short distance ahead.
In Chapter 2, we decided to view Al as the task of designing approximately rational agents. A variety of different agent designs were considered, ranging from reflex agents to knowledge-
based decision-theoretic agents to deep learning agents using reinforcement learning. There is also variety in the component technologies from which these designs are assembled: logical, probabilistic, or neural reasoning; atomic, factored, or structured representations of states; various learning algorithms from various types of data; sensors
and actuators to interact with
the world. Finally, we have seen a variety of applications, in medicine, finance, transportation,
communication,
and other fields.
There has been progress on all these fronts, both in our
scientific understanding and in our technological capabilities.
Most experts are optimistic about continued progress; as we saw on page 28, the median
estimate is for approximately human-level Al across a broad variety of tasks somewhere in
the next 50 to 100 years. Within the next decade, Al is predicted to add trillions of dollars to
the economy each year. But as we also saw, there are some critics who think general Al is centuries off, and there are numerous ethical concerns about the fairness, equity, and lethality of AL In this chapter, we ask: where are we headed and what remains to be done?
We do
that by asking whether we have the right components, architectures, and goals to make Al a successful technology that delivers benefits to the world.
28.1
Al Components
This section examines the components of Al systems and the extent to which each of them might accelerate or hinder future progress.
Sensors and actuators For much of the history of Al, direct access to the world has been glaringly absent. With a
few notable exceptions, Al systems were built in such a way that humans had to supply the inputs and interpret the outputs.
Meanwhile, robotic systems focused on low-level tasks in
which high-level reasoning and planning were largely ignored and the need for perception
was minimized. This was partly due to the great expense and engineering effort required to
get real robots to work at all, and partly because of the lack of sufficient processing power and sufficiently effective algorithms to handle high-bandwidth visual input.
The situation has changed rapidly in recent years with the availability of ready-made programmable robots. These, in turn, have benefited from compact reliable motor drives and
improved sensors. The cost of lidar for a self-driving car has fallen from $75,000 to $1,000,
Section 28.1 Al Components and a single-chip version may reach $10 per unit (Poulton and Watts, 2016). Radar sensors, once capable of only coarse-grained detection, are now sensitive enough to count the number
of sheets in a stack of paper (Yeo ef al., 2018). The demand for better image processing in cellphone cameras has given us inexpensive
high-resolution cameras for use in robotics. MEMS (micro-electromechanical systems) tech-
nology has supplied miniaturized accelerometers, gyroscopes, and actuators small enough to fit in artificial flying insects (Floreano et al., 2009; Fuller et al., 2014). It may be possible to combine millions of MEMS devices to produce powerful macroscopic actuators. 3-D printing (Muth ez al., 2014) and bioprinting (Kolesky ef al., 2014) have made it easier to experiment with prototypes.
Thus, we see that Al systems are at the cusp of moving from primarily software-only sys-
tems to useful embedded robotic systems. The state of robotics today is roughly comparable
to the state of personal computers in the early 1980s: at that time personal computers were
becoming available, but it would take another decade before they became commonplace. It is
likely that flexible, intelligent robots will first make strides in industry (where environments are more controlled,
tasks are more repetitive, and the value of an investment
is easier to
measure) before the home market (where there is more variability in environment and tasks). Representing the state of the world
Keeping track of the world requires perception as well as updating of internal representations.
Chapter 4 showed how to keep track of atomic state representations; Chapter 7 described
how to do it for factored (propositional) state representations; Chapter 10 extended this to
first-order logic; and Chapter 14 described probabilistic reasoning over time in uncertain environments.
Chapter 21 introduced recurrent neural networks, which are also capable of
maintaining a state representation over time.
Current filtering and perception algorithms can be combined to do a reasonable job of recognizing objects (“that’s a cat”) and reporting low-level predicates (“the cup is on the table”). Recognizing higher-level actions, such as “Dr. Russell is having a cup of tea with Dr. Norvig while discussing plans for next week,” is more difficult. Currently it can sometimes be done
(see Figure 25.17 on page 908) given enough training examples, but future progress will require techniques that generalize to novel situations without requiring exhaustive examples
(Poppe, 2010; Kang and Wildes, 2016). Another problem is that although the approximate filtering algorithms from Chapter 14 can handle quite large environments, they are still dealing with a factored representation—
they have random variables, but do not represent objects and relations explicitly. Also, their notion of time is restricted to step-by-step change; given the recent trajectory of a ball, we
can predict where it will be at time 7+ 1, but it is difficult to represent the abstract idea that
what goes up must come down. Section 15.1 explained how probability and first-order logic can be combined to solve
these problems; Section 15.2 showed how we can handle uncertainty about the identity of
objects; and Chapter 25 showed how recurrent neural networks enable computer vision to track the world; but we don’t yet have a good way of putting all these techniques together.
Chapter 24 showed how word embeddings and similar representations can free us from the
strict bounds of concepts defined by necessary and sufficient conditions. It remains a daunting
task to define general, reusable representation schemes for complex domains.
1013
1014
Chapter 28 The Future of AT Selecting actions
The primary difficulty in action selection in the real world is coping with long-term plans—
such as graduating from college in four years—that consist of billions of primitive steps.
Search algorithms that consider sequences of primitive actions scale only to tens or perhaps hundreds of steps. It is only by imposing hierarchical structure on behavior that we humans
cope at all. We saw in Section 11.4 how to use hierarchical representations to handle problems
of this
scale; furthermore, work in hierarchical reinforcement learning has succeeded in
combining these ideas with the MDP formalism described in Chapter 17. As yet, these methods have not been extended to the partially observable case (POMDPs).
Moreover, algorithms for solving POMDPs are typically using the same atomic state repre-
sentation we used for the search algorithms of Chapter 3. There is clearly a great deal of
work to do here, but the technical foundations are largely in place for making progress. The main missing element is an effective method for constructing the hierarchical representations
of state and behavior that are necessary for decision making over long time scales. Deciding what we want
Chapter 3 introduced search algorithms to find a goal state. But goal-based agents are brittle when the environment is uncertain, and when there are multiple factors to consider. In princi-
ple, utility-maximization agents address those issues in a completely general way. The fields of economics and game theory, as well as AI, make use of this insight: just declare what you want to optimize, and what each action does, and we can compute the optimal action.
In practice, however, we now realize that the task of picking the right utility function is a
challenging problem in its own right. Imagine, for example, the complex web of interacting preferences that must be understood by an agent operating as an office assistant for a human
being. The problem is exacerbated by the fact that each human is different, so an agent just “out of the box™ will not have enough experience with any one individual to learn an accurate
preference model; it will necessarily need to operate under preference uncertainty.
Further
complexity arises if we want to ensure that our agents are acting in a way that is fair and equitable for society, rather than just one individual.
We do not yet have much experience with building complex real-world preference mod-
els, let alone probability distributions over such models.
Although there are factored for-
malisms, similar to Bayes nets, that are intended to decompose preferences over complex states, it has proven difficult to use these formalisms in practice.
One reason may be that
preferences over states are really compiled from preferences over state histories, which are
described by reward functions (see Chapter 17). Even if the reward function is simple, the
corresponding utility function may be very complex.
This suggests that we take seriously the task of knowledge engineering for reward func-
tions as a way of conveying to our agents what we want them to do. The idea of inverse
reinforcement learning (Section 22.6) is one approach to this problem when we have an
expert who can perform a task, but not explain it. We could also use better languages for
expressing what we want. For example, in robotics, linear temporal logic makes it easier to say what things we want to happen in the near future, what things we want to avoid, and what
states we want to persist forever (Littman er al., 2017). We need better ways of saying what
we want and better ways for robots to interpret the information we provide.
Section 28.1 Al Components The computer industry as a whole has developed a powerful ecosystem for aggregating user preferences. When you click on something in an app, online game, social network, or shopping site, that serves as a recommendation that you (and your similar peers) would like to see similar things in the future. (Or it might be that the site is confusing and you clicked on the wrong thing—the data are always noisy.) The feedback inherent in this system makes
it very effective in the short run for picking out ever more addictive games and videos.
But these systems often fail to provide an easy way of opting out—your device will auto-
play a relevant video, but it is less likely to tell you “maybe it is time to put away your devices and take a relaxing walk in nature.” A shopping site will help you find clothes that match your style, but will not address world peace or ending hunger and poverty. To the extent that the menu of choices is driven by companies trying to profit from a customer’s attention, the menu
will remain incomplete.
However, companies do respond to customers’ interests, and many customers have voiced
the opinion that they are interested in a fair and sustainable world. Tim O’Reilly explains why profitis not the only motive with the following analogy: “Money is like gasoline during a road trip. You don’t want to run out of gas on your trip, but you’re not doing a tour of gas stations. You have to pay attention to money, but it shouldn’t be about the money.”
Tristan Harris’s time well spent movement at the Center for Humane Technology is a Time well spent step towards giving us more well-rounded choices (Harris, 2016). The movement addresses an issue that was recognized by Herbert Simon in 1971: “A wealth of information creates a poverty of attention.” Perhaps in the future we will have personal agents that stick up for Personal agent our true long-term interests rather than the interests of the corporations whose apps currently
fill our devices. It will be the agent’s job to mediate the offerings of various vendors, protect
us from addictive attention-grabbers, and guide us towards the goals that really matter to us.
Learning Chapters 19 to 22 described how agents can learn. Current algorithms can cope with quite large problems, reaching or exceeding human capabilities in many tasks—as long as we have sufficient training examples and we are dealing with a predefined vocabulary of features and concepts. But learning can stall when data are sparse, or unsupervised, or when we are dealing with complex representations.
Much of the recent resurgence of Al in the popular press and in industry is due to the
success of deep learning (Chapter 21). On the one hand, this can be seen as the incremental
maturation of the subfield of neural networks.
On the other hand, we can see it as a rev-
olutionary leap in capabilities spurred by a confluence of factors: the availability of more
training data thanks to the Internet, increased processing power from specialized hardware,
and a few algorithmic tricks, such as generative adversarial networks (GANs), batch normalization, dropout, and the rectified linear (ReLU) activation function.
The future should see continued emphasis on improving deep learning for the tasks it excels at, and also extending it to cover other tasks. The brand name “deep learning” has proven to be so popular that we should expect its use to continue, even if the mix of techniques. that fuel it changes considerably.
‘We have seen the emergence of the field of data science as the confluence of statistics,
programming, and domain expertise. While we can expect to see continued development in
the tools and techniques necessary to acquire, manage, and maintain big data, we will also
1015
1016
Chapter 28 The Future of AT need advances in transfer learning so that we can take advantage of data in one domain to
improve performance on a related domain. The vast majority of machine learning research today assumes a factored representation, learning a function & : R" — R for regression and i : R" — {0, 1} for classification. Machine
learning has been less successful for problems that have only a small amount of data, or problems that require the construction of new structured, hierarchical representations. Deep learning, especially with convolutional networks applied to computer vision problems, has demonstrated some success in going from low-level pixels to intermediate-level concepts like Eye and Mouth, then to Face, and finally to Person or Cat.
A challenge for the future is to more smoothly combine learning and prior knowledge.
If we give a computer a problem it has
not encountered before—say, recognizing different
models of cars—we don’t want the system to be powerless until it has been fed millions of
labeled examples.
The ideal system should be able to draw on what it already knows: it should already have
amodel of how vision works, and how the design and branding of products in general work; now it should use transfer learning to apply that to the new problem of car models. It should
be able to find on its own information about car models, drawing from text, images, and
video available on the Internet. It should be capable of apprenticeship learning: having a conversation with a teacher, and not just asking “may I have a thousand images of a Corolla,” but rather being able to understand advice like “the Insight is similar to the Prius, but the
Insight has a larger grille.” It should know that each model comes in a small range of possible
colors, but that a car can be repainted, so there is a chance that it might see a car in a color
that was not in the training set. (If it didn’t know that, it should be capable of learning it, or being told about it.)
All this requires a communication and representation language that humans and computers can share; we can’t expect a human analyst to directly modify a model with millions of
weights. Probabilistic models (including probabilistic programming languages) give humans
some ability to describe what we know, but these models are not yet well integrated with
Differentiable programming
other learning mechanisms. The work of Bengio and LeCun (2007) is one step towards this integration. Recently Yann LeCun has suggested that the term “deep learning” should be replaced with the more
general differentiable programming (Siskind and Pearlmutter, 2016; Li ef al., 2018); this
suggests that our general programming languages and our machine learning models could be merged together. Right now, it is common to build a deep learning model that is differentiable, and thus can
be trained to minimize loss, and retrained when circumstances change. But that deep learning
model is only one part of a larger software system that takes in data, massages the data, feeds
it to the model, and figures out what to do with the model’s output. All these other parts of the
larger system were written by hand by a programmer, and thus are nondifferentiable, which means that when circumstances change, it is up to the programmer to recognize any problems and fix them by hand. With differentiable programming, the hope is that the entire system is subject to automated optimization.
The end goal is to be able to express what we know in whatever form is convenient to us:
informal advice given in natural language, a strong mathematical law like F = ma, a statistical
model accompanied by data, or a probabilistic program with unknown parameters that can
Section 28.1 Al Components
1017
be automatically optimized through gradient descent. Our computer models will learn from conversations with human experts as well as by using all the available data. Yann LeCun, Geoffrey Hinton, and others have suggested that the current emphasis on supervised learning (and to a lesser extent reinforcement learning) is not sustainable—that computer models will have to rely on weakly supervised learning, in which some supervi-
sion is given with a small number of labeled examples and/or a small number of rewards, but
most of the learning is unsupervised, because unannotated data are so much more plentiful.
LeCun uses the term predictive learning for an unsupervised learning system that can Predictive learning
model the world and learn to predict aspects of future states of the world—not just predict
labels for inputs that are independent and identically distributed with respect to past data, and
not just predict a value function over states. He suggests that GANs (generative adversarial networks) can be used to learn to minimize the difference between predictions and reality.
Geoffrey Hinton stated in 2017 that “My view is throw it all away and start again,” mean-
ing that the overall idea of learning by adjusting parameters in a network is enduring, but the
specifics of the architecture of the networks and the technique of back-propagation need to be
rethought. Smolensky (1988) had a prescription for how to think about connectionist models; his thoughts remain relevant today. Resources
Machine learning research and development has been accelerated by the increasing availabil-
ity of data, storage, processing power, software, trained experts, and the investments needed to
support them. Since the 1970s, there has been a 100,000-fold speedup in general-purpose processors and an additional 1,000-fold speedup due to specialized machine learning hardware.
The Web has served as a rich source of images, videos, speech, text, and semi-structured data,
currently adding over 10'8 bytes every day.
Hundreds of high-quality data sets are available for a range of tasks in computer vision,
speech recognition, and natural language processing. If the data you need is not already
available, you can often assemble it from other sources, or engage humans to label data for
you through a crowdsourcing platform. Validating the data obtained in this way becomes an important part of the overall workflow (Hirth et al., 2013).
An important recent development is the shift from shared data to shared models.
The
major cloud service providers (e.g., Amazon, Microsoft, Google, Alibaba, IBM, Salesforce) have begun competing to offer machine learning APIs with pre-built models for specific tasks such as visual object recognition, speech recognition, and machine translation. These models
can be used as is, or can serve as a baseline to be customized with your particular data for
your particular application.
‘We expect that these models will improve over time, and that it will become unusual to
start a machine learning project from scratch, just as it is now unusual to do a Web develop-
ment project from scratch, with no libraries. It is possible that a big jump in model quality
will occur when it becomes economical to process all the video on the Web; for example, the YouTube platform alone adds 300 hours of video every minute.
Moore’s law has made it more cost effective to process data;
a megabyte of storage cost $1
million in 1969 and less then $0.02 in 2019, and supercomputer throughput has increased by a
factor of more than 10'” in that time. Specialized hardware components for machine learning
such as graphics processing units (GPUS), tensor cores, tensor processing units (TPUs), and
Shared model
1018
Chapter 28 The Future of AT field programmable gate arrays (FPGAs) are hundreds of times faster than conventional CPUs for machine learning training (Vasilache ef al., 2014; Jouppi et al., 2017). In 2014 it took a full day to train an ImageNet model; in 2018 it takes just 2 minutes (Ying ef al., 2018).
The OpenAl Institute reports that the amount of compute power used to train the largest machine learning models doubled every 3.5 months from 2012 to 2018, reaching over an exaflop/second-day for ALPHAZERO (although they also report that some very influential work used 100 million times less computing power (Amodei and Hernandez, 2018)). The
same economic trends that have made cell-phone cameras cheaper and better also apply to processors—we will see continued progress in low-power, high-performance computing that benefits from economies of scale.
There is a possibility that quantum computers could accelerate Al Currently there are
some fast quantum algorithms for the linear algebra operations used in machine learning (Harrow et al., 2009; Dervovic et al., 2018), but no quantum computer capable of running them. We have some example applications of tasks such as image classification (Mott et al., 2017) where quantum algorithms are as good as classical algorithms on small problems. Current quantum computers handle only a few tens of bits, whereas machine learning algorithms often handle inputs with millions of bits and create models with hundreds of mil-
lions of parameters. So we need breakthroughs in both quantum hardware and software to make quantum computing practical for large-scale machine learning. Alternatively, there may
be a division of labor—perhaps a quantum algorithm to efficiently search the space of hyperparameters while the normal training process runs on conventional computers—but we don’t know how to do that yet. Research on quantum algorithms can sometimes inspire new and better algorithms on classical computers (Tang, 2018). We have also seen exponential growth in the number of publications, people, and dollars
in Al/machine learning/data science. Dean er al. (2018) show that the number of papers about “machine learning” on arXiv doubled every two years from 2009 to 2017. Investors are
funding startup companies in these fields, large companies are hiring and spending as they
determine their Al strategy, and governments are investing to make sure their country doesn’t fall too far behind.
28.2
Al Architectures
It is natural to ask, “Which of the agent architectures in Chapter 2 should an agent use?” The answer is, “All of them!” Reflex responses are needed for situations in which time is of the
essence, whereas knowledge-based deliberation allows the agent to plan ahead. Learning is convenient when we have lots of data, and necessary when the environment is changing, or when human designers have insufficient knowledge of the domain.
AI has long had a split between symbolic systems (based on logical and probabilistic
inference) and connectionist systems (based on loss minimization over a large number of
uninterpreted parameters).
A continuing challenge for Al is to bring these two together,
to capture the best of both. Symbolic systems allow us to string together long chains of
reasoning and to take advantage of the expressive power of structured representations, while connectionist systems can recognize patterns even in the face of noisy data. One line of
research aims to combine probabilistic programming with deep learning, although as yet the various proposals are limited in the extent to which the approaches are truly merged.
Section 28.2
Al Architectures
1019
Agents also need ways to control their own deliberations. They must be able to use the
available time well, and cease deliberating when action is demanded. For example, a taxidriving agent that sees an accident ahead must decide in a split second whether to brake or
swerve. It should also spend that split second thinking about the most important questions,
such as whether the lanes to the left and right are clear and whether there is a large truck close
behind, rather than worrying about where to pick up the next passenger. These issues are usually studied under the heading of real-time AL As Al systems move into more complex Real-time Al domains, all problems will become real-time, because the agent will never have long enough to solve the decision problem exactly.
Clearly, there is a pressing need for general methods of controlling deliberation, rather
than specific recipes for what to think about in each situation.
The first useful idea is the
anytime algorithms (Dean and Boddy, 1988; Horvitz, 1987): an algorithm whose output Anytime algorithm quality improves gradually over time, so that it has a reasonable decision ready whenever it is
interrupted. Examples of anytime algorithms include iterative deepening in game-tree search and MCMC in Bayesian networks. The second technique for controlling deliberation is decision-theoretic metareasoning (Russell and Wefald, 1989; Horvitz and Breese, 1996; Hay ez al., 2012). This method, which
was mentioned briefly in Sections 3.6.5 and 5.7, (Chapter
Decision-theoretic metareasoning
applies the theory of information value
16) to the selection of individual computations (Section 3.6.5).
The value of a
computation depends on both its cost (in terms of delaying action) and its benefits (in terms
of improved decision quality).
Metareasoning techniques can be used to design better search algorithms and to guarantee
that the algorithms have the anytime property. Monte Carlo tree search is one example: the
choice of leaf node at which to begin the next playout is made by an approximately rational metalevel decision derived from bandit theory.
Metareasoning is more expensive than reflex action, of course, but compilation methods can be applied so that the overhead is small compared to the costs of the computations being controlled.
Metalevel reinforcement learning may provide another way to acquire effective
policies for controlling deliberation: in essence, computations that lead to better decisions are
reinforced, while those that turn out to have no effect are penalized. This approach avoids the
myopia problems of the simple value-of-information calculation.
Metareasoning is one specific example of a reflective architecture—that is, an architec-
ture that enables deliberation about the computational entities and actions occurring within
the architecture itself.
A theoretical foundation for reflective architectures can be built by
defining a joint state space composed from the environment state and the computational state
of the agent itself. Decision-making and learning algorithms can be designed that operate
over this joint state space and thereby serve to implement and improve the agent’s compu-
tational activities. Eventually, we expect task-specific algorithms such as alpha—beta search,
regression planning, and variable elimination to disappear from Al systems, to be replaced
by general methods that direct the agent’s computations toward the efficient generation of
high-quality decisions.
Metareasoning and reflection (and many other efficiency-related architectural and algo-
rithmic devices explored in this book) are necessary because making decisions
is hard. Ever
since computers were invented, their blinding speed has led people to overestimate their ability to overcome complexity, or, equivalently, to underestimate what complexity really means.
Reflective architecture
1020
Chapter 28 The Future of AT The truly gargantuan power of today’s machines tempts one to think that we could bypass all the clever devices and rely more on brute force. So let’s try to counteract this tendency.
We begin with what physicists believe to be the speed of the ultimate 1kg computing device: about 105! operations per second, or a billion trillion trillion times faster than the fastest supercomputer as of 2020 (Lloyd, 2000)." Then we propose a simple task: enumerating strings of English words, much as Borges proposed in The Library of Babel. Borges stipulated books of 410 pages.
Would that be feasible? Not quite.
In fact, the computer running for a year
could enumerate only the 11-word strings. Now consider the fact that a detailed plan for a human life consists of (very roughly) twenty trillion potential muscle actuations (Russell, 2019), and you begin to see the scale of the problem.
A computer that is a billion trillion trillion times more powerful than the
human brain is much further from being rational than a slug is from overtaking the starship Enterprise traveling at warp nine.
With these considerations in mind, it seems that the goal of building rational agents is
perhaps a little too ambitious.
Rather than aiming for something that cannot possibly exist,
we should consider a different normative target—one that necessarily exists.
Chapter 2 the following simple idea:
Recall from
agent = architecture + program .
Now fix the agent architecture (the underlying machine capabilities, perhaps with a fixed software layer on top) and allow the agent program to vary over all possible programs that the architecture can support. In any given task environment, one of these programs (or an equiv-
alence class of them) delivers the best possible performance—perhaps not close to perfect
Bounded optimality
rationality, but still better than any other agent program.
the criterion of bounded optimality.
We say that this program satisfies
Clearly it exists, and clearly it constitutes a desirable
goal. The trick is finding it, or something close to it. For some elementary classes of agent programs in simple real-time environments, it is possible to identify bounded-optimal agent programs (Etzioni, 1989; Russell and Subramanian, 1995). The success of Monte Carlo tree search has revived interest in metalevel decision
making, and there is reason to hope that bounded optimality within more complex families of
agent programs can be achieved by techniques such as metalevel reinforcement learning. It should also be possible to develop a constructive theory of architecture, beginning with theorems on the bounded optimality of suitable methods of combining different bounded-optimal
components such as reflex and action-value systems. General Al
Much of the progress in Al in the 21st century so far has been guided by competition on narrow tasks, such as the DARPA Grand Challenge for autonomous cars, the ImageNet object recognition competition, or playing Go, chess, poker, or Jeopardy! against a world champion. For each separate task, we build a separate Al system, usually with a separate machine learning model trained from scratch with data collected specifically for this task. But a truly
intelligent agent should be able to do more than one thing. Alan Turing (1950) proposed his list (page 982) and science fiction author Robert Heinlein (1973) countered with:
! We gloss over the fact that this device consumes the entire energy output of a star and operates at a billion degrees cei
Section 28.2
Al Architectures
A human being should be able to change a diaper, plan an invasion, butcher a hog, conn a ship, design a building, write a sonnet, balance accounts, build a wall, set a bone, comfort the dying, take orders, give orders, cooperate, act alone, solve equations, analyse a new problem, pitch manure, program a computer, cook a tasty meal, fight efficiently, die gallantly. Specialization is for insects.
So far, no Al system measures up to either of these lists, and some proponents of general or human-level Al (HLAI) insist that continued work on specific tasks (or on individual com-
ponents) will not be enough to reach mastery on a wide variety of tasks; that we will need a
fundamentally new approach. It seems to us that numerous new breakthroughs will indeed be necessary, but overall, Al as a field has made a reasonable exploration/exploitation tradeoff, assembling a portfolio of components, improving on particular tasks, while also exploring promising and sometimes far-out new ideas.
It would have been a mistake to tell the Wright brothers in 1903 to stop work on their
single-task airplane and design an “artificial general flight” machine that can take off vertically, fly faster than sound, carry hundreds of passengers, and land on the moon. It also would have been a mistake to follow up their first flight with an annual competition to make spruce wood biplanes incrementally better.
We have seen that work on components can spur new ideas; for example, generative adversarial networks (GANs) and transformer language models each opened up new areas of research.
We have also seen steps towards “diversity of behaviour.”
For example, machine
translation systems in the 1990s were built one at a time for each language pair (such as
French to English), but today a single system can identifying the input text as being one of a hundred languages, and translate it into any of 100 target languages. Another natural language system can perform five distinct tasks with one joint model (Hashimoto et al., 2016).
Al engineering
The field of computer programming started with a few extraordinary pioneers. But it didn’t reach the status of a major industry until a practice of software engineering was developed,
with a powerful collection of widely available tools, and a thriving ecosystem of teachers, students, practitioners, entrepreneurs, investors, and customers.
The Al industry has not yet reached that level of maturity. We do have a variety of pow-
erful tools and frameworks, such as TensorFlow, Keras, PyTorch, CAFFE, Scikit-Learn and
ScIPY. But many of the most promising approaches, such as GANs and deep reinforcement learning, have proven to be difficult to work with—they require experience and a degree of
fiddling to get them to train properly in a new domain. We don’t have enough experts to do this across all the domains where we need it, and we don’t yet have the tools and ecosystem to let less-expert practitioners succeed. Google’s Jeff Dean sees a future where we will want machine learning to handle millions
of tasks; it won’t be feasible to develop each of them from scratch, so he suggests that rather
than building each new system from scratch, we should start with a single huge system and, for each new task, extract from it the parts that are relevant to the task. We have seen some steps
in this direction, such as the transformer language models (e.g., BERT,
GPT-2) with
billions of parameters, and an “outrageously large” ensemble neural network architecture
that scales up to 68 billion parameters in one experiment (Shazeer et al., 2017). Much work remains to be done.
1021
1022
Chapter 28 The Future of AT The future ‘Which way will the future go? Science fiction authors seem to favor dystopian futures over utopian ones, probably because they make for more interesting plots. So far, Al seems to fit
in with other powerful revolutionary technologies
such as printing, plumbing, air travel, and
telephony. All these technologies have made positive impacts, but also have some unintended
side effects that disproportionately impact disadvantaged classes. We would do well to invest in minimizing the negative impacts.
Alis also different from previous revolutionary technologies. Improving printing, plumb-
ing, air travel, and telephony to their logical limits would not produce anything to threaten
human supremacy in the world. Improving Al to its logical limit certainly could.
In conclusion, Al has made great progress in its short history, but the final sentence of
Alan Turing’s (1950) essay on Computing Machinery and Intelligence is still valid today: We can see only a short distance ahead,
but we can see that much remains to be done.
ESEGTA MATHEMATICAL BACKGROUND A.1
Complexity Analysis and O() Notation
Computer scientists are often faced with the task of comparing algorithms to see how fast
they run or how much memory they require. There are two approaches to this task. The first is benchmarking—running the algorithms on a computer and measuring speed in seconds
and memory consumption in bytes. Ultimately, this is what really matters, but a benchmark can be unsatisfactory because it is so specific:
it measures the performance of a particular
Benchmarking
program written in a particular language, running on a particular computer, with a particular compiler and particular input data. From the single result that the benchmark provides, it can be difficult to predict how well the algorithm would do on a different compiler, computer, or of data set. The second approach relies on a mathematical analysis of algorithms, independent Analysis algorithms of the particular implementation and input, as discussed below.
A.1.1
Asymptotic analysis
We will consider algorithm analysis through the following example, a program to compute the sum of a sequence of numbers: function SUMMATION(sequence) returns a number sum 0 for i = 1 to LENGTH(sequence) do
sum & sum + sequenceli] return sum
The first step in the analysis is to abstract over the input, in order to find some parameter or parameters that characterize the size of the input. In this example, the input can be charac-
terized by the length of the sequence, which we will call n. The second step is to abstract over the implementation, to find some measure that reflects the running time of the algorithm
but is not tied to a particular compiler or computer. For the SUMMATION program, this could be just the number of lines of code executed, or it could be more detailed, measuring the
number of additions, assignments, array references, and branches executed by the algorithm.
Either way gives us a characterization of the total number of steps taken by the algorithm as a function of the size of the input. We will call this characterization T (n). If we count lines
of code, we have T (1) =2n+2 for our example.
If all programs were as simple as SUMMATION,
the analysis of algorithms would be a
trivial field. But two problems make it more complicated. First, it is rare to find a parameter
like n that completely characterizes the number of steps taken by an algorithm. Instead,
the best we can usually do is compute the worst case Tyors (1) O the average case Ty ().
Computing an average means that the analyst must assume some distribution of inputs.
1024
Appendix A
Mathematical Background
The second problem is that algorithms tend to resist exact analysis. In that case, it is
necessary to fall back on an approximation. We say that the SUMMATION algorithm is O(n),
meaning that its measure is at most a constant times n, with the possible exception of a few small values of n. More formally, Asymptotic analysis
T(n) is O(f(n)) if T(n) < kf(n) for some k, for all n > no.
The O() notation gives us what is called an asymptotic analysis. We can say without question that, as n asymptotically approaches infinity, an O(n) algorithm is better than an O(n?)
algorithm. A single benchmark figure could not substantiate such a claim.
The O() notation abstracts over constant factors, which makes it easier to use, but less
precise, than the T() notation. For example, an O(n2) algorithm will always be worse than an O(n) in the long run, but if the two algorithms are T'(n* + 1) and 7100+
O(n?) algorithm is actually better for n < 110. Despite this drawback, asymptotic anal;
1000), then the
is the most widely used tool for analyzing
algorithms. It is precisely because the analysis abstracts over both the exact number of operations (by ignoring the constant factor ) and the exact content of the input (by considering
only its size n) that the analysis becomes mathematically feasible. The O() notation is a good compromise between precision and ease of analysis. A.1.2
Complexity analysis
NP and inherently hard problems
The analysis of algorithms and the O() notation allow us to talk about the efficiency of a particular algorithm. However, they have nothing to say about whether there could be a better algorithm for the problem at hand. The field of complexity analysis analyzes problems rather than algorithms. The first gross division is between problems that can be solved in polynomial time and problems that cannot be solved in polynomial time, no matter what algorithm is used. The class of polynomial problems—those which can be solved in time O(n) for some k—is called P. These are sometimes called “easy” problems, because the class contains those
problems with running times like O(logn) and O(n). But it also contains those with time
NP
O(n'), 50 the name “easy™ should not be taken too literally.
Another important class of problems is NP, the class of nondeterministic polynomial
problems. A problem is in this class
if there is some algorithm that can guess
a solution and
then verify whether a guess is correct in polynomial time. The idea is that if you have an
arbitrarily large number of processors so that you can try all the guesses at once, or if you are
very lucky and always guess right the first time, then the NP problems become P problems.
One of the biggest open questions in computer science is whether the class NP is equivalent
to the class P when one does not have the luxury of an infinite number of processors or
omniscient guessing. Most computer scientists are convinced that P 7 NP; that NP problems
are inherently hard and have no polynomial-time algorithms. But this has never been proven.
NP-complete
Those who are interested in deciding whether P = NP look at a subclass of NP called the NP-complete problems. The word “complete” is used here in the sense of “most extreme” and thus refers to the hardest problems in the class NP. It has been proven that either all the NP-complete problems are in P or none of them is. This makes the class theoretically
interesting, but the class is also of practical interest because many important problems are known to be NP-complete. An example is the satisfiability problem: given a sentence of propositional logic, is there an assignment of truth values to the proposition symbols of the sentence that makes it true? Unless a miracle occurs and P = NP, there can be no algorithm
Section A.2
1025
Vectors, Matrices, and Linear Algebra
that solves all satisfiability problems in polynomial time. However, Al is more interested in whether there are algorithms that perform efficiently on typical problems drawn from a pre-
determined distribution; as we saw in Chapter 7, there are algorithms such as WALKSAT that
do quite well on many problems. The class of NP-hard problems consists of those problems that are reducible (in poly- NP-hard nomial time) to all the problems in NP, so if you solved any NP-hard problem, you could solve all the problems in NP. The NP-complete problems are all NP-hard, but there are some NP-hard problems that are even harder than NP-complete.
The class co-NP is the complement of NP, in the sense that, for every decision problem in NP, there is a corresponding problem in co-NP with the “yes” and “no” answers reversed.
‘We know that P is a subset of both NP and co-NP, and it is believed that there are problems in co-NP that are not in P. The co-NP-complete problems are the hardest problems in co-NP. The class #P (pronounced “number P™ according to Garey and Johnson (1979), but often
CoNP Co-NP-complete
pronounced “sharp P”) is the set of counting problems corresponding to the decision problems. in NP. Decision problems have a yes-or-no answer: is there a solution to this 3-SAT formula?
Counting problems have an integer answer: how many solutions are there to this 3-SAT formula? In some cases, the counting problem is much harder than the decision problem. For example, deciding whether a bipartite graph has a perfect matching can be done in time O(VE) (where the graph has V vertices and E edges), but the counting problem “how many perfect matches does this bipartite graph have” is #P-complete, meaning that it is hard as any problem in #P and thus at least as hard as any NP problem. Another class is the class of PSPACE problems—those that require a polynomial amount
of space, even on a nondeterministic machine. It is believed that PSPACE-hard problems are
worse than NP-complete problems, although it could turn out that NP = PSPACE, just as it could turn out that P = NP.
A.2
Vectors,
Matrices,
and
Linear
Algebra
Mathematicians define a vector as a member of a vector space, but we will use a more con-
crete definition: a vector is an ordered sequence of values. For example, in two-dimensional
space, we have vectors such as x= (3,4) and y = (0,2). We follow the convention of boldface
characters for vector names, although some authors use arrows or bars over the names: X or
. The elements of a vector can be accessed using subscripts: 2= (1,22,...,2,). One confusing point: this book is synthesizing work from many subfields, which variously call their sequences vectors, lists, or tuples, and variously use the notations (1,2), [1, 2], or (1, 2). The two fundamental operations on vectors are vector addition and scalar multiplication.
The vector addition x +y is the elementwise sum: x+y=(3 +0,4 +2)=(3,6). Scalar multiplication multiplies each element by a constant: 5x= (5 x 3,5 x 4) = (15,20). The length of a vector is denoted [x| and is computed by taking the square root of the sum of the squares of the elements:
[x|=+/(3%+4%)=5.
The dot product x-y (also called
scalar product) of two vectors is the sum of the products of corresponding elements, that is, X-y= ¥ X,yi, or in our particular case, X-y=3 x 0+4 x 2=8. Vectors are often interpreted as directed line segments (arrows) in an n-dimensional Eu-
clidean space. Vector addition is then equivalent to placing the tail of one vector at the head
of the other, and the dot product x-y is [x| |y| cosf, where is the angle between x and y.
Vector
1026
Appendix A
Mathematical Background
Matrix
A matrix is a rectangular array of values arranged into rows and columns. Here is a matrix A of size 3 x 4: A Axr Asr
Az Az Arg Azz Ags Agy Asz Asz Az
The first index of A
specifies the row and the second the column. In programming lan-
guages, A, is often written A[1, 3] or A[1] [].
The sum of two matrices is defined by adding their corresponding elements; for example (A+B);j=A;;+B;. (The sum is undefined if A and B have different sizes.) We can also define the multiplication of a matrix by a scalar: (cA);;=cA,;. Matrix multiplication (the product of two matrices) is more complicated. The product AB is defined only if A is of size
ax band B is of size b x ¢ (i.e., the second matrix has the same number of rows as the first
has columns); the result is a matrix of size a x . If the matrices are of appropriate size, then the result is
(AB)= YA, By )
Matrix multiplication is not commutative, even for square matrices: AB # BA in general. It
is, however, associative: (AB)C = A(BC). Note that the dot product can be expressed in
has the property that AL=A for all A. The transpose of A, written A is formed by turning rows into columns and vice versa, or, more formally, by A™; ;= A ;. The inverse ofa square matrix A is another square matrix A~' such that A~'A =1. For a singular matrix, the inverse
does not exist. For a nonsingular matrix, it can be computed in O(#3) time.
Matrices are used to solve systems of linear equations in O(n?) time; the time is domi-
nated by inverting a matrix of coefficients. Consider the following set of equations, for which we want a solution in x, y, and z:
+2x+y—z = 8 “3x—y+2z = —11 “2ty+2 = 3. We can represent this system as the matrix equation Ax = b, where 21 -1 x 8 =[-3-1 2], v, —11 2 1 2 z -3
To solve Ax = b we multiply both sides by A~', yielding A~'Ax = A~'b, which simplifies 2 3
I
R
tox=A""b. After inverting A and multiplying by b, we get the answer I
Singular
The identity matrix I has elements I ; equal to 1 when i= j and equal to 0 otherwise. It
=
Identity matrix Transpose Inverse
terms of a transpose and a matrix multiplication: x-y =x"y.
-1
A few more miscellaneous points: we use log(x) for the natural logarithm, log,(x). We use argmax, f(x) for the value ofx for which £(x) is maximal.
Section A.3
A.3
Probability Distribu
1027
Probability Distributions
ns
A probability is a measure over a set of events that satisfies three axioms:
1. The measure of each event is between 0 and 1. We write this as 0 < P(X=x;) < I, where X is a random variable representing an event and x; are the possible values of X. In general, random variables are denoted by uppercase letters and their values by lowercase letters. 2. The measure of the whole set is 1: that is, ¥/ P(X =x;) = 1. 3. The probability of a union of disjoint events is the sum of the probabilities of the individual events; that is, P(X =x; VX =x;) = P(X =x;) + P(X =x2), in the case where x| and x; are disjoint. A probabilistic model con:
of a sample space of mutually exclusive possible outcomes,
together with a probability measure for each outcome.
For example, in a model of the
weather tomorrow, the outcomes might be sun, cloud, rain, and snow.
A subset of these
outcomes constitutes an event. For example, the event of precipitation is the subset consist-
ing of {rain, snow}.
We use P(X) to denote the vector of values (P(X =x1),...,
P(X =x,)). We also use P(x;)
as an abbreviation for P(X =x;) and ¥, P(x) for T2 P(X =x;). The conditional probability P(B|A) is defined as P(BNA)/P(A). A and B are conditionally independent if P(B| A) = P(B) (or equivalently, P(A|B) = P(A)).
For continuous variables, there are an infinite number of values, and unless there are
point spikes, the probability of any one exact value is 0.
So it makes more sense to talk
about the value being within a range. We do that with a probability density function, which has a slightly different meaning from the discrete probability function. Since P(X =x)—the
Probability density function
probability that X has the value x exactly—is zero, we instead measure measure how likely it is that X falls into an interval around x, compared to the the width of the interval, and take
the limit as the interval width goes to zero:
P(x)= Jim0 P(x