163 25 25MB
English Pages 770 Year 1993
The Soar Papers
Artificial Intelligence
Patrick Henry Winston, founding editor
J.
Michael Brady, Daniel G. Bobrow, and Randall Davis, current editors
Artificial intelligence is the study of intelligence using the ideas and methods of computation. Unfortunately,
a
definition
of
intelligence
seems
impossible
at
the
moment
because
intelligence appears to be an amalgam of so many information-processing and information representation abilities. Of
course
psychology,
philosophy,
linguistics,
and
related
disciplines
offer
various
perspectives and methodologies for studying intelligence. For the most part, however, the theories proposed in these fields are too incomplete and too vaguely stated to be realized in computational terms. Something more is needed, even though valuable ideas, relationships, and constraints can be gleaned from traditional studies of what are, after all, impressive existence proofs that intelligence is in fact possible. Artificial intelligence offers a new perspective and a new methodology. Its central goal is to make computers intelligent, both to make them more useful and to understand the principles that make intelligence possible. obvious.
The
more
profound
That intelligent computers will be extremely point
is
that
artificial
intelligence
aims
to
useful
is
understand
intelligence using the ideas and methods of computation, thus offering a radically new and different basis for theory formation. Most of the people doing work in artificial intelligence believe that these theories will apply to any intelligent information processor, whether biological or solid state. There are side effects that deserve attention, too. Any program that will successfully model even a small part of intelligence will be inherently massive and complex.
Consequently,
artificial intelligence continually confronts the limits of computer-science technology. The problems encountered have been hard enough and interesting enough to seduce artificial intelligence people into working on them with enthusiasm. It is natural, then, that there has been a steady flow of ideas from artificial intelligence to computer science, and the flow shows no sign of abating. The purpose of this series in artificial intelligence is to provide people in many areas, both professionals and students, with timely, detailed information about what is happening on the frontiers in research centers all over the world.
J.
Michael Brady
Daniel Bobrow Randall Davis
The Soar Papers Research on Integrated Intelligence Volume One
Edited by
Paul S. Rosenbloom, John E. Laird, and Allen Newell
THE MIT PRESS
CAMBRIDGE, MASSACHUSETTS LONDON, ENGLAND
©
1993 Massachusetts Institute of Technology
All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher. Printed and bound in the United States of America.
Library of Congress Cataloging-in-Publication Data T he Soar papers: research on integrated intelligence I edited by Paul S. Rosenbloom, John E. Laird, and Allen Newell. p.
cm.
Includes bibliographical references and index.
ISBN 0-262-18152-5 (hc)-ISBN 0-262-68071-8 (pbk)
1. Artificial intelligence. Q335.S63 006.3--dc20
I. Rosenbloom, Paul S.
II. Laird, John, 1954- . III. Newell, Allen.
1992 91-48463 CIP
This book is dedicated to: the memory ofAllen Newell, who inspiTed so much; the Soar community, who contributed so much;
Elaine, Ann, and Noel, who tolerated and supported so much.
Contents Volume One: 1969-1988 Acknowledgments I XIII Introduction I xrx
1
Heuristic Programming: Ill-Structured Problems, by A. Newell I 3
2
Reasoning, Problem Solving, and Decision Processes: The Problem Space as a Fundamental Category, b7 A. Newell I 55
3
Mechanisms of Skill Acquisition and the Law of Practice,
4
The Knowledge Level, by A. Newell I 136
5
Learning by Chunking: A Production-System Model of Practice,
6
A Universal Weak Method, by. J. E. Laird and A. Newell I 245
7
The Chunking of Goal Hierarchies: A Generalized Model of Practice,
8
Towards Chunking as a General Learning Mechanism,
by A. Newell and P. S. Rosenbloom I 8 1
by P. S. Rosenbloom and A. Newell I 1 77
by P. S. Rosenbloom and A. Newell I 293 by J. E. Laird, P. S. Rosenbloom, and A. Newell I 335 CONTENTS
VII
VIII
9
CONTENTS
Rl -Soar: An Experiment in Knowledge-Intensive Programming in a Problem-Solving Architecture, by P. S. Rosenbloom, J. E. Laird, J. McDermott, A. Newell, and E. Orciuch I 340
10 Chunking in Soar: The Anatomy of a General Learning Mechanism,
by J.E. Laird, P. S. Rosenbloom, and A. Newell I 35 1 11
Overgeneralization During Knowledge Compilation in Soar,
by J. E. Laird, P. S. Rosenbloom, and A. Newell I 387 12
Mapping Explanation-Based Generalization onto Soar,
by P. S. Rosenbloom and J. E. Laird I 399 13 Efficient Matching Algorithms for the Soar/OPS5 Production System, by D. J. Scales I 406
14 Learning General Search Control from Outside Guidance,
by A. R. Golding, P. S. Rosenbloom, and J.E. Laird I 459 15
Soar: An Architecture for General Intelligence,
by J. E. Laird, A. Newell, and P. S. Rosenbloom I 463
16 Knowledge Level Learning in Soar, by P. S. Rosenbloom, J.E. Laird, and A. Newell I 527 17 CYPRESS-Soar: A Case Study in Search and Learning in Algorithm Design,
by D. M. Steier I 533 18 Varieties of Learning in Soar: 1 987, by D. M. Steier, J. E. Laird, A. Newell, P. S. Rosenbloom,
R. Flynn, A. Golding, T. A. Polk, 0. G. Shivers, A. Unruh, and G. R. Yost I 537
19 Dynamic Abstraction Problem Solving in Soar,
by A. Unruh, P. S. Rosenbloom, and J.E. Laird I 549
CONTENTS
20 Electronic Mail and Scientific Communication: A Study of the Soar Extended Research Group, by K. Carley and K. Wendt I 563 21 Placing Soar on the Connection Machine, by R. Flynn I 598 22 Recovery from Incorrect Knowledge in Soar, by J. E. Laird I 615 23 Comparison of the Rete and Treat Production Matchers for Soar (A Summary), by P. Nayak, A. Gupta, and P. S. Rosenbloom I 621 24 Modeling Human Syllogistic Reasoning in Soar, by T. A. Polk and A. Newell I 627 25 Beyond Generalization as Search: Towards a Unified Framework for the Acquisition of New Knowledge, by P. S. Rosenbloom I 634 26 Meta-Levels in Soar, by P. S. Rosenbloom, J. E. Laird, and A. Newell I 639 27 Integrating Multiple Sources of Knowledge into Designer-Soar: An Automatic Algorithm Designer, by D. M. Steier and A. Newell I 653 28 Soar/PSM-E: Investigating Match Parallelism in a Learning Production System,
by M. Tambe, D. Kalp, A. Gupta, C. L. Forgy, B. G. Milnes, and A. Newell I 659 29 Applying Problem Solving and Learning to Diagnosis,
by R. Washington and P. S. Rosenbloom I 674 30 Learning New Tasks in Soar, by G. R. Yost and A. Newell I 688
Index I A-1 (following page 703)
IX
X
CONTENTS
Volume Two: 1989-1991
31 A Discussion of"The Chunking of Skill and Knowledge" by Paul S. Rosenbloom, John E. Laird and Allen Newell, by T. Bosser I 705 32 A Comparative Analysis of Chunking and Decision-Analytic Control,
by 0. Etzioni and T. M. Mitchell I 713 33 Toward a Soar Theory of Taking Instructions for Immediate Reasoning Tasks,
by R: L. Lewis, A. Newell, and T. A. Polk I 7 1 9 e 34 Integrating Learning and Problem Solving within a Chemical Process Designer,
by A. K. Modi and A. W. Westerberg I 727 35 Symbolic Architectures for Cognition, by A. Newell, P. S. Rosenbloom, and J. E. Laird I 754 36 Approaches to the Study of Intelligence, by D. A. Norman I 793 37 Toward a Unified Theory of Immediate Reasoning in Soar, by T. A. Polk, A. Newell, and R. L. Lewis I 8 1 3 38 A Symbolic Goal-Oriented Perspective o n Connectionism and Soar, by P. S. Rosenbloom I 821 39 The Chunking of Skill and Knowledge, by P. S. Rosenbloom, J. E. Laird, and A. Newell I 840 40 A Preliminary Analysis of the Soar Architecture as a Basis for General Intelligence,
by P. S. Rosenbloom, ]. E. Laird, A. Newell, and R. McCarl I 860 41 Towards the Knowledge Level in Soar: The Role of the Architecture in the Use of Knowledge,
by P. S. Rosenbloom, A. Newell, and ]. E. Laird I 897 42 Tower-Noticing Triggers Strategy-Change in the Tower of Hanoi: A Soar Model,
by D. Ruiz and A. Newell I 934
CONTENTS
43
XI
"But How Did You Know To Do That?": What a Theory of Algorithm Design Process Can Tell Us, by D. M. Steier I 942
44 Abstraction in Problem Solving and Learning, by A. Unruh and P. S. Rosenbloom I 959 45
A Computational Model of Musical Creativity (Extended Abstract),
by S. Vicinanza and M. J. Prietula I 974 46 A Problem Space Approach to Expert System Specification, by G. R. Yost and A. Newell I 982
1990 47 Learning Control Knowledge in an Unsupervised Planning Domain, by C. B. Congdon I 991 48 Task-Specific Architectures for Flexible Systems,
by T. R. Johnson, J. W. Smith, and B. Chandrasekaran I 1 004 49 Correcting and Extending Domain Knowledge Using Outside Guidance,
by J. E. Laird, M. Hucka, E. S. Yager, and C. M. Tuck I 1 027 50
Integrating Execution, Planning, and Learning in Soar for External Environments,
by J. E. Laird and P. S. Rosenbloom I 1 036 51
Soar as a Unified Theory of Cognition: Spring 1 990, by R. L. Lewis, S. B. Huffman, B. E. John, J. E. Laird, J. F. Lehman, A. Newell, P. S. Rosenbloom, T. Simon, and S. G. Tessler I 1044
52
Applying an Architecture for General Intelligence to Reduce Scheduling Effort,
by M. J. Prietula, W. Hsu, D. M. Steier, and A. Newell I 1052 53 Knowledge Level and Inductive Uses of Chunking (EBL), by P. S. Rosenbloom and J. Aasman I 1096 54
Responding to Impasses in Memory-Driven Behavior: A Framework for Planning,
by P. S. Rosenbloom, S. Lee, and A. Unruh I 1 103 55
Intelligent Architectures for Integration, by D. M. Steier I 1 1 14
CONTENTS
XII
56
T he Problem of Expensive Chunks and its Solution by Restricting Expressiveness,
by M. Tambe, A. Newell, and P. S. Rosenbloom I 1 1 23 57
A Framework for Investigating Production System Formulations with Polynomially Bounded Match, by M. Tambe and P. S. Rosenbloom I 1 1 73
58
Two New Weak Method Increments for Abstraction,
by A. Unruh and P. S. Rosenbloom I 1 18 1 . 59
Using a Knowledge Analysis to Predict Conceptual Errors in Text-Editor Usage,
by R. M. Young and]. Whittington I 1 1 90
1991 60
Neuro-Soar: A Neural-Network Architecture for Goal-Oriented Behavior,
by B. Cho, P. S. Rosenbloom, and C. P. Dolan I 1 1 99 61
Predicting the Learnability of Task-Action Mappings, by A. Howes and R. M. Young I 1204
62
Towards Real-Time GOMS, by B. E. John, A. H. Vera, and A. Newell I 1210
63
Extending Problem Spaces to External Environments, by]. E. Laird I 1 294
64
Integrating Knowledge Sources in Language Comprehension,
by]. F. Lehman, R. L. Lewis, and A. Newell I 1309 65
A Constraint-Motivated Lexical Acquisition Model, by C. S. Miller and]. E. Laird I 1 3 1 5
66
Formulating the Problem Space Computational Model,
by A. Newell, G. R. Yost,]. E. Laird, P. S. Rosenbloom, and E. Altmann I 1321 67
A Computational Account of Children's Learning about Number Conservation,
by T. Simon, A. Newell, and D. Klahr I 1360 68
Attentional Modeling of Object Identification and Search,
by M. Wiesmeyer and]. Laird I 1400 Index I 1423
Acknowledgments The editors and the publishers are grateful to the following: Academic Press, for
Rosenbloom, P. S., Laird, J. E., and Newell, A., " The Chunking of Skill and Knowledge." Reprinted from Working Models ofHuman Perception, edited by B. A. G. Elsendoorn and H. Bouma, pp. 391-410. Copyright© 1989, Academ ic Press, London. Reprinted with permission. Bosser, T., " A Discussion of ' The Chunking of Skill and Knowledge' by Paul S. Rosenbloom, John E. Laird and Allen Newell." Reprinted from Working Models of Human Perception, edited by B. A. G. Elsendoorn and H. Bouma, pp. 411-418. Copyright© 1989, Academic Press, London. Reprinted with permission. The American Association for Artificial Intelligence for
Rosenbloom, P. S. and Newell, A. " Leaming by Chunking: Summary of a Task and a Model' . Reprinted from Pro ceedings of the National Conference on Artificial Intelligence, pp. 255-257. Copyright© 1982, American Associa tion for Artificial Intelligence, Menlo Park, California. Reprinted with permission. Laird, J. E., Rosenbloom, P. S., and Newell, A., "Towards Chunking as a General Leaming Mechanism." Reprinted from Proceedings of th e National C onference on Artificial Intelligence, Austin, Texas, pp. 188-192. Copyright© 1984, Association for Artificial Intelligence, Menlo Park, California. Reprinted with permission. Laird, J. E., Rosenbloom, P. S., and Newell, A., " Overgeneralization During Knowledge Compilation in Soar." Reprinted from Proceedings of the Workshop on Knowledge Compilation, edited by T. G. Dietterich. Copyright © 1986, Association for Artificial Intelligence, Menlo Park, California. Reprinted with permission. Rosenbloom, P. S., and Laird, J. E., " Mapping Explanation-Based Generalization onto Soar." Reprinted from Proceed ings of the Fifth National Conference on Artificial Intelligence, Philadelphia, Pennsylvania, pp. 561-567. Copyright© 1986, Association for Artificial Intelligence, Menlo Park, California. Reprinted with permission. Rosenbloom, P. S., Laird, J. E., and Newell, A., " Knowledge Level Leaming in Soar." Reprinted from Proceedings of the Sixth National Conference on Artificial Intelligence, Seattle, Washington, pp. 499-504. Copyright© 1987, Associ ation for Artificial Intelligence, Menlo Park, California. Reprinted with permission. Flynn, R., " Placing Soar on the Connection Machine." Reprinted from How Can Slow Components Think So Fast? AAA/ 1988 Mini-Symposium. Copyright© 1988, American Association for Artificial Intelligence, Menlo Park, Cali fornia. Reprinted with permission. Laird, J. E., " Recovery from Incorrect Knowledge in Soar." Reprinted from Proceedings of the Seventh National Con ference on Artificial Intelligence, St. Paul, Minnesota, pp. 618-623. Copyright© 1988, Association for Artificial Intel ligence, Menlo Park, California. Reprinted with permission. Nayak, P., and Gupta, A., and Rosenbloom, P. S., " Comparison of the Rete and Treat Production Matchers for Soar. Reprinted from Proceedings of the Seventh National Conference on Artificial Intelligence, St. Paul, Min nesota, pp. 693-698. Copyright© 1988, Association for Artificial Intelligence, Menlo Park, California. Reprinted with permission. Steir, D. M. and Newell, A., " Integrating Multiple Sources of Knowledge into Designer-Soar: An Automatic Algorithm Designer. Reprinted from Proceedings of the Seventh National Conference on Artificial Intelligence, St. Paul, . Minnesota, pp. 8-13. Copyright© 1988, Association for Artificial Intelligence, Menlo Park, California. Reprinted with permission.
XIV
ACKNOWLEOGMENTS
Rosenbloom, P. S., "Beyond Generalization as Search: Towards a Unified Framework for the Acquisition of New Knowledge, pp. 17-2 1 . Reprinted from Proceedings of the AAA/ Symposium on Explanation-Based Learning, edited by G. F. DeJong. Copyright© 1988, American Association for Artificial Intelligence, Menlo Park, California. Reprint ed with permission. Etzioni, 0., and Mitchell, T. M., "A Comparative Analysis of Chunking and Decision-Analytic Control." Reprinted from the Proceed ings of the 1989 AAA/ Spring Symposium on Al and Limited Rationality, Stanford, California. Copy right© 1989, American Association for Artificial Intelligence, Menlo Park, California. Reprinted with permission. Laird, J. E. and Rosenbloom, P. S., "Integrating Execution, Planning, and Leaming in Soar for External Environ ments." Reprinted from Proceedings of the Eighth National Conference on Artificial Intelligence, Boston, Massachu setts, pp. 1022- 1 029. Copyright© 1990, Association for Artificial Intelligence, Menlo Park, California. Reprinted with permission. Rosenbloom, P. S. and Aasman, J., "Knowledge Level and Inductive Uses of Chunking ( EBL)." Reprinted from Pro ceedings of the E ighth National Conference on Artificial Intelligence, Boston, Massachusetts, pp. 821-827. Copyright © 1990, Association for Artificial Intelligence, Menlo Park, California. Reprinted with permission. Tambe, M. and Rosenbloom, P. S., "A Framework for Investigating Production System Formulations with Polynomi ally Bounded Match." Reprinted from Proceedings of the E ighth National C onference on Artificial Intelligence, Boston, Massachusetts, pp. 693-700. Copyright© 1990, Association for Artificial Intelligence, Menlo Park, California. Reprinted with permission. Unruh, A., and Rosenbloom, P. S., "Two New Weak Method �crements for Abstraction." Reprinted from Working Notes ofthe AAA/-90 Workshop on Automatic Generation ofApproximations an d Abstractions, edited by T. Ellman, pp. 78-86. Copyright© 1990, American Association for Artificial Intelligence, Menlo Park, California. Reprinted with permission. The Association for Computing Machinery, for
Tambe, M.; Kalp, D.; Gupta, A.; Forgy, C. L.; Milnes, B. G.; and Newell, A., "Soar/PSM-E: Investigating Match Par allelism in a Leaming Production System." Reprinted from Proceedings of the ACMISIGPLAN Symposium on Parallel Programming: Experience with App lications, Languages, and Systems, July , pp. 146- 1 60. Copyright© 1988, Associ ation for Computing Machinery, New York, New York. Reprinted with permission. Howes, A. and Young, R. M., "Predicting the Leamability of Task-Action Mappings." Reprinted from Proceedings of CHI '91 Human Factors in Computing Systems, ACM Press, New Orleans. Copyright© 199 1 , Association for Com puting Machinery, New York, New York. Reprinted with permission. Newell, A., Yost, G. R., Laird, J. E., Rosenbloom, P. S., & Altmann, E., "Formulating the Problem Space Computa tional Model." Reprinted from Carnegie Mellon C omputer Sc ience: A 25 Year Commemorative, edited by R. F. Rashid, pp. 255-293. Reading, Massachusetts: Addison-Wesley. Copyright © 1 99 1 , Association for Computing Machinery, New York, New York. Reprinted with permission. Young, R. M. and Whittington, J., "Using a Knowledge Analysis to Predict Conceptual Errors in Text-Editor Usage." Reprinted from Pr oceed ings of CHI '90, April 1 990, pp. 9 1 -97. Copyright © 1990, Association for Computing Machinery, New York, New York. Reprinted with permission. _ The Cognitive Science Society and Lawrence Erlbaum Associates for
Lewis, R. L., Huffman, S. B., John, B. E., Laird, J. E., Lehman, J. F., Newell, A., Rosenbloom, P. S., Simon, T., and Tessler, S. G., "Soar as a Unified Theory of Cognition: Spring 1990." Reprinted from Proceedings of the 12th Annual Conference ofthe Cognitive Science Society, Cambridge, Massachusetts, pp. 1035- 1042. Copyright© 1990, Cognitive Science Society Incorporated, Pittsburgh, Pennsylvania. Reprinted with permission. Wiesmeyer, M. and Laird, J., "A Computer Model of 2D Visual Attention and Search." Reprinted from Proceedings ' of the 12th Annual Conferenc e of the Cognitive Science Society, Cambridge, Massachusetts. Copyright © 1 990,
ACKNOWLEDGMENTS
XV
Cognitive Science Society Incorporated, Pittsburgh, Pennsylvania. Reprinted with permission. Cho, B., Rosenbloom, P. S., and Dolan, C. P., " Neuro-Soar: A Neural-Network Architecture for Goal-Oriented Behav ior." Reprinted from the Proceedings of the Thirteenth Annual Conference of the Cognitive Science Society, pp. 673677. Copyright© 199 1 , Cognitive Science Society Incorporated, Pittsburgh, Pennsylvania. Reprinted with permission. Lehman, J. F., Lewis, R. L., and Newell, A., " Integrating Knowledge Sources in Language Comprehension." Reprint ed from the Proceedings of the Thirteenth Annual Conference of the Cognitive Science Society. Copyright © 199 1 , Cognitive Science Society Incorporated, Pittsburgh, Pennsylvania. Reprinted with permission. Miller, C. S., and Laird, J. E., " A Constraint-Motivated Lexical Acquisition Model." Reprinted from the Proceedings of the Thirteenth Annual Conference of the Cognitive Science Society. Copyright © 199 1 , Cognitive Science Society Incorporated, Pittsburgh, Pennsylvania. Reprinted with permission. Polk, T. A. and Newell, A., " Modeling Human Syllogistic Reasoning in Soar." Reprinted from the Proceedings of the Tenth Annual Conference of the Cognitive Science Society, Montreal, pp. 1 8 1 - 1 87. Copyright© 1988, Cognitive Sci ence Society Incorporated, Pittsburgh, Pennsylvania. Reprinted with permission. Lewis, R. L., Newell, A., and Polk, T. A., " Toward a Soar Theory of Taking Instructions for Immediate Reasoning Tasks." Reprinted from Proceedings of the E leventh Annual Conference of the Cognitive Science Society, pp. 5 14-5 1 2. Copyright© 1989, Cognitive Science Society, Inc., Pittsburgh, Pennsylvania. Reprinted with permission. Polk, T. A., Newell, A., and Lewis, R. L., " Toward a Unified Theory of Immediate Reasoning in Soar." Reprinted from Proceedings of the E leventh Annual Conference of the C ognitive Science Society, pp. 506-5 13. Copyright© 1989, Cognitive Science Society, Inc., Pittsburgh, Pennsylvania. Reprinted with permission. Ruiz, D. and Newell, A., " Tower-Noticing Triggers Strategy-Change in the Tower of Hanoi: A Soar Model." Reprint ed from Proceedings of the E leventh Annual Conference of the Cognitive Science Society, pp. 522-529. Copyright © 1989, Cognitive Science Society, Inc., Pittsburgh, Pennsylvania. Reprinted with permission. Elsevier Science Publishers for
Newell, A., " The Knowledge Level." Reprinted from Artificial Intelligence, Volume 18, pp. 87-1 27. Copyright © 1982, Elsevier Science Publishers B.V., Amsterdam, The Netherlands. Reprinted with permission. Laird, J. E., Newell, A., & Rosenbloom, P. S., " Soar: An Architecture for General Intelligence." Reprinted from Artifi cial Intelligence, Volume 33, pp. 1-64. Copyright© 1987, Elsevier Science Publishers B.V., Amsterdam, The Nether lands. Reprinted with permission. Norman, D. A., " Approaches to the Study of Intelligence," Reprinted from Artificial Intelligence, Vol. 47, pp. 327346. Copyright© 199 1 , Elsevier Science Publishers B.V., Amsterdam, The Netherlands. Reprinted with permission. Rosenbloom, P. S., Laird, J. E., Newell, A., and McCarl, R., " A Preliminary Analysis of the Soar Architecture as a Basis for General Intelligence." Reprinted from Artificial Intelligence, Volume 47, pp. 289-325. Copyright© 199 1 , Elsevier Science Publishers B.V., Amsterdam, The Netherlands. Reprinted with permission. Rosenbloom, P. S., " A Symbolic Goal-Oriented Perspective on Connectionism and Soar." Reprinted from Connection ism in P erspective, edited by R. Pfeifer, Z. Schreter, F. Fogelman-Soulie, and L. Steels, pp. 245-263.Copyright © 1989, Elsevier Science Publishers B.V., Amsterdam, The Netherlands. Reprinted with permission. Rosenbloom, P. S., Laird, J. E., and Newell, A., " Meta-Levels in Soar." Reprinted from Meta-Level Architectures and Reflection, edited by P. Maes and D. Nardi, pp. 227-240. Copyright© 1988, Elsevier Science Publishers B.V., Amster dam, The Netherlands. Reprinted with permission. Todd R. Johnson for
Johnson, T. R., Smith, J. W., and Chandrasekaran, B., " Task-Specific Architectures for Flexible Systems." Reprinted with permission.
XVI
ACKNOWLEDGMENTS
The Institute of Electrical and Electronics Engineers, Inc.for
Rosenbloom, P. S., Laird, J. E., McDermott, J., Newell, A. and Orciuch, E., "Rl -Soar: An Experiment in Knowledge Intensive Programming in a Problem-Solving Architecture." Reprinted from IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 7, pp. 561 -569. Copyright© 1985, The Institute of Electrical and Electronics Engi neers, Inc., Piscataway, New Jersey. Reprinted with permission. Steier, D.M., "Intelligent Architectures for Integration." Reprinted from Proceedings of the IEEE Conference on Sys tems Integration, August 1990. (A shorter version of this paper appeared in the proceedings.) Copyright© 1990, The Institute of Electrical and Electronics Engineers, Inc., Piscataway, New Jersey. Reprinted with permission. International Joint Conferences on Artificial Intelligence, Inc. for
Golding, A. R., Rosenbloom, P. S., and Laird, J. E., "Leaming General Search Control from Outside Guidance." Reprinted from Proceedings of the Tenth International Joint Conference on Artificial Intelligence, pp. 334-337. Copy right© 1987, International Joint Conferences on Artificial Intelligence, Inc. Copies of this and other IJCAI proceed ings are available from Morgan Kaufmann Publishers, Inc., 2929 Campus Drive, San Mateo, California 94403. Reprinted with permission. Steier, D., "CYPRESS-Soar: A Case Study in Search and Leaming in Algorithm Design." Reprinted from Proceedings of the Tenth International Joint Conference on Artificial Intelligence, pp. 327-330. Copyright © 1987, International Joint Conferences on Artificial Intelligence, Inc. Copies of this and other IJCAI proceedings are available from Mor gan Kaufmann Publishers, Inc., 2929 Campus Drive, San Mateo, California 94403. Reprinted with permission. Unruh, A. and Rosenbloom, P. S., "Abstraction in Problem Solving and Leaming." Reprinted from Proceedings of the E leventh International Joint Conference on Artificial Intelligence, pp. 681 -687. Copyright© 1989, International Joint Conferences on Artificial Intelligence, Inc. Copies of this and other IJCAI proceedings are available from Morgan Kaufmann Publishers, Inc., 2929 Campus Drive, San Mateo, California 94403. Reprinted with permission. Yost, G. R. & Newell, A., "A Problem Space Approach to Expert System Specification," Reprinted from Proceedings of E leventh International Joint Conference on Artificial Intelligence, pp. 62 1-627. Copyright © 1989, International Joint Conferences on Artificial Intelligence, Inc. Copies of this and other IJCAI proceedings are available from Mor gan Kaufmann Publishers, Inc., 2929 Campus Drive, San Mateo, California 94403. Reprinted with permission. Bonnie E.John , Carnegie Mellon University, and Nintendo of America, Inc.for
John, B. E., Vera, A. H., and Newell, A., "Towards Real-Time GOMS," Abbreviated version of "Towards Real-Time GOMS," by John, B. E., Vera, A. H., and Newell, A .. Reprinted from Carnegie Mellon University School of Computer Science Technical Report CMU-CS-90- 195. The illustration of the first screen from Super Mario Brothers 3® where Mario starts World 1 , Level 1 , and the photographs from pages 6 and 1 1 from the Super Mario Brothers 3® Instruction Booklet are copyright© 199 1 , Nintendo. Used with permission of Nintendo of America Inc. Kluwer Academic Publishers for
Laird, J. E., Rosenbloom, P. S., and Newell, A., "Chunking in Soar: The Anatomy of a General Leaming Mechanism," Reprinted from Machine Learning, Volume 1, pp. 1 1 -46. Copyright© 1986, Kluwer Academic Publishers, Norwell, Massachusetts. Reprinted with permission. Tambe, M., Newell, A., and Rosenbloom, P. S., "The Problem of Expensive Chunks and Its Solution by Restricting Expressiveness." Reprinted from Machine Learning, Volume 5, pp. 299-348. Copyright© 1990, Kluwer Academic Publishers, Norwell, Massachusetts. Reprinted with permission.
Lawrence Erlbaum Associates, Inc. for
Newell, A., "Reasoning, Problem Solving and Decision Processes: The Problem Space as a Fundamental Category." Reprinted from Attention and Performance Vil/, edited by R. Nickerson. Copyright© 1980, Lawrence Erlbaum Asso ciates, Inc., Hillsdale, New Jersey. Reprinted with permission.
ACKNOWLEDGMENTS
XVII
Newell, A. and Rosenbloom, P. S., " Mechanisms of Skill Acquisition and the Law of Practice." Reprinted from Cogni tive Ski lls and their Acquisition, edited by J. R. Anderson. Copyright © 198 1 , Lawrence Erlbaum Associates, Inc., Hillsdale, N.new Jersey. Reprinted with permission. Rosenbloom, P. S., Newell, A., and Laird, J. E., " Towards the Knowledge Level in Soar: The Role of the Architecture in the Use of Knowledge." Reprinted from Architectures for Intelligence, edited by K. VanLehn, pp. 75- 1 1 1 . Copy right© 199 1 , Lawrence Erlbaum Associates, Inc., Hillsdale, N.new Jersey. Reprinted with permission. Ajay K. Modi for
Modi, A. K. and Westerberg, A.W., " Integrating Leaming and Problem Solving within a Chemical Process Designer." Presented at the Annual Meeting of the American Institute of Chemical Engineers. Reprinted with permission. Morgan Kaufmann Publishers for
Rosenbloom, P. S. and Newell, A., " The Chunking of Goal Hierarchies: A Generalized Model of Practice." Reprinted from Machine Learning: An Artificial Intelligence Approach, Volume II, edited by R. S. Michalski, J. G. Carbonell, and T. M. Mitchell, pp. 247-288. Copyright© 1986,,Morgan Kaufmann Publishers, Inc., Los Altos, California. Reprinted with permission. Steier, D. M., Laird, J. E., Newell, A., Rosenbloom, P. S., Flynn, R., Golding, A., Polk, T. A., Shivers, 0. G., Unruh, A., and Yost, G. R., " Varieties of Leaming in Soar: 1987." Reprinted from Proceedings of the Fourth International Workshop on Machine Learning, edited by P. Langley, pp. 300-3 1 1 . Copyright© 1987, Morgan Kaufmann Publishers, Inc., Los Altos, California. Reprinted with permission. Laird, J. E., Hucka, M., Yager, E.S., and Tuck, C.M., " Correcting and Extending Domain Knowledge Using Outside Guidance." Reprinted from Proceedings of the Seventh International Conference on Machine Learning, pp. 235-243. Copyright© 1989, Morgan Kaufmann Publishers, Inc., Los Altos, California. Reprinted with permission. Simon, T., Newell, A., and Klahr, D., " Q-Soar: A Computational Account of Children's Leaming about Number Con servation." Reprinted from Computational Approaches to Concept Formation, edited by D. Fisher and M. Pazzani. Copyright© 199 1 , Morgan Kaufmann, Los Altos, California. Reprinted with permission. Operations Research Society of America for
Prietula, M. J., Hsu, W., Steier, D. M., and Newell, A., " Applying an Architecture for General Intelligence to Reduce Scheduling Effort." Reprinted from ORSA Journal on Computing. Copyright© 1992, Operations Research Society of America, Baltimore, Maryland. Reprinted with permission. Mike Prietula for
Vicinanza, S. and Prietula, M.J., " A Computational Model of Musical Creativity ( Extended abstract)." Prepared for the AI and Music Workshop, Eleventh International Joint Conference on Artificial Intelligence. Reprinted with permission. Sage Publications, Inc.for
Carley, K. and Wendt, K., " Electronic Mail and Scientific Communication: A Study of the Soar Extended Research Group." Reprinted from Knowledge: Creation, Diffusion, Utilization, Volume 12, pp. 406-440. Copyright© 199 1 , Sage Publications, Inc., Newbury Park, California. Reprinted with permission. Stanford University for
Scales, D. J., " Efficient Matching Algorithms for the Soar/Ops5 Production System," Technical Report KSL-8647. Reprinted with permission. David Steier for
" But How Did You Know To Do That?": What a Theory of Algorithm Design Process Can Tell Us," Engineering
XVIU
ACKNOWLEOGMENTS
Design Research Center, Carnegie Mellon University, August, 1989. Reprinted with permission. R. Washington for
Washington, R. and Rosenbloom, P. S., "Applying Problem Solving and Learning to Diagnosis." Department of Computer Science, Stanford University. Reprinted with permission. John Wiley and Sons for
Newell, A., "Heuristic Programming: Ill-Structured Problems." Reprinted from Progress in Operations Research, Ill, edited by J. Aronofsky, pp. 360-414. Copyright© 1969, John Wiley and Sons, New York, 1969. Reprinted with per mission. The University of Michigan for
Congdon, C. B., "Learning Control Knowledge in an Unsupervised Planning Domain," Artificial Intelligence Lab oratory, The University of Michigan. Reprinted with permission. Laird, J. E., "Extending Problem Spaces to External Environments," Artificial Intelligence Laboratory, The Univer sity of Michigan. Amy Unruh for
Unruh, A., Rosenbloom, P. S., and Laird, J. E., "Dynamic Abstraction Problem Solving in Soar." Reprinted from Pro ceedings of the Third Annual Aerospace Applications of Artificial Intelligence Conference, Dayton, Ohio, pp. 245256. Reprinted with permission. Gregg Yost for
Yost, G. R. and Newell, A., "Learning New Tasks in Soar, 'cl. Department of Computer Science, Carnegie-Mellon Uni versity. Reprinted with permission.
Introduction
he Soar project is an attempt to develop and apply a unified theory of human and artificial
Tintelligence. At the core of this effort is an investigation into the architecture-the fixed base
of tightly-coupled mechanisms-underlying intelligent behavior. This architecture then forms the
basis for wide-ranging investigations into basic intelligent capabilities-such as problem solving, planning, learning, knowledge representation, natural language, perception, and robotics-as well as applications in areas such as expert systems and psychological modeling. This is a true cognitive-science enterprise, where human and artificial evidence and criteria are constantly intermingled in service of progress in both areas [u:91]. Since the project's official inception in 1983, it has grown from a three-man effort to a geo graphically distributed, interdisciplinary community of more than ninety researchers (as of early 1992). To a first approximation, the community consists of computer scientists and psychologists in the United States and Europe; however, there is representation from other locales (such as Canada) and disciplines, primarily within engineering and the social sciences. What binds this community together is a commitment to develop, study and use a single software architecture the Soar architecture-as the basis for intelligent behavior. To date, this shared enterprise has produced over one hundred and sixty articles and books across the board on Soar and intelligent behavior. The wide range of topics covered by this body of scientific literature has led to an equally wide range of publication methods, and thus to a situation where investigators within individual disciplines have trouble locating the full set of publications of
interest. The principal purpose of the present volumes is not to add further to this literature and problem, but to help solve it by bringing together in one place a relatively comprehensive set of core papers from the Soar community. Towards this end we have selected sixty-eight articles for inclusion out of those available by early 1991. This includes articles previously published in journals, confer ences, workshops, and books, as well as some technical reports and unpublished papers. In selecting the set of articles to be included, one criterion was whether an article makes a principal contribution on its own, independent of the other articles. Though this still leaves some overlap among the articles, it is primarily limited to the introductory descriptions of Soar that, by necessity, occur in nearly all of the articles. Many of these descriptions actually provide distinct ways of viewing Soar-for example, as a hierarchy of cognitive levels, or as a general goal-oriented system-and thus maintain some degree of independent utility, while the remainder tend to be short. We ask the reader's forbearance in this matter. Most published collections of papers fall within one of two categories. Either they contain papers by unrelated groups of researchers covering the state of the art in some generic topic area, or they contain papers by members of a single tightly-knit project providing a retrospective on the project's accomplishments. The present collection fits into a niche that partially overlaps these
XX
INTRODUCTION
two traditional categories, but also covers territory between them. The contributors are members of the Soar community, comprising a set of tightly-knit projects bound together in a looser con federation. The topic is simultaneously quite narrow-it is focused on Soar-and unusually broad, attempting to cover the full span of intelligent behavior. The perspective combines state of-the-art papers with their historical antecedents, including some important antecedents (by community members) that predate the construction of Soar. Rather than serving primarily a ret rospective function, this should provide a window onto an ongoing and vital research enterprise. In broadest terms, the intended audience for these volumes is researchers interested in intelli gent behavior. Because the prime focus of Soar is on integrating together the capabilities required for intelligent behavior, these volumes should be of most interest to researchers studying unified theories of cognition, cognitive architectures, intelligent agents, integrated architectures for intel ligence, etc. However, many of the articles also make contributions in more limited areas; particu larly in learning, problem solving, expert systems, robotics, production systems, immediate rea soning, and human-computer interaction. Though these volumes assume no background on Soar-the relevant introductory articles are included-many of the articles do assume some amount of technical background in artificial intelligence or cognitive psychology. The structure of these volumes is atypical for a collection of papers. The standard approach is to divide the articles into parts according to each article's principal contribution. The index, or discursive front matter, is then used to deal with hidden secondary contributions. Such an approach was not taken for this collection because it loses the logical (and historical) flow of the ri/.aterial-which is particularly important when following the development of a single system such as Soar-and because too many of the articles make principal contributions in multiple areas (in addition to secondary contributions). The collection is therefore physically organized around chronological parts, with an alphabetic ordering by authors used within each part. Vol ume One covers the direct precursors to Soar (prior to 1 983) and 1 983 through 1 988. Volume Two covers 1 989 through the early part of 1 991. The remainder of this introduction comprises sections discussing Soar's direct precursors, its architecture, implementation issues, the capabilities it supports, the domains in which it has been applied, its use in psychological modeling, perspectives on it and the accompanying research effort, and its use as a programmable system. These sections correspond to what would be part introductions in a more traditional organization, and provide a conceptual road map for the col lection through citation of the key articles that cover these topics. Each citation is a two part code of the form number:year. The number part specifies where the article falls in the alphabetical list ing for its year. The year part specifies the year of publication for published papers, and the year of completion for unpublished papers. So, for example, the citation 9:89 indicates the ninth article alphabetically in 1 989. Occasionally there was a very long gap between completion of an article and when it was finally published. In such cases, a pair of years is specified, as in 1 :88/91 . The first spec ifies the year of completion (1 988) while the second specifies the year of publication (1991). To maintain the logical flow of the articles, such an article is included in the part corresponding to the first-that is, completion-date. So, citation 1 :88/91 appears as the first article in the 1 988 part.
INTRODUCTION
XXI
Immediately following the citations to articles in the collection, and occasionally instead of them, citations are also included to Soar articles (and books) not in the collection. Though these external citations are not an exhaustive set, they do include a number of the most relevant ones through part of 1992. External citations are structurally similar to internal ones, except that exter nal articles are ordered within years by letters rather than numbers. So, for example, b:90 is the second additional article listed in 1990. To further differentiate internal and external citations, where there are both, they are always printed in separate lists. Following the introduction is a complete bibliography of the included articles, along with their associated citations. Enclosed in square brackets at the end of each entry is the volume and page within this collection where the article can be found. Following the bibliography of included articles is a bibliography for the external citations. Then come the actual articles included in this volume, organized into their chronological parts.
Direct Precursors
We are now in a position to seriously attempt the construction of integrated intelligent systems only because of the steady background of progress in understanding the individual components of intelligent behavior. The development of Soar owes much to this rich background. The direct pre cursors cover the portion of this background mostly closely tied to Soar; in particular, the back ground research by the developers of Soar, and some of their close colleagues, that is on the direct line to the development of Soar. This is in no way intended to deny the influence of other work on Soar, but is instead consistent with this collection's focus on Soar and the community of researchers that have grown up around it. The initial driver in the development of Soar was to combine a flexible problem solving capa bility based on problem spaces and weak methods with the recognitional use of knowledge that is provided by production systems. Because of their strong influence on the development of Soar, and their current lack of accessibility (in particular to researchers in artificial intelligence), classic papers on problem spaces [1:80) and weak methods [1:69) are included in this collection. The other classic precursor on these topics is [a:72), but that is a book unto itself. The critical production-system precursors fall into four groups. The first group covers the ini tial development of production systems for cognitive modeling [ a:73). The second group covers the OpsS production-system language [a:81), which is the original basis for Soar's current memo ry architecture. The third group covers the instructable production system project [a:78, a:80], which is the seed out of which the Soar project grew. The fourth group covers a production-sys tem language, Xaps2 [2:82/87), that was both the basis for the first implementation of Soar, and the basis for the implementation of the earliest version of Soar-style chunking. Soar's chunking mechanism arose out of earlier work on models of human practice. This work progressed from an analysis of the data on human practice - in particular, the power-law shape of practice curves-and the development of an abstract chunking model [1:81); to the development of a task-specific, production-system implementation of chunking [2:82/87); to a
XXII
INTRODUCTION
task-independent, goal-based implementation [2:83/86] [d:83] . It was a variant of this last approach that was then implemented in Soar. The knowledge level did not play a large role in the early development of Soar, but it has played an increasing role in our recent understanding of the system. The seminal paper on the knowledge level [ 1:82] is included in this collection.
Architecture
In computer science, the architecture of a system is the fixed structure that provides a system that can be programmed. It often coincides with the boundary between hardware and software, but it need not-for example, when one programming language is built on top of another. In cognitive science, the term has been extended to refer to the architecture of the mind; that is, the fixed struc ture underlying the flexible domain of cognitive processing. Architectures proposed as the basis for human cognition are termed cognitive architectures, while those proposed as the basis for artificial cognition are referred to as architectures for integrated intelligent systems (or architectures for intelli
gent agents, or architectures for general intelligence). Introductions to the former can be found in [S:89] [d:90, n:92] and a brief note on the latter can be found in [j:89] . Our ultimate goal for the Soar architecture is that it serve as a basis for both human and artificial cognition. There is no gen erally accepted term for such a combination, so how Soar is described usually varies by context. Since the construction of the first major version of Soar (Soar l ) in 1982, the architecture has always been embodied as runnable code. During this period, revisions to the theory and imple mentation have led to five additional major versions of Soar-Soar2 in 1983, Soar3 in 1984, Soar4 in 1986, Soars in 1989, and Soar6 in 1992. Soar l through Soars were all implemented in Lisp, while Soar6 is in C. The prime rationale for Soarl [1:83] was the development of a universal weak method-that is, a system capable of exhibiting a wide range of problem-solving strategies-:-through the combi nation of (multiple) problem spaces and production systems (in this case, the Xaps2 production system [2:82/87] ). Goals, problem spaces, states and operators were symbolized in the production system's working memory. Processing was driven by an elaboration-decision-application cycle. Control knowledge was brought to bear during the elaboration phase via the parallel firing of productions, with the results being integrated together during the decision phase via a voting scheme. Soarl operators were applied when they were selected. Soar2 [b:83] was developed to explain how and why subgoals arise. Soarl 's deliberate genera tion strategy was augmented by a universal-subgoaling mechanism that automatically created sub goals for impasses (then referred to as difficulties) in decision making. To facilitate the detection and diagnosis of impasses, the vote-based decision procedure was replaced by one based on sym bolic preferences. The processing cycle was also simplified by dropping the separate application phase-operator application now occurred by a combination of elaboration and decjsion. Soar2, along with all later versions of Soar, utilized a modified version of the OpsS production system [ a:8 l ] rather than Xaps2.
INTRODUCTION
XXIII
Soar3 completed the transition started with Soar2 by making subgoal generation and termi nation completely automatic. Soarl 's deliberate generation strategy was removed, and automatic subgoal termination was added. In the process, working memory was extended to contain a stack of problem-solving contexts, rather than just the current one, and production firing and subgoal termination were allowed to proceed in parallel for all of these contexts. The other major addition in Soar3 was learning. A chunking mechanism was constructed that built new productions based on the results of subgoal-based problem-solving experience [1:84, 1:86]. Soar4 [2:87, 10:89/91] was the first publicly-released, and officially documented [c:86] , ver sion of the architecture. It embodied a number of refinements, particularly in chunking, but its primary reason for separation from Soar3 was its public status. An early detailed description of Soar4 can be found in [2:87]. A later abstract description of Soar4 can be found in [10:89/91] ,
along with an analysis of Soar4 as an architecture for general intelligence. I
Soars was developed to support the single-state principle (that only one state should be available
per problem space) [d:90] , to eliminate the large amounts of state copying that occurred during oper ator application in the earlier architectures, and to support interaction with the outside world [4:91] . In the process, a number of significant modifications were made to the architecture, including the use of preferences for all modifications to working memory (enabling destructive state modification, among other things), the distinction between persistent and ephemeral working-memory elements (which is supported by a justification-based truth-maintenance system), and the development of input and output interfaces. Soars is publicly distributed with a completely new manual [b:90]. Soar6 is the first complete reimplementation of the Soar architecture. Functionally it is quite close to SoarS; however, it provides improvements in efficiency through the more careful selection of algorithms, and should provide significant improvements in robustness and maintainability by bringing the implementation up to the level of modern software engineering practice. Also, in a radical departure, Soar6 is written in C rather than in Lisp to improve portability and to further improve efficiency. Portability is of particular concern in supporting the large and rather diffuse Soar community, and in supporting investigations into the implementation of Soar on parallel machines. The development of Soar6 is being driven by a specification that provides a formal implementation-free description of the Soar architecture.
Implementation Issues
Though the development of Soar6 is almost totally driven by implementation concerns such as robustness, maintainability, efficiency, and portability, it is by no means the first or only investiga tion of these issues with respect to Soar. Throughout the evolution of the Soar architecture there has been an ongoing concern with implementation issues. The three such issues that have led to signifi cant research-as opposed to simply coding effort-are efficiency, boundedness, and scalability. Efficiency is a constant concern because it has such a large impact on the size and complexity I
Though the description in [ 10:89/91 ] is mostly of Soar4, there are aspects of Soars in it as well.
XXIV
INTRODUCTION
of the systems that can be developed within Soar. Thus a good deal of effort has gone into opti mizing its innermost loop-the production matcher. Some of this has been concerned with improving the existing serial algorithms, while the remainder has been concerned with develop ing and implementing parallel algorithms. On the serial side, a set of enhancements have been made to the implementation of the Rete algorithm that came with the Lisp version of OpsS [4:86] , and then this implementation was compared with an implementation based on the Treat algorithm [4:88]. On the parallel side, matchers for Soar have been investigated for both the Encore Multimax [9:88] and the Connection Machine [2:88 ] . The Encore implementation was based on earlier work on parallel implementations of OpsS [a:83, a:84, a:86, b:86] . The Connec tion Machine implementation was substantially different, and only partially completed. Boundedness interacts with efficiency, but is fundamentally distinct. It consists of the ability to limit the resources, such as time, utilized by architectural processes. An appropriate limit may
sometimes be a fixed value, such as one second, but at other times might be a restriction on the computational complexity of the process, such as restricting the match time of a production to be
linear in the number of conditions. Boundedness did not start out as a major issue for Soar, but has become of increasing concern because of its impact on real-time behavior, on the complexity of parallel implementations, and on the utility of learned rules. As with efficiency, the research effort here has so far been focused on bounding the time required to match productions to work ing memory [10:90, 1 1 :90] [n:89, g:90]. Though a number of reasonably large systems have been constructed in Soar over the years for example, Neomycin-Soar [1 0:88] which reached 4869 productions and NL-Soar [5:91 ] [n:91 ] which reached 1 838 productions-only recently has scalability begun to be explicitly addressed as a research issue [e:92].
Capabilities
The architecture provides a set of fixed mechanisms that directly perform a limited number of low-level functions, such as creating, matching and executing productions; selecting problem space components; and detecting the existence and disappearance of impasses. What it doesn't do is provide directly all of the higher-level capabilities required of an intelligent system, such as the use of a wide range of problem solving and planning methods, or the performance of skill and knowledge acquisition. These capabilities arise, if at all, from combining the architecture with the appropriate knowledge. The architecture may directly provide aspects of the capability-for example, the storage aspect of a learning capability-but for the rest, the most that can be hoped is that the architecture effectively supports the requisite use of the appropriate knowledge. Soar-based research on higher-level capabilities has both a top-down thread and a bottom-up thread. The top-down thread focuses on understanding how Soar can provide the full range of
higher-level capabilities that have independently been assessed as important in producing intelli
gent behavior. The principle issues raised in this work are: whether Soar can produce the capabili ty; how easily, naturally, and elegantly Soar can do so; what role the architecture plays in producing
INTRODUCTION
XXV
(or hindering the production of) the capability; and what, if any, additional action a Soar-based version of the capability provides over an isolated implementation of it. The bottom-up thread what is sometimes referred to as "listening to the architecture"-focuses on the understanding of the, possibly novel, capabilities that are suggested by experience with Soar and its use. The princi ple issues raised in this work are: what the capability is, how it is produced, and what it is good for. Of course, even in top-down research, there is usually a bottom-up aspect-as when a Soar-based implementation of a known capability takes on an interestingly different shape because of the par ticular functions and constraints provided by the architecture-and vice versa (e.g., when it is dis covered that a particular interaction among Soar's mechanisms provides a capability that maps quite well onto one that is well known in the literature, but with a novel way of producing it). Capability research in Soar also has, at least conceptually, both an isolative thread and an inte grative thread. The isolative thread investigates individual capabilities divorced from each other. The integrative thread investigates how capabilities combine-both with other capabilities of the same type (e.g., the combination of multiple problem solving methods) and with capabilities of other types (e.g., the combination of abstraction planning and explanation-based learning). While some of the capability research in Soar does follow one of these pure threads, most of the research actually intertwines them in a particular way, focusing on the production of an individ ual capability plus that capability's relationship to Soar's various architectural mechanisms. So, for example, work on abstraction focuses on how abstraction is produced, and how it interacts with impasses, chunking, etc. It does not, to nearly the same extent, focus on abstraction's inte gration with other high-level capabilities, except for some planning or problem solving capability to which the abstraction must be tied. The high-level capabilities investigated so far in Soar are most naturally partitioned into four general types: problem solving and planning, learning, external interaction, and knowledge repre sentation. This ordering roughly corresponds to the ordering in which these capability types began to be addressed in Soar, and is the order in which we go through them here. Problem solving and planning capabilities were central to the early development of Soar, and still attract considerable research effort. The original focus was on weak problem-solving meth ods. In particular, a universal weak method was developed that could, with the addition of small increments of knowledge, exhibit the range of standard AI search methods [1 :83] [c:83, b:83] . Other major thrusts in this area include investigations of goals [b:83, e:91, a:92], abstraction [6:87, 1 4:89, 1 2:90, 8:90] [k:88 ] , planning [4:90, 8:90] [i:88, f:89, i:9 1 , c:92, d:92, m:92] , generic tasks [2:90] [c:88, j:88, h:89, g:9 1 , j:9 1 , o:92] , abduction [f: 9 1 , j:9 1 , 1:9 1 ] , and mental models [5:88, 7:89, 3:89]. There has also been a continuing low-level interest in analogy and case-based reasoning [ 1:87, 1 5:89] [a:87, f:90, c:92], plus more recent thrusts in multi-tasking [b:92] and multi-agent problem solving [a:9 1 ] . Overviews of the learning research in Soar can be found in [1:86, 5:87] [e:86, c:87] . At the grossest level, this research can be partitioned into work on skill acquisition, knowledge acquisi tion, and integrated learning. Most of the early work was on skill acquisition; in particular, on learning to perform existing tasks more quickly. Our work with chunking began with its applica-
XXVI
INTRODUCTION
tion as a model· of human practice [1:81, 2:82/87, 2:83/86, 9:89] [b:82, d:83] , which then grew into investigations of the acquisition of control, (macro-)operator, and reactive knowledge [1:84, 1:86, 1:87, 2:89] [i:91] . This expansion into core machine learning topics raised the question of the relationship of chunking to explanation-based learning [ 3:86] [c:90] . It also forced us to address the utility question-that is, whether chunking was actually speeding up Soar when mea sured in terms of real time-and led to the identification of the problem of expensive chunks, and to a space of possible solutions [10:90, 11:90] [m:88, n:88, n:89, g:90, v:91, w:91] . Investigations of knowledge acquisition began by looking at simple forms of associative learn ing [3:87, 9:89, 6:88, 11:89/91, 7:90, 1:90] [c:87] . On the cognitive side this focused on forms of verbal learning, such as learning to recognize and recall words and nonsense syllables, while on the AI side it focused on demonstrating how chunking could perform knowledge-level learning. More complex forms of knowledge acquisition then began to appear in work on inductive gener alization and concept formation [7:90, 6:91, 14:89, 5:90, 8:91] [c:84, j:88, i:92] . At the most com plex end comes work on the acquisition, extension and correction of domain models (that is, problem spaces) [2:86, 3:88, 11:88, 3:89, 3:90, 1:90] [b:88, k:88, f:92] . Investigations of integrated learning can be found in [5:87, 9:89, 6:88, 7:90] [j:88] . Most of this work focuses on combining the use of chunking as a skill acquisition mechanism with its use as the storage mechanism underlying knowledge acquisition. The study of external interaction in Soar is much less mature than either the study of problem solving and planning or the study of learning. It did not really come into its own until the develop ment of the interaction framework in Soars, and Soar still does not have a stable and well-devel oped perceptual-motor system. However, work has been progressing along a number of fronts in getting Soar to interact with people, with the physical world, and with other software systems. Research on interaction with people is focused on ways they can communicate to Soar other than via the standard programming model of the person directly creating new structures in Saar's mem ory. This has consisted of studies of in-context advice taking [1:87, 3:90] [m:91] , instruction taking [11:88, 3:89] , and natural language comprehension [5:91, 3:89, 5:90] [b:84, g:89, o:89, n:91] . Research on interaction with the physical world has focused on low-level work on perceptual atten tion and search [9:91 ] [i:90, p:92], goal-directed visual processing [g:92, i:91] , robotics (hand-eye and mobile-hand systems) [3:90, 4:90, 4:91] [m:91, i:89] , and interruption [4:90] [0:88, f:89, b:92] . Research on interaction with other software systems is still in its formative stages, but there has been work on both low-level communication details and high-level analyses of issues [p:91] . Knowledge representation has not traditionally been a major explicit focus in Soar. Since its inception, Soar has been based on objects as sets of attribute-value pairs, long-term knowledge as sets of productions, and domain models as sets of problem spaces. On top of this basic structure sits default knowledge which defines the top problem solving context, default responses to impasses, monitoring abilities, and default spaces (and concepts) that facilitate look-ahead search and operator subgoaling [b:90] . Each system-building effort then starts with this, and acts as a (mostly) implicit study of the specifics of representing the types of knowledge required for the domains and methods of interest. Citations to this work are thus best found by looking under the
INTRODUCTION
XXVII
respective domains and capabilities. One explicit study of the representational issues for proce dural, episodic, and declarative knowledge in Soar can be found in [ 1 1 :89/91 ] .
Domains
An important aspect of generality in an intelligent system is the range of task domains in which it can
effectively behave. Thus a significant thread throughout the development of Soar has been its applica
tion in diverse domains. Befitting the early emphasis on weak problem solving methods, the earliest applications of Soar were in classic toy tasks, such as puzzles and games [ 1 :83, 1 :84, 1 :86, 2:87] [b:83, b:87] . Even with the increased emphasis now placed on more practical knowledge-intensive and interactive domains, toy tasks continue to be of interest because of their ability to capture in particu larly clear ways the essence of many hard search problems (3:88, 1:90, 3:90, 4:90, 12:90] [b:87, b:88]. The shift to more knowledge-intensive domains began with the implementation of the R l -Soar computer-configuration system ( 1:85]. Since its initial development, R l-Soar has been used as the basis for experiments in skill acquisition ( 2:87 ] , abstraction (6:87, 14:89] , and task acquisition [ 16:89] . The development of Rl -Soar was soon followed by a sequence of systems in the area of algo rithm and software design [4:87, 8:88, 13:89] [k:89, a:88]; and then by a set of systems that focused on other design domains (4:89, 15:89] [j:88, k:88] , medical diagnosis ( 10:88] , blood banking [a:90, g:91, 1:91 ] , and factory scheduling (6:90/92] [f:88, d:89, e:89]. The first in what is to be a series of articles on the Soar approach to building knowledge-intensive systems can be found in [o:92]. The most recent shift in application focus has been to domains that require tight coupling between Soar and its environment, whether this environment be the physical world or other soft ware systems (or people, for that matter, though no substantial domains have shown up here yet). The physical-world domains investigated so far are still quite simple; in particular, block alignment by a hand-eye system (3:90] [m:91 ] and cup collection and disposal by a mobile-hand system [4:90]. The software domains being investigated are more varied, including databases [p:91 ] , symbolic mathematics packages [p:91 ] , chemical process simulators [p:91 ] , drawing packages [p:91], tutorial environments [p:91 , x:91 ] , building-design tools (9:90] [p:91 ] , and physical- world simulators. In addition to the domains listed in this section, there has also been work on algorithmic domains, such as multi-column subtraction [10:89/91 ] , and a significant number of applications in the context of psychological modeling (which is the focus of the next section).
Psychological Modeling
One of the grand challenges for Soar is to develop it into a unified theory of cognition that, through the structure of its architecture and the content of its problem spaces, can veridically produce the full range of human cognitive behavior. The major statement of this grand challenge, and of Soar's status with respect to it, can be found in [d:90] . Additional overviews and status reports can be found in ( 5:90] [p:88, q:88, h:92, j:92] , and an attempt to extend this idea into the social realm can be found in [b:91 ] .
XXVIII
INTRODUCTION
The work covered in papers included in this collection provides a patchwork of results in the general areas of perception, routine cognitive skill, reasoning, problem solving, learning, and development. The principal focus so far in perception has been on modeling visual attention [ 9:91] [i:90, p:92] . Some issues of perception also arise in the work on routine cognitive skill; however, the emphasis there is much more on high-level aspects of perception and motor con trol, and their relationship to cognition. This work to date has focused mostly on using Soar as the substrate for, and extensions to, two general human-computer interaction (HCI) formalisms: GOMS (Goals, Operators, Methods, and Selection rules) (5:90, 3:91] and PUM (Programmable User Models) ( 13:90, 2:91] . Domains covered in this work include highly interactive ones, such as on-line browsers (5:90] [k:92] and video games (3:91] , plus the more traditienal domain of text editing [ 13:90, 2:91] . Other work on routine cognitive skill includes efforts to model calcula tor use [c:91] , automobile driving [b:92] and the NASA Test Director at the Kennedy Space Cen ter [k:91] . With reasoning, the emphasis shifts to pure cognitive processing. The work so far in human reasoning has concentrated on extended mental-model-based approaches to classic tasks that reveal ways in which people do not follow the standard rules of logic. Categorical syllogisms have received the most attention, however there has also been work on relational reasoning, condition al reasoning, and the Wason selection task (5:88, 7:89, 3:89] . The work on problem solving has focused on modeling human protocols in non-routine domains ranging from simple puzzles, such as Towers of Hanoi ( 12:89] and Cryptarithmetic [d:90] , to more pragmatic domains such as instructional design [q:91] . Much of the Soar-based research on human learning has concerned relatively low-level phe nomena such as practice phenomena-in particular, modeling the power law of practice ( 1:81, 2:82/87, 2:83/86, 9:89] -and the beginnings of a model of classical verbal learning (3:87, 9:89] . However, there has been some work on higher level phenomena, such as concept acquisition-in the context of series completion (5:90] and lexical acquisition (6:91] [i:92] -and strategy change [12:89] . At the highest end, learning shades into developmental transitions, with work on quanti ty conservation [8:91] and the balance beam [d:90].
Perspectives
The most basic way to view Soar is as a symbol system; that is, as a collection of symbols and mecha nisms for their processing (5:89, 2:87] [e:90, o:92] . However, this view-though accurate-by itself misses much of what gives Soar its particular character. The first missing aspect is the fine structure of how Soar's mechanisms form a tightly coupled hierarchy of layers-memory, decision, and goal-in which each layer forms the inner loop of the layer above it [ 10:89/91] [d:90] . These layers increase progressively in both complexity and time scale from the bottom to the top of the hierarchy. The second missing aspect is the identification of the class(es) of symbol systems within which Soar should be situated. This is not a completely straightforward task for Soar because of its range of capabilities and application domains. It is also not a completely bounded task, as new
INTRODUCTION
XXIX
developments are periodically increasing the set of classes. Nonetheless, various analyses have identified at least the following classes: a general problem solver (or goal-oriented system) [1:83, 8:89] , a learning system [5:87] , a cognitive architecture [5:89] , an expert system shell [1:85 ] , a meta-level (reflective) system [7:88] ; and a hybrid planning system [4:90, 8:90 ] . The third missing aspect i s the comparison o f Soar with other integrated architectures. So far, not many such comparisons exist in the published literature (as opposed to the larger number of informal comparisons that are generated during courses on integrated architectures). However, written comparisons do at least exist for GPS [j:92] , Act* [5:89] , and the constructive-integration model [h:90]. The fourth missing aspect is that Soar provides more than just a symbol system. Above the symbol system, it provides a problem-space system that is based on the symbol system, but has its own distinct computational model [7:91] [b:90, e:90, o:92] . Further up-above the prob lem-space system - Soar provides an approximation to a knowledge-level system [11:89/91, 7:91] [1:82, d:90, n:92, o:92] . There is currently no theoretically interesting system under Soar's symbol system; however, some thought has been given to the relationship of Soar to lower neural systems [8:89] [d:90, n:92 ] , and work has begun on a neural-network basis for Soar [1:91] . The fifth missing aspect is how Soar is viewed by researchers from outside the community directly involved in its development and use. Several external commentaries on Soar originated as invited responses to papers presented at workshops and symposia (though the commentaries often address more general issues than those specifically raised in the papers at hand, as well as often addressing other systems in addition to Soar). In particular, a commentary based on [9:89] can be found in [1:89] , one based on [10:89/91] can be found in [6:89/91] , and two based on [11:89/91] can be found in [d:91] and [r:91] . Other commentaries have taken the form of book reviews: [d:87] and [h:88] are reviews of [d:86] , while [h:91] and [o:91] are reviews of [d: 90] . In addition to commentaries, several researchers have compared individual Soar capabilities directly with other alternatives. In particular, Soar's learning abilities have been compared with decision analytic control [2:89] , backpropagation [b:88], and a recursion-controlled version of explana tion-based generalization [c:90] . The sixth, and final, missing aspect is the view of Soar, and the Soar research effort, from his torical and sociological perspectives. On the historical side, this introduction covers much of the development of Soar, plus a bit about its prehistory. Additional details on its prehistory can also be found in [d:86]. On the sociological side, a study of the development of the Soar community can be found in [1:88/91] .
Using Soar
Ideally Soar should be "programmable" as a knowledge-level system. Adjustments and enhance ments of behavior should be possible by simply conveying the appropriate knowledge to the sys tem. Though such an activity can be thought of as programming at the knowledge level, it really
XXX
INTRODUCTION
has more in common with acts of communication than with acts of programming. Failing this, Soar should be programmable as a problem-space system. Augmentations of its store of problem spaces should be directly derivable from provided descriptions. Programming a problem-space system is primarily a modeling activity, in which systems are constructed by describing the objects and actions in the domains of interest. To the extent that Soar fails to be programmable as a prob lem-space system, it must be programmed as a symbol-level system. Here the requisite activity is closer to classical production-system programming, though the myriad of ways in which Soar dif fers from such classical systems-including the absence of conflict resolution and the provision of preferences, subgoaling, chunking, etc.-dictates that programming Soar at the symbol level is still a rather distinct activity. In reality, Soar is not currently programmable at the knowledge level, and it is far from clear how to provide such a capability. However, there has been progress towards the ability to program it as a problem-space system. A front end, called TAQL (Task AcQuisition Language), has been developed that provides a problem-space language which can be automatically compiled into Soar productions [y:91] . When using TAQL, its manual should be used in conjunction with the Soar manual [b:90] . The Soar manual provides an overview of the problem space level, details of Soar's symbol level structures and mechanisms, an introduction to encoding tasks in Soar, a description of the default knowledge that comes with the system, descriptions of the user-accessible variables and functions, a glossary, and production templates for common problem-space operations (which are useful when programming Soar at the symbol level). When programming at the sym bol level, the Soar manual should be sufficient on its own. In addition to TAQL, the usability of Soar has been enhanced by the construction of the Developers Soar Interface (DSI). The DSI provides GnuEmacs editor modes for Soar [t:91] and TAQL [s:91] , plus an X-based graphical display interface [1:92]. 2
Conclusion
This brings us to the end of the road map. Following the presentation of the two bibliographies of included and additional articles-come the Soar papers.
2 An earlier attempt to build such. an interface for Soar is described in
[f:86, g:86] .
INTRODUCTION
Bibliography of Included Articles
XXXI
Prior to 1983 1 :69
Newell, A., "Heuristic programming: Ill-structured problems," in Progress in Operations Research, III, Aronofsky, J., ed., Wiley, New York, 1 969, pp. 360-41 4. [One: 3 ]
1 :80
Newell, A., "Reasoning, problem solving and decision processes: The problem space as a fundamental category," in Attention and Performance VIII, R. Nickerson, ed., Erlbaum, Hillsdale, N.J., 1980. [One: 55]
1:81
Newell, A . & Rosenbloom, P . S., "Mechanisms o f skill acquisition and the law o f practice," i n Cognitive
1 :82
Newell, A., "The knowledge level," Artificial Intelligence, Vol. 1 8, 1 982, pp. 87-127. [One: 136]
Skills and their Acquisition, J. R. Anderson, ed., Erlbaum, Hillsdale, NJ, 198 1 , pp. 1 -55. [One: 8 1 ]
2:82/87 Rosenbloom, P. S. & Newell, A., "Learning by chunking: A production-system model of practice," in Production System Models of Learning and Development, D. Klahr, P. Langley, R. Neches, eds., Brad
ford Books/The MIT Press, Cambridge, MA, 1 987, pp. 221 -286. [One: 1 77]
1983 - 1985 1 :83
Laird, J. E. & Newell, A., "A Universal Weak Method," Tech. report #83 - 1 4 1 , Carnegie-Mellon Uni versity Computer Science Department, June 1 983. [One: 245]
2:83/86 Rosenbloom, P. S. & Newell, A., "The chunking of goal hierarchies: A generalized model of practice,"
in Machine Learning: An Artificial Intelligence Approach, Volume II, R. S. Michalski, J. G. Carbonell, &
T. M. Mitchell, eds., Morgan Kaufmann Publishers, Inc., Los Altos, CA, 1 986, pp. 247-288. [One: 293]
1 :84
Laird, J. E., Rosenbloom, P. S., & Newell, A., "Towards chunking as a general learning mechanism," Proceedings of the National Conference on Artificial Intelligence, AAAI 1 984, pp. 1 88-192. [One: 335] ,
1 :85
Rosenbloom, P. S., Laird, J. E., McDermott, J., Newell, A. & Orciuch, E., "Rl -Soar: An experiment in
knowledge-intensive programming in a problem-solving architecture," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 7, 1 985, pp. 56 1 -569. [One: 340] 1986
1986 1 :86
Laird, J. E., Rosenbloom, P. S., & Newell, A., "Chunking in Soar: The anatomy of a general learning mechanism," Machine Learning, Vol. 1 , 1 986, pp. 1 1 -46. [One: 35 1 ]
2:86
Laird, J. E., Rosenbloom, P. S., & Newell, A., "Overgeneralization during knowledge compilation in Soar," Proceedings of the Workshop on Knowledge Compilation, T. G. Dietterich,ed., AAAI/Oregon
State University, 1 986, pp. 46-57. [One: 387] 3:86
Rosenbloom, P. S., & Laird, J. E., "Mapping explanation-based generalization onto Soar," Proceedings
4:86
Scales, D. J., "Efficient Matching Algorithms for the Soar/Ops5 Production System," Tech. report
of the Fifth National Conference on Artificial Intelli$ence, AAAI 1 986, pp. 561 -567. [One: 399] ,
KSL-86-47, Knowledge Systems Laboratory, Department of Computer Science, Stanford University, 1986. [One: 407]
XXXII
INTRODUCTION
1987 1 :87
Golding, A. R., Rosenbloom, P. S., & Laird, J. E., "Leaming general search control from outside guid
ance," Proceedings of the Tenth International Joint Conference on Artificial Intelligence, IJCAII, 1987, pp. 334-337. [One: 459]
2:87
Laird, J. E., Newell, A., & Rosenbloom, P. S., "Soar: An architecture for general intelligence," Artificial Intelligence, Vol. 33, 1987, pp. 1 -64. [One: 463]
3:87
Rosenbloom, P. S., Laird, J. E., & Newell, A., "Knowledge level learning in Soar," Proceedings of Sixth National Conference on Artificial Intelligence, AAAI, 1987, pp. 499-504. [One: 527]
4:87
Steier, D., "Cypress-Soar: A case study in search and learning in algorithm design," Proceedings of the
5:87
Steier, D. M., Laird, J. E., Newell, A., Rosenbloom, P. S., Flynn, R., Golding, A., Polk, T. A., Shivers, 0.
Tenth International Joint Conference on Artificial Intelligence, IJCAII, 1 987, pp. 327-330. [One: 533]
G., Unruh, A., & Yost, G. R., "Varieties of Leaming in Soar: 1 987," Proceedings of the Fourth Interna tional Workshop on Machine Learning, P. Langley,ed., Morgan Kaufmann Publishers, Inc., Los Altos,
CA, 1987, pp. 300-3 1 1 . [One: 537] 6:87
Unruh, A., Rosenbloom, P. S., & Laird, J. E., "Dynamic abstraction problem solving in Soar," Proceed
ings of the Third Annual Aerospace Applications ofArtificial Intelligence Conference, Dayton, OH, 1 987,
pp. 245-256. [One: 549]
1988 1 :88/91
Carley, K. & Wendt, K., "Electronic Mail and Scientific Communication: A study of the Soar extended research group," Knowledge: Creation, Diffusion, Utilization, Vol. 1 2, 1 99 1 , pp. 406-440. [One: 563]
2:88
Flynn, R., "Placing Soar on the Connection Machine," Digital Equipment Corporation, September,
1 988. An extended abstract of this paper was distributed at the AAAI Mini-Symposium, How Can
Slow Components Think So Fast?. [One: 598] 3:88
Laird, J. E., "Recovery from incorrect knowledge in Soar," Proceedings of the Seventh National Confer
4:88
Nayak, P., & Gupta, A., & Rosenbloom, P. S., "Comparison of the Rete and Treat production match
ence on Artificial Intelligence, AAAI, 1 988, pp. 6 1 8-623. [One: 6 1 5 ]
ers for Soar (a summary)," Proceedings of the Seventh National Conference on Artificial Intelligence,
AAAI , 1988, pp. 693-698. [One: 62 1 ]
5:88
Polk, T. A. & Newell, A., "Modeling human syllogistic reasoning in Soar," Proceedings of the J Oth
Annual Conference ofthe Cognitive Science Society, 1 988, pp. 1 8 1 - 1 87. [One: 627] 6:88
Rosenbloom, P. S., "Beyond generalization as search: Towards a unified framework for the acquisition of new knowledge," Proceedings of the AAAI Symposium on Explanation-Based Learning, G. F. DeJong,ed., AAAI, Stanford, CA, 1 988, pp. 1 7-2 1 . [One: 634]
7:88
Rosenbloom, P. S., Laird, J. E., & Newell, A., "Meta-levels in Soar," in Meta-Level Architectures and
8:88
Steier, D. M. & Newell, A., "Integrating multiple sources of knowledge into Designer-Soar, an auto
Reflection, P. Maes & D. Nardi, eds., North Holland, Amsterdam, 1 988, pp. 227-240. [One: 639]
matic algorithm designer," Proceedings of the Seventh National Conference on Artificial Intelligence,
AAAI , 1988, pp. 8 - 1 3. [One: 653 ]
9:88
Tambe, M., Kalp, D., Gupta, A., Forgy, C.L., Milnes, B.G., & Newell, A., "Soar/PSM-E: Investigating
match parallelism in a learning production system," Proceedings of the ACM/SIGPLAN Symposium on Parallel Programming: Experience with Applications, Languages, and Systems, July 1988, pp. 1 46- 1 60. [One: 659]
INTRODUCTION
1 0:88
XXXIII
Washington, R. & Rosenbloom, P. S., "Applying Problem Solving and Learning to Diagnosis," Department of Computer Science, Stanford University. Unpublished. [One: 674)
1 1 :88
Yost, G. R. & Newell, A., "Learning new tasks in Soar," Department of Computer Science, Carnegie Mellon University. Unpublished. [One: 688)
1989 1:89
2:89
Bosser, T., "A discussion of 'The chunking of skill and knowledge' by Paul S. Rosenbloom, John E.
Laird & Allen Newell," in Working Models of Human Perception, B. A. G. Elsendoorn & H. Bouma, eds., Academic Press, London, 1 989, pp. 4 1 1 -4 1 8. [Two: 705) Etzioni, 0., & Mitchell, T. M., "A comparative analysis of chunking and decision-analytic control,"
Proceedings of the 1989 AAAI Spring Symposium on AI and Limited Rationality, Stanford, 1 989. [Two:
713) 3:89
Lewis, R. L., Newell, A., & Polk, T. A., "Toward a Soar theory of taking instructions for immediate reasoning tasks," Proceedings of the 1 1 th Annual Conference of the Cognitive Science Society, 1 989, pp.
5 14-52 1 . [Two: 7 1 9)
4:89
Modi, A.K. & Westerberg, A.W., "Integrating learning and problem solving within a chemical process designer," Presented at the Annual Meeting of the American Institute of Chemical Engineers. [Two: 727)
5:89
Newell, A., Rosenbloom, P. S., & Laird, J. E., "Symbolic architectures for cognition," in Foundations of Cognitive Science, M. I. Posner, ed., Bradford Books/MIT Press, Cambridge, MA, 1989, pp. 93- 1 3 1 . [Two: 754)
6:89/91
Norman, D. A., "Approaches to the study of intelligence," Artificial Intelligence, Vol. 47, 1 99 1 , pp. 327346. [Two: 793]
7:89
Polk, T. A., Newell, A., & Lewis, R. L., 'Toward a unified theory of immediate reasoning in Soar," Pro ceedings of the 1 1 th Annual Conference of the Cognitive Science Society, 1989, pp. 506-513. [Two: 8 1 3 ]
8:89
9:89
Rosenbloom, P. S., "A symbolic goal-oriented perspective on connectionism and Soar," in Connec
tionism in Perspective, R. Pfeifer, Z. Schreter, F. Fogelman-Soulie, & L. Steels, eds., Elsevier (North Holland) , Amsterdam, 1 989, pp. 245-263. [Two: 82 1 ]
Rosenbloom, P. S., Laird, J. E., & Newell, A., "The chunking of skill and knowledge," in Working Mod els of Human Perception, B. A. G. Elsendoorn & H. Bouma, eds., Academic Press, London, 1 989, pp.
3 9 1 -4 10. [Two: 840)
1 0:89/9 1 Rosenbloom, P. S., Laird, J. E., Newell, A., & McCarl, R., "A preliminary analysis of the Soar architec ture as a basis for general intelligence," Artificial Intelligence, Vol. 47, 1991, pp. 289-325. [Two: 860]
1 1 :89/9 1 Rosenbloom, P. S., Newell, A., & Laird, J. E., "Towards the knowledge level in Soar: The role of the
architecture in the use of knowledge," in Architectures for Intelligence, K. VanLehn, ed., Erlbaum, Hillsdale, NJ, 1 99 1 , pp. 75- 1 1 1 . [Two: 897)
12:89
Ruiz, D. & Newell, A., "Tower-noticing triggers strategy-change in the Tower of Hanoi: A Soar model," Proceedings ofthe Annual Conference ofthe Cognitive Science Society, 1 989, pp. 522-529. [Two: 934)
1 3:89
Steier, D., ""But How Did You Know To Do That?": What a theory of algorithm design process can
14:89
Unruh, A. & Rosenbloom, P. S., "Abstraction in Problem Solving and Learning," Proceedings of the
tell us," Engineering Design Research Center, Carnegie Mellon University. Unpublished. [Two: 942)
Eleventh International Joint Conference on Artificial Intelligence, IJCAII, 1989, pp. 68 1 -687. [Two: 959]
1 5:89
Vicinanza, S. & Prietula, M.J., "A computational model of musical creativity (Extended abstract),"
XXXIV
INTRODUCTION
Prepared for the AI and Music Workshop, Eleventh International Joint Conference on Artificial Intel ligence. [Two: 974]
16:89
Yost, G. R. & Newell, A., "A problem space approach to expert system specification," Proceedings of Eleventh International Joint Conference on Artificial Intelligence, IJCAII, 1 989, pp. 62 1 -627. [Two: 982]
1990 1 :90
Congdon, C.B., "Learning Control Knowledge in an Unsupervised Planning Domain," Artificial Intel ligence Laboratory, The University of Michigan. Unpublished. [Two: 99 1 ]
2:90
Johnson, T. R., Smith, J. W., & Chandrasekaran, B., "Task-specific architectures for flexible systems,"
3:90
Laird, J.E., Hucka, M., Yager, E.S., & Tuck, C.M., "Correcting and extending domain knowledge using
Submitted to Applied AI. [Two: 1 004]
outside guidance," Proceedings of the Seventh International Conference on Machine Learning, 1 990, pp.
235-243. [Two: 1027] 4:90
Laird, J. E., & Rosenbloom, P. S., "Integrating execution, planning, and learning in Soar for external
environments," Proceedings of the Eighth National Conference on Artificial Intelligence, AAAI 1990, pp. ,
1022- 1 029. [Two: 1 036]
5:90
Lewis, R. L., Huffman, S. B., John, B. E., Laird, J. E., Lehman, J. F., Newell, A., Rosenbloom, P. S.,
Simon, T., & Tessler, S. G., "Soar as a unified theory of cognition: Spring 1 990," Proceedings of the 12th Annual Conference of the Cognitive Science Society, 1990, pp. 1 035-1042. [Two: 1044]
6:90/92 Prietula, M. J., Hsu, W., Steier, D. M., & Newell, A., "Applying an architecture for general intelligence to reduce scheduling effort," ORSA Journal on Computing, 1 992, In press. [Two: 1 052]
7:90
Rosenbloom, P. S. & Aasman, J., "Knowledge level and inductive uses of chunking (EBL)," Proceedings
8:90
Rosenbloom, P. S., Lee, S., & Unruh, A., "Responding to impasses in memory-driven behavior: A
of the Eighth National Conference on Artificial Intelligence, AAAI 1 990, pp. 821 -827. [Two: 1096] ,
framework for planning," Proceedings of the Workshop on Innovative Approaches to Planning, Schedul ing, and Contro� DARPA, 1 990, pp. 1 8 1 - 1 9 1 . [Two: 1 103] 9:90
Steier, D.M., "Intelligent architectures for integration," Proceedings of the IEEE Conference on Systems Integration, August 1 990, A shorter version of this paper appeared in the proceedings. [Two: 1 1 14]
1 0:90
Tambe, M., Newell, A., & Rosenbloom, P. S., "The problem of expensive chunks and its solution by
restricting expressiveness," Machine Learning, Vol. 5, 1990, pp. 299-348. [Two: 1 123]
1 1 :90
Tambe, M. & Rosenbloom, P. S., "A framework for investigating production system formulations
with polynomially bounded match," Proceedings of the Eighth National Conference on Artificial Intelli gence, AAAI, 1 990, pp. 693-700. [Two: 1 1 73] 12:90
Unruh, A., & Rosenbloom, P. S., "Two p.ew weak method increments for abstraction," Working Notes of the AAAI-90 Workshop on Automatic Generation ofApproximations and Abstractions, T. Ellman,ed.,
AAAI, 1 990, pp. 78-86. [Two: 1 18 1 ] 1 3:90
Young, R.M. & Whittington, J., "Using a knowledge analysis to predict conceptual errors in text-edi tor usage," Proceedings ofCHI'90, 1 990, pp. 91 -97. [Two: 1 190]
1991 1:91
Cho, B . , Rosenbloom, P . S . , & Dolan, C . P . , "Neuro-Soar: A neural-network architecture fo r goal-ori
ented behavior," Proceedings of the Thirteenth Annual Conference of the Cognitive Science Society, Erl-
INTRODUCTION
XXXV
baum, Hillsdale, NJ, 1 99 1 , pp. 673-677. [Two: 1 199] 2:91
Howes, A. & Young, R. M., "Predicting the learnability of task-action mappings," Proceedings of CHI'91 Human Factors in Computing Systems, ACM Press, 1 99 1 , pp. 1 13 - 1 18. [Two: 1 204]
3:9 1
John, B. E., Vera, A. H., & Newell, A., "Towards Real-Time GOMS," Abbreviated version of ''Towards
Real-Time GOMS," by John, B. E., Vera, A. H., & Newell, A. Carnegie Mellon University School of
Computer Science Technical Report CMU-CS-90- 1 95. [Two: 1 2 1 0 ] 4:91
Laird, J. E., "Extending problem spaces to external environments," Artificial Intelligence Laboratory, The University of Michigan. Unpublished. [Two: 1 294]
5:91
6:9 1
Lehman, J. F., Lewis, R. L., & Newell, A., "Integrating knowledge sources in language comprehen sion," Proceedings of the Thirteenth Annual Meeting of the Cognitive Science Society, Erlbaum, Hillsdale, NJ, 1991 , pp. 461 -466. [Two: 1 309]
Miller, C. S., & Laird, J. E., "A constraint-motivated lexical acquisition model," Proceedings of the Thir teenth Annual Meeting of the Cognitive Science Society, Erlbaum, Hillsdale, NJ, 1 99 1 , pp. 827-83 1 . [Two: 1 3 1 5]
7:9 1
Newell, A., Yost, G. R., Laird, J. E., Rosenbloom, P. S., & Altmann, E., "Formulating the problem
space computational model," in CMU Computer Science: A 25th Anniversary Commemorative, R. F. Rashid, ed., ACM Press/Addison-Wesley, 1 99 1 , pp. 255-293. [Two: 1 32 1 ] 8:9 1
Simon, T., Newell, A., & Klahr, D., "A computational account of children's learning about number conservation," in Concept Formation: Knowledge & Experience in Unsupervised Learning, Fisher, D.H., Pazzani, M.J & Langley, P., eds., Morgan Kaufmann, Los Altos, CA, 1991. [Two: 1 360]
9:91
Wiesmeyer, M. & Laird, J.E., "Attentional Modeling of Object Identification and Search," Tech. report CSE-TR- 1 08-9 1 , Department of Electrical Engineering and Computer Science, University of Michi gan, 1 99 1 , A portion of this article appeared under the title "A computer model of2D visual attention, in Proceedings of the Twelfth Annual Conference of the Cognitive Science Society, 1 990. [Two: 1 400]
Bibliography of Additional Articles
Prior to 1 983 a:72 a:73
Newell, A. & Simon, H. A., Human Problem Solving, Prentice-Hall, Englewood Cliffs, 1 972. Newell, A., "Production systems: Models of control structures," in Visual Information Processing, Chase, W. G., ed., Academic Press, New York, 1 973, pp. 463-526.
a:78
Rychener, M. D. & Newell, A., "An instructable production system: Basic design issues,"' in Pattern
Directed Inference Systems, Waterman, D. A. & Hayes-Roth, F., eds., Academic Press, New York, 1 978, pp. 1 35 - 1 53 .
a:81
Forgy, C. L., "OPS5 User's Manual," Tech. report CMU-CS- 8 1 .- 1 35, Computer Science Department, Carnegie-Mellon University, July 1 98 1 .
a:82 b:82
Forgy, C. L., "Rete: A fast algorithm for the many pattern/many object pattern match problem," Arti ficial Intelligence, Vol. 1 9, 1 982, pp. 1 7-37.
Rosenbloom, P. S. & Newell, A., "Learning by chunking: Summary of a task and a model," Proceedings ofthe National Conference on Artificial Intelligence, AAAl, 1 982, pp. 255-257.
XXXVI
INTRODUCTION
1983-1985 a:83
Gupta, A. & Forgy, C., "Measurements on Production Systems," Tech. report CMU-CS-83- 167, Com
puter Science Department, Carnegie Mellon University, December 1983. b:83
Laird, J. E., Universal Subgoaling, PhD dissertation, Carnegie-Mellon University, 1 983, (Available in
Laird, J. E., Rosenbloom, P. S., & Newell, A. Universal Subgoaling and Chunking: The Automatic Gen eration and Leaming of Goal Hierarchies, Hingham, MA: Kluwer, 1 986).
c:83
Laird, J. E. & Newell, A., "A universal weak method: Summary of results," Proceedings of the Eighth International Joint Conference on Artificial Intelligence, IJCAII, 1 983, pp. 77 1 -773.
d:83
Rosenbloom, P. S., The Chunking of Goal Hierarchies: A Model of Practice and Stimulus-Response Compatibility, PhD dissertation, Carnegie-Mellon University, 1 983, (Available in Laird, J. E., Rosen
bloom, P. S., & Newell, A. Universal Subgoaling and Chunking: The Automatic Generation and Leaming of Goal Hierarchies, Hingham, MA: Kluwer, 1 986). a:84
Gupta, A., "Implementing OPS5 Production Systems on DADO," Tech. report CMU-CS-84- 1 1 5, Computer Science Department, Carnegie Mellon University, March 1 984.
b:84
Powell, L., "Parsing the Picnic-Problem with a Soar3 Implementation of Dypar- 1 ," Department of Computer Science, Carnegie-Mellon University. Unpublished.
c:84
Saul, R. H., "A SOAR2 Implementation of Version-Space Inductive Learning," Department of Com puter Science, Carnegie-Mellon University. Unpublished.
1 986 a:86
Gupta, A., Parallelism in Production Systems, PhD dissertation, Computer Science Department, Carnegie Mellon University, March 1986, Also available in Parallelism in Production Systems, Morgan Kaufman, 1 987.
b:86
Gupta, A., Forgy, C., Newell, A., and Wedig, R., "Parallel algorithms and architectures for production systems," Proceedings of the Thirteenth International Symposium on Computer Architectures, June 1 986, pp. 28-35.
c:86 d:86 e:86
Laird, J. E., "Soar User's Manual (Version 4)," Tech. report ISL- 15, Xerox Palo Alto Research Center, 1986.
Laird, J. E., Rosenbloom, P. S., & Newell, A., Universal Subgoaling and Chunking: The Automatic Gen eration and Leaming of Goal Hierarchies, Kluwer Academic Publishers, Hingham, MA, 1 986. Rosenbloom, P. S., Laird, J. E., Newell, A., Golding, A., & Unruh, A., "Current research on learning in Soar," in Machine Learning: A Guide to Current Research, T. M. Mitchell, J. G. Carbonell, & R. S.
Michalski, eds., Kluwer Academic Press, Boston, MA, 1 986, pp. 28 1 -290. f:86
Unruh, A., "Comprehensive Programming Project: An Interface for Soar," Department of Computer Science, Stanford University. Unpublished.
1987 a:87 b:87 c:87
Faieta, B., "Implementing PUPS analogy in SOAR," Unpublished progress Note. Faieta, B., "Analysis oflearning to play GO using the Soar model," Unpublished final project.
Laird, J. E., & Rosenbloom, P. S., "Research on learning in Soar," Proceedings of the Second Annual Artificial Intelligence Research Forum, NASA Ames Research Center, Palo Alto, CA, 1 987, pp. 240-253.
d:87
Lewis, C., ''The search for cognitive harmony," Contemporary Psychology, Vol. 32, 1 987, pp. 427-428,
INTRODUCTION
XXXVII
Review of Universal Subgoaling and Chunking: The Automatic Generation and Learning of Goal Hierarchies.
1 988 a:88
Adelson, B., "Modeling software design in a problem-space architecture," Proceedings of the Annual Conference ofthe Cognitive Science Society, August 1988, pp. 174-180.
b:88
Crawford, A., "On Soar and PDP models: Learning in a probabilistic environment," Honors Thesis, Computer Science Department, Harvard University.
c:88
Flynn, R., "Restructuring Knowledge in Soar: Searching for the right paradigm-specific representa tions," Digital Equipment Corporation, September, 1988. Reprinted and available from Carnegie Mel lon University. Unpublished.
d:88
Gupta, A. & Tambe, M., "Suitability of Message Passing Computers for Implementing Production
Systems," Proceedings of the Seventh National Conference on Artificial Intelligence, AAAI, 1988, pp. 687692.
e:88
Gupta, A., Tambe, M., Kalp, D., Forgy, C. L., and Newell, A., "Parallel implementation of OPS5 on the Encore Multiprocessor: Results and analysis," International Journal of Parallel Programming, Vol. 1 7, No. 2, 1988, pp. 95- 124.
f:88
Hsu, W., Prietula, M., & Steier, D., "Merl-Soar: Applying Soar to scheduling," Proceedings of the Workshop on Artificial Intelligence Simulation, The National Conference on Artificial Intelligence, 1988, pp. 8 1 -84.
g:88
Kalp, D., Tambe, M., Gupta, A., Forgy, C., Newell, A., Acharya, A., Milnes, B., & Swedlow, K., "Paral
lel OPS5 User's Manual," Tech. report CMU-CS-88 - 1 87, Computer Science Department, Carnegie Mellon University, November 1988. h:88
Mostow, J., "Review of Universal Subgoaling and Chunking: The Automatic Generation and Learning of Goal Hierarchies," American Scientist, Vol. 76, 1988, pp. 4 10.
i:88
Reich, Y., "Learning Plans as a Weak Method for Design," Department of Civil Engineering, Carnegie Mellon University. Unpublished.
j:88
Reich, Y. & Fenves, S., "Integration of Generic Learning Tasks," Department of Civil Engineering,
Carnegie Mellon University. Unpublished.
k:88
Reich, Y. & Fenves, S., "Floor-System Design in Soar: A Case Study of Learning to Learn," Tech.
report EDRC- 1 2-26-88, Engineering Design Research Center, Carnegie Mellon University, March 1988.
1:88
Rosenbloom, P. S. & Newell, A., "An integrated computational model of stimulus-response compati
bility and practice," in The Psychology ofLearning and Motivation, Volume 21, G. H. Bower, ed., Acad emic Press, 1988, pp. 1 -52.
m:88
Tambe, M. & Newell, A., "Some chunks are expensive," Proceedings of the Fifth International Confer ence on Machine Learning, J. Laird,ed., 1988, pp. 45 1 -458.
n:88
Tambe, M., & Newell, A., "Why Some Chunks are Expensive," Tech. report CMU-CS-88-103, Com puter Science Department, Carnegie-Mellon University, 1988.
0:88
van Berkum, J. J. A., "Cognitive Modeling in Soar," Tech. report WR 88-01 , Traffic Resea�ch Center,
p:88
Waldrop, M. M., ''Toward a unified theory of cognition," Science, Vol. 241, 1988, pp. 27-29.
q:88
Waldrop, M. M., "Soar: A unified theory of cognition?," Science, Vol. 241, 1988, pp. 296-298.
University of Groningen, July 1988.
XXXVIII
INTRODUCTION
1989 a:89
Acharya, A. & Tambe, M., "Efficient implementations of production systems," VIVEK: A quarterly in artificial intelligence, Vol. 2, No. 1 , 1 989, pp. 3 - 1 8.
b:89
Acharya, A. & Tambe, M., "Production systems on message passing computers: Simulation results and analysis," Proceedings of the International Conference on Parallel Processing, 1 989, pp. 246-254.
c:89
Harvey, W., Kalp, D., Tambe, M., McKeown, D., & Newell, A., "Measuring the Effectiveness of Task
Level Parallelism for High-Level Vision," Tech. report CMU-CS-89- 1 25, School of Computer Science, Carnegie Mellon University, March 1989. d:89
Hsu, W.L., Newell, A., Prietula, M.J., & Steier, D.M., "Sharing scheduling knowledge between intelli
gent agents (Extended abstract)," Proceedings of the AAAI-SIGMAN Workshop on Manufacturing Scheduling, Eleventh International Joint Conference on Artificial Intelligence, 1 989.
e:89
Hsu, W., Prietula, M., & Steier, D., "Merl-Soar: Scheduling within a general architecture for intelli
gence," Proceedings of the Third International Conference on Expert Systems and the Leading Edge in Production and Operations Management, May 1 989, pp. 467-48 1 . f:89
Hucka, M., "Planning, Interruptability, and Learning in Soar," Department of Electrical Engineering and Computer Science, University of Michigan. Unpublished.
g:89
Huffman, S. B., "A natural-language system for interaction with problem-solving domains: Extensions to NL-Soar," Unpublished report on directed-study research.
h:89
Johnson, T. R., Smith, J. W. Jr., & Chandrasekaran, B., "Generic Tasks and Soar," Working Notes of the
AAA! Spring Symposium on Knowledge System Development Tools and Languages, Stanford, CA, 1 989, pp. 25-28. i:89
Laird, J. E., Yager, E. S., Tuck, C. M., & Hucka, M., "Learning in tele-autonomous systems using Soar," Proceedings of the NASA Conference on Space Telerobotics, Pasadena, CA, 1 989, pp. 4 1 5-424.
j:89
Newell, A, ''The quest for architectures for integrated intelligent systems (Extended abstract)," Pre sented at the Eleventh International Joint Conference on Artificial Intelligence.
k:89
Steier, D.M., Automating Algorithm Design Within an Architecture for General Intelligence, PhD dis sertation, School of Computer Science, Carnegie Mellon University, March 1989.
1:89
Tambe, M. & Acharya, A., "Parallel implementations of production systems," VIVEK: A quarterly in artificial intelligence, Vol. 2, No. 2, 1 989, pp. 3-22.
m:89
Tambe, M., Acharya, A., & Gupta, A., "Implementation of Production Systems on Message Passing
Computers: Techniques, Simulation Results and Analysis," Tech. report CMU-CS-89- 1 29, School of Computer Science, Carnegie Mellon University, April 1 989. n:89
Tambe, M. & Rosenbloom, P. S., "Eliminating expensive chunks by restricting expressiveness," Pro ceedings of the Eleventh International Joint Conference on Artificial Intelligence, IJCAII, 1989, pp. 73 1 -
737. o:89
Wu, H. J., "Coherenece and ambiguity resolution," Artificial Intelligence Laboratory, The University of Michigan. Unpublished.
1990 a:90
Johnson, K.A., Smith, J.A., & Smith, P., "Design of a blood bank tutor," Proceedings of the AAA! Mini Symposium on Artificial Intelligence in Medicine, March 1 990, pp. 1 02- 1 04.
b:90
Laird, J. E., Congdon, C. B., Altmann, E., & Swedlow, K. R., "Soar User's Manual: Version 5.2," Tech.
report, Electrical Engineering and Computer Science, The University of Michigan, 1 990, Also available
INTRODUCTION
XXXIX
from The Soar Project, School of Computer Science, Carnegie Mellon University, CMU-CS-90- 1 79. c:90
Letovsky, S., "Operationality Criteria for Recursive Predicates," Proceedings of the Eighth National Conference on Artificial Intelligence, AAAI , 1990, pp. 936-94 1 .
d:90
Newell, A., Unified Theories of Cognition, Harvard University Press, Cambridge, Massachusetts, 1990.
e:90
Piersma, E. H., "Constructive Criticisms on Theory and Implementation: The Soar6 Computational
f:90
Tambe, M., "Understanding Case-based Reasoning in Comparison with Soar: A preliminary report,"
Model," Tech. report WR 90-0 1 , Traffic Research Center, University of Groningen, 1 990.
School of Computer Science, Carnegie Mellon University. Unpublished.
g:90
Tambe, M. & Rosenbloom, P. S., "Production system formulations with non-combinatoric match: Some results and analysis," School of Computer Science, Carnegie Mellon University. Unpublished.
h:90
Wharton, C., & Lewis, C., "Soar and the Constructive-Integration Model: Pressing a Button in Two
Cognitive Architectures," Tech. report CU-CS-466-90, Department of Computer Science, University of Colorado at Boulder, March 1990.
i:90
Wiesmeyer, M. & Laird, J., "A computer model of 2D visual attention," Proceedings of the Twelfth Annual Conference of the Cognitive Science Society, Erlbaum, Hillsdale, NJ, 1990, pp. 582-589.
1991 a:9 1
Carley, K., Kjaer-Hansen, J., Prietula, M . & Newell, A., "Plural-Soar: A Prolegomenon to Artificial
Agents and Organizational Behavior," Carnegie Mellon University and Joint Research Center, Ispra. Unpublished. b:91 c:9 1
Carley, K. and Newell, A., "The Nature of the Social Agent." School of Computer Science and Depart ment of Psychology, Carnegie Mellon University. Unpublished. Churchill, E.F. & Young, R.M., "Modeling Representations of Device Knowledge in Soar," AISB-91
Artificial Intelligence and Stmulation ofBehaviour, Springer-Verlag, 1 99 1 , pp. 247-255. d:9 1
Clancey, W. J., "The frame of reference problem in the design of intelligent machines," in Architec
e:91
Covrigaru, A. A. & Lindsay, R. K., "Deterministic autonomous systems," AI Magazine, Vol. 12, 1 99 1 ,
tures for Intelligence, K. VanLehn, ed., Erlbaum, Hillsdale, NJ, 1 99 1 , pp. 357-423.
pp. 1 1 0- 1 1 7. f:9 1
DeJongh, M., Causal Processes in the Problem Space Computational Model: Integrating Multiple Representations of Causal Processes in Abductive Problem Solving, PhD dissertation, Laboratory for Knowledge-Based Medical Systems, Ohio State University, 199 1 .
g:91
DeJongh, M. & Smith, J. W., Jr., "Integrating Models of a Domain for Problem Solving," Proceedings of the Eleventh International Conference on Expert Systems and Their Applications: Second Generation Expert Systems, 199 1 , pp. 125- 1 35, Avignon, France.
h:91
Feldman, J. A., "Cognition as Search," Science, Vol. 25 1 , 1 99 1 , pp. 575, Review of Unified Theories of Cognition.
i:9 1
Huffman, S.B. & Laird, J.E., "Perception, Projection and Reaction in External Environments in Soar:
Initial Approaches," Tech. report CSE-TR- 1 1 8-9 1 , Department of Electrical Engineering and Com
puter Science, University of Michigan, 1 99 1 . j:9 1
Johnson, T.R., Generic Tasks in the Problem-Space Paradigm: Building Flexible Knowledge Systems While Using Task-Level Constraints, PhD dissertation, Laboratory for Knowledge-Based Medical Sys tems, Ohio State University, 1 99 1 .
k:91
John, B.E., Remington, R.W. & Steier, D.M., "An Analysis of Space Shuttle Countdown Activities:
XL
INTRODUCTION
Preliminaries to a Computational Model of the NASA Test Director," Tech. report CMU-CS-91 - 1 38, School of Computer Science, Carnegie Mellon University, 199 1 . 1:91
Johnson, T. R. & Smith, J. W., "A framework for opportunistic abductive strategies," Proceedings of the Thirteenth Annual Conference of the Cognitive Science Society, Erlbaum, Hillsdale, NJ, 1991, pp. 760-
764.
m:91
Laird, J. E., Yager, E. S., Hucka, M., & Tuck, C. M., "Robo-Soar: An integration of external interac
n:91
Lehman, J. F., Lewis, R.L. & Newell, A., "Natural Language Comprehension in Soar: Spring 199 1 ,"
tion, planning and learning using Soar," Robotics and Autonomous Systems, Vol. 8, 1 99 1 , pp. 1 13 - 129.
Tech. report CMU-CS-9 1 - 1 17, School of Computer Science, Carnegie Mellon University, 1991 . o:91
Lindsay, R., "What is Cognitive Science," Psychological Science, Vol. 2, 1 99 1 , pp. 228-3 1 1, Review of Unified Theories of Cognition.
p:91
Newell, A. & Steier, D., "Intelligent Control of External Software Systems," Tech. report EDRC 05-559 1 , Engineering Design Research Center, Carnegie Mellon University, 1 99 1 .
q:9 1
Pirolli, P. & Berger, D., "The Structure of Ill-structured Problem Solving in Instructional Design,"
Tech. report DPS-5, University of California at Berkeley, 199 1 . r:9 1
Pylyshyn, Z. W., "The role o f cognitive architectures in theories o f cognition," in Architectures for Intelligence, K. VanLehn, ed., Erlbaum, Hillsdale, NJ, 199 1 , pp. 1 89-223.
s:9 1 t:9 1
Ritter, F.E., "TAQL� mode Manual," The Soar Project, School of Computer Science, Carnegie Mellon University. Unpublished.
Ritter, F.E., Hucka, M., & McGinnis, T.F., "Soar-mode Manual," The Soar Project, School of Com
puter Science, Carnegie Mellon University. Unpublished. u:91
Rosenbloom, P. S., "Climbing the hill of cognitive-science theory," Psychological Science, Vol. 2, 1 99 1 , pp. 308-3 1 1 , Invited Commentary.
v:91
Tambe, M., Eliminating Combinatorics from Production Match, PhD dissertation, School of Com puter Science, Carnegie Mellon University, 1 99 1 .
w:9 1
Tambe, M., Kalp, D. & Rosenbloom, P., "Uni-Rete: Specializing the Rete Match Algorithm for the Unique-attribute Representation," Tech. report CMU-CS-9 1 - 1 80, School of Computer Science, Carnegie Mellon University, 199 1 .
x:9 1
Ward, B., ET-Soar: Toward an ITS fo r Theory-Based Representations, PhD dissertation, School of Computer Science, Carnegie Mellon University, 1 99 1 .
y:91
Yost, G.R. & Altmann, E., "TAQL 3 . 1 .3: Soar Task Acquisition Language User Manual." School of Computer Science, Carnegie Mellon University. Unpublished.
1 992 a:92
Aasman, J. & Akyiirek, A., "Flattening goal hierarchies," in Soar: A Cognitive Architecture in Perspec
b:92
Aasman, J. & Michon, J. A., "Multitasking in driving," in Soar: A Cognitive Architecture in Perspective,
tive, Michon, J. A. & Akyiirek, A., eds., Kluwer Academic Publishers, Dordrecht, The Netherlands, 1992, pp. 199-2 17.
Michon, J. A. & Akyiirek, A., eds., Kluwer Academic Publishers, Dordrecht, The Netherlands, 1992,
pp. 1 69- 1 98. c:92
Akyiirek, A., "On a Computational Model of Human Planning," in Soar: A Cognitive Architecture in Perspective, Michon, J. A. & Akyiirek, A., eds., Kluwer Academic Publishers, Dordrecht, The Nether lands, 1992, pp. 8 1 - 1 08.
INTRODUCTION
d:92
XL!
Akyiirek, A., "Means-Ends Planning: An Example Soar System," in Soar: A Cognitive Architecture in Perspective, Michon, J. A. & Akyiirek, A., eds., Kluwer Academic Publishers, Dordrecht, The Nether lands, 1 992, pp. 1 09- 167.
e:92
Doorenbos, B., Tambe, M., & Newell, A., "Learning 1 0,000 chunks: What's it like out there?," Proceed
f:92
Huffman, S.B., Pearson, D.J. & Laird, J.E., "Correcting imperfect domain theories: A knowledge-level
ings ofthe Tenth National Conference on Artificial Intelligence, AAAI , 1 992, pp. 830-836.
analysis," in Machine Learning: Induction, Analogy and Discovery, Chipman, S. & Meyrowitz, A., eds., Kluwer Academic Publishers, Hingham, MA, 1 992, In press.
g:92
Johnson, T. R., Smith, J. W., Johnson, K., Amra, N., & DeJongh, M., "Diagrammatic Reasoning of
Tabular Data," Tech. report OSU- LKBMS-92 - 1 0 1 , Laboratory for Knowledge-based Medical Sys tems, Ohio State University, 1 992. h:92
Michon, J. A. & Akyiirek, A., editors, Soar: A Cognitive Architecture in Perspective, Kluwer Academic Publishers, Dordrecht, The Netherlands, 1992.
i:92
Miller, C.S. & Laird, J.E., "A simple, symbolic model for associative learning and retrieval," Artificial Intelligence Laboratory, The University of Michigan. Unpublished.
j:92
Newell, A., "Unified theories of cognition and the role of Soar," in Soar: A Cognitive Architecture in Perspective, Michon, J. A. & Akyiirek, A., eds., Kluwer Academic Publishers, Dordrecht, The Nether
lands, 1 992, pp. 25-79. k:92 1:92 m:92
Peck, V.A. & John, B.E., "Browser-Soar: A computational model of a highly interactive task," Proceed ings ofCHI'92, ACM Press, New York, 1 992, pp. 165- 1 72.
Ritter, F.E. & McGinnis, T.F., "SX: A manual for the Soar graphical display and interface in X win dows," The Soar Project, School of Computer Science, Carnegie Mellon University. Unpublished.
Rosenbloom, P. S., Lee, S., & Unruh, A., "Bias in planning and explanation-based learning," in
Machine Learning Methods for Planning, S. Minton, ed., Morgan Kaufmann, San Mateo, CA, 1992, In press.
n:92
Rosenbloom, P. S. & Newell, A., "Symbolic Architectures: Organization of lntelligence I," in Exploring Brain Functions: Models in Neuroscience, T. Poggio & D. A. Glaser, eds., John Wiley & Sons, Chich ester, England, 1992, In press.
o:92 p:92
Smith, J.W. & Johnson, T.R., "Stratified Knowledge System Design and Implementation: The Soar Approach," IEEE Expert, 1 992, In press. Wiesmeyer, M., An Operator-Based Model of Human Covert Visual Attention, PhD dissertation, Department of Electrical Engineering and Computer Science, The University of Michigan, 1 992.
/
The Soar Papers 1969-1982
CHAPTER 1
Heuristic Programming: Ill-Structured Problems A. Newell, Carnegie Mellon University
We observe that on occasion expressions in some language are put for ward that purport to state "a problem." In response a method (or algo rithm) is advanced that claims to solve the problem. That is, if input data are given that meet all the specifications of the problem statement, the method produces another expression in the language that is the solution to the problem. If there is a challenge as to whether the method actually provides a general solution to the problem (i.e., for all admissible inputs) , a proof may be forthcoming that it does. If there is a challenge to whether the problem statement is well defined, additional formalization of the problem statement may occur. In the extreme this can reach back to formalization of the language used to state the problem, until a formal logical calculus is used. 'Ve also observe that for other problems that people have and solve there seems to be no such formalized statement and formalized method. Although usually definite in some respects problems of this type seem incurably fuzzy in others. That there should be ill-defined problems around is not very surprising. That is, that there should be expressions that have some characteristics of problem statements but are fragmentary seems not surprising. However, that there should exist systems (i.e., men) that can solve these problems without the eventual intervention of formal statements and formal methods does pose an issue. Perhaps there are two domains of problems, the well structured and the ill structured. Formalization always implies the first. Men can deal with both kinds. By virtue of their capacity for working with ill-structured problems, they can transmute some of these into well-structured (or formalized) problems. But the study of formalized problems has nothing to say about the domain of ill-structured problems. In particular, there can never be a formalization of ill-structured problems, hence never a theory (in a strict sense) about them. All that is possible is the conversion of particu lar problems from ill structured to well structured via the one transducer that exists, namely, man. Perhaps an analog is useful : well-structured problems are to ill structured problems as linear systems are to nonlinear systems, or as I wish to acknowledge the help of J. Moore in clarifying the nature of the
methods of artificial intelligence and also my discussions with my colleague H. A . Simon. The research was supported by the Advanced Research Projects Agency o f the Office o f the Secretary o f Defense (SD-146) .
4
CHAPTER l
stable systems are to unstable systems, or as rational behavior is to non rational behavior. In each case, it is not that the world has been neatly divided into two parts, each with a theory proper to it.
Rather, one
member of the pair is a very special case about which much can be said, whereas the other member consists of all the rest of the world-uncharted, lacking uniform approach, inchoate, and incoherent. This is not the only view, of course.
Alternatively, one can assert
that all problems can be formulated in the same way. The formalization exists because there is some symbolic system, whether on paper or in the head, that holds the specific, quite definite information in the problem statement. The fragmentation of problem statement that occurs when an attempt is made to explicate the problem only shows that there are seri ous (perhaps even profound) communication problems. But it does not say that ill-structured problems are somehow different in nature. Thus we have an issue-somewhat ill structured, to be sure-but still an issue. What are ill-structured problems and are they a breed apart from well-structured ones? This chapter is essentially an essay devoted to exploring the issue, as it stands in We have an issue defined.
1968.
What gives life to it are two concerns,
one broad, one narrow. At root, there is the long-standing concern over the rationalization of human life and action. More sharply stated, this is the challenge of art by science. In terms of encounters long resolved, it is whether photography will displace painting, or whether biology and physiology can contribute to the practice of medicine.
In terms of en
counters now in twilight, it is whether new products come from applied science or from lone inventors. In terms of encounters still active, it is whether the holistic diagnostic j udgment of the clinical psychologist is better than the j udgments of a regression equation
[ 1 2] .
For the purpose
of this essay, of course, it is to what extent management science can extend into the domain of business j udgment. When put in this context, the issue has a charge.
The concern flows
partly from economics, where it is no� usually labeled the problem of automation.
Concern also flows from a problem of identity, in which
some are compelled to ask what attribute·s man can uniquely call his own. As has been pointed out, probably most thoroughly by Ellul
[3] ,
it makes no sense to separate hardware and software, that is, to separate machines from procedures, programs, and formalized rules. They are all expressions of the rationalization of life, in which human beings become simply the agents or carriers of a universalistic system of orderly re lations of means to ends. Thus, viewed broadly, the issue is emotionally toned. However, this
HEURISTIC PROGRAMMING
fact neither eliminates nor affects the scientific questions involved, al though it provides reasons for attending to them. Our aim in this essay is essentially scientific, a fact which leads to the second, more narrow context. Within management science the nature of rationalization has varied somewhat over time. In the early days, those of Frederick Taylor, it was expressed in explicit work procedures, but since World War II it has been expressed in the development of mathematical models and quantitative methods. In
1958 we put it as follows :
A problem is well structured to the extent that it satisfies the following criteria :
1. It can be described in terms of numerical variables, scalar and vector quanti
ties.
2. The goals to be attained can be specified in terms of a well-defined objective function-for example, the maximization of profit or the minimization of cost. 3. There exist computational routines
(al,gorithms) that permit the solution to
be found and stated in actual numerical terms. Common examples of such algorithms, which have played an important role in operations research, are maximization procedures in the calculus and calculus of variations, linear-programming algorithms like the stepping-stone and simplex methods, M onte Carlo techniques, and so on (21, pp. 4-5] .
Ill-structured problems were defined, as in the introduction of this essay, in the negative : all problems that are not well structured. Now the point of the
1958
paper, and the reason that it contrasted well- and
ill-structured problems, was
to
introduce heuristic programming as rele
vant to the issue : With recent developments in our understanding of heuristic processes and their . simulation by digital computers, the way is open to deal scientifically with ill structured problems-to make the computer co-extensive with the human mind (21, p. 9] .
That is, before the development of heuristic programming (more gener ally, artificial intelligence) the domain of ill-structured problems had been the exclusive preserve of human problem solvers.
Now we had
other systems that also could work on ill-structured problems and that could be studied and used. This
1958
paper is a convenient marker for the narrow concern of
the present essay. It can symbolize the radical transformation brought by the computer to the larger, almost philosophical concern over the nature and possibilities for rationalization. The issue has become almost technical, although now it involves three terms, where before it involved only two :
5
6
CHAPTER l
• What is the nature of problems solved by formal algorithms ? • What is the nature of problems solved by computers? •
What is the nature of problems solved by men ?
We have called the first well-structured problems ; the last remains the residual keeper of ill-structured problems ; and the middle term offers the opportunity for clarification. Our course will be to review the
1958
paper a little more carefully,
leading to a discussion of the nature of problem solving. From this will emerge an hypothesis about the nature of generality in problem solving, which will generate a corresponding hypothesis about the nature of ill structured problems. With theses in hand, we first consider some impli cations of the hypotheses, proceed to explore these a little, and finally bring out some deficiencies. The
1958
paper asserted that computers (more precisely, computers
appropriately programmed ) could deal with ill-structured problems, where the latter was defined negatively. fold.
The basis of this assertion was two
First, there h ad j ust come into existence the first successful heu
ristic programs, that is to say, programs that performed tasks requiring intelligence when performed by human beings. They included a theorem
[ 15 ] , a recognition program [20] .
[ 19 ] ,
proving program in logic
checker-playing program
and a
pattern
These ,,·ere tasks for which algorithms
either did not exist or were so immensely expensive as to preclude their use. Thus, there existed some instances of programs successfully solving interesting ill-structured problems. The second basis was the connection between these programs and the nature of human problem solving
[ 16] .
Insofar as these programs reflected the same problem-solving processes as human beings used, there was additional reason to believe that the programs dealt with ill-structured problems. The data base for the as sertion was fairly small, but there fol lowed in the next few years ad ditional heuristic programs that provided support.
There was one that
proved theorems in plane geometry, one that did symbolic indefinite integration, a couple of chess programs, a program for balancing assembly lines, and several pattern recognition programs The
1958
problems.
[ 5] .
paper provided no positive characterization of ill-structured
Although it could be said that some ill-structured problems
were being handled, these might constitute a small and particularly "well formed" subset. This was essentially the position taken by Reitman, in one of the few existing .direct contributions to the question of ill-formed problems
[ 1 7, 18] .
He observed, as have others, that all of the heuristic
programs, although lacking well-specified algorithms, were otherwise quite
HEURISTIC PROGRAMMING
precisely defined. In particular, the test whereby one determined whether the problem was solved was well specified, as was the initial data base from which the problem started. Thus, he asserted that all existing heu ristic programs were in a special class by virtue of certain aspects being well defined, and thus shed little light on the more general case. Stating this another way, it is not enough for a problem to become ill structured only with respect to the methods of solution. It is required also to become ill structured with respect to both the initial data and the criteria for solution. To the complaint that one would not then really know what the problem was, the rejoinder is that almost all problems dealt with by human beings are ill structured in these respects. To use an example discussed by Reitman, in the problem of making a silk purse from a sow's ear, neither "silk purse" nor "sow's ear" is defined beyond cavil. To attempt really
to
search for some ways
to solve such a problem, for instance, to stretch the implicit definitions to
would be force ac
ceptance of the criteria, for example, chemical decomposition and re synthesis. Reitman attempted a positive characterization of problems by setting out the possible forms of uncertainty in the specification of a problem : the ways in which the givens, the sought-for transformation, or the goal could be ill defined.
This course has the virtue, if
successful,
of
defining "ill structured" independently of problem solving and thus providing a firm base on which
to
consider how such problems might
be tackled. I will not follow him in this approach, however.
It seems
more fruitful here to start with the activity of problem solving.
I. THE NATURE OF PROBLEM SOLVING
A
rather general diagram, shown in Fig. 10. 1 , will serve to convey a
view of problem solving that captures a good deal of what is known, both casually and scientifically.
A problem
solver exists in a task environ
ment, some small part of which is the immediate stimulus for evoking the problem and which thus serves as the initial problem statement.1 This external representation is translated into some internal represen tation (a condition, if you please, for assimilation and acceptance of the problem by the problem solver) .
There is located within the memory
of the problem solver a collection of methods. 1
A
method is some organ-
Its statement form is clear when given linguistically, as in "Where do we locate
the new warehouse ?" Otherwise, "statement" is to be taken metaphorically as com prising those clues in the environment attended to by the problem solver that indi cate to him the existence of the problem.
7
8
CHAPTER 1
�
Task Environment - Problem Statement
I nput translation
Problem Solver
Internal representation
I nternal
Method
general
store
knowlege
Note: the
eye�
indicates that input representation
is not under the control the inputting process.
Figure 10.1.
General schema of a problem solver.
ized program or plan for b�havior that manipulates the internal repre sentation in an attempt to solve the problem. For the type of problem solvers we have in mind-business men, analysts, etc.-there exist many relatively independent methods, so that the total behavior of the problem
HEURISTIC PROGRAMMING
solver is made up as an iterative cycle in which methods are selected on the basis of current information (in the internal representation) and tried with consequent modification of the internal representation, and a new method is selected. Let us stay with this view of problem solving for a while, even though it de-emphasizes some important aspects, such as the initial determination of an internal representation, its possible change, and the search for or construction of new methods (by other methods) in the course of problem solving. What Fig. 10.1 does emphasize is the method-the discrete package of information that guides behavior in an attempt at problem solution. It prompts us to inquire about its anatomy. 1.1. The Anatomy of
a
Method
Let us examine some method in management science. The simplex method of linear programming will serve admirably. It is well known, important, and undoubtedly a method. It also works on well-structured problems, but we will take due account of this later. The basic linear programming problem is as follows. Given: a set of positive real variables X; � 0,
j
=
i
=
and real constants max1m1ze
1,
@
. . .,n
'), . . . , m; j
Z =
l;
;
=
1, . . . , n
C;X;
subject to i
=
1,
. . . ,m
Figure 10.2 shows the standard data organization used in the simplex method and gives the procedure to be followed. We have left out the procedure for getting an initial solution, that is, we assume that the tableau of Fig. 10.2 is initially filled in. Likewise, we have ignored the detection of degeneracy and the procedure for handling it. There are three parts to the simplex method. The first is the state ment of the problem. This determines the entities you must be able to identify in a situation and what you must know about their properties and mutual interrelationships. Unless you can find a set of nonnegative numerical quantities, subject to linear constraints and having a linear obj ective function to be maximized, you do not have a LP problem, and
9
10
CHAPTER I
the method cannot help you. The second part of the method is the p rocedure that delivers the solution to the problem. It makes use only of information available in the problem statement. Indeed, these two are coupled together as night and day. Any information in the prob lem statement which is known not to enter into the procedure can be discarded, thereby making the problem just that much more general. Simplex tableau Basis
:z:;.
:z:;,
c,
b,
c,,
C2 :Z:2
:Z:1
0
:z:,.
-- -- --
0
:z:.. +1
:Z:n+m
--
-- --
---
'•i -- -- :z:;. z;
--
-- ---
Z; - c,
Procedure
Let io = column with min {z; - c;lz; - c; < O} ; if no z; - c; < 0, then at maximum z. end. 2. Let io = row with min {bi/4;.lbi/4;. > O} ; if no bi/4;. > 0, then z unbounded, end. 3. For row io, t;.; +- t;.;/t;.;•. 4. For row i r6- io, ti; +- ti; - 4',;4;. (t;.; is the value from step 3) . 1.
Define
Xn+i
=
c,.+i
=
lli,n+k =
Figure 10.2.
bi - � ai;X;, ;
0 1 if
Simplex method.
i
=
i
=
1,
k, 0 otherwise
.
. .
,m
HEURISTIC PROGRAMMING
The third part o f the method i s the proof or justification that the pro cedure in fact delivers the solution to the problem (or delivers it within certain specified limits) . The existence of this justification has several consequences. One, already noted, is the complete adaptation of means to ends--0f the shaping of the problem statement so that it is as general as possible with respect to the procedure. Another consequence is a toleration of apparent meaninglessness in the procedure. It makes no difference that there seems to be neither rhyme nor reason to the steps of. the method in Fig. 10.2. Careful analysis reveals that they are in fact just those steps necessary to the attainment of the solution. This feature is characteristic of mathematical and computational methods generally and sometimes is even viewed as a hallmark. An additional part of the simplex method is a rationale that can be used to make the method understandable . . The one usually used for the simplex is geometrical, with each constraint being a (pofantial) boundary plane of the space of feasible solutions. Then the simplex procedure is akin to climbing from vertex to vertex until the maximal one is reached. This fourth part is less essential than the other three. The first three parts seem to be characteristic of all methods. Cer tainly, examples can be multiplied endlessly. The quadratic formula pro vides another clear one : Problem statement : Procedure : Justification :
Find x such that ax2 + bx + c = 0. compute x = b /2a ± %a vb2 4ac. (substitute formula in ax2 + bx + c and show by algebraic manipulation that 0 results) . -
In each case a justification is required (and forthcoming) that estab lishes the relation of method to problem statement. As we move toward more empirical methods, the precision of both the problem statement and the procedure declines, and concurrently the precision of the justification ; in fact, justification and plausible rationale merge into one. 1.2. Generality and Power
We need to distinguish the generality of a method from its power. A method lays claim via its problem statement to being applicable to a certain set of problems, namely, to all those for which the problem state ment applies. The generality of a method is determined by how large the set of problems is. Even without a well-defined domain of all prob lems, or any suitable measure on the sets of problems, it is still often possible to compare two problem statements and judge one to be more inclusive than another, hence one method more general than the other.
11
12
CHAPTER l
method that is applicable only to locating warehouses is less general than one that is applicable to problems involving the location of all physi cal resources. But nothing interesting can be said about the relative generality of a specific method for inventory decisions versus one for production scheduling. Within the claimed domain of a method we can inquire after its ability to deliver solutions : the higher this is, the more powerful the method. At least three somewhat independent dimensions exist along which this ability can be measured. First, the method may or may not solve every problem in the domain ; and we may loosely summarize this by talking of the probability of solution. Second, there may exist a dimension of quality in the solution, such as how close an optimizing method gets to the peak. Then methods can differ on the quality of their solutions. (To obtain a simple characteristic for this requires some summarization over the ap plicable domain, but this feature need not concern us here.) Third, the method may be able to use varying amounts of resources. Then, judg ments of probability of solution and of quality are relative to the amount of resources used. Usually the resource will be time, but it can also be amount of computation, amount of memory space, number of dollars to acquire information, and so on. For example, most iterative methods for solving systems of equations do not terminate in a finite number of iterations, but produce better solutions if run longer ; the rate of con vergence becomes a significant aspect of the power of such methods. In these terms the simplex method would seem to rank as one of limited generality but high power. The restrictions to linearity, both in the constraints and the obj ective function, and to a situation describable by a set of real numbers, all constrict generality. But the simplex method is an algorithm within its domain and guarantees delivery of the complete solution. It is not the least general method, as is indicated by the trans portation problem with its more specialized assumptions ; nor is it the most powerful method for its domain, since it can be augmented with additional schemes that obtain solutions more expeditiously. Evidently there is an inverse relationship between the generality of a method and its power. Each added condition in the problem statement is one more item that can be exploited in finding the solution, hence in increasing the power. If one takes a method, such as the simplex method, and generalizes the problem statement, the procedure no longer solves every problem in the wider domain, but only a subset of these. Thus the power diminishes. The relationship is not one-one, but more like a limiting relationship in which the amount of information in the problem statement puts bounds on how powerful the method can be. This reA
·
HEURISTIC PROGRAMMING
Power
•
- Generality
Figure 10.3.
Information demanded
�
Generality versus power.
lationship is important enough to the argument of this essay that we indicate it symbolically in Fig. 10.3. The abcissa represents increasing informaticn in the problem statement, that is, decreasing generality. The ordinate represents increasing power. For each degree of generality an upper bound exists on the possible power of a method, though there are clearly numerous methods which do not fully exploit the information in the problem statement. 2. TWO HYPOTHESES: ON GENERALITY AND ON ILL-STRUCTURED PROBLEMS
With this view of method and problem solver we can move back toward the nature of ill-structured problems. However, we need to ad dress one intermediate issue : the nature of a general problem solver. The first heuristic programs that were built laid claims to power, not to generality. A chess or checker program was an example of artificial intelligence because it solved a problem difficult by human standards ; there was never a pretense of its being general. Today's chess programs cannot even play checkers, and vice versa. Now this narrowness is completely consistent with our general ex perience with computer programs as highly special methods for restricted tasks. Consider a typical subroutine library, with its specific routines for inverting matrices, computing the sine, carrying out the simplex method, and so on. The only general "programs" are the higher-level
13
14
CHAPTER l
programming languages, and these are not problem solvers in the usual sense, but only provide means to express particular methods.2 Thus the view has arisen that, although it may be possible to construct an artificial intelligence for any highly specific task domain, it will not prove possible to provide a general intelligence. In other words, it is the ability to be a general problem solver that marks the dividing line between human intelligence and machine intelligence. The formulation of method and problem solver given earlier leads rather directly to a simple hypothesis about this question : Generality Hypothes-is. A general problem solver is one that has a collection of
successively weaker methods that demand successively less of the environment in
order to be applied. Thus a good general problem solver is simply one that has the best of the weak methods.
This hypothesis, although itself general, is not without content. (To put it the way that philosophers of science prefer, it is falsifiable. ) It says that there are no special mechanisms of generality-nothing beyond the willingness to carry around specific methods that make very weak de mands on the environment for information. By the relationship expressed in Fig. 10.3 magic is unlikely, so that these methods of weak demands will also be methods of low power. Having a few of them down at the very low tip in the figure gives the problem solver the ability to tackle almost any kind of problem, even if only with marginal success. There are some ways in which this generality hypothesis is almost surely incorrect or at least incomplete, and we will come to these later ; but let us remain with the main argument. There is at least one close association between generality and ill-structured problems : it is man that can cope with both. It is also true that ill-structured problems, what ever else may be the case, do not lend themselves to sharp solutions. Indeed, their lack of specificity would seem to be instrumental in pro hibiting the use of precisely defined methods. Since every problem does present some array of available information-something that could meet the conditions of a problem statement of some method-the suspicion arises that lack of structure in a problem is simply another indication that there are not methods of high power for the particular array of information available. Clearly this situation does not prevail absolutely, but only with respect to a given problem solver and his collection of methods (or, equally, a population of highly similar problem solvers) . We can phrase this suspicion in sharper form : 2
The relationship of programming languages to problem solvers, especially as the
languages become more problem-oriented, is unexplored territory. Although relevant to the main question of this essay, it cannot be investigated further here.
HEURISTIC PROGRAMMING
JU-structured Problem Hypothesis. A problem solver finds a problem ill structured
if the power of his methods that are applicable to the problem lies below a certain threshold.
The lack of any uniform measure of power, with the consequent lack of precision about a threshold on this power, is not of real concern : the notion of ill-structuredness is similarly vague. The hypothesis says that the problem of locating a new warehouse will look well structured to a firm that has, either by experience, analysis, or purchase, acquired a programmed procedure for locating warehouses, providing it has decided that the probability of obtaining an answer of suitable quality is high enough simply to evoke the program in the face of the new location problem. The problem will look ill structured to a firm that has only its general problem-solving abilities to fall back on. It can only have the most general faith that these procedures will discover appropriate infor mation and use it in appropriate ways in making the decision. My intent is not to argue either of these two hypotheses directly, but rather to examine some of their implications. First, the weak methods must be describable in more concrete terms. This we will do in some detail, since it has been the gradual evolution of such methods in artificial intel ligence that suggested the hypotheses in the first place. Second, the picture of Fig. 10.3 suggests not only that there are weak methods and strong ones, but that there is continuity between them in some sense. Phrased an other way,. at some level the methods of artificial intelligence and those of operations research should look like members of the same family. We will also look at this implication, although somewhat more sketchily, since little work has been done in this direction. Third, we can revisit human decision makers in ill-structured situations. This we do in an even more sketchy manner, since the main thrust of this essay stems from the more formal concerns. Finally, after these (essentially positive) explications of the hypotheses, we will turn to discussion of some dif ficulties. 3. THE METHODS OF HEURISTIC PROGRAMMING
There has been continuous work in artificial intelligence ever since the article quoted at the beginning of this chapter (21 ] took note of the initial efforts. The field has had two main branches. We will concentrate on the one called heuristic programming. It is most closely identified with the programmed digital computer and with problem solving. Also, almost all the artificial intelligence efforts that touch management sci ence are included within it. The other branch, identified with pattern
15
16
CHAPTER 1
recognition, self-organizing systems, and learning systems, although not exempt from the observations to be made here, is sufficiently different to preclude its treatment. A rather substantial number of heuristic programs have been con structed or designed and have gone far enough to get into the literature. They cover a wide range of tasks : game playing, mostly chess, checkers, and bridge ; theorem proving, mostly logic, synthetic geometry, and various elementary algebraic· systems ; all kinds of puzzles ; a range of management science tasks, including line balancing, production sched uling, and warehouse location ; question-answering systems that accept quasi-English of various degrees of sophistication ; and induction prob lems of the kind that appear on intelligence tests. The main line of progress has constituted a meandering tour through new task areas which seemed to demand new analyses. For example, there is considerable cur rent work on coordinated effector-receptor activity (e.g., hand-eye) in the real world-a domain of problems requiring intelligence that has not been touched until this time. Examination of this collection of programs reveals that only a few ideas seem to be involved, despite the diversity of tasks. These ideas, if properly expressed, can become a collection of methods in the sense used . earlier. Examination of these methods shows them to be extraordinarily weak compared with the methods, say, of linear programming. In compen sation, they have a generality that lets them be applied to tasks such as discovering proofs of theorems, where strong methods are unknown.3 It thus appears that the work in heuristic programming may provide a first formalization of the kind of weak methods called for by our two hypotheses. (To be sure, as already 1 noted, psychological invention runs the other way : the discovery that there seems to be a small set of methods underlying the diversity of heuristic programs suggested the two hy potheses.) It might be claimed that the small set of methods shows, not parsi mony, but the primitive state of development of the field and that in vestigators read each other's papers. Although there is clearly some force to this argument, to an important degree each new task attempted in heuristic programming represents an encounter with an unknown set of 3
Strong methods of proof discovery do get developed in the course of mathe
matical progress, but their effect is to reduce whole areas to a calculus. The develop ment of the operational calculus-later the Laplace and Fourier transforms-is a case in point. But the present theorem-proving programs and the methods that lie behind them do not involve mathematical advances ; rather they appear to capture methods available for proof discovery within the existing state of ignorance.
HEURISTIC PROGRAMMING
demands that have to be met on their own terms. Certainly the people who have created heuristic programs have often felt this way. In fact, the complaint is more often the opposite to the above caveat-that arti ficial intelligence is a field full of isolated cases with no underlying coherency. In fact, the view expressed in this chapter is not widely held. There is some agreement that all heuristic theorem provers and game players make use of a single scheme, called heuristic search. But there is little acknowledgment that the remainder of the methods listed below con stitute some kind of basic set. With this prelude, let us describe briefly some methods. An adequate j ob cannot be done in a single chapter ; it is more an undertaking for a textbook. Hopefully, however, some feeling for the essential char acteristics of generality and power can be obtained from what is given. The first three, generate-and-test, match, and hill climbing, rarely occur as complete methods in themselves (although they can) , but are rather the building blocks out of which more complex methods are composed. 3.1. Generate-and-Test
This is the weak method par excellence. All that must be given is a way to generate possible candidates for solution plus a way to test whether they are indeed solutions. Figure 10.4 provides a picture of generate-and-test that permits us to view it as a method with a problem statement and a procedure. The flow diagram in the figure adopts some conventions that will be used throughout. They allow expresson of the central idea of a method without unnecessary detail. The lines in the diagram show the flow of data, rather than the flow of control-more in the style of an analog computer flow diagram than a digital com puter flow diagram. Thus the nodes represent processes that receive inputs and deliver outputs. If a node is an item of data, as in the pre dicate P or the set { x} (braces are used to indicate sets) , it is a memory process that makes the data item available. A process executes (or fires) when it receives an input ; if there are several inputs, it waits until all appropriate ones have arrived before firing. A generator is a process that takes information specifying a set and produces elements of that set one by one. It should be viewed as autono mously "pushing" elements through the system. Hence there is a flow of elements from generate to the process called test. Test is a process that determines whether some condition or predicate is true of its input and behaves differentially as a result. Two different outputs are possible : satisfied ( + ) and unsatisfied ( - ) . The exact output behavior depends
17
18
CHAPTER I
Problem statement
Given: a generator of the set {x} ; a test of the predicate P defined on elements of {x} ; Find: an element of {x} that satisfies P(x) . Procedure
{x} ···• generate
element
input
p ..r
+
solution
test ---�
To show y is a solution if and only if y E {x} and P(y) . Notation: 1r' --+ a means that process 1f' prod11ces a ; a/{J means that a is o n line labeled {J. Test has associated with it a predicate P on one variable, such that :
Justification
test -+ a/+ test --+ a/ -
if
if
and only if P(a) and a/input ; and only if I P(a) and a/input.
Generate has associated with it a set {x} such that : generate -+ a/element only if a E {x} ; a E {x} implies there exists a time when generate --+ a/element. Working backward from the flow line labeled solution, we get : 1. y/solution if and only if test --+ y/ +. 2. test --+ y/ + if and only if P(y) and y/input. Now we need only show that y/input if and only if y E {x} . 3. y/input if and only if generate --+ y/element. 4. generate -+ y/element only if y E {x} . Now we need only show that y E {x} implies generate --+ y/element; however, the best we can do is : 5. y E {x} implies there exists a time when generate --+ y/element. Figure 10.4.
Generate-and-test method.
on the needs of the rest of the processing. The input can be passed through on one condition and nothing done on the other, in which case test acts as a filter or gate. The input can be passed through in both cases, but on different output lines, in which case test acts as a binary switch. The set associated with a generator and the predicate associated with a test are not inputs. Rather, they are constructs in terms of which the behavior of the process can be described. This is done in Fig. 10.4 by . . listing a set of propositions for each process. The single arrow (�) indi-
HEURISTIC PROGRAMMING
cates production of an output, and the slash (/) indicates that a data item is on the line with the given label. Thus the first proposition under test says that test produces an item on its + output line if and only if that item was input and also satisfies the associated predicate. For any particular generator the associated set must be fully specified, but clearly that specification can be shared in particular ways between the structure of the generator and some actual inputs ; for example, a generator could take an integer as input and produce the integers greater than the input, or it could have no input at all and simply generate the positive integers. The same situation holds for test or any other process : its associated constructs must be fully specified, but that specification can be shared in different ways between the structure of the process and some of its inputs. Sometimes we will put the associated construct on the flow dia gram, as we have in Fig. 10.4, to show the connection between the proc esses in the flow diagram and the constructs used in the statement of the problem. We use dotted lines to show that these are not really inputs, although inputs could exist that partially specify them. We have provided a sketch of a justification that the procedure of the method actually solves the problem. In substance the proof is trivial. To carry it through in detail requires formalization of both the procedure and the language for giving the problem statements and the properties known to hold if a process is executed [ 6] The handling of time is a bit fussy and requires more formal apparatus than is worthwhile to pre sent here. Note, for instance, that if the generator were not allowed to go to conclusion, generate-and-test would not necessarily produce a so lution. Similar issues arise with infinite sets. Justifications will not be presented for the other methods. The purpose of doing it for this (sim plest) one is to show that all the components of a method-problem state ment, procedure, and j ustification-exist for these methods of artificial intelligence. However, no separate rationale is needed for generate-and test, partly because of its simplicity and partly because of the use of a highly descriptive procedural language. If we had used a machine code, for instance, we might have drawn the procedure of Fig. 10.4 as an in formal picture of what was going on. Generate-and-test is used as a complete method, for instance, in opening a combination lock (when in desperation) . Its low power is demonstrated by the assertion that a file with a combination lock is a "safe." Still, the method will serve to open the safe eventually. Generate and-test is often used by human beings as a second method for finding lost items, such as a sock or a tiepin. The first method relies on recol lections about where the item was left or last seen. After this has failed, .
19
20
CHAPTER 1
generate-and-test is evoked, generating the physical locations in the room one by one, and looking in each. The poor record of generate-and-test as a complete method should not blind one to its ubiquitous use when other information is absent. It is used to scan the want ads for neighborhood rentals after the proper column is discovered (to the retort "What else?", the answer is, "Right ! That's why the method is §O ge,neral") . In problem-solving programs it is used to go down lists of theorems or of subproblems. It serves to detect squares of interest on chessboards, words of interest in expressions, and figures of interest in geometrical displays. 3.2. Match
We are given the following expression in symbolic logic : e: (p V q) :> ((p V q) V (r :> p))
A variety of problems arise from asking whether e is a member of various specified sets of logic expressions. Such problems can usually be thrown into the form of a generate-and-test, at which point the difficulty of find ing the solution is directly proportional to the size of the set. If we know more about the structure of the set, better methods are available. For instance, consider the following two definitions of sets :
81 : x :> (x V y), where x and y are any logic expressions. Examples : p :> (p V q), q :> (q
82 :
v
q) , (p
v
p) :> ((p V p) V p) , . . . .
where a may be replaced (independently at each occurrence) according to the following schemes :
a,
a +-- q, a +--
(p V a) , a +-- a :> a.
Examples : q, p V q, q :> q, p V (p V q), (p V q) :> (p
v
q), . . . .
In 81, x and y are variables in the standard fashion, where each occur rence of the variable is to be replaced by its value. In S2 we have defined a replacement system, where each separate occurrence of the symbol a may be replaced by any of the given expressions. These may include a, hence lead to further replacements. A legal logic expression exists only when no a's occur. It is trivial to determine that e is a member of the set of expressions defined by Si, and not so trivial to determine that it is not a member of the set defined by 82• The difference is that for 81 we could simply match the expressions against the form and determine directly the values of the variables required to do the j ob. In the case of 82 we had essentially to generate-and-test. (Actually, the structure of the replacement system
HEURISTIC PROGRAMMING
permits the generation to be shaped somewhat to the needs of the task, so it is not pure generate-and-test, which assumes no knowledge of the internal structure of the generator.) Figure 10.5 shows the structure of the match method, using the same symbolism as in Fig. 10.4 for the generate-and-test. A key assumption, implicit in calling X and F expressions, is that it is possible to generate the subparts of X and F, and that X and F are equal if and only if cor responding subparts are equal. Thus there are two generators, which produce corresponding subparts of the two expressions as elements. These are compared : if equal, the generation continues ; if not equal, a test is made i f the element from the form is a variable. If it is, a substitution of the corresponding part of X for the variable is possible, thus making the two expressions identical at that point, and permitting generation to continue. The generators must also produce some special end signal, whose co-occurrence is detected by the compare routine to determine that a solution has been found. The match procedure sketched in Fig. 10.5 is not the most general one possible. Operations other than substitution can be used to modify the form (more generally, the kernel structure) so that it is equal to X. There can be several such operations with the type of difference between the two elements selecting out the appropriate action. This action can Problem statement
Given: expressions made up of parts from a set S ; a set o f variables {v} with values i n S; a form F, which is an expression containing variables ; an expression X. Find,: if X is in the set defined by F; that is, Find,: values for {v} such that X = F (with values substituted) .
Procedure
F ·······� generate
X
element
� compare ··· · ·· ·� generate �
l:jh
aolution
Figure 10.5.
Match method.
) � c1,_ .z_ _ _
i....
+
.
test � substitute
failure
21
22
CHAPTER 1
result in modification of X as well as F. It is possible to write a single procedure that expresses these more general possibilities, but the detail does not warrant it. The essential point is that generation occurs on the parts of the expressions, and when parts fail to correspond it is possible to make a local decision on what modifying operation is necessary (though perhaps not sufficient) for the two expressions to become equal. Matching is used so pervasively in mathematical manipulation, from algebraic forms to the conditions of a theorem, that our mathematical sophistication leads us not to notice how powerful it is. Whenever a set of possible solutions can be packaged as a form with variables, the search for a solution is no longer proportional to the size of the set of all pos sible solutions, but only to the size of the form itself. Notice that the generate process in generate...,and-test (Fig. 10.4) operates on quite a different set from the generate of the match (Fig. 10.5) . Beside the obvious uses in proving theorems and doing other mathe matics, matching shows up in tasks that seem remote from this discipline. One of them, as shown below, is inducing a pattern from a part. Another use is in answering questions in quasi-natural language. In the latter, information is extracted from the raw text by means of forms, with the variables taking subexpressions in the language as values. 3.3. Hill Climbing
The most elementary procedure for finding an optimum is akin to generate-and-test, with the addition that the candidate element is com pared against a stored elemen�the best so far-and replaces it if higher. The element often involves other information in addition to the position in the space being searched, for example, a function value. With just a little stronger assumptions in the problem statement, the problem can be converted into an analog of climbing a hill. There must be available a set of operators that find new elements on the hill, given an existing ele ment. That is, new candidate elements are generated by taking a step from the present position (one is tempted to say a "nearby" step, but it is the operators themselves that define the concept of nearness) . Thus the highest element so far plays a dual role, both as the base for gener ation of new elements and as the criterion for whether they should be kept. Figure 10.6 provides the capsule formulation of hill climbing. Gener ation is over the set of operators, which are then applied to the best x so far, until a better one is found. This method differs from the various forms of steepest ascent in not finding the best step from the current position before making the next step.
HEURISTIC PROGRAMMING
Problem statement
Given: a comparison of two elements of a set {x} to determine which is greater; a set of operators {q} whose range and domain is {x} [i.e., q (x) = x', another element of {x} ]. Find: the greatest x E {x} .
Procedure
{q}
�
·······
generate
q operator -------+
apply
:'
:' >:
compare �
l.__ __ 1� · ;
best so far
Figure 10.6.
Hill climbing.
A great deal has been written about hill climbing, and the interested reader will find a thorough discussion within the context of more elabo rate methods for finding optima in Chapter 9 on structured heuristic programming. Here we note only the familiar fact that the method does not guarantee to find the best element in the space. An additional con dition, unimodality, is required ; otherwise the procedure may end up on a local peak, which is lower than the highest peak. Actually, uni modality is not easy to define in the most general situation to which hill climbing applies, since the underlying space (which is "seen" only through the operators) need not have neighborhoods in the sense required to de fine a local peak. Hill climbing shows up in a subordinate way in many heuristic pro grams, especially in the adaptive setting of parametric values. For ex ample, in one program that attempted to construct programs satisfying certain criteria [9] , the various basic instructions were selected at random and used to extend the program built so far. The entire program was an elementary form of heuristic search, discussed in Section 3.4. But super imposed on it was a hill-climbing program that gradually modified the probabilities of selecting the basic instructions so as to maximize the yield of programs over the long haul. The operators randomly j iggled the selection probabilities around (always maintaining their sum equal to one) . The comparison was made on a statistical basis after observing the performance with the new probabilities for, say, 100 randomly se lected problems. Management science is much concerned with seeking optima, although, as mentioned above, the methods used are more elaborate. This can be
23
24
CHAPTER 1
illustrated by a heuristic program developed by Kuehn and Hamburger [ 1 1 ] for locating warehouses so as to balance the costs of distribution (which decrease with more warehouses) and the costs of operation (which increase with more warehouses) . The program consists of three separate optimizers : a cascade of a generate-and-test optimizer and a steepest ascent optimizer, followed by a simple hill climber, followed by a set of simple generate-and-test optimizers. Figure 10.7 gives the problem state ment (leaving out details on the cost functions) and an indication of how the problem is mapped into the problem space4 for optimization. Three operators are defined, corresponding to the three separate stages already mentioned. The procedure is given for the first stage (called the Main Routine) , but not for the other two (called the Bump and Shift Routine) . The elements of the problem space consist of all subsets of warehouses, taken from a list of possible sites (which is a subset of the total set of sites with customer demand) . The program builds up the set of ware houses by the operation of adding one warehouse at a time. The actual data structure corresponding to the element consists not only of the list of warehouses but also the assignment of each customer to a warehouse, the partial cost borne by that warehouse, and the total cost of operating ( TC ) . (That incremental cost calculations are easier to make than calcu lations starting from scratch is an important aspect of the efficiency of programs such as this one.) The main part of the program simply con siders adding new warehouses ( i.e., taking steps in the problem space) and comparing these against the current position on total cost. It is a steepest ascent scheme, since it goes through the whole set and then picks the best one. The additional wrinkle is to eliminate from the set of un used warehouses any whose costs are less than the current position, thus depleting the set to be considered. In fact, the first stage terminates when this set becomes empty. The operator generator delivers only a fixed subset of _all possible warehouse sites. It does this by a simple hill-climbing scheme whereby the best n sites are chosen from the total set on the basis of local cost (LC ) , which is the cost savings to be made from handling the local de mand at the same site as the warehouse ( n is a parameter of the pro gram) . This cascading of two optimizers keeps the second one from becoming excessively costly. The next two stages (the Bump and Shift Routine) make minor ' We often use the term problem space to refer to the set of potential solutions
as defined by the problem statement of a method. It includes the associated oper
ations for moving around
in
the space.
I
HEURISTIC PROGRAMMING
Problem statement
Given: a set of customers, {c} , with locations and sales volume; a set of factories, {f} , with locations ; a set of warehouse sites, {w} , with transportation costs to cus tomers and to factories, and operating costs. Find: a set of warehouses that minimizes total costs. Problem space for hill climbing
El,em,enJ,s : {xlx is a subset of {w} } , for each x can compute TC(x) . Initial element: the null set. DeS'ired element: the element with lowest TC. Operators: 1 . Add w to x ; 2. delete w from x ; 3. move w E x to location of customer of w. Note : all these permit incremental calculation of TC, since only paired comparisons with existing w for each customer affected are required. Procedure for stage 1 ( operator 1 )
delete {w} � ! to : f greater TC generate -------+ apply -----+ compare � {xlnext moves} ( b
A P'PlY:
Figure 10.9.
LT : Logic Theorist.
=
"-'a V b
30
CHAYI'ER 1
that any of the set of theorems will do. Although we have not shown the internal structure of the match, it does use definitions as well as substitutions to make a theorem and an expression the same. 3.5. Induction
Figure 10.10 shows a task that clearly involves inductive reasoning : you are to determine which of the figures 1-5 bears the same relationship to figure C as figure B does to figure A [ 4] . Similar problems exist in extrapolating series [22] : for example, what is the blank in abcbcdcd_?
ANALOGIES A
B
@
0
0
D
1
0
@
c
0
0
0
0
~ 0
2
~ 0 4
3
D
& Figure 10.10.
Analogies task.
5
HEURISTIC PROGRAMMING
Another similar task is to discover a concept, given a sequence of items, some of which exemplify the concept whereas others do not [8] : for example, if xoxox, xoxxo, oxoxo are positive instances and xooxo, xxoox, xoooo are negative instances, what is oxxxo? Computer programs have been constructed for these tasks. They show a certain diversity due to the gross shape of the task ; that is, the task of Fig. 10.10 gives one instance of the concept (A : B ) and five additional possible ones {C : l, C : 2, . . , C : 5}, whereas the series provides a long sequence of exemplars if one assumes that each letter can be predicted from its predecessors. However, most of the programs use a single method, adapted to the particular top-level task structure.6 Figure 10. 1 1 gives the method, although somewhat more sketchily than for the others. The first essential feature of the method is revealed in the problem statement, which requires the problem to be cast as one of finding a function or mapping of the given data into the associated (or predicted) data. The space of functions is never provided by the problem poser certainly not in the three examples just presented. Often it is not even clear what the range and domain of the function should be. For the series extrapolation task, to view {x:y} as {a: b, ab : c, abc : b, . . . , abcbcdcd :_} is already problem solving. Thus the key inductive step is the assumption of some space of functions. Once this is done the problem reduces to finding in this space one function (or perhaps the simplest one) that fits the exemplars. The second essential feature of the method is the use of a form or kernel for the function. This can be matched ( in the sense of the match method) against the exemplars. Evidence in the items then operates directly to specify the actual function from the kernel. Implicit in the procedure in Fig. 10. 1 1 is that, inside the match, generation on the kernel (refer back to Fig. 10.5) produces, not the parts of the kernel itself, but the predictions of the y associated with the presented x. However, parts of the kernel expression must show through in these predictions, so that the modification operations of the match can specify or modify them in the light of differences. When the kernel expression actually has variables in it, the prediction from the kernel is sometimes a variable. Its value can be made equal to what it should be from the given x : y, and thus the kernel expression itself specified. Often the modification operations are like linguistic replacement rules, and then the matter is somewhat more complex to describe. .
8
Most, but not all.
Several adapt the paradigm used for pattern recognition
programs. In addition, a method called the method of successive differences is ap plicable to series extrapolation where the terms of the series are expressed as numbers.
31
32
CHAPTER 1
Problem statement
Given: a domain {x} ; a range {y} ; a generator of associated pairs {x� y} . Find: a function f with domain {x} and range {y} such that f(x) = y for all {x : y} . Additional assumption (almost never given with the problem statement, and therefore constituting the actual inductive step) : Given: a set of functions {f} constructable from a set of kernel forms {k} . Procedure
{k} ············ ··� generate
{x : y}
--- ---![:-:� �
----4
!k
I
/ (succeed )
match
gene
solution (all z:11 work)
Figure 10.11.
Induction method.
It is not often possible to express the entire space of functions as a single form (whence a single match would do the j ob) . Consequently a sequential generation of the kernels feeds the match process. Sometimes clues in the exemplars are used to order the generation ; more often, generation is simply from the simplest functions to the more complex. This method is clearly a version of "hypothesis-and-test." However the latter term is used much more generally than to designate the class of induction problems handled by this method. Furthermore, there is nothing in hypothesis-and-test which implies the use of match ; it may be only generate-and-test. Consequently, we choose to call the method simply the induction method, after the type of task it is used for. 3.6. Summary
The set of methods just sketched-generate-and-test, hill climbing, match, heuristic search, and induction-constitutes a substantial fraction of all methods used in heuristic programming. To be sure, this is only a judgment. No detailed demonstration yet exists. Also, one or two im-
HEURISTIC PROGRAMMING
portant methods are m1ssmg ; for example, an almost universally used paradigm for pattern recognition. Two characteristics of the set of methods stand out. First, explicit references to processes occur in the problem statement, whereas this is not true of mathematical methods, such as the simplex method. Thus-.. generate-and-test specifies that you must have a generator and you must have a test ; then the procedure tells how to organize these. This feature seems to be related to the strength of the method. Methods with stronger assumptions make use of known processes whose existence is implied by the assumptions. In the simplex method generation on a set of variables is done over the index, and the tests used are equality and inequality on real numbers. Hence there is no need to posit directly, say, the generator of the set. The second characteristic is the strong similarity of the methods to each other. They give the impression of ringing the various changes on a small set of structural features. Thus there appear to be only two differences between heuristic search and hill climbing. First, it is neces sary to compare for the greater element in hill climbing ; heuristic search needs only a test for solution (although it can use the stronger compari son, as in means-ends analysis) . Second, hill climbing keeps only the best element found so far, that is, it searches the problem space from where it is. Heuristic search, on the other hand, keeps around a set of obtained elements and selects from it where next to continue the search. In consequence, it permits a more global view of the space than hill climbing-and pays for it, not only by extra memory and processing, but also by the threat of exponential expansion. Similarly, the difference between match and heuristic search is pri marily one of memory for past actions and positions. Our diagram for match does not reveal this clearly, since it shows only the case of a single modification operation, substitution ; but with a set of modification oper ations (corresponding to the set of operators in heuristic search) the match looks very much like a means-ends analysis that never has to back up. Finally, the more complex processes, such as LT and the warehouse program, seem to make use of the more elementary ones in recognizable combinations. Such combination does not always take the form of dis tinct units tied output to input (i.e., of closed subroutines) , but a flavor still exists of structures composed of building blocks . . In reviewing these methods instruction in the details of artificial intel ligence has not been intended. Hopefully, however, enough information has been given to convince the reader of two main points : ( 1 ) there is in
33
34
CHAPTER l
heuristic programming a set of methods, as this term was used in the beginning of the paper ; and (2) these methods make much weaker de mands for information on the task environment than do methods such as the simplex, and hence they provide entries toward the lower, more general end of the graph in Fig. 10.3. 4. THE CONTINUNITY OF METHODS
If the two hypotheses that we have stated are correct, we should certainly expect there to be methods all along the range exhibited in Fig. 10.3. In particular, the mathematical methods of management science should not be a species apart from the methods of artificial intelligence, but should be distinguished more by having additional constraints in the problem statement. Specific mathematical content should arise as state ments strong enough to permit reasonable mathematical analysis are introduced. Evidence for this continuity comes from several sources. One is the variety of optimization techniques, ranging from hill climbing to the calculation methods of the differential calculus, each with increasing amounts of specification. Another is the existence of several methods, such as so-called branch and bound techniques, that seem equally akin to mathematical and heuristic programming. Again, dynamic programming, when applied to tasks with little mathematical structure, leads to pro cedures which seem not to differ from some of the methods in heuristic programming, for example, the minimax search techniques for playing games such as chess and checkers. What we should like most of all is that each (or at least a satisfactory number) of the mathematical methods of management science would lie along a line of methods that extends back to some very weak but general ancestors. Then, hopefully, the effect on the procedure of the increasing information in the problem statement would be visible and we could see the continuity directly. As a simple example, consider inverting a matrix. The normal algo rithms for this are highly polished procedures. Yet one can look at in version as a problem-as it is to anyone who does not know the available theory and the algorithms based on it. Parts of the problem space are clear : the elements are matrices (say of order n) , hence include both the given matrix, A, the identity matrix, I, and the desired inverse, X. The problem statement is to find X such that AX = I. Simply generating and testing is not a promising way to proceed, nor is expressing X as a form, multiplying out, and getting n2 equations to solve. Not only are these
HEURISTIC PROGRAMMING
poor approaches, but also they clearly are not the remote ancestors of the existing algorithms. If the inverse is seen as a transformation on A, carrying it into I, a more interesting specification develops. The initial object is A, the de sired object is I, and the operators consist of premultiplication (say) by some basic set of matrices. Then, if operators Ei, E2, , Ek transform 1 A into I, we have Ek · · · E2E1A = I, hence Ek · · · E2E1I = A- . If the basic operators are the elementary row operations (permute two rows, add one row to another, multiply a row by a constant) , we have the basic ingredients of several of the existing direct algorithms (those that use elimination rather than successive approximation) . These algorithms prescribe the exact transformations to be applied at each stage, but if we view this knowledge as being degraded we can envision a problem solver doing a heuristic search (or perhaps hill climbing if the approach were monotone) . Better information about the nature of the space should lead to better selection of the transformation until existing algorithms are approached. •
•
•
4.1. An Example: the Simplex Method
The simplex method clearly involves both optimization and search, hence should eventually show kinship with the methods that we have been describing. We should be able to construct a sequence of methods, each with somewhat less stringent conditions and therefore with more general applicablilty but less power. Power can be measured here by applying the method to the original linear programming (LP) problem, where the true nature of the problem is known. Figure 10.12 reformulates the LP problem and the simplex algorithm in the present notation. Indices have been suppressed as much as possible, partly by using the scalar product, in the interests of keeping the figures uncluttered. The problem statement for the simplex method, which we call SM from now on, is a special case of the LP problem, having equalities for constraints rather than inequalitities, but involving n + m variables rather than just n. The transformation between problems is straightforward (although the original selection of this specialization is not necessarily so) . The elements of the problem space (called bases) are all the subsets of m out of the n + m variables. An element consists of much more information than just the subset of variables, of course, namely, of the entire tableau shown in Fig. 10.2. The operators theoretically could be any rules that replace one subset of m variables with another. In fact, they involve adding just a single variable, hence removing one. This
35
36
CHAPTER
l
Problem statement for LP problem
Given: a set of n variables, {x} , where each x � 0 : , Xn) ; let x be the n-tuple (x1, X2, a set of m constraints, {g = b ax} : let the feasible region be {xlg � O} ; an objective function, z = ex. Find: x in the feasible region such that z is maximum. •
for
.
•
-
Given: a set of n + m variables, {x} , where each x � 0 : Xn+m ) ; let x b e the (n + m)-tuple (xi, X2, a set of m constraints, {g = b ax} : let the feasible region be {xlg = O} ; an objective function, z = ex. Find: x in the feasible region such that z is maximum. Note : any LP problem can be solved if this one can. It is a separate problem to provide the translation between them (define Xn+ • = g, and determine c and the a accordingly) .
Problem statement
SM, the simplex method
, .
Problem space for SM
Elements: {B (bases), the
(�) n m
.
•
subsets of m variables from {x}} ;
with B is associated T(B) (the tableau) containing : a feasible x such that x E B implies x > 0 ; otherwise x = 0 ; the current value of z for x ; the exchange rate (e) for each x [ (z c) in tableau] ; auxiliary data to permit application of any operator. Initial element: B0, a feasible basis (not obtained by SM) . O'/)ef'ators: {x not in B} .
- -
Procedure
{x not in B}
--
r�·-------+ frboundod ap ly
select
1.
-
Select (pick x with maximum e) : {x not in B} --+ generate
Figure 10.12.
--------+ (z',e')
compare
SM : reformulation of simplex method.
re
(B, T(B))
------e' >e
I
l
(x e)
-
HEURISTIC PROGRAMMING
would still leave (n m) m operators (any of n m in, any of m out) , except that no choice exists on the one to be removed. Hence, there are just n m operators, specified by the n m variables not in the current basis. Applying these operators to the current element consists of almost the entire calculation of the simplex procedure specified in Fig. 10.2 (actually steps 2, 3, and 4) , which amounts to roughly m (n + m) multiplications (to use a time-honored unit of computational effort) . The procedure in Fig. 10.12 looks like a hill climber with the compar ison missing : as each operator is selected, the new (B,T) is immediately calculated (i.e., the tableau updated) and a new operator selected. No comparison is needed because the selection produces only a single operator, and this is known to advance the current position. The procedure for selecting the operator reveals that the process generates over all potential operators--0ver all x not in the current basis-and selects the best one with respect to a quantity called the exchange rate (e) . Thus the select is a simple optimizer, with one exception (not indicated in the figure) : it is given an initial bias of zero, so that only operators with e > 0 have a chance. The exchange rate is the rate of change of the obj ective function (z) with a change in x, given movement along a boundary of the constraints, where the only variables changing are those_ already in the basis. Given this kind of exploration in the larger space of the n + m variables, e measures how fast z will increase or decrease. Thus, e > 0 guarantees that the compare routine is not needed in the main procedure. The selection of an x with the maximum exchange rate does not guarantee either the maximum increase from the current position or the minimum number of steps to the optimum. The former could be achieved by inserting a compare routine and trying all operators from the current position ; but this would require many times as much effort for (probably) small additional gain. However, since the space is unimodal in the feasible region, the procedure does guarantee that eventually an optimum will be reached. We need a method at the general end of the scale against which the SM can be compared. Figure 10.13 gives the obvious one, which we call M l . It retains the shape of the problem but without any specific content. The problem is still to optimize a function, f, of n positive variables, sub ject to m inequality constraints, {g}. But the only thing known about f and the g is that f is unimodal in the feasible set (which accounts for its descriptive title) . This j ustifies using hill climbing as the procedure. The operators must be any increments to the current position, either positive or negative. Many will produce a location outside the feasible -
-
-
-
37
38
CHAPTER l
Problem statement for Ml, the unimodal objective method
Given: a set of n variables, {x} , where each x � 0 : let x b e the n-tuple (x1, x2, . . . x,,) ; a set of m constraints, {g(x) � O} : let the feasible region be {xlg � O} ; an objective function z = f(x) ; f is unimodal in the feasible region. Find: x in the feasible region such that z is maximum.
Problem space PSI, for hillclimbing
E"lements : {x} . Initial e"lement: x0, a feasible solution (not obtained by M l ) . Operators: {Llx, where each Llx i s any real number} , and x' = x + M. Procedure
{�}
. . . .... �
Figure 10. 13.
�nerate --+
1:
+
--+ ap
r
:t:'
--+ oo
re
T
_1 a' >•
_ _ _ _ _ _ _ .... .._ .... .__ _
Ml : unimodal objective method.
(z, x)
region, but these can be rejected by testing against the g. The procedure in the figure does not provide any information on how to generate the operators. If M l is compared to SM, several differences are apparent. First, and most striking, the problem space for Ml is n-dimensional Euclidean space, n+m points in (n + m ) -dimenwhereas for SM it is a finite set of m sional space. Thus the search space has been drastically reduced, inde• pendently of what techniques are used to search it. Second, and almost as striking, M l has all of {dX} as operators ( i.e., all of n-dimensional space again) , whereas SM has only n - m operators. Third, a unique operator is selected for application on the basis of partial information ; it always both applies and improves the position. In Ml there is no reason to expect an operator either to produce a feasible solution or, if it does, to obtain an improvement ; thus, extensive testing and comparing must be done. Finally, we observe that the cost per step in Ml (when applied to the same LP problem as SM ) is mk, where k is the number of variables
( )
HEURISTIC PROGRAMMING
in � and would normally be rather small. Compared to m (m + n) for SM, this yields the one aspect favoring M l . One can take about (m + n) /k steps in the space of M l for each step in the space of SM. However, the thrashing around necessary at each point to obtain a positive step will largely nullify this advantage. This list of differences suggests constructing a sequence of methods that extend from M l to SM, with decreasing spaces and operators and increasing cost per step (to pay for the additional sophistication) . Much of the gain, of course, will come without changing problem spaces, but from acquiring better operator selection. There may be relatively few substantial changes of problem space. Figures 10.14 and 10.15 provide Problem statement for M2, the monotone objective method
Given: the conditions of M l , plus f is monotone in the feasible region. Find: x in the feasible region such that z is maximum. Problem space PS2, main hill climbing
Elements: {x on boundary of feasible region (at least one x 0 or g = O)} . Initial element: x0 in feasible region (not obtained by M2) . Operators: {x} , where x' = the point on boundary given by M2*. =
Problem statement for M2°, M2-operator method
Given: the conditions of M2, plus x is on the boundary ; x E {x} . Find: x on the boundary such that .£lx to increase z not feasible ; all other x unchanged ; z increased. Additional assumption for efficiency : g(x) = 0 can be solved for any x with all other x Problem space for M2°, hill climbing
Efements: {x} . Initial element: x, given by M2 operator. Operators: {.tlx, with appropriate sign} , where x' Problem space PSI for Ml
=
x + .tlx.
Used as backup when PS2 terminates without optimum.
Figure 10.14.
M2 : monotone objective method.
fixed.
39
40
CHAPTER I
. Problem statement for M3, the consistent exchange problem
Given: the conditions of M2, plus if an x is exchanged for other variables by moving along a maxi mal boundary, then Llz has a consistent sign. Find: x in the feasible region such that z is a maximum. Problem space PS3, main hill climbing
Elemenls: {x on the maximal boundary of feasible set ; i.e., no x can be changed to increase z, holding other x fixed} . Initial elemen1: x0, a feasible solution on the maximal boundary (not obtained by M3) . Operators: {x} , where x' the point on the maximal boundary given by M3*. =
Problem statement for M3°, M3-operator method
Given: the condition of M3, plus x is on a maximal boundary ; x E {x} . Find: x on the maximal boundary, such that exchange for x to increase z is not feasible ; z increased. Additional assumption for efficiency : O} , can be solved for any set any system of k equations, {g(x) of k variables with the others fixed. =
Problem space for M3°
Elements: {x} . Initial element: x, given by M3 operator. Operators: {Llx, with appropriate sign} and {Llx, subsets of {x} } , where x' = x + Llx. Figure 10.15.
M3 : consistent exchange method.
two that seem to reflect some of the major boundaries to be crossed in getting from M l to SM. Figure 10.14 shows method M2, which adds the assumption that the objective function, f, is monotone in the feasible region. In the LP problem of/ox is constant, but this is not required to justify the main conclusion ; namely, that if a given change in a varible is good, more change in the same direction is better. The effect of this is to create new operators and, through them, a new space. The basic decision is always to drive a varia-
HEURISTIC PROGRAMMING
ble to a boundary (in the direction of increasing z, of course) . Thus the space for M2 becomes the boundary set of the original space for M l (those points where at least one o f the constraints, including the x � 0, attains zero ) . The operators in the space of M2 are full steps to a boundary (what are sometimes called macromoves in artificial intelli gence) . Now; finding the boundary is still a problem, although a more manageable one. Thus M2 has a second problem method, M2*, for this purpose. As described in Fig. 10.14 it can be a simple hill climber. An additional strong assumption has been made in the procedure of M2, namely, that only changes in a single variable, x, will be considered. This reduces the number of operators, as well as making the operator submethod M2* simpler. It is not justified by the assumptions of the problem statement, however, and consequently M2 will terminate at suboptimal positions where no single variable can be changed to increase z without decreasing some other variables. (This is often called the maximal or the Pareto optimum set.) Rather than relax the operators to a wider class, the original method, M l , is held in reserve to move off the maximal set. (However, if done with small steps, this is extremely inefficient for the LP problem, since the system just j itters its way slowly up a bounding plane.) The description of M2* gives an additional assumption : each equation g (x) can be solved directly for any x, given that the values of the other x's are determined. This permits direct calculation of the extreme value of x on a boundary that is maximal. Slightly stronger conditions on the g's allow determination of the first constraint to become binding, without multiple evaluations. Figure 10.15 shows method M3, which adds the assumption that the exchange rate ( e) always has a consistent sign as one moves along the feasible region, in response to introducing a variable x (what we have ' called exchanging) . Again, in the LP problem e is constant, but this is not required to justify the main conclusion : that in a maximal situa tion, if adjustments are made in other variables to allow a particlular variable to increase, and the gain from the exchange is positive, it will always be positive ; hence the new variable should be exchanged for as much as possible, namely, until another boundary is reached. This assumption not only allows a better way of dealing with the maximal cul-de-sac than does M2, with its regression to M l , but also permits the problem space to be changed radically a second time. The elements of the space now become the set of maximal points, thus a small subset of the space of M2. The operators remain the same ; the individual variables. The application of an operator again cor-
41
42
CHAPTER 1
responds to the solving of a subproblem, hence is accomplished by a submethod, M2*. The problem is as follows : given x (with a positive exchange rate ) , to advance it as far as possible. This means solving the constraints simultaneously for the variables, so as to remain on a bound ary. As a change in the selected x is made, the current x moves off the maximal boundary by violating either the constraints or maximality. Ad justments must be made in the other x's to restore these conditions. What the new clause in the problem statement provides is not a way of making the adjustments, but a guarantee that if a change is once found that does increase z (after adj ustment) it should be pushed to the limit. We have not described M3*, other than to indicate the available operators. At its most general (i.e., assuming no other information) , it requires a two-stage process, one to discover a good direction and the other to push it. The latter is again a two-stage process, one to change the selected x and the other to make the adjustments. 'Ve have included an additional assumption, similar to the one for M2*, that a direct way exists of solving systems of contraints for some variables in terms of others. This clearly can make an immense difference in the total efficiency of problem solving but does not alter the basic structuring of the task. M3 is already a recognizable facsimile of SM. The space has been cut to all subsets of the variables, although the final contraction to sub sets of m variables has not occurred. (It is implicit in the problem statement of M3, with some mild conditions on the g's, but has not been brought out.) The operators of M3 and SM are the same. More pre cisely, they are isomorphic-the process of applying an operator is quite different in the two methods. There are still some steps to go. The kinds of methods that are possible for the operator need explication. They are represented in M3* only by the assumption that systems of equations can be solved. But the schemes in SM use special properties of linear systems. Similarly, we have not explored direct calculation of the ex change rates, with the subsequent replacement of comparison in the main method by comparison in the operator, to avoid expensive computation. We have not carried this example through in complete detail, nor have we established very many points on the path from a crude hill climber to SM. The two points determined are clearly appropriate ones and capture some of the important features of the method. They are not un expected points, of course, since linear programming is well understood. The viewpoint underlying the analysis is essentially combinatorial, and such aspects have been thoroughly explored (e.g., see [23) ) . If these intermediate problems have any peculiar flavor, it is that they become established where the search spaces change, and these need not always
HEURISTIC PROGRAMMINQ
correspond to nice mathematical properties, abstractly considered. Thus convexity is not posited and its implications explored ; rather a change of search space is posited and the problem statement that admits it sought. A single ancestral lineage should not be expected. Just as theorems can have many proofs, so methods can have many decompositions of their information. In fact, in one respect at least the line represented by M2 and M3 does violence to SM. It never recognizes the shift of problem into a set of equality constraints with the consequent change in dimensionality. Thus, the g's and the x's are handled separately, whereas it is a very distinct feature of the simplex procedure that it handles them uniformly. One could easily construct another line starting from SM, which would preserve this feature. (It would face a problem of making the transition to M l . ) The examples selected-linear programming and matrix inversion are certainly ones that seem most amenable to the kind of analysis we have proposed. If we considered methods, say, for determining inventory levels, the story might be different. Nevertheless, perhaps the case for continuity between weak and strong methods has been made plausible. 5. HUMAN BEHAVIOR IN ILL-STRUCTURED PROBLEMS
In the two issues discussed so far-the existence of weak methods, and the continuity between weak and strong methods-we have not seemed to be dealing directly with ill-structured problems. To re-evoke the concern of Reitman, the problem statements that we have exhibited seem quite precise. (Indeed, we took pains to make them so and in a more technical exposition would have completely formalized them.) According to our hypotheses the world is always formalized, seen from the view point of the methods available, which require quite definite properties to operate. A human problem solver, however, would not feel that a problem was well structured just because he was using a method on it. Our second hypothesis identifies this feeling with the low power of the appli cable methods. The concern just expressed is still well taken. If we examine some problem solvers who are working on "really ill-structured" problems, what will we find? They will necessarily be human, since as noted earlier, men are the gatekeepers of this residual class of problems. Thus we cannot observe their problem-solving processes directly but must infer them from their behavior. To have something definite in mind consider the following problem solvers and tasks :
43
44
CHAPTER l
A financial adviser : what investments should be made in a new account? A foreman : is a given subordinate well adjusted to his work? A marketing executive : which of two competitors will dominate a given market to which his firm is considering entry? None of these problems is as ill structured as the proverbial injunctions to "know thyself" (asked of every man) and to "publish or perish" (asked of the academician) . Still they are perhaps more typical of man agement problems than these two almost completely open-ended problems. They do have the feature of most concern to Reitman ; namely, neither the criteria for whether a solution is acceptable nor the data base upon which to feed are particularly well defined. The framework we have been using says that below the surface we should discover a set of methods operating. Our two hypotheses assert, first, that we should find general but weak methods ; and, second, that we should not find methods that deal with the unstructured aspects (however they come to be defined) through any mechanism other than being general enough to apply to a situation with little definite informa tion. Our first implication would seem to be upset if we discover that the human being has available very strong methods that are applicable to these ill-structured problems. This is a rather difficult proposition to test, since, without the methods themselves to scrutinize, we have very little basis for judging the nature of problem solving. Powerful methods imply good solutions, but if only men solve the problem, comparative quality is hard to judge. The three tasks in our list have the virture that comparisons have been made between the solutions obtained by human effort and those obtained by some mechanical procedure. For the second two the actual tasks are close nonmanagement analogs of the tasks listed. However, they all have the property (implicit· in our list) that the problem solver is a man who by profession is concerned with solving the stated type of problem. This condition is important in discussing real management problems, since the capabilities of a novice (e.g., a college student used as a subj ect in an experiment) may differ considerably from those of the professional. In particular, the novice may differ in the direction of using only very general reasoning abilities (since he is inexperienced) , whereas the professional may have special methods. In all of the cases the result is the same. Rather simple mechanical procedures seem to do as well as the professional problem solver or even better ; certainly they do not do worse.
HEURISTIC PROGRAMMING
The first task was investigated by Clarkson [ 1 ] in one of the early simulation studies. He actually constructed a program to simulate a trust investment officer in a bank. Thus the program and the human being attain the same level of solution. The program itself consists of a series of elementary evaluations as a data base, plus a recognition structure (called a discrimination net) to make contact between the specific situation and the evaluation ; there are also several generate-and tests. Thus the program does not have any special mechanisms for dealing with ill-structuredness. Indeed it deals with the task in a highly structured way, though with a rather large data base of information. The key point is that the human being, who can still be hypothesized to have special methods for ill-structured situations (since his internal structure is unknown) , does not show evidence of this capability through superior performance. The second task in its nonmanagement form is that of clinical judg ment. It has been an active, substantial-and somewhat heated-concern in clinical psychology ever since the forties. In its original form, as re viewed by Meehl [ 12] , it concerned the use of statistical techniques versus the judgments of clinicians. With the development of the computer it has broadened to any programmed procedure. Many studies have been done to confront the two types of judgment in an environment sufficiently controlled and understood to reveal whether one or the other was better. The results are almost uniformly that the programmed procedures perform no worse ( and often better) than the human judgment of the professional clinician, even when the clinician is allowed access to a larger "data base" in the form of his direct impressions of the patient. Needless to say, specific obj ections, both methodological and substantive, have been raised about various studies, so the conclusion is not quite as clear-cut as stated. Nevertheless, it is a fair assertion that no positive evidence of the existence of strong methods of unknown nature has emerged.7 The third task is really an analog of an analog. Harris [7] , in order to investigate the clinical versus statistical prediction problem just discussed, made use of an analog situation, which is an equally good analog to the marketing problem in our list. He tested whether formulas for predicting the outcome of college football games are better than human judgment. To get the best human judgments (i.e., professional) he made use of coaches of rival teams. Although there is a problem of bias, these coaches clearly have a wealth of information of as professional a nature 7 A recent volume [ 10] contains some recent papers in this area, which proYide access to the literature.
45
46
CHAPTER 1
as the marketing manager has about the market for his goods. On the program side, Harris used some formulas whose predictions are published each week in the newspapers during the football season. An unfortunate aspect of the study is that these formulas are proprietary, although enough information is given about them to make the study meaningful. The result is the same : the coaches do slightly worse than the formulas. Having found no evidence for strong methods that deal with unstruc tured problems, we might feel that our two hypotheses, are somewhat more strongly confirmed. However, unless the weak methods used by human beings bear some relationship to the ones we have enumerated, we should take little comfort. For our hypotheses take on substantial meaning only when the weak methods become explicit. There is less solid evidence on what methods people use than on the general absence of strong methods. Most studies simply compare performance, and do not attempt to characterize the methods used by the human problem solver. Likewise, many of the psychological studies on problems solving, although positive to our argument ( 14] , employ artificial tasks that are not sufficiently ill structured to aid us here. The study by Clarkson just reviewed is an exception, since he did investigate closely the behavior of his investment officer. The evidence that this study provides is positive. Numerous studies in the management science literature might be winnowed either to support or refute assertions about the methods used. Work in the behavioral theory of the firm (2] , for instance, provides a picture of the processes used in organizational decision making that is highly compatible with our list of weak methods-searching for alter natives, changing levels of aspiration, etc. However, the characterizations are sufficiently abstract that a substantial issue remains whether they can be converted into methods that really do the decision making. Such descriptions abstract from task content. Now the methods that we have described also abstract from task content. But we know that these can be specialized to solve the problems they claim to solve. In empirical studies we do not know what other methods might have to be added to handle the actual detail of the management decision. Only rarely are studies performed, such as Clarkson's, in which the problem is ill structured, but the analysis is carried out in detail. Reit man has studied the composition of a fugue by a professional composer [ 18] , which is certainly ill structured enough. Some of the methods we have described, such as means-ends analysis, do show up there. Reitman's characterization is still sufficiently incomplete, however, that no real evidence is provided on our question.
HEURISTIC PROGRAMMING
6. DIFFICULTIES
We have explored three areas in which some positive evidence can be adduced for the two hypotheses. We have an explicit set of weak meth ods ; there is some chance that continuity can be established between the weak and the strong methods ; and there is some evidence that human beings do not have strong methods of unknown nature for dealing with ill-structured problems. Now it is time to consider some difficulties with our hypotheses. There are several. 6.1. The Many Parts of Problem Solving
At the beginning of this essay we noted that methods were only a part of problem solving, but nevertheless persisted in ignoring all the other parts. Let us now list some of them : Recognition Evaluation Representation Method identification
Information acquisition Executive construction Method construction Representation construction
A single concern applies to all of these items. Do the aspects of prob lem solving that permit a problem solver to deal with ill-structured problems reside in one (or more) of these parts, rather than in the methods? If so, the discussions of this essay are largely beside the point. This shift could be due simply to the power (or generality) of a problem solver not being localized in the methods rather than to anything specific to ill-structuredness. The first two items on the list illustrate this possibility. Some problems are solved directly by recognition ; for example, who is it that has j ust appeared before my eyes? In many problems we seem to get nowhere until we suddenly "just recognize" the essential connection or form of the solution. Gestalt psychology has made this phenomenon of sudden restructuring central to its theory of problem solving. If it were true, our two hypotheses would certainly not be valid. Likewise for the second item, our methods say more about the organization of tests than about the tests themselves. Perhaps most of the power resides in sophisticated evaluations. This would work strongly against our hypotheses. In both examples it is possible, of course, that hypotheses of similar nature to ours apply. In the case of evalu ations, for example, it might be that ill-structured problems could be handled only because the problem solver always had available some dis-
47
48
CHAPTER 1
tinctions that applied to every situation, even though with less and less relevance. The third item on the list, representation of problems, also raises a question of the locus of power (rather than of specific mechanisms re lated to ill-structured problems) . At a global level we talk of the repre sentation of a problem in a mathematical model, presumably a trans lation from its representation in some other global form, such as natural language. These changes of the basic representational system are clearly of great importance to problem solving. It seems, however, that most problems, both well structured and ill structured, are solved without such shifts. Thus the discovery of particularly apt or powerful global repre sentations does not lie at the heart of the handling of ill-structured prob lems. More to the point might be the possibility that only special represen tations can hold ill-structured problems. Natural language or visual imagery might be candidates. To handle ill-structured problems is to be able to work in such a representation. There is no direct evidence to sup port this, except the general observations that human beings have (all) such representations, and that we do not have good descriptions of them. More narrowly, we often talk about a change in representation of a problem, even when both representations are expressed in the same lan guage or imagery. Thus we said that Fig. 10. 12 contained two represen tations of the LP problem, the original and the one for the simplex method. Such transformations of a problem occur frequently. For ex ample, to discuss the application of heuristic search to inverting matrices we had to recast the problem as one of getting from the matrix A to I, rather than of getting from the initial data (A, I, AX = I) to X. Only after this step was the application of the method possible. A suspicion arises that changes of representation at this level-symbolic manipulation into equivalent but more useful form-might constitute a substantial part of problem solving. Whether such manipulations play any special role in handling ill-structured problems is harder to see. In any event, current research in artificial intelligence attempts to incorporate this type of problem solving simply as manipulations in another, more symbolic problem space. The spaces used by theorem provers, such as LT, are relevant to handling such changes. Method identification, the next item, concerns how a problem state ment of a method comes to be identified with a new problem, so that each of the terms in the problem statement has its appropriate referent in the problem as originally given. Clearly, some process performs this identifi cation, and we know from casual experience that it often requires an
HEURISTIC PROGRAMMING
exercise of intellect. How difficult it is for the LP novice to "see" a new problem as an LP problem, and how easy for an old hand ! Conceivably this identification process could play a critical role in dealing with ill-structured problems. Much of the structuring of a prob lem takes place in creating the identification. Now it might be that methods still play the role assigned to them by our hypotheses, but even so it is not possible to instruct a computer to handle ill-structured prob lems, because it cannot handle the identification properly. Faced with an appropriate environment, given the method and told that it was the applicable one, the computer still could not proceed to solve the problem. Thus, though our hypotheses would be correct, the attempt to give them substance by describing methods would be misplaced and futile. Little information exists about the processes of identification in situ ations relevant to this issue. When the situation is already formalized, matching is clearly appropriate. But we are concerned precisely with identification from a unformalized environment to the problem state ment of a method. No substantial programs exist that perform such a task. Pattern recognition programs, although clearly designed to work in "natural" environments, have never been explored in an appropriately integrated situation. Perhaps the first significant clues will come out of the work, mentioned at the beginning of this chapter and still in its early stages, on how a machine can use a hand and eye in coordination. Al though the problems that such a device faces seem far removed from management science problems, all the essentials of method identification are there in embryo. (Given that one has a method for picking up blocks, how does one identify how to apply this to the real world, seen through a moving television eye?) An additional speculation is possible. The problem of identification is to find a mapping of the elements in the original representation (say, external ) into the new representation (dictated by the problem state ment of the method to be applied) . Hence there are methods for the solution t{> this, just as for any other problem. These methods will be like those we have exhibited. (Note, however, that pattern recognition methods would be included.) The construction of functions in the in duction method may provide some clues about how this mapping might be found. As long as the ultimate set of contacts with the external repre sentation (represented in these identification methods as generates and tests) were rather elementary, such a reduction would indeed answer the issue raised and leave our hypotheses relevant. An important aspect of problem solving is the acquisition of new information, the next item on the list. This occurs at almost every step,
49
50
CHAPTER 1
of course, but most of the time it is directed at a highly specific goal ; for instance, in method identification, which is a major occasion for assimi lating information, acquisition is directed by the problem statement. In contrast, we are concerned here with the acquisition of information to be used at some later time in unforeseen ways. The process of education provides numerous examples of such accumulation. For an ill-structured problem one general strategy is to gather ad ditional information, without asking much about its relevance until ob tained and examined. Clearly, in the viewpoint adopted here, a problem may change from ill structured to well structured under such a strategy, if information is picked up that makes a strong method applicable. The difficulty posed for our hypotheses by information acquisition is not in assimilating it to our picture of methods. It is plausible to assume that there are methods for acquisition and even that some of them might be familiar ; for example, browsing through a scientific j ournal as generate and-test. The difficulty is that information acquisition could easily play a central role in handling ill-structured problems but that this depends on the specific content of its methods. If so, then without an explicit description of these methods our hypotheses cannot claim to be relevant. These methods might not formalize easily, so that ill-structured problems would remain solely the domain of human problem solvers. The schemes whereby information is stored away yet seems available almost instantly -as in the recognition of faces or odd relevant facts-are possibly aspects of acquisition methods that may be hard to explicate. The last three items on the list name things that can be constructed by a problem solver and that affect his subsequent problem-solving be havior. Executive construction occurs because the gross shape of a par ticular task may have to be reflected in the top-level structure of the procedure that solves it. The induction method, with the three separate induction tasks mentioned, provides an example. Each requires a separate executive structure, and we could not give a single uni_fied procedure to handle them all. Yet each uses the same fundamental method. Relative to our hypotheses, construction seems only to provide additional loci for problem-solving power. This item could become important if it were shown that solutions are not obtained to ill-structured problems without some construction activity. The extended discussion of the parts of the problem-solving process other than methods, and the ways in which they might either refute or nullify our two hypotheses, stems from a conviction that the major weakness of these hypotheses is the substantial incompleteness of our
HEURlSTIC PROGRAMMING
knowledge about problem solving. They have been created in response to partial evidence, and it seems unlikely that they will emerge unscathed as some of these other parts become better known. 6.2. Measures of Informational Demands
Throughout the chapter we have talked as if adding information to a problem statement leads to a decrease in generality and an increase in power. Figure 10.3 is the baldest form of this assertion. At the most general level it seems plausible enough. Here one crudely identifies the number of conditions in the problem statement with the size of the space being searched : as it gets smaller, so the problem solver must grow more powerful. At a finer level of analysi.;, however, this assertion seems often violated, and in significant ways ; for example, a linear program ming problem is changed into an integer programming problem by the addition of the constraint that the variables {x} range over the positive integers rather than the positive reals. But this makes the problem harder, not easier. Of course, it may be that existing methods of integer programming are simply inefficient compared to what they could be. This position seems tenuous, at best. It is preferable, I think, to take as a major difficulty with these hypotheses that they are built on foundations of sand. 6.3. Vague Information
It is a major deficiency of these hypotheses (and of this chapter) that they do not come to grips directly with the nature of vague information. Typically, an ill-structured problem is full of vague information. This might almost be taken as a definition of such a problem, except that the term vague is itself vague. All extant ideas for dealing with vagueness have one concept in com mon : they locate the vagueness in the referent of a quite definite (hence un-vague} expression. To have a probability is to have an indefinite event, but a quite definite probability. To have a subset is to have a quite definite expression (the name or description of the subset) which is used to refer to an indefinite, or vague, element. Finally, the constructs of this chapter are similarly definite. The problem solver has a definite problem statement, and all the vagueness exists in the indefinite set of problems that can be identified with the problem statement.8 •
Reitman's proposals, although we have not described them here, have the same
definite character ( 17] . So also does the proposal by Zadeh for "fuzzy" sets (24] .
51
52
CHAPTER 1
The difficulty with this picture is that, when a human problem solver has a problem he calls ill structured, he does not seem to have definite expressions which refer to his vague information. Rather he has nothing definite at all. As an external observer we might form a definite expres sion describing the range (or probability distribution ) of information that the subject has, but this "meta" expression is not what the subject has that is this information. It seems to me that the notion of vague information is at the core of the feeling that ill-structured problems are essentially different from well-structured ones. Definite processes must deal with definite things, say, definite expressions. Vague information is not definite in any way. This chapter implies a position on vague information ; namely, that there are quite definite expressions in the problem solver (his problem state ment) . This is a far cry from a theory that explains the different varieties of vague information that a problem solver has. Without such expla nations the question of what is an ill-structured problem will remain only half answered. 7. CONCLUSION
The items j ust discussed-other aspects of problem solving, the meas urement of power and generality, and the concept of vagueness-do not exhaust the difficulties or deficiencies of the proposed hypotheses. But they are enough to indicate their highly tentative nature. Almost surely the two hypotheses will be substantially modified and qualified (probably even compromised) with additional knowledge. Even so, there are excel lent reasons for putting them forth in bold form. The general nature of problems and of methods is no longer a quasi philosophic enterprise, carried on in the relaxed interstic�s between the development of particular mathematical models and theorems. The de velopment of the computer has initiated the study of information proc essing, and these highly general schema that we call methods and problem solving strategies are part of its proper obj ect of study. The nature of generality in problem solving and of ill-structuredness in problems is also part of computer science, and little is known about either. The as sertion of some definite hypotheses in crystallized form has the virtue of focusing on these topics as worthy of serious, technical concern. These two hypotheses (to the extent that they hold true) also have some general implications for the proper study of management science. They say that the field need not be viewed as a collection of isolated mathematical gems, whose application is an art and which is largely
HEURISTIC PROGRAMMING
excluded from the domain of "nonquantifiable aspects" of management.9 Proper to management science is the creation of methods general enough to apply to the ill-structured problems of management-taking them on their own terms and dealing with them in all their vagueness-and not demanding more in the way of data than the situations provide. To be sure, these methods will also be weak but not necessarily weaker than is inherent in the ill-structuring of the task. That management science should deal with the full range of manage ment problems is by no means a new conclusion. In this respect these two hypotheses only reinforce some existing strands of research and ap plication. They do, however, put special emphasis on the extent to which the hard mathematical core of management science 8hould be involved in ill-structured problems. They say such involvement is possible.
BIBLIOGRAPHY 1. G. P. E. Clarkson, Portfolio Selection: a Simulation of
Trust Investment,
Prentice-Hall, Englewood Cliffs, N .J., 1962. 2. R. M . Cyert and J. G. March, A Behavioral Theory of the Firm, Prentice-Hall, Englewood Cliffs, N .J., 1963. 3. J. Ellul, The Technological Society, Knopf, New York, 1964. 4. T. G. Evans, "A Heuristic Program to Solve Geometric Analogy Problems,"
Proc. Spring Joint Computer Conference, Vol. 25, 1964, pp. 327-338. 5. E. A. Feigenbaum and J. Feldman (eds.) , Computers and Thought, McGraw Hill, New York, 1963. (Reprints many of the basic papers.) 6. R. W. Floyd, "Assigning Meanings to Programs," Proc. Am. Math. Soc., Sym
posium on Applied Mathematics, Vol. 19, 1967, pp. 19-32. 7. J. Harris, "Judgmental versus Mathematical Prediction : an Investigation by Analogy of the Clinical versus Statistical Controversy,'' Behav. Sci., 8, No. 4, 324-335 ( Oct. 1963 ) . 8 . E . S . Johnson, "An Information-Processing Model of One Kind of Problem Solving,'' Psychol. Monog., Whole No. 581, 1964. 9. T. Kilburn, R. L. Grimsdale, and F. H. Summer, "Experiments in Machine Learning and Thinking," Proc. International Conference on Information Proc
essing, UNESCO, Paris, 1959.
10. B. Kleinmuntz (ed.), Formal Representation of Human Judgment, Wiley, New York, 1968. 11. A. A. Kuehn and M . J. Hamburger, "A Heuristic Program for Locating Ware houses," Mgmt. Sci., 9, No. 4, 643-006 (July 1963) .
9
One need only note the extensive calls to arms issued to the industrial operations
researcher to consider the total decision context, and not merely that which he can quantify and put into his model, to realize the firm grip of this image.
53
54
CHAPTER 1
12. P. E. Meehl, Clinical vs. Statistical Prediction, University of Minnesota Press, Minneapolis, Minn., 1954. , 13. A. Newell and G. Ernst, "The Search for Generality," in Proc. IFIP Congress 65
aA
(20)
The structure of Fig. I . I S suggests that many of the deviations in the empiri cal curves could be due simply to starting point or asymptote effects. Because the effect of these two phenomena is to bend toward the horizontal at separate ends, it is possible to tell from the curve in log-log space what effect might be operating . The original Snoddy data in Fig. I . I provides an example of a clear initial deviation. It cannot possibly be due to an earlier starting point, because the initial curve rises toward the vertical. However, it could be due to the asymptote, because raising the asymptote parameter ( A ) will pull the right-hand part of the curve down and make its slope steeper. The Seibel data in Fig. 1 . 6 provides an example where there are deviations from linearity at both ends. Use of a nonzero value for E (previous experience) will steepen the initial portion of the curve, whereas doing likewise for A will steepen the high N portion of the curve. (The results of such a manipulation are seen in Fig. I .2 1 . ) Trials or Time?
The form of the law of practice is performance time ( T) as a function of trials (N). But trials is simply a way of marking the temporal continuum (t) into intervals , each one performance-time long. Since the performance time is itself a monotone decreasing function of trial number, trials (N) becomes a nonlinear compression of time {t). It is important to understand the effect on the law of practice of viewing it in terms of time or in terms of trials.
101
1 02
CHAPTER 3
The fundamental relationship between time and trials is
t(N) = T0 +
N
L
i=l
N
L
Ti = T0 +
i=l
Bi-a = T0 + B
N
L
i=l
; -a
(2 1)
T0 is the time from the arbitrary time origin to the start of the first trial. This equation cannot be inverted explicitly to obtain an expression for N(t) that would permit the basic law (Equation 4) to be transformed to yield T(t). Instead, we pro ceed indirectly by means of the differential forms. From Equation 2 1 we obtain dt =T dN
(22)
Jz f(x) dx = f(z)
Think of the corresponding integral formulation,
d dz
a
Now, starting with the power law in terms of trials we find:
for a r 1
(27)
Thus , a power law in terms of trials is a power law in terms of time, though with a different exponent, reflecting the expansion of time over trials. The results are significantly altered when a = l (the hyperbolic) however. Equation 25 becomes (28) This is no longer the differential form of a power law. Instead it is that of an exponential: (29)
MECHANISMS OF SKILL ACQUISITION AND TiiE LAW OF PRACTICE
It is left as an exercise for the reader to confinn that an exponential function in trials transforms to a linear function in time (hence, Zeno-like, an infinite set of trials can be accomplished in a finite amount of time). F ITTI N G TH E DATA TO A FAM I LY O F C U RVES
Given empirical curves , such as occur in abundance in the second section, it is important to understand how well they are described by curves of a given family (e .g. , power laws) and whether there are alternative general forms that fit them just as well (as noted in the introduction, exponential , hyperbolic, and logistic curves have enjoyed much more favor than power functions) . Curve fitting without benefit of a model is notoriously a black art. Nonetheless, we have deliberately chosen not to be model driven initially , because we want to have empirical generalizations as the starting point in the search for theory , not just the raw data. The basic issue of curve fitting can be introduced from Seibel 's own treatment of his data (Fig. 1 .6), which appears to be an extremely good fit to the log-log law over an extensive range (40,000 trials) . Seibel ( 1 963) fit his points to three curves by least squares: ( 1 ) a power law with asymptote only (i.e. , E fixed at 0); (2) an exponential with asymptote; and (3) a general power law with both asymptote and starting point.6 He obtained an r2 of .99 1 for the power function with asymptote only. But he also obtained an r2 of .97 1 for the exponential with asymptote. His general power law fit was .997. (His parameters for asymptotes and starting points are mostly reasonable but not entirely . ) Thus , all the curves give good fits by normal standards . If only differences in the least-squared residual are used, there can hardly be much to choose from . This is an annoying result, in any case; but it is also somewhat unexpected, for the plots that we have shown , though they surely contain noise, are still impressively linear by intuitive standards and involve lots of data. It is important to recognize that two basic kinds of failure occur in fitting data to a family of smooth curves: ( 1 ) failure of the shape of the data curve to fit to the shapes available within the family; and (2) noise in the data, which will not be fit by any of the families under consideration or even noticeably changed by parametric variation within a family. These distinctions are precisely analogous to the frequency spectrum of the noise in the data. However, the analogy proba bly should not be exploited too literally, because an attempt to filter out the high-frequency noise prior to data fitting simply adds another family of empirical curves (the filters) to confound the issues . What does seem sensible is to attempt to distinguish fits of shape without worrying too much about the jitter. A simple example of this point of view is the (sensible) rejection of the family of logistic curves from consideration for our data. The logistic provides a sig6The exponential is translation invariant, so a special staning point is not distinguishable
that is, Be NH
=
(Bef-' ) eN
=
B 'eN .
for
it;
103
l 04
CHAPTER 3
moid curve (i.e . , a slow but accelerating start with a point of inflection and then asymptoting). No trace of an S-shape appears in any of our data, though it would not be lost to view by any of the various monotone transformations (logs, pow ers, and exponentials) that we are considering. Hence, independent of how competing the measure of error, the logistic is not to be considered. The size of the jitter (i.e. , the high-frequency noise) will limit the precision of the shape that can be detected and the confidence of the statements that can be made about it. It provides a band through which smooth curves can be threaded, and if that band is wide enough-and it may not have to be very wide-then it may be possible to get suitable members of conceptually distinct curves through it. In all cases, the main contribution to any error measure will be provided by the jitter, so that only relatively small differences will distinguish the different families. The Data Ana lysis Proce d u re
With the elimination of the logistic from consideration, we have focused our efforts on three families of curves: exponential, hyperbolic, and power law. The analysis procedure that we have ended up using is primarily graphical in nature. We look at what types of deviations remain, once an empirical curve has been fit optimally by a family of theoretical curves. The analysis consists of judgments as to whether the deviations represent actual distortions of shape, or merely jitter. The procedure has the following components: I . Find spaces where the family of curves should plot as straight lines.
Judgments of shape deviation are most easily made and described when the norm is a line. These are the transformation spaces of the given family. There may be more than one such space. 2 . For each family of curves, find the best linear approximation to the data in the transformation spaces of the family. This will generally involve a combina tion of search and linear regression. 3 . Accept a curve for a family, if the best fit plots as a straight line in the space of that family. Reject it, if it has significant shape distortion. 4. Understand the shape distortion of family X when plotted in the space of family Y. Expect curves of family X to show the characteristic distortion when plotted in the spaces of alternative families. 5. Compute an estimate of fit ( r2) for the best approximation in each trans formation space. Expect these values to support the judgments made on the basis of shape distortion. These criteria contain elements both of acceptance and rejection and provide a mixture of absolute judgments about whether data belong to a given family and relative judgments about the discrimination between families. The parameters for the best fits as well as the estimates of fit (r2) can be found in Table 1 . 2 .
TABLE 1 . 2 The General Learning Curves: Parameters from Optim a l Fits i n the Log Transformation S paces Exponential
T Data Set
Snoddy ( 1 926) Crossman ( 1 959) Kolers ( 1 975) - Subject HA Neisser et al. ( 1 963) Ten targets One target Card, English & Burr ( 1 978) Stepping keys - Subj . 14 Mouse - Subj . 1 4 Seibel ( 1 963) - Subject JK Anderson ( 1 980) - Fan I Moran ( 1980) Total time Method time Neves & Anderson (in press) Total time - Subject D The Game of Stair Won games Lost games Hirsch ( 1 952) General Power Law T 5 + 75(N + 25) -0. 5 40 Term Additive Mixture Chunking Model Combinatorial TE =
A
=
Hyperbolic
A + Be -0s a
B
T r2
A
=
A
+ B/(N +
B
T
E) r2
E
A
Power Law =
A
B
+ B(N + E
E ) -a a
r
2
27 .01 7.19 1 .36
38.80 4.59 3 .82
.06 1 1 x 10-1 .018
.916 .842 . 849
24.49 7.10 1 . 10
243. 6 2.4X J06 94.02
1 .3 1 5 1 000 9.8
.962 .983 .915
2 1 .74 6.91 .18
1 1 9.2 2048 1 1 5 . 25
0.0 3 1000 0.0
.71 .66 .46
.975 .990 .93 1
.06 .06
.83 .44
.13 .094
.905 .938
.00 .00
2 .74 3.16
.9 4.6
.965 .95 1
.00 .00
2. 35 2.57
.6 3 .9
.95 .94
. 965 .95 1
2 .35 1 .46 .37 1 .487
1 .99 1 .28 .461 .283
.Oi l .028 .00005 5 .0005 5
.335 .452 .956 .774
2.14 1 .46 .328 .466
1 7 1 .4 1 6.70 3888 . I 23 1 .6
75 .2 5.0 3042 3 1 9.7
. 338 .603 . 993 . 902
. 02 .59 .324 . 353
6.36 4.28 2439.9 4.322
9.3 0.0 2690 0.0
. 14 .33 .95 .39
. 340 .729 .993 . 947
1 3 .80 1 1 .6 1
6.66 3. 1 1
.0007 3 .00 1 0
. 546 .652
14.77 1 1 .75
3335 . 9 1 38 1 . 8
474.6 360.0
.637 .737
.03 . 26
30.24 19.35
0.0 0.0
.08 .06
. 839 . 882
57.5
240.2
.0 1 9
.660
45.6
5000. 2
7.3
.728
0.0
99 1 .2
0.0
.51
.780
476 152 2.76
319 326 4 . 35
.0052 .0016 .070
.689 .634 .819
449 247 2.34
29800 4 1 270 37.05
40. 1 1 24. I 4.9
.783 .751 . 897
1 20 I .00
1763 1009 10.01
0.0 2.5 0.0
.25 . 19 .32
. 849 . 84 1 .932
7.21 1 .60
6.78 45.37
.0037 .0065
.983
6.4 1 .58
1069.6 1 23 1 .2
9 1 .2 10.2
.997 . 997
5.00 . 19
74.85 753 . I
24.9 7.2
. 89
I .OOO .998
4.61
4.7 1
.0046
4 . 35
365.7
55.3
. 992
2 .86
1 7.40
6.6
.33
I .OOO
.904
.957
.so
� Pi g;: z v; 3:: "'
�
�r
Cll
>
n
� �
� ...; ::t m
�
t"'
0 ..,, .,,
Bn �
m
0 VI
106
CHAPTER 3
The remainder of this section shows how we applied this data analysis proce dure. We start by looking at the transformation spaces . This is followed by an examination of the distortions that occur when a theoretical curve is plotted in a space belonging to a different family . We are then in a position to analyze a couple of the empirical curves that appeared in the second section. The Tra n sformation S paces
The curves that we are interested in belong to multiparameter families (3 for the exponential and hyperbolic; 4 for the power law). Regression can be used to fit a line to an empirical curve plotted in a multidimensional space. Unfortunately, for the three families that we are interested in, there is no space in which all the parameters (three or four) can be determined by linear regression. The most that we can obtain is two parameters. The remainder must be determined by some other means, such as search. The choice of which parameters ar� to drop out of the analysis determines the transformation space. We have primarily worked in two different types of transformation spaces. The first type consists of the log spaces. These are the most commonly used linearizing spaces for functions with powers. The log transformations that we use are the following: Exponential: T' = log(B) - aN for T' = log(T - A ) (30) log(N + £ ) (3 1 ) Hyperbolic: T' log(B) - N ' for T ' log(T - A ) and N ' Power Law: T ' log(B) - aN ' for T ' log(T - A ) and N ' = log(N + £) (32) =
=
=
=
=
The log spaces for the hyperbolic and the power law tum out to be the standard log-log space, whereas the exponential is in semilog space. Determin ing fits in these spaces requires a combination of search (over 0 � A � Tmln and 0 � £) and regression (for B and a) . Because the exponential and hyperbolic families are each missing one of these parameters, the process becomes simpler for them. The exponential only requires a one-dimensional search (over 0 � A � , Tmin) , whereas the hyperbolic can replace the regression (for B and a) with the computation of the average for B . The log spaces have been used exclusively for the data analyses that are described in the following section (Table 1 . 2 was computed in the log spaces). It is important to realize though that they are not the only transformation spaces that can be used. We have explored what we call the T-X spaces, though space precludes presenting the analysis. Transforming a curve into its T-X space in volves pushing all the nonlinearities into the definition of X as follows: Exponential:
T = A + BX
Hyperbolic:
T = A + BX
Power Law:
T + A + BX
for X = e -aN
for X =
(33) 1
(N + £) for X = (N + E ) -a
(34) (35)
In the T-X spaces, searches are over a � 0 and E � 0, with A and B determined by regression. Only single-dimensional searches are needed for the
MECHANISMS OF SKILL ACQUISITION AND THE LAW OF PRACTICE
two three-parameter families. The T-X spaces prove especially useful for es timating the asymptote (A ) , because it maps into the intercept of the transformed curve. The Theo retical Cu rves
When a curve is optimally fit in a space corresponding to its family, it plots as a straight line (by definition). This is not true though when the curve is fit in a space corresponding to some other family. There will be distortions that show up as nonlinearities in the plot. By understanding these characteristic shape distor tions, we are able to interpret the deviations that we find when we plot the data in these spaces . This will help us to distinguish between random j itter and distor tions that signal a bad fit by the family of curves. Data that plot with the same deviations as one of the theoretical curves have a good chance of belonging to that curve 's family. Figure 1 . 1 6 shows the best that a power law can be fit in exponential log space. The power law curve is T !-
=
5 + 75(N + 25)-0·5
(36)
fOOr -
Power law: T
=
5
+
75(N
Best exponential fit: T
=
+
2sr ·s
7 . 2 1 + 6.78e ··00""
10
0
100
200
300
400
500
600
700
800
900
•
100
.. ) [ T3
(70)
Combining this with Equation 59 yields
-aB -1;xT 1+1;x
=
_
( �;: ) T2- 1tx ( �:: ) (-f-r -ltx ( >..B 1txp 1 -1;x )51;x-2
We want C1e(s), so first solving for C '1e(s) C '1e(s)
=
=
=
C '1e( S )
J
()[
Now we can find C1e(s) by integrating C '1e(s) with respect to s .
(7 1 )
(72) (73) (74)
129
130
CHAPTER 3
Ct(' ( s ) -
[
]
' B 11:>p 1 - 1 1:> /\. s 11:>-I a I -
(75)
Though it is somewhat obscured by the complex initial constant, this is a power law in s. Power-law learning thus implies a power-law environment. An important, and indeed pleasing, feature of the chunking model is this connection between the structure of the task environment and the learning behavior of the subject. The richer the task environment (i .e. , the ensemble of environments with which the subject must potentially cope) the more difficult his learning. Relation to Existing Work on Chunking
An important aspect of the chunking model of learning is the amount of power it gets by making connection with a wide body of existing psychological work. For example, the pervasiveness of the phenomenon of chunking amply accounts for the ubiquity of log-log learning. We have been able to develop the primary assumptions of the model from this work without the necessity of pulling an arbitrary "natural " learning curve out of the air. Much of the existing work on chunking has focused on showing that chunks are the structures of memory and operate in behavior in various ways (Bower & Winzenz, 1 969; Johnson , 1972) . It is consonant with the present model but does not make interesting contact with it. However, the work on chess perception (Chase & Simon, 1973; DeGroot, 1 965) bears directly on the present model. The basic phenomenon investigated there was the differential short-term memory for meaningful chess positions with expertness . Novices are able to recall only a few pieces of a complex middle-game position after a 5-second exposure, whereas masters can recall mos� of the pieces. A well-articulated theory has evolved to explain this striking phenomenon . The theory is an elaboration of the basic assumptions about chunking. The master has acquired an immense memory for chess positions , organized as a collection of chunks. His ability for immediate perception and short-term memory of chess positions depends directly on how many chunks are used to encode a position . Estimates of the number of chunks available to the master are of the order of 50,000, based on extrapolation of a simulation program (Simon & Gilmartin, 1973) that fits novice- and expert-level players. By implication, master players must spend an immense amount of time with the game, in order to acquire the large number of chunks; this seems to be well supported by historical data. The chunking model of learning presented here for the power law is es sentially the same as the chess perception model . The present model has been elaborated quantitatively for learning data, whereas the chess perception data had the products of learning to work with. The explanation for why the number of perceptual chess chunks is so large lies in the combinatorial complexity of chess positions . High-level chess chunks encode large subpattems of pieces on the board; they are the necessary means for rapid perception. But the actual config-
MECHANISMS OF SKILL ACQUISITION AND THE LAW OF PRACilCE
urations to which they_ apply do not show up often. Thus to gain coverage of the population of chess positions requires acquisition of immense numbers of high level chunks . This is precisely the notion of environmental exhaustion that is the key mechanism of the present model. One would expect from this that the time course of chess skill would also follow the power law, if one would take the trouble to measure it. Indeed, the data on the Stair game of solitaire in Fig. 1 . 10 can be taken as a reasonable analog of the chess game.
CONCLUSION
If we may, let us start this conclusion by recounting our personal odyssey in this research. We started out, simply enough, intrigued by a great quantitative regu larity that seemed to be of immense importance (and of consequence for an applied quantitative psychology) , well known, yet seemingly ignored in cogni tive psychology. We saw the law as tied to skill and hence relevant to the modem work in automatization. The commitment to write this chapter was the goad to serious research. When we started, our theoretical stance was neutral-we just wanted to find out what the law could tell us. Through the fall of 1 979, in searching for explanations, we became convinced that plausible substantive theories of power laws were hard to find, though it seemed relatively easy to obtain an exponent of - 1 (i.e. , hyperbolics). In November, discovering the chunking model (by looking for forms of exhaustion, in fact), we became con vinced that it was the right theory (at least A. N . did) and that lack of good alternative theories helped to make the case. The chunking model also implied that the power law was not restricted to perceptual-motor skills but should apply much more generally. This led to our demonstration experiment on Stair, which showed a genuine problem-solving task to be log-log linear. At the same time, in conversations with John Anderson, additional data emerged from the work of his group (Figs . 1 . 7 and 1 . 9) that bolstered this. This picture seemed reasonably satisfactory, though the existence of log-log linear industrial learning curves (Fig.- 1 . 1 2) nagged a bit, as did the persistence of some of our colleagues in believing in the argument of mixtures. However, as we proceeded to write the chapter, additional work kept emerging from the litera ture, including especially the work by Mazur and Hastie ( 1 978) , that raised substantial doubts that the power law was the right empirical description of the data. The resulting investigation has brought us to the present chapter. The picture that emerges is somewhat complex, though we believe at the moment that this complexity is in the phenomena, and not just in our heads as a reflection of only a momentary understanding. We summarize this picture below, starting with the data and progressing through theoretical considerations.
131
1 32
CHAPTER 3
l . The empirical curves do not fit the exponential family. Their tails are genuinely slower than exponential learning and this shape deviation does not disappear with variation of asymptote. 2. The data do satisfactorily fit the family of generalized power functions (which includes the hyperbolic subfamily). There is little shape variance remain ing in the existing data to justify looking for other empirical families. In particular, there is no reason to treat apparent systematic deviations , such as occur in Snoddy 's or Seibel 's data in log-log space (Figs. l . l , 1 .6), as due to special causes , distinct from their description as a generalized power function. 3 . The data do not fit the simple power law (i .e. , without asymptote or variable starting point). There are systematic shape deviations in log-log space (the space that linearizes the simple power law), which disappear completely under the general power law. 4. We were unable to confirm either whether the data fit within the hyper bolic subfamily or actually requires the general power family. This is so despite the multitude of existing data sets, some with extremely lengthy data series (some of it as extensive as any data in psychology). 5. The major phenomenon is the ubiquity of the learning data (i.e. , its com mon description by a single family of empirical curves) . We extended the scope to all types of cognitive behavior, not just perceptual-motor skill. However, we restricted our view to performance time as the measure of performance, though learning curves measured on other criterion also yield simi lar curves. Also, we restricted our view to clear situations of individual learning, though some social (i.e . , industrial) situations yield similar curves. Our restric tion was dictated purely by the momentary need to bound the research effort. 6. Psychological models that yield the power law with arbitrary rate (a) are difficult to find. (Positive asymptotes and arbitrary starting points are, of course. immediately plausible, indeed, unavoidable.) 7 . Models that yield the hyperbolic law arise easily and naturally from man)' sources-simple accumulation assumptions, parallelism, mixtures of exponen· tials, etc. 8 . The various models are not mutually exclusive but provide an array ol sources of the power law. Several hyperbolic mechanisms could coexist in the same learner. Independent of these, if the humans learn by creating and storin! chunks, as there is evidence they do, then the environmental-exhaustion effect would also operate to produce power-law learning, independent of whether ther< were other effects such as mixing to produce hyperbolic learning curves. 9. A maintainable option is that the entire phenomenon is due to exponentia component learning yielding an effective hyperbolic law through mixing. This would cover not only the data dealt with here but probably also the dat• with other criteria and the data from industrial processes . However, the exponential learning of the component learners remains unac counted for.
MECHANISMS OF SKILL ACQUISITION AND THE LAW OF PRACTICE
10. The chunking model provides a theory of the phenomena that offers qualitatively satisfactory explanations for the major phenomena. However, some of the phenomena, such as the industrial processes, probably need to be assigned to mixing. Parsimony freaks probably will not like this . The theory i s pleasantly consistent with the existing general theory o f informa tion processing and avoids making any a priori assumptions. Though power laws are not predicted for all task environments, the learning curves do closely approximate power laws.
ACK N OWLEDG M E NTS
This research was sponsored in part by the Office of Naval Research under contract N000 1 4-76-0874 and in part by the Defense Advanced Research Projects Agency (DOD) , ARPA Order No. 3 597, monitored by the Air Force Avionics Laboratory under contract F3361 5-78-C- I55 1 . The views and conclusions in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Office of Naval Research, the Defense Advanced Research Projects Agency, or the U.S. Gov ernment.
R E FE R E N C ES Anderson, J . Private communication, 1 980. Bower, G. H . , & Winzenz, D. Group structure, coding, and memory for digit series. Experimental Psychology Monograph, 1 969, 80, 1 - 17 (May, Pt. 2). Calfee, R. C. Human experimental psychology. New York: Holt, Rinehart & Winston, 1 975. Card, S. K . , English, W. K . , & Burr, B. Evaluation of mouse, rate controlled isometric joystick, step keys, and text keys for text selection on a CRT. Ergonomics, 1978, 21, 601 -6 1 3 . Card, S . K . , Moran, T. P. , & Newell, A. Computer text editing: A n information-processing analysis of a routine cognitive skill. Cognitive Psychology, 1980, /2(1), 32-74. (a) Card, S. K . , Moran, T. P., & Newell, A. The keystroke model for user performance time with interactive systems. Communications of the ACM, 1 980, 23. (In press; available as SSL-79- 1 , Xerox PARC). (b) Chase, W. G . , & Simon, H. A. Perception in chess. Cognitive Psychology, 1 973, 4, 55-8 1 . Churchill, R . V . Operational mathematics. New York: McGraw-Hill, 1972. Cooper, E. H. , & Pantle, A. J. The total-time hypothesis in verbal learning. Psychological Bulletin,
1967, 68, 22 1 -234.
Crossman, E. R. F. W. A theory of the acquisition of speed-skill. Ergonomics, 1959, 2, 153-166. Crowder, R. G. Principles of learning and memory. Hillsdale, N.J.: Lawrence Erlbaum Associates,
1976. deGroot, A. D. Thought and choice in chess. The Hague: Mouton, 1965 . DeJong, R. J. The effects of increasing skill on cycle-time and its consequences for time-standards. Ergonomics, 1957, / , 5 1 -60. Fitts, P. M. The information capacity of the human motor system in controlling amplitude of movement. Journal of Experimental Psychology, 1954, 47, 38 1-39 1 .
1 33
134
CHAPTER 3
Fitts, P. M. Perceptual-motor skill learning. In A. W. Melton (Ed.), Categories of human learning. New York: Academic Press, 1 964. Fitts, P. M . , & Posner, M. I. Human performance. Belmont, Calif. : Brooks/Cole, 1 967 . Guilliksen, H. A rational equation of the learning curve based on Thorndike's law of effect. Journal of General Psychology, 1 934, 1 1 . 395-434.
Hick, W. E. On the rate of gain of information.
Quarterly Journal of Experimental Psychology,
1 952, 4, 1 1 -26.
Hirsch, W. Z. Manufacturing progress functions.
Review of Economics and Statistics. 1952, 34,
143- 155.
Hyman, R. Stimulus information as a determinent of reaction time.
Journal of Experimental Psy
chology, 1 953, 45, 1 88-196.
Johnson, N. F. Organization and the concept of a memory code. In A. W. Melton & E. Martin (Eds.), Coding processes in human memory, Washington, D.C.: Winston, 1 972. Kintsch, W. Memory and cognition. New York: Wiley, 1 977. Kolers, P. A. Memorial consequences of automatized encoding. Journal of Experimental Psychol ogy: Human Learning and Memory, 1 975, 1 (6), 689-70 1 .
LaBerge, D. Acquisition o f automatic processing i n perceptual and associative learning. In P. A. M . Rabbitt & S. Dornic (Eds.), Attention and performance V, New York: Academic, 1 974. Lewis, C. Speed and practice, undated. Lewis, C. Private communication, 1980. Lindsay, P . , & Norman, D. Human information processing: An introduction to psychology (2nd ed.). New York: Academic, 1 977. Mazur, J . , & Hastie, R . Learning as accumulation: A reexamination of the learning curve. Psychological Bulletin, 1 978, 85(6), 1 256-1274.
Miller, G. A. The magic number seven plus or minus two: Some limits on our capacity for process ing information. Psychological Review, 1956, 63, 8 1 -97. Moran, T. P. Compiling cognitive skill, 1 980 (AIP Memo 1 50, Xerox PARC). Neisser, U . , ovick, R . , & Lazar , R. Searching for ten targets simultaneously. Perceptual and Motor Ski ls, 1 963, 1 7, 955-96 1 .
Neves, D . M . , & Anderson, J . R . Knowledge compilation: Mechanisms for the automatization of cognitive skills. In J . R. Anderson (Ed.), Cognitive skills and their acquisition, Hillsdale, N.J.: Lawrence Erlbaum Associates (in press). Newell, A. Harpy, production systems and human cognition. In R. Cole (Ed.), Perception and production offluent speech, Hillsdale, N .J.: Lawrence Erlbaum Associates, 1 980. Posner, M. I . , & Snyder, C . R. R. Attention and cognitive control. In R. L. Solso (Ed.), Information processing and cognition, Hillsdale, N .J.: Lawrence Erlbaum Associates, 1 975. Reisberg, D . , Baron, J . , & Kemler, D. G. Overcoming Stroop interference: The effects of practice on distractor potency. Journal ofExperimental Psychology: Human Perception and Performance, 1 980, 6, 140-150.
Restle, F . , & Greeno, J. Introduction to mathematical psychology. Reading, Mass.: Addison Wesley, 1 970 (chap. I ) . Rigon, C. J . Analysis o f progress trends in aircraft production. Aero Digest, May 1 944, 1 32- 1 38 . Robertson, G . , McCracken, D . , & Newell, A. The ZOO approach to man-machine communication. International Journal of Man-Machine Studies (in press). Schneider, W . , & Shiffrin, R. M. Controlled and automatic human information processing: I. Detection, search and attention. Psychological Review, 1 977, 84, 1 -66. Seibel, R. Discrimination reaction time for a 1 ,023 alternative task. Journal of Experimental Psy chology, 1 963, 66, 2 1 5-226.
Shiffrin, R. M . , & Schneider, W. Controlled and automatic human information processing: II. Perceptual learning, automatic attending, and a general theory. Psychological Review, 1 977, 84, 1 27-190.
MECHANISMS OF SKILL ACQUISITION AND TilE LAW OF PRACTICE
Simon, H . A. On a class of skew distribution functions . Biometrika, 1955, 42, 425--440 . Simon , H. A., & Gilmartin, K. A simulation of memory for chess positions. Cognitive Psychology,
1973, 5, 29-46.
Snoddy, G. S . Learning and stability. Journal of Applied Psychology, 1926, JO, 1 -36. Spelke, E., Hirst, W . , & Neisser, U. Skills of divided attention. Cognition, 1 916, 4, 215-230. Stevens, J. C . , & Savin, H. B. On the form of learning curves. Journal of the Experimental Analysis of Behavior,
1962, 5(1), 1 5-18.
Suppes, P., Fletcher, J . D . , & Zanotti, M . Models o f individual trajectories i n computer-assisted instruction for deaf students. Journal of Educational Psychology, 1976, 68, 1 17-127. Thurstone, L. L. The learning curve equation. Psychological Monographs, 1 919, 26( 1 14), 5 1 . Welford, A . T . Fundamentals of skill. London: Methuen, 1968. Woodworth, R. S. Experimental psychology. New York: Holt, 1938.
1 35
CHAPTER 4
The Knowledge· Level A. Newell, Carnegie Mellon University
1. Introduction
This is the first presidential address of AAAI, the American Association for Artificial Intelligence. In the grand scheme of history, even the history of artificial intelligence (Al), this is surely a minor event. The field this scientific society represents has been thriving for quite some time. No doubt the society itself will make solid contributions to the health of our field. But it is too much to expect a presidential address to have a major impact. So what is the role of the presidential address and what is the significance of the first one? I believe its role is to set a tone, to provide an emphasis. I think the role of the first address is to take a stand about what that tone and emphasis should be, to set expectations for future addresses and to com municate to my fellow presidents. Only two foci are really possible for a presidential address : the state of the society or the state of the science. I believe the latter to be the correct focus. AAAI itself, its nature and its relationship to the larger society that surrounds it, are surely important.1 However, our main business is to help AI become a science-albeit a science with a strong engineering ftavor. Thus, though a president ' s address cannot be narrow or highly technical, it can certainly address a substantive issue. That is what I propose to do. I wish to address the question of knowledge and representation . That is a little *Presidential Address, American Association for Artificial Intelligence, AAAI80, Stanford University, 19 Aug 1980. Also published in the Al Magazine 2(2) 1981. * *This research was sponsored by the Defense Advanced Research Projects Agency (DOD), ARPA Order- No.3597, monitored by the Air Force Avionics Laboratory Under Contract F3361578-C- 155 1 . The views and conclusions contained i n this document are those o f the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the US Government. 1 1 have already provided some comments, as president, on such matters (25] .
THE KNOWLEDGE LEVEL
like a physicist wishing to address the question of radiation and matter. Such broad terms designate a whole arena of phenomena and a whole armada of questions. But comprehensive treatment is neither possible nor intended. Rather, such broad phrasing indicates an intent to deal with the subject in some basic way. Thus, the first task is to make clear the aspect of knowledge and representation of concern, qamely, what is the nature of knowledge. The second task will be to outline an answer to this question. As the title indicates, I will propose the existence of something called the knowledge level. The third task will be to describe the knowledge level in as much detail as time and my own understanding permits. The final task will be to indicate some con sequences of the existence of a knowledge level for various aspects of AI. 2. The Problem of Representation and Knowledge 2.1. The standard view
Two orthogonal and compatible basic views of the enterprise of AI serve our field, beyond all theoretical quibbles. The first is a geography of task areas. There is puzzle solving, theorem proving, game-playing induction, natural language, medical diagnosis, and on and on, with subdivisions of each major territory. AI, in this view, is an exploration, in breadth and in depth, of new territories of tasks with their new patterns of intellectual demands. The second view is the functional components that comprise an intelligent system. There is a perceptual system, a memory system, a processing system, a motor system, and so on. It is this second view that we need to consider to address the role of representation and knowledge. Fig. 1 . shows one version of the functional view, taken from (28], neither better nor worse than many others. An intelligent agent is embedded in a task environment; a task statment enters via a perceptual component and is encoded in an initial representation. Whence starts a cycle of activity in which a recognition occurs (as indicated by the eyes) of a method to use to attempt the problem. The method draws upon a memory of general world knowledge. In the course of such cycles, new methods and new representations may occur, as the agent attempts to solve the problem. The goal structure, a component we all believe to be important, does not receive its due in this figure, but no matter. Such a picture represents a convenient and stable decomposition of the functions to be performed by an intelligent agent, quite independent of particular implementations and anatomical arrangements. It also provides a convenient and stable decomposition of the entire scientific field into subfields. Scientists specialize in perception, or problem solving methods, or representation, etc. It is clear to us all what representation is in this picture. It is the data structures that hold the problem and will be processed into a form that makes the solution available. Additionally, it is the data structures that hold the world
137
138
CHAPTER 4
I nternal Representation
General Knowledge
Method Store
FIG. 1 . Functional diagram of general intelligent agent (after (28)).
knowledge and will be processed to acquire parts of the solution or to obtain guidance in constructing it. The first data structures represent the problem, the second represent world knowledge. A data structure by itself is impotent, of course. We have learned to take the representation to include the basic operations of reading and writing, of access and construction. Indeed, as we know, it is possible to take a pure process view of the representation and work entirely in terms of the inputs and outputs to the read and write processes, letting the data structure itself fade into a mythical story we tell ourselves to make the memory-dependent behavior of the read and write processes coherent. We also understand, though not so transparently, why the representation represents. It is because of the totality of procedures that process the data structure. They transform it in ways consistent with an interpretation of the data structure as representing something. We often express this by saying that a data structure requires an interpreter, including in that term much more than just the basic read/write processes, namely, the whole of the active system that uses the data structure. The term representation is used clearly (almost technically) in AI and
THE KNOWLEDGE LEVEL
computer scienc.e . In contrast, the term knowledge is used informally, despite its prevalence in such phrases as knowledge engineering and knowledge sources. It seems mostly a way of referring to whatever it is that a represen tation h as. If a system has (and can use) a data structure which can be said to represent something (an object, a procedure, . . . whatever), then the system itself can also be said to have knowledge, namely the knowledge embodied in that representation about that thing. 2.2. Why is there a problem?
This seems to be a reasonable picture, which is serving us well. Why then is there a problem? Let me assemble some contrary indicators from our current scene. A first indicator comes from our continually giving to representation a somewhat magical role.2 It is a cliche of AI that representation is the real issue we face. Though we have programs that search, it is said, we do not h ave programs that determine their own representations or invent new represen tations. There is of course some substance to such statments. What is indicative of underlying difficulties is our inclination to treat representation like a homunculus, as the locus of real intelligence. A good example is our fascination with problems such as the mutilated checkboard problem [24]. The task is to cover a checkboard with two-square dominoes. This is easy enough to do with the regular board and clearly impossible to do if a single square is removed, say from the upper right corner. The problem is to do it on a (mutilated) board which has two squares removed, one from each of two opposite corners. This task also turns out to be impossible. The actual task, then, is to show the impossibility. This goes from apparently intractable combinatorially, if the task is represented as all ways of laying down dominoes, to transparently easy, if the task is represented· as just the numbers of black and white squares that remain to be covered. Now, the crux for AI is that n o one h as been able to formulate in a reasonable way the problem of finding the good representation, so it can be tackled by an AI system. By implication-so goes this view-the capability to invent such appropriate representations requires intelligence of some new and different kind. A second indicator is the great theorem-proving controversy of the late sixties and early seventies. Everyone in AI h as some knowledge of it, no doubt, for its residue is still very much with us. It needs only brief recounting. Early work in theorem proving programs for quantified logics culminated in 1965 with Alan Robinson's development of a machine-oriented formulation of first-order logic called Resolution [32] . There followed an immensely produc2Representation is not the only aspect of intelligent systems that has a magical quality; learning is another. But that is a different story for a different time.
139
140
CHAPTER 4
tive period of exploration of resolution-based theorem-proving. This was fueled, not only by technical advances, which occurred rapidly and on a broad front [ 15], but also by the view that we had a general purpose reasoning engine in h and and that doing logic (and doing it well) was a foundation stone of all intelligent action. Within about five years, however, it became clear that this basic engine was not going to be powerful enough to prove theorems that are hard on a human scale, or to move beyond logic to mathematics, or to serve other sorts of problem solving, such as robot planning. A reaction set in, whose slogan was "uniform procedures will not work". This reaction itself had an immensely positive outcome in driving forward the development of the second generation of AI languages: Planner, Microplanner, QA4, Conniver, POP2, etc. [6] . These unified some of the basic mechanisms in problem solving-goals, search, pattern matching, and global data bases-into a programming language framework, with its attendant gains of involution. However, this reaction also had a negative residue, which still exists today, well after these new AI languages have come and mostly gone, leaving their own lessons. The residue in its most stereotyped form is that logic is a bad thing for AI. The stereotype is not usually met with in pure form, of course. But the mat of opinion is woven from a series of strands that amount to as much : Uniform proof techniques have been proven grossly inadequate; the failure of resolution theorem proving implicates logic generally ; logic is per meated with a static view ; and logic does not permit control. Any doubts about the reality of this residual reaction can be stilled by reading Pat Hayes's attempt to counteract it in [12] . A third indicator is the recent SIGART "Special issue of knowledge representation" [7] . This consisted of the answers (plus analysis) to an elaborate questionnaire developed by Ron Brachman of BBN and Brian Smith of MIT, which was sent to the AI research community working on knowledge representation. In practice, this meant work in natural language, semantic nets, logical formalisms for representing knowledge, and in the third generation of programming and representation systems, such as AIMDS, KRL, and KL ONE. The questionnaire not only covered the basic demography of the projects and systems, but also the position of the respondent (and his system) on many critical issues of representation-quantification, quotation, self-des cription, evaluation vs. reference-finding, and so on . The responses were massive, thoughtful and thorough, which was impressive given that the questionnaire took well over an hour just to read, and that answers were of the order of ten single-spaced pages. A substantial fraction of the field received coverage in the 80 odd returns, since many of them represented entire projects. Although the questionnaire left much to be desired in terms of the precision of its questions, the Special Issue still provides an extremely interesting glimpse of how AI sees the issues of knowledge representation.
THE KNOWLEDGE LEVEL
The main result was overwhelming diversity-a veritable jungle of opinions. There was no consensus on any question of substance. Brachman and Smith themselves highlight this throughout the issue, for it came as a major surprise to them. Many (but of course not all !) respondents themselves felt the same way. As one said, "Standard practice in the representation of knowledge is the scandal of AI". What is so overwhelming about the diversity is that it defies characterization. The role of logic and theorem proving, just described above, are in evidence, but there is much else besides. There is no tidy space of underlying issues in which respondents, hence the field, can be plotted to reveal a pattern of concerns or issues. Not that Brachman and Smith could see. Not that this reader could see. 2.3. A formulation of the problem
These three items-mystification of the role of representation, the residue of the theorem-proving controversy, and the conflicting webwork of opinions on knowledge representation-are sufficient to indicate that our views on representation and knowledge are not in satisfactory shape. However, they hardly indicate a crisis, much less a scandal. At least not to me. Science easily inhabits periods of diversity ; it tolerates bad lessons from the past in concert with good ones. The chief signal these three send is that we must redouble our efforts to bring some clarity to the area. Work on knowledge and represen tation should be a priority item on the agenda of our science. No one should have any illusions that clarity and progress will be easy fo achieve. The diversity that is represented in the SIGART Special Issue is highly articulate and often highly principled. Viewed from afar, any attempt to clarify the issues is simply one more entry into the cacophony-possibly treble, possibly bass, but in any case a note whose first effect will be to increase dissonance, not diminish it. Actually, these indicators send an ambiguous signal. An alternative view of such situations in science is that effort is premature. Only muddling can happen for the next while-until more evidence accumulates or conceptions ripen elsewhere in AI to make evident patterns that now seem only one possibility among many. Work should be left to those already committed to the area; the rest of us should make progress where progress can clearly be made. Still, though not compelled, I wish to have a go at this problem. I wish to focus attention on the question : What is knowledge ? In fact, knowledge gets very little play in the three indicators just presented. Represen tation occupies center stage, with logic in the main supporting role. I could claim that this is already the key-that the conception of knowledge is logically prior to that of representation, and until a clear conception of the former exists, the latter will remain confused. In fact, this is not so. Knowledge is simply one particular
141
142
CHAPTER 4
entry point to the whole tangled knot. Ultimately, clarity will be attained on all these notions together. The path through which this is achieved will be grist for those interested in the history of science, but is unlikely to affect our final understanding. To reiterate: What is the nature of knowledge? How is it related to representation? What is it that a system has, when it has knowledge? Are we simply dealing with redundant terminology, not unusual in natural language, which is better replaced by building on the notions of data structures, interpreters, models (in the strict sense used in logic), and the like? I think not. I think knowledge is a distinct notion, with its own part to play in the nature of intelligence. 2.4. The solution follows from practice
Before starting on matters of substance, I wish to make a methodological point. The solution I will propose follows from the practice of AI. Although the formulation I present may have some novelty, it should be basically familiar to you, for it arises from how we in AI treat knowledge in our work with intelligent systems. Thus, your reaction may (perhaps even should) be "But that is just the way I have been thinking about knowledge all along. What is this man giving me?" On the first part, you are right. This is indeed the way AI has come to use the concept of knowledge. However, this is not the way the rest of the world uses the concept. On the second part, what I am giving you is a directive that your practice represents an important source of knowledge about the nature of intelligent systems. It is to be taken seriously. This point can use expansion. Every science develops its own ways of finding out about its subject matter. These get tidied up in meta-models about scientific activity, eg, the so called scientific method in the experimental sciences. But these are only models; in reality, there is immense diversity in how scientific progress is made. For instance, in computer science many fundamental conceptual advances occur by (scientifically) uncontrolled experiments in our own style of comput ing.3 Three excellent examples are the developments of time-sharing, packet switched networks, and locally-networked personal computing. These are major conceptual advances that have broadened our view of the nature of computing. Their primary validation is entirely informal. Scientific activity of a more traditional kind certainly takes place-theoretical development with careful controlled testing and evaluation of results. But it happens on the 3Computer science is not unique in having modes of progress that don't fit easily into the standard frames. In the heyday of paleontology, major conceptual advances occurred by stumbling across the bones of immense beasties. Neither controlled experimentation nor theoretical prediction played appreciable roles.
THE KNOWLEDGE LEVEL
details, not on the main conceptions. Not everyone understands the necessary scientific role of such experiments in computational living, nor that standard experimental techniques cannot provide the same information. How else to explain, for example, the calls for controlled experimental validation that speech understanding will be useful to computer science? When that experi ment of style is finally performed there will be no doubt at all. No standard experiment will be necessary. Indeed, none could have sufficed. As an example related to the present paper, I have spent some effort recently in describing what Herb Simon and I have called the "Physical symbol system hypothesis" [26, 29] . This hypothesis identifies a class of systems as embodying the essential nature of symbols and as being the necessary and sufficient condition for a generally intelligent agent. Symbol systems turn out to be universal computational systems, viewed from a different angle. For my point here, the important feature of this hypothesis is that it grew out of the practice in AI-out of the development of list processing languages and Lisp, and out of the structure adopted in one AI program after another. We in AI were led to an adequate notion of a symbol by our practice. In the standard catechism of science, this is not how great ideas develop. Major ideas occur because great scientists discover (or invent) them, introducing them to the scientific community for testing and elaboration. But here, working scientists have evolved a new major scientific concept, under partial and alternative guises. Only gradually has it acquired its proper name. The notions · of knowledge and representation about to be presented also grow out of our practice. At least, so I assert. That does not give them immunity from criticism, for in listening for these lessons I may have a tin ear. But in so far as they are wanting, the solution lies in more practice and more attention to what emerges there as pragmatically successful. Of course, the message will be distorted by many things, e.g., peculiar twists in the evolution of computer science hardware and software, our own limitations of view, etc. But our practice remains a source of knowledge that cannot be obtained from anywhere else. Indeed, AI as a field is committed to it. If it is fundamentally flawed, that will just be too bad for us. Then, other paths will have to be found from elsewhere to discover the nature of intelligence. 3. The Knowledge Level
I am about to propose the existence of something called the knowledge Level, within which knowledge is to be defined. To state this clearly, requires first reviewing the notion of computer systems levels. 3.1 . Computer system levels
Fig. 2 shows the standard hierarchy, familiar to everyone in computer science. Conventionally, it starts at the bottom with the device Level, then up to the
143
144
CHAPTER 4
circuit level, then the logic level, with its two sublevels, combinatorial and sequential circuits, and the register-transfer level, then the program level (refer red to also as the symbolic level) and finally, at the top, the configuration level (also called the PMS or Processor-Memory-Switch level). We have drawn the configuration level to one side, since it lies directly above both the symbol level and the register-transfer level. The notion of levels occurs repeatedly throughout science and philosophy, with varying degrees of utility and precision. In computer science, the notion is quite precise and highly operational. Table 1 summarizes its essential attri butes. A level consists of a medium that is to be processed, components that provide primitive processing, laws of composition that permit components to be assembled into systems, and laws of behavior that determine how system behavior depends on the component behavior and the structure of the system. There are many variant instantiations of a given level, e.g., many programming systems and machine languages and many register-transfer systems.4 Each level is defined in two ways. First, it can be defined autonomously, without reference to any other level. To an amazing degree, programmers need not know logic circuits, logic designers need not know electrical circuits, managers can operate at the configuration level with no knowledge of pro gramming, and so forth. Second, each level can be reduced to the level below. Each aspect of a level-medium, components, laws of composition and behavior-can be defined in terms of systems at the level next below. The architecture is the name we give to the register-transfer level system that defines a symbol (programming) level, creating a machine language and making it run as described in the programmers manual for the machine. Neither of these two definitions of a level is the more fundamental. It is essential that they both exist and agree. Some intricate relations exist between and within levels. Any instantiation of a level can be used to create any instantiation of the next higher level. Within each level, systems hierarchies are possible, as in the subroutine hierarchy at the programming level. Normally, these do not add anything special in terms of the computer system hierarchy itself. However, as we all know, at the program level it is possible to construct any instantiation within any other instantiation (mudulo some rigidity in encoding one data structure into another), as in creating new programming languages. There is no need to spin out the details of each level. We live with them every day and they are the stuff of architecture textbooks [1], machine manuals and digital component· catalogues, not research papers. However, it is note worthy how radically the levels differ. The medium changes from electrons and magnetic domains at the device level, to current and voltage at the circuit level,
�
"Though currently dominated by electrical circuits, variant circuit level instantiatio s also exist, e.g., fluidic circuits.
THE KNOWLEDGE LEVEL
Configuration (PMS) Level
Program (Symbol) Level
___J __J
Register-Transfer Su blevel
Logic Circuit S u blevel Circuit Level
J
Loglo '"•'
Device Level
FIG. 2. Computer system levels.
to bits at the logic level (either single bits at the logic circuit level or bit vectors at the register-transfer level), to symbolic expressions at the symbol level, to amounts of data (measured in data bits) at the configuration level. System characteristics change from continuous to discrete processing, from parallel to serial operation, and so on. Despite this variety, all levels share some common features. Four of these, though transparently obvious, are important to us. Point l . Specification of a system at a level always determines completely a definite behavior for the system at that level (given initial and boundary conditions). Point 2. The behavior of the total system results from the local effects of each component of the system processing the medium at its inputs to produce its outputs. Point 3. The immense variety of behavior is obtained by system structure, i.e., by the variety of ways of assembling a small number of component types (though perhaps a large number of instances of each type).
TABLE
1 . Defining aspects of a computer system level
Aspects
Register-transfer level
Symbol level Computers
Systems
Digital systems
Medium
Bit vectors
Symbols, expressions
Components
Registers
Memories
Functional units
Operations
Composition Jaws
Transfer path
Designation, association
Behavior laws
Logical operations
Sequential interpretation
145
146
CHAPTER 4
Point 4. The medium is realized by state-like properties of matter, which remain passive until changed by the components. Computer systems levels are not simply levels of abstraction. That a system has a description at a given level does not necessarily imply it has a description at higher levels. There is no way to abstract from an arbitrary electronic circuit to obtain a logic-level system. There is no way to abstract from an arbitrary register-transfer system to obtain a symbol-level system. This contrasts with many types of abstraction which can be uniformly applied, and thus have a certain optional character (as in abstracting away from the visual appearance of objects to their masses). Each computer system level is a specialization of the class of systems capable of being described at the next lower level. Thus, it is a priori open whether a given level has any physical realizations. In fact, computer systems at all levels are realizable, reflecting indirectly the structure of the physical world. But more holds than this. Computer systems levels are realized by technologies. The notion of a technology has not received the conceptual attention it deserves. But roughly, given a specification of a particular system at a level, it is possible to construct by routine means a physical system that realizes that specification. Thus, systems can be obtained to specification within limits of time and cost. It is not possible to invent arbitrarily additional computer system levels that nestle between existing levels. Potential levels do not become technologies, just by being thought up. Nature has a say in whether a technology can exist. Computer system levels are approximations. All of the above notions are realized in the real world only to various degrees. Errors at lower levels propagate to higher ones, producing behavior that is not explicable within the higher level itself. Technologies are imperfect, with constraints that limit the size and complexity of systems that can actually be fabricated. These con straints are often captured in design rules (e.g., fan-out limits, stack-depth limits, etc), which transform system design from routine to problem solving. If the complexities become too great, the means of system creation no longer constitute a technology, but an arena of creative invention. We live quite confortably with imperfect system levels, especially at the extremes of the hierarchy. At the bottom, the device level is not complete, being used only to devise components at the circuit level. Likewise, at the top, the configuration level is incomplete, not providing a full set of behavioral laws. In fact, it is more nearly a pure level of abstraction than a true system level. This accounts for both symbol level and register transfer level systems having configuration (PMS) level abstractions (see [2] for a PMS approach to the register-transfer level). These levels provide ways of describing computer systems; they do not provide ways of describing their environments. This may seem somewhat unsatisfactory, because a level does not then provide a general closed descrip tion of an entire universe, which is what we generally expect (and get) from a
THE KNOWLEDGE LEVEL
level of scientific description in physics or chemistry. However, the situation is understandable enough. System design and analysis requires only that the interface between the environment and the system (i.e., the inner side of the transducers) be adequately described in terms of each level, e.g., as electrical signals, bits, symbols or whatever. Almost never does the universe of system plus environment have to be modeled in toto, with the structure and · dynamics of the environment described in the same terms as the system itself. Indeed, in general no such description of the environment in the terms of a given computer level exists. For instance, no register-transfer level description exists of the airplane in which an airborne computer resides. Computer system levels describe the internal structure of a particular class of systems, not the structure of a total world. To sum up, computer system levels are a reflection of the nature of the physical world. They are not just a point of view that exists solely in the eye of the beholder. This reality comes from computer system levels being genuine specializations, rather than being just abstractions that can be applied uni formly. 3.2. A new level
I now propose that there does exist yet another system level, which I will call the knowledge level. It is a true systems level, in the sense we have just reviewed. The thrust of this paper is that distinguishing this level leads to a simple and satisfactory view of knowledge and representation. It dissolves some of the difficulties and confusions we have about this aspect of artificial intelligence. A quick overview of the knowledge level, with an indication of some of its immediate consequences, is useful before entering into details. The system at the knowledge level is the agent. The components at the knowledge level are goals, actions, and bodies. Thus, an agent is composed of a set of actions, a set of goals and a body. The medium at the knowledge level is knowledge (as might be suspected). Thus, the agent processes its knowledge to determine the actions to take. Finally, the behavior law is the principle of rationality : Actions are selected to attain the agent's goals. To treat a system at the knowledge level is to treat it as having some knowledge and some goals, and believing it will do whatever is within its power to attain its goals, in so far as its knowledge indicates. For example : • "She knows where this restaurant is and said she'd meet me here. I don't know why she hasn't arrived." • "Sure, he'll fix it. He knows about cars." • "If you know that 2 + 2 = 4, why did you write 5?" The knowledge level sits in the hierarchy of systems levels immediately above the symbol level, as Fig. 3 shows. Its components (actions, goals, body)
147
148
CHAPTER 4
Knowledge level _ _ _ _ _ _ _configuration (PMS) Level Program (Symbol) Level
___J __J
Register-Transfer Su blevel
Logic Circ u it S u blevel C i rc u it level Device Level
FIG. 3. New version of computer system levels.
and its medium (knowledge) can be defined in terms of systems at the symbol level, just as any level can be defined by systems at the level one below. The knowledge level has been placed side by side with the configuration level. The gross anatomical description of a knowledge-level system is simply (and only) that the agent has as parts bodies of knowledge, goals and actions. They are all connected together in that they enter into the determination of what actions to take. This structure carries essentially no information ; the specification at the knowledge level is provided entirely by the content of the knowledge and the goals, not by any structural way they are connected together. In effect, the knowledge level can be taken to have a degenerate configuration level. As is true of any level, although the knowledge level oan be constructed from the level below (i.e., the symbol level), it also has an autonomous formulation as an independent level. Thus, knowledge can be defined independent of the symbol level, but can also be reduced to symbol systems. As the Fig. 3 makes evident, the knowledge level is a separate level from the symbol level. Relative to the existing view of the computer systems hierarchy, the symbol level has been split in two, one aspect remaining as the symbol level proper and another aspect becoming the knowledge level. The description of a system as exhibiting intelligent behavior still requires both levels, as we shall see. Intelligent systems are not to be described exclusively in terms of the knowledge level. To repeat the final remark of the prior section: Computer system levels really exist, as much as anything exists. They are not just a point of view. Thus, to claim that the knowledge level exists is to make a scientific claim, which can range from dead wrong to slightly askew, in the manner of all scientific claims. Thus, the matter needs to be put as a hypothesis: The Knowledge Level Hypothesis. There exists a distinct computer systems level, lying immediately above the symbol level, which is characterized by knowledge as the medium and the principle of rationality as the law of behavior.
THE KNOWLEDGE LEVEL
Some preliminary feeling for the nature of knowledge according to this hypothesis can be gained from the following remarks. • Knowledge is intimately linked with rationality. Systems of which rationality can be posited can be said to have knowledge. It is unclear in what sense other systems can be said to have knowledge. • Knowledge is a competence-like notion, being a potential for generating action.5 • The knowledge level is an approximation. Nothing guarantees how much of a system's behavior can be viewed as occurring at the knowledge level. Although extremely useful, the approximation is quite imperfect, not just in degree but in scop�. • Representations exist at the symbol level, being systems (data structures and processes) that realize a body of knowledge at the knowledge level. • Knowledge serves as the specification of what a symbol structure should be able to do. • Logics are simply one class of representations among many, though uniquely fitted to the analysis of knowledge and representation. 4. The Details of the Knowledge Level
We begin by defining the knowledge level autonomously, i.e., independently of lower system levels. Table 1 lists what is required, though we will take up the various aspects in a different order: first, the structure, consisting of the system, components and laws for composing systems ; second, the laws of behavior (the law of rationality); and third, the medium (knowledge). After this we will describe how the knowledge levelI reduces to the symbol level. Against the background of the common features of the familiar computer system levels listed earlier (Section 3.1), there will be four surprises in how the knowledge level is defined. These will stretch the notion of system level somewhat, but will not break it. 4.1. The structure of the knowledge level
An agent (the system at the knowledge level), has an extremely simple structure, so simple there is no need even to picture it. First, the agent has some physi'cal body with which it can act in the environment (and be acted upon). We talk about the body as if it consists of a set of actions, but that is only for simplicity. It can be an arbitrary physical system with arbitrary modes of interaction with its environment. Though this 5How almost interchangeable the two notions might be can be seen from a quotation from Chomsky [9, p. 315): "In the past I have tried to avoid, or perhaps evade the problem of explicating the notion 'knowledge of language' by using an invented technical term, namely the term 'competence' in place of 'knowledge" '.
149
150
CHAPTER 4
body can be arbitrarily complex, its complexity lies external to the system described at the knowledge level, which simply has the power to evoke the behavior of this physical system. Second, the agent has a body of knowledge. This body is like a memory. Whatever the agent knows at some time, it continues to know. Actions can add knowledge to the existing body of knowledge. However, in terms of structure, a body of knowledge is extremely simple compared to a memory, as defined at lower computer system levels. There are no structural constraints to the knowledge in a body, either in capacity (i.e., the amount of knowledge) or in how the knowledge is held in the body. Indeed, there is no notion of how knowledge is held (encoding is a notion at the symbol level, not knowledge level). Also, there are no well-defined structural properties associated with access and augmentation. Thus, it seems preferable to avoid calling the body of knowledge a memory. In fact, referring to a 'body of knowledge', rather than just to 'knowledge', is hardly more than a manner of speaking, since this body has no function except to be the physical component which has the knowledge. Third, and finally, the agent has a set of goals. A goal is a body of knowledge of a state of affairs� in the environment. Goals are structurally distinguished from the main body of knowledge. This permits them to enter into the behavior of the agent in a distinct way, namely, that which the organism strives to realize. But, except for this distinction, goal components are structurally identical to bodies of knowledge. Relationships exist between goals, of course, but these are not realized in the structure of the system, but in knowledge. There are no laws of composition for building a knowledge level system out of these components. An agent always has just these components. They all enter directly into the laws of behavior. There is no way to build up complex agents from them. This complete absence of significant structure in the agent is the first surprise, running counter to the common feature at all levels that variety of behavior is realized by variety of system structure (Section 3 . 1 , Point 3). This is not fortuitous, but is an essential feature of the knowledge level. The focus for determining the behavior of the system rests with the knowledge, i.e., with the content of what is known. The internal structure of the system is defined exactly so that nothing need be known about it to predict the agent's behavior. The behavior is to depend only what the agent knows, what it wants and what means it has for interacting physically with the environment.
4.2. The principle of rationality
The behavioral law that governs an agent, and permits prediction of its behavior, is the rational principle that knowledge will be used in the service of
THE KNOWLEDGE LEVEL
goals.6 This can be formulated more precisely as follows: Principle of rationality. If an agent has knowledge that one of its actions will lead to one of its goals, then the agent will select that action.
This principle asserts a connection between knowledge and goals, on the one hand, and the selection of actions on the other, without specification of any mechanism through which this connection is made. It connects all the com ponents of the agent together directly. This direct determination of behavior by a global principle is the second surprise, running counter to the common feature at all levels that behavior is determined bottom-up through the local processing of components (Section 3.1, Point 2). Such global principles are not incompatible with systems whose behavior is also describable by mechanistic causal laws, as testified by various global principles in physics, e.g., Fermat's principle of least time in geometrical optics, or the principle of least effort in mechanics. The principles in physics are usually optimization (i.e., extremum) principles. However, the principle of rationality does not have built into it any notion of optimal or best, only the automatic connection of actions to goals according to knowledge. Under certain simple conditions, the principle as stated permits the cal culation of . a system's trajectory, given the requisite initial and boundary conditions (i.e., goals, actions, initial knowledge, and acquired knowledge as determined by actions). However, the principle is not sufficient to determine the behavior in many situations. Some of these situations can be covered by adding auxiliary principles. Thus, we can think of an extended principle of rationality, building out from the central or main principle, given above. To formulate additional principles, the phrase selecting an action is taken to mean that the action becomes a member of a candidate set of actions, the selected set, rather than being the action that actually occurs. When all principles are taken into account, if only one action remains in the selected set, that action actually taken is determined; if several candidates remain, then the action actually taken is limited to these possibilities. An action can actually be taken only if it is physically possible, given the situation of the agent's body and the resources available. Such limits to action will affect the actions selected only if the agent has knowledge of them. The main principle is silent about what happens if the principle applies for more than one action for a given goal. This can be covered by the following auxiliary principle. ·
6This principle is not intended to be directly responsive to all the extensive philosophical discussion on rationality, e.g., the notion that rationality implies the ability for an agent to give
reasons for what it does.
151
152
CHAPTER 4
Equipotence of acceptable actions. For given knowledge, if action A, and action A2 both lead to goal G, then both actions are selected.7
This principle simply asserts that all ways of attaining the goal are equally acceptable from the standpoint of the goal itself. There is no implicit optimality principle that selects among such candidates. The main principle is also silent about what happens if the principle applies to several goals in a given sitpation . A simple auxiliary principle is the following. Preference of joint goal satisfaction. For given knowledge, if goal G, has the set of selected actions {A1,;} and goal G2 has the set of selected actions {A2.i}, then the effective set of selected actions is the intersection of {A,,;} and {A2j}.
It is better to achieve both of two goals than either alone. This principle determines behavior in many otherWise ambiguous situations. If the agent has general goals of minimizing effort, minimizing cost, or doing things in a simple way, these general goals select out a specific action from a set of otherwise equipotent task-specific actions. However, this principle of j oint satisfaction still goes only a little ways further towards obtaining a principle that will determine behavior in all situations. What if the intersection of selected action sets is null? What if there are several mutually exclusive actions leading to several goals? What if the attainment of two goals is mutually exclusive, no matter through what actions attained? These types of situations too can be dealt with by extending the concept of a goal to include goal preferences, that is, specific preferences for one state of the affairs over another that are not grounded in knowledge of how these states differentially aid in reaching some common superordinate goal. Even this extended principle of rationality does not cover all situations. The central principle refers to an action leading to a goal. In the real world it is often not clear whether an action will attain a specific goal. The difficulty is not just one of the possibility of error. The actual outcome may be truly prob abilistic, or the action may be only the first step in a sequence that depends on other agents' moves, etc. Again, extensions exist to deal with uncertainty and risk, such as adopting expected value calculations and principles of minimizing maximum loss. In proposing each of these solutions, I am not inventing anything. Rather, this growing extension of rational behavior moves along a well-explored path, created mostly by modern day game theory, econometrics and decision theory 7For simplicity, in this principle and others, no explicit mention is made of the agent whose goals, knowledge and actions are under discussion.
THE KNOWLEDGE LEVEL
[ 16,39] . It need not be retraced here. The exhibition of the first few elementary extensions can serve to indicate the total development to be taken in search of a principle of rationality that always determines the action to be taken. Complete retracing is not necessary because the path does not lead, even ultimately, to the desired principle. No such principle exists.8 Given an agent in the real world, there is no guarantee at all that his knowledge will determine which actions realize which goa)s. Indeed, there is no guarantee even that the difficulty resides in the incompleteness of the agent's knowledge. There need not exist any state of knowledge that would determine the action. The point can be brought home by recalling Frank Stockton's famous short story, "The lady or the tiger?" [38] . The upshot has the lover of a beautiful princess caught by the king. He now faces the traditional ordeal of judgment in this ancient and barbaric land: In the public arena he must choose to open one of two utterly indistinguishable doors. Behind one is a ferocious tiger and death ; behind the other a lovely lady, life and marriage. Guilt or innocence is determined by fate. The princess alone, spurred by her love, finds out the secret of the doors on the night before the trial. To her consternation she finds that the lady behind the door is to be her chief rival in loveliness. On the judgment day, from her seat in the arena beside the king, she indicates by a barely perceptible nod which door her lover should open. He, in unhesitating faithfulness, goes directly to that door. The story ends with a question. Did the princess send her lover to the lady or the tiger? Our knowledge-level model of the princess, even if it were to include her complete knowledge, would not tell us which she chose. But the failure of determinancy is not the model's. Nothing says that multiple goals need be compatible, nor that the incompatibility be resolvable by any higher principles. The dilemma belongs to the princess, not to the scientist attempting to formulate an adequate concept of the knowledge level. That she resolved it is clear, but that her behavior in doing so was describable at the knowledge level does not follow. This failure to determine behavior uniquely is the third surprise, running counter to the common feature at all levels that a system is a determinate machine (Section 3 . 1 , Point 1). A complete description of a system at the program, logic or circuit level yields the trajectory of the system's behavior over time, given initial and boundary conditions. This is taken to be one of its important properties, consistent with being the description of a deterministic (macro) physical system. Yet, radical incompleteness characterizes the knowl8An
adequate critical recounting of this intellectual development is not possible here. Formal
systems (utility fields) can be constructed that appear to have the right property. But they work, not by connecting actions with goals via knowledge of the task environment, but by positing of the agent a complete set of goals (actually, preferences) that directly specify all action selections over all combinations of task states (or probabilistic options over task states). Such a move actually abandons the enterprise.
153
154
CHAPTER 4
edge level. Sometimes behavior can be predicted by the knowledge level description ; often it cannot. The incompleteness is not just a failure in certain special situations or in some small departures. The term radical is used to indicate that entire ranges of behavior may not be describable at the knowl edge level, but only in terms systems at a lower level (namely, the symbolic level). However, the necessity of accepting this incompleteness is an essential aspect of this level. 4.3. The nature of knowledge
The ground is now laid to provide the definition of knowledge. As formulated so far, knowledge is defined to be the medium at the knowledge level, something to be processed according to the principle of rationality to yield behavior. We wish to elevate this into a complete definition: Knowledge. Whatever can be ascribed to an agent, such that its behavior can be computed according to the principle of rationality.
Knowledge is to be characterized entirely functionally, in terms of what it does, not structurally, in terms of physical objects with particular properties and relations. This still leaves open the requirement for a physical structure for knowledge that can fill the functional role. In fact, that key role is never filled directly. Instead, it is filled only indirectly and approximately by symbol systems at the next lower level. These are total systems, not just symbol structures. Thus, knowledge, though a medium, is embodied in no medium-like passive physical structure. This failure to have the medium at the knowledge level be a state-like physical structure is the fourth surprise, running counter to the common feature at all levels of a passive medium (Section 3 . 1 , Point 4). Again, it is an essential feature of the knowledge level, giving knowledge its special abstract and competence-like character. One can see on the blackboard a symbolic expression (say a list of theorem names). Though the actual seeing is a tad harder, yet there is still no difficulty seeing this same expression residing in a definite set of memory cells in a computer. The same is true at lower levels-the bit vectors at the register-transfer level-marked on the blackboard or residing in a register as voltages. Moreover, the medium at one level plus additional static structure defines the medium at the next level up. The bit plus its organization into registers provides the bit-vector; collections of bit vectors plus functional specialization to link fields, type fields, etc., defines the symbolic expression. All this fails at the knowledge level. The knowledge cannot so easily be seen, only imagined as the result of interpretive processes operating on symbolic expressions. Moreover, knowledge is not just a collec-
THE KNOWLEDGE LEVEL
tion of symbolic expressions plus some static organization ; it requires both processes and data structures. The definition above may seem like a reverse, even perverse, way of defining knowledge. To understand it-to see how it works, why it is necessary, why it is useful, and why it is effective-we need to back off and examine the situation in which knowledge is used. How it works
Fig. 4 shows the situation, which involves an observer and an agent. The observer treats the agent as a system at the knowledge level, i.e., ascribes knowledge and goals to it. Thus, the observer knows the agent's knowledge (K) and goals (G), along with his possible actions and his environment (these latter by direct observation, say). In consequence, the observer can make predictions of the agent's actions using the principle of rationality. Assume the observer ascribes correctly, i.e., the agent behaves as if he has knowledge K and goals G. What the agent really has is a symbol system, S, that permits it to carry out the calculations of what actions it will take, because it has K and G (with the given actions in the given environment). The observer is itself an agent, i.e., is describable at the knowledge level. There is no way in this theory of knowledge to have a system that is not an agent have knowledge or ascribe knowledge to another system. Hence, the observer has all this knowledge (i.e., K', consisting of knowledge K, knowledge that K is the agent's knowledge, knowledge G, knowledge that these are the agent's goals, knowledge of the agent's actions etc.). But what the observer · really has, of course, is a symbol system, S', that lets it calculate actions on the basis of K, goals, etc., i.e., calculate what the agent would do if it had K and G. Thus, as the figure shows, each agent (the observer and the observed) has Observer
FIG.
4.
The situation in which knowledge is used.
Agent
155
156
CHAPTER 4
knowledge by virtue of a symbol system that provides the ability to act as if it had the knowledge. The total system (i.e., the dyad of the observing and observed agents) runs without there being any physical structure that is the knowledge. Why it is necessary
Even granting the scheme of Fig. 4 to be possible, why cannot there simply be some physical structure that corresponds to knowledge? Why are not S and S' simply the knowledge? The diagram seems similar to what could be drawn between any two adjacent system levels, and in these other cases corresponding structures do exist at each level. For example, the medium at the symbol level (the symbolic expressions) corresponds to the medium at the register-transfer level (the bit vectors). The higher medium is simply a specialized use of the lower medium, but both are physical structures. The same holds when we descend from the register-transfer level (bit vectors) to the logic circuit level (single bits) ; the relationship is simply one of aggregation. Descending to the circuit level again involves specialization in which some feature of the circuit level medium (e.g. voltage) is taken to define the bit (e.g. a high and a low voltage level). The answer in a nutshell is that knowledge of the world cannot be captured in a finite structure. The world is too rich and agents have too great a capability for responding.9 Knowledge is about the world. Insofar as an agent can select actions based on some truth about the world, any candidate structure for knowledge must contain that truth. Thus knowledge as a structure must contain at least as much variety as the set of all truths (i.e. propositions) that the agent can respond to. A representation of any fragment of the world (whether abstract or con crete), reveals immediately that the knowledge is not finite. Consider our old friend, the chess position. How many propositions can be stated about the position? For a given agent, only propositions need by considered that could materially affect an action of the agent relative to its goal. But given a reasonably intelligent agent who desires to win, the list of aspects of the position is unbounded. Can the Queen take the Pawn? Can it take the Pawn next move? Can it if the Bishop moves? Can it if the Rook is ever taken? Is a pin possible that will unblock the Rook attacker before a King side attack can be mounted? And so on. A seemingly appropriate objection is: (1) the agent is finite so can't actually have an unbounded anything; and (2) the chess position is also just a finite structure. However, the objection fails, because it is not possible to state from afar (i.e. from the observer's viewpoint) a bound on the set of propositions about the position that will be available. Of course, if the observer had a model 9Agents
with finite knowledge are certainly possible, but would be extraordinarily limited.
THE KNOWLEDGE LEVEL
of the agent at the symbolic process level, then a (possibly accurate) prediction might be made. But the knowledge level is exactly the level that abstracts away from symbolic processes. Indeed, the way the observer determines what the agent might know is to consider what he (the observer) can find out. And this he cannot determine in advance. The situation here is not really strange. The underlying phenomenon is the generative ability of computational systems, which involves an active process working on an initially given data structure. Knowledge is the posited extensive form of all that can be obtained potentially from the process. This potential is unbounded when the details of processing are unknown and the gap is closed by assuming (from the principle of rationality) the processing to be whatever makes the correct selection. Of course, there are limits to this processing, namely, that it cannot go beyond the knowledge given. What the computational system generates are selections of actions for goals, conditioned on states of the world. Each such basic means-ends relation may be taken as an element of knowledge. To have the knowledge available in extension would be to have all these possible knowledge elements for all the goals, actions and states of the world discriminable to the agent at the given moment. The knowledge could then be thought of as a giant table full of these knowledge elements, but it would have to be an infinite table. Consequently, this knowledge (i.e., these elements) can only be created dynamically in time. If generated by some simple procedure, only relatively uninteresting knowledge can be found. Interesting knowledge requires generating only what is relevant to the task at hand, i.e. generating intelligently. Why it is useful The knowledge level permits predicting and understanding behavior without having an operational model of the processing that is actually being done by the agent. The utility of such a level would seem clear, given the widespread need in life's affairs for distal prediction, and also the paucity of knowledge about the internal workings of humans. (And animals, some of which are usefully described at the knowledge level.) The utility is also clear in designing AI systems, where the internal mechanisms are still to be specified. To the extent that AI systems successfully approximate rational agents, it is also useful for predicting and understanding them. Indeed, the usefulness extends beyond AI systems to all computer programs. Prediction of behavior is not possible without knowing something about the agent. However, what is known is not the processing structure, but the agent's knowledge of its external environment, which is also accessible to the observer directly (to some extent). The agent's goals must also be known, of course, and these certainly are internal to the agent. But they are relatively stable charac teristics that can be inferred from behavior and (for human adults) can sometimes be conveyed by language. One way of viewing the knowledge level is as the
157
158
CHAPTER 4
attempt to build as good a model of an agent's behavior as possible based on information external to the agent, hence permitting distal prediction. This standpoint makes understandable why such a level might exist even though it is radically incomplete. If such incompleteness is the best that can be done, it must be tolerated. Why it is effective
The knowledge level, being indirect and abstract, might be thought to be excessively complex, even abstruse. It could hardly have arisen naturally. On the contrary, given the situation of Fig. 4, it arises as day follows night. The knowledge attributed by the observer to the agent is knowledge about the external world. If the observer takes to itself the role of the other (i.e. the agent), assuming the agent's goals and attending to the common external environment, then the actions it determines for itself will be those that the agent should take. Its own computational mechanisms (i.e. its symbolic system) will produce the predictions that flow from attributing the appropriate know ledge to the agent. This scheme works for the observer without requiring the development of any explicit theory of the agent or the construction of any computational model. To be more accurate, all that is needed is a single general purpose computational mechanism; namely, creating an embedding context that posits the agent's goals and symbolizes (rather than executes) the resulting actions.10 Thus, simulation turns out to be a central mechanism that enables the knowl edge level and makes it useful. 4.4. Solutions to the representation of knowledge
The principle of rationality provides, in effect, a general functional equation for knowledge. The problem for agents is to find systems at the symbol level that are solutions to this functional equation, and hence can serve as represen tations of knowledge. That, of course, is also our own problem, as scientists of the intelligent. If we wish to study the knowledge of agents we must use representations of that knowledge. Little progress can be made given only abstract characterizations. The solutions that agents have found for their own purposes are also potentially useful solutions for scientific purposes, quite independently of their interest, because they are used by agents whose in telligent processing we wish to study. Knowledge, in the principle of rationality, is defined entirely in terms of the 1 0However, obtaining this is still not a completely trivial cognitive accomplishment, as indicated by the emphasis on egocentrism in the early work of Piaget [31] and the significance of taking the
role of the other in the work of the social philosopher, George H. Mead [21], who is generally credited with originating the phrase.
THE KNOWLEDGE LEVEL
environment of the agent, for it is the environment that is the object of the agent's goals, and whose features therefore bear on the way actions can attain goals. This is true even if the agent's goals have to do with the agent itself as a physical system. Therefore, the solutions are ways to say things about the environment, not ways to say things about reasoning, internal information processing states, and the like. (However, control over internal processing does require symbols that designate internal processing states and structures.) Logics are obvious candidates. They are, exactly, refined means for saying things about environments. Logics certainly provide solutions to the functional equation. One can find many situations in which the agent's knowledge can be characterized by an expression in a logic, and from which one can go through in mechanical detail all the steps implied in the principle, deriving ultimately the actions to take and linking them up via a direct semantics so the actions actually occur. Examples abound. They do not even have to be limited to mathematics and logic, given the work in AI to use predicate logic (e.g., resolution) as the symbol level structures for robot tasks of spatial manipula tion and movement, as well as for many other sorts of tasks [30] . A logic is just a representation of knowledge. It is not the knowledge itself, but a structure at the symbol level. If we are given a set of logical expressions, say {L;}, of which we are willing to say that the agent "knows {L;}", then the knowledge K that we ascribe to the agent is:
The agent knows all that can be inferred from the conjunction of {L;}. This statment simply expresses for logic what has been set out more generally above. There exists a symbol system in the agent that is able to bring any inference from {L;} to bear to select the actions of the agent as appropriate (i.e., in the services of the agent's goals). If this symbol system uses the clauses themselves as a representation, then presumably the active processes would consist of a theorem prover on the logic, along with sundry heuristic devices to aid the agent in arriving at the implications in time to perform the requisite action selections. This statement should bring to prominence an important question, which, if not already at the forefront of concern should be. Given that a human cannot know all the implications of an (arbitrary) set of axioms, how can such a formulation of knowledge be either correct or useful? Philosophy has many times explicitly confronted the proposition that knowl edge is the logical closure of a set of axioms. It has seemed so obvious. Yet the proposal has always come to grief on the rocky question above. It is trivial to generate counterexamples that show a person cannot possibly know all that is implied. In a field where counterexamples are a primary method for making
159
160
CHAPTER 4
progress, this has proved fatal and the proposal has little standing.11 Yet, the theory of knowledge being presented here embraces that the knowledge to be associated with a conjunction of logical expressions is its logical closure. How can that be? The answer is straightforward. The knowledge level is only an approximation, and a relatively poor one on many occasions-we called it radically incomplete. It is poor for prodicting whether a person remembers a telephone number just looked up. It is poor for predicting what a person knows given a new set of mathematical axioms with only a short time to study them. And so on, through whole meadows of counterexamples. Equally, it is a good approximation in many other cases. It is good for predicting that a person can find his way to the bedroom of his own house, for predicting that a person who knows arithmetic will be able to add a column of numbers. And so on, through much of what is called common sense knowledge. This move to appeal to approximation (as the philosophers are wont to call such proposals) seems weak, because declaring something an approximation seems a general purpose dodge, applicable to dissolving every difficulty, hence clearly dispelling none. However, an essential part of the current proposal is the existence of the second level of approximation, namely, the symbol level. We now have models of the symbol level that describe how information processing agents arrive at actions by means of search-search of problem spaces and search of global data bases-and how they map structures that are representations of given knowledge to structures that are representations of task-state knowledge in order to create representations of solutions. The discovery, development and elaboration of this second level of approximation to describing and predicting the behavior of an intelligent agent has been what AI has been all about in the quarter century of its existence. In sum, given a theory of the symbol level, we can finally see that the knowledge level is just about what seemed obvious all along. Returning to the search for solutions to the functional equation expressed by the principle of rationality, logics are only one candidate. They are in no way priviledged.12 There are many other systems (i.e., combinations of symbol structures and processes) that can yield useful solutions. To be useful an observer need only use it to make predictions according to the principle of rationality. If we consider the problem from the point of view of agents, rather 1 1 For example, both the beautiful formal treatment of knowledge by Hintikka [13] and the work
in AI by Moore [23] continually insist that knowing P and knowing that P implies Q need not lead
to knowing Q.
1 2Though the contrary might seem the case. The principle of rationality might seem to presuppose
logic, or at least its formulation might. Untangling this knot requires more care than can be spent here. Note only that it is we, the observer, who formulates the principle of rationality (in some representation). Agents only use it ; indeed, only approximate it .
THE KNOWLEDGE LEVEL
than of AI scientists, then the fundamental principle of the observer must be: To ascribe to an agent the structure S is to ascribe whatever the observer can know from structure S. Theories, models, pictures, physical views, remembered scenes, linguistic texts and utterances, etc., etc.: all these are entirely appropriate structures for ascribing knowledge. They are appropriate.because for an observer to have these structures is also for it to have means (i.e., the symbolic processes) for extracting knowledge from them. Not only are logics not privileged, there are difficulties with them. One, already mentioned in connection with the resolution controversy, is processing inefficiency. Another is the problem of contradiction. From an inconsistent conjunction of propositions, any proposition follows. Further, in general, the contradiction cannot be detected or extirpated by any finite amount of effort. One response to this latter difficulty takes the form of developing new logics or logic-like representations, such as non-monotonic logics [5] . Another is to treat the logic as only an approximation, with a limited scope, embedding its use in a larger symbol processing system. In any event, the existence of difficulties does not distinguish logics from other candidate representations. It just makes them one of the crowd. When we turn from the agents themselves and consider representations from the view-point of the AI scientist, many candidates become problems rather than solutions. If the data structures alone are known (the pictures, language expressions, and so on), and not the pr:_ocedures used to generate the know ledge from them, they cannot be used in engineered AI systems or in theoretical analyses of knowledge. The difficulty follows simply from the fact that we do not know the entire representational system.13 As a result, representations, such as natural language, speech and vision, become arenas for research, not tools to be used by the AI scientist to characterize the knowledge of agents under study. Here, logics have the virtue that the entire system of data structure and processes (i.e., rules of inference) has been externalized and is well understood. The development of AI is the story of constructing many other systems that are not logics, but can be used as representations of knowledge. Furthermore, the development of mathematics, science and technology is in part the story of bringing representational structures to a degree of explicitness very close to what is needed by AI. Often only relatively modest effort has been needed to
1 3It is irrelevent that each of us, as agents rather than AI scientists, happens to embody some of the requisite procedures, so long as we, as AI scientists, cannot get them externalized appropriately.
161
162
CHAPTER 4
extend such representations to be useful for AI systems. Good examples are algebraic mathematics and chemical notations. 4.5. Relation of the knowledge level to the symbol level
Making clear the nature of knowledge has already required discussing the central core of the reduction of the knowledge level to the symbol level. Hence the matter can be summarized briefly. Table 2 lists the aspects of the knowledge level and shows to what each corresponds at the symbol level. Starting at the top, the agent corresponds to the total system at the symbol level. Next, the actions correspond to systems that include external transducers, both input and output . An arbitrary amount of programmed system can surround and integrate the operations that are the actual primitive transducers at the symbolic level. As we have seen, knowledge, the medium at the knowledge level, cor responds at the symbol level to data structures plus the processes that extract from these structures the knowledge they contain. To 'extract knowledge' is to participate with other symbolic processes in executing actions, just to the extent that the knowledge leads to the selection of these actions at the knowledge level. The total body of knowledge corresponds, then, to the sum total of the memory structure devoted to such data and processes. A goal is simply more knowledge, hence corresponds at the symbol level to data structures and processes, just as does any body of knowledge. Three sorts of knowledge are involved: knowledge of the desired state of affairs; knowl edge that the state of affairs is desired ; and knowledge of associated concerns, such as useful methods, prior attempts to attain the goals etc. It is of little moment whether these latter items are taken as part of the goal or as part of the body of knowledge of the world. The principle of rationality corresponds at the symbol level to the processes (and associated data structures) that attempt to carry out problem solving to attain the agent's goals. There is more to the total system than just the separate symbol systems that correspond to the various bodies of knowledge. As
TABLE 2. Reduction of the knowledge level to the symbol level Knowledge level
Symbol level
Agent
Total symbol system
Actions
Symbol systems with transducers
Knowledge
Symbol structure plus its processes
Goals
(Knowledge of goals)
Principle of rationality
Total problem solving process
THE KNOWLEDGE LEVEL
repeatedly emphasized, the agent cannot generate at any instant all the knowledge that it has encoded in its symbol systems that correspond to its bodies of knowledge. It must generate and bring to bear the knowledge that, in fact, is relevant to its goals in the current environment. At the knowledge level, the principle of rationality and knowledge present a seamless surface: a uniform principle to be applied uniformly to the content of what is known (i.e., to whatever is the case about the world). There is no reason to expect this to carry down seamlessly to the symbolic level, with (say) separate subsystems for each aspect and a uniform encoding of knowledge. Decomposition must occur, of course, but the separation into processes and data structures is entirely a creation of the symbolic level, which is governed by processing and encoding considerations that have no existence at the knowl edge level. The interface between the problem solving processes and the knowledge extraction processes is as diverse as the potential ways of designing intelligent systems. A look at existing AI programs will give some idea of the diversity, though no doubt we still are only at the beginnings of exploration of potential mechanisms. In sum, the seamless surface at the knowledge level is most likely a pastiche of interlocked intricate structures when seen from below, much like the smooth skin of a baby when seen under a microscope. The theory of the knowledge level provides a definition of representation, namely, a symbol system that encodes a body of knowledge. It does not provide a theory of representation, which properly exists only at the symbol level and which tells how to create representations with particular properties, how to analyse their efficiency, etc. It does suggest that a useful way of thinking about representation is according to the slog an equation Representation
=
Knowledge + Access
The representation consists of a system for providing access to a body of knowledge, i.e., to the knowledge in a form that can be used to make selections of actions in the service of goals. The access function is not a simple generator, producing one knowledge element (i.e., means-end relation) after another. Rather, it is a system for delivering the knowledge encoded in a data structure that can be used by the larger system that represents the knowledge about goals, actions etc. Access is a computational process, hence has associated costs. Thus, a representation imposes a profile of computational costs on delivering different parts of the total knowledge encoded in the representation. Mixed systems
The classic relationship between computer system levels is that, once a level is adopted, useful analysis and synthesis can proceed exclusively in terms of that level, with only side studies on how lower level behavior might show as errors or lower level constraints might condition the tyj,es of structures that are
163
164
CHAPTER 4
efficient. The radical incompleteness of the knowledge level leads to a different relationship between it and the symbol level. Mixed systems are often con sidered, even becoming the norm on occasions. One way this happens is in the distal prediction of human behavior, in which it often pays to mix a few processing notions along with the pure knowledge considerations. This is what is often called man-in-the-street psychology. We recognize that forgetting is possible, and so we do not assume that knowledge once obtained is forever. We know that inferences are available only if the person thinks it through, so we don't assume that knowing X means knowing all the remote consequences of X, though we have no good way of determining exactly what inferences will be known. We know that people can only do a little processing in a short time, or do less processing when under stress. Having only crude models of the processing at the symbol level, these mixed models are neither very tidy nor uniformly effective. The major tool is the use of self as simulator for the agent. But mixed models are often better than pure knowledge-level models. Another important case of mixed models-especially for AI and computer science-is the use of the knowledge level to characterize components in a symbol-level description of a system. Memories are described as having a given body of knowledge and messages are described as transferring knowledge from one memory to another. This carries all the way to design philosophies that work in terms of a 'society of minds' [22] and to executive systems that oversee and analyse the operation of other internal processing. The utility of working with such mixed-level systems is evident for both design and analysis. For design, it permits specifications of the behavior of components to be given prior to specifying their internal structure. For analysis, it lets complex behavior be summarized in terms of the external environments of the components (which comprises the other internal parts of the system). Describing a component at the knowledge level treats it as an intelligent agent. The danger in this is well known ; it is called the problem of the homunculus. If an actual system is produced, all knowledge-level descriptions must ultimately be replaced by symbol-level descriptions and there is no problem. As such replacement proceeds, the internal structure of the com ponents becomes simpler, thus moving further away from a structure that could possibly realize an intelligent agent. Thus, the interesting question is how the knowledge-level description can be a good approximation even though the subsystem being so described is quite simple. The answer turns on the limited nature of the goals and environments of such agents, whose specification is also under the control of the system designer. 5. Consequences and Relationships
We have set out the theory in sufficient outline to express its general nature. Here follows some discussion of its consequences, its relations to other aspects of AI, and its relations to conceptions in other fields.
THE KNOWLEDGE LEVEL
5.1. The practice of AI
At the beginning I claimed that this theory of knowledge derived from our practice in AI. Some of its formal aspects clearly do not, especially, positing a distinct systems level, which splits apart the symbol and the knowledge level. Thus, it is worth exploring the matter of our practice briefly. When we say, as we often do in explaining an action of a program, that "the program knows K" (e.g., "the theorem prover knows the distributive law"), we mean that there is some structure in the program that we view as holding K and also that this structure was involved in selecting the action in exactly the manner claimed by the principle of rationality, namely, the encoding of that knowledge is related to the goal the action is to serve, etc. More revealing, when we talk, as we often do during the design of a program, about a proposed data structure having or holding knowledge K (e.g., "this table holds the knowledge of co-articulation effects"), we imply that some processes must exist that takes that data structure as input and make selections of which we can say, "The program did action A because it knew K". Those processes may not be known to the system's designer yet, but the belief exists that they can be found. They may not be usable when found, because they take too much time or space. Such considerations do not affect whether "knowledge K is there", only whether it can be extracted usefully. Thus, our notion of knowledge has precisely a competence-like character. Indeed, one of its main uses is to let us talk about what can be done, before we have found out how to do it. Most revealingly of all, perhaps, when we say, as we often do, that a program "can't do action A, because it doesn't have knowledge K", we mean that no amount of processing by the processes now in the program on the data structures now in the program can yield the selection of A. (E.g., "This chess program can't avoid the tie, because it doesn't know about repetition of positions".) Such a statment presupposes that the principle of rationality would lead to A given K, and no way to get A selected other than having K satisfies the principle of rationality. If in fact some rearrangement of the processing did lead to selecting A, then additional knowledge can be expected to have been imported, e.g., from the mind of the programmer who did the rearranging (though accident can confound expectation on occasion, of course). The Hearsay II speech understanding system [ 1 1] provides a concrete example of how the concept of knowledge is used. Hearsay has helped to make widespread a notion of a system composed of numerous sources of knowledge, each of which is associated with a separate module of the program (called, naturally enough, a knowledge source), and all of which act cooperatively and concurrently to attain a solution. What makes this idea so attractive-indeed, seductive-is that such a system seems a close approximation to a system that operates purely at the knowledge level, i.e., purely in terms of simply having knowledge and bringing it to bear. It permit s design by identifying first a
165
166
CHAPTER 4
source of knowledge in the abstract--e.g., syntax, phonology, co-articulation, etc.-and then designing representations and processes that encode that knowl edge and provide for its extraction against a common representation of the task in the blackboard (the working memory). This theory of knowledge has not arisen sui generis from unarticulated practice. On the contrary, the fundamental insights on which the theory draws have been well articulated and are part of existing theoretical notions, not only in AI but well beyond, in psychology and the social sciences. That an adaptive organism, by the very act of adapting, conceals its internal structure from view and makes its behavior solely a function of the task environment, has been a major theme in the artificial sciences. In the work of my colleague, Herb Simon, it stretches back to the forties [36], with a concern for the nature of administrative man versus economic man. In our book on Human Problem Solving [28] we devoted an entire chapter to an analysis of the task environ ment, which turned on precisely this point. And Herb Simon devoted his recent talk at the formative meeting of the Cognitive Science Society to a review of this same topic [37], to which I refer you for a wider discussion and references. In sum, the present theory is to be seen as a refinement of this existing view of adaptive systems, not as a new theory. 5.2. Contributions to the knowledge level vs. the symbol level
By distinguishing sharply between the knowledge level and the symbol level the theory implies an equally sharp distinction between the knowledge required to solve a problem and the processing required to bring that knowledge to bear in real time and real space. Contributions to AI may be of either flavor, i.e., either to the knowledge level or to the symbol level. Both aspects always occur in particular studies, because experimentation always occurs in total AI sys tems. But looking at the major contribution to science, it is usually toward one pole or the other; only rarely is a piece of research innovative enough to make both types of contributions. For instance, the major thrust of the work on MYCIN [35] was fundamentally to the knowledge level in capturing the knowleClge used by medical experts. The processing, an adaptation of well understood notions of backward chaining, played a much smaller role. Similarly, the SNAC procedure used by Berliner [3] to improve radically his Backgammon program was primarily a contribution to our understanding of the symbol level, since it discovered (and ameliorated) the effects of dis continuities in global evaluation functions patched together from many local ones. It did not add to our formulation of our knowledge about Backgammon. This proposed separation immediately recalls the well-known distinction of John McCarthy and Pat Hayes [20] between epistemological adequacy and heuristic adequacy. Indeed, they would seem likely to be steadfast supporters of the theory presented here, perhaps even claiming much of it to be at most a
THE KNOWLEDGE LEVEL
refinement on their own position. I am not completely against such an interpretation, for I find considerable merit in their position. In fact, a recent essay by McCarthy on ascribing mental qualities to machines [18] makes many points similar to those of the present paper (though without embracing the notion of a new computer systems level). However, all is not quite so simple. I once publicly put to John McCarthy [ 17] the proposition that the role of logic was as a tool for the analysis of knowledge, not for reasoning by intelligent agents, and he denied it fiat out. The matter is worth exploring briefly. It appears to be a prime plank in McCarthy's research program that the appropriate representation of know ledge is with a logic. The use of other forms plays little role in his analyses. Thus, the fundamental question of epistemological adequacy, namely, whether there exists an adequate explicit representation of some knowledge, is conflated with how to represent the knowledge in a logic. As observed earlier, there are many other forms in which knowledge can be represented, even setting entirely to one side forms whose semantics are not yet well enough understood scientifically, such as natural language and visual images. Let us consider a simple example [17, p.987], shown in Fig. 5, one of several that McCarthy has used to epitomize various problems in representation within logic. This one is built to show difficulties in transparency of reference. In obvious formulations, having identified Mary's and Mike's telephone numbers, it is difficult to retain the distinction between what Pat knows and what is true in fact. However, the difficulty simply does not exist if the situation is represented by an appropriate model. Let Pat and the program be modeled as agents who have knowledge, with the knowledge localized inside the agent and associated with definite data structures, with appropriate input and output actions, etc. i.e., a simple version of an information processing system. Then, there is no difficulty in keeping knowledges straight, as well as what can be and cannot be inferred. The example exhibits a couple of wrinkles, but they do not cause conceptual problems. The program must model itself, if it is to talk about what it doesn't know. That is, it must circumscribe the data structures that are the source of its knowledge about Pat. Once done, however, its problem of ascertaining its own When program is told: "Mary has the same telephone number as Mike." "Pat knows Mike's telephone number." "Pat dialed Mike's telephone number." Program should assert: "Pat dialed Mary's telephone number." "I do not know if Pat knows Mary's telephone number."
FIG. 5. Example from McCarthy [17).
167
168
CHAPTER 4
knowledge is no different from its problem of ascertaining Pat's knowledge. Also, one of its utterances depends on whether from its representation of the knowledge of Pat it can infer some other knowledge. But the difficulties of deciding whether a fact cannot be inferred from a given finite base, seem no different from deciding any issue of failure to infer from given premises. To be sure, my treatment here is a little cavalier, given existing analyses. It has been pointed out that difficulties emerge in the model-based solution, if it is known only that either Pat knows Mike's telephone number or his address; and that even worse difficulties ensue from knowing that Pat doesn't know Mike's telephone number (e.g., see [23, Chapter 2]). Without pretending to an adequate discussion, it does not seem to me that the difficulties here are other than those in dealing with knowing only that a box is painted red or blue inside, or knowing that your hen will not lay a golden egg. In both cases, the representation of this knowledge by the observer must be in terms of descrip tions of the model, not of an instance of the model. But knowledge does not pose special difficulties. 14 As one more example of differences between contributions to the knowledge level and the symbol level, consider a well-known albeit somewhat controver sial case, namely, Schank's work on conceptual dependency structures [33] . I believe its main contribution to AI has been at the knowledge level. That such a view is not completely obvious, can be seen by the contrary stance of Pat Hayes in his already mentioned piece, "In defense of logic" [12] . Though not much at variance from the present paper in its basic theme, his paper exhibits Fig. 6, classifies it as pretend-it's-English, and argues that there is no effective way to know its meaning. On the contrary, I claim that conceptual dependency structures made a real contribution to our understanding at the knowledge level. The content of this contribution lies in the model indicated in Fig. 7, taken rather directly from [34] . The major claim of conceptual dependency is that the simplest causal model is adequate to give first approximation semantics of a certain fragment of natural language. This model, though really only sketched in the figure, is not in itself very problematical, though as with all formalization it is an important act to reduce it to a finite apparatus. There is a world of states filled with objects that have attributes and whose dynamics occur through actions. Some objects, called actors, have a mentality, which is to say they have representations in a long term memory and are capable of mental acts. The elementary dynamics of this world, in terms of what can produce or inhibit what, are indicated in the figure. 14Indeed, two additional recent attempts on this problem, though cast as logical systems, seem essentially to adopt a model view. One is by McCarthy himself
[19], who introduces
the concept of
a number as distinct from the number. Concepts seem just to be a way to get data structures to talk about. The other attempt effect.
(14]
uses expressions in two languages, again with what seems the same
THE KNOWLEDGE LEVEL
IS-A:FISH
PTRANS
'
'
Cause
'
'
'
Direction= Towards (Mary)
I mpact
FIG . 6.
A conceptual dependency diagram from Pat Hayes (12] .
The claim in full is that if sentences are mapped into a model of this sort in the obvious way, an interpretation of these sentences is obtained that captures a good deal of its meaning. The program Margie [33] provided the initial demonstration, using paraphrase and inference as devices to provide explicit evidence. The continued use of these mechanisms as part of the many pro grams that have followed Margie has added to the evidence implicitly, as it gradually has become part of the practice in Al. Providing a simple model such as this constitutes a contribution to the knowledge level-to how to encode knowledge of the world in a represen tation. It is a quite general contribution, expressed in a way that makes it adaptable to a wide variety of intelligent systems.15 On the other hand, this work made relatively little contribution to the symbol level, i.e., to our notions of symbolic processing. The techniques that were used were essentially state of the art. This can be seen in the relative lack of emphasis or discussion of the internal mechanics of the program. For many of us, the meaning of conceptual dependency seemed undefined without a process that created conceptual depen dency structures from sentences. Yet, when this was finally forthcoming (in Margie), there was nothing there except a large AI program containing the usual sorts of things, e.g. various ad hoe mechanisms within a familiar frame work (a parser, etc.). What was missed was that the program was simply the implementation of the model in the obvious, i.e. straightforward, way. The program was not supposed to add significantly to the specification of the mapping. There would have been trouble if additions had been required, just as a computer program for partial differential equations is not supposed to add to the mathematics. 15This interpretation of conceptual dependency in terms of a model is my own ; Schank and Abelson (34] prefer to cast it as a causal syntax. This latter may mildly obscure its true nature, for it seems to beg for a causal semantics as well.
169
170
CHAPTER 4
World: states, actions, objects, attributes Actors: objects with mentality (central processor plus long-term memory) t tr tE tI tR t dE
Cause An act results in a state. A state enables an act. A state or act initiates a mental state. A mental act is the reason for a physical act. A state disables an act.
FIG. 7. Conceptual dependency model (after (34]).
5.3. Laying to rest the predicate calculus phobia
Let us review the theorem proving controversy of the sixties. From all that has been said earlier, it is clear that the residue can be swept aside. Logic is the appropriate tool for analyzing the knowledge level, though it is often not the preferred representation to use for a given domain. Indeed, given a represen tation-e.g., a semantic net, a data structure for a chess position, a symbolic structure for some abstract problem space, a program, or whatever-to deter mine exactly what knowledge is in the representation and to characterize it requires the use of logic. Whatever the detailed differences represented in my discussion in the last section of the Pat and Mike example, the types of analysis being performed by McCarthy, Moore and Konolige (to mention only the names that arose there) are exactly appropriate. Just as talking of programmer less programming violates truth in packaging, so does talking of a non-logical analysis of knowledge. Logic is of course a representation (actually, a family of them), and is therefore a candidate (along with its theorem proving engine) for the represen tation to be used by an intelligent agent. Its use in such a role depends strongly on computational considerations, for the agent must gain access to very large amounts of encoded knowledge, swiftly and reliably. The lessons of the sixties taught us some things about the limitations of using logics for this role. However, these lessons do not touch the role of logic as the essential language of analysis at the knowledge level. Let me apply this view to Nilsson's new textbook in AI (30]. Now, I am an admirer of Nilsson's book (27] .16 It is the first attempt to transform textbooks in AI into the mold of basic texts in science and engineering. Nilsson's book uses the first order predicate calculus as the lingua franca for presenting and discussing representation. throughout the book. The theory developed here says that is just right; I think it is an important position for the book to have adopted. However, the book also introduces logic in intimate connection with 16Especially so, because Nils and I both set out at the same time to write textbooks of AI. His is now in print and I am silent about mine.
THE KNOWLEDGE LEVEL
resolution theorem proving, thus asserting to the reader that logic is to be seen as a representation for problem solving systems. That seems to me just wrong. Logic as a representation for problem solving rates only a minor theme in our textbooks. One consequence of the present theory of knowledge will be to help assign logic its proper role in AI. 5.4. The relationship with philosophy
The present theory bears some close and curious relationships with philosophy, though only a few preliminary remarks are possible here. The nature of mind and the nature of knowledge have been classical concerns in philosophy, forming major continents of its geography. There has been increasing contact between philosophy and AI in the last decade, focussed primarily in the area of knowledge representation and natural language. Indeed, the respondents to the Brachman and Smith questionnaire, reflecting exactly this group, opined that philosophy was more relevent to AI than psychology. Philosophy's concern with knowledge centers on the issue of certainty. When can knowledge be trusted? Does a person have privileged access to his subjective awareness, so his knowledge of it is infallible? This is ensconced in the distinction in philosophy between knowledge and belief, as indicated in the slogan phrase, knowledge is justified true belief. AI, taking all knowledge to be errorful, has seen fit to call all such systems knowledge systems. It uses the term belief only informally when the lack of veridicality is paramount, as in political belief systems. From philosophy's standpoint, AI deals only in belief systems. Thus, the present theory of knowledge, sharing as it does Al's view of general indifference to the problems of absolute certainty, is simply inattentive to some central philosophical concerns. An important connection appears to be with the notion of intentional systems. Starting in the work of Brentano [8], the notion was of a class of systems capable of having desires, expectations, etc.-all things that were about the external world. The major function of the formulation was to provide a way of distinguishing the physical from the mental, i.e. of providing a charac terization of the mental. A key result of this analysis was to open an unbridge able gap between the physical and the mental. Viewing a system as physical precluded being able to ascribe intentionality to it. The enterprise seems at opposite poles from work in AI, which is devoted precisely to realizing mental functions in physical systems. In the hands of Daniel Dennett [10], a philosopher who has concerned himself rather deeply with AI, the doctrine of intentional systems has taken a form that corresponds closely to the notion of the knowledge level, as developed here. He takes pains to lay out an intentional stance, and to relate it to what he calls the subpersonal stance. His notion of stance is a system level, but ascribed entirely to the observer, i.e. to he who takes the stance. The
171
172
CHAPTER 4
subpersonal stance corresponds to the symbolic or programming level, being illustrated repeatedly by Dennett with the gross flow diagrams of AI programs. The intentional stance corresponds to the knowledge level. In particular, Dennett takes the important step of jettisoning the major result-cum-assump tion of the original doctrine, to wit, that the intentional is unalterably separated from the physical (i.e. subpersonal). Dennett's formulation differs in many details from the present one. It does not mention knowledge at all, but focuses on intentions. It does not provide (in the papers I have seen) a technical analysis of intentions-one understands this class of systems more by intent than by characterization. It does not deal with the details of the reduction. It does not, as noted, assign reality to the different system levels, but keeps them in the eye of the beholder. Withal, there is little doubt that both Dennett and myself are reaching for the characterization of exactly the same class of systems. In particular, the role of rationality is central to both, and in the same way. A detailed examination of the relation of Dennett's theory with the present one is in order, though it cannot be accomplished here. However, it should at least be noted that the knowledge level does not itself explain the notion of aboutness ; rather, it assumes it. The explanation occurs partly at the symbol level, in terms of what it means for symbols to designate external situations, and partly at lower levels, in terms of the mechanisms that permit designation to actually occur [26] . S.S. A generative space of rational systems
My talk o.n physical symbol systems to the La Jolla Cognitive Science Con ference last year [26], employed a frame story that decomposed the attempt to understand the nature of mind into a large number of constraints-universal flexibility of response, use of symbols, use of language, development, real time response, and so on. The importance of physical symbol systems was under scored by its being a single class of systems that embodied two distinct constraints, universality of functional response and symbolic behavior, and was intimately tied to a third, goal-directed behavior, as indicated by the experience in AI. An additional indicator that AI is on the right track to understanding mind came from the notion of a generative class of systems, in analogy with the use of the term in generate and test. Designing a system is a problem precisely because there is in general no way simply to generate a system with specified properties. Always the designer must back off to some class of encompassing systems that can be generated, and then test (intelligently) whether generated candidate systems have desirable properties. The game, as we all know, is to embody as many constraints as possible in the generator, leaving as few as possible to be dealt with by testing. Now, the remarkable property about universal, symbolic systems is that they
THE KNOWLEDGE LEVEL
are generative in this sense. We have fashioned a technology that lets us take for granted that whatever system we construct will be universal and will have full symbolic capability. Anyone who works in Lisp, or other similar systems, gets these constraints satisfied automatically. Effort is then devoted to contriv ing designs to satisfy additional constraints-real time, or learning, or whatnot. For most constraints we do not have generative classes of systems-for real-time, for development, for goal-directedness, etc. There is no way to explore spaces of systems that automatically satisfy these constraints, looking for instances with additional important properties. An interesting question is whether the present theory offers some hope of building a generative class of rational goal-directed systems. It would perforce also need to be universal and symbolic, but that can be taken for granted. It seems to me possible to glimpse what such a class might be like, though the idea is fairly speculative. First, implicit in all that has gone before, the class of rational goal-directed systems is the class of systems that has a knowledge level. Second, though systems only approximate a knowledge level description, they are rational systems precisely to the extent they do. Thus, the design form for all intelligent systems is in terms of the body of knowledge that they contain and the approximation they provide to being a system describable at the knowledge level. If the technology of symbol systems can be developed in this factored form, then it may be possible to remain always within the domain of rational systems, while exploring variants that meet the additional constraints of real time, learnability, and so forth. 6.
Conclusion
I have presented a theory of the nature of knowledge and representation. Knowledge is the medium of a systems level that resides immediately above the symbol level. A representation is the structure at the symbol level that realizes knowledge, i.e., it is the reduction of knowledge to the next lower computer systems level. The nature of the approximation is such that the representation at the symbol level can be seen as knowledge plus the access structure to that knowledge. This new level fits into the existing concept of computer systems levels. However, it has several surprising features: (1) a complete absence of structure, as characterized at the configuration level; (2) no specification of processing mechanisms, only a global principle to be satisfied by the system behavior; (3) a radical degree of approximation that does not guarantee a deterministic machine; and (4) a medium that is not realized in physical space in a passive way, but only in an active process of selection. Both little and much flow from this theory. This notion of knowledge and
173
174
CHAPTER 4
representation corresponds to how we in AI already use these terms in our (evolving) everyday scientific practice. Also, it is a refinement of some fun damental features of adaptive systems that have been well articulated and it has nothing that is incompatible with foundation work in logic. To this extent not much change will occur, especially in the short run. We already have assimilated these notions and use them instinctively. In this respect, my role in this paper is merely that of reporter. However, as I emphasized at the beginning, I take this close association with current practice as a source of strength, not an indication that the theory is not worthwhile, because it is not novel enough . Observing our own practice-that is, seeing what the computer implicitly tells us about the nature of intelligence as we struggle to synthesize intelligent systems-is a fundamental source of scientific knowledge for us. It must be used wisely and with acumen, but no other source of knowledge comes close to it in value. Making the theory explicit will have many consequences. I have tried to point out some of these, ranging from fundamental issues to how some of my colleagues should do their business. Reiteration of that entire list would take too long. Let me just emphasize those that seem most important, in my current view. • Knowledge is that which makes the principle of rationality work as a law of behavior. Thus, knowledge and rationality are intimately tied together. • Splitting what was a single level (symbol) into two (knowledge plus symbol) has immense longterm implications for the development of AI. It permits each of the separate aspects to be adequately developed technically. • Knowledge is not representable by a structure at the symbol level. It requires both structures and processes. Knowledge remains forever abstract and can never be actually in hand. • Knowledge is a radical approximation, failing on many occasions to be an adequate model of an agent. It must be coupled with some symbol level representation to make a viable view. • Logic is fundamentally a tool for analysis at the knowledge level. Logical formalisms with theorem-proving can certainly be used as a representation in an intelligent agent, but it is an entirely separate issue (though one we already know much about, thanks to the investigations in AI and mechanical mathe matics over the last fifteen years). • The separate knowledge level may lead to constructing a generative class of rational systems, although this is still mostly hope.
As stated at the beginning, I have no illusions that yet one more view on the nature of knowledge and representation will serve to quiet the cacophony revealed by the noble surveying efforts of Brachman and Smith. Indeed, amid the din, it may not even be possible to hear another note being played . However, I know of no other way to proceed. Of greater concern is how to
175
THE KNOWLEDGE LEVEL
determine whether this theory of knowledge is correct in its essentials, how to find the bugs in it, how to shake them out, and how to turn it to technical use. ACKNOWLEDGMENT
I am grateful for extensive comments on an earlier draft provided by Jon Bentley, Danny Bobrow, H.T. Kung, John McCarthy, John McDermott, Greg Harris, Zenon Pylyshyn, Mike Rychener and Herbert Simon. They all tried in their several (not necessarily compatible) ways to keep me from error. REFERENCES 1. Bell, C.G. and Newell, A., Computer Structures: Readings and Examples (McGraw-Hill, New York, 1971). 2. Bell, C.G., Grason, J. and Newell, A., Designing Computers and Digital Systems Using PDP16
Register Transfer Modules (Digital Press, Maynard, MA, 1972). 3. Berliner, H.J. Backgammon computer program beats world champion, Artificial Intelligence 14 (1980) 205-220. 4. Bobrow, D., A panel on knowledge representation, Proc. Fifth Intemat. Joint Conference on
Artificial Intelligence (Computer Science Department, Carnegie-Mellon University, Pittsburgh, PA, 1977). 5. Bobrow, D.G., Ed., Special issue on non-monotonic logic, Artificial Intelligence 13 (1980) 1-174. 6. Bobrow, D. and Raphael, B., New programming languages for AI research, Comput. Surveys
6
(1974) 153-174. 7. Brachman, R.J. and Smith, B.C., Special issue on knowledge representation, SIGART
Newsletter 70 (1980) 1-138. 8. Brentano, F., Psychology from an Empirical Standpoint (Duncker and Humbolt, Leipzig, 1874). Also : (Humanities Press, New York, 1973). 9. Chomsky, N., Knowledge of language, in: Gunderson, K., Ed., Language, Mind and Knowledge · (University of Minnesota Press, Minneapolis, 1975). 10. Dennent, D.C., Brainstorms : Philosophical Essays on Mind and Psychology Montgomery, VT, 1978).
(Bradford,
1 1 . Erman, L.D., Hayes-Roth, F., Lesser V.R. and Reddy, D.R., The Hearsay-II speach-under standing system: Integrating knowledge to resolve uncertainty, Comput. Surveys 12 (2) (1980) 213-253. 12. Hayes, P., In defence of logic, Proc. Fifth Intemat. Joint Conference on Artificial Intelligence (Computer Science Department, Carnegie-Mellon University, Pittsburgh, PA, 1977). 13. Hintikka, J., Knowledge and Belief (Cornell University Press, Ithaca, 1962). 14. Konolige, K., A first-order formalization of knowledge and action for a multiagent planning system, Tech. Note 232, SRI International, Dec. 1980. 15. Loveland, D.W., Automated Theorem Proving : A Logical Basis (North-Holland, Amsterdam, 1978). 16. Luce, R.D. and Raifla, H., Qames and Decisions (Wiley, New York, 1957). 17. McCarthy, J., Predicate calculus, Fifth Intemat. Joint Conference on Artificial Intelligence (Computer Science Department, Carnegie-Mellon University, Pittsburgh, PA 1977). 18. McCarthy, J., Ascribing mental qualities to machines, in: Ringle, M, Ed., Philosophical
Perspectives in Artificial Intelligence (Harvester, New York, 1979). 19. McCarthy, J., First order theories of individual concepts and propositions, in: Michie, D., Ed.,
Machine Intelligence 9 (Edinburgh University Press, Edinburgh, 1979).
176
CHAPTER 4
20. McCarthy, J. and Hayes, P.J., Some philosophical problems from the standpoint of artificial intelligence, in: Meltzer, B. and Michie, D., Eds., Machine Intelligence 4 (Edinburgh Uni versity Press, Edinburgh, 1969).
2 1 . Mead, G.H., Mind, Self and Society from the Standpoint of a Social Behaviorist (University of Chicago Press, Chicago, 1934).
22. Minsky, M., Plain talk about neurodevelopmental epistemology, in: Proc. Fifth Intemat. Joint Conference on Artificial Intelligence (Computer Science Department, Carnegie-Mellon Uni versity, Pittsburgh, PA, 1 977).
23. Moore, R.C., Reasoning about knowledge and action, Tech . Note 191, SRI International, Oct. 1980. 24. Newell, A. Limitations of the current stock of ideas for problem solving, in: Kent, A. and Taulbee, 0., Eds., Conference on Electronic Information Handling (Spartan, Washington, DC,
1965).
25. Newell, A., AAAI President's message, AI Magazine 1 (1980) 1-4. 26. Newell, A., Physical symbol systems, Cognitive Sci. 4 (1980) 135-183. 27. Newell, A., Review of Nils Nilsson, Principles of Artificial Intelligence, Com temporary Psy chology 26 (1981) 50-51 .
28 . Newell, A . and Simon, H.A., Human Problem Solving (Prentice-Hall, Englewood Cliffs, 1972). 29. Newell, A. and Simon, H.A., Computer science as empirical inquiry: Symbols and search, Comm . ACM 19(3) (1976) 1 13-126. 30. Nilsson, N., Principles of Artificial Intelligence (Tioga, Palo Alto, CA, 1 980). 32. Robinson, J.A., A machine-oriented logic based on the resolution principle, J. ACM 12 (1965) 23-4 1 . 33. Schank, R.C., Conceptual Information Processing (North-Holland, Amsterdam, 1975). 34. Schank, R. and Ableson, R., Scripts, Plans, Goals and Understanding (Erlbaum, Hillsdale, NJ, 1977). 35. Shortliffe, E.H., Computer-based Medical Consultations: MYCIN (American Elsevier, New York, 1976). 36. Simon, H.A., Administrative Behavior (MacMillan, New York, 1947). 37. Simon, H.A., Cognitive science: The newest of the artificial sciences. Cognitive Sci. 4 (1980) 33-46. 38. Stockton, F.R., The lady or the tiger?, in : A Chosen Few : Short stories (Charles Scribner's Sons, New York 1895). 39. Von Neumann, J. and Morgenstern, 0., The Theory of Games and Economic Behavior (Princeton University Press, Princeton, NJ, 1947).
CHAPTER 5
Learning by Chunking: A Production-System Model of Practice P. S. Rosenbloom and A. Newell, Carnegie Mellon University
Performance improves with practice. More precisely, the time to perform a task decreases as a power-law function of the number of times the task has been performed. This basic law-known as the power law ofpractice or the log-log linear learning law-has been known since Snoddy ( 1 926). 1 While this law was originally recognized in the domain of motor skills, it has recently become clear that it holds over the full range of human tasks (Newell and Rosenbloom 1 98 1 ) . This includes both purely perceptual tasks such as target detection (Neisser, Novick, and Lazar 1963) and purely cognitive tasks such as supplyingjustifications for geometric proofs (Neves and Anderson 1 98 1 ) or playing a game of solitaire (Newell and Rosen bloom 1 98 1 ). The ubiquity of the power law of practice argues for the presence of a single common underlying mechanism. The chunking theory of learning (Newell and Rosenbloom 1 98 1 ) proposes that chunking (Miller 1 956) is this common mechanism-a concept already implicated in many aspects of human behavior (Bower and Winzenz 1 969; Johnson 1 972; DeGroot 1 975; Chase and Simon 1 973). Newell and Rosenbloom ( 1 98 1 ) established the plausibility of the theory by showing that a model based on chunking is capable of producing log-log linear practice curves. 2 In its present form, the chunking theory of learning is a macro theory: it postulates the outline of a learning mechanism and predicts the global improvements in task performance. This chapter reports on recent work on the chunking theory and its interaction with production-system architectures (Newell I 1 973). 3 Our goals are fourfold: ( I ) fill out the details of the chunking theory; (2) show that it can form the basis of a production-system learning mechanism; (3) show that the full model produces power-law practice curves; and (4) understand the implications of the theory for production-system architec tures. The approach we take is to implement and analyze a production system model of the chunking theory in the context of a specific task-a 1 ,023-choice reaction-time task (Seibel 1 963). The choiee of task should not be critical because the chunking theory claims that the same mechanism underlies improvements on all tasks. Thus, the model, as implemented for this task, carries with it an implicit claim to generality, although the issue of generality will not be addressed. /
178
CHAPTER 5
In the remainder of this chapter we describe and analyze the task and the model. In section 5 . 1 we lay the groundwork by briefly reviewing the highlights of the power law of practice and the chunking theory oflearning. In section 5.2 the task is described. We concentrate our efforts on investi gating the control structure of task performance through the analysis of an informal experiment. In section 5.3 we derive some constraints on the form of the model. Sections 5.4, 5.5, and 5.6 describe the three components of the model: ( 1 ) the Xaps2 production-system architecture; (2) the initial perfor mance model for the task; and (3) the chunking mechanism. Section 5.7 gives some results generated by running the complete model on a sample sequence of experimental trials. The model is too costly to run on long sequences of trials, so in addition to the simulation model, we present results from an extensive simulation of the simulation (a meta-simulation). We pay particular attention to establishing that the model does produce power-law practice curves. Finally, section 5.8 contains some concluding remarks. 5.1
Previous Work
The groundwork for this research was laid in Newell and Rosenbloom ( 1 98 1 ). That paper primarily contains an analysis and evaluation of the empirical power law of practice, analyses of existing models of practice, and a presentation of the chunking theory of learning. Three components of that work are crucial for the remainder of this chapter and are sum marized in this section. Included in this summary are some recent minor elaborations on that work. The Structure of Task Environments
In experiments on practice subjects are monitored as they progress through a (long) sequence of trials. On each trial the subject is presented with a single task to be performed. In some experiments the task is ostensibly identical on all trials; for example, Moran ( 1 980) had subjects repeatedly perform the same set of edits on a single sentence with a computer text editor. In other experiments the task varies across trials; for example, Seibel ( 1 963) had subjects respond to different combinations oflights on different trials. In either case the task environment is defined to be the ensemble of tasks with which the subject must deal. Typical task environments have a combinatorial structure (though other task structures are possible); they can be thought of as being composed
LEARNING BY CHUNKING
from a set of elements that can vary with respect to attributes, locations, relations to other elements, and so forth. Each distinct task corresponds to one possible assignment of values to the elements. This structure plays an important role in determining the nature of the practice curves produced by the chunking theory. The Power Law of Practice
Practice curves are generated by plotting task performance against trial number. This cannot be done without assuming some specific measure of performance. There are many possibilities for such a measure, including such things as quantity produced per unit time and number of errors per trial. The power law of practice is defined in terms of the time to perform the task on a trial. It states that the time to perform the task (T) is a power law function of the trial number (N): (5. 1 ) If this equation is transformed b y taking the logarithm o f both sides, it becomes clear why power-law functions plot as straight lines on log-log paper: log(T) = log(B) + ( - a) log(N).
(5.2)
Figure 5 : 1 shows a practice curve from a 1 ,023-choice reaction-time task (Seibel 1 963), plotted on log-log paper. Each data point represents the mean reaction time over a block of 1 ,023 trials. The curve is linear over much of its range but has deviations at its two ends. These deviations can be removed by using a four-parameter generalized power-law function. One of the two new parameters (A) takes into account that the asymptote of learning is unlikely to be zero. In general, there is a nonzero minimum bound on performance time-determined by basic physiological and/or device limitations-if, for example, the subject must operate a machine. The other added parameter (£) is required because power laws are not translation invariant. Practice occurring before the official beginning of the experiment-even if it consists only of transfer of training from everyday experience-will alter the shape of the curve, unless the effect is explicitly allowed for by the inclusion of this parameter. Augmenting the power-law function by these two parameters yields the following generalized function: T = A + B (N + E) -a.
(5.3)
179
180
CHAPTER 5
-- 10.0 � c: 0 CJ Q)
� i:' ......
-
T
=
12N" 32
Q)
�c: .5!
u
1 .0
� 0. 1 '--�-'-�'--._..-'-,_,_,'-'--�---'-�'-'---'---'--'-'-��-'--'-...__._-'-'--' 100 1000 10000 100000 Trial number [N]
Figure 5.1
Learning in a ten-finger, 1 ,023 choice task (log-log coordinates). Plotted from the original data for subject JK (Seibel 1963)
A generalized power law plots as a straight line on log-log paper once the effects of the asymptote (A) are removed from the time ( T), and the effective number of trials prior to the experiment (E) are added to those performed during the experiment (N): log(T - A) = log(B) + ( - a) log(N + E) .
(5.4)
Figure 5.2 shows the Seibel data as it is fit by a generalized power-law function. It is now linear over the whole range of trials. Similar fits are found across all dimensions of human performance; whether the task involves perception, motor behavior, perceptual-motor skills, elementary decisions, memory, complex routines, or problem solving. Though these fits are impressive, it must be stressed that the power law of practice is only an empirical law. The true underlying law must resemble a power law, but it may have a different analytical form. The Chunking Theory of Learning
The chunking theory of learning proposes that practice improves perfor mance via the acquisition of knowledge about patterns in the task environ ment. Implicit in this theory is a model of task of performance based on this pattern knowledge. These patterns are called chunks (Miller 1 956). The theory thus starts from the chunking hypothesis:
LEARNING BY CHUNKING
"' 1 0.00 � c: 0 (J
!
-
T
=
.32
+
1673(N
+
2440)' 91
.......
0.7 (within 1 .0). In this condition the type is specified by the constant symbol OneTrial, and the identifier is specified as a variable (signified by an equals sign before the name of the variable) named Identifier. These both match successfully, so the activation of the object is 0. 1 606 (as computed before). There is only one attribute specified, so its goodness-of-fit is simply multiplied by 0 . 1 606
LEARNING BY CHUNKING
to get the activation of the condition. The value is specified by the condition as a one-sided (greater-than) real-number comparison with the interval for a successful match set to 1 .0. The expected value (O.?) is compared with the dominant value of the INITIAL-LOCATION attribute (0.99), yielding a good ness of fit of 0.4609 (from equation 5.8). The activation for this match is thus 0.0740 ( = 0 . 1 606 x 0.4609). Conflict Resolution Conflict resolution selects which of the instantiations generated by the match should be fired on a cycle. This is done by using rules to eliminate unwanted instantiations. The first rule performs a thres holding operation: •
Eliminate any instantiation with an activation value lower than 0.000 1 .
The second rule is based on the hypothesis that productions are a limited resource: •
Eliminate all but the most highly activated instantiation for each production.
This rule is similar to the special case and working-memory recency rules in the Ops languages. It allows the selection of the most focussed (activated) object from a set of alternatives. Following the execution of these conflict resolution rules, there is at most one instantiation remaining for each production. While this eliminates within-production parallelism, between production parallelism has not been restricted. It is possible for one instan tiation of every production to be simultaneously executing. This provides a second locus of the parallelism required by the parallel constraint. What happens next depends on the execution types of the productions that generated the instantiations: 1 . Instantiations of Always productions are always fired. 2. Instantiations of Decay productions are always fired, but the activation of the instantiation is cut in half each time the identical instantiation fires on successive cycles. A change in the instantiation occurs when one of the variable bindings is altered. This causes the activation to be immediately restored to its full value. 3. Instantiations of Once productions are fired only on the first cycle in which the instantiation would otherwise be eligible. This is a form of refractory inhibition.
203
204
CHAPTER 5
Standard Xaps2 productions are of execution type Always, and nearly all of the productions in the model are of this type. Decay productions have found limited use as a resettable clock, while no productions of execution type Once have been employed. Production Execution Following conflict resolution, all of the instanti ations still eligible are fired in parallel, resulting in the execution of the productions' actions. Each production may execute one or more actions, providing a third type of parallelism in the architecture. The actions look like conditions; they are partial specifications of working-memory objects. Execution of an action results in the creation of a fully specified version of the object. Variables in the action are replaced by the values bound to those variables during the match, and new symbols are created as requested by the action. Unlike the Ops languages, actions only cause modifications to working memory; they are not a means by which the model can communicate with the outside world. Communication is an explicit part of the task model. Actions modify working memory in an indirect fashion. The effects of all of the actions on one cycle are accumulated into a single data structure representing the updates to be made to working memory. The actual up dating occurs in the next (and final) stage of the production system cycle (described in the following section). The first step in creating the pre-update structure is to assign activation levels to the components of the objects asserted by the production actions. The identifier of the objects, and all of the values asserted for attributes, are assigned an activation level equal to the activation of the production instantiation asserting them. This allows activation to flow through the system under the direction of productions-like Hpsa77 (Newell 1 980) and Caps (Thibadeau, Just, and Carpenter 1 982)-as opposed to the undirected flow employed by spreading-activation models (Anderson 1 976). No good scheme has been developed for assigning activation to the type; currently, it is just given a fixed activation of 1 .0. The activation levels can be made negative by inhibiting either the whole object, or a specific value. This is analogous to the negation of conditions (and values) during the match. If the same object is asserted by more than on� action, the effects are accumulated into a single representation of the object: the type activation is set to the same fixed constant of 1 .0, the identifier activations are summed and assigned as the identifier activation, and all of the activations of the same value (of the same attribute) are summed and assigned as the activa-
LEARNING BY CHUNKING
tion of that value. This aggregation solves the problem of synchronizing simultaneous modifications of working memory. Activation and inhibition are commutative, allowing the actions to be executed in any order without changing the result. The same is not true of the operations of insertion and deletion, as used in the Ops languages. After the actions have been aggregated, any external stimuli to the· system are added into this structure. External stimuli are objects that come from outside of the production system, such as the lights in Seibel's task. These stimuli are inserted into the preupdate structure just as if they were the results of production actions. An activation value of 0.01 is used for them. This level is high enough for the stimuli to affect the system and low enough for internally generated objects to be able to dominate them. Following the inclusion of stimuli, the preupdate structure is normalized (just as if it were the. working memory) and used to update the current working-memory state. Normalization of the preupdate structure allows for the control of the relative weights of the new information (the preup date structure) and the old information (working memory). Updating of Working Memory The updates to working memory could be used simply as a replacement for the old working memory (as in Joshi 1 978), but that would result in working memory being peculiarly memory less. By combining the preupdate structure with the current working mem ory, we get a system that is sensitive to new information, but remembers the past, for at least a short time. Many combination rules (e.g., adding the two structures together) are possible, and many were experimented with in Xaps. In Xaps2 the two are simply averaged together. This particular choice was made because it interacts most easily with the lack of refractoriness in production firing. The updates can be thought of as specifying some desired state to which the productions are trying to drive working memory. Repeti tive firing of the same set of production instantiations results in working memory asymptotically approaching the desired state. Any weighted sum of the new and the old (with the weights summing to 1) would yield similar results, with change being either slower or faster. Averaging (equal weights of 0.5) was chosen because it is a reasonable null assumption. Anything that is in one of the states being combined but not the other is assumed to be there with an activation value of 0.0. Thus, ignoring nor malization, items not currently being asserted by productions (i.e., not in the preupdate structure) exponentially decay to zero, while asserted items exponentially approach their asserted activation levels. This applies to
205
206
CHAPTER 5
inhibited as well as activated items-inhibition decays to zero if it is not continually reasserted. Averaging the two working-memory states preserves the total activation, modulo small threshold effects, so the effects of normalization are minimal when it is employed with this combination rule. It has a noticeable effect only when no item within the scope of the normalization is being asserted by a production. Without normalization, all of the items would decay to zero. With normalization this decay is reversed so that the activations of the items once again sum to l . The result is that the activations of the items remain unchanged. Basically, items stick around as long as they have no active competition. Old items that have competition from new items will decay away. One consequence of the gradual updating of working memory is that it often takes more than one cycle to achieve a desired effect. This typically happens when the dominant value of an attribute is being changed. Ad dition of new attributes can always be accomplished in one cycle, but modifying old ones may take longer. It is essential that knowledge of the desired change remain available until the change has actually been made. In fact some form of test production is frequently required to detect when the change has been completed, before allowing processing to continue. 5.5
The Initial Performance Model
The chunking theory has been applied to Seibel's task, yielding a model that improves its performance with practice. Not covered by this model is the initial learning of a correct method for the task. Our future plans include extending the chunking theory to the domain of method acqui sition, but until then, the model must be initialized with a correct method for performing the task. We consider only a single method, based on the algorithm at the end of section 5.2, though subjects exhibit a range of methods. This method is straightforward but slow-efficiency comes from chunking. The algorithm is implemented as a hierarchy of five goals (figure 5 . 1 0). Each goal is a working-memory type, and each goal instance is an object of the relevant type. In addition to the goal-types there are two types repre senting the model's interfaces with the outside world, one at the stimulus end and the other at the response end. We will start with a description of these interfaces and then plunge into the details of the model's internal goal hierarchy.
LEARNING BY CHUNKING
Figure 5.10 The model's goal hierarchy for Seibel's task
Interfacing with the Outside World
The model interacts with the outside world through two two-dimensional Euclidean spaces. These spaces are defined in terms of object-centered coordinates. One space, the stimulus space, represents the information received by the model as to the location of the lights within the stimulus array. The other space, the response space, represents the information that the model transmits to the motor system. The locations of objects within these spaces are specified by relative x, y coordinates. The exact coordinate system used is not critical, but this particular one has proved to be convenient. The use of rectangular coordi nates allows left-to-right traversal across the lights to be accomplished by just increasing x. With relative coordinates, the left (top) edge of the space is 0.0, and the right (bottom) edge is 1 .0. Since the buttons and lights in the task have been arranged so as to maximize the compatibility of their locations, using the same set of relative coordinates for the two spaces makes trivial the job of mapping stimulus locations into response locations. The Stimulus Space
The stimulus space is a rectangle just bounding the total array of lights. To the model this array appears as a set of objects representing the lights (both On and Off). A typical On-light looks like (ignoring activations): (External-Stimulus Object00 1 2 [COMPONENT-PATTERN On] [SPATIAL-PATTERN One] [MINIMUM-X 0.21) [MAXIMUM-X 0.36) [MINIMUM-Y 0.00) [MAXIMUM-Y 0.30])
All stimuli have the same type (External-Stimulus), but the identifier is
207
208
CHAPTER 5
unique to this light on this trial. Productions must match the object by a description of it, rather than by its name. A total of six attributes are used to describe the stimulus object. Two of the attributes (COMPONENT-PATTERN and SPATIAL-PATTERN) specify the pat tern represented by the object. The particular stimulus object shown here represents just a single On-light, but stimulus objects can represent patterns of arbitrary complexity (e.g., an arrangement of multiple lights). The attribute COMPONENT-PATTERN specifies what kind of objects make up the pattern-limited to On and Off (lights) for this task. The other attribute, SPATIAL-PATTERN, specifies the spatial arrangement of those components. The value One given in object Object00 1 2 signifies that the object con sists of one On-light and nothing else. This single value suffices for the initial performance model, but others are created when new chunks are built. The remaining four attributes (MINIMUM-X, MAXIMUM-X, MINIMUM-Y, and MAXIMUM-Y) define the bounding box of the stimulus pattern. The bound ing box is a rectangle just large enough to enclose the stimulus. It is speci fied by its minimum and maximum x and y coordinates. For example, object ObjectOO 1 2 is flush against the top of the stimulus space and a little left of center. We make the simplifying assumption that the entire array of lights is constantly within the model's "visual field." This cannot be literally true for our subjects because of the large visual angle subtended by the display ( 1 6°) but was more true for Seibel's subjects who worked with a display covering 7° of arc. Because the model is assumed to be staring at the lights at all times during performance of the task, the stimulus objects are inserted into working memory on every cycle of the production system (see the previous section for how this is done). The Response Space The response space is constructed analogously to the stimulus space; it is a rectangle just bounding the array of buttons. This is a response space (as opposed to a stimulus space) because the objects in it represent patterns of modifications to be made to the environment, rather than patterns of stimuli perceived in the environment. Objects in this space represent locations at which the model is going to press (or not press) the buttons. The fingers are not explicitly modeled; it is assumed that some other portion of the organism enables finger movement according to the combination of location and action. Response objects look much like stimulus objects. For example, the
LEARNING BY CHUNKING
response object corresponding to stimulus object Object00 1 2 might look like: (External-Response Object0 1 4 1 [COMPONENT-PATTERN Press] [SPATIAL-pATTERN One] [MINIMUM-X 0.3] [MAXIMUM-X 0.36] [MINIMUM-Y 0.0] [MAXIMUM-Y 0.2]
The only differences are the type (External-Response), the identifier, which is unique to this instance of this response, and the value of COMPONENT PATTERN, which is Press rather than On. Response objects are created dynamically by the model as they are needed. Once they are created, response objects hang around in working memory until competition from newer ones causes them to drop out. The Control Structure: A Goal Hierarchy
The control structure imposed on the Xaps2 architecture is that of a goal hierarchy. This control structure is totally serial at the level of the goal. Information about proposed subgoals and suspended supergoals can co exist with the processing of a goal instance, but only one such instance can be actively pursued at a time. The choice of this tightly controlled structure is not forced by the nature of the architecture. Instead, it came from the following three motivations: 1 . The control structure provides the bottleneck required by the bottleneck constraint. Though this satisfies the constraint, it does so only in a weak sense because it is not an architectural limitation. This contrasts with Hpsa77 (Newell 1 980), in which the mechanism of variable binding creates a structural bottleneck in the architecture. 2. The bottleneck is only across goals, not within goals. During the processing of goal instances, productions are free to execute in parallel. The parallel constraint is therefore still met. In addition the cognitive parallelism constraint is met; goals are employed all through the perfor mance system, so the locus of parallelism is not limited to just the sensory and motor components. 3. Complicated execution paths (e.g., iterative loops) are difficult to con struct in loosely controlled systems. While such systems may be logically adequate, covincing activation-based control schemes to loop, solely on the basis of activation values, has proved difficult to accomplish.
209
210
CHAPTER 5
The first requirement of a system that employs a goal hierarchy is a representation for the goals. As stated earlier, each goal is represented as an object type, and a goal instance is represented as an object with a unique identifier. Object l 34 represents a typical goal instance-the goal name is OneTrial, and the identifier is Objectl 34. Because goals can be distin guished by their types, and multiple instances of the same goal can be distinguished by their identifiers, it is possible to maintain information about a number of goal instances simultaneously. The goal hierarchy is processed in a depth-first fashion, so the second requirement is a stack in which the current state of execution can be represented. In Xaps2 working memory does not behave as a stack; more recent objects will tend to be more highly activated, but this is not sufficient for the implementation of a goal hierarchy. The primary difficulty involves simultaneously keeping the goals in the stack active and maintaining the proper ordering among them. If the stack is just left alone, subgoal activity causes the objects in the stack to decay. The oldest objects may very well decay right out of working memory. If the stack is continually refreshed by reassertion of its elements into working memory, then the ordering, which depends on activation levels, will be disturbed. Some variation on this scheme may still work, but we have instead pursued a more symbolic representation of the goal stack. Each goal instance has a STATUS attribute. Together, the STATUS at tributes (i.e., the dominant value of the STATUS attributes) of the active goal instances completely determine the control state of the model . Three com mon STATUS values are Start, Started, and Done. Start means that the goal is being initialized; Started signifies that initialization is complete and that the goal is being pursued; and Done signifies that the goal has completed. The stack is implemented by pointing to the current subgoal of a suspended supergoal via the STATUS attribute of the supergoal. Notice that any goal instance whose STATUS is the identifier of some other goal instance must be suspended by definition because its STATUS is no longer Started. A goal can therefore be interrupted at any time by a production that changes its STATUS from Started to some other value. Execution of the goal resumes when the STATUS is changed back to Started. Activating a subgoal of a currently active goal is a multistep operation. The first step is for the goal to signify that it wants to activate a subgoal of a particular type. This is accomplished by changing the STATUS of the goal to the type of the subgoal that should be started. This enables the productions tbat create the new subgoal instance. Four tasks must be performed when-
LEARNING BY CHUNKING
ever a new subgoal is started:
I . The current goal instance must be blocked from further execution until the subgoal is corripleted. 2. A new instance of the subgoal must be created. This is a new object with its own unique identifier. 3. The parameters, if any, must be passed from the current goal instance to the subgoal instance. 4. A link, implementing the stack, must be created between the current goal instance and the new subgoal instance.
As noted earlier, the supergoal is suspended as soon as the desire for the subgoal is expressed (task 1). Changing the STATUS of the current goal instance effectively blocks further effort on the goal. The other three tasks are performed by a set of three productions. Because the type of an object (in this case the name of the goal) cannot be matched by a variable, a distinct set of productions is required for each combination of goal and subgoal. One benefit of this restriction is that the goal-instance creation productions can perform parameter passing from goal to subgoal as part of the creation of the new instance. The first production of the three performs precisely these two tasks: (2) subgoal creation and (3) parameter passing. Schematically, these productions take the following form: Production schema Start(Goal name): .lfthe current goal instance has a subgoal name as its STATUS, then generate a new instance of the subgoal with STATUS Start (parameters to the subgoal are passed as other attributes).
\
When such a production executes, it generates a new symbol to be used as the identifier of the object representing the goal instance. The second production builds a stack link from a goal to its subgoal (task 4), by copying this new identifier into the STATUS attribute of the current goal instance. This must be done after the first production fires, because this production must examine the newly created object to determine the identifier of the new goal instance: Production schema CreateStackLink(Goal name) : If the current goal instance has a subgoal name as its STATUS and there is an active object of that type with STATUS Start, then replace the goal's STATUS with the subgoal's identifier.
211
212
CHAPTER 5
The third production checks that all four tasks have been correctly per formed before enabling work on the subgoal: Production schema Started(Goal name) : lf the current goal instance has a subgoal identifier as its STATUS and that subgoal has STATUS Start, then change the STATUS of the subgoal to Started. At first glance it would appear that the action of this third production could just be added to the second production. In most production syste1Jls this would work fine, but in Xaps2 it doesn't. One production would be chang ing the values of two attributes at once. Since there is no guarantee that both alterations would happen in one cycle, a race condition would ensue. If the subgoal is Started before the stack link is created, the link will never be created. Generally, in Xaps2 separate productions are required to make a modification and test that the modification has been performed. It generally takes one cycle of the production system to express the desire for a subgoal and three cycles to activate the subgoal (one cycle for each of the three productions), for a total of four cycles of overhead for each new subgoal. This number may be slightly larger when any of the modifications requires more than one cycle to be completed. Goals are terminated by test productions that sense appropriate con ditions and change the STATUS of the goal instance to Done. If the subgoal is to return a result, then an intermediate STATUS of Result is generated by the test productions, and additional productions are employed to return the result to the parent goal and change the STATUS of the subgoal instance to Done, once it has checked that the result has actually been returned. The standard way of returning a result in Xaps2 is to assert it as the new value for some attribute of the parent goal instance. It may take several cycles before it becomes the dominant value, so the production that changes the STATUS to Done waits until the result has become the dominant value before firing. Terminating a goal instance generally requires one cycle, plus be tween zero and four cycles to return a result. The parent goal senses that the subgoal has completed by looking for an object of the subgoal type whose identifier is identical to the parent's STATUS, and whose own STATUS is Done. Once this condition is detected, the parent goal is free to request the next subgoal or to continue in any way that it sees fit. The mechanism described so far solves the problem of maintaining the
LEARNING BY CHUNKING
order of stacked goal instances. However, it does not prevent the objects representing these instances from decaying out of working memory. This is resolved by an additional production for each goal-subgoal combination that passes activation from the goal type to the subgoal type. The topmost goal type passes activation to itself and downwards to the next level. All of the other goal types simply pass activation to their subgoal types. These productions fire on every cycle. Keeping the types (goals) active ensures that at least one instance of each goal can be retained on the stack. Multiple instances of the same goal, such as would be generated by recursion, would result in lossage of instances through _competition. In order for recursion to work, either the architecture would have to be changed to fire all instantiations of a production (one goal instance per instantiation) instead of only the "best," or a separate produc tion would be required for each instance (which must be created dynami cally, as are the goal instances). The five goal types are discussed in the following sections. The Seibel Goal Seibel is the top level goal type for the task. It enters the working memory as a stimulus from the outside world (see the previous section for a discussion of stimuli), corresponding to a request to perform the task. The Seibel goal type is used solely to keep the OneTrial goal active. The One Trial Goal A desire for a new instance of the OneTrial goal is generated exogenously each time the stimulus array is changed, that is, once each trial. Both this desire, and the new stimulus array are inserted into working memory as stimuli. The Seibel goal could have detected the presence of a new stimulus array and generated the OneTrial goal directly, but we have taken this simpler approach for the time being because we wanted to focus our attention on within-trial processing. The OneTrial goal implements the following aspects of the performance algorithm (section 5.2): Focus a point to the left of the leftmost light. While there is an On-light to the right of the focal point Do (Goal OnePattern)
The point of focus is modeled as the value of an attribute (INITIAL LOCATION) of the OneTrial goal instance. This should be thought of as the focus of attention within the visual field, rather than as the locus of eye
213
214
CHAPTER 5
fixation. Setting the initial focus takes two cycles. First the goal's STATUS is changed to Initialize, and then a production that triggers off of that STATUS sets the value of the INITIAL-LOCATION to 0.0 (the left edge of the stimulus space). The entire body of the While loop has been moved inside of a single goal (OnePattern), so the loop is implemented by repeatedly starting up OnePattern goal instances. The first instance is created when a test produc tion has determined that the INITIAL-LOCATION has been set. Subsequent instances are established whenever the active OnePattern instance has completed. The focal point gets updated between iterations because the OnePattern goal returns as its result the right edge of the light pattern that it processed. This result is assigned to the INITIAL-LOCATION attribute. What we have described so far is an infinite loop; new instances of the OnePattern goal are generated endlessly. This is converted into a While loop with the addition of a single production of the following form: Production DoneOneTrial: .lfthere is a OneTrial goal with STATUS OnePattern and there is no On-light to the right of its INITIAL-LOCATION, then the OneTrial goal is Done. The test for an On-light to the right of the INITIAL-LOCATION is performed by a one-sided (greater-than) real-number match to the MINIMUM-x values of the stimulus objects. The expected value is the INITIAL-LOCATION, and the interval is 1 .0. The match will succeed if there is another light to the right, and fail otherwise. The preceding production therefore has this test negated. The reaction time for the model on Seibel's task is computed from the total number of cycles required to complete (STATUS of Done) one instance of the OneTrial goal. Experimentally this has been determined to be a fixed overhead of approximately 1 3 cycles per trial, plus approximately 3 1 cycles for each On-light-an instance of the OnePattern goal (see section 5.7). These numbers, and those for the following goals, are from the full per formance model, which is the initial performance model with some addi tional overhead for the integration of chunking into the control structure (section 5.6). The OnePattern Goal The OnePattern goals control the four steps inside the While loop of the performance strategy:
LEARNING BY CHUNKING
Locate the On-light. Map the light location into the location of the button under it. Press the button. Focus the right edge of the light.
Two of these steps (Map and Focus) are performed directly by the goal instance, and two (Locate and Press) are performed by subgoals (OneStimulusPattern and OneResponsePattern). At the start a typical instance of the OnePattern goal looks like 1 0 (OnePattern Object45 [INITIAL-LOCATION 0.0])
The first step is to locate the next stimulus pattern to process. This is accomplished by a subgoal, OneStimulusPattern, which receives as a parameter the INiTIAL-LOCATION and returns the attributes of the first On-light to the right of the INITIAL-LOCATION. These attributes are added to the OnePattern instance, to yield an object like: (OnePattern Object45 (INITIAL-LOCATION 0.0] [STIMULUS-COMPONENT-PATTERN On] [STIMULUS-SPATIAL-PATTERN One] (STIMULUS-MINIMUM-X 0.21] (STIMULUS-MAXIMUM-X 0.36] (STIMULUS-MINIMUM-Y 0.00] (STIMULUS-MAXIMUM-Y 0.30])
The mapping between stimulus and response is currently wired directly into the performance algorithm. This is sufficient but not essential for the current model. In some follow-up work we are investigating the relation ship between this model and stimulus-response compatibility. In these systems the mapping is performed in a separate subgoal. This provides flexibility, and the ability to perform a considerable amount of processing during the mapping. The mapping employed in the current model is a minimal one; all that is required is turning the stimulus attributes into response attributes and changing the COMPONENT-PATTERN from On to Press. This mapping is performed by a single production to yield an object of the following form: (OnePattern Object45 . (INITIAL-LOCATION 0.0] [STIMULUS-COMPONENT-PATTERN On] [STIMULUS-SPATIAL-PATTERN One]
215
216
CHAPTER 5
[STIMULUS-MINIMUM-X 0.21 ] [STIMULUS-MAXIMUM-X 0.36) [STIMULUS-MINIMUM-Y 0.00) [STIMULUS-MAXIMUM-Y 0.30]) [RESPONSE-COMPONENT-pATTERN Press] [RESPONSE-SPATIAL-PATTERN One] [RESPONSE-MINIMUM-X 0.21] [RESPONSE-MAXIMUM-X 0.36) [RESPONSE-MINIMUM-Y 0.00] [RESPONSE-MAXIMUM-Y 0.30])
These response parameters are passed to a subgoal, OneResponse Pattern, that converts them into a new response object. Following completion of this subgoal, the OnePattern goal terminates, passing the coordinate of the right edge of the selected stimulus pattern as its result (to be used as the focal point for the next search). Not counting the time to perform its two subgoals, a typical instance of this goal requires 1 2 cycles of the production system, including the over head involved in starting and finishing the goal. The OneStimulusPattern Goal The OneStimulusPattern goal is res ponsible for finding the next On-light to the right of the INITIAL-LOCATION, which it rec�ives as a parameter from its parent OnePattern goal. This selection is made by a single production employing the same type of one sided real-number match used to determine when the trial has completed. It looks for an On-light to the right of the INITIAL-LOCATION for which there are no other On-lights between it and the INITIAL-LOCATION. (Recall that Off-lights are ignored in this algorithm.) As long as there is some On-light to the right of the INITIAL-LOCATION, it will be found. If there is more than one, the closest one to the INITIAL-LOCATION is selected. On completion, the goal instance returns the values of the six attributes describing the selected stimulus pattern to the parent OnePattern goal. A typical instance of this goal requires nine cycles of the production system, including the overhead involved in starting and finishing the goal, though it may be higher.
The OneResponsePattern Goal Conceptually the OneResponsePattern goal is the mirror image of the OneStimulusPattern goal. Given the param eters of a response object, its task is to create a new response object with those values. We assume that the motor system automatically latches on to response objects as they are created, so creation of the object is the last step actually simulated in the model. A typical instance of this goal requires ten cycles of the production system, including the overhead involved in starting and finishing the goal, though it may be higher.
LEARNING BY CHUNK.ING
5.6
The Chunking Process
The initial performance model executes an instance of the OnePattern goal for each On-light in the stimulus array. Patterns consisting of a single On light are primitive patterns for the model; they are at the smallest grain at which the perceptual system is being modeled. Larger, or higher-level, patterns can be built out of combinations of these primitive patterns. For example, a single higher-level pattern could represent the fact that four particular lights are all On. The same holds true for response patterns, where the primitive patterns are single key presses. Higher-level response patterns that specify a combination of key presses can be built out of these primitive response patterns. According to the chunking theory of learning, chunks represent patterns experienced in the task environment. They improve performance because it is more efficient to deal with a single large pattern than a set of smaller patterns. The remainder of this section describes the design of this chunk ing process-how chunks represent environmental patterns and how they are acquired from task performance. As currently constituted, this -is an error-free design; chunks are always acquired and used correctly. Rather than model errors directly by a bug-laden final model, the problem of errors is tackled by .discussing the types of errors simple versions of the model naturally make, and the mechanisms implemented to ensure that these errors do not occur. The Representation of Chunks
We propose that a chunk consists of three components: ( 1 ) a stimulus pattern, (2) a response pattern, and (3) a connection between the two. In contrast to systems that treat chunks as static data structures, we consider a chunk to be an active structure. A chunk is the productions that process it. The obvious implementation of this proposal involves the creation of one production per·chunk. The production would have one condition for each primitive component of the stimulus pattern and one action for each prim itive component of the response pattern. The connection is implemented directly by the production's condition-action link. This implementation is straightforward enough, but it is inadequate for the following reasons: 1 . These productions violate the control structure of the model by linking stimuli to responses directly, without passing through the intervening
217
218
CHAPTER 5
bottleneck. If such productions could be created, then it should also be possible to create the optimal algorithm of ten parallel productions, one for each light-button combination. 2. The chunking mechanism implied by these productions is nonhierar chical; a chunk is always defined directly in terms of the set of primitive patterns that it covers. 3 . The direct connection of stimulus to response implies that it is impos sible for the cognitive system to intervene. The mapping of stimulus to response is wired in and unchangeable.
These problems can all be solved by implementing each chunk as three productions, one for each component. The first production encodes a set of stimulus patterns into a higher-level stimulus pattern, the second produc tion decodes a higher-level response pattern into a set of smaller response patterns, and the third production indirectly connects the higher-level stimulus pattern to the higher-level response pattern. For the acquisition of a chunk to improve the performance of the model, these productions must help overcome the bottleneck caused by the model's inability to process more than one pattern at a time. This bottle neck can be precisely located within the OnePattern goal-between the termination of the OneStimulusPattern goal and the beginning of the OneResponsePattern goal. According to the encoding constraint, encoding must occur before the bottleneck, that is, before the OneStimulusPattern goal completes and selects the pattern to use. Likewise the decoding con straint implies that decoding must occur after the bottleneck, that is, after the start of the OneResponsePattern goal. The connection production must appear somewhere in between. The model must execute an instance of the OnePattern goal for each pattern processed-approximately 3 1 production-system cycles. If there are four On-lights in the stimulus, then the initial performance model requires four iterations, or about 1 24 cycles. If one pattern can cover all four On-lights, only one iteration is required, cutting the time down to 3 1 cycles. I f instead we had two patterns o f two On-lights each, it would take two iterations, or about 62 cycles. Just as the chunking theory of learning proposes, performance can be improved through the acquisition of pat terns experienced during task performance. For simplicity, the current system works only with chunks that are -built out of exactly two subchunks. This is not a limitation on the theory; it is
LEARNING BY CHUNKING
merely the simplest assumption that still lets us investigate most of the inter esting phenomena. The remainder of this section describes how the three components of a chunk are represented and how they are integrated into the model's control structure. We delay until the following section the description of how a chunk is built. The Encoding Component The initial performance model perceives the world only in terms of primitive stimulus patterns consisting of either a single On-light or a single Off-light. The encoding component of a chunk examines the currently perceived patterns, as reflected by the contents of working memory, and, based on what it sees, may assert a new higher-level stimulus pattern. When this new object appears in working memory, it can form the basis for the recognition of even higher-level patterns. The entire set of encoding productions thus performs a hierarchical parsing process on the stimuli. Encoding is a goal-free data-driven process in which productions fire whenever they perceive their pattern. This process is asynchronous with the goal-directed computations that make up most of the system. This works because the perceived patterns interact with the rest of the system through a filter of goal-directed selection productions. As an example, the selection production in the previous section chooses one pattern from the stimulus space based on its location and COMPONENT-PATTERN. In essence we are proposing that the traditional distinction between parallel data-driven perception and serial goal-directed cognition be modified to be a distinction between parallel data-driven chunk encoding and serial goal-directed cog nition. In the remainder of this section we describe the details of this chunk encoding process. Representation of Higher-Level Stimulus Patterns All stimulus patterns, be they primitive or higher-level, are represented as working-memory objects of type External-Stimulus. For purposes of comparison, here are some objects representing a primitive pattern, and a higher-level pattern: (External-Stimulus Primitive-Example [COMPONENT-PATTERN On] [SPATIAL-PATTERN One] [MINIMUM-X 0 . 21] [MAXIMUM-X 0 . 36] [MINIMUM-Y 0.00] [MAXIMUM-Y 0.30])
219
220
CHAPTER 5
(External-Stimulus Higher-Level-Example [COMPONENT-PATTERN On] [SPATIAL-PATTERN Spatial-Pattern-0145] [MINIMUM-X 0.2J ] [MAXIMUM-X 0. 78] [MINIMUM-Y 0.00] [MAXIMUM-Y 0.64] )
They are almost identical; what differs is the values of some attributes. The four attributes defining the bounding box are interpreted in the same fashion for all patterns. They always define the rectangle just bounding the pattern. For primitive chunks, this is a rectangle just large enough to contain the light. For higher-level chunks, it is the smallest rectangle that contains all of the lights in the pattern. The COMPONENT-PATTERN of primitive patterns is always On or Off, signifying the type of light contained in the pattern. For higher-level patterns a value of On is interpreted to mean that all of the lights contained in the pattern are On. Other values are possible for higher-level patterns, but in the current task we only deal with patterns composed solely of On lights. This means that the O.ff lights are dealt with by ignoring them-not that the Off-lights can't be there. The primary difference between primitive and higher-level patterns is in the value of the SPATIAL-PATTERN attribute. For primitive patterns it always has the value One, signifying that the entire bounding box contains just a single light. For higher-level patterns the value must indicate how many On-lights there are within the box and what their positions are. One alternative for representing this information is to store it explicitly in the object in terms of a pattern language. The pattern language amounts to a strong constraint on the variety of patterns that can be perceived. This is the tactic employed in most concept formation programs (e.g., Evans 1968; Mitchell, Utgoff, Nudel, and Banerji 198 1 ). It is a powerful technique within the domain of the pattern language, but useless outside of it. We have taken the less-constrained approach pioneered by Uhr and Vossler ( 1 963) in which there is little to no precommitment as to the nature of the patterns to be learned. A unique symbol is created to represent each newly perceived pattern. This symbol is stored as the value of the SPATIAL� PATTERN attribute-Spatial-Pattern-0145 in the preceding example. In stead of the meaning being determined in terms of a hard-wired pattern
LEARNING BY CHUNKING
language, it is determined by the productions that act on the symbol. The encoding production knows to create an object with this symbol when it perceives the appropriate lower-level patterns. Likewise the connection production knows how to create the appropriate response object for this symbol. With this scheme any pattern can be represented, but other operations on patterns, such as generalization, become difficult. Integration of the Encoding Component into the Mode! When the concept of chunking is added to the initial performance model, changes in the control structure are needed for the model to make use of the newly generated higher-level patterns. The initial performance model iterates through the lights by repeatedly selecting the first On-light to the right of the focal point and then shifting the focal point to the right of the selected light. When there are higher-level patterns, this algorithm must be modified to select the largest pattern that starts with the first On-light to the right of the focal point, while shifting the focal point to the right of the pattern's bounding box. Accomplishing this involves simply changing the selection production so that it does not care about the SPATIAL-PATTERN of the object that it selects. It then selects the most highly activated stimulus object consisting of only On-lights, with no other such object between it and the INITIAL-LOCATION. The largest pattern is selected because a higher-level pattern containing n components will be more highly activated than its components. If a production has n equally activated conditions, call the activation a, then its actions will be asserted with an activation level of a (derived from n · a/ ). Originally it was intended that this selection be based solely on the match activation of the competing instantiations. The effect of size was added (via the combination of activation) to the effect of nearness to the INITIAL LOCATION (via a real-number match). This often worked, but it did lead to omission errors in which a large pattern was preferred to a near pattern, skipping over intermediate On-lights without processing them. To avoid these errors, the more explicit location comparison process described in section 5.5 is currently employed. Selection now works correctly, that is, if the encoding process has completed by the time the selection is made. Since encoding is an asyn chronous, logarithmic process, determining the time of its completion is problematic. �his problem is partly solved by the data-driven nature of the encoding productions. Encoding starts as soon as the stimuli become
(Jn) ·
Jn
221
222
CHAPTER 5
available, not just after the OneStimulusPattern goal has started. This head start allows encoding usually to finish in time. For the cases when this is insufficient, a pseudoclock is implemented by the combination of an Always production and a Decay production. Encod ing takes an amount of time dependent on the height of the chunk hierar chy, so waiting a fixed amount of time does not work. Instead, the clock keeps track of the time between successive assertions of new stimulus patterns by encoding productions. If it has been too long since the last new one, encoding is assumed to be done. The clock is based on the relative activation levels of two particular values of an attribute. One value remains at a moderate level; the other value is reset to a high level on cycles in which a new pattern is perceived, and decays during the remainder. When the activation of this value decays below the other value, because no new encoding productions have fired, encoding is considered to be done. This mechanism is clumsy but adequate. The Encoding Productions Encoding productions all have the same struc ture, consisting of three conditions and one action. The three conditions look for the two stimulus patterns that make up the new pattern, and the absence of other On patterns between the two desired ones. The action creates a new object in working memory representing the appropriate higher-level pattern. At first glance only the first two conditions would seem to be necessary, but absence of the third condition can lead to errors of omission. Suppose that an encoding production is created for a pattern consisting of a pair of On-lights separated by an Off-light. If the middle light is Off the next time the two lights are On, there is no problem. The problem occurs when all three lights are On. Without the third condition, the production would match and the higher-level pattern would be recognized. If that pattern is then used by the performance system, it would press the buttons corre sponding to the two outer lights, and then move the focal point past the right edge of the pattern's bounding box. The middle On-light would never be processed, resulting in a missing key press. By adding the third con dition, the pattern is not recognized unless there is no On-light embedded between the two subpatterns. These errors are therefore ruled out. Let's look at a couple of concrete examples. In this first example we encode two primitive patterns (On-lights) separated by an Of.flight. The relevant portion of working memory is:
LEARNING BY CHUNKING
(External-Stimulus Object01 4 1 [COMPONENT-PATTERN On] [SPATIAL-PATTERN One] (MINIMUM-X 0.21] (MAXIMUM-X 0.36] (MINIMUM-Y 0.00] (MAXIMUM-Y 0.30]) (External-Stimulus Object0 142 [COMPONENT-PATTERN Off] [SPATIAL-PATTERN One] (MINIMUM-X 0.42] (MAXIMUM-X 0.57] (MINIMUM-Y 0.00] (MAXIMUM-Y 0.30]) (External-Stimulus Object01 43 [COMPONENT-PATTERN On] [SPATIAL-PATTERN One] (MINIMUM-X 0.63] (MAXIMUM-X 0. 78] [MINIMUM-Y 0.34] (MAXIMUM-Y 0.64])
Encoding the two On-lights yields a new higher-level stimulus pattern with a bounding box just big enough to contain the two On-lights; the Off-light is simply ignored. The COMPONENT-PATTERN remains On, and a new symbol is created to represent the SPATIAL-PATTERN. The object representing the pattern looks like: (External-Simulus Object01 44 [COMPONENT-PATTERN On] [SPATIAL-PATTERN Spatial-Pattern-0145] (MINIMUM-X 0.21] (MAXIMUM-X 0. 78] (MINIMUM-Y 0.00] (MAXIMUM-Y 0.64])
The production that performs this encoding operation has the form: Production Encode 1 : .lfthere is an External-Stimulus object consisting of just one On-light, whose left edge is 0.21 (within 0. 1 5), right edge is 0.36 (within 0. 1 5), top edge is 0.00 (within 0.30), bottom edge is 0.30 (within 0.30), and there is an External-Stimulus object consisting of just one On-Light, whose left edge is 0.63 (within 0. 1 5), right edge is 0.78 (within 0. 1 5), top edge is 0.34 (within 0.30), bottom edge is 0.64 (within 0.30), and there is No External-Stimulus object consisting of On-lights in any spatial pattern, whose left edge is left of 0.63 (within 0.27),
223
224
CHAPTER 5
then create a new External-Stimulus object consisting of On-lights in configuration Spatial-Pattern-0145, whose left edge is 0.2 1 , right edge is 0.78, top edge is 0.0, bottom edge is 0.64.
The first condition looks for an On-light bounded by [0.2 1 , 0.36) horizon tally, and [0.00, 0.30) vertically. The bounding box is matched by four two sided real-number condition patterns. The lights may not always be posi tioned exactly as they were when the production was created, so the match is set up to succeed over a range of values (the interval of the real-number match). The sizes of the intervals are based on the notion that the accuracy required is proportional to the size of the pattern. The horizontal intervals are therefore set to the width of the pattern (0.36 - 0.21 = 0. 1 5), and the vertical intervals are set to the height of the pattern (0.30 - 0.00 = 0.30). The second condition works identically to the first, with only the location of the light changed. The third condition ensures that there are no interven ing On-lights. This last condition is actually testing that no On pattern starts between the right edge of the first subpattern and the left edge of the second subpattern. That this works depends on the fact that the lights are being processed horizontally and that there is no horizontal overlap between adjacent lights. Currently this knowledge is built directly into the chunking mechanism, a situation that is tolerable when only one task is being explored but intolerable in a more general mechanism. The preceding example chunked two primitive patterns together to yield a higher-level pattern, but the same technique works if the subpatterns are higher-level patterns themselves, or even if there is a mixture. In the following example a higher-level pattern is combined with a primitive pattern. Suppose the situation is the same as in the previous example, plus there is an additional On-light to the right. After the encoding production fires, working memory consists of the four objects mentioned earlier (three primitive ones and one higher-level one), plus the following object for the extra light: (External-Stimulus 0bjectO 146 [COMPONENT-PATTERN On] [SPATIAL-PATTERN One] (MINIMUM-X 0.84] (MAXIMUM-X 0.99] (MINIMUM-Y 0.68] (MAXIMUM-Y 0.98])
A higher-level pattern can be generated from this pattern and
LEARNING BY CHUNKING
Object0 1 44. The new pattern covers the entire bounding box of the four lights. The encoding production for this is: Production Encode2: .if there is an External-Stimulus object consisting of On-lights in configuration Spatial-Pattern-0145, whose left edge is 0.21 (within 0. 57), right edge is 0.78 (within 0.57), top edge is 0.00 (within 0.64), bottom edge is 0.64 (within 0.64), and there is an External-Stimulus object consisting of just one On-Light, whose left edge is 0.84 (within 0. 1 5), right edge is 0.99 (within 0. 1 5), top edge is 0.68 (within 0.30), bottom edge is 0.98 (within 0.30), and there is No External-Stimulus object consisting of On-lights in any spatial pattern, whose left edge is left of 0.84 (within 0.06), then create a new External-Stimulus object consisting of On-lights in configuration Spatial-Pattern-0147, whose left edge is 0.2 1 , right edge is 0.99, top edge is 0.0, bottom edge is 0.98. As should be clear, this production is basically the same as production Encode 1 . The bounding boxes are appropriately changed, and the SPATIAL PATTERN of one of the subpatterns is Spatial-Pattern-0145, the name for the higher-level pattern generated by production Encode l , and not One (signi fied in the productions by the phrase "consisting of just one On-light"). When production Encode2 fires, it creates a stimulus object of the follow ing form: (External-Stimulus Object0 148 [COMPONENT-PATTERN On] [SPATIAL-PATTERN Spatial-pattern-0147] (MINIMUM-X 0.21] (MAXIMUM-Y 0.99] (MINIMUM-Y 0.00] (MAXIMUM-Y 0.98]) The Decoding Component Decoding productions perform the inverse operation of encoding productions. When one matches to a higher-level pattern, it generates that pattern's two subpatterns. Because decoding must occur after the start of the OneResponsePattern Goal (after the bottleneck), it is defined on response patterns, rather than stimulus patterns. We assume that decoding occurs because the motor system only responds to primitive External-Response objects. When the response is specified by a higher-level
225
226
CHAPTER 5
pattern, it must be decoded down to its component primitives before the response can occur. The entire set of decoding productions acts as a hierarchical decoding network for higher-level response patterns. Unlike encoding, decoding is initiated under goal-directed control. The OneResponsePattern goal's parameters describe a response pattern that is to be executed. From this description the goal builds the appropriate External-Response object, and decoding begins. Decoding can't begin until the goal has built this object, but once it has begun, it continues to completion without further need of direction from the goal. Integrating the decoding component into the performance model is thus trivial; whenever an object representing a higher level response pattern is generated, the appropriate decoding productions will fire. The one complication is that, as with encoding, decoding requires a variable number of cycles to complete. The problem of determining when decoding is done is solved by the use of a second pseudoclock. In fact this mechanism is inadequate for this purpose, but the problem does not affect the execution of the remainder of the model, so the current scheme is being employed until a better alternative is devised. The following decoding production is the analogue of production Encode2. It has one condition that matches the higher-level response pattern corresponding to the stimulus pattern generated by production Encode2, and it has two actions which generate response patterns corresponding to the two stimulus subpatterns of production Encode2. One of the sub patterns is primitive, while the other is a higher-level pattern that must be decoded further by another production: Production Decode2: .if there is an External-Response object consisting of Press-keys in configuration Spatial-Pattern-0151, whose left edge is 0.21 (within 0.78), right edge is 0.99 (within 0.78), top edge is 0.0 (within 0.98), bottom edge is 0.98 (within 0.98), then create a new External-Response object consisting of Press-keys in configuration Spatial-Pattern-0150, whose left edge is 0.2 1 , right edge is 0. 78, top edge is 0.00, bottom edge is 0.64, and create a new External-Response object consisting of just one Press-key, whose left edge is 0.84, right edge is 0.99, top edge is 0.68, bottom edge is 0.98.
LEARNING BY CHUNKING
The Connection Component A connection production links a higher-level stimulus pattern with its appropriate higher-level response pattern. The entire set of connection productions defines the stimulus-response mapping for the task. This mapping must occur under goal direction so that the type of mapping can vary according to the task being performed. It would not be a very adaptive model if it were locked into always responding the same way to the same stimulus. The connection productions need to be located before the encoding component and after the decoding component-between the end of the OneStimulusPattern goal and the start of the OneResponsePattern goal. They are situated in, and under the control of, the OnePattern goal. This goal already contains a general mechanism for mapping the description of a primitive stimulus pattern to the description of the appropriate primitive response pattern. These descriptions are local to the OnePattern goal and are stored as attributes of the o.bject representing the goal. The connection productions extend this existing mechanism so that higher-level patterns can also be mapped. Whether a connection produc tion fires, or the initial mechanism executes, is completely determined by the SPATIAL-PATTERN of the stimulus pattern. If it is One, the initial mechanism is used; otherwise, a connection production is required. In tegration of the connection productions into the performance model is therefore straightforward. The following production connects a higher level stimulus pattern with SPATIAL-PATTERN Spatial-Pattern-0147 to the corresponding higher-level response pattern:
Production Map-Spatial-Pattern-0 147: I/there is a OnePattern goal whose STATUS is MapOnePattern containing the description of a stimulus pattern of C?n-lights in configuration Spatial-pattern-0147, whose left edge is 0.2 1 (within 0.78), right edge is 0.99 (within 0.78), top edge is 0.0 (within 0.98), bottom edge is 0.98 (with 0.98), then add the description of a response pattern consisting of Press-keys in configuration Spatial-Pattern-0151, whose left edge is 0.2 1 , right edge is 0.99, top edge is 0.00, bottom edge is 0.98. The key to making the proper connection is that the production matches to the unique SPATIAL-PATTERN specified by the stimulus pattern (Spatial Pattern-0147), and generates the unique SPATIAL-PATTERN for the response
227
228
CHAPTER 5
pattern (Spatial-Pattern-0151). As an example, suppose working memory contains an object of the form: (OnePattern Object1 3 1 [STATUS MapOnePattern] [STIMULUS-COMPONENT-PATTERN On] [STIMULUS-SPATIAL-PATTERN Spatial-Pattern-0147] [STIMULUS-MINIMUM-X 0.21] [STIMULUS-MAXIMUM-X 0.99] [STIMULUS-MINIMUM-Y 0.00] [STIMULUS-MAXIMUM-Y 0.98])
The connection production would modify this element by adding the description of the corresponding response pattern. The object would then have the form: (OnePattern Objectl 3 1 [STATUS MapOnePattern] [STIMULUS-COMPONENT-PATTERN On] [STIMULUS-SPATIAL-PATTERN Spatial-Pattern-147] [STIMULUS-MINIMUM-X 0.21] [STIMULUS-MAXIMUM-X 0.99] [STIMULUS-MINIMUM-Y 0.00] [STIMULUS-MAXIMUM-Y 0.98] [RESPONSE-COMPONENT-PATTERN Press] [RESPONSE-SPATIAL-PATTERN Spatial-Pattern-151] [RESPONSE-MINIMUM-X 0.21] [RESPONSE-MAXIMUM-X 0.99] [RESPONSE-MINIMUM-Y 0.00] [RESPONSE-MAXIMUM-Y 0.98]) The Acquisition of Chunks
Chunk acquisition is a task-independent, primitive capability of the archi tecture. The acquisition mechanism is therefore implemented as Lisp code, rather than as a set of productions within the architecture. The mechanism continually monitors the execution of the performance model and acquires new chunks from the objects appearing in working memory. It accom plishes this by building productions for the three component of the chunk. There are two principal structural alternatives for this mechanism: ( 1 ) the components can be created all at once, or (2) they can created indepen dently. There are clear trade-offs involved. With the all-at-once alternative the components of a chunk are all created at the same time. The primary advantage of this approach is simplicity in creating the connection component. In order to create a correct connection production, the corresponding stimulus and response SPATIAL-PATTERNS must be known. With the all-at-once alternative the SPATIAL-PATTERNS are directly available because the connection production is created concurrently with the encoding and decoding productions. With
LEARNING BY CHUNKING
the independent alternative, making this connection is more difficult. The connection production must determine the appropriate SPATIAL-PATTERNS, even though they are denoted by distinct symbols and may not be in working memory at the time. This is difficult, but if possible, it does lead to two advantages over the all-at-once approach. First, it places only a small demand on the capacity of working memory. When the stimulus informa tion is around, the encoding component can be created, and likewise with the decoding component. All of the information does not have to be active at once. Second, transfer of training is possible at a smaller grain size. With the all-at-once alternative transfer of training occurs only when the entire chunk is usable in another task. With the independent alternative individ ual encoding and decoding components can be shared, because a new connection production can be created during the transfer task that makes use of stimulus and response patterns from the training task. Implementing the independent alternative looked hard enough that the all-at-once alternative was chosen for this initial attempt at building a chunking mechanism. Creating all of the components at once eliminates the problems of the independent alternative by forcing all of the infor mation to be in working memory at the same time. This information exists within the instances of the OnePattern goal. Each instance describes a stimulus pattern and its associated response pattern. Given two of these instances, we have all of the information required to create a chunk. Built into the current chunking mechanism is the knowledge that chunks are based on the data in these goal instances and how patterns are encoded as attributes of the OnePattern objects. Basing the acquisition of chunks on the information in OnePattern goal instances, rather than on the raw stimuli and responses, has the conse quence of limiting chunk acquisition to only those patterns that are actu ally employed by the model during performance of the task. The potentially explosive number of possibilities for chunking is thus constrained to the relatively small set of patterns to which the subject actually attends. Many incidental patterns may be perceived in the process, but practice only improves performance on those components of the task actually performed. Chunks are built out of the two most highly activated instances of the OnePattern goal in working memory, assuming that there are at least two present. These instances represent the two most recently processed pat terns. Two of the architectural choices made in Xaps2 were motivated by the need to have two instances of this goal simultaneously active:
229
230
CHAPTER 5
1 . Competition among objects is limited to within types so that pursuance of other goals would not cause old instances of the OnePattern goal to disappear from working memory. 2. The working-memory threshold is set at 0.000 1 so that competition from the current instance of the OnePattern goal does not wipe out the previous instance before there is a chance to chunk them together. This is adequate for the current model but will not be for cases where the patterns take longer to process. This limitation amounts to a reasonable restriction on the length of time over which the chunking process can combine two patterns.
In order to ensure that the two most highly activated OnePattern instances are both from the same trial-we don't want cross-trial chunks-working memory is flushed between trials. This is a kludge intended to simulate the effects of intertrial activity. Once a chunk has been created, we want the model to use it when appropriate, but not to recreate it. If the model were continually recreating the same chunks, production memory would quickly fill up with useless information. This problem breaks down into two subproblems: within-trial duplications, and across-trial duplications. First, consider the problem of within-trial duplication. Suppose a chunk was just created from the two most highly activated OnePattem objects; what is to stop the system from continually recreating the same chunk as long as those two objects are the most activated? To avoid this, the chunking mechanism keeps track of the identifiers of the last two instances that it chunked together. It only creates a new chunk if the identifiers of the two most highly activated instances differ from the stored identifiers. This also is an ad hoe solution necessary until we understand better what the true constraint should be. Across-trial duplications occur when a chunk is created during one trial and then recreated when similar circumstances arise on a later trial. As currently constructed the model will never produce a duplicate of this type. If a chunk already exists that combines two patterns into a higher-level pattern, then the encoding component of the chunk ensures that whenever those two patterns are perceived, the higher-level pattern will also be perceived. The higher-level pattern will be selected for processing instead of the two smaller ones, so there is no possibility of them ever again being t'1e two most recently used (most highly activated) patterns. This does assu111e error-free performance by the model, a condition that we have taken pains to ensure holds.
LEARNING BY CHUNKING
Chunks Used
Type
Trial
1 2 3 4 5 6 7 8 9
o••o•
-·---
- -· - -
·----
-·---
•o•o•
·----
•••o•
·· - - -
••coo •o•o• eoo••
eoo••
•o•o•
·- -- eooe-
eoo••
eooee
··- - -
••o••
--•o•
Chunks Acquired ----
·
- ---· - - -··
Figure 5.11
eoeoe
--eoe
·· - - -
--•o •
---·-
-·· - -
•••o•
- - - -·
eooeeoo••
- - -··
••o••
The nine-trial sequence simulated by the model: • is On, o is Off, and - is don 't
Cycles
1 06 75 72 74 44 1 05 74 44 75
care
" 5. 7
The Results
In this section we present and analyze results from simulations of the complete model, consisting of the production-system architecture, the performance model, and the chunking mechanism. These simulations demonstrate that the model works: the chunking theory can form the basis of a practice mechanism for production-system architectures. In addition these simulations provide a detailed look at the acquisition and use of chunks and verify that the model does produce power-law practice curves. In section 5 . 1 we showed that the chunking equation-an approximation of the full model-produces curves that are well matched by a power law. Now we can demonstrate it directly, though not analytically, for the exact model of one task. The Results of a Simulation
The complete model has been run successfully on a specially selected sequence of nine trials for the left hand (five lights only). This sequence was devised especially to illustrate important aspects of the model. For each trial in this sequence, figure 5. 1 1 shows the task to be performed, the chunks used, the chunks acquired, and the reaction time in number of production system cycles. Figure 5 . 1 2 presents an alternative organization of the chunks acquired during this sequence of trials-the chunk hierarchy. Each node in this hierarchy represents one chunk that was acquired. The node's children represent the two subchunks from which that chunk was created. At the most global level these results demonstrate directly that the task
23 1
232
CHAPTER 5
Figure 5. 12
The tree of chunks created during the nine-trial simulation
was performed successfully, chunks were acquired, and they did improve the model's performance. Looking in more detail, first examine the rela tionship between the last column in figure 5. 1 1 , the number of cycles per trial, and the third column, the chunks used. We can see that the time to perform a trial is approximately given by NumberOfCycles
=
1 3 + (3 1
x
NumberOfPatternsProcessed).
(5.9)
A three-pattern trial takes about 1 06 cycles ( 1 05- 1 06 in the data), a two pattern trial takes about 75 cycles (72-75 in the data), and a one-pattern trial takes 44 cycles (44 in the data). A chunk is acquired for the first and second patterns used, the second and third patterns used, and so forth, up to the number of patterns in the trial. The number of chunks acquired on a trial is therefore given by NumberOfChunksAcquired = NumberOfPatternsProcessed
-
1.
(5. 1 0)
The rate of acquisition of chunks is one every 3 1 cycles, once the constant overhead of 44 cycles per trial ( 1 3 plus the time to process the first pattern on the trial) has been removed, satisfying the learning assumption (section 5.2). This learning is demonstrably too fast. For the ten-light task environ ment, the entire task environment can be learned within log 2 ( 1 0), between three and four, iterations through the task environment (at 1 ,023 trials per iteration). This could be remedied in one of two ways. The first possibility is to propose that there are actually more chunks to be learned than we have described. For example, the level at which primitive patterns are defined could be too high, or there may be other features of the environment that we are not capturing. The second alternative is that chunks are not learned at every opportunity. Gilmartin (1 974) computed a rate of chunk acqui-
LEARNING BY CHUNKING
sition of about one every eight to nine seconds-less than one chunk per trial in this task. Without speculating as to the cause of this slow down, we could model it by adding a parameter for the probability ( < I ) that a chunk is learned when the opportunity exists. We do not know which alternative is correct but would not be surprised to find both of them implicated in the final solution. One point clearly illustrated by this sequence of trials is that chunking is hierarchical, without having a strong notion of level. Chunks can be based on primitive patterns, higher-level patterns, or a mixture. The sequence illustrates the following combinations: ( 1 ) the creation of chunks from primitive patterns (trials 1 , 3, and 6), (2) the creation of chunks from higher-level patterns (trials 4 and 9), (3) the creation of chunks from one primitive pattern and one higher-level pattern (trials 2 and 7), and (4) the creation of no chunks (trials 5 and 8). The Off-lights in the chunks represent the regions in which no On-light should appear (section 5.6). Also illustrated is how the chunks created on one trial can be used on later trials. As one example, look at trials 6 through 8 in figure 5 . 1 1 . All three trials employ the identical task, containing three On-lights. On trial 6 the three On-lights are processed serially ( 1 05 cycles), and two chunks are acquired for the two combinations of two successive On-lights. Notice that the two chunks share the middle On-light as a subpattern. On the following trial, trial 7, the first chunk created on trial 6 is used, taking care of the first two On-lights. All that is left is the third On-light, which is a primitive pattern. The time for trial 7 is 74 cycles, a savings of 3 1 over trial 6. During trial 7 a chunk is created that covers all three On-lights by combining the two patterns employed during the trial. On trial 8 only one pattern is required, and the trial takes only 44 cycles. Chunks not only improve performance on trials that are exact repe titions of earlier trials; they can also be transferred to trials that merely share a subpattern. Thorndike first described transfer along these lines: "A change in one function alters any other only in so far as the two functions have as factor identical elements." (Thorndike 1 9 1 3). Trials 1 and 2 illus trate this variety of transfer of training. Both trials have the third and fifth lights On and the fourth light Off but differ in the first two lights. Neverthe less, the chunk created in the first trial is used to speed up the performance of the second trial. The same chunk is also reused in trial 4. The complete model has also been run successfully on a sequence of 20 ten-light trials, with results comparable to those for the five-light sequence.
233
234
CHAPTER 5
-
x
..
¥
JC
M
T =
1 2sw 1 1
x x X X**••.•�it)(,.*x .X �!LL? )( x )( )( )C
1 0'--�-'-�'--����'---'-��..._�_._�..._�_._�� 1 10 100 1000 Trial number Figure 5.13
Practice curve predicted by the metasimulation (log-log coordinates). The 408-trial sequence performed by subject 3 (aggregated by five trials).
Simulated Practice Curves
The model is too costly computationally to run the long trial sequences required for the generation of practice curves. The execution time varies with the number of productions in the system-slowing down as chunks are added-but in the middle of the 20 trial sequence, the model took an average of 22 CPU minutes to process each pattern (approximately 3 1 production system cycles) on a DecSystem 2060. This deficiency is overcome through the use of a metasimulation-a more abstract simula tion of the simulation. The metasimulation is faster than the simulation because it ignores the details of the performance system. If merely keeps track of the chunks that would be created by the model and the patterns that would be used during performance. From this information, and equation (5.9), it estimates the number of cycles that the production-system model would execute. Via this metasimulation, extensive practice curves have been generated. As a start, figure 5. 1 3 shows the practice curve generated by the metasimu lation for the 408 trial sequence used for subject 3 (section 5.2). Comparing · this curve with the curve for the human subject (figure 5.7), we see a basic similarity, though the human's curve is steeper and has more variability. Seibel ran his subjects for 75 blocks of 1 ,023 trials each (Seibel, 1 963). To
LEARNING BY CHUNKING
-
T
=
60N·03
__, � � � ...__ _.__. _._. _._. 1 0 .____.___,__,...._.__.__.��--'--��.__._..._._...._ 1 00 1 000 1 0000 100000 Trial number
Figure 5.14
Practice curve predicted by the metasimulation (log-log coordinates). Seventy-five data . points, each averaged over a block of l ,023 trials.
compare the model with this extensive data, the metasimulator was run for the same number of trials. A single random permutation of the 1 ,023 trials was processed 75 times by the metasimulation. Figure 5 . 1 4 shows the practice curve generated by the metasimulation for this sequence of trials. It is clear from this curve that creating a chunk at every opportunity leads to perfect performance much too rapidly-by the third block of trials. A much better curve can be obtained by slowing down the rate of chunk acquisition, per the second suggestion made earlier in this section. We can make a quick, back-of-the-envelope calculation to find a reasonable value for the probability of acquiring a chunk, given the opportunity. To do this, we will make three assumptions: 1 . Assume that the model has the opportunity to acquire a chunk each time a pattern is processed and that there is no overhead time. 2. Assume that the time to process a pattern is in the range of times for a simple reaction time- 1 00 to 400 msec (Card, Moran, and Newell 1 984). 3. Assume that it takes 8 to 9 sec to acquire a chunk (Gilmartin 1 974).
The probability (p) of acquiring a chunk is essentially the rate of chunking, as measured in chunks per pattern. This rate can be computed by dividing the time per pattern (0. 1 to 0.4 sec) by the time per chunk (8.0 to 9.0 sec). Using the extreme values for the two parameters, we find that the proba-
235
236
CHAPTER 5
-
T ;
270N
15
1 0 ������� 100 f OOO 1 0000 1 00000 Trial number
Figure 5.15 Practice curve predicted by the metasimulation (log-log coordinates). Seventy-five data points, each averaged over a block of 1 ,023 trials. The probability of creating a chunk, when there is an opportunity, is 0.02.
bility should be in the interval [0.0 1 , 0.05] . We have chosen to use one value in this interval, p = 0.02. Figure 5. 1 5 shows the results of a metasimulation in which chunk acquisition is slowed down by this factor: This curve is linear in log-log coordinates over the entire range of trials (r2 = 0.993). A slight wave in the points is still detectable, but the linearity is not significantly improved by resorting to the generalized power law (r2 is still 0.993). We currently have no explanation for this phenomenon. We can only comment that the deviations are indeed small and that similar waves appear to exist in the general power law fit to Seibel's data (figure 5.2), though they are some what obscured by noise. If a low probability of chunk acquisition is required in order to model adequately highly aggregated long sequences of trials (figure 5. 1 5), and a high probability is required for an adequate fit to less aggregated, short trial sequences (figure 5 . 1 3), then there would be a major problem with the model. Fortunately the one value of0.02 is sufficient for both cases. Figure 5. 1 6 shows the same 408 trial sequence as figure 5. 1 3, with the only difference being the reduced probability of chunk acquisition. Thus, given a reasonable value for p, the chunking model produces good power-law curves over both small and large trial ranges.
LEARNING BY CHUNKING
., 1000
�
::.. (,)
C\j .,
-
Q. 111 )( "' QI (j
� Q,
x
1 00
x
x
T
=
239N··'2
x
t o�-�-���-� ��--�����-�-� ��� f fO fOO 1 000 Trial number Figure 5.16 Practice curve predicted by the metasimulation (log-log coordinates). The 408-trial sequence performed by subject 3 (aggregated by five trials). The probability of creating a chunk, when there is an opportunity, is 0.02.
The most important way in which figure 5. 1 5 differs from the human data (figure 5 . 1 ) is that the power (a) of the power-law fit is lower for the metasimulation-0. 1 5 for the metasimulation versus 0.32 for the central linear portion of the subject's curve. One approach to resolving this discrep ancy is to examine the metasimulation for parameters tfiat can be modified to produce larger powers. Modification ofp, the one parameter mentioned so far, can cause small perturbations in et but is incapable of causing the large increase required. When p was varied over [0.00 1 , 1 .0], a only varied over the range [0.03, 0. 1 5]. 1 1 One parameter that can effect et is the number of lights (and buttons) in . the task environment. Increasing this number can significantly raise et. With 20 lights and buttons, the metasimulation produced a practice curve with an a of 0.26. 1 2 For the shorter 408 trial sequence, an et of 0. 1 6 was generated, compared with 0. 1 7 for subject 3 (figure 5. 7). While this manipulation yields good results, it is still true that those ten extra lights and buttons don't actually exist in the task environment. An alternative interpretation is required in which these ten virtual lights and buttons are thought of as modeling unspecified other features of the task environment. Given the simulation results in this section, a rough estimate of the cycle time of the Xaps2 production-system architecture can be computed. One
237
238
CHAPTER 5
method is to compute the mean response time for the human data; remove some portion of it, say half, as an estimate of the task time outside the scope of the model; and divide the remainder by the mean number of cycles per trial. The value varies with the number of lights used in the simulation ( 1 0 o r 20) and whether a long simulation is being compared with the Seibel data (figure 5 . 1 ) or a short simulation is being compared to subject 3 (figure 5. 7), but all four computations yield a value between 3 and 6 msec. The average value is 4 msec. One cycle every 4 msecs is a very fast rate. The accepted value for production-system architectures is generally thought to be on the order of the cognitive cycle time-between 25 and 1 70 msec (Card, Moran, and Newell 1 984). Note, however, that the model simulates the cognitive system at a smaller grain size than is normally done. The cognitive cycle is more appropriately identified with the complete processing of one pattern (one iteration through the OnePattern goal). Ifwe ignore the implementation of the model's goal structure as productions and just look at this level of goal directed processing, the architecture looks remarkably like a conventional serial production system. During each cycle of this higher-level "produc tion system" (a OnePattern goal), we recognize a single pattern (a OneStimulusPattern goal) and act accordingly (a OneResponsePattern goal)-approximately 3 1 cycles. The Xaps2 cycle time of 3 to 6 msec per cycle yields a time estimate for this higher-level cycle of between 93 and 1 86 msec, with a mean of 1 24 msec. These times are well within the accepted range for the cognitive cycle time. 5.8
Conclusion
This chapter has reported on an investigation into the implementation of the chunking theory oflearning as a model of practice within a production system architecture. Starting from the outlines of a theory, a working model capable of producing power-law practice curves has been produced. This model has been successfully simulated for one task-a 1 ,023-choice reaction-time task. During this research we have developed a novel highly parallel production-system architecture-Xaps2-combining both symbolic and activation notions of processing. The design of this architecture was driven by the needs of this work, but the resulting system is a fully general production-system architecture. Most important, it meets a set of con-
LEARNING BY CHUNKING
straints derived from an analysis of the chunking theory. These constraints must be met by any other architecture in which the chunking theory is embedded. A performance model for the reaction-time task has been implemented as a set of productions within this architecture. Though the architecture provides parallel execution of productions, the control structure of the model is a serially processed goal hierarchy, yielding a blend of serial and parallel processing. The goal hierarchy controls a loop through three tasks: ( 1 ) select a stimulus pattern to process, (2) map the stimulus pattern into an appropriate response pattern, and (3) execute the response pattern. Two of these tasks, 1 and 3, require the model to communicate with the outside world. The required stimulus and response interfaces are modeled as two dimensional Euclidean spaces of patterns. The model perceives patterns in the stimulus space and produces patterns in the response space. As with the production-system architecture, the design of the control structure and interfaces have been driven by the needs of this work. A second look shows that there is very little actual task dependence in these designs. The control structure, or a slightly more general variant, works for a large class of reaction-time tasks. To this model is added the chunking mechanism. Chunks are acquired from pairs of patterns dealt with by the performance model. Each chunk is composed of a triple of productions: ( 1 ) an encoding production that combines a pair of stimulus patterns into a more complex pattern, (2) a decoding production that decomposes a complex response pattern into its simpler components, and (3) a connection production that links the com plex stimulus pattern with the complex response pattern. Chunks improve the model's performance by reducing the number of times the system must execute the control loop. Both simulations and metasimulations (simula tions of the simulations) of the model have been run. The result is that chunking can improve performance, and it does so according to a power law function of the number of trials. The results of this investigation have been promising, but there is much work still to be done. One open question is whether these results will hold up for other tasks. As long as the task can be modeled within the control structure described in this article, power-law learning by chunking is to be expected. For radically different tasks, the answer is less certain. To investi gate this, the scope of the model needs to be extended to a wider class of tasks.
239
240
CHAPTER 5
A number of aspects of the model need improvement as well. The production-system architecture needs to be better understood, especially in relation to the chunking theory and the task models. Oversimplifications in the implementation of the chunking theory-such as allowing only pair wise chunking-need to be replaced by more general assumptions. In addition a number of ad hoe decisions and mechanisms need to be replaced by more well-reasoned and supported alternatives. Notes This research was sponsored by the Defense Advanced Research Projects Agency (DOD), ARPA Order No. 3597, monitored by the Air Force Avionics Laboratory under Contract F3361 5-78-C- 1 5 5 1 . We would like to thank John Anderson, Pat Langley, Arnold Rosen bloom, and Richard Young for their helpful comments on drafts of this chapter. Since this work was completed, a task-independent version of chunking has been developed (Rosenbloom and Newell 1 986) and integrated into a general problem-solving architecture (Laird, Rosenbloom, and Newell 1 986). I . Power-law curves plot as straight lines on log-log paper.
2. For a summary of alternative models of the power law of practice, see Newell and Rosenbloom ( 1 98 1 ). AJditional proposals can be found in Anderson ( 1 982) and Welford ( 1 98 1 ). 3. A brief summary of this work can be found in Rosenbloom and Newell ( 1 982). When all of the lights on one hand are on, it can be treated as a single response of the whole hand, rather than as five individual responses. In fact the reaction times for these trials are much faster than would be expected if the five lights were being processed separately.
4.
5. Xaps2 is implemented in MacLisp running on a DecSystem 2060. 6. The assumptions in Xaps2 bear a strong family resemblance to those in the Caps architec ture (Thibadeau, Just, and Carpenter, 1 982). 7. Our ultimate goal is to develop a task independent implementation of chunking, but until that is possible, we must live with this unsatisfactory but necessary dependence. 8. In this and following examples, the notation has been modified for clarity of presentation. Some of the names have been expanded or modified. In addition types have been made bold, identifiers are in the normal roman font, attributes are in SMALL CAPITALS, and values are it alicized. 9. The syntax of this production has been modified slightly for presentation purposes. Symbols beginning with " = " are variables. This is the only example of the internal form of a production to appear; in the remainder of this chapter we use the paraphrase form. 10. For simplicity of presentation, an additional examples is not shown.
STATUS
attribute in the following three
1 1 . The range was sampled at 0.00 1 , 0.0 1 , 0.02, 0. 1 , and 1 .0. 1 2. Each ten-light trial, in a single random permutation of the 1 ,023 trials, had an additional ten random lights appended to it, to yield 1 ,023 twenty-light trials. This block of trials was then repeated 75 times.
LEARNING BY CHUNKING
References Anderson, J. R. 1 976. Language, Memory, and Thought. Hillsdale, N.J.: Lawrence Erlbaum Associates. Anderson, J. A. 1 977. Neural models with cognitive implications. In D. LaBerge and S. J. Samuels (eds.), Basic Processes in Reading, Hillsdale, N.J.: Lawrence Erlbaum Associates. Anderson, J. R. 1 982. Acquisition of cognitive skill. Psychological Review 89, 369-406. Anderson, J. A., and Hinton, G. E. 1 98 1 . Models of information processing in the brain. In G. E. Hinton and J. A. Anderson (eds.), Parallel Models of Associative Memory. Hillsdale, N.J.: Lawrence Erlbaum Associates. Bower, G. H . , and Winzenz, D. 1 969. Group structure, coding, and memory for digit series. Experimental Psychology Mongraph 80, J.:.- 1 7.
Card, S. K., Moran, T. P. and Newell, A. 1984. The Psychology of Human-Computer Interaction. Hillsdale, N.J.: Lawrence Erlbaum Associates, in press. Chase, W. G., and Simon, H. A. 1 973. Perception in chess. Cognitive Psychology 4, 55-8 1 . DeGroot, A . D . 1 965. Thought and Choice in Chess. The Hague: Mouton. Evans, T. G. 1 968. A program for the solution of geometric-analogy intelligence test ques tions. In M . Minsky (ed.), Semantic Information Processing. Cambridge, Mass.: The MIT Press. Forgy, C., and McDermott, J. 1 977. The Ops2 Reference Manual. IPS Note No. 77-50. Department of Computer Science, Carnegie-Mellon University. Gilmartin, K. J. 1 974. An information processing model of short-term memory. Dissertation. Carnegie-Mellon University. Johnson, N. F. 1 972. Organization and the concept of a memory code. In A. W. Melton and E. Martin, (eds.), Coding Processes in Human Memory. Washington, D.C.: Winston. Joshi, A. K. 1 978. Some extensions of a system for inference on partial information. In D. A. Waterman and F. Hayes-Roth (eds.), Pattern-Directed Inference Systems. New York: Academic Press. Laird, J. E., Rosenbloom, P. S., and Newell, A. 1 986. Chunking in Soar: The anatomy of a general learning mechanism. Machine Learning I , 1 1 -46. McClelland, J. L., and Rumelhart, D. E. 1 98 1 . An interactive activation model of context effects in letter perception: Part 1 . An account of basic findings. Psychological Review 88(5), 375-407. Miller, G. A. 1 956. The magic number seven plus or minus two: Some limits on our capacity for processing information. Psychological Review 63, 8 1 -97. Mitchell, T. M., Utgoff, P. E., Nude!, B., and Banerji, R. 1 98 1 . Learning problem-solving heuristics through practice. Proceedings of the Seventh International Joint Conference on Artificial Intelligence. Los Altos, Calif.: Morgan-Kaufmann. Moran, T. P. 1 980. Compiling cognitive skill. AIP Memo 1 50. Xerox PARC. Neisser, U., Novick, R., and Lazar, R. 1 963. Searching for ten targets simultaneously. Perceptual and Motor Skills 1 7, 427-432.
Neves, D. M., and Anderson, J. R. 1 98 1 . Knowledge compilation: Mechanisms for the automatization of cognitive skills. In J. R. Anderson (ed.), Cognitive Skills and their A cqui sition. Hillsdale, N.J.: Lawrence Erlbaum Associates.
241
242
CHAPTER 5
Newell, A. 1 973. Production systems: Models of control structures. In W. G. Chase (ed.), Visual Information Processing. New York: Academic Press. Newell, A. 1 980. Harpy, production systems and human cognition. In R. Cole (ed.), Percep tion and Production of Fluent Speech. Hillsdale, N.J.: Lawrence Erlbaum Associates (also available as CMU CSD Technical Report, Sep 1 978). Newell, A., and Rosenbloom, P. S. 1 98 1 . Mechanisms of skill acquisition and the law of practice. In J. R. Anderson (ed.), Cognitive Skills and Their A cquisition. Hillsdale, N.J.: Lawrence Erlbaum Associates. Newell, A., and Simon, H. A. 1 972. Hu.. 1an Problem Solving. Englewood Cliffs, N.J .: Prentice Hall. Norman, D. A. 1 98 1 . Categorization of action slips. Psychological Review 88, 1 - 1 5 . .Rosenbloom, P. S. 1 979. The XAPS Reference Manual. Rosenbloom, P. S., and Newell, A. 1 982. Learning by chunking: Summary of a task and a model. Proceedings of the Second National Conference on Artificial Intelligence. Los Altos, Calif.: Morgan-Kaufmann. Rosenbloom, P. S., and Newell, A. 1 986. The chunking of goal hierarchies: A generalized model of practice. In R. S. Michalski, J. G. Carbonell, and T. M . Mitchell (eds.), Machine Learning: An Artificial Intelligence Approach, Vol. 2. Los Altos, Calif.: Morgan-Kaufmann. Rumelhart, D. E., and McClelland, J. L. 1 982. An interactive activation model of context effects in letter perception: Part 2. The contextual enhancement effect and some tests and extensions of the model. Psychological Review 89, 60-94. Rumelhart, D. E., and Norman, D. A. 1 982. Simulating a skilled typist: A study of skilled cognitive-motor performance. Cognitive Science 6, 1 -36. Seibel, R. 1 963. Discrimination reaction time for a 1 ,023-alternative task. Journal of Experi mental Psychology 66(3), 2 1 5-226. Shiffrin, R. M . , and Schneider, W. 1 977. Controlled and automatic human information processing: II. Perceptual learning, automatic attending, and a general theory. Psychological Review 84, 1 27- 1 90. Snoddy, G. S. 1 926. Learning and stability. Journal of Applied Psychology 10, 1 -36. Thacker, C. P., Mccreight, E. M., Lampson, B. W., Sproull, R. F., and Boggs, D. R. 1 982. Alto: a personal computer. In D. P. Sieworek, C. G. Bell, and A. Newell (eds.), Computer Structures: Principles and Examples. New York: McGraw-Hill. Thibadeau, R., Just, M . A., and Carpenter, P. A. 1 982. A model of the time course and content of reading. Cognitive Science 6, 1 57-203.
Thorndike, E. L. 1 9 1 3. Educational Psychology. II: The Psychology of Learning. Bureau o f Publications, Teachers College, Columbia University.
Uhr, L., and Vossler, C. 1 963. A pattern-recognition program that generates, evaluates, and adjusts its own operators. In E. Feigenbaum and J. Feldman (eds.), Computers and Thought. New York: McGraw-Hill. VanLehn, K. 1 98 1 . On the representation. ofprocedures in repair theory. Technical Report No. CIS- 1 6. Xerox PARC. Welford, A. T. 1 98 1 . Learning curves for sensory-motor performance. Proceedings of the Human Factors Society, 25th Annual Meeting.
The Soar Papers 1983-1985
CHAPTER 6
A Universal Weak Method J. E. Laird and A. Newell, Carnegie Mellon University
ABSTRACT
The weak methods occur pervasively in AI systems and may form the basic methods for all intelligent systems. The purpose of this paper is to characterize the weak methods and to explain how and why they arise in intelligent systems. We propose an organization, called a universal weak method, that provides functionality of all the weak methods: A universal weak method is an organizational scheme for knowledge that produces the appropriate search behavior giveh the available task-domain knowledge. We present a problem solving architecture, called SOAR, in which we realize a universal weak method. We then demonstrate the universal weak method with a variety of weak methods on a set of tasks. A basic paradigm in artificial intelligence (AI) is to structure systems in terms of goals and methods, where a goal represents the intention to attain some object or state of affairs, and a method specifics the behavior to attain the desired objects or states. Some methods, such as hill climbing and means-ends analysis, occur pervasively in existing AI systems. Such methods have been called weak methods. It has been hypothesized that they form the basic methods for all intelligent systems (Newell, 1969). Further, it has been hypothesized that they all are methods of searching problem spaces (Newell & Simon, 1972; Newell, 1980a). Whatever the ultimate fate of these hypotheses, it is important to characterize the nature of weak methods. That is the purpose of this paper.
We propose that the characterization does not lead, as expected
organization, with a collection of the weak methods plus a method-selection mechanism. Instead, we propose a single organization, called a universal weak method, embedded within a problem-solving architecture, that responds to a situation by behaving according to the weak method appropriate for the agent's knowledge of the task. Section 1 introduces weak methods as specifications of behavior. Section 2 introduces search in problem spaces and relates it to weak methods. The concepts in these first two sections are familiar, but it is useful to provide a coherent treatment as a foundation for the rest of the paper. Section
3
introduces a specific
problem-solving architecture, based on problem spaces and implemented in a production system, that provides an appropriate organization within which to realize weak methods. Section weak method and its realization within the problem-solving architecture.
4
Section
defines a universal 5
demonstrates
an
experimental version of the architecture and the universal weak method. Section 6 discusses the theory and relates it to other work in the field. Section 7 concludes. 1A brief report of the results qfthis paper was presented at UCAI-83 (Laird & Newell, 1983).
246
CHAPTER 6
1 . The Weak Met hods
We start by positing an agent with certain capabilities and structure. We are concerned ultimately with the
behavior of this agent in some environment and the extent to which this behavior is fotelligent. We introduce goals and methods as ways of specifying the behavior of the agent. We then concentrate on methods that can be used when the agent has little knowledge of its environment, namely the weak methods.
1 .1 .
Behavio r specification
Let the agent be characterized by being, at any moment, in contact with a task environment (or task
domain). which is in some state, out of a set of possible states. The agent possesses a set of operators that affect the state of the environment. The behavior of the agent during some period of time is the sequence of operators that occur during that period, which induces a corresponding sequence of states of the environment For the purposes of this paper, certain complexities can be left to one side: concurrent or asynchronous operators, continuously acting operators, and autonomous behavior by the environment The structure of the agent can be decomposed into two parts, [C, Q], where Q is the ser of operators and C is the control, the mechanism that determines which operator of Q will occur at e:ich moment, depending on the current environmental state and the past history. Additional structure is required for a general intelligent agent, namely, that it be a symbol system (Newell, 1980a), capable of universal computations on symbolic representations. This p1:nnits, first of all, the creation of internal representations that can be processed to conu·ol the selection of external operators. These internal representations can also be cast as being in a state, out of a set of possible states, with operators that change states internally. It is normal in computer science not to distinguish sharply between behavior in an external task environment and in an internal task environment, since they all pose the same problems of control. Having a symbol system also pennits the control to be further decomposed into C
=
[I, S]. where S is a
symbolic specification of the behavior to be produced in the task environment and I is an interpreter that produces the behavior from S in conjunction with the operators and the state. It is natural to take S to be a
program for the behavior of the agent However, neither the fonn nor the content of S is given. In particular, it
should not be presumed to be limited to the constructs available from familiar programming languages:
sequences, conditionals, procedures, iterations and recursions, along with various naming and abstraction mechanisms. Indeed, a basic scientific problem for Al is to discover how future behavior of an agent is represented symbolically in that agent, so as to produce intelligent action. A critical construct for specifying behavior is a goal A goal is a symbolic expression that designates an object or state of affairs that is to be attained. Like other control constructs it serves to guide behavior when it occurs in a specification, S, and is properly so interpreted by I. Goal objects or situations can be specified by
A UNIVERSAL WEAK METHOD
means of whatever descriptive mechanisms are available to the agent, and such descriptions may designate a unique situation or a class of situations. A goal does not state what behavior is to be used to attain the goal situation. This part of the specification is factored cut and provided by other processes in the agent which need not be known when the goal is created. However, there must exist a behaviors will occur to attain the goal.
selection process to determine what
The goal does not include the details of this selection process;
however, the goal may contain auxiliary information to aid the selection, such as the history of attempts to attain the goal.
An
Al method (hereafter, just a method) is simply a specification of behavior that includes goals along with
all the standard programming-language constructs. The creation of a subgoal, as dictated by the method, is often divorced from the attempt to attain the subgoal. This separation gives rise to the familiar which consists of the lattice of subgoals and supergoals, and the
agenda mechanism,
goal hierarchy,
which keeps track of
which subgoals to attempt. Goals, methods, and selection processes that link goals to methods provide the 2 standard repertoire for constructing current Al systems.
1 . 2 . The definition of a weak method Methods, being an enhancement of .programs, can be used for behavioral specifications of all kinds. Indeed, a major objective of the high-level AI languages was to make it easy to create methods that depended intricately upon knowledge of the specific task domain.
We consider here the other extreme, namely,
methods that are extremely general: A
weak method is an
AI method (a specification of behavior using goals and the control constmcts
of programming languages) with the following two additional features:
1. It makes limited demands for knowledge of the task environment.
2.
It provides a schema within which domain knowledge can be used.
Any method (indeed, any behavior specification) makes certain demands on the nature of the task environment for the method to be carried out.
These can be called the
operational demands.
Take for
example the goal to find the deepest point of a lake. One method is to use a heavy weight with a rope, and keep moving the weight as long as the anchor goes deeper. One operational demand of this method is that the task environment must include a weight and a rope, and their availability must be known to the agent. Additional
effectiveness demands also can
exist. Even if the method can be performed (the anchor and rope
are there and can be moved), the method may attain the goal only if other facts hold about the environment
Explicit goal structures have been with Al almost from the beginning (Feigenbaum & Feldman. 1963); but did not become fully integrated with programming language constructs until the high-level Al languages of the early 1970s (Hewitt, 197 1 ; Rulifson, Derksen & 2
Waldinger, 1972). Because of the ncxibility of modern Lisp systems. the practice remains to construct ad hoe goal systems, rather than use an integrated goal-containing programming language (however see Kowalski. 1979).
247
248
CHAPTER 6
(the lake does not have different deep areas v.·ith shallows between them).
The fewer the operational demands, the wider the range of environments to which a method can be 3 applied. Thus, highly general methods make weak demands on the task environment, and they derive their name from this feature. Presumably, however, the less that is specified about the task environment, the less effective the method will be. Thus, in general, methods that make weak demands provide correspondingly weak performance, although the relationship is not invariable.
The intrinsic character of problematical situations (situations requiring intelligence) is that the agent does not know the aspects of the environment that are relevant to attaining the goal. Weak methods are exactly those that are useful in such situations, because not much need be known about the environment to use them. Weak methods occur under all conditions of impoverished knowledge. Even if a strong method is ultimately used, an initial method is required to gather the information about the environment required by the strong method. This initial method must use little knowledge of the environment, and therefore is a weak method. Thus the major question about weak methods is not so much whether they exist, but their nature and variety.
Weak methods can use highly specific knowledge of the domain, despite their being highly general. They do this by requiring the domain knowledge to ·be used only for specific functions. For instance, if a weak method uses an evaluation function, then this evaluation function can involve domain-specific knowledge, and an indefinitely large amount of such knowledge.
But the role of that knowledge is strongly proscribed.
The evaluation function is used only to compare states and select the best A weak method is in effect a method schema that admits many instantiations, depending on the domain knowledge that is used for the specific functions of the method.
1 .3 .
The common weak methods
The two defining features of weak methods do not delineate sharply a subclass of methods. Rather, they describe the character of useful methods that seem in fact to occur in both artificial-intelligence systems and human problem solving. Figure
1-1
provides a list of common weak methods, giving for each a brief informal
definition. Consider the method of hill climbing (HC). It posits that the system is located at a point in Some space and has a set of operators that permit it to move to new points in the space. It also posits the existence of an evaluation function. The goal is to find a point in the space that is the maximum of this evaluation function.
The method itself consists of applying operators to the current point in the space to produce
adjacent points, finally selecting one point that is higher on the evaluation function and making it the new
3niere is more to it than this, e.g., whether the knowledge demanded by the method is likely to be available to a problem solver who
does not understand the environment
A UNIVERSAL WEAK METHOD
current point. This behavior is repeated until there is no point that is higher, and this final point is taken to attain the goal.
Generate and test (Gn. Generate candidate solutions and test each one; terminate when found Hill climbing (HC). To find a maximum point in a space, consider the next moves (operators) from a given position and select one that increases the value.
Simple hill climbing (SHC). Hill climbing with selection of the first operator that advances. Steepest ascent hill climbing (SAHC). Hill climbing with selection of the operator that makes the largest advance. Heuristic search (HS). To find an object in a space generated by operators, move from the current position by considering the possible operations and selecting an untried one to apply; test new positions to determine if they arc a solution. In any event, save them if they could plausibly lead toward a solution; and choose (from the positions that have been saved) likely positions from which to continue the search.
Means·cnds analysis (MEA). In a heuristic search, select the next operator to reduce the difference between the current position arid the desired positions (so far as they are known).
Depth· first search (DFS). To find an object in a space generated by operators, do a heuristic search but always move from the
deepest position that still has som� untried operators.
Brcadth·first search (BrFS). To find an object in a space generated by operators, do a heuristic search, IJUt always move from the least-deep position U1at still has some untried operators.
Best-first search (BFS). To find an object in a space generated by operators. do a heuristic searcll, but always move from a position that seems most likely to lead to a solution and still has some untried operators. Modilictl best-first search (MBFS). To find an object in a space generated by operators, do a best-first search, but once a position is
selected from which to advance, try all the operators at that position before selecting a new position.
A*. Modified best-first search when it is desired to find the goal state at the minimum depth from the starting position.
Operator subgoaling (OSG). If an operator cannot be applied to a position,.then set up a subgoal of finding a position (starting from the current position) at whicll the operator can be applied
Match (MOI). Given a form to be modified and completed so as to be identical to a desired object, then compare corresponding parts of the form and object and modify the mismatching parts to make them equal (if possible).
Figure 1·1: Some common weak methods. The operational demands of this method are dictated directly from the prescribed computations. It must be possible to: (1) represent a state in the space; (2) apply an operator to produce a new state; and
(3) compare
4
two adjacent states to determine which is higher on the evaluation function. These demands are all expressed in terms of the agent's abilities; thus, the demands on the task environment are stated only indirectly. It
4rhe method requires additional capabilities of the agent that do not involve the environment, such as selecting operators.
is a
249
250
CHAPTER 6
ma•ter of analysis to detennine exactly what demands are being made. Thus, hill climbing nonnally occurs with an explicitly given evaluation function, so that the comparison is done by evaluating each state and comparing the results. But all that is required is comparison between two states, not that an explicit value be obtainable for each state in isolation. Furthennore, this comparison need not be possible on all pairs of states, but only on adjacent states, where one arises from the other by operator application.
Each of these
considerations weakens the demands on the environment, while still permitting the method to be applied. For hill climbing, the difference between operational demands and effectiveness demands is a familiar one. The state with the absolute highest value will always be reached only if the space is unimodal; otherwise only a local hill is climbed, which will yield the global optimum only if the method starts on the right hill. There also exist demands on the environment for the efficiency of hill climbing which go beyond the question of sheer success of method, such as lack of plateaus and ridges in the space. Hill climbing is defined in tenns of general features of the task environment, namely the operators and the evaiuation function. These reflect the domain structure. The method remains hill climbing, even if arbitrarily large amounts of specific domain knowledge are embodied in the evaluation function. Ilut such knowledge only enters the method in a specific way. Even if the knowledge used in the evaluation function implies a way to go directly to the top of the hill, there is no way for such an inference to be detected and exploited. Many variations of a weak method can exist
For example, hill climbing leaves open exactly which
operators will be applied and in which order (if the system is serial); likewise, it does not specify which of the resulting points will be chosen, except that it must be higher on the hill. In simple hill climbing (SHC) the operators are generated and applied in an arbitrary order, with the first up-hill step being taken; in steepest
ascent hill climbing (SAHC), all the adjacent points are examined and the one that is highest is taken. Under different environments one or the other of these will prove more efficient, although generally they will both ultimately climb the same hill. Subgoals enter into hill climbing only if the acts of selecting an operator, applying it or evaluating the result are problematical and cannot . be specified in a more definite way. More illustrative of the role of subgoaling is the method of operator subgoaling, which deliberately sets up the subgoal of finding a state in which the given operator is applicable. This subgoal is to be solved by whatever total means are available to the agent, which may include the creation of other subgoals. Operator subgoaling is only one possibility for setting up subgoals, and perhaps it should not be distinguished as a method all by itself. However, iil many AI
programs, operator subgoaling is the only form of subgoaling that occurs (Fikes, Hart & Nilsson, 1972; Sacerdoti, 1977).
A UNIVERSAL WEAK METHOD
1 .4.
Complex programs are composed of weak methods
The weak methods in Figure 1-1 occur with great frequency in practice. Many others arc known, both for Al systems (e.g., iterative deepening in game playing programs), and for humans (e.g., tl1c main method of Polya, 1945). The weak methods appear to be a mainstay of 1\1 systems (Newell, 1969). That is, Al systems rely on weak methods for their problem-solving power.
Much behavior of these systems, of course, is
specified in highly constrained ways, where the program exhibits limited and stereotyped behavior. The primary exceptions to this occur in modern AI expert systems (Duda & Shortliffe, 1983), which rely as much as possible on large amounts of encoded knowledge to avoid search. But even here many of them are built on top of search methods, e.g MYCIN. •.
To provide an indication of how an AI system can be viewed as a composition of weak metl1ods, the GPS of Newell & Simon (1963) can be described as means-ends analysis plus operator subgoaling. To carry out these two weak methods, others are required: matching is used to compare the current state to the desired one; and
generate and test is used to select an operator when a difference determines a subset of operators, rather than a unique one. Likewise, generate and test is used to select goals to try if a path fails (although this mechanism is
not usually taken to be part of the core of GPS). Given these weak methods, little additional specification is required for OPS : the representations and tl1eir associated data operations, the mechanisms for constructing and maintaining a goal tree, a table of fixed associations between differences and operators, and a fixed ordering function on tl1e differences. A similar story could be told for many Al programs, such as Dendral, AM, Strips, EL, and others that are less well known (for examples of earlier programs, see Newell, 1969). To substantiate the claim of the ubiquitous use of weak methods in AI would require recasting a substantial sample of AI programs explicitly in terms of compositions of weak methods and no such attempt has yet been made. Furthermore, no formulation of weak methods yet exists that provides a notion of a basis or of completeness. In any event, this paper docs not depend on such issues, only on the general fact that weak methods play an important enough role to warrant their investigation.
25 1
252
CHAPTER 6
2 . The P ro b lem Space H y pot h e s i s
Behavior specifications of an agent lead to behavior by being interpreted by an architecture.
Many
architectures are possible. However, existing families of architectures are designed primarily to meet the requirements of current programming languages, which are behavior specifications that do not include goals or methods. A key idea on which to base the architecture for an intelligent agent is search. AI has come to understand that all intelligent action involves search (Newell & Simon, 1976). All existing Al programs that work on problems of appreciable intellectual difficulty incorporate combinatorial search in an essential way, as the examination of any Al textbook will reveal (Nilsson, 1971; Winston, 1977; Nilsson, 1980). Likewise, human problem-solving behavior seems always to exhibit search (Newell & Simon, 1972), though many forms of difficult and creative intellectual activity remain to be investigated from this viewpoint. On the other hand, the essential role of search in simple, routine or skilled behavior is more conjectural. In itself, the symbolic specification of behavior does not necessarily imply any notion of search, as typical current programming languages bear witness. They imply only the creation at one moment of time of a partial specification, to be further specified at later times until ultimately, at performance time, actual behavior is determined. Nevertheless, the case has been argued that a framework of search
is
involved in all human goal-directed
behavior, a hypothesis that is called the Problem Space Hypothesis (Newell, 1980b). We will adopt this hypothesis of the centrality of search and will build an architecture for the weak methods around it. The essential situation is presented in Figure 2-1, which shows abstractly the structure of a general intelligent agent working on a task. As the figure shows, there are two kinds of search involved, problem search and knowledge search. 2 . 1 . Problem sea rch Problem search occurs in the attempt to attain a goal (or solve a problem). The current state of knowledge of the task situation exists in the agent in some representation, which will be called a problem state (or just a state.) The agent
is
at some initial state when a new goal is encountered. The agent must find a way of
moving from its current state to a goal state. The agent has ways of transforming this representation to yield a new representation that has more (or at least different) knowledge about the situation; these transformations will be called operators. The set of possible states plus the operators that permit the movement of the agent from state to state will be called the problem space. This is similar to the situation described in Section
1.1,
except that here the state is a representation that is internal to the agent The desired state of the goal can be specified in many ways: as a complete state, an explicit list of states, a pattern, a maximizer of a function, or a set of constraints, including in the latter, constraints on the path
A UNIVERSAL WEAK METHOD
K n o w l edge Sea rc h
P ro b l e m Sea rc h
Figure 2· I : The framework for intelligent behavior as search.
followed. The agent must apply a sequence of operators, starting at the initial state, to reach a state that satisfies whatever goal specification is given. Search of the problem space is necessarily combinatorial in general. To see this, note that the space does not exist within the problem solver
as
a data structure, but instead is generated state by state by means of
operator applications, until a desired state operator
is
the appropriate one
to
is
found. If, at a state, there is any uncertainty about which
apply (either to advance along a solution path or to recover from a
nonsolution path), this must ultimately translate info actual errors in selecting operators. Uncertainty at
253
254
CHAPTER 6
successive states cascades the errors of operator selection and thus produces "he familiar combinatorially branching search tree. 5 Uncertainty over what to do is the essence of the problematic situation.
It is
guaranteed by the de novo generation of the space, which implies that new states cannot be completely known about in advance. 6
The uncertainty at a state can be diminished by the agent's knowledge about the problem space and the goal. The task for the agent at each state is to bring this
search-control knowledge
to bear on the functions
required at a node of the search tree in Figu{e 2-1. There is a fixed set of such functions to be performed in searching a problem space (Newell, 1980b): 1. Decide on success (the state is a desired state). 2. Decide on failure or suspension (the goal will not be achieved on this attempt).
3. Select a state from those directly available (if the current state is to be abandoned). 4. Select an operator to apply to the state. 5. Apply the operator to obtain the new state.
6. Decide to save the new state for future use. In addition there are decisions to be made that detennine the goals to be attempted and the problem spaces within which to attempt them. The architecture brings search-comrol knowledge to bear to perform these functions intelligently. Depending on how much knowledge the agent has and how effective its employment
is, the search in the problem space will be narrow and focused, or broad and random.
2 . 2 . Knowledge sea rch Some process must select the search-control knowledge to be used to make the decisions in the problem space. In a problem solver constructed for a specific and limited taSk, there is little difficulty in associating the correspondingly limited search-control knowledge with tl1e site of its application.
However, a general
intelligent agent· attempts many tasks and therefore has a large body of potentially applicable knowledge. There is then the need to determine what knowledge is available that is applicable to controlling the search of a particular task. Knowledge is encoded in memory, hence in some extended data base. That is, extended knowledge implies extended memory. Since the problem, as represented in the problem space, is new, the agent cannot know in advance what knowledge is relevant to controlling the search for a solution.
5
1n task environments that are densely enough interconnected, actual return to prior states
6ibis
can
be avoided in favor of always moving
forward from the current "bad" state, seeking a better state; but the essential combinatorial explosion remains. argument applies equally well to serial and concurrent processing, as long as there is an overall resource limit, ie.; as long exponential parallel processing is not pOssible.
as
A UNIVERSAL WEAK METHOD
Necessarily, then, the data base must be searched for the relevant knowledge. Figure 2-1 shows search control knowledge being applied only to a single node in the problem space, but knowledge search must occur at each node in the problem search. Hence, knowledge search lies in the inner loop of problem search and its efficiency. is of critical importance. Knowledge search differs from problem search in at least one important respect. The data base to be searched is available in advance; hence its accessing structure may be designed to make the retrieval of knowledge as efficient as possible. In consequence, the search is not necessarily combinatorial, as is the search in the problem space. The architectural possibilities for the memory that holds the search-control knowledge are not yet well understood. Much work in
Al,
from semantic nets to frame systems to production systems, is
in effect the exploration of designs for search-control memory.
2 . 3 . Goals and methods in terms of sea rch The mapping o f goals and methods into th e search structure o f Figure 2-1 is easy to outline.
A
goal leads to
forming (or selecting): (1) a problem space; (2) within that space, the initial state, which corresponds
to
agent's current situation; and (3) the desired states, which correspond to the desired situation of the goal.
the A
method corresponds . to a body of search-control knowledge that guides the selection and application of operators. A method with subgoals (such as operator subgoaling in Figure 1-1) leads to creating a new goal (with a new problem space and goal states), and to achieving it in the subspace. The new problem space need not actually be different from the original; e.g., in operator subgoaling it is not. Upon completion of the subgoal attempt, activity may return to the original space with knowledge from the subgoal attempt. A goal need not be solved within a single · problem space, but may involve many problem spaces, with the corresponding goal hierarchy. The commitment to use search as the basis for the architecture implies that the weak methods must all be encoded as search in a problem space.
A
glance at Figure 1-1 shows that many of them fit such a requirement
-- heuristic search, hill climbing, means-ends analysis, best first search, etc. This hardly demonstrates that all weak methods can be naturally so cast, but it does provide encouragement for adopting a search-based architecture.
255
256
CHAPTER 6
3 . A P ro b l e m So l v i n g A rc h itect u re
In this section we give a particular architecture for problem solving, based on the search paradigm of the 4 prior section. We will call it SOAR, for S.tate, Qperator g_nd .B..esult, which represents the main problem solving act of applying an operator to a state and producing a result Such an architecture consists of a processing structure where the functions at a node of the search tree arc decomposed into a discrete set of actions executable by a machine with appropriate memories and communication links.
3. 1 .
The object context
SOAR has representations for the objects involved in the search process: goals, problem spaces, states and operators. Each primitive representation of an object can be augmented with additional information. The augmentations are an open set of unord�red information, being either information about the object, an association with another object, or the problem-solving history of the object. For example, states may be augmented with an evaluation, goals may be augmented with a set of constraints a desired state must satisfy, problem spaces may be augmented with their operators, and a state may be augmented with information detailing the state and the operator that created it. As shown in Figure 3-1, the architecture consists of the current context and the stock. The current context consists of a single object of each type. These objects determine the behavior of the agent in the problem space. The goal in the current context is the current goal; the problem space is the current problem space, and so on. The stock is an unordered memory of all the available objects of each type. The objects in the current context are also part of the stock. )V'
to produce the behavior of the weak method. These are all either decision productions, or elaboration productions that produce concepts that are independent of a task (such as depth).
The task-dependent concepts used in each weak method are in
bold-italics.
To complete the method, task
dependent elaboration productions had to be added to compute each such concept. In addition, in actual runs, a task-dependent decision production was added to vote for operators that would apply to the current state. All methods would have worked without this, but it is a useful heuristic that saves uninteresting search where an operator is selected, does not apply, and is then vetoed because it was attempted for the current state. Examination of each of the method increments in the figure confirms the claim made in Section
4.1
in
regard to the fourth factorization condition. namely, that method increments do not specify any control that is not directly related to the task-environment knowledge that specifies the weak method. All the control (the decision productions) deals only with the consequences of the concepts for control of the method behavior. The missing task-dependent productions do not involve any method-control at all, since they are exclusively devoted to computing the concept
12 on the other hand, any weak method selection.
can
be implemented using subgoals to perfonn some of its decisions, such as operator
273
274
CHAPTER 6
Avoid Duplicates (AD) Stare: If the current state is a duplicate of a previous state, the state is unacceptable. Operator: If an operator is the inverse of the operator that produced the current state, veto the operator.
Operator-Selection Heuristic Search (OSIJS) Stare: If the state /ai/s a goal constraint, it is unacceptable. Operator: (These are task-specific decision productions that vote for or vote against an operator based on the current state.) Means· Ends Analysis (MEA) Operator: If an operator can reduce the difference between the current state and the desired state, vote for that operator.
Breadth· First Search (BrFS) Stare: If a state's depth is not known, its depth is \he depth of the ancestor state plus one.
Stare: If a state has a depth that is not larger than any other acceptable state, vote for that state.
Depth· First Search # I (DFS)
State: If the current state is acceptable, vote for it Stare: If the current state is not acceptable, vote for the ancestor state.
Depth· First Search #2 (DFS)
State: If a state's depth is not known, its depth is the depth of the ancestor state plus one.
State: If a state has a depth that is not smaller than any other acceptable state,. vote for that state.
Simple Hill Climbing (SIIQ
Stare: If the current state is not acceptable or has a evaluation worse than the ancestor state, vote for the ancestor state. Stare: If the current ;tate is acceptable and has a evaluation better than the ancestor state, vote for the current state.
Steepest Ascent Hill Climbing (SAHC) Stare: If the ancestor state is acceptable, vote for the ancestor state. Stare: If the current state is not acceptable and a descendent has an e1·aluation that is not worse than any other descendent, vote for that descendent
Stare: If the current state is acceptable, and the ancestor state is not acceptable, vote for the current state.
Best· First Search # I (BFS)
Stare: If a state has an evaluation that is not worse than any other acceptable state, vote for that state.
Best·First Search #2 (BFS)
Stare: If a state has an evaluatio11 that is better than another ;tate, vote for that state.
Modified Best· First Search (MBFS) Stare: If the ancestor state is acceptable, vote for the ancestor state.
Stare: If the current state is not acceptable. and a state has a evaluation that is not worse than any other state, vote for that state. State: If the current state is acceptable, and the ancestor state is not·acceptable, vote for the current state.
State: If a state's estiltlllted distance to the goal plus its depth is not larger tllan any other state, vote for that state.
MCH GT
Note: No incremental productions; method arises from the structure of the task.
Note: No incremental productions; method arises from the structure of the task.
Figure S-3: Method increments: The search control for weak methods
Figure 5-4 shows the moves for two weak methods on the Eight Puzzle. The first is simple hill climbing (SHC); the second is depth-first search (DFS). The information added was just the productions in Figure
5- 3 for each method. The hill climbing productions had to be instantiated for the particular task because the evaluation function is task-specific. A simple evaluation was used, namely the number of tiles already in their correct place. Additional elaboration productions had to be added to compute this quantity for the current state. Depth-first search, on the other hand, requires no task-specific instantiation, because it depends on an aspect of the behavior of the agent (depth) that is independent of task structure. Examination of the traces will show that the system was following the appropriate behavior in each case.
A UNIVERSAL WEAK METHOD
Adding the simple hill-climbing knowledge to the UWM allows SOAR to solve the problem. 13
A typical
problem solver using simple hill climbing would reach state S2, find it to be a local maximum and tenninate the search. With the UWM, S2 is also found to be a local maximum, and simple hill climbing does not help to select another state. However, the default search control still contributes to the decision process so that all states receive votes (although S2 is vetoed). To break the tie, SOAR choses one of the states as a winner. The oldest state always wins in a tie (because of a " feature" in the architecture) and Sl is selected. S7 and.SS are generated, but they arc no better than SL Sl is now exhausted and receives a veto. Another local maximum is reached, so the default search control comes into play again. The oldest, unexhausted state is S3 and it generates S9 which is better than S3. The hill climbing productions select states until the desired state is reached (Sll). Depth-first search alone (without AD) does not solve the problem because it gets into a cycle after 12 moves and stays in it because duplicate states are not detected. S i mp l e l l i 1 1 C l imb i ng ( SHC ) ( 2 , 2 ) -+ ( 3 , 2 ) ( 1 , 2 ) -.-. ( 2 . 2 ) ( 2 , 1 ) -+ ( 2 , 2 ) ( 2 , 3 ) -+ ( 2 , 2 ) ( 3 , 2 ) -+ ( 2 , 2 ) ( 3 , 1 ) -+ ( 3 , 2) Sl S1=S 2 S 2 =S3 S2 =S4 S 2 =S5 S2 =S6 S1=S7 2 8 3 2 8 3 2 8 3 2 8 3 2 8 3 2 0 3 2 8 3 1 6 4 1 0 4 1 8 4 0 1 4 1 4 0 1 6 4 1 6 4 7 0 5 7 6 5 7 6 5 7 6 5 7 6 5 7 0 5 0 7 6 ( 3 , 3 ) -+ ( 3 , 2 ) ( 1 , 1 ) -+ ( 1 , 2 ) ( 2 , 1 ) -+ ( 1 , 1 ) S1=S8 S3=S9 S9=S10 2 8 3 1 8 4 7 5 0
0 2 3 1 8 4 7 6 6
1 2 3 0 8 4 7 6 5
( 2 , 2 ) -+ ( 2 , 1 ) S10=S11 1 2 3 8 0 4
success
7 6 6
Depth- F i r s t Search ( DFS ) ( 3 , 1 ) -+ ( 3 , 2 ) ( 2 , 1 ) -+ ( 3 , 1 ) ( 2 , 2 ) -+ ( 2 , 1 ) ( 3 , 2 ) -+ ( 2 , 2 ) ( 3 , 1 ) -+ ( 3 , 2 ) Sl Sl=S2 S2 =S3 S3=S4 S4=S4 S5=S8 2 8 3 2 8 3 2 8 3 2 8 3 2 8 3 2 8 3 1 6 4 1 6 4 0 6 4 6 0 4 8 7 4 6 7 4 0 7 5 1 7 5 1 7 5 1 0 5 0 1 6 7 0 5
cycles
Figure 5-4: Behavior of Simple Hill Climbing and Depth-First Search on the Eight Puzzle.
5 . 4 . Results of the demonst ration Figure 5-5 gives a table that shows the results of all the weak methods of Figure 5-3 against all the tasks of Figure 5-1, using the incremental productions specified in Figure 5-3. In all the cases labeled +, the behavior was that of the stipulated weak method. In the cases left blank there did not seem to be any way to define the weak method for the task in question. In the case of methods requiring state-evaluation functions,
an
evaluation function was created only if it had heuristic value. Although in principle there could be an issue of detennining whether a method is being followed, in fact, there is no doubt at all for the runs in question. The
13
simplc hill climbing may often lead the search astray in other Eight Puzzle problems. But, again, the effectiveness of these methods
is not at issue, only whether the universal weak method can produce the appropriate method behavior.
275
276
CHAPTER 6
structure of the combined set of productions makes it evident that the method will occur and the trace of tr'! actual run simply serves to verify this. Task
UWM AD
OSHS MEA
BrFS
DFS
SHC
SAHC BFS
MBFS A*
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Eight Puzzle
+
+
Tower of Hanoi +
+
+
Missionaries
+
+
+
Water Jug
+
+
+
Picnic I
+
Picnic II
+
+
+
+
+
+
+
+
+
+
+
Picnic III
+
+
+
+
+
+
+
+
+
+
+
Labeling14
+
Syllogisms
+
Wason
+
String match14
+
Wizards
+
+
+
+
Root Finding I + Root Finding II +
+
+
+ + + + + +
+
+
Figure 5-5: All methods versus all tasks
All fourteen tasks in Figure 5-5 are marked under UWM, indicating that they were attempted (and sometimes solved) without additional search control knowledge. For Picnic I, there is no additional search control knowledge available, so this is the only method that can be used with it. With Root Finding II, the problem (given the problem space used) is simple enough so that no search control knowledge is necessary to solve it. Four of the tasks (Eight Puzzle, Missionaries, Picnic II and III) can use a variety of search control knowledge to both select operators and states so that it was possible to use all methods for these tasks. Tower of Hanoi, the Water Jug puzzle, and the Wason task depend on operator selection methods (and avoiding duplicate states) to constrain tr.eir searches. The Syllogism and Wizards task are monotonic problems, where all of the operators add knowledge to the states, so that state selection
is
not an issue. However operator
selection is needed to avoid an exponential blowup in the number of operator instantiations that would occur if some operators were applied repeatedly. 14Thcse tasks were run on a successor architecture to SOAR, being developed to explore universal subgoaling; however, it has the same essential structure with respect to the aspects relevant here.
A UNIVERSAL WEAK METHOD
As mentioned earlier, two weak methods occurred during the demonstration, generate and test, and match, that did not require any incremental productions. In Picnic I, each state is a candidate solution that must be tested when it is created. The UWM carries out the selection and application of the single operator to create the new states, producing a generate-and-test search of the problem space. In the Simple String Match, which is a paradigm case for MCH, the UWM itself suffices to carry out the match, given the operators. 111e special knowledge that accompanies MCH that failure at any step implies failure on the task (so that backing up to try the operators in different orders is futile), is embedded in the definition of the problem space. The two examples simply illustrate that the structure of the problem space can play a part in determining the method.
277
278
CHAPTER 6
6. Discussion
A number o f aspects o f the SOAR architecture and the universal weak method require further discussion
and analysis.
6. 1 . Su bgoaling In demonstrating the universal weak method, we excluded all methods that used subgoals, on the grounds that a more fundamental treaunent using universal subgoaling was required. Although such a treaunent
is
beyond the scope of this paper, a few additional remarks are in order.
The field has long accepted a distinction, originally due to Amarel different approaches to problem solving, the (Nilsson,
1971).
exemplified in
state-space approach
(1967),
between two fundamentally
and the
problem-reduction approach
The first is search in a problem space. The second is the use of subgoal decomposition, as
AND/OR
search. This separation has always been of concern, since it is clear that no such
sharp separation exists in human problem solving or should exist in general intelligence.
The SOAR
architecture is explicitly structured to integrate both approaches. Goal changes occur within the same control framework as state and operator changes, and their relation to each other is clear from the structure of the architecture.
In this paper we have been concerned only with one specific mode of operation: goals set up
problem spaces within which problem search occurs to attain the goals.
We need to consider modes of
operation that involve the goal structure.
New subgoals can be created at any point in searching a problem space. This requires determining that a subgoal is wanted, creating the subgoal object and voting the subgoal to become the current goal object. The new subgoal leads (in the normal case) to creating a problem space, and then achieving the goals by search in the subspace.
Reinstating the original goal leads (again, in the normal case) to reinstating the rest of the
object context and then extracting the solution objects of the subgoal from the stock. We have not described the mechanics of this process, which is the functional equivalent of a procedure call and return. It is not without interest, but can be assumed to work as described.
Tasks have been run in SOAR that use
subgoaling.
The normal mode of operation in complex problem solving involves an entire goal hierarchy. In SOAR, this takes the form of many goals in the stock, with decisions about suspending and retrying goals being made by search control voting to oust or reinstate existing goals. The evaluations of which goal to retry are made by the same elaboration-decision cycle that is used for
all other decisions, and they are subject to the same
computational limitations. Similarly, methods that involve subgoals are encoded in search control directly. Such a method operates by having search control vote in subgoals immediately, rather than apply an operator to take a step in the problem space.
,.
How such explicit methods are acquired and become part of search
A UNIVERSAL WEAK METHOD
control knowledge is one more aspect that is beyond this paper. All we have pointed out here is how goal decomposition and problem-space search combine into a single integrated problem-solving organization. In general, subgoals arise because the agent cannot accomplish what it desires with the knowledge and means immediately at hand. Thus, a major source for subgoals are the difficulties that can be detected by the agent. Operator subgoaling, where the difficulty is the inability to apply a selected operator, is the most well known example; but there are others, such as difficulties in selecting operators or difficulties in instantiating partially specified operators. A complete set of such difficulties would lay the base for a universal subgoaling scheme. Subgoals would be created whenever the pursuit of a goal runs into difficulties. Such subgoals would arise at any juncture, rather than only within the confines of prespecified methods.
Universal
subgoaling would complement the universal weak method in that both would be a response to situations that are novel and where, at least at the moment of encounter, there is an absence of extensive search-control knowledge. Central to making subgoaling work is the creation of problem spaces and goal states. Every new subgoal requires a problem space. No doubt many of these can pre-exist in a sufficiently well developed form so that all that is required is an instantiation that is within the capabilities of search control. But more substantial construction of problem spaces is clearly needed as well. The solution that flows from the architecture, and from the mechanism of universal subgoaling just sketched,
is to
create a subgoal to create the new problem
space. Following out this line to problem spaces for creating problem spaces is necessary for the present architecture to be viable. There are indications from other research that this can be successfully accomplished (Hayes & Simon, 1976), but working that out is yet one more task for the future. 6.2. The voting process The architecture uses a voting scheme, which suggests that search control balances off contenders for the decision, the one with the preponderance of weight winning.15
However, a voting scheme provides
important forms of modularity, as well as a means to adjudicate evidence. In a voting scheme, sources of knowledge operate independently of each other, so that sources can be added or deleted without knowing what other sources exist or their nature, and without disrupting the decision process (although possibly altering its outcome). If knowledge sources are highly specialized, so that only a very few have knowledge relevant to a given decision, then a voting scheme is more an opportunity for each source to control the situation it knows about than an opportunity to weigh the evidence from many sources. The balancing aspect of voting then becomes merely a way to deal with the rare cases of conflicting judgement, hopefully without strongly affecting the quality of the decisions. 15
Indeed. a common question is why we don"t admit varying weights on the votes of decision productions.
279
280
CHAPTER 6
The weak methods and the tasks used for the demonstration provide a sample of voting situations that can be used to explore the functions that voting actually serves in SOAR. We can examine whether the voting was used to balance and weigh many different sources of knowledge (so that the number and weight of votes is an issue), or whether the productions are highly specialized and act as
experts
on a limited number of
decisions. The situations are limited to state and operator changes, that is, search within a problem space, with no goal or problem-space changes, but the evidence they provide is still important.
The following
analysis is based on an examination of the productions used for each method and the traces of the runs of each task using the different methods. The structure of the state changes is simple. The
UWM votes for all acceptable states and vetoes all
unacceptable states. All of the weak methods have at most one state-change voting production active at a time. It may vote for many states, but the states will all be equivalent to the method. 'Thus, voting on states fits the expert model, in which balancing of votes does not occur. For operator changes, the result is the same {the voting fits the expert model), but the analysis is a bit more complicated. The
UWM votes for all operators in the current problem space and vetoes all operators that
have already applied to the current state.
The weak methods may have many operator-change voting
productions active at a time. However, the final winner receives a vote from every production that was voting for an operator16 and no votes from a production voting against, or vetoing an operator. There may be many operators receiving the same number of votes. but they have votes from the same productions, and the final selection is made randomly from this set There is never a balancing of conflicting evidence for the final decision. Each time a production is added, it only refines the decision, by shrinking the set from which the operator is randomly selected. Therefore, the weighting and balancing of votes would not affect the selection of operators.
6.3.
Unlimited memo ry
We have assumed an unlimited memory capacity, primarily to simplify the investigation.
In fact, each
method is characterized by a specific demand for memory. If the memory available is not sufficient for the method, or if it makes disruptive demands for access, then the method cannot be used. A useful strategy to study the nature of methods is to assume unlimited memory capacity and then investigate which methods can be used in agents with given memory structures. Figure
6-1
gives the memory requirements for the stock of
the weak methods used in the demonstration. This is the requirement beyond that for the current context, including the space for the problem space and operators. Several of the methods require an unbounded stock
16Actually, the final winner receives a vote only from every concept, rather than every production, since it is sometimes necessary to
implement a concept with more than one production, because of the limited power of the production-system language.
A UNIVERSAL WEAK METHOD
for states.
They differ significantly in the rate at which the stock increases and in the type of memory
management scheme needed to control and possibly truncate the stock if memory limits are exceeded.
Stock capacitv requirement
Universal Weak Method (UWM)
None
A void Duplication (AD)
Unbounded: All states visited
Means-Ends Analysis (MEA)
None
Breadth-First Search (BrFS)
Unbounded: The set of states at the frontier depth [equal to the (branching-factor)depth )
Depth-First Search (DFS)
Unbounded: The set of states on the line to the
frontier (equal to the depth)
Simple Hill Climbing (SHC)
One: The ancestor state
Steepest Ascent Hill Climbing (SAHC)
Finite: The next-states corresponding to each operator, plus the ancestor state;
Or two, for the next-state and the best�so-far Best-First Search (BFS)
Unbounded: The set of all acceptable states
Modified Best-First Search (MBFS)
Unbounded: The set of all acceptable states
A*
Unbounded: TI1e set of all acceptable states
Figure 6-1: Memory requirements of the methods on the stock, beyond the current context. Capacity limits on the stock affect how well the agent solves problems, but they do not produce an agent that cannot function, i.e., cannot carry out some search in the problem space. states to a small number would cause methods such
Restricting the selection of
as best-first search to become more like hill climbing.
Problem solving would continue, but the effectiveness of the method might decrease.
However, the
architecture is not similarly protected against all capacity problems. In particular, if the memory is too limited for the problem space to be represented (e.g., because of the number of operators in the problem space), then the agent cannot function.
281
282
CHAPTER 6
6.4. Computational limits and the uniqueness of the UWM Processing at a state in the problem space must be computationally limited, both for speed (it lies in the inner loop of the search through the problem space) and for functionality (it must be the unintelligent subprocess whose repeated occurrences give rise to intelligent action). Neither of these constraints puts a precise form on the limitation, and we do not currently have a principled computational model for this processing. Indeed, there may not be one. Neither the architecture nor the universal weak method described here is unique, even given the problem-space hypothesis. Organizations other than the elaboration-decision application cycle could be used for processing at a node. Much more needs to be learned about the computational issues, in order to understand the nature of acceptable architectures and universal weak methods. The limitation on functionality plays a role in defining the architecture. The central concern is that problem solving on the basic functions of search control, given originally in Section 2, must not be limited by some fixed processing. Otherwise the intelligent behavior of the system would be inherently limited by these primitive acts. Subgoals are the general mechanism for bringing to bear the full intellectual resources of the agent Thus, the appropriate solution (and the one taken here, although not worked out in this paper) is universal subgoaling, which shifts the processing limitation to the decision to evoke subgoaling. The temptation is strong to ignore the limit on functionality and locate the intelligence within the processing at a node, and this approach has generated an entire line of problem-solving organizations. These usually focus on an agenda mechanism for what to do next, where the items on the agenda in effect determine both the state and the operator (although the organization is often not described in terms of problem spaces) (Erman, Hayes-Roth, Lesser, Reddy, 1980; Lenat, 1976; Nilsson, 1980). A particularly explicit version is the notion of metarules (Davis, 1980), which are rules to be used in deciding what rules are to be applied at the object-level, with of course the implied upward recursion to metametarules, etc. The motivation for such an organization is the same concern as expressed above, namely, that a given locus of selection (operators, states, etc.) be made intelligently. The two approaches can be contrasted by the one (metarules) providing a structurally distinct organization for such selections versus the other (SOAR) merging all levels into a single one, with subgoaling to permit the metadecisions to be recast at the object level. That search control in SOAR operates with limited processing does not mean that the knowledge it brings to
bear is small. In fact, the design problem for search control is precisely to reconcile maximizing the
knowledge brought to bear with minimizing processing. This leads to casting search control as a recognitional architecture, that is, one that can recognize in the present state the appropriate facts or considerations and do so immediately, without extended inferential processing. 17 The use of a production system for search control 17The concept of recognition is not, of course, completely well defined, but it does convey the right flavor.
A UNIVERSAL WEAK METHOD
follows upon this view.
Search control is indefinitely extendible in numbers' of elaboration and decision
productions. The time to select the relevant productions can be essentially independent of the number of productions, providing that the comparisons between the production conditions and the working memory elements are appropriately limited computationally.
18
This leads to the notion of a learning process that
continually converts knowledge held in other ways into search-control productions. "fl1is is analogous to a mechanism of practice (Anderson, Greeno, Kline, Neves, Newell,
1982)
198 1;
Newell & Rosenbloom,
1981;
Rosenbloom
&
and the modular characteristics of production systems make it possible. These are familiar
properties of production systems; we mention them here because they enter into the computational characterization of search control.
The two phases of search control, elaboration and decision (voting), perform distinct functions and hence cannot be merged totally.
Elaboration converts stored knowledge (in search control) and symbolized
knowledge (in objects) into symbolized knowledge (in the current object). Voting converts stored knowledge and symbolized knowledge into an action (replacement of a current object). However, it is clearly possible to decrease the use of voting until it is a mere recognition of conventional signals associated with the objects, such as select-me and reject-me. In the other direction, shifting voting to respond directly to relevant task structure is equivalent to short circuiting the need to make all decisions explicit, with a possible increase in efficiency.
With limits on the computational power of search control, a universal weak method cannot realize all methods. It cannot even realize all versions of a given weak method. That is, there are varieties of a weak method, say hill climbing, that it can realize and varieties it cannot. It will be able to carry out the logic of the method, but not the computations required for the decisions. Thus, the correct claim for a universal weak method is that it provides for sufficiently computationally simple versions of all the weak methods, not that it can provide for all versions.
Imagine a space of all methods. There will be a distinguished null point, the default method, where no knowledge is available except that the task itself is represented. Then a universal weak method claims to realize a neighborhood of methods around this null point. These are methods that are sufficiently elementary in terms of the knowledge they demand about the task environment and the way they use this knowledge so that the knowledge can be directly converted into the appropriate action within the method. Other methods will be further away from the default point in the amount of knowledge incorporated and the inferences required to determine the method from that knowledge. To create the method from such knowledge requires
8 1 nme per cycle can be logarithmic in the number of productions (providing the conditions are still limited), where the units of 1982).
computation are the elementary comparisons (Forgy,
283
284
CHAPTER 6
techniques analogous to program design and synthesis.
Only inside some boundary around the 1efault
method will a universal weak method be able to provide the requisite method behavior. Different universal weak methods will have boundaries at different places in the space of methods, depending on how much computational power is embodied in the processing power is in the architecture and how much knowledge is embodied in its search control. An adequate characterization of this boundary must wait until. universal subgoaling is added to the universal weak method. But it is clear that the universal weak method is not unique.
6.5. Weak methods from knowledge of the task envi ronment The weak methods were factoreq into the universal weak method plus increments for each weak method (S
u
+ SM). The increments were encodings of the particular knowledge of the environment that underlay
the weak methods. One issue not dealt with is the conversion of task knowledge into the elaboration and decision productions of a method increment.
This would have required a scheme for representing task
knowledge independent of the agent to permit the conversion to be analyzed.
We did provide two steps
towards satisfying the condition that weak-method increments be easily derivable from task-environment knowledge. First, we required that a weak-method increment encode only the special knowledge used by the weak method.
Thus, obtaining an increment became a local process.
Second, the method increments
themselves decomposed into concepts (computed by elaboration productions) and conversions of concepts to the appropriate action for the method (computed by the decision productions). Only a few basic concepts occur in the collection of weak methods of Figure
5-3, beyond acceptability and
failure, which occur in the UWM itself. Some are defined on the search behavior of the agent and are task independent: ancestor, current, depth, descendant, previous and produce. The rest depend on the specific task situation: can reduce, difference, duplication, estimated distance to the goal, evaluation and inverse. Other concepts would gradually be added as the number of methods increased. More important than the small number of these concepts is their extreme generality, which captures the fact that only notions that are applicable to almost any task are used in the weak methods. If weak methods are generated by knowledge of the task environment, then weak methods should exist for each different state of knowledge, although conceivably an additional bit of task knowledge might not help. Also, this correspondence of knowledge to methods can only be expected within the computationally feasible region for the universal weak method. Although, as noted, we cannot explore this issue directly without the independent definition of a space of task knowledge, we can explore the issue at the level at which we do have a represention, namely, at the level of productions and their composition into method increments.
A UNIVERSAL WEAK METHOD
One question is whether.simply combining the productions of existing method increments produces viable new methods or some sort of degenerate behavior. If the productions from Figure 5- 3 arc combined with each other, it is possible that productions will interfere with each other, causing the system to thrash among a small number of operators and states without exploring the problem space. Note, to begin with, that the methods in Figure 5-3 fall into two distinct types, either selecting states or selecting operators. Thus, we can break the analysis up into cases. Combining together the SM of a state-selection method with the SM of an operator-selection method will
not cause any interference, but only enhance the search (assuming that both methods are appropriate). This corresponds to improving a state-selection method by further narrowing the operator choices. When operator-selection methods are combined, the productions will never cause the system to cease searching. It is possible that one method will vote for an operator and another vote against it, wiping each other out; but an operator will still be selected, specifically the one with the most total votes without any vetoes. Following the operator selection, the operator will be immediately applied to create a new state and cause the current operator to become undefined. When state-selection methods are combined, the effects take two steps to work out. When state-selection methods are selecting a state, they combine in the same manner as operator seiection methods. The votes are merely totalled and one state will win and become· the current state. After the state is selected, the next decision phase would normally select an operator, however, the state could change again. This second chance at state selection opens the possibility for. the reselection of the prior state, setting up an infinite loop. For instance, productions need not be independent of the current state, i.e., they will not vote for a state independent of whether it, or another state, is the current state. In steepest ascent hill climbing, the following production will vote for the ancestor of the current state, but will not vote for that state once it is the current state. State: If the ancestor state is acceptable. vote for the ancestor state.
If this production alone was combined with a best-first method, an infinite loop would develop starting with the creation of a new best state by operator application. The above production would vote for the ancestor state, while the best-first method would vote for the current state. The tie would be broken by selecting the oldeststate, at which point the above rule would not apply, but the best-first rule would vote in the best state, and the cycle would continue. This looping depends on the breadth-first nature of the architecture, but analogous loops can develop with other methods of tie-breaking. This problem does not appear for the methods we have described, because they all have the property that there is the same number of votes for a state independent of whether it is the current state. This is achieved in steepest ascent hill climbing by adding a production that votes for the current state if it is acceptable and its ancestor state is not acceptable. The only
285
286
CHAPTER 6
general solution is to require new productions to be added in sets that obey the above property. Instead of combining all of the productions from one method with another, it should be possible to generate new methods by an appropriate set of productions using the same concepts. Four new methods were in fact created during the investigation. The SM productions for these methods appear in Figure 6-2. Each of these new methods consists of knowledge from· previous methods, combined in new ways. All of them where implemented for at least one of the tasks. The first new method, depth-first/breadth-second search (DFBrSS), has a single rule, which states that if the current state is acceptable, it should be voted for. As long as this production is true, the search will be depth first. The current state will remain selected, an operator will be selected, and then applied to create a new current state. This new state will then stay as the current state and the process will continue until an unacceptable state is encountered. At that point, the unacceptable state will be vetoed, and all other states will receive one vote. The tie will be broken by the architecture selecting the oldest acceptable state.19 This gives the breadth-second character of the search. Following the selection by breaking the tie, the search will continue depth-first until another unacceptable state is encountered. Although this method relies on a feature of the architecture, we could add in the knowledge for breadth-first search, modified to apply when the current state is unacceptable, and achieve the same result The second new method is actually the implementation of simple hill climbing given earlier. It is included here because it differs from the classical implementation of simple hill climbing UWM,
in
that it falls back on the
which produces a breadth-first search when a local maximum is reached. This can realize classical
simple hill climbing by adding a production that detects the local maximum and returns the current state as the desired state. None of the tasks
in
our demonstration were simple maximization problems, so such a
production was never needed. The third and fourth methods are variations of simple hill climbing that deal with a local maximum
in
different ways. The first will select a descendant of the maximum from which to continue, giving a depth second search. The second will select the best state other than the maximum, giving a best-second search.
19Tuis was a "feature" ofthe XAPS2 architecture.
A UNIVERSAL WEAK METHOD
Dcplh·First/Brcadlh·Second Search (DFBrSS) [Used for Tower of Hanoi) State: If the current state is acceptable. vote for it Simple I fill Climbing/Brcadlh·Sccond Search (SI ICBrSS) [Used for Eight Puzlle, Missionaries, Picnic I and II) State: If the current state is not acceptable or not as good as the ancestor state, vote for the ancestor state. Stare: If the current state is acceptable and better than the ancestor state, vote for the current state.
Simple Hill Climbing/Dcpth·Second Search (SI ICDSS) (Used for Eight Puzzle)
State: If the current state is not as good as the ancestor state, vote for the ancestor state. Stare: If the current stale is better than the ancestor state, vote for the current state.
State: If the current state is unacceptable. vote for its desccndents. Stare: If the current state is acceptable. and the ancestor state is unacceptable. vote for the current state.
Simple I !ill Climbing/Bcst·Second Search (SHC'USS) (Used for Eight Puzzle)
State: If the current state is not as good as the ancestor state, vote for the ancestor state.
State: If the current state is better than the ancestor state, vote for the current state. State: If the current sutc is unacceptable, vote for the best of its dcscendents.
State: If the current state is unacceptable, and there is only one descendant. vote for it Stare: If the current state is acceptable, and the ancestor state
is unacceptable, vote
for the current state.
Figure 6·2: Additional weak methods
6.6. Defining the weak methods The universal weak method suggests a way to define the weak methods: A weak method is an AI method that is realized by a universal weak method plus some knowledge about the task environment (or the behavior of the agent). lbis definition satisfies the characterization given at the beginning of Section
1.
First, it makes only limited
demands on the knowledge of the task environment. The proposed definition starts at the limited· knowledge end of the spectrum and proceeds toward methods that require more knowledge, At some point, as discussed, the form of automatic assimilation required by a universal weak method fails as the knowledge about the task environment becomes sufficiently complex. We do not know where such a boundary lies, but it would seem plausible to take as sufficiently limited any knowledge that could be immediately assimilated. provides a framework within which domain knowledge can be used.
Second, it
The amount of domain specific
knowledge embedded in the concepts that enters into a weak method is limited in the first instance by the computational limits of search control. amount of recognitional knowledge.
As we saw in discussing these limits, there can be an indefinite In the · second instance subgoaling permits still more elaborate
computations and the use of further knowledge, providing the use of the results remain as stipulated by the search control of the method increment
The definition implies that the set of weak methods is relative to the universal weak method. As the latter varies in its characteristics, so too presumably does the set of weak methods. The set is also relative to the knowledge framework that can be used to form the increments to the universal weak methods. Finally, the definition is in terms of the
specification ofbehavior, not of the
behavior itself. Thus a weak method (e.g.,
hill
climbing) can be represented both by an increment to a universal weak method and by some other specification device, eg, a Pascal program. All three of these implications are relatively novel, so that it
is not
clear at this ju ncture whether they make this proposed definition of weak methods more or less attractive.
287
288
CHAPTER 6
This proposed dethition of weak methods must remain open until some additional parts of the total organization come into being, especially universal subgoaling. Only then can a sufficiently exhaustive set of weak methods be expressed within this architecture to provide a strong test of the proposed definition. Additional aspects must also be examined, for example, fleshing out the existing collection of weak methods in various directions, such as methods for acquiring knowledge and methods for handling the various subgoals generated by universal subgoaling.
6. 7 . Relation to othe r work on methods The work here endeavors to provide a qualitative theory for Al methods. lbus, it does not make immediate contact with much of the recent work on methods. which has sought to apply the research paradigm of algorithmic complexity tn Al methods (see Banerji (1982) for a recent review).
However, it is useful to understand our relation to the work of Nau, Kumar a�d Kanai
(1982).
They have
described a general fonn of branch and bound that they claim covers many of the search methods in AI. At a sufficiently general level, the two efforts express the same domination of a search framework. In terms of the details of the fonnulation, the two research efforts are complementary at the moment. They are concerned with a logical coverage of all forms of given methods under instantiation by specifying certain general functions. The intent is to integrate the analysis of a large number of methods. We are concerned with how an agent (not an analyst) can have a common organization for all members of a class of methods, under certain constraints of simplicity and directness. Their algorithm claims universal coverage of all fonns of a given method; ours is limited to the neighborhood around the default search behavior as described in Section
6.7.
However, in the longer run it is clear that the two research efforts could grow to speak to identical
research questions -- if their algorithm became the base for a general problem-solving agent or if our universal weak method plus universal subgoaling came to have pretensions of extensive coverage.
Then a detailed
comparison would be useful
A just-published note by Ernst and Banerji
(1983)
on the distinction between strong and weak problem
solvers is also relevant. They wish to associate the strength of a problem solver with the formulation of the problem space -- strong solvers from good (knowledge-intensive) and weak solvers from weak (knowledge lean) formulations. Once the fonnulation {the problem space) is given, then the problem solver is just an interpreter that runs off the behavior.
This view agrees only in part with the one presented in this paper.
Certainly the amount of knowledge available governs the strength of a problem solver -- weak methods use little knowledge. Certainly, also, the entire problem solving system is usefully viewed as an interpreter of a behavior specification -- the
{S, I] of Section 1.1.
Finally, the total knowledge involved in a problem
is
distributed amongst several components: the data structures that define the states, the operators, and the search control. The view of Ernst and Banerji seems to ignore the factorization between the problem space
A UNIVERSAL WEAK METHOD
(state representation and operators) and the search control, treating the latter as an unimportant contributor to the success of problem solving. 111e theory presented here takes the opposite view -- that after the space
is
given, heuristic knowledge must still be applied to obtain successful problem solving. Thus if this state is just an interpreter, as Ernst and Banerji maintain, it is nevertheless an interpreter that must still solve problems.
289
290
CHAPTER 6
7 . Co n c l u s i o n We have attempted in this paper to take a step towards a theory o f A l methods and a n appropriate architecture for a general intelligent agent. We introduced a specific problem-solving architecture, SOAR, based on the problem-space hypothesis, which treats all behavior as search and provides a form of behavior specification that factors the control into a recognition-like scheme (the elaboration-decision process) separate from the operators that perform the significant computations (steps in the problem space).
Non-search
behavior arises by the control being adequate to specify the correct operator at each step. This can be viewed as simply another programming formalism, which makes different assumptions about the default situation -- problematical rather than certain, as in standard languages.
On top of this architecture we introduced a universal weak method that provides the ability to perform as any weak method, given appropriate search-control increments that respond only to the special knowledge of the task environment. The existence of a universal weak method has implications for an agent being able to use the weak method appropriate to whatever knowledge it has about the task environment, without separate development of selection mechanisms that link knowledge of the task environment to methods. It also has implications for how weak methods are acquired, since it becomes a matter of acquiring the right elementary concepts with which to describe the environment, and does not require learning each weak method as a separate control structure.
Additional major steps are required to complete the theory. The most notable, and immediate is universal subgoaling. Entailed therein, in ways not yet completely worked out, is the need for processes of problem space and goal-state creation, since every subgoal must lead to a problem space and description of the goal states in order to provide actual solutions. But there are other steps as well. One is driving the factorization of weak methods back to the descriptive knowledge of the task environment, rather than just the productions of Figure 5-3. which combine descriptive and normative knowledge.
A second is including the processes of
planning, namely the construction of the plan as well as its implementation interpretively. A third is studying how the scheme behaves under the conditions of large bodies of search-control knowledge, rather than the lean search-controls that have been our emphasis here.
A UNIVERSAL WEAK METHOD
Refe re nces
Amarel, S. An approach to heuristic problem-solving and theorem-proving in the propositional calculus. In Hart, J. & Takasu, S. (Eds.), Systems and Computer Science. Toronto: University of Toronto Press, 1967.
Anderson, J. R., Greeno, J. G., Kline, P. J. & Neves, D. M. Acquisition of problem solving skill. In Anderson, J. R. (Ed.), Cognitive Skills and their Acquisition. Hillsdale, NJ: Erlbaum, 1981. Banerji, R. B. Theory of problem solving. Proceedings ofthe IEEE, 1982, 70, 1428-1448. Davis, R. Meta-rules: Reasoning about control. Artificial Intelligence, 1980,
/5, 179-222.
Duda, R. 0. & Shortliffe, E. H. Expert systems research. Science, 1983, 220, 261-268. Ennan, L., Hayes-Roth, F., Lesser, V., & Reddy, D. R. The Hearsay-II speech-understanding system: Integrating knowledge to resolve uncertainty. Computing Surveys, June 1980, 12, 213-253. Ernst, G. W. & Banerji, R. B. On the relationship between strong and weak problem solvers. The Al Magazine, 1983, 4(2), 25-27. Feigenbaum, E. A. & Feldman, J. (Eds.). Computers and Thought. New York: McGraw-Hill, 1963. Fikes, R. E.. Hart, P. E. & Nilsson, N. J. Learning and executing generalized robot plans. Artificial Intelligence, 1972, 3(4), 251-288. Forgy, C. L. OPS5 !lfanual. Computer Science Department, Carnegie-Mellon University, 1981.
Forgy, C. L. Rete: A fast algorithm for the many pattern/many object pattern match problem. Artificial Intelligence, 1982, /9, 17-37. Hayes, J. R. & Simon, H. A. Understanding complex task instructions. In Klahr, D. (Ed.), Cognition and Instruction. Hillsdale, NJ: Erlbaum, 1976. Hewitt, C. Description and Theoretical Analysis (using Schemata) of PLANNER: A language for proving theorems and manipulating models in a robot. Doctoral dissertation, MIT, 1971. Kowalski, R. Logicfor Problem Solving. New York: North-Holland, 1979. Laird, J. Explorations of a Universal Weak Method (Tech. Rep.). Computer Science Department, Camegie Mellon University, 1983. (In preparation). Laird, J. & Newell, A. A universal weak method: Summary of results. In Proceedings of the /JCAl-83. Los Altos, CA: Kaufman, 1983: Lenat, D. AM: An Artificial Intelligence Approach to Discovery in Mathematics as Heuristic Search. Doctoral dissertation, Computer Science Department, Stanford University, 1976. Nau, D. S., Kumar, V. & Kanai, L. A general paradigm for A.I. search procedures. In Proceedings of the AAA/82. American Association for Artificial Intelligence, 1982. Newell, A. Heuristic programming: Ill-structured problems. In Aronofsky, J. (Ed.), Progress in Operations Research, Ill. New York: Wiley, 1969. Newell, A. Physical symbol systems. Cognitive Science, 1980, 4, 135-183.
291
292
CHAPTER 6
Newell, A. Reasoning, problem solving and decision processes:
The problem space as a fundamental
category. In R. Nic kerson (Ed.), A ttention and Perfonnance VIII. Hillsdale, NJ : Erlbaum, 1980. Newell, A. & Rosenbloom, P. Mechanisms of skill acquisition and the law of practice. In Anderson, J. A. (Ed.), Learning and Cognition. Hillsdale, NJ : Erlbaum, 1981.
Newell. A. & Simon, H. A . GPS, a program that simulates human thought In Feigenbaum, E. A. J. (Eds.), Computers and Though1. New York: McGraw-Hill, 1963.
Newell, A. Newell,
& Feldman,
& Simon, H. A. Human Problem Solving. Englewood Cliffs: Prentice-Hall, 1972.
A. &
Simon, H. A. Computer science as empirical inquiry: Symbols and search. Communications of
the ACM, 1976, 19(3), 113-126.
Newell, A., McDermott, J. & Forgy, C. L. Artifical Intelligence: A selfpaced introductory course (Tech. Rep.). Computer Science Department, Carnegie-Mellon University, September 1977.
Nilsson, N. Problem-solving Methods in Artificial Intelligence. New York: McGraw-Hill, 1971. Niisson, N. Principles of Artificial Intelligence. Palo Alto, CA: Tioga, 1980. Polya, G. How to Solve It. Princeton, NJ: Princeton University Press, 1945. Polya, G. Mathematical Discovery, Rosenbloom, P. S. & Newell,
A.
2 vols..
New York: Wiley, 1962.
Learning by chunking: A production-system model ofpractice (fech. Rep.).
Computer Science Department, Carnegie-Mellon University, Oct 1982.
Rulifson, J. F., Derksen, J. A.
& Waldinger, R. J. QA4: A procedural calculus for intuitive reasoning (fech.
Rep.). Scanford Research Institute Artificial Intelligence Center, 1972.
Sacerdoti,
E. D.
A Structure for Plans and Behavior. New York: Elsevier, 1977.
Waso.n, P. C. & Johnson-Laird, P. N. Harvard, 1972.
Waterman, D. A. Press, 1978.
Psychology of Reasoning: Structure and content.
& Hayes-Roth, F., (Eds.).
Pattern Directed Inference Systems.
Cambridge, MA:
New York:
Academic
Winston, P. A rtificial Intelligence. Reading, MA: Addison-Wesley, 1977.
This re�earch was sponsored by the Defense Advanced Research Projects Agency (DOD), ARPA Order No.
3597, monitored by the Air Force Avionics Laboratory Under Contract F33615-78-C-1551. ·me views and
conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the US Government
CHAPTER 7
The Chunking of Goal Hierarchies: A Generalized Model of Practice P. S. Rosenbloom, Stanford University, and A. Newell, Carnegie Mellon University
Abstract This chapter describes recent advances in the specification and implementation of a model of practice. In previous work the authors showed that there is a ubiquitous regularity underlying human practice, referred to as the power law ofpractice. They also developed an abstract model of practice, called the chunking theory oflearning. This previous work established the feasibility of the chunking theory for a single 1023-choice reaction-time task, but the implementation was specific to that one task. In the current work a modified formulation of the chunking theory is developed that allows a more general implementation. In this formulation, task algorithms are expressed in terms of hierarchical goal structures. These algorithms are simulated within a goal-based production-system architecture designed for this purpose. Chunking occurs during task performance in terms of the parameters and results of the goals experienced. It improves the performance of the system by gradually reducing the need to decompose goals into their subgoals. This model has been suc cessfully applied to the task employed in the previous work and to a set of stimulus response compatibility tasks.
294
CHAPTER 7
10.1 I NT R O D U CT I O N
I-tow can systems-both natural and artificial-improve their own perfor mance? At least for natural systems (people, for example) , we know that practice is effective. A system is engaged in practice when it repeatedly performs one task or a set of similar tasks. Recently, Newell and Rosenbloom ( 1981) brought together the evidence that there is a ubiquitous law-the power law ofpractice-that characterizes the improvements in human performance during practice. The law states that when human performance is measured in terms of the time needed to perform a task, it improves as a power-law function of the number of times the task has been performed (called the trial number) . This result holds over the entire domain of human perfor mance, including both purely perceptual tasks, such as target detection (Neisser, Novick, and Lazar, 1963), and purely cognitive tasks, such as supplying justifica tions for geometric proofs (Neves and Anderson, 198 1 ) . The ubiquity of the power law of practice suggests that it may reflect something in the underlying cognitive architecture. The nature of the architecture is of funda mental importance for both artificial intelligence and psychology (in fact, for all of cognitive science; see Newell, 1973 ; Anderson, 1983a) . It provides the control struc ture within which thought occurs, determining which computations are easy and inexpensive as well as what errors will be made and when. Two important ingredi ents in the recent success of expert systems come from fundamental work on the cog nitive architecture; specifically, the development of production systems (Newell and Simon, 1972 ; Newell , 1973) and goal-structured problem solving (Ernst and Newell , 1969; Newell and Simon, 1972) . This chapter discusses recent efforts to take advan tage of the power law of practice by using it as a generator for a general production system practice mechanism . This is a highly constrained task because of the paucity of plausible practice models that can produce power-law practice curves (Newell and Rosenbloom, 1981 ) . As a beginning, Newell and Rosenbloom ( 1981) developed an abstract model of practice based on the concept of chunking-a concept already established to be ubiq uitous in human performance-and derived from it a practice equation capable of closely mimicking a power law. It was hypothesized that this model formed the basis for the performance improvements brought about by practice. In sections 10.2 and 10.3 this work on the power law and the abstract formulation of the chunking theory is briefly summarized. Rosenbloom and Newell ( 1982a, 1982b) took the analysis one step further by showing how the chunking theory oflearning could be implemented for a single psy chological task within a highly parallel, activation-based production system called XAPS2 . This work established more securely that the theory is a viable model of human practice by showing how it could actually be applied to a task to produce power-law practice curves. By producing a working system, it also established the theory's " .ability as a practice mechanism for artificial systems.
THE CHUNKING OF GOAL HIERARCHIES
The principal weakness of the work done up to that point was the heavy task dependence of the implementation. Both the representation used for describing the task to be performed and the chunking mechanism itself had built into them knowl edge about the specific task and how it should be performed. The work reported here is focused on the removal of this weakness by the development of generalized, task independent models of performance and practice. This generalization process has been driven by the use of a set of tasks that fall within a neighborhood around the task previously modeled. That task sits within an experimental domain widely used in psychology, the domain of reaction-time tasks (Woodworth and Schlosberg, 1954) . Reaction-time tasks involve the presentation to a subject of stimulus display-such as an array of lights, a string of characters on a computer terminal , or a spoken word-for which a specific "correct" response is expected. The response may be spoken, manual (such as pressing a button, pushing a lever, or typing) , or something quite different. From the subject's reaction time-the time it takes to make the response-and error rate, it is possible to draw conclusions about the nature of the subject's cognitive processing. The particular task, as performed by Seibel ( 1963 ) , was a 1023 -choice reaction-time task. It involved ten lights (strung out horizontally) and ten buttons, with each button right below a light. On each trial of the task some of the lights came on while the rest remained off. The subject's task was to respond as rapidly as pos sible by pressing the buttons corresponding to the lights that were on. There were 210 - 1, or 1023, possible situations with which the subject had to deal (excluding the one in which all ten lights were off) . Rosenbloom ( 1983) showed that a general task representation based on the concept of goal hierarchies (discussed in section 10.4) could be developed for the performance of this task. In a goal hierarchy, the root node expresses a desire to do a task. At each level further down in the hierarchy, the goals at the level above are decomposed into a set of smaller goals to be achieved. Decomposition continues until the goals at some level can be achieved directly. This is a common control structure for the kinds of complex problem-solving systems found in artificial intelligence, but this is the first time they have been applied to the domain of reaction-time tasks. Goal hierarchies also provided the basis for models of a set of related reaction time tasks known as stimulus-response compatibility tasks. These tasks involve fixed sets of stimuli and responses and a mapping between them that is manipulated. The main phenomenon is that more complex and/ or "counterintuitive" relationships between stimuli and responses lead to longer reaction times and more error. A model of stimulus-response compatibility based on goal hierarchies provides excellent fits to the human reaction-time data (Rosenbloom, 1983) . The generalized practice mechanism (described in section 10.5) is grounded in this goal-based representation of task performance. It resembles a form of store versus-compute trade-off, in which composite structures (chunks) are created that relate patterns of goal parameters tQ patterns of goal results.
295
296
CHAPTER 7
These chunking and goal-processing mechanisms are evaluated by imple menting them as part of the architecture of the XAPS3 production system. The XAPS3 architecture is an evolutionary development from the XAPS2 architecture. Only those changes required by the needs of chunking and goal processing have been made. This architecture is described in section 10.6. From this implementation simulated practice results have been generated and analyzed for the Seibel and compatibility experiments (section 10.7) . 1 Following the analysis of the model , this work is placed in perspective by relating the chunking theory to previous work on learning mechanisms (section 10.8) . The theory stakes out an intermediary position among four previously disparate mechanisms, bringing out an unexpected commonality among them. Before concluding and summarizing (section 10.10) , some final comments are presented on ways in which the scope of the chunking theory can be expanded to cover more than just the speeding up of existing algorithms (section 10.9) . Specifi cally, the authors describe a way in which chunking might be led to perform other varieties oflearning, such as generalization, discrimination, and method acquisition. •
10.2 T H E POW E R LAW OF P R ACTIC E
Performance improves with practice. More precisely, the time needed to per form a task decreases as a power-law function of the number of times the task has been performed. This basic law, the power law of practice, has been known since Snoddy ( 1926) . This law was originally recognized in the domain of motor skills, but it has recently become clear that it holds over a much wider range of human tasks, possibly extending to the full range of human performance. Newell and Rosenbloom ( 1981) brought together the evidence for this law from tasks involving perceptual motor skills (Snoddy, 1926; Crossman, 1959) , perception (Kolers, 1975 ; Neisser, Novick, and Lazar, 1963) , motor behavior (Card, English, and Burr, 1978) , elemen tary decisions (Seibel, 1963), memory (Anderson, 1980) , routine cognitive skill (Moran, 1980) , and problem solving (Neves and Anderson, 198 1 ; Newell and Rosenbloom, 198 1 ) . Practice curves are generated b y plotting task performance against trial number. This cannot be done without assuming some specific measure of perfor mance. There are many possibilities for such a measure, including such things as quantity produced per unit time and number of errors per trial. The power law of
1 A more comprehensive presentation and discussion of these results can be found in Rosenbloom
( 1983).
THE CHUNKING OF GOAL HIERARCHIES
practice is defined in terms of the time to perform the task on a trial. It states that the time to perform the task (T) is a power-law function of the trial number (N) : T
=
BN- a
(1)
As shown by the following log transform of Equation 1, power-law functions plot as straight lines on log-log paper: log(T)
=
log(B) + ( - a) log(N)
(2)
Figure 10-1 shows the practice curve from one subject in Kolers' study ( 1975) of reading inverted texts-each line of text on the page was turned upside down-as plotted on log-log paper. The solid line represents the power-law fit to this data. Its linearity is clear (r2 0.932). Many practice curves are linear (in log-log coordinates) over much of their range but show a flattening at their two ends. These deviations can be removed by using a four-parameter generalized power-law function. One of the two new parame ters (A) takes into account that the asymptote of learning can be greater than zero. In general, there is a nonzero minimum bound on performance time, determined by basic physiological limitations and/or device limitations-if, for example, the sub ject must operate a machine. The other added parameter (E) is required because power laws are not translation invariant. Practice occurring before the official begin ning of the experiment-even if it consists only of transfer of training from everyday experience-will alter the shape of the curve, unless the effect is explicitly allowed =
[ N A M E Do-Lights] [STATUS Active] [ MINIMUM-LOCATION = ?Min-Loe]
( De fProd SubGoal/Do-Lights /Do-Press-All-Buttons
[ MAXIMUM-LOCATION
( Goal { < Local> < Exists > } [ NAME No-Light-Off?]
[ STATUS Succeeded ] ) )
=
?Max-Loe] )
[ STATUS Want] [ RESU LT-TYPE Response] [ MINIMUM-LOCATION ?Min-Loe] [ M A X I M U M-LOCATION ?Max-Loe] ) ) )
( ( Goal < New-Obj ect> [Name Do-Press-All-Buttons] =
=
The name (SubGoal/Do-Lights/Do-Press-All-Buttons) is purely for the convenience of the programmer; it doesn't affect the processing of the system in any way. Each condition is a pattern to be matched against the objects in working memory. A condition pattern contains a type field (specifying the type of object to be matched) , an identifier field (containing several different types of information) , and an optional set of patterns for attributes. In the first condition in production SubGoal/Do-Lights /Do-Press -All-Buttons the type is Goal, the identifier is < E x i s t s > , and there are four attributes (N A M E , STATUS, MINIMUM-LOCATION , and M A X I M U M - LOCATION ) with associated values (Do-Lights, Active, = ?Min-Loe, and = ?Max-Loe, respectively). Condition patterns are built primarily from constants and variables. Constants are signified by the presence of the appropriate symbol in the pattern and only match objects that contain that symbol in that role. Some example constants in production SubGoal/Do-Lights /Do-Press -All-Buttons are Goal, MINIMUM-LOCATION , and Succeeded. Variables are signified by a symbol (the name of the variable) preceded by an equal sign. They match anything appearing in their role in an object. An example is = ?Min-Loe in the sample production. All instances of the same variable within a production must be bound to the same value in order for the match to succeed .
THE CHUNKING OF GOAL HIERARCHIES
The identifier in the first condition of production SubGoal/Do-Lights /Do Pres s-All-Buttons is specified by the special symbol < Exists > . Such a condition succeeds if there is any object in working memory that matches the condition. If there is more than one such object, the choice is conceptually arbitrary (actually, the first object found is used) . Only one instantiation is generated. If the identifier is specified as a variable, the condition still acts like an exists condition, but the identifier can be retrieved, allowing the object to be modified by the actions of the production. The complete opposite of an exists condition is a not-exists condition commonly called a negated condition. When the symbol < Not-Exists> appears in the identifier field, it signifies that the match succeeds only if there is no object in working memory that matches the remainder of the condition pattern. The second condition in the sample production contains the symbol < Local> as well as in the identifier field (they are conjoined syntactically by braces) . When this symbol is added to an identifier pattern of any type-constant, variable, exists, or not-exists-it signifies that the condition should be matched only against those objects in working memory that are local to the active goal. This information provided by the objects' created-by tags-is a means by which the production that works on a goal can determine which objects are part of their local context. The remainder of the condition specifies a pattern that must be met by the attribute-value pairs of working memory objects. These patterns are built out of con stants, variables, built-in predicates (such as < Greater-Than>), and general LISP computations (via an escape mechanism). In addition, any of the above forms can be negated, denoting that the condition only matches objects that do not match the pattern. There is only one kind of action in XAPS3-modifying working memory. The interface to the outside world is through working memory rather than through pro duction actions. Actions can create new objects in working memory and, under cer tain circumstances, modify existing objects. When the action has an identifier of < New-Obj ect > , a new object is created by replacing the identifier with a newly gener ated symbol, instantiating the variables with their values computed during the match, and replacing calls to LISP functions with their values (via another escape mecha nism) . An existing object can be modified by passing its identifier as a parameter from a condition. As discussed in section 10.4, only objects local to the current goal can be modified. There are no production actions that lead to the deletion of values or objects from working memory. A value can be removed only if it is superseded by another value. As discussed in section 10.4, objects go away when they are no longer part of the current context.6 No explicit mechanism for deletion has proved necessary, so none has been included.
6This mechanism is similar to the dampening mechanism in the ACT architecture (Anderson, 1976).
313
314
CHAPTER 7
1 0.6.3 The Cycle of Execution
XAPS3 follows the traditional recognize-act cycle of production-system archi tectures, with a few twists thrown in by the need to process goals. The recognition phase begins with the match and finishes up with conflict resolution. The cycle number-simulated time-is incremented after the recognition phase, whether any productions are executed or not. During the act phase, productions are executed. 1 0.6.3.1 Recognition
The match phase is quite simple. All of the productions in the system are matched against working memory in parallel (conceptually). This process yields a set of legal instantiations with at most one instantiation per production, because each condition generates at most one instantiation. Each instantiation consists of a produc tion name and a set of variable bindings for that production that yield a successful match. This set of instantiations is then passed in its entirety to the conflict resolution phase, 7 where they are winnowed down to the set to be executed on the current cycle. This winnowing is accomplished via a pair of conflict resolution rules. The first rule is goal-context refraction. A production instantiation can fire only once within any particular goal context. It is a form of the standard OPS refrac tory inhibition rule (Forgy, 198 1 ) , differing only in how the inhibition on firing is released. With the standard rule, the inhibition is released whenever one of the working memory objects on which the instantiation is predicated has been modified. With goal-context refraction, inhibition is released whenever the system leaves the context in which the instantiation fired. If the instantiation could not legally fire before the context was established but could fire both while the context was active and after the context was terminated, then the instantiation must be based, at least in part, on a result generated during the context and returned when the context was left. Therefore, the instantiation should be free to fire again to reestablish the still-relevant information. The second rule-the parameter-passing bottleneck-states that only one parameter-passing instantiation can execute on a cycle (conceptually selected arbi trarily). This rule first appeared in a slightly different form as an assumption in the HPSA77 architecture (Newell , 1980a). It will be justified in section 10.6.5. 1 0.6.3.2 Action
All of the instantiations that make it through the conflict resolution phase are fired (conceptually) in parallel . Firing a production instantiation consists of
7For purposes of efficiency, these two pliases are actually intermingled. However, this does not change the behavior of the system at the level at which we are interested.
THE CHUNKING OF GOAL HIERARCHIES
( 1) overwriting the examined-by tags of the working memory objects matched by the instantiation, with the identifier of the active goal; (2) replacing all variables in the actions by the values determined during the match phase; (3) evaluating any LISP forms; and (4) performing the actions. This can result in the addition of new objects to working memory or the modification of existing local objects. If conflicting values are simultaneously asserted, the winner is selected arbitrarily. This does not violate the crypto-information constraint because the architecture is using no information in making the decision. If the performance system works correctly, even when it can't depend on the outcome of the decision, then the learned information will not lead to incorrect performance. 1 0.6.4 Goa l Processing
In section 10.4, the processing of goals was described at an abstract level . In this section how that processing is implemented within the XAPS3 architecture is described. The goals themselves, unlike chunks, are just data structures in working memory. A typical XAPS3 goal goes through four phases in its life. The current phase of a goal is represented explicitly at all times by the value associated with the STATUS attribute of the working memory object representing the goal . In the first stage of its life, the goal is desired: at some point, but not necessarily right then, the goal should be processed. Productions create goal desires by gener ating a new object of type Goal. Each new goal object must have a NAME attribute, a STATUS attribute with a value of W ant, and a RES U LT-TYPE attribute. In addition, it may have any number of other attributes, specifying explicit parameters to the goal . The value of the RES U LT-TYPE attribute specifies the type of the results that are to be returned on successful completion of the goal. All local objects of that type are con sidered to be results of the goal . Results are marked explicitly ·s o they won't be flushed when the context is left and so the chunking mechanism will know to include them in the chunks for the goal. Leaving the expression of goal desires under the control of productions allows goals to be processed by the architecture, while the structure of the hierarchy is still left under program (that is, production) control . The architecture controls the transi tion to the second, active phase of the goal's life. At most, one goal is active at any point in time. The architecture attempts to activate a new goal whenever the system is at a loss about how to continue with the current goal . This occurs when there is an empty conflict set; that is, there is no production instantiation that can legally fire on the current cycle. When this happens, the system looks in working memory to deter mine if there are any subgoals of the current goal-those goals created while the cur rent goal was active-that are desired. If such a subgoal is found, it is made the active goal , and the parent goal is suspended by replacing its STATUS with the identifier of the newly activated subgoal. If more than one desired subgoal is found, one is arbitrarily selected (actually, the last one found is used) .
315
3 16
CHAPTER 7
Suspension is the third phase in the life of a goal (it occurs only for nonterminal
goals) . Replacing the STATUS of the parent goal with the subgoal 's identifier accom plishes two things: it halts work on the goal , because the productions that process goals all check for a STATUS of Active, and it maintains the control stack for the goal hierarchy. A suspended goal remains suspended until its active subgoal terminates, at which time it returns to being active. If a goal has more than one subgoal , the goal will oscillate between the active and suspended states. If no progress can be made on the active goal and there are no desired subgoals, then the system has no idea how to continue making progress; it therefore terminates the active goal with a STATUS of Failed. Following termination, the goal is in its fourth and final phase of life. In addition to a failure termination, goals can be terminated with a STATUS of Succeeded. There are no uniform criteria for determining when an arbitrary goal has completed successfully, so it has been left to productions to assert that this has happened. This is done via the creation of an object of type Succeeded. When the architecture detects the presence of such an object in working memory, it terminates the current goal and reactivates its parent. At goal termination time a number of activities occur in addition to the obvious one of changing the active goal . The first two activities occur only on the successful termination of a goal . As will be discussed in the following section, the first step is the possible creation of a chunk. The second step is to return the results of the termi nating goal to its parent goal . This is accomplished by altering the created-by tags of the results so that it looks as if they were created by the parent goal. The third step is to delete all of the objects from working memory created during the processing of this goal . The fourth and final step is to enable result decoding if it is appropriate.
10.6.5 Chunking
As was seen in section 10.5, chunking improves performance by enabling the system to use its experience with previous instances of a goal to avoid expanding the goal tree below it. In this section is detailed how this has been implemented within the XAPS3 architecture-yielding a working, task-independent production-system practice mechanism. This section begins with a description of how chunks are used and concludes with a description of how they are acquired.
10.6.5.1 The Use of C h u n ks
The key to the behavior of a chunk lies in its connection component. When a goal is proposed in a situation in which a chunk has already been created for it, the connection component of that chunk substitutes for the normal processing of the goal . This is accomplished in XAPS3 by having the connection component check
THE CHUNKING OF GOAL HIERARCHIES
that the goal 's status is desired. The connection is a production containing a condition testing for the existence of a desired goal, a condition testing for the encoding of the goal 's parameters, and any relevant negated ( < Not-Exists>) conditions. It has two actions: one marks the goal as having succeeded, and the other asserts the goal 's encoded result. If there is a desired goal in working memory in a situation for which a connection production exists, the connection will be eligible to fire. Whether (and when) it does fire depends on conflict resolution. Connection productions are subject to the parameter-passing-bottleneck conflict resolution rule because they pass the identifier of the desired goal as a parameter to the action that marks the goal as having succeeded . This conflict resolution rule serves a dual purpose: ( 1 ) it implements part of the bottleneck constraint by insuring that only one goal can be connected at a time, and (2) it removes a source of possible error by insuring that only one connection can execute for any particular goal instance. If the connection does fire, it removes the need to activate and expand the goal, because the goal's results will be generated directly by the connection (and decoding) and the goal will be marked as having succeeded. If instead no connection production is available for the current situation of a desired goal , then eventually the production system will reach an impasse-no productions eligible to fire-and the goal will be activated and expanded. Therefore, we have just the behavior required of chunks they replace goal activation and expansion, if they exist. This behavior is, of course, predicated on the workings of the encoding and decoding components of the chunk. 8 The encoding component of a chunk must exe cute before the associated connection can. In fact, it should execute even before the parent goal is activated, because the subgoal 's encoded symbol should be part of the parent goal 's initial state. Recall that encodings are built up hierarchically. If the parent goal shares all of the parameters of one of its subgoals, then the encoding of the parent goal's parameters should be based on the encoding generated for the subgoal . This behavior occurs in XAPS3 because the encoding components are implemented as goal-free productions that do not pass parameters. They fire (concur rently with whatever else is happening in the system) whenever the appropriate parameter values for their goal are in working memory (subject to refraction) . If the parameters exist before the parent goal becomes active, as they must if they are parameters to it, then the encoded symbol becomes part of the parent goal's initial state. As stated in section 10.4, the decoding component must decode an encoded result when it will be needed . Each decoding component is a production that keys off the nonprimitive result pattern generated by the connection production and off an
8For efficiency, the encoding and decoding components are not created if there is only one parameter or result, respectively.
317
318
CHAPTER 7
object of type Decode. When the architecture determines that decoding should occur, it places a Decode object in working memory, with the type of the object to be decoded specified by the TYPE attribute. 9 The actions of the decoding production generate the component results out of which the nonprimitive pattern was composed. 10.6.5.2 The Acq u isition of Chunks
A complete specification of the chunk acquisition process must include the details of when chunks can be acquired and from what information they are built. A chunk can be acquired when three conditions are met. The first condition is that some goal must have just been completed. The system can't create a chunk for a goal that terminated at some point in the distant past, because the information on which the chunk must be based is no longer available. Chunks also are not created prior to goal completion (on partial results) . Chunks are simple to create in part because they sum marize all of the effects of a goal . If chunks were created partway through the pro cessing of a goal-for partial results of the goal-then a sophisticated analysis might be required in order to determine which parameters affect which results and how. This is not really a limitation on what can be chunked, because any isolable portion of the performance can be made into a goal. The second condition is that the goal must have completed successfully. Part of the essence of goal failure is that the system does not know why the goal failed. This means that the system does not know which parameter values have lead to the failure; thus, it can't create a chunk that correctly summarizes the situation. Chunking is success-driven learning, as opposed to failure-driven learning (see for example, Winston, 1975). The third and final condition for chunk creation is that all of the working memory modifications occurring since the goal was first activated must be attribut able to that goal, rather than to one of its subgoals. This condition is implemented by insuring that no productions were fired while any of the goal 's subgoals were active. All of the subgoals must either be processed by a chunk or fail immediately after activation-failure of a subgoal, particularly of predicates, does not necessarily imply failure of the parent goal. To summarize, a chunk is created after the system has decided to terminate a goal successfully but before anything is done about it (such as marking the goal suc ceeded, returning the results, or flushing the local objects from working memory). At that point the goal is still active, and all of its information is readily available. Most of the information on which the chunk is based can be found in working memory (but see below). The first piece of information needed for the creation of a chunk is the name (and identifier) of the goal that is being chunked. This information
9Matters are actually somewhat more complicated (Rosenbloom,
1983) .
THE CHUNKING OF GOAL HIERARCHIES
is found by retrieving the object representing the active goal. The goal's explicit parameters are also available as attribute-value pairs on the goal object. Given the goal 's identifier, the system finds its implicit parameters by retrieving all of the objects in working memory that were part of the goal's initial state-that is, their created-by tag contains an identifier different from that of the active goal-and that were examined by a production during the processing of the active goal . This last type of information is contained in the objects' examined-by tags. The goal's results are equally easy to find. The architecture simply retrieves all of the goal's local objects that have a type equal to the goal's RESULT-TYPE. Because the goal parameter and result information is determined from the con stant objects in working memory, chunks themselves are not parameterized. Each chunk represents a specific situation for a specific goal . However, two forms of abstraction are performed during the creation of a chunk: ( 1 ) the inclusion of only the implicit parameters of a goal and not the entire initial state, and (2) the replacement of constant identifiers (found in the working memory objects) with neutral specifications. These abstractions allow the chunks to be applicable in any situation that is relevantly identical, not merely totally identical. Different chunks are needed only for relevant differences. The one complication in the clean picture of chunk acquisition so far presented involves the use of negated conditions during the processing of a goal . When a negated condition successfully matches working memory, there is no working memory object that can be marked as having been examined. Therefore, some of the information required for chunk creation cannot be represented in working memory. The current solution for this problem is not elegant, but it works. A temporary auxil iary memory is maintained, into which is placed each nonlocal negated condition occurring on productions that fire during the processing of the goal (local negated conditions can be ignored because they do not test the initial state) . This memory is reinitialized whenever the goal that is eligible to be chunked changes. Before the con ditions are placed in the memory they are fully instantiated with the values bound to their variables by the other conditions in their production . As discussed in Rosenbloom ( 1983), including a negated condition in an encoding production can lead to performance errors, so these conditions are all included in the associated con nection production.
10.7 R ES U LTS
In this section some results derived from applying the XAPS3 architecture to a set of reaction-time tasks will be presented. A more complete presentation of these results can be found in Rosenbloom ( 1983). The first experiment described here is the Seibel task. Two different sequences of trials were simulated, of which the first
3 19
320
CHAPTER 7
sequence is the same as the one used in Rosenbloom and Newell ( 1982a) . The simula tion completed 268 trials before it was terminated. 10 A total of 682 productions was learned. On the second sequence of trials-from a newly generated random permuta tion of the 1023 possibilities-259 trials were completed before termination. For this sequence, 652 productions were learned. Figure 10-7 shows the first sequence as fit by a general power law. Each point in the figure represents the mean value over five data points (except for the last one, which only . includes three) . 1 1 For this curve, the asymptote parameter (A) has no effect. Only E, the correction for previous practice, is required to straighten out the curve. At first glance, it seems nonsensical to talk about previous practice for such a simulation, but a plausible interpretation is possible. In fact, there are two indepen dent explanations-either or both may be responsible. The first possibility is that the level at which the terminal goals are defined is too high (complex) . If the "true" terminals are more primitive, then chunking starts at a lower level in the hierarchy. One view of what chunks are doing is that they are turning their associated (possibly nonterminal) goals into terminal goals for partic ular parameter situations. During preliminary bottom-up chunking, the system would eventually reach the lowest level in the current hierarchy. All of the practice prior to that point is effectively previous practice for the current simulation. --
T
=
0.0 + 5187(N + 33) - 1 .oo
10
I "--��-'-�....� .. ..._........_ .. ..._, _._. , .__� � � � -'� ,,_ ...._ ..._ �.......
10
1 000
100
Trial number (N + E) Figure 10-7:
General power-law fit to 268 simulated trials of the Seibel ( 1963) task.
10At this point the FRANZLISP system-which appeared not to be garbage collecting in the first place refused to allocate any more memory for the simulation. 1 1These data appear noisier than the human data from Seibel ( 1963) shown in figure 10-2. This is accounted for by the fact that each point in figure 10-2 was the mean of 1023 data points and each point in this figure is the mean of five data points.
THE CHUNKING OF GOAL HIERARCHIES
The other source of previous practice is the goal hierarchy itself. This structure is posited at the beginning of the simulation, hence it is already known perfectly. However, there must exist some process of method acquisition by which the subject goes from the written (or oral) instructions to an internal goal hierarchy. Though method acquisition does not fall within the domain of what is here defined as "prac tice," a scheme will be proposed in section 10.9 whereby chunking may lead to a mechanism for method acquisition. In addition to the Seibel task, the system so far has been applied to fourteen tasks from three different stimulus-response compatibility experiments (Fitts and Seeger, 1953 ; Morin and Forrin, 1962 ; Duncan, 1977). As a sample of these results, figure 10-8 shows a pair of simulated practice curves for two tasks from Fitts and Seeger ( 1953) . These curves contain fifty trials each, aggregated by five trials per data point. This chapter is not the appropriate place to discuss the issues surrounding whether the simulated practice curves are truly power laws or something slightly dif ferent (such as exponentials) . Suffice it to say that a mixture of exponential , power law, and ambiguous curves is obtained. These results roughly follow the predictions of the approximate mathematical analysis of the chi.Inking theory appearing in Rosenbloom ( 1983) . They also fit with the observation that the human curves tend to be most exponential for the simplest tasks-the compatibility tasks are among the simplest tasks for which we have human practice data. For more on this issue, see Newell and Rosenbloom ( 1981 ) and Rosenbloom ( 1983) .
h 100.0 x·
. .
.
.
�-
x
0
1 .0 G-------E> Condition SA-RA:
T
Condition Sa-RA:
T
x
·
-
·
-
x
=
43.6N-0 82
=
67.SN-0 57 JO
0
100 Trial number (N)
Figure 10-8: Simulated practice curves for conditions SA-RA and Sa-RA from Fitts and Seeger ( 1953) . The latter curve is the average over two hierarchy variations.
32 1
322
CHAPTER 7
10.8 R ELAT I O N S H I P TO P R EV I O U S WO R K
The current formulation of the chunking theory of learning provides an inter esting point of contact among four previously disparate concepts : ( 1 ) classical chunking; (2) production composition (Lewis, 1978 ; Neves and Anderson, 1981 ; Anderson, 1982b) ; (3) table look-up-memo functions (Michie, 1968) and signature tables (Samuel , 1967 ) ; and (4) macro-operators (Fikes, Hart and Nilsson, 1972 ; Korf, 1983) . Classical chunking has already been discussed in section 10. 3, so this section covers only the latter three ideas, followed by a proposal about the underlying commonality among these concepts.
10.8.1 Production Composition
Production composition (Lewis, 1978 ; Neves and Anderson, 1981 ; Anderson, 1982b) is a learning scheme whereby new productions are created through the combi nation of old ones. Given a pair of productions that execute successively, the system creates their composition from their conditions and actions (figure 10-9). The condi tion side of the new production consists of all of the conditions of the first production (C 1 , C2, and C3) , plus those conditions from the second production that do not match actions of the first production (C5) . The conditions of the second production that match actions of the first production (C4 matches A4) are not included in the composi tion (removing the serial dependency between the two productions) . All of the actions from both productions are combined in the action side of the new production (A4 and A6) . The resulting composition is a single production that accomplishes the combined effects of the older pair of productions. As learning continues, composed productions can themselves be composed, until there is a single production for an entire task. In some recent work with the GRAPES system (Saures and Farrell , 1982 ; Anderson, Farrell , and Saurers, 1982 ; Anderson, 1983b) , production composition was integrated with goal-based processing. In GRAPES, specific goals are desig nated by the programmer to be ones for which composition will occur. When such a
Figure 10-9:
An example of production composition.
THE CHUNKING OF GOAL HIERARCHIES
goal completes successfully, all of the productions that executed during that time are composed together, yielding a single production that accomplishes the goal . Because the main effects of chunking and goal-based composition are the same-the short-circuiting of goals by composite productions-it is probably too early to know which mechanism will turn out to be the correct one for a general prac tice mechanism. However, there are a number of differences between them worth noting. We will focus on the three most important: ( 1 ) the knowledge required by the learning procedure: (2) the generality of what is learned; and (3) the hierarchical nature of what is learned.
1 0.8.1 .1
Knowledge-Source Differences
With chunking, all of the information required for learning can be found in working memory (modulo negated conditions) . With production composition, the information comes from production memory (and possibly from working memory). Being able to ignore the structure of productions has two advantages. The first advan tage is that the chunking mechanism can be. much simpler. This is both because working memory is much simpler than production memory-productions contain conditions, actions, variables, function calls, negations, and other structures and information-and because, with composition, the complex process of matching con ditions of later productions to actions of previous productions is required, in order that conditions that test intermediate products not be included in the composition. Chunking accomplishes this by only including objects that are part of the goal's ini tial state. The second advantage of the chunking strategy, of learning from working memory, is that chunking is applicable to any goal, no matter what its internal imple mentation is (productions or something else) . As long as the processing of the goal leaves marks on the working memory objects that it examines, chunking can work.
10.8.1 . 2 Genera l ization Di fferences
The products of chunking are always constant productions (except for the iden tifiers of objects) that apply only for the situation in which they were created (although, as already discussed, two forms of abstractions are performed) . With pro duction composition, the variables existing in the productions to be composed are retained in the new production . The newly learned material is thus more general than that learned by chunking. The chunking mechanism definitely has more of a table look-up flavor. Section 10.8.2 contains a more thorough discussion of chunking as table look-up, and section 10.9 discusses how a chunking mechanism could possibly learn parameterized information.
323
CHAYrER 7
324
10.8.1 .3 H ierarchica l Di fferences
Both mechanisms learn hierarchically in that they learn for goals in a hierarchy. They differ in how they decide about which goals to learn and in whether the learned material is itself hierarchical. Chunking occurs bottom up in the goal hierarchy. Production composition-in GRAPES at least-works in isolation on any single goal in the hierarchy. For this to work, subgoals are kept as actions in the new productions. The composition approach is more flexible, but the chunking approach has two compensating advantages. The first advantage is that, with chunking, the encoding and decoding components can be themselves hierarchical , based on the encoding and decoding components of the pre viously chunked subgoals. Productions produced by composition tend to accumulate huge numbers of conditions and actions because they are flat structures. The second advantage is again simplicity. When information is learned about a goal at an arbitrary position in the hierarchy, its execution is intermingled with the execution of its subgoals. Knowing which information belongs in which context requires a complete historical trace of the changes made to working memory and the goals that made the changes . .
1 0.8.2 Ta ble Look-up
It has been seen that from one point of view chunking resembles production composition. From another point of view it resembles a table look-up scheme, in which a table of input parameters versus results is gradually learned for each goal in the system . As such, it has two important predecessors-memo functions (Michie, 1968; Marsh, 1970) and signature tables (Samuel , 1967).
1 0.8.2.1 M e m o Fu nctions
A memo function 1 2 is a function with an auxiliary table added. Whenever the function is evaluated, the table is first checked to see if there is a result stored with the current set of parameter values. If there is, it is retrieved as the value of the function. Otherwise, the function is computed and the arguments and result are stored in the table. Memo functions have been used to increase the efficiency of mathematical functions (Michie, 1968; Marsh, 1970) and of tree searches (Marsh, 1970) .
1 2 Memo functions themselves are derived from the earlier work by Samuel ( 1959) on creating a rote memory for the values of board positions in the game of checkers.
THE CHUNKING OF GOAL HIERARCHIES
Chunking can be seen as generating memo functions for goals. But these are hierarchical memo functions, and ones in which the arguments need not be specified explicitly. Chunking also provides a cleaner implementation of the ideas behind memo functions because the table is not simply an add-on to a different processing structure. It is implemented by the same "stuff" (productions) as is used to represent the other types of processing in the system.
10.8.2.2 Signature Ta bles
Signature tables were developed as a means of implementing nonlinearities in an evaluation function for checkers (Samuel, 1967) . The evaluation function is repre sented as a hierarchical mosaic of signature tables. Each signature table had between two and four parameters, each of which had between three and fifteen possible values. The parameters to the lowest-level tables were measures computed on the checkerboard. For each combination of parameter values a number was stored in the table representing how good that combination was. There were nine of these tables arranged in a three-level hierarchy. The values generated by lower tables were fed into higher tables. The final value of the evaluation function was the number gener ated by the root (top) table. Signature tables capture the table look-up and hierarchical aspects of chunking, though only for encoding. There is no decoding because signature tables are not a model of action; they act simply as a classifier of board positions. Another difference between chunking and signature tables is that information is stored in the latter not as a direct function of experience, but as correlations over a number of experiences.
1 0.8.3 M acro-Operators
A macro-operator is a sequence of operators that can be viewed as a single operator. One classical system that makes use of macro-operators is STRIPS (Fikes, Hart, and Nilsson, 1972) . STRIPS works by taking a task and performing a search to find a sequence of operators that will accomplish the task. Given a solution, STRIPS first generates a highly specific macro-operator from the sequence of operators and then generalizes it by figuring out which constants can be replaced by variables. The generalized macro-operator is used as a plan to guide the performance of the task, and it can be used as a primitive operator in the generation of a macro-operator for another task. Each STRIPS operator is much like a production: it has a set of conditions (a precondition wff) and a set of actions (consisting of an add list and a delete list) . Each macro-operator is represented as a triangle table representing the conditions and actions for all subsequences of the operators in the macro-operator (preserving the
325
326
CHAPTER 7
order of execution) . This process is very much like a production composition scheme that takes all of the productions that fire during the processing of a goal and creates a composition for every possible subsequence of them. STRIPS differs from the mech anisms described above in exactly how it represents, selects, and uses what it learns, but it shares a common strategy of storing, with experience, meaningful (based on the task/ goal) composites that can reduce the amount of processing required by sub sequent tasks. Another form of macro-operators can be found in Korf's ( 1983) work on macro-operator-based problem solving. Korf presents a methodology by which a table of macro-operators can be found that can collectively solve all variations on a problem. For example, one table of macro-operators is sufficient for solving any ini tial configuration of Rubik's cube. Korf's technique is based on having a set of differ ences between the goal state and the current state. The differences are ordered and then solved one at a time. During the solution of a difference, solutions to previous differences can be destroyed, but they must be reinstated before the current differ ence is considered solved. Rather than learn by experience, Korf's system prepro cesses the task to learn the macro-operator table capable of handling all variations of the task. It does this in time proportional to what it would take to search for a solution to one variation of the task without the table of macro-operators. Even though the macro-operators are nonvariabilized, a single table with size proportional to the product of the number of differences and the number of values per difference is totally sufficient. This is because at each point in the solution what is to be done depends only on the value of the current difference and not on any of the· other differences. It is possible to be more concrete and to bring out the relationship of this mechanism to chunking by viewing the sequence of differences as a goal hierarchy. The top-level goal is to solve all of the differences. In general, to solve the first x + 1 differences one first processes a subgoal for solving the first x differences; one then processes a subgoal for solving the firstx + 1 differences given that the first x differ ences have been solved. These latter conditional goals are the terminal goals in the hierarchy. Moreover, each one has only one parameter that can vary-the value of dif ference x + 1 -so only a very few macro-operators need be created for the goal (the number of values that the difference can take) . Korf's macro-operators are essentially chunks for these terminal goals. They relate the parameters of a goal (the set of differ ences already solved and the value of the next difference) to the composite result (the sequence of operators to be performed) . Korf avoids the combinatorial explosion implicit in the tasks by creating macro-operators only for these l imited terminal goals. If the chunking mechanism were doing the learning, it would begin by chunk ing the terminals, but it would then proceed to learn about the nonterminal goals as well . Korf's work is a good example of how choosing the right goal hierarchy (and limiting the set of goals for which learning occurs) can enable a small set of nonvari abilized macro-operators (or chunks) to solve a large class of problems.
THE CHUNKING OF GOAL HIERARCHIES
1 0.8.4 S u m m a ry
Although classical chunking, production composition, table look-up, and macro-operators were proposed in quite different contexts and bear little obvious relationship to each other, the current formulation of the chunking theory of learning has strong ties to all four. ( 1 ) It explains how classical chunks can be created and used; (2) it results in productions similar to those generated by goal-directed produc tion composition; (3) it caches the results of computations, as in a table look-up scheme; and (4) it unitizes sequences of operators into higher-level macro-operators. The chunking theory differs from these four mechanisms in a number of ways, but at the same time it occupies a common ground among them. This leads the authors to propose that all five mechanisms are different manifestations of a single underlying idea centered on the storage of composite information rather than its recomputation. The chunking theory has a number of useful features, but it is probably too early to know what formulation will turn out to be the correct one in the long run for general practice mechanism.
10.9 EXPA N DI N G T H E SCO P E OF C H U N KI N G
I n this work it has been shown how chunking can provide a model of practice for tasks that can already be accomplished. Performance is sped up but not qualita tively changed. It is interesting to ask whether chunking can be used to implement any of the other, more complex forms of learning, such as method acquisition, concept formation, generalization , discrimination, learning by being told, and expectation-driven learning. Chunking does not directly accomplish any of these forms of learning, and on first glance the table look-up nature of chunking would seem to preclude its use in such sophisticated ways. However, four considerations suggest the possibility that the scope of chunking may extend much further. The first two considerations derive from the ubiquitous presence of both chunking and the power law of practice in human performance. Chunking is already implicated at least in higher-level cognitive processes, and the power law of practice has been shown to occur over all levels of human performance. Thus, if the current chunking theory turns out not to be extendable, the limitation will probably be in the details of the implementation itself rather than in the more global formulation of the theory. The remaining two considerations stem from the interaction of chunking with problem solving. The combination of these two mechanisms has the potential for generating interesting forms of learning. The strongest evidence of this potential to date can be found in the work of Anderson ( 1983b) . He has demonstrated how
327
328
CHAPTER 7
production composition (a mechanism that is quite si.m ilar to chunking, as has been shown here) , when combined with specific forms of problem solving, can effectively perform both production generalization and discrimination. Generalization comes about through the composition of an analogy process, and discrimination comes from the composition of an error-correction procedure. The final line of evidence comes from the work of Newell and Laird on the structure of a general problem-solving system based on the problem space hypothesis (Newell, 1980b) , a universal weak method (Laird and Newell , 1983) , and universal subgoaling (Laird, 1983) . The problem space hypothesis states that intelligent agents are always performing in a problem space. At any instant, the agent will have a goal that it is attempting to fulfil! . Associated with that goal is a problem space in which the goal can be pursued. The problem space consists of a set of states, a problem (ini tial and desired states) , a set of operators that move the agent between states, and search control information that assists in guiding the agent efficiently from the initial to the desired state. Added to this problem space structure are ( 1 ) a universal weak method, which allows the basic problem-solving methods, such as generate-and-test and depth-first search, to arise trivially out of the knowledge of the task being per formed; and (2) universal subgoaling, which enables a problem solver automatically to create subgoals for any difficulties that can arise during problem solving. The result of this line of work has been a problem-solving production-system architecture called SOAR2 that implements these three concepts (Laird, 1983) . To see the power of integrating chunking with such a problem-solving system, consider the problem of method acquisition; given an arbitrary task or problem, how does the system first construct a method (goal structure) for it? This is the prototyp ical case of "hard" learning. There are at least two ways in which chunking can assist SOAR2 in method acquisition: ( 1 ) by compiling complex problem-solving processes into efficient operators, and (2) by acquiring search control information that elimi nates irrelevant search paths. A single goal-chunking mechanism can acquire both of these types of information; the difference is in the types of goals that are chunked. The compilation process is analogous to the kinds of chunks that are created in XAPS3 : inefficient subgoal processing is replaced by efficient operator application. Given a task along with its associated goals and problem spaces, SOAR2 attempts to fulfill the task goal through a repeated process of elaborating the current situation with information and selecting a new goal, problem space, state, or operator. The system applies operators to a state by creating a new state, elaborating it with the results of the operator, and selecting the new state. If the application of the operator requires problem solving itself, it will not be possible to apply it to a state directly via a set of elaborations. Instead, a difficulty will arise for which SOAR2 will create a subgoal. One way this subgoal can be pursued is by the selection of a problem space within which the task of applying the problematic operator to its state can be
THE CHUNKING OF GOAL HIERARCHIES
accomplished. The subgoal is fulfilled when the operator has been applied and a new state generated. A chunk could be created for this sub goal that would be applicable in any state that defines the same set of parameters for that operator. The next time the operator is tried it will be applied directly, so the difficulty will not occur and the sub goal will not be needed . Another way that a difficulty can arise in SOAR2 is if there is uncertainty about which operator to apply to a state. As with all such difficulties, SOAR2 automatically creates a subgoal to work on this problem . It then employs an operator selection problem space within which it can evaluate and compare the set of candidate opera tors. The difficulty is finally resolved when a set of preferences-statements about which operators are preferred to which other operators-has been created that uniquely· determines which of the original operators should be selected. A subgoal that deals with an operator selection problem has a set of parameters those aspects of the goal and state that were examined during the generation of the preferences. It also has a set of results-the preferences. Should a chunk be created for this goal , it would be a piece of search control for the problem space that allows it to pick the appropriate operator directly. As the system acquires more search control , the method becomes more efficient because of the resulting reduction in the amount of search required. One limitation of the current chunking mechanism that such a method acquisi tion scheme could alleviate is the inability of chunks to implement parameterized operators. Chunking always creates a totally specified piece of knowledge. As cur rently formulated, it cannot create the kinds of parameterized operators used as terminal nodes in the goal hierarchies. We have seen that chunking does create abstracted knowledge, and Korf's ( 1983) work shows that nonvariabilized macro operators can attain a good dealof generality from the goal hierarchy itself (see sec tion 10.8. 3), but fully parameterized operators are outside the current scope. On the other hand, problem spaces are inherently parameterized by their initial and desired states. Therefore it may be that it is not necessary for chunks to create parameterized operators. These operators can come from another source (problem spaces) . Chunks would only be responsible for making these operators more efficient. In summary, the ubiquity of both chunking and power-law learning indicates that the chunking model may not be limited in its scope to simple speedups. Exam ining three "hard" types of learning reveals that generalization and discrimination are possible via the combination of a similar learning mechanism (production com position) and specific types of problem solving; additionally, method acquisition appears feasible via chunking in problem spaces. If this does work out, it may prove possible to be able to formulate a number of the other difficult learning problems within this paradigm . The complications would appear as problem solving in problem spaces, and the chunking mechanism would remain simple, merely recording the results generated by the problem-solving system .
329
330
CHAPTER 7
1 0.1 0 CO N C LU S I O N
At the beginning of this investigation the authors set out to develop a general ized, task-independent model of practice, capable of producing power-law practice curves. The model was to be based on the concept of chunking, and it was to be used as the basis (and a source of constraint) for a production-system practice mechanism . All of this has been accomplished. The generalized model that has been developed is based on a goal-structured representation of reaction-time tasks. Each task has its own goal hierarchy, representing an initial performance algorithm. When a goal is successfully completed, a three-part chunk can be created for it. The chunk is based on the parameters and results of the goal . The encoding compo nent of the chunk encodes the parameters of the goal , yielding a new symbol repre senting their combination. The connection component of the chunk ties the encoded parameter symbol to an encoded symbol for the results of the goal . The decoding component of the chunk decodes the new result symbol to the results out of which it is composed . The chunk improves the performance of the system by eliminating the need to process the goal fully ; the chunk takes care of it. The process of chunking proceeds bottom up in the goal hierarchy. Once chunks are created for all of a goal's subgoals in a specific situation, it is possible to create a chunk for the goal . This process proceeds up the hierarchy until there is a chunk for the top-level goal for every situation that it could face. Mechanisms for goal processing and chunking have been built into a new production-system architecture that fits within a set of contraints developed for the architecture of cognition. This architecture has been applied to a number of different reaction-time tasks (though not all of these results are presented here) . It is capable of producing power-law practice curves. As currently formulated, the chunking theory stakes out a position that is inter mediary among four previous disparate mechanisms : classical chunking, memo functions, production composition, and macro-operators. These five ideas are dif ferent manifestations of a single underlying idea centered on the storage of composite information rather than its recomputation. And finally, a research path has been outlined by which the chunking theory, when integrated with a problem-solving system, can potentially be expanded to cover aspects of learning, such as method acquisition , outside of the domain of pure practice. 1 3
13For follow-up work along this path, see Laird, Rosenbloom, and Newell ( 1984).
THE CHUNKING OF GOAL HIERARCHIES
ACKNOWLEDG M ENTS
This research was sponsored by the Defense Advanced Research Projects Agency (DOD), ARPA Order No. 3597, monitored by the Air Force Avionics Labo ratory under Contract F33615-78-C-1551. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied; of the Defense Advanced Research Projects Agency or the U. S. Government. The authors would like to thank John Laird for innumerable helpful discussions about this material.
References Anderson, J. R . , Language, Memory, and Thought, Erlbaum, Hillsdale, N.J. , 1976. --
, Private communication, 1980.
--
, Private communication, 1982a.
--, "Acquisition of Cognitive Skill," Psychological Review, Vol. 89, pp. 369-406, 1982b. , 1he Architecture of Cognition ,
--
Harvard University Press, Cambridge, 1983a.
, "Knowledge Compilation: The General Learning Mechanism," Proceedings of the Machine R. S. Michalski (Ed.), Allerton House, University of Illinois at Urbana Champaign, pp. 203-12, June 22-24, 1983b. (An updated version of this paper appears as chap. 11 of this volume.)
--
Learning Workshop,
Anderson, J. R . , Farrell, R., and Sauers, R . , "Learning to Plan in LISP," Technical Report, Department of Psychology, Carnegie-Mellon University, 1982. Bower, G. H . , " Perceptual Groups as Coding Units in Immediate Memory," Psychonomic Science, Vol. 27, pp. 217- 19, 1972. Bower, G.
H.,
and Springston, F. , "Pauses as Recoding Points in Letter Series," Journal ofExperimental Vol. 83. pp. 421-30, 1970.
Psychology,
Bower, G.
H.,
and Winzenz, D. , "Group Structure, Coding, and Memory for Digit Series," Journal of Vol. 80, Pt . 2 , pp. 1 - 17, May 1969.
Experimental Psychology Monograph,
Card, S. K . , English, W. K . , and Burr, B. , "Evaluation of Mouse, Rate Controlled Isometric Joystick, Step Keys, and Text Keys for Text Selection on a CRT," Ergonomics, Vol. 21, pp. 601- 13, 1978. Chase, W. G. , and Ericsson, K . A . , "Skilled Memory," in Cognitive Skills and Their Acquisition, J. R . Anderson (Ed.), Erlbaum, Hillsdale, N.J. , 1981. Chase, W. G. and Simon, H. A . , "Perception in Chess," Cognitive Psychology, Vol . 4, pp. 55-81, 1973.
331
332
CHAPTER 7
Crossman, E. R. F. W. , "A Theory of the Acquisition of Speed-Skill," Ergonomics, Vol. 2, pp. 153-66, 1959. DeGroot, A. D. , Thought and Choice in Chess, Mouton, The Hague, 1965. Duncan, J. , " Response Selection Rules in Spatial Choice Reaction Tasks," Attention and Performance VJ, Erlbaum, Hillsdale, N.J. , 1977. Ernst, G. W. , and Newell, A . , CPS: A Case Study in Generality and Problem Solving, ACM Monograph, Academic Press, New York, 1969. Fikes, R. E. , Hart, P. E . , and Nilsson, N. J. , " Learning and Executing Generalized Robot Plans," A rtifi cial Intelligence, Vol. 3, pp. 251 -88, 1972. Fitts, P. M . , and Seeger, C. M . , "S-R Compatibility: Spatial Characteristics of Stimulus and Response Codes," Journal of Experimental Psychology, Vol . 46, pp. 199-210, 1953. Forgy, C. L . , "OPS5 User's Manual ," Technical Report CMU-CS-81-135, Department of Computer Sci ence, Carnegie-Mellon University, July 1981. Johnson, N. F. , "Organization and the Concept of a Memory Code," in Coding Processes in Human Memory, A. W. Melton and E. Martin (Eds.} , Winston, Washington, D. C. , 1972. Kolers, P. A . , " Memorial Consequences of Automatized Encoding," Journal of Experimental Psy chology: Human Learning and Memory, Vol . I , No. 6, pp. 689-701, 1975. Korf, R. E. , " Learning to Solve Problems by Searching for Macro-Operators," Ph. D. diss. , Carnegie Mellon University, 1983. (Available as Technical Report No. 83-138, Department of Computer Sci ence, Carnegie-Mellon University, 1983.) Laird, 1. E . , "Universal Subgoaling," Ph. D. diss . , Carnegie-Mellon University, 1983. Laird, J. E . . and Newell , A . , "A Universal Weak Method," Technical Report No. 83-141, Department of Computer Science, Carnegie-Mellon University, 1983. Laird, J. E . , Rosenbloom, P. S. , and Newell, A . , "Towards Chunking as a General Learning Mechanism," in Proceedings ofAAA l-84, Austin, Tex . , pp. 188-192, 1984. Lewis, C. H . , " Production System Models of Practice Effects," Ph . D. diss. , University of Michigan, 1978. Marsh, D. , " Memo Functions, the Graph Traverser, and a Simple Control Situation," in Machine Intelli gence 5, B. Meltzer and D. Michie (Eds.), American Elsevier, New York, 1970. Michie, D. , '"Memo' Functions and Machine Learning," Nature, Vol . 218, pp. 19-22, 1968. Miller, G. A . , "The Magic Number Seven Plus or Minus Two: Some Limits on Our Capacity for Pro cessing Information," Psychological Review, Vol . 63, pp. 81 -97, 1956. Moran, T. P. , "The Symbolic Imagery Hypothesis: An Empirical Investigation via a Production System Simulation of Human Behavior in a Visualization Task," Ph . D. diss . , Carnegie-Mellon University, 1973. ---
, " Compiling Cognitive Skill," AIP Memo No. 150, Xerox PARC, 1980.
Morin, R. E . , and Forrin, B. , " Mixing Two Types of S-R Association in a Choice Reaction Time Task," Journal of Experimental Psychology, Vol . 64, pp. 137-41, 1962.
THE CHUNKING OF GOAL HIERARCHIES
Neisser, U. , Novick, R . , and Lazar, R . , "Searching for Ten Targets Simultaneously," Perceptual and Motor Skills, Vol. 17, pp. 427-32, 1963. Neves, D. M . , and Anderson, J. R . , "Knowledge Compilation: Mechanisms for the Automatization of Cognitive Skills," in Cognitive Skills and Their Acquisition, J. R. Anderson (Ed.), Erlbaurri, Hills dale, N.J. , 1981. Newell, A. , "Heuristic Programming: Ill-Structured Problems," in Progress in Operations Research, Vol . 3, J. Aronofsky (Ed.), Wiley, New York, 1969.
--- ,
" Production Systems: Models of Control Structures," in Visual Information Processing, (Ed.), Academic Press, New York, 1973.
W. G. Chase
--- ,
"Harpy, Production Systems and Human Cognition," in Perception and Production of Fluent R. Cole (Ed.), Erlbaum, Hillsdale, N.J. , 1980a. (Also available as Technical Report No. CMU-CS-78-140, Department of Computer Science, Carnegie-Mellon University, 1978.) Speech,
--- ,
" Reasoning, Problem Solving and Decision Processes: The Problem Space as a Fundamental Category," in Attention and Performance VIII, R. Nickerson (Ed.), Erlbaum, Hillsdale, N.J. , 1980b. (Also available as Technical Report CMU CSD, Department of Computer Science, Carnegie-Mellon University, 1979.)
Newell, A. and Rosenbloom, P. S . , " Mechanisms of Skill Acquisition and the Law of Practice," in Cogni tive Skills and Their Acquisition, J. R. Anderson (Ed.), Erlbaum , Hillsdale, N.J. , 1981. Newell, A. , and Simon, H . A., Human Problem Solving, Prentice-Hall, Englewood Cliffs, N.J. , 1972.
Nilsson, N. J. , Problem-Solving Methods in Artificial Intelligence, McGraw-Hill, New York, 1971. Rosenbloom, P. S. , "The Chunking of Goal Hierarchies: A Model of Practice and Stimulus-Response Compatibility," Ph. D. diss. , Carnegie-Mellon University, 1983. (Available as Technical Report No. 83-148, Department of Computer Science, Carnegie-Mellon University, 1983.) Rosenbloom, P. S. , and Newell, A . , "Learning by Chunking: A Production-System Model of Practice," Technical Report No. 82-135, Department of Computer Science, Carnegie-Mellon University, 1982a. --- , "Learning by Chunking: Summary of a Task and a Model," Proceedings ofAAA l-82 , Pittsburgh, Pa. , pp. 255-257, 1982b. Samuel, A. L. , "Some Studies in Machine Learning Using the Game of Checkers," IBM Journal of Research and Development, Vol. 3, pp. 210-29, 1959. --- , "Some Studies in Machine Learning Using the Game of Checkers, II-Recent Progress," IBM . Journal of Research and Development, Vol. II, pp. 601- 17, 1967. Sauers, R. , and Farrell, R . , "GRAPES User's Manual," Technical Report, Department of Psychology, Carnegie-Mellon University, 1982. Seibel, R . , " Discrimination Reaction Time for a 1,023-Alternative Task," Journal ofExperimental Psy chology, Vol . 66, No. 3, pp. 215-26, 1963. Snoddy, G. S. , "Learning and Stability," Journal ofApplied Psychology, Vol. 10, pp. 1 -36, 1926. Waterman, D. A. , and Hayes-Roth, F. (Eds.), Pattern-Directed Inference Systems, Academic Press, New York, 1978.
333
334
CHAPTER 7
Winston, P. H . , "Learning Structural Descriptions from Examples," in The Psychology of Computer Vision, P. H . Winston (Ed.), McGraw-Hill, New York, 1975. Woodworth, R. S . , and Schlosberg, H . , Experimental Psychology, rev. ed. , Holt, Rinehart and Winston, New York, 1954.
CHAPTER 8
Towards Chunking as a General Learning Mechanism J. E. Laird, P. S. Rosenbloom, and A. Newell, Carnegie Mellon University
ABSTRACT
Chunks have long been proposed as a basic organizational unit for human memory. More recently chunks have been used to model human learning on simple perceptual-motor skills. In this paper we describe recent progress in extending chunking to be a general learning mechanism by implementing it within a general problem solver. Using the Soar problem-solving architecture, we take significant steps toward a general problem solver that can learn about all aspects of its behavior. We demonstrate chunking in Soar on three tasks: the Eight Puzzle, Tic-Tac-Toe, and a part of the Rf computer-configuration task. Not only is there improvement with practice, but chunking also produces significant transfer of learned behavior, and strategy acquisition. 1 Introduction
Chunking was first proposed as a model of human memory by Miller [8), and has since become a major component of theories of cognition. More recently it has been proposed that a theory of human learning based on chunking could model the ubiquitous power law of practice [12). In demonstrating that a practice mechanism based on chunking is capable of speeding up task performance, it was speculated that chunking, when combined with a general problem solver, might be capable of more interesting forms of learning than just simple speed ups [14). In this paper we describe an initial investigation into chunking as a general learning mechanism. Our approach to developing a general learning mechanism is based on the hypothesis that all complex behavior - which includes behavior concerned with learning - occurs as search in problem spaces [ 1 1 ). One image of a system meeting this requirement consists of the combination of a performance system based on search in problem spaces, and a complex, analytical, learning system also based on search in problem spaces [10). An alternative, and the one we adopt here, is to propose that all complex behavior occurs in the problem-space-based performance system. The learning component is simply a recorder of experience. It is the experience that determines the form of what is learned. Chunking is well suited to be such a learning mechanism because it is a recorder of goal-based experience [13, 14). It caches the processing of a subgoal in such a way that a chunk can substitute for the normal (possibly complex) processing of the subgoal the next time the same subgoal (or a suitably similar one) is generated. It is a task-independent mechanism that can be applied to aJI subgoals of any task in a system. Chunks are created during performance, through experience with the goals processed. No extensive analysis is required either during or alter performance. This research was sponsored by lhe Defense Advanced Research Projects Agency (DOD), ARPA Order No. 3597. monitored by the Air Force Avionics Laboratory Under Contract F33615-81 -K-153.q. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or impiied, of the Defense Advanced Research Projects Agency or the US Government.
The essential step .in turning chunking into a general learning mechanism is to combine it with a general problem-space problem solver. One candidate is Soar, a reflective problem-solving architecture that has a uniform representation and can create goals to reason about any aspect of its problem-solving behavior [5). Implementing chunking within Soar yields lour contributions towards chunking as a general learning mechanism. 1. Chunking can be applied to a general problem solver to speed up its performance. 2. Chunking can improve a// aspects of a problem solver's
behavior. 3. Significant transfer of chunked knowledge is possible via the implicit generalization of chunks. 4. Chunking can perform strategy acquisition, leading to qualitatively new behavior. Other systems have tackled individual points, but this is the first attempt to do all of them. Other work on strategy acquisition deals with the learning of qualitatively new behavior [6, 10), but it is limited to learning only one type of knowledge. These systems end up with the wandering bottle-neck problem - removal of a performance bottleneck from one part of a system means that some other locale becomes the bottleneck [10). Anderson [1) has recently proposed a scheme of knowledge compilation to be a general learning mechanism to be applied to all of cognition, although it has not yet been used on complex problem solving or reasoning tasks that require learning about all aspects of behavior. 2 Soar - A General Problem· Solving A rchitecture
Soar is a problem solving system that is based on formulating all activity (both problems and routine tasks) as heuristic search in problem spaces. A problem space consists of a set of states and a set of operators that transform one state into another. Starting from an initial state the problem solver applies a sequence of operators in an attempt to reach a desired state. Soar uses a production system 1 to implement elementary operators, tests for goal satisfaction and failure, and search control - information relevant to the selection of goals, problem spaces, states, and operators. It is possible to use a problem space that has no search control, only operators and goal recognizers. Such a space will work correctly, but will be slow because of the amount of search required. In many cases, the directly available knowledge may be insufficient for making a search-control decision or applying an operator to a state. When this happens, a difficulty occurs that results in the automatic creation of a subgoal to perform the necessary function. Jn the subgoal, Soar treats the difficulty as just another problem to solve; it selects a problem space for the subgoal 1 A modified versions of Ops5 [3), which admits parallel execution of all satisfied productions.
336
CHAPTER 8
in which goal attainment is interpreted as finding a state that resolves the difficulty. Thus, Soar generates a hierarchy of goals and problem spaces. The diversity of task domains is reflected in a diversity of problem spaces. Major tasks, such as configuring a computer will have a corresponding problem space, but so also will each of the various subtasks. In addition, problem spaces will exist in the hierarchy for performing tasks generated by problems in the system's own behavior, such as the selection of an operator to apply, the application of an operator to a state, and testing for goal attainment. With such an organization, all aspects of the system's behavior are open to problem solving when necessary. We call this property universal subgoaling [5]. Figure 1 shows a small example of how these subgoals are used in Soar. This is the subgoal/problem-space structure that gets generated while trying to take steps in a task problem space. Initially (A), the problem solver is at State1 and must select an operator. If search control is unable to uniquely determine the next operator to apply, a subgoal is created to do the selection. In that subgoal (8), a selection problem space is used that reasons about the selection of objects from a set. In order to break the tie between objects, the selection problem space has operators to evaluate each candidate object. Select Operator
k
State2
State3
Evaluate[Op3(State1
eel and apply
ratori! to State!
F. Select ...
E.
State4
Figu re 1 : Eight Puzzle subgoal/problem space structure.
Evaluating an operator, such as Operator1 in the task space, is a complex problem requiring a new subgoal. In this subgoal (C), the original task problem space and state (State1 ) are selected. Operator1 is applied, creating a new state (State2). The evaluation for State2 is used to compare Operator1 to the other operators. When Operator1 has been evaluated, the subgoal terminates, and then the whole process is repeated for the other two operators (0perator2 and Operator3 in D and E)'. If, for example, Operator2 creates a state with a better evaluation than the other operators, it will be designated as better than them. The selection subgoal will terminate and the designation of Operator2 will lead to its selection in the ori ginal task goal and problem space. At this point Operator2 . to State1 and the process continues (F). is reapplied 3 Chunking in Soar Chunking was previously defined (14] as a process that acquired chunks that generate the results of a goal, given the goal and its parameters. The parameters of a goal were defined to be those aspects of the system existing prior to the goal's creation that were examined during the processing of the goal. Each chunk was represented as a set of three productions, one that encoded the parameters of a goal, one that connected this encoding in the
presence of the goal to (chunked) results, and a third production that decoded the results. These chunks were learned bottom-up in the goal hierarchy; only terminal goals - goals for which there were no subgoals that had not already been chunked - were chunked. These chunks improved task performance by substituting efficient productions for complex goal processing. This mechanism was shown tO' work for a set of simple perceptual motor skills based on fixed goal hierarchies [ 1 3]. At the moment, Soar does away with two of the features of chunking that existed for psychological modeling purposes: the three production chunks, and the the bottom-up nature of chunking. In Soar, single- production chunks are built for every subgoal that terminates. The power of chunking in Soar stems from Soar's ability to automatically generate goals for problems in any aspects of its problem-solving behavior: a goal to select among alternatives leads to the creation of a production that will later control search; a goal to apply an operator to a state leads to the creation of a production that directly implements the operator; and a goal to test goal-satisfaction leads to a goal-recognition production. As search-control knowledge is added, performance improves via a reduction in the amount of search. If enough knowledge is added, there is no search; what is left is a method an efficient algorithm for a task. In addition to reducing search within a single problem space, chunks can completely eliminate the search of entire subspaces whose function is to make a search control decision, apply an operator, or recognize goal-satisfaction. The conditions of a chunked production need to test every1hing that was used in creating the results of the subgoal and that existed before the subgoal was invoked. In standard problem solvers this would consist of the name of the goal and its parameters. However, in Soar there are no fixed goal names, nor is there a fixed set of parameters. Once a subgoal is selected, all of the information from the prior goal is still available. The problem solver makes use of the information about why the subgoal was created and any of the other information that it needs to solve the problem. For each goal generated, the architecture maintains a of all data that existed before the goal was created and which was accessed in the goal. A datum is considered accessed if a production that matched it fires. Whenever a production is fired, all of the data it accessed that existed prior to the current goal are added to the goal's condition-list. When a goal terminates (for whatever reason), the condition-list for that goal is used to build the conditions of a chunk. Before being turned into conditions, the data is selectively variablized so that the conditions become tests for object descriptions instead of tests for the specific objects experienced. These variables are restricted so that two distinct variables can not match the same object. condition-list
The actions of the chunk should be the results of the goal. In traditional architectures, a goal produces a specific predefined type of result. However, in Soar, anything produced in a subgoal can potentially be of use in the parent goal. Although the potential exists for all objects to be relevant, the reality is that only a few of them will actually be useful. In figuring out the actions of the chunk, Soar starts with everything created in the goal, but then prunes away the information that does not relate directly to objects in any supergoal. 2 What is left is turned into production actions after being variablized in accordance with the conditions. At first glance, chunking appears. to be simply a caching mechanism with little hope of producing results that can be used on other than exact duplicates of tasks it has already a,ttempted. However, if a given task shares subgoals with another task, a chunk learned for one task can apply to the other, yielding across-task 2 Those that are pruned are also removed from memory because they are intermediate results that will never be used again.
TOWARDS CHUNKING AS A GENERAL LEARNING MECHANISM
transfer ·of learning. Within-trial transfer of learning can occur when a subgoal arises more than once during a single attempt on a task. Generality is possible because a chunk only contains conditions for the aspects that were accessed in the subgoal. This is an implicit generalization, by which many aspects of the context - the irrelevant ones - are automatically ignored by the chunk. 4 Demonstration In this section we describe the results of experiments on three tasks: the Eight Puzzle, Tic-Tac-Toe, and computer configuration (a part of the Rt expert-system implemented in Soar [1 5]). These tasks exhibit: (1 ) speed ups with practice; (2) within-trial transfer of learning; (3) across-task transfer of learning; (4) strategy acquisition (the learning of paths through search spaces); (5) knowledge acquisition in a knowledge-intensive system; and (6) learning of qualitatively different aspects of behavior. We conclude this section with a discussion of how chunking sometimes builds over-general productions. 4. 1 Eight Puzzle The states for the Eight Puzzle, as implemented in Soar, consist of different configurations of eight numbered tiles in a three by three grid; the operators move the blank space up (U), down (D), left (L) and right (R) [5). Search-control knowledge was built that computed an evaluation of a state based on the number of tiles that were moved in and out of the desired positions from the previous state.3 At each state in the problem solving, an operator must be selected, but there is insufficient search-control knowledge to intelligently distinguish between the alternatives. This leads to the selection being made using the set of selection and evaluation goals described in Section 2. The first column of Figure 2 shows the behavior of Soar without chunking in the Eight Puzzle problem space. All of the nodes off the main path were expanded in evaluate-operator subgoals (nodes on the main path were expanded once in a subgoal, and once after being selected in the top goal).4 T..k 1
T..k 1
T..k 1
R
u u L
D
T..k 1
T..k 2
R
R
R R
D
R
R
u u
R
L
No Leeming
Whll• Le8'nlng After Lternlng Ta1k 1
While Leaming
After Learnlnt Talk 2
Figu re 2: Within-trial and Across-task Transfer in Eight Puzzle.
3ro avoid tight loops, search-control was also added that avoided applying the inverse of the operator that created a given state. 4At two points in the search the correct operator had to be se�ted manually because the evaluation function was insufficient to pick out the best operator. Our purpose is not to evaluate the evaluation function, but to investigate how chunking can be used in conjunction with search-control knowledge.
337
When Soar with chunking is applied to the task, both the selection and evaluation subgoals are chunked. During this run (second column of Figure 2), some of the newly created chunks apply to subsequent subgoals in the search. This within-trial transfer of learning speeds up performance by dramatically reducing the amount of search. The third column in the figure shows that after one run with learning, the chunked productions completely eliminate search. To investigate across-task learning, another experiment was conducted in which Soar started with a learning trial for a different task - the initial and final states are different, and none· of the intermediate states were the same (the fourth column). The first task was then attempted with the productions learned from the second task, but with chunking turned off so that there would be no additional learning (the final column). The reduced seiirch is caused by across-task transfer of learning - some subgoals in the second trial were identical in all of the relevant ways to subgoals in the first trial. This happens because of the interaction between the problem solving only accessing information relevant to the result, and the implicit generalization of chunking only recording the information accessed. 4.2 Tic· Tac· Toe The implementation of Tic-Tac-Toe includes only the basic problem space - the state includes the board and who is on move, the operators make a mark on the board for the appropriate player and change who is on move - and the ability to detect a win, lose or draw [5). With just this knowledge; Soar searches depth-first through the problem space by the sequence of: (1) encountering a. difficulty in selecting an operator; (2) evaluating the operators in a selection subgoal; (3) applying on� of the operators in an evaluation subgoal; (4) encountering a difficulty in selecting an operator to apply to the resulting state; and (5) so on, until a terminal state is reached and evaluated.
Chunking in Tic-Tac·Toe yields two interesting results: (1) the chunks detect board symmetries, allowing a drastic reduction in search through within-trial transfer, (2) the chunks encode search· control knowledge so that the correct moves through the space are remembered. The first result is interesting because there is no knowledge in the system about the existence of symmetries, and without chunking the search bogs down terribly by re-exploring symmetric positions. The chunks make use of symmetries by ignoring orientation information that was not used during problem solving. The second point seems obvious given our presentation of chunking, however, it demonstrates the strategy acquisition (6, 10) abilities of chunking. Chunking acquires strategic information on the fly, using only its direct experience, and without complex post. processing of the complete solution path or knowledge learned from other trials. The quality of this path depends on the quality of the problem solving, not on the learning. 4.3 R1 Part of the Rt expert system [7) was implemented in Soar to investigate whether Soar can support knowledge-intensive expert systems [1 5). Figure 3 shows the subgoal structure that can be built up through universal subgoaling, including both subgoals that implement complex operators (heavy lines) and subgoals that select operators (thin lines to Selection subgoals). Each box shows the problem-space operators used in the subgoal. The actual subgoal structure extends much further wherever there is an ellipsis ( ... ). This subgoal structure does not pre·exist in Soar, but is built up as difficulties arise in selecting and applying operators. Table 1 presents statistics from the application of Rf-Soar to a small configuration task. The first three runs (Min. S-C) are with a minimal system that has only the problem spaces and goal detection defined. This base system consists of 232 productions (95 productions come with Soar, 137 define Rf-Soar). The final three runs (Added S-C) have 1 O additional search-control
338
CHAPTER 8
ace module boards In BP
ace modules in BP
o to nextslot
Select ... t board in current slot
Select ... ut BP in current slot nfigure backplane ...
:valuate instantiation
Select ...
nfigure backplane
Figure 3: Subgoal Structure in R t-Soar. productions that remove much of the search. In the table, the number of search-control decisions is used as the time metric because decisions are the basic unit of problem-solving.5 8Jm..!:tgg Min. S·C Min. S·C with chunking Min. S-C after chunking
lni!il!I PrQS!. 232 232 291
Added s.c 242 Added S-C with chunking 242 Added S·C after chunking 254
Finl!! PrQ!j, 232 291 291
� 1 731 485 7
242 254 254
1 50 90 7
Table 1 : Run Statistics for Rt-Soar.
The first run . shows that with minimal search control, 1 731 decisions are needed to do the task. If chunking is used, 59 productions are built during the 485 decisions it took to do this task. No prior chunking had occurred, so this shows strong within-trial transfer. After chunking, rerunning the same task takes only 7 decisions.
When Soar is run with 10 hand-crafted search-control rules, ii only takes 150 decisions. This is only little more than three times faster than Soar without those rules took when chunking was used. When chunking is applied to this situation - where the additional search control already exists - it still helps by decreasing to 90 the number of decisions for the first trial. A second trial on this task once again takes only 7 decisions. 4.4 Ove r- general i z a t i on
The within-trial and across-task transfer in the tasks we have examined was possible because of implicit generalization. Unfortunately, implicit generalization leads to over-generalization when there is special-case knowledge that was almost used in solving a subgoal. In Soar this would be a production for which most but not all of the conditions were satisfied during a problem solving episode. Those conditions that were not satisfied, either tested for the absence of something that is available in the subgoal (using a negated condition) or for the presence of something missing in the subgoal (using a positive condition) . The chunk that is built for the subgoal is over-general· because it does not include the inverses of these conditions - negated conditions for positive conditions, and positive conditions for negated conditions. During a later el)isode, when all of the conditions of a special-case production would be satisfied in a subgoal, the chunk learned in the 5on
a Symbolics 3600, Soar usually runs at 1 second per decision.
adds an overhead of approximately
Chunking
15%, mostly to compile new productions.
The
increased number of productions has no affect on the overall rate if the chunked productions are fully integrated into the existing production-match network.
first trial bypasses the subgoal. If the special-case production would lead to a different result for the goal, the chunk is over· general and produces an incorrect result. Figure 4 contains an example of how the problem solving and chunking in Soar ·can lead to over-generalization. Consider the situation where O is to move in state 1 . It already has the canter (E), while X is on a side (8). A tie arises between all the remaining moves (A,C,D,F·I) . feading to the creation of a subgoal. The Selection problem space is chosen in which each of the tieing moves are candidates to be evaluated. If position I is evaluated first, it leads to a line of play resulting in state 2, which is a win for 0 because of a fork. On return to the Selection problem space, move I is immediately chosen as the best move, the original tie-subgoal terminates, move I is made, and 0 goes on to win. When returning from the tie-subgoal, a chunk is created, with conditions sensitive to all aspects of the original state that were tested in productions that fired in the subgoals. All positions that have marks were tested (A·C, E, I) as well as those positions that had to be clear for 0 to have a fork (G, F). However, positions D and H were not tested. To see how this production is over-general consider state 3, where 0 is to move. The newly chunked production , being insensitive to the X at position D, will fire and suggest position I, which leads to a loss for O.
#
A
x
F
I
D
•
G
B
x
E O H
c
•
F
o1
B
A
x
0
XO
E O
G
H
c F
1 2 3 Figure 4: Over-generalization in Tic· Tac· Toe.
Over-generalization is a serious problem for Soar if we want to encode real tasks that are able to improve with experience. However, over-generalization is a problem for any learning system that works in many different environments and it leads to what is called negative-transfer in humans. We believe that the next step in handling over-generalization is to investigate how a problem solver can recover from over-general knowledge, and then carry out problem solving activities so that new chunks can be learned that will override the over-general chunks. This would be similar to John Anderson's work on discrimination learning using knowledge compilation [1].
TOWARDS CHUNKING AS A GENERAL LEARNING MECHANISM
Refe re n ces
5 Conclusion
In this paper we have taken several steps towards the establishment of chunking as a general learning mechanism. We have demonstrated that it is possible to extend chunking to In complex tasks Iha.I require extensive problem solving. experiments with the Eight Puzzle, Tic-Tac-Toe, and a part of the Rf computer-configuration task, it was demonstrated that chunking leads to performance improvements with practice. We have also contributed to showing how chunking can be used to improve many aspects of behavior. Though this is only partial, as not all of the different types of problem solving arose in the tasks we demonstrated, we did see that chunking can be used for subgoals that involve selection of operators and application of operators. Chunking has this generality because of the ubiquity of goals in Soar. Since all aspects of behavior are open to problem solving In subgoals, all aspects are open to learning. Not only is Soar able to learn about the task (chunking the main goal), it is able to learn about how to solve the task (chunking the subgoals). Because all aspects of behavior are open to problem solving, and hence to learning, Soar avoids the wandering bottle-neck problem.
In addition to leading to performance speed ups, we have shown that the implicit generalization of chunks leads to significant within trial and across-task transfer of learning. This was demonstrated most strikingly by the ability of chunks to use symmetries in Tic Tac-Toe positions that are not evident to the problem solving system. And finally, we have demonstrated that chunking, which on first glance is a limited caching function, is capable of strategy acquisition. It can acquire the search control required to turn search-based problem solving into an efficient method.
Though significant progress has been made, there is stiil a long way to go. One of the original goals of the work on chunking was to model human learning, but several · of the assumptions of the original model have been abandoned on this attempt, and a better understanding is needed of just why they are necessary. We also need to understand better the characteristics of problem spaces that allow interesting forms of generalization, such as use of symmetry to take place. We have demonstrated several forms of learning, but others, such as concept formation [9), problem space creation [4), and learning by analogy [2) still need to be covered before the proposal of chunking as a general learning mechanism can be firmly established.
339
1 . Anderson, J. R . Knoweldge compilation: The general learning mechanism. Proceedings of the 1983 Machine Learning Workshop, 1983. 2. Carbonell, J. G. Learning by analogy: Formulating and generalizing plans from past experience. In Machine Learning: An Artificial Intelligence Approach, R. S. Michalski, J. G. Carbonell, & T. M. Mitchell, Eds., Tioga, Palo Alto, CA, 1 983.
3. Forgy, C. L. OPS5 Manual. Computer Science Department, Carnegie-Mellon University, 1 981 . 4. Hayes, J . R . and Simon, H . A. Understanding complex task instructions. In Cognition and Instruction, Klahr, D., Ed.,Erlbaum, Hillsdale, NJ, 1976. 5. Laird, J. E. Universal Subgoaling. Ph.D. Th., Computer Science Department, Carnegie-Mellon University, 1983. 6 . Langley, P. Learning Effective Search Heuristics. Proceedings of IJCAl·63, IJCAI, 1983. 7. McDermott, J. " A l : A rule-based configurer of computer systems." Artificial Intelligence 19 (1 982), 39-88. 8. Miller, G. A. "The magic number seven, plus or minus two: Some limits on our capacity for processing information." Psychological Review
9. Mitchell, T. M.
63 (1956), 81 -97.
Version Spaces: An approach to concept
Ph.D. Th., Stanford University, 1 978. 1 0. Mitchell, T. M. Learning and Problem Solving. Proceedings of
learning.
IJCAl·83, IJCAI, 1983. 1 1 . Newell, A. Reasoning, problem solving and decision processes: The problem space as a fundamental category. In Attention and Performance VIII, R. Nickerson, Ed.,Erlbaum, Hillsdale, NJ, 1980. 1 2 . Newell, A. and Rosenbloom, P. Mechanisms of skill acquisition and the law of practice. In Learning and Cognition, Anderson, J. A., Ed.,Erlbaum, Hillsdale, NJ, .1981 . 1 3 . Rosenbloom, P. S. The Chunking of Goal Hierarchies: A Model of Practice and Stimulus-Response Compatibility. Ph.D. Th., Carnegie-Mellon University, 1 983. 1 4 . Rosenbloom, P. S., and Newell, A. The chunking of goal hierarchies: A generalized model of practice. Proceedings of the 1963 Machine Learning Workshop, 1983. 1 5. Rosenbloom, P. S., Laird, J. E., McDermott, J. and Newell, A. R1 -SOAR: An Experiment in Knowledge-Intensive Programming in a Problem-Solving Architecture. Department of Computer Science, Carnegie-Mellon University, 1984.
CHAPTER 9
Rl-Soar: An Experiment in Knowledge-Intensive Programming in a Problem-Solving Architecture P. S. Rosenbloom, J. E. Laird, J. McDermott, A. Newell, and E. Orciuch
Abstnu:t-This paper presents an experiment in knowledge-intensive programming within a general problem-solving production-system ar
chitecture called Soar. In Soar, knowledge is encoded within a set or
problem spaces, which yields a system capable or reasoning from first
principles. Expertise consists or additional rules that guide complex problem-space searches and substitute for expensive problem-space op erators. The resulting system uses both knowledge and search when relevant. Expertise knowledge is acquired either by having it pro grammed, or by a chunking mechanism that automatically learns new rules reflecting the results implicit in the knowledge or the problem spaces. The approach is demonstrated on the computer-system config uration task, the task performed by the expert system RI.
Index Terms-Chunking, computer configuration, deep and shallow
reasoning, expert systems, general problem solving, knowledge acqui sition, knowledge-intensive programming, problem spaces, production systems.
I. INTRODUCTION
R
EPEATEDLY in the work on expert systems, domain dependent knowledge-intensive methods are con trasted with domain-independent general problem-solving methods [8] . Expert systems such as Mycin [ 1 9] and Rl [ 14] attain their power to deal with applications by being knowledge intensive. However, this knowledge character istically relates aspects of the task directly to action con sequences, bypassing more basic scientific or causal knowledge of the domain. We will call this direct task to-action knowledge expertise knowledge (it has also been referred to as surface knowledge [3] , [7]), acknowledging that no existing term is very precise. Systems that pri marily use weak methods ([10] , [15]), such as depth-first search and means-ends analysis, are characterized by their wide scope of applicability. However, they achieve this at the expense of efficiency, being seemingly unable to bring to bear the vast quantities of diverse task knowledge that Manuscript received April 15, 1985. This work was supported by the Defense Advanced Research Projects Agency (DOD) under DARPA Order 3597, monitored by the Air Force Avionics Laboralory under Contracts F33615-81-K-1539 and N00039-83-C-0136, and by Digital Equipment Cor poration. The views and conclusions contained in this paper are those of l �e authors and should "?t b� interpreted as representing the official poli _ c1es, either expressed or 1mpl1ed, of the Defense Advanced Research Proj ects Agency, the US Government, or Digital Equipment Corporation. P. S. Rosenbloom is with the Departments of Computer Science and Psychology, Stanford· University, Stanford. CA 94305. J. E. Laird is with the Xerox Palo Alto Research Center ' Palo Alto ' CA
94304.
J. McDern:iott and A. Newell are with the Department of Computer Sci ence. Carnegie-Mellon University. Pittsburgh, PA 15213. E. Orciuch is with the Digital Equipment Corporation.
allow an expert system to quickly arrive at problem solu tions. This paper describes Rl-Soar, an attempt to overcome the limitations of both expert systems and general problem solvers by doing knowledge-intensive programming in a general weak-method problem-solving architecture. We wish to show three things: 1 ) a general problem-solving architecture can work at the knowledge-intensive (expert system) end of the problem-solving spectrum; 2) such a system can integrate basic reasoning and expertise; and 3) such a system can perform knowledge acquisition by au tomatically transforming computationally intensive prob lem solving into efficient expertise-level rules. Our strategy is to show how Soar, a problem-solving production-system architecture ([9], [ 12]) can deal with a portion of Rl-a large, rule-based expert system that con figures Digital Equipment Corporation's VAX-11 and PDP-1 1 computer systems. A base representation in Soar consists of knowledge about the goal to be achieved and knowledge of the operators that carry out the search for the goal state. For the configuration task, this amounts to knowledge that detects when a configuration has been done and basic knowledge of the physical operations of confi guring a computer. A system with a base representation is robust, being able to search for knowledge that it does not immediately know, but the search can be expensive. Efficiency can be achieved by adding knowledge to the system that aids in the application of difficult operators and guides the system through combinatorially explosive searches. Expertise knowledge corresponds to this non base knowledge. With little expertise knowledge, Soar is a domain-independent problem solver; with much exper tise knowledge, Soar is a knowledge-intensive system. The efficient processing due to expertise knowledge replaces costly problem solving with base knowledge when possi ble. Conversely, incompleteness in the expertise leads back smoothly into search in the base system. In Soar, expertise can be added to a base system either by hand crafting a set of expertise-level rules, or by au tomatic acquisition of the knowledge implicit in the base representation. Automatic acquisition of new rules is ac complished by chunking, a mechanism that has been shown to provide a model of human practice [ 16] , [ 17] , ., ' but is extended here to much broader types of learning. In the remainder of this paper, we describe Rl and Soar, present the structure of the configuration task as imple-
0162-8828/85/0900-0561 $01 .00 © 1985
IEEE
R l-SOAR
1 mented in Soar, look at the system's behavior to evaluate 80-90 rule firings, one-twelfth of the total number. Since ° an order usually contains several backplanes, to configure the claims of this work, and draw some conclusions. a single backplane might take Rl 20�30 rule firings, or II. RJ AND THE TASK FOR RJ-SOAR about 3-4 s on a Symbolics 3600 Lisp machine. Rl is an expert system for configuring computers [ 14] . III. SOAR It provides a suitable expert system for this experiment Soar is a problem-solving system that is based on for because: l ) it contains a very large amount of knowledge, 2) its knowledge is largely pure expertise in that it simply mulating all problem-solving activity as attempts t'? sat recognizes what to do at almost every juncture, and 3) it isfy goals via heuristic search in problem spaces. A prob is a highly successful application of expert systems, hav lem space consists of a set of states and a set of operators ing been in continuous use by Digital Equipment Corpo that transform one state into another. Starting from an ini ration for over four years [ l ] . Currently written in Ops5 tial state, the problem solver applies a sequence of oper [4], Rl consists of a database of over 7000 component de ators in an attempt to reach a state that satisfies the goal scriptions, and a set of about 3300 production rules par (called a desired state) . Each goal has associated with it a titioned into 321 subtasks. The primary problem-solving problem space within which goal satisfaction is being at technique in Rl is match-recognizing in a specific situa tempted, a current state in that problem space, and an op tion precisely what to do next. Where match is insuffi erator which is to be applied to the current state to yield cient, Rl employs specialized forms of generate and test, a new state. The search proceeds via decisions that change multistep look-ahead, planning in an abstract space, hill. the current problem space, state, or operator. If the cur climbing, and backtracking. rent state is replaced by a different state in the problem Given a customer's purchase order, Rl determines what, space-most often it is the state generated by the current if any, modifications have to be made to the order for rea operator, but it can also be the previous state, or others sons of system functionality and produces a number of normal within-problem-space search results. The knowledge used to make these decisions is called diagrams showing how the various components on the or der a re to be associated. In producing a complete config search control. Because Soar performs all problem-solv uration, Rl performs a number of relatively independent ing activity via search in problem spaces, the act of ap subtasks; of these, the task of configuring unibus modules plying search-control knowledge must be constrained to is by far the most involved. Given a partially ordered set not involve problem solving. Otherwise, there would be of modules to be put onto one or more buses and a number an infinite regression in whiCh making a decision requires of containers (backplanes, boxes, etc.), the unibus config the use of search control which requires problem solving uration task involves repeatedly selecting a backplane and in a problem space, which requires making a decision using placing modules in it until all of the modules have been search control, and so on. In Soar, search control is lim configured. The task is knowledge intensive because of ited to match-direct recognition of situations. As long as the large number of situation-dependent constraints that the computation required to make a decision is within the rule out various module placements. Rl-Soar can cur limits of search control, and the knowledge required to rently perform more than half of this task. Since Rl uses make the decision exists, problem solving proceeds about one-third of its knowledge (llOO of its 3300 rules) smoothly. However, Soar often works in domains where in performing the unibus configuration task, Rl-Soar has its search-control knowledge is either inconsistent or in approximately one-sixth of the knowledge that it would complete. Four difficulties can occur while deciding on a require to perform the entire configuration task. new problem space, state, or operator: there are no objects . Rl approaches the unibus configurat ion task by laying under consideration, all of the candidate objects are un out an abstract description of the backplane demands im viable, there is insufficient knowledge to select among two posed by the next several modules and then recognizing or more candidate objects, or there is conflicting infor which of the candidate backplanes is most likely to satisfy mation about which object to select. When Soar reaches a those demands. Once a backplane is selected on the basis decision for which one of these difficulties occurs, prob of the abstract description, Rl determines specific module lem solving reaches an impasse [2] and stops. Soar 's uni placements on the basis of a number of considerations that versal subgoaling mechanism [9] detects the impasse and it had previously ignored or not considered in detail. Rl creates a subgoal . whose purpose is to obtain the knowl Soar approaches the task somewhat differently, but for the edge which will allow the decision to be made. For ex most part makes the same judgments since it takes into ample, if more than one operator can be applied to a state, account all but one of the six factors that Rl takes into and the available knowledge does not prefer one over the account. The parts of the unibus configuration task that others, an impasse occurs and a subgoal is created to find RJ-Soar does not yet know how to perform are mostly information leading to the selection of the appropriate one. peripheral subtasks such as configuring empty backplanes after all of the modules have been placed and distributing 1This task requires a disproportionate share of knowledge-a sixth of lhe boxes appropriately among cabinets. Rf typically fires knowledge for a twelfth of the rule firings-because the unibus configura about 1000 rules in configuring a computer system; the tion task is more knowledge intensive than most of the other tasks Rl per part of the task that Rf-Soar performs tvpicallv takes Rf forms.
341
342
CHAPTER 9
Or, if an operator is selected which cannot be implemented directly in search control, an impasse occurs because there are no candidates for the successor state. A subgoal is created to apply the operator, and thus build the state that is the result of the operator. A subgoal is attempted by selecting a problem space for it. Should a decision reach an impasse in this new problem space, a new subgoal would be created to deal with it. The overall structure thus takes the form of a goal-subgoal hierarchy. Moreover, because each new subgoal will have an associated problem space, Soar generates a hierarchy of problem spaces as well as a hierarchy of goals. The diversity of task !lomains is reflected in a diversity of problem spaces. Major tasks, such as configuring a com puter, have'a corresponding problem space, but so also do each of tile various subtasks, such as placing a module into a backplane or placing a backplane into a box. In addition, problem spaces exist in the hierarchy for many types of tasks that often do not appear in a typical task-subtask decomposition, such as the selection of an operator to ap ply, the implementation of a giv�p operator in some problem space, and a test of goal attainment. Fig. 1 gives a small example of how subgoals are used ° in Soar. This is � subgoal structure that gets generated while trying to take steps in many task problem spaces. Initially (A), the·problem solver is at state I and must select an operator. If search control is unable to uniquely deter mine the next operator to apply, a subgoal is created to do the selection. In that subgoal (B) , a selection problem space is used that reasons about the selection of objects from a set. In order to break the tie between objects, the selection problem space has operators to evaluate each candidate object. When the information required to evaluate an operator (such as operator! in the task space) is not directly avail able in search control (because, for example, it must be determined by further problem solving) ; the evaluation operator is accomplished in a new subgoal . In this subgoal (C), the original task problem space and state (state I) are selected. Operator! is applied, creating a new state (state2). If an evaluation function (a rule in search con trol) exists for state2, it is used to compare operator! to the other operators. When operator! has been evaluated, the subgoal terminates , and then the whole process is re peated for the other two operators (operator2 and operator? in D and £ ) . If, for example, operator2 creates a state with a better evaluation than the other operators, it will be des ignated as better than them. The selection subgoal will terminate and the designation of operator2 will lead to its selection in the original task goal and problem �pace. At this point, operator2 is reapplied to state 1 and the process continues (F) . Soar uses a monolithic production-system architec ture-a modified version of Ops5 [4] that admits parailel execution of all satisfied productions-to realize its search control knowledge and to implement its simple operators (more complex operators are encoded as separate problem spaces that are chosen for the subgoals that arise when the ·
S.ale2
A.
l.....,,
... _, OperatOt2toState1
F
.
Tasll goal
Ellaluate(()pJ(Statel))
E.
Fig. I. A Soar subgoal structure. Each box represents one goal (the goal's name is above the box). The first row in each box is the current state for that goal. The remaining rows represent the operators that can be used for that state. Heavy arrows represent operator applications (and goals to apply operators). Light arrows represent subgoals to select among a set of objects.
operator they implement has been selected to apply) . Pro duction rules elaborate the current objects under consid eration for a decision (e.g. , candidate operators or states). The process of elaboration results in knowledge being added to the production system's working memory about the objects, including object substructures, evl!luation in formation, and preferences relative to other candidate ob jects. There is a fixed decision process that integrates the preferences and makes a selection. Each decision corre sponds to an elementary step in the problem solving, so a count of the number of decisions is a good measure of the amount of problem solving performed. To have a task formulated in Soar is to have a problem space and the ability to recognize when a state satisfies the goa\ of the task; that is, is a desired state. The default behavior for Soar-when it has no search-control know)edge at all-is to search in this problem space until it reaches a desired state. The various weak methods arise, not by explicit representation and selection, but instead by the addition of small amounts of search control (in the form of one or two productions) to Soar, which acts as a universal weak method [ 10] , [ l l ] , and [9] . These produc tion rules are responsive to the small amounts of knowl edge that are involved in the weak methods, e.g. , the eval uation function in hill climbing or the difference between the current and desired states in means-ends analysis. In this fashion, Soar is able to make use of the entire reper toire of weak methods in a simple and elegant way, making it a good exemplar of a general problem-solving system. The structure in Fig. 1 shows how one such weak method, steepest-ascent hill climbing-at each point.in the search, evaluate the possible next steps and take the best one-can come about if the available knowledge is suffi cient to allow evaluation of all of the states in the problem space. If slightly different knowledge is available, such as how to evaluate only terminal states (those states beyond which the search cannot extend), the search would be quite different, reflecting a different weak method . For exam ple, if state2 in subgoal (C) cannot be evaluated, then subgoal (C) will not be satisfied and the search will con tinue under that subgoal. An operator must be selected for
R l-SOAR
state2, leading to a selection subgoal. The search will con tinue to deepen in this fashion until a terminal state is reached and evaluated. This leads to an exhaustive depth first search for the best terminal state. Backtracking is not done explicitly, instead it implicitly happens whenever a subgoal terminates. A third weak method-depth-first search for the first desired state to be found-occurs when no evaluation information is available; that is, desired states can be recognized but no states can be evaluated. In addition to the kinds of knowledge that lead to the well-known weak methods, additional search-control knowledge can be added to any problem space. The knowledge can be in the form of new object preferences or additional information that leads to new preferences. As more knowledge is added, the problem solving be comes more and more constrained until finally search is totally eliminated. This is the basic device in Soar to move toward a knowledge-intensive system. Each addition oc curs simply by adding rules in the form of productions. Theoretically, Soar is able to move continuously from a knowledge-free solver (the default) , through the weak methods to a knowledge-intensive system. It is possible to eliminate entire subspaces if their function can be realized by search-control knowledge in their superspace. For in stance, if a subspace is to gather information for selecting between two operators, then it may be possible to encode that information directly as a search-control rule such as the following one from Rl-Soar: If
there is an acceptable put-board-in-slot operator and an acceptable go-to-next-slot operator Then the go-to-next-slot operator is worse than the put-board-in-slot operator. Similarly, if a subspace is to apply an operator, then spe cific instances of that operator might be carried out di rectly by rules in the higher space. Knowledge acquisition in Soar consists of the creation of additional rules by hand coding or by a mechanism that automatically chunks the results of successful goals ( 13], ( 12] . The chunking mechanism creates new production rules that allow the system to directly perform actions that originally required problem solving in subgoals. The con ditions of a chunked rule test those aspects of the task that were relevant to satisfying the goal . For each goal gener ated, the architecture maintains a condition list of all data that existed before the goal was created and which were accessed in the goal. A datum is considered accessed if a production that matched it fires. Whenever a production is fired, all of the data it accessed that existed prior to the current goal are added to the goal's condition list. When a goal terminates (for whatever reason) , the condition list for that goal is used to build the conditions of a chunk. The actions of the chunk generate the information that ac tually satisfied the goal . In figuring out the actions of the chunk, Soar starts with everything created in the goal, but then prunes away the information that does not relate di rectly to objects in any supergoal. What is left is turned into production actions .
New rules form part of search control when they deal with the selection among objects (chunks for goals that use the selection problem space) , or they form part of operator implementation when they are chunks for goals dealing with problematic operators. Because Soar is driven by the goals automatically created to deal with difficulties in its performance, and chunking works for all goals, the chunking mechanism is applicable to all aspects of Saar 's problem-solving behavior. At first glance, chunking appears to be simply a caching mechanism with little hope of producing results that can be used on other than exact duplicates of tasks it has al ready attempted. However, if a given task shares subgoals with another task, a chunk learned for one task can apply to the other. Generality is possible because a chunk only contains conditions for the aspects that were accessed in the subgoal. This is an implicit generalization, by which many aspects of the context-the irrelevant ones-are au tomatically ignored by the chunk. 2 IV. THE
STRUCTURE
OF Rl-SOAR
The first step in building a knowledge-based system in Soar is to design and implement the base representation as a set of problem spaces within which the problem can be solved. As displayed in Fig. 2, Rl-Soar �urrently con sists of a hierarchy of ten task problem spaces (plus the selection problem space). These spaces represent a de composition of the task in which the top space is given the goal to do the entire unibus configuration task; that is, to configure a sequence of modules to be put on a unibus. The other nine task spaces deal with subcol}lponents of this task. Each subspace implements one of the complex operators of its parent's problem space. Each configuration task begins with a goal that uses the Unassigned Backplane problem space. This space has one operator for configuring a backplane that is instantiated with a parameter that determines which type of backplane is to be configured. The initial decision, of selecting which backplane to use next, appears as a choice between in stances of this operator. Unless there is special search control knowledge that knows which backplane should be used, no decision can be made. This difficulty (of inde cision) leads to a subgoal that uses the selection problem space to evaluate the operators (by applying them to the original state and evaluating the resulting states). To do this, the evaluation operator makes recursive use of the Unassigned Backplane problem space. The initial configuration of a backplane is accomplished in the five problem spaces rooted at the Configure Back plane space by: putting the backplane in a box (the Con figure Box space) , putting into the backplane as many modules as will fit (the Configure Modules space) , reserv ing panel space in the cabinet for the module (Reserve Panel Space) , putting the modules' boards into slots in the backplane (the Configure Boards space), and cabling the 2For comparisons of chunking to other related learning mechanisms such as memo functions, macrooperators, production composition, and analyt ical 2.eneralization. sec [ 1 81 and 1 1 21.
343
344
CHAPTER 9
Unassigned B;iic k lane
•..
Configure
...
Uncontlgu re
ReconUgure Boards
Fig. 2. The task problem-space hierarchy for Rl-Soar.
backplane to the previous backplane (done by an operator in the Configure Backplane space that is simple enough to be done directly by a rule, rather than requiring a new problem space) . Each of these problem spaces contains be tween one a'nd five operators. Some of the operators are simple enough to be implemented directly by rules, such as the cable-backplane operator in the Configure Back plane space, or the put-board-in-slot, go-to-next-slot, and go-to-previous-slot operators in the Configure Boards space. Others are complex enough to require problem solving in new problem spaces, yielding the problem-space hierarchy seen in Fig. 2. In addition to containing operators, each problem space contains the knowledge allowing it to recognize the satis faction of the goal for that problem space. Several kinds of goal detection can occur: 1 ) recognition of a desired state, 2) satisfaction of path constraints (avoiding illegal sequences of operators), and 3) optimization over some criterion (such as maximizing the value of the result or minimizing its cost). All these different forms of goals are realized by appropriate production rules. For example, the Configure Backplane space simply has the following goal detection rule: if the modules have been placed in the backplane, and the backplane has been placed in a box, and the backplane has been cabled to the previous back plane (if there is one), then the goal is accomplished. In a more complicated case, the task of putting the boards from a module into slots in a backplane (the Configure Boards space) could be considered complete whenever all of the module's boards are associated with slots in the back plane. However, a two-board module can be configured by putting one board in the first slot and one in the last slot, or by putting the two boards into the first two slots, or by any one of the other combinatorial possibilities. For most modules it is desirable to put the boards as close to the front as possible to leave room for later modules (although there is one type of module that must go at the end of the backplane) , so completed configurations are evaluated ac cording to how much backplane space (first to last slot) they use. The goal is satisfied when the best completed configuration has been found. In addition to the constraints handled by evaluation functions (such as using minimum backplane space), many other constraints exist in the configuration task that com plicate the task of a problem-solving system. These in-
elude possible incompatibilities between boards and slots, the limited amounts of power that the boxes provide for use by modules (a new box may be required if more power is needed) , components that are needed but not ordered, restrictions on the location of a module in a backplane (at the front or back) , and limits on the electrical length of the unibus (for which a unibus repeater is required). Rl Soar pursues this complex configuration problem by searching for the best configuration that meets all of the constraints, and then trying to optimize the configuration some more by relaxing one of the constraints-the order ing relationship among the modules. This relaxation (oc curring in the four spaces rooted at the Reconfigure Mod ules space) may allow the removal of backplanes that were added over and above those on the initial order. When pos sible, the modules configured in these backplanes are re moved (the Unconfigure Modules space), placed into un used locations in other backplanes (the Reconfigure Boards space), and the extra backplanes are removed ·from their boxes (the Unconfigure Box space). As described so far, Rl-Soar forms a base-reasoning system because its representation and processing is in terms of the fundamental relationships between objects in the domain. The main mode of reasoning consists of search in a set of problem spaces until the goals are achieved. One part of this search can be seen clearly in the Con figure Boards space. Given a module with two boards of width 6, and a nine-slot backplane with slot-widths of 46-6-6-6-6-6-6-4, a search proceeds through the problem space using the go-to-next-slot and put-board-in-slot op erators. The search begins by making the easy decision of what to do with the first slot: it must be skipped because it is too narrow for either board. Then either one board can be placed in the second slot, or the slot can be skipped. If it is skipped, one board can be placed in the third slot, or it can be skipped, and so on. If, instead, a board is placed in the second slot, then it must go on to the third slot and decide whether to place the other board there or to skip the slot, and so on. All complete configurations are evaluated, and the path to the best one is selected. This is clearly not the most efficient way to solve this problem but it is sufficient. Rl-Soar becomes a more knowledge-intensive system as rules are added to guide the search through the problem space and to implement special cases of operators-al though the complete operator is too complex for direct rule implementation, special cases may be amenable. Most of the hand-crafted knowledge in Rl-Soar is used to control the search. In the Configure Boards space, all search is eliminated (at least for modules that go in the front of the backplane) by adding three search-control rules: 1) oper ators that configure boards of the same width and type are equal, 2) prefer the operator that configures the widest board that will fit in the current backplane slot, and 3) prefer an operator that puts a board into a slot over one that goes to the next slot. These· rules convert the search in this problem space from being depth-first to algorithmic-at each step the system knows exactly what to do next. For
R l-SOAR
the example above, the correct sequence is: go-to-next slot, put-board-in-slot, go-to-next-slot, put-board-in-slot.
V. RESULTS AND DISCUSSION In this section we evaluate how well Rl-Soar supports the three objectives given in the introduction by examining its performance on four configuration tasks. 1) There is one two-board module to be put on the uni bus. 2) There are three modules to be put on the unibus. One of the already configured backplanes must be undone in order to configure a unibus repeater. 3) There are six modules to be put on the unibus. Three of the modules require panel space in the cabinet. 4) There are four modules to be put on the unibus. Three of the modules will go into a backplane already or dered, and one will go into a backplane that must be added to the order. Later, this module is reconfigured into an open location in the first backplane, allowing removal of the extra backplane from the configuration. Most of the results to be discussed here are for tasks I ) and 2) which were done i n earlier versions of both Soar and Rl-Soar (containing only the Unassigned Backplane, Configure Backplane, Configure Box, Configure Mod ules, and Configure Boards spaces, for a total of 242 rules) . Tasks 3) and 4) were run in the current versions of Soar and Rl-Soar (containing all of the problem spaces, for a total of266 rules). 3 Table I gives all of the results for these four tasks that will be used to evaluate the three ob jectives of this paper. The first line in the table shows that a system using a base representation can work, solving the rather simple task 1 ) after making 1731 decisions. The first objective of this paper is to show that a general problem-solving system can work effectively at the knowl edge-intensive end of the problem-solving spectrum. We examine three qualitatively different knowledge-intensive versions of Rl-Soar, 1 ) where it has enough hand-crafted rules so that its knowledge is comparable to the level of knowledge in RJ (before learning on the full version), 2) where there are rules that have been acquired by chunking (after learning on the base version), and 3) where both kinds of rules exist (after learning on the full version). The hand-crafted expertise consists solely of search con trol (operator selection) rules. The chunked expertise con sists of both search-control and operator-application rules. In either case, this is expertise knowledge directly relating knowledge of the task domain to action in the task do main. Table I shows the number of decisions required to com plete each of the four configuration tasks when these three versions of Rl-Soar are used. With hand-crafted search control, all four tasks were successfully completed, taking between 150 and 628 decisions. In the table, this is before !earning on the full (search control) version. With just
TABLE I NUMBER OF DECISIONS TO COMPLETION FOR THE FOUR UNIBUS CONFIGURATION TASKS. THE BASE VERSION (TASK 1)) CONTAINS 232 RULES. THE PARTIAL VERSION (TASKS 1) AND 2)) CONTAINS 234 RULES. ANDTHE FULL VERSION CONTAINS 242 RULES (TASKS 1) AND 2)), OR 266 RULES (TASKS 3) AND 4)). THE NUMBER OF RULES LEARNED FOR EACH TASK IS SHOWN IN BRACKETS IN THE bURING-LEARNING COLUMN
Task
2 3 4
Version Base Partial Full Partial Full Full Full
Before Leaming
During Leaming
After Leaming
1731 243 150 1064 479 288 628
485(59) 1 1 1 ( 14) 90(12) 692(1091 344(53) 143(20)
7 7 7 16 16 10
chunked search control , task 1 ) was accomplished in 7 decisions (after learning on the base version). 4 A total of 3 of the 7 decisions deal with aspects outside the scope of the unibus configuration task (setting up the initial goal, problem space, and state) . Soar takes about 1 .4 s per de cision, so this yields about 6 s for the configuration task within a factor of 2 of the time taken by Rl . It was not feasible to run the more complex task 2) without search control because the time required would have been enor mous due to the combinatorial explosion-the first module alone could be configured in over 300 different ways. Tasks 3) and 4) were also more complicated than task 1 ) , and were not attempted with the base version. With both hand crafted and chunked search control, tasks 1 )-3) required between 7 and 16 decisions (after learning on the full ver sion). Task 4) learning had problems of overgeneraliza tlon. It should have learned that one module could not go iii a particular backplane, but instead learned that the module could not go in any backplane_ More discussion on overgeneralization in chunking can be found in [13). In summary for the first objective, RJ-Soar is able to do the unibus configuration task in a knowledge-intensive manner. To scale this result up to a full expert system (such as all of RJ ) we must know: 1) whether the rest of Rl is similar in its key characteristics to the portion already done, and 2) the effects of scale on a system built in Soar. With respect to the unibus configuration task being rep resentative of the whole configuration task, qualitative dif ferences between portions of Rl would be expected to manifest themselves as differences in amount of knowl edge or as differences in problem-solving methods. The task that Rl-Soar performs is atypical in the amount of knowledge required, but requires more knowledge, not less-15.7 rules per subtask for Rl-Soar's task, versus 10.3 for the entire task. The problem-solving methods used for the unibus configuration task are typical of the rest of RJ predominantly match, supplemented fairly frequently with multistep look ahead. With respect to the scaling of Rl-
4For these runs it was assumed that the top-most goal in the hierarchy 3The difference in number of task rules between these two versions is never terminates, and therefore is not chunked. If this assumption were then the number of decisions with chunked search control would changed, actually higher because a number of the default (nontask) rules needed by most likely be reduced to 1 . earlier versions of So(Jr is no longer necessary.
345
346
CHAPTER 9
Soar up to Rl 's full task, Ops5, from which Soar is built, scales very well-time is essentially constant over the number of rules, and linear in the number of modifications (rather than the absolute size) of working memory [6] . Ad ditional speed is also available in the form of the Ops83 production-system architecture, which is at least 24 times faster than Lisp-based Ops5 (on a VAX-780) [5], and a production-system machine currently being designed that is expected to yield a further multiplicative factor of be tween 40 and 160 [5] , for a combined likely speedup of at least three orders of magnitude. The second objective of this paper is to show how base reasoning and expertise can be combined in Soar to yield more expertise and a smooth transition to search in prob lem spaces when the expertise is incomplete. Toward this end we ran two more before-learning versions of Rl-Soar on tasks 1 ) and 2): the base version, which has no search control rules, and the partial version, which has two hand crafted search-control rules. The base version sits at the knowledge-lean end of the problem-solving spectrum; the partial version occupies an intermediate point between the base system and the more knowledge-intensive versions already discussed. Task 1) took 1731 decisions for the base version, and 243 decisions for the partial version. Examining the trace of the problem solving reveals that most of the search in the base version goes to figuring out how to put the one module into the backplane. For the 9-slot backplane (of which 7 slots were compatible wih the module's two boards), there are (7 choose 2) 21 pairs of slots to be considered. The two search control rules added in the par tial version have already been discussed in the previous section: 1) make operators that configure boards of equal size be equal, and 2) prefer to put a board in a slot rather than skip the slot. These two rules reduce the number of decisions required for this task by almost an order of mag nitude. With the addition of these two search control rules, the second task could also be completed, requiring 1064 decisions. In summary, the base system is capable of performing the tasks, albeit very slowly. If appropriate search control e_xists, search is reduced, lowering the number of deci sions required to complete the task. If enough rules are added, the system acts like it is totally composed of ex pertise knowledge. When such knowledge is missing, as some is missing in the partial version, the system falls back on search in its problem spaces. The third objective is to show that knowledge acquisi tion via Soar's chunking mechanism could compile com putationally intensive problem solving into efficient rules. In Soar, chunks are learned for all goals experienced on every trial, so for exact task repetition (as is the case here), all of the lea�ning occurs on the first trial. The dur ing learning column in Table I shows how many decisions were required on the trial where learning occurred. The bracketed number is the number of rules learned during that trial. These results show that learning can improve performance by a factor of about 1 .5-3, even the first time =
a task is attempted. This reflects a large degree of within trial transfer of learning; that is, a chunk learned in one situation is reused in a later situation during the same trial. Some of these new rules become part of search control, establishing preferences for operators or states. Other rules become part of the implementation of operators, re placing their original implementations as searches in sub spaces, with efficient rules for the particular situations. In task 3), for example, three operator-implementatio1;1 chunks (comprising four rules) were learned and used dur ing the first attempt at the task. Two of the chunks were for goals solved in the Configure Boards space. Leaving out some details, the first chunk says that if the module has exactly one board and it is of width six, and the next slot in the backplane is of width six, then put the board into the next slot and move the slot pointer forward one slot. This is a macrooperator which accomplishes what previously required two operators in a lower problem space. The second chunk says that if the module has two boards, both of width six, and the current slot is of width four (too small for either board), and the two subsequent slots are of width six, then place the boards in those slots, and point to the last slot of the three as the current slot. The third chunk is a more complex one dealing with the reservation of panel space. Comparing the number of decisions required before learning and after learning reveals savings of between a factor of 20 and 200 for the four unibus configuration tasks. In the process, between 12 and 109 rules are learned. The number of rules to be learned is determined by the number of distinct subgoals that need to be satis fied. If many of the subgoals are similar enough that a few chunks can deal with all of them, then fewer rules must be learned. A good example of this occurs in the base ver sion of task 1 ) where most of the subgoals are resolved in one problem space (the Configure Boards space) . Like wise, a small amount of general hand-crafted expertise can reduce significantly the number of rules to be learned. For task 1 ) the base version plus 59 learned rules leads to a system with 291 rules, the partial version plus 14 learned rules has 248 rules, and the full version plus 12 learned rules has 254 rules (some of the search control rules in the full version do not help on this particular task) . All three systems require the same number of decisions to process this configuration task. In summary, chunking can generate new knowledge in the form of search-control and operator-implementation rules. These new rules can reduce the time to perform the task by nearly two orders of magnitude. For more complex tasks the benefits could be even larger. However, more work is required to deal with the problem of overgeneral ization. VI. CONCLUSION By implementing a portion of the Rl expert system with the Soar problem-solving architecture, we have provided evidence for three hypotheses: I ) a general problem-solv ing architecture can work at the knowledge-intensive end
R l -SOAR
of the problem-solving spectrum, 2) such a system can effectively integrate base reasoning and expertise, and 3) a chunking mechanism can aid in the process of knowl edge acquisition by compiling computationally intensive problem solving into efficient expertise-level rules. The approach to knowledge-intensive programming can be summarized by the following steps: 1) design a set of base problem spaces within which the task can be solved, 2) implement the problem-space operators as either rules or problem spaces, 3) operationalize the goals via a com bination of rules that test the current state, generate search-control information and compute evaluation func tions, and 4) improve the efficiency of the system by a combination of hand crafting more search control, using chunking, and developing evaluation functions that apply to more states. REFERENCES ( 1 ) J. Bachant and J. McDermott, " R l revisited: Four years in the trenches, " Al Mag. , vol. 5, no. 3, 1984. (2) J. S. Brown and K. VanLehn, "Repair theory: A generative theory of bugs in procedural skills," Cogn. Sci. , vol. 4, pp. 379-426, 1980. (3) B. Chandrasekaran and S. Mittal, "Deep versus compiled knowledge approaches to diagnostic problem-solving," lnt. J. Man-Machine Studies, vol. 19, pp. 425-436, 1983. (4) C. L. Forgy, OPS5 Manual. Dep. Comput. Sci., Carnegie-Mellon Univ., Pittsburgh, PA, Tech. Rep. 81-135, 1981. (5) C. Forgy, A. Gupta, A. Newell, and R . Wedig, "Initial assessment of architectures for production systems," in Proc. Nat. Conf. Artif. /n tell. , Amer. Assoc. Artif. Intel/. , 1984. (6) A. Gupta and C. Forgy, "Measurements on production systems, " Dep. Comput. Sci. , Carnegie-Mellon Univ., Pittsburgh, PA, Tech. Rep. 83-167, Dec. 1983. (7) P. E . Han, "Directions for Al in the eighties," SIGART Newsletter, vol. 79, pp. 11-16, 1982. (8) F. Hayes-Roth, D. A. Waterman, and D. B. Lenat, "An overview of expen systems , " in Building Expert Systems, F. Hayes-Roth, D. A. Waterman, and D. B. Lenat, Eds. Reading, MA: Addison-Wesley, 1983. (9) J. E. Laird, "Universal subgoaling," Ph.D. dissenation, Dep. Com put. Sci., Carnegie-Mellon Un iv., Pittsburgh, PA, Tech. Rep. 84-129, 1983. ( 10) J. E. Laird and A. Newell, "A universal weak method, " Dep. Com put. Sci. , Carnegie-Mellon Univ. , Pittsburgh, PA, Tech. Rep. 83-141, June 1983. ( 1 1 ) -, "A universal weak method: Summary of results," in Proc. 8th lnt. Joint Conf. Artif. Intel/. , 1983. ( 12) J. E. Laird, A. Newell, and P. S. Rosenbloom, Soar: An Architecture for General Intelligence, 1985, in preparation. ( 1 3) J. E. Laird, P. S. Rosenbloom, and A. Newell, "Towards chunking as a general learning mechanism," in Proc. Nat. Conf. Artif. /ntell. , Amer. Assoc. Artif. Intel/. , 1984. ( 14) J. McDermott, " R l : A rule-based configurer of computer systems," Artif. Intel/. , vol. 19, Sept. 1982. ( 1 5) A. Newell, "Heuristic programming: Ill-structured problems," Prog ress in Operations Research , Ill, J. Aronofsky, Ed. New York: Wiley, 1969. ( 16) A. Newell and P. S. Rosenbloom, "Mechanisms of skill acquisition and the law of practice," in Cognitive Skills and Their Acquisition, J. R. Anderson, Ed. Hillsdale, NJ: Erlbaum 1981. Also in Dep. Comput. Sci., Carnegie-Mellon Univ., Pittsburgh, PA, Tech Rep. 80145, 1980. ( 1 7) P. S. Rosenbloom, "The chunking of goal hierarchies: A model of practice and stimulus-response compatibility, .. Ph.D. dissertation, Dep. Comput. Sci., Carnegie-Mellon Un iv. , Pittsburgh, PA, Tech. Rep. 83-148, 1983. [ 1 8) P. S. Rosenbloom and A. Newell, "The chunking of goal hierarchies: A generalized model of practice, " in Machine Learning: An Artificial Intelligence Approach, Volume II. R. S. Michalski, J. G. Carbonell, and T . M. Mitchell, Eds. Los Altos, CA: Morgan Kaufmann, 1985, in press. ( 1 9) E. H. Shonliffe, Computer-Based Medical Consultation: MYCIN. New York: Elsevier, 1976.
Paul S. Rosenbloom received the B.S. degree (Phi Beta Kappa) in mathematical sciences from Stan ford University, Stanford, CA, in 1976 and the M.S. and Ph.D. degrees in computer science from Carnegie-Mellon University, Pittsburgh, PA, in 1978 and 1983, respectively, with fellowships from the National Science Foundation and IBM. He did one year of graduate work in psychology at the University of California, San Diego, and was a Research Computer Scientist in the Depanment of Computer Science, Carnegie-Mellon University in 1983-1984. He is an Assistant Professor of Computer Science and Psychology at Stanford University. Primary research interests center around the nature of the cognitive architecture underlying anificial and natural in telligence. This work has included a model of human practice and devel opments toward its being a general learning mechanism (with J. E. Laird and A. Newell); a model of stimulus-response compatibility, and its use in the evaluation of problems in human-computer· interaction (with B. John and A. Newell); and an investigation of how to do knowledge-intensive programming in a general, learning-problem solver. Other research inter ests have included the application of anificial intelligence techniques to the production of world-championship caliber programs for the game of Oth ello."
John E.
Laird received the B.S. degree in com puter and communication sciences from the Uni versity of Michigan, Ann Arbor, in 1975 and the M.S. and Ph.D. degrees in computer science from Carnegie-Mellon University, Pittsburgh, PA, in 1978 and 1983, respectively. He is a Research Associate in the Intelligent Systems Laboratory at Xerox Palo Alto Research Center, Palo Alto, CA. His primary research in terest is the nature of intelligence, both natural and anificial. He is currently pursuing the structure of the underlying architecture of intelligence (with A. Newell and P. Rosenbloom). Significant aspects of this research include a theory of the weak methods, a theory of the origin of subgoals, a general theory of learn ing and a general theory of planning-all of which have been or are to be realized in the Soar architecture.
Allen Newell (SM '64-F'74) received the B.S. de gree in physics from Stanford University, Stan ford, CA, and the Ph.D. degree in industrial administration from Carnegie-Mellon University, Pittsburgh, PA, in 1957. He also studied mathe matics at Princeton University, Princeton, NJ. He worked at the Rand Corporation before join ing Carnegie-Mellon University in 1961. He has worked on anificial intelligence and cognitive psy chology since their emergence in the mid-1950's, mostly on problem solving and cognitive architectures, as well as list processing, computer architecture, human-compuler interfaces, and psychologically based models of human-computer interac tion. He is the U.A. and Helen Whitaker University Professor of Computer Science at Carnegie-Mellon University. Dr. Newell received the Harry Goode Award of the American Federation of Information Processing Societies (AFIPS) and (with H. A. Simon) the A. M. Turing Award of the Association of Computing Machinery. He re ceived the 1979 Alexander C. Williams Jr. Award of the Human Factors Society jointly with W. C. Biel, R. Chapman and J. L. Kennedy. He is a member of the National Academy of Sciences, the National Academy of Engineering, and other related professional societies. He was the First President of the American Association for Artificial Intelligence (AAAI).
347
348
CHAPTER 9
Edmund Orciuch received the B.S. degree in mathematics from Worcester State College, Wor cester, MA, in 1975. He joined Digital in 1980, and helped lead Dig ital's pioneering effort in the area of expert sys tems applications. He was the Chief Knowledge Engineer on XCON, Digital's premier expert sys tem that configures custom computer systems, and later Chief Knowledge Engineer on ISA, an expert system that schedules customer orders. He spent the 1983/1984 academic year at Carnegie-Mellon
University, Pittsburgh, PA, as Visiting Scientist in the Al Apprenticeship Program sponsored by the Digital Equipment Corporation. In addition to his project contributions, he was one of the designers of Digital 's Corporate Al Training program. He developed and taught a course on OPS5, a rule bascd language used to build expert systems; and he has conducted semi nars on expert systems design and knowledge engineering. He is currently with the Intelligent Systems Technology Group in Digital' s Al Technology Center where he is working in the areas of expert systems architectures and knowledge acquisition.
The Soar Papers 1986
CHAPTER 1 0
Chunking in Soar: The Anatomy of a General Learning Mechanism J. E. Laird, Xerox Palo Alto Research Center, P. S. Rosenbloom, Stanford University, and A. Newell, Carnegie Mellon University
Key words: learning from experience, general learning mechanisms, problem solving, chunking,
production systems; macro-operators, transfer
Abstract. In this article we describe an approach to the construction of a general learning mechanism based on chunking in Soar. Chunking is a learning mechanism that acquires rules from goal-based ex perience. Soar is a general problem-solving architecture with a rule-based memory. In previous work we have demonstrated how the combination of chunking and Soar could acquire search-control knowledge (strategy acquisition) and operator implementation rules in both search-based puzzle tasks and knowledge-based expert-systems tasks. In this work we examine the anatomy of chunking in Soar and pro vide a new demonstration of its learning capabilities involving the acquisition and use of macro-operators.
1 . Introduction
The goal of the Soar project is to build a system capable of general intelligent behavior. We seek to understand what mechanisms are necessary for intelligent behavior, whether they are adequate for a wide range of tasks - including search intensive tasks, knowledge-intensive tasks, and algorithmic tasks - and how they work together to form a general cognitive architecture. One necessary component of such an architecture, and the one on which we focus in this paper, is a general learn ing mechanism. Intuitively, a general learning mechanism should be capable of learn ing all that needs to be learned. To be a bit more precise, assume that we have a
352
CHAPTER 10
general performance system capable of solving any problem in a broad set of do mains . Then, a general learning mechanism for that performance system would possess the following three properties : 1 •
•
•
Task generality. It can improve the system's performance on all of the tasks
in the domains. The scope of the learning component should be the same as that of the performance component . Kno wledge generality. It can base its improvements on any knowledge available about the domain. This knowledge can be in the form of examples, instructions, hints, its own experience, etc. Aspect generality. It can improve all aspects of the system . Otherwise there would be a wandering-bottleneck problem (Mitchell, 1 983), in which those aspects not open to improvement would come to dominate the overall perfor mance effort of the system .
These properties relate to the scope of the learning, but they say nothing concerning the generality and effectiveness of what is learned. Therefore we add a fourth property. •
Transfer of learning. What is learned in one situation will be used in other situations to improve performance. It is through the transfer of learned material that generalization, as it is usually studied in artificial intelligence, reveals itself in a learning problem solver.
Generality thus plays two roles in a general learning mechanism: in the scope of ap plication of the mechanism and the generality of what it learns. There are many possible organizations for a general learning mechanism, each with different behavior and implications. Some of the possibilities that have been in vestigated within AI and psychology include: •
•
A Multistrategy Learner. Given the wide variety of learning mechanisms cur
rently being investigated in AI and psychology, one obvious way to achieve a general learner is to build a system containing a combination of these mecha nisms. The best example of this to date is Anderson's ( 1 983a) ACT* system which contains six learning mechanisms. A Deliberate Learner. Given the breadth required of a general learning mechanism, a natural way to build one is as a problem solver that deliberately devises modifications that will improve performance. The modifications are
1 These properties are related to, but not isomorphic with, the three dimensions of variation of learning mechanisms described in Carbonell, Michalski, and Mitchell (1983) - application domain, underlying learning strategy, and representation of knowledge.
CHUNKING IN SOAR
•
usually based on analyses of the tasks to be accomplished, the structure of the problem solver, and the system's performance on the tasks. Sometimes this problem solving is done by the performance system itself, as in Lenat's AM ( 1 976) and Eurisko ( 1 983) programs, or in a production system that employs a build operation (Waterman, 1 975) - whereby productions can themselves create new productions - as in Anzai and Simon's ( 1 979) work on learning by doing. Sometimes the learner is constructed as a separate critic with its own problem solver (Smith, Mitchell, Chestek, & Buchanan, 1 977; Rendell, 1 983), or as a set of critics as in Sussman's ( 1 977) Hacker program. A Simple Experience Learner. There is a single learning mechanism that bases
its modifications on the experience of the problem solver. The learning mechanism is fixed, and does not perform any complex problem solving. Ex amples of this approach are memo functions (Michie, 1 968; Marsh, 1 970), macro-operators in Strips (Fikes, Hart & Nilsson, 1 972), production composi tion (Lewis, 1 978; Neves & Anderson, 1 98 1 ) and knowledge compilation (Anderson, 1 983b) .
The third approach, the simple experience learner, i s the one adopted i n Soar. I n some ways i t is the most parsimonious o f the three alternatives: i t makes use o f only one learning mechanism, in contrast to a multistrategy learner; it makes use of only one problem solver, in contrast to a critic-based deliberate learner; and it requires only problem solving about the actual task to be performed, in contrast to both kinds of deliberate learner. Counterbalancing the parsimony is that it is not obvious a priori that a simple experience learner can provide an adequate foundation for the construc tion of a general learning mechanism. At first glance, it would appear that such a mechanism would have difficulty learning from a variety of sources of knowledge, learning about all aspects of the system, and transferring what it has learned to new situations. The hypothesis being tested in the research on Soar is that chunking, a simple experience-based learning mechanism, can form the basis for a general learning mechanism. 2 Chunking · is a mechanism originally developed as part of a psychological model of memory (Miller, 1 956) . The concept of a chunk - a symbol that designates a pattern of other symbols - has been much studied as a model of memory organization. It has been used to explain such phenomena as why the span of short-term memory is approximately constant, independent of the complexity of the items to be remembered (Miller, 1 956), and why chess masters have an advantage over novices in reproducing chess positions from memory (Chase & Simon, 1 973). Newell and Rosenbloom ( 1 98 1 ) proposed chunking as the basis for a model of 2 For a comparison of chunking to other simple mechanisms for learning by experience, see Rosenbloom and Newell (1986).
353
354
CHAPTER 10
human practice and used it to model the ubiquitous power law of practice - that the time to perform a task is a power-law function of the number of times the task has been performed. The model was based on the idea that practice improves perfor mance via the acquisition of knowledge about patterns in the task environment, that is, chunks. When the model was implemented as part of a production-system ar chitecture, this idea was instantiated with chunks relating patterns of goal parameters to patterns of goal results (Rosenbloom, 1 983; Rosenbloom & Newell, 1 986). By replacing complex processing in subgoals with chunks learned during practice, the model could improve its speed in performing a single task or set of tasks. To increase the scope of the learning beyond simple practice, a similar chunking mechanism has been incorporated into the Soar problem-solving architecture (Laird, Newell & Rosenbloom, 1 985). In previous work we have demonstrated how chunking can improve Soar 's performance on a variety of tasks and in a variety of ways (Laird, Rosenbloom & Newell, 1 984) . In this article we focus on presenting the details of how chunking works in Soar (Section 3), and describe a new application involving the ac quisition of macro-operators similar to those reported by Korf ( 1 985a) (Section 4). This demonstration extends the claims of generality, and highlights the ability of chunking to transfer learning between different situations. Before proceeding to the heart of this work - the examination of the anatomy of chunking and a demonstration of its capabilities - it is necessary to make a fairly extensive digression into the structure and performance of the Soar architecture (Sec tion 2). In contrast to systems with multistrategy or deliberate learning mechanisms, the learning phenomena exhibited by a system with only a simple experience-based learning mechanism is a function not only of the learning mechanism itself, but also of the problem-solving component of the system. The two components are closely coupled and mutually supportive.
2. Soar - an architecture for general intelligence
Soar is an architecture for general intelligence that has been applied to a variety of tasks (Laird, Newell, & Rosenbloom, 1 985; Rosenbloom, Laird, McDermott, Newell, .& Orciuch, 1 985): many of the classic AI toy tasks such as the Tower of Hanoi, and the Blocks World: tasks that appear to involve non-search-based reason ing, such as syllogisms, the three-wise-men puzzle, and sequence extrapolation; and large tasks requiring expert-level knowledge, such as the Rl computer configuration task (McDermott, 1 982). In this section we briefly review the Soar architecture and present an example of its performance in the Eight Puzzle.
CHUNKING IN SOAR
2 . I The architecture
Performance in Soar is based on the problem space hypothesis: all goal-oriented behavior occurs as search in problem spaces (Newell, 1 980). A problem space for a task domain consists of a set of states representing possible situations in the task do main and a set of operators that transform one state into another one. For example, in the chess domain the states are configurations of pieces on the board, while the operators are the legal moves, such as P-K4. In the computer-configuration domain the states are partially configured computers, while the operators add components to the existing configuration (among other actions). Problem solving in a problem space consists of starting at some given initial state, and applying operators (yielding intermediate states) until a desired state is reached that is recognized as achieving the goal. In Soar, each goal has three slots, one each for a current problem space, state, and operator. Together these four components - a goal along with its current problem space, state and operator - comprise a context. Goals can have subgoals (and associated contexts), which form a strict goal-subgoal hierarchy. All objects (such as goals, problem spaces, states, and operators) have a unique identifier, generated at the time the object was created . Further descriptions of an object are called augmen tations. Each augmentation has an identifier, an attribute, and a value. The value can either be a constant value, or the identifier of another object. All objects are con nected via augmentations (either directly, or indirectly via a chain of augmentations) to one of the objects in a context, so that the identifiers of objects act as nodes of a semantic network, while the augmentations represent the arcs or links. Throughout the process of satisfying a goal, Soar makes decisions in order to select between the available problem spaces, states, and operators. Every problem-solving episode consists of a sequence of decisions and these decisions determine the behavior of the system. Problem solving in pursuit of a goal begins with the selection of a prob lem space for the goal . This is followed by the selection of an initial state, and then an operator to apply to the state. Once the operator is selected, it is applied to create a new state. The new state can then be selected for further processing (or the current state can be kept, or some previously generated state can be selected), and the process repeats as a new operator is selected to apply to the selected state. The weak methods can be represented as knowledge for controlling the selection of states and operators (Laird & Newell, 1 983a). The knowledge that controls these decisions is collectively called search control. Problem solving without search contol is possible in Soar, but it leads to an exhaustive search of the problem space. Figure I shows a schematic representation of a series of decisions. To bring the available search-control knowledge to bear on the making of a decision, each deci sion involves a monotonic elaboration phase. During the elaboration phase, all directly available knowledge relevant to the current situation is brought to bear. This is the act of retrieving knowledge from memory to be used to control problem solv-
355
356
CHAPTER 1 0
DECISION 1
i i ititi Elaboration Phase
/
Quiescence
t
De
DECISION 2
oo
Proc d ure
Gather Preferences
t
i ii i iiti
j
DECISION 3
i i iii itiii
!
..... ,.
Replace Interpret ----;;;. Context Obiect Preferences
t
I mpasse
t
Create Subgoal
Figure /.
.
The Soar decision cycle.
ing. In Soar, the long-term memory is structured as a production system, with all directly available knowledge represented as productions. 3 The elaboration phase consists of one or more cycles of production execution in which all of the eligible pro ductions are fired in parallel. The contexts of the goal hierarchy and their augmenta tions serve as the working memory for these productions. The information added during the elaboration phase can take one of two forms. First, existing objects may have their descriptions elaborated (via augmentations) with new or existing objects, such as the addition of an evaluation to a state. Second, data structures called preferences can be created that specify the desirability of an object for a slot in a con text. Each preference indicates the context in which it is relevant by specifying the appropriate goal, problem space, state and operator. When the elaboration phase reaches quiescence - when no more productions are eligible to fire - a fixed decision procedure is run that gathers and interprets the preferences provided by the elaboration phase to produce a specific decision. Prefer ences of type acceptable and reject determine whether or not an object is a candidate for a context. Preferences of type better, equal, and worse determine the relative worth of objects. Preferences of type best, indifferent and worst make absolute j udgements about the worth of objects. 4 Starting from the oldest context, the deci sion procedure uses the preferences to determine if the current problem space, state, 3
We will use the terms production and rule interchangeably throughout this paper. ' There is also a parallel preference that can be used to assert that two operators should execute simultaneously. 4
CHUNKING IN
SOAR
or operator in any of the contexts should be changed. The problem space is con sidered first, followed by the state and then the operator . A change is made if one of the candidate objects for the slot dominates (based on the preferences) all of the others, or if a set of equal objects dominates all of the other objects. In the latter case, a random selection is made between the equal objects . Once a change has been made, the subordinate positions in the context (state and operator if a problem space is changed) are initialized to undecided, all of the more recent contexts in the stack are discarded, the decision procedure terminates, and a new decision commences. If sufficient knowledge is available during the search to uniquely determine a deci sion, the search proceeds unabated. However, in many cases the knowledge encoded into productions may be insufficient to allow the direct application of an operator or the making of a search-control decision. That is, the available preferences do not determine a unique, uncontested change in a context, causing an impasse in problem solving to occur (Brown & VanLehn, 1 980). Four classes of impasses can arise in Soar: ( 1 ) no-change (the elaboration phase ran to quiescence without suggesting any changes to the contexts), (2) tie (no single object or group of equal objects was better than all of the other candidate objects), (3) conflict .(two or more candidate objects were better than each other), and (4) rejection (all objects were rejected, even the cur rent one). All types of impasse can occur for any of the three context slots associated with a goal - problem space, state, and operator - and a no-change impasse can occur for the goal . For example, a state tie occurs whenever there are two or more competing states and no directly available knowledge to compare them. An operator no-change occurs whenever no context changes are suggested after an operator is selected (usually because not enough information is directly available to allow the creation of a new state). Soar responds to an impasse by creating a subgoal (and an associated context) to resolve the impasse. Once a subgoal is created, a problem space must be selected, followed by an initial state, and then an operator . If an impasse is reached in any of these decisions, another subgoal will be created to resolve it, leading to the hierar chy of goals in Soar. By generating a subgoal for each impasse, the full problem solving power of Soar can be brought to bear to resolve the impasse. These subgoals correspond to all of the types of subgoals created in standard AI systems (Laird, Newell, & Rosenbloom, 1 985). This capability to generate automatically all subgoals in response to impasses and to open up all aspects of problem-solving behavior to problem solving when necessary is called universal subgoaling (Laird, 1 984). Because all goals are generated in response to impasses, and each goal can have at most one impasse at a time, the goals (contexts) in working memory are structured as a stack, referred to as the context stack. A subgoal terminates when its impasse is resolved. For example, if a tie impasse arises, the subgoal generated for it will ter minate when sufficient preferences have been created so that a single object (or set of equal objects) dominates the others. When a subgoal terminates, Soar pops the context stack, removing from working memory all augmentations created in that '
357
358
CHAPTER 10
subgoal that are not connected to a prior context, either directly or indirectly (by a chain of augmentations), and preferences whose context objects do not match ob jects in prior contexts. Those augmentations and preferences that are not removed are the results of the subgoal. Default knowledge (in the form of productions) exists in Soar to cope with any of the subgoals when no additional knowledge is available. For some subgoals (those created for all types of rejection impasses and no-change impasses for goals, problem-spaces, and states) this involves simply backing up to a prior choice in the context, but for other subgoals (those created for tie, conflict and operator no change impasses), this involves searches for knowledge that will resolve the subgoal's impasse. If additional non-default knowledge is available to resolve an impasse, it dominates the default knowledge (via preferences) and controls the problem solving within the subgoal.
2.2 An example problem solving task
Consider the Eight Puzzle, in which there are eight numbered, movable tiles set in a 3 x 3 frame. One cell of the frame is always empty (the blank), making it possible to move an adjacent tile into the empty cell. The problem is to transform one con figuration of tiles into a second configuration by moving the tiles. The states of the eight-puzzle problem space are configurations of the numbers 1 - 8 in a 3 x 3 grid . There is a single general operator to move adjacent tiles into the empty cell . For a given state, an instance of this operator is created for each of the cells adjacent to the empty cell. Each of these operator instances is instantiated with the empty cell and one of the adjacent cells. To simplify our discussion, we will refer to these instan tiated operators by the direction they move a tile into the empty cell: up, down, left, or right. Figure 2 shows an example of the initial and desired states of an Eight Puzzle problem. To encode this task in Soar, one must include productions that propose the ap propriate problem space, create the initial state of that problem space, implement the operators of the problem space, and detect the desired state when it is achieved. If
I n itial State
2
7
Desired State
3
1
1
8
4
8
6
5
7
Figure 2.
2
3 4
6
5
Example initial and desired states of the Eight Puzzle.
CHUNKING IN
SOAR
no additional knowledge is available, an exhaustive depth-first search occurs as a result of the default processing for tie impasses. Tie impasses arise each time an operator has to be selected . In response to the subgoals for these impasses, alter natives are investigated to determine the best move. Whenever another tie impasse arises during the investigation of one of the alternatives, an additional subgoal is generated, and the search deepens. If additional search-control knowledge is added to provide an evaluation of the states, the search changes to steepest-ascent hill climb ing. As more or different search-control knowledge is added, the behavior of the search changes in response to the new knowledge. One of the properties of Soar is that the weak methods, such as generate and test, means-ends analysis, depth-first search and hill climbing, do not have to be explicitly selected, but instead emerge from the structure of the task and the available search-control knowledge (Laird & Newell, l 983a; Laird & Newell, l 983b; Laird, 1 984). Another way to control the search in the Eight Puzzle is to break it up into a set of subgoals to get the individual tiles into position . We will look at this approach in some detail because it forms the basis for the use of macro-operators for the Eight Puzzle. Means-ends analysis is the standard technique for solving problems where the goal can be decomposed into a set of subgoals, but it is ineffective for problems such as the Eight Puzzle that have non-serializable subgoals - tasks for which there exists no ordering of the subgoals such that successive subgoals can be achieved without undoing what was accomplished by earlier subgoals (Korf, l 985a). Figure 3 shows an intermediate state in problem solving where tiles I and 2 are in their desired posi tions. In order to move tile 3 into its desired position, tile 2 must be moved out of its desired position. Non-serializable subgoals can be tractable if they are serially decomposable (Korf, l 985a). A set of subgoals is serially decomposable if there is an ordering of them such that the solution to each subgoal depends only on that subgoal and on the preceding ones in the solution order. In the Eight Puzzle · the subgoals are, in order: ( 1 ) have the blank in its correct position; (2) have the blank and the first tile in their correct positions; (3) have the blank and the first two tiles in their correct positions; and so on through the eighth tile. Each subgoal depends only on the positions of the blank and the previously placed tiles. Within one subgoal a previous subgoal may be undone, but if it is, it must be re-achieved before the current subgoal is completed. I ntermediate State
1
2
3 7
6
Figure 3.
Desired State
4
1
8
8
5
7
2
3 4
6
5
Non-serializable subgoals in the Eight Puzzle.
359
360
CHAPTER 10
1
Gl s o l v e - e i g h t - p u zz l e
2
P l e i g h t - p u z z l e- s d
3
51
4
01 p l ace- b l ank
6
••
6
>G 2 ( re s o l v e - n o - c h a n g e ) P2 e i g h t-puz z l e
51
7 8
••
>G 3 ( reso l v e - t i e operato r ) P3 t i e
9 10 11
52 { l ef t ,
up ,
down}
06 e v a l uate- o b j e c t ( 0 2 ( 1 eft ) ) ••
12 13
>G 4 ( r e s o l v e - n o - c h a n g e ) P 2 e i gh t - puzz l e
14
51
16
02 l eft 53
16
17
02 l ef t
18
54
19
54
20
08 p l ac e - 1
problem-solving trace for the Eight Puzzle. Each line of the trace includes, from left to right, the decision number, the identifier of the object selected, and possibly a short description of the object.
Figure 4. A
Adopting this approach does not result in new knowledge for directly controlling the selection of operators and states in the eight-puzzle problem space. Instead it pro vides knowledge about how to structure and decompose the puzzle. This knowledge consists of the set of serially decomposable subgoals, and the ordering of those subgoals. To encode this knowledge in Soar, we have added a second problem space, eight-puzzle-sd, with a set of nine operators corresponding to the nine subgoals. 5 For example, the operator place-2 will place tile 2 in its desired position, while assuring that the blank and the first tile will also be in position. The ordering of the subgoals is encoded as search-control knowledge that creates preferences for the operators. Figure 4 shows a trace of the decisions for a short problem-solving episode for the initial and desired states from Figure 2. This example is heavily used in the remainder
5 Both place-7 and place-8 are always no-ops because once the blank and tiles 1 -6 are in place, either tiles 7 and 8 must also be in place, or the problem is unsolvable. They can therefore be safely ignored.
CHUNKING IN
SOAR
of the paper, so we shall go through it in some detail. To start problem solving, the current goal is initialized to be solve-eight-puzzle (in decision 1 ) . The goal is represented in working memory by an identifier, in this case 0 1 . Problem solving begins in the eight-puzzle-sd problem space. Once the initial state, S 1 , is selected, preferences are generated that order the operators so that place-blank is selected . Ap plication of this operator, and all of the eight-puzzle-sd operators, is complex, often requiring extensive problem solving. Because the problem-space hypothesis implies that such problem solving should occur in a problem space, the operator is not direct ly implemented as rules. Instead, a no-change impasse leads to a subgoal to imple ment place-blank, which will be achieved when the blank is in its desired position. The place-blank operator is then implemented as a search in the eight-puzzle problem space for a state with the blank in the correct position. This search can be carried out using any of the weak methods described earlier, but for this example, let us assume there is no additional search-control knowledge. Once the initial state is selected (decision 7), a tie impasse occurs among the operators that move the three adjacent tiles into the empty cell (left, up and down). A resolve-tie-subgoal (03) is automatically generated for this impasse, and the tie problem space is selected . Its states are sets of objects being considered, and its operators evaluate objects so that preferences can be created. One of these evaluate object operators (05) is selected to evaluate the operator that moves tile 8 to the left, and a resolve-no-change subgoal (04) is generated because there are no productions that directly compute an evaluation of the left operator for state S l . Default search control knowledge attempts to implement the evaluate-object operator by applying the left operator to state S I . This is accomplished in the subgoal (decisions 1 3 - 1 6), yielding the desired state (S3). Because the left operator led to a solution for the goal, a preference is returned for it that allows it to be selected immediately for state S l (decision 1 7) i n goal 02, flushing the two lower subgoals (03 and 04). If this state were not the desired state, another tie impasse would arise and the tie problem space would be selected for this new subgoal. The subgoal combination of a resolve-tie followed by a resolve-no-change on an evaluate-object operator would recur, giving a depth-first search. Applying the left operator to state Sl yields state S4, which is the desired result of the place-blank operator in goal 0 1 above. The place-1 operator (08) is then selected as the current operator. As with place-blank, place-1 is implemented by a search in the eight-puzzle problem space. It succeeds when both tile 1 and the blank are in their desired positions. With this problem-solving strategy, each tile is moved into place by one of the operators in the eight-puzzle-sd problem space. In the subgoals that implement the eight-puzzle-sd operators, many of the tiles already in place might be moved out of place, however, they must be back in place for the operator to terminate successfully.
361
362
CHAPTER 10
3 . Chunking in Soar
Soar was originally designed to be a general (non-learning) problem solver. Never theless, its problem-solving and memory structures support learning in a number of ways . The structure of problem solving in Soar determines when new knowledge is needed, what that knowledge might be, and when it can be acquired. •
•
•
Determining when new knowledge is needed. In
Soar, impasses occur if and only if the directly available knowledge is either incomplete or inconsistent. Therefore, impasses indicate when the system should attempt to acquire new knowledge. Determining what to learn . While problem solving within a subgoal, Soar can discover information that will resolve an impasse. This information, if remembered, can avert similar impasses in future problem solving. Determining when new kno wledge can be acquired. When a subgoal com pletes, because its impasse has been resolved, an opportunity exists to add new knowledge that was not already explicitly known .
Soar 's long-term memory, which is based on a production system and the workings of the elaboration phase, supports learning in two ways: •
•
Integrating new kno wledge. Productions provide a modular representation of
knowledge, so that the integration of new knowledge 0nly requires adding a new production to production memory and does not require a complex analysis of the previously stored knowledge in the system (Newell, 1 973; Waterman, 1 975 ; Davis & King, 1 976; Anderson, 1 983b). Using new knowledge. Even if the productions are syntactically modular, there is no guarantee that the information they encode can be integrated together when it is needed. The elaboration phase of Soar brings all appropriate knowledge to bear, with no requirement of synchronization (and no conflict resolution). The decision procedure then integrates the results of the elabora tion phase.
Chunking in Soar takes advantage of this support to create rules that summarize the processing of a subgoal, so that in the future, the costly problem solving in the subgoal can be replaced by direct rule application. When a subgoal is generated, a learning episode begins that could lead to the creation of a chunk. During problem solving within the subgoal, information accumulates on which a chunk can be based. When the subgoal terminates, a chunk can be created. Each chunk is a rule (or set of rules) that gets added to the production memory. Chunked knowledge is brought to bear during the elaboration phase of later decisions. In the remainder of this section we look in more detail at the process of chunk creation, evaluate the scope of chunking as a learning mechanism, and examine the sources of chunk generality.
CHUNKING IN
SOAR
3. 1 Constructing chunks
Chunks are based on the working-memory elements that are either examined or created during problem solving within a subgoal. The conditions consist of those aspects of the situation that existed prior to the goal, and which were examined during the processing of the goal, while the actions consist of the results of the goal. When the subgoal terminates, 6 the collected working-memory elements are converted into the conditions and actions of one or more productions. 7 In this subsection, we describe in detail the three steps in chunk creation: ( 1 ) the collection of conditions and actions, (2) the variabilization of identifiers, and (3) chunk optimization. 3 . 1 . 1 Collecting conditions and actions
The conditions of a chunk should test th.ose aspects of the situation existing prior to the creation of the goal that are relevant to the results that satisfy the goal. In Soar this corresponds to the working-memory elements that were matched by productions that fired in the goal (or one of its subgoals), but that existed before the goal was created . These are the elements that the problem solving implicitly deemed to be rele vant to the satisfaction of the subgoal. This collection of working-memory elements is maintained for each active goal in the goal's referenced-list. 8 Soar allows produc tions belonging to any goal in the context stack to execute at any time, so updating the correct referenced-list requires determining for which goal in the stack the pro duction fired. This is the most recent of the goals matched by the production's condi tions. The production's firing affects the chunks created for that goal and all of its supergoals, but because the firing is independent of the more recent subgoals, it has no effect on the chunks built for those subgoals. No chunk is created if the subgoal' s results were not based o n prior information; for example, when an object i s input
6 The default behavior for Soar is to create a chunk always; that is, every time a subgoal terminates. The major alternative to creating chunks for all terminating goals is to chunk buitom-up, as was done in modeling the power law of practice (Rosenbloom, 1983). In bottom-up chunking, only terminal goals - goals for which no subgoals were generated - are chunked. As chunks are learned for subgoals, the subgoals need no longer be generated (the chunks accomplish the subgoals' tasks before the impasses occur), and higher goals in the hierarchy become eligible for chunking. It is unclear whether chunking
always or bottom-up will prove more advantageous in the long run, so to facilitate experimentation, both options are available in Soar. 7 Production composition (Lewis, 1 978) has also been used to learn productions that summarize goals (Anderson, 1 983b). It differs most from chunking in that it examines the actual definitions of the produc tions that fired in addition to the working-memory elements referenced and created by the productions. 8 If a fired production has a negated condition - a condition testing for the absence in working memory of an element matching its pattern - then the negated condition is instantiated with the appropriate variable bindings from the production's positive conditions. If the identifier of the instantiated condition existed prior to the goal, then the instantiated condition is included in the referenced-list.
363
364
CHAPTER 10
from the outside, or when an impasse is resolved by domain-independent default knowledge. The actions of a chunk are based on the results of the subgoal for which the chunk was created. No chunk is created if there are no results. This can happen, for exam ple, when a result produced in a subgoal leads to the termination of a goal much higher in the goal hierarchy. All of the subgoals that are lower in the hierarchy will also be terminated, but they may not generate results. For an example of chunking in action, consider the terminal subgoal (G4) from the problem-solving episode in Figure 4. This subgoal was created as a result of a no change impasse for the evaluate-object operator that should evaluate the operator that will move tile 8 to the left. The problem solving within goal G4 must implement the evaluate-object operator. Figure 5 contains a graphic representation of part of the working memory for this subgoal near the beginning of problem solving (A) and just before the subgoal is terminated (B). The working memory that existed before the subgoal was created consisted of the augmentations of the goal to resolve the tie between the eight-puzzle operators, G3, and its supergoals (G2 and G 1 , not shown). The tie problem space is the current problem space of G3 , while state S2 is the current state and the evaluate-object operator (05) is the current operator. DI is the desired state of having the blank in the middle, but with no constraint on the tiles in the other cells (signified by the X's in the figure). All of these objects have further descriptions, some only partially shown in the figure. The purpose of goal G4 is to evaluate operator 02, that will move tile 8 to the left in the initial state (S l ) . The first steps are to augment the goal w ith the desired state (D l ) and then select the eight-puzzle problem space (P2), the state to which the operator will be applied (S l ) , and finally the operator being evaluated (02). To do this, the augmentations from the evaluate-object operator (05) to these objects are accessed and therefore added to the referenced list (the highlighted arrows in part (A) of Figure 5). Once operator 02 is selected, it is applied by a production that creates a new state (S3). The application of the operator depends on the exact representation used for the states of the problem space. State Sl and desired state D l , which were shown only schematically in Figure 5 , are shown in detail in Figure 6. The states are built out of cells and tiles (only some of the cells and tiles are shown in Figure 6). The nine cells (Cl -C9) represent the structure of the Eight Puzzle frame. They form a 3 x 3 grid in which each cell points to its adjacent cells. There are eight numbered tiles (T2-T9), and one blank (TI). Each tile points to its name, 1 through 8 for the numbered tiles and 0 for the blank. Tiles are associated with cells by objects called bindings. Each state contains 9 bindings, each of which associates one tile with the cell where it is located. The bindings for the desired state, D l , are L l -L9, while the bindings for state Sl are B l -B9. The fact that the blank is in the center of the desired state is represented by binding L2, which points to the blank tile (T I ) and the center cell (C5). All states (and desired states) in both the eight-puzzle and eight-puzzle-sd problem spaces share this same cell structure.
CHUNKING IN SOAR
(8)
Before subgoal During subgoal
Figure 5. An example of the working-memory elements used to create a chunk. (A) shows working
memory near the beginning of the subgoal to implement the evaluate-object operator. (B) shows working memory at the end of the subgoal. The circled symbols represent identifiers and the arrows represent augmentations. The identifiers and augmentations above the horizontal lines existed before the subgoal was created. Below the lines, the identifiers marked by doubled circles, and all of the augmentations, are created in the subgoal. The other identifiers below the line are not new; they are actually the same as the corresponding ones above the lines. The highlighted augmentations were referenced during the problem solving in the subgoal and will be the basis of the conditions of the chunk. The augmentation that was created in the subgoal but originates from an object existing before the subgoal (E l -+SUCCESS) will be the basis for the action of the chunk.
To apply the operator and create a new state, a new state symbol is created (S3) with two new bindings, one for the moved tile and one for the blank. The binding for the moved tile points to the tile (T9) and to the cell where it will be (C4). The bind ing for the blank points to the blank (Tl ) and to the cell that will be empty (C5). All the other bindings are then copied to the new state. This processing accesses the relative positions of the blank and the moved tile, and the bindings for the remaining tiles in current state (S I). The augmentations of the operator are tested for the cell that contains the tile to be moved. Once the new state (S3) is selected, a production generates the operators that can
365
366
CHAPTER 10
Figure 6. Example of working-memory elements representing the state used to create a chunk. The
highlighted augmentations were referenced during the the subgoal.
apply to the new state. All cells that are adjacent to the blank cell (C2, C4, C6, and C8) are used to create operators. This requires testing the structure of the board as encoded in the connections between the cells. Following the creation of the operators that can apply to state S3, the operator that would undo the previous operator is rejected so that unnecessary backtracking is avoided. During the same elaboration phase, the state is tested to determine whether a tile was just moved into or out of its correct position. This information is used to generate an evaluation based on the sum of the number of tiles that do not have to be in place and the number of tiles that both have to be in place and are in place. This computation, whose result is represented by the object X l with a value of 8 in Figure 5, results in the accessing of those aspects of the desired state highlighted in Figure 6. The value of 8 means that the goal is satisfied, so the evaluation (E l ) for the operator has the value success. Because El is an identifier that existed before the subgoal was created and the success augmentation was created in the subgoal, this augmentation becomes an action. If
CHUNKING IN SOAR
success had further augmentations, they would also be included as actions. The augmentations of the subgoal (G4), the new state (S3), and its sub-object (X l ) that point to objects created before the subgoal are not included as actions because they are not augmentations, either directly or indirectly, of an object that existed prior to the creation of the subgoal. When goal G4 terminates, the initial set of conditions and actions have been deter mined for the chunk. The conditions test that there exists an evaluate-object operator whose purpose is to evaluate the operator that moves the blank into its desired loca tion, and that all of the tiles are either in position or irrelevant for the current eight puzzle-sd operator. The action is to mark the evaluation as successful, meaning that the operator being evaluated will achieve the goal. This chunk should apply in similar future situations, directly implementing the evaluate-object operator, and avoiding the no-change impasse and the resulting subgoal.
3. 1 .2 Identifier variabilization
Once the conditions and actions have been determined, all of the identifiers are replaced by production (pattern-match) variables, while the constants, such as evaluate-object, eight-puzzle, and 0 are left unchanged. An identifier is a label by ' which a particular instance of an object in working memory can be referenced. It is a short-term symbol that lasts only as long as the object is in working memory. Each time the object reappears in working memory it is instantiated with a new identifier. If a chunk that is based on working-memory elements is to reapply in a later situa tion, it must not mention specific identifiers. In essence the variabilization process is like replacing an 'eq' test in Lisp (which requires pointer identity) with an 'equal' test (which only requires value identity). All occurrences of a single identifier are replaced with the same variable and all occurrences of different identifiers are replaced by different variables. This assures that the chunk will match in a new situation only if there is an identifier that appears in the same places in which the original identifier appeared. The production is also modified so that no two variables can match the same identifier. Basically, Soar is guessing which identifiers must be equal and which must be distinct, based only on the information about equality and inequality in working memory. All identifiers that are the same are assumed to require equality. All identifiers that are not the same are assumed to require inequality. Biasing the generalization in these ways assures that the chunks will not be overly general (at least because of these modifications), but they may be overly specific. The only problem this causes is that additional chunks may need to be learned if the original ones suffer from overspecialization. In practice, these modifications have not led to overly specific chunks.
367
368
CHAPTER 10
3. 1 . 3 Chunk optimization
At this point in the chunk-creation process the semantics of the chunk are deter mined . However, three additional processes are applied to the chunks to increase the efficiency with which they are matched against working memory (all related to the use in Soar of the Ops5 rule matcher (Forgy, 1 98 1 )). The first process is to remove conditions from the chunk that provide (almost) no constraint on the match process. A condition is removed if it has a variable in the value field of the augmentation that is not bound elsewhere in the rule (either in the conditions or the actions) . This pro cess recurses, so that a long linked-list of conditions will be removed if the final one in the list has a variable that is unique to that condition. For the chunk based on Figures 5 and 6, the bindings and tiles that were only referenced for copying (B l , B4, B5 , B6, B7, B8, B9, and T9) and the cells referenced for creating operator instantia tions (C2, C6, and C8) are all removed. The evaluation object, E l , in Figure 5 is not removed because it is included in the action . Eliminating the bindings does not in crease the generality of the chunk, because all states must have nine bindings. However, the removal of the cells does increase the generality, because they (along with the test of cell C4) implicitly test that there must be four cells adjacent to the one to which the blank will be moved . Only the center has four adjacent cells, so the removal of these conditions does increase the generality. This does increase slightly the chance of the chunk being over-general, but in practice it has never caused a problem, and it can significantly increase the efficiency of the match by removing unconstrained conditions. The second optimization is to eliminate potential combinatorial matches in the conditions of productions whose actions are to copy a set of augmentations from an existing object to a new object. A common strategy for implementing operators in subgoals is to create a new state containing only the new and changed information, and then to copy over pointers to the rest of the previous state. The chunks built for these subgoals contain one condition for each of the copied pointers. If, as is usually the case, a set of similar items are being copied, then the copy conditions end up dif fering only in the names of variables. Each augmentation can match each of these conditions independently, generating a combinatorial number of instantiations. This problem would arise if a subgoal were used to implement the eight-puzzle operators instead of the rules used in our current implementation. A single production would be learned that created new bindings for the moved tile and the blank, and also copied all of the other bindings. There would be seven conditions that tested for the bin dings, but each of these conditions could match any of the bindings that had to be copied, generating 7 ! (5040) instantiations. This problem is solved by collapsing the set of similar copy conditions down to one. All of the augmentations can still be copied over, but it now occurs via multiple instantiations (seven of them) of the simpler rule. Though this reduces the number of rule instantiations to linear in the number of augmentations to be copied, it still means that the other non-copying ac-
CHUNKING IN SOAR
tions are done more than once. This problem is solved by splitting the chunk into two productions. One production does everything the subgoal did except for the copying. The other production just does the copying. If there is more than one set of augmen tations to be copied, each set is collapsed into a single condition and a separate rule is created for each. 9 The final optimization process consists of applying a condition-recording algorithm to the chunk productions. The efficiency of the Rete-network matcher (Forgy, 1 982) used in Soar is sensitive to the order in which conditions are specified. By taking advantage of the known structure of Soar 's working memory, we have developed a static reordering algorithm that significantly increases the efficiency of the macth . Execution time is sometimes improved by more than an order of magnitude, almost duplicating the efficiency that would be achieved if the reordering was done by hand. This reordering process preserves the existing semantics of the chunk .
3 . 2 The scope of chunking
In Section 1 we defined the scope of a general learning mechanism in terms of three properties: task generality, knowledge generality, and aspect generality. Below we briefly discuss each of these with respect to chunking in Soar. Task generality. Soar provides a single formalism for all behavior - heuristic search of problem spaces in pursuit of goals. This formalism has been widely used in Artificial Intelligence (Feigenbaum & Feldman, 1 963 ; Nilsson, 1 980; Rich, 1 983) and it has already worked well in Soar across a wide variety of problem domains (Laird, Newell, & Rosenbloom, 1 985) . If the problem-space hypothesis (Newell, 1 980) does hold, then this should cover all problem domains for which goal-oriented behavior is appropriate. Chunking can be applied to all of the domains for which Soar is used. Though it remains to be shown that useful chunks can be learned for this wide range of domains, our preliminary experience suggests that the combination of Soar and chunking has the requisite generality. 10 Knowledge generality. Chunking learns from the experiences of the problem so!ver. At first glance, it would appear to be.unable to make use of instructions, ex amples, analogous problems, or other similar sources of knowledge. However, by using such information to help make decisions in subgoals, Soar can learn chunks that incorporate the new knowledge. This approach has worked for a simple form 9 The inelegance of this solution leads us to believe that we do not yet have the right assumptions about how new objects are to be created from old ones.
JO For demonstrations of chunking in Soar on the Eight Puzzle, Tic-Tac-Toe, and the RJ computer configuration task, see Laird, Rosenbloom, & Newell ( 1 984), Rosenbloom, Laird, McDermott, Newell,
& Orciuch ( 1 985), and van de Brug, Rosenbloom, & Newell ( 1 985).
369
370
CHAPTER 10
of user direction, and is under investigation for learning by analogy. The results are preliminary, but it establishes that the question of knowledge generality is open for Soar. Aspect generality. Three conditions must be met for chunking to be able to learn about all aspects of Soar 's problem solving. The first condition is that all aspects must be open to problem solving. This condition is met because Soar creates subgoals for all of the impasses it encounters during the problem solving process. These subgoals allow for problem solving on any of the problem solver' s functions: creating a problem space, selecting a problem space, creating an initial state, selecting a state, selecting an operator, and applying an operator. These functions are both necessary and sufficient for Soar to solve problems. So far chunking has been demonstrated for the selection and application of operators (Laird, Rosenbloom & Newell, 1 984); that is, strategy acquisition (Langley, 1 983 ; Mitchell, 1 983) and operator implemen tation. However, demonstrations of chunking for the other types of subgoals remain to be done. 1 1 The second condition is that the chunking mechanism must be able to create the long-term memory structures in which the new knowledge is to be represented. Soar represents all of its long-term knowledge as productions, and chunking acquires new productions. By restricting the kinds of condition and action primitives allowed in productions (while not losing Turing equivalence), it is possible to have a production language that is coextensive syntactically with the types of rules learned by chunking; that is, the chunking mechanism can create rules containing all of the syntactic con structs available in the language. The third condition is that the chunking mechanism must be able to acquire rules with the requisite content. In Soar, this means that the problem solving on which the requisite chunks are to be based must be understood. The current biggest limitations on coverage stem from our lack of understanding of the problem solving underlying such aspects as problem-space creation and change of representation (Hayes & Simon, 1 976; Korf, 1 980; Lenat, 1 983; Utgoff, 1 984) .
3 . 3 Chunk generality
One of the critical questions to be asked about a simple mechanism for learning from experience is the degree to which the information learned in one problem can transfer to other problems. If generality is lacking, and little transfer occurs, the learning mechanism is simply a caching scheme. The variabilization process described in Sec tion 3 . 1 .2 is one way in which cunks are made general. However, this process would by itself not lead to chunks that could exhibit non-trivial forms of transfer, All it does 11
I n part this issue i s one o f rarity. For example, selection o f problem spaces is not yet problematical, and conflict impasses have not yet been encountered.
CHUNKING IN SOAR
is allow the chunk to match another instance of the same exact situation. The prin cipal source of generality is the implicit generalization that results from basing chunks on only the aspects of the situation that were referenced during problem solv ing. In the example in Section 3. l . l , only a small percentage of the augmentations in working memory ended up as conditions of the chunk. The rest of the information, such as the identity of the tile being moved and its absolute location, and the identities and locations of the other tiles, was not examined during problem solving, and therefore had no effect on the chunk. Together, the representation of objects in working memory and the knowledge used during problem solving combine to form the bias for the implicit generalization process (Utgoff, 1 984); that is, they determine which generalizations are embodied in the chunks learned. The object representation defines a language for the implicit generalization process, bounding the potential generality of the chunks that can be learned. The problem solving determines (indirectly, by what it examines) which generalizations are actually embodied in the chunks. Consider the state representation used in Korf's (l 985a) work on the Eight Puzzle (recall Section 2.2). In his implementation, the state of the board was represented as a vector containing the positions of each of the tiles. Location 0 contained the coor dinates of the position that was blank, location l contained the coordinates of the first tile, and so on. This is a simple and concise representation, but because aspects of the representation are overloaded with more than one functional concept, it pro vides poor support for implicit generalization (or for that matter, any traditional conditition-finding method) . For example, the vector indices have two functions: they specify the identity of the tile, and they provide access to the tile's position. When using this state representation it is impossible to access the position of a tile without looking at its identity. Therefore, even when the problem solving is only dependent on the locations of the tiles, the chunks learned would test the tile iden tities, thus failing to apply in situations in which they rightly could . A second prob lem with the representation is that some of the structure of the problem is implicit in the representation. Concepts that are required for good generalizations, such as the relative positions of two tiles, cannot be captured in chunks because they are not explicitly represented in the structure of the state. Potential generality is maximized if an object is represented so that functionally independent aspects are explicitly represented and can be accessed independently. For example, the Eight Puzzle state representation shown in Figure 6 breaks each functional role into separate working memory objects. This representation, while not predetermining what generalizations are to be made, defines a class of possible generalizations that include good ones for the Eight Puzzle. The actual generality of the chunk is maximized (within the constraints established by the representation) if the problem solver only examines those features of the situa tion that are absolutely necessary to the solution of the problem. When the problem solver knows what it is doing, everything works fine, but generality can be lost when
371
372
CHAPTER 10
information that turns out to be irrelevant is accessed. For example, whenever a new state is selected, productions fire to suggest operators to apply to the state. This preparation goes on in parallel with the testing of the state to see if it matches the goal. If the state does satisfy the goal, then the preparation process was unnecessary. However, if the preparation process referenced aspects of the prior situation that were not accessed by previous productions, then irrelevant conditions will be added to the chunk. Another example occurs when false paths - searches that lead off of the solution path - are investigated in a subgoal . The searches down unsuccessful paths may reference aspects of the state that would not have been tested if only the successful path were followed . 1 2
4. A
demonstration - acquisition o f macro-operators
In this section we provide a demonstration of the capabilities of chunking in Soar involving the acquistion of macro-operators in the Eight Puzzle for serially decom posable goals (see Section 2). We begin with a brief review of Korf's ( l 985a) original implementation of this technique. We follow this with the details of its implementa tion in Soar, together with an analysis of the generality of the macro-operators learned. This demonstration of macro-operators in Soar is of particular interest because: we are using a general problem solver and learner instead of special-purpose programs developed specifically for learning and using macro-operators; and because it allows us to investigate the generality of the chunks learned in a specific application. 4. 1 Macro problem solving
Korf ( l 985a) has shown that problems that are serially decomposable can be effi ciently solved with the aid of a table of macro-operators. A macro-operator (or macro for short) is a sequence of operators that can be treated as a single operator (Fikes, Hart & Nilsson, 1 972). The key to the utility of macros for serially decom posable problems is to define each macro so that after it is applied, all subgoals that had been previously achieved are still satisfied, and one new subgoal is achieved. Means-ends analysis is thus possible when these macro-operators are used. Table 1 shows Korf's ( l 985a) macro table for the Eight Puzzle task of getting all of the tiles in order, clockwise around the frame, with the 1 in the upper left hand corner, and the blank in the middle (the desired state in Figure 3). Each column contains the macros required to achieve one of the subgoals of placing a tile. The rows give the 12 An experimental version of chunking has been implemented that overcomes these problems by per forming a dependency analysis on traces of the productions that fired in a subgoal. The production traces are used to determine which conditions were necessary to produce results of the subgoal. All of the results of this paper are based on the version of chunking without the dependency analysis.
CHUNKING IN SOAR
Table 1. Macro table for the Eight Puzzle (from Korf, 1985, Table 1). The primitive operators move a tile one step in a particular direction; u (up), d (down), I (left), and r (right).
2
0
Tiles 3
4
-
5
6
A
p 0
B
ul
c
u
rdlu
D
ur
dlurrdlu
dlur
ldrurdlu
ldru
rdllurdrul
E 0
F
dr
uldrurdldrul
lurdldru
ldrulurddlru
lurd
G
d
urdldrul
ulddru
urddluldrrul
uldr
rdlluurdldrrul
H
di
rulddrul
druuldrdlu
ruldrdluldrrul
urdluldr
uldrurdllurd
urdl
drul
rullddru
rdluldrrul
rulldr
uldrruldlurd
ruld
n
appropriate macro according to the current position of the tile, where the positions are labeled A-I as in Figure 7. For example, if the goal is to move the blank (tile 0) into the center, and it is currently in the top left corner (location B), then the operator sequence ul will accomplish it. Korf's implementation of macro problem solving used two programs: a problem solver and a learner. The problem solver could use macro tables acquired by the learner to solve serially decomposable problems efficiently. Using Table 1 , the problem-solving program could solve any Eight Puzzle problem with the same desired state (the initial state may vary). The procedure went as follows : (a) the posi tion of the blank was determined; (b) the appropriate macro was found by using this position to index into the first column of the table; (c) the operators in this macro were applied to the state, moving the blank into position; (d) the position of the first tile was determined; (e) the appropriate macro was found by using this position to index into the second column of the table; (f) the operators in this macro were applied to the state, moving the first tile (and the blank) into position; and so on until all of the tiles were in place. B
c
D
I
A
E
H
G
F
Figure 7. The positions
(A-1) in the Eight Puzzle frame.
373
374
CHAPTER 10
To discover the macros, the learner started with the desired state, and performed an iterative-deepening search (for example, see Korf, 1 985b) using the elementary tile-movement operators. 13 As the search progressed, the learner detected sequences of operators that left some of the tiles invariant, but moved others. When an operator sequence was found that left an initial sequence of the subgoals invariant - that is, for some tile k, the operator moved that tile while leaving tiles 1 through k- 1 where they were - the operator sequence was added to the macro table in the appropriate column and row. In a single search from the desired state, all macros could be found . Since the search used iterative-deepening, the first macro found was guaranteed to be the shortest for its slot in the table.
4.2 Macro problem solving in Soar
Soar 's original design criteria did not include the ability to employ serially decom posable subgoals or to acquire and use macro-operators to solve problems structured by such subgoals. However, Soar 's generality allows it to do so with no changes to the architecture (including the chunking mechanism) . Using the implementation of the Eight Puzzle described in Sections 2.2 and 3 . 1 . 1 , Soar 's problem solving and learning capabilities work in an integrated fashion to learn and use macros for serially decomposable subgoals. The opportunity to learn a macro-operator exists each time a goal for implement ing one of the eight-puzzle-sd operators, such as place-5 , is achieved. When the goal is achieved there is a stack of subgoals below it, one for each of the choice points that led up to the desired state in the eight-puzzle problem space. As described in Sec tion 2, all of these lower subgoals are terminated when the higher goal is achieved. As each subgoal terminates, a chunk is built that tests the relevant conditions and produces a preference for one of the operators at the choice point. 14 This set of chunks encodes the path that was successful for the eight-puzzle-sd operator. In future problems, these chunks will act as search-control knowledge, leading the prob lem solver directly to the solution without any impasses or subgoals. Thus, Soar learns macro-operators, not as monolithic data structures, but as sets of chunks that determine at each point in the search which operator to select next. This differs from previous realizations of macros where a single data structure contains the macro, either as a list of operators, as in Korf's work, or as a triangle table, as in Strips (Fikes, Hart & Nilsson, 1 972). Instead, for each operator in the macro-operator se13 For very deep searches, other more efficient techniques such as bidirectional search and macro operator composition were used. 14
Additional chunks are created for the subgoals resulting from no-change impasses on the evaluate
object operators, such as the example chunk in Section 3 . 1 . 1 , but these become irrelevant for this task
once the rules that embody preferences are learned.
CHUNKING IN SOAR
Without Learning
After Learning
During Learning
l
3
1 + B
Figure 8. Searches performed for the first three eight-puzzle-sd operators in an example problem. The left
column shows the search without learning. The horizontal arrows represent points in the search where no choice (and therefore no chunk) is required. The middle column shows the search during learning. A' + ' signifies that a chunk was learned that preferred a given operator. A '
-
' signifies that a chunk was learned
to avoid an operator. The boxed numbers show where a previously learned chunk was applied to avoid search during learning. The right column shows the search after learning.
quence, there is a chunk that causes it to be selected (and therefore applied) at the right time. On later problems (and even the same problem), these chunks control the search when they can, giving the appearance of macro problem solving, and when they cannot, the problem solver resorts to search . When the latter succeeds, more chunks are learned, and more of the macro table is covered. By representing macros as sets of independent productions that are learned when the appropriate problem arises, the processes of learning, storing, and using macros become both incremental and simplified . Figure 8 shows the problem solving and learning that Soar does while performing iterative-deepening searches for the first three eight-puzzle-sd operators of an exam-
375
376
CHAPTER 10
pie problem. The figure shows the searches for which the depth is sufficient to imple ment each operator. The first eight-puzzle-sd operator, place-blank, moves the blank to the center. Without learning, this yields the search shown in the left column ot' the first row. During learning (the middle column), a chunk is first learned to avoid an operator that does not achieve the goal within the current depth limit (2) . This is marked by a ' - ' and the number 1 in the figure. The unboxed numbers give the order that the chunks are learned, while the boxed numbers show where the chunks are used in later problem solving. Once the goal is achieved, signified by the darkened circle, a chunk is learned that prefers the first move over all other alternatives, marked by ' + ' in the figure. No chunk is learned for the final move to the goal since the only other alternative at that point has already been rejected, eliminating any choice, and thereby eliminating the need to learn a chunk. The right column shows that on a sec ond attempt, chunk 2 applied to select the first operator. After the operator applied, chunk 1 applied to reject the operator that did not lead to the goal. This leaves only the operator that leads to the goal, which is selected and applied. In this scheme, the chunks control the problem solving within the subgoals that implement the eight puzzle-sd operator, eliminating search , and thereby encoding a macro-operator. The examples in the second and third rows of Figure 8 show more complex searches and demonstrate how the chunks learned during problem solving for one eight puzzle-sd operator can reduce the search both within that operator and within other operators. In all of these examples, a macro-operator is encoded as a set of chunks that are learned during problem solving and that will eliminate the search the next time a similar problem is presented. In addition to learning chunks for each of the operator-selection decisions, Soar can learn chunks that directly implement instances of the operators in the eight puzzle-sd problem space. They directly create a new state where the tiles have been moved so that the next desired tile is in place, a process that usually involves many Eight Puzzle moves. These chunks would be ideal macro-operators if it were not necessary to actually apply each eight-puzzle operator to a physical puzzle in the real world . As it is, the use of such chunks can lead to illusions about having done something that was not actually done. We have not yet implemented in Soar a general solution to the problem posed by such chunks. One possible solution - whose conse quences we have not yet analyzed in depth - is to have chunking automatically turned off for any goal in which an action occurs that affects the outside world . For this work we have simulated this solution by disabling chunking for the eight-puzzle problem space. Only search-control chunks (generated for the tie problem space) are learned. The searches within the eight-puzzle problem space can be controlled by a variety of different problem solving strategies, and any heuristic knowledge that is available can be used to avoid a brute-force search . Both iterative-deepening and breadth-first
CHUNKING IN SOAR
search 15 strategies were implemented and tested. Only one piece of search control was employed - do not apply an operator that will undo the effects of the previous operator. Unfortunately, Soar is too slow to be able to generate a complete macro table for the Eight Puzzle by search . Soar was unable to learn the eight macros in columns three and five in Figure l . These macros require searches to at least a depth of eight. 1 6 The actual searches used to generate the chunks for a complete macro table were done by having a user lead Soar down the path to the correct solution. At each resolve-tie subgoal, the user specified which of the tied operators should be evaluated first, insuring that the correct path was always tried first . Because the user specified which operator should be evaluated first, and not which operator should actually be applied, Soar proceeded to try out the choice by selecting the specified evaluate object operator and entering a subgoal in which the relevant eight-puzzle operator was applied. Soar verified that the choice made by the user was correct by searching until the choice led to either success or failure. During the verification, the appro priate objects were automatically referenced so that a correct chunk was generated. This is analogous to the explanation-based learning approach (for example, see De Jong, 1981 or Mitchell, Keller, & Kedar-Cabelli, 1 986), though the explanation and learning processes differ. Soar 's inability to search quickly enough to complete the macro table autonomous ly is the one limitation on a claim to have replicated Korf's ( l 985a) results for the Eight Puzzle. This, in part, reflects a trade-off between speed (Korf's system) and generality (Soar). But it is also partially a consequence of our not using the fastest production-system technology available. Significant improvements in Soar 's perfor mance should be possible by reimplementing it using the software technology developed for Ops83 (Forgy, 1 984).
4.3 Chunk generality and transfer
Korf's ( l 985a) work on macro problem solving shows that a large class of problems - for example, all Eight Puzzle problems with the same desired state - can be solved efficiently using a table with a small number of macros. This is possible only because the macros ignore the positions of all tiles not yet in place. This degree of generality occurs in Soar as a direct consequence of implicit generalization. If the identities of the tiles not yet placed are not examined during problem solving, as they need not 15 This was actually a parallel breadth-first search in which the operators at each depth were executed in parallel. 16 Although some of the macros are fourteen operators long, not every operator selection requires a choice (some are forced moves) and, in addition, Soar is able to make use of transfer from previously
learned chunks (Section 4.3).
377
378
CHAPTER 10
be, then the chunks will also not examine them . However, this does not tap all of the possible sources of generality in the Eight Puzzle. In the remainder of this subsec tion we will describe two additional forms of transfer available in the Soar implementation. 4.3. 1 Different goal states
One limitation on the generality of the macro table is that it can only be used to solve for the specific final configuration in Figure 3 . Korf ( l 985a) described one way to overcome this limitation. For other desired states with the blank in the center it is possible to use the macro table by renumbering the tiles in the desired state to corre spond to the ordering in Figure 3, and then using the same transformation for the initial state. In the Soar implementation this degree of generality occurs automatical ly as a consequence o f implicit generalization. The problem solver must care that a tile is in its desired location, but it need not care which tile it actually is. The chunks learned are therefore independent of the exact numbering on the tiles. Instead they depend on the relationship between where the tiles are and where they should be. For desired states that have the blank in a different position, Korf ( l 985a) described a three-step solution method. First find a path from the initial state to a state with the blank in the center; second, find a path from the desired state to the same state with the blank in the middle; and third , combine the solution to the first problem with the inverse of the solution to the second problem - assuming the in verse of every operator is both defined and known - to yield a solution to the overall problem . In Soar this additional degree of generality can be achieved with the learn ing of only two additional chunks . This is done by solving the problem using the following subgoals (see Figure 9): (a) get the blank in the middle, (b) get the first six tiles into their correct positions, and (c) get the blank in its final position. The first 7 moves can be performed directly by the chunks making up the macro table, while the last step requires 2 additional chunks.
(8)
(A)
x
x
x x
x
x
1
x
x
x
x
2
3
(C)
1
4 6
5
7
2
3
8
4
6
5
Figure 9. Problems with different goals states, with different positions of the blank, can be solved by: (a)
moving the blank into the center, (b) moving the first six tiles into position, and (c) moving the blank into its desired position.
CHUNKING IN SOAR
4.3.2 Transfer between macro-operators
In addition to the transfer of learning between desired states, we can identify four different levels of generality that are based on increasing the amount of transfer that occurs between the macro-operators in the table: no transfer, simple transfer, sym metry transfer (within column), and symmetry transfer (across column). The lowest level, no transfer, corresponds to the generality provided directly by the macro table. It uses macro-operators quite generally, but shows no transfer between the macro operators. Each successive level has all of the generality of the previous level, plus one additional variety of transfer. The actual runs were done for the final level, which maximizes transfer. The number of chunks required for the other cases were com puted by hand . Let us consider each of them in turn. No transfer. The no-transfer situation is identical to that employed by Korf ( 1 985a). There is no transfer of learning between macro-operators. In Soar, a total of 230 chunks would be required for this case. 17 This is considerably higher than the number of macro-operators (35) because one chunk must be learned for each operator in the table (if there is no search control) rather than for each macro operator. If search control is available to avoid undoing the previous operator, only 1 70 chunks must be learned . Simple transfer. Simple transfer occurs when two entries in the same column of the macro table end in exactly the same set of moves. For example, in the first column of Table 1 , the macro that moves the blank to the center from the upper-right corner uses the macro-operator ur (column 0, row D in the table). The chunk learned for the second operator in this sequence, which moves the blank to the center from the position to the right of the center (by moving the center tile to the right), is dependent on the state of the board following the first operator, but independent of what the first operator actually was. Therefore, the chunk for the last half of this macro operator is exactly the chunk/macro-operator in column 0, row E of the table. This type of transfer is alway_s available in Soar, and reduces the number of chunks needed to encode the complete macro table from 1 70 to 1 1 2 . The amount of simple transfer is greater than a simple matching of the terminal sequences of operators in the macros in Table 1 would predict because different macro operators of the same length as those in the table can be found that provide greater transfer. Symmetry transfer (within column) . Further transfer can occur when two macro operators for the same subgoal are identical except for rotations or reflections. Figure 10 contains two examples of such transfer. The desired state for both is to move the 1 to the upper left corner. The X's represent tiles whose values are irrelevant to the specific subgoal and the arrow shows the path that the blank travels in order to achieve the subgoal. In (a), a simple rotation of the blank is all that is required, while in (b), two rotations of the blank must be made. Within both examples the 1 7 These numbers include only the chunks for the resolve-tie subgoals. If the chunks generated for the
evaluate-object operators were included, the chunk counts given in this section would be doubled.
379
380
CHAPTER 10
Desired State
x (a)
(b)
x
Symmetric Initial States
Symmetric Initial States
� � � x
�
1
X
X
x
1
x
x
x
X
X
x
X
x
..,._,.
x
1
X
x
x
X
X
Figure JO. Two examples of within-column symmetry transfer.
pattern of moves remains the same, but the orientation of the pattern with respect to the board changes. The ability to achieve this type of transfer by implicit general ization is critically dependent upon the representation of the states (and operators) discussed in Section 3 . 3 . The representation allows the topological relationships among the affected cells (which cells are next to which other cells) and the operators (which cells are affected by the operators) to be examined while the absolute locations of the cells and the names of the operators are ignored. This type of transfer reduces the number of required chunks from 1 1 2 to 83 over the simple-transfer case. Symmetry transfer (across column). The final level of transfer involves the carry over of learning between different subgoals . As shown by the example in Figure 1 1 , this can involve far from obvious similarities between two situations. What is im portant in this case is: ( I ) that a particular three cells are not affected by the moves (the exact three cells can vary); (2) the relative position of the tile to be placed with respect to where it should be; and (3) that a previously placed piece that is affected (b)
(a) Different Intermedi ate Subgoals Place Tile
Different I n termediate Subgoals
4
Place Tile 5
3 2 � 23 � 23 � 23 Place Tile
�
x
x
x
Place Tile
4
x
x
x
x
x
Symmetric Initial States
Figure I I. An example of across-column symmetry transfer.
x
�
4
x
x
Symmetric Initial States
x
5
CHUNKING IN SOAR
Table 2. Structure o f the chunks that encode the macro table for the Eight Puzzle.
2
0
Tiles 3
4
5
6
A p 0
0
B
2,1
c
I
4,3,/
D
2
7,6,5,4
1 5 , 14,/
E
I
10,9,8,4
18,17,16
34,33 ,32,3 1 ,30, 29,/
F
2
1 3 , 1 2 , 1 1 ,/0
2 1 ,20, 1 9,/8
40,39,38,37 ,36, 35,30
15
G
I
JO
23,22,/ 7
46,45 ,44,43 ,42, 41 ,30
18
61 ,60,59,58, 56,55,29
H
2
7
26,25,24,23
54,53 ,52,5 1 ,50, 49,48,47 ,46,29
21
40
15
I
4
28,27,22
51
23
46
18
n
by the moves gets returned to its original position. Across-column symmetry transfer reduces the number of chunks to be learned from 83 to 61 over the within-column case. 1 8 Together, the three types of transfer make it possible for Soar to learn the complete macro table in only three carefully selected trials . Table 2 contains the macro-table structure of the chunks learned when all three levels of transfer are available (and search control to avoid undoing the previous operator is included). In place of operator sequences, the table contains numbers for the chunks that encode the macros. There is no such table actually in Soar all chunks (productions) are simply stored, unordered, in production memory. The pur pose of this table is to show the actual transfer that was achieved for the Eight Puzzle. The order in which the subgoals are presented has no effect on the collection of chunks that are learned for the macro table, because if a chunk will transfer to a new situation (a different place in the macro table) the chunk that would have been learned in the new situation would be identical to the one that applied instead. -
18
The number of chunks can be reduced further, to 54, by allowing the learning of macros that are not of minimum length. This increases the total path length by 2 for 1411/o of the problems, by 4 for 2611/o of the problems and 6 for 711/o of the problems.
381
382
CHAPTER 10
Though this is not true for all tasks, it is true in this case. Therefore, we can just assume that the chunks are learned starting in the upper left corner, going top to bot tom and left to right . The first chunk learned is number 1 and the last chunk learned is number 6 1 . When the number for a chunk is highlighted, it stands for all of the chunks that followed in its first unhighlighted occurrence. For example, for tile 1 in position F, the chunks listed are 1 3 , 1 2 , 1 1 , JO. However, 10 signifies the sequence beginning with chunk 10: 10, 9, 8, 4. The terminal 4 in this sequence signifies the sequence beginning with chunk 4: 4, 3 , J. Therefore, the entire sequence for this macro is: 1 3 , 1 2 , 1 1 , 10, 9, 8, 4, 3, l . The abbreviated macro format used in Table 2 is more than j ust a notational con venience; it directly shows the transfer of learning between the macro-operators. Simple transfer and within-column symmetry transfer show up as the use of a macro that is defined in the same column . For example, the sequence starting with chunk 5 1 is learned in column 3 row H, and used in the same column in row I. The extreme case is column 0, where the chunks learned in the top row can be used for all of the other rows. Across-column symmetry transfer shows up as the reoccurrence of a chunk in a later column. For example, the sequence starting with chunk 29 is learned in column 3 (row E) and used in column 5 (row G). The extreme examples of this are columns 4 and 6 where all of the macros were learned in earlier columns of the table.
4.4 Other tasks
The macro technique can also be used in the Tower of Hanoi (Korf, l 985a). The three-peg, three-disk version of the Tower of Hanoi has been implemented as a set of serially decomposable subgoals in Soar. In a single trial (moving three disks from one peg to another), Soar learns eight chunks that completely encode Korf' s ( l 985a) macro table (six macros). Only a single trial was required because significant within and across column transfer was possible. The chunks learned for the three-peg, three disk problem will also solve the three-peg, two-disk problem . These chunks also transfer to the final moves of the three-peg, N-disk problem when the three smallest disks are out of place. Korf ( 1 985a) demonstrated the macro table technique on three additional tasks: the Fifteen Puzzle, Think-A-Dot and Rubik's Cube. The technique for learning and using macros in Soar should be applicable to all of these problems. However, the performance of the current implementation would require user directed searches for the Fifteen Puzzle and Rubik ' s Cube because of the size of the problems.
CHUNKING IN SOAR
5. Conclusion
In this article we have laid out how chunking works in Soar. Chunking is a learning mechanism that is based on the acquisition of rules from goal-based experience. As such, it is related to a number of other learning mechanisms . However, it obtains ex tra scope and generality from its intimate connection with a sophisticated problem solver (Soar) and the memory organization of the problem solver (a production system). This is the most important lesson of this research. The problem solver pro vides many things: the opportunities to learn, direction as to what is relevant (biases) and what is needed, and a consumer for the learned information. The memory pro vides a means by which the newly learned information can be integrated into the ex isting system and brought to bear when it is relevant . I n previous work w e have demonstrated how the combination of chunking and Soar could acquire search-control knowledge (strategy acquisition) and operator im plementation rules in both search-based puzzle tasks and knowledge-based expert systems tasks (Laird, Rosenbloom & Newell, 1 984; Rosenbloom, Laird, McDermott, Newell, & Orciuch, 1 985). In this paper we have provided a new demonstration of the capabilities of chunking in the context of the macro-operator learning task in vestigated by Korf ( 1 985a). This demonstration shows how: ( 1 ) the macro-operator technique can be used in a general, learning problem solver without the addition of new mechanisms; (2) the learning can be incremental during problem solving rather than requiring a preprocessing phase; (3) the macros can be used for any goal state in the problem; and (4) additional generality can be obtained via transfer of learning between macro-operators, provided an appropriate representation of the task is available. Although chunking displays many of the properties of a general learning mecha nism, it has not yet been demonstrated to be truly general. It can not yet learn new problem spaces or new representations, nor can it yet make use of the wide variety of potential knowledge sources, such as examples or analogous problems. Our approach to all of these insufficiences will be to look to the problem solving. Goals will have to occur in which new problem spaces and representations are developed, and in which different types of knowledge can be used. The knowledge can then be captured by chunking.
Acknowledgements
We would like to thank Pat Langley and Richard Korf for their comments on an earlier draft of this paper. This research was sponsored by the Defense Advanced Research Projects Agency (DOD), ARPA Order No. 3597, monitored by the Air Force Avionics Laboratory
383
384
CHAPTER 10
under contracts F336 1 5-8 1 -K- 1 539 and N00039-83-C-01 36, and by the Personnel and Training Research Programs, Psychological Sciences Division, Office of Naval Research, under contract number N0001 4-82C-0067, contract authority identifica tion number NR667-477. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Defense Advanced Research Projects Agency, the Office of Naval Research, or the US Government.
References Anderson, J .R. ( 1 983). The architecture of cognition. Cambridge: Harvard University Press. Anderson, J . R . ( 1 983). Knowledge compilation: The general learning mechanism. In R.S. Michalski, J .G. Carbonell, & T.M. Mitchell (Eds.). Proceedings of the 1983 Machine Learning Workshop. University of Illinois at Urbana-Champaign. Anzai, Y . , & Simon, H.A. ( 1 979). The theory of learning by doing. Psychological Review, 86, 1 24- 140. Brown, J . S . , & VanLehn, K. ( 1 980). Repair theory: A generative of bugs in procedural skills. Cognitive Science, 4, 379-426. Carbonell, J . G . , Michalski, R . S . , & Mitchell, T.M. ( 1 983). An overview of machine learning. In R . S .
Michalski, J .G. Carbonell, T.M. Mitchell (Eds.). Machine learning: A n artificial intelligence approach . Los Altos, CA: Morgan Kaufmann. Chase, W.G., & Simon, H.A. ( 1 973). Perception in chess. Cognitive Psychology, 4 55-8 1 . Davis, R . , & King, J. ( 1 976). A n overview o f production systems. I n E . W . Elcock & D . Michie (Ed.), Machine intelligence 8. New York: American Elsevier. DeJong, G. ( 1981 ). Generalizations based on explanations. Proceedings of the Seventh International Joint Conference on A rtificial Intelligence (pp. 67 - 69). Vancouver, B.C., Canada: Morgan Kaufmann. Feigenbaum, E . A . , & Feldman, J. (Eds.) ( 1 963). Computers and thought. New York: McGraw-Hill. Fikes, R . E . , Hart, P.E. & Nilsson, N . J . ( 1 972). Learning and executing generalized robot plans. A rtificial intelligence, 3, 25 1 -288. Forgy, C.L. ( 1 98 1 ) . OPS5 manual (Technical Report). Pittsburgh, PA: Computer Science Department, Carnegie-Mellon University.
Forgy, C.L. ( 1 982). Rete: A fast algorithm for the many pattern/many object pattern match problem. A rtificial intelligence, 19, 17-37. Forgy, C.L. ( 1 984). The OPS83 Report (Tech. Rep. # 84- 1 33). Pittsburgh, PA: Computer Science Department, Carnegie-Mellon University. Hayes, J . R . , & Simon, H.A. ( 1 976). Understanding complex task instructions. In Klahr, D.(Ed.), Cognition and instruction. Hillsdale, NJ: Erlbaum. Korf, R.E. ( 1 980). Toward a model of representation changes. A rtificial intelligence, 14, 41 -78. Korf, R.E.' ( 1 985). Macro-operators: A weak method for learning. A rtificial intelligence, 26, 35-77. Korf, R.E. ( 1 985). Depth-first iterative-deepening: An optimal admissable tree search. A rtificial intelligence, 27, 97 - 1 10. Laird, J.E. ( 1 984). Universal subgoaling. Doctoral dissertation, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA. Laird, J.E., & Newell , A. ( 1 983). A universal weak method: Summary of results. Proceedings of the Eighth International Joint Conference on A rtificial Intelligence (pp. 771 - 773). Karlsruhe, West Ger many: Morgan Kaufmann. Laird, J . E . , & Newell , A. ( 1 983). A universal weak method (Tech. Rep. # 83- 14 1 ) . Pittsburgh, PA: Com puter Science Department, Carnegie-Mellon University.
CHUNKING IN SOAR
Laird, J . E . , Newell, A., & Rosenbloom, P.S. (1 985). Soar: An architecture for general intelligence. In preparation.
Laird, J . E . , Rosenbloom, P.S., & Newell, A. ( 1 984). Towards chunking as a general learning mechanism. Proceedings of the National Conference on A rtificial Intelligence (pp. 188 - 192). Austin, TX: Morgan Kaufmann. Langlex, P. ( 1 983). Learning Effective Search Heuristics. Proceedings of the Eighth International Joint Conference on Artificial Intelligence (pp. 4 1 9 - 425). Karlsruhe, West Germany: Morgan Kaufmann. Lenat, D. ( 1 976). AM: A n artificial intelligence approach to discovery in mathematics as heuristic search. Doctoral dissertation, Computer Science Department, Stanford University, Stanford, CA. Lenat, D.B. ( 1 983). Eurisko: A program that learns new heuristics and domain concepts. A rtificial intel ligence, 21, 61 -98. Lewis, C . H . ( 1 978). Production system models of practice effects. Doctoral dissertation, University of Michigan, Ann Arbor, Michigan. Marsh, D. ( 1 970). Memo functions, the graph traverser, and a simple control situation. In B. Meltzer & D. Michie (Eds.), Machine intelligence 5. New York: American Elsevier.
McDermott, J. ( 1 982). R l : A rule-based configurer of computer systems. Artificial intelligence, 19, 39-88. Michie, D. ( 1 968). 'Memo' functions and machine learning. Nature, 218, 1 9-22. Miller, G.A. ( 1 956). The magic number seven, plus or minus two: Some limits on our capacity for process ing information. Psychological Review,. 63 , 8 1 -97. Mitchell, T.M. ( 1 983). Learning and problem solving. Proceedings of the Eighth International Joint Con ference on A rtificial Intelligence (pp. 1 1 39 - 1 1 5 1 ). Karlsruhe, West Germany: Morgan Kaufmann. Mitchell, T . M . , Keller, R . M . , & Kedar-Cabelli, S.T. ( 1 986). Explanation-based generalization: A unifying view. Machine learning, 1: 47 - 80. Neves, D . M . , & Anderson, J . R. ( 198 1) . Knowledge compilation: Mechanisms for the automatization of cognitive skills. In Anderson, J . R . (Ed.), Cognitive skills and their acquisition. Hillsdale, NJ: Erlbaum. Newell, A. ( 1 973). Production systems: Models of control structures. In Chase, W. (Ed.). Visual informa tion processing. New York: Academic. Newell, A. ( 1 980). Reasoning, problem solving and decision processes: The problem space as a fundamen tal category. In R. Nickerson (Ed.), A ttention and performance Vl/I. Hillsdale, N.J .: Erlbaum. (Also available as CMU CSD Technical Report, Aug 79). Newell, A . , & Rosenbloom, P .S. ( 1 98 1 ). Mechanisms of skill acquisition and the law of practice. In J. R. Anderson (Ed.), Cognitive skills and their acquisition. Hillsdale, NJ: Erlbaum. (Also available as Carnegie-Mellon University Computer Science Tech . Rep. # 80- 145). Nilsson, N. ( 1 980). Principles of artificial intelligence. Palo Alto, CA: Tioga. Rendell, L.A. ( 1 983). A new basis for state-space learning systems and a successful implementation. Ar tificial intelligence, 20, 369-392. Rich, E. (1 983). A rtificial intelligence. New York: McGraw-Hill. Rosenbloom, P.S. ( 1 983). The chunking of goal hierar S'. Knowledge compilation (KC) takes a specification of a process and its domain and produces a new, compiled process, KC(p1 , S) => p2. Process P2 is then used in place of P1 for some subset of the situations of S, called S p2 and for all Sp2 that are elements of S p2• P2(sp2) - P1 ( sp2). where - means acceptably the same. For a process that is embedded in a larger system, the situation it processes is the global state of the overall system before the process is run. The situation that results from the application of P 1 to s, s', is the global state of the system after the process has executed. In a Lisp system, where a process
would be a function, a situation would include the complete state of the Lisp, which includes but is not limited to the arguments to the function, all global variables, and the control stack.
One property of many processes is that they ignore many aspects of a situation and only make small
changes to create a new situation. For a given application of process, P1 (s) •> s', we define that part of s' that differs from s to be the output of p1 , and we define that part of s that was necessary to produce the output to be the input of P 1 .
In a Lisp system, the input to a function would be the
function's parameters, those global variables tested in the function (and any subfunctions), and possibly the control stack. The output would be the result of the function and any changes to global variables, the control stack or the external environment.
As Mostow and Cohen observed in their work on mernoizing Lisp ( 1 5], knowledge compilation can
adversely affect the behavior of the total system unless the results of the compiled process for a given
input are acceptably the same as the results of the original process. Acceptability is specific to the implementation of the global system and is defined in terms of the uses that are to be made of the
output of P2(Sp 2)- Even if the output is not exactly the same as that produced by p 1 , the output may
be sufficient to produce the same overall result of the larger system. Many examples of this arose in Mostow's and Cohen's work.
For example, while the original process may have created a copy of one
of the inputs as the output, it may be acceptable for the new process to return a pointer to the original input, as long as no destructive operations are performed on the input or output by later processing.
If knowledge compilation is to improve the efficiency of the total system, then using p2 must be more efficient than P 1 for those situations in Sp2 to which It is applied. Efficiency must be judged
according to a suitable metric, such as the time or space it takes to produce the output. Knowledge
compilation may not improve every instance of the process, but there should be a net gain. Mostow and Cohen provide a cost-model for caching that is easily applied to other forms of knowledge compilation ( 1 5]. Even w hen knowledge compilation is successful there can be problems. First, compiling a process . that is incorrect will lead to a compiled process that is also incorrect. Another problem can arise when
389
390
CHAPTER 1 1
the original process is improved through some modification. The compiled process will reproduce the original behavior and not reflect the improvement that has been made. This is essentially a cache consistency problem, and to recover from it, the compiled process must either be removed or modified.
However, this property of knowledge compilation can also be beneficial when the
compilation process saves an instance of original processing that would otherwise be irreproducible because it was based on some temporary external input.
Knowledge Compi lation I n Soar In Soar, chunking performs knowledge compilation, as defined above. The processes it compiles are problem-solving episodes in subgoals. The initial situation, s, for a process is defined by the state of working memory before the Impasse occurs. The outputs of the process are the results of the subgoal. The actual input to the process consists of those aspects of the pre-impasse situation that were used in creating the result. Chunking creates new rules based on instantiated traces of the production firings that led to the creation of the result. A production trace is only a partial description of a single situation (all of working memory) and the process (all of production memory).
It
includes
traces of productions that contributed to the generation of the problem spaces, states, operators, and results. It does not include traces of productions that influenced the selection of these objects by the decision phase. These selections should only influence the efficiency of the problem solving, not the ability to produce the final results. Knowledge compilation need only be sensitive to those aspects of the original process that were logically necessary, not those that contributed only to the efficiency of the process. The compiled process is a rule that summarizes the processing in the subgoat.
The creation of a
rule from the trace information requires many steps which are described in detail elsewhere (4,5,6). The input to a rule is the set of working-memory elements that match its conditions. A rule's conditions test only a subset of the total situation, thereby allowing the rule to possibly apply in many more situations than in which it was teamed. When a rule fires in some future situation, it is able to replace the original processing in the subgoal (by adding elements to working memory that avoid an impasse). There is no guarantee in the current implementation of Soar that the additional cost of matching the rule will be less than problem solving in the subgoal. Work in other systems has shown this to be problematic (13). In practice, the rules can be matched efficiently (3, 22) so that the overhead of adding a new rule is much less than solving the subgoal.
0VERGENERALIZATION DURING KNOWLEDGE COMPILATION IN SOAR
O v e r ge n e ra l izati o n A compiled process is overgeneral if it applies to situations that are different from the ones on which it was based and does not produce acceptable results. A more formal definition of overgenerality is: Process P2 is overgeneral if there exists situations s· and s+, elements of Sp2 · such that the input of s· for P2 is a proper subset of the input of s+ for P1 and P1 (s+) ....,- P1 (s") P2(s+) - P2(s·), where ....,- means not acceptably the same.
A key provision for overgenerality is the subset relation that guarantees that there is a situation (s+) with inputs that the original process (p 1 ) is sensitive to, but that the compiled process (P2) is not. When tt�ose aspects of the input are present in a situation, the original process produces different results, p 1 (s+)
....,
p 1 (s"), but the compiled process produces the same result, independent of the additional aspects of the situation, p2 (s+) - p2 (s").
Therefore, although the compiled process performs correctly when the
additional aspects are not present, P1 (s") - P2 (s· ), it does not produce an acceptable result when they are, P1 (s+) ....,- P2(s+) . Overgeneralization is a more specific problem than just an incorrect compilation, which would have p2(s)-.- p 1 (s) for some s that is an element of Sp2 · Overgeneralization also differs from the valid transfer of knowledge to new situations. With transfer, either any additional aspects of the situations not Included as input to p2 are not relevant to the outcome of P1 , or P2 is able to correctly handle the newly relevant inputs: P1 (s+) - P2(s+) and P1 (s") - P2(S").
Overgeneralization is possible because the compiled process is not restricted to those inputs for which
it can produce a correct result. The reason for this failure is that some aspect of the situation that was truly relevant for P2 to correctly generate its results was not captured in the input to p2 . Any aspect of
knowledge compilation can be to blame: the compilation process, the target representation for the
compiled process, or the inputs to the compilation process. In the remainder of this section we consider each of these in tum. Our analysis is drawn from examples of overgenerality that have arisen in Soar during the implementation on a wide variety of tasks. As we shall see below, overgenerality is still possible in Soar, but only in special cases that are easily identified.
Invalid compilation.
The first class of failures arises when the knowledge compiler does not create an acceptable process given the information it has available. One cause could be simply a programming problem, the knowledge compiler is just incorrect. However, it might be the case that the complexity of the compilation is such that the resources involved preclude a correct compilation. Partially correct compilations can be used that work in a vast majority of cases, but in those where the complexity is too high, an overly-general compiled process may be produced.
39 1
392
CHAPTER 1 1
The knowledge compiler can also lead to overgeneralization if it attempts, via an unjustified inductive leap, to generalize the compiled process so that it can be applied to a wider range of situations.
In Soar,
the only deliberate act of generalization performed by the compilation process is to replace object identifiers with variables. These identifiers are temporary symbols generated for objects as they are inserted into working memory. This is the minimum amount of generalization that is necessary for the learning to have any effect. For each run of the system, these identifiers may have different values, even if the situation is the same. Their exact values are irrelevant, only their equality and inequality is important. Therefore, all instances of the same identifier are replaced by the same variable, and all distinct identifiers are replaced by distinct variables, during the creation of a chunk. This can lead to some overspecialization in that some identifiers are forced to be the same (or different) that did not have to be. However, by itself, it does not lead to overgeneralization.
Insufficient target lang uage
A second source of overgeneralization can be the target language for the compiled process. If it is not possible to represent all of the necessary computation in the target language, the compiled process will be necessarily incomplete, and possibly overgeneral. One aspect of this is that the compiled process must be able to access and test all of the input that was relevant to the original process. For example, when information used in the original process is encoded in different modes, such as symbolically and as activation as in ACr (1), the compiled process must be able to test all modes. In ACr, both the original and compiled processes were encoded as sets of rules. The rules could explicitly test for only the existence of working-memory elements, but the interpreter of the rules depended on the activation of working-memory elements to choose between rules.
The compiled rules can not encode the
dependency of the original process on the activations, so overgenerality is possible. The problem arises from the existence of crypto-information, information used by the interpreter that is not explicitly available to the rules ( 1 8).
In Soar this problem does not arise because all processing is based only on the
pre-impasse working memory, which is available to both the uncompiled, and compiled processes. The production language is rich enough to capture any tests of the contents of working memory that arise from the production firings and decisions that occur in a subgoal.
Insufficient description of process and situations
A final class of failures arise when the process and situation descriptions used by the compiler are insufficient to determine the inputs of the compiled process. In many systems, including Soar, only a partial description of the situation and process being compiled is made available (in Soar's case both are embedded in the production trace).
The descriptions of both processes and s.ituations can be
decomposed into local features of each component of the process or situation, and global features of the process or situation. Local properties can be determined independently of the other components and thus are usually readily available. In Soar, the local properties are the working-memory elements that
OVERGENERALIZATION DURING KNOWLEDGE COMPILATION IN SOAR
matched the productions that fired on the path to the results. The local features of the process are the contents of these same productions. Global properties, such as whether any working-memory elements do not have a given feature or the reasons why no productions fired during elaboration, are more costly to compute. Knowledge compilation based solely on a trace of behaviour is prone to overgeneralization because traces usually contain only local features of the situation and processing. Of course, if the process being compiled was insensitive to the global properties of its input and itself, overgeneralization
will not be a problem. In this section we work our way from local to global features of the situation and then the process. We start with examples from Soar where the necessary information was not provided in early versions but now is. We end with some examples of where global properties of the process are missing so that overgeneralization is still possible.
Goal tests pertormed by search control. Overgeneralization is possible if the processing within a
subgoal that should be irrelevant, influences the results of the subgoal. In Soar this is possible when control Information indirectly performs parts of the test for a desired state In a goal. Control information in Soar is encoded as preferences that affect the selection of problem spaces, states and operators. Theoretically, all subgoals should be correctly solvable without any control Information, that is, by a search of the problem space until a desired state is achieved. However, often it is possible to simplify the operators of the problem space or the test for a desired state by using control information that explicitly directs the search. The search control is able to guarantee certain invariants so that they need not be
tested in the operators or the goal. For example, if the maximum number from a set is desired, the test for the desired state should include comparing a number to all other possibilities (or at least the best found so far). By using control information, those numbers that are less than the one under consideration can be avoided in the search so that the goal Is achieved when there are no other numbers to generate. This type of approach would lead to overgeneral chunks because only the processing to generate the best number would be included, not all of the tests that were performed by productions that created preferences. To eliminate this source of overgeneralization we could have outlawed the use of search-control rules that affected the validity of the result of a subgoal. However, these desired states are often difficult to test explicitly. Instead, we created two new classes of preferences, called required and prohibit. These preferences have special semantics in the decision procedure. Required means that a given object must be selected if the goal is to be achieved. Prohibit meaning that a goal cannot be achieved if a given object is selected. If a required or prohibit preference is used by the decision procedure on the path to a result, the production that generated that preference is considered to be necessary for the generation of the result (as are its predecessors that generated the working-memory elements that it tested). This eliminates the cause of overgeneralization.
Negated tests of the pre-impasse situafiqn. In Soar, a production can test for the absence of an
element in working memory. These tests are called negated conditions. Negated conditions allow Soar to realize certain types of default processing ( 1 7), which makes the processing in the subgoal defeasible (2). A defeasible process is one where a different result can be produced when more information is available.
393
394
CHAPTER
11
I n Soar, additional working-memory elements can prevent a production from firing, thereby leading to a different result. We initially believed that overgenerality was directly related to defeasibilty, however as long as the reason for defeasibility (the absence of elements in working memory for Soar and the absence of certain inconsistencies in a default logic) is included in the compiled process, overgenerality can be avoided (1 9). For example, in Soar's original chunking mechanism, the production instantiations only contained those working-memory elements that matched the conditions that tested for the presence of working-memory elements.
Since the chunking mechanism did not have access to the negated
conditions, the chunks it built had the potential of being overgeneral. To eliminate this problem, Soar was modified so that it included instantiated versions of the negated conditions.
Although the negated
conditions test global properties of working memory, they are local properties of the productions and easily computed. The remaining cases of overgeneralization in Soar occur when the processing in a subgoal depends on global properties of the process being compiled, that is, on the contents of all of production memory. The information that chunking is missing consists of traces of some of the productions which did not fire. This is not provided because of the computational burden that would be involved In analyzing it during learning and because we believe that an explicit representation of procedural knowledge (in this case the produdion memory) may not exist. Two cases of this have been identified, and we discuss them below.
Neaatedtests ofthe oost-jmoassesjtuatjon.
If
a production in a subgoal contains a negated condition
that test for the absence of working-memory elements that are generated within the subgoal, it is inappropriate to include that negated condition in the chunk built for the subgoal. Instead, if the chunk built for the subgoal is to be completely correct, all productions that might have created that working-memory element in the subgoal, but didn't, must be found and the reasons for their inability to fire must be included in the chunk.
If we do not allow the examination of production memory, then
overgeneral chunks are possible. Even if all productions can be examined, the process to determine the appropriate chunks is expensive, possibly resulting in multiple productions for each result, with very large numbers of conditions for each production.
Tests for aujesqence. A second global property of the production system that can influence the
processing in a subgoal is that the elaboration phase runs until quiescence, that is, until no more productions can fire.
If the productions are able to detect that a specific level of processing was
performed, and then no more was possible, they are testing a global property of the productions. One example of this arises in testing that a goal has been achieved. For some goals, deteding that all possible processing was performed without error appears to be a sufficient goal test. For example, unification succeeds when all variables successfully unify. Often the easy test to perform is that there is nothing more to do. This can be detected in Soar because an impasse will arise when all of the operators for a state are exhausted. This impasse can be used as a signal that the subgoal is finished. Testing for the impasse is using information concerning the contents of production memory. If additional information was available in the state, a production might fire and create an additional operator to perform more processing. This leads
0VERGENERALIZATION DURING KNOWLEDGE COMPILATION IN SOAR
to the creation of an overgeneral chunk that will only achieve a part of the goal in a similar situation in which there is more processing to be done. The chunk will only test and handle those aspects that were tested in the original subgoal, leaving some processing undone. To solve this problem by accessing the productions would be similar in complexity to handling the negated conditions that test internally generated working-memory elements because in both cases the chunking mechanism must detect why certain productions did not fire.
Avoiding and Recove ring from Overgen eral lzatlon If there is a potential for overgenerality because of insufficient information about the process. overgenerality can be avoided by three obvious approaches: 1 . supplying the missing information; 2. eliminating that aspect of the process that is unavailable; 3. refraining from building chunks for subgoals where information is missing.
For two cases we described earlier (negated tests of the pre-impasse
situation, and goal tests performed by search control), we have modified Soar to provide more information, thereby eliminating that aspect of the process that is unavailable. For the two outstanding causes of overgeneralization (negated tests of the post-impasse situation, and tests for quiescence), we balk at supplying the information, first because we expect situations in which the contents of production memory are unavailable, and secondly, because of the computational cost of analyzing that information. One option is to eliminate the use of negated conditions for internal objects, and eliminate the ability to base behavior in a subgoal on the creation of an impasse. If we adopt this approach, we must first find ways of replacing their functionality. Another alternative is to avoid chunking subgoals in which there is processing that could lead to overgeneralization. The chunking mechanism is able to detect tests for internal negated elements and tests for impasses that arise from exhaustion. Disabling chunking for these cases appears to be a prudent step for now. Another alternative to avoiding overgenerality is to modify the subsequent processing so that the definition of acceptably-the-same is more lenient. This can be achieved by moving some of the test for a result of a goal outside the goal so that the goal produces results that only potentially satisfy the goal. The test does not get included in the chunk, but always follows the application of the chunk, screening the results. The chunks that are learned then can be overgeneral, but the following test will not use them to
eliminate the impasse unless they are appropriate. One restriction is that it must be possible to recognize a acceptable result with a simple test. This type of scheme has been implemented in Soar as part of work on abstraction planning [24), for a part of R1 -Soar [21 ], a reimplementation of R1 [ 1 1 ).
In a subgoal to
evaluate one of the operators in R 1 -Soar. the operator is implemented at an abstract level, so many of the details are left out. A chunk learned for that implementation applies later when a complete implementation is required. The chunk partially implements the operator, but does not pass an additional test that is needed before the result it used. Therefore, an impasse arises, followed by problem solving that completes those aspects of the operator left out of the chunk. Even without an additional test, if the purpose of the subgoal is to select between operators, an
395
396
CHAPTER
11
overgeneral chunk will not lead to incorrect results, although it may lead to additional search. In addition, the level above will be correctly chunked because all search-control knowledge is ignored during chunking. A third way to avoid overgenerality is to have the knowledge-compilation system modify the compiled process in anticipation of overgeneralization. This is done to a small extent in Soar for a specific type of overgeneralization that arises when operator-implementation subgoals are chunked.
Since
working-memory elements can not be changed (only added) operators are usually implemented by creating a new state, copying over all unchanged information, and adding all new information. If this is performed in a subgoal, the copying is usually handled by copying all subparts of the state that are not marked as being changed. This is a case of processing until exhaustion because the copying is independent of the number of subparts. This will lead to an overgeneral chunk that applies whenever the appropriate operator is selected and there is a state that has at least those subparts that were modified or copied in the original subgoal. If there are additional subparts to be copied, the chunk will apply, not just once, but for every appropriately sized subset of subparts, creating multiple states. none of which is complete. To avoid this problem, the chunking mechanism analyzes the rules that it is building, and if they appear to be copying substructures, splits the rule into multiple rules so that the copying is carried on independently. This then allows the copy rules to fire to exhaustion, copying all subparts as necessary. Although this has always worked In practice, the determination that the chunk should be split into multiple rules is only heuristic This undesirable interaction between operator implementation and overgenerality suggests that a different scheme for applying operators may be needed. An
alternate approach is to explicitly modify the overgeneral process either by further restricting its
inputs, either by modifying its processing or removing it. In a rule-based system like Soar this would mean adding new conditions to an overgeneral ruleor deleting the rule from production memory (8, 23).
In other
rule-based systems that fire only some of all the matched rules , additional properties of the rule can be modified. In ACT*, the strength of a rule can influence its ability to fire. When a rule is suspected of being overgeneral or incorrect, it strength is lowered ( 1 ). We have rejected schemes that modify existing productions, preferring, for the moment, to investigate ways of learning new productions that overcome the overgeneral ones.
S u m mary In this paper we have attempted to categorite most if not all of the sources of overgeneralization that can arise during knowledge compilation. We've used this categorization to analyze a form of knowledge compilation, chunking in Soar. Of all the sources of overgeneralization, one is inherent to our architectural assumption that the knowledge-compilation process does not have access to all of long-term knowledge,
which in Soar's case is its production memory. We've identified two types of processing in Soar that can give rise to overgeneralization if this information is not available and we have proposed ways of avoiding the creation of the overgeneral chunks and then recovering from them if avoidance is not possible.
OVERGENERALIZATION DURING KNOWLEDGE COMPILATION IN SOAR
References [1)
Anderson, J. R. The Architecture of Cognition. Cambridge: Harvard University Press, 1 983.
[2)
Batali, J. Computational Introspection. A. I. Memo 701 , Massachusetts Institute of Technology, Artificial Intelligence Laboratory, February, 1983.
[3]
Forgy, C. L. Rete: A fast algorithm for the many pattern/many object pattern match problem. Mificial lntelligence 19, (1 982),1 7-37.
[4]
Laird, J. E. Soar User's Manual (Version 4). ISL-15, Xerox Palo Alto Research Center, 1 986.
[5]
Laird, J. E., Newell, A., & Rosenbloom, P. S. Soar: An architecture for general intelligence. In preparation.
[6]
Laird, J. E., Rosenbloom, P. S., & Newell, A. Towards chunking as a general learning mechanism. Proceedings of AMl-84, Austin, 1 984.
(7)
Laird, J. E., Rosenbloom, P. S., & Newell, A. ·crunking in Soar: The anatomy of a general learning mechanism•. Machine Leaming 1 (1 986).
[8)
Langley, P. , & Sage, S. Conceptual clustering as discrimination learning. Proceedings of the Fifth Conference of the Canadian Society for Computational Studies of Intelligence, 1984.
[9)
Lewis, C. H. Production system models of pratice effects. Ph.D. Th., University of Michigan, 1 978.
[10) Marsh, D. Memo functions, the graph traverser, and a simple control situation. In B. Meltzer & D. Michie (Ed.), Machine Intelligence 5. New York: American Elsevier, 1 970. [1 1 J McDermott, J. ·R1 : A rule-based configurer of computer systems•. Artificial Intelligence , 19 ( 1 982) 39-88. [12) Michie, D. •Memo• functions and machine learning. Nature, 1 968, 218, 1 9-22. [1 3) Minton, S. Selectively generalizing plans for problem-solving. Proceedings of IJCAl-85, Los Angeles, 1 985. [14) Mitchell, T. M., Keller, R. M., & Kedar-Cabelli, S. T. •Explanation-based generalization: A unifying view-. Machine Learning 1 (1 986). [15) Mostow, J. & Cohen, D. Automating program speedup by deciding what to cache. Proceedings of IJCAl-85, Los Angeles, 1 985. [1 6) Neves, D. M. & Anderson, J. R. Knowledge compilation: Mechanisms for the automatization of cognitive skills. In Anderson, J. R. (Ed.), Cognitive Skills and their Acquisition. Hillsdale, NJ: Erlbaum, 1 981 . [1 7) Reiter, R. •A logic of default reasoning•. Artificial lntelligence13 (1 980), 81 -132. [18) Rosenbloom, P. S. The Chunking of Goal Hierarchies: A Model of Practice and Stimulus-Response Compatibility, Ph.D. Th. Carnegie-Mellon University, 1 983. [19) Rosenbloom, P. S., & Laird, J. E. Mapping explanation-based generalization onto Soar. Proceedings of AMl-86, Philadelphia, 1 986.
397
398
CHAPrER 1 1
(20) Rosenbloom, P.S., & Newell, A. The chunking of goal hierarchies: A generalized model of practice. In Machine Learning: An Mificial Intelligence Approach, Volume II, R.S. Michalski, J. G. Carbonell, & T. M. Mitchell, Eds., Morgan Kaufmann Publishers, Inc., Los Altos, CA, 1 986. [2 1 ) Rosenbloom, P. S., Laird, J . E., McDermott, J., Newell, A. & Orciuch, E. "R 1 -Soar: An experiment in knowledge-intensive programming in a problem-solving architecture". IEEE Transactions on Pattern Analysis and Machine Intelligence 7, 5 ( 1 985), 56 1 -569. (22) Scales, D. Efficient matching algorithms for Soar/Ops5 production systems. Computer Science Tech. Report, Stanford University, 1 986. (23) Sonnenwald, D. H. Over-generalization in Soar., In preparation. (24) Unruh, A., Rosenbloom, P. S., & Laird, J. E. Implicit procedural abstraction. In preparation.
CHAPTER 1 2
Mapping Explanation-Based Generalization onto SoarI P. S. Rosenbloom, Stanford University, and ]. E. Laird, Xerox Palo Alto Research Center
A BSTRACT
Explanation-based generalization (EBG) is a powerful ap proach to concept formation in which a justifiable concept definition is acquired from a single training example and an un derlying theory of how the example is an instance of the concept. Soar is an attempt to build a general cognitive architecture com bining general learning, problem solving, and memory capabilities. It includes an independently developed learning mechanism, called chunking, that is similar to but not the same as explanation-based generalization. In this article we clarify the relationship between the explanation-based generalization framework and the Soar/chunking combination by showing how the EBG framework maps onto Soar, how several EBG concept formation tasks are implemented in Soar. and how the Soar ap proach suggests answers to some of the outstanding issues in explanation-based generalization. I INTRODUCTION Explanation-based generalization (EBG) is an approach to concept acquisition in which a justifiable concept definition is acquired from a single training example plus an underlying theory of how the example is an instance of the concept [1 , 1 5, 26). Because of its power, EBG is currently one of the most actively investigated topics in machine learning [3, 5, 6, 1 2, 1 3, 1 4, 1 6, 1 7 , 1 8, 23, 24, 25). Recently, a unifying framework for explanation-based generalization has been developed under which many of the earlier formulations can be subsumed [ 1 5]. Soar is an attempt to build a general cognitive architecture combining general learning, problem solving, and memory capabilities [9]. Numerous results have been generated with Soar to date in the areas of learning [10, 1 1 ], problem solving [7, 8], and expert systems [21]. Of particular importance for this article is that Soar includes an independently developed learning mechanism, called chunking, that is similar to but not the same as explanation-based generalization. The goal of this article is to elucidate the relationship between the general explanation-based generalization framework - as described in [15] - and the Soar approach to learning, by map ping explanation-based generalization onto Soar. 2 The resulting mapping increases our understanding of both approaches and
al.lows results and conclusions to be transferred between them. In Sections I I -IV , EBG and Soar are introduced and the initial mapping between them is specified. In Sections V and VI , the mapping is refined and detailed examples (taken from [ 1 5]) of the acquisition of a simple concept and of a search-control con cept are presented. In Section VII , differences between EBG and learning in Soar are discussed. In Section VIII , proposed solutions to some of the key issues in explanation-based generalization (as set out in [ 1 5)) are presented , based on the mapping of EBG onto Soar. In Section IX , some concluding remarks are presented. II EXPLANATION- BASED GENERALIZATION
As described in [1 5), explanation-based generalization is based on four types of knowledge: the goal concept, the training example, the operationality constraint, and the domain theory. The goal concept is a rule defining the concept to be learned . Consider the Safe-to-Stack example from [ 1 5]. The aim of the learning system is to learn the concept of when it is safe to stack one object on top of another. The goal concept is as follows.3 -,fragile(y)vlighter(x.y) ( (
)
c o nd i t i o n . . . )
action . . . ) ac t i o n . . . )
( . . . acti on . . . )
Since this paper is concerned only with speeding up the condition matcher of OPS5, we shall only discuss the details of OPS5 conditions. The reader is referred to [3] for a complete description of OPS5 syntax. Each condition of the production is a list of match primitives. Each wme that matches the con
dition is said to be an instantiation of that condition. For example, the condition ( s tate < s > i tem 3 )
matches wme's whose first field is the symbol "state" , whose third field is "item " , and whose fourth field is not equal to the number 3. The symbol "" is the name of an OPS5 variable; a symbol is the name of a variable if and only if it begins with "(" and ends with ")'!. A variable in a condition matches any value of the corresponding field. · However, the variable is bound to the value it matches, and any other occurrence of the same variable within the cond.itions of the production matches only that value. For example, the condition ( op e r ator s u p e r - ope r ator ) matches wme's whose first field is "operator," whose third field is "super-operator," and whose second and fourth fields are equal . The condition ( op e r ator s u p e r - ope rator ) is similar, except the second and fourth fields must be unequal. An instantiation of a production is a list of wme's such that each wme of the list is an instantiation
409
410
CHAPTER 1 3
o f the corresponding condition o f the production, and all occurrences o f each variable throughout the conditions can be bound to the same value. For example, a list of two wme's A and B instantiate the conditions ( goal - con text p r o b l em- space
) ( goal - context s t ate < s > ) i f the first fields of both A and B are "goal-context " , the third fields of A and B are " problem-space" and "state, " respectively, and the second fields of A and B are equal. If the conditions above were the conditions of a production, and such wme's A and B existed , a production instantiation, consist ing of the name of the production and a list of A and B, would at some point be placed in the conflict set. An instantiation is executed by executing the actions of the production, after replacing variables in the actions by the values to which they were bound for the instantiation. In practice, wme's are often used to store information about objects, and the fields represent the values of various attributes of the objects. OPS5 provides a way of giving attribute names to the fields of different kinds of wme's. Each wme has a class name, indicated by the value of the first field of the wme, and each class can have names associated with various fields of the wme. Attribute names are! indicated by a preceding up-arrow. When an attribute name appears in a condition or wme, it indicates that the following value or pattern refers to the field corresponding to that attribute. For example, the condition ( p r ef e r e n ce t o b j e c t p000 1 t r o l e p r o b l em- s p ace tval ue acceptab l e ) matches wme's whose class name is "preference", whose "object" field is "p0001 " , whose " role" field is " problem-space" , and so on. OPS5 allows a condition to be negated by preceding it by a minus sign. A negated condition indicates that no wme should match that condition. That is, a set of conditions and negated con ditions is matched if and only if there exist a set of wme's which match the positive conditions with consistent bindings of the variables, and, for those variable bindings, there are no wme's that match any of the negated conditions. Hence, the condition list ( goal t s tatus ac t i ve ttype f i nd tob j e c t b l oc k t type ) - ( b l ock ts tatus avai l ab l e ttype twe i g h t < 6 ) is satisfied only if there is a wme representing an active goal for finding a block of a particular type, but there is no wme representing an available block of that type with weight less than 5.
4 . T h e R e t e A l go rit h m a n d t h e R e t e N et w o rk The process used by OPS5 to match the conditions of productions against working memory is based on the Rete algorithm. The Rete algorithm builds a lattice structure, called the Rete network, to represent the conditions of a set of productions. The Rete network takes advantage of two properties of production systems which allow for efficient matching:
EFFICIENT MATCHING ALGORITHMS FOR THE SOAR/0PS5 PRODUCI10N SYSTEM
• •
The contents o f the working memory change very slowly. 1 There are many sequences of tests that are common to the conditions of more than one production.
The Rete algorithm exploits the first property by storing match information in the network between cycles, so that it only matches a wme against each production condition once, even if (as is likely) the wme remains in working memory tor many cycles. Hence, the Rete network contains two types of nodes, test nodes and memory nodes. A test node indicates tests that should be executed to deter mine if a particular condition or set of conditions is matched by a wme or set of wme's, while a memory node is used to store information on partial production matches. Because of the storing of partial matches in the network, only the changes to working memory, rather than the whole contents of working memory, need be processed each cycle. Each time a wme is added to working memory, the wme is "filtered " down the network, causing new partial matches to be formed and possibly causing one or more productions to be instantiated and placed in the conflict set. An identical process occurs when a wme is removed from memory, except that partial matches are discarded as the wme filters down the network. The Rete algorithm takes advantage of the second property by sharing common test and memory nodes
as
it adds nodes to the network to represent the conditions of each successive production.
Because of the structure of the Rete network, the Rete algorithm can easily determine, as it is adding nodes to the network to represent a particular production, whether the nodes required already exist in the network and so can be reused rather than creating a new node.
4. 1 .
The Details of the Rete Network
The memory nodes of the Rete network store partial match information in the form of ordered lists of wme's called tokens.
A match to a list of conditions (e.g. the left side of a production) is
represented by a token in which the first wme matches the first condition, the second wme matches the second condition, and so on.
A token stored at a particular memory node in the network
represents a successful instantiation of the list of conditions represented by the set of nodes leading into that node. In the discussion below, we often refer to the number of tokens stored in a memory node as the "size" of the node. Figu re 4-1 illustrates the structure of the Rete network for the following condition list: ( s tate � 1 de n t 1 f 1 e r �att r i bute l i s t �val ue < 1 > ) ( ob j e c t � 1 de n t 1 f 1 e r < 1 > �att r i bute next �va l ue < 1 > ) ( s tate � 1 de n t 1 f 1 e r �att r i bute s i ze �va l ue < • 16 ) 1 Miranker ( 1 3) cites production systems such as ACE in which th� contents of working memory changes completely with every cycle. However, for most typical applications of production systems, including SOAR tasks, working memory changes slowly.
411
412
CHAPTER 1 3
new WME's ( + ) removed WME's (-)
bus node
test-node class name
=
test-node attr. field alpha-branches intra-condition tests
=
' next'
test-node i d ent. fi e l d
=
test-node
'object'
val u e fi eld
cl ass name
=
'state'
test-node attr. fi eld
test-node
va l u e fi e l d
ns are created , even though the production may have few or no overall instantiations. Specifi cally, the cross-product effect happens when several unrelated or nearly unrelated conditions occur consecutively in the conditions of a production . In that case, any (or almost any) combination of matches to each of the individual conditions will match the conjunction, so the nu mber of matches to the conjunction will be the product of the nu mber of matches to each of the individual conditions. Hence, any beta memory representing those conditions is likely to contain a large nu mber of tokens. An example of the cross- product effect and the effect of varying the ordering of a set of conditions is given in Figure 9- 1 . In the figure, two orderings of the same condition list are shown. The con ditions match "persons" that each have an "aQe" attribute and a "father" attribute. The conditions are instantiated by two persons and (i.e. instantiated by wme's that represent the persons) if is the father of and the age of is the square of the age of . The figure gives the number of partial instantiations created for the two orderings if there are twenty persons and fifty numbers (whose square is represented) in working memory. Because the first two conditions of the second ordering have no variables in common, the nu mber of instantiations of the two conditions is 400, the product of the number of instantiations of each of the individual conditions. On the other hand, in the first ordering, every condition after the first is linked to a previous condition through the variable in its identifier field, so there is no cross- product effect. (person tidentifier tattribute age tvalue ) (person tidentifier tattribute father tvalue ) (person tidentifier tattribute age tvalue ) (number tidentifier tattribute square tvalue ) (person tidentifier tattribute age tvalue ) (person tidentifier tattribute age tvalue ) (person tidentifier tattribute father tvalue ) (number tidentifier tattribute square tvalue )
� �
20 --....,. 40 20 /
�
20
1
Fig u re 9- 1 : I llustration of the Cross-product Effect The cross-product effect can easily be demonstrated by actual runs of SOAR tasks. For one eight puzzle run in which the conditions had been ordered by the SOAR reorderer, 7922 tokens were created and the matching time was 25 seconds. For five further runs in which the conditions of one of the eight puzzle productions were randomly ordered, the token counts were 65922, 29864, 31 058, 64233, and 27795, and the matching time was 1 520 seconds, 1 64 seconds, 233 seconds, 1 51 0 seconds, and 227 seconds, respectively. Obviously, the overhead of the cross-product effect can be enormous if the cpnditions of a production are not ordered carefully.
427
428
CHAPTER 1 3
A second effect that ordering can have o n the Rete algorithm i s that some orderings may have greater sharing of beta memories and and-nodes than others. If several productions have common sequences of conditions, then there will be greater sharing if the conditions of the productions are ordered so that these common sequences occur at the beginning of the condition lists. Third, some productions may have some conditions that match objects that are relatively static, remaining unchanged in working memory for a long time, and other conditions that match objects that are relatively dynamic, with various attributes frequently changing . Recall , in OPS5, that any modification of a wme involves removing the current version of the wme from the network and adding the new version . If a production contains conditions near the beginning of its condition list match ing a frequently changing object, then whenever that object is modified, many partial instantiations of the production (tokens) including that object will be removed and then recreated.
This problem of
changes in matches to the early conditions of productions forcing recomputation of all partial instan tiations based on those conditions is called the long-chain effect. This problem can be somewhat reduced if conditions matching mostly static objects are placed at the top of the condition list, and conditions matching more dynamic objects are placed near the bottom of the condition list. In this way, fewer tokens are affected by the changes in the dynamic objects. Unfortunately, the orderings of a condition list that are most efficient in each of these three respects are usually quite different. It is clear, though, that conditions should be ordered mainly to minimize the number of partial instantiations created and only secondarily to maximize sharing or to reduce the recomputation of tokens. The huge overhead that results from the cross-product effect far outweighs any duplication of processing in creating tokens that the latter two goals try to eliminate. In fact, a condition reorderer was added to SOAR mainly because of the cross-product effects caused by many of the chunks. Without the reorderer, the conditions of chunks created by SOAR have no particular order and often cause large numbers of partial instantiations, slowing down the system many times over. (Section 6 mentioned that the cross-product effect occurred in one of the tasks of analyzed in Appendix A. The task was the eight puzzle task with learning on, and the cross-product effect resulted because the conditions of some of the chunks were poorly ordered by the reorderer.) The condition reorderer also runs on all task prod uctions loaded at the start of run , eliminating the need for the user to order the conditions of task productions " by hand . " 9 . 2 . The SOA R Condition Reo rde rer
The goal of the SOAR condition reorderer, then, is to order the conditions of a production so that, for a fixed number of instantiations of the production, the number of partial instantiations of the production is minimized. Clearly, the optimal ordering of the conditions of a production can vary as the contents of working memory and thus the instantiations of the individual conditions vary. Hence,
EFFICIENT MATCHING ALGORITHMS FOR THE SOARl0PS5 PRODUCTION SYSTEM
the reorderer is looking for an ordering that results in the minimum number of partial instantiations of the production, averaged over the likely contents of working memory for the task. Determining the optimal ordering strictly by this definition would be very complex and requires much information about the task that is probably not available. However, the best ordering can be approximated using static properties of the conditions that give information on the likely number of instantiations of the conditions and sets of the conditions. The SOAR reorderer determines the an ordering incrementally, a condition at a time. That is, the reorderer builds up an ordered condition list out of the original unordered set of conditions by repeatedly determining the "best " condition remaining in the set of unordered conditions. This best condition is then removed from the set of unordered conditions and appended to the ordered con dition list, and the cycle is repeated. The SOAR reorderer defines the best condition as the condition that will most likely add the most constraint to the condition list. That is, if there are already n conditions in the list of ordered conditions, the reorderer looks for an (n + 1 )st condition such that, for a given number of instantiations of the n conditions, the likely number of instantiations of the n + 1 conditions is minimized. If there are more than one condition which are "best" according to the information available to the reorderer, the reorderer will sometimes " look ahead " to determine which choice among the tied conditions will allow the greatest constraint among the following few con ditions added to the condition list. In classifying conditions according to how much constraint they add to a condition list, we use the term augmented condition list to refer to the condition list which results from appending the condition being classified to the original condition list. Hence, the constraint that a condition adds to a con dition list is defined in terms of the relation between the number of instantiations of the original condition list and that of the augmented condition list. Also, in the discussion below, we will refer to a variable in a condition as being bound with respect to a condition list if the variable occurs positively in the condition list. A variable occurs positively in a condition if the condition is itself not negated and the variable appears in the condition alone, without a predicate such as "()" or ") = . For "
example, appears positively in ( ob j e c t t i de n t i f i e r tatt r i bute he i g h t tval ue 46 ) but appears nonpositively in ( s tate t i d e n t i f i e r tat t r i bute l oc at i on tval ue < x > ) 6 and - ( ob j e c t t i de n t i f i e r tattr i bute h e i g h t tva l ue 4 6 ) If a variable o f a condition is bound with respect t o a condition list, then the values o f that variable i n 6
Note that the occurrence of "'() "' implies that there i s a positive occurrence of in a previous condition t o which the
.. "" refers.
429
430
CHAPTER 1 3
instantiations of the augmented condition list are limited to the values that the variable takes on in the instantiations on the original condition list. Three kinds of information are available to the SOAR reorderer. First, it has information on the "connectivity" of the conditions being ordered .
Specifically, it "knows" which variables of the
remaining conditions are bound by the conditions already in the ordered condition list. Conditions in which all variables are already bound are more likely to add constraint than conditions that contain unbound variables in some fields.
Also, since there are only a limited number of wme's with a
particular identifier, a condition with a bound variable in the identifier field is likely to be more con straining than another condition with bound variables or constants in the other fields. but an unbound variable in the identifier field. Second, the SOAR reorderer has some SOAR-specific knowledge. For instance, it " knows" that conditions with class name "goal-context" match augmentations of existing goals. Since the conditions of a production are linked together (via their identifier fields) in a h ierar chy rooted in the goal conditions, it is best for the goal conditions to be near the beginning of the condition list. Additionally, since there are usually only a few goals in the goal hierarchy at one time, goal conditions are likely to have only a few instantiations and so provide more constraint than other conditions, other things being equal . Third, the SOAR reorderer may be provided by the user with task-specific knowledge about which attributes of objects are multi-attributes. A multi-attribute is any attribute of a SOAR object which can have multiple values. Multi-attributes are used frequently in SOAR tasks to represent sets of objects. The user may provide for each multi-attribute the number of values that are likely to be associated with it. Conditions may be classified broadly.into two g roups with respect to the amount of constraint they add to a particular condition list. First, there are conditions that add enough constraint that the number of instantiations of the augmented condition list is certain to be less than or equal to the number of instantiations of the original condition list. Second, there are all other conditions, for which the number of instantiations of the augmented condition list may be greater than that of the original condition list. In all further discussion, we consider only conditions which match augmen tations.
(The classifications below can all be extended without much difficulty to include
preferences.) Then, based on the information available to the reorderer, the first group of conditions may be further divided into two classes: • Conditions in which all fields contain constants or bound variables. • Conditions in which the identifier field contains a bound variable7 and the attribute field contains a constant which is not a multi-attribute. The second g roup of conditions may also be subdivided into several classes: 7We should
actually say "bound variable or constant," but the case of a constant in the identifier field is very infrequent.
EFACIENT MATCHING ALGORITHMS FOR THE SoAR/0PS5 PRODUCTION SYSTEM
• Conditions in which the identifier field contains a bound variable, and the. attribute field contains a constant attribute which is a multi-attribUte. • Conditions in which the identifier field contains a bound variable, but the attribute field contains an unbound variable. Even if the value field contains a constant or a bound variable, there may still be more than one instantiation of the condition consistent with each instantiation of the original condition list (and so more total instantiations for the augmented condition list than for the original condition list). • Conditions which have an unbound variable in the identifier field. Again, even if con stants or bound variables appear in the attribute and value fields, the number of instan tiations of the augmented condition list may be greater than the number of instantiations of the original condition list. We now g ive in broad terms the definition of "best condition " used by the SOAR condition reor derer that is based on the above classification and the reorderer's knowledge about goal conditions. The best condition is the condition with the lowest rank in the following ranking: 1 . Any condition with a bound variable in its identifier field, and constants in both the at· tribute and value fields. 2. Any condition with a bound variable in its identifier field and a constant or bound variable in both the attribute and value fields. 3. Any condition with a bound variable in its identifier field, a constant in the attribute field, an unbound variable in the value field, and a class name of "goal-context" . As men tioned above, conditions matching goal augmentations are preferred to other kinds of conditions, other things being equal. 4. Any condition with a bound variable in its identifier field, a constant non- multi-attribute in the attribute field, and an unbound variables in its value field. 5.
Any condition with a bound variable in its identifier field and a constant multi-attribute in its attribute field. The conditions within this s.u bgroup are ordered further according to the number of values the multi-attribute is likely to have. (A default value is used if the expected number of values of a particular multi-attribute is not provided by the user.)
6. Any other condition with a bound variable in its identifier field and an unbound variable in the attribute field. 7. Any condition with an unbound variable in its identifier field and a class name of " goal context. " This aspect of the definition leads to a goal condition, if there is one, being chosen as the first condition of the ordering (when no variables are bound.) 8. Any other condition with an unbound variable in its identifier field. (As should be clear from their definitions, ranks 1 through 4 are further subdivisions of the first group above, while ranks 5 through 8 cover conditions in the second group.) If there are multiple conditions with the lowest rank, the tie is broken randomly for all ranks except 4 and
5.
For ties with ranks 4 and 5, the tying conditions are evaluated by a look-ahead in which each
condition is temporarily added to the condition list and the remaining conditions are reclassified. The best tying condition is the one which, if added to the condition list, minimizes the the rank of the next
43 1
432
CHAPTER 1 3
condition added to the condition list. There may still be tied conditions after this one-step look-ahead, in which case the look-ahead is extended even further.
9 . 3 . Results
Using minimal task-specific information ( namely, information on multi-attributes), the SOAR con dition reord�rer described above does a fairly good job ordering conditions in order to mini mize the cross-product effect.
From limited observation, the reorderer almost always produces orderings
which are as good or nearly as good as the orderings which would be made " by hand " by the user who understood the meaning and intended use of the productions. For example, for the same eight puzzle run in which 7922 tokens were created when conditions were ordered by the SOAR reorderer, the token count was only reduced to 7859 after several attempts to find better orderings by hand. There are, however, some cases when the SOAR reorderer orders the conditions of chunks badly. These cases suggest that the reorderer should do a better lookahead when there are several con nected conditions with multi-attributes to be ordered. Also, there are some cases when a particular ordering can produce a savings by increasing sharing or reducing the long-chain effect, without causing the cross-product effect. The reorderer misses such orderings, since it has no heuristics or information for determining when such orderings may be advantageous. At this point, these cases do not seem to be frequent enough or easy enough to recognize that it is worth modifying the reorderer to handle them. The problem of ordering the conditions of a production so as to minimize the number of partial instantiations is one instance of a problem that is frequent in database and artificial intelligence work. A conjunctive problem may be defined as a set of propositions (or relations or patterns) which share variables and must be satisfied simultaneously. The work required in determining instantiations of such conjunctions often depends on the ordering of the conjuncts. Smith and Genesereth [ 1 7] have characterized several methods for determining a good ordering of the conjuncts. The SOAR reor derer may be described as following the "connectivity" heuristic in ranking conditions according to the occurrence of bound variables in the conditions. It also uses a "cheapest-first" heuristic in choosing goal-context conditions (which are likely to have few instantiations) over other conditions, and in ordering conditions with constant attributes according to the expected number of values of the attribute. By combining these two heuristics, the reorderer appears to avoid some problems with using either alone. Additionally, the SOAR reorderer has a heuristic to cover a problem not handled by either of the other two heuristics, that of choosing the first condition on which to "anchor" the rest. The SOAR reorderer "knows" that it is best to start off a condition list with a goal condition, since such conditions are likely to have few instantiations and are usually the root conditions to which the rest of the conditions are linked.
EFACIENT MATCHING ALGORITIIMS FOR TIIB SoAR/0PS5 PRODUCTION SYSTEM
1 0 . No n - l i n e a r Rete N e t w o rks In this section we discuss the use of networks with structures other than the linear structure of the standard Rete network. We give only a general discussion of the advantages and disadvantages of using non-linear networks, since there has been only a preliminary investigation of their use for SOAR productions. The standard Rete network has a fixed linear structure in that each alpha branch representing a condition of a production is successively joined to the network representing the conjunction of the previous conditions. It is possible for the conditions to be joined together in a non-linear fashion. For example, a binary Rete network for a four-condition production can be built by joining the alpha branches representing the first two conditions, joining the alpha branches representing the last two conditions, and then joining the results. The functioning of the and- nodes is the same in non-linear networks as in linear networks, except that the tokens from the right input may have any size, rather than a constant size of 1 . Figure 1 0- 1 illustrates several possible network structures. The last of the three is referred to as bilinear, because each of several groups of alpha branches are joined together in a linear fashion, and the resulting subnetworks are themselves joined together in a linear network. The ordering problem of Section 9 can now be seen as a special case of a more general problem of determining the optimal structure of the Rete network for joining the alpha branches of the con ditions of a production. In Section 9, the linear structure of the standard Rete network was assumed , so the problem reduced to finding the best order in which the alpha branches should be successively joined. The possible effects of differing network structures on the efficiency of the Rete algorithm are the same as the effects described in Section 9.1 of reordering conditions within the linear network struc ture. First, the structure of the network affects the number of partial instantiations that must be computed for a production. Thus, if four conditions are joined by a linear network, the partial instan tiations that are created and stored in the beta memories are the instantiations of the first two con ditions and the instantiations of the first three conditions. If the conditions are joined by a binary network, the partial instantiations are the instantiations of the first two conditions and the instan tiations of the last two conditions. As Figure 9· 1 illustrates, any change in the way in which conditions are joined can have a great effect on the n umber of partial instantiations that are computed. Second, varying the structure of the network can vary the amount of sharing of and-nodes and beta-memories between productions. In linear networks, sharing at most and-nodes, except those near the beginning of a linear chain, is very unlikely because it requires that the preceding alpha branches of the chain be shared. Thus, any conditions common to several productions must appear at the beg inning of their respective condition lists for the and-node(s) that joins them to be shared .
433
434
CHAPTER 1 3
c ond 1
c ond 2
c1
c 2 c3
c2
c3
c6 c7
c8
bi nary network
li near network
c1
c4 cS
c7
c8
bi l i near network
Figu re 1 0- 1 : Several Possible Structures for the Rete Network Non-linear networks allow greater sharing by breaking up the long linear chains into smaller subnet works which can each be shared . In fact, there can be and-node sharing within a single production if it has several repeated conditions. For example, consider the condition ( pe rson � i de n t i f i e r < p t > �att r i b u te f a t h e r � v a l ue ( p e rson � i de n t i f i e r < f t > �at t r i b u te n ame �val ue ( pe r son � i d e n t i f i e r �att r i b u te father �val ue ( pe rson � i de n t i f i e r �att r i b u te n ame �val ue
l ist: ) ) ) < > < nm> )
An instantiation of such a condition list would yield a pair of persons and (p2> whose fathers have different names. In the Rete network for the condition list, the first and third conditions would
EFFICIENT MATCHING ALGORITHMS FOR THE SoAR/0PS5 PRODUCTION SYSTEM
share an alpha branch (since they have the same sequence of intra-condition tests) , as would the second and fourth conditions. If a binary network is used for the condition list, there is further sharing of the and -node joining the first two conditions and the and -node joining the second two conditions. Third, the use of a non-linear network can eliminate the long-chain effect that is characteristic of linear networks. For instance, if the conditions of a production matching distinct objects are placed in separate subnetworks of a bilinear network, then changes to one object will affect only the tokens in the subnetwork matching that object, and will not affect matches to different objects in the other subnetworks. One case in which the use of a non-linear network is particularly advantageous is for a production in which several of the conditions are intended to match one or more permanent objects in working memory. For example, one production of the eight puzzle task determines if the overall goal has been achieved by comparing the tile configuration of the current state with the tile configura tion of another object in working memory representing the desired state. If a non-linear network is used for matching this production, then the conditions matching the desired tile configuration can be placed in a separate subnetwork, and the match to the desired state is only done once. However, if a linear network is used, the work of matching the conditions representing the desired state is recom puted often when the matches to preceding conditions in the production change. Unfortunately, the cross-product effect occurs more frequently for non-linear networks than for linear networks. With linear networks, because each successive condition adds more constraint to all previous conditions, there is almost always an ordering of the conditions that avoids the cross product effect. However, in non-linear networks in which conditions are joined first in smaller groups before all being joined together, it is sometimes impossible to get enough constraint in the subnet works to avoid the cross-product effect. The occurrence of the cross-product effect often cancels out any of the benefits that may result from using non - linear networks to increase sharing or avoid the long-chain effect. A logical network structure for SOAR productions would be a bilinear network in which each linear subnetwork. contains the conditions matching one SOAR object. Such a network structure would allow sharing of similar groups of conditions in different productions that match the same type of SOAR object and would make the match of each object independent of the other objects in the production. The linear subnetworks could be thought of as "super alpha branches, " since they completely match SOAR objects, just as alpha branches completely match OPS5 objects. The dif ference between such alpha branches and "super alpha branches" is that many partial instantiations may be created for each match to the "super alpha branch , " whereas only one token is created for each match to an alpha branch . The disadvantage of such a bilinear network structure is that the object subnetworks may match many objects that are immediately ruled out by other conditions of the production in a linear network. Much of the constraint in linear networks of conditions results from
435
436
CHAPTER 1 3
the intermixing of the conditions representing the different objects. That source o f constraint i s lost if conditions representing different objects are put in different subnetworks. Hence, it is not clear whether non-linear networks are in general useful for speeding matching of SOAR productions.8 There are some special cases in which it is advantageous to build a non -linear network for a production, but there is no obvious general criteria for determining when non -linear networks are useful.
1 1 . P u t t i n g M o re T e s t s on t h e A l p h a B ra n c h e s In Section 6 we indicated that the ratio of matching tests done at the and -nodes to tests done by the alpha-nodes is much greater for Rete networks built from SOAR production sets than for OPS5 production sets. The "joining " operation of the and- node is more complex than the simple tests of the alpha nodes. Hence, it seems likely that any inclusion of extra tests on the alpha branches that reduces the number of tokens entering and- nodes (but does not eliminate any legal instantiations of productions) will result in an overall speedup of the matching process. One example of this possibility for speeding up the match has to do with conditions which match augmentations of goals.
A newly-created goal initially has augmentations that indicate that the
problem-space, state, and operator of the goal are undecided: ( goal -context
t i den t i f i e r g 0 0 0 0 5
t a t t r i b u t e p ro b l e m - s p a c e
t v a l ue
undec i ded )
( goal -context
t i dent i f i e r g 0 0 0 0 5
t a t t r i bute
tval ue
undec i ded )
( goal -context
t i dent i f i e r g 0 0 0 0 5
t a t t r i b u t e ope rato r
s tate
tval ue undec i ded )
As the problem-space, state, and operator are determined by the decision procedure, these augmen tations are removed, and new ones created with the decided values. In a production fragment such as: ( p e i g h t * a p p l y -move - t i l e ( g o a l - c o n t e x t - i n fo ( space- i nfo
t i dent i f i e r < g >
t i dent i f i e r < p >
( g o a l - c o n t e x t - i n fo
t a t t r i b u t e p r ob l e m - s p a c e
t at t r i b u t e
t i dent i f i e r
n ame
tv a l u e
)
t v a l ue e i g h t - p u z z l e )
tattri bute
s t at e
t v al ue
)
it is clear that
is not intended to match " undecided " . However, the first condition will match any .
.
problem-space augmentation of a goal, even if the value field contains " undecided " . It is only by the conjunction of the first two conditions that the possibility that
matches " undecided " is eliminated , since there is no wme with identifier " undecided " . Hence, the processing of the and-node joining the first two conditions can be reduced by including an extra test on the alpha branch that ensures that
does not bind to " undecided . " This test can be added by changing the first condition to:
8 Non-linear networks are definitely useful in SOAR for one reason unrelated to efficiency. A non-linear network is required to represent a production that contains a conjunctive negation (negation of a conjunction of several conditions). The ability to express conjunctive negations is especially needed in SOAR, given SOAR's representation of objects.
EFACIENT MATCHING ALGORITHMS FOR THE SOARl0PS5 PRODUCTION SYSTEM
( goal - contex t - i nfo t i de n t i f i e r tatt r i bute p r o b l em-space tval ue { u n d e c i ded
} ) The braces indicate that the "() undecided " test and the
variable binding both apply to the value field of the condition. Every SOAR production begins with several conditions matching goal augmentations. If a " undecided " test is added to all conditions with a variable in the value field that match problem-space, state, and operator augmentations, the effect is surprisingly large. For the eight puzzle run of 1 43 decision cycles, the number of tokens that are created is reduced from 22242 to 1 5275, resulting in an 1 8% speedup from
33
seconds to 27 seconds. The addition of the " undecided " tests can be done
by the production compiler, so that the user need not worry about them.
1 2 . M o d i f y i n g t h e Rete N e t w o r k D u r i n g a R u n Once wme's have been added to working memory, the memory nodes of the network contain tokens indicating partial matches involving the current elements of working memory. If a new produc tion is then added to the network, any unshared memory nodes of the new production will be empty and must be " primed " , i.e., updated to contain tokens representing partial matches of the current working memory contents to the new production. If the memory contents are not updated , then the tokens that should be in the memory and all tokens further down in the network that are based on these tokens will never be created . Hence, if the memories are not primed, any instantiations of the new production which are based on these tokens will never be created and added to the conflict sel: OPS5 compiles in the normal way productions that are added to production memory during a run , but it does not prime the unshared memories of the new production. This is not sufficient for SOAR , which must have the capability to add chunks to production memory so that they match working memory properly. If there were no sharing of network nodes, the priming could be done simply by saving a pointer to each alpha branch as it is built and, after the whole production is compiled, sending each wme in working memory down each of the new alpha branches in turn. Previously, SOAR took a related approach. The nodes of all productions loaded in at the beginning :>f a run were shared . However, the network structure built for each chunk during the run was not shared with the other productions, so the simple procedure described above could be used. Thus, no chunk was shared with any of the original productions or any other chunks. This loss of sharing increases the number of nodes in the network and the average number of tokens in the network. Such loss of sharing is especially sig nificant because chunks tend to be complicated productions, longer than most of the original produc tions, and because often a pair of chunks are either completely identical or are identical up to the last few conditions.
437
438
CHAPTER
13
The procedure for priming new memory nodes when the newly added productions are shared with the existing network is slightly more complicated , though it requires less processing, since shared memory nodes (i.e.
nodes used in building the production that already existed before the new
production was compiled) need not be updated . To describe the procedure, we first need to define the term beta branch. The beta branch of an and-node, not-node, or beta memory is the unique path from that node to the bus node which follows the left-input of the two-input nodes and the (sing le) input of all other nodes. This path passes through a sequence of and- nodes, not- nodes, and beta memories and then follows the alpha branch of the first condition of a production to the bus node. Because not-nodes contain memories, they must be primed as well as alpha and beta memories. Priming occurs as the network of a new production is being built, whenever a new (i.e. unshared) memory node or not-node is created. A new alpha memory is primed by sending each wme in working memory down the alpha branch leading to the new alpha memory, stopping at the alpha memory (if the wme gets that far) . A new beta memory or not-node X is primed as follows. If there is no other memory node on the beta branch above X, then all wme's in working memory must be sent from the bus node down the beta branch to the new node X. Otherwise, all tokens in the memory node (or not-node) M lowest on the beta branch above X must be sent from M to X. M is already primed either because it was shared or because it was already updated earlier in building the current produc tion . (The two cases of whether or not there is a memory node on the beta branch above X can be combined if the bus node is thought of as a memory node containing all wme's currently in working memory.) The importance of this modification of the OPS5 Rete algorithm is shown by its effect on the running of the eight puzzle task with learning on. The size of the network resulting from the default and eight puzzle productions is 1 051 nodes and would be 3347 nodes if there were no sharing. During one run of the eight puzzle with learning on, eleven chunks are created , increased the total " unshared " size of the network to 4926 nodes. If there is no sharing among the original network and the new networks created for each of the chunks, the total number of actual nodes in the network increases to 2091 , and the average number of tokens in the network is 3297. If the new chunks are integrated into the original network, the network size only increases to 1 469 nodes and the mean n umber of tokens in the network is only 21 80. The result is a speedup of the matching time for the run from 188 seconds to 1 66 seconds, a reduction of 12 percent. The speedup would be much greater for longer runs in which the ratio of the number of chunks created to the number of initial productions is larger.
EFFICIENT MATCHING ALGORITHMS FOR THE SoAR/0PS5 PRODUCTION SYSTEM
1 3 . U s i ng I n d e x i n g to El i m i n at e A l p h a Tests B y convention, all OPS5 conditions begin with a constant class name. Since the OPS5 production compiler creates the test nodes of alpha branches in the order of the tests in the condition, a test for a particular constant class name is always the first test on the alpha branch of that condition. In SOAR, not only do all conditions begin with a constant class name, but most conditions have the following form: (constant-class-name tlDENTIFIER variable tATTRIBUTE constant-attribute-name tVALUE value) Such a condition matches the most common type of wme., the augmentation. The variable in the identifier slot has most likely appeared in previous conditions and almost certainly does not appear anywhere in the value field. Hence, it is not involved in any of the tests on the alpha branch cor responding to the condition. The attribute field of a SOAR condition is almost always a constant attribute name, since it is usually possible and desirable to eliminate variable attributes in conditions by using an extra level of indirection in storing information about the object.9 In the typical case, then, of a constant class name, variable identifier, and constant attribute name, the tests on the condition's alpha branch consist of: 1 . A test that the class name of the wme is constant-class-name 2. A test that the value of the attribute field of the wme is constant-attribute-name. 3. Any intra-condition test(s) associated with the value field. (E.g. matching the value field to a constant.) Because the class-name test is the first on each alpha branch , this test can always be shared between any conditions with the same class name. Hence, because of sharing, there are actually only as
many alpha branches leaving the bus node as there are class names, though each of these
branches may separate into many alpha branches below the class name test. Any wme being filtered down the network will only satisfy one of these class name tests. The filtering process may be sped up, then, by building a table which indicates which alpha branch should be followed for each class name. A similar technique can be used to speed up the attribute·name tests immediately below class name tests. Again, because of sharing, below any class-name test for an augmentation class, there is one branch for each possible attribute name for that class, plus possibly other branches correspond ing to conditions which don't contain a constant attribute name. While all of these other branches must be traversed as usual, only one of the constant attribute branches need be traversed, as in dicated by a table indexed by the attribute names. 9 There are some case� when variable attributes are both useful and needed. For example, variable attributes are necessary to reference attributes which are themselves SOAR objects with various attributes.
439
440
CHAPTER 13
One simple method i n LISP o f implementing a table that maps class names t o the corresponding alpha branch is to put a pointer to each alpha branch on the property list of the corresponding class name. Such a method uses LISP's symbol table (often a hash table) to achieve the mapping. Also, the table for each class name which maps attributes into branches can be represented in LISP by an association list, also on the property list of the class name. When indexing of class names and of attributes is implemented in this way, the matching time for the eight puzzle run is reduced from 1 07 seconds to 95 seconds, a decrease of 1 1 percent. The number of alpha branch operations is reduced from 6081 1 alpha node tests to 6234 indexing operations and 5481 alpha node tests.
1 4 . U s i n g H a s h Ta b l e s at M e m o ry N o d e s In this section we discuss the possibility of speeding up the Rete algorithm by storing tokens at memory nodes in hash tables rather than in simple lists. First we discuss the general advantages and disadvantages of using hash tables. Next we discuss an implementation for networks built from SOAR production sets which avoids some of the disadvantages. Finally, we give timing results on speedup of SOAR tasks when hash tables are used.
1 4. 1 .
Advantages and Disadvantages of Hash Ta bles
Using hash tables at some or all of the memory nodes holds promise for speeding up two aspects of the Rete matching algorithm. First, removal of tokens from memory nodes will be faster if hash tables reduce the average number of tokens that must be examined before the one to be removed is found. Second, storing the tokens in hash tables could speed up processing at the and- nodes significantly. When a token comes down one input of an and-node, a sequence of and-tests is executed between that token and each token in the memory node at the other input. Suppose the tokens in the opposite memory node have been stored in a hash table according to some field of the tokens which is involved in an equality test at the and-node. Then, only tokens in the opposite memory that have the appropriate hash value (according to the equality and-test) need be checked to see if they satisfy the and-tests. For example, the tests at the and-node joining the two conditions ( goal - context � i de n t i f i e r �att r i b u te p ro b l em- space �val ue < p > ) ( goal - context � 1 de n t 1 f 1 e r < g > �att r i b u te s t ate �val u e ) to the condition ( p refe r e n c e �ob j e c t < s 2 > �ro l e s tate �va l ue accep tab l e �goal < g > ) are that the object attribute of the wme from the right (the preference) must equal the value attribute of the second wme of the token from the left, and the goal attribute of the right wme must equal the identifier attribute of the first wme of the left token. Suppose the tokens in the right memory are stored in a hash table according to the value of their object attribute (the field), and a token comes down on the left whose second wme has a value s0008 for its value field (i.e. for ). Then,
EFFICIENT MATCHING ALGORITHMS FOR THE SOAR/0PS5 PRODUCTION SYSTEM
only those tokens in the right hash table that have the hash value for s0008 need be checked for possible matching with the new left token. For example, if bucket hashing is used and s0008 hashes to bucket 8, then only wme's in bucket 8 of the right memory need be tested . Tokens in any other bucket could not possibly have s0008 for their object attribute. Similarly, if the tokens in the left memory were stored according to the value of their "(s2>" or "" field, then the value of the "(s2>" or "" field in a token coming down from the right could be used to pick out the particular bucket in the left hash table in which matching tokens must be. One possible problem with using hash tables at the memory nodes is that the overhead involved in hashing (computing the hash value, doing a table lookup, handling collisions) may use up more time than is saved by speeding up the remove and and-node operations. In the OPS5 Rete algorithm, the process of adding a token to a memory node consists of just adding the token to the front of a list, so that process is definitely slowed down when hash tables are used at memory nodes. Hopefully, the speedup in the remove and and- node operations will more than compensate for the additional over head of hashing. Another overhead resulting from the use of hash tables occurs if a memory of an and-node uses a hash table, but, for some reason , no field of a token from the opposite input can be used to eliminate from consideration all wme's except those with a certain hash value. In this case, the entire contents of the hash table must be enumerated so that all the wme's in the memory can be tested. Because much of the hash table may be empty, this process of enumerating all the wme's in the memory node can take sign ificantly longer than if the tokens in the memory node are stored in a simple list.
1 4. 2 .
An
I m plementation of
Hashing
for
SOAR Tasks
For general OPS5 tasks hashing would be disadvantageous at many of the memory nodes for several reasons. First, an and-node may have no equality and-tests, so there would be no field that tokens from either memory could be hashed on to reduce the processing at that particular and-node. Second, because of sharing, memory nodes (especially alpha memories) may have several and-nodes as outputs. In general, each of the and- nodes will require that the tokens at the memory node be hashed on a different field. Hence, unless sharing is reduced so that each set of and- nodes that require hashing on a particular field have their own memory node, then the use of the hash table will speed up the processing at only some of the and-nodes. At the other and-nodes the entire hash table would have to be enumerated, which, as explained above, would slow down the and-node processing considerably. SOAR productions have several properties that make hash tables more usable and advantageous. The most important property is that nearly every condition in a SOAR production (except the first) is linked to a previous condition through its identifier field. That is, almost every condition has a variable
441
442
CHAPTER 13
in its identifier field which appears somewhere in a previous condition. Thus, almost every and-node will have an and-test which tests that the identifier field in a wme from the right input is equal to some field of a token from the left input Hence, for hash tables at the right memories of and- nodes, wme's should always be hashed on the value of their identifier field. For hash tables at the left memories, tokens should be hashed on the field that must match the identifier field of the right token (wme). With this scheme, it is only for left memory nodes that we run into the problem of and -nodes that share a memory requiring tokens to be hashed according to different fields. Since, as indicated by the statistics in Appendix A, only about 1 0- 1 5% of beta memories (which make up most of the left memories) have more than one output, the loss of sharing that would be required in order to use hash tables at left memory nodes would not be great. The type of hash table that should be used seems rather clear from a couple of considerations: ( 1 ) the overhead for access to the table should be low and should degrade gracefully as the table becomes more and more filled; (2) it should be be easy to access all of the elements in the table with a particular hash value. The appropriate type of hash table then seems to be a hash table with buckets represented by linked lists. With bucket hash tables, the need to expand hash tables and rehash when they are full is avoided, and it is simple to pick out all the tokens that hash to a particular value. Another issue is whether each memory node should have a local hash table, or there should be one global hash table in which all tokens from " hashed" memory nodes are stored. If each memory has its own hash table, it would be best to vary the size of the table in relation to how filled the memory will get. However, unless statistics are saved from previous runs, this information is not available when the network is built, so each memory node must be given a hash table of the same size. (It is known that alpha memories in general get more filled than beta memories, so it is probably best for alpha memory hash tables to be bigger than beta memory hash tables.) The use of a global hash table avoids the problem of having an uneven distribution of tokens in hash tables at the individual memory nodes, so the size of the global hash table can be significantly smaller than the sum of the sizes of hash tables at all the memory nodes. On the other hand, if a global hash table is used, then the hash function by which tokens are stored would have to be based on the particular memory node for which the token is being stored, as well as on the value of a field of the token. Also, each token stored in the table must be tagged to indicate which memory node it comes from, and these tags must be examined during hash table operations to ensure that a particular token in the hash table comes from the memory node of interest. Hence, the tradeoff between local hash tables and a global hash tables is that the overhead for accesses is lower when local hash tables are used, but less storage is required if a global hash table is used.
EFFICIENT MATCHING ALGORITHMS FOR THE SoARl0PS5 PRODUCTION SYSTEM
1 4.3.
Resu lts
The implementation that was was actually tested was local bucket hash tables. Timings were obtained for three variants in which hash tables were used only at left memories, only at right memories, or at all memories. The results indicate that hash tables are most effective at right memory nodes. In fact, the matcher slowed down with hash tables at the left memories. Specifically, for an eight-puzzle run in which matching time was 70 seconds without the use of hash tables, the matching time was 82 seconds with hash tables at the left memories alone, 55 seconds with hash tables at the right memories alone, and 63 seconds with hash tables all memories. When hashing at left memories is used, the loss of sharing increases the size of the network by seven memory nodes out of 1 068. To some extent, the timings obtained depend on the relative speeds of array references and other operations that the use of hash tables requires versus the speed of the and-test operations that are eliminated by the hashing. For some LISP implementations, array referencing may be much faster in relation to list and test operations, and for these, hashing at left memory nodes may in fact speed up the algorithm. In any case, it is clear that it is particularly advantageous to hash at right memory nodes. Besides the loss of sharing when left memory hashing is used, the reason the results are better for right memory hashing than for left memory hashing seems to be that right memories (most of the alpha memories) get more filled than left memories (mostly beta memories). If a memory has only a few elements, the overhead of hashing is larger than the speedup gained for removes and and-node operations. In fact, though most tasks sped up by more than 20% with right memory hashing, one SOAR task did not speed up at all for the first 1 00 decision cycles, because the average size of the right memory when a token entered the left side of an and-node was only about 2.8. The task did speedup for the second 1 00 cycles, when the average size of the right memory on an and- left call was about 6. 1 . A variant of strict bucket hashing was found to speed u p and- node processing even further when right memory hashing is used . In this variant, each bucket contains a number of lists, rather than a single list of tokens. Each of the lists of the bucket is headed by a hash key that hashes to that bucket and contains all the tOkens that have that particular hash key. With this arrangement, the number of tokens that must be tested is reduced even further, since only tokens with the key required by the token entering on the left need be tested against that token. For example, using the example given in Section 1 4. 1 , suppose the tokens in the right memory are stored in this variant of a bucket hash table according to their field. If a token entering on the left has a value of s0008 for its value field and s0008 hashes to the 8th bucket, then only tokens in the s0008 sublist of the 8th bucket need be tested against the entering token. This variant of bucket hashing simulates the case in which all possible keys hash to a different bucket, but uses a smaller hash table at the cost of a linear search through
443
444
CHAPTER 13
the sublists of the buckets. The results indicate that the increased processing required t o maintain the sublists of the buckets is more than compensated by the reduced number of and-tests that must be executed. For the same eight- puzzle run given above in which the match time was 70 seconds without hash tables and 55 seconds with right memory hashing, the matching time with right memory hashing using this variant was 46 seconds, yielding an additional speedup of 1 6% and an overall speedup for hashing of 34%.
1 5 . S p e e d i ng U p R emoves From a programming and algorithmic standpoint, it i s convenient and even elegant that the opera tion of the Rete algorithm, except for the action of the memory nodes, is identical for the addition of wme's to working memory and for the removal of wme's from working memory ( " removes" ). However, from the standpoint of efficiency, the and-tests executed during a remove are redu ndant. For each set of and-tests executed between two tokens during a remove, the same and-tests were executed earlier, with the same result, whenever the newer of the two tokens was created. Therefore, one possibility for speeding up removes is to save the outcome each time and-tests are executed to determine if two tokens are consistent. If the outcome of each such consistency test between token A and another token is stored with token A in its memory node, then none of these consistency tests will have to be executed when token A is removed. Unfortunately, such a system has many obvious disadvantages, including the large amount of extra storage required to store the outcomes. Another possibility for speeding removes is to maintain for every wme in working memory a list of the locations of all tokens which contain that wme. Such information would make removes very fast, since the tokens containing the removed wme could be removed from their memory nodes directly without the necessity of traversing the network. Again, however, this system has a large storage overhead . Additionally, the processing overhead whenever a token is created is greatly increased, since a pointer to the memory node where the token is stored must be saved for each wme in the token. A final possibility that requires no storage of extra information eliminates almost all processing at and-nodes during removes at the cost of increased processing at the memory nodes. Consider the situation when a token T enters an input of an and- node A during a remove operation. The beta memory B below the and-node contains one or more tokens that are the result of concatenating token T with another token, if T is consistent with tokens in the opposite memory of A, according to A's and-tests. Such tokens in B may be identified and removed by scanning B's memory for all tokens which contain T as a left or right subpart.
Hence, with this modification to the Rete algorithm,
whenever a token T enters one of the inputs of an and-node A during a remove, it is immed iately sent on to each of the outputs of the node. Each memory node B below the and-node receives T, which has size one less than the size of the tokens that are actually stored at B, and removes all tokens that
EFFICIENT MATCHING ALGORITHMS FOR THE SOAR/0PS5 PRODUCTION SYSTEM
contain T as a left or right subpart. Each such token that is removed is sent on to the outputs of B
•.
causing further removes. If no tokens are found with T as a subpart, then T was not consistent with any of the tokens in the opposite memory of A, so no tokens need be removed from B, and the traversal of the network backs up right away. This change to the Rete algorithm eliminates (for removes) the consistency tests between a token entering an and-node and all the tokens in the opposite memory, but requires that the contents of the memory node(s) below the and-node be scanned. The scanning operation is much simpler and almost always faster than executing the consistency checks between tokens at the and-node, espe cially since beta memories do not tend to grow very large. The modified algorithm is especially efficient when a token is consistent with many tokens in the opposite memory, since all the resulting tokens can be removed from the beta memory below the and- node in one sweep of the memory, rather than one at a time. For one run of the eight puzzle, this modification to the remove process decreased the matching time from 46 seconds to 33 seconds, a speed up of 28%.
1 6 . E l i m i n a t i n g t h e I n te rp ret i v e Ove r h e a d of t h e Rete A l g o rit h m The Rete network is essentially a declarative representation of a program which must be inter preted by the Rete matching algorithm each time a wme is added to or removed from working memory. Its structure makes it easy to add new productions to production memory and to determine what tests that are common to more than one production can be shared. However, much of the actual processing time of the OPS5 Rete algorithm is taken up with interpreting the Rete network, rather than with actually executing the matching and storage operations. One of the manifestations of this interpretive overhead is the significant percentage of the matching time taken up by the APPLY function. LISP's APPLY function is used by the OPS5 Rete algorithm to map test and node names contained in the network into functions that actually execute the node and test functions. For one run of the eight puzzle task, the APPLY function alone used up 1 0% of the matching time. Other over head in interpreting the Rete network includes accessing the information in the nodes, iterating through lists of node outputs or and-tests, and transferring control between the functions that ex ecute the nodes and tests. All this interpretive overhead could be eliminated by compiling the net work. Ideally, compilation would yield a function or set of functions which explicitly contained all the tests and memory operations represented by the nodes of the network, such that the operations are executed in the same order as they are in the depth-first traversal of the network. As with most programming systems, the speedup gained through compiling is balanced by a number of disadvantages. First, the compiling process may be complex and time-consuming, greatly lengthening the initial production loading process. Second, the resulting compiled code may be difficult or impossible to modify as productions are added to or removed from production memory,
445
446
CHAPTER 13
forcing a large recompilation to be done for any changes to the production set. Such a disadvantage is crucial for SOAR, since chunks are added to production memory during runs. Third, since each node must be expanded into code that executes the functions of that node, the size of the code compiled to execute the network will certainly be bigger than the size of the memory required to represent the network. In Section 1 6 . 1 we describe a method of eliminating much of the interpretive overhead of the LISP OPS5 Rete algorithm without compiling the network. In Section 1 6.2 we discuss the issues involved in actually compiling the network to executable code and present the results for several implementations.
1 6. 1 .
Red ucing the I nterp retive Ove r head W i t hout Compiling
Much of the overhead of interpreting the Rete network can be eliminated without actually compil ing the network, thereby avoiding the disadvantages described above. The APPLY calls can be replaced by more specific code that executes faster. APPLY is a general function which takes any name and executes the function with that name. However, there are only a small, fixed number of node names and test names in the network, so the APPLY calls can be replaced by code in the matcher which specifically tests a node name or a test name for each of its possible values and calls the corresponding function. Essentially, the mappings between a node or test name and the cor responding function are "hard-wired " into the matching code. Other changes that are even more dependent on the particular implementation of the OPS5 Rete algorithm 10 can be made to reduce the interpretive overhead. The result is a large speedup for all tasks. For one eight- puzzle run , the matching time went down from 95 seconds to 74 seconds, a speedup of 22%.
1 6. 2 .
Compiling t h e Netw o rk
One issue in compiling the Rete network into executable code is whether to compile the network directly into native machine code or into LISP code, which is then compiled into machine code by the LISP compiler. Obviously, the first approach provides the means for the greater efficiency, since the code for the node operations can be optimized for the machine instruGtion set. However, it also makes the network compiler machine-dependent. The implementations described here compile the network to LISP code, in part because there was no documentation available for the Xerox 1 1 32 workstation on how to produce machine code from LISP. Another issue is the complexity of the node operations. In the compilation process, every node is compiled into a piece of code that directly executes the operations of the node. Hence, code to do
10 such as reducing the number of function calls made by the interpreter and making the code that iterates over the outputs of a node more efficient.
EFFICIENT MATCHING ALGORITHMS FOR THE SOARl0PS5 PRODUCTION SYSTEM
the and-node operations is duplicated many times over in the resulting compiled code, though with each instance specialized to contain the and-tests, output calls, etc., of a particular and -node. Be cause the code is duplicated so many times, it is desirable to keep the code that executes the and -node operations simple. However, changes to the Rete algorithm-such as the use of hash tables at the memories-complicate the and-node processing. Hence, there is a tradeoff between using simple and-node code in order to keep down the size of the compiled code or using more complex and-node code which takes advantage of other speed ups besides compiling. Another significant issue is the number of LISP functions into which to compile the network. One consideration is that the larger the number of functions into which the network is compiled, the greater the overhead in calling between node functions. An opposite consideration is that the greater the number of node functions, the smaller the amount of recompilation that must be done when a new production is added to the network, since only the functions corresponding to new nodes or nodes that have been changed (by the addition of new outputs) must be compiled. A third consideration is that it is simplest if each two-input node is compiled into a different function. The two-input nodes must maintain some state information while tokens that they create traverse the subnetwork below them. For instance, a two-input node must maintain information on the token that activated it, on which input the token entered, and on which token in the opposite memory is being processed. Such information is most easily maintained in a few local variables in separate functions which execute each two-input node. Based on these considerations, a compiler was built that compiles the network in either of two slightly different ways. In the first approach, every node of the network is compiled into a separate function. In the second approach, all of the alpha branches are compiled into a single function, and each two-input node and the memory below it (if there is a memory node below it) are compiled into a single function. The compiler also allows a choice of whether or not the compiled code uses hash tables at the right memories. The speedups obtained by compiling the network are summarized in Table 1 6- 1 . No Hash
I n te r p r e t e d N e tw o r k C o mp i l e d N e tw o r k
% speedup
Hash
T ab l e s
T ab l e s
74 sec 63 sec 15%
46 sec 39 s e c 15%
Ta b l e 1 6 - 1 :
Times for Interpreted vs. Compiled Networks
Regardless of whether hash tables were used or not, there was no measurable difference in matching time between code in which all nodes were co�piled into separate functions and code in which all the alpha branches were compiled into a single function. Hence, that factor is not included in the table.
447
448
CHAPTER 13
The times given in the " Interpreted Network" line are for the Rete algorithm modified as described in Section 1 6. 1 to reduced interpretive overhead. The speedup obtained from compiling the network is much less than was expected. Clearly, the changes of Section 1 6. 1 eliminate the majority of the interpretive overhead of the Rete algorithm that compiling the network is intended to eliminate. Specifically, using the timings from Section 1 6. 1 and the "No Hash Table" column of the table, the modifications to reduce the overhead of interpreting the Rete network give (95-74)/(95-63)
=
66% of
the speedup that results from compiling the network. The code size of the compiled network was large, even for small production sets. The eight puzzle network contains 1 048 nodes and can be represented by about 7,000 to 1 0,000 bytes depending on the encoding scheme. The size of the code that executes the eight puzzle network on the Xerox 1 1 32 workstation is 75,000 to 90,000 bytes, depending on how many functions into which the network is compiled and whether hashing is used at the right memory nodes. Though no analysis has been done, the large size of the code probably adversely affects its execution speed, since there is only a limited amount of physical memory available for the network code, LISP system functions, and the storage of the working memory elements and tokens. Because compiling the network only yields a small additional speedup and has many dis advantages, including large code size and long compiling time, this method of speeding up the Rete algorithm was not pursued further.
1 7 . R e l at e d Wo rk A variety of work has been done in the area of speeding up production systems. One frequent approach has been has been to index certain features or combinations of features of conditions to allow quick determination of the productions that may be matched by the contents of working memory. Such indexing methods [ 1 2] have differed in the amount of indexing done and the amount of information stored between production cycles. Forgy [2] summarizes several of these indexing schemes. The Rete algorithm may be viewed as one variation in which conditions and productions are completely indexed by a discrimination network that determines exactly the productions that are instantiated by working memory. Work specifically on speeding up the Rete algorithm has proceeded both in the areas of software and hardware. Forgy [2] proposed several ways of speeding up the Rete network, including using hash tables at the memory nodes to speed up removes and using "binary search" nodes or " index " nodes to eliminate redundant execution of mutually exclusive tests. A reimplementation in BLISS of the original LISP OPS5 resulted in a six-fold speedup [6]. The culmination of the attempts at speeding up the Rete algorithm in software was OPS83 [5], a system in which the network is compiled into machine code. OPS83 is four times faster than BLISS OPS5 [6].
EFFICIENT MATCHING ALGORITHMS FOR THE SOARl0PS5 PRODUCTION SYSTEM
Forgy (2) also proposed a machine architecture that would be capable of interpreting the Rete network at high speed. Most of the other work on hardware support for the Rete algorithm has been in the area of parallel processors. Gupta (7) provides a comprehensive summary of the expected gains from executing parts of the matching process in parallel. He concludes that the maximum speedup for any of the proposed uses of parallelism in the Rete algorithm is not likely to exceed twenty-fold.
He proposes the use of hash tables at memory nodes to make processing time of
and-node activations less variable, thus making parallel processing of and-node operations more advantageous. The DADO group at Columbia has also investigated various parallel production sys tem matching algorithms for use on the highly-parallel DADO machine, including several parallel versions of the Rete algorithm (18). Miranker (1 3) proposes a variant of the Rete algorithm, called TREAT, for use on DADO. The TREAT algorithm builds a network of alpha branches terminated by alpha memories, as in the Rete algorithm, but eliminates the and- nodes and beta memories. No information on instantiations of more than one condition is stored between cycles. Hence, the TREAT algorithm must, on each cycle, recompute the and -node tests (what Miranker calls a join reduction) between all of the alpha memories of any production that may be affected by the changes to working memory on the previous cycle. However, because there is no fixed order in which the alpha memories must be joined, the join reductions may be dynamically ordered according to the size of each of the alpha memories. Also, the join reductions for a particular production can be stopped whenever any of the join reductions yields zero i nstantiations, since then there are no possible instantiations of the whole production. One special case of this is that no computation need be done if any of the alpha memories has no tokens in it. Miranker developed the TREAT algorithm to run on a parallel machine (in which many of the join reductions could be done in parallel), but claims on the basis of preliminary tests that the TREAT algorithm runs at least as fast as the Rete algorithm on a uniprocessor.
1 8 . S u m m a ry of R e s u l t s In this section w e summarize the results we have obtained i n attempting to speed up the Rete algorithm as it is used by SOAR. First, we discuss the applicability of each of the changes described to general OPS5 tasks. Then we briefly summarize the results obtained for SOAR tasks for each of the changes.
449
450
CHAPTER 13
1 8 . 1 . Results for OPS5 The applicability to OPS5 tasks of the changes and add itions to the Rete algorithm described in the preceding sections can be summarized as follows: • Condition ordering is not as vital for OPS5 tasks as it is for SOAR tasks. The main reason for this is that typical OPS5 conditions have more fields and specify more intra-condition tests than do SOAR c;onditions, and so tend to have fewer instantiations. Thus, the order in which conditions are joined does not matter as much as it does in SOAR . Similarly, it does not matter as much for OPS5 conditions whether the conditions are joined by a linear or non-linear structure. However, the cross-product effect can still occur in OPS5 productions if the conditions are ordered badly. Some of the general connectivity prin ciples of the SOAR reorderer could probably be adapted for use in an OPS5 condition ordering algorithm. • OPS5 rules can create new rules during a run by executing the OPS5 " build " action with a form representing the new production as an argument. Hence, the modification to the network compiler to allow new productions to be integrated into the existing network in a way that they will be matched properly is useful for general OPS5 tasks. • Indexing can be used as described in Section 1 3 to speed up the class-name tests of the alpha branches. The network compiler could probably be modified to notice other sets of mutually exclusive tests on the alpha branches and use indexing to speed these up as well. • As was explained in Section 1 4, there are a number of problems with using hash tables at the memory nodes for general OPS5 tasks that are lessened for SOAR tasks. Because the implementation described depended on several properties of SOAR tasks, it is not clear whether the use of hash tables would be advantageous for OPS5 tasks. • The three ways described in Section 1 5 for speeding up removal of tokens from the network do not depend on any property of SOAR tasks. In particular, it is likely that a speedup would be obtained for OPS5 tasks by the use of the third method described in that section. • Finally, the results on eliminating the interpretive overhead of the Rete algorithm are applicable to OPS5 tasks. That is, as with SOAR tasks, it is probably best to attempt to reduce the interpretive overhead through ways other than compiling the network. 1 8. 2 . Results for SOAR The results for SOAR tasks of each of the.changes investigated can be summarized very briefly: • The ordering of conditions in SOAR productions has a great effect on the efficiency of the match . An effective algorithm for ordering conditions can be built, which uses little knowledge of the domain or the intended meaning of the productions. • While the use of non-linear Rete networks can sometimes. positively affect the efficiency of the match, it is not clear whether they are useful in general for matching SOAR produc tions. • It is likely that any method of adding tests to the alpha branches that reduces the amount of testing done at the and-nodes will speed up the matc�ing process.
EFFICIENT MATCHING ALGORlmMS FOR THE SoARl0PS5 PRODUCTION SYSTEM
• Modifying the network compiler so that chunks are integrated into the existing Rete net work improves the efficiency of SOAR learning tasks. • Indexing techniques are quite useful in reducing the number of tests executed in the alpha branches of Rete networks for SOAR tasks. • Hash tables can be useful at right memory nodes for speeding up and- node operations. • The removal of tokens from the network can be sped up by the modification to the Rete algorithm described in Section 1 5. • It seems best to try to reduce the interpretive overhead of the Rete algorithm through ways other than compiling the network into executable code. Table 1 8- 1 summarizes the timing ·results for some of the changes described . As indicated in Section 7, all the timing results are for a particular run of the SOAR eight puzzle task. ADD time and REMOVE time refer to the matching time spent adding wme's to the network and removing wme's from the network, respectively. SOAR time refers to the to�al time for the run, including the matching time and the time used by the SOAR architecture. SOAR consistently uses about 25 seconds of the run, regardless of the matching time. The fifth column indicates the percent of the original matching time that the matching time of each succeeding version is. The modifications to the Rete algorithm are referenced in the table by the following keywords. INDEX refers to the changes described in Section 13 that use indexing to reduce the number of alpha tests executed. SPEED-INTERP refers to changes described in Section 16.1 for reducing the over head of interpreting the network. LEFT-HASH, RIGHT-HASH, and ALL-HASH refer to the modifica tions to the Rete algorithm in which hash tables are used at left memories, right memories, and all memories, respectively. SUBHASH refers to the change described in Section 1 4.3 in which tokens are grouped by hash key within the buckets of the hash table (hashing at right memories only). QUICK-REMOVE refers to the the third way described in Section 15 for speeding up removes. COMPILE-NET and HCOMPILE-NET refer to the compilation of the network to executable code, without and with hash tables at the right memories, respectively. NE-UNDEC refers to the change described in Section 1 1 of inserting " undecided " tests in some goal conditions. Line ( 1 1 ) of the table indicates that the cumulative effect of the INDEX, SPEED-INTERP, RIGHT HASH, SUBHASH, QUICK-REMOVE, and NE-UNDEC changes is to reduce the matching time for the eight puzzle to a quarter of its original value. Similar speedups for the combination of these changes and for the changes individually have been obtained for a number of SOAR tasks. Additionally, all learning tasks are sped up by the modification described in Section 1 2 that allows chunks to be integrated into the network during a run. The four-fold speedup shown in the table is especially significant in that the final result is that the matching time for the run is about fifty percent of the total SOAR time. Hence, the matching process no longer dominates the remaining processing of the SOAR architecture.
451
452
CHAPTER 13
v e rs i o n
( 1 ) : SOAR 4 . 0 ( 2 ) : ( 1 ) + I NDEX ( 3) : ( 2 ) + SPEED - I NTERP ( 4 ) : ( 3 ) + COMP I L E - N E T ( 5 ) : ( 3 ) .+ L E F T - HASH ( 6 ) : ( 3 ) + R I G H T - HASH ( 7 ) : ( 3 ) + A L L - HASH ( 8 ) : ( 6 ) + SUBHASH ( 9 ) : ( 8 ) + HCOMP I L E - N E T ( 1 0 ) : ( 8 ) + QU I C K - R E MOVE ( 1 1 ) : ( 1 0 ) + NE -UNDEG
Ta ble
1 8- 1 :
SOAR
ADD
R EMOVE
MAT C H
t i me
t i me
t i me
t i me
131 . 0 120 . 0 97 . 8 88 . 0 105 . 0 78.9 88 . 5 69 . 5 62 . 8 56 . 8 50 . 3
55 . 1 1 49 . 0 9 37 . 51 32 . 14 42 . 78 28 . 17 33 . 07 2 3 . 42 1 9 . 83 23 . 09 19 . 2 5
52 . 0 1 46 . 28 36 . 10 30 . 96 38. 77 26 . 48 29 . 68 2 2 . 42 19. 08 1 0 . 04 8.21
1 0 7 . 12 95 . 37 73 . 61 63 . 10 8 1 . 55 54 . 65 62 . 75 4 5 . 84 38 . 91 33 . 13 2 7 . 46
% of ( 1 )
100 89 69 59 76 51 59 43 36 31 26
Timing Results for Some of the Modifications
A cknowledgements I would like t o thank Harold Brown, John Laird , and particularly Paul Rosenbloom for comments on drafts of this paper. Thanks also to John Laird and Paul Rosenbloom for their guidance and advice throughout the period that I was doing the work described in this paper. I was supported by a National Science Foundation Fellowship while doing the work reported in this paper. Computer facilities were partially provided by NIH grant RR -00785 to Sumex- Aim.
EFFICIENT MATCHING ALGORITHMS FOR THE SOAR/0PS5 PRODUCTION SYSTEM
A p p e n d i x A . N et w o rk S t a t i s t i c s The six tasks for which statistics are given i n the tables below �re: • Eight puzzle (EIGHT) • Eight puzzle with learning on (EIGHTL) • Missionaries and cannibals (MC) - SOAR task that attempts to solve the Missionaries and Cannibals problem • Neomycin-Soar (NEOM) - an implementation of part of Neomycin [ 1 ] in SOAR • R1 -Soar [ 1 5] (R 1 ) - an implementation of part of R1 [1 1 ] in SOAR • R1 -Soar with learning on (R1 L) All of the tasks run not only with the task productions, but with the 51 default productions mentioned in Section 7. MC 15 51 350 1
N EOM 196 51 121 15
262 51 16322
Rll 300 51 2 3556
1048
1591
1 088
3578
4628
7 1 49
3.2
3.5
3.2
3.4
3.5
3.3
10.6
7.0
11. 1
8.8
8.4
6.1
of
9.0 4.7
6.0 7.5
9.4 4.4
7.9 5.5
7.6 5.7
5.5 8.3
# of outputs of
34 . 2 1.1
38 . 6 1.1
33 . 4 1.1
34 . 1 1.2
34 . 7 1.1
38 . 9 1. 1
39 . 5 1.2 1.0
43 . 1 1.6 1.0
39 . 0 1.3 1.0
40 . 5 1.1 1.0
40 . 5 1.2 1.0
42 . 9 1.2 1.0
0.9 1.7 1.0
0.6 1.7 1.0
1.1 1.7 1.0
l.8 1.3 1.0
2.1 2·. 0 1.0
1.7 1.8 1.0
5.9
4.8
6. 1
6.9
6.8
4.9
task
# of
d e f au l t p ro d u c t i o n s
# of nodes wi thout # of
nodes
w i th s h a r i ng
p ro d u c t i o n s in
n e two r k
E I GHT 11 51 3335
s ha r i ng in
n e two r k
sharing factor
% of a l p h a nodes % o f a l p h a memo r i e s avg .
Rl
E I GHTL 26 51 5521
# of
# of
outputs
a l p h a memo r i e s
% of b e t a memo r i e s avg .
b e t a memo r i e s
% of and-nodes avg .
# of
and-tests
avg .
# of outp u t s of
at
and -node
and-nodes
% of not-nodes avg .
# of
n o t - te s t s
avg .
# of
outputs
at
of
not - n ode
not- nodes
% o f p rod u c t i o n n o d e s Ta b l e
A - 1 : Statistics on Composition of Rete Network
I n the statistics on removes in Table A-2 below, "avg number of tokens examined " refers to the average n umber of tokens that must be examined to find the token to be removed . "And-left calls" and "and-right calls" refer to the cases when a token enters the left input and right input of an and-node, respectively. "Null-memory" calls refer to cases when a token enters an and-node and the
453
454
CHAPTER 13
opposite memory i s empty. The third, fourth, and fifth statistics given for and-left and and-right calls are averaged over only non-null-memory calls.
Adds
to wo r k i n g memo ry
t o wo r k i n g memo ry f r om wo r k i n g memo ry
R e mo v e s
A l pha node
N E OM
Rl
1572 1015 557
8837 46 5 1 4 1 86
3667 2271 1 39 6
4 1 24 2243 1881
608 1 1 12 . 7
30304 12 . 6
185609 11 . 4
147887 8.8
% succes s f u l
2486 15 . 6
1406 14 . 6
6 1 89 23 . 7
544 1 31 . 7
4955 24 . 9
546 1 27.8
1941 15 . 2 1. 3
835 9.5 1.2
5655 25 . 3 1.6
3449 28 . 6 3.4
4232 27.8 4.2
3534 25 . 6 3.6
cal l s
adds
n um b e r o f
R e mo v e s
1 89 8 1 5.6
52230 81.7
29672 5. 1
7 740 3.8
10056 5.7
10555 5.7
18769 6.6 4. 1
46280 79 . 0 44 . 5
29538 6.1 1.8
7 102 4.7 2.4
9558 6.8 2.2
9418 6.9 2.5
3.5 38836 8.8 1.0
0 .9 1 0 0 0 46 21.9 0.9
3.8 65835 12 . 7 0.9
39 . 6 1 79 7 0 6.8 0.9
21.8 2 2 944 4.9 0.8
21 .2 22717 5. 1 0.8
1.0
1.1
1.1
1. 1
1.1
1.2
24.6 9.3 3 6 1 6 1 3 2 45 8 2 3 8
13 . 9 9 1 1896
8.4 1 5 0 83 6
73 . 3 8 1 89 1 . 70 0.3
65 . 6 8433 12 . 1 0.6
46 . 4 36753 3.7 0.2
84 . 6 13691 2.7 0.3
79 . 4 1 7 386 2.5 0.2
8 1 .4 15767 2.5 0.2
1.0
1.0
1.0
1.0
1.0
1.0
1.4 1 1632
11.8 99086
3.6 133992
2.5 34267
2.3 3 99 2 7
2.3 36083
2 5 1 74 1 2 4 7 1 7 8 4.6 5.0
i n memo ry
toke n s
f o r a l p h a memo r i e s :
Numbe r of
remo v e s
Avg
n um b e r o f
toke n s
i n memo ry
Avg
n u mb e r o f
tokens
e x am i n e d
Adds
3667 2266 1401
f o r a l p h a memo r i e s :
N u mb e r of Avg
E IGHTL
cal l s :
N um b e r o f e a 1 1 s
Adds
Rll
MC
3117 1775 1342
E I GHT changes
Total
f o r b e t a memo r i e s
N u mb e r of
adds
A v g numbe r o f R e mo v e s
toke n s
i n memo ry
f o r b e t a memo r i e s :
N u mb e r of
r e mo v e s
Avg
n um b e r o f
toke n s
i n memo ry
Avg
n um b e r o f
toke n s
e x am i n e d
And- l eft cal l s
% o f n u l l - memo ry c a l l s # of
n o n - n u l l -memo ry c a l l s
avg
l ength
avg
# of
in avg
o f o p p o s i t e memo ry
cons i s tent
toke n s
o p p o s i te memo ry # of and-tests
executed
f o r non-con s i s t e n t toke n s a v g total total
# of t e s t s executed
# of
tests executed
6.2 6.5 142 656 147628
And- r i g h t cal l s
% o f n u l l -memo ry c a l l s # of avg
n o n - n u l l -memo ry c a l l s l e n g t h o f o p p o s i t e memo ry
avg # o f c on s i s te n t t o k e n s in
o p p o s i te memo ry
avg # o f
and-tests
executed
f o r non- con s i s t e n t toke n s a v g total total
# of
# of
tests
executed
tests executed
T a b le A - 2 :
Statistics o n Node Operations
EFFICIENT MATCHING ALGORITHMS FOR THE SOARl0PS5 PRODUCTION SYSTEM
Refe re n c e s 1 . Clancey, W . J . & Letsinger, R . NEOMYCIN: Reconfiguring af Rule-based Expert System for Ap plication to Teaching. In Readings in Medical Artificial Intelligence: The First Decade, Clancey, W. J. and Shortliffe, E. H., Eds., Addison-Wesley, Reading, 1 984, pp. 361 -381 . 2 . Forgy, C. L. On the Efficient Implementation of Production Systems. Ph.D. Th., Computer Science Department, Carnegie-Mellon University, 1 979. 3. Forgy, C. L. OPS5 User 's Manual. Computer Science Department, Carnegie- Mellon University, 1 981 . 4.
Forgy, C. L. " Rete: A Fast Algorithm for the Many Pattern/Many Object Pattern Match Problem " .
A rtificial Intelligence 1 9 (September 1 982), pp. 1 7-37.
Forgy, C. L. The OPS83 Report. CMU-CS-84- 1 33, Computer Science Department, Carnegie Mellon University, 1 984.
5.
6. Forgy, C.L., Gupta, A., Newell , A . , & Wedig, R . Initial Assessment of Architectures for Production Systems. National Conference for Artificial Intelligence, AAAI, Austin , 1 984. 7 . Gupta, A. Parallelism in Production Systems. Ph.D. Th . , Computer Science Department, Carnegie-Mellon University, 1 986. 8.
Laird , J. E. Soar User 's Manual. Xerox Palo Alto Research Center, 1 986.
9. Laird , J. E., Newell , A., preparation.
&
Rosenbloom, P. S. Soar: An Architecture for General Intelligence. In
1 0. Laird, J. E., Rosenbloom, P. S., & Newell, A. "Chunking in Soar: The Anatomy of a General Learning Mechanism " . Machine Learning 1 (1 986). 1 1 . McDermott, J. " R 1 : A R ule-based Configurer of Computer Systems" . A rtificial Intelligence (September 1 982).
19
1 2 . McDermott, J., Newell , A., & Moore, J. The Efficiency of Certain Production System Implemen tations. In Pattern-Directed Inference Systems, D. A. Waterman & F. Hayes-Roth , Ed . , Academic Press, 1 978. 1 3 . Miranker, D. P. Performance Estimates for the DADO Machine: A Comparison of Treat and Rete. Fifth Generation Computer Systems, ICOT, Tokyo, 1 984. 1 4 . Newell, A. Production Systems: Models of Control Structures. In Visual Information Processing, W. Chase, Ed., Academic Press, 1 973. 1 5 . Rosenbloom, P. S., Laird, J. E., McDermott, J . , & Orciuch, E. " A l -Soar: An Experiment in Knowledge-intensive Programming in a Problem-solving Architecture" . IEEE Transactions on Pattern A nalysis and Machine Intelligence 7, 5 ( 1 985), pp. 561 -569. 1 6 . Rychener, M. D. Production Systems as a Programming Language for Artificial Intelligence Applications. Ph.D. Th., Computer Science Department, Carnegie-Mellon University, 1 976. 1 7 . Smith, D. E., & Genesereth , M. R. "Ordering Conjunctive Queries " . A rtificial Intelligence 26 ( 1 985), pp. 1 71 -21 5.
455
456
CHAPTER 13
1 8 . Stolfo, S. J. Five Parallel Algorithms for Production System Execution on the DADO Machine. National Conference for Artificial Intelligence, AAAI, Austin, 1 984.
The Soar Papers 1987
CHAPTER 1 4
Learning General Search Control from Outside Guidance A. R. Golding, Stanford University, P. S. Rosenbloom, Stanford University, and J. E. Laird, University ofMichigan
Abstract
The system presented here shows how Soar, an architec ture for general problem solving and learning, can acquire general search-control knowledge from outside guidance. The guidance can be either direct advice about what the system should do, or a problem that illustrates a relevant idea. The system makes use of the guidance by first for mulating an appropriate goal for itself. In the process of achieving this goal, it learns general search-control chunks. In the case of learning from direct advice, the goal is to ver. ify that the advice is correct. The verification allows the system to obtain general conditions of applicability of the advice, and to protect itself from erroneous advice. The system learns from illustrative problems by setting the goal of solving the problem provided. It can then transfer the lessons it learns along the way to its original problem. This transfer constitutes a rudimentary form of analogy. I.
Introduction
Chunking in Soar has been proposed as a general learning mechanism [Laird et al., 1986]. In previous work, it has been shown to learn search control, operator implementa tions, macro-operators, and other kinds of knowledge, in tasks ranging from search-based puzzles to expert systems. Up to now, though, chunking has not been shown to ac quire knowledge from the outside world. The only time Soar learns anything from outside is when the user first defines a task; but this is currently done by typing in a set of productions, not by chunking. The objective of the research reported here is to show that chunking can in fact learn from interactions with the outside world that take place during problem solving. Two particular styles of interaction are investigated in the present work. Both come into play when Soar has to choose among several courses of action. In the first, the advisor tells Soar directly which alternative to select. In •This research was sponsored by the Defense Advanced Research Projects Agency (DOD) under contract N00039-86-C-0133, by the Sloan Foundation, and by a Bell Laboratories graduate fellowship to Andrew Golding. Computer facilities were partially provided by NIH grant RR-00785 to Sumex-Aim.
The views and conclusions
contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Defense Advanced Research Projects Agency, the US Government, the Sloan Foundation, Bell Laboratories, or the National Institute of Health. The authors are grateful to Allen Newell and Haym Hirsh for comments on earlier drafts of this paper.
the second, the advisor supplies a problem within Soar's grasp that illustrates what to do. Both styles of interaction teach Soar search-control knowledge. In the next section, the basics of the Soar architecture are laid out. A Soar system that implements the two styles of interaction mentioned above is then described. Follow ing that is a discussion of related work and directions for future research. The final section summarizes the contri butions of this work. II.
The Soar Architecture
Soar [Laird et al., 1987] is an architecture for general cog nition. In Soar, all goal-oriented behavior 'is cast as search in a problem space. Search normally proceeds by selecting an operator from the problem space and applying it to the current state to produce a new state. The search termi nates if a state is reached that satisfies the goal test. All elements of the search task - operators, problem spaces, goal tests, etc. - are implemented by productions. Various difficulties can arise in the course of problem solving. An operator-tie impa33e results when Soar is un able to decide which operator to apply to the current state. An operator no-change impa33e occurs when Soar has se lected an operator, but does not know how to apply it to the state. When Soar encounters any kind of impasse, a subgoal is generated automatically to resolve it. This sub goal, like the original goal, is solved by search in a problem space - thus Soar can bring its full problem-solving capa bilities to bear on the task of resolving the impasse. Chunking is the learning mechanism of Soar. It sum marizes the processing of a subgoal in a chunk. The chunk is a production whose conditions are the inputs of the sub goal, and whose actions are the final results of the subgoal. Intuitively, the inputs of a subgoal are those features of the pre-subgoal situation upon which the results depend. Because certain features are omitted from the chunk namely those that the results do not depend on - the chunk attains an appropriate degree of generality. Once a chunk is learned for a subgoal, Soar can apply it in relevant situations in the future. This saves the effort of going into another subgoal to rederive the result. III.
Design of the System
In this section, a Soar system that learns search control from outside guidance is presented. The performance task of the system is described first, and then the strategy for learning from interactions with the outside world.
CHAPTER 14
460
A.
The Performance Task
ator to apply next. This condition is signalled in Soar as
The system works in the domain of algebraic equations containing a single occurrence of an unknown, such as
(a + x)/(b + c) = d + e. The unknown here i s x, an d the solution is x = (d + e) · (b + c) - a.
To derive this solution, the system goes into its
an operator-tie impasse in the fault behavior is to go into a
equationJ space. Soar's de Jelection problem space and
(1)
proceed to evaluate each operator involved in the tie in an
(2)
to benefit from outside guidance.
equa
arbitrary order, until a correct operator is found. At this point , there is an opportunity for the system Rather than evaluate
operators randomly, the system enters an
adviJe
problem
tionJ problem space and applies an appropriate sequence of
space, where it displays all of the operators, and asks the
transformations to the equation. Transformations are ef
advisor to pick one. The advisor's choice is evaluated first,
fected by
operators. Together, these
in the hopes that it will be correct, allowing the evaluation
two types form a minimal sufficient set for solving any
process to be cut short. Normally, the system will be un
problem in the task domain of the system.
sets up a subgoal to do the evaluation. In the subgoal, it
iJolate
bolate
and
commute
able to evaluate the advisor's choice by inspection; thus it
operators transfer top-level terms from one
side of the equation to the other. To illustrate, two
iJolate
operators are applicable to equation 1. One multiplies the
( b + c), yielding a + x = (d + e) (b + c);
(3)
(a + x)/(b + c) - e = d.
(4)
·
e
equationJ
space and applies the advisor's
choice to the current equation. If this leads to a solution, the operator is accepted as correct.
equation by the term the other subtracts
goes into another
from both sides and produces
The reason the latter subtracts
Phrased in terms of the general procedure given above, the goal that the system sets for itself is to evaluate the correctness of the advisor's guidance - in this case, the system resembles a traditional learning apprentice; this
e from both sides, and iJolate operators always transfer right-hand bolate operators can thus be characterized by
point is taken up in section IV.A. A more trusting sys
two parameters: the operation they perform, and the side
operator leads to a solution, the system paves the way for
not
d,
is that
arguments.
tem would simply have applied the suggested operator to its equation. However, by verifying that the recommended
of the equation the transferred term starts out on. There
the learning of general conditions of applicability of the
are five possible operations:
add, Jubtract, multiply, divide, and unary-minUJ; the side of the equation is either left or right. The operators above are iJolate(multiply, left) and iJolate(Jubtract, right), respectively. There are just two commute operators, one for each
operator. This is done by the chunking mechanism, which
side of the equation.
suggestion.
They swap the arguments of the
top-level operation on their side, provided the operation is abelian.
B.
Commute(right) applies to equation 1, (a + x)/(b + c) = e + d.
giving
The Learning Strategy
(5)
The system's strategy for learning is to ask for guidance
retains those features of the original equation that were needed in the verification, and discards the rest. Moreover, if the system is given a wrong•• operator, its verification will fail, and thus it will gracefully request an alternative It even picks up valuable information from
the failed verification by analyzing what went wrong; the chunks learned from this analysis give general conditions for when
2.
not
to apply the operator.
Example of Direct Advice
Following is a description of how the system solves
a b = -c ·
when needed, and to translate that guidance into a form
-
x.
(6 )
that can be used directly in solving equations. The gen
together with direct advice from outside. A graphic depic
eral procedure for doing this is to formulate an appropri
tion of the problem solving appears in Figure 1.
ate goal and then achieve it, thereby learning chunks that influence performance on the original equations task. The instantiation of the general procedure depends on the form of guidance provided. The most direct form is where the advisor tells the system which of its operators to apply. Alternatively, the advisor could suggest an easier problem whose solution illustrates what to do in the original prob lem. Less direct still, the advisor could refer the system to a textbook, or offer other equally hostile assistance. The current system accepts both of the simpler forms of help mentioned above - direct advice and illustrative
problems. In the following sections, the general procedure is instantiated and illustrated for these forms of help.
1.
Learning from Direct Advice
In the course of solving its equations, the system is likely to
run
into situations where it does not know which oper-
Since the system has no prior search-control knowl edge, it cannot decide which operator to apply to the initial equation. It asks for a recolnmendation, and the advisor gives it
The system sets up a subgoal
iJolate(add, right).
to evaluate the correctness of this operator. In the subgoal, it tries out the operator on its equation, yielding
a b + x = -c. ·
(7)
Here the evaluation runs into a snag, as the system again cannot decide on an operator. It asks for help, and is told to apply
commute(left).
Accordingly, it sets a subgoal
within its current evaluation subgoal to verify this advice. The first equation generated in the new subgoal is
x + a b = -c. ·
(8)
• • it turns out that in the current domain, the only wrong operators are those that are inapplicable to the equation.
LEARNING GENERAL SEARCH CONTROL FROM OUTSIDE GUIDANCE
The final bii. of guidance the system needs is that it should apply the i!olate(aubtract, left) opera.tor. It goes down into a. third nested subgoal to verify this, and obtains x
=
-c
-
a ·
b.
( 9)
Having reached a solution, the system is satisfied that iaolate(aubtract, left) is correct, and Soar learns a chunk that summarizes the result. The chunk states that if the left-hand side of the equation is a binary operation with the unknown as its left argument, then the best operator to apply is the one that undoes that binary operation. The verification of the preceding operator, com mute(left) , now goes through as well. The chunk for this subgoal pertains to equations with an abelian binary op eration on the left-hand side, whose right argument is the unknown. It says to apply the commute(left) opera.tor. Finally, the first evaluation subgoal terminates, and a. chunk is learned for it. This chunk requires that the equation have on its right-hand side a. binary operation whose inverse is commutative, and whose right argument is the unknown. It asserts that the best opera.tor to apply is the one that undoes the binary operation. Having done the necessary evaluations, the system can go a.head and solve the equation. It no longer has to ask for advice, because its chunks tell it which operators to apply. These chunks may also prove useful in other problems, as demonstrated in the next two sections. 3.
Learning from Illustrative Problems
Learning from an illustrative problem takes place in the same context as learning from direct advice - namely, when Soar is a.bout to evaluate the operators involved in a. tie. But now, instead of going into an advi.!e problem space, the system enters another equatiom space. This instance of the space is for solving the illustrative problem. The initial state of this space does not contain an equation. The system detects this, and goes into a parae problem space, where it asks the advisor for an illustrative equation. It parses the equation into its tree-structured Equation&
space '------'
461
internal representation, and attaches it to the initial state; now it is ready to attempt a solution. To instantiate the general procedure presented earlier, the goal the system sets for itself this time is to solve the illustrative example. There a.re several ways for it to do so - the current system can either follow direct advice, as described above, or do an exhaustive search. In the process of solving the illustrative problem, chunks will be learned that summarize ea.eh subgoal. Then, if the subgoals of the illustrative problem a.re sufficiently similar to those of the original problem, the chunks should apply directly, resolving the original operator-tie impasse. The learning strategy is thus to apply the lessons of one problem to another. This can be viewed as a rudimentary kind of analogical transfer, as discussed in section IY.B. 4.
In
Example of an Illustrative Problem
this run, the system is age.in asked to solve a ·
b=
-c - x .
(10)
Figure 2 gives a pictorial representation of the run. As before, the system hits an operator-tie impasse at the first step, but this time the advisor helps by supplying r = a/y
(11)
as an illustrative problem. This problem is simpler than the original one, as it has no extran�us operations, such as a multiplication or unary minus, to distract the system. An exhaustive breadth-first search would expand 28 nodes in solving the original equation, but only 7 nodes here. The system proceeds to solve equation 11 by brute force search. The details a.re suppressed here, but the outcome is that it finds the sequence of operators i!o late(multiply, right), commute(left), and iaolate(divide, left). Chunks a.re learned for each step of this solution. These chunks a.re in fact identical to the chunks learned in section 111.B.2; this is because equations 10 and 11 are identical in a.11 relevant aspects. It follows that the chunks can be applied directly to solve the original problem.
�"',,___
Adviae space
Figure 1: Subgoal structure for learning from direct advice. The levels of the diagram correspond to subgoals. Non-horizontal arrows mark the entry and exit from subgoals. Within a subgoal, a box represents a state, and a horizontal arrow stands for the application of an operator. Ovals denote operator-tie impasses. Operator no-change impasses appear as two circles separated by a gap. A wavy line symbolizes arbitrary processing.
462
CHAPTER 14
initial state Equations space �-----�
chunks apply
Equations space
Figure 2: Subgoal structure for learning from an illustrative problem.
IV.
Discussion
The system described here represents a first step toward the construction of an agent that is able to improve its be havior merely by taking in advice from the outside. Soar appears ideally suited as a research vehicle to this end, as it provides general capabilities for problem solving and learning that were not available in previous research ef forts (McCarthy, 1968, Rychener, 1983]. The relatively straightforward implementation of direct advice and illus trative problems shows that the advice-taking paradigm fits naturally into Soar. Below, these methods of taking advice are compared with related work, and extensions to the straightforward implementations are proposed. A.
Direct Advice
·
The system is similar to a learning apprentice (Mitchell et al., 1985] in its treatment of direct advice; instead of accepting it blindly, it first explains to itself why it is cor rect. Nevertheless, the system cannot accurately be called a learning apprentice, as it actively seeks advice, as op posed to passively monitoring the user's behavior. In fact, the learning-apprentice style of interaction could be consid ered a special case of advice-taking in which the guidance consists of a protocol of the user's problem-solving. The limitation of direct advice is that it forces the advisor to name a particular operator; it would be desirable to allow higher-level specifications of what to do. To take the canonical example of the game of Hearts, the advisor might want to tell the system to play a card that avoids taking points, instead of spelling out exactly which card to play. To accept such indirect advice, the system would have to reduce it to a directly usable form (Mostow, 1983].
B.
Illustrative Problems
The system processes an illustrative problem by applying the chunks it learns from that problem to the original one. Since it is solving the two problems in serial order, it may seem that this approach amounts to just working through a graded sequence of exercises. There are two reasons that it does not, however. First, the teacher can observe how the student fails, and take this into account in choosing a suitable illustrative problem. Second, the system is solving
the illustrative problem in s ervice of the original one; thus it can abandon the illustrative problem as soon as it learns enough to resolve the original impasse. A more apt way to view the system's processing of il lustrative problems is as a type of analogical transfer from the illustrative to the original problem. The trouble with this type of analogy, though, is that the generalizations are based solely on the source problem, without regard for how they will apply to the target. A more effective approach would be to establish a mapping between the two problems explicitly. This forces the system to attend to commonali ties between the problems, which would then be captured in its generalizations. This is in fact just the way general izations are constructed in GRAPES [Anderson, 1986]. V.
Conclusion
The system presented here shows how Soar can acquire general search-control knowledge from outside guidance. The guidance can be either direct advice about what the system should do, or a problem that illustrates a relevant idea. The system's strategy of verifying direct advice be fore accepting it illustrates how Soar can extract general lessons, while protecting itself from erroneous advice. This strategy could be extended by permitting the advice to be indirect; the system would then have to operationalize it. In applying the lessons learned from solving an illustrative problem to its original task, the system demonstrates an elementary form of analogical reasoning. This reasoning capability could be greatly improved, however, if the sys tem were to take into consideration the target problem of the analogy as well as the source problem. References
John R. Anderson. Knowledge compilation: the general learning mechanism. In Machine Learning: An Ar tificial Intelligence Approach, pages 289-310, Morgan Kaufmann, Los Altos, CA, 1986. John E. Laird, Allen Newell, and Paul S. Rosenbloom. Soar: an architecture for general intelligence. Arti ficial Intelligence, 1987. In press. John E. Laird, Paul S. Rosenbloom,. and Allen Newell. Chunking in Soar: the anatomy of a general learn ing mechanism. Machine Learning, 1 , 1986. John McCarthy. Programs with common sense. In Mar vin Minsky, editor, Semantic Information Processing, pages 403-418, MIT Press, Cambridge, MA, 1968. T. Mitchell, S. Mahadevan, and L. Steinberg. LEAP: a learning apprentice for VLSI design. In Proceedings of IJCAI- 85, Los Angeles, 1985. David Jack Mostow. Machine transformation of advice into a heuristic search procedure. In Machine Learn ing: An Artificial Intelligence Approach, pages 367404, Tioga, Palo Alto, CA, 1983. Michael D. Rychener. The Instructible Production Sys tem: a retrospective analysis. In Machine Learning: An Artificial Intelligence Appraach, pages 429-460, Tioga, Palo Alto, CA, 1983.
CH APTER
15
Soar: An Architecture for General Intelligence f. E. Laird, Xerox Palo Alto Research Center, * A. Newell, Carnegie Mellon University and P. S. Rosenbloom, Stanford University
ABSTRACT The ultimate goal of work in cognitive architecture is to provide the foundation for a system capable of general intelligent behavior. That is, the goal is to provide the underlying structure that would enable a system to perform the full range of cognitive tasks, employ the full range ofproblem solving methods and representations appropriate for the tasks, and learn about all aspects of the tasks and its performance on them. In this article we present SOAR, an implemented proposal for such an architecture. We describe its organizational principles, the system as currently implemented, and demonstrations of its capabilities.
This research was sponsored by the Defense Advanced Research Projects Agency (DOD), ARPA Order No. ·3597, monitored by the Air Force Avionics Laboratory under contracts F33615-81 -K-1539 and N00039-83-C-0136, and by the Personnel and Training Research Programs, Psychological Sciences Division, Office of Naval Research, under contract number N0001 4-82C0067, contract authority identification number N R667-477. Additional partial support was provided by the Sloan Foundation and some computing support was supplied by the SUMEX-AIM facility (NIH grant number RR-00785) . The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Defense Advanced Research Projects Agency, the Office of Naval Research, the Sloan Foundation, the National Institute of Health, or the US Government. * Present address: Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109, U . S . A .
464
CHAPTER 15
Introduction
SOAR is an architecture for a system that is to be capable of general intellig ence . SOAR is to be able to: ( 1 ) work on the full range of tasks, from highly routine to extremely difficult open-ended problems; (2) employ the full range of problem solving methods and representations required for these tasks; and (3) learn about all aspects of the tasks and its performance on them. SOAR has existed since mid-1982 as an experimental software system (in OPS and LISP), initially a s SOARl [31 , 32] , then as SOAR2 [29, 35] , and currently as SOAR4 [30] . SOAR realizes the capabilities of a general intelligence only in part, with significant aspects still missing. But enough has bee.n attained to make worth while an exposition of the current system. SOAR is one of many artificial intelligence (AI) systems that have attempted to provide an appropriate organization for intelligent action. It is to be compared with other organizations that have been put forth, especially recent ones: MRS [22] ; EURISKO [38 , 39] ; blackboard architectures [4, 16, 24, 56] ; PAM/PANDORA [79] and NASL [40] . SOAR is also to be compared with machine learning systems that involve some form of problem solving [10, 15, 37, 45 , 46] . Especially important are existing systems that engage in some significant form of both problem solving and learning, such as: ACT* [2] ; and repair theory [8] , embodied in a system called SIERRA [77] . ACT* and repair theory are both psychological theories of human cognition. SOAR, whose antecedents have played a strong role in cognitive theories, is also intended as the basis for a psychological theory, but this aspect is not yet well developed and is not discussed further. SOAR has its direct roots in a continuous line of research that starts back in 1956 with the "logic theorist" [53] and list processing (the IPLs) [55] . The line goes through GPS [17, 54] , the general theory of human problem solving [51] and the development of production systems, PSG [48] , PSANLS [66] and the OPS series [20, 21] . Its roots include the emergence of the concept of cognitive architecture [48] , the "instructable production system" project [67, 68] and the extension of the concept of problem spaces to routine behavior [ 49] . They also include research on cognitive skill and its acquisition [ 1 1 , 35 , 50, 63] . SOAR is the current culmination of all this work along the dimension of architectures for intelligence. SOAR's behavior has already been studied over a range of tasks and methods (Fig. 1 ) , which sample its intended range, though unsystematically. SOAR has been run on most of the standard AI toy problems [29, 31]. These tasks elicit knowledge-lean , goal-oriented behavior. SOAR has been run on a small number of routine , essentially algorithmic, tasks, such as matching forms to objects, doing elementary syllogisms, and searching for a root of a quadratic equation. SOAR has been run on knowledge-intensive tasks that are typical of current expert systems. The tactic has been to do the same task as an existing AI expert system, using the same knowledge. The main effort has been Rl-SOAR
SOAR: AN ARCHITECTURE FOR GENERAL INTELLIGENCE
Small, knowledge-lean tasks (typical AI toy tasks) :
Blocks world, Eight Puz2le, eight queens, labeling line drawings (constraint satisfaction) , magic squares, missionaries and cannibals, monkey and bananas, picnic problem, robot location finding, three wizards problem, tic-tac-toe, Tower of Hanoi, water-jug task.
Small routine tasks :
Expression unification, root finding, sequence extrapolation, syllogisms, Wason verification task. Knowledge-intensive expert-system tasks:
Rl-SOAR: 3300 rule industrial expert system
(25% coverage) ,
NEOMYCIN: Revision of MYCIN (initial version), DESIGNER: Designs algorithms (initial version).
Miscellaneous Al tasks:
DYPAR-SOAR: Natural language parsing program (small demo) , Version spaces: Concept formation (small demo) , Resolution theorem prover (small demo).
Multiple weak methods with variations, most used in multiple small tasks :
Generate and test, AND/OR search, hill climbing (simple and steepest-ascent), means-ends analysis, operator subgoaling, hypothesize and match , breadth-first search, depth-first search, heu ristic search, best-first search, A•, progressive deepening (simple and modified) , B* (progressive deepening), minimax (simple and depth-bounded), alpha-beta, iterative deepening, B * .
Multiple organizations and task representations:
Eight Puzzle, picnic problem, Rl-SOAR.
Learning:
Learns on all tasks it performs by a uniform method (chunking): Detailed studies on Eight Puzzle, Rl-SOAR, tic-tac-toe, Korfs macro-operators; Types of learning: Improvement with practice, within-task transfer, across-task transfer, strategy acquisition, · operator implementation, macro-operators, explanation-based generalization. Major aspects still missing:
Deliberate planning, automatic task acquisition, creating representations, varieties of learning, recovering from overgeneralization, interaction with external task environment.
FIG. 1. Summary of SOAR performance scope.
[65] , which showed how SOAR would realize a classical expert system, Rl , which configures VAX and PDP-11 computers at Digital Equipment Corporation [3, 41] . Rl is a large system and Rl-SOAR was only carried far enough in its detailed coverage (about 25% of the functionality of R l ) to make clear that it could be extended to full coverage if the effort warranted [75] . In addition, SOAR versions of other substantial systems are operational although not
465
466
CHAPTER 1 5
complete: NEOMYCIN [13] , which itself is a reworking of the classical expert system, MYCIN [71 ] ; and DESIGNER [26] , an AI system for designing al gorithms. SOAR has also been given some tasks that have played important roles in the development of artificial intelligence : natural language parsing, concept learning, and predicate calculus theorem proving. In each case the performance and knowledge of an existing system has been adopted as a target in order to learn as much as possible by comparison : DYPAR [6] , version spaces [44] and resolution [60] . These have so far been small demonstration systems; developing them to full-scale performance has not seemed profitable. A variety of different representations for tasks and methods can be realized within SOAR's architecturally given procedural and declarative representations. Essentially all the familiar weak methods [47] have been realized with SOAR and used on several tasks [31 ] . In larger tasks, such as Rl-SOAR, different weak methods occur in different subparts of the task. Alternative decompositions of a task into subtasks [75] and alternative basic representations of a task have also been explored [31 ] , but not intensively. SOAR has a general mechanism for learning from experience [33 , 36] which applies to any task it performs. Thus, it can improve its performance in all of the tasks listed. Detailed studies of its learning behavior have been done on several tasks of varying characteristics of size and task type (gam·es, puzzles, expert system tasks) . This single learning mechanism produces a range of learning phenomena, such as improvement in related tasks (across-task trans fer) ; improvement even within the learning trial (within-trial transfer) ; and the acquisition of new heuristics, operator implementations and macro-operators. Several basic mechanisms of cognition have not yet been demonstrated with SOAR. Potentially, each such mechanism could force the modification of the architecture, although we expect most of them to be realized without major e:i,::t ension. Some of the most important missing aspects are deliberate planning, as developed in artificial intelligence systems [69] ; the automatic acquisition of new tasks [23] ; the creation of new task representations [ 1 , 27] ; extension to additional types of learning (e .g. , by analysis, instruction, example, reading) ; and the ability to recover from errors in learning (which in SOAR occurs by overgeneralization [34] ) . It is useful to list these lacunae, not just to indicate present limitations on SOAR, but to establish the intended scope of the system. SOAR is to operate throughout the entire spectrum of cognitive tasks. The first section of this paper gives a preview of the features of SOAR. The second section describes the SOAR architecture in detail. The third section discusses some examples in order to make clear SOAR's structure and opera tion. The final section concludes with a list of the principal hypotheses underlying the design of SOAR. 1. Preview
In common with the mainstream of problem solving and reasoning systems in AI, SOAR has an explicit symbolic representation of its tasks, which it
SOAR: AN ARCHITECTURE FOR GENERAL INTELLIGENCE
manipulates by symbolic processes. It encodes its knowledge of the task environment in symbolic structures and attempts to use this knowledge to guide its behavior. It has a general scheme of goals and subgoals for represent ing what the system wants to achieve, and for controlling its behavior. Beyond these basic communalities, SOAR embodies mechanisms and organi zational principles that express distinctive hypotheses . about the nature of the architecture for intelligence. These hypotheses are shared by other systems to varying extents, but taken together they determine SOAR's unique position in the space of possible architectures. We preview here the main distinctive characteristics of SOAR. The full details of all these features will be given in the next section on the architecture. 1 . 1 . Uniform task representation by problem spaces
In SOAR, every task of attaining a goal is formulated as finding a desired state in a problem space (a space with a set of operators that apply to a current state to yield a new state) (49] . Hence , all tasks take the form of heuristic search. Routine procedures arise , in this scheme, when enough knowledge is available to provide complete search control, i . e . , to determine the correct operator to be taken at each step. In AI, problem spaces are commonly used for genuine problem solving [18, 5 1 , 57-59, 72] , but procedural representations are com monly used for routine behavior. For instance, problem space operators are typically realized by LISP code. In SOAR, on the other hand, complex operators are implemented by problem spaces (though sufficiently simple operators can be realized directly by rules) . The adoption of the problem space as the fundamental organization for all goal"oriented symbolic activity (called the Problem Space Hypothesis [49] ) is a principal feature of SOAR. Figure 2 provides a schematic view of the important components of a problem space search for the Eight Puzzle. The lower, triangular portion of the figure represents the search in the Eight Puzzle problem space, while the upper, rectangular portion represents the knowledge involved in the definition and control of the search. In the Eight Puzzle , there are eight numbered tiles and one space on a three-by-three board. The states are different configura tions of the tiles on the board. The operators are the movements of an adjacent tile into the space (up, down, left and right) . In the figure, the states are represented by schematic boards and the operators are represented by arrows. Problem space search occurs in the attempt to attain a goal. In the Eight Puzzle the goal is a desired state representing a specific configuration of the tiles-the darkened board at the right of the figure. In other tasks, such as chess, where checkmate is the goal, there are many disparate desired states, which may then be represented by a test procedure. Whenever a new goal is encountered in solving a problem, the problem solver begins at some initial state in the new problem space. For the Eight Puzzle, the initial state is just a particular configuration of the tiles. The problem space search results from the
467
468
CHAPI'ER 15
Task I m p l ementation +
Search-control Knowledge
FIG.
2. The structure of problem space search for the Eight Puzzle.
problem solver's application of operators in an attempt to find a way of moving from its initial state to one of its desired states. Only the current position (in Fig. 2, it is the board pointed to by the downward arrow from the knowledge box) exists on the physical board, and SOAR can generate new states only by applying the operators. Likewise, the states in a problem space, except the current state and possibly a few remembered states, do not preexist as data structures in the problem solver, so they must be generated by applying operators to states that do exist. 1.2.
Any decision can be an object of goal-oriented attention
All decisions in SOAR relate to searching a problem space (selection of operators, selection of states, etc.) . The box in Fig. 2 represents the knowledge
SOAR: AN ARCHITECTURE FOR GENERAL INTELLIGENCE
that can be immediately brought to bear to make the decisions in a particular space. However, a subgoal can be set up to make any decision for which the immediate knowledge is insufficient. For instance , looking back to state Sl , three moves were possible: moving a tile adjacent to the blank left, right or down. If the knowledge was not available to select which move to try, then a subgoal to select the operator would have been set up. Or, if the operator to move a tile left had been selected, but it was not known immediately how to perform that operator, then a subgoal would have been set up to do that. (The moves in the Eight Puzzle are too simple to require this, but many operators are more complex, e.g. , an operator to factor a polynomial in an algebraic task.) Or, if the left operator had been applied and SOAR attempted to evaluate the result, but the evaluation was too complicated to compute directly, then a subgoal would have been set up to obtain the evaluation. Or, to take just one more example, if SOAR had attempted to apply an operator that was illegal at state S l , say to move tile 1 to the position of tile 2, then it could have set up a subgoal to satisfy the preconditions of the operator (that the position of tile 2 be blank) . In short, a subgoal can be set up for any problematic decision , a property we call universal subgoaling. Since setting up a goal means that a search can be conducted for whatever information is needed to make the decision , SOAR can be described as having no fixed bodies of knowledge to make any decision (as in writing a specific LISP function to evaluate a position or select among operators) . The ability to search in subgoals also implies that further subgoals can be set up within existing subgoals so that the behavior of SOAR involves a tree of subgoals and problem spaces (Fig. 3). Because many of these subgoals address how to make control decisions, this implies that SOAR can reflect [73] on its own problem solving behavior, and do this to arbitrary levels [ 64] . 1.3. Uniform representation of all long-term knowledge by a production system
There is only a single memory organization for all long-term knowledge , namely, a production system [9, 14, 25, 42, 78] . Thus, the boxes in Figs. 2 and 3 are filled in with a uniform production system. Productions deliver control knowledge, when a production action rejects an operator that leads back to the prior position. Productions also provide procedural knowledge for simple operators, such as the Eight Puzzle moves, which can be accomplished by two productions, one to create the new state and put the changes in place and one to copy the unchanged tiles. (As noted above, more complex operators are realized by operating in an implementation problem space.) The data struc tures examinable by productions-that is, the pieces of knowledge in declara tive form-are all in the production system's short-term working memory. However, the long-term storage of this knowledge is in productions which have actions that generate the data structures. SOAR employs a specialized production system (a modified version of OPS5 [20]) . All satisfied productions are fired in parallel, without conflict resolution.
469
470
CHAPTER 15
Lon g -term
Task- i m p l e m entation and search-co ntrol knowledge
Task
Task
Task
operator selection
Subta k
FIG.
3. The tree of subgoals and their problem spaces.
Productions can only add data elements to working memory. All modification and removal of data elements is accomplished by the architecture. 1 .4.
Knowledge to control search expressed by preferences
Search control knowledge is brought to bear by the additive accumulation (via production firings) of data elements in working memory. One type of data
SOAR: AN ARCHITECruRE FOR GENERAL INTELLIGENCE
element, the preference, represents knowledge about how SOAR should behave in its current situation (as defined by a current goal, problem space, state and operator) . For instance, the rejection of the move that simply returns to the prior state (in the example above) is encoded as a rejection preference on the operator. The preferences admit only a few concepts: acceptability, rejection, better (best, worse and worst) , and indifferent. The architecture contains a fixed decision procedure for interpreting the set of accumulated preferences to determine the next action . This fixed procedure is simply the embodiment of the semantics of these basic preference concepts and contains no task-depen dent knowledge. 1.5.
All goals arise to cope with impasses
Difficulties arise, ultimately, from a lack of knowledge about what to do next (including of course knowledge that problems cannot be solved). In the immediate context of behaving, difficulties arise when problem solving cannot continue-when it reaches an impasse. Impasses are detectable by the architec ture, because the fixed decision procedure concludes successfully only when the knowledge of how to proceed is adequate. The procedure fails otherwise (i.e . , it detects an impasse) . A t this point the architecture creates a goal for overcoming the impasse. For example , each of the subgoals in Fig. 3 is evoked because some impasse occurs: the lack of sufficient preferences between the three task operators creates a tie impasse; the failure of the productions in the task problem space to carry out the selected task operator leads to a no-change impasse ; and so on. In SOAR, goals are created only in response to impasses. Although there are only a small set of architecturally distinct impasses (four) , this suffices to generate all the types of subgoals. Thus, all goals arise from the architecture. This principle of operation, called automatic subgoaling, is the most novel feature of the SOAR architecture, and it provides the basis for many other features. 1 .6. Continuous monitoring of goal termination
The architecture continuously monitors for the termination of all active goals in the goal hierarchy. Upon detection, SOAR proceeds immediately from the point of termination. For instance, in trying to break a tie between two operators in the Eight Puzzle, a subgoal will be set up to evaluate the operators. If in examining the first operator a preference is created that rejects it, then the decision at the higher level can, and will, be made immediately. The second operator will be selected and applied, cutting off the rest of the evaluation and comparison process. All of the working-memory elements local to the termi nated goals are automatically removed. Immediate and automatic response to the termination of any active goal is
471
472
CHAYfER 15
rarely used in AI systems because of its expense. Its (efficient) realization in SOAR depends strongly on automatic subgoaling. 1 .7.
The basic problem solving methods arise directly from knowledge of the task
SOAR realizes the so-called weak methods, such as hill climbing, means-ends analysis, alpha-beta search , etc ., by adding search control productions that express , in isolation, knowledge about the task (i.e . , about the problem space and the desired states) . The structure of SOAR is such that there is no need for this knowledge to be organized in separate procedural representations for each weak method (with a selection process to determine which one to apply) . For example, if knowledge exists about how to evaluate the states in a task, and the consequences of evaluation functions are understood (prefer operators that lead to states with higher evaluations) , then SOAR exhibits a form of hill climbing. This general capability is another novel feature of SOAR. 1.8. Continuous learning by experience through chunking
SOAR learns continuously by automatically and permanently caching the results of its subgoals as productions. Thus, consider the tie impasse between the three task operators in Fig. 3 , which leads to a subgoal to break that tie. The ultimate result of the problem solving in this subgoal is a preference (or preferences) that resolves the tie impasse in the top space and terminates the subgoal. Then a production is automatically created that will deliver that preference (or preferences) again in relevantly similar situations. If the system ever again reaches a similar situation, no impasse will occur (hence no subgoal and no problem solving in a subspace) because the appropriate preferences will be generated immediately. This mechanism is directly related to the phenomenon called chunking in human cognition [63] , whence its name. Structurally, chunking is a limited form of practice learning. However, its effects turn out to be wide-ranging. Because learning is closely tied to the goal scheme and universal subgoaling which provide an extremely fine-grained, uniformly structured, and com prehensive decomposition of tasks on which the learning can work-SOAR learns both operator implementations and search control. In addition , the combination of the fine-grained task decomposition with an ability to abstract away all but the relevant features allows SOAR to exhibit significant transfer of learning to new situations, both within the same task and between similar tasks. This ability to combine learning and problem solving has produced the most striking experimental results so far in SOAR [33 , 36, 62] . 2. The
SOAR
Architecture
In this section we describe the SOAR architecture systematically from scratch, depending on the preview primarily to have established the central role of
SOAR: AN ARCHITECTURE FOR GENERAL INTELLIGENCE
problem spaces and production systems. We will continue to use the Eight Puzzle as the example throughout. 2.1.
The architecture for problem solving
is a problem solving architecture, rather than just an architecture for symbolic manipulation within which problem solving can be realized by appropriate control. This is possible because SOAR accomplishes all of its tasks in problem spaces. To realize a task as search in a problem space requires a fixed set of task-implementation functions, involving the retrieval or generation of: ( 1 ) problem spaces, (2) problem space operators, (3) an initial state representing the current situation , and ( 4) new states that result from applying operators to existing states. To control the search requires a fixed set of search-control functions, involving the selection of: ( 1 ) a problem space, (2) a state from those directly available, and (3) an operator to apply to the state. Together, the task-implementation and search-control functions are sufficient for problem space search to occur. The quality and efficiency of the problem solving will depend on the nature of the selection functions. The task-implementation and search-control functions are usually inter leaved. Task implementation generates (or retrieves) new problem spaces, states and operators ; and then search control selects among the alternatives generated. Together they completely determine problem solving behavior in a problem space. Thus, as Fig. 4 shows, the behavior of SOAR on the Eight Puzzle can be described as a sequence of such acts. Other important functions SOAR
[Retrieve the eight-puzzle problem space] Select eight-puzzle as problem space [Generate S1 as the initial state] Select S1 as state [Retrieve the operators Down, Left, Right] Select Down as operator [Apply operator (generate S2)) Select Left as operator [Apply operator (generate S3)) Select Right as operator [Apply operator (generate S4)] Select S2 as state [Retrieve the operators Down, Left, Right] Select Down as operator [Apply operator (generate S5)) Select Left as operator [Apply operator (generate S6)) Select Right as operator [Apply operator (generate S7)) Select S7 as state
Fm.
4. Problem space trace in the Eight Puzzle. (Task-implementation steps are bracketed.)
473
474
CHAPTER 15
must be performed for a complete system: goal creation, goal selection, goal termination, memory management and learning. None of these are included in SOAR's search-control or task-implementation acts. Instead , they are handled automatically by the architecture , and hence are not objects of volition for SOAR. They are described at the appropriate places below. The deliberative acts of search-control together with the knowledge for implementing the task are the locus of intelligence in SOAR. As indicated earlier in Fig. 2, search-control and task-implementation knowledge is brought to bear on each step of the search. Depending on how much search-control knowledge the problem solver has and how effectively it is employed, the search in the problem space will be narrow and focused, or broad and random. If focused enough, the behavior is routine . Figure 5 shows a block diagram of the architecture that generates problem space search behavior. There is a working memory that holds the complete processing state for problem solving in SOAR. This has three components : ( 1 ) a Chunking Mechanism +
Production Memory
) )
Work i n g - M emory Manager
Decision Procedure
FIG.
5. Architectural structure of SOAR.
SOAR: AN ARCHITECTURE FOR GENERAL INTELLIGENCE
context stack that specifies the hierarchy of active goals, problem spaces, states and operators; (2) objects, such as goals and states (and their subobjects) ; and (3) preferences that encode the procedural search-control knowledge . The processing structure has two parts. One is the production memory, which is a set of productions that can examine any part of working . memory, add new objects and preferences, and augment existing objects, but cannot modify the context stack. The second is a fixed decision procedure that examines the preferences and the context stack, and changes the context stack. The produc tions and the decision procedure combine to implement the search-control functions. Two other fixed mechanisms are shown in the figure: a working memory manager that deletes elements from working memory, and a chunking mechanism that adds new productions. SOAR is embedded within LISP. It includes a modified version of the OPS5 production system language plus additional LISP code for the decision proce dure, chunking, the working-memory manager, and other SOAR-specific fea tures. The OPS5 matcher has been modified to significantly improve the efficiency of determining satisfied productions [70] . The total amount of LISP code involved, measured in terms of the size of the source code, is approxi mately 255 kilobytes-70 kilobytes of unmodified OPS5 code, 30 kilobytes of modified OPS5 code, and 155 kilobytes of SOAR code. SOAR runs in COMMON LISP, FRANZLISP, INTERLISP and ZETALISP on most of the appropriate hardware (UNIX VAX, VMS VAX, XEROX D-machines, Symbolics 3600s, TI Explorers, IBM RTPCs, Apollo and Sun workstations) . 2.2.
The working memory
Working memory consists of a context stack, a set of objects linked to the context stack, and preferences. Figure 6 shows a graphic depiction of a small part of working memory during problem solving on the Eight Puzzle. The context stack contains the hierarchy of active contexts (the boxed structures) . Each context contains four slots, one for each of the different roles : goal, problem space, state and operator. Each slot can be occupied either by an object or by the symbol undecided, the latter meaning that no object has bee_n selected for that slot. The object playing the role of the goal in a context is the current goal for that context; the object playing the role of the problem space is the current problem space for that context and so on. The top context contains the highest goal in the hierarchy. The goal in each context below the top context is a subgoal of the context above it. In the figure, Gl is the current goal of the top context, Pl is the current problem space, Sl is the current state, and the current operator is undecided. In the lower context, G2 is the current goal (and a subgoal of Gl). Each context has only one goal for the duration of its existence, so the context stack doubles as the goal stack. The basic representation is object-centered. An object, such as a goal or a
475
476
CHAPTER 15
01 02 03
operator preferences
Gl Pl
51
undecided
binding
desi red name
Dl
binding
Bl
/
binding
cell
binding
E I G HT-PUZZLE
binding
B2 B3
""" ""
t ile cell
item
supergoal
G2 P2
undecided
:
:
)r
Cl T1 C2
item
01
role
03
cell
r
item
02
i mpasse
O P E RATOR
name
TIE SELECTION
u ndecided
FIG.
6. Snapshot of fragment of working memory.
state, consists of a symbol , called its identifier, and a set of augmentations. An augmentation is a labeled relation (the attribute) between the object (the identifier) and another symbol (the value) , i.e. , an identifier-attribute-value triple. In the figure, Gl is augmented with a desired state, D l , which is itself an object that has its own augmentations (augmentations are directional, so Gl is not in an augmentation of Dl, even though D l is in an augmentation of Gl). The attribute symbol may also be specified as the identifier of an object. Typically, however, situations are characterized by a small fixed set of attribute symbols-here, impasse, name, operator, binding, item and role-that play no o.ther role than to provide discriminating information. An object may have any number of augmentations, and the set of augmentations may change over time. 1 A preference is a more complex data structure with a specific collection of eight architecturally defined relations between objects. Three preferences are shown in the figure, one each for objects 0 1 , 02, and 03. The preferences in the figure do not show their full structure (shown later in Fig. 10) , only the 1 The extent of the memory structure is necessarily limited by the physical resources of the problem solver, but currently this is assumed not to be a problem and mechanisms have not been created to deal with it.
SOAR: AN ARCHITECTURE FOR GENERAL INTELLIGENCE
context in which they are applicable (any context containing problem space Pl and state Sl). The actual representation of objects in working memory is shown in Fig. 7.2 Working memory is a set-attempting to add an existing element does not change working memory. Each element in working memory represents a single augmentation. To simplify the description of objects, we group together all augmentations of the same object into a single expression. For example, the first line of Fig. 7 contains a single expression for the four augmentations of goal G 1 . The first component of an object is a class name that distinguishes different types of objects. For example, goal, desired, problem space, and state are the class names of the first four objects in Fig. 7. Class names do not play a semantic role in SOAR , although they allow the underlying matcher to be more efficient. Following the class name is the identifier of the object. The goal has the current goal as its identifier. Following the identifier is an unordered list of attribute-value pairs, each attribute being prefaced by an up-arrow ( j ) . An object may have more than one value for a single attribute, as does state Sl in Fig. 7, yielding a simple representation of sets. (goal G1 i problem-space P1 i state S1 i operator undecided i desired D1 ) (desired D1 i binding 081 i binding 082 . . .) (problem-space P1 i name eight-puzzle) (state S1 i binding 81 82 83 . . .) (binding 81 i cell C1 i tile T1 ) (cell C1 i cell C2 . . . ) (tile T1 i name 1 ) (binding 82 i cell C2 . . . ) (cell C2 i cell C1 . . . ) (binding 83 . . . ) (preference i object 01 i role operator i problem-space P1 i state (preference i object 02 i role operator i problem-space P1 i state (preference i object 03 i role operator i problem-space P1 i state (operator 01 . . . ) (operator 02 . . .) (operator 03 . . .)
i value acceptable S1 ) i value acceptable S1 ) i value acceptable S1 )
(goal G2 i problem-space P2 i state undecided i operator undecided i supergoal G1 i role operator i impasse tie i item 03 i item 02 i item 01 ) (problem-space P2 i name selection)
FIG. 7. Working-memory representation of the structure in Fig. 6. 2
Some basic notation and structure is inherited from OPS5.
477
478
...,
CHAPTER 15
The basic attribute-value representation in SOAR leaves open how to represent task states. As we shall see later, the representation plays a key role in determining the generality of learning in SOAR . The generality is maximized when those aspects of a state that are functionally independent are represented independently. In the Eight Puzzle, both the structure of the board and the actual tiles do not change from state to state in the real world. Only the location of a tile on the board changes, so the representation should allow a tile's location to change without changing the structure of the board or the tiles. Figure 8 contains a detailed graphic example of one representation of a state in the Eight Puzzle that captures this structure. The state it represents is shown in . the lowest left-hand corner. The board in the Eight Puzzle is represented by nine cells (the 3 x 3 square at the bottom of the figure) , one for each of the possible locations for the tiles. Each cell is connected via augmenta tions of type cell to its neighboring cells (only a few labels in the center are actually filled in) . In addition, there are nine tiles (the horizontal sequence of objects just above the cells) , named 1-8, and blank (represented by a small box in the figure) . The connections between the tiles and cells are specified by objects called bindings. A given state , such as Sl at the top of the figure , consists of a set of nine bindings (the horizontal sequence of objects above the
FIG. 8. Graphic representation of an Eight Puzzle state .
SOAR: AN ARCHITECTURE FOR GENERAL INTELLIGENCE
tiles) . Each binding points to a tile and a cell; each tile points to its value ; and each cell points to its adjacent cells. Eight Puzzle operators manipulate only the bindings, the representation of the cells and tiles does not change. Working memory can be modified by: ( 1 ) productions, (2) the decision procedure, and (3) the working-memory manager. Each of these components has a specific function. Productions only add augmentations and preferences to working memory. The decision procedure only modifies the context stack. The working-memory manager only removes irrelevant contexts and objects from working memory. 2.3.
The processing structure
The processing structure implements the functions required for search in a problem space-bringing to bear task-implementation knowledge to generate objects, and bringing to bear search-control knowledge to select between alternative objects. The search-control functions are all realized by a single generic control act: the replacement of an object in a slot by another object from the working memory. The representation of a problem is changed by replacing the current problem space with a new problem space. Returning to a prior state is accomplished by replacing the current state with a preexisting one in working memory. An operator is selected by replacing the current operator (often undecided) with the new one. A step in the problem space occurs when the current operator is applied to the current state to produce a new state, which is then selected to replace the current state in the context. A replacement can take place anywhere in the context stack, e.g. , a new state can replace the state in any of the contexts in the stack, not just the lowest or most immediate context but any higher one as well. When an object in a slot is replaced, all of the slots below it in the context are reinitialized to undecided. Each lower slot depends on the values of the higher slots for its validity: a problem space is set up in response to a goal ; a state functions only as part of a problem space ; and an operator is to be applied at a state. Each context below the one where the replacement took place is terminated because it depends on the contents of the changed context for its existence (recall that lower contexts contain subgoals of higher contexts) . The replacement of context objects is driven by the decision cycle. Figure 9 shows three cycles, with the first one expanded out to reveal some of the inner structure. Each cycle involves two distinct parts. First, during the elaboration phase, new objects, new augmentations of old objects, and preferences are added to working memory. Then the decision procedure examines the accumu lated preferences and the context stack, and either it replaces an existing object in some slot, i . e . , in one of the roles of a context in the context stack, or it creates a subgoal in response to an impasse.
479
480
CHAPTER 15
DECISION 1
i i iiiii E l a bo ration P h ase
Qu;.,C,
f°
De�
DECISION 2
"
Proc c u re
Gather
i ii i iiii
Preferences
t
j
DECISION 3
i i iii iiiii
j
� �
Replace
I nterpret � Context
Preferences
t
Object
I m passe
t
Create Subgoal
FIG. 9. A sequence of decision cycles.
2.3. 1 . The elaboration phase Based on the current contents of working memory, the elaboration phase adds new objects, augmentations of existing objects, and preferences. Elaborations are generated in parallel (shown by the vertical columns of arrows in Fig. 9) but may still require multiple steps for completion (shown by the horizontal sequences of elaborations in the figure) because information generated during one step may allow other elaborations to be made on subsequent steps. This is a monotonic process (working-memory elements are not deleted or modified) that continues until quiescence is reached because there are no more elabora tions to be generated.3 The monotonic nature of the elaboration phase assures that no synchronization problems will occur during the parallel generation of elaborations. However, because this is only syntactic monotonicity-data structures are not modified or deleted-it leaves open whether semantic conflicts or nonmonotonicity will occur. The elaboration phase is encoded in SOAR as productions of the form:
The C; are conditions that examine the context stack and the rest of the working memory, while the A ; are actions that add augmentations or prefer3 In practice, the elaboration phase reaches quiescence quickly (less than ten cycles) , however, if quiescence is not reached after a prespecified number of iterations (typically 100), the elaboration phase terminates and the decision procedure is entered.
,
SOAR: AN ARCHITEcnJRE FOR GENERAL INTELLIGENCE
ences to memory. Condition patterns are based on constants, variables, negations, pattern-ands, · and disjunctions of constants (according to the con ventions of OPSS productions) . Any object in working memory can be accessed as long as there exists a chain of augmentations and preferences from the context stack to the object. An augmentation can be a link in the chain if its identifier appears either in a context or in a previously linked augmentation or preference . A preference can be a link in the chain if all the identifiers in its context fields (defined in Section 2.3.2) appear in the chain. This property of linked access plays an important role in working-memory management, subgoal termination , and chunking, by allowing the architecture to determine which augmentations and preferences are accessible from a context, independent of the specific knowledge encoded in elaborations. A production is successfully instantiated if the conjunction of its conditions is satisfied with a consistent binding of variables. There can be any number of concurrently successful instantiations of a production. All successful instantia tions of all productions fire concurrently (simulated) during the elaboration phase. The only conflict resolution principle in SOAR is refractory inhibition an instantiation of a production is fired only once. Rather than having control exerted at the level of productions by conflict resolution, control is exerted at the level of problem solving (by the decision procedure) . 2.3.2. The decision procedure The decision procedure is executed when the elaboration phase reaches quiescence. It determines which slot in the context stack should have its content replaced, and by which object. This is accomplished by processing the context stack from the oldest context to the newest (i.e . , from the highest goal to the lowest one) . Within each context, the roles are considered in turn, starting with the problem space and continuing through the state and operator in order. The process terminates when a slot is found for which action is required. Making a change to a higher slot results in the lower slots being reinitialized to undecided , thus making the processing of lower slots irrelevant. This ordering on the set of slots in the context stack defines a fixed desirability ordering between changes for different slots: it is always more desirable to make a change higher up. The processing for each slot is driven by the knowledge symbolized in the preferences in working memory at the end of the elaboration phase. Each preference is a statement about the selection of an object for a slot (or set of slots) . Three primitive concepts are available to make preference statements: 4 4 There is an additional preference type that allows the statement that two choices for an operator slot can be explored in parallel. This is a special option to explore parallel processing where multiple slots are created for parallel operators. For more · details, see the SOAR manual.
(30].
481
482
CHAPTER 15
acceptability : a choice is to be considered; rejection : a choice is not to be made; desirability : a choice is better than (worse than , indifferent to) a reference
choice. Together, the acceptability and ·rejection preferences determine the objects from which a selection will be made, and the desirability preferences partially order these objects. The result of processing the slot, if successful, is a single object that is: new (not currently selected for that slot) ; acceptable; not rejected; and more desirable than any other choice that is likewise new, acceptable and not rejected. A preference encodes a statement about the selection of an object for a slot into a set of attributes and values, as shown in Fig. 10. The object is specified by the value of the object attribute. The slot is specified by the combination of a role and a context. The role is either the problem space, the state or the operator; a goal cannot be specified as a role in a preference because goals are determined by the architecture and not by deliberate decisions. The context is specified by the contents of its four roles: goal, problem space , state and operator. A class of contexts can be specified by leaving unspecified the contents of one or more of the roles. For example, if only the problem space and state roles are specified, the preference will be relevant for all goals with the given problem space and state .
A ttribute
Object: The object that is to occupy the slot. Role:
}
The role the object is to occupy (problem space, state, or operator).
Goal Problem space Context in which the preference applies State (a set of contexts can be specified). Operator Value acceptable: reject: best: better: indifferent: worse: worst:
Slot
The object is a candidate for the given role. The object is not to be selected. The object is as good as any object can be. The object is better than the reference object. The object is indifferent to the reference object if there is one, otherwise the object is indifferent to all other indifferent objects. The object is worse than the reference object {the inverse of better). The object is as bad as any object can be {the inverse of best).
Reference: The reference object for order comparison. FIG. 10. The encoding of preferences.
SOAR: AN ARCHITECTURE FOR GENERAL INTELLIGENCE
The desirability of the object for the slot is specified by the value attribute of a preference, which takes one of seven alternatives. Acceptable and reject cover their corresponding concepts; the others-best, better, indifferent, worse , and worst-cover the ordering by desirability. All assertions about ordering locate the given object relative to a reference object for the same slot. Since the reference object always concerns the same slot, it is only necessary to specify the object. For better, worse, and some indifferent preferences, the reference object is another object that is being considered for the slot, and it is given by the reference attribute of the preference. For best, worst, and the remaining indifferent preferences, the reference object is an abstract anchor point, hence is implicit and need not be given. Consider an example where there are two Eight Puzzle operators, named up and left, being considered for state Sl in goal Gl. If the identifier for the Eight Puzzle problem space is Pl, and the identifiers for up and left are 01 and 02, then the following preference says that up is better than left: (preference i object 01 i role operator i value better i reference 02 i goal G 1 i problem-space P1 i state S1 )
The decision procedure computes the best choice for a slot based on the preferences in working memory and the semantics of the preference concepts, as given in Fig. 1 1 . The preference scheme of Fig. 11 is a modification of the straightforward application of the concepts of acceptability, rejection and desirability. The modifications arise from two sources. The first is independ ence. The elaboration phase consists of the contributions of independently firing individual productions, each expressing an independent source of know ledge. There is no joint constraint on what each asserts. These separate expressions must be combined, and the only way to do so is to conjoin them. Independence implies that one choice can be (and often is) both acceptable and rejected. For a decision to be possible with such preferences, rejection cannot be 1acceptable, which would lead to a logical contradiction. Instead, rejection overrides acceptable by eliminating the choice from consideration. Independ ence also implies that one choice can be both better and worse than another. This requires admitting conflicts of desirability between choices. Thus, the desirability order is quite weak, being transitive, but not irreflexive or antisym metric, and dominates must be distinguished from simply better-'-namely, domination implies better without conflict. The possibility of conflicts modifies the notion of the maximal subset of a set to be those elements that no other element dominates. For example , in the set of {x, y} if (x > y) and ( y > x) then the maximal subset contains both x and y. The second source o f modifications to the decision procedure is incomplete ness . The elaboration phase will deliver �ome collection of preferences. These can be silent on any particular fact, e .g. , they may assert that x is better than y,
483
484
CHAPTER 15
Primitive predicates and functions on objects, x, y, z, . . . The object that currently occupies the slot. current: acceptable(x): x is acceptable. x is rejected. reject(x) : x is better than y. (x > y) : x is worse than y (same as y > x) . (x < y) : (x - y) : x is indifferent to y. x dominates y: (x > y) and --i ( y > x) . (x l!> y) : Reference anchors: indifferent(x) :::> Vy [indifferent( y) :::> (x - y)) . best(x) :::> V y [best(y) :::> ( x - y)] " [--i best(y) " --i ( y > x ) :::> ( x > y)]. worst(x) :::> Vy [worst( y) :::> (x - y)) " (--i worst( y) " --i ( y < x) :::> (x < y)]. Basic properties: Desirability (x > y) is transitive, but not complete or antisymmetric. Indifference is an equivalence relationship and substitutes over > (x > y) and ( y - z) implies (x > z). Indifference does not substitute in acceptable, reject, best, and worst. acceptable(x) and (x - y) does not imply acceptable(y), reject(x) and (x - y) does not imply reject( y), etc.
Default assumption: All preference statements that are not e�plicitly mentioned and not implied by transitivity or substitution are not assumed to be true. Intermediate definitions: considered-choices = { x E objects I acceptable(x) " --i reject(x) } . maximal(X) = { x E X I Vy --i ( y l!> x) } . maximal-choices = maximal(considered-choices). empty(X) = --i3x E X. mutually-indifferent(X) � 'Vx, y E X (x - y) . random(X) = choose one element of X randomly. select(X ) = if current E X, then current else random(X ) . Final choice: empty(maximal-choices) " --ireject(current) :::> final-choice(current) . mutually-indifferent(maximal-choices) " --iempty(maximal-choices) :::> final-choice(select(maximal-choices)).
Impasse: empty(maximal-choices) " reject(current) :::> impasse. --, mutually-indifferent(maximal-choices) :::> impasse(maximal-choices) .
FIG. 1 1 . The semantics of preferences.
and that y is rejected , but say nothing about whether x is acceptable or not, or rejected or not. Indeed, an unmentioned object could be better than any that are mentioned. No constraint on completeness can hold, since SOAR can be in any state of incomplete knowledge. Thus, for the decision procedure to get a result, assumptions must be made to close the world logically. The assumptions all flow from the principle that positive knowledge is required to state a
SOAR: AN ARCHITECTURE FOR GENERAL INTELLIGENCE
preference-to state that an object is acceptable, rejected or has some de sirability relation. Hence , no such assertion should be made by default. Thus, objects are not acceptable unless explicitly acceptable ; are not rejected unless explicitly rejected ; and are not ordered in a specific way unless explicitly ordered. To do otherwise without explicit support is to rob the explicit statements of assertional power. Note, however, that this assumption does allow for the existence of prefer ences implied by the explicit preferences and their semantics. For example , two objects are indifferent if either there is a binary indifferent preference contain ing them , there is a transitive set of binary indifferent preferences containing both of them, they are both in unary indifferent preferences, they are both in best preferences, or they are both in worst preferences. The first step in processing the preferences for a slot is to determine the set of choices to be considered. These are objects that are acceptable (there are acceptable preferences for them) and are not rejected (there are no reject preferences for them) . Dominance is then determined by the best, better, worst, and worse preferences. An object dominates another if it is better than the other (or the other is worse) and the latter object is not better than the former object. A best choice dominates all other non-best choices, except those that are explicitly better than it through a better preference or worst preference . A worst choice is dominated by all other non-worst choices, except those that are explicitly worse than it through a better or worst preference. The maximal choices are those that are not dominated by any other objects. Once the set of maximal choices is computed, the decision procedure determines the final choice for the slot. The current choice acts as a default so that a given slot will change only if the current choice is displaced by another choice. Whenever there are no maximal choices for a slot, the current choice is maintained, unless the current choice is rejected. If the set of maximal choices are mutually indifferent-that is, all pairs of elements in the set are mutually indifferent-then the final choice is one of the elements of the set. The default is to not change the current choice, so if the current choice is an element of the set, then it is chosen; otherwise, one element is chosen at random.5 The random selection is justified becau.se there is positive knowledge , in the form of preferences, that explicitly states that it does not matter which of the mutually indifferent objects is selected. If the decision procedure determines that the value of the slot should be changed-that is, there is a final choice different from the current object in the slot-the change is installed, all of the lower slots are reinitialized to unde cided, and the elaboration phase of the next decision cycle ensues. If the current choice is maintained, then the decision procedure considers the next 5 In place of a random selection, there is an option in SOAR to allow the user to select from the set of indifferent choices.
485
486
CHAPTER 15
slot lower in the hierarchy. If either there is no final choice , or all of the slots have been exhausted, then the decision procedure fails and an impasse6 occurs. In SOA R , four impasse situations are distinguished : ( 1 ) Tie. This impasse arises when there are multiple maximal choices that are not mutually indifferent and do not conflict. These are competitors for the same slot for which insufficient knowledge (expressed as preferences) exists to discriminate among them. (2) Conflict. This impasse arises when there are conflicting choices in the set of maximal choices . (3) No-change. This impasse arises when the current value of every slot is maintained. ( 4) Rejection. This impasse arises when the current choice is rejected and there are no maximal choices ; that is, there are no viable choices for the slot . This situation typically occurs when all of the alternatives have been tried and ' found wanting. The rules at the bottom of Fig. 1 1 cover all but the third of these, which involves cross-slot considerations not currently dealt with by the preference semantics . These four conditions are mutually exclusive, so at most one impasse will arise from executing the decision procedure . The response to an impasse in SOAR is to set up a subgoal in which the impasse can be resolved. 2.3.3. Implementing the Eight Puzzle Making use of the processing structure so far described-and postponing the discussion of impasses and subgoals until Section 2.4-it is possible to describe the implementation of the Eight Puzzle in SOA R . This implementation consists of both task-implementation knowledge and search-control knowledge . Such knowledge is eventually to be acquired by SOAR from the external world in some representation and converted to internal forms, but until such an acquisition mechanism is developed, knowledge is simply posited of SOAR , encoded into problem spaces and search control , and incorporated directly into the production memory. Figures 12- 14 list the productions that encode the knowledge to implement the Eight Puzzle task.7 Figure 12 contains the productions that set things up so that problem solving can begin , and detect when the goal has been achieved. For this example we assume that initially the current goal is to be augmented with the name solve-eight-puzzle, a description of the initial state, and a description of the desired state. The problem space is selected based on the description of the goal . In this case , production select-eight-puzzle-space is 6 Term was first used in this sense in repair theory [8] : we had originally used the term difficulty [29] . 7 These descriptions of the productions are an abstraction of the actual SOAR productions, which are given in the SOAR manual [30] .
SOAR: AN ARCHITECTURE FOR GENERAL INTELLIGENCE
select-eight-puzzle-space:
If the current goal is solve-eight-puzzle, then make an acceptable preference for eight-puzzle as the current problem space. define-initial-state:
If the current problem space is eight-puzzle, then create a state in this problem space based on the description in the goal and make an acceptable preference for this state. define-final-state:
If the current problem space is eight-puzzle , then augment the goal with a desired state in this problem space based on the description in the goal. detect-eight-puzzle-success:
If the current problem space is eight-puzzle and the current state matches the desired state of the current goal in each cell, then mark the state with success. FIG. 12. Productions that set up the Eight Puzzle.
sens1t1ve to the name of the goal and suggests eight-puzzle as the problem space. The initial state is determined by the current goal and the problem space. Production define-initial-state translates the description of the initial state in the goal to be a state in the Eight Puzzle problem space. Similarly, define-final-state translates the description of the desired state to be a state in the Eight Puzzle problem space . By providing different initial or desired states, different Eight Puzzle problems can be attempted. Production detect-eight puzzle-success compares the current state, tile by tile and cell by cell to the desired state . If they match, the goal has been achieved. The final aspect of the task definition is the implementation of the operators. For a given problem, many different realizations of essentially the same problem space may be possible. For the Eight Puzzle, there could be twenty four operators, one for each pair of adjacent cells between which a tile could be moved. In such an implementation , all operators could be made acceptable for each state, followed by the rejection of those that cannot apply (because the blank is not �n the appropriate place). Alternatively, only those operators that are applicable to a state could be made acceptable . Another implementa tion could have four operators, one for each direction in which tiles can be moved into the blank cell : up, down , left, and right. Those operators that do not apply to a state could be rejected. In our implementation of the Eight Puzzle, there is a single general operator for moving a tile adjacent to the blank cell into the blank cell. For a given state, an instance of this operator is created for each of the adjacent cells. We will refer to these instantiated operators by the direction they move their associated tile: up , down, left and right. To create the operator instantiations requires a single production , shown in Fig. 13 . Each operator is represented in working memory as an object that is augmented with the cell containing the blank and one of the .cells adjacent to the blank. When an instantiated operator is created, an acceptable preference is also created for it in the context
487
488
CHAPTER 15
instantiate-operator:
If the current problem space is eight-puzzle and the current state has a tile in a cell adjacent to the blank's cell, then create an acceptable preference for a newly created operator that will move the tile into the blank's cell. FIG. 13. Production for creating Eight Puzzle operator instantiations.
contammg the Eight Puzzle problem space and the state for which the instantiated operator was created. Since operators are created only if they can apply, an additional production that rejects inapplicable operators is not required. An operator is applied when it is selected by the decision procedure for an operator role-selecting an operator produces a context in which productions associated with the operators can execute (they contain a condition that tests that the operator is selected) . Whatever happens while a given operator occupies an operator role comprises the attempt to apply that operator. Operator productions are j ust elaboration productions, used for operator application rather than for search control. They can create a new state by linking it to the current context (as the object of an acceptable preference) , and then augmenting it. To apply an instantiated operator in the Eight Puzzle requires the two productions shown in Fig. 14. When the operator is selected for an operator slot, production create-new-state will apply and create a new state with the tile and blank in their swapped cells. The production copy unchanged-binding copies pointers to the unchanged bindings between tiles and cells. The seven productions so far described comprise the task-implementation knowledge for the Eight Puzzle. With no additional productions, SOAR will start to solve the problem, though in an unfocused manner. Given enough time it will search until a solution is found.8 To make the behavior a bit more focused, search-control knowledge can be added that guides the selection of operators. Two simple search-control productions are shown in Fig. 15. create-new-state:
If the current problem space is eight-puzzle, then create an acceptable preference for a newly created state, and augment the new state with bindings that have switched the tiles from the current state that are changed by the current operator. copy-unchanged-binding
If the current problem space is eight-puzzle and there is an acceptable preference for a new state, then copy from the current state each binding that is unchanged by the current operator. FIG. 14. Productions for applying Eight Puzzle operator instantiations. 8 The default search is depth-first where the choices between competing operators are made randomly. Infinite loops do not arise because the choices are made nondeterministically.
SOAR: AN ARCHITECTURE FOR GENERAL INTELLIGENCE
avoid-undo:
If the current problem space is eight-puzzle, then create a worst preference for the operator that will move the tile that was moved by the operator that created the current state. mea-operator-selection:
If the current problem space is eight-puzzle and an operator will move a tile into its cell in the desired state, then make a best preference for that operator. FIG. 15. Search-control productions for the Eight Puzzle.
Avoid-undo will avoid operators that move a tile back to its prior cell. Mea-operator-selection is a means-ends analysis heuristic that prefers the selection of an operator if it moves a tile into its desired cell. This is not a fool-proof heuristic rule , and will sometimes lead SOAR to make an incorrect move. Figure 16 contains a trace of the initial behavior using these nine productions (the top of the figure shows the state and operator involved in this trace) . The trace is divided up into the activity occurring during each of the first five decision cycles (plus an initialization cycle) . Within each cycle, the activity is marked according to whether it took place within the elaboration phase (E), or as a result of the decision procedure (D). The steps within the elaboration phase are also marked; for example, line 4.lE represents activity occurring during the first step of the elaboration phase of the fourth decision cycle. Each line that is part of the elaboration phase represents a single production firing. Included in these lines are the production's name and a description of what it does. When more than one production is marked the same, as in 4.2E, it means that they fire in parallel during the single elaboration step. The trace starts where the current goal (called Gl) is the only object defined. In the first cycle, the goal is augmented with an acceptable preference for eight-puzzle for the problem space role. The decision procedure then selects eight-puzzle as the current problem space. In cycle 2, the initial state, Sl, is created with an acceptable preference for the state role, and the problem space is augmented with its operators. At the end of cycle 2, the decision procedure selects Sl as the current state. In cycle 3 , operator instances, with correspond ing acceptable preferences, are created for all of the tiles that can move into the blank cell. Production mea-operator-selection makes operator 01 (down) a best choice , resulting in its being selected as the current operator. In cycle 4, the operator is applied. First, production create-new-state creates the prefer ence for a new state (S2) and augments it with the swapped bindings, and then production copy-unchanged-binding fills in the rest of the structure of the new state. Next, state S2 is selected as the current state and operator instances are created, with corresponding acceptable preferences, for all of the tiles that can move into the cell that now contains the blank. On the next decision cycle (cycle 5) , none of the operators dominate the others, and an impasse occurs.
489
490
CHAPTER 1 5
81
82
01
� down
Cycle Production
Action
0
G 1 is the current goal
G1 is already augmented with solve-eight-puzzle.
1E 1D
select-eight-puzzle-space
Make an acceptable preference for eightcpuzzle.
2E 2E 20
define-final-state define-initial-state
3.1 E 3.1 E 3.1 E 3.2E
instantiate-operator instantiate-operator instantiate-operator mea-operator-selection ( 01 down)
30
Select 01 (down) as operator
4. 1 E 4.2E 4.2E 4.2E 4.2E 4.2E 4.2E 4.2E 40
create-new-state copy-unchanged binding copy-unchanged-binding copy-unchanged-binding copy-unchanged-binding copy-unchanged-binding copy-unchanged-binding copy-unchanged-binding
SE SE SE SE SE
instantiate-operator instantiate-operator instantiate-operator instantiate-operator Avoid-undo (07 up)
so
Select eight-puzzle as problem space
Augment goal with the desired state (01 ) . Make an acceptable preference for 81 .
Select S 1 as state
Create 01 (down) and an acceptable preference for it. Create 02 (right) and an acceptable preference for it. Create 03 (left) and an acceptable preference for. it. Make a best preference for down.
Make an acceptable preference for 82, swap bindings. Copy over unmodified bindings.
Select S2 as state
Create 04 (down) and an acceptable preference for it. Create OS (right) and an acceptable preference for it. Create 06 (left) and an acceptable preference for it. Create 07 (up) and an acceptable preference for it, Make a worst preference for up.
Tie impasse, create subgoal
FIG. 16. Trace of initial Eight Puzzle problem solving.
2.4.
Impasse and subgoals
When attempting to make progress in attaining a goal, the knowledge directly available in the problem space (encoded in SOAR as productions) may be inadequate to lead to a successful choice by the decision procedure. Such a situation occurred in the last decision cycle of the Eight Puzzle example in Fig.
SOAR: AN ARCHITECTURE FOR GENERAL INTELLIGENCE
16. The knowledge directly available about the Eight Puzzle was incomplete it did not specify which of the operators under consideration should be selected. In general, impasses occur because of incomplete or inconsistent information. Incomplete information may yield a rejection , tie, or no-change impasse , while inconsistent information yields a conflict impasse. When an impasse occurs, returning to the elaboration phase cannot deliver additional knowledge that might remove the impasse , for elaboration has already run to quiescence . Instead , a subgoal and a new context is created for each impasse. By responding to an impasse with the creation of a subgoal, SOAR is able to deliberately search for more information that can lead to the resolution of the impasse . All types of knowledge, task-implementation and search-control, can be encoded in the problem space for a subgoal . If a tie impasse between objects for the same slot arises, the problem solving to select the best object will usually result in the creation of one or more desirability preferences, making the subgoal a locus of search-control know ledge for selecting among those objects. A tie impasse between two objects can be resolved in a number of ways: one object is found to lead to the goal, so a best preference is created; one object is found to be better than the other, so a better preference is created; no difference is found between the objects, so an indifferent preference is created; or one object is found to lead away from the goal, so a worst preference is created. A number of different problem solving strategies can be used to generate these outcomes, including: further elabora tion of the tied objects (or the other objects in the context) so that a detailed comparison can be made ; look-ahead searches to determine the effects of choosing the competing objects; and analogical mappings to other situations where the choice is clear. If a no-change impasse arises with the operator slot filled, the problem solving in the resulting subgoal will usually involve operator implementation, terminating when an acceptable preference is generated for a new state in the parent problem space . Similarly , subgoals can create problem spaces or initial states when the required knowledge is more easily encoded as a goal to be achieved through problem space search , rather than as a set of elaboration productions. When the impasse occurs during the fifth decision cycle of the Eight Puzzle example in Fig. 16, the following goal and context are added to working memory. (goal G2 i supergoal G1 i impass tie i choices multiple i role operator i item 04 05 06 i problem-space undecided i state undecided i operator undecided)
The subgoal is simply a new symbol augmented with: the supergoal (which links the new goal and context into the context stack) ; the type of impasse ;
491
492
CHAPTER 1 5
whether the impasse arose because there were no choices or multiple choices in the maximal choices set; the role where the impasse arose ; the objects involved in conflicts or ties (the items) ; and the problem space, state, and operator slots (initialized to undecided) . This information provides an initial definition of the subgoal by defining the conditions that caused it to be generated and the new context. In the following elaboration phase, the subgoal can be elaborated with a suggested problem space, an initial state, a desired state or even path constraints. If the situation is not sufficiently elaborated so that a problem space and initial state can be selected, another impasse ensues and a further subgoal is created to handle it. Impasses are resolved by the addition of preferences that change the results of the decision procedure. When an impasse is resolved, allowing problem solving to proceed in the context, the subgoal created for the impasse has completed its task and can be terminated. For example, if there is a subgoal for a tie impasse at the operator role, it will be terminated when a new preference is added to working memory that either rejects all but one of the competing operators, makes one a best choice, makes one better than all the others, etc. The subgoal will also be terminated if new preferences change the state or problem space roles in the context, because the contents of the operator role depends on the values of these higher roles. If there is a subgoal created for a no-change impasse at the operator role-usually because of an inability to implement the operator directly by rules in the problem space-it can be resolved by establishing a preference for a new state , most likely the one generated from the application of the operator to the current state. In general, any change to the context at the affected role or above will lead to the termination of the subgoal. Likewise , a change in any of the contexts above a subgoal will lead to the termination of the subgoal because its depends on the higher contexts for its existence. Resolution of an impasse terminates all goals below it. When a subgoal is terminated, many working-memory elements are no longer of any use since they were created solely for internal processing in the subgoal. The working-memory manager removes these useless working memory elements from terminated subgoals in essentially the same way that a garbage collector in LISP removes inaccessible CONS cells. Only the results of the subgoal are retained-those objects and pr�ferences in working memory that meet the criteria of linked access to the unterminated contexts, as defined in Section 2.3. 1 . The context of' the subgoal is itself inaccessible from super goals-its supergoal link is one-way-so it is removed. The architecture defines the concept of goal termination, not the concept of goal success or failure. There are many reasons why a goal should disappear and many ways in which this can be reflected in the preferences. For instance , the ordinary (successful) way for a subgoal implementing an operator to terminate is to create the new result state and preferences that enable it to be
SOAR: AN ARCHITECTURE FOR GENERAL INTELLIGENCE
selected ( hence leading to the operator role becoming undecided) . But some times it is appropriate to terminate the subgoal (with failure) by rejecting the operator or selecting a prior state, so that the operator is never successfully applied. Automatic subgoal termination at any level of the hierarchy is a highly desirable, but generally expensive, feature of goal systems. In SOAR, the implementation of this feature is not expensive. Because the architecture creates all goals, it has both the knowledge and the organization necessary to terminate them. The decision procedure iterates through all contexts from the . top, and within each context, through the different roles: problem space , state and operator. Almost always, no new preferences are available to challenge the current choices. If new preferences do exist, then the standard analysis of the preferences ensues, possibly determining a new choice. If everything remains the same , the procedure continues with the next lower slot; if the value of a slot changes then all lower goals are terminated. The housekeeping costs of termination , which is the removal of irrelevant objects from the working memory, is independent of how subgoal termination occurs. 2.5. Default knowledge for subgoals
An architecture provides a frame within which goal-oriented action takes place. What action occurs depends on the knowledge that the system has. SOAR has a basic complement of task-independent knowledge about its own operation and about the attainment of goals within it that may be taken as an adjunct to the architecture. A total of fifty-two productions embody this knowledge. With it, SOAR exhibits reasonable default behavior; without it (or other task knowledge) , SOAR can flounder and fall into an infinitely deep series of impasses. We describe here the default knowledge and how it is represented. All of this knowledge can be overridden by additional knowledge that adds other preferences.
2.5 . 1 . Common search-control knowledge During the problem solving in a problem space, search-control rules are available for three common situations that require the creation of preferences. ( 1 ) Backup from a failed state. If there is a reject preference for the current state, an acceptable preference is created for the previous state. This imple ments an elementary form of backtracking. (2) Make all operators acceptable. If there are a fixed set of operators that can apply in a problem space, they should be candidates for every state . This is accomplished by creating acceptable preferences for those operators that are directly linked to the problem space . (3) No operator retry. Given the deterministic nature of SOA R , an operator will create the same result whenever it is applied to the same state. Therefore ,
493
494
CHAPTER 1 5
once an operator has created a result for a state in some context , a preference is created to reject that operator whenever that state is the current state for a context with the same problem space and goal .
2.5.2. Diagnose impasses When an impasse occurs , the architecture creates a new goal and context that provide some specific information about the nature of the impasse. From there , the situation must be diagnosed by search-control knowledge to initiate the appropriate problem solving behavior. In general this will be task-dependent, conditional on the knowledge embedded in the entire stack of active contexts. For situations in which such task-dependent knowledge does not exist, default knowledge exists to determine what to do. (1) Tie impasse. Assume that additional knowledge or reasoning is required to discriminate the items that caused the tie. The selection problem space (described below) is made acceptable to work on this problem. A worst preference is also generated for the problem space, so that any other proposed problem space will be preferred. (2) Conflict impasse. Assume that additional knowledge or reasoning is required to resolve the conflict and reject some of the items that caused the conflict. The selection problem space is also the appropriate space and it is made acceptable (and worst) for the problem space role.9
(3) No-change impasse. (a) For goal, problem space and state roles. Assume that the next higher
object in the context is responsible for the impasse , and that a new path can be attempted if the higher object is rejected. Thus, the default action is to create a reject preference for the next higher object in the context or supercontext. The default action is taken only if a problem space is not selected for the subgoal that was generated because of the impasse. This allows the default action to be overridden through problem solving in a problem space selected for the no-change impasse. If there is a no-change impasse for the top goal, problem solving is halted because there is no higher object to reject and no further progress is possible . (b) For operator role. Such an impasse can occur for multiple reasons. The operator could be too complex to be performed directly by produc tions, thus needing a subspace to implement it, or it could be incomplete ly specified, thus needing to be instantiated. Both of these require task-specific problem spaces and no appropriate default action based on them is available. A third possibility is that the operator is inapplicable to the given state, but that it would apply to some other state. This does admit a domain-independent response , namely attempting to find a state
9 There has been little experience with conflict subgoals so far. Thus, little confidence can be placed in the treatment of conflicts and they will not be discussed further.
SOAR: AN ARCHITECTURE FOR GENERAL INTELLIGENCE
in the same problem space to which the operator will apply (operator subgoaling) . This is taken as the appropriate default response . ( 4) Rejection impasse. The assumption is the same as for (nonoperator)
no-change subgoals: the higher object is responsible and progress can be made by rejecting it. If there is a rejection impasse for the top problem space, problem solving is halted because there is no higher object .
2.5.3. The selection problem space This space is used to resolve ties and conflicts. The states of the selection space contain the candidate objects from the supercontext (the items associated with the subgoal) . Figure 17 shows the subgoal structure that arises in the Eight Puzzle when there is no direct search-control knowledge to select between operators (such as the mea-operator-selection production) . Initially, the prob lem solver is at the upper-left state and must select an operator. If search control is unable to uniquely determine the next operator to apply, a tie impasse arises and a subgoal is created to do the selection. In that subgoal, the selection problem space is used. The one operator in the selection space , evaluate-object, is a general operator that is instantiated with each tying (or conflicting) object ; that is, a unique evaluate-object operator is created for each instantiation. Each state in the selection space is a set of evaluations produced by evaluate-object operators (the contents of these states is not shown in the figure) . In the figure,
�
ask goal
i n itial
desired
state
state
~
e i g ht puzzle problem space
7
�
6
5
down > l eft down > right left = ri perator t i e
�t
;:;mject
evaluat
ubgoal
(ri ht)
election
problem space
�
right = -1
valuation
ubgoal
e i g ht puzzle problem space
FIG.
2 1
7
8
6
3
2
8
5
7
6
4
1
3
4 5
...... ...... 2 8 3 1
6
7
17. The subgoal structure for the Eight Puzzle.
4 5
left
--...... 2 8 3 1
6
7
5
4
--...... 2 8 3 1
7
6
4 5
right
--...... 2 8 3 1
6
4
7
5
495
496
CHAPTER 15
an evaluate-object operator is created for each of the tied operators: down, left, and right. Each evaluate-object operator produces an evaluation that allows the creation of preferences involving the objects being evaluated. This requires task-specific knowledge , so either productions must exist that evaluate the contending objects, or a subgoal will be created to perform this evaluation (see below for a default strategy for such an evaluation) . Indifferent prefer ences are created for all of the evaluate-object operators so that a selection between them can be made without an infinite regression of tie impasses . If all of the evaluate-object operators are rejected, but still no selection can be made, problem solving in the selection problem space will have failed to achieve the goal of resolving the impasse. When this happens, default know ledge (encoded as productions) exists that makes all of the tied alternatives indifferent (or, correspondingly, rejects all of the conflicting alternatives) . This allows problem solving to continue .
2.5.4. The evaluation subgoal In the selection problem space, each evaluate-object operator must evaluate the item with which it is instantiated. Task-dependent knowledge may be available to do this. If not, a no-change impasse will occur, leading to a subgoal to implement the operator. One task-independent evaluation technique is look-ahead-try out the item temporarily to gather information. This serves as the default. For this, productions reconstruct the task context (i.e. , the supercontext that led to the tie) , making acceptable preferences for the objects selected in the context and augmenting the new goal with information for the original goal. In Fig. 17, the original task problem space and state are selected for the evaluation subgoals. Once the task context has been reconstructed, the item being evaluated-the down operator-is selected (it receives a best preference in the evaluation subgoal). This allows the object to be tried out and possibly an evaluation to be produced based on progress made toward the goal. When knowledge is available to evaluate the states in the task space, the new state produced in the evaluation subgoal will receive an evaluation, and that value can be backed up to serve as the evaluation for the down operator in this situation. One simple Eight Puzzle evaluation is to compute the number of tiles that are changed relative to the locations in the desired state. A value of 1 is assigned if the moved tile is out of position in the original state and in position in the result state. A value of 0 is assigned if the moved tile is out of position in both states. A value of - 1 is assigned if the moved tile is in position in the original state and out of position in the result state . When an evaluation has been computed for down, the evaluation subgoal terminates, and then the whole process is repeated for the other two operators (left and right) . These evaluations can be used to generate preferences among the competing operators. Since down creates a state with a better evaluation than the other
SOAR: AN ARCHITECTURE FOR GENERAL INTELLIGENCE
2
8
3
1
6
4
7
left
5
down 2
8
4 =
1 7 down 2
3
left 6
2
" 8 3
8
4 =O
1
4
7
6
5
7
6
1
5
3
1
+
right
2
\It 8
3
1
6
4
7
5
= -1
2
8
1
6 4 7
3 = -1
5
r i g ht
2 =O
5
7
8
3
1
4
6
5
= -1
FIG. 18. A trace of steepest-ascent hill climbing.
operators, better preferences (signified by > in the figure) are created for down. An indifferent preference (signified by = in the figure) is also created for left and right because they have equal evaluations. The preferences for down lead to its selection in the original task goal and problem space, terminating the tie subgoal. At this point down is reapplied to the initial state, the result is selected and the process continues. Figure 18 shows, in a state space representation, two steps of the search that occurs within the Eight Puzzle problem space. The distinctive pattern of moves in Fig. 18 is that of steepest-ascent hill climbing, where the state being selected at each step is the best at that level according to the evaluation function. These states were generated in the attempt to solve many different subgoals, rather than from the adoption of a coordinated method of hill climbing in the original task space. Other types of search arise in a similar way. If no knowledge to evaluate states is available except when the goal is achieved, a depth-first search arises. If it is known that every other move is made by an opponent in a two-player game, a mini-max search emerges. The emergence of methods directly for knowledge in SOAR is discussed further in Section 3.2. 2.6. Chunking
Chunking is a learning scheme for orgamzmg and remembering ongoing experience automatically on a continuing basis. It has been much studied in psychology [7, 12, 43, 50] and it was developed into an explicit learning mechanism within a production system architecture in prior work [35 , 61 , 63] . The current chunking scheme in SOAR is directly adapted from this latter work. As defined there, it was a process that acquired chunks that generated the results of a goal, given the goal and its parameters. The parameters of a goal
497
498
CHAPTER 15
were defined to be those aspects of the system ex1stmg prior to the goal ' s creation that were examined during the processing of the goal. Each chunk was represented as a set of three productions, one that encoded the parameters of a goal , one that connected this encoding in the presence of the goal to (chunked) results, and one that decoded the results. Learning was bottom-up: chunks were built only for terminal goals-goals for which there were no subgoals that had not already been chunked. These chunks improved task performance by substituting efficient productions for complex goal processing. This mechanism was shown to work for a set of simple perceptual-motor skills based on fixed goal hierarchies [ 61] and it exhibited the power-law speed improvement characteristic of human practice (50] . Currently, SOAR does away with one feature of this chunking scheme , the three-production chunks, and allows greater flexibility on a second, the bottom-up nature of chunking. In SOAR, single-production chunks are built for either terminal subgoals or for every subgoal, depending on the user's option. The power of chunking in SOAR stems from SOAR's ability to generate goals automatically for problems in any aspect of its problem solving behavior: a goal to select among alternatives leads to the creation of a chunk production that will later control search ; a goal to apply an operator to a state leads to the creation of a chunk production that directly implements the operator. The occasions of subgoals are exactly the conditions where SOAR requires learning, since a subgoal is created if and only if the available knowledge is insufficient for the next step in problem solving. The subgoal is created to find the necessary knowledge and the chunking mechanism stores away the knowledge so that under similar circumstances in the future , the knowledge will be available. Actually, SOAR learns what is necessary to avoid the impasse that led to the subgoal, so that henceforth a subgoal will be unnecessary, as opposed to learning to supply results after the subgoal has been created. As search-control knowledge is added through chunking, performance improves via a reduction in the amount of search. If enough knowledge is added, there is no search; what is left is an efficient algorithm for a task. In addition to reducing search within a single problem space, chunks can completely eliminate the search of entire subspaces whose function is to make a search-control decision or perform a task-implementation function (such as applying an operator or determining the initial state of the task) . 2.6. 1 . The chunking mechanism A chunk production summarizes the processing in a subgoal . The actions generate those working-memory elements that eliminated the impasse respons ible for the subgoal (and thus terminated the subgoal) . The conditions test those aspects of the current task that were relevant to those actions being performed. The chunk is created when the subgoal terminates-that is when all of the requisite information is available. The chunk's actions are based on the
SOAR: AN ARCHITECTURE FOR GENERAL INTELLIGENCE
results of the subgoal-those working-memory elements created in the subgoal (or its subgoals) that are accessible from a supergoal. An augmentation is a result if its identifier either existed before the subgoal was created, or is in another result. A preference is a result if all of its specified context objects (goal , problem space, state and operator) either existed before the subgoal was created, or are in another result. The chunk's conditions are based on a dependency analysis of traces of the productions that fired during the subgoal. The traces are accumulated during the processing of the subgoal, and then used for condition determination at subgoal termination time. Each trace contains the working-memory elements that the production matched (condition elements) and those it generated (action elements) . 10 Only productions that actually add something to working memory have their traces saved. Productions that just monitor the state (that is, only do output) do not affect what is learned, nor do productions that attempt to add working-memory elements that already exist (recall that working memory is a set). Once a trace is created it needs to be stored on a list associated with the goal in which the production fired. However, determining the appropriate goal is problematic in SOAR because elaborations can execute in parallel for any of the goals in the stack. The solution comes from examining the contexts tested by the production. The lowest goal in the hierarchy that is matched by conditions of the production is taken to be the one affected by the production firing. The production will affect the chunks created for the goal and possibly, as we shall see shortly, the higher goals. Because the production firing is independent of the lower goals-it would have fired whether they existed or not-it will have no effect on the chunks built for those goals. When the subgoal terminates, the results of the subgoal are factored into independent subgroups, where two results are considered dependent if they are linked together or they both have links to a third result object. Each subgroup forms the basis for the actions of one production, and the conditions of each production are determined by an independent dependency analysis. The effect of factoring the results is to produce more productions, with fewer conditions and actions in each, and thus more generality than if a single production was created that had all of the actions together. For each set of results, the dependency analysis procedure starts by finding those traces that have one of the results as an action element. The condition elements of these traces are then divided up into those that existed prior to the creation of the subgoal and those that were created in the subgoal. Those created prior to the subgoal become conditions of the chunk. The others are than recursively analyzed as if ·
10
If there is a condition that tests for the absence of a working-memory element, a copy of that negated condition is saved in the trace with its variables instantiated from the values bound elsewhere in the production.
499
500
CHAPTER 1 5
they were results, to determine the pre-subgoal elements that were responsible for their creation. Earlier versions of chunking in SOAR [36] implicitly embodied the assump tion that problem solving was perfect-if a rule fired in a subgoal, then that rule must be relevant to the generation of the subgoal's results. The conditions of a chunk were based on the working-memory elements matched by all of the productions that fired in the subgoal. When the assumption was violated, as it was when the processing involved searches down paths that led to failure, overly specific chunks were created. By working backward from the results, the dependency analysis includes only those working-memory elements that were matched by the productions that actually led to the creation of the results. Working-memory elements that are examined by productions , but that turn out to be irrelevant, are not included. A generalization process allows the chunk to apply in a future situation in which there are objects with the same descriptions, but possibly different identifiers. Once the set of chunk productions is determined, they are general ized by replacing the identifiers in the working-memory elements with vari ables. Each identifier serves to tie together the augmentations of an object, and serves as a pointer to the object, but carries no meaning of its own-in fact, a new identifier is generated each time an object is created. Constant symbols those that are not used as the identifiers of objects-are not modified by this variabilization process, only the identifiers. All instances of the same identifier are replaced by the same variable. Different identifiers are replaced by different variables which are forced to match distinct identifiers. This scheme may sometimes be in error, creating productions that will not match when two elements j ust happen to have the same (or different) identifiers, but it always errs by being too constraining. The final step in the chunk creation process is to perform a pair of optimizations on the chunk productions . The first optimization simplifies productions learned for the implementation of a complex operator. As part of creating the new state, much of the substructure of the prior state may be copied over to the new state. The chunk for this subgoal will have a separate condition , with an associated action , for each of the substructures copied. The chunk thus ends up with many condition-action pairs that are identical except for the names of the variables. If such a production were used in SOAR during a new situation, a huge number of instantiations would be created, one for every permutation of the objects to be copied. The optimization eliminates this problem by removing the conditions that copy substructure from the original production. For each type of substructure being copied, a new production is created which includes a single condition-action pair that will copy substructure of that type. Since all of the actions are additive, no ordering of the actions has to be maintained and the resulting set of rules will copy all of the substructure in parallel.
SOAR: AN ARCHITECTURE FOR GENERAL INTELLIGENCE
The second optimization is to order the production conditions in an attempt to make the matcher faster. Each condition acts like a query-returning all of the working-memory elements that match the condition-and the overall match process returns all of the production instantiations that match the conjunctive queries specified by the condition sides of the productions. The efficiency of such a match process is heavily dependent on the order of the queries [74] . By automatically ordering the conditions in SOA R , the number of intermediate instantiations of a production is greatly reduced and the overall . II . effic1ency improved .
2.6.2. An example of chunk creation Figure 19 shows a trace of the productions that C