302 49 11MB
English Pages 217 Year 1997
Foreword
As robots are usedto perform increasinglydifficult tasks in complex and unstructured environments, it becomesmore of a challenge to reliably program their behavior. There are many sourcesof variation for a robot control program to contend with , both in the environment and in the performance characteristicsof the robot hardware and sensors. Because these are ofteQ too poorly understood or too complex to be adequately managed with static behaviors hand- coded by a programmer, robotic systemssometimeshave undesirablelimitations in their robustness, efficiency , or competence. Machine learning techniquesoffer an attractive and innovative way to overcometheselimitations. In principle, machine learning techniquescan allow a robot to success fully adapt its behavior in responseto changing circumstanceswithout the intervention of a human programmer. The exciting potential of robot learning has only recently been investigated in a handful of groundbreaking experiments. Considerably more work must be done to progress beyond simple researchprototypes to techniquesthat have practical significance. In particular, it is clear that robot learning will not be a viable practical option for engineersuntil systematicproceduresare available for designing, developing, and testing robotic systemsthat use machine learning techniques. This book is an important first step toward that goal. Marco Dorigo and Marco Colombetti are two of the pioneersmaking significant contributions to a growing body of researchin robot learning. Both authors have years of hands-on experiencedeveloping control programs for a variety of real robots and simulated artificial agents. They have also made significant technical contributions in the subfield of machine learning known are " reinforcement learning" . Their considerable knowledge about both robotics and machine learning, together
Foreword
with their sharp engineering instincts for finding reliable and practical solutions to problems, has produced several successfulimplementations of learning in robots. In the process, the authors have developedvaluable . Their insights, ideas, insights about the methodology behind their success and experiencesare summarized clearly and comprehensively in the chaptersthat follow . Robot Shaping: An Experiment in BehaviorEngineeringprovides useful guidancefor anyone interestedin understandinghow to specify, develop, and test a robotic systemthat usesmachine learning techniques. The behavior analysis and training methodology advocatedby the authors is a well-conceivedprocedurefor robot developmentbasedon three key ideas: analysis of behavior into modular components; integration of machine learning considerationsinto the basic robot design; and specification of " a training strategy involving step-by-step reinforcement (or " shaping ). While more work remains to be done in areaslike behavior assessment , the elementsof the framework describedhere provide a solid foundation for others t ~ use and build on. Additionally , the authors have tackled some of the open technical issuesrelated to reinforcement learning. The " " family of techniquesthey use, referred to as classifiersystems , typically requires considerable expertise and experienceto implement properly. The authors have devisedsomepragmatic ways to make thesetechniques both easierto use and more reliable. Overall, this book is a significant addition to the robotics and machine " learning literature. It is the first attempt to specify an orderly behavior " engineering procedurefor learning robots. One of the goals the authors set for themselveswas to help engineersdevelop high-quality robotic systems . I think this book will be instrumental in providing the guidance engineersneedto build such systems.
LashonB. Booker The MITRE Corporation
Preface
This book is about designingand building learning autonomous robots. An autonomousrobot is a remarkableexampleof a devicethat is difficult to design and program becauseit must carry out its task in an environment at leastpartially unknown and unpredictable, with which it interacts through noisy sensorsand actuators. For example, a robot might have to move in an enpronment with unknown topology, trying to avoid still objects, peoplewalking around, and so forth ; to senseits surroundings, it might have to rely on sonarsand dead reckoning, which are notoriously inaccurate. Moreover, the actual effect of its actions might be very difficult to describea priori due to the complex interactions of the robot ' s effectorswith its physical environment. There is a fairly widespreadopinion that a good way of solving such problems would be to endow autonomous robots with the ability to acquire knowledge from experience, that is, from direct interaction with their environments. This book explores the construction of a learning robot that acquires knowledge through reinforcement learning, with reinforcementsprovided by a trainer that observesthe robot and evaluates its performance. One of our goals is to clarify what the two terms in our title , robot shapingand behaviorengineering , mean and how they relate to our main researchaim: designingand building learning autonomous robots. Robot shaping uses learning to translate suggestionsfrom an external trainer into an effective control strategy that allows a robot to achieve a goal. We borrowed " shaping" from experimental psychology becausetraining an artificial robot closely resembleswhat experimental psychologistsdo when they train an experimentalsubjectto produce a predefinedresponse. Our approachdiffers from most current researchon ~ Jearninga\J.QnQmQU robots in one important respect: the trainer plays a fundamental role in
Preface
the robot learning prdcess. Most of this book is aimed at showing how to use a trainer to developcontrol systemsfor simulated and real robots. We propose the term behaviorengineeringto characterizea new technological discipline, whose objective is to provide techniques, methodologies , and tools for developing autonomous robots. In chapter 7 we " describeone suchbehavior engineeringmethodology- behavior analysis " and training or BAT. Although discussedonly at the end of the volume, the BATmethodology permeatesall our researchon behavior engineering and, as will be clear from reading this book , is a direct offspring of our robot -shapingapproach. -. The book is divided into eight chapters. Chapter 1 servesas an introduction to the researchpresentedin chapters2 through 7 and focuseson the interplay between learning agents, the environments in which the learning agents live, and the trainers with which they interact during learning. Chapter 2 provides background on ALECsys, the learning tool we have used in our experimentalwork . We first introduce LCSo, a learning clas' sifier system: trongly inspired by Holland s semina] work , and then show how we improved this systemby adding new functionalities and by parallelizing it on a transputer. Chapter 3 discusses architectural choices and shaping policies as a function of the structure of behavior in building a successfullearning system. Chapter 4 presentsthe resultsof experimentscarried out in many different simulated environments, while chapter 5 presentsthe practical resultsof our robot -shapingapproach in developingthe control systemof a real robot. Extending our model to nonreactive tasks, in particular to sequential behavior patterns, chapter 6 showsthat giving the learning systema very simple internal state allows it to learn a dynamic behavior; chapter 7 describesBAT, a structured methodology for the developmentof learning autonomoussystemsthat learn through interaction with a trainer. Finally , chapter 8 discussesrelated work , draws someconclusions, and gives hints about directions the researchpresentedin this book may take in the near future.
Acknowledgments -
Our work was carried out at the Artificial Intelligenceand Robotics Projectof Politecnicodi Milano (PM-AI&R Project). We aregratefulto all membersof the projectfor their support, and in particular, to Marco ' Somalvico , the projects founderand director; to AndreaBonarini, who in addition to researchand teachingwas responsiblefor most of the local organizationwork; and to GiuseppeBorghi, VincenzoCaglioti, GiuseppinaGi'ni, and DomenicoSorrenti, for their invaluablework in the roboticslab. . The PM-AI&R Projecthasbeenan idealenvironmentfor our research we our with , colleagues Besidesenjoyingmany stimulatingdiscussions on real madefullestuseof the laboratoryfacilitiesfor our experiments SfER.and robots. GiuseppeBorghihad an activerole inrunning the HAM CRAB the 7 for in , we experiment CRABexperiments presented chapter ; at a who spent year the could rely on the cooperationof MukeshPatel, Fellowshipfundedby the EC PM-AI&R Projecton an ERCIM Research a Prato Previde . Emanuel , of Human Capital and Mobility Programme Milano di Universiti Medicine , of , the Instituteof Psychology , Faculty with connected issues with us severalconceptual experimental discussed . psychology Our work hasreceivedthefinancialsupportof a numberof Italian and /0" grant from a " 600 Europeaninstitutions. We gratefullyacknowledge MURST (Italian Ministry for University and Scientificand Techno) to Marco Colombetti for the years 1992 1994; a logical Research " " UniversityFund grant from Politecnicodi Milano to Marco Colom betti for theyear 1995;andan IndividualEC HumanCapitaland Mobility ProgrammeFellowshipto Marco Dorigo for the years1994- 1996. Many studentshave contributedto our researchby developingthe . We software , buildingsomeof the robots, and runningthe experiments
xviii
Acknowledgments
wish to thank them all. In particular, Enrico Sirtori , Stefano Michi , Roberto Pellagatti, Roberto Piroddi, and Rino Rusconi contributed to the developmentof ALBCSYS and to the experimentspresentedin chapters 4 and 5. Sergio Barbesta, Jacopo Finocchi, and Maurizio Goria contributed to the experimentspresentedin chapter 6; Franco Dorigo and Andrea Maesani to the experiments presentedin sections 5.3 and 7.3; Enrico Radice to the experiments presented in section 7.4; Massimo Papetti to the experiments presented in section 7.5. Autono Mouse II was designed and built by Graziano Ravizza of Logic Brainstorm; Autono Mouse IV and Autono Mouse V were designed and built by Franco Dorigo . Marco Dorigo is grateful to all his colleaguesat the I Rm I A lab of Universire Libre de Bruxellesfor providing a relaxed yet highly stimulating working environment. In particular, he wishesto thank Philippe Smets and Hugues Bersini for inviting him to join their group, and Hugues Bersini, Gianluca Bontempi, Henri Darquenne, Christine Decaestecker , Christine Defrise, Gianni Di Caro, Vittorio Gorrini , Robert Kennes, Bruno Marc Hal, Philip Miller , Marco Saerens , Alessandro Saffiotti, Tristan Salome, Philippe Smets, Thierry Vande Merckt , and Hong Xu for their part in the many interesting discussionshe and they had on robotics, learning, and other aspectsof life. Someof the ideas presentedin this book originated while Marco Dor igo was a graduate student at Politecnico di Milano . Many people helped make that period enjoyable and fruitful . Marco Dorigo thanks Alberto Colorni for his advice and support in writing a good dissertation, FrancescoMaffioli and Nello Scarabottolo for their thoughtful commentson combinatorial optimization and on parallel architectures, and Stefano Crespi Reghizzi for his work and enthusiasm as coordinator of the graduate study program. Finally , a big thank-you to our families, and to our wives, Emanuela and Laura, in particular . This book is dedicatedto them, and to Luca, the newborn son of Laura and Marco Dorigo .
1 Chapter Robots Shaping
1.1 I NTRODUCfl ON
of learningrobotic on thedevelopment Thisbookis aboutour research of thePolitecnico Robotics and the Artificial at Project Intelligence agents robotic autonomous to is di Milano. Ourlong-termresearch develop goal . The results in natural environment behavior a agents capablfof complex ~ in thebookcanonlybeviewedasa firststepin thisdirection reported . one a we , , although hope significant in autonomous agentson Althoughthepotentialimpactof advances of itself the for roboticsandautomation , development autonomous speaks task, involvinghighlysophisticated robotsis a particularlydemanding mechanics , control, andthe like. , datacollectionandprocessing of agentdevelopmen for a unified is need there we believe Indeed discipline , " " this we ; hope bookwill provide engineering , whichwecall behavior . in thisdirection a contribution of similar,well-established is reminiscent Thetermbehavior engineering . Byintroducing and termslikesoftware engineering engineering knowledge involved the of the stress want to we a newterm, problems specificity the close robots of autonomous in the development , and relationship ' of artificialagentsand the naturalsciences the development between of thenotion . Althoughtherelevance of organisms studyof thebehavior see been in has robotics behavior , of ( , for example explicitlyrecognized 1990 Brooks Maes and 1994 ; Brooks1991 , ; ; Maes1990 ; Arkin 1990 1991 Steels1990 , and , Ruspini ; Saffiotti , 1993 ; DorigoandSchnepf , 1994 that , andRuspini1995 ), webelieve , Konolige ; andSaffiotti Konolige1993 been not have sciences behavioral and robotics thelinksbetween fully yet " . appreciated
Chapter1
It can be saidthat the mechanisms of learning, and the complexbalance betweenwhat is learnedand what is geneticallydetermined , arethe main concernof behavioralsciences . In our opinion, a similar situation arisesin autonomous robotics, wherea majorproblem . is in decidingwhat shouldbe explicitly designedand what shouldbe left for the robot to learnfrom experience . Because it may be very difficult, if not impossible , for a humandesignerto incorporateenoughworld knowledgeinto an agentfrom the very beginning , we believethat machinelearningtechniques will playa central role in the developmentof robotic agents . Agentscan reacha high performancelevel only if they are capableof " extractingusefulinformationfrom their " experience , that is, from the history of their interactionwith the environment . However,the development of interestingbehaviorpatternsis not merelya matterof agentselforgani . Robotic agentswill also have to perform tasksthat are " so usefulto us. Agents' behaviorwill thushaveto be properly" shaped that a predefined taskis carriedout. Borrowinga termfrom experimental " . In psychology(Skinner1938 ), we call sucha process" robot shaping this book we advocatean approachto robot shapingbasedon reinforcement learning. In addition to satisfyingintellectualcuriosity about the power of machinelearningalgorithmsto generateinterestingbehaviorpatterns , autonomousrobots are an ideal test bed for severalmachinelearning . Many journals haverecentlydedicatedspecialissuesto the techniques applicationof machinelearning techniquesto the control of robots (Gaussier1995 ; Krose 1995;Dorigo 1996; Franklin, Mitchell and Thrun 1996 ). Moreover, the useof machinelearningmethodsopensup a new rangeof possibilitiesfor the implementationof robot controllers . Architecture as neuralnetworksand learningclassifiersystems , for example , cannotbe practicallyprogrammedby hand- if we want to usethem, we haveto rely on learning. And therearegoodreasonswhy we might want to usethem . : sucharchitectures havepropertieslike robustness , reliability, and the ability to generalizethat can greatly contributeto the overall quality of the agent. Evenwhenlearningis not strictly necessary that , it is often suggested an appropriateuse of learningmethodsmight reducethe production costof robots. In our opinion, this statementhasyet to be backedwith sufficientevidence ; indeed , nontrivial learningsystemsare not easyto developand use, and the computationalcomplexityof learningcan be. veryhigh.
Robots Shaping On the other hand, it seemsto us that the reduction of production costs should not be sought too obsessively . Recent developmentsin applied in science software , computer particulary engineering(see, for example, Sommerville 1989; Ghezzi, Jazayeri, and Mandrioli 1991) have shown that a reduction of the global cost of a product, computed on its whole life cycle, cannot be obtained by reducing the costs of the development phase. On the contrary, low global costs demand high product quality , which can only be achieved through a more complex and expensive developmentprocess. The use of learning techniquesshould be viewed as a means to achievehigher behavioral quality for an autonomous agent; the usefulnessof the learning approach should thereforebe evaluatedin a wider context, taking into account all relevant facetsof agent quality . We believethis to be a strong argument in favor of behavior engineering, that is, an integrated treatment of all aspectsof agent development, including designand the use of learning techniques.
1.2 LEARNINGTO BEHAVE
The term learningis far too generic; we therefore needto delimit the kind of learning models we have in mind. From the point of view of a robot developer, an important differencebetweendifferent learning models can be found in the kind and complexity of input information specifically required to guide the learning process. In these terms, the two extreme approaches are given by supervisedlearning and by self-organizing systems . While the latter do not use any specific feedback information to implement a learning process, the former require the definition of a training set, that is, a set of labeledinput - output pairs. Unfortunately, in a realistic application the construction of a training set can be a very demanding task, and much effort can be wasted in labeling sensory configurations that are very unlikely to occur. On the other hand, selforganiza has a limited range of application. It can only be used to discover structural patterns already existing in the environment; it cannot be usedto learn arbitrary behaviors. In terms of information requirements, an intermediate case is represented by reinforcementlearning ( RL ). RL can be seenas a classof problems in which an agent learns by trial and error to optimize a function " " (often a discounted sum) of a scalar called reinforcement (Kaelbling, Littman , and Moore 1996). Therefore, algorithms for solving RL problems do not require that a complete training set be specified: they only
Chapterl
need a scalar evaluation of behavior (that is, a way to compute reinforcem ) in order to guide learning. Such an evaluation can be either produced by the agent itself as the result of its interaction with the environmen or provided by a trainer, that is, an external observercapableof how well the agent approximates the desired behavior (Dorigo judging and Colombetti 1994a, 1994b; Dorigo 1995) . In principle, of course, it would be preferableto do without an external trainer: an artificial systemable to learn an appropriate behavior by itself is the dream of any developer. However, the use of a strictly internal reinforcementprocedureposesseverelimitations. A robot can evaluateits own behavior only when certain specificeventstake place, thus signaling that a desirable or undesirable state has been reached. For example, a ' positive reinforcement, or reward, can be produced once the robot s sensors detect that a goal destination has beenreached, and a negative reinforcem , or punishment, can be generatedonce an obstaclehas beenhit . Suchdelayedreinforcementshavethen to be usedto evaluatethe suitability of the robot ' s past course of action. While this processis theoretically possible, it tefodsto be unacceptably inefficient: feedback information is often too rare and episodicfor an effectivelearning processto take place in realistic robotic applications. A way to bypass this problem is to use a trainer to continuously monitor the robot ' s behavior and provide immediatereinforcements . To produce an immediate reinforcement, the trainer must be able to judge how well each single robot move fits into the desired behavior pattern. For example, the trainer might have to evaluatewhether a specificmovement brings the robot closer to its goal destination or keepsthe robot far enough from an obstacle. Using a human trainer would amount to a direct translation of the high-level, often implicit knowledge a human being has about problem solving into a robotic control program, avoiding the problems raised by the need to make all the intermediate stepsexplicit in the design of the robotic controller. Unforeseenproblemscould arise, however. Human evaluations could be too noisy, inaccurate, or inconsistent to allow the learning algorithm to convergein reasonabletime. Also , the reaction time . of humans is orders of magnitude slower than that of simulated robots, which could significantly increasethe time requiredto achievegood performance when training can be carried out in simulation. On the other hand, this problem will probably not emergewith real robots, whose reaction time is much slower due to the dynamicsof their mechanicalparts.
ShapingRobots
More researchis necessaryto understand whether the human trainer approach is a viable one. For the time being, we limit our attention to robots that learn by exploiting the reinforcementsprovided by a reinforceme program ( RP), that is, by an artificial trainer implementedas a computer program. The useof an RP might appear as a bad idea from the point of view of our initial goals. After all , we wanted to free the designerof the need to ' completely specify the robot s task. If the designer has to implement a detailed RP, why should this be preferable to directly programming the robot ' s controller? One reason is of an experimental nature. Making our robots learn by using an RP is a first, and simpler, step in the direction of using humanbeings: a successwith RPs opensup the door to further researchin using human-generatedreinforcements. But even if we limit our attention to RPs, there are good motivations to the RP approach, as opposed to directly programming the robot controller. Writing an RP can be an easier task than pro~ amming a controller becausethe information neededin the former caseis : noreabstract. To understandwhy it is so, let us restrict our attention to reactiveagents, that is, to agentswhoseactions are a function 1 solely of the current sensoryinput . The task of a reactivecontroller is to ; produce a control signal associatedwith the signalscoming from sensors the control messageis sent to the robot effectors, which in turn produce a physical effect by interacting with the environment. Therefore, to define a controller, the designermust know the exact format and meaning of the messagesproduced by the sensorsand required by the effectors. Moreover , the designerhas to rely on the assumptionthat in all situations the physical sensorsand effectorswill be consistentlycorrect. Suppose, for example, that we want a mobile robot with two independent wheelsto move straight forward. The designerwill then specify that both wheelsmust turn at the sameangular velocity. But supposethat for somereasonone wheelhappensto be somewhatsmallerthan the other. In this situation, for the robot to move straight on, the two wheelsmust turn at different velocities. Defining a reactive controller for a real robot (not an ideal one) thus appearsto be a demanding task: it requires that the designerknow all the details of the real robot , which often differ slightly from the ideal robot that servedas a model. Now consider the definition of an RP to train the robot to move straight forward. In this case, the designermight specify a reinforcement related to the direction in which the agent is moving, so that the highest
Chapter} evaluation is produced when such a direction is straight ahead. There is no need to worry about smaller wheelsand similar problems: the system will learn to take the control action, whichever it may be, that produces the desired result by taking into account the actual interaction between the robot and its environment. As often happens, the sameproblem could be solved in a different way. For example, the designermight design a feedbackcontrol systemthat makesthe robot move straight forward even with unequal wheels. However, this implies that the designeris aware of this specific problem in order to implement a specific solution. By contrast , RL solvesthe sameproblem as a side effect of the learning process and doesnot require the designereven to be aware of it .
1.3 SHAPINGAN AGENT' SBEBAVIOR As we have pointed out in the previous section, we view learning mechanisms as a way of providing the agent with the capacity to be trained. Indeed, we h~ve found that effectivetraining is an important issueper se, both conceptually and practically. First , let us analyzethe relationship betweenthe RP and the final controller built by the agent through learning. A way of directly programming a controller, without resorting to learning, would be to describethe robot ' s behavior in some high-level task-oriented language. In such a " " language, we would like to be able to say things like Reach position P " " or Grasp object 0 . However, similar statementscannot be directly translated into the controller' s machinecode unlessthe translator can rely on a world model, which representsboth general knowledge about the physical world and specificknowledge about the particular environment the robot acts in (figure 1.1) . Therefore, world modeling is a central issue for task-basedrobot programming. Unfortunately, everybody agreesthat world modeling is not easy, and somepeople even doubt that it is possible. For example, the difficulty of world modeling has motivated Brooks to radically reject the use of symbolic " , with the argument that the world is its own best representations " model ( Brooks 1990, 13) . Of course, this statementshould not be taken too literally (a town may be its own bestmap, but we do find smallerpaper maps useful after all) . The point is that much can be achievedwithout a world model, or at least without a human- designedworld model. While there is, in our opinion, no a priori reasonwhy symbolic models should be rejected out of hand, there is a problem with the models that
Shaping Robots
-level ion high descript ofbehavior Translat or world model program control
1.1 Ji1gure
Translating high-level description of behavior into control program
human designerscan provide. Such models are biased for at least two reasons: ( 1) they tend to reflect the worldview determinedby the human sensorysystem; and (2) they inherit the structure of linguistic descriptions usedto formulate them off-line. An autonomousagent, on the other hand, should build its own worldview, basedon its sensorimotorapparatusand " " which cognitive abilities ( might not include anything resemblinga linguistic ' . ability) One possiblesolution is to usea machinelearning system. The idea is to let the agent organize its own behavior starting from its own " experience " , that is, the history of its interaction with the environment. Again, we start from a high-level description of the desiredbehavior, from which we build the RP. In other words, the RP is an implicit procedural representatio of the desiredbehavior. The learning processcan now be seenas the processof translating such a representationinto a low -level control program- a process, however, that does not rely on a world model and that is carried out through continuous interaction betweenagent and environme , then, behavioral learning with a trainer (figure 1.2). In a sense is a form of groundedtranslation of a high-level specificationof behavior into a control program. If we are interestedin training an agentto perform a complex behavior, we should look for the most effectiveway of doing so. In our experience,a good choice is to impose somedegreeof modularity on the training process . Instead of trying to teach the complete behavior at once, it is better to split the global behavior into components, to train each component separately, and then to train the agent to correctly coordinate the component behaviors. This procedureis reminiscentof the shapingprocedure that psychologistsuse in their laboratories when they train an animal to produce a predefinedresponse.
Chapter I high - level description of behavior
reinforcement
Learning system
physical environment
control program
flgare1.2 Learning as grounded translation
1.4 REINFORCEMENT LEARNING COMPUT AnON, , EVOLUDONARY ANDLEARNING CLASSIFIER SYSTEMS As we said, reinforcementlearning can be definedas the problem faced by an agent that learns by trial and error how to act in a given environment, receiving as the only source of learning information a scalar feedback known as " reinforcement" . Recently there has been much researchon algorithms that lend themselves to solving this type of problem efficiently. Typically, the result of the agentlearning activity is a policy , that is, a mapping from statesto actions that maximizes some function of the reward intake (where rewardr are positive reinforcements, while punishmentsare negativereinforcements). The study of algorithms that find optimal policies with respectto some objective function is the object of optimal control. Techniques to solve RL problems are part of " adaptive optimal control" (Sutton, Barto, and Williams 1991) and can take inspiration from existing literature (see Kaelbling, Littman , and Moore 1996). Indeed, an entire class of algorithms usedto solve RL problems modeledas Markov decisionprocesses Bertsekas 1987; Puterman 1994) took inspiration from dynamic programming ( Bellman 1957; Ross1983). To thesetechniquesbelong, amongothers, ( the " adaptive heuristic critic " ( Barto, Sutton, and Anderson 1983) and
Robots Shaping " " Q Iearning ( Watkins 1989), which try to learn an optimal policy without " " learning a model of the environment, and Dyna (Sutton 1990, " 1991) and " real-time dynamic programming ( Barto, Bradtke, and Singh 1995), which try to learn an optimal policy while learning a model of the environment at the sametime. All the previously cited approaches try to learn a value function. Informally , this meansthat they try to associatea value to eachstateor to each state-action pair such that it can be used to implement a control policy . The transition from value functions to control policies is typically very easyto implement. For example, if the value function is associatedto state-action pairs, a control policy will be defined by choosing in every state the action with the highestvalue. Another classof techniquestries to solveRL problemsby searchingthe spaceof policies instead of the spaceof value functions. In this case, the object of learning is the entire policy, which is treated as an atomic object. Examplesof thesemethodsare evolutionary algorithms to developneural network controllers ( Dress 1987; Fogel, Fogel, and Porto 1990; Beer and Gallagher'1992; Jacoby, Husbands, and Harvey 1995; Floreano and Mondada 1996; Moriarty and Mikkulainen 1996; and Tani 1996), or to develop computer programs (Cramer 1985; Dickmanns, Schmidhuber, and Winklhofer 1987; Hicklin 1986; Fujiki and Dickinson 1987; Grefenstette, Ramsey, and Schultz 1990; Koza 1992; Grefenstetteand Schultz 1994; and Nordin and Banzhaf 1996). In a learning classifiersystem ( LCS), the machine learning technique that we have investigatedand that we have usedto run all the experiments presentedin this book, the learned controller consists of a set of rules " " (called classifiers ) that associatecontrol actions with sensoryinput in order to implement the desiredbehavior. The learning classifier system was originally proposed within a context very different from optimal control theory. It was intended to build rule based systemswith adaptive capabilities in order to overcome the brittleness shown by traditional handmade expert systems ( Holland and Reitmann 1978; Holland 1986). Nevertheless , it has recently been shown that by operating a strong simplification on an LCS, one obtains a systemequivalent to Q-Iearning (Dorigo and Bersini 1994). This makes the connection betweenLCSs and adaptive optimal control much tighter. " " The LCS learns the usefulness ( strength ) of classifiersby exploiting " a " the bucket brigade algorithm, temporal differencetechnique (Sutton
Chapterl 1988) strongly related to Q-learning. Also , the LCS learns new classifiers by using a genetic algorithm . Therefore, in a LCS there are aspectsof both the approaches discussedabove: it learns a value function (the strength of the classifiers), and at the sametime it searches the spaceof possiblerules by exploiting an evolutionary algorithm . A detailed description of our implementation of an LCS is given in chapter 2.
1.5 THEAGENTS ANDTlIEIRENVIRONMENTS
Our work has been influenced by Wilson' s " Animat problem" ( 1985, 1987), that is, the problem of realizing an artificial systemable to adapt and survive in a natural environment. This meansthat we are interestedin behavioral patterns that are the artificial counterparts of basic natural , like feeding and escapingfrom predators. Our experimentsare responses therefore to be seen as possible solutions to fragments of the Animat problem. Behavior is a product of the interaction betweenan agent and its environmen . The' set of possiblebehavioral patterns is therefore determined by the structure and the dynamicsof both the agent and the environment, and by the interface between the two , that is, the agent' s sensorimotor apparatus. In this section, we briefly introduce the agents, the behavior patterns, and the environmentswe shall usein our experiments. 1.5.1 De Agents' " Bodies" Our typical artificial agent, the Autono Mouse, is a small moving robot. Two examples of Autono Mice, which we call " Autono Mouse II " and " Autono Mouse IV " are shown in figures 1.3 and 1.4. The simulated , Autono Mice used in the experimentspresentedin chapters4 and 6 are, unlessotherwisestated, the modelsof their physical counterparts. We also use Autono Mouse V , a slight variation of Autono Mouse IV , which will be describedin chapter 7. Besidesusing the homemadeAutono Mice, we have also experimented with two agentsbasedon commercial robots, which we call " HAM SIB R" " " and CRAB. HAMSTER (figure 1.5) is a mobile robot basedon Robuter, a commercial platform produced by RoboSoft. CRAB(figure 1.6) is a robotic arm based on a two-link industrial manipulator, an IBM 7547 with a ScARA geometry. More details on the robots and their environments will be given in the relevant chapters.
ShapingRobots
Film' e 1.3 Autono Mouse II
1.5.2 The Agents' " Mind " The Autono Mice are connectedto ALECsvs(A LEarning Classifier Sy Stem ), a learning classifiersystemimplementedon a network of transputers (Dorigo and Sirtori 1991; Colombetti and Dorigo 1993; Dorigo 1995), which will be presentedin detail in chapter 2. ALECsvs allows one to design the architecture of the agent controller as a distributed system. Each component of the controller, which we call a " behavioral module," can be connectedto other componentsand to the sensorimotorinterface. As a learning classifiersystemexploiting a genetic algorithm, ALECSYS is basedon the metaphor of biological evolution. This raisesthe question of whether evolution theory provides the right technical language to characterizethe learning process. At the presentstageof the research, we are inclined to think it doesnot. There are various reasonswhy the languageof evolution cannot literally apply to our agents. First, we use an evolutionary mechanismto implement individual learning rather than phylogeneticevolution. Second, the
Chapter }
F1gare1.4 Autono Mouse IV
distinction betweenphenotypeand genotype, essentialin evolution theory, is in our caserather confused; individual rules within a learning classifier systemplay both the role of a single chromosomeand of the phenotype undergoing natural selection. In our experiments, we found that we tend to consider the learning systemas a black box, able to produce stimulusresponseS-R) associationsand categorizationsof stimuli into relevant . More precisely, we expectthe learning systemto equivalenceclasses . discoveruseful associationsbetweensensoryinput and ; and responses . categorizeinput stimuli so that precisely those categorieswill emerge which are relevantly associatedto responses . Given theseassumptions, the sole preoccupationof the designeris that the interactions between the agent and the environment can produce enough relevant information for the target behavior to emerge. As it will appear from the experimentsreported in the following chapters, this concern inftuencesthe designof the artificial environment' s objectsand of the ' agent s sensoryinterface.
Robots Shaping
F1gm ' e 1.5 HAU~
Film' e 1.6 CRAB
ChapterI
1.5.3 Typesof Behavior
A first , rough classification allows one to distinguish between stimulusresponseS -R) behavior, that is, reactive responsesconnecting sensorsto effectors in a direct way, and dynamic behavior, requiring some kind of internal state to mediate betweeninput and output . Most of our experiments (chapters 4 and 5) are dedicated to S-R ]>ehavior, although in chapter 6 we investigate some forms of dynamic behavior. To go beyond simpleS -R behavior implies that the agent is endowed with some form of internal state. The most obvious candidate ' for an internal state is a memory of the agent s past experience( Lin and Mitchell 1992; Cliff and Ross 1994; Whitehead and Lin 1995) . The designerhas to decide what has to be remembered, how to remember it , and for how long. Such decisionscannot be taken without a prior understandin of relevant properties of the environment. In an experimentreported in section4.2.3, we added a sensormemory, ' that is, a memory of the past state of the agent s sensors, allowing the learning systemto exploit regularities of the environment. We found that a memory of past perceptionsfirst makesthe learning processharder, but eventually increasesthe performanceof the learned behavior. By running a number of such experiments, we confirmed an obvious expectation, namely, that the memory of past perceptionsis useful only if the relationship betweenthe agentand its environmentchangesslowly enoughto preserve a high correlation betweensubsequentstates. In other words, agents with memory are " fitter " only in reasonably predictable environments. In a second set of experiments, discussedat length in chapter 6, we improve our learning systemcapabilities by adding a module to maintain a state word, that is, an internal state. We show that, by the use of the stateword , we are able to make our Autono Mice learn sequentialbehavior patterns, that is, behaviors in which the decision of what action to perform at time t is influencedby the actions performed in the past. We believe that experimentson robotic agentsmust be carried out in the real world to be truly significant, although such experimentsare in generalcostly and time- consuming. It is therefore advisableto preselecta small number of potentially relevant experimentsto be performed in the real world. To carry out the selection, we use a simulated environment, which allows us to have accurateexpectationson the behavior of the real agent and to prune the set of possibleexperiments. One of the hypotheseswe want to explore is that relatively complex . behavioral patternscan be built bottom-up from a set of simple responses
Robots Shaping This hypothesishas already been put to test in robotics, for example, by Arkin ( 1990) and by Saffiotti, Konolige and Ruspini ( 1995) . Arkin ' s " Autonomous Robot Architecture" integrates different kinds of information data behavioral schemes , , and world knowledge) in (perceptual order to get a robot to act in a complex natural environment. The robot , like walking through a doorway, as acom generatescomplex responses bination of competing simpler responses , like moving aheadand avoiding a static obstacle (the wall , in Arkin ' s doorway example). The compositional approach to building complex behavior has been given a formal justification in the framework of multivalued logics by Saffiotti, Konolige, and Ruspini. The key point is that complex behavior can demonstrably . We have emergefrom the simultaneousproduction of simpler responses consideredfour kinds of basic responses : I . Approaching behavior, that is, getting closer to an almost still object with given features; in the natural world , this responseis a fundamental component of feeding and sexualbehavior. 2. Chasingbeha)'ior, that is, following and trying to catch a still or moving object with given features; like the approachingbehavior, this responseis important for feeding and reproduction. 3. A voidancebehavior, that is, avoiding physical contact with an object of a given kind ; this can be seenas the artificial counterpart of a behavioral pattern that allows an organism to avoid objectsthat can hurt it . 4. Escapingbehavior, that is, moving as far as possible from an object with given features; the object can be viewed as a predator. As we shall see, more complex behavioral patternscan be built from these simple responsesin many different ways.
1.5.4 The Environment Our robots are presently able to act in closed spaces , with smooth floors and constant lighting, where they interact with moving light sources, fixed obstacles, and sounds. Of course, we could fantasizefreely in simulations, by introducing virtual sensorsable to detect the desiredentities, but then resultswould not be valid for real experimentation. We thereforeprefer to adapt our goals to the actual capacitiesof the agents. The environment is not interesting per se, but for the kind of interactions that can take place betweenit and the agent. Considerthe four basic " responsesintroduced in the previous subsection. We can call them objectual ' " in that they involve the agent s relationship with an external object.
Chapter 1
Objectual responsesare . type-sensitivein that agent-object interactions are sensitiveto the type to which the object belongs(prey, obstacle, predator, etc.); and . location-sensitivein that agent-object interactions are sensitive to the relative location of the object with respectto the agent.
Type sensitivity is interesting becauseit allows for fairly complex patterns of interaction, which are, however, within the capacity of an S-R . agent It requiresonly that the agent be able to discriminate some object feature characteristic of the type. Clearly, the types of objects an S-R agent can tell apart dependon the physical interactions betweenexternal ' objects and the agent s sensoryapparatus. Note that an S-R agent is not able to identify an object, that is, discern two similar but distinct objects of the sametype. The interactions we considerdo not dependon the absolutelocation of the objects and of the agent; they depend only on the relative angular position, and sometimeson the relative distance, of the object with respect to the agent'. Again , this requirement is within the capacities of an S-R agent. In the context of shaping, differencesthat appearto an external observer can be relevant even if they are not perceivedby the agent. The reasonis that the trainer will in general basehis reinforcing activity on the observation of the agent' s interaction with the environment. Clearly, from the point of view of the agent, a singlemove of the avoidanceor the escaping behavior would be exactly the same, although in complex behavior patterns, avoidanceand escapingrelate differently to other behaviors. In general, avoidance should modulate some other movement response, whereasescapingwill be more successfulif it suppresses all competing . As we shall seein the following chapters, this fact is going to responses inftuence both the architectural design and the shaping policy for the agent. For learning to be successful , the environment must have a number of properties. Given the kind of agent we have in mind , the interaction of a ' physical object with the agentdependsonly on the object s type and on its relative position with respectto the agent. Therefore, sufficient information about object types and relative positions must be available to the agent. This problem can be solved in two ways: either the natural objects existing in the environment have sufficient distinctive featuresthat allow them to be identified and located by the agent or the artificial objectsmust
Shaping Robots
so that theycanbe identifiedandlocated. For example bedesigned , if we want the agentto chaselight LI and avoidlight L2, the two lightsmust be of differentcolor, or havea differentpolarizationplane, to be distinguished . In anycase , identificationwill bepossible by appropriatesensors if the environment . For example the rest of , if light sensing cooperates only is involved, environmentallighting shouldnot vary too muchduring the agent' s life. In order for a suitable responseto depend on an object' s position, ' objects must be still , or move slowly enough with respectto the agent s speed. This doesnot mean that a sufficiently smart agent could not evolve a successfulinteraction pattern with very fast objects, only that such a pattern could not depend on the instantaneousrelative position of the ' object, but would involve some kind of extrapolation of the object s . trajectory, which is beyond the presentcapacitiesof ALECSYS
1.6 RERA VIOR ENGINEERING AS AN EMPIRICAL ENDEA VOR We regard our own work as part of artificial intelligence (AI ). Although certainly departing from work in classical, symbolic AI , it is akin to researchin the " new wave" of AI that started in the early 1980sand is focusedon the developmentof simple but complete agentsacting in the real world . In classicalAI , computer programs have been regarded sometimesas formal theories and sometimesas experiments. We think that both views are defective. A computer program certainly is a formal object, but it lacks the degreeof conciseness , readability, and elegancerequired from a theory. Moreover, programs usually blend together high-level aspects of theoretical relevancewith low-level elementsthat are there only to make it perform at a reasonablelevel. But a program is not by itself an experiment; at most, it is the starting point to carry out experiments (see, for example, Cohen 1995). Borrowing the jargon of experimental psychologists, by instantiating a program with specific parameters, we " have an " experimental subject, whose behavior we can observe and analyze. It is surprising to note how little this point of view is familiar in computer sciencein general, and in AI in particular. The classicalknow-how of the empirical sciencesregarding experimental design, data analysis, and hypothesistesting is still almost ignored in our field. Presumably, the reason is that computer programs, as artificial objects, are erroneously
Chapterl believedto be easily understandableby their designers. It is important to realize that this is not true. Although a tiny , well-written computer program can be completely understood by analyzing it statically as a formal object, this has little to do with the behavior of complex programs, especiallly in the area of nonsymbolic processing, which lacks strong theoretical bases. In this case, a great deal of experimental activity is necessaryto understand the behavior of programs, which behavealmost like natural systems. A disciplined empirical approach appearsto be of central importance in what we have called " behavior engineering," that is, in the principled developmentof robotic agents. Not only is the agent itself too complex to be completely understood by analytical methods, but the agent also interacts with an environment that often cannot be completely characterized a priori . As we shall argue in chapter 7, the situation calls for an experimentalactivity organizedalong clear methodological principles. The work reported in this book is experimentallyoriented: it consistsof a reasonablY, large set of experimentsperformed on a set of experimental subjects, all of which are instantiations of the same software system, ALECsvs. However, we have to admit that many of our experimentswere below the standardsof the empirical sciences . For this, we have a candid, but true, explanation. When we started our work , we partly shared the common prejudice about programs, and we only partially appreciatedthe importance of disciplined experimentation. But we learnedalong the way, as we hope someof the experimentswill show.
1.7 POINTSTO REMEMBER
. Shaping a robot means exploiting the robot ' s learning capacities to translate suggestionsfrom an external trainer into an effective control strategy that allows the robot to perform predefinedtasks. . As opposedto most work in reinforcementlearning, our work stresses the importance of the trainer in making the learning pr.ocessefficient enough to be usedwith real robots. . The learning agent, the sensorimotorcapacitiesof the robot , the environme in which the robot acts, and the trainer are strongly intertwined. Any successfulapproach to building learning autonomous robots will have to considerthe relations among thesecomponents. Reinforcementlearning techniquescan be usedto implement robot controllers with interesting properties. In particular, such techniquesmake it .
Robots Shaping " " possibleto adopt architecturesbasedon soft computing methods, like neural networks and learning classifier systems, which cannot be effectively programmed by hand. . The approach to designingand building autonomousrobots advocated in this book is a combination of designand learning. The role of designis to determine the starting architecture and the input- output interfaces of the control system. The rules that will constitute the control systemare then learnedby a reinforcementlearning algorithm. . Behavior engineering, that is, the discipline of principled robot development , will have to rely heavily on an empirical approach.
Chapter2 ALECSYS
2.1 THE LEARNING CLASSIFIER SYSTEM PARADIGM Learning classifiersystemsare a classof machine learning models which were first proposed by Holland ( Holland and Reitmann 1978; Holland 1986; Holland et al. 1986; Booker, Goldberg, and Holland 1989). A learning classifier system is a kind of paralIel production rule system in which two ttnds of learning take place. First , reinforcement learning is used to changethe strength of rules. In practice, an apportionment of credit algorithm distributes positive reinforcementsreceivedby the learning systemto those rules that contributed to the attainment of goals and negativereinforcementsto those rules that causedpunished actions. Second , evolutionary learning is used to searchin the rule space. A genetic algorithm searchesfor possibly new useful rules by geneticoperationslike reproduction, selection, crossover, and mutation. The fitness function used to direct the searchis the rule strength. As proposed by Holland , the learning classifier system or LCS is a general model; a number of low-level implementation decisionsmust be made to go from the genericmodel to the running software. In this section we describe LCSo, our implementation of Holland ' s learning classifier system, which unfortunately exhibits a number of problems: rule strength oscillation, difficulty in regulating the interplay between the reinforcement system and the background genetic algorithm (GA ), rule chains . We identified two approachesto extend instability , and slow convergence LCSo with the goal of overcoming theseproblems: ( I ) increasethe power of a single LCSo by adding new operators and techniques, and (2) use many LCSoSin parallel. The first extension, presentedin section2.2, gave raise to the " improved classifier system" or ICS (Dorigo 1993), a more powerful version of LCSo that still retains its spirit . The secondextension,
Chapter 2
discussedin section 2.3, led to the realization of ALECSvs (Dorigo and Sirtori 1991; Dorigo 1992, 1995), a distributed version of I CS that allows for both low-levelparallelism, involving distribution of the ICS over a set of transputers, and high-level parallelism, that is, the definition of many concurrent cooperating I CSs. To introduce learning classifier systems, we consider the problem of a simple autonomous robot that must learn to follow a light source (this problem will be consideredin detail in chapter4) . The robot interactswith the environment by meansof light sensorsand motor actions. The LCS controls the robot by using a set of production rules, or classifiers, which representthe knowledgethe robot has of its environment and its task. The effect of control actions is evaluated by scalar reinforcement. Learning mechanismsare in charge of distributing reinforcementsto the classifiers responsiblefor performedactions, and of discoveringnew usefulclassifiers. An LCS comprisesthree functional modules that implement the controller and the two learning mechanismsinformally describedabove (see 2.1 figure ):
LCS Rule system -
discovery new
good lers
classif
lers
classif
perceptions E Performance system n v actions i r current strength changes
strengths ~ m e n
reinforcements credit
of Apportionment
t system
2.1 Ji1gare . Learningclassifiersystem
ALECSYS
1. The performancesystemgoverns the interaction of the robot with the external environment. It is a kind of parallel production system, implementing a behavioral pattern, like light following, as a set of conditionaction rules, or classifiers. 2. The apportionmentof credit system is in charge of distributing reinforceme to the rules composing the knowledge base. The algorithm used is the " bucket brigade" ( Holland 1980; in previous LCS research, and in most of reinforcement learning research, reinforcementsare provided to the learning system whenever it enters certain states; in our approach they are provided by a trainer, or reinforcement program, as discussedin section 1.2). 3. The rule discoverysystemcreatesnew classifiersby meansof a genetic algorithm.
In LCSs, learning takes place at two distinct levels. First , the apportionmen of credit systemlearns from experiencethe adaptive value of a number of given classifierswith respectto a predefinedtarget behavior. Each classifier, maintains a value, called " strength," and this value is modified by the bucket brigade in an attempt to redistribute rewards to useful classifiersand punishmentsto useless(or harmful) ones. Strength is therefore used to assessthe degreeof usefulness of classifiers. Classifiers that have all conditions satisfied are fired with a probability that is a function of their strength. Second, the rule discovery mechanism, that is, the genetic algorithm, allows the agent to explore the value of new classifiers. The genetic algorithm explores the classifiers' space, recombining classifierswith higher strength to produce possibly better offspring. Offspring are then evaluatedby the bucket brigade. It must be noted that the LCS is a generalmodel with many degreesof freedom. When implementing an LCS, it is therefore necessaryto make a number of decisionsthat transform the genericLCS into an implemented LCS. In the rest of this sectionwe presentthe three modulesof LCSo, our 1 implementation of an LCS. ICS and ALECsys, presentedin sections2.2 and 2.3, were obtained as improvementsof LCSo.
2.1.1 LCSo: The PerfonnanceSystem The LCSo perfonnance system(seefigure 2.2 and tenninology box 2.1) consistsof . the classifierset CS; . an input interfaceand an output interfacewith the environment (sensors and effectors) to receive/sendmessagesfrom / to the environment;
Chapter 2
.mes Box 2 1 Exa + Con the cla 1 01 01 . H th # se c i ; mat the wh th fir co is m , onl by me b " ' with a in se Th f d t o # any mes po sy s "care tha bot 0 or 1 m If . b c ,the , me sym p are ma so ru is a a by , me a ct is the 01 in ou is , /acti par , , str ex list at the i it f is to be i n foll ( ste in or is sen to eff e )by tha an .som act If so eff ,disc ) m pr c the con res is an co ap of the . ar me Figure2.2 LCSoperformance system
.sc Bo 2 1 Te .me M w a is a a o ru r In K K , ' LC ( ) ( ) t r cla * * I . M o n o O i a s n on O of # , , s l e } le { str } { M i t o c w th , K l ,for KV co Inf in w t cl r h u ac th c pa = i 1 2 d . ac al ca an ) E ( m sin f th b is se of b st + . 0 1 in h su p th e o An ef int a cla .me a e M is u lcla Th is m ( ) m i f s a r M wh co L , ,the c )sif (the in c s lis w c l M w tim an , ) m ( st e pr t t u to s b p pr ..Th tim st . h se C th th c m s M is e g o (ste l o m ha b of b al tm b = m -cln(M e in i-iin w f Al ,o -w c t),an w se o in h n is )a ALECSYS
. the messagelist ML , composedof the three sublists: MLI , which collects messagessent from classifiersat the precedingtime step, MLENV , which collects messagescoming from the environment through the sensors , and MLEFF , which contains messagesusedto choosethe actions at the precedingtime step; . the match set MS, that is, the multiset containing classifiersthat have both conditions matched by at least one message(seeexample box 2.1) . MS is composedof MS-eff and MS-int , that is two multisets containing classifiersin MS that want to sendtheir messagesto effectorsand to the internal messagelist MLI , respectively; . a conflict resolutionmodule, to arbitrate conflicts among rules in MS-eff that proposecontradictory actions; and . an auction module, to select which rules in MS-int are allowed to appenda messageto MLI . In simplified terms, the LCSo performance systemworks like this. At time zero, a set of classifiersis randomly created, and both the message
2 Chapter list and the match set are empty. Then the following loop is executed. Environmental messages , that is, perceptions coming from sensors, are appendedto the messagelist MLENV and matched against the condition part of classifiers. Matching classifiersare addedto the match set, and the messagelist is then emptied. If there is any messageto effectors, then , a conflict resolution module correspondingactionsare executed(if necessary is called to chooseamong conflicting actions); messagesused to select performed actions are appendedto MLEFF . If the cardinality of MS-int does not exceedthat of MLI , all messagesare appended. Otherwise, the auction module is used to draw a k -samplefrom MS-int (where k is the cardinality of MS-int ); the messagesof the classifiersin the k-sampleare then appendedto MLI . The match setis emptied and the loop is repeated. The need to run an auction is one reason an apportionment of credit algorithm is introduced. As it redistributesreinforcementsto the rules that causedthe performed actions, the algorithm changestheir strengths. This allows the systemto choosewhich rules to selectin accordanceto some measureof their usefulness. The rules' strength is also used for conflict resolution antong rules that send messagesto effectorsproposing inconsistent actions (e.g., " Go right " and " Go left " ) . Algorithm box 2.1 givesa detailed formulation of the performancealgorithm.
2.1.2
LCSo: The Apportionmentof Credit System We have said that the main task of the apportionment of credit algorithm is to classify rules in accordancewith their usefulness. In LCSo, the algorithm works as follows: to every classifierC a real value called " strength," Str(C), is assigned.At the beginning, eachclassifierhas the samestrength. When an effector classifier causesan action on the environment, a reinforcem may be generated, which is then transmitted backward to the internal classifiersthat causedthe effector classifierto fire. The backward transmissionmechanism, examined in detail later, causesthe strength of classifiersto changein time and to reflect the relevanceof each classifier to the systemperformance(with respectto the systemgoal) . Clearly, it is not possible to keep track of all the paths of activation actually followed by the rule chains (a rule chain is a set of rules activated in sequence , starting with a rule activated by environmental messagesand ending with a rule performing an action on the environment) becausethe number of thesepaths grows exponentially with the length of the path. It is then necessaryto have an appropriate algorithm that solvesthe problem using only local information , in time and in space.
2 1 .l.4 Bo Alg :o Th LC pe sy As an in to c a c t C , st ev m lis oo I2 Re en a t to M . h m in Ex th M b a op m in . co of cl C pa 3 If no cla is l i e c n , y the de = . :7 .6 MS w h e c m C , b lea on m M 5 Re al fro in . re to cl M T m Ap r ar a s t w o bo p ap th ac . co If iIoo n S , (9 M ) ( ) Le L e = :Th h the d an to ap tit d of m = k els dr an C , in s [ ] a w fro p r ~ C C h ( ) Pr / S tthe is th au { } to M t . h M m ap in d . Q a of cla ; c { o c ( ) St st } en .tha 8 Re al fro E s t V C h If ex at le on ef = ".(M \CS EA i th o t w , ,MS (cla M 0 p t co . h wi P e ap pr a s d : o us in th sy al = .ML .ma c i C ; C I } {me C = M M is li u u m = M i is lis of M I ' , ,ha + m ) ( )'s.ea . t l s lis of en l ; , t a a ca an ac t h s p -Th o c -m M e is th th f iint nt ,oi,w s e i is s n lcl m a o e co b y m -M iT e M ,p (K )= lK .w nK i m ALECSYS
Chapter 2
" Local in time" means that the information used at every computational step is coming only from a fixed recent temporal interval. " Local in space" meansthat changesin a classifier's strength are causedonly by classifiers directly linked to it (we say that classifiers Cl and C2 are " linked" if the messagepostedby Cl matchesa condition of C2). The classical algorithm used for this purpose in LCSo, as well as in most LCSs, is the bucket brigade algorithm ( Holland 1980). This algorithm models the classifiersystemas an economic society, in which every classifierpays an amount of its strength to get the privilege of appending a messageto the messagelist and receivesa payment by the classifiers activated becauseof the presencein the messagelist of the messageit 2 appendedduring the precedingtime step. Also, a classifiercan increase (decrease ) its strength wheneveran action it proposesis performed and ~ results in receiving a reward (punishment) . In this way, reinforcement flows backward from the environment (trainer) again to the environment (trainer) through a chain of classifiers. The net result is that classifiers participatingI in chains that causehighly rewardedactions tend to increase their strength. Also , our classifiers lose a small percentage of their " " strength at each cycle (what is called the life tax ) so that classifiersthat are never usedwill sooneror later be eliminated by the geneticalgorithm . In algorithm box 2.2, we report the apportionment of credit system. Becausethe bucket brigade algorithm is closely intertwined with the performan system, a detailed description of how it works is more easily given together with the description of the performance system. Bucket brigade instructions are those steps of the credit apportionment system not already presentin the performancesystem(they can be easily recognized becausewe maintained the samename for instructions in the two systems, and the bucket brigade instructions were just added in between the performancesysteminstructions) . One of the effectsof the apportionment of credit algorithm should be to contribute to the formation of default hierarchies, that is, setsof rules that . A rule categorizethe set of environmental statesinto equivalenceclasses in a default hierarchy is characterizedby its specificity. A maximally specific rule will be activated by a single type of message(see, for example, the set of four rules in the left part of figure 2.3) . A maximally general (default) rule is activated by any message(see, for example, the last rule in the right part of figure 2. 3) . Consider a default hierarchy composedof two rules, as reported in the right part of figure 2. 3. In a default hierarchy, whenevera set of rules have
ALECSYs
2 Chapter
Non
hierarch
ical
Hierarchical
set set
( homomorphic
)
( quasi
-
set set
homomorphic
00
~
11
00
~
11
01
~
OO
II
~
00
1 0
~
00
11
~
00
)
2.3 F1gore samestatecategorization with fewerrules. Hierarchicalsetimplements
all conditions matched by at least a message , it is the most specificthat is allowed to fire. Default hierarchieshave some nice properties. They can be used to build quasi-homomorphic models. Let " be the set of environmen statesthat the learning systemhas to learn to categorize. Then, the systemcould either try to find a set of rules that partition the whole set " , never making mistakes(left part of figure 2.3), or build a default hierarchy (right part of figure 2.3) . With the first approach, a homomorphic model of the environment is built ; with the second, a quasi-homomorphic model, which generally requiresfar fewer rules than an equivalent homomorphic one ( Holland 1975; Riolo 1987, 1989). Moreover, after an initial quasi-homomorphic model is built , its performancecan be improved by adding more specific rules. In this way, the whole system can learn gracefully, that is, the performance of the system is not too strongly influencedby the insertion of new rules.
2.1.3 LCSo: De RuleDiscoverySystem The rule discoverysystemalgorithmusedin LCSo, as well as in most LCSs, is the geneticalgorithm(GA). GAs are a computationaldevice
ALECSYs 2
.
2
Box Terminology .
.
of
individuals
multiset
a
is
A population .
. , a
,
genes
Typically feasible
k
or
of
a
positions individual
is
is
)
string problems
framework
are
optimization
LCS
in
the
An
( to
an
individual
individual
chromosome
an ,
a
is
when
applied
when
,
GAs
hand
other
the
.
On
solution
, .
classifier .
is
an
that
to
value
alphabet
allelic
an
can
assume
A gene
belonging =
the
in
found
be
to
are
choice
reasons
this
The
underlying been
has
It
.
the algorithm
the
of
the ,
is the
information
rule
lower
the
that
When
individuals
discovery
1975
Holland
) in
useful
are
usually
{
algorithm
.
shown (
the
of
the
algorithm
processing classifiers
A
O ,
used ,
cardinality
alphabet
contained
I
. }
that
,
genetic
efficiency
chromosomes
the
of
higher
structure
in
the
, "
' care
,
t
" don In
allelic , action
the
#
(
) of
often
with
symbol to
rules
allelic
the
while ,
# }
I ,
O ,
(
allows condition
belong I
. }
O ,
which
general
to {
then
alphabet for
default )
represented values
is
augmented be
.
LCSo
of
values
genes to
{
belong
genes
)
message (
inspired by poptllation genetics. We give here only some general aspects of their working ; the interestedreader can refer to one of the numerous books about the subject( for example, Goldberg 1989; Michalewicz 1992; Mitchell 1996) . A GA is a stochasticalgorithm that works by modifying a population of solutions3to a given problem (seeterminology box 2.2) . Solutions are coded, often using a binary alphabet, although other choicesare possible, and a function, called a " fitnessfunction," is definedto relate solutions to performance(that is, to a measureof the quality of the solution). At every cycle, a new population is createdfrom the old one, giving higher probability to reproduce(that is, to be presentagain in the new population of solutions) to solutions with a fitnesshigher than average, wherefitness is defined to be the numeric result of applying the fi ~ ess function to a solution. This new population is then modified by meansof somegenetic operators; in LCSo theseare . crossover , which recombinesindividuals (it takes two individuals, operates a cut in a randomly chosenpoint , and recombinesthe two in such a way that some of the genetic material of the first individual goes to the secondone and vice versa; seefigure 2.4); and . mutation, which randomly changessome of the allelic values of the genesconstituting an individual .
2 Chapter
F1pre2.4 Exampleof crossoveroperator: (a) two individuals(parents ) are mated and crossover is applied, to generate (b) two newindividuals(children). Algorit I UDBox 2.3 LCSo: The rule discoverysystem(geneticalgorithm) O. Generatean initial random population P. repeat 1. Appfy the Reproduction operator to P: pi := Reproduce( P) . 2. Apply the Crossoveroperator (with probability Pc) to P' : pi' := Crossover(pI) . 3. Apply the Mutation operator (with probability Pm) to P" : pi" := Mutation ( P" ). 4. P := P"' . _ to EndTest = true.
After thesemodifications, the reproduction operator is applied again, and the cycle repeatsuntil a termination condition is verified. Usually the termination condition is given by a maximum number of cycles, or by the lack of improvementsduring a fixed number of cycles. In algorithm box 2.3, we report the rule discoverysystem(geneticalgorithm) . The overall effect of the GAs' work is to direct the searchtoward areas of the solution spacewith higher valuesof the fitnessfunction. The computational speedupwe obtain using GAs with respectto random searchis due to the fact that the search is directed by the fitness function. This direction is based not on whole chromosomesbut on the parts that are strongly related to high values of the fitness function; these parts are called " building blocks" ( Holland 1975; Goldberg 1989). It has been proven ( Booker, Goldberg, and Holland 1989; Bertoni and Dorigo 1993)
ALECSYS
that GAs process at each cycle a number of building blocks at least proportional to a polynomial function of the number of individuals in the population, where the degreeof the polynomial dependson the population dimension. In LCSs, the hypothesis is that useful rules can be derived from recombination of other useful rules.4 GAs are then used as a rule discovery system. Most often, as in the caseof our system, in order to preserve the systemperformance, they are applied only to a subsetof rules: the m best rules (the rules with higher strength) are selectedto form the initial population of the GA , while the worst m rules are replacedby those arising from application of the GA to the initial population. After applying the GA , only some of the rules are replaced. The new rules will be tested by the combined action of the performance and credit apportionment algorithms. Becausetesting a rule requiresmany time steps, GAs are applied with a much lower frequency than the performance and apportionmen of credit systems. In order to improve efficiency, we added to LCSo a cover detector operator and a' covereffectoroperator (thesecover operatorsare standard in most LCSs) . The cover detector fires whenever no classifier in the classifier set is matched by any messagecoming from the environment (that is, by messages in MLENV ) . A new classifieris createdwith conditions that match at least one of the environmental messages(a random number of # s are insertedin the conditions), and the action part is randomly generated. The cover effector, which is applied with a probability PCB ' assuresthe samething on the action side. Wheneversomeaction cannot be activated by any classifier, then a classifier that generatesthe action and that has random conditions is created. In our experiments, we setPCB= 0.1. Covering is not an absolutely necessaryoperator becausethe systemis in principle capable of generating the necessaryclassifiersby using the basic GA operators(crossoverand mutation) . On the other hand, because this could require a great deal of time, covering is a useful way to speed up searchin particular situations easily detectedby the learning system.
2.2 ICS: IMPROVEDCLASSIFIER SYSTEM In this section, we describeICS, an improved version of LCSo. The following major innovations have been introduced (for the experimental evaluation of thesechangesto LCSo, seesection4.2) .
2 Chapter
2.2.1 I CS: CaUingdie GeneticAlgorithm When a SteadyState Is Reached In LCSo, as in most classic implementations of LCSs, the genetic algorithm (i.e., reproduction, crossover, and mutation) is called every q cycles, q being a constant whose optimal value is experimentally determined. A drawback of this approach is that experimentsto find q are necessary , and that , even if using the optimal value of q, the geneticalgorithm will not be called, in general, at the exact moment when rule strength accurately reflectsrule utility . This happensbecausethe optimal value of q changesin time, dependingon the dynamicsof the environment and on the stochastic processes embedded in the learning algorithms. If the geneticalgorithm is called too soon, it usesinaccurateinformation ; if it is called long after a steadystate has beenreached, there is a waste of computing time (from the GA point of view, the time spent after the steady state has been reachedis useless ) . A better solution is to call the genetic algorithm when the bucket brigade has reacheda steadystate; in this way, we can reasonably expect that when the genetic algorithm is called, the strength of every classifierreflectsits actual usefulnessto the system. The problem is hcfwto correctly evaluatethe attainment of a steadystate. We " "5 have introduced a function ELCS (t ) , called energy of the LCS at time t, defined as the sum of the strengthsof all classifiers. An LCS is said to be " at a " steadystate at time t when
' ELCS , Emax (t ) E [EDlin ], where
for all te [t - k , t),
" " Emin= min { ELCS(t ), t e [t - 2k , t - k ]} , and " " Emu = max{ ELCS (t ) , t e [t - 2k , t - k ]} ,
k being a parameter. This excludescasesin which ELCS (t) is increasingor and in those which t is still ELCS decreasing () oscillating too much. Experimentshave shown that the value of k is very robust; in our experiments (see section 4.2; and Dorigo 1993), k was set to 50, but no substantia differenceswere found for valuesof k in the range between20 and 100. 2.2.2 ICS: The MutespecOperator Mutespec is a new operator we introduced to reduce the variance in the reinforcementreceivedby default rules. A problem causedby the presence
ALECSYS
Figure2.5 Exampleof oscillatin ~: classifier
Figure2.6 operator Exampleof applicationof mutespec " of " don' t care ( # ) symbols in classifierconditions is that the sameclas, and low sifier can receivehigh rewardswhen matched by somemessages re)J'ards (or even punishmentsif we use negative rewards) when matched " " the by others. We call these classifiers oscillating classifiers.' Consider " " don t care has a classifier 2.5 the symbol ; oscillating examplein figure in the third position of the first condition. Supposethat whenever the classifieris matched by a messagewith a ,I in the position corresponding to the # in the classifiercondition, the messageI I I I is useful, and that wheneverthe matching value is a 0, the messageI I I I is harmful . As a result, the strength of that classifiercannot convergeto a steadystate but will oscillate betweenthe values that would be reachedby the two more specific classifiersin the bottom part of figure 2.6. The major problem with oscillating classifiersis that, on average, they will be activated too often when they should not be used and too seldom when they could be useful. This causesthe performance of the system to be lower than it could be. The mutespecoperator tries to solve this problem using the oscillating classifieras the parent of two offspring classifiers, one of which will have a
Chapter 2
0 in place of the # , and the other, a 1 (seefigure 2.6). This is the optimal solution when there is a single # . When the number of # symbols is greater than 1, the # symbol chosencould be the wrong one (i.e., not the one responsiblefor oscillations) . In that case, becausethe mutespecoperator did not solve the problem, it is highly probable the operator will be applied again. The mutespecoperator can be likened to the mutation operator. The main differencesare that mutespecis applied only to oscillating classifiersand that it always mutates # symbolsto Osand 1s. And while the mutation operator mutates a classifier, mutespec introduces two new, more specific, classifiers (the parent classifier remains in the population). To decidewhich classifiershould be mutated by the mutespecoperator, we monitor the variance of the rewardseach classifiergets. This variance is computed according to the following formula:
where Rc (t ) is the reward receivedby classifierC at time t. A classifierC is an oscillating classifierifvar (Rc) ~ r . var , wherevar is the averagevariance of the classifiersin the population and r is a user definedparameter(we experimentallyfound that a good value is r = 1.25) . At every cycle, mutespecis applied to the oscillating classifier with the highestvariance. ( If no classifieris an oscillating classifier, then mutespec is not applied.)
2.2.3 ICS: Dynamically Changingthe Nmnber of Claaifiers Used In ICS, the number of used rules dynamically changesat run time (i.e., the classifierset' s cardinality shrinks as the bucket brigade finds out that some rules are uselessor dangerous). The rationale for reducing the number of usedrules is that when a rule strengthdrops to very low values, its probability of winning the competition is also very low ; should it win the competition, it is very likely to propose a uselessaction. As the time spent matching rules against messagesin the messagelist is proportional to the classifier set' s cardinality, cutting down the number of matching rules results in the ICS performing a greater number of cycles (a cycle goes from one sensingaction to the next) than LCSo in the same time period. This causesa quicker convergenceto a steady state; the genetic
ALECSYS
algorithm can be called with higher frequency and therefore a greater number of rules can be tested. In our experiments, the useof a classifieris inhibited when its strength becomeslower than h . S(t ) , where S(t ) is the averagestrength of classifiersin the population at time I , and h is a user defined parameter (0.25 Sh S 0.35 was experimentally found to be a good range for h).
2.3 THE ALEcsysSYSTEM ALEcsvs is a tool for experimentingwith parallel I CSs. One of the main problems faced by I CSs trying to solve real problems is the presenceof heavy limitations on the number of rules that can be employeddue to the linear increasein the basic cycle complexity with the classifier set' s cardinality . One possibleway to increasethe amount of processedinformation without slowing down the basicelaboration cycle would be the useof " " parallel architectures, such as the Connection Machine ( Hillis 1985) or " " the transputer ( INMOS 1989) . A parallel implementation of an LCS on the Connection Machine, proposedby Robert son (1987), has demonstrated the power of such a solution, while still retaining, in our opinion, a basic limit . Becausethe Connection Machine is a single-instructionmultiple- data architecture (SIMD ; Flynn 1972), the most natural design for the parallel version of an LCS is basedon the " data parallel" modality , that is, a single flow of control applied to many data. Therefore, the resulting implementation is a more powerful, though still classic, learning classifiersystem. Becauseour main goal was to give our systemfeaturessuch as modularity , flexibility , and scalability, we implementedALE Csvson a transputer ' 6 system. The transputer s multiple-instruction-multiple- data architecture ( MIMD ) permits many simultaneouslyactive flows of control to operate on different data setsand the learning systemto grow gradually without major problems. We organized ALECsvsin such a way as to have both SIMD -like and MIMD -like forms of parallelism concurrently working in the system. The first was called " low-level parallelism" and the second, " " high-level parallelism. Low -level parallelism operateswithin the structure of a single ICS and its role is to increasethe speedof the ICS. High level parallelism allows various I CSs to work together; therefore, the complete learning task can be decomposedinto simpler learning tasks running in parallel.
Chapter 2
2.3.1 Low-Level Parallelism: A Solution to SpeedProblems We turn now to the microstructureof A LECSYS , that is, the way parallelism was used to enhancethe performance of the ICS model. ICS, like LCSo and most LCSs, can be depicted as a set of three interacting systems (see figure 2. 1) : ( 1) the performance system; (2) the credit apportionment system(bucketbrigade algorithm); and (3) the rule discoverysystem (geneticalgorithm). In order to simplify the presentationof the low-level parallelization algorithms, the various ICS systemsare discussedseparately . We show how first the performance and credit apportionment systems and then the rule discovery system (genetic algorithm) were parallelized. 2.3.1.1 The Performanceand Apportionmentof Credit Systems As we have seenin section 2.1, the basic execution cycle of a sequential learning classifier systemlike LCSo can be looked at as the result of the interaction betweentwo data structures: the list of messages , ML , and the set of classifiers, CS. Therefore, we decomposea basic execution cycle into two co~current processes, M Lprocess and CSprocess . M Lprocess communicateswith the input processSEprocess(SEnsor), and the output process, EFprocess( EFfector), as shown in figure 2.7. The processes communicate by explicit synchronization, following an algorithm whosehigh-level description is given in algorithm box 2.4. Becausesteps3 and 4 (matching and messageproduction) can be executed on eachclassifierindependently, we split CSprocessinto an array of concurrent subprocesses { CSprocessl, . . . , CSprocess ;, . . . , CSprocessn }, each taking care of Iln of CS. The higher n goes, the more intensive the concurrency is. In our transputer-based implementation, we allocated about 100- 500 rules to each processor (this range was experimentally determinedto be the most efficient; seeDorigo 1992). CSprocessescan be organized in hierarchical structures, such as a tree (seefigure 2.8) or a toroidal grid (the structure actually chosendeeply influencesthe distribution of computational loads and, therefore, the computational efficiency of the overall system; seeCamilli et ale 1990). Many other stepscan be parallelized. We propagatethe messagelist ML from M Lprocess to the i -th CSprocessand back, obtaining concurrent processingof credit assignmenton each CSprocess ;. ( The auction among triggered rules is subjectto a hierarchical distribution mechanism, and the samehierarchical approach is applied to reinforcementdistribution.)
ALBCSYS
I
reinforcements 2.7 F1gare Concurrentprocess esin ICS
2.view 4of Alaoritllm Process -oBox classifier riente4 learning systems while not do stopped 1.message M receives from S Lprocess and them messages in the Eprocess places list ML . 2 M sends ML to C . Lprocess Sprocess " " 3..C matches ML and CS Sprocess bids for each , calculatin trigg rule . 4 ..C sends M the list of Sprocess rules . Lprocess triggered S M erases the old list and Lprocess makes an auctio message amo rules the winners selected with ; triggered to their , bid are allow respect , to their own thus a post new , list ML . messages composing messa .7.E 6 M sends ML to E . Lprocess Fprocess chooses the action to and if Fprocess disca con , apply , necess from ML E receives messages ; reinforc and is Fprocess then able to calculate the reinforceme owed to each in ML messa ; this list of reinforcements is sent back to M with the , Lproces togeth ML .the remaining .9.C 8 M sends set of and Lprocess reinforce to C . messages Spro modifies the of CS Sprocess elements bids strengths , , paying assig reinforcements taxes . collecting eDdwh De,and
Chapter 2
Parail eliCS ~
~
E
~
~
~
~
~
E
~ ions percept ions -
.
reinforcements
CS4 I " =
~
~
CSp
S5 _
~
c
~ .
cs
~ ~
2.8 F1gare of figure2.7 is split into six Parallelversionof theICS. In thisexample , CSprocess . . . . C concurrentprocess es: CSprocessl S6 , , Sproces
1.3.1.1 The Rule DiscoverySystem We now illustrate briefty our parallel implementation of the rule discovery system- the geneticalgorithm or GA . A first process, GAprocess, can be assignedthe duty to select from among CS elementsthose individua that are to be either replicated or discarded. It will be up to the , after receiving GAprocess decisions, to apply genetic (split) CSprocess each ; focusing upon its own fraction of CS , single CSprocess operators . population Becauseit could affect CS strengths, upon which genetic selection is based, M Lprocessstaysidle during GA operations. Likewise, GAprocess is " dormant" when M Lprocessworks. Our parallel version of the genetic algorithm is reported in the algorithm box 2.5. Considering our parallel implementation, step I may be seen as an auction, which can be distributed over the processor network by a
A Box 2 .5 in Igori6m Parallel A LB C Sys genetic algorithm .2.Each 1 C S selects within its own subset of classif m rule to i , , process and m to n : m ote is a . replicate replace ( system ) parame Each C S sends G some data about each selec clas i process Aprocess sifier G to set a hierarch auctio base on , enabling Aprocess up in values this results 2m individ with the ; strength process selecting . overall CS population 3.parent G sends the data to each C S a i Aprocess following proces conta : classifier identifier of the itself , parent identifiers of the two , offspring . crossover point At the same time G sends to the C S a i Aprocess proces conta pos the data : for any following offspring identifier of the , offspring identifiers of the two , parents crossover . point 4.All in fractio of sen the C es that have their own CS a p rocess parents ~ of the rule to the C S that has the i parent process corre copy will crossov and muta ;this position process apply ope offspring will overwrite rules to be with rule . and replaced newly gener ALECSYS
" gathering and broadcasting mechanism, similar to the one we usedto propagatethe messagelist. Step 2 (mating of rules) is not easy to parallelize becauseit requires a central managementunit . Luckily , in LCS applications of the GA , the number of pairs is usually low and concurrency seemsto be unnecessary(at least for moderate-sized populations ) . Step 3 is the hardest to parallelize becauseof the amount of communication it requires, both between M Lprocess and the array of . Step 4 is a typical split CSprocesses and among CSprocesses themselves example of local data processing, extremely well suited to concurrent distribution. " hierarchical
2.3.2 High- Level Parallelism: A Solution to Beha'rioral Complexity Problems In section2.3.1 we presenteda method for parallelizing a singleICS, with the goal of obtaining improvementsin computing speed. Unfortunately , this approach shows its weaknesswhen an ICS is applied to problems involving multiple tasks, which seemsto be the casefor most real problems . We proposeto assignthe executionof the various tasks to different
2 Chapter
ligure 2.9 . Using ALECSYS , Exampleof concurrenthigh-level and low-level parallelism learningsystemdesignerdividesprobleminto threeICSs(high-levelparalleliza tion) andmapseachICS onto subnetof nodes(low-levelparallelization ). I CSs. This approach requiresthe introduction of a stronger form of parallelism " " , which we call high-level parallelism. Moreover, scalability problems arise in a low-level parallelized ICS. Adding a node to the transputer network implementing the ICS makes the communication 'load grow faster than computational power, which ' results in a less than linear speedup. Adding processorsto an existing network is thus decreasinglyeffective. As already said, a better way to deal with complex problems would be to code them as a set of easier the processor subproblems, each one allocated to an ICS. In ALBCSYS network is partitioned into subsets,eachhaving its own sizeand topology. To each subsetis allocated a single ICS, which can be parallelized at a low level (see figure 2.9) . Each of these I CSs learns to solve a specific subgoal, according to the inputs it receives. BecauseALEcsys is not provided with any automated way to come up with an optimal or even a good decomposition of a task into subtasks, this work is left to the learning system designer, who should try to identify independent basic tasks and coordination tasks and then assign them to different I CSs. The design approach to task decomposition is common to most of the current researchon autonomous systems(see, for example, Mahadevan and Connell 1992; and Lin 1993a, 1993b), and will be discussedin chapter 3.
ALECSYs
2.4 POINTSTO REMEMBER . The learning classifiersystem( LCS) is composedof three modules: ( I ) the performancesystem; (2) the credit apportionment system; and (3) the rule discoverysystem. . The LCS can be seenas a way of automatically synthesizinga rulebased controller for an autonomous agent. The synthesizedcontroller consistsof the performancesystemalone. . The learning componentsin an LCS are the credit apportionment system and the rule discovery system. The first evaluatesthe usefulness of rules, while the seconddiscoverspotentially useful rules. . ICS (improved classifier system) is an improved version of LCSo, a ' particular instantiation of an LCS. I CS s major improvementsare ( I ) the GA is called only when the bucket brigade has reacheda steadystate; (2) " " a new genetic operator called mutespec is introduced to overcomethe negativeeffectsof oscillating classifiers; and (3) the number of rules used by the performancesystemis dynamically reducedat runtime. . ALECSYS is ' a tool that allows for the distribution of I CSs on coarsegrained parallel computers (transputers in the current implementation) . , a designercan take advantageof both parallelizaBy meansof ALECSYS tion of a single ICS and cooperation among many I CSs.
Chapter
3
Architectures
and Shaping
Po Hcies
3.1 THE STRUCTURE OF REBAVIOR As we have noted in chapter I , behavior is the interaction of the agent with its environment. Given that our agentsare internally structured as an LCS and that LCSs operatein cycles, we might view behavior as a chain of elementary.interactions betweenthe agent and the environment, each of which is causedby an elementaryaction performed by the agent. This is, however, an oversimplified view of behavior. Behavior is not just a fiat chain of actions, but has a structure, which manifests itself in basically two ways:
I . The chain of elementary actions can be segmentedin a sequenceof different segments , each of which is a chain of actions aimed at the execution of a specifictask. For example, an agent might perform a number of actions in order to reach an object, then a number of actions to grasp the object, and finally a chain of actions to carry the object to a container. 2. A single action can be made to perform two or more tasks simultaneously . For example, an agent might make an action in a direction that allows it both to approach a goal destination and to keep away from an obstacle. Following well-establishedusage, we shall refer to the actions aimed at " accomplishinga specifictask as a behavior" ; we shall therefore speakof activities such as approaching an object, avoiding obstacles, and the like as " specificbehaviors." Some behaviors will be treated as basic in the sensethat they are not structured into simpler ones. Other behaviors, seen to be made up of simpler ones, will be regarded as complex behaviors, whose component behaviorscan be either basic or complex.
Chapter3 Complex behaviorscan be obtained from simpler behaviors through a number of different composition mechanisms. In our work , we consider the following mechanisms: . Independentsum: Two or more behaviors involving independenteffectors are performed at the sametime; for example, an agent may assumea mimetic color while chasinga prey. The independentsum of behaviors IX and p will be written as . IXIP . Combination: Two or more behaviors involving the same effectors are combined into a resulting one; for example, the movement of an agent following a prey and trying to avoid an obstacle at the same time. The combination of behaviorsIXand p will be written as
~
.
IX+ p . . Suppression : A behavior suppresses a competing one; for example, the agent may give up chasing a prey in order to escapefrom a predator. When behavi( " IXsuppressesbehavior p , we write
. Sequence : A behavior is built as a sequenceof simpler ones; for example , fetching an object involves reaching the object, grasping it , and coming back. The sequenceof behavior (Xand behavior p will be written as (X. p ; moreover, we shall write " 0' . " when a sequence0' is repeatedfor an arbitrary number of times. In general, severalmechanismscan be at work at the same time. For example, an agent could try to avoid fixed hurting objectswhile chasinga moving prey and being ready to escapeif a predator is perceived. With respectto behavior composition, the approach we have adopted so far is the following . First , the agent designerspecifiescomplex behaviors in terms of their component simpler behaviors. Second, the designer definesan architecture such that each behavior is mapped onto a single behavioral module, that is, a single LCS of ALECsys. Third , the agent is trained according to a suitable shapingpolicy, that is, a strategy for making the agent learn the complex behavior in terms of its components. In the next sections, we review the types of architectureswe have usedin
Architectures andShaping Policies
3.2 TYPESOF ARCI UTE Cf URES By using ALEcsYS, the control systemof an agent can be implementedby a network of different LCSs. It is therefore natural to consider the issue of architecture, by which we mean the problem of designingthe network that bestfits somepredefinedclassof behaviors. As we shall see, there are interestingconnectionsbetweenthe structure of behavior, the architecture of controllers, and shaping policies (i.e., the ways in which agents are trained) . In fact, the structure of behavior determinesa " natural" architecture , which in turn constrainsthe shapingpolicy . Within ALEcsys, realizable architecturescan be broadly organized in two classes : monolithic architectures, which are built by one LCS directly connectedto the agent' s sensorsand effectors; and distributed architectures , which include severalinteracting LCSs. Distributed architectures, in turn , can be mbre or less" deep" ; that is, they can beflat architectures, in which all LCSs are directly connectedto the agent' s sensorsand effectors, or hierarchicalarchitectures, built by a hierarchy of levels. In principle, any behavior that can be implemented by ALEcSYs can be implemented through a monolithic architecture, although distributed architecturesallow for the allocation of specific tasks to different behavioral modules, thus lowering the complexity of the learning task. At the , but have to be presentstage, architecturescannot be learned by ALECSYS a certain serves to include in other words an architecture , designed; amount of a priori knowledgeinto our agentsso that their learning effort is reduced. To reach this goal, however, the architecture must correctly match the structure of the global behavior to be learned.
3.2.1 MonolithicArchitectures The simplestchoiceis the monolithic architecture , with one LCS in chargeof controllingthe wholebehavior(figure3.1). As we havealready noted, this kind of architecturein principlecan implementany behavior of whichALECsysis capable , althoughfor nontrivial behaviorsthe complexity of learninga monolithiccontrollercanbe too high. Usinga monolithicarchitectureallowsthe designerto put into the system . Suchinitial knowledge aslittle initial domainknowledgeaspossible
Chapter 3
LCS
LCS
Environment
Environment
Figme 3.1 Monolithic architectures
Environment
Ji1gure3.2. Flat architectures
is embeddedin the design of the sensorimotor interface, that is, in the structure of messagescoming from sensorsand going to effectors. If the target behavior is made up of severalbasic responses , and if such basicresponsesreact only to part of the input information , the complexity of learning can be reduced even by using a monolithic architecture. Instead of wrapping up the state of all sensorsin a single message(figure 3.la ), one can distribute sensorinput into a set of independentmessages " (figure 3.lb ) . We call the latter case monolithic architecture with distributed " input . The idea is that inputs relevant to different responsescan ; in such away , input messagesare shorter, the go into distinct messages search space of different classifiersis smaller, and the overall learning effort can be reduced(seethe experimenton monolithic architecturewith distributed input in section 4.2) . The only overhead is that each input messagehas to be tagged so that the sensor it comes from can be identified. 3.2.2 Flat Architectures A distributed architectureis made up of more than one LCS. If aU LCSs ' are directly connectedto the agent s sensors, then we use the term flat architecture(figure 3.2) . The idea is that distinct LCSs implement the dif -
Architectures and ShapingPolicies
ftgure3.3 architectures Two-levelhierarchical ferent basic responsesthat make up a complex behavior pattern. There is ' a further issue, here, regarding the way in which the agent s responseis built up from the actions proposed by the distinct LCSs. If such actions are independent, they can be realized in parallel by different effectors (figure 3.2a); for example, by rotating a videocamera while moving that are not independent, on the other hand, have to straight on. Actions f be integrated into a single responsebefore they are realized (figure 3.2b) . For example, the action suggestedby an LCS in charge of following an object and the action suggestedby an LCS in chargeof avoiding obstacles have to be combined into a single physical movement. (One of the many different ways of doing this might be to regard the two actions as vectors and add them.) In the present version of ALEC SYs , the combination of different actions in a fiat architecture cannot be learned but has to be programmed in advance.
3.2.3 Hierarchical Architectures In a flat architecture, all input to LCSs comes from the sensors. In a hierarchical architecture, however, the set of all LCSs can be partitioned into a number of levels, with lower levelsallowed to feed input into higher levels. By definition, an LCS belongsto level N if it receivesinput from systems of level N - I at most, wherelevel 0 is definedas the level of sensors. An N -level hierarchical architecture is a hierarchy of LCSs having level N as the highest one; figure 3.3 shows two different two-level hierarchical architectures. In general, first-level LCSs implement basic behaviors, while higher-level LCSs implement coordination behaviors. With an LCS in a hierarchical architecturewe have two problems: first, how to receiveinput from a lower-level LCS; second, what to do with the
3 Chapter
lfigure3.4 -levelswitcharchitecture Exampleof three for Chase . / Feed/Escapebehavior Besides threebasicbehaviors es, SWIandSW2 , therearetwoswitch . output . Receiving input from a lower-level LCS is easy: both input and output messagesare bit strings of some fixed length, so that an output messageproduced by an LCS can be treated as an input messageby a different LC~. The problem of deciding what to do with the output of LCSs is more complex. In general, the output messagesfrom the lower levels go to higher-level LCSs, while the output messagesfrom the higher levels can go directly to the effectorsto produce the response(seefigure 3.3a) or can be usedto control the composition of responsesproposedby lower LCSs (seefigure 3.3b) . Most of the experimentspresentedin this book were carried out using suppressionas composition rule; we call the resulting hierarchical systems" switch architectures." In figure 3.4, we show an example of three-level switch architecture implementing an agent that has to learn a Chase/ Feed / Escape behavior, that is, a complex behavior taking place in an environment containing a " sexualmate," pieces of food, and possibly a predator. This can be defined as if
there is a predator then Escape else if hungry then F~ { i . e . , search for food } else C~ { i . e . , try to reach the mate } endif end if .
sexual
Architectures and ShapingPolicies
In this example, the coordinator of level two (SWl ) should learn to suppress the Chase behavior whenever the Feed behavior proposes an action, while the coordinator of level three (SW2) should learn to suppress SWI wheneverthe Escape behavior proposesan action.
AN ARCl DTECf URE 3.3 REALIZING
3.3.1 How to Design an Architecture: QuaUtadveCriteria The most general criterion for choosing an architecture is to make the &.\"chitecture naturally match the structure of the ,target, behavior. This meansthat each basic responseshould be assignedan LCS, and that all such LCSs should be connectedin the most natural way to obtain the global behavior. Supposethe agent is normally supposedto follow a light, while being ready to reach its nest if a specificnoise is sensed(revealing the presence of a predator) . This behavior pattern is made up of two basic responses , namely, following a light and reaching the nest, and the relationship betweenthe tWo is one of suppression. In such a case, the switch architecture is a natural choice. In general, the four mechanismsfor building complex behaviorsdefined in section 3.1 map onto different types of architecture as follows: . Independentsum can be achievedthrough a flat architecturewith independent outputs (figure 3.2a) . This choice is the most natural becauseif two behaviors exploit different effectors, they can run asynchronouslyin parallel. . Combination can be achieved either through a flat architecture with integrated outputs (figure 3.2b) or through a hierarchical architecture (figure 3.3). A flat architecture is advisablewhen the global behavior can be reducedto a number of basicbehaviorsthat interact in a uniform way; for example, by producing a linear combination of their outputs. As we have already pointed out, the combination rule has to be explicitly programmed . On the other hand, if the agent has to learn the combination rule, the only possiblechoice with ALECsysis to use a hierarchical architecture : two LCSs can sendtheir outputs to a higher-level LCS in charge of combining the lower-level outputs into the final response. However, complex combination rules may be very difficult to learn. . Suppressioncan be achievedthrough a specialkind of hierarchical architecture " we have called " switch architecture. Suppressionis a simple limit
Chapter 3
caseof combination: two lower-leveloutputs are " combined" by letting one of the two becomethe final response.In this case, the combination rule can be effectively learned by ALECsys, resulting in a switch architecture. . Sequencecan be achievedeither through a fiat or switch architecture or through a hierarchical architecturewith added internal states. As we shall seein chapter 6, there are actually two very different types of sequential behaviors. The first type, which we have called " pseudosequential behavior " , occurs when the perception of changesin the external environment is sufficient to correctly chain the relevant behaviors. For example, exploring a territory , locating food, reaching for it , and bringing it to a nest is a kind of pseudosequentialbehavior becauseeach behavior in the sequencecan be chosensolely on the basis of current sensoryinput . This type of behavior is not substantially different from nonsequential behavior and can be implemented by a fiat or switch architecture. Proper sequentialbehavior occurs when there is not enough information in the environment to chain the basic behaviors in the right way. For example, an agent might have to perform a well-definedsequenceof actions that leave no per' ceivabletrace on the environment, like moving back and forth betweentwo locations. This type of behavior requires somekind of " internal state" to keep track of the different phases the agent goes through. We have implemented proper sequential behavior by adding internal statesto agents, and by allowing an LCS to update such a state within a hierarchical architecture (seechapter 6).
3.3.2 How to Designan Architecture: Quantitative Criteria The main reasonfor introducing an architectureinto LCSs is to speedup the learning of complex behavior patterns. Speedupis the result of factoring a large searchspaceinto smaller ones, so that a distributed architecture will be useful only if the component LCSs have smaller search spacesthan a single LCS able to perform the sametask. We can turn this consideration into a quantitative criterion by observing that the size of a searchspacegrows exponentially with the length of . This implies that a hierarchical architecture can be useful only messages if the lower-level LCSs realize some kind of informational abstraction, thus transforming the input messagesinto shorter ones(see, for example, the experiment on the two-level switch architecture in section 4.4.3) . Consider an architecture in which a basic behavioral module receives from its sensorsfour -bit messagessaying where a light is located. If this basic behavioral module sendsthe upper level four -bit messagesindicat-
Architectures and ShapingPolicies
ing the proposed direction of motion , then the upper level can use the sensorialinformation directly, by-passingthe basic module. Indeed, even if this basic behavioral module learns the correct input -output mapping, it does not operate any information abstraction, and becauseit sendsthe upper level the samenumber of bits it receivesfrom its sensors, it makes the hierarchy computationally useless . We call this the " information " compression requirement. An LCS whoseonly role is to compresssensorymessages can be viewed " " as a virtual sensor, able to extract from physical sensorydata a set of relevant features. As regardsthe actual sensors,we prefer to keep them as simple as possible, even in the simulated experiments: all informational abstraction, if any, takes place within ALECsysand has to be learned. In this way, we make sure that the " intelligence" of our agents is located mainly within ALECSys and that simulated behaviors can be also be achievedby real robots endowedwith low-cost physical sensors. 3.3.3 Implementingan Architecture with ALECSYS With ALECSYSfit is possible to define two classesof learning modules: basic behaviors and coordination behaviors. Both are implemented as learning classifiersystems. Basic behaviors are directly interfaced with the environment. Each basic behavior receivesbit strings as input from sensorsand sends bit stringsto effectorsto proposeactions. Basicbehaviorscan be insertedin a hierarchical architecture; in this case, they also send bit strings to connected higher-level coordination modules. Consider Autono Mouse II and the complex behavior Chase/ Feed / Escape , which has already been ,I!lentioned in section 2.3 and will be extensivelyinvestigatedin chapter 4. Figure 3.5 showsone possibleinnate architecture of an agent with such a complex behavior. A basic behavior has been designedfor each of the three behavioral patterns used to describe the learning task. In order to coordinate basic behaviors in situations where two or more of them propose actions simultaneously, a coordination module is used. It receivesa bit string from each connected basic behavior (in this casea one-bit string, the bit indicating whether the sendingICS wants to do something or not) and proposesa coordination action. This coordination action goes into the composition rule module, which implements the composition mechanism. In this example, the composition rule used is suppression, and therefore only one of the basic actions proposedis applied.
3 Chapter Composition rule
coordination action
basicactionsproposed Environment F1gwe3.5 for three-behaviorlearningtask Exampleof innatearchitecture
inputpattern
~ positionof chasedobject
~- ~,1
...... direction of motion
to the coordinator
move/do not move (b)
lCS- Chase
outputpattern (c)
F1gm ' e 3.6 (a) Example of input message ; ( b) example of output messaae; input- output interface for ICS- Chase behavior
of (c) example
All behaviors(both basic and coordination), are implementedas I CSs. For example, the basic behavioral pattern Chase is implemented as an ICS, which for easeof referencewe call " ICS-Chase " (figure 3.5) . Figure 3.6 showsthe input -output interface of ICS-Chase. In this case, the input pattern only sayswhich sensorsseethe light . (Autono Mouse II has four binary sensors,eachof which is set to 1 if it detectsa light intensity higher than a given threshold, and to 0 otherwise.) The output pattern is composed of a proposed action, a direction of motion plus a move/ do not move command, and a bit string (in this caseof length 1) going to ICSCoordinator (figure 3.5) . The bit string is there to let the coordinator
Architectures andShaping Policies know when ICS- Chase is proposing an action. Note that the value of this bit string is not designedbut must be learned by ICS- Chase. In general, coordination behaviors receive input from lower-level behavioral modules and produce an output action that , with different modalities dependingon the composition rule used, in8uencesthe execution of actions proposedby basic behaviors.
3.4 SHAPINGPOLICIES The use of a distributed system, either flat or hierarchical, brings in the new problem of selectinga shapingpolicy, that is, the order in which the various tasks are to be learned. In our work , we identified two extreme choices:
. Holistic shaping, in which the whole learning system is consideredas a black box. The actual behavior of a single learning classifier systemis not used to evaluate how to distribute reinforcements; when the system receivesa reinrprcement, it is given to all of the LCSs within the system architecture. This procedure has the advantage of being independentof the internal architecture, but can make the learning task difficult because there can be ambiguous situations in which the trainer cannot give the correct reinforcementto the component LCSs. A correct action might be the result of two wrong messages . For example, in our Chase / Feed / task see 3.4 the LCS called SWI might in a feeding situation Escape ( figure ), choose to give control to the wrong basic LCS, that is, LCSChase , which in turn would propose a normally wrong action, " Move " away from the light, that , by chance, happenedto be correct. Nevertheless , this method of distributing reinforcementsis interestingbecauseit doesnot require accessto the internal modulesof the systemand is much more plausible from an ethological point of view. . Modular shaping, in which each LCS is trained with a different reinforcemen program, taking into account the characteristicsof the task that the consideredLCS has to learn. We have found that a good way to implement modular shapingis first to train the basicLCSs, and then, after " " they have reacheda good performance level, to freeze them (that is, basic LCSs are no longer learning, but only performing) and to start training upper-level LCSs. Training basic LCSs is usually done in a separate session(in which training of different basic LCSs can be run in parallel) .
Chapter 3
In principle, training different LCSs separately, that is, modular shaping , makes learning easier, although the shapingpolicy must be designed in a sensibleway. Hierarchical architecturesare particularly sensitiveto the shaping policy; indeed, it seemsreasonable that the coordination modules be shapedafter the lower modules have learned to produce the basicbehaviors. The experimentsin chapter 4 on two-level and three-level switch architectures show that good results are obtained, as we said above, by shaping the lower LCSs, then freezing them and shaping the coordinators, and finally letting all componentsbe free to go on learning together.
3.5 POINTS TO REMEMBER
. Complex behaviorscan be obtained from simpler behaviors through a number of composition mechanisms. . When using ALBCSYS to build the learning control systemof an autonomous agent, a choice must be made regarding the systemarchitecture. We have pre'sentedthree types of architectures: monolithic , distributed fiat , and distributed hierarchical. . The introduction of an LCS module into an architecturemust satisfy an " information " . compression criterion: the input string to the LCS must be longer than the output string sent to the upper-level LCS. . When designingan architecture, a few qualitative criteria regarding the matching betweenthe structure of the behavioral task to be learned and the type of architecture must be met. In particular, the component LCSs of the architecture should correspond to well-specifiedbehavioral tasks, and the chosen architecture should naturally match the structure of the global target behavior. . Distributed architecturescall for a shaping policy . We presentedtwo extremechoices: holistic and modular shaping.
Chapter
4
Experimentsin Simulated Worlds
4.1 I NTRODUcn ON In this chapter we present some results obtained with simulated robots. Becauseit takes far lesstime to run experimentsin simulatedworlds than in the real world , a much greater number of design possibilities can be tested. The following questionsguided the choice of the experiments: . Does ICS perform better than a more standard LCS like LCSo? . Can a simple form of memory help the Autono Mouse to solve problems that require it to rememberwhat happenedin the recent past? . Does decompositionin subtaskshelp the learning process? . How must shaping be structured? Can basic behaviors and coordination of behaviors be learned at the sametime, or is it better to split the learning processinto severaldistinct phases? . Is there any relation betweenthe agent' s architecture and the shaping policy to be used? . Can an inappropriate architectureimpede learning? . Can the different architectural approaches we used in the first experiments be composedto build more complex hierarchical structures? . Can a bad choice of sensorgranularity with respectto the agent' s task impede learning? In the rest of this section, we explain our experimental methodology, illustrate the simulated environments we used to carry out our experiments , and describethe simulated Autono Mouse we used in our experiments . In section4.2, we presentexperimentsrun in a simple environment in which our simulated Autono Mouse had to learn to chase a moving light source. The aim of theseexperimentswas to test ICS and ALEcsys,
Chapter4
and to study a simple memory mechanismwe call " sensormemory." In section4.3, we give a rather completepresentationof a multibehavior task and discussmost of the issuesthat arise when dealing with distributed architectures. In particular, we focus on the interplay betweenthe shaping policy and the architecture chosento implement the learning system. In section4.4, we slightly complicate the overall learning problem by adding a new task, feeding, and by augmentingthe dimensionof the sensoryinput , longer and adding multiple copiesof the sameclass making sensormessages of objects in the environment. We report extensively on experiments comparing different architectural choicesand shapingpolicies. Section4.5 briefly reports on an experimentrun in a slightly more complex environment which will be repeated, using this time a real robot, in chapter 5. 4.1.1 ExperimentalMethodology In all the experimentsin simulatedworldst we use p=
Number of correct responses -< I Totall : 1lnberof responses
as performancemeasure. That i Stperformanceis measuredas the ratio of correct moves to total moves performed from the beginning of the simulation . Note that the notion of " correcttt responseis implicit in the RP: a responseis correct if and only if it receivesa positive reinforcement. " Thereforet we call the above defined ratio the cumulative performance tt measureinduced by the RP. The experimentsreported in sections4.2 and 4.3 were run as follows. Each experimentconsistedof a learning phaseand a test session. The test sessionwas run after the learning phase; during the test sessionthe learning algorithm was switchedoff ( but actions were still chosenprobabilisticallyt using strength) . The index Pt measuredconsideringonly the moves done in the test sessiontwas used to compute the performance achieved by the systemafter learning. For basic behaviorst P was computed only for the movesin which the behaviorswere active; we computed the global performanceas the ratio of globally correct moves to total moves during the whole test sessiontwhere at every cycle (the interval between two sensorreadings) a globally correct move was a move correct with respect to the current goal. Thust for examplet if after ten cyclesthe Chase behavior had been active for 7 cyclest proposing a correct move 4 timest and the Escape behavior had been active for 6 cyclest proposing a correct move 3 timest then the Chase behavior performancewas 4/ 7t the
Experimentsin Simulated Worlds
Escape behavior performance 3/ 6, and the global performance (4 + 3)/ (7 + 6) = 7/ 13. To compare the performancesof different architectural or shaping policies, we usedthe Mann-Whitney nonparametric test (see, for example, Siegeland Castellan 1988), which can be regardedas the non' parametric counterpart of Student s I-test for independentsamples. The choice of nonparametric statistics was motivated by the failure of our data, even if measuredon a numeric scale, to meet the requirementsfor parametric statistics(for example, nonnaI distribution) . The experimentspresentedin section 4.4 were run somewhat differently . First , insteadof presettinga fixed number of learning cycles, we ran an experimentuntil there was at least someevidencethat the performance was unlikely to improve further; this evidencewas collectedautomatically by a steadystatemonitor routine, checkingwhether in the last k cyclesthe performance had significantly changed. In experimentsinvolving multiphase shaping strategies, a new phasewas started when the steady state monitor routine signaledthat learning had reacheda steadystate. Second, experimentswere repeatedseveraltimes (typically five), but typical results were graphedin'steadof averageresults. In fact, the useof the steadystate monitor routine made it difficult to show averagedgraphs becausenew phasesstarted at different cycles in different experiments. Nevertheless , all the graphs obtained were very similar, which makes us confident that the typical results we present are a good approximation of the average behavior of our learning system. The reasonwe usedthis different experimental approach is that we wanted to have a systemthat learned until it could profit from learning, and then was able to switch automatically to the test session. Finally , section 4.5 presents simple exploratory experiments run to provide a basis of comparison for those to be run with a real robot (see section 5.3) . In theseexperimentswe did not distinguish betweena learning and a test session, but ran a single learning sessionof appropriate length (50,000 cyclesin the first experimentsand 5,000 in the second).
4.1.2 Simulation Environments From chapter 3 it is clear that to test all the proposed architecturesand shapingpolicies, we needmany different simulated worlds. As we needa basic task for each basic behavior, in designing the experimental environments , we were guided by the necessityof building environments in which basic tasks, and their coordination, could be learned by the tested agent architecture. We usedthe following environments:
4 Chapter
(a)
(b)
*-light
V/~~~ ~2~~~(,:j ~ light
ell ~ Iat ed Autono Mouse
~
(c )
< ' t / t @ light predator
I 9 "
Figure4.1 : the AutonoMouseand its Simulatedenvironmentsetup: (a) Chaseenvironment visualsensors ; (c) Chase/ Feed/Escape environme ; (b) Chase/Escape environment : AutonoMousedoesnot seelight ; and (d) Find Hidden environment whenit is in the shadedarea. . . . .
Chase environment (single-behavior environment). Chase/ Escape environment (two-behavior environment) . Chase / Feed /Escape environment (three-behavior environment) . Find Hidden environment (two-behavior environment) .
In the Chase environment (seefigure 4.la , in which the Autono Mouse is shown with its two binary eyes, which eachcover a half-space), the task is to learn to chasea moving object. This environment was studied pri marily to test the learning classifiersystemcapabilities and as a test bed for proposedimprovementsin the LCS model. In particular, the issuewas whether the useof punishmentsis a good thing; we also testedthe optimal length for the messagelist, a parameter that can greatly influence the performanceof the learning system. The Chase environment was usedto
Experimentsin Simulated Worlds
fine-tune the performance of ICS. The environment dimensions, for this and all the following environments, were 640 x 480 pixels, an Autono Mouse' s step was set to 3 pixels, and the maximum speedof the Autono Mouse was the sameas the speedof the light source. In the Chase/Escape environment there are two objects: a light and a lair . There is also a noisy predator, which appears now and then, and which can be heard, but not seen, by the Autono Mouse. When the predator appears, the Autono Mouse, otherwise predisposed to chase the moving light source, must run into its lair and wait there until the predator has gone. The light sourceis always present. The predator appearsat random time intervals, remains in the environment for a random number of cycles, and then disappears. The predator never enters the environment : it is supposedto be far away and its only effect is to scare the Autono Mouse, which runs to hide in its lair . Figure 4.1b shows a snapshot of the environment. In the Chase/ Feed /Escape environment there are three objects: a light, a food source, and a predator. This environment is very similar to the previous onf , but for the presenceof food and for the different robotpredator interaction. As above, the Autono Mouse is predisposedto chase the moving light source. When its distance from the food source is less than a threshold, which we set to 45 pixels, then the Autono Mouse feels hungry and thus focuseson feeding. When the chasingpredator appears, then the Autonomouse's main goal is to run away from it (there is no lair in this environment) . The maximum speedof the Autono Mouse is the same as the speedof the light source and of the chasing predator. The light source and the food are always present (but the food can be seen only when closer than the 45 pixels threshold) . As above, the predator appearsat random time intervals, remains in the environment for a random number of cycles, and then disappears. Figure 4.1c showsa snapshot of the environment. In the Find Hidden environment the Autono Mouse' s task is composed of two subtasks: Chase and Search . As in the Chase environment , the Autono Mouse has to chasea moving light source. The task is complicated by the presenceof a wall. Wheneverit is interposedbetween the light and the Autono Mouse (seefigure 4.id ), the Autono Mouse cannot seethe light any longer, and must Search for the light . 4.1.3 The SimulatedAutonoMice The simulated Autono Mice are the simulated-world counterparts of the real Autono Mice in the experimentspresentedin the next chapter. Their
Chapter 4
limited capabilities are designedto let them learn the simple behavior patterns discussedin the section4.1.2.
4.1.3.1 SelMOryCapabilities In all experimentalenvironmentsexcept the Chase / Feed /Escape environmen , the simulated Autono Mouse has two eyes. Becauseeach eye covers a visual angle of 180 degreesand the two eyeshave a 90-degree " ' " overlap in the Autono Mouse s forward move direction, the eyespartition the environment into four nonoverlapping regions (seefigure 4.la ) . Autono Mouse eyescan sensethe position of the light source, of food (if it is closer than a given threshold), of the chasing predator, and of the lair " " (we say the Autono Mouse can see the light and the lair ) . We call these " the basic Autono Mouse sensorcapabilities." In our work we chose a particular disposition of sensorswe consideredreasonable. Obviously, a robot endowed with different sensorsor even a different disposition or granularity of existing sensorswould behavedifferently (see, for example, Cliff and Bullock 1993). In one of / he experimentsrun in the Chase environment (the experiment on sensormemory), the two eyes of the Autono Mouse covered a visual cone of 60 degrees ; as a result, the visual , overlapping for 30 degrees 90 in the Autono Mouse was front of , partitioned into degrees space three sectorsof 30 degreeseach. The reasonto restrict the visual field of the Autono Mouse was to make more evident the role of memory in learning to solve the task. In experimentsrun in the Chase/Escape environment, in addition to the basic sensorcapabilities, the Autono Mouse can also detect the difference betweena closelight and a distant light (the sameis true for the lair ), " " and can sensethe presenceof the noisy predator (we say it can hear " " the predator), but cannot see it . The distinction between close and " far " was necessarybecausethe Autono Mouse had to learn two different " " responseswithin the hiding behavior: When the lair is far , approach it ; " When the lair is close the Autono Mouse is into the lair do not move." ], [ The distinction between " close" and " far " was not necessaryfor the chasingbehavior, but was maintained for uniformity . In experiments run in the Chase / Feed / Escape environment, the simulated Autono Mouse has four eyes(like Autono Mouse II ; seefigure 1.3) . Becauseeacheyecoversa visual angle of 90 degreeswithout overlap, the four eyespartition the environment into four nonoverlapping regions. As in the caseof basic sensorcapabilities, Autono Mouse eyescan sense
Experimentsin Simulated Worlds
the position of the light source, of the food (if it is closer than a given threshold), and of the chasingpredator. The reason to have sensorswith different characteristicsfrom those in the previous environments is that we wanted to seewhether there was any significant changein the Autono Mouse' s performancewhen it was receivinglonger than necessarysensory messages(it actually took twice the number of bits to code the same amount of sensoryinformation using four eyesthan it did using two eyes, as in the Chase case). In the Find Hidden environment, the Autono Mouse was enriched with a sonar and two side bumpers. Both the sonar and the bumperswere used as onloff sensors. Autono Mice with two eyesrequire two bits to identify an object (light, lair , or predator) relative position, while Autono Mice with four eyes require four bits. In the experiments where the Autono Mouse differentiates betweena close and a far region, one extra bit of information is added. Moreover, one bit of information is usedfor the sonar and one for each of the bumpers. In the Chas'e environment, the Autono Mouse receivestwo bits of information from sensors to identify light position. In the Chase/ Escape environment, the Autono Mouse receivessevenbits of information from sensors: two bits to identify light position, two bits to identify lair position, one bit for light distance(close/ far), one bit for lair distance (close/ far), and one bit to signal the presenceof the predator. In the Chase/ Feed / Escape environment the Autono Mouse receives twelve bits of information from sensors: four bits to identify light position, four bits to identify lair position, and four bits to identify predator position. In the Find Hidden environment the Autono Mouse receivesfive bits of information from sensors: two bits to identify light position, one bit from the sonar, and one bit from each of the two side bumpers. Table 4.1 summarizesthe sensorycapabilities of the Autono Mouse in the different environments. Pleasenote that two bits of information were added to messagesto distinguish betweensensormessages , internal messages and motor . , messages 4.1.3.2 Motor Capabilities The Autono Mouse has a right and a left motor and it can give the " " " following movement commands to each of them: Stay still ; Move " " " " one step backward ; Move one step forward ; and Move two steps forward " . The maximum speedof the Autono Mouse, that is, two steps
4 Chapter Table4.1 environments Sensory inputin different experimental Sonar Bumpers Total
2 bits
Chaseenvironment
Chase/Escapeenvironment 3 bits Chase/Feed/Escape 4 bits environment
4 bits
tags
2 bits 7 bits 12bits
1 bit 4 bits
2 bits
Find Hidden environment
left motor
3 bits
1bit
2 bit
5 bits
The two bits going to each motor have the following meaning:
right mo~ r
00: 01: 10: 11:
one step backward stay still one step forward two steps forward
figure 4.2
of message Structure goingto motors
per cycle, was set to be the sameas the speedof the moving light or of the chasingpredator (it wasfound that settinga higher speed, say, four stepsper cycle, made the task much easier, while a lower speedmade it impossible) . At every cycle, the Autono Mouse decideswhether to move. As there are four possibleactions for each motor , a move action can be described sendsto the Autono Mouse motors by four bits. The messagesALECSYS are therefore six bits long, the two additional bits being used to identify . The structure of a messagesent to the messagesas motor messages motors is shown in figure 4.2. In the caseof the Find Hidden experiment, we useda different coding : the output interface included two bits, coding the for motor messages " " " " " four following possible moves: still , straight ahead, ahead with a " " " left turn , and aheadwith a right turn. 4.2 EXPERIMENTS IN THE Chase ENVIRONMENT
in the Chase environmentto answerthe following We ran experiments 1 : questions
Experimentsin SimulatedWorlds Table 4.2 Comparison betweenrewards and punishmentstraining policy and only rewards training policy
Rewardsandpunishments Average Std. dev. Average Only rewards Std. dev.
ML = l
ML = 2
ML = 4
ML = 8
0.904 0.021 0.744 0.022
0.883 0.023 0.729 0.028
0.883 0.026 0.726 0.027
0.851 0.026 0.682 0.025
Note: Results obtained in test session; 1,000 cycles run after 1,000 cycles of learning. Values reported are means of performance index P over 20 runs, and standard deviations.
. Is it better to useonly positive rewards, or are punishmentsalso useful? . Are internal messagesuseful? . Does ICS perform. better than LCSo? In particular,'fwe . compared the results obtained using only positive rewards with those obtained using both positive rewardsand punishments; . comparedthe results obtained using messagelists of different lengths; . evaluated the usefulness of applying the GA when the bucket brigade has reacheda steadystate; . evaluatedthe usefulnessof the mutespecoperator; and . evaluatedthe usefulnessof dynamically changing the number of classifiers used. We also ran an experimentdesignedto provide the Autono Mouse with some simple dynamic behavior capabilities, using a general mechanism basedon the idea of sensormemory. 4.2.1 De Role of PuuWlmentsand of Internal Measges Our experimental results have shown (seetable 4.2) that using rewards and punishmentsworks better than using only rewards for any number . Mann-Whitney tests showed that the differences of internal messages observedbetweenthe first and the secondrow of table 4.2 are highly significant for every messagelist (ML ) length tested(p < 0.01) . Resultshave also indicated that , for any given task, a one-messageML is the best. This is shown both by learning curves for the rewards and punishmentspolicy (seefigure 4.3) and by resultsobtained in the test session (see table 4.2, mean performance for different values of ML ). We
.0 0 9 . .0 7 ..5 0 60200 400 60 80 1
Chapter 4
e
-Q -. "~G )
Number of cycles
. - - - - - - - - - -
ML
=
1
. . . . . . . . . . . ML = 2
ML = 4
ML = 8
F1pre4.3 Learningto chaselight source(chasingbehavior ). Comparisonof differentML over20 runs. , usingrewardsandpunishments lengths trainingpolicy; averages studied the statistical significance of the differences between adjacent means by applying the Mann-Whitney test. Sampleswere ordered by increasingML length, and the choice of comparing adjacent sampleswas done a priori (which makes the application of the Mann -Whitney test a correct statistical procedure). The differencesconsideredwere everywhere found to be highly significant (p < 0.01) by the Mann -Whitney test except between ML = 2 and ML = 4. This result is what we expected, given that the consideredbehavior is a stimulus-responseone, which does not require the useof internal messages . Although l:Iolland' s original LCS used internal messages , Wilson ( 1994) has recently proposed a minimal LCS, called " ZCS," which uses none, although to learn dynamic behavior , some researchers(Cliff and Ross 1994) have recently extended Wilson' s ZCS by adding explicit memory (see also our experimentsin chapter 6) . Indeed, it is easy to understand that internal messagescan slow down the convergenceof the classifiers' strength to good values. It is also interesting to note that, using rewards and punishmentsand ML = 1, the systemachievesa good performance level (about 0.8) after only about 210 cycles(seefigure 4.3; 100cyclesdone in about 60 seconds
Experimentsin SimulatedWorlds
Table4.3 of LCSoandLCSo+ energy Comparison
Note: NC = Number of classifiers introduced by application of GA . NGA = Number of cyclesbetweentwo applications of GA . For LCSo+ energy, we report averagenumber of cycles. For LCSo, we call GA every 500 iterations (500 was experimentally found to be good value). P = Perfonnance index averagedover 10 runs.
with 300 rules on 3 transputers). This makesit feasibleto useALBCSYS to control the real Autono Mouse in real time. Given the resultspresentedin this section, all the following experiments were run using punishmentsand rewards and a one-messageML . 4.2.2
Experimental Evaluation of ICS
4.2.2.1 Energy and SteadyState In table 4.3, we report some results obtained comparing LCSo with its derived ICS, which differs only in that the GA is called automatically when the bucket brigade algorithm has reached a steady state. The attainment of a steady state was measuredby the energy variable, introduced in section2.2.1. Similar resultswere obtained in a similar environment in which Autono Mouse had slightly different sensors(Dorigo 1993) . The experiment was repeatedfor different values of the number of new
4 Chapter Table 4.4 Comparison between LCSo without and with dynamical change of number of classifiers
LCSo LCSowith.dynamical changeof number of classifiers
Maximum performance P achieved
Iterations required to achieve P = 0.67
Time required to run to ,O()() iterations (sec)
Averagetime of iteration (sec)
0.85 0.95
3,112 1,797
7,500 3,900
0.75 0.39
Note: 10,000 cyclesper run; resultsaveragedover 10 runs.
classifiersintroduced in the population as a result of applying the GA becausethis was found to be a relevant parameter. Results show that the useof energyto identify a steadystate, and the subsequentcall of the GA when the s~ dy state has been reached, improves LCSo performance, as measuredby the P index. It is also interestingto note that, the number of cycles betweentwo applications of the GA is almost always smaller for LCSo with GA called at steadystate, which meansthat a higher number of classifierswill be testedin the samenumber of cycles.
4.2.2.2 The MutespecOperator Experiments were run to evaluate the usefulness of both the mutespec operator and the dynamical change of the number of classifiers used. Adding mutespecto LCSo resulted in an improvement of both the performan level achieved, and the speedwith which that performancewas obtained. On average, after 100calls of the GA , the performanceP with and without the mutespecoperator was, respectively, 0.94 and 0.85 (the differencewas found to be statistically significant) . Also using mutespec, a performance P = 0.99 was obtained after 500 applications of the GA , while after the samenumber of GA calls, LCSowithout mutespecreached a P = 0.92. 4.2.2.3 Dyuamically Changingdie Number of Oa Bfien Used Experimentshave shown (seetable 4.4) that adding the dynamic change of the number of activatableclassifiersto LCSodecreasesthe averagetime for a cycle by about 50 percent. Moreover, the proposed method causes
Experimentsin SimulatedWorlds
4.2.3 A first Step toward Dynamic Behavior
The experimentsdescribedthus far concernS-R behavior, that is, direct . Clearly, the production of more associationsof stimuli and responses complex behavior patterns involves the ability to deal with dynamic behavior , that is, with input - output associationsthat exploit some kind of internal state. In a dynamic system, a major function of the internal state is memory. Indeed, the limit of S-R behavior is that it can relate a responseonly to the current state of the environment. It must be noted that ALE Csysis not completely without memory; both the strengthsof classifiersand the internal messagesappended to the messagelist embody information about past events. However, it is easy to think of target behaviors that require a much 'more specifickind of memory. For example, if physical objectsare still or move slowly with respectto the agent, their current position is strongly correlated with their previous position. Therefore, how an object was sensedin the past is relevant to the actions to be performed in the present, even if the object is not currently perceived. Supposethat at cycle t the agent sensesa light in the leftmost area of its visual field, and that at cycle t + 1 the light is no longer sensed. This pieceof information is useful to chasethe light becauseat cycle t + 1 ' the light is likely to be out of the agent s visual field on its left. To study whether chasing a light can be made easierby a memory of past perceptions, we have endowed our learning system with a sensor memory, that is, a kind of short-term memory of the state of the Autono ' Mouse s sensors.In order to avoid an ad hoc solution to our problem, we have adopted a sensormemory that functions uniformly for all sensors, regardlessof the task. The idea was to provide the Autono Mouse with a representationof the previous state of each sensor, for a fixed period of time. That is, at any given time t the Autono Mouse can establish, for each sensorS, whether (i ) the state of S has not changed during the last k cycles (where the memoryspank is a parameterto be set by the experimenter); (ii ) the state of S haschangedduring the last k cycles; in this case, enough information is given so that the previous state of S can be reconstructed.
Chapter 4
This design allows us to define a sensormemory that dependson the input interface, but is independentof the target behavior (with the exception of k , whose optimal value can be a function of the task). More precisely , the sensormemory is made up of
. a memory word , isomorphic to the input interface; . an algorithm that updates the memory word at each cycle, in accordance to specifications(i ) and (ii ), on the basis of the current input, the previous memory word, and the number of cycles elapsedfrom the last change; . a mechanismthat appendsthe memory word to the messagelist , with a . specifictag identifying it as a memory message The memory process is described in more detail in algorithm box 4.1. Note that, coherently with our approach, the actual " meaning" of memory messagesmust be learned by the classifier systems. In other words, memory messagesare just one more kind of messages , whose correlation with the overall task has to be discovered by the learning system. The resultsobtained in a typical simulation are reported in figures4.44.6, which compare the performances of the Autono Mouse with and without memory. The memory spanwas k = 10. Figure 4.4 shows that the performance of the Autono Mouse with memory tends to become asymptotically better than the performance of the Autono Mouse without memory, although the learning processis slower. This is easy to explain: the " intellectual task" of the Autono Mouse with memory is harder becausethe role of the memory messages has to be learned; on the other hand, the Autono Mouse can learn about the role of memory only when it becomesrelevant, that is, when the light 2 disappearsfrom sight- and this is a relatively rare event. To show that the role of memory is actually relevant, we have decomposedAutono Mouse' s performance into the pe90rmance produced when the light is not visible (and therefore memory is relevant; see figure 4.5); and the performancewhen the light is visible (and thus memory is superfluous; see 3 figure 4.6) . It can be seenthat in the former casethe performanceof the Autono Mouse with memory is better. We conclude that even a very simple memory systemcan improve the in caseswhere the target behavior is not intrinsically performanceof ALECSYS S-R.
Experimentsin Simulated Worlds Algoridlm Box 4.1 The sensormemory algorithm
Chapter 4
1 0.95
eoueuuo
--c.. 0.9 ~ 0.85 ~
-'-'-. -'-.---.--
--------------.
0.8 0.75 0.7 0.65 0.6 0
30
60
90
120
150
Number of cycles ( thousands) With memory
- - - - - - - - - - - - - Without memory
F1gm ' e 4.4 Chasingmoving light with and without sensormemory
4.3 EXPERIMENTS IN THE Chase/Escape ENVIRONMENT The centralissueinvestigatedin this sectionis the role that the system architectureand the reinforcementprogramplay in making a learning systeman efficientlearner.
In particular, we compare different architectural solutions and shaping policies. The logical structure of the learning systemconsists, in this case, of the two basic behaviors plus coordination of the basic behaviors. (Coordination can be made explicit, as in the switch architecture, or implicit, as in the monolithic architecture.) This logical structure, which can be mapped onto ALECsysin a few different ways, is testedusing both the monolithic architectureand the switch architecture. In the caseof the switch architecture, we also compare holistic and modular shaping. Another goal of this section is to help the reader understand how our experimentswere run by giving a detailed specification of the objects in the simulation environment, the target behaviors, the learning architectures built using ALECSYS , the representation used, and, finally, the reinforcementprogram.
0.95 0.8.5 .-.-----.-.0.75 0.60530 150 60120
Experimentsin SimulatedWorlds
a. . -
eoue
"~m
90
Numberof cycles( thousands )
With
memory
- - - - - - - - - - - - - Without
memory
F1gore4.5 Light - cbasingperformancewhen light is not visible
4.3.1 De Simulation Environment Simulations were run placing the simulated Autono Mouse in a two dimensional environment where there were some objects that it could perceive. The objects are as follows: A moving light source, which moves at a speedset equal to the maximum speed of the simulated Autono Mouse. The light moves along a straight line and bouncesagainst walls (the initial position of the light is random) . . A predator, which appearsperiodically and can only be heard. . The Autono Mouse' s lair , which occupiesthe upper right angle of the environment (seefigure 4.tb ) .
4.3.2 TargetBehanor The goal of the learning system is to have the simulated Autono Mouse
learn the following three behavior patterns: 1. Chasing behavior. The simulated Autono Mouse likes to chase the moving light source.
. _ ~ _ . , '0 . .0 9 :.8I,. 5 5 .0 0 7 5 .6030 560 12 15
Chapter 4
--a.. ~ ~ 8 iE 0; or l
90
Numberof cycles( thousands )
With memory
- - - - - - - - - - - - - Without memory
4.6 F1gare whenlightis visible Light-chasing perfonnance 2. Escaping behavior. The simulated Autono Mouse occasionally hears the sound of a predator. Its main goal then becomesto reach the lair as soon as possibleand to stay there until the predator goesaway. 3. Global behavior. A major problem for the learning systemis not only to learn single behaviors, but also to learn to coordinate them. As we will see, this can be accomplishedin different ways. If the learning processis successful , the resulting global behavior should be as follows: the simulatedAutono Mouse chasesthe light source; when it happensto hear a predator, it suddenlygives up chasing, runs to the lair , and staysthere until the predator goesaway. In these simulations the Autono Mouse can hear, but cannot see, the predator. Therefore, to avoid conflict situations like having the predator betweenthe agent and the lair , we make the implicit assumptionthat the Autono Mouse can hear the predator when the predator is still very far away, so that it always has the time to reach the lair before the predator entersthe computer monitor .
Experimentsin Simulated Worlds
4.3.3 The I . . eamingArchitectures: MoooUthic and Hierarchical Experimentswere run with both a monolithic architecture(seefigure 3.1) and a switch architecture (see figures 3.3 and 3.4) . In the monolithic architecture the learning systemis implementedas a single low -level par" " allellearning classifiersystem(on three transputers), called LCS-global. In the switch architecture, the learning system is implemented as a set of three classifiersystemsorganized in a hierarchy: two classifier systems learn the basicbehaviors(Chase and Escape ), while one learnsto switch betweenthe two basic behaviors. To implement the switch architecture, we took advantage of the high-level parallelism facility of ALECSvs; we used one transputer for the chasing behavior, one transputer for the escapingbehavior, and one transputer to learn to switch.
4.3.4 Representation As we have seen in chapter 2, ALECsys classifiershave rules with two condition parts and one action part . Conditions are connected by an AND operatof; a classifierentersthe activation state if and only if it has both conditions matched by at least a message . Conditions are strings on k k { O, I , # } , and actions are strings on { O, I } . The value ofk is set to be the sameas the length of the longest among sensorand motor messages . In our experiments the length of motor messagesis always six bits, while the length of sensormessagesdependson the type of architecture chosen. When using the monolithic architecture, sensormessagesare composed of nine bits, sevenbits used for sensoryinformation and two bits for the .4 ( Therefore, a tag, which identifiesthe messageas being a sensormessage . = classifierwill be 3 9 27 bits long.) In the caseof switch architecture, sensoryinformation is split to create two messages , a five-bit messagefor the chasing behavioral module (two bits for light position, one bit for light distance, and two bits as a tag), and a six-bit messagefor the escapingbehavioral module (two bits for light position, one bit for lair distance, one bit for the predator presencemessage , and two bits as a tag). When using a switch architecture, it is also necessaryto define the format of the interface messagesamong basic LCSs and switches. At every cycle each basic LCS sends, besidesthe messagedirected to motors if that is the case, a one-bit messageto the switch. This one-bit message is intended to signal the switch the intention to propose an action. A messagecomposed by all the bits coming from basic LCSs (a two-bit
4 Chapter
lair ion tags distance posit
ofdistance ligh1 light presence ion the poAit predator Fi~ 4.7 fromsensory architecture Structure of message inputusingmonolithic
tags
light light distancepo~t ion
distance
presence of the predator
posit ion
tags
messagein our experiment: one bit from the Chase LCS and one bit from the Escape LCS) goes as input to the switch. Hence the switch selectsone of the basic LCSs, which is then allowed to send its action to the Autono Mouse motors. The value and the meaning of the bit sent from the basic LCSs to the switch is not predefined, but is learned by the system. In figure 4.7, we show the format of a messagefrom sensoryinput in the monolithic architecture; in figure 4.8, the format of a messagefrom sensoryinput to basic LCSs in the switch architecture. Consider, for example, figure 4.8b, which showsthe format of the sensor messagegoing to the Escape LCS. The messagehas six bits. Bits one and two are tags that indicate whether the messagecomesfrom the environmen or was generatedby a classifier at the preceding time step (in which casethey distinguish betweena messagethat causedan action and one that did not) . Bits three and four indicate lair position using the following encoding:
in SimulatedWorlds Experiments 00: no lair , 01: lair on the right , 10: lair on the left, 11: lair ahead. Bit five indicates lair distance; it is set to 1 if the lair is far away and to 0 if the lair is very close. Bit six is set to 1 if the Autono Mouse hears the sound produced by an approachingpredator; to 0 otherwise.
4.3.5 The ReiDforcementProgram As we have seenin section 1.2, the role of the trainer (i.e., the RP), is to observethe learning agent' s actions and to provide reinforcementsafter each action. An interesting issueregarding the use of the trainer is how the RP should be structured to make the best use of the architecture. To answer this question, we ran experiments to compare the monolithic and the switch architecture. Our RP comprises two procedures called " RP-Chase " " " pod RP-Escape . RP-Chase gives the Autono Mouse a reward if it moves so that its distance from the light source does not increase. On the other hand, if the distance from the light source increases . RP-Escape , which is used when the , it punishes- A LEC SYs predator is present, rewardsthe Autono Mouse when it makesa move that diminishesthe distancefrom the lair , if the lair is far ; or when it staysstill , if the lair is close. Although the sameRP is usedfor both the monolithic architecture and the switch architecture, when using the switch architecture a decision must be made about how to shapethe learning system. 4.3.6 Choiceof an Architedure and a ShapingPolicy The number of computing cycles a learning classifier systemrequires to learn to solve a given task is a function of the length of its rules, which in turn dependson the complexity of the task. A straightforward way to reduce this complexity, which was used in the experiments reported below, is to split the task into many simpler learning tasks. Wheneverthis can be done, the learning system performance should improve. This is what is testedby the following experiments, where we comparethe monolithic and the switch architectures, which were described in chapter 3. Also, the use of a distributed behavioral architecture calls for the choice of a shaping policy; we ran experimentswith both holistic and modular shaping.
Chapter 4 Table 4.5 Comparison acrossarchitecturesduring test session Modular -switchHolistic-switch Monolithic architecture architecture long architecture A vI .
Std.dev.
Chasing 0.830 0.169 Escaping 0.744 0.228 0.784 0.121 Global
A VI .
Std.dev.
0.941 0.053 0.919 0.085 0.933 0.073
A VI .
Std.dev.
0.980 0.015 0.978 0.022 0.984 0.022
Modular -switchshort architecture A VI .
Std.dev.
0.942 0.034 0.920 0.077 0.931 0.085
Note: Values reported are for perfonnanceindex P averagedover 20 runs.
In experiment A , we compared the monolithic architecture and the " " switch architecturewith holistic shaping( holistic-switch ). Differencesin performancebetweenthesetwo architectures, if any, are due only to the different architectural organization. Each architecturewas run for 15,000 learning cycles, and was followed by a 1,000- cycle test session. In experim! nt B, we ran two subexperiments( Bl and B2) using the " " switch architecture with modular shaping ( modular-switch ). In subexperim B 1, we gavethe modular-switch architectureroughly the same amount of computing resourcesas in experimentA , so as to allow a fair comparison. We ran a total of 15,000 cycles: 5,000 cycles to train the chasing behavioral module, 5,000 cyclesto train the escapingbehavioral module, and 5,000 cyclesto train the behavior coordination module (for " easeof reference, we call the architecture of this experiment modular" switch-long ) . It is important to note that , although the number of learning ) cycleswas the sameas in experiment A , the actual time (in seconds required was shorter becausethe two basic behaviorscould be trained in parallel. Each learning phasewas followed by a 1,000-cycletest session. In subexperiment B2, we tried to find out what was the minimal amount of resourcesto give to the modular-switch architecture to let it reach the same performance level as the best performing of the two architecturesused in experimentA . We ran a total of 6,000 cycles: 2,000 cycles to train the chasing behavioral module, 2,000 cycles to train the escapingbehavioral module, and 2,000 cyclesto train the behavior coordination " " module; we refer to this architecture as modular-switch-short. - cycletest session. Again, each learning phasewas followed by a I ,OOO Results regarding the test sessionare reported in table 4.5. The bestperforming architecturewas the modular-switch. It performed best given
in Simulated Worlds Experiments
approximately the sameamount of resourcesduring learning, and it was able to achievethe sameperformance as the holistic-switch architecture using much lesscomputing time. Theseresultscan be explainedby the fact that in the switcharchitectureeachsingleLCS, having shorterclassifiers,has a smaller searchspace;5 therefore, the overall learning task is easier. The worst-performing architecture was the monolithic. Theseresults are consistent with those obtained in a similar environment, presentedin section 4.4. We studied the significanceof the difference in mean performance betweenthe monolithic and the holistic-switch architectures, and between the holistic-switch and the modular-switch-long architectures(the choiceof which architecturesto comparewas made a priori ) . Both differenceswere found to be highly significant (p < 0.01) by the Mann -Whitney test for all the behavioral modules (Chase , Escape , and the global behavior). 4.4 EXPERIMENTS IN THE Chase /Feed/ Escape ENVIRONMENT In this section , we systematically compare different architecture and shaping policY combinations using the Chase/ Feed/ Escape environ-
menteWe try to find a confirmation to some of the answerswe already had from the experimentspresentedin the previous section. Additionally , we make a first step toward studying the architecture scalability and the importance of choosinga good mapping betweenthe structured behavior and the hierarchical architecturethat implementsit .
4.4.1 MonoUtbic Architecture Becausethe monolithic architecture is the most straightforward way to apply LCSs to a learning problem, results obtained with the monolithic architecturewere usedas a referenceto evaluatewhether we would obtain improved performance by decomposing the overall task into simpler subtasks, by using a hierarchical architecture, or both. In an attempt to be impartial in comparing the different approaches, we adopted the same number of transputers in every experiment. And because , for a given number of processors , the systemperformance is dependenton the way the physical processors are interconnected, that is, on the hardware architecture, we chose our hardware architecture only after a careful experimental investigation (discussedin Piroddi and Rusconi 1992; and Camilli et ale 1990). Figure 4.9 shows the typical result. An important observation is that the performanceof the Escape behavior is higher than the performance
Chapter4
90ue
-Q -. "G i) >
1 0.9
'----._--~ ,-.---~ .,.............-..-...-....,.....
0.8 0.7 0.6
0.5 0.4 0.3 20
30
40
50
60
70
80
Number of cycles ( thousands) - - - - - . Chase
. . . . . ' Feed
Escape
Global
figure 4.9 Cumulative performanceof typical experimentusing monolithic architecture
of the Chase behavior, which in turn is higher than that of the Feed behavior. This result holds for all the experiments with all the architectures . The reasonsare twofold . The Escape behavior is easierto learn becauseour learning agent can choosethe escapingmovement among 5 out of 8 possible directions, while the correct directions to Chase an object are, for our Autono Mouse, 3 out of 8 (seefigure 4.10) . The lower performanceof the Feed behavior is explained by the fact that, in our experiments, the Autono Mouse could see the object to be chasedand the predator from any distance, while the ft?od could be seen only when closer than a given threshold. This causeda much lower frequency of activation of Feed , which resultedin a slower learning rate for that behavior. Another observationis that, after an initial , very Quickimprovement of performance, both basic and global performancessettled to an approximately constant value, far from optimality . Mer 80,000 cycles, the global performancereachedthe value 0.72 and did not change(we ran the experiment up to 300,000 cycles without observing any improvement) . Indeed, given the length of the classifiersin the monolithic architecture, the searchspace(cardinality of the set of possibledifferent classifiers) is
Experimentsin SimulatedWorlds
@ predator
~ light
light- chasing directions
predator-escaping direct ions F1gare4.10 Difference betweenchasingand escapingbehaviors
huge: the genepcalgorithm, togetherwith the apportionment of credit system , appearsunable to searchthis spacein a reasonableamount of time.
4.4.2 MonoUdlie Architecture with Distributed Input
With distributed input , environmental messagesare shorter than in the previous case, and we therefore expect a better perfonnance. More than one messagecan be appendedto the messagelist at eachcycle (maximum three messages , one for eachbasic behavior) . The results, shown in figure 4.11, confirmed our expectations: global perfonnance settled to 0.86 after 80,000 cycles, and both the Chase and Escape behaviors reachedhigher perfonnance levels than with the previous monolithic architecture. Only the Feed behavior did not improve its perfonnance. This was partially due to the shortenedrun of the experiment . In fact, in longer experiments, in which it could be tested adequately , the Feed behavior reached a higher level of perfonnance, comparable to that of the Chase behavior. It is also interesting to note that the graph qualitatively differs from that of figure 4.9; after the initial steepincrease, perfonnance continues slowly to improve,' a sign that the learning algorithms are effectivelysearchingthe classifiers space.
4.4.3 Two-level Switch Architecture
In this experimentwe useda two-level hierarchical architecture, in which the coordination behavior implementedsuppression.The results, reported
Chapter 4
1
--Q . 0.9 -as0.8 ~ 0.7
.
e
-,.." ,-
0.6 0.5 0.4 0.3
....................., . .... ... .. . .. .. . .. . .. . .' .....,..... ~ ~ ' ", .., , 0 10 20 30 40 50 60 70 80 Number of cyclesi ( thousands) . . . . . . Chase
. . . . . . Feed
Escape
Global
figure4.11 Cumulative of typicalexperiment architecture pe/rormance with usingmonolithic distributed input in figures 4.12 and 4.13, give the following interesting information . First , as shown in figure 4.12, where we report the performance of the three basic behaviorsand of the coo~dinator (switch) in the first 50,000 cycles, and the global performance from cycle 50,000 to the end of the experiment , the use of the holistic shaping policy resultsin a final performance that is comparableto that obtained with the monolithic architecture. This is because , with holistic shaping, reinforcementsobtained by each individual LCS are very noisy. With this shaping policy, we give each LCS the same reinforcement, computed from observing the global behavior, which meansthere are occasionswhen a double mistake results in acorrect , and therefore rewarded, final action. Consider the situation where Escape is active and proposesa (wrong) move toward the predator, but the coordinator fails to choosethe Escape module and choosesinstead the Chase module, which in turn proposesa move away from the chased object (wrong move), say in the direction opposite to that of the predator. The result is a correct move (away from the predator) obtained by the composition of a wrong selection of the coordinator with a wrong proposed move of two basic behaviors. It is easyto understandthat it is difficult to learn good strategieswith such a reinforcementfunction.
in SimulatedWorlds Experiments 1 '6 - :: "i > ) -Q Q ) u ; E '-0 'Q ) a..
0.9 0.8
--"."----, , , ~ , ~.~II.---"-..--~. . I _ . . 0.6 -,' ,,.~ ....,...,...........-......."- ..-. 0.5 .,J ....,.. I \~\ .r 0.4 J ' , / ~/ 0.3 0 10 20 30 40 50 60 70 80
0.7
Number of cycles ( thousands) _ _ _ _ 0
Chase
. . . . . .
Feed
Escape -
-
Switch
tic - e 4.12 Cumulativepe~ rmanceof typicalexperimentusingtwo- level . Holisticshaping
Global
switch architecture .
Second , using the modular shaping policy, performance improves, as expected. The graph of figure 4.13 shows three different phases. During the first 33,000 cycles, the three basic behaviors were independently learned. Between cycles 33,000 and 48,000, they were frozen, that is, learning algorithms were deactivated, and only the coordinator was learning. After cycle48,000, all the componentswere free to learn, and we observed the global performance. The maximum global performance value obtained with this architecturewas 0.84. Table 4.6 siJmmarizesthe results from using the monolithic and two level architecturesalready presentedin figures 4.9, 4.11, 4.12, and 4.13. The table showsthat, from the point of view of global behavior, the best resultswere obtained by the monolithic architecturewith distributed input and by the two-level switch architecture with modular shaping. The performanc of the basic behaviors in the secondwas always better. In particular , the Feed behavior achieved a much higher performance level; using the two-level switch architecture with modular shaping, each basic behavior is fully tested independently of the others, and therefore the Feed behavior has enough time to learn its task. It is also interesting to note that the monolithic architectureand the two -level switch architecture with holistic shapinghave roughly the sameperformance.
Chapter 4
9
-a.. Q >)) Q
1 ---' ~"..' ' -..-'-,... 0.9 "'1-.,-...\~-.-.r .6, 0.8 .~ ; .... ~ ~ f 0.7 0.6 0.5 0.4 0.3 0 10 20 30 40 50 60 70 80 Number of cycles ( thousands )
Chase
. . . . . .
Feed
Escape
Jiigure4.13 Cumulativepetformance of typicalexperiment Modularshaping .
-
-
Switch
Global
using two- level switch architecture .
Warlds in Simulated Experiments 1 -Q. ~ ~() m E .g If
0.9 0.8 ~ \\ - --. 1..8\~1 ---. -_...--. ...---- . . ----------_ .. . . . . . . I.. ,.'-.-,,,,-,~ ~"~.. .. . .. ,.. .~. ~ \ ~ "' - - - - - ..... - ..... - - - - 1f.,,1 ....,,",~ V ---T' , " "' ....' " ; ~ ..... .... .-. .. ... ....- ..........
0.7 0.6 0.5 0.4 0.3 0
20
60
40
80
100
120
Number of cycles ( thousands )
- - - - - . Chase - - SW1
. . . . . . Feed -
-
SW 2
Escape Global
flgm' e 4.14 Cumulative performance of typical experiment using three-level switch architecture of figure 3.4. Holistic shaping. 4.4.4
1bree- level Switch Architecture The three-level switch architecture stretches to the limit the hierarchical approach (a three-behavior architecturewith more than three levelslooks absurd) . Within this architecture, the coordinator used in the previous architecture was split into two simpler binary coordinators (see figure 3.4) . Using holistic shaping, results have shown that two- or three-level architecturesare comparable (seefigures 4.12 and 4.14, and table 4.7) . More interesting are the results obtained with modular shaping. Because we have three levels, we can organize modular shaping in two or three phases. With two-phase modular shaping, we follow basically the same procedureusedwith the two-level hierarchical architecture: in the second phase, basic behavioral modules are frozen and the two coordinators learn at the sametime. In three-phasemodular shaping, the secondphase is devotedto shapethe second-level coordinator (all the other modulesare frozen), while in the third phase, the third -level coordinator alone learns. Somewhatsurprisingly, the results show that, given the sameamount of resources(computation time in seconds ), two-phasemodular shapinggave slightly better results. With two-phasemodular shaping, both coordination
Chapter4 Table4.7 Comparisonof two- andthree-levelhierarchicalarchitectures
Note: Valuesreportedarefor performance indexP, with numberof iterationsin . underlyingpar&ntheses
behaviorsare learning for the whole learning interval, whereaswith threephasemodular shaping, the learning interval is split into two parts during which only one of the two coordinators is learning, and therefore the two switches cannot adapt to each other. The graph in figure 4.15 shows the very high performancelevel obtained with two-phasemodular shaping. Table 4.7 comparesthe results obtained with the two- and three-level switch architectures: we report the performance of basic behaviors, of switches, and of the global behavior, as measuredafter k iterations, where k is the number in parenthesesbelow eachperformancevalue. Remember that , as we already said, the number of iterations varied across experiments due to the steady state monitor routine, which automatically decidedwhen to shift to a new phaseand when to stop the experiment. We have run another experimentusing a three-level switch architecture to show that the choice of an architecture that doesnot correspondnaturally to the structure of the target behavior leads to poor performance. The architecturewe have adopted is reported in figure 4.16; it is similar to that of figure 3.4, exceptthat the two basic behaviors Feed and Escape have been swapped. It can be expectedthat the new distribution of tasks betweenSWI and SW2 will impedelearning: becauseSW2 doesnot know whether SWI is proposing a Chase or an Escape action, it cannot
in SimulatedWorlds Experiments
------. " , . , ' .. . r.,...'. I ~ . ~ , " ( ,.--.."- ..- - ~ I'.I'.'..', .I' ~ / I1,. \\
1 0.9
eoue
Q) > Q)
0.8 0.7 0.6 0.5 0.4 0.3
0 20 40 60 80 100120 Number of cycles ( thousands) - - - - - . Chase - SW 1
-
. . . . . . Feed - - SW 2
Escape Global
figure 4.15 Cumulativeperformance of typical experiment uisng three-level switch architecture of figure 3.4. Two- phasemodular shaping.
Environment
figure 4.16 Three- level modules
with " unnatural switcharchitecture
"
disposition of coordination
Chapter4 1 -a:"i > ) -Q Q ()) c tG E ~ ~~ Q ) no
0.9 0.8 0.7 0.6 0.5
-.--. .....-... ~ . .. , . . . .." .t,-"'~J"\- ,o'r- - "'"",- - ( 1\I\.'"..,---. '",.,.- - ..-.II'
0.4 0.3
0
40
60
80
100
120
Number of cycles ( thousands ) - - . . . . Chase -
-
SW 1
. . . . . . Feed -
-
SW 2
Escape Global
figure 4.17 Cumulativeperformanceof typical experimentusing " unnatural" three-level switcharchitecture . Two- phasemodularshaping . decide (and therefore learn) whether to suppress SWI or the Feed behavioral module. Resultsare shown in figures 4.17 and 4.18. As in the precedingexperiment , two- phaseshaping gave better results than three-phase. It is clear from figure 4.18 that the low level of global performance achievedwas due to the impossibility for SW2 to learn to coordinate the SWI and the Feed modules.
4.4.5 Thel8IIe of Scalability The experiment presented in this subsection combines a monolithic architecturewith multiple inputs and a two-level hierarchical architecture. We use a Chase/ Feed /Escape environment in which four instancesof each class of objects (lights, food pieces, and predators) are scattered around. Only one instancein each classis relevant for the learning agent (i .e., the Autono Mouse likes only one of the four light colon: and one of the four kinds of food, and fearsonly one of the four potenti ~ predators) . Therefore the basic behaviors, in addition to the basic behavioral pattern, have to learn to discriminate betweendifferent objectsof the sameclass.
in SimulatedWorlds Experiments
8
-a -. G )) > G
1 0.9 0.8 0.7
_.~.~ /' -"" ' . .. .. _ 4 . . / ... . , "~ II '. ',.,...
0.6 0.5 0.4 0.3
0
10 20 30 40 50 60 70 80 Number of cycles ( thousands ) - - - - - Chase - SW 1
. . . . . Feed - SW 2
Escape Global
F1pre4.18 " " Cumulativeperformance of typical experiment using unnatural
three -level
. switch architecture. Tbree- pbasemodular shaping
For example, the Escape behavior, instead of receiving a single message ), now receives indicating the position of the predator (when present " ~nimals " different , only one of which many messagesregarding many " " are is a real predator. Different animals distinguished by a tag, and Escape must learn to run away only from the real predator (it should be unresponsiveto other animals) . Our experiments show (see figure 4.19) that the Autono Mouse learns the new, more complex, task, although a greater number of cycles is necessaryto reach a performance level comparable to that obtained in the previous experiment 4.13 ). (figure 4.5 EXPERIMENTS IN THE lind Bidden
ENVIRONMENT
The aims of the Find Hidden experiment are twofold . First, we are interestedto seewhether our systemis capable of learning a reasonably complex task, involving obstacle detection by sonar and bumpers and searching for a hidden light . The target behavior is to chase a light (Chase behavior), searchingfor it behind a wall when necessary(Search
Chapter4 1 a -:"i > ) -Q Q ) u ; E ~ ~~ Q ) Q.
. , _ , . . . : ~ .,,,\'..',-?--.-.-. ,.'.-,.--,-.,. JI,.-" t' Ir rI
0.9 0.8 0.7 0.6 0.5 0.4 0.3
0 40 80 120160200240260 Number of cycles ( thousands )
Chase
. . . . . .
Feed
Escape
-
-
Switch
Global
F1~ 4.19 Cumulative ~ rformance of typical experimentwith multi -input , two - level switch architecture. Modular shaping. 1 a.. Q) > ..!
.r,
. ' . . ., . . . . r , . . " " ' , . . . . . " " . . . - .. .. . . . . .
0 .9
Q) U 08. C tU E ... 0 0 .7 ~ ~
I
J
'" .. ...
- - - - - - .. - .. -_ ft _ -~ - - - ~ ... - ' ~ " " -
I I
0 .6 0
10
20
30
40
50
Number of cycles ( thousands ) . . . . . . Chase
- - - Search
Ji1gure4. 20 Cumulative perfonnancefor Find Hidden task
Rnd Hidden
Experimentsin SimulatedWorlds
behavior) . The waIl has a fixed position, while the light automatically hides itself behind the waIl each time it is reachedby the Autono Mouse. Second, we are interestedin the relation betweenthe sensorgranularity, that is, the number of different statesa sensorcan discriminate, and the learning performance. In theseexperiments, we pay no attention to the issueof architecture, adopting a simple monolithic LCS throughout.
4.5.1 The Find Bidden Task
To shapethe Autono Mouse in the Find Hidden environment, the reinforcemen program was written with the following strategy in mind: is seen if 'a light behavior } Approach it then { C~ behavior } Search else { is sensed by sonars obstacle if a distal then Approach it is sensed by obstacle else if a proximal bumpers then Move along it else Turn consistently { go on in the same direction , turning it is } whichever endif endif endif . The distancesof the Autono Mouse from the light and from the wall are computedfrom the geometriccoordinatesof the simulatedobjects. The simulation wasrun for 50,000cycles. In figure 4.20, we separatelyshow the Chase behavior performance(when the light is visible), the Search behavior performance(when the light is not visible), and the Find Hidden performance. Chasingthe light appearsto be easierto learn than searching for it . This is easyto explain given that searchingfor the light is a rather complex task, involving moving toward the wall , moving along it , and . It will be interesting to compare turning around when no obstacleis sensed real robot (seesection5.3) . with the obtained those with theseresults
4.5.2. SeDsorGranularity and LeamiDg Performance This experiment explored the effect that different levels of sensorgranularity have on the learning speed. To make things easier, we concen-
Chapter 4
trated on the sonar; very similar results were obtained using the light sensors. As a test problem, we choseto let the simulated Autono Mouse learn a modified version of the Search task. The reinforcementprogram usedwas as follows. case distance of very _close : if close : if far : if very _far :
if
backward then reward else punish ; turn then reward else punish ; slowly forward then reward else punish ; forward then reward else quickly punish
endcase .
In our experiments, robot sensorswere used by both the trainer and the learner. In fact, the trainer always used the sensorinformation with the maximum available granularity (in the sonar case, 256 levels) becausein this way it could more accuratelyjudge the learner performance. On the other hand, die learner had to use the minimum amount of information necessaryto make learning feasiblebecausea higher amount of information would make learning slower, without improving the long-term performan . The correspondencebetween the distance variable and the sonar reading was set by the trainer as follows: distance : = very _close , if sensor _readinge [ 0 . . . 63 ] , distance distance distance
: = close , if sensor _readinge [ 64 . . . 127 ] , : = far , if sensor _readinge [ 128 . . . 191 ] , = : very _far , if sensor _readinge [ 192 . . . 256 ] .
We investigatedthe learning performancewith the simulatedrobot for the : 2, 4, 8, 16, 32, 64, 128, 256. following granularities of the sonar messages This meansthat the learner was provided with messagesfrom the sonar that ranged from a very coarse granularity (one-bit messages , that is, division of the to bit binary space) very high granularity (eight , messages that is, 256 levels, the sameas the trainer) . Figure 4.21 showsthe results obtained with the simulated robot. We also report the performance of a random controller for comparison. Results show that, as expected, when the sonar was used to discri~ ate among four levels, the performancewas the best. In fact, this granularity level matched the four levels used by the trainer to give reinforcem . It is interesting to note that, even with a too-low granularity
-gr --.:-:-~ -.:~ --~ gr -16 -..4 :~-.= ::.-~ .~ .-_ .-r .r-.2 .-~ 6 g -0~--1_ .--2 --_ .3_ .5 -ran .._.4 _ ---_
in SimulatedWorlds Experiments
-Q -. G )) > G
1
9
0.8
0.6 0.4 0.2 0
128 -
-
Numberof cycles( thousands )
4.21 F1aare P with simulatedAutonoGranularityof sonarsensorand learningperfonnance Mouse
(i .e., granularity = 2), the robot was able to perform better than in the random case. Increasingthe granularity decreasedmonotonically the performance achieved by the learner. It is also interesting to note that the learner, in those caseswhere the granularity was higher than necessary " ' " (> 4), was able to create rules that (by the use of don t care symbols matching the lower positions of input messages ) clustered together all inputs belonging to the sameoutput classas definedby the trainer policy .
4.6 POINTSTO REMEMBER . ALECSYs allows the Autono Mouse to learn a set of classifiersthat are tantamount to solving the learning problemsposedby the simple taskswe
haveworkedwith in this chapter.
. Our resultssuggestthat it is better to userewardsand punishmentsthan only rewards. . Using the mutespec operator, dynamically changing the number of cla$Sifiers, and calling the geneticalgorithm when the bucket brigade has reacheda steadystate all improve the LCS' s performance. . Autono Mouse can benefit from a simple type of memory called " sensor " memory. . The factorization of the learning task into severalsimpler learning tasks helps. This was shown by the resultsof our experiments.
4 Chapter
. The system designer must take two architectural decisions: how to decomposethe learning problem; and how to make the resulting modules interact. The first issueis a matter of efficiency: a basic behavior should not be too difficult to learn. In our system, this means that classifiers should be no longer than about 30 bits (and thereforemessagescannot be longer than 10 bits) . The secondissueis a matter of efficiency, comprehensib , and learnability. . The best shaping policy, at least in the kind of environments investigated in this chapter, is modular shaping. . The granularity of sensormessagesis best when it matches the granularity chosenby the trainer to discriminate betweendifferent reinforcement actions.
Chapter
5
Experiments
in the Real
World
S.1 I NTRODUCfI ON Moving from simulated to real environments is challenging. New issues arise: sensornoise can appear, motors can be inaccurate, and in general there can be differencesof any sort between the designedand the real function of the robot. Moreover, not only do the robot ' s sensorsand effectors become noisy, but also the RP must rely on real, and hence noisy, sensorsto evaluate the learning robot moves. It is therefore important to determine to what extent the real robot can use the ideas and the software usedin simulations. Although the above problems can be studied from many different points of view, running experimentswith real robots has constrainedus to limit the extent of our investigation. In section 5.2, we focus on the role learn a good control policy : we that the trainer plays in making ALECSYS in :ftuence the robustnessto noise and to how different trainers can study lesions in the real Autono Mouse. We present results from experiments in the real world using Autono Mouse II . In section 5.3, we repeat, with the real robot (Autono Mouse IV ), the Find Hidden experiment and the sensorgranularity experimentalready run in simulation. Results obtained with simulations presentedin the preceding chapter suggestsetting the messagelist length to 1, using 300 classifierson a threetransputerconfiguration, and using both rewardsand punishments. In particular , we experimentally found that learning tended to be faster when punishmentswere somewhat higher (in absolute value) than rewards; a typical choice would be using a reinforcementof + 10 for rewards, and of - 15 for punishments. In all the experimentspresentedin this chapter, we usedthe monolithic architecture.
5 Chapter , " left
frontal
left wheel and motor
eye
" , , ,
rear left eye
, , ,
left frontal visual cone
, ,
,\ ,, ,, frontal right eye .
frontal
~
"
-
-
Raj
-
-
rear
.
cent
C
~
"
"
,
~
raj
cent
~
'
"
eye
eye
rear right
! eye /
right frontal visual cone
right wheel and mot or cent rat wheel
F1gore S.l Functional schematic of Autono Mouse II
.5.2 EXPERIMENTS WITH AUTONOMOUSED 5.2.1 AutonoMoi Be II Hardware Our smallestrobot is Autono Mouse II (seepicture in figure 1.3 and sketch in figure 5.1) . It has four directional eyes, a microphone, and two motors. Each directional eye can sensea light source within a cone of about 60 . Each motor can stay still or move the connectedwheel one or degrees two stepsfo~ ~ d, or one step backward. Autono Mouse n is connected to a transputer board on a PC via a 9,600-baud RS-232 link . Only a small amount of processingis done onboard (the collection of data from sensors and to effectors and the managementof communications with the PC) . All the learning algorithms are run on the transputer board. The directional eyessenselight sourcewhen it is in either or both eyes' field of view. In order to changethis field of view, Autono Mouse II must rotate becausethe eyescannot do so independently; they are fixed with respectto the robot body. The sensorvalue of eacheye is either on or off , dependingon the relation of the light intensity to a threshold. The robot is also provided with two central eyes that can senselight within a halfspace; theseare usedsolely by the trainer while implementing thereward the-result policy . These central eyes evaluate the absolute increase or decreasein light intensity (they distinguish 256 levels) . The format of input messagesis given in figure 5.2. We used three messageformats
Experimentsin the Real World
tas light position (frontal eyes) (a)
position ofnoise (rear ) tags eyes tags light presence light position (frontal eyes) (b)
ion posit ofnoise presence frontal (light ) eyes
figure 5.2 in AutonoMouseII : (a) formatof sensormessages from sensors Meaningof messages in first andsecondexperiments ; AutonoMousen usesonly two frontaleyes ; in third experiment ; AutonoMouseII usestwo (b) format of sensormessages frontal eyesandmicrophone in fourth experiment ; (c) formatof sensormessages ; AutonoMousen usesall four eyesandmicrophone . becausein different experimentsAutono Mouse II neededdifferent sensory capabilities. In the experimentson noisy sensorsand motors, in the lesion studies, and in the experimenton learning to chasea moving light source(sections2.4.1, 2.4.2, and 2.4.3, respectively), the robot can seethe light only with the two frontal eyes; this explains the short input message shown in figure 5.2a. In the first of the two experimentspresentedin section 2.4.4, Autono Mouse II was also able to hear a particular environmental noise, which accountsfor the extra bit in figure 5.2b. Finally , in the secondexperiment presentedin section 2.4.4, using all the sensory capabilities resultedin the longest of the in~ut messages(figure 5.2c). Autono Mouse II moves by activating the two motors, one for the left wheel and one for the right . The available commands for engines are " " " " " " Move one step backward ; Stay still ; Move one step forward ; " " Move two stepsforward. From combinations of thesebasic commands arise forms of movementslike " Advance" ; " Retreat" ; " Rotate" ; " Rotate while advancing," and so on (there are 16 different composite movements ) . The format of messagesto effectorsis the sameas for the simulated Autono Mouse (seefigure 4.2). 5.2.2 ExperimentalMethodology In the real world , experimentswere run until either the goal was achieved or the experimenter was convinced that the robot was not going to achievethe goal (at least in a reasonabletime period) . Experiments with the real robots were repeated only occasionally becausethey took up a great deal of time. When experiments were
ChapterS
repeated, however, differencesbetweendifferent runs were shown to be marginal. The performance index used was the light intensity perceived by the frontal central eye (which increasedwith proximity to the light, up to a maximum level of 256) . To study the relation betweenthe reinforcement program and the actual performance, we plotted the averagereward over intervals of 20 cycles. 5.2.3 The Training Policies As we have seen, given that the environment usedin the simulations was designedto be as close as possible to the real environment, messages ' going from the robot s sensorsto ALECsys and from ALECsys to the ' robot s motors have a structure very similar to that of the simulated Autono Mouse. On the other hand, the real robot differs from the simulated one in that sensorinput and motor output are liable to be inaccurate . This is a major point in machine learning research: simulated environmentscan only approximate real ones. To study the effect that using real sensorsand motors has on the learning process, and its relation to the training procedure, we introduced two different training policies, called " reward-the-intention" policy, and " reward-the-result" policy (seefigure 5.3) .
5.2.3.1 The Reward-tbe-Intention Polley We say that a trainer " rewards the intention" if , to decide what reinforcem to give, it uses observationsfrom an ideal environmentpositioned betweenALECsysand the real Autono Mouse. This environment is said to be " ideal" becausethere is no interferencewith the real world (it is the same type of environment ALECSYS sensesin simulations) . The reward-the-intention trainer rewardsALECsysif it proposesa move that is correct with regard to the input - output mapping the trainer wants to teach (i.e., the reinforcement is computed observing the responsesin the ideal, simulated world) . A reward-the-intention trainer knows the desired input-output mapping and rewards the learning system if it learns that mapping, regardlessof the resulting actions in the real world. 5.2.3.2 The Reward-tbe-Result Pollcy On the other hand, we say that a trainer " rewards the result" if , to decide what reinforcementto give, it usesobservationsfrom the real world . The reward-the-result trainer rewards ALECSYS if it proposes an action that
in theRealWorld Experiments the 8brain8
the . body.
diminishes the distancefrom the goal. ( In the light -approachingexample the rewarded action is a move that causesAutono Mouse II to approach the light source.) A reward-the-result trainer has knowledge about the goal, and rewards ALECsys if the actual move is consistent with the achievementof that goal. With the reward-the-intention policy, the observedbehavior can be the desiredone only if there is a correct mapping betweenthe trainer' s interpretation of sensor(motor ) messagesand the real world significanceof Autono Mouse' s sensoryinput (motor commands) .! For example, if the two Autono Mouse' s eyes are inverted the reward-the-intention trainer will reward ALECSYS for actions that an external observerwould judge as : while the trainer intends to reward the " Turn toward the light " wrong behavior, the learned behavior will be a " Turn away from the light " one. In caseof a reward-the-result trainer, the mapping problem disappears; learns any mapping between sensormessagesand motor messages ALECSYS that maximizesthe reward receivedby the trainer.
100
ChapterS In the following section, we present the results of some experiments ' designed to test ALEcsyss learning capability under the two different reinforcementpolicies. We also introduce the following handicapsto test 's ALE CSys adaptive capabilities: inverted eyes, blindnessin one eye, and incorrect calibration of motors. All graphs presentedare relative to a single typical experiment.
5.2.4 AD ExperimentalStudy of Training Policies
In this section, we investigatewhat effect the choice of a training policy has on noisy sensorsand motors, and on a set of lesions2we deliberately in:@ icted on Autono Mouse II to study the robustnessof learning. The experimentwas set in a room in which a lamp had been arbitrarily positioned . The robot was allowed to wander freely, sensingthe environment with only its two frontal directional eyes. ALEcsvsreceiveda high reward whenever, in the caseof reward-the-result policy, the light intensity perceived by the frontal central sensorincreased, or, in the caseof rewardthe intention policy, the proposedmove was the correct one with respect to the receiv*edsensory input (according to the mapping on which the reward-the-intention trainer basedits evaluations) . Sometimes, especially in the early phasesof the learning process, Autono Mouse II lost contact with the light source (i.e., it ceasedto see the lamp) . In these cases, AL Ecsysreceiveda reward if the robot turned twice in the samedirection. This favored sequencesof turns in the samedirection, assuring that the robot would, sooner or later, seethe light again. In the following experiments , we report and comment on the results of experimentswith the standardAutono Mouse II and with versionsusing noisy or faulty sensors or motors. In all the experiments, ALEcsvs was initialized with a set of randomly generatedclassifiers.
5.2.4.1 Noisy Sensorsand Motors Figure 5.4 comparesthe performanceobtained with the two different reinforcem policies using well-calibrated sensoryor motor devices.3 Note that the perfonnance is better for the reward-the-result policy. Drops in performanceat cycle 230 for the reward-the-intention policy and at cycle 250 for the reward-the-result policy are due to suddenchangesin the light position causedby the experimentermoving the lamp far away from the robot. Figure 5.5 showsthe rewardsreceivedby ALEcsysfor the sametwo runs. Strangely enough, the averagereward is higher (and optimal after 110 cycles) for the reward-the::-intention policy . This indicates that, given
~ " . , ' I ~ / : II ' { A , " ' . ' . , , .050 100 . I , " 5 I 0
in theRealWorld Experiments 250
~(/) 200 c S .= 150 .c .2 ...J 100 50 0
150
200
250
300
Number of cycles
Rewardthe result
----- ---- ---- Rewardthe intention
FiglD'eS.4 Differencein performancebetweenreward-tbe-result and reward-tbe-intention trainingpolicies
pJBM9J
101
300
Number of cycles Reward the result
- - - - - - - - - - - - - Reward the intention
figure 5.5 Differencein averagerewardbetweenreward-the-resultand reward-the-intention trainingpolicies
102
ChapterS
Lt) 250 C \I x 200 '...0 co 150 ~ ~ 100 Q ) 50 g ... Q ) > 0 - in 100 c ) -G 150 c - .c 200 C) :,:j -250
50
' " : I , . , " . ' -1 J '.,,..20 "-'I.~ ! I Y . " , ~ t ,".:tIi".,-'II '".:,a0 II,5 ~ Numberof cycles
Lightintensity
------------- Average reward
5.6 J4igore Systemperformance (light intensity) andaveragereward(multipliedby 2Sto ease visualcomparison ) for reward-the-resulttrainingpolicy perfect information , ALBcsys learns to do the right thing . In the caseof the reward-the-result policy, however, ALECSYS does not always get the expec : ted reward. This is becausewe did not completely succeeddespite our efforts to build good, noise-free sensorsand motors, and to calibrate them appropriately. Figure 5.6 directly compareslight intensity and average reward for the reward-the-result policy (averagereward is multiplied by 25 to easevisual comparison). Theseresultsconfirm our expectation, as discussedin section2.3, that a trainer using the reward-the-intention policy, in order to be a good trainer, needsvery accurate low-level knowledge of the input -output mapping it is teaching and of how to compute this mapping, in order to be able to give the correct reinforcements. Often this is a nonreasonablerequirement becausethis accurate knowledge could be directly used to design the behavioral modules, making learning useless . On the other hand, a reward-the-result trainer only needsto be able to evaluate the moves of the learning systemwith respectto the behavioral pattern it is teaching. It is interesting that the number of cyclesrequired by the real robot to reach the light is significantly lower than the number of cyclesrequired by
Experimentsin the Real World
250 ~
103
200 150 100 50 0
A"".,""'.'iI'-",".",-" .,.,-,.".'A ."'"~ ,'".A I 100 150
200
"I'. ,'.I",\I-,'di ",--~ '--------'-'---'..~' 250300350
Number of cycles Reward the result
- - - - - - - - - - - - - Reward the intention
the simulated robot to reach a high performance in the previous experiments . This is easily explained if you think that the correct behavior is more frequent than the wrong one as soon as performanceis higher than 50 percent. The real robot starts therefore to approach ihe lIght source much before it has reacheda high frequencyof correct moves. 5.2.4.2 LesionsStudy Lesionsdiffer from noise in that they causea sensoror motor to systematically deviate from its designspecifications. In this sectionwe study the robustnessof our systemwhen altered by three kinds of lesions: inverted eyes, one blind eye, and incorrectly regulated motors. Experimentshave shown that for each type of lesion the reward-the-result training policy was able to teach the target behavior, while the reward-the-intention training policy was not. Results of the experimentin which we inverted the two frontal eyesof Autono Mouse II are shown in figures 5.7 and 5.8 (thesegraphs are analogous to thoseregarding the standard eyeconfiguration of figures 5.4 and 5.5) . While graphs obtained for the reward-the-result training policy are qualitatively comparable, graphs obtained for the reward-the-intention training policy differ greatly (seefigures 5.4 and 5.7). As discussedbefore
. , . 5 0500
Chapter 5
pJ8M9J
104
Number of cycles Reward the result
- - - - - - - - - - - - - Reward the intention
I1IUfe 5.8 Difference in averagereward betweenreward-the-result and reward-the-intention training policies for Autono Mouse n with inverted eyes
(seesection 2.3), with inverted eyesthe reward-the-intention policy fails. ( Note that in this experimentthe light sourcewas moved after 150cycles.) Similar qualitative results"were obtained using only one of the frontal directional eyes (one blind eye experiment) . Figure 5.9 shows that the reward-the-intention policy is not capable of letting the robot learn to chase the light source. In this experiment the light was never moved becausethe task was already hard enough. The reward-the-result policy, however, allows the robot to approachthe light, although it requiresmore cyclesthan with two working eyes. (At cycle 135, there is a drop in performan where Autono Mouse II lost sight of the light and turned right until it saw the light sourceagain.) As a final experiment, the robot was given badly regulated motors. In this experiment, bits going to eachmotor had the following new meaning: 00: stay still , 01: one step forward , 10: two stepsforward , 11: four stepsforward.
in the Real World Experiments
250 200 ~
105
150 100 50 0 0
150
200
250
300
Numberof cycles Rewardthe result
- - - - - - - - - - - - - Rewardthe intention
Figme 5.9 Difference in perfonnance between reward-the-result and reward-the-intention training policies I >r Autono Mouse II with one blind eye
The net effect was an Autono Mouse II that could not move backward any more, and that on the average moved much more quickly than before. The result was that it was much easier to lose contact WIth the light source, as happenedwith the reward-the-intention policy after the light was moved (in this experiment the light sourcewas moved after 90 cycles). Becausethe averagespeedof the robot was higher, however, it took fewer cycles to reach the light . Figure 5.10, the result of a typical experiment, shows that the reward-the-result training policy was better also in this casethan the reward-the-intention training policy . 5.2.4.3 Learning to Chasea Moving Light Source Learning to chasea moving light sourceintroduces a major problem; the reinforcementprogram cannot usedistance(or light intensity) changesto decide whether to give a reward or a punishment. Indeed, there are situations in which Autono Mbuse II goestoward the light sourcewhile the light sourcemovesin the sainedirection. If the light sourcespeedis higher than the robot ' s, then the distanceincreaseseven though the robot made the right move. This is not a problem if we usethe reinforcementprogram basedon the intention, but it is if the reinforcementprogram is basedon the result. Nevertheless , it 1Sclear that the reward-the-intention policy is
ChapterS
250 ~
106
200 150 100 50 0
,-" 15 -.,200
Number of cycles
Rewardthe result
----- -------- Rewardthe intention
figure 5.10 Difference in performance between reward-the-result and reward-the-intention training polici~ for Autono Mouse II with incorrect regulation of motors
less appealing becauseit is not adaptive with respect to calibration of sensorsand effectors. A possible solution is to let ALECSYS learn to approach a stationary light, then to freeze the rule set (i.e., stop the learning algorithm) and use it to control the robot. With this procedure one should be aware that , once learning has been stopped, the robot can useonly rules developedduring the learning phase, which must therefore be as complete as possible. To ensurecomplete learning, we need to give Autono Mouse II the opportunity to learn the right behavior in every possible environmental situation (in our case, every possiblerelative position of the light source). Figure 5.11 reports the result of the following experiment . At first the robot is left free to move and to learn the approaching behavior. Mter 600 cycleswe stop the learning algorithm, start to move the light source, and let the robot chaseit . Unfortunately , during the first learning phaseAutono Mouse II does not learn all the relevant rules (in particular, it doesnot learn to turn right when the light is seenby the right eye), and the resulting performance is not satisfactory. At cycle 1,300, we therefore start the learning algorithm again and, during the cycles between1,700 and 2,200, we start to presentthe lamp to Autono Mouse II on one side, and wait until the robot starts to do the right thing . This procedureis repeatedmany times, presentingthe light alternately in front ,
World
Real
the
in Experiments 250 200 >
-
(
)
/
150 c )
Q c -
.
100 c
.
)
. C -
.
50 J
-
0 learning off
on
learning
learning
learning
training
0
500
1000
1500
2000
2500
off
on
107
Number of cycles I1gure 5.11 Autono Mouse II performancein learning to chasemoving light
on the right , on the left, and in back. The exact procedure is difficult to describe, but very easy and natural to implement. What happensduring the experiment, which we find very exciting, is that the reactive system learns in real time. The experimenterhas only to repeat the procedure4 until visual feedbacktells him that ALEcsys has learned, as manifestedby the robot beginning to chasethe light . After this training phase, learning is stopped again (at cycle 2,200) and the light source is steadily moved. Figure 5.11 shows that this time the observedbehavior is much better (performanceis far from maximum, however, becausethe light sourceis moving and the robot therefore never reachesit ) . . 5.2.4.4 Learning to Switch betweenTwo Behaviors In thesetwo experiments, the learning task was made more complicated by requiring Autono Mouse n to learn to change to an alternative behavior in the presenceof a particular noise. Experimentswere run using the reward-the-result policy . In the first experiment, we usedthe two directional frontal eyesand the two behaviors Chase and Escape . During the Chase phasethe experimental setting was the same as in the first experiment with standard capabilities (that is, the experiment in section 2.4.1), except for input , which were one bit longer due to the noise bit (seefigure 5.2b). messages
108
ChapterS
250 200 150 100 50 00" ~~,~3q :'. .900 .,O ~ 600 -50 chase lightescape light chase light
> o X ..--Q _ ~ 0 .= C (Q G ).-Q ~ ) .O )G O (J..> -c I)(Q
Number of cycles Light intensity
-------------
Average revvanj
Figure 5.12 . Autono Mouse II perfonnance in learning to switch between two behaviors: chasinglight and escapinglight (robot usestwo eyes)
When the noise started, the reinforcementprogram changedaccording to the behavior that was to be taught, that is, escapingfrom the light . Again, as can be seenfrom figure 5.12, the observedbehavior was exactly the desiredone. In the secondexperiment, we used four directional eyes and the two behaviors Chase moving forward and Chase moving backward. During the Chase moving forward phase, the experimentalsetting was the same as in the first experiment with standard capabilities (that is, the experiment in section 2.4.1), except for input messages , which were three bits longer due to the noise bit and to the two extra bits for the two rear eyes (see figure 5.2c) . When the noise started, the reinforcement program changed according to the behavior that was to be taught. From figure 5.13, it is apparent that after the noise event (at cycle 152) performance fell very quickly ; 5 the robot had to make a 180- degreeturn before it could again seethe maximum intensity of light from the light source. As after the turn , ALECSYS did not have the right rules (rememberthat the rules developedin the moving forward phasewere different from the rules necessaryduring the moving backward phasebecauseof the bit that
109
in the RealWorld Experiments 250 200 ->U) -; 150 c - 100 .c. a -J 50 0 0 14
50 moving
100 forward
150 ~
200
250
moving
backward
4
300 ~
indicated presenceor absenceof noise), Autono Mouse II was in a state very similar to the one it was in at the beginning of the experiment. Therefore, for a while it moved randomly and its distancefrom the light source could increase(as happenedin the reported experiment). At the end, it again learnedthe correct set of rules and, as can be seenin the last part of the graph, started to approach the light sourcemoving backward. These two experimentshave shown that increasing the complexity of the task did not causethe performance of the system to drop to unacceptable levels. Still , the increasein complexity was very small, and further investigation will therefore be necessaryto understandthe behavior of the real Autono Mouse in environmentslike those used in simulations.
WITH AUTONOMOUSE IV 5.3 EXPERIMENTS 5.3.1 AutonoMo. . IV Hardware Autono Mouse IV (seefigure 5.14) has two directional eyes, a sonar, front and side bumpers, and two motors. Each directional eye can sensea light sourcewithin a cone of about 180degrees.The two eyestogether cover a 270- degreezone, with an overlapping of 90 degreesin front of the robot. The sonar is highly directional and can sensean object as far away as 10
110
ChapterS .' left visual
, ,
eyes
,
tracks
one sonar
beam
~
~
~
~ .
c .
c
~.
.
t .
.
t
~
~
~ , , , , , , ,
sonar
bumpers
right visual cone " ,
figure 5.14 Functional schematicof Autono MouseIV
meters. The output of the sonar can assumetwo values, either " I sensean " " " object, or I do not sensean object. Each motor can stay still or move the connected track one or two steps forward , or one step backward. Autono Mouse IV is connectedto a transputer board on a PC via a 4,800baud infrared link . 5.3.2 ExperimentalSettings The aim of the experimentspresentedin this section was to compare the results obtained in simulation with those obtained with the real robot. The task chosenwas Find Hidden , already discussedin section 4.5. The environment in which we ran the real robot experimentsconsistedof a large room containing an opaque wall , about 50 cm x 50 cm, and an ' ordinary lamp (50 W ). The wall s surfacewas pleated, in order to reflect ' the sonar s beam from a wide range of directions. The input and output interfaceswere exactly the sameas in the simulation, and so was the RP. The input from the sonar was defined in such a way that a front obstacle could be detectedwithin about 1.5 m from the robot. The first experiment presents the results obtained on the complete task, while the second experiment discusses results on the relation between sensor granularity and learning speed. Although we designedthe simulated environment and robot so as to make them very similar to their real counterparts, there were three main differencesbetweenthe real and the simulated experiments. First , in the real environment the light was moved by hand and hidden behind the wall when Autono Mouse IV got very close to it ( 10- 15 cm), as compared to the simulated environment, where the light was moved by the
Experimentsin the Real World
simulation program in a systematic way. The manual procedure introduced an element of irregularity, although, from the results of the experiments, it is not clear whether this irregularity affected the learning process. Second, the distancesof Autono Mouse IV from the light and from the wall were estimated on the basis of the outputs of the light sensorsand of the sonar, respectively. More precisely, each of the two eyesand the sonar output an eight-bit number from 0 to 255, which coded, respec tively, the light intensity and the distance from an obstacle. To estimate whether the robot got closer to the light , the total light intensity (that is, the sum of the outputs of both eyes) at cycle t was compared with the total light intensity at cycle t - 1. The sonar' s output was usedin a similar way to estimatewhether the robot got closer to the wall. The eyesand the sonar were usedin different ways by the agent and by the RP: from the point of view of the agent, all thesesensorsbehavedas on/ off devices; for the RP, the eyes and the sonar produced an output with higher discriminative power. Therefore, the samehardware devices were used as tHe trainer' s sensorsand, through a transformation of their outputs, as the sensorsof the agent (the main reason for providing the agent with a simplified binary input was to reducethe sizeof the learning ' systems searchspace, thus speedingup learning) . By exploiting the full binary output of the eyesand of the sonar, it is possibleto estimate the actual effect of a movement toward the light or the wall. However, given the sensoryapparatus of Autono Mouse IV , we cannot establishthe real effect of a left or right turn ; for thesecases, the RP basedits reward on the expectedmove, that is, on the output message sent to the effectors, and not on the actual move. In other words, we used a reward-the-intention reinforcementpolicy . To overcomethis limitation , we have added to Autono Mouse IV a tail which can senserotations of the robot. The resulting robot , which we call " Autono Mouse y ," will be the object of experimentationin chapter 7. Third and finally , for practical considerations, the experiments with Autono Mouse IV were run for about four hours, covering only 5,000 cycles, while the simulated experimentswere run for 50,000 cycles.
5.3.3 Find Bidden Task: ExperimentalResults The graph of figureS .lS showsthat Autono Mouse IV learned the target behavior reasonably well, as was intuitively clear by direct observation during the experiment. There is, however, a principal discrepancybetween
ChapterS
--Q . 0.9 )Q Q > )
90ue
112
0.8 0.7 0.6
2
- - - - -
Chase
3
4
!
Number of cycles ( thousands) - - - - Search FindHidden
F11m ' e 5.15 Find Hidden experimentwith AutonoMouseIV the results of the real and those of the simulated experiments (see figure 4.20) . In the simulation , after 5,000 cycles the Chase , Search , and Find Hidden performances had reached , respectively , the approximate values of 0.92, 0.76, and 0.81; the three corresponding values in the real experiment are lower (about 0.75) and very close to each other . To put it differently , although the real and the simulated light searching perform ances are very similar , in the simulated experiment the chasing behavior is much more effective than the searching behavior , while in the real . experiment they are about the same. We interpret this discrepancy between the real and simulated experiments as an effect of the different ways in which the distance between the and the light were estimated . In fact , the total light intensity did robot not allow for a very accurate discrimination of such a distance. A move toward the light often did not result in an increase of total light intensity large enough to be detected; therefore , a correct move was not rewarded , because the RP did not understand that the robot had gotten closer to the light . As a consequence, the rewards given by the RP with respect to the light - chasing behavior were not as consistent as in the simulated experiments .
113
in theRealWarld Experiments
-a.. 1 - - - - - - - - .-4 - 0.8 gr ....-- / .!~ 0.6 rl ~ --------,-,---g!~~~ , , ' , c 0.4 / . ' , J ' E so gr.2 - 0.2 a~ .. l 0 500
1000
1500
2000
2500
Number of cycles F1awe5.16 Sensorvirt11ali?~tion with Autono Mouse IV
. Table5.1 whenchanging of realandsimulatedAutonoMouseIV perfonnances Comparison sensor 's granularity
Sonar granularity
Performanceof simulated robot after 5,000 cycles
Performanceof real robot after 5,000 cycles
2
0 .456
0 .378
4
0 .926
0 .906
64
0 .770
0 .496
5.3.4 SensorGranularity: ExperimentalResults To further explore the relation between sonar sensor granularity and learning performance, we repeatedwith the real robot the same experiment previously run in simulation (see section 4.5.2). We set the granularity of the real robot sonar sensorto 2, 4, and 64 (seefigure 5.16) . As was the case in the simulation experiment, the best performance was obtained with granularity 4, that is, when the granularity matched the number of levels used by the trainer to provide reinforcements. It is interesting to note (seetable 5.1) that the overall performance level was lower in the real robot case. This is due to the difference between the simulated and the real sonar: the real sonar was able to detect a distant
114
ChapterS obstacle only if the sonar beam hit the obstacle with an angle very close to 90 degrees.
5.4 POINTSTO REMEMBER . Results obtained in simulation carry over smoothly to our real robots, which suggeststhat we can use simulation to develop the control systems and fine-tune them on real robots. . The kind of data used by the trainer can greatly influence the learning . a trivial point , we capabilities of the learning system. While this can seem have shown how two reasonableways to use data on a robot ' s behavior determine a different robustnessof the learning system to the robot ' s maIfuncti oDing. . As was the case in simulation experiments, sensor granularity is at best when it matchesthe number of levels used by the trainer to provide . reinforcements.
Chapter Beyond
6 Reacdve
Behavior
6.1 I NTRODUCfI ON
In the experimentspresentedin the two previous chapters, our robots were viewed as reactive systems, that is, as systemswhose actions are completely determined by current sensory input . Although our results show that reactive systemscan learn to carry out fairly complex tasks, . there are interesting behavioral patterns not determined by current perceptions alone that they simply cannot learn. In one important class of nonreactivetasks, sequentialbehaviorpatterns, the decisionof what action to perform at time t is influencedby the actions performed in the past. This chapter presentsan approach to the problem of learning sequential behavior patterns based on the coordination of separate, previously learned basic behaviors. Section 6.2 defines some technical terminology and situatesthe problem of sequentialbehavior in the context of a general approach to the developmentof autonomous agents. Section 6.3 defines the agents, the environment, and the behavior patterns on which we carried out our experimentation, and specifiesour experimental plan. Section 6.4 describesthe two different training strategieswe used in our experiments, while section 6.5 reports someof our results.
6.2 REACTIVEAND DYNAMICREBAVIORS As the simplest class of agents, reactive systemsare agents that react to their current perceptions( Wilson 1990; Littman 1993) . In a reactive " Behavior is based onthearticle"TrainingAgentsto Perform Thischapter by Sequential in Behavior 2 3 1994 The which M. Colombetti andM. Dorigo , ( ), @ appeared Adaptive MIT Press .
116
Chapter 6 system, the action a ( t) produced at time t is a function of the sensory input
s( t) at time t: a(t) = / (s( t . As argued by Whitehead and Lin ( 1995), reactive systemsare perfectly adequateto Markov environments , more specifically, when . the greedy control strategy is globally optimal, that is, choosing the locally optimal action in eachenvironmental situation leadsto a courseof actions that is globally optimal; and . the agent has knowledge about both the effectsand the costs (gains) of each possibleaction in eachpossibleenvironmental situation. Although fairly complex behaviors can be carried out in Markov environments, very often an agent cannot be assumedto have complete knowledgeabout the effectsor the costsof its own actions. Non -Markov situations are basically of two different types:
1. Hidden state environments . A hiddenstate is a part of the environmental situation dot accessibleto the agent but relevant to the effectsor to the costs of actions. H the environment includes hidden states, a reactive agent cannot selectan optimal action; for example, a reactive agent cannot choosean optimal movementto reach an object that it doesnot see. 2. Sequentialbehaviors. Supposethat at time t an agent has to choosean action as a function of the action performed at time t - 1. A reactive agent can perform an optimal choice only if the action performed at time t - 1 has somecharacteristicand observableeffect at time t , that is, only if the agent can infer which action it performed at time t - 1 by inspecting the environment at time t. For example, supposethat the agent has to put an object in a given position, and then to removethe object. H the agent is able to perceivethat the object is in the given position, it will be able to appropriately sequencethe placing and the removing actions. On the other hand, a reactive agent will not be able to act properly at time t if (i) the effectsof the action performed at time t - 1 cannot be perceivedby the agent at time t (this is a subcaseof the hidden state problem); or (ii ) no effect of the action performed at time t - 1 persists in the environmen at time t. To develop an agent able to deal with non-Markov environments, one must go beyond the simple reactive model. We say that an agent is " dynamic" if the action a(t) it perfonns at tilDe t dependsnot only on its
117
BeyondReactiveBehavior current sensoryinput s( t) but also on its state x( t) at time t ; in turn , such state and the current sensoryinput determinethe state at time t + 1:
a(t) = f (s( t), x( t ; x( t + 1) = g(s(t), x( t . In this way, the current action can dependon the past history. An agent' s states can be called " internal states," to distinguish them from the statesof the environment. They are often regardedas memories of the agent' s past or as representationsof the environment. In spite (or because ) of their rather intuitive meaning, however, terms like memory or representationcan easily be used in a confusing way. Take the packing task proposed by Lin and Mitchell ( 1992, p. 1) as an example of nonMarkov problem: : opena box, put a gift into it, [ . . . ] Considera packingtaskwhichinvolves4 steps closeit , and sealit. An agentdrivenonly by its currentvisualperceptscannot whenfacinga closedbox the agentdoesnot know accomplishthis task, because if a gift is alreadyin the box andthereforecannotdecidewhetherto sealor open . the box. It seemsthat the agent needsto rememberthat it has already put a gift into the box. In fact, the agentmust be able to assumeone of two distinct internal states, say 0 and 1, so that its controller can choose different actions when the agent is facing a closed box. We can associatestate 0 " " with " The box is empty," and state 1 with The gift is in the box. Clearly, the statemust switch from 0 to 1 when the agentputs the gift into the box. But now, the agent' s state can be regarded: " 1. as a memoryof the past action " Put the gift into the box; or " 2. as a representationof the hidden environmental state The gift is in the " box. Probably, the choice of one of theseviews is a matter of personaltaste. But consider a different problem. There are two distinct objects in the environment, say A and B. The agent has to reach A , touch it , then reach B, touch it , then reachA again, touch it , and so on. In this case, provided that touching an object doesnot leave any trace on it , there is no hidden state in the environment to distinguish the situations in which the agent should reach A from those in which it should reach B. We say that this " environment is " forgetful in that it does not keep track of the past actions of the agent. Again , the agent must be able to assumetwo distinct internal states, 0 and 1, so that its task is to reach A when in state 0, and
118
Chapter 6
to reach B when in state 1. Such internal statescannot be viewed as represent of hidden statesbecausethere are no such statesin this environme . However, we still have two possibleinterpretations: 1. Internal statesare memoriesof past actions: 0 means that B has just beentouched, and 1 meansthat A hasjust beentouched; or 2. Internal statesare goals, determining the agent' s current task: state 0 meansthat the current task is to reach A , state 1 meansthat the current task is to reach B. The conclusion we draw is that terms like memory, representation , and goal, which are very commonly used for examplein artificial intelligence, often involve a subjectiveinterpretation of what is going on in an artificial agent. The term internal state, borrowed from systemstheory, seemsto be neutral in this respect, and it describesmore faithfully what is actually going on in the agent. In this chapter, we will be concernedwith internal statesthat keep track of past actions, so that the agent' s behavior can follow a sequentialpattern. In particularI we are interestedin dynamic agentspossessinginternal states by design, and learning to usethem to produce sequentialbehavior. Then, if we interpret internal statesas goals, this amounts to learning an action plan, able to enforce the correct sequencingof actions, although this intuitive idea must be taken with somecare. Let us observethat not all behavior patterns that at first sight appear to be basedon an action plan are necessarilydynamic. Consider an example of hoarding behavior: an agent leavesits nest, chasesand graspsa prey, brings it to its nest, goesout for a new prey, and so on. This sequential behavior can be produced by a reactive system, whose stimulus-response associationsare describedby the following production rules (where only the most specificproduction whoseconditions are satisfiedis assumedto fire at eachcycle) : Rule 1: - + move randomly. Rule 2: not grasped& prey ahead - + move ahead. Rule 3: not grasped& prey at contact - + grasp. Rule 4: grasped& nest ahead - + move ahead. Rule 5: grasped& in nest - + drop. In fact, the experimentsreported in chapters4 and 5 show that a reactive can easily learn to perform tasks similar agent implementedwith ALECSYS to the one above.
119
Beyond Reactive Behavior
It is interesting to seewhy the behavior pattern describedabove, while merely reactive, appearsassequentialto an external observer. If , insteadof the agent' s behavior, we considerthe behavior of the global dynamic system constitutedby the agentand the environment, the task is actually dynamic. The relevant statesare the statesof the environment, which keepstrack of the effectsof the agent' s moves; for example, the effectsof a graspingaction are stored by the environment in the form of a graspedprey, which can then be perceivedby the agent. We call pseudosequences tasks performed by a reactive agent that are sequentialin virtue of the dynamic nature of the environment, and use the term proper sequences for tasks that can be executedonly by dynamic agents, in virtue of their internal states. Let us look at theseconsiderationsfrom another perspective. As it has already been suggestedin the literature (see, for example, Rosenschein and Kaelbling 1986; Beer 1995), the agent and the environment can be viewed as a global system, made up of two coupled subsystems . For the interactions of the two subsystemsto be sequential, at least one of them must be properly dynamic in the sensethat its actions dependon the sub' systems state. The two subsystemsare not equivalent, however. Even though the agent can be shapedto produce a given target behavior, the dynamicsof the environment are taken as given and cannot be trained. It is therefore interesting to see whether the only subsystemthat can be trained, namely, the agent, can contribute to a sequentialinteraction with statesof its own: this is what we have called a " proper sequence ." In this chapter, instead of experimenting with sequencesof single actions, we have focusedon tasksmade up of a sequenceof phases, where a phase is a subtask that may involve an arbitrary number of single actions. Again, the problem to be solved is, how can we train an agent to switch from the current phaseto the next one on the basisof both current sensoryinput and knowledgeof the current phase? One important thing to be decidedis when the phasetransition should occur. The most obvious assumption is that a transition signal is produced by the trainer or by the environment, and is perceivedby the agent. Clearly, if we want to experiment on the agent' s capacity to produce proper behavioral sequences , the transition signal must not itself convey information about which should be the next phase.
6.3 EXPERIMENTAL SETTINGS This sectionpresentstheenvironments , theagent, andthetargetbehavior for our experiments .
120
Chapter 6
...
,
- - ,
, ,
I , \ ,
\ ,
0 I "
A , - - '"
~
-
'
'
I ~
( :
,
, '
) ~ - - ...
}1aure6.1 Forgetful environment for SeQuentialbehavior
6.3.1 The Simulation Environment What is a g60d experimental setting to show that proper sequences can emerge? Clearly, one in which ( 1) agent- environmentinteractions are sequential, and (2) the sequentialnature of the interactions is not due to states of the environment. Indeed, under these conditions we have the guaranteethat the relevant states are those of the agent. Therefore, we carried out our initial experiments on sequential behavior in forgetful environments- environments that keep no track of the effects of the ' agent s move. Our experimental environment is basically an empty space containing two objects, A and B (figure 6.1) . The objects lie on a bidimensional plane in which the agent can move freely; the distance between them is approximately 100 forward stepsof the agent. In some of the experiments , both objects emit a signal when the agent entersa circular area of predefinedradius around the object (shown by the dashedcircles in the figure) . 6.3.2 De Agent' s " Body" The Autono Mouse usedin theseexperimentsis very similar to simulated Autono Mice usedfor experimentsin chapter 4. Its sensorsare two on/ off eyeswith limited visual field of 180 degreesand an on/ ofT microphone. The eyesare able to detect the presenceof an object in their visual fields, and can discriminate betweenthe two objectsA and B. The visual fields of
121
BeyondReactiveBehavior the two eyesoverlap by 90 degrees , so that the total angle coveredby the two eyesis 270 degrees, partitioned into three areas of 90 degreeseach ' (seefigure 6.1). The Autono Mouse s effectorsare two independentwheels that can either stay still , move forward one or two steps, or move backward one step.
6.3.3 Target Behavior The target behavior is the following : the Autono Mouse has to approach object A , then approach object B, then approach object A , and so on. . This target sequencecan be representedby the regular expression{ IXP } , where IXdenotesthe behavioral phasein which the Autono Mouse has to approach object A , and P the behavioral phase in which the Autono Mouse has to approach object B. We assumethat the transition from one phaseto the next occurs when the Autono Mouse sensesa transition signal . This signal tells the Autono Mouse that it is time to switch to the next phase, but doesnot tell it which phaseshould be the next. Concerning. the production of the transition signal, there are basically two possibilities: 1. External-basedtransition, where the transition signal is produced by an external source(e.g., the trainer) independentlyof the current interaction betweenthe agent and its environment; or 2. Result-basedtransition, where the transition signal is produced when a ' given situation occurs as a result of the agent s behavior; for example, a transition signal is generatedwhen the Autono Mouse has come close enough to an object. The choice between these two variants correspondsto two different intuitive conceptions of the overall task. If we choose external-based transitions, what we actually want from the Autono Mouse is that it learn to switch phaseeach time we tell it to. If , instead, we chooseresult-based transitions, we want the Autono Mouse to achievea given result, and then to switch to the next phase. Supposethat the transition signal is generated when the agent reachesa given threshold distancefrom A or from B. This meansthat we want the agent to reach object A , then to reach object B, and so on. As we shall see, the different conceptionsof the task underlying this choice in:8uencethe way in which the Autono Mouse can be trained. It would be easyto turn the environmentdescribedaboveinto a Markov environment so that a reactive agent could learn the target behavior. For example, we could assumethat A and B are two lights, alternatively
122
Chapter 6
switched on and off, exactly one light being on at each moment. In this case, a reactive Autono Mouse could learn to approach the only visible light , and a pseudosequentialbehavior would emergeas an effect of the dynamic nature of the environment.
6.3.4 De Agent' s Controller and SensorimotorInterfaces
. For the { (XP } behavior, we useda two-level hierarchical architecture(see chapter 3) that was organizedas follows. Basic modules consistedof two independentLCSs, LCS and LCSp, in charge of learning the two basic behaviors (Xand p . The coordinator consistedof one LCS, in charge of learning the sequentialcoordination of the lower-level modules. The input of each basic module representsthe relative direction in ' which the relevant object is perceived. Given that the Autono Mouse s eyes partition the environment into four angular areas, both modules have a two-bit sensoryword as input . At any cycle, each basic module proposes a motor action, which is representedby four bits coding the movement of each independentwheel (two bits code the four possible movementsor the left wheel, and two bits those of the right wheel) . Coordination is achievedby choosing for execution exactly one of the actions proposedby the lower-level modules. This choice is basedon the value of a one-bit word, which representsthe internal state of the agent, " " and which we thereforecall the state word. The effect of the stateword is hardwired: when its value is 0, the action proposedby LCS is executed; when the value is 1, it is LCSp that wins. The coordinator receivesas input the current value of the state word and one bit representingthe state of the transition signal sensor; this bit is set to 1 at the rising edgeof the transition signal and to 0 otherwise. The possible actions for the coordinator are ( 1) set the state word to 0, and (2) set the state word to 1. The task that the coordinator has to learn is " " Maintain the same phase if no transition signal is perceived, and " each time a transition " Switch signal is perceived. phase The controller architecture, describedin figure 6.2, is a dynamic system. At eachcycle t , the sensoryinput at t and the value of the state word at t jointly determineboth the action performed at t and the value of the state word at t + 1.
6.3.5 Experimental Methodology For eachexperimentreportedin this chapterwe ran twelveindependent trials, startingfrom randominitial conditions.Eachtrial included
123
Beyond ReactiveBehavior
Figure6.2 Controllerarchitectureof Autono Mouse . a basiclearningsessionof 4,000 cycles, in which the two basicbehaviors and P were learned ; . . a coordinator learning sessionof 12,000 cycles, in which learning of basic behaviorswas switchedoff and only the coordinator was allowed to learn; and . a testsessionof 4,000 cycles, where all learning was switchedoff and the perfonnanceof the agent evaluated. ' In the learning sessions , the agent s perfonnance Plcarn (t) at cycle twas computed for each trial as Plearn( t ) =
Number
of correct actions performed t
from cycle I to cycle t
,
where an action was considered as correct if it was positively reinforced . " " We called the graph of Plearn(t) for a single trial a learning curve . In the ' test session, the agent s performance Ptestwas measured for each trial as a single number : Number of correct actions performed in the test session . Ptest= 4 ,000 For each experiment we show the coordinator learning curve of a single, typical trial , and report the mean and standard deviation of the twelve Ptestvalues for ( 1) the two basic behaviors < and 1>; and (2) the " " " " ' and switch ) . It is important to two coordinator s tasks ( maintain remark that the performance of the coordinator was judged from the
124
6 Chapter overall behavior of the agent. That is, the only information available to evaluatesuchperformancewas whether the agent was actually approaching A or approaching B; no direct accessto the coordinator' s state word was allowed. Instead, to evaluatethe performanceof the basic behaviors, it was also necessaryto know at each cycle whether the action performed was suggestedby LCS or by LCSp; this fact was establishedby directly inspectingthe internal state of the agent. Finally , to establish whether different experiments resulted in significantly different performances, we computed the probability , p, that the of sample performancesproducedby the different experimentsweredrawn from the samepopulation. To computep, we usedthe Mann -Whitney test (seesection4.1.1) .
6.4 TRAININGPOLI~ As we said in chapter 1, we view training by reinforcement learning as a mechanism to translate a specification of the agent' s target behavior, , embodiedin the reinforcementpolicy, into a control program that realizes the behavior. As the learning mechanismcarries out the translation in the context of agent-environment interactions, the resulting control program can be highly sensitiveto featuresof the environment that would be difficult to model explicitly in a handwritten control program. As usual, our trainer was implementedas an RP, which embodied the specificationof the target behavior. We believe that an important quality an RP should possessis to be highly agent-independent . In other words, we want the RP to baseits judgments on high-level featuresof the agent' s behavior, without bothering too much about the details of suchbehavior. In particular, we want the RP to be as independent as possible from internal features of the agent, which are unobservable to an external observer. Let us considerthe { aP} * behavior, where a = approach object A ; p = approach object B. The transitions from a to p and from p to a should occur whenever a transition signal is perceived. The first step is to train the Autono Mouse to perform the two basic behaviorsa and p . This is a fairly easytask, given that the basicbehaviors are instancesof approaching responsesthat can be produced by a simple
125
Beyond Reactive Behavior
reactive agent. After the basic behaviors have been learned, the next ' step is to train the Autono Mouse s coordinator to generate the target . Before doing so, we have to decidehow the transition signal is sequence to be generated. We experimentedwith both external-based and resultbasedtransitions.
6.4.1 External-BasedTransitions External-based transition signals are generated by the trainer. Let us assumethat coordinator training starts with phase~. The trainer rewards the Autono Mouse if it approaches object A , and punishes it otherwise. At ran