322 71 17MB
English Pages 276 [287] Year 1996
An Introduction to Artificial Intelligence
An Introduction to Artificial Intelligence
Janet Finlay University of Huddersfield and
Alan Dix Staffordshire University
CRC Press Taylor & Francis Croup Boca Raton London New York CR C Press is an im p rint of the Taylor & Francis G ro up , an inform a business
C R C Press Taylor & F ra n cis Group 6 0 0 0 Broken Sou nd Parkw ay N W , Su ite 3 0 0 B o ca R aton, FL 3 3 4 8 7 - 2 7 4 2 © 1 99 6 Jan et Finlay and A lan Dix C R C P ress is an im p rin t o f Taylor & F ra n cis Group, an In form a busin ess No cla im to o rigin al U .S. G o v ern m en t works ISBN 13: 978-1-85728-399-0 (pbk) ISBN 13: 978-1-138-43584-1 (hbk) T h is book contains inform ation obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and inform ation, but the author and publisher cannot assume responsibility for the validity o f all m aterials or the consequences o f their use. The authors and publishers have attempted to trace the copyright holders o f all m aterial reproduced in this publication and apologize to copyright holders if perm ission to publish in this form has not been obtained. I f any copyright m aterial has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as perm itted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transm itted, or utilized in any form by any electronic, m echanical, or other m eans, now known or hereafter invented, including photocopying, m icrofilm ing, and recording, or in any inform ation storage or retrieval system, without w ritten perm ission from the publishers. T radem ark N otice: Product or corporate names may be tradem arks or registered tradem arks, and are used only for identification and explanation w ithout intent to infringe. V isit th e Taylor & F ran cis W eb site at http://www .taylorandfrancis.com and th e CRC P ress W eb site at http://www.crcpress.com
Contents
Preface Introduction What is artificial intelligence? History of artificialintelligence The future for AI
ix 1 1 3 7
1
Knowledge in AI 1.1 Overview 1.2 Introduction 1.3 Representing knowledge 1.4 Metrics for assessing knowledge representation schemes 1.5 Logic representations 1.6 Procedural representation 1.7 Network representations 1.8 Structured representations 1.9 General knowledge 1.10 The frame problem 1.11 Knowledge elicitation 1.12 Summary 1.13 Exercises 1.14 Recommended further reading
9 9 9 10 13 14 17 21 23 26 27 28 28 29 30
2
Reasoning 2.1 Overview 2.2 What is reasoning? 2.3 Forward and backward reasoning 2.4 Reasoning with uncertainty 2.5 Summary 2.6 Exercises 2.7 Recommended further reading
31 31 31 33 34 43 43 44
3
Search 3.1 Introduction 3.2 Exhaustive search and simple pruning 3.3 Heuristic search 3.4 Knowledge-rich search
45 45 55 63 71
v
CONTENTS
3.5 3.6 3.7
Summary Exercises Recommended further reading
74 75 75
4
Machine learning 4.1 Overview 4.2 Why do we want machine learning? 4.3 How machines learn 4.4 Deductive learning 4.5 Inductive learning 4.6 Explanation-based learning 4.7 Example: Query-by-Browsing 4.8 Summary 4.9 Recommended further reading
77 77 77 78 84 85 96 97 99 100
5
Game playing 5.1 Overview 5.2 Introduction 5.3 Characteristics of game playing 5.4 Standard games 5.5 Non-zero-sum games and simultaneous play 5.6 The adversary is life! 5.7 Probability 5.8 Summary 5.9 Exercises 5.10 Recommended further reading
101 101 101 103 104 109 114 116 117 118 119
6
Expert systems 6.1 Overview 6.2 What are expert systems? 6.3 Uses of expert systems 6.4 Architecture of an expert system 6.5 Examples of four expert systems 6.6 Building an expert system 6.7 Limitations of expert systems 6.8 Hybrid expert systems 6.9 Summary 6.10 Exercises 6.11 Recommended further reading
121 121 121 122 123 127 129 134 135 135 135 136
7
Natural language understanding 7.1 Overview 7.2 What is natural language understanding? 7.3 Why do we need natural language understanding? 7.4 Why is natural language understanding difficult?
137 137 137 138 138
CONTENTS
7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 7.13
An early attempt at natural language understanding: SHRDLU How does natural language understanding work? Syntactic analysis Semantic analysis Pragmatic analysis Summary Exercises Recommended further reading Solution to SHRDLU problem
140 141 143 153 158 160 160 161 161
8
Computer vision 8.1 Overview 8.2 Introduction 8.3 Digitization and signal processing 8.4 Edge detection 8.5 Region detection 8.6 Reconstructing objects 8.7 Identifying objects 8.8 Multiple images 8.9 Summary 8.10 Exercises 8.11 Recommended further reading
163 163 163 165 173 182 185 191 197 200 201 202
9
Planning and robotics 9.1 Overview 9.2 Introduction 9.3 Global planning 9.4 Local planning 9.5 Limbs, legs and eyes 9.6 Practical robotics 9.7 Summary 9.8 Exercises 9.9 Recommended further reading
203 203 203 205 210 216 223 225 226 227
10 Agents 10.1 Overview 10.2 Software agents 10.3 Co-operating agents and distributed AI 10.4 Summary 10.5 Exercises 10.6 Recommended further reading
229 229 229 236 241 242 243
11 Models of the mind 11.1 Overview 11.2 Introduction
245 245 245
CONTENTS
11.3 11.4 11.5 11.6 11.7 11.8 11.9
What is the human mind? Production system models Connectionist models of cognition Summary Exercises Recommended further reading Notes
245 247 251 258 258 259 260
12 Epilogue: philosophical and sociological issues 12.1 Overview 12.2 Intelligent machines or engineering tools? 12.3 What is intelligence? 12.4 Computational argument vs. Searle's Chinese Room 12.5 Who is responsible? 12.6 Morals and emotions 12.7 Social implications 12.8 Summary 12.9 Recommended further reading
261 261 261 262 263 264 264 265 266 266
Bibliography
267
Index
271
Preface
Why another AI text-book? This book is based upon course material used by one of the authors on an introductory course in artificial intelligence (AI) at the University of York. The course was taught to MSc conversion course students who had little technical background and only basic level mathematics. Available text-books in AI either assumed too much technical knowledge or provided a very limited coverage of the subject. This book is an attempt to fill this gap. Its aim is to provide accessible coverage of the key areas of AI, in such a way that it will be understandable to those with only a basic knowledge of mathematics. The book takes a pragmatic approach to AI, looking at how AI techniques are applied to various application areas. It is structured in two main sections. The first part introduces the key techniques used in AI in the areas of knowledge rep resentation, search, reasoning and learning. The second part covers application areas including game playing, expert systems, natural language understanding, vision, robotics, agents and modelling cognition. The book concludes with a brief consideration of some of the philosophical and social issues relating to the subject. It does not claim to be comprehensive: there are many books on the market which give more detailed coverage. Instead it is designed to be used to support a one-semester introductory module in AI (assuming a 12 week module with a lecture and practical session). Depending on the emphasis of the module, possible course structures include spending one week on each of the 12 main chapters or spending more time on techniques and selecting a subset of the application areas to consider. The book does not attempt to teach any programming language. However, since it is useful to demonstrate techniques using a specific implementation we provide example PROLOG programs which are available from our WWW site (http://www.hud.ac.uk/schools/comp+maths/books/ai96). If you prefer to receive the programs on disk please send £8 to us at the address below and we will supply a disk.
PREFACE
Acknowledgements As with any endeavour there are many people behind the scenes providing help and support. We would like to thank our families, friends and colleagues for their tolerance and understanding while we have been writing this book. In particular, thanks must go to Fiona for suggestions that have enhanced the readability of the book and to the anonymous reviewers for useful feedback. Janet Finlay School of Computing and Mathematics University of Huddersfield Canalside, Huddersfield HD1 3DH UK and Alan Dix Associate Dean School of Computing Staffordshire University P.O. Box 334, Stafford, ST18 ODG UK
Introduction
What is artificial intelligence? Artificial intelligence (AI) is many different things to different people. It is likely that everyone who picks up this book has their own, albeit perhaps vague, notion of what it is. As a concept, AI has long captured the attention and imagination of journalists and novelists alike, leading both to popular renditions of current AI developments and futuristic representations of what might be just around the corner. Television and film producers have followed suit, so that AI is rarely far from the public eye. Robots, computers that talk to us in our own language and intelligent computer "experts" are all part of the future as presented to us through the media, though there is some division as to whether these developments will provide us with benign servants or sinister and deadly opponents. But outside the realm of futuristic fiction, what is AI all about? Unfortunately there is no single answer: just like in the media representation it very much depends upon who you talk to. On the one hand, there are those who view AI in high-level terms as the study of the nature of intelligence and, from there, how to reproduce it. Computers are therefore used to model intelligence in order to understand it. Within this group there are those who believe that human intelligence is essentially computational and, therefore, that cognitive states can be reproduced in a machine. Others use computers to test their theories of intelligence: they are interested less in replicating than in understanding human intelligence. For either of these groups, it is vital that the techniques proposed actually reflect human cognitive processes. On the other hand, there are those who view AI in more "pragmatic" terms: it is a discipline that provides engineering techniques to solve difficult problems. Whether these techniques reflect human cognition or indicate actual intelligence is not important. To this group the success of an AI system is judged on its behaviour in the domain of interest. It is not necessary for the machine to exhibit general intelligence. A third set of people, perhaps falling somewhere between the previous two, want to develop machines that not only exhibit intelligent behaviour but are able to learn and adapt to their environment in a way similar to humans. In striving towards this, it is inevitable that insights will be gained into the nature of human intelligence and learning, although it is not essential that these are accurately reproduced.
1
INTRODUCTION
So can we derive a definition of AI that encompasses some of these ideas? A working definition may go something like this: AI is concerned with building machines that can act and react ap propriately, adapting their response to the demands of the situation. Such machines should display behaviour comparable with that con sidered to require intelligence in humans. Such a definition incorporates learning and adaptability as general characteristics of intelligence but stops short of insisting on the replication of human intelligence. What types of behaviour would meet this definition and therefore fall under the umbrella of AI? Or, perhaps more importantly, what types of behaviour would not? It may be useful to think about some of the things we consider to require intelligence or thought in human beings. A list would usually include conscious cognitive activities: problem solving, decision making, reading, mathematics. Further consideration might add more creative activities: writing and art. We are less likely to think of our more fundamental skills - language, vision, mo tor skills and navigation - simply because, to us, these are automatic and do not require conscious attention. But consider for a moment what is involved in these "everyday " activities. For example, language understanding requires recognition and interpretation of words, spoken in many different accents and intonations, and knowledge of how words can be strung together. It involves resolution of ambiguity and understanding of context. Language production is even more complex. One only needs to take up a foreign language to appreciate the difficulties involved - even for humans. On the other hand, some areas that may seem to us very difficult, such as mathematical calculation, are in fact much more formulaic and therefore require only the ability to follow steps accurately. Such behaviour is not inherently intelligent, and computers are traditionally excellent as calculators. However, this activity would not be classed as artificial intelligence. Of course, we would not want to suggest that mathematics does not require intelligence! For example, problem solving and interpretation are also important in mathematics and these aspects have been studied as domains for AI research. From this we can see that there are a number of areas that are useful to explore in AI, among them language understanding, expert decision making, problem solving and planning. We will consider these and others in the course of this book. However, it should be noted that there are also some "grey " areas, activities that require skill and strategy when performed by humans but that can, ultimately, be condensed to a search of possible options (albeit a huge number of them). Game playing is a prime example of such an activity. In the early days, chess and other complex games were very much within the domain of humans and not computers, and were considered a valid target for AI research. But today computers can play chess at grand master level, largely due to their huge memory capacity. Many would therefore say that these are not now part of AI. However, such games have played an important part in the development of several AI techniques, notably 2
HISTORY OF ARTIFICIAL INTELLIGENCE
search techniques, and are therefore still worthy of consideration. It is for this reason that we include them in this book. However, before we move on to look in more detail at the techniques and applications of AI, we should pause to consider how it has developed up to now.
History of artificial intelligence AI is not a new concept. The idea of creating an intelligent being was proposed and discussed in various ways by writers and philosophers centuries before the computer was even invented. The earliest writers imagined their "artificial" beings created from stone: the Roman poet Ovid wrote a story of Pygmalion, the sculptor, whose statue of a beautiful woman was brought to life (the musical My fair lady is the more recent rendition of this fable). Much later, in the age of industrial machines, Mary Shelley had Doctor Frankenstein manufacture a man from separate components and bring him to life through electricity. By the 1960s, fiction was beginning to mirror the goals of the most ambitious AI researcher. In Arthur C. Clarke's 2001, we find the computer HAL displaying all the attributes of human intelligence, including self-preservation. Other films, such as Blade runner and Terminator, present a vision of cyborg machines almost indistinguishable from humans. Early philosophers also considered the question of whether human intelli gence can be reproduced in a machine. In 1642, Descartes argued that, although machines (in the right guise) could pass as animals, they could never pass as humans. He went on to identify his reasons for this assertion, namely that ma chines lack the ability to use language and the ability to reason. Interestingly, although he was writing at a time when clocks and windmills were among the most sophisticated pieces of machinery, he had identified two areas that still occupy the attention of AI researchers today, and that are central to one of the first tests of machine intelligence proposed for computers, the Turing test.
Turing and the Turing test To find the start of modern AI we have to go back to 1950, when computers were basically large numeric calculators. In that year, a British mathematician, Alan Turing, wrote a now famous paper entitled Computing machinery and intelligence, in which he posed the question "can machines think?" (Turing 1950). His answer to the question was to propose a game, the Imitation game, as the basis for a test for machine intelligence. His test is now known as the Turing test. His proposal was as follows. Imagine you have a person able to communicate with two others, one male and one female, only through a teletype or computer
3
INTRODUCTION
link. This person must try to distinguish the male from the female simply by asking questions, while those being questioned try to fool the interrogator about their gender. In the Turing test, one of the human participants is replaced by a computer. If the computer is able to convince the interrogator it is the human, Turing argued, it can be said to be intelligent. The Turing test has been widely criticized since it was proposed, and in Chap ter 12 of this book we will look in detail at some of the objections to it. However, the main criticism is that it takes a behavioural rather than a psychological view of intelligence. Regardless of the criticisms the Turing test is significant as it provided the first concrete goal for AI research.
The development of AI In 1956 about a dozen researchers met at Dartmouth College. They represented a number of academic disciplines including mathematics, electrical engineering and psychology but they had one interest in common: using the powerful tool of the computer to simulate human intelligence. It was here, at the Dartmouth Conference, that the name " Artificial Intelligence" was coined. In the early days of AI, research concentrated on applications that were con sidered to involve intelligence and reasoning, mainly game playing and theorem proving. During the decade following the Dartmouth Conference there were significant developments in search techniques, pattern matching, heuristics and constraint satisfaction, which are still important today. However, these early ap plications proved to be solvable using rapid searches through all possible options, rather than the selection strategies and reasoning applied by humans, and atten tion turned to the problem of modelling and reproducing true understanding. From the mid-1960s researchers concentrated on applications such as percep tion, language understanding and general problem solving. One such appli cation, ELIZA (Weizenbaum 1966), was hailed by some as a breakthrough in language understanding. ELIZA appeared, on the surface, to understand and respond intelligently to human conversation, albeit in the slightly artificial con text of Rogerian psychoanalysis. "Patients " would communicate with ELIZA via a keyboard and the program would respond in such a way that many patients believed that they were talking to a human. A conversation with ELIZA would run something like this (machine responses in upper case): TELL ME ABOUT YOUR PROBLEM men are all alike HOW DO YOU FEEL ABOUT MEN? none of them can be trusted NOT ANY? well very few COULD YOU BE MORE PRECISE PLEASE?
4
HISTORY OF ARTIFICIAL INTELLIGENCE
my grandfather was trustworthy TELL ME ABOUT YOUR GRANDFATHER he is dead HOW LONG HAS HE BEEN DEAD? a couple of years he was my mother's father TELL ME ABOUT YOUR MOTHER she is very nice CAN YOU THINK OF A SPECIFIC EXAMPLE? she is always glad to see me IS YOUR FATHER ALSO ALWAYS GLAD TO SEE YOU? most of the time HOW DID YOUR PARENTS TREAT YOU AS A CHILD? Such a conversation may appear convincing but does ELIZA therefore pass the Turing test? In 1966, the program's developer, Joseph Weizenbaum, wrote a paper to dispel this notion (at the same time casting some doubt on the validity of the Turing test) (Weizenbaum 1966). ELIZA actually works on the basis of some clever tricks of substitution, pattern matching and random selection. The user's sentence is searched for key words that are then either substituted into a new sentence by ELIZA or used to fire an associated response. If more than one possible response is available, ELIZA chooses between them randomly. In spite of appearances, therefore, ELIZA embodies no understanding of the English lan guage. As Weizenbaum says, it shows "how easy it is to create and maintain the illusion of understanding" (Weizenbaum 1966). Once these tricks are understood it is very easy to fool ELIZA: HOW DID YOUR PARENTS TREAT YOU AS A CHILD? they gave me my grandfather clock TELL ME ABOUT YOUR GRANDFATHER ... However, research continued into language understanding in the early 1970s and another significant system emerged: SHRDLU. SHRDLU (the name comes from six of the 12 most commonly used letters in English) was developed by Terry Winograd (Winograd 1972). The system can understand and comply with quite complex sentences but there is a catch: it is restricted to the domain of simulated blocks. Although this may seem limited, SHRDLU still far surpassed any other system of the time. Consider an instruction such as "Find a block that is taller than the one you are holding and put it on top of the red box." What knowledge is required to interpret such a sentence? First you need to understand the concepts of relative sizes. Then you need to interpret the reference in the second clause: to what does " it " refer? Then you need to understand relative position and differentiate by colour. SHRDLU was able to interpret such instructions through the use of stored knowledge and was one of the applications of this period that led to the development of a number of methodologies for knowledge representation (discussed in Ch. 1). 5
INTRODUCTION
The physical symbol system hypothesis In 1976 Newell and Simon proposed a hypothesis that has become the basis of research and experimentation in AI: the physical symbol system hypothesis (Newell & Simon 1976). The hypothesis states that A physical symbol system has the necessary and sufficient means for general intelligent action. So what does this mean? A symbol is a token that represents something else. For example, a word is a symbol representing an object or concept. The symbol is physical although the thing represented by it may be conceptual. Symbols are physically related to each other in symbol structures (for example, they may be adjacent). In addition to symbol structures, the system contains operators or processes that transform structures into other structures, for example copying, adding, removing them. A physical symbol system comprises an evolving set of symbol structures and the operators required to transform them. The hypothesis suggests that such a system is able to model intelligent behaviour. The only way to test this hypothesis is by experimentation: choose an activity that requires intelligence and devise a physical symbol system to solve it. Computers are a good means of simulating the physical symbol system and are therefore used in testing the hypothesis. It is not yet clear whether the physical symbol system hypothesis will hold in all areas of intelligence. It is certainly supported by work in areas such as game playing and decision making but in lower-level activities such as vision it is possible that subsymbolic approaches (such as neural networks) will prove to be more useful. However, this in itself does not disprove the physical symbol system hypothesis, since it is clearly possible to solve problems in alternative ways. The physical symbol system hypothesis is important as the foundation for the belief that it is possible to create artifical intelligence. It also provides a useful model of human intelligence that can be simulated and therefore tested.
More recent developments By the late 1970s, while the physical symbol system hypothesis provided fresh impetus to those examining the nature of intelligent behaviour, some research moved away from the "grand aim " of producing general machine understanding and concentrated instead upon developing effective techniques in restricted do mains. Arguably this approach has had the most commercial success, producing, amongst other things, the expert system (see Ch. 6 ). More recently the development of artificial neural networks, modelled on the human brain, has been hailed by some as the basis for genuine machine intelligence and learning. Neural networks, or "connectionist" systems, have proved effective in small applications but many have huge resource requirements. 6
THE FUTURE FORAI
Traditional AI researchers have been slow to welcome the connectionists, being sceptical of their claims and the premises underlying neural networks. In one example, a recognition system used neural networks to learn the prop erties of a number of photographs taken in woodland. Its aim was to differentiate between those containing tanks and those without. After a number of test runs in which the system accurately picked out all the photographs of tanks, the devel opers were feeling suitably pleased with themselves. However, to confirm their findings they took another set of photographs. To their dismay the system proved completely unable to pick out the tanks. After further investigation it turned out that the first set of photographs of tanks had been taken on a sunny day while those without were cloudy. The network was not classifying the photographs according to the presence of tanks at all but according to prevailing weather conditions! Since the "reasoning" underlying the network is difficult to examine such mistakes can go unnoticed. Neural networks will be discussed in more detail in Chapter 11.
The future for AI In spite of a certain retreat from the early unbounded claims for AI, there are a number of areas where AI research has resulted in commercially successful systems and where further developments are being unveiled every year. Certainly the greatest commercial success of AI to date is the expert system. Starting from early medical and geological systems, expert system technology has been applied to a huge range of application areas, from insurance to travel advice, from vehicle maintenance to selecting a pet. It is no longer necessary for the developer of an expert system to be an AI expert: expert system shells that embody the reasoning and knowledge structures of an expert system require only the domain-specific information to be added. Other areas that are having increasing commercial success are handwriting and speech recognition. Commercial pen-based computing systems are now available that allow users to interact without the use of a keyboard, entering information by hand. Such systems currently require training in order to perform accurately, and even then can usually only understand one user's writing reliably. But even this is impressive given the ambiguity in handwriting and the number of topics that may be being discussed. Similarly, there are commercially available speech recognition systems that perform accurately for single users. Game playing was one of the areas that attracted the attention of early AI researchers, since it appeared that skilled performance of games such as chess required well-developed problem-solving skills and the application of strategy. Although it was realised that much of the skill required in such games could be distilled into the rapid searching of a set of possible moves, the resulting systems do demonstrate how well such methods work. The chess programs of today are 7
INTRODUCTION
sophisticated enough to beat a human grand master on occasion. Much of the research in AI has been of interest to the military, and none less than that concerned with computer vision and robotics. Research has developed self-navigating cars and vehicles and systems to identify types of aircraft. Pattern recognition and classification technology is particularly promising in these areas, although care should be exercised in the light of problems like the one cited above. This technology is also finding civil applications, including identification of speeding motorists through recognition of number plates. Robotics in the factory is already well developed, and attention is being given to producing effective robotic applications for the home. The notion of "smart buildings" that adapt to their environment (for example, by adjusting window shading) is also being extended to take us towards the "smart house". It is clear then that although the goals and emphases of AI may have changed over time, the subject is far from dead or historical. Developments using AI techniques are being produced all the time. Indeed, some very familiar aspects of the computer tools we use regularly, such as spelling and grammar checkers, originate in AI research. It may not be long before AI is an integral part of all our lives. In the chapters that follow we will take a relatively pragmatic approach to AI, considering first the techniques that form the building blocks for AI systems, and then showing how these can be applied in a number of important application areas, including expert systems, language understanding, computer vision and robotics.
8
Chapter One
Knowledge in AI
1.1
Overview
Knowledge is vital to all intelligence. In this chapter we examine four key knowl edge representation schemes looking at examples of each and their strengths and weaknesses. We consider how to assess a knowledge representation scheme in order to choose one that is appropriate to our particular problem. We discuss the problems of representing general knowledge and changing knowledge.
1.2
Introduction
Knowledge is central to intelligence. We need it to use or understand language; to make decisions; to recognize objects; to interpret situations; to plan strategies. We store in our memories millions of pieces of knowledge that we use daily to make sense of the world and our interactions with it. Some of the knowledge we possess is factual. We know what things are and what they do. This type of knowledge is known as declarative knowledge. We also know how to do things: procedural knowledge. For example, if we consider what we know about the English language we may have some declarative knowledge that the word tree is a noun and that tall is an adjective. These are among the thousands of facts we know about the English language. However, we also have procedural knowledge about English. For example, we may know that in order to provide more information about something we place an adjective before the noun. Similarly, imagine you are giving directions to your home. You may have declarative knowledge about the location of your house and its transport links (for example, "my house is in Golcar ", "the number 301 bus runs through Golcar", "Golcar is off the Manchester Road"). In addition you may have procedural knowledge about how to get to your house ("Get on the 301 bus"). Another distinction that can be drawn is between the specific knowledge we have on a particular subject (domain-specific knowledge) and the general or "common -sense" knowledge that applies throughout our experience (idomain
9
KNOWLEDGE IN AI
independent knowledge). The fact "the number 301 bus goes to Golcar" is an example of the former: it is knowledge that is relevant only in a restricted domain in this case Huddersfield's transport system. New knowledge would be required to deal with transport in any other city. However, the knowledge that a bus is a motorized means of transport is a piece of general knowledge which is applicable to buses throughout our experience. General or common-sense knowledge also enables us to interpret situations accurately. For example, imagine someone asks you "Can you tell me the way to the station?". Your common-sense knowledge tells you that the person expects a set of directions; only a deliberately obtuse person would answer literally "yes " ! Similarly there are thousands if not millions of "facts " that are obvious to us from our experience of the world, many acquired in early childhood. They are so obvious to us that we wouldn't normally dream of expressing them explicitly. Facts about age: a person's age increments by one each year, children are always younger than their parents, people don't live much longer than 100 years; facts about the way that substances such as water behave; facts about the physical properties of everyday objects and indeed ourselves - this is the general or " common " knowledge that humans share through shared experience and that we rely on every day. Just as we need knowledge to function effectively, it is also vital in artificial intelligence. As we saw earlier, one of the problems with ELIZA was lack of knowledge: the program had no knowledge of the meanings or contexts of the words it was using and so failed to convince for long. So the first thing we need to provide for our intelligent machine is knowledge. As we shall see, this will include procedural and declarative knowledge and domain-specific and general knowledge. The specific knowledge required will depend upon the application. For language understanding we need to provide knowledge of syntax rules, words and their meanings, and context; for expert decision making, we need knowledge of the domain of interest as well as decision-making strategies. For visual recognition, knowledge of possible objects and how they occur in the world is needed. Even simple game playing requires knowledge of possible moves and winning strategies.
1.3
Representing knowledge
We have seen the types of knowledge that we use in everyday life and that we would like to provide to our intelligent machine. We have also seen something of the enormity of the task of providing that knowledge. However, the knowledge that we have been considering is largely experiential or internal to the human holder. In order to make use of it in AI we need to get it from the source (usually human but can be other information sources) and represent it in a form usable by the machine. Human knowledge is usually expressed through language, which,
10
REPRESENTING KNOWLEDGE
of course, cannot be accurately understood by the machine. The representation we choose must therefore be both appropriate for the computer to use and allow easy and accurate encoding from the source. We need to be able to represent facts about the world. However, this is not all. Facts do not exist in isolation; they are related to each other in a number of ways. First, a fact may be a specific instance of another, more general fact. For example, "Spotty Dog barks" is a specific instance of the fact "all dogs bark " (not strictly true but a common belief). In a case like this, we may wish to allow property inheritance, in which properties or attributes of the main class are inherited by instances of that class. So we might represent the knowledge that dogs bark and that Spotty Dog is a dog, allowing us then to deduce by inheritance the fact that Spotty Dog barks. Secondly, facts may be related by virtue of the object or concept to which they refer. For example, we may know the time, place, subject and speaker for a lecture and these pieces of information make sense only in the context of the occasion by which they are related. And of course we need to represent procedural knowledge as well as declarative knowledge. It should be noted that the representation chosen can be an important factor in determining the ease with which a problem can be solved. For example, imagine you have a 3 x 3 chess board with a knight in each corner (as in Fig. 1.1). How many moves (that is, chess knight moves) will it take to move each knight round to the next corner?
Figure 1.1 Four knights: how many moves?
Looking at the diagrammatic representation in Figure 1.1, the solution is not obvious, but if we label each square and represent valid moves as adjacent points 11
KNOWLEDGE IN AI
on a circle (see Fig. 1.2), the solution becomes more obvious: each knight takes two moves to reach its new position so the minimum number of moves is eight.
1
7
2
3
8
9
Figure 1.2 A different representation makes the solution clearer.
In addition, the granularity of the representation can affect its usefulness. In other words, we have to determine how detailed the knowledge we represent needs to be. This will depend largely on the application and the use to which the knowledge will be put. For example, if we are building a knowledge base about family relationships we may include a representation of the definition of the relation "cousin " (given here in English but easily translatable into logic, for example): your cousin is a child of a sibling of your parent. However, this may not be enough information; we may also wish to know the gender of the cousin. If this is the case a more detailed representation is required. For a female cousin: your cousin is a daughter of a sibling of your parent or a male cousin your cousin is a son of a sibling of your parent. Similarly, if you wanted to know to which side of the family your cousin belongs you would need different information: from your father's side 12
METRICS FOR ASSESSING KNOWLEDGE REPRESENTATION SCHEMES
your cousin is a child of a sibling of your father or your mother's your cousin is a child of a sibling of your mother. A full description of all the possible variations is given in Figure 1.3. Such detail may not always be required and therefore may in some circumstances be unnecessarily complex. your cousin is a daughter of a sister of your mother your cousin is a daughter of a sister of your father your cousin is a daughter of a brother of your mother your cousin is a daughter of a brother of your father your cousin is a son of a sister of your mother your cousin is a son of a sister of your father your cousin is a son of a brother of your mother your cousin is a son of a brother of your father
Figure 1.3 Full definition of the relationship "cousin ".
There are a number of knowledge representation methods that can be used. Later in this chapter we will examine some of them briefly, and identify the areas for which each is best suited. In later chapters of the book we will see how these methods can be used in specific application areas. But what makes a good knowledge representation scheme, and how can different schemes be evaluated against one another? Before going on to consider specific approaches to knowledge representation, we will look in more detail at what features a knowledge representation scheme should possess.
1.4
Metrics for assessing knowledge representation schemes
We have already looked at some of the factors we are looking for in a knowledge representation scheme. However, we can expand upon these and generate some metrics by which to measure the representations available to us. The main requirements of a knowledge representation scheme can be summarized under four headings: expressiveness, effectiveness, efficiency and explanation. - expressiveness. We have already considered some of the types of knowledge that we might wish to represent. An expressive representation scheme will be able to handle different types and levels of granularity of knowledge. It will be able to represent complex knowledge and knowledge structures 13
KNOWLEDGE IN AI
and the relationships between them. It will have means of representing specific facts and generic information (for example, by using variables). Expressiveness also relates to the clarity of the representation scheme. Ide ally, the scheme should use a notation that is natural and usable both by the knowledge engineer and the domain expert. Schemes that are too complex for the latter to understand can result in incorrect knowledge being held, since the expert may not be able to critique the knowledge adequately. In summary, our representation scheme should be characterized by complete ness and clarity of expression. - effectiveness. The second measure of a good representation scheme is its effectiveness. In order to be effective, the scheme must provide a means of inferring new knowledge from old. It should also be amenable to computation, allowing adequate tool support. - efficiency. Thirdly, the scheme should be efficient. The knowledge repre sentation scheme must not only support inference of new knowledge from old but must do so efficiently in order for the new knowledge to be of use. In addition, the scheme should facilitate efficient knowledge gathering and representation. - explicitness. Finally, a good knowledge representation scheme must be able to provide an explanation of its inferences and allow justifications of its reasoning. The chain of reasoning should be explicit. In the rest of this chapter we will use these four metrics to compare the effective ness of the techniques we will consider.
1.5
Logic representations
Logic representations use expressions in formal logic to represent the knowledge required. Inference rules and proof procedures can apply this knowledge to specific problems. First-order predicate calculus is the most common form of logic representation, with PROLOG being the most common language used to implement it. Logic is appealing as a means of knowledge representation, as it is a powerful formalism with known inference procedures. We can derive a new piece of knowledge by proving that it is a consequence of knowledge that is already known. The significant features of the domain can be represented as logical assertions, and general attributes can be expressed using variables in logical statements. It has the advantage of being computable, albeit in a restricted form. So how can we use logic to represent knowledge? Facts can be expressed as simple propositions. A proposition is a statement that can have one of two values:
14
LOGIC REPRESENTATIONS
true or false. These are known as truth values. So the statements It is raining and I am hungry are propositions whose values depend on the situation at the time. If I have just eaten dinner in a thunderstorm then the first is likely to be true and the second false. Propositions can be combined using operators such as and (A ) and or ( V ). Returning to our dining example, we could combine the two statements: It is raining and I am hungry (which for convenience we will express as P A Q). The truth value of the combined propositions will depend upon the truth values of the individual propositions and the operator connecting them. If the situation is still as it was then this combined propositional statement will be false, since one of the propositions (Q) is false. Figure 1.4 shows a truth table that defines the truth values of and and or.
p T T F F
Q T F T F
PAQ T F F F
P
V
Q
T T T F
Figure 1.4 Truth values for simple logic operators.
Propositional logic is limited in that it does not allow us to generalize suffi ciently. Common elements within propositions cannot be used to make infer ences. We need to be able to extract such common elements as parameters to the propositions, in order to allow inferences with them. Parametrized propositions give us predicate logic. For example, if we wish to represent our knowledge of the members of Thunderbirds' International rescue organization we might include such facts as f a t h e r ( J e f f , Virgil) f a t h e r ( J e f f , Alan) to mean Jeff is the father o f Virgil and Jeff is the father o f Alan respectively. Father is the predicate here and Jeff, Virgil and Alan are parameters. In predicate logic, parameters can also include variables. For example, f a t h e r ( J e f f , x) where x is a variable that can be instantiated later with a value - the name of someone of whom Jeff is the father. Quantifiers (universal and existential) allow the scope of the variable to be determined unambiguously. For example, in the statement above, we do not know for certain that there is value for x; that is, that Jeff is indeed someone's father (ignoring the two earlier facts for a moment). In the following statement
15
KNOWLEDGE IN AI
we use the existential quantifier, 3, to express the fact that Jeff is the father of at least one person: 3x : f a t h e r ( J e f f , x) Similarly we can express rules that apply universally using the universal quantifier, V: \/x^y : fa th er(x ,y ) V m other(x,y) —» parent(x,y) VrrVyVz : parent(x,y) A parent(x, z) —» sibling(y,z) The first of these states that for all values of x and y if x is the father of y or (V ) the mother of y then x is the parent of y. The second uses this knowledge to say something about siblings: for all values of x, y and z, if x is the parent of y (that is, the father or the mother), and (A) x is the parent of z, then y and z are siblings. Inference methods allow us to derive new facts from existing facts. There are a number of inference procedures for logic but we can illustrate the principle using the simple rule that we can substitute a universally quantified variable with any value in its domain. So, given the rule about parenthood and the facts we already know about the family from International rescue, we can derive new facts as shown below. Given \/x\/y : fa th er(x ,y ) V m other(x,y) —>parent(x,y) f a t h e r ( J e f f , Virgil) f a t h e r ( J e f f , Alan) we can derive the facts (by substitution) p a r e n t (J e f /, Virgil) p a r e n t(J e f f , Alan) Similarly, given 'ix'iy'iz : parent(x,y) A parent(x, z) —> sibling(y, z) p a r e n t ( J e f f , Virgil) p a r e n t ( J e ff, Alan) we can derive the fact sibling (Virgil, Alan) Facts and rules such as these can be represented easily in PROLOG. However, predicate logic and PROLOG have a limitation, which is that they operate under what is known as the closed world assumption. This means that we assume that all knowledge in the world is represented: the knowledge base is complete. Therefore any fact that is missing is assumed to be false. PROLOG uses a problem solving strategy called negation as failure, which means that it returns a result of 16
PROCEDURAL REPRESENTATION
false if it is unable to prove a goal to be true. This relies on the closed world assumption (Reiter 1978). Such an assumption is useful when all relevant facts are represented but can cause problems when the knowledge base is incomplete. In summary, logic is - expressive: it allows representation of facts, relationships between facts and assertions about facts. It is relatively understandable. PROLOG is less expressive since it is not possible to represent logical negation explicitly. This in turn leads to less clarity. - effective: new facts can be inferred from old. It is also amenable to compu tation through PROLOG. - efficient: the use of predicates and variables makes the representation scheme relatively efficient, although computational efficiency depends to a degree on the interpreter being used and the programmer. - explicit: explanations and justifications can be provided by backtracking. We will return to logic and PROLOG later in the book since it is a useful method of illustrating and implementing some of the techniques we will be considering.
1.6
Procedural representation
Logic representations, such as we have been looking at, are declarative: we specify what we know about a problem or domain. We do not specify how to solve the problem or what to do with the knowledge. Procedural approaches, on the other hand, represent knowledge as a set of instructions for solving a problem. If a given condition is met then an associated action or series of actions is performed. The production system is an example of this (Newell & Simon 1972). A production system has three components: - a database of facts (often called working memory) - a set of production rules that alter the facts in the database. These rules or productions are of the form IF THEN - an interpreter that decides which rule to apply and handles any conflicts.
1.6.1 The database The database or working memory represents all the knowledge of the system at any given moment. It can be thought of as a simple database of facts that are 17
KNOWLEDGE IN AI
true of the domain at that time. The number of items in the database is small: the analogy is to human working memory, which can hold only a small number of items at a time. The contents of the database change as facts are added or removed according to the application of the rules.
1.6.2
The production rules
Production rules are operators that are applied to the knowledge in the database and change the state of the production system in some way, usually by changing the content of the database. Production rules are sometimes called condition action rules and this describes their behaviour well. If the condition of a rule is true (according to the database at that moment), the action associated with the rule is performed. This may be, for example, to alter the contents of the database by removing a fact, or to interact with the outside world in some way. Production rules are usually unordered, in the sense that the sequence in which the rules will be applied depends on the current state of the database: the rule whose condition matches the state of the database will be selected. If more than one rule matches then conflict resolution strategies are applied. However, some production systems are programmed to apply rules in order, so avoiding conflict (this is itself a conflict resolution strategy).
1.6.3
The interpreter
The interpreter is responsible for finding the rules whose conditions are matched by the current state of the database. It must then decide which rule to apply. If there is more than one rule whose condition matches then one of the con tenders must be selected using strategies such as those proposed below. If no rule matches, the system cannot proceed. Once a single rule has been selected the interpreter must perform the actions in the body of the rule. This process continues until there are no matching rules or until a rule is triggered which includes the instruction to stop. The interpreter must have strategies to select a single rule where several match the state of the database. There are a number of possible ways to handle this situation. The most simple strategy is to choose the first rule that matches. This effectively places an ordering on the production rules, which must be carefully considered when writing the rules. An alternative strategy is to favour the most specific rule. This may involve choosing a rule that matches all the conditions of its contenders but that also contains further conditions that match, or it may mean choosing the rule that instantiates variables or qualifies a fact.
18
PROCEDURAL REPRESENTATION
For example, IF and < age > 40> is more specific than IF < salary is high>. Similarly IF £40,000> is more specific than IF £20,000> since fewer instances will match it.
1.6.4
An example production system: making a loan
This production system gives advice on whether to make a loan to a client (its rules are obviously very simplistic but it is useful to illustrate the technique). Initially the database contains the following default facts:
< client student? is unknown> < salary is unknown> and a single fact relating to our client: < AMOUNT REQUESTED is £2000> which represents the amount of money our client wishes to borrow (we will assume that this has been added using other rules). We can use the production system to find out more information about the client and decide whether to give this loan. Rules: 1.
2.
IF < client working? is unknown> THEN ask "Are you working?" read WORKING remove add IF and THEN ask "What is your salary?" read SALARY remove < salary is unknown> add < salary is SALARY> 19
KNOWLEDGE IN AI
3.
4.
5.
6.
7.
IF and and SALARY > (5 * AMOUNT REQUESTED) THEN grant loan of AMOUNT REQUESTED clear database finish IF and and SALARY < (5 * AMOUNT REQUESTED) THEN grant loan of (SALARY/5) clear database finish IF and THEN ask "Are you a student?" read STUDENT remove add IF and THEN discuss student loan clear database finish IF and THEN refuse loan clear database finish
Imagine our client is working and earns £7500. Given the contents of the database, the following sequence occurs: 1. Rule 1 fires since the condition matches a fact in the database. The user answers YES to the question, instantiating the variable WORKING to YES. This adds the fact to the database, replacing the fact < client working? is unknown> - Database contents after rule 1 fires: < client working? is YES> 2. Rule 2 fires instantiating the variable SALARY to the value given by the user. This adds this fact to the database, as above. - Database contents after rule 2 fires: < client working? is YES> < AMOUNT REQUESTED is £ 2000 > 3. Rule 4 fires since the value of SALARY is less than five times the value of AMOUNT REQUESTED. This results in an instruction to grant a loan of SALARY/5, that is £1500. The system then clears the database to the default values and finishes. 20
NETWORK REPRESENTATIONS
This particular system is very simple and no conflicts can occur. It is assumed that the interpreter examines the rule base from the beginning each time. To summarize, we can consider production systems against our metrics: - expressiveness: production systems are particularly good at representing procedural knowledge. They are ideal in situations where knowledge changes over time and where the final and initial states differ from user to user (or subject to subject). The approach relies on an understanding of the concept of a working memory, which sometimes causes confusion. The modularity of the representation aids clarity in use: each rule is an independent chunk of knowledge, and modification of one rule does not interfere with others. - effectiveness: new information is generated using operators to change the contents of working memory. The approach is very amenable to computa tion. - efficiency: the scheme is relatively efficient for procedural problems, and their flexibility makes it transferable between domains. The use of features from human problem solving (such as short-term memory) means that the scheme may not be the most efficient. However, to counter this, these features make it a candidate for modelling human problem solving. - explicitness: production systems can be programmed to provide explana tions for their decisions by tracing back through the rules that are applied to reach the solution.
1.7
Network representations
Network representations capture knowledge as a graph, in which nodes represent objects or concepts and arcs represent relationships or associations. Relationships can be domain specific or generic (see below for examples). Networks support property inheritance. An object or concept may be a member of a class, and is assumed to have the same attribute values as the parent class (unless alternative values override). Classes can also have subclasses that inherit properties in a similar way. For example, the parent class may be Dog, which has attributes such as has tail, barks and has four legs. A subclass of that parent class may be a particular breed, say Great Dane, which consequently inherits all the attributes above, as well as having its own attributes (such as tall). A particular member (or instance) of this subclass, that is a particular Great Dane, may have additional attributes such as colour. Property inheritance is overridden where a class member or subclass has an explicit alternative value for an attribute. For example, Rottweiler may be a subclass of the parent class Dog, but may have the attribute has no tail. 21
KNOWLEDGE IN AI
Alternatives may also be given at the instance level: Rottweiler as a class may inherit the property has tail but a particular dog, whose tail has been docked, may have the value has no tail overriding the inherited property. Semantic networks are an example of a network representation. A semantic network illustrating property inheritance is given below. It includes two generic relationships that support property inheritance: is-a indicating class inclusion (subclass) and instance indicating class membership.
Figure 1.5 A fragment of a semantic network.
Property inheritance supports inference, in that we can derive facts about an object by considering the parent classes. For example, in the Dog network in Figure 1.5, we can derive the facts that a Great Dane has a tail and is carnivorous from the facts that a dog has a tail and a canine is carnivorous respectively. Note, however, that we cannot derive the fact that a Basenji can bark since we have an alternative value associated with Basenji. Note also how the network links together information from different domains (dogs and cartoons) by association. Network representations are useful where an object or concept is associated with many attributes and where relationships between objects are important. Considering them against our metrics for knowledge representation schemes: - expressiveness: they allow representation of facts and relationships between
22
STRUCTURED REPRESENTATIONS
facts. The levels of the hierarchy provide a mechanism for representing general and specific knowledge. The representation is a model of human memory, and it is therefore relatively understandable. - effectiveness: they support inference through property inheritance. They can also be easily represented using PROLOG and other AI languages making them amenable to computation. - efficiency: they reduce the size of the knowledge base, since knowledge is stored only at its highest level of abstraction rather than for every instance or example of a class. They help maintain consistency in the knowledge base, because high-level properties are inherited by subclasses and not added for each subclass. - explicitness: reasoning equates to following paths through the network, so the relationships and inference are explicit in the network links.
1.8
Structured representations
In structured representations information is organized into more complex knowl edge structures. Slots in the structure represent attributes into which values can be placed. These values are either specific to a particular instance or default val ues, which represent stereotypical information. Structured representations can capture complex situations or objects, for example eating a meal in a restaurant or the content of a hotel room. Such structures can be linked together as networks, giving property inheritance. Frames and scripts are the most common types of structured representation.
1.8.1 Frames Frames are knowledge structures that represent expected or stereotypical infor mation about an object (Minsky 1975). For example, imagine a supermarket. If you have visited one or two you will have certain expectations as to what you will find there. These may include aisles of shelves, freezer banks and check-out tills. Some information will vary from supermarket to supermarket, for example the number of tills. This type of information can be stored in a network of frames where each frame comprises a number of slots with appropriate values. A section of a frame network on supermarkets is shown in Figure 1.6. In summary, frames extend semantic networks to include structured, hierar chical knowledge. Since they can be used with semantic networks, they share the benefits of these, as well as
23
KNOWLEDGE IN AI SUPERMARKET Instance of: SHOP Location: out of town Comprises: check-out till --------line shelving aisles freezer banks
[► CHECK-OUT TILL LINE Location: Supermarket exit Number: 50 Comprises: check-out stations trolley parks
-CHECK-OUT STATION Type: EPOS Use: billing & payment Comprises: till chair conveyor belt
Figure 1.6 Frame representation of supermarket.
- expressiveness: they allow representation of structured knowledge and pro cedural knowledge. The additional structure increases clarity. - effectiveness: actions or operations can be associated with a slot and per formed, for example, whenever the value for that slot is changed. Such procedures are called demons. - efficiency: they allow more complex knowledge to be captured efficiently. - explicitness: the additional structure makes the relative importance of par ticular objects and concepts explicit.
1.8.2
Scripts
A script, like a frame, is a structure used to represent a stereotypical situation (Schank & Abelson 1977). It also contains slots that can be filled with appropriate values. However, where a frame typically represents knowledge of objects and concepts, scripts represent knowledge of events. They were originally proposed as a means of providing contextual information to support natural language understanding (see Ch. 7). Consider the following description: Alison and Brian went to the supermarket. When they had got everything on their list they went home. Although it is not explicitly stated in this description, we are likely to infer that Alison and Brian paid for their selections before leaving. We might also be able to fill in more details about their shopping trip: that they had a trolley and walked around the supermarket, that they selected their own purchases, that their list contained the items that they wished to buy. All of this can be inferred from our general knowledge concerning supermarkets, our expectations as to what is likely to happen at one. Our assumptions about Alison and Brian's experience of 24
STRUCTURED REPRESENTATIONS
shopping would be very different if the word supermarket was replaced by corner shop. It is this type of stereotypical knowledge that scripts attempt to capture, with the aim of allowing a computer to make similar inferences about incomplete stories to those we were able to make above. Schank and colleagues developed a number of programs during the 1970s and 1980s that used scripts to answer questions about stories (Schank & Abelson 1977). The script would describe likely action sequences and provide the contextual information to understand the stories. A script comprises a number of elements: - entry conditions: these are the conditions that must be true for the script to be activated. - results: these are the facts that are true when the script ends. - props: these are the objects that are involved in the events described in the script. - roles: these are the expected actions of the major participants in the events described in the script. - scenes: these are the sequences of events that take place. - tracks: these represent variations on the general theme or pattern of the script. For example, a script for going to a supermarket might store the following information: Entry conditions: supermarket open, shopper needs goods, shopper has money Result: shopper has goods, supermarket has less stock, supermarket has more money Props: trolleys, goods, check-out tills Roles: shopper collects food, assistant checks out food and takes money, store manager orders new stock Scenes: selecting goods, checking out goods, paying for goods, packing goods Tracks: customer packs bag, assistant packs bag Scripts have been useful in natural language understanding in restricted do mains. Problems arise when the knowledge required to interpret a story is not domain specific but general, "common -sense" knowledge. Charniak (Charniak 1972) used children's stories to illustrate just how much knowledge is required to interpret even simple descriptions. For example, consider the following excerpt about exchanging unwanted gifts:
25
KNOWLEDGE IN AI
Alison and Brian received two toasters at their engagement party, so they took one back to the shop. To interpret this we need to know about toasters and why, under normal circum stances, one wouldn't want two; we also need to know about shops and their normal exchange policies. In addition, we need to know about engagements and the tradition of giving gifts on such occasions. But the situation is more compli cated than it appears. If instead of toasters Alison and Brian had received two gift vouchers, two books or two £20 notes they would not have needed to exchange them. So the rule that one doesn't want two of something only applies to certain items. Such information is not specific to engagements: the same would be true of birthday presents, wedding presents or Christmas presents. So in which script do we store such information? This is indicative of a basic problem of AI, which is how to provide the computer with the general, interpretative knowledge that we glean from our experience, as well as the specific factual knowledge about a particular domain. We will consider this problem in the next section. Scripts are designed for representing knowledge in a particular context. Within this context the method is expressive and effective, except as we have seen in representing general knowledge, but it is limited in wider application. Similarly, it provides an efficient and explicit mechanism for capturing complex structured information within its limited domain.
1.9
General knowledge
Most knowledge-based systems are effective in a restricted domain only because they do not have access to the deep, common knowledge that we use daily to interpret our world. Few AI projects have attempted to provide such general knowledge, the CYC project begun at MCC, Texas, by Doug Lenat (Lenat & Guha 1990), being a notable exception. The CYC project aims to build a knowledge base containing the millions of pieces of common knowledge that humans possess. It is a ten-year project involv ing many people, meticulously encoding the type of facts that are "obvious " to us, facts at the level of " all men are people " and "children are always younger than their parents". To us, expressing such facts seems ludicrous; for the computer they need to be represented explicitly. The project investigates whether it is possible to represent such common-sense knowledge effectively in a knowledge base, and also considers the problems of building and maintaining large-scale knowledge bases. Its critics claim that it is a waste of time and money, since such knowledge can only be gained by experience, for example the experiences children have through play. However, CYC can derive new knowledge from the facts provided, effectively learning and generalizing from its, albeit artificial, experience.
26
THE FRAME PROBLEM
1.10
The frame problem
Throughout this chapter we have been looking at knowledge representation schemes that allow us to represent a problem at a particular point in time: a particular state. However, as we will see in subsequent chapters, representation schemes have to be able to represent sequences of problem states for use in search and planning. Imagine the problem of moving an automatic fork lift truck around a factory floor. In order to do this we need to represent knowledge about the layout of the factory and the position of the truck, together with information dictating how the truck can move (perhaps it can only move if its forks are raised above the ground). However, as soon as the truck makes one movement, the knowledge has changed and a new state has to be represented. Of course, not all the knowledge has changed; some facts, such as the position of the factory walls, are likely to remain the same. The problem of representing the facts that alter from state to state as well as those that remain the same is the essence of the frame problem (McCarthy & Hayes 1969). In some situations, where keeping track of the sequence of states is important, it is infeasible to simply store the whole state each time - doing so will soon use up memory. So it is necessary to store information about what does and does not change from state to state. In some situations even deciding what changes is not an easy problem. In our factory we may describe bricks as being on a pallet which in turn is by the door: on(pallet, bricks) by(door, pallet) If we move the pallet then we infer that the bricks also move but that the door does not. So in this case at least the relationship on implies no change but by does imply a change (the pallet is no longer by the door). A number of solutions have been proposed to the frame problem. One ap proach is to include specific frame axioms which describe the parts that do not change when an operator is applied to move from state to state. So, for example, the system above would include the axiom on(x, y , «i)
A
move(x, si, ^2) —» on(x ,
S2)
to specify that when an object, y, is on object x in state s\, then if the operation move is applied to move x to state S2 then object y is still on object x in the new state. Frame axioms are a useful way of making change explicit, but become extremely unwieldy in complex domains. An alternative solution is to describe the initial state and then change the state description as rules are applied. This means that the representation is always up-to-date. Such a solution is fine until the system needs to backtrack in order to explore another solution. Then there is nothing to indicate what should be done to undo the changes. Instead we could maintain the initial description but store changes each time an operator is applied. This makes backtracking easy 27
KNOWLEDGE IN AI
since information as to what has been changed is immediately available, but it is again a complex solution. A compromise solution is to change the initial state description but also store information as to how to undo the change. There is no ideal solution to the frame problem, but these issues should be considered both in selecting a knowledge representation scheme and in choosing appropriate search strategies. We will look at search in more detail in Chapter 3.
1.11
Knowledge elicitation
All knowledge representation depends upon knowledge elicitation to get the appropriate information from the source (often human) to the knowledge base. Knowledge elicitation is the bottleneck of knowledge-based technology. It is difficult, time consuming and imprecise. This is because it depends upon the expert providing the right information, without missing anything out. This in turn often depends upon the person trying to elicit the knowledge (the knowledge engineer) asking the expert the right questions in an area that he or she may know little about. To illustrate the magnitude of the knowledge elicitation problem, think of a subject that you know something about (perhaps a hobby, a sport, a form of art or literature, a skill). Try to write down everything you know about the subject. Even more enlightening, get a friend who is not expert in the subject to question you about it, and provide answers to the questions. You will soon find that it is difficult to be precise and exhaustive in this type of activity. A number of techniques have been proposed to help alleviate the problem of knowledge elicitation. These include structured interview techniques, knowl edge elicitation tools and the use of machine-learning techniques that learn con cepts from examples. The latter can be used to identify key features in examples which characterize a concept. We will look in more detail at knowledge elicitation when we consider expert systems in Chapter 6 .
1.12
Summary
In this chapter we have seen the importance of an appropriate knowledge rep resentation scheme and how we can assess potential schemes according to their expressiveness, effectiveness, efficiency and explicitness. We have considered four key representation schemes - logic, production rules, network represen tations and structured representations - looking at examples of each and their strengths and weaknesses. We have looked at the problems of representing general knowledge and changing knowledge. Finally, we have touched on the problem of knowledge elicitation, which we will return to in Chapter 6 .
28
EXERCISES
1.13
Exercises
1. UK law forbids marriage between certain relatives (for example, par ents and children, brothers and sisters) but allows it between others (for example, first cousins). Use a logic formalism to represent your knowledge about UK (or your own country's) marriage laws. 2. A pet shop would like to implement an expert system to advise customers on suitable pets for their circumstances. Write a production system to incor porate the following information (your system should elicit the information it needs from the customer). A budgie is suitable for small homes (including city flats) where all the members o f the family are out during the day. It is not appropriate for those with a fear of birds or who have a cat. A guinea pig is suitablefor homes with a small garden where the occupants are out all day. It is particularly appropriate for children. However, it will require regular cleaning o f the cage. A cat is suitable for most homes except high-rise flats, although the house should not be on a main road. It does not require exercise. Some people are allergic to cats. A dog is suitable for homes with a garden or a park nearby. It is not suitable if all occupants are out all day. It will require regular exercise and grooming. 3. Construct a script for a train journey. (You can use a natural language representation but you should clearly indicate the script elements.) 4. Working in pairs, one of you should take the role of expert, the other of knowledge engineer. The expert should suggest a topic in which he or she is expert and the knowledge engineer should ask questions of the expert to elicit information on this topic. The expert should answer as precisely as possible. The knowledge engineer should record all the answers given. When enough information has been gathered, choose an appropriate rep resentation scheme and formalize this knowledge.
29
KNOWLEDGE IN AI
1.14
Recommended further reading
Ringland, G. A. & D. A. Duce 1988. Approaches to knowledge representation: an introduction. Chichester: John Wiley. Explains the issues o f knowledge representation in more detail than is possible here: a good next step. Brachman, R.J., H. J. Levesque, R. Retier (eds) 1992. Knowledge representation. Cambridge, MA: MIT Press. A collection of papers that makes a good follow -on from the above and covers more recent research into representation for symbolic reasoning. Bobrow D. G. & A. Collins (eds) 1975. Representation and understanding: studies in cognitive science. New York: Academic Press. A collection of important papers on knowledge representation.
30
Chapter Two
Reasoning
2.1
Overview
Reasoning is the ability to use knowledge to draw new conclusions about the world. Without it we are simply recalling stored information. There are a number of different types of reasoning, including induction, abduction and deduction. In this chapter we consider methods for reasoning when our knowledge is unreliable or incomplete. We also look at how we can use previous experience to reason about current problems.
2.2
What is reasoning?
Mention of reasoning probably brings to mind logic puzzles or "whodunit" thrillers, but it is something that we do every day of our lives. Reasoning is the process by which we use the knowledge we have to draw conclusions or infer something new about a domain of interest. It is a necessary part of what we call "intelligence": without the ability to reason we are doing little more than a lookup when we use information. In fact this is the difference between a standard database system and a knowledge-based or expert system. Both have information that can be accessed in various ways but the database, unlike the expert system, has no reasoning facilities and can therefore answer only limited, specific questions. Think for a moment about the types of reasoning you use. How do you know what to expect when you go on a train journey? What do you think when your friend is annoyed with you? How do you know what will happen if your car has a flat battery? Whether you are aware of it or not, you will use a number of different methods of reasoning depending on the problem you are considering and the information that you have before you. The three everyday situations mentioned above illustrate three key types of reasoning that we use. In the first case you know what to expect on a train journey because of your experience of numerous other train journeys: you infer that the new journey will share common features with the examples you are aware of.
31
REASONING
This is induction, which can be summarized as generalizationfrom cases seen to infer information about cases unseen. We use it frequently in learning about the world around us. For example, every crow we see is black; therefore we infer that all crows are black. If you think about it, such reasoning is unreliable: we can never prove our inferences to be true, we can only prove them to be false. Take the crows again. To prove that all crows are black we would have to confirm that all crows that exist, have existed or will exist are black. This is obviously not possible. However, to disprove the statement, all we need is to produce a single crow that is white or pink. So at best we can amass evidence to support our belief that all crows are black. In spite of its unreliability, inductive reasoning is very useful and is the basis of much of our learning. It is used particularly in machine learning, which we will meet in Chapter 4. The second example we suggested was working out why a friend is annoyed with you, in other words trying to find an explanation for your friend's behaviour. It may be that this particular friend is a stickler for punctuality and you are a few minutes late to your rendezvous. You may therefore infer that your friend's anger is caused by your lateness. This uses abduction, the process of reasoning back from something to the state or event that caused it. Of course this too is unreliable; it may be that your friend is angry for another reason (perhaps you had promised to telephone but had forgotten). Abduction can be used in cases where the knowledge is incomplete, for example where it is not possible to use deductive reasoning (see below). Abduction can provide a " best guess" given the evidence available. The third problem is usually solved by deduction: you have knowledge about cars such as " if the battery is flat the headlights won 't work "; you know the battery is flat so you can infer that the lights won 't work. This is the reasoning of standard logic. Indeed, we could express our car problem in terms of logic: given that a = the battery isflat and 6 = the lights won't work and the axioms Vx : a(x) —►b(x) a(my car) we can deduce b(my car). Note, however, that we cannot deduce the inverse: that is, if we know b(my car) we cannot deduce a(my car) This is not permitted in standard logic, but is of course another example of abduction. If our lights don't work we may use abduction to derive this explanation. However, it could be wrong; there may be another explanation for the light failure (for example, a bulb may have blown). Deduction is probably the most familiar form of explicit reasoning. Most of us at some point have been tried with syllogisms about Aristotle's mortality and the like. It can be defined as the process of deriving the logically necessary conclusion for the initial premises. So, for example,
32
FORWARD AND BACKWARD REASONING
Elephants are bigger than dogs Dogs are bigger than mice Therefore Elephants are bigger than mice. However, it should be noted that deduction is concerned with logical validity, not actual truth. Consider the following example; given the facts, can we reach the conclusion by deduction? Some dogs are greyhounds Some greyhounds run fast Therefore Some dogs run fast. The answer is no. We cannot make this deduction because we do not know that all greyhounds are dogs. The fast greyhounds may therefore be the greyhounds that are not dogs. This of course is nonsensical in terms of what we know (or more accurately have induced) about the real world, but it is perfectly valid based on the premises given. You should therefore be aware that deduction does not always correspond to natural human reasoning.
2.3
Forward and backward reasoning
As well as coming in different "flavours", reasoning can progress in one of two directions: forwards to the goal or backwards from the goal. Both are used in AI in different circumstances. Forward reasoning (also referred to as forward chaining, data-driven reasoning, bottom-up or antecedent-driven) begins with known facts and attempts to move towards the desired goal. Backward reasoning (backward chaining, goal-driven reasoning, top-down, consequent-driven or hypothesis-driven) begins with the goal and sets up subgoals which must be solved in order to solve the main goal. Imagine you hear that a man bearing your family name died intestate a hundred years ago and that solicitors are looking for descendants. There are two ways in which you could determine if you are related to the dead man. First, follow through your family tree from yourself to see if he appears. Secondly, trace his family tree to see if it includes you. The first is an example of forward reasoning, the second backward reasoning. In order to decide which method to use, we need to consider the number of start and goal states (move from the smaller to the larger - the more states there are the easier it is to find one) and the number of possibilities that need to be considered at each stage (the fewer the better). In the above example there is one start state and one goal state (unless you 33
REASONING
are related to the dead man more than once), so this does not help us. However, if you use forward reasoning there will be two possibilities to consider from each node (each person will have two parents), whereas with backward reasoning there may be many more (even today the average number of children is 2.4; at the beginning of the century it was far more). In general, backward reasoning is most applicable in situations where a goal or hypothesis can be easily generated (for example, in mathematics or medicine), and where problem data must be acquired by the solver (for example, a doctor asking for vital signs information in order to prove or disprove a hypothesis). Forward reasoning, on the other hand, is useful where most of the data is given in the problem statement but where the goal is unknown or where there are a large number of possible goals. For example, a system which analyzes geological data in order to determine which minerals are present falls into this category.
2.4
Reasoning with uncertainty
In Chapter 1 we looked at knowledge and considered how different knowledge representation schemes allow us to reason. Recall, for example, that standard logics allow us to infer new information from the facts and rules that we have. Such reasoning is useful in that it allows us to store and utilize information ef ficiently (we do not have to store everything). However, such reasoning assumes that the knowledge available is complete (or can be inferred) and correct, and that it is consistent. Knowledge added to such systems never makes previous knowledge invalid. Each new piece of information simply adds to the knowl edge. This is called monotonic reasoning. Monotonic reasoning can be useful in complex knowledge bases since it is not necessary to check consistency when adding knowledge or to store information relating to the truth of knowledge. It therefore saves time and storage. However, if knowledge is incomplete or changing an alternative reasoning system is required. There are a number of ways of dealing with uncertainty. We will consider four of them briefly: - non-monotonic reasoning - probabilistic reasoning - reasoning with certainty factors - fuzzy reasoning
34
REASONING WITH UNCERTAINTY
2.4.1
Non -monotonic reasoning
In a non-monotonic reasoning system new information can be added that will cause the deletion or alteration of existing knowledge. For example, imagine you have invited someone round for dinner. In the absence of any other information you may make an assumption that your guest eats meat and will like chicken. Later you discover that the guest is in fact a vegetarian and the inference that your guest likes chicken becomes invalid. We have already met two non-monotonic reasoning systems: abduction and property inheritance (see Ch. 1). Recall that abduction involves inferring some information on the basis of current evidence. This may be changed if new evidence comes to light, which is a characteristic of non-monotonic reasoning. So, for example, we might infer that a child who has spots has measles. However, if evidence comes to light to refute this assumption (for example, that the spots are yellow and not red), then we replace the inference with another. Property inheritance is also non-monotonic. An instance or subclass will inherit the characteristics of the parent class, unless it has alternative or conflicting values for that characteristic. So, as we saw in Chapter 1, we know that dogs bark and that Rottweilers and Basenjis are dogs. However, we also know that Basenjis don't bark. We can therefore infer that Rottweilers bark (since they are dogs and we have no evidence to think otherwise) but we cannot infer that Basenjis do, since the evidence refutes it. A third non-monotonic reasoning system is the truth maintenance system or TMS (Doyle 1979). In a TMS the truth or falsity of all facts is maintained. Each piece of knowledge is given a support list (SL) of other items that support (or refute) belief in it. Each piece of knowledge is labelled for reference, and an item can be supported either by another item being true (+) or being false (-). Take, for example, a simple system to determine the weather conditions: (1) It is winter (SL ()()) (2) It is cold (SL (l+)(3-)) (3) It is warm Statement (1) does not depend on anything else: it is a fact. Statement (2) depends on statement (1) being true and statement (3) being false. It is not known at this point what statement (3) depends on. It has no support list. Therefore we could assume that " it is cold " since we know that "it is winter " is true (it is a fact) and we have no information to suggest that it is warm (we can therefore assume that this is false). However, if " it is warm " becomes true, then " it is cold " will become false. In this way the TMS maintains the validity and currency of the information held.
35
REASONING
2.4.2
Probabilistic reasoning
Probabilistic reasoning is required to deal with incomplete data. In many situ ations we need to make decisions based on the likelihood of particular events, given the knowledge we have. We can use probability to determine the most likely cause. Simple probability deals with independent events. If we know the probability of event A occurring (call it p(A)) and the probability of event B occurring (p(B)), the probability that both will occur (p(AB)) is calculated as p(A) * p(B). For example, consider an ordinary pack of 52 playing cards, shuffled well. If I select a card at random, what is the likelihood of it being the king of diamonds? If we take event A to be the card being a diamond and event B to be the card being a king, we can calculate the probability as follows: p(A) = 52 = 0.25 (there are 13 diamonds) p(B) = ^ = 0.077 (there are four kings) p{AB) = 2^4 - h . ~ 0-0192 (there is one king of diamonds) However, if two events are interdependent and the outcome of one affects the outcome of the other, then we need to consider conditional probability. Given the probability of event A (p(A)) and that of a second event B which depends on it, p(B\A) (B given A), the probability of both occurring is p(A) * p(B\ A). So, returning to our pack of cards, imagine I take two cards. What is the probability that they are both diamonds? Again, event A is the first card being a diamond, but this time event B is the second card also being a diamond: p(A) = gl = 0.25 (there are 13 diamonds) p(B\A) = P (A B ) = ^
= 0.235 (there are 12 diamonds left and 51 cards) = 0.058
This is the basis of Bayes theorem and several probabilistic reasoning systems. Bayes theorem calculates the probabilities of particular " causes" given observed "effects". The theorem is as follows: p(hi\e) = p{e\hi)p{hi)/YTj=\ P{e\hj)P(hj) where p{hi\e) is the probability that the hypothesis hi is true given the evidence e p{hi) is the probability that hi in the absence of specific evidence p(e\hi) is the probability that evidence e will be observed if hypothesis hi is true n is the number of hypotheses being considered. 36
REASONING WITH UNCERTAINTY
For example, a doctor wants to determine the likelihood of particular causes, based on the evidence that a patient has a headache. The doctor has two hy potheses, a common cold (hi) and meningitis (/12), and one piece of evidence, the headache (e), and wants to know the probability of the patient having a cold. Suppose the probability of the doctor seeing a patient with a cold, p(h\), is 0.2 and the probability of seeing someone with meningitis, p(h 2), is 0 .000001 . Suppose also that the probability of a patient having a headache with a cold, v(e\hi), is 0.8 and the probability of a patient having a headache with meningitis, p(e\h2 ), is 0.9. Using Bayes theorem we can see that the probability that the patient has a cold is very high: /l \ _________ 0.8 x 0.2 __________ _____0.16 _ a on PKtll) ~ (0.8 x 0.2) + (0.9 x 0.000001) “ 0.16 + 0.0(500009 ” u' ^ In reality, of course, the cost of misdiagnosis of meningitis is also very high, and therefore many more factors would have to be taken into account. Bayes theorem was used in the expert system, PROSPECTOR (Duda et al. 1979), to find mineral deposits. The aim was to determine the likelihood of finding a specific mineral by observing the geological features of an area. PROSPECTOR has been used to find several commercially significant mineral deposits. In spite of such successful uses, Bayes theorem makes certain assumptions that make it intractable in many domains. First, it assumes that statistical data on the relationships between evidence and hypotheses are known, which is often not the case. Secondly, it assumes that the relationships between evidence and hypotheses are all independent. In spite of these limitations Bayes theorem has been used as the base for a number of probabilistic reasoning systems, including certainty factors, which we will consider next.
2.4.3
Certainty factors
As we have seen, Bayesian reasoning assumes information is available regarding the statistical probabilities of certain events occurring. This makes it difficult to operate in many domains. Certainty factors are a compromise on pure Bayesian reasoning. The approach has been used successfully, most notably in the MYCIN expert system (Shortliffe 1976). MYCIN is a medical diagnosis system that diagnoses bacterial infections of the blood and prescribes drugs for treatment. Its knowledge is represented in rule form and each rule has an associated certainty factor. For example, a MYCIN rule looks something like this: If (a) the gram stain of the organism is gram negative and (b) the morphology of the organism is rod and (c) the aerobicity of the organism is anaerobic 37
REASONING
then there is suggestive evidence (0.5) that identity of the organism is Bacteroides In this system, each hypothesis is given a certainty factor (C F) by the expert providing the rules, based on his or her assessment of the evidence. A C F takes a value between 1 and - 1, where values approaching - 1 indicate that the evidence against the hypothesis is strong, and those approaching 1 show that the evidence for the hypothesis is strong. A value of 0 indicates that no evidence for or against the hypothesis is available. A C F is calculated as the amount of belief in a hypothesis given the evidence (MB(h\e)) minus the amount of disbelief (MD(h\e)). The measures are assigned to each rule by the experts providing the knowledge for the system as an indi cation of the reliability of the rule. Measures of belief and disbelief take values between 0 and 1. Certainty factors can be combined in various ways if there are several pieces of evidence. For example, evidence from two sources can be combined to produce a C F as follows: CF(h\e1,e 2) = MB(h\ex, e2) MD{h\e1,e 1) where MB(h\e1,e 2) = M J3 (% i) + {MB(h\e2)[l MB(h\e{)\} (orOif MD(h\e1,e 2) = l) and MD(h\e\,e2) = MD(h\e{) + {MD(h\e2)[l MD(h\e{)]} (orOif M B(h\ei,e2) = 1) The easiest way to understand how this works is to consider a simple example. Imagine that we observe the fact that the air feels moist (ei). There may be a number of reasons for this (rain, snow, fog). We may hypothesize that it is foggy, with a measure of belief (MB(h\e\)) in this being the correct hypothesis of 0.4. Our disbelief in the hypothesis given the evidence (MD(h\ei)) will be low, say 0.1 (it may be dry and foggy but it is unlikely). The certainty factor for this hypothesis is then calculated as CF(h\e\) = MB(h\e\) - MD(h\e{) = 0.5 - 0.1 = 0.4 We then make a second observation, e2, that visibility is poor, which confirms our hypothesis that it is foggy, with MB(h\e2) of 0.7. Our disbelief in the hypoth esis given this new evidence is 0.0 (poor visibility is a characteristic of fog). The certainty factor for it being foggy given this evidence is CF(h\e2) = MB(h\e2) - MD(h\e2) = 0.7 - 0.0 = 0.7 However, if we combine these two pieces of evidence we get an increase in the overall certainty factor: 38
REASONING WITH UNCERTAINTY
MB(h\ei, e2) = 0.5 + (0.7 * 0.5) = 0.85 MD(h\ei, e2) = 0.1 + (0.0 * 0.9) = 0.1 CF(h\ei, c2) = 0.85 - 0.1 = 0.75 Certainty factors provide a mechanism for reasoning with uncertainty that does not require probabilities. Measures of belief and disbelief reflect the expert's assessment of the evidence rather than statistical values. This makes the certainty factors method more tractable as a method of reasoning. Its use in MYCIN shows that it can be successful, at least within a clearly defined domain.
2.4,4
Fuzzy reasoning
Probabilistic reasoning and reasoning with certainty factors deal with uncertainty using principles from probability to extend the scope of standard logics. An alternative approach is to change the properties of logic itself. Fuzzy sets and fuzzy logic do just that. In classical set theory an item, say a, is either a member of set A or it is not. So a meal at a restaurant is either expensive or not expensive and a value must be provided to delimit set membership. Clearly, however, this is not the way we think in real life. While some sets are clearly defined (a piece of fruit is either an orange or not an orange), many sets are not. Qualities such as size, speed and price are relative. We talk of things being very expensive or quite small. Fuzzy set theory extends classical set theory to include the notion of degree of set membership. Each item is associated with a value between 0 and 1, where 0 indicates that it is not a member of the set and 1 that it is definitely a member. Values in between indicate a certain degree of set membership. For example, although you may agree with the inclusion of Porsche and BMW in the set F astC ar, you may wish to indicate that one is faster than the other. This is possible in fuzzy set theory:
{
{P orsche 944,0.9), {B M W 316,0.5), {Vauxhall N o v a l.2,0.1)
Here the second value in each pair is the degree of set membership. Fuzzy logic is similar in that it attaches a measure of truth to facts. A predicate, P, is given a value between 0 and 1 (as in fuzzy sets). So, taking an element from our fuzzy set, we may have a predicate fa s tc a r {P orsche 944) = 0.9 Standard logic operators, such as and, or and not, can be applied in fuzzy logic and are interpreted as follows: P A Q = mm(P, Q) 39
REASONING
P V Q = m ax(P , Q) notP = 1 —P So, for example, we can combine predicates and get new measures: fa s tc a r (P orsche 944) = 0.9 pretentiouscar (P orsche 944) = 0.6 fa s tc a r (P orsche 944) Apretentiouscar (P orsche 944) = 0.6
2.4.5
Reasoning by analogy
Analogy is a common tool in human reasoning (Hall 1989). Given a novel problem, we might compare it with a familiar problem and note the similarities. We might then apply our knowledge of the old problem to solving the new. This approach is effective if the problems are comparable and the solutions transferable. Analogy has been applied in AI in two ways: transformational analogy and derivational analogy. Transformational analogy involves using the solution to an old problem to find a solution to a new. Reasoning can be viewed as a state space search where the old solution is the start state and operators are used (employing means-ends analysis, for example) to transform this solution into a new solution. An alternative to this is derivational analogy, where not only the old solution but the process of reaching it is considered in solving the new problem. A history of the problem-solving process is used. Where a step in the procedure is valid for the new problem, it is retained; otherwise it is discarded. The solution is therefore not a copy of the previous solution but a variation of it.
2.4.6
Case-based reasoning
A method of reasoning which exploits the principle of analogy is case-based reasoning (CBR). All the examples (called cases in CBR) are remembered in a case base. When a new situation is encountered, it is compared with all the known cases and the best match is found. If the match is exact, then the system can perform exactly the response suggested by the example. If the match is not exact, the differences between the actual situation and the case are used to derive a suitable response (see Fig. 2.1). Where there is an exact match, the CBR is acting as a rote learning system, but where there is no exact match, the combination of case selection and comparison is a form of generalization. The simplest form of CBR system may just classify the new situation, a form of concept learning. In this case, the performance of the system is determined solely by the case selection algorithm. In a more 40
REASONING WITH UNCERTAINTY
new situation
behaviour
chosen case Figure 2.1 Case-based reasoning.
complicated system, the response may be some form of desired action depending on the encountered situation. The case base consists of examples of stimulusaction pairs, and the comparison stage then has to decide how to modify the action stored with the selected case. This step may involve various forms of reasoning. Imagine we have the following situation: situation:
buy(fishmonger,cod), owner(fishmonger,Fred), cost(cod,£3)
The case base selects the following best match case: stimulus: action:
buy(postoffice,stamp), owner(postoffice,Dilys), cost(stamp,25p) pay(Dilys,25p)
The comparison yields the following differences: fishmonger —►postoffice, cod —>stamp, Fred —>Dilys, £3 —>25p The action is then modified correspondingly to give "pay(Fred,£3)". In this example, the comparison and associated modification is based on simple substitution of corresponding values. However, the appropriate action may not be so simple. For example, consider a blocks-world CBR (Fig. 2.2). The situation is: situation:
blue(A), pyramid (A), on(A,table), green(B), cube(B), on(B,table), blue(C), ball(C), on(C,B),
The CBR has retrieved the following case: stimulus:
action:
blue(X), pyramid(X), on(X,table), green(Y), cube(Y), on(Y,table), blue(Z), cube(Z), on(Z,table), move(X,Y)
41
REASONING
retrieved case
current situation Figure 2.2 Modifying cases.
A simple pattern match would see that the action only involves the first two objects, X and Y, and the situation concerning these two objects is virtually identical. So, the obvious response is //move(A,B)". However, a more detailed analysis would show that moving the blue pyramid onto the green cube is not possible because, in the current situation, the blue ball is on it. A more sophisticated difference procedure could infer that a more appropriate response would be: move(C,table), move(A,B). Note how the comparison must be able to distinguish irrelevant differences such as ball(C) vs. cube(Z) from significant ones such as on(C,B) vs. on(Z,table). This is also a problem for the selection algorithm. In practice there may be many attributes describing a situation, only a few of which are really important. If selec tion is based on a simple measure such as "least number of different attributes", then the system may choose "best match" cases where all the irrelevant attributes match, but none of the relevant ones! At the very least some sort of weighting is needed. For example, if one were developing a fault diagnosis system for a photocopier, the attributes would include the error code displayed, the number of copies required, the paper type, whether the automatic feeder was being used, and so on. However, one would probably give the error code a higher weighting than the rest of the attributes. Where the comparison yields differences which invalidate the response given in the case and no repair is possible, the CBR can try another close match case. So, a good selection mechanism is important, but some poor matches can be corrected. Case-based reasoning has some important advantages. The most important is that it has an obvious and clear explanation for the user: "the current situation is similar to this given case and so I did a similar response". Indeed, one option is to do no comparison at all, simply to present the user with similar cases and allow the user to do the comparison between the current situation and the selected cases. Arguably, because the human does the "intelligent" part, this is not really CBR but simply a case memory, a sort of database. Another advantage of CBR is that it is not difficult to incorporate partial de scriptions, in both the cases and the presented situations. This is because it is fairly easy to generalize measures of similarity to cases where some of the attributes 42
SUMMARY
are missing or unknown. For example, we could score +1 for each matching attribute, - 1 for each non-match and 0 for any attributes that are missing from either the case or situation (weighted of course!). This is an important feature of .CBR, as it is often the case that records are incomplete. For example, if we start to build a CBR based on past medical records, we will find that many symptoms are unrecorded - the doctor would not have taken the heart rate of someone with a skin complaint. Other reasoning methods can deal with such problems, but not so simply as CBR.
2.5
Summary
In this chapter we have considered a number of different types of reasoning, in cluding induction, abduction and deduction. We have seen that the knowledge that we are reasoning about is often incomplete and therefore demands reasoning methods that can deal with uncertainty. We have considered four approaches to reasoning with uncertainty: non-monotonic reasoning, probabilistic reasoning, reasoning with certainty factors and fuzzy reasoning. We have considered ana logical reasoning and case-based reasoning.
2.6
Exercises 1. Distinguish between deductive, inductive and abductive reasoning, giving an example of the appropriate use of each. 2. Alison is trying to determine the cause of overheating in Brian's car. She has two theories: a leak in the radiator or a broken thermostat. She knows that leaky radiators are more common than broken thermostats: she estimates that 10% of cars have a leaky radiator while 2 % have a faulty thermostat. However, 90% of cars with a broken thermostat overheat whereas only 30% overheat with a leaky radiator. Use Bayes theorem to advise Alison of the most likely cause of the problem. 3 . Alison then checks the water level in Brian's car and notices it is normal.
She knows that a car with a leaky radiator is very unlikely not to lose water (perhaps 1% chance), whereas water loss is not seen in 95% of cases of faulty thermostats. How would this new evidence affect your advice to Alison? (Use Bayes theorem again and assume for simplicity that all evidence is independent.)
43
REASONING
2.7
Recommended further reading
Shafer, G. & J. Pearl (eds) 1990. Readings in uncertain reasoning. Los Altos: Morgan Kaufmann. A collection o f articles which provide a useful introduction to reasoning with uncer tainty. Riesbeck, C. K. & R. Schank 1989. Inside case based reasoning. Hillsdale, NJ: Lawrence Erlbaum. A useful introduction to case-based reasoning.
44
Chapter Three
Search
3.1
Introduction
When we want to solve a problem, we consider various alternatives, some of which fail to solve the problem. Of those that succeed, we may want to find the best solution, or the easiest to perform. The act of enumerating possibilities and deciding between them is search. AI systems must search through sets of possible actions or solutions, and this chapter discusses some of the algorithms that are used. Before we go on to consider specific algorithms, we need to look at the sorts of problems that we are likely to face, as the appropriate algorithm depends on the form of the problem. The set of possible solutions is not just an amorphous bag, but typically has some structure. This structure also influences the choice of search algorithm.
3.1.1 Types of problem State and path In some problems we are only interested in the state representing the solution, whereas in other cases we also want to know how we got to the solution - the path. A crossword puzzle is an example of the former: the important thing is that the crossword is eventually completed; the order in which the clues were solved is only of interest to the real crossword fanatic. The eight queens problem and solving magic squares are similar problems (see Fig. 3.1). Typically with pure state-finding problems the goal state is described by some properties. In the case of the magic square, the states are the set of all 3 x 3 squares filled in with numbers between 1 and 9, and the property is that each row, column and diagonal adds up to 15. Mathematical theorem proving has always been a driving force in AI. If we consider this, we see that it is not only important that we solve the required theorem, but that the steps we take are recorded - that is, the proof. Other path problems include various finding route problems, puzzles such as the Towers o f Hanoi (Fig. 3.2) and algorithms for planning actions such as means-ends analysis (Ch. 9). In all these problems we know precisely what the goal state is to be; it is
45
SEARCH
A magic square is a square of numbers where each row, column and diagonal adds up to the same number. Usually the numbers have to be consecutive. So, for example, the 3 x 3 square would contain the numbers 1, 2, 3, 4, 5, 6 , 7, 8 and 9. Here are some examples, one 3 x 3 square and one 4 x 4 square:
7 13 12 2 00
1
6
10 4
3
5
7
1 11 14 8
4
9
2
16 6
5 15 3
9
The 3 x 3 square is the simplest. (There are no 2 x 2 squares and the l x l square
[Tj is rather boring!)
So, when we talk about the "magic squares" problem in this chapter, we will always mean finding 3 x 3 squares. The eight queens problem is another classic placing problem. In this case we must position eight queens on a chess board so that no queen is attacking another. That is, so that no queen is on the same row, column or diagonal as any other. There are similar problems with smaller numbers of queens on smaller chess boards: for example, a solution of the four queens problem is:
Figure 3.1 Magic squares and the eight queens problem.
only the means of getting there that is required. The solution to such problems must include not just a single goal state, but instead a sequence of states visited and the moves made between them. In some problems the moves are implicit from the sequence of states visited and can hence be omitted. In fact, some route problems do not specify their goal state in advance. For example, we may want to find the fastest route from Zuata, Venezuela, to any international airport with direct flights to Sydney, Australia. In this case we want to find a route (sequence of places) where the goal state is a city that satisfies the property P(s) = " s has an international airport with direct flights to Sydney"
46
INTRODUCTION In a monastery in deepest Tibet there are three crystal columns and 64 golden rings. The rings are different sizes and rest over the columns. At the beginning of time all the rings rested on the leftmost column, and since then the monks have toiled ceaselessly moving the rings one by one between the columns. It is said that when all the rings lie on the centre column the world will end in a clap of thunder and the monks can at last rest. The monks must obey two rules: 1. They may move only one ring at a time between columns. 2. No ring may rest on top of a smaller ring. With 64 rings to move, the world will not end for some time yet. However, we can do the same puzzle with fewer rings. So, with three rings the following would be a legal move:
A IJ.—» l I I ✓ But this would not be:
In the examples in this chapter, we will consider the even simpler case of two rings!
Figure 3.2 Towers of Hanoi.
In fact, the travelling salesman problem is more complex again. Imagine a salesman has to visit a number of towns. He must plan a route that visits each town exactly once and that begins and ends at his home town. He wants the route to be as short as possible. Although the final state is given (the same as the start state), the important property is one of the whole path, namely that each place is visited exactly once. It would be no good to find a route which reached the goal state by going nowhere at all! The last chapter was all about the importance of the choice of representation. In this example, it may well be best to regard the travelling salesman problem as a state problem where the state is a path! Any solution or best solution When finding a proof to a theorem (path problem), or solving the magic square or eight queens problem (state problems), all we are interested in is finding some
47
SEARCH
solution - any one will do so long as it satisfies the required conditions (although some proofs may be more elegant than others). However, if we consider the travelling salesman problem, we now want to find the shortest route. Similarly, we may want to choose a colouring for a map that uses the fewest colours (to reduce the costs of printing), or simply be looking for the shortest path between two places. In each of these examples, we are not only interested in finding a solution that satisfies some property, we are after the best solution - that is, search is an optimization problem. The definition of best depends on the problem. It may mean making some measure as big as possible (for example, profit), or making something as small as possible (for example, costs). As profits can be seen as negative costs (or vice versa), we can choose whichever direction is easiest or whichever is normal for a particular problem type. For a state problem such as map colouring, the costs are associated with the solution obtained, whereas in a path problem it is a combination of the "goodness" of the final solution and the cost of the path: total cost = cost(route) —ben efit(goal state) However, one finds that for many path problems there is no second term; that is, all goal states are considered equally good. In general, the specification of a problem includes both a property (or con straints), which must be satisfied by the goal state (and path), and some cost measure. A state (and path) that satisfies the constraints is said to be em feasible and a feasible state that has the least cost is optimal. That is, real problems are a mixture of finding any solution (feasibility) and finding the best (optimality). However, for simplicity, the examples within this chapter fall into one camp or the other. Where constraints exist in optimization problems, they are often satisfied " by construction". For example, a constraint on map-colouring problems is that adjacent countries have different colours. Rather than constructing a colouring and then checking this condition, one can simply ensure as one adds each colour that the constraint is met. Deterministic vs. adversarial All the problems considered so far have been deterministic, that is totally under the control of the problem solver. However, some of the driving problems of AI have been to do with game playing: chess, backgammon and even simple noughts and crosses (tic-tac-toe). The presence of an adversary radically changes the search problem: as the solver tries to get to the best solution (that is, win), the adversary is trying to stop it! Most games are state based: although it is interesting to look back over the history of a game, it is the state of the chess board now that matters. However, there are some path-oriented games as well, for example bridge or poker, where the player needs to remember all past moves, both of other players and their own, in order to choose the next move. Interaction with the physical environment can be seen as a form of game 48
INTRODUCTION
playing also. As the solver attempts to perform actions in the real world, new knowledge is found and circumstances may occur to help or hinder. If one takes a pessimistic viewpoint, one can think of the world as an adversary which, in the worst case, plays to one's downfall. (Readers of Thomas Hardy will be familiar with this world view!) A further feature in both game playing and real-world interaction is chance. Whereas chess depends solely on the abilities of the two players, a game like backgammon also depends on the chance outcome of the throwing of dice. Similarly, we may know that certain real-world phenomena are very unlikely and should not be given too great a prominence in our decision making. This chapter will only deal with deterministic search. Chapter 5 will deal with game playing and adversarial search. Perfect vs. good enough Finally, we must consider whether our problem demands the absolutely best solution or whether we can make do with a "good enough" solution. If we are looking for the best route from Cape Town to Addis Ababa, we are unlikely to quibble about the odd few miles. This behaviour is typical of human problem solving and is called satisficing. Satisficing can significantly reduce the resources needed to solve a problem, and when the problem size grows may be the only way of getting a solution at all. There is a parallel to satisficing when we are simply seeking any solution. In such cases, we may be satisfied with a system that replies YES - here is your solution NO - there is no solution SORRY - I'm not sure In practice theorem provers are like this. In most domains, not only is it very expensive to find proofs for all theorems, it may be fundamentally impossible. (Basically, Godel showed that in sufficiently powerful systems (like the numbers) there are always things that are true yet which can never be proved to be true (Kilmister 1967).)
3.1.2
Structuring the search space
Generate and test - combinatorial explosion The simplest form of search is generate and test. You list each possible candidate solution in turn and check to see if it satisfies the constraints. You can either stop when you reach an acceptable goal state or, if you are after the best solution, keep track of the best so far until you get to the end. Figure 3.3 shows this algorithm applied to the 3 x 3 magic square. However, 49
SEARCH
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 9 8
1 2 3 4 5 6 8 7 9
XXX
1 2 4 3 5 6 7 8 9
1 5 9 8 3 4 7 8 9
X
X
2 7 6 9 5 1 4 3 8
/
Figure 3.3 Generate and test - finding solutions to the magic square.
this is an extremely inefficient way to look for a solution. If one examines the solutions in the lexicographic order (as in the figure), the first solution is found only after rejecting 75 231 candidates. In fact, the whole search space consists of 9! = 362 880 possible squares of which only eight satisfy the goal conditions - and that is after we have been careful not to generate squares with repeated digits! This problem is called combinatorial explosion and occurs whenever there are a large number of nearly independent parameters. In practice, only the most ill-structured problems require this sledge-hammer treatment. One can structure most problems to make the search space far more tractable. Trees The first square in Figure 3.3 fails because 1 + 2 + 3 ^ 15. So does the second square, the third ... in fact the first 720 squares all fail for exactly the same reason. Then the next 720 fail because 1 + 2 + 4 ^ 15, etc. In each case, you do not need to look at the full square: the partial state is sufficient to fail it. The space of potential magic squares can be organized into a tree, where the leaf nodes are completed squares (all 362 880 of them), and the internal nodes are partial solutions starting off at the top left-hand corner. Figure 3.4 shows part of this search tree. The advantage of such a representation is that one can instantly ignore all nodes under the one starting 123, as all of these will fail. There are 504 possible first lines, of which only 52 add up to 15 (the first being 1 5 9 p. That is, of 504 partial solutions we only need to consider 52 of them further - an instant reduction by a factor of 10. Of course, each of the subtrees under those 52 will be able to be similarly pruned - the gains compound. There are many ways to organize the tree. Instead of doing it in reading order, we could have filled out the first column first, or the bottom right, and so on. However, some organizations are better than others. Imagine we had built the tree so that the third level of partial solution got us to partial solutions like the following:
50
?
2
7
1
?
7
?
?
3
INTRODUCTION
level 1
' J T
/T\/T\/T\
S
level 2
/ T V / T V 2 /1T \ /T\
112
7
level 3
level 6
level 9
Figure 3.4 Magic square - search tree of potential solutions.
Clearly, we would not be able to prune the tree so rapidly. Choosing the best organization for a particular problem is somewhat of an art, but there are general guidelines. In particular, you want to be able to test constraints as soon as possible. Branchingfactor and depth We can roughly characterize a tree by the number of children each node has - the branchingfactor - and the distance from the root of the tree to the leaves (bottom nodes) - the depth. The tree for magic squares has a branching factor of 9 at the root (corresponding to the nine possible entries at the top left), and a depth of 9 (the number of entries in the square). However, the branching factor reduces as one goes down the tree: at the second level it is 8, at the third level 7, and so on. For a game of chess, the branching factor is 20 for the first move (two possibilities for each pawn and four knight moves). For Go, played on a 19 x 19 board, the branching factor is 361! For a uniform tree, if the branching factor is b there are bn nodes at level n. That is, over 10 billion possibilities for the first four moves in Go - you can see why Go-playing computers aren't very good!
51
SEARCH
Figure 3.5 Towers of Hanoi: graph of possible states and moves.
Graphs When one considers a problem consisting of states with moves between them, it is often the case that several move sequences get one from a particular start state to the same final state. That is, the collection of moves and states can be best thought of as a directed graph, where the nodes of the graph are the states and the arcs between nodes are the moves. Figure 3.5 shows the complete graph of states of the Towers of Hanoi (with only two rings!). Notice how even such a simple puzzle has a reasonably complex graph. With the Towers of Hanoi, each arc is bidirectional, because each move between two states can be undone by a move in the reverse direction. This is not always so, for example when a piece is taken in chess; if the nodes represented states while making a cake, there would be no move backwards once the cake was cooked. When the arcs are directional, we can distinguish between the forward branching
52
INTRODUCTION
x+ (y+ z)
C
(y+z)+x
\A
(x+y)+z
y + (z+ x )
z + ( x + y ) ° C " is received, it should multiply the number by 9/5 and add 32. Note how this is not a simple arithmetic rule. The system would have to learn that different formulae should be used depending on whether the stimulus included "° C " or "° F " . In fact, in most of the 80
HOW MACHINES LEARN
learning algorithms we will discuss, the rules learnt will be symbolic rather than numeric. However, one should not underestimate the importance of rote learning. After all, the ability to remember vast amounts of information is one of the advantages of using a computer, and it is especially powerful when combined with other techniques. For example, heuristic evaluation functions are often expensive to compute; during a search the same node in the search tree may be visited several times and the heuristic evaluation wastefully recomputed. Where sufficient memory is available a rote learning technique called memorizing can help. The first time a node is visited the computed value can be remembered. When the node is revisited this value is used instead of recomputing the function. Thus the search proceeds faster and more complex (and costly) evaluation functions can be used.
4.3.3
Inputs to training
In Figure 4.3, we identified two inputs to the training process: the training set and existing knowledge. Most of the learning algorithms we will describe are heavily example based; however, pure deductive learning (Section 4.4) uses no examples and only makes use of existing knowledge. There is a continuum (Fig. 4.4) between knowledge-rich methods that use extensive domain knowledge and those that use only simple domain-independent knowledge. The latter is often implicit in the algorithms; for example, inductive learning is based on the knowledge that if something happens a lot it is likely to be generally true. knowledge-rich
knowledge-poor
deductive learning explanation-based learning case-based reasoning inductive learning neural networks
Figure 4.4 Knowledge continuum.
Where examples are being used it is important to know what the source is. The examples may be simply measurements from the world, for example transcripts of grand master chess tournaments. If so, do they represent "typical" sets of behaviour or have they been filtered to be "representative"? If the former is true then we can infer information about the relative probability from the frequency in the training set. However, unfiltered data may also be noisy, have errors, etc., and examples from the world may not be complete, since infrequent situations may simply not be in the training set. Alternatively, the examples may have been generated by a teacher. In this case 81
MACHINE LEARNING
we can assume that they are a helpful set, covering all the important cases and including near miss examples. Also, one can assume that the teacher will not be deliberately ambiguous or misleading. For example, a helpful teacher trying to teach a relationship between numbers would not give the example (2, 2, 4), as this might be multiplication or addition. Finally, the system itself may be able to generate examples by performing experiments on the world (for robots), asking an expert, or using an internal model of the world. We also have to decide on a representation for the examples. This may be partly determined by the context, but often we will have some choice. Often the choice of representation embodies quite a lot of the domain knowledge. A common representation is as a set of attribute values. For example, in Section 4.5.1, we will describe children's play tiles using four attributes: shape, colour, size and material. A particular example could be: triangle, blue, large, wood. In vision applications (see Ch. 8), the representation is often even cruder - simply a bitmap. On the other hand, more knowledge-rich learning often uses more expressive descriptions of the structure of the examples, using predicate logic or semantic nets.
4.3.4
Outputs of training
To a large extent the outputs of learning are determined by the application. What is it we want to do with our new knowledge? Many machine learning systems are classifiers. The examples they are given are from two or more classes, and the purpose of learning is to determine the common features in each class. When a new unseen example is presented, the system then uses the common features to determine in which class the new example belongs. The new knowledge is thus effectively in the form of rules such as if then
example satisfies condition assign it to class X
In machine learning, this job of classification is often called concept learning (see Section 4.5.1). The simplest case is when there are only two classes, of which one is seen as the desired "concept" to be learnt and the other is everything else. In this case we talk about positive and negative examples of the concept. The " then " part of the rules is then always the same and so the learnt rule is simply a predicate describing the concept. The form of this predicate, or of the condition part of a more complex rule, varies between machine learning algorithms. In some it is an arbitrary logical predicate, but more commonly its form is much simpler. In Section 4.5.1 we will consider predicates that are of the form attributel = valuel and attribute2 = value2 and ... 82
HOW MACHINES LEARN
That is, conjunctions of simple tests on attributes. In Section 4.5.2 more complex predicates in the form of decision trees will be considered. We will see that there is a trade-off between the allowable set of rules and the complexity of the learning process. The desire for simple rules is determined partly by computational tractability, but also by the application of Occam 's razor always prefer simpler explanations: they are more likely to be right and more likely to generalize. Not all learning is simple classification. In applications such as robotics one wants to learn appropriate actions. In this case, the knowledge may be in terms of production rules or some similar representation. More complex rules also arise in theorem provers and planning systems. An important consideration for both the content and representation of learnt knowledge is the extent to which explanation may be required for future actions. In some cases the application is a black box. For example, in speech recognition, one would not ask for an explanation of why the system recognizes a particular word or not, one just wants it to work! However, as we shall see in Chapter 6, many applications require that the system can give a justification for decisions. Imagine you asked an expert system " is my aircraft design safe" and it said "yes " . Would you be happy? Probably not. Even worse, imagine you asked it to generate a design - it might do a very good job, but unless it could justify its decisions would you be happy? Because of this, the learnt rules must often be restricted to a form that is comprehensible to humans. This is another reason for having a bias towards simple rules.
4.3.5
The training process
As we noted, real learning involves some generalization from past experience and usually some coding of memories into a more compact form. Achieving this generalization requires some form of reasoning. In Chapter 2, we discussed the difference between deductive reasoning and inductive reasoning. This is often used as the primary distinction between machine learning algorithms. Deductive learning works on existing facts and knowledge and deduces new knowledge from the old. In contrast, inductive learning uses examples and generates hypotheses based on similarities between them. In addition, abductive reasoning may be used and also reasoning by analogy (see Ch. 2). Imagine we are analyzing road accidents. One report states that conditions were foggy, another that visibility was poor. With no deductive reasoning it would be impossible to see the similarity between these cases. However, a bit of deduction based on weather knowledge would enable us to reason that in both cases visibility was poor. Indeed, abductive reasoning would suggest that visibility being poor probably means that it was foggy anyway, so the two descriptions are in fact identical. However, using this sort of reasoning is expensive both during learning and because it is dependent on having coded much of the background knowledge. If learning is being used to reduce the 83
MACHINE LEARNING
costs of knowledge elicitation, this is not acceptable. For this reason many machine learning systems depend largely on inductive reasoning based on simple attribute-value examples. One way of looking at the learning process is as search. One has a set of examples and a set of possible rules. The job of the learning algorithm is to find suitable rules that are correct with respect to the examples and existing knowledge. If the set of possible rules is finite, one could in principle exhaustively search to find the best rule. We will see later in this chapter that the sizes of the search spaces make this infeasible. We could use some of the generic search methods from Chapter 3. For example, genetic algorithms have been used for rule learning. In practice, the structure of rules suggests particular forms of the algorithms. For example, the version-space method (Section 4.5.1) can be seen as a special case of a branch and bound search. This exhaustive search works because the rules used by version spaces are very limited. Where the rule set is larger exhaustive search is not possible and the search must be extensively heuristic driven with little backtracking. For example, the inductive learning algorithm ID3 discussed in Section 4.5.2 will use an entropy-based heuristic.
4.4
Deductive learning
Deductive learning works on existing facts and knowledge and deduces new knowledge from the old. For example, assume you know that Alison is taller than Clarise and that Brian is taller than Alison. If asked whether Brian is taller than Clarise, you can use your knowledge to reason that he is. Now, if you remember this new fact and are asked again, you will not have to reason it out a second time, you will know it you have learnt. Arguably, deductive learning does not generate "new " knowledge at all, it simply memorizes the logical consequences of what you know already. However, by this argument virtually all of mathematical research would not be classed as learning"new " things. Note that, whether or not you regard this as new knowledge, it certainly can make a reasoning system more efficient. If there are many rules and facts, the search process to find out whether a given consequence is true or not can be very extensive. Memorizing previous results can save this time. Of course, simple memorization of past results would be a very crude form of learning, and real learning also includes generalization. A proof system has been asked to prove that 3 + 3 = 2 x 3 . It reasons as follows: 3+3
= = =
Ix 3 + lx 3 (1 + 1) x 3 2x3
(because for any number n, 1 x n = n) (distributivity of x)
Although this looks trivial, a real proof system might find it quite difficult. The 84
INDUCTIVE LEARNING
step that uses the fact that 3 can be replaced by 1 x 3 is hardly an obvious one to use! Rather than simply remembering this result, the proof system can review the proof and try to generalize. One way to do this is simply to attempt to replace constants in the proof by variables. Replacing all the occurrences of " 3 " by a variable a gives the following proof: a +a
= =
lx a + lx a (1 + 1) x a 2x a
(because for any number a, 1 x a = a) (distributivity of x)
The proof did not depend on the particular value of 3; hence the system has learnt that in general a + a = 2 x a. The system might try other variables. For example, it might try replacing 2 with a variable to get 3 + 3 = b x 3, but would discover that for this generalization the proof fails. Hence, by studying the way it has used particular parts of a situation, the system can learn general rules. We will see further examples of deductive learning in Chapter 9, when we consider planning, and in Chapter 11, in the SOAR architecture. In this chapter, we will not look further at pure deductive learning, although explanation-based learning (Section 4.6) and case-based reasoning (Ch. 2) both involve elements of deductive learning.
4.5
Inductive learning
Rather than starting with existing knowledge, inductive learning takes examples and generalizes. For example, having seen many cats, all of which have tails, one might conclude that all cats have tails. This is of course a potentially un sound step of reasoning, and indeed Manx cats have no tails. However, it would be impossible to function without using induction to some extent. Indeed, in many areas it is an explicit assumption. Geologists talk about the "principle of uniformity" (things in the past work the same as they do now) and cosmologists assume that the same laws of physics apply throughout the universe. Without such assumptions it is never possible to move beyond one's initial knowledge - deductive learning can go a long way (as in mathematics) but is fundamen tally limited. So, despite its potential for error, inductive reasoning is a useful technique and has been used as the basis of several successful systems. One major subclass of inductive learning is concept learning. This takes examples of a concept, say examples of fish, and tries to build a general description of the concept. Often the examples are described using simple attribute-value pairs. For example, consider the fish and non-fish in Table 4.1. There are various ways we can generalize from these examples of fish and non fish. The simplest description (from the examples) is that a fish is something that does not have lungs. No other single attribute would serve to differentiate the fish. However, it is dangerous to opt for too simple a classification. From the first 85
MACHINE LEARNING
herring cat pigeon flying fish otter cod whale
swims yes no no yes yes yes yes
has fins yes no no yes no yes yes
flies no no yes yes no no no
has lungs no yes yes no yes no yes
is a fish ✓
X X ✓
X ✓
X
Table 4.1 Fish and non-fish.
four examples we might have been tempted to say that a fish was something that swims, but the otter shows that this is too general a description. Alternatively, we might use a more specific description. A fish is something that swims, has fins and has no lungs. However, being too specific also has its dangers. If we had not seen the example of the flying fish, we might have been tempted to say that a fish also did not fly. This trade-off between learning an overgeneral or overspecific concept is inherent in the problem. Notice also the importance of the choice of attributes. If the "has lungs" attribute were missing it would be impossible to tell that a whale was not a fish. The two inductive learning algorithms described in detail in this section version spaces and ID3 - are examples of concept learning. However, inductive learning can also be used to learn plans and heuristics. The final part of this section will look at some of the problems of rule induction.
4.5.1
Version spaces
When considering the fish, we used our common sense to find the rule from the examples. In an AI setting we need an algorithm. This should take a set of examples such as those above and generate a rule to classify new unseen examples. We will look first at concept learning using version spaces, which uses examples to home in on a particular rule (Mitchell 1978). Consider again Table 4.1. Imagine we have only seen the first four examples so far. There are many different rules that could be used to classify the fish. A simple class of rules are those that consist of conjunctions of tests of attributes: if then
attribute 1 = value 1 and attribute2 = value2 ... is a fish
Even if we restrict ourselves to these, there are seven different rules that correctly classify the fish in the first four examples:
86
INDUCTIVE LEARNING
Rl. R2. R3. R4. R5. R6. R7.
if then if then if then if then if then if then if then
swims = yes is a fish has fins = yes is a fish has lungs = no is a fish swims = yes and has fins = yes is a fish swims = yes and has lungs = no is a fish has fins = yes and has lungs = no is a fish swims = yes and has fins = yes and has lungs = is a fish
If we only had the first four examples, what rule should we use? Notice how rules R1 and R2 are more general than rule R4, which is in turn more general than R7. (By more general, one means that the rule is true more often.) One option is to choose the most specific rule that covers all the positive examples, in this case R7. Alternatively, we could look for the most general rule. Unfortunately, there is no single most general rule. The three rules R l, R2 and R3 are all " most" general in that there is no correct rule more general than them, but they are all "most " general in different ways. Figure 4.5 shows these rules as a lattice with the most general rules at the top and the most specific at the bottom.
R1
R2
R3
R4
R5
R6
IX X I R7
most general rules
most specific rule
Figure 4.5 Rule lattice.
Further examples may restrict this set of possible rules further. If one takes the next example, the otter, it swims, but is not a fish. Therefore rule R l can be removed from the set of candidate rules. This gives rise to an algorithm: 1. 2. 3.
start off with the set of all rules for each positive example p 2.1. remove any rules which p doesn't satisfy for each negative example n 3.1. remove any rules which n does satisfy 87
MACHINE LEARNING
4. 5.
if there are no rules left FAIL when there is one rule left it is the result
The only problem with this algorithm is that you have to keep track of all rules. If there are n attributes with m values each, then there are (m + l) n rules! Clearly this is infeasible for any realistic problem. Version spaces reduce this number by only keeping track of the most specific and most general rules: all the other possible rules lie somewhere between these. Positive examples change the set of most specific rules, forcing them to become more general in order to include the new examples. Negative examples change the set of the most general rules, forcing them to become more specific in order to exclude the new examples. In addition, because we are looking for a single final rule we can further prune the two sets. After a positive example we examine the set of most general rules (G) and remove any that are not above (more general than) any of those in the set of most specific examples (S ). Similarly, after a negative example we can prune S to remove any which are not below some rule in G. An example Let's see how this would work when given the examples of tiles in Table 4.2. As a shorthand, rules will be represented by a tuple of the attributes they select. For example, the rule " if colour = blue and material = wood " is represented by the tuple (?,blue,?,wood). The question marks denote attributes which the rule doesn't test. The most general rule is (?,?,?,?), which doesn't care about any of the attributes.
exl ex2 ex3 ex4
shape triangle square triangle triangle
colour blue blue blue green
size large small small large
material wood wood plastic plastic
✓ X ✓ X
Table 4.2 Example tiles.
After seeing the first example, the most specific rule is (triangle,blue,large,wood), which only matches exl. The most general rule is (?,?,?,?), which matches any thing. This is because we have not seen any negative examples yet and so cannot rule out anything. The state of the algorithm can thus be summarized: set of most specific rules (S ) set of most general rules (G)
= =
{ (triangle,blue,large,wood) } { (?,?,?,?) }
The second example is negative and so the set of most general rules must be mod ified to exclude it. However, the new most general rules should not contradict 88
INDUCTIVE LEARNING
the previous examples, and so only those that are more general than all those in S are allowed. This gives rise to a new state: set of most specific rules (S ) set of most general rules (G)
= =
{ (triangle,blue,large,wood) } { (triangle,?,?,?), (?,?,large,?), (7,7,7,wood) }
The third example is positive. It does not satisfy (triangle,blue,large,wood), so S is generalized (again consistent with G): set of most specific rules (S )
=
{ (triangle,blue,?,?) }
However, at this stage we can alsouse the pruning rules to remove the second two rules from {G), as neither is more general than (triangle,blue,?,?): set of most general rules (G)
=
{ (triangle,?,?,?) }
Finally, we look at the fourth example, which is negative. It satisfies (trian gle,?,?,?), so we must make G more specific. The only rule that is more specific than (triangle,?,?,?), but that is also more general than those in S, is (triangle,blue,?,?). Thus this becomes the new G. The set S is not changed by this new example. set of most specific rules (S) set of most general rules (G)
= =
{ (triangle,blue,?,?) } { (triangle,blue,?,?) }
At this point S = G and so we can finish successfully which is just as well as we have reached the end of our examples! Different kinds o f rules - bias The version-space algorithm depends on being able to generate rules that are just a little more or less specific than a given rule. In fact, any class of rules which have a method of making them slightly more or less specific can be used, not just the simple conjunctions that we have dealt with so far. So, if an attribute has values that themselves have some form of generalization hierarchy, then this can be used in the algorithm. For example, assume the shape attribute has a hierarchy as in Figure 4.6. We can then generalize from two rules (circle,?,small,?) (ellipse,?,small,?) to get (rounded?,small,?). The rules can get even more complicated. With full boolean predicates general ization can be achieved by adding disjunctions or turning constants into variables; specialization by adding conjunctions or turning variables into constants. This sounds like a very general learning mechanism - but wait. If we allow more com plicated rules, then the number of examples needed to learn those rules increases. If we are not careful, we end up with rules like if new exam ple = e x l or new exam ple = ex2 or ... These are not only difficult to learn, but fairly useless - rote learning again. 89
MACHINE LEARNING
shapes
/ \ / \ straight
rectilinear
square
triangle
rounded
/ \
circle
ellipse
rectangle
Figure 4.6 Shape taxonomy.
A learning algorithm must have some bias - a tendency to choose certain types of rules rather than others. This reduces the set of possible rules, and in so doing makes the learning task both tractable and useful. Restricting the rules in the version-space method to conjunctions introduced just such a bias and so enabled the algorithm to learn. However, the downside of a bias is that it means that some sorts of rule cannot be learnt. In this case, we would not be able to learn rules of the form if shape = triangle or colour = blue
Noise and other problems The version-space method has several problems. It is very sensitive to noise - if any wrong examples are given the algorithm will fail completely. It also demands a complete set of examples, in the sense that there must be exactly one rule that classifies them all. Finally, it is not well suited to multi-way classification (for example, sorting animals into fish/bird/mammal). One must effectively treat these as several yes/no distinctions.
4.5.2
ID3 and decision trees
Decision trees are another way of representing rules. For example, Figure 4.7 shows a decision tree for selecting all blue triangles. Imagine a tile coming in at the top of the tree. If it satisfies the condition at the top node it passes down the yes (Y) branch; if it doesn't it passes down the no (N) branch. It is passed down node by node until it comes to one of the leaves, which then classifies the tile. Several algorithms, of which the most well known is ID3 (Quinlan 1979), learn by building decision trees in a top-down fashion. Consider again the tiles in Table 4.2. We start off with the four examples and choose some condition to be the root of the tree, say "shape = triangle". Three of 90
INDUCTIVE LEARNING
/
X Figure 4.7 Decision tree.
the tiles (exl, ex3 and ex4) satisfy this, and one doesn't (ex2). The N branch has all negative examples, and so no further action is taken on that branch. The Y branch has a mixture of positive and negative examples, and so the same procedure is taken recursively (Fig. 4.8).
ex4 X Figure 4.8 Starting to build a decision tree.
We now choose another condition for this branch, say "colour = blue ". The three examples are sorted by this condition and now both branches have examples of one type. At this point we stop and label the leaves in the obvious manner (Fig. 4.9). A different choice of condition at the root would lead to a different tree. For example, if we had instead chosen "material = wood ", we would get to the stage in Figure 4.10. This time both branches have mixed examples and we must build subtrees at each. If we chose the same condition "size = large" for each branch, we would end up with the decision tree in Figure 4.11. Note that this not only is a different tree from Figure 4.9, but also represents a completely different rule:
91
Figure 4.9 Completed tree.
ex2 X
©x4 X
Figure 4.10 Starting a different tree.
exl y
ex2 X
ex4 X
Figure 4.11 A different decision tree.
ex3 S
INDUCTIVE LEARNING
if
material = wood and size = large or material i- wood and size ^ large
as opposed to the original rule if
shape = triangle and colour = blue
How do we choose between these? Well, one way would be to find the smallest tree (or at least one of the smallest). Unfortunately, the number of trees is huge and so an exhaustive search would be impractical. Instead, the algorithms are careful about the choice of condition at each node and use a condition that looks as though it will lead to a good tree (but might not). This decision is usually based on the numbers of positive and negative examples that are sent to the Y and N branches. In Figure 4.12 these are tabulated for the two top-level conditions "shape = triangle" and "material = wood". In the first table, we see that the Y branch has two positive examples and one negative example giving three in total. The N branch has no positive examples and one negative example. In comparison the "material = wood" condition is very even handed with one positive and one negative example down each branch.
shape = triangle Y N
material = wood Y N
/
2
0
2
/
1
1
X
1
1
2
X
1
1
Figure 4.12 Contingency tables for different choices.
Of the two, the first is a more promising candidate as it makes the branches more uneven. Unevenness is important because we want the final tree to be very uneven - leaves must be either totally positive or totally negative. Indeed, one would expect a totally irrelevant attribute to give rise to an even split, as in the second table. Algorithms use different measures of this unevenness and use this to choose which condition to use at the node. ID3 uses an entropy-based measure. The entropy of a collection of probabilities Pi is given by entropy =
log2(pi)
We calculate the entropy of each branch and then the average entropy (weighted by the number of examples sent down each branch). For example, take the "shape = triangle " table. The Y branch has entropy - [2/3 x log2(2/3) + 1/3 x log2(1/3)] = 0.918 93
MACHINE LEARNING
The N branch has entropy - [ O x log2(0) + 1 x log2(1)] = 0 The average entropy is thus 3/4x0.918 + 1 / 4 x 0 = 0.689 (NB When calculating entropy one assumes that 0 x log2(0) = 0. This usually has to be treated as a special case to avoid an overflow error when calculating l ° g2(0).) In contrast, the entropy of the "material = wood" decision is: , 2/4 x - [0 .5 x log2(0.5) + 0.5 x log2(0.5)] +2/4 x - [0 .5 x log2(0.5) + 0.5 x log2(0.5)] = - log2(0.5) = 1 Small values of entropy correspond to greatest disorder; hence the first decision would be chosen. The original ID3 algorithm did not use simple yes/no conditions at nodes; instead it chose an attribute and generated a branch for each possible value of the attribute. However, it was discovered that the entropy measure has a bias towards attributes with large numbers of values. Because of this, some subse quent systems used binary conditions at the nodes (as in the above examples). However, it is also possible to modify the entropy measure to reduce the bias. Other systems use completely different measures of unevenness similar to the x 2 statistical test. In fact, the performance of decision tree inductive learning has been found to be remarkably independent of the actual choice of measure. As with the version-space method, decision tree building is susceptible to noise. If wrongly classified examples are given in training then the tree will have spurious branches and leaves to classify these. Two methods have been proposed to deal with this. The first is to stop the tree growing when no condition yields a suitable level of unevenness. The alternative is to grow a large tree that completely classifies the training set, and then to prune the tree, removing nodes that appear to be spurious. The second option has several advantages, as it allows one to use properties of the whole tree to assess a suitable cut-off point, and is the preferred option in most modern tree-building systems. The original ID3 algorithm only allowed splits based on attribute values. Sub sequent algorithms have used a variety of conditions at the nodes, including tests of numerical attributes and set membership tests for attribute values. However, as the number of possible conditions increases, one again begins to hit computa tional problems in choosing even a single node condition. Set membership tests are particularly bad, as an attribute with m values gives rise to 2m_1 different possible set tests! The above description of decision tree learning used only a binary classifica tion. However, it is easy to allow multi-way classification, and this is present in the original ID3 algorithm. The leaves of multi-way decision trees simply 94
INDUCTIVE LEARNING
record a particular classification rather than just accept/reject. During training, the measure of unevenness must be able to account for multiple classifications, but the entropy measure easily allows this.
4.5.3
Rule induction
In both the version-space method and decision tree induction, the rules that are learnt are of the form " if condition then classify". The training can see whether a rule works simply by seeing whether the response it gives matches the desired response - that is, it classifies correctly. However, in more complicated domains it is not so easy to see whether a particular rule is correct. A classic example is pole balancing (Fig. 4.13). The task is to move the railway carriage so that the upright pole does not fall over and so that the carriage stays between the buffers. At each moment, the system must choose whether to move the carriage to the right or left depending on its position and the position and velocity of the pole. However, if the pole falls over, which rule is held "responsible" the last rule applied? In fact, in such tasks the mistake often happened much earlier, and subsequent rules might be good ones.
This problem is called the credit assignment problem. It arises in many domains. For example, in computer chess - if the computer won, which moves were the good ones? If it lost, which should be blamed? A human expert might be needed at this stage to analyze the game in order to tell the computer what went wrong. There is no simple solution to this problem. The human expert will be useful in some circumstances, but often the nature of the problem makes this undesirable or impractical - for example, a human expert would find it hard to assign credit in the pole-balancing problem. If the problem domain is internal to the computer it may be able to backtrack to each decision point and try alternatives. However, this approach will often be computationally infeasible. Sometimes there are special solutions dependent on the domain. For example, LEX, a theorem-proving 95
MACHINE LEARNING
program, searches for minimal proofs of mathematical propositions. All the heuristics that give rise to a minimal proof are deemed " good " LEX assigns credit uniformly.
4.6
Explanation -based learning
Algorithms for inductive learning usually require a very large number of ex amples in order to ensure that the rules learnt are reliable. Explanation-based learning addresses this problem by taking a single example and attempting to use detailed domain knowledge in order to explaih the example. Those attributes which are required in the explanation are thus taken as defining the concept. Imagine you are shown a hammer for the first time. You notice that it has a long wooden handle with a heavy metal bit at the end. The metal end has one flat surface and one round one. You are told that the purpose of a hammer is to knock in nails. You explain the example as follows. The handle is there so that it can be held in the hand. It is long so that the head can be swung at speed to hit the nail. One surface of the hammer must be flat to hit the nail with. So, the essential features extracted are: a long handle of a substance that is easy to hold, and a head with at least one flat surface, made of a substance hard enough to hit nails without damage. A couple of years ago, one of the authors bought a tool in Finland. It was made of steel with rubber covering the handle. The head had one flat surface and one flat sharp edge (for cutting wood, a form of adze). Despite the strange shape and not having a wooden handle it is recognizably a hammer. Notice how explanation-based learning makes up for the small number (one!) of examples by using extensive domain knowledge: how people hold things; the hardness of nails; the way long handles can allow one to swing the end at speed. If the explanation is complete, then one can guarantee that the description is correct (or at least not overinclusive). Of course, with all that domain knowledge, a machine could, in theory, generate a design for a tool to knock in nails without ever seeing an example of a hammer. However, this suffers both from the search cost problem and because the concepts deduced in isolation may not correspond to those used by people (but it might be an interesting tool!). In addition, explanations may use reasoning steps that are not sound. Where gaps are found in the explanation an EBL system may use abduction or induction to fill them. Both forms of reasoning are made more reliable by being part of an explanation. Consider abduction first. Imagine one knows that hitting a nail with a large object will knock it into wood. If we have not been shown the hammer in use, merely told its function, we will have to use an abductive step to reason that the heavy metal head is used to knock in the nail. However, the match between features of the example and the possible cause makes it far more likely that the abductive step is correct than if we looked at causes in general (for example, that
96
EXAMPLE: QUERY-BY-BROWSING
the nail is driven into the wood by drilling a hole and then pushing it gently home). Similarly, the inductive steps can be made with greater certainty if they are part of an explanation. Often several examples with very different attributes require the same assumption in order to explain them. One may thus make the inductive inference that this assumption is true in general. Even if no non-deductive steps are made, explanation-based learning gives an important boost to deductive learning - it suggests useful things to learn. This is especially true if the explanation is based on a low-level, perhaps physical, model. The process of looking at examples of phenomena and then explaining them can turn this physical knowledge into higher-level heuristics. For example, given the example of someone slipping on ice, an explanation based on physical knowledge could deduce that the pressure of the person melted the ice and that the presence of the resulting thin layer of water allowed the foot to move relative to the ice. An analysis of this explanation would reveal, amongst other things, that thin layers of fluid allow things to move more easily the principle of lubrication.
4.7
Example: Query-by-Browsing
As an example of the use of machine learning techniques we will look briefly at Query-by-Browsing (QbB). This is an experimental "intelligent" interface for databases that uses ID3 to generate queries for the user. This means that the user need only be able to recognize the right query, not actually produce it.
4.7.1 What the user sees Initially Query-by-Browsing shows the user a list of all the records in the database. The user browses through the list, marking each record either with a tick if it is wanted or a cross if it is not (see Fig. 4.14). After a while the system guesses what sort of records the user wants, highlights them and generates a query (in SQL or an appropriate query method). The query is shown in a separate window so that the user can use the combination of the selected records and the textual form of the query to decide whether it is right (Fig. 4.15). Whereas so-called Query-by-Example works by making the user design a sort of answer template, Query-by-Browsing is really "by example" the user works from examples of the desired output.
97
Listing B Name
________ Department
Salary
Accounts Accounts Accounts Accounts Accounts Cleaning Cleaning
21,750 27,000 14,500 16,000 17,500 7,500 5,670
W illiam Brown Margery Chen Thomas Herbert Janet Davies Eugene Warbuck-Smyth Fredrick Blogia Mary O ’ Hara
/ / X /
11!
Figure 4.14 Query-by-Browsing - user ticks interesting records.
I Query S E L E C T name, department, salary W H E R E department = "accounts" and salary > 15000
C
Ve*
) C
C
Yes but
No
)
Name
Department
W illiam Brown Margery Chen Thomas Herbert Janet Davies Eugene Warbuck-Smyth Fredrick Blogia Mary O ’ Hara
Accounts Accounts Accounts Accounts Accounts Cleaning Cleaning
Salary 21,750 27,000 14,500 16,000 17,500 7,500 5,670
/ / X /
)
Figure 4.15 Query-by-Browsing - system highlights inferred selection.
SUMMARY
4.7.2 How it works The form of examples used by ID3, attribute-value tuples, is almost exactly the same as that of the records found in a relational database. It is thus an easy job to take the positive and negative examples of records selected by the user and feed them into the ID3 algorithm. The output of ID3, a decision tree, is also reasonably easy to translate into a standard database query. In fact, QbB uses a variant of the standard ID3 algorithm in that it also allows branches based on cross-attribute tests (for example, "overdraft > overdraftlimit") as these are deemed important for effective queries. Otherwise the imple mentation of the basic system is really as simple as it sounds.
4.7.3 Problems Even a very simple implementation of QbB works very well - when the system gets it right. When it doesn't things are rather more complicated. First of all the algorithm produces some decision tree which correctly classifies the records. However, there are typically many such trees. Sometimes the system produces a "sensible" answer, sometimes not. Although the answers are always "correct" they are not always the sort a reasonable human expert would produce. When QbB gets the wrong answer, the user can tell it and give more examples to help clarify the desired result. At some point the system generates a new query. However, the algorithm used starts from scratch each time and so there may be no obvious relationship between the first attempt and subsequent guesses. Although the earlier queries were wrong, the resulting behaviour can appear odd, and reduce one's confidence in the system. The above problems can be tackled by modifying the algorithm in various ways, but the lesson they give us is that applications of machine learning must do more than work, they must work in a way that is comprehensible to those who use them. Sometimes the machine intelligence can be hidden away in a "black box ", where the mechanisms are invisible and hence don't matter, but more often than not someone will have to understand what is going on. This is a point we shall return to in the next chapter.
4.8
Summary
In this chapter, we have discussed the importance of machine learning, its general pattern, and some of the issues that arise. Several specific machine learning methods have been described, including deductive learning, inductive learning, and explanation-based learning. In particular we have examined two inductive 99
MACHINE LEARNING
learning algorithms: the version-space method and ID3. We ended the chapter with a discussion of an experimental system that uses machine learning in an intelligent database interface.
4.9
Recommended further reading
Anzai, Y. 1992. Pattern recognition and machine learning. San Diego: Academic Press. Provides more detail o f many of the algorithms discussed here including concept learning and decision trees.
100
Chapter Five
Game playing
5.1
Overview
Game playing has been an important part of the history of AI. The techniques for game playing can also be applied to other situations where factors are unknown but will be discovered only after action is taken. This chapter will consider algorithms for playing standard games (non-probabilistic, open, two-person, turn-taking, zero-sum games). Such games include chess, draughts, tic-tac-toe and Go. In particular, we will look at minimax search techniques and alpha beta pruning. This builds on the search techniques studied in Chapter 3. The chapter will also consider other types of game where co-operation is important, where players can take simultaneous moves and where random events happen (such as the throw of a die). We will see in Chapter 9 that acting in the presence of uncertainty is essential for robotics and other practical planning tasks, and this chapter will show how game-playing algorithms can be used to tackle such non-gaming problems.
5.2
Introduction
Game playing has always been an important part of AI. Indeed, the earliest attempts at game-playing computer programs predate the field. Even Babbage considered programming his analytic engine to play chess. Games have been seen as a good testing ground for two reasons. First, because the mixture of reasoning and creative flair seems to epitomize the best of human intelligence. Secondly, because the constrained environment of play with clearly formulated rules is far more conducive to computation than the confused and open problems of everyday life. However, more recently this latter advantage has often been seen as a weakness of game playing as a measure of intelligence. Instead human intelligence is regarded as being more thoroughly expressed in the complexity of open problems and the subtlety of social relationships. Many now look at the current state of the art in chess programs and say that the brute force approaches that are being applied are no longer mainstream AI.
101
GAME PLAYING
This critique of game playing should not detract from its own successes and its enormous importance in the development of the field of AI. When chess programs were still struggling at club level they were regarded as a challenge to AI; now they compete at grand master level. Game-playing programs have also led to the development of general purpose AI algorithms; for example, iterative deepening (discussed in Ch. 3) was first used in CHESS 4.5 (Slate & Atkin 1977). Game playing has also been a fertile ground for experiments in machine learning. The single problem that has received most attention from the artificial intelligence community is the playing of chess, a game whose whole attraction is that it runs to precise rules within which billions of games are possible. As Stephen Rose, the British brain biologist, says, getting a computer to do this is not too great a wonder. Get one to play a decent game of poker, he says, and he might be more impressed. Martin Ince, THES(1994) Most interesting games defy pure brute force approaches because of the sheer size of their branching factor. In chess there are typically around 30 legal moves at any time (although only a few "sensible" ones), and it is estimated around 1075 legal chess games. We say "legal" games, as few would be sensible games. In order to deal with this enormous search space the computer player must be able to recognize which of the legal moves are sensible, and which of the reachable board positions are desirable. Search must be heuristic driven, and the formulation of these heuristics means that the programs must to some extent capture the strategy of a game. These factors are exemplified by the game of Go. Its branching factor is nearly 400, with as many moves. Furthermore, the tactics of the game involve both local and global assessment of the board position, making heuristics very difficult to formulate. However, effective heuristics are essential to the game. The moves made in the early part of the game are critical for the final stages; effectively one needs to plan for the end game, hundreds of moves later. But the huge branching factor clearly makes it impossible to plan for the precise end game; instead one makes moves to produce the right kind of end game. Applying machine learning and neural networks to Go also encounters prob lems. The tactical advantage of a move is partly determined by its absolute position on the board (easy to match), but partly also by the local configuration of pieces. We will see in Chapter 8 that position independence is a major problem for pattern matching, and so this is not a parochial problem for game playing. Perhaps advances in Go and other games will give rise to general AI methods in the same way as chess and other simpler games have done over the years.
102
CHARACTERISTICS OF GAME PLAYING
5.3
Characteristics of game playing
Game playing has an obvious difference from the searches in Chapter 3: while you are doing your best to find the best solution, your adversary is trying to stop you! One consequence of this is that the distinction between planning and acting is stronger in game play. When working out how to fill out a magic square, one could always backtrack and choose a different solution path. However, once one has made a choice in a game there is no going back. Of course, you can look ahead, guessing what your opponent's moves will be and planning your responses, but it remains a guess until you have made your move and your opponent has responded - it is then too late to change your mind. The above description of game playing is in fact only of a particular sort of game: a non-probabilistic, open, two-person, turn-taking, zero-sum game. - non-probabilistic: no dice, cards or any other random effects. - open: each player has complete knowledge of the current state of play, as opposed to games like "battleships" where different players have different knowledge. - two-person: no third adversary and no team playing on your side, as op posed to say bridge or football. - turn-taking: the players get alternate moves, as opposed to a game where they can take multiple moves, perhaps based on their speed of play. - zero-sum: what one player wins, the other loses. In addition, the games considered by AI are normally non-physical (a footballplaying computer?). With a bit of effort one can think of games that have alter natives to all the above, but the "standard " style of game is most heavily studied, with the occasional addition of some randomness (for example, backgammon). As with deterministic search, we can organize the possible game states into trees or graphs, with the nodes linked by moves. However, we must also label the branches with the player who can make the choice between them. In a game tree alternate layers will be controlled by different players. Like deterministic problems, the game trees can be very big and typically have large branching factors. Indeed, if a game tree is not complex the game is likely to be boring. Even a trivial game like noughts and crosses (tic-tac-toe) has a game tree far too big to demonstrate here. Because of the game tree's size it is usually only possible to examine a portion of the total space. Two implications can be drawn from the complexity of game trees. First, heuristics are important - they are often the only way to judge whether a move is good or bad, as one cannot search as far as the actual winning or losing state. Secondly, the choice of which nodes to expand is critical. A human chess player only examines a small number of the many possible moves, but is able to identify those moves that are "interesting". This process of choosing directions to search is 103
GAME PLAYING
knowledge rich and therefore expensive. More time spent examining each node means fewer nodes examined - in fact, the most successful chess programs have relatively simple heuristics, but examine vast numbers of moves. They attain grand master level and are clearly "intelligent", but the intelligence is certainly "artificial".
5.4
Standard games
5.4.1 A simple game tree In order to demonstrate a complete game tree, we consider the (rather boring) game of "placing dominoes". Take a squared board such as a chess board. Each player in turn places a domino that covers exactly two squares. One player always places pieces right to left, the other always places them top to bottom. The player who cannot place a piece loses. The complete game tree for this when played on a 3 x 3 board is shown in Figure 5.1. In fact, even this tree has been simplified to fit it onto the page, and some states that are equivalent to others have not been drawn. For example, there are two states similar to b and four similar to c.
Alison's turn
Brian's turn
Alison's turn
YIN Y\ I-4111 141 I 14H 111II 14-11 ITUT -
Brian 's turn
1
1
1
II III I I l«l lu4T
u i-» i
1
|
1
Yi.YNY 1
1
1
-1 Figure 5.1 Game tree for "placing dominoes".
104
1
STANDARD GAMES
The adversaries are called Alison and Brian. Alison plays first and places her pieces left to right. Consider board position j. This is a win for Alison, as it is Brian's turn to play and there is no way to play a piece top to bottom. On the other hand, position s is a win for Brian, as although neither player can place a piece it is Alison's turn to play. We can see some of the important features of game search by looking at this tree. The leaves of the tree are given scores of +1 (win for Alison) or —1 (win for Brian - Alison loses). This scoring would of course be replaced by a heuristic value where the search is incomplete. The left-hand branch is quite simple - if Alison makes this move, Brian has only one move (apart from equivalent ones) and from there anything Alison does will win. The right branch is rather more interesting. Consider node m: Brian has only one possible move, but this leads to a win for him (and a loss for Alison). Thus position m should be regarded as a win for Brian and could be labelled " - 1 " . So, from position e Alison has two choices, either to play to 1- a win - or to play to m - a loss. If Alison is sensible she will play to 1. Using this sort of argument, we can move up the tree marking nodes as win or lose for Alison. In a win -lose game either there will be a way that the first player can always win, or alternatively the second player will always be able to force a win. This game is a first-player win game; Alison is a winner! If draws are also allowed then there is the third alternative that two good players should always be able to prevent each other from winning - all games are draws. This is the case for noughts and crosses, and it is suspected that the same is true in chess. The reason that chess is more interesting to play than noughts and crosses is that no-one knows; and even if it were true that in theory the first player would always win, the limited ability to look ahead means that this does not happen in practice.
5.4.2
Heuristics and minimax search
In the dominoes game we were able to assign each leaf node as a definite win either for Alison or for Brian. By tracing back we were able to assign a similar value for each intermediate board position. As we have discussed, we will not usually have this complete information, and will have to rely instead on heuristic evaluation. As with deterministic search, the form of this will depend on the problem. Examples are - chess: one can use the standard scoring system where a pawn counts as 1, a knight as 3, and so on. - noughts and crosses: one can use a sum based on the value of each square where the middle counts most, the corners less and the sides least of all. You add up the squares under the crosses and subtract those under the noughts.
105
GAME PLAYING
Note that these heuristics may give values outside the range 1 to —1, so one must either suitably scale them or choose large enough values to represent winning and losing positions. Figure 5.2 shows an example game tree with heuristic values for each position. The heuristic values are the unbracketed numbers (ignore those in brackets for the moment). Alison's moves are shown as solid lines and Brian's moves are dashed. This is not the whole game tree, which would extend beyond the nodes shown. We will also ignore for now the difficult issue of how we decided to search this far in the tree and not, for example, to look at the children of node k. The portion of the tree that we have examined is called the search horizon.
5 a (1) Alison’s turn
Brian’s turn -3 n
5 o Figure 5.2 Minimax search on a game tree.
It is Alison's move. There are obviously some good positions for her (with scores 5 and 7) and some very bad ones (—10). But she cannot just decide to take the path to the best position, node j, as some of the decisions are not hers to make. If she moves to position c then Brian might choose to move to position g rather than to f. How can she predict what Brian will do and also make her own decision? We can proceed up the tree rather as we did with the dominoes game. Consider position i. It is Brian's move, and he will obviously move to the best position for him, that is the child with the minimum score, n. Thus, although the heuristic value at node i was 2, by looking ahead at Brian's move we can predict that the actual score resulting from that move will be —3. This number is shown in brackets. Look next at node d. It is Alison's move. If she has predicted Brian's move (using the argument above), her two possible moves are to h with score —2, or to i with score —3. She will want the best move for her, that is the maximum score. Thus the move made would be to h and position d can be given the revised score of —2. This process has been repeated for the whole tree. The numbers in brackets
106
STANDARD GAMES
show the revised scores for each node, and the solid lines show the chosen moves from each position. With this process one alternately chooses the minimum (for the adversary's move) and the maximum (for one's own move). The procedure is thus called minimax search. Pseudocode for minimax is shown in Figure 5.3.
to find minimax score of n find minimax score of each child of n if it is Alison's turn score of n is the maximum of the children's scores if it is Brian's turn score of n is the minimum of the children's scores
Figure 5.3 Minimax pseudocode.
Note that the numbers on the positions are the worst score that you can get assuming you always take the indicated decisions. Of course you may do better if your adversary makes a mistake. For example, if Alison moves to c and Brian moves to f, Alison will be able to respond with a move to j, giving a score of 7 rather than the worst case score of 1. However, if you don't take the indicated moves, a good opponent will fight down your score to below the minimax figure. Minimax is thus a risk averse search.
5.4.3
Horizon problems
It is important to remember that the portion of tree examined in determining the next move is not the whole tree. So although minimax gives the worst case score given the nodes that have been examined, the actual score may be better or worse as the game proceeds, and one gets to previously unconsidered positions. For example, imagine that Alison looks ahead only two moves, to the level d -g. A minimax search at this level gives scores of 5 to b and —2 to c, so Alison will move to b, whereas by looking further ahead we know that c would be better. Looking even further ahead, our choice might change again. These rapid changes in fortune are a constant problem in detemining when to stop in examining the game tree. Figure 5.4 shows a particularly dramatic example. The white draught is crowned, so it can jump in any direction, and it is white's move. A simple heuristic would suggest that black is unassailable, but looking one move further we find that white jumps all black's draughts and wins the game! 107
GAME PLAYING
Figure 5.4 Horizon effect - simple heuristics can be wrong!
Look again at Figure 5.2. Positions a, b, d and e all have the same heuristic score. That is, they form a plateau rather like we saw in hill climbing. While we only look at the positions within a plateau, minimax can tell us nothing. In the example tree, the search horizon went beyond the plateau, and so we were able to get a better estimate of the score for each position. In fact, if you examine the suggested chess heuristic this only changes when a piece is taken. There are likely to be long play sequences with no takes, and hence plateaux in the game tree. Plateaux cause two problems. First, as already noted, minimax cannot give us a good score. Secondly, and perhaps more critically, it gives us no clue as to which nodes to examine further. If we have no other knowledge to guide our search, the best we can do is examine the tree around a plateau in a breadth first manner. In fact, one rule for examining nodes is to look precisely at those where there is a lot of change - that is, ignore the plateaux. This is based on the observation that rapid changes in the evaluation function represent interesting parts of the game.
5.4.4
Alpha -beta pruning
The minimax search can be speeded up by using branch and bound techniques. Look again at Figure 5.2. Imagine we are considering moves from d. We find that h has score —2. We then go on to look at node i - its child n has score —3. So, before we look at o we know that the minimax score for i will be no more than —3, as Brian will be minimizing. Thus Alison would be foolish to choose i, as h is going to be better than i whatever score o has. We can see similar savings on the dominoes game tree (Fig. 5.1). Imagine we
108
NON-ZERO-SUM GAMES AND SIMULTANEOUS PLAY
are trying to find the move from position c. We have evaluated e and its children and f, and are about to look at the children of nodes g and h. From Brian's point of view (minimization), f is best so far. Now as soon as we look at node n we can see that the minimax score for g will be at least 1 (as Alison will play to maximize), so there is no reason to examine node o. Similarly, having seen node f, nodes p and q can be skipped. In fact, if we look a bit further up we can see that even less search is required. Position b has a minimax score of 1. As soon as we have seen that node f has score " - 1 " we know that Brian could choose this path and that the minimax score of c is at most —1. Thus nodes g and h can be ignored completely. This process is called alpha-beta pruning and depends on carrying around a best-so-far (a) value for Alison's choices and a worst-so-far (j3) for Brian's choices.
5.4.5
The imperfect opponent
Minimax and alpha-beta search both assume that the opponent is a rational player using the same sort of reasoning as the algorithm. Imagine two computers, AYE and BEE, playing against one another. AYE is much more powerful than BEE and is to move first. There are two possible moves. If one move is taken then a draw is inevitable. If the other move is taken then, by looking ahead 20-ply, AYE can see that BEE can force a win. However, all other paths lead to a win for AYE. If AYE knows that BEE can only look ahead 10-ply, then AYE should probably play the slightly risky move in the knowledge that BEE will not know the correct moves to make and so almost certainly lose. For a computer to play the same trick on a human player is far more risky. Even though human players can consider nowhere near as many moves as computers, they may look very far ahead down promising lines of moves (actually computers do so too). Because AYE knew that BEE's search horizon was fixed, it could effectively use probabilistic reasoning. The problem with human opponents, or less predictable computer ones, is that they might pick exactly the right path. Assuming random moves from your opponent under such circumstances is clearly foolhardy, but minimax seems somewhat unadventurous. In preventing the worst, it throws away golden opportunities.
5.5
Non-zero-sum games and simultaneous play
In this section we will relax some of the assumptions of the standard game. If we have a non-zero-sum game, there is no longer a single score for each position. Instead, we have two values representing how good the position is for each player. Depending on the rules of play, different players control different choice
109
GAME PLAYING
points, and they seek to maximize their own score. This formulation allows one to consider not only competitive but also co-operative situations, where the choices are made independently, but where the players' ideas of " good " agree with one another. This leads into the area of distributed AI, where one considers, for example, shop-floor robots co-operating in the building of a motor car (see Ch. 10). However, there we will consider the opposite extreme, where all parties share a common goal. In this section we will consider the in-between stage when the players' goals need not agree, but may do so. We will also examine simultaneous play, that is when both parties make a move in ignorance of each other's choice.
5.5.1
The prisoner 's dilemma
A classic problem in game theory is the prisoner's dilemma. There are several versions of this. The one discussed in Section 5.5.4 is the most common, but we will deal with a more tractable version first! This comes in several guises, and the most common is as follows. Imagine two bank-robbers have been arrested by the police and are being questioned individually. The police have no evidence against them, and can only prosecute if one or the other decides to confess.
first prisoner trust renege trust
\ 0
second prisoner
0
\ \
\ \
\ renege
\ 10 \ \ 1 \
1
10 \ \ \
\
5 \ 5 \
_
\
\
\
\
Figure 5.5 The prisoner's dilemma.
Before they were arrested, the criminals made a pact to say nothing. Each now has the choice either to remain silent - and trust their colleague will do the same - or to renege on their promise. Is there honour among thieves? If neither confesses then the police will eventually have to let them go. If both confess then they will each get a long, five-year sentence. However, the longest sentence will be for a prisoner who doesn't confess when the other does. If the first prisoner confesses then the other prisoner will get a ten-year sentence, whereas the first prisoner will only be given a short, one-year sentence. Similarly, if the second prisoner confesses and the first does not, the first will get the ten-
110
NON-ZERO-SUM GAMES AND SIMULTANEOUS PLAY
year sentence. The situation is summarized in Figure 5.5. In each square the first prisoner's sentence is in the upper right and the second in the lower left. Let's consider the first prisoner's options. If he trusts his colleague, but she reneges then he will be in prison for ten years. However, if he confesses, reneging on his promise, then the worst that can happen to him is a five-year sentence. A minimax strategy would suggest reneging. The second prisoner will reason in exactly the same way so both confess.
5.5.2
Searching the game tree
The above problem was drawn as a matrix rather than a tree, because neither prisoner knew the other's moves. If instead the two 'played' in turn then the situation would be far better. In this case we can draw the prisoner's dilemma as a game tree (see Fig. 5.6). At each terminal node we put the two values and use a minimax-like algorithm on the tree.
first prisoner
second prisoner
Figure 5.6 Game tree for prisoner's dilemma.
Imagine the first prisoner has decided not to confess, and the second prisoner knows this. Her options are then to remain silent also and stay out of prison, or to renege and have a one-year sentence. Her choice is clear. On the other hand, if the first prisoner has already reneged, then it is clear that she should also do so (honour aside!). Her choices are indicated by bold lines, and the middle nodes have been given pairs of scores based on her decisions. Assuming the first prisoner can predict his partner's reasoning, he now knows 111
GAME PLAYING
the scores for each of his options. If he reneges he gets five years; if he stays silent he walks away free - no problem! Notice that although this is like the minimax algorithm, it differs when we consider the second prisoner's moves. She does not seek to minimize the first prisoner's score, but to maximize her own. More of a maximax algorithm? So, the game leads to a satisfactory conclusion (for the prisoners) if the moves are open, but not if they are secret (which is why the police question them sepa rately). In real-life decision making, for example many business and diplomatic negotiations, some of the choices are secret. For example, the Cuban missile crisis can be cast in a similar form to the prisoner's dilemma. The "renege " option here would be to take pre-emptive nuclear action. Happily, the range of options and the level of communication were substantially higher. Although there are obvious differences, running computer simulations of such games can be used to give some insight into these complex real-world decisions. In the iterated prisoner's dilemma, the same pair of players are constantly faced with the same secret decisions. Although in any one game they have no knowledge of the other's moves, they can observe their partner's previous behaviour. A successful strategy for the iterated prisoner's dilemma is tit-for-tat, where the player "pays back " the other player for reneging. So long as there is some tendency for the players occasionally to take a risk, the play is likely to end up in extended periods of mutual trust.
5.5.3
No alpha -beta pruning
Although the slightly modified version of the minimax algorithm works fine on non-zero-sum games, alpha-beta pruning cannot be used. Consider again the game tree in Figure 5.6. Imagine this time that you consider the nodes from right to left. That is, you consider each renege choice before the corresponding trust choice. The third and fourth terminal nodes are considered as before, and the node above them scored. Thus the first prisoner knows that reneging will result in five years in jail. We now move on to the second terminal node. It has a penalty of ten years for the first prisoner. If he applied alpha-beta pruning, he would see that this is worse than the reneging option, and so not bother to consider the first node at all. Why does alpha-beta fail? The reason is that it depends on the fact that in zero-sum games the best move for one player is the worst for the other. This holds in the right-hand branch of the game tree, but not in the left-hand branch. When the first prisoner has kept silent, then the penalties for both are minimized when the second prisoner also remains silent. What's good for one is good for both.
112
NON-ZERO-SUM GAMES AND SIMULTANEOUS PLAY
5.5.4
Pareto-optimality
In the form of the prisoner's dilemma discussed above, the option when both remain silent was best for both. However, when there is more than one goal, it is not always possible to find a uniformly best alternative. Consider the form of the prisoner's dilemma in Figure 5.7. This might arise if the police have evidence of a lesser crime, perhaps possession of stolen goods, so that if neither prisoner confesses they will still both be imprisoned for two years. However, if only one confesses, that prisoner has been promised a lenient sentence on both charges.
first prisoner
trust
\ 2
second prisoner
\ \
1
\
o
/
\ renege
2
\
\
\
\ 1 \ 10 \ \ \
\
5 \ 5 \
_
\
\
\
\
Figure 5.7 Modified prisoner's dilemma.
This time, there is no uniformly optimal solution. Neither prisoner will like the renege-renege choice, and the trust-trust one is better for both. However, it is not best overall as each prisoner would prefer the situation when only they confess. The trust-trust situation is called Pareto-optimal. This means that there is no other situation that is uniformly better. In general, there may be several different Pareto-optimal situations favouring one or other party. Now see what happens when the prisoners make their choices. The first prisoner wonders what the second prisoner might do. If she reneges, then he certainly ought to as well. But if she stays silent it is still better for him to renege as this will reduce his sentence from two years to one. The second prisoner reasons similarly and so they end up in the renege-renege situation. This time, having an open, turn-taking game does not help. Figure 5.8 shows the game tree for this version of the dilemma, which also leads to the renege renege option. Even though both prisoners would prefer the Pareto-optimal trust-trust option to the renege-renege one, the latter is still chosen. Furthermore, if they both did decide to stay silent, but were later given the option of changing their decision, both would do so. The Pareto-optimal decision is, in this case, unstable. The lesson is that, in order to get along, both computers and people have to negotiate and be able to trust one another. Some early work has begun on endowing software agents (see Ch. 10) with ideas of trust.
113
GAME PLAYING
first prisoner
second prisoner
Figure 5.8 Non-Pareto-optimal solution.
5.5.5
Multi-party competition and co-operation
The above can easily be extended to the case of multiple players. Instead of two scores, one gets a tuple of scores, one for each player. The modified minimax algorithm can again be used. At each point, as we move up the tree, we assume each player will maximize their own part of the tuple. The same problems arise with secret moves and non-Pareto-optimal results.
5.6
The adversary is life!
Game playing is similar to interacting with the physical environment - as you act, new knowledge is found, or circumstances change to help or hinder you. In such circumstances the minimax algorithm can be used where the adversary is replaced by events from the environment. This effectively assumes that the worst thing will always happen. Consider the following coin-weighing problem: King Alabonzo of Arbicora has nine golden coins. He knows that one is a counterfeit (but not which one). He also knows that fake coins are slightly lighter than true ones. The local magician Berzicaan has a large and accurate balance, but demands payment in advance for each weighing required. How many weighings should the king ask for and how should he proceed?
114
THE ADVERSARY IS LIFE!
weigh 2 1 on each pan minimax = 7
weigh 4 2 on each pan minimax = 5
weigh 6 3 on each pan minimax = 3
weigh 8 4 on each pan minimax = 4
Figure 5.9 Minimax search for King Alabonzo's counterfeit coin.
Figure 5.9 shows the search space, expanded to one level. The numbers in bags represent the size of the pile that has the heavier coin in it. This starts off as size 9. The king can weigh two coins (one on each side of the balance), four, six or eight. If the balance is equal the coin must be in the remaining pile; if unequal he can confine his search to the heavier pile. For example, imagine the king chose to weigh four coins. If the balance was unequal, he would know that the lighter side had the fake coin in it; hence the pile to test would now consist of only two coins. If, on the other hand, the balance had been equal, the king would know that the fake coin was among the five unweighed coins. Thus if we look at the figure, the choice to weigh four coins has two branches, the "=" branch leading to a five-coin bag and the 'V " branch leading to a two-coin bag. The balance acts as the adversary and we assume it "chooses " to weigh equal or unequal to make things as bad as possible for King Alabonzo! Alabonzo wants the pile as small as possible, so he acts as minimizer, while the balance acts as maximizer. Based on this, the intermediate nodes have been marked with their minimax values. We can see that, from this level of look-ahead, the best option appears to be weighing six coins first. In fact, this is the best option and, in this case, the number of coins remaining acts as a very good heuristic to guide us quickly to the shallowest solution.
115
GAME PLAYING
5.7
Probability
Many games contain some element of randomness, perhaps the toss of a coin or the roll of a die. Some of the choice points are replaced by branches with probabilities attached. This may be done both for simple search trees and for game trees. There are various ways to proceed. The simplest is to take the expected value at each point and then continue much as before.
weigh 2 1 on each pan m inim ax = 3 average = 1.6
p = 0.
p = 0.6,
[ 1]
w eigh 4 2 on each pan m inimax = 2 average = 1.8
[0]
[0]
p = 0.8
[ 1]
Figure 5.10 Probabilistic game tree for King Alabonzo's counterfeit coin.
In the example of Alabonzo's coins we deliberately avoided probabilities by saying he had to pay in advance for the number of weighings, so only the worst case mattered. If instead he paid per weighing when required, he might choose to minimize the expected cost. This wouldn't necessarily give the same answer as minimax. Figure 5.10 shows part of the tree starting with five coins. The lower branches have been labelled with the probability that they will occur. For example, if two coins are weighed then there is a probability of 2/5 that one of them will be the counterfeit and 3/5 that it will be one of the three remaining coins. At the bottom of the figure, the numbers in square brackets are the expected number of further weighings needed to find the coin. In the case of one coin remaining, that must be the counterfeit and so the number is zero. In the cases 116
SUMMARY
of two or three coins, one further weighing is sufficient. With five coins the king can choose to weigh either two or four coins. The average number of weighings for each has been calculated. For example, when weighing two coins there is that weighing, and if the scales are equal (with a probability of 0.6) then a further weighing is required, giving an average number of 1.6. See how the average number of weighings required is 1.6 when two coins are weighed and 1.8 when four are weighed. So, it is better to weigh two. However, if the number of coins in the piles is used as a heuristic, the minimax score is better for four weighings. In general the two methods will not give the same answer, as minimax will concentrate on the worst outcome no matter how unlikely its occurrence. One problem with calculating the average pay-off is that it leads to a rapid increase in the search tree. For example, in a two-dice game, like backgammon, one has to investigate game situations for all 21 different pairs of die faces (or 13 sums). One way to control this is by using a probability-based cut-off for the search. It is not worth spending a lot of effort on something that is very unlikely to happen. Averages are not the only way to proceed. One might prefer a choice with a lower average pay-off (or higher cost), if it has less variability - that is, a strategy of risk avoidance. On the other hand, a gambler might prefer a small chance of a big win. This may not be wise against a shrewd opponent with no randomness, but may be perfectly reasonable where luck is involved. Because of the problem with calculating probabilities, game-playing programs usually use complex heuristics rather than deep searches. So, a backgammon program will play more like a human than a chess program. However, there are some games where the calculation of probabilities can make a computer a far better player than a person. In casinos, the margin towards the house is quite narrow (otherwise people would lose their money too quickly!), so a little bit of knowledge can turn a slow loss into a steady win. In card games, the probability of particular cards occurring changes as the pack is used up. If you can remember every card that has been played, then you can take advantage of this and win. But don't try it! Counting (as this is called) is outlawed. If the management suspects, it will change the card packs, and anyone found in the casino using a pocket computer will find themselves in the local police station if not wearing new shoes at the bottom of the river!
5.8
Summary
In this chapter we have looked at algorithms for playing standard games (nonprobabilistic, open, two-person, turn-taking, zero-sum games). Such games include chess, draughts, tic-tac-toe and Go. We considered minimax search 117
GAME PLAYING
techniques and alpha-beta pruning, which relate to the search techniques studied in Chapter 3. We also discussed games where co-operation is important, where players can take simultaneous moves and where random events happen (such as the throw of a die). We will see in Chapter 9 that acting in the presence of uncertainty is essential for robotics and other practical planning tasks, and this chapter will show how game-playing algorithms can be used to tackle such non gaming problems.
5.9
Exercises 1. Consider the alternatives to the "standard " game (the non-probabilistic, open, two-person, turn-taking, zero-sum game). Confining yourself to turn-taking games, consider all possible combinations of game types, and attempt to find a game to fit in each category. Only worry about the "zero sum " property for two-person games, so that you should have 12 categories in all. For example, find a game that is probabilistic, open and not two person. 2. Consider the three-person game hex-lines, a variant of "placing dominoes". A piece of paper is marked with dots in a triangular pattern. Different sizes and shapes of playing area give rise to different games. Each person in turn connects two adjacent points. However, they are only allowed to use points that have not yet been used. The players each have a direction and are only allowed to draw lines parallel to their direction. WeTl assume that the first player draws lines sloping up (/), the second horizontal (— ) and the third sloping down (\). If players cannot draw their direction of line then they are out of the game. When no player can draw a line the lines for each player are counted, giving the final score.
Figure 5.11 Hex-lines.
Consider an example game on a small hexagonal playing area. The board positions through the game are shown in Figure 5.11. The initial configu ration is (i).
118
RECOMMENDED FURTHER READING
(a) First player draws sloping up (ii). (b) Second player draws horizontal (iii). (c) Third player cannot play and is out. (d) First player cannot play either and is out. (e) Second player draws again giving (iv). The final score is thus [1,2,0] (1 for the first player, 2 for the second and 0 for the third). Taking the same initial configuration draw the complete game tree. Could the first player have done better?
5.10
Recommended further reading
Pearl, J. 1984. Heuristics: intelligent search strategies for computer problem solving. Reading, MA: Addison-Wesley. Part 3 o f this book concentrates on game-playing strategies and heuristics.
119
Chapter Six
Expert systems
6.1
Overview
Expert systems are one of the most successful commercial applications of artificial intelligence. In this chapter we examine expert systems, looking at a basic system architecture, some classic applications, and the stages involved in building an expert system. In particular, we consider the important problem of knowledge elicitation. Finally, we discuss some of the limitations of current expert system technology and some possible solutions for the future.
6.2
What are expert systems?
Of all the applications of AI techniques, expert systems are perhaps the most familiar and are certainly, as yet, the most commercially successful. But what exactly is an expert system? What is it used for, and how is it built? This chapter will attempt to answer these questions. In our discussion we will return to several of the techniques we have already examined and see how they operate together to produce a useful artefact. An expert system is an AI program that uses knowledge to solve problems that would normally require a human expert. The knowledge is collected from human experts and secondary knowledge sources, such as books, and is represented in some form, often using logic or production rules. The system includes a reasoning mechanism as well as heuristics for making choices and navigating around the search space of possible solutions. It also includes a mechanism for passing information to and from the user. Even from this brief overview you can probably see how the techniques that we have already considered might be used in expert system development. In this chapter we will look at the main uses of expert systems and the com ponents that make up an expert system, before going on to consider a number of well-known and, in a sense, classic expert systems. We will then look more closely at the process of building an expert system and the tools that are available to support this process. Finally, we will consider the problems facing expert system technology and look to the future.
121
EXPERT SYSTEMS
6.3
Uses of expert systems
If an expert system is a program that performs the work of human experts, what type of work are we talking about? This is not an easy question to answer since the possibilities, if not endless, are extensive. Commercial expert systems have been developed to provide financial, tax-related and legal advice; to plan journeys; to check customer orders; to perform medical diagnosis and chemical analysis; to solve mathematical equations; to design everything from kitchens to computer networks and to debug and diagnose faults. And this is not a comprehensive list. Such tasks fall into two main categories: those that use the evidence to select one of a number of hypotheses (such as medical diagnosis and advisory systems) and those that work from requirements and constraints to produce a solution which meets these (such as design and planning systems). So why are expert systems used in such areas? Why not use human experts instead? And what problems are candidates for an expert system? To take the last question first, expert systems are generally developed for domains that share certain characteristics. First, human expertise about the subject in question is not always available when it is needed. This may be because the necessary knowledge is held by a small group of experts who may not be in the right place at the right time. Alternatively it may be because the knowledge is distributed through a variety of sources and is therefore difficult to assimilate. Secondly, the domain is well defined and the problem clearly specified. At present, as we discovered in Chapter 1, AI technology cannot handle common sense or general knowledge very well, but expert systems can be very successful for well-bounded problems. Thirdly, there are suitable and willing domain experts to provide the necessary knowledge to populate the expert system. It is unfeasible to contemplate an expert system when the relevant experts are either unwilling to co-operate or are not available. Finally, the problem is of reasonable scope, covering diagnosis of a particular class of disease, for example, rather than of disease in general. If the problem fits this profile it is likely to benefit from the use of expert system technology. In many cases the benefits are in real commercial terms such as cost reduction, which may go some way to explaining their commercial success. For example, expert systems allow the dissemination of information held by one or a small number of experts. This makes the knowledge available to a larger number of people, and less skilled (so less expensive) people, reducing the cost of accessing information. Expert systems also allow knowledge to be formalized. It can then be tested and potentially validated, reducing the costs incurred through error. They also allow integration of knowledge from different sources, again reducing the cost of searching for knowledge. Finally, expert systems can provide consistent, unbiased responses. This can be a blessing or a curse depending on which way you look at it. On the positive side, the system is not plagued by human error or prejudice (unless this is built into the knowledge and reasoning), resulting in more consistent, correct solutions. On the other hand, the system is unable to make value judgements, which makes it more inflexible
122
ARCHITECTURE OF AN EXPERT SYSTEM user
Figure 6.1 Typical expert system architecture.
than the human (for example, a human assessing a loan application can take into account mitigating circumstances when assessing previous bad debts, but an expert system is limited in what it can do).
6.4
Architecture of an expert system
An expert system comprises a number of components, several of which utilize the techniques we have considered so far (see Fig. 6.1). Working from the bottom up, we require knowledge, a reasoning mechanism and heuristics for problem solving (for example, search or constraint satisfaction), an explanation component and a dialogue component. We have considered the first three of these in previous chapters and will come back to them when we consider particular expert systems. Before that, let us look in a little more detail at the last two.
123
EXPERT SYSTEMS
6.4.1
Explanation facility
It is not acceptable for an expert system to make decisions without being able to provide an explanation for the basis of those decisions. Clients using an expert system need to be convinced of the validity of the conclusion drawn before applying it to their domain. They also need to be convinced that the solution is appropriate and applicable in their circumstances. Engineers building the expert system also need to be able to examine the reasoning behind decisions in order to assess and evaluate the mechanisms being used. It is not possible to know if the system is working as intended (even if it produces the expected answer) if an explanation is not provided. So explanation is a vital part of expert system technology. There are a number of ways of generating an explanation, the most common being to derive it from the goal tree that has been traversed. Here the explanation facility keeps track of the subgoals solved by the system, and reports the rules that were used to reach that point. For example, imagine the following very simple system for diagnosing skin problems in dogs. Rule 1: IF the dog is scratching its ears AND the ears are waxy THEN the ears should be cleaned Rule 2: IF the dog is scratching its coat AND if insects can be seen in the coat AND if the insects are grey THEN the dog should be treated for lice Rule 3: IF the dog is scratching its coat AND if insects can be seen in the coat AND if the insects are black THEN the dog should be treated for fleas Rule 4: IF the dog is scratching its coat AND there is hair loss AND there is inflammation THEN the dog should be treated for eczema Imagine we have a dog that is scratching and has insects in its coat. A typical consultation would begin with a request for information, in an attempt to match the conditions of the first rule " is the dog scratching its ears?", to which the response would be no. The system would then attempt to match the conditions of rule 2, asking " is the dog scratching its coat?" (yes), " can you see insects in the coat?" (yes), "are the insects grey?". If we respond yes to this question the system will inform us that our dog needs delousing. At this point if we asked for an explanation the following style of response would be given: It follows from rule 2 that If the dog is scratching 124
ARCHITECTURE OF AN EXPERT SYSTEM
And if insects can be seen And if the insects are grey Then the dog should be treated for lice. This traces the reasoning used through the consultation so that any errors can be identified and justification can be given to the client if required. However, as you can see, the explanation given is simply a restatement of the rules used, and as such is limited. In addition to questions such as "how did you reach that conclusion?" the user may require explanatory feedback during a consultation, particularly to clarify what information the system requires. A common request is "why do you want to know that?" when the system asks for a piece of information. In this case the usual response is to provide a trace up to the rule currently being considered and a restatement of that rule. Imagine that in our horror at discovering crawling insects on our dog we hadn't noted the colour we might ask to know why the system needs this information. The response would be of the form You said the dog is scratching and that there are insects. If the insects are grey then the dog should be treated for lice. Notice that it does not present the alternative rule, rule 3, which deals with black insects. This would be useful but assumes look-ahead to other rules in the system to see which other rules may be matched. This form of explanation facility is far from ideal, both in terms of the way that it provides the explanation and the information to which it has access. In particular it tends to regurgitate the reasoning in terms of rules and goals, which may be appropriate to the knowledge engineer but is less suitable for the user. Ideally, an explanation facility should be able to direct the explanation towards the skill level or understanding of the user. In addition, it should be able to differentiate between the domain knowledge that it uses and control knowledge, such as that used to control the search. Explanations for users are best described in terms of the domain; those for engineers in terms of control mechanisms. In addition, rule tracing only makes sense for backward reasoning systems, since in forward reasoning it is not known, at a particular point, where the line of reasoning is going. For these reasons researchers have looked for alternative mechanisms for providing explanations. One approach is to maintain a representation of the problem-solving process used in reaching the solution as well as the domain knowledge. This provides a context for the explanation: the user knows not only which rules have been fired but what hypothesis was being considered. The XPLAIN system (Swartout 1983) takes this approach further and links the process of explanation with that of designing the expert system. The system
125
EXPERT SYSTEMS
defines a domain model (including facts about the domain of interest) and domain principles that are heuristics and operators - meta-rules. This represents the strategy used in the system. An automatic programmer then uses these to generate the system. This approach ensures explanation is considered at the early specification stage, and allows the automatic programmer to use one piece of knowledge in several ways (problem solving, strategy, explanation). Approaches such as this recognize the need for meta-knowledge in providing explanation facilities. In order to do this successfully, expert systems must be designed for explanation.
6.4.2
Dialogue component
The dialogue component is closely linked to the explanation component, as one side of the dialogue involves the user questioning the system at any point in the consultation in the ways we have considered. However, the system must also be able to question the user in order to establish the existence of evidence. The dialogue component has two functions. First, it determines which question to ask next (using meta-rules and the reasoning mechanism to establish what information is required to fire particular rules). Secondly, it ensures that unnecessary questions are not asked, by keeping a record of previous questions. For example, it is not helpful to request the model of a car when the user has already said that he or she doesn't know its make. The dialogue could be one of three styles: - system controlled, where the system drives the dialogue through question ing the user - mixed control, where both user and system can direct the consultation - user controlled, where the user drives the consultation by providing infor mation to the system. Most expert systems use the first of these, the rest the second. This is because the system needs to be able to elicit information from the user when it is needed to advance the consultation. If the user controlled the dialogue, the system might not get all the information required. Ideally a mixed dialogue should be provided, allowing the system to request further information and the user to ask for "why? " and " how? " explanations at any point.
126
EXAMPLES OF FOUR EXPERT SYSTEMS
6.5
Examples of four expert systems
To illustrate how the components that we have looked at fit together we will consider four early expert systems. Although these systems are not the most up-to-date, they were systems that were ground breaking when they were built, and they have all been successful in their domains. As such they rank among the " classics" of expert systems, and therefore merit a closer look. In each case we will summarize the features of the expert system in terms of the five key components we have identified. This will help you to see how different expert systems can be constructed for different problems. In each case, consider the problem that the expert system was designed to solve, and why the particular components chosen are suited to that task.
6.5.1
Example 1: MYCIN
MYCIN is an expert system for diagnosing and recommending treatment of bacterial infections of the blood (such as meningitis and bacteremia) (Shortliffe 1976). It was developed at Stanford University in California in the 1970s, and has become a template for many similar rule-based systems. It is intended to support clinicians in the early diagnosis and treatment of meningitis, which can be fatal if not treated in time. However, the laboratory tests for these conditions take several days to complete, so doctors (and therefore MYCIN) have to make decisions with incomplete information. A consultation with MYCIN begins with requests for routine information such as age, medical history and so on, progressing to more specific questions as required. - knowledge representation. Production rules (implemented in LISP). - reasoning. Backward chaining, goal-driven reasoning. MYCIN uses cer tainty factors to reason with uncertain information. - heuristics. When the general category of infection has been established, MYCIN examines each candidate diagnosis in a depth first manner. Heuris tics are used to limit the search, including checking all premises of a possible rule to see if any are known to be false. - dialogue/explanation. The dialogue is computer controlled, with MYCIN driving the consultation through asking questions. Explanations are gen erated through tracing back through the rules that have been fired. Both "how? " and "why? " explanations are supported.
127
EXPERT SYSTEMS
6.5.2
Example 2: PROSPECTOR
PROSPECTOR is an expert system to evaluate geological sites for potential mineral deposits, again developed at Stanford in the late 1970s (Duda et al. 1979). Given a set of observations on the site's attributes (provided by the user), PROSPECTOR provides a list of minerals, along with probabilities of them being present. In 1984 it was instrumental in discovering a molybdenum deposit worth 100 million dollars! - knowledge representation. Rules, semantic network. - reasoning. Predominantly forward chaining (data-driven), with some back ward chaining. Bayesian reasoning is used to deal with uncertainty. - heuristics. Depth first search is focused using the probabilities of each hypothesis. - dialogue/explanation. The dialogue uses mixed control. The user volunteers information at the start of the consultation, and PROSPECTOR can request additional information when required. Explanations are generated by tracing back through the rules that have been fired.
6.5.3
Example 3: DENDRAL
DENDRAL is one of the earliest expert systems, developed at Stanford during the late 1960s (Lindsay et al. 1980). It infers the molecular structure of organic compounds from chemical formulae and mass spectrography data. It is not a " stand -alone" expert, more an expert's assistant, since it relies on the input of the human expert to guide its decision making. However, it has been successful enough in this capacity to discover results that have been published as original research. - knowledge representation. Production rules and algorithms for generating graph structures, supplemented by expert user' s knowledge. - reasoning. Forward chaining (data-driven). - heuristics. DENDRAL uses a variation on depth first search called generate and test, where all hypotheses are generated and then tested against the available evidence. Heuristic knowledge from the users (chemists) is also used to constrain the search. - dialogue/explanation. The dialogue uses mixed control. The user can supply information and the system can request information as required.
128
BUILDING AN EXPERT SYSTEM
6.5.4
Example 4: XCON
XCON is a commercial expert system developed by Digital Electronics Corpora tion to configure VAX computer systems to comply with customer orders (Barker & O'Connor 1989). The problem is one of planning and design: there are up to 100 components in any system and XCON must decide how they can best be spatially arranged to meet the specification. The design also has to meet constraints placed by the functionality of the system, and physical constraints. - knowledge representation. Production rules. - reasoning. Forward chaining (data-driven). Since it is possible to specify rules exactly no uncertainty is present. - heuristics. The main configuration task is split into subtasks which are always examined in a predetermined order. Constraint satisfaction is used to inform the search for a solution to a subtask. - dialogue/explanation. The dialogue is less important than in the previous situations since the customer's requirements can be specified at the begin ning and the system contains all the information it needs regarding other constraints. These examples illustrate how the different techniques we have considered in previous chapters can be combined to produce a useful solution, and how differ ent problems require different solutions. We will now look at the practical aspects of building an expert system.
6.6
Building an expert system
We have looked at some of the applications for which expert systems have proved successful, and what components an expert system will have. But how would we go about building one? First, we need to be certain that expert system technology is appropriate to solve the problem that we have in mind. If the problem falls into one of the categories we have already mentioned, such as diagnosis, planning, design or advice giving, then it has passed the first test. The second consideration is whether the problem can be adequately solved using conventional technology. For example, can it be solved statistically or algorithmically? If the answer to this is no, we need to ask whether the problem justifies the expense and effort required to produce an expert system solution. This usually means that the expert system is expected to save costs in the long term, perhaps by making an operation more efficient or making knowledge more widely available. The problem should also be clearly defined and of reasonable size, since expert system technology cannot handle general or common-sense knowledge. 129
EXPERT SYSTEMS
So we have examined our candidate problem and decided that an expert system would be an appropriate solution; what next? Assuming that we have considered our domain of interest carefully and defined the boundaries of the expert system, our first and most crucial stage is knowledge acquisition. Knowledge acquisition is the process of getting information out of the head of the expert or from the chosen source and into the form required by the expert system. We can identify two phases of this process: knowledge elicitation, where the knowledge is extracted from the expert, and knowledge representation, where the knowledge is put into the expert system. We considered the latter in Chapter 1. Here we will look briefly at knowledge elicitation.
6.6.1
Knowledge elicitation
The knowledge engineer (the title often given to the person developing the expert system) is probably not an expert in the domain of interest. The engineer's first task is therefore to become familiar with the domain through talking to domain experts and reading relevant background material. Once the engineer has a basic level of understanding of the domain he or she can begin knowledge elicitation. There are a number of techniques used to facilitate this. It is the job of the knowledge engineer to spot gaps in the knowledge that is being offered and fill them. The problem of knowledge elicitation is not a trivial one. To help you to understand the magnitude of the problem, think of a subject on which you would consider yourself expert. Imagine having to formalize all this information without error or omission. Think about some behaviour in which you are skilled (a good example is driving a car): can you formalize all the actions and knowledge required to perform the necessary actions? Alternatively, imagine questioning someone on a topic on which they are expert and you are not. How do you know when information is missing? This is where concrete examples can be useful since it is easier to spot a conceptual leap in an explanation of a specific example than it is in more general explanations. Techniques for knowledge elicitation The interview can capture qualitative information, which is the crux of knowledge elicitation, and therefore provides the key mechanism for acquiring knowledge. There are a number of different types of interview, each of which can be useful for eliciting different types of information. We will consider a number of variants on the interview: the unstructured interview; the structured interview; focused discussion; role reversal and think aloud. - unstructured interviews. The unstructured interview is open and exploratory no fixed questions are prepared and the interviewee is allowed to cover
130
BUILDING AN EXPERT SYSTEM
topics as he or she sees fit. It can be used to set the scene and gather contex tual information at the start of the knowledge elicitation process. Probes, prompts and seed questions can be used to encourage the interviewee to provide relevant information. A probe encourages the expert to provide further information without indicating what that information should be. Examples of such questions are "tell me more about that", "and then?", and "yes? ". Prompts are more directed and can help return the interview to a relevant topic that is incomplete. Seed questions are helpful in starting an unstructured interview. A general seed question might be: "Imagine you went into a bookshop and saw the book you wished you'd had when you first started working in the field. What would it have in it? " (Johnson & Johnson 1987). - structured interviews. In structured interviews a framework for the inter view is determined in advance. They can involve the use of check-lists or questionnaires to ensure focus is maintained. Strictly structured inter views allow the elicitor to compare answers between experts whereas less strict, perhaps more accurately termed semi-structured interviews combine a focus on detail with some freedom to explore areas of interest. Appropriate questions can be difficult to devise without some understand ing of the domain. Unstructured interviews are often used initially, fol lowed by structured interviews and more focused techniques. - focused discussions. A focused discussion is centred around a particular problem or scenario. This may be a case study, a critical incident or a specific solution. Case analysis considers a case study that might occur in the domain or one that has occurred. The expert explains how it would be or was solved, either verbally or by demonstration. Critical incident analysis is a variant of this that looks at unusual and probably serious incidents, such as error situations. In critiquing, the expert is asked to comment on someone else's solution to a problem or design. The expert is asked to review the design or problem solution and identify omissions or errors. This can be helpful as a way of cross-referencing the information provided by different experts, and also provides validation checks, since each solution or piece of information is reviewed by another expert. - role reversal. Role reversal techniques place the elicitor in the expert's role and vice versa. There are two main types: teach-back interviews and Twenty Questions. In teach-back interviews the elicitor " teaches " the expert on a subject that has already been discussed. This checks the elicitor's understanding and allows the expert to amend the knowledge if necessary. In Twenty Questions, the elicitor chooses a topic from a predetermined set and the expert asks questions about the topic in order to determine which one has been selected. The elicitor can answer yes or
131
EXPERT SYSTEMS
no. The questions asked reflect the expert's knowledge of the topic and therefore provide information about the domain. - think-aloud. Think-aloud is used to elicit information about specific tasks. The expert is asked to think aloud while carrying out the task. Similarly, the post-task walk-through involves debriefing the expert after the task has been completed. Both techniques are better than simple observation, as they provide information on expert strategy as well as behaviour. Tool support for knowledge elicitation There are a number of tools to support knowledge acquisition, and some expert system shells provide this support. For example, the expert system shell MacSMARTS includes an example-based component that allows the domain expert to input information in terms of primary factors, questions and advice (instead of rules), and the system generates a knowledge base from these. This knowledge base can then be validated by the expert. Another system, Teiresias (Davis 1982) (after the Greek sage), was designed alongside EMYCIN (van Melle et al. 1981) (an expert system shell based on MYCIN) to help the expert enter knowledge. The expert critiques the performance of the system on a small set of rules and some problems. This focuses the knowledge elicitation process on the specific problems.
6.6.2
Representing the knowledge
When the knowledge engineer has become familiar with the domain and elicited some knowledge, it is necessary to decide on an appropriate representation for the knowledge, choosing, for example, to use a frame-based or networkbased scheme. The engineer also needs to decide on appropriate reasoning and search strategies. At this point the engineer is able to begin prototyping the expert system, normally using an expert system shell or a high-level AI lan guage. Expert system shells An expert system shell abstracts features of one or more expert systems. For example, EMYCIN is an expert system shell derived from the MYCIN system. The shell comprises the inference and explanation facilities of an existing expert system without the domain-specific knowledge. This allows non-programmers to add their own knowledge on a problem of similar structure but to re-use the reasoning mechanisms. A different shell is required for each type of problem, for example to support data-driven or goal-driven reasoning, but one shell can be used for many different domains. 132
BUILDING AN EXPERT SYSTEM
Expert system shells are very useful if the match between the problem and the shell is good, but they are inflexible. They work best in diagnostic and advice-style problems rather than design or constraint satisfaction, and are readily available for most computer platforms. This makes building an expert system using a shell relatively cheap. High-level programming languages High-level programming languages, designed for AI, provide a fast, flexible mechanism for developing expert systems. They conceal their implementation details, allowing the developer to concentrate on the application. They also provide inbuilt mechanisms for representation and control. Different languages support different paradigms, for example PROLOG supports logic, LISP is a functional programming language, OPS5 is a production system language. The language we are using for the examples in this book is PROLOG. It pro vides a database that can represent rules, procedures and facts and can implement pattern matching, control (via depth first searching and backward chaining) and structured representations. It can be used to build simple expert systems, or to build a more complex expert system shell. However, high-level languages do demand certain programming skills in the user, particularly to develop more complex systems, so they are less suitable for the " do-it-yourself" expert system developer. Some environments have been de veloped that support more than one AI programming language, such as POPLOG which incorporates LISP and PROLOG. These provide even more flexibility and some programming support, but still require programming skills. Selecting a tool There are a number of things to bear in mind when choosing a tool to build an expert system. First, select a tool which has only the features you need. This will reduce the cost and increase the speed (both in terms of performance and development). Secondly, let the problem dictate the tool to use, where possible, not the available software. This is particularly important with expert system shells, where choosing a shell with the wrong reasoning strategy for the problem will create more difficulties than it solves. Think about the problem in the abstract first and plan your design. Consider your problem against the following abstract problem types: - problems with a small solution space and reliable data - problems with unreliable or uncertain data - problems where the solution space is large but where you can reduce it, say using heuristics - problems where the solution space is large but not reducible. 133
EXPERT SYSTEMS
Each of these would need a different approach. Look also at example systems (such as the four we have touched on). Try to find one that is solving a similar problem to yours and look at its structure. Only when you have decided on the structure and techniques that are best for your problem should you look for an appropriate tool. Finally, choose a tool with built-in explanation facilities and debugging if possible. These are easier to use and test and will save time in implementation.
6.7
Limitations of expert systems
We have looked at expert systems, what they are used for and how to build one. But what are the current limitations of expert system technology that might affect our exploitation of them? We have already come across a number of limitations in our discussion, but we will reconsider them here. First, there is the problem of knowledge acquisition: it is not an easy task to develop complete, consistent and correct knowledge bases. Experts are generally poor at expressing their knowledge, and non-expert (in the domain) knowledge engineers may not know what they are looking for. Some tool support is available, and using a structured approach can alleviate the problem, but it remains a bottleneck in expert system design. A second problem is the verification of the knowledge stored. The knowl edge may be internally consistent but inaccurate, due to either expert error or misunderstanding at the acquisition stage. Validation of data is usually done informally, on the basis of performance of the system, but this makes it more dif ficult to isolate the cause of an observed error. Knowledge elicitation techniques such as critiquing, where the domain expert assesses the knowledge base in stages as it is developed, help to alleviate this problem, although the verification is still subjective. Thirdly, expert systems are highly domain dependent and are therefore brittle. They cannot fall back on general or common-sense knowledge or generalize their knowledge to unexpected cases. A new expert system is therefore required for each problem (although the shells can be re-used) and the solution is limited in scope. An additional problem with brittleness is that the user may not know the limitations of the system. For example, in a PROLOG-based system a goal may be proved false if the system has knowledge that it is false or if the system does not have knowledge that it is true. So the user may not know whether the goal is in fact false or whether the knowledge base is incomplete. Finally, expert systems lack meta-knowledge, that is knowledge about their own operations, so they cannot reason about their limitations or the effect of these on the decisions that are made. They cannot decide to use a different reasoning or search strategy if it is more appropriate or provide more informative explanations. 134
HYBRID EXPERT SYSTEMS
6.8
Hybrid expert systems
One possible solution to some of the limitations of expert systems is to combine the knowledge-based technology of expert systems with technologies that learn from examples, such as neural networks and inductive learning. These classify instances of an object or event according to their closeness to previously trained examples, and therefore do not require explicit knowledge representation (see Ch. 11 and Ch. 4 for more details). However, they tend to be "black -box" techniques, which are poor at providing explanations for their decisions. It has been suggested that these techniques may help in a number of areas. We have also mentioned MacSMARTS, which uses inductive learning to alleviate the knowledge elicitation problem. Other suggestions are to use the techniques to deal with uncertainty since they are more tolerant of error and incompleteness than knowledge-based systems. This would result in a "hybrid " expert system using both methods in parallel. Another area in which they could be helpful is pruning search spaces, since they could be trained to recognize successful search paths. However, this is still very much a research area and there is some suspicion on either side as to the value of the other approach.
6.9
Summary
In this chapter we looked at the main applications of expert systems and the components that we would expect to see in an expert system. We then examined the structure of four classic expert systems in order to illustrate how different types of problem can be solved. We considered the stages in building an expert system, concentrating on knowledge acquisition and choosing appropriate tools. Finally, we considered some of the limitations of current expert system technology and some possible solutions for the future.
6.10 Exercises 1. You are asked to advise on the use of expert systems for the following tasks. Outline appropriate reasoning methods and other key expert system features for each application. - a system to advise on financial investment (to reduce enquiries to a bank 's human advisor) - a medical diagnosis system to help doctors
135
EXPERT SYSTEMS
- a kitchen design system to be used by salesmen. 2. Working in small groups and using the information below (extending it where necessary) - formalize the knowledge as a set of rules (of the form IF evidence THEN hypothesis) - calculate certainty factors (see Ch. 2) for each hypothesis given the ev idence (estimate measures of belief and disbelief from the statements made) - use an expert system shell to implement this knowledge. There are a number o f reasons why a car might overheat. If the radiator is empty it will certaily overheat. If it is halffidl this may cause overheating but is quite likely not to. If the fan belt is broken or missing the car will again certainly overheat. I f it is too tight it may cause this problem but not always. Another possible cause is a broken or jammed thermostat, or too much or too little oil. If the engine is not tuned properly it may also overheat but this is less likely. Finally, the water pump may be broken. If none of these things is the cause, the temperature gauge might be faulty (the car is not overheating at all). Also the weather and the age o f the car should be considered (older cars are more likely to overheat). A combination o f any o f the above factors would increase the likelihood o f overheating.
6.11 Recommended further reading Jackson, P. 1990. An introduction to expert systems, 2nd edn. Wokingham: Addison Wesley. Detailed coverage o fmany of the topics introduced here as well as other aspects ofexpert systems. An excellent next step for anyone wanting to know more about the subject. Medsker, L. & J. Liebowitz 1994. Design and development of expert systems and neural networks. New York: Macmillan. A book that attempts to provide a balanced view of the role of traditional and connectionist techniques in the practical development of expert systems. Goonatilake, S. & S. Khebbal (eds) 1995. Intelligent hybrid systems. Chichester: John Wiley. A collection o f papers detailing some of the research in using hybrid techniques in expert systems and knowledge acquisition. Kidd, A. (ed.) 1989. Knowledge acquisitionfor expert systems: a practical handbook. New York: Plenum Press. A collection o f papers discussing a range o f knowlege elicitation techniques. A worthwhile read for anyone wanting to gather information to build an expert system.
136
Chapter Seven
Natural language understanding
7.1
Overview
Natural language understanding is one of the most popular applications of arti ficial intelligence portrayed in fiction and the media. The idea of being able to control computers by talking to them in our own language is very attractive. But natural language is ambiguous, which makes natural language understanding particularly difficult. In this chapter we examine the stages of natural language understanding - syntactic analysis, semantic analysis and pragmatic analysis and some of the techniques that are used to make sense of this ambiguity.
7.2
What is natural language understanding?
Whenever computers are represented in science fiction, futuristic literature or film, they invariably have the ability to communicate with their human users in natural language. By "natural language", we mean a language for human communication such as English, French, Swahili, or Urdu, as opposed to a formal "created " language (for example, a programming language or Morse code). Unlike computers in films, which understand spoken language, we will concern ourselves primarily with understanding written language, rather than speech, and on analysis rather than language generation. As we shall see, this will present enough challenges for one chapter! Understanding speech shares the same difficulties, but has additional problems with deciphering the sound signal and identifying word parts.
137
NATURAL LANGUAGE UNDERSTANDING
7.3
Why do we need natural language understanding?
Before we consider how natural language understanding can be achieved, we should be clear about the benefits that it can bring. There are a number of areas that can be helped by the use of natural language. The first is human computer interaction, by the provision of natural language interfaces for the user. This would allow the user to communicate with computer applications in their own language, rather than in a command language or using menus. There are advantages and disadvantages to this: it is a natural form of communication that requires no specialized training, but it is inefficient for expert users and less precise than a command language. It may certainly be helpful in applications that are used by casual users (for example, tourist information) or for novice users. A second area is information management, where natural language processing could enable automatic management and processing of information, by inter preting its content. If the system could understand the meaning of a document it could, for example, store it with other similar documents. A third possibility is to provide an intuitive means of database access. At present most databases can be accessed through a query language. Some of these are very complex, demanding considerable expertise to generate even relatively common queries. Others are based on forms and menus, providing a simpler access mechanism. However, these still require the user to have some understanding of the structure of the database. The user, on the other hand, is usually more familiar with the content of the database, or at least its domain. By allowing the user to ask for information using natural language, queries can be framed in terms of the content and domain rather than the structure. We will look at a simple example of database query using natural language later in the chapter.
7.4
Why is natural language understanding difficult?
The primary problem with natural language processing is the ambiguity of lan guage. There are a number of levels at which ambiguity may occur in natural language (of course a single sentence may include several of these levels). First, a sentence or phrase may be ambiguous at a syntactic level. Syntax relates to the structure of the language, the way the words are put together. Some word sequences make valid sentences in a given language, some do not. However, some sentence structures have more than one correct interpretation. These are syntactically ambiguous. Secondly, a sentence may be ambiguous at a lexical level. The lexical level is the word level, and ambiguity here occurs when a word can have more than one meaning. Thirdly, a sentence may be ambiguous at a referential level. This is concerned with what the sentence (or a part of the sentence) refers to. Ambiguity occurs when it is not clear what the sentence is 138
WHY IS NATURAL LANGUAGE UNDERSTANDING DIFFICULT?
referring to or where it may legally refer to more than one thing. Fourthly, a sentence can be ambiguous at a semantic level, that is, at the point of the meaning of the sentence. Sometimes a sentence is ambiguous at this level: it has two different meanings. Indeed this characteristic is exploited in humour, with the use of double entendre and inuendo. Finally, a sentence may be ambiguous at a pragmatic level, that is at the level of interpretation within its context. The same word or phrase may have different interpretations depending on the context in which it occurs. To make things even more complicated some sentences involve ambiguity at more than one of these levels. Consider the following sentences; how many of them are ambiguous and how? 1. I hit the man with the hammer. 2. I went to the bank. 3. He saw her duck. 4. Fred hit Joe because he liked Harry. 5. I went to the doctor yesterday. 6. I waited for a long time at the bank. 7. There is a drought because it hasn't rained for a long time. 8. Dinosaurs have been extinct for a long time. How did you do? In fact all the sentences above have some form of ambiguity. Let's look at them more closely. - I hit the man with the hammer. Was the hammer the weapon used or was it in the hand of the victim? This sentence contains syntactic ambiguity: there are two perfectly legitimate ways of interpreting the sentence structure. - I went to the bank. Did I visit a financial institution or go to the river bank? This sentence is ambiguous at a lexical level: the word " bank " has two meanings, either of which fits in this sentence. - He saw her duck. Did he see her dip down to avoid something or the web-footed bird owned by her? This one is ambiguous at a lexical and a semantic level. The word "duck " has two meanings and the sentence can be interpreted in two completely different ways. - Fred hit Joe because he liked Harry. Who is it that likes Harry? This is an example of referential ambiguity. Who does the pronoun " he " refer to, Fred or Joe? It is not clear from this sentence structure. - I went to the doctor yesterday. When exactly was yesterday? This demon strates pragmatic ambiguity. In some situations this may be clear but not in all. Does yesterday refer literally to the day preceding today or does it 139
NATURAL LANGUAGE UNDERSTANDING
refer to another yesterday (imagine I am reading this sentence a week after it was written, for example). The meaning depends on the context. - I waited for a long time at the bank. - There is a drought because it hasn't rained for a long time. - Dinosaurs have been extinct for a long time. The last three sentences can be considered together. What does the phrase for a long time mean? In each sentence it clearly refers to a different amount of time. This again is pragmatic ambiguity. We can only interpret the phrase through our understanding of the sentence context. In addition to these major sources of ambiguity, language is problematic because it is imprecise, incomplete, inaccurate, and continually changing. Think about the conversations you have with your friends. The words you use may not always be quite right to express the meaning you intend, you may not always finish a sentence, you may use analogies and comparisons to express ideas. As humans we are adept at coping with these things, to the extent that we can usually understand each other if we speak the same language, even if words are missed out or misused. We usually have enough knowledge in common to disambiguate the words and interpret them correctly in context. We can also cope quickly with new words. This is borne out by the speed with which slang and street words can be incorporated into everyday usage. All of this presents an extremely difficult problem for the computer.
7.5
An early attempt at natural language understanding: SHRDLU
We met SHRDLU briefly in the Introduction. If you recall, SHRDLU is the natural language processing system developed by Winograd at MIT in the early 1970s (Winograd 1972). It is used for controlling a robot in a restricted " blocks " domain. The robot's world consists of a number of blocks of various shapes, sizes and colours, which it can manipulate as instructed or answer questions about. All instructions and questions are given in natural language and even though the robot's domain is so limited, it still encounters the problems we have mentioned. Consider for example the following instructions: Find a block that is taller than the one you are holding and place it in the box How many blocks are on top of the green block? Put the red pyramid on the block in the box Does the shortest thing the tallest pyramid's support supports sup port anything green? 140
HOW DOES NATURAL LANGUAGE UNDERSTANDING WORK?
What problems did you spot? Again each instruction contains ambiguity of some kind. WeTl leave it to you to figure them out! (The answers are given at the end of the chapter in case you get stuck.) However, SHRDLU was successful because it could be given complete knowl edge about its world and ambiguity could be reduced (it only recognizes one meaning of "block " for instance and there is no need for contextual understand ing since the context is given). It is therefore no use as a general natural language processor. However, it did provide insight into how syntactic and semantic pro cessing can be achieved. We will look at techniques for this and the other stages of natural language understanding next.
7.6
How does natural language understanding work?
So given that, unlike SHRDLU, we are not able to provide complete world knowl edge to our natural language processor, how can we go about interpreting lan guage? There are three primary stages in natural language processing: syntactic analysis, semantic analysis and pragmatic analysis. Sentences can be well formed or ill formed syntactically, semantically and pragmatically. Take the following responses to the question: Do you know where the park is? - The park is across the road. This is syntactically, semantically, and pragmati cally well formed, that is, it is a correctly structured, meaningful sentence which is an appropriate response to the question. - The park is across the elephant. This is syntactically well formed but seman tically ill formed. The sentence is correctly structured but our knowledge of parks and elephants and their characteristics shows it is meaningless. - The park across the road is. This is syntactically ill formed. It is not a legal sentence structure. - Yes. This is pragmatically ill formed: it misses the intention of the ques tioner. At each stage in processing, the system will determine whether a sentence is well formed. These three stages are not necessarily always separate or sequential. However, it is convenient to consider them as such. Syntactic analysis determines whether the sentence is a legal sentence of the language, or generates legal sentences, using a grammar and lexicon, and, if so, returns a parse tree for the sentence (representing its structure). This is the process of parsing. Take a simple sentence, "The dog sat on the rug." It has a number of constituent parts: nouns ("dog " and "rug "), a verb (" sat " ), determiners (" the " ) and a preposition ("on "). We can also see that it has a definite structure: noun
141
NATURAL LANGUAGE UNDERSTANDING
followed by verb followed by preposition followed by noun (with a determiner associated with each noun). We could formalize this observation: sentence = determiner noun verb preposition determiner noun Such a definition could then be tested on other sentences. What about "The man ran over the hill."? This too fits our definition of a sentence. Looking at these two sentences, we can see certain patterns emerging. For instance, the determiner " the " always seems to be attached to a noun. We could therefore simplify our definition of a sentence by defining a sentence component called noun_phrase noun_phrase = determiner noun Our sentence definition would then become sentence = noun_phrase verb preposition noun.phrase This is the principle of syntactic grammars. The grammar is built up by examining legal sentence structures and a lexicon is produced identifying the constituent type of each word. In our case our lexicon would include d og : noun th e : determiner ru g : noun sa t: verb and so on. If a legal sentence is not parsed by the grammar then the grammar must be extended to include that sentence definition as well. Although our grammar looks much like a standard English grammar, it is not. Rather, we create a grammar that exactly specifies legal constructions of our language. In practice such grammars do bear some resemblance to conventional grammar, in that the symbols that are chosen to represent sentence constituents often reflect conventional word types, but do not confuse this with any grammar you learned at school! Semantic analysis takes the parse tree for the sentence and interprets it ac cording to the possible meanings of its constituent parts. A representation of semantics may include information about different meanings of words and their characteristics. For example, take the sentence "The necklace has a diamond on it." Our syntactic analysis of this would require another definition of sentence than the one we gave above: sentence = noun_phrase verb noun.phrase prepositionaLphrase prepositionaLphrase = preposition pronoun This gives us the structure of the sentence, but the meaning is still unclear. This is because the word diamond has a number of meanings. It can refer to a precious stone, a geometric shape, even a baseball field. The semantic analysis would 142
SYNTACTIC ANALYSIS
consider each meaning and match the most appropriate one according to its characteristics. A necklace is jewellery and the first meaning is the one most closely associated with jewellery, so it is the most likely interpretation. Finally, in pragmatic analysis, the sentence is interpreted in terms of its context and intention. For example, a sentence may have meanings provided by its context or social expectations that are over and above the semantic meaning. In order to under stand the intention of sentences it is important to consider these. To illustrate, consider the sentence "He gave her a diamond ring." Semantically this means that a male person passed possession of a piece of hand jewellery made with precious stones over to a female person. However, there are additional likely implications of this sentence. Diamond rings are often (though of course not exclusively) given to indicate engagement, for example, so the sentence could mean the couple got engaged. Such additional, hidden meanings are the domain of pragmatic analysis.
7.7
Syntactic analysis
Syntactic analysis is concerned with the structure of the sentence. Its role is to verify whether a given sentence is a valid construction within the language, and to provide a representation of its structure, or to generate legal sentences. There are a number of ways in which this can be done. Perhaps the simplest option is to use some form of pattern matching. Tem plates of possible sentence patterns are stored, with variables to allow matching to specific sentences. For example, the template < the ** rides ** > (where ** matches anything) fits lots of different sentences, such as the shoivjumper rides a clear round or the girl rides her mountain hike. These sentences have similar syntax (both are basically noun.phrase verb noun.phrase), so does this mean that template matching works? Not really. What about the sentence the theme park rides are terrifying? This also matches the template but is clearly a very different sentence structure to the first two. For a start, in the first two sentences " rides " is a verb, whereas here it is a noun. This highlights the fundamental flaw in template matching. It has no representation of word types, which essentially means it cannot ensure that words are correctly sequenced and put together. Template matching is the method used in ELIZA (Weizenbaum 1966), which, as we saw in the Introduction, fails to cope with ambiguity and so can accept (and generate) garbage. These are problems inherent in the approach: it is too simplistic to deal with a language of any complexity. However, it is a simple approach that may be useful in very constrained environments (whether such a restricted use of language could be called "natural" is another issue). 143
NATURAL LANGUAGE UNDERSTANDING
A more viable approach to syntactic analysis is sentence parsing. Here the input sentence is converted into a hierarchical structure indicating the sentence constituents. Parsing systems have two main components: 1. a grammar: a declarative representation of the syntactic facts about the language 2. a parser: a procedure to compare the input sentence with the grammar. Parsing may be top down, in which case it starts with the symbol for a sentence and tries to map possible rules to the input (or target) sentence, or bottom up, where it starts with the input sentence and works towards the sentence symbol, considering all the possible representations of the input sentence. The choice of which type of parsing to use is similar to that for top-down or bottom-up reason ing; it depends on factors such as the amount of branching each will require and the availability of heuristics for evaluating progress. In practice, a combination is sometimes used. There are a number of parsing methods. These include gram mars, transition networks, context-sensitive grammars and augmented transition networks. As we shall see, each has its benefits and drawbacks.
7.7.1
Grammars
We have already met grammars informally. A grammar is a specification of the legal structures of a language. It is essentially a set of rewrite rules that allow any element matching the left-hand side of the rule to be replaced by the right-hand side. So for example, A —>B allows the string XAX to be rewritten XBX. Unlike template matching, it explicitly shows how words of different types can be combined, and defines the type of any given word. In this section we will examine grammars more closely, and demonstrate how they work through an example. A grammar has three basic components: terminal symbols, non-terminal symbols and rules. Terminal symbols are the actual words that make up the language (this part of the grammar is called the lexicon). So "cat" , "dog " , "chase " are all terminal symbols. Non-terminal symbols are special symbols designating structures of the language. There are three types: - lexical categories, which are the grammatical categories of words, such as noun or verb - syntactic categories, which are the permissible combinations of lexical cate gories, for instance "noun.phrase ", "verb_phrase" - a special symbol representing a sentence (the startjsymbol). 144
SYNTACTIC ANALYSIS
The third component of the grammar is the rules, which govern the valid combi nations of the words in the language. Rules are sometimes called phrase structure rules. A rule is usually of the form S —>NP VP where S represents the sentence, NP a noun_phrase and VP a verb_phrase. This rule states that a noun.phrase followed by a verb.phrase is a valid sentence. The grammar can generate all syntactically valid sentences in the language and can be implemented in a number of ways, for example as a production system implemented in PROLOG. We will look at how a grammar is generated and how it parses sentences by considering a detailed example.
7.7.2
An example: generating a grammar fragment
Imagine we want to produce a grammar for database queries on an employee database. We have examples of possible queries. We can generate a grammar fragment by analyzing each query sentence. If the sentence can be parsed by the grammar we have, we do nothing. If it can't, we can add rules and words to the grammar to deal with the new sentence. For example, take the queries Who belongs to a union? Does Sam Smith work in the IT Department? In the case of the first sentence, Who belongs to a union?, we would start with the sentence symbol (S) and generate a rule to match the sentence in the example. To do this we need to identify the sentence constituents (the non-terminal symbols). Remember that the choice of these does not depend on any grammar of English we may have learned at school. We can choose any symbols, as long as they are used consistently. We designate the symbol RelP to indicate a relative pronoun, such as "who " , "what" (a lexical category), and the symbol VP to designate a verb_phrase (a syntactic category). We then require rules to show how our lexical categories can be constructed. In this case VP has the structure V (verb) PP (prepositional phrase), which can be further decomposed as P, a preposition, followed by NP, a noun.phrase. Finally the NP category is defined as Det (determiner) followed by N (noun). The terminal symbols are associated with a lexical category to show how they can fit together in a sentence. We end up with the grammar fragment in Figure 7.1. This will successfully parse our sentence, as shown in the parse tree in Fig ure 7.2, which represents the hierarchical breakdown of the sentence. The root of the tree is the sentence symbol. Each branch of the tree represents a non-terminal symbol, either a syntactic category or a lexical category. The leaves of the tree are the terminal symbols. However, our grammar is still very limited. To extend the grammar, we need to analyze many sentences in this way, until we end up with a very large grammar 145
s
----- ►RelP VP
who: RelP
VP
----- - V PP
belongs: V
pp
----- ► P NP
NP
----- ► Det N
to: P a: D union: N
Figure 7.1 Initial grammar fragment.
Figure 7.2 Parse tree for the first sentence.
SYNTACTIC ANALYSIS
S
► Aux V NP NP
N P ------------- ► PN N P ------------- ► Det PN
does: AuxV Sam Smith: PN work: V in: P the: Det IT Department: PN
Figure 7.3 Further grammar rules.
and lexicon. As we analyze more sentences, the grammar becomes more complete and, we hope, less work is involved in adding to it. We will analyze just one more sentence. Our second query was Does Sam Smith work in the IT Department? First, we check whether our grammar can parse this sentence successfully. If you recall, our only definition of a sentence so far is S -> RelP VP Taking the VP part first, work in the IT Department does meet our definition of a word phrase, if we interpret IT Department loosely as a noun. However, Does Sam Smith is certainly not a RelP. We therefore need another definition of a sentence. In this case a sentence is an auxiliary verb (AuxV) followed by an NP followed by a VP. Since Sam Smith is a proper noun we also need an additional definition of NP, and for good measure we will call IT Department a proper noun as well, giving us a third definition of NP. The additional grammar rules are shown in Figure 7.3. Note that we do not need to add a rule to define VP since our previous rule fits the structure of this sentence as well. A parse tree for this sentence using this grammar is shown in Figure 7.4. Grammars such as this are powerful tools for natural language understanding. They can also be used to generate legal sentences, constructing them from the sentence symbol down, using appropriate terminal symbols from the lexicon. Of course, sentence generation is not solely a matter of syntax; it is important that the sentence also makes sense. Therefore semantic analysis is also important. We shall consider this shortly. First we will look briefly at another method of parsing, the transition network.
7.7.3
Transition networks
The transition network is a method of parsing that represents the grammar as a set of finite state machines. A finite state machine is a model of computational behaviour where each node represents an internal state of the system and the arcs are the means of moving between the states. In the case of parsing natural 147
NATURAL LANGUAGE UNDERSTANDING
Does
Sam Smith
work
in
the
IT Department
Figure 7.4 Parse tree for the second sentence.
language, the arcs in the networks represent either a terminal or a non-terminal symbol. Rules in the grammar correspond to a path through a network. Each non terminal is represented by a different network. To illustrate this we will represent the grammar fragment that we created earlier using transition networks. All rules are represented but to save space only some lexical categories are included. Others would be represented in the same way. In Figure 7.5 each network represents the rules for one non-terminal as paths from the initial state (I) to the final state (F). So, whereas we had three rules for NP in our grammar, here we have a single transition network, with three possible paths through it representing the three rules. To move from one state to the next through the network the parser tests the label on the arc. If it is a terminal symbol, the parser will check whether it matches the next word in the input sentence. If it is a non-terminal symbol, the parser moves to the network for that symbol and attempts to find a path through that. If it finds a path through that network it returns to the higher-level network and continues. If the parser fails to find a path at any point it backtracks and attempts another path. If it succeeds in finding a path, the sentence is a valid one. So to parse our sentence Who belongs to a union? the parser would start at the sentence network and find that the first part of a sentence is RelP. It would therefore go to the RelP network and test the first word in the input sentence "who " against the terminal symbol on the arc. These match, so that network has been traversed successfully and the parser returns to the sentence network able to cross the arc RelP. Parsing of the sentence continues in this fashion until the top-level sentence network is successfully traversed. The full navigation of the network for this sentence is shown in Figure 7.6. 148
NP
vp0
i
G O
Q
E
Q
G
NP
-
th e -
o
-
© PG
in ■■■tOQ..
Figure 7.5 Transition network.
G
NATURAL LANGUAGE UNDERSTANDING
The transition network allows each non-terminal to be represented in a single network rather than by numerous rules, making this approach more concise than grammars. However, as you can see from the network for just two sentences, the approach is not really tenable for large languages since the networks would become unworkable. Another disadvantage over grammars is that the transition network does not produce a parse tree for sentences and tracing the path through the network can be unclear for complex sentences. However, the transition network is an example of a simple parsing algorithm that forms the basis of more powerful tools, such as augmented transition networks, which we will consider in Section 7.7.6.
7.7.4
Context-sensitive grammars
The grammars considered so far are context-free grammars. They allow a single non-terminal on the left-hand side of the rule. The rule may be applied to any
150
SYNTACTIC ANALYSIS
instance of that symbol, regardless of context. So the rule A->B will match an occurrence of A whether it occurs in the string ABC or in ZAB. The context-free grammar cannot restrict this to only instances where A occurs surrounded by Z and B. In order to interpret the symbol in context, a contextsensitive grammar is required. This allows more than one symbol on the left-hand side, and insists that the right-hand side is at least as long as the left-hand side. So in a context-sensitive grammar, we can have rules of the form ZAB - » ZBB Context-free grammars are not sufficient to represent natural language syntax. For example, they cannot distinguish between plural and singular nouns or verbs. So in a context-free grammar, if we have a set of simple definitions S —
►NP VP NP —
>Det N VP^ V and the following lexicon dog : N guide : V the : Det dogs: N guides: V a : Det we would be able to generate the sentences the dog guides and the dogs guide, both legal English sentences. However, we would also be able to generate sentences such as a dogs guides, which is clearly not an acceptable sentence. By incorporating the context of agreement into the left-hand side of the rule we can provide a grammar which can resolve this kind of problem. An example is shown in Figure 7.7. The use of the symbols "Sing " and " Plur " , to indicate agreement, does not allow generation of sentences that violate consistency rules. For example, using the grammar in Figure 7.7 we can derive the sentence " a dog guides" but not " a dogs guides". The derivation of the former is shown using the following substitutions: S NP VP Det AGR NV P 151
NATURAL LANGUAGE UNDERSTANDING
----- ----- -
s
NP VP
----- ►Det AGR N
NP AGR
------------- ►Sing
AGR
----- — ►Plur
Sing VP
—
►Sing
Plur VP
—
►Plur V
V
dog Sing:
Sing N
dogs Plur:
Plur N
a Sing:
Det Sing
the Sing:
Det Sing
the Plur:
Det Plur
guides:
Sing V
guide:
Plur V
Figure 7.7 Grammar fragment for context-sensitive grammar.
Det Sing N VP a Sing N VP a dog Sing VP a dog Sing V a dog guides Unfortunately context sensitivity increases the size of the grammar consider ably, making it a complex method for a language of any size. Feature sets and augmented transition networks are alternative approaches to solving the context problem.
7.7.5
Feature sets
Another approach to incorporating context in syntactic processing is the use of feature sets. Feature sets provide a mechanism for subclassifying syntactic categories (noun, verb, etc.) in terms of contextual properties such as number agreement and verb tense. The descriptions of the syntactic categories are framed in terms of constraints. There are many variations of feature sets, but here we shall use one approach to illustrate the general principle - that of Pereira and
152
SEMANTIC ANALYSIS
Warren's Direct Clause Grammar (Pereira & Warren 1980). In this grammar each syntactic category has an associated feature set, together with constraints that indicate what context is allowable. So, for example, S —>NP (agreem ent = la) VP (agreem ent =?b): a = b Feature sets are a relatively efficient mechanism for representing syntactic context. However, we have still not progressed to understanding any semantics of the sentence. Augmented transition networksprovide an approach that begins to bridge the gap between syntactic and semantic processing.
7.7.6 Augmented transition networks The augmented transition network provides context without an unacceptable in crease in complexity (Woods 1970). It is a transition network that allows pro cedures to be attached to arcs to test for matching context. All terminals and non-terminals have frame-like structures associated with them that contain their contextual information. To traverse an arc, the parser tests whatever contextual features are required against these stored attributes. For example, a test on the V arc may be to check number (i.e. plural or singular). The structure for the word guides would contain, among other things, an indication that the word is singular. The sentence is only parsed successfully if all the contextual checks are consistent. Augmented transition networks can be used to provide semantic information as well as syntactic, since information about meaning can also be stored in the structures. They are therefore a bridge between syntactic analysis and the next stage in the process, semantic analysis.
7.8
Semantic analysis
Syntactic analysis shows us that a sentence is correctly constructed according to the rules of the language. However, it does not check whether the sentence is meaningful, or give information about its meaning. For this we need to perform semantic analysis. Semantic analysis enables us to determine the meaning of the sentence, which may vary depending on context. So, for example, a system for understanding children's stories and a natural language interface may assign different meanings to the same words. Take the word " run " , for example. In a children's story this is likely to refer to quick movement, while in a natural language interface it is more likely to be an instruction to execute a program. There are two levels at which semantic analysis can operate: the lexical level and the sentence level. Lexical processing involves looking up the meaning of the word in the lexicon. 153
NATURAL LANGUAGE UNDERSTANDING
However, many words have several meanings within the same lexical category (for example, the noun "square" may refer to a geometrical shape or an area of a town). In addition, the same word may have further meanings under different lexical categories: "square" can also be an adjective meaning "not trendy", or a verb meaning "reconcile". The latter cases can be disambiguated syntactically but the former rely on reference to known properties of the different meanings. Ultimately, words are understood in the context of the sentences in which they occur. Therefore lexical processing alone is inadequate. Sentencelevel processing on the other hand does take context into account. There are a number of approaches to sentence-level processing. We will look briefly at two: semantic grammars and case grammars.
7.8.1
Semantic grammars
As we have seen, syntactic grammars enable us to parse sentences according to their structure and, in the case of context-sensitive grammars, such attributes as number and tense. However, syntactic grammars provide no representation of the meaning of the sentence, so it is still possible to parse nonsense if it is written in correctly constructed sentences. In a semantic grammar (Burton 1976), the symbols and rules have semantic as well as syntactic significance. Semantic actions can also be associated with a rule, so that a grammar can be used to translate a natural language sentence into a command or query. Let us take another look at our database query system. An example: a database query interpreter revisited Recall the problem we are trying to address. We want to produce a natural language database query system for an employee database that understands questions such as Who belongs to a union? and Does Sam Smith work in the IT Department? We have already seen how to generate a syntactic grammar to deal with these sentences but we really need to derive a grammar that takes into account not only the syntax of the sentences but their meaning. In the context of a query interpreter, meaning is related to the form of the query that we will make to the database in response to the question. So what we would like is a grammar that will not only parse our sentence, but interpret its meaning and convert it into a database query. This is exactly what we can do with a semantic grammar. In the following grammar, a query is built up as part of the semantic analysis of the sentence: when a rule is matched, the query template associated with it (shown in square brackets) is instantiated. The grammar is generated as follows. First, sentence structures are identified. Our sentences represent two types of question: the first is looking for information (names of union members), the second for a yes/no answer. So we define two legal sentence structures, the
154
SEMANTIC ANALYSIS
first seeking information and preceded by the word " who " , the second seeking a yes/no response, preceded by the word " does ". The action associated with these rules is to set up a query which will be whatever is the result of parsing the INFO or YN structures. Having done this we need to determine the structure of the main query parts. We will concentrate on the INFO category to simplify matters but the YN category is generated in the same way. Words are categorized in terms of their meaning to the query (rather than, for example, their syntactic category). Therefore, the words "belong to " and "work in " are semantically equivalent, because they require the same query (but with different information) to answer. Both are concerned with who is in what organization. Similarly, " union " and "department" are also classed as semantically equivalent: they are both examples of a type of organization. Obviously, such an interpretation is context dependent. If, instead of a query interpreter, we wanted our natural language processing system to understand a political manifesto, then the semantic categories would be very different. INFO is therefore a structure that consists of an AFFIL_VB (another category) followed by an ORG. Its associated action is to return the query that results from parsing AFFILJVB. The rest of the grammar is built up in the same way down to the terminals, which return the values matched from the input sentence. The full grammar is shown in Figure 7.8. Using this grammar we can get the parses query: isJn(PERSON, org(NAME, union)) query: isJn(Sam Smith, org(IT, Department)) for the above sentences respectively. Parse trees for these sentences are shown in Figures 7.9 and 7.10. These show how the query is built up at every stage in the parse. Instantiation of the query components works from the bottom of the tree and moves up.
7.8.2
Case grammars
Semantic grammars are designed to give a structural and semantic parse of the sentence. Grammars can get very big as a result. Case grammars represent the semantics in the first instance, ignoring the syntactic, so reducing the size of the grammar (Fillmore 1968). For example, a sentence such as Joe wrote the letter would be represented as wrote (agent(Joe), object(letter)) This indicates that Joe was the active participant, the agent, who performed the action "wrote " on the object " letter " . The passive version The letter was written by Joe would be represented in the same way, since the meaning of the sentences is identical. Case grammars rely on cases, which describe relationships between verbs and their arguments. A number of cases are available to build case grammar 155
S------------------ ----------------------- --
who INFO
[query: INFO]
S
^
does
INFO
►
A F F IL V B ORG
YN
►
PERSON AFFIL VB
ORG
►
Det NAMETYPE
[org (NAME, TYPE)]
ORG
►
Det TYPE
[org (NAME, TYPE)]
yn
[query: YN] [A F F IL V B ] [AFFIL _ VB]
Department:
TYPE
[value]
Union:
TYPE
[value]
IT:
NAME
[value]
Sam Smith:
PERSON
[value]
belongs to:
AFFIL
VB
[is_in(PERSON, ORG)]
work in:
AFFIL
VB
[is_in(PERSON, ORG)]
the:
Det
a:
Det
Figure 7.8 Semantic grammar fragment.
S [query: is _ in(PERSON, org(NAME, union))]
Figure 7.9 Parse tree for the first sentence.
S [query: is _ in(Sam Smith, org(IT, Department))]
Figure 7.10 Parse tree for the second sentence.
NATURAL LANGUAGE UNDERSTANDING
representations. The following list is not exhaustive. Can you think of other cases? - agent: the person or thing performing the action. - object: the person or thing to which something is done. - instrument: the person or thing which allows an agent to perform an action. - time: the time at which an action occurs. - beneficiary: the person or thing benefiting from an action. - goal: the place reached by the action. So, for example, the sentence At 1 pm, Paul hit the gong with the hammerfor lunch would be parsed as hit( time(lpm), agent(Paul), object(gong), instrument(hammer), goal(lunch)) If we changed the sentence to At 1 pm, Paul hit the gong with the hammer for his father, the case representation would be hit( time(lpm), agent(Paul), object(gong), instrument(hammer), beneficiary(his father)) The case structures can be used to derive syntactic structures, by using rules to map from the semantic components that are present to the syntactic structures that are expected to contain these components. However, case grammars do not provide a full semantic representation, since the resulting parse will still contain English words that must be understood.
7.9
Pragmatic analysis
The third stage in understanding natural language is pragmatic analysis. As we saw earlier, language can often only be interpreted in context. The context that must be taken into account may include both the surrounding sentences (to allow the correct understanding of ambiguous words and references) and the receiver's expectations, so that the sentence is appropriate for the situation in which it occurs. There are many relationships that can exist between sentences and phrases that have to be taken into account in pragmatic analysis. For example: - A pronoun may refer back to a noun in a previous sentence that relates to the same object. John had an ice cream. Joe wanted to share it.
158
PRAGMATIC ANALYSIS
- A phrase may reference something that is a component of an object referred to previously. She looked at the house. The front door was open. - A phrase may refer to something that is a component of an activity referred to previously. Jo went on holiday. She took the early train. - A phrase may refer to agents who were involved in an action referred to previously. My car was stolen yesterday. They abandoned it two miles away. - A phrase may refer to a result of an event referred to previously. There have been serious floods. The army was called out today. - A phrase may refer to a subgoal of a plan referred to previously. She wanted a new car. She decided to get a new job. - A phrase may implicitly intend some action. This room is cold (expects an action to warm the room). One approach to performing this pragmatic analysis is the use of scripts (Schank & Abelson 1977). We met scripts in Chapter 1. In scripts, the expectations of a particular event or situation are recorded, and can be used to fill in gaps and help to interpret stories. The main problem with scripts is that much of the information that we use in understanding the context of language is not specific to a particular situation, but generally applicable. However, scripts have proved useful in interpreting simple stories.
7.9.1
Speech acts
When we use language our intention is often to achieve a specific goal that is reached by a set of actions. The acts that we perform with language are called speech acts (Searle 1969). Sentences can be classified by type. For example, the statement " I am cold " is a declarative sentence. It states a fact. On the other hand, the sentence "Are you cold?" is interrogative: it asks a question. A third sentence category is the imperative: "Shut the window ". This makes a demand. One way to use speech acts in pragmatic analysis is to assume that the sentence type indicates the intention of the sentence. Therefore, a declarative sentence makes an assertion, an interrogative sentence asks a question and an imperative sentence issues a command. This is a simplistic approach, which fails in situations where the desired action is implied. For example, the sentence " I am hungry" may be simply an assertion or it may be a request to hurry up with the dinner. Similarly, many commands are phrased as questions ("Can you tell me what time it is?"). However, most commercial natural language processing systems ignore such complexity and use speech acts in the manner described above.
159
NATURAL LANGUAGE UNDERSTANDING
Such an approach can be useful in natural language interfaces since assertions, questions and commands map clearly onto system actions. So if I am interacting with a database, an assertion results in the updating of the data held, a question results in a search and a command results in some operation being performed.
7.10
Summary
In this chapter we have looked at the issue of ambiguity, which makes natural language understanding so difficult. We have considered the key stages of natural language understanding: syntactic analysis, semantic analysis and pragmatic analysis. We have looked at grammars and transition networks as techniques for syntactic analysis; semantic and case grammars for semantic analysis; and scripts and speech acts for pragmatic analysis.
7.11
Exercises
1. For each of the sentences below generate the following: - a syntactic grammar and parse tree - a transition network - a semantic grammar and parse tree - a case grammar What additional features would you represent if you were generating context-sensitive grammars for these sentences? - My program was deleted by Brian - I need a print-out of my program file - The system administrator removed my files - I want to create a new document file 2. Identify the ambiguity in each of the following sentences and indicate how it could be resolved. - She was not sure if she had taken the drink - Joe broke his glasses - I saw the boy with the telescope - They left to go on holiday this morning
160
RECOMMENDED FURTHER READING
3. Devise a script for visiting the doctor, and indicate how this would be used to interpret the statement: "Alison went to the surgery. After seeing the doctor she left."
7.12
Recommended further reading
Allen, J. 1995. Natural language understanding, 2ndedn. Redwood, CA.: Benjamin Cummings. A comprehensive text book covering all aspects o f natural language understanding; a good next stage from here. Winograd, T. & F. Flores 1987. Understanding computers and cognition. Nor wood, NJ: Addison-Wesley, Ablex Corporation. Includes a discussion o f speech act theory and other aspects o f natural language understanding.
7.13
Solution to SHRDLU problem
1. Find a block that is taller than the one you are holding and place it in the box. This is referential ambiguity. What does the word " it " refer to? 2. How many blocks are on top of the green block? This is perhaps more tricky but it involves semantic ambiguity. Does " on top of" mean directly on top of or above (that is, it could be on top of a block that is on top of the green block)? 3. Put the red pyramid on the block in the box. This is syntactic ambiguity. Is it the block that is in the box or the red pyramid that is being put into the box? 4. Does the shortest thing the tallest pyramid's support supports support anything green? This is lexical: there are two uses of the word "support"!
161
Chapter Eight
Computer vision
8.1
Overview
Computer vision is one way for a computer system to reach beyond the data it is given and find out about the real world. There are many important applications, from robotics to airport security. However, it is a difficult process. This chapter starts with an overview of the typical phases of processing in computer vision. Subsequent sections (8.3 -8.7) then follow through these phases in turn. At each point deeper knowledge is inferred from the raw image. Finally, in Section 8.8, we look at the special problems and opportunities that arise when we have moving images or input from several cameras. In this chapter we shall assume that the cameras are passive - we interpret what we are given. In Chapter 9 we shall look at active vision, where the camera can move or adjust itself to improve its understanding of a scene.
8.2
Introduction
8.2.1 Why computer vision is difficult The human visual system makes scene interpretation seem easy. We can look out of a window and can make sense of what is in fact a very complex scene. This process is very difficult for a machine. As with natural language interpretation, it is a problem of ambiguity. The orientation and position of an object changes its appearance, as does different lighting or colour. In addition, objects are often partially hidden by other objects. In order to interpret an image, we need both low-level information, such as texture and shading, and high-level information, such as context and world knowledge. The former allows us to identify the object, the latter to interpret it according to our expectations.
163
COMPUTER VISION
8.2.2
Phases of computer vision
Because of these multiple levels of conformation, most computer vision is based on a hierarchy of processes, starting with the raw image and working towards a high-level model of the world. Each stage builds on the features extracted at the stage below. Typical stages are (see Fig. 8.1):
i digitization and signal processing
edge and region detection
image understanding
sphere (x) pyramid (y) box (z) black (z) above (x, y)
Figure 8.1 Phases of computer vision.
- digitization: the analogue video signal is converted into a digital image. - signal processing: low-level processing of the digital image in order to enhance significant features. - edge and region detection: finding low-level features in the digital image. - three-dimensional or two-dimensional object recognition: building lines and regions into objects. - image understanding: making sufficient sense of the image to use it. Note, however, that not all applications go through all the stages. The higher levels of processing are more complicated and time consuming. In any real situation one would want to get away with as low a level of processing as possible. 164
DIGITIZATION AND SIGNAL PROCESSING
The rest of this chapter will follow these levels of processing, and we will note where applications exist at each level.
8.3
Digitization and signal processing
The aim of computer vision is to understand some scene in the outside world. This maybe captured using a video camera, but may come from a scanner (for example, optical character recognition). Indeed, for experimenting with computer vision it will be easier to digitize photographs than to work with real-time video. Also, it is not necessary that images come from visible light. For example, satellite data may use infrared sensing. For the purposes of exposition, we will assume that we are capturing a visible image with a video camera. This image will need to be digitized so that it can be processed by a computer and also "cleaned up " by signal processing software. The next section will discuss signal processing further in the context of edge detection.
8.3.1
Digitizing images
For use in computer vision, the image must be represented in a form that the machine can read. The analogue video image is converted (by a video digitizer) into a digital image. The digital image is basically a stream of numbers, each corresponding to a small region of the image, a pixel. The number is a measure of the light intensity of the pixel, and is called a grey level. The range of possible grey levels is called a grey scale (hence grey-scale images). If the grey scale consists of just two levels (black or white) the image is a binary image. Figure 8.2 shows an image (ii) and its digitized form (i). There are ten grey levels from 0-white to 9-black. More typically there will be 16 or 256 grey levels rather than ten and often 0 is black (no light). However, the digits 0 - 9 fit better into the picture. Also, in order to print it, the image (ii) is already digitized and we are simply looking at a coarser level of digitization. Most of the algorithms used in computer vision work on simple grey-scale images. However, sometimes colour images are used. In this case, there are usually three or four values stored for each pixel, corresponding to either primary colours (red, blue and green) or some other colour representation system. Look again at Figure 8.2. Notice how the right-hand edge of the black rectangle translates into a series of medium grey levels. This is because the pixels each include some of the black rectangle and some of the white background. What was a sharp edge has become fuzzy. As well as this blurring of edges, other effects conspire to make the grey-scale image inaccurate. Some cameras may not generate parallel lines of pixels, the pixels may be rectangular rather than square (the aspect ratio) or the relationship
165
COMPUTER VISION
El 0 El E) ElD 0 0 El El ElEl El El0 Q □ □ Q □ □ □ Q U O □ □ □ QQ
El ElEl El B B S EIEIEI ElEl El BEI El ElEl El EIHEI ElEl El ElEl El ElEl q q d d □□□ □□□ □□□ □□ El E1KIII EIEIEI EIEIEI B E I0 ElEl E in n n n n n n n n n n n n n □□D□□□□□□□□□□□□
ei E in
n nan nnn non Bn □ □ □ ODOD□ □ □ DOQQQ ei E in n e ie ie i nnn EinEi e ie i El EIEI El ElEl El EIEIEI ElEl El EIEI
between darkness and grey scale recorded may not be linear. However, the most persistent problem is noise: inaccurate readings of individual pixels due to electronic fluctuations, dust on the lens or even a foggy day!
8.3.2
Thresholding
Given a grey-scale image, the simplest thing we can do is to threshold it; that is, select all pixels whose greyness exceed some value. This may select key significant features from the image. In Figure 8.3, we see an image (i) thresholded at three different levels of greyness. The first (ii) has the lowest threshold, accepting anything that is not pure white. The pixels of all the objects in the image are selected with this threshold. The next threshold (iii) accepts only the darker grey of the circle and the black of the rectangle. Finally, the highest threshold (iv) accepts only pure black pixels and hence only those of the obscured rectangle are selected. This can be used as a simple way to recognize objects. For example, (Loughlin 1993) shows how faults in electrical plugs can be detected using multiple thresh old levels. At some levels the wires are selected, allowing one to check that the wiring is correct; at others the presence of the fuse can be verified. In an industrial setting one may be able to select lighting levels carefully in order to make this possible. One can also use thresholding to obtain a simple edge detection. One simply follows round the edge of a thresholded image. One can do this without actually performing the thresholding as one can simply follow pixels where the grey changes from the desired value. This is called contour following. However, more generally, images resist this level of interpretation. Consider Figure 8.4. To the human eye, this also consists of three objects. However, see what two levels of thresholding, (ii) and (iii), do to the image. The combination
166
(iii)
(iv) Figure 8.3 Thresholding.
COMPUTER VISION
Figure 8.4 A difficult image to threshold.
of light and shadows means that the regions picked out by thresholding show areas of individual objects instead of distinguishing the objects. Indeed, even to the human eye, the only way we know that the sphere is not connected to the black rectangular area is because of the intervening pyramid. Contour following would give the boundary of one of these images - not really a good start for image understanding. The more robust approaches in the next section will instead use the rate of change in intensity slope rather than height - to detect edges. However, even that will struggle on this image. The last image (iv) in Figure 8.4 shows edges obtained by looking for sharp contrasts in greyness. See how the dark side of the sphere has merged into the black rectangle, and how the light shining on the pyramid has lost part of its boundary. There is even a little blob in the middle where the light side of the pyramid meets the dark at the point. In fact, as a human rather than a machine, you will have inferred quite a lot from the image. You will see it as a three-dimensional image where the sphere is above the pyramid and both lie above a dark rectangle. You will recognize that the light is shining somewhere from the top left. You will also notice from the shape of the figures and the nature of the shading that this is no photograph, but
168
DIGITIZATION AND SIGNAL PROCESSING
a generated image. The algorithms we will discuss later in this chapter will get significantly beyond thresholding, but nowhere near your level of sophistication!
8.3.3
Digital filters
We have noted some of the problems of noise, blurring and lighting effects that make image interpretation difficult. Various signal processing techniques can be applied to the image in order to remove some of the effects of noise or enhance other features, such as edges. The application of such techniques is also called digital filtering. This is by analogy with physical filters, which enable you to remove unwanted materials, or to find desired material. Thresholding is a simple form of digital filter, but whereas thresholding processes each pixel independently, more sophisticated filters also use neighbouring pixels. Some filters go beyond this and potentially each pixel's filtered value is dependent on the whole image. However, all the filters we will consider operate on a finite window a fixed-size group of pixels surrounding the current pixel. Linear filters Many filters are linear. These work by having a series of weights for each pixel in the window. For any point in the image, the surrounding pixels are multiplied by the relevant weights and added together to give the final filtered pixel value. In Figure 8.5 we see the effect of applying a filter with a 3 x 3 window. The filter weights are shown at the top right. The initial image grey levels are at the top left. For a particular pixel the nine pixel values in the window are extracted. These are then multiplied by the corresponding weights, giving in this case the new value 1. This value is placed in the appropriate position in the new filtered image (bottom left). The pixels around the edge of the filtered image have been left blank. This is because one cannot position a window of pixels 3 x 3 centred on the edge pixels. So, either the filtered image must be smaller than the initial image, or some special action is taken at the edges. Notice also that some of the filtered pixels have negative values associated with them. Obviously this can only arise if some of the weights are negative. This is not a problem for subsequent computer processing, but the values after this particular filter cannot easily be interpreted as grey levels. A related problem is that the values in the final image may be bigger than the original range of values. For example, with the above weights, a zero pixel surrounded by nines would give rise to a filtered value of 36. Again, this is not too great a problem, but if the result is too large or too small (negative) then it may be too large to store - an overflow problem. Usually, the weights will be scaled to avoid this. So, in the example above, the result of applying the filter
169
COMPUTER VISION
would be divided by 8 in order to bring the output values within a similar range to the input grey scales. The coefficients are often chosen to add up to a power of 2, as dividing can then be achieved using bit shifts, which are far faster.
1
0
4
6
0
5
5
6
1 -4
1
6
0
0
5
6
1
0
filter values
image window 0 x 0 + 1x 4
+ 0x6
+1x5 + - 4 x 5
+ 1x6
+0x5 + 1x6
+ 0x6
1
Figure 8.5 Applying a digital filter.
Smoothing The simplest type of filter is for smoothing an image. That is, surrounding pixels are averaged to give the new value of a pixel. Figure 8.6 shows a simple 2 x 2 smoothing filter applied to an image. The filter window is drawn in the middle, and its pivot cell (the one which overlays the pixel to which the window is applied) is at the top left. The filter values are all ones, and so it simply adds the pixel and its three neighbours to the left and below and averages the four (note the +-4). The image clearly consists of two regions, one to the left with high (7 or 8) grey-scale values and one to the right with low (0 or 1) values. However, the image also has some noise in it. Two of the pixels on the left have low values and one on 170
DIGITIZATION AND SIGNAL PROCESSING
the right a high value. Applying the filter has all but removed these anomalies, leaving the two regions far more uniform, and hence suitable for thresholding or other further analysis.
7 7
8 0
0 0
8 1 8 7 7 8
7 1 0 1 7 1 5 0 7 0 1 1
3 7 7 8
8 1 7 0
filter values
0 0 1 1
6
6 4
0 0
6 7
0 1 2 2
6
5 4 7 4 7 4
6
7 4
0 0
0 0
Figure 8.6 Applying a 2 x 2 smoothing filter.
Because only a few pixels are averaged with the 2 x 2 filter, it is still susceptible to noise. Applying the filter would only reduce the magnitude by a factor of 4. Larger windows are used if there is more noise, or if later analysis requires a cleaner image. A larger filter will often have an uneven distribution of weights, giving more importance to pixels near the chosen one and less to those far away. There are disadvantages to smoothing, especially when using large filters. Notice in Figure 8.6 that the boundary between the two regions has become blurred. There is a line of pixels that are at an average value between the high and low regions. Thus, the edge can become harder to trace. Furthermore, fine features such as thin lines may disappear altogether. There is no easy answer to this problem - the desire to remove noise is in conflict with the desire to retain sharp images. In the end, how do you distinguish a small but significant feature from noise? Gaussian filters The Gaussian filter is a special smoothing filter based on the bell-shaped Gaussian curve, well known in statistics as the "normal" distribution. One imagines a window of infinite size, where the weight, w(x, y), assigned to the pixel at position x , y from the centre is w{x, y) = - 4 = exp [ ( a 2 + y2) / l a 2} V 7T(T
The constant a is a measure of the spread of the window - how much the image will be smeared by the filter. A small value of a will mean that the weights in the filter will be small for distant pixels, whereas a large value allows more distant pixels to affect the new value of the current pixel. If noise affects groups of pixels together then one would choose a large value of a.
171
COMPUTER VISION
0
1
3
1
0
1
13
30
13
1
3
30
64 30
3
1
13
30
13
1
0
1
3
1
0
Figure 8.7 Gaussian filter with a = 0.8.
Although the window for a Gaussian filter is theoretically infinite, the weights become small rapidly and so, depending on the value of