349 110 3MB
English Pages 345 Year 1997
< BACK
Elements of Artificial Neural Networks Kishan Mehrotra, Chilukuri K. Mohan a n d Sanjay Ranka Preface 1 Introduction
1.1 History of Neural Networks 1.2
Structure and Function of a Single Neuron 1.2.1 Biological neurons 1.2.2 Artificial neuron models
1.3
1.3.1 Fully connected networks
October 1 9 9 6 ISBN 0 - 2 6 2 - 1 3 3 2 8 - 8 3 4 4 p p . , 1 4 4 illus. $ 7 0 . 0 0 / £ 4 5 . 9 5 (CLOTH)
1.3.2 Layered networks 1.3.3 Acyclic networks 1.3.4 Feedforward networks
ADD TO CART
1.3.5 Modular neural networks
Series
1.4
Bradford Books
1.4.2 Competitive learning
R e l a t e d Links
More about this book and related software
Neural Learning 1.4.1 Correlation learning
Complex Adaptive Systems
Instructor's Manual
Neural Net Architectures
^ ^
i
I
i
I
1.4.3 Feedback-based weight adaptation 1.5
What Can Neural Networks Be Used for? 1.5.1 Classification
Request E x a m / D e s k Copy
1.5.2 Clustering
Table of Contents
1.5.3 Vector quantization 1.5.4 Pattern association 1.5.5 Function approximation 1.5.6 Forecasting 1.5.7 Control applications 1.5.8 Optimization 1.5.9 Search 1.6
Evaluation of Networks 1.6.1 Quality of results 1.6.2 Generalizability 1.6.3 Computational resources
1.7
Implementation
1.8
Conclusion
1.9
Exercises
2 Supervised Learning: Single-Layer Networks 2.1 Perceptrons
2.4 Guarantee of Success 2.5
Modifications 2.5.1 Pocket algorithm 2.5.2 Adalines 2.5.3 Multiclass discrimination
2.6 Conclusion 2.7 Exercises 3 Supervised Learning: Multilayer Networks I 3.1 Multilevel Discrimination 3.2
Preliminaries 3.2.1 Architecture 3.2.2 Objectives
3.3 Backpropagation Algorithm 3.4 Setting the Parameter Values 3.4.1 Initialization of weights 3.4.2 Frequency of weight updates 3.4.3 Choice of learning rate 3.4.4 Momentum 3.4.5 Generalizability 3.4.6 Number of hidden layers and nodes 3.4.7 Number of samples 3.5 Theoretical Results* 3.5.1 Cover's theorem 3.5.2 Representations of functions 3.5.3 Approximations of functions 3.6 Accelerating the Learning Process 3.6.1 Quickprop algorithm 3.6.2 Conjugate gradient 3.7
Applications 3.7.1 Weaning from mechanically assisted ventilation 3.7.2 Classification of myoelectric signals 3.7.3 Forecasting commodity prices 3.7.4 Controlling a gantry crane
3.8
Conclusion
3.9
Exercises
4 Supervised Learning: Mayer Networks II 4.1 Madalines 4.2 Adaptive Multilayer Networks
4.2.6 Tiling algorithm 4.3 Prediction Networks 4.3.1 Recurrent networks 4.3.2 Feedforward networks for forecasting 4.4 Radial Basis Functions 4.5 Polynomial Networks 4.6 Regularization 4.7 Conclusion 4.8 Exercises 5 Unsupervised Learning 5.1 Winner-Take-All Networks 5.1.1 Hamming networks 5.1.2 Maxnet 5.1.3 Simple competitive learning 5.2 Learning Vector Quantizers 5.3 Counterpropagation Networks 5.4 Adaptive Resonance Theory 5.5 Topologically Organized Networks 5.5.1 Self-organizing maps 5.5.2 Convergence* 5.5.3 Extensions 5.6 Distance-Based Learning 5.6.1 Maximum entropy 5.6.2 Neural gas 5.7
Neocognitron
5.8 Principal Component Analysis Networks 5.9
Conclusion
5.10 Exercises 6 Associative Models 6.1 Non-iterative Procedures for Association 6.2 Hopfield Networks 6.2.1 Discrete Hopfield networks 6.2.2 Storage capacity of Hopfield networks* 6.2.3 Continuous Hopfield networks 6.3 Brain-State-in-a-Box Network 6.4 Boltzmann Machines 6.4.1 Mean field annealing 6.5
Hetero-associators
7.1.2 Solving simultaneous linear equations 7.1.3 Allocating documents to multiprocessors Discrete Hopfield network Continuous Hopfield network Performance 7.2 Iterated Gradient Descent 7.3 Simulated Annealing 7.4 Random Search 7.5 Evolutionary Computation 7.5.1 Evolutionary algorithms 7.5.2 Initialization 7.5.3 Termination criterion 7.5.4 Reproduction 7.5.5 Operators Mutation Crossover 7.5.6 Replacement 7.5.7 Schema Theorem* 7.6
Conclusion
7.7
Exercises
Appendix A: A Little Math A.1 Calculus A.2 Linear Algebra A.3 Statistics Appendix B: Data B.1 Iris Data B.2 Classification of Myoelectric Signals B.3 Gold Prices B.4 Clustering Animal Features B.5 3-D Corners, Grid and Approximation B.6 Eleven-City Traveling Salesperson Problem (Distances) B.7 Daily Stock Prices of Three Companies, over the Same Period B.8 Spiral Data Bibliography Index
Preface This book is intended as an introduction to the subject of artificial neural networks for readers at the senior undergraduate or beginning graduate levels, as well as professional engineers and scientists. The background presumed is roughly a year of college-level mathematics, and some amount of exposure to the task of developing algorithms and computer programs. For completeness, some of the chapters contain theoretical sections that discuss issues such as the capabilities of algorithms presented. These sections, identified by an asterisk in the section name, require greater mathematical sophistication and may be skipped by readers who are willing to assume the existence of theoretical results about neural network algorithms. Many off-the-shelf neural network toolkits are available, including some on the Internet, and some that make source code available for experimentation. Toolkits with user-friendly interfaces are useful in attacking large applications; for a deeper understanding, we recommend that the reader be willing to modify computer programs, rather than remain a user of code written elsewhere. The authors of this book have used the material in teaching courses at Syracuse University, covering various chapters in the same sequence as in the book. The book is organized so that the most frequently used neural network algorithms (such as error backpropagation) are introduced very early, so that these can form the basis for initiating course projects. Chapters 2, 3, and 4 have a linear dependency and, thus, should be covered in the same sequence. However, chapters 5 and 6 are essentially independent of each other and earlier chapters, so these may be covered in any relative order. If the emphasis in a course is to be on associative networks, for instance, then chapter 6 may be covered before chapters 2, 3, and 4. Chapter 6 should be discussed before chapter 7. If the "non-neural" parts of chapter 7 (sections 7.2 to 7.5) are not covered in a short course, then discussion of section 7.1 may immediately follow chapter 6. The inter-chapter dependency rules are roughly as follows. l->2->3->4 l-»5 l->6 3-»5.3 6 . 2 - • 7.1 Within each chapter, it is best to cover most sections in the same sequence as the text; this is not logically necessary for parts of chapters 4, 5, and 7, but minimizes student confusion. Material for transparencies may be obtained from the authors. We welcome suggestions for improvements and corrections. Instructors who plan to use the book in a course should
XIV
Preface
send electronic mail to one of the authors, so that we can indicate any last-minute corrections needed (if errors are found after book production). New theoretical and practical developments continue to be reported in the neural network literature, and some of these are relevant even for newcomers to the field; we hope to communicate some such results to instructors who contact us. The authors of this book have arrived at neural networks through different paths (statistics, artificial intelligence, and parallel computing) and have developed the material through teaching courses in Computer and Information Science. Some of our biases may show through the text, while perspectives found in other books may be missing; for instance, we do not discount the importance of neurobiological issues, although these consume little ink in the book. It is hoped that this book will help newcomers understand the rationale, advantages, and limitations of various neural network models. For details regarding some of the more mathematical and technical material, the reader is referred to more advanced texts such as those by Hertz, Krogh, and Palmer (1990) and Haykin (1994). We express our gratitiude to all the researchers who have worked on and written about neural networks, and whose work has made this book possible. We thank Syracuse University and the University of Florida, Gainesville, for supporting us during the process of writing this book. We thank Li-Min Fu, Joydeep Ghosh, and Lockwood Morris for many useful suggestions that have helped improve the presentation. We thank all the students who have suffered through earlier drafts of this book, and whose comments have improved this book, especially S. K. Bolazar, M. Gunwani, A. R. Menon, and Z. Zeng. We thank Elaine Weinman, who has contributed much to the development of the text. Harry Stanton of the MIT Press has been an excellent editor to work with. Suggestions on an early draft of the book, by various reviewers, have helped correct many errors. Finally, our families have been the source of much needed support during the many months of work this book has entailed. We expect that some errors remain in the text, and welcome comments and corrections from readers. The authors may be reached by electronic mail at [email protected], [email protected], and [email protected]. In particular, there has been so much recent research in neural networks that we may have mistakenly failed to mention the names of researchers who have developed some of the ideas discussed in this book. Errata, computer programs, and data files will be made accessible by Internet.
A Introduction If we could first know where we are, and whither we are tending, we could better judge what to do, and how to do it. —Abraham Lincoln Many tasks involving intelligence or pattern recognition are extremely difficult to automate, but appear to be performed very easily by animals. For instance, animals recognize various objects and make sense out of the large amount of visual information in their surroundings, apparently requiring very little effort. It stands to reason that computing systems that attempt similar tasks will profit enormously from understanding how animals perform these tasks, and simulating these processes to the extent allowed by physical limitations. This necessitates the study and simulation of Neural Networks. The neural network of an animal is part of its nervous system, containing a large number of interconnected neurons (nerve cells). "Neural" is an adjective for neuron, and "network" denotes a graph-like structure. Artificial neural networks refer to computing systems whose central theme is borrowed from the analogy of biological neural networks. Bowing to common practice, we omit the prefix "artificial." There is potential for confusing the (artificial) poor imitation for the (biological) real thing; in this text, non-biological words and names are used as far as possible. Artificial neural networks are also referred to as "neural nets," "artificial neural systems," "parallel distributed processing systems," and "connectionist systems." For a computing system to be called by these pretty names, it is necessary for the system to have a labeled directed graph structure where nodes perform some simple computations. From elementary graph theory we recall that a "directed graph" consists of a set of "nodes" (vertices) and a set of "connections" (edges/links/arcs) connecting pairs of nodes. A graph is a "labeled graph" if each connection is associated with a label to identify some property of the connection. In a neural network, each node performs some simple computations, and each connection conveys a signal from one node to another, labeled by a number called the "connection strength" or "weight" indicating the extent to which a signal is amplified or diminished by a connection. Not every such graph can be called a neural network, as illustrated in example 1.1 using a simple labeled directed graph that conducts an elementary computation. 1.1 The "AND" of two binary inputs is an elementary logical operation, implemented in hardware using an "AND gate." If the inputs to the AND gate are x\ e {0,1} and X2 e {0,1}, the desired output is 1 if x\ = X2 = 1, and 0 otherwise. A graph representing this computation is shown in figure 1.1, with one node at which computation (multiplication) is carried out, two nodes that hold the inputs (x\,x2), and one node that holds one output. However, this graph cannot be considered a neural network since the connections EXAMPLE
1
Introduction
Multiplier
*,s{0,l}*• > o = xx ANDJC2 X26{0,1}-
Figure 1.1 AND gate graph.
(W,J:1)(H'2J(2) 1
»» o = JCI AND x^
Figure 1.2 AND gate network.
between the nodes are fixed and appear to play no other role than carrying the inputs to the node that computes their conjunction. We may modify the graph in figure 1.1 to obtain a network containing weights (connection strengths), as shown in figure 1.2. Different choices for the weights result in different functions being evaluated by the network. Given a network whose weights are initially random, and given that we know the task to be accomplished by the network, a "learning algorithm" must be used to determine the values of the weights that will achieve the desired task. The graph structure, with connection weights modifiable using a learning algorithm, qualifies the computing system to be called an artificial neural network. 1.2 For the network shown in figure 1.2, the following is an example of a learning algorithm that will allow learning the AND function, starting from arbitrary values of w\ and u>2. The trainer uses the following four examples to modify the weights: {(*l = l,JC 2 =l,i or u>2 may be repeated until the final result is satisfactory, with weights w\ = 5.0, W2 = 0.2. Can the weights of such a net be modified so that the system performs a different task? For instance, is there a set of values for w\ and W2 such that a net otherwise identical to that shown in figure 1.2 can compute the OR of its inputs? Unfortunately, there is no possible choice of weights w\ and u>2 such that {w\ x\) • • (tU2 • *2) will compute the OR of x\ and X2. For instance, whenever x\ = 0, the output value (w\ - x\) > (\V2 • X2) = 0, irrespective of whether X2 = 1. The node function was predetermined to multiply weighted inputs, imposing a fundamental limitation on the. capabilities of the network shown in figure 1.2, although it was adequate for the task of computing the AND function and for functions described by the mathematical expression o = wiW2XiX2. A different node function is needed if there is to be some chance of learning the OR function. An example of such a node function is (x\ + X2 — x\ X2), • which evaluates to 1 if x\ = 1 or X2 = 1, and to 0 if x\ = 0 and X2 = 0 (assuming that each input can take only a 0 or 1 value). But this network cannot be used to compute the AND function. Sometimes, a network may be capable of computing a function, but the learning algorithm may not be powerful enough to find a satisfactory set of weight values, and the final result may be constrained due to the initial (random) choice of weights. For instance, the AND function cannot be learnt accurately using the learning algorithm described above if we started from initial weight values w\ — W2 = 0.3, since the solution w\ = 1/0.3 cannot be reached by repeatedly incrementing (or decrementing) the initial choice of w\ by 0.1. We seem to be stuck with a one node function for AND and another for OR. What if we did not know beforehand whether the desired function was AND or OR? Is there some node function such that we can simulate AND as well as OR by using different weight values? Is there a different network that is powerful enough to learn every conceivable function of its inputs? Fortunately, the answer is yes; networks can be built with sufficiently general node functions so that a large number of different problems can be solved, using a different set of weight values for each task. The AND gate example has served as a takeoff point for several important questions: what are neural networks, what can they accomplish, how can they be modified, and what are their limitations? In the rest of this chapter, we review the history of research in neural networks, and address four important questions regarding neural network systems.
4
1 Introduction
1. How does a single neuron work? 2. How is a neural network structured, i.e., how are different neurons combined or connected to obtain the desired behavior? 3. How can neurons and neural networks be made to learn? 4. What can neural networks be used for? We also discuss some general issues important for the evaluation and implementation of neural networks. 1.1
History of Neural Networks
Those who cannot remember the past are condemned to repeat it. —Santayana, "The Life of Reason" (1905-06) The roots of all work on neural networks are in neurobiological studies that date back to about a century ago. For many decades, biologists have speculated on exactly how the nervous system works. The following century-old statement by William James (1890) is particularly insightful, and is reflected in the subsequent work of many researchers. The amount of activity at any given point in the brain cortex is the sum of the tendencies of all other points to discharge into it, such tendencies being proportionate 1. to the number of times the excitement of other points may have accompanied that of the point in question; 2. to the intensities of such excitements; and 3. to the absence of any rival point functionally disconnected with the first point, into which the discharges may be diverted. How do nerves behave when stimulated by different magnitudes of electric current? Is there a minimal threshold (quantity of current) needed for nerves to be activated? Given that no single nerve cell is long enough, how do different nerve cells communicate electrical currents among one another? How do various nerve cells differ in behavior? Although hypotheses could be formulated, reasonable answers to these questions could not be given and verified until the mid-twentieth century, with the advance of neurology as a science. Another front of attack came from psychologists striving to understand exactly how learning, forgetting, recognition, and other such tasks are accomplished by animals. Psycho-physical experiments have helped greatly to enhance our meager understanding of how individual neurons and groups of neurons work. McCulloch and Pitts (1943) are credited with developing the first mathematical model of a single neuron. This model has been modified and widely applied in subsequent work.
1.1 History of Neural Networks
5
System-builders are mainly concerned with questions as to whether a neuron model is sufficiently general to enable learning all kinds of functions, while being easy to implement, without requiring excessive computation within each neuron. Biological modelers, on the other hand, must also justify a neuron model by its biological plausibility. Most neural network learning rules have their roots in statistical correlation analysis and in gradient descent search procedures. Hebb's (1949) learning rule incrementally modifies connection weights by examining whether two connected nodes are simultaneously ON or OFF. Such a rule is still widely used, with some modifications. Rosenblatt's (1958) "perceptron" neural model and the associated learning rule are based on gradient descent, "rewarding" or "punishing" a weight depending on the satisfactoriness of a neuron's behavior. The simplicity of this scheme was also its nemesis; there are certain simple pattern recognition tasks that individual perceptrons cannot accomplish, as shown by Minsky and Papert (1969). A similar problem was faced by the Widrow-Hoff (1960, 1962) learning rule, also based on gradient descent. Despite obvious limitations, accomplishments of these systems were exaggerated and incredible claims were asserted, saying that intelligent machines have come to exist. This discredited and discouraged neural network research among computer scientists and engineers. A brief history of early neural network activities is listed below, in chronological order. 1938 Rashevsky initiated studies of neurodynamics, also known as neural field theory, representing activation and propagation in neural networks in terms of differential equations. 1943 McCulloch and Pitts invented the first artificial model for biological neurons using simple binary threshold functions (described in section 1.2.2). 1943 Landahl, McCulloch, and Pitts noted that many arithmetic and logical operations could be implemented using methods containing McCulloch and Pitts neuron models. 1948 Wiener presented an elaborate mathematical approach to neurodynamics, extending the work initiated by Rashevsky. 1949 In The Organization of Behavior, an influential book, Hebb followed up on early suggestions of Lashley and Cajal, and introduced his famous learning rule: repeated activation of one neuron by another, across a particular synapse, increases its conductance. 1954 Gabor invented the "learning filter" that uses gradient descent to obtain "optimal" weights that minimize the mean squared error between the observed output signal and a signal generated based upon the past information. 1954 Cragg and Temperly reformulated the McCulloch and Pitts network in terms of the "spinglass" model well-known to physicists.
6
1 Introduction
1956 Taylor introduced an associative memory network using Hebb's rule. 1956 Beurle analyzed the triggering and propagation of large-scale brain activity. 1956 Von Neumann showed how to introduce redundancy and fault tolerance into neural networks and showed how the synchronous activation of many neurons can be used to represent each bit of information. 1956 Uttley demonstrated that neural networks with modifiable connections could learn to classify patterns with synaptic weights representing conditional probabilities. He developed a linear separator in which weights were adjusted using Shannon's entropy measure. 1958 Rosenblatt invented the "perception" introducing a learning method for the McCulloch and Pitts neuron model. 1960 Widrow and Hoff introduced the "Adaline," a simple network trained by a gradient descent rule to minimize mean squared error. 1961 Rosenblatt proposed the "backpropagation" scheme for training multilayer networks; this attempt was unsuccessful because he used non-differentiable node functions. 1962 Hubel and Wiesel conducted important biological studies of properties of the neurons in the visual cortex of cats, spurring the development of self-organizing artificial neural models that simulated these properties. 1963 Novikoff provided a short proof for the Perception Convergence Theorem conjectured by Rosenblatt. 1964 Taylor constructed a winner-take-all circuit with inhibitions among output units. 1966 Uttley developed neural networks in which synaptic strengths represent the mutual information between fixing patterns of neurons. 1967 Cowan introduced the sigmoid fixing characteristic. 1967 Amari obtained a mathematical solution of the credit assignment problem to determine a learning rule for weights in multilayer networks. Unfortunately, its importance was not noticed for a long time. 1968 Cowan introduced a network of neurons with skew-symmetric coupling constants that generates neutrally stable oscillations in neuron outputs. 1969 Minsky and Papert demonstrated the limits of simple perceptions. This important work is famous for demonstrating that perceptions are not computationally universal, and infamous as it resulted in a drastic reduction in funding support for research in neural networks.
1.2 Structure and Function of a Single Neuron
7
In the next two decades, the limitations of neural networks were overcome to some extent by researchers who explored several different lines of work. 1. Combinations of many neurons (i.e., neural networks) can be more powerful than single neurons. Learning rules applicable to large NN's were formulated by researchers such as Dreyfus (1962), Bryson and Ho (1969), and Werbos (1974); and popularized by McClelland and Rumelhart (1986). Most of these are still based on gradient descent. 2. Often gradient descent is not successful in obtaining a desired solution to a problem. Random, probabilistic, or stochastic methods (e.g., Boltzmann machines) have been developed to combat this problem by Ackley, Hinton, and Sejnowski (1985); Kirkpatrick, Gelatt, and Vecchi (1983); and others. 3. Theoretical results have been established to understand the capabilities of non-trivial neural networks, by Cybenko (1988) and others. Theoretical analyses have been carried out to establish whether networks can give an approximately correct solution with a high probability, even though the correct solution is not guaranteed [see Valiant (1985), Baum and Haussler (1988)]. 4. For effective use of available problem-specific information, "hybrid systems" (combining neural networks and non-connectionist components) were developed, bridging the gulf between symbolic and connectionist systems [see Gallant (1986)]. In recent years, several other researchers (such as Amari, Grossberg, Hopfield, Kohonen, von der Malsburg, and Willshaw) have made major contributions to the field of neural networks; such as in self-organizing maps discussed in chapter 5 and in associative memories discussed in chapter 6. 1.2 Structure and Function of a Single Neuron In this section, we begin by discussing biological neurons, then discuss the functions computed by nodes in artificial neural networks. 1.2.1 Biological neurons A typical biological neuron is composed of a cell body, a tubular axon, and a multitude of hair-like dendrites, shown in figure 1.3. The dendrites form a very fine filamentary brush surrounding the body of the neuron. The axon is essentially a long, thin tube that splits into branches terminating in little end bulbs that almost touch the dendrites of other cells. The small gap between an end bulb and a dendrite is called a synapse, across which information is propagated. The axon of a single neuron forms synaptic connections with many other
8
1 Introduction
Dendrites
Figure 1J A biological neuron. neurons; the presynaptic side of the synapse refers to the neuron that sends a signal, while the postsynaptic side refers to the neuron that receives the signal. However, the real picture of neurons is a little more complicated. 1. A neuron may have no obvious axon, but only "processes" that receive and transmit information. 2. Axons may form synapses on other axons. 3. Dendrites may form synapses onto other dendrites. The number of synapses received by each neuron range from 100 to 100,000. Morphologically, most synaptic contacts are of two types. Type I: Excitatory synapses with asymmetrical membrane specializations; membrane thickening is greater on the postsynaptic side. The presynaptic side contains round bags (synaptic vesicles) believed to contain packets of a neurotransmitter (a chemical such as glutamate or aspartate). Type II: Inhibitory synapses with symmetrical membrane specializations; with smaller ellipsoidal or flattened vesicles. Gamma-amino butyric acid is an example of an inhibitory neurotransmitter. An electrostatic potential difference is maintained across the cell membrane, with the inside of the membrane being negatively charged. Ions diffuse through the membrane to maintain this potential difference. Inhibitory or excitatory signals from other neurons are
1.2 Structure and Function of a Single Neuron
9
transmitted to a neuron at its dendrites' synapses. The magnitude of the signal received by a neuron (from another) depends on the efficiency of the synaptic transmission, and can be thought of as the strength of the connection between the neurons. The cell membrane becomes electrically active when sufficiently excited by the neurons making synapses onto this neuron. A neuron will fire, i.e., send an output impulse of about lOOmV down its axon, if sufficient signals from other neurons fall upon its dendrites in a short period of time, called the period of latent summation. The neuron fires if its net excitation exceeds its inhibition by a critical amount, the threshold of the neuron; this process is modeled by equations proposed by Hodgkin and Huxley (1952). Firing is followed by a brief refractory period during which the neuron is inactive. If the input to the neuron remains strong, the neuron continues to deliver impulses at frequencies up to a few hundred impulses per second. It is this frequency which is often referred to as the output of the neuron. Impulses propagate down the axon of a neuron and reach up to the synapses, sending signals of various strengths down the dendrites of other neurons. 1.2.2
Artificial neuron models
We begin our discussion of artificial neuron models by introducing oft-used terminology that establishes the correspondence between biological and artificial neurons, shown in table 1.1. Node output represents firing frequency when allowed to take arbitrary nonbinary values; however, the analogy with biological neurons is more direct in some artificial neural networks with binary node outputs, and a node is said to be fired when its net input exceeds a certain threshold. Figure 1.4 describes a general model encompassing almost every artificial neuron model proposed so far. Even this noncommittal model makes the following assumptions that may lead one to question its biological plausibility. 1. The position on the neuron (node) of the incoming synapse (connection) is irrelevant. 2. Each node has a single output value, distributed to other nodes via outgoing links, irrespective of their positions.
Table 1.1 Terminology Biological Terminology
Artificial Neural Network Terminology
Neuron Synapse Synaptic Efficiency Firing Frequency
Node/Unit/Cell/Neurode Connection/Edge/Link Connection Strength/Weight Node Output
10
1 Introduction
>V
f )
*n
*(
f(wlx\
w x
n n)
-—
J
Figure 1.4 General neuron model.
3. All inputs come in at the same time or remain activated at the same level long enough for computation (of /) to occur. An alternative is to postulate the existence of buffers to store weighted inputs inside nodes. The next level of specialization is to assume that different weighted inputs are summed, as shown in figure 1.5. The neuron output may be written as f(w\x\ -\ h w„x„) or e w x or w x f(S=i i i) f(fl£t)* where net = YH=i i i- Th simplification involved here is the assumption that all weighted inputs are treated similarly, and merely summed. When examining biological plausibility of such models, we may pose questions such as the following: If different inputs to a biological neuron come in at different locations, exactly how can these be added up before any other function (/) is applied to them? Some artificial neuron models do not sum their weighted inputs, but take their product, as in "sigma-pi" networks [see Feldman and Ballard (1982), Rumelhart and McClelland (1986)]. Nevertheless, the model shown in figure 1.5 is most commonly used, and we elaborate on it in the rest of this section, addressing the exact form of the function /. The simplest possible functions are: the identity function firiet) = net; the non-negative identity function f(net) = max (0, net); and the constant functions finet) = c for some constant value c. Some other functions, commonly used in neural networks, are described below. Node functions whose outputs saturate (e.g., lim^oo f(x) = 1 and lim^-co /(*) = 0) are of great interest in all neural network models. Only such functions will be considered in this chapter. Inputs to a neuron that differ very little are expected to produce approximately the same outputs, which justifies using continuous node functions. The motivation for us-
1.2
11
Structure and Function of a Single Neuron
f(wlXl + ... + wnxn)
Figure 1.5 Weighted input summation.
finet) A
(ON)
(OFF) -*> net Figure 1.6 Step function.
ing differentiable node functions will become clear when we present learning algorithms that conduct gradient descent. Step functions A commonly used single neuron model is given by a simple step function, shown in figure 1.6. This function is defined in general as follows. fi.net)
I
a if net < b if net >
(1.1)
and at c, /(c)* is sometimes defined to equal a, sometimes b, sometimes (a + b)/2 and sometimes 0. Common choices are c = 0, a = 0, b = 1; and c =; 0, a = — 1, b = 1. The
12
1 Introduction
latter case is also called the signum function, whose output is +1 if net > 0, -1 if net < 0, andOifnef = 0. The step function is very easy to implement. It also captures the idea of having a minimum threshold (= c in figure 1.6) for the net weighted input that must be exceeded if a neuron's output is to equal b. The state of the neuron in which net > c, so that f{net) = b, is often identified as the active or ON state for the neuron, while the state with finet) = a is considered to be the passive or OFF state, assuming b > a. Note that b is not necessarily greater than a; it is possible that a node is activated when its net input is less than a threshold. Though the notion of a threshold appears very natural, this model has the biologically implausible feature that the magnitude of the net input is largely irrelevant (given that we know whether net input exceeds the threshold). It is logical to expect that variations in the magnitudes of inputs should cause corresponding variations in the output. This is not the case with discontinuous functions such as the step function. Recall that a function is continuous if small changes in its inputs produce corresponding small changes in its output. With the step function shown in figure 1.4, however, a change in net from c - e/2 to c + e/2 produces a change in f{nei) from a to b that is large when compared to e, which can be made infinitesimally small. Biological systems are subject to noise, and a neuron with a discontinuous node function may potentially be activated by a small amount of noise, implying that this node is biologically implausible. Another feature of the step function is that its output "saturates," i.e., does not increase or decrease to values whose magnitude is excessively high. This is desirable because we cannot expect biological or electronic hardware to produce excessively high voltages. The outputs of the step function may be interpreted as class identifiers: we may conclude that an input sample belongs to one class if and only if the net input exceeds a certain value. This interpretation of the step-functional neuron appears simplistic when a network contains more than one neuron. It is sometimes possible to interpret nodes in the interior of the network as identifying features of the input, while the output neurons compute the application-specific output based on the inputs received from these feature-identifying intermediate nodes. Ramp functions The ramp function is shown in figure 1.7. This function is defined in general as follows. f{net) =
a if net < c b if net a + {{net — c){b — a)) /{d — c) otherwise
>d
Common choices are c = 0, d — 1, a — 0, b = 1; and c = — 1, d = 1, a = — b
(1.2)
1.2 Structure and Function of a Single Neuron
13
finet) I i
d
7 c (CiPF\ \\jrr)
(ON)
\ b
4 I
.
*- net
Figure 1.7 Ramp function. As in the case of the step function, finet) = max(a, b) is identified as the ON state, and f(net) = min(#, b) is the OFF state. This node function also implies the existence of a threshold c which must be exceeded by the net weighted input in order to activate the node. The node output also saturates, i.e., is limited in magnitude. But unlike the step function, the ramp is continuous; small variations in net weighted input cause correspondingly small variations (or none at all) in the output. This desirable property is gained at the loss of the simple ON/OFF description of the output: for c \x • x\. In order to ensure that the sample x is correctly classified following the weight change, we need a choice of Aw = ±rjx such that (w + Aw) • x and x • x have opposite signs, i.e., | Aw -x\ > \w -x\, that is, rj\x-x\>
\w-x\,
that is,
2.3.3
Non-numeric inputs
In some problems, the input dimensions are non-numeric, and their values do not have any inherent order. For instance, the input dimension may be "color," and its values may range over the set {red, blue, green, yellow}. If these values are depicted as points on an axis, e.g., red-blue-green-yellow, the resulting figure depicts erroneous implicit interrelationships, e.g., that "red" is closer to "blue" than to "green" or "yellow." How may the perceptron method be used to solve problems with such inputs? Note that this is not a problem if the attribute "color" was allowed to take on only two values. The simplest alternative is to generate four new dimensions ("red," "blue," "green," and "yellow") instead of the single "color" dimension, and replace each original attribute-value pair by a binary vector with one component corresponding to each color. For instance,
52
2 Supervised Learning: Single-Layer Networks
color = "green" is represented by the input vector (0, 0, 1, 0) standing for red=0, blue=0, green=l, yellow=0. In the general case, if an attribute can take one of n different values, then n new dimensions are obtained, and each sample has a 1 or 0 value for each of these dimensions. The disadvantage of this approach is a drastic increase in the number of dimensions, as well as the loss of the information that the attribute has a unique value, e.g., that an object cannot be both blue and red. EXAMPLE 2.4 The day of the week (Sunday/Monday/ . . . ) is an important variable in predicting the amount of electric power consumed in a geographical region. However, there is no obvious way of sequencing weekdays, so it is inappropriate to use a single variable whose values range from 1 to 7. Instead, seven different variables should be chosen (eight, if a separate category of "holiday" is needed), and each input sample has a value of 1 for one of these coordinates, and a value of 0 for others. It may also be reasonable to combine some of these variables into a single one, e.g., to combine Tuesday, Wednesday, and Thursday, for which the input-output mapping is similar. Another possible approach departs from the notion that each weight must be a single real number. Using this approach, each weight is essentially a function that maps its input to a real number. For instance, for an input node intended to denote color={red, blue, green, yellow}, the weight function on the connection from the color node to the yth node at the next level may be of the form
\ c\ Wj, color p. Since the given classes are linearly separable, there must exist a vector of weights w* that
2.4 Guarantee of Success
53
correctly separates them, This implies that w* i* > 0 whenever i* belongs to the class with desired output 1 and w* • i* < 0 when i* belongs to the class with desired output - 1 . Define
(
ik if desired output is 1 —ik if desired output is — 1,
so that w* • Xk > 0 whenever x* is correctly classified by the perceptron with weight vector w*. THEOREM 2.1 Given training samples from two linearly separable classes, the perceptron training algorithm terminates after a finite number of steps, and correctly classifies all elements of the training set, irrespective of the initial random non-zero weight vector H>O.
Proof We assume rj = 1, without loss of generality. After executing the perceptron training algorithm for k steps, the resulting weight vector is W* = w o + x i + * 2 ++** •••
•
(2.1)
Since the given classes are linearly separable, there must exist a vector of weights w* that correctly classifies them, with sgn(w* • ik) = class(ik), where "sgn" is the step function whose outputs are 1 and —1, with the discontinuity located at 0. Taking the scalar product of each side of equation 2.1 with H>*, we have W* • Wk = W* • Wo + W* • Xi + . . . + W* • Xk.
For each input vector /,, the dot product w* • ij has the same sign as class(ij). Since the corresponding element of the training sequence x = class(ij)ij, we can be assured that w* • x = w* • (class(ij)ij) > 0. Therefore, there exists an e > 0 such that w* • x,- > e for every member x,- of the training sequence. Hence w* • Wk > w* • wo + ks.
(2.2)
By the Cauchy-Schwarz inequality, llw^wtll^lKH^Iwjkll2.
(2.3)
Without loss of generality we may assume that w* is of unit length, i.e., ||w* || = 1, since the unit length vector w*/||w*|| also correctly classifies the same samples. Using this unitlength assumption and equations 2.2, 2.3, we obtain the following lower bound for the square of the length of the weight vector.
54
2 Supervised Learning: Single-Layer Networks
^k\\2>(w0-w* + ke)2
(2.4)
On the other hand, since Wj = Wj-\ + Xj, the following upper bound can be obtained for this vector's squared length. \\Wjf=Wj>Wj = Wj-i • Wj-i + 2WJ-I • Xj +Xj • Xj = | | W y _ l | | 2 + 2W;_1-JCy + ||x;||2
Since Wj-i • Xj < 0 whenever a weight change is required by the algorithm, we have ll^||2-|k7-i||2 2, containing
68
3 Supervised Learning: Multilayer Networks I
"hidden layers" numbered 1 to L - 1. Hidden nodes do not directly receive inputs from nor send outputs to the external environment. For convenience of presentation, we will assume that L = 2 in describing the backpropagation algorithm, implying that there is only one hidden layer, as shown in figure 3.1. The algorithm can be extended easily to cases when L ^ 2. The presentation of the algorithm also assumes that the network is strictly feedforward, i.e., only nodes in adjacent layers are directly connected; this assumption can also be done away with. Input layer nodes merely transmit input values to the hidden layer nodes, and do not perform any computation. The number of input nodes equals the dimensionality of input patterns, and the number of nodes in the output layer is dictated by the problem under consideration. For instance, if the task is to approximate a function mapping rc-dimensional input vectors to m-dimensional output vectors, the network contains n input nodes and m output nodes. An additional "dummy" input node with constant input (= 1) is also often used so that the bias or threshold term can be treated just like other weights in the network. The number of nodes in the hidden layer is up to the discretion of the network designer and generally depends on problem complexity. Each hidden node and output node applies a sigmoid function to its net input, shown in figure 3.3. As discussed briefly in chapter 1, the main reasons motivating the use of an S-shaped sigmoidal function are that it is continuous, monotonicaUy increasing, invertible, everywhere differentiable, and asymptotically approaches its saturation values as net -> ±oo. These basic properties of the sigmoidal function are more important than the specific sigmoidal function chosen in our presentation below, namely,
S(net) = -
\—-.
1 + e(-net) 3.2.2
Objectives
The algorithm discussed in this chapter is a supervised learning algorithm trained using P input patterns. For each input vector xp, we have the corresponding desired £-dimensional output vector dp
=
(dp,i,dPt2,
...,dp,K)
for 1 < /? < /\ This collection of input-output pairs constitutes the training set {(xp, dp) : p=l,...,P}. The length of the input vector xp is equal to the number of inputs of the given application. The length of the output vector dp is equal to the number of outputs of the given application.
3.2
69
Preliminaries
Figure 3 J A sigmoid function.
The training algorithm should work irrespective of the weight values that preceded training, which may initially have been assigned randomly. As with the other networks, one would expect that some training samples are such that any reasonably small network is forced to make some errors in the hope of better performance on the test data. Furthermore, many real-life problems are such that perfect classification or approximation using a small network is impossible, and not all training or test samples can be classified or approximated correctly. As in the case of perceptrons and Adalines, the goal of training is to modify the weights in the network so that the network's output vector op = (op,i,oPf2,
...,OP,K)
is as close as possible to the desired output vector dp, when an input vector xp is presented to the network. Towards this goal, we would like to minimize the cumulative error of the network: Error = ^ Err(op, dp). P=\
(3.1)
70
3 Supervised Learning: Multilayer Networks I
As discussed in chapter 1, there are many possible choices for the error function Err above. The function Err should be non-negative, and should be small if op is close to dp (and large otherwise). One such function is the "cross-entropy" function, T,P(dp^gop + (1 -d p )log(l Op)), suggested by Hinton (1987). The more popular choices for Err are based on norms of the difference vector dp — op. An error measure may be obtained by examining the difference tPiJ = \opj - dPtj\ between the y'th components of the actual and desired output vectors. For the entire output 1vector, a measure indicating its distance from the desired vector is lp = ((tp,i)u H U (IP,K) Y , for some « > 0 . For the rest of the section, we will consider u = 2 and define the Err(op, dp) function to be (lp)2. This function is differentiable, unlike the absolute value function. It is also easy to apply gradient descent methods, since the derivative of the resulting quadratic function is linear and thus easier to compute. Hence our goal will be to find a set of weights that minimize p
K
Sum Square Error = ^ ^ t f / , , ; ) 2 P=\
(3.2)
j=\
or mean squared error P
K
MSE
= TY,t,«PJ?-
33
Backpropagation Algorithm
(3.3)
The backpropagation algorithm is a generalization of the least mean squared algorithm that modifies network weights to minimize the mean squared error between the desired and actual outputs of the network. Backpropagation uses supervised learning in which the network is trained using data for which inputs as well as desired outputs are known. Once trained, the network weights are frozen and can be used to compute output values for new input samples. The feedforward process involves presenting an input pattern to input layer neurons that pass the input values onto the first hidden layer. Each of the hidden layer nodes computes a weighted sum of its inputs, passes the sum through its activation function and presents the result to the output layer. The following is the scenario for the pih pattern in a feedforward network with L = 2.
3.3 Backpropagation Algorithm
71
1. The ith node in the input layer holds a value of xpj for the pth pattern. 2. The net input to the y'th node in the hidden layer = net{P = YH=owff)xP,i- This includes the threshold with xPto = 1; the connection from the ith input node to the jih hidden layer node is assigned a weight value wfy \ as shown in figure 3.1. 3. The output of the yth node in the hidden layer = xp1^ = S (E?=o wff)jcpj) is a sigmoid function.
where s
4. The net input to the kih node of the output layer = netj® = ]£,• (wf'Pxplj\ including the threshold; the connection from the jih hidden layer node to the kih output layer node is assigned a weight value wf'P. 5. Output of the kih node of the output layer oPtk = S (5Z,- w^ '• *:, A where S is a sigmoid function. 6. The desired output of the kth node of the output layer is dPtk, and the corresponding squared error is I2 k = \dp 2, weight wj'j l f , ) denotes the weight assigned to the link from node j in the ith layer to node k in the (i + l)th layer, and xplj denotes output of the j'th node in the ith layer for the pth pattern. The error (for pattern p) is given by Ep = £k {tp,k) • We need to discover w, the vector consisting of all weights in the network, such that the value of Ep is minimized. Suppressing the suffix p for convenience, the expression to be minimized is E = £f=1(£fc)2. One way to minimize E is based on the gradient descent method. Since o* depends on the network weights, E is also a function of the network weights. According to gradient descent, the direction of weight change of w should be in the same direction as -dE/dw. To simplify the calculation of —dE/dw, we examine the weight change in a single weight. We calculate the value of BE/dwfP for each connection from the hidden layer to the output layer. Similarly, we calculate the value of dE/dwjj ' for each connection from the input layer to the hidden layer. The connection weights are then changed by using the values so obtained; this method is also known as the generalized delta rule. In brief, the following two equations describe the suggested weight changes.
72
3 Supervised Learning: Multilayer Networks I
The derivative of E with respect to a weight wf'P associated with the link from node j of the hidden layer to the fcth node of the output layer is easier to calculate than for a weight wf:0) connecting the ith node of the input layer to the ;th node of the hidden layer. But both calculations use the same general idea—the chain rule of derivatives. We consider the chain rule in some detail in the following paragraphs as it applies to the backpropagation learning algorithm. We assume h = \dk - o*| in the following derivation. The error E depends on wfP only through 0*, i.e., no other output term ov, k'^k (2 1)
contains wk j '. Hence, for the calculations that follow, it is sufficient to restrict attention to the partial derivative of E with respect to 0* and then differentiate 0* with respect to w(k2f. Since E = J^ if + (dk - ok)2, we obtain dE — = -2(dk-ok).
(3.6)
Before differentiating 0*, we observe that the output 0* is obtained by applying a node function 8 to net^, i.e., 0* = B(netk(2)) and net^ represents the total input to a node k in the output layer, i.e.,
j
Hence Bok/Bnet(k2) = $'(netj*\ where S'(x) = dS(x)/dx. Finally, Bnet^/Bwff = xfK Consequently, the chain rule BE
_ BE
Bok Bnet^
dwff'BokdnetPdtfV
(3 7)
'
gives - 2 | i j = - 2 ( 4 - ok)$\net™)xf.
(3.8)
Next, consider the derivation of (BE/Bwfj®). The error E depends on wf:0) through netf\ which, in turn, appears in each 0* for all k = 1,..., K. Also, 0* = S(n^ 2 ) ), x*p = §(netf}), and net^ = £,- wf}0) x *,-. Therefore, using the chain rule of derivatives, we obtain
3.3
73
Backpropagation Algorithm
J £ _ _ ^ 9 £ d
dw?f~£i °i