479 77 6MB
English Pages 232 [234] Year 2020
COMPUTER SCIENCE, TECHNOLOGY AND APPLICATIONS
NEURAL NETWORKS HISTORY AND APPLICATIONS
No part of this digital document may be reproduced, stored in a retrieval system or transmitted in any form or by any means. The publisher has taken reasonable care in the preparation of this digital document, but makes no expressed or implied warranty of any kind and assumes no responsibility for any errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of information contained herein. This digital document is sold with the clear understanding that the publisher is not engaged in rendering legal, medical or any other professional services.
COMPUTER SCIENCE, TECHNOLOGY AND APPLICATIONS Additional books and e-books in this series can be found on Nova’s website under the Series tab.
COMPUTER SCIENCE, TECHNOLOGY AND APPLICATIONS
NEURAL NETWORKS HISTORY AND APPLICATIONS
DOUG ALEXANDER EDITOR
Copyright © 2020 by Nova Science Publishers, Inc. All rights reserved. No part of this book may be reproduced, stored in a retrieval system or transmitted in any form or by any means: electronic, electrostatic, magnetic, tape, mechanical photocopying, recording or otherwise without the written permission of the Publisher. We have partnered with Copyright Clearance Center to make it easy for you to obtain permissions to reuse content from this publication. Simply navigate to this publication’s page on Nova’s website and locate the “Get Permission” button below the title description. This button is linked directly to the title’s permission page on copyright.com. Alternatively, you can visit copyright.com and search by title, ISBN, or ISSN. For further questions about using the service on copyright.com, please contact: Copyright Clearance Center Phone: +1-(978) 750-8400 Fax: +1-(978) 750-4470 E-mail: [email protected].
NOTICE TO THE READER The Publisher has taken reasonable care in the preparation of this book, but makes no expressed or implied warranty of any kind and assumes no responsibility for any errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of information contained in this book. The Publisher shall not be liable for any special, consequential, or exemplary damages resulting, in whole or in part, from the readers’ use of, or reliance upon, this material. Any parts of this book based on government reports are so indicated and copyright is claimed for those parts to the extent applicable to compilations of such works. Independent verification should be sought for any data, advice or recommendations contained in this book. In addition, no responsibility is assumed by the Publisher for any injury and/or damage to persons or property arising from any methods, products, instructions, ideas or otherwise contained in this publication. This publication is designed to provide accurate and authoritative information with regard to the subject matter covered herein. It is sold with the clear understanding that the Publisher is not engaged in rendering legal or any other professional services. If legal or any other expert assistance is required, the services of a competent person should be sought. FROM A DECLARATION OF PARTICIPANTS JOINTLY ADOPTED BY A COMMITTEE OF THE AMERICAN BAR ASSOCIATION AND A COMMITTEE OF PUBLISHERS. Additional color graphics may be available in the e-book version of this book.
Library of Congress Cataloging-in-Publication Data ISBN:
Published by Nova Science Publishers, Inc. † New York
CONTENTS Preface
vii
Chapter 1
Artificial Neural Networks, Concept, Application and Types M. Khishe and Gh. R. Parvizi
Chapter 2
Emotion Recognition from Facial Expressions Using Artificial Neural Networks: A Review Sibel Senan, Zeynep Orman and Fulya Akcan
31
Chapter 3
Dipole Mode Index Prediction with Artificial Neural Networks Kalpesh R. Patil and Masaaki Iiyama
75
Chapter 4
Efficacy of Artificial Neural Networks in Differentiating Breast Cancers in Digital Mammography Sundaran Kada and Fuk-hay Tang
Chapter 5
Supervised Adjustment of Synaptic Plasticity in Spiking Neural Networks Saeed Solouki
1
97
113
vi Chapter 6
Contents A Review on Intelligent Decision Support Systems and a Case Study: Prediction of Cryptocurrency Prices with Neural Networks Zeynep Orman, Emel Arslan, Burcu Ozbay and Emad Elmasri
189
Index
213
Related Nova Publications
219
PREFACE With respect to the ever-increasing developments in artificial intelligence and artificial neural network applications in different scopes such as medicine, industry, biology, history, military industries, recognition science, space, machine learning and etc., Neural Networks: History and Applications first discusses a comprehensive investigation of artificial neural networks. Next, the authors focus on studies carried out with the artificial neural network approach on the emotion recognition from 2D facial expressions between 2009 and 2019. The major objective of this study is to review, identify, evaluate and analyze the performance of artificial neural network models in emotion recognition applications. This compilation also proposes a simple nonlinear approach for dipole mode index prediction where past values of dipole mode index were used as inputs, and future values were predicted by artificial neural networks. The study was also conducted for seasonal dipole mode index prediction because the dipole mode index is more prominent in the Sep-Oct-Nov season. A subsequent study focuses on how mammography has a high false negative and false positive rate. As such, computer-aided diagnosis systems have been commercialized to help in micro-calcification detection and malignancy differentiation. Yet, little has been explored in differentiating
viii
Doug Alexander
breast cancers with artificial neural networks, one example of computeraided diagnosis systems. The authors aim to bridge this gap in research. The penultimate chapter reviews the general conditions under which synaptic plasticity most effectively takes place to support the supervised learning of a precise temporal code. Then, the accuracy of each plasticity rule with respect to its temporal encoding precision is examined, and the maximum number of input patterns it can memorize using the precise timings of individual spikes as an indicator of storage capacity in different control and recognition tasks is explored. In closing, a case study is presented centered on an intelligent decision support system that is built on a neural network model based on the Encog machine learning framework to predict cryptocurrency close prices. Chapter 1 - With respect to the ever-increasing development of capabilities and Artificial Intelligence (AI) and Artificial Neural Network (ANN) applications in different scopes such as medicine, industry, biology, history, military industries, recognition science, space, machine learning and etc., this chapter will be discussing a comprehensive investigation of ANNs. In this respect, not only will a full background of the emergence of these ANNs be reviewed, but also the terms and the existing concepts regarding the AI will be reviewed. As a result, the simplest kind of ANNs named perceptron will be introduced through modeling an artificial neuron from a biological neuron. The rest of this chapter will be introducing and sharing the different kinds of ANNs such as Multi-Layer Perceptron Neural Networks (MLP NNs), Radial Basis Function Neural Networks (RBF NNs), Hopfield, Hamming, Kohonen Self-Organized Map (SOM), Time Delay Neural Network (TDNN), Deep Feed Forward (DFF), Recurrent Neural Networks (RNNs), Long-Short Term Memory Neural Networks (LSTM NNs), Auto Encoders Neural Network (AE NN), Markov Chains Networks (MC Ns). In the end, this chapter will be investigating the strengths and weaknesses of the abovementioned NNs. Chapter 2 - Facial expressions are universal signs that provide semantic integrity in interpersonal communication. Changes in facial expressions are considered as the most important clues in emotion psychology. Automatic analysis and classification of facial expressions are challenging problems in
Preface
ix
many areas such as human-computer interaction, computer vision, and image processing. In addition, there is a growing need for analysis of facial expressions in many areas such as psychology, safety, health, games, and robotics. Therefore, rapid and accurate analysis of facial expressions plays a critical role in many systems in different application areas. This chapter focuses on all the studies that are carried out with the Artificial Neural Network (ANN) approach on the emotion recognition from 2D facial expressions between 2009 and 2019. The major objective of this study is to review, identify, evaluate and analyze the performance of ANN models in emotion recognition applications. Hence, to shed light on the implementations of artificial neural networks on facial expression analysis, this paper reviews all the related studies based on five key parameters: (i) used ANN models for classification, (ii) used feature extraction methods for each ANN classification model, (iii) accuracy rates of ANN models for each dataset that are used for facial expression analysis studies, (iv) rates of datasets preferred by studies for facial expression analysis and (v) rates of studies according to publishing years. Published literature presented in this study shows the high potential of ANN models for effective emotion recognition applications. Chapter 3 - Indian Ocean Dipole (IOD) being one of the important climatic indices happens to be directly linked with floods and droughts occurring in Indian Ocean neighbouring rim countries. IOD is supposed to occur frequently due to the warming rate of the Indian Ocean which is slightly higher than the global ocean. This is therefore important to predict such IOD events in advance to efficiently mitigate against its effects on local and global climate. Strength of IOD event is defined by dipole mode index (DMI), which is the difference of sea surface temperature anomalies (SSTA) in the western (50° - 70°E, 10°S - 10°N) and eastern (90° - 110°E, 10°S 0°) part. Depending on this difference of SSTA in the western and eastern part IOD is categorized as strong and weak. Strong and weak IOD shows differences higher and lower than one standard deviation respectively. Several past studies were reported to predict IOD events by coupled ocean-atmosphere global circulation models (COAGCM). They need a very accurate oceanic subsurface observation to accurately predict DMI. Also,
x
Doug Alexander
COAGCM assume that predictability of DMI is largely governed by variation in Nino 3.4 (170° - 120°W, 5°S - 5°N) region which is true for the western part of Indian Ocean. Whereas predictability of the eastern part is largely governed by its intrinsic dynamics and it is independent of Nino 3.4 region. Such assumptions are producing low skills in DMI prediction by COAGCM. This study proposes a simple non-linear approach for DMI prediction where past values of DMI were used as inputs and future values were predicted by artificial neural networks (ANN). DMI was composed using Hadley SST data for a long period from 1870-present. Anomalies were based on a 32-year period from 1981 to 2013. Apart from past values of DMI, El Nino southern oscillations (ENSO) past values also considered as inputs to check their contribution in skill for DMI prediction. The study was also conducted for seasonal DMI prediction because the DMI is more prominent in the Sep-Oct-Nov season. DMI prediction skills were compared to observed anomalies and with the persistent model (PM). Four months ahead prediction skill for ANN model and PM in terms of root mean square error (rmse) was observed to be lower than 0.29°C and higher than 0.35°C respectively. This study suggests that DMI can be predicted skillfully with a lead time of 3 - 4 months and 2 seasons ahead. Prediction skill compared with observed anomalies separately for Sep, Oct and Nov varies from 0.93 to 0.61 in correlation coefficient (r) and 0.16°C to 0.35°C in ‘rmse’ respectively for one and three months ahead. The seasonal prediction skill of the Sep-Oct-Nov season together is noted as 0.81 (r) and 0.23°C (rmse) with one season ahead. Along with that few important extreme IOD events were also assessed, which includes ‘1994 positive IOD’ and ‘2016 negative IOD event’, both events amplitudes were found to be predicted by 87% and 78% accurately with one season lead time respectively. Along with that, this study also arrives at the conclusion of ENSO is least affecting the IOD events. Because when ENSO past values were given as inputs to predict DMI there was no significant improvement in DMI prediction was noticed. Chapter 4 - The most common invasive breast cancers include invasive ductal carcinoma (IDC) and invasive lobularcarcinoma (ILC). About eight
Preface
xi
of ten invasive breast cancers are IDC and about one invasive breast cancer in ten is an ILC. As for carcinoma in situ, ductal carcinoma in situ (DCIS) is the most common non-invasive type taking up 80-90% of carcinoma in situ. Mammography has a high false negative and false positive rate. Computer aided diagnosis (CAD) systems have been commercialized to help in micro-calcification detection and malignancy differentiation. Yet, little has been explored in differentiating breast cancers with artificial neural network (ANN), one example of CAD systems. The aim of this chapter was to describe how well artificial neural networks (ANNs) differentiate the three types of most prevalent breast cancer with normal breasts, namely the ductal carcinoma in situ (DCIS), invasive ductal carcinoma (IDC) and, invasive lobular carcinoma (ILC). The authors conducted a study where 160 digital mammograms were collected (including IDC, ILC and DCIS were of equal number (each type of cancer 40 images)) plus 40 control images. All cancers were screened by the mammography unit and further proven by biopsy histologically between November 2012 and November 2015. Mammograms were analysed with a CAD system and Image Feature Assessment Program. CAD system determines the possible regions of interest that are then used for feature extraction. The authors’ result indicated that the accuracy for detection of IDC against normal is 97.5% (N = 40, normal 20, abnormal 20), for ILC against normal is; 97.5% (N = 40, normal = 20, abnormal = 20); for DCIS against normal is 76.9% (N = 39, normal = 20, abnormal = 19) respectively. One image for DCIS is omitted due to the image quality did not meet the authors’ program criteria. The authors’ study indicated that using image features in conjunction with artificial neural networks offer a promising method in differentiating different invasive breast cancers with normal breast. Chapter 5 - Precise spike timing as a means to encode information in neural networks is biologically supported and is advantageous over frequency-based codes by processing input features on a much shorter timescale. For these reasons, much recent attention has been focused on the development of supervised learning rules for spiking neural networks (SNN) that utilize a temporal coding scheme. However, despite significant progress
xii
Doug Alexander
in this area, there still lack rules that have a theoretical basis, and yet can be considered biologically relevant. Here the authors review the general conditions under which synaptic plasticity most effectively takes place to support the supervised learning of a precise temporal code. Then, the authors examine the accuracy of each plasticity rule with respect to its temporal encoding precision, and the maximum number of input patterns it can memorize using the precise timings of individual spikes as an indicator of storage capacity in different control and recognition tasks. Practically, the network should learn to distinguish patterns belonging to different classes from the temporal patterns of spikes. Chapter 6 - Intelligent Decision Support Systems (IDSS) with the integration of advanced artificial intelligence techniques have been developed to assist the decision-makers during their decision management process. Due to the advances in data mining and artificial intelligence techniques, there has been a growing interest in the development of such systems. IDSS are becoming increasingly more critical to the daily operations of organizations because of the necessity of rapid and effective decision-making success. In this chapter, over twenty scientific literature related to IDSS, primarily spanning the decade between 2008 and 2018 were analyzed. The authors have presented a classification analysis of IDSS literature based on the data mining algorithms used and the type and performance of the systems. The authors have also provided information on the current applications, benefits, and future research of these systems. When the results are considered, it can be deduced that the use of Artificial Neural Networks (ANN) increases the accuracy rate in many of the related studies. Therefore, in addition to IDSS research, a case study of an intelligent decision support system that is built on a neural network model based on the Encog machine learning framework to predict cryptocurrency close prices is also presented in this chapter.
In: Neural Networks Editor: Doug Alexander
ISBN: 978-1-53617-188-4 © 2020 Nova Science Publishers, Inc.
Chapter 1
ARTIFICIAL NEURAL NETWORKS, CONCEPT, APPLICATION AND TYPES M. Khishe1,* and Gh. R. Parvizi2 1
Department of Electronic Engineering, Imam Khomeini University of Naval Science, Nowshahr, Iran 2 Faculty of Foreign Languages, University of Isfahan, Iran
ABSTRACT With respect to the ever-increasing development of capabilities and Artificial Intelligence (AI) and Artificial Neural Network (ANN) applications in different scopes such as medicine, industry, biology, history, military industries, recognition science, space, machine learning and etc., this chapter will be discussing a comprehensive investigation of ANNs. In this respect, not only will a full background of the emergence of these ANNs be reviewed, but also the terms and the existing concepts regarding the AI will be reviewed. As a result, the simplest kind of ANNs named perceptron will be introduced through modeling an artificial neuron from a biological neuron. The rest of this chapter will be introducing and sharing the different kinds of ANNs such as Multi-Layer Perceptron Neural *
Corresponding Author’s Email: [email protected].
2
M. Khishe and Gh. R. Parvizi Networks (MLP NNs), Radial Basis Function Neural Networks (RBF NNs), Hopfield, Hamming, Kohonen Self-Organized Map (SOM), Time Delay Neural Network (TDNN), Deep Feed Forward (DFF), Recurrent Neural Networks (RNNs), Long-Short Term Memory Neural Networks (LSTM NNs), Auto Encoders Neural Network (AE NN), Markov Chains Networks (MC Ns). In the end, this chapter will be investigating the strengths and weaknesses of the abovementioned NNs.
Keywords: artificial intelligence, artificial neural networks, training, neuron
INTRODUCTION The early decades of the twentieth century and the era of huge industrial development along with the production of automobiles have all caused an all-out revolution in transportation, an increase in mobility, and hundreds of new jobs and careers in the field of commerce. It seems that the symbol of the post-industrial era and the next century's unique products is AI. Today, the topic of AI is the hottest debate among computer and information science experts and other scientists and decision makers. Throughout the history, human beings have been the center of discussions and research in the field of body and mind. But nowadays, a lower, lifeless and fabricated creature desires to be the replacement; it is a fact that is denied by most human beings. If AI achieves its big goals, there will be a leap in the pursuit of greater human well-being and even greater wealth. Good examples and acceptance of AI have been currently applied in our real world [1]. Such achievements will continue to justify the resources needed in the future. AI critics, on the other hand, argue that spending time and other valuable resources on building a faulty product that is full of failures lead to defaming and trampling of the human beings’ capabilities. The bitterest criticism is that AI is a blatant insult to the essence of human nature and the role of man. ANN is a practical way to learn various functions such as realvalued functions, discrete-valued functions, and vector-valued functions.
Artificial Neural Networks, Concept, Application and Types
3
Neural network learning is immune from educational data errors, and networks have been successfully applied to issues such as speech recognition, image recognition, interpretation, and robot learning. To understand AI, it is worthwhile to know its difference well with human intelligence. The study of ANNs is largely derived from natural learning systems in which a complex set of interconnected neurons are involved in the functioning of the task. The human brain is made up of 10 11 neurons, each neuron is connected with approximately 104 other neurons. The switching speed of the neurons is about 3 to 10 seconds. One can detect a human image in 0.1 seconds. The human brain is made up of billions of cells or neural connections, and these cells are complexly interconnected [2]. Simulating the human brain can be done through hardware or software. Preliminary research has shown that brain simulation is simple and mechanical. For example, a worm has multiple neural networks. An insect has about one million neural connections and the human brain is made up of thousands of neural connections [3]. By concentrating and connecting artificial neural connections, one can create an AI unit. Human intelligence is much more complex and widespread than computer systems, and has outstanding capabilities such as reasoning, behavior, comparison, creation, and application of concepts. Human intelligence is capable of making connections between subjects and comparing new representations. A human always makes new laws or applies old ones to new ones. One of his other qualities is the ability of his to create various meanings in the world around him or her. The human ability to create different meanings in the world around him is one of his other characteristics. Extensive concepts such as cause-andeffect relationships have created time or simpler concepts such as the choice of meals (breakfast, lunch and dinner). Thinking about these concepts and applying them is specific to intelligent human behavior. AI seeks to build devices that can deliver the aforementioned abilities (reasoning, behavior, comparison, and conceptualization). What has been built up so far has not been able to base itself on it, although it has produced many benefits. Lastly, one of the causes of the problem with AI is due to its inappropriate naming. If John McCarthy [4] had called it in 1956 something like “Advanced
4
M. Khishe and Gh. R. Parvizi
Planning”, perhaps there would have been no war or controversy around it. AI is divided into a number of subfields and attempts to create systems and methods that imitate the intelligence and logic of decision makers. One of the main groups of AI is ANN. So, in the following we will try to introduce ANNs, their concepts, applications, and types. Simultaneously but separately from the 19th century on, neurophysiologists tried to discover the brain's learning and analysis system, and mathematicians, on the other hand, sought to develop a mathematical model that could be generalized and analyzed. The first attempts at simulation were made using a logic model by McCluck and Walter Pitts [5], which is today the main building block of most ANNs. This model presents hypotheses about the function of neurons. The performance of this model is based on the sum of inputs and outputs. If the sum of inputs exceeds the threshold value, so-called neurons are excited. The result of this model was to perform simple functions like “AND” and “OR” logic gates. Not only neurophysiologists but also psychologists and engineers were instrumental in the development of neural network simulations. In 1958, the MLP NN was introduced by Rosenblatt [6]. It was the same as the previous modeled units. MLP has three layers with a middle layer known as the hidden layer. The system can learn to apply the corresponding random output to the given input. Another system is the Neuron Adaptive Linear Model (NALM), developed in 1960 by Widrow and Huff (Stanford University) [7], the first neural networks employed in real-world problems. Adaline was an electronic device made of simple components; the method used for training was different from perceptron. In 1969, Minsky and Papert wrote a book outlining the limitations of single-layer and multi-layer perceptron systems [8]. The result of this book was the bias and discontinuation of investment in neural network simulation research. They suspended research for several years, claiming that the Perceptron scheme could not solve any interesting problem.
Artificial Neural Networks, Concept, Application and Types
5
Although public enthusiasm and investment has fallen to a minimum, some researchers have continued their research to develop machines capable of solving problems such as pattern recognition, including Stephan Grossberg [9], who launched an Avalanch network for continuous speech recognition and robot hand control. He also co-founded Carpenter ART Networks, which differs from natural models. Anderson and Kohonen were also individuals who developed techniques for learning [10]. In 1976, Linnainmaa developed the error Back-Propagation (BP) training method [11], which was a multilayer perceptron network with stronger training rules. The advances made in the 1970s and 1980s were crucial to attracting attention to neural networks. Some factors also contributed to the exacerbation of this, including the extensive books and conferences offered to people in a variety of disciplines. Today, there have been many developments in ANN technology such as new meta-heuristic training methodology [12-14], using fuzzy technique [15], different topology and etc. [16-17].
BIOLOGICAL NEURAL NETWORKS Biological neural networks are a huge set of parallel processors called neurons that work in harmony to solve the problem and transmit information through synapses (electro-magnetic connections) [1]. In these networks, if one cell is damaged, the other cells can compensate for its absence and also contribute to its regeneration. These networks are capable of learning. For example, by burning the touch neural cells, the cells learn not to move toward the hot body, and with this algorithm, they learn the system to correct its error. Learning in these systems is adaptive, meaning that by using examples, the weight of the synapses is altered in such a way as to produce a correct response when the new system inputs. A typical biological neuron is shown as Figure 1:
6
M. Khishe and Gh. R. Parvizi
Figure 1. The typical structure of a biological neuron.
WHAT IS A NEURAL NETWORK? Generally, a biological neural network consists of a set or set of physically connected or functionally interconnected neurons. Each neuron can be connected to a large number of neurons and the total number of neurons and connections between them can be very large. Connections, called synapses, are usually made up of axons and dendrites [3]. AI and cognitive modeling try to simulate some ANN properties. Although similar in their methods, the goal of AI is to solve specific problems, while the goal of cognitive modeling is to build mathematical models of biological neuronal systems. It consists of an arbitrary number of neurons that connect the input set to the output. Each neuron can be connected to a large number of neurons and the total number of neurons and connections between them can be very large.
Artificial Neural Networks, Concept, Application and Types
7
ARTIFICIAL NEURONS An artificial neuron is a system with many inputs and only one output. The neuron has two modes, training mode and operation mode. In training mode, the neuron learns to be fired against specific input patterns or to fire at a term. In operation mode, a corresponding output is provided when a recognized input pattern is entered. If the input is not part of the preidentified inputs, the fire rules decide whether to trigger it or not. By omitting some of the critical properties of the neurons and their interconnections, an elementary model of the neuron can be simulated by a computer. A typical artificial neuron is shown in Figure 2.
Figure 2. An artificial neuron.
WHAT IS AN ANN? ANNs are patterns of information processing that are made by mimicking biological NNs such as the human brain. The key element of this model is the new structure of its information processing system and its many elements (neurons). They are formed by strong internal communications that work together to solve specific problems. ANNs, by processing on experimental data, transfer the knowledge or rule behind the data to the
8
M. Khishe and Gh. R. Parvizi
network structure, which is called learning [18]. Ability to learn is essentially the most important feature of an intelligent system [19]. A system that can learn is more flexible and easier to program, so it can be better responsive to new problems and equations. Creativity, flexibility, and parallel processing in the brain are interesting to humans, and these capabilities are highly desirable in machines. Algorithmic methods are not suitable for the implementation of these properties in machines. Humans are using example for learning, just as a child can recognize a special type of animal by seeing different types of that animal. In other words, the ANN is a special data processing system that originates from the human brain and processes data into small and very large processors. There are many deposits that work in a networked and parallel way to solve a problem. In these networks, with the help of programming knowledge, a data structure is designed that can act as a neuron. This is called the node structure. Then, they train the network by creating a network between these nodes and applying a training algorithm to it. In this memory or neural network, the nodes have two active states (on or off) and inactive (off or 0) and each connection (synapses or connections between nodes) has a weight. Positiveweighted connections stimulate or activate the next inactive node, and negative-weighted connections inactivate or inhibit the next inactive node (if active).
NEURAL NETWORK STRUCTURE As shown in Figure 3 a simple neural network consists of the components of layers and connection weights. Network behavior also depends on communication between members. There are generally three types of neuronal layers in neural networks: 1. Input layer: Receive raw data fed into the network. 2. Hidden layers: The performance of these layers is determined by the inputs and the weight of the connection between them and the
Artificial Neural Networks, Concept, Application and Types
9
hidden layers. The weights between input and hidden units determine when a hidden unit should be enabled. 3. Output layer: The output of the output unit depends on the activity of the hidden unit and the weight of the connection between the hidden unit and the output.
Figure 3. A typical multi-layer neural network.
There are also single-layer and multi-layer networks where single layer organization where all units are connected to one layer has the most use and has more computational potential than multi-layer organizations. In multilayer networks, units are numbered by layers (instead of following global numbering). Both layers of a network are connected by weights and in fact connections. In neural networks, there are several types of connectivity or weight gain:
Feedforward: Most connections are of the kind in which signals move in one direction only. From Input to Output There is no feedback (loop). The output of each layer does not affect the same layer. Backward: Data is returned from the top layer nodes to the bottom layer nodes. Side: The outputs of each layer’s nodes are used as inputs to the same layer’s nodes.
10
M. Khishe and Gh. R. Parvizi
TRAINING METHODS ANNs are divided into four categories according to the teaching method: 1. Fixed Weight: No training and no weights values. Its application is in information optimization, volume reduction, resizing and compression. 2. Non-supervised training: Weights are only modified based on inputs and there is no optimal output to correct the weights by comparing the network output with the error value. Weights are updated only based on input pattern information. The purpose is to extract the attributes of the input patterns based on the clustering strategy or to classify and identify similarities (forming groups of similar patterns), without specifying the output or classes corresponding to the input patterns. This learning is usually done on the basis of a superior matching style. The unsupervised network shifts its weights based on the output of the input to provide a suitable response for the input in the next collision. As a result, the network learns how to respond to input. The goal is basically to be selected by the dominant neuron technique that has the most initial stimulation. Therefore, in non-supervised networks finding the dominant neuron is one of the most important tasks. 3. Supervised training: For each set of input patterns the corresponding outputs are also displayed on the grid and the weights are changed until the network output difference for the training outputs of the desired outputs is within the acceptable error range. In these methods, either the outputs are related to the weights or the vacuum is distributed back and forth from the output layer to the input and the weights are corrected. The purpose of a network design is to first train using existing training data and then classify it by presenting input vectors to the network that may or may not already have learned the network. Such a network is widely used for pattern recognition tasks [20].
Artificial Neural Networks, Concept, Application and Types
11
4. Reinforcement Training: The quality of system performance is improved step by step over time. There are no training patterns, but by using a signal called critical expression, the system behaves as good or bad behavior (a state between supervised and nonsupervised learning).
APPLICATION OF ANNS ANNs are growing and improving both quantitatively and in terms of structural analysis and hardware implementation, and the number of neural computing techniques continues to increase. ANNs have a wide range of applications, including: Aerospace, Finance, Industry, Transportation, Banking, Fun, Defense Affairs, Electronic, Oil and gas.
ANN TYPES The following are the well-known types of neural networks: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.
Multi-Layer Perceptron Neural Network (MLP NNs) Radial Basis Function Neural Networks (RBF NNs) Hopfield Neural Network (HN) Hamming Neural Network (HNN) Kohonen Self-Organized Map Neural Network (KSOM NN) Time Delay Neural Network (TDNN) Deep Feed Forward Neural Networks (DFF NNs) Recurrent Neural Networks (RNNs) Long-Short Term Memory Neural Networks (LSTM NNs) Auto Encoders Neural Network (AE NN) Markov Chains Networks (MC Ns)
12
M. Khishe and Gh. R. Parvizi
PERCEPTRON NEURAL NETWORK This ANN is constructed based on a computational unit called perceptron. A perceptron captures the inputs with real values and calculates a linear combination of the inputs. Perceptron neural networks, especially MLP, are among the most functional ANNs, capable of selecting a number of neural layers and cells, which are often not large, by a suitable mapping.
SINGLE LAYER PERCEPTRON The first network we are considering is single layer perceptron, which is structured as in Figure 4.
X1
O
Xn
Figure 4. Single layer perceptron.
LEARNING A PERCEPTRON Learning a perceptron is about finding the right values for W, so the hypothesis H of perceptron learning is the set of all possible real values for the weight vectors. A perceptron can only learn examples that are linearly indivisible. Examples are samples that are completely separable by a hyperplane.
Artificial Neural Networks, Concept, Application and Types
Linearly separable
13
Linearly Inseparable
Figure 5. Functions that perceptron can learn.
A perceptron can display many Boolean functions such as AND, OR, NAND, NOR but cannot display XOR. In fact, any Boolean function can be represented by a two-layer grid of perceptron.
MULTILAYER PERCEPTRON NEURAL NETWORKS An MLP NN with three layers are shown in Figure 6, where n is the number of input nodes, h is the number of hidden nodes and m is the number of output nodes. In each iteration, the output of hidden nodes are calculated by Eq.s (1) and (2) [21]. s1
X1
O1
sj
Xn
Om
sh
Figure 6. An MLP neural network with three layers with (n,h,m) structure.
14
M. Khishe and Gh. R. Parvizi f sj
1 , j 1, 2,..., h n 1 exp wij . i j i 1
(1)
(2)
Where n represents the number of input nodes, wij indicates the connection weight between the i-th node in the input layer and the j-th node in the hidden layer, j is the bias (threshold) of the j-th hidden node, and xi is the input of the i-th node. After calculating the outputs of hidden nodes, the final outputs can be defined as follow: ok wkj . f s j j h
i 1
, k 1,2,..., m
(3)
Where wkj represents the connection weight between the j-th hidden node and the k-th output node and k is the bias (threshold) of the k-th output node. Finally, the learning error E (fitness function) is calculated as follows: Ek oi k di k m
2
i 1
q
E k 1
Ek q
(4) (5)
Where q represents the number of training samples, dik indicates the expected output from the i-th input when the k-th training sample is used and yik represents the actual output of the i-th input when the k-th training sample is used. Therefore, the fitness function of the i-th training sample can be defined by Eq. (6):
Fitness X i E X i
(6)
Artificial Neural Networks, Concept, Application and Types
15
RADIAL BASIS FUNCTION NEURAL NETWORKS RBF NNs are one of the Feed-Forward Neural Networks (FFNNs) which are composed of three layers (an input layer, hidden layer, and output layer). In RBF NNs, outputs of the input layer are manipulated by calculating the distance between inputs and centers of the hidden layer. The outputs of the second layer (hidden layer) are calculated by multiplying the outputs of the input layer and related connection weight. Each neuron of the hidden layer has a center. So, the general description of a typical RBF NN is given by Eq. (7) [22]: I
yˆ j
w ( x c ij
i
)j .
(7)
i 1
In this paper, Euclidean distance is considered as the classic distance and Gaussian basis function is considered as RBF function as shown by the Eq. 8: 2
(r) exp(i x ci )
.
In Eq.s (7) and (8), i is defined as
(8) i 1, 2,3,..., I
where I is the number of
hidden neurons, wij shows the connection weight from ith neuron in the hidden layer to jth neuron in the output layer, φ indicates Gaussian basis function, i shows variance parameter for ith hidden neuron, x is input vector, ci is the center vector for neuron i, β shows the bias of jth neuron in the output layer and y is the output of the RBF NN. Figure 7 shows an RBF NN with three layer, where the number of inputs (x) is m. In this figure, the number of hidden neurons is I where the output of each neuron is calculated in terms of the Euclidean distance between the inputs and center vectors. The hidden neuron is included an activation function named RBF Gaussian Basis Function. Outputs of hidden layers
16
M. Khishe and Gh. R. Parvizi
transfer to the output layer through weights ( w1 ،...، w2 ). The output of the RBF NN is a linear combination of the outputs of the hidden layer and bias parameter β. Finally, y is calculated as RBF’s output.
1
1 k
X1
y1
2 k X2
n
Xn
yn
N k
Hidden Layer
Input Layer
Output Layer
Figure 7. An RBF NN with one hidden layer.
E SSE (w,a,c, )
I
( y
j
y j )2
(9)
i 1
Where y is the desired output and y shows the calculated output. The final aim of RBF NN training method is minimizing the RMSE.
Artificial Neural Networks, Concept, Application and Types
17
HOPFIELD NEURAL NETWORK As shown in Figure 8, the Hopfield network has a particular architecture that separates it from other networks. In principle, these networks have a layer of input neurons to the input neurons, and in a way we can say that the input neurons are the output neurons.
Figure 8. Hopfield network with a particular architecture.
Unlike other networks, this network does not specify its weights in the training algorithm but it does so with a specific formula and in the detection algorithm, the inputs are repeated, modified and in a certain way. At one time only one neuron is active and the other neurons are inactive, i.e., one neuron receives input from the other neurons, so that the neuron is changing and the other neurons are stationary. Image or any other pattern is used.
HAMMING NEURAL NETWORK This network was first introduced by Stein Botch in 1961 and has been revised in recent years by Lippmann [23]. This network also falls under the neural network framework, since it consists of a series of neurons as nodes and a series of transverse weights between nodes as Figure 9.
18
M. Khishe and Gh. R. Parvizi
Figure 9. A typical hamming network.
Each node has an active surface that makes the neuron output. The hamming network consists of both forward and backward structures. The main purpose of the hamming network is to identify which reference pattern is closest to the input pattern and then display it at the network output. The Hamming Network consists of three layers:
Feeder Layer: The first feeder layer represented by the weight matrix, bias vector and linear transformation function, calculates the internal multiplication between the reference vectors with the input vectors. Reference patterns are stored in the grid by weight matrix. Return layer (WTA): The middle layer of the hamming NN has a return structure. This structure is known as the competitive structure. This is also called the middle layer of the Competitive Layer Hamming network after the number of reference vectors or storage capacity of the network is calculated by the first layer of the Hamming subnet and the output values of the first layer are expressed as values. It initializes itself then reduces its input to each iteration of the inputs and repeats this action until the outputs in all cells except the winning cell (indicating the most similarity of the reference pattern to the input vector). In this case, the middle layer,
Artificial Neural Networks, Concept, Application and Types
19
that is, the entire Hamming NN, in its steady state. The middle layer is repeated and it is useless to continue repeating in the middle layer. This is called WTA operation. Eventually in the middle layer the competition between the neurons begins and one neuron wins and the rest of the neurons lose. Layer Three: This layer in the Hamming NN is a feed-in network with a weighted matrix and a symmetric double value threshold conversion function. The task of the third layer is that after the second layer is converged, the stored reference vector appears at the output of the network, as in this example, if the second layer shows that the apple reference pattern is identified for the input, then the third layer of the vector is identified as the input and it creates the receiver and vector P at the network output.
KOHONEN SELF-ORGANIZED MAP NEURAL NETWORK Scholars consider this network as one of the most difficult networks of one layer networks. Kohonen has designed a network whose only parameter is the input neurons, while the weights and output neurons are unknown parameters to be found [24]. The most important feature is that the network is one of the most difficult single-layer networks. It’s a self-organizing network of ten. Kohonen’s method is to select a number for the number of output neurons and obtain a geometric distance from a simple logic. Input and output neurons are quantified with binary values. It is one of the input patterns. The weights are obtained by repetition and the network operates nonlinearly. Kohonen’s model is a model without any observers. In this model, a number of neurons, usually arranged in a flat topology, interact with each other to perform the task of organizing their network. This estimation task is a moderation function. Consider a vector whose probability density is in each of its drives. In this density space, we apply the samples randomly to the grid. According to the position of the input vector in space, the weights of the cells are changed algorithmically. This change occurs so that eventually, the weight vectors of the cells are evenly
20
M. Khishe and Gh. R. Parvizi
distributed over the input probability density space, thereby estimating the probability of the network by spreading its cells into the input space. Cells in the input probability space can somehow be compressed information, since now each cell represents an approximate range in a given space.
Figure 10. Kohonen self-organized map neural network.
TIME DELAY NEURAL NETWORK They are kinds of multilayer neural networks that are capable of dealing with the dynamic nature of sample data and input signals. Multilayer neural networks have the following characteristics: 1. It has multiple layers and each layer has a sufficient number of connections between neurons, so that the network is capable of learning complex nonlinear decision-making surfaces. 2. The network behavior is sensitive to the temporal transmission of sample properties. 3. Network learning is sensitive to the exact timing of input samples.
Artificial Neural Networks, Concept, Application and Types
21
TDNN was first used by Weibull in 1988 [25] and still remains the same, it consists of three layers whose weights are paired with the time delay cells. The inputs are weighted. In the design of neural networks and especially TDNNs, the designer faces the problem of choosing the right network for his design. In general, a network with the least complexity and least parameter that is most accurate in identifying input patterns is called a suitable network. In theory, if a problem can be solved by a particular network, it can also be solved by larger networks. But due to the lack of a single solution for optimal weights, learning algorithms for the larger network usually yield opposite weights of 0, so there is a problem with a smaller size network for solving the problem if it is diagnosed.
Figure 11. A Typical Time Delay neural network.
If we reduce the number of the hidden neuron layers used in a particular problem, the network will not be able to learn because the number of super spaces and therefore the volumes needed to divide the input space into different classes will not be sufficient, on the other hand, the large number of hidden layer neurons are not appropriate due to the increased computational volume and therefore longer time of networking. In addition, because network training is based on a limited set of training patterns, if the network is too large, it tries to maintain the training patterns precisely, which results in reducing the power of generalization and network interpolation that is used to identify new patterns outside the training set, so there is a critical number of hidden layers that must be found for any particular
22
M. Khishe and Gh. R. Parvizi
application. The number of hidden layer neurons is simulated by different networks and measured by accuracy and Interpolation of these networks on patterns that were not in their training set. Network output or in other words, the type of output coding must also be appropriate to solve a particular problem. The best way of coding the output classes is to use primitive vectors. A typical TDNN is shown in Figure 11.
DEEP FEED FORWARD NEURAL NETWORKS In the early of 90s, a Pandora box of deep learning was opened by Deep Feed Forward neural networks (DFF NN). These are just FF NNs, but with more than one hidden layer. The only difference between conventional FF NNs and DFF NNs is that, as training a traditional FF, only a small amount of error is passed to previous layer. Due to this reason, stacking more layers caused an exponential amount of growth in training times, and finally made DFFs totally impractical. Only in early 90s, a bunch of approaches was developed to allow the training of DFFs effectively; now they form a core of modern machine learning systems. Therefore, they cover the same purposes as FFs, but with much better results. A typical DFF NN is shown in Figure 12.
Figure 12. Deep Feed Forward neural networks.
Artificial Neural Networks, Concept, Application and Types
23
RECURRENT NEURAL NETWORKS RNNs introduce different type of neurons which is named recurrent cells. Jordan network was the first network of this type. In this type of network each of hidden recurrent cell received its own output with fixed delay. RNNs are mainly used when context is important. It is mean that decisions from past samples or iterations can influence current decision [26]. A typical RNN is shown in Figure 13.
Figure 13. Recurrent Neural Networks.
LONG-SHORT TERM MEMORY NEURAL NETWORKS A memory cell is introduced by this type, a special cell that is able to process data when data have time gaps (or lags). RNNs are able to process texts through “keeping in mind” the ten previous words, and LSTM NNs have the ability to process video frames by “keeping in mind” something that happened many frames ago. In writing and speech recognition, LSTM networks are used widely. Memory cells actually consist of a few elements — named gates, which are recurrent and manage how information is being reminisced and forgotten. The input gate decides how much information from the last sample will be kept in memory; output gate controls the amount of data passed to the next layer, and forget gates that
24
M. Khishe and Gh. R. Parvizi
control the tearing rate of memory stored. A typical LSTM is shown in Figure 14. Nonetheless, the structure shown in Figure 14 is a very simple implementation of LSTM cells, there are also many other existing architectures. It seems simple, but lack of output gate will make it easier to repeat the same output for a concrete input many times, and they are currently used the most in sound (music) and speech synthesis. Although, the actual composition is a bit different: all LSTM gates are combined into so-called update gate, and reset gate is almost tied to input.
Figure 14. Long-Short Term Memory Neural Networks.
AUTO ENCODERS NEURAL NETWORKS AE NN is a kind of ANN which are used for classification, clustering and feature extraction. Many classification network are using supervised learning methods. However, AE NNs can be trained without supervision, when the number of hidden cells is smaller than the number of input cells (and number of output cells is equal to the number of input cells), and when the AE NN is trained the way the output is as close to input as possible, it forces AE NNs to generalize data and search for common patterns. A typical AE NN is shown in Figure 15.
Artificial Neural Networks, Concept, Application and Types
25
Figure 15. Auto Encoders Neural Network.
MARKOV CHAINS NETWORKS MC Ns are pretty old concepts of graphs where each connection has a probability. MC Ns are not neural networks in a classic way, they can be used for classification based on probabilities like Bayesian decision makers, clustering, and as a finite state machine. A typical MC N is shown in Figure 16.
Figure 16. A Markov Chains Network.
26
M. Khishe and Gh. R. Parvizi
CONCLUSION, ADVANTAGES AND DISADVANTAGES OF AANS An ANN generally differs from a computer in: 1. ANNs do not execute commands in secret, do not include memory for data storage and instructions. 2. Respond to a set of inputs in parallel. 3. More to do with transformations and mappings than algorithms and methods. 4. Not a complex computing tool, consisting of a large number of simple tools that often perform little more than a weighted sum. ANNs have a different way of solving the problem. Traditional computers use an algorithmic problem-solving approach followed by a set of unambiguous instructions to solve the problem. These commands are converted into high-level language and then into machine language that the system can detect. If the steps the computer has to take to solve the problem are not known in advance and there is no specific algorithm, the system will not be able to solve the problem. Computers can be much more useful if they can do things we have no background in. Neural networks and computers are not only competitors, but they can also be complementary. There are some things that are better solved by the algorithmic method, as well as things that cannot be solved except through the ANN, though many use a combination of the above methods to achieve maximum efficiency. A traditional computer is typically used to monitor the ANN. ANNs don’t do miracles, they do strange things if used intuitively. To sum up, ANNs, with their remarkable ability to derive results from complex data, can be used to extract patterns and identify trends that are extremely difficult for humans and computers to identify. The benefits of ANNs include:
Artificial Neural Networks, Concept, Application and Types
27
1. Adaptive learning: The ability to learn how to perform its tasks based on the information given to it or its initial experiences is actually called network modification. 2. Self-Organizing: An ANN automatically organizes and presents the data received during training. The neurons are adapted to the learning rule and the input response changes. 3. Real-time Operators: Computations on the ANN can be performed in parallel by specialized hardware designed and manufactured to obtain optimal results of ANN capabilities. 4. Error tolerance: Some network performance is reduced by crashing the network, but some features are still maintained despite major problems. 5. Classification: Neural networks are able to categorize inputs to receive appropriate output. 6. Generalization: This property of the network enables it to obtain a general rule of thumb only by dealing with a limited number of samples, and to extend the results of these lessons to the observations already made. The ability that otherwise would have to remember infinitely many facts and relationships. 7. Stability, Flexibility: A neural network is both stable enough to retain its learned information, and capable of flexibility and adaptability, and can accept new ones without losing previous information. Despite the advantages of ANNs over conventional systems, there are also disadvantages that researchers in the field are trying to minimize them, including: 1. There are no specific rules or guidelines for network design for optional use. 2. On modeling problems, it is simply not possible to understand the physics of the problem using the ANN. In other words, it is usually impossible to associate parameters or network structure with process parameters.
28
M. Khishe and Gh. R. Parvizi 3. The accuracy of the results depends largely on the size of the training set. 4. Network training can be difficult or even impossible. 5. Predicting the network’s future performance (generalization) is simply not possible.
REFERENCES [1]
[2]
[3]
[4]
[5]
[6]
[7] [8]
Mosavi, M.R., M. Khishe and A. Ghamgosar, “Classification of Sonar Data Set using Neural Network Trained by Gray Wolf Optimization”, Journal of Neural Network World 26, No.4, pp.393-415, 2016. Barak O. and M. Rigotti, “A Simple Derivation of a Bound on the Perceptron Margin using Singular Value Decomposition”, Neural Computation 23, No.8, pp.1935-1943, 2011. Mosavi, M.R., M. Khishe and M. Akbarisani, “Neural Network Trained by Biogeography-based Optimizer with Chaos for Sonar Data Set Classification,” Wireless Personal Communications 95, No.4, pp.1-20, 2017. McCarthy, John. “Review: Roger Penrose, The emperor’s new mind,” Bulletin (New Series) of the American Mathematical Society 23, No. 2, pp. 606-616, 1990. McCulloch, W.S., W. Pitts, “A logical calculus of the ideas immanent in nervous activity,” The bulletin of mathematical biophysics 5, No.4, pp. 115–133, 1943. Van Der Malsburg, C. “Frank Rosenblatt: Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms,” Brain Theory, pp. 245-248, January 1986. Magoun, A. “A Nonrandom Walk down Memory Lane with Bernard Widrow,” Proceedings of the IEEE 102, No.10, pp.1622-1629, 2014. Papert, S., M. Minsky, “Perceptrons: an introduction to computational geometry,” Cambridge, Massachusetts: MIT Press, 1988.
Artificial Neural Networks, Concept, Application and Types [9]
[10] [11] [12]
[13]
[14]
[15]
[16]
[17]
[18]
29
Carpenter, G.A., S. Grossberg, “ART 2: Self-organization of stable category recognition codes for analog input patterns,” Applied optics 26, No. 23, pp. 4919-4930, 1987. Anderson, B. “Kohonen neural networks and language,” Brain and Language 70, No. 1, pp. 86-94, 19999. Linnainmaa, S. “Taylor expansion of the accumulated rounding error,” BIT Numerical Mathematics 16, No. 2, pp. 146-160, 1976. Mosavi, M.R., M. Khishe, “Training a Feed-Forward Neural Network using Particle Swarm Optimizer with Autonomous Groups for Sonar Target Classification,” Journal of Circuits, Systems, and Computers 26, No. 11, pp.1-20, November 2017. DOI: 10.1142/ S0218126617501857 Khishe, M., M.R. Mosavi, and M. Kaveh, “Improved Migration Models of Biogeography-based Optimization for Sonar Data Set Classification using Neural Network,” Applied Acoustic 118, pp.1529, 2017. Ravakhah, S., M. Khishe, M. Aghababaee and E. Hashemzadeh, “Sonar False Alarm Rate Suppression using Classification Methods Based on Interior Search Algorithm,” IJCSNS International Journal of Computer Science and Network Security 17, No.7, July 2017. Khishe, M., M.R. Mosavi and A. Moridi “Chaotic Fractal Walk Trainer for Sonar Data Set Classification using Multi-Layer Perceptron Neural Network and Its Hardware Implementation,” Applied Acoustics 137, pp.121-139, 2018. Afrakhteh, S., M. Mosavi, M. Khishe, A. Ayatollahi,” Accurate Classification of EEG Signals using Neural Networks Trained by Hybrid Population-physic-based Algorithm,” International Journal of Automation and Computing, pp.1-15, 2018. Kaveh, M., M. Khishe, M.R. Mosavi, “Design and Implementation of a Neighborhood Search BBO Trainer for Classifying Sonar Data Set using Multi-Layer Perceptron Neural Network,” Analog Integrated Circuits and Signal Processing 100, No. 2, pp. 405-428, 2019 Mosavi, M.R., M. Khishe, G.R. Parvizi, M.J. Naseri, M. Ayat, “Training Multi-Layer Perceptron Utilizing Adaptive Best-mass
30
[19]
[20]
[21]
[22]
[23] [24] [25]
[26]
M. Khishe and Gh. R. Parvizi Gravitational Search Algorithm to Classify Sonar Dataset,” Archive of Acoustics 44, No. 1, pp. 137–151, 2019. Khishe, M., H. Mohammadi, “Sonar Target Classification using Multi-Layer Perceptron Trained by Salp Swarm Algorithm,” Ocean Engineering 181, pp. 98-108, 2019. Khishe M. and A. Saffari, “Classification of Sonar Targets using an MLP Neural Network Trained by Dragonfly Algorithm,” Wireless Personal System 108, No. 4, pp 2241–2260, 2019. Khishe, M. and M.R. Mosavi, “Improved Whale Trainer for Sonar Datasets Classification using Neural Network,” Applied Acoustic 154, pp. 176-192, 2019. Mosavi, M.R., M. Khishe, Y. Hatam Khani and M. Shabani, “Training Radial Basis Function Neural Network using Stochastic Fractal Search Algorithm to Classify Sonar Dataset”, Iranian Journal of Electrical and Electronic Engineering 13, No.1, pp.100-112, 2017. Lippmann, R.P. “An introduction to computing with neural nets,” IEEE ASSP Magazine 4, pp. 4–22, 1987. Kohonen, T. “Self-Organized Formation of Topologically Correct Feature Maps,” Biological Cybernetics 43, No. 1, PP. 59–69, 1982. Waibel, A., T. Hanazawa, G. Hinton, K. Shikano, K.J. Lang, “Phoneme Recognition Using Time-Delay Neural Networks,” IEEE Transactions on Acoustics, Speech, and Signal Processing 37, No. 3, pp. 328. - 339 March 1989. Graves, A., M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber, “A Novel Connectionist System for Improved Unconstrained Handwriting Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence 31, No. 5, pp. 855–868, 2009.
In: Neural Networks Editor: Doug Alexander
ISBN: 978-1-53617-188-4 © 2020 Nova Science Publishers, Inc.
Chapter 2
EMOTION RECOGNITION FROM FACIAL EXPRESSIONS USING ARTIFICIAL NEURAL NETWORKS: A REVIEW Sibel Senan1,*, Zeynep Orman2 and Fulya Akcan3 1
Istanbul University, Cerrahpasa, Istanbul, Turkey
ABSTRACT Facial expressions are universal signs that provide semantic integrity in interpersonal communication. Changes in facial expressions are considered as the most important clues in emotion psychology. Automatic analysis and classification of facial expressions are challenging problems in many areas such as human-computer interaction, computer vision, and image processing. In addition, there is a growing need for analysis of facial expressions in many areas such as psychology, safety, health, games, and robotics. Therefore, rapid and accurate analysis of facial expressions plays a critical role in many systems in different application areas. This chapter focuses on all the studies that are carried out with the Artificial Neural Network (ANN) approach on the emotion recognition from 2D facial expressions between 2009 and 2019. The major objective of this study is * Corresponding Author’s Email: [email protected].
32
Sibel Senan, Zeynep Orman and Fulya Akcan to review, identify, evaluate and analyze the performance of ANN models in emotion recognition applications. Hence, to shed light on the implementations of artificial neural networks on facial expression analysis, this paper reviews all the related studies based on five key parameters: (i) used ANN models for classification, (ii) used feature extraction methods for each ANN classification model, (iii) accuracy rates of ANN models for each dataset that are used for facial expression analysis studies, (iv) rates of datasets preferred by studies for facial expression analysis and (v) rates of studies according to publishing years. Published literature presented in this study shows the high potential of ANN models for effective emotion recognition applications.
Keywords: artificial neural networks, facial expression analysis, emotion recognition applications
INTRODUCTION Analyzing facial expressions has an important place in many areas such as psychological research, verbal and nonverbal communication, humancomputer interaction and image processing [1]. In 1872, Charles Darwin emphasized in his book “Expression of Emotions in Humans and Animals” that some of the innate emotions in humans and animals emerge as facial expressions and these expressions are perceived in the same sense all over the world [2]. This conclusion has been the basis for studies on facial expressions. In the early 1970s, Ekman and colleagues have conducted researches on human facial expressions, providing evidence to support this universality theory. These ‘universal facial expressions’ are defined as happiness, sadness, fear, anger, disgust, and surprise [3-5]. Apart from this, “neutral” is considered as a seventh expression. These studies have inspired many researchers to analyze facial expressions for emotion recognition. In the 1990s, researches on automatic facial expression analysis have taken much attention, related to progress in the fields such as image processing, pattern recognition, and machine learning. Facial expression analysis consists of 3 basic steps: First step is the detection of faces. The face is detected from the given input image. If the
Emotion Recognition from Facial Expressions …
33
input is given as image series, the face in the first frame is detected and the face is tracked in the remaining frames. In the presence of large head movement, head finding, head tracking and pose estimation can be applied. The next step is feature extraction of the facial expressions. Extraction and tracking of relevant facial information are performed to identify the facial expression. There are two popular methods used in the literature for facial feature extraction. These methods are known as geometric-based and appearance-based methods [6]. In geometric-based methods, the facial components or facial feature points are extracted to form a feature vector that represents the face geometry. In appearance-based methods, image filters, such as Gabor wavelets, are applied to either the whole face or specific regions in a face image to extract a feature vector. There are also hybrid methods where geometric-based and appearance-based approaches are used together [7, 8]. The final step is recognition of the facial expression in the image series. Emotion recognition or classification process is performed according to facial expression features in this step. Many statistical and machine learning methods can be used to recognize facial expressions. However, according to the studies in the literature, it has been realized that neural networks have gained a big potential to be used in this process because of their high performance. The aim of this chapter is to review the related studies which are using ANN models in emotion recognition applications by considering the ANN models used for classification, the feature extraction methods that are used for each ANN classification model, the accuracy rates of ANN models for each dataset that are used for facial expression analysis, rates of datasets preferred by the studies for facial expression analysis and rates of studies according to publishing years. The following sections are organized as follows: In Literature Review, several studies are examined that are carried out between the years of 20092019 related to facial expressions analysis from 2-dimensional facial images by using artificial neural networks. In the Results section, the factors affecting the success rate in emotion recognition from facial expressions are analyzed by comparing these studies according to the classification methods, publishing year, datasets and accuracy rates. Finally, these studies are
34
Sibel Senan, Zeynep Orman and Fulya Akcan
evaluated in terms of used datasets and obtained accuracy rates to discuss the high potential of ANN models for effective emotion recognition applications in Conclusion.
LITERATURE REVIEW In the literature, several studies have been done to recognize the 2dimensional facial expressions by artificial neural networks. Veena Mayya et al. studied the method of automatic facial recognition by using Deep Convolutional Neural Network (DCNN) features [9]. An application was developed in the UNIX environment by using Python machine learning tools and Convolutional Architecture for Fast Feature Embedding (CAFFE). Facial expressions were recognizable from both the image and the video sequence with this application. CK+ and JAFFE databases were used in the study. Detection and trimming of the face were performed by using OpenCV. The face attributes were extracted by the DCNN framework. For classification, SVM (Support Vector Machine) was used. Due to the GPU-based JAFFE module was used, the time required for feature extraction was significantly reduced. X. Chen and W. Cheng used a method based on the edge detection algorithm for facial expression recognition [10]. In their study, the eyes and lips on the face images found in the skin color model were detected and marked and the shape features were removed. Commonly used edge detection operators named Canny, Laplace, Sobel, and Robert were compared. According to the results, the Canny operator was complex in edge detection, while the Robert operator was simple, the Laplace and Sobel operators were ambiguous. The system was trained and tested with selected face images by using the JAFFE database. The highest values were obtained by the Canny operator according to the test results which recognize happiness, normal, sadness and surprise expressions. Lei Xu et al. used CNN (convolutional neural network) to reduce the complexity of facial feature extraction in facial expression recognition [11]. Firstly, face detection and preprocessing step was applied. Then, the facial
Emotion Recognition from Facial Expressions …
35
expression features were extracted by using educable convolution kernels. The facial expression features were reduced by the maximum pooling method. Softmax classifier was used for the classification process. Seven different facial expressions were used for the classification: disgust, anger, fear, sadness, happiness, surprise and neutral. Experiments were performed by using the TensorFlow framework. Examples for training were selected from FER2013. 26,820 facial expression images were used for training (3300 anger, 420 disgust, 3600 fear, 7200 happiness, 4800 sadness, 3000 surprised and 4500 neutral). Each face image taken from FER2013 was trimmed and normalized (48x48 pixel, gray-level image). It was aimed to optimize the CNN structure, increase the accuracy and speed of recognition for their future studies. Ismail Ari et al. developed a face-tracking pointer based on the multiresolution and multi-exposure active shape model, performed both head movement and facial expression with the classifier based on the Hidden Markov Model by using the trajectory information of the points followed [12]. The first step in the study was to detect the landmark of the face with a multi-resolution active shape model-based follower. In the second step, local changes in certain areas of the face (forehead wrinkles, wrinkles between the eyebrows, the distance of the eyebrows, wrinkles of the cheeks, horizontal and vertical openings of the mouth) were calculated and the feature vectors of the expression were obtained by the help of the point positions followed. Finally, the facial expression was determined by the classifier based on the distance of the resulting feature vector from the training set. The proposed method was shown to run successfully in partial closure cases. The contribution of this study to the literature is the ability to recognize six universal facial expressions (surprise, anger, happiness, sadness, fear, disgust) and neutral expression in real-time. Turan Gunes and Ediz Polat used Gabor filtering for feature extraction from facial images [13]. Because, as a result of Gabor filtration, the negative effects of the non-homogeneity of the light distribution in the image were removed. Besides, Gabor filters were less sensitive to small displacements and deformations. Selecting important features ensured that the classification algorithms were used for working quickly, saving time and
36
Sibel Senan, Zeynep Orman and Fulya Akcan
avoiding unnecessary use of memory. In the study, 2 winding method algorithms (“Zero Norm Attribute Selection” and “Recursive Attribute Elimination”) and 1 filter method algorithm (“Common Knowledge-Based Attribute Selection”) were used for feature selection. These algorithms were performed in the MATLAB environment by using the SPIDER package program. In this study, images taken from Cohn and Kanade DFAT-504 data set were used for facial expression analysis. This data set consists of 100 university students aged 18-30. 65% of the students are women, 15% are black and 3% are from eastern and Latin. Images were recorded on the front of the contacts with an analog S-video camera. The images have a size of 640x480 or 640x490 pixels and an 8-bit gray tone. SVM was used for classifying multiple expressions. The RFE algorithm that received the lowest error from the feature selection algorithms was more successful than the other algorithms used. When the results obtained after the feature selection and before the feature selection was compared, it was observed that the choice of the feature is very useful in reducing the classification errors. Yuanyuan Liu et al. proposed a new approach to facial expression recognition (CoNERF-conditional convolutional neural network enhanced random forest) [14]. In the learning process, NCNER (neurally connected split function) was used as a node division strategy in CoNERF. The experiments were carried out by using CK+, JAFFE, multi-view BU-3DEF, and LFW datasets. The proposed method achieved 99.02% in the CK+ and JAFFE dataset, 94.09% in the multi-view BU-3DEF dataset and 60.9% average accuracy in the LFW data set. André Teixeira Lopes et al. used a combination of CNN and specific image preprocessing steps to recognize facial expression [15]. Experiments were performed by using CK+, BU-3DEF and JAFFE datasets. By using the Cnclass classifier, an accurate recognition rate of 6 facial expression recognition was obtained 96.76% in the CK + dataset, 72.89% in the BU3DFE dataset and 53.44% in the JAFFE dataset. Wenyun Sun et al. proposed 11-fold CNN with Visual Attention for facial expression recognition [16]. First, the local convolutional properties of the face were extracted by 10 convolutional layer heaps. Second, the regions of interest were automatically determined by the attention pattern
Emotion Recognition from Facial Expressions …
37
embedded in these local features. Third, local characteristics in these regions were collected and used for revealing the emotional label. Then, these three components were integrated into a single network that could be trained in an end-to-end program. Experiments were performed on aligned faces, faces in different exposures, aligned unbounded faces, and unbounded facial groups to analyze the effects of visual attention-based ROI detector. Experiments were performed by using RaFD, FER-2013, and SFEW 2.0 datasets. The developed FCNN was tested with the RaFD dataset and the HAPPEI dataset. Siyue Xie et al. presented a novel model called Deep Attentive MultiPath Convolutional Neural Network (DAM-CNN) for facial expression recognition [17]. The proposed model included 2 new modules: The Salient Expressional Region Descriptor (SERD) and the Multi-Path VariationSuppressing Network (MPVS-Net). Feature extraction was performed with the VGG-Face network. Experimental results showed the effectiveness of the DAM-CNN model developed in constrained datasets (CK+, JAFFE, TFEID) and unconstrained data sets (SFEW, FER2013, BAUM-2i). 95.88% accurate recognition rate in the CK+ dataset, 99.22% accuracy in the JAFFE dataset, 93.65% accuracy with 6 class in the TFEID data set, %65.31 accuracy in the FER2013 dataset, 42.30% accuracy in the SFEW dataset, 67.92% accuracy with 6 class in the BAUM-2i dataset were obtained. Diah Anggraeni Pitaloka et al. made improvements to the CNN method to recognize the 6 basic emotions in their studies and compared some preprocessing methods to show the effect on CNN performance [18]. Used compared data preprocessing methods were Resize, face detection, crop, noise addition, data normalization with local normalization, global contrast normalization, and histogram equalization. Comparing with the other preprocessing stages and raw datasets, it was observed that face detection reached a remarkable true recognition rate of 86.08%. However, it was also observed that the performance of CNN increased with a combination of preprocessing techniques and an accurate recognition rate of 97.06% was achieved. Four methods were used for preprocessing: Face detection & trimming, resizing, noise addition and normalizations. JAFFE, CK+ and MUG datasets were used in the study.
38
Sibel Senan, Zeynep Orman and Fulya Akcan
Fengyuan Wang et al. proposed a new hybrid feature representation of the combination of SIFT (scale-invariant feature transform) and deep learning features extracted from the CNN model for facial expression analysis [19]. Then, the obtained feature combination was taken and the expressions were classified by SVM. Experiments were performed by using the CK+ dataset. To evaluate the generalization ability of the proposed method, an experiment in the cross-database (training in the CK+ database, testing in JAFFE and MMI database) was performed. Experimental results have shown that the proposed approach can achieve a better classification rate compared to the latest technology CNNs. Xiaofeng Liu et al. proposed a new FER framework called IDFERM (Identity Disentangled Facial Expression Recognition Machine) [20]. A possible generation recognition scheme including the HNG (hard negative generation) network and the RML (radial metric learning) network was shown. FER, VGG-Face and CMU Multi-PIE datasets were used for training. Testing was performed by using CK+, MMI and Oulu-CASIA datasets. With the IDFERM method obtained an average accurate recognition rate of 98.35% in the 7-class CK+ dataset, 81.13% in the MMI data set, and 88.25% in the Oulu-CASIA VIS data set. Otkrist Gupta et al. presented architectures based on the deep neural networks for expression recognition in videos [21]. Viola-Jones algorithm was used for detecting faces, and the DCNN method was used for expression prediction. DCNN includes autoencoder which is combined with the predictor based on the semi-supervised learning paradigm. For accuracy analysis, CK+, MMI, and Florentine datasets were used. With ScaleInvariant Learner, an accurate recognition rate of 65.57% in the MMI dataset, 90.52% in the CK+ dataset and 51.35% in the Florentine dataset were obtained. Su-Jing Wang et al. used 560 micro-expression video clips to evaluate their proposed TLCNN (Transferring Long-term Convolutional Neural Network) network model [22]. DCNN was used for feature extraction from each frame. Then, LTSM was fed to learn the transient sequence information of the micro-expression. Because of the small sample size of microexpression data, TLCNN used two transfer learning stages: transfer from
Emotion Recognition from Facial Expressions …
39
expression data and transfer from a single frame of micro-expression video clips. The network was passed through a preliminary training with KDEF, MMI, Radboud, and TFEID expression data sets. Then, the training stage was performed with SMIC, CASME and CASME 2 micro-expression databases. TLCNN achieved an average accurate recognition rate of 71.19% with TIM32 (temporal interpolation model) and 69.12% with TIM64. Luefeng Chen et al. proposed an SR-based SRDSAN (Softmax regression-based deep sparse autoencoder network) method to recognize facial emotions in human-robot interaction [23]. DSAN (Deep Sparse Autoencoder Network) was used for feature extraction and SR (Softmax regression) was used for classification. The training was performed with JAFFE and CK+ datasets. It was observed that SRDSAN had higher average recognition accuracy than SR and CNN. 98.59% average accuracy rate in the JAFFE dataset and 100% average accuracy in the CK+ dataset were obtained. Zhenbo Yu et al. proposed a novel end-to-end architecture called STCNLSTM (Spatio-Temporal Convolutional features with Nested LSTM) [24]. 3DCNN was used to extract Spatio-temporal convolutional features from image sequences representing facial expressions. The dynamics of the expression were modeled by the Nested LSTM formed by combining two lower LSTMs, namely T-LSTM and C-LSTM. T-LSTM was used for modeling the transient dynamics of the Spatio-temporal features of each convolutional layer. The C-LSTM was used for combining the outputs of all T-LSTMs to encode multi-level features that encoded in the middle layers of the network. The softmax classifier was used for the classification stage. The experiments were carried out with CK+, Oulu-CASIA, MMI and BP4D databases. It was observed that this method performed better than the state of the art methods. %(99.8 ± 0.2) average accuracy in the CK+ dataset, %(93.45 ± 0.43) average accuracy in the Oulu-CASIA dataset, %(84.53 ± 0.67) average accuracy in the MMI dataset and 0.58 F1 score in the BP4D dataset were obtained. Yacine Yaddaden and colleagues introduced an effective facial expression recognition approach based on CNN architecture [25]. A CNN architecture that was inspired by LeNet-5 and optimized for AFER from the
40
Sibel Senan, Zeynep Orman and Fulya Akcan
images was proposed. The technique developed by Kazemi and Sullivan was used for feature extraction. The classification was carried out with a random forest classifier. Experiments were conducted on JAFFE, RaFD, KDEF, MMI and CK+ datasets. %95.30 ± %1.70 average accuracy in the JAFFE dataset, %97.57 ± %1.33 average accuracy in the RaFD dataset, %90.62 ± %1.60 average accuracy in the KDEF dataset, %85.84 ± %0.86 average accuracy in the MMI dataset and %96.37 ± %0.80 average accuracy in the CK+ dataset were obtained. Minhaz Uddin Ahmed et al. proposed a deep learning-based facial expression recognition system by using the VGG16 model in which they applied the framework for incremental active learning [26]. Unlabeled facial expression data were collected from ITLab members at Inha University. These data were collected by using 5 different illuminations. (good lighting, average lighting, natural lighting, close to the camera, away from the camera). CNN was used for feature extraction. The classification was carried out with VGG. CK + and ITLab data sets were used for the experiments. According to the results of the experiment, natural lighting and average lighting were more successful than other lighting conditions (natural lighting= %81, average lighting=% 74.83 average FER accuracy). With the proposed method, 91.80% average accuracy was obtained in the CK+ data set. Khadija Lekdioui et al. presented a facial expression recognition system based on automatic and more efficient facial segregation on ROI (Region of Interest) [27]. First, seven ROIs (left eyebrow, right eyebrow, left eye, right eye, eyebrows, nose, and mouth) were extracted by using the positions of some landmarks detected by IntraFace (IF). Then, each ROI was resized and characterized by using some texture and shape identifiers and partitioned into blocks. A multi-class SVM-based classifier was used for the classification process. CK, FEED and KDEF datasets were used for the experiments. 96.06% average accuracy in the CK+ dataset, 92.03% average accuracy in the FEED dataset and 93.34% average accuracy in the KDEF dataset was obtained. Qingxiang Wang et al. reported that a control group in the Shandong Mental Health Center in China and the collection of videos of depression
Emotion Recognition from Facial Expressions …
41
patients [28]. The facial features were extracted from the collected videos with a personalized active view model. Feature extraction was performed by using the Viola-Jones algorithm. Depression was classified according to SVM by looking at movement changes of eyes, eyebrows and mouth corners. The results showed that these features were effective in the automatic classification of depression patients. As a result of the experiment, an accuracy of 0.7885 was obtained. Bharat Richhariya and Deepak Gupta used Universum data for the multi-class classification process [29]. Due to the higher training cost of the Universum model was a disadvantage, the IUTWSVM (iterative Universum twin support vector machine) was proposed that uses the Newton method. PCA, LDA, ICA, LBP, Wavelet Transform and Phase Congruency were used for feature extraction and accuracy was compared according to the proposed method. It was observed that LDA gave the best result with 91.4286% accuracy. Ebenezer Owusu et al. used Viola and Jones' algorithm to detect facial features in their studies [30]. Gabor feature extraction techniques were used to extract hundreds of facial features representing various facial deformation patterns. The AdaBoost-based hypothesis was formulated to select several hundreds of the numerous features obtained to accelerate the classification. Selected features were fed to a well-designed 3-layer neural network classifier that trained with the back-propagation algorithm. JAFFE and Yale facial expression databases were used for training and testing stages. An average accurate recognition rate of 96.83% in the JAFFE database and an average recognition rate of 92.22% in the Yale database were obtained. Yaxin Sun and Guihua Wenb proposed a new classifier based on the ECGM (enhanced cognitive gravity model) with a new size reduction method [31]. The developed method named CFER (Cognitive facial expression recognition) consisted of 3 steps: Feature extraction, size reduction, classification. PHOG for property extraction (pyramid of a histogram of oriented gradients), ERF for size reduction, a new classifier based on ECGM for classification was used. JAFFE and CK+ databases were used for the experiments. As a result of the experiment with faces obtained by Adaboost, the accurate recognition rate of 96.24% with 10-
42
Sibel Senan, Zeynep Orman and Fulya Akcan
FOLD on JAFFE and accurate recognition rate of 76.46% with LOSO on JAFFE were obtained. The accurate recognition rate of 94.87% with faces obtained by Adaboost and accurate recognition rate of 97.66% with faces obtained by a landmark in CK+ dataset were obtained. Anima Majumder et al. developed a novel emotion detection model that used the system diagnostic approach [32]. A comprehensive data-driven model was developed by using the extended KSOM (Kohonen selforganizing map) which input of 26-dimensional geometric feature vector containing eye, lip, and eyebrow feature points. MMI database was used for training. At the end of the experiment, it was observed that the proposed method gave better results than the multi-class support vector machine. 93.53% average accurate recognition rate was obtained. Ying Huang et al. proposed the new EFTL method for effective facial expression recognition, which benefited from multitasking learning for distinctive feature learning [33]. First, the common features were extracted from CNN's substrates. Then, based on common features, ESF (expressionspecific features) was learned through multitasking for each facial expression. In order to increase the discriminability of ESF, a common loss (a combination of center loss and class loss) was developed which decreased intraclass changes and increased the differences interclass. Finally, all ESFs were combined in a fully-connected layer for final classification by softmax loss. The softmax classifier was used for the classification stage. CK+, OuluCASIA, MMI and FER2013 databases were used for the experiments. I. Michael Revina and W.R. Sam Emmanuel proposed EMDBUTMF (Enhanced Modified Decision Based Unsymmetric Trimmed Median Filter) method which removes noisy pixels from facial images for facial expression analysis [34]. LDN (Local Directional Number) and DGLTP (Dominant Gradient Local Ternary Pattern) methods for feature extraction, SVM method for classification were proposed. At the end of the experiment, an accurate accuracy rate of 88% was obtained in the JAFFE and CK database. Hasimah Ali et al. proposed a novel method based on EMD (empirical mode decomposition) for facial expression recognition [35]. In this method, the face signal was obtained by decomposing the Radon transformation projection of the 2-D image to the oscillating components called the IMF
Emotion Recognition from Facial Expressions …
43
(intrinsic mode functions) by using the EMD. PCA + LDA, PCA + LFDA and KLFDA were applied separately to EMD based properties to reduce dimensionality. The dimensionality reduced features were fed to the classifiers k-NN, SVM and ELM-RBF. The proposed method was applied to JAFFE and CK databases. In the CK database; IMF1 + PCA + LDA, with a k-NN classifier, obtained 99.11% average accurate recognition rate, with Gaussian SVM classifier obtained 99.21% average accurate recognition rate, with ELM-RBF (Extreme Learning Machines with Radial Basis Function) classifier obtained 99.26% average accurate recognition rate. In the CK database; IMF1 + PCA + LFDA, with a k-NN classifier, obtained 99.22% average accurate recognition rate, with Gaussian SVM classifier obtained 99.38% average accurate recognition rate, with ELM-RBF (Extreme Learning Machines with Radial Basis Function) classifier obtained 99.51% average accurate recognition rate. In the CK database; IMF1+KLFDA, with a k-NN classifier, obtained a 99.61% average accurate recognition rate, with a Gaussian SVM classifier obtained 99.71% average accurate recognition rate, with ELM-RBF (Extreme Learning Machines with Radial Basis Function) classifier obtained 99.75% average accurate recognition rate. In the JAFFE database, a 100% accurate recognition rate was obtained with kNN, SVM and ELM-RBF classifiers by all three methods. A. Geetha et al. proposed a method for facial expression recognition [36]. According to this method; the location of the face was detected by subtracting the head contour points by using movement information. A rectangular bounding box for the face region was placed by using the received contour points. Eyes are the most prominent features used for determining the size of a face. Therefore, the visual characteristics of a face according to the positions of the eyes were extracted. SVM was used for classification. 15 subjects participated in the experiment. 98.5% accurate recognition rate was obtained at the end of the study. Yuan Luo et al. proposed a hybrid method consisting of PCA (Principal Component Analysis) and LBP [37]. PCA was used for global feature extraction and dimensionality reduction, while LBP (Local Binary Pattern) was used for local texture feature extraction. SVM was used for classification. 350 facial expression images were used for training.
44
Sibel Senan, Zeynep Orman and Fulya Akcan
According to the results of the experiment, a higher recognition rate was obtained than the traditional recognition methods. The average recognition rate was obtained at 93.75%. Wei-Lun Chao et al. used LPQ (local phase quantization) + es-LBP-s (expression-specific local binary pattern) for feature extraction, cr-LPP (class-regularized locality preserving projection) for dimension reduction, SVM for classification [38]. JAFFE database used for training. In the JAFFE database, the average recognition rate was obtained 94.88% with 10-fold cross-validation and 76.67% with LOPO. Chih-Chin Lai and Chung-Hung Ko proposed a two-stage approach for facial expression recognition based on local facial texture extraction [39]. In the first step, TLBP (threshold local binary pattern) was used for converting face image to the featured image. Then, the most distinctive features were extracted from the feature image by using block-based CS-LBP (centersymmetric local binary pattern). SVM was used for classification. The experiments were largely conducted by using the CK database. In the experiments, a 10-fold cross-validation test scheme was used for assessing recognition performance. 97.6% average accurate recognition rate was obtained by the proposed method. Ligang Zhang et al. defined a database of various environmental and facial variations collected from television broadcasts and the World Wide Web [40]. They proposed a fully automated system that used a fusion-based approach to facial expression recognition for performance evaluation. Three sources were used for collecting video segments from real multimedia materials: News, TV Series, and Youtube. QUT database was generated with the collected images. SIFT (Texture Feature) and FAP (Geometric Feature) were extracted by ASM (Active shape model). SVM used for classification. At the end of the experiment, the average recognition rate was obtained 34.6% with SFEW test data, 30.3% with the NVIE test set and 30.1% with the FEEDTUM test set. Xiaorong Pu et al. proposed a novel framework that recognizes AUs from image sequences using a two-layer random forest classifier [41]. Measurement of facial movement was performed by monitoring the Active View Model (AAM) facial feature points by using the Lucas-Kanade (LK)
Emotion Recognition from Facial Expressions …
45
optical flow. The substitution vectors between the neutral expression frame and the peak expression frame were used as motion characteristics of the facial expression. Then, these features were transformed into the first level random forest to determine the Units of Action (AU). Finally, the detected AUs were given as input to the second level random forest for the classification of facial expressions. Experiments were conducted by using the CK+ and Oulu-CASIA databases. Random forest was observed to be more effective than the SVM method in AU recognition. At the end of the experiment, 96.38% average accurate recognition rate was obtained. Guoying Zhao et al. presented a novel study on the recognition of a dynamic facial expression using NIR video sequences and LBP-TOP (Local binary patterns from three orthogonal planes) feature identifiers [42]. The SVM and SRC (Sparse representation classifier) were used for classification. The experiments were carried out with Oulu-CASIA NIR & VIS database. For classification by SVM; 72.09% average recognition rate using NIR_N for training and testing, 66.02% average recognition rate using NIR_N for training and NIR_W for testing, 69.90% average recognition rate using NIR_N for training and NIR_D for testing, 73.54% average recognition rate using VIS_N for training and testing, 35.44% average recognition rate using VIS_N for training and VIS_W for testing and 29.13% average recognition rate using VIS_N for training and VIS_D were obtained. For classification by SRC; 78.64% using NIR_N for training and testing, 69.42% using NIR_N for training and NIR_W for testing, 72.33% using NIR_N for training and NIR_D for testing, 76.21% using VIS_N for training and testing, 34.22% using VIS_N for training and VIS_W for testing, 28.88% using VIS_N for training and VIS_D for testing were obtained. Hui Fang et al. used GR (Groupwise Registration) for feature extraction in their studies [43]. Six types of classifiers were used for classification: J48 (a version of ID3), FRNN (a fuzzy-rough based nearest neighbor algorithm), VQNN (vaguely quantified nearest neighbor, a noise-tolerant fuzzy rough classifier), Random forest, SMO-SVM (sequential minimal) optimization approach for support vector machines) and Logistic. MMI database was used for training. The average accurate recognition rate was obtained 50.00%
46
Sibel Senan, Zeynep Orman and Fulya Akcan
with J48, 71.57% with FRNN, 70.58% with VQNN, 57.84% with RF, 71.56% with SMO-SVM and 69.6% with Logistic. Farkhod Makhmudkhujaev et al. proposed a novel edge-based identifier called LPDP (Local Prominent Directional Pattern) which evaluates the statistical information of a pixel neighborhood to code more meaningful and reliable information than existing identifiers in feature extraction [44]. SVM was used for classification. The experiment was carried out by using CK+, MMI, BU-3DFE, ISED, GEMEP-FERA and FACES datasets. At the end of the experiment, it was observed that LPDP was better than other existing descriptors in terms of robustness in the extraction of various local structures resulting from facial expression changes. The average recognition rate was 94.50% in CK + dataset, 70.63% in MMI dataset, 73.4% in BU-3DFE dataset, 78.32% in ISED dataset, 70.0% in GEMEP-FERA dataset and 94.72% in FACES dataset. Mohammad Mehedi Hassan et al. applied an uncontrolled deep belief network (DBN) for depth-level feature extraction from fused observations of EDA (Electro-Dermal Activity), PPG (Photoplethysmogram) and zEMG (Zygomaticus Electromyography) sensor signals [45]. Then, the generated DBN features were combined with the statistical properties of EDA, PPG, and zEMG to prepare a feature fusion vector. FGSVM (Fine Gaussian Support Vector Machine) was used for classification. DEAP dataset was used for the experiment. At the end of the experiment, 89.53% average accurate recognition rate was obtained. Asim Munir et al. used FFT + CLAHE (Adaptive Histogram Equalization for Fast Fourier Transform and Contrast Limited) as a preprocessing step [46]. MBPC (merged binary pattern code) was used for feature extraction and PCA was used for dimensionality reduction. SMO, Simple Logistic, KNN (K - Nearest Neighbors), MLP and J48 were used for classification. SFEW dataset was used for the experiment. At the end of the experiment, it was clear that the proposed MBPC-based technique outperforms other techniques with an accuracy of 96.5% and 67.2% for the holistic and division-based approach, respectively. 93.5% average accurate recognition rate by using SMO, 94.1% by using Simple Logistic, 87.1% by using KNN, 89.4% by using MLP, and 96.5% by using J48 were obtained.
Emotion Recognition from Facial Expressions …
47
Anastasios Maronidis et al. investigated the robustness of view-based subspace learning techniques in geometric transformations of images [47]. DNMF (Discriminant Non-Negative Matrix Factorization) was used for feature extraction and SVM was used for classification. Kanade, JAFFE and BU datasets were used for the experiment. Also, it has been proven that personalized education is much more accurate than general learning to recognize facial expression. 69.2% average correct recognition rate in Kanade 5-fold cross-over, 63.9% in JAFFE 5-cross over, and 63.9% in BU 5-cross over were obtained. Taner Danisman and colleagues presented an automated way of discovering pixels in the facial image that improved facial expression recognition results [48]. The main contribution of their work was that offering a practical method to improve the classification performance of classifiers by selecting the best of the pixels of interest. STASM (Stacked Trimmed Active Shape Model) was used for feature extraction and NN + Mask was used for classification. Experimental studies on the GENKI, JAFFE, and FERET databases have shown that the proposed system improves the classification results by selecting the best pixels of interest. 50% of the GENKI dataset was used for training, JAFFE and FERET dataset were used for testing and development. At the end of the study, a 92.0% average recognition rate was obtained. Shui-Hua Wang et al. proposed a new intelligent emotion recognition system [49]. SWE (stationary wavelet entropy) was used for feature extraction. A novel method for training, the Jaya algorithm, was proposed. With the Jaya algorithm used for training; a higher accurate recognition rate than GA, PSO, GPS, BBO, ABC, and FA was obtained (96.80 ± 0.14%). The SHLFNN (single hidden layer feedforward neural network) was used for classification. A 700-image dataset consisting of 20 subjects was used for the experiment. At the end of the experiment, 96.80 ± 0.14% average correct recognition rate was obtained. Kaimin Yu and colleagues proposed a search-based framework to collect images of realistic facial expressions from the Web [50]. An active learning approach based on the Support Vector Machine (SVM) was presented to select relevant images from noisy image search results. A
48
Sibel Senan, Zeynep Orman and Fulya Akcan
dataset of 350 images including the 100 most searched images from Google search was generated (Gw). A new facial expression feature was proposed based on the modern Weber Local Identifier (WLD) and histogram contextualization to process the dataset. At the end of the experiment, a 59.9% average recognition rate was obtained. S. Moore and R. Bowden used LBP and LGBP for feature extraction [51]. SVM used for classification. BU3DFE and Multi-PIE databases were used for the experiment. 65.02% average recognition rate in the BU3DFE database and 73.26% average recognition rate in the Multi-PIE database by using LBP were obtained. 67.96% average correct recognition rate in the BU3DFE database and 80.60% average correct recognition rate in the MultiPIE database were obtained by using LGBP. 71.1% average recognition rate in the BU3DFE database was obtained by using LBP and LGBP together. Yeongjae Cheon and Daijin Kim proposed a novel natural facial expression recognition method that recognizes the dynamic facial expression image sequence using AAM and manifold learning [52]. DAF (differential-AAM features) was used for feature extraction and k-NNS was used for classification. A CK database was used for the experiment. At the end of the experiment, 86.49% average correct recognition rate was obtained. Mahdi Jampour and colleagues proposed a novel approach to recognizing facial expressions across a wide range of head positions [53]. First, the exposure of the head in the input image was estimated and then the mapping which learned specifically for this exposure was applied. HOG and LBP were used for feature extraction. PSC (Pose Specific Classification), PSLM (Pose specific linear mapping), PSLM-SF, FPSLM-SF, KPSNM (Kernel-based pose specific non-linear mapping) were used for classification. The BU3DFE and Multi-PIE datasets were used for the experiment. 79.26% average correct recognition rate in BU3DFE-P1 dataset, 78.79% average correct recognition rate in BU3DFE-P2 dataset, 82.43% average correct recognition rate in Multi-PIE-P1 dataset and 83.09% average correct recognition rate in Multi-PIE-P2 dataset was obtained by using KPSNM.
Emotion Recognition from Facial Expressions …
49
Ligang Zhang et al. proposed an approach that uses the Monte Carlo algorithm to extract a series of Gabor-based part-sided templates from gallery images and converts these templates into template matching distance properties [54]. SVM was used for classification. Experimental results obtained by using Cohn-Kanade and JAFFE databases showed the robustness and fast processing speed of the proposed approach and provided useful information about the effects of occlusion on FER. If the training images are not clogged and the test images are clogged (mismatch), the methodology does not learn how to reduce the effect of rectangular patchlike occlusion. Accordingly, there was a 65% decrease in performance in the CK dataset and a 58% decrease in the JAFFE dataset. If there was no occlusion in the training and test set, 95.3% accuracy was obtained in the CK dataset and 81.2% accuracy in the JAFFE dataset. When there was an occlusion by clear glasses on the training and test sets, 95.0% accuracy was obtained in the CK database and 79.8% accuracy in the JAFFE database. 91.5% accuracy in the CK database and 75.1% accuracy in the JAFFE database were obtained when there was an occlusion by solid glasses on training and test sets. Kaimin Yu et al. proposed a spectral placement based on multiple view dimension reduction to combine multiple features in facial expression recognition [55]. An sMSE, a new method composed of combining Multiscale-WLD (multiscale weber local descriptor), LBP, SIFT, Gabor filters, and AAM, was used for feature extraction. The SVM was used for classification. 62.3% accuracy in the GWI dataset, 85.5% accuracy in JAFFE dataset and 96.0% accuracy in CK dataset were obtained. Shaohua Wan and J.K. Aggarwal presented a method of expression recognition based on robust metric learning [56]. In particular, a new metric space has been learned to increase the differentiation of different facial expressions, where spatially close data points are more likely to be in the same class. PCA was used for feature extraction. RobustML (Robust Metric Learning) was used for classification. MFP and CK + datasets were used for the experiment. It was seen that directly trained RobustML (RobustML) was better than transferred RobustML (trRobustML) for recognizing posed expressions.
50
Sibel Senan, Zeynep Orman and Fulya Akcan
Keyu Yan et al. used SIFT and VGG for feature extraction and UDADL (unsupervised domain adaptive dictionary learning) for classification [57]. Experiments were conducted by using Multi-PIE and BU-3DFE datasets. With RGB color SIFT and UDADL; when using the source database MultiPIE (0°) and target database BU-3DFE (30°), 72.55% average correct recognition rate; when using the source database BU-3DFE(30°) - target database Multi-PIE(0°), 69.50% average correct recognition rate; when using the source database Multi-PIE(30°) - target database BU-3DFE (0°), 71.32% average correct recognition rate; when using the source database BU-3DFE (0°)- target database Multi-PIE (30°), 64.65% average correct recognition rate were obtained. With VGG and UDADL; when using the source database Multi-PIE(0°) and target database BU-3DFE(30°), 92.25% average correct recognition rate; when using the source database BU3DFE(30°)- target database Multi-PIE (0°), 89.75% average correct recognition rate; when using the source database Multi-PIE (30°) - target database BU-3DFE (0°), 94.50% average correct recognition rate; when using the source database BU-3DFE (0°)- target database Multi-PIE (30°), 89.75% average correct recognition rate were obtained. Hela Mahersia and Kamel Hamrouni detected facial expressions from the user's input images using statistical features extracted from the directional pyramid separation and classified by a Bayesian normalization neural network [58]. Evaluation of the proposed approach in terms of recognition accuracy was obtained by using two universal databases, the JAFFE database, and Cohn-Kanade facial expression database. 93.33% average accurate recognition rate in the JAFFE dataset and 98.13% in CK dataset were obtained. Min Hu et al. focused on the recognition of facial emotion expressions in video sequences and proposed an integrated framework of two networks: a local network based on local enhanced motion history image (LEMHI) and a global network based on CNN-LSTM graded networks [59]. On the local network, the frames in the unrecognized video were collected in a single frame with the new method LEMHI. This approach improved the MHI by using the detected human surface area as attention areas to increase the local value in the difference image calculation so that the movement of the crucial
Emotion Recognition from Facial Expressions …
51
face unit could be effectively captured. Then, this single frame was fed to a CNN network for estimation. On the other hand, an improved CNN-LSTM model was used as a global feature extractor and classifier for video face emotion recognition in the global network. Finally, a random searchweighted collection strategy for the final estimate was conducted. VGG-16 was used for feature extraction. Experiments on AFEW, CK+, and MMI datasets using subject-independent validation schemes were shown that the integrated framework of the two networks performs better than using separate networks. Compared to the latest technology methods, the proposed framework showed superior performance. By using VGG-CTSLSTM and LEMHI-VGG combination; 93.9% accuracy in the CK+ dataset, 78.4% accuracy in the MMI dataset, 51.2% accuracy in the AFEW dataset were obtained. Hao Zheng et al. proposed a multifunctional facial extraction model (MT-FIM) for simultaneous facial recognition and facial expression recognition [60]. MT-FIM simultaneously minimizes intra-class dispersion and maximizes the distance of inter-class to ensure the robust performance of each task. Cohn–Kanade, BU-3DFE, Oulu-CASIA VIS datasets were used for the experiment. In the Cohn-Kanade database; when dimension = 300, 0.9611 FI accuracy and 0.9396 FE accuracy were obtained. In the BU3DFE database; when dimension = 300, 0.6615 FI accuracy and 0.6572 FE accuracy were obtained. In the Oulu-CASIA VIS database; when dimension = 300, 0.7927 FI accuracy and 0.7013 FE accuracy were obtained. Ling Zhang et al. proposed a hybrid property set consisting of AAM shape features, appearance features, and geometry features [61]. SVM was used for classification. QMI method was used for feature selection. CASPEAL expression database was used for the experiment. 87.33% average recognition rate was obtained with 14 features. Jianlong Wu et al. presented the LLCBL (locality-constrained linear coding based bi-layer) model for learning discriminatory representation to multi-facial facial expression recognition [62]. SIFT was used for feature extraction and LLCBL was used for classification. To evaluate the proposed approach, extensive experiments were conducted in both the BU-3DFE and Multi-PIE databases. 74.6% average recognition rate was obtained in the
52
Sibel Senan, Zeynep Orman and Fulya Akcan
BU-3DFE dataset and 86.3% average recognition rate was obtained in the Multi-PIE dataset. Seyed Mehdi Lajevardi et al. used Zernike moments (ZM) for feature extraction and Naive Bayesian (NB) for classification [63]. Cohn-Kanade, JAFFE databases were used for training and testing. 73.2% average accurate recognition rate was obtained in the Cohn-Kanade database and 92.8% average accurate recognition rate was obtained in the JAFFE database. Haibin Yan used WBLDA (weighted biased linear discriminant analysis) and WBMFA (weighted biased margin Fisher analysis) for feature extraction and NN classifier for classification [64]. Experiments were conducted with Cohn-Kanade and JAFFE databases. Sumeyye Bayrakdar, Devrim Akgun, and Ibrahim Yucedag performed facial expression analysis which used haar-based features and cubic Bezier curves in the determination of face image and facial expression on the video files and the results were presented statistically [65]. The frame-based approach, which is based on the reduction of the number of processed video frames, aims to accelerate the process by reducing the number of transactions. In this way, the various video images can be analyzed quickly in terms of facial expressions and the general feeling of the person can be obtained. This method which used for accelerating facial expression analysis was based on reducing the number of processed video frames. Because it is not possible to change the facial expression in every frame in a video. Based on this approach, instead of processing all the frames of the video, one image frame per N. frame according to the determined number of frames (N) was used for expression analysis. EmguCV Library was used for face detection in the study. In facial expression analysis systems, facial expression recognition is performed after the related features are extracted from the facial image. In the study, four different facial expressions were used: normal, happy, surprised and sad. Fear, anger and disgust facial expressions were not considered in the study. Because, the expression of fear is confused with the expression of surprised frequently. Anger and disgust are also confused with each other and sometimes with sad facial expressions. The trained database contained six-point information of the eye and mouth curves and the width-height values of the eyes and mouth. In the test process,
Emotion Recognition from Facial Expressions …
53
the most appropriate facial expression of the test data was determined by comparing it with the data obtained from the eye and mouth Bézier curves and the training data in the database. If there were no convergent results between the training data and the test data, the facial expression was described as “Uncertain.” When the face or facial features cannot be detected or detected incorrectly, the “Error” result was generated for the facial expression of the related video frame. For the test phase, eNTERFACE'05 database which consisted of 1116 videos of 42 subjects of 14 nationalities, 81% of the male and rest of the female, was used.
RESULTS This chapter surveyed all studies carried out between the years of 20092019 about facial expressions analysis from 2-dimensional facial images by using artificial neural networks. In order to analyze the factors affecting the success rate in emotion recognition from facial expressions, studies were compared according to classification methods, years, datasets and accuracy rates. SVM, CNN, k-NN, Random forest are the most preferred methods for classification. The studies reviewed in this study were compared in Table 1, Table 2, Table 3, Table 4 and Table 5 according to classification methods. The comparison of the studies by years is given in Table 6. Table 7 presents the data sets used in the studies. The comparison of the studies according to the data sets used in terms of classification methods and accuracy rates is given in Table 8 to Table 20. When the tables were analyzed, it was observed that the highest accurate recognition rates were obtained by using JAFFE and CK datasets. In Table 1, feature extraction methods used in studies using the SVM method for classification are compared. According to Table 1, LBP is the most commonly used feature extraction method with the SVM classification method.
54
Sibel Senan, Zeynep Orman and Fulya Akcan Table 1. Feature extraction methods used in studies using SVM method
Year
Classification
Feature Extraction
2009 2009
Reference No [13] [36]
SVM SVM
2011 2011 2011
[51] [47] [42]
2012
[61]
SVM SVM SVM SRC SVM
Gabor Filter (GF) The visual characteristics of the face were extracted according to the position of the eyes. LBP, LGBP DNMF LBP-TOP
2013 2013 2014 2014
[37] [50] [39] [43]
2014 2014 2014 2015 2015 2015
[40] [55] [54] [10] [38] [35]
2016 2017 2018 2018 2019 2019 2019 2019
[9] [27] [28] [34] [29] [44] [45] [19]
SVM SVM SVM J48 FRNN VQNN Random Forest SMO-SVM Logistic SVM SVM SVM SVM SVM k-NN SVM ELM-RBF SVM SVM SVM SVM IUTWSVM SVM FGSVM SVM
A hybrid feature set consisting of AAM shape features, appearance features and geometry features PCA and LBP WLD TLBP+CS-LBP GR
SIFT and FAP were extracted by using ASM. sMSE Monte-Carlo algorithm DCNN LBQ+es-LBP-s PCA+LDA PCA+LFDA KLFDA DCNN IntraFace Viola-Jones algorithm LDN+DGLTP LDA LPDP DBN SIFT+CNN
Emotion Recognition from Facial Expressions …
55
In Table 2, feature extraction methods used in studies using the CNN method for classification are compared. According to Table 2, CNN is the most commonly used feature extraction method with the CNN classification method. In Table 3, feature extraction methods used in studies using the k-NN method for classification are compared. In Table 4, feature extraction methods used in studies using the Random Forest classifier for classification are compared. In Table 5, other ANN-based classifiers and feature extraction methods used in the studies were compared. According to Table 5, CNN is the most commonly used feature extraction method with other ANN-based classifiers. Table 2. Feature extraction methods used in studies using CNN method Year 2017 2017 2018 2018 2018 2018 2018 2019 2019
Author [18] [15] [11] [16] [33] [21] [22] [17] [59]
Classification CNN CNN CNN 11-fold CNN with Visual Attention Softmax Classifier DCNN TLCNN DAM-CNN CNN-LSTM
Feature Extraction CNN CNN Trainable convolution kernels Convolutional layer ESF Viola-Jones algorithm DCNN VGG-Face VGG-16
Table 3. Feature extraction methods used in studies with the k-NN method for classification Year 2009 2018
Author [52] [46]
Classification k-NNS SMO Simple Logistic k-NN MLP J48
Feature Extraction DAF MBPC
56
Sibel Senan, Zeynep Orman and Fulya Akcan Table 4. Feature extraction methods used in studies using the Random forest classifier
Year 2015 2018
Author [41] [25]
Classification Twofold random forest classifier Random-forest classifier
Feature Extraction AAM Kazemi-Sullivan technic
Table 5. Other ANN-based classifiers and feature extraction methods used in the studies Year 2010 2011
Author [63] [12]
2013 2014
[48] [30]
2014
[32]
Classification Naive Bayesian (NB) Normalization of mean distance values of the elements in the attribute vector for each class and similarity metrics NN+Mask A 3-layer neural network classifier trained by back-propagation algorithm KSOM
2014 2015
[56] [58]
RobustML Bayesian Normalization Neural Network
2016 2016 2017
[60] [64] [53]
2017 2017 2017 2018 2018 2018
[62] [65] [31] [49] [14] [57]
MT-FIM NN PSC PSLM PSLM-SF FPSLM-SF KPSNM LLCBL Bezier Curves ECGM based classifier SHLFNN CONERF UDADL
2018 2018 2018 2019
[24] [23] [26] [20]
STC-NLSTM SR VGG IDFERM
Feature Extraction Zernike moments (ZM) Tracking of triangulation points by using active shape models STASM Gabor Filter Viola-Jones algorithm ROI PCA Directional Pyramid Decomposition MT-FIM WBLDA, WBMFA HOG LBP
SIFT Adaboost Algorithm PHOG SWE CNN SIFT VGG 3DCNN DSAN CNN CNN
Emotion Recognition from Facial Expressions …
57
In Table 6, studies were compared according to years. According to the comparison, it was observed that the year 2018 was the most intense study on emotion analysis by using artificial neural networks. In Table 7, studies were compared by datasets used. In order, JAFFE, CK +, CK, MMI, BU-3DFE, Oulu-CASIA, and Multi-Pie are the most preferred data sets. Table 8 shows the accuracy rates of classification methods used CK + database. According to Table 8, the highest accuracy rate in the CK database was obtained as 99.75% by IMF + KLFDA feature extraction methods and ELM-RBF classification method. Table 9 shows the accuracy rates of classification methods used CK + database. According to Table 9, the highest accuracy rate in the CK + database was obtained as 100% by SR classification method. Table 10 shows the accuracy rates of classification methods used BU3DFE database. According to the table, the highest accuracy rate in the BU3DFE database was obtained as 79.26% by the KPSNM classification method. Table 6. Comparison of studies by years Year 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
Publishing [13, 36, 52] [63] [12, 42, 47, 51] [61] [37, 48, 50] [30, 32, 39, 40, 43, 54, 55, 56] [10, 35, 38, 41, 58] [9, 60, 64, 66] [15, 18, 27, 31, 53, 62, 65] [11, 14, 16, 21, 22, 23, 24, 25, 26, 28, 33, 34, 46, 49, 57], [17, 19, 20, 29, 44, 45, 59]
Total 3 1 4 1 3 8 5 4 7 15 7
58
Sibel Senan, Zeynep Orman and Fulya Akcan Table 7. Data sets which used in studies
Dataset CK CK+ DFAT-504 BU-3DFE Multi-Pie JAFFE Oulu-CASIA CAS-PEAL MMI SFEW NVIE FEEDTUM FEED KDEF KEEL ISED GEMEP-FERA FACES DEAP MUG FER2013 RAFD Florentine Radbound TFEID BAUM-2i AFEW GENKI FERET YALE MFP LFW BP4D GWI Shandong Mental Health Center Real-world datasets in UCI repository 3 repetitions of 7 subjects from 5 subjects eNTERFACE’05
Publishing [13, 19, 27, 34, 35, 39, 47, 52, 54, 55, 58, 60, 63, 64] [9, 14, 15, 17, 18, 20, 21, 23, 24, 25, 26, 31, 33, 41, 44, 56, 59] [13] [15, 44, 47, 51, 53, 57, 60, 62] [51, 53, 57, 62] [9, 10, 14, 15, 17, 18, 19, 23, 25, 29, 30, 31, 34, 35, 38, 47, 48, 54, 55, 58, 63, 64] [20, 24, 33, 41, 42, 60] [61] [19, 20, 21, 22, 24, 25, 32, 33, 43, 44, 59] [16, 17, 40, 46] [40] [40] [27] [22, 25, 27] [29] [44] [44] [44] [45] [18] [11, 17, 33] [16, 25] [21] [22] [17, 22] [17] [59] [48] [48] [30] [56] [14] [24] [55] [28] [29] [12] [65]
Emotion Recognition from Facial Expressions … Dataset 20 subjects - 700 images Multi-View BU-3DEF ITLab Gw 15 subjects 350 face expression images
59
Publishing [49] [14] [26] [50] [36] [37]
Table 8. Comparison of studies using CK database in terms of classification methods and accuracy Publishing [60]
Classification MT-FIM
[35]
ELM-RBF k-NN SVM
[64] [58]
NN Bayesian Normalization Neural Network SVM SVM SVM kNNS Naive Bayesian (NB) SVM SVM SVM SVM SVM
[27] [34] [19] [52] [63] [13] [47] [39] [55] [54]
Accuracy Rate In the CK database, with dimension = 300, FI accuracy = 0.9611, FE accuracy = 0.9396 CK, IMF1+KLFDA, %99.75 by using ELM-RBF CK, IMF1+KLFDA, %99.61 by using kNN, %99.71 by using SVM CK, IMF1+PCA+LDA; %99.11 by using kNN, %99.21 by using Gaussian SVM, %99.26 by using ELM-RBF CK, IMF1+PCA+LFDA, %99.22 by using kNN, %99.38 by using SVM, %99.51 by using ELM-RBF %98.13 %96.06 %88 %86.49 %73.2 %69.2 %97.6 %96 If there is no occlusion in the training and test sets; 95.3% in CK. If there is an occlusion by clear glasses on training and test sets; 95% in CK. If there is an occlusion by solid glasses on training and test sets; 91.5% in CK
60
Sibel Senan, Zeynep Orman and Fulya Akcan
Table 9. Comparison of studies using CK+ database by classification methods and accuracy rates Publishing [41] [25] [56] [31]
Classification Twofold random forest classifier Random-forest classifier RobustML ECGM based classifier
[14] [24] [23] [26] [20] [9] [44] [18] [15] [33]
CONERF STC-NLSTM SR VGG IDFERM SVM SVM CNN CNN Softmax Classifier
[21] [17] [59]
DCNN DAM-CNN CNN-LSTM
Accuracy Rate %96.38 %96.37 ± %0.80 CK+; 97.66% with faces obtained by landmark, 94.87% with faces obtained by Adaboost. %99.02 % (99.8 ± 0.2) %100 %91.80 %98.35 %96.02 %94.50 %97.06 %96.76 CK +; 96.57% with BASELINE, 97.28% with ETFL_C, 99.03% with ETFL_J. %90.52 %95.88 %93.9
Table 10. Comparison of studies using BU-3DFE database in terms of classification methods and accuracy rates Publishing [57] [62] [60]
Classification UDADL LLCBL MT-FIM
[51]
SVM
[53]
PSC KPSNM PSLM PSLM-SF FPSLM-SF SVM SVM CNN
[47] [44] [15]
Accuracy Rate %74.6 In the BU-3DFE database, with dimension = 300, FI accuracy = 0.6615, FE accuracy = 0.6572. By using LBP; 65.02% in the BU3DFE dataset By using LGBP; 67.96% in the BU3DFE dataset By using KPSNM; 79.26% in the BU3DFE-P1 dataset By using KPSNM; 78.79% in the BU3DFE-P2 dataset
%63.9 %73.4 %72.89
Emotion Recognition from Facial Expressions …
61
Table 11 shows the accuracy rates of classification methods used MultiPIE database. According to Table 11, the highest accuracy rate in the MultiPIE database was obtained as 86.3% by the LLCBL classification method. Table 12 shows the accuracy rates of classification methods used Jaffe database. According to the table, the highest accuracy rate in the Jaffe database was obtained as 100% by k-NN, SVM, and ELM-RBF classification methods. Table 13 shows the accuracy rates of classification methods used OuluCASIA database. According to Table 13, the highest accuracy rate in the Oulu-CASIA database was obtained as 100% by k-NN, SVM, and ELMRBF classification methods. Table 14 shows the accuracy rates of classification methods used MMI database. According to Table 14, the highest accuracy rate in the MMI database was obtained as 93.53% by the KSOM classification method. Table 15 shows the accuracy rates of classification methods used SFEW database. According to Table 15, the highest accuracy rate in the SFEW database was obtained as 96.5% by the J48 classification method. Table 16 shows the accuracy rates of classification methods used KDEF database. According to Table 16, the highest accuracy rate in the KDEF database was 93.34% by the SVM classification method. Table 11. Comparison of studies using Multi-PIE database in terms of classification methods and accuracy rates Publishing [51]
Classification SVM
[53]
PSC PSLM PSLM-SF FPSLM-SF KPSNM Classification LLCBL UDADL
Publishing [62] [57]
Accuracy Rate By using LBP; 73.26% in the multi-pie dataset. By using LGBP; 80.60% in the multi-pie dataset 82.43% in the Multi-Pie-P1 dataset and 83.09% in the Multi-Pie-P2 dataset.
Accuracy Rate %86.3 -
62
Sibel Senan, Zeynep Orman and Fulya Akcan Table 12. Comparison of studies using Jaffe database in terms of classification methods and accuracy rates
Publishing [47] [55] [54]
Classification SVM SVM SVM
[10] [38]
SVM SVM
[35]
k-NN SVM ELM-RBF SVM SVM IUTWSVM SVM CNN CNN DAM-CNN Random-forest classifier Naive Bayesian (NB) NN+Mask 3-layer neural network classifier trained with the back-propagation algorithm Bayesian Normalization Neural Network NN ECGM based classifier CONERF SR
[9] [34] [29] [19] [18] [15] [17] [25] [63] [48] [30]
[58]
[64] [31] [14] [23]
Accuracy Rate %63.9 %85.5 If there is no occlusion in the training and test sets; 81.2% by using JAFFE. If there is an occlusion by clear glasses on training and test sets; 79.8% by using JAFFE. If there is an occlusion by solid glasses on training and test sets; 75.1% by using JAFFE 94.88% by using 10-fold cross-validation and 76.67% by using LOPO JAFFE obtained a 100% average recognition rate with k-NN, SVM, and ELM-RBF by 3 feature extraction methods. %98.12 %88 %91.42 %97.06 %53.44 %99.22 %95.30 ± %1.70 % 92.8 %92 %96.83
%93.33
In the JAFFE dataset; the %76.44 accuracy rate was obtained by using LOSO, a %96.24 accuracy rate by using 10-fold. %99.02 %98.59
Emotion Recognition from Facial Expressions …
63
Table 13. Comparison of studies using Oulu-CASIA database in terms of classification methods and accuracy rates Publishing [42]
Classification SVM SRC
[33]
Softmax Classifier
[41] [60]
Twofold random forest classifier MT-FIM
[24] [20]
STC-NLSTM IDFERM
Accuracy Rate SVM for classification; 72.09% when using NIR_N for training and testing, 73.54% when using VIS_N for training and testing. SRC for classification; 78.64% when using NIR_N for training and testing, 76.21% when using VIS_N for training and testing Oulu-CASIA; 83.96% by using BASELINE, 85.21% by using ETFL_C, 86.25% by using EFTL_J %96.38 In the Oulu-CASIA VIS database, with dimension = 300, FI accuracy = 0.7927, FE accuracy = 0.7013. %93.45 (±0.43) %88.25
Table 14. Comparison of studies using the MMI database in terms of classification methods and accuracy rates Publishing [43]
[44] [19] [33]
Classification J48 FRNN VQNN Random Forest SMO-SVM Logistic SVM SVM Softmax Classifier
[21] [22]
DCNN TLCNN
[59] [25] [32] [24] [20]
CNN-LSTM Random-forest classifier KSOM STC-NLSTM IDFERM
Accuracy Rate %50 %71.57 %70.58 %57.84 %71.56 %69.6 %70.63 78.08% by using BASELINE, 79.60% by using ETFL_C, 82.34% by using ETFL_J. %65.57 TIM32-%71.19 TIM64-%69.12 %78.4 %85.84 ± %0.86 %93.53 %84.53 (±0.67) %81.13
64
Sibel Senan, Zeynep Orman and Fulya Akcan
Table 15. Comparison of studies using the SFEW database in terms of classification methods and accuracy rates Publishing [40] [16] [17] [46]
Classification SVM 11-fold CNN with Visual Attention DAM-CNN SMO Simple Logistic kNN MLP J48
Accuracy Rate %34.6 %40 %42.30 %93.5 %94.1 %87.1 %89.4 %96.5
Table 16. Comparison of studies using KDEF database in terms of classification methods and accuracy rates Publishing [27] [22]
Classification SVM TLCNN
[25]
Random-forest classifier
Accuracy Rate %93.34 TIM32-%71.19 TIM64-%69.12 %90.62 ± %1.60
Table 17. Comparison of studies using FER2013 database in terms of classification methods and accuracy rates Publishing [11] [33]
Classification CNN Softmax Classifier
[17]
DAM-CNN
Accuracy Rate %65.10 %71.52 by using BASELINE, %72.25 by using ETFL_C, %75.10 by using ETFL_J %65.31
Table 17 shows the accuracy rates of classification methods used FER2013 database. According to Table 17, the highest accuracy rate in the FER2013 database was 75.10% with the Softmax classifier and ETFL_J. Table 18 shows the accuracy rates of classification methods used RAFD database. According to Table 18, the highest accuracy rate in the RAFD database was obtained as %97.57 ± %1.33 by the Random forest classification method.
Emotion Recognition from Facial Expressions …
65
Table 18. Comparison of studies using RAFD database in terms of classification methods and accuracy rates Publishing [16]
Classification 11-fold CNN with Visual Attention
[25]
Random-forest classifier
Accuracy Rate RAFD-POSE - %93.1 RAFD-FRONT - %95.2 %97.57 ± %1.33
Table 19 shows the accuracy rates of classification methods used TFEID database. According to Table 19, the highest accuracy rate in the TFEID database was obtained by 93.65% by the DAM-CNN classification method. Table 19. Comparison of studies using TFEID database in terms of classification methods and accuracy rates Publishing [22]
Classification TLCNN
[17]
DAM-CNN
Accuracy Rate TIM32-%71.19 TIM64-%69.12 %93.65
Table 20 shows the accuracy rates of classification methods used other databases. According to the table, highest accuracy rate according in a database containing 15 subjects composed for 19 study, a 98.5% were obtained by the SVM classification method. When the comparison tables are analyzed, it is observed that CK and JAFFE datasets obtained higher accurate recognition rates compared to other datasets. From these findings, it can be said that JAFFE and CK datasets contribute to an accurate recognition rate. Highest accuracy in CNN based classifiers; at a rate of 99.22% was obtained by VGG-Face feature extraction and DAM-CNN classifier in the JAFFE dataset. The highest accuracy in random forest classifiers was obtained by using the Kazemi-Sullivan feature extraction method and RAFD database with a %97.57 ± %1.33 accuracy rate.
66
Sibel Senan, Zeynep Orman and Fulya Akcan Table 20. Comparison of studies using other datasets in terms of classification methods and accuracy rates
Dataset DFAT-504 CAS-PEAL NVIE FEEDTUM FEED Real-world datasets in KEEL + UCI repository ISED GEMEP-FERA FACES DEAP MUG Florentine Radbound
Publishing [13] [61] [40] [40] [27] [29]
Classification SVM SVM SVM SVM SVM IUTWSVM
Accuracy Rate %87.33 %30.3 %30.1 %92.03 %91.42
[44] [44] [44] [45] [18] [21] [22]
SVM SVM SVM FGSVM CNN DCNN TLCNN
BAUM-2i AFEW GENKI + FERET YALE
[17] [59] [48] [30]
MFP LFW BP4D GWI Shandong Mental Health Center 3 repetitions of 7 subjects from 5 subjects
[56] [14] [24] [55] [28] [12]
eNTERFACE’05 20 subjects - 700 images Multi-View BU-3DEF ITLab Gw 15 subjects 350 face expression images
[65] [49] [14] [26] [50] [36] [37]
DAM-CNN CNN-LSTM NN+Mask A 3-layer neural network classifier trained by backpropagation algorithm RobustML CONERF STC-NLSTM SVM SVM Normalization of mean distance values of the elements in the attribute vector for each class and similarity metrics Bezier Curves SHLFNN CONERF VGG SVM SVM SVM
%78.32 %70 %94.72 %89.53 %97.06 %51.35 TIM32-%71.19 TIM64-%69.12 %67.92 %51.2 %92 %92.22
%60.9 0.51 F1 Score %62.3 %78.85 %75.23
%96.80 %94.09 %91.80 %59.9 %98.5 %93.75
Emotion Recognition from Facial Expressions …
67
In the JAFFE database, each PCA + LDA, PCA + LFDA, KLFDA feature extraction methods obtained 100% accuracy which using k-NN, SVM, and ELM-RBF classifiers. In the CK+ database, a 100% accuracy rate was obtained by the SR classification method.
CONCLUSION In this study, all studies carried out between the years of 2009-2019 about facial expressions analysis from 2D facial images by using artificial neural networks are reviewed. The structure of facial expression analysis systems consists of the following phases: “detection of face,” “subtraction of facial expression features” and “classification blocks.” The effects of these phases on the performance of the emotion recognition applications and the effects of the methods used in these phases on the accuracy rates were presented as a result of the literature evaluation. We also compared the results of the studies according to the presence of the feature selection process and observed that the results obtained with feature selection were very useful in reducing the classification errors. As a result of this review study, it had also seen that the success of the emotion recognition rates is closely related to the dataset used in the study. Some of the classification methods obtained high accuracy for some datasets, but the same methods showed low accuracy in some other datasets.
REFERENCES [1]
[2]
Bayrakdar S., Akgun D., Yucedag I., “A survey on automatic analysis of facial expressions,” Sakarya University Journal of Science 20, no. 2 (2016): 383-398. Darwin C., Expression of the Emotions in Man and Animals, London: John Murray, 1872.
68 [3]
[4] [5] [6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
Sibel Senan, Zeynep Orman and Fulya Akcan Ekman P., Friesen W. V., “Constants across cultures in the face and emotion,” Journal of Personality and Social Psychology 17, no. 2 (1971): 124-129. Ekman P., “Universals and cultural differences in facial expressions of emotion,” Nebraska Symposium on Motivation, Lincoln, 1972. Ekman P., Friesen W., Unmasking the face, Prentice-Hall, 1975. Kumaria J., Rajesh R., Poojaa K., “Facial expression recognition: A survey,” (Kochi: Second International Symposium on Computer Vision and the Internet (VisionNet’15), 2015). Gupta S. K., Agrawal S., Meena Y. K., Nain N., “A hybrid method of feature extraction for facial expression recognition,” (Dijon: Seventh International Conference on Signal-Image Technology & InternetBased Systems, 2011). Zhang L., Chen S., Wang T., Liu Z., “Automatic facial expression recognition based on hybrid features,” International Conference on Future Electrical Power and Energy Systems, 2012. Mayya V., Pai R. M., Pai Manohara M. M, “Automatic Facial Expression Recognition Using DCNN,” 6th International Conference On Advances In Computing & Communications, ICACC 2016, 6-8. Chen X., Cheng W., “Facial expression recognition based on edge detection,” International Journal of Computer Science & Engineering Survey (IJCSES) 6, no. 2, (2015): 1-9. Xu L., Fei M., Zhou W., Yang A., “Face Expression Recognition Based on Convolutional Neural Network,” Australian & New Zealand Control Conference (ANZCC), 2018. Arı I., Alsaran F. O., Akarun L., “Vision-based Real-time Emotion Recognition,” IEEE 19th Signal Processing and Communications Applications Conference (SIU), 2011. Gunes T., Polat E., “Feature selection in facial expression analysis and its effect on multi-SVM classifiers,” Journal of the Faculty of Engineering and Architecture of Gazi University 24, no. 1 (2009): 714. Liu Yuanyuan, Yuan Xiaohui, Gon Xi, Xie Zhong, Fang Fang, Luo Zhongwen, “Conditional convolution neural network enhanced
Emotion Recognition from Facial Expressions …
[15]
[16]
[17]
[18]
[19]
[20]
[21] [22]
[23]
69
random forest for facial expression recognition,” Pattern Recognition 84 (2018): 251–261. Lopes André T., De Aguiar E., De Souza Alberto F., Oliveira-Santos Thiago, “Facial expression recognition with Convolutional Neural Networks: Coping with few data and the training sample order,” Pattern Recognition 61 (2017): 610–628. Sun W., Zhao H., Jin Z., “A visual attention-based ROI detection method for facial expression recognition,” Neurocomputing 296 (2018): 12–22. Xie S., Hu H., Wu Y., “Deep multi-path convolutional neural network joint with salient region attention for facial expression recognition,” Pattern Recognition 92 (2019): 177–191. Pitaloka Diah A., Wulandari A., Basaruddin T., Liliana Dewi Y., “Enhancing CNN with Preprocessing Stage in Automatic Emotion Recognition,” (Bali: 2nd International Conference on Computer Science and Computational Intelligence 2017 (ICCSCI 2017), 13-14 October 2017). Wang F., Lv J., Ying G., Chen S., Zhang C., “Facial expression recognition from image based on hybrid features understanding,” J. Vis. Commun. Image R. 59 (2019): 84–88. Liu X., Kumar Vijaya B.V.K., Jia P., You J., “Hard negative generation for identity-disentangled facial expression recognition,” Pattern Recognition 88 (2019): 1–12. Gupta O., Raviv D., Raskar R., “Illumination invariants in deep video expression recognition,” Pattern Recognition 76 (2018): 25–35. Wang Su-Jing, Li Bing-Jun, Liu Yong-Jin, Yan Wen-Jing, Ou X., Huang X., Xu F., Fu X., “Micro-expression recognition with small sample size by transferring long-term convolutional neural network,” Neurocomputing 312 (2018): 251–262. Chen L., Zhou M., Su W., Wu M., She J., “Softmax regression-based deep sparse autoencoder network for facial emotion recognition in human-robot interaction,” Information Sciences 428 (2018): 49–61.
70
Sibel Senan, Zeynep Orman and Fulya Akcan
[24] Yu Z., Liu G., Liu Q., Deng J., “Spatio-temporal convolutional features with nested LSTM for facial expression recognition,” Neurocomputing 317 (2018): 50–57. [25] Yaddaden Y., Adda M., Bouzouane A., Gaboury S., “User action and facial expression recognition for error detection system in ambient assisted environment,” Expert Systems with Applications 112 (2018): 173–189. [26] Ahmed Uddin M., Woo Jin K., Hyeon Yeong K., Bashar Md. R., Rhee Kyu P., “Wild facial expression recognition based on incremental active learning,” Cognitive Systems Research 52 (2018): 212–222. [27] Lekdioui K., Messoussi R., Ruichek Y., Chaabi Y., Touahni R., “Facial decomposition for expression recognition using texture/shape descriptors and SVM classifier,” Signal Processing: Image Communication 58 (2017): 300–312. [28] Wang Q., Yang H., Yu Y., “Facial expression video analysis for depression detection in Chinese patients,” J. Vis. Commun. Image R. 57 (2018): 228–233. [29] Richhariya B., Gupta D., “Facial expression recognition using iterative Universum twin support vector machine,” Applied Soft Computing Journal 76 (2019): 53–67. [30] Owusu E., Zhan Y., Mao Q. R.,” A neural-AdaBoost based facial expression recognition system,” Expert Systems with Applications 41 (2014): 3383–3390. [31] Sun Y., Wen G., “Cognitive facial expression recognition with constrained dimensionality reduction,” Neurocomputing 230 (2017): 397–408. [32] Majumder A., Behera L., Subramanian Venkatesh K., “Emotion recognition from geometric facial features using self-organizing map,” Pattern Recognition 47 (2014): 1282–1293. [33] Huang Y., Yan Y., Chen S., Wang H., “Expression-targeted feature learning for effective facial expression recognition,” J. Vis. Commun. Image R. 55 (2018): 677–687. [34] Revina Michael I., Emmanuel W.R. S., “Face expression recognition using LDN and Dominant Gradient Local Ternary Pattern
Emotion Recognition from Facial Expressions …
[35]
[36]
[37]
[38]
[39] [40]
[41]
[42]
[43]
[44]
71
descriptors,” Journal of King Saud University – Computer and Information Sciences, 2018. Ali H., Hariharan M., Yaacob S., Adom A. H., “Facial emotion recognition using empirical mode decomposition,” Expert Systems with Applications 42 (2015): 1261–1277. Geetha A., Ramalingam V., Palanivel S., Palaniappan B., “Facial expression recognition – A real-time approach,” Expert Systems with Applications 36 (2009): 303–308. Luo Y., Wu Cai-ming, Zhang Yi, “Facial expression recognition based on fusion feature of PCA and LBP with SVM,” Optik 124 (2013): 2767–2770. Chao Wei-Lun, Ding Jian-Jiun, Liu Jun-Zuo, “Facial expression recognition based on improved local binary pattern and classregularized locality preserving projection,” Signal Processing 117 (2015): 1–10. Lai Chih-Chin, Ko Chung-Hung, “Facial expression recognition based on two-stage features extraction,” Optik 125 (2014): 6678–6680. Zhang L., Tjondronegoro D., Chandran V., “Facial expression recognition experiments with data from television broadcasts and the World Wide Web,” Image and Vision Computing 32 (2014): 107–119. Pu X., Fan K., Chen X., Ji L., Zhou Z., Facial expression recognition from image sequences using twofold random forest classifier, Neurocomputing 168 (2015): 1173–1180. Zhao G., Huang X., Taini M., Li S. Z., Pietikäinen M., “Facial expression recognition from near-infrared videos,” Image and Vision Computing 29 (2011): 607–619. Fang H., Parthaláin N. M., Aubrey A. J., Tam G. K. L., Borgo R., Rosin P. L., Grant Philip W., Marshall D., Chen M., “Facial expression recognition in dynamic sequences: An integrated approach,” Pattern Recognition 47 (2014): 1271–1281. Makhmudkhujaev F., Abdullah-Al-Wadud M., Iqbal M. T. B., Ryu B., Chae O., “Facial expression recognition with a local prominent directional pattern,” Signal Processing: Image Communication 74 (2019): 1–12.
72
Sibel Senan, Zeynep Orman and Fulya Akcan
[45] Hassan M. M., Alam M. G. R., Uddin M. Z., Huda S., Almogren A., Fortino G., “Human emotion recognition using deep belief network architecture,” Information Fusion 51 (2019): 10–18. [46] Munir A., Hussain A., Khan S. A., Nadeem M., Arshid S., “Illumination invariant facial expression recognition using selected merged binary patterns for real-world images,” Optik 158 (2018): 1016–1025. [47] Maronidis A., Bolis D., Tefas A., Pitas I., “Improving subspace learning for facial expression recognition using person dependent and geometrically enriched training sets,” Neural Networks 24 (2011): 814–823. [48] Danisman T., Bilasco I. M., Martinet J., Djeraba C., “Intelligent pixels of interest selection with application to facial expression recognition using multilayer perceptron,” Signal Processing 93 (2013): 1547– 1556. [49] Wang Shui-Hua, Phillips P., Dong Zheng-Chao, Zhang Yu-Dong, “Intelligent facial emotion recognition based on stationary wavelet entropy and Jaya algorithm,” Neurocomputing 272 (2018): 668–676. [50] Yu K., Wang Z., Zhuo L., Wang J., Chi Z., Feng D., “Learning realistic facial expressions from web images,” Pattern Recognition 46 (2013): 2144–2155. [51] Moore S., Bowden R., “Local binary patterns for multi-view facial expression recognition,” Computer Vision and Image Understanding 115 (2011): 541–558. [52] Cheon Y., Kim D., “Natural facial expression recognition using differential-AAM and manifold learning,” Pattern Recognition 42 (2009): 1340 – 1350. [53] Jampour M., Lepetit V., Mauthner T., Bischof H., “Pose-specific nonlinear mappings in feature space towards multi-view facial expression recognition,” Image and Vision Computing 58 (2017): 38–46. [54] Zhang L., Tjondronegoro D., Chandran V., “Random Gabor based templates for facial expression recognition in images with facial occlusion,” Neurocomputing 145 (2014): 451–464.
Emotion Recognition from Facial Expressions …
73
[55] Yu K., Wang Z., Hagenbuchner M., Feng D. D., “Spectral embedding based facial expression recognition with multiple features,” Neurocomputing 129 (2014): 136–145. [56] Wan S., Aggarwal J. K., “Spontaneous facial expression recognition: A robust metric learning approach,” Pattern Recognition 47 (2014): 1859–1868. [57] Yan K., Zheng W., Cui Z., Zong Y., Zhang T., Tang C., “Unsupervised facial expression recognition using domain adaptation-based dictionary learning approach,” Neurocomputing 319 (2018): 84–91. [58] Mahersia H., Hamrouni K., “Using multiple steerable filters and Bayesian regularization for facial expression recognition,” Engineering Applications of Artificial Intelligence 38 (2015): 190– 202. [59] Hu M., Wang H., Wang X., Yang J., Wang R., “Video facial emotion recognition based on local enhanced motion history image and CNNCTSLSTM networks,” J. Vis. Commun. Image R. 59 (2019): 176–185 [60] Zheng H., Geng X., Tao D., Jin Z., “A multi-task model for simultaneous face identification and facial expression recognition,” Neurocomputing 171 (2016): 515–523. [61] Zhang L., Chen S., Wang T., Liu Z., “Automatic Facial Expression Recognition Based on Hybrid Features,” Energy Procedia 17 (2012): 1817 – 1823. [62] Wu J., Lin Z., Zheng W., Zha H., “Locality-constrained linear coding based bi-layer model for multi-view facial expression recognition,” Neurocomputing 239 (2017): 143–152. [63] Lajevardi S. M., Hussain Zahir M., “Higher-order orthogonal moments for invariant facial expression recognition,” Digital Signal Processing 20 (2010): 1771–1779. [64] Yan H., “Biased subspace learning for misalignment-robust facial expression recognition,” Neurocomputing 208 (2016): 202–209. [65] Bayrakdar S., Akgün D., Yücedağ İ., “An accelerated approach for facial expression analysis on video files” Pamukkale University Journal of Engineering Sciences, 23 no.5 (2017): 602-613.
74
Sibel Senan, Zeynep Orman and Fulya Akcan
[66] Luo Y., Zhang T., Zhang Y., “A novel fusion method of PCA and LDP for facial expression feature extraction,” Optik 127 (2016): 718–721. [67] Lucey P., Cohn J. F., Kanade T., Saragih J., Ambadar Z., “The extended Cohn-Kanade dataset (CK+): A complete dataset for action unit,” (San Francisco: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2010). [68] Kanade T., Cohn J. F., Tian Y., “Comprehensive database for facial expression analysis,” (Grenoble: IEEE Fourth International Conference on Automatic Face and Gesture Recognition, 2000). [69] The Japanese Female Facial Expression (JAFFE) Database, “http://www.kasrl. org/jaffe.html,” accessed April 2019. [70] The Karolinska Directed Emotional Faces (KDEF) database, http://www.emotionlab. se/kdef/, accessed May 2019. [71] MMI facial expression database, “https://mmifacedb.eu/,” accessed May 2019. [72] Multimedia Understanding Group (MUG) Database, https://mug.ee.auth.gr/fed, accessed May 2019. [73] Taiwanese Facial Expression Image Database (TFEID), http://bml.ym.edu.tw/tfeid, accessed May 2019. [74] Ekman P., Friesen W., Facial Action Coding System: A Technique for the Measurement of Facial Movement, (Palo Alto: Consulting Psychologists Press, 1978). [75] Alcalá-Fdez J., Fernandez A., Luengo J., Derrac J., García S., Sánchez L., Herrera F., KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework, J. Mult.-Valued Logic Soft Comput. 17 (2–3) (2011): 255–287.
In: Neural Networks Editor: Doug Alexander
ISBN: 978-1-53617-188-4 © 2020 Nova Science Publishers, Inc.
Chapter 3
DIPOLE MODE INDEX PREDICTION WITH ARTIFICIAL NEURAL NETWORKS Kalpesh R. Patil1 and Masaaki Iiyama2 ,
1
Academic Center for Computing and Media Studies, Kyoto University, Kyoto, Japan 2 Academic Center for Computing and Media Studies, Kyoto University, Kyoto, Japan
ABSTRACT Indian Ocean Dipole (IOD) being one of the important climatic indices happens to be directly linked with floods and droughts occurring in Indian Ocean neighbouring rim countries. IOD is supposed to occur frequently due to the warming rate of the Indian Ocean which is slightly higher than the global ocean. This is therefore important to predict such IOD events in advance to efficiently mitigate against its effects on local and global climate. Strength of IOD event is defined by dipole mode index (DMI), which is the difference of sea surface temperature anomalies (SSTA) in the western (50° - 70°E, 10°S - 10°N) and eastern (90° - 110°E, 10°S - 0°) part.
Corresponding Author’s Email: [email protected].
76
Kalpesh R. Patil and Masaaki Iiyama Depending on this difference of SSTA in the western and eastern part IOD is categorized as strong and weak. Strong and weak IOD shows differences higher and lower than one standard deviation respectively. Several past studies were reported to predict IOD events by coupled ocean-atmosphere global circulation models (COAGCM). They need a very accurate oceanic subsurface observation to accurately predict DMI. Also, COAGCM assume that predictability of DMI is largely governed by variation in Nino 3.4 (170° - 120°W, 5°S - 5°N) region which is true for the western part of Indian Ocean. Whereas predictability of the eastern part is largely governed by its intrinsic dynamics and it is independent of Nino 3.4 region. Such assumptions are producing low skills in DMI prediction by COAGCM. This study proposes a simple non-linear approach for DMI prediction where past values of DMI were used as inputs and future values were predicted by artificial neural networks (ANN). DMI was composed using Hadley SST data for a long period from 1870-present. Anomalies were based on a 32-year period from 1981 to 2013. Apart from past values of DMI, El Nino southern oscillations (ENSO) past values also considered as inputs to check their contribution in skill for DMI prediction. The study was also conducted for seasonal DMI prediction because the DMI is more prominent in the Sep-Oct-Nov season. DMI prediction skills were compared to observed anomalies and with the persistent model (PM). Four months ahead prediction skill for ANN model and PM in terms of root mean square error (rmse) was observed to be lower than 0.29°C and higher than 0.35°C respectively. This study suggests that DMI can be predicted skillfully with a lead time of 3 - 4 months and 2 seasons ahead. Prediction skill compared with observed anomalies separately for Sep, Oct and Nov varies from 0.93 to 0.61 in correlation coefficient (r) and 0.16°C to 0.35°C in ‘rmse’ respectively for one and three months ahead. The seasonal prediction skill of the Sep-OctNov season together is noted as 0.81 (r) and 0.23°C (rmse) with one season ahead. Along with that few important extreme IOD events were also assessed, which includes ‘1994 positive IOD’ and ‘2016 negative IOD event’, both events amplitudes were found to be predicted by 87% and 78% accurately with one season lead time respectively. Along with that, this study also arrives at the conclusion of ENSO is least affecting the IOD events. Because when ENSO past values were given as inputs to predict DMI there was no significant improvement in DMI prediction was noticed.
Keywords: Dipole mode index, Indian Ocean dipole, Predictability, Seasonal prediction
Dipole Mode Index Prediction with Artificial Neural Networks
77
1. INTRODUCTION Dipole mode index (DMI) is time series of monthly values constructed from the difference of an SSTA in western (50 – 70°E, 10°S – 10°N) and eastern (90 – 110°E, 10°S – Eq) part of tropical Indian Ocean [1, 2]. DMI is one of the most significant climate indices of the Indian Ocean. It represents the intensity of an IOD phenomenon. IOD is of the greatest importance for its influence connected to rainfall in India and Asian summer monsoon [24]. Apart from that, significant frequent floods and droughts around countries neighbouring the Indian Ocean are reported frequently by extreme intensities of DMI especially in the Sep-Oct-Nov season [4-7]. The prediction of DMI is highly substantial for policymaking in the environment, agriculture, water resources, and climate sectors. DMI prediction task is often recognized as a very complex task and is often attempted by Numerical models by various scientific agencies across the world [7-9]. Numerical models do involve assumptions in solving the primitive equations defining them [10, 11]. It often needs to downscale output from the numerical model to remove bias, also such models are very sensitive to the initial forcing fields [12-14] which therefore needs to initialize with several conditions and make an ensemble of predictions [15, 16]. This study attempts a simple approach of time series forecasting using ANN for DMI prediction using its past values and ENSO past values as some studies reported that IOD is driven by ENSO [17, 18]. The further parts in the chapter are arranged to elaborate on data used for study, the methodology adopted using ANN, various results on monthly and seasonal DMI prediction and discussion for further improvements.
2. DATA SST data used in this study is a product from the ‘Hadley Centre Global Sea Ice and Sea Surface Temperature (HadISST)’. HadISST data is an optimal interpolated output of SSTs from the Marine Data Bank (mostly ship
78
Kalpesh R. Patil and Masaaki Iiyama
routes) and the International Comprehensive Ocean-Atmosphere Data Set (ICOADS) through 1981 and a blend of in-situ and adjusted satellite-derived SSTs for 1982-onwards [19]. This data is available from Jan-1780 onwards till the month before last month. A long term mean calculated from a base period of 1981 to 2013 was subtracted before estimating the spatial SST anomalies in western and eastern tropical Indian Ocean. The difference for SSTA in western and eastern constructs the DMI index (Eq. 1). In every scenario, the past values of DMI were given as inputs to predict future values. ENSO index was calculated from the same the HadISST data after subtraction of long term mean for a similar period. Data is easily accessible at https://www.metoffice.gov.uk/hadobs/hadisst/data/download.html. 1981−2013 1981−2013 𝐷𝑀𝐼(𝑡) = (𝑆𝑆𝑇𝑤𝑒𝑠𝑡 (𝑡) − ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ ) − (𝑆𝑆𝑇𝑒𝑎𝑠𝑡 (𝑡) − ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ ) 𝑆𝑆𝑇𝑤𝑒𝑠𝑡 𝑆𝑆𝑇𝑒𝑎𝑠𝑡
(1)
3. METHODOLOGY ANN is used in this study to forecast monthly/seasonal DMI. Amongst many architectures available in ANN, multilayer perceptron (MLP) is used for its simplicity and widespread generalization capability [20, 21]. Many complex architectures that also have potential usage, but the study scope is limited to time series forecasting approach for long term prediction of DMI rather than the usage of various architectures. The working of MLP is widely known therefore the readers are referred to the [22-24] for more detail understanding. To provide brief information about MLP architecture used, this study employs three-layer architecture for various combinations of past values of DMI and seasonal DMI as inputs and hidden neurons. The inputs for the MLP model were decided based on the partial-autocorrelation function and spectral plot of the monthly/seasonal DMI series. Based on these two plots various combinations of past values as mentioned in Table 1 was tried with several combinations of hidden neurons. Whereas the target neuron was
Dipole Mode Index Prediction with Artificial Neural Networks
79
always one, which indicates that separate models were developed for each month/seasons DMI prediction. Table 1. Past values of inputs used for monthly/seasonal DMI predictions Input combinations 1 2 3 4
Monthly case
Seasonal case
1, 2, 3 1, 2, 3, 4 1, 2, 3, 4, 5 1 to 6
1, 2 1, 2, 3, 4 1 to 6 1 to 12
5 6 7 8 9 10 11 12 13 14 15 16
1 to 12 1 to 24 1, 2, 3, 12, 18 1, 2, 3, 12, 18, 24 1, 2, 3, 12, 18, 24, 30 1 to 6, 12, 18 1 to 6, 12, 18, 24 1 to 6, 12, 18, 24, 30 1 to 12, 24, 36 1 to 12, 24, 36, 48, 60 1 to 12, 24, 36, 48, 60, 72 1 to 24, 48
1, 2, 4, 6 1, 2, 4, 6, 8 1, 2, 3, 4, 8, 12 1, 2, 3, 4, 8, 12, 16 1 to 6, 12, 18 1 to 6, 12, 18, 24 1 to 12, 24 1 to 12, 24, 36 1, 2, 3, 4, 11, 15, 20, 21, 22, 23
17
1 to 24, 48, 72
Table 2. Number of hidden neurons combination Sr. No. 1 2 3 4
Number of hidden neurons 2 (Number of inputs + 1)^0.5 0.75*(Number of inputs + 1) 2*(Number of inputs) + 1
To summarize the methodology in nutshell please refer the Figure 1. In monthly case total 17 input combinations * 4 neurons (Table 2) combinations = 68 combinations were tried for each month lead-time and
80
Kalpesh R. Patil and Masaaki Iiyama
the best model was reserved. Similarly, in the seasonal case for each season lead-time total of 52 combinations were tried and best was reserved in each case. To consider another scenario of the ENSO forced approach, extra inputs from past ENSO events were also added to check the effect on the improvement in DMI prediction as ENSO is assumed to drive the DMI [18]. Past six-monthly ENSO index values were considered in monthly DMI prediction, whereas in seasonal cases past three seasons ENSO index given as extra inputs along with respective past values of DMI series. The results from the best model in each case are depicted in a further section.
4. RESULTS DMI prediction has experimented with three different scenarios.
Figure 1. Brief methodology of time series forecasting approach for monthly/seasonal DMI prediction.
First is persistent models, second is prediction with past values of DMI and third is exogenous inputs from past ENSO index along with past DMI
Dipole Mode Index Prediction with Artificial Neural Networks
81
values. Past values of DMI and ENSO were used from the HadISST v1 dataset available for a longer length of more than 140 years. In all scenario’s models were trained with 70% of data (111 yrs), cross-validated with 5% of data (7.5 yrs) and tested on 20% of data (30 yrs), except for persistent models which did not need any training and cross-validation. In each scenario number of past values as inputs were the same as mentioned in Table 1. Models were evaluated with rmse and r performance metrics. The detailed results for monthly and seasonal DMI predictions are explained in a further section. 2
(𝐷𝑀𝐼𝑝𝑟𝑒𝑑(𝑖) − 𝐷𝑀𝐼𝑜𝑏𝑠(𝑖) ) 𝑟𝑚𝑠𝑒 = √∑𝑛𝑖=1
(2)
𝑛
𝑟=
̅̅̅̅̅̅̅̅̅̅̅ ̅̅̅̅̅̅̅̅̅̅̅̅ ∑𝑛 𝑖=1((𝐷𝑀𝐼𝑜𝑏𝑠(𝑖) −𝐷𝑀𝐼𝑜𝑏𝑠 )×(𝐷𝑀𝐼𝑝𝑟𝑒𝑑(𝑖) −𝐷𝑀𝐼𝑝𝑟𝑒𝑑 )) 2
2
(3)
𝑛 ̅̅̅̅̅̅̅̅̅̅̅ ̅̅̅̅̅̅̅̅̅̅̅̅ √∑𝑛 𝑖=1(𝐷𝑀𝐼𝑜𝑏𝑠(𝑖) − 𝐷𝑀𝐼𝑜𝑏𝑠 ) × ∑𝑖=1(𝐷𝑀𝐼𝑝𝑟𝑒𝑑(𝑖) − 𝐷𝑀𝐼𝑝𝑟𝑒𝑑 )
where 𝐷𝑀𝐼𝑝𝑟𝑒𝑑 – predicted DMI values with the best combination of inputs ̅̅̅̅̅̅̅̅̅̅̅ and neurons, 𝐷𝑀𝐼 𝑝𝑟𝑒𝑑 – mean values of DMI prediction, 𝐷𝑀𝐼𝑜𝑏𝑠 – DMI estimated from Eq. 1 with HadISST data, ̅̅̅̅̅̅̅̅̅̅ 𝐷𝑀𝐼𝑜𝑏𝑠 – mean values of estimated DMI.
4.1. Monthly DMI Prediction Monthly DMI models are evaluated for a lead time of 1 to 6 months for each scenario rmse was found to be increased up to 4 months lead time and then have a constant trend, whereas r was consistently decreasing. The highest rmse was observed with persistent models followed by ENSO forced scenario and least with past DMI values as inputs. Variation of rmse and r against lead time is as depicted in Figure 2 shows error less than the value of 0.3°C up to a lead time of six months.
82
Kalpesh R. Patil and Masaaki Iiyama
Figure 2. rmse and r variation against lead time of six months for various DMI prediction models.
Figure 3. Time series plot (left) and scatter (right) of two months lead predicted DMI with observed DMI.
On the other hand, the rapid decreasing of r values suggests poor performance after three months lead time. A sample two months lead prediction of monthly DMI against its observation is shown in Figure 3. Two months lead DMI predictions are seen to follow trends in observations except missing for some higher values. Performance metrics shown in Figure 3 were collectively for all months together. But it is also important to understand the performance separately for each month. Table 3 shows performance metrics separately evaluated for various months with past DMI values as inputs scenario. It is interesting to note that a few of the months have more prediction accuracy than collective accuracy for all months. For example, overall r value for all months together is 0.62 which for Jun., Jul., Aug., Sep., Oct. and Nov. months is more than 0.82.
Lead time (months)
Table 3. Performance measure for monthly DMI predictions separately evaluated for various months against the lead time of one to six month r 1 2 3 4 5 6 rmse 1 2 3 4 5 6
Jan 0.58 0.47 0.57 0.39 0.43 0.07 Jan 0.17 0.20 0.17 0.18 0.17 0.19
Feb 0.41 0.39 0.41 0.33 0.25 0.32 Feb 0.19 0.19 0.22 0.19 0.19 0.18
Mar 0.66 0.39 -0.06 -0.01 -0.06 0.17 Mar 0.17 0.21 0.29 0.24 0.24 0.22
Apr 0.57 0.23 0.20 -0.15 0.09 -0.05 Apr 0.20 0.24 0.25 0.26 0.24 0.25
May 0.76 0.30 0.22 0.15 -0.29 0.13 May 0.18 0.27 0.28 0.29 0.30 0.28
Jun 0.82 0.61 0.44 0.10 0.24 0.02 Jun 0.16 0.23 0.26 0.29 0.28 0.28
Jul 0.69 0.47 0.42 -0.02 0.06 0.07 Jul 0.23 0.28 0.29 0.32 0.32 0.31
Aug 0.91 0.71 0.41 0.37 0.53 0.20 Aug 0.18 0.27 0.34 0.36 0.34 0.36
Sep 0.86 0.78 0.59 0.35 0.24 0.07 Sep 0.18 0.24 0.32 0.35 0.35 0.36
Oct 0.93 0.79 0.61 0.43 0.09 0.40 Oct 0.20 0.29 0.38 0.41 0.43 0.41
Nov 0.90 0.87 0.69 0.52 0.21 0.26 Nov 0.16 0.21 0.30 0.36 0.37 0.36
Dec 0.82 0.75 0.82 0.56 0.63 0.24 Dec 0.16 0.18 0.16 0.23 0.24 0.24
84
Kalpesh R. Patil and Masaaki Iiyama
And the fact is IOD is defined based on SST for Sep., Oct. and Nov. months only [1]. The prediction accuracy of Sep., Oct. and Nov. months is higher than overall accuracy and other months even at three-month lead (r > 0.6, rmse < 0.35°C).
4.2. Seasonal DMI Prediction The original monthly DMI was converted to seasonal DMI by averaging the DMI for months in four seasons MAM, JJA, SON, DJF consecutively. Therefore, each year instead of twelve months, four seasonally averaged values were left and no month from other seasons was involved in the calculation of average for the corresponding season. The original monthly DMI was shortened by four times after conversion to seasonal DMI. The importance of predicting the average seasonal DMI is from the aspect of estimating average rainfall in the Indian subcontinent in monsoon season [25] and its other local impact on local climate such as floods/droughts in Australia, Africa, and Indonesia [5-7]. As observed in Figure 4 rmse for seasonal DMI prediction was found to be lower than 0.28°C at three seasons lead. But r value was more rapidly decreased (r < 0.2, at three seasons lead) compare to monthly models.
Figure 4. rmse and r variation against the lead time of three seasons for various seasonal DMI prediction models.
Dipole Mode Index Prediction with Artificial Neural Networks
85
Figure 5 shows one season lead prediction of seasonal DMI against averaged seasonal DMI. As SON season is more important than other seasons its accuracy is separately evaluated and compared to overall accuracy. As seen in Figure 5 and Table 4 even though all seasons collective accuracy is low, SON has high predictability in seasonal models like in the monthly case. SON season has shown rmse of 0.32°C and r > 0.35 at three seasons lead which is effectively can be called 6 to 9 months lead time.
Figure 5. Time series plot (left) and scatter (right) of one season lead predicted seasonally averaged DMI with observed seasonally averaged DMI.
Lead time (seasons)
Table 4. Performance measure for seasonal DMI predictions separately evaluated for various seasons against the lead time of one to three seasons r
MAM
JJA
SON
DJF
1 2 3
0.30 0.10 -0.19
0.56 0.27 0.44
0.81 0.41 0.36
0.45 0.22 0.08
rmse
MAM
JJA
SON
DJF
1 2
0.25 0.26
0.28 0.32
0.23 0.31
0.19 0.19
3
0.27
0.32
0.32
0.19
86
Kalpesh R. Patil and Masaaki Iiyama
Such prediction accuracy is very high when compared to the monthly six lead time accuracy of Sep., Oct., and Nov. individually.
4.3. Important IOD Events IOD is defined as a difference in SST anomalies in western IO and eastern IO. It peaks in the SON season and therefore SON season IOD index has separate importance concerning its effects on local climate [2]. IOD strength is completely defined by the magnitude of DMI. Stronger events have more than 0.5 magnitude whereas dependant on the sign of DMI it is termed as positive and negative. To evaluate how monthly and seasonal models were effective in capturing the important IOD events, the model accuracy was evaluated for the SON season during eight IOD events against their averaged values. In the case of monthly models, the average value of prediction was calculated by considering the one month lead predicted value of Sep. likewise, two months lead predicted value for Oct. and three months lead predicted value for Nov.
Figure 6. DMI observed vs prediction with monthly and seasonal models for various important events. For seasonal models averaged SON anomaly was compared with the corresponding averaged forecast. In case of monthly models forecasted averaged of SON was calculated from one month ahead Sep., two months ahead Oct. and three months ahead Nov. forecasts.
Dipole Mode Index Prediction with Artificial Neural Networks
87
Whereas, in seasonal models, it was one season lead predicted value for SON season. Figure 6 shows the comparison of DMI predicted values against SON averaged values from past input values of the DMI scenario. Both monthly and seasonal models show good accuracy for moderate IOD events (2005, 2015 and 2016). Also, severe positive events of 1994 were accurately picked by both types of models. Whereas extreme 1997 positive IOD event was 50% underestimated and the latest 2018 positive was completely missed.
5. DISCUSSION Prediction of DMI with its past values as inputs were attempted from the monthly and seasonal models. Also, in the third scenario past values of the ENSO index were included as exogenous inputs. The results of the prediction were evaluated against persistence models for several lead-times. In each scenario, the persistent models found to be underperforming. Seasonal models were found to be more accurate than monthly models. Especially for important months when DMI peaks. Adding exogenous inputs from past values of the ENSO index does not improve the results. Which suggests that DMI is an independently occurring phenomenon. Despite good prediction results, few of the important IOD events were underpredicted by seasonal models. The reason for this is the models presented in this study are univariate time series models. Which means driven by only its past values. Adding more inputs from other climatological factors may have a strong influence on the prediction results. Also, the use of deep learning will further improvise the results. But the scope of this study was to assess the predictability skill of DMI from its past
88
Kalpesh R. Patil and Masaaki Iiyama
values and change in prediction performance with additional inputs from the ENSO index.
ACKNOWLEDGMENT I would like to acknowledge Prof. Masaki Iiyama for his support and guidance to complete this study. Also, the acknowledge greatly extends to the team of Met Office Hadley Centre due to whom the SST data was made available for this study.
REFERENCES [1]
[2]
[3]
[4]
[5]
[6]
Saji, N. H., B. N. Goswami, P. N. Vinayachandran and T. Yamagata. “A dipole mode in the tropical Indian Ocean”. Nature, 401, no. 6751 (1999): 360. Webster, Peter J., Andrew M. Moore, Johannes P. Loschnigg and Robert R. Leben. “Coupled ocean–atmosphere dynamics in the Indian Ocean during 1997–98”. Nature, 401, no. 6751 (1999): 356. Clark, Christina Oelfke, Julia E. Cole and Peter J. Webster. “Indian Ocean SST and Indian summer rainfall: Predictive relationships and their decadal variability”. Journal of Climate, 13, no. 14 (2000): 2503 - 2519. Yang, Jianling, Qinyu Liu, Shang‐Ping Xie, Zhengyu Liu and Lixin Wu. “Impact of the Indian Ocean SST basin mode on the Asian summer monsoon”. Geophysical Research Letters, 34, no. 2 (2007). Cai, W., T. Cowan and A. Sullivan. “Recent unprecedented skewness towards positive Indian Ocean Dipole occurrences and its impact on Australian rainfall”. Geophysical Research Letters, 36, no. 11 (2009). Qiu, Yun, Wenju Cai, Xiaogang Guo, and Benjamin Ng. “The asymmetric influence of the positive and negative IOD events on China's rainfall”. Scientific reports, 4 (2014): 4943.
Dipole Mode Index Prediction with Artificial Neural Networks [7]
[8]
[9]
[10] [11]
[12]
[13]
[14]
[15]
89
Ummenhofer, Caroline C., Matthew H. England, Peter C. McIntosh, Gary A. Meyers, Michael J. Pook, James S. Risbey, Alexander Sen Gupta and Andréa S. Taschetto. “What causes southeast Australia's worst droughts?.” Geophysical Research Letters, 36, no. 4 (2009). Kripalani, R. H., J. H. Oh and H. S. Chaudhari. “Delayed influence of the Indian Ocean Dipole mode on the East Asia–West Pacific monsoon: possible mechanism”. International Journal of Climatology: A Journal of the Royal Meteorological Society, 30, no. 2 (2010): 197 - 209. Hashizume, Masahiro, Luis Fernando Chaves and Noboru Minakawa. “Indian Ocean Dipole drives malaria resurgence in East African highlands”. Scientific reports, 2 (2012): 269. Bryan, Kirk and Michael D. Cox. “A numerical investigation of the oceanic general circulation”. Tellus, 19, no. 1 (1967): 54 - 80. Bryan, Kirk. “A numerical method for the study of the circulation of the world ocean”. Journal of computational physics, 4, no. 3 (1969): 347 - 376. Sivareddy, Sanikommu, Muthalagu Ravichandran, Madathil Sivasankaran Girishkumar and Koneru Venkata Siva Rama Prasad. “Assessing the impact of various wind forcing on INCOIS-GODAS simulated ocean currents in the equatorial Indian Ocean”. Ocean Dynamics, 65, no. 9 - 10 (2015): 1235 - 1247. Ravichandran, M., D. Behringer, S. Sivareddy, M. S. Girishkumar, Neethu Chacko and R. Harikumar. “Evaluation of the global ocean data assimilation system at INCOIS: the tropical Indian Ocean”. Ocean Modelling, 69 (2013): 123 - 135. Saha, Suranjana, Shrinivas Moorthi, Xingren Wu, Jiande Wang, Sudhir Nadiga, Patrick Tripp, David Behringer et al. “The NCEP climate forecast system version 2”. Journal of Climate, 27, no. 6 (2014): 2185 - 2208. Luo, Jing-Jia, Sebastien Masson, Swadhin Behera and Toshio Yamagata. “Experimental forecasts of the Indian Ocean dipole using a coupled OAGCM”. Journal of climate, 20, no. 10 (2007): 2178 2190.
90
Kalpesh R. Patil and Masaaki Iiyama
[16] Wang, Guomin, Debra Hudson, Yonghong Yin, Oscar Alves, Harry Hendon, Sally Langford, Guo Liu, and Faina Tseitkin. “POAMA-2 SST skill assessment and beyond”. CAWCR Research Letters, 6 (2011): 40 - 46. [17] Baquero-Bernal, Astrid, Mojib Latif and Stephanie Legutke. “On dipolelike variability of sea surface temperature in the tropical Indian Ocean”. Journal of Climate, 15, no. 11 (2002): 1358 - 1368. [18] Yu, Jin-Yi and K. M. Lau. “Contrasting Indian Ocean SST variability with and without ENSO influence: A coupled atmosphere-ocean GCM study”. Meteorology and Atmospheric Physics, 90, no. 3 - 4 (2005): 179 - 191. [19] Rayner, N. A. A., De E. Parker, E. B. Horton, Chris K. Folland, Lisa V. Alexander, D. P. Rowell, E. C. Kent and A. Kaplan. “Global analyses of sea surface temperature, sea ice, and night marine air temperature since the late nineteenth century”. Journal of Geophysical Research: Atmospheres, 108, no. D14 (2003). [20] Gardner, Matt W. and S. R. Dorling. “Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences”. Atmospheric environment, 32, no. 14 - 15 (1998): 2627 2636. [21] Ramos, Elsy Gómez and Francisco Venegas Martínez. “A Review of Artificial Neural Networks: How Well Do They Perform in Forecasting Time Series?.” Analítika: revista de análisis estadístico, 6 (2013): 7 - 18. [22] Murtagh, Fionn. “Multilayer perceptrons for classification and regression”. Neurocomputing, 2, no. 5 - 6 (1991): 183 - 197. [23] Koskela, Timo, Mikko Lehtokangas, Jukka Saarinen and Kimmo Kaski. “Time series prediction with multilayer perceptron, FIR and Elman neural networks”. In Proceedings of the World Congress on Neural Networks, pp. 491 - 496. INNS Press San Diego, USA, 1996. [24] Noriega, Leonardo. “Multilayer perceptron tutorial”. School of Computing. Staffordshire University (2005).
Dipole Mode Index Prediction with Artificial Neural Networks
91
[25] Sahai, A. K., A. M. Grimm, V. Satyan and G. B. Pant. “Long-lead prediction of Indian summer monsoon rainfall from global SST evolution”. Climate Dynamics, 20, no. 7 - 8 (2003): 855 - 863.
BIOGRAPHICAL SKETCHES Kalpesh R. Patil, PhD Affiliation: Post-doctorate Research Scholar, Academic Center for Computing and Media Studies, Kyoto University, Japan Education: PhD, Indian Institute of Technology, Bombay, India Business Address: Academic Center for Computing and Media Studies (South Bldg.), Kyoto University, Yoshidahonmachi, Sakyo Ward, Kyoto 606-8315. Research and Professional Experience: Apr. 2019 - till date: Post-doctorate Research Scholar, Academic Center for Computing and Media Studies, Kyoto University. Jan. 2014 - Feb 2018: Research Associate, Indian Institute of Technology, Bombay, India. Aug. 2013 - Dec 2013: Project Research Associate, Indian Institute of Technology, Bombay, India. Professional Appointments: Honors: Selected as a session chair, Oceans’18 MTS Kobe, IEEE conference. Invited as Guest faculty for delivering lecture on Machine learning at GreyAtom School of Data Science, Mumbai, Maharashtra, India. Active reviewer for ISH Journal of Hydraulic Engineering and Geoscience and Remote Sensing Letters.
92
Kalpesh R. Patil and Masaaki Iiyama
Publications from the Last 3 Years: Patil, K. and Iiyama, M. (2019). Residual prediction to improve the meteorological based sea surface temperature forecasts using ANN. AGU Ocean Sciences Meeting 2020 San Diego, California. (Abstract accepted). Patil, K. and Deo, M. C. (2018). Basin-Scale Prediction of Sea Surface Temperature with Artificial Neural Networks. Journal of Atmospheric and Oceanic Technology, 35(7), 1441 - 1455. doi: https://doi.org/ 10.1175/JTECH-D-17-0217.1. Patil, K. and Deo, M. C. (2018). Basin-scale Prediction of Sea Surface Temperature with Artificial Neural Networks. OCEANS’18 MTS/IEEE Kobe/Techno-Ocean, 2018, Kobe, Japan. doi: 10.1109/ OCEANSKOBE.2018.8558780. Patil, K. and Deo, M. C. (2017). Prediction of daily sea surface temperature using efficient neural networks. Ocean Dynamics, 67(3 - 4), 357 - 368. doi: https://doi.org/10.1007/s10236-017-1032-9. Patil, K. and Deo, M. C. (2017). Real Time Prediction of Sea Surface Temperature with Soft and Hard Computing. 22nd, International conference on Hydraulics, Water Resources, Coastal and Environmental Engineering (HYDRO 2017), 1526 - 1536. Patil, K. Deo, M. C. and Ravichandran, M. (2016). Prediction of sea surface temperature by combining numerical and neural techniques. Journal of Atmospheric and Oceanic Technology, 33(8), 1715 - 1726. doi: https://doi.org/10.1175/JTECH-D-15-0213.1.
Masaaki Iiyama, PhD Affiliation: Associate Professor, Academic Center for Computing and Media Studies, Kyoto University, Japan Education: PhD in Informatics, Kyoto University.
Dipole Mode Index Prediction with Artificial Neural Networks
93
Business Address: Academic Center for Computing and Media Studies (South Bldg.), Kyoto University, Yoshidahonmachi, Sakyo Ward, Kyoto 606-8315. Research and Professional Experience: 2015- till date: Associate Professor, Academic Center for Computing and Media Studies, Kyoto University 2010-2015: Associate Professor, Graduate School of Economics, Kyoto University 2006-2009: Assistant Professor, Graduate School of Economics, Kyoto University 2003-2006: Research Associate, Academic Center for Computing and Media Studies, Kyoto University Professional Appointments: Honors: Invited Talk on ‘AI meets Fishery-Pattern Recognition Approach for Fishery Applications’ at University of Science and Technology Beijing, China,2017/11/13,2017-11. Publications from the Last 3 Years: 2019 Daisuke Matsuoka, Shiori Sugimoto, Yujin Nakagawa, Shintaro Kawahara, Fumiaki Araki, Yosuke Onoue, Masaaki Iiyama, Koji Koyamada. Automatic Detection of Stationary Fronts around Japan using a Deep Convolutional Neural Network. Scientific online letters on the atmosphere: SOLA, Vol. 15, pp.154 - 159, 2019-07. DOI: 10.2151/sola.2019-028. Koki Sakata, Koh Kakusho, Satoshi Nishiguchi, Masaaki Iiyama. Observation Planning for Identifying Each Person by a Drone in an Indoor Daily Living Environment. HCI International, 2019, (To be appeared). (Accepted).
94
Kalpesh R. Patil and Masaaki Iiyama
Nobuyuki Hirahara, Motoharu Sonogashira, Hidekazu Kasahara and Masaaki Iiyama. Denoising and Inpainting of Sea Surface Temperature Image with Adversarial Physical Model Loss. Asian Conference on Pattern Recognition (ACPR2019), 2019-11. Takumi Shimura, Motoharu Sonogashira, Hidekazu Kasahara, Masaaki Iiyama. Fishing spot detection using sea water temperature pattern by nonlinear clustering. Oceans, 2019, 2019-06. (Accepted). Takashi Watabe, Hidekazu Kasahara, Masaaki Iiyama. Tourist Transition Model among Sightseeing Spot based on Trajectory Data. ENTER, 2019, Vol. 16, No. 2/3, pp.115 - 126, 2019-02. https://journals.tdl.org/ ertr/index.php/ertr/article/download/324/97. Yu-Jung Wang, Motoharu Sonogashira, Atsushi Hashimoto and Masaaki Iiyama. Two-stage Fully Convolutional Networks for Stroke Recovery of Handwritten Chinese Character. Asian Conference on Pattern Recognition (ACPR2019), 2019-11. Yuki Fujimura, Masaaki, Iiyama, Atshushi Hashimoto, Michihiko Minoh. Photometric Stereo in Participating Media Using an Analytical Solution for Shape-Dependent Forward Scatter. IEEE Transactions on Pattern Analysis and Machine Intelligence, (To be appeared). DOI: 10.1109/TPAMI.2018.2889088. Yuki Kotakehara, Koh Kakusho, Satoshi Nishiguchi, Masaaki Iiyama, Masayuki Murakami. The Classification of Different Situations in a Lecture Based on Students' Observed Postures. HCI International, 2019, (To be appeared). (Accepted). 2018 Masaaki Iiyama, Kei Zhao, Atsushi Hashimoto, Hidekazu Kasahara, Michihiko Minoh. Fishing Spot Estimation by Sea Temperature Pattern Learning. Oceans, 2018, 2018-05. Download: PDF. Motoharu Sonogashira, Masaaki Iiyama, Michihiko Minoh. VariationalBayesian Single-Image Devignetting. IEICE Transactions on Information and Systems, Vol. E101-D, No.9, pp.2368 - 2380, 2018-09. DOI: 10.1587/transinf.2017EDP7393.
Dipole Mode Index Prediction with Artificial Neural Networks
95
Satoki Shibata, Masaaki Iiyama, Atsushi Hashimoto, Michihiko Minoh. Restoration of Sea Surface Temperature Satellite Images Using a Partially Occluded Training Set. International Conference on Pattern Recognition (ICPR2018), 2018-08.PDF. Takumi Fujino, Atsushi Hashimoto, Hidekazu Kasahara, Mikihiko Mori, Masaaki Iiyama, Michihiko Minoh. Detecting Deviations from Intended Routes Using Vehicular GPS Tracks. ACM Trans. On Spatial Algorithms and Systems (TSAS), Vol. 4, No. 1, 2018-06. DOI: 10.1145/ 3204455. Yuki Fujimura, Masaaki Iiyama, Atsushi Hashimoto, Michihiko Minoh. Photometric Stereo in Participating Media Considering ShapeDependent Forward Scatter. IEEE Conference on Computer Vision and Pattern Recognition (CVPR ORAL), 2018-06. http:// openaccess.thecvf.com/content_cvpr_2018/papers/Fujimura_Photomet ric_Stereo_in_CVPR_2018_paper.pdf. 2017 Atsushi Hashimoto, Takumi Fujino, Jun Harashima, Masaaki Iiyama, Michihiko Minoh. Learning Food Appearance by a Supervision with Recipe Text. 9th Workshop on Multimediafor Cooking and Eating Activities (CEA2017), p. 39 - 41, 2017-08. DOI: 10.1145/ 3106668.3106675. Hidekazu Kasahara, Masaaki Iiyama, Michihiko Minoh. How to design smart tourism destination: From viewpoint of data. EU-Japan Workshop on Big Data for Sustainability and Tourism, 2017-03. Hidekazu Kasahara, Masaaki Iiyama, Michihiko Minoh. Tourism Service Portfolio for Smart Destination. Enter, 2017, 2017-01. Hidekazu Kasahara, Masaaki Iiyama, Michihiko Minoh. Transportation Mode Inference Using Environmental Constraints. IMCOM 2017, 201701. Masaaki Iiyama. AI meets Fishery -Pattern Recognition Approach for Fishery Applications-. Invited Talk, University of Science and Technology, Beijing, China, 2017/11/13,2017-11.
96
Kalpesh R. Patil and Masaaki Iiyama
Motoharu Sonogashira, Takuya Funatomi, Masaaki Iiyama, Michihiko Minoh. A Variational Bayesian Approach to Multiframe Image Restoration. IEEE Transactions on Image Processing, Vol. 25, No. 5, pp. 2163 - 2178, 2017-05. DOI: 10.1109/TIP.2017.2678171. Motoharu Sonogashira, Masaaki Iiyama, Michihiko Minoh. Shift-Variant Blind Deconvolution Using a Field of Kernels. IEICE Transactions on Information and Systems, Vol. E100-D, No. 9, pp. 1971 - 1983, 201709. DOI: 10.1587/transinf.2016PCP0013. Omi, T., K. Kakusho, M. Iiyama, S. Nishiguchi. Segmentation and Tracking of Object when Grasped and Moved within Living Spaces. Proceedings of IEEE International Conference on Systems, Man, and Cybernetics, pp. 3147 - 3152, 2017-10. DOI: 10.1109/ SMC.2017.8123111. Satoki Shibata, Masaaki Iiyama, Atsushi Hashimoto, Michihiko Minoh. Restoration of Sea surface temperature Images by Learning-based and Optical-flow-based Inpainting. IEEE International Conference on Multimedia and Expo (ICME) 2017, pp. 193 - 198, 2017-07. DOI: 10.1109/ICME.2017.8019401. Download:PDF. Takuya Funatomi, Masaaki Iiyama, Koh Kakusho, Michihiko Minoh. Regression of 3D Rigid Transformations on Real-Valued Vectors in Closed Form. IEEE International Conference on Robotics and Automation (ICRA) 2017, pp. 6412 - 6419, 2017-06. DOI: 10.1109/ ICRA.2017.7989757. Tsukamoto, M., K. Kakusho, M. Iiyama, S. Nishiguchi. Estimating the Target of Interaction for Each Human in Office Space with Obstacles Using 3D Observation. Proceedings of IEEE International Conference on Systems, Man, and Cybernetics, pp. 3129 - 3134, 2017-10. DOI:_10.1109/SMC.2017.8123108.
In: Neural Networks Editor: Doug Alexander
ISBN: 978-1-53617-188-4 © 2020 Nova Science Publishers, Inc.
Chapter 4
EFFICACY OF ARTIFICIAL NEURAL NETWORKS IN DIFFERENTIATING BREAST CANCERS IN DIGITAL MAMMOGRAPHY Sundaran Kada1, PhD and Fuk-hay Tang2,, PhD 1
Faculty of Health and Social Sciences, Western Norway University of Applied Sciences, Bergen, Norway 2 School of Medical and Health Sciences, Tung Wah College, King’s Park, Hong Kong
ABSTRACT The most common invasive breast cancers include invasive ductal carcinoma (IDC) and invasive lobularcarcinoma (ILC). About eight of ten invasive breast cancers are IDC and about one invasive breast cancer in ten is an ILC. As for carcinoma in situ, ductal carcinoma in situ (DCIS) is the most common non-invasive type taking up 80-90% of carcinoma in situ. Mammography has a high false negative and false positive rate. Computer aided diagnosis (CAD) systems have been commercialized to help in
Corresponding Author’s Email: [email protected].
98
Sundaran Kada and Fuk-hay Tang micro-calcification detection and malignancy differentiation. Yet, little has been explored in differentiating breast cancers with artificial neural network (ANN), one example of CAD systems. The aim of this chapter was to describe how well artificial neural networks (ANNs) differentiate the three types of most prevalent breast cancer with normal breasts, namely the ductal carcinoma in situ (DCIS), invasive ductal carcinoma (IDC) and, invasive lobular carcinoma (ILC). We conducted a study where 160 digital mammograms were collected (including IDC, ILC and DCIS were of equal number (each type of cancer 40 images)) plus 40 control images. All cancers were screened by the mammography unit and further proven by biopsy histologically between November 2012 and November 2015. Mammograms were analysed with a CAD system and Image Feature Assessment Program. CAD system determines the possible regions of interest that are then used for feature extraction. Our result indicated that the accuracy for detection of IDC against normal is 97.5% (N = 40, normal 20, abnormal 20), for ILC against normal is; 97.5% (N = 40, normal = 20, abnormal = 20); for DCIS against normal is 76.9% (N = 39, normal = 20, abnormal = 19) respectively. One image for DCIS is omitted due to the image quality did not meet our program criteria. Our study indicated that using image features in conjunction with artificial neural networks offer a promising method in differentiating different invasive breast cancers with normal breast.
Keywords: breast cancer, digital mammography, artificial neural networks
1. INTRODUCTION Breast cancer is the most frequently diagnosed of all cancers, and it is the leading cause of female cancer deaths worldwide, with 1.7 million breast cancer cases and 521,900 deaths in 2012 (Torre et al., 2015). In Norway, breast cancer is the leading cancer type diagnosed in women, and it represents the second most frequent cause of cancer death (Cancer Registry of Norway 2015). There were 3,324 new cases diagnosed in 2014, of which, 34% were in women between 25 and 50 years old (Cancer Registry of Norway 2015). Breast cancer can be classified into invasive and noninvasive cancer. The most common invasive breast cancer types are invasive ductal carcinoma (IDC) and invasive lobular carcinoma (ILC), and these make up 80% and 0.7%-20% of all invasive carcinomas, respectively. In addition,
Efficacy of Artificial Neural Networks in Differentiating …
99
ductal carcinoma in situ (DCIS) is the most prevalent, making up 80%-90% of all in situ carcinomas (American Cancer Society, 2017; Cancer Registry of Norway 2015). The cumulative risk of developing breast cancer by the age of 75 years old is approximately one in twelve among Norwegian women (Cancer Registry of Norway, 2015). Although there is no foolproof method to prevent breast cancer, evidence has shown that early diagnosis and treatment can significantly increase a patient’s chance of survival. For example, the 5-year survival rate increases from 27% to 98% when breast cancer is detected in the early stage (American Cancer Society, 2011-2012). The early detection of breast cancer reduces the mortality rate by 40% or more (Jemal, 2011). Routine screening has been shown to reduce breast cancer-related deaths by 40% (Research council of Norway report, 2015). Although a mammogram is an advanced and effective modality used for screening breast cancer, limitations still exist. The necessity of analysing a high number of images to detect a small number of positive cases, the complex radiographic structure of the breast, positioning errors, an inadequate mammogram technique and subtle malignancy characteristics in association with a radiologist’s fatigue or distraction can contribute to the false-negative interpretation of a mammogram image (Majid et al., 2003; Paquerault et al., 2009). In addition, the absence of previous imaging studies for comparison may also lead to misinterpretation (Majid et al., 2003; Paquerault et al., 2009). The mammographic sensitivity for detecting breast cancer ranges from 70%-90%, meaning that 10%-30% of all breast cancers are missed during a mammographic screening (Brem et al., 2003; Calas et al., 2012). The diagnostic accuracy of mammography increases when two radiologists examine the same mammogram or the same radiologist reads the same mammogram more than once (Giger, 2002). Double viewing reduces the number of false negatives by 5% to 15% (Karssemeijer et al., 2003; Sohns et al., 2010), and it significantly improves the cancer detection rate (Taylor & Potts, 2008; Duijm et al., 2009). However, one previous meta-analysis demonstrated that double reading increased the operational costs, and that substituting a second reader using computer-aided detection (CAD) was a more cost-effective strategy (Posso et al., 2017).
100
Sundaran Kada and Fuk-hay Tang
It has been proven that CAD can effectively reduce the number of false positive mammograms (Freer & Ulissey, 2001), improve the diagnostic performance, and reduce the interpretative variability of the radiologists (Calas et al., 2012). Moreover, previous clinical studies have demonstrated that CAD increases a radiologist’s breast cancer detection sensitivity by up to 20%-21% (Brem et al., 2003; Romero et al., 2011; Baker et al., 2004; Weii et al., 2005; Melton et al., 2005; Morten et al., 2006). Artificial neural networks (ANNs), which are one CAD classification, have been used commonly in CAD systems due to their effectiveness in medical decisionmaking (Lisboa, 2002). An ANN is capable of learning complicated patterns from data that are difficult for humans to identify (Dayhoff & DeLeo, 2001), and it is a good classifier of mass calcification and microcalcification in mammography (Mehdy et al., 2017). Different breast cancer types exhibit different mortality rates; therefore, the early differentiation of breast cancer is important. However, to the best of our knowledge, no previous studies have concentrated on differentiating between invasive and noninvasive breast cancer using ANNs. Thus, the aim of this study was to evaluate how well an ANN differentiates among DCIS, IDC and ILC. These three cancers were chosen due to their prevalences.
1.1. ANN Structure An example of an ANN structure is shown in Figure 1. It generally consists of three layers: an input layer, hidden layer and output layer (Ayer et al., 2013). Each layer contains nodes, which are a set number of processing units. The parameters that could contribute to the outcome are fed into the input layer. In the hidden layer, signals are received from the previous layer, and after processing, they are relayed to the output layer. Since all the units are interconnected, each connection is weighted depending on its relative importance in generating the desired outcome.
Efficacy of Artificial Neural Networks in Differentiating … Input layer (N=22)
Hidden layer (N=20)
101
Output layer (N=2)
Figure 1. Artificial Figure neural network structure. 1. Artificial neural network structure.
Before training, each weighting is set randomly, and with each training process, the weighting is adjusted. Such an example is described as forward propagation; however, other learning algorithms can be used, such as the classical backpropagation method. With this method, the error between the ANN outcome and the desired outcome is calculated, and it is computed with the current weighted set for each training. The mean square error magnitudes are then propagated backward to adjust the weighting. The modelling aim is to generate the lowest mean square error with the training data, while utilizing the least number of nodes in both the input and hidden layers. This can avoid overtraining, which can lead to a falsely high accuracy.
2. MATERIALS AND METHODS The mammography images used in this paper were collected from the Haukeland University Hospital in Bergen, Norway. There were 40 digital mammography images of each cancer type (DCIS, IDC and ILC) composed of both craniocaudal and mediolateral oblique views from 20 patients. All
102
Sundaran Kada and Fuk-hay Tang
of the images were obtained between November 2012 and November 2015. The radiologists’ reports and biopsy results were also collected to serve as the standard with which to validate the ANN results.
2.1. Feature Extraction The mammography image features were extracted using the Image Feature Assessment Program, in which the regions of interest (ROIs) were selected manually using the polygon selection method. The program was developed by the second author (HT). When using this method, the ROI was selected using the mouse curser; then, the selected region was changed to a binary image. The ROI was then obtained using image multiplication of the selected binary image with the original selected regions utilised for feature extraction. Twenty-two features were numerically calculated (see appendix 1), including the homogeneity, contrast, and sum variance, and documented for each image.
2.2. ANN Training and Testing There are two learning method types: supervised and unsupervised. Supervised learning is a learning process that requires training ahead of time with lots of data, whereas the unsupervised learning process does not require a desired response (Zhengohao & Lifeng, 2010). The network is subjected to two processes, namely training and testing, in order to develop an ANN model (Saritas, 2011). We used the the leave-one-out cross (LOOCV) approach. A single observation from the dataset is used for validation, and the remaining observations as the training data. In this study for each types of cancer, there were 20 normal and 20 diseased, making a total of 40. For LOOCV, we used 39 for training and 1 for testing. The process was then repeated 40 times. This method was particular using for small sample in medical imaging (see Table 1). Using LOOCV, we usually obtain almost unbiased accuracy estimates.
Efficacy of Artificial Neural Networks in Differentiating …
103
Table 1. No of cases used for leave-one-out cross validation in detection of breast tumors Chan (1997) Jiang (999) Kallergi (2004) Huo (1998)
41 malignant, 45 benign 46 malignant, 58 benign 50 malignant, 50 benign 57 malignant, 38 benign
In our study, input Excel files (Microsoft, Redman, WA, USA) were created, which contained only the mammograms to be trained and tested. The ANN could differentiate two categories each time. Twenty-two extracted features were documented in Excel files, and each row corresponded with one mammogram. During each training and testing session with the different combinations, the mammograms were labelled with either a 0 or 1 to indicate the nature of the lesion with regard to the reports. The labelled data was fed into the ANN to generate the results, which were shown as the correct identification percentage. The ANN results were summed up, and an average correct identification percentage value was calculated.
2.3. Ethical Consideration All of the forms of data collection used in this project were approved by the Regional Ethical Committee for Medical Research in Norway. The head of the Diagnostic Imaging Department that ultimately participated in this study was initially approached regarding involvement.
2.4. Data Analysis We performed a receiver operating characteristic (ROC) analysis, which was used to determine the accuracy of the medical diagnostic tests, and we obtained summary ROC curves for the three types of breast cancer. The area
104
Sundaran Kada and Fuk-hay Tang
under the ROC curve (Az) was used as a summary index of accuracy. The Az could have a value between 0.5, which represented no apparent accuracy, and 1.0, which represented perfect accuracy. When considering the successful data, the Az was expected to be greater (close to 1). A chi-squared test was used to calculate the p values, and a p value of less than 0.05 was considered to be statistically significant.
3. RESULTS The accuracy for the detection of IDC against normal was 97.5%, sensitivity = 95%, specificity = 100% (N = 40, normal 20, abnormal 20), for ILC against normal it was 97.5%; sensitivity = 95%, specificity = 100% (N = 40, normal = 20, abnormal = 20), and for DCIS against normal it was 76.9%, sensitivity = 76%, specificity = 74% (N = 39, normal = 20, abnormal = 19). One DCIS image was omitted because the image quality did not meet our program criteria. Table 2. The artificial neural network performances for the different breast cancer types Type I (IDC) Type II (ILC) Type III (DCIS) 0.9918 0.9922 0.9058 Type I versus Type I versus Type II versus Type II Type III Type III P value 0.802 0.154 0.011 IDC: invasive ductal carcinoma, ILC: invasive lobular carcinoma, DCIS: ductal carcinoma in situ. Area under the curve
Based on the ROC analysis, our results indicated that the computeraided detection method using the image feature approach achieved very reliable breast cancer detection rates for the ILC, IDC and DCSI, with areas under the curve (AUCs) of 0.991752, 0.9922 and 0.9058, respectively. The results are summarized in Table 2 and Figure 2. The fact that these areas were large (close to 1) indicated that the study was successful in
Efficacy of Artificial Neural Networks in Differentiating …
105
differentiating among the three cancer types. Further data analyses indicated that there were no differences when using the CAD method for differentiating between cancer types IDC and ILC (p > 0.05), but the CAD could differentiate between cancer types ILC and DCIS (p < 0.05).
1-Specificity (False Positive Rate)
1-Specificity (False Positive Rate)
Type 1: invasive ductal carcinoma, Type 2: invasive lobular carcinoma, Type 3: ductal carcinoma in situ. Figure 2. Receiver operating characteristic curves of the different breast tumor types.
4. DISCUSSION Our study results deemed that the performance of the ANN was optimal, with high sensitivity and accuracy. The accuracy of the ANN ranged from 76.9% to 97.5% when differentiating among benign, DCIS, IDC and ILC mammograms. Overall, the accuracy was found to be satisfactory. However,
106
Sundaran Kada and Fuk-hay Tang
it is noted that DCIS is relatively difficult to differentiate when using an ANN. Among all of the combinations, differentiating between the DCIS and IDC and normal mammograms exhibited similarly high accuracies (97.5%). One previous study performed by Huo et al. (1998) investigated the use of an ANN for differentiating between benign and malignant breast masses while using automated classifiers, and they achieved an accuracy of 83% with an AUC of 0.94. We obtained values of 0.991752, 0.9922 and 0.9058 for the ILC, IDC and DCSI, respectively. The implementation perceptionbased three-layer neural network using a backpropagation algorithm became a pioneer in ANN mammography (Mehdy et al., 2017). It is evidenced that an ANN with a feature approach exhibits better performance than an automated classifier method. Moreover, an ANN plays an important role in the detection of carcinogenic conditions in the breast (Mehdy et al., 2017). The deep learning model is becoming more important today. The backpropagation of an ANN was adopted in this study, and when considering the other approaches, such as a general regression ANN, the training process was rather simple. Deep learning models, however, contain either many more hidden layers or neurons in different configurations (Manner A), which can complicate the process.
CONCLUSION This study aimed to assess how well an ANN could differentiate among IDC, ILC and DCIS. To the best of our knowledge, this was the first study of its kind to be conducted in this field. The key findings of the present study demonstrated a high breast cancer type detection level (ILC, IDC and DCIS). However, further studies must be conducted to compare the accuracies of differentiating among these cancer types between radiologists and an ANN.
Efficacy of Artificial Neural Networks in Differentiating …
107
APPENDIX A Table A. The 22 features contrast
correlation
entropy
homogeneity
sum of square of variance difference of entropy
sum of average
sum of variance information measures of correlation 2 cluster shade
sum of entropy
autocorrelation
information measures of correlation 1 dissimilarity
maximum correlation coefficient cluster prominence
angular second moment difference of variance energy
maximum probability
inverse difference normalized
inverse difference of moment normalized See Ehsanirad A, Kumar YH. Leaf recognition for plant classification using GLCM and PCA methods. Oriental Journal of Computer Science & Technology 2010; 3(1):31-36.
REFERENCES American Cancer Society (2017). Cancer facts and figures 2017. https:// www.cancer.org/research/cancer-facts-statistics/all-cancerfacts-figures/cancer-facts-figures-2017.html. American Cancer Society 2011-2012. Breast cancer, Facts and figures 2011-2012. https://www.cancer.org/content/dam/cancer-org/research/ cancer-facts-and-statistics/breast-cancer-facts-and-figures/breastcancer-facts-and-figures-2011-2012.pdf. Ayer T, Chen Q, Burnside ES (2013). Artificial neural networks in mammography interpretation and diagnostic decision making. Computational and mathematical methods in medicine, doi: http://dx. doi.org/10.5772/51857.
108
Sundaran Kada and Fuk-hay Tang
Baker JA, Lo Jy, Delong DM et al. (2004). Computer-aided detection in screening mammography: variability in cues. Radiology, 233:411-17. Baker JA, Kornguth PJ, Lo JY, Williford ME, and Floyd CE, “Breast cancer: prediction with artificial neural network based on BI-RADS standardized lexicon,” Radiology, vol. 196, no. 3, pp. 817-822, 1995. Brem RF, Baum J, Lechner M et al. (2003) Improvement in sensitivity of screening mammography with computer-aided detection: a multiinstitutional trail. American Journal Roentgenol, 181 (3):687-693. Calas, MJG, Gutfilen B, Pereira WCA (2012). CAD and mammography: why use this toll? Radiology Bras. 45 (1):46-52. Dayhoff JE, DeLeo JM (2001). Artificial neural networks: opening the black box. Cancer, suppl. 1615-1635. Duijm LE, Louwann MW, Groenewoud JH et al. (2009). Inter-observator variability in mammography screening and effect of type and number of readers on screening outcome. Br. J. Cancer, 100(6):195-201. Freer TW, Ulissey MJ (2001) Screening mammography with computeraided detection: prospective study of 12,860 patients in a community breast center. Radiology, 220: 781-86. Giger ML (2002). Computer-aided diagnosis in radiology. Acad. Radiology, 9:1-3. Huo Z, Giger ML, Vyborny CJ, Wolverton DE, Schmidt RA, & Doi K (1998). Automated computerized classification of malignant and benign masses on digitized mammograms. Academic Radiology, 5(3), 155-168. doi: 10.1016/s1076-6332(98)80278-x. Jemal A, Bray F, Center MM, Ferlay J, Ward E. ve Forman D (2011). Global cancer statistics, CA: A Cancer Journal for Clinicians, 61(2), 69-90. Jiang Y, Nishikawa RM, Schmidt RA, Toledano AY, Doi K (2001). Potential of computer-aided diagnosis to reduce variability in radiologists’ interpretations of mammograms depicting microcalcifications1. Radiology, 220 (3):787-794. Jiang Y, Nishikawa RM, Schmidt RA, Metz CE, Giger ML, Doi K, “Improving breast cancer diagnosis with computer aided diagnosis,” Academic Radiology, vol. 6, no. 1, pp. 22-33,1999.
Efficacy of Artificial Neural Networks in Differentiating …
109
Karssemeijer N, Otten JDM, Verbeek ALM et al. (2003). Computer-aided detection versus indepnedent double reading of masses on mammograms. Radiology, 227: 192-200. Kallergi M, “Computer-aided diagnosis of mammographic microcalcification clusters,” Medical Physics, vol. 31, no. 2, pp. 314326, 2004. Lisboa, PJ (2002). A review of evidence of health benefit from artificial neural networks in medical intervention. Neural Network, 15(1):11-39. Majid A, de Paredes ES, Doherty RD et al. (2003) Missed breast carcinoma: pitfalls and pearls. Radiographics, 23:881-895. Manner A: http://alexminnaar.com/deep-learning-basics-neural-networksbackpropagation-and-stochastic-gradient-descent.html retreived on 31 May 2018. Mehdy MM, Ng PY, Shair EF, Md Saleh NI, Gomes C (2017). Artificial neural networks in image processing for early detection of breast cancer. Computational and mathematical methods in medicine. Article ID 2610628, 15 pages https://doi.org/10.1155/2017/2610628. Melton AR, Worrel SW, Knapp J et al. (2007). Computer-aided detection detection with full-field digital mammography and screen-film mammography. American J. Roentgenol. 188:A36-A39. Motren MJ, Whaley DH, Brandt KR, Amrami KK (2006). Screening mammograms: interpretation with computer-aided detection prospective evaluaton. Radiology, 239:375-83. Norwegian cancer registry 2015. Cancer in Norway: cancer incidence, mortality, survival and prevalence in Norway. https://www.kreft registeret.no/globalassets/cancer-in-norway/2015/cin_2015.pdf. Norwegian research committee report 2015. https://www.regjeringen.no/ contentassets/e0051d59fc4f48c1980a342fa18a1111/arsrapportforskningsradet-2015.pdf. Paquerault S, Samuelsen FW, Petrick N et al. (2009). Investigation of reading mode and relative sensitivity as factors that influence reader performance when using computer-aided detection software. Acta Radiol. 227:192-200.
110
Sundaran Kada and Fuk-hay Tang
Posso M, Puig T, Carles M, Rue M, Canelo-Aybar C, Bonfill X 8 (2017). Effectiveness and cost-effectiveness of double reading in digital mammography screening: A systematic and meta-analysis. European Journal of Radiology, 96:40-49. Romero C, Almenar A, Pinto JM et al. (2011). Impact on breast cancer diagnosis in a multidisciplinary unit after the incorporation of mammography digitalization and computer-aided detection systems. American J. Roentgenol. 197:1492-97. Sapna et al. (2012) S, Tamilarasi A, Kumar MP (2012) Backpropagation learning algorithm based on Levenberg-Marquardt algorithm. Comput. Sci. Inf. Technol. 2012:393-398. Saritas I (2012). Prediction of breast cancer using artificial neural networks. M. Med. Syst., 36:2901-7. Setiono R (2000). Generating concise and accurate classification rules for breast cancer diagnosis, Artificial Intelligence in Medicine, vol. 18 (3), pp. 205-217. Sohns C, Angic B, Sossalla S et al. (2010). Computer assisted diagnosis in full-field digital mammography - results in dependence of readers experiences. Breass J. 16: 490-97. Taylor P, Potts HW. Computer aids and human second reading as interventions in screening mammography: two systematic reviews to compare effects on cancer detection and recall rate. Eur. J. Cancer 2008; 44 (6): 798-807. Taylor P, Potts HW. Computer aids and human second reading as interventions in screening mammography: two systematic reviews to compare effects on cancer detection and recall rate. Eur. J. Cancer 2008; 44 (6):798-807. Taylor P, Potts HW. Computer aids and human second reading as interventions in screening mammography: two systematic reviews to compare effects on cancer detection and recall rate. Eur. J. Cancer 2008; 44 (6):798-807. Taylor P, Potts HW. Computer aids and human second reading as interventions in screening mammography: two systematic reviews to
Efficacy of Artificial Neural Networks in Differentiating …
111
compare effects on cancer detection and recall rate. Eur. J. Cancer 2008; 44 (6):798-807. Taylor P, Potts HW. Computer aids and human second reading as interventions in screening mammography: two systematic reviews to compare effects on cancer detection and recall rate. Eur. J. Cancer 2008; 44 (6):798-807. Taylor P & Potts HWW (2008). Computer aids and human second reading as interventions in screening mammography: two systematic reviews to compare effects on cancer detection and recall rate. Eur. J. Cancer, 44(6):798-807. Torre LA, Bray F, Siegel RL, Ferlay J et al. (2015). Global cancer statistics, 2012. CA Cancer J. Clin., 65:87-108. Wei J, Sahiner B, Hadjiiski LM et al. (2005). Computer-aided breast masses on full field digital mammograms. Med. Phys., 32:2827-38.
In: Neural Networks Editor: Doug Alexander
ISBN: 978-1-53617-188-4 © 2020 Nova Science Publishers, Inc.
Chapter 5
SUPERVISED ADJUSTMENT OF SYNAPTIC PLASTICITY IN SPIKING NEURAL NETWORKS Saeed Solouki Control and Intelligent Processing Center of Excellence (CIPCE), Human Motor Control and Computational Neuroscience Laboratory, School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran
ABSTRACT Precise spike timing as a means to encode information in neural networks is biologically supported and is advantageous over frequencybased codes by processing input features on a much shorter time-scale. For these reasons, much recent attention has been focused on the development of supervised learning rules for spiking neural networks (SNN) that utilize a temporal coding scheme. However, despite significant progress in this area, there still lack rules that have a theoretical basis, and yet can be
Corresponding Author’s Email: [email protected], [email protected].
114
Saeed Solouki considered biologically relevant. Here we review the general conditions under which synaptic plasticity most effectively takes place to support the supervised learning of a precise temporal code. Then, we examine the accuracy of each plasticity rule with respect to its temporal encoding precision, and the maximum number of input patterns it can memorize using the precise timings of individual spikes as an indicator of storage capacity in different control and recognition tasks. Practically, the network should learn to distinguish patterns belonging to different classes from the temporal patterns of spikes.
Keywords: synaptic plasticity, supervised learning, spiking neural network, temporal encoding
1. SUPERVISED LEARNING IN SNN VIA REWARDMODULATED SPIKE-TIMING-DEPENDENT PLASTICITY In spite of the success of conventional artificial neural networks (ANNs) in learning complex non-linear functions, the interest of spiking neural networks (SNNs) is consistently increasing due to the fact that SNNs offer many fundamental and inherent advantages, such as biological plausibility [1], rapid information processing [2, 3], and energy efficiency [4, 5]. Since conventional ANN computing units process signals in the form of continuous activation functions, they can be interpreted as their average pulse frequencies over a time window. Different from this, SNNs process information in the form of spikes or pulses, which is much more analogus to the natural nervous system, and therefore, more biologically plausible. An advantage of this form of information processing is the possibility of not only encoding spatial information like traditional ANNs do, but also adding temporal information in the form of the precise timing of spikes [1]. This eliminates the need for an averaging time window and allows for contineus data processing, greatly reducing response latencies [6]. Since SNNs are able to transmit and receive large volumes of data encoded by the relative timing of only a few spikes, this also leads to the possibility of very fast and energy-efficient computing. Moreover, in terms of energy efficiency, maintaining the sufficient functions of the nervous system to perform
Supervised Adjustment of Synaptic Plasticity …
115
various tasks needs a continuous energy supply [7]. However, the human brain only needs remarkably low power consumption, which is around 20 W of power [8]. Overall, SNNs have great potential to offer an efficient way to model the principles underlying neural structures devoted to motor control in living creatures. Hence, mobile robots will be able to manage their weaknesses of carrying limited computing resources and power supply based on SNN-based controllers. However, training these kinds of networks is notoriously difficult, since the error back-propagation mechanism commonly used in conventional neural networks cannot be directly transferred to SNNs due to the nondifferentiability at spike times. Although there are successful attempts to combine the advantages of SNNs and the back-propagation mechanism together [9, 10], they basically transfer the continuous errors into probability-based or rate-based spikes, which gave away some of the inherent advantages of SNNs, such as the temporal encoding of spikes occurrence. Although a variety of learning rules for SNNs have been presented in the past [11-13], solving different tasks usually involves constructing a specific network architecture suited to solve them, restricted by lack of efficient and general training solutions. Initially, SNN-based control tasks were execuited by manually setting network weights [14-17]. Although this approach is able to solve basic behavioral tasks, such as path keeping [18] or wall following [19], it is only feasible for lightweight networks with few connections or simple network architectures without hidden layers. In this section we aim to introduce the application of SNNs in robotic controllers using an indirect training method for adjusing synaptic weights among neural layers. The functionality of spike-timing-dependent plasticity (STDP), as one of the fundamental rules for synaptic activities, has shown that the strength of one synaptic connection is regulated by the precise timing of pre- and post-synaptic spikes. On this basis, the STDP learning rule is used in robotic control. experimental studies also reveal that the brain
116
Saeed Solouki
modifies the outcome of STDP synapses using chemicals emitted by given neurons. This mechanism inspires a new method for training SNNs which is known as reward-modulated spike-timing-dependent plasticity (R-STDP) [20]. Since the R-STDP can modulate an SNN with sparse delayed signals, this method is well suited for mobile robotic tasks. Howevere, there still exist many chalenges for widespread implementing such learning rules on mobile robotic applications. First, there has been lacking of a unified learning paradigm that can be easily used in different tasks and assign neuron modulations regardless of the multi-layered SNN architecture. Second, in order to make desired behaviors of the robot, defining proper rewards is important but complicated. Besides, the working mechanism of the reward-based neuron modulator in multi-layered SNNs is still unclear. Third, the R-STDP learning rule, similar to reinforcement learning, needs the agents to explore and interact with the environment randomly at the beginning. Thus, Improper parameters identification will consume long period of time or even fail the tasks with high probability. In the following, we explore the ability of an indirect SNN training approach based on the R-STDP learning rule and supervised learning framework to control a simple robot navigation task. In the first step, a simulated target-reaching scenario is constructed and adapted with different traffic conditions for evaluating the suggested SNN-based controller [21], in which a Pioneer robot mounted with proximity sensors is regarded as the moving vehicle. This controller consists of a goal-approaching subcontroller, responsible for reaching a target area and an obstacle-avoiding sub-controller, responsible for avoiding obstacles in the robot’s path. Second, a supervised R-STDP learning rule is presented to train a onehidden-layer SNN in an end-to-end learning fashion. The SNN-based controller computes the robot’s proximity sensor readings and the direction of a target as inputs and the motor speed as the output. Finally, the accuracy of training results in reaching the specified target is analyzed and then other unknown scenarios are implemented to demonstrate the feasibility of this scheme.
Supervised Adjustment of Synaptic Plasticity …
117
1.1. Modeling of Spiking Neural Network and the Supervised R-STDP Learning Rule Network model: The proposed network [21] has a simple architecture consisting of an input layer encoding the sensory input vector with integrateand-fire (IF) neurons, a hidden layer of leaky integrate-and-fire (LIF) neurons, and an output layer of LIF neurons, which provides the output spike trains that are decoded into an output vector. The network is a fullyconnected feed-forward network with specific values of parameters. The input neurons modeled can be interpreted as an IF neuron without leakage. Its firing threshold vth is set as 1 mV and the neuron is modeled as 𝑑𝑣𝑗 𝑑𝑡
= 𝑎. 𝑥𝑗 + 𝑏,
(1.1)
where vj is the neuron’s membrane potential and x is the inject current. The parameter b is used to enable the input neuron firing even when there is no 1 𝑏
stimulus, since a spike will be generated every ms. This serves the purpose of helping to generate spikes even for low input values in the time window T and thus enabling learning for inputs that would otherwise not have fired the input neurons in T. With the factor a, the build-up of the membrane potential can be scaled, limiting the amount of spikes generated in T for a maximum input to (a + b) × T. Considering a=0.2 and b = 0.025, resulting in the generation of one spike per time window for no input and 11 spikes for maximum input. An example is shown in Figure 1a. The hidden and output layer consist of LIF neurons with thresholds vth,hidden = 30 mV, vth,output = 25 mV. The neurons in both layers share a refractory period τref = 3 ms and their membrane time constant τm = 10 ms. The LIF neurons are modeled as follows: 𝑑𝑣𝑗 𝑑𝑡
= (−𝑣𝑗 (𝑡) + 𝑃𝑆𝑃𝑗 (𝑡))/𝜏𝑚 ,
(1.2)
118
Saeed Solouki
Here, vj(t) is the jth neuron’s membrane potential, PSPj(t) is the postsynaptic potential of the jth neuron. The PSP induced by a presynaptic spike has the shape of an alpha function, also referred to as alpha synapse [22, 23]. A system of two differential equations is used to approximate the shape of the resulting PSP: 𝑑𝑃𝑆𝑃𝑗 𝑑𝑡 𝑑𝑖𝑗 𝑑𝑡
= (−𝑃𝑆𝑃𝑗 (𝑡) + 𝑖𝑗 (𝑡))/𝜏𝑠 , 𝑓
= (−𝑖𝑗 (𝑡)/𝜏𝑠 + ∑𝑡 𝑓 𝑤𝑖𝑗 𝛿(𝑡 − 𝑡𝑖 ),
(1.3)
𝑖
𝑓
where ij(t) is the inject current, 𝑡𝑖 is the firing time of the ith neuron connecting to the neuron, τs is a time constant controlling the decay of the PSP, and δ(·) is the Dirac delta function. For the network to successfully learn a specific task, the exact shape of the PSP is not decisive. It simply has to be a function that allows for slow buildup of the membrane potential, resulting in firing the post-synaptic neuron earlier when the synaptic strength increases. Some mechanisms integrating the PSP by simply increasing the membrane potential at the arrival time of the pre-synaptic spike are not suited for this kind of network, since this would result in firing the postsynaptic neuron only at the same time the pre-synaptic neurons fires or after a certain delay has passed. Then, the STDP learning rule will not work with the delayed reward to adjust the synapses. The integration of the postsynaptic potential is chosen to be alpha-shaped, because the slow build up allows for integration of temporal information, which includes the timing of the post-synaptic spike and the presynaptic spike-timing and the strength of synaptic connection. The output spike trains are decoded similar to the leaky integrator equation. To further increase the influence of the precise timing of the output spikes and reward early spiking, it was slightly changed to 𝑓
𝑦𝑖 = ∑𝑡 𝑓 𝛼 𝑖
𝑇−𝑡𝑖 𝑇
𝑓
exp (β(𝑡𝑖 − 𝑡)) − 𝛾,
(1.4)
Supervised Adjustment of Synaptic Plasticity …
119
𝑓
where yi is the output of the ith output neuron and 𝑡𝑖 are the firing times of that neuron. α, β, γ are the output constants. Figure 1b shows the development of the output function in the time window T for a typical spike train. Supervised R-STDP learning rule: The basic idea underlying the proposed supervised R-STDP learning rule is to calculate the reward according to the supervised learning framework and strengthen a synaptic connection based on the combination effect of the STDP function and dopamine reward, where STDP means strengthening a synaptic connection results in a faster buildup of the postsynaptic neuron potential when a presynaptic spike arrives, leading to the postsynaptic neuron firing earlier. According to this learning rule, the weight changes proposed by an STDP function are collected and a reward representing whether the output is lower or higher than the desired output is calculated after every simulation time window. Then, this reward is used to change the synaptic connections of the network under the R-STDP learning rule. The weights of the synaptic connections wij, where i and j are the indices of the pre- and the post-synaptic neurons, respectively, are updated after the simulation time window T by the following equations: 𝑤𝑖𝑗 (𝑡) = 𝑤𝑖𝑗 (𝑡 − ∆𝑡) + ∆𝑤𝑖𝑗 (𝑡)
(1.5)
∆𝑤𝑖𝑗 (𝑡) = ƞ × 𝑟𝑖𝑗 (𝑡) × 𝑆𝑇𝐷𝑃𝑖𝑗 (𝑡) × 𝑔𝑖𝑗 (𝑡),
(1.6)
where t denotes the number of the current time window and ∆t = T. The learning rate η is a constant which regulates the learning speed of the SNN. Inspired by traditional ANNs, η starts from ηmax and decreases toward ηmin, such that the weight changes are becoming finer as the training progresses. It is updated after every training episode. ƞ = ƞ𝑚𝑎𝑥 −
ƞ𝑚𝑎𝑥 −ƞ𝑚𝑖𝑛 𝑒𝑝𝑖𝑠𝑚𝑎𝑥
× 𝑒𝑝𝑖𝑠𝑐𝑢𝑟𝑟 ,
(1.7)
120
Saeed Solouki
where episcurr is the current training episode and epismax is the number of training episodes. The function gij(t) is the synapses eligibility trace. As opposed to eligibility traces in reinforcement learning, here it models a phenomenon in biological neurons, where synapses with higher efficacies produce greater weight changes [24] and can be calculated as 𝑔𝑖𝑗 (𝑡) = 1 − 𝑐1 × 𝑤𝑖𝑗 × exp (−𝑐2 ×
𝑎𝑏𝑠(𝑤𝑖𝑗 ) 𝑤𝑚𝑎𝑥
),
(1.8)
where c1 and c2 are positive constants. c1 is set to 1/wmax to make sure the eligibility traces only assume values between 0 and 1. The weight changes proposed by the STDP function are collected and represented by the term STDPij. This mechanism is modeled with the help of two variables aij,pre and aij,post that are traces of the pre- and post-synaptic activity [25]. They are governed by the following differential equations: 𝜏𝑝𝑟𝑒
𝑑𝑎𝑖𝑗,𝑝𝑟𝑒
𝜏𝑝𝑜𝑠𝑡
= −𝑎𝑖𝑗,𝑝𝑟𝑒 ,
𝑑𝑡 𝑑𝑎𝑖𝑗,𝑝𝑜𝑠𝑡 𝑑𝑡
= −𝑎𝑖𝑗,𝑝𝑜𝑠𝑡 ,
(1.9)
Upon occurrence of a pre-synaptic spike, aij,pre is updated and the proposed weight changes are modified: 𝑎𝑖𝑗,𝑝𝑟𝑒 (𝑡) = 𝑎𝑖𝑗,𝑝𝑟𝑒 (𝑡 − 𝑑𝑡) + 𝐴𝑝𝑟𝑒 , 𝑆𝑇𝐷𝑃𝑖𝑗 (𝑡) = 𝑆𝑇𝐷𝑃𝑖𝑗 (𝑡 − 𝑑𝑡) + 𝑎𝑖𝑗,𝑝𝑜𝑠𝑡 ,
(1.10)
When the post-synaptic neuron fires, aij,post is updated as follows: 𝑎𝑖𝑗,𝑝𝑜𝑠𝑡 (𝑡) = 𝑎𝑖𝑗,𝑝𝑜𝑠𝑡 (𝑡 − 𝑑𝑡) + 𝐴𝑝𝑜𝑠𝑡 , 𝑆𝑇𝐷𝑃𝑖𝑗 (𝑡) = 𝑆𝑇𝐷𝑃𝑖𝑗 (𝑡 − 𝑑𝑡) + 𝑎𝑖𝑗,𝑝𝑟𝑒 ,
(1.11)
Supervised Adjustment of Synaptic Plasticity …
121
The STDP function governed by these rules is equivalent to the STDP learning rule. This function is used to show the same behavior while being efficient and physiologically plausible as biological neurons, since they do not have a memory of all their fired spikes. The reward is represented by the term rij. It can be seen as more of an adjustment than a reward, since it determines whether the SNN output has to be increased or lowered in order to reach the desired output. After calculating the SNN’s output ySNN,k, a reward variable that represents the relative deviation of each output from the desired value ycon,k (provided by the dataset) is calculated for each neuron indexed by k. As opposed to other R-STDP learning rules, there is no global reward signal, but every synapse is assigned its individual reward as 𝑟𝑘 =
(|𝑦𝑐𝑜𝑛,𝑘 |−|𝑦𝑆𝑁𝑁,𝑘 |) 𝑦𝑚𝑎𝑥
,
(1.12)
The maximum value of output should not be exceeded ymax. Thus, the synapse connecting the jth neuron in the hidden-layer and the kth neuron in the output layer are given a reward as 𝑟𝑗𝑘 = 𝑟𝑘 ,
(1.13)
To assign a reward to the synapses connecting the input to hidden neurons, it has to be calculated differently. In this scheme, it is backpropagated through the layers: each hidden neuron has one synaptic connection to each output neuron, where the synapse is assigned a reward rk. With the help of the weights of those synaptic connections the reward of a hidden layer neuron is calculated, following Eq. 1.14. This hidden layer neuron reward can now be assigned to the synapses connecting an input neuron to this neuron. 𝑟𝑗,𝑘 = (∑𝑘|𝑤𝑗𝑘 |𝑟𝑘 )/(∑𝑘|𝑤𝑗𝑘 |),
(1.14)
122
Saeed Solouki
Here, i represents the ith input layer neuron, j indexes the jth hidden layer neuron and k denotes the kth output layer neuron. With the proposed rule for setting rewards for synapses, an SNN construed with R-STDP synapses can be trained by the supervised learning framework with a dataset. Next, the user simply has to set the size of each layer of an SNN to achieve a desired behavior.
Figure 1. (a) Input encoding function for a = 0.2, b = 0.025, and xi = 0.5 in a time window T = 50ms, the horizontal dashed line means the firing threshold. (b) Output function y(t) for α = 5, β = 0.05, and γ = 0 in a time window T = 100ms, the red dashed vertical lines denote firing times of the output neuron. (c) The Pioneer robot with its 6 on-board sonar sensors. Red line denotes the sensors used for the obstacle-avoiding task. (d) Top-view of the V-REP scene that is used to collect the data for the obstacleavoiding dataset.
Supervised Adjustment of Synaptic Plasticity …
123
Notably, during the obstacle avoidance task, while both output reward values for the hidden-output-snapses are calculated using Eq. 1.14, this was done slightly different for the goal approaching SNN. This is because the output of one neuron has to be precise, while the other neurons output simply has to be higher for our goal-approaching sub-controller to exhibit the intended behavior. This will be explained with more detail in section 1.3. For this rule, the most important part is to correctly judge whether an output has to be increased or lowered.
1.2. Reference Dataset The target-reaching controller (TR) is supposed to drive the robot to reach a target area and avoid obstacles in its path. Each task is to be solved by a sub-SNN-controller trained by the supervised R-STDP learning rule, one for obstacle avoiding (OA) and one for goal approaching (GA). Thus, two datasets were created consisting of 500 inputoutput pairs, which were later used to train the sub-controllers to approximate their obstacle avoiding and goal approaching behaviors, respectively [21]. The datasets are generated by simulating the locomotion tasks in the Virtual Robot Experimentation Platform (V-REP) [26].
1.2.1. Obstacle Avoiding Dataset The goal of the obstacle-avoiding sub-controller is to find a direction that the modile robot should take to avoid encountered obstacle. For this purpose, an obstacle-avoiding reference controller based on simple if-then fuzzy rules [27] is built. One piece of data in the dataset consists of the six sensor readings of the sensors S1–S6, the two output angles αOA,ref ,L, αOA,ref ,R and the turn to take, i.e., left or right. The inputs of the obstacle-avoiding controller are the six central front sonar-sensors of the pioneer robot (Figure 1c). The sensor reading Si is given as the distance between the detected object to the ith sensor’s position. Since sensors that do not detect anything return random values between 0 and 1 for the coordinates, an additional boolean variable detecti is used for each
124
Saeed Solouki
sensor, which is True when an object is detected and False if not. The outputs are two angles αOA,ref ,R, αOA,ref ,L, which are chosen based on the most central right and left sensors that do not detect any obstacle. Negative angles represent right turning, whereas positive angles represent left turning. To select the turning angle αturn for the robot, the angles are compared and the one with the lower absolute value is chosen. If two have the same absolute value, the sum of the right and left sensor readings are compared and the side with the lower overall sensor readings is chosen. The sensors are orientated at ±10, ±30, and ±50° at the front of the robot. To make sure the robot successfully avoids the detected obstacle, output angles are set to a value that is 10° higher than the respective sensor’s orientation angles. If all sensors on one side detect an obstacle, the output angle is set to ±90°, implying a turn away from that obstacle. This turning rule is presented in Algorithm 1. To create the dataset, the robot controlled by the reference controller drives around the training scene that can be seen in Figure 1d. Meanwhile, the robot saves the sensor readings as well as the outputs of Algorithm 1 every 200 ms, if an obstacle is encountered. This scene is chosen as the training environment, because it ensures the robot takes both left and right turns while navigating through the scene. The controller then decides on the turn to take by choosing the angle with the smaller absolute value as the output angle αOA,ref. In the event that both output angles are the same, the angle calculation algorithm additionally returns the turn to take (1 for a left turn, 2 for a right turn) based on which side’s detected obstacles are further away. If all of the sensors S1–S6 return the following readings, no obstacle will be encountered, which means the robot will not crash on anything if moving forward. 𝑆1 , 𝑆6 ≥ 0.01 { 𝑆2 , 𝑆5 ≥ 0.15 𝑆3 , 𝑆4 = 1
(1.15)
Supervised Adjustment of Synaptic Plasticity …
125
Algorithm 1. Steps to calculate the output angles for the reference obstacle avoiding controller. procedure CALCOUTPUTANGLES (S) αOA,ref ,L = 0°, αOA,ref ,R = 0° sumL = S1 + S2 + S3, sumR = S4 + S5 + S6 turn = 0 if S3 == 1 then αOA,ref ,L = 20° else if S2 == 1 then αOA,ref ,L = 40° else if S1 == 1 then αOA,ref ,L = 60° else αOA,ref ,L = 90° end if if S4 == 1 then αOA,ref ,R = −20° else if S5 == 1 then αOA,ref ,R = −40° else if S6 == 1 then αOA,ref ,R = −60° else αOA,ref ,R = −90° end if if αOA,ref ,L > |αOA,ref ,R| then turn = 1 else if αOA,ref ,L < |αOA,ref ,R| then turn = 2 else if sumOA,ref ,L < sumOA,ref ,R then turn = 1 else turn = 2 end if Return αOA,ref ,L, αOA,ref ,R, turn end procedure
1.2.2. Goal-Approaching Dataset Another dataset is created to train the goal-approaching subcontroller in order to reach a pre-set target area. This controller gets the normalized vector 𝑔⃗ = (gx, gy) from the Pioneer robot to the goal center as input and outputs a turning angle αGA,ref, which causes direct facing with the target. Figure 2 shows the Pioneer robot and its four imaginary goal positions g1–g4. It can be seen that the controller should later calculate a left turning angle (α1 > 0°) for y > 0, a right angle (α2 < 0°) for y < 0, and an angle |α3/4| > 90° for x < 0. The controller’s activity could be restrained due to being exposed to very high output angles (±180°) in training, while experiencing mostly low target angles when almost facing the goal. For this reason, all angles >90° are clipped at ±90°:
126
Saeed Solouki arcsin(𝑔𝑦 ) 𝛼𝐺𝐴,𝑟𝑒𝑓 = {
90
°
−90
𝑖𝑓 𝑔𝑥 > 0; 𝑖𝑓 𝑔𝑦 > 0, 𝑔𝑥 < 0;
°
(1.16)
𝑖𝑓 𝑔𝑦 < 0, 𝑔𝑥 < 0.
The coordinate pairs (gx, gy) are randomly generated and then normalized to create the dataset. The absolute value of the target angle is set to be >10°, because αGA,ref ≤ 10° is treated here as facing the target. From those normalized pairs (gx, gy), a target angle is calculated according to Eq. 1.16. Similar to the obstacle avoiding SNN, the goal-approaching SNN later calculates two output angles, one for each side. For this reason, the angle of the side where the target is not located is set to ±180° to be consistent. One input-output pair then consists of the two parts of the goal vector gx, gy and the two target angles αGA,ref ,R, αGA,ref ,L.
Figure 2. The Pioneer robot with different relative goal positions (g1 - g4) and the corresponding target angles (α1 - α4).
Supervised Adjustment of Synaptic Plasticity …
127
1.2.3. Calculating the Speed of the Robot Motor Since the SNN sub-controllers and the reference datasets only provide the turning angles, it is necessary to translate them into actual motor speeds of the robot in rad/s, where vforward denotes the default motor speed when moving forward: 𝑣𝑟𝑖𝑔ℎ𝑡 = 𝑣𝑓𝑜𝑟𝑤𝑎𝑟𝑑 + ∆𝑣(𝛼)/2 𝑣𝑙𝑒𝑓𝑡 = 𝑣𝑓𝑜𝑟𝑤𝑎𝑟𝑑 − ∆𝑣(𝛼)/2,
(1.17)
where ∆𝑣(𝛼) is the difference in motor speeds necessary to achieve a turn of α degree in 1 s. The default forward speed is set to vforward = 5.0 rad/s.
1.3. Controller In this part, the target-reaching (TR) control architecture is presented which consists of obstacle-avoiding (OA) and goal-approaching (GA) subcontrollers. Figure 3 shows the target-reaching control structure. Upon starting the simulation, V-REP passes the position of the target center ptarget = (ptarget, x, ptarget, y) to the controller and the user is nessecary to set the target area by specifying a radius rtarget. After every simulation time window, the controller is provided with the position of the robot pp3dx = (pp3dx,x, pp3dx,y), its proximity sensor readings S1–S6, and the normalized vector to the goal 𝑔⃗. It is then checked in every step whether the robot has reached the target area by estimating its distance d from the target center as: 𝑑 = √(𝑝𝑡𝑎𝑟𝑔𝑒𝑡,𝑥 − 𝑝𝑝3𝑑𝑥,𝑥 )2 + (𝑝𝑡𝑎𝑟𝑔𝑒𝑡,𝑦 − 𝑃𝑝3𝑑𝑥,𝑦 )2,
(1.18)
128
Saeed Solouki
Figure 3. Structure of the Target-Reaching controller with its SNN sub-controllers communicating with the V-REP simulator.
Then, it is compared to the initially specified target radius rtarget. If the robot is not in the target area, the SNN-based obstacle-avoiding subcontroller and the goal-approaching subcontroller will calculate their angle outputs to drive the robot for the next step, respectively. Finally, the motor speeds are further calculated according to the output angle of the controller. The output layer of both sub-controllers is equipped with an additional neuron directly connected to the input layer, being referred to as targetfacing neuron (TN) and obstacle neuron (ON) for the goal-approaching and obstacle-avoiding sub-controller in this review, respectively. These neurons are added to the architecture to judge whether the SNN subcontrollers should take action in controlling the mobile robot. The obstacle neuron checks whether a forthcoming obstacle has to be avoided. This is the case if the output yobst of the obstacle neuron is higher than a threshold value yth,ON. Similarly, the robot is not facing the target if the target neuron output ytarget
Supervised Adjustment of Synaptic Plasticity …
129
𝑦𝑡ℎ,𝑂𝑁 ) ∧ (𝛼𝑂𝐴,𝐿 < |𝛼𝑂𝐴,𝑅 |)
𝛼𝑂𝐴,𝑅,
𝑖𝑓 (𝑦𝑜𝑏𝑠𝑡 > 𝑦𝑡ℎ,𝑂𝑁 ) ∧ (𝛼𝑂𝐴,𝐿 > |𝛼𝑂𝐴,𝑅 |)
𝛼𝐺𝐴,𝐿,
𝑖𝑓 (𝑦𝑡𝑎𝑟𝑔𝑒𝑡 < 𝑦𝑡ℎ,𝑇𝑁 ) ∧ (𝛼𝐺𝐴,𝐿 < |𝛼𝐺𝐴,𝑅 |)
𝛼𝐺𝐴,𝑅,
𝑖𝑓 (𝑦𝑡𝑎𝑟𝑔𝑒𝑡 < 𝑦𝑡ℎ,𝑇𝑁 ) ∧ (𝛼𝐺𝐴,𝐿 > |𝛼𝐺𝐴,𝑅 |)
𝛼𝑡𝑢𝑟𝑛 = {
(1.19)
0° , 𝑒𝑙𝑠𝑒
where αOA,L and αOA,R are the output of the obstacle-avoiding controller, αGA,L and αGA,R are the output of the goal-approaching controller. The subscript L and R represent turning left or right. To translate the output of each SNN sub-controllers into an angle, the following equation is used: 𝛼 = 𝛼𝑚𝑖𝑛,𝑂𝐴/𝐺𝐴 + (𝛼𝑚𝑎𝑥,𝑂𝐴/𝐺𝐴 − 𝛼𝑚𝑖𝑛,𝑂𝐴/𝐺𝐴 ) × 𝑦𝑆𝑁𝑁 /𝑦𝑚𝑎𝑥 )
(1.20)
In the above equiation, ymax and ySNN are the maximum and output of each SNN, αmin, and αmax denote the range of the turning angle. This angle is then used to calculate the motor speeds of the robot according to Eq. 1.17. Figure 4a illustrates the topology of the goal-approaching SNN with its three input neurons gy,pos, gx,neg, and gy,neg, which are the caretesian components of the normalized goal vector gE as shown in Figure 2. gE consists of gx, gy ∈ [-1, 1]. Since the inputs have to be real values between 0 and 1, they are split into a positive part gx,pos and gy,pos and a negative part gx,neg, gy,neg. Initially, both of these positive and negative parts are set to zero. During the locomotion, if gy is positive, its positive part is set to gy, otherwise the positive part is set to |gy|. The unset part remains zero. Since gx and gy stand in relation to each other as 𝑔𝑥2 + 𝑔𝑦2 = 1, |gx| does not provide any additional information and therefore gx,pos is not fed into the network. However, gx,neg cannot be omitted, since gx,neg < 0 implies that the output angle has to be >90°. As already mentioned in previous section, all output angles bigger than 90° are set to 90° in the dataset. There are 50 neurons
130
Saeed Solouki
inside for the hidden layer. The output neurons yGA,L and yGA,R are used for estimating the output angles αGA,L, αGA,R according to Eq. 1.4 and 20. For goal approaching, αmin,GA is set to 10° and αmax,GA to 100°. ymax is the SNN’s output for which αmax,GA is reached. The third output neuron is responsible for deciding whether the robot is facing the target or not, which is directly connected to the input neurons. More details about the parameter values can be found in [21].
Figure 4. Network topology of the (a) GA and (b) OA sub-SNNs together with the simulation scenes used for testing the (c) target-reaching controller, and (d) obstacleavoiding sub-controller. In each scenario, the red disc represents the goal position and the gray objects are the obstacles.
The obstacle-avoiding sub-controller works very similarly to the goalapproaching sub-controller. Its topology is shown in Figure 4b. It differs from the goal-approaching controller in twoaspects, the first is the inputs fed into it. The input layer consists of six input neurons, each responsible for encoding one sensor reading. A sensor reading Si is between 0 and 1, representing how many meters away the detected obstacle is located. Since a higher value represents a faraway obstacle and lower value for a close
Supervised Adjustment of Synaptic Plasticity …
131
obstacle to the robot, 𝑆𝑖̅ =1 - Si is fed into the network in order to ensure a network activity that is higher the closer an obstacle is. The output angles of the two motor neurons are as well calculated by Eq. 1.20. The neuron directly connected to the input layer evaluates whether an obstacle has to be avoided or not; this is the case if its output is yobst > 5.
1.4. Testing Environments and the Performance of the Controller to Achieve Target-Reaching Tasks in Different Scenarios In this first part of the review, the capabilities and efficiencies of endto-end learning of spiking neural network based on supervised R-STDP is demonstrated by performing a target-reaching vehicle. Within V-REP, different scenarios for goal-approaching and obstacle-avoiding tasks is presented. Then, the training performances of SNN-based controllers in terms of training accuracy and errors is analysised. Finally, a group of overall target-reaching tasks are conducted in unknown scenarios to examine the proposed algorithms. The core concept of these tasks is to demonstrate a promising training method for SNN with a general-purpose and easy-to-use way.
1.4.1. Testing Environment The environments used for testing the performance of the OA subcontroller as well as the overall TR controller as a whole are presented in Figure 4. For the goal-approaching subcontroller, a target is represented by a red platform and placed in an open environment without obstacles. This allows for testing the ability of the robot to reach the target from different orientations without having to worry about colliding with an obstacle. For the obstacle-avoiding sub-controller, the mobile robot is driving around with the potential to collide with multiple obstacles (see Figure 4d). Besides, the obstacles with different shapes and sizes (e.g., the thin pillars) have never been encountered before by the robot in the training scene (see Figure 1d). This is critical for verifying its ability to react correctly even to unknown
132
Saeed Solouki
stimulus. To test the performance of the target-reaching control structure as a whole, the target from the goal-approaching sub-controller scene is simply added in the obstacle-avoiding testing environment (Figure 4c). The SNN models and the learning rule proposed in section 1.1 are used to train the goal-approaching and obstacle-avoiding sub-controllers to mimic the output values for certain input vectors, both provided by their respective datasets. All controllers are trained for 100 episodes, where one episode consists of a set of 500 input-output pairs. The training accuracy denotes how often the sub-controllers chose the right turn (right or left), while the error represents the deviation from the desired output value.
1.4.2. Goal-Approaching Sub-Controller Figure 5a represents the development of the accuracy and Figure 5b indicates the average error of the goal-approaching sub-controller over the course of training. As can be seen, the accuracy quickly rises to a value of over 90% and then slowly keeps rising during the training process. The training terminates with a final accuracy of 96.2%. It should be noted that, after approximately 70 training episodes, the accuracy stagnates at values between 94.8 and 96.6%. Similar to the accuracy, the average error per episode falls to a value below 15% after only four training episodes and gradually reduces to an error of 10.24%. While it can be expected to continue to fall for more episodes, the error usually stagnates at a value of 10 ± 0.5%. The learning rates are set by trial and error for both sub-controllers. Generally speaking, the learning rates that are too low result in a much slower increase in accuracy, while the error rates even stagnate at values much worse than when choosing close-to-optimal learning rates. Meanwhile conversely, the accuracy and average error usually fluctuate before stagnating at a value far from the optimum. The special neurons of each controller are trained separately by using the same learning rule to make their output approach the threshold value yth for edge cases repeatedly until the average deviation falls under a certain level. For the target neuron (TN) the maximum deviation is set to 0.25 and the two edge cases are:
Supervised Adjustment of Synaptic Plasticity …
133
𝑔𝑦,𝑛𝑒𝑔 = 0.15, 𝑔𝑥,𝑛𝑒𝑔 = 0, 𝑔𝑦,𝑝𝑜𝑠 = 0, 𝑔𝑦,𝑛𝑒𝑔 = 0, 𝑔𝑥,𝑛𝑒𝑔 = 0, 𝑔𝑦,𝑝𝑜𝑠 = 0.15,
(1.21)
Since |gy| = 0.15 results in a desired output angle of approximately 8.6°, an angle close to this value will be interpreted as facing the target. Any other goal vector gE that does not fall between these two edge cases (-0.15 ≤ gy ≤ 0.15) means the robot is not facing the target. Consequently, the higher input value |gy| will apparently cause the output to cross the threshold value yth. The learning parameters of the special neurons are set to the same values as for their respective sub-controllers, with the exception of A+ = 0.1, allowing for overall smaller weight changes. As can be seen in Figure 5c, it reaches an average deviation of 0.248 after approximately 340 episodes, where in one episode the edge cases are fed into the SNN five times.
Figure 5. Development of the accuracy (blue graph in a) and average error (purple graph in b) of the goal-approaching sub-controller over the training episodes for ηmax = 0.2, and ηmin = 0.05. Development of the target neuron’s average deviation from yth,TN is depicted in c. The dashed red line represents the maximum deviation of 0.25 that had to be surpassed. The solid blue line in c means the moving mean value of the averaged absolute error over the episodes.
The trajectories of the robot controlled by the goal-approaching subcontroller for different initial positions and orientations is shown in Figure 6a,b. It can be observed that the robot manages to quickly turn toward the goal and reach the target area for all the initial orientations. However, a problem arises while testing. When setting the radius of the target area to a
134
Saeed Solouki
sufficiently small value such as 0.1 m, the robot drives toward the target and then, instead of moving closer to it, starts driving around the target area in circles (Figure 6c). This happens because the goal-approaching controller causes the robot to turn toward the target, but the consistent average velocity prevents it from moving closer to the target center. However, this could be easily fixed by lowering the mean velocity when getting close to the target center (d < 1 m): 𝑣𝑓𝑜𝑟𝑤𝑎𝑟𝑑 = {
𝑣𝑖𝑛𝑖𝑡 × 𝑑 𝑣𝑖𝑛𝑖𝑡
, 𝑖𝑓 𝑑 < 1 , , 𝑒𝑙𝑠𝑒
(1.22)
where vinit is the initial forward velocity when starting the simulation. d is the distance between the robot and the target defined in Eq.1.18. Figure 6d shows the trajectory of the robot with the improvement under the same situation.
Figure 6. (a,b) The robot’s trajectories controlled by the trained goal-approaching subcontroller for two different goal positions. (c) The robot drives in circles around the target center. (d) By reducing the forward velocity according to Eq. 1.22, the robot manages to reach the target area without circling around the target center.
Supervised Adjustment of Synaptic Plasticity …
135
1.4.3. Obstacle-Avoiding Sub-Controller For avoiding obstacles, the learning parameters are set to ηmax = 0.09, ηmin = 0.02, and A+ = 0.4. The learning process can be seen in Figure 7. Similar to the goal-approaching sub-controller, the accuracy quickly rises to over 90% and continues to increase before it stagnates at around 97.6%. The average error however falls much faster, reaching a value of approximately 10.5% after 30 episodes and stagnating around that rate. The accuracy after the final episode amounts to 97.6%, while the average error rate is 10.5%.The obstacle neuron undergoes the same training procedure as the target-facing neuron. The development of the average deviation per training episode can be seen in Figure 7c. The maximum deviation is set to 0.3 and it is surpassed after approximately 580 episodes. For the obstacle neuron, the edge cases are the six sensor readings from Eq. 1.15, while every other sensor reading is set to 1 (not detecting anything). Since the input neurons exhibit spikes even when not detecting anything, the case where no sensor detects anything is also included in the edge cases. For this case, however, instead of approximating the value yth,ON, it is trained to assume a value lower than yth,ON/2 to make sure the unconditioned firing does not cause the neuron to fire too early, resulting in falsely detecting an obstacle.
Figure 7. Development of the accuracy (blue graph in a) and average error (purple graph in b) of the obstacle-avoiding sub-controller over the training episodes. Development of the obstacle neuron’s average deviation from yth,TN is depicted in c. The dashed red line in c represents the maximum deviation of 0.3 that had to be surpassed.
136
Saeed Solouki
Figure 8 indicates the robot’s trajectory controlled by the obstacleavoiding sub-controller. It manages to efficiently avoid obstacles while moving around the scene. However, there is still one rarely encountered case where it fails. This can also be imputed to the consistent mean velocity of the robot. When directly driving toward a corner, it detects the obstacle too late, such that there is not enough space for taking a turn in any direction, which in turn leads to collision with an obstacle. This is shown in Figure 8b which could be solved however in a similar way to the goal-approaching sub-controller by simply relating the default forward velocity to the sensor readings causing a decrease when getting close to an obstacle: 1
𝑣𝑓𝑜𝑟𝑤𝑎𝑟𝑑 = 𝑣𝑖𝑛𝑖𝑡 × ∑6𝑖=1 𝑆𝑖 × 6,
(1.23)
Figure 8. (a) Trajectories of the robot controlled by the obstacle-avoiding subcontroller from a starting positions. (b) The robot fails to avoid the obstacle. (c) The robot successfully avoids the obstacle when adjusting its forward velocity according to Eq. 1.23.
This adjustment results in the robot needing much less space for a turning maneuver and therefore allowing for a smoother trajectory when avoiding obstacles. The resulting path solving this problem is shown in Figure 8c.
Supervised Adjustment of Synaptic Plasticity …
137
1.4.4. Overall Performance Combining the two sub-controllers, the target-reaching controller exhibits successful goal approaching behavior while avoiding obstacles in its path. Figure 9 illustrates the path of the robot controlled by the targetreaching controller for different scenarios. In Figure 9a, the robot maneuvers through a pool of obstacles when there is not enough space to turn to the target. In the beginning, the robot tries to make a right turn before detecting a wall. It then moves away from it and turns to face the target. Figure 9b shows its behavior when the target is located behind a big concave-shaped obstacle. It avoids it with a left turn and then turns 180° to maneuver around it and finally reach the target. As can be observed, the robot manages to quickly reach the target in every scenario while avoiding all obstacles in its path. The previous results and performances show that the target-reaching controller as well as the embedded sub-controllers exhibit the desired behavior after being trained with the proposed learning rule. While the accuracy of both sub-controllers rises to a value higher than 95%, the error rates stagnate at approximately 10%. This is because the amount of spike times is limited in each simulated step size (dt = 1.0 ms), only allowing for so much precision. This is due to the relatively high step size (dt = 1.0 ms) of the simulated SNNs and therefore limited amount of spike times, only allowing for so much precision. This however increases the speed of the simulation, resulting in more updates per second. A higher precision could be achieved by lowering the step size, given the computation speed is not an issue. Moreover, considering that the target output angles provided by the dataset used to train the obstacle-avoiding sub-controller are meant as points of reference and are at a minimum 20° apart from each other, in theory an average error of 10° is appropriate and acceptable for outputting all angles between -90° and +90°. Apart from this network being able to be used to train two sub-controllers with multiple outputs on different tasks, it is also shown that the backpropagation of the rewards works well and can be easily assigned to each side’s synaptic connections, effectively resulting in the training of two different SNNs on a one-dimensional output. This underlines the ability of this proposed approach to effectively train a network on two different
138
Saeed Solouki
outputs at the same time, yielding similar results to a network trained for a single output. The problems encountered while testing the performance of the subcontrollers can all be accounted for the forward velocity and therefore could be easily solved. However, the target-reaching controller does have its flaws. First, a drawback of this training procedure is the need for a reference or dataset controller providing the SNNs with a desired optimum output to calculate the reward. In neurosience, however, the learning rule only requires some kind of mechanism determining if the SNNs outputs are too low or too high. The second limitation with the controller can be observed in Figure 9b. Even though the obstacle avoiding and goal approaching work well on their own, the target-reaching controller does not unify the different sub-controller’s outputs and exclusively either avoids obstacles or approaches the target. This is because the obstacle-avoiding controller is meant to choose the smaller angle between the two output angles for more efficient turning. However, under some circumstance, the larger turning angle leads closer to the target. Therefore, the robot takes some unnecessary steps to reach the final target.
Figure 9. (a,b) The trajectories of the controlled robot by the TR-Controller for different starting and target positions.
Supervised Adjustment of Synaptic Plasticity …
139
1.5. Perspectives and Limitations of SNN-Based Supervised Controllers for Performing Target-Reaching Tasks The first section of this review presented an approach for fast building an SNN-based controller for performing robotic implementations. This approach first used a model-based control method to shape a desired behavior of the robot as a dataset and then use it to train an SNN-based on supervised learning. A robot navigation task as a case study is presented. Specifically, preacquired knowledge have been used for training an SNN with R-STDP synapses to achieve desired functions. The reward can be assigned properly to all the synapses in an SNN constructed with hidden layer. The SNN-based controller can quickly assemble the knowledge from the dataset and exhibit adaptiveness in unknown environment. The motivation of this review is to present an alternative to train SNNs quickly for practical implementations, where we expect that SNN-based controllers can exhibit their advantages on neuromorphic hardware.The presented approach requires a pre-acquired dataset to train the SNNs off-line based on the supervised learning framework. However, this problem is expected to be solved when the network equips memory-like functions to store its knowledge and train itself at the same time or afterwards in a semisupervised or reinforcement manner. Teaching a brain-inspired spiking neural network in a general and easy way is not simple. This problem is tackled by proposing an end-to-end learning rule based on the supervised R-STDP rule and used it for training two SNNs for an autonomous target-tracking implementation. By simply changing the inputs fed into the network and slightly changing the way that the reward was assigned to the output neurons, two SNNs were trained to learn to exhibit the desired behavior successfully and the robot was able to reach a previously set target area while avoiding obstacles. This approach not only offers a general-purpose training framework for SNNs with multiple outputs and hidden layers but also indicates that how the reward can be properly backpropagated through them. Together with this, the basic idea of this learning rule also allows for potentially greatly increasing the
140
Saeed Solouki
energy efficiency of SNNs by making them able to learn with and operate on very few and even single spikes per time window.The insights gained could help to further improve this concept up to the point of creating a general-purpose and easy-to-use spiking neural network design for training and energy-efficient control of autonomous mobile robots.
2. APPLYING SYMMETRIC STDP RULES TO DEFINE BIOLOGICALLY PLAUSIBLE SUPERVISED LEARNING METHOD FOR SNNS SNNs training methods can basically be categorized into two classes, backpropagation-like training methods and plasticity-based learning methods. The former methods are dependent on energy-inefficient realvalued computation and non-local transmission, as also required in ANNs, while the latter either be considered biologically implausible or exhibit poor performance. Therefore, providing bio-plausible high-performance supervised learning methods for SNNs remains challenging. In this section, we review a novel bio-plausible SNN model for supervised learning based on the symmetric spike-timing dependent plasticity (sym-STDP) rule found in neuroscience. Combining the sym-STDP rule with bio-plausible synaptic scaling and intrinsic plasticity of the dynamic threshold, increased the learning performance of the network in the benchmark handwritten digit recognition tasks. Visualizing both synaptic weights and layer-based activities by the use of the t-distributed stochastic neighbor embedding (tSNE) method after training showed distinct clusters, thereby confirming excellent classification ability. As the learning rules were bio-plausible and based purely on local spike events, this model could be easily applied to neuromorphic hardware for online training and may be helpful for understanding supervised information processing at the synaptic level in biological neural systems.
Supervised Adjustment of Synaptic Plasticity …
141
2.1. Network Architecture and Neuronal Dynamics A three-layer feedforward spiking neural network is constructed for supervised learning [28], which included an input layer, hidden layer, and supervised learning layer (Figure 10a). The structure of the first two layers was inspired by the previous model of Diehl and Cook [29]. Input patterns were coded as Poisson spike processes with firing rates proportional to the intensities of the corresponding pixels. The Poisson spike trains were then fed to the excitatory neurons in the hidden layer with all-to-all connections. The dark blue shaded area in Figure 10b shows the input connection to a specific neuron. The connection from the excitatory neurons to inhibitory neurons was one-to-one. An inhibitory neuron only received input from the corresponding excitatory neuron at the same position in the map and inhibited the remaining excitatory neurons. All excitatory neurons were fully connected to the neurons in the supervised learning layer. In the supervised layer, neurons fired with two different modes during the training and testing processes. During the training period, the label information of the current input pattern was converted to a teacher signal in a one-hot coding scheme by the ten supervised neurons. Only one supervised neuron was pushed to fire as a Poisson spike process, with the remaining supervised neurons inactivated in the resting state. In the testing mode, all supervised neurons fired according to inputs from the hidden layer.
Figure 10. (a) network structure. (b) Schematic diagram of the classic STDP [30] and DA-STDP [31, 32].
142
Saeed Solouki
The membrane potential (V) dynamics of the individual neurons can be described by [29, 33]: 𝑑𝑉
𝜏 𝑑𝑡 = (𝐸𝑟𝑒𝑠𝑡 − 𝑉) + 𝑔𝐸 (𝐸𝐸 − 𝑉) + 𝑔𝐼 (𝐸𝐼 − 𝑉),
(2.1)
where gI and gE are the total inhibitory conductance and total exitatory conductance, respectively, Erest is the resting membrane potential, EI and EE are the equilibrium potentials of the inhibitory and excitatory synapses, respectively, and τ is the time constant of membrane potential damping. The variables gI and gE in the leaky integrate-and-fire (LIF) model can be described by the following equations [29, 33]: 𝜏𝑔𝐸
𝑑𝑔𝐸 𝑑𝑡
𝑃 ∑𝑘 𝑤𝑖𝐸𝑃 𝛿(𝑡 − 𝑡𝑖𝑘 ), = −𝑔𝐸 + ∑𝑖=1
𝜏𝑔𝐼
𝑑𝑔𝐸 𝑑𝑡
𝐸 ∑𝑘 𝑤𝑖𝐸𝐼 𝛿(𝑡 − 𝑡𝑖𝑘 ), = −𝑔𝐼 + ∑𝑖=1
𝑁
𝑁
(2.2) (2.3)
where NE and NP are the number of excitatory and input neurons, 𝑤𝑖𝐸𝐼 and 𝑤𝑖𝐸𝑃 are the input synapse weights of the inhibitory and excitatory neurons, τgE and τgI are the time constants of synapse conductance damping, and 𝑡𝑖𝑘 is the kth spike time from the ith neuron. Synaptic weights are modified according to the two biological plasticity rules, i.e., dopamine-modulated STDP (DA-STDP) and synaptic scaling. DA-STDP rule is a new type of symmetric STDP rule, which modifies the synaptic weight if the interval between the pre- and post-synaptic spike activities is within a narrow time-window when dopamine (DA) is present. Dopamine is an important neuromodulator and plays a critical role in learning and memory processes [34]. Here, inspired by the DA-STDP found in different brain areas, such as the prefrontal cortex and hippocampus [31, 32, 35] (Figure 10b), it is hypothesized that DA can modulate changes in synaptic weights according to the DA-STDP rule during the supervised learning process. The phenomenological model of DA-STDP can be expressed as [31, 32]:
Supervised Adjustment of Synaptic Plasticity … −∆𝑡
∆𝑤 = {
𝐴+ exp ( 𝜏 ) , +
∆𝑡
𝐴− exp (𝜏 ) , −
143
∆𝑡 > 0 ∆𝑡 < 0
,
(2.4)
where ∆𝑤 refers to the weights increment, ∆t is the time difference between the pre- and post-synaptic spikes, τ+ and τ- are the time constants of positive and negative phases for ∆t, respectively, and A+ and A- are the learning rates, A+; A- ≥ 0. Since DA-STDP can only increase synapse strength at any time, the synaptic scaling plasticity rule was used to introduce a competition mechanism among all input synapses of a neuron in the hidden layer (only for excitatory neurons) and supervised layer. Synaptic scaling is a homeostatic plasticity mechanism observed in many experiments [36-38], especially in visual systems [39-41] and the neocortex [42]. Here, synaptic scaling is conducted after the pattern is trained and adjusted according to the following equation [29]: 𝛼𝑁
𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 ∑ 𝑤 𝑖𝑛 , 𝑜𝑙𝑑
(2.5)
where α is the scaling factor with α ∈ (0; 1) and Nin is the number of all insynapses of a neuron. The sum iterates over all incoming weights of a neuron. The dynamic threshold homeostatic plasticity mechanism is also adopted. The dynamic threshold is the intrinsic plasticity of a neuron which is found in different neural systems [36, 43-46]. It is introduced to generate a specific response to a class of input patterns for each excitatory neuron in the hidden layer [29, 45], otherwise a single neuron will dominate the response pattern due to its enlarged in-synaptic weights and lateral inhibition. That is, the potential threshold of a neuron will be increased with a certain amplitude after the neuron spikes, which can be described as: 𝛼𝑁𝑖𝑛 , 𝑤𝑜𝑙𝑑
𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 ∑
(2.6)
144
Saeed Solouki 𝑑𝜃 𝑑𝑡
𝜃
= − 𝜏 + ∆𝜃 ∑𝑘 𝛿(𝑡 − 𝑡𝑘 )
(2.7)
𝜃
𝜃
𝑖𝑛𝑖𝑡𝑖𝑎𝑙 where ∆θ is a function of current θ and set to 𝜃+ ∗ |2𝜃−𝜃
𝑖𝑛𝑖𝑡𝑖𝑎𝑙 |
to avoid
excessive growth.
2.2. Performance of the Recognition Task The SNN model is trained on the MNIST dataset (training set: 60000 samples; test dataset: 10000 samples) using two training methods, i.e., 1) simultaneous and 2) layer-by-layer training. For simultaneous training, the supervised learning layer and in-synapses of the hidden layer are updated simultaneously during the training process, whereas for layer-by-layer training, the hidden layer is trained first followed by the supervised learning layer, then the supervised learning layer is trained with all in-synaptic weights of the hidden layer fixed. There is no preprocessing of the data and a SNN simulator (GeNN) [47] is used tosimulate all experiments. More details about the parameter setting can be found in [28]. The in-synaptic and out-synaptic weights of the excitatory neurons in the hidden layer are set to the ranges of [0; 1] and [0; 8], respectively. All initial weights are set to corresponding maximum weights multiplied by uniform distributed values in the range [0; 0:3]. For the learning rates of STDP, A+ and A- are set to the same value, which equaled 0.001 and 0.002 for in-synaptic and out-synaptic weights of the hidden layer, respectively. The firing rates of the input neurons are proportional to the intensity of the pixels of the MNIST images [29]. The maximum rates are set to 63.75 Hz after dividing the maximum pixel intensity of 255 by 4. When less than five spikes were found in the excitatory neurons of the hidden layer during 350 ms, the maximum input firing rates are increased by 32 Hz. To demonstrate the power of the proposed supervised learning method for different network sizes, the results of the presented learning algorithm is compared to the ‘Label Statistics’ learning algorithm used in previous research [29]. ‘Label Statistics’ method has two phases named ‘label’ and
Supervised Adjustment of Synaptic Plasticity …
145
‘statistics’. Specifically, after the unsupervised training using STDP, the excitatory neuron in the hidden layer has a special receptive field, which makes various degrees of response to one sample. Then the neuron obtains the label of the strongest response samplein the ‘label’ phase. In the testing, the all class votes using the ‘statistics’ method is applied to accumulate all of the same label neuron spike counts as the likelihood of that class. Finally, the label of most votes is the predicted label.
Figure 11. (a,b) Convergence property of networks with sizes Ne = 400 and 6400. The ‘Layer-by-Layer’ curve in each figure represents supervised learning (SL) performance under layer-by-layer training using STDP to train in-synaptic weights of the hidden layer in unsupervised style and then train out-synaptic weights in the supervised style. (c) Confusion matrix of test dataset results in the supervised learning layer for the network with size Ne= 6400. The darker pixel indicates stronger consistency between desired (label) and inferred (real) outputs. Data are obtained for re-trained supervised learning, as shown in b.
The model is trained with different epochs of the training dataset for different network sizes (3, 5, 7, 10, and 20 epochs for Ne = 100, 400, 1600, 6400, and 10000, respectively). During the training process, the network performances for each of 10000 samples were estimated by the test dataset. Taking networks with sizes Ne = 400 and 6400 as examples, classifying accuracies converged quickly under the two training methods (see Figure 11). Figure 11c illustrates very high consistency between the desired (label) and inferred (real) outputs in the supervised learning layer. Meanwhile, the results obtained by different learning methods are summarized in Table 1.
146
Saeed Solouki
The presented supervised learning model outperformed the ‘Label Statistics’ method for small-sized networks (N ≤ 1600), except for the re-training case with Ne = 100, and its performance was only slightly lower than the ‘Label Statistics’ method for largesized networks (Ne = 6400 and 10000). The best performance of the supervised learning model (96.56%) is achieved in the largest network under Layer-by-Layer training. These results indicated that a SNN equipped with biologically realistic plasticity rules can achieve good supervised learning by pure spike-based computation. Table 1. Performance for different sized networks with different training methods Network Size (𝑵𝒆 ) 100 400 1600 6400 10000 Simultaneous Label Stat.* 83.11% 90.89% 91.18% 96.29% 96.81% Training Supervised learning* 83.11% 91.31% 91.33% 95.81% 95.82% Layer-by-layer Label Stat. 83.22% 91.11% 91.91% 96.34% 96.97% Training Supervised learning 83.20% 91.55% 92.33% 96.17% 96.56% * Label Stat. represents classification by the ‘Label Statistics’ method in the hidden layer (more details can be found in [29]). Each pair of accuracies under the ‘Label Statistics’ method and supervised learning method are measured at the same time in the training process. The highest accuracy in each training trial is reported here.
2.3. Clustering Ability of the Model Determining the underlying mechanisms of the supervised learning model in pattern recognition tasks, the popular dimension-reduction method t-distributed Stochastic Neighbor Embedding (t-SNE) [48] is adopted to reveal the models clustering ability. The t-SNE is a nonlinear dimensionality reduction method which widely used for visualizing the high-dimensional data in a low-dimensional space of two or three dimensions. The original digit patterns is visualized in Figure 12a. Additionally, the spike activities of the hidden layer and the spike activities of the supervised learning layer for all samples in the test dataset are shown in Figure 12b and Figure 12c, respectively. The separability of the output information of the three layers in the model increased from the input to supervised learning layer, which
Supervised Adjustment of Synaptic Plasticity …
147
indicated that the supervised learning layer served as a good classifier after training. To demonstrate why the applied supervised learning method achieved effective clustering for the hidden layer outputs, the t-SNE method is adopted to reduce the dimensions of the out-synaptic weights of the
Figure 12. Visualization of the original digit patterns (a), spike activities of the hidden layer (b), spike activities of the supervised layer (c), and clustering ability of outsynapses of excitatory neurons in the hidden layer (Ne= 6400) to the supervised learning layer using the t-SNE method. Each dot represents a digit sample and is colored by its corresponding label information. The clustering of the out-synaptic weights of the excitatory neurons is highly consistent with the clustering of their labels.
excitatory neurons. As shown in Figure 12d, the clustering of the outsynaptic weights of the excitatory neurons is highly consistent with the clustering of their label information using the ‘Label Statistics’ method. This explains why the supervised learning method achieved comparatively good performance as the ‘Label Statistics’ method for the classification task, although the model does not require ‘Label Statistics’ computation outside the network to calculate the most likely representation of a hidden neuron [29] and realizes the supervised learning process based lonely on computation within the network.
148
Saeed Solouki
2.4. Comparison to Other SNN Models Current SNN models for pattern recognition can be generally categorized into three classes: that is, indirect training [9, 49-52], direct supervised training with BP [53, 54], and plasticity-based unsupervised training with supervised modules [29, 55]. Table 2 summaries several previous SNN models trained and tested on full training and testing sets of the dataset. In the current model, bio-plausible neuroplasticity rules are used (i.e., sym-STDP and synaptic scaling) and local spike-based computation to accomplish supervised learning, without the need of error computation or backpropagation. Thus, the proposed method provides a novel model framework to achieve efficient supervised learning in SNNs. Table 2. Comparison of classification performance of SNN models on the recognition task Network model
Learning method
Training type
(Un-)Supervised
Accuracy
Deep LIF SNN [50] CSNN [51]
BP-ANN conversion
Rate-based
Supervised
98.37%
Chip-basedSNN [9] SDRN [56]
BP-ANN conversion
Rate-based
Supervised
99.1%
BP-ANN conversion
Rate-based
Supervised
99.42%
BP-ANN conversion
Rate-based
Supervised
99.59%
BP-STDP SNN [58] Deep LIF-BA SNN [57]
BP-STDP SNN [58] Deep LIF-BA SNN [57] STBP SNN [53]
DCSNN [55]
BP-STDP SNN [58] Deep LIFBA SNN [57] STBP SNN [53] VDPC SNN [59] Deep SNN [61] DCSNN [55]
In-direct training
Direct training BP-STDP SNN [58] Deep LIF-BA SNN [57]
BP-STDP SNN [58]
STBP SNN [53]
STBP SNN [53]
VDPC SNN [59]
VDPC SNN [59]
Deep SNN [61]
Deep SNN [61]
STBP SNN [53] VDPC SNN [59] Deep SNN [61]
DCSNN [55]
DCSNN [55]
DCSNN [55]
Deep LIF-BA SNN [57]
VDPC SNN [59] Deep SNN [61]
Supervised Adjustment of Synaptic Plasticity …
149
Network model
Learning method
Training type
(Un-)Supervised
Accuracy
Two-layer SNN [29] Two-layer SNN [29] SDNN [63]
Two-layer SNN [29]
SDNN [63]
Two-layer SNN [29] Two-layer SNN [29] SDNN [63]
Two-layer SNN [29] Two-layer SNN [29] SDNN [63]
Two-layer SNN [29] Two-layer SNN [29] SDNN [63]
Three-layer SNN [64] MLHN [59]
Three-layer SNN [64] MLHN [59]
Three-layer SNN [64] MLHN [59]
Three-layer SNN [64] MLHN [59]
Two-layer SNN [29]
Three-layer SNN [64] MLHN [59] Bidirectional SNN Bidirectional SNN Bidirectional Bidirectional Bidirection [65] [65] SNN [65] SNN [65] al SNN [65] DCSNN [66] DCSNN [66] DCSNN [66] DCSNN [66] DCSNN [66] Sym-STDP SNN Sym-STDP SNN Sym-STDP Sym-STDP SNN Sym-STDP [28] [28] SNN [28] [28] SNN [28] LIF, Leaky integrate-and-fire; CSNN, Convolutional spiking neural network; BP, Backpropagation; MLHN, Multi-layer hierarchical network; SDRN, Spiking deep residual network; VDPC, Voltagedriven plasticity-centric; STBP, Spatio-temporal backpropagation; BA, Broadcast alignment; SDNN, Spiking deep neural network; R-STDP Reward-modulated STDP; SVM, Support vector machine.
2.5. Comparison to STDP-Based Models with or without Backpropagation In previous studies with indirect training, ANNs were trained using the BP algorithm based on activity rates and transformed to corresponding equivalent SNNs based on firing rates. Although their performances were very good, they ignored the temporal evolution of SNNs and spike-based learning processes. Thus, indirect training provides very little enlightenment on how SNNs learn and encode different features of inputs. For other studies using direct supervised training, most adopted the BP algorithm and calculated errors based on continuous variables - currents, membrane potentials (voltage), or activity rates to approximate spike activities and realize supervised learning [53, 57-61]. For example, Zhang et al. proposed a voltage-driven plasticity-centric SNN for supervised learning [59], with four learning stages required for training, i.e., equilibrium learning and voltage-based STDP for unsupervised learning and BP for the final
150
Saeed Solouki
supervised learning. However, this model is highly dissimilar to biological neuronal systems and is energy inefficient. Lee et al. pre-trained multi-layer SNN systems by STDP in an unsupervised manner for optimal initial weights and then used current-based BP to re-train all-layer weights in a supervised way [61]. However, this model is bio- implausible, requires nonlocal computation, and is also energy inefficient. There are also a lot of studies that have used STDP-based training methods without BP. These previous models adopted STDP-like plasticity rules for unsupervised learninig and required a special supervised module for supervised learning, e.g., a classifier (SVM) [63], artificial label statistics outside network [29, 55, 62], or an additional supervised layer [54, 64-66]. However, the first two supervised modules are bio-implausible because their computing modules are outside the SNNs, resulting in the SNNs having no direct relationship with supervised learning [29, 55, 63]. For example, Beyeler et al. adopted a calcium concentration-based STDP for supervised learning [54], although it was inspired by experiments, which showed considerably poorer performance than that of the present model. Hu et al. used an artificially-modified STDP with a special temporal learning phase for supervised learning [65]; however, their STDP rule was artificially designed and its internal mechanism was not well explained. Another group also adopted a modified STDP rule with exponential weight change and extended depression window for supervised learning in a SNN. However, its performance was relatively poor (less than 90%) [64]. Mozafari et al. used a mixed classic STDP and anti-STDP rule to generate reward-modulated STDP with a remote supervised spike, but they were not able to provide biological evidence to explain this type of learning [66]. The presented model in section 2 is inspired by the previous STDPbased SNN method proposed by Diehl and Cook [29]. In their two-layer SNN, STDP is used to extract features by unsupervisd learning and an additional computing module outside the SNN is used for label statistics in the training process and classification in the testing process. The additional module first calculates the most likely representation of a neuron for a special class of input patterns, labels the neuron to the class during the training process, and uses the maximum mean firing rates among all classes
Supervised Adjustment of Synaptic Plasticity …
151
of labeled neurons for inference in the test process. In the reviewed model, however, the same algebraic computation and reasoning is achieved using an additional layer of spiking neurons instead of the outside-network computations, which made a great progress of STDP-based SNN model for supervised learning due to the completely spike-based computation. Moreover, there are two other improvements compared to that of Diehl and Cook [29]. The first improvement was the novel sym-STDP rule rooted in DA-STDP, with DA-STDP able to give a potential explanation for the supervised learning processes occurring in the brain. It is assumed that DA may be involved in the supervised learning process and that local synaptic plasticity could be changed to sym-STDP during the whole training process. With the aid of the forced firing of a supervised neuron by the incoming teacher signal, sym-STDP could establish the relationship between the input and its teacher information after sufficient training. The second improvement is the new dynamic threshold rule, in which the multiplicative learning rate is replaced by the fixed learning rate of the threshold increment, which significantly improved performance. It should be noted that, there are also a few SNN training models that can do classification tasks in other ways, but their recognition performance is not very good [72, 73]. For example, recently, Xu et al. constructed a novel convolutional spiking neural network (CSNN) model by using the tempotron as the classifier, and attempted to take advantage of both convolutional structure and SNN’s temporal coding ability [74]. But their model’s performance was not good and only reached the maximal accuracy of 88% on a subset of dataset (Training samples: 500; Test samples: 100) when the network size equalled 1200, in contrast with the accuracy of 91.55% in the reviewed case with Ne = 400 on the full dataset. This reveals that current model could also work very well under the small network size constaint. Thus, for the above reasons, the proposed sym-STDP based SNN model could solve the lack of bio-plausible and high-performance SNN methods for spike-based supervised learning.
152
Saeed Solouki
2.6. SNN-Based Supervised Learning and Handwritten Digit Recognition Tasks A neural network model with biological plausibility must meet three basic characteristics, i.e., the ability to integrate temporal input and generate spike output, spike-based computation for training and inference, and all learning rules rooted in biological experiments. In this review, the LIF neuron model is used with all learning rules (e.g., sym-STDP, synaptic scaling, and dynamic threshold) observed in all experiments and computation based on spikes. Thus, the proposed SNN model meets all the above requirements and is a true biologically plausible neural network model. However, how does the model obtain good pattern recognition performance? This is mainly because the three learning rules worked synergistically to achieve feature extraction and expected relationship between input and output. The sym-STDP rule demonstrates a considerable advantage by extracting the relationship of spike events regardless of the temporal order in two connected neurons, with synaptic scaling able to stabilize total in-synaptic weights and create weight competition among insynapses of a neuron to ensure that the suitable group of synapses became strong. Furthermore, the dynamic threshold mechanism compelled a neuron to fire for matched patterns but rarely fire for unmatched ones, which generated neuron selectivity to a special class of patterns. By combining the three bio-plausible plasticity rules, the reviewed SNN model established a strong relationship between the input signal and supervised signal after sufficient training, ensuring effective supervised learning implementation and good performance in the benchmark handwritten digit recognition task. The proposed model also obtained good performance when training two layers synchronously, whereas many previous SNN models can only be trained using the layer-by-layer or multi-step/multi-phase methods [57, 59, 60, 63, 65]. In the current SNN model, DA was found to be a key factor for achieving supervised learning. Dopamine plays a critical role in different learning processes and can serve as a reward signal for reinforcement learning [6770]. A special form of dopamine-modulated STDP, different with the
Supervised Adjustment of Synaptic Plasticity …
153
symmetric one investigated here, had been applied to realize reinforcement learning in SNNs [71], but there is no direct experimental evidence for this kind of STDP rule. Here, it is assumed that DA may also be involved in the supervised learning process, with the symmetric DA-STDP rule found in experiments to modify synaptic weights during learning. It is further indicated the potentially diverse functions of DA in regulating neural networks for information processing. It is worth noting that the sym-STDP rule is a spiking version of the original Hebbian rule, that is, ‘fire together wire together’, and a rate-based neural network with the Hebbian rule, synaptic scaling, and dynamic bias could be expected to have similar classification ability as the presented model. However, the performance of the rate model may not be as high as that reported here. Further exploration is needed using the simplified rate model with the original Hebbian rule. Since the plasticity rules is used were purely based on local spike events, in contrast with the BP method, the model not only has the potential to be applied to other machine learning tasks under the supervised learning framework but may also be suitable for online learning on programmable neuromorphic chips. Moreover, the hypothesis about the function of DA in supervised learning processing could serve as a potential mechanism for the synaptic information processing of supervised learning in the brain, which will need to be verified in future experiments.
3. LEARNING STRUCTURE OF SENSORY INPUTS WITH SYNAPTIC PLASTICITY LEADS TO SYNAPTIC INTERFERENCE Synaptic plasticity is often explored as a form of unsupervised adaptation in cortical microcircuits to learn the structure of complex sensory inputs and thereby improve performance of classification and prediction. The question of whether the specific structure of the input patterns is encoded in the structure of neural networks has been largely neglected. Existing studies that have analyzed input-specific structural adaptation have
154
Saeed Solouki
used simplified, synthetic inputs in contrast to complex and noisy patterns found in real-world sensory data. In this section, input-specific structural changes are analyzed for three empirically derived models of plasticity applied to three temporal sensory classification tasks that include complex, real-world auditory and visual data. Two forms of spike-timing dependent plasticity (STDP) and the Bienenstock-Cooper-Munro (BCM) plasticity rule are used to adapt the recurrent network structure during the training process before performance is tested on the pattern recognition tasks. It is shown that synaptic adaptation is highly sensitive to specific classes of input pattern. However, plasticity does not improve the performance on sensory pattern recognition tasks, partly due to synaptic interference between consecutively presented input samples. The changes in synaptic strength produced by one stimulus are reversed by the presentation of another, thus largely preventing input-specific synaptic changes from being retained in the structure of the network. To solve the problem of interference, Chrol-Cannon and Jin suggested that models of plasticity be extended to restrict neural activity and synaptic modification to a subset of the neural circuit [75], which is increasingly found to be the case in experimental neuroscience. Recurrent neural networks consisting of biologically based spiking neuron models have only recently been applied to real-world learning tasks under a framework called reservoir computing [76, 77]. The models of this framework use a recurrently connected set of neurons driven by an input signal to create a non-linear, high-dimensional temporal transformation of the input that is used by single layer perceptrons [78] to produce desired outputs. This restricts the training algorithms to a linear regression task, while still allowing the potential to work on temporal data in a non-linear fashion. Reservoir computing is based on the principle of random projections of the input signal in which the network structure is completely independent of the input patterns. In these models, the only features learned by the trainable parameters of the perceptron readout are the correlations between the randomly projected features and the desired output signal. It is believed that learning in neural networks should go further than supervised training based on error from the output. All synapses should adapt to be able to encode the structure of the input signal and ideally, should not rely on the
Supervised Adjustment of Synaptic Plasticity …
155
presence of a desired output signal from which to calculate an error with the actual output. The neural activity generated by the input signal should provide enough information for synapses to regulate and encode properties of the signal in the network structure. By applying unsupervised adaptation to the synapses in the form of biologically derived plasticity rules [79-81] it is hoped to provide the means for the recurrently connected neurons of the network to learn a structure that generates more effective features than a completely random projection that is not specific to the input data. On a conceptual level, unsupervised learning is important in the understanding of how synaptic adaptation occurs because it is still unknown what the sources of supervised signals are in the brain, if any exist. From early work on synaptic self-organization [82, 83], the principle of learning has rested on correlations in neural activity becoming associated together and forming assemblies that activate simultaneously. These structures are thought to encode invariances in the sensory input that are key in developing the ability to recognize previously encountered patterns. In this section we will review the impact of applying several biologically derived plasticity mechanisms on three temporal sensory discrimination tasks. Two forms of spike-timing dependent plasticity (STDP) [30, 81] are tested, along with the Beinenstock-CooperMunro (BCM) rule [79]. The sensory tasks include real-world video of human motion and speech data. Synaptic plasticity is applied in an unsupervised pre-training step, before the supervised regression of the perceptron readout occurs. The impacts that plasticity have on the performance in these tasks are compared. In addition, the specific structural adaptation of the weight matrices between each of the classes of input samples in each task is analysed. A method is introduced to evaluate the extent to which the synaptic changes encode class-specific features in the network structure. Interference between different samples is a well-established phenomenon in sequentially trained learning models [84]. Presenting to a learning model, an input pattern causes specific changes to be made in the models parameters—in the case of neural networks, the synapses. However, during this encoding process, existing structure in the synaptic values is interfered with. In this way, consecutive input patterns disrupt previously
156
Saeed Solouki
learned features, sometimes completely. This effect is known as forgetting phenomenon. It is of direct concern to neural networks trained on sensory recognition tasks that consist of spatio-temporal patterns projected through a common neural processing pathway. The level of interference among the synaptic parameters for each tested plasticity model being applied to each type of sensory data is quantified. Some studies have reported that adapting neural circuits with plasticity improves their performance on pattern recognition tasks [85, 86]. However there is no analysis of how the adaptation of synaptic parameters leads to this result. On the other hand, work that does detailed analysis on the structural adaptation of the network does so using synthetic input patterns that are already linearly separable [87] or Poisson inputs projecting to single and recurrently connected neurons [88]. The experiments undertaken in this section is performed on a typical reservoir computing model with its recurrent connections adapted with plasticity. Two main aspects of analysis are made: 1) determination of the strength of input specific synaptic adaptation and 2) specification of the extent to which consecutive inputs interfere within the synapses. Both of these are achieved through analysis of the change in weight matrix in response to each pattern.
3.1. Simulation Procedure, Network Structure and Plasticity Model The three step procedure for training a typical liquid state machine (LSM) with plasticity is depicted in Figure 13. The procedure is also described in algorithm 2. Some of the expressions within the pseudocode refer to equations that can be found in subsequent subsections where the models of neurons, connectivity, plasticity and preprocessing of inputs can also be found. In the first step, the pseudocode of the pretraining process is illustrated in which the recurrent synaptic connections are adapted with plasticity. Input samples are selected at random (scrambled) for a total number of preTrainIterations which is 10,000. For a single input sample, each of the time-series frames is presented to the network in sequence by
Supervised Adjustment of Synaptic Plasticity …
157
setting the input current of the connected neurons to Win[x][c] · S[f ][x] · inputScale. The inputScale is selected based on the neuron membrane model. The neural activity of the network is then simulated for a specific frameDuration about 30 ms. Plasticity is calculated and updated between each frame of input in a sample. Neural activity is reset for the next input sample. Secondly, the reservoir states for each sample is collected.
Figure 13. Three step process describing a reservoir computing model extended by having the recurrent connections adapted with unsupervised plasticity in a pre-training phase. First, input samples I are presented in random order while the resulting neural activity drives synaptic adaptation under plasticity. Second, each input sample is presented in sequence with the resulting neural activity decoded into a series of state vectors S. Finally, the state vectors are used as the input to train a set of perceptron readouts, one to recognize each class of sample, Cx.
The simulation procedure is essentially the same as in pre-training but iterates once for each sample in the dataset. Activity feature vectors are stored in S.fv and weight matrix adaptation in S.dw. Finally, for determining the pattern recognition performance of the LSM, a set of readouts is trained using least mean squares regression. There is one readout to predict the presence of each possible class of input. For a total of readoutTrainingIterations a randomly selected samples state vector fv will be used to adapt the readout weights. The desired signal will be set to 1 for the readout matching the sample class and 0 for the others. To predict class labels on the training and testing data, the readout with the maximum value for a given fv is selected to predict the class (winner takes all). The neural network model used in this section is illustrated in Figure 14. Recurrently connected neurons, indicated by L are stimulated by current I
158
Saeed Solouki
that is the sum total of injected current from the input signal, Iinj and stimulating current from the pre-synapses, Irec. The total current I perturbs the membrane potential that is modeled with a simple model that matches neuron spiking patterns observed in biology [89]. This method for modeling the spiking activity of a neuron is shown to reproduce most naturally occurring patterns of activity [90]. The real-valued inputs are normalized between 0 and 1, which are multiplied by a scaling factor before being injected as current into L. Input connections number 0.2 · network size, projected randomly to the network nodes. Weights are uniformly initialized at random between 0 and 1. The utilized video data set consists of significantly higher dimension inputs—more features—than the other data sets. Therefore, in this case each feature only projects to one neuron, initially selected at random (a neuron can have connections from multiple inputs). The network activity dynamics are simulated for 30 ms for each frame of data in a time-series input sample. This value is chosen as it roughly approximates the actual millisecond delay between digital audio and video data frames. Then, the resulting spike trains produced by each of the neurons are passed through a low-pass filter, f , to produce a real valued vector used to train a linear readout with the iterative, stochastic gradient descent method which will be described in the following section.
Figure 14. Schematic representation of the elements of the recurrent network model. I is a multi-dimensional input signal, the x vector is the neural activation state, f is the filtering of the spike trains, L nodes constitute the recurrent network, and y is the output after weight and sum.
Supervised Adjustment of Synaptic Plasticity …
159
Algorithm 2. Three-step procedure for training a typical liquid state machine with plasticity. 1// pre-train recurrent neurons with plasticity for each iteration I in preTrainIterations select random sample S from trainingSamples for each frame f in S for each attribute x in f for each connection c in Cin c.input(Win[x][c] · S[f ][x] · inputScale) for each timestep t in frameDuration neurons.simulateActivity() // Equations 3.1, 3.2, 3.3 synapses.applyPlasticity() // Equations 3.7, 3.8, 3.9 neurons.resetActivity()
2// collect neural activation state vectors baseWeights.value ← snapses.value for each sample S in trainingSamples for each frame f in S for each attribute x in f for each connection c in Cin c.input(Win[x][c] · S[f ][x] · inputScale) for each timestep t in frame Duration neurons.simulateActivity() // Equations 3.1, 3.2, 3.3 synapses.applyPlasticity() // Equations 3.7, 3.8, 3.9 S.fv ← neurons.filteredSpikes() // Equation 3.4 S.dw ← synapses.value baseWeights.value neurons.resetActivity() synapses.value ← baseWeights.value
3// train readouts with linear regression for each iteration I in readoutTrainIterations select random feature vector fv from trainingSamples.fv for each class readout R in nClass if R.classLabel = fv.classLabel // boost readout for matching class R.output ← R.lms(fv, 1) // Equations 3.5, 3.6 else // suppress other readouts R.output ← R.lms(fv, 0) // Equations 3.5, 3.6 prediction P ← max(R.output) if P.classLabel ≠ fv.classLabel errorSum← errorSum + 1 errorCummulative← errorSum ÷ I
In the current experiments the network consists of 35 or 135 spiking neurons (weight matrix plots consist of 35, performance trials consist of 135) with the ratio of excitatory to inhibitory as 4:1. Neurons are connected with static synapses i.e., the delta impulse (step) function. Connectivity is formed by having N2 · C synapses that each have source and target neurons drawn according to uniform random distribution, where N is the number of neurons and C is equal to 0.1, the probability of a connection between any two neurons. Weights are drawn from two Gaussian distributions; N (-5, 0.5) for inhibitory and N (6, 0.5) for excitatory. When plasticity adapts the reservoir weights, wmin at -10 and wmax is clamped at 10. All parameters for inhibitory and excitatory neuron membranes are taken from Izhikevich [89]. The equations for the membrane model are as follows: 𝑣 ′ = 0.04𝑣 2 + 5𝑣 + 140 − 𝑢 + 𝐼,
(3.1)
160
Saeed Solouki 𝑢′ = 𝑎(𝑏𝑣 − 𝑢),
(3.2)
The spike firing condition is: 𝑣←𝑐 𝑖𝑓 𝑣 > 30𝑚𝑉 𝑡ℎ𝑒𝑛 { , 𝑢 ←𝑢+𝑑
(3.3)
More details about the parameters values of the exitatpry and inhibitory neurons can be found in [75]. Generating a real-valued output from the discrete spiking activity, the spike train from each neuron is convolved with a decaying exponential according to Eq. 3.4. The vector of values produced is then weighted with the readout weight matrix and summed to produce a single output value, determined in Eq. 3.5. −𝑆(𝑡) )), 𝜏
𝑥𝑖 = 𝑓(𝑆(𝑡)) = max(∑𝑇𝑡=1 exp( 𝑦 = ∑𝑛𝑖=1 𝑥𝑖 . 𝑤𝑖 ,
(3.4) (3.5)
The state vector for a neuron is denoted by xi, the spike train is S(t), and the filter function is f (). T is the maximum number of time-steps in S(t) and τ is decaying constant. The maximum value is taken from the low-pass filtered values in Eq. 3.4 in order to detect the highest level of burst activity in the given neuron. This approach is taken under the assumption that burst activity is more representative of spiking neural computation than a sum total of the firing rate. The output weights are updated according to the iterative, stochastic gradient descent method; Least Mean Squares, given in Eq. 3.6. 𝑤𝑖 + 𝜇(𝑦𝑑 − 𝑦𝑜 )𝑥𝑖 → 𝑤𝑖 ,
(3.6)
where yo is the actual output , yd is the desired output, xi is the input taken from a neuron’s filtered state, and µ is a small learning rate. wi is the weight
Supervised Adjustment of Synaptic Plasticity …
161
from xi to the output. For the classification tasks of pattern recognition, yd takes the values of 0 or 1 depending if the class corresponding to the readout is the label of the current input sample. Three synaptic plasticity mechanisms are employed in this section, each of them based on the Hebbian postulate [82, 91] of “fire together, wire together.” Each mechanism is outlined as follows: BCM Plasticity: The BCM rule [79] is a rate based Hebbian rule that also regulates the post-neuron firing rate to a desired level. It works on a temporal average of pre- and post-synaptic activity. The BCM rule is given in Eq. 3.7. The regulating parameter is the dynamic threshold θM, which changes based on the post-synaptic activity y in the following function: θM =E[y], where E[·] denotes a temporal average. In this case, E[·] is calculated as an exponential moving average of the postsynaptic neurons membrane potential. There is also a uniform decay parameter 𝜖w set very small in a way that slowly reduces connection strength and so provides a means for weight decay, irrespective of the level of activity or correlation between presynaptic inputs and post synaptic potential. ∆𝑤 = 𝑦(𝑦 − 𝜃𝑀 )𝑥 − 𝜖𝑤,
(3.7)
Bi-phasic STDP: The STDP rule depends on the temporal correlation between pre- and post-synaptic spikes. The synaptic weight change is computed based on the delay between the firing times of the pre- and postneuron. This is described in a fixed “learning window” in which the y-axis is the level of weight change and the x-axis is the time delay between a preand post-synaptic spike occurrence. The bi-phasic STDP rule consists of two decaying exponential curves [92], a negative one to depress out-of-order spikes, and a positive one to potentiate in-order spikes. This rule was derived from experimental work carried out on populations of neurons in vitro [30, 80]. Bi-phasic STDP is given in Eq. 3.8. −∆𝑡
𝐴+ . exp( 𝜏 )
𝑖𝑓 𝑡 > 0 + ∆𝑤(∆𝑡) = { , ∆𝑡 −𝐴− . exp(𝜏 ) 𝑖𝑓 𝑡 ≤ 0 −
(3.8)
162
Saeed Solouki
where A+ and A- are the learning rates for the potentiation and depression, respectively. ∆𝑡 is the delay of the post-synaptic spike occurring after the transmission of the pre-synaptic spike. τ+ and τ- control the rates of the exponential decrease in plasticity across the learning window. Tri-phasic STDP: A tri-phasic STDP learning window consists of a narrow potentiating region for closely correlated activity but depressing regions on either side: for recently uncorrelated activity, and for correlated but late activity. This learning window has been observed in vitro, most notably in the hippocampi, between areas CA3 and CA1 [81]. The tri-phasic STDP is given in Eq. 3.9. −(∆𝑡−15)2 )− 200
∆𝑤(∆𝑡) = 𝐴+ exp (
−(∆𝑡−15)2 ), 200
𝐴− exp (
(3.9)
To quantify interference directly between synaptic adaptations of plasticity the following formulation which is based on the synaptic changes from sequentially presented samples is used. Synaptic adaptation for a given class of sample is called ∆𝑊𝑡 and average adaptation for all others are ∆𝑊𝑜 . 𝑡𝑜𝑡𝑎𝑙 Interference must be calculated individually for each class of sample, 𝐼𝑐𝑙𝑎𝑠𝑠 , 𝑡𝑜𝑡𝑎𝑙 and averaged together to get the overall interference, 𝐼 . The equations are as follows: 1
𝐼𝑡𝑐𝑙𝑎𝑠𝑠 = 𝑁 ∑𝑁 𝑖=1 ∆𝑊𝑡𝑖 . ∆𝑊𝑜𝑖 < 0][|∆𝑊𝑡𝑖 | < |∆𝑊𝑜𝑖 |. 𝐶𝑛 ] 𝐶
𝑛 𝐼 𝑡𝑜𝑡𝑎𝑙 = ∑𝑡=1
𝐼𝑡𝑐𝑙𝑎𝑠𝑠 𝐶𝑛
(3.10)
(3.11)
where I is interference, Cn is the number of competing sample classes, N is the number of synapses, and ∆𝑊 is a vector of synaptic changes. Subscript t denotes samples of a given class “this,” Subscript i denotes the parameter index, and subscript o denotes samples of all “other” classes. The first set of Iverson brackets in Eq. 3.10 returns 1 if synaptic adaptation of a given class is of a different sign than that of the average adaptation of other class samples. The second set of Iverson Brackets returns 1 only if the magnitude
Supervised Adjustment of Synaptic Plasticity …
163
of the synaptic adaptation of a class is less than the average weight adaptation of other classes multiplied by the total number. This leads to take a conservative measure where interference within a synapse will be flagged for a class of pattern if the weight change is in a different direction to the average as well as being lower in magnitude than the total weight adaptation of other inputs.
3.2. Data Preparation
3.2.1. Synthetic Signal Data A synthetic benchmark task is taken from a study performed with Echo State Networks [93], a similar type of network model to the one employed here, but using continuous rate-based neurons instead. The task is to predict which of three signal generating functions is currently active in producing a
Figure 15. (a) Data Plot from 49,500 samples generated according to Jaeger’s trifunction system recognition time-series task [93]. (b) Human motion samples for the six types of behavior in the KTH visual discrimination task [95]. This illustration consists of different behaviors from a single person, while the whole data set contains 25 persons. Top row: Still frames from example video samples; boxing, clapping, waving, walking, running and jogging. Bottom row: Features extracted corresponding to the samples above, according to Eq. 3.12 and 3.13. Features are the raw time-series activity used as input to the neural network.
164
Saeed Solouki
time varying input signal. To generate a sample of the signal at a given time step, one of the three following function types is used; (1) A sine function of a randomly selected period, (2) A chaotic iterated tent map, (3) A randomly chosen constant. The generator is given some low probability of switching to another function at each time-step. The full method of generating the data is described in [93]. A short window of the generated signal is plotted in Figure 15a.
3.2.2. Speaker Recognition Data A speaker recognition task is a classification problem dealing with mapping time-series audio input data to target speaker labels. The data set, taken from [94], consists of utterances of nine male Japanese speakers pronouncing the vowel /ae/. The task is to correctly discriminate each speaker based on the speech samples. Each sample is comprised of a sequence of 12 feature audio frames. The features of each frame are the linear prediction coefficients (LPCs) to cepstral coefficients (CCs). The dataset is divided into training and testing sets of 270 and 370 samples each, respectively. Note that unlike the benchmark data used in this review, the samples are not in a consecutive time-series, yet each sample consists of a time-series sequence of audio frames.
3.2.3. Pre-Processing of the Human Motion Data A visual task is selected to test high dimensional spatial-temporal input data. The KTH data set [95] consists of 2391 video files of people performing one of six actions; boxing, clapping, waving, walking and jogging. There are 25 different subjects and the samples cover a range of conditions that are described in more detail in [95]. Each video sample is taken at 25 frames per second and down sampled to a resolution of 160 × 120 pixels. The raw video sequences is processed according to the following equations: 𝑀(𝑡) = ‖[∆(𝐼1 , 𝐼2 ), … , ∆(𝐼𝑁−1 , 𝐼𝑁 )]‖,
(3.12)
Supervised Adjustment of Synaptic Plasticity … 1 𝑖𝑓 𝑀(𝑡, 𝑖) ≥ 0.2. max(𝑀(. )) 𝑀(𝑡, 𝑖) = { , 0 𝑒𝑙𝑠𝑒
165 (3.13)
The final input matrix M is indexed by time-frames, t and spatial samples i. Column vectors In are individual frames, re-shaped into one dimension. Each sample contains up to a total of N frames. In plain language, this process essentially further down samples and calculates the difference between pixels in consecutive frames, which are then used as the new input features. Each frame is then re-shaped into a single dimensional column vector then appended together to form an input matrix in which each column is used as the neural network input at consecutive time steps. Figure 15b illustrates frames extracted from an example of each type on motion along with the corresponding processed features.
3.3. Training Recurrent Networks with Plasticity Training and analysis is performed on a typical liquid state machine (LSM) model [76] that is trained to correctly classify temporal input patterns of sensory signals. Details of the models and simulation procedure can be found in the previous section. Here we present an overview of the experimental procedure. An LSM consists of recurrently connected spiking neurons in which transient activity of the neurons is driven by time-series input sequentially exciting their membrane potential. In order for an output to be produced from the network and used to train a supervised readout, a snapshot must be taken of the transient activity which is called the state vector. This vector is weighted and summed to produce an output, the weights of which are trained with linear regression. The recurrent connections with synaptic plasticity is adapted before taking the state vectors used for pattern recognition. The synaptic weights are changed from their initial random structure, to values that are adapted to the general statistics of the input signals. After this pre-training process, the state vectors are taken for each sample in the data set and used to train a set of readouts to recognize labeled patterns in the data. Performance of pattern
166
Saeed Solouki
recognition is only a small aspect of the analysis of synaptic adaptation through plasticity. The analysis methodology requires the information of how each sample of input causes unique adaptation of the synapses. Therefore, for convenience, when collecting the liquid state vectors of a given sample from the neural activity, the synaptic change during the presentation of that sample is also computed and store the weight matrix adaptation. Figure 13 illustrates the three step process just described, delineated into; 1) a pre-training phase of synaptic plasticity, 2) a collection of the liquid state vectors and weight adaptation matrices, and 3) a supervised training phase of linear readouts for pattern recognition.
3.4. Analysis of Synaptic Adaptation Synaptic weight adaptation matrices form the basis of the analysis in this section. Figure 16 illustrates the process of these matrices being collected and used for analysis of class-specific synaptic plasticity. First, synaptic plasticity is applied to the network to adapt a baseline weight matrix that reflects the general statistics of the input patterns in the data set. Second, each the weight adaptation matrix is collected for each sample and these are grouped by class and also into two sets based on the training and test data division. Finally, the Euclidean distance is calculated between each weight matrix, with the average distance between each set plotted in a type of “confusion matrix” in which a low distance indicates high similarity between the adaptation of synaptic parameters. In the confusion matrix, if the diagonal values are lower than the others it means that synaptic plasticity is sensitive to the structural differences in input samples that are labeled as different classes. The stronger the diagonal trend, the more sensitive plasticity is to features of the input. It means that plasticity learns to distinguish class labels, such as different speakers or human actions, without ever being exposed to the labels themselves a priori. The weight adaptation matrices are also used to estimate the amount of interference between different input samples within the synaptic parameters.
Supervised Adjustment of Synaptic Plasticity …
167
Figure 16. Three step process describing the analysis of input-specific synaptic adaptations. First, the recurrent connections are adapted under plasticity in the same way as in Figure 13. Second, each input sample is presented and plasticity adapts the synapses. The change in the weight matrix is stored for each sample and grouped by the input class label, Cx and into two sets, train and test. Finally, the Euclidean distance between the matrices in train and test is calculated and the average for each class label is plotted in a confusion matrix.
3.5. Learning Input-Specific Adaptations To test the hypothesis that synaptic plasticity encodes a distinct structure for input samples of different labels. For the speech task, these labels consist of different speakers and for the video recognition task the labels consist of different human behaviors. The data sets are divided evenly into two. Each subset is used to train a recurrently connected network for 10,000 iterations, selecting a sample at random on each iteration. The changes to the weight matrix due to plasticity are recorded for each sample presentation. This is then used to create a class-specific average weight change for each of the class labels in both of the sample subsets. Finally, the Euclidean distance is calculated between each class in one set and each class by the following equation: 𝑋 𝑌 𝑋 𝑌 𝐷𝑖𝑠𝑡(𝐶𝑙𝑎𝑏 , 𝐶𝑙𝑎𝑏 ) = ∑𝑁 𝑖=1|∆𝑊𝑖 (𝐶𝑙𝑎𝑏 ) − ∆𝑊𝑖 (𝐶𝑙𝑎𝑏 )|,
(3.14)
Where Clab denote class labels, X and Y distinguish the separated sets of samples, N is the number of synapses, ∆W is the change in weight matrix for a presented sample, and i the synapse index.
168
Saeed Solouki
This effectively produces a confusion matrix of similarity in the synaptic weight change for different classes of input. Having lower values on the descending diagonal means that there is structural adaptation that is specific to the class of that column compared with the similarity between structural adaptations of two different classes. Figure 17a illustrates the “weight change confusion matrices” described above, for each plasticity model applied to all sensory tasks (nine experiments in total). All of the experiments show at least some stronger similarity in the descending diagonals and most are stark in this manner. It is certainly a strong enough pattern to show that through the many iterations of training, each of the plasticity models have become sensitive to the particular structure of the sensory input signals so that each different class of sample will give rise to changes in synaptic strength that are distinct from other classes compared with the similarity to themselves. The class labels were not used in any way in the plasticity models themselves and so the differences in the weight change arise from the input signals alone. There are a few exceptions to the strong diagonal patterns in Figure 17a. This means that some classes are not effectively distinguished from each other; speakers 8/9 with bi-phasic STDP, behaviors 1/2 with BCM, behaviors 1/2/3, and 4/5/6 with triphasic STDP. The latter confusion corresponds to the behaviors of boxing/ waving/running /walking/clapping and jogging. From the similarity of those input features shown in the upper panes of Figure 15b, it is evident why this confusion might occur.
3.6. Classification Performance with Plasticity and Synaptic Interference Perhaps the ultimate goal of neural network methods when applied to sensory tasks is the ability to accurately distinguish different types of input sample by their patterns. The error rates achieved by the presented neural network is compared on the three sensory tasks, with and without three different forms of plasticity. Table 3 lists the error rates achieved for each of
Supervised Adjustment of Synaptic Plasticity …
169
the learning tasks with the different plasticity rules active in a pre-training phase in addition to a static network with fixed internal synapses. From the error rates in Table 3 it is evident that pre-training the network with synaptic plasticity can make insignificant improvements in lowering the error rate. However, the results indicate that it can have a greater negative impact than a positive one. In the KTH human behavior data set, all three plasticity models increase the error rate by between 1.7 and 10%. Conversely, the best improvement is found on the tri-function signal recognition task with tri-phasic STDP at only 1.5%. Accordingly, it is clear from the network output that pre-training with synaptic plasticity is not a suitable method for this class of model, This does not contradict the result that plastic synapses are learning useful, input-specific structure. However, it does suggest that the structure being learned is not effectively utilized in the generation of a network output. In the next step, interference between synaptic changes is investigated to determine if the structural learning is retained in the network or if interference is a barrier for effective application of synaptic plasticity. Table 3. Classification error rates and synaptic interference
Synaptic interference
Classification error rate
TD-STDP Synaptic interference
Classification error rate
STDP Synaptic interference
Classification error rate
BCM Synaptic interference
Classification error rate
Static
Tri-func 0.153 — 0.157 0.82 0.204 0.88 0.8 0.138 KTH 0.3 0.333 0.93 0.383 0.96 0.283 — 0.92 Vowels 0.089 — 0.96 0.092 0.9 0.086 0.58 0.086 Values averaged over 10 trials with random seed based on system clock. SD did not exceed 0.07 for all values. Bold values indicate lowest error rate and interference.
When a model adapts incrementally to sequentially presented input, existing patterns that have been learned by the model parameters are prone to be overwritten by learning new patterns. This is known as interference. This effect has been studied previously [96], indicates the ability to
170
Saeed Solouki
recognize presented input after the model has been trained on new ones in order to estimate how much learning has been undone. When new training leaves the model unable to recognize old patterns, it is said there has been catastrophic interference and forgetting.
Figure 17. (a) Class correlation of structural synaptic adaptation. Heat map plots indicate the structure learned on each class for the three tasks under each of the plasticity rules. Essentially, it is a confusion matrix of the geometric distance between the weight matrix adaptation of each class of sample. The training data for each task is divided into two sets. Class-average adaptation is found for each set. There is then a distance calculated between each class of the two sets. Lower values on the descending diagonal indicate higher correlation within a class adaptation and therefore strong class-specific structure learned. (b) The class-specific synaptic adaptation for 3 out of the 9 class speaker recognition task under BCM plasticity. The main heat maps in each subplot show the adaptation of the weight matrix (synapses) after the presentation of voice input data from each speaker. Blue values show a reduction in synaptic strength and red values show an increase. Each N × N weight matrix has pre-neurons on the xaxis and post-neurons on the y-axis. The bar-chart, S, shows the average neuron activation for each class. The bar-chart,R, shows the learned readout weights. Labeled synapses a and b indicate key structural changes that are selective between different speakers. Each label alone can distinguish between two sets of speaker. Taken all together, the labeled synapses adapt specifically to each speaker in a unique pattern, learning a distinct network structure for each one.
Supervised Adjustment of Synaptic Plasticity …
171
Meanwhile, a new method of measuring interference is introduced which directly considers synaptic parameters instead of the model output. This measure is described in detail in the Section 3.1. 𝐼 𝑡𝑜𝑡𝑎𝑙 directly quantifies all synaptic changes that are overwritten. The interference for each of the experiments is listed in Table 3. In all but one of the experiments the interference level is between 82 and 96%. Most of the learned structure for each class of input is forgotten as consecutive samples overwrite each other’s previous changes. Bi-phasic STDP applied to speaker recognition has the lowest level of interference at 58%. To further explore interference and visualize the impact of plasticity, synaptic changes are analyzed directly. Figure 17b is an illustrative example in which a reduced network size of 35 neurons is used to improve visual clarity of the plotted patterns. It is an example for the speaker recognition task with BCM plasticity. It shows the adaptation of the synaptic weight matrix produced by the ninth speaker in the voice recognition task. This is plotted against the activity level for each neuron, S, and the readout weights, R, that are trained to generate an output that is sensitive to that given speaker. These sub plots are the average responses taken over all sample presentations from that speaker. This makes a whole chain of effect visible: from the synaptic change of an internal network connection, to the average neuron state for a given speaker, to the selective weights of the readout for that speaker. For all to be working well in a cohesive system, it is expected that a positive weight change should correspond with a neuron activation unique to the class which would in turn improve the recognition ability of the readout to identify that class. The sections of the class weight matrix highlighted in green in Figure 17b, highlight an example where synaptic interference is occurring between different types of pattern. Directly opposing features in the weight matrix adaptations show the samples negating each other’s changes. However, the same features are also most distinctively class specific. Any synapse can only change in two directions: positively or negatively, which means that a single synapse can only adapt to distinguish between two mutually exclusive kinds of input pattern. If n synapses are considered in combination, then the number of input patterns that can be discriminated
172
Saeed Solouki
becomes 2n in ideal theoretical conditions. Figure 17b illustrates this principle in practice with regards to the nine speaker recognition tasks. The adapted synapses labeled (a) can clearly distinguish speaker {#1} from speakers {#2, #3} but cannot distinguish {#2} from {#3}. Figure 17b also shows the weight changes are not correlated with the neural activity or readout weights. For plasticity to improve the accuracy of sensory discrimination, it would be expected that synapses would strengthen for class specific neural activity and weaken for common neural activity. This is not the case in the presented results.
3.7. Evolution of Synaptic Weights The main conclusions are drawn from the observation that the synaptic plasticity models tested become sensitive to specific class labels during a competitive process of synaptic interference between input patterns. To generalize the conclusions to recurrent neural circuits and liquid state machines in particular, it must be shown that synaptic weights reach some stability during pre-training and that the neural activity dynamics are working in a balanced regime. Figure 18 demonstrates a series of plots taken at 1, 100, and 1000 input iterations that show the inter-spike intervals (ISI) and evolving distributions of synaptic weights for each of the experiments performed in the current work. In general, the plots show that between the first and 100th pattern, the synaptic weights are adapted significantly by plasticity, with a corresponding—but more subtle—change in the distribution of ISIs. While there is also some level of change in weights between the 100th and 1000th iteration, the level is far smaller, which indicates that the synapses are converging on a common structure. However, it is important to note that for simulations even up to 10,000 iterations there is always some low level of synaptic change. The plasticity models tested never stabilize to a point in which there is no further synaptic adaptation, even when a single input sample repeatedly presented. Each of the plasticity models drives the synaptic weights to a different kind of distribution. STDP creates a bi-modal
Supervised Adjustment of Synaptic Plasticity …
173
distribution that drives most weights to the extremes: 0 and 10, with a few that are in a state of change leading up to each boundary. This leads to a structure with more full strength synapses than zeroed. BCM and TP-STDP plasticity leads to sparser connectivity that drives most weights to zero. In particular, TP-STDP only maintains a small number of weak connections due to the narrow window of potentiation being surrounded by depressive regions that suppress most connections. BCM includes an implicit target level of post-synaptic activity that encourages some synapses to take larger values but doesn’t drive them to their maximum. The distribution of ISIs represents the dynamics of the neural activity. The plots in Figure 18 show that a balance between completely saturated and sparse activity is maintained during the simulation. The shape of the ISI distributions tend to stabilize between 100 and 1000 sample presentations. These observations provide some evidence that the results presented here are not simply an artifact of a particular choice of model parameters but are observed for a normally functioning liquid state machine.
Figure 18. Plots of ISI (blue, bottom) and evolving synaptic weight (black, top) distributions given for each recognition task and each plasticity model. The plots are snapshots of the parameter distributions after 1, 100, and 1000 input samples have been presented during pre-training.
174
Saeed Solouki
Furthermore, both BCM and STDP models adapt the synapses of the network in distinctive patterns according to which type of sample is being presented to the network. It can be concluded that presenting a training signal with the sample label is not required for plasticity to learn specific information for complex sensory inputs from different sources. This result holds for the visual, speech, and benchmark pattern recognition tasks. To achieve this feat, a hypothesise is that plasticity drives the synaptic parameters to a structure that represents an average between all input samples. Once converged, any further input stimulus will drive the synaptic parameters in a unique direction away from this average structure. On balance, scrambled presentation of random inputs keeps the network in this sensitive state.
3.8. Synaptic Interference and Its Impact on Learning Performance In section 3, we show synaptic plasticity spends most of its action counteracting previous changes and overwriting learned patterns. The same patterns of synaptic adaptation that distinctly characterizes each class of input are the same ones that reverse adaptations made by other inputs. Plasticity is applied uniformly to all synapses. All neurons in a recurrent network produce activity when given input stimulus. Combined, these factors mean that any input sample will cause the same synapses to change. This leads to synaptic competition, interference and ultimately, forgetting. To overcome the problem of interference, the mechanisms of plasticity need to be restricted to adapt only a subset of the synapses for any given input stimulus. There is much existing research that supports this conclusion and a number of possible mechanisms that can restrict the locality of plasticity. It has been shown in vivo (using fMRI and neurological experiment) that synaptic plasticity learns highly specific adaptations early in the visual perceptual pathway [97, 98]. imulated models of sensory systems have demonstrated that sparsity of activity is essential for sensitivity to inputspecific features [99, 100]. In fact, in a single-layer, nonrecurrent structure,
Supervised Adjustment of Synaptic Plasticity …
175
STDP is shown to promote sparsity in a model olfactory system [99]. Conversely, in recurrent networks, STDP alone is unable to learn input specific structure because it “over-associates” [101]. Strengthened inhibition was used to overcome this problem and combined with reinforcement learning to produce selectivity in the output [101]. By promoting sparsity, the lack of activity in most of the network will prevent activity-dependent models of plasticity in adapting those connections. Reward modulated plasticity has also been widely explored in simulated [102-103] and biological experiment [104]. Input-specific synaptic changes are shown to be strongest in the presence of a reward signal [104]. Lasting memories (synaptic changes not subject to interference), are also seen to rely on a process of reconsolidation consisting of fear conditioning [105]. A reinforcement signal based on either reward or fear conditioning can be effectively used to restrict synaptic changes in a task dependent context such as sensory pattern recognition. Another way to restrict synaptic changes in a task dependent way is to rely on a back-propagated error signal that has well-established use in artificial neural networks. This might be achieved in a biologically plausible way through axonal propagation [106] or top-down cortical projections sending signals backwards through the sensory pathways [107]. Top-down neural function in general is thought to be essential in determining structure in neural networks [108], providing a context for any adaptations. A molecular mechanism for the retro-axon signals required for backpropagation is has been proposed [109]. However, in general these retroaxon signals are known to be important for neural development but may be too slowly acting to learn sensory input. Structural adaptation with plasticity in the pre-training phase, while specific, may not be utilized by the output produced by the network readout. This could be due to the following reasons: 1) there is a disparity in the neural code. The output from a recurrent spiking network model is currently decoded as a rate code. In contrast, synaptic plasticity updates structure in a way that depends on the precise temporal activity of neural spikes. 2) information content is reduced. While creating associations between coactivating neurons, Hebbian forms of plasticity may also increase
176
Saeed Solouki
correlations and reduce information and separation. These can determine the computational capacity of a recurrent network model [75]. Both discrepancies could be barriers for the effective application of plasticity to improve pattern recognition. Therefore, new frameworks of neural processing should be based directly on the adapting synapses. This will lead to functional models of neural computing that are not merely improved by synaptic plasticity, but that rely on it as an integral element. This finding contrasts with some existing work that shows pretraining with plasticity including STDP [110] and BCM [85] can improve performance in a recurrent spiking network. To address this discrepancy it should be noted that pre-training might improve the general computational properties of recurrent networks without learning input-specific structure. Furthermore, if this is the case, the likelihood of plasticity leading to an improvement will largely depend on how well-tuned the initial parameters of the network are before the pre-training phase begins.
REFERENCES [1] [2] [3]
[4] [5]
Maass W. Networks of spiking neurons: the third generation of neural network models. Neural networks. 1997 Dec 1;10(9):1659-71. Thorpe S, Delorme A, Van Rullen R. Spike-based strategies for rapid processing. Neural networks. 2001 Jul 9;14(6-7):715-25. Wysoski SG, Benuskova L, Kasabov N. Evolving spiking neural networks for audiovisual information processing. Neural Networks. 2010 Sep 1;23(7):819-35. Drubach D. The brain explained. Prentice Hall; 2000. Cassidy AS, Alvarez-Icaza R, Akopyan F, Sawada J, Arthur JV, Merolla PA, Datta P, Tallada MG, Taba B, Andreopoulos A, Amir A. Real-time scalable cortical computing at 46 giga-synaptic OPS/watt with. In Proceedings of the international conference for high performance computing, networking, storage and analysis 2014 Nov 16 (pp. 27-38). IEEE Press.
Supervised Adjustment of Synaptic Plasticity … [6]
[7] [8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
177
Frémaux N, Sprekeler H, Gerstner W. Reinforcement learning using a continuous time actor-critic framework with spiking neurons. PLoS computational biology. 2013 Apr 11;9(4):e1003024. Schoettle B, Sivak M. A survey of public opinion about autonomous and self-driving vehicles in the US, the UK, and Australia. Attwell D, Laughlin SB. An energy budget for signaling in the grey matter of the brain. Journal of Cerebral Blood Flow & Metabolism. 2001 Oct;21(10):1133-45. Esser SK, Appuswamy R, Merolla P, Arthur JV, Modha DS. Backpropagation for energy-efficient neuromorphic computing. In Advances in Neural Information Processing Systems 2015 (pp. 11171125). Neftci EO, Augustine C, Paul S, Detorakis G. Event-driven random back-propagation: Enabling neuromorphic deep learning machines. Frontiers in neuroscience. 2017 Jun 21;11:324. Ponulak F, Kasinski A. Introduction to spiking neural networks: Information processing, learning and applications. Acta neurobiologiae experimentalis. 2011;71(4):409-33. Solouki S, Pooyan M. Arrangement and Applying of Movement Patterns in the Cerebellum Based on Semi-supervised Learning. The Cerebellum. 2016 Jun 1;15(3):299-305. Bing Z, Meschede C, Huang K, Chen G, Rohrbein F, Akl M, Knoll A. End to end learning of spiking neural network based on r-stdp for a lane keeping vehicle. In 2018 IEEE International Conference on Robotics and Automation (ICRA) 2018 May 21 (pp. 1-8). IEEE. Indiveri G. Neuromorphic analog VLSI sensor for visual tracking: Circuits and application examples. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing. 1999 Nov;46(11):1337-47. Lewis, M. A., Etienne-Cummings, R., Cohen, A. H. and Hartmann, M., 2000. Toward biomorphic control using custom aVLSI CPG chips. In Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065) (Vol. 1, pp. 494-500). IEEE.
178
Saeed Solouki
[16] Solouki S, Pooyan M. Fuzzy model reference adaptive control based on PID for fundamental and typical industrial plants. In the 3rd International Conference on Control, Instrumentation, and Automation 2013 Dec 28 (pp. 345-350). IEEE. [17] Ambrosano A, Vannucci L, Albanese U, Kirtay M, Falotico E, Hinkel G, Kaiser J, Ulbrich S, Levi P, Morillas C, Knoll A. Retina coloropponency based pursuit implemented through spiking neural networks in the neurorobotics platform. In Conference on Biomimetic and Biohybrid Systems 2016 Jul 19 (pp. 16-27). Springer, Cham. [18] Kaiser J, Tieck JC, Hubschneider C, Wolf P, Weber M, Hoff M, Friedrich A, Wojtasik K, Roennau A, Kohlhaas R, Dillmann R. Towards a framework for end-to-end control of a simulated vehicle with spiking neural networks. In 2016 IEEE International Conference on Simulation, Modeling, and Programming for Autonomous Robots (SIMPAR) 2016 Dec 13 (pp. 127-134). IEEE. [19] Wang X, Hou ZG, Tan M, Wang Y, Hu L. The wall-following controller for the mobile robot using spiking neurons. In 2009 International Conference on Artificial Intelligence and Computational Intelligence 2009 Nov 7 (Vol. 1, pp. 194-199). IEEE. [20] Izhikevich EM. Solving the distal reward problem through linkage of STDP and dopamine signaling. Cerebral cortex. 2007 Jan 13;17(10):2443-52. [21] Bing Z, Baumann I, Jiang Z, Huang K, Cai C, Knoll AC. Supervised Learning in SNN via Reward-Modulated Spike-Timing-Dependent Plasticity for a Target Reaching Vehicle. Frontiers in neurorobotics. 2019;13:18. [22] Gerstner W, Kistler WM. Spiking neuron models: Single neurons, populations, plasticity. Cambridge University Press; 2002 Aug 15. [23] Rothman JS, Silver RA. Data-driven modeling of synaptic transmission and integration. In Progress in molecular biology and translational science 2014 Jan 1 (Vol. 123, pp. 305-350). Academic Press. [24] Foderaro G, Henriquez C, Ferrari S. Indirect training of a spiking neural network for flight control via spike-timing-dependent synaptic
Supervised Adjustment of Synaptic Plasticity …
[25] [26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
179
plasticity. In 49th IEEE Conference on Decision and Control (CDC) 2010 Dec 15 (pp. 911-917). IEEE. Echeveste R, Gros C. Two-trace model for spike-timing-dependent synaptic plasticity. Neural computation. 2015 Mar;27(3):672-98. Rohmer E, Singh SP, Freese M. V-REP: A versatile and scalable robot simulation framework. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems 2013 Nov 3 (pp. 13211326). IEEE. Zadeh LA. The calculus of fuzzy if/then rules. In Proceedings of the Theorie and Praxis, Fuzzy Logik 1992 Jun 9 (pp. 84-94). SpringerVerlag. Hao Y, Huang X, Dong M, Xu B. A Biologically Plausible Supervised Learning Method for Spiking Neural Networks Using the Symmetric STDP Rule. arXiv preprint arXiv:1812.06574. 2018 Dec 17. Diehl PU, Cook M. Unsupervised learning of digit recognition using spike-timing-dependent plasticity. Frontiers in computational neuroscience. 2015 Aug 3;9:99. Bi GQ, Poo MM. Synaptic modifications in cultured hippocampal neurons: dependence on spike timing, synaptic strength, and postsynaptic cell type. Journal of neuroscience. 1998 Dec 15;18(24):10464-72. Zhang JC, Lau PM, Bi GQ. Gain in sensitivity and loss in temporal contrast of STDP by dopaminergic modulation at hippocampal synapses. Proceedings of the National Academy of Sciences. 2009 Aug 4;106(31):13028-33. Brzosko Z, Schultz W, Paulsen O. Retroactive modulation of spike timing-dependent plasticity by dopamine. Elife. 2015 Oct 30;4:e09685. Vogels TP, Sprekeler H, Zenke F, Clopath C, Gerstner W. Inhibitory plasticity balances excitation and inhibition in sensory pathways and memory networks. Science. 2011 Dec 16;334(6062):1569-73. Tritsch NX, Sabatini BL. Dopaminergic modulation of synaptic transmission in cortex and striatum. Neuron. 2012 Oct 4;76(1):33-50.
180
Saeed Solouki
[35] Ruan H, Saur T, Yao WD. Dopamine-enabled anti-Hebbian timingdependent plasticity in prefrontal circuitry. Frontiers in neural circuits. 2014 Apr 23;8:38. [36] Pozo K, Goda Y. Unraveling mechanisms of homeostatic synaptic plasticity. Neuron. 2010 May 13;66(3):337-51. [37] Davis GW. Homeostatic control of neural activity: from phenomenology to molecular design. Annu. Rev. Neurosci., 2006 Jul 21;29:307-23. [38] Turrigiano GG. The self-tuning neuron: synaptic scaling of excitatory synapses. Cell. 2008 Oct 31;135(3):422-35. [39] Maffei A, Turrigiano GG. Multiple modes of network homeostasis in visual cortical layer 2/3. Journal of Neuroscience. 2008 Apr 23;28(17):4377-84. [40] Keck T, Keller GB, Jacobsen RI, Eysel UT, Bonhoeffer T, Hübener M. Synaptic scaling and homeostatic plasticity in the mouse visual cortex in vivo. Neuron. 2013 Oct 16;80(2):327-34. [41] Hengen KB, Lambo ME, Van Hooser SD, Katz DB, Turrigiano GG. Firing rate homeostasis in visual cortex of freely behaving rodents. Neuron. 2013 Oct 16;80(2):335-42. [42] Turrigiano GG, Leslie KR, Desai NS, Rutherford LC, Nelson SB. Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature. 1998 Feb;391(6670):892. [43] Yeung LC, Shouval HZ, Blais BS, Cooper LN. Synaptic homeostasis and input selectivity follow from a calcium-dependent plasticity model. Proceedings of the National Academy of Sciences. 2004 Oct 12;101(41):14943-8. [44] Sun QQ. Experience-dependent intrinsic plasticity in interneurons of barrel cortex layer IV. Journal of neurophysiology. 2009 Sep 9;102(5):2955-73. [45] Zhang W, Linden DJ. The other side of the engram: experience-driven changes in neuronal intrinsic excitability. Nature Reviews Neuroscience. 2003 Nov;4(11):885.
Supervised Adjustment of Synaptic Plasticity …
181
[46] Cooper LN, Bear MF. The BCM theory of synapse modification at 30: interaction of theory with experiment. Nature Reviews Neuroscience. 2012 Nov;13(11):798. [47] Yavuz E, Turner J, Nowotny T. GeNN: a code generation framework for accelerated brain simulations. Scientific reports. 2016 Jan 7;6:18854. [48] Maaten LV, Hinton G. Visualizing data using t-SNE. Journal of machine learning research. 2008;9(Nov):2579-605. [49] Cao Y, Chen Y, Khosla D. Spiking deep convolutional neural networks for energy-efficient object recognition. International Journal of Computer Vision. 2015 May 1;113(1):54-66. [50] Hunsberger E, Eliasmith C. Spiking deep networks with LIF neurons. arXiv preprint arXiv:1510.08829. 2015 Oct 29. [51] Diehl PU, Neil D, Binas J, Cook M, Liu SC, Pfeiffer M. Fastclassifying, high-accuracy spiking deep networks through weight and threshold balancing. In 2015 International Joint Conference on Neural Networks (IJCNN) 2015 Jul 12 (pp. 1-8). IEEE. [52] LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD. Backpropagation applied to handwritten zip code recognition. Neural computation. 1989 Dec;1(4):541-51. [53] Wu Y, Deng L, Li G, Zhu J, Shi L. Spatio-temporal backpropagation for training high-performance spiking neural networks. Frontiers in neuroscience. 2018;12. [54] Beyeler M, Dutt ND, Krichmar JL. Categorization and decisionmaking in a neurobiologically plausible spiking network using a STDP-like learning rule. Neural Networks. 2013 Dec 1;48:109-24. [55] Querlioz D, Bichler O, Dollfus P, Gamrat C. Immunity to device variations in a spiking neural network with memristive nanodevices. IEEE Transactions on Nanotechnology. 2013 May;12(3):288-95. [56] Hu B, Tamba TA. Optimal Co-design of Industrial Networked Control Systems with State-dependent Correlated Fading Channels. arXiv preprint arXiv:1807.07681. 2018 Jul 20.
182
Saeed Solouki
[57] Samadi A, Lillicrap TP, Tweed DB. Deep learning with dynamic spiking neurons and fixed feedback weights. Neural computation. 2017 Mar;29(3):578-602. [58] Tavanaei A, Maida A. BP-STDP: Approximating backpropagation using spike timing dependent plasticity. Neurocomputing. 2019 Feb 22;330:39-47. [59] Zhang T, Zeng Y, Zhao D, Shi M. A plasticity-centric approach to train the non-differential spiking neural networks. In Thirty-Second AAAI Conference on Artificial Intelligence 2018 Apr 25. [60] Lee JH, Delbruck T, Pfeiffer M. Training deep spiking neural networks using backpropagation. Frontiers in neuroscience. 2016 Nov 8;10:508. [61] Lee C, Panda P, Srinivasan G, Roy K. Training deep spiking convolutional neural networks with stdp-based unsupervised pretraining followed by supervised fine-tuning. Frontiers in neuroscience. 2018;12. [62] Bill J, Legenstein R. A compound memristive synapse model for statistical learning through STDP in spiking neural networks. Frontiers in neuroscience. 2014 Dec 16;8:412. [63] Kheradpisheh SR, Ganjtabesh M, Thorpe SJ, Masquelier T. STDPbased spiking deep convolutional neural networks for object recognition. Neural Networks. 2018 Mar 1;99:56-67. [64] Shrestha A, Ahmed K, Wang Y, Qiu Q. Stable spike-timing dependent plasticity rule for multilayer unsupervised and supervised learning. In 2017 International Joint Conference on Neural Networks (IJCNN) 2017 May 14 (pp. 1999-2006). IEEE. [65] Hu Z, Wang T, Hu X. An STDP-based supervised learning algorithm for spiking neural networks. In International Conference on Neural Information Processing 2017 Nov 14 (pp. 92-100). Springer, Cham. [66] Mozafari M, Ganjtabesh M, Nowzari-Dalini A, Thorpe SJ, Masquelier T. Combining stdp and reward-modulated stdp in deep convolutional spiking neural networks for digit recognition. arXiv preprint arXiv:1804.00227. 2018 Mar.
Supervised Adjustment of Synaptic Plasticity …
183
[67] Glimcher PW. Understanding dopamine and reinforcement learning: the dopamine reward prediction error hypothesis. Proceedings of the National Academy of Sciences. 2011 Sep 13;108 (Supplement 3):15647-54. [68] Holroyd CB, Coles MG. The neural basis of human error processing: reinforcement learning, dopamine, and the error-related negativity. Psychological review. 2002 Oct;109(4):679. [69] Wise RA. Dopamine, learning and motivation. Nature reviews neuroscience. 2004 Jun;5(6):483. [70] Dayan P, Balleine BW. Reward, motivation, and reinforcement learning. Neuron. 2002 Oct 10;36(2):285-98. [71] Izhikevich EM, inventor; Neurosciences Research Foundation Inc, assignee. Solving the distal reward problem through linkage of STDP and dopamine signaling. United States patent US 8,103,602. 2012 Jan 24. [72] Lin Z, Ma D, Meng J, Chen L. Relative ordering learning in spiking neural network for pattern recognition. Neurocomputing. 2018 Jan 31;275:94-106. [73] Sporea I, Grüning A. Supervised learning in multilayer spiking neural networks. Neural computation. 2013 Feb;25(2):473-509. [74] Xu Q, Qi Y, Yu H, Shen J, Tang H, Pan G. CSNN: An Augmented Spiking based Framework with Perceptron-Inception. In IJCAI 2018 Jul 13 (pp. 1646-1652). [75] Chrol-Cannon J, Jin Y. Learning structure of sensory inputs with synaptic plasticity leads to interference. Frontiers in computational neuroscience. 2015 Aug 5;9:103. [76] Maass W, Natschläger T, Markram H. Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural computation. 2002 Nov 1;14(11):2531-60. [77] Buonomano DV, Maass W. State-dependent computations: spatiotemporal processing in cortical networks. Nature Reviews Neuroscience. 2009 Feb;10(2):113.
184
Saeed Solouki
[78] Rosenblatt F. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review. 1958 Nov;65(6):386. [79] Bienenstock EL, Cooper LN, Munro PW. Theory for the development of neuron selectivity: orientation specificity and binocular interaction in visual cortex. Journal of Neuroscience. 1982 Jan 1;2(1):32-48. [80] Markram H, Lübke J, Frotscher M, Sakmann B. Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science. 1997 Jan 10;275(5297):213-5. [81] Wittenberg GM, Wang SS. Malleability of spike-timing-dependent plasticity at the CA3–CA1 synapse. Journal of Neuroscience. 2006 Jun 14;26(24):6610-7. [82] Hebb DO. The organization of behavior: A neuropsychological theory. Psychology Press; 2005 Apr 11. [83] Solouki S, Bahrami F, Janahmadi M. Effects of irreversible olivary system lesion on the gain adaptation of optokinetic response eye movement: a model based study. In 2018 25th National and 3rd International Iranian Conference on Biomedical Engineering (ICBME) 2018 Nov 29 (pp. 1-6). IEEE. [84] French RM. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences. 1999 Apr 1;3(4):128-35. [85] Yin J, Meng Y, Jin Y. A developmental approach to structural selforganization in reservoir computing. IEEE transactions on autonomous mental development. 2012 Dec;4(4):273-89. [86] Xue F, Hou Z, Li X. Computational capability of liquid state machines with spike-timing-dependent plasticity. Neurocomputing. 2013 Dec 25;122:324-9. [87] Toutounji H, Pipa G. Spatiotemporal computations of an excitable and plastic brain: neuronal plasticity leads to noise-robust and noiseconstructive computations. PLoS computational biology. 2014 Mar 20;10(3):e1003512. [88] Gilson M, Burkitt A, Van Hemmen LJ. STDP in recurrent neuronal networks. Frontiers in computational neuroscience. 2010 Sep 10;4:23.
Supervised Adjustment of Synaptic Plasticity …
185
[89] Izhikevich EM. Simple model of spiking neurons. IEEE Transactions on neural networks. 2003 Nov;14(6):1569-72. [90] Izhikevich EM. Which model to use for cortical spiking neurons?. IEEE transactions on neural networks. 2004 Sep;15(5):1063-70. [91] Hebb DO. Drives and the CNS (conceptual nervous system). Psychological review. 1955 Jul;62(4):243. [92] Song S, Miller KD, Abbott LF. Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nature neuroscience. 2000 Sep;3(9):919. [93] Jaeger H. Discovering multiscale dynamical features with hierarchical echo state networks. Jacobs University Bremen; 2007 Jul 1. [94] Kudo M, Toyama J, Shimbo M. Multidimensional curve classification using passing-through regions. Pattern Recognition Letters. 1999 Nov 1;20(11-13):1103-11. [95] Laptev I, Caputo B. Recognizing human actions: a local SVM approach. Innull 2004 Aug 23 (pp. 32-36). IEEE. [96] McCloskey M, Cohen NJ. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation 1989 Jan 1 (Vol. 24, pp. 109-165). Academic Press. [97] Karni A, Sagi D. Where practice makes perfect in texture discrimination: evidence for primary visual cortex plasticity. Proceedings of the National Academy of Sciences. 1991 Jun 1;88(11):4966-70. [98] Schwartz S, Maquet P, Frith C. Neural correlates of perceptual learning: a functional MRI study of visual texture discrimination. Proceedings of the National Academy of Sciences. 2002 Dec 24;99(26):17137-42. [99] Finelli LA, Haney S, Bazhenov M, Stopfer M, Sejnowski TJ. Synaptic learning rules and sparse coding in a model sensory system. PLoS computational biology. 2008 Apr 18;4(4):e1000062. [100] Barranca VJ, Kovačič G, Zhou D, Cai D. Sparsity and compressed coding in sensory systems. PLoS computational biology. 2014 Aug 21;10(8):e1003793.
186
Saeed Solouki
[101] Bourjaily MA, Miller P. Synaptic plasticity and connectivity requirements to produce stimulus-pair specific responses in recurrent networks of spiking neurons. PLoS computational biology. 2011 Feb 24;7(2):e1001091. [102] Darshan R, Leblois A, Hansel D. Interference and shaping in sensorimotor adaptations with rewards. PLoS computational biology. 2014 Jan 9;10(1):e1003377. [103] Gavornik JP, Shuler MG, Loewenstein Y, Bear MF, Shouval HZ. Learning reward timing in cortex through reward dependent expression of synaptic plasticity. Proceedings of the National Academy of Sciences. 2009 Apr 21;106(16):6826-31. [104] Lepousez G, Nissant A, Bryant AK, Gheusi G, Greer CA, Lledo PM. Olfactory learning promotes input-specific synaptic plasticity in adult-born neurons. Proceedings of the National Academy of Sciences. 2014 Sep 23;111(38):13984-9. [105] Li Y, Meloni EG, Carlezon WA, Milad MR, Pitman RK, Nader K, Bolshakov VY. Learning and reconsolidation implicate different synaptic mechanisms. Proceedings of the National Academy of Sciences. 2013 Mar 19;110(12):4798-803. [106] Kempter R, Leibold C, Wagner H, van Hemmen JL. Formation of temporal-feature maps by axonal propagation of synaptic learning. Proceedings of the National Academy of Sciences. 2001 Mar 27;98(7):4166-71. [107] Schäfer R, Vasilaki E, Senn W. Perceptual learning via modification of cortical top-down signals. PLoS computational biology. 2007 Aug 17;3(8):e165. [108] Sharpee TO. Function determines structure in complex neural networks. Proceedings of the National Academy of Sciences. 2014 Jun 10;111(23):8327-8. [109] Harris KD. Stability of the fittest: organizing learning through retroaxonal signals. Trends in neurosciences. 2008 Mar 1;31(3):1306.
Supervised Adjustment of Synaptic Plasticity …
187
[110] Xue F, Hou Z, Li X. Computational capability of liquid state machines with spike-timing-dependent plasticity. Neurocomputing. 2013 Dec 25;122:324-9.
In: Neural Networks Editor: Doug Alexander
ISBN: 978-1-53617-188-4 © 2020 Nova Science Publishers, Inc.
Chapter 6
A REVIEW ON INTELLIGENT DECISION SUPPORT SYSTEMS AND A CASE STUDY: PREDICTION OF CRYPTOCURRENCY PRICES WITH NEURAL NETWORKS Zeynep Orman*, Emel Arslan, Burcu Ozbay and Emad Elmasri Department of Computer Engineering, Istanbul University-Cerrahpasa, Istanbul, Turkey
ABSTRACT Intelligent Decision Support Systems (IDSS) with the integration of advanced artificial intelligence techniques have been developed to assist the decision-makers during their decision management process. Due to the advances in data mining and artificial intelligence techniques, there has been a growing interest in the development of such systems. IDSS are becoming increasingly more critical to the daily operations of
*
Corresponding Author’s Email: [email protected].
190
Zeynep Orman, Emel Arslan, Burcu Ozbay et al. organizations because of the necessity of rapid and effective decisionmaking success. In this chapter, over twenty scientific literature related to IDSS, primarily spanning the decade between 2008 and 2018 were analyzed. We have presented a classification analysis of IDSS literature based on the data mining algorithms used and the type and performance of the systems. We have also provided information on the current applications, benefits, and future research of these systems. When the results are considered, it can be deduced that the use of Artificial Neural Networks (ANN) increases the accuracy rate in many of the related studies. Therefore, in addition to IDSS research, a case study of an intelligent decision support system that is built on a neural network model based on the Encog machine learning framework to predict cryptocurrency close prices is also presented in this chapter.
Keywords: intelligent decision support systems, artificial intelligence, artificial neural networks, cryptocurrency prices
INTRODUCTION Decision Support Systems (DSS) are computer-based systems that are developed to support the conflicting situations that decision-makers face while trying to achieve their goals. The decision-making process can be very complex because it may be influenced by many attributes. If such a complex decision-making structure is determined by any of an artificial intelligence technique, the system becomes an intelligent decision support system. Being able to act intelligently, it is necessary for a system to be able to communicate with the others and its environment, and by taking the advantage of learning, the system would act so as to maximize its own profits. An intelligent system takes some decisions when it behaves in this way. Therefore, most of the intelligent system architectures include the decision support system structure. Decision support systems have been preferred for 30 years because of their success in supporting the rapid and correct decision-making process in response to the changing needs of the decision-makers. In today’s large data age, the accuracy of decision making has increased with the increasing data
A Review on Intelligent Decision Support Systems and a Case Study 191 that is to be deduced and the development of artificial intelligence applications. Because of the easy adaptability of DSS to different disciplines, these systems which are created for effective operations of companies or individuals, have been branching to many sub-fields over the years. Information management systems where support decision-making processes related to information transfer, storage, retrieval, and applications; data warehouse systems that provide large-scale data infrastructure for decision support; enterprise reporting and analysis systems and intelligent decision support systems can be given as examples of these subfields [1]. In this chapter, we focused on IDSS, which is a subfield of decision support systems, and examined the studies involving IDSS between 2008 and 2018 in the literature. We have examined and classified the architecture of these systems, the algorithms that implement these systems, the fields they are applied, the applications and the applications’ success rates. In addition, we proposed an IDSS case study that estimates the stock closing prices of three major cryptocurrencies with ANN. This chapter is organized as follows. First, the classification of the articles that are related to IDSS are discussed. Then, the basic data mining methods used in these articles are presented. To support the emerging prominence of ANN for establishing effective IDDS, a case study which predicts three major cryptocurrencies’ close prices is given. Finally, conclusions and future works are presented.
RELATED WORKS Brief information on the purpose of the developed IDSS and the methods that are used in previous papers are presented in this section. Prior investigation has been performed on some aspects of IDSS, and a brief history of the specific research methods is discussed. Recently, IDSS have been used in various areas. In this study, we consider IDSS that provide medical diagnosis support, financial/business IDSS that are designed for effective operations of enterprises (production, logistics, and supply chain) and environment and energy-based IDSS. These areas have been determined
192
Zeynep Orman, Emel Arslan, Burcu Ozbay et al.
as the three most used fields of IDSS. The articles examined are selected from these domains.
Medical Diagnosis IDSS Quality health care requires accurate diagnosis of diseases and effective treatment of patients. Wrong clinical decisions can be the main reason for unacceptable consequences. IDSS that will facilitate the clinical diagnosis phase can be the solution to these problems. In addition, hospitals want to reduce clinical testing costs. They can achieve this by using an appropriate intelligent decision support system. The health sector collects a lot of patient information. Advanced data mining techniques can help to make this data meaningful for diagnosis. In 2008, Palaniappan S. et al. introduced an intelligent decision support system for heart disease prediction using Decision Trees, Naïve Bayes and Neural Network (NN) [2]. In this study, the authors tried to predict the possibility of patients to have heart disease by using medical profiles including age, gender, blood pressure, and blood sugar data. The proposed system uses pointless knowledge from a heart disease database. The system is trained to make the test data meaningful. They tried to reach the five goals determined within the scope of the study. Naïve Bayes could answer four out of the five goals; Decision Trees achieved three goals and NN achieved only the two goals. This system was developed as a web-based platform and it was reliable, scalable, user-friendly and expandable. Naïve Bayes model with the number of correct predictions (95%) appeared to perform better than the other two models (NN (93.54%) and Decision Trees (94.93%)). Decision making for heart disease detection issue was also examined by Arabasadi Z. et al. in 2017. This time researchers used a hybrid NN-Genetic Algorithm model on the diagnosis of coronary artery disease [3]. The combined use of the genetic algorithm with NN updates the weights of the network much better and this model increases the performance of the NN by about 10%. Making use of the proposed method, the authors achieved accuracy, sensitivity and specificity rates of 93.85%, 97%, and 92%
A Review on Intelligent Decision Support Systems and a Case Study 193 respectively. They suggested to use evolutionary and swarm intelligence methods instead of the Genetic Algorithm as the future works. In the study published in 2012, F. Gorunescu et al. proposed an intelligent decision support system that evaluates the degree of hepatic fibrosis which is the major indicator of progressive liver disease [7]. Fibroscan is one of the medical imaging methodologies for indicating hepatic fibrosis degree. But due to the complex interactions between the Fibroscan and the biochemical and clinical results, it is hard to evaluate the hepatic fibrosis degree manually. Thus, machine learning algorithms were proposed in this study to support an automatic diagnosis system. An evolutionary-trained NN was developed for the classification of the liver fibrosis stages. As a result of the average six experiments obtained by using both the hybrid and five well-known high-performance NNs, the accuracy was obtained between 53.66% and 61.16%. Besides, the highest accuracy was obtained from the hybrid network. I. Maglogiannis et al. proposed a Support Vector Machine (SVM) based classifier intelligent decision support system for the diagnosis of breast cancer disease [5]. The result was that the optimized SVM algorithm performed perfectly with overall accuracy (up to 96.91%), specificity (up 97,67%) and sensitivity (up to 97,84%). A. Masood et al. in 2018 proposed an intelligent decision support system for the research in pulmonary cancer by using the deep learning-based model obtained from MBAN (Medical Body Area Network). This model, DFCNet, was tried on different datasets with varying scan conditions. Accuracy of Convolutional Neural Network (CNN) and DFCNet were reported as 77.6% and 84.58% [6]. H. A. Haenssle et al. in 2018, aimed to diagnose dermoscopic images of lesions of melanocytic origin (melanoma, benign nevi) by using deep learning architecture. They compared the results against a large group of 58 dermatologists [7]. As a result of this study, the proposed system’s accuracy is obtained as 95% while the diagnostic performance of dermatologists remains at 86.6%. F. Lucini et al. in their study in 2017, aimed to diagnose the patients who should be hospitalized based on the medical records formed in the doctorpatient meetings at an emergency department. They used text mining
194
Zeynep Orman, Emel Arslan, Burcu Ozbay et al.
methods to deduce useful information from the medical records. Sets of words were obtained via term frequency, binary representation, and term frequency-inverse document frequency. Feature selection was performed with f-score metrics. Eight data mining algorithms were used for classification which was Support Vector Machine, Multinomial Naïve Bayes, Random Forest, Extremely Randomized Tree, Decision Tree, AdaBoost, Logistic Regression and Nu-Support Vector Machine. NuSupport Vector Machine obtained the best overall performance with 77.70% accuracy for predicting hospitalization [8].
Business/Financial IDSS One of the areas where IDSS are most studied is business or finance. IDSS aim to make optimum decisions and reduce the risk ratio in the business sector with their approaches based on past experiences. The benefits of IDSS in this field are to make correct and accurate decisions on forecasting the financial situations, the business profitability, and the right investments. IDSS can also be used to increase the amount and speed of production, to improve the logistics activities and even to protect the systems from fraud. Analyzing financial data is very important to predict future market trends correctly. Sometimes text data can give more information than numeric data. S.W.K. Chan, J. Franklin, in 2011 proposed a novel text-based intelligent decision support system to extract information from the financial texts [9]. Approximately 28,000 sentences in the 2000 financial reports were analyzed and used to create a decision tree. The accuracy of the proposed system which was supported by the Hidden Markov Model was 89.1%. In 2014, A. Bagheri et al. presented a new hybrid intelligent method to forecast the financial time series, especially for the Foreign Exchange Market (FX) [10]. This method used the historical market data to estimate the investors’ market trends. This intelligent decision support system was based on fuzzy logic and swarm optimization. The proposed system provided 68.98% accuracy.
A Review on Intelligent Decision Support Systems and a Case Study 195 Being successful in a specific market is very relevant to have accurate information about that market segment. N. Lei, S.K. Mooni presented an intelligent decision support system for product design based on the market data in 2015 [11]. In this study, Principal Component Analysis (PCA), Kmeans, and AdaBoost methods were applied to the US automotive market data. Accuracy was obtained between 76.1% and 93.5% for different scenarios. The major growth of online banking frauds increased the need for fraud detection systems. In 2015, M. Carminati et al. described the BANKSEALER [11] which was an intelligent decision support system for online banking fraud analysis and investigation. During the training phase based on past transactions, the BANKSEALER produced a model for each customer’s spending habits. The proposed system identified new transactions that deviated from the usual transactions by using the learned profiles. BANKSEALER analyzed data with PCA and DBSCAN algorithms. Maintenance or equipment failure is the two reasons for stopping a production line. Performing timely and necessary maintenance prevents failures that may result in costly production interruptions. In this context, M. Confalonieri et al. described an intelligent decision support system that enables the early discovery of problems that may cause production lines to stop in 2015 [12]. This prevented the interruption of the line with the related rescue actions. The data to be used for decision making was collected with the help of sensors in the production lines. The data were processed by an ANN. The maintenance activities proposed by the designed system were added to the weekly schedule. The system had been implemented in the manufacturing plants of IKEA and Brembo and had prevented a problem in the production line with 95% accuracy. In the study published in 2015, Z.X. Guo et al. proposed a radio frequency identification(RFID)-based intelligent decision support system architecture to handle the production monitoring and scheduling in a distributed manufacturing environment [13]. The proposed system was designed in real-time and it collected data remotely. Intelligent optimization techniques were implemented to generate an effective production schedule
196
Zeynep Orman, Emel Arslan, Burcu Ozbay et al.
among the distributed plants. With the proposed architecture, the production and logistics operations in the supply chain were improved. In order to make this system intelligent, the authors used the Hybrid Intelligent Optimization Model based on the memetic algorithms proposed by Guo et al. [14]. The system had many benefits such as a 25% increase in production efficiency, and a 12% reduction in production wastes, and an 8% reduction in labor and system costs. Multi-agent intelligent simulator (MAIS) had been developed for forecasting a dynamic price change in the US wholesale power market by T. Sueyoshi, G.R. Tadiparthi in 2008 [15]. Each intelligent agent in the system represented a factor in the wholesale market. The MAIS was an intelligent decision support system that was trained by the combination of Neural Networks and Genetic Algorithms for evaluating the new trading strategies in a competitive electricity market. The system was implemented using a data set regarding the California electricity crisis. This study indicated that estimation accuracies obtained were 84.33% for MAIS, 29.99% for genetic algorithm and 82.98% for NNs.
Environment/Energy IDSS With the rapid increase in the human population, buildings and technology applications have caused energy consumption to grow rapidly. Environmental pollution and climate change cause the existing resources to disappear eventually. One of the biggest problems of our age is the depletion of energy and water resources. It is estimated that half of the fossil fuel resources will be consumed in ten years if energy consumption continues in this way around the world. More than half the world’s wetlands have disappeared and at the current consumption rate, this situation will only get worse. Therefore, IDSS developed for the environment with the awareness of reducing natural resource consumption and efficient use are now an important field of study. R. Dutta et al., in 2014 proposed an architecture based on the unsupervised machine learning method [16]. The aim of this study was to
A Review on Intelligent Decision Support Systems and a Case Study 197 estimate the water balance in the agriculture areas by using an intelligent decision support system and to reduce wasted water consumption. Collected data were processed by Principal Component Analysis (PCA), Fuzzy-CMeans (FCM), and Neural Networks. Prediction accuracy was obtained in the range of 62.5% and 91.3% according to the neural network models or the data sets used in the study. The planet’s environmental challenges and finite resources are major danger issues for planet sustainability. Thus, interest with sustainability indicators (SI) which track the current state and evolution of the planet has increased in recent years. In 2015, S. J. Rivera et al. applied the text mining methods on digital news articles to track SI [17]. News articles in San Mateo County, California were analyzed by using Natural Language Processing Techniques and the data were classified by using the K-Nearest Neighbor Classification (K-NN) Algorithm with 86% accuracy. K.A. Nguyen et al., in 2018, showed how smart technologies could be used to achieve operational efficiencies and water savings during the whole urban water life cycle process. They developed a novel intelligent decision support software tool (Auto-flow©) by using ANN to analyze water data received from the customer smart meters reports and obtained %98 accuracy [18]. Reliable detection of the current situation is an important factor for the sustainability of ecosystems. The amount of Chlorophyll a (Chl a) is an important parameter for estimating whether or not lake ecosystems are in danger. Therefore, predicting the amount of Ch1a can act as a warning alarm and may encourage authorities to protect lake ecosystems [19]. F. Wang et al. proposed a method that combined the Wavelet Analysis and the ANN (WA–ANN). For a 9-year period, a model was created by using hydrological, ecological and meteorological time series data in the Baiyangdian Lake study area in North China. The proposed WA–ANN model performed with % 99,9 accuracy. Predicting energy consumption in buildings can contribute to optimizing energy consumption. Expert-based prediction increases the cost of this process. Maxim S. et al. in 2013 proposed an intelligent decision support system named EFAS[20] which was a web-based system developed for electric energy consumption forecasting of buildings. This system reduced
198
Zeynep Orman, Emel Arslan, Burcu Ozbay et al.
the cost of monitoring processes and more importantly increased the people’s awareness in energy management. EFAS was developed based on the decision tree approach which was created by using available variables like electric energy consumption, the status of the building, outdoor temperature, etc. The best accuracy result of EFAS tested with the datasets generated from different period measurements was 85,14%. Table 1 gives a brief summary of the intention, methods and accuracy rates of the articles mentioned above. Table 1. Summary of the articles related to IDSS
Medical Diagnosis IDSS
Area Investigated
Research Z. Arabasadi. et al., 2017 F. Gorunescu et al., 2012
Intelligent Decision Support System Heart Disease Detection System Progressive Liver Disease
S. Palaniappan et al., 2008
Heart Disease Prediction System
I. Maglogiannis et al., 2009 A. Masood et al., in 2018
Diagnosis of Breast Cancer Disease Diagnosis of Pulmonary Cancer
H. A. Haenssle et al.
Diagnosis of Melenoma
F. Lucini et al., 2017
Diagnosis of Hospitalized Patients
Used Method Hybrid Neural NetworkGenetic Algorithms Neural Network Hybrid Neural Network Naïve Bayes Neural Network Decision Trees Support Vector Machines Convolutional Neural Networks DFCNet Deep LearningConvolutional Neural Networks Text Mining based;
Obtained Accuracy 93.85% 53.66% 60.81% 61.16% 95% 93.54% 94.93% 96.91% 84.58% 77.6% 95%
Decision Tree
67.86%
Random Forest
76.90%
Extremely Randomized Tree
77.03%
AdaBoost
73.72%
Logistic Regression Multinomial Naïve Bayes
76.95% 76.55%
Support Vector Machine
77.59%
Nu-Support VM
77.70%
A Review on Intelligent Decision Support Systems and a Case Study 199
Business/Financial IDSS
Area Investigated
Research S.W.K. Chan, J. Franklin, 2011 A. Bagheri et al., 2014 N. Lei, S.K. Mooni,2015 M. Carminati et al., 2015 M. Confalonieri et al., 2015 T. Sueyoshi, G.R. Tadiparthi, 2008
Environment/Energy IDSS
R. Dutta et al., 2014
S. J. Rivera et al. 2014 K.A. Nguyen et al., 2018 F. Wang et al., 2017 Maxim S. et al. in 2013
Intelligent Decision Support System Financial Sequence Prediction Financial Forecasting Market-Driven Product Positioning and Design Online Banking Fraud Analysis Maintenance and Production Optimization Analyzing Dynamic Price Change
Used Method Text-based Decision Tree Fuzzy Logic – Swarm Optimization Principal Component Analysis-K-meansAdaBoost Principal Component Analysis-DBSCAN Artificial Neural Networks
Genetic Algorithms Neural Networks Genetic AlgorithmsNeural Networks Environmental Principal Component Knowledge for Analysis-FuzzySustainable Agricultural Artificial Neural Networks Advancing Text Mining based KSustainability Indicators nearest neighbor Urban Water Artificial Neural Management Networks-Hidden Markov Model Sustainable Wavelet Analysis – Management of Artificial Neural Ecosystems Networks Electric Energy Linear Regression Consumption Artificial Neural Forecasting Networks Decision Tree
Obtained Accuracy 89.1% 68.98% 76.1% 93.5% 74% 98% 95%
29.99% 82.98% 84.33% 62.5% 91.3%
86% 93.8%
99.9%
96.7% 98.8% 85.14%
COMPUTATIONAL INTELLIGENCE AND DATA MINING METHODS FOR IDDS There are several computational intelligence and data mining methods for IDDS. In terms of accuracy, Artificial Neural Networks is one of the
200
Zeynep Orman, Emel Arslan, Burcu Ozbay et al.
most important of these methods according to the articles reviewed in this study. NNs mimic the behavior of the human brain. Their main components are the neurons, interconnected in a layered parallel structure. Thanks to simple processors interconnected with each other, they store experiential knowledge and use it with a parallel distributed computing system. The training (learning) process is performed by determining and updating the weights of the interconnects (synaptic weights). After adjusting the training weight of the synaptic weights, the neural model is used during the test phase to check whether the network can be generalized and to provide a true measure of computer performance [5]. When creating a model in an ANNs, input variables, output variables and weights are required. There are three types of layers: The first layer is the input layer that receives data to the network. The second layer is the hidden layer. The last layer is the output layer. The relationship between input and output defines network behavior for such as neuron and connection configuration between layers define network behavior. The number of layers and neurons in each layer is determined by the designer during the trial and error process [3]. Neural Networks gain the ability to decide after the training process. Neural network structure changes in the learning stage. There are several types of neural network configurations. The basic success of an ANN is its learning ability. Training is a procedure in which a neural network adapts itself to provide the desired output for input data [21]. Neural Networks show great success in finding patterns and clusters in the data. Due to these reasons, it is a preferred method despite its high size and complexity and hence the high training duration. In the last few years, deep learning which emerged with the development of Artificial Neural Networks has led to very good performance in a variety of areas, such as visual recognition, speech recognition, and natural language processing. Deep learning techniques have become the most successful approach to intensive research and artificial intelligence and machine learning. In this deep learning-based approach, the input signal is processed by the processing units. These consecutive units attempt to direct the target specimens correctly. The signal processing units are known as layers which
A Review on Intelligent Decision Support Systems and a Case Study 201 could be convolutional or fully connected. Among different types of deep neural networks, convolutional neural networks (CNNs) have been most extensively studied. The convolutional layer calculates the property maps of the inputs with its many cores, so it learns the inputs. Each neuron in this layer is a receiver bound to the previous layer. There is usually a pooling layer between two convolution layers. Thanks to the pooling layer, high-resolution regions are saved on the property map so that the properties in the input information are clarified. After these layers, there may be one or more fully bound layers to provide high-level reasoning. A CNN terminates with the output layer [23]. Genetic Algorithms, especially with ANNs, are also frequently used algorithms in IDSS. Genetic algorithms are a kind of evolutionary algorithms in which evolutionary biological techniques such as inheritance and mutation are used. In genetic algorithms, some operators are used to combine the current generation members to produce more qualified members. Based on the fitness criterion selection operator, the algorithm selects a member of a generation that participates in the reproduction process that is known as fitness proportionate selection. The selection probability of each individual is calculated as Eq. (1) where 𝑝𝑖 shows the probability of the ith individual selection, 𝑓𝑖 shows the fitness of individual i and N is the number of individuals in the population. Substitution operator provides the propagation and transfer of members from one generation to the next. Using the recombination operator, the substrings of two members belonging to a chosen generation are substituted in each other in an intersectional manner. The mutation operator is used to make changes in the genes of a member of the current generation in order to produce a new member [3]. 𝑓𝑖
𝑝𝑖 = ∑𝑁
𝑘=1 𝑓𝑘
(1)
IDSS classify or cluster data in the decision-making process, thus decision support systems use classification and clustering algorithms on a wide range of basis.
202
Zeynep Orman, Emel Arslan, Burcu Ozbay et al.
Decision tree learning is one of the most practical and widely used classification methods for the decision making process. Decision trees are intended to correctly classify the data by learning the training data labeled with a class. Most of the decision trees algorithms are variations of a topdown, greedy search algorithm [9]. The entire model is represented in a treelike structure where one of the conditions is extracted from the dataset in order to set as a root node and the new nodes are formed according to some other conditions. Learning is the stage in which this tree structure is formed. Entropy impurity is the general measure used to construct the root and subnodes of the tree. Random Forest is an ensemble classifier consisting of several decision trees. Each tree is recursively generated by running a binary classification test to each non-leaf node, using the training set of entries. The final classification of an entry is the most popular among those obtained in all trees [8]. A Bayesian Classifier is a statistical classification technique that controls the reality of a hypothesis H (whether X can be a member of a class such as H) by giving a probability of a P is given by Eq. (2) according to Bayes Theorem. 𝑃(𝐻|𝑋) =
𝑃(𝑋|𝐻 )∗𝑃(𝐻) 𝑃(𝑋)
(2)
Support Vector Machine is a classification algorithm that forms a subset of data set, thanks to the information in a training data set. Vectors locate a hypersurface that separates the input data with a very good degree of generalization. SVM Algorithms training process involves optimization of a convex cost function where there are no local minima to complicate the learning process [5]. The K Nearest Neighbor (KNN) algorithm is a supervised machine learning technique that is used for classification. The algorithm compares K nearest points to the point which is searched for it's class and measures the similarities. This point is included in the group where the number of points is similar. KNN is widely used for simplicity, scalability, and ease of application [17].
A Review on Intelligent Decision Support Systems and a Case Study 203 AdaBoost is a machine learning method that combines several weak classifiers into a single, more accurate group class [8]. AdaBoost continuously feeds a weak learning algorithm with an input training set. Initially, all weights are equal. However, after the repeated call, the weights are updated so that the weights of the misclassified samples are increased [22]. K-Means is a data mining clustering method. Algorithms separate observations into K clusters, such that each observation is placed to the cluster which has the nearest mean. Algorithmically, K-means uses a twophase iterative algorithm to minimize the sum of point to centroid distances, summed K Clusters. The performance of a clustering algorithm is relevant to the chosen value of K [22]. DBSCAN is a clustering algorithm that exploits the different zones of the density of the large cluster. It grows regions with sufficiently high density into clusters, defined as maximal sets of density connected points, and discovers clusters of arbitrary shape in spatial databases with noise [11]. Thus, the DBSCAN algorithm tries to separate zones with different density in the big cluster. These separated clusters renew according to closeness at each iteration. The algorithm continues until clusters are unchanged. So similar characteristics are collected in the same cluster. In data mining applications, thanks to PCA which is a mathematical procedure that uses an orthogonal transformation, unnecessary components are extracted from high-dimensional data to generate the model and converted to less-sized variables which are called basic components [22]. Linear Regression is also a statistical method that summarizes the relationship between two quantitative data. It shows the value of the estimated dependent variable according to the value of an independent variable. The basis of fuzzy logic is based on fuzzy sets and sub-sets. In the classical approach, an entity is a member of the set or is not. In mathematically expressed, when the entity is the element of the set, it takes value “1” and when it is not the element of the set it takes the value “0” with respect to the membership relationship. But in fuzzy logic, data can be a member of more than one set. Each entity in the fuzzy asset set has a
204
Zeynep Orman, Emel Arslan, Burcu Ozbay et al.
membership degree. The membership degree of the assets can be any value between 0 and 1 [10]. Swarm Optimization (SO) is a random search technique based on a population of particles. SO was inspired by the social behavior flocks of birds and schools of fish. Each particle is represented by velocity and position vectors. According to SO, the particles move in a D-dimensional space based on their own past experience. At each iteration, particles are compared with each other to find the best position of all particles. The best position in the swarm is called the global best [10]. In the Hidden Markov Model (HMM), the situation is not directly visible, but the situation related outputs are visible. Each state has the probability distribution on the possible output. Therefore, the series of outputs are generated by HMM, giving information about the sequence of states. HMM is a stochastic finite state automation defined by the parameter λ= (π, a, b) where π is an initial state probability, is a state transition probability and b is an observation probability, defined by a finite multivariate Gaussian mixture [18]. Text mining is a method that aims to extract meaningful data from a plain text. Text is converted into quantitative data using some preprocessing steps according to the text mining methodology. Classical data mining methods are applied to this quantitative data.
A CASE STUDY TO PREDICT CRYPTOCURRENCY PRICES WITH NEURAL NETWORKS It has been seen that artificial neural networks come into prominence with the methods used in the studies for IDDS in the literature. Therefore, to simulate the effectiveness of neural networks, we propose an IDDS to accomplish the task of predicting 3 major cryptocurrencies’ close prices. This task is much more complex and harder than the normal stock exchange predictions because of the different natures, popularity and new usage culture on the internet of the cryptocurrencies. Cryptocurrencies seem to be
A Review on Intelligent Decision Support Systems and a Case Study 205 influenced by more external social-psychological factors. To examine the proposed IDSS to predict cryptocurrency close prices, we built an application that contains data collectors and a neural network model based on the Encog Machine Learning Framework. We performed tests on different cryptocurrencies on day close prices. The data used in the research are collected using CryptoCompare API, which provides practical API to get Cryptocurrency data with a helpful variable to get suitable data, e.g., sample amount, hour/day/year-based observations [25]. The training set of collected data belongs to 270 observations from 13-9-2017 to 10-6-2018 based on daily close price and volume values, and the predictions are made on the close prices of 31 days between 10-6-2017 to 10-7-2018. Encog is an advanced machine learning framework that supports a variety of advanced algorithms, as well as support classes to normalize and process data. Machine learning algorithms such as Support Vector Machines, Neural Networks, Bayesian Networks, Hidden Markov Models, Genetic Programming and Genetic Algorithms are supported [24]. We’ve tested 3 cryptocurrencies which are Bitcoin, Ethereum and LiteCoin. After performing the component analysis on the data that we have, we chose the closed price and market volume as input. We create our model and trained with Resilient Propagation Learning Algorithm (Rprop).
Model Building and Training A predictive neural network uses its inputs to accept information about current data and uses its outputs to predict future data. It uses two windows, a future window, and a past window. Both windows must have a window size, which is the amount of data that is either predicted or is needed to predict. In this model, we need 3-days window size data as input for 1-day window size data as output. To get additional data, both windows are slid forward, until we complete building the training set.
206
Zeynep Orman, Emel Arslan, Burcu Ozbay et al.
The data we have need to be normalized before being used in the Neural Network model. The data normalization range is chosen to be [0,1] and the equation for data normalization is given by: 𝑋
−min(𝑋𝑘 ) 𝑘 )−min(𝑋𝑘 )
𝑘,𝑛 ∗ 𝑋𝑘,𝑛 = max(𝑋
(3)
where 𝑋𝑘 is the data series, X* is the normalized data, 𝑋𝑘,𝑛 is the original value. We create the feed-forward network with input and output layers according to our past and future window sizes, we add 2 hidden layers to the network and we use the Resilient Propagation Learning Algorithm (Rprop) for training. Rprop doesn’t require to specify any free parameter values, as opposed to backpropagation which needs values for the learning rate. Rprop is based on a mathematical concept called the gradient, that the value of the error will change as the value of one weight changes, that the values of the other weights and biases the same. A gradient is made up of several “partial derivatives.” A partial derivative for weight can be thought of as the slope of the tangent line (the slope, not the tangent line itself) to the error function for some value of the weight. The sign of the slope/partial derivative indicates which direction to go in order to get to a smaller error. A negative slope means to go in the positive weight direction, and vice versa. The steepness (magnitude) of the slope indicates how rapidly the error is changing and gives a hint at how far to move to get to a smaller error We perform the training through 1000 epochs, on 270 days data samples.
Experimental Results The experimental results and graphs of these results are shown in Table 2 and Figures 1-3. Cryptocurrencies abbreviations used are as follows:
A Review on Intelligent Decision Support Systems and a Case Study 207 BTC: Bitcoin ETH: Ethereum. LTC: LiteCoin. Table 2. Cryptocurrencies real and predicted values BTC Predicted 7018.28 7621.01 8324.62 7463.79 8189.98 8767.23 8889.00 8611.94 8563.41 8589.86 8460.61 8480.20 8483.01 7932.55 7249.59 7635.23 7547.14 6758.17 7295.16 7010.25 7172.66 6979.25 7194.59 6595.61 6497.68 6638.91 6583.05 6534.44 6507.19 6524.32 6570.91
ETH Real 7899.11 8079.77 8301.82 8823.36 8873.62 9351.47 9077.28 9699.61 9377.81 9032.22 8709.46 8344.78 8249.24 7992.75 7475.36 7118.88 7502.15 7719.75 7661.79 7513.69 6556.94 6396.71 6714.82 6720.64 6157.78 6141.57 6385.38 6509.58 6602.02 6668.84 6456.35
Predicted 434.67 478.34 490.52 536.34 615.01 653.41 654.46 689.03 738.96 748.25 715.69 729.78 705.60 686.46 644.73 600.81 494.99 513.05 595.11 606.61 557.24 471.95 420.71 482.80 515.80 465.33 457.68 457.50 468.17 473.29 476.27
LTC Real 494.96 511.67 567.25 621.33 617.73 683.02 670.81 784.21 752.40 723.61 729.34 706.72 696.05 640.84 584.77 512.03 577.23 619.04 606.30 593.38 494.53 487.51 517.63 525.77 455.25 441.75 453.42 461.95 469.93 471.48 446.98
Predicted 120.19 136.29 140.95 152.40 153.22 153.84 153.94 153.65 154.50 154.62 153.96 153.73 153.25 152.56 149.39 142.14 124.11 117.81 121.66 127.51 125.59 110.56 98.55 90.72 87.80 84.54 84.78 80.40 82.31 81.07 80.71
Real 125.57 128.84 146.58 146.67 144.99 151.96 147.99 168.73 164.23 148.24 144.41 139.32 135.14 128.13 119.07 111.17 118.38 125.18 121.49 117.46 100.04 95.55 98.73 96.70 80.62 80.73 81.12 84.99 83.08 80.47 76.09
208
Zeynep Orman, Emel Arslan, Burcu Ozbay et al.
Figure 1. Bitcoin day close prices 10-6-2017 to 10-7-2018. Real values (──), predictions (- - -).
The success rate of the predictions varies between 75% and 97.3% for Bitcoin in day close price prediction as seen in Figure 1, however, it has fewer variations than other coins (Figures 2-3).
Figure 2. Ethereum day close prices 10-3-2018 to 11-4-2018, Real values (──), predictions (- - -).
A Review on Intelligent Decision Support Systems and a Case Study 209
Figure 3. Lite Coin day close prices 10-6-2017 to 10-7-2018, Real values (──), predictions (- - -).
CONCLUSION DSS facilitate decision makers to handle complex decision-making processes. The IDSS which is a term used for the integration of machine learning methods and the decision support systems, make inferences about the future. They provide great support for the decision-makers with this inference mechanism. Recently, many methods based on IDSS are widely used in several areas. This study reviewed intelligent systems according to three different areas which are a medical diagnosis, financial/business, and environment/ energy and it also presented an intelligent decision support system case study for forecasting the cryptocurrency prices. As a result of the studies that were reviewed, it had seen that the success of the data mining algorithms is closely related to the distribution of the dataset. Some of the methods obtained very good results for some datasets in terms of accuracy, but the same methods showed low results in some other datasets. In general, pure ANN has approximately the same accuracy as compared to the other models.
210
Zeynep Orman, Emel Arslan, Burcu Ozbay et al.
However, we can deduce that the accuracy of predictions of ANN trained by genetic algorithms or by a different methodology has increased. In particular, ANN and CNN based on deep learning are highly effective in decision making. Therefore, we have also used ANN as an intelligent decision support system in the case study to predict cryptocurrency close prices. This prediction was much more complex and harder than the normal stock exchange predictions. The main reasons beyond this complex prediction were because of the fact that cryptocurrencies were influenced by more external social-psychological factors and the cryptocurrency prices were changing instantly. Nevertheless, the proposed system in this study presented high accuracy. We suggest also making use of text mining technologies to analyze web-based or press news to get better predictions in the future.
REFERENCES [1] [2]
[3]
[4]
[5]
[6]
Arnott, D., G. Pervan/Decision Support Systems 44 (2008) 657–672. Palaniappan, S., Intelligent Heart Disease Prediction System Using Data Mining Techniques, Computer Systems and Applications, 2008. AICCSA 2008. IEEE/ACS International Conference on. Arabasadi, Z. et al., Computer aided decision making for heart disease detection using hybrid neural network-Genetic algorithm, Computer Methods and Programs in Biomedicine 141 (2017) 19–26. Gorunescu, F. et al., Intelligent decision-making for liver fibrosis stadialization based on tandem feature selection and evolutionarydriven neural network, Expert Systems with Applications 39 (2012) 12824–12832. Maglogiannis, I. et al., An intelligent system for automated breast cancer diagnosis and prognosis using SVM based classifiers, Applied Intelligence (2009), Volume 30, Issue 1, pp 24–36. Masood, A. et al., Computer-Assisted Decision Support System in Pulmonary Cancer detection and stage classification on CT images, Journal of Biomedical Informatics 79 (2018) 117–128.
A Review on Intelligent Decision Support Systems and a Case Study 211 [7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
Haenssle, H. A. et al., Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists, Annals of Oncology. Lucini F. et al., Text mining approach tp predict hospital admissions using early medical records from the emergency department, International Journal of Medical Informatics 100 (2017) 1-8. Chan, S. W. K., J. Franklin, A text-based decision support system for financial sequence prediction, Decision Support Systems 52 (2011) 189–198. Bagheri, A. et al., Financial forecasting using ANFIS networks with Quantum-behaved Particle Swarm Optimization, Expert Systems with Applications 41 (2014) 6235–6250. Carminati, M. et al., BANKSEALER: A decision support system for online banking fraud analysis and investigation, Computers & Security 53 (2015) 175 -186. Confalonieri, M. et al., An AI based Decision Support System for preventive maintenance and production optimization in energy intensive manufacturing plants, Engineering, Technology and Innovation/International Technology Management Conference (ICE/ITMC) (2015). Guo, Z. X. et al., An RFID-based intelligent decision support system architecture for production monitoring and scheduling in a distributed manufacturing environment, Int. J. Production Economics 159 (2015) 16–28. Guo, Z. X. et al., Modeling and Pareto optimization of multi-objective order scheduling problems in production planning. Comput. Ind. Eng. 64 (2013), 972–986. Sueyoshi, T., G. R. Tadiparthi, An agent-based decision support system for wholesale electricity market, Decision Support Systems 44 (2008) 425–446. Dutta, R. Development of an intelligent environmental knowledge system for sustainable agricultural decision support, Environmental Modelling & Software 52 (2014) 264-272.
212
Zeynep Orman, Emel Arslan, Burcu Ozbay et al.
[17] Rivera, S. J. et al, A text mining framework for advancing sustainability indicators, Environmental Modelling & Software 62 (2014) 128-138. [18] Nguyen, K. A. et al., Re-engineering traditional urban water management practices with smart metering and informatics, Environmental Modelling & Software 101 (2018) 256-267. [19] Wang, F. et al., Chlorophyll a Simulation in a Lake Ecosystem Using a Model with Wavelet Analysis and Artificial Neural Network, Environmental Management 51 (2013) 1044–1054. [20] Maxim, S. et al., Automated Electric Energy Consumption Forecasting System Based On Decision Tree Approach, IFAC Proceeding Volumes 46 (2013) 1027-1032. [21] Agrawal, Shikhaand, Jitendra Agrawal, Neural Network Techniques for Cancer Prediction: A Survey, Procedia Computer Science 60 (2015) 769 – 774. [22] Lei, N., S. K. Moon, A Decision Support System for market-driven product positioning and design, Decision Support Systems 69 (2015) 82–91. [23] Gu J. et al., Recent advances in convolutional neural networks, Pattern Recognition 77 (2018) 354–377. [24] Encog Machine Learning, Frameworkhttp://www.heatonresearch. com/encog/. [25] CryptoCompare API, https://www.cryptocompare.com/api/.
INDEX A activation state, 158, 159 activity level, 171 activity rate, 149 actual output, 14, 155, 160 AdaBoost, 41, 70, 194, 195, 198, 199, 203 ANN-based classifiers, 55, 56 appearance-based methods, 33 artificial intelligence, vii, xii, 2, 189, 190, 191, 200 Artificial Neural Networks (ANNs), v, vii, viii, ix, x, xi, xii, 1, 2, 3, 4, 7, 10, 11, 12, 26, 27, 31, 32, 33, 34, 53, 57, 67, 75, 76, 90, 92, 97, 98, 100, 109, 110, 114, 119, 140, 149, 175, 190, 199, 200, 201, 204
B Bayesian Classifier, 202 brain, 3, 4, 8, 115, 139, 142, 151, 153, 155, 176, 177, 181, 184 breast cancer, viii, x, xi, 97, 98, 99, 100, 103, 104, 106, 108, 109, 110, 193, 210
breast carcinoma, 109 breast mass, 106, 111 BU-3DFE database, 51, 57, 60
C CAD, xi, 97, 98, 99, 100, 105, 108 calcification, vii, xi, 98, 100 calcium, 150, 180 cancer, xi, 97, 98, 99, 100, 101, 102, 105, 106, 107, 108, 109, 110, 111, 193 cancer death, 98 carcinoma, x, xi, 97, 98, 104, 105 case study, viii, xii, 139, 190, 191, 209 chi-squared test, 104 CK+ database, 38, 41, 60, 67 clustering, 10, 24, 25, 94, 146, 147, 201, 203 clusters, 109, 140, 200, 203 CNN, 34, 36, 37, 38, 39, 40, 42, 50, 53, 54, 55, 56, 60, 62, 63, 64, 65, 66, 69, 73, 193, 201, 210 CNN method, 37, 55 computer-aided detection, 99, 104, 108, 109, 110
214
Index
convolutional neural network, 34, 36, 69, 181, 182, 201, 211, 212 cryptocurrency, viii, xii, 190, 205, 209 cryptocurrency prices, 190, 209
D Darwin, Charles, 32 data mining, xii, 189, 191, 192, 194, 199, 203, 204, 209 data processing, 8, 114 data set, 36, 37, 38, 39, 40, 53, 57, 158, 163, 164, 165, 166, 167, 169, 196, 197, 202 data structure, 8 database, 34, 38, 41, 42, 43, 44, 45, 48, 49, 50, 51, 52, 57, 59, 60, 61, 62, 63, 64, 65, 67, 74, 192 decision support systems, 191, 201, 209 decision trees, 202 decision-making process, 190, 191, 201, 209 deep learning, 22, 38, 40, 87, 106, 177, 193, 200, 210, 211 detection, vii, xi, 17, 32, 34, 37, 42, 52, 67, 68, 69, 70, 94, 98, 99, 100, 103, 104, 106, 108, 109, 110, 111, 192, 195, 197, 210 detection system, 110, 195 deviation, 121, 132, 133, 135 differential equations, 118, 120 digital mammography, 98, 101, 109, 110 dimensionality, 43, 46, 70, 146 dipole mode index, vii, ix, 75 discriminant analysis, 52 discrimination, 155, 163, 172, 185 discrimination tasks, 155 dopamine, 119, 142, 152, 178, 179, 183 ductal carcinoma in situ (DCIS), xi, 97, 98, 99, 100, 101, 104, 105, 106
E emotion recognition, vii, ix, 31, 32, 33, 47, 51, 53, 67, 69, 71, 72, 73 encoding, viii, xii, 114, 115, 117, 122, 130, 155 Encog machine learning, viii, xii, 190, 205, 212 energy, 107, 114, 140, 150, 177, 181, 191, 196, 197, 209, 211 energy consumption, 196, 197 energy efficiency, 114, 140 energy supply, 115 extraction, ix, xi, 24, 32, 33, 34, 35, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 62, 65, 67, 68, 74, 98, 102, 152 eye movement, 184
F Facial Action Coding, 74 facial expression, vii, viii, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 67, 68, 69, 70, 72, 73, 74 facial expression analysis, ix, 32, 33, 36, 38, 42, 52, 67, 68, 73, 74 false negative, vii, xi, 97, 99 false positive, vii, xi, 97, 100 feature extraction, ix, xi, 24, 32, 33, 34, 35, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 55, 56, 57, 62, 65, 67, 68, 74, 98, 102, 152 feature selection, 36, 51, 67, 210 features extraction, 71 FER2013 database, 42, 64 forecasting, 77, 78, 80, 194, 196, 197, 209, 211
Index
215
G genetic algorithm, 192, 196, 201, 210 geometric-based methods, 33 GPS, 47, 95
H Hidden Markov Model (HMM), 35, 194, 199, 204, 205 human actions, 166, 185 human behavior, 3, 167, 169 human brain, 3, 7, 115, 200 human nature, 2 hybrid, 33, 38, 43, 51, 54, 68, 69, 192, 193, 194, 210
I image features, xi, 98, 102 Indian Ocean dipole, 76, 89 information processing, 7, 114, 140, 153, 176 input signal, 152, 154, 200 integration, xii, 74, 118, 178, 189, 209 intelligence, xii, 3, 4, 189, 199 intelligent decision support systems, 190, 191 intelligent systems, 209 interference, 154, 156, 162, 166, 169, 171, 172, 174, 175, 183, 185 interneurons, 180 invasive ductal carcinoma (IDC), x, xi, 97, 98, 100, 101, 104, 105, 106 invasive lobular carcinoma (ILC), x, xi, 97, 98, 100, 101, 104, 105, 106 iteration, 13, 18, 159, 167, 172, 203, 204
J Jaffe database, 61, 62
K KDEF database, 61, 64 K-means, 195, 199, 203 K-Nearest Neighbor Classification (K-NN), 197 k-NN method, 55
L leave-one-out cross (LOOCV) approach, 102
M machine learning, vii, viii, xii, 1, 22, 32, 33, 34, 153, 181, 190, 193, 196, 200, 202, 203, 205, 209 mammography, vii, xi, 98, 99, 100, 101, 102, 106, 107, 108, 109, 110, 111 matrix, 18, 19, 145, 156, 157, 159, 160, 165, 166, 167, 168, 170, 171 memory, 8, 23, 26, 36, 121, 139, 142, 179 memory processes, 142 MMI database, 38, 42, 45, 61, 63 mobile robots, 115, 140 models, vii, ix, 5, 6, 32, 33, 34, 56, 76, 77, 79, 80, 81, 84, 85, 86, 87, 106, 120, 132, 146, 148, 150, 151, 152, 154, 155, 156, 165, 168, 169, 172, 174, 176, 178, 192, 197, 209 Multi-PIE database(s), 48, 51, 61
N Naïve Bayes, 192, 194, 198
216
Index
nervous system, 114, 185 networking, 21, 176 neural connection, 3 neural development, 175 neural function, 175 Neural Network (NN), v, vi, vii, viii, ix, x, xi, xii, 1, 4, 5, 6, 8, 11, 12, 13, 15, 16, 17, 18, 19, 20, 22, 23, 24, 25, 28, 29, 30, 31, 34, 37, 38, 43, 47, 52, 53, 54, 55, 56, 59, 61, 62, 66, 67, 68, 69, 72, 76, 90, 93, 98, 106, 109, 113, 117, 151, 176, 179, 181, 182, 189, 190, 192, 193, 196, 197, 198, 199, 200, 204, 205, 206, 212 neural systems, 140, 143 neuronal systems, 6, 150 neuron(s), 3, 4, 5, 6, 7, 15, 17, 19, 20, 21, 23, 27, 78, 79, 81, 106, 116, 117, 118, 119, 120, 121, 123, 128, 129, 130, 132, 133, 135, 139, 141, 142, 143, 144, 147, 151, 152, 154, 156, 157, 159, 160, 161, 163, 165, 170, 171, 174, 175, 176, 177, 178, 179, 180, 181,182, 185, 186, 200
O Oulu-CASIA database, 45, 61, 63
P pattern recognition, 5, 10, 32, 146, 148, 152, 154, 156, 157, 161, 165, 174, 175, 176, 183 plasticity, viii, xii, 114, 115, 140, 142, 143, 146, 148, 149, 150, 152, 153, 155, 156, 157, 159, 162, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 178, 179, 180, 182, 184, 185, 186, 187 Principal Component Analysis (PCA), 41, 43, 46, 49, 54, 56, 59, 67, 71, 74, 107, 195, 197, 199, 203
probability, 19, 25, 107, 115, 116, 159, 164, 201, 202, 204 probability distribution, 204
R RAFD database, 64, 65 Random Forest, 54, 55, 63, 194, 198, 202 Random Forest classifier, 56 receiver operating characteristic (ROC) analysis, 103, 104 recognition, vii, viii, ix, xii, 1, 3, 5, 23, 29, 31, 32, 33, 34, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 62, 65, 67, 68, 69, 70, 71, 72, 73, 107, 114, 140, 148, 151, 152, 154, 156, 163, 164, 166, 167, 169, 170, 171, 172, 173, 179, 181, 182, 200, 211 regression, 39, 69, 90, 106, 154, 155, 157, 159, 165 reinforcement learning, 116, 120, 152, 175, 183 repetitions, 58, 66 Resilient Propagation Learning Algorithm (Rprop), 205, 206 robot hand, 5 robotics, ix, 31
S seasonal prediction, x, 76 sensitivity, 99, 100, 104, 105, 108, 109, 174, 179, 192, 193 sensor(s), 46, 116, 122, 123, 127, 130, 135, 136, 177, 195 sensory data, 154, 156 sensory system, 174, 185 SFEW database, 61, 64 signals, 9, 20, 46, 100, 114, 116, 155, 165, 168, 175, 186
Index simulation, 3, 4, 119, 127, 130, 134, 137, 157, 165, 172, 173, 179, 181 spiking neural network, xi, 113, 114, 131, 139, 141, 149, 151, 176, 177, 178, 181, 182, 183 supervised learning, viii, xi, 11, 24, 38, 113, 114, 116, 119, 122, 139, 140, 141, 142, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 182 Support Vector Machine (SVM), 34, 36, 38, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 51, 53, 54, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 70, 71, 149, 150, 185, 193, 194, 198, 202, 205, 210 SVM method, 42, 45, 53, 54 swarm intelligence, 193 swarm optimization, 194 synaptic plasticity, viii, xii, 114, 151, 161, 165, 166, 167, 169, 172, 174, 175, 179, 180, 183, 185, 186 synaptic strength, 118, 154, 168, 170, 179 synaptic transmission, 178, 179
217 testing, 38, 41, 45, 47, 52, 63, 102, 103, 130, 131, 133, 138, 141, 145, 148, 150, 157, 164, 192 text mining, 193, 197, 204, 210, 212 TFEID database, 65 time series, 77, 78, 80, 87, 194, 197
U under the ROC curve (Az), 104 underlying mechanisms, 146
V vector, 2, 15, 18, 19, 33, 35, 41, 42, 45, 46, 56, 66, 70, 117, 125, 126, 127, 129, 133, 149, 157, 158, 159, 160, 162, 165 velocity, 134, 136, 138, 204 visual attention, 37, 69 visual system, 143
W T temporal encoding, viii, xii, 114, 115 test data, 44, 53, 144, 145, 146, 166, 192
wavelet, 47, 72 Wavelet Analysis, 197, 199, 212 worry, 131